CN109063824B

CN109063824B - Deep three-dimensional convolutional neural network creation method and device, storage medium and processor

Info

Publication number: CN109063824B
Application number: CN201810824695.3A
Authority: CN
Inventors: 王志鹏; 周文明
Original assignee: Shenzhen Zhongyue Technology Co ltd
Current assignee: Shenzhen Zhongyue Technology Co ltd
Priority date: 2018-07-25
Filing date: 2018-07-25
Publication date: 2023-04-07
Anticipated expiration: 2038-07-25
Also published as: CN109063824A

Abstract

The invention discloses a method and a device for creating a deep three-dimensional convolutional neural network, a storage medium and a processor. Wherein, the method comprises the following steps: training a preset shallow three-dimensional convolutional neural network model according to a preset video sequence data set; creating a deep three-dimensional convolution neural network model according to a dense connection mode; splitting all or part of three-dimensional convolution layers in the deep three-dimensional convolution neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolution network model; and setting a preset video sequence data set and a preset shallow three-dimensional convolutional neural network model reaching a convergence state as a supervision signal, and training the first target deep three-dimensional convolutional network model according to the supervision signal to obtain a second target deep three-dimensional convolutional network model. The invention solves the technical problem of poor network performance of the three-dimensional convolution neural network established in the prior art.

Description

Deep three-dimensional convolutional neural network creating method and device, storage medium and processor

Technical Field

The invention relates to the field of video identification and processing, in particular to a method and a device for creating a deep three-dimensional convolutional neural network, a storage medium and a processor.

Background

A three-dimensional Convolutional Neural Network (CNN) takes a multi-frame image sequence as input, can simultaneously extract space dimension abstract features and time dimension abstract features in the image sequence, and makes a major breakthrough in image sequence analysis applications such as video classification and action recognition. Compared with the conventional convolutional neural network, the three-dimensional convolutional neural network with the same layer number has a plurality of parameters, the required training data is greatly increased, and the training difficulty is doubled. However, in practical applications, the training samples of video data are limited and difficult to meet.

To address this problem, the prior art reduces the number of parameters of a three-dimensional convolutional neural network by limiting the number of layers or channels in spatial dimensions. The technologies can ensure network convergence when a data set is small through network design with shallow layer number and simple structure, and other technologies adopt two-dimensional convolution instead of three-dimensional convolution on the space dimension to reduce the training difficulty. However, the extraction capability of these methods for video sequence abstract features is also weakened, and the performance is difficult to guarantee in complex video analysis applications. Therefore, the three-dimensional convolutional neural network created in the prior art has the technical problems of more parameters, difficulty in training, poor performance and the like, and accordingly the network performance of the three-dimensional convolutional neural network is poor.

In view of the above problems, no effective solution has been proposed.

Disclosure of Invention

The embodiment of the invention provides a method and a device for creating a deep three-dimensional convolutional neural network, a storage medium and a processor, which are used for at least solving the technical problem of poor network performance of the created three-dimensional convolutional neural network in the prior art.

According to an aspect of an embodiment of the present invention, there is provided a method for creating a deep three-dimensional convolutional neural network, the method including: training a preset shallow three-dimensional convolutional neural network model according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state; creating a deep three-dimensional convolutional neural network model according to a dense connection mode, wherein the deep three-dimensional convolutional neural network model comprises three-dimensional convolutional layers with the number of layers higher than that of the three-dimensional convolutional layers contained in the preset shallow three-dimensional convolutional neural network model; splitting all or part of the three-dimensional convolution layer in the deep three-dimensional convolution neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolution network model, wherein the three-dimensional convolution units comprise bottleneck layers and the three-dimensional convolution layers; setting the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state as a supervision signal, and training the first target deep three-dimensional convolutional network model according to the supervision signal to obtain a second target deep three-dimensional convolutional network model, wherein the second target deep three-dimensional convolutional network model is the first target deep three-dimensional convolutional network model reaching the convergence state.

Further, the creating of the deep three-dimensional convolutional neural network model according to the dense connection mode includes: and densely connecting a plurality of the three-dimensional convolutional layers to obtain the deep three-dimensional convolutional neural network model, wherein the input of any one layer of the three-dimensional convolutional layers in the deep three-dimensional convolutional neural network model can comprise cascade connection of the output characteristics of each three-dimensional convolutional layer in front of the any one layer of the three-dimensional convolutional layers.

Further, the setting the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state as a supervision signal, and training the first target deep three-dimensional convolutional network model according to the supervision signal to obtain a second target deep three-dimensional convolutional network model includes: splitting the shallow three-dimensional convolution neural network model reaching the convergence state to obtain a plurality of first sub-models; splitting the first target deep three-dimensional convolutional neural network model respectively to obtain a plurality of second submodels, wherein the number of the second submodels is equal to that of the first submodels; arranging the plurality of first submodels and the plurality of second submodels, respectively, training the second submodels with the same arrangement numbers as the first submodels by using the first submodels in the arrangement as supervision signals until the second submodels reach the convergence state; and cascading all the second submodels which reach the convergence state to obtain the second target deep three-dimensional convolution network model.

Further, the training of the second submodel having the same sequence number as the first submodel using the first submodel in the sequence as a supervision signal may include: and training the second submodel corresponding to the last arrangement serial number in the arrangement until the second submodel reaches the convergence state by using a calculation result obtained by performing weighted calculation on the output of the first submodel corresponding to the last arrangement serial number in the arrangement and the label of the preset video sequence data set as a supervision signal.

According to another aspect of the embodiments of the present invention, there is also provided an apparatus for creating a deep three-dimensional convolutional neural network, the apparatus including: the first training unit is used for training a preset shallow three-dimensional convolutional neural network model according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state; a creating unit, configured to create a deep three-dimensional convolutional neural network model according to a dense connection manner, where a number of layers of three-dimensional convolutional layers included in the deep three-dimensional convolutional neural network model is higher than the number of layers of the three-dimensional convolutional layers included in the preset shallow three-dimensional convolutional neural network model; a splitting unit, configured to split all or part of the three-dimensional convolution layers in the deep three-dimensional convolutional neural network model into three-dimensional convolution units, so as to obtain a first target deep three-dimensional convolutional network model, where the three-dimensional convolution units include a bottleneck layer and the three-dimensional convolution layers; and a second training unit, configured to set the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state as a supervision signal, train the first target deep three-dimensional convolutional network model according to the supervision signal, and obtain a second target deep three-dimensional convolutional network model, where the second target deep three-dimensional convolutional network model is the first target deep three-dimensional convolutional network model reaching the convergence state.

Further, the creating unit includes: and a connection unit configured to densely connect the plurality of three-dimensional convolutional layers to obtain the deep three-dimensional convolutional neural network model, wherein an input of any one of the three-dimensional convolutional layers in the deep three-dimensional convolutional neural network model may include a cascade of output characteristics of each of the three-dimensional convolutional layers preceding the any one of the three-dimensional convolutional layers.

Further, the second training unit includes: the first splitting subunit is used for splitting the shallow three-dimensional convolutional neural network model reaching the convergence state to obtain a plurality of first submodels; the second splitting sub-unit is used for splitting the first target deep three-dimensional convolutional neural network model respectively to obtain a plurality of second sub-models, wherein the number of the second sub-models is equal to that of the first sub-models; a processing subunit, configured to arrange the plurality of first submodels and the plurality of second submodels, respectively, train the second submodels having the same arrangement numbers as those of the first submodels, using the first submodels in the arrangement as supervision signals, until the second submodels reach the convergence state; and the cascading subunit is used for cascading all the second submodels which reach the convergence state to obtain the second target deep three-dimensional convolution network model.

Further, the processing subunit includes: and the processing module is used for taking a calculation result obtained after weighted calculation is carried out on the output of the first sub-model corresponding to the last arrangement serial number in the arrangement and the label of the preset video sequence data set as a supervision signal, and training the second sub-model corresponding to the last arrangement serial number in the arrangement until the second sub-model reaches the convergence state.

According to another aspect of the embodiments of the present invention, there is provided a storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus where the storage medium is located is controlled to execute the method for creating the deep three-dimensional convolutional neural network.

According to another aspect of the embodiments of the present invention, there is further provided a processor, wherein the processor is configured to execute a program, where the program executes the method for creating a deep three-dimensional convolutional neural network.

In the embodiment of the invention, a preset shallow three-dimensional convolutional neural network model is trained according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state; creating a deep three-dimensional convolutional neural network model according to a dense connection mode, wherein the hierarchy number of three-dimensional convolutional layers contained in the deep three-dimensional convolutional neural network model is higher than that of three-dimensional convolutional layers contained in a preset shallow three-dimensional convolutional neural network model; splitting all or part of three-dimensional convolution layers in the deep three-dimensional convolution neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolution network model, wherein the three-dimensional convolution units comprise bottleneck layers and three-dimensional convolution layers; the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state are set as the supervision signal, and the purpose of training the first target deep three-dimensional convolutional network model according to the supervision signal to obtain the second target deep three-dimensional convolutional network model is achieved, wherein the second target deep three-dimensional convolutional network model is the first target deep three-dimensional convolutional network model reaching the convergence state. The embodiment of the invention realizes the technical effects of improving the network performance of the three-dimensional convolutional neural network, reducing the network parameter and deepening the network hierarchy in the network creating process, thereby solving the technical problem of poor network performance of the created three-dimensional convolutional neural network in the prior art.

Drawings

The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:

FIG. 1 is a schematic flow chart diagram illustrating an alternative method for creating a deep three-dimensional convolutional neural network, according to an embodiment of the present invention;

FIG. 2 is a schematic flow chart diagram illustrating an alternative method for creating a deep three-dimensional convolutional neural network, in accordance with an embodiment of the present invention;

fig. 3 is a schematic structural diagram of an alternative deep three-dimensional convolutional neural network creation apparatus according to an embodiment of the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Example 1

In accordance with an embodiment of the present invention, there is provided an embodiment of a method for creating a deep three-dimensional convolutional neural network, wherein the steps illustrated in the flow chart of the figure may be performed in a computer system, such as a set of computer-executable instructions, and although a logical order is illustrated in the flow chart, in some cases, the steps illustrated or described may be performed in an order different than that illustrated herein.

Fig. 1 is a schematic flowchart of an optional method for creating a deep three-dimensional convolutional neural network according to an embodiment of the present invention, and as shown in fig. 1, the method includes the following steps:

step S102, training a preset shallow three-dimensional convolutional neural network model according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state;

step S104, creating a deep three-dimensional convolutional neural network model according to a dense connection mode, wherein the number of the layers of the three-dimensional convolutional layers contained in the deep three-dimensional convolutional neural network model is higher than the number of the layers of the three-dimensional convolutional layers contained in a preset shallow three-dimensional convolutional neural network model;

step S106, splitting all or part of three-dimensional convolution layers in the deep three-dimensional convolution neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolution network model, wherein the three-dimensional convolution units comprise bottleneck layers and three-dimensional convolution layers;

and S108, setting a preset video sequence data set and a preset shallow layer three-dimensional convolution neural network model reaching the convergence state as a supervision signal, and training a first target deep layer three-dimensional convolution network model according to the supervision signal to obtain a second target deep layer three-dimensional convolution network model, wherein the second target deep layer three-dimensional convolution network model is the first target deep layer three-dimensional convolution network model reaching the convergence state.

Optionally, in step S102, a shallow three-dimensional convolutional neural network model may be constructed by using a three-dimensional convolutional kernel, for example, the shallow three-dimensional convolutional neural network model may be a three-dimensional convolutional neural network a, and the network a may include multiple layers of three-dimensional convolutional layers, three-dimensional pooling layers, and full connection layers;

further, the size of the three-dimensional convolution kernel of the ith convolution layer of the network A is [ w [ ] _i ,h _i ,f _i ]Corresponding to three dimensions of width, length, frame number/channel number.

Optionally, the network a may adopt a C3D network structure, which includes 8 layers of three-dimensional convolutional layers and 2 layers of full connections, respectively:

the method comprises the steps of firstly, laminating layers, wherein the size of a three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 64;

a first pooling layer, the three-dimensional pooling core size being 2x2x1, the step size being 2x2x1;

a second convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 128;

a second pooling layer, the three-dimensional pooling core size being 2x2x2, the step size being 2x2x2;

a third convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 256;

a fourth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 256;

a third pooling layer, the three-dimensional pooling core size being 2x2x2, the step length being 2x2x2;

a fifth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 512;

a sixth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 512;

a fourth pooling layer, the three-dimensional pooling core size being 2x2x2, the step length being 2x2x2;

a seventh convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 512;

the eighth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 512;

a fifth pooling layer, the three-dimensional pooling core size being 2x2x2, the step length being 2x2x2;

the number of the neurons of the first full-connection layer is 4096;

in the second full connection layer, the number of neurons is 4096.

Alternatively, in step S102, the preset video sequence data set may be a behavior recognition data set UCF101, 1M sport, or the like.

Optionally, in step S104, a deep three-dimensional convolutional neural network model is created according to the dense connection manner, for example, the deep three-dimensional convolutional neural network model is a network B, and the network B includes multiple convolutional layers, three-dimensional pooling layers, and full connection layers. The three-dimensional convolution layer adopts a dense connection form, so that the feature utilization rate is improved, and the network precision is improved.

Further, the size of the three-dimensional convolution kernel of the jth convolution layer of the network B is [ w _j ,h _j ,f _j ]Corresponding to the width, length, frame number/channel number.

Further, the so-called dense connection of the network B, i.e. the input of each three-dimensional convolutional layer, is the cascade of the output feature maps of all three-dimensional convolutional layers in front of the layer;

optionally, the network B comprises 20 three-dimensional convolutional layers and 1 fully-connected layer, which are:

the method comprises the steps of firstly, laminating layers, wherein the size of a three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 16;

a second convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 16;

a third convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 32;

a fourth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 32;

a fifth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 32;

a sixth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 32;

a third pooling layer, the three-dimensional pooling core size being 2x2x2, the step size being 2x2x2;

a seventh convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 64;

the eighth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 64;

the ninth convolution layer has a three-dimensional convolution kernel size of 3x3x3, a step length of 1 and a number of 64;

the tenth convolution layer has a three-dimensional convolution kernel size of 3x3x3, a step length of 1 and a number of 64;

an eleventh convolution layer, the three-dimensional convolution kernel size being 3x3x3, the step size being 1, the number being 128;

a twelfth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 128;

a sixth pooling layer, the three-dimensional pooling core size being 2x2x2, the step size being 2x2x2;

a thirteenth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 128;

a fourteenth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 128;

a seventh pooling layer, the three-dimensional pooling core size being 2x2x2, the step size being 2x2x2;

a fifteenth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 256;

sixteenth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 256;

seventeenth convolution layer, the size of the three-dimensional convolution kernel is 3x3x3, the step length is 1, and the number is 256;

an eighth pooling layer, the three-dimensional pooling core size being 2x2x2, the step size being 2x2x2;

eighteenth convolution layer, three-dimensional convolution kernel size is 3x3x3, step length is 1, number is 512;

the nineteenth convolution layer has a three-dimensional convolution kernel size of 3x3x3, a step length of 1 and a number of 512;

a twentieth convolution layer, the three-dimensional convolution kernel size is 3x3x3, the step length is 1, and the number is 512;

a ninth pooling layer, the three-dimensional pooling core size being 2x2x2, the step size being 2x2x2;

and the number of the neurons in the full connection layer is 1024.

Optionally, in step S106, all or part of the three-dimensional convolution layers in the deep three-dimensional convolutional neural network model are split into three-dimensional convolution units, so as to obtain a first target deep three-dimensional convolutional network model. For example, the first target deep three-dimensional convolutional network model may be network C. Specifically, part or all of the three-dimensional convolution layers in the network B are split into 1x1x1 grouped convolution layers and three-dimensional convolution layers to obtain the network C, thereby reducing the number of parameters. Further, the splitting refers to: the size of the j layer convolution kernel is [ w ] _j ,h _j ,f _j ]Split into a layer of 1x1x1 packet convolutional layers and a layer of convolutional kernel size [ w [ [ w ] _j ,h _j ,f _j ]The three-dimensional convolution layer, thereby reducing the number of input channels generated by intensive connection and reducing the parameter quantity and the calculation quantity.

Optionally, the grouped convolution layer means that each convolution kernel of the layer performs convolution operation only with the input channel in the corresponding group, and does not perform operation with the input channels of other groups, so as to further reduce the parameter number. If the input channel is m and the number of groups is n, where n is a positive integer not less than 1, the number of input channels in each group is m/n. In conventional convolution calculations, each convolution kernel is calculated with all m input channels. When n takes 1, it is equivalent to a conventional convolution. In block convolution, each convolution kernel is connected only to its corresponding set of m/n input channels.

Optionally, splitting all three-dimensional convolutional layers of the network B to obtain a network C, including:

the size of a three-dimensional convolution kernel of the first bottleneck layer is 1x1x1, the grouping number is 1, the step length is 1, and the number is 16;

the size of a three-dimensional convolution kernel of the second bottleneck layer is 1x1x1, the number of groups is 1, the step length is 1, and the number is 16;

a third bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 2, the step length is 1, and the number is 32;

a fourth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 2, the step length is 1, and the number is 32;

a fifth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 2, the step length is 1, and the number is 32;

a sixth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 2, the step length is 1, and the number is 32;

a seventh bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 4, the step length is 1, and the number is 64;

the size of a three-dimensional convolution kernel is 1x1x1, the number of groups is 4, the step length is 1, and the number of the groups is 64;

a ninth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 4, the step length is 1, and the number is 64;

in the tenth bottleneck layer, the size of a three-dimensional convolution kernel is 1x1x1, the number of groups is 4, the step length is 1, and the number is 64;

in the eleventh bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 8, the step length is 1, and the number is 128;

a twelfth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of the groups is 8, the step length is 1, and the number is 128;

a thirteenth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 8, the step length is 1, and the number is 128;

a fourteenth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the grouping number is 8, the step length is 1, and the number is 128;

a fifteenth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the grouping number is 16, the step length is 1, and the number is 256;

in the sixteenth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 16, the step length is 1, and the number is 256;

a seventeenth bottleneck layer, in which the size of the three-dimensional convolution kernel is 1x1x1, the number of packets is 16, the step length is 1, and the number is 256;

an eighteenth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 32, the step length is 1, and the number is 512;

in the nineteenth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of groups is 32, the step length is 1, and the number is 512;

in the twentieth bottleneck layer, the size of the three-dimensional convolution kernel is 1x1x1, the number of the groups is 32, the step length is 1, and the number is 512;

and the number of the neurons in the full connection layer is 1024.

Optionally, in step S108, the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model which reaches the convergence state are set as the supervision signal, and the first target deep three-dimensional convolutional network model is trained according to the supervision signal to obtain the second target deep three-dimensional convolutional network model. The second target deep three-dimensional convolution network model has a deep structure, abstract features of input data can be fully extracted, the network adopts a dense connection form, the feature utilization rate is high, and performance accuracy is guaranteed.

Optionally, creating the deep three-dimensional convolutional neural network model according to the dense connection mode comprises: and densely connecting the plurality of three-dimensional convolutional layers to obtain a deep three-dimensional convolutional neural network model, wherein the input of any one layer of three-dimensional convolutional layer in the deep three-dimensional convolutional neural network model can comprise the cascade of the output characteristics of each three-dimensional convolutional layer in front of any one layer of three-dimensional convolutional layer.

Fig. 2 is a schematic flowchart of an optional method for creating a deep three-dimensional convolutional neural network according to an embodiment of the present invention, and as shown in fig. 2, executing step S108, setting a preset video sequence data set and a preset shallow three-dimensional convolutional neural network model that reaches a convergence state as a supervision signal, and training a first target deep three-dimensional convolutional network model according to the supervision signal to obtain a second target deep three-dimensional convolutional network model includes:

step S202, splitting the shallow three-dimensional convolutional neural network model reaching the convergence state to obtain a plurality of first sub-models;

step S204, splitting the first target deep three-dimensional convolutional neural network model respectively to obtain a plurality of second submodels, wherein the number of the second submodels is equal to that of the first submodels;

step S206, respectively arranging the plurality of first submodels and the plurality of second submodels, taking the first submodels in arrangement as supervision signals, and training the second submodels with the same arrangement serial numbers as the first submodels until the second submodels reach a convergence state;

and S208, cascading all the second submodels which reach the convergence state to obtain a second target deep three-dimensional convolution network model.

Alternatively, the process from step S202 to step S208 is executed, for example, as follows:

(1) And setting the number of the multiple supervision signals as s, wherein s is an integer not less than 1. When s takes 1, the method is equivalent to a conventional training mode; (2) Obtaining s supervision networks [ A ] based on converged network A ₁ ,A ₂ …A _s ]. Wherein the ith supervisory network A _i Front comprising A network

Layer (` based `)>

Meaning rounding up), A _s Is equivalent to A; (3) Obtaining s networks to be trained [ C ] based on network to be trained C ₁ ,C ₂ …C _s ]，C _s Is equivalent to C. Wherein the j network C to be trained _j Pre-X comprising a C network>

A layer; (4) By network A ₁ For the teacher model, a knowledgdistill method was used to train C ₁ Until convergence. Based on C after convergence ₁ Update C ₂ The parameters of (1); (5) By network A ₂ For the teacher model, a knowledgdistill method was used to train C ₂ Until convergence. Based on C after convergence ₂ Update C ₃ The parameters of (a); (6) Repeating the above process until network A is used _s For the teacher model, a knowledgdistill method was used to train C _s And the obtained network is the target network until convergence.

Alternatively, s is set to 2. Supervision network A ₁ Comprising 4 convolutional layers, A ₂ Equivalent to A, network to be trained C ₁ Comprising 10 bottleneck layers and 10 convolution layers, C ₂ Equivalent to C.

Optionally, taking a first sub-model in the permutation as a supervision signal, and training a second sub-model with the same permutation number as that of the first sub-model includes: and taking the output of the first sub-model corresponding to the last arrangement serial number in the arrangement and the calculation result obtained after the weighting calculation of the label of the preset video sequence data set as a supervision signal, and training the second sub-model corresponding to the last arrangement serial number in the arrangement until the second sub-model reaches a convergence state.

Optionally, in the embodiment of the present invention, a deep three-dimensional convolutional neural network is constructed, and a deep network structure is helpful for extracting high-dimensional abstract features, so as to improve the expression capability of the network. Meanwhile, the network adopts a dense connection mode, and the back convolution layer can receive output characteristics from other layers in front, so that the reuse rate of the characteristics is improved, and the network performance is ensured. Secondly, in order to solve the problem of difficult training, a multi-supervisory signal training mode is adopted, a knowledgediscill method is combined, and a plurality of shallow supervision models are adopted to conduct guided training to accelerate network convergence. Moreover, in order to reduce the parameter number of the network, the network reduces the dimensionality of the number of input channels by using a 1x1x1 bottleneck layer, and further reduces the connection number and the parameter number and reduces the calculation burden by adopting a packet convolution mode. In summary, the invention effectively solves the technical problem of poor network performance of the three-dimensional convolutional neural network created in the prior art in the prior three-dimensional CNN technology, provides a method for constructing and training the three-dimensional convolutional neural network with deep hierarchy, less parameter quantity and high performance, and can be used in various application fields of action recognition, sequence analysis, video similarity comparison and the like.

In the embodiment of the invention, a preset shallow three-dimensional convolutional neural network model is trained according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state; creating a deep three-dimensional convolutional neural network model according to a dense connection mode, wherein the number of the three-dimensional convolutional layers contained in the deep three-dimensional convolutional neural network model is higher than that of the three-dimensional convolutional layers contained in a preset shallow three-dimensional convolutional neural network model; splitting all or part of three-dimensional convolution layers in the deep three-dimensional convolution neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolution network model, wherein the three-dimensional convolution units comprise bottleneck layers and three-dimensional convolution layers; the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state are set as the supervision signal, and the purpose of training the first target deep three-dimensional convolutional network model according to the supervision signal to obtain the second target deep three-dimensional convolutional network model is achieved, wherein the second target deep three-dimensional convolutional network model is the first target deep three-dimensional convolutional network model reaching the convergence state. The embodiment of the invention realizes the technical effects of improving the network performance of the three-dimensional convolutional neural network, reducing the network parameter and deepening the network hierarchy in the network creating process, thereby solving the technical problem of poor network performance of the created three-dimensional convolutional neural network in the prior art.

Example 2

According to another aspect of the embodiments of the present invention, there is also provided a device for creating a deep three-dimensional convolutional neural network, as shown in fig. 3, the device including: the first training unit 301 is configured to train a preset shallow three-dimensional convolutional neural network model according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state; a creating unit 303, configured to create a deep three-dimensional convolutional neural network model according to a dense connection manner, where a number of layers of three-dimensional convolutional layers included in the deep three-dimensional convolutional neural network model is higher than a number of layers of three-dimensional convolutional layers included in a preset shallow three-dimensional convolutional neural network model; the splitting unit 305 is configured to split all or part of three-dimensional convolution layers in the deep three-dimensional convolutional neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolutional network model, where the three-dimensional convolution units include a bottleneck layer and a three-dimensional convolution layer; the second training unit 307 is configured to set a preset video sequence data set and a preset shallow three-dimensional convolutional neural network model which reaches a convergence state as a supervision signal, train the first target deep three-dimensional convolutional network model according to the supervision signal, and obtain a second target deep three-dimensional convolutional network model, where the second target deep three-dimensional convolutional network model is the first target deep three-dimensional convolutional network model which reaches the convergence state.

Optionally, the creating unit includes: and the connecting unit is used for densely connecting the plurality of three-dimensional convolutional layers to obtain a deep three-dimensional convolutional neural network model, wherein the input of any one layer of three-dimensional convolutional layer in the deep three-dimensional convolutional neural network model can comprise cascade connection of the output characteristics of each three-dimensional convolutional layer in front of the any one layer of three-dimensional convolutional layer.

Optionally, the second training unit comprises: the first splitting subunit is used for splitting the shallow three-dimensional convolutional neural network model reaching the convergence state to obtain a plurality of first submodels; the second splitting subunit is used for splitting the first target deep three-dimensional convolutional neural network model respectively to obtain a plurality of second submodels, wherein the number of the second submodels is equal to that of the first submodels; the processing subunit is used for respectively arranging the plurality of first submodels and the plurality of second submodels, taking the first submodels in arrangement as supervision signals, and training the second submodels with the same arrangement serial numbers as the first submodels until the second submodels reach a convergence state; and the cascade subunit is used for cascading all the second submodels which reach the convergence state to obtain a second target deep three-dimensional convolution network model.

Optionally, the processing subunit comprises: and the processing module is used for taking the calculation result obtained after the output of the first sub-model corresponding to the last arrangement serial number in the arrangement and the label of the preset video sequence data set are subjected to weighting calculation as a supervision signal, and training the second sub-model corresponding to the last arrangement serial number in the arrangement until the second sub-model reaches a convergence state.

According to another aspect of the embodiments of the present invention, there is further provided a storage medium, where the storage medium includes a stored program, and when the program runs, the apparatus on which the storage medium is located is controlled to execute the method for creating the deep three-dimensional convolutional neural network in embodiment 1 of the present application.

According to another aspect of the embodiments of the present invention, there is further provided a processor, wherein the processor is configured to execute a program, and the program executes the method for creating the deep three-dimensional convolutional neural network in embodiment 1 of the present application.

The above-mentioned serial numbers of the embodiments of the present invention are only for description, and do not represent the advantages and disadvantages of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed coupling or direct coupling or communication connection between each other may be an indirect coupling or communication connection through some interfaces, units or modules, and may be electrical or in other forms.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one position, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention, which is substantially or partly contributed by the prior art, or all or part of the technical solution may be embodied in a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to perform all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk, and various media capable of storing program codes.

The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims

1. A method for creating a deep three-dimensional convolutional neural network, comprising:

training a preset shallow three-dimensional convolutional neural network model according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state;

creating a deep three-dimensional convolutional neural network model according to a dense connection mode, wherein the three-dimensional convolutional layers contained in the deep three-dimensional convolutional neural network model have higher layer numbers than the three-dimensional convolutional layers contained in the preset shallow three-dimensional convolutional neural network model;

splitting all or part of the three-dimensional convolution layer in the deep three-dimensional convolution neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolution neural network model, wherein the three-dimensional convolution units comprise bottleneck layers and the three-dimensional convolution layers;

setting the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state as a supervision signal, and training the first target deep three-dimensional convolutional neural network model according to the supervision signal to obtain a second target deep three-dimensional convolutional network model, wherein the second target deep three-dimensional convolutional network model is the first target deep three-dimensional convolutional neural network model reaching the convergence state;

the splitting is as follows: the size of the j layer convolution kernel is [ w ] _j ,h _j ,f _j ]Split into a layer of 1x1x1 packet convolutional layers and a layer of convolutional kernel size [ w [ [ w ] _j ,h _j ,f _j ]The three-dimensional convolution layer of [ w ] _j ,h _j ,f _j ]Corresponding to the width, length, frame number/channel number.

2. The method of claim 1, wherein creating the deep three-dimensional convolutional neural network model according to the dense connection approach comprises:

and densely connecting the three-dimensional convolutional layers to obtain the deep three-dimensional convolutional neural network model, wherein the input of any layer of the three-dimensional convolutional layer in the deep three-dimensional convolutional neural network model comprises the cascade connection of the output characteristics of each three-dimensional convolutional layer in front of the any layer of the three-dimensional convolutional layer.

3. The method of claim 1, wherein the setting the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state as a supervisory signal, and the training the first target deep three-dimensional convolutional neural network model according to the supervisory signal to obtain a second target deep three-dimensional convolutional neural network model comprises:

splitting the shallow three-dimensional convolutional neural network model reaching the convergence state to obtain a plurality of first sub-models;

splitting the first target deep three-dimensional convolutional neural network model respectively to obtain a plurality of second submodels, wherein the number of the second submodels is equal to that of the first submodels;

respectively arranging the plurality of first submodels and the plurality of second submodels, taking the first submodels in the arrangement as supervision signals, and training the second submodels with the same arrangement sequence numbers as the first submodels until the second submodels reach the convergence state;

and all the second submodels which reach the convergence state are cascaded to obtain the second target deep three-dimensional convolution network model.

4. The method of claim 3, wherein training the second sub-model with the same permutation number as the first sub-model using the first sub-model in the permutation as a supervisory signal comprises:

and taking the output of the first sub-model corresponding to the last arrangement serial number in the arrangement and the calculation result obtained after the weighting calculation of the label of the preset video sequence data set as a supervision signal, and training the second sub-model corresponding to the last arrangement serial number in the arrangement until the second sub-model reaches the convergence state.

5. An apparatus for creating a deep three-dimensional convolutional neural network, comprising:

the system comprises a first training unit, a second training unit and a third training unit, wherein the first training unit is used for training a preset shallow three-dimensional convolutional neural network model according to a preset video sequence data set until the preset shallow three-dimensional convolutional neural network model reaches a convergence state;

the creating unit is used for creating a deep three-dimensional convolutional neural network model according to a dense connection mode, wherein the number of layers of three-dimensional convolutional layers contained in the deep three-dimensional convolutional neural network model is higher than the number of layers of the three-dimensional convolutional layers contained in the preset shallow three-dimensional convolutional neural network model;

the splitting unit is used for splitting all or part of the three-dimensional convolution layers in the deep three-dimensional convolution neural network model into three-dimensional convolution units to obtain a first target deep three-dimensional convolution neural network model, wherein the three-dimensional convolution units comprise bottleneck layers and the three-dimensional convolution layers;

the splitting is as follows: the jth layer convolution kernel size is [ w ] _j ,h _j ,f _j ]Split into a layer of 1x1x1 packet convolutional layers and a layer of convolutional kernel size [ w [ [ w ] _j ,h _j ,f _j ](ii) a three-dimensional convolution layer of [ w ] _j ,h _j ,f _j ]Corresponding to width, length, frame number/channel number three dimensions;

and the second training unit is used for setting the preset video sequence data set and the preset shallow three-dimensional convolutional neural network model reaching the convergence state as a supervision signal, training the first target deep three-dimensional convolutional neural network model according to the supervision signal, and obtaining a second target deep three-dimensional convolutional network model, wherein the second target deep three-dimensional convolutional network model is the first target deep three-dimensional convolutional neural network model reaching the convergence state.

6. The apparatus of claim 5, wherein the creating unit comprises:

and the connecting unit is used for densely connecting the three-dimensional convolutional layers to obtain the deep three-dimensional convolutional neural network model, wherein the input of any one layer of the three-dimensional convolutional layer in the deep three-dimensional convolutional neural network model comprises cascade connection of the output characteristics of each three-dimensional convolutional layer in front of the any one layer of the three-dimensional convolutional layer.

7. The apparatus of claim 5, wherein the second training unit comprises:

the first splitting subunit is used for splitting the shallow three-dimensional convolutional neural network model reaching the convergence state to obtain a plurality of first submodels;

the second splitting sub-unit is used for splitting the first target deep three-dimensional convolutional neural network model respectively to obtain a plurality of second sub-models, wherein the number of the second sub-models is equal to that of the first sub-models;

a processing subunit, configured to respectively arrange the plurality of first submodels and the plurality of second submodels, use the first submodel in the arrangement as a supervision signal, train the second submodel with the same arrangement number as the first submodel, until the second submodel reaches the convergence state;

and the cascading subunit is used for cascading all the second submodels which reach the convergence state to obtain the second target deep three-dimensional convolution network model.

8. The apparatus of claim 7, wherein the processing subunit comprises:

and the processing module is used for taking the output of the first sub-model corresponding to the last permutation serial number in the permutation and the calculation result obtained after the weighted calculation of the label of the preset video sequence data set as a supervision signal, and training the second sub-model corresponding to the last permutation serial number in the permutation until the second sub-model reaches the convergence state.

9. A storage medium comprising a stored program, wherein when the program is executed, an apparatus in which the storage medium is located is controlled to execute the method for creating a deep three-dimensional convolutional neural network as claimed in any one of claims 1 to 4.

10. A processor for running a program, wherein the program is run to perform the method of creating a deep three-dimensional convolutional neural network of any one of claims 1 to 4.