CN113344182A

CN113344182A - Network model compression method based on deep learning

Info

Publication number: CN113344182A
Application number: CN202110608182.0A
Authority: CN
Inventors: 饶云波; 郭毅; 薛俊民
Original assignee: University of Electronic Science and Technology of China
Current assignee: University of Electronic Science and Technology of China
Priority date: 2021-06-01
Filing date: 2021-06-01
Publication date: 2021-09-03

Abstract

The invention discloses a network model compression method based on deep learning, and belongs to the field of artificial intelligence. The invention adopts a method combining grouping convolution and deconvolution to set a brand new lightweight network structure, thereby reducing the complexity of the model and compressing the weight of the model in the stage of setting the network model; then, a model pruning method is used for pruning the set network model, and the adopted pruning method is as follows: the number of channels of the convolution kernel is reduced as much as possible by controlling the error of the feature map before and after cropping. Therefore, classification of input images or target detection and identification tasks are achieved based on the network model obtained through pruning, the calculation amount is reduced, and the deployment requirement of mobile equipment or embedded equipment is met.

Description

Network model compression method based on deep learning

Technical Field

The invention belongs to the field of artificial intelligence, and particularly relates to a network model compression method based on deep learning.

Background

There are two main research directions in the field of deep learning. One is an academic direction, pursuing powerful and complex network models and experimental approaches, pursuing extreme performance. Another research direction is in the field of engineering, and aims to more stably and efficiently deploy an algorithm to hardware equipment, and the performance of the algorithm and the difficulty level of the deployment are main targets of the research. Although the complex model has good performance, the model is large in size, high in storage requirement and large in consumption of computing resources, and effective application to the current hardware environment is difficult. At present, there are many researches on algorithm deployment landing, and the research direction of target detection at present starts to turn to model compression from theoretical research to platform implementation, so that a huge achievement is achieved. The deep learning model compression and acceleration algorithm mainly comprises three directions at present, namely lightweight network structure design, model pruning and kernel sparseness.

In the design of lightweight network structures, the current research focus is on proposing a new convolution method and designing a brand new network structure. Current hot convolution methods include block convolution, i.e., dividing the input feature maps into different groups (grouped along the channel dimension), and then performing convolution operations on the different groups, i.e., each convolution kernel is connected to only one of the input feature maps, whereas the normal convolution operation is connected to all feature maps. The larger the number of packets k, the less the total number of parameters and the total amount of computations for the convolution operation (a reduction of k times). However, the packet convolution has a fatal disadvantage that the information circulation between different packet channels is reduced, namely, the output feature maps only consider partial information of input features, so that the information fusion operation is carried out after the packet convolution in practical application, and the ShuffleNet and the MobileNet networks are two networks which are representative based on the idea of the packet convolution. The shuffle structure uniformly mixes feature maps after group convolution (group convolution) by channels by uniform arrangement, as shown in fig. 1. Mobilenet adopts the idea of depth separable convolutions (depthwise partial convolutions), and performs deconvolution by using the methods of Depthwise (DW) and 1x1 Pointwise (PW). The depthwise partial convolution operation is performed on each channel, which can be regarded as a packet convolution with only one channel in each group, and finally, the 1x1 convolution with low overhead is used for channel fusion, so that the calculation amount can be greatly reduced.

Unstructured model pruning generally refers to reducing network complexity and solving the overfitting problem by adopting network pruning and sharing technologies. An early application re-pruning method with one bias is called attenuation (binary Weight Decay), Optimal Brain injury (Optimal Brain Damage) and Optimal Brain surgery (Optimal Brain surgery) methods reduce the number of connections based on the Hessia matrix of loss functions. Research shows that the accuracy of the pruning method based on the loss function is better than that of the pruning method based on the importance.

However, there are some potential problems with the pruning and sharing approach. First, if L1 or L2 regularization is used, the pruning method requires more iterations to converge. In addition, all pruning methods need to manually set the sensitivity of the layer, that is, the related number needs to be finely adjusted, which may be very long and heavy in some applications, but the unstructured method has poor compatibility with hardware, and the compressed model depends on a hardware library or an algorithm library of a third party, which is not universal.

Disclosure of Invention

The invention provides a network model compression method based on deep learning, which is used for reducing the operation amount of classification or target detection and identification tasks of images based on deep learning.

The invention relates to a network model compression method based on deep learning, which comprises the following steps:

setting a network model for image classification or target detection and identification tasks, wherein the network model adopts a lightweight network structure of packet convolution and deconvolution and comprises a plurality of layers of convolution layers;

acquiring a training data set based on an image classification task or a target detection and identification task of the network model, wherein the training data set comprises image data and image labels (classification labels or target detection labels);

pruning the network model based on the training data set, and performing image classification processing or target detection and identification processing on the image to be processed input in real time based on the network model obtained through the pruning processing, namely obtaining an image classification result or a target detection and identification result of the image to be processed input currently based on the output of the trained network model;

wherein, the pruning treatment comprises the following steps:

acquiring network parameters of a network model before pruning: initializing network parameters of a network model, performing deep learning training on the network model based on a training data set and a preset loss function, and stopping training when preset prediction precision is met;

traversing each convolution layer of the network model layer by layer according to the forward propagation direction of the network, pruning layer by layer based on the kernel weight of each convolution kernel of each convolution layer, initializing the weights of the current convolution layer and the residual convolution kernels of the next convolution layer after each pruning, performing deep learning training on the network model obtained by current pruning based on a training data set, and stopping training when the preset prediction precision is met;

wherein, the kernel weight of the convolution kernel is: the sum of the absolute kernel weights (kernel weights) included in the convolution kernel, that is, the sum of the absolute kernel weights of all the two-dimensional convolution kernels corresponding to the same output channel;

pruning comprises the following steps: and clipping the convolution kernel with the minimum kernel weight in the current convolution layer and the corresponding feature mapping thereof, and clipping the convolution kernel of the clipped feature mapping corresponding to the convolution kernel in the next convolution layer.

In a possible implementation manner, the pruning process may also be:

traversing each convolutional layer of the network model layer by layer according to the forward propagation direction of the network, pruning layer by layer based on the kernel weight of each convolutional core of each convolutional layer, performing deep learning training on the network model obtained by current pruning based on the training data set after pruning all convolutional layers layer by layer, and stopping training when the preset prediction precision is met.

In one possible implementation, the pruning process is:

pruning the designated convolutional layers in the network model according to the forward propagation direction of the network and based on the kernel weights of the convolutional kernels of the designated convolutional layers in sequence, wherein the deep learning training of the pruned network model can be once every pruning; or after all the designated convolutional layers are pruned, deep learning training is carried out on the currently obtained network model.

In one possible implementation, when calculating the kernel weight of each convolution kernel, the kernel weight of the kernel function corresponding to the pruned feature map is discarded.

In one possible implementation, when initializing the network parameters of the network model, the existing pre-trained core weights are directly loaded for the core weights in the network parameters.

The technical scheme provided by the invention at least has the following beneficial effects: a lightweight network structure is set in a mode of combining grouping convolution and decomposition convolution so as to reduce the complexity of the model and compress the weight of the model; and controlling the error of the characteristic diagram before and after cutting through the set pruning strategy to reduce the number of channels of the convolution kernel as much as possible. Therefore, classification of input images or target detection and identification tasks are achieved based on the network model obtained through pruning, the calculation amount is reduced, and the deployment requirement of mobile equipment or embedded equipment is met.

Drawings

In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings needed to be used in the description of the embodiments will be briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

FIG. 1 is a schematic diagram of a conventional shufflenet convolution;

FIG. 2 is a diagram of convolutional layer convolution kernels employed in an embodiment of the present invention;

FIG. 3 is a schematic diagram of a network pruning process employed in an embodiment of the present invention;

FIG. 4 is a schematic diagram of a single-layer pruning strategy employed in an embodiment of the present invention;

FIG. 5 is a schematic illustration of a multi-layer pruning strategy employed in an embodiment of the present invention;

fig. 6 is a comparison diagram of the recognition results of the target detection network before and after pruning in the embodiment of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, embodiments of the present invention will be described in detail with reference to the accompanying drawings.

The traditional model pruning method is generally divided into three steps, namely training an original model; pruning 'unimportant' channels by using a pruning algorithm; and (5) fine-tuning the parameters of the pruned model. In recent years, some work proposes that model parameters after pruning are not important, and the parameters of the pruning model are directly initialized and retrained, so that the same or even higher precision can be achieved. Therefore, model pruning can be solved as a model structure searching problem, and the number of channels of each layer of the model is searched by using a searching algorithm.

To reduce the computational cost of convolutional layers, some past approaches propose an approximate convolution operation, i.e., representing the weight matrix as a low-rank product of two smaller matrices without changing the original number of convolution kernels. Still other methods to reduce convolution overhead include using fft (fast Fourier transform) -based convolution and fast convolution using Winograd algorithms. In addition, quantization and binarization can be used to reduce the model size, reducing computational overhead. There is also a three-level weight pruning method that uses particle filtering to locate pruning candidates and selects the best combination from a large number of randomly generated masks.

In an embodiment of the present invention, a method is presented for analyzing convolution kernel weights and pruning convolution kernels and their corresponding feature maps using a simple amplitude-based metric without examining possible combinations. Meanwhile, a full-network pruning strategy is provided for the integral convolution kernel of the simple and complex convolution network architecture.

The L1 norm is used to select and prune insignificant convolution kernels. The fine tuning process is the same as the conventional training process, and no additional normalization is introduced. The model compression method provided by the embodiment of the invention does not introduce additional parameters for the regularizer except the percentage of the convolution kernel to be pruned. By using layer-by-layer piecewise pruning, a single pruning rate can be set for all layers in a single stage.

In the embodiment of the invention, a brand-new lightweight network structure is set by adopting a method combining packet convolution and deconvolution, so that the complexity of a model is reduced and the weight of the model is compressed at the stage of setting the network model; then, a model pruning method is used for pruning the set network model, and the adopted pruning method is as follows: the number of channels of the convolution kernel is reduced as much as possible by controlling the error of the feature map before and after cropping. The specific implementation mode is as follows:

first, lightweight network setup.

Convolution operation plays an important role in network models, especially image correlation networks, convolution layers are equivalent to human eyes, and whether the recognition effect of the network (the recognition of target objects of input images) is good or not depends on the effect of the convolution layers for the most part. Therefore, the convolutional layer is also a unit with the largest calculation amount, and in the embodiment of the present invention, the set focus for the lightweight network is to optimize the convolutional layer.

For the amount of computation of the convolutional layer, a separate idea is adopted to replace the N × N convolution (convolution kernel) with N × 1+1 × N. Meanwhile, the network is constructed by adopting the ideas of fewer channels and multiple layers, and deep information is extracted as far as possible. For example, in the embodiment of the present invention, 1 × 1 and 3 × 3 convolution kernels are used in all of the set network models, a convolution kernel of 5 × 5 or more in size is abandoned, and two output layers are set for target recognition prediction of images of different scales, as shown in fig. 2. Compared with a common lightweight network model, the method reduces the number of channels and deepens the network depth.

And secondly, network pruning strategy.

Referring to fig. 3, the network pruning strategy adopted in the embodiment of the present invention is: two ideas are respectively adopted for pruning, one is to carry out training while pruning, and the other is to carry out training after pruning is completed.

Firstly, the optimization degree of the strategy on the calculation cost is calculated through a formula.

Let n_iThe number of input channels of the ith convolutional layer, h_i、w_iThe height and width of the input feature map of the ith convolutional layer, i.e., the input feature map size of the ith convolutional layer, respectively. Convolution layer mapping input features

Conversion to output feature mapping

As the input feature map for the next convolutional layer. Wherein R represents a real number field, n_i+1Indicates the number of output channels of the ith convolutional layer, i.e., the number of input channels of the (i + 1) th convolutional layer, h_i+1、w_i+1The dimensions of the output characteristic diagram of the ith convolutional layer are shown. Each convolution kernel (corresponding to each output channel) is composed of n_iA two-dimensional convolution kernel K belongs to R^k×kComposition, i.e. convolution kernel size, can be described as n_iXk x k, where k x k also represents the number of kernel values included in the two-dimensional convolution kernel, all convolution kernels together forming a convolution kernel matrix

The number of convolution layer operations is n_i×n_i+1×k×k×h_i+1×w_i+1When the convolution kernel F is matched, as shown in FIG. 4_i,j(convolution kernel matrix F)_iThe jth convolution kernel in (1), i.e. the convolution kernel corresponding to the jth output channel) is pruned, the corresponding feature map X is removed_++1,jReduce n_i×k×k×h_i+1×w_i+1The amount of calculation of (a). The kernels that apply to the feature map removed from the convolution kernel of the next convolution layer will also be removed, which reduces the extra n_i+2×k×k×h_i+2×w_i+2And (5) operating.

Pruning m convolution kernels of the ith layer reduces the calculation cost of the ith layer and the (i + 1) th layer by m/n₊₊₁. Fig. 4 shows a schematic diagram of a pruning strategy, where h and w in fig. 4 represent dimensions of an input image, i represents the number of network layers, a black part in a convolution kernel is used to represent a network structure that needs to be deleted in the pruning strategy, and when a black part in an ith layer is deleted, the associated parts in the upper and lower layers are also deleted, and both are represented by black in the diagram.

When the convolution kernels with low importance (the importance is lower than a specified threshold value) are cut out from a trained model, the calculation efficiency is improved, and meanwhile, the accuracy reduction is minimized. Calculating the relative importance of a convolution kernel at each layer can be determined by calculating the sum of its absolute weights ∑ F_i,jI.e. its L1 paradigm i F_i,jL. Since the number of output channels is the same as the number of convolution kernels, Σ | F_i,jAnd | also represents the average size weight of its convolution kernel. This value gives an expectation of the size of the output feature map, and a convolution kernel with a smaller kernel value than the other convolution kernels of the layer can be considered to produce a feature map with weak activation, equivalent to a weak importance to the model.

As a possible implementation, the specific steps of pruning layer by layer are as follows:

the process of pruning m convolution kernels from the ith convolution layer is as follows:

(1) for each convolution kernel F_i,jThe sum of their absolute weights is calculated as the kernel weight S for each convolution kernel_i,jI.e. kernel weight

Wherein the content of the first and second substances,

the ρ -th weight (kernel weight) of the l-th two-dimensional convolution kernel representing the i-th layer is, for example, ρ 1, …, k × k for k × k two-dimensional convolution kernels.

(2) According to kernel weight S_i,jAnd (6) sorting.

(3) Convolution kernel F with minimum subtracted kernel weight_i,jAnd its corresponding feature map. The convolution kernel F in the next convolution layer_i,jThe corresponding pruned feature map convolution kernels are also pruned.

(4) And (4) creating a new kernel matrix for the ith layer and the (i + 1) th layer and the residual convolution kernels, and initializing the weights into a new model.

The pruning approach described above is to prune the weights on a layer-by-layer basis and then iteratively train and compensate for the loss of precision.

However, pruning across multiple levels of the network is necessary:

1) an overly deep network, pruning and retraining on a layer-by-layer basis can be very time consuming

2) For complex networks, a comprehensive approach tends to work better than a single layer-by-layer approach.

For example, pruning the second layer to identify the feature maps or each remaining block, like some networks with residual structures, such as residual networks and related networks, may result in additional pruning of other layers (since the residual structure is designed with more feature maps than the conventional modules, it affects more when pruning). The strategies in the current field in the cross-multilayer pruning work generally include the following two types, and a pruning strategy diagram is shown in fig. 5:

(a) an independent pruning strategy determines which convolution kernels should be pruned at each layer and independently of the other layers, with subsequent pruning taking into account the previous pruning effect.

(b) Greedy pruning strategy, the convolution kernel has been deleted at the previous layer. The strategy does not consider the kernel function of the previously pruned feature map when computing the sum of absolute weights.

For simple convolutional neural networks like VGG, any convolutional kernel can be easily deleted in any convolutional layer. However, for complex network structures, such as residual networks, direct pruning of the convolution kernel is not straightforward. The architecture of the residual network imposes limitations, requiring careful pruning of the convolution kernel. The first layer convolution kernel of the residual network residual block can be arbitrarily pruned because it does not change the number of feature maps of the block output. However, the output feature map of the second convolutional layer is difficult to prune due to the corresponding relationship between the output feature map and the identity feature map. Therefore, in order to prune the second layer convolutional layer of the residual block, the corresponding feature map must be pruned. Since the same feature map is more important than the added residual map, the feature map to be pruned should be determined by the previous pruning result, considering the feature dimension alignment problem after the residual structure pruning.

The pruning strategy across multiple layers is shown as the gray part in fig. 5, the black part is the network structure that the jth layer (corresponding to the jth output channel) determines to prune, and when the convolution kernel of the jth +1 layer needs to prune the second row, the pruning structure of the jth +1 layer is related to the pruning structure of the previous layer. The independent pruning strategy will not take into account the feature map (shown in black in fig. 5) removed by the previous layer, and thus the weight of the kernel for the repeated structure will be repeatedly calculated. And the greedy pruning strategy does not calculate the kernel of the feature mapping which is already pruned, namely the kernel weight of the repetitive structure is not calculated, and the two methods can output the matrix with the same latitude without influencing the alignment of the feature dimension.

In addition, pruning is not adopted at the level of some residual error structures, and better results can be obtained. Retraining the pruned network to recover accuracy is an important step in network pruning, and any model compression method needs to keep loss of precision as little as possible. After pruning the convolution kernel, the performance degradation should be compensated by retraining the network. And (5) training the fine adjustment of the pruned network. For a network model with a residual structure, the problem of characteristic dimension of input and output of the residual structure needs to be considered. Aiming at the technical problem, the embodiment of the invention adopts the following two different strategies to solve the technical problem:

(1) pruning once and retraining: pruning the convolution kernels of multiple layers once and retraining them until the original accuracy is restored.

(2) Filtering layer by layer, then training again in an iterative way, and training again the model before pruning the next layer to obtain the weight so as to adapt to the change in the pruning process.

For those levels that can accommodate pruning, a pruning and retraining strategy (1)) can be used to prune away significant portions of the network, and any loss of accuracy can be recovered by retraining for a short time (less than the initial training time). However, when some convolution kernels from the sensitive layer are removed or most of the network is removed, it may not be possible to restore the original accuracy. Iterative pruning and retraining may yield better results, but the iterative process requires more time, especially for very deep networks. Therefore, different strategies can be adopted for different size types of networks, and the network depth and the sensitivity of different layers are fully considered.

The beneficial effects of the model compression method based on deep learning provided by the embodiment of the invention are explained in detail in combination with specific simulation experiments. Namely, the network depth and the performance of multiple channels are verified on a network model through a simulation experiment. The best balance is sought in shallow networks. The network such as lightnet (lightweight network) is tested and evaluated under an MS-COCO2014 data set (image recognition data set), and the influence of network depth and channel number on the performance of the network model in the table 2 is verified. It can be seen that deep networks with small channels are better than small networks with large channels, even if they have convolution kernels of the same size. As shown in table 1, the performance of Lightnet networks on different layers and different channels is verified in a convolution kernel, and the three Lightnet networks in the following table are designed completely the same except that the number of network layers is different. Two nxsxs convolution kernels are combined into one 2 nxsxs convolution kernel by combining them. Where N represents the number of input channels and S × S represents the convolution kernel size.

Table 1 compares the calculated amount and accuracy of lightnet network with different layers using the idea of lightweight network design

Where "Parameters" represents Parameters, "FLOPs" represents computational power, i.e., number of floating point operations per second, and "mAP" represents mean average precision.

Experimental demonstration is carried out on the network structure (N × N convolution kernels are replaced by N × 1+1 × N) and the currently popular related lightweight network, and an experimental data set is MS-COCO 2014. The operating system is ubuntu16.04, the CPU is 8GB Intel core i7-7700, and the GPU is NVIDIA GeForce GTX 1070. Meanwhile, the method is compared with the existing hot related neural networks (including Cornernetite, YoloV3-tiny, RFBnet300 and RFBnet512), so that the effectiveness and the usability of the embodiment are verified. The results of the experiment are shown in table 2.

TABLE 2 demonstration of accuracy results of related target detection network models

The experimental environment for the network pruning process and the experimental environment for the lightweight network model are the same, and a CIFAR-10 dataset (color image dataset) and a VGG-16 classification network model are used as experimental data. The experimental results are shown in table 3, the calculated amount of the parameter of each layer is quantified, the data reduction amount before and after pruning is analyzed, and the last two columns show the number of feature maps and the percentage reduction from the model after pruning.

TABLE 3 VGG-16 Experimental results show table

Wherein, the channel number 1 refers to the output channel number before pruning, and the channel number 2 refers to the output channel number after pruning.

In the simulation experiment, two types of image classification networks are pruned, and in addition to the VGG16 network in the experiment, the model also has ResNet with a residual structure. When the convolution kernel is pruned, the remaining parameters of the modified and unaffected layers are copied into the new model. In addition, if the convolutional layer is pruned, the weights of the subsequent batch normalization layer are also deleted. To obtain the baseline accuracy of each network, each model was trained from scratch and followed the same pre-processing and hyper-parameters as ResNet. For retraining, the present example was experimentally validated for VGGNet-16, ResNet-56, ResNet110, and LightNet networks using a constant learning rate lr equal to 0.001, as shown in Table 4. Since the LightNet network is proposed as the target detection network, the top error rate of classified data is not given in the table, the target detection accuracy of the network is given in the column of the error rate, the data used is the test set of MS-COCO2014, and the experimental results are shown in table 4.

Table 4 table of experimental results of pruning algorithm herein

Fig. 6 shows an experimental result of the pruned LightNet32 network, where characters in the graph represent recognition results (recognition target names), and it can be seen that the result is degraded with respect to the precision before pruning, especially the recognition result on the right side is a schematic diagram, because pruning may affect feature extraction more or less, but complexity of the target detection network model is reduced, which is also worthy of being that the target detection network model may be further compressed through pruning, so as to achieve the purpose of deployment on a mobile terminal or an embedded device.

The embodiment of the invention provides a model compression method based on deep learning, which comprises lightweight network setting and a new network model pruning strategy. The lightweight network setting considers the design idea of the current lightweight model and combines a new convolutional layer method, so that the number of parameters of the set model is reduced while the accuracy is kept stable. The novel network pruning method belongs to one of structured pruning, and combines an L1 paradigm, adopts a strategy of pruning layer by layer and then pruning the network in a repairing way, and adopts different multilayer pruning strategies according to different sensitivity degrees of each layer aiming at the networks with different depths. Experiments show that the model compression method provided by the embodiment of the invention has good practical application effect.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

What has been described above are merely some embodiments of the present invention. It will be apparent to those skilled in the art that various changes and modifications can be made without departing from the inventive concept thereof, and these changes and modifications can be made without departing from the spirit and scope of the invention.

Claims

1. A network model compression method based on deep learning is characterized by comprising the following steps:

acquiring a training data set based on an image classification task or a target detection and identification task of the network model, wherein the training data set comprises image data and image labels;

pruning the network model based on the training data set, and carrying out image classification processing or target detection and identification processing on the image to be processed input in real time based on the network model obtained by pruning;

wherein, the pruning treatment comprises the following steps:

wherein, the kernel weight of the convolution kernel is: the sum of absolute kernel weights comprised by the convolution kernel;

2. The method of claim 1, wherein the pruning process is replaced with:

3. The method of claim 1, wherein the pruning process is replaced with:

pruning the designated convolutional layers in the network model according to the forward propagation direction of the network and based on the kernel weights of the convolutional kernels of the designated convolutional layers in sequence;

performing deep training on the currently obtained network model once every pruning, and stopping when the preset prediction precision is met; or after pruning of all the designated convolutional layers is completed, performing deep learning training on the currently obtained network model based on the training data set, and stopping when the preset prediction precision is met.

4. A method as claimed in any one of claims 1 to 3, wherein the kernel weights of the kernel functions corresponding to the pruned feature maps are discarded when calculating the kernel weights of the respective convolution kernels.

5. A method according to any one of claims 1 to 3, characterized in that when initializing the network parameters of the network model, the pre-trained kernel weights are loaded directly for the kernel weights in the network parameters.

6. The method of claim 1, wherein the convolutional layers of the network model employ convolutional kernels having a size of less than 5 x 5.