CN112036475A

CN112036475A - Fusion module, multi-scale feature fusion convolutional neural network and image identification method

Info

Publication number: CN112036475A
Application number: CN202010888768.2A
Authority: CN
Inventors: 钱雪忠; 陈鑫华
Original assignee: Jiangnan University
Current assignee: Jiangnan University
Priority date: 2020-08-28
Filing date: 2020-08-28
Publication date: 2020-12-04

Abstract

The invention relates to a fusion module, a multi-scale feature fusion convolutional neural network and an image identification method. The method has low error rate, effectively reduces the parameter quantity of the network and is beneficial to improving the generalization capability of the model.

Description

Fusion module, multi-scale feature fusion convolutional neural network and image identification method

Technical Field

The invention relates to the technical field of image recognition, in particular to a fusion module, a multi-scale feature fusion convolutional neural network and an image recognition method.

Background

Deep learning has made a major breakthrough in recent years as one of the research hotspots in the field of machine learning. Convolutional neural networks are most widely used as representatives of deep learning, and typical applications of convolutional neural networks include image recognition, scene segmentation, target detection and the like.

Convolutional neural networks have achieved significant results in different tasks for different applications, but such excellent performance is accompanied by significant computational cost, which is larger in computation and parameter amount than the traditional algorithm in the field. As the convolutional neural network model approaches the accuracy limit of a computer vision task, the structure of most deep convolutional neural networks becomes more complex, a large number of floating-point matrix operations are required for calculating convolutional layers and full-link layers, the calculation cost is extremely high, and the model is too large and too deep in the aspect of landing in reality, so that the model is not beneficial to being used on mobile equipment and embedded chips which have strict limits on real-time performance, storage space and energy consumption. For convenience of application, it is therefore necessary to design a more lightweight network model.

Disclosure of Invention

Therefore, the technical problem to be solved by the invention is to overcome the problems of too single convolution kernel, no diversity, complex network structure and parameter redundancy of the convolution neural network in the prior art, so that a fusion module, a multi-scale feature fusion convolution neural network and an image identification method which have low error rate, effectively reduce the parameter quantity of the network and improve the generalization capability of the model are provided.

In order to solve the technical problem, the fusion module comprises a previous layer, a fusion layer and a cascade layer, wherein the previous layer calculates an input picture through the fusion layer, and then combines and outputs a result of the calculation through the cascade layer, and the fusion layer is formed by adding a bottleneck structure and a cavity convolution structure into an original module and combining the bottleneck structure and the cavity convolution structure together.

In one embodiment of the invention, different scales of convolution kernels are used in the fusion layer.

In one embodiment of the present invention, the bottleneck structure is provided with 1 × 1 convolutional layers and 3 × 3 convolutional layers.

In an embodiment of the present invention, the calculation formula of the area size of the void convolution is: f (r) ═ 2^r+1-1)×(2^r+1-1), where the hyperparameter r indicates that r-1 spaces are filled between each pixel.

In one embodiment of the invention, the fusion layer has five convolution branch channels.

In an embodiment of the present invention, the method for calculating the parameter F (i, n) of the fusion module includes:

where kXk is the current convolution kernelSize, C_inFor input channel number, C_outIs the number of output channels.

In an embodiment of the present invention, the calculation method of the calculated amount Flops (i, n) of the fusion module is:

where kXk is the current convolution kernel size, C_inFor input channel number, C_outIs the number of output channels.

The invention also discloses a multi-scale feature fusion convolutional neural network, which comprises the following components: an input layer, a first convolution layer, a maximum pooling layer, a plurality of fusion modules as described in any of the above, a classification output layer, and an output layer, and the previous layer and the subsequent layer of each fusion module use a 1 × 1 convolution operation.

In one embodiment of the present invention, the using a 1 × 1 convolution operation for the previous layer and the subsequent layer of each fusion module includes: the input layer, the first convolution layer, the maximum pooling layer, the second convolution layer, the first fusion module, the third convolution layer, the second fusion module, the fourth fusion module, the sixth convolution layer, the seventh convolution layer, the first fusion module, the second fusion module, the third fusion module and the fourth fusion module are connected in order according to a topological structure.

The invention also discloses an image identification method of the multi-scale feature fusion convolutional neural network, which comprises the following steps: step S1: performing convolution pooling operation on the input picture of the multi-scale feature fusion convolution neural network; step S2: using 1 × 1 convolution operation for the previous layer and the next layer of each fusion module; step S3: and classifying and outputting the output end of the multi-scale feature fusion convolutional neural network through a global average pooling layer or a convolutional layer.

Compared with the prior art, the technical scheme of the invention has the following advantages:

aiming at the defects of excessive parameters and complex network structure of the traditional convolutional neural network, the fusion module, the multi-scale feature fusion convolutional neural network MS-FNet are designed by analyzing and comparing the traditional convolutional neural network model, the network is introduced into the fusion module, features of different scales are extracted through convolution kernels of different sizes, the width of the network is increased, the robustness is improved, and the problems that the convolution kernels used for extracting the features in the general convolutional neural network are too single and do not have diversity and comprehensiveness are solved; in addition, the convolution layer is used for replacing the traditional full connection layer at the end of the network, and the parameter quantity of the network model is effectively controlled. The experimental result shows that compared with a common convolutional neural network model, the MS-FNet has the advantages of more reasonable network structure, higher convergence rate, lower error rate and better generalization capability under the condition that the parameter quantity is far smaller than that of other convolutional neural network models.

Drawings

In order that the present disclosure may be more readily and clearly understood, reference is now made to the following detailed description of the embodiments of the present disclosure taken in conjunction with the accompanying drawings, in which

FIG. 1 is a schematic diagram of a fusion module of the present invention;

FIG. 2 is a schematic diagram of a primitive module of the present invention;

FIGS. 3a and 3b are schematic diagrams of the standard convolution and bottleneck structures, respectively, of the present invention;

FIG. 4 is a schematic diagram of a multi-scale feature fusion convolutional neural network of the present invention;

FIG. 5 is a parameter set by the MS-FNet of the present invention on the CIFAR-10 data set;

FIG. 6 is a comparison of the results of the CIFAR-10 experiment of the present invention;

FIG. 7 is a comparison of the results of the CIFAR-100 experiment of the present invention;

FIG. 8 is a comparison of the present invention using hole convolution and not using hole convolution;

FIG. 9 is a comparison of the results of an MNIST data set experiment of the present invention;

FIG. 10 is a graphical illustration of a comparison of accuracy curves on the MNIST data set of the present invention;

FIG. 11 is a graphical illustration of cross-entropy loss function curve comparison on a CIFAR-10 dataset according to the present invention.

The specification reference numbers indicate: 10-upper layer, 20-fusion layer, 21-bottleneck structure, 22-void convolution structure, 30-cascade layer.

Detailed Description

Example one

As shown in fig. 1, this embodiment provides a Fusion module (Fusion module), which includes a previous layer 10, a Fusion layer 20, and a cascade layer 30, where the previous layer 10 calculates an input picture through the Fusion layer 20, and then merges and outputs a result of the calculation through the cascade layer 30, and the Fusion layer 20 is formed by adding a bottleneck structure 21 and a hole convolution (called "scaled convolution" structure 22 in the original module and combining them together.

The fusion module of this embodiment includes a previous layer 10, a fusion layer 20, and a cascade layer 30, where the previous layer 10 performs operation 20 on an input image through the fusion layer, and then performs combination output on the result of the operation through the cascade layer 30, where the fusion layer 20 is formed by adding a bottleneck structure 21 and a hollow convolution structure 22 to an original module, and since the fusion layer 20 can represent different extracted details by using different pixel size characteristics, and then the regions are fused to represent fusion of different details, which is equivalent to performing convolution on multiple scales simultaneously, the extracted features are also of different scales, and the features of the different scales are richer, and are favorable for outputting the final classification result; meanwhile, the difference of the receptive fields among different characteristic channels can be realized, so that the output characteristics are not uniformly distributed any more, but the characteristics with strong correlation are mutually gathered together, the redundant information of the output characteristics is less, the problems that a convolution kernel for extracting the characteristics in a general convolution neural network is too single and has no diversity and comprehensiveness are solved, and the characteristic extraction capability of a network model is favorably improved.

In the fusion layer 20, different sense fields between different feature channels can be realized by using convolution kernels of different scales, so that output features are not uniformly distributed any more, but features with strong correlation are mutually gathered together, and therefore, output redundant information of the features is less, and the purpose of improving the feature extraction capability of the network model is achieved.

Specifically, as shown in fig. 2, the structural features of the primitive module are as follows: firstly, convolution decomposition is utilized, because convolution kernels with different scales are used, the convolution operation can be carried out on the same input by using convolution kernels with different scales (such as 1 x1, 3 x 3), and then the results after the operation are combined and output; secondly, in the original module, the traditional convolution layer is replaced by a small block, so that the learning ability is improved, and the utilization efficiency of parameters is also improved; moreover, 1 multiplied by 1 point-by-point convolution kernel calculation is used, and the sparse structure is a sparse structure with high-efficiency expression characteristics, so that the accuracy is improved, overfitting is avoided, the parameters of the network can be reduced, a deeper network structure can be established under limited calculation resources, and the depth and the width of the network are efficiently expanded.

The principle of the original module is that a compact part is used for fitting a sparse part perfectly, feature extractors of different pixels are used for representing different regions, and then the regions are combined, so that extracted feature details are richer, and classification results of the network model are more accurate.

As shown in fig. 3a and fig. 3b, compared to the standard convolutional layer diagram 3a, assuming that the input and output characteristic diagram quantity N is 128, the standard convolutional layer parameter quantity P is 128 × 3 × 128 × 147456, and the convolutional layer parameter quantity P of the bottleneck structure 21 is 128 × 1 × 32 × 3 × 32+32 × 1 × 128 × 17408. It can be seen by calculation that the parameter amount using the bottleneck structure 21 is about one eighth of the parameter amount of the standard convolution layer. In addition, the bottleneck structure 21 is deeper than a standard convolution structure, and the nonlinear expression capability of the network is also stronger, so that the network parameters are reduced and the network depth is increased through the bottleneck structure 21, and the accuracy of the network is improved.

The bottleneck structure 21 is provided with a 1 × 1 convolution layer and a 3 × 3 convolution layer, and the structure can not only reduce the network parameter quantity, but also improve the computing efficiency of the network.

In the conventional convolutional neural network, the feature map is generally down-sampled after the convolutional layer, which results in information loss, and in order to reduce the loss, a hole convolutional structure 22 may be added to the network structure, so that the network expands the receptive field and improves the network performance without increasing model parameters and calculation amount.

The hole convolution structure 22 is a special data sampling mode, and it is from dense sampling to sparse sampling of data, and does not change the learned parameters, but changes the sampling mode of the input data, so it can be used in the network model seamlessly, and does not need to change the structure of the network model, and does not add extra parameters and computation.

The area size calculation formula of the void convolution structure 22 is: f (r) ═ 2^r+1-1)×(2^r+1-1), where the hyperparameter r indicates that r-1 spaces are filled between each pixel.

Assuming that the convolution kernel size is k × k and the hole convolution step size is r, the k2 values corresponding to the values used to compute the convolution are respectively obtained from locations in the feature map that are spaced apart by r-1, so that the receptive field changes from k × k to k + (r-1) × (k-1).

The fusion layer 20 has five convolution branch channels, and since different details are extracted by using features with different pixel sizes, and then the regions are fused to represent fusion of different details, which is equivalent to performing convolution on multiple scales simultaneously, the extracted features are also different scales, so that the features of different scales are richer, and the final classification result is output.

In this application, the method for calculating the parameter F (i, n) of the fusion module includes:

The calculation method of the calculated amount Flops (i, n) of the fusion module comprises the following steps:

Example two

As shown in fig. 4, the present embodiment provides a multi-scale feature fusion convolutional neural network (MS-FNet for short), including: the input layer, the first convolution layer, the maximum pooling layer, the fusion modules according to the first embodiment, the classification output layer and the output layer, and the previous layer and the next layer of each fusion module use a convolution operation of 1 × 1.

The multi-scale feature fusion convolutional neural network described in this embodiment includes: the system comprises an input layer, a first convolution layer, a maximum pooling layer, a plurality of fusion modules, a classification output layer and an output layer, wherein the former layer and the latter layer of each fusion module use convolution operation of 1 x1, so that the feature map can be subjected to dimension reduction or dimension increase operation, meanwhile, a nonlinear layer is added by using a small calculated amount, the spatial information exchange of the feature map after cascade operation is enhanced, and the feature fusion among different receptive field feature channels is realized; in addition, the output end of the network uses the convolution layer to replace a full connection layer for classification output, the feature graph is converted into the original input size, the output layer node is ensured to have a larger receptive field in the original input space, and pixel classification is realized through up-sampling.

In this embodiment, in order to increase the width of the network and improve the robustness, the number of the fusion modules is four. Specifically, the convolution operation using 1 × 1 for the previous layer and the subsequent layer of each fusion module includes: the input layer, the first convolution layer, the maximum pooling layer, the second convolution layer, the first fusion module, the third convolution layer, the second fusion module, the fourth convolution layer, the fifth convolution layer, the fourth fusion module, the sixth convolution layer, the seventh convolution layer, the classified output layer and the output layer are orderly connected according to a topological structure, so that the difference of the receptive fields among different characteristic channels can be realized, the output characteristics are not uniformly distributed, and the higher precision is realized while the calculation cost is reduced.

In the multi-scale feature fusion convolutional neural network, convolutional pooling operation is firstly carried out on network input, so that extracted features are combined, and output channels of the next layer are increased; and then 1 × 1 convolution operation is used before and after each fusion module, so that dimension reduction or dimension increase operation can be performed on the feature map, a nonlinear layer is added by using a small amount of calculation, the spatial information exchange of the feature map after cascade operation is enhanced, the feature fusion among different receptive field feature channels is realized, and finally, the convolution layer is used for classified output of the network.

Because the distribution of the input data is changed after convolution operation, Batch Normalization (Batch Normalization) is carried out on the input of the previous layer after convolution operation before function operation is activated each time, the data distribution is processed, the data mean value of the BN layer is changed into 0, and the standard deviation is changed into 1, so that the correlation among characteristics is reduced, the output of the network is normalized to normal distribution, the deep convolution network training speed can be accelerated, the problem of internal covariate deviation caused by the change of the data distribution in the training process of the network is solved, and the phenomenon of overfitting of the network is relieved. In addition, all convolution operations are activated by the ReLU activation function, and feature sparsity and linear separability can be enhanced.

The main index for evaluating the time performance of the algorithm is the time complexity of the algorithm, and if the dimension of the input feature map of the multi-scale feature fusion convolutional neural network is n, the time complexity of the MS-FNet whole is as follows:

where N denotes the depth of the network, i denotes the ith convolutional layer, M denotes the output feature size, K denotes the convolutional kernel size, N_i-1Representing the dimension of the feature map of the previous layer, N_iRepresenting the next level of feature map dimensions.

EXAMPLE III

The embodiment provides an image identification method of a multi-scale feature fusion convolutional neural network, which comprises the following steps: step S1: performing convolution pooling operation on the input picture of the multi-scale feature fusion convolution neural network; step S2: using 1 × 1 convolution operation for the previous layer and the next layer of each fusion module; step S3: and classifying and outputting the output end of the multi-scale feature fusion convolutional neural network through a global average pooling layer or a convolutional layer.

In the image identification method of the multi-scale feature fusion convolutional neural network according to this embodiment, in step S1, a convolutional pooling operation is performed on an input picture of the multi-scale feature fusion convolutional neural network, which is beneficial to combining extracted features, so that an output channel of a next layer is added; in step S2, 1 × 1 convolution operation is performed on the previous layer and the subsequent layer of each fusion module, so that dimension reduction or dimension increase operations can be performed on the feature map, and a nonlinear layer is added by using a small amount of calculation, so that spatial information exchange of the feature map after cascade operation is enhanced, and feature fusion between different receptive field feature channels is realized; in step S3, the output end of the multi-scale feature fusion convolutional neural network is classified and output through a global average pooling layer or a convolutional layer, the feature map is converted into an original input size, it is ensured that the output layer node has a large receptive field in the original input space, and pixel-level classification is achieved through upsampling.

The test results and analyses of the present application are as follows:

the experimental platform is GTX1080Ti GPU, the used deep learning framework is TensorFlow, experiments are respectively carried out on three data sets of MNIST, CIFAR-10 and CIFAR-100, and the performance of the network model MS-FNet is verified by adopting the parameter number and the error rate of the network.

CIFAR-10 dataset:

the CIFAR-10 dataset consisted of 60000 RGB three-channel images of size 32X 32, 50000 for training and 10000 for testing. And the data set consisted of 10 classes, each with 5000 training images and 1000 test images. Before training the model, samples are preprocessed, each image is randomly turned over, random brightness change and contrast change are set, and the images are cut into 28 x 28 sizes, so that more samples are manufactured, the utilization rate of the sample images is improved, the sample images have random noise, and the data amplification effect is achieved.

As shown in fig. 5, the CIFAR-10 data set is taken as an example to describe the parameter setting of the network model MS-FNet in detail. The first column, Layer name, represents the operation of each Layer of the network model, the second column, Filter size/stride, represents the size and step size of the convolution kernel used by the current Layer, the third column, Output size, the fourth column, Filter size/Channel number (Fusion module), the last Parameter generated by the operation of each Layer, the last row is to add the parameters of all layers, and the total Parameter of the MS-FNet is finally calculated to be about 0.37M.

As shown in fig. 6, the comparison result of the parameters and error rates of the network model MS-FNet and other different network models on the CIFAR-10 data set is shown, and it can be seen from the data comparison in the table that the parameters of the network model MS-FNet on the CIFAR-10 data set are only 0.37M, which is 1/7 of FitNet, 1/23 of squeezet, and 1/100 of FractalNet, and the error rate of the network model MS-FNet is much lower than that of the conventional convolutional neural network model by analyzing the parameters, and at the same time, the error rate is only 6.19%, which is relatively lower than that of other network models, although the error rate of widereset and FractalNet is much lower, but the parameter amount thereof is much larger than that of the network model MS-FNet. The MS-FNet greatly reduces the parameter number of the network under the condition of ensuring lower error rate, and has better classification precision and generalization capability.

CIFAR-100 dataset:

the CIFAR-100 dataset is similar to the CIFAR-10 dataset and likewise contains 60000 RGB three-channel images of size 32X 32. Unlike CIFAR-10, the CIFAR-100 dataset contains 100 classes, and the 100 classes are extended from 20 super classes. Each class contains 500 training images and 100 test images, respectively, for a total of 600 images.

As shown in fig. 7, the setting of the network configuration parameters on the CIFAR-100 dataset is consistent with fig. 5. FIG. 7 is a comparison of the MS-FNet and the results of other network models on the CIFAR-100 data set, and it can be seen from the figure that the MS-FNet greatly reduces the parameter amount of the network under the condition of ensuring that the error rate is not increased.

MNIST dataset:

the MNIST data set is a subset of the handwriting database NIST and contains handwritten numbers 0 to 9, wherein each type of number has 6000 training images and 1000 test images, and each image is a single-channel image of 28 multiplied by 28, and 70000 images. Due to the fact that MNIST data set is simple, the number of convolution layers and Fusiom modules is reduced on network parameter setting of the MS-FNet, and the number of feature map channels is reduced to a single channel.

As can be seen from fig. 8, under the same experimental environment and iteration times, the training time using the hole convolution is reduced by 13% on average compared with the training time without using the hole convolution, and as the number of training iterations increases, the training time using the hole convolution and the training time without using the hole convolution network model both increase and the error rate decreases, but the error rate using the hole convolution network model slightly increases compared with the error rate without using the hole convolution network model. Therefore, the experiment group proves that the training speed of the model can be increased by properly adding the hole convolution in the network model, so that the network model can be converged more quickly, and the network performance is improved.

Fig. 9 is a comparison of the MS-FNet and the MNIST data set, and it can be seen from the figure that the error rate of the network model MS-FNet is only 0.42%, which is reduced compared with other network models, indicating that the network model MS-FNet has better classification capability.

Analyzing the network performance:

in order to intuitively analyze the network performance of the network model, an accuracy curve and a loss function curve of the MS-FNet are listed respectively.

As shown in fig. 10, the accuracy curves of MS-FNet on MNIST dataset are shown for three different processing modes, namely, full-link layer, global average pooling layer, and convolutional layer. The abscissa of the graph is the iteration number of the network model training, and 1500 iterations are performed on an MNIST data set; the ordinate is the accuracy value of the network model. As can be seen from the following figure, as the number of iterations increases, the accuracy rates of the three processing methods are all rising trends, but the accuracy rate curve using the convolutional layer curve rises fastest, the finally reached accuracy rate is also highest, the global average pooling layer curve rises slightly slower than the convolutional layer curve, but the final accuracy rate is close to the convolutional layer curve, the rising of the fully-connected layer curve is slowest, and the final accuracy rate is also lowest. Therefore, through the analysis of the accuracy curve, the training speed of the network can be accelerated by using the convolutional layer to replace the fully-connected layer, so that the accuracy of the network is correspondingly improved.

The loss function curve is used as one of performance indexes for evaluating the convolutional neural network, can reflect the convergence condition of network model training, and is generally used as a standard for evaluating the performance of the model according to the convergence speed and the convergence value of the model. FIG. 11 is a cross-entropy loss function curve comparison diagram of the network model MS-FNet on the CIFAR-10 data set, and three different processing modes, namely a full connection layer, a global average pooling layer and a convolution layer, are respectively used at the end of the network. The convergence capability of the network model is analyzed by the loss function curves of the three modes. As can be seen from the figure, the abscissa of the loss function curve is the iteration number of the network model training, and the iteration number is 5000 times on the CIFAR-10 data set; the ordinate is the error loss value of the different network models during the training process. It can be seen from the figure that, as the number of iteration steps increases, the loss function curves of the three processing modes tend to converge, but the full-connection layer curve converges at a higher position and descends slowest, the global average pooling layer curve descends faster than the full-connection layer curve, the convergence value is smaller, and the convolutional layer curve descends fastest and converges at a smaller value than the global average pooling layer curve. Comprehensive contrastive analysis, and the network model can be converged faster and has better effect by using the convolution layer to replace a full-connection layer.

As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, system, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

It should be understood that the above examples are only for clarity of illustration and are not intended to limit the embodiments. Other variations and modifications will be apparent to persons skilled in the art in light of the above description. And are neither required nor exhaustive of all embodiments. And obvious variations or modifications therefrom are within the scope of the invention.

Claims

1. A fusion module, characterized by: the method comprises a previous layer, a fusion layer and a cascade layer, wherein the previous layer calculates an input picture through the fusion layer, and then combines and outputs the result of the calculation through the cascade layer, and the fusion layer is formed by adding a bottleneck structure and a cavity convolution structure into an original module and combining together.

2. The fusion module of claim 1, wherein: in the fusion layer, convolution kernels of different scales are used.

3. The fusion module of claim 1, wherein: the bottle neck structure is provided with a 1 × 1 convolution layer and a 3 × 3 convolution layer.

4. The fusion module of claim 1, wherein: the calculation formula of the area size of the cavity convolution is as follows: f (r) ═ 2^r+1-1)×(2^r+1-1), where the hyperparameter r indicates that r-1 spaces are filled between each pixel.

5. The fusion module of claim 1, wherein: the fusion layer has five convolution branch channels.

6. The fusion module of claim 1, wherein: the calculation method of the parameter F (i, n) of the fusion module comprises the following steps:

7. The fusion module of claim 1, wherein: the calculation method of the calculated amount Flops (i, n) of the fusion module comprises the following steps:

8. A multi-scale feature fusion convolutional neural network, comprising: an input layer, a first convolution layer, a max-pooling layer, a plurality of the fusion modules of any of claims 1-7, a classification output layer, and an output layer, and each of the previous and subsequent layers of the fusion module uses a 1 x1 convolution operation.

9. The multi-scale feature fusion convolutional neural network of claim 8, wherein: the previous layer and the next layer of each fusion module use 1 × 1 convolution operations including: the input layer, the first convolution layer, the maximum pooling layer, the second convolution layer, the first fusion module, the third convolution layer, the second fusion module, the fourth fusion module, the sixth convolution layer, the seventh convolution layer, the first fusion module, the second fusion module, the third fusion module and the fourth fusion module are connected in order according to a topological structure.

10. An image identification method of a multi-scale feature fusion convolutional neural network is characterized by comprising the following steps:

step S1: performing convolution pooling operation on the input picture of the multi-scale feature fusion convolution neural network;

step S2: using 1 × 1 convolution operation for the previous layer and the next layer of each fusion module;

step S3: and classifying and outputting the output end of the multi-scale feature fusion convolutional neural network through a global average pooling layer or a convolutional layer.