CN111767860A

CN111767860A - Method and terminal for realizing image recognition through convolutional neural network

Info

Publication number: CN111767860A
Application number: CN202010613939.0A
Authority: CN
Inventors: 仲会娟; 蔡清泳
Original assignee: Yango University
Current assignee: Yango University
Priority date: 2020-06-30
Filing date: 2020-06-30
Publication date: 2020-10-13

Abstract

The invention discloses a method and a terminal for realizing image recognition through a convolutional neural network, which are used for acquiring a data set and preprocessing the data set; setting initial parameters of a convolutional neural network model, wherein the convolutional neural network model comprises a plurality of pooling layers, and cascading the output characteristics of the pooling layers; training the convolutional neural network model according to the data set, and performing image recognition by using the trained convolutional neural network model; the method and the terminal for realizing the image recognition through the convolutional neural network creatively cascade the output of each pooling layer, can fully utilize each scale characteristic of the image to classify, and greatly improve the accuracy and the speed of the classification of the convolutional neural network.

Description

Method and terminal for realizing image recognition through convolutional neural network

Technical Field

The invention relates to the field of image processing, in particular to a method and a terminal for realizing image recognition through a convolutional network.

Background

With the development of global economy, the living standard of people is greatly improved, the automobile holding capacity is also sharply increased, more choices and convenience are provided for daily travel of people, but the contradiction between the relatively lagged traffic safety infrastructure construction and the relatively weak traffic safety management level is increasingly prominent, so that traffic accidents and traffic jam frequently occur, and the automobile parking system becomes a major social problem influencing the life of people. Therefore, the intelligent transportation system is receiving wide attention, and the active safe driving technology and the unmanned driving technology become research focuses of domestic and foreign scholars and enterprises. Road traffic sign recognition is an important component of active safety driving systems and automatic driving systems, and plays a great role in the road driving safety process. The automotive industry has near-critical requirements for safety and reliability due to the personal safety of passengers, and therefore, the traffic sign recognition system should have both high recognition accuracy and real-time recognition, so that the traffic sign recognition is still a challenging task.

In recent years, convolutional neural network methods such as LeNet, Alexnet, VGG, google net, Yolo, and ResNet have achieved unusual performance in the field of image detection and recognition, and a conventional LeNet-5 lightweight convolutional neural network model shown in fig. 1 includes five layers: for the identification of the traffic sign images, the excessively complex convolutional neural network can provide a more reliable identification result, but simultaneously causes unnecessary resource waste and has a lower calculation speed.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method and the terminal for realizing the image recognition through the convolutional neural network can quickly and correctly recognize the image.

In order to solve the technical problems, the invention adopts a technical scheme that:

a method for implementing image recognition by a convolutional neural network, comprising the steps of:

s1, acquiring a data set, and preprocessing the data set;

s2, setting initial parameters of a convolutional neural network model, wherein the convolutional neural network model comprises a plurality of pooling layers, and cascading the output characteristics of the pooling layers;

and S3, training the convolutional neural network model according to the data set, and performing image recognition by using the trained convolutional neural network model.

In order to solve the technical problem, the invention adopts another technical scheme as follows:

a terminal for implementing image recognition through a convolutional neural network, comprising a memory, a processor, and a computer program stored on the memory and executable on the processor, the processor implementing the following steps when executing the computer program:

s1, acquiring a data set, and preprocessing the data set;

The invention has the beneficial effects that: the method comprises the steps of acquiring a data set, preprocessing the data set, cascading the output characteristics of each pooling layer, fully utilizing the local characteristics and the global characteristics of the image, analyzing the image according to the characteristic information of different scales, greatly improving the accuracy of the output result, ensuring that the input image meets the processing conditions of a convolutional neural network by preprocessing, further improving the accuracy of the output result, preprocessing the image, unifying the image and improving the analysis speed of the convolutional neural network on the image.

Drawings

FIG. 1 is a diagram of a conventional convolutional neural network model LeNet-5;

FIG. 2 is a flowchart illustrating steps of a method for performing image recognition via a convolutional neural network, according to an embodiment of the present invention;

fig. 3 is a schematic structural diagram of a terminal for implementing image recognition through a convolutional neural network according to an embodiment of the present invention;

FIG. 4 is a sample distribution diagram of a data set according to an embodiment of the present invention;

fig. 5 is a schematic diagram of GTSRB dataset sample distribution according to an embodiment of the present invention;

FIG. 6 is a schematic diagram of image augmentation according to an embodiment of the present invention;

FIG. 7 is a block diagram of a convolutional neural network model according to an embodiment of the present invention;

FIG. 8 is a graph of the loss function variation during the convolutional neural network model training process according to an embodiment of the present invention;

FIG. 9 is a graph of the accuracy function variation during the convolutional neural network model training process according to an embodiment of the present invention;

FIG. 10 is a selection of convolution kernel size parameter candidates and their results of operation in a convolutional neural network model, in accordance with an embodiment of the present invention;

FIG. 11 is a Dropout parameter selection and its results of operation in a convolutional neural network model, in accordance with an embodiment of the present invention;

FIG. 12 is a diagram of a candidate set of full-link layer neuron numbers and their results of operation in a convolutional neural network model, according to an embodiment of the present invention;

FIG. 13 is a comparison of model data for different convolutional neural networks according to embodiments of the present invention;

description of reference numerals:

1. a terminal for realizing image recognition through a convolutional neural network; 2. a processor;

3. a memory.

Detailed Description

In order to explain technical contents, achieved objects, and effects of the present invention in detail, the following description is made with reference to the accompanying drawings in combination with the embodiments.

Referring to fig. 2, a method for implementing image recognition by an over-convolution neural network includes the steps of:

s1, acquiring a data set, and preprocessing the data set;

From the above description, the beneficial effects of the present invention are: the method comprises the steps of acquiring a data set, preprocessing the data set, cascading the output characteristics of each pooling layer, fully utilizing the local characteristics and the global characteristics of the image, analyzing the image according to the characteristic information of different scales, greatly improving the accuracy of the output result, ensuring that the input image meets the processing conditions of a convolutional neural network by preprocessing, further improving the accuracy of the output result, preprocessing the image, unifying the image and improving the analysis speed of the convolutional neural network on the image.

Further, the convolutional neural network model further comprises a global average pooling layer;

in S2, the step of cascading the output characteristics of each pooling layer specifically includes:

unifying the size of the output features of each of the pooling layers;

connecting the output features of the pooling layers with uniform size in series to form a tensor;

inputting the tensor into the global average pooling layer.

As can be seen from the above description, the output of the pooling layer is connected to the input of the global average pooling layer after size conversion and feature fusion, so that the multi-scale features are highly purified in the global average pooling layer and are conveniently handed to a subsequent classifier, instead of a conventional flat layer (Flatten), so that the conversion from a feature map to classification recognition is more natural, and the classification of traffic signs is realized by using feature information of different scales, such as local features, global features and the like of an image.

Further, the data set is a road traffic sign image acquired by a vehicle-mounted camera in an actual traffic environment;

each road traffic sign image in the data set comprises traffic sign images shot under different light rays, different shielding degrees, different shooting angles and different vehicle motion speeds.

The preprocessing the data set in S1 includes:

s11, normalizing the size of the image in the data set to obtain a first data set;

s12, normalizing the pixels of the image in the first data set to obtain a second data set;

s13, amplifying the images in the second data set to obtain a third data set;

the S3 specifically includes: training the convolutional neural network model from the third data set.

According to the description, the practicability of the model after training is considered when the traffic sign image in the data set is selected, the image acquired by the vehicle-mounted camera in the actual traffic environment is used, so that the model after final training can adapt to the actual use requirement, different light rays, shielding degrees, shooting angles and vehicle movement speeds are set, the image shot under different scenes can be recognized by the final model, and the robustness of the final model is improved; the data set is preprocessed and then used for training the model, and accuracy of the final model recognition image is guaranteed.

Further, the augmentation is specifically:

s131, rotating in a random direction by taking the center of the image as an origin and 10 degrees as a unit;

s132, randomly translating the image by 4 pixel points, removing the spare part on one side, and stretching the image back to the size of the original image;

s133, randomly overturning the image;

the operations of S131, S132, and S133 are performed in an increasing order, but the order is not limited thereto.

According to the description, the data set is subjected to augmentation processing, the data set with sufficient data volume can be obtained, the model is trained, premature fitting of the model is prevented, and meanwhile the generalization capability of the model can be improved.

Further, the convolutional neural network further includes: a plurality of convolutional layers alternating with the pooling layers;

and the single convolution layer is formed by cascading a plurality of small convolution layers, and the input of each small convolution layer is connected with the output of the previous small convolution layer except for the head and the tail of the two small convolution layers.

From the above description, it can be known that the number of parameters can be effectively controlled while ensuring the scope of the receptive field by using a plurality of small convolution layers in cascade connection instead of the traditional single large convolution layer.

Further, a batch normalization algorithm is set after each convolution layer:

wherein B ═ { x ═ x₁,x₂,x₃,...x_mAnd, represents a batch of m data, μ, input into the convolutional neural network model_BRepresents the mean of the m data sets,

the variance of the m data is a small positive number, γ is a conversion factor, and β is a translation factor.

From the above description, it can be seen that the batch normalization algorithm is set after each convolution layer, so that the convergence speed of the model can be increased, the divisor can be prevented from being 0 by adding a small positive number, and the expression capability of the network can be enhanced by adding a transformation factor and a translation factor.

Further, the convolutional neural network further includes: a fully-connected layer;

setting initial parameters of the convolutional neural network model in the step S2;

setting the number of convolution kernels, the weight of the convolution kernels, the bias and the number of neurons of a full connecting layer;

setting an initial learning rate, a target minimum error, a training period and the number of samples selected in a single training.

According to the description, the initial parameters of the convolutional neural network model can be set, the initial parameters can be set according to different conditions, the effect of the model is controlled, different requirements are met, and the flexibility of the model is high.

Further, the data set in S1 includes a training set and a test set;

the step S3 specifically includes:

training the convolutional neural network model according to the training set to obtain a first convolutional neural network model;

and verifying the first convolutional neural network model according to the test set to obtain the trained convolutional neural network model.

According to the above description, the data set is divided into the training set and the test set, the training set is used for training the convolutional neural network model, then the convolutional neural network model is verified according to the test set, if the convolutional neural network model does not meet the performance requirement, the convolutional neural network model is correspondingly adjusted according to the verification result, and the identification accuracy of the finally completed model is further ensured.

Further, training the convolutional neural network model according to the data set in step S3 specifically includes:

s31, selecting a cross entropy loss function as a target function, outputting a recognition result by a Softmax classifier, setting EPOCHS to be 50, BATCH _ SIZE to be 64 and initial learning rate to be 0.001, and gradually attenuating the learning rate along with the increase of iteration times by an Adam optimization algorithm;

s32, sending the data set into the convolutional neural network model, and calculating forward output;

s33, calculating the error of the forward output, and updating the weight and the bias in the convolutional neural network model by combining a back propagation algorithm;

s34, repeating the steps S32 and S33 until the objective function converges, and saving the convolutional neural network model at this time.

From the above description, the parameters of the convolutional neural network model are adjusted according to the objective function in combination with the back propagation algorithm, so that the reliability of the completed model is further ensured.

Referring to fig. 3, a terminal for implementing image recognition through a convolutional neural network includes a memory, a processor, and a computer program stored in the memory and executable on the processor, where the processor implements the following steps when executing the computer program:

s1, acquiring a data set, and preprocessing the data set;

As can be seen from the above description, the beneficial effects of the present invention are: the method comprises the steps of acquiring a data set, preprocessing the data set, cascading the output characteristics of each pooling layer, fully utilizing the local characteristics and the global characteristics of the image, analyzing the image according to the characteristic information of different scales, greatly improving the accuracy of the output result, ensuring that the input image meets the processing conditions of a convolutional neural network by preprocessing, further improving the accuracy of the output result, preprocessing the image, unifying the image and improving the analysis speed of the convolutional neural network on the image.

Referring to fig. 2, fig. 4, fig. 6 and fig. 10-12, a first embodiment of the present invention is:

a method for realizing image recognition through a convolutional neural network specifically comprises the following steps:

s1, acquiring a data set, and preprocessing the data set;

the data set is a road traffic sign image acquired by a vehicle-mounted camera in an actual traffic environment;

each road traffic sign image in the data set comprises traffic sign images shot under different light rays, shielding degrees, shooting angles and vehicle movement speeds, each road traffic sign image only comprises one traffic sign, and the image resolution is different from 16 × 16 to 250 × 250;

the data set comprises a training set and a test set;

the pre-processing of the data set comprises:

s11, normalizing the size of the image in the data set to obtain a first data set: uniformly scaling the size of the sample image into 32 multiplied by 32 by a bilinear interpolation method;

s12, normalizing the pixels of the image in the first data set to obtain a second data set: compressing the pixel value range of each pixel point in the image from 0-255 to 0-1, and accelerating the convergence speed of the neural network;

s13, amplifying the images in the second data set to obtain a third data set:

referring to fig. 6, a first column from the left in fig. 6 is an image in the second data set, specifically:

s131, rotating in a random direction by taking the center of the image as an origin and 10 degrees as a unit; for example, 20 degrees, 80 degrees, 170 degrees, etc.;

s133, randomly overturning the image;

the operations of S131, S132, and S133 are performed in an increasing order, but the order is not limited thereto;

in an alternative embodiment, the constructed data set comprises 3250 images of 40 types of traffic signs, wherein the distribution of the training set 2275 images, the test set 975 images and 3250 images in the 40 types of traffic signs is shown in fig. 4;

s2, setting initial parameters of the convolutional neural network model, including;

setting the size of a convolution kernel, the number of the convolution kernels, the weight of the convolution kernel, the bias and the number of neurons of a full connecting layer;

setting an initial learning rate, a target minimum error, a training period and a selected sample number (BATCH _ SIZE) of single training;

performing a parameter selection ratio experiment on the convolution kernel size, the Dropout parameter and the number of neurons in the full connection layer in the network structure according to the data set, respectively adopting 5 × 5, 0.5 and 256 in an optional implementation manner, specifically referring to fig. 10 to 12, respectively setting a preset parameter candidate set for each type of parameter, changing the value of the parameter in the preset parameter candidate set of one type of parameter at a time by a control variable method, operating the convolution neural network model, and selecting the parameter with the highest benefit in the parameter candidate set by an operation result;

in an alternative embodiment, an EPOCH (when a complete data set has passed through the neural network once and back, this process is referred to as an EPOCH) is set to 50, BATCH _ SIZE is set to 64, the initial learning rate is 0.001, and the learning rate is gradually attenuated as the number of iterations increases by the Adam optimization algorithm;

s3, training the convolutional neural network model according to the data set, and performing image recognition by using the trained convolutional neural network model;

specifically, the data set is a third data set subjected to size normalization, pixel normalization and amplification;

the method specifically comprises the following steps: training the convolutional neural network model according to the training set to obtain a first convolutional neural network model; and verifying the first convolutional neural network model according to the test set to obtain the convolutional neural network model meeting the requirement.

Referring to fig. 7, a second embodiment of the present invention is:

a method for realizing image recognition through a convolutional neural network, which is different from the first embodiment in that:

the convolutional neural network model comprises an input layer (input), a plurality of pooling layers (Maxpool), a plurality of convolutional layers (Conv), a global average pooling layer (Global average Pooling), a fully connected layer (Full) and a classifier;

the convolutional layers alternate with the pooling layers;

the convolutional layers all adopt a ReLU activation function: ReLU max (0, x), where the gradient of the function remains 1 at times greater than 0 and 0 at other times;

adding a Dropout strategy behind each pooling layer, namely randomly closing or neglecting part of hidden layer neurons in the training process;

unifying the size of the output features of each of the pooling layers;

inputting the tensor into the global average pooling layer;

specifically, in this embodiment, the input of the global average pooling layer is each of the cascaded pooling layers:

GlobalAveragePooling merge_input＝[Max_11,Max_22,Maxpool 3]；

the output of each pooling layer is respectively connected with one size-uniform pooling layer, so that the sizes of the output characteristics of the pooling layers are uniform;

the Max _11 and Max _22 are two unified pooling layers, which are respectively connected with the Maxpool1 and Maxpool2, and the output features of the Maxpool1 and Maxpool2 are unified into 4 × 4 from 16 × 16 and 8 × 8;

[ Max _11, Max _22, Maxpool 3] indicates that the output signals of two unified size pooling layers and the pooling layer 3 are connected in series to form one tensor, and the output feature size of the pooling layer Maxpool3 is 4 x 4, so that the unified size pooling layers do not need to be connected;

the single convolution layer is formed by cascading a plurality of small convolution layers, and the input of each small convolution layer is connected with the output of the previous small convolution layer except for the head and the tail of the two small convolution layers;

wherein, the classifier can be a SoftMax classifier;

a Batch Normalization algorithm (BN) is set after each of the convolution layers:

wherein, B ═ { x ═ x₁,x₂,x₃,...x_mAnd, represents a batch of m data, μ, input into the convolutional neural network model_BRepresents the mean of the m data sets,

the variance of the m data is to prevent a slight positive number added by a divisor of 0, γ is a transformation factor, and β is a translation factor;

in an optional embodiment, the convolutional neural network model comprises 3 convolutional layers, 3 pooling layers, 1 global average pooling layer, 1 fully-connected layer and 1 classification output layer, wherein the convolutional layers and the pooling layers are alternately connected; the 3 convolutional layers all adopt convolutional kernels with the size of 5 multiplied by 5, the step length is 1, and the number of the neurons is respectively 32, 64 and 128; the 3 layers of pooling layers all adopt convolution kernels with the size of 2 multiplied by 2, the step length is 2, and the maximum pooling is adopted;

the number of the parameters of the convolutional neural network model provided by the invention is shown in the table 1:

TABLE 1

Wherein, each convolutional layer is cascaded through a plurality of 5 × 5 convolutional layers to replace the traditional single 9 × 9 convolutional layer, and table 2 shows the number of parameters of the traditional large convolutional kernel convolutional neural network model:

TABLE 2

As can be seen from the comparison between tables 1 and 2, the increase of the number of parameters of the convolutional neural network model provided by the invention is small, but the reliability of the convolutional neural network model is obviously improved.

Referring to fig. 5, 8, 9 and 13, a third embodiment of the present invention is:

a method for implementing image recognition through a convolutional neural network, which is different from the first embodiment or the second embodiment in that training the convolutional neural network model according to the data set in S3 specifically includes:

s31, selecting a cross entropy loss function as a target function;

specifically, the weights of the bias and convolution kernels are updated;

s34, repeating the steps S32 and S33 until the objective function converges, and storing the convolutional neural network model at the moment;

taking a road traffic sign image which is acquired from a natural scene and only contains one traffic sign as the input of a stored convolutional neural network model, and realizing the specific category of the output traffic sign image;

referring to fig. 8, a loss value of a loss function in the training process decreases rapidly at the early stage of training, which indicates that a difference between a predicted result and a real result is large in the early stage of training, and as the training frequency increases, the change of the loss function tends to be smooth, and a model converges gradually, that is, the fitting degree of the model and the training data is higher and higher;

referring to fig. 9, it is a curve of the variation of the accuracy (precision) in the training process, and it can be seen from the figure that the accuracy is rapidly increased in the early stage of the training, and when the iteration is about thirty times or so, the variation region of the accuracy is gentle, the increase amplitude is very small, and the model is gradually converged;

in an alternative embodiment, a german traffic sign data set (GTSRB) is obtained containing 5 major and 43 minor traffic sign images for 51839 images, wherein 39209 training sets and 12630 testing sets; the distribution of 51839 images in the 43 subclasses is shown in FIG. 5;

the GTSRB is used to perform a comparison experiment on the convolutional neural network model (MS-TSRCNN) and two other convolutional neural network models (single-scale feature connected model, convolutional layer feature connected model), wherein the convolutional layer feature connected model has the same structure as the MS-TSRCNN except that the cascading of each pooling layer is not performed, and the experimental result is shown in fig. 13.

Referring to fig. 3, a fourth embodiment of the present invention is:

a terminal 1 for implementing image recognition through a convolutional neural network, comprising a processor 2, a memory 3 and a computer program stored on the memory 3 and operable on the processor 2, wherein the processor 2 implements the steps of the first embodiment, the second embodiment or the third embodiment when executing the computer program.

In summary, the invention provides a method and a terminal for realizing image recognition through a convolutional neural network, when a training data set is selected, road traffic sign images acquired by a vehicle-mounted camera in an actual traffic environment are adopted, and the traffic sign images shot under different light rays, shielding degrees, shooting angles and vehicle movement speeds are selected, various conditions possibly occurring under actual conditions are preset, targeted model training is performed in advance, and the recognition accuracy of a model in actual application is ensured; when a model is constructed, all pooling layers are cascaded, a global average pooling layer is arranged, characteristic information of local characteristics, global characteristics and the like in various scales can be fully utilized for carrying out traffic sign classification, the conversion from a characteristic diagram to classification recognition is more natural, a plurality of small pooling layers are used for replacing a large pooling layer, the scope of a receptive field is ensured, although the parameter quantity is increased in a small range, the reliability of a convolutional neural network model is obviously improved, the calculation speed of the model is improved, parameters are determined by utilizing a data set to carry out a parameter selection experiment first, the time of model training is accelerated, the recognition accuracy of the final model is improved, the final model is simple in structure, and the recognition of traffic signs can be completed quickly and correctly.

The above description is only an embodiment of the present invention, and not intended to limit the scope of the present invention, and all equivalent changes made by using the contents of the present specification and the drawings, or applied directly or indirectly to the related technical fields, are included in the scope of the present invention.

Claims

1. A method for performing image recognition via a convolutional neural network, comprising the steps of:

s1, acquiring a data set, and preprocessing the data set;

2. The method of claim 1, wherein the convolutional neural network model further comprises a global mean pooling layer;

the step of cascading the output characteristics of each pooling layer specifically includes:

unifying the size of the output features of each of the pooling layers;

inputting the tensor into the global average pooling layer.

3. The method of claim 1, wherein the data set is a road traffic sign image collected by a vehicle-mounted camera in an actual traffic environment;

each road traffic sign image in the data set comprises traffic sign images shot under different light rays, different shielding degrees, different shooting angles and different vehicle motion speeds;

the preprocessing the data set in S1 includes:

s13, amplifying the images in the second data set to obtain a third data set;

4. The method of claim 3, wherein the augmenting is specifically:

s133, randomly overturning the image;

5. The method of claim 1, wherein the convolutional neural network further comprises: a plurality of convolutional layers alternating with the pooling layers;

6. The method of claim 1, wherein each convolutional layer is followed by a batch normalization algorithm:

the variance of the m data is to prevent a slight positive number added by a divisor of 0, γ is a transformation factor, and β is a translation factor.

7. The method of claim 1, wherein the convolutional neural network further comprises: a fully-connected layer;

8. The method of claim 1, wherein the data set in S1 includes a training set and a test set;

the step S3 specifically includes:

9. The method according to claim 1, wherein the training of the convolutional neural network model according to the data set in step S3 specifically comprises:

10. A terminal for implementing image recognition by a convolutional neural network, comprising a memory, a processor and a computer program stored in the memory and executable on the processor, wherein the processor implements a method for implementing image recognition by a convolutional neural network according to any one of claims 1 to 9 when executing the computer program.