CN110443286B

CN110443286B - Training method of neural network model, image recognition method and device

Info

Publication number: CN110443286B
Application number: CN201910651552.1A
Authority: CN
Inventors: 曾葆明; 王雷; 梁炎
Original assignee: Guangzhou Cubesili Information Technology Co Ltd
Current assignee: Guangzhou Cubesili Information Technology Co Ltd
Priority date: 2019-07-18
Filing date: 2019-07-18
Publication date: 2024-06-04
Anticipated expiration: 2039-07-18
Also published as: CN110443286A

Abstract

The application discloses a training method, an image recognition method and a device of a neural network model, wherein the training method of the neural network model comprises the following steps: acquiring a neural network model; the neural network model is a neural network model obtained by training, and at least comprises a first branch network; adding a second branch network in the neural network model; inputting the data set to be trained into a second branch network to train the second branch network independently; and fusing the first branch network and the second branch network to complete training of the neural network model. By the mode, the training efficiency of the neural network model can be improved, and the identification effect of the original neural network model is not affected.

Description

Training method of neural network model, image recognition method and device

Technical Field

The present application relates to the field of image processing technologies, and in particular, to a training method of a neural network model, an image recognition method and an image recognition device.

Background

With the advent of deep learning, more and more techniques have employed deep learning to enable image recognition of pictures or video streams. Compared with the traditional method, the deep learning avoids the complexity of manual parameter adjustment and artificial feature selection, and the data are subjected to multi-layer analysis and abstract feature extraction by constructing a deep neural network model, so that the method has the characteristics of high accuracy, high reliability and high adaptability. Common image recognition applications cover motion recognition, face recognition, object recognition, scene recognition, etc. The object recognition and scene recognition serve as the basis of image retrieval, image classification, scene understanding and environment perception, and play an important role in the fields of pattern recognition, machine learning and the like.

When a trained neural network model is adopted for image recognition, if new features are needed to be added, two methods exist at present: 1. independently creating a neural network model; 2. the image with the new characteristics is input into the original neural network model for continuous training. The former consumes doubled computing resources, the latter takes longer time to train, can not react rapidly, and can not control the recognition effect of the original category after the new sample is added, so that the original recognition effect is likely to be influenced.

Disclosure of Invention

In order to solve the problems, the application provides a training method, an image recognition method and a device for a neural network model, which can improve the training efficiency of the neural network model and do not influence the recognition effect of the original neural network model.

The application adopts a technical scheme that: there is provided a training method of a neural network model, the method comprising: acquiring a neural network model; the neural network model is a neural network model obtained by training, and at least comprises a first branch network; adding a second branch network in the neural network model; inputting the data set to be trained into a second branch network to train the second branch network independently; and fusing the first branch network and the second branch network to complete training of the neural network model.

Wherein adding a second branch network in the neural network model comprises: determining output scales of a plurality of convolution modules of the first branch network; the second branch network is added to a particular convolution module in the first branch network based on the output scale requirements.

Wherein the first branch network comprises: an input layer; a first convolution module; a first pooling layer; a second convolution module; a second pooling layer; a third convolution module; a fourth convolution module; a fifth convolution module; a first global averaging pooling layer; a first full connection layer; a first classification network layer; a first branch network output layer.

Wherein the second branch network comprises: the characteristic selection layer is connected with the fourth convolution module; a sixth convolution module; a second global average pooling layer; a second full connection layer; a second hierarchical network layer; and a second branch network output layer.

Wherein the network mode further comprises: the fusion layer is connected with the first branch network output layer and the second branch network output layer; and fusing the output layers.

Wherein inputting the data set to be trained into the second branch network to perform individual training on the second branch network comprises: acquiring a data set to be trained; performing data enhancement processing on the data set to be trained; and inputting the data set to be trained after the data enhancement processing into a second branch network, and independently training the second branch network.

The data set to be trained after the data enhancement processing is input to a second branch network, and the second branch network is independently trained, including: setting convolution initialization parameters of a second branch network; and fixing parameters of a plurality of convolution modules of the first branch network, inputting the data set to be trained after data enhancement processing into the second branch network, and independently training the second branch network.

The application adopts another technical scheme that: there is provided an image recognition method including: acquiring an image to be identified; inputting an image to be identified into a set neural network model; the neural network model is set and trained by the method; and outputting the identification result.

The application adopts another technical scheme that: there is provided an image recognition apparatus comprising a processor and a memory coupled to the processor, the memory for storing program data, the processor for executing the program data to implement a method as described above.

The application adopts another technical scheme that: there is provided a computer storage medium having stored therein program data which, when executed by a processor, is adapted to carry out a method as described above.

The training method of the neural network model provided by the application comprises the following steps: acquiring a neural network model; the neural network model is a neural network model obtained by training, and at least comprises a first branch network; adding a second branch network in the neural network model; inputting the data set to be trained into a second branch network to train the second branch network independently; and fusing the first branch network and the second branch network to complete training of the neural network model. By the method, when the existing neural network model is required to be utilized to identify new features, a new training of a neural network model or retraining of an original neural network model is not required, the training efficiency of the neural network model is improved, and the identification effect of the original neural network model is not affected.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

FIG. 1 is a flow chart of a training method of a neural network model according to an embodiment of the present application;

FIG. 2 is a schematic diagram of a neural network model according to an embodiment of the present application;

FIG. 3 is a schematic flow chart of training of a second branch network according to an embodiment of the present application;

FIG. 4 is another flow diagram of training of a second branch network provided by an embodiment of the present application;

FIG. 5 is a schematic flow chart of an image recognition method provided by the application;

Fig. 6 is a schematic structural diagram of an image recognition device according to an embodiment of the present application;

Fig. 7 is a schematic structural diagram of a computer storage medium according to an embodiment of the present application.

Detailed Description

The technical solutions in the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application. It is to be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the application. It should be further noted that, for convenience of description, only some, but not all of the structures related to the present application are shown in the drawings. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to be within the scope of the application.

The terms "first," "second," and the like in this disclosure are used for distinguishing between different objects and not for describing a particular sequential order. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion. For example, a process, method, system, article, or apparatus that comprises a list of steps or elements is not limited to only those listed steps or elements but may include other steps or elements not listed or inherent to such process, method, article, or apparatus.

Reference herein to "an embodiment" means that a particular feature, structure, or characteristic described in connection with the embodiment may be included in at least one embodiment of the application. The appearances of such phrases in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments. Those of skill in the art will explicitly and implicitly appreciate that the embodiments described herein may be combined with other embodiments.

Referring to fig. 1, fig. 1 is a flowchart of a training method of a neural network model according to an embodiment of the present application, where the method includes:

Step 11: acquiring a neural network model; the neural network model is a neural network model obtained by training, and at least comprises a first branch network.

The neural network model is a carrier for deep learning (DEEP LEARNING, DL), which is one of the technical and research fields of machine learning, and artificial intelligence is realized in a computing system by establishing an artificial neural network (ARTIFITIAL NEURAL NETWORKS, ANNs) with a hierarchical structure. Because hierarchical ANN can extract and screen input information layer by layer, deep learning has characteristic learning (representation learning) capability, and end-to-end supervised learning and unsupervised learning can be realized. In addition, deep learning may also be involved in constructing a reinforcement learning (reinforcement learning) system, forming deep reinforcement learning.

Taking convolutional neural networks as an example, convolutional neural networks (Convolutional Neural Networks, CNN) are a type of feedforward neural network (Feedforward Neural Networks) that includes convolutional computations and has a deep structure, and are one of representative algorithms for deep learning (DEEP LEARNING).

The convolutional neural network comprises an input layer, an implicit layer and an output layer. The hidden layers comprise a convolution module, a pooling layer and a full connection layer.

1) The input layer of the convolutional neural network can process multidimensional data, and the input layer of the one-dimensional convolutional neural network receives a one-dimensional or two-dimensional array, wherein the one-dimensional array is usually time or frequency spectrum sampling; the two-dimensional array may include a plurality of channels; the input layer of the two-dimensional convolutional neural network receives a two-dimensional or three-dimensional array; the input layer of the three-dimensional convolutional neural network receives a four-dimensional array.

In this embodiment, the convolutional neural network is mainly used for processing an image, and thus, a three-dimensional convolutional neural network including three-dimensional data channels, that is, two-dimensional pixel points and RGB (red, green, blue) data channels may be employed.

2) The function of the convolution module is to perform feature extraction on the input data, wherein the convolution module internally comprises a plurality of convolution kernels, and each element forming the convolution kernels corresponds to a weight coefficient and a bias vector, which is similar to a neuron (neuron) of a feedforward neural network. Each neuron in the convolution module is connected to a plurality of neurons in a region of the previous layer that is located close to the region, the size of the region being dependent on the size of the convolution kernel.

After the convolution module performs feature extraction, the output feature map is transferred to a pooling layer for feature selection and information filtering. The pooling layer contains a predefined pooling function that functions to replace the results of individual points in the feature map with the feature map statistics of its neighboring regions. The pooling layer selects pooling area and the step of the convolution kernel scanning characteristic diagram are the same, and the pooling area, step length and filling are controlled.

The fully connected layer in convolutional neural networks is equivalent to the hidden layer in conventional feed forward neural networks. The fully connected layer is typically built on the last part of the hidden layer of the convolutional neural network and only transmits signals to the other fully connected layers. The feature map loses three-dimensional structure in the fully connected layers, is expanded into vectors and is passed on to the next layer by the excitation function.

3) The output layer upstream of the convolutional neural network is usually a fully-connected layer, so that the structure and the working principle of the convolutional neural network are the same as those of the output layer of the traditional feedforward neural network. For image classification problems, the output layer outputs classification labels using a logic function or a normalized exponential function (softmax function).

For example, the recognition output layer for an image in this embodiment may be designed to output the center coordinates, size, and classification of objects in the image. In image semantic segmentation, the output layer directly outputs the classification result of each pixel.

Step 12: a second branch network is added to the neural network model.

Optionally, step 12 may specifically include: determining output scales of a plurality of convolution modules of the first branch network; the second branch network is added to a particular convolution module in the first branch network based on the output scale requirements.

As shown in fig. 2, fig. 2 is a schematic diagram of a neural network model according to an embodiment of the present application.

Wherein the first branch network comprises: an INPUT layer (INPUT); a first convolution module (ConvBlock); a first pooling layer (Pooling); a second convolution module; a second pooling layer; a third convolution module; a fourth convolution module; a fifth convolution module; a first global average pooling layer (Global Average Pooling, GAP); a first fully-connected layer (fully connected layers, FC); a first classification network layer (Softmax); a first branch network output layer (main_output).

Wherein the second branch network comprises: a feature selection layer (SelectBlock) connected to the fourth convolution module; a sixth convolution module; a second global average pooling layer; a second full connection layer; a second hierarchical network layer; and a second Branch network output layer (branch_output).

In addition, the system also comprises a fusion layer (Fusing) and a fusion output layer (Fusing _output), wherein the fusion layer is connected with the first branch network output layer and the second branch network output layer.

In an embodiment, the output scale of the first convolution module may be set according to requirements, for example, the output scale may be n×n, where 100 < N < 300. For example, a common value for N may be 227. Further, the output scale of the first pooling layer is N/2*N/2; the output scale of the second convolution module is N/2*N/2; the output scale of the second pooling layer is N/4*N/4; the output scale of the third convolution module is N/8*N/8; the output scale of the fourth convolution module is N/16 x N/16; the output scale of the fifth convolution module is N/16 x N/16. The characteristic selection layer is connected with the fourth convolution module, and the output scales of the characteristic selection layer and the fourth convolution module are the same and are N/16 x N/16.

In a specific embodiment, the output scale of the first convolution module is 168×168; the output scale of the first pooling layer is 84 x 84; the output scale of the second convolution module is 84 x 84; the output scale of the second pooling layer is 42 x 42; the output scale of the third convolution module is 21 x 21; the output scale of the fourth convolution module is 11 x 11; the output scale of the fifth convolution module is 11 x 11. The output scale of the feature selection layer is 11 x 11.

In this embodiment, the second branch network needs to be branched when the dimension of the penultimate feature map is reduced, that is, the dimension of 11×11, because the shallow layer is mainly used for extracting features, and the deep layer network is mainly used for changing features, and extracting high-level semantic information; if only the last full connection layer is used for bifurcation, the extracted information has great influence on the final result by the first branch network, and the final effect is poor; the right side as shown in fig. 2 is a second branch network part, firstly, a feature selection layer (SelectBlock) is used for weighting and recombining features, the weight of the features with large new sample effect is given to the features with larger weight, then, the convolution transformation of the original network is carried out for several times, and then, the full connection layer is connected for classification.

Step 13: the data set to be trained is input to the second branch network for individual training of the second branch network.

Optionally, as shown in fig. 3, fig. 3 is a schematic flow chart of training of the second branch network provided in the embodiment of the present application, and step 12 may specifically include:

step 131: and acquiring a data set to be trained.

Wherein the data set to be trained is data with new characteristics. Taking an image as an example, in an application scenario, an image with an a feature in the image needs to be identified, and then the first branch network is obtained by training the image with the a feature. Further, if the B feature is to be newly added, a second branch network is added, and an image with the B feature is input for training.

Step 132: and carrying out data enhancement processing on the data set to be trained.

In general, neural networks require a large number of parameters, many of which are millions, and so that they can function properly requires a large amount of data to train, which in practice is not as much as we imagine. The enhancement of the data, namely, the creation of more data by utilizing the existing data such as overturn, translation or rotation, enables the neural network to have better generalization effect.

Step 133: and inputting the data set to be trained after the data enhancement processing into a second branch network, and independently training the second branch network.

In addition, as shown in fig. 4, fig. 4 is another schematic flow chart of training of the second branch network according to the embodiment of the present application, and step 12 may specifically include:

step 136: and setting convolution initialization parameters of the second branch network.

The parameters of the convolution module comprise the size of the convolution kernel, the step length and the filling, and the three determine the size of the output characteristic diagram of the convolution module together, so that the parameters are super parameters of the convolution neural network. Where the convolution kernel size may be specified as any value less than the input image size, the larger the convolution kernel, the more complex the extractable input feature.

The convolution step length defines the distance between the positions of the convolution kernel when the convolution kernel scans the feature map twice, when the convolution step length is 1, the convolution kernel scans the elements of the feature map one by one, and when the convolution step length is n, n-1 pixels are skipped in the next scanning.

As can be seen from the cross-correlation calculation of the convolution kernels, the feature map size gradually decreases as the convolution modules are stacked, for example, a 16×16 input image, after passing through a unit step, unfilled 5×5 convolution kernel, outputs a 12×12 feature map. To this end, padding is a method of artificially increasing the size of the feature map before it passes through the convolution kernel to counteract the effects of size shrinkage in the computation. A common filling method is filling with 0 and repeating boundary values (replication padding).

Step 137: and fixing parameters of a plurality of convolution modules of the first branch network, inputting the data set to be trained after data enhancement processing into the second branch network, and independently training the second branch network.

Step 14: and fusing the first branch network and the second branch network to complete training of the neural network model.

The training method of the neural network model provided by the embodiment comprises the following steps: acquiring a neural network model; the neural network model is a neural network model obtained by training, and at least comprises a first branch network; adding a second branch network in the neural network model; inputting the data set to be trained into a second branch network to train the second branch network independently; and fusing the first branch network and the second branch network to complete training of the neural network model. By the method, when the existing neural network model is required to be utilized to identify new features, a new training of a neural network model or retraining of an original neural network model is not required, the training efficiency of the neural network model is improved, and the identification effect of the original neural network model is not affected.

It will be appreciated that the method of the present embodiment may be applied to training and identifying illegal pictures or videos of a network. For example, when the external yellow identification is output, if the application scenes are different, different outputs can be customized, and a branch network can be adopted for adaptation; if some sudden illegal pictures appear in the short video, the existing model cannot be identified, and the existing effect can be influenced by adding a training set, the problem can be solved by adopting a branch network, for example, the illegal video revealed by the short video application can be continuously spread on a platform, and for example, a watermark of a pornography website exists; the method has the advantages that the method has certain characteristics, specific pictures are high in illegal picture recognition rate and few in false recognition after the method is adopted.

Taking the SE-BN-Inception model as an example, the following table shows:

according to the model for adding the second branch network based on the first branch network, when a single picture is processed, the single calculation time consumption is averagely increased by 4.8ms, the video memory consumption is increased by 69MB, and when batchs size is 12, the single calculation time consumption is averagely increased by 1ms, and the video memory consumption is increased by 129MB. From the above data, it can be seen that when a new neural network model with a branch network is used for image processing, the time consumption increase is small compared with that of the original neural network model, the video memory consumption increase is also low, and compared with that of performing image processing twice through two different neural network models, the processing time is greatly shortened, and the memory consumption is reduced.

Referring to fig. 5, fig. 5 is a flowchart of an image recognition method provided by the present application, where the method includes:

step 51: and acquiring an image to be identified.

The image may be a single picture or an image frame in a video stream, which is not limited herein.

Step 52: and inputting the image to be identified into a set neural network model.

The neural network model is set and trained by the method in the above embodiment, which is not described herein.

Step 53: and outputting the identification result.

Referring to fig. 6, fig. 6 is a schematic structural diagram of an image recognition device according to an embodiment of the present application, where the image recognition device 60 includes a processor 61 and a memory 62 connected to the processor 61, the memory 62 is used for storing program data, and the processor 61 is used for executing the program data to implement the following method:

Acquiring a neural network model; the neural network model is a neural network model obtained by training, and at least comprises a first branch network; adding a second branch network in the neural network model; inputting the data set to be trained into a second branch network to train the second branch network independently; and fusing the first branch network and the second branch network to complete training of the neural network model.

Optionally, in another embodiment, the processor 61 is configured to execute the program data to implement the following method: acquiring an image to be identified; inputting an image to be identified into a set neural network model; and outputting the identification result.

Referring to fig. 7, fig. 7 is a schematic structural diagram of a computer storage medium provided in an embodiment of the present application, in which program data 71 is stored in the computer storage medium 70, and when the program data 71 is executed by a processor, the program data 71 is configured to implement the following method:

Optionally, in another embodiment, the program data 71, when executed by the processor, is further configured to implement the following method: determining output scales of a plurality of convolution modules of the first branch network; the second branch network is added to a particular convolution module in the first branch network based on the output scale requirements.

Wherein the first branch network comprises: an input layer; the first convolution module has an output scale of 168×168; a first pooling layer having an output scale of 84 x 84; a second convolution module with an output scale of 84 x 84; a second pooling layer with an output scale of 42 x 42; the third convolution module has an output scale of 21 x 21; the output scale of the fourth convolution module is 11 x 11; a fifth convolution module, the output scale of which is 11 x 11; a first global averaging pooling layer; a first full connection layer; a first classification network layer; a first branch network output layer.

Wherein the second branch network comprises: the characteristic selection layer is connected with the fourth convolution module, and the scale of the characteristic selection layer is 11 x 11; a sixth convolution module; a second global average pooling layer; a second full connection layer; a second hierarchical network layer; and a second branch network output layer.

Optionally, in another embodiment, the program data 71, when executed by the processor, is further configured to implement the following method: acquiring a data set to be trained; performing data enhancement processing on the data set to be trained; and inputting the data set to be trained after the data enhancement processing into a second branch network, and independently training the second branch network.

Optionally, in another embodiment, the program data 71, when executed by the processor, is further configured to implement the following method: setting convolution initialization parameters of a second branch network; and fixing parameters of a plurality of convolution modules of the first branch network, inputting the data set to be trained after data enhancement processing into the second branch network, and independently training the second branch network.

In the several embodiments provided in the present application, it should be understood that the disclosed method and apparatus may be implemented in other manners. For example, the above-described device embodiments are merely illustrative, e.g., the division of the modules or units is merely a logical functional division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the embodiment.

In addition, each functional unit in each embodiment of the present application may be integrated in one processing unit, each unit may exist alone physically, or two or more units may be integrated in one unit. The integrated units may be implemented in hardware or in software functional units.

The integrated units of the other embodiments described above may be stored in a computer readable storage medium if implemented in the form of software functional units and sold or used as stand alone products. Based on such understanding, the technical solution of the present application may be embodied in essence or a part contributing to the prior art or all or part of the technical solution in the form of a software product stored in a storage medium, including several instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) or a processor (processor) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, random Access Memory), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing description is only of embodiments of the present application, and is not intended to limit the scope of the application, and all equivalent structures or equivalent processes according to the present application and the accompanying drawings, or direct or indirect application in other related technical fields, are included in the scope of the present application.

Claims

1. An image recognition method, the method comprising:

acquiring an image to be identified; the image to be identified is a single picture or one image frame in a video stream;

Inputting the image to be identified into a set neural network model; the set neural network model is trained by the following method: acquiring a neural network model for image recognition; the neural network model is a trained neural network model for image recognition, is a three-dimensional convolution neural network for processing images and comprises three-dimensional data channels, wherein the three-dimensional data channels are two-dimensional pixel points and red, green and blue data channels, and at least comprises a first branch network; adding a second branch network in the neural network model; inputting a data set to be trained into the second branch network to train the second branch network independently; the data set to be trained is an image with new characteristics, and each branch network is adapted to different application scenes; fusing the first branch network and the second branch network to complete training of the neural network model; the first branch network comprises an input layer, a first convolution module, a first pooling layer, a second convolution module, a second pooling layer, a third convolution module, a fourth convolution module, a fifth convolution module, a first global average pooling layer, a first full connection layer, a first classification network layer and a first branch network output layer which are sequentially connected; the second branch network comprises a feature selection layer, a sixth convolution module, a second global average pooling layer, a second full connection layer, a second classification network layer and a second branch network output layer which are sequentially connected; wherein the feature selection layer is connected with the fourth convolution module;

Outputting a recognition result; the identification result comprises center coordinates, sizes and classifications of objects in the image to be identified.

2. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The network model further includes:

the fusion layer is connected with the first branch network output layer and the second branch network output layer;

and fusing the output layers.

3. The method of claim 1, wherein the step of determining the position of the substrate comprises,

The inputting the data set to be trained into the second branch network to train the second branch network independently includes:

Acquiring a data set to be trained;

performing data enhancement processing on the data set to be trained;

And inputting the data set to be trained after the data enhancement processing to the second branch network, and independently training the second branch network.

4. The method of claim 3, wherein the step of,

The data set to be trained after the data enhancement processing is input to the second branch network, and the second branch network is independently trained, including:

Setting convolution initialization parameters of the second branch network;

and fixing parameters of a plurality of convolution modules of the first branch network, inputting the data set to be trained after data enhancement processing to the second branch network, and independently training the second branch network.

5. An image recognition device, characterized in that the image recognition device comprises a processor and a memory connected to the processor, the memory being for storing program data, the processor being for executing the program data for implementing the method according to any of claims 1-4.

6. A computer storage medium, characterized in that the computer storage medium has stored therein program data, which, when being executed by a processor, is adapted to carry out the method according to any one of claims 1-4.