CN112529146B

CN112529146B - Neural network model training method and device

Info

Publication number: CN112529146B
Application number: CN201910883124.1A
Authority: CN
Inventors: 于德权; 吴觊豪; 贾明波; 马杰延
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2023-10-17
Anticipated expiration: 2039-09-18
Also published as: WO2021051987A1; CN112529146A

Abstract

The application provides a neural network model training method in the artificial intelligence field, which comprises the following steps: acquiring a neural network model, first training data and categories of the first training data, wherein the neural network model is trained according to second training data, the first training data comprises supporting data and query data, the supporting data comprises all or part of data of each category in the first training data, and the query data comprises all or part of data of each category in the first training data; extracting features of the first training data by using the neural network model to obtain features of the first training data; and adjusting parameters of partial layers of the neural network model according to the feature distance between the class center feature of each class and the query data feature to obtain an adjusted neural network model. And the parameters of part of layers of the neural network model obtained through training are adjusted, so that the neural network model with good precision and generalization capability is obtained.

Description

Neural network model training method and device

Technical Field

The application relates to the field of artificial intelligence, in particular to a method and a device for training a neural network model.

Background

Artificial intelligence (artificial intelligence, AI) is the theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and extend human intelligence, sense the environment, acquire knowledge and use the knowledge to obtain optimal results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and to produce a new intelligent machine that can react in a similar manner to human intelligence. Artificial intelligence, i.e. research on design principles and implementation methods of various intelligent machines, enables the machines to have functions of sensing, reasoning and decision making

Computer vision is an integral part of various intelligent/autonomous systems in various fields of application, such as manufacturing, inspection, document analysis, medical diagnosis, and military, and is a study of how to use cameras/cameras and computers to acquire the data and information of a subject. In image, eyes (cameras/video cameras) and brains (algorithms) are installed on a computer to replace human eyes to identify, track, measure targets and the like, so that the computer can sense the environment. Because perception can be seen as the extraction of information from sensory signals, computer vision can also be seen as science of how to "perceive" an artificial system from images or multi-dimensional data. In general, computer vision is the acquisition of input information by various imaging systems instead of visual organs, and the processing and interpretation of such input information is accomplished by a computer instead of the brain. The ultimate goal of computer vision is to enable computers to view and understand the world visually, like humans, with the ability to adapt themselves to the environment.

And extracting the characteristics of the training data through the neural network model based on a clustering small sample learning scheme, calculating the distances among the characteristics of the training data of different categories, and training the neural network model. Because the training data of the small sample learning scheme is limited, the generalization capability of the neural network model obtained by training is poor.

Disclosure of Invention

The application provides a neural network model training method, which can train to obtain a neural network model with higher precision and good generalization capability under the condition of smaller data volume or unbalanced data volume of training data.

In a first aspect, a method for training a neural network model is provided, including: acquiring a neural network model, first training data and categories of the first training data, wherein the neural network model is trained according to second training data, the first training data comprises supporting data and query data, the supporting data comprises all or part of data of each category in the first training data, and the query data comprises all or part of data of each category in the first training data; extracting features of the first training data by using the neural network model to obtain features of the first training data; and adjusting parameters of partial layers of the neural network model according to the feature distance between the class center feature of each class and the query data feature to obtain an adjusted neural network model, wherein each bit in the class center feature of each class is an average value of the feature corresponding bits of the support data of each class.

The neural network model obtained through training is utilized to extract the characteristics of the training data, and the parameters of part layers of the neural network model are adjusted according to the characteristic distances among the characteristics of the training data, so that the neural network model with higher precision and stronger generalization capability can be obtained.

With reference to the first aspect, in some possible implementations, the adjusting parameters of a partial hierarchy of the neural network model according to feature distances between class center features of each class and the query data features includes: and adjusting the parameters of the partial layers according to the characteristic distance between the class center characteristic of each class and the query data characteristic and the average value of the characteristic distances between the characteristics of the first training data of each class.

The center loss represents the feature distance of each class of class center features from the query data features. In the training process of the neural network model, center loss is introduced, so that the training efficiency of the neural network model can be improved, and the accuracy of the neural network model can be improved.

With reference to the first aspect, in some possible implementations, the extracting features of the first training data using the neural network model to obtain features of the first training data includes: inputting the first training data into the neural network model; and carrying out deep hash on the features extracted by the neural network model to obtain the features of the first training data.

By deep hashing the features extracted by the neural network model, the volume of the features can be reduced, the training time is shortened, and the neural network model training is ensured to have higher precision. In the process of determining the category of the data by adopting the neural network model obtained by training, the reasoning speed can be improved.

With reference to the first aspect, in some possible implementations, the adjusting parameters of a partial hierarchy of the neural network model according to feature distances between class center features of each class and the query data features includes: when the data volume of the first training data is smaller than a preset value, the super parameters are adjusted through a Bayesian optimization scheme, and the parameters of the partial layers are adjusted according to the feature distance between the class center features of each class and the query data features; and when the data volume of the first training data is larger than or equal to the preset value, adjusting the parameters of the partial layers according to the preset super parameters corresponding to the neural network model and the characteristic distance between the class center characteristics of each class and the query data characteristics.

When the data volume is large, the efficiency of training the neural network model through the Bayesian optimization scheme is low. When the data volume is smaller, training the neural network model according to the preset super parameters corresponding to the neural network model, wherein the neural network model obtained by training is lower in accuracy. By training the neural network model through the Bayesian optimization scheme only when the data amount of the first training data is smaller, the accuracy of the neural network model obtained through training can be improved, and the training efficiency can be improved.

With reference to the first aspect, in some possible implementations, the super-parameters include one or more of a learning rate, a learning rate decay period, a number of iteration periods, a batch size, a network structure parameter of the neural network model.

In a second aspect, a method for training a neural network model is provided, including: acquiring first training data and categories of the first training data; when the data amount of the first training data is smaller than a preset value, adjusting super parameters through a Bayesian optimization scheme, and training a neural network model according to the first training data and the category of the first training data; and training the neural network model according to preset super parameters corresponding to the neural network model, the first training data and the categories of the first training data when the data amount of the first training data is larger than or equal to the preset value.

It should be appreciated that the type of neural network model may be default or specified. The neural network model may be stored in a memory of an electronic device that performs the method of neural network model training, or may receive neural network models transmitted by other electronic devices.

With reference to the second aspect, in some possible implementations, the method further includes: acquiring the neural network model, wherein the neural network model is obtained by training according to second training data; said training a neural network model according to said first training data and a class of said first training data, comprising: extracting features of the first training data by using the neural network model to obtain features of the first training data, wherein the first training data comprises supporting data and query data, the supporting data comprises all or part of data of each type in the first training data, and the query data comprises all or part of data of each type in the first training data; and adjusting parameters of partial layers of the neural network model according to the feature distance between the class center feature of each class and the query data feature to obtain an adjusted neural network model, wherein each bit in the class center feature of each class is an average value of the feature corresponding bits of the support data of each class.

With reference to the second aspect, in some possible implementations, the method further includes: and adjusting the parameters of the partial layers according to the characteristic distance between the class center characteristic of each class and the query data characteristic and the average value of the characteristic distances between the characteristics of the first training data of each class.

With reference to the second aspect, in some possible implementations, the method further includes: inputting the first training data into the neural network model; and carrying out deep hash on the features extracted by the neural network model to obtain the features of the first training data.

With reference to the second aspect, in some possible implementations, the super-parameters include one or more of a learning rate, a learning rate decay period, a number of iteration periods, a batch size, a network structure parameter of the neural network model.

In a third aspect, an apparatus for training a neural network model is provided, including means for performing the methods of the first aspect described above.

In a fourth aspect, an apparatus for training a neural network model is provided, including means for performing the method in the second aspect.

In a fifth aspect, an apparatus for training a neural network model is provided, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the first aspect described above when the program stored in the memory is executed.

In a sixth aspect, there is provided a training apparatus for a neural network, the apparatus comprising: a memory for storing a program; a processor for executing the program stored in the memory, the processor being configured to perform the method of the second aspect described above when the program stored in the memory is executed.

In a seventh aspect, there is provided a computer storage medium storing program code comprising instructions for performing the steps of the method of the first or second aspects.

In an eighth aspect, there is provided a chip system comprising at least one processor, wherein program instructions, when executed in the at least one processor, cause the chip system to perform the method of the first or second aspect.

Optionally, as an implementation manner, the chip system may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to perform the method in the first aspect.

The chip system can be a field programmable gate array FPGA or an application specific integrated circuit ASIC.

It should be understood that, in the present application, the method of the first aspect may specifically refer to the method of the first aspect and any implementation manner of the various implementation manners of the first aspect.

Drawings

Fig. 1 is a schematic structural diagram of a system architecture according to an embodiment of the present application.

Fig. 2 is a schematic diagram of a convolutional neural network model provided by an embodiment of the present application.

Fig. 3 is a schematic diagram of a chip hardware structure according to an embodiment of the present application.

Fig. 4 is a schematic flow chart of a method for training a neural network model according to an embodiment of the present application.

Fig. 5 is a schematic flow chart of a method for training a neural network model according to another embodiment of the present application.

Fig. 6 is a schematic flow chart of a method of cluster-based small sample learning provided by an embodiment of the application.

Fig. 7 is a schematic diagram of a fine tuning method according to an embodiment of the present application.

Fig. 8 is a schematic flow chart of a bayesian optimization scheme.

Fig. 9 is a schematic block diagram of an apparatus for training a neural network model according to another embodiment of the present application.

Fig. 10 is a schematic hardware structure of an electronic device according to an embodiment of the application.

Detailed Description

The technical scheme of the application will be described below with reference to the accompanying drawings.

Because the embodiments of the present application relate to a large number of applications of neural network models, in order to facilitate understanding, related terms and related concepts such as neural network models related to the embodiments of the present application are first described below.

(1) Neural network model

The neural network model may be composed of neural units, which may be referred to as x _s And an arithmetic unit whose intercept b is an input, the output of the arithmetic unit may be:

wherein s=1, 2, … … n, n is a natural number greater than 1, W _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network model to convert an input signal in the neural unit to an output signal. The output signal of the activation function may be used as an input to the next convolutional layer. The activation function may be a sigmoid function. A neural network model is a network formed by joining together a number of the above-described single neural units, i.e., the output of one neural unit may be the input of another. The input of each neural unit may be connected to a local receptive field of a previous layer to extract features of the local receptive field, which may be an area composed of several neural units.

(2) Deep neural network model

Deep neural network models (deep neural network, DNN), also known as multi-layer neural network models, can be understood as neural network models with many hidden layers, many of which are not particularly metrics. From the division of DNNs by the location of the different layers, the neural network model inside the DNNs can be divided into three classes: input layer, hidden layer, output layer. Typically the first layer is the input layer, the last layer is the output layer, and the intermediate layers are all hidden layers. For example, layers in a fully connected neural network model are fully connected, that is, any neuron in layer i must be connected to any neuron in layer i+1. Although DNN appears to be complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:wherein (1)>Is an input vector, +.>Is the output vector, +.>Is the offset vector, W is the weight matrix (also called coefficient), and α () is the activation function. Each layer is only for the input vector +.>The output vector is obtained by such simple operation>Since DNN has a large number of layers, the coefficient W and the offset vector +.>And thus a large number. The definition of these parameters in DNN is as follows: taking the coefficient W as an example: it is assumed that in DNN of one three layers, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as +. >The superscript 3 represents the number of layers in which the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. The summary is: the coefficients from the kth neuron of the L-1 th layer to the jth neuron of the L-1 th layer are defined as +.>It should be noted that the input layer is devoid of W parameters. In the deep neural network model, more hidden layers make the network more capable of characterizing complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the greater the "capacity", i.eMeaning that it can perform more complex learning tasks. The final objective of the process of training the deep neural network model, i.e. learning the weight matrix, is to obtain a weight matrix (a weight matrix formed by a number of layers of vectors W) for all layers of the trained deep neural network model.

(3) Convolutional neural network model

The convolutional neural network model (convolutional neuron network, CNN) is a deep neural network model with a convolutional structure. The convolutional neural network model comprises a feature extractor consisting of a convolutional layer and a sub-sampling layer. The feature extractor can be seen as a filter and the convolution process can be seen as a convolution with an input image or convolution feature plane (feature map) using a trainable filter. The convolution layer refers to a neuron layer for performing convolution processing on an input signal in a convolution neural network model. In the convolutional layer of the convolutional neural network model, one neuron may be connected with only a part of adjacent layer neurons. A convolutional layer typically contains a number of feature planes, each of which may be composed of a number of neural elements arranged in a rectangular pattern. Neural elements of the same feature plane share weights, where the shared weights are convolution kernels. Sharing weights can be understood as the way image information is extracted is independent of location. The underlying principle in this is: the statistics of a certain part of the image are the same as other parts. I.e. meaning that the image information learned in one part can also be used in another part. The same learned image information can be used for all locations on the image. In the same convolution layer, a plurality of convolution kernels may be used to extract different image information, and in general, the greater the number of convolution kernels, the more abundant the image information reflected by the convolution operation.

The convolution kernel can be initialized in the form of a matrix with random size, and reasonable weight can be obtained through learning in the training process of the convolution neural network model. In addition, the direct benefit of sharing weights is to reduce the connections between the layers of the convolutional neural network model, while reducing the risk of overfitting.

(4) Loss function

In training the deep neural network model, because the output of the deep neural network model is expected to be as close to the value actually expected, the weight vector of each layer of neural network model can be updated according to the difference between the predicted value of the current network and the actually expected target value (of course, there is usually an initialization process before the first update, that is, the parameters are preconfigured for each layer in the deep neural network model), for example, if the predicted value of the network is higher, the weight vector is adjusted to be lower than the predicted value, and the adjustment is continued until the deep neural network model can predict the actually expected target value or the value very close to the actually expected target value. Thus, it is necessary to define in advance "how to compare the difference between the predicted value and the target value", which is a loss function (loss function) or an objective function (objective function), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, the higher the output value (loss) of the loss function is, the larger the difference is, and then the training of the deep neural network model becomes a process of reducing the loss as much as possible.

(5) Residual error network

When the depth of the neural network model is continuously added, the degradation problem occurs, namely, the accuracy is firstly increased along with the increase of the depth of the neural network model, then the saturation is achieved, and the accuracy is reduced when the depth is continuously increased. The biggest difference between the general direct convolutional neural network model and the residual network (ResNet) is that the ResNet has a plurality of branches for bypassing the input directly to the following layer, and the integrity of the information is protected by bypassing the input information directly to the output, so that the problem of degradation is solved. The residual network includes a convolutional layer and/or a pooling layer.

The residual network may be: in addition to layer-by-layer connection among multiple hidden layers in the deep neural network model, for example, a layer 1 hidden layer is connected with a layer 2 hidden layer, a layer 2 hidden layer is connected with a layer 3 hidden layer, a layer 3 hidden layer is connected with a layer 4 hidden layer (the hidden layer is a data operation path of the neural network model and can also be visually called as neural network model transmission), and the residual network is further provided with a direct connection branch which is directly connected with the layer 4 hidden layer from the layer 1 hidden layer, namely, the processing of the layer 2 hidden layer and the layer 3 hidden layer is skipped, and the data of the layer 1 hidden layer is directly transmitted to the layer 4 hidden layer for operation. The road network may be: the deep neural network model comprises a weight acquisition branch besides the operation path and the direct connection branch, and the weight acquisition branch is introduced into a transmission gate (transmission gate) to acquire a weight value and output the weight value T for subsequent operation of the operation path and the direct connection branch.

(6) Back propagation algorithm

The convolutional neural network model can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the input signal is transmitted forward until the output is generated with error loss, and the parameters in the initial neural network model are updated by back propagation of the error loss information, so that the error loss is converged. The back propagation algorithm is a back propagation motion that dominates the error loss, and aims to obtain parameters of the optimal neural network model, such as a weight matrix.

(7) Pixel value

The pixel value of the image may be a Red Green Blue (RGB) color value and the pixel value may be a long integer representing the color. For example, the pixel value is 256×red+100×green+76×blue, where Blue represents the Blue component, green represents the Green component, and Red represents the Red component. The smaller the value, the lower the luminance, the larger the value, and the higher the luminance in each color component. For a gray image, the pixel value may be a gray value.

(8) Small sample learning

The purpose of the small sample study is to design a relevant learning model so that the model can learn quickly and identify the class of new samples in only a small number of labeled samples. The existing research ideas suitable for the problem of small samples are a transfer learning method and a semi-supervised learning method, and the methods can relieve the over-fitting problem and the data scarcity problem in the training process of a small amount of data to a certain extent.

Some of the basic contents of the neural network model are briefly described above, and some of the specific neural network models that may be used in image data processing are described below.

The system architecture of the embodiment of the present application is described in detail below with reference to fig. 1.

FIG. 1 is a schematic diagram of a system architecture according to an embodiment of the present application. As shown in fig. 1, system architecture 100 includes an execution device 110, a training device 120, a database 130, a client device 140, a data storage system 150, and a data acquisition system 160.

In addition, the execution device 110 includes a calculation module 111, an I/O interface 112, a preprocessing module 113, and a preprocessing module 114. Among other things, the calculation module 111 may include the target model/rule 101, with the preprocessing module 113 and the preprocessing module 114 being optional.

The data acquisition device 160 is used to acquire training data. For the neural network model training method of the embodiment of the present application, the training data may include first training data and a class of the first training data. After the training data is collected, the data collection device 160 stores the training data in the database 130 and the training device 120 trains the target model/rule 101 based on the training data maintained in the database 130.

The training device 120 will be described below, where the target model/rule 101 is obtained based on the training data, the training device 120 processes the input first training data, and calculates the feature distance between the feature of the output query data and the class center feature of each class until the feature distance between the feature of the query data output by the training device 120 and the class center feature of each class meets the preset condition, thereby completing the training of the target model/rule 101.

The target model/rule 101 can be used to implement classification of the neural network model according to the embodiment of the present application, that is, the data to be processed (after the related preprocessing) is input into the target model/rule 101, and the classification of the data to be processed can be obtained. The target model/rule 101 in the embodiment of the present application may be specifically a neural network model. It should be noted that, in practical applications, the training data maintained in the database 130 is not necessarily all acquired by the data acquisition device 160, but may be received from other devices. It should be noted that the training device 120 is not necessarily completely based on the training data maintained by the database 130 to perform training of the target model/rule 101, and it is also possible to obtain the training data from the cloud or other places to perform model training, which should not be taken as a limitation of the embodiments of the present application.

The target model/rule 101 obtained by training according to the training device 120 may be applied to different systems or devices, such as the execution device 110 shown in fig. 1, where the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR)/Virtual Reality (VR), a vehicle-mounted terminal, or may also be a server or cloud. In fig. 1, an execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through a client device 140, where the input data may include in an embodiment of the present application: and the data to be processed is input by the client device. The client device 140 here may be in particular a terminal device.

The preprocessing module 113 and the preprocessing module 114 are used for preprocessing according to input data (such as data to be processed) received by the I/O interface 112, and in an embodiment of the present application, there may be no preprocessing module 113 and no preprocessing module 114 or only one preprocessing module. When the preprocessing module 113 and the preprocessing module 114 are not present, the calculation module 111 may be directly employed to process the input data.

In preprocessing input data by the execution device 110, or in performing processing related to computation or the like by the computation module 111 of the execution device 110, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processing, or may store data, instructions or the like obtained by corresponding processing in the data storage system 150.

Finally, the I/O interface 112 presents the processing results, such as classification results calculated by the target model/rule 101, to the client device 140 for presentation to the user.

Specifically, the classification result obtained by the processing of the target model/rule 101 in the calculation module 111 may be sent to the I/O interface after the processing of the preprocessing module 113 (or the processing of the preprocessing module 114 may be added), and then the processing result is sent to the client device 140 by the I/O interface for display.

It should be understood that when the preprocessing module 113 and the preprocessing module 114 are not present in the system architecture 100, the computing module 111 may also transmit the classification result obtained by the processing to the I/O interface, and then send the processing result to the client device 140 for display by the I/O interface.

It should be noted that the training device 120 may generate, based on different training data, a corresponding target model/rule 101 for different targets or different tasks, where the corresponding target model/rule 101 may be used to achieve the targets or complete the tasks, thereby providing the user with the desired result.

In the case shown in FIG. 1, the user may manually give input data that may be manipulated through an interface provided by the I/O interface 112. In another case, the client device 140 may automatically send the input data to the I/O interface 112, and if the client device 140 is required to automatically send the input data requiring the user's authorization, the user may set the corresponding permissions in the client device 140. The user may view the results output by the execution device 110 at the client device 140, and the specific presentation may be in the form of a display, a sound, an action, or the like. The client device 140 may also be used as a data collection terminal to collect input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data as shown in the figure, and store the new sample data in the database 130. Of course, instead of being collected by the client device 140, the I/O interface 112 may directly store the input data input to the I/O interface 112 and the output result output from the I/O interface 112 as new sample data into the database 130.

It should be noted that fig. 1 is only a schematic diagram of a system architecture provided by an embodiment of the present application, and the positional relationship among devices, apparatuses, modules, etc. shown in the drawing is not limited in any way, for example, in fig. 1, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may be disposed in the execution device 110.

As shown in fig. 1, the target model/rule 101 may be a neural network model according to an embodiment of the present application, and specifically, the neural network model provided by the embodiment of the present application may be a CNN and deep convolutional neural network model (deep convolutional neural networks, DCNN), and so on.

Since CNN is a very common neural network model, the structure of CNN will be described in detail with reference to fig. 2. As described in the basic concept introduction above, the convolutional neural network model is a deep neural network model with a convolutional structure, and is a deep learning architecture, in which multiple levels of learning are performed at different abstraction levels through machine learning algorithms. As a deep learning architecture, CNN is a feed-forward artificial neural network model in which individual neurons can respond to data input thereto.

As shown in fig. 2, a convolutional neural network model (CNN) 200 may include an input layer 210, a convolutional layer/pooling layer 220 (where the pooling layer is optional), and a fully-connected layer (fully connected layer) 230. The relevant contents of these layers are described in detail below.

Convolution layer/pooling layer 220:

convolution layer:

the convolution/pooling layer 220 as shown in fig. 2 may include layers as examples 221-226, for example: in one implementation, layer 221 is a convolutional layer, layer 222 is a pooling layer, layer 223 is a convolutional layer, layer 224 is a pooling layer, layer 225 is a convolutional layer, and layer 226 is a pooling layer; in another implementation, 221, 222 are convolutional layers, 223 are pooling layers, 224, 225 are convolutional layers, and 226 are pooling layers. I.e. the output of the convolution layer may be used as input to a subsequent pooling layer or as input to another convolution layer to continue the convolution operation.

The internal principle of operation of a convolution layer to process an image will be described below using the convolution layer 221 as an example.

The convolution layer 221 may include a plurality of convolution operators, also known as kernels, which function in image processing as a filter to extract specific information from the input image matrix, which may be a weight matrix in nature, which is typically predefined, and which is typically processed on the input image in a horizontal direction, pixel by pixel (or two pixels by two pixels … …, depending on the value of the step size stride), to accomplish the task of extracting specific features from the image. The size of the weight matrix should be related to the size of the image, and it should be noted that the depth dimension (depth dimension) of the weight matrix is the same as the depth dimension of the input image, and the weight matrix extends to the entire depth of the input image during the convolution operation. Thus, convolving with a single weight matrix produces a convolved output of a single depth dimension, but in most cases does not use a single weight matrix, but instead applies multiple weight matrices of the same size (row by column), i.e., multiple homography matrices. The outputs of each weight matrix are stacked to form the depth dimension of the convolved image, where the dimension is understood to be determined by the "multiple" as described above. Different weight matrices may be used to extract different features in the image, e.g., one weight matrix is used to extract image edge information, another weight matrix is used to extract a particular color of the image, yet another weight matrix is used to blur unwanted noise in the image, etc. The sizes (rows and columns) of the weight matrixes are the same, the sizes of the convolution feature images extracted by the weight matrixes with the same sizes are the same, and the convolution feature images with the same sizes are combined to form the output of convolution operation.

The weight values in the weight matrices are required to be obtained through a large amount of training in practical application, and each weight matrix formed by the weight values obtained through training can be used for extracting information from an input image, so that the convolutional neural network model 200 can perform correct prediction.

When convolutional neural network model 200 has multiple convolutional layers, the initial convolutional layer (e.g., 221) tends to extract more general features, which may also be referred to as low-level features; as the depth of the convolutional neural network model 200 increases, features extracted by the later convolutional layers (e.g., 226) become more complex, such as features of high level semantics, which are more suitable for the problem to be solved.

Pooling layer:

since it is often desirable to reduce the number of training parameters, the convolutional layers often require periodic introduction of pooling layers, one convolutional layer followed by one pooling layer, or multiple convolutional layers followed by one or more pooling layers, as illustrated by layers 221-226 in FIG. 2, 220. The only purpose of the pooling layer during image processing is to reduce the spatial size of the image. The pooling layer may include an average pooling operator and/or a maximum pooling operator for sampling the input image to obtain a smaller size image. The average pooling operator may calculate pixel values in the image over a particular range to produce an average as a result of the average pooling. The max pooling operator may take the pixel with the largest value in a particular range as the result of max pooling. In addition, just as the size of the weighting matrix used in the convolutional layer should be related to the image size, the operators in the pooling layer should also be related to the image size. The size of the image output after the processing by the pooling layer can be smaller than the size of the image input to the pooling layer, and each pixel point in the image output by the pooling layer represents the average value or the maximum value of the corresponding sub-region of the image input to the pooling layer.

Full connection layer 230:

after processing by the convolutional layer/pooling layer 220, the convolutional neural network model 200 is not yet sufficient to output the required output information. Because, as previously described, the convolution/pooling layer 220 will only extract features and reduce the parameters imposed by the input image. However, in order to generate the final output information (the required class information or other relevant information), the convolutional neural network model 200 needs to utilize the fully-connected layer 230 to generate the output of the number of classes required for one or a set. Thus, multiple hidden layers (231, 232 to 23n as shown in fig. 2) may be included in the fully connected layer 230, and the output layer 240, where parameters included in the multiple hidden layers may be pre-trained according to relevant training data of a specific task type, for example, the task type may include image recognition, image classification, image super-resolution reconstruction, and so on.

After the hidden layers in the fully connected layer 230, i.e., the final layer of the overall convolutional neural network model 200 is the output layer 240, the output layer 240 has a class-cross entropy-like loss function, specifically for calculating the prediction error, once the forward propagation of the overall convolutional neural network model 200 (e.g., propagation from 210 to 240 in fig. 2) is completed (e.g., propagation from 240 to 210 in fig. 2) and the backward propagation (e.g., propagation from 240 to 210 in fig. 2) begins to update the weights and deviations of the aforementioned layers to reduce the loss of the convolutional neural network model 200 and the error between the result output by the convolutional neural network model 200 through the output layer and the ideal result.

It should be noted that the convolutional neural network model 200 shown in fig. 2 is only an example of a convolutional neural network model, and the convolutional neural network model may also exist in the form of other network models in a specific application.

It should be understood that the classification method according to the embodiment of the present application may be performed using the convolutional neural network model (CNN) 200 shown in fig. 2, and the classification of the data to be processed may be obtained after the data to be processed is processed by the input layer 210, the convolutional layer/pooling layer 220, and the full-connection layer 230 as shown in fig. 2.

Fig. 3 is a hardware structure of a chip according to an embodiment of the present application, where the chip includes a neural network model processor 50. The chip may be provided in an execution device 110 as shown in fig. 1 for performing the calculation of the calculation module 111. The chip may also be provided in the training device 120 as shown in fig. 1 to complete the training work of the training device 120 and output the target model/rule 101. The algorithms of the layers in the convolutional neural network model shown in fig. 2 can be implemented in a chip as shown in fig. 3.

The neural network model processor (NPU) 50 is mounted as a coprocessor to a main central processing unit (central processing unit, CPU) (host CPU) which distributes tasks. The NPU has a core part of an arithmetic circuit 503, and a controller 504 controls the arithmetic circuit 503 to extract data in a memory (weight memory or input memory) and perform arithmetic.

In some implementations, the arithmetic circuitry 503 internally includes a plurality of processing units (PEs). In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuitry capable of performing mathematical operations such as multiplication and addition. In some implementations, the operation circuit 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix a, a weight matrix B, and an output matrix C. The arithmetic circuit 503 takes the data corresponding to the matrix B from the weight memory 502 and buffers the data on each PE in the arithmetic circuit 503. The arithmetic circuit 503 performs matrix operation on the matrix a data and the matrix B data from the input memory 501, and the partial result or the final result of the matrix obtained is stored in an accumulator (accumulator) 508.

The vector calculation unit 507 may further process the output of the operation circuit 503, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like. For example, the vector calculation unit 507 may be used for network calculations of non-convolutional/non-FC layers in the neural network model, such as pooling, batch normalization (batch normalization), local response normalization (local response normalization), and the like.

In some implementations, the vector computation unit 507 can store the vector of processed outputs to the unified buffer 506. For example, the vector calculation unit 507 may apply a nonlinear function to an output of the operation circuit 503, such as a vector of accumulated values, to generate an activation value. In some implementations, the vector calculation unit 507 generates a normalized value, a combined value, or both. In some implementations, the vector of processed outputs can be used as an activation input to the operational circuitry 503, for example for use in subsequent layers in a neural network model.

The unified memory 506 is used for storing input data and output data.

The weight data is directly transferred to the input memory 501 and/or the unified memory 506 through the memory cell access controller 505 (direct memory access controller, DMAC), the weight data in the external memory is stored in the weight memory 502, and the data in the unified memory 506 is stored in the external memory.

A bus interface unit (bus interface unit, BIU) 510 for interfacing between the main CPU, DMAC and finger memory 509 via a bus.

An instruction fetch memory (instruction fetch buffer) 509 connected to the controller 504 for storing instructions used by the controller 504;

And a controller 504 for calling the instruction cached in the instruction memory 509 to control the operation of the operation accelerator.

Typically, the unified memory 506, the input memory 501, the weight memory 502, and the finger memory 509 are on-chip (on-chip) memories, and the external memory is a memory external to the NPU, and the external memory may be a double data rate synchronous dynamic random access memory (double data rate synchronous dynamic random access memory, abbreviated as DDR SDRAM), a high bandwidth memory (high bandwidth memory, HBM), or other readable and writable memory.

In addition, in the present application, the operations of the respective layers in the convolutional neural network model shown in fig. 2 may be performed by the operation circuit 503 or the vector calculation unit 507.

Deep learning technology is rapidly developed, however, at present, training of a neural network model has a certain difficulty, and engineers with certain experience are required to adjust parameters of the neural network model and select a learning model. The learning model includes various kinds such as small sample learning, transfer learning, and the like. At present, to realize the high precision of the neural network model, the parameters of the training neural network model are adjusted by expert experience, which is time-consuming and labor-consuming and is unfavorable for the rapid iteration of related services.

Under the framework of traditional machine learning, the task of machine learning is to learn a classification model on the basis of given sufficient training data; the learned model is then used to classify and predict the test data. However, machine learning algorithms have a key problem: it is difficult to obtain a large amount of training data in some newly emerging fields.

A large number of new fields are continuously emerging, and a large amount of training data is required to be calibrated for each field in the traditional machine learning, so that a large amount of manpower and material resources are consumed. Without a large amount of annotation data, many study-related studies and applications cannot be performed. One condition that may typically occur is expiration of training data. This often requires us to re-label a large amount of training data to meet our training needs, but labeling new data is very expensive, requiring a large amount of manpower and material resources. From another perspective, if we have a large amount of training data in different distributions, it is also very wasteful to discard the data entirely. How to reasonably utilize the data is a main problem solved by transfer learning. The transfer learning (transfer learning) can transfer knowledge from existing data to assist in future learning. The goal of transfer learning is to use knowledge learned from one environment to assist the learning task in a new environment.

In the training process of the neural network model, super-parameters (hyper-parameters) affecting the performance need to be set and adjusted. Parameters defining neural network model properties or defining a training process may be referred to as hyper-parameters. Super parameters include one or more of Learning Rate (LR), learning rate decay rate, learning rate decay period, number of iteration (iterations) periods, batch size, network structure parameters of the neural network model, and the like.

When the gradient descent algorithm is used for optimization, a coefficient is multiplied before the gradient term in the updating rule of the weight, and the coefficient can be called a learning rate. The learning rate is an important super-parameter in supervised learning and deep learning, which determines whether and when the objective function can converge to a local minimum.

In order to prevent the learning rate from being too large, the learning device can swing back and forth when converging to the global optimal point, and the learning rate can be enabled to be continuously reduced along with the training round number by setting the learning rate attenuation rate, so that the learning step length with gradient reduction is converged. The learning rate is reduced with the increase of the iteration number to accelerate the learning. The learning rate set in the super parameter can also be understood as an initial learning rate.

The learning rate decay rate may be understood as the decreasing value of the learning rate per iteration cycle. The learning rate decreases every time a learning rate decay period passes. The learning rate decay period may be a positive integer multiple of the iteration period.

The number of iterations (cycles) may also be referred to as rounds (epochs), which may be understood as a single training iteration of all batches in forward and backward propagation. This means that 1 cycle is a single forward and backward pass of the entire input data. In brief, epochs refers to how many times the training data will be "rolled" during the training process. For example, the training set has 1000 samples, the batch size=10, and then 100 iterations, 1 epoch, are required to train the entire sample set.

The super parameters may be adjusted by means of automatic parameter tuning, such as grid search (grid search), random search (random search), genetic algorithm (genetic algorithm), particle swarm optimization (paticle swarm optimization), bayesian optimization (Bayesian optimization), etc. The following description will be given by taking bayesian optimization as an example.

Under the condition that the data volume of the training data is small, a neural network model with a classification function can be obtained through a clustering scheme. However, the resulting model has a weak generalization ability due to a small data amount of the training data. In order to solve the problems, the application provides a neural network model training method.

In step S401, a neural network model, first training data, and a class of the first training data are acquired.

The acquisition may be from a memory or received from another device. The neural network model may be trained from the second training data. The second training data may be different data than the first training data, and the second training data may be, for example, all or part of the public data set.

The first training data includes support data and query data. The support data includes all or part of each class of data in the first training data. The query data includes all or part of each class of data in the first training data.

The first training data may be, for example, text, speech, images, etc. The category of the first training data may be, for example, the part of speech (noun, verb, etc.) of each word in a sentence, or may be the emotion of a person whose voice corresponds to the person speaking, or may be the category of the person or object in the picture, etc.

In step S402, feature extraction is performed on the first training data by using the neural network model, so as to obtain features of the first training data.

The features of the first training data may be features extracted from the neural network model, or may be obtained by processing the features extracted from the neural network model. The first training data can be input into the neural network model, and the features extracted by the neural network model are subjected to deep hash, so that the features of the first training data are obtained. That is, the features of the first training data may be the result of deep hashing the features extracted by the neural network model. The feature distance may be represented by a hamming distance.

In step S403, according to the feature distance between the class center feature of each class and the query data feature, parameters of a partial layer of the neural network model are adjusted to obtain an adjusted neural network model.

Before step S403, class center features for each class may be calculated. Each bit in the class center feature of each class is an average value of the corresponding bits of the feature of the support data of each class.

The parameters of the partial layers may also be adjusted according to an average value of feature distances between features of the first training data of each class. The center loss may be used to represent an average of feature distances between features of each class of first training data.

In the training process of the neural network model, center loss is introduced, so that the training efficiency of the neural network model can be improved, and the accuracy of the neural network model can be improved.

When the data volume of the first training data meets the preset condition, the network structure of a part of layers of the neural network model is adjusted through a Bayesian optimization scheme, super parameters are optimized, and parameters of the part of layers of the neural network model are adjusted according to the feature distance between the class center feature of each class and the query data feature.

Parameters of a portion of the layers of the neural network model are adjusted, and the adjusted layers may be preset values. Parameters of the last layers in the neural network model may be adjusted.

And when the data volume of the first training data does not meet the preset condition, adjusting the parameters of the partial layers according to the preset super parameters corresponding to the neural network model and the characteristic distance between the class center characteristics of each class and the query data characteristics.

The preset hyper-parameters may be determined based on expert experience. The preset hyper-parameters may correspond one-to-one to the neural network model. The device for training the neural network model can store the corresponding relation between the preset super parameters and the neural network model.

Through steps S401 to S403, the generalization ability of the neural network model can be improved by adjusting the parameters of the partial layers of the neural network model obtained by training.

Fig. 5 is a schematic flow chart of a method for training a neural network model according to an embodiment of the present application.

In order to solve the problem of low parameter efficiency of training a neural network model through manual adjustment, the embodiment of the application provides a neural network model training method.

First, preprocessing of training data is performed. The training data will be described by taking the image data as an example.

The training data may be checked. In the training data verification process, whether the picture is damaged or not can be verified, if so, the damaged picture is deleted, and the undamaged picture is not processed. In the training data verification process, whether the picture is a three-channel picture or not can be verified, and if not, the three-channel picture is converted into a three-channel jpg format. In the process of verifying the training data, the balance verification can be performed on the training data. The preset condition of the data amount ratio of various training data can be preset. If the quantity of the training data is approximately equal, the data quantity proportion of the training data is smaller than the preset condition, and the processing is not performed. If the quantity difference of the training data is large, namely the quantity proportion of the training data of two types does not meet the preset condition, warning information can be output. The warning information is used to indicate that the training data is unbalanced.

The training data may be formatted. Data format conversion is also understood as the sorting or packing of training data. In the training data format conversion process, the picture data and its tag may be converted into tfrecord format.

And then training the neural network model according to the training data obtained by the preprocessing.

Indication information may be obtained for indicating the type of neural network model being trained. I.e., a neural network model of a specified type. The neural network model of the default type may also be trained.

The Bayesian optimization scheme can adjust the super parameters and automatically adjust the parameters. However, the efficiency of the bayesian optimization scheme is low, and the optimization of the super-parameters takes a long time. The bayesian optimization scheme can be seen in the illustration of fig. 8.

When the data amount of the training data of the single category is greater than or equal to the first preset value, the first preset value may be, for example, 200, and training of the neural network model may be performed according to the training data.

The data amount of the training data of the single category may be the data amount of the category having the smallest data amount in the training data, or the data amount of each category in the training data may be averaged to obtain the data amount of the training data of the single category.

When the total data amount of the training data is smaller than a second preset value, the second preset value can be 20 ten thousand, for example, the network structure of the neural network model can be adjusted and the super parameters can be optimized through a Bayesian optimization scheme, and the neural network model is trained according to the training data. The structure of all or part of the layers of the neural network model can be adjusted, and parameters of all or part of the layers of the neural network model can be adjusted.

When the total data amount of the training data is greater than or equal to a second preset value, training the neural network model according to the preset super parameters and the training data corresponding to the neural network model.

By training the neural network model, the optimal neural network model is obtained

When the data amount of the training data of the single category is smaller than the first preset value, the neural network model may be trained according to a small sample learning scheme. Through a small sample learning scheme, the robustness of the neural network model obtained through training can be enhanced, namely the generalization capability is improved, and the accuracy is improved.

Under the condition that the data volume of the training data of the single category is smaller than a first preset value, the neural network model can be trained according to preset super parameters corresponding to the neural network model. And when the accuracy of the neural network model obtained through training meets the standard, small sample learning is not performed. The accuracy reaches 95% for example. The small sample learning scheme includes a cluster-based small sample learning scheme, a fine tune (fine tune) -based small sample learning scheme, and the like. The cluster-based small sample learning scheme may be referred to the description of fig. 6, and the fine tune-based small sample learning scheme may be referred to the description of fig. 7. The accuracy of the neural network model may also be understood as the accuracy of the neural network model, which may be determined on training data or other labeling data.

The neural network model can be trained by a plurality of small sample learning schemes, and the neural network model with highest precision can be used as the optimal neural network model in the plurality of trained neural network models.

Under the condition of smaller data volume, training the neural network model according to preset super parameters, and the accuracy of the neural network model obtained by training is lower. The neural network model with higher precision can be obtained through the Bayesian optimization scheme, but the efficiency is lower under the condition of larger data quantity, and longer time is required to be occupied.

According to the total data amount of the training data, whether the network structure of the neural network model is adjusted and the super parameters of the training neural network model are adjusted by using a Bayesian optimization scheme is selected, so that the time occupied by training the neural network model can be reduced and the occupation of resources can be reduced under the condition that the accuracy of the trained neural network model is ensured.

And finally, outputting a training result. The training result comprises the optimal neural network model obtained through training. The training results may also include the processing results of the optimal neural network model on portions of the training data, and highlighting indicia of the portions of each training data that have the greatest impact on the processing results. For example, a portion of pixels in an image of training data that have the greatest influence on the processing result may be highlighted.

According to the highlighting mark of the part with the greatest influence on the processing result in each training data, the reason for influencing the accuracy of the neural network model obtained by training can be judged manually. The reasons may include, for example, poor training data, and/or the need for further optimization of the super parameters being trained, etc.

According to the neural network model training method provided by the embodiment of the application, when the sample size is small, such as when the sample size is less than 200 sheets/class, the neural network model is trained by using a small sample learning scheme, when the sample size is between 200 and 2000 sheets/class, the classification model is trained by using Bayesian optimization and combining the whole network fine tuning technology, and when the sample size is greater than 2000 sheets/class, the neural network model is trained by directly using preset super parameters determined according to manual experience due to sufficient sample size, so that the high-precision classification model is obtained. In the training process of the neural network model, an early stop (early stop) technology can be combined, namely, before the iteration number reaches the preset iteration number, the accuracy of the neural network model is not improved any more, and the training of the neural network model can be stopped. The neural network model training method provided by the embodiment of the application is fully automatic, does not depend on expert tuning, and is simple and easy to use. Especially when the samples are smaller than 30 sheets/class, model accuracy is guaranteed by using small sample learning based on clustering.

The scheme of automatic parameter adjustment and the preset super parameters corresponding to the neural network model are combined, the problem that manual parameter adjustment is needed is solved, the complicated process of manual parameter adjustment can be completely eliminated, the parameters are automatically adjusted, and the accuracy of the neural network model is improved. The neural network model is trained by the preset super parameters corresponding to the neural network model, and the training strategy can be understood as a preset general training strategy in the system.

The neural network model is trained by the neural network model training method provided by the embodiment of the application, and the neural network model obtained by training can realize the same or better precision with the artificial parameter adjustment.

Fig. 6 is a schematic flow chart of a cluster-based small sample learning scheme provided by an embodiment of the application.

In order to ensure the accuracy of the neural network model under the condition of small sample size, a small sample learning scheme can be adopted to train the neural network model. Referring to fig. 5, the neural network model may be trained using a small sample learning scheme when the data amount of a single type of training data in the training data is less than 200.

Features of the training data may be extracted through a neural network model. According to the distance between the features of the training data, the relationship network may cluster the training data using a clustering algorithm to determine the class of the training data.

By using the relational network, the training data can be clustered according to the characteristics of the training data, so that a clustering result is determined.

In a conventional small sample learning scheme, the neural network model is adjusted using cross entropy loss (cross entropy loss). By minimizing cross entropy loss, feature distances between different classes of features extracted by the neural network model, that is, between different classes of features, can be increased.

And extracting the characteristics of the training data by utilizing the neural network model, so that the characteristics of the training data can be obtained. The training data includes support data and query data. The support data may include all or part of the training data. The query data may include all or part of the training data. The union of the support data and the query data may include all of the data in the training data. The support data may or may not intersect with the query data. The support data includes all or part of each type of data in the training data. The query data includes all or part of each type of data in the training data.

Class center features for each class in the first training data may be calculated based on the features of the support data. The class center feature of each class is the average of the feature correspondence bits of all the support data of the class.

And adjusting parameters of partial layers of the neural network model according to the feature distance between the class center feature of each class and the query data feature, so as to obtain the adjusted neural network model. The cross entropy penalty may be calculated from the feature distance of each class center feature from the query data feature. And adjusting parameters of partial layers of the neural network model to minimize cross entropy loss, thereby obtaining an adjusted neural network model.

Some of the data may be randomly selected from the training data to form a support set and others to form a query set. The average of all the features of the support data in the support set extracted by the neural network may be referred to as the support features, and the features of the query data in the query set extracted by the neural network may be referred to as the query features. The support set comprises support data of each type, and the query set comprises support data of each type. And calculating cross entropy loss according to the support features of the category of the query data corresponding to the query features and the feature distance between the query features. And adjusting parameters of the neural network model according to the cross entropy loss, and training the neural network model.

Under the condition that the data volume of the training data is small, training of the initial neural network model may result in that the neural network model obtained through training is only applicable to the training data, and generalization capability is poor. Therefore, a transfer learning scheme can be adopted in the learning process of dawn bird's nest. The scheme of transfer learning can fine tune the parameters of the neural network model obtained by pre-training, namely adjust the layered parameters of the part. This approach to transfer learning may also be referred to as fine-tuning of the neural network model. The fine tuning scheme can be seen in particular from the description of fig. 7.

The pre-trained neural network model may be a neural network model trained on a common data set. Under the condition that the data volume of the training data is smaller, the generalization capability of the neural network model obtained by final training is improved by adjusting the parameters of partial layering of the neural network model obtained by pre-training.

In order to improve the accuracy of training of the neural network model, the accuracy of the neural network model may be improved, and a center loss (center loss) may be introduced when adjusting the neural network model. The center loss may be calculated from an average of distances between features of the first training data of each class.

Training through cross entropy loss can increase inter-class distances; training through center loss may reduce intra-class distances. According to the cross entropy loss and the center loss, the neural network model is trained, so that the training efficiency of the neural network model can be improved, and the accuracy of the neural network model can be improved.

The neural network model extracts the features of the training data, the bit width of the features may be larger, the features of the training data are saved, and more resources are occupied according to the calculation of the features of the training data. Features extracted from the neural network model may be compressed. The features extracted by the neural network model can be compressed in a deep hash mode. The feature distance may be represented by a hamming distance. By deep hashing the features extracted by the neural network model, the volume of the features can be reduced, the training time is shortened, and the neural network model training is ensured to have higher precision. In the process of determining the category of the data by adopting the neural network model obtained by training, the reasoning speed can be improved.

Before classifying data by using the neural network model obtained by the small sample learning scheme based on clustering, the average value of the characteristics of each type of training data can be determined as the central characteristic of each type according to the characteristics of the training data extracted by the neural network model obtained by training. When the neural network model obtained by applying the small sample learning scheme based on the clustering is used for classifying the data to be classified, the characteristics of the data to be classified can be extracted according to the neural network model obtained by training, and the data to be classified can be classified according to the characteristic distance between the characteristics of the data to be classified and the central characteristics of each class. For example, it may be determined that a category of the data to be classified, which corresponds to the smallest feature distance among the feature distances of the center features of the respective categories, is the category of the data to be classified.

According to the clustering-based small sample learning scheme provided by the embodiment of the application, the feature extraction is performed by utilizing the migration learning mode, so that the accuracy of the neural network model obtained by training under a small quantity of training samples is ensured, and the dependence of the training of the neural network model on the data quantity of training data is reduced. The extracted features are subjected to deep hash so as to compress the features, the feature size is small, the efficiency of computing the features is improved, and the occupation of resources is reduced.

Fig. 7 is a schematic flow chart of a fine-tuning-based small sample learning scheme provided by an embodiment of the present application.

The ability of the neural network model to predict samples outside of the training set may be referred to as the generalization ability of the neural network model. An important topic in machine learning is to improve the generalization capability of a neural network model, and a model with strong generalization capability is a good model. Under the condition that the data volume of the training data is insufficient, the training neural network model is easy to be in under fitting, and the neural network model cannot learn the general rule in the training data due to the insufficient training data for learning, so that the generalization capability is weak.

In order to solve the problem that the generalization capability of the training neural network model is poor under the condition that the data volume of the training data is insufficient, the transfer learning can be performed. And training the neural network model obtained through training a large amount of data again according to the less training data, and adjusting the parameters of partial layering of the neural network model. The adjustment of parameters of a partial hierarchy of the neural network model may also be referred to as fine-tuning of the neural network model.

Parameters of the shallow layer network of the neural network model can be kept unchanged based on the neural network model trained on the large data set by virtue of an open source in the industry, namely the weight of the shallow layer network is unchanged, and parameters of the last layers of the neural network model are adjusted. By fine tuning the neural network model trained on the large data set, the robustness of the model can be ensured, and meanwhile, the accuracy of the neural network model under small sample size can be ensured.

The neural network model trained on the large dataset can also be fine-tuned in combination with a bayesian optimization scheme.

Fig. 8 is a schematic flow chart of a bayesian optimization scheme.

The Bayesian optimization scheme can adopt a Gaussian process regression mode, a random forest regression mode and the like. For different modes, the substitution function of the objective function is different, namely, the function of the fitting curve adopted when curve fitting is carried out is different. Taking Gaussian process regression as an example for illustration.

In step S801, the super parameters are initialized.

By initialization, multiple groups of super parameters, namely parameters for training the neural network model, can be obtained.

In step S802, a neural network model is trained.

The network structure of the neural network model can be adjusted according to each group of super parameters in the multi-group super parameter training obtained by initialization, and the neural network model is trained according to the group of super parameters, so that the neural network model obtained according to each group of super parameters in the multi-group super parameters is obtained.

In step S803, curve fitting is performed.

And (3) assuming that the relation between the super-parameters of the neural network model and the accuracy of the neural network model accords with Gaussian distribution, and fitting each super-parameter through a Gaussian distribution curve.

In step S804, the super parameter corresponding to the accuracy desired maximum value is determined.

And obtaining each super parameter corresponding to the neural network model with highest accuracy expectation through the fitted curve.

And then, performing steps S802-S804, training the neural network model according to each super parameter corresponding to the neural network model with the highest precision, and re-performing curve fitting to update the fitted curve. And obtaining each super parameter corresponding to the neural network model with the highest expected accuracy according to the updated curve.

In step S805, the neural network model is optimized.

When the preset maximum iteration times are reached, or each super parameter corresponding to the corresponding accuracy expected maximum neural network model obtained through curve fitting is not changed, the finally obtained neural network model can be used as the optimal neural network model obtained through training.

Through steps S801 to S805, one or more parameters of super parameters such as Learning Rate (LR), learning rate decay rate, learning rate decay period, iteration (intervals) period, batch size (batch size), override (drop) and the like may be optimized.

The neural network training method according to the embodiment of the present application is described in detail above with reference to the accompanying drawings, and the neural network training device according to the embodiment of the present application is described in detail below with reference to the accompanying drawings, and it should be understood that the neural network training device described below can perform each step of the neural network training method according to the embodiment of the present application, and in order to avoid unnecessary repetition, repetitive description is omitted below when describing the neural network model training device according to the embodiment of the present application.

Fig. 9 is a schematic structural diagram of a neural network training device according to an embodiment of the present application. The apparatus 3000 includes an acquisition module 3001 and a processing module 3002. The acquisition module 3001 and the processing module 3002 may be used to perform the method of neural network training of embodiments of the present application.

In some embodiments, in particular, the acquisition module 3001 may perform step S401 and the processing module 3002 may perform steps S402-S403.

The acquiring module 3001 is configured to acquire a neural network model, first training data and a class of the first training data, where the neural network model is trained according to second training data, the first training data includes support data and query data, the support data includes all or part of data of each class of the first training data, and the query data includes all or part of data of each class of the first training data.

The processing module 3002 is configured to perform feature extraction on the first training data by using the neural network model, so as to obtain features of the first training data.

The processing module 3002 is configured to adjust parameters of a partial hierarchy of the neural network model according to feature distances between a class center feature of each class and the query data feature, so as to obtain an adjusted neural network model, where each bit in the class center feature of each class is an average value of bits corresponding to features of the support data of each class.

Optionally, the processing module 3002 is configured to adjust the parameters of the partial layer according to the feature distance between the class center feature of each class and the query data feature, and the average value of feature distances between features of the first training data of each class.

Optionally, the processing module 3002 is configured to input the first training data into the neural network model; and carrying out deep hash on the features extracted by the neural network model to obtain the features of the first training data.

Optionally, the processing module 3002 is configured to adjust, when the data amount of the first training data is smaller than a preset value, a super parameter through a bayesian optimization scheme, and adjust parameters of the partial layer according to feature distances between the class center feature of each class and the query data feature; and when the data volume of the first training data is larger than or equal to the preset value, adjusting the parameters of the partial layers according to the preset super parameters corresponding to the neural network model and the characteristic distance between the class center characteristics of each class and the query data characteristics.

Optionally, the super-parameters include one or more of a learning rate, a learning rate decay period, a number of iteration periods, a batch size, a disclaimer, a network structure parameter of the neural network model.

In other embodiments, the acquiring module 3001 is configured to acquire the first training data and a category of the first training data.

The processing module 3002 is configured to, when the data amount of the first training data is smaller than a preset value, adjust a super parameter through a bayesian optimization scheme, and train a neural network model according to the first training data and the class of the first training data; and training the neural network model according to preset super parameters corresponding to the neural network model, the first training data and the categories of the first training data when the data amount of the first training data is larger than or equal to the preset value.

Optionally, the neural network model is trained from the second training data.

The processing module 3002 is configured to perform feature extraction on the first training data by using the neural network model to obtain features of the first training data, where the first training data includes support data and query data, the support data includes all or part of data of each class in the first training data, and the query data includes all or part of data of each class in the first training data.

Fig. 10 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application. The electronic apparatus 1000 shown in fig. 10 (the apparatus 1000 may be a computer device in particular) comprises a memory 1001, a processor 1002, a communication interface 1003, and a bus 1004. The memory 1001, the processor 1002, and the communication interface 1003 are connected to each other by a bus 1004.

The memory 1001 may be a Read Only Memory (ROM), a static storage device, a dynamic storage device, or a random access memory (random access memory, RAM). The memory 1001 may store a program, and the processor 1002 and the communication interface 1003 are for performing the respective steps of the neural network model training method of the embodiment of the present application when the program stored in the memory 1001 is executed by the processor 1002.

The processor 1002 may employ a general-purpose central processing unit (central processing unit, CPU), microprocessor, application specific integrated circuit (application specific integrated circuit, ASIC), graphics processor (graphics processing unit, GPU) or one or more integrated circuits for executing associated programs to perform the functions required by the elements in the apparatus for neural network model training of an embodiment of the present application, or to perform the method for neural network model training of an embodiment of the present application.

The processor 1002 may also be an integrated circuit chip with signal processing capabilities. In implementation, the various steps of the neural network model training method of the present application may be performed by instructions in the form of integrated logic circuits or software in hardware in the processor 1002. The processor 1002 may also be a general purpose processor, a digital signal processor (digital signal processing, DSP), an application specific integrated circuit, an off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic device, discrete gate or transistor logic device, discrete hardware components. The disclosed methods, steps, and logic blocks in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be embodied directly in the execution of a hardware decoding processor, or in the execution of a combination of hardware and software modules in a decoding processor. The software modules may be located in a random access memory, flash memory, read only memory, programmable read only memory, or electrically erasable programmable memory, registers, etc. as well known in the art. The storage medium is located in the memory 1001, and the processor 1002 reads information in the memory 1001, and combines the hardware thereof to perform functions required to be performed by units included in the neural network model training apparatus according to the embodiment of the present application, or perform the neural network model training method according to the embodiment of the present application.

Communication interface 1003 enables communication between apparatus 1000 and other devices or communication networks using a transceiving apparatus such as, but not limited to, a transceiver. For example, one or more of the neural network model, the first training data, and the like may be acquired through the communication interface 1003.

Bus 1004 may include a path to transfer information between elements of device 1000 (e.g., memory 1001, processor 1002, communication interface 1003).

The embodiment of the application also provides a computer program storage medium, which is characterized in that the computer program storage medium is provided with program instructions, which when executed directly or indirectly, enable the implementation of the method in the preamble.

An embodiment of the present application also provides a chip system, where the chip system includes at least one processor, and where program instructions, when executed in the at least one processor, cause the method in the foregoing to be implemented.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It will be clear to those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, and are not repeated herein.

In the several embodiments provided by the present application, it should be understood that the disclosed systems, devices, and methods may be implemented in other manners. For example, the apparatus embodiments described above are merely illustrative, e.g., the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, e.g., multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.

The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution of this embodiment.

In addition, each functional unit in the embodiments of the present application may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art or in a part of the technical solution, in the form of a software product stored in a storage medium, comprising several instructions for causing a computer device (which may be a personal computer, a server, a network device, etc.) to perform all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, or an optical disk, or other various media capable of storing program codes.

The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for neural network model training, comprising:

acquiring a neural network model, first training data and categories of the first training data, wherein the neural network model is obtained by training according to second training data, the first training data comprises supporting data and query data, the supporting data comprises all or part of data of each category in the first training data, the query data comprises all or part of data of each category in the first training data, and the first training data and the second training data are image data;

extracting features of the first training data by using the neural network model to obtain features of the first training data;

according to the characteristic distance between the class center characteristic of each class and the query data characteristic and the average value of the characteristic distance between the characteristics of the first training data of each class, adjusting the parameters of partial layering of the neural network model to obtain an adjusted neural network model, wherein each bit in the class center characteristic of each class is the average value of the corresponding bits of the characteristics of the support data of each class;

The neural network model comprises a convolution layer and a pooling layer, the convolution layer comprises a plurality of convolution operators, the convolution operators are used for extracting specific features of a first image, the pooling layer comprises an average pooling operator and/or a maximum pooling operator, the average pooling operator and/or the maximum pooling operator are used for sampling the first image to obtain a second image, the first image is determined according to the second training data, and the size of the second image is smaller than that of the first image.

2. The method of claim 1, wherein the feature extraction of the first training data using the neural network model to obtain the features of the first training data comprises:

inputting the first training data into the neural network model;

and carrying out deep hash on the features extracted by the neural network model to obtain the features of the first training data.

3. The method of claim 1, wherein adjusting the parameters of the partial hierarchy of the neural network model based on the feature distance between the class center feature of each class and the query data feature and the average of feature distances between features of the first training data of each class comprises:

When the data volume of the first training data is smaller than a preset value, the super parameters are adjusted through a Bayesian optimization scheme, and the parameters of the partial layers are adjusted according to the characteristic distance between the class center characteristics of each class and the query data characteristics and the average value of the characteristic distances between the characteristics of the first training data of each class;

and when the data volume of the first training data is larger than or equal to the preset value, adjusting the parameters of the partial layer according to the preset super parameters corresponding to the neural network model, the characteristic distance between the class center characteristic of each class and the query data characteristic and the average value of the characteristic distances between the characteristics of the first training data of each class.

4. The method of claim 3, wherein the super-parameters comprise one or more of a learning rate, a learning rate decay period, a number of iteration periods, a batch size, a network structure parameter of a neural network model.

5. A method for neural network model training, comprising:

acquiring a neural network model, first training data and categories of the first training data, wherein the first training data is image data;

When the data amount of the first training data is smaller than a preset value, adjusting super parameters through a Bayesian optimization scheme, and training the neural network model according to the first training data and the category of the first training data;

when the data amount of the first training data is larger than or equal to the preset value, training the neural network model according to preset super parameters corresponding to the neural network model, the first training data and the categories of the first training data;

the neural network model comprises a convolution layer and a pooling layer, the convolution layer comprises a plurality of convolution operators, the convolution operators are used for extracting specific features of a first image, the pooling layer comprises an average pooling operator and/or a maximum pooling operator, the average pooling operator and/or the maximum pooling operator are used for sampling the first image to obtain a second image, the first image is determined according to second training data, the second training data are image data, the neural network model is obtained according to second training data in a training mode, and the size of the second image is smaller than that of the first image.

6. The method of claim 5, wherein training a neural network model based on the first training data and the class of the first training data comprises:

Extracting features of the first training data by using the neural network model to obtain features of the first training data, wherein the first training data comprises supporting data and query data, the supporting data comprises all or part of data of each type in the first training data, and the query data comprises all or part of data of each type in the first training data;

and adjusting parameters of partial layers of the neural network model according to the feature distance between the class center feature of each class and the query data feature to obtain an adjusted neural network model, wherein each bit in the class center feature of each class is an average value of the feature corresponding bits of the support data of each class.

7. The method of claim 6, wherein adjusting parameters of a partial hierarchy of the neural network model based on feature distances of class center features of each class from the query data features comprises:

and adjusting the parameters of the partial layers according to the characteristic distance between the class center characteristic of each class and the query data characteristic and the average value of the characteristic distances between the characteristics of the first training data of each class.

8. The method according to claim 6 or 7, wherein the feature extraction of the first training data using the neural network model to obtain the features of the first training data comprises:

inputting the first training data into the neural network model;

9. The method of any of claims 5-7, wherein the super-parameters comprise one or more of a learning rate, a learning rate decay period, a number of iteration periods, a batch size, a network structure parameter of a neural network model.

10. An apparatus for neural network model training, comprising:

the system comprises an acquisition module, a first training module and a second training module, wherein the acquisition module is used for acquiring a neural network model, first training data and categories of the first training data, the neural network model is trained according to second training data, the first training data comprises supporting data and query data, the supporting data comprises all or part of data of each category in the first training data, the query data comprises all or part of data of each category in the first training data, and the first training data and the second training data are image data;

A processing module for:

11. The apparatus of claim 10, wherein the processing module is configured to:

Inputting the first training data into the neural network model;

12. The apparatus of claim 10, wherein the processing module is configured to:

when the data volume of the first training data is smaller than a preset value, adjusting a network structure of a part of layers of the neural network model through a Bayesian optimization scheme, optimizing super parameters, and adjusting parameters of the part of layers according to the characteristic distance between the class center characteristics of each class and the query data characteristics and the average value of the characteristic distances between the characteristics of the first training data of each class;

and when the data volume of the first training data is larger than or equal to the preset value, adjusting the parameters of the partial layer according to the preset super parameters corresponding to the neural network model, the characteristic distance between the class center characteristic of each class and the query data characteristic, and the average value of the characteristic distances between the characteristics of the first training data of each class.

13. The apparatus of claim 12, wherein the super-parameters comprise one or more of a learning rate, a learning rate decay period, a number of iteration periods, a batch size, a network structure parameter of a neural network model.

14. An apparatus for neural network model training, comprising:

the acquisition module is used for acquiring the neural network model, the first training data and the category of the first training data, wherein the first training data is image data;

a processing module for:

when the data volume of the first training data is smaller than a preset value, the network structure of the neural network model is adjusted through a Bayesian optimization scheme, super parameters are optimized, and the neural network model is trained according to the first training data and the category of the first training data;

the neural network model comprises a convolution layer and a pooling layer, the convolution layer comprises a plurality of convolution operators, the convolution operators are used for extracting specific features of a first image, the pooling layer comprises an average pooling operator and/or a maximum pooling operator, the average pooling operator and/or the maximum pooling operator are used for sampling the first image to obtain a second image, the first image is determined according to second training data, the second training data are image data, and the size of the second image is smaller than that of the first image.

15. The apparatus of claim 14, wherein the processing module is configured to:

16. The apparatus of claim 15, wherein the processing module is configured to:

17. The apparatus of claim 15 or 16, wherein the processing module is configured to:

inputting the first training data into the neural network model;

18. The apparatus of any one of claims 15-17, wherein the super-parameters comprise one or more of a learning rate, a learning rate decay period, a number of iteration periods, a batch size, a network structure parameter of a neural network model.

19. A computer readable storage medium storing program code for execution by a device, the program code comprising instructions for performing the method of any one of claims 1-9.

20. A chip comprising a processor and a data interface, the processor reading instructions stored on a memory via the data interface to perform the method of any of claims 1-9.