CN111401546B

CN111401546B - Training method of neural network model, medium and electronic equipment thereof

Info

Publication number: CN111401546B
Application number: CN202010086380.0A
Authority: CN
Inventors: 刘默翰; 周力; 白立勋; 石文元; 俞清华; 隋志成
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-02-11
Filing date: 2020-02-11
Publication date: 2023-12-08
Anticipated expiration: 2040-02-11
Also published as: CN111401546A

Abstract

The application relates to the technical field of neural networks, and discloses a training method of a neural network model, a medium and electronic equipment thereof. The training method of the neural network model comprises the following steps: a first network layer in the n network layers acquires sample data and inputs the sample data to a second network layer; for an ith network layer of the n network layers, performing the following operations: when i=2, output data of the ith network layer is obtained based on the initial input data and a plurality of initial weights of the ith network layer, and when 2<i is less than or equal to n, output data of the ith network layer is obtained based on the output data of the ith-1 th network layer and a plurality of initial weights of the ith network layer, wherein the plurality of initial weights of the ith network layer are obtained based on m discrete values. The application sets the initial weights of the neural network model as the low-bit discrete values, can effectively avoid the gradient disappearance problem of the neural network model in the low-bit weight training process, and accelerates the convergence of the neural network model.

Description

Training method of neural network model, medium and electronic equipment thereof

Technical Field

The application relates to the technical field of neural networks, in particular to a training method of a neural network model, a medium and electronic equipment thereof.

Background

A neural network model is an operational model consisting of a large number of nodes (or neurons) interconnected. A common neural network model includes an input layer, an output layer, and a plurality of hidden layers (also referred to as hidden layers). The inputs to each node of each layer are typically weighted, thus generating a weighted sum (or other weighted operation result) at each node. The weight of each layer may be adjusted during training.

When the traditional neural network model is trained, each training process adopts a random initialization mode to initialize the weight of the neural network model. The weight of the traditional neural network model is generally a floating point number within a certain value range, and the random initialization mode starts training from any floating point number within the value range. In this training process, the large number of floating point numbers and the multiple training processes make training of the neural network model a long time.

Disclosure of Invention

The embodiment of the application provides a training method of a neural network model, a medium and electronic equipment thereof.

In a first aspect, an embodiment of the present application provides a training method for a neural network model, where the neural network model includes n network layers, where n is a positive integer greater than 1; and the method comprises:

A first network layer of the n network layers acquires sample data and inputs the sample data to a second network layer, wherein the sample data comprises initial input data and expected result data;

for an ith network layer of the n network layers, performing the following operations:

when i=2, based on the initial input data and a plurality of initial weights of the ith network layerThe output data of the ith network layer is obtained,

when 2<When i is less than or equal to n, the output data of the ith-1 network layer and a plurality of initial weights of the ith network layer are basedObtaining output data of the ith network layer, wherein,

the plurality of initial weights of the ith network layerIs based on m discrete values, wherein the plurality of initial weights +.>The numerical range of (2) is +.>And m= {2,3}, i.e. the discrete value may be two or three;

based on an error between the output data of the n network layers and expected result data in the sample data, the plurality of initial weights for the i-th network layerAnd adjusting.

For example, the plurality of initial weights of the ith network layerThe value range of (1) may be set to { -1,1} or { -1,0,1}. That is, in the present embodiment, in order to limit the final weights to 1 and-1 and convert the multiplication operation into the exclusive nor operation between bits to reduce the memory access rate and the occupancy rate, a plurality of initial weights of the neural network model are- >The discrete values of { -1,1} or { -1,0,1} are set so as to accelerate the convergence of the model while avoiding the disappearance of the gradient of the model.

In a possible implementation of the first aspect, the method further includes: the plurality of initial weights of the ith network layerEach of which is one of m discrete values.

In a possible implementation of the first aspect, the method further includes: the m discrete values are-1 and 1, and the plurality of initial weights of the ith network layerThe mean value of (1) is 0 and the variance is 1.

In a possible implementation of the first aspect, the method further includes: the m discrete values are-1, 0, and 1, and the plurality of initial weights of the ith network layerThe mean value of (2) is 0 and the variance is 2/3.

In a possible implementation of the first aspect, the method further includes: the ith network layer has p initial weightsAnd said p initial weights of said i-th network layer +.>Calculated by the following formula:

wherein W is ^b For one of the m discrete values, the W ^b The numerical value range of (2) is-1 to W ^b Less than or equal to 1 and corresponds to the p initial weightsP W of (2) ^b The mean value of (2) is 0, and the variance is 1 or 2/3; alpha is a scaling factor and is a positive number less than 1 for adjusting the distribution of the output data of the i-th network layer. If values are simply selected from discrete values For initial weights, there may be a case where the distribution of the input data and the output data is not uniform, so in order to keep the distribution of the input data and the output data of the network layer substantially uniform, a scaling factor is set here, wherein the scaling factor is obtained by a normalization method to scale the variance of the initial weights of the neural network model so that the neural network model can be propagated to a deeper layer.

In a possible implementation of the first aspect, the method further includes: corresponding to the p initial weightsP W of (2) ^b Is 1 and the m discrete values are-1 and 1.

In a possible implementation of the first aspect, the method further includes: corresponding to the p initial weightsP W of (2) ^b The variance of (2/3), the m discrete values being-1, 0 and 1.

In a possible implementation of the first aspect, the method further includes: the scaling factor is obtained by the following formula:

wherein,for the p W ^b Corresponding to said p initial weights +.>Discrete values of the jth initial weight in (c),p W's for the ith network layer ^b Average value of l _i Input channel for the ith network layerA number.

In a possible implementation of the first aspect, the method further includes: the plurality of initial weights of the ith network layer Calculated by the following formula:

wherein W is ^t Any one of a plurality of weights determined for the previous training of the ith network layer, the W ^t The numerical value range of (2) is-1 to W ^t And less than or equal to 1, alpha being a scaling factor and being a positive number less than 1, for adjusting the distribution of output data of the ith network layer. If a numerical value is simply selected from the discrete values as the initial weight, there may be a case where the distribution of the input data and the output data is not uniform, so in order to keep the distribution of the input data and the output data of the network layer substantially uniform, a scaling factor is set here, wherein the scaling factor is obtained by a normalization method to scale the variance of the initial weight of the neural network model so that the neural network model may propagate to a deeper layer.

wherein p is the number of weights determined by the previous training of the ith network layer, W _j ^t Represents the j-th weight of the weights determined by the p previous training,the average of the weights determined for the p previous training.

Wherein l _i For the number of input channels of the ith network layer, l _i+1 Is the number of input channels for the i+1th network layer.

In a second aspect, an embodiment of the present application provides a training method of a neural network model, where the neural network model includes n network layers and the neural network model has converged, and n is a positive integer greater than 1; and is also provided with

The method is used for the converged neural network model, and is used for carrying out low-bit quantization on the full-precision weight of the trained number so as to convert multiplication operation into exclusive nor operation between bits, so that the memory access rate and the occupancy rate are reduced. Specifically, the method comprises the following steps:

when i=2, performing symbol value on the full-precision weights of the ith network layer to obtain initial weights of the ith network layerAnd based on the initial input data and the plurality of initial weights +.>The output data of the ith network layer is obtained,

When 2<When i is less than or equal to n, performing symbol value taking on the full-precision weights of the ith network layer to obtain initial weights of the ith network layerAnd based on the output data of the i-1 th network layer and the plurality of initial weights +.>Obtaining output data of the ith network layer, wherein,

the plurality of initial weights of the ith network layerIs derived based on m discrete values and the plurality of initial weights +.>The numerical range of (2) is +.>And m= {2,3}, i.e. the discrete value may be two or three;

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1 and 1, the plurality of initial weights of the ith network layerThe mean value of (1) is 0 and the variance is 1; and is also provided with

The step of performing symbol value of the full-precision weights of the ith network layer to obtain initial weights of the ith network layer includes:

if the full-precision weight is smaller than or equal to 0, taking-1 as an initial weight corresponding to the full-precision weight;

If the full precision weight is greater than 0, 1 is taken as the initial weight corresponding to the full precision weight.

The trained full-precision weight in the converged neural network model is subjected to symbol value, and is converted into one of 1 and-1.

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1, 0 and 1, the plurality of initial weights of the ith network layerThe mean value of (2) is 0 and the variance is 2/3; and is also provided with

The symbol value of the full-precision weights of the ith network layer is carried out to obtain initial weights of the ith network layerComprising the following steps:

if the full-precision weight is less than 0, taking-1 as an initial weight corresponding to the full-precision weight

If the full-precision weight is equal to 0, 0 is taken as an initial weight corresponding to the full-precision weight

If the full-precision weight is greater than 0, 1 is taken as the initial weight corresponding to the full-precision weight

The trained full-precision weight in the converged neural network model is subjected to symbol value, and is converted into one of 1,0 and-1.

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1 and 1, and,

The plurality of full precision weights to the ith network layerRe-performing symbol valuation to obtain multiple initial weights of the ith network layerComprising the following steps:

if the full-precision weight is less than or equal to 0, taking the product of-1 and the scaling factor as an initial weight corresponding to the full-precision weight

If the full-precision weight is greater than 0, taking the product of 1 and the scaling factor as an initial weight corresponding to the full-precision weight

And the scaling factor is a positive number smaller than 1 and is used for adjusting the distribution of the output data of the ith network layer.

The trained full-precision weight in the converged neural network model is subjected to symbol value, and is converted into the product of one of 1 and-1 and the scaling factor. If the numerical value is simply selected from 1 and-1 as the initial weight after the sign value, there may be a case that the distribution of the input data and the output data of the network layer is inconsistent, so in order to keep the distribution of the input data and the output data of the network layer substantially consistent, a scaling factor is set here, where the scaling factor is obtained by a normalization method to scale the variance of the initial weight of the neural network model, so that the neural network model may propagate to a deeper layer.

In a possible implementation of the second aspect, the method further includes: the m discrete values are-1, 0 and 1, and the symbol value is carried out on the full-precision weights of the ith network layer to obtain initial weights of the ith network layerComprising the following steps:

if the full precision weight is less than 0, then combining-1 withThe product of the scaling factors is taken as an initial weight corresponding to the full-precision weight

In a possible implementation of the second aspect, the method further includes: the scaling factor is obtained by the following formula:

wherein α is a scaling factor, l _i For the number of input channels of the ith network layer, l _i+1 The number of input channels for the i+1th network layer.

wherein α is a scaling factor; p represents the number of the plurality of initial weights of the i-th network layer; w (W) _j ^z Is one of-1 and 1, corresponds to the j-th initial weight of the p initial weights, and corresponds to p W of the p initial weights _j ^z The mean value of (1) is 0 and the variance is 1;for the p W _j ^z Average value of (2); l (L) _i And the number of the input channels is the i-th network layer.

wherein α is a scaling factor; p represents the number of the plurality of initial weights of the i-th network layer; w (W) _j ^q Is one of-1, 0, 1, and corresponds to the j-th initial weight of the p initial weights, and corresponds to p W of the p initial weights _j ^q The mean value of (2) is 0 and the variance is 2/3;for the p W _j ^q Average value of (2); l (L) _i And the number of the input channels is the i-th network layer.

In a possible implementation of the second aspect, the method further includes: the sample data includes image data and the neural network model is used for image recognition.

In a third aspect, an embodiment of the present application provides an electronic device for training a neural network model, including:

a first data acquisition module for acquiring sample data and inputting the sample data to a second network layer, wherein the sample data comprises initial input data and expected result data;

the first data processing module is used for executing the following operations:

the plurality of initial weights of the ith network layerIs based on m discrete values, wherein the plurality of initial weights +.>The numerical range of (2) is +. >And m= {2,3};

and a first weight adjustment module for adjusting the plurality of initial weights of the ith network layer based on an error between the output data of the n network layers and expected result data in the sample data.

In a fourth aspect, an embodiment of the present application provides an electronic device for training a neural network model, including:

a second data acquisition module for acquiring sample data and inputting the sample data to a second network layer, wherein the sample data comprises initial input data and expected result data;

a second data processing module for performing the following operations

The plurality of initial weights of the ith network layerIs derived based on m discrete values and the plurality of initial weights +.>The numerical range of (2) is +.>And m= {2,3};

a second weight adjustment module for adjusting the output data of the n network layers based on the output data of the n network layersError between expected result data in sample data, the plurality of initial weights for the ith network layerAnd adjusting.

In a fifth aspect, an embodiment of the present application provides a computer readable medium, where instructions are stored on the computer readable medium, where the instructions when executed on a computer cause the computer to perform the training method of any one of the neural network models in the first aspect and the second aspect.

In a sixth aspect, an embodiment of the present application provides an electronic device, including:

a memory for storing instructions for execution by one or more processors of the system, and

a processor, which is one of the processors of the system, for executing the training method of the neural network model according to any one of the first aspect and the second aspect.

Drawings

FIG. 1 illustrates a block diagram of an electronic device, according to some embodiments of the application;

FIG. 2 is a schematic diagram of a neural network model;

FIG. 3 illustrates a schematic diagram of a computational process for a node of a neural network model, according to some embodiments of the application;

FIG. 4 is a graph showing the output distribution of several layers of activation functions near the output layer in a randomly initialized neural network model using a Gaussian distribution with a mean of 0 and a variance of 1;

FIG. 5 (a) shows a weight distribution diagram of a convolutional neural network model for 1-bit weight initialization, according to some embodiments of the present application;

FIG. 5 (b) illustrates a weight distribution diagram of a convolutional neural network model after initialization with 1-bit weights, at some time during model convergence, in accordance with some embodiments of the present application;

FIG. 5 (c) illustrates a weight distribution diagram of a convolutional neural network model after training the model after initializing with a 1-bit weight, and after model convergence, in accordance with some embodiments of the present application;

FIG. 6 (a) shows a graph of training a 1-bit fixed-point quantization model using an Xavier initialization function;

FIG. 6 (b) is a graph illustrating training of a 1-bit fixed-point quantization model using the weight initialization method illustrated in FIG. 2, according to some embodiments of the present application;

FIG. 7 illustrates a schematic diagram of an electronic device for training a neural network model, according to some embodiments of the application;

FIG. 8 illustrates a schematic diagram of another electronic device for training a neural network model, in accordance with some embodiments of the application;

fig. 9 illustrates a schematic diagram of an electronic device, according to some embodiments of the application.

Detailed Description

Illustrative embodiments of the application include, but are not limited to, a method, apparatus, medium, and electronic device for initializing weights for a neural network.

It is to be appreciated that as used herein, the term module may refer to or include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (shared, dedicated, or group) and/or memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable hardware components that provide the described functionality.

It is to be appreciated that in various embodiments of the application, the processor may be a microprocessor, a digital signal processor, a microcontroller, or the like, and/or any combination thereof. According to another aspect, the processor may be a single core processor, a multi-core processor, or the like, and/or any combination thereof.

Embodiments of the present application will be described in further detail below with reference to the accompanying drawings.

It is understood that the neural network model provided by the present application may be any artificial neural network model that uses multiply-add operation, such as convolutional neural network (Convolutional Neural Network, CNN), deep neural network (Deep Neural Networks, DNN), cyclic neural network (Recurrent Neural Networks, RNN), binary neural network (Binary Neural Network, BNN), and the like.

It is to be appreciated that the methods of weight initialization of neural networks provided by the present application can be implemented on a variety of electronic devices, including, but not limited to, servers, distributed server clusters of multiple servers, cell phones, tablet computers, laptop computers, desktop computers, wearable devices, head mounted displays, mobile email devices, portable gaming devices, portable music players, reader devices, personal digital assistants, virtual or augmented reality devices, televisions with one or more processors embedded or coupled therein, and the like.

Particularly, the weight initialization of the neural network provided by the application is suitable for edge equipment, the edge calculation is a distributed open platform (architecture) which fuses network, calculation, storage and application core capabilities at the network edge side close to object or data sources, and the application provides edge intelligent service nearby, thereby meeting the key requirements in the aspects of real-time business, data optimization, application intelligence, security, privacy protection and the like. For example, the edge device may be a device capable of performing edge computation on video data near a video data source (network smart camera) end in a video surveillance system.

The following describes a weighting initialization scheme of the neural network disclosed in the present application, taking the electronic device 100 as an example.

Fig. 1 illustrates a block diagram of an electronic device 100, according to some embodiments of the application. Specifically, as shown in FIG. 1, the electronic device 100 includes one or more processors 104, system control logic 108 coupled to at least one of the processors 104, system memory 112 coupled to the system control logic 108, non-volatile memory (NVM) 116 coupled to the system control logic 108, and a network interface 120 coupled to the system control logic 108.

In some embodiments, the processor 104 may include one or more single-core or multi-core processors. In some embodiments, the processor 104 may include any combination of general-purpose and special-purpose processors (e.g., graphics processor, application processor, baseband processor, etc.). In embodiments where the electronic device 100 employs an enhanced Node B (eNB) or radio access network (Radio Access Network, RAN) controller, the processor 104 may be configured to perform various conforming embodiments.

In some embodiments, the processor 104 may be configured to invoke training information to train out a neural network model. Specifically, for example, the processor 104 may obtain initialization information and input data information (e.g., image information, voice information, etc.) for the neural network model weights, and train the neural network model. The neural network model may be quantized into a binary network or a ternary network, and the weight of the neural network model may be set to a preset discrete value. In each layer of training of the neural network model, the processor 104 continuously adjusts the weights according to the acquired training information until the model converges. The processor 104 may also periodically update the neural network model described above to facilitate better adaptation to changes in various actual demands of the neural network model.

In some embodiments, the system control logic 108 may include any suitable interface controller to provide any suitable interface to at least one of the processors 104 and/or any suitable device or component in communication with the system control logic 108.

In some embodiments, system control logic 108 may include one or more memory controllers to provide an interface to system memory 112. The system memory 112 may be used to load and store data and/or instructions. The memory 112 of the electronic device 100 may include any suitable volatile memory in some embodiments, such as a suitable Dynamic Random Access Memory (DRAM). In some embodiments, system memory 112 may be used to load or store instructions that implement the neural network model described above, or system memory 112 may be used to load or store instructions that implement an application that utilizes the neural network model described above.

NVM/memory 116 may include one or more tangible, non-transitory computer-readable media for storing data and/or instructions. In some embodiments, NVM/memory 116 may include any suitable nonvolatile memory, such as flash memory, and/or any suitable nonvolatile storage device, such as at least one of a Hard Disk Drive (HDD), compact Disc (CD) Drive, digital versatile Disc (Digital Versatile Disc, DVD) Drive. NVM/memory 116 may also be used to store trained weights for the neural network model described above.

NVM/memory 116 may include a portion of a memory resource on the device on which electronic apparatus 100 is installed, or it may be accessed by, but not necessarily a part of, the apparatus. NVM/storage 116 may be accessed over a network, for example, via network interface 120.

In particular, system memory 112 and NVM/storage 116 may each include: a temporary copy and a permanent copy of instructions 124. The instructions 124 may include: instructions that, when executed by at least one of the processors 104, cause the electronic device 100 to implement the method as shown in fig. 3. In some embodiments, instructions 124, hardware, firmware, and/or software components thereof may additionally/alternatively be disposed in system control logic 108, network interface 120, and/or processor 104.

The network interface 120 may include a transceiver to provide a radio interface for the electronic device 100 to communicate with any other suitable device (e.g., front end module, antenna, etc.) over one or more networks. In some embodiments, the network interface 120 may be integrated with other components of the electronic device 100. For example, the network interface 120 may be integrated with at least one of the processor 104, the system memory 112, the nvm/storage 116, and a firmware device (not shown) having instructions that, when executed by at least one of the processor 104, the electronic device 100 implements the method as shown in fig. 3.

The network interface 120 may further include any suitable hardware and/or firmware to provide a multiple-input multiple-output radio interface. For example, network interface 120 may be a network adapter, a wireless network adapter, a telephone modem, and/or a wireless modem.

In some embodiments, at least one of the processors 104 may be packaged together with logic for one or more controllers of the system control logic 108 to form a System In Package (SiP). In some embodiments, at least one of the processors 104 may be integrated on the same die with logic for one or more controllers of the system control logic 108 to form a system on a chip (SoC).

The electronic device 100 may further include: input/output (I/O) devices 132. The I/O device 132 may include a user interface to enable a user to interact with the electronic device 100; the design of the peripheral component interface enables the peripheral component to also interact with the electronic device 100. In some embodiments, the electronic device 100 further comprises a sensor for determining at least one of environmental conditions and location information related to the electronic device 100.

It will be appreciated that exemplary applications of the neural network model to which the method of weight initialization provided by the embodiments of the present application is applicable include, but are not limited to, image recognition in the machine vision field, speech recognition, and the like.

In the following, a technical solution for training the neural network model 200 shown in fig. 2 by using the electronic device 100 shown in fig. 1 will be described in detail, taking image recognition as an example (for example, for performing face recognition, and recognizing facial features such as mouth shapes, eyebrow features, eye features, etc. in a face image).

Specifically, as shown in fig. 2, the neural network model 200 includes n network layers, i.e., an input layer, a plurality of hidden layers, and an output layer. The first layer is called an input layer, the last layer is called an output layer, and the other layers are called hidden layers. Each layer has several nodes (e.g., the input layer in fig. 2 has s nodes), each node having a corresponding weight. The layers of the n network layers are in full cross connection, and the output of the upper layer is the input of the next adjacent layer. The calculation formula of each node in the network layer is as follows:

y＝f(Wx+b)

wherein W is the weight, b is the deviation, x is the input, y is the output, and f is the activation function.

The specific process of training the neural network model 200 shown in fig. 2 using sample images in performing image recognition, e.g., face recognition, is described in detail below.

In training the neural network model 200, a large amount of sample image data and expected result data may be input into the model 200, wherein the image data of each sample image is input into s nodes of an input layer of the neural network model 200 shown in fig. 2, and is calculated by a hidden layer, and finally facial recognition result data is generated after calculation by an output layer. It should be noted that, when a model is trained using a large number of sample images, each complete training process corresponds to only one image, for example, there are 1000 images in total, the first image may be trained first, after the training of the first image is completed, the second image is trained, and so on until the neural network model 200 converges. After each image training is completed, the face recognition result data finally output by the neural network model 200 is compared with the expected result data, an error is calculated, a partial derivative is calculated according to the error, and the weights of the nodes in the other network layers except the input layer are adjusted based on the calculated partial derivative. In this way, the neural network model 200 is trained by the image data inputted with the image, the weights are continuously adjusted, and when the error between the face recognition result data finally outputted by the neural network model 200 and the expected result data is smaller than the error threshold, it is determined that the neural network model 200 converges.

Specifically, FIG. 3 illustrates the calculation of various network layers in the neural network model 200, in some embodiments. As shown in fig. 3, the calculation process of each network layer in the neural network model 200 includes:

1. calculation of input layer

The image data of the sample image a is input as input data to the input layer.

2. Calculation of hidden layer

a) The input layer outputs image data of the sample image a to the first hidden layer. For example, the input image data may be color information (e.g., numbers between 0 and 255 in RGB color space) of each pixel point in the image, and the image data is input into s nodes (i.e., inputs x1, x2 to xs) of an input layer (first layer) of the model shown in fig. 2.

b) Initializing the weight of each hidden layer to obtain an initial weight.

In the embodiments of the present application, an initialization model with low bit weight is used to obtain the initial weight of each hidden layer, for example, a 1-bit weight initialization or a 2-bit weight initialization model is used, the weight value range in the 1-bit weight initialization is (1, -1), the weight value range in the 2-bit weight initialization is (1, 0, -1), that is, when the weight is initialized by using the 1-bit weight initialization model, the initial weight value of a certain node in the hidden layer is set to one of two values of 1 and-1, and all weights of the hidden layer need to satisfy the distribution that the mean value is 0 and the variance is 1. When the weight is initialized by adopting a 2-bit weight initialization model, the initial weight value of a certain node in the hidden layer is set to one of three values of-1, 0 and 1, and all weights of the hidden layer need to meet the distribution with the mean value of 0 and the variance of 2/3.

Specifically, for example, in some embodiments, the following 1-bit weight initialization method may be employed to initialize the floating point weights of the untrained neural network model:

the initial weights of the neural network model 200 are quantized by 1 bit, and one of quantized initial weights takes a discrete value of 1 or-1, and since the distribution of 0 and 1 is to be satisfied, the number of weights taking 1 and-1 in all weights in the same network layer is substantially the same (for example, uniform probability is generated by uniformly distributing the function unimorph, and 1 and-1 are uniformly sampled).

Furthermore, in other embodiments, to ensure that the distribution of inputs and outputs for each layer of the neural network model remains substantially uniform, the problem of gradient extinction during delivery to deeper layers is alleviated. The discrete value W selected from 1 and-1 for a node in the quantization process can be based on a preset scaling factor ^b Compressing to obtain initial weight of neural network modelThe preset scaling factors are obtained through a normalization method to scale the variance of the weights of the neural network model, so that the network can propagate to deeper layers.

For example, in some embodiments, W, which has been a selected value of the dispersion, may be selected ^b The compression is as follows:

where α is a scaling factor, typically α is a positive fraction less than 1, for adjusting the distribution of output data of the ith network layer.

In some embodiments, the scaling factor α may be calculated as follows:

wherein l _i Representing the number of input channels, l, of the ith network layer of the neural network model _i+1 Representing the number of input channels of the i+1th network layer of the neural network model. It will be appreciated that for the input layer, i herein is 1; i+1 represents the network layer next to the ith network layer in the neural network model.

In some embodiments, assume that the ith network layer has p initial weightsThere are p W of the chosen discrete values ^b Corresponding to p initial weights->The scaling factor α may also be calculated according to the following formula:

wherein W is _j ^b Representing p W ^b Corresponding to P initial weightsDiscrete value of the j-th initial weight,/-in (a)>P W for the ith network layer ^b Average value of l _i Representing the number of input channels for the i-th network layer. />

In this way, the scaling factor α of the weight of each network layer in the neural network model 200 is calculated, the scaling factor α is multiplied by the weight of the corresponding network layer, and in the training process of the neural network model, the input data of each layer and the compressed weight are weighted, so as to forward transmit, so that the distribution of the input and the output of each layer of the neural network model can be ensured to keep basically consistent, and the problem of gradient disappearance in the process of transmitting to deeper layers can be relieved.

It will be appreciated that the above method of calculating the scaling factor α is merely exemplary and not limiting, and that in other embodiments, other normalization methods may be employed to calculate the scaling factor α.

In some embodiments, the weights of the neural network model may be initialized using a 2-bit weight initialization method as follows:

first, in some embodiments, the weights of the floating point type of the untrained neural network model may be quantized by 2 bits, and the quantized initial weights take one of discrete-1, 0 or 1, and since the average value is 0 and the variance is 2/3, the number of weights taking 1 and-1 in all weights in the same network layer is substantially the same (e.g., uniform probability is generated by uniform distribution function unimorph, and 1 and-1 are uniformly sampled).

In particular, in other embodiments, to ensure that the distribution of inputs and outputs for each layer of the neural network model remains substantially uniform, the problem of gradient extinction during delivery to deeper layers is alleviated. The quantization process can be a node slave based on a preset scaling factor1 and-1, and a discrete value of W ^b Compressing to obtain initial weight of neural network model The preset scaling factors are obtained through a normalization method to scale the variance of the weights of the neural network model, so that the network can propagate to deeper layers. W can be obtained by using a method similar to the 1-bit weight initialization method ^b Compressing to obtain initial weight of neural network model>The specific compression method is referred to above and will not be described herein.

Because the full-precision neural network model occupies a large amount of storage space, and floating point multiply-add operation consumes a large amount of computation resources, especially for edge devices, the operation and storage resources of the full-precision neural network model are limited, and a large amount of floating point multiply-add operation and storage of a large amount of floating point numbers cannot be generally borne. 8-bit quantization is a currently more common solution, but can only support up to 4 times compression, and although integer operations are used instead of floating point operations, the operation resource consumption is still large. Therefore, the 1-bit weight initialization method or the 2-bit weight initialization method is adopted to carry out low-bit quantization (1 &2 bits) on the weight of the full-precision neural network model, and the full-precision neural network model is far more than an 8-bit quantization model in terms of storage space and operation efficiency, is very suitable for operation on edge equipment, and can reduce equipment power consumption. Compared with the existing initialization method, the method adopts the 1-bit weight initialization or 2-bit weight initialization model, so that the neural network model can be more easily converged. As described above, when training the BNN network, the final weights are expected to be defined as 1 and-1, and the multiplication operation is converted into the exclusive nor operation between bits to reduce the memory access rate and the occupancy, and the scheme of the present application can avoid model gradient disappearance and accelerate convergence by directly setting the initial weights to { -1,1} or { -1,0,1 }.

It will be appreciated that in some embodiments, in addition to 1-bit or 2-bit quantization of the initial weights of the neural network model, the inputs to the neural network model (e.g., the image data of sample image a) may be quantized, so that the matrix multiplication between the initial weights and the inputs can be equivalently replaced by XNOR operations, which can better accelerate convergence.

It will be appreciated that the 1-bit quantized discrete value of W ^b The value range { -1,1} is merely exemplary and not limiting. In some embodiments, 1-bit quantized, discrete valued W ^b Other integer discrete values may be used, for example, take { -2,2}, or {1, 100}, and in order to satisfy the condition that the mean value is 0 and the variance is 1, both { -2,2} and {1, 100} are converted into { -1,1} when calculating each network layer in the neural network model. After the calculation of each network layer in the neural network model is completed, the weight of each network layer in the model is reduced according to a preset proportion, for example, 0.8 is reduced to 90, and 0.5 is reduced to 2.

It will be appreciated that the 2-bit quantized discrete value of W ^b The value range { -1,0,1} is also merely exemplary and not limiting, and in some embodiments, 2-bit quantized, discrete valued W ^b May be discrete values of other integers, such as { -2,0,2} or {0, 50, 100}, etc. In order to meet the condition that the mean is 0 and the variance is 2/3, both { -2,0,2} and {0, 50, 100} are converted to { -1,0,1} when computing each network layer in the neural network model. After the calculation of each network layer in the neural network model is completed, the weight of each network layer in the model is reduced according to a preset proportion, for example, 1 is reduced to 100.

c) Input data of each hidden layer and corresponding initial weightAnd performing weighting operation. For example, the image data of the sample image a is segmented into s data blocks, the s data blocks are input as s input data to s nodes of the input layer in fig. 2, and a weighting operation is performed with the weights of the first hidden layer (for example, the weights of the first nodes of the first hidden layerThe operation is as follows: x1×w11+x2×w12+x3×w13+ … +xsxw1s+b). d) And performing an activation operation on the weighted operation result by using an activation function.

For example, in some embodiments, the activation function may be a Sigmoid function, a Tanh function, or the like. Specifically, the feature data of the sample image a output by each node of the input layer and the initial weights of each node (h nodes shown in fig. 2) of the first hidden layer A weighting operation is performed to generate the output of each node of the first hidden layer (i.e. the input of the second hidden layer) by means of an activation function (e.g. Sigmoid function, tanh function) of the first hidden layer. Input of the second hidden layer and initial weights of the respective nodes (u nodes shown in fig. 2) of the second hidden layer +.>After weighting operation, the output of each node of the second hidden layer (i.e. the input of the third hidden layer) is generated by the activation function of the second hidden layer, and the n-2 hidden layer (n-2 hidden layers in fig. 2) is sequentially calculated in the same manner, the input of v nodes of the n-2 hidden layer (the output of the n-3 hidden layer) and the corresponding initial weight>After the weighting operation is performed, the output of each node of the n-2 hidden layer (i.e., the input of the output layer) is generated by the activation function of each node of the n-2 hidden layer.

It will be appreciated that in some embodiments, when the initial weights of the neural network model areAnd after the input is quantized according to the 1-bit or 2-bit quantization method, only a corresponding sign function or an equivalent function can be used as an activation function, namely, the input is converted into { -1,1} or { -1,0,1}, so that XNOR (same or) and popcount bit calculation are guaranteed to save calculation resources and storage resources.

3. Calculation of output layer

The calculation of the output layer is also similar to the calculation of the hidden layer described above. Specifically, the input of each node of the output layer (i.e., the output of each node of the n-2 th hidden layer) and the initial weight of each node of the output layerAfter weighting, the final output of a training process of the neural network model using the sample image a is generated by the activation functions (e.g., functions of Relu, tanh, sigmoid, etc.) of the respective nodes of the output layer. Wherein the initial weight of the output layer +.>Or may be obtained by using the low bit weight initialization scheme (1 bit weight initialization or 2 bit weight initialization model). Please refer to the above for detailed description, and the detailed description is omitted herein.

4. Adjusting weights

After each time of outputting the face recognition result value output by the layer, comparing the face recognition result data with expected result data corresponding to the image data of the sample image input at the time to obtain an error. From this error, a partial derivative is obtained, and the weights of the nodes in the other network layers than the input layer are adjusted based on the obtained partial derivative. The weights are continually adjusted until the final error reaches the error threshold.

Then comparing the output of the model (i.e. the training result of training the model by using the sample image A) with the actual image characteristics of the sample image A, obtaining an error (i.e. the difference between the two), obtaining a partial derivative of the error, and updating the weight according to the partial derivative. Other sample image data may be subsequently input to model training so that in the training of a large number of sample image data, by continually adjusting the weights, the neural network model 200 is considered to converge when the output error reaches a small value (e.g., meets a predetermined error threshold), and model training is completed.

After each time of training the input sample image data, the weights of the network layers of the neural network model 200 are adjusted, and the adjusted weights can be directly used as initial weights for training the next time of input sample image data, or can be scaled by a scaling factor and then used as initial weights for training the next time of input sample image data.

For example, in some embodiments, after training the neural network model via sample image a, the weights of the various network layers determined after training the neural network model via sample image a may be used as initial weights for the neural network model at the next training (e.g., training the neural network model trained via sample image a with sample image B).

In yet other embodiments, to ensure that the distribution of inputs and outputs to each layer of the neural network model remains substantially uniform to mitigate the problem of gradient extinction during transmission to deeper layers, the product of the weights and scaling factors for each network layer determined after training the neural network model via sample image a may be used as the initial weight for the neural network model at the next training (e.g., training the neural network model trained via sample image a using sample image B) after training the neural network model via sample image a.

For example, in some embodiments, a plurality of initial weights for an ith network layer of n network layers of the neural network modelThe calculation can be made by the following formula:

wherein W is ^t For any one of a plurality of weights of an ith network layer determined after training for sample image A, W ^t The numerical value range of (2) is-1 to W ^t Less than or equal to 1, alpha is a scaling factor and is a positive number less than 1 for adjusting the fraction of output data of the ith network layerAnd (3) cloth. Wherein, in some embodiments, the scaling factor α may be calculated as follows:

wherein l _i Representing the number of input channels of the ith network layer of the n network layers of the neural network model, l _i+1 Representing the number of input channels of the i+1th network layer of the neural network model. It will be appreciated that for the input layer, i herein is 1; i+1 represents the network layer next to the ith network layer in the neural network model.

In other embodiments, the scaling factor α may also be calculated according to the following formula:

where p is the number of weights of the ith network layer, W, determined after training on sample image A _j ^t Represents the j-th weight of the p weights determined after training for sample image a,is the average of the p weights determined after training for sample image a.

In addition, the detailed process of training the neural network model trained via the sample image a using the sample image B is referred to the above description of the process of training the neural network model using the sample image a, and will not be repeated here. In addition, it can be understood that, in the present application, in order to ensure that the distribution of the input and the output of each layer of the neural network model is kept substantially consistent, so as to alleviate the problem of gradient vanishing in the process of transferring to the deeper layer, a scaling factor is set, where the calculation method of the scaling factor is not limited to the formula shown above, but other calculation methods may also be adopted, and no limitation is made herein.

As described above, the weight generated by the weight initialization method can avoid the problem that the gradient of the neural network model is easy to disappear in the convergence process, and can enable the neural network model to be converged rapidly. Fig. 4 and 5 show convergence of the model in the case of obtaining an initial weight by a gaussian distribution random initialization method of the related art and obtaining an initial weight of a discrete value by an initialization method of the present application, respectively. The initial weight distribution of the model for obtaining the initial weight of the discrete value by adopting the initialization method is almost the same as the weight distribution after convergence, and the convergence speed and stability of the model are better.

It will be appreciated that the above description of the training of the neural network model 200 shown in fig. 2 is merely exemplary and not limiting, and that in other embodiments, the weight initialization method of the present application may be used for speech recognition, etc.

Specifically, as shown in fig. 4, when the neural network model uses a Sign function as an activation function, as the number of layers of the neural network model increases, since the neural network model generates a gradient only when its weight is between-k and k (for example, between-1 and 1), it is found that the output value of the activation function toward the following layer is almost close to 0, which tends to cause the model gradient to disappear.

As shown in fig. 5 (a) and 5 (b), a weight distribution diagram of a convolutional neural network model of the present application for 1-bit weight initialization is shown, according to some embodiments of the present application. FIG. 5 (b) illustrates a weight distribution diagram of a convolutional neural network model after initialization with 1-bit weights, at some time during model convergence, in accordance with some embodiments of the present application. FIG. 5 (c) shows a weight distribution diagram of a convolutional neural network model after training the model after initializing with 1-bit weights, and after model convergence, according to some embodiments of the present application. Wherein the horizontal axis is the value of the weight, and the vertical axis is the sampling number of the weight.

In the illustrated embodiment, the convolution kernel of the trained convolutional neural network model is 3x3x64x128. Referring to fig. 5 (a), the number of weights taken to-1 and the number of weights taken to 1 are equal in the initialization (the number of weights taken to-1 in the weights is 35000, and the number of weights taken to 1 is 35000). Referring to fig. 5 (b), it can be seen that, during a certain period of model convergence, only a very small amount of weights take values between-0.004 and 0.004, and the weights basically tend to take discrete values of-0.004 and 0.004, and it is noted that the weights participating in training are the weights after compression. The specific compression method is referred to above and will not be described herein. Referring to fig. 5 (c), it can be seen that the weight distribution of the model after convergence is substantially the same as the initial weight distribution, but the number of-1 is greater than the number of 1 (the number difference is small). In other embodiments, the number of-1's may also be less than 1's. Therefore, after the 1-bit weight initialization method is adopted to initialize the weight of the target neural network model, the model is trained, the initial weight distribution of the model is almost the same as the weight distribution after convergence, and the convergence speed and the stability of the model are good.

It will be appreciated that in some embodiments, the convolutional neural network model is trained after initializing with 2-bit weights (initial weights take-1, 0 and 1), and the weights of the network layers take-1, 0 and 1 after model convergence.

Fig. 6 (a) and fig. 6 (b) respectively show the model convergence situation of training the existing model by the Xavier initialization function and the weight initialization method provided by the application, so that it can be seen that the training result of training the 1-bit fixed-point quantization model by the weight initialization method provided by the application has high precision, and the model is more stable and easier to converge.

Specifically, as shown in fig. 6 (a), the weight of the res net-32 (depth residual network (Deep residual network, res net) model is initialized by an Xavier initialization function, and the res net-32 model with 1 bit fixed-point quantization is trained on a Cifar10 data set, wherein Cifar-10 is composed of 60000 RGB color images of 32 x 32 for 10 classifications (aircraft, car, bird, cat, deer, dog, frog, horse, ship, truck), referring to fig. 6 (a), wherein the abscissa is the number of training steps, the ordinate is the accuracy, it can be seen that the res net-32 model accuracy is about 82% at maximum, and the model accuracy swing amplitude is large, i.e., noise is large, and model convergence is unstable (stopping after 150 Epoch), and accuracy cannot be improved).

As shown in fig. 6 (b), the weight initialization method provided by the embodiment of the present application is used to initialize the weight of the res net-32 (Deep residual network, depth residual network) model, and train the res net-32 model with 1-bit fixed-point quantization on the Cifar10 dataset. Referring to fig. 6 (b), where the abscissa is the number of steps of training and the ordinate is the accuracy, it can be seen that the weight initialization method shown in fig. 3 is used to initialize the res net-32 model, and as the number of steps of training increases, the accuracy of the res net-32 model is stabilized at about 98%, and the noise is small, and compared with the result of training the 1-bit fixed-point quantization model by using the Xavier initialization function in fig. 6 (a), the training result of training the 1-bit fixed-point quantization model by using the weight initialization method provided by the present application has high accuracy, the model is more stable, and the convergence (300 Epoch) is easier. It will be appreciated that the embodiments shown in fig. 6 (a) and 6 (b) respectively employ a conventional Xavier initialization function and the initialization method provided by the embodiments of the present application to initialize the res net-32 model, and employ the Cifar10 dataset to train the initialized res net-32 model, which is merely exemplary and not limiting. It should be noted that, the weight initialization method provided by the embodiment of the present application may be applicable to all neural network models that use multiply-add operation, for example, other types of data sets (such as ILSVRC2012, COCO, UCF11 and other data sets) may be used to train other models (such as res net model, mobileNet model, efficientNet model, VGG model and the like) in CNN.

For example, when the method for initializing low bit weight provided by the embodiment of the application is used for image recognition, the acquired image information to be learned is subjected to necessary pretreatment (such as sampling, analog-to-digital conversion, feature extraction and the like) to form data to be subjected to neural network model operation, the data to be trained is input into a neural network model for training, and the method for initializing low bit weight provided by the embodiment of the application is applied to the model during training, so that the model convergence efficiency is improved and the stability is high under the condition that the operation precision meets the requirement.

In the following, a technical solution for initializing weights of a trained neural network model by using the terminal device 100 shown in fig. 1 is described in detail according to some embodiments of the present application.

1. Calculation of input layer

The image data of the sample image C is input as input data to the input layer.

2. Calculation of hidden layer

a) The input layer outputs image data of the sample image C to the first hidden layer. For example, the input image data may be color information (e.g., numbers between 0 and 255 in RGB color space) of each pixel point in the image, and the image data is input into s nodes (i.e., inputs x1, x2 to xs) of an input layer (first layer) of the model shown in fig. 2.

b) Initializing the weight of each hidden layer to obtain an initial weight.

In embodiments of the present application, an initialization model with low bit weight is used to obtain the initial weight of each hidden layer, for example, the following 1-bit weight initialization or 2-bit weight initialization model is used to initialize the weight of a trained neural network model (for example, a converged 8-bit model). The weight value range in the 1-bit weight initialization is (1, -1), the weight value range in the 2-bit weight initialization is (1, 0, -1), namely when the weight is initialized by adopting a 1-bit weight initialization model, the initial weight of a certain node in the hidden layer is set to one of 1 and-1, and all the weights of the hidden layer need to meet the distribution of 0 as the mean value and 1 as the variance. When the weight is initialized by adopting a 2-bit weight initialization model, the initial weight value of a certain node in the hidden layer is set to one of three values of-1, 0 and 1, and all weights of the hidden layer need to meet the distribution with the mean value of 0 and the variance of 2/3.

Specifically, for example, in some embodiments, assuming that the trained model has converged, the following 1-bit weight initialization method may be used to initialize the full-precision weights of the neural network model:

I.e. the trained nerves can be aligned by a Sign function Sign (W)The full-precision weight of the network model is valued according to the sign of the full-precision numerical value, namely: the full-precision weight is converted to 1 when it is a value (e.g., 0.23) greater than 0, and converted to-1 when it is a value (e.g., -0.15) less than 0 or equal to 0, thus quantized initial weightThe value is one of discrete value 1 or-1, and the initial weight +.>The distribution of (1) still satisfies a mean of 0 and a variance of 1.

Furthermore, in some embodiments, to ensure that the distribution of inputs and outputs for each layer of the neural network model remains substantially uniform, the problem of gradient extinction during delivery to deeper layers is alleviated. The W of the discrete values selected from 1 and-1 for a node in the quantization process can be also based on a preset scaling factor alpha ^z Compressing to obtain initial weight of neural network modelWherein the preset scaling factor is obtained by a normalization method. For example, in some embodiments, the quantized, fetched discrete value of W may be used ^z The compression is as follows:

where α is a scaling factor, typically α takes a positive fraction less than 1, for adjusting the distribution of output data of the ith network layer.

In some embodiments, the scaling factor α may be calculated as follows:

wherein l _i The number of input channels for the ith network layer, l _i+1 Is the number of input channels for the i+1th network layer. It will be appreciated that for the input layer, i herein is 1; i+1 represents the network layer next to the ith network layer in the neural network model.

In addition, in other embodiments, the scaling factor α may also be calculated according to the following formula:

where α is a scaling factor, p is the number of initial weights for the ith network layer, W _j ^z Is one of-1 and 1, and corresponds to the j-th initial weight of the p initial weights, and corresponds to p W of the p initial weights _j ^z The mean value of (1) is 0 and the variance is 1;for p W _j ^z Average value of (2); l (L) _i Is the number of input channels for the ith network layer.

Thus, the scaling factor alpha of the weight of each layer in the neural network model 200 is calculated, and the W of the scaling factor alpha, which is a discrete value from the corresponding network layer, is calculated ^z In the process of training the neural network model, the input data of each layer and the compressed weight are subjected to weighting operation, so that the input data and the compressed weight are transmitted forwards, the input and output distribution of the neural network model can be ensured to be consistent, and the problem of gradient disappearance in the process of transmitting the data to a deeper layer can be relieved.

In some embodiments, the full-precision weights of the trained neural network model may be initialized using the following 2-bit weight initialization method:

that is, the full accuracy of the trained model can be weighted by the Sign function Sign (W)The symbols of the full-precision numerical values are taken, namely: converting the full-precision weight to 1 when it is a value (e.g., 0.31) greater than 0, converting the full-precision weight to-1 when it is a value (e.g., -0.17) less than 0, and taking the full-precision weight to 0 when it is 0, thus quantized initial weightTakes on one of the discrete values-1, 0 or 1 and the initial weight +.>The distribution of (2) still satisfies a mean of 0 and a variance of 2/3.

In order to ensure that the distribution of the input and output of each layer of the neural network model is kept basically consistent, the problem of gradient disappearance in the process of transferring to deeper layers is relieved. W, which is a discrete value selected from-1, 0 or 1 for a node in the quantization process, can be based on a preset scaling factor ^z Compressing to obtain initial weight of neural network modelThe preset scaling factors are obtained through a normalization method to scale the variance of the weights of the neural network model, so that the network can propagate to deeper layers. For example, in some embodiments, the scaling factor is calculated as follows:

wherein l _i The number of input channels for the ith network layer, l _i+1 Is the number of input channels for the i+1th network layer.

For another example, in other embodiments, the scaling factor α may also be calculated according to the following formula:

wherein α is a scaling factor; p represents the number of the plurality of initial weights of the i-th network layer; w (W) _j ^q Is one of-1, 0, 1, and corresponds to the j-th initial weight of the p initial weights, and corresponds to p W of the p initial weights _j ^q The mean value of (2) is 0 and the variance is 2/3;for p W _j ^q Average value of (2); l (L) _i Is the number of input channels for the ith network layer.

Therefore, compared with the traditional full-precision floating point operation and the 8-bit model commonly used in the prior art, the embodiment of the application carries out 1-bit or 2-bit quantization on the full-precision neural network model, thereby greatly reducing the model size, the operation resource and the power consumption.

c) Input data of each hidden layer and corresponding initial weightAnd performing weighting operation. For example, the image data of the sample image C is segmented into s data blocks, the s data blocks are input as s input data, and the s nodes of the input layer in fig. 2 are weighted with the weight of the first hidden layer (for example, the weighting operation of the first node of the first hidden layer is x1×w11+x2×w12+x3×w13+ … +xsxw1s+b).

d) And performing an activation operation on the weighted operation result by using an activation function.

For example, in some embodiments, the activation function may be a Sigmoid function, a Tanh function, or the like. Specifically, the feature data of the sample image C output by each node of the input layer and the initial weights of each node (h nodes shown in fig. 2) of the first hidden layerA weighting operation is performed to generate the output of each node of the first hidden layer (i.e. the input of the second hidden layer) by means of an activation function (e.g. Sigmoid function, tanh function) of the first hidden layer. Of a second hidden layerInitial weights of respective nodes (u nodes shown in fig. 2) of the input and second hidden layers +.>After weighting operation, the output of each node of the second hidden layer (i.e. the input of the third hidden layer) is generated by the activation function of the second hidden layer, and the n-2 hidden layer (n-2 hidden layers in fig. 2) is sequentially calculated in the same manner, and the input of the u nodes of the n-2 hidden layer (the output of the n-3 hidden layer) and the corresponding initial weight are calculated >After the weighting operation is performed, the output of each node of the n-2 hidden layer (i.e., the input of the output layer) is generated by the activation function of each node of the n-2 hidden layer.

It will be appreciated that the hidden layer calculation is similar to the hidden layer calculation described in the above-described scheme for training the untrained neural network model 200, except that the initialization method of the weights samples the Sign function Sign (W) to take the full-precision weights of the trained model as their full-precision numerical symbols. Please refer to the above for detailed description, and the detailed description is omitted herein.

3. Calculation of output layer

The calculation of the output layer is similar to the calculation method of the output layer described in the above-described scheme for training the untrained neural network model 200, except that the initialization method of the weights samples the Sign function Sign (W) to take the full-precision weights of the trained model as the signs of the full-precision values thereof. Please refer to the above for detailed description, and the detailed description is omitted herein.

4. The weight adjustment method is similar to the weight adjustment method described in the above scheme for training the untrained neural network model 200, and the detailed description is referred to above, and will not be repeated here.

Although the above embodiment is exemplified by face recognition of an image, the weight initialization model of the present application can be applied to any neural network model, such as convolutional neural network (Convolutional Neural Network, CNN), deep neural network (Deep Neural Networks, DNN), and cyclic neural network (Recurrent Neural Networks, RNN), etc.

For example, in some embodiments, after training the already trained neural network model via sample image C, the weights of the various network layers determined after training the already trained neural network model via sample image C may be taken as initial weights of the neural network model at the next training (e.g., training the neural network model trained via sample image C using sample image D).

In yet other embodiments, to ensure that the distribution of inputs and outputs to each layer of the neural network model remains substantially uniform to mitigate the problem of gradient extinction during transmission to deeper layers, the product of the weights of the various network layers determined after training the neural network model via sample image C and the scaling factor may be used as the initial weight of the neural network model at the next training (e.g., training the neural network model trained via sample image C using sample image D) after training the neural network model via sample image C.

wherein W is ^r To at the same timeAny one of a plurality of weights of an ith network layer, W, determined after training for a sample image C ^r The numerical value range of (2) is-1 to W ^r Less than or equal to 1, alpha is a scaling factor and alpha is a positive fraction less than 1, for adjusting the distribution of output data of the ith network layer. Wherein, in some embodiments, the scaling factor α may be calculated as follows:

where p is the number of weights of the ith network layer, W, determined after training on sample image C _j ^r Represents the j-th weight of the p weights determined after training for the sample image C,is the average of the p weights determined after training for sample image C.

In addition, the detailed process of training the neural network model trained via the sample image C using the sample image D is referred to the above description of the process of training the neural network model using the sample image C, and will not be repeated here.

In addition, it can be understood that, in the present application, in order to ensure that the distribution of the input and the output of each layer of the neural network model is kept substantially consistent, so as to alleviate the problem of gradient vanishing in the process of transferring to the deeper layer, a scaling factor is set, where the calculation method of the scaling factor is not limited to the formula shown above, but other calculation methods may also be adopted, and no limitation is made herein.

Fig. 7 provides a schematic structural diagram of an electronic device 700 for training a neural network model, according to some embodiments of the application. As shown in fig. 7, the electronic device 700 includes:

a first data acquisition module 702, configured to acquire sample data and input the sample data to the second network layer, where the sample data includes initial input data and expected result data;

a first data processing module 704, configured to perform the following operations:

when 2 < i.ltoreq.n, based on the output data of the i-1 th network layer and the plurality of initial weights of the i-th network layerObtaining output data of the ith network layer, wherein a plurality of initial weights of the ith network layer are +.>Is based on m discrete values, wherein a plurality of initial weights +.>The numerical range of (2) is +.>And m= {2,3};

a first weight adjustment module 706, configured to adjust a plurality of initial weights of the ith network layer based on an error between output data of the n network layers and expected result data in the sample data.

It can be understood that the electronic device 700 for training a neural network model shown in fig. 7 corresponds to the training method of the neural network model provided by the present application, and the technical details in the specific description of the training method of the neural network model provided by the present application still apply to the electronic device 700 for training a neural network model shown in fig. 7, and the specific description is referred to above and will not be repeated here.

Fig. 8 provides a schematic structural diagram of an electronic device 800 for training a neural network model, according to some embodiments of the application. As shown in fig. 8, the electronic device 800 includes:

a second data acquisition module 802, configured to acquire sample data and input the sample data to a second network layer, where the sample data includes initial input data and expected result data;

a second data processing module 804 for performing the following operations

When i=2, performing symbol value on multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layerAnd based on the initial input data and a plurality of initial weights +.>The output data of the ith network layer is obtained,

when i is more than 2 and less than or equal to n, performing symbol value on multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layerAnd based on the output data of the i-1 th network layer and a plurality of initial weights, obtaining the output data of the i-1 th network layer, wherein,

multiple initial weights for the ith network layerIs derived based on m discrete values and a plurality of initial weights +.>The numerical range of (2) is +.>And m= {2,3}; a second weight adjustment module 806, configured to adjust a plurality of initial weights of the ith network layer based on errors between the output data of the n network layers and the expected result data in the sample data.

It can be understood that the electronic device 800 for training a neural network model shown in fig. 8 corresponds to the training method of a neural network model provided by the present application, and the technical details in the above detailed description about the training method of a neural network model provided by the present application still apply to the electronic device 800 for training a neural network model shown in fig. 8, and the detailed description is omitted herein.

Fig. 9 shows a schematic structural diagram of an electronic device 900 according to an embodiment of the present application. The electronic device 900 is also capable of performing the training of the neural network model disclosed in the above embodiments of the present application. In fig. 9, similar parts have the same reference numerals. As shown in fig. 9, the electronic device 900 may include a processor 910, a power module 940, a memory 980, a mobile communication module 930, a wireless communication module 920, a sensor module 990, an audio module 950, a camera 970, an interface module 960, keys 901, a display 902, and the like.

It should be understood that the illustrated structure of the embodiment of the present application does not constitute a specific limitation on the electronic device 900. In other embodiments of the application, electronic device 900 may include more or less components than illustrated, or certain components may be combined, or certain components may be split, or different arrangements of components. The illustrated components may be implemented in hardware, software, or a combination of software and hardware.

The processor 910 may include one or more processing units, for example, processing modules or processing circuits that may include a central processor CPU (Central Processing Unit), an image processor GPU (Graphics Processing Unit), a digital signal processor DSP, a microprocessor MCU (Micro-programmed Control Unit), an AI (Artificial Intelligence ) processor, a programmable logic device FPGA (Field Programmable Gate Array), and the like. Wherein the different processing units may be separate devices or may be integrated in one or more processors. A memory unit may be provided in the processor 910 for storing instructions and data. In some embodiments, the storage unit in the processor 910 is a cache 980. The memory 980 mainly includes a storage program area 9801 and a storage data area 9802, wherein the storage program area 9801 can store an operating system and at least one application program (such as audio playing, image recognition, etc.) required by functions. The neural network model provided in the embodiment of the present application may be regarded as an application program capable of implementing functions such as image processing and voice processing in the storage program area 9801. The weights for each network layer of the neural network model are stored in the stored data area 9802 described above.

The power module 940 may include a power source, a power management component, and the like. The power source may be a battery. The power management component is used for managing the charging of the power supply and the power supply supplying of the power supply to other modules. In some embodiments, the power management component includes a charge management module and a power management module. The charging management module is used for receiving charging input from the charger; the power management module is used to connect the power source, the charge management module and the processor 910. The power management module receives input from the power and/or charge management module and provides power to the processor 910, the display 902, the camera 970, and the wireless communication module 920.

The mobile communication module 930 may include, but is not limited to, an antenna, a power amplifier, a filter, a low noise amplifier (Low noise amplify, LNA), and the like. The mobile communication module 930 may provide a solution for wireless communication, including 2G/3G/4G/5G, as applied to the electronic device 900. The mobile communication module 930 may receive electromagnetic waves from an antenna, filter, amplify, and the like the received electromagnetic waves, and transmit the electromagnetic waves to a modem processor for demodulation. The mobile communication module 930 may also amplify the signal modulated by the modem processor, and convert the signal into electromagnetic waves through an antenna to radiate. In some embodiments, at least some of the functional modules of the mobile communication module 930 may be disposed in the processor 910. In some embodiments, at least some of the functional modules of the mobile communication module 930 may be provided in the same device as at least some of the modules of the processor 910. The wireless communication technologies may include global system for mobile communications (global system for mobile communications, GSM), general packet radio service (general packet radio service, GPRS), code division multiple access (code divisionmultiple access, CDMA), wideband code division multiple access (wideband code division multiple access, WCDMA), time division code division multiple access (time-division code division multiple access, TD-SCDMA), long term evolution (long term evolution, LTE), bluetooth (BT), global navigation satellite system (global navigation satellite system, GNSS), wireless local area network (wireless local area networks, WLAN), near field wireless communication technology (near field communication, NFC), frequency modulation (frequency modulation, FM) and/or field communication, NFC), infrared (IR) technology, and the like. The GNSS may include a global satellite positioning system (global positioning system, GPS), a global navigation satellite system (globalnavigation satellite system, GLONASS), a beidou satellite navigation system (beidou navigationsatellite system, BDS), a quasi zenith satellite system (quasi-zenith satellite system, QZSS) and/or a satellite based augmentation system (satellite based augmentation systems, SBAS).

The wireless communication module 920 may include an antenna, and implement transmission and reception of electromagnetic waves via the antenna. The wireless communication module 920 may provide solutions for wireless communication including wireless local area networks (wireless localarea networks, WLAN) (e.g., wireless fidelity (wireless fidelity, wi-Fi) networks), bluetooth (BT), global navigation satellite systems (global navigation satellite system, GNSS), frequency modulation (frequency modulation, FM), near field wireless communication technology (near field communication, NFC), infrared technology (IR), and the like, as applied to the electronic device 900. The electronic device 900 may communicate with networks and other devices through wireless communication technology.

In some embodiments, the mobile communication module 930 and the wireless communication module 920 of the electronic device 900 may also be located in the same module.

The display 902 is used to display a human-machine interface, images, video, etc. The display 902 includes a display panel. The display panel may employ a liquid crystal display (liquid crystal display, LCD), an organic light-emitting diode (OLED), an active-matrix organic light-emitting diode (AMOLED) or an active-matrix organic light-emitting diode (matrix organic light emitting diode), a flexible light-emitting diode (flex), a mini, a Micro-led, a quantum dot light-emitting diode (quantum dot light emitting diodes, QLED), or the like.

The sensor module 990 may include a proximity light sensor, a pressure sensor, a gyroscope sensor, a barometric sensor, a magnetic sensor, an acceleration sensor, a distance sensor, a fingerprint sensor, a temperature sensor, a touch sensor, an ambient light sensor, a bone conduction sensor, and the like.

The audio module 950 is used to convert digital audio information into an analog audio signal output, or to convert an analog audio input into a digital audio signal. The audio module 950 may also be used to encode and decode audio signals. In some embodiments, the audio module 950 may be disposed in the processor 910, or a portion of the functional modules of the audio module 950 may be disposed in the processor 910. In some embodiments, the audio module 950 may include a speaker, an earpiece, a microphone, and an earphone interface.

Camera 970 is used to capture still images or video. The object generates an optical image through the lens and projects the optical image onto the photosensitive element. The photosensitive element converts the optical signal into an electrical signal, which is then transferred to an ISP (Image Signal Processing ) to be converted into a digital image signal. The electronic device 900 may implement shooting functions through an ISP, a camera 970, a video codec, a GPU (Graphic Processing Unit, a graphics processor), a display screen 902, an application processor, and the like.

The interface module 960 includes an external memory interface, a universal serial bus (universal serial bus, USB) interface, a subscriber identity module (subscriber identification module, SIM) card interface, and the like. Wherein the external memory interface may be used to connect an external memory card, such as a Micro SD card, to enable expansion of the memory capabilities of the electronic device 900. The external memory card communicates with the processor 910 through an external memory interface to implement data storage functions. The universal serial bus interface is used for communication between the electronic device 900 and other electronic devices. The subscriber identity module card interface is used to communicate with a SIM card mounted to the electronic device 900, for example, by reading a telephone number stored in the SIM card or by writing a telephone number to the SIM card.

In some embodiments, the electronic device 900 also includes keys 901, motors, indicators, and the like. The keys 901 may include a volume key, an on/off key, and the like. The motor is used to cause the electronic device 900 to generate a vibration effect, such as when the user's electronic device 900 is called, to prompt the user to answer the electronic device 900. The indicators may include laser indicators, radio frequency indicators, LED indicators, and the like.

Embodiments of the disclosed mechanisms may be implemented in hardware, software, firmware, or a combination of these implementations. Embodiments of the application may be implemented as a computer program or program code that is executed on a programmable system comprising at least one processor, a storage system (including volatile and non-volatile memory and/or storage elements), at least one input device, and at least one output device.

Program code may be applied to input instructions to perform the functions described herein and generate output information. The output information may be applied to one or more output devices in a known manner. For the purposes of this application, a processing system includes any system having a processor such as, for example, a Digital Signal Processor (DSP), a microcontroller, an Application Specific Integrated Circuit (ASIC), or a microprocessor.

The program code may be implemented in a high level procedural or object oriented programming language to communicate with a processing system. Program code may also be implemented in assembly or machine language, if desired. Indeed, the mechanisms described in the present application are not limited in scope by any particular programming language. In either case, the language may be a compiled or interpreted language.

In some cases, the disclosed embodiments may be implemented in hardware, firmware, software, or any combination thereof. The disclosed embodiments may also be implemented as instructions carried by or stored on one or more transitory or non-transitory machine-readable (e.g., computer-readable) storage media, which may be read and executed by one or more processors. For example, the instructions may be distributed over a network or through other computer readable media. Thus, a machine-readable medium may include any mechanism for storing or transmitting information in a form readable by a machine (e.g., a computer), including but not limited to floppy diskettes, optical disks, read-only memories (CD-ROMs), magneto-optical disks, read-only memories (ROMs), random Access Memories (RAMs), erasable programmable read-only memories (EPROMs), electrically erasable programmable read-only memories (EEPROMs), magnetic or optical cards, flash memory, or tangible machine-readable memory for transmitting information (e.g., carrier waves, infrared signal digital signals, etc.) in an electrical, optical, acoustical or other form of propagated signal using the internet. Thus, a machine-readable medium includes any type of machine-readable medium suitable for storing or transmitting electronic instructions or information in a form readable by a machine (e.g., a computer).

In the drawings, some structural or methodological features may be shown in a particular arrangement and/or order. However, it should be understood that such a particular arrangement and/or ordering may not be required. Rather, in some embodiments, these features may be arranged in a different manner and/or order than shown in the illustrative figures. Additionally, the inclusion of structural or methodological features in a particular figure is not meant to imply that such features are required in all embodiments, and in some embodiments, may not be included or may be combined with other features.

It should be noted that, in the embodiments of the present application, each unit/module mentioned in each device is a logic unit/module, and in physical terms, one logic unit/module may be one physical unit/module, or may be a part of one physical unit/module, or may be implemented by a combination of multiple physical units/modules, where the physical implementation manner of the logic unit/module itself is not the most important, and the combination of functions implemented by the logic unit/module is only a key for solving the technical problem posed by the present application. Furthermore, in order to highlight the innovative part of the present application, the above-described device embodiments of the present application do not introduce units/modules that are less closely related to solving the technical problems posed by the present application, which does not indicate that the above-described device embodiments do not have other units/modules.

It should be noted that in the examples and descriptions of this patent, relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

While the application has been shown and described with reference to certain preferred embodiments thereof, it will be understood by those skilled in the art that various changes in form and details may be made therein without departing from the spirit and scope of the application.

Claims

1. The training method of the neural network model is characterized in that the neural network model comprises n network layers, wherein n is a positive integer greater than 1; and is also provided with

The method comprises the following steps:

a first network layer of the n network layers acquires sample data and inputs the sample data to a second network layer, wherein the sample data comprises initial input data and expected result data, the sample data comprises image data, the neural network model is used for image recognition, the initial input data is image data, the first network layer is an input layer comprising s nodes, the image data represented by the initial input data is color information of each pixel point in an image, and the initial input data is divided into s input data and is respectively input into the s nodes;

when i=2, based on the initial input data and a plurality of initial weights of the ith network layerObtaining output data of an ith network layer, wherein the output data of the ith network layer is obtained by adding a plurality of initial weights of the ith network layer>And the s input data are subjected to weighted operation, and the result of the weighted operation is obtained by adopting an activation function to perform activation operation,

When 2<When i is less than or equal to n, the output data of the ith-1 network layer and a plurality of initial weights of the ith network layer are basedObtaining output data of an ith network layer, wherein the output data of the ith network layer is obtained by adding a plurality of initial weights of the ith network layer>And the output data of the i-1 network layer, and performing a weighted operation, and performing an activated operation on the weighted operation result by using an activated function, wherein,

the plurality of initial weights of the ith network layerIs based on m discrete values, wherein the plurality of initial weights +.>The numerical range of (2) is +.>And m= {2,3};

based on the error between the image recognition result corresponding to the output data of the n network layers and the expected result data in the sample data, the plurality of initial weights of the ith network layerAnd adjusting.

2. The method of claim 1, wherein the plurality of initial weights for the ith network layerEach of which is one of m discrete values.

3. The method of claim 2, wherein the m discrete values are-1 and 1, and the plurality of initial weights for the ith network layer The mean value of (1) is 0 and the variance is 1.

4. The method of claim 2, wherein the m discrete values are-1, 0, and 1, and the plurality of initial weights of the ith network layerThe mean value of (2) is 0 and the variance is 2/3.

5. The method of claim 1, wherein the ith network layer has p initial weightsAnd said p initial weights of said i-th network layer +.>Calculated by the following formula:

wherein W is ^b For one of the m discrete values, the W ^b The numerical value range of (2) is-1 to W ^b Less than or equal to 1 and corresponds to the p initial weightsP W of (2) ^b The mean value of (2) is 0, and the variance is 1 or 2/3; alpha is a scaling factor and is a positive number less than 1 for adjusting the distribution of the output data of the i-th network layer.

6. The method of claim 5, wherein the p initial weights are correspondingP W of (2) ^b Is 1 and the m discrete values are-1 and 1.

7. The method of claim 5, wherein the p initial weights are correspondingP W of (2) ^b The variance of (2/3), the m discrete values being-1, 0 and 1.

8. The method according to any of claims 5 to 7, wherein the scaling factor is obtained by the following formula:

Wherein W is _j ^b For the p W ^b Corresponding to the p initial weightsDiscrete value of the j-th initial weight,/-in (a)>P W's for the ith network layer ^b Average value of l _i And the number of the input channels is the i-th network layer.

9. The method of claim 1, wherein the plurality of initial weights for the ith network layerCalculated by the following formula:

wherein W is ^t Any one of a plurality of weights determined for the previous training of the ith network layer, the W ^t The numerical value range of (2) is-1 to W ^t And less than or equal to 1, alpha being a scaling factor and being a positive number less than 1, for adjusting the distribution of output data of the ith network layer.

10. The method of claim 9, wherein the scaling factor is obtained by the following formula:

wherein p is the ithThe number of weights, W, determined by the previous training of the network layer _j ^t Represents the j-th weight of the weights determined by the p previous training,the average of the weights determined for the p previous training.

11. The method according to claim 5 or 9, wherein the scaling factor is obtained by the following formula:

12. A method of training a neural network model, characterized in that the neural network model comprises n network layers and the neural network model has converged, n being a positive integer greater than 1; and is also provided with

The method comprises the following steps:

a first network layer of the n network layers acquires sample data and inputs the sample data to a second network layer, wherein the sample data comprises initial input data and expected result data, the sample data comprises image data, the neural network model is used for image recognition, the initial input data is image data, the first network layer is an input layer comprising s nodes, the image data represented by the initial input data is color information of each pixel point in an image, and the initial input data is divided into s input data which are respectively input into the s nodes;

when i=2, performing symbol value on the full-precision weights of the ith network layer to obtain a plurality of ith network layersInitial weightAnd based on the initial input data and the plurality of initial weights +. >Obtaining output data of an ith network layer, wherein the output data of the ith network layer is obtained by adding a plurality of initial weights of the ith network layer>And the s input data are subjected to weighted operation, and the result of the weighted operation is obtained by adopting an activation function to perform activation operation,

when 2<When i is less than or equal to n, performing symbol value taking on the full-precision weights of the ith network layer to obtain initial weights of the ith network layerAnd based on the output data of the i-1 th network layer and the plurality of initial weights +.>Obtaining output data of an ith network layer, wherein the output data of the ith network layer is obtained by a plurality of initial weights of the ith network layerAnd the output data of the i-1 network layer, and performing a weighted operation, and performing an activated operation on the weighted operation result by using an activated function, wherein,

based on the error between the image recognition result corresponding to the output data of the n network layers and the expected result data in the sample data, the plurality of initial weights of the ith network layer And adjusting.

13. The method of claim 12, wherein the m discrete values are-1 and 1, the plurality of initial weights for the ith network layerThe mean value of (1) is 0 and the variance is 1; and is also provided with

14. The method of claim 12, wherein the m discrete values are-1, 0, and 1, the plurality of initial weights for the ith network layerThe mean value of (2) is 0 and the variance is 2/3; and is also provided with

15. The method of claim 12, wherein the m discrete values are-1 and 1, and,

16. Such asThe method of claim 12, wherein the m discrete values are-1, 0, and 1, and wherein the sign-valued for the full-precision weights for the ith network layer yields initial weights for the ith network layerComprising the following steps:

if the full-precision weight is less than 0, taking the product of-1 and the scaling factor as the initial weight corresponding to the full-precision weight

17. The method of claim 15 or 16, wherein the scaling factor is obtained by the following formula:

wherein α is a scaling factor, l _i For the number of input channels of the ith network layer, l _i+1 Is the number of input channels for the i+1th network layer.

18. The method of claim 15, wherein the scaling factor is obtained by the following formula:

19. The method of claim 16, wherein the scaling factor is obtained by the following formula:

20. An electronic device for training a neural network model, comprising:

the system comprises a first data acquisition module, a second data acquisition module and a storage module, wherein the first data acquisition module is used for acquiring sample data and inputting the sample data to a second network layer, the sample data comprises initial input data and expected result data, the sample data comprises image data, the neural network model is used for image recognition, the initial input data is image data, the first network layer is an input layer containing s nodes, the image data represented by the initial input data is color information of each pixel point in an image, and the initial input data is divided into s input data and is respectively input into the s nodes;

when i=2, based on the initial input data and a plurality of initial weights of the ith network layerObtaining output data of an ith network layer, wherein the output data of the ith network layer is obtained by a plurality of initial weights of the ith network layer

And the s input data are subjected to weighted operation, and the result of the weighted operation is obtained by adopting an activation function to perform activation operation,

and the first weight adjusting module is used for adjusting the initial weights of the ith network layer based on errors between image recognition results corresponding to the output data of the n network layers and expected result data in the sample data.

21. An electronic device for training a neural network model, comprising:

the second data acquisition module is used for acquiring sample data and inputting the sample data into a second network layer, wherein the sample data comprises initial input data and expected result data, the sample data comprises image data, the neural network model is used for image recognition, the initial input data is image data, the first network layer is an input layer comprising s nodes, the image data represented by the initial input data is color information of each pixel point in an image, and the initial input data is divided into s input data which are respectively input into the s nodes;

A second data processing module for performing the following operations

When i=2, performing symbol value on multiple full-precision weights of the ith network layer to obtain multiple initial weights of the ith network layerAnd based on the initial input data and the plurality of initial weights +.>Obtaining output data of an ith network layer, wherein the ith network layerIs to add a plurality of initial weights to the ith network layer>And the s input data are subjected to weighted operation, and the result of the weighted operation is obtained by adopting an activation function to perform activation operation,

the plurality of initial weights of the ith network layer Is derived based on m discrete values and the plurality of initial weights +.>The numerical range of (2) is +.>And m= {2,3};

a second weight adjustment module, configured to adjust the multiple of the ith network layer based on an error between an image recognition result corresponding to output data of n network layers and expected result data in the sample dataInitial weightsAnd adjusting.

22. A computer readable medium having instructions stored thereon, which when executed on a computer cause the computer to perform the method of training a neural network model according to any of claims 1-19.

23. An electronic device, comprising:

a processor, being one of the processors of a system, for performing the training method of the neural network model of any one of claims 1-19.