CN110909860A

CN110909860A - Method and device for initializing neural network parameters

Info

Publication number: CN110909860A
Application number: CN201811072803.2A
Authority: CN
Inventors: 杨宁
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2018-09-14
Filing date: 2018-09-14
Publication date: 2020-03-24

Abstract

The application provides a method and a device for initializing neural network parameters in the field of artificial intelligence. The method comprises the following steps: respectively determining a feature training set corresponding to each local network of the neural network, respectively training a plurality of local networks based on the feature training sets, covering a part of the neural network by each local network in the plurality of local networks, and taking all or part of parameters of the trained local networks as initialization parameters of the neural network. According to the technical scheme, the neural network initialization parameters are obtained by training the plurality of local networks, so that the convergence speed and the parameter initialization performance of the neural network can be improved.

Description

Method and device for initializing neural network parameters

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a method and apparatus for neural network parameter initialization.

Background

The study of neural networks differentiates into two directions. One of the research directions is focused on the process of biological information processing, called biological neural network; another research direction has focused on engineering applications, called artificial neural networks. Until the 2006 concept of deep networks (deep networks) and deep learning (deep learning), neural networks began to take a new turn.

Deep Neural Networks (DNNs) refer to deep neural network algorithms, and are a popular topic in the field of machine learning, which is novel in the industry and academia in recent years. The DNN algorithm successfully improves the recognition rate of the conventional artificial neural network by a remarkable level.

At present, a deep neural network is used as one of hot spots of machine learning, and high-level abstract feature meaning features of training data can be learned. Deep neural networks have been well validated and applied in image processing, speech recognition, text classification, etc. in recent years. In particular, neural networks originated in the last 40 th century, where they could implement some functions of logical operations.

In the process of training the deep neural network by using the back propagation algorithm, because the number of network layers of the deep neural network is large, the disappearance of the gradient or the explosion of the gradient is easy to occur. Therefore, the initialization of the neural network parameters is very important, and the good neural network parameter initialization method is beneficial to improving the performance of the neural network parameter initialization and the convergence speed of training the neural network.

In the prior art, a randomization method and a transfer learning method are adopted during deep neural network parameter initialization, which results in slow convergence rate during deep neural network training and poor performance of deep neural network parameter initialization. Therefore, how to improve the convergence rate of the training neural network and the performance of the neural network parameter initialization becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a method and a device for initializing parameters of a neural network, which can improve the convergence speed of training the neural network and the performance of initializing the parameters of the neural network.

In a first aspect, a method for initializing neural network parameters is provided, including: respectively determining a feature training set corresponding to each local network of the neural network, wherein the neural network comprises a plurality of local networks, and each local network of the plurality of local networks covers a part of the neural network; training each local network by using the corresponding characteristic training set to obtain the parameters of the local network; determining initialization parameters of the neural network, wherein the initialization parameters of the neural network comprise: all or a portion of the parameters of the plurality of local networks.

According to the method for initializing the parameters of the neural network, a plurality of feature training sets are determined for a plurality of local networks, the plurality of local networks are respectively trained based on the feature training sets corresponding to the local networks, and all or part of the parameters in the trained local networks are used as the initialization parameters of the neural network. Because each of the plurality of local networks covers a part of the neural network, which is equivalent to performing parameter initialization on the plurality of parts of the neural network, the convergence rate of the training neural network and the performance of parameter initialization of the neural network can be improved.

It should be understood that the above-mentioned parameters of the trained local networks as initialization parameters of the neural network are: each local network in the plurality of local networks covers a part of the neural network, and after the training of the local network is completed, the parameters of the local network are used as the parameters of the neural network of the corresponding covered part.

Wherein each local network overlays a portion of the neural network means that the local network is a portion of the neural network.

With reference to the first aspect, in certain implementations of the first aspect, the determining initialization parameters of the neural network includes: the initialization parameters of the first part of the neural network are all or part of the parameters of the last trained local network in a first local network set, the first local network set comprises one or more trained local networks, and the one or more trained local networks cover the first part of the neural network.

According to the method for initializing the parameters of the neural network provided by the embodiment of the application, when the first part of the neural network is covered by the plurality of local networks, when the initialization parameters of the first part are determined, part or all of the parameters of the last trained local network in the plurality of local networks covering the first part can be used as the initialization parameters of the first part. And further, the optimal parameters can be used as the initialization parameters of the first part, and the performance of neural network parameter initialization is improved.

The first part is covered by a plurality of local networks, and may be covered by partial or all of the plurality of local networks, and when the first part is covered by all of the local networks, the initialization parameter of the first part is all of the parameters of the local network; when the first part is covered by a partial network of the local network, the initialization parameter of the first part is part of the parameter of the local network, and the part of the parameter of the local network is the corresponding parameter of the partial network covering the first part.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the determining, for each local network of the neural network, a feature training set corresponding to the local network respectively includes: respectively determining a training subset and a feature training set corresponding to each local network of the neural network to generate a network; and generating a network based on the feature training set of each local network and generating the feature training set corresponding to the training subset.

According to the method for initializing the parameters of the neural network, which is provided by the embodiment of the application, the network is generated based on the training subset and the feature training set, and the feature training set corresponding to each local network of the neural network is respectively determined for each local network of the neural network.

The above-described training set of the determined features is described in detail below, starting from any one of the plurality of local networks.

For example, a first feature training set is determined, the first feature training set being a feature training set used for training a first local network, the first local network being any one of the plurality of local networks. First, a first training subset is assigned to a first local network and a first feature training set generation network for generating a feature training set of the first local network is determined. A first feature training set is determined based on the first training subset and a first feature training set generation network. A feature training set for training the local network can be accurately determined.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the training subset includes: the training set is preset, or a part of the training set is preset.

According to the method for initializing neural network parameters provided by the embodiment of the application, the training subset may be a preset training set or a part of the preset training set. Various schemes are provided for determining the training subsets.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the feature training set generation network of the local network includes all or part of a back network, where the back network is a network between the local network and an input of the neural network.

According to the method for initializing the neural network parameters provided by the embodiment of the application, the network can be generated based on the rear network determination feature training set between the local network and the input of the neural network. When all or part of the back network is covered by the trained local network, the feature training set generation network of the local network comprises all or part of the back network.

It should be understood that in the present application, when all of the back networks are not covered, the training subset is the feature training set.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the generating initialization parameters of a network by using a feature training set of a local network includes: covering all or part of parameters of at least one trained local network in the local network generated by the characteristic training set of the local network, wherein the parameters of the last trained local network are all or partially; or, the parameters obtained by presetting and calculating all or part of the parameters of at least one trained local network of the network generated by the feature training set covering the local network.

According to the method for initializing parameters of a neural network provided by the embodiment of the application, the initialization parameters of the feature training set generation network may be all or part of parameters of at least one trained local network covering the feature training set generation network and the last trained local network, or all or part of parameters of at least one trained local network covering the feature training set generation network and obtained through preset calculation. A plurality of flexible schemes are provided for determining initialization parameters of the feature training set generation network.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the feature training set of the local network includes: and data and a label, wherein the data is data which is generated by forward propagation of the network through a feature training set of the local network for the training subset of the local network, and is output, and the label is a label of the neural network.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the covering a portion of the neural network by each of the plurality of local networks includes: the first partial network overlays a second portion of the neural network; a second local network overlays a third portion of the neural network; wherein the first local network and the second local network are two local networks of the plurality of local networks, and the second part is partially or completely the same as the third part.

According to the method for initializing the neural network parameters provided by the embodiment of the application, when the plurality of local networks cover different parts of the neural network, coverage between the local networks may occur. Various ways of local networks to overlay neural networks can be provided.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, the training using the feature training set corresponding to each local network respectively includes: and in the forward direction of the neural network, training the local networks in sequence from back to front according to the front-back sequence of the local networks.

According to the method for initializing the neural network parameters provided by the embodiment of the application, when a plurality of local networks are trained, the local networks can be trained sequentially from back to front according to the front-back sequence of the neural networks in the forward direction. The continuity between local networks can be ensured, and further the performance of neural network parameter initialization is improved.

With reference to the first aspect and the foregoing implementation manner of the first aspect, in another implementation manner of the first aspect, before training each local network with the feature training set corresponding to the local network, the method further includes: and adding an auxiliary output layer for the local network which does not comprise the output layer in the plurality of local networks, wherein the auxiliary output layer is used for supporting the output of the local network to meet the preset condition.

According to the method for initializing the neural network parameters provided by the embodiment of the application, when some local networks do not comprise an output layer, an auxiliary output layer can be added to the local networks, so that the outputs of the local networks meet the preset conditions. The parameter initialization of the neural network can be accurately performed.

In a second aspect, an apparatus for neural network parameter initialization is provided. The apparatus for neural network parameter initialization may be used to perform the method for neural network parameter initialization in the first aspect and any possible implementation manner of the first aspect. In particular, the means for neural network parameter initialization comprises means (means) for performing the steps or functions described in the first aspect above. The steps or functions may be implemented by software, or hardware, or by a combination of hardware and software.

In a third aspect, a server is provided, and the server includes a processor in its structure. The processor is configured to support the communication device to perform the functions of the first aspect and any possible implementation manner of the first aspect, and in one possible design, the server may further include a transceiver configured to support the server to receive or transmit information.

In one possible design, the server may further include a memory for coupling with the processor for storing necessary program instructions and data in the communication device.

Alternatively, the server comprises a memory for storing a computer program and a processor for calling and running the computer program from the memory, so that the server performs the method for initializing the neural network parameters according to any one of the first aspect and any possible implementation manner of the first aspect.

In a fourth aspect, there is provided a computer program product comprising: a computer program (which may also be referred to as code, or instructions), which when executed, causes a computer to perform the method of neural network parameter initialization described above in the first aspect and any possible implementation manner of the first aspect.

In a fifth aspect, a computer-readable storage medium is provided, where the computer-readable storage medium stores a program, where the program makes a server in a computer perform the method for initializing the neural network parameters in the first aspect and any possible implementation manner of the first aspect.

Alternatively, the computer readable storage medium is used for storing computer software instructions for the server, which includes a program designed to execute the method for initializing neural network parameters according to any one of the first aspect and any possible implementation manner of the first aspect.

A sixth aspect provides a chip system, which includes a processor, and is configured to enable a server in a computer to implement the functions recited in the first aspect and any possible implementation manner of the first aspect.

According to the method and the device for initializing the parameters of the neural network, the plurality of local networks covering the neural network are trained, and part or all of the parameters in the trained local networks are used as the initialization parameters of the neural network, so that the convergence speed of the trained neural network and the parameter initialization performance of the neural network can be improved.

Drawings

Fig. 1 is a schematic diagram of a deep neural network 100 to which an embodiment of the present application is applicable.

Fig. 2 is a schematic diagram of a neural network parameter initialization method according to an embodiment of the present disclosure.

Fig. 3 is a schematic diagram of a local network according to an embodiment of the present application.

Fig. 4 is a schematic diagram of another local network provided in the embodiments of the present application.

Fig. 5 is a schematic diagram of determining a feature training set according to an embodiment of the present application.

Fig. 6 is a schematic diagram of determining a feature training set generation network according to an embodiment of the present application.

Fig. 7 is a schematic diagram of a local network location provided in an embodiment of the present application.

Fig. 8 a is a schematic diagram of generating a feature training set according to an embodiment of the present application; b is another schematic diagram for generating a feature training set provided by the embodiment of the application; c is a schematic diagram of another feature training set generation provided by the embodiment of the present application.

Fig. 9 is a flowchart of training a local network according to an embodiment of the present application.

Fig. 10 is a schematic diagram of a specific embodiment provided in an embodiment of the present application.

Fig. 11 is a schematic diagram of a local network including an auxiliary output layer according to an embodiment of the present application.

Fig. 12 is a schematic diagram of another local network including an auxiliary output layer according to an embodiment of the present application.

Fig. 13 is a schematic diagram of a feature training set generation network according to an embodiment of the present application.

Fig. 14 is a schematic diagram of a local network according to an embodiment of the present application.

Fig. 15 is a schematic diagram of another feature training set generation network provided in the embodiment of the present application.

Fig. 16 is a schematic diagram of parameter migration according to an embodiment of the present application.

Fig. 17 is a schematic diagram of another specific embodiment provided in the embodiments of the present application.

Fig. 18 is a schematic diagram of a local network including an auxiliary output layer according to an embodiment of the present application.

Fig. 19 is a schematic diagram of local network parameter initialization according to an embodiment of the present application.

Fig. 20 is a schematic diagram of another local network including an auxiliary output layer according to an embodiment of the present application.

Fig. 21 is a schematic diagram of determining a feature training set generation network according to an embodiment of the present application.

Fig. 22 is a schematic diagram of generating a feature training set according to an embodiment of the present application.

Fig. 23 is a schematic diagram of local network parameter initialization according to an embodiment of the present application.

Fig. 24 is a schematic diagram of a local network according to an embodiment of the present application.

Fig. 25 is a schematic diagram of a network for generating a training set of determined features according to an embodiment of the present application.

Fig. 26 is a schematic diagram of generating a feature training set according to an embodiment of the present application.

Fig. 27 is a schematic diagram of parameter migration according to an embodiment of the present application.

Fig. 28 is a schematic block diagram of an apparatus 2800 for neural network parameter initialization provided by an embodiment of the present application.

Fig. 29 is a schematic diagram of a server 2900 according to an embodiment of the present application.

Fig. 30 is a schematic structural diagram of a neural network processor according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

The technical scheme of the embodiment of the application can be applied to a deep neural network, for example: fig. 1 shows a deep neural network 100.

Fig. 1 is a schematic diagram of a deep neural network 100 to which an embodiment of the present application is applicable. The schematic includes an input layer, a hidden layer, and an output layer.

Deep neural networks are literally understood to be deep neural networks. The deep neural network includes a plurality of layers, and the leftmost layer as shown in fig. 1 is referred to as an input layer, and neurons located at the input layer are referred to as input neurons. The rightmost layer shown in fig. 1 is referred to as an output layer, neurons located in the output layer are referred to as output neurons, and the output layer shown in fig. 1 has only one output neuron. The layer in between the input layer and the output layer is called a hidden layer, and the neurons in the hidden layer are neither input neurons nor output neurons.

It should be understood that the input layer, the output layer, and the hidden layer are only used for the convenience of distinguishing different layers of the deep neural network, and should not constitute any limitation to the present application. The input layer, the output layer, and the hidden layer are referred to in the prior art, and other terminology may be used in the subsequent development process of the neural network technology, which is not limited in the present application. For example, a hidden layer may also be referred to as a hidden layer.

So far we have discussed neural networks with the output of the previous layer as the input of the next layer. Such networks are known as feed-forward neural networks (feedforward neural networks). This means that there is no loop in the neural network and the information always goes forward and not backward. Therefore, the direction of the neural network from the input layer to the output layer may also be referred to as the forward direction of the neural network.

For example, the output of the input layer shown in FIG. 1 is taken as the input of a first hidden layer, the output of the first hidden layer is taken as the input of a second hidden layer, and the output of the second hidden layer is taken as the input of the output layer.

However, it is possible that there is a feedback loop in some artificial neural networks. This neural network model is called a recurrent neural network. The idea of a recurrent neural network is to let neurons activate for a finite time and then remain active. Such an activated neuron may stimulate other neurons in the recurrent neural network to activate at a later time. Resulting in the activation of multiple neurons, which over time will activate a string of neurons. In a recurrent neural network model, loops do not cause problems because the output of a neuron will only affect its input at a later time, rather than immediately.

The impact of the recurrent neural network is less than that of the feedforward neural network, partly because the learning algorithm of the recurrent network is not so strong so far. But recursive networks remain of great research interest. It is closer to our brain's way of thinking than a feedforward network. Recursive networks may solve some problems that are difficult to solve with feed-forward networks.

It should be understood that the method for initializing neural network parameters in the embodiments of the present application can be applied to the recursive neural network described above, in addition to the feedforward network which is widely used at present. The method for initializing the neural network parameters provided in the present application is not limited to which neural network is specifically applied.

The deep neural network 100 shown in fig. 1 includes two hidden layers. The hidden layer connected to the input layer is referred to as a first hidden layer, and the hidden layer connected to the output layer is referred to as a second hidden layer. Further, it should be understood that more than two hidden layers may be included in the deep neural network, or only one hidden layer may be included in the deep neural network, and fig. 1 is only an example and does not limit the scope of the present application.

The deep neural network is briefly described above with reference to fig. 1, and in order to more clearly understand the technical solution to be described in the present application, the basic concept involved in the present application is first described below.

1. A back propagation algorithm.

Illustratively, the back propagation algorithm is explained in supervised learning, wherein supervised learning means that in order to train a model, a plurality of training samples need to be provided: each training sample includes both input features x and corresponding outputs y. Wherein the output y is also called a label (label).

For example, to find many people, neural networks need to acquire their features. Wherein the characteristics of each person include: age, industry or income, etc. The neural network takes the acquired characteristics of each person as a sample. The model is then trained based on the samples. So that the model knows both each question posed (input feature x) and the answer to the corresponding question (label y). When the model sees enough samples, the neural network can summarize some of the rules. Then, the labels y corresponding to the input features x unknown to some neural network can be predicted.

Exemplarily, it is assumed that the activation function f of the neuron is a multilayer perceptron (sigmoid) function. It should be understood that different activation functions correspond to different calculation formulas in the back propagation algorithm.

Illustratively, assume that each training sample is

Wherein the content of the first and second substances,

is a feature of the training sample, and

is the target value of the sample.

Using characteristics of the sample based on sigmoid function

Calculating the output a of each hidden layer neuron in the neural network_iAnd the output y of each node of the output layer_i. Then, the error term δ of each neuron of the output layer is calculated according to the following formula_i：

δ_i＝y_i(1-y_i)(t_i-y_i)

Wherein, delta_iIs the error term, y, for output layer neuron i_iIs the output value of output layer neuron i, and ti is the target value for the sample corresponding to output layer neuron i.

Error term δ for each neuron of the hidden layer_i：

δ_i＝a_i(1-a_i)∑_kw_kiδ_k

Wherein, a_iIs the output value, w, of hidden layer neuron i_kiIs the weight of the connection of a hidden layer neuron i to its next layer node k, δ_iIs the error term for the next layer node k of hidden layer neuron i.

Finally, the weight on each connection is updated:

w_ij←w_ij+ηδ_ix_ij

wherein, w_ijIs the weight of node i to node j, η is a constant that becomes the learning rate, δ_iIs the error term, x, of hidden layer neuron j_ijIs the input that neuron i delivers to neuron j.

The calculation of each neuron error term and the weight updating method of the neural network are introduced above. Therefore, to calculate the error term for a neuron, it is necessary to calculate the error term for each neuron in the next layer connected to the neuron. This requires that the order of calculation of the error terms must be from the output layer and then the error terms for each hidden layer are calculated in reverse order up to the hidden layer connected to the input layer. This is the meaning of the name of the back propagation algorithm. After the error terms of all neurons are calculated, we can update all weights according to the weight formula on each connection.

2. The gradient is unstable.

Gradient instability includes a gradient disappearance or a gradient explosion.

The disappearance of the gradient refers to that in some neural networks, the gradient becomes smaller and smaller as viewed from the back to the front through the hidden layer. This means that the front layers will learn significantly slower than the back layers.

Gradient explosion means that in some neural networks, the gradient becomes larger and larger as viewed from the back through the hidden layer. This means that the front layers will learn significantly faster than the back layers

For the neural network to learn useful information during the training process, this means that the parameter gradient should not be 0. In a fully connected neural network, the parameter gradient is related to the state gradient and the activation value obtained by back propagation. Then neural network parameter initialization should satisfy the following two conditions:

number initialization requirements one: the activation values of all layers cannot be saturated;

number initialization requirements two: the activation value of each layer is not 0.

From the above basic concept, in the course of training the deep neural network by using the back propagation algorithm, the gradient vanishes or gradient explodes very easily under the condition of a large number of network layers. However, the factors causing the disappearance of the gradient or the explosion of the gradient are mainly the initialization parameters of the deep neural network.

Therefore, the initialization of the deep neural network parameters is very important, and the good deep neural network parameter initialization method is beneficial to improving the model performance and the convergence rate of training.

The method aims to overcome the defects of a deep neural network parameter initialization method in the prior art. The application provides a neural network parameter initialization method which can greatly improve the convergence speed and generalization performance of neural network training.

Illustratively, the neural network parameter initialization method provided by the present application can be applied to the deep neural network shown in fig. 1.

Illustratively, the neural network parameter initialization method provided by the application can also be applied to a non-fully connected neural network and the like.

It should be understood that in the present application, the specific concepts of neural networks involved are: an operational model is composed of a large number of nodes (or called neurons) which are connected with each other. Each node represents a particular output function, called the excitation function. Every connection between two nodes represents a weighted value, called weight, for the signal passing through the connection, which is equivalent to the memory of the artificial neural network. The output of the network is different according to the connection mode of the network, the weight value and the excitation function. The network itself is usually an approximation to some algorithm or function in nature, and may also be an expression of a logic strategy. The specific form of the neural network in the present application is not limited, and may be any one of neural networks in the prior art.

In some applications, neural networks are used to perform machine learning tasks, receive various data inputs and generate various scores, classifications, or regression outputs, etc., based on the inputs.

For example, if the input to the neural network is an image or feature extracted from an image, the output generated by the neural network for a given image may be a score for each object class in a set of object classes, where each score represents a probability or likelihood that the image contains an image of an object belonging to that class.

For another example, if the input to the neural network is an internet resource (e.g., a web page), a document or portion of a document or a feature extracted from an internet resource, document or portion of a document, the output generated by the neural network for a given internet resource, document or portion of a document may be a score for each topic in a set of topics, wherein each score represents a probability or likelihood that the internet resource, document or portion of a document is relevant to that topic.

As another example, if the input to the neural network is characteristic of the context of a particular interactive content (e.g., content containing hyperlinks to other content), the output generated by the neural network may be a score representing the probability or likelihood that the particular content will be clicked on or interacted with.

As another example, if the input to the neural network is a feature of a personalized recommendation for the user, such as a feature characterizing the context for the recommendation, or a feature characterizing a previous action taken by the user, etc., then the output generated by the neural network may be a score for each of a set of content items, where each score represents a likelihood that the user will respond to the recommended content item.

As another example, if the input to the neural network is text in one language a, the output generated by the neural network may be a score for each segment in the set of segment texts in another language B, where each score represents a probability or likelihood that a piece of text in another language B is a correct translation of the input text into another language B.

As another example, if the input to the neural network is a spoken utterance, a sequence of spoken utterances, or a feature derived from one of the two, the output generated by the neural network may be a score for each of the set of snippet texts, each score representing a probability or likelihood that the piece of text is a correct recording of the utterance or the sequence of utterances.

It should be understood that the specific tasks performed by the neural network in the present application are not limiting, and any tasks that the neural network can perform in the prior art may be used.

The following describes in detail a flow of the method for initializing neural network parameters provided in this embodiment with reference to fig. 2.

Fig. 2 is a schematic diagram of a neural network parameter initialization method according to an embodiment of the present disclosure. The schematic includes two steps S210-S220, which are described in detail below.

And S210, determining a feature training set.

And respectively determining a feature training set corresponding to each local network of the neural network, wherein the neural network comprises a plurality of local networks, and each local network in the plurality of local networks covers a part of the neural network.

It will be appreciated that the portion of each local network covering the neural network is not exactly the same. Because, when each of the plurality of local networks covers the same portion of the neural network completely, the actual neural network does not include the plurality of local networks, only one local network. Therefore, at least two of the plurality of local networks cover the part of the neural network not exactly the same

It should be understood that each of the local networks described above covers a portion of the neural network, and that structurally speaking a local network is a portion of the neural network may be understood as the neural network is divided into a plurality of local networks.

For example, the neural network is a 10-layer neural network including 5 local networks. Wherein each local network covers 2 of the 10 layers of the neural network. From the structure of the neural network and the local network, it can be understood that the local network is a layer 2 in the neural network. It is also understood that the neural network is divided into 5 local networks.

Illustratively, as shown in fig. 3 and 4, each of the 5 local networks covers a portion of the neural network.

Fig. 3 is a schematic diagram of a local network according to an embodiment of the present application. The schematic includes one neural network and 5 local networks.

Wherein each local network covers a portion of the neural network and there is no overlapping portion between the portions covered by each local network. Specifically, every two adjacent local networks are connected.

Optionally, each of the plurality of local networks covering a portion of the neural network comprises: the first partial network overlays a second portion of the neural network; a second local network overlays a third portion of the neural network; wherein the first local network and the second local network are two local networks of the plurality of local networks, and the second part is partially or completely the same as the third part. The second portion is the same as the third portion, as shown in fig. 4.

Fig. 4 is a schematic diagram of another local network provided in the embodiments of the present application. The schematic includes one neural network and 5 local networks.

Wherein each local network covers a portion of the neural network with a partial overlap between the portions covered by each local network.

It should be understood that only a portion of the neural network is covered by each of the plurality of local networks, and that partial overlap of the portion covered by each local network occurs as shown in fig. 4.

Further, each of the plurality of local networks covers a portion of the neural network, and the portion covered by each local network may include a complete overlap of the portions covered by the two local networks, which is not described herein again.

It should also be understood that fig. 3 and 4 are only examples of two types of local networks and are not intended to limit the scope of the present application. Other local network forms are also within the scope of the present application. For example, each of the 3 local networks covers a portion of the neural network, where there are two partial networks covering portions of the neural network that completely overlap, and another partial network covering portion of the neural network that partially overlaps or does not overlap with the two partial networks covering portions of the neural network.

In a general case, each of the plurality of local networks covers a part of the neural network may be understood as that, as shown in fig. 3 or fig. 4, between each two consecutive local networks, there is no network of the neural network that is not covered by the local network. However, the present application is not limited to the plurality of local networks being a continuous plurality of local networks, and may be a discontinuous plurality of local networks.

Further, the form in which the neural network is covered by the plurality of local networks is related to the number of local networks and the coverage relationship between the parts of the neural network covered by the plurality of local networks. It cannot be enumerated, and is not described in detail here.

Illustratively, the feature training set is a signal that trains a local network. For example, images, sounds and the like are input, and the specific type of the feature training set is not limited in the present application and may be determined according to the task that needs to be completed by the neural network.

For example, the neural network includes N local networks, where N is an integer greater than 1. Then, N feature training sets are respectively determined for the N local networks, the N feature training sets are respectively used for training the N local networks, and the N feature training sets correspond to the N local networks one to one. That is, a feature training set can only be used to train the local network corresponding to the feature training set.

Optionally, the determining, for each local network of the neural network, a feature training set corresponding to the local network respectively includes:

respectively determining a training subset and a feature training set corresponding to each local network of the neural network to generate a network;

generating a network based on the feature training set of each local network and generating a feature training set of each local network based on the training subset. Specifically, the training subset includes:

a pre-set training set, or a portion of the pre-set training set.

Illustratively, the preset training set is a training set of the neural network, or the preset training set is a subset of the training set of the neural network.

When the preset training set is the subset of the training set of the neural network, one half of the training set is randomly selected from the training set of the neural network according to equal probability to serve as the subset of the training set of the neural network, or the subset of the training set of the neural network is selected from the training set of the neural network according to other selection modes.

Illustratively, the training set of the neural network may be input signals of the neural network. For example, a number of pictures, sounds, etc. of the neural network are input. In particular, the type of training set of the neural network is related to the task that the neural network needs to perform, and the application is not limited thereto.

Determining a training subset corresponding to each local network of the neural network respectively comprises:

dividing a preset training set into a plurality of training subsets, wherein the plurality of training subsets correspond to the plurality of local networks one to one, and the plurality of training subsets do not have intersection; alternatively, the first and second electrodes may be,

dividing a preset training set into a plurality of training subsets, wherein the training subsets correspond to the local networks one by one, and intersections exist among part or all of the training subsets; alternatively, the first and second electrodes may be,

and taking the preset training set as a plurality of training subsets, wherein each training subset in the plurality of training subsets is the same.

For example, the dividing of the preset training set into the plurality of training subsets may be dividing the preset training set into the plurality of training subsets in an equal probability random selection manner, or dividing the preset training set into the plurality of training subsets in another dividing manner.

As described above, the assignment of the training subset to each local network in the plurality of local networks includes the following four cases:

the first condition is as follows: the training set of the neural network is divided into a plurality of training subsets the number of which is the same as that of the plurality of local networks. There may or may not be an intersection between the training subsets. And the training subsets are in one-to-one correspondence with the local networks and are respectively used for generating feature training sets of the local networks.

For example, the neural network includes N local networks in the aforementioned S210. Then, the training set of the neural network is divided into N training subsets, the N training subsets are respectively used for generating feature training sets of the N local networks, and the N training subsets are in one-to-one correspondence with the N local networks. Wherein, the N training subsets may or may not have an intersection.

Case two: the subsets of the training set of the neural network are divided into a plurality of training subsets the number of which is the same as that of the plurality of local networks. There may or may not be an intersection between the training subsets. And the training subsets are in one-to-one correspondence with the local networks and are respectively used for generating feature training sets of the local networks.

The subset of the training set of the neural network may be any one of a plurality of subsets of the training set of the neural network.

For example, the neural network includes N local networks in the aforementioned S210. Then, any subset in the training set of the neural network is divided into N training subsets, the N training subsets are respectively used for generating feature training sets of the N local networks, and the N training subsets are in one-to-one correspondence with the N local networks. Wherein, the N training subsets may or may not have an intersection.

Case three: and directly taking the training set of the neural network as the plurality of training subsets, wherein each training subset in the plurality of training subsets is the same and is respectively used for generating the characteristic training sets of the plurality of local networks.

For example, the neural network includes N local networks in the aforementioned S210. Then, the training set of the neural network is directly used as N training subsets of the N local networks, and is respectively used for generating feature training sets of the N local networks.

Case four: and directly taking the same subset of the training set of the neural network as the plurality of training subsets to generate the feature training sets of the plurality of local networks.

For example, the neural network includes N local networks in the aforementioned S210. Then, the same subset of the training set of the neural network is directly used as N training subsets of N local networks, and is respectively used for generating feature training sets of the N local networks.

In a special case, the feature training set of a certain local network is a training subset allocated by the system to the local network, which will be described in detail below with reference to fig. 6 and will not be described in detail here.

The following describes how to determine a feature training set corresponding to each local network of the neural network separately, taking the determination of the first feature training set as an example in conjunction with fig. 5.

The first feature training set is a feature training set corresponding to the first local network, and the parameters of the first local network can be obtained by training the first local network with the first feature training set. The first local network is any one of the plurality of local networks and has generality.

Fig. 5 is a schematic diagram of determining a feature training set according to an embodiment of the present application. The schematic includes steps S510-S530, which are described in detail below.

S510, a first training subset is determined.

The system assigns a first training subset to the first local network.

In particular, the first training subset comprises a preset training set, or a part of a preset training set.

S520, determining a first feature training set generation network.

The system determines a first feature training set generation network for generating the first feature training set.

It should be understood that the system should determine, for each of the plurality of local networks, a feature training set generation network corresponding to the local network to be generated.

The feature training set generation network of the local network comprises all or part of a back network, wherein the back network is a network between the local network and an input of the neural network.

In the following, an example of determining the first feature training set generation network is taken in conjunction with fig. 6, and how to determine the feature training set generation network corresponding to each local network of the neural network is described in detail. The first local network is any one of a plurality of local networks and has generality.

Fig. 6 is a schematic diagram of determining a feature training set generation network according to an embodiment of the present application. The schematic includes steps S610-S630, which are described in detail below.

S610, determining a rear network.

Wherein the back network is a network between the first local network and an input of the neural network.

Specifically, the first local network and the back network are located in the neural network as shown in fig. 7. Fig. 7 is a schematic diagram of a local network location provided in an embodiment of the present application.

And S620, determining the part covered by the trained local network in the rear network.

The first condition is as follows: all of the back networks are not covered by the trained local networks.

Case two: part or all of the back network is covered by at least one trained local network.

Wherein, determining whether the back network is covered by at least one trained local network may be determined according to a position relationship between the back network and the plurality of local networks.

When the plurality of local networks are trained, the plurality of local networks are trained sequentially from back to front according to the front-back sequence of the plurality of local networks in the forward direction of the neural network. I.e. from the first local network to the last local network, respectively.

Therefore, whether the rear network of one local network is covered by the trained local network or not is judged according to the position relation between the rear network and the local networks, and when the position of the rear network in the neural network is covered by a certain local network or a certain local network, the rear network is covered by at least one trained local network.

S630, determining a first feature training set generation network.

Exemplarily, when the back network is the case one in S620, the first feature training set generation network may be regarded as a linear function x ═ y, that is, the feature training set of the first local network is a training subset of the first local network. It can also be understood that when the back network is the case one in S620, the first feature training set generation network does not need to be determined.

Illustratively, when the back network is case two in S620, all or part of the above back networks constitute the first feature training set generation network.

Optionally, the generating initialization parameters of the network by using the feature training set of the local network includes:

covering all or part of parameters of at least one trained local network in the local network generated by the characteristic training set of the local network, wherein the parameters of the last trained local network are all or partially; alternatively, the first and second electrodes may be,

and covering all or part of parameters of at least one trained local network of the local network generated by the feature training set of the local network with preset calculated parameters.

Then, the initializing parameters of the first feature training set generation network include:

all or part of parameters of the last trained local network in at least one trained local network covering the first feature training set generation network; alternatively, the first and second electrodes may be,

and all or part of the parameters of at least one trained local network covering the first feature training set generation network are obtained through preset calculation. The preset calculation may be averaging all or part of parameters of at least one trained local network covering the first feature training set generation network, or performing other operations.

It should be understood that, in the embodiment of the present application, there is no limitation on how all or part of the parameters of the at least one trained local network covering the first feature training set generation network is calculated to obtain the initialization parameters of the first feature training set generation network.

Specifically, when the back network is the case two in S620, determining the first feature training set generation network and determining the initialization parameter of the first feature training set generation network include the following four cases:

the first condition is as follows: part of the network of the back network is covered by a trained local network. In the following, the part of the back network covered by one trained local network is referred to as a first partial network of the back network, and the part of the back network not covered by the trained local network is referred to as a second partial network of the back network.

First, deleting a second part of the network of the rear part;

when the first partial network of the back network after deleting the second partial network of the back network is a continuous one: and taking the first part network of the rear network as the first feature training set generation network, and taking all or part of the parameters of the trained local network covering the first part network of the rear network as initialization parameters of the first feature training set generation network.

When the first partial network of the rear network after deleting the second partial network of the rear network is a plurality of discontinuous networks: and directly connecting the plurality of discontinuous networks according to the sequence to be used as the first characteristic training set generation network, and using all or part of the parameters of the trained local network of the first part network covering the rear network as the initialization parameters of the first characteristic training set generation network.

Case two: part of the network of the back network is covered by a plurality of trained local networks. In the following, the part of the back network covered by the trained local networks is referred to as a first partial network of the back network, and the part of the back network not covered by the trained local networks is referred to as a second partial network of the back network.

First, deleting a second part of the network of the rear part;

when the first network part of the back network is a continuous network after the second network part of the back network is deleted: taking a first part of networks of a rear network as the first feature training set generation network, and taking all or part of parameters of a last trained local network in a plurality of trained local networks covering the first feature training set generation network as initialization parameters of the first feature training set generation network;

or, all or part of the parameters of the first trained local network in a plurality of trained local networks covering the first feature training set generation network is used as the initialization parameters of the first feature training set generation network;

or, all or part of parameters of a plurality of trained local networks covering the first feature training set generation network are subjected to preset calculation to obtain parameters which are used as initialization parameters of the first feature training set generation network, and the like.

When the first partial network of the rear network after deleting the second partial network of the rear network is a plurality of networks which are discontinuous: and directly connecting the plurality of discontinuous networks according to the sequence to be used as the first characteristic training set to generate the network. Specifically, the initialization parameters of the first feature training set generation network are similar to those of the second case where the first part of the back network is a continuous network, and are not described herein again.

In general, the discontinuous networks can be directly connected without introducing new parameters.

Case three: and when all the networks of the back network are covered by one trained local network, the back network generates a network for the first feature training set.

The first feature training set generates initialization parameters of the network, and the initialization parameters are all or part of the network of the parameters of the trained local network. Case four: and when all the networks of the back network are covered by the trained local networks, the back network generates a network for the first feature training set.

Taking all or part of the parameters of the last trained local network in a plurality of trained local networks covering the first feature training set generation network as the initialization parameters of the first feature training set generation network;

It should be understood that the first local network illustrated in fig. 6 is any one of the aforementioned local networks, and is generic. Therefore, the feature training set generation network corresponding to each local network in the plurality of local networks may determine the feature training set generation network by using the method shown in fig. 6. And are not described in detail herein.

S530, determining a first feature training set.

And generating a network based on the feature training set of each local network and generating the feature training set corresponding to the training subset.

Then the first feature training set is generated based on the first feature training set generation network and the first training subset.

Optionally, the feature training set of the local network includes: and data and a label, wherein the data is data which is generated by forward propagation of the network through a feature training set of the local network for the training subset of the local network, and is output, and the label is a label of the neural network.

Specifically, determining the first feature training set includes the following cases:

the first condition is as follows: the system assigns a different training subset to each local network as described in case one or case two in S510. The training subset of the first local network is a first training subset, the feature training set generation network of the first local network generates a network for the first feature training set, and part or all of a back network of the first local network is covered by at least one trained local network.

A feature training set, referred to as a first feature training set, of the first local network is determined based on the first training subset and the first feature training set generating network. The first training subset is a training subset corresponding to the first local network in a plurality of training subsets distributed by the system for a plurality of local networks.

Taking the determination of the first feature training set as an example in conjunction with fig. 8, how to generate a network based on the feature training set of each local network and generate a corresponding feature training set of each local network by the training subset is described in detail.

Fig. 8 a is a schematic diagram of generating a feature training set according to an embodiment of the present application. The schematic includes a first feature training set generation network, a first training subset, and data. Wherein the data is data included in the first feature training set.

As shown in a in fig. 8, the first training subset is propagated forward through the first feature training set generation network, and the output of the first feature training set generation network is the data included in the first feature training set. Wherein the first local network is any one of the plurality of local networks. And combining the data of the first characteristic training set with the label to obtain a first characteristic training set.

Case two: the system assigns a different training subset to each local network as described in case one or case two in S510. The training subset of the first local network is a first training subset, the feature training set generation network of the first local network is a first feature training set generation network, all the back networks of the first local network are not covered by the trained local networks, and the first feature training set generation network can be understood as x ═ y. Then the first feature training set is the first training subset.

Case three: the system assigns the same training subset to each local network as described in case three or case four in S510. The training subset of the first local network is a first training subset, the feature training set generation network of the first local network is a first feature training set generation network, all the back networks of the first local network are not covered by the trained local networks, and the first feature training set generation network can be understood as x ═ y. Then the first feature training set is the first training subset.

Case four: the system assigns the same training subset to each local network as described in case three or case four in S510. The training subset of the first local network is a first training subset, the feature training set generation network of the first local network generates a network for the first feature training set, and part of the first feature training set generation network is covered by all or part of a trained second local network.

The feature training set of the first local network is determined based on the feature training set of the second local network and all or part of the network of the second local network, referred to as a first feature training set.

Fig. 8 b is a schematic diagram of another feature training set generation provided in the embodiment of the present application. The schematic includes all or part of a second local network, a training set of features for the second local network, and data. Wherein the data is data included in the first feature training set.

As shown in b in fig. 8, the feature training set of the second local network is propagated forward through all or part of the network of the second local network, and the output of all or part of the network of the second local network is the data included in the first feature training set. Wherein the first local network is any one of the plurality of local networks. And combining the data of the first characteristic training set with the label to obtain a first characteristic training set.

Case five: the system assigns the same training subset to each local network as described in case three or case four in S510. The training subset of the first local network is a first training subset, the feature training set generation network of the first local network generates a network for the first feature training set, and each local network is directly connected with each other (as shown in fig. 3). The second local network is assumed to be a local network directly connected to the first local network and located behind the first local network in the forward direction of the neural network.

The feature training set of the first local network is determined based on the feature training set of the second local network and the second local network, referred to as a first feature training set.

Fig. 8 c is a schematic diagram of another feature training set generation provided by the embodiment of the present application. The schematic includes a second local network, a training set of features for the second local network, and data. Wherein the data is included data of the first feature training set.

As shown in c in fig. 8, the feature training set of the second local network is propagated forward through the second local network, and the output of the second local network is the data included in the first feature training set. Wherein the first local network is any one of the plurality of local networks. And combining the data of the first characteristic training set with the label to obtain a first characteristic training set.

S220, training a local network.

And training each local network by using the corresponding characteristic training set to obtain the parameters of the local network. Specifically, each local network is trained based on a feature training set corresponding to the local network.

Optionally, the training using the feature training set corresponding to each local network respectively comprises:

and in the forward direction of the neural network, training the local networks in sequence from back to front according to the front-back sequence of the local networks.

Optionally, before each local network is trained by using the feature training set corresponding to the local network, an auxiliary output layer is added to a local network, which does not include an output layer, of the plurality of local networks, and the auxiliary output layer is used for supporting that the output of the local network meets a preset condition.

For example, the local network trained based on the feature training set in the present application may be any one of algorithms for training neural networks based on the feature training set in the prior art. This is not limited by the present application.

For example, a random gradient descent method-related deformation method, and the like.

The following takes training the first local network as an example, and briefly introduces a procedure of training the local network in the present application with reference to fig. 9.

Fig. 9 is a flowchart of training a local network according to an embodiment of the present application. The flowchart includes steps S910-S930, which are described in detail below.

S910, initializing a first local network parameter.

The initialization of the parameters of the first local network includes the following three cases:

the first condition is as follows: directly inherits the initialization parameters of the network of the portion of the first initialization parameters of the neural network that is covered by the first local network. The first initialization parameter of the neural network may be an initialization parameter obtained by initializing a parameter of the neural network based on the prior art.

It can be understood that the method for initializing the neural network parameters provided in the embodiments of the present application is to enhance or optimize the first initialization parameters of the neural network.

For example, the first local network overlays the input layer portion of the neural network, then the parameter initialization process for the first local network may be. Directly inheriting the initialization parameters of the input layer part in the neural network in the first initialization parameters of the neural network.

Case two: the network parameter initialization method in the prior art is adopted. A parameter initialization of the first local network is performed.

For example, the parameter initialization of the first local network is performed by the random method described above.

Case three: the parameters are migrated to the first local network.

And migrating the parameters to the first local network if the initialization parameters of the first local network are included in the trained local network.

Optionally, fig. 9 further includes S911, adding an auxiliary output layer for the first local network.

Adding an auxiliary output layer for a first local network, wherein the auxiliary output layer is used for supporting that the output of the first local network meets a preset condition, and the first local network is a local network which does not include an output layer in the plurality of local networks.

The main functions of the auxiliary output layer of each local network are as follows: the output of the local network meets the task requirement of the neural network, and the data and the labels in the feature training set can be used for training. If the auxiliary output layer has parameters to be initialized, the auxiliary output layer is initialized by adopting the existing initialization scheme.

S920, training the first local network.

Specifically, after the initialization of the first local network parameters is completed, the network composed of the auxiliary output layers of the first local network is trained by using the first feature training set until the training is completed.

The first local network is any one of the plurality of local networks, and each of the plurality of local networks is trained according to the local network training method shown in fig. 9, so as to obtain each of the plurality of local network parameters.

And S230, determining the neural network initialization parameters.

The initialization parameters of the neural network include: all or a portion of the parameters of the plurality of local networks.

And further, all or part of the parameters in the trained local networks are migrated to corresponding positions in the neural network. And completing the initialization of the neural network parameters. In particular, the corresponding location refers to the portion of the local network that covers the neural network.

Optionally, when a part of the network in the neural network is covered by a plurality of local networks, selecting a corresponding parameter value in a last trained local network in the plurality of local networks to migrate into the part of the network in the neural network; alternatively, the first and second electrodes may be,

when a part of the network in the neural network is covered by a plurality of local networks, selecting a corresponding parameter value in the local network trained firstly in the plurality of local networks to migrate into the part of the network in the neural network.

Illustratively, the initialization parameters of the first part of the neural network are taken as an example for simple explanation. Wherein the first part of the neural network is any one part of the neural network.

The first condition is as follows: the first part is covered by a local network. And all or part of the parameters of the local network covering the first part after training is used as the initialization parameters of the first part.

When the first part is covered by all networks of the local network, all parameters of the local network are used as initialization parameters of the first part; when the first part is covered by the partial network of the local network, the parameters of the partial network covering the first part in the local network are used as the initialization parameters of the first part.

Case two: the first portion is covered by a plurality of local networks. And all or part of the parameters of the last trained local network in the plurality of trained local networks covering the first part is used as the initialization parameters of the first part.

Case three: the first part is not covered by any local network. The initialization parameters of the first part are initialization parameters of the first part when the neural network initializes the parameters based on the prior art. I.e. the initialization parameters of the first part are not optimized.

Fig. 2-9 illustrate the main flow of the neural network parameter initialization method of the present application in detail.

The above-mentioned neural network including several local networks, the process of training the several local networks and migrating parameter values back to the neural network is referred to as local training of the neural network in this application as an example. Meanwhile, the training set used for training the local networks is referred to as a feature training set of the local networks in the present application.

In the local training of the neural network, since the number of layers included in the local network is small, the convergence rate in the training is high. Meanwhile, the feature training set of the local network comes from the forward propagation process of the trained local network and contains feature information with a certain abstraction degree. Therefore, the local training of the neural network can obtain a good network parameter initialization state at a low cost, and the training convergence process of the neural network is accelerated. In addition, the local network contains fewer layers, so that the over-fitting problem does not exist in local training, the over-fitting problem of the neural network is reduced, and the generalization performance of the neural network is improved.

The method for initializing neural network parameters provided by the present application will be described in detail below with reference to specific embodiments.

First, a neural network is taken as an example of a convolutional neural network for image classification, and a specific embodiment of the method for initializing the neural network parameters in the present application is described.

A Convolutional Neural Network (CNN) is a deep neural network with a convolutional structure, and is a deep learning (deep learning) architecture, where the deep learning architecture refers to learning at multiple levels in different abstraction levels through a machine learning algorithm. As a deep learning architecture, CNN is a feed-forward artificial neural network in which individual neurons respond to overlapping regions in an image input thereto.

Illustratively, a convolutional neural network is composed of an input layer, a Convolution (CONV) layer, an activation function layer, a Batch Normalization (BN) layer, a pooling (pool) layer, a Full Connect (FC) layer, and the like.

Specifically, the convolutional neural network used in this embodiment includes 14 convolutional layers, 14 batch normalization layers, 14 activation function layers, 2 pooling layers, a full-link layer for 10 classes, and a flexible maximum transfer function (Softmax) layer for outputting decision probabilities.

Each convolutional layer is followed by a BN layer and an activation function layer, and for simplicity of expression, the combination of the three layers is denoted as convolutional layer (CONV). Meanwhile, it is assumed that the convolutional neural network has been parameter-initialized using an existing initialization method, see the left side of fig. 10. Shown on the left side of fig. 10 is the convolutional neural network after initialization of parameters according to the prior art, wherein the parameters in each layer are the first initialization parameters of the convolutional neural network.

For convenience of expression, the convolutional neural network is referred to as a global network in the following embodiment.

Fig. 10 is a schematic diagram of a specific embodiment provided in an embodiment of the present application. The schematic includes a convolutional neural network, local network 1-local network 3. Wherein 7 × 7 and 3 × 3 shown in fig. 10 indicate that the convolution kernels of the convolution layers have sizes of 7 × 7 and 3 × 3, respectively; 1/2 shown in FIG. 10 refers to a sliding step size of 2 for the convolution kernel; the 64, 128, 256, 512 shown in FIG. 10 refer to the number of output channels.

First, the global network includes three local networks (e.g., local network 1, local network 2, and local network 3 shown in fig. 10), the three local networks respectively cover different parts of the global network, and there is no cover part between the three local networks, as shown in fig. 10.

Illustratively, a training set of the global network is divided into three mutually disjoint groups according to an equal probability random selection mode, the number of samples contained in each group is the same, and the three groups of training subsets are respectively marked as: the training subset 1, the training subset 2 and the training subset 3 are respectively used for generating feature training sets of the local network 1, the local network 2 and the local network 3.

It should be understood that the above-mentioned assignment of the training subsets 1 to 3 for the local network 1 to the local network 3 is only an example and is not intended to limit the scope of the present application. In the embodiment of the present application, the training set of the global network may also be directly used as the training subset 1 to the training subset 3; alternatively, a subset of the training set of the global network is directly used as the training subset 1-the training subset 3.

The first local network in the forward direction of the global network (i.e. local network 1) is initialized with parameters: the initialization parameters of the local network adopt the initialization parameters of the partial network covered by the first local network in the global network.

Since the output layer is not included in the local network 1, an auxiliary output layer is added to the local network 1, and the auxiliary output layer includes: one average pooling layer, one 10-class fully-connected layer, and one SoftMax layer. Wherein the parameters of the fully-connected layer are initialized to a uniform distribution between [ -1,1], as shown in fig. 11.

Fig. 11 is a schematic diagram of a local network including an auxiliary output layer according to an embodiment of the present application. The schematic comprises a local network 1 and an auxiliary output layer.

Further, since all of the back networks of the local network 1 are not covered by the trained local network, the training subset 1 is labeled as the feature training set 1 as the feature training set of the local network 1.

And training the network consisting of the local network 1 and the auxiliary output layer thereof in the graph 11 by using the feature training set 1 and adopting a random gradient descent method until the training is finished. Specifically, the training method may adopt any one of the training methods in the prior art.

Further, for the second local network in the forward direction of the global network (i.e. local network 2), parameter initialization is performed: the initialization parameters of the local network adopt the initialization parameters of a part of the global network covered by a second local network.

Since the output layer is not included in the local network 2, an auxiliary output layer is added to the local network 2, including: an average pooling layer, a 10-class fully-connected layer and a SoftMax layer, wherein the parameters of the fully-connected layer are initialized with a uniform distribution between [ -1,1], as shown in FIG. 12.

Fig. 12 is a schematic diagram of another local network including an auxiliary output layer according to an embodiment of the present application. The schematic comprises a local network 2 and an auxiliary output layer.

The back part of the local network 2 is determined in the global network, since the back part of the local network 2 overlaps with the whole of the local network 1. Therefore, the trained local network 1 is used as the feature training set of the local network 2 to generate a network, as shown in fig. 13.

Fig. 13 is a schematic diagram of a feature training set generation network according to an embodiment of the present application. The schematic diagram includes a trained local network 1, a training subset 2, and a feature training set 2.

Fig. 13 shows that the training subset 2 is propagated forward through the feature training set generation network of the local network 2, and the output result and the corresponding label are labeled as the feature training set 2 as the feature training set of the local network 2. Wherein, the characteristic training set generation network is a local network 1 after training.

And (3) training the network consisting of the local network 2 and the auxiliary output layer thereof in the graph 13 by using the characteristic training set 2 and adopting a random gradient descent method until the training is finished.

Further, for the third local network in the forward direction of the global network (i.e. local network 3), parameter initialization is performed: the initialization parameters of the local network adopt the initialization parameters of a part of the global network covered by a third local network.

Since the local network is already provided with an output layer, there is no need to add an auxiliary output layer, as shown in fig. 14.

Fig. 14 is a schematic diagram of a local network according to an embodiment of the present application. The schematic diagram includes a local network 3.

Since the entire rear part of the local network 3 is covered by the trained local network 1 and local network 2, a network formed by the trained local network 1 and local network 2 is used as a feature training set of the local network 3 to generate a network, as shown in fig. 15.

Fig. 15 is a schematic diagram of another feature training set generation network provided in the embodiment of the present application. The schematic includes a trained local network 1, a trained local network 2, a training subset 3, and a feature training set 3.

Fig. 15 shows that the training subset 3 is propagated forward through the feature training set generation network of the local network 3, and the output result and the corresponding label are labeled as the feature training set 3 as the feature training set of the local network 3. The feature training set generation network is composed of a trained local network 1 and a trained local network 2.

And training the local network 3 by using the characteristic training set 3 and adopting a random gradient descent method until the training is finished.

The trained parameter values of the local network 1, the local network 2 and the local network 3 are migrated to corresponding parameters in the global network, so as to complete the initialization enhancement of the parameters of the global network, as shown in fig. 16.

Fig. 16 is a schematic diagram of parameter migration according to an embodiment of the present application. The schematic comprises a global network, a trained local network 1, a trained local network 2 and a trained local network 3.

As shown in fig. 16, since the local network 1 covers layers 1 to 6 of the global network, the local network 2 covers layers 7 to 12 of the global network, and the local network 3 covers layers 13 to 19 of the global network. Taking the parameters of the trained local network 1 as initialization parameters of layers 1-6 of the global network; taking the parameters of the trained local network 2 as initialization parameters of layers 7-12 of the global network; and taking the parameters of the trained local network 3 as initialization parameters of layers 13-19 of the global network.

In the following, the convolutional neural network on the left side of fig. 10 is taken as an example, and it is also assumed that the convolutional neural network has been initialized with parameters by using the existing initialization method.

First, the global network is covered by three local networks, which are: local network 1, local network 2, and local network 3. Here, the local network 1 and the local network 2 cover a part of the global network at the same time, and the local network 2 and the local network 3 cover a part of the global network at the same time, as shown in fig. 17.

Fig. 17 is a schematic diagram of another specific embodiment provided in the embodiments of the present application. The schematic includes a convolutional neural network, local network 1-local network 3.

Illustratively, a training set of global networks, samples of which 1/2 are randomly chosen according to equal probability, is denoted as a training subset, and the training subset is used for feature training set generation of the local network 1, the local network 2 and the local network 3.

It should be understood that the above-mentioned configuration of the same training subset for the local network 1 to the local network 3 is only an example and is not intended to limit the scope of the present application.

For the first local network in the forward direction of the global network (i.e. local network 1), parameter initialization is performed: the initialization parameters of the local network adopt the initialization parameters of the partial network covered by the first local network in the global network.

Since the output layer is not included in the local network 1, an auxiliary output layer is added to the local network 1, and the auxiliary output layer includes: one average pooling layer, one 10-class fully-connected layer, and one SoftMax layer. In which the parameters of the fully connected layer are initialized with a uniform distribution between-1, as shown in figure 18.

Fig. 18 is a schematic diagram of a local network including an auxiliary output layer according to an embodiment of the present application. The schematic comprises a local network 1 and an auxiliary output layer.

Further, since all the back networks of the local network 1 are not covered by the trained local network, the training subset is labeled as the feature training set 1 as the feature training set of the local network 1.

And training the network consisting of the local network 1 and the auxiliary output layer thereof in the graph 18 by using the feature training set 1 and adopting a random gradient descent method until the training is finished. Specifically, the training method may adopt any one of the training methods in the prior art.

Further, for the second local network in the forward direction of the global network (i.e. local network 2), the parameter initialization includes:

first, a part of the local network 1 overlapping with the local network 2 is referred to as a first partial network of the local network 2, wherein the initialization parameter of the first partial network of the local network 2 is a parameter of a part of the trained local network 1 overlapping with the local network 2. As shown in fig. 19.

Fig. 19 is a schematic diagram of local network parameter initialization according to an embodiment of the present application. The schematic diagram includes a local network 2 and a trained local network 1.

Next, a part of the local network 2 that does not overlap with the local network 1 is referred to as a second partial network of the local network 2, and the initialization parameter of the second partial network of the local network 2 is an initialization parameter of a partial network of the global network that is covered by the second partial network of the local network 2.

Illustratively, the initialization parameters of the local network 2 may not need to be migrated from the trained local network 1 as shown in fig. 19, and the initialization parameters of the partial network covered by the local network 2 in the global network are directly used as the initialization parameters of the local network 2.

Since the output layer is not included in the local network 2, an auxiliary output layer is added to the local network 2, including: an average pooling layer, a 10-class fully-connected layer and a SoftMax layer, in which the parameters of the fully-connected layer are initialized with a uniform distribution between [ -1,1], as shown in FIG. 20.

Fig. 20 is a schematic diagram of another local network including an auxiliary output layer according to an embodiment of the present application. The schematic comprises a local network 2 and an auxiliary output layer.

Determining a back part network of the local network 2 from the global network, wherein, as can be seen from the fact that the global network shown in fig. 17 is covered by the local networks 1-3, the whole part of the back part network of the local network 2 is covered by the partial network in the local network 1 after the training. Then, the parameters of the partial network covering the back network of the local network 2 in the trained local network 1 are used as the initialization parameters of the corresponding layer of the back network of the local network 2 (as shown in fig. 21), and the back network of the local network 2 with the obtained initialization parameters is used as the feature training set of the local network 2 to generate the network.

Fig. 21 is a schematic diagram of determining a feature training set generation network according to an embodiment of the present application. The schematic diagram includes a feature training set generation network of a local network 2 and a trained local network 1.

Fig. 21 shows the migration of the parameters on the trained local network 1 to the part of the network behind the local network 2 covered by the local network 1. And further determining a feature training set generation network of the local network 2, which is a back network of the local network 2 and is an initialization parameter of the back network of the local network 2, and which is a parameter of a partial network covering the back network of the local network 2 in the trained local network 1.

Further, the training subset is used to generate a network through the feature training set of the local network 2, forward propagation is performed, and the output result and the label are labeled as the feature training set 2 as the feature training set of the local network 2, as shown in fig. 22.

Fig. 22 is a schematic diagram of generating a feature training set according to an embodiment of the present application. The schematic includes a feature training set generation network for the local network 2, a training subset, and a feature training set 2.

And training the network consisting of the local network 2 and the auxiliary output layer in the graph 20 by using the feature training set 2 and adopting a random gradient descent method until the training is finished.

Further, for the third local network in the forward direction of the global network (i.e. local network 3), the parameter initialization includes:

first, a part of the local network 3 overlapping with the local network 2 is referred to as a first partial network of the local network 3, wherein the initialization parameter of the first partial network of the local network 3 is a parameter of a part of the trained local network 2 overlapping with the local network 3. As shown in fig. 23.

Fig. 23 is a schematic diagram of local network parameter initialization according to an embodiment of the present application. The schematic diagram includes a local network 3 and a trained local network 2.

Next, a part of the local network 3 that does not overlap with the local network 2 is referred to as a second partial network of the local network 3, and the initialization parameter of the second partial network of the local network 3 is an initialization parameter of a partial network of the global network that is covered by the second partial network of the local network 3.

Illustratively, the initialization parameters of the local network 3 may not need to be migrated from the trained local network 2 as shown in fig. 23, and the initialization parameters of the partial network covered by the local network 3 in the global network are directly used as the initialization parameters of the local network 3.

The local network 3 is already provided with an output layer and therefore no additional auxiliary output layer needs to be added, as shown in fig. 24. Fig. 24 is a schematic diagram of a local network according to an embodiment of the present application. The schematic diagram includes a local network 3.

The back part network of the local network 3 is determined from the global network, wherein, as can be seen from the fact that the global network shown in fig. 17 is covered by the local networks 1-3, the whole part of the back part network of the local network 3 is covered by the partial network of the trained local network 1 and the trained local network 2.

Then the parameters of the back network part network of the trained local network 1 covering the local network 3 are migrated to the corresponding parameters on the back network of the local network 3 (as shown on the right side of fig. 25), and the parameters of the back network part network of the trained local network 2 covering the local network 3 are migrated to the corresponding parameters on the back network of the local network 3 (as shown on the left side of fig. 25).

As can be seen from fig. 25, the layers 7 and 8 of the back network of the local network 3 are covered by the partial networks in the trained local network 1 and the trained local network 2 at the same time, in the embodiment of the present application, the parameters of the partial networks in the layer 7 and 8 of the back network of the trained local network 2 covering the local network 3 are used as the initialization parameters of the layers 7 and 8 of the back network of the local network 3.

It should be understood that the initialization parameters of the layers 7 and 8 of the back network of the local network 3 may also be selected as parameters of the partial network of the trained local network 1 covering the layers 7 and 8 of the back network of the local network 3; alternatively, the first and second electrodes may be,

the initialization parameters of the layers 7 and 8 of the back network of the local network 3 may also be selected after weighted averaging or other calculation of the parameters of the layer 7 and 8 partial networks of the back network of the trained local network 2 covering the local network 3 and the parameters of the layer 7 and 8 partial networks of the back network of the trained local network 1 covering the local network 3.

Fig. 25 is a schematic diagram of a network for generating a training set of determined features according to an embodiment of the present application. The schematic diagram includes a feature training set generation network of a local network 3, a trained local network 1, and a trained local network 2.

The training subset is subjected to forward propagation through the feature training set generation network of the local network 3, and the output result and the label are marked as the feature training set 3 as the feature training set of the local network 3, as shown in the left side of fig. 26.

Because all local networks use the same training subset, the feature training set 2 of the local network 2, and the feature training set of the local network 3 may be used to generate the network portion of the network covered by the local network 2, to generate the feature training set 3 of the local network 3, as shown on the right in fig. 26. Fig. 26 is a schematic diagram of generating a feature training set according to an embodiment of the present application.

The local network 3 in fig. 24 is trained by using the aforementioned feature training set 3 and using a random gradient descent method until the training is completed.

The parameter values of the trained local network 1, the trained local network 2, and the trained local network 3 are migrated to corresponding parameters in the global network, and for the part covered by the plurality of local networks in the global network, the parameter value corresponding to the local network trained last time in the local networks is selected for migration, as shown in fig. 27.

Fig. 27 is a schematic diagram of parameter migration according to an embodiment of the present application. The schematic comprises a global network, a trained local network 1, a trained local network 2 and a trained local network 3.

As shown in fig. 27, since the local network 1 covers layers 1 to 8 of the global network, the local network 2 covers layers 7 to 14 of the global network, and the local network 3 covers layers 13 to 19 of the global network. Aiming at layers 7 and 8 of the global network, the global network is simultaneously covered by the local network 1 and the local network 2, and the local network 2 is finally trained in the local network 1 and the local network 2; aiming at layers 13 and 14 of the global network, the local network 2 and the local network 3 are covered at the same time, and the local network 3 is trained finally in the local network 2 and the local network 3.

Taking the parameters of the 1-6 layers of the local network 1 after training as the initialization parameters of the 1-6 layers of the global network; taking the parameters of 1-6 layers of the trained local network 2 as the initialization parameters of 7-12 layers of the global network; the parameters of the layers 1-7 of the trained local network 3 are used as the initialization parameters of the layers 13-19 of the global network.

It should be understood that the two embodiments described above with reference to fig. 10-27 are only exemplary and should not be construed as limiting the scope of the present application, and that other, easily conceivable and modified embodiments are within the scope of the present application.

In the embodiments of the present application, the first, second, third, and the like are merely for convenience of distinguishing different objects, and should not be construed as limiting the present application in any way. For example, to distinguish between different ones of a plurality of local networks, etc. In addition, terms such as "forward direction" and "forward propagation" appearing in the present application are terms commonly used in the prior art, and are not limited to the present application, and may also be referred to as "forward direction", "forward propagation" or other terms specified in the subsequent technical development.

The method for initializing neural network parameters provided by the embodiment of the present application is described in detail above with reference to fig. 2 to 27. The following describes in detail the apparatus for initializing neural network parameters provided in the embodiments of the present application with reference to fig. 28 to 30.

Fig. 28 is a schematic block diagram of an apparatus 2800 for neural network parameter initialization according to an embodiment of the present application, where the apparatus for neural network parameter initialization includes a processing unit 2801, a parameter determining unit 2802, and a training unit 2803.

A processing unit 2801, configured to determine a feature training set corresponding to each local network of the neural network, where the neural network includes a plurality of local networks, and each local network of the plurality of local networks covers a part of the neural network.

A training unit 2803, configured to train each local network with a feature training set corresponding to the local network to obtain parameters of the local network;

a parameter determining unit 2802, configured to determine an initialization parameter of the neural network, where the initialization parameter of the neural network includes: all or a portion of the parameters of the plurality of local networks.

The parameter determination unit 2802 determines initialization parameters of the neural network, including: the initialization parameters of the first part of the neural network are all or part of the parameters of the last trained local network in a first local network set, the first local network set comprises one or more trained local networks, and the one or more trained local networks cover the first part of the neural network.

The processing unit 2801, configured to determine, for each local network of the neural network, a feature training set corresponding to the local network respectively includes: the processing unit 2801 determines a training subset and a feature training set generation network corresponding to each local network of the neural network; the processing unit 2801 generates a network based on the feature training set of each local network and a training subset to generate the feature training set corresponding thereto.

Specifically, the training subset includes: the training set is preset, or a part of the training set is preset.

Specifically, the feature training set generation network of the local network includes all or part of a back network, where the back network is a network between the local network and an input of the neural network, specifically, the initialization parameters of the feature training set generation network of the local network include:

Specifically, the feature training set of the local network includes: and data and a label, wherein the data is data which is generated by forward propagation of the network through a feature training set of the local network for the training subset of the local network, and is output, and the label is a label of the neural network.

Specifically, each of the plurality of local networks covering a portion of the neural network comprises: the first partial network overlays a second portion of the neural network; a second local network overlays a third portion of the neural network;

wherein the first local network and the second local network are two local networks of the plurality of local networks, and the second part is partially or completely the same as the third part.

A training unit 2803, configured to train each local network using the corresponding feature training set respectively, including: the training unit 2803 is configured to train the plurality of local networks sequentially from back to front in the forward direction of the neural network according to the sequence of the plurality of local networks.

Before the training unit 2803 trains each local network with its corresponding feature training set, the processing unit 2801 is further configured to add an auxiliary output layer to a local network of the multiple local networks that does not include an output layer, where the auxiliary output layer is used to support that the output of the local network meets a preset condition.

As shown in fig. 29, the embodiment of the present application further provides a server, which includes a processor 2901 and a memory 2902, where the memory 2902 stores instructions or programs, and the processor 2901 is configured to execute the instructions or programs stored in the memory 2902. When the instructions or programs stored in the memory 2902 are executed, the processor 2901 is configured to perform the operations performed by the processing unit 2801 in the embodiment shown in fig. 28. In particular, the server may also include a transceiver 2903 for interacting with information with the system.

Specifically, in fig. 29, the processor 2901 shown may be implemented by a Network Processing Unit (NPU) chip shown in fig. 30.

The neural network processor 50 is mounted on a main CPU (Host CPU) as a coprocessor, and tasks are allocated by the Host CPU. The core portion of the NPU is an arithmetic circuit 50, and the controller 504 controls the arithmetic circuit 503 to extract matrix data in the memory and perform multiplication.

In some implementations, the arithmetic circuit 503 includes a plurality of processing units (PEs) therein. In some implementations, the operational circuitry 503 is a two-dimensional systolic array. The arithmetic circuit 503 may also be a one-dimensional systolic array or other electronic circuit capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuitry 503 is a general-purpose matrix processor.

For example, assume that there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 502 and buffers each PE in the arithmetic circuit. The arithmetic circuit takes the matrix a data from the input memory 501 and performs matrix operation with the matrix B, and partial or final results of the obtained matrix are stored in an accumulator (accumulator) 508.

The unified memory 506 is used to store input data as well as output data. The weight data directly passes through a Direct Memory Access Controller (DMAC) 505, and the DMAC is transferred to the weight memory 502. The input data is also carried through the DMAC into the unified memory 506.

A Bus Interface Unit (BIU) 510 for interaction of the AXI bus with the DMAC and an Instruction Fetch Buffer (IFB) 509.

The BIU is specifically used for the instruction fetch memory 509 to fetch instructions from the external memory, and is also used for the memory unit access controller 505 to fetch the raw data of the input matrix a or the weight matrix B from the external memory.

The DMAC is mainly used to transfer input data in the external memory DDR to the unified memory 506 or to transfer weight data into the weight memory 502 or to transfer input data into the input memory 501.

The vector calculation unit 507 includes a plurality of operation processing units, and further processes the output of the operation circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, magnitude comparison, and the like, if necessary. The method is mainly used for non-convolution/FC layer network calculation in the neural network. Such as pooling, batch normalization, Local Response Normalization (LRN), etc.

In some implementations, the vector calculation unit 507 can store the processed output vector to the unified buffer 506. For example, the vector calculation unit 507 may apply a non-linear function to the output of the arithmetic circuit 503, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 507 generates normalized values, combined values, or both. In some implementations, the vector of processed outputs can be used as activation inputs to the arithmetic circuitry 503, for example for use in subsequent layers in a neural network.

An instruction fetch buffer 509 is coupled to the controller 504 for storing instructions used by the controller 504.

The unified memory 506, the input memory 501, the weight memory 502, and the instruction fetch memory 509 are all On-Chip memories. The external memory is private to the NPU hardware architecture.

The operations of the layers in the convolutional neural network shown in the foregoing embodiment can be performed by the matrix calculation unit 212 or the vector calculation unit 507.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed server, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one logical division, and other divisions may be realized in practice, for example, a plurality of units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method of neural network parameter initialization, comprising:

respectively determining a feature training set corresponding to each local network of the neural network, wherein the neural network comprises a plurality of local networks, and each local network of the plurality of local networks covers a part of the neural network;

training each local network by using a feature training set corresponding to each local network to obtain parameters of the local network;

determining initialization parameters of the neural network, wherein the initialization parameters of the neural network comprise: all or a portion of the parameters of the plurality of local networks.

2. The method of claim 1, wherein the determining initialization parameters for the neural network comprises:

the initialization parameters of the first part of the neural network are all or part of the parameters of the last trained local network in a first local network set, the first local network set comprises one or more trained local networks, and the one or more trained local networks cover the first part of the neural network.

3. The method of claim 1 or 2, wherein the determining a feature training set corresponding to each local network of the neural network comprises:

4. The method of claim 3, wherein the training subset comprises:

a pre-set training set, or a portion of the pre-set training set.

5. The method of claim 3 or 4, wherein the feature training set generation network of the local network comprises all or part of a back network, wherein the back network is a network between the local network and an input of the neural network.

6. The method of claim 5, wherein the feature training set of the local network generating initialization parameters for the network comprises:

7. The method according to any one of claims 3-6, wherein the training set of features of the local network comprises:

and data and a label, wherein the data is data which is generated by forward propagation of the network through a feature training set of the local network for the training subset of the local network, and is output, and the label is a label of the neural network.

8. The method of any one of claims 1-7, wherein each local network of the plurality of local networks overlaying a portion of the neural network comprises:

the first partial network overlays a second portion of the neural network;

a second local network overlays a third portion of the neural network;

9. The method according to any one of claims 1-8, wherein the training with the feature training set corresponding thereto for each local network respectively comprises:

10. The method according to any one of claims 1-9, wherein before training with the feature training set corresponding thereto for each local network, respectively, the method further comprises:

and adding an auxiliary output layer for the local network which does not comprise the output layer in the plurality of local networks, wherein the auxiliary output layer is used for supporting the output of the local network to meet the preset condition.

11. An apparatus for neural network parameter initialization, comprising:

the processing unit is used for respectively determining a feature training set corresponding to each local network of the neural network, wherein the neural network comprises a plurality of local networks, and each local network of the plurality of local networks covers a part of the neural network;

the training unit is used for training each local network by using the corresponding characteristic training set to obtain the parameters of the local network;

a parameter determining unit, configured to determine an initialization parameter of the neural network, where the initialization parameter of the neural network includes: all or a portion of the parameters of the plurality of local networks.

12. The apparatus of claim 11, wherein the parameter determining unit determines initialization parameters of the neural network comprises:

13. The apparatus according to claim 11 or 12, wherein the processing unit, configured to determine, for each local network of the neural network, a feature training set corresponding thereto respectively comprises:

the processing unit respectively determines a training subset and a characteristic training set generation network corresponding to each local network of the neural network;

the processing unit generates a network based on the feature training set of each local network and generates the feature training set corresponding to the training subset.

14. The apparatus of claim 13, wherein the training subset comprises:

the training set is preset, or a part of the training set is preset.

15. The apparatus of claim 13 or 14, wherein the feature training set generation network of the local network comprises all or part of a back network, wherein the back network is a network between the local network and an input of the neural network.

16. The apparatus of claim 15, wherein the initialization parameters of the feature training set generation network of the local network comprise:

17. The apparatus of any of claims 13-16, wherein the training set of features for the local network comprises:

18. The apparatus of any one of claims 11-17, wherein each local network of the plurality of local networks overlaying a portion of the neural network comprises:

the first partial network overlays a second portion of the neural network;

a second local network overlays a third portion of the neural network;

19. The apparatus according to any one of claims 11-18, wherein the training unit configured to train each local network with its corresponding feature training set respectively comprises:

and the training unit is used for sequentially training the local networks from back to front according to the front and back sequence of the local networks in the forward direction of the neural network.

20. The apparatus according to any one of claims 11-19, wherein before the training unit trains each local network with its corresponding feature training set, the processing unit is further configured to add an auxiliary output layer to a local network of the plurality of local networks, which does not include an output layer, the auxiliary output layer being configured to support that an output of the local network meets a preset condition.

21. A server, comprising:

a memory for storing a computer program;

a processor for executing a computer program stored in the memory to cause the apparatus to perform the method of neural network parameter initialization of any of claims 1-10.

22. A computer-readable storage medium, comprising a computer program which, when run on a computer, causes the computer to perform a method of neural network parameter initialization according to any one of claims 1 to 10.

23. A chip system comprising a processor for supporting a server in a computer to perform the method of neural network parameter initialization of any one of claims 1 to 10.