CN115600635A

CN115600635A - Training method of neural network model, and data processing method and device

Info

Publication number: CN115600635A
Application number: CN202111014266.8A
Authority: CN
Inventors: 孟庆春
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2021-07-08
Filing date: 2021-08-31
Publication date: 2023-01-13

Abstract

The application provides a training method of a neural network model, a data processing method and a data processing device in the field of artificial intelligence, wherein the training method comprises the following steps: training the neural network model based on the second training data set to obtain a target neural network model, wherein the neural network model comprises an expert network layer, the expert network layer comprises a first expert network in the first service field, the initial weight of the first expert network is determined according to a first word vector matrix, and the first word vector matrix is obtained based on the training of the first training data set in the first service field. The method can reduce the training time of the model and improve the training efficiency of the model.

Description

Training method of neural network model, and data processing method and device

Technical Field

The present application relates to the field of artificial intelligence, and more particularly, to a training method for a neural network model, a data processing method, and an apparatus.

Background

Artificial Intelligence (AI) is a theory, method, technique and application system that uses a digital computer or a machine controlled by a digital computer to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use the knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new intelligent machine that can react in a manner similar to human intelligence. Artificial intelligence is the research of the design principle and the realization method of various intelligent machines, so that the machines have the functions of perception, reasoning and decision making. Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision and reasoning, human-computer interaction, recommendation and search, AI basic theory, and the like.

In the deep learning field, large-scale training can improve the performance of the neural network model. Generally, neural network models process all the inputs to the model based on the same parameters. As the amount of parameters of the model increases, the computational resources required by the model also increases. A hybrid expert (MoE) includes a plurality of expert networks, each having different parameters. The MoE may selectively activate different expert networks in the model to participate in the computation for different inputs. Therefore, the number of parameters actually participating in calculation can be greatly reduced, and the demand of calculation resources is reduced, so that a model with a training scale reaching trillion or even higher can be trained.

However, the neural network model using MoE requires a long training time, which affects the use of the model.

Therefore, how to improve the training efficiency of the model becomes an urgent problem to be solved.

Disclosure of Invention

The application provides a training method of a neural network model, a data processing method and a data processing device, so that the training time of the model is reduced, and the training efficiency of the model is improved.

In a first aspect, a method for training a neural network model is provided, including: acquiring a first word vector matrix, wherein the first word vector matrix is obtained by training based on a first training data set of a first service field; acquiring a second training data set; and training the neural network model based on the second training data set to obtain a target neural network model, wherein the neural network model comprises an expert network layer, the expert network layer comprises a first expert network in the first service field, and the initial weight of the first expert network is determined according to the first word vector matrix.

According to the scheme of the embodiment of the application, the word vector matrix is obtained by training according to the training data set, the word vector matrix comprises a large amount of semantic information, the weight of part or all of the expert networks in the model is initialized by the word vector matrix, the semantic information can be introduced into the expert networks, prior knowledge is provided for the expert networks, the training time is reduced, and particularly when the scale of the neural network model is large, the training time can be greatly reduced by the scheme of the embodiment of the application. Meanwhile, the semantic information is introduced into the expert network, so that the semantic representation capability of the expert network can be effectively improved, and the training performance of the model is further improved.

With reference to the first aspect, in certain implementations of the first aspect, the method further includes: and acquiring a second word vector matrix, wherein the second word vector matrix is obtained by training based on a third training data set of the second service field, the expert network layer further comprises a second expert network of the second service field, and the initial weight of the second expert network is determined according to the second word vector matrix.

In the scheme of the embodiment of the application, different word vector matrixes are obtained by training based on training data sets of different business fields and have different semantic information, different expert networks have different semantic representation capabilities under the condition that different expert networks in an expert network layer are initialized through different word vector matrixes, and the comprehension capability of natural language semantics can be further improved through semantic combination among the different expert networks, so that the performance of the model is further improved.

With reference to the first aspect, in certain implementations of the first aspect, the expert network layer is configured to process data input to the expert network layer through a selected first expert network, and the first expert network is selected according to the data input to the expert network layer.

With reference to the first aspect, in certain implementations of the first aspect, the first set of training data is determined from a first knowledge-graph of the first business segment.

In the scheme of the embodiment of the application, a training data set of a service field can be constructed by a knowledge graph of the service field, and the knowledge graph of the service field can indicate the relation between each entity in the service field, so that the learning of the knowledge of the service field by a word vector matrix is facilitated, and the semantic expression capability is improved.

With reference to the first aspect, in certain implementations of the first aspect, the first training data set is determined from a first knowledge-graph of a first business segment, including: at least one first text sequence in the first training data set is generated according to at least one first triple in the first knowledge graph, and three words in the first triple are used for representing a subject in the first business field, an object in the first business field and a relationship between the subject and the object respectively.

A triplet may be represented in the form of a triplet (subject, relationship, object). The subject and the object are concepts in the business field.

A text sequence may be generated from a triplet. In other words, a triplet may constitute a sentence, i.e. a text sequence.

Illustratively, triples may be converted into sentences by a language model. The language model may be an n-word (n-gram) language model. For example, n may be 2, or n may be 3.

With reference to the first aspect, in certain implementations of the first aspect, the first word vector matrix is trained based on a first training data set of a first business segment, and includes: the first word vector matrix is the weight of a hidden layer in the first target word vector generation model, the first target word vector generation model is obtained by taking words except target words in at least one first text sequence as input of the word vector generation model and taking the target words as target output of the word vector generation model to train the word vector generation model, and the target words are words in at least one first triple.

The word vector generation model may include an input layer, a hidden layer, and an output layer. The hidden layer is a full connection layer.

The weight of the hidden layer may also be referred to as an embedding matrix (embedding matrix) or a word vector matrix.

Optionally, the target word in the at least one first text sequence is an object in the at least one first triple.

Optionally, the target word in the at least one first text sequence is a subject in the at least one first triple.

Optionally, the target word in the at least one first text sequence is a relationship in the at least one first triple.

With reference to the first aspect, in some implementations of the first aspect, the initial weights of the first expert network are determined according to a first word vector matrix, including: the initial weight of the first expert network is a first word vector matrix.

With reference to the first aspect, in certain implementations of the first aspect, the initial weights of the first expert network are determined from a first word vector matrix, including: the initial weight of the first expert network is obtained by adjusting the first word vector matrix.

With reference to the first aspect, in certain implementations of the first aspect, the neural network model is a Natural Language Processing (NLP) model or a speech processing model.

If the neural network model is an NLP model, the data in the second training data set may be text data.

If the neural network model is a speech processing model, the data in the second training data set may be speech data.

Illustratively, the speech processing model may be an end-to-end speech processing model, for example, the end-to-end speech processing model may be a listen-to-participate-spelling (last, attend, spell, LAS) model.

In a second aspect, a method for data processing is provided, including: acquiring data to be processed;

and processing the data to be processed by using a target neural network model, wherein the target neural network model is obtained by training the neural network model based on a second training data set, the neural network model comprises an expert network layer, the expert network layer comprises a first expert network in a first service field, the initial weight of the first expert network is determined according to a first word vector matrix, and the first word vector matrix is obtained by training based on a first training data set in the first service field.

According to the scheme of the embodiment of the application, the word vector matrix is obtained by training according to the training data set, the word vector matrix comprises a large amount of semantic information, the weight of part or all of the expert networks in the model is initialized by using the word vector matrix, the semantic information can be introduced into the expert networks, prior knowledge is provided for the expert networks, training time is shortened, and especially when the scale of the neural network model is large, the scheme of the embodiment of the application can greatly shorten the training time. Meanwhile, the semantic information is introduced into the expert network, so that the semantic representation capability of the expert network can be effectively improved, and the performance of the target neural network model is further improved.

With reference to the second aspect, in certain implementations of the second aspect, the expert network layer further includes a second expert network in the second business domain, initial weights of the second expert network are determined according to a second word vector matrix, and the second word vector matrix is obtained by training based on a third training data set in the second business domain.

With reference to the second aspect, in some implementations of the second aspect, the expert network layer is configured to process data input to the expert network layer through a selected first expert network, and the first expert network is selected according to the data input to the expert network layer.

With reference to the second aspect, in certain implementations of the second aspect, the first training data set is determined from a first knowledge-graph of the first business segment.

With reference to the second aspect, in certain implementations of the second aspect, the first training data set is determined from a first knowledge-graph, including: at least one first text sequence in the first training data set is generated according to at least one first triple in the first knowledge graph, and three words in the first triple are used for representing a subject in the first business field, an object in the first business field and a relationship between the subject and the object respectively.

With reference to the second aspect, in some implementations of the second aspect, the first word vector matrix is trained based on a first training data set of a first business domain, and includes: the first word vector matrix is the weight of a hidden layer in the first target word vector generation model, the first target word vector generation model is obtained by taking words except target words in at least one first text sequence as input of the word vector generation model and taking the target words as target output of the word vector generation model to train the word vector generation model, and the target words are words in at least one first triple.

With reference to the second aspect, in some implementations of the second aspect, the initial weights of the first expert network are determined from a first word vector matrix, including: the initial weight of the first expert network is a first word vector matrix.

With reference to the second aspect, in some implementations of the second aspect, the neural network model is a natural language processing NLP model or a speech processing model.

In a third aspect, an apparatus for training a neural network model is provided, which includes means for performing the method of any one of the implementations of the first aspect.

In a fourth aspect, an apparatus for data processing is provided, the apparatus comprising means for performing the method of any one of the implementations of the second aspect.

It is to be understood that extensions, definitions, explanations and explanations of relevant matters in the above-mentioned first aspect also apply to the same matters in the second, third and fourth aspects.

In a fifth aspect, an apparatus for training a neural network model is provided, the apparatus comprising: a memory for storing a program; a processor configured to execute the program stored in the memory, and when the program stored in the memory is executed, the processor is configured to perform the method in any one of the implementation manners of the first aspect.

The processor in the fifth aspect may be a Central Processing Unit (CPU), or may be a combination of a CPU and a neural network computing processor, where the neural network computing processor may include a Graphics Processing Unit (GPU), a neural Network Processing Unit (NPU), a Tensor Processing Unit (TPU), and the like. Wherein, the TPU is an artificial intelligence accelerator application-specific integrated circuit which is completely customized for machine learning by Google (Google).

In a sixth aspect, an apparatus for data processing is provided, the apparatus comprising: a memory for storing a program; a processor configured to execute the memory-stored program, the processor being configured to perform the method of any one of the implementations of the second aspect when the memory-stored program is executed.

The processor in the above sixth aspect may be a CPU, or a combination of a CPU and a neural network arithmetic processor, where the neural network arithmetic processor may include a GPU, an NPU, a TPU, and the like.

In a seventh aspect, a computer-readable medium is provided, which stores program code for execution by a device, the program code including instructions for performing the method in any one of the implementations of the first or second aspect.

In an eighth aspect, there is provided a computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of the implementations of the first or second aspect.

A ninth aspect provides a chip, where the chip includes a processor and a data interface, and the processor reads instructions stored in a memory through the data interface to execute the method in any one implementation manner of the first aspect or the second aspect.

Optionally, as an implementation manner, the chip may further include a memory, where instructions are stored in the memory, and the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the processor is configured to execute the method in any one implementation manner of the first aspect or the second aspect.

Drawings

Fig. 1 is a schematic diagram of a dialog system provided in an embodiment of the present application;

fig. 2 is a schematic diagram of a process of generating a word vector model according to an embodiment of the present application;

FIG. 3 is a schematic diagram of a natural language processing system according to an embodiment of the present application;

FIG. 4 is a diagram illustrating a system architecture according to an embodiment of the present application;

fig. 5 is a schematic diagram of a training apparatus for a neural network model according to an embodiment of the present disclosure;

FIG. 6 is a schematic flow chart of a training method for a neural network model according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart diagram of a method for data processing according to an embodiment of the present application;

FIG. 8 is a schematic block diagram of a training apparatus for a neural network model according to an embodiment of the present disclosure;

fig. 9 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 10 is a schematic block diagram of another training apparatus for neural network models provided in embodiments of the present application;

fig. 11 is a schematic block diagram of another data processing apparatus provided in an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

The embodiment of the application can be applied to the field of natural language processing or the field of voice processing.

The following description will be given taking an example in which the scheme of the embodiment of the present application is applied to a dialogue system.

Dialog systems are an important application in the field of natural language processing. As shown in fig. 1, the dialog system includes an Automatic Speech Recognition (ASR) subsystem, a Natural Language Understanding (NLU) subsystem, a Dialog Management (DM) subsystem, a Natural Language Generation (NLG) subsystem, and a text-to-speech (TTS) subsystem.

The ASR subsystem converts audio information input by a user into text information, the NLU subsystem analyzes the text information obtained by the ASR subsystem to obtain the intention of the user, and the DM subsystem executes corresponding actions, such as inquiring a knowledge base and the like, and returns the result according to the intention of the user obtained by the NLU subsystem and in combination with the current conversation state. The NLG subsystem generates text data according to the result returned by the DM subsystem, and the TTS subsystem converts the text data into audio data and feeds the audio data back to the user.

The scheme of the embodiment of the application can be utilized in the NLU subsystem to obtain or optimize a neural network model suitable for natural language understanding. By adopting the scheme of the embodiment of the application, the training efficiency of the neural network model can be improved, and the neural network model can be obtained more quickly.

It should be understood that the present invention is only described by way of example, and the present invention is not limited to the natural language understanding subsystem applied in the dialog system. The scheme of the embodiment of the application can also be applied to other scenes related to natural language understanding.

Since the embodiments of the present application relate to the application of a large number of neural networks, for the sake of understanding, the following description will be made first of all with respect to terms and concepts of the neural networks to which the embodiments of the present application may relate.

(1) Neural network

The neural network may be composed of neural units, which may be referred to as x _s And an arithmetic unit with intercept 1 as input, the output of which may be:

wherein s =1, 2, 8230, n is a natural number more than 1, and W _s Is x _s B is the bias of the neural unit. f is an activation function (activation functions) of the neural unit for introducing a nonlinear characteristic into the neural network to convert an input signal in the neural unit into an output signal. The output signal of the activation function may be used as an input for the next convolutional layer, and the activation function may be a sigmoid function. A neural network is a network formed by connecting together a plurality of the above-mentioned single neural units, i.e. the output of one neural unit may be the input of another neural unit. The input of each neural unit can be connected with the local receiving domain of the previous layer to extract the characteristics of the local receiving domain, and the local receiving domain can be a region composed of a plurality of neural units.

(2) Deep neural network

Deep Neural Networks (DNNs), also called multi-layer neural networks, can be understood as neural networks with multiple hidden layers. The DNNs are divided according to the positions of different layers, and the neural networks inside the DNNs can be divided into three categories: input layer, hidden layer, output layer. Typically, the first layer is the input layer, the last layer is the output layer, and the number of layers in between are all hidden layers. The layers are all connected, that is, any neuron at the ith layer is necessarily connected with any neuron at the (i + 1) th layer.

Although DNN appears complex, it is not really complex in terms of the work of each layer, simply the following linear relational expression:

wherein the content of the first and second substances,

is the input vector of the input vector,

is the output vector of the output vector,

is an offset vector, W is a weight matrix (also called coefficient), and α () is an activation function. Each layer is only for the input vector

Obtaining the output vector through such simple operation

Due to the large number of DNN layers, the coefficient W and the offset vector

The number of the same is also large. The definition of these parameters in DNN is as follows: taking coefficient W as an example: assume that in a three-layer DNN, the linear coefficients of the 4 th neuron of the second layer to the 2 nd neuron of the third layer are defined as

The superscript 3 represents the number of layers in which the coefficient W is located, while the subscripts correspond to the third layer index 2 of the output and the second layer index 4 of the input.

In summary, the coefficients from the kth neuron at layer L-1 to the jth neuron at layer L are defined as

Note that the input layer is without the W parameter. In deep neural networks, more hidden layers make the network more able to depict complex situations in the real world. Theoretically, the more parameters the higher the model complexity, the larger the "capacity", which means that it can accomplish more complex learning tasks. The final goal of the process of training the deep neural network, i.e., learning the weight matrix, is to obtain the weight matrix (formed by a number of layers of vectors W) of all layers of the deep neural network that has been trained.

(3) Loss function

In the process of training the deep neural network, because the output of the deep neural network is expected to be as close to the value really expected to be predicted as possible, the weight vector of each layer of the neural network can be updated according to the difference between the predicted value of the current network and the really expected target value (of course, an initialization process is usually carried out before the first updating, namely parameters are preset for each layer in the deep neural network), for example, if the predicted value of the network is high, the weight vector is adjusted to be lower, and the adjustment is continuously carried out until the deep neural network can predict the really expected target value or the value which is very close to the really expected target value. Therefore, it is necessary to define in advance how to compare the difference between the predicted value and the target value, which are loss functions (loss functions) or objective functions (objective functions), which are important equations for measuring the difference between the predicted value and the target value. Taking the loss function as an example, if the higher the output value (loss) of the loss function indicates the larger the difference, the training of the deep neural network becomes the process of reducing the loss as much as possible.

(4) Back propagation algorithm

The neural network can adopt a Back Propagation (BP) algorithm to correct the size of parameters in the initial neural network model in the training process, so that the reconstruction error loss of the neural network model is smaller and smaller. Specifically, the error loss is generated by transmitting the input signal in the forward direction until the output, and the parameters in the initial neural network model are updated by reversely propagating the error loss information, so that the error loss is converged. The back propagation algorithm is an error-loss dominant back propagation motion aimed at obtaining optimal parameters of the neural network model, such as a weight matrix.

(5) Natural Language Processing (NLP)

Natural language (natural language) is a human language, and Natural Language Processing (NLP) is a process of human language. Natural language processing is a process for systematic analysis, understanding and information extraction of text data in an intelligent and efficient manner. NLP and its components can manage very large blocks of text data, or perform a large number of automated tasks, and solve a wide variety of problems, such as automatic summarization (automatic summarization), machine Translation (MT), named Entity Recognition (NER), relationship Extraction (RE), information Extraction (IE), emotion analysis, speech recognition (speech recognition), question and answer system (query answering), and topic segmentation, among others.

(6) Knowledge-graph (KG)

A knowledge graph is a semantic network that exposes relationships between entities. On the basis of the information, the connections between the entities are established to form "knowledge". The knowledge graph is composed of a piece of knowledge, and each piece of knowledge can be represented as a triple, namely a triple composed of a subject (subject), a relation and an object (object), and can be represented as a triple (subject, relation and object).

Entities, i.e., subjects and objects in triplets, generally represent concepts, and are generally composed of nouns or noun phrases. A relationship represents a connection between two entities, typically consisting of a verb, an adjective, or a noun.

For example, the knowledge indicated by the triple (scotch, teacher, asialdod) is that scotch is a teacher of asialdod.

(7) Hybrid expert of experts (MoE) system

The hybrid expert system is a neural network architecture in which a number of linear models are trained using local input data, the outputs of which are combined by weights generated by a gate network as the output of the MoE. These linear models are called experts, or may also be called expert networks or expert models.

Specifically, the MoE includes at least one gate network and a plurality of expert networks. Different expert networks have different parameters. The gate network may selectively activate some of the parameters in the MoE for different input data. In other words, the gate network may select different expert networks to participate in the actual computation of the current input based on the different inputs.

The same expert network may be deployed on multiple devices. In other words, the same expert network deployed on different devices has the same parameters. In this way, multiple devices can share parameters, which is beneficial for training large-scale models, for example, models with parameter quantities reaching trillion or even higher.

(8) Word vector

A word in NLP generally includes two representations: one-hot representation (one-hot representation) and distributed representation (distribution representation).

The distributed representation is to map a word or phrase from the vocabulary into a new space, representing the word or phrase in a real number vector, i.e., a word vector. This approach may be referred to as word embedding (word embedding). Word to vector (word 2 vec) is one way of word embedding.

The Word2vec model may include an input layer, a hidden layer, and an output layer. The hidden layer is a full connection layer. As shown in fig. 2, the weights of the hidden layers in the trained model are word vector matrices, or may also be referred to as embedding matrices (embedding matrices).

The word2vec model includes two types of models, a skip-word model and a continuous bag-of-words (CBOW) model.

The skip-gram model is used to generate words in the context of a word based on the word. In other words, a word is input to the skip-gram model, and words in the context of the word are output as targets of the skip-gram model. For example, w (t) is taken as an input, and w (t-1), w (t-2), w (t + 1), and w (t + 2) in the context of w (t) are taken as target outputs.

The CBOW model is used to generate a word based on the word in the context of the word. In other words, a word in the context of a word is used as an input to the CBOW model, and the word is used as a target output of the CBOW model. For example, w (t-1), w (t-2), w (t + 1), and w (t + 2) in the context of w (t) are taken as inputs and w (t) is taken as a target output.

FIG. 2 shows a schematic diagram of the processing of a CBOW model. "1" in the input layer indicates that a word corresponding to a position where "1" is input, and "0" indicates that a word corresponding to a position where "0" is not input. A "1" in the output layer indicates a word corresponding to a position where the "1" is output, and a "0" indicates a word corresponding to a position where the "0" is not output. For example, a sentence is "the dog bark at mailman", the and bark are contexts of dog, one-hot codes of the and bark in the sentence are input into the CBOW model shown in fig. 2, that is, the position 1 corresponding to the and bark in the input layer is processed by the CBOW model, and the position corresponding to "dog" in the output result is 1, that is, the dog is output.

Fig. 3 is a schematic diagram of a natural language processing system according to an embodiment of the present application.

As shown in fig. 3 (a), the natural language processing system may include a user device and a data processing device. The user equipment comprises a user, a mobile phone, a personal computer or an intelligent terminal such as an information processing center. The user equipment is an initiating end of natural language data processing, and is used as an initiator of requests such as language question answering or query, and usually a user initiates the requests through the user equipment.

The data processing device may be a device or a server having a data processing function, such as a cloud server, a network server, an application server, and a management server. The data processing equipment receives query sentences such as query sentences/voice/text and the like from the intelligent terminal through the interactive interface, and then performs language data processing in the modes of machine learning, deep learning, searching, reasoning, decision making and the like through a memory for storing data and a processor link for data processing. The memory may be a generic term that includes databases that store the historical data locally, which may be located on the data processing device, or on other network servers.

In the natural language processing system shown in fig. 3 (a), a user device may receive an instruction of a user, for example, the user device may receive a piece of text input by the user device, and then initiate a request to a data processing device, so that the data processing device executes a natural language processing application (e.g., intention recognition, text classification, text sequence labeling, translation, etc.) on the piece of text obtained by the user device, thereby obtaining a processing result (e.g., intention recognition, text classification, text sequence labeling, translation, etc.) of a corresponding natural language processing application on the piece of text.

For example, the user device may receive a text to be processed input by a user, and then initiate a request to the data processing device, so that the data processing device classifies the text to be processed, thereby obtaining a classification result for the text to be processed. The classification result may refer to a semantic intention of the user indicated by the text to be processed, for example, an intention of the user for indicating playing a song, setting time, and starting navigation; alternatively, the classification result may also be used to indicate an emotion classification result of the user, for example, the classification result may indicate that the emotion of the user corresponding to the text to be processed is classified as depression, happy or angry.

The target neural network model obtained by the training method of the neural network model according to the embodiment of the present application may be deployed in the data processing device in (a) of fig. 3, and the target neural network model may be used to execute a natural language processing application to execute the natural language processing application (e.g., intention recognition, text classification, text sequence labeling, translation, etc.), so as to obtain a processing result of the natural language processing application (e.g., intention recognition, text classification, text sequence labeling, translation, etc.).

Fig. 3 (b) shows another application scenario of the natural language processing system. In this scenario, the intelligent terminal directly serves as a data processing device, directly receives an input from a user, and directly processes the input by hardware of the intelligent terminal, which is similar to the process (a) in fig. 3, and reference may be made to the above description, which is not repeated herein.

In the natural language processing system shown in fig. 3 (b), the user equipment may receive an instruction of a user, and the user equipment processes the data to be processed to obtain a processing result of the data to be processed.

In the natural language processing system shown in fig. 3 (b), the user device may receive an instruction of the user, for example, the user device may receive a piece of text input by the user, and then execute a natural language processing application (e.g., intention recognition, text classification, text sequence labeling, translation, etc.) on the piece of text by the user device itself, so as to obtain a processing result (e.g., intention recognition, text classification, text sequence labeling, translation, etc.) of a corresponding natural language processing application on the piece of text.

The target neural network model obtained by the training method of the neural network model according to the embodiment of the present application may be deployed in the user equipment in (b) of fig. 3, and the target neural network model may be used to execute a natural language processing application to execute the natural language processing application (e.g., intention recognition, text classification, text sequence labeling, translation, etc.), so as to obtain a processing result of the natural language processing application (e.g., intention recognition, text classification, text sequence labeling, translation, etc.).

Fig. 3 (c) is a schematic diagram of a related device of the natural language processing system provided in the embodiment of the present application.

The user device in (a) and (b) of fig. 3 may specifically be the

local device

301 or 302 as in (c) of fig. 3, and the data processing device may be the execution device 310, where the data storage system 350 may be integrated on the execution device 310, or may be disposed on a cloud or other network server.

The local device 301 and the local device 302 are connected to the execution device 310 through a communication network.

The execution device 310 may be implemented by one or more servers. Optionally, the execution device 310 may be used with other computing devices, such as: data storage, routers, load balancers, and the like. The enforcement devices 310 may be disposed on one physical site or distributed across multiple physical sites. The executing device 310 may use data in the data storage system 350 or call program code in the data storage system 350 to implement the training method of the neural network model of the embodiment of the present application.

Specifically, in one implementation, the execution device 310 may perform the following processes:

acquiring a first word vector matrix, wherein the first word vector matrix is obtained by training based on a first training data set of a first service field;

acquiring a second training data set;

and training the neural network model based on the second training data set to obtain a target neural network model, wherein the neural network model comprises an expert network layer, the expert network layer comprises a first expert network in the first service field, and the initial weight of the first expert network is determined according to the first word vector matrix.

The process execution device 310 can obtain a trained neural network, i.e., a target neural network model, which can be used for natural language processing, etc.

Illustratively, the user may operate respective user devices (e.g., local device 301 and local device 302) to interact with the execution device 310. Each local device may represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, gaming console, and so forth.

The local devices of each user may interact with the enforcement device 310 via a communication network of any communication mechanism/standard, such as a wide area network, a local area network, a peer-to-peer connection, etc., or any combination thereof.

In one implementation, the local device 301 or the local device 302 acquires the relevant parameters of the target neural network model from the execution device 310, deploys the target model on the local device 301 or the local device 302, and performs voice processing or text processing, etc. by using the target model.

In another implementation, the execution device 310 may directly deploy the target neural network model, and the execution device 310 acquires the data to be processed from the local device 301 and the local device 302 and processes the data to be processed by using the target neural network model, and further may return the processing result to the local device 301 and the local device 302.

It is noted that all of the functions of the performing device 310 may also be performed by a local device. For example, the local device 301 implements functionality to perform the device 310 and provide services to its own user, or to provide services to a user of the local device 302.

The execution device 310 may also be a cloud device, and in this case, the execution device 310 may be deployed in a cloud; alternatively, the execution device 310 may also be a terminal device, in which case, the execution device 310 may be deployed at a user terminal side, which is not limited in this embodiment of the application.

As shown in fig. 4, the present embodiment provides a system architecture 100. In fig. 4, a data acquisition device 160 is used to acquire training data. For the training method of the neural network model in the embodiment of the present application, if the data is text data, the training data may include a text sequence and a processing result corresponding to the text sequence, for example, the processing result corresponding to the text sequence may be an intention recognition result for the text sequence.

After the training data is collected, data collection device 160 stores the training data in database 130, and training device 120 trains target model/rule 101 based on the training data maintained in database 130.

The following describes that the training device 120 obtains the target model/rule 101 based on the training data, and the training device 120 processes the input raw data and compares the output value with the target value until the difference between the output value of the training device 120 and the target value is smaller than a certain threshold, thereby completing the training of the target model/rule 101.

The above-described target model/rule 101 can be used to implement the data processing method of the embodiment of the present application. The target model/rule 101 in the embodiment of the present application may specifically be a neural network model. It should be noted that, in practical applications, the training data maintained in the database 130 may not necessarily all come from the acquisition of the data acquisition device 160, and may also be received from other devices. It should be noted that, the training device 120 does not necessarily perform the training of the target model/rule 101 based on the training data maintained by the database 130, and may also obtain the training data from the cloud or other places for performing the model training, and the above description should not be taken as a limitation to the embodiment of the present application.

The target model/rule 101 trained from the training device 120 may be applied to different systems or devices, such as the enforcement device 110 shown in FIG. 4.

The execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an Augmented Reality (AR) AR/Virtual Reality (VR), a vehicle-mounted terminal, or a server or a cloud. In fig. 4, the execution device 110 configures an input/output (I/O) interface 112 for data interaction with an external device, and a user may input data to the I/O interface 112 through the client device 140, where the input data may include: pending data input by the client device.

In the process that the execution device 110 preprocesses the input data or in the process that the calculation module 111 of the execution device 110 executes the calculation or other related processes, the execution device 110 may call data, codes or the like in the data storage system 150 for corresponding processes, or store data, instructions or the like obtained by corresponding processes in the data storage system 150.

Finally, the I/O interface 112 returns the processing result, such as the processing result of the data obtained as described above, to the client device 140, thereby providing it to the user.

It is worth noting that the training device 120 may generate corresponding target models/rules 101 for different targets or different tasks based on different training data, and the corresponding target models/rules 101 may be used to achieve the targets or complete the tasks, thereby providing the user with desired results.

In the case shown in fig. 4, the user may manually give the input data, which may be operated through an interface provided by the I/O interface 112. Alternatively, the client device 140 may automatically send the input data to the I/O interface 112, and if requiring the client device 140 to automatically send the input data requires authorization from the user, the user may set the corresponding permissions in the client device 140. The user can view the result output by the execution device 110 at the client device 140, and the specific presentation form can be display, sound, action, and the like. The client device 140 may also be used as a data collection terminal, collecting input data of the input I/O interface 112 and output results of the output I/O interface 112 as new sample data, and storing the new sample data in the database 130. Of course, the input data inputted to the I/O interface 112 and the output result outputted from the I/O interface 112 as shown in the figure may be directly stored in the database 130 as new sample data by the I/O interface 112 without being collected by the client device 140.

It should be noted that fig. 4 is only a schematic diagram of a system architecture provided in the embodiment of the present application, and the position relationship between the devices, modules, and the like shown in the diagram does not constitute any limitation, for example, in fig. 4, the data storage system 150 is an external memory with respect to the execution device 110, and in other cases, the data storage system 150 may also be disposed in the execution device 110.

As shown in fig. 4, a target model/rule 101 is obtained by training according to a training device 120, and the target model/rule 101 may be a target neural network model in the present application in the embodiment of the present application.

The MoE can be used for expanding the parameter quantity of the model, training the model with the scale reaching trillion or even higher, and further improving the performance of the model. However, the neural network model adopting the MoE needs longer training time, which affects the application of the model.

The embodiment of the application provides a training method of a neural network model, which initializes the weight of an expert network in the neural network model by using a word vector matrix, can provide prior knowledge for model training, reduces the training time of the model and improves the training efficiency of the model.

In order to better explain the training method of the neural network model in the embodiment of the present application, a training apparatus of the neural network model in the embodiment of the present application is described below with reference to fig. 5. The apparatus 500 shown in fig. 5 may be deployed on a cloud service device or a terminal device, for example, a computer, a server, a vehicle, a mobile phone, or a system formed by the cloud service device and the terminal device. The apparatus 500 may be, for example, the training device 120 in fig. 4 or the performing device 310 or the local device in fig. 3.

The apparatus 500 includes a knowledge graph building module 510, a language generation module 520, a word vector matrix generation module 530, and a training module 540.

The knowledge graph constructing module 510 is configured to construct a knowledge graph according to the corpus of the business field.

Illustratively, a knowledge-graph may include at least one triplet. The detailed description may refer to step S610 in method 600.

The language generation module 520 is configured to generate at least one text series from the at least one triple. The detailed description may refer to step S620 in the method 600.

The word vector matrix generating module 530 is configured to obtain a word vector matrix based on the at least one triplet training.

The at least one triplet may constitute a training data set. In other words, the word vector matrix generating module 530 is configured to obtain a word vector matrix according to the training data set. The detailed description may refer to step S630 in method 600.

The training module 540 is configured to train the neural network model to obtain a target neural network model. Wherein the target neural network model comprises an expert network layer. The initial weight of at least one of the expert networks in the expert network layer is determined from the word vector matrix. In other words, the initial weights of at least one expert network in the expert network layer are initialized according to the word vector matrix. The detailed description may refer to step S650 in method 600.

The following describes a training method of a neural network model in the embodiment of the present application with reference to fig. 6.

Fig. 6 illustrates a method 600 for training a neural network model provided in an embodiment of the present application. The method shown in fig. 6 may be executed by a cloud service device or a terminal device, for example, a computer, a server, a vehicle, a mobile phone, or a system formed by the cloud service device and the terminal device. Illustratively, the method 600 may be performed by the training device 120 of fig. 4 or the performing device 310 or a local device of fig. 3.

The method 600 includes steps S610 to S650. The following describes steps S610 to S650 in detail.

S610, acquiring a knowledge graph of at least one service field.

The knowledge graph can be constructed according to the linguistic data of the business field. For example, the corpus may include website articles or books, etc.

Illustratively, the knowledge-graph may be constructed by a knowledge-graph construction module 510 in the apparatus 500.

Knowledge maps in different business fields can be respectively constructed based on the corpora in different business fields.

Illustratively, the at least one business domain includes a first business domain, and the first knowledge-graph of the first business domain may be a knowledge-graph constructed from corpora of the first business domain.

Further, the at least one business domain further includes a second business domain, and a second knowledge graph of the second business domain may be a knowledge graph constructed according to the corpus of the second business domain.

For example, if the first business domain is a financial domain and the second business domain is an internet domain, a first knowledge-graph of the financial domain and a second knowledge-graph of the internet domain may be respectively constructed based on the corpus of the financial domain and the corpus of the internet domain, and in step S610, the first knowledge-graph and the second knowledge-graph may be obtained.

For convenience of description, in the embodiment of the present application, S610 is described by taking only the first business domain and the second business domain as an example, and in step S610, more knowledge maps of the business domains or less knowledge maps of the business domains may also be obtained, and the number of knowledge maps is not limited in the embodiment of the present application.

Illustratively, a knowledge-graph includes at least one triple.

In other words, relationships between entities in a knowledge graph are represented in the form of triples.

The triples in the knowledge-graph, including the three elements of subject, relationship, and object, may be represented in the form of triples (subject, relationship, object), such as triples (sugra base, teacher, aristodor). Wherein, the subject and the object can be concepts in the business field of the knowledge map. The relationship is used to indicate a relationship between the subject and the object.

If the knowledge-graph obtained in step S610 is a plurality of knowledge-graphs, each knowledge-graph of the plurality of knowledge-graphs may include at least one triple.

Illustratively, the first knowledge-graph includes at least one first triple. Three words in the first triple are used to represent the subject in the first business field, the object in the first business field, and the relationship between the subject and the object, respectively.

The second knowledge-graph includes at least one second triple. Three words in the second triple respectively represent the subject in the second business field, the object in the second business field and the relationship between the subject and the object.

It should be understood that the "first" of the "first triples" is only used to define the triples as triples in the first knowledge-graph, and has no other defining role. In other words, the triples in the first knowledge-graph may each be referred to as a first triplet.

Similarly, the "second" of the "second triples" is only used to define the triples as the triples in the second knowledge-graph, and has no other defining role. In other words, the triples in the second knowledge-graph may each be referred to as a second triplet.

It should be understood that the example is only an example, and the knowledge graph may also be represented by other forms besides the triple, which is not limited by the embodiment of the present application.

It should be noted that step S610 is an optional step.

S620, a training data set of at least one business field is obtained.

Illustratively, the at least one business segment includes a first business segment, then step S620 may include obtaining a first training data set of the first business segment.

Further, the at least one business domain further includes a second business domain, then step S620 may include obtaining a first training data set of the first business domain and a third training data set of the second business domain.

In the case where the method 600 includes the step S610, the step S620 may include: and respectively constructing a training data set of the at least one business field according to the knowledge graph of the at least one business field.

In other words, the training data sets of the at least one business domain are determined according to the knowledge-graph of the at least one business domain, respectively.

Optionally, the first training data set of the first business domain is determined from a first knowledge-graph of the first business domain.

Further, a third training data set for the second business segment is determined based on a second knowledge-graph of the second business segment.

For convenience of description, in this embodiment of the present application, S620 is described by taking only the first service domain and the second service domain as an example, more training data sets of the service domains or less training data sets of the service domains may also be obtained in step S620, and the obtaining manner of the training data sets of other service domains may refer to the obtaining manner of the first training data set and the second training data set, which is not limited in this embodiment of the present application.

Each of the at least one training data set includes at least one text sequence.

Illustratively, the first training data set includes at least one first text sequence.

Further, the third training data set comprises at least one second text sequence.

It should be understood that the "first" of the "first text sequence" is only used to define the text sequence as a text sequence in the first training data set, and has no other limiting role. In other words, the text sequences in the first training data set may each be referred to as a first text sequence.

Similarly, "second" in "second text sequence" is only used to limit the text sequence to the text sequence in the third training data set, and has no other limiting effect. In other words, the text sequences in the third training data set may each be referred to as a second text sequence.

Optionally, the first training data set is determined from a first knowledge-graph, comprising: at least one first text sequence in the first training data set is generated from at least one first triplet in the first knowledge-graph, respectively.

The third training data set is determined from a second knowledge-graph, comprising: at least one second text sequence in the third training data set is generated from at least one second triplet in the second knowledge-graph, respectively.

A text sequence may be generated from a triplet. A text sequence may be considered a training sample of a word vector generation model.

In other words, a triplet may constitute a sentence, i.e. a text sequence.

For example, a text sequence generated from a triplet (scotlat, teacher, aristod) may be such that scotlat is an aristod of aristod.

Illustratively, the language model may be deployed in the language generation module 520 of the apparatus 500. I.e., the triplets are converted to text sequences by the language generation module 520.

In step S620, the training data set of at least one business field may also be constructed in other manners, for example, a plurality of text sequences are respectively collected in at least one business field to construct the training data set of at least one business field. The embodiment of the present application does not limit this.

For example, the obtaining of the training data set of the at least one business domain may be to construct a training data set of the at least one business domain, or the obtaining of the training data set of the at least one business domain may also be to receive the training data set of the at least one business domain from another device, or the obtaining of the training data set of the at least one business domain may also be to read a locally stored training data set of the at least one business domain. In the case that the at least one business domain includes a plurality of business domains, the training data sets of the plurality of business domains may be obtained in the same manner or different manners. The embodiment of the present application does not limit the specific manner of "obtaining".

It should be noted that step S620 is an optional step.

S630, at least one word vector matrix is obtained. The at least one word vector matrix is trained based on a training data set of at least one business domain, respectively.

In case the method 600 comprises step S620, step S630 comprises training based on the at least one training data set to obtain at least one word vector matrix, respectively.

Illustratively, the at least one word vector matrix may be trained by the word vector matrix generation module 530 in the apparatus 500.

Optionally, step S630 includes obtaining a first word vector matrix, where the first word vector matrix is trained based on the first training data set.

Optionally, step S630 further includes obtaining a second word vector matrix, where the second word vector matrix is obtained by training based on a third training data set.

The knowledge graph of one service field can indicate the relation between each entity in the service field, and the training data set of the service field can be constructed by the knowledge graph of the service field, so that the learning of the knowledge of the service field by a word vector matrix is facilitated, and the semantic representation capability is improved.

The at least one word vector matrix generates weights of hidden layers in the model for the at least one target word vector, respectively. The at least one target word vector generation model is obtained by training the word vector generation model based on a training data set of at least one business field.

In this case, step S630 may also be understood as obtaining the weight of the hidden layer in at least one target word vector generation model.

The target word vector generation model is the trained word vector generation model. Training the word vector generation model based on training data sets of different business fields can obtain target word vector generation models of different business fields.

The word vector generative model may include an input layer, a hidden layer, and an output layer. The hidden layer is a full connection layer. The word vector generation model may employ an existing model, for example, the word vector generation model may be a CBOW model.

The target word vector model is obtained by taking words except for target words in at least one text sequence in a training data set of a business field as input of a word vector generation model and taking the target words as target output of the word vector generation model to train the word vector generation model, wherein the target words are words in at least one triple in a knowledge graph of the business field.

Illustratively, for a text sequence, words other than the target word in the text sequence are used as input of the word vector generation model, and the word vector model is trained by using the target word as the target output of the word vector generation model. The target words are words in the triples corresponding to the text sequence. The target word may be any one of the three elements of subject, object or relationship in the triplet.

The triples corresponding to the text sequence refer to the triples used to guide the generation of the text sequence. Alternatively, the text sequence may be generated based on the triples to which the text sequence corresponds.

Optionally, the first word vector matrix is trained based on a first training data set of the first business domain, and includes: the first word vector matrix is the weight of a hidden layer in the first target word vector generation model, the first target word vector generation model is obtained by taking words except for target words in at least one first text sequence as input of the word vector generation model and taking the target words as target output of the word vector generation model to train the word vector generation model, and the target words are words in at least one first triple.

Specifically, words except the object in the at least one first triple in the at least one first text sequence are used as input of a word vector generation model, and the object in the at least one first triple is used as target output of the word vector generation model to train the word vector generation model, so that a first target word vector generation model is obtained.

In other words, for a text sequence, words except for the object in the first triple corresponding to the first text sequence in the first text sequence are used as input of the word vector generation model, and the object in the first triple corresponding to the first text sequence is used as target output of the word vector generation model.

The target output may also be understood as a positive sample label of a training sample. In this case, the positive sample label is the object. The negative example labels may be word pairs obtained by negative sampling.

For example, the text sequence is: scotch is a teacher of aristode. The triplet corresponding to the text sequence is a triplet (scotlat, teacher, aristod), and the object in the triplet is aristod. Words in the text sequence other than the words in the asian-shidred are used as input of the CBOW model, that is, (sugra base, yes, teacher) is used as input of the CBOW model. And taking Aristodord as a target output of the CBOW model.

Specifically, words except the main body in the at least one first triple in the at least one first text sequence are used as input of a word vector generation model, and the main body in the at least one first triple is used as target output of the word vector generation model to train the word vector generation model, so that the first target word vector generation model is obtained.

In other words, for a first text sequence, words except for the main body in the first triple corresponding to the first text sequence in the first text sequence are used as input of the word vector generation model, and the main body in the first triple corresponding to the first text sequence is used as target output of the word vector generation model.

In this case, the positive sample label is the subject. The negative example labels may be word pairs obtained by negative sampling.

For example, the text sequence is: scotch is a teacher of aristode. The triplet corresponding to the text sequence is a triplet (scotland, teacher, and aristod), and the main body in the triplet is scotland. Words in the text sequence other than sugratin are used as input of the CBOW model, that is, (teacher of asian schrader) is used as input of the CBOW model. And taking the Scotrabottom as the target output of the CBOW model.

Specifically, words except for the relation among the plurality of first triples in the at least one first text sequence are used as input of a word vector generation model, and the relation among the at least one first triples is used as target output of the word vector generation model to train the word vector generation model, so that the first target word vector generation model is obtained.

In other words, for a first text sequence, words in the first text sequence other than the relation in the triplet corresponding to the first text sequence are used as input of the word vector generation model, and the relation in the triplet corresponding to the first text sequence is used as target output of the word vector generation model.

In this case, the positive exemplar label is the relationship. The negative example labels may be word pairs obtained by negative sampling.

For example, the text sequence is: scotch is a teacher of aristode. The triple corresponding to the text sequence is a triple (scotland, teacher, aristod), and the relationship in the triple is teacher. Words other than the teacher in the text sequence are used as input of the CBOW model, that is, (sugra base, of asia-risdord) is used as input of the CBOW model. And (5) outputting the teacher as a target of the CBOW model.

Optionally, the second word vector matrix is obtained by training based on a third training data set of the second service domain, and includes:

the second word vector matrix is the weight of a hidden layer in the second target word vector generation model, the second target word vector generation model is obtained by taking words except the target words in at least one second text sequence as the input of the word vector generation model and taking the target words as the target output of the word vector generation model to train the word vector generation model, and the target words are words in at least one second triple.

The training process of the second word vector generation model may refer to the training process of the first word vector generation model described above. And replacing the first text sequence in the training process with a second text sequence, and replacing the first triple with a second triple to obtain a second target word vector generation model through training.

It should be understood that, in the embodiment of the present application, the step S630 is described by taking only the first word vector matrix and the second word vector matrix as examples, and in practical applications, more word vector matrices may be obtained in the step S630, which is not limited in the embodiment of the present application.

Illustratively, the obtaining of the at least one word vector matrix may be training to obtain the at least one word vector matrix, or may also be receiving the at least one word vector matrix from another device, or may also be reading at least one locally stored word vector matrix. The embodiment of the present application does not limit the specific manner of "obtaining".

And S640, acquiring a second training data set.

The data types in the second training data set are related to task types of the neural network model.

Alternatively, the neural network model may be an NLP model. Accordingly, the data in the second training data set may be text data.

Alternatively, the neural network model may be a speech processing model. Accordingly, the data in the second training data set may be speech data.

Illustratively, the executing device of step S640 may be the training device 120 as shown in fig. 4. The second training data set may be training data maintained in a database 130 as shown in fig. 4.

And S650, training the neural network model based on the second training data set to obtain a target neural network model. Wherein the neural network model comprises expert network layers, and the initial weight of at least one expert network in the expert network layers is determined according to at least one word vector matrix.

Illustratively, step S650 may be performed by the training module 540 in the apparatus 500.

Specifically, the expert network layer includes a first expert network of the first business segment, and an initial weight of the first expert network is determined according to the first word vector matrix.

Optionally, the expert network layer further includes a second expert network of a second business domain, and the initial weight of the second expert network is determined according to the second word vector matrix.

The initial weights of the at least one expert network are determined in each case on the basis of the at least one word vector matrix, it also being understood that the weights of the at least one expert network are initialized on the basis of the at least one word vector matrix.

The expert network layer includes a plurality of expert networks, and the expert network layer processes data input into the expert network layer through a target expert network of the plurality of expert networks. The target expert network is determined based on data input to the expert network layer.

That is, in the training or reasoning process of the target neural network model, the target expert network is selected based on data input to the expert network layer.

Optionally, the target expert network may include a first expert network. For example, the expert network layer may process data input to the expert network layer through a selected first expert network, which is selected based on the data input to the expert network layer.

Optionally, the target expert network may include a second expert network. For example, the expert network layer may process data input to the expert network layer through a selected second expert network, which is selected based on the data input to the expert network layer.

And training the neural network model based on the second training data set, wherein the obtained trained neural network model is the target neural network model.

Illustratively, the neural network model may be an existing neural network model.

For example, the neural network model may be a switch transform model.

Alternatively, the neural network model may also be constructed by itself, which is not limited in the embodiments of the present application as long as the neural network model includes an expert network layer.

The number of the expert network layers may be one or multiple, which is not limited in the embodiment of the present application.

In the case where the neural network model includes a plurality of expert network layers, some or all of the plurality of expert network layers may determine the initial weights in the manner of step S650. For convenience of description, in the embodiment of the present application, only one expert network layer is taken as an example, and the scheme in the embodiment of the present application is not limited.

An expert network layer includes a plurality of expert networks, the plurality of expert networks having different parameters.

It should be noted that the plurality of expert networks may be deployed on one device or may be deployed on a plurality of devices. If the plurality of expert networks are deployed on a plurality of devices, method 600 may also be understood to be performed collectively by the plurality of devices.

Illustratively, the expert network layer may include a gate network. The gate network may select one or more expert networks to participate in the actual computation of the currently entered data based on the data entered into the expert network layer. Alternatively, the gate network may route data input to the expert network layer to one or more expert networks for processing. The one or more selected expert networks are target expert networks. The specific determination method of the target expert network may adopt an existing scheme, for example, a routing method in MoE, or may also adopt a routing method in a switch layer in a switch transformer, which is not limited in this embodiment of the present application. And if the target expert network comprises a plurality of expert networks, the plurality of expert networks respectively process the input data. The outputs of the plurality of expert networks may be combined as an output of the expert network layer by the weights generated by the gate network. The weight may be calculated by using an existing scheme, for example, a calculation method in MoE, or may also be calculated by using a weight calculation method in a switch layer in a switch transformer, which is not limited in this embodiment of the present application.

That is, the target expert network in the expert network layer may be different for different input data.

An initial weight of at least one expert network in the expert network layer is determined from the at least one word vector matrix. Alternatively, the initial weights of the at least one expert network are determined based on weights of hidden layers in the at least one target word vector generation model. That is, the structure of the at least one expert network is the same as the structure of the hidden layer of the at least one target word vector generation model, which may be a fully-connected layer. Namely, the weight of the at least one expert network is initialized according to the weight distribution condition of the fully connected layer of the at least one target word vector generation model.

For other expert networks than the at least one expert network in the expert network layer, the initialization may be performed in an existing manner, for example, randomly by a random value generated by a gaussian distribution.

For convenience of description, the manner of weight initialization is described below by taking the first expert network as an example.

The first expert network may include one or more expert networks.

In other words, the first word vector matrix may be used to initialize the weight of one expert network or the weights of multiple expert networks.

Optionally, the initial weight of the first expert network is determined from a first word vector matrix, comprising: the initial weight of the first expert network is a first word vector matrix.

Optionally, the initial weight of the first expert network is determined from a first word vector matrix, including: the initial weight of the first expert network is obtained by adjusting the first word vector matrix.

Specifically, one or more values in the first word vector matrix may be adjusted, and the adjusted first word vector matrix may be used as an initial weight of the first expert network.

Optionally, the initial weight of the first expert network is determined from a first word vector matrix, comprising: the initial weight of one part of the expert network in the first expert network is a first word vector matrix, and the initial weight of the other part of the expert network is obtained by adjusting the first word vector matrix.

In this case, the first expert network includes two or more expert networks.

The weight initialization mode of the other expert network may refer to the weight initialization mode of the first expert network. For example, the first expert network in the initialization process is replaced by the second expert network, and the initial weight of the second expert network can be obtained by replacing the first word vector matrix with the second word vector matrix.

It should be understood that, in the embodiment of the present application, only the first expert network and the second expert network are taken as examples to describe step S650, and in practical applications, other expert networks in the expert network layer may also perform weight initialization by using the scheme of the embodiment of the present application, which is not limited in the embodiment of the present application.

According to the scheme of the embodiment of the application, the word vector matrix is obtained by training according to the training data set, the word vector matrix comprises a large amount of semantic information, the weight of part or all of the expert networks in the model is initialized by using the word vector matrix, the semantic information can be introduced into the expert networks, prior knowledge is provided for the expert networks, training time is shortened, and especially when the scale of the neural network model is large, the scheme of the embodiment of the application can greatly shorten the training time. Meanwhile, the semantic information is introduced into the expert network, so that the semantic representation capability of the expert network can be effectively improved, and the training performance of the model is further improved.

In addition, different word vector matrixes are obtained by training based on training data sets of different business fields and have different semantic information, different expert networks have different semantic representation capabilities under the condition that different expert networks in an expert network layer are initialized through different word vector matrixes, and the semantic combination among the different expert networks can further improve the understanding capability of natural language semantics and further improve the performance of the model. For example, a plurality of expert networks of the expert network layer are respectively initialized through a plurality of word vector matrixes, and the word vector matrixes are obtained through training based on training data sets of a plurality of business fields, so that the expert network layer has semantic representation capability of the business fields, the understanding capability of natural language semantics of the model is improved, data of each business field can be respectively routed to the corresponding expert network to be processed in the application process of the target neural network model, and the performance of the model is further improved.

In addition, the knowledge graph of one service field can indicate the relationship among all entities in the service field, and the training data set of the service field can be constructed by the knowledge graph of the service field, so that the learning of the knowledge of the service field by a word vector matrix is facilitated, and the semantic representation capability is improved.

Fig. 7 shows a schematic flowchart of a method 700 for data processing according to an embodiment of the present application, where the method may be performed by an apparatus or device capable of performing data processing, for example, the apparatus may be a cloud service device, or a terminal device, for example, an apparatus with sufficient computing power such as a computer and a server for performing the method for data processing, or a system formed by the cloud service device and the terminal device. Method 700 may be performed by, for example, performing device 110 in fig. 4 or performing device 310 in fig. 3 or a local device.

For example, the method 700 may be specifically executed by the execution device 110 shown in fig. 4, and the data to be processed in the method 700 may be input data given by the client device 140 shown in fig. 4.

The model used in method 700 of data processing in FIG. 7 may be constructed by the method described above in FIG. 6. Specific implementations of the method 700 may refer to the method 600 described above, and in order to avoid unnecessary repetition, the repeated description is appropriately omitted below when describing the method 700.

The method 700 includes steps S710 to S720, and the steps S710 to S720 are described below.

S710, acquiring data to be processed.

The type of data to be processed is related to the task type of the neural network model.

Alternatively, the neural network model may be an NLP model. Accordingly, the data to be processed may be text data.

Alternatively, the neural network model may be a speech processing model. Accordingly, the data to be processed may be voice data.

S720, processing the data to be processed by using a target neural network model, wherein the target neural network model is obtained by training the neural network model based on a second training data set, the neural network model comprises an expert network layer, the expert network layer comprises a first expert network in a first service field, the initial weight of the first expert network is determined according to a first word vector matrix, and the first word vector matrix is obtained by training based on a first training data set in the first service field.

Optionally, the expert network layer further includes a second expert network in a second business domain, an initial weight of the second expert network is determined according to a second word vector matrix, and the second word vector matrix is trained based on a third training data set in the second business domain.

Optionally, the expert network layer is configured to process data input to the expert network layer through a selected first expert network, the first expert network being selected based on the data input to the expert network layer.

Optionally, the first training data set is determined from a first knowledge-graph of the first business segment.

Optionally, the first training data set is determined from a first knowledge-graph, comprising: at least one first text sequence in the first training data set is generated according to at least one first triple in the first knowledge graph, and three words in the first triple are used for representing a subject in the first business field, an object in the first business field and a relationship between the subject and the object respectively.

Optionally, the initial weight of the first expert network is determined from a first word vector matrix, including: the initial weight of the first expert network is a first word vector matrix.

The apparatus of the embodiment of the present application will be described with reference to fig. 8 to 11. It should be understood that the apparatus described below is capable of performing the method of the embodiments of the present application described above, and in order to avoid unnecessary repetition, the repetitive description will be appropriately omitted when describing the apparatus of the embodiments of the present application.

Fig. 8 is a schematic block diagram of a training apparatus for a neural network according to an embodiment of the present application. The apparatus 3000 shown in fig. 8 comprises an acquisition unit 3010 and a processing unit 3020.

The obtaining unit 3010 and the processing unit 3020 may be configured to perform the method 600 for training a neural network model according to the embodiment of the present application.

The obtaining unit 3010 is configured to obtain a first word vector matrix, where the first word vector matrix is obtained by training based on a first training data set in a first service domain.

The acquisition unit is further configured to acquire a second training data set.

The processing unit 3020 is configured to train the neural network model based on the second training data set to obtain a target neural network model, where the neural network model includes an expert network layer, the expert network layer includes a first expert network in the first service field, and an initial weight of the first expert network is determined according to the first word vector matrix.

Optionally, as an embodiment, the obtaining unit 3010 is further configured to: and acquiring a second word vector matrix, wherein the second word vector matrix is obtained by training based on a third training data set of the second service field, the expert network layer further comprises a second expert network of the second service field, and the initial weight of the second expert network is determined according to the second word vector matrix.

Optionally, as an embodiment, the expert network layer is configured to process data input to the expert network layer through a selected first expert network, the first expert network being selected based on the data input to the expert network layer.

Optionally, as an embodiment, the first training data set is determined from a first knowledge-graph of the first business segment.

Optionally, as an embodiment, the first training data set is determined according to a first knowledge-graph of the first business segment, and includes: at least one first text sequence in the first training data set is generated according to at least one first triple in the first knowledge-graph, wherein the first triple comprises a subject in the first business field, an object in the first business field and a relationship between the subject and the object.

Optionally, as an embodiment, the first word vector matrix is trained based on a first training data set of the first business domain, and includes: the first word vector matrix is the weight of a hidden layer in the first target word vector generation model, the first target word vector generation model is obtained by taking words except target words in at least one first text sequence as input of the word vector generation model and taking the target words as target output of the word vector generation model to train the word vector generation model, and the target words are words in at least one first triple.

Optionally, as an embodiment, the initial weight of the first expert network is determined according to a first word vector matrix, which includes: the initial weight of the first expert network is a first word vector matrix.

Optionally, as an embodiment, the neural network model is an NLP model or a speech processing model.

Fig. 9 is a schematic block diagram of a data processing apparatus according to an embodiment of the present application. The apparatus 4000 shown in fig. 9 comprises an acquisition unit 4010 and a processing unit 4020.

The obtaining unit 4010 and the processing unit 4020 can be used to execute the method 700 of data processing in the embodiment of the present application.

An obtaining unit 4010 is configured to obtain data to be processed.

The processing unit 4020 is configured to process data to be processed by using a target neural network model, where the target neural network model is obtained by training a neural network model based on a second training data set, the neural network model includes an expert network layer, the expert network layer includes a first expert network in a first service field, an initial weight of the first expert network is determined according to a first word vector matrix, and the first word vector matrix is obtained by training based on the first training data set in the first service field.

Optionally, as an embodiment, the expert network layer further includes a second expert network in the second business domain, the initial weight of the second expert network is determined according to a second word vector matrix, and the second word vector matrix is trained based on a third training data set in the second business domain.

Optionally, as an embodiment, the expert network layer is configured to process data input to the expert network layer through a selected first expert network, the first expert network being selected according to the data input to the expert network layer.

Optionally, as an embodiment, the first training data set is determined according to a first knowledge-graph, including: at least one first text sequence in the first training data set is generated according to at least one first triple in the first knowledge graph, and three words in the first triple are used for representing a subject in the first business field, an object in the first business field and a relationship between the subject and the object respectively.

Optionally, as an embodiment, the neural network model is a natural language processing NLP model or a speech processing model.

It should be noted that the above-mentioned devices 3000 and 4000 are embodied in the form of functional units. The term "unit" herein may be implemented in software and/or hardware, and is not particularly limited thereto.

For example, a "unit" may be a software program, a hardware circuit, or a combination of both that implement the above-described functions. The hardware circuitry may include an Application Specific Integrated Circuit (ASIC), an electronic circuit, a processor (e.g., a shared, dedicated, or group processor) and memory that execute one or more software or firmware programs, a combinational logic circuit, and/or other suitable components that support the described functionality.

Thus, the units of each example described in the embodiments of the present application can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the technical solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

Fig. 10 is a schematic hardware structure diagram of a training apparatus for a neural network model according to an embodiment of the present application. The training apparatus 5000 of the neural network model shown in fig. 10 (the apparatus 5000 may be a computer device) includes a memory 5001, a processor 5002, a communication interface 5003 and a bus 5004. The memory 5001, the processor 5002 and the communication interface 5003 are connected to each other via a bus 5004.

The memory 5001 may be a Read Only Memory (ROM), a static memory device, a dynamic memory device, or a Random Access Memory (RAM). The memory 5001 may store a program, and the processor 5002 is configured to perform the steps of the training method of the neural network model of the embodiments of the present application when the program stored in the memory 5001 is executed by the processor 5002.

The processor 5002 may be a general-purpose Central Processing Unit (CPU), a microprocessor, an Application Specific Integrated Circuit (ASIC), a Graphics Processing Unit (GPU) or one or more integrated circuits, and is configured to execute related programs to implement the method for training the neural network model according to the embodiment of the present disclosure.

The processor 5002 may also be an integrated circuit chip having signal processing capabilities. In implementation, the steps of the training method of the neural network model of the present application may be implemented by integrated logic circuits of hardware in the processor 5002 or instructions in the form of software.

The processor 5002 may also be a general purpose processor, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), an off-the-shelf programmable gate array (FPGA) or other programmable logic device, discrete gate or transistor logic, or discrete hardware components. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software modules may be located in ram, flash, rom, prom, or eprom, registers, etc. as is well known in the art. The storage medium is located in the memory 5001, and the processor 5002 reads information in the memory 5001, and in combination with hardware thereof, performs functions required to be performed by units included in the apparatus shown in fig. 8, or performs a training method of a neural network model according to an embodiment of the method of the present application.

The communication interface 5003 enables communication between the apparatus 5000 and other devices or communication networks using transceiving means such as, but not limited to, a transceiver. For example, the second training data set may be acquired through the communication interface 5003.

The bus 5004 may include a pathway to transfer information between the various components of the apparatus 5000 (e.g., the memory 5001, the processor 5002, the communication interface 5003).

Fig. 11 is a schematic hardware structure diagram of a data processing apparatus according to an embodiment of the present application. A data processing apparatus 6000 (the apparatus 6000 may specifically be a computer device) shown in fig. 11 includes a memory 6001, a processor 6002, a communication interface 6003, and a bus 6004. The memory 6001, the processor 6002, and the communication interface 6003 are connected to each other in a communication manner via a bus 6004.

Memory 6001 can be a ROM, static storage device, dynamic storage device, or RAM. The memory 6001 may store programs that when executed by the processor 6002, the processor 6002 is configured to perform the steps of the data processing method according to the embodiments of the application.

The processor 6002 may be a general-purpose CPU, a microprocessor, an ASIC, a GPU or one or more integrated circuits, and is used to execute the related programs to implement the data processing methods of the embodiments of the present application.

The processor 6002 could also be an integrated circuit chip that has signal processing capabilities. In implementation, the steps of the data processing method of the present application may be performed by integrated logic circuits of hardware or instructions in the form of software in the processor 6002.

The processor 6002 could also be a general purpose processor, a DSP, an ASIC, an FPGA or other programmable logic device, discrete gate or transistor logic device, discrete hardware component. The various methods, steps, and logic blocks disclosed in the embodiments of the present application may be implemented or performed. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like. The steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor. The software module may be located in ram, flash memory, rom, prom, or eprom, registers, etc. storage media as is well known in the art. The storage medium is located in the memory 6001, and the processor 6002 reads information in the memory 6001, and completes functions required to be performed by a cell included in the apparatus shown in fig. 9 with hardware thereof, or performs a data processing method according to an embodiment of the present application.

The communication interface 6003 enables communications between the apparatus 6000 and other devices or communication networks using transceiver means such as, but not limited to, a transceiver. For example, the second training data set may be acquired through the communication interface 6003.

The bus 6004 may include paths that transport information between various components of the device 6000 (e.g., memory 6001, processor 6002, communication interface 6003).

It should be noted that although the above-described devices 5000 and 6000 only show memories, processors, and communication interfaces, in particular implementation, those skilled in the art will appreciate that the devices 5000 and 6000 may also include other components necessary for normal operation. Also, the apparatus 5000 and the apparatus 6000 may also include hardware components for performing other additional functions, as may be appreciated by those skilled in the art according to particular needs. Furthermore, it should be understood by those skilled in the art that the apparatus 5000 and the apparatus 6000 may also include only the components necessary to implement the embodiments of the present application, and not necessarily all of the components shown in fig. 10 and 11.

Embodiments of the present application further provide a computer-readable medium storing program code for execution by a device, where the program code includes a method for performing a training method or a data processing method of a neural network model in an embodiment of the present application.

Embodiments of the present application further provide a computer program product including instructions, which when run on a computer, cause the computer to execute the data processing method in the embodiments of the present application.

The embodiment of the present application further provides a chip, where the chip includes a processor and a data interface, and the processor reads an instruction stored in a memory through the data interface, and executes a training method of a neural network model or a data processing method in the embodiment of the present application.

Optionally, as an implementation manner, the chip may further include a memory, where the memory stores instructions, and the processor is configured to execute the instructions stored on the memory, and when the instructions are executed, the processor is configured to execute the training method of the neural network model or the data processing method in the embodiment of the present application.

It should be understood that the processor in the embodiments of the present application may be a Central Processing Unit (CPU), and the processor may also be other general-purpose processors, digital Signal Processors (DSPs), application Specific Integrated Circuits (ASICs), field Programmable Gate Arrays (FPGAs) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, and the like. A general purpose processor may be a microprocessor or the processor may be any conventional processor or the like.

It will also be appreciated that the memory in the embodiments of the subject application may be either volatile memory or nonvolatile memory, or may include both volatile and nonvolatile memory. The non-volatile memory may be a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. Volatile memory can be Random Access Memory (RAM), which acts as external cache memory. By way of example, and not limitation, many forms of Random Access Memory (RAM) are available, such as Static RAM (SRAM), dynamic RAM (DRAM), synchronous DRAM (SDRAM), double data rate SDRAM (DDR SDRAM), enhanced SDRAM (ESDRAM), synchronous Link DRAM (SLDRAM), and direct bus RAM (DR RAM).

The above-described embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented in software, the above-described embodiments may be implemented in whole or in part in the form of a computer program product. The computer program product comprises one or more computer instructions or computer programs. The procedures or functions according to the embodiments of the present application are wholly or partially generated when the computer instructions or the computer program are loaded or executed on a computer. The computer may be a general purpose computer, a special purpose computer, a network of computers, or other programmable device. The computer instructions may be stored in a computer readable storage medium or transmitted from one computer readable storage medium to another computer readable storage medium, for example, the computer instructions may be transmitted from one website, computer, server, or data center to another website, computer, server, or data center by wire (e.g., infrared, wireless, microwave, etc.). The computer-readable storage medium can be any available medium that can be accessed by a computer or a data storage device such as a server, data center, etc. that contains one or more collections of available media. The usable medium may be a magnetic medium (e.g., floppy disk, hard disk, magnetic tape), an optical medium (e.g., DVD), or a semiconductor medium. The semiconductor medium may be a solid state disk.

It should be understood that the term "and/or" herein is merely one type of association relationship that describes an associated object, meaning that three relationships may exist, e.g., a and/or B may mean: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. In addition, the "/" in this document generally indicates that the former and latter associated objects are in an "or" relationship, but may also indicate an "and/or" relationship, and may be understood with particular reference to the former and latter contexts.

In this application, "at least one" means one or more, "a plurality" means two or more. "at least one of the following" or similar expressions refer to any combination of these items, including any combination of the singular or plural items. For example, at least one (one) of a, b, or c, may represent: a, b, c, a-b, a-c, b-c, or a-b-c, wherein a, b, c may be single or multiple.

It should be understood that, in the various embodiments of the present application, the sequence numbers of the above-mentioned processes do not mean the execution sequence, and the execution sequence of each process should be determined by its function and inherent logic, and should not constitute any limitation to the implementation process of the embodiments of the present application.

Those of ordinary skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.

It is clear to those skilled in the art that, for convenience and brevity of description, the specific working processes of the above-described systems, apparatuses and units may refer to the corresponding processes in the foregoing method embodiments, and are not described herein again.

In the several embodiments provided in the present application, it should be understood that the disclosed system, apparatus and method may be implemented in other ways. For example, the above-described apparatus embodiments are merely illustrative, and for example, the division of the units is only one type of logical functional division, and other divisions may be realized in practice, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may be in an electrical, mechanical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present application may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit.

The functions, if implemented in the form of software functional units and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solutions of the present application or portions thereof that substantially contribute to the prior art may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present application. And the aforementioned storage medium includes: various media capable of storing program codes, such as a usb disk, a removable hard disk, a read-only memory (ROM), a Random Access Memory (RAM), a magnetic disk, or an optical disk.

The above description is only for the specific embodiments of the present application, but the scope of the present application is not limited thereto, and any person skilled in the art can easily conceive of the changes or substitutions within the technical scope of the present application, and shall be covered by the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

1. A method for training a neural network model, comprising:

acquiring a second training data set;

training a neural network model based on the second training data set to obtain a target neural network model, wherein the neural network model comprises an expert network layer, the expert network layer comprises a first expert network of the first service field, and the initial weight of the first expert network is determined according to the first word vector matrix.

2. The training method of claim 1, wherein the method further comprises:

and acquiring a second word vector matrix, wherein the second word vector matrix is obtained by training based on a third training data set in a second service field, the expert network layer further comprises a second expert network in the second service field, and the initial weight of the second expert network is determined according to the second word vector matrix.

3. Training method according to claim 1 or 2, wherein said expert network layer is adapted to process data entered into said expert network layer through said selected first expert network, said first expert network being selected on the basis of said data entered into said expert network layer.

4. A training method as defined in any of claims 1-3, wherein the first set of training data is determined from a first knowledge-graph of the first business segment.

5. Training method according to claim 4, wherein the first training data set is determined from a first knowledge-graph of the first business domain comprising:

at least one first text sequence in the first training data set is generated according to at least one first triple in the first knowledge-graph, and three words in the first triple are used for representing a subject in the first business field, an object in the first business field, and a relationship between the subject and the object, respectively.

6. The training method of claim 5, wherein the first word vector matrix is trained based on a first training data set of a first business domain, and comprises:

the first word vector matrix is the weight of a hidden layer in a first target word vector generation model, the first target word vector generation model is obtained by taking words except target words in the at least one first text sequence as input of the word vector generation model and taking the target words as target output of the word vector generation model to train the word vector generation model, and the target words are words in the at least one first triple.

7. Training method according to any of the claims 1 to 6, wherein the initial weights of the first expert network are determined from the first word vector matrix, comprising:

the initial weight of the first expert network is the first word vector matrix.

8. A training method as claimed in any one of claims 1 to 7, wherein the neural network model is a Natural Language Processing (NLP) model or a speech processing model.

9. A method of data processing, comprising:

acquiring data to be processed;

10. The method of claim 9, wherein the expert network layer further comprises a second expert network of a second business domain, wherein initial weights of the second expert network are determined according to a second word vector matrix trained based on a third training data set of the second business domain.

11. The method according to claim 9 or 10, wherein said expert network layer is configured to process data inputted into said expert network layer through said selected first expert network, said first expert network being selected according to said data inputted into said expert network layer.

12. The method according to any of claims 9 to 11, wherein the first training data set is determined from a first knowledge-graph of the first business segment.

13. The method of claim 12, wherein the first training data set is determined from the first knowledge-graph, comprising:

14. The method of claim 13, wherein the first word vector matrix is trained based on a first training data set of the first business segment, and comprises:

15. The method of any one of claims 9 to 14, wherein the initial weight of the first expert network is determined from a first word vector matrix comprising:

the initial weight of the first expert network is the first word vector matrix.

16. The method according to any one of claims 9 to 15, wherein the neural network model is a Natural Language Processing (NLP) model or a speech processing model.

17. An apparatus for training a neural network model, comprising:

the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring a first word vector matrix, and the first word vector matrix is obtained by training based on a first training data set of a first service field;

the acquisition unit is further configured to acquire a second training data set;

a processing unit to:

18. The training device of claim 17, wherein the obtaining unit is further configured to:

and acquiring a second word vector matrix, wherein the second word vector matrix is obtained by training based on a third training data set of a second service field, the expert network layer further comprises a second expert network of the second service field, and the initial weight of the second expert network is determined according to the second word vector matrix.

19. Training apparatus according to claim 17 or 18, wherein the expert network layer is configured to process data input to the expert network layer via the selected first expert network, the first expert network being selected on the basis of the data input to the expert network layer.

20. Training apparatus according to any of claims 17-19, wherein the first training data set is determined from a first knowledge-graph of the first business segment.

21. Training apparatus according to claim 20, wherein the first training data set is determined from a first knowledge-graph of the first business segment, comprising:

at least one first text sequence in the first training data set is generated from at least one first triplet in the first knowledge-graph, the first triplet comprising a subject in the first business domain, an object in the first business domain, and a relationship between the subject and the object.

22. The training apparatus of claim 21, wherein the first word vector matrix is trained based on a first training data set of a first business segment, comprising:

the first word vector matrix is a weight of a hidden layer in a first target word vector generation model, the first target word vector generation model is obtained by taking words except for target words in the at least one first text sequence as input of the word vector generation model and taking the target words as target output of the word vector generation model to train the word vector generation model, and the target words are words in the at least one first triple.

23. Training apparatus according to any of the claims 17-22, wherein the initial weights of the first expert network are determined from the first word vector matrix, comprising:

the initial weight of the first expert network is the first word vector matrix.

24. Training apparatus according to any of claims 17 to 23, wherein the neural network model is a natural language processing NLP model or a speech processing model.

25. An apparatus for data processing, comprising:

an acquisition unit configured to acquire data to be processed;

the processing unit is used for processing the data to be processed by using a target neural network model, the target neural network model is obtained by training the neural network model based on a second training data set, the neural network model comprises an expert network layer, the expert network layer comprises a first expert network in a first service field, the initial weight of the first expert network is determined according to a first word vector matrix, and the first word vector matrix is obtained by training based on a first training data set in the first service field.

26. The apparatus of claim 25, wherein the expert network layer further comprises a second expert network in a second business segment, wherein initial weights of the second expert network are determined according to a second word vector matrix trained based on a third training data set in the second business segment.

27. The apparatus according to claim 25 or 26, wherein said expert network layer is configured to process data inputted into said expert network layer through said selected first expert network, said first expert network being selected according to said data inputted into said expert network layer.

28. The apparatus according to any of claims 25 to 27, wherein the first training data set is determined from a first knowledge-graph of the first business segment.

29. The apparatus of claim 28, wherein the first training data set is determined from the first knowledge-graph, comprising:

30. The apparatus of claim 29, wherein the first word vector matrix is trained based on a first training data set of the first business segment, and wherein the training comprises:

31. The apparatus of any one of claims 25 to 30, wherein the initial weight of the first expert network is determined from a first word vector matrix comprising:

the initial weight of the first expert network is the first word vector matrix.

32. The apparatus of any one of claims 25 to 31, wherein the neural network model is a Natural Language Processing (NLP) model or a speech processing model.

33. An apparatus for training a neural network model, comprising a processor and a memory, the memory being configured to store program instructions, the processor being configured to invoke the program instructions to perform the method of any one of claims 1 to 8.

34. An apparatus for data processing, comprising: comprising a processor and a memory for storing program instructions, the processor being configured to invoke the program instructions to perform the method of any of claims 9 to 16.

35. A computer-readable storage medium, characterized in that the computer-readable medium stores program code for execution by a device, the program code comprising instructions for performing the method of any of claims 1 to 8 or claims 9 to 16.

36. A computer program product comprising instructions which, when run on a computer, cause the computer to perform the method of any one of claims 1 to 8 or claims 9 to 16.