WO2023279921A1 - 神经网络模型的训练方法、数据处理的方法及装置 - Google Patents

神经网络模型的训练方法、数据处理的方法及装置 Download PDF

Info

Publication number
WO2023279921A1
WO2023279921A1 PCT/CN2022/098621 CN2022098621W WO2023279921A1 WO 2023279921 A1 WO2023279921 A1 WO 2023279921A1 CN 2022098621 W CN2022098621 W CN 2022098621W WO 2023279921 A1 WO2023279921 A1 WO 2023279921A1
Authority
WO
WIPO (PCT)
Prior art keywords
word vector
training
model
expert
data set
Prior art date
Application number
PCT/CN2022/098621
Other languages
English (en)
French (fr)
Inventor
孟庆春
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202111014266.8A external-priority patent/CN115600635A/zh
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP22836675.3A priority Critical patent/EP4318311A4/en
Publication of WO2023279921A1 publication Critical patent/WO2023279921A1/zh
Priority to US18/401,738 priority patent/US20240232618A9/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition

Definitions

  • the present application relates to the field of artificial intelligence, and more specifically, to a neural network model training method, data processing method and device.
  • Artificial intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge and use knowledge to obtain the best results.
  • artificial intelligence is the branch of computer science that attempts to understand the nature of intelligence and produce a new class of intelligent machines that respond in ways similar to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Research in the field of artificial intelligence includes robotics, natural language processing, computer vision, decision-making and reasoning, human-computer interaction, recommendation and search, basic AI theory, etc.
  • a neural network model processes all inputs to the model based on the same parameters.
  • the computing resources required by the model will also increase accordingly.
  • a mixture of experts includes multiple expert networks, each with different parameters. MoE can selectively activate different expert networks in the model to participate in the calculation for different inputs. This can greatly reduce the number of parameters actually involved in the calculation, reduce the demand for computing resources, and thus train models with a scale of trillions or even higher.
  • the present application provides a neural network model training method, data processing method and device, which reduce the training time of the model and improve the training efficiency of the model.
  • a training method of a neural network model including: obtaining a first word vector matrix, the first word vector matrix is trained based on the first training data set in the first business domain; obtaining the second training data set; the neural network model is trained based on the second training data set to obtain the target neural network model, the neural network model includes an expert network layer, the expert network layer includes the first expert network in the first business domain, and the initial weight of the first expert network is determined according to the first word vector matrix.
  • the word vector matrix is trained according to the training data set, and the word vector matrix contains a large amount of semantic information, and the weight of some or all expert networks in the model can be initialized by using the word vector matrix, and the semantic information can be Introducing the expert network to provide prior knowledge for the expert network and reduce training time, especially when the scale of the neural network model is large, the solution of the embodiment of the present application can greatly reduce the training time. At the same time, introducing semantic information into the expert network can effectively improve the semantic representation ability of the expert network, thereby improving the training performance of the model.
  • the method further includes: obtaining a second word vector matrix, where the second word vector matrix is trained based on the third training data set in the second business domain, and the expert network
  • the layer also includes a second expert network in the second business domain, and the initial weight of the second expert network is determined according to the second word vector matrix.
  • different word vector matrices are trained based on training data sets in different business fields, and have different semantic information.
  • Different expert networks in the expert network layer use different word vector matrices
  • different expert networks have different semantic representation capabilities, and the semantic combination between different expert networks can further improve the understanding ability of natural language semantics and further improve the performance of the model.
  • the expert network layer is used to process the data input to the expert network layer through the selected first expert network, and the first expert network is based on the data input to the expert network layer Selected.
  • the first training data set is determined according to the first knowledge graph of the first business domain.
  • the training data set of a business domain can be constructed through the knowledge graph of the business domain, and the knowledge graph of the business domain can indicate the relationship between entities in the business domain, so that It is beneficial to the word vector matrix to learn the knowledge of this business field and improve the semantic representation ability.
  • the first training data set is determined according to the first knowledge map of the first business domain, including: at least one first text sequence in the first training data set is Generated according to at least one first triple in the first knowledge map, the three words in the first triple are used to represent the subject in the first business domain, the object in the first business domain, and the subject and object The relationship between.
  • a triple can be represented in the form of a triple (subject, relation, object).
  • Subject and object are concepts in the business domain.
  • a text sequence can be generated from a triple.
  • a triple can form a sentence, that is, a sequence of text.
  • triples can be converted into sentences through a language model.
  • the language model may be an n-gram language model.
  • n can be 2, or n can be 3.
  • the first word vector matrix is trained based on the first training data set in the first business domain, including: the first word vector matrix is the first target word vector Generate the weight of the hidden layer in the model.
  • the first target word vector generation model is to use words other than the target word in at least one first text sequence as the input of the word vector generation model, and take the target word as the target of the word vector generation model.
  • the output is obtained by training the word vector generation model, and the target word is a word in at least one first triple.
  • a word embedding generation model can include an input layer, a hidden layer, and an output layer.
  • the hidden layer is a fully connected layer.
  • the weight of the hidden layer can also be called an embedding matrix or a word vector matrix.
  • the target word in the at least one first text sequence is the object in the at least one first triple.
  • the target word in the at least one first text sequence is the subject in the at least one first triple.
  • the target words in the at least one first text sequence are relations in the at least one first triple.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is the first word vector matrix.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is determined by adjusting the first word vector matrix owned.
  • the neural network model is a natural language processing (natural language processing, NLP) model or a speech processing model.
  • the data in the second training data set may be text data.
  • the data in the second training data set may be speech data.
  • the speech processing model may be an end-to-end speech processing model, for example, the end-to-end speech processing model may be a listen, attend, spell, LAS (listen, attend, spell, LAS) model.
  • a data processing method including: acquiring data to be processed;
  • the target neural network model is used to process the data to be processed.
  • the target neural network model is obtained by training the neural network model based on the second training data set.
  • the neural network model includes an expert network layer, and the expert network layer includes the first business domain.
  • An expert network, the initial weight of the first expert network is determined according to the first word vector matrix, and the first word vector matrix is trained based on the first training data set in the first business domain.
  • the word vector matrix is trained according to the training data set, and the word vector matrix contains a large amount of semantic information, and the weight of some or all expert networks in the model can be initialized by using the word vector matrix, and the semantic information can be Introducing the expert network to provide prior knowledge for the expert network and reduce training time, especially when the scale of the neural network model is large, the solution of the embodiment of the present application can greatly reduce the training time. At the same time, introducing semantic information into the expert network can effectively improve the semantic representation ability of the expert network, and then improve the performance of the target neural network model.
  • the expert network layer further includes a second expert network in the second business domain, the initial weight of the second expert network is determined according to the second word vector matrix, and the second The word vector matrix is trained based on the third training data set in the second business domain.
  • the expert network layer is used to process the data input to the expert network layer through the selected first expert network, and the first expert network is based on the data input to the expert network layer Selected.
  • the first training data set is determined according to the first knowledge graph of the first business domain.
  • the first training data set is determined according to the first knowledge graph, including: at least one first text sequence in the first training data set is determined according to the first knowledge graph Generated by at least one first triple in the first triple, the three words in the first triple are respectively used to represent the subject in the first business domain, the object in the first business domain, and the relationship between the subject and the object.
  • the first word vector matrix is trained based on the first training data set in the first business domain, including: the first word vector matrix is the first target word vector Generate the weight of the hidden layer in the model.
  • the first target word vector generation model is to use words other than the target word in at least one first text sequence as the input of the word vector generation model, and take the target word as the target of the word vector generation model.
  • the output is obtained by training the word vector generation model, and the target word is a word in at least one first triple.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is the first word vector matrix.
  • the neural network model is a natural language processing (NLP) model or a speech processing model.
  • NLP natural language processing
  • a neural network model training device in a third aspect, includes a unit for performing the method in any implementation manner of the first aspect above.
  • a data processing device in a fourth aspect, includes a unit configured to execute the method in any implementation manner of the second aspect above.
  • a training device for a neural network model comprising: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, The processor is configured to execute the method in any one implementation manner of the first aspect.
  • the processor in the fifth aspect above can be either a central processing unit (central processing unit, CPU), or a combination of a CPU and a neural network computing processor, where the neural network computing processor can include a graphics processing unit (graphics processing unit). unit, GPU), neural-network processing unit (NPU) and tensor processing unit (TPU), etc.
  • TPU is an artificial intelligence accelerator ASIC fully customized by Google for machine learning.
  • a data processing device which includes: a memory for storing a program; a processor for executing the program stored in the memory, and when the program stored in the memory is executed, the The processor is configured to execute the method in any one implementation manner of the second aspect.
  • the processor in the sixth aspect above can be either a CPU, or a combination of a CPU and a neural network computing processor, where the neural network computing processor can include a GPU, an NPU, a TPU, and the like.
  • a computer-readable medium stores program code for execution by a device, and the program code includes a method for executing any one of the implementation manners of the first aspect or the second aspect .
  • a computer program product containing instructions is provided, and when the computer program product is run on a computer, the computer is made to execute the method in any one of the above-mentioned first aspect or the second aspect.
  • the chip includes a processor and a data interface, and the processor reads the instructions stored in the memory through the data interface, and executes any one of the above-mentioned first aspect or the second aspect method in the implementation.
  • the chip may further include a memory, the memory stores instructions, the processor is configured to execute the instructions stored in the memory, and when the instructions are executed, the The processor is configured to execute the method in any one of the implementation manners of the first aspect or the second aspect.
  • FIG. 1 is a schematic diagram of a dialogue system provided by an embodiment of the present application.
  • FIG. 2 is a schematic diagram of a processing procedure of a word vector generation model provided by an embodiment of the present application
  • FIG. 3 is a schematic diagram of a natural language processing system provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a system architecture provided by an embodiment of the present application.
  • FIG. 5 is a schematic diagram of a training device for a neural network model provided in an embodiment of the present application.
  • FIG. 6 is a schematic flowchart of a training method for a neural network model provided in an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of a data processing method provided by an embodiment of the present application.
  • FIG. 8 is a schematic block diagram of a training device for a neural network model provided in an embodiment of the present application.
  • FIG. 9 is a schematic block diagram of a data processing device provided by an embodiment of the present application.
  • FIG. 10 is a schematic block diagram of another neural network model training device provided in the embodiment of the present application.
  • FIG. 11 is a schematic block diagram of another data processing apparatus provided by an embodiment of the present application.
  • the embodiments of the present application may be applied in the field of natural language processing or speech processing.
  • Dialogue systems are an important application in the field of natural language processing.
  • the dialogue system includes automatic speech recognition (automatic speech recognition, ASR) subsystem, natural language understanding (natural language understanding, NLU) subsystem, dialogue management (dialog manager, DM) subsystem, natural language generation ( natural language generation (NLG) subsystem and text-to-speech (text to speech, TTS) subsystem.
  • ASR automatic speech recognition
  • NLU natural language understanding
  • DM dialogue management
  • natural language generation natural language generation
  • TTS text-to-speech
  • the ASR subsystem converts the audio information input by the user into text information.
  • the NLU subsystem analyzes the text information obtained by the ASR subsystem to analyze the user's intention.
  • the DM subsystem combines the current dialogue with the user's intention obtained by the NLU subsystem. State, execute the corresponding action, for example, query the knowledge base, etc., and return the result.
  • the NLG subsystem generates text data according to the results returned by the DM subsystem, and the TTS subsystem converts the text data into audio data and feeds it back to the user.
  • the solutions of the embodiments of the present application can be used to obtain or optimize a neural network model suitable for natural language understanding. Adopting the solution of the embodiment of the present application can improve the training efficiency of the neural network model and obtain the neural network model faster.
  • a neural network can be composed of neural units, and a neural unit can refer to an operation unit that takes x s and an intercept 1 as input, and the output of the operation unit can be:
  • W s is the weight of x s
  • b is the bias of the neuron unit.
  • f is the activation function of the neural unit, which is used to introduce nonlinear characteristics into the neural network to convert the input signal in the neural unit into an output signal.
  • the output signal of the activation function can be used as the input of the next convolutional layer, and the activation function can be a sigmoid function.
  • a neural network is a network formed by connecting multiple above-mentioned single neural units, that is, the output of one neural unit can be the input of another neural unit.
  • the input of each neural unit can be connected with the local receptive field of the previous layer to extract the features of the local receptive field.
  • the local receptive field can be an area composed of several neural units.
  • Deep neural network also known as multi-layer neural network
  • DNN can be understood as a neural network with multiple hidden layers.
  • DNN is divided according to the position of different layers, and the neural network inside DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1-th layer.
  • DNN looks complicated, it is actually not complicated in terms of the work of each layer.
  • it is the following linear relationship expression: in, is the input vector, is the output vector, Is the offset vector, W is the weight matrix (also called coefficient), and ⁇ () is the activation function.
  • Each layer is just an input vector After such a simple operation to get the output vector Due to the large number of DNN layers, the coefficient W and the offset vector The number is also higher.
  • DNN The definition of these parameters in DNN is as follows: Take the coefficient W as an example: Assume that in a three-layer DNN, the linear coefficient from the fourth neuron of the second layer to the second neuron of the third layer is defined as The superscript 3 represents the layer number of the coefficient W, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the L layer is defined as
  • the input layer has no W parameter.
  • more hidden layers make the network more capable of describing complex situations in the real world. Theoretically speaking, a model with more parameters has a higher complexity and a greater "capacity", which means that it can complete more complex learning tasks.
  • Training the deep neural network is the process of learning the weight matrix, and its ultimate goal is to obtain the weight matrix of all layers of the trained deep neural network (the weight matrix formed by the vector W of many layers).
  • the neural network can use the error back propagation (back propagation, BP) algorithm to correct the size of the parameters in the initial neural network model during the training process, so that the reconstruction error loss of the neural network model becomes smaller and smaller. Specifically, passing the input signal forward until the output will generate an error loss, and updating the parameters in the initial neural network model by backpropagating the error loss information, so that the error loss converges.
  • the backpropagation algorithm is a backpropagation movement dominated by error loss, aiming to obtain the optimal parameters of the neural network model, such as the weight matrix.
  • NLP Natural language processing
  • Natural language is human language, and natural language processing (NLP) is the processing of human language. Natural language processing is the process of systematically analyzing, understanding and extracting information from text data in an intelligent and efficient manner. NLP and its components can manage very large chunks of text data, or perform a large number of automated tasks, and solve a variety of problems, such as automatic summarization (automatic summarization), machine translation (machine translation, MT), named entity recognition ( named entity recognition (NER), relation extraction (relation extraction, RE), information extraction (information extraction, IE), sentiment analysis, speech recognition (speech recognition), question answering system (question answering) and topic segmentation, etc.
  • automatic summarization automatic summarization
  • machine translation machine translation
  • NER named entity recognition
  • relation extraction relation extraction
  • RE information extraction
  • IE information extraction
  • sentiment analysis speech recognition
  • speech recognition speech recognition
  • question answering system question answering
  • topic segmentation etc.
  • a knowledge graph is a semantic network that reveals the relationships between entities. On the basis of information, the connection between entities is established to form "knowledge".
  • the knowledge map is composed of pieces of knowledge, and each piece of knowledge can be expressed as a triplet, that is, a triplet composed of subject, relation and object, which can be expressed as a triplet (subject, relation, object).
  • Entities the subject and object in a triple, usually represent concepts and generally consist of nouns or noun phrases.
  • a relationship represents a link between two entities and is generally composed of verbs, adjectives, or nouns.
  • the triad (Socrates, teacher, Aristotle) indicates the knowledge that Socrates was Aristotle's teacher.
  • a hybrid expert system is a neural network architecture in which several linear models are trained using local input data, and the outputs of these linear models are combined through the weights generated by the gate network as the output of MoE.
  • These linear models are called experts, or may also be called expert networks or expert models.
  • MoE includes at least one gate network and multiple expert networks. Different expert networks have different parameters.
  • the gate network can selectively activate some parameters in MoE for different input data. In other words, the gate network can select different expert networks to participate in the actual calculation of the current input according to different inputs.
  • the same expert network can be deployed on multiple devices.
  • the same expert network deployed on different devices has the same parameters.
  • multiple devices can share parameters, which is conducive to training large-scale models, for example, models with trillions of parameters or even higher.
  • a word usually includes two representations in NLP: one-hot representation and distribution representation.
  • Distributed representation is to map a word or phrase from the vocabulary into a new space, and represent the word or phrase with a real number vector, that is, a word vector. This approach can be called word embedding. Converting words to vectors (word to vector, word2vec) is a way of word embedding.
  • the Word2vec model can include an input layer, a hidden layer, and an output layer.
  • the hidden layer is a fully connected layer.
  • the weight of the hidden layer in the trained model is the word vector matrix, or it can also be called the embedding matrix.
  • the word2vec model includes two types of models: the skip-gram model and the continuous bag-of-words (CBOW) model.
  • the skip-gram model is used to generate words in the context of a word based on that word.
  • a word is used as the input of the skip-gram model, and the words in the context of the word are used as the target output of the skip-gram model.
  • w(t) as input
  • w(t-1), w(t-2), w(t+1) and w(t+2) in the context of w(t) as target outputs.
  • the CBOW model is used to generate a word based on the words in its context.
  • the word in the context of a word is used as the input of the CBOW model, and the word is used as the target output of the CBOW model.
  • w(t) is the target output.
  • Fig. 2 shows a schematic diagram of the processing procedure of a CBOW model.
  • "1" in the input layer indicates the word corresponding to the position of input "1", and "0” indicates that there is no word corresponding to the position of input "0".
  • “1” in the output layer indicates that the word corresponding to the position of the "1” is output, and "0” indicates that the word corresponding to the position of the "0” is not output.
  • a sentence is "the dog bark at mailman", the and bark are the context of the dog, and the one-hot codes of the and bark in the sentence are input into the CBOW model shown in Figure 2, that is, the The position corresponding to bark is set to 1.
  • the position corresponding to "dog” in the output result is 1, that is, dog is output.
  • Fig. 3 is a schematic diagram of a natural language processing system according to an embodiment of the present application.
  • the natural language processing system may include user equipment and data processing equipment.
  • the user equipment includes a user and smart terminals such as a mobile phone, a personal computer, or an information processing center.
  • the user device is the initiator of natural language data processing, and as the initiator of requests such as language question and answer or query, usually the user initiates the request through the user device.
  • the data processing device may be a device or server with data processing functions such as a cloud server, a network server, an application server, and a management server.
  • the data processing device receives query sentences/voice/text and other questions from the intelligent terminal through the interactive interface, and then performs machine learning, deep learning, search, reasoning, decision-making, etc. through the memory for storing data and the processor link of data processing.
  • Storage can be a general term, including local storage and databases storing historical data, and the databases can be located on data processing equipment or other network servers.
  • the user equipment can receive user instructions, for example, the user equipment can receive a section of text input by the user, and then initiate a request to the data processing equipment, so that the data processing equipment can target the user
  • the piece of text obtained by the device executes a natural language processing application (for example, intent recognition, text classification, text sequence labeling, translation, etc.), so as to obtain the processing result of the corresponding natural language processing application for the piece of text (for example, intent recognition, text classification, text sequence labeling, translation, etc.).
  • a natural language processing application for example, intent recognition, text classification, text sequence labeling, translation, etc.
  • the user equipment may receive the text to be processed input by the user, and then initiate a request to the data processing device, so that the data processing device classifies the text to be processed, so as to obtain a classification result for the text to be processed.
  • the classification result may refer to the semantic intention of the user indicated by the text to be processed, for example, the user indicates the intention of playing a song, setting the time, and starting navigation; or, the classification result may also be used to indicate the user's emotion classification result, such as , the classification result may indicate that the user emotion corresponding to the text to be processed is classified as depression, happiness, or anger.
  • the target neural network model obtained by using the neural network model training method of the embodiment of the present application can be deployed in the data processing device in (a) of Figure 3, and the target neural network model can be used to execute natural language processing applications to execute natural language Processing applications (for example, intent recognition, text classification, text sequence labeling, translation, etc.) to obtain the processing results of natural language processing applications (for example, intent recognition, text classification, text sequence labeling, translation, etc.).
  • natural language processing applications for example, intent recognition, text classification, text sequence labeling, translation, etc.
  • FIG. 3 Another application scenario of the natural language processing system is shown in (b) of FIG. 3 .
  • the smart terminal directly serves as a data processing device, directly receiving input from the user and processing it directly by the hardware of the smart terminal itself.
  • the specific process is similar to (a) in Figure 3. Please refer to the above description, and will not go into details here. .
  • the user equipment may receive an instruction from the user, and the user equipment itself processes the data to be processed to obtain a processing result of the data to be processed.
  • the user equipment can receive user instructions, for example, the user equipment can receive a section of text input by the user, and then the user equipment itself can execute a natural language processing application for the section of text. (for example, intent recognition, text classification, text sequence labeling, translation, etc.), so as to obtain the processing results of the corresponding natural language processing application for the piece of text (for example, intent recognition, text classification, text sequence labeling, translation, etc.).
  • a natural language processing application for example, intent recognition, text classification, text sequence labeling, translation, etc.
  • the target neural network model obtained by using the neural network model training method of the embodiment of the present application can be deployed in the user equipment in (b) of Figure 3, and the target neural network model can be used to execute natural language processing applications to perform natural language processing applications (for example, intent recognition, text classification, text sequence labeling, translation, etc.), so as to obtain the processing results of natural language processing applications (for example, intent recognition, text classification, text sequence labeling, translation, etc.).
  • natural language processing applications for example, intent recognition, text classification, text sequence labeling, translation, etc.
  • FIG. 3 is a schematic diagram of related equipment of the natural language processing system provided by the embodiment of the present application.
  • the user equipment in (a) and (b) of FIG. 3 above may specifically be the local device 301 or 302 as in (c) of FIG.
  • the device 310 may also be set on the cloud or other network servers.
  • the local device 301 and the local device 302 are connected to the execution device 310 through a communication network.
  • Execution device 310 may be implemented by one or more servers.
  • the execution device 310 may be used in cooperation with other computing devices, such as data storage, routers, load balancers and other devices.
  • Execution device 310 may be arranged on one physical site, or distributed on multiple physical sites.
  • the execution device 310 can use the data in the data storage system 350 or call the program code in the data storage system 350 to implement the neural network model training method of the embodiment of the present application.
  • the execution device 310 may perform the following process:
  • the neural network model is trained based on the second training data set to obtain the target neural network model.
  • the neural network model includes an expert network layer, and the expert network layer includes the first expert network in the first business domain.
  • the initial weight of the first expert network is based on The first word vector matrix is determined.
  • a trained neural network that is, a target neural network model
  • the target neural network model can be used for natural language processing and the like.
  • the user may operate respective user devices (such as the local device 301 and the local device 302 ) to interact with the execution device 310 .
  • Each local device can represent any computing device, such as a personal computer, computer workstation, smartphone, tablet, smart camera, smart car or other type of cellular phone, media consumption device, wearable device, set-top box, game console, etc.
  • Each user's local device can interact with the execution device 310 through any communication mechanism/communication standard communication network, and the communication network can be a wide area network, a local area network, a point-to-point connection, etc., or any combination thereof.
  • the local device 301 and the local device 302 obtain relevant parameters of the target neural network model from the execution device 310, deploy the target model on the local device 301 and the local device 302, and use the target model to perform speech processing or text processing. processing etc.
  • the target neural network model can be directly deployed on the execution device 310, and the execution device 310 obtains the data to be processed from the local device 301 and the local device 302, and uses the target neural network model to process the data to be processed, and further , the processing result may be returned to the local device 301 and the local device 302 .
  • the execution device 310 may also be implemented by a local device.
  • the local device 301 implements the functions of the device 310 and provides services for its own users, or provides services for the users of the local device 302 .
  • the above execution device 310 may also be a cloud device. In this case, the execution device 310 may be deployed on the cloud; or, the above execution device 310 may also be a terminal device. In this case, the execution device 310 may be deployed on the user terminal side. This is not limited.
  • the embodiment of the present application provides a system architecture 100 .
  • the data collection device 160 is used to collect training data.
  • the training data may include the text sequence and the processing result corresponding to the text sequence, for example, the processing result corresponding to the text sequence may be the text sequence Intent recognition results.
  • the data collection device 160 After collecting the training data, the data collection device 160 stores the training data in the database 130 , and the training device 120 obtains the target model/rule 101 based on training data maintained in the database 130 .
  • the training device 120 obtains the target model/rule 101 based on the training data.
  • the training device 120 processes the input raw data and compares the output value with the target value until the difference between the value output by the training device 120 and the target value The value is less than a certain threshold, thus completing the training of the target model/rule 101.
  • the above target model/rule 101 can be used to implement the data processing method of the embodiment of the present application.
  • the target model/rule 101 in the embodiment of the present application may specifically be a neural network model.
  • the training data maintained in the database 130 may not all be collected by the data collection device 160, but may also be received from other devices.
  • the training device 120 does not necessarily perform the training of the target model/rules 101 based entirely on the training data maintained by the database 130, and it is also possible to obtain training data from the cloud or other places for model training. Limitations of the Examples.
  • the target model/rule 101 obtained by training according to the training device 120 can be applied to different systems or devices, such as the execution device 110 shown in FIG. 4 .
  • the execution device 110 may be a terminal, such as a mobile phone terminal, a tablet computer, a notebook computer, an augmented reality (augmented reality, AR) AR/virtual reality (virtual reality, VR), a vehicle terminal, etc., or a server or a cloud.
  • the execution device 110 configures an input/output (input/output, I/O) interface 112 for data interaction with external devices, and the user can input data to the I/O interface 112 through the client device 140, and input
  • I/O input/output
  • it may include: data to be processed input by the client device.
  • the execution device 110 When the execution device 110 preprocesses the input data, or in the calculation module 111 of the execution device 110 performs calculation and other related processing, the execution device 110 can call the data, codes, etc. in the data storage system 150 for corresponding processing , the correspondingly processed data and instructions may also be stored in the data storage system 150 .
  • the I/O interface 112 returns the processing result, such as the processing result of the data obtained above, to the client device 140, thereby providing it to the user.
  • the training device 120 can generate corresponding target models/rules 101 based on different training data for different goals or different tasks, and the corresponding target models/rules 101 can be used to achieve the above-mentioned goals or complete the above-mentioned task to provide the user with the desired result.
  • the user can manually specify the input data, and the manual specification can be operated through the interface provided by the I/O interface 112 .
  • the client device 140 can automatically send the input data to the I/O interface 112 . If the client device 140 is required to automatically send the input data to obtain the user's authorization, the user can set the corresponding authority in the client device 140 .
  • the user can view the results output by the execution device 110 on the client device 140, and the specific presentation form may be specific ways such as display, sound, and action.
  • the client device 140 can also be used as a data collection terminal, collecting the input data input to the I/O interface 112 as shown in the figure and the output results of the output I/O interface 112 as new sample data, and storing them in the database 130 .
  • the client device 140 may not be used for collection, but the I/O interface 112 directly uses the input data input to the I/O interface 112 as shown in the figure and the output result of the output I/O interface 112 as a new sample.
  • the data is stored in database 130 .
  • FIG. 4 is only a schematic diagram of a system architecture provided by the embodiment of the present application, and the positional relationship between devices, devices, modules, etc. shown in the figure does not constitute any limitation.
  • the data The storage system 150 is an external memory relative to the execution device 110 , and in other cases, the data storage system 150 may also be placed in the execution device 110 .
  • a target model/rule 101 is obtained through training according to the training device 120 , and the target model/rule 101 may be the target neural network model in the present application in the embodiment of the present application.
  • the parameter quantity of the model can be expanded, and a model with a scale of trillions or even higher can be trained, thereby improving the performance of the model.
  • the training time required for the neural network model using MoE is long, which affects the application of the model.
  • the embodiment of the present application provides a training method for a neural network model, using the word vector matrix to initialize the weight of the expert network in the neural network model, which can provide prior knowledge for model training, reduce the training time of the model, and improve the training efficiency of the model .
  • the apparatus 500 shown in FIG. 5 can be deployed on cloud service equipment or terminal equipment, such as computers, servers, vehicles, mobile phones, etc., or can be deployed on a system composed of cloud service equipment and terminal equipment.
  • the apparatus 500 may be the training device 120 in FIG. 4 or the execution device 310 in FIG. 3 or a local device.
  • the device 500 includes a knowledge graph construction module 510 , a language generation module 520 , a word vector matrix generation module 530 and a training module 540 .
  • the knowledge map construction module 510 is used to construct the knowledge map according to the corpus in the business field.
  • the knowledge graph may include at least one triple.
  • the language generating module 520 is configured to generate at least one text series according to the at least one triple. For a specific description, reference may be made to step S620 in the method 600 .
  • the word vector matrix generation module 530 is used for training and obtaining a word vector matrix based on the at least one triplet.
  • the at least one triplet may constitute a training data set.
  • the word vector matrix generation module 530 is used to train and obtain a word vector matrix according to the training data set.
  • step S630 in the method 600 reference may be made to step S630 in the method 600 .
  • the training module 540 is used to train the neural network model to obtain the target neural network model.
  • the target neural network model includes an expert network layer.
  • the initial weight of at least one expert network in the expert network layer is determined according to the word vector matrix. In other words, initial weights of at least one expert network in the expert network layer are initialized according to the word vector matrix.
  • step S650 in the method 600 For a specific description, reference may be made to step S650 in the method 600 .
  • the training method of the neural network model in the embodiment of the present application will be described below with reference to FIG. 6 .
  • FIG. 6 shows a method 600 for training a neural network model provided by an embodiment of the present application.
  • the method shown in FIG. 6 may be executed by a cloud service device or a terminal device, such as a computer, server, vehicle, mobile phone, or other devices, or may be a system composed of a cloud service device and a terminal device.
  • the method 600 may be executed by the training device 120 in FIG. 4 or the executing device 310 in FIG. 3 or a local device.
  • the method 600 includes step S610 to step S650. Step S610 to step S650 will be described in detail below.
  • the knowledge map can be constructed based on the corpus in the business domain.
  • corpus can include website articles or books, etc.
  • the knowledge graph may be constructed by the knowledge graph construction module 510 in the apparatus 500 .
  • the at least one business domain includes a first business domain
  • the first knowledge graph of the first business domain may be a knowledge graph constructed according to corpus of the first business domain.
  • the at least one business domain also includes a second business domain
  • the second knowledge graph of the second business domain may be a knowledge graph constructed according to the corpus of the second business domain.
  • the first knowledge graph in the financial field and the second knowledge graph in the Internet field can be constructed based on the corpus in the financial field and the corpus in the Internet field, respectively.
  • the first knowledge graph and the second knowledge graph may be acquired.
  • step S610 more knowledge graphs of business domains or fewer knowledge graphs of business domains can be obtained , the embodiment of this application does not limit the number of knowledge graphs.
  • a knowledge graph includes at least one triple.
  • the triplet in the knowledge graph includes three elements: subject, relation and object, which can be expressed in the form of triplet (subject, relation, object), for example, triplet (Socrates, teacher, Aristotle Germany).
  • subject and object can be concepts in the business domain where the knowledge graph is located. Relationships are used to indicate a connection between a subject and an object.
  • each knowledge graph in the plurality of knowledge graphs may include at least one triple.
  • the first knowledge graph includes at least one first triple.
  • the three words in the first triple are respectively used to represent the subject in the first business domain, the object in the first business domain, and the relationship between the subject and the object.
  • the second knowledge graph includes at least one second triple.
  • the three words in the second triplet respectively represent the subject in the second business domain, the object in the second business domain, and the relationship between the subject and the object.
  • first triplet is only used to limit the triplet as a triplet in the first knowledge graph, and has no other limiting effect.
  • all triples in the first knowledge graph can be called first triples.
  • the "second” in the "second triplet” is only used to limit the triplet as a triplet in the second knowledge graph, and has no other limiting effect.
  • all triples in the second knowledge graph can be called second triples.
  • step S610 is an optional step.
  • the at least one business domain includes a first business domain
  • step S620 may include acquiring a first training data set of the first business domain.
  • step S620 may include acquiring the first training data set of the first business domain and the third training data set of the second business domain.
  • step S620 may include: respectively constructing a training data set of the at least one business domain according to the knowledge map of the at least one business domain.
  • the training data set of the at least one business domain is respectively determined according to the knowledge map of the at least one business domain.
  • the first training data set in the first business domain is determined according to the first knowledge graph in the first business domain.
  • the third training data set of the second business domain is determined according to the second knowledge map of the second business domain.
  • step S620 more training data sets of business domains or fewer training data sets of business domains can be obtained.
  • acquisition method of the data set and the training data set in other business fields refer to the acquisition method of the first training data set and the second training data set, which is not limited in this embodiment of the present application.
  • Each training data set of the at least one training data set includes at least one text sequence.
  • the first training data set includes at least one first text sequence.
  • the third training data set includes at least one second text sequence.
  • first text sequence is only used to define the text sequence as a text sequence in the first training data set, and has no other limiting effect.
  • all text sequences in the first training data set can be referred to as first text sequences.
  • the "second” in the "second text sequence” is only used to limit the text sequence to the text sequence in the third training data set, and has no other limiting effect.
  • all the text sequences in the third training data set can be called the second text sequences.
  • the first training data set is determined according to the first knowledge graph, including: at least one first text sequence in the first training data set is respectively generated according to at least one first triplet in the first knowledge graph .
  • the third training data set is determined according to the second knowledge graph, including: at least one second text sequence in the third training data set is respectively generated according to at least one second triplet in the second knowledge graph.
  • a text sequence can be generated from a triple.
  • a text sequence can be regarded as a training sample for the word vector generation model.
  • a triple can form a sentence, that is, a sequence of text.
  • a text sequence generated from a triple could be, Socrates is Aristotle's teacher.
  • triples can be converted into sentences through a language model.
  • the language model may be an n-gram language model.
  • n can be 2, or n can be 3.
  • the language model can be deployed in the language generation module 520 of the device 500 . That is, the triplet is converted into a text sequence by the language generation module 520 .
  • step S620 other methods may also be used to construct the training data set of at least one business domain, for example, collecting multiple text sequences in at least one business domain to form the training data set of at least one business domain.
  • This embodiment of the present application does not limit it.
  • obtaining the training data set of at least one business domain may be constructing the training data set of at least one business domain, or obtaining the training data set of the at least one business domain may also be receiving the training data set of the at least one business domain from other devices.
  • the training data set, or, obtaining the training data set of the at least one business domain may also be reading the locally stored training data set of the at least one business domain.
  • the training data sets of the multiple business domains may be obtained in the same manner or differently. The embodiment of this application does not limit the specific way of "obtaining”.
  • step S620 is an optional step.
  • the at least one word vector matrix is obtained through training based on training data sets in at least one business domain.
  • step S630 includes training and obtaining at least one word vector matrix based on the at least one training data set.
  • the at least one word vector matrix can be trained by the word vector matrix generation module 530 in the apparatus 500 .
  • step S630 includes acquiring a first word vector matrix, where the first word vector matrix is obtained through training based on the first training data set.
  • step S630 further includes acquiring a second word vector matrix, where the second word vector matrix is trained based on the third training data set.
  • the knowledge graph of a business domain can indicate the relationship between entities in the business domain, and the training data set of the business domain can be constructed through the knowledge graph of the business domain, which is beneficial to the word vector matrix learning of the business domain Knowledge, improve semantic representation ability.
  • the at least one word vector matrix is respectively the weight of a hidden layer in the at least one target word vector generation model.
  • the at least one target word vector generation model is obtained by training the word vector generation model based on the training data set of at least one business domain.
  • step S630 can also be understood as acquiring the weight of at least one hidden layer in the target word vector generation model.
  • the target word vector generation model is the trained word vector generation model.
  • the word vector generation model is trained based on the training data sets of different business fields, and the target word vector generation models of different business fields can be obtained.
  • a word embedding generation model can include an input layer, a hidden layer, and an output layer.
  • the hidden layer is a fully connected layer.
  • the word vector generation model may adopt an existing model, for example, the word vector generation model may be a CBOW model.
  • the weight of the hidden layer can also be called an embedding matrix or a word vector matrix.
  • a target word vector model uses words other than the target word in at least one text sequence in a training data set in a business field as the input of the word vector generation model, and uses the target word as the target output of the word vector generation model to generate the word vector
  • the model is trained, and the target word is a word in at least one triple in the knowledge map of the business domain.
  • words other than the target word in the text sequence are used as the input of the word vector generation model, and the target word is used as the target output of the word vector generation model to train the word vector model.
  • the target word is the word in the triplet corresponding to the text sequence.
  • the target word can be any one of the three elements in the triplet, subject, object or relation.
  • the triplet corresponding to the text sequence refers to the triplet used to guide the generation of the text sequence.
  • the text sequence can be generated based on the triples corresponding to the text sequence.
  • the first word vector matrix is trained based on the first training data set in the first business domain, including: the first word vector matrix is the weight of the hidden layer in the first target word vector generation model, and the first target
  • the word vector generation model uses words other than the target word in at least one first text sequence as the input of the word vector generation model, and uses the target word as the target output of the word vector generation model to train the word vector generation model, and the target The words are words in at least one first triplet.
  • the target word in the at least one first text sequence is the object in the at least one first triple.
  • the target output of the vector generation model trains the word vector generation model to obtain the first target word vector generation model.
  • words other than the object in the first triplet corresponding to the first text sequence in the first text sequence are used as the input of the word vector generation model, and the corresponding first text sequence
  • the object in a triplet is used as the target output of the word vector generation model.
  • the target output can also be understood as a positive sample label of a training sample.
  • the positive sample label is the object.
  • Negative sample labels can be word pairs obtained by negative sampling.
  • the text sequence is: Socrates was the teacher of Aristotle.
  • the triple corresponding to the text sequence is the triple (Socrates, teacher, Aristotle), and the object in the triple is Aristotle.
  • the words other than Aristotle in the text sequence are used as the input of the CBOW model, that is, (Socrates, yes, yes, teacher) is used as the input of the CBOW model.
  • Use Aristotle as the target output of the CBOW model.
  • the target word in the at least one first text sequence is the subject in the at least one first triple.
  • the target output of the generation model is used to train the word vector generation model to obtain the first target word vector generation model.
  • the words other than the subject in the first triplet corresponding to the first text sequence in the first text sequence are used as the input of the word vector generation model, and the first text sequence corresponds to The subject in the first triplet of is used as the target output of the word vector generation model.
  • the positive sample label is the subject.
  • Negative sample labels can be word pairs obtained by negative sampling.
  • the text sequence is: Socrates was the teacher of Aristotle.
  • the triple corresponding to the text sequence is the triple (Socrates, teacher, Aristotle), and the subject in the triple is Socrates.
  • the words other than Socrates in the text sequence are used as the input of the CBOW model, that is, (yes, Aristotle, of, teacher) is used as the input of the CBOW model.
  • Use Socrates as the target output of the CBOW model.
  • the target words in the at least one first text sequence are relations in the at least one first triple.
  • the target output of the generation model is used to train the word vector generation model to obtain the first target word vector generation model.
  • the words other than the relationship in the triplet corresponding to the first text sequence in the first text sequence are used as the input of the word vector generation model, and the three corresponding to the first text sequence
  • the relation in the tuple is used as the target output of the word embedding generation model.
  • Negative sample labels can be word pairs obtained by negative sampling.
  • the text sequence is: Socrates was the teacher of Aristotle.
  • the triple corresponding to the text sequence is a triple (Socrates, teacher, Aristotle), and the relationship in the triple is teacher.
  • Words other than teacher in the text sequence are used as the input of the CBOW model, that is, (Socrates, yes, Aristotle, of) is used as the input of the CBOW model.
  • the second word vector matrix is trained based on the third training data set in the second business domain, including:
  • the second word vector matrix is the weight of the hidden layer in the second target word vector generation model, and the second target word vector generation model uses words other than the target words in at least one second text sequence as the input of the word vector generation model , which is obtained by training the word vector generation model with the target word as the target output of the word vector generation model, and the target word is a word in at least one second triplet.
  • the training process of the second word vector generation model reference may be made to the above-mentioned training process of the first word vector generation model.
  • the first text sequence is replaced by the second text sequence
  • the first triplet is replaced by the second triplet to train the second target word vector generation model.
  • step S630 only the first word vector matrix and the second word vector matrix are used as examples to illustrate step S630. In practical applications, more word vector matrices can be obtained in step S630. In the embodiment of the present application, There is no limit to this.
  • obtaining at least one word vector matrix may be to obtain at least one word vector matrix through training, or it may also be to receive at least one word vector matrix from other devices, or it may also be to read at least one word vector matrix stored locally matrix.
  • the embodiment of this application does not limit the specific way of "obtaining”.
  • the data type in the second training data set is related to the task type of the neural network model.
  • the neural network model may be an NLP model.
  • the data in the second training data set may be text data.
  • the neural network model may be a speech processing model.
  • the data in the second training data set may be voice data.
  • the speech processing model may be an end-to-end speech processing model, for example, the end-to-end speech processing model may be a listen, attend, spell, LAS (listen, attend, spell, LAS) model.
  • the execution device of step S640 may be the training device 120 as shown in FIG. 4 .
  • the second training data set may be the training data maintained in the database 130 as shown in FIG. 4 .
  • the neural network model based on the second training data set to obtain a target neural network model.
  • the neural network model includes an expert network layer, and the initial weight of at least one expert network in the expert network layer is respectively determined according to at least one word vector matrix.
  • step S650 may be executed by the training module 540 in the device 500 .
  • the expert network layer includes a first expert network in the first business domain, and the initial weight of the first expert network is determined according to the first word vector matrix.
  • the expert network layer further includes a second expert network in the second business domain, and the initial weight of the second expert network is determined according to the second word vector matrix.
  • the initial weights of the at least one expert network are respectively determined according to at least one word vector matrix, which can also be understood as initializing the weights of the at least one expert network based on the at least one word vector matrix.
  • the expert network layer includes a plurality of expert networks, and the expert network layer is used to process the data input to the expert network layer through a target expert network in the plurality of expert networks.
  • the target expert network is determined according to the data input to the expert network layer.
  • the target expert network is selected according to the data input to the expert network layer.
  • the target expert network may include a first expert network.
  • the expert network layer can process the data input to the expert network layer through the selected first expert network, and the first expert network is selected according to the data input to the expert network layer.
  • the target expert network may include a second expert network.
  • the expert network layer can process the data input to the expert network layer through the selected second expert network, and the second expert network is selected according to the data input to the expert network layer.
  • the neural network model is trained based on the second training data set, and the trained neural network model obtained is the target neural network model.
  • the neural network model may be an existing neural network model.
  • the neural network model can be a switch transformer model.
  • the neural network model may also be constructed by itself, which is not limited in the embodiment of the present application, as long as the neural network model includes an expert network layer.
  • the number of expert network layers may be one or multiple, which is not limited in this embodiment of the present application.
  • the neural network model includes multiple expert network layers
  • some or all of the multiple expert network layers can determine initial weights in the manner in step S650.
  • the embodiment of the present application only uses an expert network layer as an example, and does not limit the solution of the embodiment of the present application.
  • An expert network layer includes a plurality of expert networks, and the parameters of the plurality of expert networks are different.
  • the multiple expert networks may be deployed on one device, or may be deployed on multiple devices. If the multiple expert networks are deployed on multiple devices, the method 600 can also be understood as being jointly executed by the multiple devices.
  • the expert network layer may include a gate network.
  • the gate network can select one or more expert networks to participate in the actual calculation of the currently input data according to the data input to the expert network layer.
  • the gate network can route the data input to the expert network layer to one or more expert networks for processing.
  • the one or more selected expert networks are the target expert networks.
  • the specific method for determining the target expert network can use existing solutions, for example, the routing method in MoE, or the routing method in the switch layer in the switch transformer can also be used, which is not limited in the embodiment of the present application.
  • the target expert network includes multiple expert networks, the multiple expert networks process the input data respectively.
  • the outputs of the plurality of expert networks can be combined through the weights generated by the gate network as the output of the expert network layer.
  • the calculation method of the weight may adopt an existing solution, for example, the calculation method in MoE, or the weight calculation method in the switch layer in the switch transformer may also be used, which is not limited in the embodiment of the present application.
  • the target expert network in the expert network layer may be different.
  • the initial weight of at least one expert network in the expert network layer is determined according to at least one word vector matrix.
  • the initial weight of the at least one expert network is determined according to the weight of a hidden layer in at least one target word vector generation model. That is to say, the structure of the at least one expert network is the same as the structure of the hidden layer of the at least one target word vector generation model, and the hidden layer may be a fully connected layer. That is, the weights of the at least one expert network are initialized according to the weight distribution of the fully connected layer of the at least one target word vector generation model.
  • existing methods may be used for initialization, for example, random initialization using random values generated by Gaussian distribution.
  • the first expert network may include one or more expert networks.
  • the first word vector matrix can be used to initialize the weight of an expert network or the weights of multiple expert networks.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is the first word vector matrix.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is obtained by adjusting the first word vector matrix.
  • one or more values in the first word vector matrix may be adjusted, and the adjusted first word vector matrix may be used as the initial weight of the first expert network.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of a part of the expert network in the first expert network is the first word vector matrix, and the initial weight of another part of the expert network is Obtained by adjusting the first word vector matrix.
  • the first expert network includes two or more expert networks.
  • the initial weights of the second expert network can be obtained by replacing the first expert network in the above initialization process with the second expert network, and replacing the first word vector matrix with the second word vector matrix.
  • the word vector matrix is trained according to the training data set, and the word vector matrix contains a large amount of semantic information, and the weight of some or all expert networks in the model can be initialized by using the word vector matrix, and the semantic information can be Introducing the expert network to provide prior knowledge for the expert network and reduce training time, especially when the scale of the neural network model is large, the solution of the embodiment of the present application can greatly reduce the training time. At the same time, introducing semantic information into the expert network can effectively improve the semantic representation ability of the expert network, thereby improving the training performance of the model.
  • different word vector matrices are trained based on training data sets in different business fields and have different semantic information.
  • different Expert networks in the expert network layer are initialized by different word vector matrices, different Expert networks have different semantic representation capabilities, and the semantic combination between different expert networks can further improve the understanding ability of natural language semantics and further improve the performance of the model.
  • multiple expert networks in the expert network layer are initialized through multiple word vector matrices respectively, and the multiple word vector matrices are respectively trained based on training data sets in multiple business fields. In this way, the expert network layer has The semantic representation ability of multiple business fields improves the understanding ability of the model's natural language semantics.
  • the data of each business field can be routed to the corresponding expert network for processing, further improving the model. performance.
  • the knowledge graph of a business domain can indicate the relationship between various entities in the business domain, and the training data set of the business domain can be constructed through the knowledge graph of the business domain, which is beneficial for the word vector matrix to learn the business Domain knowledge improves semantic representation capabilities.
  • FIG. 7 shows a schematic flow chart of a data processing method 700 provided by an embodiment of the present application.
  • This method can be executed by a device or device capable of data processing.
  • the device can be a cloud service device or a terminal Equipment, such as computers, servers and other devices with sufficient computing power to execute the data processing method, may also be a system composed of cloud service equipment and terminal equipment.
  • the method 700 may be executed by the executing device 110 in FIG. 4 or the executing device 310 in FIG. 3 or a local device.
  • the method 700 may be specifically executed by the executing device 110 as shown in FIG. 4 , and the data to be processed in the method 700 may be input data provided by the client device 140 as shown in FIG. 4 .
  • the model used in the data processing method 700 in FIG. 7 may be constructed by the above-mentioned method in FIG. 6 .
  • the model used in the data processing method 700 in FIG. 7 may be constructed by the above-mentioned method in FIG. 6 .
  • repeated descriptions are appropriately omitted when introducing the method 700 below.
  • the method 700 includes steps S710 to S720, and the steps S710 to S720 will be described below.
  • the type of data to be processed is related to the task type of the neural network model.
  • the neural network model may be an NLP model.
  • the data to be processed may be text data.
  • the neural network model may be a speech processing model.
  • the data to be processed may be voice data.
  • the target neural network model uses the target neural network model to process the data to be processed, the target neural network model is obtained by training the neural network model based on the second training data set, the neural network model includes an expert network layer, and the expert network layer includes the first business domain
  • the first expert network of the first expert network, the initial weight of the first expert network is determined according to the first word vector matrix, and the first word vector matrix is obtained by training based on the first training data set of the first business domain.
  • the expert network layer further includes a second expert network in the second business domain, the initial weight of the second expert network is determined according to the second word vector matrix, and the second word vector matrix is based on the second business domain obtained by training on the third training data set in the domain.
  • the expert network layer is used to process the data input to the expert network layer through the selected first expert network, and the first expert network is selected according to the data input to the expert network layer.
  • the first training data set is determined according to the first knowledge map of the first business domain.
  • the first training data set is determined according to the first knowledge graph, including: at least one first text sequence in the first training data set is generated according to at least one first triplet in the first knowledge graph, The three words in the first triple are respectively used to represent the subject in the first business domain, the object in the first business domain and the relationship between the subject and the object.
  • the first word vector matrix is trained based on the first training data set in the first business domain, including: the first word vector matrix is the weight of the hidden layer in the first target word vector generation model, and the first target
  • the word vector generation model uses words other than the target word in at least one first text sequence as the input of the word vector generation model, and uses the target word as the target output of the word vector generation model to train the word vector generation model, and the target The words are words in at least one first triplet.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is the first word vector matrix.
  • the word vector matrix is trained according to the training data set, and the word vector matrix contains a large amount of semantic information, and the weight of some or all expert networks in the model can be initialized by using the word vector matrix, and the semantic information can be Introducing the expert network to provide prior knowledge for the expert network and reduce training time, especially when the scale of the neural network model is large, the solution of the embodiment of the present application can greatly reduce the training time. At the same time, introducing semantic information into the expert network can effectively improve the semantic representation ability of the expert network, and then improve the performance of the target neural network model.
  • the device of the embodiment of the present application will be described below with reference to FIG. 8 to FIG. 11 . It should be understood that the device described below can execute the method of the aforementioned embodiment of the present application. In order to avoid unnecessary repetition, repeated descriptions are appropriately omitted when introducing the device of the embodiment of the present application below.
  • Fig. 8 is a schematic block diagram of an apparatus for training a neural network according to an embodiment of the present application.
  • the apparatus 3000 shown in FIG. 8 includes an acquisition unit 3010 and a processing unit 3020 .
  • the acquisition unit 3010 and the processing unit 3020 can be used to execute the neural network model training method 600 of the embodiment of the present application.
  • the obtaining unit 3010 is used for obtaining a first word vector matrix, and the first word vector matrix is obtained through training based on the first training data set in the first business domain.
  • the obtaining unit is also used to obtain a second training data set.
  • the processing unit 3020 is used to train the neural network model based on the second training data set to obtain the target neural network model.
  • the neural network model includes an expert network layer, and the expert network layer includes a first expert network in the first business field.
  • the first expert network The initial weight of is determined according to the first word vector matrix.
  • the obtaining unit 3010 is also used to: obtain the second word vector matrix, the second word vector matrix is obtained based on the training of the third training data set in the second business domain, and the expert network layer also includes the first The second expert network in the second business field, the initial weight of the second expert network is determined according to the second word vector matrix.
  • the expert network layer is configured to process the data input to the expert network layer through the selected first expert network, and the first expert network is selected according to the data input to the expert network layer.
  • the first training data set is determined according to the first knowledge map of the first business domain.
  • the first training data set is determined according to the first knowledge graph of the first business domain, including: at least one first text sequence in the first training data set is determined according to the first knowledge graph in the first knowledge graph At least one first triple is generated, and the first triple includes a subject in the first business domain, an object in the first business domain, and a relationship between the subject and the object.
  • the first word vector matrix is trained based on the first training data set in the first business domain, including: the first word vector matrix is the hidden layer in the first target word vector generation model Weight, the first target word vector generation model is to use words other than the target words in at least one first text sequence as the input of the word vector generation model, with the target word as the target output of the word vector generation model to the word vector generation model After training, the target words are words in at least one first trigram.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is the first word vector matrix.
  • the neural network model is an NLP model or a speech processing model.
  • Fig. 9 is a schematic block diagram of a data processing device according to an embodiment of the present application.
  • the apparatus 4000 shown in FIG. 9 includes an acquisition unit 4010 and a processing unit 4020 .
  • the acquiring unit 4010 and the processing unit 4020 may be used to execute the data processing method 700 of the embodiment of the present application.
  • the acquiring unit 4010 is configured to acquire data to be processed.
  • the processing unit 4020 is configured to use the target neural network model to process the data to be processed, the target neural network model is obtained by training the neural network model based on the second training data set, the neural network model includes an expert network layer, and the expert network layer includes For the first expert network in the first business domain, the initial weight of the first expert network is determined according to the first word vector matrix, and the first word vector matrix is trained based on the first training data set in the first business domain.
  • the expert network layer further includes a second expert network in the second business domain, the initial weight of the second expert network is determined according to the second word vector matrix, and the second word vector matrix is based on the second obtained through training on the third training data set in the business domain.
  • the expert network layer is configured to process the data input to the expert network layer through the selected first expert network, and the first expert network is selected according to the data input to the expert network layer.
  • the first training data set is determined according to the first knowledge map of the first business domain.
  • the first training data set is determined according to the first knowledge graph, including: at least one first text sequence in the first training data set is determined according to at least one first three in the first knowledge graph
  • the three words in the first triple are respectively used to represent the subject in the first business domain, the object in the first business domain, and the relationship between the subject and the object.
  • the first word vector matrix is trained based on the first training data set in the first business domain, including: the first word vector matrix is the hidden layer in the first target word vector generation model Weight, the first target word vector generation model is to use words other than the target words in at least one first text sequence as the input of the word vector generation model, with the target word as the target output of the word vector generation model to the word vector generation model After training, the target words are words in at least one first trigram.
  • the initial weight of the first expert network is determined according to the first word vector matrix, including: the initial weight of the first expert network is the first word vector matrix.
  • the neural network model is a natural language processing (NLP) model or a speech processing model.
  • NLP natural language processing
  • apparatus 3000 and apparatus 4000 are embodied in the form of functional units.
  • unit here may be implemented in the form of software and/or hardware, which is not specifically limited.
  • a "unit” may be a software program, a hardware circuit or a combination of both to realize the above functions.
  • the hardware circuitry may include application specific integrated circuits (ASICs), electronic circuits, processors (such as shared processors, dedicated processors, or group processors) for executing one or more software or firmware programs. etc.) and memory, incorporating logic, and/or other suitable components to support the described functionality.
  • ASICs application specific integrated circuits
  • processors such as shared processors, dedicated processors, or group processors for executing one or more software or firmware programs. etc.
  • memory incorporating logic, and/or other suitable components to support the described functionality.
  • the units of each example described in the embodiments of the present application can be realized by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
  • FIG. 10 is a schematic diagram of the hardware structure of the neural network model training device provided by the embodiment of the present application.
  • the training apparatus 5000 of the neural network model shown in FIG. wherein, the memory 5001 , the processor 5002 , and the communication interface 5003 are connected to each other through a bus 5004 .
  • the memory 5001 may be a read only memory (read only memory, ROM), a static storage device, a dynamic storage device or a random access memory (random access memory, RAM).
  • the memory 5001 may store programs, and when the programs stored in the memory 5001 are executed by the processor 5002, the processor 5002 is configured to execute various steps of the neural network model training method of the embodiment of the present application.
  • the processor 5002 may adopt a general-purpose central processing unit (central processing unit, CPU), a microprocessor, an application specific integrated circuit (application specific integrated circuit, ASIC), a graphics processing unit (graphics processing unit, GPU) or one or more
  • the integrated circuit is used to execute related programs, so as to realize the training method of the neural network model in the method embodiment of the present application.
  • the processor 5002 may also be an integrated circuit chip with signal processing capabilities.
  • each step of the neural network model training method of the present application can be completed by an integrated logic circuit of hardware in the processor 5002 or instructions in the form of software.
  • the above-mentioned processor 5002 can also be a general-purpose processor, a digital signal processor (digital signal processing, DSP), an application-specific integrated circuit (ASIC), a ready-made programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, Discrete gate or transistor logic devices, discrete hardware components.
  • DSP digital signal processing
  • ASIC application-specific integrated circuit
  • FPGA field programmable gate array
  • Various methods, steps, and logic block diagrams disclosed in the embodiments of the present application may be implemented or executed.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 5001, and the processor 5002 reads the information in the memory 5001, and combines its hardware to complete the functions required by the units included in the device shown in Figure 8, or execute the neural network model of the method embodiment of the present application training method.
  • the communication interface 5003 implements communication between the apparatus 5000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the second training data set can be obtained through the communication interface 5003 .
  • the bus 5004 may include a pathway for transferring information between various components of the device 5000 (eg, memory 5001, processor 5002, communication interface 5003).
  • FIG. 11 is a schematic diagram of a hardware structure of a data processing device provided by an embodiment of the present application.
  • the data processing apparatus 6000 shown in FIG. 11 includes a memory 6001 , a processor 6002 , a communication interface 6003 and a bus 6004 .
  • the memory 6001 , the processor 6002 , and the communication interface 6003 are connected to each other through a bus 6004 .
  • the memory 6001 may be a ROM, a static storage device, a dynamic storage device or a RAM.
  • the memory 6001 may store a program, and when the program stored in the memory 6001 is executed by the processor 6002, the processor 6002 is configured to execute each step of the data processing method of the embodiment of the present application.
  • the processor 6002 may adopt a general-purpose CPU, microprocessor, ASIC, GPU or one or more integrated circuits for executing related programs, so as to implement the data processing method of the method embodiment of the present application.
  • the processor 6002 may also be an integrated circuit chip with signal processing capabilities. In the implementation process, each step of the data processing method of the present application may be completed by an integrated logic circuit of hardware in the processor 6002 or instructions in the form of software.
  • the above processor 6002 may also be a general processor, DSP, ASIC, FPGA or other programmable logic devices, discrete gate or transistor logic devices, or discrete hardware components.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the steps of the method disclosed in connection with the embodiments of the present application may be directly implemented by a hardware decoding processor, or implemented by a combination of hardware and software modules in the decoding processor.
  • the software module can be located in a mature storage medium in the field such as random access memory, flash memory, read-only memory, programmable read-only memory or electrically erasable programmable memory, register.
  • the storage medium is located in the memory 6001, and the processor 6002 reads the information in the memory 6001, and combines its hardware to complete the functions required by the units included in the device shown in Figure 9, or execute the data processing of the method embodiment of the present application method.
  • the communication interface 6003 implements communication between the apparatus 6000 and other devices or communication networks by using a transceiver device such as but not limited to a transceiver.
  • a transceiver device such as but not limited to a transceiver.
  • the second training data set can be obtained through the communication interface 6003 .
  • the bus 6004 may include pathways for transferring information between various components of the device 6000 (eg, memory 6001 , processor 6002 , communication interface 6003 ).
  • the device 5000 and device 6000 may also include other devices.
  • the apparatus 5000 and the apparatus 6000 may also include hardware devices for implementing other additional functions.
  • the device 5000 and the device 6000 may also include only the devices necessary to realize the embodiment of the present application, instead of all the devices shown in FIG. 10 and FIG. 11 .
  • the embodiment of the present application also provides a computer-readable medium, the computer-readable medium stores the program code for device execution, and the program code includes the method for executing the training method of the neural network model or data processing in the embodiment of the present application. method.
  • the embodiment of the present application also provides a computer program product including instructions, and when the computer program product is run on a computer, the computer is made to execute the data processing method in the embodiment of the present application.
  • the embodiment of the present application also provides a chip, the chip includes a processor and a data interface, the processor reads the instructions stored on the memory through the data interface, and executes the training method or data processing of the neural network model in the embodiment of the present application Methods.
  • the chip may further include a memory, the memory stores instructions, the processor is used to execute the instructions stored in the memory, and when the instructions are executed, the processor is used to execute the instructions of the present application.
  • the processor in the embodiment of the present application may be a central processing unit (central processing unit, CPU), and the processor may also be other general-purpose processors, digital signal processors (digital signal processor, DSP), application specific integrated circuits (application specific integrated circuit, ASIC), off-the-shelf programmable gate array (field programmable gate array, FPGA) or other programmable logic devices, discrete gate or transistor logic devices, discrete hardware components, etc.
  • a general-purpose processor may be a microprocessor, or the processor may be any conventional processor, or the like.
  • the memory in the embodiments of the present application may be a volatile memory or a nonvolatile memory, or may include both volatile and nonvolatile memories.
  • the non-volatile memory can be read-only memory (read-only memory, ROM), programmable read-only memory (programmable ROM, PROM), erasable programmable read-only memory (erasable PROM, EPROM), electrically programmable Erases programmable read-only memory (electrically EPROM, EEPROM) or flash memory.
  • Volatile memory can be random access memory (RAM), which acts as external cache memory.
  • RAM random access memory
  • static random access memory static random access memory
  • DRAM dynamic random access memory
  • DRAM synchronous dynamic random access memory Access memory
  • SDRAM synchronous dynamic random access memory
  • double data rate synchronous dynamic random access memory double data rate SDRAM, DDR SDRAM
  • enhanced synchronous dynamic random access memory enhanced SDRAM, ESDRAM
  • serial link DRAM SLDRAM
  • direct memory bus random access memory direct rambus RAM, DR RAM
  • the above-mentioned embodiments may be implemented in whole or in part by software, hardware, firmware or other arbitrary combinations.
  • the above-described embodiments may be implemented in whole or in part in the form of computer program products.
  • the computer program product comprises one or more computer instructions or computer programs.
  • the processes or functions according to the embodiments of the present application will be generated in whole or in part.
  • the computer may be a general purpose computer, a special purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in or transmitted from one computer-readable storage medium to another computer-readable storage medium, for example, the computer instructions may be transmitted from a website, computer, server, or data center Transmission to another website site, computer, server or data center by wired (such as infrared, wireless, microwave, etc.).
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center that includes one or more sets of available media.
  • the available media may be magnetic media (eg, floppy disk, hard disk, magnetic tape), optical media (eg, DVD), or semiconductor media.
  • the semiconductor medium may be a solid state drive.
  • At least one means one or more, and “multiple” means two or more.
  • At least one of the following" or similar expressions refer to any combination of these items, including any combination of single or plural items.
  • at least one item (piece) of a, b, or c can represent: a, b, c, a-b, a-c, b-c, or a-b-c, where a, b, c can be single or multiple .
  • sequence numbers of the above-mentioned processes do not mean the order of execution, and the execution order of the processes should be determined by their functions and internal logic, and should not be used in the embodiments of the present application.
  • the implementation process constitutes any limitation.
  • the disclosed systems, devices and methods may be implemented in other ways.
  • the device embodiments described above are only illustrative.
  • the division of the units is only a logical function division. In actual implementation, there may be other division methods.
  • multiple units or components can be combined or May be integrated into another system, or some features may be ignored, or not implemented.
  • the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
  • the units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit.
  • the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium.
  • the technical solution of the present application is essentially or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in the various embodiments of the present application.
  • the aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (read-only memory, ROM), random access memory (random access memory, RAM), magnetic disk or optical disc and other media that can store program codes. .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Computation (AREA)
  • Mathematical Physics (AREA)
  • Computing Systems (AREA)
  • Molecular Biology (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Health & Medical Sciences (AREA)
  • Machine Translation (AREA)

Abstract

本申请提供了人工智能领域中的一种神经网络模型的训练方法、数据处理的方法及装置,该训练方法包括:基于第二训练数据集对神经网络模型进行训练,得到目标神经网络模型,神经网络模型包括专家网络层,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是根据第一词向量矩阵确定的,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的。本申请的方法能够减少模型的训练时间,提高模型的训练效率。

Description

神经网络模型的训练方法、数据处理的方法及装置
本申请要求于2021年07月08日提交中国专利局、申请号为202110773754.0、申请名称为“一种自然语言模型的训练方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
本申请要求于2021年08月31日提交中国专利局、申请号为202111014266.8、申请名称为“神经网络模型的训练方法、数据处理的方法及装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及人工智能领域,并且更具体地,涉及一种神经网络模型的训练方法、数据处理的方法及装置。
背景技术
人工智能(artificial intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。人工智能领域的研究包括机器人,自然语言处理,计算机视觉,决策与推理,人机交互,推荐与搜索,AI基础理论等。
在深度学习领域中,大规模的训练能够提高神经网络模型的性能。通常,神经网络模型基于相同的参数对模型的所有输入进行处理。而当模型的参数量增大时,模型所需要的的计算资源也会随之增大。混合专家(mixture of experts,MoE)包括多个专家网络,每个专家网络具有不同的参数。MoE可以针对不同的输入选择性地激活模型中的不同专家网络参与计算。这样能够大幅降低实际参与计算的参数量,减少计算资源的需求量,从而训练规模达到万亿甚至更高的模型。
然而采用MoE的神经网络模型所需的训练时间较长,影响模型的使用。
因此,如何提高模型的训练效率成为一个亟待解决的问题。
发明内容
本申请提供一种神经网络模型的训练方法、数据处理的方法及装置,减少了模型的训练时间,提高了模型的训练效率。
第一方面,提供了一种神经网络模型的训练方法,包括:获取第一词向量矩阵,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的;获取第二训练数据集;基于第二训练数据集对神经网络模型进行训练,得到目标神经网络模型,神经网络模型包括 专家网络层,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是根据第一词向量矩阵确定的。
根据本申请实施例的方案,词向量矩阵是根据训练数据集训练得到的,词向量矩阵中包含大量的语义信息,利用词向量矩阵初始化模型中的部分或全部专家网络的权重,能够将语义信息引入专家网络中,为专家网络提供先验知识,减少训练时间,尤其是在神经网络模型的规模较大时,本申请实施例的方案能够大幅减少训练时间。同时,将语义信息引入专家网络中,能够有效提高专家网络的语义表示能力,进而提高模型的训练性能。
结合第一方面,在第一方面的某些实现方式中,方法还包括:获取第二词向量矩阵,第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的,专家网络层还包括第二业务领域的第二专家网络,第二专家网络的初始权重是根据第二词向量矩阵确定的。
在本申请实施例的方案中,不同的词向量矩阵是基于不同的业务领域的训练数据集训练得到的,具备不同的语义信息,在专家网络层中的不同专家网络是通过不同的词向量矩阵初始化的情况下,不同的专家网络具备不同的语义表示能力,不同的专家网络之间的语义组合能够进一步提升自然语言语义的理解能力,进一步提高模型的性能。
结合第一方面,在第一方面的某些实现方式中,专家网络层用于通过选择的第一专家网络对输入专家网络层的数据进行处理,第一专家网络是根据输入专家网络层的数据选择的。
结合第一方面,在第一方面的某些实现方式中,第一训练数据集是根据第一业务领域的第一知识图谱确定的。
在本申请实施例的方案中,一个业务领域的训练数据集可以是通过该业务领域的知识图谱构建的,该业务领域的知识图谱能够指示该业务领域中的各个实体之间的关系,这样有利于词向量矩阵学习该业务领域的知识,提高语义表示能力。
结合第一方面,在第一方面的某些实现方式中,第一训练数据集是根据第一业务领域的第一知识图谱确定的,包括:第一训练数据集中的至少一个第一文本序列是根据第一知识图谱中的至少一个第一三元组生成的,第一三元组中的三个词语分别用于表示第一业务领域中的主体、第一业务领域中的客体以及主体与客体之间的关系。
一个三元组可以表示为三元组(主体,关系,客体)的形式。主体和客体为业务领域中的概念。
一个文本序列可以是根据一个三元组生成的。换言之,一个三元组可以组成一个句子,即一个文本序列。
示例性地,可以通过语言模型将三元组转换为句子。该语言模型可以是n字(n-gram)语言模型。例如,n可以为2,或者,n可以为3。
结合第一方面,在第一方面的某些实现方式中,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,第一目标词向量生成模型是以至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以目标词语为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为至少一个第一三元组中的词语。
词向量生成模型可以包括输入层、隐层和输出层。隐层为全连接层。
隐层的权重也可以称为嵌入矩阵(embedding matrix)或词向量矩阵。
可选地,至少一个第一文本序列中的目标词语为该至少一个第一三元组中的客体。
可选地,至少一个第一文本序列中的目标词语为该至少一个第一三元组中的主体。
可选地,该至少一个第一文本序列中的目标词语为该至少一个第一三元组中的关系。
结合第一方面,在第一方面的某些实现方式中,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重为第一词向量矩阵。
结合第一方面,在第一方面的某些实现方式中,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重是通过调整第一词向量矩阵得到的。
结合第一方面,在第一方面的某些实现方式中,神经网络模型为自然语言处理(natural language processing,NLP)模型或者语音处理模型。
若神经网络模型为NLP模型,则第二训练数据集中的数据可以为文本数据。
若神经网络模型为语音处理模型,则第二训练数据集中的数据可以为语音数据。
示例性地,语音处理模型可以为端到端的语音处理模型,例如,该端到端的语音处理模型可以为聆听参与拼写(listen,attend,spell,LAS)模型。
第二方面,提供了一种数据处理的方法,包括:获取待处理的数据;
利用目标神经网络模型对待处理的数据进行处理,目标神经网络模型是基于第二训练数据集对神经网络模型进行训练得到的,神经网络模型包括专家网络层,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是根据第一词向量矩阵确定的,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的。
根据本申请实施例的方案,词向量矩阵是根据训练数据集训练得到的,词向量矩阵中包含大量的语义信息,利用词向量矩阵初始化模型中的部分或全部专家网络的权重,能够将语义信息引入专家网络中,为专家网络提供先验知识,减少训练时间,尤其是在神经网络模型的规模较大时,本申请实施例的方案能够大幅减少训练时间。同时,将语义信息引入专家网络中,能够有效提高专家网络的语义表示能力,进而提高目标神经网络模型的性能。
结合第二方面,在第二方面的某些实现方式中,专家网络层还包括第二业务领域的第二专家网络,第二专家网络的初始权重是根据第二词向量矩阵确定的,第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的。
结合第二方面,在第二方面的某些实现方式中,专家网络层用于通过选择的第一专家网络对输入专家网络层的数据进行处理,第一专家网络是根据输入专家网络层的数据选择的。
结合第二方面,在第二方面的某些实现方式中,第一训练数据集是根据第一业务领域的第一知识图谱确定的。
结合第二方面,在第二方面的某些实现方式中,第一训练数据集是根据第一知识图谱确定的,包括:第一训练数据集中的至少一个第一文本序列是根据第一知识图谱中的至少一个第一三元组生成的,第一三元组中的三个词语分别用于表示第一业务领域中的主体、第一业务领域中的客体以及主体与客体之间的关系。
结合第二方面,在第二方面的某些实现方式中,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,第一目标词向量生成模型是以至少一个第一文本序列中的目标词语之外的词语 作为词向量生成模型的输入,以目标词语为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为至少一个第一三元组中的词语。
结合第二方面,在第二方面的某些实现方式中,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重为第一词向量矩阵。
结合第二方面,在第二方面的某些实现方式中,神经网络模型为自然语言处理NLP模型或语音处理模型。
第三方面,提供了一种神经网络模型的训练装置,该装置包括用于执行上述第一方面的任意一种实现方式的方法的单元。
第四方面,提供了一种数据处理的装置,该装置包括用于执行上述第二方面的任意一种实现方式的方法的单元。
应理解,在上述第一方面中对相关内容的扩展、限定、解释和说明也适用于第二方面、第三方面和第四方面中相同的内容。
第五方面,提供了一种神经网络模型的训练装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第一方面的任意一种实现方式中的方法。
上述第五方面中的处理器既可以是中央处理器(central processing unit,CPU),也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括图形处理器(graphics processing unit,GPU)、神经网络处理器(neural-network processing unit,NPU)和张量处理器(tensor processing unit,TPU)等等。其中,TPU是谷歌(google)为机器学习全定制的人工智能加速器专用集成电路。
第六方面,提供了一种数据处理的装置,该装置包括:存储器,用于存储程序;处理器,用于执行所述存储器存储的程序,当所述存储器存储的程序被执行时,所述处理器用于执行第二方面的任意一种实现方式中的方法。
上述第六方面中的处理器既可以是CPU,也可以是CPU与神经网络运算处理器的组合,这里的神经网络运算处理器可以包括GPU、NPU和TPU等等。
第七方面,提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行第一方面或第二方面中的任意一种实现方式中的方法。
第八方面,提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面或第二方面中的任意一种实现方式中的方法。
第九方面,提供一种芯片,所述芯片包括处理器与数据接口,所述处理器通过所述数据接口读取存储器上存储的指令,执行上述第一方面或第二方面中的任意一种实现方式中的方法。
可选地,作为一种实现方式,所述芯片还可以包括存储器,所述存储器中存储有指令,所述处理器用于执行所述存储器上存储的指令,当所述指令被执行时,所述处理器用于执行第一方面或第二方面中的任意一种实现方式中的方法。
附图说明
图1为本申请实施例提供的一种对话系统的示意图;
图2为本申请实施例提供的一种词向量生成模型的处理过程的示意图;
图3为本申请实施例提供的一种自然语言处理系统的示意图;
图4为本申请实施例提供的一种系统架构的示意图;
图5为本申请实施例提供的一种神经网络模型的训练装置的示意图;
图6为本申请实施例提供的一种神经网络模型的训练方法的示意性流程图;
图7为本申请实施例提供的一种数据处理的方法的示意性流程图;
图8为本申请实施例提供的一种神经网络模型的训练装置的示意性框图;
图9为本申请实施例提供的一种数据处理的装置的示意性框图;
图10为本申请实施例提供的另一种神经网络模型的训练装置的示意性框图;
图11为本申请实施例提供的另一种数据处理的装置的示意性框图。
具体实施方式
下面将结合附图,对本申请中的技术方案进行描述。
本申请实施例可以应用在自然语言处理领域或语音处理领域。
下面以本申请实施例的方案应用于对话系统为例进行说明。
对话系统是自然语言处理领域中的一项重要应用。如图1所示,对话系统包括自动语音识别(automatic speech recognition,ASR)子系统、自然语言理解(natural language understanding,NLU)子系统、对话管理(dialog manager,DM)子系统、自然语言生成(natural language generation,NLG)子系统和文本语音转换(text to speech,TTS)子系统。
ASR子系统将用户输入的音频信息转换为文本信息,NLU子系统通过分析ASR子系统得到的文本信息,解析得到用户的意图,DM子系统根据NLU子系统得到的用户的意图,结合当前的对话状态,执行对应的动作,例如,查询知识库等,并返回结果。NLG子系统根据DM子系统返回的结果生成文本数据,并由TTS子系统将该文本数据转换为音频数据反馈给用户。
在NLU子系统可以利用本申请实施例的方案获得或者优化适用于自然语言理解的神经网络模型。采用本申请实施例的方案能够提高神经网络模型的训练效率,更快地得到神经网络模型。
应理解,此处仅以本申请实施例的方案应用于对话系统中的自然语言理解子系统为例进行说明,不对本申请实施例的方案构成限定。本申请实施例的方案还可以应用于其他与自然语言理解相关的场景中。
由于本申请实施例涉及大量神经网络的应用,为了便于理解,下面先对本申请实施例可能涉及的神经网络的相关术语和概念进行介绍。
(1)神经网络
神经网络可以是由神经单元组成的,神经单元可以是指以x s和截距1为输入的运算单元,该运算单元的输出可以为:
Figure PCTCN2022098621-appb-000001
其中,s=1、2、……n,n为大于1的自然数,W s为x s的权重,b为神经单元的偏置。f为神经单元的激活函数(activation functions),用于将非线性特性引入神经网络中,来 将神经单元中的输入信号转换为输出信号。该激活函数的输出信号可以作为下一层卷积层的输入,激活函数可以是sigmoid函数。神经网络是将多个上述单一的神经单元联结在一起形成的网络,即一个神经单元的输出可以是另一个神经单元的输入。每个神经单元的输入可以与前一层的局部接受域相连,来提取局部接受域的特征,局部接受域可以是由若干个神经单元组成的区域。
(2)深度神经网络
深度神经网络(deep neural network,DNN),也称多层神经网络,可以理解为具有多层隐含层的神经网络。按照不同层的位置对DNN进行划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。
虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2022098621-appb-000002
其中,
Figure PCTCN2022098621-appb-000003
是输入向量,
Figure PCTCN2022098621-appb-000004
是输出向量,
Figure PCTCN2022098621-appb-000005
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2022098621-appb-000006
经过如此简单的操作得到输出向量
Figure PCTCN2022098621-appb-000007
由于DNN层数多,系数W和偏移向量
Figure PCTCN2022098621-appb-000008
的数量也比较多。这些参数在DNN中的定义如下所述:以系数W为例:假设在一个三层的DNN中,第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2022098621-appb-000009
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。
综上,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2022098621-appb-000010
需要注意的是,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。训练深度神经网络的也就是学习权重矩阵的过程,其最终目的是得到训练好的深度神经网络的所有层的权重矩阵(由很多层的向量W形成的权重矩阵)。
(3)损失函数
在训练深度神经网络的过程中,因为希望深度神经网络的输出尽可能的接近真正想要预测的值,所以可以通过比较当前网络的预测值和真正想要的目标值,再根据两者之间的差异情况来更新每一层神经网络的权重向量(当然,在第一次更新之前通常会有初始化的过程,即为深度神经网络中的各层预先配置参数),比如,如果网络的预测值高了,就调整权重向量让它预测低一些,不断地调整,直到深度神经网络能够预测出真正想要的目标值或与真正想要的目标值非常接近的值。因此,就需要预先定义“如何比较预测值和目标值之间的差异”,这便是损失函数(loss function)或目标函数(objective function),它们是用于衡量预测值和目标值的差异的重要方程。其中,以损失函数举例,损失函数的输出值(loss)越高表示差异越大,那么深度神经网络的训练就变成了尽可能缩小这个loss的过程。
(4)反向传播算法
神经网络可以采用误差反向传播(back propagation,BP)算法在训练过程中修正初始的神经网络模型中参数的大小,使得神经网络模型的重建误差损失越来越小。具体地,前向传递输入信号直至输出会产生误差损失,通过反向传播误差损失信息来更新初始的神经 网络模型中参数,从而使误差损失收敛。反向传播算法是以误差损失为主导的反向传播运动,旨在得到最优的神经网络模型的参数,例如,权重矩阵。
(5)自然语言处理(natural language processing,NLP)
自然语言(natural language)即人类语言,自然语言处理(NLP)就是对人类语言的处理。自然语言处理是以一种智能与高效的方式,对文本数据进行系统化分析、理解与信息提取的过程。NLP及其组件可以管理非常大块的文本数据,或者执行大量的自动化任务,并且解决各式各样的问题,如自动摘要(automatic summarization),机器翻译(machine translation,MT),命名实体识别(named entity recognition,NER),关系提取(relation extraction,RE),信息抽取(information extraction,IE),情感分析,语音识别(speech recognition),问答系统(question answering)以及主题分割等等。
(6)知识图谱(knowledge graph,KG)
知识图谱是一种揭示实体之间关系的语义网络。在信息的基础上,建立实体之间的联系,以形成“知识”。知识图谱是由一条条知识组成,每条知识可以表示为一个三元组,即由主体(subject)、关系和客体(object)构成的三元组,可以表示为三元组(主体,关系,客体)。
实体,即三元组中的主体和客体,通常表示概念,一般由名词或名词短语组成。关系表示两个实体之间的联系,一般由动词、形容词或名词组成。
例如,三元组(苏格拉底,老师,亚里士多德)所指示的知识为,苏格拉底是亚里士多德的老师。
(7)混合专家(mixture of experts,MoE)系统
混合专家系统是一种神经网络架构,在该类架构中利用本地的输入数据训练若干线性模型,这些线性模型的输出通过门网络产生的权值组合起来,作为MoE的输出。这些线性模型被称作专家,或者也可以称为专家网络或专家模型。
具体地,MoE至少包括一个门网络和多个专家网络。不同的专家网络具有不同的参数。门网络可以针对不同的输入数据选择性地激活MoE中的部分参数。换言之,门网络可以根据不同的输入选择不同的专家网络参与当前输入的实际计算。
同一个专家网络可以部署于多个设备上。换言之,部署于不同设备上的相同的专家网络具有相同的参数。这样,多个设备能够实现参数共享,有利于训练规模较大的模型,例如,参数量达到万亿甚至更高的模型。
(8)词向量
一个词在NLP中通常包括两种表示方式:独热码表示(one-hot representation)和分布式表示(distribution representation)。
分布式表示是将来自词汇表的单词或短语映射到新的空间中,以实数向量,即词向量,表示该单词或短语。这种方式可以称为词嵌入(word embedding)。单词转换为向量(word to vector,word2vec)是词嵌入的一种方式。
Word2vec模型可以包括输入层(input layer)、隐层(hidden layer)和输出层(output layer)。隐层为全连接层。如图2所示,训练好的模型中的隐层的权重即为词向量矩阵,或者,也可以称为嵌入矩阵(embedding matrix)。
word2vec模型包括跳字(skip-gram)模型和连续词袋(continuous bag-of-words,CBOW) 模型两类模型。
skip-gram模型用于基于一个词生成该词的上下文中的词。换言之,将一个词作为skip-gram模型的输入,将该词的上下文中的词作为skip-gram模型的目标输出。例如,以w(t)作为输入,以w(t)的上下文中的w(t-1)、w(t-2)、w(t+1)和w(t+2)作为目标输出。
CBOW模型用于基于一个词的上下文中的词生成该词。换言之,将一个词的上下文中的词作为CBOW模型的输入,以该词作为CBOW模型的目标输出。例如,以w(t)的上下文中的w(t-1)、w(t-2)、w(t+1)和w(t+2)作为输入,将w(t)作为目标输出。
图2示出了一个CBOW模型的处理过程的示意图。输入层中的“1”表示输入“1”的位置对应的词,“0”的表示没有输入“0”的位置对应的词。输出层中的“1”表示输出该“1”的位置对应的词,“0”表示没有输出该“0”的位置对应的词。例如,一个句子为“the dog bark at mailman”,the和bark为dog的上下文,将该句子中的the和bark的one-hot码输入如图2所示的CBOW模型中,即输入层中the和bark对应的位置置1,经过CBOW模型处理后,输出结果中“dog”对应的位置为1,即输出dog。
图3是本申请实施例的一种自然语言处理系统的示意图。
如图3的(a)所示,自然语言处理系统可以包括用户设备以及数据处理设备。用户设备包括用户以及手机、个人电脑或者信息处理中心等智能终端。用户设备为自然语言数据处理的发起端,作为语言问答或者查询等请求的发起方,通常用户通过用户设备发起请求。
数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。数据处理设备通过交互接口接收来自智能终端的查询语句/语音/文本等问句,再通过存储数据的存储器以及数据处理的处理器环节进行机器学习,深度学习,搜索,推理,决策等方式的语言数据处理。存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,数据库可以位于数据处理设备上,也可以位于其它网络服务器上。
在图3的(a)所示的自然语言处理系统中,用户设备可以接收用户的指令,例如,用户设备可以接收用户输入的一段文本,然后向数据处理设备发起请求,使得数据处理设备针对用户设备得到的该一段文本执行自然语言处理应用(例如,意图识别、文本分类、文本序列标注、翻译等),从而得到针对该一段文本的对应的自然语言处理应用的处理结果(例如,意图识别、文本分类、文本序列标注、翻译等)。
示例性地,用户设备可以接收用户输入的待处理文本,然后向数据处理设备发起请求,使得数据处理设备对该待处理文本进行分类,从而得到针对该待处理文本的分类结果。其中,分类结果可以是指该待处理文本所指示的用户语义意图,比如,用户用于指示放歌、设置时间、开启导航的意图;或者,分类结果还可以用于指示用户的情感分类结果,比如,分类结果可以指示待处理文本对应的用户情感分类为抑郁、开心或者生气等。
采用本申请实施例的神经网络模型的训练方法得到的目标神经网络模型可以部署于图3的(a)中的数据处理设备中,该目标神经网络模型可以用于执行自然语言处理应用执行自然语言处理应用(例如,意图识别、文本分类、文本序列标注、翻译等),从而得 到自然语言处理应用的处理结果(例如,意图识别、文本分类、文本序列标注、翻译等)。
如图3的(b)所示为自然语言处理系统的另一个应用场景。此场景中智能终端直接作为数据处理设备,直接接收来自用户的输入并直接由智能终端本身的硬件进行处理,具体过程与图3的(a)相似,可参考上面的描述,在此不再赘述。
在图3的(b)所示的自然语言处理系统中,用户设备可以接收用户的指令,由用户设备自身对待处理数据进行处理得到待处理数据的处理结果。
在图3的(b)所示的自然语言处理系统中,用户设备可以接收用户的指令,例如用户设备可以接收用户输入的一段文本,然后再由用户设备自身针对该一段文本执行自然语言处理应用(例如,意图识别、文本分类、文本序列标注、翻译等),从而得到针对该一段文本的对应的自然语言处理应用的处理结果(例如,意图识别、文本分类、文本序列标注、翻译等)。
采用本申请实施例的神经网络模型的训练方法得到的目标神经网络模型可以部署于图3的(b)中的用户设备中,该目标神经网络模型可以用于执行自然语言处理应用执行自然语言处理应用(例如,意图识别、文本分类、文本序列标注、翻译等),从而得到自然语言处理应用的处理结果(例如,意图识别、文本分类、文本序列标注、翻译等)。
图3的(c)是本申请实施例提供的自然语言处理系统的相关设备的示意图。
上述图3的(a)和(b)中的用户设备具体可以是如图3的(c)的本地设备301或302,数据处理设备可以是执行设备310,其中数据存储系统350可以集成在执行设备310上,也可以设置在云上或其它网络服务器上。
本地设备301和本地设备302通过通信网络与执行设备310连接。
执行设备310可以由一个或多个服务器实现。可选的,执行设备310可以与其它计算设备配合使用,例如:数据存储器、路由器、负载均衡器等设备。执行设备310可以布置在一个物理站点上,或者分布在多个物理站点上。执行设备310可以使用数据存储系统350中的数据,或者调用数据存储系统350中的程序代码来实现本申请实施例的神经网络模型的训练方法。
具体地,在一种实现方式中,执行设备310可以执行以下过程:
获取第一词向量矩阵,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的;
获取第二训练数据集;
基于第二训练数据集对神经网络模型进行训练,得到目标神经网络模型,神经网络模型包括专家网络层,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是根据第一词向量矩阵确定的。
通过上述过程执行设备310能够获取一个训练好的神经网络,即目标神经网络模型,该目标神经网络模型可以用于进行自然语言处理等等。
示例性地,用户可以操作各自的用户设备(例如本地设备301和本地设备302)与执行设备310进行交互。每个本地设备可以表示任何计算设备,例如个人计算机、计算机工作站、智能手机、平板电脑、智能摄像头、智能汽车或其他类型蜂窝电话、媒体消费设备、可穿戴设备、机顶盒、游戏机等。
每个用户的本地设备可以通过任何通信机制/通信标准的通信网络与执行设备310进 行交互,通信网络可以是广域网、局域网、点对点连接等方式,或它们的任意组合。
在一种实现方式中,本地设备301、本地设备302从执行设备310获取到目标神经网络模型的相关参数,将目标模型部署在本地设备301、本地设备302上,利用目标模型进行语音处理或者文本处理等等。
在另一种实现中,执行设备310上可以直接部署目标神经网络模型,执行设备310通过从本地设备301和本地设备302获取待处理数据,并采用目标神经网络模型对待处理数据进行处理,进一步地,可以将处理结果返回至本地设备301和本地设备302。
需要注意的,执行设备310的所有功能也可以由本地设备实现。例如,本地设备301实现执行设备310的功能并为自己的用户提供服务,或者为本地设备302的用户提供服务。
上述执行设备310也可以为云端设备,此时,执行设备310可以部署在云端;或者,上述执行设备310也可以为终端设备,此时,执行设备310可以部署在用户终端侧,本申请实施例对此并不限定。
如图4所示,本申请实施例提供了一种系统架构100。在图4中,数据采集设备160用于采集训练数据。针对本申请实施例的神经网络模型的训练方法来说,若数据为文本数据,则训练数据可以包括文本序列以及文本序列对应的处理结果,例如,文本序列对应的处理结果可以为对文本系列的意图识别结果。
在采集到训练数据之后,数据采集设备160将这些训练数据存入数据库130,训练设备120基于数据库130中维护的训练数据训练得到目标模型/规则101。
下面对训练设备120基于训练数据得到目标模型/规则101进行描述,训练设备120对输入的原始数据进行处理,将输出值与目标值进行对比,直到训练设备120输出的值与目标值的差值小于一定的阈值,从而完成目标模型/规则101的训练。
上述目标模型/规则101能够用于实现本申请实施例的数据处理方法。本申请实施例中的目标模型/规则101具体可以为神经网络模型。需要说明的是,在实际的应用中,所述数据库130中维护的训练数据不一定都来自于数据采集设备160的采集,也有可能是从其他设备接收得到的。另外需要说明的是,训练设备120也不一定完全基于数据库130维护的训练数据进行目标模型/规则101的训练,也有可能从云端或其他地方获取训练数据进行模型训练,上述描述不应该作为对本申请实施例的限定。
根据训练设备120训练得到的目标模型/规则101可以应用于不同的系统或设备中,如应用于图4所示的执行设备110。
执行设备110可以是终端,如手机终端,平板电脑,笔记本电脑,增强现实(augmented reality,AR)AR/虚拟现实(virtual reality,VR),车载终端等,还可以是服务器或者云端等。在图4中,执行设备110配置输入/输出(input/output,I/O)接口112,用于与外部设备进行数据交互,用户可以通过客户设备140向I/O接口112输入数据,输入数据在本申请实施例中可以包括:客户设备输入的待处理的数据。
在执行设备110对输入数据进行预处理,或者在执行设备110的计算模块111执行计算等相关的处理过程中,执行设备110可以调用数据存储系统150中的数据、代码等以用于相应的处理,也可以将相应处理得到的数据、指令等存入数据存储系统150中。
最后,I/O接口112将处理结果,如上述得到的数据的处理结果返回给客户设备140,从而提供给用户。
值得说明的是,训练设备120可以针对不同的目标或不同的任务,基于不同的训练数据生成相应的目标模型/规则101,该相应的目标模型/规则101即可以用于实现上述目标或完成上述任务,从而为用户提供所需的结果。
在图4中所示情况下,用户可以手动给定输入数据,该手动给定可以通过I/O接口112提供的界面进行操作。另一种情况下,客户设备140可以自动地向I/O接口112发送输入数据,如果要求客户设备140自动发送输入数据需要获得用户的授权,则用户可以在客户设备140中设置相应权限。用户可以在客户设备140查看执行设备110输出的结果,具体的呈现形式可以是显示、声音、动作等具体方式。客户设备140也可以作为数据采集端,采集如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果作为新的样本数据,并存入数据库130。当然,也可以不经过客户设备140进行采集,而是由I/O接口112直接将如图所示输入I/O接口112的输入数据及输出I/O接口112的输出结果,作为新的样本数据存入数据库130。
值得注意的是,图4仅是本申请实施例提供的一种系统架构的示意图,图中所示设备、器件、模块等之间的位置关系不构成任何限制,例如,在图4中,数据存储系统150相对执行设备110是外部存储器,在其它情况下,也可以将数据存储系统150置于执行设备110中。
如图4所示,根据训练设备120训练得到目标模型/规则101,该目标模型/规则101在本申请实施例中可以是本申请中的目标神经网络模型。
通过使用MoE能够扩大模型的参数量,训练出规模达到万亿甚至更高的模型,进而提高模型的性能。然而采用MoE的神经网络模型所需的训练时间较长,影响了模型的应用。
本申请实施例提供了一种神经网络模型的训练方法,利用词向量矩阵初始化神经网络模型中的专家网络的权重,能够为模型训练提供先验知识,减少模型的训练时间,提高模型的训练效率。
为了更好地说明本申请实施例的神经网络模型的训练方法,下面结合图5对本申请实施例的神经网络模型的训练装置进行说明。图5所示的装置500可以部署于云服务设备或终端设备上,例如,电脑、服务器、车辆、手机等设备,也可以部署于由云服务设备和终端设备构成的系统上。示例性地,装置500可以为图4中的训练设备120或图3中的执行设备310或本地设备。
装置500包括知识图谱构建模块510、语言生成模块520、词向量矩阵生成模块530和训练模块540。
其中,知识图谱构建模块510用于根据业务领域的语料构建知识图谱。
示例性地,知识图谱可以包括至少一个三元组。具体描述可以参考方法600中的步骤S610。
语言生成模块520用于根据该至少一个三元组生成至少一个文本系列。具体描述可以参考方法600中的步骤S620。
词向量矩阵生成模块530用于基于该至少一个三元组训练得到词向量矩阵。
该至少一个三元组可以构成训练数据集。换言之,词向量矩阵生成模块530用于根据该训练数据集训练得到词向量矩阵。具体描述可以参考方法600中的步骤S630。
训练模块540用于对神经网络模型进行训练,得到目标神经网络模型。其中,目标神经网络模型包括专家网络层。专家网络层中的至少一个专家网络的初始权重是根据词向量矩阵确定的。换言之,根据词向量矩阵初始化专家网络层中的至少一个专家网络的初始权重。具体描述可以参考方法600中的步骤S650。
下面结合图6对本申请实施例中的神经网络模型的训练方法进行说明。
图6示出了本申请实施例提供的神经网络模型的训练方法600。图6所示的方法可以由云服务设备或终端设备执行,例如,电脑、服务器、车辆、手机等装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法600可以由图4中的训练设备120或图3中的执行设备310或本地设备执行。
方法600包括步骤S610至步骤S650。下面对步骤S610至步骤S650进行详细介绍。
S610,获取至少一个业务领域的知识图谱。
知识图谱可以是根据业务领域的语料构建的。例如,语料可以包括网站文章或图书等。
示例性地,可以通过装置500中的知识图谱构建模块510构建知识图谱。
基于不同业务领域的语料可以分别构建不同业务领域的知识图谱。
示例性地,该至少一个业务领域包括第一业务领域,第一业务领域的第一知识图谱可以是根据第一业务领域的语料构建的知识图谱。
进一步地,该至少一个业务领域还包括第二业务领域,第二业务领域的第二知识图谱可以是根据第二业务领域的语料构建的知识图谱。
例如,第一业务领域为金融领域,第二业务领域为互联网领域,则可以基于金融领域的语料和互联网领域的语料分别构建金融领域的第一知识图谱和互联网领域的第二知识图谱,在步骤S610中,可以获取第一知识图谱和第二知识图谱。
为了便于描述,本申请实施例中仅以第一业务领域和第二业务领域为例对S610进行说明,步骤S610中还可以获取更多的业务领域的知识图谱或更少的业务领域的知识图谱,本申请实施例对知识图谱的数量不做限定。
示例性地,一个知识图谱包括至少一个三元组。
换言之,以三元组的形式表示一个知识图谱中的实体之间的关系。
知识图谱中的三元组包括主体、关系和客体这三个元素,可以表示为三元组(主体,关系,客体)的形式,例如,三元组(苏格拉底,老师,亚里士多德)。其中,主体和客体可以为知识图谱所在业务领域中的概念。关系即用于指示主体和客体之间的联系。
若步骤S610中获取的知识图谱为多个知识图谱,该多个知识图谱中的每个知识图谱均可以包括至少一个三元组。
示例性地,第一知识图谱包括至少一个第一三元组。第一三元组中的三个词语分别用于表示第一业务领域中的主体、第一业务领域中的客体以及主体和客体之间的关系。
第二知识图谱包括至少一个第二三元组。第二三元组中的三个词语分别表示第二业务领域中的主体、第二业务领域中的客体以及主体和客体之间的关系。
应理解,“第一三元组”中的“第一”仅用于限定该三元组为第一知识图谱中的三元组,不具有其他限定作用。换言之,第一知识图谱中的三元组均可以称为第一三元组。
同理,“第二三元组”中的“第二”仅用于限定该三元组为第二知识图谱中的三元组,不具有其他限定作用。换言之,第二知识图谱中的三元组均可以称为第二三元组。
应理解,此处仅为示例,还可以通过三元组以外的其他形式表示知识图谱,本申请实施例对此不做限定。
需要说明的是,步骤S610为可选步骤。
S620,获取至少一个业务领域的训练数据集。
示例性地,该至少一个业务领域包括第一业务领域,则步骤S620可以包括获取第一业务领域的第一训练数据集。
进一步地,该至少一个业务领域还包括第二业务领域,则步骤S620可以包括获取第一业务领域的第一训练数据集和第二业务领域的第三训练数据集。
在方法600包括步骤S610的情况下,步骤S620可以包括:根据该至少一个业务领域的知识图谱分别构建该至少一个业务领域的训练数据集。
换言之,该至少一个业务领域的训练数据集是分别根据该至少一个业务领域的知识图谱确定的。
可选地,第一业务领域的第一训练数据集是根据第一业务领域的第一知识图谱确定的。
进一步地,第二业务领域的第三训练数据集是根据第二业务领域的第二知识图谱确定的。
为了便于描述,本申请实施例中仅以第一业务领域和第二业务领域为例对S620进行说明,步骤S620中还可以获取更多的业务领域的训练数据集或更少的业务领域的训练数据集,其他业务领域的训练数据集的获取方式可以参考第一训练数据集和第二训练数据集的获取方式,本申请实施例对此不做限定。
该至少一个训练数据集中的每个训练数据集包括至少一个文本序列。
示例性地,第一训练数据集包括至少一个第一文本序列。
进一步地,第三训练数据集包括至少一个第二文本序列。
应理解,“第一文本序列”中的“第一”仅用于限定该文本序列为第一训练数据集中的文本序列,不具有其他限定作用。换言之,第一训练数据集中的文本序列均可以称为第一文本序列。
同理,“第二文本序列”中的“第二”仅用于限定该文本序列为第三训练数据集中的文本序列,不具有其他限定作用。换言之,第三训练数据集中的文本序列均可以称为第二文本序列。
可选地,第一训练数据集是根据第一知识图谱确定的,包括:第一训练数据集中的至少一个第一文本序列分别是根据第一知识图谱中的至少一个第一三元组生成的。
第三训练数据集是根据第二知识图谱确定的,包括:第三训练数据集中的至少一个第二文本序列分别是根据第二知识图谱中的至少一个第二三元组生成的。
一个文本序列可以是根据一个三元组生成的。一个文本序列可以视为词向量生成模型的一个训练样本。
换言之,一个三元组可以组成一个句子,即一个文本序列。
例如,根据三元组(苏格拉底,老师,亚里士多德)生成的文本序列可以为,苏格拉底是亚里士多德的老师。
示例性地,可以通过语言模型将三元组转换为句子。该语言模型可以是n字(n-gram) 语言模型。例如,n可以为2,或者,n可以为3。
示例性地,该语言模型可以部署于装置500的语言生成模块520中。即由语言生成模块520将三元组转换为文本序列。
步骤S620中也可以采用其他方式构建至少一个业务领域的训练数据集,例如,在至少一个业务领域分别采集多个文本序列,构成至少一个业务领域的训练数据集。本申请实施例对此不做限定。
示例性地,获取至少一个业务领域的训练数据集可以为构建至少一个业务领域的训练数据集,或者,获取该至少一个业务领域的训练数据集也可以为从其他设备接收该至少一个业务领域的训练数据集,或者,获取该至少一个业务领域的训练数据集还可以为读取本地存储的该至少一个业务领域的训练数据集。在该至少一个业务领域包括多个业务领域的情况下,该多个业务领域的训练数据集的获取方式可以相同,也可以不同。本申请实施例对“获取”的具体方式不做限定。
需要说明的是,步骤S620为可选步骤。
S630,获取至少一个词向量矩阵。该至少一个词向量矩阵分别是基于至少一个业务领域的训练数据集训练得到的。
在方法600包括步骤S620的情况下,步骤S630包括,基于该至少一个训练数据集训练分别得到至少一个词向量矩阵。
示例性地,该至少一个词向量矩阵可以由装置500中的词向量矩阵生成模块530训练得到。
可选地,步骤S630包括获取第一词向量矩阵,第一词向量矩阵是基于第一训练数据集训练得到的。
可选地,步骤S630还包括获取第二词向量矩阵,第二词向量矩阵是基于第三训练数据集训练得到的。
一个业务领域的知识图谱能够指示该业务领域中的各个实体之间的关系,该业务领域的训练数据集可以是通过该业务领域的知识图谱构建的,这样有利于词向量矩阵学习该业务领域的知识,提高语义表示能力。
该至少一个词向量矩阵分别为至少一个目标词向量生成模型中的隐层的权重。该至少一个目标词向量生成模型分别是基于至少一个业务领域的训练数据集对词向量生成模型进行训练得到的。
在该情况下,步骤S630也可以理解为获取至少一个目标词向量生成模型中的隐层的权重。
目标词向量生成模型即为训练好的词向量生成模型。基于不同业务领域的训练数据集对词向量生成模型进行训练,可以得到不同业务领域的目标词向量生成模型。
词向量生成模型可以包括输入层、隐层和输出层。隐层为全连接层。词向量生成模型可以采用现有模型,例如,词向量生成模型可以为CBOW模型。
隐层的权重也可以称为嵌入矩阵(embedding matrix)或词向量矩阵。
一个目标词向量模型是以一个业务领域的训练数据集中的至少一个文本序列中的目标词语之外的词语作为词向量生成模型的输入,以目标词语作为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为该业务领域的知识图谱中的至少一个三 元组中的词语。
示例性地,对于一个文本序列而言,将该文本序列中的目标词语以外的词语作为词向量生成模型的输入,以目标词语作为词向量生成模型的目标输出训练该词向量模型。目标词语为该文本序列对应的三元组中的词语。目标词语可以为该三元组中的主体、客体或关系这三个元素中的任一项。
文本序列对应的三元组指的是用于指导生成该文本序列的三元组。或者说,基于该文本序列对应的三元组可以生成该文本序列。
可选地,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,第一目标词向量生成模型是以至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以目标词语为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为至少一个第一三元组中的词语。
可选地,至少一个第一文本序列中的目标词语为该至少一个第一三元组中的客体。
具体地,以将该至少一个第一文本序列中该至少一个第一三元组中的客体之外的词作为词向量生成模型的输入,以该至少一个第一三元组中的客体作为词向量生成模型的目标输出训练该词向量生成模型,得到第一目标词向量生成模型。
换言之,对一个文本序列而言,该第一文本序列中该第一文本序列对应的第一三元组中的客体之外的词作为词向量生成模型的输入,该第一文本序列对应的第一三元组中的客体作为词向量生成模型的目标输出。
目标输出也可以理解为一个训练样本的正样本标签。在该情况下,正样本标签即为客体。负样本标签可以为通过负采样得到的单词对。
例如,文本序列为:苏格拉底是亚里士多德的老师。该文本序列对应的三元组为三元组(苏格拉底,老师,亚里士多德),该三元组中的客体为亚里士多德。将该文本序列中亚里士多德以外的词作为CBOW模型的输入,即将(苏格拉底,是,的,老师)作为CBOW模型的输入。将亚里士多德作为CBOW模型的目标输出。
可选地,至少一个第一文本序列中的目标词语为该至少一个第一三元组中的主体。
具体地,以该至少一个第一文本序列中该至少一个第一三元组中的主体之外的词作为词向量生成模型的输入,以该至少一个第一三元组中的主体作为词向量生成模型的目标输出训练该词向量生成模型,得到第一目标词向量生成模型。
换言之,对一个第一文本序列而言,该第一文本序列中该第一文本序列对应的第一三元组中的主体之外的词作为词向量生成模型的输入,该第一文本序列对应的第一三元组中的主体作为词向量生成模型的目标输出。
在该情况下,正样本标签即为主体。负样本标签可以为通过负采样得到的单词对。
例如,文本序列为:苏格拉底是亚里士多德的老师。该文本序列对应的三元组为三元组(苏格拉底,老师,亚里士多德),该三元组中的主体为苏格拉底。将该文本序列中苏格拉底以外的词作为CBOW模型的输入,即将(是,亚里士多德,的,老师)作为CBOW模型的输入。将苏格拉底作为CBOW模型的目标输出。
可选地,该至少一个第一文本序列中的目标词语为该至少一个第一三元组中的关系。
具体地,以该至少一个第一文本序列中该多个第一三元组中的关系之外的词作为词向 量生成模型的输入,以该至少一个第一三元组中的关系作为词向量生成模型的目标输出训练该词向量生成模型,得到第一目标词向量生成模型。
换言之,对一个第一文本序列而言,该第一文本序列中该第一文本序列对应的三元组中的关系之外的词作为词向量生成模型的输入,该第一文本序列对应的三元组中的关系作为词向量生成模型的目标输出。
在该情况下,正样本标签即为关系。负样本标签可以为通过负采样得到的单词对。
例如,文本序列为:苏格拉底是亚里士多德的老师。该文本序列对应的三元组为三元组(苏格拉底,老师,亚里士多德),该三元组中的关系为老师。将该文本序列中的老师以外的词作为CBOW模型的输入,即将(苏格拉底,是,亚里士多德,的)作为CBOW模型的输入。将老师作为CBOW模型的目标输出。
可选地,第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的,包括:
第二词向量矩阵为第二目标词向量生成模型中的隐层的权重,第二目标词向量生成模型是以至少一个第二文本序列中的目标词语之外的词语作为词向量生成模型的输入,以目标词语为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为至少一个第二三元组中的词语。
第二词向量生成模型的训练过程可以参考上述第一词向量生成模型的训练过程。将上述训练过程中的第一文本序列替换为第二文本序列,将第一三元组替换为第二三元组即可训练得到第二目标词向量生成模型。
应理解,本申请实施例中仅以第一词向量矩阵和第二词向量矩阵为例对步骤S630进行说明,在实际应用中,步骤S630中可以获取更多的词向量矩阵,本申请实施例对此不做限定。
示例性地,获取至少一个词向量矩阵可以为通过训练得到至少一个词向量矩阵,或者,也可以为从其他设备接收至少一个词向量矩阵,或者,还可以为读取本地存储的至少一个词向量矩阵。本申请实施例对“获取”的具体方式不做限定。
S640,获取第二训练数据集。
第二训练数据集中的数据类型与神经网络模型的任务类型有关。
可选地,神经网络模型可以为NLP模型。相应地,第二训练数据集中的数据可以为文本数据。
可选地,神经网络模型可以为语音处理模型。相应地,第二训练数据集中的数据可以为语音数据。
示例性地,语音处理模型可以为端到端的语音处理模型,例如,该端到端的语音处理模型可以为聆听参与拼写(listen,attend,spell,LAS)模型。
示例性地,步骤S640的执行设备可以为如图4所示的训练设备120。第二训练数据集可以是如图4所示的数据库130中维护的训练数据。
S650,基于第二训练数据集对神经网络模型进行训练,得到目标神经网络模型。其中,神经网络模型包括专家网络层,专家网络层中的至少一个专家网络的初始权重分别是根据至少一个词向量矩阵确定的。
示例性地,步骤S650可以由装置500中的训练模块540执行。
具体地,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是 根据第一词向量矩阵确定的。
可选地,专家网络层还包括第二业务领域的第二专家网络,第二专家网络的初始权重是根据第二词向量矩阵确定的。
该至少一个专家网络的初始权重分别是根据至少一个词向量矩阵确定的,也可以理解为,基于该至少一个词向量矩阵对至少一个专家网络的权重进行初始化。
专家网络层包括多个专家网络,专家网络层用于通过多个专家网络中的目标专家网络对输入专家网络层的数据进行处理。该目标专家网络是根据输入专家网络层的数据确定的。
也就是说,在目标神经网络模型的训练或推理过程中,目标专家网络是根据输入专家网络层的数据选择的。
可选地,目标专家网络可以包括第一专家网络。例如,专家网络层可以通过选择的第一专家网络对输入专家网络层的数据进行处理,第一专家网络是根据输入专家网络层的数据选择的。
可选地,目标专家网络可以包括第二专家网络。例如,专家网络层可以通过选择的第二专家网络对输入专家网络层的数据进行处理,第二专家网络是根据输入专家网络层的数据选择的。
基于第二训练数据集对神经网络模型进行训练,得到的训练好的神经网络模型即为目标神经网络模型。
示例性地,该神经网络模型可以是现有的神经网络模型。
例如,该神经网络模型可以是switch transformer模型。
可替换地,该神经网络模型也可以是自行构建的,本申请实施例对此不做限定,只要该神经网络模型中包括专家网络层即可。
专家网络层的数量可以为一个,也可以为多个,本申请实施例对此不做限定。
在神经网络模型包括多个专家网络层的情况下,该多个专家网络层中的部分或全部专家网络层均可以采用步骤S650中的方式确定初始权重。为了便于描述,本申请实施例中仅以一个专家网络层作为示例,不对本申请实施例的方案构成限定。
一个专家网络层包括多个专家网络,该多个专家网络的参数不同。
需要说明的是,该多个专家网络可以部署于一个设备上,也可以部署于多个设备上。若该多个专家网络部署于多个设备上,方法600也可以理解为由该多个设备共同执行。
示例性地,专家网络层可以包括门网络。门网络可以根据输入专家网络层的数据选择一个或多个专家网络参与当前输入的数据的实际计算中。或者说,门网络可以将输入专家网络层的数据路由至一个或多个专家网络中进行处理。该一个或多个被选择的专家网络即为目标专家网络。目标专家网络的具体确定方式可以采用现有方案,例如,采用MoE中的路由方式,或者,也可以采用switch transformer中的switch层中的路由方式,本申请实施例对此不作限定。若目标专家网络包括多个专家网络,该多个专家网络分别对输入的数据进行处理。该多个专家网络的输出可以通过门网络产生的权值组合起来,作为专家网络层的输出。权值的计算方式可以采用现有方案,例如,采用MoE中的计算方式,或者,也可以采用switch transformer中的switch层中的权值计算方式,本申请实施例对此不作限定。
也就是说,对于不同的输入数据,专家网络层中的目标专家网络可能是不同的。
专家网络层中的至少一个专家网络的初始权重是根据至少一个词向量矩阵确定的。或者说,该至少一个专家网络的初始权重是根据至少一个目标词向量生成模型中的隐层的权重确定的。也就是说,该至少一个专家网络的结构与该至少一个目标词向量生成模型的隐层的结构是相同的,隐层可以为全连接层。即根据至少一个目标词向量生成模型的全连接层的权重分布情况对该至少一个专家网络的权重进行初始化。
对于专家网络层中的该至少一个专家网络以外的其他专家网络,可以采用现有的方式进行初始化,例如,通过高斯分布产生的随机值进行随机初始化。
为了便于描述,下面以第一专家网络为例对权重初始化的方式进行说明。
第一专家网络可以包括一个或多个专家网络。
换言之,第一词向量矩阵可以用于初始化一个专家网络的权重或者多个专家网络的权重。
可选地,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重为第一词向量矩阵。
可选地,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重是通过调整第一词向量矩阵得到的。
具体地,可以调整第一词向量矩阵中的一个或多个值,并将调整后的第一词向量矩阵作为第一专家网络的初始权重。
可选地,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络中的一部分专家网络的初始权重为第一词向量矩阵,另一部分专家网络的初始权重为通过调整第一词向量矩阵得到的。
在该情况下,第一专家网络包括两个及两个以上的专家网络。
其他专家网络的权重初始化方式可以参考上述第一专家网络的权重初始化方式。例如,将上述初始化过程中的第一专家网络替换为第二专家网络,将第一词向量矩阵替换为第二词向量矩阵即可得到第二专家网络的初始权重。
应理解,本申请实施例中仅以第一专家网络和第二专家网络为例对步骤S650进行说明,在实际应用中,专家网络层中的其他专家网络也可以采用本申请实施例的方案进行权重初始化,本申请实施例对此不做限定。
根据本申请实施例的方案,词向量矩阵是根据训练数据集训练得到的,词向量矩阵中包含大量的语义信息,利用词向量矩阵初始化模型中的部分或全部专家网络的权重,能够将语义信息引入专家网络中,为专家网络提供先验知识,减少训练时间,尤其是在神经网络模型的规模较大时,本申请实施例的方案能够大幅减少训练时间。同时,将语义信息引入专家网络中,能够有效提高专家网络的语义表示能力,进而提高模型的训练性能。
此外,不同的词向量矩阵是基于不同的业务领域的训练数据集训练得到的,具备不同的语义信息,在专家网络层中的不同专家网络是通过不同的词向量矩阵初始化的情况下,不同的专家网络具备不同的语义表示能力,不同的专家网络之间的语义组合能够进一步提升自然语言语义的理解能力,进一步提高模型的性能。例如,专家网络层的多个专家网络分别是通过多个词向量矩阵进行初始化的,该多个词向量矩阵分别是基于多个业务领域的训练数据集训练得到的,这样,该专家网络层具备多个业务领域的语义表示能力,提高了 模型的自然语言语义的理解能力,在目标神经网络模型应用的过程中,可以将各个业务领域的数据分别路由至对应的专家网络进行处理,进一步提高模型的性能。
此外,一个业务领域的知识图谱能够指示该业务领域中的各个实体之间的关系,该业务领域的训练数据集可以是通过该业务领域的知识图谱构建的,这样有利于词向量矩阵学习该业务领域的知识,提高语义表示能力。
图7示出了本申请实施例提供的数据处理的方法700的示意性流程图,该方法可以由能够进行数据处理的装置或设备执行,例如,该装置可以是云服务设备,也可以是终端设备,例如,电脑、服务器等运算能力足以用来执行数据处理的方法的装置,也可以是由云服务设备和终端设备构成的系统。示例性地,方法700可以由图4中的执行设备110或图3中的执行设备310或本地设备执行。
例如,方法700具体可以由如图4所示的执行设备110执行,方法700中的待处理数据可以是如图4所示的客户设备140给出的输入数据。
图7中的数据处理的方法700中使用的模型可以是通过上述图6中的方法构建的。方法700中的具体实现方式可以参照前述方法600,为了避免不必要的重复,下面在介绍方法700时适当省略重复的描述。
方法700包括步骤S710至步骤S720,下面对步骤S710至步骤S720进行描述。
S710,获取待处理的数据。
待处理的数据的类型与神经网络模型的任务类型有关。
可选地,神经网络模型可以为NLP模型。相应地,待处理的数据可以为文本数据。
可选地,神经网络模型可以为语音处理模型。相应地,待处理的数据可以为语音数据。
S720,利用目标神经网络模型对待处理的数据进行处理,目标神经网络模型是基于第二训练数据集对神经网络模型进行训练得到的,神经网络模型包括专家网络层,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是根据第一词向量矩阵确定的,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的。
可选地,其特征在于,专家网络层还包括第二业务领域的第二专家网络,第二专家网络的初始权重是根据第二词向量矩阵确定的,第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的。
可选地,专家网络层用于通过选择的第一专家网络对输入专家网络层的数据进行处理,第一专家网络是根据输入专家网络层的数据选择的。
可选地,第一训练数据集是根据第一业务领域的第一知识图谱确定的。
可选地,第一训练数据集是根据第一知识图谱确定的,包括:第一训练数据集中的至少一个第一文本序列是根据第一知识图谱中的至少一个第一三元组生成的,第一三元组中的三个词语分别用于表示第一业务领域中的主体、第一业务领域中的客体以及主体与客体之间的关系。
可选地,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,第一目标词向量生成模型是以至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以目标词语为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为至少一个第一三元组中的词语。
可选地,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重为第一词向量矩阵。
根据本申请实施例的方案,词向量矩阵是根据训练数据集训练得到的,词向量矩阵中包含大量的语义信息,利用词向量矩阵初始化模型中的部分或全部专家网络的权重,能够将语义信息引入专家网络中,为专家网络提供先验知识,减少训练时间,尤其是在神经网络模型的规模较大时,本申请实施例的方案能够大幅减少训练时间。同时,将语义信息引入专家网络中,能够有效提高专家网络的语义表示能力,进而提高目标神经网络模型的性能。
下面结合图8至图11对本申请实施例的装置进行说明。应理解,下面描述的装置能够执行前述本申请实施例的方法,为了避免不必要的重复,下面在介绍本申请实施例的装置时适当省略重复的描述。
图8是本申请实施例的一种神经网络的训练装置的示意性框图。图8所示的装置3000包括获取单元3010和处理单元3020。
获取单元3010和处理单元3020可以用于执行本申请实施例的神经网络模型的训练方法方法600。
获取单元3010用于获取用于获取第一词向量矩阵,所述第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的。
所述获取单元还用于获取第二训练数据集。
处理单元3020用于基于第二训练数据集对神经网络模型进行训练,得到目标神经网络模型,神经网络模型包括专家网络层,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是根据第一词向量矩阵确定的。
可选地,作为一个实施例,获取单元3010还用于:获取第二词向量矩阵,第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的,专家网络层还包括第二业务领域的第二专家网络,第二专家网络的初始权重是根据第二词向量矩阵确定的。
可选地,作为一个实施例,专家网络层用于通过选择的第一专家网络对输入专家网络层的数据进行处理,第一专家网络是根据输入专家网络层的数据选择的。
可选地,作为一个实施例,第一训练数据集是根据第一业务领域的第一知识图谱确定的。
可选地,作为一个实施例,第一训练数据集是根据第一业务领域的第一知识图谱确定的,包括:第一训练数据集中的至少一个第一文本序列是根据第一知识图谱中的至少一个第一三元组生成的,第一三元组包括第一业务领域中的主体、第一业务领域中的客体以及主体与客体之间的关系。
可选地,作为一个实施例,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,第一目标词向量生成模型是以至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以目标词语为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为至少一个第一三元组中的词语。
可选地,作为一个实施例,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重为第一词向量矩阵。
可选地,作为一个实施例,神经网络模型为NLP模型或者语音处理模型。
图9是本申请实施例的一种数据处理装置的示意性框图。图9所示的装置4000包括获取单元4010和处理单元4020。
获取单元4010和处理单元4020可以用于执行本申请实施例的数据处理的方法700。
获取单元4010,用于获取待处理的数据。
处理单元4020,用于利用目标神经网络模型对待处理的数据进行处理,目标神经网络模型是基于第二训练数据集对神经网络模型进行训练得到的,神经网络模型包括专家网络层,专家网络层包括第一业务领域的第一专家网络,第一专家网络的初始权重是根据第一词向量矩阵确定的,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的。
可选地,作为一个实施例,专家网络层还包括第二业务领域的第二专家网络,第二专家网络的初始权重是根据第二词向量矩阵确定的,第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的。
可选地,作为一个实施例,专家网络层用于通过选择的第一专家网络对输入专家网络层的数据进行处理,第一专家网络是根据输入专家网络层的数据选择的。
可选地,作为一个实施例,第一训练数据集是根据第一业务领域的第一知识图谱确定的。
可选地,作为一个实施例,第一训练数据集是根据第一知识图谱确定的,包括:第一训练数据集中的至少一个第一文本序列是根据第一知识图谱中的至少一个第一三元组生成的,第一三元组中的三个词语分别用于表示第一业务领域中的主体、第一业务领域中的客体以及主体与客体之间的关系。
可选地,作为一个实施例,第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,第一目标词向量生成模型是以至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以目标词语为词向量生成模型的目标输出对词向量生成模型进行训练得到的,目标词语为至少一个第一三元组中的词语。
可选地,作为一个实施例,第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:第一专家网络的初始权重为第一词向量矩阵。
可选地,作为一个实施例,神经网络模型为自然语言处理NLP模型或语音处理模型。
需要说明的是,上述装置3000和装置4000以功能单元的形式体现。这里的术语“单元”可以通过软件和/或硬件形式实现,对此不作具体限定。
例如,“单元”可以是实现上述功能的软件程序、硬件电路或二者结合。所述硬件电路可能包括应用特有集成电路(application specific integrated circuit,ASIC)、电子电路、用于执行一个或多个软件或固件程序的处理器(例如共享处理器、专有处理器或组处理器等)和存储器、合并逻辑电路和/或其它支持所描述的功能的合适组件。
因此,在本申请的实施例中描述的各示例的单元,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。
图10是本申请实施例提供的神经网络模型的训练装置的硬件结构示意图。图10所示的神经网络模型的训练装置5000(该装置5000具体可以是一种计算机设备)包括存储器5001、处理器5002、通信接口5003以及总线5004。其中,存储器5001、处理器5002、通信接口5003通过总线5004实现彼此之间的通信连接。
存储器5001可以是只读存储器(read only memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(random access memory,RAM)。存储器5001可以存储程序,当存储器5001中存储的程序被处理器5002执行时,处理器5002用于执行本申请实施例的神经网络模型的训练方法的各个步骤。
处理器5002可以采用通用的中央处理器(central processing unit,CPU),微处理器,应用专用集成电路(application specific integrated circuit,ASIC),图形处理器(graphics processing unit,GPU)或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的神经网络模型的训练方法。
处理器5002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本申请的神经网络模型的训练方法的各个步骤可以通过处理器5002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器5002还可以是通用处理器、数字信号处理器(digital signal processing,DSP)、专用集成电路(ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器5001,处理器5002读取存储器5001中的信息,结合其硬件完成图8所示的装置中包括的单元所需执行的功能,或者,执行本申请方法实施例的神经网络模型的训练方法。
通信接口5003使用例如但不限于收发器一类的收发装置,来实现装置5000与其他设备或通信网络之间的通信。例如,可以通过通信接口5003获取第二训练数据集。
总线5004可包括在装置5000各个部件(例如,存储器5001、处理器5002、通信接口5003)之间传送信息的通路。
图11是本申请实施例提供的数据处理的装置的硬件结构示意图。图11所示的数据处理的装置6000(该装置6000具体可以是一种计算机设备)包括存储器6001、处理器6002、通信接口6003以及总线6004。其中,存储器6001、处理器6002、通信接口6003通过总线6004实现彼此之间的通信连接。
存储器6001可以是ROM,静态存储设备,动态存储设备或者RAM。存储器6001可以存储程序,当存储器6001中存储的程序被处理器6002执行时,处理器6002用于执行本申请实施例的数据处理的方法的各个步骤。
处理器6002可以采用通用的CPU,微处理器,ASIC,GPU或者一个或多个集成电路,用于执行相关程序,以实现本申请方法实施例的数据处理的方法。
处理器6002还可以是一种集成电路芯片,具有信号的处理能力。在实现过程中,本 申请的数据处理的方法的各个步骤可以通过处理器6002中的硬件的集成逻辑电路或者软件形式的指令完成。
上述处理器6002还可以是通用处理器、DSP、ASIC、FPGA或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件。可以实现或者执行本申请实施例中的公开的各方法、步骤及逻辑框图。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。结合本申请实施例所公开的方法的步骤可以直接体现为硬件译码处理器执行完成,或者用译码处理器中的硬件及软件模块组合执行完成。软件模块可以位于随机存储器,闪存、只读存储器,可编程只读存储器或者电可擦写可编程存储器、寄存器等本领域成熟的存储介质中。该存储介质位于存储器6001,处理器6002读取存储器6001中的信息,结合其硬件完成图9所示的装置中包括的单元所需执行的功能,或者,执行本申请方法实施例的数据处理的方法。
通信接口6003使用例如但不限于收发器一类的收发装置,来实现装置6000与其他设备或通信网络之间的通信。例如,可以通过通信接口6003获取第二训练数据集。
总线6004可包括在装置6000各个部件(例如,存储器6001、处理器6002、通信接口6003)之间传送信息的通路。
应注意,尽管上述装置5000和装置6000仅仅示出了存储器、处理器、通信接口,但是在具体实现过程中,本领域的技术人员应当理解,装置5000和装置6000还可以包括实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当理解,装置5000和装置6000还可包括实现其他附加功能的硬件器件。此外,本领域的技术人员应当理解,装置5000和装置6000也可仅仅包括实现本申请实施例所必须的器件,而不必包括图10和图11中所示的全部器件。
本申请实施例还提供一种计算机可读介质,该计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行本申请实施例中的神经网络模型的训练方法或数据处理的方法。
本申请实施例还提供一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行本申请实施例中的数据处理方法。
本申请实施例还提供一种芯片,该芯片包括处理器与数据接口,该处理器通过该数据接口读取存储器上存储的指令,执行本申请实施例中的神经网络模型的训练方法或数据处理的方法。
可选地,作为一种实现方式,该芯片还可以包括存储器,该存储器中存储有指令,该处理器用于执行该存储器上存储的指令,当该指令被执行时,该处理器用于执行本申请实施例中的神经网络模型的训练方法或数据处理的方法。
应理解,本申请实施例中的处理器可以为中央处理单元(central processing unit,CPU),该处理器还可以是其他通用处理器、数字信号处理器(digital signal processor,DSP)、专用集成电路(application specific integrated circuit,ASIC)、现成可编程门阵列(field programmable gate array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。
还应理解,本申请实施例中的存储器可以是易失性存储器或非易失性存储器,或可包 括易失性和非易失性存储器两者。其中,非易失性存储器可以是只读存储器(read-only memory,ROM)、可编程只读存储器(programmable ROM,PROM)、可擦除可编程只读存储器(erasable PROM,EPROM)、电可擦除可编程只读存储器(electrically EPROM,EEPROM)或闪存。易失性存储器可以是随机存取存储器(random access memory,RAM),其用作外部高速缓存。通过示例性但不是限制性说明,许多形式的随机存取存储器(random access memory,RAM)可用,例如静态随机存取存储器(static RAM,SRAM)、动态随机存取存储器(DRAM)、同步动态随机存取存储器(synchronous DRAM,SDRAM)、双倍数据速率同步动态随机存取存储器(double data rate SDRAM,DDR SDRAM)、增强型同步动态随机存取存储器(enhanced SDRAM,ESDRAM)、同步连接动态随机存取存储器(synchlink DRAM,SLDRAM)和直接内存总线随机存取存储器(direct rambus RAM,DR RAM)。
上述实施例,可以全部或部分地通过软件、硬件、固件或其他任意组合来实现。当使用软件实现时,上述实施例可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令或计算机程序。在计算机上加载或执行所述计算机指令或计算机程序时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以为通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线(例如红外、无线、微波等)方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集合的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质。半导体介质可以是固态硬盘。
应理解,本文中术语“和/或”,仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况,其中A,B可以是单数或者复数。另外,本文中字符“/”,一般表示前后关联对象是一种“或”的关系,但也可能表示的是一种“和/或”的关系,具体可参考前后文进行理解。
本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b,或c中的至少一项(个),可以表示:a,b,c,a-b,a-c,b-c,或a-b-c,其中a,b,c可以是单个,也可以是多个。
应理解,在本申请的各种实施例中,上述各过程的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本 申请的范围。
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。

Claims (36)

  1. 一种神经网络模型的训练方法,其特征在于,包括:
    获取第一词向量矩阵,所述第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的;
    获取第二训练数据集;
    基于所述第二训练数据集对神经网络模型进行训练,得到目标神经网络模型,所述神经网络模型包括专家网络层,所述专家网络层包括所述第一业务领域的第一专家网络,所述第一专家网络的初始权重是根据所述第一词向量矩阵确定的。
  2. 根据权利要求1所述的训练方法,其特征在于,所述方法还包括:
    获取第二词向量矩阵,所述第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的,所述专家网络层还包括所述第二业务领域的第二专家网络,所述第二专家网络的初始权重是根据所述第二词向量矩阵确定的。
  3. 根据权利要求1或2所述的训练方法,其特征在于,所述专家网络层用于通过选择的所述第一专家网络对输入所述专家网络层的数据进行处理,所述第一专家网络是根据所述输入所述专家网络层的数据选择的。
  4. 根据权利要求1至3中任一项所述的训练方法,其特征在于,所述第一训练数据集是根据所述第一业务领域的第一知识图谱确定的。
  5. 根据权利要求4所述的训练方法,其特征在于,所述第一训练数据集是根据所述第一业务领域的第一知识图谱确定的,包括:
    所述第一训练数据集中的至少一个第一文本序列是根据所述第一知识图谱中的至少一个第一三元组生成的,所述第一三元组中的三个词语分别用于表示所述第一业务领域中的主体、所述第一业务领域中的客体以及所述主体与客体之间的关系。
  6. 根据权利要求5所述的训练方法,其特征在于,所述第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:
    所述第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,所述第一目标词向量生成模型是以所述至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以所述目标词语为所述词向量生成模型的目标输出对所述词向量生成模型进行训练得到的,所述目标词语为所述至少一个第一三元组中的词语。
  7. 根据权利要求1至6中任一项所述的训练方法,其特征在于,所述第一专家网络的初始权重是根据所述第一词向量矩阵确定的,包括:
    所述第一专家网络的初始权重为所述第一词向量矩阵。
  8. 根据权利要求1至7中任一项所述的训练方法,其特征在于,所述神经网络模型为自然语言处理NLP模型或者语音处理模型。
  9. 一种数据处理的方法,其特征在于,包括:
    获取待处理的数据;
    利用目标神经网络模型对所述待处理的数据进行处理,所述目标神经网络模型是基于第二训练数据集对神经网络模型进行训练得到的,所述神经网络模型包括专家网络层,所 述专家网络层包括第一业务领域的第一专家网络,所述第一专家网络的初始权重是根据第一词向量矩阵确定的,所述第一词向量矩阵是基于所述第一业务领域的第一训练数据集训练得到的。
  10. 根据权利要求9所述的方法,其特征在于,所述专家网络层还包括第二业务领域的第二专家网络,所述第二专家网络的初始权重是根据第二词向量矩阵确定的,所述第二词向量矩阵是基于所述第二业务领域的第三训练数据集训练得到的。
  11. 根据权利要求9或10所述的方法,其特征在于,所述专家网络层用于通过选择的所述第一专家网络对输入所述专家网络层的数据进行处理,所述第一专家网络是根据所述输入所述专家网络层的数据选择的。
  12. 根据权利要求9至11中任一项所述的方法,其特征在于,所述第一训练数据集是根据所述第一业务领域的第一知识图谱确定的。
  13. 根据权利要求12所述的方法,其特征在于,所述第一训练数据集是根据所述第一知识图谱确定的,包括:
    所述第一训练数据集中的至少一个第一文本序列是根据所述第一知识图谱中的至少一个第一三元组生成的,所述第一三元组中的三个词语分别用于表示所述第一业务领域中的主体、所述第一业务领域中的客体以及所述主体与客体之间的关系。
  14. 根据权利要求13所述的方法,其特征在于,所述第一词向量矩阵是基于所述第一业务领域的第一训练数据集训练得到的,包括:
    所述第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,所述第一目标词向量生成模型是以所述至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以所述目标词语为所述词向量生成模型的目标输出对所述词向量生成模型进行训练得到的,所述目标词语为所述至少一个第一三元组中的词语。
  15. 根据权利要求9至14中任一项所述的方法,其特征在于,所述第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:
    所述第一专家网络的初始权重为所述第一词向量矩阵。
  16. 根据权利要求9至15中任一项所述的方法,其特征在于,所述神经网络模型为自然语言处理NLP模型或语音处理模型。
  17. 一种神经网络模型的训练装置,其特征在于,包括:
    获取单元,用于获取第一词向量矩阵,所述第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的;
    所述获取单元还用于获取第二训练数据集;
    处理单元,用于:
    基于所述第二训练数据集对神经网络模型进行训练,得到目标神经网络模型,所述神经网络模型包括专家网络层,所述专家网络层包括所述第一业务领域的第一专家网络,所述第一专家网络的初始权重是根据所述第一词向量矩阵确定的。
  18. 根据权利要求17所述的训练装置,其特征在于,所述获取单元还用于:
    获取第二词向量矩阵,所述第二词向量矩阵是基于第二业务领域的第三训练数据集训练得到的,所述专家网络层还包括所述第二业务领域的第二专家网络,所述第二专家网络的初始权重是根据所述第二词向量矩阵确定的。
  19. 根据权利要求17或18所述的训练装置,其特征在于,所述专家网络层用于通过选择的所述第一专家网络对输入所述专家网络层的数据进行处理,所述第一专家网络是根据所述输入所述专家网络层的数据选择的。
  20. 根据权利要求17至19中任一项所述的训练装置,其特征在于,所述第一训练数据集是根据所述第一业务领域的第一知识图谱确定的。
  21. 根据权利要求20所述的训练装置,其特征在于,所述第一训练数据集是根据所述第一业务领域的第一知识图谱确定的,包括:
    所述第一训练数据集中的至少一个第一文本序列是根据所述第一知识图谱中的至少一个第一三元组生成的,所述第一三元组包括所述第一业务领域中的主体、所述第一业务领域中的客体以及所述主体与客体之间的关系。
  22. 根据权利要求21所述的训练装置,其特征在于,所述第一词向量矩阵是基于第一业务领域的第一训练数据集训练得到的,包括:
    所述第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,所述第一目标词向量生成模型是以所述至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以所述目标词语为所述词向量生成模型的目标输出对所述词向量生成模型进行训练得到的,所述目标词语为所述至少一个第一三元组中的词语。
  23. 根据权利要求17至22中任一项所述的训练装置,其特征在于,所述第一专家网络的初始权重是根据所述第一词向量矩阵确定的,包括:
    所述第一专家网络的初始权重为所述第一词向量矩阵。
  24. 根据权利要求17至23中任一项所述的训练装置,其特征在于,所述神经网络模型为自然语言处理NLP模型或者语音处理模型。
  25. 一种数据处理的装置,其特征在于,包括:
    获取单元,用于获取待处理的数据;
    处理单元,用于利用目标神经网络模型对所述待处理的数据进行处理,所述目标神经网络模型是基于第二训练数据集对神经网络模型进行训练得到的,所述神经网络模型包括专家网络层,所述专家网络层包括第一业务领域的第一专家网络,所述第一专家网络的初始权重是根据第一词向量矩阵确定的,所述第一词向量矩阵是基于所述第一业务领域的第一训练数据集训练得到的。
  26. 根据权利要求25所述的装置,其特征在于,所述专家网络层还包括第二业务领域的第二专家网络,所述第二专家网络的初始权重是根据第二词向量矩阵确定的,所述第二词向量矩阵是基于所述第二业务领域的第三训练数据集训练得到的。
  27. 根据权利要求25或26所述的装置,其特征在于,所述专家网络层用于通过选择的所述第一专家网络对输入所述专家网络层的数据进行处理,所述第一专家网络是根据所述输入所述专家网络层的数据选择的。
  28. 根据权利要求25至27中任一项所述的装置,其特征在于,所述第一训练数据集是根据所述第一业务领域的第一知识图谱确定的。
  29. 根据权利要求28所述的装置,其特征在于,所述第一训练数据集是根据所述第一知识图谱确定的,包括:
    所述第一训练数据集中的至少一个第一文本序列是根据所述第一知识图谱中的至少 一个第一三元组生成的,所述第一三元组中的三个词语分别用于表示所述第一业务领域中的主体、所述第一业务领域中的客体以及所述主体与客体之间的关系。
  30. 根据权利要求29所述的装置,其特征在于,所述第一词向量矩阵是基于所述第一业务领域的第一训练数据集训练得到的,包括:
    所述第一词向量矩阵为第一目标词向量生成模型中的隐层的权重,所述第一目标词向量生成模型是以所述至少一个第一文本序列中的目标词语之外的词语作为词向量生成模型的输入,以所述目标词语为所述词向量生成模型的目标输出对所述词向量生成模型进行训练得到的,所述目标词语为所述至少一个第一三元组中的词语。
  31. 根据权利要求25至30中任一项所述的装置,其特征在于,所述所述第一专家网络的初始权重是根据第一词向量矩阵确定的,包括:
    所述第一专家网络的初始权重为所述第一词向量矩阵。
  32. 根据权利要求25至31中任一项所述的装置,其特征在于,所述神经网络模型为自然语言处理NLP模型或语音处理模型。
  33. 一种神经网络模型的训练装置,其特征在于,包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行如权利要求1至8中任一项所述的方法。
  34. 一种数据处理的装置,其特征在于,包括:包括处理器和存储器,所述存储器用于存储程序指令,所述处理器用于调用所述程序指令来执行如权利要求9至16中任一项所述的方法。
  35. 一种计算机可读存储介质,其特征在于,所述计算机可读介质存储用于设备执行的程序代码,该程序代码包括用于执行如权利要求1至8或权利要求9至16中任一项所述的方法。
  36. 一种包含指令的计算机程序产品,其特征在于,当所述计算机程序产品在计算机上运行时,使得所述计算机执行如权利要求1至8或权利要求9至16中任一项所述的方法。
PCT/CN2022/098621 2021-07-08 2022-06-14 神经网络模型的训练方法、数据处理的方法及装置 WO2023279921A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP22836675.3A EP4318311A4 (en) 2021-07-08 2022-06-14 NEURAL NETWORK MODEL TRAINING METHOD, DATA PROCESSING METHOD AND APPARATUSES
US18/401,738 US20240232618A9 (en) 2021-07-08 2024-01-02 Training method and apparatus for neural network model, and data processing method and apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202110773754.0 2021-07-07
CN202110773754 2021-07-08
CN202111014266.8 2021-08-30
CN202111014266.8A CN115600635A (zh) 2021-07-08 2021-08-31 神经网络模型的训练方法、数据处理的方法及装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/401,738 Continuation US20240232618A9 (en) 2021-07-08 2024-01-02 Training method and apparatus for neural network model, and data processing method and apparatus

Publications (1)

Publication Number Publication Date
WO2023279921A1 true WO2023279921A1 (zh) 2023-01-12

Family

ID=84801212

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/098621 WO2023279921A1 (zh) 2021-07-08 2022-06-14 神经网络模型的训练方法、数据处理的方法及装置

Country Status (3)

Country Link
US (1) US20240232618A9 (zh)
EP (1) EP4318311A4 (zh)
WO (1) WO2023279921A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050433A (zh) * 2023-02-13 2023-05-02 北京百度网讯科技有限公司 自然语言处理模型的场景适配方法、装置、设备及介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294322A (zh) * 2016-08-04 2017-01-04 哈尔滨工业大学 一种基于lstm的汉语零指代消解方法
CN106599933A (zh) * 2016-12-26 2017-04-26 哈尔滨工业大学 一种基于联合深度学习模型的文本情感分类方法
CN110968688A (zh) * 2018-09-30 2020-04-07 北京国双科技有限公司 司法数据的处理方法及系统
US20200364576A1 (en) * 2019-05-14 2020-11-19 Adobe Inc. Utilizing deep recurrent neural networks with layer-wise attention for punctuation restoration

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106294322A (zh) * 2016-08-04 2017-01-04 哈尔滨工业大学 一种基于lstm的汉语零指代消解方法
CN106599933A (zh) * 2016-12-26 2017-04-26 哈尔滨工业大学 一种基于联合深度学习模型的文本情感分类方法
CN110968688A (zh) * 2018-09-30 2020-04-07 北京国双科技有限公司 司法数据的处理方法及系统
US20200364576A1 (en) * 2019-05-14 2020-11-19 Adobe Inc. Utilizing deep recurrent neural networks with layer-wise attention for punctuation restoration

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4318311A4

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116050433A (zh) * 2023-02-13 2023-05-02 北京百度网讯科技有限公司 自然语言处理模型的场景适配方法、装置、设备及介质
CN116050433B (zh) * 2023-02-13 2024-03-26 北京百度网讯科技有限公司 自然语言处理模型的场景适配方法、装置、设备及介质

Also Published As

Publication number Publication date
EP4318311A4 (en) 2024-10-16
EP4318311A1 (en) 2024-02-07
US20240232618A9 (en) 2024-07-11
US20240135176A1 (en) 2024-04-25

Similar Documents

Publication Publication Date Title
WO2021047286A1 (zh) 文本处理模型的训练方法、文本处理方法及装置
Keneshloo et al. Deep reinforcement learning for sequence-to-sequence models
CN112131366B (zh) 训练文本分类模型及文本分类的方法、装置及存储介质
WO2021057884A1 (zh) 语句复述方法、训练语句复述模型的方法及其装置
WO2022033332A1 (zh) 对话生成方法、网络训练方法、装置、存储介质及设备
WO2020147428A1 (zh) 交互内容生成方法、装置、计算机设备及存储介质
CN110263324A (zh) 文本处理方法、模型训练方法和装置
CN111400470A (zh) 问题处理方法、装置、计算机设备和存储介质
CN113505205A (zh) 一种人机对话的系统和方法
CN112214591B (zh) 一种对话预测的方法及装置
CN114596844B (zh) 声学模型的训练方法、语音识别方法及相关设备
WO2023284716A1 (zh) 一种神经网络搜索方法及相关设备
WO2020192523A1 (zh) 译文质量检测方法、装置、机器翻译系统和存储介质
Boussakssou et al. Chatbot in Arabic language using seq to seq model
CN116821307B (zh) 内容交互方法、装置、电子设备和存储介质
US20240135176A1 (en) Training method and apparatus for neural network model, and data processing method and apparatus
Guo et al. Who is answering whom? Finding “Reply-To” relations in group chats with deep bidirectional LSTM networks
CN109002498B (zh) 人机对话方法、装置、设备及存储介质
CN116975221A (zh) 文本阅读理解方法、装置、设备及存储介质
CN115600635A (zh) 神经网络模型的训练方法、数据处理的方法及装置
CN116957006A (zh) 预测模型的训练方法、装置、设备、介质及程序产品
WO2021083312A1 (zh) 训练语句复述模型的方法、语句复述方法及其装置
CN114333790A (zh) 数据处理方法、装置、设备、存储介质及程序产品
CN113407664A (zh) 语义匹配方法、装置和介质
WO2023061443A1 (zh) 一种确定回复语句的方法及装置

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22836675

Country of ref document: EP

Kind code of ref document: A1

WWE Wipo information: entry into national phase

Ref document number: 2022836675

Country of ref document: EP

ENP Entry into the national phase

Ref document number: 2022836675

Country of ref document: EP

Effective date: 20231102

NENP Non-entry into the national phase

Ref country code: DE