WO2020147369A1

WO2020147369A1 - Natural language processing method, training method, and data processing device

Info

Publication number: WO2020147369A1
Application number: PCT/CN2019/114146
Authority: WO
Inventors: 李梓超; 蒋欣; 刘群
Original assignee: 华为技术有限公司
Priority date: 2019-01-18
Filing date: 2019-10-29
Publication date: 2020-07-23
Also published as: CN109902296A; CN109902296B

Abstract

The present application discloses a natural language processing method, a training method and a data processing device in the field of artificial intelligence. Said method comprises: obtaining a natural language text to be processed; and processing the natural language text by means of a trained deep neural network, and outputting a target result obtained by processing the natural language text, the deep neural network comprising: a granularity labeling network, a first feature network, a second feature network, a first processing network, a second processing network and a fusing network. In the present application, the data processing device uses networks decoupled from one another to process words of different granularities in a natural language text, effectively improving the performance of processing a natural language processing task.

Description

Natural language processing method, training method and data processing equipment

This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China, the application number is 201910108559.9, and the application name is "Natural language processing methods, training methods, and data processing equipment" on January 18, 2019. The reference is incorporated in this application.

Technical field

This application relates to the field of natural language processing, in particular to a natural language processing method, training method and data processing equipment.

Background technique

Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.

With the continuous development of artificial intelligence technology, more and more natural language processing tasks can be implemented using artificial intelligence technology, for example, using artificial intelligence technology to implement translation tasks. Natural language processing tasks can be divided into different granularities, generally divided into character level, word level, phrase level, sentence level, discourse level, etc. These particle sizes become coarser in turn. For example, part-of-speech tagging is a word-level task, named entity recognition (named entity recognition) is a phrase-level task, and syntactic analysis is usually a sentence-level task. Information at different granularities is not isolated, but is transmitted to each other. For example, when doing syntactic analysis, the word-level and phrase-level features are usually considered. In some relatively more complex tasks, such as sentence classification, sentence-to-sentence semantic matching, sentence translation or rewriting, it is usually necessary to use multiple granular information, and finally synthesize it.

The current mainstream natural language processing method based on deep learning is to process natural language text through neural networks. In the mainstream method, the neural network processes the words of different granularity in the processing process are mixed together, and the probability of obtaining the correct processing result is low. Therefore, new solutions need to be studied.

Summary of the invention

The embodiments of the present application provide a natural language processing method, training method, and data processing device, which can avoid the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural language processing tasks.

In the first aspect, the embodiments of the present application provide a natural language processing method, which includes: obtaining natural language text to be processed; processing the natural language text using a deep neural network obtained by training, and output processing the natural language text The target result obtained from the text; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network, and the processing includes: using the The granularity tagging network determines the granularity of each word in the natural language text; using the first feature network to perform feature extraction on the first granular words in the natural language text, and output the obtained first feature information to the first feature information A processing network; using the second feature network to perform feature extraction on words with a second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network Process the first characteristic information, and output the obtained first processing result to the fusion network; use the second processing network to perform the processing on the second characteristic information, and obtain the second processing result Output to the fusion network; use the fusion network to fuse the first processing result and the second processing result to obtain the target result; the first granularity and the second granularity are different.

The deep neural network may include N feature networks and N processing networks. The N feature networks and the N processing networks have a one-to-one correspondence, and N is an integer greater than one. A pair of corresponding feature networks and processing networks are used to process words of the same granularity. Since the data processing equipment separates words of different granularities for processing, the processing operations for words of each granularity do not depend on the processing results of words of other granularities, which avoids obtaining coarser-grained information from finer-grained information This process greatly reduces the probability that the data processing device will get wrong results.

In the embodiments of the present application, the data processing device uses a deep neural network to independently process words of different granularity, avoiding the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural processing tasks.

In an optional implementation manner, the architecture of the first characteristic network and the second characteristic network are different, and/or the architecture of the first processing network and the second processing network are different.

Words with different granularities have different characteristics. Using networks with different architectures to process words with different granularities can more specifically process words with different granularities.

In this implementation manner, words of different granularities are processed through feature networks of different architectures or processing networks of different architectures, which further improves the performance of the data processing device in processing natural language processing tasks.

In an optional implementation manner, the input of the granular annotation network is the natural language text, and the using the granular annotation network to determine the granularity of each word in the natural language text includes: using the granular annotation network Determine the granularity of each word in the natural language text according to N granularities to obtain the annotation information of the natural language text, and output the annotation information to the first feature network and the second feature network; wherein, The label information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;

The using the first feature network to perform feature extraction on the words of the first granularity in the natural language text includes: using the first feature network to process the words of the first granularity to obtain the first feature information, The first feature information is a vector or matrix representing words of the first granularity;

The using the second feature network to perform feature extraction on the words of the second granularity in the natural language text includes: using the second feature network to process the words of the second granularity to obtain the second feature information, The second feature information is a vector or matrix representing words of the second granularity.

In this implementation, the granular annotation network can accurately determine the granularity of each word in the natural language text, so that each feature network can process words with a specific granularity.

In an optional implementation manner, the granular labeling network includes a long and short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; and the using the granular labeling network to determine the granularity of each word in the natural language text includes:

The granularity labeling network is used to determine the granularity of each word in the natural language text using the following formula:

h _l =BiLSTM([x _l ;h _l-1 ,h _l+1 ]);

g _l ＝LSTM([h _l ,z _l-1 ;g _l-1 ]);

z _l =GS(W _g g _l ,τ);

Wherein, BiLSTM() in the formula represents the processing operation of the LSTM, LSTM() represents the processing operation of the BiLSTM, x represents a word in the natural language text, and x _l represents the first natural language text x l words, h represents the hidden state variable in the BiLSMT network, h _l , h _l-1 , h _l+1 in turn indicate that the BiLSMT network processes the lth word, the (l)th word in the natural language text -1) The hidden state variable of the (l+1)th word. g represents the hidden state variable in the LSTM network, and g _l and g _l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable. z represents the probability that a word belongs to the reference granularity, z _l-1 and z _l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity, and the reference granularity is Any one of the N types of granularities, GS represents the Gumbel Softmax function, τ is a hyperparameter (temperature) in the Gumbel Softmax function, and Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.

In this implementation, the granular annotation network uses the architecture of a multi-layer LSTM network to determine the granularity of each word in the natural language text, and can make full use of the granularity of the determined word to determine the granularity of the new word (word of the granularity to be determined) , Simple implementation and high processing efficiency.

In an optional implementation manner, the using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:

Use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:

U _z =ENC _z (X,Z _x );

Wherein, ENC _z represents the first feature network, the first feature network is a Transformer model, ENC _z () represents the processing operation performed by the first feature network, X represents the natural language text, Z _X =[z1,z2,...,zL] represents the label information, z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text, and Uz represents the first feature The first characteristic information output by the network.

In this implementation, the feature network can be used to accurately and quickly extract the feature information of the corresponding granular words.

In an optional implementation manner, the first processing result is a sequence containing one or more words, and the processing of the first characteristic information using the first processing network includes: using the first The processing network processes the input first feature information and the words that have been output by the first processing network in the process of processing the first feature information to obtain the first processing result.

In this implementation manner, the first processing network adopts a recursive manner to process the feature information output by the corresponding feature network, which can make full use of the relevance of each word in the natural language text, thereby improving the efficiency and accuracy of processing.

In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the first processing result and the second processing result Obtaining the target result includes: using the fusion network to process the first processing result, the second processing result, and the fusion network has outputted in the process of processing the first processing result and the second processing result To determine the target words to be output, output the target words.

In this implementation, the fusion network uses a recursive method to process the processing results input to it by each processing network, which can make full use of the relevance of each word in the natural language text, thereby improving the efficiency and accuracy of its processing.

In an optional implementation manner, the converged network includes at least one LSTM network, and the converged network is used to process the first processing result, the second processing result, and the converged network is processing the first The processing result and the sequence output in the process of the second processing result to determine the target word to be output include:

Input the vector obtained by merging the first processing result and the second processing result to the LSTM network;

The LSTM network uses the following formula to calculate the probability of a word with a reference granularity to be output:

h _t =LSMT(h _t-1 ,y _t-1 ,v0,v1);

P(z _t |y _1:t-1 ,X)=GS(W _z h _t ,τ);

Wherein, h _t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, and h _t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word Hidden state variable, LSMT() represents the processing operation done by LSMT, the LMST network has currently output (t-1) words, and the y _t-1 represents the (t-1)th output of the fusion network Words, v0 represents the first processing result, v1 represents the second processing result, W _z is a parameter matrix in the fusion network, τ is a hyperparameter, P(z _t |y _1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output, and t is an integer greater than 1.

Use the fusion network to calculate the probability of the target word to be output using the following formula:

Wherein, P _Zt (y _t |y _1:t-1 ,X) represents the probability of outputting the target word y _t at the reference granularity; P(y _t |y _1:t-1 ,X) represents Output the probability of the target word.

P _Zt (y _t |y _1:t-1 ,X) can be given by the processing network. The processing network of granularity z can input the probability of each word in the words (words of granularity z) currently to be output to the fusion network. The fusion network can calculate the probability of each word being output among the words currently to be output, and output the word with the highest probability of being output (target word).

In the second aspect, the embodiments of the present application provide a training method, which includes: inputting training samples into a deep neural network for processing to obtain a prediction processing result; wherein the deep neural network includes: a granular annotation network, a first feature Network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granularity labeling network to determine the granularity of each word in the training sample; using the first feature network to Perform feature extraction on words of the first granularity in the training sample, and output the obtained third feature information to the first processing network; use the second feature network to feature words of the second granularity in the training sample Extracting, outputting the obtained fourth characteristic information to the second processing network; using the first processing network to perform target processing on the third characteristic information, and outputting the obtained third processing result to the fusion network; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the first Four processing results obtain the prediction processing result; the first granularity and the second granularity are different; according to the prediction processing result and the standard result, the loss corresponding to the training sample is determined; the standard result is using the The deep neural network processes the expected processing result of the training sample; using the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.

In the embodiments of the present application, the data processing device trains a deep neural network that can independently process words of different granularities, so as to obtain a deep neural network that can avoid the process of obtaining coarser-grained information from finer-grained information, and is simple to implement.

The using the first feature network to perform feature extraction on the words of the first granularity in the natural language text includes: using the first feature network to process the words of the first granularity to obtain the third feature information, The third feature information is a vector or matrix representing words of the first granularity;

The using the second feature network to perform feature extraction on the words of the second granularity in the natural language text includes: using the second feature network to process the words of the second granularity to obtain the fourth feature information, The fourth feature information is a vector or matrix representing words of the second granularity.

h _l =BiLSTM([x _l ;h _l-1 ,h _l+1 ]);

g _l ＝LSTM([h _l ,z _l-1 ;g _l-1 ]);

z _l =GS(W _g g _l ,τ);

U _z =ENC _z (X,Z _x );

Wherein, ENC _z represents the first feature network, the first feature network is a Transformer model, ENC _z () represents the processing operation performed by the first feature network, X represents the natural language text, Z _X =[z1,z2,...,zL] represents the label information, z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text, and Uz represents the first feature The third characteristic information output by the network.

In an optional implementation manner, the third processing result is a sequence containing one or more words, and the processing of the third characteristic information using the first processing network includes: using the first processing network The processing network processes the input third characteristic information and the words that have been output by the first processing network in the process of processing the third characteristic information to obtain the third processing result.

In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the third processing result and the fourth processing result Obtaining the target result includes: using the fusion network to process the third processing result, the fourth processing result, and the fusion network has output in the process of processing the third processing result and the fourth processing result To determine the target words to be output, output the target words.

In an optional implementation manner, the converged network includes at least one LSTM network, and the converged network is used to process the third processing result, the fourth processing result, and the third processing result of the converged network. The processing result and the sequence output in the process of the fourth processing result to determine the target word to be output include:

Input the vector obtained by merging the third processing result and the fourth processing result to the LSTM network;

h _t =LSMT(h _t-1 ,y _t-1 ,v2,v3);

P(z _t |y _1:t-1 ,X)=GS(W _z h _t ,τ);

Wherein, h _t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, and h _t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word Hidden state variable, LSMT() represents the processing operation done by LSMT, the LMST network has currently output (t-1) words, and the y _t-1 represents the (t-1)th output of the fusion network Words, v2 represents the third processing result, v3 represents the fourth processing result, W _z is a parameter matrix in the fusion network, τ is a hyperparameter, P(z _t |y _1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output, and t is an integer greater than 1.

In an optional implementation manner, the using the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm includes:

Update the parameters of the at least one network using a loss function relative to the gradient value of at least one network included in the deep neural network; the loss function is used to calculate the loss between the prediction processing result and the standard result; Wherein, during the update process of any one of the first characteristic network, the second characteristic network, the first processing network, and the second processing network, the parameters of any one of the other three networks are maintained constant.

In the third aspect, the embodiments of the application provide a data processing device. The data processing device includes: an acquisition unit for obtaining natural language texts to be processed; a processing unit for processing the natural language text obtained by training using a deep neural network; Language and text are processed; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network, and the processing includes: using the granularity The tagging network determines the granularity of each word in the natural language text; using the first feature network to perform feature extraction on words with the first granularity in the natural language text, and output the obtained first feature information to the first Processing network; using the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network to The first characteristic information is processed, and the obtained first processing result is output to the fusion network; the second processing network is used to perform the processing on the second characteristic information, and the obtained second processing result is output To the fusion network; use the fusion network to fuse the first processing result and the second processing result to obtain the target result; the first granularity and the second granularity are different; an output unit for outputting The target result obtained by processing the natural language text.

In an optional implementation manner, the input of the granular annotation network is the natural language text; the processing unit is specifically configured to use the granular annotation network to determine each of the natural language texts according to N types of granularities. The granularity of words is used to obtain the annotation information of the natural language text, and the annotation information is output to the first feature network and the second feature network; wherein the annotation information is used to describe the granularity of each word Or the probability that each word belongs to the N types of granularities; N is an integer greater than 1;

The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the first characteristic information, where the first characteristic information is a vector or word representing the words of the first granularity matrix;

The processing unit is specifically configured to use the second feature network to process the words of the second granularity to obtain the second feature information, where the second feature information is a vector or word representing the words of the second granularity. matrix.

In an optional implementation, the granular labeling network includes a long short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; the processing unit is specifically configured to use the granular labeling network to determine the natural language using the following formula The granularity of words in the text:

h _l =BiLSTM([x _l ;h _l-1 ,h _l+1 ]);

g _l ＝LSTM([h _l ,z _l-1 ;g _l-1 ]);

z _l =GS(W _g g _l ,τ);

In an optional implementation manner, the processing unit is specifically configured to use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:

U _z =ENC _z (X,Z _x );

In an optional implementation manner, the first processing result is a sequence containing one or more words; the processing unit is specifically configured to use the first processing network to compare the input first feature information and The first processing network processes the output words in the process of processing the first characteristic information to obtain the first processing result.

In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words; the processing unit is specifically configured to use the fusion network to process the first processing result, The second processing result and the words that have been output by the fusion network in the process of processing the first processing result and the second processing result to determine the target word to be output, and output the target word.

In an optional implementation manner, the converged network includes at least one LSTM network;

The processing unit is specifically configured to use a vector obtained by combining the first processing result and the second processing result to input to the LSTM network;

The processing unit is specifically configured to use the LSTM network to calculate the probability of a word with a reference granularity to be output by using the following formula:

h _t =LSMT(h _t-1 ,y _t-1 ,v0,v1);

P(z _t |y _1:t-1 ,X)=GS(W _z h _t ,τ);

The processing unit is specifically configured to use the fusion network to calculate the probability of the target word to be output by using the following formula:

In the fourth aspect, the embodiments of the present application provide another data processing device. The data processing device includes: a processing unit for inputting training samples into a deep neural network for processing to obtain a prediction processing result; wherein, the deep neural network Including: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granular labeling network to determine the granularity of each word in the training sample Use the first feature network to perform feature extraction on words of the first granularity in the training sample, and output the obtained third feature information to the first processing network; use the second feature network to perform feature extraction on the training Perform feature extraction on words of the second granularity in the sample, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and the obtained third The processing result is output to the fusion network; the second processing network is used to perform the target processing on the fourth characteristic information, and the obtained fourth processing result is output to the fusion network; The third processing result and the fourth processing result obtain the predicted processing result; the first granularity and the second granularity are different; the processing unit is further configured to, according to the predicted processing result and the standard result, Determine the loss corresponding to the training sample; the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; use the loss corresponding to the training sample to update the deep neural network through an optimization algorithm parameter.

In an optional implementation manner, the first characteristic network and the second characteristic network have different architectures, and/or the first processing network and the second processing network have different architectures.

The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the third characteristic information, where the third characteristic information is a vector or word representing the words of the first granularity matrix;

The processing unit is specifically configured to process the words of the second granularity by using the second characteristic network to obtain the fourth characteristic information, where the fourth characteristic information is a vector representing the words of the second granularity Or matrix.

h _l =BiLSTM([x _l ;h _l-1 ,h _l+1 ]);

g _l ＝LSTM([h _l ,z _l-1 ;g _l-1 ]);

z _l =GS(W _g g _l ,τ);

U _z =ENC _z (X,Z _x );

In an optional implementation manner, the first processing result is a sequence containing one or more words; the processing unit is specifically configured to use the first processing network to pair the input third characteristic information and The first processing network processes the output words in the process of processing the third characteristic information to obtain the third processing result.

In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words; the processing unit is specifically configured to use the fusion network to process the third processing result, The fourth processing result and the words that have been output by the fusion network in the process of processing the third processing result and the fourth processing result to determine the target word to be output, and output the target word.

In an optional implementation manner, the converged network includes at least one LSTM network; the processing unit is specifically configured to input a vector obtained by combining the third processing result and the fourth processing result into the LSTM The internet;

h _t =LSMT(h _t-1 ,y _t-1 ,v2,v3);

P(z _t |y _1:t-1 ,X)=GS(W _z h _t ,τ);

In an optional implementation manner, the processing unit is specifically configured to update the parameters of the at least one network by using the gradient value of the loss function relative to the at least one network included in the deep neural network; To calculate the loss between the predicted processing result and the standard result; wherein any one of the first feature network, the second feature network, the first processing network, and the second processing network During the network update process, the parameters of any one of the other three networks remain unchanged.

In the fifth aspect, the embodiments of the present application provide yet another data processing device. The data processing device includes: a processor, a memory, an input device, and an output device. The memory is used to store code; The code is used to execute the method provided in the first aspect or the second aspect, the input device is used to obtain the natural language text to be processed, and the output device is used to output the target result obtained by the processor processing the natural language text.

In a sixth aspect, embodiments of the present application provide a computer program product. The computer program product includes program instructions that, when executed by a processor, cause the processor to execute the first aspect or the second aspect described above. method.

In the seventh aspect, the embodiments of the present application provide a computer-readable storage medium, the computer storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to Perform the method of the above-mentioned first aspect or the above-mentioned second aspect.

BRIEF DESCRIPTION

Figures 1A to 1C are application scenarios of natural language processing systems;

Fig. 2 is a flowchart of a natural language processing method provided by an embodiment of the application;

FIG. 3 is a schematic structural diagram of a deep neural network provided by an embodiment of this application;

FIG. 4 is a schematic structural diagram of a granular labeling network 301 provided by an embodiment of this application;

FIG. 5 is a schematic structural diagram of a feature network provided by an embodiment of this application;

FIG. 6 is a schematic structural diagram of a deep neural network provided by an embodiment of this application;

FIG. 7 is a flowchart of a training method provided by an embodiment of the application;

FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of this application;

FIG. 9 is a schematic structural diagram of a neural network processor provided by an embodiment of this application;

FIG. 10 is a block diagram of a partial structure of an intelligent terminal provided by an embodiment of the application;

FIG. 11 is a block diagram of a part of the structure of another data processing device provided by an embodiment of the application.

detailed description

In order to enable those skilled in the art to better understand the solutions of the embodiments of the present application, the technical solutions in the embodiments of the present application will be clearly described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only These are a part of the embodiments of this application, not all of the embodiments.

The terms "first", "second", and "third" in the specification embodiments and claims of this application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or Priority. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion, for example, a series of steps or units are included. The method, system, product, or device is not necessarily limited to those clearly listed steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or devices. "And/or" is used to indicate one or all of the two connected objects. For example, "A and/or B" means A, B, or A+B.

At present, the network models used to process natural language processing tasks, such as the typical Google Neural Machine Translation (GNMT), Transformer and other models, do not perform operations on words with different granularities in natural language texts. Separation. That is to say, in the currently adopted scheme, operations on words between different granularities are not decoupled. When a certain deep neural network is currently used to process natural language processing tasks, a pooling operation is usually used to synthesize finer-grained features to form coarser-grained features. For example, the word-level and phrase-level features are integrated through the pooling operation to form sentence-level features. It can be understood that if the finer-grained features obtained are wrong, the coarser-grained features obtained from the finer-grained features will also be wrong. This makes us encounter some difficulties in understanding and applying deep neural networks to process natural language processing tasks. For example, when an error occurs, it is impossible to locate which granular operation is a problem. Operations on words of a certain granularity can be understood as operations of a certain granularity. For example, operations on phrase-level words are phrase-level operations, and operations on sentence-level words are sentence-level operations. The main principle of the solution of this application is to use mutually decoupled networks to process words of different granularities to obtain processing results of words of different granularities, and then merge the processing results of words of different granularities to obtain the final result. In other words, multiple networks that process words of different granularities are decoupled. The decoupling of the two networks can be understood as the processing of the two networks does not affect each other. Since the deep neural network used in this application has the ability of decoupling, using the solution of this application to process natural language processing tasks has at least the following benefits:

Interpretability: When using a deep neural network to process a certain natural language text and get an incorrect result, it can accurately locate which granular operation has a problem for subsequent analysis and correction.

Controllability: In the solution of the present application, since the networks processing words of different granularities are decoupled, the deep neural network can be analyzed or adjusted to realize the networks of operations of different granularities. The deep neural network used in this application includes multiple decoupled sub-networks for processing words of different granularities. These sub-networks can be optimized in a targeted manner to ensure that operations at each granularity are controllable .

Reusable and transferable: Operations at different granularities have different reusable or transferable characteristics. Generally, in machine translation or sentence rewriting, sentence-level operations (sentence translation or transformation) are easier to reuse or migrate to other fields, and phrase or word-level operations have more field features. In the solution of this application, since the deep neural network includes multiple independent sub-networks for processing words of different granularities, a part of the sub-networks obtained by training using samples in a certain field can be applied to other fields.

The following describes the scenarios in which this application solution can be applied.

As shown in FIG. 1A, a natural language processing system includes user equipment and data processing equipment.

The user equipment may be a mobile phone, a personal computer, a tablet computer, a wearable device, a personal digital assistant, a game console, an information processing center, and other smart terminals. The user equipment is the initiator of natural language data processing, and serves as the initiator of natural language processing tasks (for example, translation tasks, paraphrase tasks, etc.). Generally, users initiate natural language processing tasks through the user equipment. The paraphrase task is the task of transforming a natural language text into another text with the same meaning but different expressions as the natural language text. For example, "What makes the second world war happen" can be repeated as "What is the reason of world war II".

The data processing device may be a device or server with data processing functions such as a cloud server, a network server, an application server, and a management server. The data processing device receives query sentences/voice/text questions from the smart terminal through an interactive interface, and then performs machine learning, deep learning, search, reasoning, and decision-making through a memory that stores data and a processor that performs data processing. Language data processing in other ways. The storage may be a general term including a database for local storage and storing historical data. The database may be on a data processing device or on other network servers.

Figure 1B shows another application scenario of the natural language processing system. In this scenario, the smart terminal is directly used as a data processing device, directly receiving input from the user and directly processed by the hardware of the smart terminal itself. The specific process is similar to that of FIG. 1A, and the above description can be referred to, which will not be repeated here.

As shown in FIG. 1C, the user equipment may be a

local device

101 or 102, the data processing device may be an execution device 210, and the data storage system 250 may be integrated on the execution device 210 or set on the cloud Or on other network servers.

The solution of this application can be applied to a variety of scenarios. The following describes how to perform natural language processing tasks using data processing equipment. FIG. 2 is a flowchart of a natural language processing method provided by an embodiment of the application. As shown in FIG. 2, the method may include:

201. Obtain natural language text to be processed.

The natural language text to be processed may be a sentence currently to be processed by the data processing device. The data processing device can process the received natural language text or the natural language text obtained by recognizing voice sentence by sentence.

In the scenarios in FIG. 1A and FIG. 1C, obtaining the natural language text to be processed may be that the data processing device receives data such as voice or text sent by the user equipment, and obtains the natural language text to be processed according to the received voice or text data. For example, the data processing device receives 2 sentences sent by the user device, the data processing device obtains the first sentence (natural language text to be processed), and uses the trained deep neural network to process the first sentence , Output and process the first sentence to get the result; get the second sentence (natural language text to be processed), use the trained deep neural network to process the second sentence, and output and process the second sentence to get the result .

In the scenario in FIG. 1B, obtaining the natural language text to be processed may be that the smart terminal directly receives data such as voice or text input by the user, and obtains the natural language text to be processed according to the received voice or text data. For example, the smart terminal receives 2 sentences input by the user, the smart terminal obtains the first sentence (natural language text to be processed), uses the trained deep neural network to process the first sentence, and outputs the processing The first sentence is the result; the second sentence (natural language text to be processed) is obtained, the second sentence is processed by the deep neural network obtained by training, and the second sentence is output and processed to obtain the result.

202. Use the deep neural network obtained by training to process the natural language text, and output a target result obtained by processing the natural language text.

The deep neural network may include: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The data processing device uses the deep neural network to do the natural language text The processing may include: using the granular annotation network to determine the granularity of each word in the natural language text; using the first feature network to perform feature extraction on the first granular word in the natural language text, and output the obtained first feature information to The first processing network; using the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network to Perform target processing on the first characteristic information, and output the obtained first processing result to the fusion network; use the second processing network to perform the target processing on the second characteristic information, and output the obtained second processing result to the fusion network Use the fusion network to fuse the first processing result and the second processing result to obtain the target result; the first granularity and the second granularity are different. The first granularity and the second granularity may be any two different granularities among character level, word level, phrase level, and sentence level. In this application, the granularity of a word refers to the granularity of the word in the natural language text (sentence). The target processing can be translation, retelling, abstract generation, etc. The target result is another natural language text obtained by processing the natural language text. For example, the target result is a natural language text obtained by translating the natural language text. For another example, the target result is another natural language text obtained by retelling the natural language text. The natural language text to be processed can be regarded as an input sequence, and the target result (another natural language text) obtained by the data processing device processing the natural language text can be regarded as a generated sequence.

The deep neural network may include N feature networks and N processing networks. The N feature networks and the N processing networks have a one-to-one correspondence, and N is an integer greater than one. A pair of corresponding feature network and processing network are used to process words of the same granularity. For example, the first feature network performs feature extraction on words of the first granularity in the natural language text to obtain first feature information, and the first processing network performs target processing on the first feature information. It can be understood that, in addition to the first feature network and the second feature network, the deep neural network may also include features for feature extraction of words of other granularities (granularities other than the first granularity and the second granularity). Network; In addition to the first processing network and the second processing network, the deep neural network can also include target processing for the feature information of words of other granularities (granularities other than the first granularity and the second granularity) Processing network. In this application, the number of feature networks included in the deep neural network and the number of processing networks are not limited. If the words in the natural language text are classified into N granularities, the deep neural network includes N feature networks and N processing networks. That is to say, if the words in the natural language text are classified according to N granularities, the deep neural network includes N feature networks and N feature networks. For example, the words in natural language text are divided into phrase-level words and sentence-level words, then the deep neural network includes two feature networks, one feature network is used to extract the feature of phrase-level words to obtain the feature information of phrase-level words Another feature network is used to extract feature information of sentence-level words to obtain feature information of sentence-level words; the deep neural network includes two processing networks, one processing network is used to target the feature information of phrase-level words Processing, another processing network is used to target the feature information of sentence-level words. In the case that the deep neural network includes N feature networks and N processing networks, the N feature networks output N feature information, the N processing networks output N processing results, and the fusion network is used to fuse the N The processing result is the final output result. In other words, the fusion network is not limited to fusing two processing results.

Any two of the N feature networks perform feature extraction on words with different granularities in natural language text; any two of the N processing networks perform target processing on the feature information of words with different granularities. Optionally, any two characteristic networks of the N characteristic networks do not share parameters; any two of the N processing networks do not share parameters. The target processing can be translation, retelling, abstract generation, etc. The parameters of the first feature network and the second feature network are different, and the architectures adopted are the same or different. For example, the first feature network uses a deep neural network architecture, and the second feature network uses a Transformer architecture. The first processing network and the second processing network have different parameters and adopt the same or different architectures. For example, the first processing network uses a deep neural network architecture, and the second processing network uses a Transformer architecture. It can be understood that the multiple feature networks included in the deep neural network may adopt different architectures, and the multiple processing networks included in the deep neural network may adopt different architectures.

In the embodiment of the present application, the data processing device uses the mutually decoupled network in the deep neural network to process words of different granularity respectively, which can effectively improve the performance of processing natural processing tasks.

The following describes how to process natural language text in conjunction with the structure of the deep neural network used in this application. Figure 3 is a schematic structural diagram of a deep neural network provided by an embodiment of the application. The deep neural network may include N feature networks and N processing networks. To facilitate understanding, only two feature networks (the first feature Network and second characteristic network) and 2 processing networks (first processing network and second processing network). As shown in Figure 3, 301 is a granular annotation network, 302 is a first feature network, 303 is a second feature network, 304 is a first processing network, 305 is a second processing network, and 306 is a converged network. The data processing equipment uses the deep neural network in Figure 3 to process natural language text as follows:

311. The granularity labeling network 301 determines the granularity of each word in the natural language text according to N types of granularities to obtain the labeling information of the natural language text, and outputs the labeling information to the first feature network 302 and the second feature network 303.

The input of the granular annotation network 301 is the natural language text to be processed; the output may be annotation information, or annotation information and the natural language text. The input of the first feature network 302 and the input of the second feature network 303 are both the annotation information and the natural language text. The annotation information is used to describe the granularity of each word in the natural language text or the probability that each word in the natural language text belongs to the N types of granularities; N is an integer greater than 1.

The granularity labeling network 301 labels the granularity to which each word (assuming the word is the basic processing unit) in the input natural language text (input sequence), that is, determines the label of each word in the natural language text. Assuming that we consider two granularities: phrase-level granularity and sentence-level granularity, the granularity of each word in the input natural language text (sentence) is determined to be one of these two granularities. For example, the granularity annotation network 301 determines the granularity of each word in the input natural language text "what makes the second world war happen", where words such as "what", "makes", and "happen" are determined to be sentence-level Granularity, words such as "the", "second", "world", and "war" are determined as phrase-level granularity. It is worth noting that the granularity of each word in the natural language text to be processed is not labeled with data (label), but the granularity annotation network 301 determines the granularity of each word in the input natural language text.

312. The first feature network 302 uses the input natural language text and annotation information to perform feature extraction, and outputs the obtained first feature information to the first processing network 304.

The first feature information is a vector or matrix representing words of the first granularity. The input of the first feature network 302 is natural language text and tagging information. The natural language text can be feature-extracted from the first-granularity words, and the vector or matrix representation of the first-granularity words in the natural language text can be obtained, that is, the The first feature information.

313. The second feature network 303 uses the input natural language text and annotation information to perform feature extraction, and outputs the obtained second feature information to the second processing network 305.

The second feature information is a vector or matrix representing words of the second granularity. The input of the second feature network 303 is natural language text and tagging information, and the words of the second granularity in the natural language text can be feature extracted, and the vector or matrix representation of the words of the second granularity in the natural language text can be obtained, that is, the The second feature information. The embodiment of the present application does not limit the order in which the data processing device performs step 313 and step 312. Step 313 and step 312 can be performed at the same time, or step 312 can be performed before step 313, or step 313 can be performed before step 312.

314. The first processing network 304 uses the input first characteristic information and the processing result output by the first processing network 304 in the process of processing the first characteristic information for processing to obtain the first processing result.

The first processing network 304 processes the input first feature information in a recursive manner (for example, translation, paraphrase, abstract extraction, etc.), that is, the first processing network 304 uses the output of the first feature network 302 (first The feature information) and the previously output processing result (sequence) are input, and the representation of the vector or matrix (the first processing result) is calculated through the deep neural network.

315. The second processing network 305 uses the input second characteristic information and the processing result output by the second processing network 305 in the process of processing the second characteristic information for processing to obtain the second processing result.

The second processing network 305 processes the input second feature information in a recursive manner (for example, translation, paraphrase, abstract extraction, etc.), that is, the second processing network 305 uses the output of the second feature network 303 (second The feature information) and the previously output processing result (sequence) are input, and the representation of the vector or matrix is calculated through the deep neural network (the second processing result). The embodiment of the present application does not limit the order in which the data processing device executes step 314 and step 315. Step 314 and step 315 can be executed simultaneously, or step 314 can be executed first and then step 315 can be executed, or step 315 can be executed before step 314 is executed.

316. The fusion network 306 uses the first processing result, the second processing result, and the processing results that the fusion network 306 has output in the process of processing the first processing result and the second processing result to determine the target word to be output, and output the target Words.

The target word is included in the first processing result or the second processing result. The fusion network 306 can merge the output of processing networks of different granularities, that is, determine the granularity of the current word to be output and then determine the word to be output. For example, the first step is to determine the words to be output with "sentence level" granularity and output "what"; the second step to determine the words to be output with "sentence level" granularity and output "is"; repeat the previous operation until the final output sentence is completed (Corresponding to the target result) generation. It should be noted that the above steps 311 to 316 are all completed by deep neural network calculations.

In the embodiments of the present application, the data processing device uses feature networks of different granularities and processing networks of different granularities to independently process words of different granularities, which can effectively improve the probability of obtaining correct results.

The following describes how the granular annotation network 301 determines the granularity of each word in the natural language text in conjunction with the structure of the granular annotation network 301. FIG. 4 is a schematic structural diagram of a granular labeling network 301 provided by an embodiment of this application. As shown in FIG. 4, the granular annotation network 301 includes a Long Short-Term Memory (LSTM) 402 and a Bi LSTM (Bi-directional LSTM) network 401. It can be seen from FIG. 4 that the granular labeling network 301 uses a multilayer LSTM network architecture. The input of LSTM401 is natural language text, and the output of LSTM402 is labeling information, that is, the granularity label of each word or the probability that each word belongs to various granularities. The granularity annotation network 301 is used to predict the granularity corresponding to each word in the input sentence (natural language text). Optionally, the BiLSTM network 401 is used to convert the input natural language text into a vector, which is used as the input of the next layer of the LSTM network 402; the LSTM network 402 calculates and outputs the probability that each word in the natural language text belongs to each granularity. In order to ensure the differentiability of the entire granularity labeling network 301 and to further decouple information of different granularities, the labeling information can use the GS (Gumbel-Softmax) function instead of the commonly used Softmax operation. In this case, each word has a probability of belonging to each granularity, and this value is close to 0 or 1.

The following uses mathematical formulas to describe the manner in which the granularity annotation network 301 predicts the granularity of each word in the natural language text. The mathematical formula corresponding to the processing process of BiLSTM network 401 is as follows:

h _l =BiLSTM([x _l ;h _l-1 ,h _l+1 ]);

The mathematical formula corresponding to the processing process of the LSTM network 402 is as follows:

g _l ＝LSTM([h _l ,z _l-1 ;g _l-1 ]);

z _l =GS(W _g g _l ,τ);

Among them, BiLSTM() in the formula represents the processing of a two-way recursive deep neural network, LSTM() represents the processing of a (one-way) recursive deep neural network, l represents the position index of the word, and x represents the input sentence (natural language text) X _l represents the _lth word in the input sentence x, h represents the hidden states in the BiLSMT network 401, h _l , h _l-1 , and h _l+1 in turn represent the BiLSMT network 401 The hidden state variables when processing the lth word, (l-1)th word, and (l+1)th word in the input sentence. g represents the hidden state variable in the (one-way) LSTM network, and its calculation process follows the calculation rules of the LSTM network. g _l and g _l-1 respectively indicate that the LSTM network 402 processes the lth word and the (l)th word in the input sentence. -1) Hidden state variables for words. z represents the probability that the word belongs to a certain granularity (phrase-level granularity, sentence-level granularity or other granularity), z _l-1 and z _l respectively represent the lth word and (l-1)th word in the input sentence The probability of belonging to a certain granularity, GS represents the Gumbel Softmax function, τ is the hyperparameter (temperature) in the Gumbel Softmax function, and Wg is the parameter matrix, that is, a parameter matrix in the granularity annotation network.

The granularity annotation network 301 uses the architecture of a multi-layer LSTM network to determine the granularity of each word in a natural language text, and can make full use of the granularity of the determined word to determine the granularity of a new word (word with a granularity to be determined), which is simple to implement and process efficient.

The following describes the feature extraction operation of the feature network in combination with the structure of the first feature network 302 and the structure of the second feature network 303. FIG. 5 is a schematic structural diagram of a first characteristic network 302 and a second characteristic network 303 provided by an embodiment of this application. As shown in Figure 5, the input of the first feature network 302 and the input of the second feature network 303 are the same. The first feature network 302 performs feature extraction on words of the first granularity in the natural language text, and the second feature network 303 performs feature extraction on the natural language text. The words of the second granularity in the text are feature extracted. The network architectures adopted by the first feature network 302 and the second feature network 303 may be the same or different. A feature network that processes words of a certain granularity can be understood as a feature network of that granularity, and feature networks of different granularities process words of different granularity. The parameters of the first characteristic network 302 and the second characteristic network 303 are not shared, and the hyperparameter settings are different. Optionally, both the first feature network 302 and the second feature network 303 adopt the Transformer model. This model is based on a multi-head self-attention mechanism, which processes input sentences (natural language text) at a certain granularity. Words, so as to construct a vector as the characteristic information of the granular words. In the case that the granular feature network 301 determines the granularity of each word in the natural language text, the first feature network 302 may only focus on the words of the first granularity in the input sentence (natural language text); the second feature network 303 may only focus on Input sentences (natural language text) in the second granularity of words. In the case that the granular feature network 301 determines the probability that each word in the natural language text belongs to the aforementioned N types of granularities, the first feature network 302 can focus on the words of the first granularity in the input sentence (natural language text); The feature network 303 can focus on the words of the second granularity in the input sentence (natural language text). In this case, for the first feature network 302, it focuses on words with a higher probability of belonging to the first granularity in the input sentence; for the second feature network 303, it focuses on words belonging to the second Words with higher probability of granularity. It can be understood that the higher the probability that a word belongs to the first granularity, the higher the attention of the first feature network 302 to the word.

As shown in 5, the first feature network 302 can use a self-attention mechanism with a limited window (similar to a deep neural network mechanism, but its weight is still calculated by attention. For the input sentence (natural language text) , The first feature network 302 will focus on words at the first granularity in the input sentence and ignore words at other granularity levels. The first feature network 302 can be a feature network with a phrase-level granularity. When extracting the features of each word, only Pay attention to the two adjacent words of the word, as shown in Figure 5. The second feature network 303 can adopt the Self-Attention mechanism of the whole sentence, so as to be able to pay attention to the global information of the sentence. The second feature network 303 can be sentence-level The granular feature network focuses on the entire input sentence when extracting the features of each word, as shown in Figure 5. For the input sentence (natural language text), the second feature network 303 will focus on the second granular word in the input sentence , While ignoring words at other levels of granularity. The Transformer model is a commonly used model in the field, and the working principle of the model will not be described in detail here. Finally, the first feature network 302 can obtain the input sentence (natural language text). The vector representation (first feature information) of each word at one granularity; the second feature network 303 can obtain the vector representation (second feature information) of each word at the second granularity in the input sentence (natural language text). In practical applications , Through the calculation of the deep neural network (Transformer), the feature network at each granularity obtains the vector representation of the word at the granularity, denoted as Uz.

The processing operations implemented by the first feature network 302 and the second feature network 303 are described below with the aid of mathematical formulas. The mathematical formulas corresponding to the processing operations implemented by the first feature network 302 and the second feature network 303 are as follows:

U _z =ENC _z (X,Z _x );

Among them, z represents the index of the granularity level (for example, z = 0 represents the word level granularity, z = 1 represents the sentence level granularity), ENC _z represents the feature network (the first feature network or the second feature network) at the granularity z, the feature network ENC _z is a Transformer model, ENC _z () represents the processing operation done by the feature network, X represents the input sentence (natural language text) of the feature network, Z _X = [z1, z2,..., zL] represents the input sentence Annotation information (granularity level), the annotation information is determined by the output of the granularity annotation network, z1 to z1 in turn indicate the granularity of the first word to the Lth (last) word in the input sentence, Uz indicates the feature network ENC The final output of _z . The input of the feature network is the input sentence X and the label information Z _X. In the case where the annotation information output by the granularity annotation network 301 is the granularity of each word in the natural language text, the annotation information of the input sentence input by the feature network is the annotation information output by the granularity annotation network 301. For example, the annotation information output by the granularity annotation network 301 is [1100001], and these binary values sequentially represent the granularity of the first word to the last word in the input sentence, 0 means word-level granularity, and 1 means sentence-level granularity. When the annotation information output by the granularity annotation network 301 is the probability that each word in the natural language text belongs to the above N types of granularities, the annotation information of the input sentence input by the feature network is obtained according to the annotation information output by the granularity annotation network 301 Label information. In practical applications, the data processing device may further process the annotation information output by the granular annotation network 301 to obtain the annotation information that can be input to the feature network.

In an optional implementation manner, the data processing device uses the granularity at which each word in the natural language text belongs to the maximum probability as the granularity of each word. For example, if the probability that a word in the input sentence (natural language text) belongs to the phrase-level granularity and sentence-level granularity are 0.85 and 0.15, respectively, the granularity of the word is the phrase-level granularity. For another example, according to phrase-level granularity and sentence-level granularity, the granularity of each word in the natural language text is classified. The annotation information output by the granularity annotation network 301 is [0.92 0.88 0.08 0.07 0.04 0.06 0.97], and the value in the annotation information In turn, it indicates the probability that the first word to the last word in the natural language text belong to the sentence-level granularity. The data processing device can set the value less than 0.5 in the label information to 0, and the value greater than or equal to 0.5 to 1, to get The new label information [1100001] is input into the feature network.

In an optional implementation manner, the data processing device samples the natural language text according to the probability that each word in the natural language text belongs to the aforementioned N types of granularities, and obtains the annotation information of the natural language text by using the granularity of each word obtained by the sampling. And input to the feature network.

Each feature network included in the deep neural network independently processes words of different granularities, and uses networks of different architectures to process words of different granularities, with better feature extraction performance.

The processing performed by the processing network and the processing performed by the convergence network 306 will be introduced below in conjunction with the structures of the first feature network 302, the second feature network 303, the first processing network 304, the second processing network 305, and the converged network 306.

Fig. 6 is a schematic structural diagram of a deep neural network provided by an embodiment of the application, and Fig. 6 does not show a granular annotation network. As shown in FIG. 6, the input of the first processing network 304 is the first characteristic information output by the first characteristic network 302, and the first processing network 304 has outputted processing results (words) in the process of processing the first characteristic information; The input of the second processing network 305 is the second feature information output by the second feature network 303, and the second processing network 305 outputs the processed results (words) that have been output in the process of processing the second feature information; the input of the fusion network 306 is the first A processing result, a second processing result, and words that have been output in the process of processing the first processing result and the second processing result. The output of the fusion network 306 is obtained by fusing the first processing result and the second processing result Target result. The architectures adopted by the first processing network 304 and the second processing network 305 may be the same or different. The first processing network 304 and the second processing network 305 may not share parameters.

A processing network that processes words of a certain granularity can be understood as a processing network of that granularity, and processing networks of different granularities process words of different granularity. In other words, each granularity has a corresponding processing network. For example, the granularity of each word in a natural language text is divided into phrase-level granularity and sentence-level granularity. Deep neural networks include a phrase-level granularity processing network and a sentence-level granularity processing network. The processing networks of different granularities are decoupled, which means that they do not share parameters and can adopt different architectures. For example, the phrase-level granularity processing network uses a deep neural network architecture, and the sentence-level granularity processing network uses a Transformer architecture. The processing network can output one word at a time and the granularity of the word. The processing network can be performed in a recursive manner, that is, the processing network of each granularity takes the output of the corresponding granular feature network and the words that have been output before as input, calculates the probability of multiple words to be output at present, and has the highest output probability The word and the label information corresponding to the word. Optionally, the processing network uses its input to calculate the probability of each word currently to be output, and performs sampling according to the probability of each word, and outputs the sampled word and the label information corresponding to the word. Optionally, the processing network uses its input to calculate the probability of each word currently to be output (that is, the probability of each word currently being output), and output the probability of each word currently to be output. For example, the processing network currently has F words to be output. The processing network uses its input to calculate the probability of the first word to be output, the probability of the second word to be output, and the probability of the Fth word to be output. , And input these probabilities into the fusion network, and F is an integer greater than 1. The label information corresponding to a word may be the probability that the word belongs to a certain granularity, or the granularity of the word, or the probability that the word belongs to various granularities.

The processing performed by the first processing network 302 may be as follows: In the first step, the first processing network 302 processes the input first feature information to predict the first word (a word) currently required to be output, and output the first word The label information corresponding to the first word; in the second step, the first processing network 302 processes the input first feature information and the first word to predict the second word (a word) that is currently required to be output, and output the The second word and the label information corresponding to the second word; the first processing network 302 processes the input first feature information, the first word, and the second word to predict the third word (a word ), output the third word and the label information corresponding to the third word; repeat the previous steps until the processing of the first processing result is completed. It should be understood that each processing network included in the deep neural network can process the input feature information in a similar manner to the first processing network 302. For example, the input of a certain processing network is the characteristic information obtained by feature extraction of "a good geologist" by its corresponding characteristic network, and the processing network processes the input characteristic information, predicting the current need to output "a" and Output; the processing network processes the input feature information and the previously output "a", predicting the current need to output "great" and output; the processing network processes the input feature information, the previous output "a" and "great" "For processing, predict the current need to output "geologist" and output.

As shown in FIG. 6, the first processing network 304 receives the input of the first feature network 302 and the words it has output for calculation. The calculation method is to use the Self-Attention mechanism with a limited window; the second processing network 305 receives the second feature The input of the network 303 and the words that it has output are calculated, and the calculation method is to adopt the Self-Attention mechanism of the whole sentence range. The processing result obtained by the processing network at each granularity is denoted as Vz, and z represents the index of the granularity level, namely the granularity z. The first processing network 304 and the second processing network 305 may also adopt different architectures. The following describes the operations performed by the convergent network 305 on the processing results input by each processing network.

The fusion network 306 can merge the processing results output by the processing network at different granularities to obtain the target result. The output of the fusion network 306 is a sequence containing words. The input of the fusion network 306 is the processing results of each processing network (the first processing result and the second processing result) and the sequence that the fusion network 306 has output in the process of processing these processing results. The operations performed by the fusion network 306 can be as follows: the fusion network 306 merges the processing results input by each processing network into a vector; inputs the vector to an LSTM network for processing to determine the current word to be output, that is, to determine the current word to be output. Which word of granularity level to output; the fusion network 306 outputs the target word currently to be output by the processing network of this granularity. Said inputting the vector into an LSTM network for processing to determine the words of the current granularity to be output may be inputting the vector into an LSTM network for processing to determine the probability of the words of each granularity in the above N granularities being output, and then Determine the word of the current granularity to be output; among them, the word of the granularity to be output has the highest probability of being output currently. The particle size is any one of the above-mentioned N particle sizes. The target word is the word with the highest probability of being output among the multiple words currently to be output by the processing network of the granularity to be output. For example, the probability of the first word, the second word, and the third word to be output by the processing network of the reference granularity is 0.06, 0.8, 0.14, respectively, and the target word to be output by the processing network of the reference granularity is this The second word is the word with the highest probability of being output. It can be understood that the fusion network 306 may first determine which granular words are currently to be output, and then output the words to be output by the processing network with this granularity.

The operations performed by the fusion network 306 can also be as follows: the fusion network 306 merges the processing results input by each processing network into a vector; inputs the vector to an LSTM network for processing to determine the current words to be output by each processing network The probability of each word being output; the fusion network 306 outputs the target word with the highest probability of being output among the words. Each processing network refers to a processing network of each granularity. For example, the words currently to be output by the first processing network include "a", "good" and "geologist", and the words currently to be output by the second processing network include: "How", "can", "I" and " "be", the fusion network calculates the current probability of each of these 7 words being output, and outputs the word with the highest probability of being output among these 7 words.

The following describes how to calculate the probability of each word being output in each word currently to be output by the processing network with reference granularity. The reference particle size is any one of the above-mentioned N particle sizes.

Suppose that before the fusion network 306 outputs the t-th word, the (t-1) words already output by the fusion network 306 are denoted as [y ₁ ,y ₂ ,...,y _t-1 ], and t is an integer greater than 1. The vectors (processing results) output by the first processing network and the second processing network are v0 and v1, respectively. The fusion network 306 combines these two vectors and the sequence output by the fusion network 306, and inputs the merged vector The LSTM network performs processing to calculate the probability of words with a reference granularity to be output. The converged network 306 includes the LSTM network. The LSTM network can use the following formula to calculate the probability of words with a reference granularity to be output:

h _t =LSMT(h _t-1 ,y _t-1 ,v0,v1);

P(z _t |y _1:t-1 ,X)=GS(W _z h _t ,τ);

Among them, h _t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, LSMT() represents the processing operation performed by the LSMT, and y _t-1 represents the (t-1 ) Words, W _z is a parameter matrix in the fusion network, τ is a hyperparameter, and P(z _t |y _1:t-1 ,X) is the probability of a word of granularity z currently to be output. It can be understood that the fusion network 306 can use a similar method to calculate the probability of currently outputting words of any one of the above N granularities. After calculating the probability, the mixed probability model is used to calculate the probability of outputting the target word. The target word is a word currently to be output by the processing network of the granularity z. The formula for calculating the probability of outputting the target word is as follows:

Among them, P _Zt (y _t |y _1:t-1 ,X) represents the probability of outputting the target word y _t on the granularity z; P(y _t |y _1:t-1 ,X) represents outputting the target The probability of the word. P _Zt (y _t |y _1:t-1 ,X) can be given by the processing network. The processing network of granularity z can input the probability of each word (word of granularity z) currently to be output to the fusion network, that is, the probability of each word in the words currently to be output by the processing network is output. For example, the input of the first processing network is the feature information obtained by the first feature network's feature extraction of "a good geologist", and the processing network processes the feature information to obtain the current probability of output "a", The probability of "great" to be output and the probability of "geologist" to be output are output, and these words and the corresponding probability are input to the fusion network. Assuming that the target word y _t is "great", then P _Zt (y _t |y _1:t-1 ,X) represents the probability of outputting "great" at the granularity z. It can be understood that the fusion network 306 may first calculate the probability of words of each of the above N granularities to be output, then calculate the probability of each word currently to be output to be output, and finally, output the word with the highest probability of being output.

The foregoing embodiment describes the use of a deep neural network obtained by training to implement a natural language processing method. The following describes how to train a required deep neural network.

FIG. 7 is a flowchart of a training method provided by an embodiment of the application. As shown in FIG. 7, the method may include:

701. The data processing device inputs the training samples to the deep neural network for processing, and obtains a prediction processing result.

The deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granular labeling network to determine the value of each word in the training sample Granularity; Use the first feature network to perform feature extraction on words with the first granularity in the training sample, and output the obtained third feature information to the first processing network; Use the second feature network to perform feature extraction in the second training sample Perform feature extraction on the granular words, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and output the obtained third processing result to the fusion network ; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction Processing result; the first particle size and the second particle size are different.

The architectures of the first feature network and the second feature network are different, and/or the architectures of the first processing network and the second processing network are different. The input of the granular annotation network is the natural language text, and the granular annotation network is used to determine the granularity of each word in the natural language text according to N types of granularities to obtain the annotation information of the natural language text, and send it to the first feature network and The second feature network outputs the labeling information; where the labeling information is used to describe the granularity of each word or the probability that each word belongs to the N kinds of granularities; N is an integer greater than 1. The first feature network is used to perform feature extraction using the input natural language text and the annotation information, and output the obtained third feature information to the first processing network; wherein, the third feature information represents the first A vector or matrix of granular words; the first processing network is used to use the input third characteristic information and the processing result output by the first processing network as target processing to obtain the third processing result. The fusion network outputs one word at a time, and the fusion network is used to use the third processing result, the fourth processing result, and the words that the fusion network has output in the process of processing the third processing result and the fourth processing result , Determine the target word to be output, and output the target word.

702. The data processing device determines the loss corresponding to the training sample according to the predicted processing result and the standard result.

The standard result, that is, ground truth, is the expected processing result obtained by using the deep neural network to process the training sample. It can be understood that each training sample corresponds to a standard result, so that the data processing device can calculate and use the deep neural network to process the loss of each training sample, thereby optimizing the deep neural network. The following takes training a deep neural network to process retelling tasks as an example to introduce the training samples and standard results that can be used by the data processing device to train the deep neural network.

Table 1

For the granularity of each word in the training sample, no data is labeled. The granular annotation network 301 is obtained through end-to-end learning. Due to end-to-end learning, in order to ensure that the granular labeling network 301 can be differentiated, during the training process, the granular labeling network 301 actually gives the probability that each word belongs to a different granularity, rather than an absolute 0/1 label. It should be understood that the data processing equipment trains the deep neural network to process different natural language processing tasks, and uses different training samples and standard results. For example, if the data processing device is trained to handle retelling tasks, training samples and standard results similar to those in Table 1 can be used. For another example, if the data processing device is trained to handle translation tasks, the training sample used is English text, and the standard result is the standard Chinese text corresponding to the training sample.

703. The data processing device uses the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm.

In practical applications, data processing equipment can train deep neural networks to handle different natural language processing tasks. The data processing equipment training deep neural network processes different natural language processing tasks, and the data processing equipment calculates the loss between the predicted processing result and the standard result differently, that is, the method of calculating the loss corresponding to the training sample is different.

In an optional implementation manner, the data processing device uses the loss corresponding to the training sample, and updating the parameters of the deep neural network through an optimization algorithm may be to use the gradient value of the loss function relative to at least one network included in the deep neural network, Update the parameters of the at least one network; the loss function is used to calculate the loss between the predicted processing result and the standard result; wherein, the first characteristic network, the second characteristic network, the first processing network, the second During the update process of any one of the processing networks, the parameters of any one of the other three networks remain unchanged. Using the gradient value of the loss function with respect to a network and updating the parameters of the network through an optimization algorithm (such as a gradient descent algorithm) is a common technical means in this field, which will not be detailed here.

The deep neural network used in the foregoing embodiment is a network obtained by using the training method in FIG. 7. It should be understood that the structure and processing process of the deep neural network in FIG. 7 are the same as the deep neural network in the foregoing embodiment.

The foregoing embodiments introduced training methods for natural language processing methods, and the structure of data processing equipment implementing these methods is described below. FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of the application. As shown in FIG. 8, the data processing device may include:

The obtaining unit 801 is configured to obtain the natural language text to be processed;

The processing unit 802 is configured to process the natural language text by using the deep neural network obtained by training;

The output unit 803 is configured to output the target result obtained by processing the natural language text.

The deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granular labeling network to determine each word in the natural language text Use the first feature network to perform feature extraction on words of the first granularity in the natural language text, and output the obtained first feature information to the first processing network; use the second feature network to perform feature extraction on the natural language text Perform feature extraction on words with the second granularity in the second granularity, and output the obtained second feature information to the second processing network; use the first processing network to process the first feature information, and output the obtained first processing result to the Convergence network; use the second processing network to process the second characteristic information, and output the obtained second processing result to the fusion network; use the fusion network to fuse the first processing result and the second processing result to obtain the Target result; the first particle size and the second particle size are different.

The processing unit 802 may be a central processing unit (Central Processing Unit, CPU) in a data processing device, a neural network processor (Neural-network Processing Unit, NPU), or other types of processors. The output unit 803 may be a display, a display screen, an audio device, etc. The target result may be another natural language text obtained from the natural language text, and the obtained natural language text is displayed on the display screen of the data processing device. The target result can be a voice corresponding to another natural language text obtained from the natural language text, and the audio device in the data processing device plays the voice.

In an optional implementation manner, the processing unit 802 is also used to input training samples into the deep neural network for processing to obtain prediction processing results; according to the prediction processing results and standard results, determine the loss corresponding to the training samples; The standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; using the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.

The detailed training method is shown in Figure 7, which will not be detailed here.

The foregoing embodiments describe a method in which a data processing device uses a deep neural network to process natural language tasks. The following introduces the deep neural network to facilitate readers to further understand this scheme.

Deep Neural Network (DNN) can be understood as a neural network with many hidden layers. The "many" here has no special metric. The essence of the multi-layer neural network and deep neural network we often say The above is the same thing. From the DNN according to the location of different layers, the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1th layer. Although DNN looks complicated, it is not complicated in terms of the work of each layer. Simply put, it is the following linear relationship expression:

among them,

Is the input vector,

Is the output vector,

Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just the input vector

After such a simple operation, the output vector is obtained

Due to the large number of DNN layers, the coefficient W and the offset vector

The number is a lot. So, how are the specific parameters defined in DNN? First, let's look at the definition of the coefficient W. Take a three-layer DNN as an example. For example, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as

The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. In summary, the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as

Note that the input layer has no W parameter. In deep neural networks, more hidden layers allow the network to better describe complex situations in the real world. Theoretically speaking, a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks.

The method executed by the data processing device using the deep neural network in the foregoing embodiment can be implemented in the NPU. FIG. 9 is a schematic structural diagram of a neural network processor provided by an embodiment of the application.

The neural network processor NPU 90NPU is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks (for example, natural language processing tasks). The core part of the NPU is the arithmetic circuit 90, and the arithmetic circuit 903 is controlled by the controller 904 to extract matrix data from the memory and perform multiplication operations.

In some implementations, the arithmetic circuit 903 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 903 is a two-dimensional systolic array. The arithmetic circuit 903 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 903 is a general-purpose matrix processor.

For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 902 and caches it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix A data and matrix B from the input memory 901 to perform matrix operations, and the partial or final result of the obtained matrix is stored in the accumulator 908.

The unified memory 906 is used to store input data and output data. The weight data is directly transferred to the weight memory 902 through the direct memory access controller (DMAC) 905. The input data is also transferred to the unified memory 906 through the DMAC.

The Bus Interface Unit (BIU) 510 is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer) 909.

The bus interface unit 510 is also used for the instruction fetch memory 909 to obtain instructions from the external memory, and also used for the storage unit access controller 905 to obtain the original data of the input matrix A or the weight matrix B from the external memory.

The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 906 or the weight data to the weight memory 902 or the input data to the input memory 901.

The vector calculation unit 907 has multiple arithmetic processing units, if necessary, further processing the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. Mainly used for non-convolution/FC layer network calculations in neural networks, such as Pooling, Batch Normalization, Local Response Normalization, etc.

In some implementations, the vector calculation unit 907 can store the processed output vector in the unified buffer 906. For example, the vector calculation unit 907 may apply a nonlinear function to the output of the arithmetic circuit 903, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 907 generates a normalized value, a combined value, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 903, for example for use in a subsequent layer in a neural network.

The instruction fetch buffer 909 connected to the controller 904 is used to store instructions used by the controller 904;

The unified memory 906, the input memory 901, the weight memory 902, and the fetch memory 909 are all On-Chip memories.

Among them, the operations of each layer in the deep neural network shown in FIG. 3 may be executed by the matrix calculation unit 212 or the vector calculation unit 907.

In this application, NPU is used to implement a natural language processing method and training method based on a deep neural network, which can greatly improve the efficiency of processing natural language processing tasks and training a deep neural network of a data processing device.

The following describes the data processing device in the embodiment of the present invention from the perspective of hardware processing.

FIG. 10 is a block diagram of a partial structure of an intelligent terminal provided by an embodiment of the application. 10, the smart terminal includes: a radio frequency (RF) circuit 1010, a memory 1020, an input unit 1030, a display unit 1040, a sensor 1050, an audio circuit 1060, a wireless fidelity (WiFi) module 1070, a system on chip (System On Chip, SoC) 1080 and power supply 1090 and other components.

The memory 1020 includes DDR memory, of course, may also include high-speed random access memory, or include other storage units such as non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.

Those skilled in the art can understand that the structure of the smart terminal shown in FIG. 10 does not constitute a limitation on the smart terminal, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.

The components of the smart terminal are specifically introduced below in conjunction with Figure 10:

The RF circuit 1010 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by SoC 1080; in addition, the designed uplink data is sent to the base station. Generally, the RF circuit 1010 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 1010 can also communicate with the network and other devices through wireless communication. The above wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.

The memory 1020 may be used to store software programs and modules. The SoC 1080 runs the software programs and modules stored in the memory 1020 to execute various functional applications and data processing of the smart terminal. The memory 1020 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, a translation function, a retelling function, etc.), etc.; The data storage area can store data (such as audio data, phone book, etc.) created according to the use of the smart terminal.

The input unit 1030 can be used to receive input natural language text and voice data, and generate key signal inputs related to user settings and function control of the smart terminal. Specifically, the input unit 1030 may include a touch panel 1031 and other input devices 1032. The touch panel 1031, also known as a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 1031 or near the touch panel 1031. Operation), and drive the corresponding connection device according to the preset program. The touch panel 1031 is used to receive the natural language text input by the user and input the natural language text into the SoC1080. Optionally, the touch panel 1031 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch position, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it Give SoC 1080, and can receive commands from SoC 1080 and execute them. In addition, the touch panel 1031 can be realized by various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 1031, the input unit 1030 may also include other input devices 1032. Specifically, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, joystick, touch screen, microphone, etc. The microphone included in the input device 1032 can receive the voice data input by the user and input the voice data to the SoC1080.

The SoC 1080 runs the software programs and modules stored in the memory 1020 to execute the data processing method provided in this application to process the natural language text input by the input unit 1030 to obtain the target result. SoC 1080 may also convert the voice data input by the input unit 1030 into natural language text, and then execute the data processing method provided in this application to process the natural language text to obtain the target result.

The display unit 1040 may be used to display information input by the user or information provided to the user and various menus of the smart terminal. The display unit 1040 may include a display panel 1041, and optionally, the display panel 1041 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), etc. The display unit 1040 can be used to display the target result obtained by the SoC 1080 processing natural language text. Further, the touch panel 1031 can cover the display panel 1041. When the touch panel 1031 detects a touch operation on or near it, it is sent to SoC 1080 to determine the type of touch event, and then SoC 1080 displays the touch event according to the type of touch event. The display panel 1041 provides corresponding visual output. Although in FIG. 10, the touch panel 1031 and the display panel 1041 are used as two independent components to implement the input and input functions of the smart terminal, in some embodiments, the touch panel 1031 and the display panel 1041 can be integrated And realize the input and output functions of the intelligent terminal.

The smart terminal may also include at least one sensor 1050, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor can include an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 1041 according to the brightness of the ambient light. The proximity sensor can close the display panel 1041 and the display panel 1041 when the smart terminal is moved to the ear. / Or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used to identify smart terminal posture applications (such as horizontal and vertical screen switching, Related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured in smart terminals, here No longer.

The audio circuit 1060, the speaker 1061, and the microphone 1062 can provide an audio interface between the user and the smart terminal. The audio circuit 1060 can transmit the electrical signal converted from the received audio data to the speaker 1061, and the speaker 1061 converts it into a sound signal for output; on the other hand, the microphone 1062 converts the collected sound signal into an electrical signal, which is then output by the audio circuit 1060. After being received, the audio data is converted into audio data, and then the audio data is output to SoC 1080 for processing, and then sent to another smart terminal through the RF circuit 1010, or the audio data is output to the memory 1020 for further processing.

WiFi is a short-distance wireless transmission technology. The smart terminal can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 1070. It provides users with wireless broadband Internet access. Although FIG. 10 shows the WiFi module 1070, it is understandable that it is not a necessary component of the smart terminal, and can be omitted as needed without changing the essence of the invention.

SoC 1080 is the control center of the intelligent terminal. It uses various interfaces and lines to connect the various parts of the entire intelligent terminal. By running or executing software programs and/or modules stored in the memory 1020, and calling data stored in the memory 1020, Perform various functions of the smart terminal and process data, thereby monitoring the smart terminal as a whole. Optionally, SoC 1080 may include multiple processing units, such as CPUs or various service processors; SoC 1080 may also integrate application processors and modem processors, where the application processor mainly processes operating systems, user interfaces, and For application programs, the modem processor mainly deals with wireless communication. It is understandable that the above modem processor may not be integrated into SoC 1080.

The smart terminal also includes a power supply 1090 (such as a battery) for supplying power to various components. Preferably, the power supply can be logically connected to the SoC 1080 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.

Although not shown, the smart terminal may also include a camera, a Bluetooth module, etc., which will not be repeated here.

Fig. 11 is a block diagram of a partial structure of a data processing device provided by an embodiment of the present application. As shown in FIG. 11, the data processing device 1100 may include a processor 1101, a memory 1102, an input device 1103, an output device 1104, and a bus 1105. Among them, the processor 1101, the memory 1102, the input device 1103, and the output device 1104 realize the communication connection between each other through the bus 1105.

The processor 1101 may adopt a general CPU, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for executing related programs to implement the technology provided by the embodiments of the present invention Program. The processor 1101 corresponds to the processing unit 802 in FIG. 8.

The memory 1102 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM). The memory 1102 may store an operating system and other application programs. The program code used to implement the modules and components of the data processing device provided in the embodiment of the present application through software or firmware, or the program code used to implement the foregoing method provided in the method embodiment of the present application is stored in the memory 1102, And the processor 1101 reads the code in the memory 1102 to execute operations required by the modules and components included in the data processing device, or execute the above-mentioned methods provided in the embodiments of the present application.

The input device 1103, corresponding to the acquiring unit 801, is used to input natural language text to be processed by the data processing device.

The output device 1104, corresponding to the output unit 803, is used to output the target result obtained by the data processing device.

The bus 1105 may include a path for transferring information between various components of the data processing device (for example, the processor 1101, the memory 1102, the input device 1103, and the output device 1104).

It should be noted that although the data processing device 1100 shown in FIG. 11 only shows the processor 1101, the memory 1102, the input device 1103, the output device 1104, and the bus 1105, in the specific implementation process, those skilled in the art should understand that, The data processing device 1100 also includes other devices necessary for normal operation. At the same time, according to specific needs, those skilled in the art should understand that the data processing device 1100 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the data processing device 1100 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 11.

An embodiment of the present application provides a computer-readable storage medium. The above-mentioned computer-readable storage medium stores a computer program. The above-mentioned computer program includes software program instructions. When the above-mentioned program instructions are executed by a processor in a data processing device, the foregoing embodiments are implemented. The data processing method and/or training method in.

In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present application are generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available medium integrated servers, data centers, and the like. The usable medium may be a magnetic medium (eg, floppy disk, hard disk, magnetic tape), optical medium (eg, DVD), or semiconductor medium (eg, solid state disk (SSD)), or the like.

The above is only the specific implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims

A natural language processing method, characterized in that it comprises:

Obtain the natural language text to be processed;

Use the trained deep neural network to process the natural language text, and output the target result obtained by processing the natural language text; wherein, the deep neural network includes: a granular annotation network, a first feature network, and a second feature network , A first processing network, a second processing network, and a fusion network, the processing includes: using the granularity tagging network to determine the granularity of each word in the natural language text; using the first feature network to analyze the natural language text Perform feature extraction on words with the first granularity in the first processing network, and output the obtained first feature information to the first processing network; use the second feature network to perform feature extraction on words with the second granularity in the natural language text, and Output the obtained second characteristic information to the second processing network; use the first processing network to process the first characteristic information, and output the obtained first processing result to the fusion network; use the first processing network The second processing network processes the second characteristic information, and outputs the obtained second processing result to the fusion network; uses the fusion network to fuse the first processing result and the second processing result to obtain the target Result; the first particle size and the second particle size are different.
The method according to claim 1, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the architectures of the first processing network and the second processing network are different .
The method according to claim 1 or 2, wherein the input of the granular annotation network is the natural language text, and the determining the granularity of each word in the natural language text by using the granular annotation network comprises:

Use the granular labeling network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the labeling information of the natural language text, and output the first feature network and the second feature network Tagging information; wherein the tagging information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;

The using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:

Processing the words of the first granularity by using the first characteristic network to obtain the first characteristic information, where the first characteristic information is a vector or matrix representing the words of the first granularity;

The using the second feature network to perform feature extraction on words with a second granularity in the natural language text includes:

The second feature network is used to process the words of the second granularity to obtain the second feature information, and the second feature information is a vector or matrix representing the words of the second granularity.
The method according to claim 3, wherein the first processing result is a sequence containing one or more words, and the processing the first characteristic information by the first processing network comprises:

The first processing network is used to process the input first feature information and the words that the first processing network has output in the process of processing the first feature information to obtain the first processing result.
The method according to claim 4, wherein the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the first processing result and the The target result obtained by the second processing result includes:

Use the fusion network to process the first processing result, the second processing result, and the words that the fusion network has output in the process of processing the first processing result and the second processing result to determine the target to be output Words, output the target words.
A training method, characterized in that it includes:

Input the training samples to the deep neural network for processing to obtain the prediction processing results; wherein the deep neural network includes: granular annotation network, first feature network, second feature network, first processing network, second processing network and fusion Network, the processing includes: using the granularity annotation network to determine the granularity of each word in the training sample; using the first feature network to perform feature extraction on the words of the first granularity in the training sample, and the obtained first Output three feature information to the first processing network; use the second feature network to perform feature extraction on words of the second granularity in the training sample, and output the obtained fourth feature information to the second processing network; Use the first processing network to perform target processing on the third characteristic information, and output the obtained third processing result to the fusion network; use the second processing network to target the fourth characteristic information Processing, output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction processing result; the first granularity and the The second granularity is different;

Determine the loss corresponding to the training sample according to the predicted processing result and the standard result; the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample;

Using the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.
The method according to claim 6, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the architectures of the first processing network and the second processing network are different .
The method according to claim 6 or 7, wherein the input of the granular annotation network is the natural language text, and the determining the granularity of each word in the natural language text by using the granular annotation network comprises:

Use the granular labeling network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the labeling information of the natural language text, and output the first feature network and the second feature network Tagging information; wherein the tagging information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;

The using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:

Processing the words of the first granularity by using the first characteristic network to obtain the third characteristic information, where the third characteristic information is a vector or matrix representing the words of the first granularity;

The using the second feature network to perform feature extraction on words with a second granularity in the natural language text includes:

The second feature network is used to process the words of the second granularity to obtain the fourth feature information, and the fourth feature information is a vector or matrix representing the words of the second granularity.
The method according to claim 8, wherein the first processing result is a sequence containing one or more words, and the processing the third characteristic information using the first processing network comprises:

The first processing network is used to process the input third characteristic information and the words outputted by the first processing network in the process of processing the third characteristic information to obtain the third processing result.
The method according to claim 9, wherein the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the third processing result and the Obtaining the target result from the fourth processing result includes:

Use the fusion network to process the third processing result, the fourth processing result, and the words that the fusion network has output in the process of processing the third processing result and the fourth processing result to determine the target to be output Words, output the target words.
The method according to any one of claims 6 to 10, wherein the using the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm comprises:

Update the parameters of the at least one network using a loss function relative to the gradient value of at least one network included in the deep neural network; the loss function is used to calculate the loss between the prediction processing result and the standard result; Wherein, during the update process of any one of the first characteristic network, the second characteristic network, the first processing network, and the second processing network, the parameters of any one of the other three networks are maintained constant.
A data processing device, characterized in that it comprises:

The obtaining unit is used to obtain the natural language text to be processed;

The processing unit is configured to process the natural language text by using the deep neural network obtained by training; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, and a second The second processing network and the fusion network, the processing includes: using the granular annotation network to determine the granularity of each word in the natural language text; using the first feature network to perform processing on the first granular words in the natural language text Feature extraction, output the obtained first feature information to the first processing network; use the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information To the second processing network; use the first processing network to process the first characteristic information, and output the obtained first processing result to the converged network; use the second processing network to process the first feature information The second feature information is processed, and the obtained second processing result is output to the fusion network; the fusion network is used to fuse the first processing result and the second processing result to obtain the target result; the first The first particle size is different from the second particle size;

The output unit is used to output the target result obtained by processing the natural language text.
The data processing device according to claim 12, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the first processing network and the second processing network are The architecture is different.
The data processing device according to claim 12 or 13, wherein the input of the granular annotation network is the natural language text;

The processing unit is specifically configured to use the granular annotation network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the annotation information of the natural language text, and to send the annotation information to the first feature network and the The second feature network outputs the annotation information; wherein the annotation information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;

The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the first characteristic information, where the first characteristic information is a vector or word representing the words of the first granularity matrix;

The processing unit is specifically configured to use the second feature network to process the words of the second granularity to obtain the second feature information, where the second feature information is a vector or word representing the words of the second granularity. matrix.
The data processing device according to claim 14, wherein the first processing result is a sequence containing one or more words;

The processing unit is specifically configured to use the first processing network to process the input first characteristic information and the words that the first processing network has output in the process of processing the first characteristic information to obtain The first processing result.
The data processing device according to claim 15, wherein the target result output by the fusion network is a sequence containing one or more words;

The processing unit is specifically configured to use the fusion network to process the first processing result, the second processing result, and the fusion network has already processed the first processing result and the second processing result. The output words determine the target words to be output, and output the target words.
A data processing device, characterized in that it comprises:

The processing unit is used to input training samples to the deep neural network for processing to obtain prediction processing results; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, and a second The second processing network and the fusion network, the processing includes: using the granularity annotation network to determine the granularity of each word in the training sample; using the first feature network to perform feature extraction on the words of the first granularity in the training sample , Output the obtained third feature information to the first processing network; use the second feature network to perform feature extraction on words of the second granularity in the training sample, and output the obtained fourth feature information to the The second processing network; using the first processing network to perform target processing on the third feature information, and outputting the obtained third processing result to the fusion network; using the second processing network to perform target processing on the fourth feature information Perform the target processing of information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction processing result; the first The first particle size is different from the second particle size;

The processing unit is further configured to determine the loss corresponding to the training sample according to the predicted processing result and the standard result; the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; using For the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.
The data processing device according to claim 17, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the first processing network and the second processing network have different architectures. The architecture is different.
The data processing device according to claim 17 or 18, wherein the input of the granular annotation network is the natural language text;

The processing unit is specifically configured to use the granular annotation network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the annotation information of the natural language text, and to send the annotation information to the first feature network and the The second feature network outputs the annotation information; wherein the annotation information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;

The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the third characteristic information, where the third characteristic information is a vector or word representing the words of the first granularity matrix;

The processing unit is specifically configured to process the words of the second granularity by using the second characteristic network to obtain the fourth characteristic information, where the fourth characteristic information is a vector representing the words of the second granularity Or matrix.
The data processing device according to claim 19, wherein the third processing result is a sequence containing one or more words;

The processing unit is specifically configured to use the first processing network to process the input third characteristic information and the words outputted by the first processing network in the process of processing the third characteristic information to obtain The third processing result.
The data processing device according to claim 20, wherein the target result output by the fusion network is a sequence containing one or more words;

The processing unit is specifically configured to use the converged network to process the third processing result, the fourth processing result, and the converged network has already processed the third processing result and the fourth processing result. The output words determine the target words to be output, and output the target words.
The data processing device according to any one of claims 17 to 21, wherein:

The processing unit is specifically configured to update the parameters of the at least one network by using a loss function relative to the gradient value of the at least one network included in the deep neural network; the loss function is used to calculate the prediction processing result and the The loss between the standard results; wherein any one of the first feature network, the second feature network, the first processing network, and the second processing network is in the update process, and the other three The parameters of any network in the network remain unchanged.
A computer-readable storage medium, wherein the computer storage medium stores a computer program, the computer program includes program instructions, and when executed by a processor, the program instructions cause the processor to execute as claimed in claim 1. -11 The method of any one.