WO2020147369A1 - Natural language processing method, training method, and data processing device - Google Patents

Natural language processing method, training method, and data processing device Download PDF

Info

Publication number
WO2020147369A1
WO2020147369A1 PCT/CN2019/114146 CN2019114146W WO2020147369A1 WO 2020147369 A1 WO2020147369 A1 WO 2020147369A1 CN 2019114146 W CN2019114146 W CN 2019114146W WO 2020147369 A1 WO2020147369 A1 WO 2020147369A1
Authority
WO
WIPO (PCT)
Prior art keywords
network
processing
granularity
feature
words
Prior art date
Application number
PCT/CN2019/114146
Other languages
French (fr)
Chinese (zh)
Inventor
李梓超
蒋欣
刘群
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Publication of WO2020147369A1 publication Critical patent/WO2020147369A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Definitions

  • This application relates to the field of natural language processing, in particular to a natural language processing method, training method and data processing equipment.
  • Artificial Intelligence is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results.
  • artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence.
  • Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
  • Natural language processing tasks can be divided into different granularities, generally divided into character level, word level, phrase level, sentence level, discourse level, etc. These particle sizes become coarser in turn.
  • part-of-speech tagging is a word-level task
  • named entity recognition is a phrase-level task
  • syntactic analysis is usually a sentence-level task.
  • Information at different granularities is not isolated, but is transmitted to each other.
  • the word-level and phrase-level features are usually considered.
  • sentence classification sentence-to-sentence semantic matching
  • sentence translation or rewriting it is usually necessary to use multiple granular information, and finally synthesize it.
  • the current mainstream natural language processing method based on deep learning is to process natural language text through neural networks.
  • the neural network processes the words of different granularity in the processing process are mixed together, and the probability of obtaining the correct processing result is low. Therefore, new solutions need to be studied.
  • the embodiments of the present application provide a natural language processing method, training method, and data processing device, which can avoid the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural language processing tasks.
  • the embodiments of the present application provide a natural language processing method, which includes: obtaining natural language text to be processed; processing the natural language text using a deep neural network obtained by training, and output processing the natural language text The target result obtained from the text; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network, and the processing includes: using the The granularity tagging network determines the granularity of each word in the natural language text; using the first feature network to perform feature extraction on the first granular words in the natural language text, and output the obtained first feature information to the first feature information A processing network; using the second feature network to perform feature extraction on words with a second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network Process the first characteristic information, and output the obtained first processing result to the fusion network; use the second processing network to perform the processing on the second characteristic information
  • the deep neural network may include N feature networks and N processing networks.
  • the N feature networks and the N processing networks have a one-to-one correspondence, and N is an integer greater than one.
  • a pair of corresponding feature networks and processing networks are used to process words of the same granularity. Since the data processing equipment separates words of different granularities for processing, the processing operations for words of each granularity do not depend on the processing results of words of other granularities, which avoids obtaining coarser-grained information from finer-grained information This process greatly reduces the probability that the data processing device will get wrong results.
  • the data processing device uses a deep neural network to independently process words of different granularity, avoiding the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural processing tasks.
  • the architecture of the first characteristic network and the second characteristic network are different, and/or the architecture of the first processing network and the second processing network are different.
  • Words with different granularities have different characteristics. Using networks with different architectures to process words with different granularities can more specifically process words with different granularities.
  • words of different granularities are processed through feature networks of different architectures or processing networks of different architectures, which further improves the performance of the data processing device in processing natural language processing tasks.
  • the input of the granular annotation network is the natural language text
  • the using the granular annotation network to determine the granularity of each word in the natural language text includes: using the granular annotation network Determine the granularity of each word in the natural language text according to N granularities to obtain the annotation information of the natural language text, and output the annotation information to the first feature network and the second feature network; wherein, The label information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
  • the using the first feature network to perform feature extraction on the words of the first granularity in the natural language text includes: using the first feature network to process the words of the first granularity to obtain the first feature information,
  • the first feature information is a vector or matrix representing words of the first granularity;
  • the using the second feature network to perform feature extraction on the words of the second granularity in the natural language text includes: using the second feature network to process the words of the second granularity to obtain the second feature information,
  • the second feature information is a vector or matrix representing words of the second granularity.
  • the granular annotation network can accurately determine the granularity of each word in the natural language text, so that each feature network can process words with a specific granularity.
  • the granular labeling network includes a long and short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; and the using the granular labeling network to determine the granularity of each word in the natural language text includes:
  • g l LSTM([h l ,z l-1 ;g l-1 ]);
  • BiLSTM() in the formula represents the processing operation of the LSTM
  • LSTM() represents the processing operation of the BiLSTM
  • x represents a word in the natural language text
  • x l represents the first natural language text x l words
  • h represents the hidden state variable in the BiLSMT network
  • h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word
  • the hidden state variable of the (l+1)th word is
  • g represents the hidden state variable in the LSTM network
  • g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable.
  • z represents the probability that a word belongs to the reference granularity
  • z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity
  • the reference granularity is Any one of the N types of granularities
  • GS represents the Gumbel Softmax function
  • is a hyperparameter (temperature) in the Gumbel Softmax function
  • Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
  • the granular annotation network uses the architecture of a multi-layer LSTM network to determine the granularity of each word in the natural language text, and can make full use of the granularity of the determined word to determine the granularity of the new word (word of the granularity to be determined) , Simple implementation and high processing efficiency.
  • the using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:
  • ENC z represents the first feature network
  • the first feature network is a Transformer model
  • ENC z () represents the processing operation performed by the first feature network
  • X represents the natural language text
  • Z X [z1,z2,...,zL] represents the label information
  • z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text
  • Uz represents the first feature The first characteristic information output by the network.
  • the feature network can be used to accurately and quickly extract the feature information of the corresponding granular words.
  • the first processing result is a sequence containing one or more words
  • the processing of the first characteristic information using the first processing network includes: using the first The processing network processes the input first feature information and the words that have been output by the first processing network in the process of processing the first feature information to obtain the first processing result.
  • the first processing network adopts a recursive manner to process the feature information output by the corresponding feature network, which can make full use of the relevance of each word in the natural language text, thereby improving the efficiency and accuracy of processing.
  • the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the first processing result and the second processing result
  • Obtaining the target result includes: using the fusion network to process the first processing result, the second processing result, and the fusion network has outputted in the process of processing the first processing result and the second processing result To determine the target words to be output, output the target words.
  • the fusion network uses a recursive method to process the processing results input to it by each processing network, which can make full use of the relevance of each word in the natural language text, thereby improving the efficiency and accuracy of its processing.
  • the converged network includes at least one LSTM network, and the converged network is used to process the first processing result, the second processing result, and the converged network is processing the first
  • the processing result and the sequence output in the process of the second processing result to determine the target word to be output include:
  • the LSTM network uses the following formula to calculate the probability of a word with a reference granularity to be output:
  • h t LSMT(h t-1 ,y t-1 ,v0,v1);
  • h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word
  • h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word
  • Hidden state variable LSMT() represents the processing operation done by LSMT
  • the LMST network has currently output (t-1) words
  • the y t-1 represents the (t-1)th output of the fusion network Words
  • v0 represents the first processing result
  • v1 represents the second processing result
  • W z is a parameter matrix in the fusion network
  • is a hyperparameter
  • y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output
  • t is an integer greater than 1.
  • y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity
  • y 1:t-1 ,X) represents Output the probability of the target word.
  • y 1:t-1 ,X) can be given by the processing network.
  • the processing network of granularity z can input the probability of each word in the words (words of granularity z) currently to be output to the fusion network.
  • the fusion network can calculate the probability of each word being output among the words currently to be output, and output the word with the highest probability of being output (target word).
  • the embodiments of the present application provide a training method, which includes: inputting training samples into a deep neural network for processing to obtain a prediction processing result; wherein the deep neural network includes: a granular annotation network, a first feature Network, a second feature network, a first processing network, a second processing network, and a fusion network.
  • the processing includes: using the granularity labeling network to determine the granularity of each word in the training sample; using the first feature network to Perform feature extraction on words of the first granularity in the training sample, and output the obtained third feature information to the first processing network; use the second feature network to feature words of the second granularity in the training sample Extracting, outputting the obtained fourth characteristic information to the second processing network; using the first processing network to perform target processing on the third characteristic information, and outputting the obtained third processing result to the fusion network; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the first Four processing results obtain the prediction processing result; the first granularity and the second granularity are different; according to the prediction processing result and the standard result, the loss corresponding to the training sample is determined; the standard result is using the The deep neural network processes the expected processing result of the training sample; using the loss corresponding to the training sample, the parameters of the deep neural
  • the data processing device trains a deep neural network that can independently process words of different granularities, so as to obtain a deep neural network that can avoid the process of obtaining coarser-grained information from finer-grained information, and is simple to implement.
  • the architecture of the first characteristic network and the second characteristic network are different, and/or the architecture of the first processing network and the second processing network are different.
  • the input of the granular annotation network is the natural language text
  • the using the granular annotation network to determine the granularity of each word in the natural language text includes: using the granular annotation network Determine the granularity of each word in the natural language text according to N granularities to obtain the annotation information of the natural language text, and output the annotation information to the first feature network and the second feature network; wherein, The label information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
  • the using the first feature network to perform feature extraction on the words of the first granularity in the natural language text includes: using the first feature network to process the words of the first granularity to obtain the third feature information,
  • the third feature information is a vector or matrix representing words of the first granularity;
  • the using the second feature network to perform feature extraction on the words of the second granularity in the natural language text includes: using the second feature network to process the words of the second granularity to obtain the fourth feature information,
  • the fourth feature information is a vector or matrix representing words of the second granularity.
  • the granular labeling network includes a long and short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; and the using the granular labeling network to determine the granularity of each word in the natural language text includes:
  • g l LSTM([h l ,z l-1 ;g l-1 ]);
  • BiLSTM() in the formula represents the processing operation of the LSTM
  • LSTM() represents the processing operation of the BiLSTM
  • x represents a word in the natural language text
  • x l represents the first natural language text x l words
  • h represents the hidden state variable in the BiLSMT network
  • h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word
  • the hidden state variable of the (l+1)th word is
  • g represents the hidden state variable in the LSTM network
  • g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable.
  • z represents the probability that a word belongs to the reference granularity
  • z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity
  • the reference granularity is Any one of the N types of granularities
  • GS represents the Gumbel Softmax function
  • is a hyperparameter (temperature) in the Gumbel Softmax function
  • Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
  • the granular annotation network uses the architecture of a multi-layer LSTM network to determine the granularity of each word in the natural language text, and can make full use of the granularity of the determined word to determine the granularity of the new word (word of the granularity to be determined) , Simple implementation and high processing efficiency.
  • the using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:
  • ENC z represents the first feature network
  • the first feature network is a Transformer model
  • ENC z () represents the processing operation performed by the first feature network
  • X represents the natural language text
  • Z X [z1,z2,...,zL] represents the label information
  • z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text
  • Uz represents the first feature The third characteristic information output by the network.
  • the third processing result is a sequence containing one or more words
  • the processing of the third characteristic information using the first processing network includes: using the first processing network The processing network processes the input third characteristic information and the words that have been output by the first processing network in the process of processing the third characteristic information to obtain the third processing result.
  • the target result output by the fusion network is a sequence containing one or more words
  • the fusion network is used to fuse the third processing result and the fourth processing result
  • Obtaining the target result includes: using the fusion network to process the third processing result, the fourth processing result, and the fusion network has output in the process of processing the third processing result and the fourth processing result To determine the target words to be output, output the target words.
  • the converged network includes at least one LSTM network, and the converged network is used to process the third processing result, the fourth processing result, and the third processing result of the converged network.
  • the processing result and the sequence output in the process of the fourth processing result to determine the target word to be output include:
  • the LSTM network uses the following formula to calculate the probability of a word with a reference granularity to be output:
  • h t LSMT(h t-1 ,y t-1 ,v2,v3);
  • h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word
  • h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word
  • Hidden state variable LSMT() represents the processing operation done by LSMT
  • the LMST network has currently output (t-1) words
  • the y t-1 represents the (t-1)th output of the fusion network Words
  • v2 represents the third processing result
  • v3 represents the fourth processing result
  • W z is a parameter matrix in the fusion network
  • is a hyperparameter
  • y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output
  • t is an integer greater than 1.
  • y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity
  • y 1:t-1 ,X) represents Output the probability of the target word.
  • the using the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm includes:
  • the loss function is used to calculate the loss between the prediction processing result and the standard result; Wherein, during the update process of any one of the first characteristic network, the second characteristic network, the first processing network, and the second processing network, the parameters of any one of the other three networks are maintained constant.
  • the embodiments of the application provide a data processing device.
  • the data processing device includes: an acquisition unit for obtaining natural language texts to be processed; a processing unit for processing the natural language text obtained by training using a deep neural network; Language and text are processed; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network, and the processing includes: using the granularity
  • the tagging network determines the granularity of each word in the natural language text; using the first feature network to perform feature extraction on words with the first granularity in the natural language text, and output the obtained first feature information to the first Processing network; using the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network to The first characteristic information is processed, and the obtained first processing result is output to the fusion network; the second processing network is used to perform the processing on the
  • the data processing device uses a deep neural network to independently process words of different granularity, avoiding the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural processing tasks.
  • the architecture of the first characteristic network and the second characteristic network are different, and/or the architecture of the first processing network and the second processing network are different.
  • the input of the granular annotation network is the natural language text; the processing unit is specifically configured to use the granular annotation network to determine each of the natural language texts according to N types of granularities.
  • the granularity of words is used to obtain the annotation information of the natural language text, and the annotation information is output to the first feature network and the second feature network; wherein the annotation information is used to describe the granularity of each word Or the probability that each word belongs to the N types of granularities; N is an integer greater than 1;
  • the processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the first characteristic information, where the first characteristic information is a vector or word representing the words of the first granularity matrix;
  • the processing unit is specifically configured to use the second feature network to process the words of the second granularity to obtain the second feature information, where the second feature information is a vector or word representing the words of the second granularity. matrix.
  • the granular labeling network includes a long short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; the processing unit is specifically configured to use the granular labeling network to determine the natural language using the following formula The granularity of words in the text:
  • g l LSTM([h l ,z l-1 ;g l-1 ]);
  • BiLSTM() in the formula represents the processing operation of the LSTM
  • LSTM() represents the processing operation of the BiLSTM
  • x represents a word in the natural language text
  • x l represents the first natural language text x l words
  • h represents the hidden state variable in the BiLSMT network
  • h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word
  • the hidden state variable of the (l+1)th word is
  • g represents the hidden state variable in the LSTM network
  • g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable.
  • z represents the probability that a word belongs to the reference granularity
  • z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity
  • the reference granularity is Any one of the N types of granularities
  • GS represents the Gumbel Softmax function
  • is a hyperparameter (temperature) in the Gumbel Softmax function
  • Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
  • the processing unit is specifically configured to use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:
  • ENC z represents the first feature network
  • the first feature network is a Transformer model
  • ENC z () represents the processing operation performed by the first feature network
  • X represents the natural language text
  • Z X [z1,z2,...,zL] represents the label information
  • z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text
  • Uz represents the first feature The first characteristic information output by the network.
  • the first processing result is a sequence containing one or more words; the processing unit is specifically configured to use the first processing network to compare the input first feature information and The first processing network processes the output words in the process of processing the first characteristic information to obtain the first processing result.
  • the target result output by the fusion network is a sequence containing one or more words; the processing unit is specifically configured to use the fusion network to process the first processing result, The second processing result and the words that have been output by the fusion network in the process of processing the first processing result and the second processing result to determine the target word to be output, and output the target word.
  • the converged network includes at least one LSTM network
  • the processing unit is specifically configured to use a vector obtained by combining the first processing result and the second processing result to input to the LSTM network;
  • the processing unit is specifically configured to use the LSTM network to calculate the probability of a word with a reference granularity to be output by using the following formula:
  • h t LSMT(h t-1 ,y t-1 ,v0,v1);
  • h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word
  • h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word
  • Hidden state variable LSMT() represents the processing operation done by LSMT
  • the LMST network has currently output (t-1) words
  • the y t-1 represents the (t-1)th output of the fusion network Words
  • v0 represents the first processing result
  • v1 represents the second processing result
  • W z is a parameter matrix in the fusion network
  • is a hyperparameter
  • y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output
  • t is an integer greater than 1.
  • the processing unit is specifically configured to use the fusion network to calculate the probability of the target word to be output by using the following formula:
  • y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity
  • y 1:t-1 ,X) represents Output the probability of the target word.
  • the embodiments of the present application provide another data processing device.
  • the data processing device includes: a processing unit for inputting training samples into a deep neural network for processing to obtain a prediction processing result; wherein, the deep neural network Including: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network.
  • the processing includes: using the granular labeling network to determine the granularity of each word in the training sample Use the first feature network to perform feature extraction on words of the first granularity in the training sample, and output the obtained third feature information to the first processing network; use the second feature network to perform feature extraction on the training Perform feature extraction on words of the second granularity in the sample, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and the obtained third
  • the processing result is output to the fusion network; the second processing network is used to perform the target processing on the fourth characteristic information, and the obtained fourth processing result is output to the fusion network;
  • the third processing result and the fourth processing result obtain the predicted processing result; the first granularity and the second granularity are different; the processing unit is further configured to, according to the predicted processing result and the standard result, Determine the loss corresponding to the training sample; the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; use the loss
  • the data processing device trains a deep neural network that can independently process words of different granularities, so as to obtain a deep neural network that can avoid the process of obtaining coarser-grained information from finer-grained information, and is simple to implement.
  • the first characteristic network and the second characteristic network have different architectures, and/or the first processing network and the second processing network have different architectures.
  • the input of the granular annotation network is the natural language text; the processing unit is specifically configured to use the granular annotation network to determine each of the natural language texts according to N types of granularities.
  • the granularity of words is used to obtain the annotation information of the natural language text, and the annotation information is output to the first feature network and the second feature network; wherein the annotation information is used to describe the granularity of each word Or the probability that each word belongs to the N types of granularities; N is an integer greater than 1;
  • the processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the third characteristic information, where the third characteristic information is a vector or word representing the words of the first granularity matrix;
  • the processing unit is specifically configured to process the words of the second granularity by using the second characteristic network to obtain the fourth characteristic information, where the fourth characteristic information is a vector representing the words of the second granularity Or matrix.
  • the granular labeling network includes a long short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; the processing unit is specifically configured to use the granular labeling network to determine the natural language using the following formula The granularity of words in the text:
  • g l LSTM([h l ,z l-1 ;g l-1 ]);
  • BiLSTM() in the formula represents the processing operation of the LSTM
  • LSTM() represents the processing operation of the BiLSTM
  • x represents a word in the natural language text
  • x l represents the first natural language text x l words
  • h represents the hidden state variable in the BiLSMT network
  • h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word
  • the hidden state variable of the (l+1)th word is
  • g represents the hidden state variable in the LSTM network
  • g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable.
  • z represents the probability that a word belongs to the reference granularity
  • z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity
  • the reference granularity is Any one of the N types of granularities
  • GS represents the Gumbel Softmax function
  • is a hyperparameter (temperature) in the Gumbel Softmax function
  • Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
  • the processing unit is specifically configured to use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:
  • ENC z represents the first feature network
  • the first feature network is a Transformer model
  • ENC z () represents the processing operation performed by the first feature network
  • X represents the natural language text
  • Z X [z1,z2,...,zL] represents the label information
  • z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text
  • Uz represents the first feature The third characteristic information output by the network.
  • the first processing result is a sequence containing one or more words; the processing unit is specifically configured to use the first processing network to pair the input third characteristic information and The first processing network processes the output words in the process of processing the third characteristic information to obtain the third processing result.
  • the target result output by the fusion network is a sequence containing one or more words; the processing unit is specifically configured to use the fusion network to process the third processing result, The fourth processing result and the words that have been output by the fusion network in the process of processing the third processing result and the fourth processing result to determine the target word to be output, and output the target word.
  • the converged network includes at least one LSTM network; the processing unit is specifically configured to input a vector obtained by combining the third processing result and the fourth processing result into the LSTM The internet;
  • the LSTM network uses the following formula to calculate the probability of a word with a reference granularity to be output:
  • h t LSMT(h t-1 ,y t-1 ,v2,v3);
  • h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word
  • h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word
  • Hidden state variable LSMT() represents the processing operation done by LSMT
  • the LMST network has currently output (t-1) words
  • the y t-1 represents the (t-1)th output of the fusion network Words
  • v2 represents the third processing result
  • v3 represents the fourth processing result
  • W z is a parameter matrix in the fusion network
  • is a hyperparameter
  • y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output
  • t is an integer greater than 1.
  • y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity
  • y 1:t-1 ,X) represents Output the probability of the target word.
  • the processing unit is specifically configured to update the parameters of the at least one network by using the gradient value of the loss function relative to the at least one network included in the deep neural network; To calculate the loss between the predicted processing result and the standard result; wherein any one of the first feature network, the second feature network, the first processing network, and the second processing network During the network update process, the parameters of any one of the other three networks remain unchanged.
  • the embodiments of the present application provide yet another data processing device.
  • the data processing device includes: a processor, a memory, an input device, and an output device.
  • the memory is used to store code;
  • the code is used to execute the method provided in the first aspect or the second aspect, the input device is used to obtain the natural language text to be processed, and the output device is used to output the target result obtained by the processor processing the natural language text.
  • inventions of the present application provide a computer program product.
  • the computer program product includes program instructions that, when executed by a processor, cause the processor to execute the first aspect or the second aspect described above. method.
  • the embodiments of the present application provide a computer-readable storage medium, the computer storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to Perform the method of the above-mentioned first aspect or the above-mentioned second aspect.
  • Figures 1A to 1C are application scenarios of natural language processing systems
  • Fig. 2 is a flowchart of a natural language processing method provided by an embodiment of the application
  • FIG. 3 is a schematic structural diagram of a deep neural network provided by an embodiment of this application.
  • FIG. 4 is a schematic structural diagram of a granular labeling network 301 provided by an embodiment of this application.
  • FIG. 5 is a schematic structural diagram of a feature network provided by an embodiment of this application.
  • FIG. 6 is a schematic structural diagram of a deep neural network provided by an embodiment of this application.
  • FIG. 7 is a flowchart of a training method provided by an embodiment of the application.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of this application.
  • FIG. 9 is a schematic structural diagram of a neural network processor provided by an embodiment of this application.
  • FIG. 10 is a block diagram of a partial structure of an intelligent terminal provided by an embodiment of the application.
  • FIG. 11 is a block diagram of a part of the structure of another data processing device provided by an embodiment of the application.
  • the network models used to process natural language processing tasks do not perform operations on words with different granularities in natural language texts. Separation. That is to say, in the currently adopted scheme, operations on words between different granularities are not decoupled.
  • a pooling operation is usually used to synthesize finer-grained features to form coarser-grained features.
  • the word-level and phrase-level features are integrated through the pooling operation to form sentence-level features. It can be understood that if the finer-grained features obtained are wrong, the coarser-grained features obtained from the finer-grained features will also be wrong.
  • the deep neural network can be analyzed or adjusted to realize the networks of operations of different granularities.
  • the deep neural network used in this application includes multiple decoupled sub-networks for processing words of different granularities. These sub-networks can be optimized in a targeted manner to ensure that operations at each granularity are controllable .
  • Reusable and transferable Operations at different granularities have different reusable or transferable characteristics.
  • sentence-level operations sentences translation or transformation
  • phrase or word-level operations have more field features.
  • the deep neural network since the deep neural network includes multiple independent sub-networks for processing words of different granularities, a part of the sub-networks obtained by training using samples in a certain field can be applied to other fields.
  • a natural language processing system includes user equipment and data processing equipment.
  • the user equipment may be a mobile phone, a personal computer, a tablet computer, a wearable device, a personal digital assistant, a game console, an information processing center, and other smart terminals.
  • the user equipment is the initiator of natural language data processing, and serves as the initiator of natural language processing tasks (for example, translation tasks, paraphrase tasks, etc.).
  • natural language processing tasks for example, translation tasks, paraphrase tasks, etc.
  • users initiate natural language processing tasks through the user equipment.
  • the paraphrase task is the task of transforming a natural language text into another text with the same meaning but different expressions as the natural language text. For example, "What makes the second world war happen" can be repeated as "What is the reason of world war II".
  • the data processing device may be a device or server with data processing functions such as a cloud server, a network server, an application server, and a management server.
  • the data processing device receives query sentences/voice/text questions from the smart terminal through an interactive interface, and then performs machine learning, deep learning, search, reasoning, and decision-making through a memory that stores data and a processor that performs data processing.
  • Language data processing in other ways.
  • the storage may be a general term including a database for local storage and storing historical data.
  • the database may be on a data processing device or on other network servers.
  • FIG. 1B shows another application scenario of the natural language processing system.
  • the smart terminal is directly used as a data processing device, directly receiving input from the user and directly processed by the hardware of the smart terminal itself.
  • the specific process is similar to that of FIG. 1A, and the above description can be referred to, which will not be repeated here.
  • the user equipment may be a local device 101 or 102
  • the data processing device may be an execution device 210
  • the data storage system 250 may be integrated on the execution device 210 or set on the cloud Or on other network servers.
  • FIG. 2 is a flowchart of a natural language processing method provided by an embodiment of the application. As shown in FIG. 2, the method may include:
  • the natural language text to be processed may be a sentence currently to be processed by the data processing device.
  • the data processing device can process the received natural language text or the natural language text obtained by recognizing voice sentence by sentence.
  • obtaining the natural language text to be processed may be that the data processing device receives data such as voice or text sent by the user equipment, and obtains the natural language text to be processed according to the received voice or text data.
  • the data processing device receives 2 sentences sent by the user device, the data processing device obtains the first sentence (natural language text to be processed), and uses the trained deep neural network to process the first sentence , Output and process the first sentence to get the result; get the second sentence (natural language text to be processed), use the trained deep neural network to process the second sentence, and output and process the second sentence to get the result .
  • obtaining the natural language text to be processed may be that the smart terminal directly receives data such as voice or text input by the user, and obtains the natural language text to be processed according to the received voice or text data.
  • the smart terminal receives 2 sentences input by the user, the smart terminal obtains the first sentence (natural language text to be processed), uses the trained deep neural network to process the first sentence, and outputs the processing The first sentence is the result; the second sentence (natural language text to be processed) is obtained, the second sentence is processed by the deep neural network obtained by training, and the second sentence is output and processed to obtain the result.
  • the deep neural network may include: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network.
  • the data processing device uses the deep neural network to do the natural language text
  • the processing may include: using the granular annotation network to determine the granularity of each word in the natural language text; using the first feature network to perform feature extraction on the first granular word in the natural language text, and output the obtained first feature information to The first processing network; using the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network to Perform target processing on the first characteristic information, and output the obtained first processing result to the fusion network; use the second processing network to perform the target processing on the second characteristic information, and output the obtained second processing result to the fusion network Use the fusion network to fuse the first processing result and the second processing result to obtain the target result; the first granularity and the second gran
  • the first granularity and the second granularity may be any two different granularities among character level, word level, phrase level, and sentence level.
  • the granularity of a word refers to the granularity of the word in the natural language text (sentence).
  • the target processing can be translation, retelling, abstract generation, etc.
  • the target result is another natural language text obtained by processing the natural language text.
  • the target result is a natural language text obtained by translating the natural language text.
  • the target result is another natural language text obtained by retelling the natural language text.
  • the natural language text to be processed can be regarded as an input sequence, and the target result (another natural language text) obtained by the data processing device processing the natural language text can be regarded as a generated sequence.
  • the deep neural network may include N feature networks and N processing networks.
  • the N feature networks and the N processing networks have a one-to-one correspondence, and N is an integer greater than one.
  • a pair of corresponding feature network and processing network are used to process words of the same granularity.
  • the first feature network performs feature extraction on words of the first granularity in the natural language text to obtain first feature information
  • the first processing network performs target processing on the first feature information.
  • the deep neural network may also include features for feature extraction of words of other granularities (granularities other than the first granularity and the second granularity).
  • the deep neural network can also include target processing for the feature information of words of other granularities (granularities other than the first granularity and the second granularity) Processing network.
  • the number of feature networks included in the deep neural network and the number of processing networks are not limited. If the words in the natural language text are classified into N granularities, the deep neural network includes N feature networks and N processing networks. That is to say, if the words in the natural language text are classified according to N granularities, the deep neural network includes N feature networks and N feature networks.
  • the words in natural language text are divided into phrase-level words and sentence-level words, then the deep neural network includes two feature networks, one feature network is used to extract the feature of phrase-level words to obtain the feature information of phrase-level words Another feature network is used to extract feature information of sentence-level words to obtain feature information of sentence-level words; the deep neural network includes two processing networks, one processing network is used to target the feature information of phrase-level words Processing, another processing network is used to target the feature information of sentence-level words.
  • the deep neural network includes N feature networks and N processing networks
  • the N feature networks output N feature information
  • the N processing networks output N processing results
  • the fusion network is used to fuse the N
  • the processing result is the final output result.
  • the fusion network is not limited to fusing two processing results.
  • any two of the N feature networks perform feature extraction on words with different granularities in natural language text; any two of the N processing networks perform target processing on the feature information of words with different granularities.
  • any two characteristic networks of the N characteristic networks do not share parameters; any two of the N processing networks do not share parameters.
  • the target processing can be translation, retelling, abstract generation, etc.
  • the parameters of the first feature network and the second feature network are different, and the architectures adopted are the same or different.
  • the first feature network uses a deep neural network architecture
  • the second feature network uses a Transformer architecture.
  • the first processing network and the second processing network have different parameters and adopt the same or different architectures.
  • the first processing network uses a deep neural network architecture
  • the second processing network uses a Transformer architecture. It can be understood that the multiple feature networks included in the deep neural network may adopt different architectures, and the multiple processing networks included in the deep neural network may adopt different architectures.
  • the data processing device uses the mutually decoupled network in the deep neural network to process words of different granularity respectively, which can effectively improve the performance of processing natural processing tasks.
  • FIG. 3 is a schematic structural diagram of a deep neural network provided by an embodiment of the application.
  • the deep neural network may include N feature networks and N processing networks. To facilitate understanding, only two feature networks (the first feature Network and second characteristic network) and 2 processing networks (first processing network and second processing network).
  • 301 is a granular annotation network
  • 302 is a first feature network
  • 303 is a second feature network
  • 304 is a first processing network
  • 305 is a second processing network
  • 306 is a converged network.
  • the data processing equipment uses the deep neural network in Figure 3 to process natural language text as follows:
  • the granularity labeling network 301 determines the granularity of each word in the natural language text according to N types of granularities to obtain the labeling information of the natural language text, and outputs the labeling information to the first feature network 302 and the second feature network 303.
  • the input of the granular annotation network 301 is the natural language text to be processed; the output may be annotation information, or annotation information and the natural language text.
  • the input of the first feature network 302 and the input of the second feature network 303 are both the annotation information and the natural language text.
  • the annotation information is used to describe the granularity of each word in the natural language text or the probability that each word in the natural language text belongs to the N types of granularities; N is an integer greater than 1.
  • the granularity labeling network 301 labels the granularity to which each word (assuming the word is the basic processing unit) in the input natural language text (input sequence), that is, determines the label of each word in the natural language text. Assuming that we consider two granularities: phrase-level granularity and sentence-level granularity, the granularity of each word in the input natural language text (sentence) is determined to be one of these two granularities.
  • the granularity annotation network 301 determines the granularity of each word in the input natural language text "what makes the second world war happen", where words such as “what", “makes”, and “happen” are determined to be sentence-level Granularity, words such as "the”, “second”, “world”, and “war” are determined as phrase-level granularity. It is worth noting that the granularity of each word in the natural language text to be processed is not labeled with data (label), but the granularity annotation network 301 determines the granularity of each word in the input natural language text.
  • the first feature network 302 uses the input natural language text and annotation information to perform feature extraction, and outputs the obtained first feature information to the first processing network 304.
  • the first feature information is a vector or matrix representing words of the first granularity.
  • the input of the first feature network 302 is natural language text and tagging information.
  • the natural language text can be feature-extracted from the first-granularity words, and the vector or matrix representation of the first-granularity words in the natural language text can be obtained, that is, the The first feature information.
  • the second feature network 303 uses the input natural language text and annotation information to perform feature extraction, and outputs the obtained second feature information to the second processing network 305.
  • the second feature information is a vector or matrix representing words of the second granularity.
  • the input of the second feature network 303 is natural language text and tagging information, and the words of the second granularity in the natural language text can be feature extracted, and the vector or matrix representation of the words of the second granularity in the natural language text can be obtained, that is, the The second feature information.
  • the embodiment of the present application does not limit the order in which the data processing device performs step 313 and step 312. Step 313 and step 312 can be performed at the same time, or step 312 can be performed before step 313, or step 313 can be performed before step 312.
  • the first processing network 304 uses the input first characteristic information and the processing result output by the first processing network 304 in the process of processing the first characteristic information for processing to obtain the first processing result.
  • the first processing network 304 processes the input first feature information in a recursive manner (for example, translation, paraphrase, abstract extraction, etc.), that is, the first processing network 304 uses the output of the first feature network 302 (first The feature information) and the previously output processing result (sequence) are input, and the representation of the vector or matrix (the first processing result) is calculated through the deep neural network.
  • a recursive manner for example, translation, paraphrase, abstract extraction, etc.
  • the second processing network 305 uses the input second characteristic information and the processing result output by the second processing network 305 in the process of processing the second characteristic information for processing to obtain the second processing result.
  • the second processing network 305 processes the input second feature information in a recursive manner (for example, translation, paraphrase, abstract extraction, etc.), that is, the second processing network 305 uses the output of the second feature network 303 (second The feature information) and the previously output processing result (sequence) are input, and the representation of the vector or matrix is calculated through the deep neural network (the second processing result).
  • the embodiment of the present application does not limit the order in which the data processing device executes step 314 and step 315. Step 314 and step 315 can be executed simultaneously, or step 314 can be executed first and then step 315 can be executed, or step 315 can be executed before step 314 is executed.
  • the fusion network 306 uses the first processing result, the second processing result, and the processing results that the fusion network 306 has output in the process of processing the first processing result and the second processing result to determine the target word to be output, and output the target Words.
  • the target word is included in the first processing result or the second processing result.
  • the fusion network 306 can merge the output of processing networks of different granularities, that is, determine the granularity of the current word to be output and then determine the word to be output.
  • the first step is to determine the words to be output with "sentence level” granularity and output "what"; the second step to determine the words to be output with "sentence level” granularity and output "is”; repeat the previous operation until the final output sentence is completed (Corresponding to the target result) generation. It should be noted that the above steps 311 to 316 are all completed by deep neural network calculations.
  • the data processing device uses feature networks of different granularities and processing networks of different granularities to independently process words of different granularities, which can effectively improve the probability of obtaining correct results.
  • FIG. 4 is a schematic structural diagram of a granular labeling network 301 provided by an embodiment of this application.
  • the granular annotation network 301 includes a Long Short-Term Memory (LSTM) 402 and a Bi LSTM (Bi-directional LSTM) network 401. It can be seen from FIG. 4 that the granular labeling network 301 uses a multilayer LSTM network architecture.
  • LSTM Long Short-Term Memory
  • Bi LSTM Bi-directional LSTM
  • the input of LSTM401 is natural language text
  • the output of LSTM402 is labeling information, that is, the granularity label of each word or the probability that each word belongs to various granularities.
  • the granularity annotation network 301 is used to predict the granularity corresponding to each word in the input sentence (natural language text).
  • the BiLSTM network 401 is used to convert the input natural language text into a vector, which is used as the input of the next layer of the LSTM network 402; the LSTM network 402 calculates and outputs the probability that each word in the natural language text belongs to each granularity.
  • the labeling information can use the GS (Gumbel-Softmax) function instead of the commonly used Softmax operation.
  • GS Gumbel-Softmax
  • each word has a probability of belonging to each granularity, and this value is close to 0 or 1.
  • the following uses mathematical formulas to describe the manner in which the granularity annotation network 301 predicts the granularity of each word in the natural language text.
  • the mathematical formula corresponding to the processing process of BiLSTM network 401 is as follows:
  • g l LSTM([h l ,z l-1 ;g l-1 ]);
  • BiLSTM() in the formula represents the processing of a two-way recursive deep neural network
  • LSTM() represents the processing of a (one-way) recursive deep neural network
  • l represents the position index of the word
  • x represents the input sentence (natural language text)
  • X l represents the lth word in the input sentence x
  • h represents the hidden states in the BiLSMT network 401
  • h l , h l-1 in turn represent the BiLSMT network 401
  • g represents the hidden state variable in the (one-way) LSTM network, and its calculation process follows the calculation rules of the LSTM network.
  • g l and g l-1 respectively indicate that the LSTM network 402 processes the lth word and the (l)th word in the input sentence.
  • Hidden state variables for words are hidden state variables for words.
  • z represents the probability that the word belongs to a certain granularity (phrase-level granularity, sentence-level granularity or other granularity), z l-1 and z l respectively represent the lth word and (l-1)th word in the input sentence
  • GS represents the Gumbel Softmax function
  • is the hyperparameter (temperature) in the Gumbel Softmax function
  • Wg is the parameter matrix, that is, a parameter matrix in the granularity annotation network.
  • the granularity annotation network 301 uses the architecture of a multi-layer LSTM network to determine the granularity of each word in a natural language text, and can make full use of the granularity of the determined word to determine the granularity of a new word (word with a granularity to be determined), which is simple to implement and process efficient.
  • FIG. 5 is a schematic structural diagram of a first characteristic network 302 and a second characteristic network 303 provided by an embodiment of this application.
  • the first feature network 302 performs feature extraction on words of the first granularity in the natural language text
  • the second feature network 303 performs feature extraction on the natural language text.
  • the words of the second granularity in the text are feature extracted.
  • the network architectures adopted by the first feature network 302 and the second feature network 303 may be the same or different.
  • a feature network that processes words of a certain granularity can be understood as a feature network of that granularity, and feature networks of different granularities process words of different granularity.
  • the parameters of the first characteristic network 302 and the second characteristic network 303 are not shared, and the hyperparameter settings are different.
  • both the first feature network 302 and the second feature network 303 adopt the Transformer model.
  • This model is based on a multi-head self-attention mechanism, which processes input sentences (natural language text) at a certain granularity. Words, so as to construct a vector as the characteristic information of the granular words.
  • the first feature network 302 may only focus on the words of the first granularity in the input sentence (natural language text); the second feature network 303 may only focus on Input sentences (natural language text) in the second granularity of words.
  • the granular feature network 301 determines the probability that each word in the natural language text belongs to the aforementioned N types of granularities
  • the first feature network 302 can focus on the words of the first granularity in the input sentence (natural language text);
  • the feature network 303 can focus on the words of the second granularity in the input sentence (natural language text).
  • the first feature network 302 it focuses on words with a higher probability of belonging to the first granularity in the input sentence; for the second feature network 303, it focuses on words belonging to the second Words with higher probability of granularity. It can be understood that the higher the probability that a word belongs to the first granularity, the higher the attention of the first feature network 302 to the word.
  • the first feature network 302 can use a self-attention mechanism with a limited window (similar to a deep neural network mechanism, but its weight is still calculated by attention.
  • the first feature network 302 will focus on words at the first granularity in the input sentence and ignore words at other granularity levels.
  • the first feature network 302 can be a feature network with a phrase-level granularity. When extracting the features of each word, only Pay attention to the two adjacent words of the word, as shown in Figure 5.
  • the second feature network 303 can adopt the Self-Attention mechanism of the whole sentence, so as to be able to pay attention to the global information of the sentence.
  • the second feature network 303 can be sentence-level The granular feature network focuses on the entire input sentence when extracting the features of each word, as shown in Figure 5.
  • the second feature network 303 will focus on the second granular word in the input sentence , While ignoring words at other levels of granularity.
  • the Transformer model is a commonly used model in the field, and the working principle of the model will not be described in detail here.
  • the first feature network 302 can obtain the input sentence (natural language text).
  • the feature network at each granularity obtains the vector representation of the word at the granularity, denoted as Uz.
  • the processing operations implemented by the first feature network 302 and the second feature network 303 are described below with the aid of mathematical formulas.
  • the mathematical formulas corresponding to the processing operations implemented by the first feature network 302 and the second feature network 303 are as follows:
  • the input of the feature network is the input sentence X and the label information Z X.
  • the annotation information output by the granularity annotation network 301 is the granularity of each word in the natural language text
  • the annotation information of the input sentence input by the feature network is the annotation information output by the granularity annotation network 301.
  • the annotation information output by the granularity annotation network 301 is [1100001], and these binary values sequentially represent the granularity of the first word to the last word in the input sentence, 0 means word-level granularity, and 1 means sentence-level granularity.
  • the annotation information output by the granularity annotation network 301 is the probability that each word in the natural language text belongs to the above N types of granularities
  • the annotation information of the input sentence input by the feature network is obtained according to the annotation information output by the granularity annotation network 301 Label information.
  • the data processing device may further process the annotation information output by the granular annotation network 301 to obtain the annotation information that can be input to the feature network.
  • the data processing device uses the granularity at which each word in the natural language text belongs to the maximum probability as the granularity of each word. For example, if the probability that a word in the input sentence (natural language text) belongs to the phrase-level granularity and sentence-level granularity are 0.85 and 0.15, respectively, the granularity of the word is the phrase-level granularity. For another example, according to phrase-level granularity and sentence-level granularity, the granularity of each word in the natural language text is classified.
  • the annotation information output by the granularity annotation network 301 is [0.92 0.88 0.08 0.07 0.04 0.06 0.97], and the value in the annotation information In turn, it indicates the probability that the first word to the last word in the natural language text belong to the sentence-level granularity.
  • the data processing device can set the value less than 0.5 in the label information to 0, and the value greater than or equal to 0.5 to 1, to get The new label information [1100001] is input into the feature network.
  • the data processing device samples the natural language text according to the probability that each word in the natural language text belongs to the aforementioned N types of granularities, and obtains the annotation information of the natural language text by using the granularity of each word obtained by the sampling. And input to the feature network.
  • Each feature network included in the deep neural network independently processes words of different granularities, and uses networks of different architectures to process words of different granularities, with better feature extraction performance.
  • the processing performed by the processing network and the processing performed by the convergence network 306 will be introduced below in conjunction with the structures of the first feature network 302, the second feature network 303, the first processing network 304, the second processing network 305, and the converged network 306.
  • Fig. 6 is a schematic structural diagram of a deep neural network provided by an embodiment of the application, and Fig. 6 does not show a granular annotation network.
  • the input of the first processing network 304 is the first characteristic information output by the first characteristic network 302, and the first processing network 304 has outputted processing results (words) in the process of processing the first characteristic information;
  • the input of the second processing network 305 is the second feature information output by the second feature network 303, and the second processing network 305 outputs the processed results (words) that have been output in the process of processing the second feature information;
  • the input of the fusion network 306 is the first A processing result, a second processing result, and words that have been output in the process of processing the first processing result and the second processing result.
  • the output of the fusion network 306 is obtained by fusing the first processing result and the second processing result Target result.
  • the architectures adopted by the first processing network 304 and the second processing network 305 may be the same or different.
  • the first processing network 304 and the second processing network 305 may not share parameters.
  • a processing network that processes words of a certain granularity can be understood as a processing network of that granularity, and processing networks of different granularities process words of different granularity.
  • each granularity has a corresponding processing network.
  • the granularity of each word in a natural language text is divided into phrase-level granularity and sentence-level granularity.
  • Deep neural networks include a phrase-level granularity processing network and a sentence-level granularity processing network.
  • the processing networks of different granularities are decoupled, which means that they do not share parameters and can adopt different architectures.
  • the phrase-level granularity processing network uses a deep neural network architecture
  • the sentence-level granularity processing network uses a Transformer architecture.
  • the processing network can output one word at a time and the granularity of the word.
  • the processing network can be performed in a recursive manner, that is, the processing network of each granularity takes the output of the corresponding granular feature network and the words that have been output before as input, calculates the probability of multiple words to be output at present, and has the highest output probability The word and the label information corresponding to the word.
  • the processing network uses its input to calculate the probability of each word currently to be output, and performs sampling according to the probability of each word, and outputs the sampled word and the label information corresponding to the word.
  • the processing network uses its input to calculate the probability of each word currently to be output (that is, the probability of each word currently being output), and output the probability of each word currently to be output.
  • the processing network currently has F words to be output.
  • the processing network uses its input to calculate the probability of the first word to be output, the probability of the second word to be output, and the probability of the Fth word to be output. , And input these probabilities into the fusion network, and F is an integer greater than 1.
  • the label information corresponding to a word may be the probability that the word belongs to a certain granularity, or the granularity of the word, or the probability that the word belongs to various granularities.
  • the processing performed by the first processing network 302 may be as follows: In the first step, the first processing network 302 processes the input first feature information to predict the first word (a word) currently required to be output, and output the first word The label information corresponding to the first word; in the second step, the first processing network 302 processes the input first feature information and the first word to predict the second word (a word) that is currently required to be output, and output the The second word and the label information corresponding to the second word; the first processing network 302 processes the input first feature information, the first word, and the second word to predict the third word (a word ), output the third word and the label information corresponding to the third word; repeat the previous steps until the processing of the first processing result is completed.
  • each processing network included in the deep neural network can process the input feature information in a similar manner to the first processing network 302.
  • the input of a certain processing network is the characteristic information obtained by feature extraction of "a good geologist" by its corresponding characteristic network, and the processing network processes the input characteristic information, predicting the current need to output "a” and Output; the processing network processes the input feature information and the previously output "a”, predicting the current need to output "great” and output; the processing network processes the input feature information, the previous output "a” and “great” “For processing, predict the current need to output "geologist” and output.
  • the first processing network 304 receives the input of the first feature network 302 and the words it has output for calculation.
  • the calculation method is to use the Self-Attention mechanism with a limited window; the second processing network 305 receives the second feature
  • the input of the network 303 and the words that it has output are calculated, and the calculation method is to adopt the Self-Attention mechanism of the whole sentence range.
  • the processing result obtained by the processing network at each granularity is denoted as Vz, and z represents the index of the granularity level, namely the granularity z.
  • the first processing network 304 and the second processing network 305 may also adopt different architectures. The following describes the operations performed by the convergent network 305 on the processing results input by each processing network.
  • the fusion network 306 can merge the processing results output by the processing network at different granularities to obtain the target result.
  • the output of the fusion network 306 is a sequence containing words.
  • the input of the fusion network 306 is the processing results of each processing network (the first processing result and the second processing result) and the sequence that the fusion network 306 has output in the process of processing these processing results.
  • the operations performed by the fusion network 306 can be as follows: the fusion network 306 merges the processing results input by each processing network into a vector; inputs the vector to an LSTM network for processing to determine the current word to be output, that is, to determine the current word to be output.
  • the fusion network 306 outputs the target word currently to be output by the processing network of this granularity.
  • Said inputting the vector into an LSTM network for processing to determine the words of the current granularity to be output may be inputting the vector into an LSTM network for processing to determine the probability of the words of each granularity in the above N granularities being output, and then Determine the word of the current granularity to be output; among them, the word of the granularity to be output has the highest probability of being output currently.
  • the particle size is any one of the above-mentioned N particle sizes.
  • the target word is the word with the highest probability of being output among the multiple words currently to be output by the processing network of the granularity to be output.
  • the probability of the first word, the second word, and the third word to be output by the processing network of the reference granularity is 0.06, 0.8, 0.14, respectively, and the target word to be output by the processing network of the reference granularity is this
  • the second word is the word with the highest probability of being output. It can be understood that the fusion network 306 may first determine which granular words are currently to be output, and then output the words to be output by the processing network with this granularity.
  • the operations performed by the fusion network 306 can also be as follows: the fusion network 306 merges the processing results input by each processing network into a vector; inputs the vector to an LSTM network for processing to determine the current words to be output by each processing network The probability of each word being output; the fusion network 306 outputs the target word with the highest probability of being output among the words.
  • Each processing network refers to a processing network of each granularity.
  • the words currently to be output by the first processing network include “a”, “good” and “geologist”, and the words currently to be output by the second processing network include: “How”, “can”, “I” and “be”, the fusion network calculates the current probability of each of these 7 words being output, and outputs the word with the highest probability of being output among these 7 words.
  • the following describes how to calculate the probability of each word being output in each word currently to be output by the processing network with reference granularity.
  • the reference particle size is any one of the above-mentioned N particle sizes.
  • the (t-1) words already output by the fusion network 306 are denoted as [y 1 ,y 2 ,...,y t-1 ], and t is an integer greater than 1.
  • the vectors (processing results) output by the first processing network and the second processing network are v0 and v1, respectively.
  • the fusion network 306 combines these two vectors and the sequence output by the fusion network 306, and inputs the merged vector
  • the LSTM network performs processing to calculate the probability of words with a reference granularity to be output.
  • the converged network 306 includes the LSTM network.
  • the LSTM network can use the following formula to calculate the probability of words with a reference granularity to be output:
  • h t LSMT(h t-1 ,y t-1 ,v0,v1);
  • h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word
  • LSMT() represents the processing operation performed by the LSMT
  • y t-1 represents the (t-1 ) Words
  • W z is a parameter matrix in the fusion network
  • is a hyperparameter
  • y 1:t-1 ,X) is the probability of a word of granularity z currently to be output.
  • the fusion network 306 can use a similar method to calculate the probability of currently outputting words of any one of the above N granularities.
  • the mixed probability model is used to calculate the probability of outputting the target word.
  • the target word is a word currently to be output by the processing network of the granularity z.
  • the formula for calculating the probability of outputting the target word is as follows:
  • y 1:t-1 ,X) represents the probability of outputting the target word y t on the granularity z
  • y 1:t-1 ,X) represents outputting the target The probability of the word.
  • y 1:t-1 ,X) can be given by the processing network.
  • the processing network of granularity z can input the probability of each word (word of granularity z) currently to be output to the fusion network, that is, the probability of each word in the words currently to be output by the processing network is output.
  • the input of the first processing network is the feature information obtained by the first feature network's feature extraction of "a good geologist", and the processing network processes the feature information to obtain the current probability of output "a",
  • the probability of "great” to be output and the probability of "geologist” to be output are output, and these words and the corresponding probability are input to the fusion network.
  • y 1:t-1 ,X) represents the probability of outputting "great” at the granularity z.
  • the fusion network 306 may first calculate the probability of words of each of the above N granularities to be output, then calculate the probability of each word currently to be output to be output, and finally, output the word with the highest probability of being output.
  • the foregoing embodiment describes the use of a deep neural network obtained by training to implement a natural language processing method.
  • the following describes how to train a required deep neural network.
  • FIG. 7 is a flowchart of a training method provided by an embodiment of the application. As shown in FIG. 7, the method may include:
  • the data processing device inputs the training samples to the deep neural network for processing, and obtains a prediction processing result.
  • the deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network.
  • the processing includes: using the granular labeling network to determine the value of each word in the training sample Granularity; Use the first feature network to perform feature extraction on words with the first granularity in the training sample, and output the obtained third feature information to the first processing network; Use the second feature network to perform feature extraction in the second training sample Perform feature extraction on the granular words, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and output the obtained third processing result to the fusion network ; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction Processing result; the first particle size and the second particle size are different.
  • the architectures of the first feature network and the second feature network are different, and/or the architectures of the first processing network and the second processing network are different.
  • the input of the granular annotation network is the natural language text
  • the granular annotation network is used to determine the granularity of each word in the natural language text according to N types of granularities to obtain the annotation information of the natural language text, and send it to the first feature network and
  • the second feature network outputs the labeling information; where the labeling information is used to describe the granularity of each word or the probability that each word belongs to the N kinds of granularities; N is an integer greater than 1.
  • the first feature network is used to perform feature extraction using the input natural language text and the annotation information, and output the obtained third feature information to the first processing network; wherein, the third feature information represents the first A vector or matrix of granular words; the first processing network is used to use the input third characteristic information and the processing result output by the first processing network as target processing to obtain the third processing result.
  • the fusion network outputs one word at a time, and the fusion network is used to use the third processing result, the fourth processing result, and the words that the fusion network has output in the process of processing the third processing result and the fourth processing result , Determine the target word to be output, and output the target word.
  • the data processing device determines the loss corresponding to the training sample according to the predicted processing result and the standard result.
  • the standard result is the expected processing result obtained by using the deep neural network to process the training sample.
  • each training sample corresponds to a standard result, so that the data processing device can calculate and use the deep neural network to process the loss of each training sample, thereby optimizing the deep neural network.
  • the following takes training a deep neural network to process retelling tasks as an example to introduce the training samples and standard results that can be used by the data processing device to train the deep neural network.
  • the granular annotation network 301 is obtained through end-to-end learning. Due to end-to-end learning, in order to ensure that the granular labeling network 301 can be differentiated, during the training process, the granular labeling network 301 actually gives the probability that each word belongs to a different granularity, rather than an absolute 0/1 label.
  • the data processing equipment trains the deep neural network to process different natural language processing tasks, and uses different training samples and standard results. For example, if the data processing device is trained to handle retelling tasks, training samples and standard results similar to those in Table 1 can be used. For another example, if the data processing device is trained to handle translation tasks, the training sample used is English text, and the standard result is the standard Chinese text corresponding to the training sample.
  • the data processing device uses the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm.
  • data processing equipment can train deep neural networks to handle different natural language processing tasks.
  • the data processing equipment training deep neural network processes different natural language processing tasks, and the data processing equipment calculates the loss between the predicted processing result and the standard result differently, that is, the method of calculating the loss corresponding to the training sample is different.
  • the data processing device uses the loss corresponding to the training sample, and updating the parameters of the deep neural network through an optimization algorithm may be to use the gradient value of the loss function relative to at least one network included in the deep neural network, Update the parameters of the at least one network; the loss function is used to calculate the loss between the predicted processing result and the standard result; wherein, the first characteristic network, the second characteristic network, the first processing network, the second During the update process of any one of the processing networks, the parameters of any one of the other three networks remain unchanged.
  • an optimization algorithm such as a gradient descent algorithm
  • the deep neural network used in the foregoing embodiment is a network obtained by using the training method in FIG. 7. It should be understood that the structure and processing process of the deep neural network in FIG. 7 are the same as the deep neural network in the foregoing embodiment.
  • the data processing device trains a deep neural network that can independently process words of different granularities, so as to obtain a deep neural network that can avoid the process of obtaining coarser-grained information from finer-grained information, and is simple to implement.
  • FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of the application. As shown in FIG. 8, the data processing device may include:
  • the obtaining unit 801 is configured to obtain the natural language text to be processed
  • the processing unit 802 is configured to process the natural language text by using the deep neural network obtained by training;
  • the output unit 803 is configured to output the target result obtained by processing the natural language text.
  • the deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network.
  • the processing includes: using the granular labeling network to determine each word in the natural language text Use the first feature network to perform feature extraction on words of the first granularity in the natural language text, and output the obtained first feature information to the first processing network; use the second feature network to perform feature extraction on the natural language text Perform feature extraction on words with the second granularity in the second granularity, and output the obtained second feature information to the second processing network; use the first processing network to process the first feature information, and output the obtained first processing result to the Convergence network; use the second processing network to process the second characteristic information, and output the obtained second processing result to the fusion network; use the fusion network to fuse the first processing result and the second processing result to obtain the Target result; the first particle size and the second particle size are different.
  • the processing unit 802 may be a central processing unit (Central Processing Unit, CPU) in a data processing device, a neural network processor (Neural-network Processing Unit, NPU), or other types of processors.
  • the output unit 803 may be a display, a display screen, an audio device, etc.
  • the target result may be another natural language text obtained from the natural language text, and the obtained natural language text is displayed on the display screen of the data processing device.
  • the target result can be a voice corresponding to another natural language text obtained from the natural language text, and the audio device in the data processing device plays the voice.
  • the processing unit 802 is also used to input training samples into the deep neural network for processing to obtain prediction processing results; according to the prediction processing results and standard results, determine the loss corresponding to the training samples;
  • the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; using the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.
  • the deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network.
  • the processing includes: using the granular labeling network to determine the value of each word in the training sample Granularity; Use the first feature network to perform feature extraction on words with the first granularity in the training sample, and output the obtained third feature information to the first processing network; Use the second feature network to perform feature extraction in the second training sample Perform feature extraction on the granular words, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and output the obtained third processing result to the fusion network ; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction Processing result; the first particle size and the second particle size are different.
  • Deep Neural Network can be understood as a neural network with many hidden layers.
  • the "many” here has no special metric.
  • the essence of the multi-layer neural network and deep neural network we often say The above is the same thing.
  • the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer.
  • the first layer is the input layer
  • the last layer is the output layer
  • the number of layers in the middle are all hidden layers.
  • the layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1th layer.
  • the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as
  • the superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4.
  • the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as Note that the input layer has no W parameter.
  • more hidden layers allow the network to better describe complex situations in the real world. Theoretically speaking, a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks.
  • FIG. 9 is a schematic structural diagram of a neural network processor provided by an embodiment of the application.
  • the neural network processor NPU 90NPU is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks (for example, natural language processing tasks).
  • the core part of the NPU is the arithmetic circuit 90, and the arithmetic circuit 903 is controlled by the controller 904 to extract matrix data from the memory and perform multiplication operations.
  • the arithmetic circuit 903 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 903 is a two-dimensional systolic array. The arithmetic circuit 903 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 903 is a general-purpose matrix processor.
  • the arithmetic circuit fetches the data corresponding to matrix B from the weight memory 902 and caches it on each PE in the arithmetic circuit.
  • the arithmetic circuit takes the matrix A data and matrix B from the input memory 901 to perform matrix operations, and the partial or final result of the obtained matrix is stored in the accumulator 908.
  • the unified memory 906 is used to store input data and output data.
  • the weight data is directly transferred to the weight memory 902 through the direct memory access controller (DMAC) 905.
  • the input data is also transferred to the unified memory 906 through the DMAC.
  • DMAC direct memory access controller
  • the Bus Interface Unit (BIU) 510 is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer) 909.
  • the bus interface unit 510 is also used for the instruction fetch memory 909 to obtain instructions from the external memory, and also used for the storage unit access controller 905 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
  • the DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 906 or the weight data to the weight memory 902 or the input data to the input memory 901.
  • the vector calculation unit 907 has multiple arithmetic processing units, if necessary, further processing the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on.
  • the vector calculation unit 907 can store the processed output vector in the unified buffer 906.
  • the vector calculation unit 907 may apply a nonlinear function to the output of the arithmetic circuit 903, such as a vector of accumulated values, to generate the activation value.
  • the vector calculation unit 907 generates a normalized value, a combined value, or both.
  • the processed output vector can be used as an activation input to the arithmetic circuit 903, for example for use in a subsequent layer in a neural network.
  • the instruction fetch buffer 909 connected to the controller 904 is used to store instructions used by the controller 904;
  • the unified memory 906, the input memory 901, the weight memory 902, and the fetch memory 909 are all On-Chip memories.
  • each layer in the deep neural network shown in FIG. 3 may be executed by the matrix calculation unit 212 or the vector calculation unit 907.
  • NPU is used to implement a natural language processing method and training method based on a deep neural network, which can greatly improve the efficiency of processing natural language processing tasks and training a deep neural network of a data processing device.
  • FIG. 10 is a block diagram of a partial structure of an intelligent terminal provided by an embodiment of the application.
  • the smart terminal includes: a radio frequency (RF) circuit 1010, a memory 1020, an input unit 1030, a display unit 1040, a sensor 1050, an audio circuit 1060, a wireless fidelity (WiFi) module 1070, a system on chip (System On Chip, SoC) 1080 and power supply 1090 and other components.
  • RF radio frequency
  • the memory 1020 includes DDR memory, of course, may also include high-speed random access memory, or include other storage units such as non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
  • non-volatile memory such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
  • the structure of the smart terminal shown in FIG. 10 does not constitute a limitation on the smart terminal, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.
  • the RF circuit 1010 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by SoC 1080; in addition, the designed uplink data is sent to the base station.
  • the RF circuit 1010 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like.
  • the RF circuit 1010 can also communicate with the network and other devices through wireless communication.
  • the above wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.
  • GSM Global System of Mobile Communication
  • GPRS General Packet Radio Service
  • CDMA Code Division Multiple Access
  • WCDMA Wideband Code Division Multiple Access
  • LTE Long Term Evolution
  • Email Short Messaging Service
  • the memory 1020 may be used to store software programs and modules.
  • the SoC 1080 runs the software programs and modules stored in the memory 1020 to execute various functional applications and data processing of the smart terminal.
  • the memory 1020 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, a translation function, a retelling function, etc.), etc.;
  • the data storage area can store data (such as audio data, phone book, etc.) created according to the use of the smart terminal.
  • the input unit 1030 can be used to receive input natural language text and voice data, and generate key signal inputs related to user settings and function control of the smart terminal.
  • the input unit 1030 may include a touch panel 1031 and other input devices 1032.
  • the touch panel 1031 also known as a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 1031 or near the touch panel 1031. Operation), and drive the corresponding connection device according to the preset program.
  • the touch panel 1031 is used to receive the natural language text input by the user and input the natural language text into the SoC1080.
  • the touch panel 1031 may include two parts: a touch detection device and a touch controller.
  • the touch detection device detects the user's touch position, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it Give SoC 1080, and can receive commands from SoC 1080 and execute them.
  • the touch panel 1031 can be realized by various types such as resistive, capacitive, infrared, and surface acoustic wave.
  • the input unit 1030 may also include other input devices 1032.
  • other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, joystick, touch screen, microphone, etc.
  • the microphone included in the input device 1032 can receive the voice data input by the user and input the voice data to the SoC1080.
  • the SoC 1080 runs the software programs and modules stored in the memory 1020 to execute the data processing method provided in this application to process the natural language text input by the input unit 1030 to obtain the target result. SoC 1080 may also convert the voice data input by the input unit 1030 into natural language text, and then execute the data processing method provided in this application to process the natural language text to obtain the target result.
  • the display unit 1040 may be used to display information input by the user or information provided to the user and various menus of the smart terminal.
  • the display unit 1040 may include a display panel 1041, and optionally, the display panel 1041 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), etc.
  • the display unit 1040 can be used to display the target result obtained by the SoC 1080 processing natural language text. Further, the touch panel 1031 can cover the display panel 1041.
  • the touch panel 1031 When the touch panel 1031 detects a touch operation on or near it, it is sent to SoC 1080 to determine the type of touch event, and then SoC 1080 displays the touch event according to the type of touch event.
  • the display panel 1041 provides corresponding visual output.
  • the touch panel 1031 and the display panel 1041 are used as two independent components to implement the input and input functions of the smart terminal, in some embodiments, the touch panel 1031 and the display panel 1041 can be integrated And realize the input and output functions of the intelligent terminal.
  • the smart terminal may also include at least one sensor 1050, such as a light sensor, a motion sensor, and other sensors.
  • the light sensor can include an ambient light sensor and a proximity sensor.
  • the ambient light sensor can adjust the brightness of the display panel 1041 according to the brightness of the ambient light.
  • the proximity sensor can close the display panel 1041 and the display panel 1041 when the smart terminal is moved to the ear. / Or backlight.
  • the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used to identify smart terminal posture applications (such as horizontal and vertical screen switching, Related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured in smart terminals, here No longer.
  • smart terminal posture applications such as horizontal and vertical screen switching, Related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.
  • other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured in smart terminals, here No longer.
  • the audio circuit 1060, the speaker 1061, and the microphone 1062 can provide an audio interface between the user and the smart terminal.
  • the audio circuit 1060 can transmit the electrical signal converted from the received audio data to the speaker 1061, and the speaker 1061 converts it into a sound signal for output; on the other hand, the microphone 1062 converts the collected sound signal into an electrical signal, which is then output by the audio circuit 1060.
  • the audio data is converted into audio data, and then the audio data is output to SoC 1080 for processing, and then sent to another smart terminal through the RF circuit 1010, or the audio data is output to the memory 1020 for further processing.
  • WiFi is a short-distance wireless transmission technology.
  • the smart terminal can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 1070. It provides users with wireless broadband Internet access.
  • FIG. 10 shows the WiFi module 1070, it is understandable that it is not a necessary component of the smart terminal, and can be omitted as needed without changing the essence of the invention.
  • SoC 1080 is the control center of the intelligent terminal. It uses various interfaces and lines to connect the various parts of the entire intelligent terminal. By running or executing software programs and/or modules stored in the memory 1020, and calling data stored in the memory 1020, Perform various functions of the smart terminal and process data, thereby monitoring the smart terminal as a whole.
  • SoC 1080 may include multiple processing units, such as CPUs or various service processors; SoC 1080 may also integrate application processors and modem processors, where the application processor mainly processes operating systems, user interfaces, and For application programs, the modem processor mainly deals with wireless communication. It is understandable that the above modem processor may not be integrated into SoC 1080.
  • the smart terminal also includes a power supply 1090 (such as a battery) for supplying power to various components.
  • a power supply 1090 (such as a battery) for supplying power to various components.
  • the power supply can be logically connected to the SoC 1080 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
  • the smart terminal may also include a camera, a Bluetooth module, etc., which will not be repeated here.
  • Fig. 11 is a block diagram of a partial structure of a data processing device provided by an embodiment of the present application.
  • the data processing device 1100 may include a processor 1101, a memory 1102, an input device 1103, an output device 1104, and a bus 1105.
  • the processor 1101, the memory 1102, the input device 1103, and the output device 1104 realize the communication connection between each other through the bus 1105.
  • the processor 1101 may adopt a general CPU, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for executing related programs to implement the technology provided by the embodiments of the present invention Program.
  • the processor 1101 corresponds to the processing unit 802 in FIG. 8.
  • the memory 1102 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM).
  • the memory 1102 may store an operating system and other application programs.
  • the program code used to implement the modules and components of the data processing device provided in the embodiment of the present application through software or firmware, or the program code used to implement the foregoing method provided in the method embodiment of the present application is stored in the memory 1102, And the processor 1101 reads the code in the memory 1102 to execute operations required by the modules and components included in the data processing device, or execute the above-mentioned methods provided in the embodiments of the present application.
  • the input device 1103, corresponding to the acquiring unit 801, is used to input natural language text to be processed by the data processing device.
  • the output device 1104, corresponding to the output unit 803, is used to output the target result obtained by the data processing device.
  • the bus 1105 may include a path for transferring information between various components of the data processing device (for example, the processor 1101, the memory 1102, the input device 1103, and the output device 1104).
  • the data processing device 1100 shown in FIG. 11 only shows the processor 1101, the memory 1102, the input device 1103, the output device 1104, and the bus 1105, in the specific implementation process, those skilled in the art should understand that, The data processing device 1100 also includes other devices necessary for normal operation. At the same time, according to specific needs, those skilled in the art should understand that the data processing device 1100 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the data processing device 1100 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 11.
  • An embodiment of the present application provides a computer-readable storage medium.
  • the above-mentioned computer-readable storage medium stores a computer program.
  • the above-mentioned computer program includes software program instructions. When the above-mentioned program instructions are executed by a processor in a data processing device, the foregoing embodiments are implemented.
  • the data processing method and/or training method in.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available medium integrated servers, data centers, and the like.
  • the usable medium may be a magnetic medium (eg, floppy disk, hard disk, magnetic tape), optical medium (eg, DVD), or semiconductor medium (eg, solid state disk (SSD)), or the like.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Databases & Information Systems (AREA)
  • Biomedical Technology (AREA)
  • Biophysics (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Machine Translation (AREA)

Abstract

The present application discloses a natural language processing method, a training method and a data processing device in the field of artificial intelligence. Said method comprises: obtaining a natural language text to be processed; and processing the natural language text by means of a trained deep neural network, and outputting a target result obtained by processing the natural language text, the deep neural network comprising: a granularity labeling network, a first feature network, a second feature network, a first processing network, a second processing network and a fusing network. In the present application, the data processing device uses networks decoupled from one another to process words of different granularities in a natural language text, effectively improving the performance of processing a natural language processing task.

Description

自然语言处理方法、训练方法及数据处理设备Natural language processing method, training method and data processing equipment
本申请要求于2019年01月18日提交中国国家知识产权局、申请号为201910108559.9、申请名称为“自然语言处理方法、训练方法及数据处理设备”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application claims the priority of a Chinese patent application filed with the State Intellectual Property Office of China, the application number is 201910108559.9, and the application name is "Natural language processing methods, training methods, and data processing equipment" on January 18, 2019. The reference is incorporated in this application.
技术领域Technical field
本申请涉及自然语言处理领域,特别涉及一种自然语言处理方法、训练方法及数据处理设备。This application relates to the field of natural language processing, in particular to a natural language processing method, training method and data processing equipment.
背景技术Background technique
人工智能(Artificial Intelligence,AI)是利用数字计算机或者数字计算机控制的机器模拟、延伸和扩展人的智能,感知环境、获取知识并使用知识获得最佳结果的理论、方法、技术及应用系统。换句话说,人工智能是计算机科学的一个分支,它企图了解智能的实质,并生产出一种新的能以人类智能相似的方式作出反应的智能机器。人工智能也就是研究各种智能机器的设计原理与实现方法,使机器具有感知、推理与决策的功能。Artificial Intelligence (AI) is a theory, method, technology and application system that uses digital computers or machines controlled by digital computers to simulate, extend and expand human intelligence, perceive the environment, acquire knowledge, and use knowledge to obtain the best results. In other words, artificial intelligence is a branch of computer science that attempts to understand the essence of intelligence and produce a new kind of intelligent machine that can react in a similar way to human intelligence. Artificial intelligence is to study the design principles and implementation methods of various intelligent machines, so that the machines have the functions of perception, reasoning and decision-making.
随着人工智能技术的不断发展,越来越多的自然语言处理任务可以采用人工智能技术来实现,例如采用人工智能技术来实现翻译任务。自然语言处理任务可以分为不同的粒度,一般分为字符级(character level)、词语级(word level)、短语级(phrase level)、句子级(sentence level)、篇章级(discourse level)等,这些粒度依次变粗。例如词性标注是词语级任务,命名实体识别(named entity recognition)是短语级任务,句法分析通常是句子级的任务。不同粒度上的信息并不是孤立的,而是相互传递的。例如在做句法分析时,通常也要考虑到词语级和短语级的特征。在一些相对更加复杂的任务中,例如句子的分类、句子与句子之间的语义匹配、句子的翻译或改写,通常需要用到多个粒度上的信息,最后再进行综合。With the continuous development of artificial intelligence technology, more and more natural language processing tasks can be implemented using artificial intelligence technology, for example, using artificial intelligence technology to implement translation tasks. Natural language processing tasks can be divided into different granularities, generally divided into character level, word level, phrase level, sentence level, discourse level, etc. These particle sizes become coarser in turn. For example, part-of-speech tagging is a word-level task, named entity recognition (named entity recognition) is a phrase-level task, and syntactic analysis is usually a sentence-level task. Information at different granularities is not isolated, but is transmitted to each other. For example, when doing syntactic analysis, the word-level and phrase-level features are usually considered. In some relatively more complex tasks, such as sentence classification, sentence-to-sentence semantic matching, sentence translation or rewriting, it is usually necessary to use multiple granular information, and finally synthesize it.
目前主流的基于深度学习的自然语言处理方法是通过神经网络对自然语言文本做处理。在主流的方法中,神经网络在处理过程中对不同粒度的词语的处理是混合在一起的,得到正确的处理结果的概率较低。因此,需要研究新的方案。The current mainstream natural language processing method based on deep learning is to process natural language text through neural networks. In the mainstream method, the neural network processes the words of different granularity in the processing process are mixed together, and the probability of obtaining the correct processing result is low. Therefore, new solutions need to be studied.
发明内容Summary of the invention
本申请实施例提供一种自然语言处理方法、训练方法及数据处理设备,可以避免由较细粒度的信息得到较粗粒度的信息的过程,可以有效改善处理自然语言处理任务的性能。The embodiments of the present application provide a natural language processing method, training method, and data processing device, which can avoid the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural language processing tasks.
第一方面本申请实施例提供了一种自然语言处理方法,该方法包括:获得待处理的自然语言文本;利用训练得到的深度神经网络对所述自然语言文本做处理,输出处理所述自然语言文本得到的目标结果;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述自然语言文本中各词语的粒度;利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取,将得到的第一特征信息输出至所述第一处理 网络;利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取,将得到的第二特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第一特征信息做处理,将得到的第一处理结果输出至所述融合网络;利用所述第二处理网络对所述第二特征信息做所述处理,将得到的第二处理结果输出至所述融合网络;利用所述融合网络融合所述第一处理结果和所述第二处理结果得到所述目标结果;所述第一粒度和所述第二粒度不同。In the first aspect, the embodiments of the present application provide a natural language processing method, which includes: obtaining natural language text to be processed; processing the natural language text using a deep neural network obtained by training, and output processing the natural language text The target result obtained from the text; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network, and the processing includes: using the The granularity tagging network determines the granularity of each word in the natural language text; using the first feature network to perform feature extraction on the first granular words in the natural language text, and output the obtained first feature information to the first feature information A processing network; using the second feature network to perform feature extraction on words with a second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network Process the first characteristic information, and output the obtained first processing result to the fusion network; use the second processing network to perform the processing on the second characteristic information, and obtain the second processing result Output to the fusion network; use the fusion network to fuse the first processing result and the second processing result to obtain the target result; the first granularity and the second granularity are different.
该深度神经网络可以包括N个特征网络以及N个处理网络,该N个特征网络以及该N个处理网络一一对应,N为大于1的整数。一对相对应的特征网络和处理网络用于处理同一粒度的词语。由于数据处理设备将不同粒度的词语分开进行处理,对各粒度的词语所做的处理操作不依赖于其他粒度的词语的处理结果,这就避免了由较细粒度的信息得到较粗粒度的信息的过程,从而大大降低该数据处理设备得到错误结果的概率。The deep neural network may include N feature networks and N processing networks. The N feature networks and the N processing networks have a one-to-one correspondence, and N is an integer greater than one. A pair of corresponding feature networks and processing networks are used to process words of the same granularity. Since the data processing equipment separates words of different granularities for processing, the processing operations for words of each granularity do not depend on the processing results of words of other granularities, which avoids obtaining coarser-grained information from finer-grained information This process greatly reduces the probability that the data processing device will get wrong results.
本申请实施例中,数据处理设备利用深度神经网络独立处理不同粒度的词语,避免了由较细粒度的信息得到较粗粒度的信息的过程,可以有效提高处理自然处理任务的性能。In the embodiments of the present application, the data processing device uses a deep neural network to independently process words of different granularity, avoiding the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural processing tasks.
在一个可选的实现方式中,所述第一特征网络和所述第二特征网络的架构不同,和/或,所述第一处理网络和所述第二处理网络的架构不同。In an optional implementation manner, the architecture of the first characteristic network and the second characteristic network are different, and/or the architecture of the first processing network and the second processing network are different.
不同粒度的词语的特征不同,采用不同架构的网络来处理不同粒度的词语,可以更有针对性的处理不同粒度的词语。Words with different granularities have different characteristics. Using networks with different architectures to process words with different granularities can more specifically process words with different granularities.
在该实现方式中,通过不同架构的特征网络或不同架构的处理网络来处理不同粒度的词语,进一步提升数据处理设备处理自然语言处理任务的性能。In this implementation manner, words of different granularities are processed through feature networks of different architectures or processing networks of different architectures, which further improves the performance of the data processing device in processing natural language processing tasks.
在一个可选的实现方式中,所述粒度标注网络的输入为所述自然语言文本,所述利用所述粒度标注网络确定所述自然语言文本中各词语的粒度包括:利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述每个词语分别属于所述N种粒度的概率;N为大于1的整数;In an optional implementation manner, the input of the granular annotation network is the natural language text, and the using the granular annotation network to determine the granularity of each word in the natural language text includes: using the granular annotation network Determine the granularity of each word in the natural language text according to N granularities to obtain the annotation information of the natural language text, and output the annotation information to the first feature network and the second feature network; wherein, The label information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
所述利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取包括:利用所述第一特征网络处理所述第一粒度的词语以得到所述第一特征信息,所述第一特征信息为表示所述第一粒度的词语的向量或矩阵;The using the first feature network to perform feature extraction on the words of the first granularity in the natural language text includes: using the first feature network to process the words of the first granularity to obtain the first feature information, The first feature information is a vector or matrix representing words of the first granularity;
所述利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取包括:利用所述第二特征网络处理所述第二粒度的词语以得到所述第二特征信息,所述述第二特征信息为表示所述第二粒度的词语的向量或矩阵。The using the second feature network to perform feature extraction on the words of the second granularity in the natural language text includes: using the second feature network to process the words of the second granularity to obtain the second feature information, The second feature information is a vector or matrix representing words of the second granularity.
在该实现方式中,粒度标注网络可以准确地确定自然语言文本中各词语的粒度,以便于各特征网络处理特定粒度的词语。In this implementation, the granular annotation network can accurately determine the granularity of each word in the natural language text, so that each feature network can process words with a specific granularity.
在一个可选的实现方式中,所述粒度标注网络包括长短期记忆网络LSTM和双向长短期记忆网络BiLSTM;所述利用所述粒度标注网络确定所述自然语言文本中各词语的粒度包括:In an optional implementation manner, the granular labeling network includes a long and short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; and the using the granular labeling network to determine the granularity of each word in the natural language text includes:
利用所述粒度标注网络采用如下公式确定所述自然语言文本中各词语的粒度:The granularity labeling network is used to determine the granularity of each word in the natural language text using the following formula:
h l=BiLSTM([x l;h l-1,h l+1]); h l =BiLSTM([x l ;h l-1 ,h l+1 ]);
g l=LSTM([h l,z l-1;g l-1]); g l =LSTM([h l ,z l-1 ;g l-1 ]);
z l=GS(W gg l,τ); z l =GS(W g g l ,τ);
其中,公式中的BiLSTM()表示所述LSTM的处理操作,LSTM()表示所述BiLSTM的处理操作,x表示所述自然语言文本中的词,x l表示所述自然语言文本x中的第l个词,h表示所述BiLSMT网络中的隐状态变量,h l、h l-1、h l+1依次表示所述BiLSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语、第(l+1)个词语时的隐状态变量。g表示所述LSTM网络中的隐状态变量,g l、g l-1分别表示所述LSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语时的隐状态变量。z表示词属于参考粒度的概率,z l-1、z l分别表示所述自然语言文本中第l个词语、第(l-1)个词语属于所述参考粒度的概率,所述参考粒度为所述N种粒度中的任一种,GS表示Gumbel Softmax函数,τ是Gumbel Softmax函数中的超参数(temperature),Wg是参数矩阵,即粒度标注网络中的一个参数矩阵。 Wherein, BiLSTM() in the formula represents the processing operation of the LSTM, LSTM() represents the processing operation of the BiLSTM, x represents a word in the natural language text, and x l represents the first natural language text x l words, h represents the hidden state variable in the BiLSMT network, h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word, the (l)th word in the natural language text -1) The hidden state variable of the (l+1)th word. g represents the hidden state variable in the LSTM network, and g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable. z represents the probability that a word belongs to the reference granularity, z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity, and the reference granularity is Any one of the N types of granularities, GS represents the Gumbel Softmax function, τ is a hyperparameter (temperature) in the Gumbel Softmax function, and Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
在该实现方式中,粒度标注网络使用多层LSTM网络的架构来确定自然语言文本中各词语的粒度,可以充分利用已确定的词语的粒度来确定新的词语(待确定粒度的词语)的粒度,实现简单,处理效率高。In this implementation, the granular annotation network uses the architecture of a multi-layer LSTM network to determine the granularity of each word in the natural language text, and can make full use of the granularity of the determined word to determine the granularity of the new word (word of the granularity to be determined) , Simple implementation and high processing efficiency.
在一个可选的实现方式中,所述利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取包括:In an optional implementation manner, the using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:
利用所述第一特征网络采用如下公式对所述自然语言文本中第一粒度的词语进行特征提取:Use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:
U z=ENC z(X,Z x); U z =ENC z (X,Z x );
其中,ENC z表示所述第一特征网络,所述第一特征网络是一个Transformer模型,ENC z()表示所述第一特征网络所做的处理操作,X表示所述自然语言文本,Z X=[z1,z2,…,zL]表示所述标注信息,z1至z1依次表示所述自然语言文本中第一个词语至第L个(最后一个)词语的粒度,Uz表示所述第一特征网络输出的所述第一特征信息。 Wherein, ENC z represents the first feature network, the first feature network is a Transformer model, ENC z () represents the processing operation performed by the first feature network, X represents the natural language text, Z X =[z1,z2,...,zL] represents the label information, z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text, and Uz represents the first feature The first characteristic information output by the network.
在该实现方式中,利用特征网络可以准确、快速地提取出相应粒度的词语的特征信息。In this implementation, the feature network can be used to accurately and quickly extract the feature information of the corresponding granular words.
在一个可选的实现方式中,所述第一处理结果为包含一个或多个词语的序列,所述利用所述第一处理网络对所述第一特征信息做处理包括:利用所述第一处理网络对输入的所述第一特征信息和所述第一处理网络在处理所述第一特征信息的过程中已输出的词语做处理以得到所述第一处理结果。In an optional implementation manner, the first processing result is a sequence containing one or more words, and the processing of the first characteristic information using the first processing network includes: using the first The processing network processes the input first feature information and the words that have been output by the first processing network in the process of processing the first feature information to obtain the first processing result.
在该实现方式中,第一处理网络采用递归的方式来处理对应特征网络输出的特征信息,可以充分利用自然语言文本中各词语的相关性,进而提高处理的效率和准确性。In this implementation manner, the first processing network adopts a recursive manner to process the feature information output by the corresponding feature network, which can make full use of the relevance of each word in the natural language text, thereby improving the efficiency and accuracy of processing.
在一个可选的实现方式中,所述融合网络输出的所述目标结果为包含一个或多个词语的序列,所述利用所述融合网络融合所述第一处理结果和所述第二处理结果得到所述目标结果包括:利用所述融合网络处理所述第一处理结果、所述第二处理结果以及所述融合网络在处理所第一处理结果和所述第二处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the first processing result and the second processing result Obtaining the target result includes: using the fusion network to process the first processing result, the second processing result, and the fusion network has outputted in the process of processing the first processing result and the second processing result To determine the target words to be output, output the target words.
在该实现方式中,融合网络采用递归的方式来处理各处理网络向其输入的处理结果,可以充分利用自然语言文本中各词语的相关性,进而提高其处理的效率和准确性。In this implementation, the fusion network uses a recursive method to process the processing results input to it by each processing network, which can make full use of the relevance of each word in the natural language text, thereby improving the efficiency and accuracy of its processing.
在一个可选的实现方式中,所述融合网络包括至少一个LSTM网络,所述利用所述融合网络处理所述第一处理结果、所述第二处理结果以及所述融合网络在处理所第一处理结果和所述第二处理结果的过程中已输出的序列以确定待输出目标词语包括:In an optional implementation manner, the converged network includes at least one LSTM network, and the converged network is used to process the first processing result, the second processing result, and the converged network is processing the first The processing result and the sequence output in the process of the second processing result to determine the target word to be output include:
将所述第一处理结果和所述第二处理结果合并得到的向量输入至所述LSTM网络;Input the vector obtained by merging the first processing result and the second processing result to the LSTM network;
利用所述LSTM网络采用如下公式计算待输出参考粒度的词语的概率:The LSTM network uses the following formula to calculate the probability of a word with a reference granularity to be output:
h t=LSMT(h t-1,y t-1,v0,v1); h t =LSMT(h t-1 ,y t-1 ,v0,v1);
P(z t|y 1:t-1,X)=GS(W z h t,τ); P(z t |y 1:t-1 ,X)=GS(W z h t ,τ);
其中,h t表示所述LSMT网络处理第t个词时所述LSMT网络中的隐状态变量,h t-1表示所述LSMT网络处理第(t-1)个词时所述LSMT网络中的隐状态变量,LSMT()表示LSMT所做的处理操作,所述LMST网络当前已输出(t-1)个词,所述y t-1表示所述融合网络输出的第(t-1)个词,v0表示所述第一处理结果,v1表示所述第二处理结果,W z为所述融合网络中的一个参数矩阵,τ为超参数,P(z t|y 1:t-1,X)为当前待输出所述参考粒度(粒度z)的词的概率,t为大于1的整数。 Wherein, h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, and h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word Hidden state variable, LSMT() represents the processing operation done by LSMT, the LMST network has currently output (t-1) words, and the y t-1 represents the (t-1)th output of the fusion network Words, v0 represents the first processing result, v1 represents the second processing result, W z is a parameter matrix in the fusion network, τ is a hyperparameter, P(z t |y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output, and t is an integer greater than 1.
利用所述融合网络采用如下公式计算待输出所述目标词语的概率:Use the fusion network to calculate the probability of the target word to be output using the following formula:
Figure PCTCN2019114146-appb-000001
Figure PCTCN2019114146-appb-000001
其中,P Zt(y t|y 1:t-1,X)表示在所述参考粒度上,输出所述目标词语y t的概率;P(y t|y 1:t-1,X)表示输出所述目标词语的概率。 Wherein, P Zt (y t |y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity; P(y t |y 1:t-1 ,X) represents Output the probability of the target word.
P Zt(y t|y 1:t-1,X)可以由处理网络给出。粒度z的处理网络可以向融合网络输入其当前待输出各词语(z粒度的词语)中每个词语被输出的概率。所述融合网络可以计算当前待输出的各词中每个词被输出的概率,并输出被输出的概率最高的词语(目标词语)。 P Zt (y t |y 1:t-1 ,X) can be given by the processing network. The processing network of granularity z can input the probability of each word in the words (words of granularity z) currently to be output to the fusion network. The fusion network can calculate the probability of each word being output among the words currently to be output, and output the word with the highest probability of being output (target word).
第二方面本申请实施例提供了一种训练方法,该方法包括:将训练样本输入至深度神经网络做处理,得到预测处理结果;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述训练样本中各词语的粒度;利用所述第一特征网络对所述训练样本中第一粒度的词语进行特征提取,将得到的第三特征信息输出至所述第一处理网络;利用所述第二特征网络对所述训练样本中第二粒度的词语进行特征提取,将得到的第四特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第三特征信息做目标处理,将得到的第三处理结果输出至所述融合网络;利用所述第二处理网络对所述第四特征信息做所述目标处理,将得到的第四处理结果输出至所述融合网络;利用所述融合网络融合所述第三处理结果和所述第四处理结果得到所述预测处理结果;所述第一粒度和所述第二粒度不同;根据所述预测处理结果和标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述深度神经网络处理所述训练样本期望得到的处理结果;利用所述训练样本对应的损失,通过优化算法更新所述深度神经网络的参数。In the second aspect, the embodiments of the present application provide a training method, which includes: inputting training samples into a deep neural network for processing to obtain a prediction processing result; wherein the deep neural network includes: a granular annotation network, a first feature Network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granularity labeling network to determine the granularity of each word in the training sample; using the first feature network to Perform feature extraction on words of the first granularity in the training sample, and output the obtained third feature information to the first processing network; use the second feature network to feature words of the second granularity in the training sample Extracting, outputting the obtained fourth characteristic information to the second processing network; using the first processing network to perform target processing on the third characteristic information, and outputting the obtained third processing result to the fusion network; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the first Four processing results obtain the prediction processing result; the first granularity and the second granularity are different; according to the prediction processing result and the standard result, the loss corresponding to the training sample is determined; the standard result is using the The deep neural network processes the expected processing result of the training sample; using the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.
本申请实施例中,数据处理设备训练可以独立处理不同粒度的词语的深度神经网络,以便于得到能够避免由较细粒度的信息得到较粗粒度的信息的过程的深度神经网络,实现简单。In the embodiments of the present application, the data processing device trains a deep neural network that can independently process words of different granularities, so as to obtain a deep neural network that can avoid the process of obtaining coarser-grained information from finer-grained information, and is simple to implement.
在一个可选的实现方式中,所述第一特征网络和所述第二特征网络的架构不同,和/或,所述第一处理网络和所述第二处理网络的架构不同。In an optional implementation manner, the architecture of the first characteristic network and the second characteristic network are different, and/or the architecture of the first processing network and the second processing network are different.
在一个可选的实现方式中,所述粒度标注网络的输入为所述自然语言文本,所述利用所述粒度标注网络确定所述自然语言文本中各词语的粒度包括:利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述每个词语分别属于所述N种粒度的概率;N为大于1的整数;In an optional implementation manner, the input of the granular annotation network is the natural language text, and the using the granular annotation network to determine the granularity of each word in the natural language text includes: using the granular annotation network Determine the granularity of each word in the natural language text according to N granularities to obtain the annotation information of the natural language text, and output the annotation information to the first feature network and the second feature network; wherein, The label information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
所述利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取包括:利用所述第一特征网络处理所述第一粒度的词语以得到所述第三特征信息,所述第三特征信息为表示所述第一粒度的词语的向量或矩阵;The using the first feature network to perform feature extraction on the words of the first granularity in the natural language text includes: using the first feature network to process the words of the first granularity to obtain the third feature information, The third feature information is a vector or matrix representing words of the first granularity;
所述利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取包括:利用所述第二特征网络处理所述第二粒度的词语以得到所述第四特征信息,所述述第四特征信息为表示所述第二粒度的词语的向量或矩阵。The using the second feature network to perform feature extraction on the words of the second granularity in the natural language text includes: using the second feature network to process the words of the second granularity to obtain the fourth feature information, The fourth feature information is a vector or matrix representing words of the second granularity.
在一个可选的实现方式中,所述粒度标注网络包括长短期记忆网络LSTM和双向长短期记忆网络BiLSTM;所述利用所述粒度标注网络确定所述自然语言文本中各词语的粒度包括:In an optional implementation manner, the granular labeling network includes a long and short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; and the using the granular labeling network to determine the granularity of each word in the natural language text includes:
利用所述粒度标注网络采用如下公式确定所述自然语言文本中各词语的粒度:The granularity labeling network is used to determine the granularity of each word in the natural language text using the following formula:
h l=BiLSTM([x l;h l-1,h l+1]); h l =BiLSTM([x l ;h l-1 ,h l+1 ]);
g l=LSTM([h l,z l-1;g l-1]); g l =LSTM([h l ,z l-1 ;g l-1 ]);
z l=GS(W gg l,τ); z l =GS(W g g l ,τ);
其中,公式中的BiLSTM()表示所述LSTM的处理操作,LSTM()表示所述BiLSTM的处理操作,x表示所述自然语言文本中的词,x l表示所述自然语言文本x中的第l个词,h表示所述BiLSMT网络中的隐状态变量,h l、h l-1、h l+1依次表示所述BiLSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语、第(l+1)个词语时的隐状态变量。g表示所述LSTM网络中的隐状态变量,g l、g l-1分别表示所述LSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语时的隐状态变量。z表示词属于参考粒度的概率,z l-1、z l分别表示所述自然语言文本中第l个词语、第(l-1)个词语属于所述参考粒度的概率,所述参考粒度为所述N种粒度中的任一种,GS表示Gumbel Softmax函数,τ是Gumbel Softmax函数中的超参数(temperature),Wg是参数矩阵,即粒度标注网络中的一个参数矩阵。 Wherein, BiLSTM() in the formula represents the processing operation of the LSTM, LSTM() represents the processing operation of the BiLSTM, x represents a word in the natural language text, and x l represents the first natural language text x l words, h represents the hidden state variable in the BiLSMT network, h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word, the (l)th word in the natural language text -1) The hidden state variable of the (l+1)th word. g represents the hidden state variable in the LSTM network, and g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable. z represents the probability that a word belongs to the reference granularity, z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity, and the reference granularity is Any one of the N types of granularities, GS represents the Gumbel Softmax function, τ is a hyperparameter (temperature) in the Gumbel Softmax function, and Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
在该实现方式中,粒度标注网络使用多层LSTM网络的架构来确定自然语言文本中各词语的粒度,可以充分利用已确定的词语的粒度来确定新的词语(待确定粒度的词语)的粒度,实现简单,处理效率高。In this implementation, the granular annotation network uses the architecture of a multi-layer LSTM network to determine the granularity of each word in the natural language text, and can make full use of the granularity of the determined word to determine the granularity of the new word (word of the granularity to be determined) , Simple implementation and high processing efficiency.
在一个可选的实现方式中,所述利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取包括:In an optional implementation manner, the using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:
利用所述第一特征网络采用如下公式对所述自然语言文本中第一粒度的词语进行特征提取:Use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:
U z=ENC z(X,Z x); U z =ENC z (X,Z x );
其中,ENC z表示所述第一特征网络,所述第一特征网络是一个Transformer模型,ENC z ()表示所述第一特征网络所做的处理操作,X表示所述自然语言文本,Z X=[z1,z2,…,zL]表示所述标注信息,z1至z1依次表示所述自然语言文本中第一个词语至第L个(最后一个)词语的粒度,Uz表示所述第一特征网络输出的所述第三特征信息。 Wherein, ENC z represents the first feature network, the first feature network is a Transformer model, ENC z () represents the processing operation performed by the first feature network, X represents the natural language text, Z X =[z1,z2,...,zL] represents the label information, z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text, and Uz represents the first feature The third characteristic information output by the network.
在一个可选的实现方式中,所述第三处理结果为包含一个或多个词语的序列,所述利用所述第一处理网络对所述第三特征信息做处理包括:利用所述第一处理网络对输入的所述第三特征信息和所述第一处理网络在处理所述第三特征信息的过程中已输出的词语做处理以得到所述第三处理结果。In an optional implementation manner, the third processing result is a sequence containing one or more words, and the processing of the third characteristic information using the first processing network includes: using the first processing network The processing network processes the input third characteristic information and the words that have been output by the first processing network in the process of processing the third characteristic information to obtain the third processing result.
在一个可选的实现方式中,所述融合网络输出的所述目标结果为包含一个或多个词语的序列,所述利用所述融合网络融合所述第三处理结果和所述第四处理结果得到所述目标结果包括:利用所述融合网络处理所述第三处理结果、所述第四处理结果以及所述融合网络在处理所第三处理结果和所述第四处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the third processing result and the fourth processing result Obtaining the target result includes: using the fusion network to process the third processing result, the fourth processing result, and the fusion network has output in the process of processing the third processing result and the fourth processing result To determine the target words to be output, output the target words.
在一个可选的实现方式中,所述融合网络包括至少一个LSTM网络,所述利用所述融合网络处理所述第三处理结果、所述第四处理结果以及所述融合网络在处理所第三处理结果和所述第四处理结果的过程中已输出的序列以确定待输出目标词语包括:In an optional implementation manner, the converged network includes at least one LSTM network, and the converged network is used to process the third processing result, the fourth processing result, and the third processing result of the converged network. The processing result and the sequence output in the process of the fourth processing result to determine the target word to be output include:
将所述第三处理结果和所述第四处理结果合并得到的向量输入至所述LSTM网络;Input the vector obtained by merging the third processing result and the fourth processing result to the LSTM network;
利用所述LSTM网络采用如下公式计算待输出参考粒度的词语的概率:The LSTM network uses the following formula to calculate the probability of a word with a reference granularity to be output:
h t=LSMT(h t-1,y t-1,v2,v3); h t =LSMT(h t-1 ,y t-1 ,v2,v3);
P(z t|y 1:t-1,X)=GS(W z h t,τ); P(z t |y 1:t-1 ,X)=GS(W z h t ,τ);
其中,h t表示所述LSMT网络处理第t个词时所述LSMT网络中的隐状态变量,h t-1表示所述LSMT网络处理第(t-1)个词时所述LSMT网络中的隐状态变量,LSMT()表示LSMT所做的处理操作,所述LMST网络当前已输出(t-1)个词,所述y t-1表示所述融合网络输出的第(t-1)个词,v2表示所述第三处理结果,v3表示所述第四处理结果,W z为所述融合网络中的一个参数矩阵,τ为超参数,P(z t|y 1:t-1,X)为当前待输出所述参考粒度(粒度z)的词的概率,t为大于1的整数。 Wherein, h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, and h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word Hidden state variable, LSMT() represents the processing operation done by LSMT, the LMST network has currently output (t-1) words, and the y t-1 represents the (t-1)th output of the fusion network Words, v2 represents the third processing result, v3 represents the fourth processing result, W z is a parameter matrix in the fusion network, τ is a hyperparameter, P(z t |y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output, and t is an integer greater than 1.
利用所述融合网络采用如下公式计算待输出所述目标词语的概率:Use the fusion network to calculate the probability of the target word to be output using the following formula:
Figure PCTCN2019114146-appb-000002
Figure PCTCN2019114146-appb-000002
其中,P Zt(y t|y 1:t-1,X)表示在所述参考粒度上,输出所述目标词语y t的概率;P(y t|y 1:t-1,X)表示输出所述目标词语的概率。 Wherein, P Zt (y t |y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity; P(y t |y 1:t-1 ,X) represents Output the probability of the target word.
在一个可选的实现方式中,所述利用所述训练样本对应的损失,通过优化算法更新所述深度神经网络的参数包括:In an optional implementation manner, the using the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm includes:
利用损失函数相对于所述深度神经网络包括的至少一个网络的梯度值,更新所述至少一个网络的参数;所述损失函数用于计算所述预测处理结果和所述标准结果之间的损失;其中,所述第一特征网络、所述第二特征网络、所述第一处理网络、所述第二处理网络中的任一个网络在更新过程中,另外三个网络中任一个网络的参数保持不变。Update the parameters of the at least one network using a loss function relative to the gradient value of at least one network included in the deep neural network; the loss function is used to calculate the loss between the prediction processing result and the standard result; Wherein, during the update process of any one of the first characteristic network, the second characteristic network, the first processing network, and the second processing network, the parameters of any one of the other three networks are maintained constant.
第三方面本申请实施例提供了一种数据处理设备,该数据处理设备包括:获取单元,用于获得待处理的自然语言文本;处理单元,用于利用训练得到的深度神经网络对所述自然语言文本做处理;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二 特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述自然语言文本中各词语的粒度;利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取,将得到的第一特征信息输出至所述第一处理网络;利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取,将得到的第二特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第一特征信息做处理,将得到的第一处理结果输出至所述融合网络;利用所述第二处理网络对所述第二特征信息做所述处理,将得到的第二处理结果输出至所述融合网络;利用所述融合网络融合所述第一处理结果和所述第二处理结果得到所述目标结果;所述第一粒度和所述第二粒度不同;输出单元,用于输出处理所述自然语言文本得到的目标结果。In the third aspect, the embodiments of the application provide a data processing device. The data processing device includes: an acquisition unit for obtaining natural language texts to be processed; a processing unit for processing the natural language text obtained by training using a deep neural network; Language and text are processed; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network, and the processing includes: using the granularity The tagging network determines the granularity of each word in the natural language text; using the first feature network to perform feature extraction on words with the first granularity in the natural language text, and output the obtained first feature information to the first Processing network; using the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network to The first characteristic information is processed, and the obtained first processing result is output to the fusion network; the second processing network is used to perform the processing on the second characteristic information, and the obtained second processing result is output To the fusion network; use the fusion network to fuse the first processing result and the second processing result to obtain the target result; the first granularity and the second granularity are different; an output unit for outputting The target result obtained by processing the natural language text.
本申请实施例中,数据处理设备利用深度神经网络独立处理不同粒度的词语,避免了由较细粒度的信息得到较粗粒度的信息的过程,可以有效提高处理自然处理任务的性能。In the embodiments of the present application, the data processing device uses a deep neural network to independently process words of different granularity, avoiding the process of obtaining coarser-grained information from finer-grained information, and can effectively improve the performance of processing natural processing tasks.
在一个可选的实现方式中,所述第一特征网络和所述第二特征网络的架构不同,和/或,所述第一处理网络和所述第二处理网络的架构不同。In an optional implementation manner, the architecture of the first characteristic network and the second characteristic network are different, and/or the architecture of the first processing network and the second processing network are different.
在一个可选的实现方式中,所述粒度标注网络的输入为所述自然语言文本;所述处理单元,具体用于利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述每个词语分别属于所述N种粒度的概率;N为大于1的整数;In an optional implementation manner, the input of the granular annotation network is the natural language text; the processing unit is specifically configured to use the granular annotation network to determine each of the natural language texts according to N types of granularities. The granularity of words is used to obtain the annotation information of the natural language text, and the annotation information is output to the first feature network and the second feature network; wherein the annotation information is used to describe the granularity of each word Or the probability that each word belongs to the N types of granularities; N is an integer greater than 1;
所述处理单元,具体用于利用所述第一特征网络处理所述第一粒度的词语以得到所述第一特征信息,所述第一特征信息为表示所述第一粒度的词语的向量或矩阵;The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the first characteristic information, where the first characteristic information is a vector or word representing the words of the first granularity matrix;
所述处理单元,具体用于利用所述第二特征网络处理所述第二粒度的词语以得到所述第二特征信息,所述第二特征信息为表示所述第二粒度的词语的向量或矩阵。The processing unit is specifically configured to use the second feature network to process the words of the second granularity to obtain the second feature information, where the second feature information is a vector or word representing the words of the second granularity. matrix.
在一个可选的实现方式中,所述粒度标注网络包括长短期记忆网络LSTM和双向长短期记忆网络BiLSTM;所述处理单元,具体用于利用所述粒度标注网络采用如下公式确定所述自然语言文本中各词语的粒度:In an optional implementation, the granular labeling network includes a long short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; the processing unit is specifically configured to use the granular labeling network to determine the natural language using the following formula The granularity of words in the text:
h l=BiLSTM([x l;h l-1,h l+1]); h l =BiLSTM([x l ;h l-1 ,h l+1 ]);
g l=LSTM([h l,z l-1;g l-1]); g l =LSTM([h l ,z l-1 ;g l-1 ]);
z l=GS(W gg l,τ); z l =GS(W g g l ,τ);
其中,公式中的BiLSTM()表示所述LSTM的处理操作,LSTM()表示所述BiLSTM的处理操作,x表示所述自然语言文本中的词,x l表示所述自然语言文本x中的第l个词,h表示所述BiLSMT网络中的隐状态变量,h l、h l-1、h l+1依次表示所述BiLSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语、第(l+1)个词语时的隐状态变量。g表示所述LSTM网络中的隐状态变量,g l、g l-1分别表示所述LSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语时的隐状态变量。z表示词属于参考粒度的概率,z l-1、z l分别表示所述自然语言文本中第l个词语、第(l-1)个词语属于所述参考粒度的概率,所述参考粒度为所述N种粒度中的任一种,GS表示Gumbel Softmax函数,τ是Gumbel Softmax函数中的超参数(temperature),Wg是参数矩阵,即粒度标注网络中的一个参数矩阵。 Wherein, BiLSTM() in the formula represents the processing operation of the LSTM, LSTM() represents the processing operation of the BiLSTM, x represents a word in the natural language text, and x l represents the first natural language text x l words, h represents the hidden state variable in the BiLSMT network, h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word, the (l)th word in the natural language text -1) The hidden state variable of the (l+1)th word. g represents the hidden state variable in the LSTM network, and g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable. z represents the probability that a word belongs to the reference granularity, z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity, and the reference granularity is Any one of the N types of granularities, GS represents the Gumbel Softmax function, τ is a hyperparameter (temperature) in the Gumbel Softmax function, and Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
在一个可选的实现方式中,所述处理单元,具体用于利用所述第一特征网络采用如下公式对所述自然语言文本中第一粒度的词语进行特征提取:In an optional implementation manner, the processing unit is specifically configured to use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:
U z=ENC z(X,Z x); U z =ENC z (X,Z x );
其中,ENC z表示所述第一特征网络,所述第一特征网络是一个Transformer模型,ENC z()表示所述第一特征网络所做的处理操作,X表示所述自然语言文本,Z X=[z1,z2,…,zL]表示所述标注信息,z1至z1依次表示所述自然语言文本中第一个词语至第L个(最后一个)词语的粒度,Uz表示所述第一特征网络输出的所述第一特征信息。 Wherein, ENC z represents the first feature network, the first feature network is a Transformer model, ENC z () represents the processing operation performed by the first feature network, X represents the natural language text, Z X =[z1,z2,...,zL] represents the label information, z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text, and Uz represents the first feature The first characteristic information output by the network.
在一个可选的实现方式中,所述第一处理结果为包含一个或多个词语的序列;所述处理单元,具体用于利用所述第一处理网络对输入的所述第一特征信息和所述第一处理网络在处理所述第一特征信息的过程中已输出的词语做处理以得到所述第一处理结果。In an optional implementation manner, the first processing result is a sequence containing one or more words; the processing unit is specifically configured to use the first processing network to compare the input first feature information and The first processing network processes the output words in the process of processing the first characteristic information to obtain the first processing result.
在一个可选的实现方式中,所述融合网络输出的所述目标结果为包含一个或多个词语的序列;所述处理单元,具体用于利用所述融合网络处理所述第一处理结果、所述第二处理结果以及所述融合网络在处理所第一处理结果和所述第二处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words; the processing unit is specifically configured to use the fusion network to process the first processing result, The second processing result and the words that have been output by the fusion network in the process of processing the first processing result and the second processing result to determine the target word to be output, and output the target word.
在一个可选的实现方式中,所述融合网络包括至少一个LSTM网络;In an optional implementation manner, the converged network includes at least one LSTM network;
所述处理单元,具体用于利用将所述第一处理结果和所述第二处理结果合并得到的向量输入至所述LSTM网络;The processing unit is specifically configured to use a vector obtained by combining the first processing result and the second processing result to input to the LSTM network;
所述处理单元,具体用于利用所述LSTM网络采用如下公式计算待输出参考粒度的词语的概率:The processing unit is specifically configured to use the LSTM network to calculate the probability of a word with a reference granularity to be output by using the following formula:
h t=LSMT(h t-1,y t-1,v0,v1); h t =LSMT(h t-1 ,y t-1 ,v0,v1);
P(z t|y 1:t-1,X)=GS(W z h t,τ); P(z t |y 1:t-1 ,X)=GS(W z h t ,τ);
其中,h t表示所述LSMT网络处理第t个词时所述LSMT网络中的隐状态变量,h t-1表示所述LSMT网络处理第(t-1)个词时所述LSMT网络中的隐状态变量,LSMT()表示LSMT所做的处理操作,所述LMST网络当前已输出(t-1)个词,所述y t-1表示所述融合网络输出的第(t-1)个词,v0表示所述第一处理结果,v1表示所述第二处理结果,W z为所述融合网络中的一个参数矩阵,τ为超参数,P(z t|y 1:t-1,X)为当前待输出所述参考粒度(粒度z)的词的概率,t为大于1的整数。 Wherein, h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, and h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word Hidden state variable, LSMT() represents the processing operation done by LSMT, the LMST network has currently output (t-1) words, and the y t-1 represents the (t-1)th output of the fusion network Words, v0 represents the first processing result, v1 represents the second processing result, W z is a parameter matrix in the fusion network, τ is a hyperparameter, P(z t |y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output, and t is an integer greater than 1.
所述处理单元,具体用于利用所述融合网络采用如下公式计算待输出所述目标词语的概率:The processing unit is specifically configured to use the fusion network to calculate the probability of the target word to be output by using the following formula:
Figure PCTCN2019114146-appb-000003
Figure PCTCN2019114146-appb-000003
其中,P Zt(y t|y 1:t-1,X)表示在所述参考粒度上,输出所述目标词语y t的概率;P(y t|y 1:t-1,X)表示输出所述目标词语的概率。 Wherein, P Zt (y t |y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity; P(y t |y 1:t-1 ,X) represents Output the probability of the target word.
第四方面本申请实施例提供了另一种数据处理设备,该数据处理设备包括:处理单元,用于将训练样本输入至深度神经网络做处理,得到预测处理结果;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述训练样本中各词语的粒度;利用所述第一特征网络对所述训练样本中第一粒度的词语进行特征提取,将得到的第三特征信息输出至所述第一处理网络;利用所述第二特征网络对所述训练样本中第二粒度的词 语进行特征提取,将得到的第四特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第三特征信息做目标处理,将得到的第三处理结果输出至所述融合网络;利用所述第二处理网络对所述第四特征信息做所述目标处理,将得到的第四处理结果输出至所述融合网络;利用所述融合网络融合所述第三处理结果和所述第四处理结果得到所述预测处理结果;所述第一粒度和所述第二粒度不同;所述处理单元,还用于根据所述预测处理结果和标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述深度神经网络处理所述训练样本期望得到的处理结果;利用所述训练样本对应的损失,通过优化算法更新所述深度神经网络的参数。In the fourth aspect, the embodiments of the present application provide another data processing device. The data processing device includes: a processing unit for inputting training samples into a deep neural network for processing to obtain a prediction processing result; wherein, the deep neural network Including: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granular labeling network to determine the granularity of each word in the training sample Use the first feature network to perform feature extraction on words of the first granularity in the training sample, and output the obtained third feature information to the first processing network; use the second feature network to perform feature extraction on the training Perform feature extraction on words of the second granularity in the sample, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and the obtained third The processing result is output to the fusion network; the second processing network is used to perform the target processing on the fourth characteristic information, and the obtained fourth processing result is output to the fusion network; The third processing result and the fourth processing result obtain the predicted processing result; the first granularity and the second granularity are different; the processing unit is further configured to, according to the predicted processing result and the standard result, Determine the loss corresponding to the training sample; the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; use the loss corresponding to the training sample to update the deep neural network through an optimization algorithm parameter.
本申请实施例中,数据处理设备训练可以独立处理不同粒度的词语的深度神经网络,以便于得到能够避免由较细粒度的信息得到较粗粒度的信息的过程的深度神经网络,实现简单。In the embodiments of the present application, the data processing device trains a deep neural network that can independently process words of different granularities, so as to obtain a deep neural network that can avoid the process of obtaining coarser-grained information from finer-grained information, and is simple to implement.
在一个可选的实现方式中,所述第一特征网络和所述第二特征网络架构不同,和/或,所述第一处理网络和所述第二处理网络架构不同。In an optional implementation manner, the first characteristic network and the second characteristic network have different architectures, and/or the first processing network and the second processing network have different architectures.
在一个可选的实现方式中,所述粒度标注网络的输入为所述自然语言文本;所述处理单元,具体用于利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述每个词语分别属于所述N种粒度的概率;N为大于1的整数;In an optional implementation manner, the input of the granular annotation network is the natural language text; the processing unit is specifically configured to use the granular annotation network to determine each of the natural language texts according to N types of granularities. The granularity of words is used to obtain the annotation information of the natural language text, and the annotation information is output to the first feature network and the second feature network; wherein the annotation information is used to describe the granularity of each word Or the probability that each word belongs to the N types of granularities; N is an integer greater than 1;
所述处理单元,具体用于利用所述第一特征网络处理所述第一粒度的词语以得到所述第三特征信息,所述第三特征信息为表示所述第一粒度的词语的向量或矩阵;The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the third characteristic information, where the third characteristic information is a vector or word representing the words of the first granularity matrix;
所述处理单元,具体用于利用所述第二特征网络处理所述第二粒度的词语以得到所述第四特征信息,所述述第四特征信息为表示所述第二粒度的词语的向量或矩阵。The processing unit is specifically configured to process the words of the second granularity by using the second characteristic network to obtain the fourth characteristic information, where the fourth characteristic information is a vector representing the words of the second granularity Or matrix.
在一个可选的实现方式中,所述粒度标注网络包括长短期记忆网络LSTM和双向长短期记忆网络BiLSTM;所述处理单元,具体用于利用所述粒度标注网络采用如下公式确定所述自然语言文本中各词语的粒度:In an optional implementation, the granular labeling network includes a long short-term memory network LSTM and a bidirectional long short-term memory network BiLSTM; the processing unit is specifically configured to use the granular labeling network to determine the natural language using the following formula The granularity of words in the text:
h l=BiLSTM([x l;h l-1,h l+1]); h l =BiLSTM([x l ;h l-1 ,h l+1 ]);
g l=LSTM([h l,z l-1;g l-1]); g l =LSTM([h l ,z l-1 ;g l-1 ]);
z l=GS(W gg l,τ); z l =GS(W g g l ,τ);
其中,公式中的BiLSTM()表示所述LSTM的处理操作,LSTM()表示所述BiLSTM的处理操作,x表示所述自然语言文本中的词,x l表示所述自然语言文本x中的第l个词,h表示所述BiLSMT网络中的隐状态变量,h l、h l-1、h l+1依次表示所述BiLSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语、第(l+1)个词语时的隐状态变量。g表示所述LSTM网络中的隐状态变量,g l、g l-1分别表示所述LSMT网络处理所述自然语言文本中的第l个词语、第(l-1)个词语时的隐状态变量。z表示词属于参考粒度的概率,z l-1、z l分别表示所述自然语言文本中第l个词语、第(l-1)个词语属于所述参考粒度的概率,所述参考粒度为所述N种粒度中的任一种,GS表示Gumbel Softmax函数,τ是Gumbel Softmax函数中的超参数(temperature),Wg是参数矩阵,即粒度标注网络中的一个参数矩阵。 Wherein, BiLSTM() in the formula represents the processing operation of the LSTM, LSTM() represents the processing operation of the BiLSTM, x represents a word in the natural language text, and x l represents the first natural language text x l words, h represents the hidden state variable in the BiLSMT network, h l , h l-1 , h l+1 in turn indicate that the BiLSMT network processes the lth word, the (l)th word in the natural language text -1) The hidden state variable of the (l+1)th word. g represents the hidden state variable in the LSTM network, and g l and g l-1 respectively represent the hidden state when the LSMT network processes the lth word and the (l-1)th word in the natural language text variable. z represents the probability that a word belongs to the reference granularity, z l-1 and z l respectively represent the probability that the lth word and the (l-1)th word in the natural language text belong to the reference granularity, and the reference granularity is Any one of the N types of granularities, GS represents the Gumbel Softmax function, τ is a hyperparameter (temperature) in the Gumbel Softmax function, and Wg is a parameter matrix, that is, a parameter matrix in the granularity annotation network.
在一个可选的实现方式中,所述处理单元,具体用于利用所述第一特征网络采用如下公式对所述自然语言文本中第一粒度的词语进行特征提取:In an optional implementation manner, the processing unit is specifically configured to use the first feature network to use the following formula to perform feature extraction on words of the first granularity in the natural language text:
U z=ENC z(X,Z x); U z =ENC z (X,Z x );
其中,ENC z表示所述第一特征网络,所述第一特征网络是一个Transformer模型,ENC z()表示所述第一特征网络所做的处理操作,X表示所述自然语言文本,Z X=[z1,z2,…,zL]表示所述标注信息,z1至z1依次表示所述自然语言文本中第一个词语至第L个(最后一个)词语的粒度,Uz表示所述第一特征网络输出的所述第三特征信息。 Wherein, ENC z represents the first feature network, the first feature network is a Transformer model, ENC z () represents the processing operation performed by the first feature network, X represents the natural language text, Z X =[z1,z2,...,zL] represents the label information, z1 to z1 sequentially represent the granularity of the first word to the Lth (last) word in the natural language text, and Uz represents the first feature The third characteristic information output by the network.
在一个可选的实现方式中,所述第一处理结果为包含一个或多个词语的序列;所述处理单元,具体用于利用所述第一处理网络对输入的所述第三特征信息和所述第一处理网络在处理所述第三特征信息的过程中已输出的词语做处理以得到所述第三处理结果。In an optional implementation manner, the first processing result is a sequence containing one or more words; the processing unit is specifically configured to use the first processing network to pair the input third characteristic information and The first processing network processes the output words in the process of processing the third characteristic information to obtain the third processing result.
在一个可选的实现方式中,所述融合网络输出的所述目标结果为包含一个或多个词语的序列;所述处理单元,具体用于利用所述融合网络处理所述第三处理结果、所述第四处理结果以及所述融合网络在处理所第三处理结果和所述第四处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。In an optional implementation manner, the target result output by the fusion network is a sequence containing one or more words; the processing unit is specifically configured to use the fusion network to process the third processing result, The fourth processing result and the words that have been output by the fusion network in the process of processing the third processing result and the fourth processing result to determine the target word to be output, and output the target word.
在一个可选的实现方式中,所述融合网络包括至少一个LSTM网络;所述处理单元,具体用于将所述第三处理结果和所述第四处理结果合并得到的向量输入至所述LSTM网络;In an optional implementation manner, the converged network includes at least one LSTM network; the processing unit is specifically configured to input a vector obtained by combining the third processing result and the fourth processing result into the LSTM The internet;
利用所述LSTM网络采用如下公式计算待输出参考粒度的词语的概率:The LSTM network uses the following formula to calculate the probability of a word with a reference granularity to be output:
h t=LSMT(h t-1,y t-1,v2,v3); h t =LSMT(h t-1 ,y t-1 ,v2,v3);
P(z t|y 1:t-1,X)=GS(W z h t,τ); P(z t |y 1:t-1 ,X)=GS(W z h t ,τ);
其中,h t表示所述LSMT网络处理第t个词时所述LSMT网络中的隐状态变量,h t-1表示所述LSMT网络处理第(t-1)个词时所述LSMT网络中的隐状态变量,LSMT()表示LSMT所做的处理操作,所述LMST网络当前已输出(t-1)个词,所述y t-1表示所述融合网络输出的第(t-1)个词,v2表示所述第三处理结果,v3表示所述第四处理结果,W z为所述融合网络中的一个参数矩阵,τ为超参数,P(z t|y 1:t-1,X)为当前待输出所述参考粒度(粒度z)的词的概率,t为大于1的整数。 Wherein, h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, and h t-1 represents the hidden state variable in the LSMT network when the LSMT network processes the (t-1)-th word Hidden state variable, LSMT() represents the processing operation done by LSMT, the LMST network has currently output (t-1) words, and the y t-1 represents the (t-1)th output of the fusion network Words, v2 represents the third processing result, v3 represents the fourth processing result, W z is a parameter matrix in the fusion network, τ is a hyperparameter, P(z t |y 1:t-1 , X) is the probability of the word of the reference granularity (granularity z) currently to be output, and t is an integer greater than 1.
利用所述融合网络采用如下公式计算待输出所述目标词语的概率:Use the fusion network to calculate the probability of the target word to be output using the following formula:
Figure PCTCN2019114146-appb-000004
Figure PCTCN2019114146-appb-000004
其中,P Zt(y t|y 1:t-1,X)表示在所述参考粒度上,输出所述目标词语y t的概率;P(y t|y 1:t-1,X)表示输出所述目标词语的概率。 Wherein, P Zt (y t |y 1:t-1 ,X) represents the probability of outputting the target word y t at the reference granularity; P(y t |y 1:t-1 ,X) represents Output the probability of the target word.
在一个可选的实现方式中,所述处理单元,具体用于利用损失函数相对于所述深度神经网络包括的至少一个网络的梯度值,更新所述至少一个网络的参数;所述损失函数用于计算所述预测处理结果和所述标准结果之间的损失;其中,所述第一特征网络、所述第二特征网络、所述第一处理网络、所述第二处理网络中的任一个网络在更新过程中,另外三个网络中任一个网络的参数保持不变。In an optional implementation manner, the processing unit is specifically configured to update the parameters of the at least one network by using the gradient value of the loss function relative to the at least one network included in the deep neural network; To calculate the loss between the predicted processing result and the standard result; wherein any one of the first feature network, the second feature network, the first processing network, and the second processing network During the network update process, the parameters of any one of the other three networks remain unchanged.
第五方面本申请实施例提供了又一种数据处理设备,该数据处理设备包括:处理器、存储器、输入设备以及输出设备,该存储器用于存储代码;该处理器通过读取该存储器中存储的该代码以用于执行第一方面或上述第二方面提供的方法,该输入设备用于获得待处理的自然语言文本,该输出设备用于输出处理器处理该自然语言文本得到的目标结果。In the fifth aspect, the embodiments of the present application provide yet another data processing device. The data processing device includes: a processor, a memory, an input device, and an output device. The memory is used to store code; The code is used to execute the method provided in the first aspect or the second aspect, the input device is used to obtain the natural language text to be processed, and the output device is used to output the target result obtained by the processor processing the natural language text.
第六方面,本申请实施例提供了一种计算机程序产品,所述计算机程序产品包括程序指令,所述程序指令当被处理器执行时使所述处理器执行上述第一方面或上述第二方面方法。In a sixth aspect, embodiments of the present application provide a computer program product. The computer program product includes program instructions that, when executed by a processor, cause the processor to execute the first aspect or the second aspect described above. method.
第七方面本申请实施例提供了一种计算机可读存储介质,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如上述第一方面或上述第二方面方法。In the seventh aspect, the embodiments of the present application provide a computer-readable storage medium, the computer storage medium stores a computer program, and the computer program includes program instructions that, when executed by a processor, cause the processor to Perform the method of the above-mentioned first aspect or the above-mentioned second aspect.
附图说明BRIEF DESCRIPTION
图1A至图1C为自然语言处理系统的应用场景;Figures 1A to 1C are application scenarios of natural language processing systems;
图2为本申请实施例提供的一种自然语言处理方法流程图;Fig. 2 is a flowchart of a natural language processing method provided by an embodiment of the application;
图3为本申请实施例提供的一种深度神经网络的结构示意图;FIG. 3 is a schematic structural diagram of a deep neural network provided by an embodiment of this application;
图4为本申请实施例提供的一种粒度标注网络301的结构示意图;FIG. 4 is a schematic structural diagram of a granular labeling network 301 provided by an embodiment of this application;
图5为本申请实施例提供的一种特征网络的结构示意图;FIG. 5 is a schematic structural diagram of a feature network provided by an embodiment of this application;
图6为本申请实施例提供的一种深度神经网络的结构示意图;FIG. 6 is a schematic structural diagram of a deep neural network provided by an embodiment of this application;
图7为本申请实施例提供的一种训练方法流程图;FIG. 7 is a flowchart of a training method provided by an embodiment of the application;
图8为本申请实施例提供的一种数据处理设备的结构示意图;FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of this application;
图9为本申请实施例提供的一种神经网络处理器的结构示意图;FIG. 9 is a schematic structural diagram of a neural network processor provided by an embodiment of this application;
图10为本申请实施例提供的一种智能终端的部分结构的框图;FIG. 10 is a block diagram of a partial structure of an intelligent terminal provided by an embodiment of the application;
图11为本申请实施例提供的另一种数据处理设备的部分结构的框图。FIG. 11 is a block diagram of a part of the structure of another data processing device provided by an embodiment of the application.
具体实施方式detailed description
为了使本技术领域的人员更好地理解本申请实施例方案,下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行清楚地描述,显然,所描述的实施例仅仅是本申请一部分的实施例,而不是全部的实施例。In order to enable those skilled in the art to better understand the solutions of the embodiments of the present application, the technical solutions in the embodiments of the present application will be clearly described below in conjunction with the drawings in the embodiments of the present application. Obviously, the described embodiments are only These are a part of the embodiments of this application, not all of the embodiments.
本申请的说明书实施例和权利要求书及上述附图中的术语“第一”、“第二”、和“第三”等是用于区别类似的对象,而不必用于描述特定的顺序或先后次序。此外,术语“包括”和“具有”以及他们的任何变形,意图在于覆盖不排他的包含,例如,包含了一系列步骤或单元。方法、系统、产品或设备不必限于清楚地列出的那些步骤或单元,而是可包括没有清楚地列出的或对于这些过程、方法、产品或设备固有的其它步骤或单元。“和/或”用于表示在其所连接的两个对象之间选择一个或全部。例如“A和/或B”表示A、B或A+B。The terms "first", "second", and "third" in the specification embodiments and claims of this application and the above-mentioned drawings are used to distinguish similar objects, and are not necessarily used to describe a specific order or Priority. In addition, the terms "including" and "having" and any variations of them are intended to cover non-exclusive inclusion, for example, a series of steps or units are included. The method, system, product, or device is not necessarily limited to those clearly listed steps or units, but may include other steps or units that are not clearly listed or are inherent to these processes, methods, products, or devices. "And/or" is used to indicate one or all of the two connected objects. For example, "A and/or B" means A, B, or A+B.
目前,用于处理自然语言处理任务的网络模型中,例如典型的谷歌神经翻译系统(Google Neural Machine Translation,GNMT)、Transformer等模型,都没有对自然语言文本中不同粒度的词语所做的操作进行分离。也就是说,在当前采用的方案中,对不同粒度之间的词语所做的操作是非解耦的。当前采用某种深度神经网络来处理自然语言处理任务时,通常通过池化操作来综合较细粒度的特征,形成较粗粒度的特征。例如,通过池化操作来综合词语级和短语级的特征以形成句子级的特征。可以理解,若得到的较细粒度的特征是错误的,则由较细粒度的特征得到的较粗粒度的特征也会是错误的。这使得我们在理解和应用深度神经网络处理自然语言处理任务时碰到一些困难,例如当有错误发生时,不能定 位哪个粒度的操作出了问题。对某种粒度的词语的操作可以理解为某种粒度的操作。例如,对短语级的词语的操作为短语级的操作,对句子级的词语的操作为句子级的操作。本申请方案的主要原理是采用相互解耦的网络来处理不同粒度的词语以得到不同粒度的词语的处理结果,进而融合不同粒度的词语的处理结果得到最终结果。也就是说,处理不同粒度的词语的多个网络之间是解耦的。两个之间网络是解耦的可以理解为这两个网络所做的处理互不影响。由于本申请采用的深度神经网络具备解耦的能力,采用本申请方案处理自然语言处理任务至少有以下好处:At present, the network models used to process natural language processing tasks, such as the typical Google Neural Machine Translation (GNMT), Transformer and other models, do not perform operations on words with different granularities in natural language texts. Separation. That is to say, in the currently adopted scheme, operations on words between different granularities are not decoupled. When a certain deep neural network is currently used to process natural language processing tasks, a pooling operation is usually used to synthesize finer-grained features to form coarser-grained features. For example, the word-level and phrase-level features are integrated through the pooling operation to form sentence-level features. It can be understood that if the finer-grained features obtained are wrong, the coarser-grained features obtained from the finer-grained features will also be wrong. This makes us encounter some difficulties in understanding and applying deep neural networks to process natural language processing tasks. For example, when an error occurs, it is impossible to locate which granular operation is a problem. Operations on words of a certain granularity can be understood as operations of a certain granularity. For example, operations on phrase-level words are phrase-level operations, and operations on sentence-level words are sentence-level operations. The main principle of the solution of this application is to use mutually decoupled networks to process words of different granularities to obtain processing results of words of different granularities, and then merge the processing results of words of different granularities to obtain the final result. In other words, multiple networks that process words of different granularities are decoupled. The decoupling of the two networks can be understood as the processing of the two networks does not affect each other. Since the deep neural network used in this application has the ability of decoupling, using the solution of this application to process natural language processing tasks has at least the following benefits:
可解释性:当利用深度神经网络处理某个自然语言文本得到错误结果时,可以准确地定位到是哪个粒度的操作出了问题,以便后续分析和修正。Interpretability: When using a deep neural network to process a certain natural language text and get an incorrect result, it can accurately locate which granular operation has a problem for subsequent analysis and correction.
可控性:在本申请方案中,由于处理不同粒度的词语的网络之间是解耦的,因此可以分析或调整深度神经网络实现不同粒度的操作的网络。本申请采用的深度神经网络包括多个相互之间解耦的用于处理不同粒度的词语的子网络,可以有针对性的优化这些子网络,以保证在各粒度上的操作都是可控的。Controllability: In the solution of the present application, since the networks processing words of different granularities are decoupled, the deep neural network can be analyzed or adjusted to realize the networks of operations of different granularities. The deep neural network used in this application includes multiple decoupled sub-networks for processing words of different granularities. These sub-networks can be optimized in a targeted manner to ensure that operations at each granularity are controllable .
可复用、可迁移:不同粒度上的操作具有不同的可复用或可迁移特性。通常,在机器翻译或者句子改写中,句子级别的操作(句式的翻译或变换)更容易复用或更容易迁移到其他领域,短语或词语级的操作具有更多领域的特征。本申请方案中,由于深度神经网络包括多个独立的用于处理不同粒度的词语的子网络,采用某个领域的样本训练得到的一部分子网络,可以应用到其他领域。Reusable and transferable: Operations at different granularities have different reusable or transferable characteristics. Generally, in machine translation or sentence rewriting, sentence-level operations (sentence translation or transformation) are easier to reuse or migrate to other fields, and phrase or word-level operations have more field features. In the solution of this application, since the deep neural network includes multiple independent sub-networks for processing words of different granularities, a part of the sub-networks obtained by training using samples in a certain field can be applied to other fields.
下面介绍本申请方案可以应用的场景。The following describes the scenarios in which this application solution can be applied.
如图1A所示,一种自然语言处理系统包括用户设备以及数据处理设备。As shown in FIG. 1A, a natural language processing system includes user equipment and data processing equipment.
所述用户设备可以是手机、个人电脑、平板电脑、可穿戴设备、个人数字助理、游戏机、信息处理中心等智能终端。所述用户设备为自然语言数据处理的发起端,作为自然语言处理任务(例如翻译任务、复述任务等)的发起方,通常用户通过所述用户设备发起自然语言处理任务。复述任务是将一个自然语言文本转换为另一个与该自然语言文本意思相同但表达不同的文本的任务。例如,“What makes the second world war happen”可以复述为“What is the reason of world war II”。The user equipment may be a mobile phone, a personal computer, a tablet computer, a wearable device, a personal digital assistant, a game console, an information processing center, and other smart terminals. The user equipment is the initiator of natural language data processing, and serves as the initiator of natural language processing tasks (for example, translation tasks, paraphrase tasks, etc.). Generally, users initiate natural language processing tasks through the user equipment. The paraphrase task is the task of transforming a natural language text into another text with the same meaning but different expressions as the natural language text. For example, "What makes the second world war happen" can be repeated as "What is the reason of world war II".
所述数据处理设备可以是云服务器、网络服务器、应用服务器以及管理服务器等具有数据处理功能的设备或服务器。所述数据处理设备通过交互接口接收来自所述智能终端的查询语句/语音/文本等问句,再通过存储数据的存储器以及执行数据处理的处理器进行机器学习,深度学习,搜索,推理,决策等方式的语言数据处理。所述存储器可以是一个统称,包括本地存储以及存储历史数据的数据库,所述数据库可以在数据处理设备上,也可以在其它网络服务器上。The data processing device may be a device or server with data processing functions such as a cloud server, a network server, an application server, and a management server. The data processing device receives query sentences/voice/text questions from the smart terminal through an interactive interface, and then performs machine learning, deep learning, search, reasoning, and decision-making through a memory that stores data and a processor that performs data processing. Language data processing in other ways. The storage may be a general term including a database for local storage and storing historical data. The database may be on a data processing device or on other network servers.
如图1B所示为自然语言处理系统的另一个应用场景。此场景中智能终端直接作为数据处理设备,直接接收来自用户的输入并直接由智能终端本身的硬件进行处理,具体过程与图1A相似,可参考上面的描述,在此不再赘述。Figure 1B shows another application scenario of the natural language processing system. In this scenario, the smart terminal is directly used as a data processing device, directly receiving input from the user and directly processed by the hardware of the smart terminal itself. The specific process is similar to that of FIG. 1A, and the above description can be referred to, which will not be repeated here.
如图1C所示,所述用户设备可以是本地设备101或102,所述数据处理设备可以是执行设备210,其中数据存储系统250可以集成在所述执行设备210上,也可以设置在云上或其它网络服务器上。As shown in FIG. 1C, the user equipment may be a local device 101 or 102, the data processing device may be an execution device 210, and the data storage system 250 may be integrated on the execution device 210 or set on the cloud Or on other network servers.
本申请方案可以应用到多种场景,下面介绍利用数据处理设备如何执行自然语言处理任务。图2为本申请实施例提供的一种自然语言处理方法流程图,如图2所示,该方法可包括:The solution of this application can be applied to a variety of scenarios. The following describes how to perform natural language processing tasks using data processing equipment. FIG. 2 is a flowchart of a natural language processing method provided by an embodiment of the application. As shown in FIG. 2, the method may include:
201、获得待处理的自然语言文本。201. Obtain natural language text to be processed.
该待处理的自然语言文本可以是数据处理设备当前待处理的一个句子。该数据处理设备可以逐句对接收到的自然语言文本或者识别语音得到的自然语言文本做处理。The natural language text to be processed may be a sentence currently to be processed by the data processing device. The data processing device can process the received natural language text or the natural language text obtained by recognizing voice sentence by sentence.
在图1A和图1C场景中,获得待处理的自然语言文本可以是数据处理设备接收用户设备发送的语音或文本等数据,根据接收到的语音或文本等数据获得待处理的自然语言文本。举例来说,数据处理设备接收到用户设备发送的2个句子,该数据处理设备获取第1个句子(待处理的自然语言文本),利用训练得到的深度神经网络对该第1个句子做处理,输出处理该第1个句子得到结果;获取第2个句子(待处理的自然语言文本),利用训练得到的深度神经网络对该第2个句子做处理,输出处理该第2个句子得到结果。In the scenarios in FIG. 1A and FIG. 1C, obtaining the natural language text to be processed may be that the data processing device receives data such as voice or text sent by the user equipment, and obtains the natural language text to be processed according to the received voice or text data. For example, the data processing device receives 2 sentences sent by the user device, the data processing device obtains the first sentence (natural language text to be processed), and uses the trained deep neural network to process the first sentence , Output and process the first sentence to get the result; get the second sentence (natural language text to be processed), use the trained deep neural network to process the second sentence, and output and process the second sentence to get the result .
在图1B场景中,获得待处理的自然语言文本可以是智能终端直接接收用户输入的语音或文本等数据,根据接收到的语音或文本等数据获得待处理的自然语言文本。举例来说,智能终端接收到用户输入的2个句子,该智能终端获取第1个句子(待处理的自然语言文本),利用训练得到的深度神经网络对该第1个句子做处理,输出处理该第1个句子得到结果;获取第2个句子(待处理的自然语言文本),利用训练得到的深度神经网络对该第2个句子做处理,输出处理该第2个句子得到结果。In the scenario in FIG. 1B, obtaining the natural language text to be processed may be that the smart terminal directly receives data such as voice or text input by the user, and obtains the natural language text to be processed according to the received voice or text data. For example, the smart terminal receives 2 sentences input by the user, the smart terminal obtains the first sentence (natural language text to be processed), uses the trained deep neural network to process the first sentence, and outputs the processing The first sentence is the result; the second sentence (natural language text to be processed) is obtained, the second sentence is processed by the deep neural network obtained by training, and the second sentence is output and processed to obtain the result.
202、利用训练得到的深度神经网络对该自然语言文本做处理,输出处理该自然语言文本得到的目标结果。202. Use the deep neural network obtained by training to process the natural language text, and output a target result obtained by processing the natural language text.
该深度神经网络可以包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,数据处理设备利用该深度神经网络对该自然语言文本所做的处理可以包括:利用该粒度标注网络确定该自然语言文本中各词语的粒度;利用该第一特征网络对该自然语言文本中第一粒度的词语进行特征提取,将得到的第一特征信息输出至该第一处理网络;利用该第二特征网络对该自然语言文本中第二粒度的词语进行特征提取,将得到的第二特征信息输出至该第二处理网络;利用该第一处理网络对该第一特征信息做目标处理,将得到的第一处理结果输出至该融合网络;利用该第二处理网络对该第二特征信息做该目标处理,将得到的第二处理结果输出至该融合网络;利用该融合网络融合该第一处理结果和该第二处理结果得到该目标结果;该第一粒度和该第二粒度不同。该第一粒度和该第二粒度可以为字符级、词语级、短语级、句子级中任意两种不同的粒度。本申请中,一个词语的粒度是指该词语在自然语言文本(句子)中所属的粒度。该目标处理可以是翻译、复述、摘要生成等。该目标结果为处理该自然语言文本得到的另一个自然语言文本。例如,目标结果为翻译该自然语言文本得到的一个自然语言文本。又例如,目标结果为复述该自然语言文本得到的另一个自然语言文本。待处理的自然语言文本可以视为输入序列,数据处理设备处理该自然语言文本得到的目标结果(另一个自然语言文本)可以视为生成序列。The deep neural network may include: a granular annotation network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The data processing device uses the deep neural network to do the natural language text The processing may include: using the granular annotation network to determine the granularity of each word in the natural language text; using the first feature network to perform feature extraction on the first granular word in the natural language text, and output the obtained first feature information to The first processing network; using the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information to the second processing network; using the first processing network to Perform target processing on the first characteristic information, and output the obtained first processing result to the fusion network; use the second processing network to perform the target processing on the second characteristic information, and output the obtained second processing result to the fusion network Use the fusion network to fuse the first processing result and the second processing result to obtain the target result; the first granularity and the second granularity are different. The first granularity and the second granularity may be any two different granularities among character level, word level, phrase level, and sentence level. In this application, the granularity of a word refers to the granularity of the word in the natural language text (sentence). The target processing can be translation, retelling, abstract generation, etc. The target result is another natural language text obtained by processing the natural language text. For example, the target result is a natural language text obtained by translating the natural language text. For another example, the target result is another natural language text obtained by retelling the natural language text. The natural language text to be processed can be regarded as an input sequence, and the target result (another natural language text) obtained by the data processing device processing the natural language text can be regarded as a generated sequence.
该深度神经网络可以包括N个特征网络以及N个处理网络,该N个特征网络以及该N个处理网络一一对应,N为大于1的整数。一对相对应的特征网络和处理网络用于处理同 一粒度的词语。例如,第一特征网络对自然语言文本中第一粒度的词语进行特征提取以得到第一特征信息,第一处理网络对该第一特征信息做目标处理。可以理解,该深度神经网络除了包括第一特征网络和第二特征网络之外,还可以包括用于对其他粒度(除第一粒度和第二粒度之外的粒度)的词语进行特征提取的特征网络;该深度神经网络除了包括第一处理网络和第二处理网络之外,还可以包括用于对其他粒度(除第一粒度和第二粒度之外的粒度)的词语的特征信息做目标处理的处理网络。本申请中,不对深度神经网络包括的特征网络的个数和处理网络的个数作限定。若自然语言文本中的词语被划为N种粒度,则该深度神经网络包括N个特征网络以及N个处理网络。也就是说,如果按照N种粒度划为自然语言文本中的各词语,则深度神经网络包括N个特征网络和N个特征网络。例如,自然语言文本中的词语划分为短语级词语和句子级词语,则深度神经网络包括两个特征网络,一个特征网络用于对短语级的词语进行特征提取以得到短语级的词语的特征信息,另一个特征网络用于对句子级的词语进行特征提取以得到句子级的词语的特征信息;该深度神经网络包括两个处理网络,一个处理网络用于对短语级的词语的特征信息做目标处理,另一个处理网络用于对句子级的词语的特征信息做目标处理。在该深度神经网络包括N个特征网络以及N个处理网络的情况下,该N个特征网络输出N个特征信息,该N个处理网络输出N个处理结果,该融合网络用于融合该N个处理结果得到最终的输出结果。也就是说,该融合网络并不限于融合两个处理结果。The deep neural network may include N feature networks and N processing networks. The N feature networks and the N processing networks have a one-to-one correspondence, and N is an integer greater than one. A pair of corresponding feature network and processing network are used to process words of the same granularity. For example, the first feature network performs feature extraction on words of the first granularity in the natural language text to obtain first feature information, and the first processing network performs target processing on the first feature information. It can be understood that, in addition to the first feature network and the second feature network, the deep neural network may also include features for feature extraction of words of other granularities (granularities other than the first granularity and the second granularity). Network; In addition to the first processing network and the second processing network, the deep neural network can also include target processing for the feature information of words of other granularities (granularities other than the first granularity and the second granularity) Processing network. In this application, the number of feature networks included in the deep neural network and the number of processing networks are not limited. If the words in the natural language text are classified into N granularities, the deep neural network includes N feature networks and N processing networks. That is to say, if the words in the natural language text are classified according to N granularities, the deep neural network includes N feature networks and N feature networks. For example, the words in natural language text are divided into phrase-level words and sentence-level words, then the deep neural network includes two feature networks, one feature network is used to extract the feature of phrase-level words to obtain the feature information of phrase-level words Another feature network is used to extract feature information of sentence-level words to obtain feature information of sentence-level words; the deep neural network includes two processing networks, one processing network is used to target the feature information of phrase-level words Processing, another processing network is used to target the feature information of sentence-level words. In the case that the deep neural network includes N feature networks and N processing networks, the N feature networks output N feature information, the N processing networks output N processing results, and the fusion network is used to fuse the N The processing result is the final output result. In other words, the fusion network is not limited to fusing two processing results.
该N个特征网络中任意两个特征网络对自然语言文本中不同粒度的词语进行特征提取;该N个处理网络中任意两个处理网络对不同粒度的词语的特征信息做目标处理。可选的,该N个特征网络中任意两个特征网络不共享参数;该N个处理网络中任意两个处理网络不共享参数。该目标处理可以是翻译、复述、摘要生成等。该第一特征网络和该第二特征网络的参数不同,且采用的架构相同或不同。例如,第一特征网络采用深度神经网络架构,第二特征网络采用Transformer架构。该第一处理网络和该第二处理网络的参数不同,且采用的架构相同或不同。例如,第一处理网络采用深度神经网络架构,第二处理网络采用Transformer架构。可以理解,该深度神经网络包括的多个特征网络采用的架构可以不同,该深度神经网络包括的多个处理网络采用的架构也可以不同。Any two of the N feature networks perform feature extraction on words with different granularities in natural language text; any two of the N processing networks perform target processing on the feature information of words with different granularities. Optionally, any two characteristic networks of the N characteristic networks do not share parameters; any two of the N processing networks do not share parameters. The target processing can be translation, retelling, abstract generation, etc. The parameters of the first feature network and the second feature network are different, and the architectures adopted are the same or different. For example, the first feature network uses a deep neural network architecture, and the second feature network uses a Transformer architecture. The first processing network and the second processing network have different parameters and adopt the same or different architectures. For example, the first processing network uses a deep neural network architecture, and the second processing network uses a Transformer architecture. It can be understood that the multiple feature networks included in the deep neural network may adopt different architectures, and the multiple processing networks included in the deep neural network may adopt different architectures.
本申请实施例中,数据处理设备利用深度神经网络中相互解耦的网络分别处理不同粒度的词语,可以有效提高处理自然处理任务的性能。In the embodiment of the present application, the data processing device uses the mutually decoupled network in the deep neural network to process words of different granularity respectively, which can effectively improve the performance of processing natural processing tasks.
下面结合本申请采用的深度神经网络的结构来描述如何对自然语言文本做处理的流程。图3为本申请实施例提供的一种深度神经网络的结构示意图,该深度神经网络可以包括N个特征网络和N个处理网络,为方便理解图中仅示出2个特征网络(第一特征网络和第二特征网络)和2个处理网络(第一处理网络和第二处理网络)。如图3所示,301为粒度标注网络,302为第一特征网络,303为第二特征网络,304为第一处理网络,305为第二处理网络,306为融合网络。数据处理设备利用图3中的深度神经网络对自然语言文本的处理流程如下:The following describes how to process natural language text in conjunction with the structure of the deep neural network used in this application. Figure 3 is a schematic structural diagram of a deep neural network provided by an embodiment of the application. The deep neural network may include N feature networks and N processing networks. To facilitate understanding, only two feature networks (the first feature Network and second characteristic network) and 2 processing networks (first processing network and second processing network). As shown in Figure 3, 301 is a granular annotation network, 302 is a first feature network, 303 is a second feature network, 304 is a first processing network, 305 is a second processing network, and 306 is a converged network. The data processing equipment uses the deep neural network in Figure 3 to process natural language text as follows:
311、粒度标注网络301按照N种粒度确定自然语言文本中每个词语的粒度以得到该自然语言文本的标注信息,向第一特征网络302和第二特征网络303输出该标注信息。311. The granularity labeling network 301 determines the granularity of each word in the natural language text according to N types of granularities to obtain the labeling information of the natural language text, and outputs the labeling information to the first feature network 302 and the second feature network 303.
粒度标注网络301的输入为待处理的自然语言文本;输出可以为标注信息,也可以为 标注信息以及该自然语言文本。第一特征网络302的输入和第二特征网络303的输入均为该标注信息以及该自然语言文本。该标注信息用于描述自然语言文本中每个词语的粒度或者该自然语言文本中每个词语分别属于该N种粒度的概率;N为大于1的整数。The input of the granular annotation network 301 is the natural language text to be processed; the output may be annotation information, or annotation information and the natural language text. The input of the first feature network 302 and the input of the second feature network 303 are both the annotation information and the natural language text. The annotation information is used to describe the granularity of each word in the natural language text or the probability that each word in the natural language text belongs to the N types of granularities; N is an integer greater than 1.
粒度标注网络301对输入的自然语言文本(输入序列)中的每个词(假设以词为基本处理单位)所属的粒度进行标注,即确定该自然语言文本中每个词的标注。假设我们考虑两种粒度:短语级粒度和句子级粒度,输入的自然语言文本(语句)中的每个词的粒度都被确定为这两种粒度中的一种。举例来说,粒度标注网络301确定输入的自然语言文本“what makes the second world war happen”中每个词语的粒度,其中,“what”、“makes”、“happen”等词被确定为句子级粒度,“the”、“second”、“world”、“war”等词被确定为短语级粒度。值得注意的是,对于待处理的自然语言文本中各词语所属的粒度,并没有标注数据(lable),而是由粒度标注网络301确定其输入的自然语言文本中各词语的粒度。The granularity labeling network 301 labels the granularity to which each word (assuming the word is the basic processing unit) in the input natural language text (input sequence), that is, determines the label of each word in the natural language text. Assuming that we consider two granularities: phrase-level granularity and sentence-level granularity, the granularity of each word in the input natural language text (sentence) is determined to be one of these two granularities. For example, the granularity annotation network 301 determines the granularity of each word in the input natural language text "what makes the second world war happen", where words such as "what", "makes", and "happen" are determined to be sentence-level Granularity, words such as "the", "second", "world", and "war" are determined as phrase-level granularity. It is worth noting that the granularity of each word in the natural language text to be processed is not labeled with data (label), but the granularity annotation network 301 determines the granularity of each word in the input natural language text.
312、第一特征网络302利用输入的自然语言文本和标注信息进行特征提取,将得到的第一特征信息输出至第一处理网络304。312. The first feature network 302 uses the input natural language text and annotation information to perform feature extraction, and outputs the obtained first feature information to the first processing network 304.
该第一特征信息为表示第一粒度的词语的向量或矩阵。第一特征网络302的输入为自然语言文本和标注信息,可以对该自然语言文本中第一粒度的词语进行特征提取,并得到自然语言文本中第一粒度的词语的向量或矩阵表示,即该第一特征信息。The first feature information is a vector or matrix representing words of the first granularity. The input of the first feature network 302 is natural language text and tagging information. The natural language text can be feature-extracted from the first-granularity words, and the vector or matrix representation of the first-granularity words in the natural language text can be obtained, that is, the The first feature information.
313、第二特征网络303利用输入的自然语言文本和标注信息进行特征提取,将得到的第二特征信息输出至第二处理网络305。313. The second feature network 303 uses the input natural language text and annotation information to perform feature extraction, and outputs the obtained second feature information to the second processing network 305.
该第二特征信息为表示第二粒度的词语的向量或矩阵。第二特征网络303的输入为自然语言文本和标注信息,可以对该自然语言文本中第二粒度的词语进行特征提取,并得到自然语言文本中第二粒度的词语的向量或矩阵表示,即该第二特征信息。本申请实施例不对数据处理设备执行步骤313和步骤312的顺序做限定,步骤313和步骤312可以同时执行,也可以先执行步骤312再执行步骤313,还可以先执行步骤313再执行步骤312。The second feature information is a vector or matrix representing words of the second granularity. The input of the second feature network 303 is natural language text and tagging information, and the words of the second granularity in the natural language text can be feature extracted, and the vector or matrix representation of the words of the second granularity in the natural language text can be obtained, that is, the The second feature information. The embodiment of the present application does not limit the order in which the data processing device performs step 313 and step 312. Step 313 and step 312 can be performed at the same time, or step 312 can be performed before step 313, or step 313 can be performed before step 312.
314、第一处理网络304利用输入的第一特征信息和第一处理网络304在处理该第一特征信息的过程中已输出的处理结果做处理以得到第一处理结果。314. The first processing network 304 uses the input first characteristic information and the processing result output by the first processing network 304 in the process of processing the first characteristic information for processing to obtain the first processing result.
第一处理网络304通过递归的方式对输入的该第一特征信息做处理(例如翻译、复述、摘要提取等),即第一处理网络304以其对应的第一特征网络302的输出(第一特征信息)以及其之前已经输出的处理结果(序列)为输入,通过深度神经网络计算出向量或矩阵的表示(第一处理结果)。The first processing network 304 processes the input first feature information in a recursive manner (for example, translation, paraphrase, abstract extraction, etc.), that is, the first processing network 304 uses the output of the first feature network 302 (first The feature information) and the previously output processing result (sequence) are input, and the representation of the vector or matrix (the first processing result) is calculated through the deep neural network.
315、第二处理网络305利用输入的第二特征信息和第二处理网络305在处理该第二特征信息的过程中已输出的处理结果做处理以得到第二处理结果。315. The second processing network 305 uses the input second characteristic information and the processing result output by the second processing network 305 in the process of processing the second characteristic information for processing to obtain the second processing result.
第二处理网络305通过递归的方式对输入的该第二特征信息做处理(例如翻译、复述、摘要提取等),即第二处理网络305以其对应的第二特征网络303的输出(第二特征信息)以及其之前已经输出的处理结果(序列)为输入,通过深度神经网络计算出向量或矩阵的表示(第二处理结果)。本申请实施例不对数据处理设备执行步骤314和步骤315的顺序做限定,步骤314和步骤315可以同时执行,也可以先执行步骤314再执行步骤315,还可以先执行步骤315再执行步骤314。The second processing network 305 processes the input second feature information in a recursive manner (for example, translation, paraphrase, abstract extraction, etc.), that is, the second processing network 305 uses the output of the second feature network 303 (second The feature information) and the previously output processing result (sequence) are input, and the representation of the vector or matrix is calculated through the deep neural network (the second processing result). The embodiment of the present application does not limit the order in which the data processing device executes step 314 and step 315. Step 314 and step 315 can be executed simultaneously, or step 314 can be executed first and then step 315 can be executed, or step 315 can be executed before step 314 is executed.
316、融合网络306利用第一处理结果、第二处理结果以及融合网络306在处理该第一 处理结果和该第二处理结果的过程中已输出的处理结果,确定待输出目标词语,输出该目标词语。316. The fusion network 306 uses the first processing result, the second processing result, and the processing results that the fusion network 306 has output in the process of processing the first processing result and the second processing result to determine the target word to be output, and output the target Words.
该目标词语包含于该第一处理结果或该第二处理结果。融合网络306可以将不同粒度的处理网络的输出进行融合,即确定当前待输出词的粒度进而确定待输出的词。例如,第一步确定待输出“句子级”粒度的词语,输出“what”;第二步确定待输出“句子级”粒度的词语,输出“is”;重复之前的操作,直至最终完成输出语句(对应于目标结果)的生成。需要指出的是,上述步骤311至316均通过深度神经网络计算完成。The target word is included in the first processing result or the second processing result. The fusion network 306 can merge the output of processing networks of different granularities, that is, determine the granularity of the current word to be output and then determine the word to be output. For example, the first step is to determine the words to be output with "sentence level" granularity and output "what"; the second step to determine the words to be output with "sentence level" granularity and output "is"; repeat the previous operation until the final output sentence is completed (Corresponding to the target result) generation. It should be noted that the above steps 311 to 316 are all completed by deep neural network calculations.
本申请实施例中,数据处理设备利用不同粒度的特征网络和不同粒度的处理网络独立处理不同粒度的词语,可以有效提高得到正确结果的概率。In the embodiments of the present application, the data processing device uses feature networks of different granularities and processing networks of different granularities to independently process words of different granularities, which can effectively improve the probability of obtaining correct results.
下面结合粒度标注网络301的结构来描述粒度标注网络301如何确定自然语言文本中各词语的粒度。图4为本申请实施例提供的一种粒度标注网络301的结构示意图。如图4所示,粒度标注网络301包括长短期记忆网络(Long Short-Term Memory,LSTM)402和Bi LSTM(双向LSTM)网络401。从图4可以看出,粒度标注网络301使用多层LSTM网络的架构。Bi LSTM401的输入为自然语言文本,LSTM402的输出为标注信息,即每个词语的粒度标签或者每个词分别属于各种粒度的概率。粒度标注网络301用于预测输入句子(自然语言文本)中的每个词所对应的粒度。可选的,利用BiLSTM网络401将输入的自然语言文本转换成向量,作为下一层的LSTM网络402的输入;LSTM网络402计算该自然语言文本中每一个词属于每种粒度的概率并输出。为了保证整个粒度标注网络301的可微分,同时进一步地解耦开不同粒度的信息,标注信息可以使用GS(Gumbel-Softmax)函数代替常用的Softmax操作。这种情况下,每个词都有属于每种粒度的概率,且这个值接近0或1。The following describes how the granular annotation network 301 determines the granularity of each word in the natural language text in conjunction with the structure of the granular annotation network 301. FIG. 4 is a schematic structural diagram of a granular labeling network 301 provided by an embodiment of this application. As shown in FIG. 4, the granular annotation network 301 includes a Long Short-Term Memory (LSTM) 402 and a Bi LSTM (Bi-directional LSTM) network 401. It can be seen from FIG. 4 that the granular labeling network 301 uses a multilayer LSTM network architecture. The input of LSTM401 is natural language text, and the output of LSTM402 is labeling information, that is, the granularity label of each word or the probability that each word belongs to various granularities. The granularity annotation network 301 is used to predict the granularity corresponding to each word in the input sentence (natural language text). Optionally, the BiLSTM network 401 is used to convert the input natural language text into a vector, which is used as the input of the next layer of the LSTM network 402; the LSTM network 402 calculates and outputs the probability that each word in the natural language text belongs to each granularity. In order to ensure the differentiability of the entire granularity labeling network 301 and to further decouple information of different granularities, the labeling information can use the GS (Gumbel-Softmax) function instead of the commonly used Softmax operation. In this case, each word has a probability of belonging to each granularity, and this value is close to 0 or 1.
下面借助数学公式来描述粒度标注网络301预测自然语言文本中各词语的粒度的方式。BiLSTM网络401的处理过程对应的数学公式如下:The following uses mathematical formulas to describe the manner in which the granularity annotation network 301 predicts the granularity of each word in the natural language text. The mathematical formula corresponding to the processing process of BiLSTM network 401 is as follows:
h l=BiLSTM([x l;h l-1,h l+1]); h l =BiLSTM([x l ;h l-1 ,h l+1 ]);
LSTM网络402的处理过程对应的数学公式如下:The mathematical formula corresponding to the processing process of the LSTM network 402 is as follows:
g l=LSTM([h l,z l-1;g l-1]); g l =LSTM([h l ,z l-1 ;g l-1 ]);
z l=GS(W gg l,τ); z l =GS(W g g l ,τ);
其中,公式中的BiLSTM()表示双向递归深度神经网络的处理,LSTM()表示(单向)递归深度神经网络的处理,l表示词的位置的角标,x表示输入句子(自然语言文本)中的词,x l表示该输入句子x中的第l个词,h表示BiLSMT网络401中的隐状态变量(hidden states),h l、h l-1、h l+1依次表示BiLSMT网络401处理该输入句子中的第l个词语、第(l-1)个词语、第(l+1)个词语时的隐状态变量。g表示(单向)LSTM网络中的隐状态变量,其计算过程遵从LSTM网络的计算规则,g l、g l-1分别表示LSTM网络402处理该输入句子中的第l个词语、第(l-1)个词语时的隐状态变量。z表示词属于某个粒度(短语级粒度、句子级粒度或其他级别的粒度)的概率,z l-1、z l分别表示该输入句子中第l个词语、第(l-1)个词语属于某个粒度的概率,GS表示Gumbel Softmax函数,τ是Gumbel Softmax函数中的超参数(temperature),Wg是参数矩阵,即粒度标注网络中的一个参数矩阵。 Among them, BiLSTM() in the formula represents the processing of a two-way recursive deep neural network, LSTM() represents the processing of a (one-way) recursive deep neural network, l represents the position index of the word, and x represents the input sentence (natural language text) X l represents the lth word in the input sentence x, h represents the hidden states in the BiLSMT network 401, h l , h l-1 , and h l+1 in turn represent the BiLSMT network 401 The hidden state variables when processing the lth word, (l-1)th word, and (l+1)th word in the input sentence. g represents the hidden state variable in the (one-way) LSTM network, and its calculation process follows the calculation rules of the LSTM network. g l and g l-1 respectively indicate that the LSTM network 402 processes the lth word and the (l)th word in the input sentence. -1) Hidden state variables for words. z represents the probability that the word belongs to a certain granularity (phrase-level granularity, sentence-level granularity or other granularity), z l-1 and z l respectively represent the lth word and (l-1)th word in the input sentence The probability of belonging to a certain granularity, GS represents the Gumbel Softmax function, τ is the hyperparameter (temperature) in the Gumbel Softmax function, and Wg is the parameter matrix, that is, a parameter matrix in the granularity annotation network.
粒度标注网络301使用多层LSTM网络的架构来确定自然语言文本中各词语的粒度, 可以充分利用已确定的词语的粒度来确定新的词语(待确定粒度的词语)的粒度,实现简单,处理效率高。The granularity annotation network 301 uses the architecture of a multi-layer LSTM network to determine the granularity of each word in a natural language text, and can make full use of the granularity of the determined word to determine the granularity of a new word (word with a granularity to be determined), which is simple to implement and process efficient.
下面结合第一特征网络302的结构和第二特征网络303的结构来介绍特征网络的特征提取操作。图5为本申请实施例提供的一种第一特征网络302和第二特征网络303的结构示意图。如图5所示,第一特征网络302的输入和第二特征网络303的输入相同,第一特征网络302对自然语言文本中第一粒度的词语进行特征提取,第二特征网络303对自然语言文本中第二粒度的词语进行特征提取。第一特征网络302和第二特征网络303采用的网络架构可以相同,也可以不同。处理某种粒度的词语的特征网络可以理解为该粒度的特征网络,不同粒度的特征网络处理不同粒度的词语。第一特征网络302和第二特征网络303的参数不共享,且超参数的设置不同。可选的,第一特征网络302和第二特征网络303都采用Transformer模型,此模型基于多头自注意力机制(Multi-head Self-Attention),处理输入语句(自然语言文本)中某一粒度的词语,从而构造一个向量作为该粒度的词语的特征信息。在粒度特征网络301确定自然语言文本中每个词语的粒度的情况下,第一特征网络302可以仅关注输入语句(自然语言文本)中的第一粒度的词语;第二特征网络303可以仅关注输入语句(自然语言文本)中的第二粒度的词语。在粒度特征网络301确定自然语言文本中每个词语分别属于上述N种粒度的概率的情况下,第一特征网络302可以重点关注输入语句(自然语言文本)中的第一粒度的词语;第二特征网络303可以重点关注输入语句(自然语言文本)中的第二粒度的词语。在这种情况下,对于第一特征网络302来说,其重点关注输入语句中属于第一粒度的概率较高的词语;对于第二特征网络303来说,其重点关注输入语句中属于第二粒度的概率较高的词语。可以理解,一个词属于第一粒度的概率越高,第一特征网络302对该词语的关注度越高。The following describes the feature extraction operation of the feature network in combination with the structure of the first feature network 302 and the structure of the second feature network 303. FIG. 5 is a schematic structural diagram of a first characteristic network 302 and a second characteristic network 303 provided by an embodiment of this application. As shown in Figure 5, the input of the first feature network 302 and the input of the second feature network 303 are the same. The first feature network 302 performs feature extraction on words of the first granularity in the natural language text, and the second feature network 303 performs feature extraction on the natural language text. The words of the second granularity in the text are feature extracted. The network architectures adopted by the first feature network 302 and the second feature network 303 may be the same or different. A feature network that processes words of a certain granularity can be understood as a feature network of that granularity, and feature networks of different granularities process words of different granularity. The parameters of the first characteristic network 302 and the second characteristic network 303 are not shared, and the hyperparameter settings are different. Optionally, both the first feature network 302 and the second feature network 303 adopt the Transformer model. This model is based on a multi-head self-attention mechanism, which processes input sentences (natural language text) at a certain granularity. Words, so as to construct a vector as the characteristic information of the granular words. In the case that the granular feature network 301 determines the granularity of each word in the natural language text, the first feature network 302 may only focus on the words of the first granularity in the input sentence (natural language text); the second feature network 303 may only focus on Input sentences (natural language text) in the second granularity of words. In the case that the granular feature network 301 determines the probability that each word in the natural language text belongs to the aforementioned N types of granularities, the first feature network 302 can focus on the words of the first granularity in the input sentence (natural language text); The feature network 303 can focus on the words of the second granularity in the input sentence (natural language text). In this case, for the first feature network 302, it focuses on words with a higher probability of belonging to the first granularity in the input sentence; for the second feature network 303, it focuses on words belonging to the second Words with higher probability of granularity. It can be understood that the higher the probability that a word belongs to the first granularity, the higher the attention of the first feature network 302 to the word.
如5所示,第一特征网络302可以采用限定窗口的自注意力(Self-Attention)机制(类似深度神经网络的机制,但其权重仍由attention计算得出。对于输入语句(自然语言文本),第一特征网络302会重点关注该输入语句中第一粒度的词,而忽视其他粒度层级上的词。第一特征网络302可以是短语级粒度的特征网络,提取每个词语的特征时仅关注该词相邻的两个词语,如图5所示。第二特征网络303可以采用整句范围的Self-Attention机制,从而能够关注到句子全局的信息。第二特征网络303可以是句子级粒度的特征网络,提取每个词语的特征时都关注整个输入语句,如图5所示。对于输入语句(自然语言文本),第二特征网络303会重点关注该输入语句中第二粒度的词,而忽略其他粒度层级上的词。Transformer模型是本领域常用的一种模型,这里不再详述该模型的工作原理。最终,第一特征网络302可以得到输入语句(自然语言文本)中第一粒度的各词语的向量表示(第一特征信息);第二特征网络303可以得到输入语句(自然语言文本)中第二粒度的各词语的向量表示(第二特征信息)。在实际应用中,通过深度神经网络(Transformer)的计算,每个粒度上的特征网络得到该粒度上的词语的向量表示,记为Uz。As shown in 5, the first feature network 302 can use a self-attention mechanism with a limited window (similar to a deep neural network mechanism, but its weight is still calculated by attention. For the input sentence (natural language text) , The first feature network 302 will focus on words at the first granularity in the input sentence and ignore words at other granularity levels. The first feature network 302 can be a feature network with a phrase-level granularity. When extracting the features of each word, only Pay attention to the two adjacent words of the word, as shown in Figure 5. The second feature network 303 can adopt the Self-Attention mechanism of the whole sentence, so as to be able to pay attention to the global information of the sentence. The second feature network 303 can be sentence-level The granular feature network focuses on the entire input sentence when extracting the features of each word, as shown in Figure 5. For the input sentence (natural language text), the second feature network 303 will focus on the second granular word in the input sentence , While ignoring words at other levels of granularity. The Transformer model is a commonly used model in the field, and the working principle of the model will not be described in detail here. Finally, the first feature network 302 can obtain the input sentence (natural language text). The vector representation (first feature information) of each word at one granularity; the second feature network 303 can obtain the vector representation (second feature information) of each word at the second granularity in the input sentence (natural language text). In practical applications , Through the calculation of the deep neural network (Transformer), the feature network at each granularity obtains the vector representation of the word at the granularity, denoted as Uz.
下面借助数学公式来描述第一特征网络302和第二特征网络303实现的处理操作。第一特征网络302和第二特征网络303实现的处理操作对应的数学公式如下:The processing operations implemented by the first feature network 302 and the second feature network 303 are described below with the aid of mathematical formulas. The mathematical formulas corresponding to the processing operations implemented by the first feature network 302 and the second feature network 303 are as follows:
U z=ENC z(X,Z x); U z =ENC z (X,Z x );
其中,z表示粒度层级的指标(例如z=0表示词语级粒度,z=1表示句子级粒度),ENC z 表示粒度z上的特征网络(第一特征网络或第二特征网络),特征网络ENC z是一个Transformer模型,ENC z()表示特征网络所做的处理操作,X表示特征网络的输入语句(自然语言文本),Z X=[z1,z2,…,zL]表示该输入语句的标注信息(粒度层级),该标注信息是由粒度标注网络的输出决定的,z1至z1依次表示该输入语句中第一个词语至第L个(最后一个)词语的粒度,Uz表示特征网络ENC z的最终输出。特征网络的输入为输入语句X以及标注信息Z X。在粒度标注网络301输出的标注信息为自然语言文本中每个词语的粒度的情况下,特征网络输入的输入语句的标注信息为粒度标注网络301输出的标注信息。例如,粒度标注网络301输出的标注信息为[1100001],这些二进制数值依次表示输入语句中第一个词语至最后一个词语的粒度,0表示词语级粒度,1表示句子级粒度。在粒度标注网络301输出的标注信息为自然语言文本中每个词语分别属于上述N种粒度的概率的情况下,特征网络输入的输入语句的标注信息为根据粒度标注网络301输出的标注信息得到的标注信息。在实际应用中,数据处理设备可以对粒度标注网络301输出的标注信息做进一步处理以得到可以输入至特征网络的标注信息。 Among them, z represents the index of the granularity level (for example, z = 0 represents the word level granularity, z = 1 represents the sentence level granularity), ENC z represents the feature network (the first feature network or the second feature network) at the granularity z, the feature network ENC z is a Transformer model, ENC z () represents the processing operation done by the feature network, X represents the input sentence (natural language text) of the feature network, Z X = [z1, z2,..., zL] represents the input sentence Annotation information (granularity level), the annotation information is determined by the output of the granularity annotation network, z1 to z1 in turn indicate the granularity of the first word to the Lth (last) word in the input sentence, Uz indicates the feature network ENC The final output of z . The input of the feature network is the input sentence X and the label information Z X. In the case where the annotation information output by the granularity annotation network 301 is the granularity of each word in the natural language text, the annotation information of the input sentence input by the feature network is the annotation information output by the granularity annotation network 301. For example, the annotation information output by the granularity annotation network 301 is [1100001], and these binary values sequentially represent the granularity of the first word to the last word in the input sentence, 0 means word-level granularity, and 1 means sentence-level granularity. When the annotation information output by the granularity annotation network 301 is the probability that each word in the natural language text belongs to the above N types of granularities, the annotation information of the input sentence input by the feature network is obtained according to the annotation information output by the granularity annotation network 301 Label information. In practical applications, the data processing device may further process the annotation information output by the granular annotation network 301 to obtain the annotation information that can be input to the feature network.
在一个可选的实现方式中,数据处理设备将自然语言文本中每个词语属于最大概率的那种粒度作为每个词语的粒度。举例来说,输入语句(自然语言文本)中某个词语属于短语级粒度、句子级粒度的概率分别为0.85和0.15,则该词语的粒度为短语级粒度。又举例来说,按照短语级粒度和句子级粒度划为自然语言文本中的各词语的粒度,粒度标注网络301输出的标注信息为[0.92 0.88 0.08 0.07 0.04 0.06 0.97],该标注信息中的数值依次表示该自然语言文本中第一个词语至最后一个词语分别属于句子级粒度的概率,数据处理设备可以将标注信息中小于0.5的数值置为0,大于或等于0.5的数值置为1,得到新的标注信息[1100001]并输入至特征网络。In an optional implementation manner, the data processing device uses the granularity at which each word in the natural language text belongs to the maximum probability as the granularity of each word. For example, if the probability that a word in the input sentence (natural language text) belongs to the phrase-level granularity and sentence-level granularity are 0.85 and 0.15, respectively, the granularity of the word is the phrase-level granularity. For another example, according to phrase-level granularity and sentence-level granularity, the granularity of each word in the natural language text is classified. The annotation information output by the granularity annotation network 301 is [0.92 0.88 0.08 0.07 0.04 0.06 0.97], and the value in the annotation information In turn, it indicates the probability that the first word to the last word in the natural language text belong to the sentence-level granularity. The data processing device can set the value less than 0.5 in the label information to 0, and the value greater than or equal to 0.5 to 1, to get The new label information [1100001] is input into the feature network.
在一个可选的实现方式中,数据处理设备根据自然语言文本中每个词语分别属于上述N种粒度的概率进行采样,利用采样得到的每个词语所属的粒度得到该自然语言文本的标注信息,并输入至特征网络。In an optional implementation manner, the data processing device samples the natural language text according to the probability that each word in the natural language text belongs to the aforementioned N types of granularities, and obtains the annotation information of the natural language text by using the granularity of each word obtained by the sampling. And input to the feature network.
深度神经网络包括的各特征网络独立处理不同粒度的词语,采用不同架构的网络来处理不同粒度的词语,特征提取性能较好。Each feature network included in the deep neural network independently processes words of different granularities, and uses networks of different architectures to process words of different granularities, with better feature extraction performance.
下面结合第一特征网络302、第二特征网络303、第一处理网络304、第二处理网络305以及融合网络306的结构来介绍处理网络所做的处理以及融合网络306所做的处理。The processing performed by the processing network and the processing performed by the convergence network 306 will be introduced below in conjunction with the structures of the first feature network 302, the second feature network 303, the first processing network 304, the second processing network 305, and the converged network 306.
图6为本申请实施例提供的一种深度神经网络的结构示意图,图6未示出粒度标注网络。图6所示,第一处理网络304的输入为第一特征网络302输出的第一特征信息以及第一处理网络304在处理该第一特征信息的过程出已输出的处理结果(词语);第二处理网络305的输入为第二特征网络303输出的第二特征信息以及第二处理网络305在处理该第二特征信息的过程出已输出的处理结果(词语);融合网络306的输入为第一处理结果、第二处理结果以及在处理该第一处理结果和该第二处理结果的过程中已输出的词语,融合网络306的输出为融合该第一处理结果和该第二处理结果得到的目标结果。第一处理网络304和第二处理网络305采用的架构可以相同,也可以不同。第一处理网络304和第二处理网络305可以不共享参数。Fig. 6 is a schematic structural diagram of a deep neural network provided by an embodiment of the application, and Fig. 6 does not show a granular annotation network. As shown in FIG. 6, the input of the first processing network 304 is the first characteristic information output by the first characteristic network 302, and the first processing network 304 has outputted processing results (words) in the process of processing the first characteristic information; The input of the second processing network 305 is the second feature information output by the second feature network 303, and the second processing network 305 outputs the processed results (words) that have been output in the process of processing the second feature information; the input of the fusion network 306 is the first A processing result, a second processing result, and words that have been output in the process of processing the first processing result and the second processing result. The output of the fusion network 306 is obtained by fusing the first processing result and the second processing result Target result. The architectures adopted by the first processing network 304 and the second processing network 305 may be the same or different. The first processing network 304 and the second processing network 305 may not share parameters.
处理某种粒度的词语的处理网络可以理解为该粒度的处理网络,不同粒度的处理网络 处理不同粒度的词语。也就是说,每种粒度有一个对应的处理网络。举例来说,自然语言文本中的各词语所属的粒度被分为短语级粒度和句子级粒度,深度神经网络包括一个短语级粒度的处理网络以及一个句子级粒度的处理网络。不同粒度的处理网络之间是解耦的,意味着它们不共享参数,并且可以采用不同的架构。例如,短语级粒度的处理网络采用深度神经网络架构,句子级粒度的处理网络采用Transformer架构。处理网络可以每次输出一个词语以及该词语的粒度。处理网络可以通过递归的方式进行,即每种粒度的处理网络以其对应粒度的特征网络的输出以及其之前已经输出的词语为输入,计算当前待输出的多个词语的概率,并输出概率最高的词以及该词对应的标注信息。可选的,处理网络利用其输入计算当前待输出的每个词语的概率,并根据每个词语的概率进行采样,输出采样得到的词语以及该词对应的标注信息。可选的,处理网络利用其输入计算其当前待输出每个词语的概率(即当前每个词语被输出的概率),输出其当前待输出每个词语的概率。举例来说,处理网络当前待输出的词语有F个,该处理网络利用其输入计算其当前待输出第1个词语的概率、待输出第2个词语的概率、待输出第F个词语的概率,并将这些概率输入至融合网络,F为大于1的整数。一个词对应的标注信息可以是该词属于某种粒度的概率,也可以是该词的粒度,还可以是该词分别属于各种粒度的概率。A processing network that processes words of a certain granularity can be understood as a processing network of that granularity, and processing networks of different granularities process words of different granularity. In other words, each granularity has a corresponding processing network. For example, the granularity of each word in a natural language text is divided into phrase-level granularity and sentence-level granularity. Deep neural networks include a phrase-level granularity processing network and a sentence-level granularity processing network. The processing networks of different granularities are decoupled, which means that they do not share parameters and can adopt different architectures. For example, the phrase-level granularity processing network uses a deep neural network architecture, and the sentence-level granularity processing network uses a Transformer architecture. The processing network can output one word at a time and the granularity of the word. The processing network can be performed in a recursive manner, that is, the processing network of each granularity takes the output of the corresponding granular feature network and the words that have been output before as input, calculates the probability of multiple words to be output at present, and has the highest output probability The word and the label information corresponding to the word. Optionally, the processing network uses its input to calculate the probability of each word currently to be output, and performs sampling according to the probability of each word, and outputs the sampled word and the label information corresponding to the word. Optionally, the processing network uses its input to calculate the probability of each word currently to be output (that is, the probability of each word currently being output), and output the probability of each word currently to be output. For example, the processing network currently has F words to be output. The processing network uses its input to calculate the probability of the first word to be output, the probability of the second word to be output, and the probability of the Fth word to be output. , And input these probabilities into the fusion network, and F is an integer greater than 1. The label information corresponding to a word may be the probability that the word belongs to a certain granularity, or the granularity of the word, or the probability that the word belongs to various granularities.
第一处理网络302所做的处理可以如下:第一步、第一处理网络302对输入的第一特征信息做处理以预测当前所需输出的第一词语(一个词语),输出该第一词语和该第一词语对应的标注信息;第二步、第一处理网络302对输入的第一特征信息和该第一词语做处理以预测当前所需输出的第二词语(一个词语),输出该第二词语和该第二词语对应的标注信息;第一处理网络302对输入的第一特征信息、该第一词语、该第二词语做处理以预测当前所需输出的第三词语(一个词语),输出该第三词语和该第三词语对应的标注信息;重复之前的步骤直到完成对该第一处理结果的处理。应理解,深度神经网络包括的每个处理网络可以采用与第一处理网络302类似的方式对输入的特征信息做处理。举例来说,某个处理网络的输入为其对应的特征网络对“a good geologist”做特征提取得到的特征信息,该处理网络对输入的该特征信息做处理,预测当前需要输出“a”并输出;该处理网络对输入的该特征信息以及之前输出的“a”做处理,预测当前需要输出“great”并输出;该处理网络对输入的该特征信息、之前输出的“a”和“great”做处理,预测当前需要输出“geologist”并输出。The processing performed by the first processing network 302 may be as follows: In the first step, the first processing network 302 processes the input first feature information to predict the first word (a word) currently required to be output, and output the first word The label information corresponding to the first word; in the second step, the first processing network 302 processes the input first feature information and the first word to predict the second word (a word) that is currently required to be output, and output the The second word and the label information corresponding to the second word; the first processing network 302 processes the input first feature information, the first word, and the second word to predict the third word (a word ), output the third word and the label information corresponding to the third word; repeat the previous steps until the processing of the first processing result is completed. It should be understood that each processing network included in the deep neural network can process the input feature information in a similar manner to the first processing network 302. For example, the input of a certain processing network is the characteristic information obtained by feature extraction of "a good geologist" by its corresponding characteristic network, and the processing network processes the input characteristic information, predicting the current need to output "a" and Output; the processing network processes the input feature information and the previously output "a", predicting the current need to output "great" and output; the processing network processes the input feature information, the previous output "a" and "great" "For processing, predict the current need to output "geologist" and output.
如图6所示,第一处理网络304接收第一特征网络302的输入以及其已经输出的词语进行计算,其计算方式是采用限定窗口的Self-Attention机制;第二处理网络305接收第二特征网络303的输入以及其已经输出的词语进行计算,其计算方式是采用整句范围的Self-Attention机制。每个粒度上的处理网络得到的处理结果记为Vz,z表示粒度层级的指标,即粒度z。第一处理网络304和第二处理网络305也可以采用不同的架构。下面介绍融合网络305对各处理网络输入的处理结果所做的操作。As shown in FIG. 6, the first processing network 304 receives the input of the first feature network 302 and the words it has output for calculation. The calculation method is to use the Self-Attention mechanism with a limited window; the second processing network 305 receives the second feature The input of the network 303 and the words that it has output are calculated, and the calculation method is to adopt the Self-Attention mechanism of the whole sentence range. The processing result obtained by the processing network at each granularity is denoted as Vz, and z represents the index of the granularity level, namely the granularity z. The first processing network 304 and the second processing network 305 may also adopt different architectures. The following describes the operations performed by the convergent network 305 on the processing results input by each processing network.
融合网络306可以将不同粒度上的处理网络输出的处理结果进行融合,得到目标结果。融合网络306的输出为包含词语的序列。融合网络306的输入为各处理网络的处理结果(第一处理结果和第二处理结果)以及融合网络306在处理这些处理结果的过程中已输出的序列。融合网络306所做的操作可以如下:融合网络306将各处理网络输入的各处理结果合并为一个向量;将该向量输入至一个LSTM网络做处理以确定当前待输出粒度的词,即确 定当前待输出哪一个粒度级别的词;融合网络306输出该粒度的处理网络当前待输出的目标词语。所述将该向量输入至一个LSTM网络做处理以确定当前待输出粒度的词可以是将该向量输入至一个LSTM网络做处理以确定上述N种粒度中每种粒度的词被输出的概率,进而确定当前待输出粒度的词;其中,该待输出粒度的词当前被输出的概率最高。该粒度为上述N种粒度中的任一种。该目标词语为该待输出粒度的处理网络当前待输出的多个词语中被输出的概率最高的词语。举例来说,参考粒度的处理网络当前待输出第1个词、第2个词以及第3个词的概率分别为0.06、0.8、0.14,该参考粒度的处理网络当前待输出的目标词语为该第2个词,即被输出概率最高的词。可以理解,融合网络306可以先确定当前待输出哪种粒度的词语,然后输出这种粒度的处理网络待输出的词语。The fusion network 306 can merge the processing results output by the processing network at different granularities to obtain the target result. The output of the fusion network 306 is a sequence containing words. The input of the fusion network 306 is the processing results of each processing network (the first processing result and the second processing result) and the sequence that the fusion network 306 has output in the process of processing these processing results. The operations performed by the fusion network 306 can be as follows: the fusion network 306 merges the processing results input by each processing network into a vector; inputs the vector to an LSTM network for processing to determine the current word to be output, that is, to determine the current word to be output. Which word of granularity level to output; the fusion network 306 outputs the target word currently to be output by the processing network of this granularity. Said inputting the vector into an LSTM network for processing to determine the words of the current granularity to be output may be inputting the vector into an LSTM network for processing to determine the probability of the words of each granularity in the above N granularities being output, and then Determine the word of the current granularity to be output; among them, the word of the granularity to be output has the highest probability of being output currently. The particle size is any one of the above-mentioned N particle sizes. The target word is the word with the highest probability of being output among the multiple words currently to be output by the processing network of the granularity to be output. For example, the probability of the first word, the second word, and the third word to be output by the processing network of the reference granularity is 0.06, 0.8, 0.14, respectively, and the target word to be output by the processing network of the reference granularity is this The second word is the word with the highest probability of being output. It can be understood that the fusion network 306 may first determine which granular words are currently to be output, and then output the words to be output by the processing network with this granularity.
融合网络306所做的操作也可以如下:融合网络306将各处理网络输入的各处理结果合并为一个向量;将该向量输入至一个LSTM网络做处理以确定各处理网络当前待输出的各词语中每个词被输出的概率;融合网络306输出各词语中被输出的概率最高的目标词语。各处理网络是指各粒度的处理网络。举例来说,第一处理网络当前待输出的词包括“a”、“good”以及“geologist”,第二处理网络当前待输出的词包括:“How”、“can”、“I”以及“be”,融合网络计算这7个词中每个词当前被输出的概率,并输出这7个词中被输出的概率最高的那个词。The operations performed by the fusion network 306 can also be as follows: the fusion network 306 merges the processing results input by each processing network into a vector; inputs the vector to an LSTM network for processing to determine the current words to be output by each processing network The probability of each word being output; the fusion network 306 outputs the target word with the highest probability of being output among the words. Each processing network refers to a processing network of each granularity. For example, the words currently to be output by the first processing network include "a", "good" and "geologist", and the words currently to be output by the second processing network include: "How", "can", "I" and " "be", the fusion network calculates the current probability of each of these 7 words being output, and outputs the word with the highest probability of being output among these 7 words.
下面介绍如何计算参考粒度的处理网络当前待输出的各词语中每个词被输出的概率。参考粒度为上述N种粒度中的任一种。The following describes how to calculate the probability of each word being output in each word currently to be output by the processing network with reference granularity. The reference particle size is any one of the above-mentioned N particle sizes.
假定融合网络306在输出第t个词之前,融合网络306已经输出的(t-1)个词,记为[y 1,y 2,…,y t-1],t为大于1的整数,第一处理网络和第二处理网络输出的向量(处理结果)分别为v0和v1,融合网络306将这两个向量以及融合网络306已输出的序列进行合并(concatenation),将合并后的向量输入LSTM网络做处理以计算待输出参考粒度的词语的概率。融合网络306包括该LSTM网络。LSTM网络可以采用如下公式计算待输出参考粒度的词语的概率: Suppose that before the fusion network 306 outputs the t-th word, the (t-1) words already output by the fusion network 306 are denoted as [y 1 ,y 2 ,...,y t-1 ], and t is an integer greater than 1. The vectors (processing results) output by the first processing network and the second processing network are v0 and v1, respectively. The fusion network 306 combines these two vectors and the sequence output by the fusion network 306, and inputs the merged vector The LSTM network performs processing to calculate the probability of words with a reference granularity to be output. The converged network 306 includes the LSTM network. The LSTM network can use the following formula to calculate the probability of words with a reference granularity to be output:
h t=LSMT(h t-1,y t-1,v0,v1); h t =LSMT(h t-1 ,y t-1 ,v0,v1);
P(z t|y 1:t-1,X)=GS(W z h t,τ); P(z t |y 1:t-1 ,X)=GS(W z h t ,τ);
其中,h t表示该LSMT网络处理第t个词时该LSMT网络中的隐状态变量,LSMT()表示该LSMT所做的处理操作,y t-1表示该融合网络输出的第(t-1)个词,W z为该融合网络中的一个参数矩阵,τ为超参数,P(z t|y 1:t-1,X)为当前待输出粒度z的词的概率。可以理解,融合网络306可以采用类似的方式计算当前输出上述N粒度中任一种粒度的词语的概率。在计算出该概率后,再通过混合概率模型,计算输出目标词语的概率。该目标词语为该粒度z的处理网络当前待输出的一个词语。计算输出该目标词语的概率的公式如下: Among them, h t represents the hidden state variable in the LSMT network when the LSMT network processes the t-th word, LSMT() represents the processing operation performed by the LSMT, and y t-1 represents the (t-1 ) Words, W z is a parameter matrix in the fusion network, τ is a hyperparameter, and P(z t |y 1:t-1 ,X) is the probability of a word of granularity z currently to be output. It can be understood that the fusion network 306 can use a similar method to calculate the probability of currently outputting words of any one of the above N granularities. After calculating the probability, the mixed probability model is used to calculate the probability of outputting the target word. The target word is a word currently to be output by the processing network of the granularity z. The formula for calculating the probability of outputting the target word is as follows:
Figure PCTCN2019114146-appb-000005
Figure PCTCN2019114146-appb-000005
其中,P Zt(y t|y 1:t-1,X)表示在粒度z上,输出该目标词语y t的概率;P(y t|y 1:t-1,X)表示输出该目标词语的概率。P Zt(y t|y 1:t-1,X)可以由处理网络给出。粒度z的处理网络可以向融合网络输入其当前待输出每个词语(z粒度的词语)的概率,即该处理网络当前待输出的各词中每个词被输出的概率。举例来说,第一处理网络的输入为第一特征网络对“a good geologist”做特征提取得到的特征信息,该处理网络对该特征信息做处理,以得到当前待输出“a”的概 率、待输出“great”的概率以及待输出“geologist”的概率,并将这些词语以及相应的概率输入至融合网络。假定目标词语y t为“great”,则P Zt(y t|y 1:t-1,X)表示在粒度z上,输出“great”的概率。可以理解,融合网络306可以先计算当前待输出上述N种粒度中每种粒度的词语的概率,然后计算当前待输出的每个词被输出的概率,最后,输出被输出概率最高的那个词。 Among them, P Zt (y t |y 1:t-1 ,X) represents the probability of outputting the target word y t on the granularity z; P(y t |y 1:t-1 ,X) represents outputting the target The probability of the word. P Zt (y t |y 1:t-1 ,X) can be given by the processing network. The processing network of granularity z can input the probability of each word (word of granularity z) currently to be output to the fusion network, that is, the probability of each word in the words currently to be output by the processing network is output. For example, the input of the first processing network is the feature information obtained by the first feature network's feature extraction of "a good geologist", and the processing network processes the feature information to obtain the current probability of output "a", The probability of "great" to be output and the probability of "geologist" to be output are output, and these words and the corresponding probability are input to the fusion network. Assuming that the target word y t is "great", then P Zt (y t |y 1:t-1 ,X) represents the probability of outputting "great" at the granularity z. It can be understood that the fusion network 306 may first calculate the probability of words of each of the above N granularities to be output, then calculate the probability of each word currently to be output to be output, and finally, output the word with the highest probability of being output.
前述实施例描述了利用训练得到的深度神经网络来实现自然语言处理方法,下面介绍如何训练得到所需的深度神经网络。The foregoing embodiment describes the use of a deep neural network obtained by training to implement a natural language processing method. The following describes how to train a required deep neural network.
图7为本申请实施例提供的一种训练方法流程图,如图7所示,该方法可包括:FIG. 7 is a flowchart of a training method provided by an embodiment of the application. As shown in FIG. 7, the method may include:
701、数据处理设备将训练样本输入至深度神经网络做处理,得到预测处理结果。701. The data processing device inputs the training samples to the deep neural network for processing, and obtains a prediction processing result.
该深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,该处理包括:利用该粒度标注网络确定该训练样本中各词语的粒度;利用该第一特征网络对该训练样本中第一粒度的词语进行特征提取,将得到的第三特征信息输出至该第一处理网络;利用该第二特征网络对该训练样本中第二粒度的词语进行特征提取,将得到的第四特征信息输出至该第二处理网络;利用该第一处理网络对该第三特征信息做目标处理,将得到的第三处理结果输出至该融合网络;利用该第二处理网络对该第四特征信息做该目标处理,将得到的第四处理结果输出至该融合网络;利用该融合网络融合该第三处理结果和该第四处理结果得到该预测处理结果;该第一粒度和该第二粒度不同。The deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granular labeling network to determine the value of each word in the training sample Granularity; Use the first feature network to perform feature extraction on words with the first granularity in the training sample, and output the obtained third feature information to the first processing network; Use the second feature network to perform feature extraction in the second training sample Perform feature extraction on the granular words, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and output the obtained third processing result to the fusion network ; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction Processing result; the first particle size and the second particle size are different.
该第一特征网络和该第二特征网络的架构不同,和/或,该第一处理网络和该第二处理网络的架构不同。该粒度标注网络的输入为该自然语言文本,该粒度标注网络用于按照N种粒度确定该自然语言文本中每个词语的粒度以得到该自然语言文本的标注信息,向该第一特征网络和该第二特征网络输出该标注信息;其中,该标注信息用于描述该每个词语的粒度或者该每个词语分别属于该N种粒度的概率;N为大于1的整数。该第一特征网络用于利用输入的该自然语言文本和该标注信息进行特征提取,将得到的该第三特征信息输出至该第一处理网络;其中,该第三特征信息为表示该第一粒度的词语的向量或矩阵;该第一处理网络用于利用输入的该第三特征信息和该第一处理网络已输出的处理结果做目标处理以得到该第三处理结果。该融合网络每次输出一个词语,该融合网络用于利用该第三处理结果、该第四处理结果以及该融合网络在处理该第三处理结果和该第四处理结果的过程中已输出的词语,确定待输出目标词语,输出该目标词语。The architectures of the first feature network and the second feature network are different, and/or the architectures of the first processing network and the second processing network are different. The input of the granular annotation network is the natural language text, and the granular annotation network is used to determine the granularity of each word in the natural language text according to N types of granularities to obtain the annotation information of the natural language text, and send it to the first feature network and The second feature network outputs the labeling information; where the labeling information is used to describe the granularity of each word or the probability that each word belongs to the N kinds of granularities; N is an integer greater than 1. The first feature network is used to perform feature extraction using the input natural language text and the annotation information, and output the obtained third feature information to the first processing network; wherein, the third feature information represents the first A vector or matrix of granular words; the first processing network is used to use the input third characteristic information and the processing result output by the first processing network as target processing to obtain the third processing result. The fusion network outputs one word at a time, and the fusion network is used to use the third processing result, the fourth processing result, and the words that the fusion network has output in the process of processing the third processing result and the fourth processing result , Determine the target word to be output, and output the target word.
702、数据处理设备根据该预测处理结果和标准结果,确定该训练样本对应的损失。702. The data processing device determines the loss corresponding to the training sample according to the predicted processing result and the standard result.
该标准结果,也即ground truth,为利用该深度神经网络处理该训练样本期望得到的处理结果。可以理解,每一个训练样本对应一个标准结果,以便于数据处理设备计算利用深度神经网络处理每一个训练样本的损失,进而优化深度神经网络。下面以训练深度神经网络处理复述任务为例,介绍数据处理装置训练该深度神经网络可以采用的训练样本和标准结果。The standard result, that is, ground truth, is the expected processing result obtained by using the deep neural network to process the training sample. It can be understood that each training sample corresponds to a standard result, so that the data processing device can calculate and use the deep neural network to process the loss of each training sample, thereby optimizing the deep neural network. The following takes training a deep neural network to process retelling tasks as an example to introduce the training samples and standard results that can be used by the data processing device to train the deep neural network.
表1Table 1
Figure PCTCN2019114146-appb-000006
Figure PCTCN2019114146-appb-000006
Figure PCTCN2019114146-appb-000007
Figure PCTCN2019114146-appb-000007
对于训练样本中每个词的粒度,并没有标注数据。粒度标注网络301是通过端到端学习得到的。由于要进行端到端学习,为了保证粒度标注网络301可微分,在训练过程中,粒度标注网络301给出的实际上是每个词属于不同粒度的概率,而非绝对的0/1标签。应理解,数据处理设备训练深度神经网络处理的自然语言处理任务不同,采用的训练样本以及标准结果不同。举例来说,若数据处理设备训练处理复述任务,可以采用类似表1中的训练样本和标准结果。又举例来说,若数据处理设备训练处理翻译任务,采用的训练样本为英语文本,标准结果为训练样本对应的标准的中文文本。For the granularity of each word in the training sample, no data is labeled. The granular annotation network 301 is obtained through end-to-end learning. Due to end-to-end learning, in order to ensure that the granular labeling network 301 can be differentiated, during the training process, the granular labeling network 301 actually gives the probability that each word belongs to a different granularity, rather than an absolute 0/1 label. It should be understood that the data processing equipment trains the deep neural network to process different natural language processing tasks, and uses different training samples and standard results. For example, if the data processing device is trained to handle retelling tasks, training samples and standard results similar to those in Table 1 can be used. For another example, if the data processing device is trained to handle translation tasks, the training sample used is English text, and the standard result is the standard Chinese text corresponding to the training sample.
703、数据处理设备利用该训练样本对应的损失,通过优化算法更新该深度神经网络的参数。703. The data processing device uses the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm.
在实际应用中,数据处理设备可以训练深度神经网络处理不同的自然语言处理任务。数据处理设备训练深度神经网络处理的自然语言处理任务不同,该数据处理设备计算预测处理结果和标准结果之间的损失的方法也就不同,即计算训练样本对应的损失的方法不同。In practical applications, data processing equipment can train deep neural networks to handle different natural language processing tasks. The data processing equipment training deep neural network processes different natural language processing tasks, and the data processing equipment calculates the loss between the predicted processing result and the standard result differently, that is, the method of calculating the loss corresponding to the training sample is different.
在一个可选的实现方式中,数据处理设备利用该训练样本对应的损失,通过优化算法更新该深度神经网络的参数可以是利用损失函数相对于该深度神经网络包括的至少一个网络的梯度值,更新该至少一个网络的参数;该损失函数用于计算该预测处理结果和该标准结果之间的损失;其中,该第一特征网络、该第二特征网络、该第一处理网络、该第二处理网络中的任一个网络在更新过程中,另外三个网络中任一个网络的参数保持不变。利用损失函数相对于一个网络的梯度值,通过优化算法(例如梯度下降算法)更新该网络的参数是本领域常用的技术手段,这里不再详述。In an optional implementation manner, the data processing device uses the loss corresponding to the training sample, and updating the parameters of the deep neural network through an optimization algorithm may be to use the gradient value of the loss function relative to at least one network included in the deep neural network, Update the parameters of the at least one network; the loss function is used to calculate the loss between the predicted processing result and the standard result; wherein, the first characteristic network, the second characteristic network, the first processing network, the second During the update process of any one of the processing networks, the parameters of any one of the other three networks remain unchanged. Using the gradient value of the loss function with respect to a network and updating the parameters of the network through an optimization algorithm (such as a gradient descent algorithm) is a common technical means in this field, which will not be detailed here.
前述实施例采用的深度神经网络为采用图7中的训练方法得到的网络,应理解图7中 的深度神经网络与前述实施例中的深度神经网络的结构和处理过程均相同。The deep neural network used in the foregoing embodiment is a network obtained by using the training method in FIG. 7. It should be understood that the structure and processing process of the deep neural network in FIG. 7 are the same as the deep neural network in the foregoing embodiment.
本申请实施例中,数据处理设备训练可以独立处理不同粒度的词语的深度神经网络,以便于得到能够避免由较细粒度的信息得到较粗粒度的信息的过程的深度神经网络,实现简单。In the embodiments of the present application, the data processing device trains a deep neural network that can independently process words of different granularities, so as to obtain a deep neural network that can avoid the process of obtaining coarser-grained information from finer-grained information, and is simple to implement.
前述实施例介绍了自然语言处理方法训练方法,下面介绍实现这些方法的数据处理设备的结构。图8为本申请实施例提供的一种数据处理设备的结构示意图,如图8所示,该数据处理设备可包括:The foregoing embodiments introduced training methods for natural language processing methods, and the structure of data processing equipment implementing these methods is described below. FIG. 8 is a schematic structural diagram of a data processing device provided by an embodiment of the application. As shown in FIG. 8, the data processing device may include:
获取单元801,用于获得待处理的自然语言文本;The obtaining unit 801 is configured to obtain the natural language text to be processed;
处理单元802,用于利用训练得到的深度神经网络对该自然语言文本做处理;The processing unit 802 is configured to process the natural language text by using the deep neural network obtained by training;
输出单元803,用于输出处理该自然语言文本得到的目标结果。The output unit 803 is configured to output the target result obtained by processing the natural language text.
该深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,该处理包括:利用该粒度标注网络确定该自然语言文本中各词语的粒度;利用该第一特征网络对该自然语言文本中第一粒度的词语进行特征提取,将得到的第一特征信息输出至该第一处理网络;利用该第二特征网络对该自然语言文本中第二粒度的词语进行特征提取,将得到的第二特征信息输出至该第二处理网络;利用该第一处理网络对该第一特征信息做处理,将得到的第一处理结果输出至该融合网络;利用该第二处理网络对该第二特征信息做该处理,将得到的第二处理结果输出至该融合网络;利用该融合网络融合该第一处理结果和该第二处理结果得到该目标结果;该第一粒度和该第二粒度不同。The deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granular labeling network to determine each word in the natural language text Use the first feature network to perform feature extraction on words of the first granularity in the natural language text, and output the obtained first feature information to the first processing network; use the second feature network to perform feature extraction on the natural language text Perform feature extraction on words with the second granularity in the second granularity, and output the obtained second feature information to the second processing network; use the first processing network to process the first feature information, and output the obtained first processing result to the Convergence network; use the second processing network to process the second characteristic information, and output the obtained second processing result to the fusion network; use the fusion network to fuse the first processing result and the second processing result to obtain the Target result; the first particle size and the second particle size are different.
处理单元802可以是数据处理设备中的中央处理器(Central Processing Unit,CPU),也可以是神经网络处理器(Neural-network Processing Unit,NPU),还可以是其他类型的处理器。输出单元803可以是显示器、显示屏、音频设备等。该目标结果可以是由该自然语言文本得到的另一个自然语言文本,数据处理设备的显示屏显示得到的自然语言文本。该目标结果可以由该自然语言文本得到的另一个自然语言文本对应的语音,数据处理设备中的音频设备播放该语音。The processing unit 802 may be a central processing unit (Central Processing Unit, CPU) in a data processing device, a neural network processor (Neural-network Processing Unit, NPU), or other types of processors. The output unit 803 may be a display, a display screen, an audio device, etc. The target result may be another natural language text obtained from the natural language text, and the obtained natural language text is displayed on the display screen of the data processing device. The target result can be a voice corresponding to another natural language text obtained from the natural language text, and the audio device in the data processing device plays the voice.
在一个可选的实现方式中,处理单元802,还用于将训练样本输入至深度神经网络做处理,得到预测处理结果;根据该预测处理结果和标准结果,确定该训练样本对应的损失;该标准结果为利用该深度神经网络处理该训练样本期望得到的处理结果;利用该训练样本对应的损失,通过优化算法更新该深度神经网络的参数。In an optional implementation manner, the processing unit 802 is also used to input training samples into the deep neural network for processing to obtain prediction processing results; according to the prediction processing results and standard results, determine the loss corresponding to the training samples; The standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; using the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.
该深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,该处理包括:利用该粒度标注网络确定该训练样本中各词语的粒度;利用该第一特征网络对该训练样本中第一粒度的词语进行特征提取,将得到的第三特征信息输出至该第一处理网络;利用该第二特征网络对该训练样本中第二粒度的词语进行特征提取,将得到的第四特征信息输出至该第二处理网络;利用该第一处理网络对该第三特征信息做目标处理,将得到的第三处理结果输出至该融合网络;利用该第二处理网络对该第四特征信息做该目标处理,将得到的第四处理结果输出至该融合网络;利用该融合网络融合该第三处理结果和该第四处理结果得到该预测处理结果;该第一粒度和该第二粒度不同。The deep neural network includes: a granular labeling network, a first feature network, a second feature network, a first processing network, a second processing network, and a fusion network. The processing includes: using the granular labeling network to determine the value of each word in the training sample Granularity; Use the first feature network to perform feature extraction on words with the first granularity in the training sample, and output the obtained third feature information to the first processing network; Use the second feature network to perform feature extraction in the second training sample Perform feature extraction on the granular words, and output the obtained fourth feature information to the second processing network; use the first processing network to perform target processing on the third feature information, and output the obtained third processing result to the fusion network ; Use the second processing network to perform the target processing on the fourth characteristic information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction Processing result; the first particle size and the second particle size are different.
详细的训练方法参阅图7,这里不再详述。The detailed training method is shown in Figure 7, which will not be detailed here.
前述实施例描述了数据处理设备利用深度神经网络来处理自然语言任务的方法。下面介绍一下深度神经网络以方便读者进一步理解本方案。The foregoing embodiments describe a method in which a data processing device uses a deep neural network to process natural language tasks. The following introduces the deep neural network to facilitate readers to further understand this scheme.
深度神经网络(Deep Neural Network,DNN),可以理解为具有很多层隐含层的神经网络,这里的“很多”并没有特别的度量标准,我们常说的多层神经网络和深度神经网络其本质上是同一个东西。从DNN按不同层的位置划分,DNN内部的神经网络可以分为三类:输入层,隐含层,输出层。一般来说第一层是输入层,最后一层是输出层,中间的层数都是隐含层。层与层之间是全连接的,也就是说,第i层的任意一个神经元一定与第i+1层的任意一个神经元相连。虽然DNN看起来很复杂,但是就每一层的工作来说,其实并不复杂,简单来说就是如下线性关系表达式:
Figure PCTCN2019114146-appb-000008
其中,
Figure PCTCN2019114146-appb-000009
是输入向量,
Figure PCTCN2019114146-appb-000010
是输出向量,
Figure PCTCN2019114146-appb-000011
是偏移向量,W是权重矩阵(也称系数),α()是激活函数。每一层仅仅是对输入向量
Figure PCTCN2019114146-appb-000012
经过如此简单的操作得到输出向量
Figure PCTCN2019114146-appb-000013
由于DNN层数多,则系数W和偏移向量
Figure PCTCN2019114146-appb-000014
的数量也就是很多了。那么,具体的参数在DNN是如何定义的呢?首先我们来看看系数W的定义。以一个三层的DNN为例,如:第二层的第4个神经元到第三层的第2个神经元的线性系数定义为
Figure PCTCN2019114146-appb-000015
上标3代表系数W所在的层数,而下标对应的是输出的第三层索引2和输入的第二层索引4。总结下,第L-1层的第k个神经元到第L层的第j个神经元的系数定义为
Figure PCTCN2019114146-appb-000016
注意,输入层是没有W参数的。在深度神经网络中,更多的隐含层让网络更能够刻画现实世界中的复杂情形。理论上而言,参数越多的模型复杂度越高,“容量”也就越大,也就意味着它能完成更复杂的学习任务。
Deep Neural Network (DNN) can be understood as a neural network with many hidden layers. The "many" here has no special metric. The essence of the multi-layer neural network and deep neural network we often say The above is the same thing. From the DNN according to the location of different layers, the neural network inside the DNN can be divided into three categories: input layer, hidden layer, and output layer. Generally speaking, the first layer is the input layer, the last layer is the output layer, and the number of layers in the middle are all hidden layers. The layers are fully connected, that is, any neuron in the i-th layer must be connected to any neuron in the i+1th layer. Although DNN looks complicated, it is not complicated in terms of the work of each layer. Simply put, it is the following linear relationship expression:
Figure PCTCN2019114146-appb-000008
among them,
Figure PCTCN2019114146-appb-000009
Is the input vector,
Figure PCTCN2019114146-appb-000010
Is the output vector,
Figure PCTCN2019114146-appb-000011
Is the offset vector, W is the weight matrix (also called coefficient), and α() is the activation function. Each layer is just the input vector
Figure PCTCN2019114146-appb-000012
After such a simple operation, the output vector is obtained
Figure PCTCN2019114146-appb-000013
Due to the large number of DNN layers, the coefficient W and the offset vector
Figure PCTCN2019114146-appb-000014
The number is a lot. So, how are the specific parameters defined in DNN? First, let's look at the definition of the coefficient W. Take a three-layer DNN as an example. For example, the linear coefficients from the fourth neuron in the second layer to the second neuron in the third layer are defined as
Figure PCTCN2019114146-appb-000015
The superscript 3 represents the number of layers where the coefficient W is located, and the subscript corresponds to the output third layer index 2 and the input second layer index 4. In summary, the coefficient from the kth neuron of the L-1 layer to the jth neuron of the Lth layer is defined as
Figure PCTCN2019114146-appb-000016
Note that the input layer has no W parameter. In deep neural networks, more hidden layers allow the network to better describe complex situations in the real world. Theoretically speaking, a model with more parameters is more complex and has a greater "capacity", which means it can complete more complex learning tasks.
前述实施例中数据处理设备采用深度神经网络所执行的方法可以在NPU中实现。图9为本申请实施例提供的一种神经网络处理器的结构示意图。The method executed by the data processing device using the deep neural network in the foregoing embodiment can be implemented in the NPU. FIG. 9 is a schematic structural diagram of a neural network processor provided by an embodiment of the application.
神经网络处理器NPU 90NPU作为协处理器挂载到主CPU(Host CPU)上,由Host CPU分配任务(例如自然语言处理任务)。NPU的核心部分为运算电路90,通过控制器904控制运算电路903提取存储器中的矩阵数据并进行乘法运算。The neural network processor NPU 90NPU is mounted on the main CPU (Host CPU) as a coprocessor, and the Host CPU allocates tasks (for example, natural language processing tasks). The core part of the NPU is the arithmetic circuit 90, and the arithmetic circuit 903 is controlled by the controller 904 to extract matrix data from the memory and perform multiplication operations.
在一些实现中,运算电路903内部包括多个处理单元(Process Engine,PE)。在一些实现中,运算电路903是二维脉动阵列。运算电路903还可以是一维脉动阵列或者能够执行例如乘法和加法这样的数学运算的其它电子线路。在一些实现中,运算电路903是通用的矩阵处理器。In some implementations, the arithmetic circuit 903 includes multiple processing units (Process Engine, PE). In some implementations, the arithmetic circuit 903 is a two-dimensional systolic array. The arithmetic circuit 903 may also be a one-dimensional systolic array or other electronic circuits capable of performing mathematical operations such as multiplication and addition. In some implementations, the arithmetic circuit 903 is a general-purpose matrix processor.
举例来说,假设有输入矩阵A,权重矩阵B,输出矩阵C。运算电路从权重存储器902中取矩阵B相应的数据,并缓存在运算电路中每一个PE上。运算电路从输入存储器901中取矩阵A数据与矩阵B进行矩阵运算,得到的矩阵的部分结果或最终结果,保存在累加器908accumulator中。For example, suppose there is an input matrix A, a weight matrix B, and an output matrix C. The arithmetic circuit fetches the data corresponding to matrix B from the weight memory 902 and caches it on each PE in the arithmetic circuit. The arithmetic circuit takes the matrix A data and matrix B from the input memory 901 to perform matrix operations, and the partial or final result of the obtained matrix is stored in the accumulator 908.
统一存储器906用于存放输入数据以及输出数据。权重数据直接通过存储单元访问控制器(Direct Memory Access Controller,DMAC)905被搬运到权重存储器902中。输入数据也通过DMAC被搬运到统一存储器906中。The unified memory 906 is used to store input data and output data. The weight data is directly transferred to the weight memory 902 through the direct memory access controller (DMAC) 905. The input data is also transferred to the unified memory 906 through the DMAC.
总线接口单元(Bus Interface Unit,BIU)510,用于AXI总线与DMAC和取指存储器(Instruction Fetch Buffer)909的交互。The Bus Interface Unit (BIU) 510 is used for the interaction between the AXI bus and the DMAC and the instruction fetch buffer (Instruction Fetch Buffer) 909.
总线接口单元510还用于取指存储器909从外部存储器获取指令,还用于存储单元访 问控制器905从外部存储器获取输入矩阵A或者权重矩阵B的原数据。The bus interface unit 510 is also used for the instruction fetch memory 909 to obtain instructions from the external memory, and also used for the storage unit access controller 905 to obtain the original data of the input matrix A or the weight matrix B from the external memory.
DMAC主要用于将外部存储器DDR中的输入数据搬运到统一存储器906或将权重数据搬运到权重存储器902中或将输入数据数据搬运到输入存储器901中。The DMAC is mainly used to transfer the input data in the external memory DDR to the unified memory 906 or the weight data to the weight memory 902 or the input data to the input memory 901.
向量计算单元907多个运算处理单元,在需要的情况下,对运算电路的输出做进一步处理,如向量乘,向量加,指数运算,对数运算,大小比较等等。主要用于神经网络中非卷积/FC层网络计算,如Pooling(池化),Batch Normalization(批归一化),Local Response Normalization(局部响应归一化)等。The vector calculation unit 907 has multiple arithmetic processing units, if necessary, further processing the output of the arithmetic circuit, such as vector multiplication, vector addition, exponential operation, logarithmic operation, size comparison and so on. Mainly used for non-convolution/FC layer network calculations in neural networks, such as Pooling, Batch Normalization, Local Response Normalization, etc.
在一些实现种,向量计算单元能907将经处理的输出的向量存储到统一缓存器906。例如,向量计算单元907可以将非线性函数应用到运算电路903的输出,例如累加值的向量,用以生成激活值。在一些实现中,向量计算单元907生成归一化的值、合并值,或二者均有。在一些实现中,处理过的输出的向量能够用作到运算电路903的激活输入,例如用于在神经网络中的后续层中的使用。In some implementations, the vector calculation unit 907 can store the processed output vector in the unified buffer 906. For example, the vector calculation unit 907 may apply a nonlinear function to the output of the arithmetic circuit 903, such as a vector of accumulated values, to generate the activation value. In some implementations, the vector calculation unit 907 generates a normalized value, a combined value, or both. In some implementations, the processed output vector can be used as an activation input to the arithmetic circuit 903, for example for use in a subsequent layer in a neural network.
控制器904连接的取指存储器(instruction fetch buffer)909,用于存储控制器904使用的指令;The instruction fetch buffer 909 connected to the controller 904 is used to store instructions used by the controller 904;
统一存储器906,输入存储器901,权重存储器902以及取指存储器909均为On-Chip存储器。The unified memory 906, the input memory 901, the weight memory 902, and the fetch memory 909 are all On-Chip memories.
其中,图3所示的深度神经网络中各层的运算可以由矩阵计算单元212或向量计算单元907执行。Among them, the operations of each layer in the deep neural network shown in FIG. 3 may be executed by the matrix calculation unit 212 or the vector calculation unit 907.
本申请采用NPU实现基于深度神经网络的自然语言处理方法以及训练方法,可以大大提高数据处理设备的处理自然语言处理任务以及训练深度神经网络的效率。In this application, NPU is used to implement a natural language processing method and training method based on a deep neural network, which can greatly improve the efficiency of processing natural language processing tasks and training a deep neural network of a data processing device.
下面从硬件处理的角度对本发明实施例中的数据处理设备进行描述。The following describes the data processing device in the embodiment of the present invention from the perspective of hardware processing.
图10为本申请实施例提供的一种智能终端的部分结构的框图。参考图10,智能终端包括:射频(Radio Frequency,RF)电路1010、存储器1020、输入单元1030、显示单元1040、传感器1050、音频电路1060、无线保真(wireless fidelity,WiFi)模块1070、片上系统(System On Chip,SoC)1080以及电源1090等部件。FIG. 10 is a block diagram of a partial structure of an intelligent terminal provided by an embodiment of the application. 10, the smart terminal includes: a radio frequency (RF) circuit 1010, a memory 1020, an input unit 1030, a display unit 1040, a sensor 1050, an audio circuit 1060, a wireless fidelity (WiFi) module 1070, a system on chip (System On Chip, SoC) 1080 and power supply 1090 and other components.
存储器1020包括DDR存储器,当然还可以包括高速随机存取存储器,或者包括非易失性存储器等其他存储单元,例如至少一个磁盘存储器件、闪存器件、或其他易失性固态存储器件等。The memory 1020 includes DDR memory, of course, may also include high-speed random access memory, or include other storage units such as non-volatile memory, such as at least one magnetic disk storage device, flash memory device, or other volatile solid-state storage devices.
本领域技术人员可以理解,图10中示出的智能终端结构并不构成对智能终端的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件布置。Those skilled in the art can understand that the structure of the smart terminal shown in FIG. 10 does not constitute a limitation on the smart terminal, and may include more or less components than those shown in the figure, or a combination of certain components, or different component arrangements.
下面结合图10对智能终端的各个构成部件进行具体的介绍:The components of the smart terminal are specifically introduced below in conjunction with Figure 10:
RF电路1010可用于收发信息或通话过程中,信号的接收和发送,特别地,将基站的下行信息接收后,给SoC 1080处理;另外,将设计上行的数据发送给基站。通常,RF电路1010包括但不限于天线、至少一个放大器、收发信机、耦合器、低噪声放大器(Low Noise Amplifier,LNA)、双工器等。此外,RF电路1010还可以通过无线通信与网络和其他设备通信。上述无线通信可以使用任一通信标准或协议,包括但不限于全球移动通讯系统(Global System of Mobile communication,GSM)、通用分组无线服务(General Packet Radio Service,GPRS)、码分多址(Code Division Multiple Access,CDMA)、宽带码分多址(Wideband  Code Division Multiple Access,WCDMA)、长期演进(Long Term Evolution,LTE)、电子邮件、短消息服务(Short Messaging Service,SMS)等。The RF circuit 1010 can be used for receiving and sending signals during the process of sending and receiving information or talking. In particular, after receiving the downlink information of the base station, it is processed by SoC 1080; in addition, the designed uplink data is sent to the base station. Generally, the RF circuit 1010 includes but is not limited to an antenna, at least one amplifier, a transceiver, a coupler, a low noise amplifier (LNA), a duplexer, and the like. In addition, the RF circuit 1010 can also communicate with the network and other devices through wireless communication. The above wireless communication can use any communication standard or protocol, including but not limited to Global System of Mobile Communication (GSM), General Packet Radio Service (GPRS), Code Division Multiple Access (Code Division) Multiple Access (CDMA), Wideband Code Division Multiple Access (WCDMA), Long Term Evolution (LTE), Email, Short Messaging Service (SMS), etc.
存储器1020可用于存储软件程序以及模块,SoC 1080通过运行存储在存储器1020的软件程序以及模块,从而执行智能终端的各种功能应用以及数据处理。存储器1020可主要包括存储程序区和存储数据区,其中,存储程序区可存储操作系统、至少一个功能所需的应用程序(比如声音播放功能、图像播放功能、翻译功能、复述功能等)等;存储数据区可存储根据智能终端的使用所创建的数据(比如音频数据、电话本等)等。The memory 1020 may be used to store software programs and modules. The SoC 1080 runs the software programs and modules stored in the memory 1020 to execute various functional applications and data processing of the smart terminal. The memory 1020 may mainly include a program storage area and a data storage area, where the program storage area may store an operating system, an application program required by at least one function (such as a sound playback function, an image playback function, a translation function, a retelling function, etc.), etc.; The data storage area can store data (such as audio data, phone book, etc.) created according to the use of the smart terminal.
输入单元1030可用于接收输入的自然语言文本以及语音数据,以及产生与智能终端的用户设置以及功能控制有关的键信号输入。具体地,输入单元1030可包括触控面板1031以及其他输入设备1032。触控面板1031,也称为触摸屏,可收集用户在其上或附近的触摸操作(比如用户使用手指、触笔等任何适合的物体或附件在触控面板1031上或在触控面板1031附近的操作),并根据预先设定的程式驱动相应的连接装置。触控面板1031用于接收用户输入的自然语言文本,并将该自然语言文本输入至SoC1080。可选的,触控面板1031可包括触摸检测装置和触摸控制器两个部分。其中,触摸检测装置检测用户的触摸方位,并检测触摸操作带来的信号,将信号传送给触摸控制器;触摸控制器从触摸检测装置上接收触摸信息,并将它转换成触点坐标,再送给SoC 1080,并能接收SoC 1080发来的命令并加以执行。此外,可以采用电阻式、电容式、红外线以及表面声波等多种类型实现触控面板1031。除了触控面板1031,输入单元1030还可以包括其他输入设备1032。具体地,其他输入设备1032可以包括但不限于物理键盘、功能键(比如音量控制按键、开关按键等)、轨迹球、鼠标、操作杆、触摸屏、话筒等中的一种或多种。输入设备1032包括的话筒可以接收用户输入的语音数据,并将该语音数据输入至SoC1080。The input unit 1030 can be used to receive input natural language text and voice data, and generate key signal inputs related to user settings and function control of the smart terminal. Specifically, the input unit 1030 may include a touch panel 1031 and other input devices 1032. The touch panel 1031, also known as a touch screen, can collect user touch operations on or near it (for example, the user uses any suitable objects or accessories such as fingers, stylus, etc.) on the touch panel 1031 or near the touch panel 1031. Operation), and drive the corresponding connection device according to the preset program. The touch panel 1031 is used to receive the natural language text input by the user and input the natural language text into the SoC1080. Optionally, the touch panel 1031 may include two parts: a touch detection device and a touch controller. Among them, the touch detection device detects the user's touch position, and detects the signal brought by the touch operation, and transmits the signal to the touch controller; the touch controller receives the touch information from the touch detection device, converts it into contact coordinates, and then sends it Give SoC 1080, and can receive commands from SoC 1080 and execute them. In addition, the touch panel 1031 can be realized by various types such as resistive, capacitive, infrared, and surface acoustic wave. In addition to the touch panel 1031, the input unit 1030 may also include other input devices 1032. Specifically, other input devices 1032 may include, but are not limited to, one or more of a physical keyboard, function keys (such as volume control buttons, switch buttons, etc.), trackball, mouse, joystick, touch screen, microphone, etc. The microphone included in the input device 1032 can receive the voice data input by the user and input the voice data to the SoC1080.
SoC 1080通过运行存储在存储器1020的软件程序以及模块,从而执行本申请提供的数据处理方法对输入单元1030输入的自然语言文本做处理,得到目标结果。SoC 1080也可以在将输入单元1030输入的语音数据转换为自然语言文本后,执行本申请提供的数据处理方法对该自然语言文本做处理,得到目标结果。The SoC 1080 runs the software programs and modules stored in the memory 1020 to execute the data processing method provided in this application to process the natural language text input by the input unit 1030 to obtain the target result. SoC 1080 may also convert the voice data input by the input unit 1030 into natural language text, and then execute the data processing method provided in this application to process the natural language text to obtain the target result.
显示单元1040可用于显示由用户输入的信息或提供给用户的信息以及智能终端的各种菜单。显示单元1040可包括显示面板1041,可选的,可以采用液晶显示器(Liquid Crystal Display,LCD)、有机发光二极管(Organic Light-Emitting Diode,OLED)等形式来配置显示面板1041。显示单元1040可用于显示SoC 1080处理自然语言文本得到的目标结果。进一步的,触控面板1031可覆盖显示面板1041,当触控面板1031检测到在其上或附近的触摸操作后,传送给SoC 1080以确定触摸事件的类型,随后SoC 1080根据触摸事件的类型在显示面板1041上提供相应的视觉输出。虽然在图10中,触控面板1031与显示面板1041是作为两个独立的部件来实现智能终端的输入和输入功能,但是在某些实施例中,可以将触控面板1031与显示面板1041集成而实现智能终端的输入和输出功能。The display unit 1040 may be used to display information input by the user or information provided to the user and various menus of the smart terminal. The display unit 1040 may include a display panel 1041, and optionally, the display panel 1041 may be configured in the form of a liquid crystal display (Liquid Crystal Display, LCD), an organic light-emitting diode (Organic Light-Emitting Diode, OLED), etc. The display unit 1040 can be used to display the target result obtained by the SoC 1080 processing natural language text. Further, the touch panel 1031 can cover the display panel 1041. When the touch panel 1031 detects a touch operation on or near it, it is sent to SoC 1080 to determine the type of touch event, and then SoC 1080 displays the touch event according to the type of touch event. The display panel 1041 provides corresponding visual output. Although in FIG. 10, the touch panel 1031 and the display panel 1041 are used as two independent components to implement the input and input functions of the smart terminal, in some embodiments, the touch panel 1031 and the display panel 1041 can be integrated And realize the input and output functions of the intelligent terminal.
智能终端还可包括至少一种传感器1050,比如光传感器、运动传感器以及其他传感器。具体地,光传感器可包括环境光传感器及接近传感器,其中,环境光传感器可根据环境光线的明暗来调节显示面板1041的亮度,接近传感器可在智能终端移动到耳边时,关闭显示面板1041和/或背光。作为运动传感器的一种,加速计传感器可检测各个方向上(一般为 三轴)加速度的大小,静止时可检测出重力的大小及方向,可用于识别智能终端姿态的应用(比如横竖屏切换、相关游戏、磁力计姿态校准)、振动识别相关功能(比如计步器、敲击)等;至于智能终端还可配置的陀螺仪、气压计、湿度计、温度计、红外线传感器等其他传感器,在此不再赘述。The smart terminal may also include at least one sensor 1050, such as a light sensor, a motion sensor, and other sensors. Specifically, the light sensor can include an ambient light sensor and a proximity sensor. The ambient light sensor can adjust the brightness of the display panel 1041 according to the brightness of the ambient light. The proximity sensor can close the display panel 1041 and the display panel 1041 when the smart terminal is moved to the ear. / Or backlight. As a kind of motion sensor, the accelerometer sensor can detect the magnitude of acceleration in various directions (usually three axes), and can detect the magnitude and direction of gravity when it is stationary, and can be used to identify smart terminal posture applications (such as horizontal and vertical screen switching, Related games, magnetometer posture calibration), vibration recognition related functions (such as pedometer, percussion), etc.; as for other sensors such as gyroscopes, barometers, hygrometers, thermometers, infrared sensors, etc. that can be configured in smart terminals, here No longer.
音频电路1060、扬声器1061,传声器1062可提供用户与智能终端之间的音频接口。音频电路1060可将接收到的音频数据转换后的电信号,传输到扬声器1061,由扬声器1061转换为声音信号输出;另一方面,传声器1062将收集的声音信号转换为电信号,由音频电路1060接收后转换为音频数据,再将音频数据输出SoC 1080处理后,经RF电路1010以发送给比如另一智能终端,或者将音频数据输出至存储器1020以便进一步处理。The audio circuit 1060, the speaker 1061, and the microphone 1062 can provide an audio interface between the user and the smart terminal. The audio circuit 1060 can transmit the electrical signal converted from the received audio data to the speaker 1061, and the speaker 1061 converts it into a sound signal for output; on the other hand, the microphone 1062 converts the collected sound signal into an electrical signal, which is then output by the audio circuit 1060. After being received, the audio data is converted into audio data, and then the audio data is output to SoC 1080 for processing, and then sent to another smart terminal through the RF circuit 1010, or the audio data is output to the memory 1020 for further processing.
WiFi属于短距离无线传输技术,智能终端通过WiFi模块1070可以帮助用户收发电子邮件、浏览网页和访问流式媒体等,它为用户提供了无线的宽带互联网访问。虽然图10示出了WiFi模块1070,但是可以理解的是,其并不属于智能终端的必须构成,完全可以根据需要在不改变发明的本质的范围内而省略。WiFi is a short-distance wireless transmission technology. The smart terminal can help users send and receive emails, browse web pages, and access streaming media through the WiFi module 1070. It provides users with wireless broadband Internet access. Although FIG. 10 shows the WiFi module 1070, it is understandable that it is not a necessary component of the smart terminal, and can be omitted as needed without changing the essence of the invention.
SoC 1080是智能终端的控制中心,利用各种接口和线路连接整个智能终端的各个部分,通过运行或执行存储在存储器1020内的软件程序和/或模块,以及调用存储在存储器1020内的数据,执行智能终端的各种功能和处理数据,从而对智能终端进行整体监控。可选的,SoC 1080可包括多个处理单元,例如CPU或者各种业务处理器;SoC 1080还可集成应用处理器和调制解调处理器,其中,应用处理器主要处理操作系统、用户界面和应用程序等,调制解调处理器主要处理无线通信。可以理解的是,上述调制解调处理器也可以不集成到SoC 1080中。 SoC 1080 is the control center of the intelligent terminal. It uses various interfaces and lines to connect the various parts of the entire intelligent terminal. By running or executing software programs and/or modules stored in the memory 1020, and calling data stored in the memory 1020, Perform various functions of the smart terminal and process data, thereby monitoring the smart terminal as a whole. Optionally, SoC 1080 may include multiple processing units, such as CPUs or various service processors; SoC 1080 may also integrate application processors and modem processors, where the application processor mainly processes operating systems, user interfaces, and For application programs, the modem processor mainly deals with wireless communication. It is understandable that the above modem processor may not be integrated into SoC 1080.
智能终端还包括给各个部件供电的电源1090(比如电池),优选的,电源可以通过电源管理系统与SoC 1080逻辑相连,从而通过电源管理系统实现管理充电、放电、以及功耗管理等功能。The smart terminal also includes a power supply 1090 (such as a battery) for supplying power to various components. Preferably, the power supply can be logically connected to the SoC 1080 through a power management system, so that functions such as charging, discharging, and power management are realized through the power management system.
尽管未示出,智能终端还可以包括摄像头、蓝牙模块等,在此不再赘述。Although not shown, the smart terminal may also include a camera, a Bluetooth module, etc., which will not be repeated here.
图11是本申请实施例提供的一种数据处理设备的部分结构的框图。如图11所示,数据处理设备1100可以处理器1101、存储器1102、输入设备1103、输出设备1104以及总线1105。其中,处理器1101、存储器1102、输入设备1103、输出设备1104通过总线1105实现彼此之间的通信连接。Fig. 11 is a block diagram of a partial structure of a data processing device provided by an embodiment of the present application. As shown in FIG. 11, the data processing device 1100 may include a processor 1101, a memory 1102, an input device 1103, an output device 1104, and a bus 1105. Among them, the processor 1101, the memory 1102, the input device 1103, and the output device 1104 realize the communication connection between each other through the bus 1105.
处理器1101可以采用通用的CPU,微处理器,应用专用集成电路(Application Specific Integrated Circuit,ASIC),或者一个或多个集成电路,用于执行相关程序,以实现本发明实施例所提供的技术方案。处理器1101对应于图8中的处理单元802。The processor 1101 may adopt a general CPU, a microprocessor, an application specific integrated circuit (Application Specific Integrated Circuit, ASIC), or one or more integrated circuits for executing related programs to implement the technology provided by the embodiments of the present invention Program. The processor 1101 corresponds to the processing unit 802 in FIG. 8.
存储器1102可以是只读存储器(Read Only Memory,ROM),静态存储设备,动态存储设备或者随机存取存储器(Random Access Memory,RAM)。存储器1102可以存储操作系统、以及其他应用程序。用于通过软件或者固件来实现本申请实施例提供的数据处理设备包括的模块以及部件所需执行的功能,或者用于实现本申请方法实施例提供的上述方法的程序代码存储在存储器1102中,并由处理器1101读取存储器1102中的代码来执行数据处理设备包括的模块以及部件所需执行的操作,或者执行本申请实施例提供的上述方法。The memory 1102 may be a read only memory (Read Only Memory, ROM), a static storage device, a dynamic storage device, or a random access memory (Random Access Memory, RAM). The memory 1102 may store an operating system and other application programs. The program code used to implement the modules and components of the data processing device provided in the embodiment of the present application through software or firmware, or the program code used to implement the foregoing method provided in the method embodiment of the present application is stored in the memory 1102, And the processor 1101 reads the code in the memory 1102 to execute operations required by the modules and components included in the data processing device, or execute the above-mentioned methods provided in the embodiments of the present application.
输入设备1103,对应于获取单元801,用于输入数据处理设备待处理的自然语言文本。The input device 1103, corresponding to the acquiring unit 801, is used to input natural language text to be processed by the data processing device.
输出设备1104,对应于输出单元803,用于输出数据处理设备得到的目标结果。The output device 1104, corresponding to the output unit 803, is used to output the target result obtained by the data processing device.
总线1105可包括在数据处理设备各个部件(例如处理器1101、存储器1102、输入设备1103、输出设备1104)之间传送信息的通路。The bus 1105 may include a path for transferring information between various components of the data processing device (for example, the processor 1101, the memory 1102, the input device 1103, and the output device 1104).
应注意,尽管图11所示的数据处理设备1100仅仅示出了处理器1101、存储器1102、输入设备1103、输出设备1104以及总线1105,但是在具体实现过程中,本领域的技术人员应当明白,数据处理设备1100还包含实现正常运行所必须的其他器件。同时,根据具体需要,本领域的技术人员应当明白,数据处理设备1100还可包含实现其他附加功能的硬件器件。此外,本领域的技术人员应当明白,数据处理设备1100也可仅仅包含实现本申请实施例所必须的器件,而不必包含图11中所示的全部器件。It should be noted that although the data processing device 1100 shown in FIG. 11 only shows the processor 1101, the memory 1102, the input device 1103, the output device 1104, and the bus 1105, in the specific implementation process, those skilled in the art should understand that, The data processing device 1100 also includes other devices necessary for normal operation. At the same time, according to specific needs, those skilled in the art should understand that the data processing device 1100 may also include hardware devices that implement other additional functions. In addition, those skilled in the art should understand that the data processing device 1100 may also only include the components necessary to implement the embodiments of the present application, and not necessarily include all the components shown in FIG. 11.
本申请实施例提供了一种计算机可读存储介质,上述计算机可读存储介质存储有计算机程序,上述计算机程序包括软件程序指令,上述程序指令被数据处理设备中的处理器执行时实现前述实施例中的数据处理方法和/或训练方法。An embodiment of the present application provides a computer-readable storage medium. The above-mentioned computer-readable storage medium stores a computer program. The above-mentioned computer program includes software program instructions. When the above-mentioned program instructions are executed by a processor in a data processing device, the foregoing embodiments are implemented. The data processing method and/or training method in.
在上述实施例中,可以全部或部分地通过软件、硬件、固件或者其任意组合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机程序指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络、或者其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者通过所述计算机可读存储介质进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。所述可用介质可以是磁性介质,(例如,软盘、硬盘、磁带)、光介质(例如,DVD)、或者半导体介质(例如,固态硬盘(solid state disk,SSD))等。In the above embodiments, it can be implemented in whole or in part by software, hardware, firmware, or any combination thereof. When implemented using software, it can be implemented in whole or in part in the form of a computer program product. The computer program product includes one or more computer instructions. When the computer program instructions are loaded and executed on the computer, all or part of the processes or functions according to the embodiments of the present application are generated. The computer may be a general-purpose computer, a dedicated computer, a computer network, or other programmable devices. The computer instructions may be stored in a computer-readable storage medium or transmitted through the computer-readable storage medium. The computer-readable storage medium may be any available medium that can be accessed by a computer or a data storage device including one or more available medium integrated servers, data centers, and the like. The usable medium may be a magnetic medium (eg, floppy disk, hard disk, magnetic tape), optical medium (eg, DVD), or semiconductor medium (eg, solid state disk (SSD)), or the like.
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到各种等效的修改或替换,这些修改或替换都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以权利要求的保护范围为准。The above is only the specific implementation of this application, but the scope of protection of this application is not limited to this, any person skilled in the art can easily think of various equivalents within the technical scope disclosed in this application Modifications or replacements, these modifications or replacements should be covered within the scope of protection of this application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.

Claims (23)

  1. 一种自然语言处理方法,其特征在于,包括:A natural language processing method, characterized in that it comprises:
    获得待处理的自然语言文本;Obtain the natural language text to be processed;
    利用训练得到的深度神经网络对所述自然语言文本做处理,输出处理所述自然语言文本得到的目标结果;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述自然语言文本中各词语的粒度;利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取,将得到的第一特征信息输出至所述第一处理网络;利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取,将得到的第二特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第一特征信息做处理,将得到的第一处理结果输出至所述融合网络;利用所述第二处理网络对所述第二特征信息做处理,将得到的第二处理结果输出至所述融合网络;利用所述融合网络融合所述第一处理结果和所述第二处理结果得到所述目标结果;所述第一粒度和所述第二粒度不同。Use the trained deep neural network to process the natural language text, and output the target result obtained by processing the natural language text; wherein, the deep neural network includes: a granular annotation network, a first feature network, and a second feature network , A first processing network, a second processing network, and a fusion network, the processing includes: using the granularity tagging network to determine the granularity of each word in the natural language text; using the first feature network to analyze the natural language text Perform feature extraction on words with the first granularity in the first processing network, and output the obtained first feature information to the first processing network; use the second feature network to perform feature extraction on words with the second granularity in the natural language text, and Output the obtained second characteristic information to the second processing network; use the first processing network to process the first characteristic information, and output the obtained first processing result to the fusion network; use the first processing network The second processing network processes the second characteristic information, and outputs the obtained second processing result to the fusion network; uses the fusion network to fuse the first processing result and the second processing result to obtain the target Result; the first particle size and the second particle size are different.
  2. 根据权利要求1所述的方法,其特征在于,所述第一特征网络和所述第二特征网络的架构不同,和/或,所述第一处理网络和所述第二处理网络的架构不同。The method according to claim 1, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the architectures of the first processing network and the second processing network are different .
  3. 根据权利要求1或2所述的方法,其特征在于,所述粒度标注网络的输入为所述自然语言文本,所述利用所述粒度标注网络确定所述自然语言文本中各词语的粒度包括:The method according to claim 1 or 2, wherein the input of the granular annotation network is the natural language text, and the determining the granularity of each word in the natural language text by using the granular annotation network comprises:
    利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述每个词语分别属于所述N种粒度的概率;N为大于1的整数;Use the granular labeling network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the labeling information of the natural language text, and output the first feature network and the second feature network Tagging information; wherein the tagging information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
    所述利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取包括:The using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:
    利用所述第一特征网络处理所述第一粒度的词语以得到所述第一特征信息,所述第一特征信息为表示所述第一粒度的词语的向量或矩阵;Processing the words of the first granularity by using the first characteristic network to obtain the first characteristic information, where the first characteristic information is a vector or matrix representing the words of the first granularity;
    所述利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取包括:The using the second feature network to perform feature extraction on words with a second granularity in the natural language text includes:
    利用所述第二特征网络处理所述第二粒度的词语以得到所述第二特征信息,所述述第二特征信息为表示所述第二粒度的词语的向量或矩阵。The second feature network is used to process the words of the second granularity to obtain the second feature information, and the second feature information is a vector or matrix representing the words of the second granularity.
  4. 根据权利要求3所述的方法,其特征在于,所述第一处理结果为包含一个或多个词语的序列,所述利用所述第一处理网络对所述第一特征信息做处理包括:The method according to claim 3, wherein the first processing result is a sequence containing one or more words, and the processing the first characteristic information by the first processing network comprises:
    利用所述第一处理网络对输入的所述第一特征信息和所述第一处理网络在处理所述第一特征信息的过程中已输出的词语做处理以得到所述第一处理结果。The first processing network is used to process the input first feature information and the words that the first processing network has output in the process of processing the first feature information to obtain the first processing result.
  5. 根据权利要求4所述的方法,其特征在于,所述融合网络输出的所述目标结果为包含一个或多个词语的序列,所述利用所述融合网络融合所述第一处理结果和所述第二处理结果得到所述目标结果包括:The method according to claim 4, wherein the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the first processing result and the The target result obtained by the second processing result includes:
    利用所述融合网络处理所述第一处理结果、所述第二处理结果以及所述融合网络在处理所第一处理结果和所述第二处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。Use the fusion network to process the first processing result, the second processing result, and the words that the fusion network has output in the process of processing the first processing result and the second processing result to determine the target to be output Words, output the target words.
  6. 一种训练方法,其特征在于,包括:A training method, characterized in that it includes:
    将训练样本输入至深度神经网络做处理,得到预测处理结果;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述训练样本中各词语的粒度;利用所述第一特征网络对所述训练样本中第一粒度的词语进行特征提取,将得到的第三特征信息输出至所述第一处理网络;利用所述第二特征网络对所述训练样本中第二粒度的词语进行特征提取,将得到的第四特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第三特征信息做目标处理,将得到的第三处理结果输出至所述融合网络;利用所述第二处理网络对所述第四特征信息做所述目标处理,将得到的第四处理结果输出至所述融合网络;利用所述融合网络融合所述第三处理结果和所述第四处理结果得到所述预测处理结果;所述第一粒度和所述第二粒度不同;Input the training samples to the deep neural network for processing to obtain the prediction processing results; wherein the deep neural network includes: granular annotation network, first feature network, second feature network, first processing network, second processing network and fusion Network, the processing includes: using the granularity annotation network to determine the granularity of each word in the training sample; using the first feature network to perform feature extraction on the words of the first granularity in the training sample, and the obtained first Output three feature information to the first processing network; use the second feature network to perform feature extraction on words of the second granularity in the training sample, and output the obtained fourth feature information to the second processing network; Use the first processing network to perform target processing on the third characteristic information, and output the obtained third processing result to the fusion network; use the second processing network to target the fourth characteristic information Processing, output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction processing result; the first granularity and the The second granularity is different;
    根据所述预测处理结果和标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述深度神经网络处理所述训练样本期望得到的处理结果;Determine the loss corresponding to the training sample according to the predicted processing result and the standard result; the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample;
    利用所述训练样本对应的损失,通过优化算法更新所述深度神经网络的参数。Using the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.
  7. 根据权利要求6所述的方法,其特征在于,所述第一特征网络和所述第二特征网络的架构不同,和/或,所述第一处理网络和所述第二处理网络架构的不同。The method according to claim 6, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the architectures of the first processing network and the second processing network are different .
  8. 根据权利要求6或7所述的方法,其特征在于,所述粒度标注网络的输入为所述自然语言文本,所述利用所述粒度标注网络确定所述自然语言文本中各词语的粒度包括:The method according to claim 6 or 7, wherein the input of the granular annotation network is the natural language text, and the determining the granularity of each word in the natural language text by using the granular annotation network comprises:
    利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述每个词语分别属于所述N种粒度的概率;N为大于1的整数;Use the granular labeling network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the labeling information of the natural language text, and output the first feature network and the second feature network Tagging information; wherein the tagging information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
    所述利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取包括:The using the first feature network to perform feature extraction on words with a first granularity in the natural language text includes:
    利用所述第一特征网络处理所述第一粒度的词语以得到所述第三特征信息,所述第三特征信息为表示所述第一粒度的词语的向量或矩阵;Processing the words of the first granularity by using the first characteristic network to obtain the third characteristic information, where the third characteristic information is a vector or matrix representing the words of the first granularity;
    所述利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取包括:The using the second feature network to perform feature extraction on words with a second granularity in the natural language text includes:
    利用所述第二特征网络处理所述第二粒度的词语以得到所述第四特征信息,所述述第四特征信息为表示所述第二粒度的词语的向量或矩阵。The second feature network is used to process the words of the second granularity to obtain the fourth feature information, and the fourth feature information is a vector or matrix representing the words of the second granularity.
  9. 根据权利要求8所述的方法,其特征在于,所述第一处理结果为包含一个或多个词语的序列,所述利用所述第一处理网络对所述第三特征信息做处理包括:The method according to claim 8, wherein the first processing result is a sequence containing one or more words, and the processing the third characteristic information using the first processing network comprises:
    利用所述第一处理网络对输入的所述第三特征信息和所述第一处理网络在处理所述第 三特征信息的过程中已输出的词语做处理以得到所述第三处理结果。The first processing network is used to process the input third characteristic information and the words outputted by the first processing network in the process of processing the third characteristic information to obtain the third processing result.
  10. 根据权利要求9所述的方法,其特征在于,所述融合网络输出的所述目标结果为包含一个或多个词语的序列,所述利用所述融合网络融合所述第三处理结果和所述第四处理结果得到所述目标结果包括:The method according to claim 9, wherein the target result output by the fusion network is a sequence containing one or more words, and the fusion network is used to fuse the third processing result and the Obtaining the target result from the fourth processing result includes:
    利用所述融合网络处理所述第三处理结果、所述第四处理结果以及所述融合网络在处理所第三处理结果和所述第四处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。Use the fusion network to process the third processing result, the fourth processing result, and the words that the fusion network has output in the process of processing the third processing result and the fourth processing result to determine the target to be output Words, output the target words.
  11. 根据权利要求6至10任一项所述的方法,其特征在于,所述利用所述训练样本对应的损失,通过优化算法更新所述深度神经网络的参数包括:The method according to any one of claims 6 to 10, wherein the using the loss corresponding to the training sample to update the parameters of the deep neural network through an optimization algorithm comprises:
    利用损失函数相对于所述深度神经网络包括的至少一个网络的梯度值,更新所述至少一个网络的参数;所述损失函数用于计算所述预测处理结果和所述标准结果之间的损失;其中,所述第一特征网络、所述第二特征网络、所述第一处理网络、所述第二处理网络中的任一个网络在更新过程中,另外三个网络中任一个网络的参数保持不变。Update the parameters of the at least one network using a loss function relative to the gradient value of at least one network included in the deep neural network; the loss function is used to calculate the loss between the prediction processing result and the standard result; Wherein, during the update process of any one of the first characteristic network, the second characteristic network, the first processing network, and the second processing network, the parameters of any one of the other three networks are maintained constant.
  12. 一种数据处理设备,其特征在于,包括:A data processing device, characterized in that it comprises:
    获取单元,用于获得待处理的自然语言文本;The obtaining unit is used to obtain the natural language text to be processed;
    处理单元,用于利用训练得到的深度神经网络对所述自然语言文本做处理;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述自然语言文本中各词语的粒度;利用所述第一特征网络对所述自然语言文本中第一粒度的词语进行特征提取,将得到的第一特征信息输出至所述第一处理网络;利用所述第二特征网络对所述自然语言文本中第二粒度的词语进行特征提取,将得到的第二特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第一特征信息做处理,将得到的第一处理结果输出至所述融合网络;利用所述第二处理网络对所述第二特征信息做所述处理,将得到的第二处理结果输出至所述融合网络;利用所述融合网络融合所述第一处理结果和所述第二处理结果得到所述目标结果;所述第一粒度和所述第二粒度不同;The processing unit is configured to process the natural language text by using the deep neural network obtained by training; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, and a second The second processing network and the fusion network, the processing includes: using the granular annotation network to determine the granularity of each word in the natural language text; using the first feature network to perform processing on the first granular words in the natural language text Feature extraction, output the obtained first feature information to the first processing network; use the second feature network to perform feature extraction on words of the second granularity in the natural language text, and output the obtained second feature information To the second processing network; use the first processing network to process the first characteristic information, and output the obtained first processing result to the converged network; use the second processing network to process the first feature information The second feature information is processed, and the obtained second processing result is output to the fusion network; the fusion network is used to fuse the first processing result and the second processing result to obtain the target result; the first The first particle size is different from the second particle size;
    输出单元,用于输出处理所述自然语言文本得到的目标结果。The output unit is used to output the target result obtained by processing the natural language text.
  13. 根据权利要求12所述的数据处理设备,其特征在于,所述第一特征网络和所述第二特征网络的架构不同,和/或,所述第一处理网络和所述第二处理网络的架构不同。The data processing device according to claim 12, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the first processing network and the second processing network are The architecture is different.
  14. 根据权利要求12或13所述的数据处理设备,其特征在于,所述粒度标注网络的输入为所述自然语言文本;The data processing device according to claim 12 or 13, wherein the input of the granular annotation network is the natural language text;
    所述处理单元,具体用于利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述 每个词语分别属于所述N种粒度的概率;N为大于1的整数;The processing unit is specifically configured to use the granular annotation network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the annotation information of the natural language text, and to send the annotation information to the first feature network and the The second feature network outputs the annotation information; wherein the annotation information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
    所述处理单元,具体用于利用所述第一特征网络处理所述第一粒度的词语以得到所述第一特征信息,所述第一特征信息为表示所述第一粒度的词语的向量或矩阵;The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the first characteristic information, where the first characteristic information is a vector or word representing the words of the first granularity matrix;
    所述处理单元,具体用于利用所述第二特征网络处理所述第二粒度的词语以得到所述第二特征信息,所述第二特征信息为表示所述第二粒度的词语的向量或矩阵。The processing unit is specifically configured to use the second feature network to process the words of the second granularity to obtain the second feature information, where the second feature information is a vector or word representing the words of the second granularity. matrix.
  15. 根据权利要求14所述的数据处理设备,其特征在于,所述第一处理结果为包含一个或多个词语的序列;The data processing device according to claim 14, wherein the first processing result is a sequence containing one or more words;
    所述处理单元,具体用于利用所述第一处理网络对输入的所述第一特征信息和所述第一处理网络在处理所述第一特征信息的过程中已输出的词语做处理以得到所述第一处理结果。The processing unit is specifically configured to use the first processing network to process the input first characteristic information and the words that the first processing network has output in the process of processing the first characteristic information to obtain The first processing result.
  16. 根据权利要求15所述的数据处理设备,其特征在于,所述融合网络输出的所述目标结果为包含一个或多个词语的序列;The data processing device according to claim 15, wherein the target result output by the fusion network is a sequence containing one or more words;
    所述处理单元,具体用于利用所述融合网络处理所述第一处理结果、所述第二处理结果以及所述融合网络在处理所第一处理结果和所述第二处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。The processing unit is specifically configured to use the fusion network to process the first processing result, the second processing result, and the fusion network has already processed the first processing result and the second processing result. The output words determine the target words to be output, and output the target words.
  17. 一种数据处理设备,其特征在于,包括:A data processing device, characterized in that it comprises:
    处理单元,用于将训练样本输入至深度神经网络做处理,得到预测处理结果;其中,所述深度神经网络包括:粒度标注网络、第一特征网络、第二特征网络、第一处理网络、第二处理网络以及融合网络,所述处理包括:利用所述粒度标注网络确定所述训练样本中各词语的粒度;利用所述第一特征网络对所述训练样本中第一粒度的词语进行特征提取,将得到的第三特征信息输出至所述第一处理网络;利用所述第二特征网络对所述训练样本中第二粒度的词语进行特征提取,将得到的第四特征信息输出至所述第二处理网络;利用所述第一处理网络对所述第三特征信息做目标处理,将得到的第三处理结果输出至所述融合网络;利用所述第二处理网络对所述第四特征信息做所述目标处理,将得到的第四处理结果输出至所述融合网络;利用所述融合网络融合所述第三处理结果和所述第四处理结果得到所述预测处理结果;所述第一粒度和所述第二粒度不同;The processing unit is used to input training samples to the deep neural network for processing to obtain prediction processing results; wherein the deep neural network includes: a granular annotation network, a first feature network, a second feature network, a first processing network, and a second The second processing network and the fusion network, the processing includes: using the granularity annotation network to determine the granularity of each word in the training sample; using the first feature network to perform feature extraction on the words of the first granularity in the training sample , Output the obtained third feature information to the first processing network; use the second feature network to perform feature extraction on words of the second granularity in the training sample, and output the obtained fourth feature information to the The second processing network; using the first processing network to perform target processing on the third feature information, and outputting the obtained third processing result to the fusion network; using the second processing network to perform target processing on the fourth feature information Perform the target processing of information, and output the obtained fourth processing result to the fusion network; use the fusion network to fuse the third processing result and the fourth processing result to obtain the prediction processing result; the first The first particle size is different from the second particle size;
    所述处理单元,还用于根据所述预测处理结果和标准结果,确定所述训练样本对应的损失;所述标准结果为利用所述深度神经网络处理所述训练样本期望得到的处理结果;利用所述训练样本对应的损失,通过优化算法更新所述深度神经网络的参数。The processing unit is further configured to determine the loss corresponding to the training sample according to the predicted processing result and the standard result; the standard result is the processing result expected to be obtained by using the deep neural network to process the training sample; using For the loss corresponding to the training sample, the parameters of the deep neural network are updated through an optimization algorithm.
  18. 根据权利要求17所述的数据处理设备,其特征在于,所述第一特征网络和所述第二特征网络的架构不同,和/或,所述第一处理网络和所述第二处理网络的架构不同。The data processing device according to claim 17, wherein the architectures of the first characteristic network and the second characteristic network are different, and/or the first processing network and the second processing network have different architectures. The architecture is different.
  19. 根据权利要求17或18所述的数据处理设备,其特征在于,所述粒度标注网络的输入为所述自然语言文本;The data processing device according to claim 17 or 18, wherein the input of the granular annotation network is the natural language text;
    所述处理单元,具体用于利用所述粒度标注网络按照N种粒度确定所述自然语言文本中每个词语的粒度以得到所述自然语言文本的标注信息,向所述第一特征网络和所述第二特征网络输出所述标注信息;其中,所述标注信息用于描述所述每个词语的粒度或者所述每个词语分别属于所述N种粒度的概率;N为大于1的整数;The processing unit is specifically configured to use the granular annotation network to determine the granularity of each word in the natural language text according to N types of granularities to obtain the annotation information of the natural language text, and to send the annotation information to the first feature network and the The second feature network outputs the annotation information; wherein the annotation information is used to describe the granularity of each word or the probability that each word belongs to the N granularities; N is an integer greater than 1;
    所述处理单元,具体用于利用所述第一特征网络处理所述第一粒度的词语以得到所述第三特征信息,所述第三特征信息为表示所述第一粒度的词语的向量或矩阵;The processing unit is specifically configured to process the words of the first granularity by using the first characteristic network to obtain the third characteristic information, where the third characteristic information is a vector or word representing the words of the first granularity matrix;
    所述处理单元,具体用于利用所述第二特征网络处理所述第二粒度的词语以得到所述第四特征信息,所述述第四特征信息为表示所述第二粒度的词语的向量或矩阵。The processing unit is specifically configured to process the words of the second granularity by using the second characteristic network to obtain the fourth characteristic information, where the fourth characteristic information is a vector representing the words of the second granularity Or matrix.
  20. 根据权利要求19所述的数据处理设备,其特征在于,所述第三处理结果为包含一个或多个词语的序列;The data processing device according to claim 19, wherein the third processing result is a sequence containing one or more words;
    所述处理单元,具体用于利用所述第一处理网络对输入的所述第三特征信息和所述第一处理网络在处理所述第三特征信息的过程中已输出的词语做处理以得到所述第三处理结果。The processing unit is specifically configured to use the first processing network to process the input third characteristic information and the words outputted by the first processing network in the process of processing the third characteristic information to obtain The third processing result.
  21. 根据权利要求20所述的数据处理设备,其特征在于,所述融合网络输出的所述目标结果为包含一个或多个词语的序列;The data processing device according to claim 20, wherein the target result output by the fusion network is a sequence containing one or more words;
    所述处理单元,具体用于利用所述融合网络处理所述第三处理结果、所述第四处理结果以及所述融合网络在处理所第三处理结果和所述第四处理结果的过程中已输出的词语以确定待输出目标词语,输出所述目标词语。The processing unit is specifically configured to use the converged network to process the third processing result, the fourth processing result, and the converged network has already processed the third processing result and the fourth processing result. The output words determine the target words to be output, and output the target words.
  22. 根据权利要求17至21任一项所述的数据处理设备,其特征在于,The data processing device according to any one of claims 17 to 21, wherein:
    所述处理单元,具体用于利用损失函数相对于所述深度神经网络包括的至少一个网络的梯度值,更新所述至少一个网络的参数;所述损失函数用于计算所述预测处理结果和所述标准结果之间的损失;其中,所述第一特征网络、所述第二特征网络、所述第一处理网络、所述第二处理网络中的任一个网络在更新过程中,另外三个网络中任一个网络的参数保持不变。The processing unit is specifically configured to update the parameters of the at least one network by using a loss function relative to the gradient value of the at least one network included in the deep neural network; the loss function is used to calculate the prediction processing result and the The loss between the standard results; wherein any one of the first feature network, the second feature network, the first processing network, and the second processing network is in the update process, and the other three The parameters of any network in the network remain unchanged.
  23. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机程序,所述计算机程序包括程序指令,所述程序指令当被处理器执行时使所述处理器执行如权利要求1-11任一项所述的方法。A computer-readable storage medium, wherein the computer storage medium stores a computer program, the computer program includes program instructions, and when executed by a processor, the program instructions cause the processor to execute as claimed in claim 1. -11 The method of any one.
PCT/CN2019/114146 2019-01-18 2019-10-29 Natural language processing method, training method, and data processing device WO2020147369A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910108559.9A CN109902296B (en) 2019-01-18 2019-01-18 Natural language processing method, training method and data processing equipment
CN201910108559.9 2019-01-18

Publications (1)

Publication Number Publication Date
WO2020147369A1 true WO2020147369A1 (en) 2020-07-23

Family

ID=66944544

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/114146 WO2020147369A1 (en) 2019-01-18 2019-10-29 Natural language processing method, training method, and data processing device

Country Status (2)

Country Link
CN (1) CN109902296B (en)
WO (1) WO2020147369A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032798A (en) * 2022-12-28 2023-04-28 天翼云科技有限公司 Automatic testing method and device for zero-trust identity authorization

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109902296B (en) * 2019-01-18 2023-06-30 华为技术有限公司 Natural language processing method, training method and data processing equipment
CN110472063B (en) * 2019-07-12 2022-04-08 新华三大数据技术有限公司 Social media data processing method, model training method and related device
CN112329465A (en) * 2019-07-18 2021-02-05 株式会社理光 Named entity identification method and device and computer readable storage medium
CN110705273B (en) * 2019-09-02 2023-06-13 腾讯科技(深圳)有限公司 Information processing method and device based on neural network, medium and electronic equipment
CN110837738B (en) * 2019-09-24 2023-06-30 平安科技(深圳)有限公司 Method, device, computer equipment and storage medium for identifying similarity
CN110674783B (en) * 2019-10-08 2022-06-28 山东浪潮科学研究院有限公司 Video description method and system based on multi-stage prediction architecture
CN111444686B (en) * 2020-03-16 2023-07-25 武汉中科医疗科技工业技术研究院有限公司 Medical data labeling method, medical data labeling device, storage medium and computer equipment
CN112488290B (en) * 2020-10-21 2021-09-07 上海旻浦科技有限公司 Natural language multitask modeling and predicting method and system with dependency relationship

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162478A1 (en) * 2014-11-25 2016-06-09 Lionbridge Techologies, Inc. Information technology platform for language translation and task management
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN107797985A (en) * 2017-09-27 2018-03-13 百度在线网络技术(北京)有限公司 Establish synonymous discriminating model and differentiate the method, apparatus of synonymous text
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN109902296A (en) * 2019-01-18 2019-06-18 华为技术有限公司 Natural language processing method, training method and data processing equipment

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10635949B2 (en) * 2015-07-07 2020-04-28 Xerox Corporation Latent embeddings for word images and their semantics
CN107918782B (en) * 2016-12-29 2020-01-21 中国科学院计算技术研究所 Method and system for generating natural language for describing image content
EP3376400A1 (en) * 2017-03-14 2018-09-19 Fujitsu Limited Dynamic context adjustment in language models
CN108460089B (en) * 2018-01-23 2022-03-01 海南师范大学 Multi-feature fusion Chinese text classification method based on Attention neural network

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20160162478A1 (en) * 2014-11-25 2016-06-09 Lionbridge Techologies, Inc. Information technology platform for language translation and task management
CN107145483A (en) * 2017-04-24 2017-09-08 北京邮电大学 A kind of adaptive Chinese word cutting method based on embedded expression
CN107797985A (en) * 2017-09-27 2018-03-13 百度在线网络技术(北京)有限公司 Establish synonymous discriminating model and differentiate the method, apparatus of synonymous text
CN108268643A (en) * 2018-01-22 2018-07-10 北京邮电大学 A kind of Deep Semantics matching entities link method based on more granularity LSTM networks
CN109902296A (en) * 2019-01-18 2019-06-18 华为技术有限公司 Natural language processing method, training method and data processing equipment

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116032798A (en) * 2022-12-28 2023-04-28 天翼云科技有限公司 Automatic testing method and device for zero-trust identity authorization

Also Published As

Publication number Publication date
CN109902296A (en) 2019-06-18
CN109902296B (en) 2023-06-30

Similar Documents

Publication Publication Date Title
WO2020147369A1 (en) Natural language processing method, training method, and data processing device
CN110599557B (en) Image description generation method, model training method, device and storage medium
US10956771B2 (en) Image recognition method, terminal, and storage medium
KR102360659B1 (en) Machine translation method, apparatus, computer device and storage medium
KR102646667B1 (en) Methods for finding image regions, model training methods, and related devices
CN107943860B (en) Model training method, text intention recognition method and text intention recognition device
US11977851B2 (en) Information processing method and apparatus, and storage medium
WO2020108483A1 (en) Model training method, machine translation method, computer device and storage medium
CN111816159B (en) Language identification method and related device
CN111898636B (en) Data processing method and device
CN110162600B (en) Information processing method, session response method and session response device
CN113821589A (en) Text label determination method and device, computer equipment and storage medium
CN113821720A (en) Behavior prediction method and device and related product
CN114065900A (en) Data processing method and data processing device
CN112862021B (en) Content labeling method and related device
CN114840563B (en) Method, device, equipment and storage medium for generating field description information
CN115866291A (en) Data processing method and device
CN113569043A (en) Text category determination method and related device
CN110019648A (en) A kind of method, apparatus and storage medium of training data
CN114840499A (en) Table description information generation method, related device, equipment and storage medium
CN117057345B (en) Role relation acquisition method and related products
CN116975295B (en) Text classification method and device and related products
CN115795025A (en) Abstract generation method and related equipment thereof
CN115906060A (en) Data processing method and related equipment
CN116028632A (en) Determination method and related device of domain language model

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19910183

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19910183

Country of ref document: EP

Kind code of ref document: A1