CN112464673B

CN112464673B - Language meaning understanding method for fusing meaning original information

Info

Publication number: CN112464673B
Application number: CN202011431776.0A
Authority: CN
Inventors: 王念滨; 汪先慈; 张耘; 周连科; 王红滨; 张毅; 崔琎
Original assignee: Harbin Engineering University
Current assignee: Harbin Engineering University
Priority date: 2020-12-09
Filing date: 2020-12-09
Publication date: 2023-05-26
Anticipated expiration: 2040-12-09
Also published as: CN112464673A

Abstract

Language meaning understanding method of fusion sense original information, belonging to languageThe technical field of information processing. The method aims to solve the problems that the existing language modeling method is high in complexity and cannot achieve effects. Firstly, processing a language by taking each word as a unit according to two paths; left path: word encoder+rnn+word decoder, left path output is noted as w ^l The method comprises the steps of carrying out a first treatment on the surface of the Right path: sense original encoder+RNN+sense original decoder+word decoder+sigmoid, right path output is denoted as w ^r The method comprises the steps of carrying out a first treatment on the surface of the The outputs of the two paths are then fused. Mainly for language meaning understanding.

Description

Language meaning understanding method for fusing meaning original information

Technical Field

The invention relates to a language meaning understanding method, and belongs to the technical field of language information processing.

Background

Language Modeling (LM) is a core task for Natural Language Processing (NLP) and language understanding. The purpose of language modeling is to show the probability distribution of a word sequence. LMs play a key role in text generation tasks, such as machine translation, document summarization.

The model always predicts words based on context. In a simple N-gram language model, it assumes: each word is only N-1 words relative to the previous word. In recent years, more and more neural network models have been built. They have obtained the most advanced performance and are still making progress. The classical neural network language model consists of an encoder and a decoder. In the encoder, the NN receives a sequence of languages and then encodes the left context of each word into a vector. In the decoder, the neural network takes the vector from the encoder and attempts to predict the word. The invention refers to all words to the left of a word as "left context".

These LMs rely on Recurrent Neural Networks (RNNs) or other modern neural networks to learn the context of words. Then, a similar decoder is used to obtain the probability distribution of the word. As computing power increases, many have focused on creating a complex structure that can better understand context. These efforts play a role, but more effort should be put into how the context is utilized.

Some expert annotated data helps to model the context and increase the interpretability of the context. HowNet is annotation data that annotates each concept in chinese with one or more related origins, revealing relationships between origins. The source of meaning is a semantic unit that is indivisible to human language defined by linguists. HowNet contains definitions, emotions, examples, etc. of a word, but only definitions are always filled in. The lower layer is based on a more accurate definition of the upper layer. Except for the first layer, there is only one supersense source for each sense source. The invention considers that HowNet can be fused into a neural network, and brings about obvious improvement.

In recent years, some have done much work on knowledge networks, such as emotion analysis and word similarity calculation. The present invention has seen that some works have good results, such as sense-driven language model (SDLM) and token learning. These works are instructive to the present invention. They use the sense original information for language modeling, attention, and sense original prediction. However, compared with the quality and cost of the known network, the method is not advanced enough, is complex and is difficult to popularize.

Disclosure of Invention

The invention aims to solve the problems that the existing language modeling method has higher complexity and cannot give consideration to effects.

The language meaning understanding method of the fusion meaning original information comprises the following steps:

processing the language by taking each word as a unit according to two paths; left path: word encoder+rnn+word decoder, left path output is noted as w ^l The method comprises the steps of carrying out a first treatment on the surface of the Right path: sense original encoder+RNN+sense original decoder+word decoder+sigmoid, right path output is denoted as w ^r The method comprises the steps of carrying out a first treatment on the surface of the Then fusing the outputs of the two paths;

the processing procedure of the original encoder comprises the following steps:

creating a trainable matrix having N rows, where N is the number of words, i represents the sequence number of the word, and corresponding to the ith row of the trainable matrix, taking each sense element as a vector:

s＝select(w，M)

emb＝∑s _j *EMB(j)

where M is a constant matrix of sense primitive weights for words,

h1 and H2 represent the number of words and the number of sense origins; select is a function of extracting the sense-original weight vector represented by word w from M, s is an s _j A composed sense origin weight vector representing the probability distribution of sense origins contained in the word; />

Weights representing the sense origins contained in a word, EMB is a trainable embedding layer that converts each sense origin into a corresponding vector;

the processing procedure of the word decoder comprises the following steps:

according to the base Yu Yiyuan decoder, the sense original probability needs to be converted into words:

w＝sM ^T

wherein M is ^T Is the transpose of M;

the process of fusing the outputs of the two paths comprises the following steps:

according to the output w of the left path ^l And the output w of the right path ^r Obtaining fused output:

p(w|g)＝softmax(w ^l w ^r )

where p (w|g) represents the probability of word w with respect to context;

further, the weight of the sense source contained in the word

Wherein word w has l at the h-th level in the sense original tree _h A sense source, j represents one of them;

further, when the number of layers in the sense tree is 2 or more,

g' the total number of layers of the original tree;

further, the processing procedure of the sense original decoder comprises the following steps:

the output of RNN in the right path is first processed as follows:

q＝σ(g ^T V+b)

wherein V and b are trainable parameters, σ represents an activation function, g is the output of the RNN;

then taking the hidden state q as input to calculate the sense original score; for the sense origin k in vector s, the sense origin is scored as:

s _k ＝q ^T U _k

wherein U is _k Is a conversion matrix;

further, the conversion matrix U _k The following are provided:

wherein the method comprises the steps of

α _k,r >0 is a trainable parameter and +.>

Further, the sigmoid processing procedure in the right path includes the following steps:

firstly, aiming at the output of a word decoder, processing by using a Sigmoid function to obtain w ^r

Then pair w using the following formula ^r Updating:

w ^r ＝w ^r ×(1-X)+X

where X determines a constant parameter of the offset.

The beneficial effects are that:

the invention provides a new data expansion method which is realized by taking a meaning source as an additional input. Thus, for the input of one word, there are two types of inputs, one is the original word and the other is the sense source. The two types of inputs are modeled by the same intermediate layer, and at the last layer of the model, the outputs of the two paths are fused together. This is a simple generation method that fuses the sense primitive knowledge with the neural network. The present invention refers to this approach as data enhancement (SBDE) of the base Yu Yi source. Thus, the integration of SBDE as a model may make the model more robust. The effectiveness of SBDE was demonstrated by language modeling experiments and by downstream application experiments of title generation.

Drawings

FIG. 1 is a schematic diagram of word and source relation;

FIG. 2 is a sense-original generation model diagram;

FIG. 3 is a schematic diagram of a language model of the present invention;

fig. 4 is a diagram of an example of an artificial original embedding layer.

Detailed Description

The first embodiment is as follows:

the meaning sources are unique concepts in HowNet, and the design purpose is to represent basic semantic information by using a fixed part of words, so that any other words can be represented by the meaning sources. These representations are defined by the authors of HowNet, are numerous in terms of vocabulary, are structurally rich, and contain a large amount of human experience information. However, as a database defined by people, limited by the increase of expert level and knowledge amount, there are inevitably some subjectivity and timeliness related problems. Although the errors caused by these problems may not be obvious for the current deep learning method, the fusion of the problems with the deep learning method can complement each other, and combine the advantages of different methods.

The meaning source is designed and constructed by the father and son persons Dong Zhendong and Dong Jiang, and takes concepts (Sense) represented by words (Word) of English and Chinese as description objects to explain the relation between concepts and attributes of the concepts as an attempt knowledge base of basic content. The idea is mainly from the theory of reduction, i.e. any meaning of a word can consist of one or more semantic units, which are called sense origins (sememes) in the knowledge network. The meaning sources here are the smallest, indivisible semantic units, which of course not only illustrate the meaning sources comprised by the concept, but also there are relations and structures between these meaning sources, which as a whole represent a tree shape. The nodes of the tree represent the sense sources, while the edges of the tree represent the relationships between the sense sources, such as modifications, compositions, etc.

The source of meaning was pursued by researchers of various traditional methods in the early years, but with the sinking of traditional statistical methods, the source of meaning appears in papers less and less frequently. In the current era of deep learning model flooding, researchers mostly pursue much larger amounts of data and parameters than expert fine-labeling data. In the present invention, it is desirable to fuse such data with the neural network model. Since the sense origins are complicated and various in information, and it is difficult to form a matrix format commonly used in a neural network, in the present invention, only the hierarchical relationship between sense origins and sense origins in HowNet is extracted and no modifications between sense origins are recorded in order to reduce the complexity of the model. And since in written language words of different semantics may contain the same sense origin, the invention chooses to take only the highest one in the tree.

As shown in fig. 1, a word may contain multiple semantics, for each of which a relationship diagram between different origins is designed according to its semantics. For these relationships to appear tree-like, the semantics of the deeper hierarchical representation tend to be more subdivided.

Language models are the basis for various NLP tasks. Modern language models typically analyze the probability distribution of words in a given context. When a statistical language model is presented, the model assigns probabilities to a sequence S with N words, as shown in the following formula:

P(s)＝P(w ₁ w ₂ …w _N )＝P(w ₁ w ₂ …w _N-1 )P(w _N |w ₁ w ₂ …w _N-1 )＝P(w ₁ )P(w ₂ |w ₁ )…P(w _N |w ₁ w ₂ …w _N-1

wherein w is _i Representing the ith word in the sequence s, the joint probability of the N words can be decomposed into the product of the joint probability of the N-1 words and the conditional probability of the nth word.

With the popularity of neural networks, mikolov et al propose for the first time that RNNs are applied to language models. They rely on recursive structures to capture longer context information and train through back propagation of time. While RNN can theoretically fit any function, it is difficult to learn very long dependencies. Some reasons are gradient explosion or vanishing problems. To solve this problem, a model was developed and many regularization and optimization methods were introduced. AWD-LSTM is an outstanding model that has reached the most advanced level on the Pennsylvania Tree Bank (PTB) and WikiText-2 data sets.

Recently, however, the effectiveness of pre-trained language models has been remarkable, and has led to research interest. These pre-trained models often contain many parameters and are then trained on a vast corpus.

With the development of deep learning, it is desirable to train a network with a large corpus. In recent years, the importance of expert manual annotation knowledge base is more and more paid attention to, and some research results on sense origin prediction are also presented. Niu et al use the word's sense primitive information to improve the quality of word embedding and accurately capture the meaning of the word in a particular context. Ruobin Xie uses pre-training word embedding and matrix decomposition to predict the meaning source of the word, and Jin and the like further consider Chinese characters and position information on the basis.

There has also been a prior attempt to combine a sense original and a language model, model name SDLM (SememeDrive Language Model), SDLM, which is a valuable attempt to fuse sense original information into a neural network, proving it to be beneficial to the language model. These inputs and encoders are the same as the usual language model, except that the decoder is changed. As shown in fig. 2, the SDLM uses two paths to generate semantic vectors and calculates by a constant matrix of semantics. While successful, the encoder of the SDLM receives a word as input, the decoder needs to generate a context of the sense origin. Thus, the encoder is to capture not only the context of the word, but also the context of the sense origin. But there is a problem here: the encoder cannot directly obtain the information of the sense origin. To make up for the gap between encoder and decoder, the present invention attempts to solve this problem by the following method, and shows different research perspectives in this field.

The language meaning understanding method of the fusion sense original information according to the embodiment comprises the following steps:

generating a language based on a sense source:

in this section, the present invention proposes a simple method to use the semblance information to refine the results of the NLP model. Some words are not candidates if their sense origins are too far from the context. This method will be described in detail in the subsequent experiments and examples of the present invention. In a general language model, an input sequence is fed into a neural network and the next word is predicted from its preceding word. As a description of the knowledge network, each word may be defined as a structured meaning source. Based on the theory, the invention can take the meaning source of the context as input, and further predict the probability of the meaning source contained in the current word.

In fact, the main efficiency of the known network is to provide a simpler model, which helps the model learn it easily. Since the sense origins are much smaller in space than words, different words may represent the same sense origins. There are two sentences: "apple is very good at" and "orange is very good at". Obviously, learning "fruit good-eating" is much easier than learning "apple good-eating" and "orange good-eating" respectively, so that neural networks can benefit from the specific information of the sense origin.

The method proposed by the invention is to establish a parallel path from input to output. If details of the encoder and decoder are ignored, the parallel path is identical to the other path. Word in encoder encodingConverted into a sense source and a coded vector is derived therefrom. In the decoder there is a new path to determine which sense in the current word is not present. The language model of the present invention is shown in fig. 3, and the present invention includes two paths, a left path: word encoder (abbreviated emb) +rnn+ word decoder (word decoder), left path output is denoted w ^l The method comprises the steps of carrying out a first treatment on the surface of the Right path: sense original encoder (sememe emudding, abbreviated sememe emub) +RNN+sense original decoder (sememe decoder) +word decoder) +sigmoid, and right path output is denoted as w ^r The method comprises the steps of carrying out a first treatment on the surface of the The outputs of the two paths are then fused (MUL).

An Embedding part:

in the original LM (language model), embedding is a step of taking words as vectors. Because vectors can obtain more information than a single number. In the left path, most models of the NLP region design an embedded component. The embedding is accomplished by creating a trainable matrix having N rows, where N is the number of words and i represents the sequence number of the word, corresponding to the ith row of the trainable matrix, as with the original LM, the right path of the model of the present invention takes each sense as a vector. The method comprises the following steps:

s＝select(w，M)

emb＝∑s _j *EMB(j)

where M is a constant matrix of sense primitive weights for words,

h1 and H2 represent the number of words and the number of sense origins. Select is a function of extracting the sense-original weight vector represented by word w from M, s is an s _j A composed sense primitive weight vector representing the probability distribution of sense primitives contained within the word, where each s _j The weight of the sense origin is indicated.

The EMB, representing the weight of the sense sources contained in a word, is a trainable embedding layer that converts each sense source into a corresponding vector.

Obviously, these wordsThe sense sources play a different role in the context, with some important and some unimportant. Thus, if the word w has l at the h-th level in the sense original tree _h The sense origin, j is one of them, use

To distinguish between the differences between these different layer origins. However, since the meaning sources are mostly located in the first layer and the second layer, the present invention regards all meaning sources of h > 2 as h=2. In other words, when the number of layers is 2 or more, the formula is +.>

g' means the total number of layers of the original tree. The above formula can also be represented as fig. 4, from which the meaning source and hierarchy information of each word can be obtained.

For each sense element, the invention can do the same sense element embedding operation as the word embedding.

RNN part:

in the right path, the present invention uses the same intermediate layer (RNN) between the encoder and decoder as the left path. The right path of the invention is completely independent of the left path, so that the performance of the original network is not affected.

A Decoder section:

the function of the decoder is to convert the hidden representation into word probabilities. The structure of the knowledge network does not contain hidden rules which are not known or written out by human expert. But such rules contain word-to-word inherent relationships and can be used to eliminate word uncertainty.

1) Original meaning decoder

The present invention designs the sense original decoder in two steps, the first step is to decode the output of the intermediate layer and the second step is to select a parameterized matrix for each sense original k.

The first step: a linear decoder is designed with an activation function:

q＝σ(g ^T V+b)

where V and b are trainable parameters, σ represents the activation function, and g is the output of the RNN.

And a second step of: the hidden state q is used as input to calculate the sense origin score. The invention uses a unique linear function to describe the probability of each source. For the sense origin k in vector s, the sense origin score is:

s _k ＝q ^T U _k

wherein U is _k Is a conversion matrix;

the invention regards each sense primitive as an expert and predicts probability distributions according to context, which reflect the prediction of sense primitive probabilities of all other words to the sense primitive of the current word. The sense origin decoder portion calculates the sense origin probability of the word.

2) Word decoder

The final output of the language model is the word probability. Therefore, the sense original probability needs to be converted into words:

w＝sM ^T

where s is the probability of the sense origins, each sense origin being associated with a plurality of words. M is M ^T Is of (1)

And represent the sense-original probability distribution of a word. As described above, M ^T There are H2 rows, each with H1 numbers, representing the probability defined by HowNet. When the invention obtains the probability of the source meaning and the relation matrix of the source meaning and the word, the sum of the probability distribution of the word can be calculated by s.

Fusion:

although the net is known to be an excellent word dataset that annotates most of the meaning of words, some information may be lost or outdated so the present invention attempts to propose a fusion method to avoid the above-mentioned problems. The invention merges the original neural network (left path) and the additional path (right path). The invention selects multiplication and makes some corrections.

For the above reasons, the present invention treats the language model as a multi-label classification task. Thus, the present invention uses a sigmoid function on the output of the word decoder in the right path and refers to its output as w ^r . At the same time, know the netOn the basis, w can be utilized ^r Some words with low likelihood are pruned.

In the left path, w ^l Is obtained by the fully connected part of the standard word decoder; in the right path, w ^r Derived from the last sigmoid function, finally using w ^r Multiplying w ^l The effect of pruning the left path result can be achieved.

Thus, overall the model output principle is p (w|g), i.e. the probability distribution of word w with respect to context g, can be derived from the results of the left and right paths:

p(w|g)＝softmax(w ^l w ^r )

additional details:

sigmoid: since the scope of the Sigmoid function is (0, 1). In a given context, there will always be more words that are irrelevant than there are words that are relevant, these words letting w ^l w ^r Becomes w ^l 0. Therefore, it is very difficult to obtain a counter-propagating gradient for the left path, and it is very difficult to train better with the model, many times. The invention herein uses a sigmoid function and modifies the output of the sigmoid (the whole can be regarded as a modified sigmoid function), modifying its range to (X, 1); the final formula is w ^r X (1-X) +x, where X determines the constant parameter of the offset.

Basis matrix: a conversion matrix U is arranged for each sense element k _k Is extremely space consuming, a skill is designed here to achieve this. Modeling U using R matrices and their weighted sums _k The formula is as follows:

wherein the method comprises the steps of

α _k,r >0 is a trainable parameter and +.>

/>

The present invention is capable of other and further embodiments and its several details are capable of modification and variation in light of the present invention, as will be apparent to those skilled in the art, without departing from the spirit and scope of the invention as defined in the appended claims.

Claims

1. The language meaning understanding method of the fusion meaning original information is characterized by comprising the following steps of:

the processing procedure of the original encoder comprises the following steps:

s＝select(w，M)

emb＝∑s _j *EMB(j)

where M is a constant matrix of sense primitive weights for words,

h1 H2 represents the number of words and the number of sense origins; select is a function of extracting the sense-original weight vector represented by word w from M, s is an s _j A composed sense origin weight vector representing the probability distribution of sense origins contained in the word; />

the processing procedure of the word decoder comprises the following steps:

w＝sM ^T

wherein M is ^T Is the transpose of M;

p(w|g)＝softmax(w ^l w ^r )

where p (w|g) represents the probability of word w with respect to context.

2. The method for understanding language meaning of merged sense original information according to claim 1, wherein the one word contains weight of sense original

Wherein word w has l at the h-th level in the sense original tree _h The meaning source, j, represents one of them.

3. The language meaning understanding method of merging sense original information according to claim 2, wherein when the number of layers in the sense original tree is 2 or more,

g' means the total number of layers of the original tree.

4. A method for understanding language meaning of fused sense original information according to claim 1, 2 or 3, wherein the process of the sense original decoder comprises the steps of:

the output of RNN in the right path is first processed as follows:

q＝σ(g ^T V+b)

s _k ＝q ^T U _k

wherein U is _k Is a conversion matrix.

5. The method for understanding language meaning of fusion of raw information according to claim 4, wherein the transformation matrix U _k The following are provided:

wherein the method comprises the steps of

α _k，r > 0 is a trainable parameter and +.>

6. The language meaning understanding method of fusion of raw information according to claim 5, wherein the sigmoid processing procedure in the right path comprises the steps of:

Then pair w using the following formula ^r Updating:

w ^r ＝w ^r ×(1-X)+X

where X determines a constant parameter of the offset.