CN107423284B

CN107423284B - Method and system for constructing sentence representation fusing internal structure information of Chinese words

Info

Publication number: CN107423284B
Application number: CN201710449875.3A
Authority: CN
Inventors: 王少楠; 张家俊; 宗成庆
Original assignee: Institute of Automation of Chinese Academy of Science
Current assignee: Institute of Automation of Chinese Academy of Science
Priority date: 2017-06-14
Filing date: 2017-06-14
Publication date: 2020-03-06
Anticipated expiration: 2037-06-14
Also published as: CN107423284A

Abstract

The invention relates to the technical field of natural language processing, in particular to a method and a system for constructing sentence expression fused with internal structure information of Chinese words, aiming at solving the problem of low utilization rate of the internal structure information of the words; the construction method comprises the following steps: performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses; pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors; integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus; determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information; and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed. The invention can improve the utilization rate of the internal structure information of the words.

Description

Method and system for constructing sentence representation fusing internal structure information of Chinese words

Technical Field

The invention relates to the technical field of natural language processing, in particular to a method and a system for constructing sentence expression fusing internal structure information of Chinese words.

Background

Sentence representation is the mapping of a natural language sentence into a high dimensional space such that semantically similar sentences have a closer distance in this space. Sentence representation is the fundamental task of natural language processing, directly affecting the performance of the entire language processing system. Therefore, much effort has been devoted to the study of how to design suitable sentence representation methods for specific tasks to improve the performance of language processing systems.

The traditional sentence representation method uses a large number of manually designed features to represent the meaning of a sentence, and achieves good effect in various natural language processing tasks. However, the method requires a lot of manpower and professional knowledge, and often needs to select features according to different tasks, so that the problems of poor model generalization capability and difficult feature representation are caused. In recent years, people find that the neural network-based model can automatically extract semantic features of sentences from large-scale texts, and greatly improve the semantic expression effect of the sentences.

However, most of the sentence representation studies are directed to english sentences, and different neural network structures are designed on the word granularity to encode the semantics of the sentences. Different from English, Chinese words are formed by characters, and the characters contain rich semantic information and can reflect the meaning of the words. In fact, researchers have noted this problem and improved the learning of word vectors by using the words in the chinese words, but these methods do not fully utilize the internal information of the chinese words, such as the relationships between words, and are limited to the task of learning word vectors and do not search in the sentence representation. Therefore, how to fully utilize the internal structure information of the words to learn a better sentence representation model is a topic worthy of research.

Disclosure of Invention

In order to solve the problems in the prior art, namely the problem of low utilization rate of the internal structure information of the words, the invention provides a method and a system for constructing sentence expression fusing the internal structure information of Chinese words.

In order to solve the technical problems, the invention provides the following scheme:

a construction method for sentence representation fusing internal structure information of Chinese words comprises the following steps:

performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;

pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;

integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;

determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;

and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.

Optionally, the pre-training of each word corpus specifically includes:

splitting each word corpus according to characters to obtain a character corpus;

splicing word linguistic data and word linguistic data to obtain a word vector and a word vector;

and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.

Optionally, the integrating all the pre-training word vectors and pre-training word vectors in each word corpus specifically includes:

splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector;

inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector;

and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.

Optionally, inputting the stitching vector into a feedforward neural network and performing nonlinear transformation, specifically including:

determining a mask vector v according to the following formula_ij：

v_ij＝tanh(W·[c_ij；x_i])

Where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, c_ijIs the ith word corpus x_iThe jth pre-training word vector.

Optionally, the determining a combined word vector of each word corpus according to all pre-training word vectors and corresponding mask vectors in each word corpus specifically includes:

according to the following formula, the inner products of all the pre-training word vectors in each word corpus and the corresponding mask vectors are summed to obtain the combined word vector of the word corpus

Wherein, c_ijIs the ith word corpus x_iMiddle j-th pre-training word vector, v_ijFor pre-training word vector c_ijAnd m represents the total number of pre-training word vectors of the ith word corpus.

Optionally, the determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus specifically includes:

taking the maximum value of each dimension of the pre-training word vector and the combined word vector as a final word vector based on a maximum pooling method according to the following formula

Wherein the content of the first and second substances,

a pre-training word vector representing the ith word corpus in the kth dimension,

represents the combined word vector of the ith word corpus in the k dimension, d represents all dimensions of the ith word corpus, and max () represents the function of taking the maximum value.

Optionally, the integrating the final word vector of each word corpus in the sentence to be processed to obtain the representation vector of the sentence to be processed specifically includes:

and integrating the final word vectors into a representation vector of the sentence to be processed through a sentence combination function.

Optionally, the sentence combination function includes at least one of an Average model function, a Matrix model function, a Dan model function, an RNN model function, and an LSTM model function.

Optionally, the corpus is chinese text corpus crawled from hundred degree encyclopedia.

According to the embodiment of the invention, the invention discloses the following technical effects:

the method for constructing the sentence representation fusing the internal structure information of the Chinese words integrates the final word vectors representing the internal structure information of the words so as to accurately determine the representation vector of the sentence to be processed and improve the utilization rate of the internal structure information of the words by segmenting the training corpus, pre-training the word corpus, integrating the pre-training word vectors and determining the final word vectors.

In order to solve the technical problems, the invention also provides the following scheme:

a construction system for fusing sentence representations of internal structural information of chinese words, the construction system comprising:

the word segmentation unit is used for performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;

the pre-training unit is used for pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;

the first integration unit is used for integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;

the determining unit is used for determining a final word vector of each word corpus according to a pre-training word vector and the combined word vector in each word corpus, and the final word vector represents word internal structure information;

and the second integration unit is used for integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.

the building system for sentence representation of Chinese word internal structure information is provided with the word segmentation unit, the pre-training unit, the first integration unit, the determination unit and the second integration unit, can perform word segmentation processing on a training corpus, pre-train a word corpus, integrate a pre-training word vector and determine a final word vector, thereby integrating a plurality of final word vectors representing the word internal structure information to accurately determine the representation vector of the sentence to be processed and improving the utilization rate of the word internal structure information.

Drawings

FIG. 1 is a flow chart of a method of constructing a sentence representation incorporating internal structural information of Chinese words according to the present invention;

FIG. 2 is a schematic diagram of a modular structure of a sentence expression construction system for fusing internal structural information of Chinese words according to the present invention.

Description of the symbols:

word segmentation unit-1, pre-training unit-2, first integration unit-3, determination unit-4, second integration unit-5.

Detailed Description

Preferred embodiments of the present invention are described below with reference to the accompanying drawings. It should be understood by those skilled in the art that these embodiments are only for explaining the technical principle of the present invention, and are not intended to limit the scope of the present invention.

The invention provides a method for constructing sentence representation fused with internal structure information of Chinese words, which integrates a plurality of final word vectors representing internal structure information of words to accurately determine a representation vector of a sentence to be processed and improve the utilization rate of the internal structure information of the words by performing word segmentation processing on a training corpus, performing pre-training on the word corpus, integrating a pre-training word vector and determining a final word vector.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in FIG. 1, the method for constructing sentence expression by fusing internal structure information of Chinese words in the invention comprises:

step 100: performing word segmentation on all Chinese repeated statement sentence pairs in the training corpus to obtain a plurality of word corpuses;

step 200: pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors;

step 300: integrating all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus;

step 400: determining a final word vector of each word corpus according to the pre-training word vector and the combined word vector in each word corpus, wherein the final word vector represents word internal structure information;

step 500: and integrating the final word vector of each word corpus in the sentence to be processed to obtain the expression vector of the sentence to be processed.

In step 100, the corpus is a chinese text corpus crawled from hundred encyclopedia.

There are many ways to segment chinese sentences. In the embodiment, the Chinese sentences are segmented by using an open-source segmentation tool.

The sentence pairs are repeated in Chinese:

for example, after word segmentation, the Chinese restated sentence pair can be expressed as:

in step 200, the pre-training of each word corpus specifically includes:

step 201: and splitting each word corpus according to characters to obtain a word corpus.

Step 202: and splicing the word linguistic data and the word linguistic data to obtain a word vector and a word vector.

Step 203: and pre-training the character vectors and the word vectors by utilizing an open source model to obtain corresponding pre-training character vectors and pre-training word vectors.

In this embodiment, the open source model is a skip-gram model, but not limited thereto.

Taking "japan" as an example, the obtained 300-dimensional word vector and word vector are:

"Japanese-0.2434300.2944200.188458-0.0929210.1392860.1865990.011289-0.218883-0.1810620.152754 …";

"Ri-0.3849000.2144930.187968-0.0384640.0575210.069445-0.218115-0.035687-0.126120-0.419776-0.312976 …".

In step 300, the integrating all the pre-training word vectors and pre-training word vectors in each word corpus specifically includes:

step 301: and splicing the pre-training word vector and the pre-training word vector of each word corpus to obtain a spliced vector corresponding to the pre-training word vector.

Taking "japan" as an example, all the pre-training word vectors "day", "this", and the pre-training word vector "japan" in one word corpus "japan" are spliced to obtain two 600-dimensional spliced vectors.

Step 302: and inputting the spliced vector into a feedforward neural network and carrying out nonlinear transformation to obtain a mask vector corresponding to the pre-training word vector.

Inputting the splicing vector into a feedforward neural network and carrying out nonlinear transformation, specifically comprising:

determining a mask vector v_ijAs shown in equation (1):

v_ij＝tanh(W·[c_ij；x_i]) (1)

where tanh () represents a hyperbolic tangent function, W is a parameter of a feedforward neural network, c_ijIs the ith word corpus x_iJ-th pre-trainingA word vector. Wherein the mask vector v_ijTo control the ith word corpus x_iInfluence on the meaning of the jth pre-training word vector.

Step 303: and determining a combined word vector of each word corpus according to all the pre-training word vectors and the corresponding mask vectors in each word corpus.

Wherein, the determining the combined word vector of the word corpus according to all the pre-training word vectors and the corresponding mask vector in each word corpus specifically includes:

summing the inner products of all pre-training word vectors and corresponding mask vectors in each word corpus to obtain a combined word vector of the word corpus

As shown in equation (2):

In step 400, the determining a final word vector of the word corpus according to the pre-training word vector and the compound word vector in each word corpus specifically includes:

based on the maximum pooling method, the maximum value is taken from each dimension of the pre-training word vector and the combined word vector as the final word vector

As shown in equation (3):

wherein the content of the first and second substances,

In step 500, the integrating the final word vector of each word corpus in the sentence to be processed to obtain the representation vector of the sentence to be processed specifically includes:

The sentence combination function includes at least one of Average model function, Matrix model function, Dan (transform) model function, RNN (recurrent neural network) model function, and LSTM (long-short term memory) model function.

Average model function, which is to Average the vector representation of all words in a sentence to obtain the final sentence representation R_sentenceAs shown in equation (4):

the Matrix model function firstly uses the Average model function to obtain the vector representation of the sentence, then multiplies the vector representation by a Matrix and carries out nonlinear transformation to obtain the final sentence representation, as shown in formula (5):

the Dan model function firstly uses Average model function to obtain vector representation of sentence, and then uses multilayer feedforward neural network to transform the sentence representation to obtain final sentence representation, as shown in formula (6):

RNN model functions, which combine word representations in a sentence to form a final sentence representation, as shown in equation (7):

R_sentence＝RNN(x)＝f(W_xx_i+W_hh_i-1+b) (7)

the LSTM model function combines the word representations in a sentence to form a final sentence representation, as shown in equation (8).

After the expression vector of each sentence in the sentence pair is obtained, the model parameters are solved by maximizing the distance between the positive examples and the negative examples by adopting the maximum-interval objective function, as shown in the formula (9):

wherein, (x1, x2) represents a positive example, which is a sentence pair with similar meaning; (t1, t2) is a negative example, consisting of randomly combined pairs of sentences,

the sentence representing sentence x represents a vector.

Table 1 shows the results of the comparison of the present invention with the word-based model, the word-based model and the word-mean model over three sets of test data (big data, hundredth data, Total (sum of big data and hundredth data)). The training data includes 30846 sentence pairs. From table 1 it can be found: compared with a word-based model, the evaluation index (Pearson) of the correlation degree between the model predicted value and the standard numerical value is improved by 2.00% of the Pearson correlation averagely, and compared with a word average model, the evaluation index (Pearson) of the correlation degree between the model predicted value and the standard numerical value is improved by 1.52% of the Pearson correlation. The effectiveness and the superiority of the construction method of the sentence expression fusing the internal structure information of the Chinese words are fully demonstrated.

TABLE 1 Pearson relevance across different sentence similarity test sets

In addition, the attached table 2 shows the effect of the present invention on word similarity test set with word-based models, word-based models and word-mean models. The method can be directly obtained, and the performance of word representation can be effectively improved.

TABLE 2 Pearson relevance on word similarity test set

The construction method for sentence representation fusing the internal structure information of the Chinese words has the following positive effects: the Chinese words are formed by words, and for most words, the meaning of the words greatly influences the meaning of the words formed by the words; while a small portion of chinese words are non-combinative words whose meaning is independent of the meaning of the constituent words. According to the invention, by modeling the internal structural characteristics of the Chinese words, the word representation effect can be effectively improved, and non-combined words can be automatically identified to a certain extent. The invention uses mask door mechanism to control the contribution degree of different words in a word to the word semantic, uses the maximum pooling method to select the meaning of the word as a whole or combined by the word meanings, and automatically learns the weights of the two.

Through experiments on the task of Chinese sentence similarity, experimental results show that compared with a sentence representation model based on words, the method has the advantage that the average Pearson correlation is improved by 2.00%; compared with a sentence representation model based on a word average method, the method has the advantage that the Pearson relevance is improved by 1.52 percent on average. This fully demonstrates the effectiveness and superiority of fusing the internal structure of words.

In addition, the invention also provides a construction system for sentence representation by fusing the internal structure information of the Chinese words. As shown in FIG. 2, the system for constructing sentence expression fused with internal structure information of Chinese words according to the present invention includes a segmentation unit 1, a pre-training unit 2, a first integration unit 3, a determination unit 4, and a second integration unit 5.

The word segmentation unit 1 is used for performing word segmentation processing on all Chinese repeated statement sentence pairs in a training corpus to obtain a plurality of word corpuses; the pre-training unit 2 is used for pre-training each word corpus to obtain pre-training word vectors and pre-training word vectors; the first integration unit 3 is configured to integrate all pre-training word vectors and pre-training word vectors in each word corpus to obtain a combined word vector corresponding to the word corpus; the determining unit 4 is configured to determine a final word vector of each word corpus according to a pre-training word vector and the combined word vector in each word corpus, where the final word vector represents word internal structure information; the second integration unit 5 is configured to integrate final word vectors of word corpora in the sentence to be processed to obtain a representation vector of the sentence to be processed.

Compared with the prior art, the construction system for the sentence expression fusing the internal structure information of the Chinese word has the same beneficial effects as the construction method for the sentence expression fusing the internal structure information of the Chinese word, and the description is omitted.

So far, the technical solutions of the present invention have been described in connection with the preferred embodiments shown in the drawings, but it is easily understood by those skilled in the art that the scope of the present invention is obviously not limited to these specific embodiments. Equivalent changes or substitutions of related technical features can be made by those skilled in the art without departing from the principle of the invention, and the technical scheme after the changes or substitutions can fall into the protection scope of the invention.

Claims

1. A construction method for sentence representation fusing internal structure information of Chinese words is characterized by comprising the following steps:

integrating final word vectors of word corpora in the sentence to be processed to obtain a representation vector of the sentence to be processed;

wherein, said integrating all pre-training word vectors and pre-training word vectors in each word corpus specifically comprises:

2. The method for constructing sentence representations fusing internal structural information of chinese words according to claim 1, wherein the pre-training of each word corpus specifically comprises:

3. The method for constructing sentence representations fusing internal structural information of chinese words according to claim 1, wherein inputting the stitched vector into a feedforward neural network and performing nonlinear transformation specifically comprises:

determining a mask vector v according to the following formula_ij：

v_ij＝tanh(W·[c_ij；x_i])

4. The method according to claim 1, wherein the determining a combined word vector of the corpus according to all pre-training word vectors and corresponding mask vectors in each corpus specifically comprises:

5. The method according to claim 1, wherein the determining a final word vector of the corpus according to the pre-training word vector and the combined word vector in each corpus specifically comprises:

Wherein the content of the first and second substances,

6. The method according to claim 1, wherein the step of integrating the final word vectors of the word corpora in the sentence to be processed to obtain the expression vector of the sentence to be processed includes:

7. The method of claim 6, wherein the sentence combination function comprises at least one of an Average model function, a Matrix model function, a Dan model function, an RNN model function, and an LSTM model function.

8. The method for constructing sentence representation fusing internal structural information of Chinese words according to any of claims 1-7, wherein the training corpus is Chinese text corpus crawled from Baidu encyclopedia.