CN112232090A

CN112232090A - Chinese-crossing parallel sentence pair extraction method fusing syntactic structure and Tree-LSTM

Info

Publication number: CN112232090A
Application number: CN202010978713.0A
Authority: CN
Inventors: 高盛祥; 张迎晨; 余正涛; 朱浩东
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2020-09-17
Filing date: 2020-09-17
Publication date: 2021-01-15

Abstract

The invention relates to a Hanyue parallel sentence pair extraction method fusing a syntactic structure and Tree-LSTM. The method comprises the steps of pre-training a Chinese-crossing bilingual word vector, mapping the Chinese-crossing bilingual word vector to the same semantic space, considering that the Chinese-crossing sentence structure has difference, converting a sentence sequence structure into a dependency Tree structure through a dependency syntax Tree, capturing syntax structure information of sentences through Tree-LSTM, splicing part-of-speech information of the Chinese-crossing bilingual sentences into sentence semantic vectors as characteristic vectors, and inputting the vectors into a full connection layer to train the Chinese-crossing parallel sentence pair classifier. The invention utilizes a deep learning method to automatically learn sentence expression rules in a large amount of data, and solves the problem that a large amount of human resources are consumed for extracting tasks to design characteristics in the traditional parallel sentences. Meanwhile, the method considers and solves the problem that the structural difference characteristics of the Hanyue language influence the performance of the extracted model, and improves the accuracy of the extracted model by the parallel sentences.

Description

Chinese-crossing parallel sentence pair extraction method fusing syntactic structure and Tree-LSTM

Technical Field

The invention relates to a Chinese-crossing parallel sentence pair extraction method fusing a syntactic structure and Tree-LSTM, belonging to the technical field of natural language processing.

Background

Parallel corpora are an important resource for developing machine translation research, in recent years, communication and cooperation of middle and more countries become more and more intimate, machine translation is becoming an important tool for getting through various cooperation of the middle and more countries, and the parallel corpora have very considerable application prospects. Vietnamese is a typical scarce language, more parallel sentences of Chinese are rare to data, and more parallel sentences of Chinese are generated in a large number of comparable Chinese corpus by using a parallel sentence pair extraction technology, so that the problem of data sparsity in a machine translation task of Chinese can be solved. The traditional parallel sentence pair extraction method rarely considers the syntactic structure information of sentences, so the Chinese-crossing parallel sentence extraction method fusing the syntactic structure and Tree-LSTM has important significance in extracting the Chinese-crossing parallel sentence pair from the Chinese-crossing comparable corpus in Wikipedia.

Disclosure of Invention

The invention provides a Hanyue parallel sentence pair extraction method fusing a syntactic structure and Tree-LSTM, which solves the problem that a great amount of human resources are consumed for designing characteristics in the traditional parallel sentence pair extraction task, and simultaneously considers and solves the problem that the structural difference characteristic of Hanyue language influences the performance of an extraction model; the method utilizes deep learning to express sentence semantics, and considers the difference characteristic of the Chinese-crossing syntactic structure at the same time so as to improve the accuracy of the Chinese-crossing parallel sentences on the extraction model.

The technical scheme of the invention is as follows: a Hanyue parallel sentence pair extraction method fusing a syntactic structure and Tree-LSTM comprises the following steps:

step1, collecting Chinese-crossing parallel linguistic data used for training a parallel sentence pair extraction model and Chinese-crossing comparable linguistic data of wiki encyclopedia as an extraction source, and dividing the collected Chinese-crossing parallel linguistic data into training linguistic data and testing linguistic data; using Scapy as a crawler tool, simulating user operation, crawling Chinese-Yue parallel sentence pairs according to an Xpath path of page data elements, downloading Chinese and Dump data sets under Wikipedia at the same time, wherein the data contains all Wikipedia data of Chinese, and extracting Chinese-Yue comparable corpus according to ID alignment;

step2, training a Chinese-word-crossing vector by using the Chinese-word-crossing monolingual corpus, and training a Chinese-word-crossing bilingual word vector by using a bilingual dictionary;

step2.1, respectively training the Chinese-Yuetui single-language word vectors by using fastText for the Chinese-Yuetui single-language corpus;

step2.2, segmenting the collected bilingual corpus, and constructing a Chinese-Yue bilingual dictionary as a Chinese-Yue bilingual word vector training task label;

step2.3, training the bilingual word vectors in Hanyue using MUSE.

The bilingual word vector training enables the word vectors of more and more synonyms of Chinese to be close in a bilingual semantic space, and meanwhile, the distance of the word vectors in the bilingual semantic space of more and more synonyms of Chinese is not changed, the invention uses MUSE to train the bilingual word vectors of more and more Chinese, as shown in a formula:

argmin∑_i‖X_i*W-Y_i*‖² (1)

step3, converting the sentence sequence of Hanyue into a Tree sequence of a dependency Tree structure by using a dependency syntax Tree model, and taking the Tree sequence as the input of the Tree-LSTM model;

in the Step3, Chinese and Vietnamese languages are isolated languages in the Step3, and grammatical means mainly adopts the use of word orders and imaginary words. The word sequences of the Chinese language and the Vietnamese language are similar, the main stem components of the two languages have consistent word sequences and are both of a main-predicate-guest (SVO) structure, the grammatical difference of the two languages is mainly reflected in that the sequence of modified words (fixed language and shape language) of the Vietnamese language is different from that of the Chinese language, and the modified words of the Chinese languageThe order is a biased structure, i.e. modifiers are in front of the core word, and the multilayer modification is the same, for example, Chinese "she is the most beautiful girl that I have seen. "Vietnamese is a forward biased structure, i.e. the modifiers are behind the core word, the same for the multi-layer modification, e.g. Vietnamese"

l. a (she is)

g a (girl) xinh

(the most beautiful) m-a

(I have seen). ". In order to extract The information of The Chinese Vietnamese method, a Chinese dependency syntax analysis tool provided by Stanford university is used for carrying out syntax analysis on Chinese to generate a syntax dependency tree, and The Vietnamese dependency treebank VnDT tool of Vietnamese is used for carrying out syntax analysis on Vietnamese to generate The syntax dependency tree; the generated syntax dependency Tree is used as an input of the Tree-LSTM model.

Step4, marking the part of speech of each word in the Chinese-character-crossing parallel sentence pair in the Chinese-character-crossing parallel corpus, converting the part of speech into a vector, and splicing the vector into a sentence semantic vector;

and Step5, performing element product and element difference capture difference on the final Hanyu semantic vector, and further inputting the Hanyu semantic vector into a full-connection layer for supervised training.

As a further scheme of the present invention, in Step4, part-of-speech tagging is performed on each word in the chinese-to-chinese parallel sentence pair in the chinese-to-chinese parallel corpus, and then the part-of-speech is converted into a vector, which is spliced into a sentence semantic vector to generate an input vector of the final extraction model.

In Step5, the sentence semantic vectors containing part-of-speech and syntactic structure information obtained finally are captured by element product and absolute element difference to capture their matching information, and the translation probability of sentences is calculated by using a full-connected layer, thereby performing supervised training.

The invention has the beneficial effects that:

1. when the method is used for coding sentences by using the Tree-LSTM model fused with sentence structure information, the effect is obviously superior to that of the Bi-LSTM based model, and the part-of-speech information is used as auxiliary information to improve the overall model performance;

2. the method utilizes a deep learning method to automatically learn sentence expression rules in a large amount of data, and solves the problem that a large amount of human resources are consumed for extracting tasks to design features in the traditional parallel sentences; meanwhile, the method considers and solves the problem that the structural difference characteristics of the Hanyue language influence the performance of the extracted model, and improves the accuracy of the extracted model by the parallel sentences. Experiments show that the method is superior to a baseline model in three indexes of accuracy, recall rate and F value.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is an illustration of the present invention converting sequences into tree structures.

Detailed Description

Example 1: as shown in fig. 1-2, a method for extracting hanyue parallel sentence pair fusing syntax structure and Tree-LSTM includes:

step2, performing monolingual word vector training on the monolingual corpus of the Chinese and the more monolingual corpora by using fastText, acquiring a Chinese and more bilingual dictionary, supervising the training of the Chinese and more bilingual word vector, and performing training by using MUSE, wherein the formula is as follows:

argmin∑_i‖X_i*W-Y_i*‖² (1)

step3, as shown in fig. 2, using The chinese dependency syntax analysis tool provided by stanford university and The Vietnamese dependency treebank VnDT tool, both chinese-to-chinese and Vietnamese parallel sentence pairs in chinese-to-chinese parallel corpus are converted into chinese dependency trees and Vietnamese dependency trees, and fig. 2 mainly converts The sequential structure of sentences into Tree structures as The input of Tree-LSTM model.

The Tree-LSTM:

f_jk＝σ(W_fx_j+U_fh_k+b_f) (4)

c_j＝i_t*u_j+∑_k∈C(j)f_jk*c_k (7)

h_j＝o_j*tanh(c_j) (8)

step4, performing part-of-speech tagging on each word in the Chinese-character-crossing parallel sentence pair in the Chinese-character-crossing parallel corpus, converting each part-of-speech into a vector, and splicing the vector into a sentence semantic vector.

Step5, and obtaining the final vector containing the information of part of speech and syntactic structure

Their matching information is captured by element product and absolute element difference, and the translation probability of sentences to each other is calculated using the fully-connected layer.

p(y_j|c_j)＝σ(W^ch_j+c) (20)

Step6, comparative experiment was conducted using the accuracy (P), recall (R), and F-value (F) as evaluation indexes.

In order to verify the effectiveness of the model, the following experiment is set, and in order to verify the influence of the pre-trained bilingual word vector model on the extraction method, a baseline model is directly used as a test, as shown in the following table:

table 1: effect of Pre-trained bilingual word vectors on the baseline model

From the above table, it can be seen that the pre-training word direction makes the parts of speech identical and close to each other on the basis of not changing the original word vector, and the finally obtained classification structures are all promoted in a small range.

Meanwhile, on the basis of pre-training word vectors, in order to research the characteristics of different language differences, a Bi-LSTM-based model is improved into a dependency syntax Tree-based Tree-LSTM model, and the experimental results are shown in the following table 2.

Table 2: influence of syntactic structure information and part-of-speech information on model

According to experimental results, when the set threshold value rho is 0.7, the model has higher confusion degree in the judgment process, and when the threshold value rho is 0.9, the three evaluation indexes are obviously improved, and meanwhile, when a sentence is coded by using the Tree-LSTM model fused with sentence structure information, the effect is obviously superior to that of the Bi-LSTM based model, and finally, the part-of-speech information can be seen as auxiliary information to improve the overall model performance.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims

1. The Hanyue parallel sentence pair extraction method fusing the syntactic structure and the Tree-LSTM is characterized by comprising the following steps of: the method comprises the following steps:

step1, collecting Chinese-crossing parallel linguistic data used for training a parallel sentence pair extraction model and Chinese-crossing comparable linguistic data of wiki encyclopedia as an extraction source, and dividing the collected Chinese-crossing parallel linguistic data into training linguistic data and testing linguistic data;

step4, labeling the part of speech of each word in the Chinese-crossing parallel sentence pair in the Chinese-crossing parallel corpus, converting the part of speech into a vector, and splicing the vector into a sentence semantic vector;

2. The method for extracting pairs of hanyue parallel sentences fusing syntax structure and Tree-LSTM according to claim 1, characterized in that: in Step1, Scapy is used as a crawler tool, user operation is simulated, Chinese-Yue parallel sentence pairs are crawled according to an Xpath path of page data elements, Chinese and Dump data sets under Wikipedia are downloaded at the same time, all Wikipedia data are contained in the data, and Chinese-Yue comparable linguistic data can be extracted according to ID alignment.

3. The method for extracting pairs of hanyue parallel sentences fusing syntax structure and Tree-LSTM according to claim 1, characterized in that: the specific steps of Step2 are as follows:

step2.3, training the bilingual word vectors in Hanyue using MUSE.

4. The method for extracting pairs of hanyue parallel sentences fusing syntax structure and Tree-LSTM according to claim 1, characterized in that: in Step3, in order to extract The information of The Chinese Vietnamese, a Chinese dependency syntax analysis tool provided by Stanford university is used for carrying out syntax analysis on Chinese to generate a syntax dependency tree, and The Vietnamese dependency tree is generated by using The Vietnamese dependency treebank VnDT tool of Vietnamese; the generated syntax dependency Tree is used as an input of the Tree-LSTM model.

5. The method for extracting pairs of hanyue parallel sentences fusing syntax structure and Tree-LSTM according to claim 1, characterized in that: in Step4, respectively labeling part of speech of each word in the Chinese-to-parallel sentence pair in the Chinese and the more-to-parallel corpus, then converting the part of speech into vectors, splicing the vectors into the semantic vectors of sentences, and generating the input vectors of the final extraction model.

6. The method for extracting pairs of hanyue parallel sentences fusing syntax structure and Tree-LSTM according to claim 1, characterized in that: in Step5, the finally obtained sentence semantic vectors containing part of speech and syntactic structure information are captured by element product and absolute element difference to capture the matching information of the sentence semantic vectors, and the translation probability of the sentences is calculated by using a full-connection layer, so that supervised training is performed.