CN109783809B

CN109783809B - Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus

Info

Publication number: CN109783809B
Application number: CN201811577667.2A
Authority: CN
Inventors: 周兰江; 贾善崇
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-12-22
Filing date: 2018-12-22
Publication date: 2022-04-12
Anticipated expiration: 2038-12-22
Also published as: CN109783809A

Abstract

The invention discloses a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data, and belongs to the technical field of natural language processing and machine learning. The method comprises the steps of firstly, using python to process the corpus aligned at chapter level to carry out regular expression processing, removing noise data, and using the corpus as input, wherein the corpus aligned at chapter level can be processed into single aligned sentences firstly, and then the aligned sentences are split as the Laos are consistent with the Chinese sentences in sequence. And then segmenting the aligned sentences, taking the segmented sentences as LSTM input, training a model to selectively learn the input by keeping the intermediate output result of the LSTM encoder on the input sequence, and associating the output sequence when the model is output, thereby extracting parallel sentence pairs from the bilingual corpus. The method has certain research significance on the extraction of Laos parallel sentences.

Description

Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus

Technical Field

The invention relates to a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data, in particular to a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data based on an LSTM (Long Short-Term Memory network), and belongs to the technical field of natural language processing and machine learning.

Background

The bilingual corpus is an important basic resource in research fields such as statistical machine translation, cross-language retrieval, bilingual dictionary construction and the like, and the quantity and quality of the bilingual corpus influence and even determine the final result of a related task to a great extent. The mining of the parallel sentence pairs is a key technology for constructing bilingual corpus, so that the method has important research value. In many cases, bilingual corpus is available, but the resulting text is usually not aligned in sentence units, e.g., some aligned in paragraphs or in whole articles. In this case, it is necessary to extract parallel sentence pairs by arranging these corpora, which are not aligned in units of sentences, into a sentence alignment format.

Disclosure of Invention

The technical problem to be solved by the invention is as follows: the method for extracting the aligned sentences from the aligned linguistic data of Laos-Chinese chapter levels is used for extracting the aligned sentences from the aligned linguistic data of the Chinese-Laos, and can effectively improve the accuracy of sentence alignment.

The technical scheme adopted by the invention is as follows: a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data comprises the following steps:

step1, performing noise processing on the Chinese-old bilingual corpus by using a regular expression through python codes, and then performing data set division on the aligned segments, wherein the aligned training set accounts for 90%, and the out-of-order test set accounts for 10%;

step2, counting the different phrases in the training set and the sentences in the testing set and the occurrence frequency of each phrase according to the sentences in the training set and the testing set, and calculating the word vectors of the sentences through word-embedding;

step3, using the word vectors obtained by Step2 as the input of the LSTM algorithm, namely using the LSTM algorithm as an encoder part at the moment, using the word vectors as the input of an encoder end, and performing similarity calculation on the encoder part through the initialization vector of the LSTM algorithm;

step4, outputting each word vector through an encoder part, and solving semantic codes C of each sentence word vector through a softmax function to form a vector sequence;

step5, taking the vector sequence obtained in Step4 as the initial input of the decoder part, adding an Attention mechanism in the decoder part, and during decoding, selectively selecting a subset from the vector sequence of the semantic code C in each Step for further processing; therefore, in the decoder part, the output of each moment is used as the input of the next moment, and each output can fully utilize the information carried by the input sequence, and so on until the end;

step6, calculating the similarity between the encoder and the decoder to obtain the sentence word vector with the highest similarity, and forming the sentence by the word vector, thereby completing the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus.

Specifically, the aligned segment in Step1 is the aligned discourse material after being subjected to noise processing.

Specifically, Step2 performs sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.

Specifically, the specific steps of Step3 are as follows:

inputting the divided sentences, dividing the sentences into words, inputting the words as input after word-embedding into an LSTM, and obtaining hidden layer information h through a hidden layer₁，h₂,., at which time the hidden-state of the first time of the encoder portion is assumed to be Z₀(initial variables) and then using Z₀And h₁，h₂,., carrying out similarity calculation to obtain a at each moment₁₀,a₂₀,a₃₀,…a_ijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.

In particular, said Step5 enables one input for each decoding Step in the decoder stage, for all hidden layer information h of the input sequence₁,h₂,…h_tWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stage_iHiding new state S of the layer_iAccording to the state S of the previous step_i-1,Y_i,C_iA non-linear function of the three, as shown in formula (1), wherein C_iIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), S_i-1,Y_iRespectively, the predicted values of the previous state and the previous output of the decoder stage, where h_jFor each time instant of the encoder phase, output state, a_ijH corresponding to input i of each decoder stage_jThe weight value of (2);

S_i＝F(S_i-1，Y_i，C_i) (1)

specifically, Step6 completes the extraction of the aligned sentences of the Chinese-old bilingual from the aligned chapter-level corpus by composing the sentences with word vectors after the similarity calculation.

The invention has the beneficial effects that:

(1) according to the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the accuracy rate in the extraction of the Chinese Laos-Laos is improved compared with a unilateral encoder-decoder algorithm model.

(2) In the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the LSTM algorithm is used, and compared with other algorithms, the method has the advantage that the effect of feature extraction is improved.

(3) According to the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the Laos grammatical features and the Chinese grammatical features are integrated, the Laos grammatical features and the Chinese grammatical features can be automatically recognized through deep learning, and compared with manual recognition, the method is high in speed, strong in generalization and time-saving and labor-saving.

Drawings

FIG. 1 is a flow chart of the present invention;

FIG. 2 is a basic block diagram of an LSTM training word vector used in the present invention;

FIG. 3 is a schematic diagram of an encoder-decoder model of the Attention mechanism of the present invention;

FIG. 4 is a diagram of the Attention model computing word vectors of the present invention.

Detailed Description

Example 1: as shown in fig. 1-4, a method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus includes the following steps: :

Further, the aligned segment in Step1 is the aligned chapter corpus after the noise processing.

Further, Step2 carries out sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.

Further, the specific steps of Step3 are as follows:

inputting the divided sentences, dividing the sentences into words, inputting the words as input after word-embedding, inputting the words into the LSTM, and obtaining hidden words through a hidden layerLayer information h₁，h₂,., at which time the hidden-state of the first time of the encoder portion is assumed to be Z₀(initial variables) and then using Z₀And h₁，h₂,., carrying out similarity calculation to obtain a at each moment₁₀,a₂₀,a₃₀,…a_ijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.

Further, said Step5 can have an input for all hidden layer information h of the input sequence at each decoding Step of the decoder stage₁,h₂,…h_tWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stage_iHiding new state S of the layer_iAccording to the state S of the previous step_i-1,Y_i,C_iA non-linear function of the three, as shown in formula (1), wherein C_iIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), S_i-1,Y_iRespectively, the predicted values of the previous state and the previous output of the decoder stage, where h_jFor each time instant of the encoder phase, output state, a_ijH corresponding to input i of each decoder stage_jThe weight value of (2);

S_i＝F(S_i-1，Y_i，C_i) (1)

further, Step6 completes the extraction of the aligned sentences of the Chinese-old bilingual from the aligned chapter-level corpus by composing the sentences with word vectors after the similarity calculation.

The bilingual corpus is the most important language resource in the natural language research field, the research on language information processing is deep, and the processing has great progress in the acquisition of the corpus. The invention mainly fuses Laos linguistic characteristics into an algorithm model, selects a method for fusing a plurality of models in the use of the model, improves the identification precision, uses an Attention mechanism and takes LSTM as an encoder-decoder. Firstly, the corpus aligned at chapter level is processed by python to carry out regular expression processing, noise data is removed and is used as input, and because Laos and Chinese sentences are ordered in a consistent manner, the corpus aligned at chapter level can be processed into single aligned sentences first, and then the aligned sentences are split. And then segmenting the aligned sentences, taking the segmented sentences as LSTM input, training a model to selectively learn the input by keeping the intermediate output result of the LSTM encoder on the input sequence, and associating the output sequence when the model is output, thereby extracting parallel sentence pairs from the bilingual corpus. The method has certain research significance on the extraction of Laos parallel sentences.

While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims

1. A method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data is characterized in that: the method comprises the following steps:

2. The method of claim 1, wherein the method comprises the following steps: the aligned segment in Step1 is the aligned discourse material after being processed by noise.

3. The method of claim 1, wherein the method comprises the following steps: the Step2 carries out sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.

4. The method of claim 1, wherein the method comprises the following steps: the specific steps of Step3 are as follows:

inputting the divided sentencesThe sentence is divided into words, the words are input into the LSTM after being subjected to word-embedding, and then the hidden layer information h is obtained through the hidden layer₁，h₂,., at which time the hidden-state of the first instance of the encoder section is assumed to be the initial variable Z₀Then using Z₀And h₁，h₂,., carrying out similarity calculation to obtain a at each moment₁₀,a₂₀,a₃₀,…a_ijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.

5. The method of claim 4, wherein the method for extracting aligned sentences from Laos-Chinese discourse level aligned corpus comprises: step5 can have an input for all hidden layer information h of input sequence in each decoding Step of decoder stage₁,h₂,…h_tWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stage_iHiding new state S of the layer_iAccording to the state S of the previous step_i-1,Y_i,C_iA non-linear function of the three, as shown in formula (1), wherein C_iIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), S_i-1,Y_iRespectively, the predicted values of the previous state and the previous output of the decoder stage, where h_jFor each time instant of the encoder phase, output state, a_ijH corresponding to input i of each decoder stage_jThe weight value of (2);

S_i＝F(S_i-1，Y_i，C_i) (1)

6. the method of claim 1, wherein the method comprises the following steps: after similarity calculation, Step6 completes the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus by composing the sentence with word vectors.