CN109783809A

CN109783809A - A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus

Info

Publication number: CN109783809A
Application number: CN201811577667.2A
Authority: CN
Inventors: 周兰江; 贾善崇
Original assignee: Kunming University of Science and Technology
Current assignee: Kunming University of Science and Technology
Priority date: 2018-12-22
Filing date: 2018-12-22
Publication date: 2019-05-21
Anticipated expiration: 2038-12-22
Also published as: CN109783809B

Abstract

The invention discloses a kind of methods that alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus, belong to natural language processing and machine learning techniques field.The corpus that chapter grade is aligned by the present invention first carries out the processing of regular expression using python, get rid of noise data, and as input, since Laotian and the sentence sequence of Chinese are consistent, so can first handle the corpus of chapter grade for single alignment sentence, the sentence of alignment is split later.These sentences being aligned are segmented later, using this language of participle as the input of LSTM, result is exported to the intermediate of list entries by retaining LSTM encoder, one model of training is selectively learnt to input to these and is associated output sequence when model exports, to extract parallel sentence pairs from bilingualism corpora.The present invention has certain research significance in the extraction of Laotian parallel sentence pairs.

Description

A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus

Technical field

The present invention relates to a kind of method for extracting alignment sentence from Laos-Chinese chapter grade alignment corpus, especially one Kind is based on LSTM (Long Short-Term Memory shot and long term memory network) from Laos-Chinese chapter grade alignment corpus The method for extracting alignment sentence, belongs to natural language processing and machine learning techniques field.

Background technique

Bilingual corpora is the important foundation money of the research fields such as statistical machine translation, cross-language retrieval, bilingual dictionary building Source, the quantity and quality of bilingual corpora largely influence the final result for even determining inter-related task.And parallel sentence pairs Excavation then be construct bilingual corpora key technology, thus have important researching value.In many cases, bilingual corpora I The text that can obtain, but obtain be generally not be aligned as unit of sentence, such as some be with paragraph or by It is aligned according to entire article.In this case, it is necessary to not be that the corpus arrangement being aligned as unit of sentence is formed a complete sentence by these Son alignment format, to carry out the extraction of parallel sentence pairs.

Summary of the invention

It is aligned the technical problem to be solved by the present invention is providing a kind of extract from Laos-Chinese chapter grade alignment corpus The method of sentence extracts alignment sentence for solving from Chinese-Laotian alignment corpus, can effectively improve sentence alignment Accuracy rate.

The technical solution adopted by the present invention is that: it is a kind of to extract alignment sentence from Laos-Chinese chapter grade alignment corpus Method includes the following steps:

The old bilingual corpora of the Chinese-is first passed through python code using regular expression to carry out noise processed, so by Step1 Data set division is carried out to these alignment segments afterwards, wherein the training set being aligned accounts for 90%, and out-of-order test set accounts for 10%；

Step2, according to training set and the sentence of test set, the phrase and each phrase for counting inequality therein go out Existing number calculates the term vector of sentence by word-embedding；

Step3, using Step2 obtain term vector as the input of LSTM algorithm, i.e., at this time LSTM algorithm as the portion encoder Point, and using these term vectors as the input at the end encoder, the initialization vector that the part encoder passes through LSTM algorithm carries out Similarity calculation；

Step4, each term vector is exported via the part encoder, by softmax function, find out each sentence word to The semantic coding C of amount forms a sequence vector；

Step5, by sequence vector obtained in Step4, as the initial input of the part decoder, in the part decoder It joined Attention mechanism, when decoding, each step all can be selected selectively from the sequence vector of semantic coding C A subset is further processed；So the output at each moment is as the defeated of subsequent time in the part decoder Entering, each output can accomplish the information for making full use of list entries to carry, and so on, until ending；

Step6, by the calculating of encoder and the similarity of the part decoder, obtain the highest sentence word of similarity to Amount, the sentence being made up of term vector, to complete the language for extracting the old bilingual alignment of the Chinese-from the chapter grade corpus of alignment Sentence.

Specifically, alignment segment described in the Step1 is by the alignment chapter corpus after noise processed.

Specifically, the Step2 is encoded by python, is carried out sentence participle to initial chapter grade alignment corpus, is led to It crosses code and realizes the participle of single sentence Laotian sentence and Chinese sentence, and count word number.

Specifically, specific step is as follows by the Step3:

The sentence branched away is inputted, sentence is segmented, by, as inputting, being input to after word-embedding In LSTM, hidden layer information h then is obtained by hidden layer₁, h₂..., first moment of the part encoder during this time Hidden-state be assumed to be Z₀(initializaing variable) then uses Z₀And h₁, h₂... similarity calculation is carried out, is obtained each The a at moment₁₀,a₂₀,a₃₀,…a_ij, wherein the subscript i of a indicates the subscript of hidden layer information in encoder, and the subscript j of a is indicated The subscript of the initializaing variable of neural network.

Specifically, the step Step5 can have an input, to input sequence in each step decoding of decoder stage Arrange the information h of all hidden layers₁,h₂,…h_tIt is weighted summation, that is, every time all can be all when predicting next word The hidden layer information of list entries is all read through, determine it is most related to those of list entries word when prediction current word, Attention mechanism represented in the decoding decoder stage, can input the vector C an of context every time_i, hidden layer New state S_iAccording to the state S of previous step_i-1,Y_i,C_iThe nonlinear function of three obtains, such as formula (1), wherein C_iFor The weighted average of per moment output state in encoder stage and, solutions mode be formula (2), S_i-1,Y_iRespectively decoder The previous state in stage and the preceding predicted value once exported, here h_jFor each moment output state in encoder stage, a_ijFor The corresponding h of input i in each decoder stage_jWeighted value size；

S_i=F (S_i-1, Y_i, C_i) (1)

Specifically, the step Step6 is after by similarity calculation, the sentence being made up of term vector, thus Complete the sentence that the old bilingual alignment of the Chinese-is extracted from the chapter grade corpus of alignment.

The beneficial effects of the present invention are:

It (1) should be based on extracting in alignment sentence method from Laos-Chinese chapter grade alignment corpus based on LSTM, relatively Algorithm model than the one-side encoder-decoder accuracy rate in Chinese-Laotian extracts increases.

(2) it should be aligned in sentence method, be used based on being extracted from Laos-Chinese chapter grade alignment corpus for LSTM LSTM algorithm, compare other algorithms, there is goodr raising in the effect of feature extraction.

(3) it should be aligned in sentence method based on being extracted from Laos-Chinese chapter grade alignment corpus for LSTM, and incorporate Laos The grammar property of language grammar property and Chinese, can be come out by deep learning with automatic identification, compared to manual identified, speed Faster, generalization is stronger, time saving and energy saving.

Detailed description of the invention

Fig. 1 is the flow chart in the present invention；

Fig. 2 is the basic block diagram of LSTM used in the present invention training term vector；

Fig. 3 is the encoder-decoder model schematic of Attention mechanism of the present invention；

Fig. 4 is that Attention model of the present invention calculates term vector schematic diagram.

Specific embodiment

Embodiment 1: as shown in Figs 1-4, a kind of side extracting alignment sentence from Laos-Chinese chapter grade alignment corpus Method includes the following steps::

Further, alignment segment described in the Step1 is by the alignment chapter corpus after noise processed.

Further, the Step2 is encoded by python, carries out sentence participle to initial chapter grade alignment corpus, The participle of single sentence Laotian sentence and Chinese sentence is realized by code, and counts word number.

Further, specific step is as follows by the Step3:

Further, the step Step5 can have an input, to input in each step decoding of decoder stage The information h of all hidden layers of sequence₁,h₂,…h_tIt is weighted summation, that is, every time all can be institute when predicting next word There is the hidden layer information of list entries all to read through, determine it is most related to those of list entries word when prediction current word, Attention mechanism represented in the decoding decoder stage, can input the vector C an of context every time_i, hidden layer New state S_iAccording to the state S of previous step_i-1,Y_i,C_iThe nonlinear function of three obtains, such as formula (1), wherein C_iFor The weighted average of per moment output state in encoder stage and, solutions mode be formula (2), S_i-1,Y_iRespectively decoder The previous state in stage and the preceding predicted value once exported, here h_jFor each moment output state in encoder stage, a_ijFor The corresponding h of input i in each decoder stage_jWeighted value size；

S_i=F (S_i-1, Y_i, C_i) (1)

Further, the step Step6 is after by similarity calculation, the sentence being made up of term vector, from And complete the sentence that the old bilingual alignment of the Chinese-is extracted from the chapter grade corpus of alignment.

Bilingualism corpora is used as the important language resource of natural language research field the most, and the research of language information processing is deep Enter, in the acquisition of corpus, processing has significant progress.The present invention has mainly merged Laotian linguistic feature to algorithm model In, the method that a variety of Model Fusions have been selected in the use of model improves accuracy of identification, (is paid attention to using Attention mechanism Power mechanism), and take LSTM as encoder-decoder (coder-decoder).The corpus that chapter grade is aligned first uses Python carries out the processing of regular expression, noise data is got rid of, and as input, since the sentence of Laotian and Chinese is arranged Sequence is consistent, it is possible to first be handled the corpus of chapter grade for single alignment sentence, later be carried out the sentence of alignment It splits.These sentences being aligned are segmented later, using this language of participle as the input of LSTM, by retaining LSTM coding Device is to output among list entries as a result, training a model selectively to be learnt and these inputs in model Output sequence is associated when output, to extract parallel sentence pairs from bilingualism corpora.The present invention is parallel in Laotian Sentence pair has certain research significance on extracting.

In conjunction with attached drawing, the embodiment of the present invention is explained in detail above, but the present invention is not limited to above-mentioned Embodiment within the knowledge of a person skilled in the art can also be before not departing from present inventive concept Put that various changes can be made.

Claims

1. a kind of method for extracting alignment sentence from Laos-Chinese chapter grade alignment corpus, it is characterised in that: including walking as follows It is rapid:

The old bilingual corpora of the Chinese-is first passed through python code and carries out noise processed using regular expression by Step1, then right These alignment segments carry out data set division, wherein the training set being aligned accounts for 90%, and out-of-order test set accounts for 10%；

Step2, according to training set and the sentence of test set, what the phrase and each phrase for counting inequality therein occurred Number calculates the term vector of sentence by word-embedding；

Step3, using Step2 obtain term vector as the input of LSTM algorithm, i.e., at this time LSTM algorithm as the part encoder, And using these term vectors as the input at the end encoder, the initialization vector that the part encoder passes through LSTM algorithm carries out similar Degree calculates；

Step4, each term vector are exported via the part encoder, by softmax function, find out each sentence term vector Semantic coding C forms a sequence vector；

Sequence vector obtained in Step4 is added as the initial input of the part decoder in the part decoder Step5 Attention mechanism, when decoding, each step all can selectively select one from the sequence vector of semantic coding C Subset is further processed；So in the part decoder, input of the output at each moment as subsequent time, often One output, can accomplish the information for making full use of list entries to carry, and so on, until ending；

Step6 obtains the highest sentence term vector of similarity by the calculating of encoder and the similarity of the part decoder, The sentence being made up of term vector, to complete the sentence for extracting the old bilingual alignment of the Chinese-from the chapter grade corpus of alignment.

2. the method according to claim 1 for extracting alignment sentence from Laos-Chinese chapter grade alignment corpus, feature Be: alignment segment described in the Stepl is by the alignment chapter corpus after noise processed.

3. the method according to claim 1 for extracting alignment sentence from Laos-Chinese chapter grade alignment corpus, feature Be: the Step2 is encoded by python, is carried out sentence participle to initial chapter grade alignment corpus, is realized by code The participle of single sentence Laotian sentence and Chinese sentence, and count word number.

4. the method according to claim 1 for extracting alignment sentence from Laos-Chinese chapter grade alignment corpus, feature Be: specific step is as follows by the Step3:

The sentence branched away is inputted, sentence is segmented, by, as inputting, being input to LSTM after word-embedding In, then hidden layer information h is obtained by hidden layer₁, h₂..., first moment of the part encoder during this time Hidden-state is assumed to be Z₀(initializaing variable) then uses Z₀And h₁, h₂... similarity calculation is carried out, when obtaining each The a at quarter₁₀, a₂₀, a₃₀... a_ij, wherein the subscript i of a indicates the subscript of hidden layer information in encoder, and the subscript j of a indicates mind The subscript of initializaing variable through network.

5. the method according to claim 4 for extracting alignment sentence from Laos-Chinese chapter grade alignment corpus, feature Be: the step Step5 can have an input in each step decoding of decoder stage, all to list entries to hide The information h of layer₁, h₂... h_tIt is weighted summation, that is, every time all can be all list entries when predicting next word Hidden layer information is all read through, and determines most related to those of list entries word when prediction current word, and Attention mechanism represents In the decoding decoder stage, the vector C an of context can be inputted every time_i, the new state Si of hidden layer is according to previous step State S_i-1, Y_i, C_iThe nonlinear function of three obtains, such as formula (1), wherein C_iFor per moment in encoder stage The weighted average of output state and, solutions mode be formula (2), S_i-1, Y_iRespectively the previous state in decoder stage is with before The predicted value once exported, here h_jFor each moment output state in encoder stage, a_ijFor each decoder stage Input the corresponding h of i_jWeighted value size；

S_i=F (S_i-1, Y_i, C_i) (1)

。

6. the method according to claim 1 for extracting alignment sentence from Laos-Chinese chapter grade alignment corpus, feature Be: the step Step6 is after by similarity calculation, the sentence being made up of term vector, to complete from alignment Chapter grade corpus in extract the sentence of the old bilingual alignment of the Chinese-.