CN109783809B - Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus - Google Patents

Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus Download PDF

Info

Publication number
CN109783809B
CN109783809B CN201811577667.2A CN201811577667A CN109783809B CN 109783809 B CN109783809 B CN 109783809B CN 201811577667 A CN201811577667 A CN 201811577667A CN 109783809 B CN109783809 B CN 109783809B
Authority
CN
China
Prior art keywords
aligned
sentences
input
word
corpus
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201811577667.2A
Other languages
Chinese (zh)
Other versions
CN109783809A (en
Inventor
周兰江
贾善崇
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201811577667.2A priority Critical patent/CN109783809B/en
Publication of CN109783809A publication Critical patent/CN109783809A/en
Application granted granted Critical
Publication of CN109783809B publication Critical patent/CN109783809B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Machine Translation (AREA)
  • Document Processing Apparatus (AREA)

Abstract

The invention discloses a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data, and belongs to the technical field of natural language processing and machine learning. The method comprises the steps of firstly, using python to process the corpus aligned at chapter level to carry out regular expression processing, removing noise data, and using the corpus as input, wherein the corpus aligned at chapter level can be processed into single aligned sentences firstly, and then the aligned sentences are split as the Laos are consistent with the Chinese sentences in sequence. And then segmenting the aligned sentences, taking the segmented sentences as LSTM input, training a model to selectively learn the input by keeping the intermediate output result of the LSTM encoder on the input sequence, and associating the output sequence when the model is output, thereby extracting parallel sentence pairs from the bilingual corpus. The method has certain research significance on the extraction of Laos parallel sentences.

Description

Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus
Technical Field
The invention relates to a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data, in particular to a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data based on an LSTM (Long Short-Term Memory network), and belongs to the technical field of natural language processing and machine learning.
Background
The bilingual corpus is an important basic resource in research fields such as statistical machine translation, cross-language retrieval, bilingual dictionary construction and the like, and the quantity and quality of the bilingual corpus influence and even determine the final result of a related task to a great extent. The mining of the parallel sentence pairs is a key technology for constructing bilingual corpus, so that the method has important research value. In many cases, bilingual corpus is available, but the resulting text is usually not aligned in sentence units, e.g., some aligned in paragraphs or in whole articles. In this case, it is necessary to extract parallel sentence pairs by arranging these corpora, which are not aligned in units of sentences, into a sentence alignment format.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method for extracting the aligned sentences from the aligned linguistic data of Laos-Chinese chapter levels is used for extracting the aligned sentences from the aligned linguistic data of the Chinese-Laos, and can effectively improve the accuracy of sentence alignment.
The technical scheme adopted by the invention is as follows: a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data comprises the following steps:
step1, performing noise processing on the Chinese-old bilingual corpus by using a regular expression through python codes, and then performing data set division on the aligned segments, wherein the aligned training set accounts for 90%, and the out-of-order test set accounts for 10%;
step2, counting the different phrases in the training set and the sentences in the testing set and the occurrence frequency of each phrase according to the sentences in the training set and the testing set, and calculating the word vectors of the sentences through word-embedding;
step3, using the word vectors obtained by Step2 as the input of the LSTM algorithm, namely using the LSTM algorithm as an encoder part at the moment, using the word vectors as the input of an encoder end, and performing similarity calculation on the encoder part through the initialization vector of the LSTM algorithm;
step4, outputting each word vector through an encoder part, and solving semantic codes C of each sentence word vector through a softmax function to form a vector sequence;
step5, taking the vector sequence obtained in Step4 as the initial input of the decoder part, adding an Attention mechanism in the decoder part, and during decoding, selectively selecting a subset from the vector sequence of the semantic code C in each Step for further processing; therefore, in the decoder part, the output of each moment is used as the input of the next moment, and each output can fully utilize the information carried by the input sequence, and so on until the end;
step6, calculating the similarity between the encoder and the decoder to obtain the sentence word vector with the highest similarity, and forming the sentence by the word vector, thereby completing the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus.
Specifically, the aligned segment in Step1 is the aligned discourse material after being subjected to noise processing.
Specifically, Step2 performs sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.
Specifically, the specific steps of Step3 are as follows:
inputting the divided sentences, dividing the sentences into words, inputting the words as input after word-embedding into an LSTM, and obtaining hidden layer information h through a hidden layer1,h2,., at which time the hidden-state of the first time of the encoder portion is assumed to be Z0(initial variables) and then using Z0And h1,h2,., carrying out similarity calculation to obtain a at each moment10,a20,a30,…aijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.
In particular, said Step5 enables one input for each decoding Step in the decoder stage, for all hidden layer information h of the input sequence1,h2,…htWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stageiHiding new state S of the layeriAccording to the state S of the previous stepi-1,Yi,CiA non-linear function of the three, as shown in formula (1), wherein CiIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), Si-1,YiRespectively, the predicted values of the previous state and the previous output of the decoder stage, where hjFor each time instant of the encoder phase, output state, aijH corresponding to input i of each decoder stagejThe weight value of (2);
Si=F(Si-1,Yi,Ci) (1)
Figure BDA0001916881150000031
specifically, Step6 completes the extraction of the aligned sentences of the Chinese-old bilingual from the aligned chapter-level corpus by composing the sentences with word vectors after the similarity calculation.
The invention has the beneficial effects that:
(1) according to the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the accuracy rate in the extraction of the Chinese Laos-Laos is improved compared with a unilateral encoder-decoder algorithm model.
(2) In the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the LSTM algorithm is used, and compared with other algorithms, the method has the advantage that the effect of feature extraction is improved.
(3) According to the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the Laos grammatical features and the Chinese grammatical features are integrated, the Laos grammatical features and the Chinese grammatical features can be automatically recognized through deep learning, and compared with manual recognition, the method is high in speed, strong in generalization and time-saving and labor-saving.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a basic block diagram of an LSTM training word vector used in the present invention;
FIG. 3 is a schematic diagram of an encoder-decoder model of the Attention mechanism of the present invention;
FIG. 4 is a diagram of the Attention model computing word vectors of the present invention.
Detailed Description
Example 1: as shown in fig. 1-4, a method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus includes the following steps: :
step1, performing noise processing on the Chinese-old bilingual corpus by using a regular expression through python codes, and then performing data set division on the aligned segments, wherein the aligned training set accounts for 90%, and the out-of-order test set accounts for 10%;
step2, counting the different phrases in the training set and the sentences in the testing set and the occurrence frequency of each phrase according to the sentences in the training set and the testing set, and calculating the word vectors of the sentences through word-embedding;
step3, using the word vectors obtained by Step2 as the input of the LSTM algorithm, namely using the LSTM algorithm as an encoder part at the moment, using the word vectors as the input of an encoder end, and performing similarity calculation on the encoder part through the initialization vector of the LSTM algorithm;
step4, outputting each word vector through an encoder part, and solving semantic codes C of each sentence word vector through a softmax function to form a vector sequence;
step5, taking the vector sequence obtained in Step4 as the initial input of the decoder part, adding an Attention mechanism in the decoder part, and during decoding, selectively selecting a subset from the vector sequence of the semantic code C in each Step for further processing; therefore, in the decoder part, the output of each moment is used as the input of the next moment, and each output can fully utilize the information carried by the input sequence, and so on until the end;
step6, calculating the similarity between the encoder and the decoder to obtain the sentence word vector with the highest similarity, and forming the sentence by the word vector, thereby completing the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus.
Further, the aligned segment in Step1 is the aligned chapter corpus after the noise processing.
Further, Step2 carries out sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.
Further, the specific steps of Step3 are as follows:
inputting the divided sentences, dividing the sentences into words, inputting the words as input after word-embedding, inputting the words into the LSTM, and obtaining hidden words through a hidden layerLayer information h1,h2,., at which time the hidden-state of the first time of the encoder portion is assumed to be Z0(initial variables) and then using Z0And h1,h2,., carrying out similarity calculation to obtain a at each moment10,a20,a30,…aijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.
Further, said Step5 can have an input for all hidden layer information h of the input sequence at each decoding Step of the decoder stage1,h2,…htWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stageiHiding new state S of the layeriAccording to the state S of the previous stepi-1,Yi,CiA non-linear function of the three, as shown in formula (1), wherein CiIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), Si-1,YiRespectively, the predicted values of the previous state and the previous output of the decoder stage, where hjFor each time instant of the encoder phase, output state, aijH corresponding to input i of each decoder stagejThe weight value of (2);
Si=F(Si-1,Yi,Ci) (1)
Figure BDA0001916881150000051
further, Step6 completes the extraction of the aligned sentences of the Chinese-old bilingual from the aligned chapter-level corpus by composing the sentences with word vectors after the similarity calculation.
The bilingual corpus is the most important language resource in the natural language research field, the research on language information processing is deep, and the processing has great progress in the acquisition of the corpus. The invention mainly fuses Laos linguistic characteristics into an algorithm model, selects a method for fusing a plurality of models in the use of the model, improves the identification precision, uses an Attention mechanism and takes LSTM as an encoder-decoder. Firstly, the corpus aligned at chapter level is processed by python to carry out regular expression processing, noise data is removed and is used as input, and because Laos and Chinese sentences are ordered in a consistent manner, the corpus aligned at chapter level can be processed into single aligned sentences first, and then the aligned sentences are split. And then segmenting the aligned sentences, taking the segmented sentences as LSTM input, training a model to selectively learn the input by keeping the intermediate output result of the LSTM encoder on the input sequence, and associating the output sequence when the model is output, thereby extracting parallel sentence pairs from the bilingual corpus. The method has certain research significance on the extraction of Laos parallel sentences.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.

Claims (6)

1. A method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data is characterized in that: the method comprises the following steps:
step1, performing noise processing on the Chinese-old bilingual corpus by using a regular expression through python codes, and then performing data set division on the aligned segments, wherein the aligned training set accounts for 90%, and the out-of-order test set accounts for 10%;
step2, counting the different phrases in the training set and the sentences in the testing set and the occurrence frequency of each phrase according to the sentences in the training set and the testing set, and calculating the word vectors of the sentences through word-embedding;
step3, using the word vectors obtained by Step2 as the input of the LSTM algorithm, namely using the LSTM algorithm as an encoder part at the moment, using the word vectors as the input of an encoder end, and performing similarity calculation on the encoder part through the initialization vector of the LSTM algorithm;
step4, outputting each word vector through an encoder part, and solving semantic codes C of each sentence word vector through a softmax function to form a vector sequence;
step5, taking the vector sequence obtained in Step4 as the initial input of the decoder part, adding an Attention mechanism in the decoder part, and during decoding, selectively selecting a subset from the vector sequence of the semantic code C in each Step for further processing; therefore, in the decoder part, the output of each moment is used as the input of the next moment, and each output can fully utilize the information carried by the input sequence, and so on until the end;
step6, calculating the similarity between the encoder and the decoder to obtain the sentence word vector with the highest similarity, and forming the sentence by the word vector, thereby completing the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus.
2. The method of claim 1, wherein the method comprises the following steps: the aligned segment in Step1 is the aligned discourse material after being processed by noise.
3. The method of claim 1, wherein the method comprises the following steps: the Step2 carries out sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.
4. The method of claim 1, wherein the method comprises the following steps: the specific steps of Step3 are as follows:
inputting the divided sentencesThe sentence is divided into words, the words are input into the LSTM after being subjected to word-embedding, and then the hidden layer information h is obtained through the hidden layer1,h2,., at which time the hidden-state of the first instance of the encoder section is assumed to be the initial variable Z0Then using Z0And h1,h2,., carrying out similarity calculation to obtain a at each moment10,a20,a30,…aijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.
5. The method of claim 4, wherein the method for extracting aligned sentences from Laos-Chinese discourse level aligned corpus comprises: step5 can have an input for all hidden layer information h of input sequence in each decoding Step of decoder stage1,h2,…htWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stageiHiding new state S of the layeriAccording to the state S of the previous stepi-1,Yi,CiA non-linear function of the three, as shown in formula (1), wherein CiIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), Si-1,YiRespectively, the predicted values of the previous state and the previous output of the decoder stage, where hjFor each time instant of the encoder phase, output state, aijH corresponding to input i of each decoder stagejThe weight value of (2);
Si=F(Si-1,Yi,Ci) (1)
Figure FDA0003502633290000021
6. the method of claim 1, wherein the method comprises the following steps: after similarity calculation, Step6 completes the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus by composing the sentence with word vectors.
CN201811577667.2A 2018-12-22 2018-12-22 Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus Active CN109783809B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201811577667.2A CN109783809B (en) 2018-12-22 2018-12-22 Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201811577667.2A CN109783809B (en) 2018-12-22 2018-12-22 Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus

Publications (2)

Publication Number Publication Date
CN109783809A CN109783809A (en) 2019-05-21
CN109783809B true CN109783809B (en) 2022-04-12

Family

ID=66498083

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201811577667.2A Active CN109783809B (en) 2018-12-22 2018-12-22 Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus

Country Status (1)

Country Link
CN (1) CN109783809B (en)

Families Citing this family (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110362820B (en) * 2019-06-17 2022-11-01 昆明理工大学 Bi-LSTM algorithm-based method for extracting bilingual parallel sentences in old and Chinese
CN110414009B (en) * 2019-07-09 2021-02-05 昆明理工大学 Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN
CN110489102B (en) * 2019-07-29 2021-06-18 东北大学 Method for automatically generating Python code from natural language
CN110717341B (en) * 2019-09-11 2022-06-14 昆明理工大学 Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN112232090A (en) * 2020-09-17 2021-01-15 昆明理工大学 Chinese-crossing parallel sentence pair extraction method fusing syntactic structure and Tree-LSTM
CN112287688B (en) * 2020-09-17 2022-02-11 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN113095091A (en) * 2021-04-09 2021-07-09 天津大学 Chapter machine translation system and method capable of selecting context information
CN113705168B (en) * 2021-08-31 2023-04-07 苏州大学 Chapter neural machine translation method and system based on cross-level attention mechanism

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391885A (en) * 2014-11-07 2015-03-04 哈尔滨工业大学 Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN105022728A (en) * 2015-07-13 2015-11-04 广西达译商务服务有限责任公司 Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
CN107967262A (en) * 2017-11-02 2018-04-27 内蒙古工业大学 A kind of neutral net covers Chinese machine translation method
JP2018072979A (en) * 2016-10-26 2018-05-10 株式会社エヌ・ティ・ティ・データ Parallel translation sentence extraction device, parallel translation sentence extraction method and program
CN108549629A (en) * 2018-03-19 2018-09-18 昆明理工大学 A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes
CN109062897A (en) * 2018-07-26 2018-12-21 苏州大学 Sentence alignment method based on deep neural network

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104391885A (en) * 2014-11-07 2015-03-04 哈尔滨工业大学 Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training
CN105022728A (en) * 2015-07-13 2015-11-04 广西达译商务服务有限责任公司 Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method
JP2018072979A (en) * 2016-10-26 2018-05-10 株式会社エヌ・ティ・ティ・データ Parallel translation sentence extraction device, parallel translation sentence extraction method and program
CN107967262A (en) * 2017-11-02 2018-04-27 内蒙古工业大学 A kind of neutral net covers Chinese machine translation method
CN108549629A (en) * 2018-03-19 2018-09-18 昆明理工大学 A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes
CN109062897A (en) * 2018-07-26 2018-12-21 苏州大学 Sentence alignment method based on deep neural network

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
汉老双语句子对齐方法研究;让子强;《中国优秀硕士论文全文数据库 信息科技辑》;20180115;I138-2044 *
融入多特征的汉-老双语对齐方法;贾善崇 等;《中 国 水 运》;20200331;第20卷(第3期);78-80 *

Also Published As

Publication number Publication date
CN109783809A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CN109783809B (en) Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus
CN111143550B (en) Method for automatically identifying dispute focus based on hierarchical attention neural network model
CN109635124B (en) Remote supervision relation extraction method combined with background knowledge
CN111046946B (en) Burma language image text recognition method based on CRNN
CN110532554B (en) Chinese abstract generation method, system and storage medium
CN110969020B (en) CNN and attention mechanism-based Chinese named entity identification method, system and medium
CN109543181B (en) Named entity model and system based on combination of active learning and deep learning
CN114861600B (en) NER-oriented Chinese clinical text data enhancement method and device
CN108984526A (en) A kind of document subject matter vector abstracting method based on deep learning
CN110688862A (en) Mongolian-Chinese inter-translation method based on transfer learning
CN107480143A (en) Dialogue topic dividing method and system based on context dependence
CN108491372B (en) Chinese word segmentation method based on seq2seq model
CN110083826A (en) A kind of old man's bilingual alignment method based on Transformer model
CN110414009B (en) Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN
CN114757182A (en) BERT short text sentiment analysis method for improving training mode
CN110555084A (en) remote supervision relation classification method based on PCNN and multi-layer attention
CN114818891B (en) Small sample multi-label text classification model training method and text classification method
CN112560486A (en) Power entity identification method based on multilayer neural network, storage medium and equipment
CN110222338B (en) Organization name entity identification method
CN107894975A (en) A kind of segmenting method based on Bi LSTM
CN106610937A (en) Information theory-based Chinese automatic word segmentation method
CN113553847A (en) Method, device, system and storage medium for parsing address text
CN107992468A (en) A kind of mixing language material name entity recognition method based on LSTM
CN114036908A (en) Chinese chapter-level event extraction method and device integrated with word list knowledge
CN112380882B (en) Mongolian Chinese neural machine translation method with error correction function

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant