CN109783809B - Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus - Google Patents
Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus Download PDFInfo
- Publication number
- CN109783809B CN109783809B CN201811577667.2A CN201811577667A CN109783809B CN 109783809 B CN109783809 B CN 109783809B CN 201811577667 A CN201811577667 A CN 201811577667A CN 109783809 B CN109783809 B CN 109783809B
- Authority
- CN
- China
- Prior art keywords
- aligned
- sentences
- input
- word
- corpus
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Landscapes
- Machine Translation (AREA)
- Document Processing Apparatus (AREA)
Abstract
The invention discloses a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data, and belongs to the technical field of natural language processing and machine learning. The method comprises the steps of firstly, using python to process the corpus aligned at chapter level to carry out regular expression processing, removing noise data, and using the corpus as input, wherein the corpus aligned at chapter level can be processed into single aligned sentences firstly, and then the aligned sentences are split as the Laos are consistent with the Chinese sentences in sequence. And then segmenting the aligned sentences, taking the segmented sentences as LSTM input, training a model to selectively learn the input by keeping the intermediate output result of the LSTM encoder on the input sequence, and associating the output sequence when the model is output, thereby extracting parallel sentence pairs from the bilingual corpus. The method has certain research significance on the extraction of Laos parallel sentences.
Description
Technical Field
The invention relates to a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data, in particular to a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data based on an LSTM (Long Short-Term Memory network), and belongs to the technical field of natural language processing and machine learning.
Background
The bilingual corpus is an important basic resource in research fields such as statistical machine translation, cross-language retrieval, bilingual dictionary construction and the like, and the quantity and quality of the bilingual corpus influence and even determine the final result of a related task to a great extent. The mining of the parallel sentence pairs is a key technology for constructing bilingual corpus, so that the method has important research value. In many cases, bilingual corpus is available, but the resulting text is usually not aligned in sentence units, e.g., some aligned in paragraphs or in whole articles. In this case, it is necessary to extract parallel sentence pairs by arranging these corpora, which are not aligned in units of sentences, into a sentence alignment format.
Disclosure of Invention
The technical problem to be solved by the invention is as follows: the method for extracting the aligned sentences from the aligned linguistic data of Laos-Chinese chapter levels is used for extracting the aligned sentences from the aligned linguistic data of the Chinese-Laos, and can effectively improve the accuracy of sentence alignment.
The technical scheme adopted by the invention is as follows: a method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data comprises the following steps:
step1, performing noise processing on the Chinese-old bilingual corpus by using a regular expression through python codes, and then performing data set division on the aligned segments, wherein the aligned training set accounts for 90%, and the out-of-order test set accounts for 10%;
step2, counting the different phrases in the training set and the sentences in the testing set and the occurrence frequency of each phrase according to the sentences in the training set and the testing set, and calculating the word vectors of the sentences through word-embedding;
step3, using the word vectors obtained by Step2 as the input of the LSTM algorithm, namely using the LSTM algorithm as an encoder part at the moment, using the word vectors as the input of an encoder end, and performing similarity calculation on the encoder part through the initialization vector of the LSTM algorithm;
step4, outputting each word vector through an encoder part, and solving semantic codes C of each sentence word vector through a softmax function to form a vector sequence;
step5, taking the vector sequence obtained in Step4 as the initial input of the decoder part, adding an Attention mechanism in the decoder part, and during decoding, selectively selecting a subset from the vector sequence of the semantic code C in each Step for further processing; therefore, in the decoder part, the output of each moment is used as the input of the next moment, and each output can fully utilize the information carried by the input sequence, and so on until the end;
step6, calculating the similarity between the encoder and the decoder to obtain the sentence word vector with the highest similarity, and forming the sentence by the word vector, thereby completing the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus.
Specifically, the aligned segment in Step1 is the aligned discourse material after being subjected to noise processing.
Specifically, Step2 performs sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.
Specifically, the specific steps of Step3 are as follows:
inputting the divided sentences, dividing the sentences into words, inputting the words as input after word-embedding into an LSTM, and obtaining hidden layer information h through a hidden layer1,h2,., at which time the hidden-state of the first time of the encoder portion is assumed to be Z0(initial variables) and then using Z0And h1,h2,., carrying out similarity calculation to obtain a at each moment10,a20,a30,…aijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.
In particular, said Step5 enables one input for each decoding Step in the decoder stage, for all hidden layer information h of the input sequence1,h2,…htWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stageiHiding new state S of the layeriAccording to the state S of the previous stepi-1,Yi,CiA non-linear function of the three, as shown in formula (1), wherein CiIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), Si-1,YiRespectively, the predicted values of the previous state and the previous output of the decoder stage, where hjFor each time instant of the encoder phase, output state, aijH corresponding to input i of each decoder stagejThe weight value of (2);
Si=F(Si-1,Yi,Ci) (1)
specifically, Step6 completes the extraction of the aligned sentences of the Chinese-old bilingual from the aligned chapter-level corpus by composing the sentences with word vectors after the similarity calculation.
The invention has the beneficial effects that:
(1) according to the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the accuracy rate in the extraction of the Chinese Laos-Laos is improved compared with a unilateral encoder-decoder algorithm model.
(2) In the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the LSTM algorithm is used, and compared with other algorithms, the method has the advantage that the effect of feature extraction is improved.
(3) According to the method for extracting the aligned sentences from the Laos-Chinese chapter level aligned linguistic data based on the LSTM, the Laos grammatical features and the Chinese grammatical features are integrated, the Laos grammatical features and the Chinese grammatical features can be automatically recognized through deep learning, and compared with manual recognition, the method is high in speed, strong in generalization and time-saving and labor-saving.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a basic block diagram of an LSTM training word vector used in the present invention;
FIG. 3 is a schematic diagram of an encoder-decoder model of the Attention mechanism of the present invention;
FIG. 4 is a diagram of the Attention model computing word vectors of the present invention.
Detailed Description
Example 1: as shown in fig. 1-4, a method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus includes the following steps: :
step1, performing noise processing on the Chinese-old bilingual corpus by using a regular expression through python codes, and then performing data set division on the aligned segments, wherein the aligned training set accounts for 90%, and the out-of-order test set accounts for 10%;
step2, counting the different phrases in the training set and the sentences in the testing set and the occurrence frequency of each phrase according to the sentences in the training set and the testing set, and calculating the word vectors of the sentences through word-embedding;
step3, using the word vectors obtained by Step2 as the input of the LSTM algorithm, namely using the LSTM algorithm as an encoder part at the moment, using the word vectors as the input of an encoder end, and performing similarity calculation on the encoder part through the initialization vector of the LSTM algorithm;
step4, outputting each word vector through an encoder part, and solving semantic codes C of each sentence word vector through a softmax function to form a vector sequence;
step5, taking the vector sequence obtained in Step4 as the initial input of the decoder part, adding an Attention mechanism in the decoder part, and during decoding, selectively selecting a subset from the vector sequence of the semantic code C in each Step for further processing; therefore, in the decoder part, the output of each moment is used as the input of the next moment, and each output can fully utilize the information carried by the input sequence, and so on until the end;
step6, calculating the similarity between the encoder and the decoder to obtain the sentence word vector with the highest similarity, and forming the sentence by the word vector, thereby completing the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus.
Further, the aligned segment in Step1 is the aligned chapter corpus after the noise processing.
Further, Step2 carries out sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.
Further, the specific steps of Step3 are as follows:
inputting the divided sentences, dividing the sentences into words, inputting the words as input after word-embedding, inputting the words into the LSTM, and obtaining hidden words through a hidden layerLayer information h1,h2,., at which time the hidden-state of the first time of the encoder portion is assumed to be Z0(initial variables) and then using Z0And h1,h2,., carrying out similarity calculation to obtain a at each moment10,a20,a30,…aijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.
Further, said Step5 can have an input for all hidden layer information h of the input sequence at each decoding Step of the decoder stage1,h2,…htWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stageiHiding new state S of the layeriAccording to the state S of the previous stepi-1,Yi,CiA non-linear function of the three, as shown in formula (1), wherein CiIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), Si-1,YiRespectively, the predicted values of the previous state and the previous output of the decoder stage, where hjFor each time instant of the encoder phase, output state, aijH corresponding to input i of each decoder stagejThe weight value of (2);
Si=F(Si-1,Yi,Ci) (1)
further, Step6 completes the extraction of the aligned sentences of the Chinese-old bilingual from the aligned chapter-level corpus by composing the sentences with word vectors after the similarity calculation.
The bilingual corpus is the most important language resource in the natural language research field, the research on language information processing is deep, and the processing has great progress in the acquisition of the corpus. The invention mainly fuses Laos linguistic characteristics into an algorithm model, selects a method for fusing a plurality of models in the use of the model, improves the identification precision, uses an Attention mechanism and takes LSTM as an encoder-decoder. Firstly, the corpus aligned at chapter level is processed by python to carry out regular expression processing, noise data is removed and is used as input, and because Laos and Chinese sentences are ordered in a consistent manner, the corpus aligned at chapter level can be processed into single aligned sentences first, and then the aligned sentences are split. And then segmenting the aligned sentences, taking the segmented sentences as LSTM input, training a model to selectively learn the input by keeping the intermediate output result of the LSTM encoder on the input sequence, and associating the output sequence when the model is output, thereby extracting parallel sentence pairs from the bilingual corpus. The method has certain research significance on the extraction of Laos parallel sentences.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit and scope of the present invention.
Claims (6)
1. A method for extracting aligned sentences from Laos-Chinese chapter level aligned linguistic data is characterized in that: the method comprises the following steps:
step1, performing noise processing on the Chinese-old bilingual corpus by using a regular expression through python codes, and then performing data set division on the aligned segments, wherein the aligned training set accounts for 90%, and the out-of-order test set accounts for 10%;
step2, counting the different phrases in the training set and the sentences in the testing set and the occurrence frequency of each phrase according to the sentences in the training set and the testing set, and calculating the word vectors of the sentences through word-embedding;
step3, using the word vectors obtained by Step2 as the input of the LSTM algorithm, namely using the LSTM algorithm as an encoder part at the moment, using the word vectors as the input of an encoder end, and performing similarity calculation on the encoder part through the initialization vector of the LSTM algorithm;
step4, outputting each word vector through an encoder part, and solving semantic codes C of each sentence word vector through a softmax function to form a vector sequence;
step5, taking the vector sequence obtained in Step4 as the initial input of the decoder part, adding an Attention mechanism in the decoder part, and during decoding, selectively selecting a subset from the vector sequence of the semantic code C in each Step for further processing; therefore, in the decoder part, the output of each moment is used as the input of the next moment, and each output can fully utilize the information carried by the input sequence, and so on until the end;
step6, calculating the similarity between the encoder and the decoder to obtain the sentence word vector with the highest similarity, and forming the sentence by the word vector, thereby completing the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus.
2. The method of claim 1, wherein the method comprises the following steps: the aligned segment in Step1 is the aligned discourse material after being processed by noise.
3. The method of claim 1, wherein the method comprises the following steps: the Step2 carries out sentence segmentation on the initial chapter-level aligned corpus through python coding, realizes the segmentation of single sentence Laos sentences and Chinese sentences through codes, and counts the number of words.
4. The method of claim 1, wherein the method comprises the following steps: the specific steps of Step3 are as follows:
inputting the divided sentencesThe sentence is divided into words, the words are input into the LSTM after being subjected to word-embedding, and then the hidden layer information h is obtained through the hidden layer1,h2,., at which time the hidden-state of the first instance of the encoder section is assumed to be the initial variable Z0Then using Z0And h1,h2,., carrying out similarity calculation to obtain a at each moment10,a20,a30,…aijWherein, the subscript i of a represents the subscript of hidden layer information in the encoder, and the subscript j of a represents the subscript of initial variables of the neural network.
5. The method of claim 4, wherein the method for extracting aligned sentences from Laos-Chinese discourse level aligned corpus comprises: step5 can have an input for all hidden layer information h of input sequence in each decoding Step of decoder stage1,h2,…htWeighted summation is carried out, namely hidden layer information of all input sequences is seen each time when the next word is predicted, the most relevant word between the current word and the input sequence is determined when the current word is predicted, and the Attention mechanism represents that a vector C of a context is input each time in a decoder stageiHiding new state S of the layeriAccording to the state S of the previous stepi-1,Yi,CiA non-linear function of the three, as shown in formula (1), wherein CiIs the weighted average sum of the output states at each moment of the encoder stage, and the solving mode is formula (2), Si-1,YiRespectively, the predicted values of the previous state and the previous output of the decoder stage, where hjFor each time instant of the encoder phase, output state, aijH corresponding to input i of each decoder stagejThe weight value of (2);
Si=F(Si-1,Yi,Ci) (1)
6. the method of claim 1, wherein the method comprises the following steps: after similarity calculation, Step6 completes the extraction of the Chinese-old bilingual aligned sentence from the aligned chapter-level corpus by composing the sentence with word vectors.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811577667.2A CN109783809B (en) | 2018-12-22 | 2018-12-22 | Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201811577667.2A CN109783809B (en) | 2018-12-22 | 2018-12-22 | Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109783809A CN109783809A (en) | 2019-05-21 |
CN109783809B true CN109783809B (en) | 2022-04-12 |
Family
ID=66498083
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201811577667.2A Active CN109783809B (en) | 2018-12-22 | 2018-12-22 | Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109783809B (en) |
Families Citing this family (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110362820B (en) * | 2019-06-17 | 2022-11-01 | 昆明理工大学 | Bi-LSTM algorithm-based method for extracting bilingual parallel sentences in old and Chinese |
CN110414009B (en) * | 2019-07-09 | 2021-02-05 | 昆明理工大学 | Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN |
CN110489102B (en) * | 2019-07-29 | 2021-06-18 | 东北大学 | Method for automatically generating Python code from natural language |
CN110717341B (en) * | 2019-09-11 | 2022-06-14 | 昆明理工大学 | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot |
CN112232090A (en) * | 2020-09-17 | 2021-01-15 | 昆明理工大学 | Chinese-crossing parallel sentence pair extraction method fusing syntactic structure and Tree-LSTM |
CN112287688B (en) * | 2020-09-17 | 2022-02-11 | 昆明理工大学 | English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features |
CN113095091A (en) * | 2021-04-09 | 2021-07-09 | 天津大学 | Chapter machine translation system and method capable of selecting context information |
CN113705168B (en) * | 2021-08-31 | 2023-04-07 | 苏州大学 | Chapter neural machine translation method and system based on cross-level attention mechanism |
Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104391885A (en) * | 2014-11-07 | 2015-03-04 | 哈尔滨工业大学 | Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training |
CN105022728A (en) * | 2015-07-13 | 2015-11-04 | 广西达译商务服务有限责任公司 | Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method |
CN107967262A (en) * | 2017-11-02 | 2018-04-27 | 内蒙古工业大学 | A kind of neutral net covers Chinese machine translation method |
JP2018072979A (en) * | 2016-10-26 | 2018-05-10 | 株式会社エヌ・ティ・ティ・データ | Parallel translation sentence extraction device, parallel translation sentence extraction method and program |
CN108549629A (en) * | 2018-03-19 | 2018-09-18 | 昆明理工大学 | A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes |
CN109062897A (en) * | 2018-07-26 | 2018-12-21 | 苏州大学 | Sentence alignment method based on deep neural network |
-
2018
- 2018-12-22 CN CN201811577667.2A patent/CN109783809B/en active Active
Patent Citations (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN104391885A (en) * | 2014-11-07 | 2015-03-04 | 哈尔滨工业大学 | Method for extracting chapter-level parallel phrase pair of comparable corpus based on parallel corpus training |
CN105022728A (en) * | 2015-07-13 | 2015-11-04 | 广西达译商务服务有限责任公司 | Automatic acquisition system of Chinese and Lao bilingual parallel texts and implementation method |
JP2018072979A (en) * | 2016-10-26 | 2018-05-10 | 株式会社エヌ・ティ・ティ・データ | Parallel translation sentence extraction device, parallel translation sentence extraction method and program |
CN107967262A (en) * | 2017-11-02 | 2018-04-27 | 内蒙古工业大学 | A kind of neutral net covers Chinese machine translation method |
CN108549629A (en) * | 2018-03-19 | 2018-09-18 | 昆明理工大学 | A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes |
CN109062897A (en) * | 2018-07-26 | 2018-12-21 | 苏州大学 | Sentence alignment method based on deep neural network |
Non-Patent Citations (2)
Title |
---|
汉老双语句子对齐方法研究;让子强;《中国优秀硕士论文全文数据库 信息科技辑》;20180115;I138-2044 * |
融入多特征的汉-老双语对齐方法;贾善崇 等;《中 国 水 运》;20200331;第20卷(第3期);78-80 * |
Also Published As
Publication number | Publication date |
---|---|
CN109783809A (en) | 2019-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109783809B (en) | Method for extracting aligned sentences from Laos-Chinese chapter level aligned corpus | |
CN111143550B (en) | Method for automatically identifying dispute focus based on hierarchical attention neural network model | |
CN109635124B (en) | Remote supervision relation extraction method combined with background knowledge | |
CN111046946B (en) | Burma language image text recognition method based on CRNN | |
CN110532554B (en) | Chinese abstract generation method, system and storage medium | |
CN110969020B (en) | CNN and attention mechanism-based Chinese named entity identification method, system and medium | |
CN109543181B (en) | Named entity model and system based on combination of active learning and deep learning | |
CN114861600B (en) | NER-oriented Chinese clinical text data enhancement method and device | |
CN108984526A (en) | A kind of document subject matter vector abstracting method based on deep learning | |
CN110688862A (en) | Mongolian-Chinese inter-translation method based on transfer learning | |
CN107480143A (en) | Dialogue topic dividing method and system based on context dependence | |
CN108491372B (en) | Chinese word segmentation method based on seq2seq model | |
CN110083826A (en) | A kind of old man's bilingual alignment method based on Transformer model | |
CN110414009B (en) | Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN | |
CN114757182A (en) | BERT short text sentiment analysis method for improving training mode | |
CN110555084A (en) | remote supervision relation classification method based on PCNN and multi-layer attention | |
CN114818891B (en) | Small sample multi-label text classification model training method and text classification method | |
CN112560486A (en) | Power entity identification method based on multilayer neural network, storage medium and equipment | |
CN110222338B (en) | Organization name entity identification method | |
CN107894975A (en) | A kind of segmenting method based on Bi LSTM | |
CN106610937A (en) | Information theory-based Chinese automatic word segmentation method | |
CN113553847A (en) | Method, device, system and storage medium for parsing address text | |
CN107992468A (en) | A kind of mixing language material name entity recognition method based on LSTM | |
CN114036908A (en) | Chinese chapter-level event extraction method and device integrated with word list knowledge | |
CN112380882B (en) | Mongolian Chinese neural machine translation method with error correction function |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |