CN110717341A - Method and device for constructing old-Chinese bilingual corpus with Thai as pivot - Google Patents
Method and device for constructing old-Chinese bilingual corpus with Thai as pivot Download PDFInfo
- Publication number
- CN110717341A CN110717341A CN201910856645.8A CN201910856645A CN110717341A CN 110717341 A CN110717341 A CN 110717341A CN 201910856645 A CN201910856645 A CN 201910856645A CN 110717341 A CN110717341 A CN 110717341A
- Authority
- CN
- China
- Prior art keywords
- thai
- laos
- sentence
- parallel
- sentences
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims abstract description 30
- 238000013145 classification model Methods 0.000 claims abstract description 21
- 230000011218 segmentation Effects 0.000 claims abstract description 18
- 238000010276 construction Methods 0.000 claims abstract description 12
- 238000012545 processing Methods 0.000 claims abstract description 8
- 239000013598 vector Substances 0.000 claims description 81
- 230000002457 bidirectional effect Effects 0.000 claims description 18
- 238000012549 training Methods 0.000 claims description 18
- 238000013519 translation Methods 0.000 claims description 17
- 230000006870 function Effects 0.000 claims description 15
- 238000013527 convolutional neural network Methods 0.000 claims description 13
- 238000000605 extraction Methods 0.000 claims description 8
- 238000007781 pre-processing Methods 0.000 claims description 8
- 230000004913 activation Effects 0.000 claims description 6
- 239000000126 substance Substances 0.000 claims description 6
- 238000003058 natural language processing Methods 0.000 abstract description 3
- 230000008569 process Effects 0.000 description 5
- 238000013461 design Methods 0.000 description 4
- 238000010801 machine learning Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000010586 diagram Methods 0.000 description 2
- 230000010365 information processing Effects 0.000 description 2
- 239000011159 matrix material Substances 0.000 description 2
- 238000002360 preparation method Methods 0.000 description 2
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000008570 general process Effects 0.000 description 1
- 238000007637 random forest analysis Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/33—Querying
- G06F16/3331—Query processing
- G06F16/334—Query execution
- G06F16/3344—Query execution using natural language analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Machine Translation (AREA)
Abstract
The invention relates to a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, belonging to the field of natural language processing. Firstly, carrying out Thai word segmentation processing on Chinese-Thai parallel corpus data; constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs; constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and acquiring Laos-Thai bilingual parallel sentence pairs; and matching Laos with Chinese by taking the Thai as a pivot language to construct a Laos-Chinese bilingual parallel corpus. The invention solves the problem of scarcity of Laos-Chinese linguistic data and has certain theoretical significance and practical application value for the construction of the old-Chinese bilingual corpus.
Description
Technical Field
The invention relates to a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, belonging to the technical field of natural language processing.
Background
The corpus construction is the premise of natural language processing research work, the old-Chinese bilingual corpus is an important data resource for developing Chinese-old machine translation and cross-language retrieval, Laos is a language with scarce resources in southeast Asia languages, the old-Chinese bilingual parallel resources are scarce, and the method for directly acquiring the old-Chinese bilingual parallel resources from the Internet has great difficulty.
Laos and Thai belong to the strong Dai branch of the strong Dong nationality of the Chinese Tibetan language family, basic vocabularies are almost the same or similar, the syntax structure has great similarity, and the Chinese-Thai parallel linguistic data is relatively easy to obtain, so that Laos and Thai can be used for obtaining an old-Thai parallel sentence pair, and the old-Chinese bilingual parallel linguistic data is constructed on the basis of taking Thai as a pivot.
Disclosure of Invention
The invention provides a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, which are used for constructing a Laos-Chinese bilingual parallel corpus.
The technical scheme of the invention is as follows: a method for constructing an old-Chinese bilingual corpus with Thai as a pivot comprises the following steps:
step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.
Further, the Step1 includes the specific steps of:
step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;
step1.2, performing word segmentation on the selected Thai sentences, wherein the word segmentation tool uses a southeast Asia small language information processing platform developed by Kunming technology university, and the website is http://222.197.219.24: 8099/.
The invention considers that Thai adopts a book connecting form without word segmentation, and cannot be translated based on words and used in a model. Therefore, the word segmentation is carried out through the Thai word segmentation tool to obtain the Thai sentences with segmented words.
The design of the preferred scheme is an important component of the invention, and mainly provides a corpus and data preprocessing process for the invention, and provides a corpus basis for the subsequent dictionary translation and model use.
Further, the specific Step of Step2 is as follows:
construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;
step2.2, because Laos-Thai are extremely similar, the Thai sentences in the acquired Chinese-Thai bilingual parallel sentence pairs are translated word by using a Laos-Thai bilingual dictionary, and because the situation of one word is ambiguous, a plurality of Laos sentences with different semantemes can be generated during translation by the dictionary, so that candidate Laos-Thai parallel sentence pairs are obtained, wherein the candidate Laos-Thai parallel sentence pairs are a plurality of groups of sentences of a plurality of Laos corresponding to one Thai sentence, and the Laos sentences are not completely translated with each other.
The preferred design scheme is that an important process of a Laos-Thai candidate parallel sentence is obtained, similarity of Laos and Thai in the aspects of word construction and the like is analyzed and utilized, a dictionary is constructed to translate word by word to obtain a candidate parallel corpus, and preparation is made for next step of extraction of the Laos-Thai parallel corpus through a model.
Further, the specific Step of Step3 is as follows:
step3.1, manually constructing a Laos-Thai parallel corpus based on sentence alignment;
the present invention trains models based on Laos-Thai parallel corpora, and therefore, high quality parallel corpora are required to make the trained models more efficient. Therefore, the Laos-Thai parallel corpus is constructed in a manual mode, and the data of the training model are ensured to be completely accurate parallel corpus, so that the Laos-Thai parallel sentence classification model is obtained.
Step3.2, because Laos and Thai have great similarity in terms and pronunciation, the Laos-Thai parallel sentence pair constructed by utilizing the bidirectional LSTM is characterized in a shared semantic space, specifically, the bidirectional LSTM is used for obtaining forward and backward state vectors, and splicing is carried out to obtain sentence vector representation in the shared semantic space, namely:
wherein the content of the first and second substances,representing the forward representation of the hidden vector of the ith Thai sentence in an N state;is a hidden vector forward representation of the ith sentence in Thai in the N-1 state,is the word vector representation of Thai sentence in the ith sentence in N state, and the LSTM represents the LSTM activation function;
representing the i-th sentence of Thai in the backward direction of a hidden vector of an N state;is a hidden vector backward representation of the ith sentence in Thai in an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
representing the hidden vector forward representation of the ith sentence of Laos in an N state;is a hidden vector forward representation of the ith sentence of Laos in an N-1 state,is the word vector representation of Laos sentences in the N state in the ith sentence;
expressing the i-th sentence of Laos in the backward direction of the hidden vector of the N state;the method is characterized in that the i-th sentence of Laos is represented backwards in a hidden vector of an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
then, capturing matching information between the two vectors by using a vector dot product and a vector difference to obtain a matching vector:
wherein the content of the first and second substances,which respectively represent matching vectors containing sentence matching information obtained by calculating sentence vector dot products and vector difference values of Laos and Thai; h isiIs the final vector representation containing the matching information,W1,W2and b is a parameter of the bidirectional LSTM model;
step3.3, finally, calculating the probability that Laos sentences and Thai sentences are parallel sentences by using a fully connected layer of a convolutional neural network through a sigmoid function to judge whether the two sentences are mutually translated or not;
p(yi=1|hi)=σ(W3hi+c)
wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are translated into each other, W3C is the convolutional neural network model parameter, σ is the activation function;
step3.4, using the following cross entropy loss as a loss function, iterating for multiple times, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai parallel sentence pairs;
wherein the loss function is as follows:
wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.
A device for constructing an old-Chinese bilingual corpus with Thai as a pivot comprises a data preprocessing module, a dictionary translation module, a Laos-Thai parallel sentence pair extraction module and a Laos-Chinese parallel corpus construction module;
a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.
The invention has the beneficial effects that:
laos is a scarce language in southeast Asia language, and it is very difficult to directly obtain parallel Lao-Chinese bilingual resources from the Internet.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a drawing of Laos-Thai syntactic similarity in the present invention;
FIG. 3 is a diagram of word polysemous for translation in the present invention;
FIG. 4 is a flow chart of parallel sentence classification in the present invention;
FIG. 5 is a view showing the construction of the apparatus of the present invention;
FIG. 6 is a block diagram of the general process flow of the present invention.
Detailed Description
Example 1: as shown in fig. 1-6, a method for constructing an old-chinese bilingual corpus using tai language as a pivot includes the following steps:
step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
as a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;
step1.2, a language information processing platform in southeast Asia languages developed by Kunming university can be used for the selected Thai sentences, and the website is http://222.197.219.24: 8099/word segmentation processing.
The invention considers that Thai adopts a book connecting form without word segmentation, and cannot be translated based on words and used in a model. Therefore, the word segmentation is carried out through the Thai word segmentation tool to obtain the Thai sentences with segmented words.
The design of the preferred scheme is an important component of the invention, and mainly provides a corpus and data preprocessing process for the invention, and provides a corpus basis for the subsequent dictionary translation and model use.
Step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
as a preferable scheme of the invention, the Step2 comprises the following specific steps:
construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;
step2.2, the similarity of the syntax structure of the Thai and Laos is manually analyzed, and the Laos-Thai basically keep consistent in sentence composition, namely the word sequence is consistent, as shown in figure 2, candidate Laos-Thai parallel sentences can be generated by utilizing a dictionary through translation one by one;
specifically, the tai sentences in the acquired chinese-to-tai bilingual parallel sentence pairs are translated word by using a Laos-to-tai bilingual dictionary, and due to the fact that a word is ambiguous, when the word is translated through the dictionary, a plurality of Laos sentences with different semantics may be generated, so that candidate Laos-to-tai parallel sentence pairs are obtained, as shown in fig. 3, wherein the candidate Laos-to-Tai parallel sentence pairs are a plurality of groups of sentences of one sentence corresponding to a plurality of sentences of Laos, which are not completely inter-translated.
The preferred design scheme is that an important process of a Laos-Thai candidate parallel sentence is obtained, similarity of Laos and Thai in the aspects of word construction and the like is analyzed and utilized, a dictionary is constructed to translate word by word to obtain a candidate parallel corpus, and preparation is made for next step of extraction of the Laos-Thai parallel corpus through a model.
Step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
as a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, artificially constructing 9483 Laos-Thai parallel linguistic data based on sentence alignment;
the present invention trains models based on Laos-Thai parallel corpora, and therefore, high quality parallel corpora are required to make the trained models more efficient. Therefore, the Laos-Thai parallel corpus is constructed in a manual mode, and the data of the training model are ensured to be completely accurate parallel corpus, so that the Laos-Thai parallel sentence classification model is obtained.
The similarity of Thai and Laos in terms of word composition and pronunciation is analyzed. Laos and Thai have many similar words in their sense, not only are they synonymous, but they are also very similar in their writing, e.g.,(Thai) and(Laos) all mean "company";(Thai) and(Laos language) is the meaning of "ahead of time";(Thai) and(Laos language) means "boss". On reading, the Tai "Mei Gong river"The pronunciation of the Chinese character 'mei gong' is Menamkong, Laos "The pronunciations are also menamkong. As can be seen from the above examples, Thai and Laos basically write the same words and have basically the same pronunciation, and sentences can be represented by using the language characteristics.
Step3.2, because Laos and Thai have great similarity in terms and pronunciation, sentences of the two similar languages can be expressed into a shared semantic space, as shown in FIG. 4, Laos-Thai parallel sentence pairs constructed by utilizing the bidirectional LSTM are characterized in the shared semantic space, compared with the LSTM, the bidirectional LSTM mainly compensates the coding problem of the LSTM from back to front when modeling sentences, and can better capture the relation between forward semantics and backward semantics. The specific process is as follows:
first, the word vector is encoded using the embedding matrix and the one-hot vector of the words in the sentence, i.e.:
where E is the embedding matrix, wkThe representation is the one-hot representation of the kth word in the vocabulary, i represents the sequence number of the sentence.
After the vector representation is obtained, the sentence is fed into the bi-directional LSTM, and the vector of the last state in both the forward and backward directions is selected as the final representation vector:
after the final state vectors in two directions are obtained, the two vectors are splicedA final representation is obtained. The Laos in the same way is processed in the same way to obtain the final sentence representation of the Laos
Wherein the content of the first and second substances,implicit expression of the ith sentence in Thai in N stateVector forward representation;is a hidden vector forward representation of the ith sentence in Thai in the N-1 state,is the word vector representation of Thai sentence in the ith sentence in N state, and the LSTM represents the LSTM activation function;
representing the i-th sentence of Thai in the backward direction of a hidden vector of an N state;is a hidden vector backward representation of the ith sentence in Thai in an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
representing the hidden vector forward representation of the ith sentence of Laos in an N state;is a hidden vector forward representation of the ith sentence of Laos in an N-1 state,is the word vector representation of Laos sentences in the N state in the ith sentence;
expressing the i-th sentence of Laos in the backward direction of the hidden vector of the N state;the method is characterized in that the i-th sentence of Laos is represented backwards in a hidden vector of an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
in order to obtain the inter-translation degree of the two sentences, the vectors of the two sentences are respectively processed by vector dot product and vector difference to capture the matching information between the two vectors, so as to obtain the matching vectors:
wherein the content of the first and second substances,which respectively represent matching vectors containing sentence matching information obtained by calculating sentence vector dot products and vector difference values of Laos and Thai; h isiIs the final vector representation, W, containing matching information1,W2And b is a parameter of the bidirectional LSTM model;
step3.3, a fully connected layers (FC) plays a role of a classifier in the whole convolutional neural network, after the sentence matching degree vector representation of Laos and Thai is obtained, the fully connected layers of the convolutional neural network are finally used, and the probability that Laos and Thai sentences are parallel sentences is calculated through a sigmoid function to judge whether the two sentences are parallel (inter-translated);
p(yi=1|hi)=σ(W3hi+c)
wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are parallel (inter-translated), W3C is the convolutional neural network model parameter, σ is the activation function;
step3.4, iterating for 15 times by using the following cross entropy loss as a loss function, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
wherein the loss function is as follows:
wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.
9483 Laos-Thai bilingual parallel corpora which are constructed manually are used in the training of the model, word segmentation is carried out, and then the model is divided into a training set and a testing set, wherein 8883 training sets and 600 testing sets are used for testing the training result of the model.
Step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.
In the classification of the classification model by the Laos-Thai parallel sentences, the invention adopts the F1 value to evaluate the quality of the model, and the specific formula is as follows:
wherein TP predicts the original positive class into positive class, FN predicts the original positive class into negative class, FP predicts the negative class into positive class. The F1 value is the harmonic mean of precision and recall.
In order to compare the effect of the Laos-Thai parallel sentence on classification models and the traditional machine learning method on parallel sentence classification, the Laos-Thai parallel sentence classification models of the invention are compared with a plurality of common machine learning models, as shown in Table 1.
Table 1: parallel sentence classification model result comparison
Numbering | Model (model) | F1 value (%) |
1 | SVM | 68.78 |
2 | LR | 65.04 |
3 | Random forest | 51.49 |
4 | Gbdt | 60.03 |
5 | Laos-Thai parallel sentence pair classification model | 71.30 |
From the results in table 1, it can be seen that when parallel sentences are classified by using the classification model of the Laos-Thai parallel sentences, the accuracy is better than that when parallel sentences are classified by using a machine learning method, so that the accuracy of the obtained Laos-Thai bilingual parallel sentence pairs is high, and the Laos-Chinese bilingual parallel corpus constructed by matching the Laos and Chinese with the existing Chinese-Thai parallel corpus by using the Thai as a pivot language is high.
Referring to fig. 5, the invention provides a device for constructing an old-chinese bilingual corpus with tai language as a pivot, which comprises a data preprocessing module, a dictionary translation module, a Laos-tai parallel sentence pair extraction module and a Laos-chinese parallel corpus construction module;
a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.
Claims (5)
1. A method for constructing an old-Chinese bilingual corpus with Thai as a pivot is characterized by comprising the following steps of: the method comprises the following steps:
step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.
2. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific steps of Step1 are as follows:
step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;
and Step1.2, performing word segmentation on the selected Thai sentences.
3. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific Step of Step2 is as follows:
construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;
step2.2, because Laos-Thai are extremely similar, the Thai sentences in the acquired Chinese-Thai bilingual parallel sentence pairs are translated word by using a Laos-Thai bilingual dictionary, and because the situation of one word is ambiguous, a plurality of Laos sentences with different semantemes can be generated during translation by the dictionary, so that candidate Laos-Thai parallel sentence pairs are obtained, wherein the candidate Laos-Thai parallel sentence pairs are a plurality of groups of sentences of a plurality of Laos corresponding to one Thai sentence, and the Laos sentences are not completely translated with each other.
4. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific Step of Step3 is as follows:
step3.1, manually constructing a Laos-Thai parallel corpus based on sentence alignment;
step3.2, because Laos and Thai have great similarity in terms and pronunciation, the Laos-Thai parallel sentence pair constructed by utilizing the bidirectional LSTM is characterized in a shared semantic space, specifically, the bidirectional LSTM is used for obtaining forward and backward state vectors, and splicing is carried out to obtain sentence vector representation in the shared semantic space, namely:
wherein the content of the first and second substances,representing the forward representation of the hidden vector of the ith Thai sentence in an N state;is a hidden vector forward representation of the ith sentence in Thai in the N-1 state,is the word vector representation of Thai sentence in the ith sentence in N state, and the LSTM represents the LSTM activation function;
representing the i-th sentence of Thai in the backward direction of a hidden vector of an N state;is a hidden vector backward representation of the ith sentence in Thai in an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
representing the hidden vector forward representation of the ith sentence of Laos in an N state;is a hidden vector forward representation of the ith sentence of Laos in an N-1 state,is the word vector representation of Laos sentences in the N state in the ith sentence;
expressing the i-th sentence of Laos in the backward direction of the hidden vector of the N state;the method is characterized in that the i-th sentence of Laos is represented backwards in a hidden vector of an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
then, capturing matching information between the two vectors by using a vector dot product and a vector difference to obtain a matching vector:
wherein the content of the first and second substances,which respectively represent matching vectors containing sentence matching information obtained by calculating sentence vector dot products and vector difference values of Laos and Thai; h isiIs the final vector representation, W, containing matching information1,W2And b is a parameter of the bidirectional LSTM model;
step3.3, finally, calculating the probability that Laos sentences and Thai sentences are parallel sentences by using a fully connected layer of a convolutional neural network through a sigmoid function to judge whether the two sentences are mutually translated or not;
p(yi=1|hi)=σ(W3hi+c)
wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are translated into each other, W3C is the convolutional neural network model parameter, σ is the activation function;
step3.4, using the following cross entropy loss as a loss function, iterating for multiple times, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai parallel sentence pairs;
wherein the loss function is as follows:
wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.
5. An old-Chinese bilingual corpus construction device taking Thai as a pivot is characterized in that: the system comprises a data preprocessing module, a dictionary translation module, a Laos-Thai parallel sentence pair extraction module and a Laos-Chinese parallel corpus construction module;
a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910856645.8A CN110717341B (en) | 2019-09-11 | 2019-09-11 | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910856645.8A CN110717341B (en) | 2019-09-11 | 2019-09-11 | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot |
Publications (2)
Publication Number | Publication Date |
---|---|
CN110717341A true CN110717341A (en) | 2020-01-21 |
CN110717341B CN110717341B (en) | 2022-06-14 |
Family
ID=69209837
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910856645.8A Active CN110717341B (en) | 2019-09-11 | 2019-09-11 | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110717341B (en) |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287688A (en) * | 2020-09-17 | 2021-01-29 | 昆明理工大学 | English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features |
CN113627150A (en) * | 2021-07-01 | 2021-11-09 | 昆明理工大学 | Method and device for extracting parallel sentence pairs for transfer learning based on language similarity |
CN114417807A (en) * | 2022-01-24 | 2022-04-29 | 中国电子科技集团公司第五十四研究所 | Human-like language description expression method oriented to presence or absence of human collaboration scene |
CN115329785A (en) * | 2022-10-15 | 2022-11-11 | 小语智能信息科技(云南)有限公司 | Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device |
RU2790026C2 (en) * | 2020-12-22 | 2023-02-14 | Общество С Ограниченной Ответственностью "Яндекс" | Method and server for training machine learning algorithm for translation |
CN116822495A (en) * | 2023-08-31 | 2023-09-29 | 小语智能信息科技(云南)有限公司 | Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855263A (en) * | 2011-06-30 | 2013-01-02 | 富士通株式会社 | Method and device for aligning sentences in bilingual corpus |
US9348809B1 (en) * | 2015-02-02 | 2016-05-24 | Linkedin Corporation | Modifying a tokenizer based on pseudo data for natural language processing |
CN108363704A (en) * | 2018-03-02 | 2018-08-03 | 北京理工大学 | A kind of neural network machine translation corpus expansion method based on statistics phrase table |
CN108491383A (en) * | 2018-03-14 | 2018-09-04 | 昆明理工大学 | A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule |
CN108549629A (en) * | 2018-03-19 | 2018-09-18 | 昆明理工大学 | A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes |
CN109783809A (en) * | 2018-12-22 | 2019-05-21 | 昆明理工大学 | A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus |
CN109885686A (en) * | 2019-02-20 | 2019-06-14 | 延边大学 | A kind of multilingual file classification method merging subject information and BiLSTM-CNN |
CN110083826A (en) * | 2019-03-21 | 2019-08-02 | 昆明理工大学 | A kind of old man's bilingual alignment method based on Transformer model |
CN110110061A (en) * | 2019-04-26 | 2019-08-09 | 同济大学 | Low-resource languages entity abstracting method based on bilingual term vector |
-
2019
- 2019-09-11 CN CN201910856645.8A patent/CN110717341B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN102855263A (en) * | 2011-06-30 | 2013-01-02 | 富士通株式会社 | Method and device for aligning sentences in bilingual corpus |
US9348809B1 (en) * | 2015-02-02 | 2016-05-24 | Linkedin Corporation | Modifying a tokenizer based on pseudo data for natural language processing |
CN108363704A (en) * | 2018-03-02 | 2018-08-03 | 北京理工大学 | A kind of neural network machine translation corpus expansion method based on statistics phrase table |
CN108491383A (en) * | 2018-03-14 | 2018-09-04 | 昆明理工大学 | A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule |
CN108549629A (en) * | 2018-03-19 | 2018-09-18 | 昆明理工大学 | A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes |
CN109783809A (en) * | 2018-12-22 | 2019-05-21 | 昆明理工大学 | A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus |
CN109885686A (en) * | 2019-02-20 | 2019-06-14 | 延边大学 | A kind of multilingual file classification method merging subject information and BiLSTM-CNN |
CN110083826A (en) * | 2019-03-21 | 2019-08-02 | 昆明理工大学 | A kind of old man's bilingual alignment method based on Transformer model |
CN110110061A (en) * | 2019-04-26 | 2019-08-09 | 同济大学 | Low-resource languages entity abstracting method based on bilingual term vector |
Non-Patent Citations (4)
Title |
---|
KEYSERS DANIEL 等: "Multi-language online handwriting recognition", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 * |
WANG YONGQIANG 等: "Research on the Recognition of Offline Handwritten New Tai Lue Characters Based on Bidirectional LSTM", 《INTERNATIONAL CONFERENCE ON NETWORK, COMMUNICATION, COMPUTER ENGINEERING》 * |
杨蓓 等: "半监督学习的老挝语词性标注方法研究", 《计算机科学》 * |
聂男: "以泰语为枢轴的老—汉双语语料库构建方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)哲学与人文科学辑》 * |
Cited By (10)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112287688A (en) * | 2020-09-17 | 2021-01-29 | 昆明理工大学 | English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features |
CN112287688B (en) * | 2020-09-17 | 2022-02-11 | 昆明理工大学 | English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features |
RU2790026C2 (en) * | 2020-12-22 | 2023-02-14 | Общество С Ограниченной Ответственностью "Яндекс" | Method and server for training machine learning algorithm for translation |
CN113627150A (en) * | 2021-07-01 | 2021-11-09 | 昆明理工大学 | Method and device for extracting parallel sentence pairs for transfer learning based on language similarity |
CN113627150B (en) * | 2021-07-01 | 2022-12-20 | 昆明理工大学 | Language similarity-based parallel sentence pair extraction method and device for transfer learning |
CN114417807A (en) * | 2022-01-24 | 2022-04-29 | 中国电子科技集团公司第五十四研究所 | Human-like language description expression method oriented to presence or absence of human collaboration scene |
CN114417807B (en) * | 2022-01-24 | 2023-09-22 | 中国电子科技集团公司第五十四研究所 | Human-like language description expression method for collaboration scene of presence or absence |
CN115329785A (en) * | 2022-10-15 | 2022-11-11 | 小语智能信息科技(云南)有限公司 | Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device |
CN116822495A (en) * | 2023-08-31 | 2023-09-29 | 小语智能信息科技(云南)有限公司 | Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning |
CN116822495B (en) * | 2023-08-31 | 2023-11-03 | 小语智能信息科技(云南)有限公司 | Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning |
Also Published As
Publication number | Publication date |
---|---|
CN110717341B (en) | 2022-06-14 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110717341B (en) | Method and device for constructing old-Chinese bilingual corpus with Thai as pivot | |
CN109213995B (en) | Cross-language text similarity evaluation technology based on bilingual word embedding | |
CN108614875B (en) | Chinese emotion tendency classification method based on global average pooling convolutional neural network | |
CN110059188B (en) | Chinese emotion analysis method based on bidirectional time convolution network | |
CN111931506B (en) | Entity relationship extraction method based on graph information enhancement | |
CN106599032B (en) | Text event extraction method combining sparse coding and structure sensing machine | |
CN111061861B (en) | Text abstract automatic generation method based on XLNet | |
CN110414009B (en) | Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN | |
CN110619043A (en) | Automatic text abstract generation method based on dynamic word vector | |
CN112231472B (en) | Judicial public opinion sensitive information identification method integrated with domain term dictionary | |
CN110489750A (en) | Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF | |
CN112016320A (en) | English punctuation adding method, system and equipment based on data enhancement | |
CN112561718A (en) | Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing | |
Huang et al. | End-to-end sequence labeling via convolutional recurrent neural network with a connectionist temporal classification layer | |
CN114818717A (en) | Chinese named entity recognition method and system fusing vocabulary and syntax information | |
CN111581943A (en) | Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph | |
CN115510863A (en) | Question matching task oriented data enhancement method | |
CN111553157A (en) | Entity replacement-based dialog intention identification method | |
Bigot et al. | Person name recognition in ASR outputs using continuous context models | |
CN112632272A (en) | Microblog emotion classification method and system based on syntactic analysis | |
Zhao et al. | Tibetan Multi-Dialect Speech and Dialect Identity Recognition. | |
CN117332789A (en) | Semantic analysis method and system for dialogue scene | |
Sun | Analysis of Chinese machine translation training based on deep learning technology | |
CN115934948A (en) | Knowledge enhancement-based drug entity relationship combined extraction method and system | |
Suleiman et al. | Recurrent neural network techniques: Emphasis on use in neural machine translation |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |