CN110717341A - Method and device for constructing old-Chinese bilingual corpus with Thai as pivot - Google Patents

Method and device for constructing old-Chinese bilingual corpus with Thai as pivot Download PDF

Info

Publication number
CN110717341A
CN110717341A CN201910856645.8A CN201910856645A CN110717341A CN 110717341 A CN110717341 A CN 110717341A CN 201910856645 A CN201910856645 A CN 201910856645A CN 110717341 A CN110717341 A CN 110717341A
Authority
CN
China
Prior art keywords
thai
laos
sentence
parallel
sentences
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910856645.8A
Other languages
Chinese (zh)
Other versions
CN110717341B (en
Inventor
毛存礼
高旭
余正涛
高盛祥
王振晗
聂男
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kunming University of Science and Technology
Original Assignee
Kunming University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kunming University of Science and Technology filed Critical Kunming University of Science and Technology
Priority to CN201910856645.8A priority Critical patent/CN110717341B/en
Publication of CN110717341A publication Critical patent/CN110717341A/en
Application granted granted Critical
Publication of CN110717341B publication Critical patent/CN110717341B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/33Querying
    • G06F16/3331Query processing
    • G06F16/334Query execution
    • G06F16/3344Query execution using natural language analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Machine Translation (AREA)

Abstract

The invention relates to a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, belonging to the field of natural language processing. Firstly, carrying out Thai word segmentation processing on Chinese-Thai parallel corpus data; constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs; constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and acquiring Laos-Thai bilingual parallel sentence pairs; and matching Laos with Chinese by taking the Thai as a pivot language to construct a Laos-Chinese bilingual parallel corpus. The invention solves the problem of scarcity of Laos-Chinese linguistic data and has certain theoretical significance and practical application value for the construction of the old-Chinese bilingual corpus.

Description

Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
Technical Field
The invention relates to a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, belonging to the technical field of natural language processing.
Background
The corpus construction is the premise of natural language processing research work, the old-Chinese bilingual corpus is an important data resource for developing Chinese-old machine translation and cross-language retrieval, Laos is a language with scarce resources in southeast Asia languages, the old-Chinese bilingual parallel resources are scarce, and the method for directly acquiring the old-Chinese bilingual parallel resources from the Internet has great difficulty.
Laos and Thai belong to the strong Dai branch of the strong Dong nationality of the Chinese Tibetan language family, basic vocabularies are almost the same or similar, the syntax structure has great similarity, and the Chinese-Thai parallel linguistic data is relatively easy to obtain, so that Laos and Thai can be used for obtaining an old-Thai parallel sentence pair, and the old-Chinese bilingual parallel linguistic data is constructed on the basis of taking Thai as a pivot.
Disclosure of Invention
The invention provides a method and a device for constructing an old-Chinese bilingual corpus with Thai as a pivot, which are used for constructing a Laos-Chinese bilingual parallel corpus.
The technical scheme of the invention is as follows: a method for constructing an old-Chinese bilingual corpus with Thai as a pivot comprises the following steps:
step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.
Further, the Step1 includes the specific steps of:
step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;
step1.2, performing word segmentation on the selected Thai sentences, wherein the word segmentation tool uses a southeast Asia small language information processing platform developed by Kunming technology university, and the website is http://222.197.219.24: 8099/.
The invention considers that Thai adopts a book connecting form without word segmentation, and cannot be translated based on words and used in a model. Therefore, the word segmentation is carried out through the Thai word segmentation tool to obtain the Thai sentences with segmented words.
The design of the preferred scheme is an important component of the invention, and mainly provides a corpus and data preprocessing process for the invention, and provides a corpus basis for the subsequent dictionary translation and model use.
Further, the specific Step of Step2 is as follows:
construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;
step2.2, because Laos-Thai are extremely similar, the Thai sentences in the acquired Chinese-Thai bilingual parallel sentence pairs are translated word by using a Laos-Thai bilingual dictionary, and because the situation of one word is ambiguous, a plurality of Laos sentences with different semantemes can be generated during translation by the dictionary, so that candidate Laos-Thai parallel sentence pairs are obtained, wherein the candidate Laos-Thai parallel sentence pairs are a plurality of groups of sentences of a plurality of Laos corresponding to one Thai sentence, and the Laos sentences are not completely translated with each other.
The preferred design scheme is that an important process of a Laos-Thai candidate parallel sentence is obtained, similarity of Laos and Thai in the aspects of word construction and the like is analyzed and utilized, a dictionary is constructed to translate word by word to obtain a candidate parallel corpus, and preparation is made for next step of extraction of the Laos-Thai parallel corpus through a model.
Further, the specific Step of Step3 is as follows:
step3.1, manually constructing a Laos-Thai parallel corpus based on sentence alignment;
the present invention trains models based on Laos-Thai parallel corpora, and therefore, high quality parallel corpora are required to make the trained models more efficient. Therefore, the Laos-Thai parallel corpus is constructed in a manual mode, and the data of the training model are ensured to be completely accurate parallel corpus, so that the Laos-Thai parallel sentence classification model is obtained.
Step3.2, because Laos and Thai have great similarity in terms and pronunciation, the Laos-Thai parallel sentence pair constructed by utilizing the bidirectional LSTM is characterized in a shared semantic space, specifically, the bidirectional LSTM is used for obtaining forward and backward state vectors, and splicing is carried out to obtain sentence vector representation in the shared semantic space, namely:
Figure BDA0002198521770000021
Figure BDA0002198521770000022
Figure BDA0002198521770000023
Figure BDA0002198521770000024
Figure BDA0002198521770000031
wherein the content of the first and second substances,
Figure BDA0002198521770000033
representing the forward representation of the hidden vector of the ith Thai sentence in an N state;
Figure BDA0002198521770000034
is a hidden vector forward representation of the ith sentence in Thai in the N-1 state,
Figure BDA0002198521770000035
is the word vector representation of Thai sentence in the ith sentence in N state, and the LSTM represents the LSTM activation function;
Figure BDA0002198521770000036
representing the i-th sentence of Thai in the backward direction of a hidden vector of an N state;
Figure BDA0002198521770000037
is a hidden vector backward representation of the ith sentence in Thai in an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
Figure BDA0002198521770000039
representing the hidden vector forward representation of the ith sentence of Laos in an N state;
Figure BDA00021985217700000310
is a hidden vector forward representation of the ith sentence of Laos in an N-1 state,
Figure BDA00021985217700000311
is the word vector representation of Laos sentences in the N state in the ith sentence;
Figure BDA00021985217700000312
expressing the i-th sentence of Laos in the backward direction of the hidden vector of the N state;
Figure BDA00021985217700000313
the method is characterized in that the i-th sentence of Laos is represented backwards in a hidden vector of an N +1 state;
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
then, capturing matching information between the two vectors by using a vector dot product and a vector difference to obtain a matching vector:
Figure BDA00021985217700000315
Figure BDA00021985217700000316
Figure BDA00021985217700000317
wherein the content of the first and second substances,which respectively represent matching vectors containing sentence matching information obtained by calculating sentence vector dot products and vector difference values of Laos and Thai; h isiIs the final vector representation containing the matching information,W1,W2and b is a parameter of the bidirectional LSTM model;
step3.3, finally, calculating the probability that Laos sentences and Thai sentences are parallel sentences by using a fully connected layer of a convolutional neural network through a sigmoid function to judge whether the two sentences are mutually translated or not;
p(yi=1|hi)=σ(W3hi+c)
wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are translated into each other, W3C is the convolutional neural network model parameter, σ is the activation function;
step3.4, using the following cross entropy loss as a loss function, iterating for multiple times, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai parallel sentence pairs;
wherein the loss function is as follows:
Figure BDA0002198521770000041
wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.
A device for constructing an old-Chinese bilingual corpus with Thai as a pivot comprises a data preprocessing module, a dictionary translation module, a Laos-Thai parallel sentence pair extraction module and a Laos-Chinese parallel corpus construction module;
a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.
The invention has the beneficial effects that:
laos is a scarce language in southeast Asia language, and it is very difficult to directly obtain parallel Lao-Chinese bilingual resources from the Internet.
Drawings
FIG. 1 is a flow chart of the present invention;
FIG. 2 is a drawing of Laos-Thai syntactic similarity in the present invention;
FIG. 3 is a diagram of word polysemous for translation in the present invention;
FIG. 4 is a flow chart of parallel sentence classification in the present invention;
FIG. 5 is a view showing the construction of the apparatus of the present invention;
FIG. 6 is a block diagram of the general process flow of the present invention.
Detailed Description
Example 1: as shown in fig. 1-6, a method for constructing an old-chinese bilingual corpus using tai language as a pivot includes the following steps:
step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
as a preferable scheme of the invention, the Step1 comprises the following specific steps:
step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;
step1.2, a language information processing platform in southeast Asia languages developed by Kunming university can be used for the selected Thai sentences, and the website is http://222.197.219.24: 8099/word segmentation processing.
The invention considers that Thai adopts a book connecting form without word segmentation, and cannot be translated based on words and used in a model. Therefore, the word segmentation is carried out through the Thai word segmentation tool to obtain the Thai sentences with segmented words.
The design of the preferred scheme is an important component of the invention, and mainly provides a corpus and data preprocessing process for the invention, and provides a corpus basis for the subsequent dictionary translation and model use.
Step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
as a preferable scheme of the invention, the Step2 comprises the following specific steps:
construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;
step2.2, the similarity of the syntax structure of the Thai and Laos is manually analyzed, and the Laos-Thai basically keep consistent in sentence composition, namely the word sequence is consistent, as shown in figure 2, candidate Laos-Thai parallel sentences can be generated by utilizing a dictionary through translation one by one;
specifically, the tai sentences in the acquired chinese-to-tai bilingual parallel sentence pairs are translated word by using a Laos-to-tai bilingual dictionary, and due to the fact that a word is ambiguous, when the word is translated through the dictionary, a plurality of Laos sentences with different semantics may be generated, so that candidate Laos-to-tai parallel sentence pairs are obtained, as shown in fig. 3, wherein the candidate Laos-to-Tai parallel sentence pairs are a plurality of groups of sentences of one sentence corresponding to a plurality of sentences of Laos, which are not completely inter-translated.
The preferred design scheme is that an important process of a Laos-Thai candidate parallel sentence is obtained, similarity of Laos and Thai in the aspects of word construction and the like is analyzed and utilized, a dictionary is constructed to translate word by word to obtain a candidate parallel corpus, and preparation is made for next step of extraction of the Laos-Thai parallel corpus through a model.
Step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
as a preferable scheme of the invention, the Step3 comprises the following specific steps:
step3.1, artificially constructing 9483 Laos-Thai parallel linguistic data based on sentence alignment;
the present invention trains models based on Laos-Thai parallel corpora, and therefore, high quality parallel corpora are required to make the trained models more efficient. Therefore, the Laos-Thai parallel corpus is constructed in a manual mode, and the data of the training model are ensured to be completely accurate parallel corpus, so that the Laos-Thai parallel sentence classification model is obtained.
The similarity of Thai and Laos in terms of word composition and pronunciation is analyzed. Laos and Thai have many similar words in their sense, not only are they synonymous, but they are also very similar in their writing, e.g.,
Figure BDA0002198521770000061
(Thai) and
Figure BDA0002198521770000062
(Laos) all mean "company";
Figure BDA0002198521770000063
(Thai) and
Figure BDA0002198521770000064
(Laos language) is the meaning of "ahead of time";
Figure BDA0002198521770000065
(Thai) and
Figure BDA0002198521770000066
(Laos language) means "boss". On reading, the Tai "Mei Gong river"The pronunciation of the Chinese character 'mei gong' is Menamkong, Laos "The pronunciations are also menamkong. As can be seen from the above examples, Thai and Laos basically write the same words and have basically the same pronunciation, and sentences can be represented by using the language characteristics.
Step3.2, because Laos and Thai have great similarity in terms and pronunciation, sentences of the two similar languages can be expressed into a shared semantic space, as shown in FIG. 4, Laos-Thai parallel sentence pairs constructed by utilizing the bidirectional LSTM are characterized in the shared semantic space, compared with the LSTM, the bidirectional LSTM mainly compensates the coding problem of the LSTM from back to front when modeling sentences, and can better capture the relation between forward semantics and backward semantics. The specific process is as follows:
first, the word vector is encoded using the embedding matrix and the one-hot vector of the words in the sentence, i.e.:
Figure BDA0002198521770000069
where E is the embedding matrix, wkThe representation is the one-hot representation of the kth word in the vocabulary, i represents the sequence number of the sentence.
After the vector representation is obtained, the sentence is fed into the bi-directional LSTM, and the vector of the last state in both the forward and backward directions is selected as the final representation vector:
Figure BDA0002198521770000071
Figure BDA0002198521770000072
after the final state vectors in two directions are obtained, the two vectors are spliced
Figure BDA0002198521770000073
A final representation is obtained. The Laos in the same way is processed in the same way to obtain the final sentence representation of the Laos
Figure BDA0002198521770000074
Figure BDA0002198521770000075
Figure BDA0002198521770000076
Figure BDA0002198521770000077
Wherein the content of the first and second substances,
Figure BDA0002198521770000078
implicit expression of the ith sentence in Thai in N stateVector forward representation;
Figure BDA0002198521770000079
is a hidden vector forward representation of the ith sentence in Thai in the N-1 state,
Figure BDA00021985217700000710
is the word vector representation of Thai sentence in the ith sentence in N state, and the LSTM represents the LSTM activation function;
Figure BDA00021985217700000711
representing the i-th sentence of Thai in the backward direction of a hidden vector of an N state;
Figure BDA00021985217700000712
is a hidden vector backward representation of the ith sentence in Thai in an N +1 state;
Figure BDA00021985217700000713
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
representing the hidden vector forward representation of the ith sentence of Laos in an N state;
Figure BDA00021985217700000715
is a hidden vector forward representation of the ith sentence of Laos in an N-1 state,
Figure BDA00021985217700000716
is the word vector representation of Laos sentences in the N state in the ith sentence;
Figure BDA00021985217700000717
expressing the i-th sentence of Laos in the backward direction of the hidden vector of the N state;
Figure BDA00021985217700000718
the method is characterized in that the i-th sentence of Laos is represented backwards in a hidden vector of an N +1 state;
Figure BDA00021985217700000719
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
in order to obtain the inter-translation degree of the two sentences, the vectors of the two sentences are respectively processed by vector dot product and vector difference to capture the matching information between the two vectors, so as to obtain the matching vectors:
Figure BDA0002198521770000082
Figure BDA0002198521770000083
wherein the content of the first and second substances,
Figure BDA0002198521770000084
which respectively represent matching vectors containing sentence matching information obtained by calculating sentence vector dot products and vector difference values of Laos and Thai; h isiIs the final vector representation, W, containing matching information1,W2And b is a parameter of the bidirectional LSTM model;
step3.3, a fully connected layers (FC) plays a role of a classifier in the whole convolutional neural network, after the sentence matching degree vector representation of Laos and Thai is obtained, the fully connected layers of the convolutional neural network are finally used, and the probability that Laos and Thai sentences are parallel sentences is calculated through a sigmoid function to judge whether the two sentences are parallel (inter-translated);
p(yi=1|hi)=σ(W3hi+c)
wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are parallel (inter-translated), W3C is the convolutional neural network model parameter, σ is the activation function;
step3.4, iterating for 15 times by using the following cross entropy loss as a loss function, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
wherein the loss function is as follows:
Figure BDA0002198521770000085
wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.
9483 Laos-Thai bilingual parallel corpora which are constructed manually are used in the training of the model, word segmentation is carried out, and then the model is divided into a training set and a testing set, wherein 8883 training sets and 600 testing sets are used for testing the training result of the model.
Step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.
In the classification of the classification model by the Laos-Thai parallel sentences, the invention adopts the F1 value to evaluate the quality of the model, and the specific formula is as follows:
Figure BDA0002198521770000091
Figure BDA0002198521770000092
Figure BDA0002198521770000093
wherein TP predicts the original positive class into positive class, FN predicts the original positive class into negative class, FP predicts the negative class into positive class. The F1 value is the harmonic mean of precision and recall.
In order to compare the effect of the Laos-Thai parallel sentence on classification models and the traditional machine learning method on parallel sentence classification, the Laos-Thai parallel sentence classification models of the invention are compared with a plurality of common machine learning models, as shown in Table 1.
Table 1: parallel sentence classification model result comparison
Numbering Model (model) F1 value (%)
1 SVM 68.78
2 LR 65.04
3 Random forest 51.49
4 Gbdt 60.03
5 Laos-Thai parallel sentence pair classification model 71.30
From the results in table 1, it can be seen that when parallel sentences are classified by using the classification model of the Laos-Thai parallel sentences, the accuracy is better than that when parallel sentences are classified by using a machine learning method, so that the accuracy of the obtained Laos-Thai bilingual parallel sentence pairs is high, and the Laos-Chinese bilingual parallel corpus constructed by matching the Laos and Chinese with the existing Chinese-Thai parallel corpus by using the Thai as a pivot language is high.
Referring to fig. 5, the invention provides a device for constructing an old-chinese bilingual corpus with tai language as a pivot, which comprises a data preprocessing module, a dictionary translation module, a Laos-tai parallel sentence pair extraction module and a Laos-chinese parallel corpus construction module;
a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.
While the present invention has been described in detail with reference to the embodiments shown in the drawings, the present invention is not limited to the embodiments, and various changes can be made without departing from the spirit of the present invention within the knowledge of those skilled in the art.

Claims (5)

1. A method for constructing an old-Chinese bilingual corpus with Thai as a pivot is characterized by comprising the following steps of: the method comprises the following steps:
step1, extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
step2, constructing a Laos-Thai bilingual dictionary, and translating the Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
step3, constructing a Laos-Thai parallel sentence pair classification model based on bidirectional LSTM, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
step4, matching the obtained Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking Thai as pivot language to build a Laos-Chinese bilingual parallel corpus.
2. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific steps of Step1 are as follows:
step1.1, selecting Thai sentences with 20-50 characters from an existing Chinese-Thai bilingual parallel corpus;
and Step1.2, performing word segmentation on the selected Thai sentences.
3. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific Step of Step2 is as follows:
construction of Step2.1 and Laos-Thai bilingual dictionary: mainly using English as an intermediate language, aligning Laos and Thai words by using English words on the basis of a Laos-English dictionary and a Thai-English dictionary, and constructing a Laos-Thai bilingual dictionary;
step2.2, because Laos-Thai are extremely similar, the Thai sentences in the acquired Chinese-Thai bilingual parallel sentence pairs are translated word by using a Laos-Thai bilingual dictionary, and because the situation of one word is ambiguous, a plurality of Laos sentences with different semantemes can be generated during translation by the dictionary, so that candidate Laos-Thai parallel sentence pairs are obtained, wherein the candidate Laos-Thai parallel sentence pairs are a plurality of groups of sentences of a plurality of Laos corresponding to one Thai sentence, and the Laos sentences are not completely translated with each other.
4. The method of claim 1 for constructing an old-chinese bilingual corpus pivoted in thai, wherein: the specific Step of Step3 is as follows:
step3.1, manually constructing a Laos-Thai parallel corpus based on sentence alignment;
step3.2, because Laos and Thai have great similarity in terms and pronunciation, the Laos-Thai parallel sentence pair constructed by utilizing the bidirectional LSTM is characterized in a shared semantic space, specifically, the bidirectional LSTM is used for obtaining forward and backward state vectors, and splicing is carried out to obtain sentence vector representation in the shared semantic space, namely:
Figure FDA0002198521760000022
Figure FDA0002198521760000023
Figure FDA0002198521760000024
Figure FDA0002198521760000025
Figure FDA0002198521760000026
wherein the content of the first and second substances,representing the forward representation of the hidden vector of the ith Thai sentence in an N state;
Figure FDA0002198521760000028
is a hidden vector forward representation of the ith sentence in Thai in the N-1 state,is the word vector representation of Thai sentence in the ith sentence in N state, and the LSTM represents the LSTM activation function;
Figure FDA00021985217600000210
representing the i-th sentence of Thai in the backward direction of a hidden vector of an N state;
Figure FDA00021985217600000211
is a hidden vector backward representation of the ith sentence in Thai in an N +1 state;
Figure FDA00021985217600000212
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
Figure FDA00021985217600000213
representing the hidden vector forward representation of the ith sentence of Laos in an N state;
Figure FDA00021985217600000214
is a hidden vector forward representation of the ith sentence of Laos in an N-1 state,
Figure FDA00021985217600000215
is the word vector representation of Laos sentences in the N state in the ith sentence;
Figure FDA00021985217600000216
expressing the i-th sentence of Laos in the backward direction of the hidden vector of the N state;
Figure FDA00021985217600000217
the method is characterized in that the i-th sentence of Laos is represented backwards in a hidden vector of an N +1 state;
Figure FDA00021985217600000218
expressing the ith sentence, splicing the final vectors obtained from the two directions to obtain the sentence vector expression of the ith sentence;
then, capturing matching information between the two vectors by using a vector dot product and a vector difference to obtain a matching vector:
Figure FDA00021985217600000219
Figure FDA00021985217600000220
Figure FDA00021985217600000221
wherein the content of the first and second substances,
Figure FDA00021985217600000222
which respectively represent matching vectors containing sentence matching information obtained by calculating sentence vector dot products and vector difference values of Laos and Thai; h isiIs the final vector representation, W, containing matching information1,W2And b is a parameter of the bidirectional LSTM model;
step3.3, finally, calculating the probability that Laos sentences and Thai sentences are parallel sentences by using a fully connected layer of a convolutional neural network through a sigmoid function to judge whether the two sentences are mutually translated or not;
p(yi=1|hi)=σ(W3hi+c)
wherein, p (y)i=1|hi) Represents the vector h obtainediProbability value of mutual translation of two sentences, yiMeaning that two sentences are translated into each other, W3C is the convolutional neural network model parameter, σ is the activation function;
step3.4, using the following cross entropy loss as a loss function, iterating for multiple times, updating parameters of a bidirectional LSTM model and a convolutional neural network model, training the bidirectional LSTM model and the convolutional neural network model, namely training a Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs through the trained Laos-Thai parallel sentence pair classification model, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai parallel sentence pairs;
wherein the loss function is as follows:
Figure FDA0002198521760000031
wherein, yi1 or yi=0,yi1 indicates that the sentences of two Laos and Thai are parallel, yi0 means that the sentences of two Laos and Thai are not parallel, n represents the number of positive samples, i.e. parallel sentences, in the training model, and m represents the number of negative samples, i.e. non-parallel sentences, in the training model.
5. An old-Chinese bilingual corpus construction device taking Thai as a pivot is characterized in that: the system comprises a data preprocessing module, a dictionary translation module, a Laos-Thai parallel sentence pair extraction module and a Laos-Chinese parallel corpus construction module;
a data preprocessing module: the system is used for extracting Thai sentences from the existing Chinese-Thai parallel corpus data and carrying out Thai word segmentation processing;
a dictionary translation module: the method is used for constructing a Laos-Thai bilingual dictionary, and translating Thai sentences into Laos sentence subsequences word by using the Laos-Thai bilingual dictionary to obtain candidate Laos-Thai parallel sentence pairs;
Laos-Thai parallel sentence pair extraction module: the method is used for constructing a two-way LSTM-based Laos-Thai parallel sentence pair classification model, classifying candidate Laos-Thai parallel sentence pairs, and extracting inter-translated Laos-Thai parallel sentences to obtain Laos-Thai bilingual parallel sentence pairs;
Laos-Chinese parallel corpus building module: the method is used for matching the acquired Laos-Thai bilingual parallel sentence corpus with the existing Chinese-Thai parallel corpus by taking the Thai as the pivot language to match the Laos and the Chinese, and constructing the Laos-Chinese bilingual parallel corpus.
CN201910856645.8A 2019-09-11 2019-09-11 Method and device for constructing old-Chinese bilingual corpus with Thai as pivot Active CN110717341B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910856645.8A CN110717341B (en) 2019-09-11 2019-09-11 Method and device for constructing old-Chinese bilingual corpus with Thai as pivot

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910856645.8A CN110717341B (en) 2019-09-11 2019-09-11 Method and device for constructing old-Chinese bilingual corpus with Thai as pivot

Publications (2)

Publication Number Publication Date
CN110717341A true CN110717341A (en) 2020-01-21
CN110717341B CN110717341B (en) 2022-06-14

Family

ID=69209837

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910856645.8A Active CN110717341B (en) 2019-09-11 2019-09-11 Method and device for constructing old-Chinese bilingual corpus with Thai as pivot

Country Status (1)

Country Link
CN (1) CN110717341B (en)

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN113627150A (en) * 2021-07-01 2021-11-09 昆明理工大学 Method and device for extracting parallel sentence pairs for transfer learning based on language similarity
CN114417807A (en) * 2022-01-24 2022-04-29 中国电子科技集团公司第五十四研究所 Human-like language description expression method oriented to presence or absence of human collaboration scene
CN115329785A (en) * 2022-10-15 2022-11-11 小语智能信息科技(云南)有限公司 Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device
RU2790026C2 (en) * 2020-12-22 2023-02-14 Общество С Ограниченной Ответственностью "Яндекс" Method and server for training machine learning algorithm for translation
CN116822495A (en) * 2023-08-31 2023-09-29 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855263A (en) * 2011-06-30 2013-01-02 富士通株式会社 Method and device for aligning sentences in bilingual corpus
US9348809B1 (en) * 2015-02-02 2016-05-24 Linkedin Corporation Modifying a tokenizer based on pseudo data for natural language processing
CN108363704A (en) * 2018-03-02 2018-08-03 北京理工大学 A kind of neural network machine translation corpus expansion method based on statistics phrase table
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN108549629A (en) * 2018-03-19 2018-09-18 昆明理工大学 A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes
CN109783809A (en) * 2018-12-22 2019-05-21 昆明理工大学 A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus
CN109885686A (en) * 2019-02-20 2019-06-14 延边大学 A kind of multilingual file classification method merging subject information and BiLSTM-CNN
CN110083826A (en) * 2019-03-21 2019-08-02 昆明理工大学 A kind of old man's bilingual alignment method based on Transformer model
CN110110061A (en) * 2019-04-26 2019-08-09 同济大学 Low-resource languages entity abstracting method based on bilingual term vector

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102855263A (en) * 2011-06-30 2013-01-02 富士通株式会社 Method and device for aligning sentences in bilingual corpus
US9348809B1 (en) * 2015-02-02 2016-05-24 Linkedin Corporation Modifying a tokenizer based on pseudo data for natural language processing
CN108363704A (en) * 2018-03-02 2018-08-03 北京理工大学 A kind of neural network machine translation corpus expansion method based on statistics phrase table
CN108491383A (en) * 2018-03-14 2018-09-04 昆明理工大学 A kind of Thai sentence cutting method based on maximum entropy disaggregated model and the correction of Thai syntax rule
CN108549629A (en) * 2018-03-19 2018-09-18 昆明理工大学 A kind of combination similarity and scheme matched old-Chinese bilingual sentence alignment schemes
CN109783809A (en) * 2018-12-22 2019-05-21 昆明理工大学 A method of alignment sentence is extracted from Laos-Chinese chapter grade alignment corpus
CN109885686A (en) * 2019-02-20 2019-06-14 延边大学 A kind of multilingual file classification method merging subject information and BiLSTM-CNN
CN110083826A (en) * 2019-03-21 2019-08-02 昆明理工大学 A kind of old man's bilingual alignment method based on Transformer model
CN110110061A (en) * 2019-04-26 2019-08-09 同济大学 Low-resource languages entity abstracting method based on bilingual term vector

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
KEYSERS DANIEL 等: "Multi-language online handwriting recognition", 《IEEE TRANSACTIONS ON PATTERN ANALYSIS AND MACHINE INTELLIGENCE》 *
WANG YONGQIANG 等: "Research on the Recognition of Offline Handwritten New Tai Lue Characters Based on Bidirectional LSTM", 《INTERNATIONAL CONFERENCE ON NETWORK, COMMUNICATION, COMPUTER ENGINEERING》 *
杨蓓 等: "半监督学习的老挝语词性标注方法研究", 《计算机科学》 *
聂男: "以泰语为枢轴的老—汉双语语料库构建方法研究", 《中国优秀博硕士学位论文全文数据库(硕士)哲学与人文科学辑》 *

Cited By (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112287688A (en) * 2020-09-17 2021-01-29 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
CN112287688B (en) * 2020-09-17 2022-02-11 昆明理工大学 English-Burmese bilingual parallel sentence pair extraction method and device integrating pre-training language model and structural features
RU2790026C2 (en) * 2020-12-22 2023-02-14 Общество С Ограниченной Ответственностью "Яндекс" Method and server for training machine learning algorithm for translation
CN113627150A (en) * 2021-07-01 2021-11-09 昆明理工大学 Method and device for extracting parallel sentence pairs for transfer learning based on language similarity
CN113627150B (en) * 2021-07-01 2022-12-20 昆明理工大学 Language similarity-based parallel sentence pair extraction method and device for transfer learning
CN114417807A (en) * 2022-01-24 2022-04-29 中国电子科技集团公司第五十四研究所 Human-like language description expression method oriented to presence or absence of human collaboration scene
CN114417807B (en) * 2022-01-24 2023-09-22 中国电子科技集团公司第五十四研究所 Human-like language description expression method for collaboration scene of presence or absence
CN115329785A (en) * 2022-10-15 2022-11-11 小语智能信息科技(云南)有限公司 Phoneme feature-fused English-Tai-old multi-language neural machine translation method and device
CN116822495A (en) * 2023-08-31 2023-09-29 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning
CN116822495B (en) * 2023-08-31 2023-11-03 小语智能信息科技(云南)有限公司 Chinese-old and Tai parallel sentence pair extraction method and device based on contrast learning

Also Published As

Publication number Publication date
CN110717341B (en) 2022-06-14

Similar Documents

Publication Publication Date Title
CN110717341B (en) Method and device for constructing old-Chinese bilingual corpus with Thai as pivot
CN109213995B (en) Cross-language text similarity evaluation technology based on bilingual word embedding
CN108614875B (en) Chinese emotion tendency classification method based on global average pooling convolutional neural network
CN110059188B (en) Chinese emotion analysis method based on bidirectional time convolution network
CN111931506B (en) Entity relationship extraction method based on graph information enhancement
CN106599032B (en) Text event extraction method combining sparse coding and structure sensing machine
CN111061861B (en) Text abstract automatic generation method based on XLNet
CN110414009B (en) Burma bilingual parallel sentence pair extraction method and device based on BilSTM-CNN
CN110619043A (en) Automatic text abstract generation method based on dynamic word vector
CN112231472B (en) Judicial public opinion sensitive information identification method integrated with domain term dictionary
CN110489750A (en) Burmese participle and part-of-speech tagging method and device based on two-way LSTM-CRF
CN112016320A (en) English punctuation adding method, system and equipment based on data enhancement
CN112561718A (en) Case microblog evaluation object emotion tendency analysis method based on BilSTM weight sharing
Huang et al. End-to-end sequence labeling via convolutional recurrent neural network with a connectionist temporal classification layer
CN114818717A (en) Chinese named entity recognition method and system fusing vocabulary and syntax information
CN111581943A (en) Chinese-over-bilingual multi-document news viewpoint sentence identification method based on sentence association graph
CN115510863A (en) Question matching task oriented data enhancement method
CN111553157A (en) Entity replacement-based dialog intention identification method
Bigot et al. Person name recognition in ASR outputs using continuous context models
CN112632272A (en) Microblog emotion classification method and system based on syntactic analysis
Zhao et al. Tibetan Multi-Dialect Speech and Dialect Identity Recognition.
CN117332789A (en) Semantic analysis method and system for dialogue scene
Sun Analysis of Chinese machine translation training based on deep learning technology
CN115934948A (en) Knowledge enhancement-based drug entity relationship combined extraction method and system
Suleiman et al. Recurrent neural network techniques: Emphasis on use in neural machine translation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant