CN109783825B - Neural network-based ancient language translation method - Google Patents

Neural network-based ancient language translation method Download PDF

Info

Publication number
CN109783825B
CN109783825B CN201910012805.0A CN201910012805A CN109783825B CN 109783825 B CN109783825 B CN 109783825B CN 201910012805 A CN201910012805 A CN 201910012805A CN 109783825 B CN109783825 B CN 109783825B
Authority
CN
China
Prior art keywords
ancient
clause
modern chinese
translation
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910012805.0A
Other languages
Chinese (zh)
Other versions
CN109783825A (en
Inventor
吕建成
杨可心
屈茜
刘大一恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910012805.0A priority Critical patent/CN109783825B/en
Publication of CN109783825A publication Critical patent/CN109783825A/en
Application granted granted Critical
Publication of CN109783825B publication Critical patent/CN109783825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses an ancient Chinese translation method based on a neural network, which comprises the following steps: s1, obtaining an ancient text chapter and corresponding translation data as an initial sample, and S2, sequentially performing clause alignment, data word segmentation and data augmentation on the initial sample to obtain an ancient translation corpus; s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network; and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese. The invention can provide accurate name, place name and colloquial information for the translation model by introducing various dictionaries for word segmentation, improve the translation effect of proper nouns, automatically align clauses, complete implicit alignment between characters and words by an attention mechanism, translate ancient clauses to be translated by a neural network and effectively improve the translation efficiency and accuracy.

Description

Neural network-based ancient language translation method
Technical Field
The invention relates to the field of ancient language translation, in particular to an ancient language translation method based on a neural network.
Background
The high degree of achievement and achievement of the ancient people in the aspects of thought, science and literature are the Chinese national wisdom and the crystallization of blood sweat, and the crystal as a national treasure is not discardable. Most of the cultural heritages are born in the form of literature, however, the languages used by the ancient people are greatly different from the languages used by the modern people, so that the ancient cultural heritages are difficult to understand by the modern people, and a high threshold is set for the ancient cultural research invisibly. In the past, the translation can only be carried out word by word and sentence by an ancient scholar, and the translation is time-consuming, labor-consuming and high in cost.
Disclosure of Invention
Aiming at the defects in the prior art, the ancient translation method based on the neural network solves the problem of low translation efficiency word by word and sentence by sentence.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that:
the ancient Chinese translation method based on the neural network comprises the following steps:
s1, obtaining ancient text chapters and corresponding translation data as initial samples,
s2, carrying out clause alignment, data word segmentation and data augmentation operations on the initial sample in sequence to obtain an ancient translation corpus;
s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network;
and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese.
Further, the specific method of step S1 is:
and (3) crawling the ancient text chapters and the corresponding translation data from the Internet, cleaning the crawled data, and taking the cleaned data as an initial sample.
Further, the method for clause alignment of the initial samples in step S2 includes the following sub-steps:
s2-1-1, performing word segmentation on the modern Chinese in the initial sample, and matching the ancient Chinese with the modern Chinese according to the sequence from left to right;
s2-1-2, deleting matched words from the original sentence, introducing an ancient sentence dictionary to establish an inverse document frequency dictionary for ancient sentences which do not correspond to the modern Chinese, and acquiring an inverse document frequency score of each character of the unmatched ancient sentences;
s2-1-3, retrieving each unmatched ancient character defined by the ancient dictionary and matching the character with the rest modern Chinese vocabulary;
s2-1-4, according to the formula
Figure BDA0001937967400000021
Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s representsAn ancient clause; | s | represents the length of a foreign clause;
Figure BDA0001937967400000022
to indicate the function, if the character c in s can be directly matched with the word in the modern Chinese clause t
Figure BDA0001937967400000023
Is 1, otherwise
Figure BDA0001937967400000024
Is 0;
Figure BDA0001937967400000025
and
Figure BDA0001937967400000026
respectively forming character strings consisting of the characters which are remained in s and t and are not matched;
Figure BDA0001937967400000027
to indicate the function, if a character k in the modern interpretation of the ancient character c matches the vocabulary of the remaining modern Chinese, its score is taken from the IDF dictionary and is denoted as IDFkOtherwise 0, β is the normalized parameter of the inverse document frequency;
s2-1-5, establishing a translation corresponding model of the ancient clause and the modern Chinese clause; wherein the translation correspondence modes of the translation correspondence model include a 1 → 0 mode, a 0 → 1 mode, a 1 → 2 mode, a 2 → 1 mode, and a 2 → 2 mode; → represents translation correspondence, the front end of the → is the number corresponding to the ancient clauses, and the rear end of the → is the number corresponding to the modern Chinese clauses;
s2-1-6, acquiring the probability Pr (a → b) of each translation corresponding mode in the translation corresponding model of each ancient clause; a is more than or equal to 0, and b is less than or equal to 2;
s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value u and the standard deviation sigma of all the length proportions;
s2-1-8, according to the formula
Figure BDA0001937967400000031
Acquiring statistical information S (S, t); wherein
Figure BDA0001937967400000032
Is a normal distribution probability density function;
s2-1-9, according to the formula
Figure BDA0001937967400000033
Acquiring an edit distance value E (s, t); wherein EditDis (s, t) is an operand when a sentence of ancient text is translated into modern Chinese, and the operand comprises the total times of insertion, deletion and replacement;
s2-1-10, according to the formula
d(s,t)=L(s,t)+γS(s,t)+λE(s,t)
Figure BDA0001937967400000034
Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; siIs the ith ancient clause; si-1Is the i-1 st ancient clause; si-2Is the ith-2 ancient clauses; t is tjIs the jth modern Chinese clause; t is tj-1Is the j-1 th modern Chinese clause; t is tj-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e., there is no corresponding clause.
S2-1-11, for any ancient clause, selecting the modern Chinese clause with the largest corresponding score as the alignment clause to complete clause alignment.
Further, the specific method for data word segmentation in step S2 is as follows:
and respectively constructing a name dictionary, a place dictionary and a colloquial dictionary, and segmenting the names, the place names and the colloquial in the ancient clauses according to the constructed name dictionary, place dictionary and colloquial dictionary.
Further, the specific method for data augmentation in step S2 includes the following sub-steps:
s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation;
s2-2-2, splicing each piece of data and the data behind the data until the end punctuation of the sentence is an exclamation mark, a question mark or a sentence mark or the spliced clause data reaches four pieces, and using the spliced clause as new clause data to finish the augmentation based on the clause;
s2-2-3, obtaining alignment information of all terms of each sentence of ancient and all terms of modern Chinese corresponding to the ancient by adopting a giza + + alignment tool of a statistical machine translation model, and adjusting the ancient language order according to the alignment information to obtain an ancient translation corpus.
Further, the specific method of step S3 includes the following sub-steps:
s3-1, converting the ancient clauses in the ancient translation corpus into vector forms to obtain ancient clause vectors, and inputting the ancient clause corresponding to each ancient clause into a sequence model as a training basic unit;
s3-2, according to the formula
forgetm=sigmoid(W1·[hiddenm-1,m]+b1)
inputm=sigmoid(W2·[hiddenm-1,m]+b2)
Figure BDA0001937967400000041
Figure BDA0001937967400000042
outputm=sigmoid(W4·[hiddenm-1,m]+b4)
hiddenm=outputm*tanh(Cm)
Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unitm(ii) a Wherein hiddenm-1The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetm、inputm
Figure BDA0001937967400000051
CmAnd outputmAll the parameters are intermediate parameters after the mth ancient text vector element is input; cm-1The intermediate parameter after the (m-1) th ancient text vector element is input; b1、b2、b3And b4All represent a bias; w1、W2、W3And W4All represent weights; setting the initial state of a hidden layer in an encoder by adopting random initialization;
s3-3, combining hidden layer states of each neuron in the encoder from the sequence to the sequence model after the last ancient character vector element is input into a vector to obtain a hidden layer state vector hidden corresponding to the encoder from the sequence to the sequence model and the current training basic unitM
S3-4, inputting the modern Chinese clause corresponding to the ancient clause input in the encoder as the basic check unit to the decoder, and according to the formula
forgetn=sigmoid(W5·[staten-1,n]+b5)
inputn=sigmoid(W6·[staten-1,n]+b6)
Figure BDA0001937967400000052
Figure BDA0001937967400000053
outputn=sigmoid(W8·[staten-1,n]+b8)
staten=outputn*tanh(Cn)
Obtaining the corresponding hidden layer state of any neuron in the decoder from the sequence to the sequence model after inputting the nth modern Chinese wordn;staten-1The hidden layer state of the neuron in the encoder after inputting the (n-1) th modern Chinese is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetn、inputn
Figure BDA0001937967400000054
CnAnd outputnAll are intermediate parameters after the nth modern Chinese is input; cn-1Inputting intermediate parameters after the (n-1) th modern Chinese; b5、b6、b7And b8All represent a bias; w5、W6、W7And W8All represent weights; setting the initial state of a hidden layer in a decoder by adopting the value of the state of the hidden layer of an encoder;
s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer state vector state corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence modelY
S3-6, according to the formula
Figure BDA0001937967400000061
enM=bmm(stateY,hiddenM)
enx=bmm(stateY,hiddenx)
Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence modelnM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the typenMAnd enxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hiddenxHidden layer state, hidden, after the mth ancient vector element is input to the encoderx∈(hidden1,hidden2,...,hiddenM);
S3-7, according to the formula
Figure BDA0001937967400000062
Obtaining the weighted average context of hidden layer state output by the encoder corresponding to the nth modern Chinese word from the input ancient clausenI.e. the context vector corresponding to the nth modern chinese word;
s3-8, according to the formula
Figure BDA0001937967400000063
Context vectors corresponding to the nth modern Chinese word and hidden layer states of a decoder are cascaded and sent to a fully-connected network WcontextTo obtain a cascade state
Figure BDA0001937967400000065
S3-9, according to the formula
Figure BDA0001937967400000064
Obtaining the output of the sequence-to-sequence model corresponding to the nth modern Chinese word
Figure BDA0001937967400000071
And then obtaining the output of the sequence-to-sequence model corresponding to the sentence of modern Chinese
Figure BDA0001937967400000072
Wherein
Figure BDA0001937967400000073
softmax (·) is a softmax function; wsIs the network weight;
s3-10, according to the formula
Figure BDA0001937967400000074
Obtaining the output of the sequence-to-sequence model corresponding to the sentence of the modern Chinese
Figure BDA0001937967400000075
The difference from the true answer y
Figure BDA0001937967400000076
If there is a difference
Figure BDA0001937967400000077
If the difference is larger than the threshold value, the parameters of the sequence to the sequence model are updated until the difference
Figure BDA0001937967400000078
Less than or equal to the threshold value, and obtaining a trained neural network; wherein N is the total number of words in the modern Chinese of the sentence; y isnIs the nth modern Chinese word, ynE.g. y, the real answer y is the modern Chinese input to the decoder.
The invention has the beneficial effects that: the invention can provide accurate name, place name and colloquial information for the translation model by introducing various dictionaries for word segmentation, improve the translation effect of proper nouns, automatically align clauses, complete implicit alignment between characters and words by an attention mechanism, translate ancient clauses to be translated by a neural network and effectively improve the translation efficiency and accuracy.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, the neural network-based ancient language translation method includes the following steps:
s1, obtaining ancient text chapters and corresponding translation data as initial samples,
s2, carrying out clause alignment, data word segmentation and data augmentation operations on the initial sample in sequence to obtain an ancient translation corpus;
s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network;
and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese.
The specific method of step S1 is:
and (3) crawling the ancient text chapters and the corresponding translation data from the Internet, cleaning the crawled data, and taking the cleaned data as an initial sample.
The method for clause alignment of the initial samples in step S2 includes the following sub-steps:
s2-1-1, performing word segmentation on the modern Chinese in the initial sample, and matching the ancient Chinese with the modern Chinese according to the sequence from left to right;
s2-1-2, deleting matched words from the original sentence, introducing an ancient sentence dictionary to establish an inverse document frequency dictionary for ancient sentences which do not correspond to the modern Chinese, and acquiring an inverse document frequency score of each character of the unmatched ancient sentences;
s2-1-3, retrieving each unmatched ancient character defined by the ancient dictionary and matching the character with the rest modern Chinese vocabulary;
s2-1-4, according to the formula
Figure BDA0001937967400000081
Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s represents an ancient clause; | s | represents the length of a foreign clause;
Figure BDA0001937967400000082
to indicate the function, if the character c in s can be directly matched with the word in the modern Chinese clause t
Figure BDA0001937967400000083
Is 1, otherwise
Figure BDA0001937967400000084
Is 0;
Figure BDA0001937967400000085
and
Figure BDA0001937967400000086
respectively forming character strings consisting of the characters which are remained in s and t and are not matched;
Figure BDA0001937967400000091
to indicate the function, if a character k in the modern interpretation of the ancient character c matches the remaining modern vocabulary in the modern Chinese, its frequency score, denoted idf, is taken from the inverse document frequency dictionarykOtherwise 0, β is the normalized parameter of the inverse document frequency;
s2-1-5, establishing a translation corresponding model of the ancient clause and the modern Chinese clause; wherein the translation correspondence modes of the translation correspondence model include a 1 → 0 mode, a 0 → 1 mode, a 1 → 2 mode, a 2 → 1 mode, and a 2 → 2 mode; → represents translation correspondence, the front end of the → is the number corresponding to the ancient clauses, and the rear end of the → is the number corresponding to the modern Chinese clauses;
s2-1-6, acquiring the probability Pr (a → b) of each translation corresponding mode in the translation corresponding model of each ancient clause; a is more than or equal to 0, and b is less than or equal to 2;
s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value u and the standard deviation sigma of all the length proportions;
s2-1-8, according to the formula
Figure BDA0001937967400000092
Acquiring statistical information S (S, t); wherein
Figure BDA0001937967400000093
Is a normal distribution probability density function;
s2-1-9, according to the formula
Figure BDA0001937967400000094
Acquiring an edit distance value E (s, t); wherein EditDis (s, t) is an operand when a sentence of ancient text is translated into modern Chinese, and the operand comprises the total times of insertion, deletion and replacement;
s2-1-10, according to the formula
d(s,t)=L(s,t)+γS(s,t)+λE(s,t)
Figure BDA0001937967400000101
Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; siIs the ith ancient clause; si-1Is the i-1 st ancient clause; si-2Is the ith-2 ancient clauses; t is tjIs the jth modern Chinese clause; t is tj-1Is the j-1 th modern Chinese clause; t is tj-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e., there is no corresponding clause.
S2-1-11, for any ancient clause, selecting the modern Chinese clause with the largest corresponding score as the alignment clause to complete clause alignment.
The specific method for data word segmentation in step S2 is as follows:
and respectively constructing a name dictionary, a place dictionary and a colloquial dictionary, and segmenting the names, the place names and the colloquial in the ancient clauses according to the constructed name dictionary, place dictionary and colloquial dictionary.
The specific method for data augmentation in step S2 includes the following substeps:
s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation; during similarity calculation, calculating a cosine value of an included angle between two vectors of word2vec vectors of two words as a similarity value;
s2-2-2, splicing each piece of data and the data behind the data until the end punctuation of the sentence is an exclamation mark, a question mark or a sentence mark or the spliced clause data reaches four pieces, and using the spliced clause as new clause data to finish the augmentation based on the clause;
s2-2-3, obtaining alignment information of all terms of each sentence of ancient and all terms of modern Chinese corresponding to the ancient by adopting a giza + + alignment tool of a statistical machine translation model, and adjusting the ancient language order according to the alignment information to obtain an ancient translation corpus.
The specific method of step S3 includes the following substeps:
s3-1, converting the ancient clauses in the ancient translation corpus into vector forms to obtain ancient clause vectors, and inputting the ancient clause corresponding to each ancient clause into a sequence model as a training basic unit;
s3-2, according to the formula
forgetm=sigmoid(W1·[hiddenm-1,m]+b1)
inputm=sigmoid(W2·[hiddenm-1,m]+b2)
Figure BDA0001937967400000111
Figure BDA0001937967400000112
outputm=sigmoid(W4·[hiddenm-1,m]+b4)
hiddenm=outputm*tanh(Cm)
Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unitm(ii) a Wherein hiddenm-1The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetm、inputm
Figure BDA0001937967400000113
CmAnd outputmAll the parameters are intermediate parameters after the mth ancient text vector element is input; cm-1The intermediate parameter after the (m-1) th ancient text vector element is input; b1、b2、b3And b4All represent a bias; w1、W2、W3And W4All represent weights; setting the initial state of a hidden layer in an encoder by adopting random initialization;
s3-3, combining hidden layer states of each neuron in the encoder from the sequence to the sequence model after the last ancient character vector element is input into a vector to obtain a hidden layer state vector hidden corresponding to the encoder from the sequence to the sequence model and the current training basic unitM
S3-4, inputting the modern Chinese clause corresponding to the ancient clause input in the encoder as the basic check unit to the decoder, and according to the formula
forgetn=sigmoid(W5·[staten-1,n]+b5)
inputn=sigmoid(W6·[staten-1,n]+b6)
Figure BDA0001937967400000121
Figure BDA0001937967400000122
outputn=sigmoid(W8·[staten-1,n]+b8)
staten=outputn*tanh(Cn)
Obtaining the corresponding hidden layer state of any neuron in the decoder from the sequence to the sequence model after inputting the nth modern Chinese wordn;staten-1The hidden layer state of the neuron in the encoder after inputting the (n-1) th modern Chinese is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetn、inputn
Figure BDA0001937967400000123
CnAnd outputnAll are intermediate parameters after the nth modern Chinese is input; cn-1Inputting intermediate parameters after the (n-1) th modern Chinese; b5、b6、b7And b8All represent a bias; w5、W6、W7And W8All represent weights; setting the initial state of a hidden layer in a decoder by adopting the value of the state of the hidden layer of an encoder;
s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer state vector state corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence modelY
S3-6, according to the formula
Figure BDA0001937967400000124
enM=bmm(stateY,hiddenM)
enx=bmm(stateY,hiddenx)
Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence modelnM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the typenMAnd enxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hiddenxHidden layer state hfdden after input of mth ancient vector element for encoderx∈(hfdden1,hfdden2,...,hfddenM);
S3-7, according to the formula
Figure BDA0001937967400000131
Obtaining the weighted average context of hidden layer state output by the encoder corresponding to the nth modern Chinese word from the input ancient clausenI.e. the context vector corresponding to the nth modern chinese word;
s3-8, according to the formula
Figure BDA0001937967400000132
Context vectors corresponding to the nth modern Chinese word and hidden layer states of a decoder are cascaded and sent to a fully-connected network WcontextTo obtain a cascade state
Figure BDA00019379674000001312
S3-9, according to the formula
Figure BDA0001937967400000133
Obtaining the output of the sequence-to-sequence model corresponding to the nth modern Chinese word
Figure BDA0001937967400000134
And then obtaining the output of the sequence-to-sequence model corresponding to the sentence of modern Chinese
Figure BDA0001937967400000135
Wherein
Figure BDA0001937967400000136
softmax (·) is a softmax function; wsIs the network weight;
s3-10, according to the formula
Figure BDA0001937967400000137
Obtaining the output of the sequence-to-sequence model corresponding to the sentence of the modern Chinese
Figure BDA0001937967400000138
The difference from the true answer y
Figure BDA0001937967400000139
If there is a difference
Figure BDA00019379674000001310
If the difference is larger than the threshold value, the parameters of the sequence to the sequence model are updated until the difference
Figure BDA00019379674000001311
Less than or equal to the threshold value, and obtaining a trained neural network; wherein N is the total number of words in the modern Chinese of the sentence; y isnIs the nth modern Chinese word, ynE.g. y, the real answer y is the modern Chinese input to the decoder.
In conclusion, the invention can provide accurate name and place name and common language information for the translation model by introducing various dictionaries for word segmentation, improve the proper noun translation effect, automatically align clauses, complete implicit alignment between characters and words by attention, translate ancient clauses to be translated by a neural network and effectively improve translation efficiency and accuracy.

Claims (5)

1. An ancient language translation method based on a neural network is characterized by comprising the following steps:
s1, obtaining ancient text chapters and corresponding translation data as initial samples,
s2, carrying out clause alignment, data word segmentation and data augmentation operations on the initial sample in sequence to obtain an ancient translation corpus;
s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network;
s4, taking the ancient Chinese to be translated as the input of the trained neural network to finish the translation of the ancient Chinese;
the method for clause alignment of the initial samples in step S2 includes the following sub-steps:
s2-1-1, performing word segmentation on the modern Chinese in the initial sample, and matching the ancient Chinese with the modern Chinese according to the sequence from left to right;
s2-1-2, deleting matched words from the original sentence, introducing an ancient sentence dictionary to establish an inverse document frequency dictionary for ancient sentences which do not correspond to the modern Chinese, and acquiring an inverse document frequency score of each character of the unmatched ancient sentences;
s2-1-3, retrieving each unmatched ancient character defined by the ancient dictionary and matching the character with the rest modern Chinese vocabulary;
s2-1-4, according to the formula
Figure FDA0002393250080000011
Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s represents an ancient clause; | s | represents the length of a foreign clause;
Figure FDA0002393250080000012
to indicate the function, if the character c in s can be directly matched with the word in the modern Chinese clause t
Figure FDA0002393250080000013
Is 1, otherwise
Figure FDA0002393250080000014
Is 0;
Figure FDA0002393250080000015
and
Figure FDA0002393250080000016
respectively forming character strings consisting of the characters which are remained in s and t and are not matched;
Figure FDA0002393250080000017
to indicate the function, if a character k in the modern interpretation of the ancient character c matches the remaining modern vocabulary in the modern Chinese, its frequency score, denoted idf, is taken from the inverse document frequency dictionarykOtherwise 0, β is the normalized parameter of the inverse document frequency;
s2-1-5, establishing a translation corresponding model of the ancient clause and the modern Chinese clause; wherein the translation correspondence modes of the translation correspondence model include a 1 → 0 mode, a 0 → 1 mode, a 1 → 2 mode, a 2 → 1 mode, and a 2 → 2 mode; → represents translation correspondence, the front end of the → is the number corresponding to the ancient clauses, and the rear end of the → is the number corresponding to the modern Chinese clauses;
s2-1-6, acquiring the probability Pr (a → b) of each translation corresponding mode in the translation corresponding model of each ancient clause; a is more than or equal to 0, and b is less than or equal to 2;
s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value mu and the standard deviation sigma of all the length proportions;
s2-1-8, according to the formula
Figure FDA0002393250080000021
Acquiring statistical information S (S, t); wherein
Figure FDA0002393250080000022
Is a normal distribution probability density function;
s2-1-9, according to the formula
Figure FDA0002393250080000023
Acquiring an edit distance value E (s, t); wherein EditDis (s, t) is an operand when a sentence of ancient text is translated into modern Chinese, and the operand comprises the total times of insertion, deletion and replacement;
s2-1-10, according to the formula
d(s,t)=L(s,t)+γS(s,t)+λE(s,t)
Figure FDA0002393250080000024
Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; siIs the ith ancient clause; si-1Is the i-1 st ancient clause; si-2Is the ith-2 ancient clauses; t is tjIs the jth modern Chinese clause; t is tj-1Is the j-1 th modern Chinese clause; t is tj-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e. there is no corresponding clause;
s2-1-11, for any ancient clause, selecting the modern Chinese clause with the largest corresponding score as the alignment clause to complete clause alignment.
2. The ancient sentence translation method based on neural network according to claim 1, wherein the specific method of step S1 is:
and (3) crawling the ancient text chapters and the corresponding translation data from the Internet, cleaning the crawled data, and taking the cleaned data as an initial sample.
3. The ancient language translation method based on neural network as claimed in claim 1, wherein the specific method of data word segmentation in step S2 is:
and respectively constructing a name dictionary, a place dictionary and a colloquial dictionary, and segmenting the names, the place names and the colloquial in the ancient clauses according to the constructed name dictionary, place dictionary and colloquial dictionary.
4. The neural network-based ancient sentence translation method according to claim 3, wherein the specific method of data augmentation in the step S2 comprises the following sub-steps:
s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation;
s2-2-2, splicing each piece of data and the data behind the data until the end punctuation of the sentence is an exclamation mark, a question mark or a sentence mark or the spliced clause data reaches four pieces, and using the spliced clause as new clause data to finish the augmentation based on the clause;
s2-2-3, obtaining alignment information of all terms of each sentence of ancient and all terms of modern Chinese corresponding to the ancient by adopting a giza + + alignment tool of a statistical machine translation model, and adjusting the ancient language order according to the alignment information to obtain an ancient translation corpus.
5. The neural network-based ancient sentence translation method according to claim 4, wherein the specific method of the step S3 comprises the following sub-steps:
s3-1, converting the ancient clauses in the ancient translation corpus into vector forms to obtain ancient clause vectors, and inputting the ancient clause corresponding to each ancient clause into a sequence model as a training basic unit;
s3-2, according to the formula
forgetm=sigmoid(W1·[hiddenm-1,m]+b1)
inputm=sigmoid(W2·[hiddenm-1,m]+b2)
Figure FDA0002393250080000041
Figure FDA0002393250080000042
outputm=sigmoid(W4·[hiddenm-1,m]+b4)
hiddenm=outputm*tanh(Cm)
Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unitm(ii) a Wherein hiddenm-1The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetm、inputm
Figure FDA0002393250080000043
CmAnd outputmAll the parameters are intermediate parameters after the mth ancient text vector element is input; cm-1The intermediate parameter after the (m-1) th ancient text vector element is input; b1、b2、b3And b4All represent a bias; w1、W2、W3And W4All represent weights; setting the initial state of a hidden layer in an encoder by adopting random initialization;
s3-3, combining hidden layer states of each neuron in the encoder from the sequence to the sequence model after the last ancient character vector element is input into a vector to obtain a hidden layer state vector hidden corresponding to the encoder from the sequence to the sequence model and the current training basic unitM
S3-4, inputting the modern Chinese clause corresponding to the ancient clause input in the encoder as the basic check unit to the decoder, and according to the formula
forgetn=sigmoid(W5·[staten-1,n]+b5)
inputn=sigmoid(W6·[staten-1,n]+b6)
Figure FDA0002393250080000051
Figure FDA0002393250080000052
outputn=sigmoid(W8·[staten-1,n]+b8)
staten=outputn*tanh(Cn)
Obtaining the corresponding hidden layer state of any neuron in the decoder from the sequence to the sequence model after inputting the nth modern Chinese wordn;staten-1The hidden layer state of the neuron in the encoder after inputting the (n-1) th modern Chinese is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetn、inputn
Figure FDA0002393250080000053
CnAnd outputnAll are intermediate parameters after the nth modern Chinese is input; cn-1Inputting intermediate parameters after the (n-1) th modern Chinese; b5、b6、b7And b8All represent a bias; w5、W6、W7And W8All represent weights; setting the initial state of a hidden layer in a decoder by adopting the value of the state of the hidden layer of an encoder;
s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer states corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence modelLayer state vector stateY
S3-6, according to the formula
Figure FDA0002393250080000054
enM=bmm(stateY,hiddenM)
enx=bmm(stateY,hiddenx)
Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence modelnM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the typenMAnd enxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hiddenxHidden layer state, hidden, after the mth ancient vector element is input to the encoderx∈(hidden1,hidden2,…,hiddenM);
S3-7, according to the formula
Figure FDA0002393250080000061
Obtaining the weighted average context of hidden layer state output by the encoder corresponding to the nth modern Chinese word from the input ancient clausenI.e. the context vector corresponding to the nth modern chinese word;
s3-8, according to the formula
Figure FDA0002393250080000062
Context vectors corresponding to the nth modern Chinese word and hidden layer states of a decoder are cascaded and sent to a fully-connected network WcontextTo obtain a cascade state
Figure FDA0002393250080000063
S3-9, according to the formula
Figure FDA0002393250080000064
Obtaining the output of the sequence-to-sequence model corresponding to the nth modern Chinese word
Figure FDA0002393250080000065
And then obtaining the output of the sequence-to-sequence model corresponding to the sentence of modern Chinese
Figure FDA0002393250080000066
Wherein
Figure FDA0002393250080000067
softmax (·) is a softmax function; wsIs the network weight;
s3-10, according to the formula
Figure FDA0002393250080000068
Obtaining the output of the sequence-to-sequence model corresponding to the sentence of the modern Chinese
Figure FDA0002393250080000069
The difference from the true answer y
Figure FDA00023932500800000610
If there is a difference
Figure FDA00023932500800000611
If the difference is larger than the threshold value, the parameters of the sequence to the sequence model are updated until the difference
Figure FDA0002393250080000071
Less than or equal to the threshold value, and obtaining a trained neural network; wherein N is the total number of words in the modern Chinese of the sentence; y isnIs the nth modern Chinese word, ynE.g. y, the real answer y is the modern Chinese input to the decoder.
CN201910012805.0A 2019-01-07 2019-01-07 Neural network-based ancient language translation method Active CN109783825B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910012805.0A CN109783825B (en) 2019-01-07 2019-01-07 Neural network-based ancient language translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910012805.0A CN109783825B (en) 2019-01-07 2019-01-07 Neural network-based ancient language translation method

Publications (2)

Publication Number Publication Date
CN109783825A CN109783825A (en) 2019-05-21
CN109783825B true CN109783825B (en) 2020-04-28

Family

ID=66499178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910012805.0A Active CN109783825B (en) 2019-01-07 2019-01-07 Neural network-based ancient language translation method

Country Status (1)

Country Link
CN (1) CN109783825B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222349B (en) * 2019-06-13 2020-05-19 成都信息工程大学 Method and computer for deep dynamic context word expression
CN110795552B (en) * 2019-10-22 2024-01-23 腾讯科技(深圳)有限公司 Training sample generation method and device, electronic equipment and storage medium
CN112270190A (en) * 2020-11-13 2021-01-26 浩鲸云计算科技股份有限公司 Attention mechanism-based database field translation method and system
CN116070643B (en) * 2023-04-03 2023-08-15 武昌理工学院 Fixed style translation method and system from ancient text to English
CN116701961B (en) * 2023-08-04 2023-10-20 北京语言大学 Method and system for automatically evaluating machine translation result of cultural relics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007068123A1 (en) * 2005-12-16 2007-06-21 National Research Council Of Canada Method and system for training and applying a distortion component to machine translation
CN103955454A (en) * 2014-03-19 2014-07-30 北京百度网讯科技有限公司 Method and equipment for carrying out literary form conversion between vernacular Chinese and classical Chinese
CN108090050A (en) * 2017-11-08 2018-05-29 江苏名通信息科技有限公司 Game translation system based on deep neural network
CN109033094A (en) * 2018-07-18 2018-12-18 五邑大学 A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007068123A1 (en) * 2005-12-16 2007-06-21 National Research Council Of Canada Method and system for training and applying a distortion component to machine translation
CN103955454A (en) * 2014-03-19 2014-07-30 北京百度网讯科技有限公司 Method and equipment for carrying out literary form conversion between vernacular Chinese and classical Chinese
CN108090050A (en) * 2017-11-08 2018-05-29 江苏名通信息科技有限公司 Game translation system based on deep neural network
CN109033094A (en) * 2018-07-18 2018-12-18 五邑大学 A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model

Also Published As

Publication number Publication date
CN109783825A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
CN109783825B (en) Neural network-based ancient language translation method
WO2019196314A1 (en) Text information similarity matching method and apparatus, computer device, and storage medium
CN109948165B (en) Fine granularity emotion polarity prediction method based on mixed attention network
CN108628823B (en) Named entity recognition method combining attention mechanism and multi-task collaborative training
CN108399163B (en) Text similarity measurement method combining word aggregation and word combination semantic features
Yao et al. An improved LSTM structure for natural language processing
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
Cui et al. Attention-over-attention neural networks for reading comprehension
Cho et al. Learning phrase representations using RNN encoder-decoder for statistical machine translation
CN109726389B (en) Chinese missing pronoun completion method based on common sense and reasoning
CN109992780B (en) Specific target emotion classification method based on deep neural network
US8386234B2 (en) Method for generating a text sentence in a target language and text sentence generating apparatus
CN110737758A (en) Method and apparatus for generating a model
CN109871541B (en) Named entity identification method suitable for multiple languages and fields
CN111444700A (en) Text similarity measurement method based on semantic document expression
CN107967318A (en) A kind of Chinese short text subjective item automatic scoring method and system using LSTM neutral nets
CN108647191B (en) Sentiment dictionary construction method based on supervised sentiment text and word vector
CN109726745B (en) Target-based emotion classification method integrating description knowledge
CN110765769A (en) Entity attribute dependency emotion analysis method based on clause characteristics
Xu et al. Sentence segmentation for classical Chinese based on LSTM with radical embedding
Sun et al. VCWE: visual character-enhanced word embeddings
CN114943230A (en) Chinese specific field entity linking method fusing common knowledge
Xu et al. Implicitly incorporating morphological information into word embedding
Ye et al. Improving cross-domain Chinese word segmentation with word embeddings
Greenstein et al. Japanese-to-english machine translation using recurrent neural networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant