CN109783825B - Neural network-based ancient language translation method - Google Patents
Neural network-based ancient language translation method Download PDFInfo
- Publication number
- CN109783825B CN109783825B CN201910012805.0A CN201910012805A CN109783825B CN 109783825 B CN109783825 B CN 109783825B CN 201910012805 A CN201910012805 A CN 201910012805A CN 109783825 B CN109783825 B CN 109783825B
- Authority
- CN
- China
- Prior art keywords
- ancient
- clause
- modern chinese
- translation
- sequence
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Abstract
The invention discloses an ancient Chinese translation method based on a neural network, which comprises the following steps: s1, obtaining an ancient text chapter and corresponding translation data as an initial sample, and S2, sequentially performing clause alignment, data word segmentation and data augmentation on the initial sample to obtain an ancient translation corpus; s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network; and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese. The invention can provide accurate name, place name and colloquial information for the translation model by introducing various dictionaries for word segmentation, improve the translation effect of proper nouns, automatically align clauses, complete implicit alignment between characters and words by an attention mechanism, translate ancient clauses to be translated by a neural network and effectively improve the translation efficiency and accuracy.
Description
Technical Field
The invention relates to the field of ancient language translation, in particular to an ancient language translation method based on a neural network.
Background
The high degree of achievement and achievement of the ancient people in the aspects of thought, science and literature are the Chinese national wisdom and the crystallization of blood sweat, and the crystal as a national treasure is not discardable. Most of the cultural heritages are born in the form of literature, however, the languages used by the ancient people are greatly different from the languages used by the modern people, so that the ancient cultural heritages are difficult to understand by the modern people, and a high threshold is set for the ancient cultural research invisibly. In the past, the translation can only be carried out word by word and sentence by an ancient scholar, and the translation is time-consuming, labor-consuming and high in cost.
Disclosure of Invention
Aiming at the defects in the prior art, the ancient translation method based on the neural network solves the problem of low translation efficiency word by word and sentence by sentence.
In order to achieve the purpose of the invention, the invention adopts the technical scheme that:
the ancient Chinese translation method based on the neural network comprises the following steps:
s1, obtaining ancient text chapters and corresponding translation data as initial samples,
s2, carrying out clause alignment, data word segmentation and data augmentation operations on the initial sample in sequence to obtain an ancient translation corpus;
s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network;
and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese.
Further, the specific method of step S1 is:
and (3) crawling the ancient text chapters and the corresponding translation data from the Internet, cleaning the crawled data, and taking the cleaned data as an initial sample.
Further, the method for clause alignment of the initial samples in step S2 includes the following sub-steps:
s2-1-1, performing word segmentation on the modern Chinese in the initial sample, and matching the ancient Chinese with the modern Chinese according to the sequence from left to right;
s2-1-2, deleting matched words from the original sentence, introducing an ancient sentence dictionary to establish an inverse document frequency dictionary for ancient sentences which do not correspond to the modern Chinese, and acquiring an inverse document frequency score of each character of the unmatched ancient sentences;
s2-1-3, retrieving each unmatched ancient character defined by the ancient dictionary and matching the character with the rest modern Chinese vocabulary;
s2-1-4, according to the formula
Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s representsAn ancient clause; | s | represents the length of a foreign clause;to indicate the function, if the character c in s can be directly matched with the word in the modern Chinese clause tIs 1, otherwiseIs 0;andrespectively forming character strings consisting of the characters which are remained in s and t and are not matched;to indicate the function, if a character k in the modern interpretation of the ancient character c matches the vocabulary of the remaining modern Chinese, its score is taken from the IDF dictionary and is denoted as IDFkOtherwise 0, β is the normalized parameter of the inverse document frequency;
s2-1-5, establishing a translation corresponding model of the ancient clause and the modern Chinese clause; wherein the translation correspondence modes of the translation correspondence model include a 1 → 0 mode, a 0 → 1 mode, a 1 → 2 mode, a 2 → 1 mode, and a 2 → 2 mode; → represents translation correspondence, the front end of the → is the number corresponding to the ancient clauses, and the rear end of the → is the number corresponding to the modern Chinese clauses;
s2-1-6, acquiring the probability Pr (a → b) of each translation corresponding mode in the translation corresponding model of each ancient clause; a is more than or equal to 0, and b is less than or equal to 2;
s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value u and the standard deviation sigma of all the length proportions;
s2-1-8, according to the formula
Acquiring statistical information S (S, t); whereinIs a normal distribution probability density function;
s2-1-9, according to the formula
Acquiring an edit distance value E (s, t); wherein EditDis (s, t) is an operand when a sentence of ancient text is translated into modern Chinese, and the operand comprises the total times of insertion, deletion and replacement;
s2-1-10, according to the formula
d(s,t)=L(s,t)+γS(s,t)+λE(s,t)
Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; siIs the ith ancient clause; si-1Is the i-1 st ancient clause; si-2Is the ith-2 ancient clauses; t is tjIs the jth modern Chinese clause; t is tj-1Is the j-1 th modern Chinese clause; t is tj-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e., there is no corresponding clause.
S2-1-11, for any ancient clause, selecting the modern Chinese clause with the largest corresponding score as the alignment clause to complete clause alignment.
Further, the specific method for data word segmentation in step S2 is as follows:
and respectively constructing a name dictionary, a place dictionary and a colloquial dictionary, and segmenting the names, the place names and the colloquial in the ancient clauses according to the constructed name dictionary, place dictionary and colloquial dictionary.
Further, the specific method for data augmentation in step S2 includes the following sub-steps:
s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation;
s2-2-2, splicing each piece of data and the data behind the data until the end punctuation of the sentence is an exclamation mark, a question mark or a sentence mark or the spliced clause data reaches four pieces, and using the spliced clause as new clause data to finish the augmentation based on the clause;
s2-2-3, obtaining alignment information of all terms of each sentence of ancient and all terms of modern Chinese corresponding to the ancient by adopting a giza + + alignment tool of a statistical machine translation model, and adjusting the ancient language order according to the alignment information to obtain an ancient translation corpus.
Further, the specific method of step S3 includes the following sub-steps:
s3-1, converting the ancient clauses in the ancient translation corpus into vector forms to obtain ancient clause vectors, and inputting the ancient clause corresponding to each ancient clause into a sequence model as a training basic unit;
s3-2, according to the formula
forgetm=sigmoid(W1·[hiddenm-1,m]+b1)
inputm=sigmoid(W2·[hiddenm-1,m]+b2)
outputm=sigmoid(W4·[hiddenm-1,m]+b4)
hiddenm=outputm*tanh(Cm)
Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unitm(ii) a Wherein hiddenm-1The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetm、inputm、CmAnd outputmAll the parameters are intermediate parameters after the mth ancient text vector element is input; cm-1The intermediate parameter after the (m-1) th ancient text vector element is input; b1、b2、b3And b4All represent a bias; w1、W2、W3And W4All represent weights; setting the initial state of a hidden layer in an encoder by adopting random initialization;
s3-3, combining hidden layer states of each neuron in the encoder from the sequence to the sequence model after the last ancient character vector element is input into a vector to obtain a hidden layer state vector hidden corresponding to the encoder from the sequence to the sequence model and the current training basic unitM;
S3-4, inputting the modern Chinese clause corresponding to the ancient clause input in the encoder as the basic check unit to the decoder, and according to the formula
forgetn=sigmoid(W5·[staten-1,n]+b5)
inputn=sigmoid(W6·[staten-1,n]+b6)
outputn=sigmoid(W8·[staten-1,n]+b8)
staten=outputn*tanh(Cn)
Obtaining the corresponding hidden layer state of any neuron in the decoder from the sequence to the sequence model after inputting the nth modern Chinese wordn;staten-1The hidden layer state of the neuron in the encoder after inputting the (n-1) th modern Chinese is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetn、inputn、CnAnd outputnAll are intermediate parameters after the nth modern Chinese is input; cn-1Inputting intermediate parameters after the (n-1) th modern Chinese; b5、b6、b7And b8All represent a bias; w5、W6、W7And W8All represent weights; setting the initial state of a hidden layer in a decoder by adopting the value of the state of the hidden layer of an encoder;
s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer state vector state corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence modelY;
S3-6, according to the formula
enM=bmm(stateY,hiddenM)
enx=bmm(stateY,hiddenx)
Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence modelnM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the typenMAnd enxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hiddenxHidden layer state, hidden, after the mth ancient vector element is input to the encoderx∈(hidden1,hidden2,...,hiddenM);
S3-7, according to the formula
Obtaining the weighted average context of hidden layer state output by the encoder corresponding to the nth modern Chinese word from the input ancient clausenI.e. the context vector corresponding to the nth modern chinese word;
s3-8, according to the formula
Context vectors corresponding to the nth modern Chinese word and hidden layer states of a decoder are cascaded and sent to a fully-connected network WcontextTo obtain a cascade state
S3-9, according to the formula
Obtaining the output of the sequence-to-sequence model corresponding to the nth modern Chinese wordAnd then obtaining the output of the sequence-to-sequence model corresponding to the sentence of modern ChineseWhereinsoftmax (·) is a softmax function; wsIs the network weight;
s3-10, according to the formula
Obtaining the output of the sequence-to-sequence model corresponding to the sentence of the modern ChineseThe difference from the true answer yIf there is a differenceIf the difference is larger than the threshold value, the parameters of the sequence to the sequence model are updated until the differenceLess than or equal to the threshold value, and obtaining a trained neural network; wherein N is the total number of words in the modern Chinese of the sentence; y isnIs the nth modern Chinese word, ynE.g. y, the real answer y is the modern Chinese input to the decoder.
The invention has the beneficial effects that: the invention can provide accurate name, place name and colloquial information for the translation model by introducing various dictionaries for word segmentation, improve the translation effect of proper nouns, automatically align clauses, complete implicit alignment between characters and words by an attention mechanism, translate ancient clauses to be translated by a neural network and effectively improve the translation efficiency and accuracy.
Drawings
FIG. 1 is a schematic flow chart of the present invention.
Detailed Description
The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.
As shown in fig. 1, the neural network-based ancient language translation method includes the following steps:
s1, obtaining ancient text chapters and corresponding translation data as initial samples,
s2, carrying out clause alignment, data word segmentation and data augmentation operations on the initial sample in sequence to obtain an ancient translation corpus;
s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network;
and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese.
The specific method of step S1 is:
and (3) crawling the ancient text chapters and the corresponding translation data from the Internet, cleaning the crawled data, and taking the cleaned data as an initial sample.
The method for clause alignment of the initial samples in step S2 includes the following sub-steps:
s2-1-1, performing word segmentation on the modern Chinese in the initial sample, and matching the ancient Chinese with the modern Chinese according to the sequence from left to right;
s2-1-2, deleting matched words from the original sentence, introducing an ancient sentence dictionary to establish an inverse document frequency dictionary for ancient sentences which do not correspond to the modern Chinese, and acquiring an inverse document frequency score of each character of the unmatched ancient sentences;
s2-1-3, retrieving each unmatched ancient character defined by the ancient dictionary and matching the character with the rest modern Chinese vocabulary;
s2-1-4, according to the formula
Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s represents an ancient clause; | s | represents the length of a foreign clause;to indicate the function, if the character c in s can be directly matched with the word in the modern Chinese clause tIs 1, otherwiseIs 0;andrespectively forming character strings consisting of the characters which are remained in s and t and are not matched;to indicate the function, if a character k in the modern interpretation of the ancient character c matches the remaining modern vocabulary in the modern Chinese, its frequency score, denoted idf, is taken from the inverse document frequency dictionarykOtherwise 0, β is the normalized parameter of the inverse document frequency;
s2-1-5, establishing a translation corresponding model of the ancient clause and the modern Chinese clause; wherein the translation correspondence modes of the translation correspondence model include a 1 → 0 mode, a 0 → 1 mode, a 1 → 2 mode, a 2 → 1 mode, and a 2 → 2 mode; → represents translation correspondence, the front end of the → is the number corresponding to the ancient clauses, and the rear end of the → is the number corresponding to the modern Chinese clauses;
s2-1-6, acquiring the probability Pr (a → b) of each translation corresponding mode in the translation corresponding model of each ancient clause; a is more than or equal to 0, and b is less than or equal to 2;
s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value u and the standard deviation sigma of all the length proportions;
s2-1-8, according to the formula
Acquiring statistical information S (S, t); whereinIs a normal distribution probability density function;
s2-1-9, according to the formula
Acquiring an edit distance value E (s, t); wherein EditDis (s, t) is an operand when a sentence of ancient text is translated into modern Chinese, and the operand comprises the total times of insertion, deletion and replacement;
s2-1-10, according to the formula
d(s,t)=L(s,t)+γS(s,t)+λE(s,t)
Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; siIs the ith ancient clause; si-1Is the i-1 st ancient clause; si-2Is the ith-2 ancient clauses; t is tjIs the jth modern Chinese clause; t is tj-1Is the j-1 th modern Chinese clause; t is tj-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e., there is no corresponding clause.
S2-1-11, for any ancient clause, selecting the modern Chinese clause with the largest corresponding score as the alignment clause to complete clause alignment.
The specific method for data word segmentation in step S2 is as follows:
and respectively constructing a name dictionary, a place dictionary and a colloquial dictionary, and segmenting the names, the place names and the colloquial in the ancient clauses according to the constructed name dictionary, place dictionary and colloquial dictionary.
The specific method for data augmentation in step S2 includes the following substeps:
s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation; during similarity calculation, calculating a cosine value of an included angle between two vectors of word2vec vectors of two words as a similarity value;
s2-2-2, splicing each piece of data and the data behind the data until the end punctuation of the sentence is an exclamation mark, a question mark or a sentence mark or the spliced clause data reaches four pieces, and using the spliced clause as new clause data to finish the augmentation based on the clause;
s2-2-3, obtaining alignment information of all terms of each sentence of ancient and all terms of modern Chinese corresponding to the ancient by adopting a giza + + alignment tool of a statistical machine translation model, and adjusting the ancient language order according to the alignment information to obtain an ancient translation corpus.
The specific method of step S3 includes the following substeps:
s3-1, converting the ancient clauses in the ancient translation corpus into vector forms to obtain ancient clause vectors, and inputting the ancient clause corresponding to each ancient clause into a sequence model as a training basic unit;
s3-2, according to the formula
forgetm=sigmoid(W1·[hiddenm-1,m]+b1)
inputm=sigmoid(W2·[hiddenm-1,m]+b2)
outputm=sigmoid(W4·[hiddenm-1,m]+b4)
hiddenm=outputm*tanh(Cm)
Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unitm(ii) a Wherein hiddenm-1The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetm、inputm、CmAnd outputmAll the parameters are intermediate parameters after the mth ancient text vector element is input; cm-1The intermediate parameter after the (m-1) th ancient text vector element is input; b1、b2、b3And b4All represent a bias; w1、W2、W3And W4All represent weights; setting the initial state of a hidden layer in an encoder by adopting random initialization;
s3-3, combining hidden layer states of each neuron in the encoder from the sequence to the sequence model after the last ancient character vector element is input into a vector to obtain a hidden layer state vector hidden corresponding to the encoder from the sequence to the sequence model and the current training basic unitM;
S3-4, inputting the modern Chinese clause corresponding to the ancient clause input in the encoder as the basic check unit to the decoder, and according to the formula
forgetn=sigmoid(W5·[staten-1,n]+b5)
inputn=sigmoid(W6·[staten-1,n]+b6)
outputn=sigmoid(W8·[staten-1,n]+b8)
staten=outputn*tanh(Cn)
Obtaining the corresponding hidden layer state of any neuron in the decoder from the sequence to the sequence model after inputting the nth modern Chinese wordn;staten-1The hidden layer state of the neuron in the encoder after inputting the (n-1) th modern Chinese is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetn、inputn、CnAnd outputnAll are intermediate parameters after the nth modern Chinese is input; cn-1Inputting intermediate parameters after the (n-1) th modern Chinese; b5、b6、b7And b8All represent a bias; w5、W6、W7And W8All represent weights; setting the initial state of a hidden layer in a decoder by adopting the value of the state of the hidden layer of an encoder;
s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer state vector state corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence modelY;
S3-6, according to the formula
enM=bmm(stateY,hiddenM)
enx=bmm(stateY,hiddenx)
Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence modelnM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the typenMAnd enxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hiddenxHidden layer state hfdden after input of mth ancient vector element for encoderx∈(hfdden1,hfdden2,...,hfddenM);
S3-7, according to the formula
Obtaining the weighted average context of hidden layer state output by the encoder corresponding to the nth modern Chinese word from the input ancient clausenI.e. the context vector corresponding to the nth modern chinese word;
s3-8, according to the formula
Context vectors corresponding to the nth modern Chinese word and hidden layer states of a decoder are cascaded and sent to a fully-connected network WcontextTo obtain a cascade state
S3-9, according to the formula
Obtaining the output of the sequence-to-sequence model corresponding to the nth modern Chinese wordAnd then obtaining the output of the sequence-to-sequence model corresponding to the sentence of modern ChineseWhereinsoftmax (·) is a softmax function; wsIs the network weight;
s3-10, according to the formula
Obtaining the output of the sequence-to-sequence model corresponding to the sentence of the modern ChineseThe difference from the true answer yIf there is a differenceIf the difference is larger than the threshold value, the parameters of the sequence to the sequence model are updated until the differenceLess than or equal to the threshold value, and obtaining a trained neural network; wherein N is the total number of words in the modern Chinese of the sentence; y isnIs the nth modern Chinese word, ynE.g. y, the real answer y is the modern Chinese input to the decoder.
In conclusion, the invention can provide accurate name and place name and common language information for the translation model by introducing various dictionaries for word segmentation, improve the proper noun translation effect, automatically align clauses, complete implicit alignment between characters and words by attention, translate ancient clauses to be translated by a neural network and effectively improve translation efficiency and accuracy.
Claims (5)
1. An ancient language translation method based on a neural network is characterized by comprising the following steps:
s1, obtaining ancient text chapters and corresponding translation data as initial samples,
s2, carrying out clause alignment, data word segmentation and data augmentation operations on the initial sample in sequence to obtain an ancient translation corpus;
s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network;
s4, taking the ancient Chinese to be translated as the input of the trained neural network to finish the translation of the ancient Chinese;
the method for clause alignment of the initial samples in step S2 includes the following sub-steps:
s2-1-1, performing word segmentation on the modern Chinese in the initial sample, and matching the ancient Chinese with the modern Chinese according to the sequence from left to right;
s2-1-2, deleting matched words from the original sentence, introducing an ancient sentence dictionary to establish an inverse document frequency dictionary for ancient sentences which do not correspond to the modern Chinese, and acquiring an inverse document frequency score of each character of the unmatched ancient sentences;
s2-1-3, retrieving each unmatched ancient character defined by the ancient dictionary and matching the character with the rest modern Chinese vocabulary;
s2-1-4, according to the formula
Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s represents an ancient clause; | s | represents the length of a foreign clause;to indicate the function, if the character c in s can be directly matched with the word in the modern Chinese clause tIs 1, otherwiseIs 0;andrespectively forming character strings consisting of the characters which are remained in s and t and are not matched;to indicate the function, if a character k in the modern interpretation of the ancient character c matches the remaining modern vocabulary in the modern Chinese, its frequency score, denoted idf, is taken from the inverse document frequency dictionarykOtherwise 0, β is the normalized parameter of the inverse document frequency;
s2-1-5, establishing a translation corresponding model of the ancient clause and the modern Chinese clause; wherein the translation correspondence modes of the translation correspondence model include a 1 → 0 mode, a 0 → 1 mode, a 1 → 2 mode, a 2 → 1 mode, and a 2 → 2 mode; → represents translation correspondence, the front end of the → is the number corresponding to the ancient clauses, and the rear end of the → is the number corresponding to the modern Chinese clauses;
s2-1-6, acquiring the probability Pr (a → b) of each translation corresponding mode in the translation corresponding model of each ancient clause; a is more than or equal to 0, and b is less than or equal to 2;
s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value mu and the standard deviation sigma of all the length proportions;
s2-1-8, according to the formula
Acquiring statistical information S (S, t); whereinIs a normal distribution probability density function;
s2-1-9, according to the formula
Acquiring an edit distance value E (s, t); wherein EditDis (s, t) is an operand when a sentence of ancient text is translated into modern Chinese, and the operand comprises the total times of insertion, deletion and replacement;
s2-1-10, according to the formula
d(s,t)=L(s,t)+γS(s,t)+λE(s,t)
Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; siIs the ith ancient clause; si-1Is the i-1 st ancient clause; si-2Is the ith-2 ancient clauses; t is tjIs the jth modern Chinese clause; t is tj-1Is the j-1 th modern Chinese clause; t is tj-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e. there is no corresponding clause;
s2-1-11, for any ancient clause, selecting the modern Chinese clause with the largest corresponding score as the alignment clause to complete clause alignment.
2. The ancient sentence translation method based on neural network according to claim 1, wherein the specific method of step S1 is:
and (3) crawling the ancient text chapters and the corresponding translation data from the Internet, cleaning the crawled data, and taking the cleaned data as an initial sample.
3. The ancient language translation method based on neural network as claimed in claim 1, wherein the specific method of data word segmentation in step S2 is:
and respectively constructing a name dictionary, a place dictionary and a colloquial dictionary, and segmenting the names, the place names and the colloquial in the ancient clauses according to the constructed name dictionary, place dictionary and colloquial dictionary.
4. The neural network-based ancient sentence translation method according to claim 3, wherein the specific method of data augmentation in the step S2 comprises the following sub-steps:
s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation;
s2-2-2, splicing each piece of data and the data behind the data until the end punctuation of the sentence is an exclamation mark, a question mark or a sentence mark or the spliced clause data reaches four pieces, and using the spliced clause as new clause data to finish the augmentation based on the clause;
s2-2-3, obtaining alignment information of all terms of each sentence of ancient and all terms of modern Chinese corresponding to the ancient by adopting a giza + + alignment tool of a statistical machine translation model, and adjusting the ancient language order according to the alignment information to obtain an ancient translation corpus.
5. The neural network-based ancient sentence translation method according to claim 4, wherein the specific method of the step S3 comprises the following sub-steps:
s3-1, converting the ancient clauses in the ancient translation corpus into vector forms to obtain ancient clause vectors, and inputting the ancient clause corresponding to each ancient clause into a sequence model as a training basic unit;
s3-2, according to the formula
forgetm=sigmoid(W1·[hiddenm-1,m]+b1)
inputm=sigmoid(W2·[hiddenm-1,m]+b2)
outputm=sigmoid(W4·[hiddenm-1,m]+b4)
hiddenm=outputm*tanh(Cm)
Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unitm(ii) a Wherein hiddenm-1The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetm、inputm、CmAnd outputmAll the parameters are intermediate parameters after the mth ancient text vector element is input; cm-1The intermediate parameter after the (m-1) th ancient text vector element is input; b1、b2、b3And b4All represent a bias; w1、W2、W3And W4All represent weights; setting the initial state of a hidden layer in an encoder by adopting random initialization;
s3-3, combining hidden layer states of each neuron in the encoder from the sequence to the sequence model after the last ancient character vector element is input into a vector to obtain a hidden layer state vector hidden corresponding to the encoder from the sequence to the sequence model and the current training basic unitM;
S3-4, inputting the modern Chinese clause corresponding to the ancient clause input in the encoder as the basic check unit to the decoder, and according to the formula
forgetn=sigmoid(W5·[staten-1,n]+b5)
inputn=sigmoid(W6·[staten-1,n]+b6)
outputn=sigmoid(W8·[staten-1,n]+b8)
staten=outputn*tanh(Cn)
Obtaining the corresponding hidden layer state of any neuron in the decoder from the sequence to the sequence model after inputting the nth modern Chinese wordn;staten-1The hidden layer state of the neuron in the encoder after inputting the (n-1) th modern Chinese is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forgetn、inputn、CnAnd outputnAll are intermediate parameters after the nth modern Chinese is input; cn-1Inputting intermediate parameters after the (n-1) th modern Chinese; b5、b6、b7And b8All represent a bias; w5、W6、W7And W8All represent weights; setting the initial state of a hidden layer in a decoder by adopting the value of the state of the hidden layer of an encoder;
s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer states corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence modelLayer state vector stateY;
S3-6, according to the formula
enM=bmm(stateY,hiddenM)
enx=bmm(stateY,hiddenx)
Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence modelnM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the typenMAnd enxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hiddenxHidden layer state, hidden, after the mth ancient vector element is input to the encoderx∈(hidden1,hidden2,…,hiddenM);
S3-7, according to the formula
Obtaining the weighted average context of hidden layer state output by the encoder corresponding to the nth modern Chinese word from the input ancient clausenI.e. the context vector corresponding to the nth modern chinese word;
s3-8, according to the formula
Context vectors corresponding to the nth modern Chinese word and hidden layer states of a decoder are cascaded and sent to a fully-connected network WcontextTo obtain a cascade state
S3-9, according to the formula
Obtaining the output of the sequence-to-sequence model corresponding to the nth modern Chinese wordAnd then obtaining the output of the sequence-to-sequence model corresponding to the sentence of modern ChineseWhereinsoftmax (·) is a softmax function; wsIs the network weight;
s3-10, according to the formula
Obtaining the output of the sequence-to-sequence model corresponding to the sentence of the modern ChineseThe difference from the true answer yIf there is a differenceIf the difference is larger than the threshold value, the parameters of the sequence to the sequence model are updated until the differenceLess than or equal to the threshold value, and obtaining a trained neural network; wherein N is the total number of words in the modern Chinese of the sentence; y isnIs the nth modern Chinese word, ynE.g. y, the real answer y is the modern Chinese input to the decoder.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910012805.0A CN109783825B (en) | 2019-01-07 | 2019-01-07 | Neural network-based ancient language translation method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910012805.0A CN109783825B (en) | 2019-01-07 | 2019-01-07 | Neural network-based ancient language translation method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109783825A CN109783825A (en) | 2019-05-21 |
CN109783825B true CN109783825B (en) | 2020-04-28 |
Family
ID=66499178
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910012805.0A Active CN109783825B (en) | 2019-01-07 | 2019-01-07 | Neural network-based ancient language translation method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109783825B (en) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110222349B (en) * | 2019-06-13 | 2020-05-19 | 成都信息工程大学 | Method and computer for deep dynamic context word expression |
CN110795552B (en) * | 2019-10-22 | 2024-01-23 | 腾讯科技(深圳)有限公司 | Training sample generation method and device, electronic equipment and storage medium |
CN112270190A (en) * | 2020-11-13 | 2021-01-26 | 浩鲸云计算科技股份有限公司 | Attention mechanism-based database field translation method and system |
CN116070643B (en) * | 2023-04-03 | 2023-08-15 | 武昌理工学院 | Fixed style translation method and system from ancient text to English |
CN116701961B (en) * | 2023-08-04 | 2023-10-20 | 北京语言大学 | Method and system for automatically evaluating machine translation result of cultural relics |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007068123A1 (en) * | 2005-12-16 | 2007-06-21 | National Research Council Of Canada | Method and system for training and applying a distortion component to machine translation |
CN103955454A (en) * | 2014-03-19 | 2014-07-30 | 北京百度网讯科技有限公司 | Method and equipment for carrying out literary form conversion between vernacular Chinese and classical Chinese |
CN108090050A (en) * | 2017-11-08 | 2018-05-29 | 江苏名通信息科技有限公司 | Game translation system based on deep neural network |
CN109033094A (en) * | 2018-07-18 | 2018-12-18 | 五邑大学 | A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model |
-
2019
- 2019-01-07 CN CN201910012805.0A patent/CN109783825B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007068123A1 (en) * | 2005-12-16 | 2007-06-21 | National Research Council Of Canada | Method and system for training and applying a distortion component to machine translation |
CN103955454A (en) * | 2014-03-19 | 2014-07-30 | 北京百度网讯科技有限公司 | Method and equipment for carrying out literary form conversion between vernacular Chinese and classical Chinese |
CN108090050A (en) * | 2017-11-08 | 2018-05-29 | 江苏名通信息科技有限公司 | Game translation system based on deep neural network |
CN109033094A (en) * | 2018-07-18 | 2018-12-18 | 五邑大学 | A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model |
Also Published As
Publication number | Publication date |
---|---|
CN109783825A (en) | 2019-05-21 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109783825B (en) | Neural network-based ancient language translation method | |
WO2019196314A1 (en) | Text information similarity matching method and apparatus, computer device, and storage medium | |
CN109948165B (en) | Fine granularity emotion polarity prediction method based on mixed attention network | |
CN108628823B (en) | Named entity recognition method combining attention mechanism and multi-task collaborative training | |
CN108399163B (en) | Text similarity measurement method combining word aggregation and word combination semantic features | |
Yao et al. | An improved LSTM structure for natural language processing | |
CN108363743B (en) | Intelligent problem generation method and device and computer readable storage medium | |
Cui et al. | Attention-over-attention neural networks for reading comprehension | |
Cho et al. | Learning phrase representations using RNN encoder-decoder for statistical machine translation | |
CN109726389B (en) | Chinese missing pronoun completion method based on common sense and reasoning | |
CN109992780B (en) | Specific target emotion classification method based on deep neural network | |
US8386234B2 (en) | Method for generating a text sentence in a target language and text sentence generating apparatus | |
CN110737758A (en) | Method and apparatus for generating a model | |
CN109871541B (en) | Named entity identification method suitable for multiple languages and fields | |
CN111444700A (en) | Text similarity measurement method based on semantic document expression | |
CN107967318A (en) | A kind of Chinese short text subjective item automatic scoring method and system using LSTM neutral nets | |
CN108647191B (en) | Sentiment dictionary construction method based on supervised sentiment text and word vector | |
CN109726745B (en) | Target-based emotion classification method integrating description knowledge | |
CN110765769A (en) | Entity attribute dependency emotion analysis method based on clause characteristics | |
Xu et al. | Sentence segmentation for classical Chinese based on LSTM with radical embedding | |
Sun et al. | VCWE: visual character-enhanced word embeddings | |
CN114943230A (en) | Chinese specific field entity linking method fusing common knowledge | |
Xu et al. | Implicitly incorporating morphological information into word embedding | |
Ye et al. | Improving cross-domain Chinese word segmentation with word embeddings | |
Greenstein et al. | Japanese-to-english machine translation using recurrent neural networks |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |