CN109783825B

CN109783825B - Neural network-based ancient language translation method

Info

Publication number: CN109783825B
Application number: CN201910012805.0A
Authority: CN
Inventors: 吕建成; 杨可心; 屈茜; 刘大一恒
Original assignee: Sichuan University
Current assignee: Sichuan University
Priority date: 2019-01-07
Filing date: 2019-01-07
Publication date: 2020-04-28
Anticipated expiration: 2039-01-07
Also published as: CN109783825A

Abstract

The invention discloses an ancient Chinese translation method based on a neural network, which comprises the following steps: s1, obtaining an ancient text chapter and corresponding translation data as an initial sample, and S2, sequentially performing clause alignment, data word segmentation and data augmentation on the initial sample to obtain an ancient translation corpus; s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network; and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese. The invention can provide accurate name, place name and colloquial information for the translation model by introducing various dictionaries for word segmentation, improve the translation effect of proper nouns, automatically align clauses, complete implicit alignment between characters and words by an attention mechanism, translate ancient clauses to be translated by a neural network and effectively improve the translation efficiency and accuracy.

Description

Neural network-based ancient language translation method

Technical Field

The invention relates to the field of ancient language translation, in particular to an ancient language translation method based on a neural network.

Background

The high degree of achievement and achievement of the ancient people in the aspects of thought, science and literature are the Chinese national wisdom and the crystallization of blood sweat, and the crystal as a national treasure is not discardable. Most of the cultural heritages are born in the form of literature, however, the languages used by the ancient people are greatly different from the languages used by the modern people, so that the ancient cultural heritages are difficult to understand by the modern people, and a high threshold is set for the ancient cultural research invisibly. In the past, the translation can only be carried out word by word and sentence by an ancient scholar, and the translation is time-consuming, labor-consuming and high in cost.

Disclosure of Invention

Aiming at the defects in the prior art, the ancient translation method based on the neural network solves the problem of low translation efficiency word by word and sentence by sentence.

In order to achieve the purpose of the invention, the invention adopts the technical scheme that:

the ancient Chinese translation method based on the neural network comprises the following steps:

s1, obtaining ancient text chapters and corresponding translation data as initial samples,

s2, carrying out clause alignment, data word segmentation and data augmentation operations on the initial sample in sequence to obtain an ancient translation corpus;

s3, taking the ancient language translation corpus as a database of a neural machine translation model, and training the sequence-to-sequence model to obtain a trained neural network;

and S4, taking the ancient Chinese to be translated as the input of the trained neural network, and completing the translation of the ancient Chinese.

Further, the specific method of step S1 is:

and (3) crawling the ancient text chapters and the corresponding translation data from the Internet, cleaning the crawled data, and taking the cleaned data as an initial sample.

Further, the method for clause alignment of the initial samples in step S2 includes the following sub-steps:

s2-1-1, performing word segmentation on the modern Chinese in the initial sample, and matching the ancient Chinese with the modern Chinese according to the sequence from left to right;

s2-1-2, deleting matched words from the original sentence, introducing an ancient sentence dictionary to establish an inverse document frequency dictionary for ancient sentences which do not correspond to the modern Chinese, and acquiring an inverse document frequency score of each character of the unmatched ancient sentences;

s2-1-3, retrieving each unmatched ancient character defined by the ancient dictionary and matching the character with the rest modern Chinese vocabulary;

s2-1-4, according to the formula

Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s representsAn ancient clause; | s | represents the length of a foreign clause;

to indicate the function, if the character c in s can be directly matched with the word in the modern Chinese clause t

Is 1, otherwise

Is 0;

and

respectively forming character strings consisting of the characters which are remained in s and t and are not matched;

to indicate the function, if a character k in the modern interpretation of the ancient character c matches the vocabulary of the remaining modern Chinese, its score is taken from the IDF dictionary and is denoted as IDF_kOtherwise 0, β is the normalized parameter of the inverse document frequency;

s2-1-5, establishing a translation corresponding model of the ancient clause and the modern Chinese clause; wherein the translation correspondence modes of the translation correspondence model include a 1 → 0 mode, a 0 → 1 mode, a 1 → 2 mode, a 2 → 1 mode, and a 2 → 2 mode; → represents translation correspondence, the front end of the → is the number corresponding to the ancient clauses, and the rear end of the → is the number corresponding to the modern Chinese clauses;

s2-1-6, acquiring the probability Pr (a → b) of each translation corresponding mode in the translation corresponding model of each ancient clause; a is more than or equal to 0, and b is less than or equal to 2;

s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value u and the standard deviation sigma of all the length proportions;

s2-1-8, according to the formula

Acquiring statistical information S (S, t); wherein

Is a normal distribution probability density function;

s2-1-9, according to the formula

Acquiring an edit distance value E (s, t); wherein EditDis (s, t) is an operand when a sentence of ancient text is translated into modern Chinese, and the operand comprises the total times of insertion, deletion and replacement;

s2-1-10, according to the formula

d(s,t)＝L(s,t)+γS(s,t)+λE(s,t)

Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; s_iIs the ith ancient clause; s_i-1Is the i-1 st ancient clause; s_i-2Is the ith-2 ancient clauses; t is t_jIs the jth modern Chinese clause; t is t_j-1Is the j-1 th modern Chinese clause; t is t_j-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e., there is no corresponding clause.

S2-1-11, for any ancient clause, selecting the modern Chinese clause with the largest corresponding score as the alignment clause to complete clause alignment.

Further, the specific method for data word segmentation in step S2 is as follows:

and respectively constructing a name dictionary, a place dictionary and a colloquial dictionary, and segmenting the names, the place names and the colloquial in the ancient clauses according to the constructed name dictionary, place dictionary and colloquial dictionary.

Further, the specific method for data augmentation in step S2 includes the following sub-steps:

s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation;

s2-2-2, splicing each piece of data and the data behind the data until the end punctuation of the sentence is an exclamation mark, a question mark or a sentence mark or the spliced clause data reaches four pieces, and using the spliced clause as new clause data to finish the augmentation based on the clause;

s2-2-3, obtaining alignment information of all terms of each sentence of ancient and all terms of modern Chinese corresponding to the ancient by adopting a giza + + alignment tool of a statistical machine translation model, and adjusting the ancient language order according to the alignment information to obtain an ancient translation corpus.

Further, the specific method of step S3 includes the following sub-steps:

s3-1, converting the ancient clauses in the ancient translation corpus into vector forms to obtain ancient clause vectors, and inputting the ancient clause corresponding to each ancient clause into a sequence model as a training basic unit;

s3-2, according to the formula

forget_m＝sigmoid(W₁·[hidden_m-1，m]+b₁)

input_m＝sigmoid(W₂·[hidden_m-1，m]+b₂)

output_m＝sigmoid(W₄·[hidden_m-₁，m]+b₄)

hidden_m＝output_m*tanh(C_m)

Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unit_m(ii) a Wherein hidden_m-₁The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forget_m、input_m、

C_mAnd output_mAll the parameters are intermediate parameters after the mth ancient text vector element is input; c_m-1The intermediate parameter after the (m-1) th ancient text vector element is input; b₁、b₂、b₃And b₄All represent a bias; w₁、W₂、W₃And W₄All represent weights; setting the initial state of a hidden layer in an encoder by adopting random initialization;

s3-3, combining hidden layer states of each neuron in the encoder from the sequence to the sequence model after the last ancient character vector element is input into a vector to obtain a hidden layer state vector hidden corresponding to the encoder from the sequence to the sequence model and the current training basic unit_M；

S3-4, inputting the modern Chinese clause corresponding to the ancient clause input in the encoder as the basic check unit to the decoder, and according to the formula

forget_n＝sigmoid(W₅·[state_n-1，n]+b₅)

input_n＝sigmoid(W₆·[state_n-1，n]+b₆)

output_n＝sigmoid(W₈·[state_n-1，n]+b₈)

state_n＝output_n*tanh(C_n)

Obtaining the corresponding hidden layer state of any neuron in the decoder from the sequence to the sequence model after inputting the nth modern Chinese word_n；state_n-1The hidden layer state of the neuron in the encoder after inputting the (n-1) th modern Chinese is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forget_n、input_n、

C_nAnd output_nAll are intermediate parameters after the nth modern Chinese is input; c_n-1Inputting intermediate parameters after the (n-1) th modern Chinese; b₅、b₆、b₇And b₈All represent a bias; w₅、W₆、W₇And W₈All represent weights; setting the initial state of a hidden layer in a decoder by adopting the value of the state of the hidden layer of an encoder;

s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer state vector state corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence model_Y；

S3-6, according to the formula

e_nM＝bmm(state_Y，hidden_M)

e_nx＝bmm(state_Y，hidden_x)

Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence model_nM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the type_nMAnd e_nxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hidden_xHidden layer state, hidden, after the mth ancient vector element is input to the encoder_x∈(hidden₁，hidden₂，...，hidden_M)；

S3-7, according to the formula

Obtaining the weighted average context of hidden layer state output by the encoder corresponding to the nth modern Chinese word from the input ancient clause_nI.e. the context vector corresponding to the nth modern chinese word;

s3-8, according to the formula

Context vectors corresponding to the nth modern Chinese word and hidden layer states of a decoder are cascaded and sent to a fully-connected network W_contextTo obtain a cascade state

S3-9, according to the formula

Obtaining the output of the sequence-to-sequence model corresponding to the nth modern Chinese word

And then obtaining the output of the sequence-to-sequence model corresponding to the sentence of modern Chinese

Wherein

softmax (·) is a softmax function; w_sIs the network weight;

s3-10, according to the formula

Obtaining the output of the sequence-to-sequence model corresponding to the sentence of the modern Chinese

The difference from the true answer y

If there is a difference

If the difference is larger than the threshold value, the parameters of the sequence to the sequence model are updated until the difference

Less than or equal to the threshold value, and obtaining a trained neural network; wherein N is the total number of words in the modern Chinese of the sentence; y is_nIs the nth modern Chinese word, y_nE.g. y, the real answer y is the modern Chinese input to the decoder.

The invention has the beneficial effects that: the invention can provide accurate name, place name and colloquial information for the translation model by introducing various dictionaries for word segmentation, improve the translation effect of proper nouns, automatically align clauses, complete implicit alignment between characters and words by an attention mechanism, translate ancient clauses to be translated by a neural network and effectively improve the translation efficiency and accuracy.

Drawings

FIG. 1 is a schematic flow chart of the present invention.

Detailed Description

The following description of the embodiments of the present invention is provided to facilitate the understanding of the present invention by those skilled in the art, but it should be understood that the present invention is not limited to the scope of the embodiments, and it will be apparent to those skilled in the art that various changes may be made without departing from the spirit and scope of the invention as defined and defined in the appended claims, and all matters produced by the invention using the inventive concept are protected.

As shown in fig. 1, the neural network-based ancient language translation method includes the following steps:

The specific method of step S1 is:

The method for clause alignment of the initial samples in step S2 includes the following sub-steps:

s2-1-4, according to the formula

Obtaining the matching degree L (s, t) of lexical matching; wherein t represents a modern Chinese clause; s represents an ancient clause; | s | represents the length of a foreign clause;

Is 1, otherwise

Is 0;

and

to indicate the function, if a character k in the modern interpretation of the ancient character c matches the remaining modern vocabulary in the modern Chinese, its frequency score, denoted idf, is taken from the inverse document frequency dictionary_kOtherwise 0, β is the normalized parameter of the inverse document frequency;

s2-1-8, according to the formula

Acquiring statistical information S (S, t); wherein

Is a normal distribution probability density function;

s2-1-9, according to the formula

s2-1-10, according to the formula

d(s,t)＝L(s,t)+γS(s,t)+λE(s,t)

The specific method for data word segmentation in step S2 is as follows:

The specific method for data augmentation in step S2 includes the following substeps:

s2-2-1, adopting word2vec to construct a near meaning word dictionary, only selecting near meaning words with similarity exceeding 0.8 from each word in the near meaning word dictionary, obtaining a cleaned near meaning word dictionary with each piece of data consisting of one word and two to three words closest to the word, and completing near meaning word augmentation; during similarity calculation, calculating a cosine value of an included angle between two vectors of word2vec vectors of two words as a similarity value;

The specific method of step S3 includes the following substeps:

s3-2, according to the formula

forget_m＝sigmoid(W₁·[hidden_m-1，m]+b₁)

input_m＝sigmoid(W₂·[hidden_m-1，m]+b₂)

output_m＝sigmoid(W₄·[hidden_m-1，m]+b₄)

hidden_m＝output_m*tanh(C_m)

Obtaining hidden layer state hidden of any neuron in coder from sequence to sequence model after inputting mth ancient text vector element in training basic unit_m(ii) a Wherein hidden_m-1The state of a hidden layer of the neuron in the encoder after the (m-1) th ancient text vector element is input; sigmoid (·) is a sigmoid function; tan h (·) is a hyperbolic tangent function; forget_m、input_m、

forget_n＝sigmoid(W₅·[state_n-1，n]+b₅)

input_n＝sigmoid(W₆·[state_n-1，n]+b₆)

output_n＝sigmoid(W₈·[state_n-1，n]+b₈)

state_n＝output_n*tanh(C_n)

S3-6, according to the formula

e_nM＝bmm(state_Y，hidden_M)

e_nx＝bmm(state_Y，hidden_x)

Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence model_nM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the type_nMAnd e_nxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hidden_xHidden layer state hfdden after input of mth ancient vector element for encoder_x∈(hfdden₁，hfdden₂，...，hfdden_M)；

S3-7, according to the formula

s3-8, according to the formula

S3-9, according to the formula

Wherein

softmax (·) is a softmax function; w_sIs the network weight;

s3-10, according to the formula

The difference from the true answer y

If there is a difference

In conclusion, the invention can provide accurate name and place name and common language information for the translation model by introducing various dictionaries for word segmentation, improve the proper noun translation effect, automatically align clauses, complete implicit alignment between characters and words by attention, translate ancient clauses to be translated by a neural network and effectively improve translation efficiency and accuracy.

Claims

1. An ancient language translation method based on a neural network is characterized by comprising the following steps:

s4, taking the ancient Chinese to be translated as the input of the trained neural network to finish the translation of the ancient Chinese;

s2-1-4, according to the formula

Is 1, otherwise

Is 0;

and

s2-1-7, acquiring the length proportion of each ancient Chinese natural segment and the corresponding modern Chinese natural segment, and acquiring the mean value mu and the standard deviation sigma of all the length proportions;

s2-1-8, according to the formula

Acquiring statistical information S (S, t); wherein

Is a normal distribution probability density function;

s2-1-9, according to the formula

s2-1-10, according to the formula

d(s,t)＝L(s,t)+γS(s,t)+λE(s,t)

Obtaining a score D (i, j) corresponding to each modern Chinese clause by any ancient clause; d (i, j) is a score obtained by corresponding the ith ancient clause to the jth modern Chinese clause; both gamma and lambda are weight parameters; s_iIs the ith ancient clause; s_i-1Is the i-1 st ancient clause; s_i-2Is the ith-2 ancient clauses; t is t_jIs the jth modern Chinese clause; t is t_j-1Is the j-1 th modern Chinese clause; t is t_j-2Is the j-2 th modern Chinese clause; NULL indicates that the sentence is empty, i.e. there is no corresponding clause;

2. The ancient sentence translation method based on neural network according to claim 1, wherein the specific method of step S1 is:

3. The ancient language translation method based on neural network as claimed in claim 1, wherein the specific method of data word segmentation in step S2 is:

4. The neural network-based ancient sentence translation method according to claim 3, wherein the specific method of data augmentation in the step S2 comprises the following sub-steps:

5. The neural network-based ancient sentence translation method according to claim 4, wherein the specific method of the step S3 comprises the following sub-steps:

s3-2, according to the formula

forget_m＝sigmoid(W₁·[hidden_m-1,m]+b₁)

input_m＝sigmoid(W₂·[hidden_m-1,m]+b₂)

output_m＝sigmoid(W₄·[hidden_m-1,m]+b₄)

hidden_m＝output_m*tanh(C_m)

forget_n＝sigmoid(W₅·[state_n-1,n]+b₅)

input_n＝sigmoid(W₆·[state_n-1,n]+b₆)

output_n＝sigmoid(W₈·[state_n-1,n]+b₈)

state_n＝output_n*tanh(C_n)

s3-5, combining the hidden layer states corresponding to the nth modern Chinese word after each neuron in the decoder of the sequence-to-sequence model inputs the nth modern Chinese word into a vector to obtain the hidden layer states corresponding to the nth modern Chinese word in the decoder of the sequence-to-sequence modelLayer state vector state_Y；

S3-6, according to the formula

e_nM＝bmm(state_Y,hidden_M)

e_nx＝bmm(state_Y,hidden_x)

Obtaining the attention a from the sequence of the last ancient Chinese vector element and the nth modern Chinese word to the sequence model_nM(ii) a exp (·) is an exponential function with a natural constant e as the base; e.g. of the type_nMAnd e_nxIs an intermediate parameter; bmm (·) denotes dot product; m is the number of elements in the ancient text vector; hidden_xHidden layer state, hidden, after the mth ancient vector element is input to the encoder_x∈(hidden₁，hidden₂，…，hidden_M)；

S3-7, according to the formula