CN109783825A - A kind of ancient Chinese prose interpretation method neural network based - Google Patents

A kind of ancient Chinese prose interpretation method neural network based Download PDF

Info

Publication number
CN109783825A
CN109783825A CN201910012805.0A CN201910012805A CN109783825A CN 109783825 A CN109783825 A CN 109783825A CN 201910012805 A CN201910012805 A CN 201910012805A CN 109783825 A CN109783825 A CN 109783825A
Authority
CN
China
Prior art keywords
clause
ancient chinese
prose
ancient
chinese prose
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910012805.0A
Other languages
Chinese (zh)
Other versions
CN109783825B (en
Inventor
吕建成
杨可心
屈茜
刘大一恒
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan University
Original Assignee
Sichuan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan University filed Critical Sichuan University
Priority to CN201910012805.0A priority Critical patent/CN109783825B/en
Publication of CN109783825A publication Critical patent/CN109783825A/en
Application granted granted Critical
Publication of CN109783825B publication Critical patent/CN109783825B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of ancient Chinese prose interpretation methods neural network based, it is the following steps are included: S1, obtain ancient Chinese prose chapter and corresponding translation data as initial sample, S2, clause's alignment, data participle and the operation of data augmentation are successively carried out to initial sample, obtain ancient Chinese prose translated corpora;S3, using ancient Chinese prose translated corpora as the database of neural Machine Translation Model, sequence is trained to series model, obtains trained neural network;S4, using ancient Chinese prose to be translated as the input of trained neural network, complete the translation to ancient Chinese prose.The present invention is segmented by introducing a variety of dictionaries, accurate name place name and common saying information can be provided to translation model, improve proper noun and translates effect, and the present invention can carry out clause's alignment automatically, the implicit alignment between word, word can be completed by attention mechanism, and ancient Chinese prose clause to be translated is translated by neural network, effectively improve translation efficiency and accuracy.

Description

A kind of ancient Chinese prose interpretation method neural network based
Technical field
The present invention relates to ancient Chinese proses to translate field, and in particular to a kind of ancient Chinese prose interpretation method neural network based.
Background technique
Ancients height achieved and achievement of acquirement in thought, science, literature and art are Chinese nation's wisdom and blood and sweat Crystallization, is not discardable as national rarity.These cultural heritages are carried in the form of document mostly, however ancients institute Language used in language and modern times makes a big difference, and is that modern is difficult to understand for, and virtually sets for ancient culture research Very high threshold is found.It can only be translated word by word and sentence by sentence, be taken time and effort, higher cost by ancient culture scholar in the past.
Summary of the invention
For above-mentioned deficiency in the prior art, a kind of ancient Chinese prose interpretation method solution neural network based provided by the invention It has determined the low problem of translation efficiency word by word and sentence by sentence.
In order to achieve the above object of the invention, the technical solution adopted by the present invention are as follows:
There is provided a kind of ancient Chinese prose interpretation method neural network based comprising following steps:
S1, ancient Chinese prose chapter and corresponding translation data are obtained as initial sample,
S2, clause's alignment, data participle and the operation of data augmentation are successively carried out to initial sample, obtain ancient Chinese prose translation corpus Library;
S3, using ancient Chinese prose translated corpora as the database of neural Machine Translation Model, to sequence to series model progress Training, obtains trained neural network;
S4, using ancient Chinese prose to be translated as the input of trained neural network, complete the translation to ancient Chinese prose.
Further, step S1 method particularly includes:
Ancient Chinese prose chapter and corresponding translation data are crawled from internet, and data cleansing is carried out to the data crawled, it will Data after cleaning are as initial sample.
Further, the method for carrying out clause's alignment to initial sample in step S2 includes following sub-step:
S2-1-1, the Modern Chinese in initial sample is segmented, and according to sequence from left to right by ancient Chinese prose and now It is matched for Chinese;
S2-1-2, matched word is deleted from former sentence, for ancient Chinese prose not corresponding with Modern Chinese, is drawn Inverse document frequency dictionary is established into ancient Chinese prose dictionary, and obtains the inverse document frequency score for not matching ancient Chinese prose character each;
The unmatched ancient Chinese prose character of each of S2-1-3, retrieval ancient Chinese prose dictionary definition, and use it and the remaining modern Chinese Language vocabulary is matched;
S2-1-4, according to formula
Obtain the matched matching degree L (s, t) of morphology;Wherein t indicates Modern Chinese clause;S indicates ancient Chinese prose clause;| s | table Show the length of ancient Chinese prose clause;For indicator function, if the character c in s can directly match the word in Modern Chinese clause t Language is thenIt is 1, otherwiseIt is 0;WithCharacter string composed by remaining not yet matched character in respectively s and t;For indicator function, if the modern language of ancient Chinese prose character c explain in have some character k to match remaining in Modern Chinese Modern cliction converge, then take out its score from IDF dictionary, be denoted as idfk, it is otherwise 0;β is the standardization of inverse document frequency Parameter;
S2-1-5, ancient Chinese prose clause model corresponding with the translation of Modern Chinese clause is established;Wherein translate turning over for corresponding model Translating corresponded manner includes 1 → 0 mode, 0 → 1 mode, 1 → 1 mode, 1 → 2 mode, 2 → 1 modes and 2 → 2 modes;→ indicate to turn over Translate correspondence, → front end be ancient Chinese prose clause corresponding number, → rear end be Modern Chinese clause corresponding number;
S2-1-6, for each ancient Chinese prose clause, obtain its probability P r for translating every kind of translation corresponded manner in corresponding model (a→b);0≤a,b≤2;
S2-1-7, the length ratio for obtaining each ancient Chinese prose paragragh and corresponding Modern Chinese paragragh, and obtain all The mean value u and standard deviation sigma of length ratio;
S2-1-8, according to formula
It obtains statistical information S (s, t);WhereinIt is normpdf;
S2-1-9, according to formula
It obtains editing distance numerical value E (s, t);Wherein EditDis (s, t) is a behaviour when ancient Chinese prose translates to Modern Chinese It counts, operand includes the total degree of insertion, deletion and replacement;
S2-1-10, according to formula
D (s, t)=L (s, t)+γ S (s, t)+λ E (s, t)
Obtain the score value D (i, j) that any ancient Chinese prose clause corresponds to each Modern Chinese clause;Wherein D (i, j) is specially I ancient Chinese prose clause corresponds to j-th of resulting score of Modern Chinese clause;γ and λ is weight parameter;siFor i-th of ancient Chinese prose Clause;si-1For (i-1)-th ancient Chinese prose clause;si-2For the i-th -2 ancient Chinese prose clauses;tjFor j-th of Modern Chinese clause;tj-1It is J-1 Modern Chinese clause;tj-2For -2 Modern Chinese clauses of jth;NULL indicates that sentence is sky, i.e., without corresponding clause.
S2-1-11, for any ancient Chinese prose clause, choose its correspond to the maximum Modern Chinese clause of score value as its alignment son Sentence completes clause's alignment.
Further, data participle in step S2 method particularly includes:
Biographical dictionary, dictionary of place name and common saying dictionary are constructed respectively, and according to the biographical dictionary, dictionary of place name and custom of building Dictionary segments name, place name and the common saying in ancient Chinese prose clause.
Further, the specific method of data augmentation includes following sub-step in step S2:
S2-2-1, near synonym dictionary is constructed using word2vec, each word near synonym dictionary only chosen and its phase Like degree be more than 0.8 near synonym, obtain each data constituted by a word and with its most similar two to three word it is clear Rear near synonym dictionary is washed, near synonym augmentation is completed;
S2-2-2, data by every data and behind are spliced, until end of the sentence punctuate is exclamation mark, question mark or sentence Number, or clause's data of splicing reach four, using the clause of splicing as new clause's data, complete the increasing based on clause Extensively;
S2-2-3, the giza++ alignment tool using statistical machine translation model, obtain all words of each ancient Chinese prose The contraposition information for all words for corresponding to Modern Chinese with it, and ancient Chinese prose word order is adjusted according to contraposition information, obtain ancient Chinese prose translation Corpus.
Further, the specific method of step S3 includes following sub-step:
S3-1, it the ancient Chinese prose clause in ancient Chinese prose translated corpora is converted into vector form obtains ancient Chinese prose vector, and will be each The corresponding ancient Chinese prose vector of a ancient Chinese prose clause is as a trained base unit list entries to series model;
S3-2, according to formula
forgetm=sigmoid (W1·[hiddenm-1, m] and+b1)
inputm=sigmoid (W2·[hiddenm-1, m] and+b2)
outputm=sigmoid (W4·[hiddenm-1, m] and+b4)
hiddenm=outputm*tanh(Cm)
It obtains sequence and trains m-th of ancient Chinese prose in base unit in input to any of the encoder of series model neuron Hiding layer state hidden after vector elementm;Wherein hiddenm-1The m-1 ancient Chinese prose is being inputted for the neuron in encoder Hiding layer state after vector element;Sigmoid () is sigmoid function;Tanh () is hyperbolic tangent function; forgetm、inputmCmAnd outputmIt is the intermediate parameters inputted after m-th of ancient Chinese prose vector element;Cm-1For input Intermediate parameters after the m-1 ancient Chinese prose vector element;b1、b2、b3And b4Indicate biasing;W1、W2、W3And W4Indicate weight;It compiles Hidden layer original state in code device is configured using random initializtion;
S3-3, by neuron each in the encoder of sequence to series model after inputting the last one ancient Chinese prose vector element Hidden layer state group composite vector, obtain sequence to series model encoder and it is current training base unit it is corresponding hidden Hide layer state vector hiddenM
It is S3-4, Modern Chinese clause corresponding with the ancient Chinese prose clause inputted in encoder is defeated as basic verification unit Enter decoder, and according to formula
forgetn=sigmoid (W5·[staten-1, n] and+b5)
inputn=sigmoid (W6·[staten-1, n] and+b6)
outputn=sigmoid (W8·[staten-1, n] and+b8)
staten=outputn*tanh(Cn)
It is corresponding after inputting n-th of modern Chinese word to any of the decoder of series model neuron to obtain sequence Hiding layer state staten;staten-1For hiding stratiform of the neuron in encoder after inputting (n-1)th Modern Chinese State;Sigmoid () is sigmoid function;Tanh () is hyperbolic tangent function;forgetn、inputnCnWith outputnIt is the intermediate parameters inputted after n-th of Modern Chinese;Cn-1For the intermediate ginseng after (n-1)th Modern Chinese of input Number;b5、b6、b7And b8Indicate biasing;W5、W6、W7And W8Indicate weight;Hidden layer original state in decoder is using volume The value of the hiding layer state of code device is configured;
S3-5, neuron each in the decoder of sequence to series model is inputted again after n-th of modern Chinese word relatively The hidden layer state group composite vector answered, it is corresponding with n-th of modern Chinese word into the decoder of series model to obtain sequence Hidden layer state vector stateY
S3-6, according to formula
enM=bmm (stateY, hiddenM)
enx=bmm (stateY, hiddenx)
Obtain the last one ancient Chinese prose vector element and n-th of modern Chinese word note of the corresponding sequence to series model jointly Anticipate power anM;Exp () is the exponential function using natural constant e the bottom of as;enMAnd enxFor intermediate parameters;Bmm () indicates dot product;M For the element number in ancient Chinese prose vector;hiddenxHiding layer state after inputting m-th of ancient Chinese prose vector element for encoder, hiddenx∈(hidden1, hidden2..., hiddenM);
S3-7, according to formula
The weighting for obtaining the hiding layer state of the corresponding encoder output to n-th of modern Chinese word of input ancient Chinese prose clause is flat Equal contextn, that is, correspond to the context vector of n-th of modern Chinese word;
S3-8, according to formula
The hiding layer state of the context vector of corresponding to n-th modern Chinese word and decoder is cascaded and is sent into Fully-connected network Wcontext, obtain tandem states
S3-9, according to formula
Obtain the output that sequence corresponds to n-th of modern Chinese word to series modelAnd then sequence is obtained to sequence mould Type corresponds to the output of this Modern ChineseWhereinSoftmax () is softmax function;WsFor network weight Weight;
S3-10, according to formula
Obtain the output that sequence corresponds to this Modern Chinese to series modelWith the gap between true answer yIf gapGreater than threshold value, then the parameter of renewal sequence to series model, until gapLess than or equal to threshold value, trained neural network is obtained;Wherein N is the total quantity of word in this Modern Chinese; ynFor n-th of modern Chinese word, yn∈ y, true answer y are the Modern Chinese of input decoder.
The invention has the benefit that the present invention is segmented by introducing a variety of dictionaries, can be provided to translation model Accurate name place name and common saying information improve proper noun and translate effect, and the present invention can carry out clause's alignment automatically, Can complete word, the implicit alignment between word by attention mechanism, and by neural network to ancient Chinese prose clause to be translated into Row translation, effectively improves translation efficiency and accuracy.
Detailed description of the invention
Fig. 1 is flow diagram of the invention.
Specific embodiment
A specific embodiment of the invention is described below, in order to facilitate understanding by those skilled in the art this hair It is bright, it should be apparent that the present invention is not limited to the ranges of specific embodiment, for those skilled in the art, As long as various change is in the spirit and scope of the present invention that the attached claims limit and determine, these variations are aobvious and easy See, all are using the innovation and creation of present inventive concept in the column of protection.
As shown in Figure 1, the ancient Chinese prose interpretation method neural network based the following steps are included:
S1, ancient Chinese prose chapter and corresponding translation data are obtained as initial sample,
S2, clause's alignment, data participle and the operation of data augmentation are successively carried out to initial sample, obtain ancient Chinese prose translation corpus Library;
S3, using ancient Chinese prose translated corpora as the database of neural Machine Translation Model, to sequence to series model progress Training, obtains trained neural network;
S4, using ancient Chinese prose to be translated as the input of trained neural network, complete the translation to ancient Chinese prose.
Step S1's method particularly includes:
Ancient Chinese prose chapter and corresponding translation data are crawled from internet, and data cleansing is carried out to the data crawled, it will Data after cleaning are as initial sample.
The method for carrying out clause's alignment to initial sample in step S2 includes following sub-step:
S2-1-1, the Modern Chinese in initial sample is segmented, and according to sequence from left to right by ancient Chinese prose and now It is matched for Chinese;
S2-1-2, matched word is deleted from former sentence, for ancient Chinese prose not corresponding with Modern Chinese, is drawn Inverse document frequency dictionary is established into ancient Chinese prose dictionary, and obtains the inverse document frequency score for not matching ancient Chinese prose character each;
The unmatched ancient Chinese prose character of each of S2-1-3, retrieval ancient Chinese prose dictionary definition, and use it and the remaining modern Chinese Language vocabulary is matched;
S2-1-4, according to formula
Obtain the matched matching degree L (s, t) of morphology;Wherein t indicates Modern Chinese clause;S indicates ancient Chinese prose clause;| s | table Show the length of ancient Chinese prose clause;For indicator function, if the character c in s can directly match the word in Modern Chinese clause t Language is thenIt is 1, otherwiseIt is 0;WithCharacter string composed by remaining not yet matched character in respectively s and t;For indicator function, if the modern language of ancient Chinese prose character c explain in have some character k to match remaining in Modern Chinese Modern cliction converge, then take out its frequency score from inverse document frequency dictionary, be denoted as idfk, it is otherwise 0;β is inverse document The normalizing parameter of frequency;
S2-1-5, ancient Chinese prose clause model corresponding with the translation of Modern Chinese clause is established;Wherein translate turning over for corresponding model Translating corresponded manner includes 1 → 0 mode, 0 → 1 mode, 1 → 1 mode, 1 → 2 mode, 2 → 1 modes and 2 → 2 modes;→ indicate to turn over Translate correspondence, → front end be ancient Chinese prose clause corresponding number, → rear end be Modern Chinese clause corresponding number;
S2-1-6, for each ancient Chinese prose clause, obtain its probability P r for translating every kind of translation corresponded manner in corresponding model (a→b);0≤a,b≤2;
S2-1-7, the length ratio for obtaining each ancient Chinese prose paragragh and corresponding Modern Chinese paragragh, and obtain all The mean value u and standard deviation sigma of length ratio;
S2-1-8, according to formula
It obtains statistical information S (s, t);WhereinIt is normpdf;
S2-1-9, according to formula
It obtains editing distance numerical value E (s, t);Wherein EditDis (s, t) is a behaviour when ancient Chinese prose translates to Modern Chinese It counts, operand includes the total degree of insertion, deletion and replacement;
S2-1-10, according to formula
D (s, t)=L (s, t)+γ S (s, t)+λ E (s, t)
Obtain the score value D (i, j) that any ancient Chinese prose clause corresponds to each Modern Chinese clause;Wherein D (i, j) is specially I ancient Chinese prose clause corresponds to j-th of resulting score of Modern Chinese clause;γ and λ is weight parameter;siFor i-th of ancient Chinese prose Clause;si-1For (i-1)-th ancient Chinese prose clause;si-2For the i-th -2 ancient Chinese prose clauses;tjFor j-th of Modern Chinese clause;tj-1It is J-1 Modern Chinese clause;tj-2For -2 Modern Chinese clauses of jth;NULL indicates that sentence is sky, i.e., without corresponding clause.
S2-1-11, for any ancient Chinese prose clause, choose its correspond to the maximum Modern Chinese clause of score value as its alignment son Sentence completes clause's alignment.
Data segment in step S2 method particularly includes:
Biographical dictionary, dictionary of place name and common saying dictionary are constructed respectively, and according to the biographical dictionary, dictionary of place name and custom of building Dictionary segments name, place name and the common saying in ancient Chinese prose clause.
The specific method of data augmentation includes following sub-step in step S2:
S2-2-1, near synonym dictionary is constructed using word2vec, each word near synonym dictionary only chosen and its phase Like degree be more than 0.8 near synonym, obtain each data constituted by a word and with its most similar two to three word it is clear Rear near synonym dictionary is washed, near synonym augmentation is completed;When similarity calculation, the two of the word2vec vector of two words are calculated Value of the cosine value of the angle of a vector as similarity;
S2-2-2, data by every data and behind are spliced, until end of the sentence punctuate is exclamation mark, question mark or sentence Number, or clause's data of splicing reach four, using the clause of splicing as new clause's data, complete the increasing based on clause Extensively;
S2-2-3, the giza++ alignment tool using statistical machine translation model, obtain all words of each ancient Chinese prose The contraposition information for all words for corresponding to Modern Chinese with it, and ancient Chinese prose word order is adjusted according to contraposition information, obtain ancient Chinese prose translation Corpus.
The specific method of step S3 includes following sub-step:
S3-1, it the ancient Chinese prose clause in ancient Chinese prose translated corpora is converted into vector form obtains ancient Chinese prose vector, and will be each The corresponding ancient Chinese prose vector of a ancient Chinese prose clause is as a trained base unit list entries to series model;
S3-2, according to formula
forgetm=sigmoid (W1·[hiddenm-1, m] and+b1)
inputm=sigmoid (W2·[hiddenm-1, m] and+b2)
outputm=sigmoid (W4·[hiddenm-1, m] and+b4)
hiddenm=outputm*tanh(Cm)
It obtains sequence and trains m-th of ancient Chinese prose in base unit in input to any of the encoder of series model neuron Hiding layer state hidden after vector elementm;Wherein hiddenm-1The m-1 ancient Chinese prose is being inputted for the neuron in encoder Hiding layer state after vector element;Sigmoid () is sigmoid function;Tanh () is hyperbolic tangent function; forgetm、inputmCmAnd outputmIt is the intermediate parameters inputted after m-th of ancient Chinese prose vector element;Cm-1For input Intermediate parameters after the m-1 ancient Chinese prose vector element;b1、b2、b3And b4Indicate biasing;W1、W2、W3And W4Indicate weight;It compiles Hidden layer original state in code device is configured using random initializtion;
S3-3, by neuron each in the encoder of sequence to series model after inputting the last one ancient Chinese prose vector element Hidden layer state group composite vector, obtain sequence to series model encoder and it is current training base unit it is corresponding hidden Hide layer state vector hiddenM
It is S3-4, Modern Chinese clause corresponding with the ancient Chinese prose clause inputted in encoder is defeated as basic verification unit Enter decoder, and according to formula
forgetn=sigmoid (W5·[staten-1, n] and+b5)
inputn=sigmoid (W6·[staten-1, n] and+b6)
outputn=sigmoid (W8·[staten-1, n] and+b8)
staten=outputn*tanh(Cn)
It is corresponding after inputting n-th of modern Chinese word to any of the decoder of series model neuron to obtain sequence Hiding layer state staten;staten-1For hiding stratiform of the neuron in encoder after inputting (n-1)th Modern Chinese State;Sigmoid () is sigmoid function;Tanh () is hyperbolic tangent function;forgetn、inputnCnWith outputnIt is the intermediate parameters inputted after n-th of Modern Chinese;Cn-1For the intermediate ginseng after (n-1)th Modern Chinese of input Number;b5、b6、b7And b8Indicate biasing;W5、W6、W7And W8Indicate weight;Hidden layer original state in decoder is using volume The value of the hiding layer state of code device is configured;
S3-5, neuron each in the decoder of sequence to series model is inputted again after n-th of modern Chinese word relatively The hidden layer state group composite vector answered, it is corresponding with n-th of modern Chinese word into the decoder of series model to obtain sequence Hidden layer state vector stateY
S3-6, according to formula
enM=bmm (stateY, hiddenM)
enx=bmm (stateY, hiddenx)
Obtain the last one ancient Chinese prose vector element and n-th of modern Chinese word note of the corresponding sequence to series model jointly Anticipate power anM;Exp () is the exponential function using natural constant e the bottom of as;enMAnd enxFor intermediate parameters;Bmm () indicates dot product;M For the element number in ancient Chinese prose vector;hiddenxHiding layer state after inputting m-th of ancient Chinese prose vector element for encoder, hfddenx∈(hfdden1, hfdden2..., hfddenM);
S3-7, according to formula
The weighting for obtaining the hiding layer state of the corresponding encoder output to n-th of modern Chinese word of input ancient Chinese prose clause is flat Equal contextn, that is, correspond to the context vector of n-th of modern Chinese word;
S3-8, according to formula
The hiding layer state of the context vector of corresponding to n-th modern Chinese word and decoder is cascaded and is sent into Fully-connected network Wcontext, obtain tandem states
S3-9, according to formula
Obtain the output that sequence corresponds to n-th of modern Chinese word to series modelAnd then sequence is obtained to sequence mould Type corresponds to the output of this Modern ChineseWhereinSoftmax () is softmax function;WsFor network weight Weight;
S3-10, according to formula
Obtain the output that sequence corresponds to this Modern Chinese to series modelWith the gap between true answer yIf gapGreater than threshold value, then the parameter of renewal sequence to series model, until gapLess than or equal to threshold value, trained neural network is obtained;Wherein N is the total quantity of word in this Modern Chinese; ynFor n-th of modern Chinese word, yn∈ y, true answer y are the Modern Chinese of input decoder.
In conclusion the present invention is segmented by introducing a variety of dictionaries, accurate name can be provided to translation model Place name and common saying information improve proper noun and translate effect, and the present invention can carry out clause's alignment automatically, can pass through note Power mechanism of anticipating completes the implicit alignment between word, word, and is translated by neural network to ancient Chinese prose clause to be translated, effectively Improve translation efficiency and accuracy.

Claims (6)

1. a kind of ancient Chinese prose interpretation method neural network based, which comprises the following steps:
S1, ancient Chinese prose chapter and corresponding translation data are obtained as initial sample,
S2, clause's alignment, data participle and the operation of data augmentation are successively carried out to initial sample, obtain ancient Chinese prose translated corpora;
S3, using ancient Chinese prose translated corpora as the database of neural Machine Translation Model, sequence is trained to series model, Obtain trained neural network;
S4, using ancient Chinese prose to be translated as the input of trained neural network, complete the translation to ancient Chinese prose.
2. ancient Chinese prose interpretation method neural network based according to claim 1, which is characterized in that the tool of the step S1 Body method are as follows:
Ancient Chinese prose chapter and corresponding translation data are crawled from internet, and data cleansing is carried out to the data crawled, will be cleaned Data afterwards are as initial sample.
3. ancient Chinese prose interpretation method neural network based according to claim 1, which is characterized in that right in the step S2 The method that initial sample carries out clause's alignment includes following sub-step:
S2-1-1, the Modern Chinese in initial sample is segmented, and according to sequence from left to right by ancient Chinese prose and the modern Chinese Language is matched;
S2-1-2, matched word is deleted from former sentence, for ancient Chinese prose not corresponding with Modern Chinese, is introduced ancient Text allusion quotation establishes inverse document frequency dictionary, and obtains the inverse document frequency score for not matching ancient Chinese prose character each;
The unmatched ancient Chinese prose character of each of S2-1-3, retrieval ancient Chinese prose dictionary definition, and use it and remaining modern Chinese word Remittance is matched;
S2-1-4, according to formula
Obtain the matched matching degree L (s, t) of morphology;Wherein t indicates Modern Chinese clause;S indicates ancient Chinese prose clause;| s | indicate ancient The length of literary clause;For indicator function, if the word that the character c in s can be matched directly in Modern Chinese clause tIt is 1, otherwiseIt is 0;WithCharacter string composed by remaining not yet matched character in respectively s and t;For indicator function, if the modern language of ancient Chinese prose character c explain in have some character k to match remaining in Modern Chinese Modern cliction is converged, then takes out its frequency score from inverse document frequency dictionary, be denoted as idfk, it is otherwise 0;β is inverse document frequency The normalizing parameter of rate;
S2-1-5, ancient Chinese prose clause model corresponding with the translation of Modern Chinese clause is established;Wherein translate the translation pair of corresponding model Answering mode includes 1 → 0 mode, 0 → 1 mode, 1 → 1 mode, 1 → 2 mode, 2 → 1 modes and 2 → 2 modes;→ indicate translation pair Answer, → front end be ancient Chinese prose clause corresponding number, → rear end be Modern Chinese clause corresponding number;
S2-1-6, for each ancient Chinese prose clause, obtain its probability P r for translating every kind of translation corresponded manner in corresponding model (a → b);0≤a,b≤2;
S2-1-7, the length ratio for obtaining each ancient Chinese prose paragragh and corresponding Modern Chinese paragragh, and obtain all length The mean μ and standard deviation sigma of ratio;
S2-1-8, according to formula
It obtains statistical information S (s, t);WhereinIt is normpdf;
S2-1-9, according to formula
It obtains editing distance numerical value E (s, t);Wherein EditDis (s, t) is an operation when ancient Chinese prose translates to Modern Chinese Number, operand include the total degree of insertion, deletion and replacement;
S2-1-10, according to formula
D (s, t)=L (s, t)+γ S (s, t)+λ E (s, t)
Obtain the score value D (i, j) that any ancient Chinese prose clause corresponds to each Modern Chinese clause;Wherein D (i, j) is specially i-th Ancient Chinese prose clause corresponds to j-th of resulting score of Modern Chinese clause;γ and λ is weight parameter;siFor i-th of ancient Chinese prose Sentence;si-1For (i-1)-th ancient Chinese prose clause;si-2For the i-th -2 ancient Chinese prose clauses;tjFor j-th of Modern Chinese clause;tj-1For jth -1 A Modern Chinese clause;tj-2For -2 Modern Chinese clauses of jth;NULL indicates that sentence is sky, i.e., without corresponding clause.
S2-1-11, for any ancient Chinese prose clause, choose it and correspond to the maximum Modern Chinese clause of score value as its alignment clause, Complete clause's alignment.
4. ancient Chinese prose interpretation method neural network based according to claim 3, which is characterized in that number in the step S2 According to participle method particularly includes:
Biographical dictionary, dictionary of place name and common saying dictionary are constructed respectively, and according to the biographical dictionary, dictionary of place name and popular word of building Allusion quotation segments name, place name and the common saying in ancient Chinese prose clause.
5. ancient Chinese prose interpretation method neural network based according to claim 4, which is characterized in that number in the step S2 Specific method according to augmentation includes following sub-step:
S2-2-1, near synonym dictionary is constructed using word2vec, each word near synonym dictionary only chosen and its similarity Near synonym more than 0.8, after obtaining each data by a word and the cleaning constituted with its most similar two to three word Near synonym dictionary completes near synonym augmentation;
S2-2-2, data by every data and behind are spliced, until end of the sentence punctuate be exclamation mark, question mark or fullstop, Or clause's data of splicing reach four, using the clause of splicing as new clause's data, complete the augmentation based on clause;
S2-2-3, the giza++ alignment tool using statistical machine translation model, obtain each ancient Chinese prose all words and its The contraposition information of all words of corresponding Modern Chinese, and ancient Chinese prose word order is adjusted according to contraposition information, obtain ancient Chinese prose translation corpus Library.
6. ancient Chinese prose interpretation method neural network based according to claim 5, which is characterized in that the tool of the step S3 Body method includes following sub-step:
S3-1, it the ancient Chinese prose clause in ancient Chinese prose translated corpora is converted to vector form obtains ancient Chinese prose vector, and by each Gu The corresponding ancient Chinese prose vector of literary clause is as a trained base unit list entries to series model;
S3-2, according to formula
forgetm=sigmoid (W1·[hiddenm-1,m]+b1)
inputm=sigmoid (W2·[hiddenm-1,m]+b2)
outputm=sigmoid (W4·[hiddenm-1,m]+b4)
hiddenm=outputm*tanh(Cm)
It obtains sequence and trains m-th of ancient Chinese prose vector in base unit in input to any of the encoder of series model neuron Hiding layer state hidden after elementm;Wherein hiddenm-1The m-1 ancient Chinese prose vector is being inputted for the neuron in encoder Hiding layer state after element;Sigmoid () is sigmoid function;Tanh () is hyperbolic tangent function;forgetm、 inputmCmAnd outputmIt is the intermediate parameters inputted after m-th of ancient Chinese prose vector element;Cm-1It is ancient for input the m-1 Intermediate parameters after literary vector element;b1、b2、b3And b4Indicate biasing;W1、W2、W3And W4Indicate weight;In encoder Hidden layer original state is configured using random initializtion;
It is S3-3, neuron each in the encoder of sequence to series model is hidden after inputting the last one ancient Chinese prose vector element Hiding layer state is combined into vector, obtains sequence to the encoder of series model hidden layer corresponding with current training base unit State vector hiddenM
S3-4, it is solved Modern Chinese clause corresponding with the ancient Chinese prose clause inputted in encoder as basic verification unit input Code device, and according to formula
forgetn=sigmoid (W5·[staten-1,n]+b5)
inputn=sigmoid (W6·[staten-1,n]+b6)
outputn=sigmoid (W8·[staten-1,n]+b8)
staten=outputn*tanh(Cn)
It is corresponding hidden after inputting n-th of modern Chinese word to any of the decoder of series model neuron to obtain sequence Hide layer state staten;staten-1For hiding layer state of the neuron in encoder after inputting (n-1)th Modern Chinese; Sigmoid () is sigmoid function;Tanh () is hyperbolic tangent function;forgetn、inputnCnAnd outputn It is the intermediate parameters inputted after n-th of Modern Chinese;Cn-1For the intermediate parameters after (n-1)th Modern Chinese of input;b5、b6、 b7And b8Indicate biasing;W5、W6、W7And W8Indicate weight;Hidden layer original state in decoder is using the hidden of encoder The value of hiding layer state is configured;
S3-5, neuron each in the decoder of sequence to series model is inputted again it is corresponding after n-th of modern Chinese word It is corresponding with n-th of modern Chinese word hidden into the decoder of series model to obtain sequence for hidden layer state group composite vector Hide layer state vector stateY
S3-6, according to formula
enM=bmm (stateY,hiddenM)
enx=bmm (stateY,hiddenx)
Obtain the last one ancient Chinese prose vector element and n-th of modern Chinese word attention of the corresponding sequence to series model jointly anM;Exp () is the exponential function using natural constant e the bottom of as;enMAnd enxFor intermediate parameters;Bmm () indicates dot product;M is Gu Element number in literary vector;hiddenxHiding layer state after inputting m-th of ancient Chinese prose vector element for encoder, hiddenx ∈(hidden1, hidden2..., hiddenM);
S3-7, according to formula
Obtain the weighted average of the hiding layer state of the corresponding encoder output to n-th of modern Chinese word of input ancient Chinese prose clause contextn, that is, correspond to the context vector of n-th of modern Chinese word;
S3-8, according to formula
The hiding layer state of the context vector of corresponding to n-th modern Chinese word and decoder is cascaded and is sent into and is connected entirely Meet network Wcontext, obtain tandem states
S3-9, according to formula
Obtain the output that sequence corresponds to n-th of modern Chinese word to series modelAnd then sequence is obtained to series model pair It should be in the output of this Modern ChineseWhereinSoftmax () is softmax function;WsFor network weight;
S3-10, according to formula
Obtain the output that sequence corresponds to this Modern Chinese to series modelWith the gap between true answer yIf gapGreater than threshold value, then the parameter of renewal sequence to series model, until gapLess than or equal to threshold value, trained neural network is obtained;Wherein N is the total quantity of word in this Modern Chinese; ynFor n-th of modern Chinese word, yn∈ y, true answer y are the Modern Chinese of input decoder.
CN201910012805.0A 2019-01-07 2019-01-07 Neural network-based ancient language translation method Active CN109783825B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910012805.0A CN109783825B (en) 2019-01-07 2019-01-07 Neural network-based ancient language translation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910012805.0A CN109783825B (en) 2019-01-07 2019-01-07 Neural network-based ancient language translation method

Publications (2)

Publication Number Publication Date
CN109783825A true CN109783825A (en) 2019-05-21
CN109783825B CN109783825B (en) 2020-04-28

Family

ID=66499178

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910012805.0A Active CN109783825B (en) 2019-01-07 2019-01-07 Neural network-based ancient language translation method

Country Status (1)

Country Link
CN (1) CN109783825B (en)

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222349A (en) * 2019-06-13 2019-09-10 成都信息工程大学 A kind of model and method, computer of the expression of depth dynamic context word
CN110795552A (en) * 2019-10-22 2020-02-14 腾讯科技(深圳)有限公司 Training sample generation method and device, electronic equipment and storage medium
CN112270190A (en) * 2020-11-13 2021-01-26 浩鲸云计算科技股份有限公司 Attention mechanism-based database field translation method and system
CN116070643A (en) * 2023-04-03 2023-05-05 武昌理工学院 Fixed style translation method and system from ancient text to English
CN116701961A (en) * 2023-08-04 2023-09-05 北京语言大学 Method and system for automatically evaluating machine translation result of cultural relics

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007068123A1 (en) * 2005-12-16 2007-06-21 National Research Council Of Canada Method and system for training and applying a distortion component to machine translation
CN103955454A (en) * 2014-03-19 2014-07-30 北京百度网讯科技有限公司 Method and equipment for carrying out literary form conversion between vernacular Chinese and classical Chinese
CN108090050A (en) * 2017-11-08 2018-05-29 江苏名通信息科技有限公司 Game translation system based on deep neural network
CN109033094A (en) * 2018-07-18 2018-12-18 五邑大学 A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007068123A1 (en) * 2005-12-16 2007-06-21 National Research Council Of Canada Method and system for training and applying a distortion component to machine translation
CN103955454A (en) * 2014-03-19 2014-07-30 北京百度网讯科技有限公司 Method and equipment for carrying out literary form conversion between vernacular Chinese and classical Chinese
CN108090050A (en) * 2017-11-08 2018-05-29 江苏名通信息科技有限公司 Game translation system based on deep neural network
CN109033094A (en) * 2018-07-18 2018-12-18 五邑大学 A kind of writing in classical Chinese writings in the vernacular inter-translation method and system based on sequence to series neural network model

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110222349A (en) * 2019-06-13 2019-09-10 成都信息工程大学 A kind of model and method, computer of the expression of depth dynamic context word
CN110795552A (en) * 2019-10-22 2020-02-14 腾讯科技(深圳)有限公司 Training sample generation method and device, electronic equipment and storage medium
CN110795552B (en) * 2019-10-22 2024-01-23 腾讯科技(深圳)有限公司 Training sample generation method and device, electronic equipment and storage medium
CN112270190A (en) * 2020-11-13 2021-01-26 浩鲸云计算科技股份有限公司 Attention mechanism-based database field translation method and system
CN116070643A (en) * 2023-04-03 2023-05-05 武昌理工学院 Fixed style translation method and system from ancient text to English
CN116070643B (en) * 2023-04-03 2023-08-15 武昌理工学院 Fixed style translation method and system from ancient text to English
CN116701961A (en) * 2023-08-04 2023-09-05 北京语言大学 Method and system for automatically evaluating machine translation result of cultural relics
CN116701961B (en) * 2023-08-04 2023-10-20 北京语言大学 Method and system for automatically evaluating machine translation result of cultural relics

Also Published As

Publication number Publication date
CN109783825B (en) 2020-04-28

Similar Documents

Publication Publication Date Title
CN109783825A (en) A kind of ancient Chinese prose interpretation method neural network based
Yao et al. An improved LSTM structure for natural language processing
CN109359294B (en) Ancient Chinese translation method based on neural machine translation
CN108363743B (en) Intelligent problem generation method and device and computer readable storage medium
CN107798140B (en) Dialog system construction method, semantic controlled response method and device
WO2020062770A1 (en) Method and apparatus for constructing domain dictionary, and device and storage medium
CN110807320B (en) Short text emotion analysis method based on CNN bidirectional GRU attention mechanism
CN110297908A (en) Diagnosis and treatment program prediction method and device
CN108345585A (en) A kind of automatic question-answering method based on deep learning
CN110348535A (en) A kind of vision Question-Answering Model training method and device
CN107480132A (en) A kind of classic poetry generation method of image content-based
CN106202010A (en) The method and apparatus building Law Text syntax tree based on deep neural network
CN110781306B (en) English text aspect layer emotion classification method and system
CN109086269B (en) Semantic bilingual recognition method based on semantic resource word representation and collocation relationship
CN108153864A (en) Method based on neural network generation text snippet
CN108628935A (en) A kind of answering method based on end-to-end memory network
CN112232087B (en) Specific aspect emotion analysis method of multi-granularity attention model based on Transformer
Liu et al. A multi-modal chinese poetry generation model
CN109284361A (en) A kind of entity abstracting method and system based on deep learning
CN108647191A (en) It is a kind of based on have supervision emotion text and term vector sentiment dictionary construction method
CN110909736A (en) Image description method based on long-short term memory model and target detection algorithm
CN111144410B (en) Cross-modal image semantic extraction method, system, equipment and medium
CN110765769A (en) Entity attribute dependency emotion analysis method based on clause characteristics
Xu et al. Implicitly incorporating morphological information into word embedding
CN113094502A (en) Multi-granularity takeaway user comment sentiment analysis method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant