CN116070643B

CN116070643B - Fixed style translation method and system from ancient text to English

Info

Publication number: CN116070643B
Application number: CN202310343986.1A
Authority: CN
Inventors: 杨红征; 刘鑫
Original assignee: Wuchang University of Technology
Current assignee: Wuchang University of Technology
Priority date: 2023-04-03
Filing date: 2023-04-03
Publication date: 2023-08-15
Anticipated expiration: 2043-04-03
Also published as: CN116070643A

Abstract

The invention provides a fixed style translation method and a system from ancient text to English, wherein the method comprises the following steps: performing clause alignment and word segmentation operation on the acquired ancient poems and the corresponding English translation poems to acquire an ancient poems word segmentation corpus, an ancient translation word segmentation corpus and an ancient poems translation style corpus; training a segmentation model and an ancient poetry translation style model based on the corpus respectively; and predicting the ancient poems based on the ancient poems translation style model, outputting English translations, and analyzing the translation style. Aiming at the translation problem of a fixed style, the invention constructs the fixed translation style ancient text poetry word segmentation corpus and the corresponding ancient text translation word segmentation corpus, trains the word segmentation network according to the fixed translation style, forms a fixed translation style word segmentation model and solves the translation problem of the fixed style.

Description

Fixed style translation method and system from ancient text to English

Technical Field

The invention relates to the field of paleo-text translation, in particular to a paleo-text to English fixed style translation method and system.

Background

With the development of deep learning in the field of machine translation, the deep neural network can automatically learn translation knowledge from a corpus, so that translation quality is greatly improved, and accuracy reaches more than 90%. Machine translation based on neural networks is an automatic translation method that automatically converts one language into another language through the neural network. An encoder-decoder framework is usually employed to implement automatic translation, but the state of the encoder is only passed to the first node of the decoder, so that the information from the encoder will become less and less relevant in the next time step, and in order to solve the long-range dependency problem of the ancient chinese grammar, an attention mechanism network is introduced to decode the encoded context fragments, so as to solve the feature learning problem of long sentences.

However, in the field of machine translation, the problem of ancient text translation is mostly that ancient text is translated into modern Chinese, and the ancient text cannot be directly translated into English.

Disclosure of Invention

Aiming at the technical problems in the prior art, the invention provides a fixed style translation method and a system from ancient text to English.

According to a first aspect of the present invention, there is provided a fixed style translation method from paleo-text to english, comprising:

s1, acquiring ancient poems and English translation poems corresponding to the ancient poems in a specific translation style as an initial sample set, aligning clauses and word segmentation operations are carried out on the ancient poems and the English translation poems in the initial sample set, and an ancient poems word segmentation corpus, an ancient translation word segmentation corpus and an ancient poems translation style corpus are acquired;

s2, training a word segmentation model based on the ancient text poetry word segmentation corpus, the ancient text translation word segmentation corpus and the sign 2005 PKU corpus, and obtaining a trained word segmentation model;

s3, preprocessing and word segmentation are carried out on the ancient poetry translation style corpus based on the trained word segmentation model, and the preprocessed data are encoded into vectors to be used as a training data set for training to obtain an ancient poetry translation style model;

s4, predicting the ancient poems based on the ancient poems translation style model, outputting corresponding English translations, and analyzing the translation style based on the ancient poems and the corresponding English translations to obtain translation style analysis results.

According to a second aspect of the present invention, there is provided a fixed style translation system for paleo-to-english, comprising:

the system comprises an acquisition module, a sentence segmentation module and a sentence segmentation module, wherein the acquisition module is used for acquiring ancient poems and English translation poems corresponding to specific translation styles thereof as an initial sample set, performing clause alignment and sentence segmentation operation on the ancient poems and English translation poems in the initial sample set, and acquiring an ancient poems segmentation corpus, an ancient translation segmentation corpus and an ancient poems translation style corpus;

the first training module is used for training the word segmentation model based on the ancient poetry word segmentation corpus, the ancient translation word segmentation corpus and the sighan2005 PKU corpus to obtain a trained word segmentation model;

the second training module is used for preprocessing and segmenting the ancient poetry translation style corpus based on the trained segmentation model, and coding the preprocessed data into vectors serving as a training data set to train to obtain an ancient poetry translation style model;

and the analysis module is used for predicting the ancient poems based on the ancient poems translation style model, outputting corresponding English translations, and analyzing the translation style based on the ancient poems and the corresponding English translations to obtain translation style analysis results.

According to a third aspect of the present invention, there is provided an electronic device including a memory, and a processor for implementing the steps of the fixed style translation method of paleo-text to english when executing a computer management class program stored in the memory.

According to a fourth aspect of the present invention, there is provided a computer-readable storage medium having stored thereon a computer management class program which, when executed by a processor, implements the steps of a fixed style translation method of paleo-text to english.

According to the method and the system for translating the ancient text into English in the fixed style, a corpus of the translation style of the ancient text poetry translation style is constructed, and the ancient text poetry is aligned with the English translation in the fixed translation style, so that the method and the system can directly translate the ancient text into English. For the problem of fixed translation style, a fixed translation style old text poetry word segmentation corpus and a corresponding old text translation word segmentation corpus are constructed, a word segmentation network is trained according to the fixed translation style to obtain a fixed translation style word segmentation model for subsequent machine translation, and an automatic fixed style translation method from old text to English based on deep learning is provided.

Drawings

FIG. 1 is a flow chart of a fixed style translation method from ancient text to English;

FIG. 2 is a schematic flow chart of obtaining a corpus;

FIG. 3 is a flow chart of preprocessing the lengths of sentences in a training sample set of classification models;

FIG. 4 is a schematic diagram of a translation style analysis of a ancient poetry translation style model;

FIG. 5 is a schematic diagram of a fixed style translation system from ancient text to English according to the present invention;

fig. 6 is a schematic hardware structure of one possible electronic device according to the present invention;

fig. 7 is a schematic hardware structure of a possible computer readable storage medium according to the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention. In addition, the technical features of each embodiment or the single embodiment provided by the invention can be combined with each other at will to form a feasible technical scheme, and the combination is not limited by the sequence of steps and/or the structural composition mode, but is necessarily based on the fact that a person of ordinary skill in the art can realize the combination, and when the technical scheme is contradictory or can not realize, the combination of the technical scheme is not considered to exist and is not within the protection scope of the invention claimed.

Fig. 1 is a flowchart of a fixed style translation method from paleo-text to english, provided by the invention, as shown in fig. 1, the method includes:

s1, acquiring an ancient poetry and English translation poetry corresponding to a specific translation style as an initial sample set, aligning clauses and word segmentation operations are carried out on the ancient poetry and the English translation poetry in the initial sample set, and an ancient poetry word segmentation corpus, an ancient translation word segmentation corpus and an ancient poetry translation style corpus are acquired.

As an embodiment, the step of obtaining the ancient poetry and the translation data corresponding to the specific translation style in S1 as the initial sample set includes: acquiring an original text sample of the ancient text poetry and acquiring an English translation sample of a specific translation style of the ancient text poetry; based on the original text sample of the ancient text poetry and the corresponding English translation sample of the specific translation style, an initial sample set with the structure of [ poetry serial number, chinese title, chinese poetry, english title and English poetry ] is formed.

It can be understood that an original text sample of the ancient text poetry and an English translation sample corresponding to a specific translation style are obtained, the ancient text poetry sample and the English translation sample are matched and aligned, and an initial sample set is formed in a certain structural form.

It should be noted that, the specific translation style refers to a translation style of the ancient poetry according to a certain well-known translation author or a certain well-known translation tool.

For example, taking sushi's constant wind wave' and Zhuo Zhenying translated version as an example, the initial samples are: (284, "wind wave", "Momordica groin She Sheng," how well the shoes are howling And creep.

Taking the following example of the translated version Yang Xianyi and Dai Naidie of Qingming, du Mu, the initial samples are: (789, "Qing Ming", "Qing Ming festival rain disputes, passers-by want to break souls.

The method provided by the invention is described below by taking sushi, first sunny and rainy days on a drinking lake and Xu Yuanchong translation version thereof as an example, and initial samples are as follows: (22, "weather after the first weather in the lake," "water light weather," and "good mountain color space rain," it is also very good to make the West lake be more suitable to make the West lake be more light-colored and be more light-colored, "" Drinking at the Lake First in Sunny and then in Rainy Weather, "" The brimming waves delight the eye on sunny days; the dimming hills give a rare view in rainy haze. The West Lake looks like the fair lady at her best; whether she is richly adorned or plainly dressed).

Referring to fig. 2, after the initial sample set is obtained, obtaining the old-fashioned poetry segmentation corpus in S1 includes performing clause alignment and segmentation operations on the old-fashioned poetry of the initial sample set according to segmentation rules, to obtain the old-fashioned poetry segmentation corpus: dividing the ancient poems into a plurality of sections by taking punctuations in the ancient poems as segmentation and assisting in manual examination, wherein each section occupies one line; dividing each section of the ancient text poetry, wherein each word occupies one line; and according to the word segmentation result of the ancient poetry, acquiring a word segmentation corpus of the ancient poetry, wherein the word segmentation corpus of the ancient poetry has the structure of [ poetry serial number, chinese title, chinese poetry, chinese word, chinese line number, chinese column number ].

It can be understood that, performing clause alignment and word segmentation operation on the ancient poetry in the initial sample set according to the word segmentation rule, and obtaining a word segmentation corpus of the ancient poetry. Specifically, the ancient poems are separated by punctuations and are divided into one section, and four ancient poems are taken as examples, namely, four ancient poems are divided, and one ancient poems are taken as one section.

Taking the initial sample as an example, the segmentation result is: ( The water waves are good, and the mountain color is rain is also extraordinary. It is preferable to apply the paste to West lake Bixizi. )

Then, each section of poetry is segmented, so that the reading and understanding habit of the palace is met as much as possible, and each word occupies one line.

Taking the above sample as an example, the water light waves are good, "the word segmentation result is: (Water light/billow/fine')

After each section of poetry is segmented, according to the segmentation result, the ancient poetry segmentation corpus is obtained in a certain structural form. The corpus sample structure of the ancient poetry segmentation is (poetry sequence number, chinese title, chinese poetry, chinese word, chinese line number, chinese column number).

Taking the above sample as an example, the corpus sample is: (22, "weather after the beginning of the weather on the lake", "water light waves", "water light", 1, 1_2).

The step S1 of obtaining the corpus of the ancient translation and segmentation comprises the steps of performing clause alignment and segmentation operation on English translation poems of an initial sample set according to a segmentation rule to obtain the corpus of the ancient translation and segmentation: dividing punctuation in English translation poetry as a segmentation and assisting manual verification, dividing the English translation poetry into a plurality of sections, wherein each section occupies one line; dividing each English translation poetry, and separating the words by using space intervals; and obtaining an ancient text translation word segmentation corpus according to word segmentation results of English translation poems, wherein the structure of the ancient text translation word segmentation corpus is [ poem sequence number, english title, english poems, english words, english line number and English line number ].

It can be understood that, performing clause alignment and word segmentation operation on the english translation poems in the initial sample set according to word segmentation rules, and obtaining an ancient translation word segmentation corpus. Specifically, the English translation poetry punctuation in the initial sample set is divided into a plurality of sections by assisting manual verification, the poetry is divided into a plurality of sections, the ancient text paraphrasing is satisfied as much as possible, and each section occupies one line.

Taking the initial sample as an example, the segmentation result is: (The brimming waves delight the eye on sunny days;/The dimming hills give a rare view in rainy haze/The West Lake looks like the fair lady at her best;/Whether she is richly adorned or plainly addresses).

And dividing each English translation poem segment to meet the interpretation of ancient text as far as possible, and spacing between the words by using spaces.

Taking the above sample as an example, the word segmentation result of "The brimming waves delight the eye on sunny days;" is: (The brimming waves/light the eye/on sun days/;).

According to the word segmentation result of English translation poetry, a paleo-language translation word segmentation corpus is formed in a certain structural form, and the sample structure of the paleo-language translation word segmentation corpus is (poetry serial number, english title, english poetry, english word, english line number and English line number).

Taking the above sample as an example, the corpus sample is: (22, "Drinking at the Lake First in Sunny and then in Rainy Weather", "The brimming waves delight the eye on sunny days; the dimming hills give a rare view in rainy haze, the West Lake looks like the fair lady at her best; whether she is richly adorned or plainly address.", "light the eye",1, 4_6).

After the antique poetry segmentation corpus and the English translation segmentation corpus are obtained, aligning English segmentation results in the antique translation segmentation corpus with antique segmentation results in the antique poetry segmentation corpus to obtain the antique poetry translation style corpus.

First, a script is used to delete punctuation in the results of the ancient and English word segmentation.

Taking the initial sample as an example, the water light waves are good, and the final word segmentation result is as follows: (Water light/billow/fine).

"The brimming waves delight the eye on sunny days", the "final word segmentation result is: (The brimming waves/light the eye/on sun days).

Then, matching the processed English word segmentation result with the ancient text word segmentation result, wherein each word segmentation occupies one line, and forming a sample with the structure of (Chinese word, english word, chinese title, english title, chinese line number, english line number, chinese column number, english column number and jump mark).

The Chinese line number and the English line number represent the line number of the poetry where the segmented word is located in the corresponding poetry of the segmented word corpus, and the Chinese column number and the English column number represent the corresponding word sequence of the poetry where the segmented word is located in the corresponding poetry of the segmented word corpus.

Taking the above sample as an example, the corpus sample is: ("Water light", "light the eye", "first weather on the drink lake", "Drinking at the Lake First in Sunny and then in Rainy Weather", 1"," 1_2","4_6","0 ").

It should be noted that, in the process of aligning and matching the english word segmentation result and the ancient text word segmentation result, if there is english translation and no corresponding ancient text, the corresponding chinese title, chinese word and rank number are disposed in the blank; if the ancient text is available and the corresponding translation is not available, the corresponding English title, english word and rank number are empty.

For example, in the book Hezi by, pool nostalgia, the word "accidentally left finger on mud" is divided into four parts: (mud/contingent/stay/finger), its corresponding translation word is divided into six parts: (See/the claw and nail prints/by challenge/mud/and now/bear). Where "and snorow" and "se" are added due to translation context requirements, resulting in no correspondence of Gu Wenyuan text, the corpus correspondence samples are denoted (None, "and snorw," None, "Recalling the old Days at Mianchi in the Same Rhymes as Ziyou's Poem," None,3, none, "9_10", "0").

If the jump translation exists in the Chinese/English matching, the Chinese/English column numbers are separated by commas, and the jump flag bit is marked with 1, otherwise, the jump flag bit is marked with 0.

If "the whole driving is promoted and" the whole driving is not promoted "in" the "Nari-you-He Hui Qinhui Liangheng" is translated into "to go … to my abade", then the corpus is expressed as "the whole driving", "to go … to my abade", "Nari-you-He Hui Qinhui Liangheng", "Visiting in Winter the Two Learned Monks in the Lonely Hill",15, 4 "," 4_5,8_10","1 ").

And (3) aligning and matching the ancient text word segmentation result and the English word segmentation result to obtain the translation style corpus of the ancient text poetry.

S2, training a word segmentation model based on the ancient text poetry word segmentation corpus, the ancient text translation word segmentation corpus and the sign 2005 PKU corpus, and obtaining a trained word segmentation model.

As an embodiment, in S2, training the word segmentation model based on the old text poetry word segmentation corpus, the old text translation word segmentation corpus and the sighan2005 PKU corpus, and obtaining a trained word segmentation model includes: and encoding each word segmentation result in the ancient text poetry word segmentation corpus, the ancient text translation word segmentation corpus and the sighan2005 PKU corpus into vectors, training the word segmentation model based on the encoded vectors, and obtaining the trained word segmentation model.

It can be understood that the ancient text poetry word segmentation corpus, the ancient text translation word segmentation corpus and the sighan2005 PKU corpus are combined to be used as a training database of the word segmentation model, and the word segmentation model is obtained through mixed training.

The second international chinese word segmentation evaluation (Second International Chinese Word Segmentation Bakeoff, abbreviated as SIGHAN 05) provides AS, CITYU, MSR and PKU corpus, and provides training set, verification set and test set of different specification labels (sentence+word segmentation label). In order to obtain a word segmentation model suitable for ancient poetry, a large-scale corpus is needed, so that the invention combines the sighan2005 PKU with a custom corpus to carry out mixed training.

The training sample data is preprocessed, the training sample data is encoded into a vector input word segmentation model, the word segmentation model is trained, and the specific preprocessing operation steps comprise:

firstly, the maximum length of a sentence input into a word segmentation model is specified, if the sentence is overlong, the overlong part behind the sentence is truncated, and otherwise, the sentence is supplemented. Adding [ CLS ] marks at the beginning of a sentence, adding [ SEP ] marks at the end of the sentence or between two sentences, and if a filling operation is required, filling [ PAD ] marks with corresponding lengths to form a sample format as follows: [ CLS ] + sentence+ [ SEP ] + [ PAD ] N, N is the number of [ PAD ] marks to be complemented.

Here, it should be noted that, since the maximum length of a sentence is specified, there are three possibilities, and the processing flow chart can be seen in fig. 3:

(1) input sentence length = maximum length: adding corresponding identifiers at the beginning and the end of the sentence;

(2) input sentence length < maximum length: adding corresponding identifiers at the beginning and the end of the sentence, and adding [ PAD ] marks to fill the sentence length to be equal to the maximum length (if the maximum length is x and the sentence length is y, x-y [ PAD ] marks need to be added, namely N=x-y in 221);

(3) input sentence length > maximum length: cutting the sentence at the maximum length into two sentences, namely, one sentence is the (1) th case and the other sentence is the (2) th case, and carrying out corresponding operation.

Taking the good water light waves as an example, sentences become: [ CLS ] Water light waves, [ SEP ] are good.

And marking each word in the sample into four BMES types by using a BMES marking mode. Wherein B represents a start position of a word, M represents an intermediate position of the word, E represents an end position of the word, and S represents a single word.

Taking the above sample as an example, "water light waves weather good," marked as: [ ' B ', ' E ', ' B ', ' E ', ' B ', ' M ', ' E ', ' S ', ' corresponding tag ids are: [0, 1, 0, 1, 0, 2, 1, 3].

The sample is mapped to the dictionary by a marker token, and the input words are mapped to dictionary IDs in the model and encoded into vectors.

Taking the sample as an example, the "water light waves are good," the code token vector is: [ 101, 3717, 1045, 4046, 4006, 3252, 3175, 1962, 8024, 102].

Setting sentence identification limit, the first sentence is corresponding to all 0, the second sentence is corresponding to all 1, and the position [ PAD ] is set to all 0.

Taking the above sample as an example, the "good water-light waves," the sentence identification limits segments vector is: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0].

The attention mechanism range is set, the [ PAD ] is set to 0, and the rest positions are set to 1.

Taking the above samples as examples, the "mask vector of water light, weather is: [ 1,1, 1,1, 1,1, 1,1, 1, 1].

Inputting the coded vector features into a word segmentation model, and training the word segmentation model, wherein the specific training process comprises the following steps:

the training data passes through an Encode layer of the word segmentation model, and vector features are output;

taking the above samples as an example, the input feature vectors of each training sample are:

vector features pass through the pooling layer, will [ CLS ]]The corresponding mark representation is extracted and transformed to be the whole sequence representation and returned, and all the mark representations are returned as they are, and the classification result is output.

Taking the above sample as an example, the vector feature format is (1, 10, 4), as shown in x below. After the first mark is removed, the output label is [0, 1, 0, 1, 0, 2, 1, 3], which is consistent with the mark id.

Training according to the steps to obtain the word segmentation model.

S3, preprocessing and word segmentation are carried out on the ancient poetry translation style corpus based on the trained word segmentation model, and the preprocessed data are encoded into vectors to be used as a training data set for training to obtain the ancient poetry translation style model.

The step 3 of preprocessing and word segmentation is performed on the corpus of the translation style of the ancient poetry based on the trained word segmentation model, and the preprocessed data are encoded into vectors as a training data set to be trained to obtain the model of the translation style of the ancient poetry, and the step comprises the following steps: according to the ancient text poetry segmentation corpus and the ancient text translation segmentation corpus, aligning the poetry according to the attributes of 'poetry sequence number', 'Chinese line number' and 'English line number', and taking out one line of English translation poetry; performing character string standardization pretreatment on the English translation poem, and adding start and end marks for the English translation poem; dividing words of the preprocessed ancient text poems and English translation poems based on word division models respectively to obtain a character list of each sentence of the poems, and representing characters in the character list by vectors through one-hot coding, wherein each sentence of the poems is represented by a vector matrix; and training the ancient poetry translation style model based on the encoded vector matrix to obtain the trained ancient poetry translation style model.

It can be understood that referring to fig. 4, after training in step S2 to obtain a word segmentation model, preprocessing and word segmentation are performed based on the word segmentation model ancient poetry translation style corpus. The specific pretreatment and word segmentation steps comprise:

the poetry alignment data are taken out from the ancient text poetry translation style corpus according to lines, the poetry alignment is carried out by combining the ancient text translation word segmentation corpus and the ancient text poetry word segmentation corpus and through the properties of 'poetry sequence number', 'Chinese line number' and 'English line number', and one line of poetry translation data are taken out as follows: the brimming waves delight the eye on sunny days, water light waves good in all aspects.

And (3) carrying out pretreatment operations such as normalizing character strings on the extracted data, filtering unnecessary characters, adding spaces before punctuation marks and the like.

Taking the above sample as an example, the data preprocessing outputs: the brimming waves delight the eye on sunny days the water waves are good. ".

Start and end tags are added to the sentence so that the model knows when to start and end the prediction.

Taking the above sample as an example, the data processing is: "< start > the brimming waves delight the eye on sunny days, < end > water waves,".

Respectively segmenting the preprocessed ancient text data and English data through a segmentation model in S2 to obtain a character list of each sentence of the poems, and representing the characters by vectors through one-hot coding, wherein each sentence of the poems is represented by a vector matrix, and the specific steps comprise:

respectively segmenting the ancient text data and the English data according to the segmentation model obtained in the step S2;

taking the above sample as an example, the results of the ancient text word segmentation are: [ ' Water light ', ' waves ', ' fine weather ', '. ' the English word segmentation result is: [ '< start >', 'the brimming waves', 'right the eye', 'on sun days', 'end >' ].

Creating a character list of the translation style of the ancient poems according to the word segmentation result;

taking the above sample as an example, the character list of paleo-text and english is: { ': 1,' the brimming waves ': 2,' light the eye ': 3,' on suny days ': 4,' < start > ': 5,' < end > ': 6} and {'. ' 1, ' water light ' 2, ' billow ' 3, ' fine ' 4}.

And carrying out one-hot coding according to the character list to obtain character type vectors, and representing each sentence of the poem by using a vector matrix.

Taking "the brimming waves" as an example, the character vector is expressed as:the whole sentence can be encoded as a vector matrix: />。

Inputting the coded vector matrix into a network model as training data, and training to obtain a ancient poetry translation style model, wherein the training process comprises the following steps:

input vector matrixM is the number of column vectors in the input vector matrix, and the target translation sentence is expressed as +.>N is the number of column vectors in the target translation vector, and the forward hidden state sequence is obtained by forward reading the input sequence X through the attention layer>Reverse hidden state sequence +.>The forward hidden state and the reverse hidden state are combined to obtain the attention of each x +.>，/>Indicating the attention of the jth x.

Target wordThe probability of (2) is +.>Wherein the network hidden state corresponding to position i is +.>Context vector c _i Is a as _j The weighted sum of (2) is +.>Weight alpha _ij The calculation formula is thatWherein s is _ij Is the score of the degree of matching of the input around position j and the output at position i, calculated as +.>。s _ij Is based on network hidden state d _i−1 And j-th attention a of the input sentence _j Calculated.

And training through the training process to obtain the ancient poetry translation style model.

And predicting the ancient text poetry by using the trained ancient text poetry translation style model, and outputting English translation.

And counting the translation of the fixed vocabulary, and analyzing the translation style of the ancient poetry translation style model.

Taking the above sample as an example, a common translator translates "billow" into "ripple", and the model prediction output of the ancient poetry translation style is "the brimming waves", which is more complex to use vocabulary.

And carrying out translation style analysis according to the 0/1 jump rule formulated in the Chinese and English alignment matching process.

Taking the above sample as an example, the common translator translates "complete driving guidance and no-after-the-body" into "The whole drive has been urged back and not late", and the model prediction output of the ancient poetry translation style is "They hurry me to go before dusk to my abode". The ancient poetry translation style model translates "Gui" into "to go.

Carrying out translation style analysis according to the method of the repair;

taking "Duchesner's half-sky fish tail red" as an example, a common translator is translated into "Broken clouds in mid-air fish tail red", and the model predictive output of the ancient poetry translation style is "Rosy clouds in mid-air like fish-files undibulate", and a figurative inpainting technique is used, so that the translation is more vivid than the translation by the translator.

And synthesizing the analysis according to the fixed vocabulary translation and jump translation rules and the modification method to obtain a translation style analysis result of the ancient poetry translation style model.

Referring to fig. 5, in an embodiment of the present invention, a fixed style translation system from paleo to english includes an obtaining module 501, a first training module 502, a second training module 503, and an analyzing module 504, where:

the obtaining module 501 is configured to obtain an old text poetry and an english translation poetry corresponding to a specific translation style thereof as an initial sample set, perform clause alignment and segmentation operations on the old text poetry and the english translation poetry in the initial sample set, and obtain an old text poetry segmentation corpus, an old text translation segmentation corpus and an old text poetry translation style corpus;

the first training module 502 is configured to train a word segmentation model based on the ancient poetry word segmentation corpus, the ancient translation word segmentation corpus and the sign 2005 PKU corpus, and obtain a trained word segmentation model;

the second training module 503 is configured to perform preprocessing and word segmentation on the corpus of the translation style of the palace poetry based on the trained word segmentation model, and encode the preprocessed data into vectors as a training data set to train to obtain the model of the translation style of the palace poetry;

the analysis module 504 is configured to predict the ancient poetry based on the model of the style of translation of the ancient poetry, output a corresponding english translation, and analyze a translation style based on the ancient poetry and the corresponding english translation, so as to obtain a translation style analysis result.

It can be understood that the fixed style translation system from paleo-text to english provided by the present invention corresponds to the fixed style translation method from paleo-text to english provided in the foregoing embodiments, and the relevant technical features of the fixed style translation system from paleo-text to english may refer to the relevant technical features of the fixed style translation method from paleo-text to english, which is not described herein again.

Referring to fig. 6, fig. 6 is a schematic diagram of an embodiment of an electronic device according to an embodiment of the invention. As shown in fig. 6, an embodiment of the present invention provides an electronic device 600, including a memory 610, a processor 620, and a computer program 611 stored in the memory 610 and capable of running on the processor 620, wherein the processor 620 implements the steps of the fixed style translation method from ancient text to english when executing the computer program 611.

Referring to fig. 7, fig. 7 is a schematic diagram of an embodiment of a computer readable storage medium according to the present invention. As shown in fig. 7, the present embodiment provides a computer-readable storage medium 700 on which a computer program 711 is stored, the computer program 711 implementing the steps of the fixed style translation method of paleo-text to english when executed by a processor.

According to the method and the system for translating the palace text into the English fixed style, a corpus of the palace text poetry translation style is constructed, the palace text poetry is aligned with English translation of the fixed translation style, so that the palace text poetry translation is directly translated into the English, a corpus of the palace text poetry segmentation of the fixed translation style and a corpus of the palace text translation segmentation corresponding to the corpus of the palace text poetry segmentation are constructed, the problem of translation of the fixed style is solved, and the method for translating the palace text into the English based on deep learning is provided.

In the foregoing embodiments, the descriptions of the embodiments are focused on, and for those portions of one embodiment that are not described in detail, reference may be made to the related descriptions of other embodiments.

It will be appreciated by those skilled in the art that embodiments of the present invention may be provided as a method, system, or computer program product. Accordingly, the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present invention may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

The present invention is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded computer, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiments and all such alterations and modifications as fall within the scope of the invention.

It will be apparent to those skilled in the art that various modifications and variations can be made to the present invention without departing from the spirit or scope of the invention. Thus, it is intended that the present invention also include such modifications and alterations insofar as they come within the scope of the appended claims or the equivalents thereof.

Claims

1. A fixed style translation method from ancient text to English is characterized by comprising the following steps:

s4, predicting the ancient poems based on the ancient poems translation style model, outputting corresponding English translations, and analyzing translation styles based on the ancient poems and the corresponding English translations to obtain translation style analysis results;

the step S1 of obtaining the old-fashioned poetry segmentation corpus comprises the steps of performing clause alignment and segmentation operation on the old-fashioned poetry of the initial sample set according to segmentation rules to obtain the old-fashioned poetry segmentation corpus:

dividing the ancient poems into a plurality of sections by taking punctuations in the ancient poems as segmentation and assisting in manual examination, wherein each section occupies one line;

dividing each section of the ancient text poetry, wherein each word occupies one line;

according to the word segmentation result of the ancient poetry, obtaining a word segmentation corpus of the ancient poetry, wherein the word segmentation corpus of the ancient poetry has the structure of [ poetry serial number, chinese title, chinese poetry, chinese word, chinese line number, chinese column number ];

the step S1 of obtaining the corpus of the ancient translation and segmentation comprises the steps of performing clause alignment and segmentation operation on English translation poems of an initial sample set according to segmentation rules to obtain the corpus of the ancient translation and segmentation:

dividing punctuation in English translation poetry as a segmentation and assisting manual verification, dividing the English translation poetry into a plurality of sections, wherein each section occupies one line;

dividing each English translation poetry, and separating the words by using space intervals;

according to the word segmentation result of English translation poetry, obtaining an ancient text translation word segmentation corpus, wherein the structure of the ancient text translation word segmentation corpus is [ poetry serial number, english title, english poetry, english word, english line number and English column number ];

the step S1 of obtaining the ancient poetry translation style corpus based on the ancient poetry segmentation corpus and the ancient poetry translation segmentation corpus comprises the step of obtaining the ancient poetry translation style corpus:

deleting punctuation in the results of the ancient poetry segmentation in the ancient poetry segmentation corpus and the results of the English poetry segmentation in the ancient translation segmentation corpus by using scripts;

and matching the processed English word segmentation result with the ancient text word segmentation result, wherein each word segmentation occupies one line, so as to form a sample with the structure of [ Chinese words, english words, chinese titles, english titles, chinese line numbers, english line numbers, chinese column numbers, english column numbers and jump marks ] and obtain a ancient text poetry translation style corpus.

2. The method for fixed style translation according to claim 1, wherein the step of obtaining translation data of the ancient poetry and the specific translation style corresponding to the ancient poetry as the initial sample set in S1 comprises:

acquiring an original text sample of the ancient text poetry and acquiring an English translation sample of a specific translation style of the ancient text poetry;

based on the original text sample of the ancient text poetry and the corresponding English translation sample of the specific translation style, an initial sample set with the structure of [ poetry serial number, chinese title, chinese poetry, english title and English poetry ] is formed.

3. The method according to claim 1, wherein in the process of matching the english word segmentation result with the ancient text word segmentation result, if there is english translation, there is no corresponding ancient text, the corresponding chinese title, chinese word and rank number are left empty;

if the ancient text is available and no corresponding translation exists, the corresponding English title, english word and rank number are arranged empty;

4. The method of claim 1, wherein the step of training the word segmentation model based on the paleo-text poetry word segmentation corpus, the paleo-text translation word segmentation corpus and the sign 2005 PKU corpus in step S2, and obtaining the trained word segmentation model comprises:

and encoding each word segmentation result in the ancient text poetry word segmentation corpus, the ancient text translation word segmentation corpus and the sighan2005 PKU corpus into vectors, training the word segmentation model based on the encoded vectors, and obtaining the trained word segmentation model.

5. The method of claim 1, wherein the step of preprocessing and word segmentation in S3 based on the trained word segmentation model, and encoding the preprocessed data into vectors as a training data set to train to obtain the model of the translation style of the palace poetry, comprises:

according to the ancient text poetry segmentation corpus and the ancient text translation segmentation corpus, aligning the poetry according to the attributes of 'poetry sequence number', 'Chinese line number' and 'English line number', and taking out one line of English translation poetry;

performing character string standardization pretreatment on the English translation poem, and adding start and end marks for the English translation poem;

dividing words of the preprocessed ancient text poems and English translation poems based on word division models respectively to obtain a character list of each sentence of the poems, and representing characters in the character list by vectors through one-hot coding, wherein each sentence of the poems is represented by a vector matrix;

and training the ancient poetry translation style model based on the encoded vector matrix to obtain the trained ancient poetry translation style model.

6. The method for fixed style translation according to claim 1, wherein the step S4 of analyzing the translation style based on the ancient poetry and the corresponding english translation to obtain a translation style analysis result includes:

counting the frequency of fixed English translation of the same ancient poetry and analyzing the translation style of the ancient poetry translation style model;

analyzing the translation style of the ancient poetry translation style model according to the jump flag bit;

and analyzing the translation style of the ancient poetry translation style model according to the modification method of English translation.

7. A fixed style translation system for paleo-to-english comprising:

the analysis module is used for predicting the ancient poems based on the ancient poems translation style model, outputting corresponding English translations, and analyzing the translation style based on the ancient poems and the corresponding English translations to obtain translation style analysis results;

the method comprises the steps of obtaining an ancient poetry segmentation corpus, wherein the step of obtaining the ancient poetry segmentation corpus comprises the steps of performing clause alignment and segmentation operation on ancient poetry of an initial sample set according to segmentation rules, and obtaining the ancient poetry segmentation corpus:

the step of obtaining the corpus of the ancient translation and segmentation comprises the steps of performing clause alignment and segmentation operation on English translation poems of an initial sample set according to segmentation rules to obtain the corpus of the ancient translation and segmentation:

the ancient poetry translation style corpus comprises an ancient poetry translation style corpus obtained based on the ancient poetry segmentation corpus and the ancient translation segmentation corpus:

8. An electronic device, comprising a memory and a processor, wherein the processor is configured to implement the method for fixed style translation of paleo-text to english according to any one of claims 1-6 when executing a computer management program stored in the memory.

9. A computer-readable storage medium, having stored thereon a computer management class program which, when executed by a processor, implements the steps of the fixed style translation method of paleo-text to english as claimed in any one of claims 1-6.