Disclosure of Invention
The invention aims to provide a method for analyzing the translation difficulty of a document, which solves the problem of how to respectively provide the document for proper translators.
The invention discloses a method for analyzing document translation difficulty, which comprises the following steps:
scanning a document to be translated, and determining all words and sentences in the document to be translated;
respectively carrying out complexity calculation according to the determined vocabulary and the sentence to obtain the vocabulary complexity and the sentence complexity of the document;
calculating to obtain a translation difficulty value of the document according to the vocabulary complexity and the document complexity of the document;
and determining the translation difficulty level of the document according to the translation difficulty value of the document in a difficulty level table.
Preferably, the process of calculating the lexical complexity of the document comprises:
calculating the vocabulary level, the similar character-character ratio and the sense density of the semantic words of the document;
and calculating according to a vocabulary complexity calculation formula to obtain the vocabulary complexity of the document, wherein the vocabulary complexity calculation formula is as follows:
diff_word=K11·grade_word+K12·STTR+K13·density_notional;
wherein diff _ word is the document vocabulary complexity, grade _ word is the vocabulary level of the document, STTR is the similarity-character ratio of the document, dense _ nominal is the meaning density of the meaning words of the document, K11、K12And K13And obtaining a vocabulary complexity adjusting coefficient through sample calculation.
Preferably, before calculating the vocabulary level of the document, the method further comprises:
performing word segmentation processing on the document to obtain all vocabularies, and counting to obtain the total number of the vocabularies;
matching each obtained vocabulary in a vocabulary classification table to obtain the vocabulary level of each vocabulary; the vocabulary level is a first level, a second level, a third level or a fourth level;
respectively counting the number of the vocabularies with the vocabulary level of two or more levels;
the process of calculating the lexical rating of the document includes:
and calculating the vocabulary level of the document according to a vocabulary level calculation formula, wherein the vocabulary level calculation formula is as follows:
<math>
<mrow>
<mi>grad</mi>
<mi>e</mi>
<mo>_</mo>
<mi>word</mi>
<mo>=</mo>
<msub>
<mi>K</mi>
<mn>111</mn>
</msub>
<mo>·</mo>
<mfrac>
<msub>
<mi>word</mi>
<mn>2</mn>
</msub>
<mi>word</mi>
</mfrac>
<mo>+</mo>
<msub>
<mi>K</mi>
<mn>112</mn>
</msub>
<mo>·</mo>
<mfrac>
<msub>
<mi>word</mi>
<mn>3</mn>
</msub>
<mi>word</mi>
</mfrac>
<mo>+</mo>
<msub>
<mi>K</mi>
<mn>113</mn>
</msub>
<mo>·</mo>
<mfrac>
<msub>
<mi>word</mi>
<mn>4</mn>
</msub>
<mi>word</mi>
</mfrac>
<mo>;</mo>
</mrow>
</math>
wherein, wordxFor the number of words of class X, K111、K112And K113And in order to obtain the vocabulary level adjusting coefficient through sample calculation, word is the total vocabulary number.
Preferably, the process of calculating the character-to-character ratio of the document comprises;
according to all the obtained vocabularies, counting the number of class symbols and the number of form symbols in the vocabularies, and calculating the ratio of the number of the class symbols to the number of the form symbols to obtain the ratio of the class symbols to the form symbols of the documents; or
Dividing all the obtained vocabularies into a plurality of subdocuments and 1 subdocument which is less than the vocabularies of the standard quantity according to the standard quantity, and calculating according to a similar character-to-character ratio calculation formula to obtain the similar character-to-character ratio of the document; the calculation formula of the class symbol ratio is as follows:
<math>
<mrow>
<mi>STTR</mi>
<mo>=</mo>
<mfenced open='' close=''>
<mtable>
<mtr>
<mtd>
<mfrac>
<mn>1</mn>
<mrow>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mo>·</mo>
<mi>ST</mi>
<mo>·</mo>
<mi>token</mi>
</mrow>
</mfrac>
<mo>·</mo>
<mrow>
<mo>(</mo>
<mi>type</mi>
<mo>·</mo>
<mi>ST</mi>
<mo>+</mo>
<mi>token</mi>
<mo>·</mo>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</msubsup>
<msub>
<mi>type</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>,</mo>
</mtd>
<mtd>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>≥</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mfrac>
<mi>type</mi>
<mi>token</mi>
</mfrac>
<mo>,</mo>
</mtd>
<mtd>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
</mrow>
</math>
(ii) a Wherein token is the number of shape symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and type is the number of class symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and typeiThe number of the class identifiers of the ith subdocument containing the standard number of words is the number of the class identifiers of the standard number of words, n is the number of the subdocuments containing the standard number of words, and ST is the standard number of word division units.
Preferably, before calculating the sense density of the semantic words of the document, the method further comprises:
performing part-of-speech tagging on all the obtained vocabularies to obtain the semantic words in the vocabularies;
arranging all the obtained semantic words according to a certain sequence;
obtaining the meaning item number meanings of each sense word according to the sense word ontology tooliWherein i is the sequence number of the semantic word; counting the total number of the meaning items of the meaning words;
calculating according to a semantic word and semantic density calculation formula to obtain the semantic word and semantic density of the document; the word sense density calculation formula of the semantic words is as follows:
<math>
<mrow>
<mi>density</mi>
<mo>_</mo>
<mi>notional</mi>
<mo>=</mo>
<mfrac>
<mrow>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>notional</mi>
</mrow>
</msubsup>
<msub>
<mi>meanings</mi>
<mi>i</mi>
</msub>
</mrow>
<mrow>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>notional</mi>
</mrow>
</msubsup>
<msub>
<mi>meanings</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<mrow>
<mo>(</mo>
<mi>word</mi>
<mo>-</mo>
<mi>count</mi>
<mo>_</mo>
<mi>notional</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>;</mo>
</mrow>
</math>
wherein, meansiThe number of the meaning term of the ith meaning term is, and the count _ normal is the number of the meaning term.
Preferably, the semantic word includes at least one of the following parts of speech: nouns, pronouns, verbs, adjectives, adverbs, and exclamations.
Preferably, before calculating the sentence complexity of the document, the method further comprises:
calculating the average length of the whole sentence by determining the number of the whole sentences in the document;
calculating the average length of the first type clauses in the whole sentence by determining the number of the first type clauses in all the whole sentences in the document;
calculating the average length of the long sentences by determining the number of the long sentences in the document and the length of each long sentence;
calculating the average length of the second type clauses in the long sentences by determining the number of the second type clauses in all the long sentences in the document;
the process of calculating the sentence complexity of the document comprises:
calculating the sentence complexity of the document according to a sentence complexity calculation formula; the sentence complexity calculation formula is as follows:
diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
wherein MLS is the average length of the whole sentence, MLC is the average length of the first type clause, MLL is the average length of the long sentence, MLCL is the average length of the second type clause, K21、K22、K23And K24And calculating to obtain a sentence complexity adjusting coefficient through a sample.
Preferably, the process of calculating the average length of the whole sentence and the first type clause includes:
dividing the total vocabulary number by the whole sentence number to obtain the average length of the whole sentence;
and dividing the total vocabulary number by the number of the first type clauses to obtain the average length of the first type clauses.
Preferably, the process of calculating the average length of the long sentence and the second type clause comprises:
counting the length word _ _ofeach long sentencelongiI is more than or equal to 1 and less than or equal to count _ long; wherein i is the serial number of the long sentence;
calculating according to a long sentence average length calculation formula to obtain the long sentence average length; the average calculation formula of the long sentence is as follows:
<math>
<mrow>
<mi>MLL</mi>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</mfrac>
<mo>·</mo>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</msubsup>
<mi>word</mi>
<mo>_</mo>
<msub>
<mi>long</mi>
<mi>i</mi>
</msub>
<mo>;</mo>
</mrow>
</math>
wherein, count _ long is the number of the long sentences;
calculating according to an average length calculation formula of the second type clauses to obtain the average length of the second type clauses; the average length calculation formula of the second type clause is as follows:
<math>
<mrow>
<mi>MLCL</mi>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>clause</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</mfrac>
<mo>·</mo>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</msubsup>
<mi>word</mi>
<mo>_</mo>
<msub>
<mi>long</mi>
<mi>i</mi>
</msub>
<mo>;</mo>
</mrow>
</math>
wherein count _ close _ Long is the number of the second type clauses.
Preferably, the calculation process of the translation difficulty value of the document comprises the following steps:
calculating to obtain a translation difficulty value of the document according to a translation difficulty calculation formula; the translation difficulty calculation formula is as follows:
diff_doc=K1·diff_word+K2·diff_sentence;
wherein, K1And K2In order to obtain the translation difficulty adjustment coefficient through sample calculation, diff _ doc is a translation difficulty value.
The method for analyzing the document translation difficulty has the following advantages:
1. the translation difficulty of the document is uniformly and objectively calculated, so that the accuracy of the calculated translation difficulty is improved;
2. the method can be used for distributing the translation tasks to the translators and reasonably realizing the optimal configuration of resources.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The technical scheme analyzes the translation difficulty of the document to be translated from 2 aspects: determining the translation difficulty of the document to be translated according to the vocabulary complexity and the sentence complexity of the document to be translated, which specifically comprises the following steps
S11, scanning the document to be translated, and determining all words and sentences in the document to be translated;
s12, respectively carrying out complexity calculation according to the determined vocabulary and the sentence to obtain the vocabulary complexity and the sentence complexity of the document;
s13, calculating a translation difficulty value of the document according to the vocabulary complexity and the document complexity of the document;
s14, according to the translation difficulty value of the document, the translation difficulty level of the document is determined in a difficulty level table.
Based on the above method, a preferred embodiment is provided as follows:
determining a document to be translated, namely a document;
1. the lexical complexity of the document is calculated as follows:
performing word segmentation processing on the document to obtain all words and phrases in the document, wherein the term "word and phrase" should be understood not only as an english word but also as a word having a font structure, such as chinese, japanese, korean, etc.; and/or words having a letter-like structure, such as french, russian, etc.; and all vocabulary should be understood to include repeated vocabularies;
1) calculating the vocabulary level of the document:
matching each obtained vocabulary in a vocabulary hierarchical table to obtain the level matched with each vocabulary, wherein the level is a first level, a second level, a third level or a fourth level; wherein, the first, second and third levels are obtained by table look-up matching, and words which are not successfully matched in the word hierarchical table are taken as the fourth level;
each language can classify the vocabulary according to the frequency of the vocabulary appearing in the actual use. The technical scheme establishes a vocabulary classification list of each language according to various authority classification standards of each language for vocabularies, and divides the vocabularies of each language into 3 grades according to the common degree. For example, the Chinese character uses 'general standard Chinese character table' and 'Chinese character coding character set for information exchange-basic set' as the hierarchical reference of Chinese characters, and the Chinese characters are respectively corresponding to the first level, the second level and the third level according to common use, secondary use and rare use.
Word with the statistic level of one grade1The number of words with the statistical level of two is word2Word with the statistic level of three levels3Word with the number of words with the statistical level of four4;
Counting the number of all vocabularies in the document to be used as a total vocabulary number word;
calculating the ratio of the second and above vocabularies in the document as follows:
the ratio of the words with the level of two is
The ratio of the words with the level of three is
And the ratio of the words with the four grades is
Calculating according to a vocabulary level calculation formula to obtain the vocabulary level of the document; the formula is as follows:
<math>
<mrow>
<mi>grad</mi>
<mi>e</mi>
<mo>_</mo>
<mi>word</mi>
<mo>=</mo>
<msub>
<mi>K</mi>
<mn>111</mn>
</msub>
<mo>·</mo>
<mfrac>
<msub>
<mi>word</mi>
<mn>2</mn>
</msub>
<mi>word</mi>
</mfrac>
<mo>+</mo>
<msub>
<mi>K</mi>
<mn>112</mn>
</msub>
<mo>·</mo>
<mfrac>
<msub>
<mi>word</mi>
<mn>3</mn>
</msub>
<mi>word</mi>
</mfrac>
<mo>+</mo>
<msub>
<mi>K</mi>
<mn>113</mn>
</msub>
<mo>·</mo>
<mfrac>
<msub>
<mi>word</mi>
<mn>4</mn>
</msub>
<mi>word</mi>
</mfrac>
<mo>;</mo>
</mrow>
</math>
wherein, grad _ word is vocabulary level, K111、K112And K113The vocabulary level adjustment coefficient calculated for the given sample belongs to a third-level adjustment coefficient, and the adjustment coefficient is a multiple linear regression coefficient and can be calculated by a least square method. The specific calculation method is as follows:
order: y is a grade word,
for n sets of collected sample data:
{X11,X12,X13};
{X21,X22,X23};
{Xn1,Xn2,Xn3};
corresponding to the word level evaluated by the given expert:
the following system of linear equations can thus be obtained:
Y1=K111·X11+K112·X12+K113·X13;
Y2=K111·X21+K112·X22+K113·X23;
Yn=K111·Xn1+K112·Xn2+K113·Xn3;
obtaining:
<math>
<mrow>
<mfenced open='[' close=']'>
<mtable>
<mtr>
<mtd>
<msub>
<mi>K</mi>
<mn>111</mn>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>K</mi>
<mn>112</mn>
</msub>
</mtd>
</mtr>
<mtr>
<mtd>
<msub>
<mi>K</mi>
<mn>113</mn>
</msub>
</mtd>
</mtr>
</mtable>
</mfenced>
<mo>=</mo>
<msup>
<mrow>
<mo>(</mo>
<msup>
<mi>X</mi>
<mo>′</mo>
</msup>
<mi>X</mi>
<mo>)</mo>
</mrow>
<mrow>
<mo>-</mo>
<mn>1</mn>
</mrow>
</msup>
<msup>
<mi>X</mi>
<mo>′</mo>
</msup>
<mi>Y</mi>
<mo>;</mo>
</mrow>
</math>
wherein, x' is the transpose of X.
2) Calculating the standard type character ratio of the document:
counting the shape symbols in the document, namely the total vocabulary number appearing in the document;
counting the class symbols in the documents, namely the number of different vocabularies appearing in the documents;
the class-to-token ratio (TTR) represents the rate of change of vocabulary, and the richness of the aggregated vocabulary of the document. The higher the TTR ratio, the more different words are used to indicate the text, and the reading difficulty is increased correspondingly. Since the number of words or vocabularies for any language is fixed, the larger the document, the smaller the symbol-like-symbol ratio, and the distorted statistical symbol-like-symbol ratio. Therefore, the actual processing can calculate the TTR by taking each standard number ST (for example, ST takes 1000) as a unit, and finally takes the average of all TTRs as a final value, i.e., standard class symbol ratio (STTR). And directly performing TTR calculation on the documents with the quantity less than the standard quantity. The method comprises the following specific steps:
dividing all words of the document into n first subdocuments according to the standard quantity ST, wherein the number of the type characters in each first subdocument is typei(ii) a Wherein i is the serial number of the first subdocument;
a second subdocument with insufficient vocabulary quantity ST; the type and token of the class in the second sub-document are
Calculating the standard type character-to-character ratio of the document according to a standard type character-to-character ratio calculation formula; the formula is as follows:
<math>
<mrow>
<mi>STTR</mi>
<mo>=</mo>
<mfenced open='' close=''>
<mtable>
<mtr>
<mtd>
<mfrac>
<mn>1</mn>
<mrow>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>+</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
<mo>·</mo>
<mi>ST</mi>
<mo>·</mo>
<mi>token</mi>
</mrow>
</mfrac>
<mo>·</mo>
<mrow>
<mo>(</mo>
<mi>type</mi>
<mo>·</mo>
<mi>ST</mi>
<mo>+</mo>
<mi>token</mi>
<mo>·</mo>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mi>n</mi>
</msubsup>
<msub>
<mi>type</mi>
<mi>i</mi>
</msub>
<mo>)</mo>
</mrow>
<mo>,</mo>
</mtd>
<mtd>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>≥</mo>
<mn>1</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
<mtr>
<mtd>
<mfrac>
<mi>type</mi>
<mi>token</mi>
</mfrac>
<mo>,</mo>
</mtd>
<mtd>
<mrow>
<mo>(</mo>
<mi>n</mi>
<mo>=</mo>
<mn>0</mn>
<mo>)</mo>
</mrow>
</mtd>
</mtr>
</mtable>
</mfenced>
</mrow>
</math>
wherein token is the number of shape symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and type is the number of class symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and typeiThe number of the class identifiers of the ith subdocument containing the standard number of words is the number of the class identifiers of the standard number of words, n is the number of the subdocuments containing the standard number of words, and ST is the standard number of word division units.
3) Calculating the word meaning density of the semantic words of the document:
the word density is the proportion of the number of words in a text to the number of words in the text. Generally, the higher the vocabulary density, the larger the proportion of the semantic words of the text, the larger the information amount, and the difficulty in reading and translating is increased.
Counting the number count _ nominal of the semantic words in the document, namely counting the number of nouns, pronouns, verbs, adjectives, adverbs, exclamations and the like;
arranging all the obtained semantic words according to a certain sequence;
counting the number of semantic terms of each semantic term according to the synonym ontology tooli(1 ≦ i ≦ count _ no); wherein i is the sequence number of the semantic word;
and (4) counting the semantic items of all the semantic words, and adding the semantic item numbers of all the semantic words to obtain the total semantic item number of all the semantic words.
Calculating the meaning word and meaning density of the document according to a meaning word and meaning density calculation formula; the formula is as follows:
<math>
<mrow>
<mi>density</mi>
<mo>_</mo>
<mi>notional</mi>
<mo>=</mo>
<mfrac>
<mrow>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>notional</mi>
</mrow>
</msubsup>
<msub>
<mi>meanings</mi>
<mi>i</mi>
</msub>
</mrow>
<mrow>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>notional</mi>
</mrow>
</msubsup>
<msub>
<mi>meanings</mi>
<mi>i</mi>
</msub>
<mo>+</mo>
<mrow>
<mo>(</mo>
<mi>word</mi>
<mo>-</mo>
<mi>count</mi>
<mo>_</mo>
<mi>notional</mi>
<mo>)</mo>
</mrow>
</mrow>
</mfrac>
<mo>;</mo>
</mrow>
</math>
wherein dense _ nominal is the semantic word density of the semantic words,
the total number of the semantic terms is the number of the semantic terms;
the steps of calculating the vocabulary level of the document, the standard class symbol ratio of the document and the semantic word density of the document do not have a sequence, and can be calculated respectively or simultaneously.
4) And calculating the vocabulary complexity of the document according to the vocabulary level, the standard class symbol-symbol ratio and the semantic density of the semantic words of the document:
calculating the vocabulary complexity of the document according to a vocabulary complexity calculation formula; the formula is as follows:
diff_word=K11·grade_worddK12·STTR+K13·density_notional;
wherein diff _ word is vocabulary complexity, grade _ word is vocabulary level, STTR is standard type character ratio, dense _ normal is meaning density of meaning words; k11、K12And K13The lexical complexity adjustment coefficients calculated for a given sample belong to second-stage adjustment coefficients, and the adjustment coefficients are multiple linear regression coefficients which can be calculated by a least square method. The specific calculation method is consistent with the vocabulary level adjustment coefficient.
2. Calculating the sentence complexity of the document, specifically as follows:
the term "whole sentence" is to be understood as a collection of words that express the complete meaning, for example: vocabulary set from the first character of the document to the end coincidence; the ending symbol is one of a period, an exclamation mark, a question mark and an ellipsis mark; or a vocabulary set from the first character after the first ending symbol to the second ending symbol;
the term "clause" should be understood as a portion of a whole sentence, a collection of words or phrases spaced apart by symbols such as commas, pause, semicolons, and the like;
the term "long sentence" should be understood as an entire sentence with a vocabulary number greater than a predetermined threshold;
the first and second categories are used herein for distinction only.
The scheme is as follows:
scanning a document, determining all whole sentences in the document, and counting the total number of the whole sentences to be recorded as count _ presence;
taking the whole sentence with the vocabulary number larger than a preset threshold value as a long sentence, counting the total number of the long sentences, recording the total number as count _ long and the vocabulary number in each long sentence as word _ longiI is more than or equal to 1 and less than or equal to count _ long; i is the serial number of the long sentence;
the clauses in the whole sentence are first type clauses, the total number of the first type clauses is counted and recorded as count _ clause;
the clauses in the long sentence are clauses of a second type, the total number of the clauses of the second type is counted and recorded as count _ close _ long;
respectively calculating the average length of the whole sentence, the average length of the long sentence, the average length of the first type of clause and the average length of the second type of clause; the following were used:
the average length of the whole sentence (MLS) is calculated by the following method: MLS = word/count _ presence;
the average length (MLC) of the first clause is calculated by the following method: MLC = word/count _ clause;
the average length of long sentences (MLL) is calculated by the following method:
<math>
<mrow>
<mi>MLL</mi>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</mfrac>
<mo>·</mo>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</msubsup>
<mi>word</mi>
<mo>_</mo>
<msub>
<mi>long</mi>
<mi>i</mi>
</msub>
<mo>;</mo>
</mrow>
</math>
the average length (MLCL) of the second type of clauses is calculated as follows:
<math>
<mrow>
<mi>MLCL</mi>
<mo>=</mo>
<mfrac>
<mn>1</mn>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>clause</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</mfrac>
<mo>·</mo>
<msubsup>
<mi>Σ</mi>
<mrow>
<mi>i</mi>
<mo>=</mo>
<mn>1</mn>
</mrow>
<mrow>
<mi>count</mi>
<mo>_</mo>
<mi>long</mi>
</mrow>
</msubsup>
<mi>word</mi>
<mo>_</mo>
<msub>
<mi>long</mi>
<mi>i</mi>
</msub>
<mo>;</mo>
</mrow>
</math>
calculating according to a sentence complexity calculation formula to obtain sentence complexity; the sentence complexity calculation formula is as follows:
diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
K21、K22、K23、K24the sentence difficulty adjusting coefficient calculated by the collected sample belongs to a second-stage adjusting coefficient, and the adjusting coefficient is a multiple linear regression coefficient and can be calculated by a least square method. The specific calculation method is consistent with the vocabulary level adjustment coefficient.
3. Calculating a translation difficulty value of the document;
calculating to obtain a translation difficulty value of the document according to the acquired vocabulary complexity and sentence complexity of the document and a translation difficulty calculation formula; the formula is as follows:
diff_doc=K1·diff_word+K2·diff_sentence;
K1、K2the translation difficulty adjusting coefficient calculated through the collected samples belongs to a first-stage adjusting coefficient, and the adjusting coefficient is a multiple linear regression coefficient and can be calculated through a least square method. The specific calculation method is consistent with the vocabulary level adjustment coefficient.
4. Determining a translation difficulty level of the document;
matching in a difficulty level table according to the translation difficulty value of the document to obtain a difficulty level corresponding to the value;
the difficulty level table is in a form similar to a dictionary and comprises a plurality of difficulty levels and translation difficulty value ranges corresponding to the difficulty levels;
and the translation difficulty value range in the difficulty level table is obtained by learning or training operation.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.