Summary of the invention
It is desirable to provide a kind of analysis method of document translation difficulty, solve how by document respectively to suitably
The problem of interpreter.
The invention discloses a kind of analysis method of document translation difficulty, including:
Scanning waiting for translating shelves, determine all vocabulary in described waiting for translating shelves and all statements;
Carry out complicated dynamic behaviour respectively according to the described vocabulary determined and statement, obtain vocabulary complexity and the statement of document
Complexity;
The translation that described vocabulary complexity according to described document and described statement complicated dynamic behaviour obtain described document is difficult
Number of degrees value;
Translation difficulty numerical value according to described document mates in grade of difficulty table, determines that the translation of described document is difficult
Degree grade.
Preferably, the process of the vocabulary complexity calculating described document includes:
Calculate the vocabulary grade of document, class symbol pictograph ratio and notional word meaning of a word density;
Calculating according to vocabulary complicated dynamic behaviour formula, obtain the described vocabulary complexity of described document, described vocabulary is complicated
Degree computing formula is as follows:
Diff_word=K11·grade_word+K12·STTR+K13·density_notional;
Wherein, diff_word is described document vocabulary complexity, and grade_word is the vocabulary grade of described document,
STTR is the class symbol pictograph ratio of described document, and density_notional is the notional word meaning of a word density of described document, K11、K12
And K13For being calculated vocabulary complexity adjustment factor by sample.
Preferably, before calculating the vocabulary grade of described document, also include:
Described document is carried out word segmentation processing, obtains all vocabulary, and statistics obtains total vocabulary number;
The each described vocabulary obtained is mated in vocabulary hierarchical table, obtains the vocabulary level of each described vocabulary
Not;Described vocabulary level is one-level, two grades, three grades or level Four;
Add up the quantity of the described vocabulary of the rank that described vocabulary level is two grades or more than two grades respectively;
The process of the vocabulary grade calculating described document includes:
Calculate the vocabulary grade of described document according to vocabulary rating calculation formula, described vocabulary rating calculation formula is such as
Under:
Wherein, wordxFor the quantity of vocabulary that vocabulary level is X level, K111、K112And K113For being calculated by sample
Vocabulary grade adjustment factor, word is total vocabulary number.
Preferably, the process at the class symbol pictograph ratio calculating described document includes;
According to all described vocabulary obtained, add up class therein symbol number and pictograph number, calculate described class symbol number with described
The ratio of pictograph number, obtains the class symbol pictograph ratio of described document;Or
The all described vocabulary obtained is divided into multiple subdocument, and 1 not enough criterion numeral measure word according to standard number
The subdocument converged, calculates than computing formula according to class symbol pictograph, obtains the class symbol pictograph ratio of described document;Described class symbol pictograph ratio
Computing formula is as follows:
;Wherein, token is the pictograph number of the subdocument of described not enough standard number vocabulary, and type is not enough standard number
The class symbol number of the subdocument of vocabulary, typeiFor the class symbol number of the i-th subdocument containing standard number vocabulary, n is described containing mark
The subdocument quantity of quasi-quantity vocabulary, ST is described standard number vocabulary dividing unit.
Preferably, before calculating the described notional word meaning of a word density of described document, also include:
The all described vocabulary obtained is carried out part-of-speech tagging, obtains notional word therein;
The all described notional word obtained is arranged according to a definite sequence;
The senses of a dictionary entry number meanings of each described notional word is obtained according to synonym ontology tooli, wherein i is described real justice
The sequence number of word;And add up the senses of a dictionary entry sum of described notional word;
Calculate according to notional word meaning of a word density computing formula, obtain the notional word meaning of a word density of described document;Described real justice
Word meaning of a word density computing formula is as follows:
Wherein, meaningsiFor i-th notional word senses of a dictionary entry number, count_notional is the number of described notional word
Amount.
Preferably, described notional word at least includes the part of speech of one below: noun, synonym, verb, adjective, adverbial word
And interjection.
Preferably, before calculating the described statement complexity of described document, also include:
The whole sentence number being determined by described document calculates the average length of whole sentence;
The quantity of the first kind clause in all described whole sentence being determined by described document calculates in whole sentence
The average length of one generic clause;
The long sentence number being determined by described document and the length gauge of each long sentence calculate the average length of long sentence;
The quantity being determined by the Equations of The Second Kind clause in all described long sentence in described document calculates in long sentence
The average length of two generic clauses;
The process of the described statement complexity calculating described document includes:
The described statement complexity of described document is calculated according to statement complicated dynamic behaviour formula;Described statement complexity
Computing formula is as follows:
Diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
Wherein, MLS is the average length of described whole sentence, and MLC is the average length of described first kind clause, and MLL is described
The average length of long sentence, MLCL is the average length of described Equations of The Second Kind clause, K21、K22、K23And K24For being calculated by sample
Statement complexity adjustment factor.
Preferably, the process of the average length calculating described whole sentence and described first kind clause includes:
By described total vocabulary number except described whole sentence number, obtain the average length of described whole sentence;
By described total vocabulary number except the quantity of described first kind clause, obtain the average length of described first kind clause.
Preferably, the process of the average length calculating described long sentence and described Equations of The Second Kind clause includes:
Add up length word_long of each described long sentencei, 1≤i≤count_long;Wherein, i is the sequence number of long sentence;
Average length computing formula according to long sentence is calculated the average length of described long sentence;The average meter of described long sentence
Calculation formula is as follows:
Wherein, count_long is described long sentence number;
The average length of described Equations of The Second Kind clause it is calculated according to the average length computing formula of Equations of The Second Kind clause;Described
The average length computing formula of Equations of The Second Kind clause is as follows:
Wherein, count_clause_long is the quantity of described Equations of The Second Kind clause.
Preferably, the calculating process of the translation difficulty numerical value of described document includes:
The translation difficulty numerical value of described document it is calculated according to translation difficulty computing formula;Described translation difficulty calculates public affairs
Formula is as follows:
Diff_doc=K1·diff_word+K2·diff_sentence;
Wherein, K1And K2For being calculated translation difficulty adjustment factor by sample, diff_doc is translation difficulty numerical value.
The analysis method of the document translation difficulty in the present invention, has the advantage that
1, the unified translation difficulty objectively calculating document, improves the accuracy of the translation difficulty calculated;
2, can be used for distributing translation duties to interpreter, rationally realize distributing rationally of resource.
Detailed description of the invention
Below with reference to the accompanying drawings and in conjunction with the embodiments, the present invention is described in detail.
The technical program carries out the analysis of waiting for translating shelves translation difficulty in terms of 2: vocabulary complexity and statement are complicated
Degree, determines the translation difficulty of waiting for translating shelves, specifically includes according to the vocabulary complexity of waiting for translating shelves and statement complexity
S11, scanning waiting for translating shelves, determine all vocabulary in described waiting for translating shelves and all statements;
S12, carry out complicated dynamic behaviour respectively according to the described vocabulary determined and statement, obtain document vocabulary complexity and
Statement complexity;
S13, described vocabulary complexity and described statement complicated dynamic behaviour according to described document obtain turning over of described document
Translate difficulty numerical value;
S14, translation difficulty numerical value according to described document mate in grade of difficulty table, determine turning over of described document
Translate grade of difficulty.
Based on said method, a preferred embodiment presented below:
Determine waiting for translating shelves, i.e. document;
1, calculating the vocabulary complexity of the document, process is as follows:
The document being carried out word segmentation processing, obtains all vocabulary in the document, wherein term " vocabulary " should only not understand
For English word, it is also understood as the word with character form structure, such as Chinese character, Japanese, Korean etc.;And/or there is alphabetical shape knot
The word of structure, such as French, Russian etc.;And all vocabulary are interpreted as including dittograph and converge;
1), the vocabulary grade of calculating document:
The each vocabulary obtained is mated in vocabulary hierarchical table, it is thus achieved that the rank that each vocabulary is mated, this level
Wei one-level, two grades, three grades or level Four;Wherein, one-level, two grades and three grades are obtained by coupling of tabling look-up, will be in vocabulary hierarchical table
The unsuccessful vocabulary of middle coupling is as level Four;
The frequency that each languages can occur in actual use according to its vocabulary, carries out staged care to vocabulary.This skill
Art scheme according to each languages to vocabulary various authority grading rules, set up the vocabulary hierarchical table of each languages, by each language
The vocabulary planted is divided into 3 ranks by conventional degree.Such as Chinese is with " general specification Chinese character table " and " information exchange encoding of chinese characters
Character set baseset " as the classification reference of Chinese character, by Chinese character by conventional, secondary conventional and uncommon corresponding one-level respectively, two grades
With three grades.
Adding up the vocabulary quantity that rank is one-level is word1, adding up the vocabulary quantity that rank is two grades is word2, add up level
Be not the vocabulary quantity of three grades be word3, statistics rank be the vocabulary quantity of level Four be word4;
The quantity of all vocabulary in statistic document, as total vocabulary number word;
Calculate the ratio that two grades and above vocabulary are shared in a document, as follows:
Rank is that ratio shared by the vocabulary of two grades isRank is that ratio shared by the vocabulary of three grades isAnd level
Shared by the vocabulary of level Four, ratio is not
Carry out being calculated the vocabulary grade of document according to vocabulary rating calculation formula;Formula is as follows:
Wherein, grade_word is vocabulary grade, K111、K112And K113The vocabulary grade calculated by given sample is adjusted
Joint coefficient, belongs to third level adjustment factor, and this adjustment factor is that multiple linear regression coefficient can be calculated by method of least square
Obtain.Circular is as follows:
Order:
N group sample data for collecting:
{X11,X12,X13};
{X21,X22,X23};
.;
{Xn1,Xn2,Xn3};
Correspondence provides the vocabulary grade that expert evaluation goes out:
Thus can obtain following system of linear equations:
Y1=K111·X11+K112·X12+K113·X13;
Y2=K111·X21+K112·X22+K113·X23;
.;
Yn=K111·X21+K112·X22+K113·X23;
Obtain:
Wherein, X ' is the transposed matrix of X.
2) the standard class symbol pictograph, calculating document compares:
The total vocabulary number occurred in pictograph in statistic document, i.e. document;
Class symbol in statistic document, the vocabulary number differed i.e. occurred in document;
Class symbol pictograph ratio (TTR) represents vocabulary rate of change, and document collects the abundant degree of vocabulary.The ratio of TTR is more
Height, illustrates that the different vocabulary that the text is used are the most, and its reading difficulty increases the most accordingly.Due to for any one language
The quantity of word or vocabulary is fixing fixed, so when document is the biggest, class symbol pictograph ratio will be the least, and the class symbol pictograph ratio counted
Will distortion.Therefore actual treatment can be that unit is carried out based on TTR by every standard number ST (such as ST value 1000) individual vocabulary
Calculate, finally using the average of all TTR as final value, i.e. standard class symbol pictograph ratio (STTR, Standard TTR).Not enough mark
The document of quasi-quantity, directly carries out TTR calculating.Specific as follows:
All vocabulary of document are divided into n the first subdocument according to standard number ST, each first subdocument has
The quantity having class to accord with is typei;Wherein i is the sequence number of the first subdocument;
Also include second subdocument of a vocabulary lazy weight ST;Class symbol in second subdocument is type and pictograph
For token
It is calculated the standard class symbol pictograph ratio of document than computing formula according to standard class symbol pictograph;Formula is as follows:
Wherein, token is the pictograph number of the subdocument of described not enough standard number vocabulary, and type is not enough criterion numeral measure word
The class symbol number of the subdocument converged, typeiFor the class symbol number of the i-th subdocument containing standard number vocabulary, n is described containing standard
The subdocument quantity of quantity vocabulary, ST is described standard number vocabulary dividing unit.
3), the notional word meaning of a word density of calculating document:
Lexical density refers to that in a text, notional word accounts for the ratio of total word number.Generally lexical density is the highest, the reality of text
Justice word ratio is the biggest, and quantity of information is the biggest, reads and translates difficulty and increase the most therewith.
Quantity count_notional of notional word in statistic document, i.e. statistics include noun, synonym, verb, describe
The quantity of word, adverbial word, interjection etc.;
The all described notional word obtained is arranged according to a definite sequence;
According to synonym ontology tool, add up the senses of a dictionary entry number meanings of each notional wordi(1≤i≤count_
notional);Wherein, i is the sequence number of notional word;
Add up the senses of a dictionary entry of all notional words, the senses of a dictionary entry number of all notional words is added the total senses of a dictionary entry obtaining all notional words
Number.
The notional word meaning of a word density of document it is calculated according to notional word meaning of a word density computing formula;Formula is as follows:
Wherein, density_notional is notional word meaning of a word density, For reality
Total senses of a dictionary entry number of justice word;
Wherein, the notional word meaning of a word density of the vocabulary grade of document, the standard class symbol pictograph ratio of document and document is calculated
There is not sequencing in step, can calculate respectively, it is also possible to calculate simultaneously.
4) according to vocabulary grade, standard class symbol pictograph ratio and the notional word meaning of a word density of document, the vocabulary of document is calculated
Complexity:
The vocabulary complexity of document is calculated according to vocabulary complicated dynamic behaviour formula;Formula is as follows:
Diff_word=K11·grade_word+K12·STTR+K13·density_notional;
Wherein, diff_word is vocabulary complexity, and grade_word is vocabulary grade, and STTR is that standard class accords with pictograph ratio,
Density_notional is notional word meaning of a word density;K11、K12And K13The vocabulary complexity regulation calculated by given sample
Coefficient, belongs to second level adjustment factor, and this adjustment factor is that multiple linear regression coefficient can be calculated by method of least square
Arrive.Circular is consistent with vocabulary grade adjustment factor.
2, the statement complexity of document is calculated, specific as follows:
Term " whole sentence " is construed as have expressed the lexical set of the complete meaning, such as: document lead-in is to end mark
Lexical set between conjunction;Terminating symbol is one of fullstop, exclamation mark, question mark, ellipsis;Or the lead-in that first after terminating symbol
Lexical set between the second terminating symbol;
Term " clause " is construed as a part for whole sentence, the word come with mark spaces such as comma, pause mark, branches or
Lexical set;
Term " long sentence " is construed as the vocabulary quantity whole sentence more than predetermined threshold;
The first kind used herein and Equations of The Second Kind are served only for distinguishing.
Scheme is specific as follows:
Scanned document, determines all whole sentence in document, and adds up the sum of whole sentence, is denoted as count_sentence;
Using vocabulary quantity more than the whole sentence of predetermined threshold as long sentence, and add up the sum of long sentence, be denoted as count_long
Vocabulary quantity with in each long sentence, is denoted as word_longi, 1≤i≤count_long;I is the sequence number of long sentence;
Clause in whole sentence is first kind clause, the sum of statistics first kind clause, is denoted as count_clause;
Clause in long sentence is Equations of The Second Kind clause, the sum of statistics Equations of The Second Kind clause, is denoted as count_clause_long;
Calculate the average length of whole sentence, the average length of long sentence, the average length of first kind clause and Equations of The Second Kind respectively
The average length of sentence;As follows:
The average length (MLS, mean length of sentence) of whole sentence, computational methods are: MLS=word/
count_sentence;
The average length (MLC, mean length of clause) of first kind clause, computational methods are: MLC=word/
count_clause;
The average length (MLL, mean length of long sentence) of long sentence, computational methods are:
The average length (MLCL, mean length of clause of long sentence) of Equations of The Second Kind clause, meter
Calculation method is:
It is calculated statement complexity according to statement complicated dynamic behaviour formula;Statement complicated dynamic behaviour formula is as follows:
Diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
K21、K22、K23And K24For the sentence difficulty level adjustment factor calculated by institute's collecting sample, belong to second level regulation
Coefficient, this adjustment factor is that multiple linear regression coefficient can be calculated by method of least square.Circular and word
The grade adjustment factor that converges is consistent.
3, the translation difficulty numerical value of document is calculated;
Vocabulary complexity according to the document obtained and statement complexity, be calculated literary composition according to translation difficulty computing formula
The translation difficulty numerical value of shelves;Formula is as follows:
Diff_doc=K1·diff_word+K2·diff_sentence;
K1、K2For the translation difficulty adjustment factor calculated by institute's collecting sample, belong to first order adjustment factor, this tune
Joint coefficient is that multiple linear regression coefficient can be calculated by method of least square.Circular regulates with vocabulary grade
Coefficient is consistent.
4, the translation grade of difficulty of document is determined;
Translation difficulty numerical value according to document mates in grade of difficulty table, obtains the difficulty etc. that this numerical value is corresponding
Level;
Grade of difficulty table is analogous to the form of dictionary, including the translation corresponding to several grade of difficulty and grade of difficulty
Difficulty numerical range;
Translation difficulty numerical range in grade of difficulty table carries out learning or train computing to obtain.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for the skill of this area
For art personnel, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, that is made any repaiies
Change, equivalent, improvement etc., should be included within the scope of the present invention.