CN103744840B - A kind of analysis method of document translation difficulty - Google Patents

A kind of analysis method of document translation difficulty Download PDF

Info

Publication number
CN103744840B
CN103744840B CN201310713175.2A CN201310713175A CN103744840B CN 103744840 B CN103744840 B CN 103744840B CN 201310713175 A CN201310713175 A CN 201310713175A CN 103744840 B CN103744840 B CN 103744840B
Authority
CN
China
Prior art keywords
vocabulary
word
document
sentence
average length
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310713175.2A
Other languages
Chinese (zh)
Other versions
CN103744840A (en
Inventor
江潮
张芃
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Language Network (wuhan) Information Technology Co Ltd
Original Assignee
Language Network (wuhan) Information Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Language Network (wuhan) Information Technology Co Ltd filed Critical Language Network (wuhan) Information Technology Co Ltd
Priority to CN201310713175.2A priority Critical patent/CN103744840B/en
Publication of CN103744840A publication Critical patent/CN103744840A/en
Application granted granted Critical
Publication of CN103744840B publication Critical patent/CN103744840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a kind of analysis method of document translation difficulty, including: scanning waiting for translating shelves, determine all vocabulary in described waiting for translating shelves and all statements;Carry out complicated dynamic behaviour respectively according to the described vocabulary determined and statement, obtain vocabulary complexity and the statement complexity of document;Described vocabulary complexity according to described document and described document complicated dynamic behaviour obtain the translation difficulty numerical value of described document.Translation difficulty numerical value according to described document is carried out in grade of difficulty table, determines the translation grade of difficulty of described document.The present invention, by providing the computational methods of the translation difficulty of a kind of document, calculates the translation difficulty of waiting for translating shelves accurately, improves the accuracy analyzing document translation difficulty.

Description

A kind of analysis method of document translation difficulty
Technical field
The present invention relates to translation technology field, in particular to a kind of analysis method of document translation difficulty.
Background technology
Differentiation for document translation difficulty can be divided into artificial cognition and machine to differentiate.Artificial cognition is special by language Family or translation expert are labeled documents to be translated and judge, owing to reading and the understanding of people limit, this method speed is relatively Slow to expend the biggest human cost simultaneously, and owing to differentiating irregular and everyone reason to document difficulty of people's ability Solve different and produce the biggest differentiation difference, it determines result cannot accomplish unified standard, and objectivity is very poor.It is to pass through that machine differentiates Computer structure is unified fixed method and is carried out document translating difficulty judgement, and current most common method is by uncommon in document The statistics of words carries out difficulty judgement, the determination methods of this single dimension its be used as the Reliability comparotive of differentiation factor Thin, there is bigger one-sidedness, the differentiation result obtained is often the biggest with practical situation difference, it is impossible to ensure to differentiate the standard of result Really property.Differentiation to document translation difficulty at present, also lacks a method of discrimination the most efficient but also relatively accurate.
Summary of the invention
It is desirable to provide a kind of analysis method of document translation difficulty, solve how by document respectively to suitably The problem of interpreter.
The invention discloses a kind of analysis method of document translation difficulty, including:
Scanning waiting for translating shelves, determine all vocabulary in described waiting for translating shelves and all statements;
Carry out complicated dynamic behaviour respectively according to the described vocabulary determined and statement, obtain vocabulary complexity and the statement of document Complexity;
The translation that described vocabulary complexity according to described document and described statement complicated dynamic behaviour obtain described document is difficult Number of degrees value;
Translation difficulty numerical value according to described document mates in grade of difficulty table, determines that the translation of described document is difficult Degree grade.
Preferably, the process of the vocabulary complexity calculating described document includes:
Calculate the vocabulary grade of document, class symbol pictograph ratio and notional word meaning of a word density;
Calculating according to vocabulary complicated dynamic behaviour formula, obtain the described vocabulary complexity of described document, described vocabulary is complicated Degree computing formula is as follows:
Diff_word=K11·grade_word+K12·STTR+K13·density_notional;
Wherein, diff_word is described document vocabulary complexity, and grade_word is the vocabulary grade of described document, STTR is the class symbol pictograph ratio of described document, and density_notional is the notional word meaning of a word density of described document, K11、K12 And K13For being calculated vocabulary complexity adjustment factor by sample.
Preferably, before calculating the vocabulary grade of described document, also include:
Described document is carried out word segmentation processing, obtains all vocabulary, and statistics obtains total vocabulary number;
The each described vocabulary obtained is mated in vocabulary hierarchical table, obtains the vocabulary level of each described vocabulary Not;Described vocabulary level is one-level, two grades, three grades or level Four;
Add up the quantity of the described vocabulary of the rank that described vocabulary level is two grades or more than two grades respectively;
The process of the vocabulary grade calculating described document includes:
Calculate the vocabulary grade of described document according to vocabulary rating calculation formula, described vocabulary rating calculation formula is such as Under:
g r a d e _ w o r d = K 111 · word 2 w o r d + K 112 · word 3 w o r d + K 113 · word 4 w o r d ;
Wherein, wordxFor the quantity of vocabulary that vocabulary level is X level, K111、K112And K113For being calculated by sample Vocabulary grade adjustment factor, word is total vocabulary number.
Preferably, the process at the class symbol pictograph ratio calculating described document includes;
According to all described vocabulary obtained, add up class therein symbol number and pictograph number, calculate described class symbol number with described The ratio of pictograph number, obtains the class symbol pictograph ratio of described document;Or
The all described vocabulary obtained is divided into multiple subdocument, and 1 not enough criterion numeral measure word according to standard number The subdocument converged, calculates than computing formula according to class symbol pictograph, obtains the class symbol pictograph ratio of described document;Described class symbol pictograph ratio Computing formula is as follows:
S T T R = 1 ( n + 1 ) · S T · t o k e n · ( t y p e · S T + t o k e n · Σ i = 1 n type i ) , ( n ≥ 1 ) t y p e t o k e n , ( n = 1 )
;Wherein, token is the pictograph number of the subdocument of described not enough standard number vocabulary, and type is not enough standard number The class symbol number of the subdocument of vocabulary, typeiFor the class symbol number of the i-th subdocument containing standard number vocabulary, n is described containing mark The subdocument quantity of quasi-quantity vocabulary, ST is described standard number vocabulary dividing unit.
Preferably, before calculating the described notional word meaning of a word density of described document, also include:
The all described vocabulary obtained is carried out part-of-speech tagging, obtains notional word therein;
The all described notional word obtained is arranged according to a definite sequence;
The senses of a dictionary entry number meanings of each described notional word is obtained according to synonym ontology tooli, wherein i is described real justice The sequence number of word;And add up the senses of a dictionary entry sum of described notional word;
Calculate according to notional word meaning of a word density computing formula, obtain the notional word meaning of a word density of described document;Described real justice Word meaning of a word density computing formula is as follows:
d e n s i t y _ n o t i o n a l = Σ i = 1 c o u n t _ n o t i o n a l meanings i Σ i = 1 c o u n t _ n o t i o n a l meanings i + ( w o r d - c o u n t _ n o t i o n a l ) ;
Wherein, meaningsiFor i-th notional word senses of a dictionary entry number, count_notional is the number of described notional word Amount.
Preferably, described notional word at least includes the part of speech of one below: noun, synonym, verb, adjective, adverbial word And interjection.
Preferably, before calculating the described statement complexity of described document, also include:
The whole sentence number being determined by described document calculates the average length of whole sentence;
The quantity of the first kind clause in all described whole sentence being determined by described document calculates in whole sentence The average length of one generic clause;
The long sentence number being determined by described document and the length gauge of each long sentence calculate the average length of long sentence;
The quantity being determined by the Equations of The Second Kind clause in all described long sentence in described document calculates in long sentence The average length of two generic clauses;
The process of the described statement complexity calculating described document includes:
The described statement complexity of described document is calculated according to statement complicated dynamic behaviour formula;Described statement complexity Computing formula is as follows:
Diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
Wherein, MLS is the average length of described whole sentence, and MLC is the average length of described first kind clause, and MLL is described The average length of long sentence, MLCL is the average length of described Equations of The Second Kind clause, K21、K22、K23And K24For being calculated by sample Statement complexity adjustment factor.
Preferably, the process of the average length calculating described whole sentence and described first kind clause includes:
By described total vocabulary number except described whole sentence number, obtain the average length of described whole sentence;
By described total vocabulary number except the quantity of described first kind clause, obtain the average length of described first kind clause.
Preferably, the process of the average length calculating described long sentence and described Equations of The Second Kind clause includes:
Add up length word_long of each described long sentencei, 1≤i≤count_long;Wherein, i is the sequence number of long sentence;
Average length computing formula according to long sentence is calculated the average length of described long sentence;The average meter of described long sentence Calculation formula is as follows:
M L L = 1 c o u n t _ l o n g · Σ i = 1 c o u n t _ l o n g w o r d _ long i ;
Wherein, count_long is described long sentence number;
The average length of described Equations of The Second Kind clause it is calculated according to the average length computing formula of Equations of The Second Kind clause;Described The average length computing formula of Equations of The Second Kind clause is as follows:
M L C L = 1 c o u n t _ c l a u s e _ l o n g · Σ i = 1 c o u n t _ l o n g w o r d _ long i ;
Wherein, count_clause_long is the quantity of described Equations of The Second Kind clause.
Preferably, the calculating process of the translation difficulty numerical value of described document includes:
The translation difficulty numerical value of described document it is calculated according to translation difficulty computing formula;Described translation difficulty calculates public affairs Formula is as follows:
Diff_doc=K1·diff_word+K2·diff_sentence;
Wherein, K1And K2For being calculated translation difficulty adjustment factor by sample, diff_doc is translation difficulty numerical value.
The analysis method of the document translation difficulty in the present invention, has the advantage that
1, the unified translation difficulty objectively calculating document, improves the accuracy of the translation difficulty calculated;
2, can be used for distributing translation duties to interpreter, rationally realize distributing rationally of resource.
Accompanying drawing explanation
Accompanying drawing described herein is used for providing a further understanding of the present invention, constitutes the part of the application, this Bright schematic description and description is used for explaining the present invention, is not intended that inappropriate limitation of the present invention.In the accompanying drawings:
Fig. 1 shows the flow chart of embodiment.
Detailed description of the invention
Below with reference to the accompanying drawings and in conjunction with the embodiments, the present invention is described in detail.
The technical program carries out the analysis of waiting for translating shelves translation difficulty in terms of 2: vocabulary complexity and statement are complicated Degree, determines the translation difficulty of waiting for translating shelves, specifically includes according to the vocabulary complexity of waiting for translating shelves and statement complexity
S11, scanning waiting for translating shelves, determine all vocabulary in described waiting for translating shelves and all statements;
S12, carry out complicated dynamic behaviour respectively according to the described vocabulary determined and statement, obtain document vocabulary complexity and Statement complexity;
S13, described vocabulary complexity and described statement complicated dynamic behaviour according to described document obtain turning over of described document Translate difficulty numerical value;
S14, translation difficulty numerical value according to described document mate in grade of difficulty table, determine turning over of described document Translate grade of difficulty.
Based on said method, a preferred embodiment presented below:
Determine waiting for translating shelves, i.e. document;
1, calculating the vocabulary complexity of the document, process is as follows:
The document being carried out word segmentation processing, obtains all vocabulary in the document, wherein term " vocabulary " should only not understand For English word, it is also understood as the word with character form structure, such as Chinese character, Japanese, Korean etc.;And/or there is alphabetical shape knot The word of structure, such as French, Russian etc.;And all vocabulary are interpreted as including dittograph and converge;
1), the vocabulary grade of calculating document:
The each vocabulary obtained is mated in vocabulary hierarchical table, it is thus achieved that the rank that each vocabulary is mated, this level Wei one-level, two grades, three grades or level Four;Wherein, one-level, two grades and three grades are obtained by coupling of tabling look-up, will be in vocabulary hierarchical table The unsuccessful vocabulary of middle coupling is as level Four;
The frequency that each languages can occur in actual use according to its vocabulary, carries out staged care to vocabulary.This skill Art scheme according to each languages to vocabulary various authority grading rules, set up the vocabulary hierarchical table of each languages, by each language The vocabulary planted is divided into 3 ranks by conventional degree.Such as Chinese is with " general specification Chinese character table " and " information exchange encoding of chinese characters Character set baseset " as the classification reference of Chinese character, by Chinese character by conventional, secondary conventional and uncommon corresponding one-level respectively, two grades With three grades.
Adding up the vocabulary quantity that rank is one-level is word1, adding up the vocabulary quantity that rank is two grades is word2, add up level Be not the vocabulary quantity of three grades be word3, statistics rank be the vocabulary quantity of level Four be word4
The quantity of all vocabulary in statistic document, as total vocabulary number word;
Calculate the ratio that two grades and above vocabulary are shared in a document, as follows:
Rank is that ratio shared by the vocabulary of two grades isRank is that ratio shared by the vocabulary of three grades isAnd level Shared by the vocabulary of level Four, ratio is not
Carry out being calculated the vocabulary grade of document according to vocabulary rating calculation formula;Formula is as follows:
g r a d e _ w o r d = K 111 · word 2 w o r d + K 112 · word 3 w o r d + K 113 · word 4 w o r d ;
Wherein, grade_word is vocabulary grade, K111、K112And K113The vocabulary grade calculated by given sample is adjusted Joint coefficient, belongs to third level adjustment factor, and this adjustment factor is that multiple linear regression coefficient can be calculated by method of least square Obtain.Circular is as follows:
Order: Y = g r a d e _ w o r d , X 1 = word 2 w o r d , X 2 = word 3 w o r d , X 3 = word 4 w o r d ;
N group sample data for collecting:
{X11,X12,X13};
{X21,X22,X23};
.;
{Xn1,Xn2,Xn3};
Correspondence provides the vocabulary grade that expert evaluation goes out: Y 1 Y 2 . . . Y n ;
Thus can obtain following system of linear equations:
Y1=K111·X11+K112·X12+K113·X13
Y2=K111·X21+K112·X22+K113·X23
.;
Yn=K111·X21+K112·X22+K113·X23
Obtain:
K 111 K 112 K 113 = ( X ′ X ) - 1 X ′ Y ;
Wherein, X = X 11 X 12 X 13 X 21 X 22 X 23 . . . X n 1 X n 2 X n 3 , Y = y 1 y 2 . . . y n , X ' is the transposed matrix of X.
2) the standard class symbol pictograph, calculating document compares:
The total vocabulary number occurred in pictograph in statistic document, i.e. document;
Class symbol in statistic document, the vocabulary number differed i.e. occurred in document;
Class symbol pictograph ratio (TTR) represents vocabulary rate of change, and document collects the abundant degree of vocabulary.The ratio of TTR is more Height, illustrates that the different vocabulary that the text is used are the most, and its reading difficulty increases the most accordingly.Due to for any one language The quantity of word or vocabulary is fixing fixed, so when document is the biggest, class symbol pictograph ratio will be the least, and the class symbol pictograph ratio counted Will distortion.Therefore actual treatment can be that unit is carried out based on TTR by every standard number ST (such as ST value 1000) individual vocabulary Calculate, finally using the average of all TTR as final value, i.e. standard class symbol pictograph ratio (STTR, Standard TTR).Not enough mark The document of quasi-quantity, directly carries out TTR calculating.Specific as follows:
All vocabulary of document are divided into n the first subdocument according to standard number ST, each first subdocument has The quantity having class to accord with is typei;Wherein i is the sequence number of the first subdocument;
Also include second subdocument of a vocabulary lazy weight ST;Class symbol in second subdocument is type and pictograph For token
It is calculated the standard class symbol pictograph ratio of document than computing formula according to standard class symbol pictograph;Formula is as follows:
S T T R = 1 ( n + 1 ) · S T · t o k e n · ( t y p e · S T + t o k e n · Σ i = 1 n type i ) , ( n ≥ 1 ) t y p e t o k e n , ( n = 1 )
Wherein, token is the pictograph number of the subdocument of described not enough standard number vocabulary, and type is not enough criterion numeral measure word The class symbol number of the subdocument converged, typeiFor the class symbol number of the i-th subdocument containing standard number vocabulary, n is described containing standard The subdocument quantity of quantity vocabulary, ST is described standard number vocabulary dividing unit.
3), the notional word meaning of a word density of calculating document:
Lexical density refers to that in a text, notional word accounts for the ratio of total word number.Generally lexical density is the highest, the reality of text Justice word ratio is the biggest, and quantity of information is the biggest, reads and translates difficulty and increase the most therewith.
Quantity count_notional of notional word in statistic document, i.e. statistics include noun, synonym, verb, describe The quantity of word, adverbial word, interjection etc.;
The all described notional word obtained is arranged according to a definite sequence;
According to synonym ontology tool, add up the senses of a dictionary entry number meanings of each notional wordi(1≤i≤count_ notional);Wherein, i is the sequence number of notional word;
Add up the senses of a dictionary entry of all notional words, the senses of a dictionary entry number of all notional words is added the total senses of a dictionary entry obtaining all notional words Number.
The notional word meaning of a word density of document it is calculated according to notional word meaning of a word density computing formula;Formula is as follows:
d e n s i t y _ n o t i o n a l = Σ i = 1 c o u n t _ n o t i o n a l meanings i Σ i = 1 c o u n t _ n o t i o n a l meanings i + ( w o r d - c o u n t _ n o t i o n a l )
Wherein, density_notional is notional word meaning of a word density, Σ i = 1 c o u n t _ n o t i o n a l meanings i For reality Total senses of a dictionary entry number of justice word;
Wherein, the notional word meaning of a word density of the vocabulary grade of document, the standard class symbol pictograph ratio of document and document is calculated There is not sequencing in step, can calculate respectively, it is also possible to calculate simultaneously.
4) according to vocabulary grade, standard class symbol pictograph ratio and the notional word meaning of a word density of document, the vocabulary of document is calculated Complexity:
The vocabulary complexity of document is calculated according to vocabulary complicated dynamic behaviour formula;Formula is as follows:
Diff_word=K11·grade_word+K12·STTR+K13·density_notional;
Wherein, diff_word is vocabulary complexity, and grade_word is vocabulary grade, and STTR is that standard class accords with pictograph ratio, Density_notional is notional word meaning of a word density;K11、K12And K13The vocabulary complexity regulation calculated by given sample Coefficient, belongs to second level adjustment factor, and this adjustment factor is that multiple linear regression coefficient can be calculated by method of least square Arrive.Circular is consistent with vocabulary grade adjustment factor.
2, the statement complexity of document is calculated, specific as follows:
Term " whole sentence " is construed as have expressed the lexical set of the complete meaning, such as: document lead-in is to end mark Lexical set between conjunction;Terminating symbol is one of fullstop, exclamation mark, question mark, ellipsis;Or the lead-in that first after terminating symbol Lexical set between the second terminating symbol;
Term " clause " is construed as a part for whole sentence, the word come with mark spaces such as comma, pause mark, branches or Lexical set;
Term " long sentence " is construed as the vocabulary quantity whole sentence more than predetermined threshold;
The first kind used herein and Equations of The Second Kind are served only for distinguishing.
Scheme is specific as follows:
Scanned document, determines all whole sentence in document, and adds up the sum of whole sentence, is denoted as count_sentence;
Using vocabulary quantity more than the whole sentence of predetermined threshold as long sentence, and add up the sum of long sentence, be denoted as count_long Vocabulary quantity with in each long sentence, is denoted as word_longi, 1≤i≤count_long;I is the sequence number of long sentence;
Clause in whole sentence is first kind clause, the sum of statistics first kind clause, is denoted as count_clause;
Clause in long sentence is Equations of The Second Kind clause, the sum of statistics Equations of The Second Kind clause, is denoted as count_clause_long;
Calculate the average length of whole sentence, the average length of long sentence, the average length of first kind clause and Equations of The Second Kind respectively The average length of sentence;As follows:
The average length (MLS, mean length of sentence) of whole sentence, computational methods are: MLS=word/ count_sentence;
The average length (MLC, mean length of clause) of first kind clause, computational methods are: MLC=word/ count_clause;
The average length (MLL, mean length of long sentence) of long sentence, computational methods are:
M L L = 1 c o u n t _ l o n g · Σ i = 1 c o u n t _ l o n g w o r d _ long i ;
The average length (MLCL, mean length of clause of long sentence) of Equations of The Second Kind clause, meter Calculation method is:
M L C L = 1 c o u n t _ c l a u s e _ l o n g · Σ i = 1 c o u n t _ l o n g w o r d _ long i ;
It is calculated statement complexity according to statement complicated dynamic behaviour formula;Statement complicated dynamic behaviour formula is as follows:
Diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
K21、K22、K23And K24For the sentence difficulty level adjustment factor calculated by institute's collecting sample, belong to second level regulation Coefficient, this adjustment factor is that multiple linear regression coefficient can be calculated by method of least square.Circular and word The grade adjustment factor that converges is consistent.
3, the translation difficulty numerical value of document is calculated;
Vocabulary complexity according to the document obtained and statement complexity, be calculated literary composition according to translation difficulty computing formula The translation difficulty numerical value of shelves;Formula is as follows:
Diff_doc=K1·diff_word+K2·diff_sentence;
K1、K2For the translation difficulty adjustment factor calculated by institute's collecting sample, belong to first order adjustment factor, this tune Joint coefficient is that multiple linear regression coefficient can be calculated by method of least square.Circular regulates with vocabulary grade Coefficient is consistent.
4, the translation grade of difficulty of document is determined;
Translation difficulty numerical value according to document mates in grade of difficulty table, obtains the difficulty etc. that this numerical value is corresponding Level;
Grade of difficulty table is analogous to the form of dictionary, including the translation corresponding to several grade of difficulty and grade of difficulty Difficulty numerical range;
Translation difficulty numerical range in grade of difficulty table carries out learning or train computing to obtain.
The foregoing is only the preferred embodiments of the present invention, be not limited to the present invention, for the skill of this area For art personnel, the present invention can have various modifications and variations.All within the spirit and principles in the present invention, that is made any repaiies Change, equivalent, improvement etc., should be included within the scope of the present invention.

Claims (9)

1. the analysis method of a document translation difficulty, it is characterised in that including:
Scanning waiting for translating shelves, determine all vocabulary in described waiting for translating shelves and all statements;
Carrying out complicated dynamic behaviour respectively according to the described vocabulary determined and statement, the vocabulary complexity and the statement that obtain document are complicated Degree;
Described vocabulary complexity according to described document and described statement complicated dynamic behaviour obtain the translation difficulty number of described document Value;
Translation difficulty numerical value according to described document mates in grade of difficulty table, determines the translation difficulty etc. of described document Level;
The process of the vocabulary complexity calculating described document includes:
Calculate the vocabulary grade of document, class symbol pictograph ratio and notional word meaning of a word density;
Calculate according to vocabulary complicated dynamic behaviour formula, obtain the described vocabulary complexity of described document, described vocabulary complexity meter Calculation formula is as follows:
Diff_word=K11·grade_word+K12·STTR+K13·density_notional;
Wherein, diff_word is described document vocabulary complexity, and grade_word is the vocabulary grade of described document, and STTR is The class symbol pictograph ratio of described document, density_notional is the notional word meaning of a word density of described document, K11、K12And K13For It is calculated vocabulary complexity adjustment factor by sample.
Method the most according to claim 1, it is characterised in that before calculating the vocabulary grade of described document, also include:
Described document is carried out word segmentation processing, obtains all vocabulary, and statistics obtains total vocabulary number;
The each described vocabulary obtained is mated in vocabulary hierarchical table, obtains the vocabulary level of each described vocabulary;Institute Stating vocabulary level is one-level, two grades, three grades or level Four;
Add up the quantity of the described vocabulary of the rank that described vocabulary level is two grades or more than two grades respectively;
The process of the vocabulary grade calculating described document includes:
Calculate the vocabulary grade of described document according to vocabulary rating calculation formula, described vocabulary rating calculation formula is as follows:
g r a d e _ w o r d = K 111 · word 2 w o r d + K 112 · word 3 w o r d + K 113 · word 4 w o r d ;
Wherein, wordxFor the quantity of vocabulary that vocabulary level is X level, K111、K112And K113For being calculated vocabulary etc. by sample Level adjustment factor, word is total vocabulary number.
Method the most according to claim 2, it is characterised in that at the process bag of the class symbol pictograph ratio calculating described document Include;
According to all described vocabulary obtained, add up class therein symbol number and pictograph number, calculate described class symbol number and described pictograph The ratio of number, obtains the class symbol pictograph ratio of described document;Or
The all described vocabulary obtained is divided into multiple subdocument according to standard number, and 1 not enough standard number vocabulary Subdocument, calculates than computing formula according to class symbol pictograph, obtains the class symbol pictograph ratio of described document;Described class symbol pictograph ratio calculates Formula is as follows:
S T T R = 1 ( n + 1 ) · S T · t o k e n · ( t y p e · S T + t o k e n · Σ i = 1 n type i ) , ( n ≥ 1 ) t y p e t o k e n , ( n = 0 ) ;
Wherein, token is the pictograph number of the subdocument of described not enough standard number vocabulary, and type is not enough standard number vocabulary The class symbol number of subdocument, typeiFor the class symbol number of the i-th subdocument containing standard number vocabulary, n is described containing standard number The subdocument quantity of individual vocabulary, ST is described standard number vocabulary dividing unit.
Method the most according to claim 2, it is characterised in that calculate described document described notional word meaning of a word density it Before, also include:
The all described vocabulary obtained is carried out part-of-speech tagging, obtains notional word therein;
The all described notional word obtained is arranged according to a definite sequence;
The senses of a dictionary entry number meanings of each described notional word is obtained according to synonym ontology tooli, wherein i is described notional word Sequence number;And add up the senses of a dictionary entry sum of described notional word;
Calculate according to notional word meaning of a word density computing formula, obtain the notional word meaning of a word density of described document;Described notional word word Justice density computing formula is as follows:
d e n s i t y _ n o t i o n a l = Σ i = 1 c o u n t _ n o t i o n a l meanings i Σ i = 1 c o u n t _ n o t i o n a l meanings i + ( w o r d - c o u n t _ n o t i o n a l ) ;
Wherein, meaningsiFor i-th notional word senses of a dictionary entry number, count_notional is the quantity of described notional word.
Method the most according to claim 4, it is characterised in that described notional word at least includes the part of speech of one below: name Word, synonym, verb, adjective, adverbial word and interjection.
Method the most according to claim 2, it is characterised in that before calculating the described statement complexity of described document, Also include:
The whole sentence number being determined by described document calculates the average length of whole sentence;
The quantity of the first kind clause in all described whole sentence being determined by described document calculates the first kind in whole sentence The average length of clause;
The long sentence number being determined by described document and the length gauge of each long sentence calculate the average length of long sentence;
It is determined by the Equations of The Second Kind that the quantity of the Equations of The Second Kind clause in all described long sentence in described document calculates in long sentence The average length of clause;
The process of the described statement complexity calculating described document includes:
The described statement complexity of described document is calculated according to statement complicated dynamic behaviour formula;Described statement complicated dynamic behaviour Formula is as follows:
Diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
Wherein, MLS is the average length of described whole sentence, and MLC is the average length of described first kind clause, and MLL is described long sentence Average length, MLCL is the average length of described Equations of The Second Kind clause, K21、K22、K23And K24For being calculated statement by sample Complexity adjustment factor.
Method the most according to claim 6, it is characterised in that calculate the average length of described whole sentence and described first kind clause The process of degree includes:
By described total vocabulary number divided by described whole sentence number, obtain average length MLS of described whole sentence;
By described total vocabulary number divided by the quantity of described first kind clause, obtain average length MLC of described first kind clause.
Method the most according to claim 6, it is characterised in that calculate the average length of described long sentence and described Equations of The Second Kind clause The process of degree includes:
Add up length word_long of each described long sentencei, 1≤i≤count_long;Wherein, i is the sequence number of long sentence;
Average length computing formula according to long sentence is calculated the average length of described long sentence;The average computation of described long sentence is public Formula is as follows:
M L L = 1 c o u n t _ l o n g · Σ i = 1 c o u n t _ l o n g w o r d _ long i ;
Wherein, count_long is described long sentence number;
The average length of described Equations of The Second Kind clause it is calculated according to the average length computing formula of Equations of The Second Kind clause;Described second The average length computing formula of generic clause is as follows:
M L C L = 1 c o u n t _ c l a u s e _ l o n g · Σ i = 1 c o u n t _ l o n g w o r d _ long i ;
Wherein, count_clause_long is the quantity of described Equations of The Second Kind clause.
Method the most according to claim 1, it is characterised in that the calculating process bag of the translation difficulty numerical value of described document Include:
The translation difficulty numerical value of described document it is calculated according to translation difficulty computing formula;Described translation difficulty computing formula is such as Under: diff_doc=K1·diff_word+K2·diff_sentence;
Wherein, K1And K2For being calculated translation difficulty adjustment factor by sample, diff_doc is translation difficulty numerical value.
CN201310713175.2A 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty Active CN103744840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310713175.2A CN103744840B (en) 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310713175.2A CN103744840B (en) 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty

Publications (2)

Publication Number Publication Date
CN103744840A CN103744840A (en) 2014-04-23
CN103744840B true CN103744840B (en) 2016-12-07

Family

ID=50501858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310713175.2A Active CN103744840B (en) 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty

Country Status (1)

Country Link
CN (1) CN103744840B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008094B (en) * 2014-05-22 2017-08-11 武汉传神信息技术有限公司 A kind of method for obtaining document translation difficulty
CN105224524B (en) * 2015-09-02 2022-01-25 网易有道信息技术(北京)有限公司 Document translation difficulty evaluation method and device
CN109086363B (en) * 2018-07-19 2021-03-16 百度在线网络技术(北京)有限公司 File information maintenance degree determining method, device and equipment
CN112232060A (en) * 2020-09-27 2021-01-15 淄博职业学院 Intelligent international Chinese teaching-oriented sentence difficulty level online measuring system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1266238A (en) * 1999-03-04 2000-09-13 英业达股份有限公司 English natural sentences antomatic identification and word querying free automatic processing method
JP2000516749A (en) * 1997-06-26 2000-12-12 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Machine construction method and apparatus for translating word source text into word target text
WO2002075585A1 (en) * 2001-03-21 2002-09-26 Fujitsu Limited Machine-translation apparatus
CN102214246A (en) * 2011-07-18 2011-10-12 南京大学 Method for grading Chinese electronic document reading on the Internet

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000516749A (en) * 1997-06-26 2000-12-12 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Machine construction method and apparatus for translating word source text into word target text
CN1266238A (en) * 1999-03-04 2000-09-13 英业达股份有限公司 English natural sentences antomatic identification and word querying free automatic processing method
WO2002075585A1 (en) * 2001-03-21 2002-09-26 Fujitsu Limited Machine-translation apparatus
CN102214246A (en) * 2011-07-18 2011-10-12 南京大学 Method for grading Chinese electronic document reading on the Internet

Also Published As

Publication number Publication date
CN103744840A (en) 2014-04-23

Similar Documents

Publication Publication Date Title
CN103744834B (en) A kind of method that translation duties is accurately distributed
CN103729421B (en) A kind of method that interpreter's document accurately matches
Sylak-Glassman et al. A language-independent feature schema for inflectional morphology
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
Ishioka et al. Automated Japanese essay scoring system based on articles written by experts
CN103744840B (en) A kind of analysis method of document translation difficulty
Lam et al. Appraisal resources in L1 and L2 argumentative essays: A contrastive learner corpus-informed study of evaluative stance
CN110008465A (en) The measure of sentence semantics distance
Bell et al. Semantic transparency: Challenges for distributional semantics
CN106779455A (en) The methods of risk assessment and system of a kind of translation project
CN103729348B (en) A kind of analysis method of sentence translation complexity
Dunn et al. Stability of syntactic dialect classification over space and time
CN103699675B (en) A kind of method of interpreter's hierarchical index
Abe Frequency change patterns across proficiency levels in Japanese EFL learner speech
Li Research on Readability Grade Formula Based on HSK Compositions
Zhang et al. A machine learning classification algorithm for vocabulary grading in Chinese language teaching
Peng et al. Readability assessment for Chinese L2 sentences: an extended knowledge base and comprehensive evaluation model-based method
Du et al. What affects the difficulty of Chinese syntax?
CN106528550A (en) Evaluation method and system for translation capability of translator
CN103729350B (en) The preprocess method of various dimensions waiting for translating shelves
Zhu et al. Study on the Assessment of Chinese Sentence Difficulty for Second Language Teaching
Wu et al. Automated Chinese-English translation scoring based on answer knowledge base
Li An automated English translation judging system based on feature extraction algorithm
Smith The lexical frequency profile: Problems and uses
Pérez-Guerra Measuring linguistic complexity and proficiency in learner and native academic English 1

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant before: Wuhan Transn Information Technology Co., Ltd.

CB03 Change of inventor or designer information

Inventor after: Jiang Chao

Inventor after: Zhang Pi

Inventor before: Jiang Chao

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Document translation difficulty analyzing method

Effective date of registration: 20181115

Granted publication date: 20161207

Pledgee: Bank of Communications Co., Ltd. Wuhan Branch of Hubei Free Trade Experimental Zone

Pledgor: Language network (Wuhan) Information Technology Co., Ltd.

Registration number: 2018420000061

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20200617

Granted publication date: 20161207

Pledgee: Bank of Communications Co.,Ltd. Wuhan Branch of Hubei Free Trade Experimental Zone

Pledgor: IOL (WUHAN) INFORMATION TECHNOLOGY Co.,Ltd.

Registration number: 2018420000061

PC01 Cancellation of the registration of the contract for pledge of patent right