CN103744840A - Document translation difficulty analyzing method - Google Patents

Document translation difficulty analyzing method Download PDF

Info

Publication number
CN103744840A
CN103744840A CN201310713175.2A CN201310713175A CN103744840A CN 103744840 A CN103744840 A CN 103744840A CN 201310713175 A CN201310713175 A CN 201310713175A CN 103744840 A CN103744840 A CN 103744840A
Authority
CN
China
Prior art keywords
document
vocabulary
calculating
sentence
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201310713175.2A
Other languages
Chinese (zh)
Other versions
CN103744840B (en
Inventor
江潮
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Original Assignee
WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd filed Critical WUHAN TRANSN INFORMATION TECHNOLOGY Co Ltd
Priority to CN201310713175.2A priority Critical patent/CN103744840B/en
Publication of CN103744840A publication Critical patent/CN103744840A/en
Application granted granted Critical
Publication of CN103744840B publication Critical patent/CN103744840B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Machine Translation (AREA)

Abstract

The invention discloses a document translation difficulty analyzing method which includes scanning a document to be translated, determining all vocabulary and sentences in the document to be translated, conducting complexity calculation respectively according to the determined vocabulary and sentences to obtain the vocabulary complexity and the sentence complexity of the document, calculating the translation difficulty value of the document according to the vocabulary complexity and the document complexity of the document and determining the translation difficulty level of the document according to the position of the translation difficulty value of the document in a difficulty level table. By means of the calculation method, the document translation difficulty can be calculated accurately, and the document translation difficulty analyzing accuracy is improved.

Description

Document translation difficulty analysis method
Technical Field
The invention relates to the technical field of translation, in particular to a method for analyzing document translation difficulty.
Background
The discrimination of the document translation difficulty can be divided into manual discrimination and machine discrimination. The manual judgment is to label and judge the document to be translated through a language expert or a translation expert, the method has low speed and consumes very large labor cost due to the limitation of reading and understanding of people, and the judgment result cannot be unified and has poor objectivity due to the great judgment difference generated by the difference of the judgment of the abilities of people and the difficulty of each person in understanding the document. The machine judgment is to judge the difficulty of translation of a document by combining a computer with a certain method, the most common method at present is to judge the difficulty by counting the rarely-used words in the document, the reliability of the judgment method with a single dimension, which is used as a judgment factor, is thinner and has larger one-sidedness, the obtained judgment result is often greatly different from the actual condition, and the accuracy of the judgment result cannot be ensured. At present, an efficient and relatively accurate discrimination method is also lacked for discriminating the translation difficulty of the document.
Disclosure of Invention
The invention aims to provide a method for analyzing the translation difficulty of a document, which solves the problem of how to respectively provide the document for proper translators.
The invention discloses a method for analyzing document translation difficulty, which comprises the following steps:
scanning a document to be translated, and determining all words and sentences in the document to be translated;
respectively carrying out complexity calculation according to the determined vocabulary and the sentence to obtain the vocabulary complexity and the sentence complexity of the document;
calculating to obtain a translation difficulty value of the document according to the vocabulary complexity and the document complexity of the document;
and determining the translation difficulty level of the document according to the translation difficulty value of the document in a difficulty level table.
Preferably, the process of calculating the lexical complexity of the document comprises:
calculating the vocabulary level, the similar character-character ratio and the sense density of the semantic words of the document;
and calculating according to a vocabulary complexity calculation formula to obtain the vocabulary complexity of the document, wherein the vocabulary complexity calculation formula is as follows:
diff_word=K11·grade_word+K12·STTR+K13·density_notional;
wherein diff _ word is the document vocabulary complexity, grade _ word is the vocabulary level of the document, STTR is the similarity-character ratio of the document, dense _ nominal is the meaning density of the meaning words of the document, K11、K12And K13And obtaining a vocabulary complexity adjusting coefficient through sample calculation.
Preferably, before calculating the vocabulary level of the document, the method further comprises:
performing word segmentation processing on the document to obtain all vocabularies, and counting to obtain the total number of the vocabularies;
matching each obtained vocabulary in a vocabulary classification table to obtain the vocabulary level of each vocabulary; the vocabulary level is a first level, a second level, a third level or a fourth level;
respectively counting the number of the vocabularies with the vocabulary level of two or more levels;
the process of calculating the lexical rating of the document includes:
and calculating the vocabulary level of the document according to a vocabulary level calculation formula, wherein the vocabulary level calculation formula is as follows:
<math> <mrow> <mi>grad</mi> <mi>e</mi> <mo>_</mo> <mi>word</mi> <mo>=</mo> <msub> <mi>K</mi> <mn>111</mn> </msub> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>word</mi> <mn>2</mn> </msub> <mi>word</mi> </mfrac> <mo>+</mo> <msub> <mi>K</mi> <mn>112</mn> </msub> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>word</mi> <mn>3</mn> </msub> <mi>word</mi> </mfrac> <mo>+</mo> <msub> <mi>K</mi> <mn>113</mn> </msub> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>word</mi> <mn>4</mn> </msub> <mi>word</mi> </mfrac> <mo>;</mo> </mrow> </math>
wherein, wordxFor the number of words of class X, K111、K112And K113And in order to obtain the vocabulary level adjusting coefficient through sample calculation, word is the total vocabulary number.
Preferably, the process of calculating the character-to-character ratio of the document comprises;
according to all the obtained vocabularies, counting the number of class symbols and the number of form symbols in the vocabularies, and calculating the ratio of the number of the class symbols to the number of the form symbols to obtain the ratio of the class symbols to the form symbols of the documents; or
Dividing all the obtained vocabularies into a plurality of subdocuments and 1 subdocument which is less than the vocabularies of the standard quantity according to the standard quantity, and calculating according to a similar character-to-character ratio calculation formula to obtain the similar character-to-character ratio of the document; the calculation formula of the class symbol ratio is as follows:
<math> <mrow> <mi>STTR</mi> <mo>=</mo> <mfenced open='' close=''> <mtable> <mtr> <mtd> <mfrac> <mn>1</mn> <mrow> <mrow> <mo>(</mo> <mi>n</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mi>ST</mi> <mo>&CenterDot;</mo> <mi>token</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <mrow> <mo>(</mo> <mi>type</mi> <mo>&CenterDot;</mo> <mi>ST</mi> <mo>+</mo> <mi>token</mi> <mo>&CenterDot;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>type</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mtd> <mtd> <mrow> <mo>(</mo> <mi>n</mi> <mo>&GreaterEqual;</mo> <mn>1</mn> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mfrac> <mi>type</mi> <mi>token</mi> </mfrac> <mo>,</mo> </mtd> <mtd> <mrow> <mo>(</mo> <mi>n</mi> <mo>=</mo> <mn>0</mn> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
(ii) a Wherein token is the number of shape symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and type is the number of class symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and typeiThe number of the class identifiers of the ith subdocument containing the standard number of words is the number of the class identifiers of the standard number of words, n is the number of the subdocuments containing the standard number of words, and ST is the standard number of word division units.
Preferably, before calculating the sense density of the semantic words of the document, the method further comprises:
performing part-of-speech tagging on all the obtained vocabularies to obtain the semantic words in the vocabularies;
arranging all the obtained semantic words according to a certain sequence;
obtaining the meaning item number meanings of each sense word according to the sense word ontology tooliWherein i is the sequence number of the semantic word; counting the total number of the meaning items of the meaning words;
calculating according to a semantic word and semantic density calculation formula to obtain the semantic word and semantic density of the document; the word sense density calculation formula of the semantic words is as follows:
<math> <mrow> <mi>density</mi> <mo>_</mo> <mi>notional</mi> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>notional</mi> </mrow> </msubsup> <msub> <mi>meanings</mi> <mi>i</mi> </msub> </mrow> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>notional</mi> </mrow> </msubsup> <msub> <mi>meanings</mi> <mi>i</mi> </msub> <mo>+</mo> <mrow> <mo>(</mo> <mi>word</mi> <mo>-</mo> <mi>count</mi> <mo>_</mo> <mi>notional</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>;</mo> </mrow> </math>
wherein, meansiThe number of the meaning term of the ith meaning term is, and the count _ normal is the number of the meaning term.
Preferably, the semantic word includes at least one of the following parts of speech: nouns, pronouns, verbs, adjectives, adverbs, and exclamations.
Preferably, before calculating the sentence complexity of the document, the method further comprises:
calculating the average length of the whole sentence by determining the number of the whole sentences in the document;
calculating the average length of the first type clauses in the whole sentence by determining the number of the first type clauses in all the whole sentences in the document;
calculating the average length of the long sentences by determining the number of the long sentences in the document and the length of each long sentence;
calculating the average length of the second type clauses in the long sentences by determining the number of the second type clauses in all the long sentences in the document;
the process of calculating the sentence complexity of the document comprises:
calculating the sentence complexity of the document according to a sentence complexity calculation formula; the sentence complexity calculation formula is as follows:
diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
wherein MLS is the average length of the whole sentence, MLC is the average length of the first type clause, MLL is the average length of the long sentence, MLCL is the average length of the second type clause, K21、K22、K23And K24And calculating to obtain a sentence complexity adjusting coefficient through a sample.
Preferably, the process of calculating the average length of the whole sentence and the first type clause includes:
dividing the total vocabulary number by the whole sentence number to obtain the average length of the whole sentence;
and dividing the total vocabulary number by the number of the first type clauses to obtain the average length of the first type clauses.
Preferably, the process of calculating the average length of the long sentence and the second type clause comprises:
counting the length word _ _ofeach long sentencelongiI is more than or equal to 1 and less than or equal to count _ long; wherein i is the serial number of the long sentence;
calculating according to a long sentence average length calculation formula to obtain the long sentence average length; the average calculation formula of the long sentence is as follows:
<math> <mrow> <mi>MLL</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>count</mi> <mo>_</mo> <mi>long</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>long</mi> </mrow> </msubsup> <mi>word</mi> <mo>_</mo> <msub> <mi>long</mi> <mi>i</mi> </msub> <mo>;</mo> </mrow> </math>
wherein, count _ long is the number of the long sentences;
calculating according to an average length calculation formula of the second type clauses to obtain the average length of the second type clauses; the average length calculation formula of the second type clause is as follows:
<math> <mrow> <mi>MLCL</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>count</mi> <mo>_</mo> <mi>clause</mi> <mo>_</mo> <mi>long</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>long</mi> </mrow> </msubsup> <mi>word</mi> <mo>_</mo> <msub> <mi>long</mi> <mi>i</mi> </msub> <mo>;</mo> </mrow> </math>
wherein count _ close _ Long is the number of the second type clauses.
Preferably, the calculation process of the translation difficulty value of the document comprises the following steps:
calculating to obtain a translation difficulty value of the document according to a translation difficulty calculation formula; the translation difficulty calculation formula is as follows:
diff_doc=K1·diff_word+K2·diff_sentence;
wherein, K1And K2In order to obtain the translation difficulty adjustment coefficient through sample calculation, diff _ doc is a translation difficulty value.
The method for analyzing the document translation difficulty has the following advantages:
1. the translation difficulty of the document is uniformly and objectively calculated, so that the accuracy of the calculated translation difficulty is improved;
2. the method can be used for distributing the translation tasks to the translators and reasonably realizing the optimal configuration of resources.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 shows a flow chart of an embodiment.
Detailed Description
The present invention will be described in detail below with reference to the accompanying drawings in conjunction with embodiments.
The technical scheme analyzes the translation difficulty of the document to be translated from 2 aspects: determining the translation difficulty of the document to be translated according to the vocabulary complexity and the sentence complexity of the document to be translated, which specifically comprises the following steps
S11, scanning the document to be translated, and determining all words and sentences in the document to be translated;
s12, respectively carrying out complexity calculation according to the determined vocabulary and the sentence to obtain the vocabulary complexity and the sentence complexity of the document;
s13, calculating a translation difficulty value of the document according to the vocabulary complexity and the document complexity of the document;
s14, according to the translation difficulty value of the document, the translation difficulty level of the document is determined in a difficulty level table.
Based on the above method, a preferred embodiment is provided as follows:
determining a document to be translated, namely a document;
1. the lexical complexity of the document is calculated as follows:
performing word segmentation processing on the document to obtain all words and phrases in the document, wherein the term "word and phrase" should be understood not only as an english word but also as a word having a font structure, such as chinese, japanese, korean, etc.; and/or words having a letter-like structure, such as french, russian, etc.; and all vocabulary should be understood to include repeated vocabularies;
1) calculating the vocabulary level of the document:
matching each obtained vocabulary in a vocabulary hierarchical table to obtain the level matched with each vocabulary, wherein the level is a first level, a second level, a third level or a fourth level; wherein, the first, second and third levels are obtained by table look-up matching, and words which are not successfully matched in the word hierarchical table are taken as the fourth level;
each language can classify the vocabulary according to the frequency of the vocabulary appearing in the actual use. The technical scheme establishes a vocabulary classification list of each language according to various authority classification standards of each language for vocabularies, and divides the vocabularies of each language into 3 grades according to the common degree. For example, the Chinese character uses 'general standard Chinese character table' and 'Chinese character coding character set for information exchange-basic set' as the hierarchical reference of Chinese characters, and the Chinese characters are respectively corresponding to the first level, the second level and the third level according to common use, secondary use and rare use.
Word with the statistic level of one grade1The number of words with the statistical level of two is word2Word with the statistic level of three levels3Word with the number of words with the statistical level of four4
Counting the number of all vocabularies in the document to be used as a total vocabulary number word;
calculating the ratio of the second and above vocabularies in the document as follows:
the ratio of the words with the level of two is
Figure BDA0000443899370000071
The ratio of the words with the level of three is
Figure BDA0000443899370000072
And the ratio of the words with the four grades is
Calculating according to a vocabulary level calculation formula to obtain the vocabulary level of the document; the formula is as follows:
<math> <mrow> <mi>grad</mi> <mi>e</mi> <mo>_</mo> <mi>word</mi> <mo>=</mo> <msub> <mi>K</mi> <mn>111</mn> </msub> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>word</mi> <mn>2</mn> </msub> <mi>word</mi> </mfrac> <mo>+</mo> <msub> <mi>K</mi> <mn>112</mn> </msub> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>word</mi> <mn>3</mn> </msub> <mi>word</mi> </mfrac> <mo>+</mo> <msub> <mi>K</mi> <mn>113</mn> </msub> <mo>&CenterDot;</mo> <mfrac> <msub> <mi>word</mi> <mn>4</mn> </msub> <mi>word</mi> </mfrac> <mo>;</mo> </mrow> </math>
wherein, grad _ word is vocabulary level, K111、K112And K113The vocabulary level adjustment coefficient calculated for the given sample belongs to a third-level adjustment coefficient, and the adjustment coefficient is a multiple linear regression coefficient and can be calculated by a least square method. The specific calculation method is as follows:
order: y is a grade word, X 1 = word 2 word , X 2 = word 3 word , X 3 = word 4 word ;
for n sets of collected sample data:
{X11,X12,X13};
{X21,X22,X23};
Figure BDA0000443899370000077
{Xn1,Xn2,Xn3};
corresponding to the word level evaluated by the given expert: Y 1 Y 2 . . . Y n ;
the following system of linear equations can thus be obtained:
Y1=K111·X11+K112·X12+K113·X13
Y2=K111·X21+K112·X22+K113·X23
Figure BDA0000443899370000083
Yn=K111·Xn1+K112·Xn2+K113·Xn3
obtaining:
<math> <mrow> <mfenced open='[' close=']'> <mtable> <mtr> <mtd> <msub> <mi>K</mi> <mn>111</mn> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>K</mi> <mn>112</mn> </msub> </mtd> </mtr> <mtr> <mtd> <msub> <mi>K</mi> <mn>113</mn> </msub> </mtd> </mtr> </mtable> </mfenced> <mo>=</mo> <msup> <mrow> <mo>(</mo> <msup> <mi>X</mi> <mo>&prime;</mo> </msup> <mi>X</mi> <mo>)</mo> </mrow> <mrow> <mo>-</mo> <mn>1</mn> </mrow> </msup> <msup> <mi>X</mi> <mo>&prime;</mo> </msup> <mi>Y</mi> <mo>;</mo> </mrow> </math>
wherein, X = X 11 X 12 X 13 X 21 X 22 X 23 . . . X n 1 X n 2 X n 3 , Y = Y 1 Y 2 . . . Y n , x' is the transpose of X.
2) Calculating the standard type character ratio of the document:
counting the shape symbols in the document, namely the total vocabulary number appearing in the document;
counting the class symbols in the documents, namely the number of different vocabularies appearing in the documents;
the class-to-token ratio (TTR) represents the rate of change of vocabulary, and the richness of the aggregated vocabulary of the document. The higher the TTR ratio, the more different words are used to indicate the text, and the reading difficulty is increased correspondingly. Since the number of words or vocabularies for any language is fixed, the larger the document, the smaller the symbol-like-symbol ratio, and the distorted statistical symbol-like-symbol ratio. Therefore, the actual processing can calculate the TTR by taking each standard number ST (for example, ST takes 1000) as a unit, and finally takes the average of all TTRs as a final value, i.e., standard class symbol ratio (STTR). And directly performing TTR calculation on the documents with the quantity less than the standard quantity. The method comprises the following specific steps:
dividing all words of the document into n first subdocuments according to the standard quantity ST, wherein the number of the type characters in each first subdocument is typei(ii) a Wherein i is the serial number of the first subdocument;
a second subdocument with insufficient vocabulary quantity ST; the type and token of the class in the second sub-document are
Calculating the standard type character-to-character ratio of the document according to a standard type character-to-character ratio calculation formula; the formula is as follows:
<math> <mrow> <mi>STTR</mi> <mo>=</mo> <mfenced open='' close=''> <mtable> <mtr> <mtd> <mfrac> <mn>1</mn> <mrow> <mrow> <mo>(</mo> <mi>n</mi> <mo>+</mo> <mn>1</mn> <mo>)</mo> </mrow> <mo>&CenterDot;</mo> <mi>ST</mi> <mo>&CenterDot;</mo> <mi>token</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <mrow> <mo>(</mo> <mi>type</mi> <mo>&CenterDot;</mo> <mi>ST</mi> <mo>+</mo> <mi>token</mi> <mo>&CenterDot;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mi>n</mi> </msubsup> <msub> <mi>type</mi> <mi>i</mi> </msub> <mo>)</mo> </mrow> <mo>,</mo> </mtd> <mtd> <mrow> <mo>(</mo> <mi>n</mi> <mo>&GreaterEqual;</mo> <mn>1</mn> <mo>)</mo> </mrow> </mtd> </mtr> <mtr> <mtd> <mfrac> <mi>type</mi> <mi>token</mi> </mfrac> <mo>,</mo> </mtd> <mtd> <mrow> <mo>(</mo> <mi>n</mi> <mo>=</mo> <mn>0</mn> <mo>)</mo> </mrow> </mtd> </mtr> </mtable> </mfenced> </mrow> </math>
wherein token is the number of shape symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and type is the number of class symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and typeiThe number of the class identifiers of the ith subdocument containing the standard number of words is the number of the class identifiers of the standard number of words, n is the number of the subdocuments containing the standard number of words, and ST is the standard number of word division units.
3) Calculating the word meaning density of the semantic words of the document:
the word density is the proportion of the number of words in a text to the number of words in the text. Generally, the higher the vocabulary density, the larger the proportion of the semantic words of the text, the larger the information amount, and the difficulty in reading and translating is increased.
Counting the number count _ nominal of the semantic words in the document, namely counting the number of nouns, pronouns, verbs, adjectives, adverbs, exclamations and the like;
arranging all the obtained semantic words according to a certain sequence;
counting the number of semantic terms of each semantic term according to the synonym ontology tooli(1 ≦ i ≦ count _ no); wherein i is the sequence number of the semantic word;
and (4) counting the semantic items of all the semantic words, and adding the semantic item numbers of all the semantic words to obtain the total semantic item number of all the semantic words.
Calculating the meaning word and meaning density of the document according to a meaning word and meaning density calculation formula; the formula is as follows:
<math> <mrow> <mi>density</mi> <mo>_</mo> <mi>notional</mi> <mo>=</mo> <mfrac> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>notional</mi> </mrow> </msubsup> <msub> <mi>meanings</mi> <mi>i</mi> </msub> </mrow> <mrow> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>notional</mi> </mrow> </msubsup> <msub> <mi>meanings</mi> <mi>i</mi> </msub> <mo>+</mo> <mrow> <mo>(</mo> <mi>word</mi> <mo>-</mo> <mi>count</mi> <mo>_</mo> <mi>notional</mi> <mo>)</mo> </mrow> </mrow> </mfrac> <mo>;</mo> </mrow> </math>
wherein dense _ nominal is the semantic word density of the semantic words,
Figure BDA0000443899370000093
the total number of the semantic terms is the number of the semantic terms;
the steps of calculating the vocabulary level of the document, the standard class symbol ratio of the document and the semantic word density of the document do not have a sequence, and can be calculated respectively or simultaneously.
4) And calculating the vocabulary complexity of the document according to the vocabulary level, the standard class symbol-symbol ratio and the semantic density of the semantic words of the document:
calculating the vocabulary complexity of the document according to a vocabulary complexity calculation formula; the formula is as follows:
diff_word=K11·grade_worddK12·STTR+K13·density_notional;
wherein diff _ word is vocabulary complexity, grade _ word is vocabulary level, STTR is standard type character ratio, dense _ normal is meaning density of meaning words; k11、K12And K13The lexical complexity adjustment coefficients calculated for a given sample belong to second-stage adjustment coefficients, and the adjustment coefficients are multiple linear regression coefficients which can be calculated by a least square method. The specific calculation method is consistent with the vocabulary level adjustment coefficient.
2. Calculating the sentence complexity of the document, specifically as follows:
the term "whole sentence" is to be understood as a collection of words that express the complete meaning, for example: vocabulary set from the first character of the document to the end coincidence; the ending symbol is one of a period, an exclamation mark, a question mark and an ellipsis mark; or a vocabulary set from the first character after the first ending symbol to the second ending symbol;
the term "clause" should be understood as a portion of a whole sentence, a collection of words or phrases spaced apart by symbols such as commas, pause, semicolons, and the like;
the term "long sentence" should be understood as an entire sentence with a vocabulary number greater than a predetermined threshold;
the first and second categories are used herein for distinction only.
The scheme is as follows:
scanning a document, determining all whole sentences in the document, and counting the total number of the whole sentences to be recorded as count _ presence;
taking the whole sentence with the vocabulary number larger than a preset threshold value as a long sentence, counting the total number of the long sentences, recording the total number as count _ long and the vocabulary number in each long sentence as word _ longiI is more than or equal to 1 and less than or equal to count _ long; i is the serial number of the long sentence;
the clauses in the whole sentence are first type clauses, the total number of the first type clauses is counted and recorded as count _ clause;
the clauses in the long sentence are clauses of a second type, the total number of the clauses of the second type is counted and recorded as count _ close _ long;
respectively calculating the average length of the whole sentence, the average length of the long sentence, the average length of the first type of clause and the average length of the second type of clause; the following were used:
the average length of the whole sentence (MLS) is calculated by the following method: MLS = word/count _ presence;
the average length (MLC) of the first clause is calculated by the following method: MLC = word/count _ clause;
the average length of long sentences (MLL) is calculated by the following method:
<math> <mrow> <mi>MLL</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>count</mi> <mo>_</mo> <mi>long</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>long</mi> </mrow> </msubsup> <mi>word</mi> <mo>_</mo> <msub> <mi>long</mi> <mi>i</mi> </msub> <mo>;</mo> </mrow> </math>
the average length (MLCL) of the second type of clauses is calculated as follows:
<math> <mrow> <mi>MLCL</mi> <mo>=</mo> <mfrac> <mn>1</mn> <mrow> <mi>count</mi> <mo>_</mo> <mi>clause</mi> <mo>_</mo> <mi>long</mi> </mrow> </mfrac> <mo>&CenterDot;</mo> <msubsup> <mi>&Sigma;</mi> <mrow> <mi>i</mi> <mo>=</mo> <mn>1</mn> </mrow> <mrow> <mi>count</mi> <mo>_</mo> <mi>long</mi> </mrow> </msubsup> <mi>word</mi> <mo>_</mo> <msub> <mi>long</mi> <mi>i</mi> </msub> <mo>;</mo> </mrow> </math>
calculating according to a sentence complexity calculation formula to obtain sentence complexity; the sentence complexity calculation formula is as follows:
diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
K21、K22、K23、K24the sentence difficulty adjusting coefficient calculated by the collected sample belongs to a second-stage adjusting coefficient, and the adjusting coefficient is a multiple linear regression coefficient and can be calculated by a least square method. The specific calculation method is consistent with the vocabulary level adjustment coefficient.
3. Calculating a translation difficulty value of the document;
calculating to obtain a translation difficulty value of the document according to the acquired vocabulary complexity and sentence complexity of the document and a translation difficulty calculation formula; the formula is as follows:
diff_doc=K1·diff_word+K2·diff_sentence;
K1、K2the translation difficulty adjusting coefficient calculated through the collected samples belongs to a first-stage adjusting coefficient, and the adjusting coefficient is a multiple linear regression coefficient and can be calculated through a least square method. The specific calculation method is consistent with the vocabulary level adjustment coefficient.
4. Determining a translation difficulty level of the document;
matching in a difficulty level table according to the translation difficulty value of the document to obtain a difficulty level corresponding to the value;
the difficulty level table is in a form similar to a dictionary and comprises a plurality of difficulty levels and translation difficulty value ranges corresponding to the difficulty levels;
and the translation difficulty value range in the difficulty level table is obtained by learning or training operation.
The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (10)

1. A method for analyzing the translation difficulty of a document is characterized by comprising the following steps:
scanning a document to be translated, and determining all words and sentences in the document to be translated;
respectively carrying out complexity calculation according to the determined vocabulary and the sentence to obtain the vocabulary complexity and the sentence complexity of the document;
calculating to obtain a translation difficulty value of the document according to the vocabulary complexity and the document complexity of the document;
and determining the translation difficulty level of the document according to the translation difficulty value of the document in a difficulty level table.
2. The method of claim 1, wherein calculating the lexical complexity of the document comprises:
calculating the vocabulary level, the similar character-character ratio and the sense density of the semantic words of the document;
and calculating according to a vocabulary complexity calculation formula to obtain the vocabulary complexity of the document, wherein the vocabulary complexity calculation formula is as follows:
diff_word=K11·grade_word+K12·STTR+K13·density_notional;
wherein diff _ word is the document vocabulary complexity, grade _ word is the vocabulary level of the document, STTR is the similarity-character ratio of the document, dense _ nominal is the meaning density of the meaning words of the document, K11、K12And K13And obtaining a vocabulary complexity adjusting coefficient through sample calculation.
3. The method of claim 2, further comprising, prior to calculating the lexical rating of the document:
performing word segmentation processing on the document to obtain all vocabularies, and counting to obtain the total number of the vocabularies;
matching each obtained vocabulary in a vocabulary classification table to obtain the vocabulary level of each vocabulary; the vocabulary level is a first level, a second level, a third level or a fourth level;
respectively counting the number of the vocabularies with the vocabulary level of two or more levels;
the process of calculating the lexical rating of the document includes:
and calculating the vocabulary level of the document according to a vocabulary level calculation formula, wherein the vocabulary level calculation formula is as follows:
Figure FDA0000443899360000021
wherein, wordxFor the number of words of class X, K111、K112And K113And in order to obtain the vocabulary level adjusting coefficient through sample calculation, word is the total vocabulary number.
4. The method of claim 3, wherein the process of calculating the aspect ratio of the document comprises;
according to all the obtained vocabularies, counting the number of class symbols and the number of form symbols in the vocabularies, and calculating the ratio of the number of the class symbols to the number of the form symbols to obtain the ratio of the class symbols to the form symbols of the documents; or
Dividing all the obtained vocabularies into a plurality of subdocuments and 1 subdocument which is less than the vocabularies of the standard quantity according to the standard quantity, and calculating according to a similar character-to-character ratio calculation formula to obtain the similar character-to-character ratio of the document; the calculation formula of the class symbol ratio is as follows:
Figure FDA0000443899360000031
wherein token is the number of shape symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and type is the number of class symbols of the subdocuments of the vocabulary with the insufficient standard quantity, and typeiThe number of the class identifiers of the ith subdocument containing the standard number of words is the number of the class identifiers of the standard number of words, n is the number of the subdocuments containing the standard number of words, and ST is the standard number of word division units.
5. The method of claim 3, prior to calculating the semantic word sense density for the document, further comprising:
performing part-of-speech tagging on all the obtained vocabularies to obtain the semantic words in the vocabularies;
arranging all the obtained semantic words according to a certain sequence;
obtaining the meaning item number meanings of each sense word according to the sense word ontology tooliWherein i is the sequence number of the semantic word; counting the total number of the meaning items of the meaning words;
calculating according to a semantic word and semantic density calculation formula to obtain the semantic word and semantic density of the document; the word sense density calculation formula of the semantic words is as follows:
Figure FDA0000443899360000032
wherein, meansiThe number of the meaning term of the ith meaning term is, and the count _ normal is the number of the meaning term.
6. The method of claim 5, wherein the semantic word comprises at least one of the following parts of speech: nouns, pronouns, verbs, adjectives, adverbs, and exclamations.
7. The method of claim 2, prior to calculating the sentence complexity of the document, further comprising:
calculating the average length of the whole sentence by determining the number of the whole sentences in the document;
calculating the average length of the first type clauses in the whole sentence by determining the number of the first type clauses in all the whole sentences in the document;
calculating the average length of the long sentences by determining the number of the long sentences in the document and the length of each long sentence;
calculating the average length of the second type clauses in the long sentences by determining the number of the second type clauses in all the long sentences in the document;
the process of calculating the sentence complexity of the document comprises:
calculating the sentence complexity of the document according to a sentence complexity calculation formula; the sentence complexity calculation formula is as follows:
diff_sentence=K21·MLS+K22·MLC+K23·MLL+K24·MLCL;
wherein MLS is the average length of the whole sentence, MLC is the average length of the first type clause, MLL is the average length of the long sentence, MLCL is the average length of the second type clause, K21、K22、K23And K24And calculating to obtain a sentence complexity adjusting coefficient through a sample.
8. The method of claim 7, wherein calculating the average length of the whole sentence and the first type clause comprises:
dividing the total vocabulary number by the whole sentence number to obtain the average length MLS of the whole sentence;
and dividing the total vocabulary number by the number of the first type clauses to obtain the average length MLC of the first type clauses.
9. The method of claim 7, wherein calculating the average length of the long sentence and the second type clause comprises:
counting the length word long of each long sentenceiI is more than or equal to 1 and less than or equal to count _ long; wherein i is the serial number of the long sentence;
calculating according to a long sentence average length calculation formula to obtain the long sentence average length; the average calculation formula of the long sentence is as follows:
Figure FDA0000443899360000051
wherein, count _ long is the number of the long sentences;
calculating according to an average length calculation formula of the second type clauses to obtain the average length of the second type clauses; the average length calculation formula of the second type clause is as follows:
Figure FDA0000443899360000052
wherein count _ close _ Long is the number of the second type clauses.
10. The method of claim 1, wherein the calculating of the translation difficulty value for the document comprises:
calculating to obtain a translation difficulty value of the document according to a translation difficulty calculation formula; the translation difficulty calculation formula is as follows:
diff_doc=K1·diff_word+K2·diff_sentence;
wherein, K1And K2In order to obtain the translation difficulty adjustment coefficient through sample calculation, diff _ doc is a translation difficulty value.
CN201310713175.2A 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty Active CN103744840B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310713175.2A CN103744840B (en) 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310713175.2A CN103744840B (en) 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty

Publications (2)

Publication Number Publication Date
CN103744840A true CN103744840A (en) 2014-04-23
CN103744840B CN103744840B (en) 2016-12-07

Family

ID=50501858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310713175.2A Active CN103744840B (en) 2013-12-23 2013-12-23 A kind of analysis method of document translation difficulty

Country Status (1)

Country Link
CN (1) CN103744840B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN105224524A (en) * 2015-09-02 2016-01-06 网易有道信息技术(北京)有限公司 Document translation difficulty evaluation method and device
CN104008094B (en) * 2014-05-22 2017-08-11 武汉传神信息技术有限公司 A kind of method for obtaining document translation difficulty
CN109086363A (en) * 2018-07-19 2018-12-25 百度在线网络技术(北京)有限公司 The file information maintenance degree determines method, device and equipment
CN112232060A (en) * 2020-09-27 2021-01-15 淄博职业学院 Intelligent international Chinese teaching-oriented sentence difficulty level online measuring system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1266238A (en) * 1999-03-04 2000-09-13 英业达股份有限公司 English natural sentences antomatic identification and word querying free automatic processing method
JP2000516749A (en) * 1997-06-26 2000-12-12 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Machine construction method and apparatus for translating word source text into word target text
WO2002075585A1 (en) * 2001-03-21 2002-09-26 Fujitsu Limited Machine-translation apparatus
CN102214246A (en) * 2011-07-18 2011-10-12 南京大学 Method for grading Chinese electronic document reading on the Internet

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2000516749A (en) * 1997-06-26 2000-12-12 コーニンクレッカ フィリップス エレクトロニクス エヌ ヴィ Machine construction method and apparatus for translating word source text into word target text
CN1266238A (en) * 1999-03-04 2000-09-13 英业达股份有限公司 English natural sentences antomatic identification and word querying free automatic processing method
WO2002075585A1 (en) * 2001-03-21 2002-09-26 Fujitsu Limited Machine-translation apparatus
CN102214246A (en) * 2011-07-18 2011-10-12 南京大学 Method for grading Chinese electronic document reading on the Internet

Cited By (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104008094B (en) * 2014-05-22 2017-08-11 武汉传神信息技术有限公司 A kind of method for obtaining document translation difficulty
CN105224524A (en) * 2015-09-02 2016-01-06 网易有道信息技术(北京)有限公司 Document translation difficulty evaluation method and device
CN105224524B (en) * 2015-09-02 2022-01-25 网易有道信息技术(北京)有限公司 Document translation difficulty evaluation method and device
CN109086363A (en) * 2018-07-19 2018-12-25 百度在线网络技术(北京)有限公司 The file information maintenance degree determines method, device and equipment
CN109086363B (en) * 2018-07-19 2021-03-16 百度在线网络技术(北京)有限公司 File information maintenance degree determining method, device and equipment
CN112232060A (en) * 2020-09-27 2021-01-15 淄博职业学院 Intelligent international Chinese teaching-oriented sentence difficulty level online measuring system

Also Published As

Publication number Publication date
CN103744840B (en) 2016-12-07

Similar Documents

Publication Publication Date Title
CN103744834B (en) A kind of method that translation duties is accurately distributed
CN103729421B (en) A kind of method that interpreter&#39;s document accurately matches
Lewis Representation and learning in information retrieval
Edmonds et al. Introduction to the special issue on evaluating word sense disambiguation systems
CN106651696B (en) Approximate question pushing method and system
CN107239439A (en) Public sentiment sentiment classification method based on word2vec
CN112926345A (en) Multi-feature fusion neural machine translation error detection method based on data enhancement training
CN103744840B (en) A kind of analysis method of document translation difficulty
CN109062895A (en) A kind of intelligent semantic processing method
Glaser et al. Sentence Boundary Detection in German Legal Documents.
Wadud et al. Text coherence analysis based on misspelling oblivious word embeddings and deep neural network
CN103729348B (en) A kind of analysis method of sentence translation complexity
Hindocha et al. Short-text Semantic Similarity using GloVe word embedding
CN113934814B (en) Automatic scoring method for subjective questions of ancient poems
CN103699675B (en) A kind of method of interpreter&#39;s hierarchical index
Saini et al. Intrinsic plagiarism detection system using stylometric features and DBSCAN
CN108573025B (en) Method and device for extracting sentence classification characteristics based on mixed template
Larsson Classification into readability levels: implementation and evaluation
CN103714051B (en) A kind of preprocess method of waiting for translating shelves
Powers Unsupervised learning of linguistic structure: an empirical evaluation
Garcia-Gorrostieta et al. Argument component classification in academic writings
CN103761226B (en) By the method for the character attibute fragmentation of document
Xu et al. Historical changes in semantic weights of sub-word units
Desilia et al. An Attempt to Combine Features in Classifying Argument Components in Persuasive Essays.
CN103729350B (en) The preprocess method of various dimensions waiting for translating shelves

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information

Address after: 430070 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant after: Language network (Wuhan) Information Technology Co., Ltd.

Address before: 430073 East Lake Hubei Development Zone, Optics Valley Software Park, a phase of the west, South Lake Road South, Optics Valley Software Park, No. 2, No. 5, layer 205, six

Applicant before: Wuhan Transn Information Technology Co., Ltd.

CB03 Change of inventor or designer information

Inventor after: Jiang Chao

Inventor after: Zhang Pi

Inventor before: Jiang Chao

COR Change of bibliographic data
C14 Grant of patent or utility model
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: Document translation difficulty analyzing method

Effective date of registration: 20181115

Granted publication date: 20161207

Pledgee: Bank of Communications Co., Ltd. Wuhan Branch of Hubei Free Trade Experimental Zone

Pledgor: Language network (Wuhan) Information Technology Co., Ltd.

Registration number: 2018420000061

PE01 Entry into force of the registration of the contract for pledge of patent right
PC01 Cancellation of the registration of the contract for pledge of patent right

Date of cancellation: 20200617

Granted publication date: 20161207

Pledgee: Bank of Communications Co.,Ltd. Wuhan Branch of Hubei Free Trade Experimental Zone

Pledgor: IOL (WUHAN) INFORMATION TECHNOLOGY Co.,Ltd.

Registration number: 2018420000061

PC01 Cancellation of the registration of the contract for pledge of patent right