CN107870905A

CN107870905A - A kind of recognition methods of specific vocabulary

Info

Publication number: CN107870905A
Application number: CN201711253593.2A
Authority: CN
Inventors: 郑丽华; 何征宇
Original assignee: Language Network (wuhan) Information Technology Co Ltd
Current assignee: Language Network (wuhan) Information Technology Co Ltd
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2018-04-03
Anticipated expiration: 2037-12-04
Also published as: CN107870905B

Abstract

The invention discloses recognition methods, system and the computer-readable medium of the specific vocabulary in a kind of waiting for translating shelves.Using the method and system of the present invention, the most of specific unconventional vocabulary occurred in translation process can be recognized accurately, and methods described can use computer software and/or hardware system to realize that automatic identification exports.The present invention is used in actual translations work, the translation error of related special word can be avoided, improves the accuracy of translation.Furthermore, it is possible to progressively establish unconventional lexicon in translation process, and the content in the storehouse of being enriched constantly by identification process；So as to by the unconventional lexicon of continuous renewal, finally realize the full automatic translation of all waiting for translating sheets including unconventional vocabulary.

Description

A kind of recognition methods of specific vocabulary

Technical field

The invention belongs to vocabulary to identify field, more particularly to a kind of recognition methods of the specific vocabulary in waiting for translating shelves.

Background technology

Through being commonly encountered the issues for translation of some special words in translation.These special words are neither traditional English Cliction is converged, nor traditional Chinese phonetic alphabet vocabulary.When being translated to it, if in accordance with existing conventional translation corpus, These vocabulary are all difficult to find the corresponding translation for meeting the original text meaning.Therefore, either machine translation, or human translation, By the limitation of corpus or the level of translator are limited, all occurs deviation unavoidably.

Example known to one translator is exactly for " Chiang Kai-shek " translation.Famous history religion Award what Wang Qi published in October, 2008《Sino-Russian national boundaries eastern section academic history research：In in China, Russia, the western scholar visual field Russia national boundaries eastern section problem》In one book, Jiang Jieshi's (using the original text of Webster phonetic as Chiang Kai-shek) is translated as " often Triumphant Shen "；It is not unique, but has its counterpart, and " Mencius " was once also translated into " Men Xiusi " by other famous scholars（The original text meaning should be " Meng Son "）.It can be seen that the processing in translation for such vocabulary, is even a problem for associated specialist, less with wide Big common translation person and machine translation tools.

Therefore, the translation of this kind of special word is also required to specially treated, it is impossible to using the form that English is translated or even is translated firmly.Due to This kind of special word total amount is relatively fewer, and one kind, which possible solution, is, in translation, first skips this kind of vocabulary, directly protects Stay original text to express, obtain a preliminary translation result, then special word therein is identified so as to post-processing again； Or before translation, special word therein is just identified, the processing such as emphasis mark is carried out, it is wrong to avoid the occurrence of above-mentioned translation By mistake.This special processing mode reduces the translation speed and quality of document, and is carried out exclusively for a small amount of special word Artificial treatment also wastes time and energy.

The content of the invention

In view of the above-mentioned problems, the present invention proposes a kind of recognition methods of special word, this method can be recognized accurately Special word in waiting for translating shelves, to avoid translation error.

Special word mentioned here, it is primarily referred to as neither traditional English word, does not also form the Scheme for the Chinese Phonetic Alphabet Vocabulary.

" tradition " English word described here, refer to word common in conventional language study, for example, the routine in Guangzhou English word is " Guangzhou ", and in other words, also considerable part people understand that " Canton ", but due to historical reasons, Word " Kwangchow ", " Kuang-chou " accurately translation should also be as being " Guangzhou " as place name, still, for major part For people, this 2 words are all the words of " non-traditional ".

Likewise, for " " I Ching " " Chunghwa " are not one and meet the Chinese phonetic alphabet side Mao Tse-tung " The vocabulary of case, falls within special word.

Inventor had found by substantial amounts of NULL, and most of special word is all noun, including place name, name, machine Structure title etc..Therefore, the identification range of special word is limited on noun first, meets real work needs.

Therefore, recognition methods proposed by the invention, comprises the following steps first：

Cutting is carried out to the file to be translated, noun therein is identified, by all nouns identified according to it in institute The sequence of positions in file to be translated is stated to be stored in an ordered list.

On carrying out cutting to file to be translated and identifying noun therein, there are a variety of common algorithms in this area.Example Such as, it is sentence by file cutting first, then by carrying out semantic analysis, including sentence element analysis to sentence, identifies wherein Each structure division, such as SVO etc., then find noun from object part；Or preposition part therein is identified, it is being situated between Other ad-hoc locations outside word identify noun, such as subject etc.；Again or, by analyzing the connection between different words Degree, by Connected degree whether exceed certain threshold value come judge connect words whether be noun or connection words before and after words be No is noun, or directly whether belongs to noun, etc. by dictionary, dictionary, language material library inquiry.It will not be repeated here.

After identifying noun, not all noun is all special word, therefore, can carry out certain pre- place Reason, filters out potential special word, so as to reduce follow-up work amount.

Specifically, following preprocessing means can be taken：

Judge whether the noun includes the Latin alphabet, if do not included, the noun is without storage.

If comprising, continue to judge whether the noun meets the Scheme for the Chinese Phonetic Alphabet, if meeting the Scheme for the Chinese Phonetic Alphabet, The noun is without storage.

The noun in noun sequence table set after above-mentioned pretreatment, all it is potential possible special word, enters Enter and analyze in next step：The noun being successively read in the ordered list, semantic analysis is carried out to the noun, to determine that the noun is It is no to belong to specific vocabulary；

Now, the means taken and determination methods of the invention are：Cutting is carried out in units of byte to the noun and obtains multiple spies Levy field；If at least one in the multiple feature field meets predetermined condition, it is determined that the noun belongs to specific vocabulary.

In the present invention, the specific identification method of specific vocabulary is proposed first.First, noun is entered in units of byte Row cutting, it ensure that the maximum accuracy of obtained feature field；Secondly, according to the feature field of byte unit whether Meet predetermined condition, also farthest identify " special " property of the noun.

For the former, multiple feature fields that cutting obtains are carried out in units of byte to the noun, by following multiple words One of section or multiple compositions：The Latin alphabet, space, diacritic, connector.

It is described to meet predetermined condition for the latter, refer at least meet one of following condition：

The multiple feature field includes multiple Latin alphabets, while includes connector；

Affiliated multiple feature fields include multiple Latin alphabets and at least one diacritic, and the diacritic is positioned at least The top or the upper right corner of one Latin alphabet.

By above-mentioned steps, the present invention can at least identify such as " Mao Tse-tung " " Kuang-chou " " Chiang Kai-shek " " Ch'eng T'ien-fang " etc special words.

Signified " diacritic " herein, it focuses on " adding ", and " additional " is it should be appreciated that according to traditional spelling Mode, this symbol should not occur, for example, being typically not in various symbols of supplying gas in english literature（‘）（’）, also will not On alphabetical top, either the upper right corner or other positions have additional marking.

Therefore, diacritic of the invention is not limited to the symbol of supplying gas（‘）（’）, it is also not necessarily limited to positioned at least one The top of the Latin alphabet or other symbols of the position in the upper right corner, it can also appear in other positions.

Above-mentioned predetermined condition is one of most significant feature of special word.But it still there may be the situation of omission, example Such as, " Kwangchow " being previously mentioned, " I Ching " " Chunghwa ", now then need to determine whether：It is if described more Individual feature field is unsatisfactory for the predetermined condition, then continues with identification step：

Judge whether the multiple feature field includes space；

If not including space, judge whether the character of the multiple feature field composition meets the Scheme for the Chinese Phonetic Alphabet；If It is unsatisfactory for, it is determined that the noun belongs to specific vocabulary；

If comprising space, whether at least one be unsatisfactory for is judged in two characters of the feature field composition before and after the space The Scheme for the Chinese Phonetic Alphabet, if it is, determining that the noun belongs to specific vocabulary.

It can be seen from this standard, " Kwangchow " " Chunghwa " although not including space, composition character is not inconsistent Close the Scheme for the Chinese Phonetic Alphabet；" I Ching " include space, but " Ching " after space is unsatisfactory for the Scheme for the Chinese Phonetic Alphabet, simultaneously Single I can not form phonetic plan.

Therefore, the present invention can continue to identify such special word.

As can be seen that above-mentioned recognition methods proposed by the present invention can be realized automatically by computer program.By above-mentioned Method, most of special word in waiting for translating shelves can be recognized accurately.

In another aspect of the present invention, a kind of specific vocabulary identifying system is additionally provided, for identifying in file to be translated Specific vocabulary, the specific vocabulary includes at least one Latin alphabet；The system includes following module：

Identification module, cutting is carried out to the file to be translated, identifies and exports noun therein；

Pretreatment module, the noun of cutting module output is pre-processed；The pretreatment includes：Judge whether the noun wraps Containing the Latin alphabet；And judge whether the noun meets the Scheme for the Chinese Phonetic Alphabet；

Memory module, the noun after pretreatment module is handled is stored according to its sequence of positions in the file to be translated In an ordered list；

Semantic module, the noun being successively read in the ordered list, semantic analysis is carried out to the noun, to determine the name Whether word belongs to specific vocabulary；

Characterized in that, be set forth in semantic module includes byte cutting module, judge module and result output module,

The byte cutting module carries out cutting in units of byte to the noun and obtains multiple feature fields；

The judge module, whether judge in the multiple feature field at least one meets predetermined condition；

The result output module exports the recognition result of vocabulary according to the judge module.

Above-mentioned identifying system can be used for the recognition methods for performing the foregoing proposition of the present invention, and include corresponding function mould Block, realized using computer hardware or software.When being realized using software, can by a kind of computer-readable recording medium, Computer-readable store instruction is stored thereon with, by instruction described in memory and computing device, to realize the above method.

It is pointed out that the specific vocabulary pointed by the present invention, is referred not only to for traditional vocabulary, Er Qieshi For the current degree of awareness of translator.For example, for " Chiang Kai-shek " translation, it is famous to go through When historiography professor Wang Qi is translated, for the degree of awareness at that time, " Chiang Kai-shek " are exactly a present invention " the specific vocabulary " of definition.However, by the passage of cultural wide-scale distribution and time, till now, even for the general of this area For logical technical staff, " a Chiang Kai-shek " also not specific vocabulary at last, but a popular word, Because related translated corpora/translation tool etc., all by " Chiang Kai-shek " correct translation result " Jiang Jie Stone " is stored and preserved.For " Mencius " and in this way, it correctly can be identified and translated into by existing translation " Mencius ".

But as first translation " Chiang Kai-shek "/" Mencius ", due to historical reasons, also very A large amount of similar specific vocabulary are included in more waiting for translating shelves.When such vocabulary is translated for the first time, translator still may Because there is mistake without any reference；Meanwhile existing translated corpora/translation tool also has no idea to predict this in advance Class situation.In light of this situation, still the method for the present invention is relied on constantly to identify specific vocabulary in translation process.

For the specific vocabulary identified, it can be determined that whether accurate translation be present；For example, a spy can be established Determine vocabulary corpus, existing specific vocabulary translation result is preserved；The new specific vocabulary that will identify that simultaneously is continuously added, So as to update the specific vocabulary translated corpora.

Therefore, using the method and system of the present invention, it is specific that the major part occurred in translation process can be recognized accurately Unconventional vocabulary, and methods described can use computer software and/or hardware system realize that automatic identification exports.In reality The present invention is used in the translation of border, the translation error of related special word can be avoided, improves the accuracy of translation.This Outside, unconventional lexicon can be progressively established in translation process, and the content in the storehouse of being enriched constantly by identification process；So as to By the unconventional lexicon of continuous renewal, finally realize that the full-automatic of all waiting for translating sheets including unconventional vocabulary is turned over Translate.

Brief description of the drawings

Fig. 1 is a kind of flow chart of recognition methods of the present invention.

Fig. 2 is the frame diagram of identifying system of the present invention.

Embodiment

Reference picture 1, the recognition methods step of proposition of the invention are as follows：

S1, cutting is carried out to the file to be translated, identifies noun therein；

S2, judges whether current noun includes the Latin alphabet；If do not included, the noun carries out next name without storage Word judges；Otherwise step S3 is entered；

S3：Judge whether the noun meets the Scheme for the Chinese Phonetic Alphabet, if meeting the Scheme for the Chinese Phonetic Alphabet, the noun need not store, Judge otherwise to enter step S4 into next noun：

S4：All nouns identified are stored in into one according to its sequence of positions in the file to be translated sequence In table；

S5：Sequentially read the noun in ordered list；

S6：Cutting is carried out in units of byte to the noun and obtains multiple feature fields；

S7：Whether judge in the multiple feature field at least one meets predetermined condition；If it is, exporting the noun and being Special word；Otherwise, read next noun to continue to judge, until all nouns have been identified and finished in sequence table.

Fig. 1 execution step is only the one of which specific implementation of the method for the invention.In practical implementations, The step S2, step S3 order can exchange；S3 can be moved on to after step S4 and performed in current order, Step S2 can be moved on to after step S4；Likewise, can also be by S2 or S3 after step S7 judged result is no Performing.Performed it will be understood by those skilled in the art that above-mentioned different combination step can be separated or merged, as long as finally Special word can be identified according to predetermined condition.

For example, the method for the present invention can not carry out step S3 judgement at the beginning, and going to step " currently If the multiple feature field is unsatisfactory for the predetermined condition " and then continue with identification step：

Judge whether the multiple feature field includes space；

Fig. 2 then gives the identifying system of the present invention, including following module：

On the whole, using the method and system of the present invention, the major part occurred in translation process can be recognized accurately Specific unconventional vocabulary, and methods described can use computer software and/or hardware system to realize that automatic identification exports. Using the present invention in real work, it can avoid being similar to the translation error mentioned in background of invention, improve translation The accuracy of work；Furthermore, it is possible to progressively establish unconventional lexicon in translation process, and enriched constantly by identification process The content in the storehouse；So as to be needed by the unconventional lexicon of continuous renewal, final realize including unconventional vocabulary The full automatic translation of translation sheet.

Claims

1. the specific vocabulary recognition methods in a kind of file to be translated, the specific vocabulary includes at least one Latin alphabet, described Recognition methods comprises the following steps：

Cutting is carried out to the file to be translated, noun therein is identified, by all nouns identified according to it in institute The sequence of positions in file to be translated is stated to be stored in an ordered list；

The noun being successively read in the ordered list, semantic analysis is carried out to the noun, to determine whether the noun belongs to special Determine vocabulary；

Characterized in that,

The step（2）In, semantic analysis is carried out to the noun to determine whether the noun belongs to specific vocabulary, is specifically included：

（21）Cutting is carried out in units of byte to the noun and obtains multiple feature fields；

（22）If at least one in the multiple feature field meets predetermined condition, it is determined that the noun belongs to specific word Converge.

2. the method as described in claim 1, the step（2）In, what cutting obtained is carried out in units of byte to the noun Multiple feature fields, by one of following multiple fields or multiple form：The Latin alphabet, space, diacritic, connection Symbol.

3. method as claimed in claim 2, described to meet predetermined condition, refer at least meet one of following condition：

（31）The multiple feature field includes multiple Latin alphabets, while includes connector；

（32）Affiliated multiple feature fields include multiple Latin alphabets and at least one diacritic, and the diacritic is located at The top or the upper right corner of at least one Latin alphabet.

4. method as claimed in claim 3, further comprises, if the multiple feature field is unsatisfactory for the predetermined bar Part, then continue with identification step：

（41）Judge whether the multiple feature field includes space；

（42）If not including space, judge whether the character of the multiple feature field composition meets the Scheme for the Chinese Phonetic Alphabet； If it is unsatisfactory for, it is determined that the noun belongs to specific vocabulary；

（43）If comprising space, judge in two characters of the feature field composition before and after the space it is whether at least one not Meet the Scheme for the Chinese Phonetic Alphabet, if it is, determining that the noun belongs to specific vocabulary.

5. the method for claim 1, wherein by all nouns identified according to it in the file to be translated Sequence of positions be stored in an ordered list, in addition to pre-treatment step：Judge whether the noun includes the Latin alphabet, If do not included, the noun is without storage.

6. method as claimed in claim 5, wherein, judge whether the noun includes the Latin alphabet；If comprising continuing Judge whether the noun meets the Scheme for the Chinese Phonetic Alphabet, if meeting the Scheme for the Chinese Phonetic Alphabet, the noun is without storage.

7. a kind of specific vocabulary identifying system, for identifying the specific vocabulary in file to be translated, the specific vocabulary includes at least One Latin alphabet；The system includes following module：

8. system as claimed in claim 7, the byte cutting module, cutting is carried out in units of byte to the noun and obtained Multiple feature fields, by one of following multiple fields or multiple form：The Latin alphabet, space, diacritic, even Connect symbol.

9. system as claimed in claim 7, described to meet predetermined condition, refer at least meet one of following condition：

（91）The multiple feature field includes multiple Latin alphabets, while includes connector；

（92）Affiliated multiple feature fields include multiple Latin alphabets and at least one diacritic, and the diacritic is located at The top or the upper right corner of at least one Latin alphabet.

10. a kind of computer-readable recording medium, computer-readable store instruction is stored thereon with, passes through memory and processor The instruction is performed, for realizing the method described in claim any one of 1-6.