CN107870905B

CN107870905B - Method for identifying specific vocabulary

Info

Publication number: CN107870905B
Application number: CN201711253593.2A
Authority: CN
Inventors: 郑丽华; 何征宇
Original assignee: Iol Wuhan Information Technology Co ltd
Current assignee: Iol Wuhan Information Technology Co ltd
Priority date: 2017-12-04
Filing date: 2017-12-04
Publication date: 2021-09-17
Anticipated expiration: 2037-12-04
Also published as: CN107870905A

Abstract

The invention discloses a method, a system and a computer readable medium for identifying specific words in a document to be translated. By adopting the method and the system, most of specific unconventional words appearing in the translation process can be accurately recognized, and the method can realize automatic recognition output by adopting computer software and/or a hardware system. The invention can avoid the translation error of the related special vocabulary and improve the accuracy of the translation work. In addition, an unconventional vocabulary library can be built step by step in the translation process, and the content of the library is enriched continuously through the recognition process; therefore, through the constantly updated unconventional vocabulary library, the full-automatic translation of all texts to be translated including the unconventional vocabularies is finally realized.

Description

Method for identifying specific vocabulary

Technical Field

The invention belongs to the field of vocabulary recognition, and particularly relates to a method for recognizing a specific vocabulary in a document to be translated.

Background

Some special vocabulary translation problems are often encountered in the translation work. These special words are neither traditional english words nor traditional chinese pinyin words. When translating the words, if the words conform to the existing traditional translation corpus, the corresponding translations which conform to the meaning of the original text are difficult to find out. Therefore, whether it is a machine translation or a manual translation, it is inevitable that a bias occurs due to the limitation of the corpus or the level of the translator.

An example of translation is well known to the human translator, the translation for "Chiang Kai-seek". Study on eastern academic history of the kingdom of china, published by professor of famous history in 10 months of 2008: in the book of the eastern problems of the middle and Russian kingdom in the visual fields of scholars in China, Russia and Western, Jiang stone (the original text adopting the Pinyin of Wei is Chiang Kai-shek) is translated into 'Chankaisan'; the "Mencius" was also once translated by other famous scholars into "menaus" (the original meaning should be "mengzi"). It can be seen that the processing of such words in the translation work is a difficult problem for relevant experts, and the vast ordinary translation workers and machine translation tools are not used.

Therefore, the translation of such special words also requires special processing, and cannot be in the form of English translation or even hard translation. Because the total amount of the special words is relatively less, one possible solution is that during translation, the words are skipped first, the original text expression is directly reserved, a primary translation result is obtained, and then the special words are identified for later processing; or, before the translation, the special vocabulary is identified, and the processing such as emphasis labeling is carried out, so that the translation error is avoided. The special processing mode reduces the translation speed and the translation quality of the document, and the manual processing special for a small number of special words is time-consuming and labor-consuming.

Disclosure of Invention

Aiming at the problems, the invention provides a special vocabulary identification method, which can accurately identify the special vocabulary in the document to be translated so as to avoid translation errors.

The special vocabulary is mainly the vocabulary which is neither the traditional English word nor the Chinese pinyin scheme.

The "traditional" english word mentioned here refers to a word that is commonly used in conventional language learning, for example, the conventional english word in Guangzhou is "Guangzhou", or a considerable part of people can know "Canton", but for historical reasons, the words "Kwangchow", "Kuang-chou" should also be "Guangzhou" as a translation with accurate place name, but for most people, these 2 words are "non-traditional" words.

Similarly, neither Mao Tse-tung, I Ching or Chunghwa is a vocabulary meeting the Chinese phonetic alphabet scheme, but also belongs to special vocabularies.

Through a large amount of statistical research, the inventor finds that most special words are nouns, including place names, person names, organization names and the like. Therefore, the recognition range of the special words is firstly limited to nouns, which meets the actual working requirements.

Therefore, the identification method provided by the invention comprises the following steps:

the file to be translated is segmented, nouns in the file to be translated are identified, and all the identified nouns are stored in an ordered list according to the position sequence of the nouns in the file to be translated.

There are many common algorithms in the art for segmenting a document to be translated and identifying nouns therein. For example, a document is firstly divided into sentences, then semantic analysis is carried out on the sentences, including sentence component analysis, each structural part, such as a main and a predicate, is identified, and then nouns are searched from the object parts; or identifying a preposition part in the preposition, and identifying nouns at other specific positions besides the preposition, such as a subject and the like; or, by analyzing the connectivity between different words, whether the connected word is a noun or whether the words before and after the connected word are nouns is judged by judging whether the connectivity exceeds a certain threshold, or whether the words belong to the noun is directly inquired through a dictionary, a word bank and a corpus, and the like. And will not be described in detail herein.

After the nouns are recognized, not all nouns are special words, therefore, certain preprocessing can be carried out to screen out potential special words, and the subsequent workload is reduced.

Specifically, the following preprocessing means may be adopted:

and judging whether the noun contains Latin letters or not, and if not, storing the noun without storage.

If yes, whether the noun accords with the Chinese pinyin scheme or not is continuously judged, and if the noun accords with the Chinese pinyin scheme, the noun does not need to be stored.

The terms in the term sequence list set after the preprocessing are all potential possible special words, and the next analysis is carried out: reading nouns in the ordered list in sequence, and performing semantic analysis on the nouns to determine whether the nouns belong to a specific vocabulary;

in this case, the means and the judgment method adopted by the present invention are: the noun is segmented by taking bytes as a unit to obtain a plurality of characteristic fields; determining that the noun belongs to a particular vocabulary if at least one of the plurality of feature fields satisfies a predetermined condition.

In the invention, a specific recognition mode of a specific word is firstly provided. Firstly, the nouns are segmented by taking bytes as units, so that the maximum accuracy of the obtained characteristic fields is ensured; secondly, the "specificity" of the noun is also recognized to the greatest extent according to whether the characteristic field of the byte unit satisfies the predetermined condition.

For the former, the plurality of feature fields obtained by segmenting the noun in units of bytes are composed of one or more of the following fields: latin letters, spaces, diacritics, and connectors.

For the latter, the satisfaction of the predetermined condition means that at least one of the following conditions is satisfied:

the plurality of characteristic fields comprise a plurality of Latin letters and connectors;

the plurality of characteristic fields comprise a plurality of Latin letters and at least one additional symbol, and the additional symbol is positioned at the upper part or the upper right corner of at least one Latin letter.

Through the steps, the invention can at least identify special words such as 'Mao Tse-tung', 'Kuang-chou', 'Chiang Kai-shek', 'Ch' eng T 'ien-fang'.

The term "additional symbol" as used herein, with emphasis on "additional" should be understood as meaning that such symbol should not appear in the traditional manner of spelling, for example, various aspirated symbols (') (') would not normally appear in english literature, nor would there be additional indicia on the top or upper right corner of the letter or elsewhere.

Therefore, the additional symbols of the present invention are not limited to the air supply symbol (') (') and are not limited to other symbols located at the upper or upper right corner of at least one latin character, and may be present at other positions.

The predetermined condition is one of the most prominent features of a particular vocabulary. However, there may be missing cases, such as the aforementioned "Kwangchow", "I Ching", "Chunghwa", where further determination is needed: if none of the plurality of characteristic fields meets the predetermined condition, continuing the following identification steps:

judging whether the characteristic fields contain spaces or not;

if the Chinese character does not contain a blank space, judging whether the character formed by the plurality of characteristic fields meets a Chinese pinyin scheme; if not, determining that the noun belongs to a specific vocabulary;

if the space is contained, judging whether at least one of two characters formed by the characteristic fields before and after the space does not meet the Chinese pinyin scheme, and if so, determining that the noun belongs to a specific vocabulary.

According to the standard, although the Kwangchow and Chunghwa do not contain blank spaces, the composition characters do not conform to the Chinese pinyin scheme; the 'I Ching' contains a space, but the 'Ching' after the space does not satisfy the Chinese pinyin scheme, and meanwhile, the independent I cannot form the pinyin scheme.

Thus, the present invention can continue to recognize such special words.

It can be seen that the above recognition method proposed by the present invention can be automatically implemented by a computer program. By the method, most of special words in the document to be translated can be accurately identified.

In another aspect of the invention, a specific vocabulary recognition system is also provided for recognizing a specific vocabulary in a to-be-translated file, wherein the specific vocabulary comprises at least one Latin letter; the system comprises the following modules:

the recognition module is used for segmenting the to-be-translated file, recognizing and outputting nouns in the to-be-translated file;

the preprocessing module is used for preprocessing the nouns output by the dividing module; the pretreatment comprises the following steps: judging whether the noun contains Latin letters or not; and judging whether the noun accords with a Chinese pinyin scheme;

the storage module stores the nouns processed by the preprocessing module in an ordered list according to the position sequence of the nouns in the file to be translated;

the semantic analysis module is used for reading the nouns in the ordered list in sequence and carrying out semantic analysis on the nouns so as to determine whether the nouns belong to a specific vocabulary or not;

it is characterized in that the semantic analysis module comprises a byte segmentation module, a judgment module and a result output module,

the byte segmentation module segments the noun by taking bytes as a unit to obtain a plurality of characteristic fields;

the judging module is used for judging whether at least one of the characteristic fields meets a preset condition;

and the result output module outputs the recognition result of the vocabulary according to the judgment module.

The recognition system can be used for executing the recognition method provided by the invention, and comprises corresponding functional modules, and is realized by adopting computer hardware or software. When implemented in software, the above-described method may be implemented by a computer-readable storage medium having computer-readable storage instructions stored thereon which are executed by a memory and a processor.

It should be noted that the specific words and phrases used herein are intended to refer not only to the traditional words and phrases, but also to the current level of knowledge of the translator. For example, for the translation of "Chiang Kai-shek", the famous historical professor King of the translation is "Chiang Kai-shek" defined as a specific vocabulary in the present invention based on the cognitive level at that time. However, through the widespread culture propagation and the passage of time, at present, even for a person skilled in the art, the "Chiang Kai-seek" is not a specific word, but a common word, and because the relevant translation corpus/translation tool and the like store and store the correct translation result of the "Chiang Kai-seek" as the "Jiangshi stone". The same is true for "Mencius", and existing translation work is able to correctly recognize and translate it into "bangs".

However, just as with the first translation "Chiang Kai-shek"/"Mencius", there are also many documents to be translated that contain a large number of similar specific words for historical reasons. When such a vocabulary is translated for the first time, the translator may still be in error because there is no reference; meanwhile, the existing translation corpus/translation tool has no way to predict such a situation in advance. In view of this situation, the method of the present invention is still relied on to recognize specific words continuously during the translation process.

For the identified specific vocabulary, whether accurate translation exists can be judged; for example, a specific vocabulary corpus can be established, and the existing specific vocabulary translation results are saved; and meanwhile, new specific vocabulary is continuously added, so that the specific vocabulary translation corpus is updated.

Therefore, by adopting the method and the system, most of specific irregular words appearing in the translation process can be accurately recognized, and the method can realize automatic recognition output by adopting a computer software and/or hardware system. The invention can avoid the translation error of the related special vocabulary and improve the accuracy of the translation work. In addition, an unconventional vocabulary library can be built step by step in the translation process, and the content of the library is enriched continuously through the recognition process; therefore, through the constantly updated unconventional vocabulary library, the full-automatic translation of all texts to be translated including the unconventional vocabularies is finally realized.

Drawings

Fig. 1 is a flow chart of the identification method of the present invention.

Fig. 2 is a block diagram of the identification system of the present invention.

Detailed Description

Referring to fig. 1, the proposed identification method of the present invention comprises the following steps:

s1, segmenting the to-be-translated file, and identifying nouns in the to-be-translated file;

s2, judging whether the current noun contains Latin letters; if not, the noun does not need to be stored, and the next noun judgment is carried out; otherwise, go to step S3;

s3: judging whether the noun accords with the Chinese pinyin scheme, if so, the noun does not need to be stored, entering the next noun judgment, otherwise, entering the step S4:

s4: storing all the recognized nouns in an ordered list according to the position sequence of the nouns in the file to be translated;

s5: reading nouns in the ordered list in sequence;

s6: the noun is segmented by taking bytes as a unit to obtain a plurality of characteristic fields;

s7: determining whether at least one of the plurality of characteristic fields satisfies a predetermined condition; if yes, outputting the noun as a special vocabulary; otherwise, reading the next noun and continuing to judge until all nouns in the sequence list are identified.

The steps performed in fig. 1 are only one specific implementation of the method of the present invention. In practical implementation, the sequence of the step S2, the step S3 may be reversed; it may be performed after moving S3 to step S4 in the present sequence, or step S2 to step S4; likewise, S2 or S3 may be executed after the determination result of step S7 is no. It will be appreciated by those skilled in the art that the various combination steps described above may be performed separately or in combination, so long as the particular vocabulary is ultimately recognized in accordance with the predetermined criteria.

For example, the method of the present invention may not initially perform the determination of step S3, but after performing to step "if none of the plurality of feature fields currently satisfies the predetermined condition", continue the following identifying steps:

judging whether the characteristic fields contain spaces or not;

Fig. 2 shows an identification system of the present invention, which includes the following modules:

Generally speaking, by adopting the method and the system of the invention, most of specific irregular words appearing in the translation process can be accurately recognized, and the method can realize automatic recognition output by adopting a computer software and/or hardware system. The invention can avoid the translation error similar to the background technology of the invention and improve the accuracy of the translation work; in addition, an unconventional vocabulary library can be built step by step in the translation process, and the content of the library is enriched continuously through the recognition process; therefore, through the constantly updated unconventional vocabulary library, the full-automatic translation of all texts to be translated including the unconventional vocabularies is finally realized.

Claims

1. A method for recognizing a specific word in a file to be translated, wherein the specific word comprises at least one Latin letter, and the recognition method is used for segmenting the file to be translated and recognizing a noun in the file to be translated;

the method is characterized in that: the method comprises the following steps:

storing all the recognized nouns in an ordered list according to the position sequence of the nouns in the file to be translated;

reading nouns in the ordered list in sequence, and performing semantic analysis on the nouns to determine whether the nouns belong to a specific vocabulary;

performing semantic analysis on the noun to determine whether the noun belongs to a specific vocabulary, specifically including:

the noun is segmented by taking bytes as a unit to obtain a plurality of characteristic fields;

determining that the noun belongs to a specific vocabulary if at least one of the plurality of feature fields satisfies a predetermined condition;

the plurality of characteristic fields obtained by segmenting the noun by taking bytes as a unit are composed of one or more of the following fields: latin letters, spaces, additional symbols and connectors;

the satisfaction of the predetermined condition means that at least one of the following conditions is satisfied:

the plurality of characteristic fields comprise a plurality of Latin letters and at least one additional symbol, and the additional symbol is positioned at the upper part or the upper right corner of at least one Latin letter;

if none of the plurality of characteristic fields meets the predetermined condition, continuing the following identification steps:

(41) judging whether the characteristic fields contain spaces or not;

(42) if the Chinese character does not contain a blank space, judging whether the character formed by the plurality of characteristic fields meets a Chinese pinyin scheme; if not, determining that the noun belongs to a specific vocabulary;

(43) if the space is contained, judging whether at least one of two characters formed by the characteristic fields before and after the space does not meet the Chinese pinyin scheme, and if so, determining that the noun belongs to a specific vocabulary.

2. The method of claim 1, wherein:

storing all the recognized nouns in an ordered list according to the position sequence of the nouns in the file to be translated, and further comprising the preprocessing steps of: and judging whether the noun contains Latin letters or not, and if not, storing the noun without storage.

3. The method of claim 2, wherein:

judging whether the noun contains Latin letters; if yes, whether the noun accords with the Chinese pinyin scheme or not is continuously judged, and if the noun accords with the Chinese pinyin scheme, the noun does not need to be stored.

4. A specific vocabulary recognition system is used for recognizing a specific vocabulary in a file to be translated, wherein the specific vocabulary comprises at least one Latin letter;

the system comprises the following modules:

the semantic analysis module comprises a byte segmentation module, a judgment module and a result output module, wherein the byte segmentation module segments the noun by taking bytes as units to obtain a plurality of characteristic fields;

the result output module outputs the recognition result of the vocabulary according to the judgment module;

the byte segmentation module is used for segmenting the noun by taking bytes as a unit to obtain a plurality of characteristic fields, and the characteristic fields comprise one or more of the following fields: latin letters, spaces, additional symbols and connectors; the satisfaction of the predetermined condition means that at least one of the following conditions is satisfied:

(41) judging whether the characteristic fields contain spaces or not;

5. A computer-readable storage medium having computer-readable stored instructions stored thereon, the instructions being executable by a memory and a processor for implementing the method of any one of claims 1-3.