CN111178061A

CN111178061A - Multi-lingual word segmentation method based on code conversion

Info

Publication number: CN111178061A
Application number: CN201911324149.4A
Authority: CN
Inventors: 杜权; 徐萍; 朱靖波; 肖桐; 张春良
Original assignee: Shenyang Yaze Network Technology Co ltd
Current assignee: Shenyang Yaze Network Technology Co ltd
Priority date: 2019-12-20
Filing date: 2019-12-20
Publication date: 2020-05-19
Anticipated expiration: 2039-12-20
Also published as: CN111178061B

Abstract

The invention discloses a multi-national-language word segmentation method based on code conversion, which comprises the following steps: 1) data preprocessing: inputting data to be segmented and a language label, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format; 2) loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1); 3) and (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2); 4) word segmentation: and performing word segmentation on the code-converted data by using symbols such as punctuations, spaces and the like. The multi-national-language word segmentation method based on code conversion can simultaneously meet different coding characteristics of multi-national languages, performs analysis and code conversion according to the characteristics of different languages in a targeted manner, and meets the requirement that the word segmentation method can simultaneously perform word segmentation on the multi-national languages.

Description

Multi-lingual word segmentation method based on code conversion

Technical Field

The invention relates to a word segmentation method in language processing, in particular to a multi-national-language word segmentation method based on code conversion.

Background

The language is the medium of human thought communication, is the most important communication tool for people, and is generated and developed along with the human society, so that the influence on politics, economy, science and technology and even culture is inevitable. Currently, there are 5651 languages found in the world, which are distributed in different parts of the world.

According to the common features and origin relations of the speech, grammar and vocabulary of each language, the linguists divide the languages in the world into a plurality of language families, each language family comprises a plurality of languages, the languages and the languages are distributed in certain regions, and many cultural features are closely related to the languages and the languages.

The word is the smallest language unit capable of being independently used in the language, and the word is generally used as a basic unit to be analyzed in the machine translation system, so an effective and high-quality word segmentation module is crucial to the machine translation system.

The languages of all countries in the world have unique characteristics, and the languages can be roughly divided into two types by distinguishing the languages in a word segmentation mode: one is isolated or sticky similar to Chinese and Japanese; the other is most western national languages mainly including english, words in the languages use a space as a boundary, the words are called inflected language, the space between words in a text of the inflected language can specify the boundary of the words, and the sentence can be split into a plurality of continuous word combinations in a word splitting mode by using the space as a splitting mark, so that the complete sentence is split. Therefore, for most languages in western countries, the space is used as a segmentation mark to segment the languages.

At 19 ages, European scholars research nearly one hundred languages in the world, find that there are corresponding relations and similarities among voices, vocabularies and grammar rules of some languages, and classify the languages into one class, namely the same-family languages; because there is a corresponding relationship between different language families, they are summarized as a homologous language, which is the pedigree relationship of languages. At 20 ages, linguists have further divided world languages into various languages, such as the Hindu language family, the Tibetan language family, the Cantonese language family, and so on. However, the languages of each country are of various types, and are divided into different language families, each language family has its own characteristics, there are many differences between different languages in the same language family, and there are many different encoding and writing methods in some languages, for example:

1) vietnamese has two coding sets, one of which is an independent character, and the other is formed by combining two characters.

2) The arabic character has a plurality of expressions such as arabic language, arabic form a and arabic form B, and two kinds of coded data such as the arabic character and the arabic form B character appear in the bosch language at the same time.

3) Bulgaria belongs to south-schlavian branches of the indolo european language family, and is written by using cyrillic letters, and a large number of latin letters needing to be converted are often doped in the bulgaria.

As shown in the above situation, a word segmentation method cannot simultaneously satisfy all language features, and it is difficult to implement word segmentation functions for all languages simultaneously in the same word segmentation manner, but existing languages are of various types, and designing a unique word segmentation manner for each language is too cumbersome and impractical, so that different languages need to be learned and analyzed, and data is pertinently preprocessed by code conversion according to the features of each language, and then words are uniformly segmented.

The Unicode is a new encoding scheme generated for solving the limitation of the traditional character set encoding scheme, and a unique binary code is uniformly set for each character in each language so as to meet the requirements of cross-language and cross-platform text conversion and processing. The Unicode only has one character set in the Unicode coding, thereby effectively avoiding the ambiguity of the double-byte character set, and the Unicode coding is widely applied in the information exchange field of the global scope at present. In Unicode encoding, each character block has its own encoding range based on the same standard, such as greek letters, cyrillic, amantan, etc., and each character has its own encoding interval with a specific range. FIG. 1 is a Unicode encoding section for a partial language.

In the three languages mentioned above, each language has its own coding region, but occasionally some noise data is mixed in these languages: some languages have a character split into a combination of two characters, such as vietnamese; some are characters from two different coding regions of the same language, such as the Persian language; some of the data include characters with other coding regions, such as bulgarian language.

The multi-national-language word segmentation method based on code conversion can unify different codes of the same language, and different expressions and writing modes of the same language are summarized and sorted together, so that the size of a training data vocabulary is effectively reduced, meanwhile, the sparse problem of training data can be effectively relieved, the quality of word segmentation results in machine translation is improved, and the quality of machine translation translated texts is optimized.

At present, a multilingual word segmentation method based on code conversion, which can meet the requirements, is not reported yet.

Disclosure of Invention

Aiming at the defects that the word segmentation method for the multi-national languages in the prior art mainly segments spaces and punctuations, is difficult to meet the requirement of segmenting words for the multi-national languages with various codes at the same time, cannot obtain high-quality word segmentation results and the like, the invention provides a word segmentation mode based on code conversion, which can meet the interconversion of the multi-national languages in a multi-coding interval.

In order to solve the technical problems, the invention adopts the technical scheme that:

the invention relates to a multi-national-language word segmentation method based on code conversion, which comprises the following steps of:

1) data preprocessing: inputting data to be segmented and a language label, filtering redundant spaces in the data and adjusting the data into a UTF-8 encoding format;

2) loading a code conversion file: loading a code conversion resource file of a corresponding language according to the language tag input in the step 1);

3) and (3) code conversion: performing code conversion on the data by using the code conversion resource file loaded in the step 2);

4) word segmentation: and performing word segmentation on the code-converted data by using symbols such as punctuations, spaces and the like.

In step 2), the code conversion resource file specifically analyzes different characteristics of each language, distinguishes writing modes and using habits of each language, and loads corresponding code conversion files for processing according to coding intervals and conversion requirements of each language and by using characteristics of each language.

For Vietnamese, the syllable-carrying characters in the data have two writing modes of single characters and combined characters, the Vietnamese data is subjected to coding conversion before word segmentation, the characters in the data in a non-standard coding mode are uniformly converted into corresponding standard coding characters, and the resource files of the two writing modes are loaded according to the corresponding relation of the standard coding and the non-standard coding of the same character.

Aiming at the fact that most word characters are Arabic characters in the Persian data, and a few words are coded characters in an Arabic form B, conversion rules of the Arabic characters and the Arabic form B characters in the Persian data are loaded, the characters in the Arabic form B in the Persian data are converted into common Arabic characters, and then subsequent word segmentation processing and machine translation training are carried out, so that common translation between the Persian in multiple coding sections is achieved.

Aiming at the Bulgaria language, the Bulgaria language is written by using Sirillic letters, and the cases are distinguished, wherein part of letters are similar to the Latin writing method, but the codes are different, and the Bulgaria language data is mixed with a large number of Latin letters to replace the Sirillic letters in the Bulgaria language; and simultaneously loading a resource file for the Sirillic letters in the Bulgaria language, the confusable Latin letters corresponding to the Bulgaria letters and the corresponding relation between the Sirillic letters and the Bulgaria letters, and converting the data of the Bulgaria language according to the resource file.

In step 3), loading transcoding files of each language, and transcoding data, specifically:

301) inputting language data and a language tag to be processed;

302) reading a resource file corresponding to the language, and loading the resource file into a memory;

303) traversing each character in the language data according to sentences, and judging whether the current character needs code conversion;

304) if code conversion is needed, converting the characters to be converted according to the conversion rules of the languages and the corresponding resource files;

305) the code-converted sentence is output to step 4).

In step 303), if no transcoding is needed, the next character is determined continuously until all characters in the data have been traversed, and go to step 305).

The invention has the following beneficial effects and advantages:

1. the multi-national-language word segmentation method based on code conversion can simultaneously meet different coding characteristics of multi-national languages, performs analysis and code conversion according to the characteristics of different languages in a targeted manner, and meets the requirement that the word segmentation method can simultaneously perform word segmentation on the multi-national languages.

2. The method of the invention analyzes and learns the characteristics of different languages, carries out code conversion on multi-language data, can solve the problem that the same language data has multiple coding modes, and can filter error coding data in the data at the same time, thereby improving the quality of the multi-language data.

3. The code conversion method provided by the invention can effectively reduce data sparseness and enhance data quality, and can also reduce the vocabulary size of data in the subsequent machine translation training process, thereby effectively improving the translation quality of machine translation.

Drawings

FIG. 1 is a diagram of a partial language code interval table;

FIG. 2 is a diagram of a partial language tag table according to the method of the present invention;

fig. 3 is a general flowchart of the multilingual word segmentation method based on transcoding according to the present invention.

Detailed Description

The invention is further elucidated with reference to the accompanying drawings.

The invention standardizes a plurality of codes by carrying out Unicode code conversion on the data with a plurality of code expression modes in the same language in a plurality of national languages, and is convenient for word segmentation system processing. Each language in the Unicode code set has a coding interval in a specific range, so that ambiguity of the character set is effectively avoided, and the coding intervals of partial languages are shown in figure 1.

Fig. 3 is a general flow chart of a multilingual word segmentation method based on transcoding, and the multilingual word segmentation method based on transcoding of the present invention specifically includes the following steps:

In step 1), each language has its own corresponding language tag, and the language tags are composed of 2 to 3 English letters and used for marking the name of the language. Inputting the language tags into a word segmentation system, so that the system can conveniently identify the language and perform the next coding conversion processing, wherein part of the language tags are shown in FIG. 2;

In the following, vietnamese, bosch and bulgarian languages are taken as examples, and the characteristics of the three languages are as follows:

step 201) Vietnamese is a tonal language that uses tones to distinguish word senses written using Latin letters, which are classified into syllabic letters and non-syllabic letters, and the standard syllabic letter is a separate character, but this character can also be written by combining a non-syllabic character with a phonetic symbol, which results in two writing modes for the syllabic character in the Vietnamese data.

By letters

For the purpose of example only,

is an independent character, belongs to Latin expansion additional characters in a Unicode code set, and the standard Unicode code is IEAC; in addition, another writing method exists in Vietnamese data

Is made up of characters

And the pronunciation symbol ". The combination is written, and is equivalent to two characters in a Unicode code set. The two writing methods cannot distinguish when reading Vietnamese, but when a word segmentation system performs word segmentation on data, the non-standard coded characters can be split into two characters, so that the content of original data is changed, the original data loses the meaning of the original data, and the quality of the data is reduced. Therefore, before word segmentation, the vietnamese data needs to be subjected to code conversion, and characters in the data in a non-standard coding mode are uniformly converted into corresponding standard coding characters.

The number of Vietnamese letters is 195, most of the characters are syllable-carrying characters which can be written by combining common characters and pronunciation symbols, so that the Vietnamese needs to load resource files of two writing modes according to the corresponding relation of standard codes and non-standard codes of the same character.

Step 202) the Bose language is composed of 28 Arabic letters and 4 newly-created Bose letters, and a large number of Arabic borrows are contained in the language. In the Unicode code set, there are three sets containing Arabic letters, Arabic form A, and Arabic form B, respectively. The Arabic letters are positioned in a coding interval of 0x 0600-0 x06FF, and because of the special characteristics of the writing specifications of the Arabic letters, the shapes of the same letter at different positions in a word are different, so the Unicode coding set also defines two types of codes of an Arabic form A and an Arabic form B to specify other representation methods of the Arabic letters, wherein the Arabic form B defines a Bose language deformation rule and display characters, and the coding interval is 0xE 70-0 xFEFF.

In the existing Persian language data, most characters in the Persian language are Arabic characters, and a few words are composed of characters coded by Arabic type B. Because the two types of Gaussian data come from different coding intervals, the machine translation training is carried out by using the two types of Gaussian data, and the translation quality is influenced due to the conditions of data sparseness and overlarge vocabulary caused by different coding intervals; meanwhile, the Bose data of the coding interval of the Arabic type B is less, and the performance of a machine translation system is influenced when the proportion in the training data is too small. The gaussian coded in arabic form B cannot be discarded for the purpose of communication between multiple nationalities and multiple languages, and for the purpose of protecting human language culture. Therefore, it is necessary to load a conversion rule between the arabic character and the arabic form B character in the gaussian, convert the arabic form B character in the gaussian data into a common arabic character, and then perform subsequent word segmentation processing and machine translation training, so as to implement common translation between the gaussian in multiple coding sections.

Step 203) the development of bulgaria language goes through three stages of ancient bulgaria language, middle ancient bulgaria language and modern bulgaria language. Modern bulgaria is written using 30 cyrillic letters and is case-specific, with some letters being similar to, but not identical to, latin scripts.

For example, the letters "B" and the latin character "B" in bulgarian look the same writing, but in practice the Unicode code for the letter "B" in bulgarian is 0x412, the Unicode code for the latin character "B" is 0x42, which are two completely different letters, and so on, in bulgarian.

The training data of machine translation has wide sources, and a large number of Latin letters are mixed in the massive Bulgaria language data crawled from the Internet to replace the situation of the Chinese and Western Lei letters of the Bulgaria language. In order to retain real bulgarian data and reduce the difference in the data, the bulgarian data may be first cleaned before word segmentation, and the latin letters in bulgarian may be converted into corresponding cyrillic letters.

In the coding conversion resource file of the bulgarian language, cyrillic letters in the bulgarian language, confusable latin letters corresponding to the bulgarian letters, and the corresponding relationship between the cyrillic letters and the confusable latin letters need to be loaded simultaneously, and the bulgarian language data needs to be converted according to the resource file.

In step 3), according to the analysis of the three languages in step 2), loading the transcoding file of each language, and transcoding the data, specifically:

301) inputting language data and a language tag to be processed;

305) outputting the sentence after code conversion to step 4);

In the following, the specific conversion method is as follows, taking vietnamese, bosch and bulgarian as examples:

the Vietnamese loaded code conversion file comprises a standard Unicode code and a non-standard writing method, the two coding modes correspond to each other one by one, and the specific code conversion steps are as follows:

a. inputting data and language labels, wherein the label of Vietnamese is vi, and example sentences are as follows:

(middle translation: exit Iran nuclear protocol in the United states)

b. Reading the Vietnamese resource file, loading the content in the Vietnamese resource file into a memory in a dictionary form corresponding to key values, and naming the Vietnamese resource file as Vi _ dit: the key of Vietnamese coding dictionary is a reading symbol combination of non-standard coding, and the corresponding value is a standard Unicode coding character;

c. traversing each character in the sentence, and judging whether the current character is a pronunciation symbol: if the current character is not a pronunciation symbol, continuously traversing the next character; if the current character is a pronunciation symbol, go to step d. After traversing the whole sentence, turning to step e;

d. if the current character is a pronunciation symbol, combining the previous character with the current character, and judging whether the current combined character exists in Vi _ fact: if the combined letter exists, replacing the combined letter with a corresponding standard code character, and then returning to the step c to traverse the next character; if the combined character does not exist in Vi _ dit, directly returning to the step c. The combinations of phonetic symbols to be converted appearing in example sentences are respectively

And

the combined characters are inquired and stored in a language dictionary Vi _ dit, and then are replaced by the standard Unicode coded characters corresponding to the combined characters in the dictionary;

e. encoding the converted sentence into

It is returned.

The method is characterized in that Arabic characters and Arabic form B characters are loaded in a resource file of the Persian, and the specific conversion steps are as follows:

a. inputting data and language tags, wherein the tags of the Gaussian are fa, and the example sentences are as follows:

(middle translation: the day that shakes violently in the sky)

b. Reading the resource file of the Persian language. The word in the Persian language is generally composed of 3 original letters, and a new word can be formed by adding prefixes, suffixes or changing the internal phonemes of the word and inserting other phonemes, so that each Arabic character corresponds to a word composed of four Arabic form B characters in the resource file and is placed in the structure Fa _ map;

c. traversing each character in the Persian language according to sentences, judging whether the character is in the coding interval of the Arabic form B in the Unicode standard coding set, and traversing the next character if the current character is not the character of the Arabic form B; if the current character is in the Arabic type B character interval, turning to the step d; after traversing the whole sentence, turning to step e;

d. c, traversing all characters in the Fa _ map, judging whether the current character exists in the Fa _ map, if not, returning to the step c to continue traversing the next character; if yes, converting the character into an Arabic character corresponding to the character, and then returning to the step c;

e. encoding the converted sentence

And returning.

The coding conversion of the bulgarian language requires converting latin letters in bulgarian language data into cyrillic letters, so that the resource file is loaded with the corresponding relationship between latin letters and cyrillic letters in bulgarian language, and the specific conversion steps are as follows:

a. inputting data and a language label, wherein the label of Bulgaria language is bg, and example sentences are as follows:

“Toйkaзa:"He mиcлr,чe иma пpoблem,kaпитaлoвиrт пaзap e mнoгo kooпepaтивeн.”

(intermediate: he says: "i think there is no problem, the capital market is very cooperative.)

b. Loading a Bulgaria language resource file, and loading Sirill characters in Bulgaria language letters and Latin letters corresponding to the Sirilia language characters into a dictionary Bg _ dit in a key-value pair mode: wherein, Latin letters are keys of a dictionary, and Sirill letters are values;

c. in the Bulgaria language in the example sentence, for example, "к", "k", "T", "a", "o", and "r", etc. are all Latin letters, and they need to be converted into corresponding Sirrier letters, "as", "k", "T", "a", "o", and "я", according to the Bg _ di ct.

d. the example sentence after the code conversion is "" Tako й a з a "", Pi я, ч б з 3 б, з 2 б chi "" к "" я к "" Pi "" з 0 "" Pi "" я Pi "" з 1O "" Pi "" P "" and "" P ".

The multi-national-language word segmentation method based on code conversion can be used for analyzing and converting according to different use characteristics of languages of various countries, and meets the requirement that a user uses the same word segmentation method to segment words of the languages of the countries. The method is flexible and simple, can be conveniently embedded into the word segmentation process, is convenient to switch among different languages, meets the requirement of simultaneous conversion of multiple languages, ensures the coding consistency of the languages in multiple coding modes, effectively enhances the data quality, reduces the situation of data sparseness in the translation training process of the neural machine, and improves the quality of the translated text of the neural machine.

Claims

1. A multi-lingual word segmentation method based on code conversion is characterized by comprising the following steps:

2. The transcoding-based multilingual word segmentation method of claim 1, wherein: in step 2), the code conversion resource file specifically analyzes different characteristics of each language, distinguishes writing modes and using habits of each language, and loads corresponding code conversion files for processing according to coding intervals and conversion requirements of each language and by using characteristics of each language.

3. The transcoding-based multilingual word segmentation method of claim 2, wherein: for Vietnamese, the syllable-carrying characters in the data have two writing modes of single characters and combined characters, the Vietnamese data is subjected to coding conversion before word segmentation, the characters in the data in a non-standard coding mode are uniformly converted into corresponding standard coding characters, and the resource files of the two writing modes are loaded according to the corresponding relation of the standard coding and the non-standard coding of the same character.

4. The transcoding-based multilingual word segmentation method of claim 2, wherein: aiming at the fact that most word characters are Arabic characters in the Persian data, and a few words are coded characters in an Arabic form B, conversion rules of the Arabic characters and the Arabic form B characters in the Persian data are loaded, the characters in the Arabic form B in the Persian data are converted into common Arabic characters, and then subsequent word segmentation processing and machine translation training are carried out, so that common translation between the Persian in multiple coding sections is achieved.

5. The transcoding-based multilingual word segmentation method of claim 2, wherein: aiming at the Bulgaria language, the Bulgaria language is written by using Sirillic letters, and the cases are distinguished, wherein part of letters are similar to the Latin writing method, but the codes are different, and the Bulgaria language data is mixed with a large number of Latin letters to replace the Sirillic letters in the Bulgaria language; and simultaneously loading a resource file for the Sirillic letters in the Bulgaria language, the confusable Latin letters corresponding to the Bulgaria letters and the corresponding relation between the Sirillic letters and the Bulgaria letters, and converting the data of the Bulgaria language according to the resource file.

6. The method for segmenting words in multiple languages based on transcoding of claim 1, wherein in step 3), transcoding files of each language are loaded to perform transcoding on data, specifically:

301) inputting language data and a language tag to be processed;

305) the code-converted sentence is output to step 4).

7. The method of claim 6, wherein in step 303), if transcoding is not required, the next character is determined until all characters in the data have been traversed, and go to step 305).