WO2014101629A1

WO2014101629A1 - Name transliteration method based on classification of name origins

Info

Publication number: WO2014101629A1
Application number: PCT/CN2013/088283
Authority: WO
Inventors: 李婷婷; 张春越; 赵铁军; 曹海龙
Original assignee: 哈尔滨工业大学
Priority date: 2012-12-24
Filing date: 2013-12-02
Publication date: 2014-07-03
Also published as: KR20150128656A; CN103020046B; CN103020046A

Abstract

A name transliteration method based on classification of name origins. The method relates to a translation system. By means of the present invention, the problem of inconsistent transliteration mode for different origin country names in Chinese and English name transliteration is solved. The method comprises: 1) classifying name origins; and 2) integrating a linear interpolation system. According to the method provided in the present invention, a logistic multi-classification regression model is applied to the name origin classification, and the name origin classification is carried out according to a characteristic template of a name composition word characteristic; for the name category of each type of origin, one specific transliteration (translation) model is trained; and then the results of multiple transliteration models are integrated to implement bilingual name inter-translation.

Description

Method for transliteration of names based on the classification of names of people

Technical field

The present invention relates to a translation system.

Background technique

The Internet has become an indispensable part of people's lives. It has become the most important way for human information to be acquired, exchanged, and disseminated. Every day, we rely on the Internet to get the information we need for life services and work research. In order to provide users with information faster, more accurate and smarter in the massive data of the Internet, information retrieval, information extraction, question and answer systems and other technologies have become the focus of research in recent years. With the information exchange revolution brought by the Internet, people's information exchange and acquisition has not only been limited to a single language. It has become an urgent need to process Internet information across languages. This need is particularly important in the fields of news and finance. urgent. Therefore, research on technologies such as machine translation, cross-language retrieval, and cross-language question and answer has become more and more important. Among these studies, the translation of named entities is an important and fundamental issue of these technologies. The name of a person, as an important part of a named entity, has a strong ability to express and is one of the key messages in a document. However, due to its openness, names are often the main components of unregistered words in natural language processing and machine translation. Therefore, correctly and automatically translating names will be a meaningful job and will also have a guiding role in human translation.

The translation of names is mainly based on the similar pronunciation, so it is also called transliteration of names. Transliteration in the last century 90 The era began to develop, and there have been more than ten years of research accumulation, mainly based on phoneme-based and graph-based two methods. The former relies on the knowledge of phonetics. The latter is modeled directly between the glyphs, and the combined use of these two methods is called a hybrid transliteration method. Specifically, the phoneme-based transliteration method uses a unified phonetic representation as an intermediate conversion axis ( The symbol of this intermediate axis is often called a phoneme) To achieve the conversion of the source language to phonemes and phonemes to the target language, so this method is also called the central axis method or the speech-based transliteration method. The speech-based method requires a multi-step conversion from phoneme to phoneme to phoneme to pixel. Each conversion process can be erroneous, causing errors to accumulate. At the same time, the method relies on a specific language. Each language uses different intermediate pronunciation units. Each language pair needs to construct its own phoneme table, so the method is not expandable. In order to overcome the above shortcomings of the speech-based method, inspired by the word alignment in machine translation, the researchers directly construct a transliteration model for the genophs between the source and target languages. Such methods are also called direct transliteration or pixel-based transliteration. method. Later, some researchers used these two methods to propose A method of mixed transliteration that combines word-based and speech-based transliteration methods. A variety of system fusion methods, such as linear interpolation, mix the two transliteration results. Because the graph-based method is independent of the specific language pair and has good performance, it becomes the main research method of transliteration.

Although the researchers have proposed a lot of transliteration methods, among the many factors affecting the transliteration effect, the origin of the name has not yet received enough attention. Chinese name - English name transliteration is an example. Note that the Chinese name refers to the name written in Chinese characters, and the English name refers to the name written in English letters. Such as 'Takawa Jikang' It is a name of Japanese origin, its English translation is 'Tokugawa Ieyasu', the Korean name of the name 'Roh Moo-hyun' The transliteration (translation) of these Chinese names is similar to what is usually said based on pronunciation - English transliteration is very different. Therefore, if the origin of these names is not distinguished, and the single model trained directly can not translate the names of such names, the existence of them will also affect the transliteration of the names of Chinese and English names. . In summary, the study of transliteration based on the classification of names of people is a very important issue.

technical problem

The purpose of the invention is to solve the problem of inconsistent transliteration patterns of names of people from different countries of origin in transliteration between Chinese and English names. A method for transliteration of names based on the classification of names of people is provided.

Technical solution

The method of transliteration of names based on the classification of names of people follows the following steps:

First, the classification of the names of people:

The logistic regression model is used according to the feature template of the origin of the person to calculate:

Formula one

Formula 2

In formula 1 and formula 2, the value of K is 6 and Y is 1-6, where 1 is China, 2 is Anglo, 3 Represents Arabia, 4 means Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;

The feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;

The Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;

The language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model. The length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences, and is divided into 20 levels according to the quotient.

The so-called integrated n-gram model means that in order to prevent the number of features of this class from being too large, the n-gram based on the minimum variance will be The probability eigenvalues are divided into 1-100 intervals to form 100 features. The Chinese name origin feature template uses the SRILM tool to train the language model, where each n-gram There is a probability that n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. These 100 intervals are for n-gram. A cluster of features, each interval representing a category, the variance and minimum within each interval, and the variance and maximum between the interval mean values, using the n-gram data to find the cut-off points for the 100 intervals:

Formula three

In Equation 3, λ represents a set of 100 demarcation points, x _i represents the probability value of each n-gram, and y _j represents the average of the jth boundary interval. This gives 300 features on the language model.

The TF-IDF of the word is the "name" word TF and the "name" word IDF. According to the name corpus, the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used. The formula calculates TF and IDF:

Formula four

Formula five

In Equation 4 and Equation 5, x represents the i The word frequency in the training corpus, the denominator is the number of occurrences of all the words in the training corpus in the word table, N represents the number of words in the word table, and DF represents the number of names of the names of people including i; similar to the language model, will TF and IDF are divided into 100 intervals, resulting in 200 features.

The template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.

The language model is to integrate the 2-gram model, integrate the 3-gram model, and integrate 4-gram Model, syllable language model for integrating 1-gram model, integrating 2-gram model and integrating 3-gram model, the integrated n-gram The model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features. The length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:

1. Replace 'x' with 'ks';

2, {a,o,e,i,u} is the basic vowel character , y if treated as a vowel after the consonant;

3. When ' w ' is preceded by ' a, e, o ' and the following is not ' h ', ' w ' and the previous vowel are treated as a new vowel;

4. Except for { iu, eo, io, oi, ia, ui, ua, uo }, the remaining consecutive vowels are treated as a new vowel;

5. Separate the consonants that are next to each other and separate the vowels from the consonants that follow;

6. The consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;

The TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable. According to the name corpus, the common syllables of the names are recorded and the frequency of each common syllable is recorded. The common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:

Formula four

Formula five

In Equation 4 and Equation 5, x represents the i The frequency of the syllables in the training corpus, the denominator is the total number of occurrences of all syllables in the training corpus in the word table, N is the number of syllables in the word list, and DF is the number of names of the names of the names containing the i syllables.

Second, the linear interpolation system fusion:

Formula six

Formula seven

Formula eight

Formula nine

In Equation VII, Equation 8 and Equation 9, T represents the translation result, P represents the probability of the translation result T, and t represents the first position translated into the source language. In Equation 6, λ _i represents the probability that S belongs to the origin i. Equation 6 is a multi-system fusion strategy. Equations seven, eight, and nine are decoding algorithms.

Since multiple categories are classified according to the origin of the person's name, a transliterated model can be trained in each category; in order to make full use of these transliteration models, the present invention proposes a strategy based on actual experimental data. For the name of the person to be translated, the origin category of the person name is first determined; the user can specify the origin type of the person name. If the user does not have the origin of the specified person name, the system will call the classification model to calculate the probability that the person name belongs to each origin category, and then according to The results of the classification model of the name of the person's name are dynamically combined using the results of multiple transliteration systems, as shown in Equation 6.

The specific strategy is as follows:

1) If the user specifies the origin of the person's name, the probability that the person's name belongs to the origin is 1 and the probability of belonging to other origins is 0. ;

2) If the user calls the origin classification system calculation without specifying, the probability of belonging to each origin can be obtained;

3) If the probability that a person's name belongs to a certain origin is greater than a value A (obviously the A value is greater than 0.5) , then only assign to the corresponding transliteration model to get the result;

4) Otherwise, assign names to those models whose membership probability is greater than B;

5) If used 4) In the method transliteration, the results of each model are linearly interpolated, and the weight of each model is equivalent to the probability that the person's name belongs to the origin. Taking the transliteration of Chinese and English as an example, the values of A and B in the system are respectively 0.72 and The effect near 0.15 is good (this is an empirical value, and it is also related to the training corpus).

The model used in transliteration is a phrase-based translation system that is used in transliteration to ignore its ordering function.

Beneficial effect

The entire application of the invention The transliteration system is distributed in three levels according to the front end, the intermediate control layer, and the background system. The front end is the interface between the user and the background transliteration system. It is responsible for accepting the user name and command input by the user and transmitting it to the control layer, and then accepting the results and signals returned by the control layer. The middle layer is responsible for connecting the front end and the background, controlling the background system according to the input and semaphore of the front end, and receiving the backend operation result feedback to the front end interface. The background system is mainly the classification system of the origin of names, and the transliteration system of names. The front-end interface is in the form of a web page, mainly used Html and css implementations.

The classification of the origin of names is based on the principle of logistic regression models in multivariate logistic The classification probability is calculated in the regression model as in Equation 1 and Equation 2 above; the model parameter training is based on the principle of maximum likelihood estimation to obtain the equation that needs to be optimized, and then Newton-Raphson is adopted. Solve feature weight values.

The invention proposes a method for classifying the origin of a person's name according to the name of a person, and combining the output results of a plurality of transliteration models of different origins to realize the translation of bilingual names. In the transliteration of bilingual names, the origin of training corpus names usually includes multiple countries; the pronunciation and translation criteria of different languages vary from country to country. Therefore, when bilingual bilingual names are translated, the translation training model is classified according to the origin of names. It will be of great help to the translation results.

The method proposed by the present invention applies a logistic multi-classification regression model to a classification of names of people, and According to the name of the person, the character template of the character feature is used to classify the origin of the person's name. For each type of origin, a specific transliteration (translation) model is trained, and the results of the multiple transliteration models are systematically integrated to realize bilingual translation.

The main inventive content of the method of the present invention is the fusion of the classification of names of people and the linear interpolation system.

This patent will be logistic for the first time. The regression model is used in the classification of names of people. The model is mainly used because it can easily add, delete and modify features.

Embodiments of the invention

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First, in the present embodiment, a method for transliterating a person's name based on a classification of names of persons is performed according to the following steps:

First, the classification of the names of people:

Formula one

Formula 2

The language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model; The n-gram model prevents the number of features from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features. The length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences;

Formula four

Formula five

In Equation 4 and Equation 5, x represents the i The word frequency of the words in the training corpus, the denominator is the number of occurrences of all the words in the training corpus in the word table, N represents the number of words in the word table, and DF represents the number of origin categories of the person name containing i;

1. Replace 'x' with 'ks';

Formula four

Formula five

In Equation 4 and Equation 5, x represents the i The frequency of the syllables of the words in the training corpus, the denominator is the number of occurrences of all the syllables in the training corpus in the word table, N represents the number of syllables in the word list, and DF represents the number of names of the names of the names containing the i syllables;

Second, the linear interpolation system fusion:

Formula six

Formula seven

Formula eight

Formula nine

In Equation VII, Equation 8 and Equation 9, T represents the translation result, P represents the probability of the translation result T, and t represents the first position translated into the source language. In Equation 6, λ _i represents S belongs to the origin. The probability of i, formula 6 is the strategy of multi-system fusion, and the formulas seven, eight, and nine are decoding algorithms.

Embodiment 2: This embodiment differs from the specific implementation manner in that the SRILM tool training language model is used in the Chinese name origin feature template in step one, wherein each n-gram has a probability. n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. The 100 intervals are a cluster of n-gram features, and each interval represents a category. The variance and the minimum in each interval, the variance and the maximum between the interval averages, using n-gram The data asks for the demarcation point of 100 intervals:

Formula three

In Equation 3, λ represents a set of 100 demarcation points, x _i represents the probability value of each n-gram, and y _j represents the average of the jth boundary interval. The same method is used for the division of the TF and IDF values.

Last name confidence characteristics: In Chinese names, the surnames are relatively fixed. The commonly used ones are hundreds of surnames. We extracted the names of the names in the “People’s Daily 1998” corpus to extract hundreds of surnames, and artificially surnamed each of them. Confidence labeling, which is manually defined. The beliefs of the surnames "Gong, Liao, 覃" are higher than those of "Li, Wang, Zhou", while the names of "White, Stone, Money" have lower confidence; the distinction between their confidence is based on these The word is calculated in the name of the person's name as "the number of occurrences of the last name" / "the total number of occurrences"; the feature clustering method similar to n-gram divides the surname confidence into 20 levels.

Others are the same as the first embodiment.

The effects of the present invention were verified by the following experiments:

1. The user inputs the name of the person to be translated on the interactive interface, and may or may not specify a specific category; here, the input name "Tokugawa Ieyasu" does not specify the origin of the nationality (actually the name of the person originated in Japan) as an example.

2. Form the feature vector X of the person's name:

2.1 According to the name of the input person and the existing knowledge, the classification vector X of the name of Tokugawa Ieyasu is formed: Here, we get {Germany, Sichuan, Jia, Kang, Tokugawa, The probability of Chuanjia, Jiakang, Tokugawa, and Chuanjiakang in the language model, and the Chinese interval number {86, obtained by mapping the 100 intervals of 1-gram\2-gram\3-gram according to the demarcation point respectively. 30, 51, 63, 31, 12, 43, 5, 7}, Japanese interval is good {51, 70, 81, 53, 11, 42, 43, 5, 7}, Europe and America {85, 3, 19, 33, 11, 5, 23, 5, 7} and other eigenvalues in six countries.

2.2, calculation {Germany, Sichuan, home, TF and IDF of these words, mapped to 100 intervals of IDF to get the interval number {14, 57, 85, 41}; get TF in China {3, 15, 7}, Japan {50, 32, 76, 21} and other countries have TF values.

2.3, because the default first word is the last name, the rest of the words are the name; so the calculation of the {de} last name confidence degree is obtained to belong to the degree of execution level {1}, a total of 20 levels, the higher the level, the greater the confidence.

2.4. Calculate the length of the person's name as {4}.

2.5. According to the feature information obtained in steps 2.1-2.4 above, the corresponding position in the feature vector X is set to 1, and the remaining features without hit are set to 0.

3. According to formula 1 and formula 2, calculate the probability that the name belongs to a certain class and normalize it, and finally obtain the normalized probability vector (0.23, 0.07, 0.08, 0.05, 0.43, 0.14), where 1 means China, 2 means Anglo-American, 3 means Arab, 4 means Russia, 5 means Japan, and 6 means Korea.

4. According to the translation strategy formula 6 of multi-system fusion, we choose 1: China, 5: Japan, 6: Korean model for decoding; according to the fusion of the three systems, the final transliteration result is “” Tokugawaleyasu", the second transliteration result is "tokuwavasu", the third one is "dekuanjiaking", and the result in the first place is returned to the user. It can be seen that the mixed model helps to get the correct translation result.

Claims

The method for transliteration of names of people based on the classification of names of people, the classification features, methods and multi-system fusion methods of human names are carried out according to the following steps:

First, the classification of the names of people:

The logistic regression model is used according to the feature template of the origin of the person to calculate:

Formula one

Formula 2

In Equations 1 and 2, the value of K is 6 and Y is 1-6, where 1 is China, 2 is Anglo-American, 3 is Arabic, 4 Represents Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;

The feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;

The Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;

The language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model; the integrated n-gram The model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features. The length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences;

The TF-IDF of the word is the "name" word TF and the "name" word IDF. According to the name corpus, the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used. The formula calculates TF and IDF:

Formula four

Formula five

In Equation 4 and Equation 5, x represents the word frequency of the i-th word in the training corpus, and the denominator is the number of occurrences of all the words in the training corpus in the word table, N Represents the number of words in the word list, and DF represents the number of names of people whose names contain i;

The template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.

The language model is an integrated 2-gram model, an integrated 3-gram model, and an integrated 4-gram model. The syllable language model is integrated. 1-gram model, integrated 2-gram model and integrated 3-gram model, the integrated n-gram model prevents the number of features of this class from being too large, based on the minimum variance The probability eigenvalues of n-gram are divided into 1-100 intervals to form 100 features. The length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:

1. Replace 'x' with 'ks';

2, {a,o,e,i,u} is the basic vowel character , y if treated as a vowel after the consonant;

3. When ' w ' is preceded by ' a, e, o ' and the following is not ' h ', ' w ' and the previous vowel are treated as a new vowel;

4. Except for { iu, eo, io, oi, ia, ui, ua, uo }, the remaining consecutive vowels are treated as a new vowel;

5. Separate the consonants that are next to each other and separate the vowels from the consonants that follow;

6. The consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;

The TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable. According to the name corpus, the common syllables of the names are recorded and the frequency of each common syllable is recorded. The common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:

Formula four

Formula five

In Equation 4 and Equation 5, x represents the frequency of the syllables of the i-th word in the training corpus, and the denominator is the total number of occurrences of all syllables in the training corpus in the word table. N represents the number of syllables in the word list, and DF represents the number of names of person names that contain i syllables;

Second, the linear interpolation system fusion:

Formula six

Formula seven

Formula eight

Formula nine

In Equation VII, Equation 8 and Equation 9, T represents the translation result, P represents the probability of the translation result T, and t represents the first position translated into the source language. In Equation 6, λ i represents S belongs to the origin. The probability of i, formula 6 is the strategy of multi-system fusion, and the formulas seven, eight, and nine are decoding algorithms.
The method for transliteration of a person's name based on the classification of the origin of a person according to claim 1, wherein the SRILM tool is used to train the language model in the Chinese name origin feature module, and each n-gram has a probability. n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. The 100 intervals are a cluster of n-gram features, and each interval represents a category. The variance and the minimum in each interval, the variance and the maximum between the interval averages, using n-gram The data asks for the demarcation point of 100 intervals:

Formula three

In Equation 3, λ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval.