WO2014101629A1 - Procédé de translitération de nom sur la base d'une classification d'origines de nom - Google Patents

Procédé de translitération de nom sur la base d'une classification d'origines de nom Download PDF

Info

Publication number
WO2014101629A1
WO2014101629A1 PCT/CN2013/088283 CN2013088283W WO2014101629A1 WO 2014101629 A1 WO2014101629 A1 WO 2014101629A1 CN 2013088283 W CN2013088283 W CN 2013088283W WO 2014101629 A1 WO2014101629 A1 WO 2014101629A1
Authority
WO
WIPO (PCT)
Prior art keywords
name
gram
origin
names
model
Prior art date
Application number
PCT/CN2013/088283
Other languages
English (en)
Chinese (zh)
Inventor
李婷婷
张春越
赵铁军
曹海龙
Original Assignee
哈尔滨工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哈尔滨工业大学 filed Critical 哈尔滨工业大学
Priority to KR1020157020138A priority Critical patent/KR20150128656A/ko
Publication of WO2014101629A1 publication Critical patent/WO2014101629A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Definitions

  • the present invention relates to a translation system.
  • the Internet has become an indispensable part of people's lives. It has become the most important way for human information to be acquired, exchanged, and disseminated. Every day, we rely on the Internet to get the information we need for life services and work research. In order to provide users with information faster, more accurate and smarter in the massive data of the Internet, information retrieval, information extraction, question and answer systems and other technologies have become the focus of research in recent years. With the information exchange revolution brought by the Internet, people's information exchange and acquisition has not only been limited to a single language. It has become an urgent need to process Internet information across languages. This need is particularly important in the fields of news and finance. urgent. Therefore, research on technologies such as machine translation, cross-language retrieval, and cross-language question and answer has become more and more important.
  • the translation of names is mainly based on the similar pronunciation, so it is also called transliteration of names.
  • Transliteration in the last century 90 The era began to develop, and there have been more than ten years of research accumulation, mainly based on phoneme-based and graph-based two methods. The former relies on the knowledge of phonetics. The latter is modeled directly between the glyphs, and the combined use of these two methods is called a hybrid transliteration method.
  • the phoneme-based transliteration method uses a unified phonetic representation as an intermediate conversion axis ( The symbol of this intermediate axis is often called a phoneme)
  • this method is also called the central axis method or the speech-based transliteration method.
  • the speech-based method requires a multi-step conversion from phoneme to phoneme to phoneme to pixel. Each conversion process can be erroneous, causing errors to accumulate.
  • the method relies on a specific language. Each language uses different intermediate pronunciation units. Each language pair needs to construct its own phoneme table, so the method is not expandable.
  • Chinese name - English name transliteration is an example.
  • the Chinese name refers to the name written in Chinese characters
  • the English name refers to the name written in English letters.
  • 'Takawa Jikang' It is a name of Japanese origin
  • its English translation is 'Tokugawa Ieyasu'
  • the transliteration (translation) of these Chinese names is similar to what is usually said based on pronunciation - English transliteration is very different.
  • the purpose of the invention is to solve the problem of inconsistent transliteration patterns of names of people from different countries of origin in transliteration between Chinese and English names.
  • a method for transliteration of names based on the classification of names of people is provided.
  • the logistic regression model is used according to the feature template of the origin of the person to calculate:
  • K is 6 and Y is 1-6, where 1 is China, 2 is Anglo, 3 Represents Arabia, 4 means Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;
  • the feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;
  • the Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;
  • the language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model.
  • the length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences, and is divided into 20 levels according to the quotient.
  • the so-called integrated n-gram model means that in order to prevent the number of features of this class from being too large, the n-gram based on the minimum variance will be
  • the probability eigenvalues are divided into 1-100 intervals to form 100 features.
  • the Chinese name origin feature template uses the SRILM tool to train the language model, where each n-gram There is a probability that n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. These 100 intervals are for n-gram.
  • Equation 3 ⁇ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval. This gives 300 features on the language model.
  • the TF-IDF of the word is the "name" word TF and the "name” word IDF.
  • name corpus the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used.
  • the formula calculates TF and IDF:
  • Equation 4 and Equation 5 x represents the i
  • the denominator is the number of occurrences of all the words in the training corpus in the word table
  • N represents the number of words in the word table
  • DF represents the number of names of the names of people including i; similar to the language model, will TF and IDF are divided into 100 intervals, resulting in 200 features.
  • the template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.
  • the language model is to integrate the 2-gram model, integrate the 3-gram model, and integrate 4-gram Model, syllable language model for integrating 1-gram model, integrating 2-gram model and integrating 3-gram model, the integrated n-gram
  • the model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features.
  • the length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:
  • ⁇ a,o,e,i,u ⁇ is the basic vowel character , y if treated as a vowel after the consonant;
  • consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;
  • the TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable.
  • the common syllables of the names are recorded and the frequency of each common syllable is recorded.
  • the common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:
  • Equation 4 and Equation 5 x represents the i The frequency of the syllables in the training corpus, the denominator is the total number of occurrences of all syllables in the training corpus in the word table, N is the number of syllables in the word list, and DF is the number of names of the names of the names containing the i syllables.
  • Equation VII Equation 8 and Equation 9
  • T represents the translation result
  • P represents the probability of the translation result
  • t represents the first position translated into the source language.
  • ⁇ i represents the probability that S belongs to the origin i. Equation 6 is a multi-system fusion strategy. Equations seven, eight, and nine are decoding algorithms.
  • the present invention proposes a strategy based on actual experimental data.
  • the origin category of the person name is first determined; the user can specify the origin type of the person name. If the user does not have the origin of the specified person name, the system will call the classification model to calculate the probability that the person name belongs to each origin category, and then according to The results of the classification model of the name of the person's name are dynamically combined using the results of multiple transliteration systems, as shown in Equation 6.
  • the model used in transliteration is a phrase-based translation system that is used in transliteration to ignore its ordering function.
  • the transliteration system is distributed in three levels according to the front end, the intermediate control layer, and the background system.
  • the front end is the interface between the user and the background transliteration system. It is responsible for accepting the user name and command input by the user and transmitting it to the control layer, and then accepting the results and signals returned by the control layer.
  • the middle layer is responsible for connecting the front end and the background, controlling the background system according to the input and semaphore of the front end, and receiving the backend operation result feedback to the front end interface.
  • the background system is mainly the classification system of the origin of names, and the transliteration system of names.
  • the front-end interface is in the form of a web page, mainly used Html and css implementations.
  • the classification of the origin of names is based on the principle of logistic regression models in multivariate logistic
  • the classification probability is calculated in the regression model as in Equation 1 and Equation 2 above; the model parameter training is based on the principle of maximum likelihood estimation to obtain the equation that needs to be optimized, and then Newton-Raphson is adopted. Solve feature weight values.
  • the invention proposes a method for classifying the origin of a person's name according to the name of a person, and combining the output results of a plurality of transliteration models of different origins to realize the translation of bilingual names.
  • the origin of training corpus names usually includes multiple countries; the pronunciation and translation criteria of different languages vary from country to country. Therefore, when bilingual bilingual names are translated, the translation training model is classified according to the origin of names. It will be of great help to the translation results.
  • the method proposed by the present invention applies a logistic multi-classification regression model to a classification of names of people, and According to the name of the person, the character template of the character feature is used to classify the origin of the person's name. For each type of origin, a specific transliteration (translation) model is trained, and the results of the multiple transliteration models are systematically integrated to realize bilingual translation.
  • a specific transliteration (translation) model is trained, and the results of the multiple transliteration models are systematically integrated to realize bilingual translation.
  • the main inventive content of the method of the present invention is the fusion of the classification of names of people and the linear interpolation system.
  • the regression model is used in the classification of names of people.
  • the model is mainly used because it can easily add, delete and modify features.
  • a method for transliterating a person's name based on a classification of names of persons is performed according to the following steps:
  • the logistic regression model is used according to the feature template of the origin of the person to calculate:
  • K is 6 and Y is 1-6, where 1 is China, 2 is Anglo, 3 Represents Arabia, 4 means Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;
  • the feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;
  • the Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;
  • the language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model;
  • the n-gram model prevents the number of features from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features.
  • the length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences;
  • the TF-IDF of the word is the "name" word TF and the "name” word IDF.
  • name corpus the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used.
  • the formula calculates TF and IDF:
  • Equation 4 and Equation 5 x represents the i
  • the word frequency of the words in the training corpus the denominator is the number of occurrences of all the words in the training corpus in the word table
  • N represents the number of words in the word table
  • DF represents the number of origin categories of the person name containing i
  • the template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.
  • the language model is to integrate the 2-gram model, integrate the 3-gram model, and integrate 4-gram Model, syllable language model for integrating 1-gram model, integrating 2-gram model and integrating 3-gram model, the integrated n-gram
  • the model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features.
  • the length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:
  • ⁇ a,o,e,i,u ⁇ is the basic vowel character , y if treated as a vowel after the consonant;
  • consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;
  • the TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable.
  • the common syllables of the names are recorded and the frequency of each common syllable is recorded.
  • the common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:
  • Equation 4 and Equation 5 x represents the i The frequency of the syllables of the words in the training corpus, the denominator is the number of occurrences of all the syllables in the training corpus in the word table, N represents the number of syllables in the word list, and DF represents the number of names of the names of the names containing the i syllables;
  • Equation VII Equation 8 and Equation 9
  • T represents the translation result
  • P represents the probability of the translation result
  • t represents the first position translated into the source language.
  • ⁇ i represents S belongs to the origin.
  • the probability of i formula 6 is the strategy of multi-system fusion, and the formulas seven, eight, and nine are decoding algorithms.
  • Embodiment 2 This embodiment differs from the specific implementation manner in that the SRILM tool training language model is used in the Chinese name origin feature template in step one, wherein each n-gram has a probability. n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. The 100 intervals are a cluster of n-gram features, and each interval represents a category. The variance and the minimum in each interval, the variance and the maximum between the interval averages, using n-gram The data asks for the demarcation point of 100 intervals:
  • Equation 3 ⁇ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval. The same method is used for the division of the TF and IDF values.
  • Last name confidence characteristics In Chinese names, the surnames are relatively fixed. The commonly used ones are hundreds of surnames. We extracted the names of the names in the “People’s Daily 1998” corpus to extract hundreds of surnames, and artificially surnamed each of them. Confidence labeling, which is manually defined. The beliefs of the surnames “Gong, Liao, ⁇ " are higher than those of “Li, Wang, Zhou", while the names of "White, Stone, Money” have lower confidence; the distinction between their confidence is based on these The word is calculated in the name of the person's name as "the number of occurrences of the last name" / "the total number of occurrences”; the feature clustering method similar to n-gram divides the surname confidence into 20 levels.
  • the user inputs the name of the person to be translated on the interactive interface, and may or may not specify a specific category; here, the input name "Tokugawa Ieyasu” does not specify the origin of the nationality (actually the name of the person originated in Japan) as an example.
  • the classification vector X of the name of Tokugawa Ieyasu is formed:
  • 30, 51, 63, 31, 12, 43, 5, 7 ⁇ , Japanese interval is good ⁇ 51, 70, 81, 53, 11, 42, 43, 5, 7 ⁇ , Europe and America ⁇ 85, 3, 19, 33, 11, 5, 23, 5, 7 ⁇ and other eigenvalues in six countries.
  • the corresponding position in the feature vector X is set to 1, and the remaining features without hit are set to 0.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

L'invention concerne un procédé de translitération de nom sur la base d'une classification d'origines de nom. Le procédé est associé à un système de traduction. Au moyen de la présente invention, le problème de mode de translitération incohérent pour différents noms de pays d'origine dans une translitération de noms chinois et anglais est résolu. Le procédé consiste à : 1) classifier des origines de nom; et 2) intégrer un système d'interpolation linéaire. Selon le procédé décrit dans la présente invention, un modèle de régression logistique à classifications multiples est appliqué à la classification d'origines de nom, et la classification d'origines de nom est réalisée selon un modèle de caractéristique d'une caractéristique de mot de composition de nom; pour la catégorie de nom de chaque type d'origine, un modèle de translitération (traduction) spécifique est appris; puis les résultats de multiples modèles de translitération sont intégrés pour mettre en œuvre une intertraduction de nom bilingue.
PCT/CN2013/088283 2012-12-24 2013-12-02 Procédé de translitération de nom sur la base d'une classification d'origines de nom WO2014101629A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020157020138A KR20150128656A (ko) 2012-12-24 2013-12-02 인명 기원 분류를 기반으로 한 인명 음역 방법

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210566217.X 2012-12-24
CN201210566217.XA CN103020046B (zh) 2012-12-24 2012-12-24 基于人名起源分类的人名音译方法

Publications (1)

Publication Number Publication Date
WO2014101629A1 true WO2014101629A1 (fr) 2014-07-03

Family

ID=47968663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/088283 WO2014101629A1 (fr) 2012-12-24 2013-12-02 Procédé de translitération de nom sur la base d'une classification d'origines de nom

Country Status (3)

Country Link
KR (1) KR20150128656A (fr)
CN (1) CN103020046B (fr)
WO (1) WO2014101629A1 (fr)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020046B (zh) * 2012-12-24 2016-04-20 哈尔滨工业大学 基于人名起源分类的人名音译方法
KR20180001889A (ko) 2016-06-28 2018-01-05 삼성전자주식회사 언어 처리 방법 및 장치
CN107066447B (zh) * 2017-04-19 2021-03-26 广东惠禾科技发展有限公司 一种无意义句子识别的方法和设备
CN115662392B (zh) * 2022-12-13 2023-04-25 中国科学技术大学 一种基于音素记忆的音译方法、电子设备及存储介质

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650945A (zh) * 2009-09-17 2010-02-17 浙江工业大学 基于多元核logistic回归模型的说话人辨别实现方法
CN103020046A (zh) * 2012-12-24 2013-04-03 哈尔滨工业大学 基于人名起源分类的人名音译方法

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033879B (zh) * 2009-09-27 2015-02-18 深圳市世纪光速信息技术有限公司 一种中文人名识别的方法和装置

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650945A (zh) * 2009-09-17 2010-02-17 浙江工业大学 基于多元核logistic回归模型的说话人辨别实现方法
CN103020046A (zh) * 2012-12-24 2013-04-03 哈尔滨工业大学 基于人名起源分类的人名音译方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, SIFU ET AL.: "Performing Chinese Text Classification Using Logistic Regression Models.", COMPUTER ENGINEERING AND APPLICATIONS., vol. 45, no. 14, July 2009 (2009-07-01), pages 152 - 154 *

Also Published As

Publication number Publication date
KR20150128656A (ko) 2015-11-18
CN103020046B (zh) 2016-04-20
CN103020046A (zh) 2013-04-03

Similar Documents

Publication Publication Date Title
CN109840331B (zh) 一种基于用户词典的神经机器翻译方法
WO2009149549A1 (fr) Procédé et système permettant d’utiliser des moyens d’alignement dans une mise en correspondance de traduction
CN104750687A (zh) 改进双语语料库的方法及装置、机器翻译方法及装置
WO2014101629A1 (fr) Procédé de translitération de nom sur la base d'une classification d'origines de nom
CN101008864A (zh) 一种数字键盘多功能、多语种输入系统和方法
CN101196881A (zh) 文本中数字和特殊符号串的文字符号化处理方法及系统
US7912696B1 (en) Natural language processing apparatus and natural language processing method
CN103164397A (zh) 汉哈电子辞典及其自动转译汉哈语的方法
CN103164396A (zh) 汉维哈柯电子辞典及其自动转译汉维哈柯语的方法
CN103164395A (zh) 汉柯电子辞典及其自动转译汉柯语的方法
CN106066870B (zh) 一种语境标注的双语平行语料库构建系统
CN100337232C (zh) 盲汉对照编辑排版方法
JPH0682376B2 (ja) 感情情報抽出装置
Okuno et al. An ensemble model of word-based and character-based models for Japanese and Chinese input method
Leng et al. Analysis and research on lexical errors in machine translation in Chinese and Korean translation
CN111597827A (zh) 一种提高机器翻译准确度的方法及其装置
McEnery et al. Corpora and translation: Uses and future prospects
CN109284012A (zh) 一种古彝文语言输入控制系统及方法、信息数据处理终端
Tsai et al. Applying an NVEF Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem
Rajendran et al. Text processing for developing unrestricted Tamil text to speech synthesis system
Li Design of a Japanese Machine Translation System Based on Speech Recognition Technology
Tripathy et al. Punctuation and case restoration in code mixed Indian languages
Zhang et al. Research Article Design and Implementation of Chinese Common Braille Translation System Integrating Braille Word Segmentation and Concatenation Rules
Patkar et al. Machine Translation of English to Ahirani Language: A Review
JPS62271065A (ja) 機械翻訳システム

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13866669

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20157020138

Country of ref document: KR

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 13866669

Country of ref document: EP

Kind code of ref document: A1