WO2014101629A1 - Name transliteration method based on classification of name origins - Google Patents

Name transliteration method based on classification of name origins Download PDF

Info

Publication number
WO2014101629A1
WO2014101629A1 PCT/CN2013/088283 CN2013088283W WO2014101629A1 WO 2014101629 A1 WO2014101629 A1 WO 2014101629A1 CN 2013088283 W CN2013088283 W CN 2013088283W WO 2014101629 A1 WO2014101629 A1 WO 2014101629A1
Authority
WO
WIPO (PCT)
Prior art keywords
name
gram
origin
names
model
Prior art date
Application number
PCT/CN2013/088283
Other languages
French (fr)
Chinese (zh)
Inventor
李婷婷
张春越
赵铁军
曹海龙
Original Assignee
哈尔滨工业大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 哈尔滨工业大学 filed Critical 哈尔滨工业大学
Priority to KR1020157020138A priority Critical patent/KR20150128656A/en
Publication of WO2014101629A1 publication Critical patent/WO2014101629A1/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/53Processing of non-Latin text

Definitions

  • the present invention relates to a translation system.
  • the Internet has become an indispensable part of people's lives. It has become the most important way for human information to be acquired, exchanged, and disseminated. Every day, we rely on the Internet to get the information we need for life services and work research. In order to provide users with information faster, more accurate and smarter in the massive data of the Internet, information retrieval, information extraction, question and answer systems and other technologies have become the focus of research in recent years. With the information exchange revolution brought by the Internet, people's information exchange and acquisition has not only been limited to a single language. It has become an urgent need to process Internet information across languages. This need is particularly important in the fields of news and finance. urgent. Therefore, research on technologies such as machine translation, cross-language retrieval, and cross-language question and answer has become more and more important.
  • the translation of names is mainly based on the similar pronunciation, so it is also called transliteration of names.
  • Transliteration in the last century 90 The era began to develop, and there have been more than ten years of research accumulation, mainly based on phoneme-based and graph-based two methods. The former relies on the knowledge of phonetics. The latter is modeled directly between the glyphs, and the combined use of these two methods is called a hybrid transliteration method.
  • the phoneme-based transliteration method uses a unified phonetic representation as an intermediate conversion axis ( The symbol of this intermediate axis is often called a phoneme)
  • this method is also called the central axis method or the speech-based transliteration method.
  • the speech-based method requires a multi-step conversion from phoneme to phoneme to phoneme to pixel. Each conversion process can be erroneous, causing errors to accumulate.
  • the method relies on a specific language. Each language uses different intermediate pronunciation units. Each language pair needs to construct its own phoneme table, so the method is not expandable.
  • Chinese name - English name transliteration is an example.
  • the Chinese name refers to the name written in Chinese characters
  • the English name refers to the name written in English letters.
  • 'Takawa Jikang' It is a name of Japanese origin
  • its English translation is 'Tokugawa Ieyasu'
  • the transliteration (translation) of these Chinese names is similar to what is usually said based on pronunciation - English transliteration is very different.
  • the purpose of the invention is to solve the problem of inconsistent transliteration patterns of names of people from different countries of origin in transliteration between Chinese and English names.
  • a method for transliteration of names based on the classification of names of people is provided.
  • the logistic regression model is used according to the feature template of the origin of the person to calculate:
  • K is 6 and Y is 1-6, where 1 is China, 2 is Anglo, 3 Represents Arabia, 4 means Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;
  • the feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;
  • the Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;
  • the language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model.
  • the length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences, and is divided into 20 levels according to the quotient.
  • the so-called integrated n-gram model means that in order to prevent the number of features of this class from being too large, the n-gram based on the minimum variance will be
  • the probability eigenvalues are divided into 1-100 intervals to form 100 features.
  • the Chinese name origin feature template uses the SRILM tool to train the language model, where each n-gram There is a probability that n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. These 100 intervals are for n-gram.
  • Equation 3 ⁇ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval. This gives 300 features on the language model.
  • the TF-IDF of the word is the "name" word TF and the "name” word IDF.
  • name corpus the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used.
  • the formula calculates TF and IDF:
  • Equation 4 and Equation 5 x represents the i
  • the denominator is the number of occurrences of all the words in the training corpus in the word table
  • N represents the number of words in the word table
  • DF represents the number of names of the names of people including i; similar to the language model, will TF and IDF are divided into 100 intervals, resulting in 200 features.
  • the template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.
  • the language model is to integrate the 2-gram model, integrate the 3-gram model, and integrate 4-gram Model, syllable language model for integrating 1-gram model, integrating 2-gram model and integrating 3-gram model, the integrated n-gram
  • the model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features.
  • the length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:
  • ⁇ a,o,e,i,u ⁇ is the basic vowel character , y if treated as a vowel after the consonant;
  • consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;
  • the TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable.
  • the common syllables of the names are recorded and the frequency of each common syllable is recorded.
  • the common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:
  • Equation 4 and Equation 5 x represents the i The frequency of the syllables in the training corpus, the denominator is the total number of occurrences of all syllables in the training corpus in the word table, N is the number of syllables in the word list, and DF is the number of names of the names of the names containing the i syllables.
  • Equation VII Equation 8 and Equation 9
  • T represents the translation result
  • P represents the probability of the translation result
  • t represents the first position translated into the source language.
  • ⁇ i represents the probability that S belongs to the origin i. Equation 6 is a multi-system fusion strategy. Equations seven, eight, and nine are decoding algorithms.
  • the present invention proposes a strategy based on actual experimental data.
  • the origin category of the person name is first determined; the user can specify the origin type of the person name. If the user does not have the origin of the specified person name, the system will call the classification model to calculate the probability that the person name belongs to each origin category, and then according to The results of the classification model of the name of the person's name are dynamically combined using the results of multiple transliteration systems, as shown in Equation 6.
  • the model used in transliteration is a phrase-based translation system that is used in transliteration to ignore its ordering function.
  • the transliteration system is distributed in three levels according to the front end, the intermediate control layer, and the background system.
  • the front end is the interface between the user and the background transliteration system. It is responsible for accepting the user name and command input by the user and transmitting it to the control layer, and then accepting the results and signals returned by the control layer.
  • the middle layer is responsible for connecting the front end and the background, controlling the background system according to the input and semaphore of the front end, and receiving the backend operation result feedback to the front end interface.
  • the background system is mainly the classification system of the origin of names, and the transliteration system of names.
  • the front-end interface is in the form of a web page, mainly used Html and css implementations.
  • the classification of the origin of names is based on the principle of logistic regression models in multivariate logistic
  • the classification probability is calculated in the regression model as in Equation 1 and Equation 2 above; the model parameter training is based on the principle of maximum likelihood estimation to obtain the equation that needs to be optimized, and then Newton-Raphson is adopted. Solve feature weight values.
  • the invention proposes a method for classifying the origin of a person's name according to the name of a person, and combining the output results of a plurality of transliteration models of different origins to realize the translation of bilingual names.
  • the origin of training corpus names usually includes multiple countries; the pronunciation and translation criteria of different languages vary from country to country. Therefore, when bilingual bilingual names are translated, the translation training model is classified according to the origin of names. It will be of great help to the translation results.
  • the method proposed by the present invention applies a logistic multi-classification regression model to a classification of names of people, and According to the name of the person, the character template of the character feature is used to classify the origin of the person's name. For each type of origin, a specific transliteration (translation) model is trained, and the results of the multiple transliteration models are systematically integrated to realize bilingual translation.
  • a specific transliteration (translation) model is trained, and the results of the multiple transliteration models are systematically integrated to realize bilingual translation.
  • the main inventive content of the method of the present invention is the fusion of the classification of names of people and the linear interpolation system.
  • the regression model is used in the classification of names of people.
  • the model is mainly used because it can easily add, delete and modify features.
  • a method for transliterating a person's name based on a classification of names of persons is performed according to the following steps:
  • the logistic regression model is used according to the feature template of the origin of the person to calculate:
  • K is 6 and Y is 1-6, where 1 is China, 2 is Anglo, 3 Represents Arabia, 4 means Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;
  • the feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;
  • the Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;
  • the language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model;
  • the n-gram model prevents the number of features from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features.
  • the length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences;
  • the TF-IDF of the word is the "name" word TF and the "name” word IDF.
  • name corpus the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used.
  • the formula calculates TF and IDF:
  • Equation 4 and Equation 5 x represents the i
  • the word frequency of the words in the training corpus the denominator is the number of occurrences of all the words in the training corpus in the word table
  • N represents the number of words in the word table
  • DF represents the number of origin categories of the person name containing i
  • the template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.
  • the language model is to integrate the 2-gram model, integrate the 3-gram model, and integrate 4-gram Model, syllable language model for integrating 1-gram model, integrating 2-gram model and integrating 3-gram model, the integrated n-gram
  • the model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features.
  • the length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:
  • ⁇ a,o,e,i,u ⁇ is the basic vowel character , y if treated as a vowel after the consonant;
  • consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;
  • the TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable.
  • the common syllables of the names are recorded and the frequency of each common syllable is recorded.
  • the common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:
  • Equation 4 and Equation 5 x represents the i The frequency of the syllables of the words in the training corpus, the denominator is the number of occurrences of all the syllables in the training corpus in the word table, N represents the number of syllables in the word list, and DF represents the number of names of the names of the names containing the i syllables;
  • Equation VII Equation 8 and Equation 9
  • T represents the translation result
  • P represents the probability of the translation result
  • t represents the first position translated into the source language.
  • ⁇ i represents S belongs to the origin.
  • the probability of i formula 6 is the strategy of multi-system fusion, and the formulas seven, eight, and nine are decoding algorithms.
  • Embodiment 2 This embodiment differs from the specific implementation manner in that the SRILM tool training language model is used in the Chinese name origin feature template in step one, wherein each n-gram has a probability. n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. The 100 intervals are a cluster of n-gram features, and each interval represents a category. The variance and the minimum in each interval, the variance and the maximum between the interval averages, using n-gram The data asks for the demarcation point of 100 intervals:
  • Equation 3 ⁇ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval. The same method is used for the division of the TF and IDF values.
  • Last name confidence characteristics In Chinese names, the surnames are relatively fixed. The commonly used ones are hundreds of surnames. We extracted the names of the names in the “People’s Daily 1998” corpus to extract hundreds of surnames, and artificially surnamed each of them. Confidence labeling, which is manually defined. The beliefs of the surnames “Gong, Liao, ⁇ " are higher than those of “Li, Wang, Zhou", while the names of "White, Stone, Money” have lower confidence; the distinction between their confidence is based on these The word is calculated in the name of the person's name as "the number of occurrences of the last name" / "the total number of occurrences”; the feature clustering method similar to n-gram divides the surname confidence into 20 levels.
  • the user inputs the name of the person to be translated on the interactive interface, and may or may not specify a specific category; here, the input name "Tokugawa Ieyasu” does not specify the origin of the nationality (actually the name of the person originated in Japan) as an example.
  • the classification vector X of the name of Tokugawa Ieyasu is formed:
  • 30, 51, 63, 31, 12, 43, 5, 7 ⁇ , Japanese interval is good ⁇ 51, 70, 81, 53, 11, 42, 43, 5, 7 ⁇ , Europe and America ⁇ 85, 3, 19, 33, 11, 5, 23, 5, 7 ⁇ and other eigenvalues in six countries.
  • the corresponding position in the feature vector X is set to 1, and the remaining features without hit are set to 0.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A name transliteration method based on classification of name origins. The method relates to a translation system. By means of the present invention, the problem of inconsistent transliteration mode for different origin country names in Chinese and English name transliteration is solved. The method comprises: 1) classifying name origins; and 2) integrating a linear interpolation system. According to the method provided in the present invention, a logistic multi-classification regression model is applied to the name origin classification, and the name origin classification is carried out according to a characteristic template of a name composition word characteristic; for the name category of each type of origin, one specific transliteration (translation) model is trained; and then the results of multiple transliteration models are integrated to implement bilingual name inter-translation.

Description

基于人名起源分类的人名音译方法  Method for transliteration of names based on the classification of names of people 技术领域Technical field
本发明涉及一种 翻译系统。  The present invention relates to a translation system.
背景技术Background technique
互联网成为人们生活中不可或缺的一部分,它已经是人类信息获取、相互交流、信息传播的一个最重要的途径。我们每天都依靠互联网从中获取需要的生活服务、工作研究等信息。为了能在互联网的海量数据中更快更准更智能的为用户提供信息,信息检索、信息抽取、问答系统等技术成为近年来研究的重点。随着互联网带来的信息交流革命,人们的信息交流和获取已经不单单局限在单一语言中,能够跨语言处理互联网信息已经成为一种迫切需求,这种需要在新闻、金融等领域中显得尤为迫切。因此,机器翻译、跨语言检索、跨语言问答等技术的研究变得越来越重要。在这些研究之中,命名实体的翻译是这些技术的一个重要且基础的问题。人名,作为命名实体的一个重要的组成部分,具有相当强的表达能力,是一篇文档中的关键信息之一。但由于其开放性,人名常常是自然语言处理和机器翻译中未登陆词的主要成分。因此,正确地、自动地翻译人名将是一个有意义的工作,并且对于人工翻译也有一定的指导作用。 The Internet has become an indispensable part of people's lives. It has become the most important way for human information to be acquired, exchanged, and disseminated. Every day, we rely on the Internet to get the information we need for life services and work research. In order to provide users with information faster, more accurate and smarter in the massive data of the Internet, information retrieval, information extraction, question and answer systems and other technologies have become the focus of research in recent years. With the information exchange revolution brought by the Internet, people's information exchange and acquisition has not only been limited to a single language. It has become an urgent need to process Internet information across languages. This need is particularly important in the fields of news and finance. urgent. Therefore, research on technologies such as machine translation, cross-language retrieval, and cross-language question and answer has become more and more important. Among these studies, the translation of named entities is an important and fundamental issue of these technologies. The name of a person, as an important part of a named entity, has a strong ability to express and is one of the key messages in a document. However, due to its openness, names are often the main components of unregistered words in natural language processing and machine translation. Therefore, correctly and automatically translating names will be a meaningful job and will also have a guiding role in human translation.
人名翻译主要依据发音相似来进行,因此也叫做人名的音译。音译在上世纪 90 年代开始发展,至今已经有十几年的研究积累 , 主要有基于音素的和基于字素的两类方法,前者依赖语音学的知识 , 后者则直接在字素之间建模,而综合使用这两类方法则称之为混合音译方法。具体地,基于音素的音译方法借助一个统一的语音学表示方法作为中间转换轴 ( 这个中间轴的表示符号常称为音素 ) ,实现源语言到音素、音素到目标语言的转换,所以该方法也叫中轴法或基于语音的音译方法。基于语音的方法因为需要做字素到音素、音素到字素多步转换,每个转换过程都有可能出错,会使得错误累加。同时该方法依赖于具体的语言,每种语言对用到的中间发音单元不同,每种语言对都需要构建自己的音素表,所以方法是不可扩展的。为了克服基于语音的方法的上述缺点,受到机器翻译中词对齐的启发,研究人员直接对源和目标语言间的字素构建音译模型,这类方法也被称作直接音译或者基于字素的音译方法。后来有研究者综合利用这两类方法,提出 了 混合音译的方法,将基于字素和语音的音译方法相结合,使用 线性插值等多种系统融合方法对两种音译结果进行混合。由于基于字素的方法独立于具体的语言对,并且性能较好,成为音译的主要研究方法。 The translation of names is mainly based on the similar pronunciation, so it is also called transliteration of names. Transliteration in the last century 90 The era began to develop, and there have been more than ten years of research accumulation, mainly based on phoneme-based and graph-based two methods. The former relies on the knowledge of phonetics. The latter is modeled directly between the glyphs, and the combined use of these two methods is called a hybrid transliteration method. Specifically, the phoneme-based transliteration method uses a unified phonetic representation as an intermediate conversion axis ( The symbol of this intermediate axis is often called a phoneme) To achieve the conversion of the source language to phonemes and phonemes to the target language, so this method is also called the central axis method or the speech-based transliteration method. The speech-based method requires a multi-step conversion from phoneme to phoneme to phoneme to pixel. Each conversion process can be erroneous, causing errors to accumulate. At the same time, the method relies on a specific language. Each language uses different intermediate pronunciation units. Each language pair needs to construct its own phoneme table, so the method is not expandable. In order to overcome the above shortcomings of the speech-based method, inspired by the word alignment in machine translation, the researchers directly construct a transliteration model for the genophs between the source and target languages. Such methods are also called direct transliteration or pixel-based transliteration. method. Later, some researchers used these two methods to propose A method of mixed transliteration that combines word-based and speech-based transliteration methods. A variety of system fusion methods, such as linear interpolation, mix the two transliteration results. Because the graph-based method is independent of the specific language pair and has good performance, it becomes the main research method of transliteration.
尽管研究者提出了很多的音译方法,但在影响音译效果的诸多因素中,人名起源还尚未引起足够的重视。以中文人名 - 英文人名音译为例,注意这里的中文人名指的用中文汉字书写的人名,英文人名指的是用英文字母书写的人名。比如 ' 德川家康 ' 是一个日本起源的人名,它的英文翻译是 'Tokugawa Ieyasu' ,韩国起源的人名 ' 卢武铉 ' 的音译 'Roh Moo-hyun' ,这些中文人名的音译(翻译)与通常说的基于发音相似的中 - 英音译区别很大。因此,如果对这些人名的起源不加以区分,而直接使用训练出来的单一模型对这类人名进行互译则得不到正确结果,同时他们的存在还会影响模型对中、英起源人名的音译。综上,基于人名起源分类的音译研究是一个十分重要的问题。 Although the researchers have proposed a lot of transliteration methods, among the many factors affecting the transliteration effect, the origin of the name has not yet received enough attention. Chinese name - English name transliteration is an example. Note that the Chinese name refers to the name written in Chinese characters, and the English name refers to the name written in English letters. Such as 'Takawa Jikang' It is a name of Japanese origin, its English translation is 'Tokugawa Ieyasu', the Korean name of the name 'Roh Moo-hyun' The transliteration (translation) of these Chinese names is similar to what is usually said based on pronunciation - English transliteration is very different. Therefore, if the origin of these names is not distinguished, and the single model trained directly can not translate the names of such names, the existence of them will also affect the transliteration of the names of Chinese and English names. . In summary, the study of transliteration based on the classification of names of people is a very important issue.
技术问题technical problem
本发明的目的是为了解决中英人名音译中不同起源国家人名的音译模式不一致问题 ,提供了一种基于人名起源分类的人名音译方法。  The purpose of the invention is to solve the problem of inconsistent transliteration patterns of names of people from different countries of origin in transliteration between Chinese and English names. A method for transliteration of names based on the classification of names of people is provided.
技术解决方案Technical solution
基于人名起源分类的人名音译方法按照以下步骤进行: The method of transliteration of names based on the classification of names of people follows the following steps:
一、人名起源分类: First, the classification of the names of people:
根据 人名起源特征模板采用 logistic 回归模型,进行计算: The logistic regression model is used according to the feature template of the origin of the person to calculate:
Figure PCTCN2013088283-appb-M000001
公式一
Figure PCTCN2013088283-appb-M000001
Formula one
Figure PCTCN2013088283-appb-M000002
公式二
Figure PCTCN2013088283-appb-M000002
Formula 2
公式一和公式二中 K 的值是 6 , Y 为 1-6 ,其中 1 表示中国, 2 表示英美, 3 表示阿拉伯, 4 表示俄国, 5 表示日本, 6 表示韩国, x 为人名起源特征模板, P 表示起源的概率, w 是特征的权重向量 ; In formula 1 and formula 2, the value of K is 6 and Y is 1-6, where 1 is China, 2 is Anglo, 3 Represents Arabia, 4 means Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;
步骤一中所述的人名起源特征模板为中文人名起源特征模板或英文人名起源特征模板; The feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;
中文人名起源特征模板为语言模型、字的 TF-IDF 、长度和姓氏; The Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;
语言模型为整合 1-gram 模型、整合 2-gram 模型和整合 3-gram 模型 ;长度为汉字字符数;姓氏为姓氏置信度,姓氏置信度为姓氏出现的次数除以出现的总次数所得的商,并根据商值划分为 20 个等级。 The language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model. The length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences, and is divided into 20 levels according to the quotient.
所谓整合 n-gram 模型是指为了防止该类特征数量过于庞大,而基于最小方差将 n-gram 的概率特征值划分到 1-100 个区间上,形成 100 个特征,所述中文人名起源特征模板中采用 SRILM 工具训练语言模型,其中每个 n-gram 都有概率, n 为 1 、 2 或 3 ,统计所有 n-gram 概率的一维分布,根据这个分布划分出 100 个区间,这 100 个区间是对 n-gram 特征的一个聚类,每个区间代表一个类别,每个区间内的方差和最小,区间平均值间的方差和最大,利用 n-gram 的数据求 100 个区间的分界点: The so-called integrated n-gram model means that in order to prevent the number of features of this class from being too large, the n-gram based on the minimum variance will be The probability eigenvalues are divided into 1-100 intervals to form 100 features. The Chinese name origin feature template uses the SRILM tool to train the language model, where each n-gram There is a probability that n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. These 100 intervals are for n-gram. A cluster of features, each interval representing a category, the variance and minimum within each interval, and the variance and maximum between the interval mean values, using the n-gram data to find the cut-off points for the 100 intervals:
Figure PCTCN2013088283-appb-M000003
公式三
Figure PCTCN2013088283-appb-M000003
Formula three
公式三中λ代表100个分界点的集合, x i 代表每一个n-gram的概率值,y j 代表第j个分界区间的平均值。这样在语言模型上就得到300个特征。 In Equation 3, λ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval. This gives 300 features on the language model.
字的TF-IDF为“名”单字TF和“名”单字IDF,根据人名语料统计出人名常用字并记录每个常用字的字频,得到6类人名常用字表,然后用下面的两个公式计算TF和IDF: The TF-IDF of the word is the "name" word TF and the "name" word IDF. According to the name corpus, the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used. The formula calculates TF and IDF:
Figure PCTCN2013088283-appb-M000004
公式四
Figure PCTCN2013088283-appb-M000004
Formula four
Figure PCTCN2013088283-appb-M000005
公式五
Figure PCTCN2013088283-appb-M000005
Formula five
在公式四及公式五中, x 代表第 i 个字在训练语料中的字频,分母是字表中所有字在训练语料中全部的出现次数, N 代表字表中字的个数, DF 表示包含 i 的人名起源类别数;类似于语言模型,将 TF 和 IDF 划分到 100 个区间,得到 200 个特征。 In Equation 4 and Equation 5, x represents the i The word frequency in the training corpus, the denominator is the number of occurrences of all the words in the training corpus in the word table, N represents the number of words in the word table, and DF represents the number of names of the names of people including i; similar to the language model, will TF and IDF are divided into 100 intervals, resulting in 200 features.
英文人名起源特征模板为字符语言模型、音节的语言模型、音节的 TF-IDF 和长度, The template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.
语言模型为整合 2-gram 模型、整合 3-gram 模型和整合 4-gram 模型,音节的语言模型为整合 1-gram 模型、整合 2-gram 模型和整合 3-gram 模型 ,所述整合 n-gram 模型是防止该类特征数量过于庞大,而基于最小方差将 n-gram 的概率特征值划分到 1-100 个区间上,形成 100 个特征 ;长度为字符个数和音节个数,并且采用下述的方法将英文切分成音节: The language model is to integrate the 2-gram model, integrate the 3-gram model, and integrate 4-gram Model, syllable language model for integrating 1-gram model, integrating 2-gram model and integrating 3-gram model, the integrated n-gram The model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features. The length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:
1 、将 'x' 替换成 'ks' ; 1. Replace 'x' with 'ks';
2 、 {a,o,e,i,u} 是基本的元音字符 , y 如果在辅音后面当作元音处理;2, {a,o,e,i,u} is the basic vowel character , y if treated as a vowel after the consonant;
3 、当 'w' 前面是 'a,e,o' 且后面不是 'h' 的时候, 'w' 和之前的元音当作一个新的元音符号;3. When ' w ' is preceded by ' a, e, o ' and the following is not ' h ', ' w ' and the previous vowel are treated as a new vowel;
4 、除了 {iu,eo,io,oi,ia,ui,ua,uo} 外,其余的连续的元音当作一个新的元音符号处理;4. Except for { iu, eo, io, oi, ia, ui, ua, uo }, the remaining consecutive vowels are treated as a new vowel;
5 、将挨着的辅音分开,将元音和紧跟着的辅音分开; 5. Separate the consonants that are next to each other and separate the vowels from the consonants that follow;
6 、辅音和其后的元音形成一个音节,其他的孤立元音和辅音作为单独的音节; 6. The consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;
音节的 TF-IDF 为音节的 TF 和音节的 IDF ,根据人名语料统计出人名常用音节并记录每个常用音节的频率,得到 6 类人名常用音节表,然后用下面的两个公式计算 TF 和 IDF : The TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable. According to the name corpus, the common syllables of the names are recorded and the frequency of each common syllable is recorded. The common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:
Figure PCTCN2013088283-appb-M000006
公式四
Figure PCTCN2013088283-appb-M000006
Formula four
Figure PCTCN2013088283-appb-M000007
公式五
Figure PCTCN2013088283-appb-M000007
Formula five
在公式四及公式五中, x 代表第 i 个字在训练语料中的音节的频率,分母是字表中所有音节在训练语料中全部的出现次数, N 代表字表中音节的个数, DF 表示包含 i 音节的人名起源类别数。 In Equation 4 and Equation 5, x represents the i The frequency of the syllables in the training corpus, the denominator is the total number of occurrences of all syllables in the training corpus in the word table, N is the number of syllables in the word list, and DF is the number of names of the names of the names containing the i syllables.
二、线性插值系统融合: Second, the linear interpolation system fusion:
Figure PCTCN2013088283-appb-M000008
公式六
Figure PCTCN2013088283-appb-M000008
Formula six
Figure PCTCN2013088283-appb-M000009
公式七
Figure PCTCN2013088283-appb-M000009
Formula seven
Figure PCTCN2013088283-appb-M000010
公式八
Figure PCTCN2013088283-appb-M000010
Formula eight
Figure PCTCN2013088283-appb-M000011
公式九
Figure PCTCN2013088283-appb-M000011
Formula nine
公式七、公式八和公式九中 T 代表的是翻译结果, P 代表的是翻译的结果 T 的概率, t 代表翻译到源语言的第几个位置。在公式六中, λ i 代表 S 属于起源 i 的概率。公式六是多系统融合的策略,公式七、八、九是解码算法。In Equation VII, Equation 8 and Equation 9, T represents the translation result, P represents the probability of the translation result T, and t represents the first position translated into the source language. In Equation 6, λ i represents the probability that S belongs to the origin i. Equation 6 is a multi-system fusion strategy. Equations seven, eight, and nine are decoding algorithms.
因为根据人名起源分了多个类别,在每个类别上就可以训练得到一个音译的模型;为了能更充分的利用这些音译模型,本发明根据实际的实验数据提出了一个策略。对于待翻译的人名会先判断该人名所属的起源类别;用户可以指定人名的起源类型,如果用户没有人为指定人名的起源,系统会调用分类模型计算出人名属于每个起源类别的概率,然后根据人名起源分类模型的结果,动态的利用多个音译系统结果进行融合,如公式六所示。 Since multiple categories are classified according to the origin of the person's name, a transliterated model can be trained in each category; in order to make full use of these transliteration models, the present invention proposes a strategy based on actual experimental data. For the name of the person to be translated, the origin category of the person name is first determined; the user can specify the origin type of the person name. If the user does not have the origin of the specified person name, the system will call the classification model to calculate the probability that the person name belongs to each origin category, and then according to The results of the classification model of the name of the person's name are dynamically combined using the results of multiple transliteration systems, as shown in Equation 6.
具体的策略如下: The specific strategy is as follows:
1) 如果用户指定了人名起源,那么人名属于该起源的概率是 1 ,属于其他起源的概率是 0 ; 1) If the user specifies the origin of the person's name, the probability that the person's name belongs to the origin is 1 and the probability of belonging to other origins is 0. ;
2) 如果用户没有指定就调用起源分类系统计算,可以得到属于每个起源的概率; 2) If the user calls the origin classification system calculation without specifying, the probability of belonging to each origin can be obtained;
3) 如果人名隶属于某个起源的概率大于一个值 A( 显然 A 值是大于 0.5) ,则只分配给相应的音译模型得到结果; 3) If the probability that a person's name belongs to a certain origin is greater than a value A (obviously the A value is greater than 0.5) , then only assign to the corresponding transliteration model to get the result;
4) 否则,将人名分配给隶属概率大于 B 值的那些模型; 4) Otherwise, assign names to those models whose membership probability is greater than B;
5) 如果使用了 4) 中的方法音译,对每个模型的结果进行线性插值,每个模型的权重等价于人名隶属于该起源的概率。以中英的音译为例,系统中的 A 和 B 取值分别在 0.72 和 0.15 附近效果较好 ( 这是个经验值,与训练语料也有关系 ) 。 5) If used 4) In the method transliteration, the results of each model are linearly interpolated, and the weight of each model is equivalent to the probability that the person's name belongs to the origin. Taking the transliteration of Chinese and English as an example, the values of A and B in the system are respectively 0.72 and The effect near 0.15 is good (this is an empirical value, and it is also related to the training corpus).
音译采用的模型是基于短语的翻译系统,用在音译中忽略其调序功能。 The model used in transliteration is a phrase-based translation system that is used in transliteration to ignore its ordering function.
有益效果Beneficial effect
本发明应用的整个 音译系统按照前端、中间控制层、后台系统的三个层次分布。前端就是用户与后台音译系统进行交互的界面,负责接受用户输入的人名和命令并传送给控制层,然后接受控制层返回的结果和信号。中间层负责连接前端和后台,根据前端的输入和信号量控制后台的系统,同时接受后台的运行结果反馈给前端界面。后台的系统主要是人名起源的分类系统、人名音译系统。前端界面是网页的形式,主要用 html 和 css 实现。 The entire application of the invention The transliteration system is distributed in three levels according to the front end, the intermediate control layer, and the background system. The front end is the interface between the user and the background transliteration system. It is responsible for accepting the user name and command input by the user and transmitting it to the control layer, and then accepting the results and signals returned by the control layer. The middle layer is responsible for connecting the front end and the background, controlling the background system according to the input and semaphore of the front end, and receiving the backend operation result feedback to the front end interface. The background system is mainly the classification system of the origin of names, and the transliteration system of names. The front-end interface is in the form of a web page, mainly used Html and css implementations.
人名起源的分类采用的是 logistic 回归模型 的原理,在多元 logistic 回归模型中分类概率的计算如前面的公式一和公式二;模型参数训练是根据极大似然估计的原理得到需要最优化的等式,然后采用 Newton-Raphson 求解特征权重值。 The classification of the origin of names is based on the principle of logistic regression models in multivariate logistic The classification probability is calculated in the regression model as in Equation 1 and Equation 2 above; the model parameter training is based on the principle of maximum likelihood estimation to obtain the equation that needs to be optimized, and then Newton-Raphson is adopted. Solve feature weight values.
本发明提出了一种根据人名构成用字特征进行人名起源分类,并融合多个不同起源的音译模型的输出结果,实现双语人名互译的方法。在双语人名音译中,训练语料人名的起源通常是包含多个国家的;不同国家间语言的发音和翻译准则各不相同,因此在做双语人名互译的时候根据人名的起源进行分类训练翻译模型将对翻译结果有很大的帮助。 The invention proposes a method for classifying the origin of a person's name according to the name of a person, and combining the output results of a plurality of transliteration models of different origins to realize the translation of bilingual names. In the transliteration of bilingual names, the origin of training corpus names usually includes multiple countries; the pronunciation and translation criteria of different languages vary from country to country. Therefore, when bilingual bilingual names are translated, the translation training model is classified according to the origin of names. It will be of great help to the translation results.
本发明提出的方法将 logistic 多分类回归模型应用到人名起源分类中,并 根据人名构成用字特征的特征模板进行人名起源分类;对于每一种起源的人名类别训练一个特定的音译(翻译)模型,再对多个音译模型的结果进行系统融合,实现双语人名互译。 The method proposed by the present invention applies a logistic multi-classification regression model to a classification of names of people, and According to the name of the person, the character template of the character feature is used to classify the origin of the person's name. For each type of origin, a specific transliteration (translation) model is trained, and the results of the multiple transliteration models are systematically integrated to realize bilingual translation.
本发明方法的主要发明内容是在人名起源分类和线性插值系统融合这两点。 The main inventive content of the method of the present invention is the fusion of the classification of names of people and the linear interpolation system.
本专利首次将 logistic 回归模型用到人名起源分类中,选用该模型主要是因为它能方便地进行特征的增加、删除和修改。 This patent will be logistic for the first time. The regression model is used in the classification of names of people. The model is mainly used because it can easily add, delete and modify features.
本发明的实施方式Embodiments of the invention
具体实施方式一:本实施方式中基于人名起源分类的人名音译方法按照以下步骤进行:  DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENTS First, in the present embodiment, a method for transliterating a person's name based on a classification of names of persons is performed according to the following steps:
一、人名起源分类:First, the classification of the names of people:
根据 人名起源特征模板采用 logistic 回归模型,进行计算: The logistic regression model is used according to the feature template of the origin of the person to calculate:
Figure PCTCN2013088283-appb-M000012
公式一
Figure PCTCN2013088283-appb-M000012
Formula one
Figure PCTCN2013088283-appb-M000013
公式二
Figure PCTCN2013088283-appb-M000013
Formula 2
公式一和公式二中 K 的值是 6 , Y 为 1-6 ,其中 1 表示中国, 2 表示英美, 3 表示阿拉伯, 4 表示俄国, 5 表示日本, 6 表示韩国, x 为人名起源特征模板, P 表示起源的概率, w 是特征的权重向量 ; In formula 1 and formula 2, the value of K is 6 and Y is 1-6, where 1 is China, 2 is Anglo, 3 Represents Arabia, 4 means Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;
步骤一中所述的人名起源特征模板为中文人名起源特征模板或英文人名起源特征模板; The feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;
中文人名起源特征模板为语言模型、字的 TF-IDF 、长度和姓氏; The Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;
语言模型为整合 1-gram 模型、整合 2-gram 模型和整合 3-gram 模型 ; 所述整合 n-gram 模型是防止该类特征数量过于庞大,而基于最小方差将 n-gram 的概率特征值划分到 1-100 个区间上,形成 100 个特征 ;长度为汉字字符数;姓氏为姓氏置信度,姓氏置信度为姓氏出现的次数除以出现的总次数所得的商; The language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model; The n-gram model prevents the number of features from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features. The length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences;
字的TF-IDF为“名”单字TF和“名”单字IDF,根据人名语料统计出人名常用字并记录每个常用字的字频,得到6类人名常用字表,然后用下面的两个公式计算TF和IDF: The TF-IDF of the word is the "name" word TF and the "name" word IDF. According to the name corpus, the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used. The formula calculates TF and IDF:
Figure PCTCN2013088283-appb-M000014
公式四
Figure PCTCN2013088283-appb-M000014
Formula four
Figure PCTCN2013088283-appb-M000015
公式五
Figure PCTCN2013088283-appb-M000015
Formula five
在公式四及公式五中, x 代表第 i 个字在训练语料中的字频,分母是字表中所有字在训练语料中全部的出现次数, N 代表字表中字的个数, DF 表示包含 i 的人名起源类别数; In Equation 4 and Equation 5, x represents the i The word frequency of the words in the training corpus, the denominator is the number of occurrences of all the words in the training corpus in the word table, N represents the number of words in the word table, and DF represents the number of origin categories of the person name containing i;
英文人名起源特征模板为字符语言模型、音节的语言模型、音节的 TF-IDF 和长度, The template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.
语言模型为整合 2-gram 模型、整合 3-gram 模型和整合 4-gram 模型,音节的语言模型为整合 1-gram 模型、整合 2-gram 模型和整合 3-gram 模型 ,所述整合 n-gram 模型是防止该类特征数量过于庞大,而基于最小方差将 n-gram 的概率特征值划分到 1-100 个区间上,形成 100 个特征 ;长度为字符个数和音节个数,并且采用下述的方法将英文切分成音节: The language model is to integrate the 2-gram model, integrate the 3-gram model, and integrate 4-gram Model, syllable language model for integrating 1-gram model, integrating 2-gram model and integrating 3-gram model, the integrated n-gram The model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features. The length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:
1 、将 'x' 替换成 'ks' ; 1. Replace 'x' with 'ks';
2 、 {a,o,e,i,u} 是基本的元音字符 , y 如果在辅音后面当作元音处理;2, {a,o,e,i,u} is the basic vowel character , y if treated as a vowel after the consonant;
3 、当 'w' 前面是 'a,e,o' 且后面不是 'h' 的时候, 'w' 和之前的元音当作一个新的元音符号;3. When ' w ' is preceded by ' a, e, o ' and the following is not ' h ', ' w ' and the previous vowel are treated as a new vowel;
4 、除了 {iu,eo,io,oi,ia,ui,ua,uo} 外,其余的连续的元音当作一个新的元音符号处理;4. Except for { iu, eo, io, oi, ia, ui, ua, uo }, the remaining consecutive vowels are treated as a new vowel;
5 、将挨着的辅音分开,将元音和紧跟着的辅音分开; 5. Separate the consonants that are next to each other and separate the vowels from the consonants that follow;
6 、辅音和其后的元音形成一个音节,其他的孤立元音和辅音作为单独的音节; 6. The consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;
音节的 TF-IDF 为音节的 TF 和音节的 IDF ,根据人名语料统计出人名常用音节并记录每个常用音节的频率,得到 6 类人名常用音节表,然后用下面的两个公式计算 TF 和 IDF : The TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable. According to the name corpus, the common syllables of the names are recorded and the frequency of each common syllable is recorded. The common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:
Figure PCTCN2013088283-appb-C000001
公式四
Figure PCTCN2013088283-appb-C000001
Formula four
Figure PCTCN2013088283-appb-M000016
公式五
Figure PCTCN2013088283-appb-M000016
Formula five
在公式四及公式五中, x 代表第 i 个字在训练语料中的音节的频率,分母是字表中所有音节在训练语料中全部的出现次数, N 代表字表中音节的个数, DF 表示包含 i 音节的人名起源类别数; In Equation 4 and Equation 5, x represents the i The frequency of the syllables of the words in the training corpus, the denominator is the number of occurrences of all the syllables in the training corpus in the word table, N represents the number of syllables in the word list, and DF represents the number of names of the names of the names containing the i syllables;
二、线性插值系统融合: Second, the linear interpolation system fusion:
Figure PCTCN2013088283-appb-M000017
公式六
Figure PCTCN2013088283-appb-M000017
Formula six
Figure PCTCN2013088283-appb-M000018
公式七
Figure PCTCN2013088283-appb-M000018
Formula seven
Figure PCTCN2013088283-appb-M000019
公式八
Figure PCTCN2013088283-appb-M000019
Formula eight
Figure PCTCN2013088283-appb-M000020
公式九
Figure PCTCN2013088283-appb-M000020
Formula nine
公式七、公式八和公式九中 T 代表的是翻译结果, P 代表的是翻译的结果 T 的概率, t 代表翻译到源语言的第几个位置,在公式六中, λ i 代表 S 属于起源 i 的概率,公式六是多系统融合的策略,公式七、八、九是解码算法。In Equation VII, Equation 8 and Equation 9, T represents the translation result, P represents the probability of the translation result T, and t represents the first position translated into the source language. In Equation 6, λ i represents S belongs to the origin. The probability of i, formula 6 is the strategy of multi-system fusion, and the formulas seven, eight, and nine are decoding algorithms.
具体实施方式二:本实施方式与具体实施方式一不同的是步骤一所述中文人名起源特征模板中采用SRILM工具训练语言模型,其中每个n-gram都有概率, n为1、2或3,统计所有n-gram概率的一维分布,根据这个分布划分出100个区间,这100个区间是对n-gram特征的一个聚类,每个区间代表一个类别,每个区间内的方差和最小,区间平均值间的方差和最大,利用n-gram 的数据求100个区间的分界点:Embodiment 2: This embodiment differs from the specific implementation manner in that the SRILM tool training language model is used in the Chinese name origin feature template in step one, wherein each n-gram has a probability. n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. The 100 intervals are a cluster of n-gram features, and each interval represents a category. The variance and the minimum in each interval, the variance and the maximum between the interval averages, using n-gram The data asks for the demarcation point of 100 intervals:
Figure PCTCN2013088283-appb-M000021
公式三
Figure PCTCN2013088283-appb-M000021
Formula three
公式三中λ代表100个分界点的集合,x i 代表每一个n-gram的概率值,y j 代表第j个分界区间的平均值。TF和IDF取值区间划分也采用同样的方式。 In Equation 3, λ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval. The same method is used for the division of the TF and IDF values.
姓氏置信度特征:在中文人名中姓氏是比较固定的常用的是百家姓,我们在“人民日报1998年”语料中抽取人名提取出其中的姓氏几百多个,人工对其中的每个姓氏进行置信度标注,这个置信度是人工定义的。认为姓氏“龚、廖、覃”等字的置信度就高于“李、王、周”,而“白、石、钱”等字的姓氏置信度更低;他们置信度的区分是根据这些字在人名日报中“作为姓氏出现的次数”/“出现的总次数”的值计算的;同样类似于n-gram的特征聚类方法将姓氏置信度分为20个等级。 Last name confidence characteristics: In Chinese names, the surnames are relatively fixed. The commonly used ones are hundreds of surnames. We extracted the names of the names in the “People’s Daily 1998” corpus to extract hundreds of surnames, and artificially surnamed each of them. Confidence labeling, which is manually defined. The beliefs of the surnames "Gong, Liao, 覃" are higher than those of "Li, Wang, Zhou", while the names of "White, Stone, Money" have lower confidence; the distinction between their confidence is based on these The word is calculated in the name of the person's name as "the number of occurrences of the last name" / "the total number of occurrences"; the feature clustering method similar to n-gram divides the surname confidence into 20 levels.
其他与具体实施方式一相同。 Others are the same as the first embodiment.
采用下述实验验证本发明效果:The effects of the present invention were verified by the following experiments:
1、用户在交互界面输入待翻译人名,可以指定也可以不指定具体的类别;这里以输入人名”德川家康”不指定国籍起源(实际上这个人名起源于日本)为例。1. The user inputs the name of the person to be translated on the interactive interface, and may or may not specify a specific category; here, the input name "Tokugawa Ieyasu" does not specify the origin of the nationality (actually the name of the person originated in Japan) as an example.
2、形成人名的特征向量X:2. Form the feature vector X of the person's name:
2.1根据输入人名和现有的知识,形成人名”德川家康”的分类向量X:这里得到{德、 川、 家、 康、 德川、 川家、 家康、 德川家、 川家康}在语言模型中的概率,并根据分界点分别映射1-gram\2-gram\3-gram的100个区间上得到中文区间号{86、 30、 51、 63、 31、 12、 43、 5、 7},日文区间好{51、 70、 81、 53、 11、 42、 43、 5、 7},欧美{85、 3、 19、 33、 11、 5、 23、 5、 7}等等6个国家上的特征值。2.1 According to the name of the input person and the existing knowledge, the classification vector X of the name of Tokugawa Ieyasu is formed: Here, we get {Germany, Sichuan, Jia, Kang, Tokugawa, The probability of Chuanjia, Jiakang, Tokugawa, and Chuanjiakang in the language model, and the Chinese interval number {86, obtained by mapping the 100 intervals of 1-gram\2-gram\3-gram according to the demarcation point respectively. 30, 51, 63, 31, 12, 43, 5, 7}, Japanese interval is good {51, 70, 81, 53, 11, 42, 43, 5, 7}, Europe and America {85, 3, 19, 33, 11, 5, 23, 5, 7} and other eigenvalues in six countries.
2.2、计算{德、 川、 家、 康}这几个字的TF和IDF,映射到IDF的100个区间上得到区间号{14、57、85、41};得到TF在中国{3、15、7}、日本{50、32、76、21}等6个国家的TF值。2.2, calculation {Germany, Sichuan, home, TF and IDF of these words, mapped to 100 intervals of IDF to get the interval number {14, 57, 85, 41}; get TF in China {3, 15, 7}, Japan {50, 32, 76, 21} and other countries have TF values.
2.3、因为默认第一个字是姓,其余字是名;所以计算{德}的姓氏置信度得到属于执行度等级{1},共20个等级,等级越高置信度越大。2.3, because the default first word is the last name, the rest of the words are the name; so the calculation of the {de} last name confidence degree is obtained to belong to the degree of execution level {1}, a total of 20 levels, the higher the level, the greater the confidence.
2.4、计算人名的长度是{4}。2.4. Calculate the length of the person's name as {4}.
2.5、根据上面2.1-2.4步中得到的特征信息,给特征向量X中相应的位置设为1,其余没有命中的特征置0。2.5. According to the feature information obtained in steps 2.1-2.4 above, the corresponding position in the feature vector X is set to 1, and the remaining features without hit are set to 0.
3、根据公式一和公式二,计算出人名属于某个类的概率并归一化,最终得到归一化后的概率向量(0.23, 0.07, 0.08, 0.05, 0.43, 0.14),其中1表示中国,2表示英美,3表示阿拉伯,4表示俄国,5表示日本,6表示韩国。3. According to formula 1 and formula 2, calculate the probability that the name belongs to a certain class and normalize it, and finally obtain the normalized probability vector (0.23, 0.07, 0.08, 0.05, 0.43, 0.14), where 1 means China, 2 means Anglo-American, 3 means Arab, 4 means Russia, 5 means Japan, and 6 means Korea.
4、根据多系统融合的翻译策略公式六,我们选择1:中国、5:日本、6:韩国模型进行解码;根据三个系统的融合最终排在第一位的音译结果是” tokugawaleyasu”、第二位的音译结果是”tokuwavasu”、第三位的是”dekuanjiaking”,并将排在第一位的结果返回给用户。可见混合的模型有助于得到正确的翻译结果。4. According to the translation strategy formula 6 of multi-system fusion, we choose 1: China, 5: Japan, 6: Korean model for decoding; according to the fusion of the three systems, the final transliteration result is “” Tokugawaleyasu", the second transliteration result is "tokuwavasu", the third one is "dekuanjiaking", and the result in the first place is returned to the user. It can be seen that the mixed model helps to get the correct translation result.

Claims (2)

  1. 基于人名起源分类的人名音译方法,其人名起源分类特征、方法和多系统融合方法按照以下步骤进行:The method for transliteration of names of people based on the classification of names of people, the classification features, methods and multi-system fusion methods of human names are carried out according to the following steps:
    一、人名起源分类:First, the classification of the names of people:
    根据 人名起源特征模板采用 logistic 回归模型,进行计算: The logistic regression model is used according to the feature template of the origin of the person to calculate:
    Figure PCTCN2013088283-appb-M000022
    公式一
    Figure PCTCN2013088283-appb-M000022
    Formula one
    Figure PCTCN2013088283-appb-M000023
    公式二
    Figure PCTCN2013088283-appb-M000023
    Formula 2
    公式一和公式二中 K 的值是 6 , Y 为 1-6 ,其中 1 表示中国, 2 表示英美, 3 表示阿拉伯, 4 表示俄国, 5 表示日本, 6 表示韩国, x 为人名起源特征模板, P 表示起源的概率, w 是特征的权重向量 ; In Equations 1 and 2, the value of K is 6 and Y is 1-6, where 1 is China, 2 is Anglo-American, 3 is Arabic, 4 Represents Russia, 5 means Japan, 6 means Korea, x is the template of the origin of the name, P is the probability of origin, and w is the weight vector of the feature;
    步骤一中所述的人名起源特征模板为中文人名起源特征模板或英文人名起源特征模板; The feature name template of the person name described in the first step is a Chinese person name origin feature template or an English person name origin feature template;
    中文人名起源特征模板为语言模型、字的 TF-IDF 、长度和姓氏; The Chinese name origin template is the language model, the TF-IDF of the word, the length and the last name;
    语言模型为整合 1-gram 模型、整合 2-gram 模型和整合 3-gram 模型 ; 所述整合 n-gram 模型是防止该类特征数量过于庞大,而基于最小方差将 n-gram 的概率特征值划分到 1-100 个区间上,形成 100 个特征 ;长度为汉字字符数;姓氏为姓氏置信度,姓氏置信度为姓氏出现的次数除以出现的总次数所得的商; The language model is to integrate the 1-gram model, integrate the 2-gram model, and integrate the 3-gram model; the integrated n-gram The model prevents the number of features of this class from being too large, and divides the probability eigenvalues of n-grams into 1-100 intervals based on the minimum variance to form 100 features. The length is the number of Chinese characters; the last name is the last name confidence, and the last name confidence is the number of occurrences of the last name divided by the total number of occurrences;
    字的TF-IDF为“名”单字TF和“名”单字IDF,根据人名语料统计出人名常用字并记录每个常用字的字频,得到6类人名常用字表,然后用下面的两个公式计算TF和IDF: The TF-IDF of the word is the "name" word TF and the "name" word IDF. According to the name corpus, the common word of the person name is recorded and the word frequency of each common word is recorded, and the common word list of 6 types of people is obtained, and then the following two are used. The formula calculates TF and IDF:
    Figure PCTCN2013088283-appb-M000024
    公式四
    Figure PCTCN2013088283-appb-M000024
    Formula four
    Figure PCTCN2013088283-appb-M000025
    公式五
    Figure PCTCN2013088283-appb-M000025
    Formula five
    在公式四及公式五中, x 代表第 i 个字在训练语料中的字频,分母是字表中所有字在训练语料中全部的出现次数, N 代表字表中字的个数, DF 表示包含 i 的人名起源类别数; In Equation 4 and Equation 5, x represents the word frequency of the i-th word in the training corpus, and the denominator is the number of occurrences of all the words in the training corpus in the word table, N Represents the number of words in the word list, and DF represents the number of names of people whose names contain i;
    英文人名起源特征模板为字符语言模型、音节的语言模型、音节的 TF-IDF 和长度, The template for the origin of English names is the character language model, the language model of the syllable, the TF-IDF and the length of the syllable.
    语言模型为整合 2-gram 模型、整合 3-gram 模型和整合 4-gram 模型,音节的语言模型为整合 1-gram 模型、整合 2-gram 模型和整合 3-gram 模型 ,所述整合 n-gram 模型是防止该类特征数量过于庞大,而基于最小方差将 n-gram 的概率特征值划分到 1-100 个区间上,形成 100 个特征 ;长度为字符个数和音节个数,并且采用下述的方法将英文切分成音节: The language model is an integrated 2-gram model, an integrated 3-gram model, and an integrated 4-gram model. The syllable language model is integrated. 1-gram model, integrated 2-gram model and integrated 3-gram model, the integrated n-gram model prevents the number of features of this class from being too large, based on the minimum variance The probability eigenvalues of n-gram are divided into 1-100 intervals to form 100 features. The length is the number of characters and the number of syllables, and the English is divided into syllables by the following method:
    1 、将 'x' 替换成 'ks' ; 1. Replace 'x' with 'ks';
    2 、 {a,o,e,i,u} 是基本的元音字符 , y 如果在辅音后面当作元音处理;2, {a,o,e,i,u} is the basic vowel character , y if treated as a vowel after the consonant;
    3 、当 'w' 前面是 'a,e,o' 且后面不是 'h' 的时候, 'w' 和之前的元音当作一个新的元音符号;3. When ' w ' is preceded by ' a, e, o ' and the following is not ' h ', ' w ' and the previous vowel are treated as a new vowel;
    4 、除了 {iu,eo,io,oi,ia,ui,ua,uo} 外,其余的连续的元音当作一个新的元音符号处理;4. Except for { iu, eo, io, oi, ia, ui, ua, uo }, the remaining consecutive vowels are treated as a new vowel;
    5 、将挨着的辅音分开,将元音和紧跟着的辅音分开; 5. Separate the consonants that are next to each other and separate the vowels from the consonants that follow;
    6 、辅音和其后的元音形成一个音节,其他的孤立元音和辅音作为单独的音节; 6. The consonants and the subsequent vowels form a syllable, and the other isolated vowels and consonants are separate syllables;
    音节的 TF-IDF 为音节的 TF 和音节的 IDF ,根据人名语料统计出人名常用音节并记录每个常用音节的频率,得到 6 类人名常用音节表,然后用下面的两个公式计算 TF 和 IDF : The TF-IDF of the syllable is the TF of the syllable and the IDF of the syllable. According to the name corpus, the common syllables of the names are recorded and the frequency of each common syllable is recorded. The common syllable table of 6 types of people is obtained, and then the TF and IDF are calculated by the following two formulas:
    Figure PCTCN2013088283-appb-M000026
    公式四
    Figure PCTCN2013088283-appb-M000026
    Formula four
    Figure PCTCN2013088283-appb-M000027
    公式五
    Figure PCTCN2013088283-appb-M000027
    Formula five
    在公式四及公式五中, x 代表第 i 个字在训练语料中的音节的频率,分母是字表中所有音节在训练语料中全部的出现次数, N 代表字表中音节的个数, DF 表示包含 i 音节的人名起源类别数; In Equation 4 and Equation 5, x represents the frequency of the syllables of the i-th word in the training corpus, and the denominator is the total number of occurrences of all syllables in the training corpus in the word table. N represents the number of syllables in the word list, and DF represents the number of names of person names that contain i syllables;
    二、线性插值系统融合: Second, the linear interpolation system fusion:
    Figure PCTCN2013088283-appb-M000028
    公式六
    Figure PCTCN2013088283-appb-M000028
    Formula six
    Figure PCTCN2013088283-appb-M000029
    公式七
    Figure PCTCN2013088283-appb-M000029
    Formula seven
    Figure PCTCN2013088283-appb-M000030
    公式八
    Figure PCTCN2013088283-appb-M000030
    Formula eight
    Figure PCTCN2013088283-appb-M000031
    公式九
    Figure PCTCN2013088283-appb-M000031
    Formula nine
    公式七、公式八和公式九中 T 代表的是翻译结果, P 代表的是翻译的结果 T 的概率, t 代表翻译到源语言的第几个位置,在公式六中, λ i 代表 S 属于起源 i 的概率,公式六是多系统融合的策略,公式七、八、九是解码算法。In Equation VII, Equation 8 and Equation 9, T represents the translation result, P represents the probability of the translation result T, and t represents the first position translated into the source language. In Equation 6, λ i represents S belongs to the origin. The probability of i, formula 6 is the strategy of multi-system fusion, and the formulas seven, eight, and nine are decoding algorithms.
  2. 根据权利要求1所述的基于人名起源分类的人名音译方法,其特征在于步骤一所述的中文人名起源特征模块中采用SRILM工具训练语言模型,其中每个n-gram都有概率, n为1、2或3,统计所有n-gram概率的一维分布,根据这个分布划分出100个区间,这100个区间是对n-gram特征的一个聚类,每个区间代表一个类别,每个区间内的方差和最小,区间平均值间的方差和最大,利用n-gram 的数据求100个区间的分界点: The method for transliteration of a person's name based on the classification of the origin of a person according to claim 1, wherein the SRILM tool is used to train the language model in the Chinese name origin feature module, and each n-gram has a probability. n is 1, 2 or 3, and the one-dimensional distribution of all n-gram probabilities is counted. According to this distribution, 100 intervals are divided. The 100 intervals are a cluster of n-gram features, and each interval represents a category. The variance and the minimum in each interval, the variance and the maximum between the interval averages, using n-gram The data asks for the demarcation point of 100 intervals:
    Figure PCTCN2013088283-appb-M000032
    公式三
    Figure PCTCN2013088283-appb-M000032
    Formula three
    公式三中λ代表100个分界点的集合,x i 代表每一个n-gram的概率值,y j 代表第j个分界区间的平均值。 In Equation 3, λ represents a set of 100 demarcation points, x i represents the probability value of each n-gram, and y j represents the average of the jth boundary interval.
PCT/CN2013/088283 2012-12-24 2013-12-02 Name transliteration method based on classification of name origins WO2014101629A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
KR1020157020138A KR20150128656A (en) 2012-12-24 2013-12-02 Name transliteration method based on classification of name origins

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201210566217.X 2012-12-24
CN201210566217.XA CN103020046B (en) 2012-12-24 2012-12-24 Based on the name transliteration method of name origin classification

Publications (1)

Publication Number Publication Date
WO2014101629A1 true WO2014101629A1 (en) 2014-07-03

Family

ID=47968663

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2013/088283 WO2014101629A1 (en) 2012-12-24 2013-12-02 Name transliteration method based on classification of name origins

Country Status (3)

Country Link
KR (1) KR20150128656A (en)
CN (1) CN103020046B (en)
WO (1) WO2014101629A1 (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103020046B (en) * 2012-12-24 2016-04-20 哈尔滨工业大学 Based on the name transliteration method of name origin classification
KR20180001889A (en) 2016-06-28 2018-01-05 삼성전자주식회사 Language processing method and apparatus
CN107066447B (en) * 2017-04-19 2021-03-26 广东惠禾科技发展有限公司 Method and equipment for identifying meaningless sentences
CN115662392B (en) * 2022-12-13 2023-04-25 中国科学技术大学 Transliteration method based on phoneme memory, electronic equipment and storage medium

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650945A (en) * 2009-09-17 2010-02-17 浙江工业大学 Method for recognizing speaker based on multivariate core logistic regression model
CN103020046A (en) * 2012-12-24 2013-04-03 哈尔滨工业大学 Name transliteration method on the basis of classification of name origin

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102033879B (en) * 2009-09-27 2015-02-18 深圳市世纪光速信息技术有限公司 Method and device for identifying Chinese name

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101650945A (en) * 2009-09-17 2010-02-17 浙江工业大学 Method for recognizing speaker based on multivariate core logistic regression model
CN103020046A (en) * 2012-12-24 2013-04-03 哈尔滨工业大学 Name transliteration method on the basis of classification of name origin

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
LI, SIFU ET AL.: "Performing Chinese Text Classification Using Logistic Regression Models.", COMPUTER ENGINEERING AND APPLICATIONS., vol. 45, no. 14, July 2009 (2009-07-01), pages 152 - 154 *

Also Published As

Publication number Publication date
KR20150128656A (en) 2015-11-18
CN103020046B (en) 2016-04-20
CN103020046A (en) 2013-04-03

Similar Documents

Publication Publication Date Title
CN109840331B (en) Neural machine translation method based on user dictionary
WO2009149549A1 (en) Method and system for using alignment means in matching translation
CN104750687A (en) Method for improving bilingual corpus, device for improving bilingual corpus, machine translation method and machine translation device
WO2014101629A1 (en) Name transliteration method based on classification of name origins
CN101008864A (en) Multifunctional and multilingual input system for numeric keyboard and method thereof
CN101196881A (en) Words symbolization processing method and system for number and special symbol string in text
US7912696B1 (en) Natural language processing apparatus and natural language processing method
CN103164397A (en) Chinese-Kazakh electronic dictionary and automatic translating Chinese- Kazakh method thereof
CN103164396A (en) Chinese-Uygur language-Kazakh-Kirgiz language electronic dictionary and automatic translating Chinese-Uygur language-Kazakh-Kirgiz language method thereof
CN103164395A (en) Chinese-Kirgiz language electronic dictionary and automatic translating Chinese-Kirgiz language method thereof
CN106066870B (en) A kind of bilingual teaching mode building system of context mark
CN100337232C (en) Braille-Chinese contrapositive editing/typesetting system and editing/typesetting method
JPH0682376B2 (en) Emotion information extraction device
Okuno et al. An ensemble model of word-based and character-based models for Japanese and Chinese input method
Leng et al. Analysis and research on lexical errors in machine translation in Chinese and Korean translation
CN111597827A (en) Method and device for improving machine translation accuracy
McEnery et al. Corpora and translation: Uses and future prospects
CN109284012A (en) A kind of Gu Yi nationality's text language in-put control system and method, information data processing terminal
Tsai et al. Applying an NVEF Word-Pair Identifier to the Chinese Syllable-to-Word Conversion Problem
Rajendran et al. Text processing for developing unrestricted Tamil text to speech synthesis system
Li Design of a Japanese Machine Translation System Based on Speech Recognition Technology
Tripathy et al. Punctuation and case restoration in code mixed Indian languages
Zhang et al. Research Article Design and Implementation of Chinese Common Braille Translation System Integrating Braille Word Segmentation and Concatenation Rules
Patkar et al. Machine Translation of English to Ahirani Language: A Review
JPS62271065A (en) Mechanical translation system

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13866669

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 20157020138

Country of ref document: KR

Kind code of ref document: A

122 Ep: pct application non-entry in european phase

Ref document number: 13866669

Country of ref document: EP

Kind code of ref document: A1