CN103902528A - Uygur language word alignment method - Google Patents

Uygur language word alignment method Download PDF

Info

Publication number
CN103902528A
CN103902528A CN 201210579979 CN201210579979A CN103902528A CN 103902528 A CN103902528 A CN 103902528A CN 201210579979 CN201210579979 CN 201210579979 CN 201210579979 A CN201210579979 A CN 201210579979A CN 103902528 A CN103902528 A CN 103902528A
Authority
CN
Grant status
Application
Patent type
Prior art keywords
words
uygur
alignment
uighur
uygur language
Prior art date
Application number
CN 201210579979
Other languages
Chinese (zh)
Inventor
尼加提·纳吉米
买合木提·买买提
帕肉克·司地克
马斌
Original Assignee
新疆电力信息通信有限责任公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Abstract

The invention discloses a Uygur language word alignment method. The method includes that automatic alignment of Uygur language words is realized, and five alignment relationships between Uygur language words and Chinese words include one to one, one to multiple, multiple to one, multiple to multiple and one to none; manual alignment is performed on words which are wrong in automatic alignment, so that accuracy of a system to process Uygur language is improved; word splitting and merging of the Uygur language words is realized according to characteristics of the Uygur language. By the Uygur language word alignment method, automatic alignment of the Uygur language words is realized, assistance is provided for Chinese-Uygur machine translation and establishing of electronic Uygur language dictionaries, and a solid foundation is laid for development of electronic dictionaries for Uzbek, Kazak, Kyrgyz and Turkish and machine-aided translation systems.

Description

维吾尔语词语对齐方法 Uighur word alignment method

技术领域 FIELD

[0001] 本发明涉及语言信息处理技术,特别是维吾尔语词语对齐方法。 [0001] The present invention relates to a language information processing technology, especially Uighur word alignment method.

背景技术 Background technique

[0002] 在国民经济和社会信息化的今天,人们对各类语种信息获取、查询、翻译提出了更快、更高的要求。 [0002] in the national economy and social information technology today, people get all kinds of multilingual information, query translation made faster and higher requirements. 随之,研制开发了各类电子词典产品和机器翻译系统,受到广大用户欢迎。 Following this, the research and development of various types of electronic dictionaries and machine translation systems products, welcomed by the majority of users. 在进行机器翻译时,语料库的质量直接影响翻译的质量,维吾尔语词语对齐系统是机器翻译和语料库建设的辅助工具。 During machine translation, Corpus quality directly affects the quality of translation, Uighur word alignment system is a machine translation aids and Corpus.

[0003] 机器翻译系统和自然语言处理系统的实用化进程中,机器词典和机器翻译系统已成为开发的焦点,语料库的建设速度和质量尤为重要。 The practical process [0003] Machine translation systems and natural language processing systems, machines dictionaries and machine translation systems have become the focus of development, the construction corpus speed and quality is particularly important. 词语对齐是在互译的文本上寻找以词为单位的翻译对应。 Word alignment is to find the corresponding word as translation units in the translation of the text. 词语是双语语料库的自然语言处理任务都需要词语级的对齐。 Words are bilingual corpus of natural language processing tasks require word-level alignment. 目前词语对齐的方法主要有4种:基于统计的方法、基于字符的方法、基于语言学知识的方法和混合方法。 Current methods of word alignment there are four kinds: based on statistical methods, the method of character-based, and hybrid method based on linguistic knowledge. 基于统计的方法是通过对大规模双语语料库的统计训练,获得双语对译词的同现概率以此作为对齐的基础。 Based on statistical methods through statistical training on a large scale bilingual corpus, access to bilingual co-occurrence probability of translated words as a basis for alignment. 基于字符的方法是以两种语言含有的同源词在词性上面的共同之处进行词对齐。 Character-based method is based on cognates two languages ​​contain words were aligned in the top part of speech in common. 基于语言学知识的方法是以双语词典和同义词词典等语言学知识作为对齐的基础。 The method is based on linguistic knowledge of basic linguistic knowledge bilingual dictionary and thesaurus, etc., as aligned. 混合方法同时使用了包含上三种方法的多种方法。 Mixing method comprising use more than the three methods.

[0004] 近年来,随着少数民族信息化领域的发展,在新疆的少数民族语言语料库建设也有了新的发展,但大多数以维吾尔语为主,在更多少数民族语言的支持和技术水平上存在一定的缺陷。 [0004] In recent years, with the development of information technology in the field of minority, Corpus minority languages ​​in Xinjiang has also been a new development, but most to Uighur mainly in support and technical level more minority languages there are some flaws.

发明内容 SUMMARY

[0005] 本发明的目的在于提供一种维吾尔语词语对齐方法,实现了维吾尔语词语的自动对齐,为维吾尔语电子词典的构建和维吾尔语语料库的建设提供了帮助;为汉维机器翻译系统的研究提供了基础,对乌(乌孜别克文)、哈(哈萨克)、柯(柯尔克孜)、土(土耳其文)电子词典及辅助机器翻译系统的开发打下了坚实的基础。 [0005] The object of the present invention is to provide a method for aligning Uygur Phrases, to achieve the automatic alignment of Uygur Phrases, provided to help build the construction Uighur Uygur Corpus electronic dictionary; Han dimensional machine translation system Research provides the basis for Ukraine (Uzbek text), Kazakhstan (Kazakhstan), Ko (Kirghiz), the development of electronic dictionaries and auxiliary machine translation system of soil (Turkish) laid a solid foundation.

[0006] 本发明的目的是这样实现的:一种维吾尔语词语对齐方法,1.实现了维吾尔语词语的自动对齐,维吾尔语词语和汉语词语之间的对齐关系分为5种,分别是一对一,一对多,多对一,多对多,一对空;2.对自动对齐出现错误的词语人工对齐,提高了系统处理维吾尔语的准确率;3.根据维吾尔语的特点实现了对维吾尔语词语的拆分和合并。 [0006] The object of the present invention is implemented as follows: A method for aligning Uighur words, to achieve the automatic alignment of a Uighur words, the relationship between the alignment Uygur Phrases and divided into five kinds of Chinese words, each is a to one, many, many-to-many, one-to-air; 2 words for automatic alignment error manual alignment, improve the accuracy of the processing system Uyghur; 3. the characteristics achieved Uighur split and merge Uighur words.

[0007] 本发明涉及维吾尔语词语的对齐,实现了维吾尔语词语的自动对齐和维吾尔语词语的拆分和合并。 [0007] The present invention relates to Uygur Phrases aligned to achieve automatic alignment of the splitting and merging of Uygur Phrases and words Uighur. 词语对齐是语料库建设的基本问题之一,也是长期以来一直在研究的课题。 Word alignment is one of the basic problems Corpus building, also has long been the subject of study. 在目前市场上,这种能对维吾尔语词语对齐的系统尚属首例。 In the current market, this can be the first case of the Uighur word alignment system. 本发明解决了对提交的维吾尔词语进行自动对齐;是维吾尔语电子词典的构建,汉维机器翻译系统很好的辅助工具;另一方面对将来汉维机器翻译语料库建设;对乌(乌孜别克文)、哈(哈萨克)、柯(柯尔克孜)、土(土耳其文)电子词典及辅助机器翻译系统的开发打下了坚实的基础。 The invention solves the Uyghur words submitted for automatic alignment; constructing an electronic dictionary Uighur, Uyghur machine translation system a good supporting tools; on the other hand the future Uyghur machine translation of Corpus; to Ukraine (Uzbek text) , Kazakhstan (Kazakhstan), Ko (Kirghiz), the development of electronic dictionaries and auxiliary machine translation system of soil (Turkish) laid a solid foundation. 本发明是基于计算语言学、语言学、社会学、计算机信息处理科学的维吾尔语词语对齐系统。 The present invention is based on computational linguistics, linguistic, sociological, Science computer information processing system Uyghur word alignment. 其特征是:根据维吾尔语的形态特点对维吾尔语词语进行自动对齐;可以实现没有自动对齐的词语;根据维吾尔语的特征本系统实现了对维吾尔语词语的拆分和合并。 Wherein: Uighur words automatically aligned according to morphological characteristics Uighur; automatic alignment can be achieved without the words; achieve splitting and merging words according Uighur Uighur feature of the present system.

[0008] 本发明的有益效果是,系统实现了维吾尔语词语的自动对齐,为维吾尔语电子词典的构建和维吾尔语语料库的建设提供了帮助;为汉维机器翻译系统的研究提供了基础,对乌(乌孜别克文)、哈(哈萨克)、柯(柯尔克孜)、土(土耳其文)电子词典及辅助机器翻译系统的开发打下了坚实的基础。 [0008] the beneficial effects of the present invention is a system to achieve automatic alignment Uighur words, to build for the Uighur electronic dictionary and construction Uyghur corpus provided assistance; to provide a basis for the study of Uyghur machine translation system, for Uzbekistan (Uzbek text), Kazakhstan (Kazakhstan), Ko (Kirghiz), soil (Turkish) development of electronic dictionaries and auxiliary machine translation system has laid a solid foundation.

附图说明 BRIEF DESCRIPTION

[0009] 下面将结合附图对本发明作进一步说明。 [0009] The following with reference to the present invention will be further described.

[0010] 图1是本发明的流程图。 [0010] FIG. 1 is a flowchart illustrating the present invention.

具体实施方式 Detailed ways

[0011] 一种维吾尔语词语对齐方法,1.实现了维吾尔语词语的自动对齐,维吾尔语词语和汉语词语之间的对齐关系分为5种,分别是一对一,一对多,多对一,多对多,一对空;2.对自动对齐出现错误的词语人工对齐,提高了系统处理维吾尔语的准确率;3.根据维吾尔语的特点实现了对维吾尔语词语的拆分和合并。 [0011] A method for aligning Uygur Phrases, 1. Uighur achieve the automatic alignment words, the relationship between the alignment Uighur words and Chinese words divided into five kinds, namely, one to one, one to many, a plurality of pairs a-many, one-to-air; 2 words for automatic alignment error manual alignment, improve the accuracy of the processing system Uyghur; 3. the characteristics achieved Uighur splitting and merging words Uighur .

[0012] 如图1所示,首先,判断用户的角色,然后获得审核通过之后的句子。 [0012] As shown in FIG 1, first, the role of the user is determined, then the sentence is obtained after approval. 根据维吾尔语词语的特点实现词语的拆分和合并,对自动对齐错误的词语进行人工对齐,然后保存对齐结果,同时登记有错误的句子。 Realized according to the characteristics of Uygur Phrases split and merge words, the words of the automatic alignment errors manually aligned, and then save the alignment result, registration error while sentence.

Claims (1)

  1. 1.一种维吾尔语词语对齐方法,其特征是:1.实现了维吾尔语词语的自动对齐,维吾尔语词语和汉语词语之间的对齐关系分为5种,分别是一对一,一对多,多对一,多对多,一对空;2.对自动对齐出现错误的词语人工对齐,提高了系统处理维吾尔语的准确率;3.根据维吾尔语的特点实现了对维吾尔语词语的拆分和合并。 A method for aligning Uygur Phrases, characterized in that: 1 to achieve the automatic alignment Uighur words, the relationship between the alignment Uighur words and Chinese words divided into five kinds, namely, one to one, one to many , many-to-many, one pair of air; 2 words for automatic alignment error manual alignment, improve the accuracy of the processing system Uyghur; 3. the characteristics achieved Uighur split Uygur Phrases divide and merge.
CN 201210579979 2012-12-28 2012-12-28 Uygur language word alignment method CN103902528A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN 201210579979 CN103902528A (en) 2012-12-28 2012-12-28 Uygur language word alignment method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN 201210579979 CN103902528A (en) 2012-12-28 2012-12-28 Uygur language word alignment method

Publications (1)

Publication Number Publication Date
CN103902528A true true CN103902528A (en) 2014-07-02

Family

ID=50993858

Family Applications (1)

Application Number Title Priority Date Filing Date
CN 201210579979 CN103902528A (en) 2012-12-28 2012-12-28 Uygur language word alignment method

Country Status (1)

Country Link
CN (1) CN103902528A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246177A1 (en) * 2010-04-06 2011-10-06 Samsung Electronics Co. Ltd. Syntactic analysis and hierarchical phrase model based machine translation system and method
CN102662932A (en) * 2012-03-15 2012-09-12 中国科学院自动化研究所 Method for establishing tree structure and tree-structure-based machine translation system
CN102708098A (en) * 2012-05-30 2012-10-03 中国科学院自动化研究所 Dependency coherence constraint-based automatic alignment method for bilingual words

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20110246177A1 (en) * 2010-04-06 2011-10-06 Samsung Electronics Co. Ltd. Syntactic analysis and hierarchical phrase model based machine translation system and method
CN102662932A (en) * 2012-03-15 2012-09-12 中国科学院自动化研究所 Method for establishing tree structure and tree-structure-based machine translation system
CN102708098A (en) * 2012-05-30 2012-10-03 中国科学院自动化研究所 Dependency coherence constraint-based automatic alignment method for bilingual words

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
张亚军 等: "汉语维吾尔语的一对一词对齐研究", 《昌吉学院院报》, no. 6, 15 December 2012 (2012-12-15), pages 80 - 83 *
李英 等: "一种基于词典和长度相结合的汉维句子对齐算法", 《新乡学院院报自然科学版》, vol. 29, no. 1, 15 May 2012 (2012-05-15), pages 66 - 68 *
麦热哈巴·艾力 等: "一种提高维吾尔语汉语词语对齐的方法研究", 《小型微型计算机系统》, vol. 33, no. 11, 15 November 2012 (2012-11-15), pages 2551 - 2555 *

Similar Documents

Publication Publication Date Title
Och et al. A systematic comparison of various statistical alignment models
Clark et al. Text normalization in social media: progress, problems and applications for a pre-processing system of casual English
Ramanathan et al. Simple syntactic and morphological processing can help English-Hindi statistical machine translation
CN101079028A (en) On-line translation model selection method of statistic machine translation
Miháltz et al. Methods and results of the Hungarian WordNet project
CN104408078A (en) Construction method for key word-based Chinese-English bilingual parallel corpora
CN103885938A (en) Industry spelling mistake checking method based on user feedback
CN101187924A (en) Method and system for obtaining word pair translation from bilingual sentence
Banea et al. Word sense disambiguation with multilingual features
Weese et al. Joshua 3.0: Syntax-based machine translation with the Thrax grammar extractor
CN101075251A (en) Method for searching file based on data excavation
US20130103390A1 (en) Method and apparatus for paraphrase acquisition
aTilde Improving SMT for Baltic languages with factored models
CN101488126A (en) Double-language sentence alignment method and device
Salloum et al. Dialectal arabic to english machine translation: Pivoting through modern standard arabic
Williams et al. Ghkm rule extraction and scope-3 parsing in moses
Nakazawa et al. Example-based machine translation based on deeper NLP
Koehn et al. More linguistic annotation for statistical machine translation
Islam Research on Bangla language processing in Bangladesh: progress and challenges
Li et al. Enriching Word Alignment with Linguistic Tags.
CN102622342A (en) Interlanguage system and interlanguage engine and interlanguage translation system and corresponding method
He et al. Cross-language information retrieval
Ture et al. Looking inside the box: Context-sensitive translation for cross-language information retrieval
CN102654867A (en) Webpage sorting method and system in cross-language search
CN102681983A (en) Alignment method and device for text data

Legal Events

Date Code Title Description
C06 Publication
C10 Entry into substantive examination
C02 Deemed withdrawal of patent application after publication (patent law 2001)