CN103902528A

CN103902528A - Uygur language word alignment method

Info

Publication number: CN103902528A
Application number: CN201210579979.3A
Authority: CN
Inventors: 尼加提·纳吉米; 买合木提·买买提; 帕肉克·司地克; 马斌
Original assignee: Xinjiang Electric Power Information Communication Co Ltd
Current assignee: Xinjiang Electric Power Information Communication Co Ltd
Priority date: 2012-12-28
Filing date: 2012-12-28
Publication date: 2014-07-02

Abstract

The invention discloses a Uygur language word alignment method. The method includes that automatic alignment of Uygur language words is realized, and five alignment relationships between Uygur language words and Chinese words include one to one, one to multiple, multiple to one, multiple to multiple and one to none; manual alignment is performed on words which are wrong in automatic alignment, so that accuracy of a system to process Uygur language is improved; word splitting and merging of the Uygur language words is realized according to characteristics of the Uygur language. By the Uygur language word alignment method, automatic alignment of the Uygur language words is realized, assistance is provided for Chinese-Uygur machine translation and establishing of electronic Uygur language dictionaries, and a solid foundation is laid for development of electronic dictionaries for Uzbek, Kazak, Kyrgyz and Turkish and machine-aided translation systems.

Description

Uighur word alignment method

Technical field

The present invention relates to language information processing technology, particularly Uighur word alignment method.

Background technology

In today of national economy and social IT application, people to all kinds of languages acquisition of informations, inquiry, translation proposed sooner, higher requirement.Thereupon, develop all kinds of electronic dictionary products and machine translation system, be subject to users and welcome.In the time carrying out mechanical translation, the quality of corpus directly affects the quality of translation, and Uighur word alignment system is the aid of mechanical translation and Corpus Construction.

In the practicalization of machine translation system and natural language processing system, machine dictionary and machine translation system have become the focus of exploitation, and the construction speed of corpus and quality are particularly important.Word alignment is on the text of intertranslation, to find the translation correspondence take word as unit.Word is the alignment that the natural language processing task of bilingualism corpora all needs word-level.The method of word alignment mainly contains 4 kinds at present: based on method, the method based on character, method and the mixed method based on linguistic knowledge of statistics.Method based on statistics is by the statistics training to extensive bilingualism corpora, obtains the co-occurrence probability of bilingual paginal translation word using this as the basis of aliging.Method based on character is that the cognate that contains with the bilingual something in common on part of speech carries out word alignment.Method based on linguistic knowledge is the basis using the linguistic knowledge such as bilingual dictionary and synonymicon as alignment.Mixed method has been used the several different methods that comprises three kinds of methods simultaneously.

In recent years, along with the development of ethnic group's informatization, also had new development at the minority language Corpus Construction in Xinjiang, but great majority are take Uighur as main, in the support of more minority languages and technical merit, have certain defect.

Summary of the invention

The object of the present invention is to provide a kind of Uighur word alignment method, realized the automatic aligning of Uighur word, for the structure of Uighur electronic dictionary and the construction of Uighur corpus provide help; For the research of Chinese dimension machine translation system provides the foundation, the exploitation of crow (Uzbek's literary composition), Kazakhstan (Kazak), Ke (Kirgiz), soil (Turkey's literary composition) electronic dictionary and auxiliary engine translation system is laid a solid foundation.

The object of the present invention is achieved like this: a kind of Uighur word alignment method, 1. realize the automatic aligning of Uighur word, and the alignment relation between Uighur word and Chinese terms is divided into 5 kinds, respectively one to one, one-to-many, many-one, multi-to-multi, a pair of sky; 2. pair automatic aligning occurs that wrong word manually aligns, and has improved the accuracy rate of system processing Uighur; 3. realized fractionation and the merging to Uighur word according to the feature of Uighur.

The present invention relates to the alignment of Uighur word, realized fractionation and the merging of automatic aligning and the Uighur word of Uighur word.Word alignment is one of basic problem of Corpus Construction, is also the problem of always studying for a long time.In the market, thisly can still belong to the first to the system of Uighur word alignment.The Uygur's word the invention solves submitting to carries out automatic aligning; The structure of Uighur electronic dictionary, the good aid of Chinese dimension machine translation system; On the other hand to Chinese dimension mechanical translation Corpus Construction in future; Exploitation to crow (Uzbek's literary composition), Kazakhstan (Kazak), Ke (Kirgiz), soil (Turkey's literary composition) electronic dictionary and auxiliary engine translation system lays a solid foundation.The present invention is the Uighur word alignment system based on computational linguistics, linguistics, sociology, computer information processing science.It is characterized in that: according to the Morphological Features of Uighur, Uighur word is carried out to automatic aligning; Can realize the word that there is no automatic aligning; Realize fractionation and the merging to Uighur word according to the feature native system of Uighur.

The invention has the beneficial effects as follows, system has realized the automatic aligning of Uighur word, for the structure of Uighur electronic dictionary and the construction of Uighur corpus provide help; For the research of Chinese dimension machine translation system provides the foundation, the exploitation of crow (Uzbek's literary composition), Kazakhstan (Kazak), Ke (Kirgiz), soil (Turkey's literary composition) electronic dictionary and auxiliary engine translation system is laid a solid foundation.

Accompanying drawing explanation

Below in conjunction with accompanying drawing, the invention will be further described.

Fig. 1 is process flow diagram of the present invention.

Embodiment

A kind of Uighur word alignment method, has 1. realized the automatic aligning of Uighur word, and the alignment relation between Uighur word and Chinese terms is divided into 5 kinds, is respectively one to one, one-to-many, many-one, multi-to-multi, a pair of sky; 2. pair automatic aligning occurs that wrong word manually aligns, and has improved the accuracy rate of system processing Uighur; 3. realized fractionation and the merging to Uighur word according to the feature of Uighur.

As shown in Figure 1, first, judge user's role, then obtain audit by sentence afterwards.Realize fractionation and the merging of word according to the feature of Uighur word, the word of automatic aligning mistake is manually alignd, then preserve alignment result, register vicious sentence simultaneously.

Claims

1. a Uighur word alignment method, is characterized in that: 1. realized the automatic aligning of Uighur word, the alignment relation between Uighur word and Chinese terms is divided into 5 kinds, is respectively one to one, one-to-many, many-one, multi-to-multi, a pair of sky; 2. pair automatic aligning occurs that wrong word manually aligns, and has improved the accuracy rate of system processing Uighur; 3. realized fractionation and the merging to Uighur word according to the feature of Uighur.