WO2016033617A2

WO2016033617A2 - Method of asynchronous machine translation

Info

Publication number: WO2016033617A2
Application number: PCT/VN2015/000010
Authority: WO
Inventors: Duy Thang Nguyen
Original assignee: Duy Thang Nguyen
Priority date: 2014-08-28
Filing date: 2015-08-27
Publication date: 2016-03-03
Also published as: WO2016033617A3

Abstract

The invention has provided an asynchronous machine translation to standardize and simplify translation process as well as improve translation quality. The invention divides the process of translating from language A to language B into two steps, each of which may be independently developed. Step 1 : translating language A to values stored by multipurpose language data storage method (intermediate data) Step 2: converting intermediate data into any language B, exported data may be text, sound, image, sign... The intermediate data enables to exterminate synonyms. Division of the translation process into two independent steps helps to decrease difficulty of translation between two languages (changing from multiplication to addition) and simultaneously translate multi-languages, independently develop new languages and reduce translation on a device to various devices.

Description

METHOD OF ASYNCHRONOUS MACHINE TRANSLATION

Mentioned technique field

Method of asynchronous machine translation is applied in technique of machine translation.

Technical condition of the invention

Three methods being used in machine translation include:

+ Statistical machine translation (mainly current method)

+ Example-basedmachine translation.

+ Rule-based machine translation or Classical approach.

Translation using above methods is a continuous process, so those translation methods have the following limitations:

- Direct multi-language translation is impossible (based on the concept of isotropy. Multi-language translation which has been currently introduced is actually repeated bilingual translation process).

- Translation cannot be performed on the various systems of devices (For example, translation cannot be performed on a server and client) and on two different applications.

-It is unable to independently develop the ability to translate a new language. (There is always a pair of bilingual languages)

- If difficulty of language is assumed as x while difficulty of language B is assumed as y, the process of translating language A to language B will have difficulty as x*y (multiply x by y) .

Technical nature of the invention:

The purpose of the invention is to standardize the translation process, improve translation quality and simplify the translation process. To achieve this purpose the invention provides a translation method including two steps. The invention uses data of multipurpose language data storage method to connect two those processes

(hereinafter called intermediate data).

Translation of language A to language B is done as follow:

(Two steps are not necessary to be done in the same system or application)

Step 1 : Translate language A to intermediate language

Step 2: Convert the intermediate language into language B.

l This intermediate data was standardized, so it can be converted into any other language similar to language B .

Brief description of the drawings

Figure 1 in the document describes the process of translating.

Detailed descriptionof the invention:

The intermediate data used in the invention is called DLSC (this is the type of data used in the method of data storage and language conversion). Each value of database may correspond with a word, a phrase or even a complete sentence of natural language.

Database includes two types of values with basic length of 04 bytes (4 byte = 32 bit = 2³² value that can be extended to 64 bits or more)

- Data of type 1: divided into 2 parts including Content (21 bits) + Grammar (11 bits) The content includes 2 sub-parts including Part of speech (5bits) and Value (16bits). The grammar includes three sub-parts including General grammar, Synonyms and Expansion.

General grammar: storing grammatical information of most general language unit of all languages

Synonyms: used to distinguish all synonyms. This is a way of giving all synonyms to the only form. Number of supported synonyms may vary depending on change in Part of speech and Expansion.

Expansion: used to add grammatical elements to each specific language

- Data of type 2: divided into 2 parts including Content (21 bits) + Additional information (1 1 bits).

The content includes 2 sub-parts including Part of speech (5 bits) and Value (16 bits). Additional information: used for storage and may be used to support translation

The element of Content in both above types has the same value and corresponds with a unit of natural language (value of this element will be constant for different natural languages. It acts a connection bridge among languages and between two types of data included in the database). Data of type 2 may be expanded to 64 bits, 128 bits or more (variables) because it is necessary to store much information during translation. However, 21 first bits of this data area shall be identical to21 first bits of data of type 1.Component distribution in the value domain of 04 bytes (position of variable areas may vary) Part of speech is stored from bit 1 to bit 5 with 32 values. (Part of speech influences Value and Grammar)

Value is stored from bit 6 to bit 21 with 65536 values.Grammar is stored from bit 22 to bit 32 with2¹¹values. (Content = Part of speech + Value= 2²¹)

If Part of speech has value of 0,main value area is Unicode table(may be combined with Grammar part to create codes which are larger than Unicode, determine natural language in which stored language originates)

The phrase 'each value of value area will correspondwith' is 'evovawcw' for short.

If Part of speech has value of 1 evovawcw an adverb.

If Part of speech has value of 2 evovawcw an adjective.

If Part of speech has value of 3,4,5evovawcw a noun of animals.

If Part of speech has value of 8,9evovawcw a noun of plant.

If Part of speech has value of 12,13evovawcw a noun of objects.

If Part of speech has value of 16, 17evovawcw a noun of fact, phenomenon...

If Part of speech has value of 20, evovawcw a verb.

If Part of speech has value of 21 evovawcw a conjunctions,prepositions, pronouns, interjection, article.

If Part of speech has value of 22 evovawcw an idiom.

If Part of speech has value within the range of 23 and 24 evovawcw sentence.

Used to spend, if Part of speech has value of 6,7,10,11,14,15,18,19 and 25 to 31.

Value of Part of speech also influences component of Grammar.

If Part of speech has value of within the range of 1 and 2, Grammar area will include three sub-parts:

The part of General Grammar enables to determine kinds of comparison (superlative, comparative, Equality comparison, and infinitive form, Comparative of inferiority and Superlative of inferiority) If Part of speech has value of 3,4,5,8,9,12,13,16,17 Grammar will be divided into three following sub-parts:

The part of General Grammar enables to determine forms and genders of noun (singular, plural, masculine, feminine, neutral gender, infinitives) .Determination of manner in some languages such as Russian is added in the expansionmanner (2³ value). English includes countable and uncountable nouns, so expansion part will be Expansion.countability 2¹. If Part of speech has value of 20, Grammar will be divided into 3 following sub-parts: The part of General Grammar enables to determine tense of verb (past, present, future and infinitive). With a specific language, number of Synonyms and Expansion will be changed. For example, as for Vietnamese, Synonym 2⁵ + Expansion2³ but as for English, two those values are changed to Synonym2² + Expansion2⁷ since Expansion2⁷= Expansion2⁴+ Expansion.Person2³. In Vietnamese, conjugation of verb is not dependent on the Person and Form and only includes three simple tenses, so Expansion is not needed to add grammatical elements. In contrast, English is different and conjugation of verb is more complex, so it requires expansion to absolutely conjugate tenses, distinguish Person and Form of subject on which verb depends.

If Part of speech has value of within the range of 21 and 24,Grammar area will include two sub-parts:

Two first values of data area including values used in the invention are those which are used to determine source language forming that data. 4 byte has value of 0 (check), 4 second bytes are used to determine language (set up based on value of national code, Vietnam = 84)

Converting database text, determining language used to store, using current search algorithms (branch, node...) in order to convert (determine vocabulary, part of speech, general grammar and additional information). Characters which are not vocabulary are included in the variables as vocabulary

Converting value of DLSC into text or other forms such as sound, image... Since DLSC is value and structural, it is difficult to do this conversion. There are only a few notes during conversion. Grammatical elements which varyDLSC via languages may be different values. It depends on natural language whichcreates those basic values. There are two cases happening during data processing: (reading 8first bytes of data area to determine)

If DLSC is created from the language in accordance with the language to be processed (in case language A has mapping to database and this database has mapping to language A), Synonyms and Expansion are completely used so that the process may be exactly restored (complete recovery).

If DLSC is created from the language different from the language to be processed (in case language B has mapping to DLSC and this DLSC has mapping to language A - that works as a process of translating from language B into the language of A), components of Synonyms and Expansion are not used. Two those values will be replaced with values of the language to be processed (language A) (incomplete recovery). For example, value of DLSCis created from word "soya", it will be processed as "soya" in Vietnamese and stored in the form of Part of speech =3, Value=3;Npc.form=0; Npc.gender=0; Synonym=0; Npc.count=0; => Values which will be used are: Part of speech =3;Value=3;Npc.form=^:(); Values including Npc. gender and Synonyn have default value of 0 that enables to determine the word "dau tuoTig". We can use exploratory variable to scan all synonyms. For example, if we consider Synonym = 1, received word is "d§ tuong"... (If inputted value of Synonym is more than number of actual synonyms, it will give back to the first word). For example of using database structure:

Storing nouns of dau tuong, do tuorig, dau nanh" in Vietnamese (3 synonyms) is as follow:

The phrase 'is stored with Part of speech' is 'iswPos' for short.

Word "dau tuong" iswPos =Value=3, Npc.form= Npc.gender=0, Synonym^O. Word "do tuong" iswPos =Value=3, Npc.form= Npc.gender=0, Synonym=l. Word "dau nanh" iswPos =Value=3, Npc.form= Npc.gender=0, Synonym=2. Respecfully words"soya" and"soya bean" in English (2 synonyms). Values of components are:

Word "soya" iswPos =Value=3, Npc.form= Npc.gender=0,Synonym=0; Npc.count=0. "soyabean" iswPos=Value=3,Npc.form= Npc.gender=0,Synonym=l;Npc.count=0. Therefore, to determine species of soybean in pair of Vietnamese-English (or any language), we only consider two areas of Part of speech=Value=3. Vale of Synonym area is important for each specific language. Exterminate synonym.

The intermediate data is able to simultaneously store part of speech and vocabulary (unable to store position of part of speech within a sentence) and exterminate synonyms. The way of arranging part of speech within a sentence is different in each language; the invention uses the only sentence structure when storing data so that languages may be easily converted (export or import data) that is Subject + predicate, predicate = verb + complement, modifier standing behind modified word.

Machine translation includes two steps; each of them may be independently developed and used for different purposes. Step 1 :is the process of translating natural language (language A) to values stored by multipurpose language storage method. Since the value stored in the form of database corresponds with a language unit (word, phrase...), the process of converting language A to database value is also based on search according to language unit. Current algorithms may be used.

Step 2: is the process of converting values created in Step 1 into different forms of any natural language. Data used in the invention is structural, so apart from being converted into text, it may be also converted into non-text forms. Vocabulary and grammar are stored in the form of values and their positions; we can transmit these values to other applications, devices to perform this step.

With both processes, it is paid attention to position of database values arranged in accordance with rules so that processing is exact.

By dividing the translation process into two steps as mentioned above, difficulty of translation is changed from multiplication to addition. Extermination of synonyms and a fixed grammar order also make translation become easier.

In case of translating language A to language X, it is necessary to focus on step 1 of the process. In case of translating language Y to language B, it is necessary to focus on step 2 of the process. Development of a new language is easier and more independent.

Achieved efficiency:

Use of data of method of data storage and language conversion enables the invention to separate the translation process into two independent steps, synonyms are exterminated, and information on vocabulary and grammar are fully stored. Since the translation process is divided into two steps, the invention achieves the following efficiencies:

Support multi-language translation simultaneously without intermediate translation steps, quality and time are improved

Translation of a new language may be independently developed

Difficulty of translation among languages is reduced

Translation on various devices or applications may be done

Claims

CLAIM

1. Asynchronous machine translation. Unlike normal translation method, the process of translating from language A to language B is divided into two independent steps (for current translation method, translation of language A to language B is a continuous process). The intermediate data (intermediate value) used in the invention is the data used in the multipurpose language data storage method. This method includes two parts: a Receiving data of the language to be translated A, converting the language A into intermediate values which store both vocabulary and part of speech, position of those values are arranged according to a fixed structure (modifier stands after modified word. Sentence structure is: Subject + predicate (verb + complements) b Receiving intermediate data and converting them into language B, part of speech and vocabulary are stored in the intermediate values, sentence structure and modifiers are fixed when storing in the form of intermediate values, it is required to convert position so that it is suitable to language B,

The objective is to make machine translation more simple, and efficient.