WO2016033617A2 - Method of asynchronous machine translation - Google Patents

Method of asynchronous machine translation Download PDF

Info

Publication number
WO2016033617A2
WO2016033617A2 PCT/VN2015/000010 VN2015000010W WO2016033617A2 WO 2016033617 A2 WO2016033617 A2 WO 2016033617A2 VN 2015000010 W VN2015000010 W VN 2015000010W WO 2016033617 A2 WO2016033617 A2 WO 2016033617A2
Authority
WO
WIPO (PCT)
Prior art keywords
language
translation
value
data
speech
Prior art date
Application number
PCT/VN2015/000010
Other languages
French (fr)
Other versions
WO2016033617A3 (en
Inventor
Duy Thang Nguyen
Original Assignee
Duy Thang Nguyen
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Duy Thang Nguyen filed Critical Duy Thang Nguyen
Publication of WO2016033617A2 publication Critical patent/WO2016033617A2/en
Publication of WO2016033617A3 publication Critical patent/WO2016033617A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Definitions

  • Method of asynchronous machine translation is applied in technique of machine translation.
  • Three methods being used in machine translation include:
  • Translation cannot be performed on the various systems of devices (For example, translation cannot be performed on a server and client) and on two different applications.
  • the purpose of the invention is to standardize the translation process, improve translation quality and simplify the translation process.
  • the invention provides a translation method including two steps.
  • the invention uses data of multipurpose language data storage method to connect two those processes
  • intermediate data (hereinafter called intermediate data).
  • Step 1 Translate language A to intermediate language
  • Step 2 Convert the intermediate language into language B.
  • the intermediate data used in the invention is called DLSC (this is the type of data used in the method of data storage and language conversion).
  • DLSC this is the type of data used in the method of data storage and language conversion.
  • Each value of database may correspond with a word, a phrase or even a complete sentence of natural language.
  • - Data of type 1 divided into 2 parts including Content (21 bits) + Grammar (11 bits)
  • the content includes 2 sub-parts including Part of speech (5bits) and Value (16bits).
  • the grammar includes three sub-parts including General grammar, Synonyms and Expansion.
  • Synonyms used to distinguish all synonyms. This is a way of giving all synonyms to the only form. Number of supported synonyms may vary depending on change in Part of speech and Expansion.
  • the content includes 2 sub-parts including Part of speech (5 bits) and Value (16 bits). Additional information: used for storage and may be used to support translation
  • the element of Content in both above types has the same value and corresponds with a unit of natural language (value of this element will be constant for different natural languages. It acts a connection bridge among languages and between two types of data included in the database).
  • Data of type 2 may be expanded to 64 bits, 128 bits or more (variables) because it is necessary to store much information during translation. However, 21 first bits of this data area shall be identical to21 first bits of data of type 1.Component distribution in the value domain of 04 bytes (position of variable areas may vary) Part of speech is stored from bit 1 to bit 5 with 32 values. (Part of speech influences Value and Grammar)
  • main value area is Unicode table(may be combined with Grammar part to create codes which are larger than Unicode, determine natural language in which stored language originates)
  • Part of speech has value of 1 evovawcw an adverb.
  • Part of speech has value of 2 evovawcw an adjective.
  • Part of speech has value of 3,4,5evovawcw a noun of animals.
  • Part of speech has value of 8,9evovawcw a noun of plant.
  • Part of speech has value of 12,13evovawcw a noun of objects.
  • Part of speech has value of 21 evovawcw a conjunctions,prepositions, pronouns, interjection, article.
  • Part of speech has value of 22 evovawcw an idiom.
  • Part of speech has value within the range of 23 and 24 evovawcw sentence.
  • the part of General Grammar enables to determine forms and genders of noun (singular, plural, masculine, feminine, neutral gender, infinitives) .Determination of manner in some languages such as Russian is added in the expansionmanner (2 3 value). English includes countable and uncountable nouns, so expansion part will be Expansion.countability 2 1 . If Part of speech has value of 20, Grammar will be divided into 3 following sub-parts: The part of General Grammar enables to determine tense of verb (past, present, future and infinitive). With a specific language, number of Synonyms and Expansion will be changed.
  • the intermediate data is able to simultaneously store part of speech and vocabulary (unable to store position of part of speech within a sentence) and exterminate synonyms.
  • Step 1 is the process of translating natural language (language A) to values stored by multipurpose language storage method. Since the value stored in the form of database corresponds with a language unit (word, phrase%), the process of converting language A to database value is also based on search according to language unit. Current algorithms may be used.
  • Step 2 is the process of converting values created in Step 1 into different forms of any natural language.
  • Data used in the invention is structural, so apart from being converted into text, it may be also converted into non-text forms.
  • Vocabulary and grammar are stored in the form of values and their positions; we can transmit these values to other applications, devices to perform this step.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

The invention has provided an asynchronous machine translation to standardize and simplify translation process as well as improve translation quality. The invention divides the process of translating from language A to language B into two steps, each of which may be independently developed. Step 1 : translating language A to values stored by multipurpose language data storage method (intermediate data) Step 2: converting intermediate data into any language B, exported data may be text, sound, image, sign... The intermediate data enables to exterminate synonyms. Division of the translation process into two independent steps helps to decrease difficulty of translation between two languages (changing from multiplication to addition) and simultaneously translate multi-languages, independently develop new languages and reduce translation on a device to various devices.

Description

METHOD OF ASYNCHRONOUS MACHINE TRANSLATION
Mentioned technique field
Method of asynchronous machine translation is applied in technique of machine translation.
Technical condition of the invention
Three methods being used in machine translation include:
+ Statistical machine translation (mainly current method)
+ Example-basedmachine translation.
+ Rule-based machine translation or Classical approach.
Translation using above methods is a continuous process, so those translation methods have the following limitations:
- Direct multi-language translation is impossible (based on the concept of isotropy. Multi-language translation which has been currently introduced is actually repeated bilingual translation process).
- Translation cannot be performed on the various systems of devices (For example, translation cannot be performed on a server and client) and on two different applications.
-It is unable to independently develop the ability to translate a new language. (There is always a pair of bilingual languages)
- If difficulty of language is assumed as x while difficulty of language B is assumed as y, the process of translating language A to language B will have difficulty as x*y (multiply x by y) .
Technical nature of the invention:
The purpose of the invention is to standardize the translation process, improve translation quality and simplify the translation process. To achieve this purpose the invention provides a translation method including two steps. The invention uses data of multipurpose language data storage method to connect two those processes
(hereinafter called intermediate data).
Translation of language A to language B is done as follow:
(Two steps are not necessary to be done in the same system or application)
Step 1 : Translate language A to intermediate language
Step 2: Convert the intermediate language into language B.
l This intermediate data was standardized, so it can be converted into any other language similar to language B .
Brief description of the drawings
Figure 1 in the document describes the process of translating.
Detailed descriptionof the invention:
The intermediate data used in the invention is called DLSC (this is the type of data used in the method of data storage and language conversion). Each value of database may correspond with a word, a phrase or even a complete sentence of natural language.
Database includes two types of values with basic length of 04 bytes (4 byte = 32 bit = 232 value that can be extended to 64 bits or more)
- Data of type 1: divided into 2 parts including Content (21 bits) + Grammar (11 bits) The content includes 2 sub-parts including Part of speech (5bits) and Value (16bits). The grammar includes three sub-parts including General grammar, Synonyms and Expansion.
General grammar: storing grammatical information of most general language unit of all languages
Synonyms: used to distinguish all synonyms. This is a way of giving all synonyms to the only form. Number of supported synonyms may vary depending on change in Part of speech and Expansion.
Expansion: used to add grammatical elements to each specific language
- Data of type 2: divided into 2 parts including Content (21 bits) + Additional information (1 1 bits).
The content includes 2 sub-parts including Part of speech (5 bits) and Value (16 bits). Additional information: used for storage and may be used to support translation
The element of Content in both above types has the same value and corresponds with a unit of natural language (value of this element will be constant for different natural languages. It acts a connection bridge among languages and between two types of data included in the database). Data of type 2 may be expanded to 64 bits, 128 bits or more (variables) because it is necessary to store much information during translation. However, 21 first bits of this data area shall be identical to21 first bits of data of type 1.Component distribution in the value domain of 04 bytes (position of variable areas may vary) Part of speech is stored from bit 1 to bit 5 with 32 values. (Part of speech influences Value and Grammar)
Value is stored from bit 6 to bit 21 with 65536 values.Grammar is stored from bit 22 to bit 32 with211values. (Content = Part of speech + Value= 221)
If Part of speech has value of 0,main value area is Unicode table(may be combined with Grammar part to create codes which are larger than Unicode, determine natural language in which stored language originates)
The phrase 'each value of value area will correspondwith' is 'evovawcw' for short.
If Part of speech has value of 1 evovawcw an adverb.
If Part of speech has value of 2 evovawcw an adjective.
If Part of speech has value of 3,4,5evovawcw a noun of animals.
If Part of speech has value of 8,9evovawcw a noun of plant.
If Part of speech has value of 12,13evovawcw a noun of objects.
If Part of speech has value of 16, 17evovawcw a noun of fact, phenomenon...
If Part of speech has value of 20, evovawcw a verb.
If Part of speech has value of 21 evovawcw a conjunctions,prepositions, pronouns, interjection, article.
If Part of speech has value of 22 evovawcw an idiom.
If Part of speech has value within the range of 23 and 24 evovawcw sentence.
Used to spend, if Part of speech has value of 6,7,10,11,14,15,18,19 and 25 to 31.
Value of Part of speech also influences component of Grammar.
If Part of speech has value of within the range of 1 and 2, Grammar area will include three sub-parts:
The part of General Grammar enables to determine kinds of comparison (superlative, comparative, Equality comparison, and infinitive form, Comparative of inferiority and Superlative of inferiority) If Part of speech has value of 3,4,5,8,9,12,13,16,17 Grammar will be divided into three following sub-parts:
The part of General Grammar enables to determine forms and genders of noun (singular, plural, masculine, feminine, neutral gender, infinitives) .Determination of manner in some languages such as Russian is added in the expansionmanner (23 value). English includes countable and uncountable nouns, so expansion part will be Expansion.countability 21. If Part of speech has value of 20, Grammar will be divided into 3 following sub-parts: The part of General Grammar enables to determine tense of verb (past, present, future and infinitive). With a specific language, number of Synonyms and Expansion will be changed. For example, as for Vietnamese, Synonym 25 + Expansion23 but as for English, two those values are changed to Synonym22 + Expansion27 since Expansion27= Expansion24+ Expansion.Person23. In Vietnamese, conjugation of verb is not dependent on the Person and Form and only includes three simple tenses, so Expansion is not needed to add grammatical elements. In contrast, English is different and conjugation of verb is more complex, so it requires expansion to absolutely conjugate tenses, distinguish Person and Form of subject on which verb depends.
If Part of speech has value of within the range of 21 and 24,Grammar area will include two sub-parts:
Two first values of data area including values used in the invention are those which are used to determine source language forming that data. 4 byte has value of 0 (check), 4 second bytes are used to determine language (set up based on value of national code, Vietnam = 84)
Converting database text, determining language used to store, using current search algorithms (branch, node...) in order to convert (determine vocabulary, part of speech, general grammar and additional information). Characters which are not vocabulary are included in the variables as vocabulary
Converting value of DLSC into text or other forms such as sound, image... Since DLSC is value and structural, it is difficult to do this conversion. There are only a few notes during conversion. Grammatical elements which varyDLSC via languages may be different values. It depends on natural language whichcreates those basic values. There are two cases happening during data processing: (reading 8first bytes of data area to determine)
If DLSC is created from the language in accordance with the language to be processed (in case language A has mapping to database and this database has mapping to language A), Synonyms and Expansion are completely used so that the process may be exactly restored (complete recovery).
If DLSC is created from the language different from the language to be processed (in case language B has mapping to DLSC and this DLSC has mapping to language A - that works as a process of translating from language B into the language of A), components of Synonyms and Expansion are not used. Two those values will be replaced with values of the language to be processed (language A) (incomplete recovery). For example, value of DLSCis created from word "soya", it will be processed as "soya" in Vietnamese and stored in the form of Part of speech =3, Value=3;Npc.form=0; Npc.gender=0; Synonym=0; Npc.count=0; => Values which will be used are: Part of speech =3;Value=3;Npc.form=:(); Values including Npc. gender and Synonyn have default value of 0 that enables to determine the word "dau tuoTig". We can use exploratory variable to scan all synonyms. For example, if we consider Synonym = 1, received word is "d§ tuong"... (If inputted value of Synonym is more than number of actual synonyms, it will give back to the first word). For example of using database structure:
Storing nouns of dau tuong, do tuorig, dau nanh" in Vietnamese (3 synonyms) is as follow:
The phrase 'is stored with Part of speech' is 'iswPos' for short.
Word "dau tuong" iswPos =Value=3, Npc.form= Npc.gender=0, Synonym^O. Word "do tuong" iswPos =Value=3, Npc.form= Npc.gender=0, Synonym=l. Word "dau nanh" iswPos =Value=3, Npc.form= Npc.gender=0, Synonym=2. Respecfully words"soya" and"soya bean" in English (2 synonyms). Values of components are:
Word "soya" iswPos =Value=3, Npc.form= Npc.gender=0,Synonym=0; Npc.count=0. "soyabean" iswPos=Value=3,Npc.form= Npc.gender=0,Synonym=l;Npc.count=0. Therefore, to determine species of soybean in pair of Vietnamese-English (or any language), we only consider two areas of Part of speech=Value=3. Vale of Synonym area is important for each specific language. Exterminate synonym.
The intermediate data is able to simultaneously store part of speech and vocabulary (unable to store position of part of speech within a sentence) and exterminate synonyms. The way of arranging part of speech within a sentence is different in each language; the invention uses the only sentence structure when storing data so that languages may be easily converted (export or import data) that is Subject + predicate, predicate = verb + complement, modifier standing behind modified word.
Machine translation includes two steps; each of them may be independently developed and used for different purposes. Step 1 :is the process of translating natural language (language A) to values stored by multipurpose language storage method. Since the value stored in the form of database corresponds with a language unit (word, phrase...), the process of converting language A to database value is also based on search according to language unit. Current algorithms may be used.
Step 2: is the process of converting values created in Step 1 into different forms of any natural language. Data used in the invention is structural, so apart from being converted into text, it may be also converted into non-text forms. Vocabulary and grammar are stored in the form of values and their positions; we can transmit these values to other applications, devices to perform this step.
With both processes, it is paid attention to position of database values arranged in accordance with rules so that processing is exact.
By dividing the translation process into two steps as mentioned above, difficulty of translation is changed from multiplication to addition. Extermination of synonyms and a fixed grammar order also make translation become easier.
In case of translating language A to language X, it is necessary to focus on step 1 of the process. In case of translating language Y to language B, it is necessary to focus on step 2 of the process. Development of a new language is easier and more independent.
Achieved efficiency:
Use of data of method of data storage and language conversion enables the invention to separate the translation process into two independent steps, synonyms are exterminated, and information on vocabulary and grammar are fully stored. Since the translation process is divided into two steps, the invention achieves the following efficiencies:
Support multi-language translation simultaneously without intermediate translation steps, quality and time are improved
Translation of a new language may be independently developed
Difficulty of translation among languages is reduced
Translation on various devices or applications may be done

Claims

CLAIM
1. Asynchronous machine translation. Unlike normal translation method, the process of translating from language A to language B is divided into two independent steps (for current translation method, translation of language A to language B is a continuous process). The intermediate data (intermediate value) used in the invention is the data used in the multipurpose language data storage method. This method includes two parts: a Receiving data of the language to be translated A, converting the language A into intermediate values which store both vocabulary and part of speech, position of those values are arranged according to a fixed structure (modifier stands after modified word. Sentence structure is: Subject + predicate (verb + complements) b Receiving intermediate data and converting them into language B, part of speech and vocabulary are stored in the intermediate values, sentence structure and modifiers are fixed when storing in the form of intermediate values, it is required to convert position so that it is suitable to language B,
The objective is to make machine translation more simple, and efficient.
PCT/VN2015/000010 2014-08-28 2015-08-27 Method of asynchronous machine translation WO2016033617A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
VN1-2014-02900 2014-08-28
VN201402900 2014-08-28

Publications (2)

Publication Number Publication Date
WO2016033617A2 true WO2016033617A2 (en) 2016-03-03
WO2016033617A3 WO2016033617A3 (en) 2016-05-26

Family

ID=55400835

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/VN2015/000010 WO2016033617A2 (en) 2014-08-28 2015-08-27 Method of asynchronous machine translation

Country Status (1)

Country Link
WO (1) WO2016033617A2 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076760A (en) * 2020-01-03 2021-07-06 阿里巴巴集团控股有限公司 Translation method, commodity retrieval method, translation device, commodity retrieval device, electronic equipment and computer storage medium

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US4864503A (en) * 1987-02-05 1989-09-05 Toltran, Ltd. Method of using a created international language as an intermediate pathway in translation between two national languages
JP3066274B2 (en) * 1995-01-12 2000-07-17 シャープ株式会社 Machine translation equipment
US6161082A (en) * 1997-11-18 2000-12-12 At&T Corp Network based language translation system
EP1754169A4 (en) * 2004-04-06 2008-03-05 Dept Of Information Technology A system for multilingual machine translation from english to hindi and other indian languages using pseudo-interlingua and hybridized approach
US8214199B2 (en) * 2006-10-10 2012-07-03 Abbyy Software, Ltd. Systems for translating sentences between languages using language-independent semantic structures and ratings of syntactic constructions

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113076760A (en) * 2020-01-03 2021-07-06 阿里巴巴集团控股有限公司 Translation method, commodity retrieval method, translation device, commodity retrieval device, electronic equipment and computer storage medium
CN113076760B (en) * 2020-01-03 2024-01-26 阿里巴巴集团控股有限公司 Translation and commodity retrieval method and device, electronic equipment and computer storage medium

Also Published As

Publication number Publication date
WO2016033617A3 (en) 2016-05-26

Similar Documents

Publication Publication Date Title
Goyal et al. Web based Hindi to Punjabi machine translation system
Vintar Bilingual term recognition revisited: The bag-of-equivalents term alignment approach and its evaluation
JP4319860B2 (en) Method and apparatus for developing a transfer dictionary for use in a transfer-based machine translation system
US5384702A (en) Method for self-correction of grammar in machine translation
KR101818598B1 (en) Server and method for automatic translation
Aswani et al. A hybrid approach to align sentences and words in English-Hindi parallel corpora
KR101616031B1 (en) Query Translator and Method for Cross-language Information Retrieval using Liguistic Resources from Wikipedia and Parallel Corpus
Aasha et al. Machine translation from English to Malayalam using transfer approach
Rathod Machine translation of natural language using different approaches
JP2017010274A (en) Associating device and program
WO2016033617A2 (en) Method of asynchronous machine translation
Garje et al. Transmuter: an approach to rule-based English to Marathi machine translation
Das et al. English to Hindi machine transliteration system at NEWS 2009
KR20120048139A (en) Automatic translation device and method thereof
Mall et al. Innovative algorithms for Parts of Speech Tagging in hindi-english machine translation language
Gamallo Otero et al. Automatic generation of bilingual dictionaries using intermediary languages and comparable corpora
Tian et al. Improving English-Arabic transliteration with phonemic memories
Kaur et al. A web based Punjabi to Hindi Statistical Machine Translation System
Chaware et al. Rule-based phonetic matching approach for Hindi and Marathi
Bouziane et al. Annotating Arabic Texts with Linked Data
US20190108220A1 (en) Method of data storage and language conversion
WO2019161421A2 (en) Method of changing languages of captions, subtitles and illustrations by dual identification technique
Debbarma et al. Morphological Analyzer for Kokborok
Kharitonova Linguistics4fairness: neutralizing Gender Bias in neural machine translation by introducing linguistic knowledge
Godase et al. A novel approach for rule based translation of English to Marathi

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 15836950

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase in:

Ref country code: DE

REEP Request for entry into the european phase

Ref document number: 2015836950

Country of ref document: EP

WWE Wipo information: entry into national phase

Ref document number: 2015836950

Country of ref document: EP

122 Ep: pct application non-entry in european phase

Ref document number: 15836950

Country of ref document: EP

Kind code of ref document: A2