WO2008007386A1

WO2008007386A1 - A method for run time translation to create language interoperability environment [lie] and system thereof

Info

Publication number: WO2008007386A1
Application number: PCT/IN2006/000268
Authority: WO
Inventors: Chandrashekar Rudrappa Koranahally
Original assignee: Koranahally Chandrashekar Rudr
Priority date: 2006-07-14
Filing date: 2006-07-31
Publication date: 2008-01-17
Also published as: WO2008007386B1; WO2008007386A9

Abstract

A method and system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the text using internet based protocol in order to create a Language Interoperability Environment (LIE), the method comprising steps of sending the input in source language to Source Language Analyzer (SLA), SLA analyzing the input to obtain broken-down word groups along with its grammatical features, replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), and receiving the output in target language(s) in identical format at an intended destination as shown in Figure 4.

Description

A METHOD FOR RUN TIME TRANSLATION TO CREATE LANGUAGE INTEROPERABILITY ENVIRONMENT (LIE) AND SYSTEM THEREOF

Field of the Present Invention Present Invention addresses the issue of communication between people speaking different languages, which is of magnanimous proportion. An equally magnanimous endeavor is to build a Language Interoperability Environment (LIE) which can translate content (text or speech) from a source language to target language(s) at run time and help people of the world to communicate with each other and to access the wealth of information that is available the world over. Background of the Present invention:

Communication between people speaking different languages has always troubled mankind. Especially in a country like India with so many languages the problem is more pronounced and magnified. There are a few possible solutions to this problem:

1. Either one of them or both of them know the other person's language or both of them know a common language like English.

2. Employ a translator to help communicate with each other.

3. Use sign language Offices have become truly cosmopolitan with people from different parts of the country speaking different languages. With the advent of computers / internet, globalization and the need to communicate with people speaking different languages, English has become the middleman. English is the language of modern day research and scholarship, and has emerged as the lingua-franca of the modern world and the globalized society. All important works are translated into English from the different languages of the world resulting in the creation of an immense knowledgebase.

The number of people who know English is miniscule and the Internet is dominated by English. Advanced countries have made enough investments on systems to ensure that their languages are adapted to the digital age. It is in the poor and the developing countries where the problem is acute - in Asia and Africa. Unfortunately it is here that the maximum diversity exists.

These immense groups of people who are left out are forced to catch up at the cost of neglecting their languages. With little investments and lack of focus, the usage of languages has been coming down drastically. It is evident if you look at the new generation of people.

Experts say in 2-3 centuries there will be only 4-5 languages in use on planet Earth; a repeat of the Rosetta script - only larger in scale. Languages have evolved over centuries and have literature, knowledgebase and a world of their own. Mankind will lose something so great and immense, and the worst part is - it is irreversible. There is an urgent need to evolve a system to keep them alive.

Languages survive only through usage and this can happen only if people are able to transact in their own languages. Computers and Internet have become modern day necessities and people should be able to use them and access the world's information in their own languages and feel empowered.

Prior Art of the present Invention

To reach people speaking different languages, translations have been popular since centuries and the Bible is the most translated book. There have heen developments in the Machine Translation (MT) arena even though it is in its infancy.

Current state is: though there are multi lingual sites and translation software's available, they operate independently and are inaccurate.

Many web sites like Alta Vista and Google have translation facility where a user can type in text and ask the system to translate to other languages. But the sad part is it is

'word to word' translation. It is not even repeatable: if a user types in text in English and asks the system to translate to say French, and copies the output and pastes it in the input box again and ask the system to translate back to English, the output will be different from the original input. To address this issue of magnanimous proportion, an equally magnanimous endeavor is to build a Language Interoperability Environment (LIE) which can translate content from a source language to target language(s) at run time and help people of the world to communicate with each other and to access the wealth of information that is available the world over. Even though computers and processing power has increased multifold in the last decade, a major weakness of the machine remains - that it has little or no common sense or world knowledge. Fully-automatic general purpose high quality machine translation systems (FGH-MT) are extremely difficult to build, hi fact, there is no system in the world that qualifies as a FGH-MT. The reasons are very evident and are not difficult to locate. Translation is a creative process that involves interpretation of the given text by a translator and also varies depending on the audience and the purpose. This explains the difficulty of building a machine translation system.

The major difficulty the machine faces in interpreting a given text is the lack of general world knowledge or common knowledge, subject specific knowledge, knowledge of the context, etc. which can be collectively called as 'background knowledge'. The difficulty the machine faces at the first level pertains to information coded in a text.

To overcome the complexities of such a large scale MT system, the most common approach has been to delimit the subject domain so that machine works in a narrow subject area, such as, weather reports, computer manuals, etc. It has been hoped that by delimiting MT in a narrow area, one stands a better chance of using context, domain knowledge, etc. The system would perform badly when given a text outside the domain but that is a limitation one would have to live with. The real difficulty is in identifying a domain that is narrow enough that the system works well, and wide enough that enough real texts qualify to be in it, so that it is practically useful. When some information is transferred from one language to another, there is no way to express it exactly. There will be losses/imperfections to some extent, as in any other case where you see transmission or interpretation losses whenever something transforms from one medium to another.

LIE addresses these issues. Another important aspect is: LIE is NOT aimed at translating serious stuff like poetry but to do mundane stuff- the kind of language used in everyday life is fairly simple and LIE is to help people as much as possible. Brief description of the Accompanying drawings:

Figure 1: represents mapping of all the languages in Multi Language Mapper (MLM) Figure 2: represents a summarized LIE Figure 3: represents LIE specific to speech Figure 4: represents an elaborate LIE Objects of the Present Invention

The main object of the present invention is to develop a method for run time translation of input, independent of its language and format. Yet another object of the present invention is to develop a method wherein the language background knowledge is used to convey context of the text.

Still another object of the present invention is to develop said method using internet based protocol.

Still another object of the present invention is to develop said method in order to create a Language Interoperability Environment (LIE).

Another main object of the present invention is to develop a system for run time translation of input, independent of its language and format. Yet another object of the present invention is to develop a system wherein the language background knowledge is used to convey context of the text.

Still another object of the present invention is to develop said system using internet based protocol. Still another object of the present invention is to develop said system in order to create a Language Interoperability Environment (LIE). Statement of the present Invention

The present invention is related to a method for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said method comprising steps of sending the input in source language to Source Language Analyzer (SLA), analyzing the input using SLA to obtain broken-down word groups along with its grammatical features, replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), receiving the output in target language(s) in identical format at an intended destination and a system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said system comprises: means for sending the input in a source language to Source Language Analyzer (SLA); means for analyzing the input using SLA to obtain broken-down word groups alongwith its grammatical features, thereafter replacing the analyzed text to its target language(s) using Multi Language Mapper (MLM), and thereby generating words taking root, its lexical category and grammatical features using Target Language

Generator (TLG); means for receiving the output in target language(s) in identical format at an intended destination. Detailed description of present invention: Accordingly, the present invention relates to a method for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said method comprising steps of a) sending the input in source language to Source Language Analyzer (SLA), b) analyzing the input using SLA to obtain broken-down word groups along with its grammatical features, c) replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), d) generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), and e) receiving the output in target language(s) in identical format at an intended destination.

In an embodiment of the present invention, wherein the method further comprises editing the text at steps (a) and/or (e) using pre- and post-editor respectively.

In yet another embodiment of the present invention the input and output are text or speech (Figure 3).

In still another embodiment of the present invention, wherein the pre-editor provides for editing, identifying non-standard forms, seeking corrections and offering alternatives in order to choose correct form.

In still another embodiment of the present invention the sent text is tagged to characterize the format.

In still another embodiment of the present invention SLA comprises Word Splitter

(WS), Morphological Analyzer (MA) and Language Rules Engine (LRE). In still another embodiment of the present invention the WS analyzes and separates words and word groups.

In still another embodiment of the present invention, wherein the MA analyzes each word and produces its root and grammatical features. In still another embodiment of the present invention, wherein the MA breaks up each word into a root and a suffix at different points to look-up the proposed root in dictionary and the proposed suffix in a suffix table.

In still another embodiment of the present invention, wherein adding and/or deleting characters during breakup of words.

In still another embodiment of the present invention, wherein the MLM replaces elements of source language with elements of target language(s) using database having equivalent elements of the source language in all other languages.

In still another embodiment of the present invention, wherein the TLG comprises Word Grouper (WG), Morphological Synthesizer (MS) and Language Rules Engine (LRE) .

In still another embodiment of the present invention the WG analyzes and separates and/or combines words and word groups.

In still another embodiment of the present invention, wherein the MS synthesizes words taking root, its lexical category, grammatical rules and features. In still another embodiment of the present invention, wherein the LRE helps check lexical category, exceptions, grammatical rules and features.

In still another embodiment of the present invention, wherein the received text format reflects characteristics of the tagged sent text.

In still another embodiment of the present invention the editing provides for background knowledge to convey context of the text.

In still another embodiment of the present invention the method maintains meaning, information, context and concordance of the source language in the target language(s).

In another main embodiment of the present invention, wherein a system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said system comprises:

1. means for sending the input in a source language to Source Language Analyzer

(SLA); 2. means for analyzing the input using SLA to obtain broken-down word groups alongwith its grammatical features, thereafter replacing the analyzed text to its target language(s) using Multi Language Mapper (MLM), and thereby generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG); and 3. means for receiving the output in target language(s) in identical format at an intended destination. In still another embodiment of the present invention, wherein the system further comprises means for pre-editor/post-editing.

In still another embodiment of the present invention the input and output are text or speech.

In still another embodiment of the present invention the SLA comprises Word Splitter (WS), Morphological Analyzer (MA) and Language Rules Engine (LRE).

In still another embodiment of the present invention the MA has proposed suffix in a suffix table to look-up at different point during breaking up of each word into a root and a suffix.

In still another embodiment of the present invention the MLM is. a database having the equivalent elements of the source language in all the other languages.

In still another embodiment of the present invention the TLG comprises Word Grouper

(WG), Morphological Synthesizer (MS) and Language Rule Engine (LRE).

In still another embodiment of the present invention the LRE has entire grammar rules and exceptions of the language. In still another embodiment of the present invention the system maintains meaning, information, context and concordance of the source language in the target language(s).

Language Interoperability Environment (LIE) shown in Figure- 1 is aimed at creating a:

1. Run time Machine Translation environment.

2. The reference language used is English to create MLM as all the languages of the world' -have built dictionaries available between the respective languages and

English. For example: Kannada-English, French-English, Hindi-English, etc. This is done for grammatical and morphological purposes also. Hence while designing LIE, English will be used as the gold standard.

3. The LIE engines will translate from the source language to the target language(s) and vice versa. There are 3 components as shown in Figure 2: Source Language

Analyzer (SLA), Multi Language Mapper (MLM) and Target Language Generator (TLG) and have to be built in each language adhering to the overall architecture. 4. The Multi Language Mapper (MLM) is a huge database that has the equivalent elements of the source language in all the other languages under consideration and will be expanded to include many more languages when resources permit.

5. The Language Interoperability is achieved through creating standard interfaces and formats between the different LIE engines. For example: a. A person can write a document in Kannada. Now the recipients can read the document in Kannada. b. If recipients want to read it in English/German/French/Tamil/Mandarin he/she can get the Kannada document translated using LDB-English/LIE- German/LIE-French/LIE-Tamil/LIE-Mandarin, etc.

This way we can achieve language interoperability. Thus LIE unites the entire world and its people together by empowering them to transact in their own languages with all others with the help of advanced technology, computers and connectivity. The result is that the entire world, its people and the immense knowledgebase opens in one's own language.

LIE is a very large and very complex software system hosted on powerful farm of servers. The system is made available in several flavors like:

1. Freely available on the web for writing and translating - email, chat, browsing and searching the internet in many languages. This supports many concurrent users but limits the input to a few pages at a time.

2. A product that can be set up on powerful servers at local installations like large corporations. This supports a few users but can take large inputs running into several pages.

3. Licensed and secure usage by companies for their corporate communication. A basic LIE MT System consists of an analyzer of the source language i.e. Source

Language Analyzer (SLA) whose output is fed Target Language Generator (TLG) for the generation of the target language. Between the analyzer and the generator there is a Multi Language Mapper (MLM) which uses multi lingual dictionaries and grammar rules/exceptions with support from LRE to map the source language elements to target language(s) elements.

To make the system more usable, User Interface Editors are also provided for human pre-editing of the input and post-editing of the output. These are also part of the overall system. The important components are described in Figure-2:

At a basic level, the Machine Translation (MT) is perceived as a sequence of independent steps/processes executed by the different modules of the overall software system. The Engines are different for different languages hence for each language a separate system need to be built which adheres to the over all system needs and architecture.

The input to the system is either formatted text (email, html, Microsoft Word document, Excel spread sheet, pdf ... file) or voice.

The way in which LIE system works is described and represented in Figure-4: 1. A Listener software module receives the formatted input text - identifies and tags them for characteristics such as:

• Original format: html, Microsoft Word document, Excel spread sheet, pdf ... file.

• Format details like paragraphs, fonts, bold/italic, etc. • Source Language.

• Target Language(s).

2. If the input is speech, the voice modulations are analyzed by the Speech Analyzer (SA), corresponding values are fetched from the MLM-speech database and the output will be given using the Speech Generator (SG) in the target language. Figure-4 explains this scenario.

3. Depending on the source language and the target language(s), the respective software engines are invoked and the input is passed to them. Now onwards the processing steps refer to the specific language engines.

4. The input text is passed to the pre-editor which is a user interface that allows the user to edit and correct the input: words spelt with non-standard spellings are changed to their standard spellings. It also points out the non-standard forms and seeks corrections. It can also present alternatives out of which the user can choose the correct form. The user can avoid this step if he/she wishes to do so.

5. The input text in a source language is passed through Source Language Analyzer (SLA) which has components like: local Word Splitter and

Morphological Analyzer.

6. The local Word Splitter (WS) analyzes and separates words and word groups like idioms and phrases. 7. The output is passed to Morphological Analyzer (MA) which is designed to handle inflectional and derivational morphology. It analyzes each word and produces its root and grammatical features using the elaborate Language Rules Engine (LRE) which has the entire grammar rules and exceptions. For a given word, it checks for the lexical category (such as pronoun, post-position, noun, verb, etc.) and other grammatical features. It also tries to see whether the word can be broken up into a root and a suffix. At the breakup point, some characters such as vowels may be added or deleted. It may have to try several times to break the word at different points. For each breakup it looks up the proposed root in the dictionary and the proposed suffix in a suffix table. Whenever, both lookups are successful that value is taken as valid. This is the output of the source system.

8. The Multi Language Mapper (MLM) is a huge database that has the equivalent elements of the source language in all the other languages. MLM takes the output produced so far to replace the elements of the source language with elements of the target language(s) and kick starts the Target Language Generator (TLG) processes of the respective languages.

9. The output of the MLM is fed into the respective Target Language Generator (TLG) which has components like: local Word Grouper (WG) and Morphological Synthesizer (MS). These are the reverse of Word Splitter and

Morphological Analyzer.

10. The Word Grouper (WG) analyzes and separates words and word groups like idioms and phrases.

11. Morphological Synthesizer (MS) takes a root, its lexical category and grammatical features and generates words.

12. The out put is fed into a Language Packager (LP) to package in the target language. It applies the formats of the original text to the output text such as: a) Original format: html, Microsoft Word document, Excel spread sheet, pdf, ... file b) Format details like paragraphs, fonts, bold/italic, etc.

13. The output produced is the LIE system output. The post-editing user interface allows the user to do post-editing rapidly. The user can avoid this step if he/she wishes to do so. There are three levels of post-editing: a. First level seeks to make the output grammatically correct.

b.In second level, the raw output is corrected not only grammatically but also stylistically.

c. In the third level, the post-editor might change the setting and the events in the story to convey the same meaning to the reader who has a different cultural and social background. This is really trans-creation, and a creative post-editor can go all the way up to this level.

LIE takes the information in the source language text and presents it in the target language. Thus, at the prefix/suffix level, a prefix/suffix in the source language is replaced by a suitable element in the target language and at the word level, the source words are replaced by equivalent words in the target language. Similarly, the word groups are also replaced by equivalent groups in the target language.

The LIE system is to be designed so that the combination of man and machine together can perform translations and the output is as close to the target language as possible. If LIE enters into mainstream and common use, it has major implications for global communication and integration as a person can access documents in his/her language which will be a big asset.

The LIE answer to the world's communication problem is that it envisages building a massive IT backbone which can take input in the languages for which the LIE systems are built and give output in other languages and vice versa. The architecture and standards are defined in such a way that all the LIE engines adhere to a standard architecture and talk to each other based on defined document interchange standards which are based on open standards like Unicode, XML and web services. A person from Japan can transact in his own language - Japanese with a person from Germany who is transacting in his own language - German. The task of building a LIE machine translation system for each language is subdivided into two parts:

1. The first module, the core LIE, does language analysis based on language knowledge: It takes all the information in the source text and presents it in its output which is quite close to the target language. 2. The second module does domain specific knowledge based processing, statistical processing, etc. based on world knowledge, statistical knowledge, etc. in which it utilizes world knowledge, frequency information, concordances, etc. to produce output in the target language. The advantages of said modular approach are as given below:

1. The first module can be made available for use at an earlier day since it requires less effort and easier to be built. But, the user needs a certain amount of training to read the output and make sense out of it. 2. The early feedback- guides the refinement and building of the system. Since the system can be used at an early date, not only does it serve a useful purpose, it also becomes easier to build the second module.

3. The system provides a robust layer in the first module which can be used even if the second module fails to an extent in any specific context. The second module by its very nature is fragile. The first module is made much more robust.

4. The segregation of said modules is critical to appreciate the boundaries of various activities and accordingly co-ordinate in a better manner. It also facilitates due recognition of language knowledge and also thereby the knowledge background to ultimately achieve LIE. 5. When LIE is made available in a few languages in the first phase of implementation as the software is very complex and needs teams from the respective language groups and a lot of money to build and operate, the people speaking those languages will validate the translations and its uses, and will help in refining the system. After several such iterations a more robust environment can be developed and subsequently enhanced to involve more languages. The ultimate aim is to develop the Language Interoperability Environment (LIE) in most of the languages of the world and bring the entire planet under one interoperable umbrella.

6. It is also envisaged that the knowledgebase is made available in the many languages of the world at runtime. The philosophy is to provide access to "all the world's information" through mechanized translation with interoperability mechanisms inbuilt.

The invention is further elaborated with the help of following examples. However, such examples should not be construed to limit scope of the invention Example 1

A sample MLM table A sample Multi Language Mapper (MLM) table which is part of the MLM database is given below.

Example 2

A sample html sentence and the intermediary steps are shown below:

The format of the input is maintained in the output with the $10000 in bold. Similarly input formats like html, Microsoft Word, Excel, etc., are packaged accordingly with the formats like paragraphs, bold, italics, punctuation marks, etc. Example 3

As shown in figure 4, wherein Person A sends an email with a Microsoft Word document attachment in Source language to Person B. This email goes to the LEE before it reaches Person B for transformation to target language. The Listener initially tags the formats like: document format - html, Excel, Word etc. format characteristics - paragraphs, bold, italics etc. and optionally passes it to User Interface (UI) for Pre- editing. Now user can edit for non-standard forms/spellings. Further, SLA takes the process information to Analyze and produce a broken down structure and its grammatical features. MLM replaces each element of the source language with an element of the target language. In TLG a combination of man and machine together can perform translations and the output is as close to the target language as possible. Packager applies the original formats with the help of tags produced by the Listener. User can now edit the output form the Packager for non-standard forms/spellings. The Person B now receives the email with the Microsoft Word attachment in the Target Language.

Claims

Claims:

1. A method for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said method comprising steps of a) sending the input in source language to Source Language Analyzer (SLA), b) analyzing the input using SLA to obtain broken-down word groups along with its grammatical features, c) replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), d) generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), and e) receiving the output in target language(s) in identical format at an intended destination.

2. The method as claimed in claim 1, wherein the method further comprises editing the text at steps (a) and/or (e) using pre- and post-editor respectively.

3. The method as claimed in claim 1, wherein the input and output are text or speech.

4. The method as claimed in claim 1, wherein the pre-editor provides for editing, identifying non-standard forms, seeking corrections and offering alternatives in order to choose correct form.

5. The method as claimed in claim 1, wherein the sent text is tagged to characterize the format.

6. The method as claimed in claim 1, wherein SLA comprises Word Splitter (WS), Morphological Analyzer (MA) and Language Rules Engine (LRE).

7. The method as claimed claim 6, wherein the WS analyzes and separates words and word groups.

8. The method as claimed claim 6, wherein the MA analyzes each word and produces its root and grammatical features.

9. The method as claimed claim 6, wherein the MA breaks up each word into a root and a suffix at different points to look-up the proposed root in dictionary and the proposed suffix in a suffix table.

10. The method as claimed claim 9, wherein adding and/or deleting characters during breakup of words.

11. The method as claimed claim 1, wherein the MLM replaces elements of source language with elements of target language(s) using database having equivalent elements of the source language in all other languages.

12. The method as claimed in claim 1, wherein the TLG comprises Word Grouper (WG), Morphological Synthesizer (MS) and Language Rules Engine (LRE) .

13. The method as claimed in claim 12, wherein the WG analyzes and separates and/or combines words and word groups.

14. The method as claimed in claim 12, wherein the MS synthesizes words taking root, its lexical category, grammatical rules and features.

15. The method as claimed claims 6 and 12, wherein the LRE helps check lexical category, exceptions, grammatical rules and features.

16. The method as claimed in claim 1, wherein the received text format reflects characteristics of the tagged sent text.

17. The method as claimed claim 2, wherein the editing provides for background ' ' knowledge to convey context of the text.

18. The method as claimed in claim 1, wherein the method maintains meaning, information, context and concordance of the source language in the target language(s).

19. A system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said system comprises: a. means for sending the input in a source language to Source Language

Analyzer (SLA); b. means for analyzing the input using SLA to obtain broken-down word groups alongwith its grammatical features, thereafter replacing the analyzed text to its target language(s) using Multi Language Mapper (MLM), and thereby generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG); and c. means for receiving the output in target language(s) in identical format at an intended destination.

20. The system as claimed in claim 19, wherein the system, further comprises means for pre-editor/post-editing.

21. The system as claimed in claim 19, wherein the input and output are text or speech.

22. The system as claimed in claim 19, wherein the SLA comprises Word Splitter (WS), Morphological Analyzer (MA) and Language Rules Engine (LRE).

23. The system as claimed in claim 22, wherein the MA has proposed suffix in a suffix table to look-up at different point during breaking up of each word into a root and a suffix.

24. The system as claimed in claim 19, wherein the MLM is a database having the equivalent elements of the source language in all the other languages.

25. The system as claimed in claim 19, wherein the TLG comprises Word Grouper (WG), Morphological Synthesizer (MS) and Language Rule Engine (LRE).

26. The system as claimed in claims 22 and 25, wherein the LRE has entire grammar rules and exceptions of the language.

27. The system as claimed in claim 19, wherein the system maintains meaning, information, context and concordance of the source language in the target language(s).