WO2008007386A1 - A method for run time translation to create language interoperability environment [lie] and system thereof - Google Patents

A method for run time translation to create language interoperability environment [lie] and system thereof Download PDF

Info

Publication number
WO2008007386A1
WO2008007386A1 PCT/IN2006/000268 IN2006000268W WO2008007386A1 WO 2008007386 A1 WO2008007386 A1 WO 2008007386A1 IN 2006000268 W IN2006000268 W IN 2006000268W WO 2008007386 A1 WO2008007386 A1 WO 2008007386A1
Authority
WO
WIPO (PCT)
Prior art keywords
language
input
word
text
sla
Prior art date
Application number
PCT/IN2006/000268
Other languages
French (fr)
Other versions
WO2008007386A9 (en
WO2008007386B1 (en
Inventor
Chandrashekar Rudrappa Koranahally
Original Assignee
Koranahally Chandrashekar Rudr
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Koranahally Chandrashekar Rudr filed Critical Koranahally Chandrashekar Rudr
Publication of WO2008007386A1 publication Critical patent/WO2008007386A1/en
Publication of WO2008007386B1 publication Critical patent/WO2008007386B1/en
Publication of WO2008007386A9 publication Critical patent/WO2008007386A9/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation

Definitions

  • LIE Language Interoperability Environment
  • FGH-MT Fully-automatic general purpose high quality machine translation systems
  • the major difficulty the machine faces in interpreting a given text is the lack of general world knowledge or common knowledge, subject specific knowledge, knowledge of the context, etc. which can be collectively called as 'background knowledge'.
  • the difficulty the machine faces at the first level pertains to information coded in a text.
  • LIE addresses these issues. Another important aspect is: LIE is NOT aimed at translating serious stuff like poetry but to do mundane stuff- the kind of language used in everyday life is fairly simple and LIE is to help people as much as possible. Brief description of the Accompanying drawings:
  • the main object of the present invention is to develop a method for run time translation of input, independent of its language and format. Yet another object of the present invention is to develop a method wherein the language background knowledge is used to convey context of the text.
  • Still another object of the present invention is to develop said method in order to create a Language Interoperability Environment (LIE).
  • LIE Language Interoperability Environment
  • Still another object of the present invention is to develop said system using internet based protocol. Still another object of the present invention is to develop said system in order to create a Language Interoperability Environment (LIE).
  • LIE Language Interoperability Environment
  • the present invention relates to a method for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said method comprising steps of a) sending the input in source language to Source Language Analyzer (SLA), b) analyzing the input using SLA to obtain broken-down word groups along with its grammatical features, c) replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), d) generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), and e) receiving the output in target language(s) in identical format at an intended destination.
  • SLA Source Language Analyzer
  • MLM Multi Language Mapper
  • TMG Target Language Generator
  • the method further comprises editing the text at steps (a) and/or (e) using pre- and post-editor respectively.
  • the input and output are text or speech ( Figure 3).
  • the sent text is tagged to characterize the format.
  • the MA analyzes each word and produces its root and grammatical features.
  • the MA breaks up each word into a root and a suffix at different points to look-up the proposed root in dictionary and the proposed suffix in a suffix table.
  • the MLM replaces elements of source language with elements of target language(s) using database having equivalent elements of the source language in all other languages.
  • TLG comprises Word Grouper (WG), Morphological Synthesizer (MS) and Language Rules Engine (LRE) .
  • WG Word Grouper
  • MS Morphological Synthesizer
  • LRE Language Rules Engine
  • the WG analyzes and separates and/or combines words and word groups.
  • the MS synthesizes words taking root, its lexical category, grammatical rules and features.
  • the LRE helps check lexical category, exceptions, grammatical rules and features.
  • the editing provides for background knowledge to convey context of the text.
  • a system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE) said system comprises:
  • SLA Service-to-SLA
  • MLM Multi Language Mapper
  • TSG Target Language Generator
  • the input and output are text or speech.
  • the MA has proposed suffix in a suffix table to look-up at different point during breaking up of each word into a root and a suffix.
  • the MLM is. a database having the equivalent elements of the source language in all the other languages.
  • WG Morphological Synthesizer
  • LRE Language Rule Engine
  • the LRE has entire grammar rules and exceptions of the language.
  • the system maintains meaning, information, context and concordance of the source language in the target language(s).
  • LIE Language Interoperability Environment
  • the reference language used is English to create MLM as all the languages of the world' -have built dictionaries available between the respective languages and
  • the LIE engines will translate from the source language to the target language(s) and vice versa. There are 3 components as shown in Figure 2: Source Language
  • the Multi Language Mapper is a huge database that has the equivalent elements of the source language in all the other languages under consideration and will be expanded to include many more languages when resources permit.
  • the Language Interoperability is achieved through creating standard interfaces and formats between the different LIE engines. For example: a. A person can write a document in Kannada. Now the recipients can read the document in Kannada. b. If recipients want to read it in English/German/French/Tamil/Mandarin he/she can get the Kannada document translated using LDB-English/LIE- German/LIE-French/LIE-Tamil/LIE-Mandarin, etc.
  • LIE unites the entire world and its people together by empowering them to transact in their own languages with all others with the help of advanced technology, computers and connectivity.
  • the result is that the entire world, its people and the immense knowledgebase opens in one's own language.
  • LIE is a very large and very complex software system hosted on powerful farm of servers.
  • the system is made available in several flavors like:
  • SLA Language Analyzer
  • TSG Target Language Generator
  • MLM Multi Language Mapper
  • the Machine Translation is perceived as a sequence of independent steps/processes executed by the different modules of the overall software system.
  • the Engines are different for different languages hence for each language a separate system need to be built which adheres to the over all system needs and architecture.
  • the input to the system is either formatted text (email, html, Microsoft Word document, Excel spread sheet, pdf ... file) or voice.
  • a Listener software module receives the formatted input text - identifies and tags them for characteristics such as:
  • the input text is passed to the pre-editor which is a user interface that allows the user to edit and correct the input: words spelt with non-standard spellings are changed to their standard spellings. It also points out the non-standard forms and seeks corrections. It can also present alternatives out of which the user can choose the correct form. The user can avoid this step if he/she wishes to do so.
  • Source Language Analyzer which has components like: local Word Splitter and
  • the local Word Splitter analyzes and separates words and word groups like idioms and phrases.
  • the output is passed to Morphological Analyzer (MA) which is designed to handle inflectional and derivational morphology. It analyzes each word and produces its root and grammatical features using the elaborate Language Rules Engine (LRE) which has the entire grammar rules and exceptions. For a given word, it checks for the lexical category (such as pronoun, post-position, noun, verb, etc.) and other grammatical features. It also tries to see whether the word can be broken up into a root and a suffix. At the breakup point, some characters such as vowels may be added or deleted. It may have to try several times to break the word at different points. For each breakup it looks up the proposed root in the dictionary and the proposed suffix in a suffix table. Whenever, both lookups are successful that value is taken as valid. This is the output of the source system.
  • MA Morphological Analyzer
  • MLM Multi Language Mapper
  • TMG Target Language Generator
  • WG local Word Grouper
  • MS Morphological Synthesizer
  • the Word Grouper analyzes and separates words and word groups like idioms and phrases.
  • Morphological Synthesizer takes a root, its lexical category and grammatical features and generates words.
  • the output produced is the LIE system output.
  • the post-editing user interface allows the user to do post-editing rapidly. The user can avoid this step if he/she wishes to do so.
  • the raw output is corrected not only grammatically but also stylistically.
  • the post-editor might change the setting and the events in the story to convey the same meaning to the reader who has a different cultural and social background. This is really trans-creation, and a creative post-editor can go all the way up to this level.
  • LIE takes the information in the source language text and presents it in the target language.
  • a prefix/suffix in the source language is replaced by a suitable element in the target language and at the word level, the source words are replaced by equivalent words in the target language.
  • the word groups are also replaced by equivalent groups in the target language.
  • the LIE system is to be designed so that the combination of man and machine together can perform translations and the output is as close to the target language as possible. If LIE enters into mainstream and common use, it has major implications for global communication and integration as a person can access documents in his/her language which will be a big asset.
  • the LIE answer to the world's communication problem is that it envisages building a massive IT backbone which can take input in the languages for which the LIE systems are built and give output in other languages and vice versa.
  • the architecture and standards are defined in such a way that all the LIE engines adhere to a standard architecture and talk to each other based on defined document interchange standards which are based on open standards like Unicode, XML and web services.
  • a person from Japan can transact in his own language - Japanese with a person from Germany who is transacting in his own language - German.
  • the task of building a LIE machine translation system for each language is subdivided into two parts:
  • the first module does language analysis based on language knowledge: It takes all the information in the source text and presents it in its output which is quite close to the target language.
  • the second module does domain specific knowledge based processing, statistical processing, etc. based on world knowledge, statistical knowledge, etc. in which it utilizes world knowledge, frequency information, concordances, etc. to produce output in the target language.
  • the first module can be made available for use at an earlier day since it requires less effort and easier to be built. But, the user needs a certain amount of training to read the output and make sense out of it. 2.
  • the early feedback- guides the refinement and building of the system. Since the system can be used at an early date, not only does it serve a useful purpose, it also becomes easier to build the second module.
  • the system provides a robust layer in the first module which can be used even if the second module fails to an extent in any specific context.
  • the second module by its very nature is fragile.
  • the first module is made much more robust.
  • LIE Language Interoperability Environment
  • a sample MLM table A sample Multi Language Mapper (MLM) table which is part of the MLM database is given below.
  • MLM Multi Language Mapper

Abstract

A method and system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the text using internet based protocol in order to create a Language Interoperability Environment (LIE), the method comprising steps of sending the input in source language to Source Language Analyzer (SLA), SLA analyzing the input to obtain broken-down word groups along with its grammatical features, replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), and receiving the output in target language(s) in identical format at an intended destination as shown in Figure 4.

Description

A METHOD FOR RUN TIME TRANSLATION TO CREATE LANGUAGE INTEROPERABILITY ENVIRONMENT (LIE) AND SYSTEM THEREOF
Field of the Present Invention Present Invention addresses the issue of communication between people speaking different languages, which is of magnanimous proportion. An equally magnanimous endeavor is to build a Language Interoperability Environment (LIE) which can translate content (text or speech) from a source language to target language(s) at run time and help people of the world to communicate with each other and to access the wealth of information that is available the world over. Background of the Present invention:
Communication between people speaking different languages has always troubled mankind. Especially in a country like India with so many languages the problem is more pronounced and magnified. There are a few possible solutions to this problem:
1. Either one of them or both of them know the other person's language or both of them know a common language like English.
2. Employ a translator to help communicate with each other.
3. Use sign language Offices have become truly cosmopolitan with people from different parts of the country speaking different languages. With the advent of computers / internet, globalization and the need to communicate with people speaking different languages, English has become the middleman. English is the language of modern day research and scholarship, and has emerged as the lingua-franca of the modern world and the globalized society. All important works are translated into English from the different languages of the world resulting in the creation of an immense knowledgebase.
The number of people who know English is miniscule and the Internet is dominated by English. Advanced countries have made enough investments on systems to ensure that their languages are adapted to the digital age. It is in the poor and the developing countries where the problem is acute - in Asia and Africa. Unfortunately it is here that the maximum diversity exists.
These immense groups of people who are left out are forced to catch up at the cost of neglecting their languages. With little investments and lack of focus, the usage of languages has been coming down drastically. It is evident if you look at the new generation of people.
Experts say in 2-3 centuries there will be only 4-5 languages in use on planet Earth; a repeat of the Rosetta script - only larger in scale. Languages have evolved over centuries and have literature, knowledgebase and a world of their own. Mankind will lose something so great and immense, and the worst part is - it is irreversible. There is an urgent need to evolve a system to keep them alive.
Languages survive only through usage and this can happen only if people are able to transact in their own languages. Computers and Internet have become modern day necessities and people should be able to use them and access the world's information in their own languages and feel empowered.
Prior Art of the present Invention
To reach people speaking different languages, translations have been popular since centuries and the Bible is the most translated book. There have heen developments in the Machine Translation (MT) arena even though it is in its infancy.
Current state is: though there are multi lingual sites and translation software's available, they operate independently and are inaccurate.
Many web sites like Alta Vista and Google have translation facility where a user can type in text and ask the system to translate to other languages. But the sad part is it is
'word to word' translation. It is not even repeatable: if a user types in text in English and asks the system to translate to say French, and copies the output and pastes it in the input box again and ask the system to translate back to English, the output will be different from the original input. To address this issue of magnanimous proportion, an equally magnanimous endeavor is to build a Language Interoperability Environment (LIE) which can translate content from a source language to target language(s) at run time and help people of the world to communicate with each other and to access the wealth of information that is available the world over. Even though computers and processing power has increased multifold in the last decade, a major weakness of the machine remains - that it has little or no common sense or world knowledge. Fully-automatic general purpose high quality machine translation systems (FGH-MT) are extremely difficult to build, hi fact, there is no system in the world that qualifies as a FGH-MT. The reasons are very evident and are not difficult to locate. Translation is a creative process that involves interpretation of the given text by a translator and also varies depending on the audience and the purpose. This explains the difficulty of building a machine translation system.
The major difficulty the machine faces in interpreting a given text is the lack of general world knowledge or common knowledge, subject specific knowledge, knowledge of the context, etc. which can be collectively called as 'background knowledge'. The difficulty the machine faces at the first level pertains to information coded in a text.
To overcome the complexities of such a large scale MT system, the most common approach has been to delimit the subject domain so that machine works in a narrow subject area, such as, weather reports, computer manuals, etc. It has been hoped that by delimiting MT in a narrow area, one stands a better chance of using context, domain knowledge, etc. The system would perform badly when given a text outside the domain but that is a limitation one would have to live with. The real difficulty is in identifying a domain that is narrow enough that the system works well, and wide enough that enough real texts qualify to be in it, so that it is practically useful. When some information is transferred from one language to another, there is no way to express it exactly. There will be losses/imperfections to some extent, as in any other case where you see transmission or interpretation losses whenever something transforms from one medium to another.
LIE addresses these issues. Another important aspect is: LIE is NOT aimed at translating serious stuff like poetry but to do mundane stuff- the kind of language used in everyday life is fairly simple and LIE is to help people as much as possible. Brief description of the Accompanying drawings:
Figure 1: represents mapping of all the languages in Multi Language Mapper (MLM) Figure 2: represents a summarized LIE Figure 3: represents LIE specific to speech Figure 4: represents an elaborate LIE Objects of the Present Invention
The main object of the present invention is to develop a method for run time translation of input, independent of its language and format. Yet another object of the present invention is to develop a method wherein the language background knowledge is used to convey context of the text.
Still another object of the present invention is to develop said method using internet based protocol.
Still another object of the present invention is to develop said method in order to create a Language Interoperability Environment (LIE).
Another main object of the present invention is to develop a system for run time translation of input, independent of its language and format. Yet another object of the present invention is to develop a system wherein the language background knowledge is used to convey context of the text.
Still another object of the present invention is to develop said system using internet based protocol. Still another object of the present invention is to develop said system in order to create a Language Interoperability Environment (LIE). Statement of the present Invention
The present invention is related to a method for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said method comprising steps of sending the input in source language to Source Language Analyzer (SLA), analyzing the input using SLA to obtain broken-down word groups along with its grammatical features, replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), receiving the output in target language(s) in identical format at an intended destination and a system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said system comprises: means for sending the input in a source language to Source Language Analyzer (SLA); means for analyzing the input using SLA to obtain broken-down word groups alongwith its grammatical features, thereafter replacing the analyzed text to its target language(s) using Multi Language Mapper (MLM), and thereby generating words taking root, its lexical category and grammatical features using Target Language
Generator (TLG); means for receiving the output in target language(s) in identical format at an intended destination. Detailed description of present invention: Accordingly, the present invention relates to a method for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said method comprising steps of a) sending the input in source language to Source Language Analyzer (SLA), b) analyzing the input using SLA to obtain broken-down word groups along with its grammatical features, c) replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), d) generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), and e) receiving the output in target language(s) in identical format at an intended destination.
In an embodiment of the present invention, wherein the method further comprises editing the text at steps (a) and/or (e) using pre- and post-editor respectively.
In yet another embodiment of the present invention the input and output are text or speech (Figure 3).
In still another embodiment of the present invention, wherein the pre-editor provides for editing, identifying non-standard forms, seeking corrections and offering alternatives in order to choose correct form.
In still another embodiment of the present invention the sent text is tagged to characterize the format.
In still another embodiment of the present invention SLA comprises Word Splitter
(WS), Morphological Analyzer (MA) and Language Rules Engine (LRE). In still another embodiment of the present invention the WS analyzes and separates words and word groups.
In still another embodiment of the present invention, wherein the MA analyzes each word and produces its root and grammatical features. In still another embodiment of the present invention, wherein the MA breaks up each word into a root and a suffix at different points to look-up the proposed root in dictionary and the proposed suffix in a suffix table.
In still another embodiment of the present invention, wherein adding and/or deleting characters during breakup of words.
In still another embodiment of the present invention, wherein the MLM replaces elements of source language with elements of target language(s) using database having equivalent elements of the source language in all other languages.
In still another embodiment of the present invention, wherein the TLG comprises Word Grouper (WG), Morphological Synthesizer (MS) and Language Rules Engine (LRE) .
In still another embodiment of the present invention the WG analyzes and separates and/or combines words and word groups.
In still another embodiment of the present invention, wherein the MS synthesizes words taking root, its lexical category, grammatical rules and features. In still another embodiment of the present invention, wherein the LRE helps check lexical category, exceptions, grammatical rules and features.
In still another embodiment of the present invention, wherein the received text format reflects characteristics of the tagged sent text.
In still another embodiment of the present invention the editing provides for background knowledge to convey context of the text.
In still another embodiment of the present invention the method maintains meaning, information, context and concordance of the source language in the target language(s).
In another main embodiment of the present invention, wherein a system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said system comprises:
1. means for sending the input in a source language to Source Language Analyzer
(SLA); 2. means for analyzing the input using SLA to obtain broken-down word groups alongwith its grammatical features, thereafter replacing the analyzed text to its target language(s) using Multi Language Mapper (MLM), and thereby generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG); and 3. means for receiving the output in target language(s) in identical format at an intended destination. In still another embodiment of the present invention, wherein the system further comprises means for pre-editor/post-editing.
In still another embodiment of the present invention the input and output are text or speech.
In still another embodiment of the present invention the SLA comprises Word Splitter (WS), Morphological Analyzer (MA) and Language Rules Engine (LRE).
In still another embodiment of the present invention the MA has proposed suffix in a suffix table to look-up at different point during breaking up of each word into a root and a suffix.
In still another embodiment of the present invention the MLM is. a database having the equivalent elements of the source language in all the other languages.
In still another embodiment of the present invention the TLG comprises Word Grouper
(WG), Morphological Synthesizer (MS) and Language Rule Engine (LRE).
In still another embodiment of the present invention the LRE has entire grammar rules and exceptions of the language. In still another embodiment of the present invention the system maintains meaning, information, context and concordance of the source language in the target language(s).
Language Interoperability Environment (LIE) shown in Figure- 1 is aimed at creating a:
1. Run time Machine Translation environment.
2. The reference language used is English to create MLM as all the languages of the world' -have built dictionaries available between the respective languages and
English. For example: Kannada-English, French-English, Hindi-English, etc. This is done for grammatical and morphological purposes also. Hence while designing LIE, English will be used as the gold standard.
3. The LIE engines will translate from the source language to the target language(s) and vice versa. There are 3 components as shown in Figure 2: Source Language
Analyzer (SLA), Multi Language Mapper (MLM) and Target Language Generator (TLG) and have to be built in each language adhering to the overall architecture. 4. The Multi Language Mapper (MLM) is a huge database that has the equivalent elements of the source language in all the other languages under consideration and will be expanded to include many more languages when resources permit.
5. The Language Interoperability is achieved through creating standard interfaces and formats between the different LIE engines. For example: a. A person can write a document in Kannada. Now the recipients can read the document in Kannada. b. If recipients want to read it in English/German/French/Tamil/Mandarin he/she can get the Kannada document translated using LDB-English/LIE- German/LIE-French/LIE-Tamil/LIE-Mandarin, etc.
This way we can achieve language interoperability. Thus LIE unites the entire world and its people together by empowering them to transact in their own languages with all others with the help of advanced technology, computers and connectivity. The result is that the entire world, its people and the immense knowledgebase opens in one's own language.
LIE is a very large and very complex software system hosted on powerful farm of servers. The system is made available in several flavors like:
1. Freely available on the web for writing and translating - email, chat, browsing and searching the internet in many languages. This supports many concurrent users but limits the input to a few pages at a time.
2. A product that can be set up on powerful servers at local installations like large corporations. This supports a few users but can take large inputs running into several pages.
3. Licensed and secure usage by companies for their corporate communication. A basic LIE MT System consists of an analyzer of the source language i.e. Source
Language Analyzer (SLA) whose output is fed Target Language Generator (TLG) for the generation of the target language. Between the analyzer and the generator there is a Multi Language Mapper (MLM) which uses multi lingual dictionaries and grammar rules/exceptions with support from LRE to map the source language elements to target language(s) elements.
To make the system more usable, User Interface Editors are also provided for human pre-editing of the input and post-editing of the output. These are also part of the overall system. The important components are described in Figure-2:
At a basic level, the Machine Translation (MT) is perceived as a sequence of independent steps/processes executed by the different modules of the overall software system. The Engines are different for different languages hence for each language a separate system need to be built which adheres to the over all system needs and architecture.
The input to the system is either formatted text (email, html, Microsoft Word document, Excel spread sheet, pdf ... file) or voice.
The way in which LIE system works is described and represented in Figure-4: 1. A Listener software module receives the formatted input text - identifies and tags them for characteristics such as:
• Original format: html, Microsoft Word document, Excel spread sheet, pdf ... file.
• Format details like paragraphs, fonts, bold/italic, etc. • Source Language.
• Target Language(s).
2. If the input is speech, the voice modulations are analyzed by the Speech Analyzer (SA), corresponding values are fetched from the MLM-speech database and the output will be given using the Speech Generator (SG) in the target language. Figure-4 explains this scenario.
3. Depending on the source language and the target language(s), the respective software engines are invoked and the input is passed to them. Now onwards the processing steps refer to the specific language engines.
4. The input text is passed to the pre-editor which is a user interface that allows the user to edit and correct the input: words spelt with non-standard spellings are changed to their standard spellings. It also points out the non-standard forms and seeks corrections. It can also present alternatives out of which the user can choose the correct form. The user can avoid this step if he/she wishes to do so.
5. The input text in a source language is passed through Source Language Analyzer (SLA) which has components like: local Word Splitter and
Morphological Analyzer.
6. The local Word Splitter (WS) analyzes and separates words and word groups like idioms and phrases. 7. The output is passed to Morphological Analyzer (MA) which is designed to handle inflectional and derivational morphology. It analyzes each word and produces its root and grammatical features using the elaborate Language Rules Engine (LRE) which has the entire grammar rules and exceptions. For a given word, it checks for the lexical category (such as pronoun, post-position, noun, verb, etc.) and other grammatical features. It also tries to see whether the word can be broken up into a root and a suffix. At the breakup point, some characters such as vowels may be added or deleted. It may have to try several times to break the word at different points. For each breakup it looks up the proposed root in the dictionary and the proposed suffix in a suffix table. Whenever, both lookups are successful that value is taken as valid. This is the output of the source system.
8. The Multi Language Mapper (MLM) is a huge database that has the equivalent elements of the source language in all the other languages. MLM takes the output produced so far to replace the elements of the source language with elements of the target language(s) and kick starts the Target Language Generator (TLG) processes of the respective languages.
9. The output of the MLM is fed into the respective Target Language Generator (TLG) which has components like: local Word Grouper (WG) and Morphological Synthesizer (MS). These are the reverse of Word Splitter and
Morphological Analyzer.
10. The Word Grouper (WG) analyzes and separates words and word groups like idioms and phrases.
11. Morphological Synthesizer (MS) takes a root, its lexical category and grammatical features and generates words.
12. The out put is fed into a Language Packager (LP) to package in the target language. It applies the formats of the original text to the output text such as: a) Original format: html, Microsoft Word document, Excel spread sheet, pdf, ... file b) Format details like paragraphs, fonts, bold/italic, etc.
13. The output produced is the LIE system output. The post-editing user interface allows the user to do post-editing rapidly. The user can avoid this step if he/she wishes to do so. There are three levels of post-editing: a. First level seeks to make the output grammatically correct.
b.In second level, the raw output is corrected not only grammatically but also stylistically.
c. In the third level, the post-editor might change the setting and the events in the story to convey the same meaning to the reader who has a different cultural and social background. This is really trans-creation, and a creative post-editor can go all the way up to this level.
LIE takes the information in the source language text and presents it in the target language. Thus, at the prefix/suffix level, a prefix/suffix in the source language is replaced by a suitable element in the target language and at the word level, the source words are replaced by equivalent words in the target language. Similarly, the word groups are also replaced by equivalent groups in the target language.
The LIE system is to be designed so that the combination of man and machine together can perform translations and the output is as close to the target language as possible. If LIE enters into mainstream and common use, it has major implications for global communication and integration as a person can access documents in his/her language which will be a big asset.
The LIE answer to the world's communication problem is that it envisages building a massive IT backbone which can take input in the languages for which the LIE systems are built and give output in other languages and vice versa. The architecture and standards are defined in such a way that all the LIE engines adhere to a standard architecture and talk to each other based on defined document interchange standards which are based on open standards like Unicode, XML and web services. A person from Japan can transact in his own language - Japanese with a person from Germany who is transacting in his own language - German. The task of building a LIE machine translation system for each language is subdivided into two parts:
1. The first module, the core LIE, does language analysis based on language knowledge: It takes all the information in the source text and presents it in its output which is quite close to the target language. 2. The second module does domain specific knowledge based processing, statistical processing, etc. based on world knowledge, statistical knowledge, etc. in which it utilizes world knowledge, frequency information, concordances, etc. to produce output in the target language. The advantages of said modular approach are as given below:
1. The first module can be made available for use at an earlier day since it requires less effort and easier to be built. But, the user needs a certain amount of training to read the output and make sense out of it. 2. The early feedback- guides the refinement and building of the system. Since the system can be used at an early date, not only does it serve a useful purpose, it also becomes easier to build the second module.
3. The system provides a robust layer in the first module which can be used even if the second module fails to an extent in any specific context. The second module by its very nature is fragile. The first module is made much more robust.
4. The segregation of said modules is critical to appreciate the boundaries of various activities and accordingly co-ordinate in a better manner. It also facilitates due recognition of language knowledge and also thereby the knowledge background to ultimately achieve LIE. 5. When LIE is made available in a few languages in the first phase of implementation as the software is very complex and needs teams from the respective language groups and a lot of money to build and operate, the people speaking those languages will validate the translations and its uses, and will help in refining the system. After several such iterations a more robust environment can be developed and subsequently enhanced to involve more languages. The ultimate aim is to develop the Language Interoperability Environment (LIE) in most of the languages of the world and bring the entire planet under one interoperable umbrella.
6. It is also envisaged that the knowledgebase is made available in the many languages of the world at runtime. The philosophy is to provide access to "all the world's information" through mechanized translation with interoperability mechanisms inbuilt.
The invention is further elaborated with the help of following examples. However, such examples should not be construed to limit scope of the invention Example 1
A sample MLM table A sample Multi Language Mapper (MLM) table which is part of the MLM database is given below.
Figure imgf000014_0001
Example 2
A sample html sentence and the intermediary steps are shown below:
Figure imgf000014_0002
Figure imgf000015_0001
The format of the input is maintained in the output with the $10000 in bold. Similarly input formats like html, Microsoft Word, Excel, etc., are packaged accordingly with the formats like paragraphs, bold, italics, punctuation marks, etc. Example 3
As shown in figure 4, wherein Person A sends an email with a Microsoft Word document attachment in Source language to Person B. This email goes to the LEE before it reaches Person B for transformation to target language. The Listener initially tags the formats like: document format - html, Excel, Word etc. format characteristics - paragraphs, bold, italics etc. and optionally passes it to User Interface (UI) for Pre- editing. Now user can edit for non-standard forms/spellings. Further, SLA takes the process information to Analyze and produce a broken down structure and its grammatical features. MLM replaces each element of the source language with an element of the target language. In TLG a combination of man and machine together can perform translations and the output is as close to the target language as possible. Packager applies the original formats with the help of tags produced by the Listener. User can now edit the output form the Packager for non-standard forms/spellings. The Person B now receives the email with the Microsoft Word attachment in the Target Language.

Claims

Claims:
1. A method for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said method comprising steps of a) sending the input in source language to Source Language Analyzer (SLA), b) analyzing the input using SLA to obtain broken-down word groups along with its grammatical features, c) replacing the analyzed input to its target language(s) using Multi Language Mapper (MLM), d) generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG), and e) receiving the output in target language(s) in identical format at an intended destination.
2. The method as claimed in claim 1, wherein the method further comprises editing the text at steps (a) and/or (e) using pre- and post-editor respectively.
3. The method as claimed in claim 1, wherein the input and output are text or speech.
4. The method as claimed in claim 1, wherein the pre-editor provides for editing, identifying non-standard forms, seeking corrections and offering alternatives in order to choose correct form.
5. The method as claimed in claim 1, wherein the sent text is tagged to characterize the format.
6. The method as claimed in claim 1, wherein SLA comprises Word Splitter (WS), Morphological Analyzer (MA) and Language Rules Engine (LRE).
7. The method as claimed claim 6, wherein the WS analyzes and separates words and word groups.
8. The method as claimed claim 6, wherein the MA analyzes each word and produces its root and grammatical features.
9. The method as claimed claim 6, wherein the MA breaks up each word into a root and a suffix at different points to look-up the proposed root in dictionary and the proposed suffix in a suffix table.
10. The method as claimed claim 9, wherein adding and/or deleting characters during breakup of words.
11. The method as claimed claim 1, wherein the MLM replaces elements of source language with elements of target language(s) using database having equivalent elements of the source language in all other languages.
12. The method as claimed in claim 1, wherein the TLG comprises Word Grouper (WG), Morphological Synthesizer (MS) and Language Rules Engine (LRE) .
13. The method as claimed in claim 12, wherein the WG analyzes and separates and/or combines words and word groups.
14. The method as claimed in claim 12, wherein the MS synthesizes words taking root, its lexical category, grammatical rules and features.
15. The method as claimed claims 6 and 12, wherein the LRE helps check lexical category, exceptions, grammatical rules and features.
16. The method as claimed in claim 1, wherein the received text format reflects characteristics of the tagged sent text.
17. The method as claimed claim 2, wherein the editing provides for background ' ' knowledge to convey context of the text.
18. The method as claimed in claim 1, wherein the method maintains meaning, information, context and concordance of the source language in the target language(s).
19. A system for run time translation of input, independent of its language and format, wherein the language background knowledge is used to convey context of the input using internet based protocol in order to create a Language Interoperability Environment (LIE), said system comprises: a. means for sending the input in a source language to Source Language
Analyzer (SLA); b. means for analyzing the input using SLA to obtain broken-down word groups alongwith its grammatical features, thereafter replacing the analyzed text to its target language(s) using Multi Language Mapper (MLM), and thereby generating words taking root, its lexical category and grammatical features using Target Language Generator (TLG); and c. means for receiving the output in target language(s) in identical format at an intended destination.
20. The system as claimed in claim 19, wherein the system, further comprises means for pre-editor/post-editing.
21. The system as claimed in claim 19, wherein the input and output are text or speech.
22. The system as claimed in claim 19, wherein the SLA comprises Word Splitter (WS), Morphological Analyzer (MA) and Language Rules Engine (LRE).
23. The system as claimed in claim 22, wherein the MA has proposed suffix in a suffix table to look-up at different point during breaking up of each word into a root and a suffix.
24. The system as claimed in claim 19, wherein the MLM is a database having the equivalent elements of the source language in all the other languages.
25. The system as claimed in claim 19, wherein the TLG comprises Word Grouper (WG), Morphological Synthesizer (MS) and Language Rule Engine (LRE).
26. The system as claimed in claims 22 and 25, wherein the LRE has entire grammar rules and exceptions of the language.
27. The system as claimed in claim 19, wherein the system maintains meaning, information, context and concordance of the source language in the target language(s).
PCT/IN2006/000268 2006-07-14 2006-07-31 A method for run time translation to create language interoperability environment [lie] and system thereof WO2008007386A1 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
IN1227CH2006 2006-07-14
IN1227/CHE/2006 2006-07-14

Publications (3)

Publication Number Publication Date
WO2008007386A1 true WO2008007386A1 (en) 2008-01-17
WO2008007386B1 WO2008007386B1 (en) 2008-03-27
WO2008007386A9 WO2008007386A9 (en) 2008-12-11

Family

ID=38922983

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IN2006/000268 WO2008007386A1 (en) 2006-07-14 2006-07-31 A method for run time translation to create language interoperability environment [lie] and system thereof

Country Status (1)

Country Link
WO (1) WO2008007386A1 (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8364463B2 (en) 2009-09-25 2013-01-29 International Business Machines Corporation Optimizing a language/media translation map

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6470306B1 (en) * 1996-04-23 2002-10-22 Logovista Corporation Automated translation of annotated text based on the determination of locations for inserting annotation tokens and linked ending, end-of-sentence or language tokens
US20020169592A1 (en) * 2001-05-11 2002-11-14 Aityan Sergey Khachatur Open environment for real-time multilingual communication
WO2005096708A2 (en) * 2004-04-06 2005-10-20 Department Of Information Technology A system for multiligual machine translation from english to hindi and other indian languages using pseudo-interlingua and hybridized approach

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6470306B1 (en) * 1996-04-23 2002-10-22 Logovista Corporation Automated translation of annotated text based on the determination of locations for inserting annotation tokens and linked ending, end-of-sentence or language tokens
US20020169592A1 (en) * 2001-05-11 2002-11-14 Aityan Sergey Khachatur Open environment for real-time multilingual communication
WO2005096708A2 (en) * 2004-04-06 2005-10-20 Department Of Information Technology A system for multiligual machine translation from english to hindi and other indian languages using pseudo-interlingua and hybridized approach

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
BHARATI A. ET AL.: "Anuvad: Approaches to Translation", ANUSAARAKA: OVERCOMING THE LANGUAGE BARRIER IN INDIA NEW DELPHI, 2001, 24 September 2007 (2007-09-24), Retrieved from the Internet <URL:http://www.arxiv.org/abs/cs.CL/0308018> *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8364463B2 (en) 2009-09-25 2013-01-29 International Business Machines Corporation Optimizing a language/media translation map
US8364465B2 (en) 2009-09-25 2013-01-29 International Business Machines Corporation Optimizing a language/media translation map

Also Published As

Publication number Publication date
WO2008007386A9 (en) 2008-12-11
WO2008007386B1 (en) 2008-03-27

Similar Documents

Publication Publication Date Title
Goossens et al. The Latex Web Companion: Integrating TEX, HTML, and XML
Garje et al. Survey of machine translation systems in India
US20130110504A1 (en) Method and system for natural language dictionary generation
JP2017199363A (en) Machine translation device and computer program for machine translation
JP4304268B2 (en) Third language text generation algorithm, apparatus, and program by inputting bilingual parallel text
Baker et al. Corpus linguistics and South Asian languages: Corpus creation and tool development
Amin et al. CMS-Intelligent machine translation with adaptation and AI
Lyons A review of Thai–English machine translation
Rehm et al. The Latvian Language in the Digital Age
WO2008007386A1 (en) A method for run time translation to create language interoperability environment [lie] and system thereof
JP2019053262A (en) Learning system
Keyvan et al. Developing persianet: The persian wordnet
Anto et al. Text to speech synthesis system for English to Malayalam translation
Sankaravelayuthan et al. English to tamil machine translation system using parallel corpus
Sarkar et al. A hybrid sequential model for text simplification
Watve et al. English to hindi translation using transformer
Nelson A two-level engine for tagalog morphology and a structured xml output for pc-kimmo
Pathak et al. English to Sanskrit machine translation using transfer based approach
Zarnoufi et al. Language identification for user generated content in social media
Roy et al. Machine Translation Systems for Official Languages of North-Eastern India: A Review
JP3389313B2 (en) Machine translation equipment
Costa et al. An Open and Extensible Platform for Machine Translation of Spoken Languages into Sign Languages
Prakapenka et al. Creation of a Legal Domain Corpus for the Belarusian Module in NooJ: Texts, Dictionaries, Grammars
KR20230142397A (en) System for providing english text editing service
Parida et al. Enhancing Braille Accessibility: An Android Application for Indian Braille Transliteration

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 06780548

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

NENP Non-entry into the national phase

Ref country code: RU

122 Ep: pct application non-entry in european phase

Ref document number: 06780548

Country of ref document: EP

Kind code of ref document: A1