GB2456391A - Reduced lexicon translation - Google Patents

Reduced lexicon translation Download PDF

Info

Publication number
GB2456391A
GB2456391A GB0800748A GB0800748A GB2456391A GB 2456391 A GB2456391 A GB 2456391A GB 0800748 A GB0800748 A GB 0800748A GB 0800748 A GB0800748 A GB 0800748A GB 2456391 A GB2456391 A GB 2456391A
Authority
GB
United Kingdom
Prior art keywords
language
english
words
vocabulary
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
GB0800748A
Other versions
GB0800748D0 (en
Inventor
Marcelo Funes
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Individual
Original Assignee
Individual
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Individual filed Critical Individual
Priority to GB0800748A priority Critical patent/GB2456391A/en
Publication of GB0800748D0 publication Critical patent/GB0800748D0/en
Publication of GB2456391A publication Critical patent/GB2456391A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • G06F17/2795
    • G06F17/28
    • G06F17/2872
    • G06F17/289
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/247Thesauruses; Synonyms
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/58Use of machine translation, e.g. for multi-lingual retrieval, for server-side translation for client devices or for real-time translation

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Electrically Operated Instructional Devices (AREA)

Abstract

A system for specifying equivalence between a language and a reduced lexicon representation of the same language, including grammar, and automatically translating sentences in a language to sentences in a reduced-lexicon version of the same language preferably in a computer environment, while keeping substantially the same semantic content.

Description

I
2456391
Title; Automatic Translating Method and System
Description
Background of the invention
Probably one of the greatest challenges of our time is that of communications. At a technical level of telecommunications technology, it is one of the fastest growing industries. However, once physical communication is achieved between individuals, the next step is that of understanding. This is normally achieved using language, not always phonetically as in sign language. This invention is aimed at spoken and written languages in general and particularly at the English language. The English language is the predominant language for human interaction across the whole world today. Two thirds of the Internet traffic takes place in English. Most of the top universities, business transactions, entertainment, as well as scientific and technological publications use the English language. The present invention proposes a significant reform in the approach to English as a foreign language communications.
Many translating tools are available to translate between languages but they are in general of very low quality due to the complexities of correctly deriving the equivalence between words and phrases in two distinct natural languages. In general, an automatic translating apparatus reads a source text written in a source language and carries out morphological analysis. Then, a dictionary for translation is accessed to obtain grammatical information such as the part of speech for each word and a corresponding translation word for the target language, followed by analysis of the tense, person, number, and the like. An internal structure of the source language as a result of the morphological analysis and rules prepared in advance for that source language. The internal structure of the source language is transformed into an internal structure of the target language using a transformation rule. In accordance with the transformed internal structure, a translated sentence in the target language is generated using the translation words obtained by the dictionary look-up process during the morphological analysis. Translation is difficult for numerous reasons, including the lack of one-to-one word correspondences among languages, the existence in every language of homonyms, and the fact that natural grammars are idiosyncratic; they do not conform to an exact set of rules that would facilitate direct, word-to-word substitution. It is toward a computational
2
"understanding" of these idiosyncrasies that many artificial-intelligence research efforts have been directed, and their limited success testifies to the complexity of the problem.
An alternative is to interact in a language which is widely understood and which many people wish to learn even if at a basic conversational level in order to interact and he entertained, as is the case with the English language. The difficulty then arises of how to assimilate complex material even if only a colloquial level of knowledge is possessed.
At the root of the concept of language lies the very definition of a word. For centuries words like "virtue" have defied an exact generally-accepted definition, though this is a subtle complex term. However, as first noted by Wittgentein (1953) even simple words like "dog" or "game" can be equally challenging. His suggestion was that there exists a "family resemblance" which allows us to identify a particular instance as a member of a group such as a "dog". Following the work of Rosch (1973) who first tried to connect these ideas into a psychological interpretation, we can conceive of these family resemblances as based on the fact that the human brain really possesses prototypes or examplars in order to represent the meaning of a particular word. Moreover, Rosch argued that there is a "natural" level of categorization that we tend to use to communicate. This level is known as "basic-level categorization" and has been found to reflect a natural way humans categorize the objects in our world. This is important because we know that the processes of acquisition, storage and retrieval in the human brain are inextricably intertwined. Therefore, we can be fairly certain that basic-level categorization is a fundamental aspect of human cognition and communication. Now, this is not a particularly novel concept in itself, when viewed in a more global context.
Chinese writing for instance possesses more than 40,000 mainly ideographic signs, but knowledge of four thousand is enough for most purposes. Chinese writing, insofar as it is phonetic, is also monosyllabic, for the very good reason that the words of the language consist of only one syllable, with a large number of homophones, which made it important to have signs that distinguished between these homophones, and so the script avoided being purely phonetic Even in this case, early simplification such as the one performed by James Yen in 1923, resulted in a selection of 1,200 of the traditional characters, in order to form what can be called Basic Chinese, enabling illiterate people to read in this system after four
3
months work. A later refinement by Yuan Chao produced a system of about 2,500 of the traditional characters, which it was claimed can cover basically all of the language Even the Japanese resolved the basic linguistic problem by adding Hira Gana, children arc taught 1,200 from 40,000 symbols, which often contain a Chinese root and suffixes.
Another attempt at devising a simplified version of a language is that of Basic English, as proposed by Charles K. Ogden in the 1920s. The fact that it is possible to say almost everything we normally desire to say with 850 words, makes Basic English something more than a mere educational experiment. Eight hundred fifty words are sufficient for ordinary communication in idiomatic English. Six hundred words form a first stage at which a wide range of simple matter can be provided. By the addition of 100 words required for any general field such as science, and 50 internationally recognized words, a total of 1,000 words enable successful communication.
With this vocabulary, the style and brevity has no literary pretensions, but is clear and precise Below the minimum 600, only Pidgin English or traveller's enquiries can emerge. Above the 1,000 maximum, we are at the level of standardizing English. Normal vocabulary hovers between the alleged 300 words of the Somersetshire farmers and the 12,000 of the average undergraduate, though a cultured person can command as much as 50,000. The 850 words can be learned in 40 hours spent during a month by a speaker of a European romance or Germanic language. In the case of % of the world population, i.e. the Orient the time requires can be substantially more than this figure since these languages do not hold the Latin and Greek roots of many words as in European languages. However, this task is considerably easier than being able to command a full vocabulary as required for higher-level communication in business, technology or science, where English is the main language of communication. On Internet, if we consider scientific and technological words, the required vocabulary conies closer to 100,000 words. Such a number lies well beyond the capacity of the average English as a foreign language prospective learner.
The grammar of this reduced-vocabulary English is similar to Standard English except the rules fill one chapter rather than a whole book. There are fewer exceptions. The chief form-changes are those which make the behaviour of verbs and pronouns the same as in normal English; together with 'plurals,' -ly for 'adverbs', the degrees of comparison, and the -er, -ing, -ed endings of 300 of the names of things. In this way the learner is not troubled by a great
4
number of forms and endings which are not regular, and the outcome is a simple, natural, English in which there is room for addition but no need for change at a later stage. Compound words may be combined from two nouns (milkman) or a noun and a directive (sundown).
Discussion of prior art necessitates re-evaluating the core assumption on which foreign language communication is based: that English as a foreign language interaction should preferably be based on memorization using the whole vocabulary.
From a practical perspective, the memory-dependent approach has proven ineffective. Two problematic aspects of memorization are time and timing. Cumulative memorization processes require long-term commitments of time for study and for practice. Many aspiring English speakers are unable or unwilling to invest extensive amounts of time in learning a foreign language. Furthermore, many are interested only in casual use or for work reasons: for communication of a scientific or technological nature for instance. Timing is a problem in that second language knowledge tends to decline with non-use. Often, opportunities for English as a foreign language study do not coincide with opportunities for using the language in contextualized environments. Such individuals often face a frustrating cycle of repeatedly restarting English study programs.
Prior art methods, systems and apparatus emphasize production of materials for decontextualized study (academic) environments. That approach has been the norm despite the pieponderance of evidence indicating a relatively weak link between English acquisition and English learning. This is illustrated by the millions of bilinguals (adults and children) who have acquired conversational English skills at work or play, on trips, on the streets, or just "picking up" the language.
For individuals interested in developing English language proficiency, the ultimate need for cumulative memorization of English language fundamentals remains constant. However, current methodology is directed toward facilitating memorization and learning may be problematic from pedagogical and product-development perspectives. Research indicates a probability that knowledge gained in academic settings does not directly transfer to real-life settings and vice versa. Indeed, many English language students report encountering substantial difficulties in applying complex vocabulary academically-acquired knowledge to basic conversations with native speakers and vice-versa. On-site English as a foreign
5
language exchanges often are characterized by broad memory gaps. Thus, some researchers contend that knowledge acquired via traditional methods is of little benefit in real-life conversational interactions.
Currently there are numerous resources that address various aspects of English language learning. Those resources include courses and instructive study guides, audio and video tapes and computer software programs. Certain products provide information on particular aspects of the English conversational process. For instance, lexical information is available in dictionaries and, to a lesser extent, in supplemental sections of phrase books. Discrete phrases and questions are available in phrase books and in supplemental sections of certain dictionaries. Syntactic information is available in textbooks and supplemental sections of certain phrase books and dictionaries. Verb conjugations are available in verb dictionaries and, to a lesser extent, in supplemental sections of dictionaries and phrase books. Instruction on pronunciation is available in supplemental sections of certain dictionaries and phrase books. Electronic translators provide varied alternatives, ranging from translations of one-word entries to entire sentences, with higher priced models also offering voice-replication of pronunciations.
Each existing product has specific disadvantages. Electronic translators have numerous drawbacks including high cost, inconvenience of operation, susceptibility to malfunctioning, damage, (battery) power losses, and adverse weather conditions, limited vocabulary, inaccuracies in translations, and some customer resistance to high-technology products. Also, electronic dictionaries provide no visual continuity or permanence of lexical displays, making them particularly undesirable.
Traditional dictionaries provide such large quantities of lexical information that users often are stymied by the volume of selections and the time it takes to locate words. Furthermore, dictionaries typically lack easily accessible and understandable syntactic and pronunciation support. Similarly, phrase books provide so many predefined sentences that the resultant volume results in a tedious search process, thus hindering access to a desired phrase.
There are additional drawbacks relative to particular aspects of English as a foreign language communication. Phonetic transcriptions in dictionaries and phrase books are often difficult to
6
read and vocalize. Supplemental instructions on how to use those systems are difficult to access, typically being located in separate sections of the book. When the appropriate section is finally located, users often find a complex system of instructions that is difficult to apply on a letter-by-letter, sound-by-sound basis. Those problems render existing systems ineffective. The resultant effect is that users may ignore the resources and/or produce very low quality pronunciations. Typically, generation of original sentences also is addressed by providing instructive guidelines in supplemental sections. However, application of that information requires extensive study and practice prior to application in conversational settings.
A fundamental problem exists in that traditional approaches to English as a foreign language communications and learning have resulted in a fragmented and confusing market. The plethora of problems inherent in learning a foreign language are traditionally addressed by diverse, aspect-specific solutions. However, in many situations, the consumers' interests are better served by consolidating solutions rather than providing products that address only particular aspects. Furthermore, it would be advantageous to provide a product that capitalizes on innate human capacities.
In short, there is a significant and widespread need for new products that are simplified, centralized, rapid-access and that serve as resources for aspiring English as a foreign language speakers who simply wish to engage in basic conversations with native speakers and also use this basic knowledge of the English language for work or learning purposes where the lexicon required may be substantially larger, but do not choose or are unable to invest an extensive amount of time and energy in the effort of achieving a high level of English proficiency.
Summary of the Invention
The present invention addresses the above problems by providing an automatic translation system that enables the user, with prior colloquial English training, to immediately begin assimilating English content however complex. The system provides instant access to the entire English-speaking written word, including scientific and technological material and a
7
method for performing this translation automatically without human intervention. The present system is preferably embodied in a personal computer device that houses a novel translation system in which linguistic components arc identified, grouped, categorized, translated and sequentially displayed. The system typically incorporates provision of a graphic display framework for the processed content that also serves for comparison with the original input content.
Development of the present invention is based on the premise that reducing the size of the vocabulary while substantially maintaining semantic content maximizes potential for English as a foreign language communication and assimilation. Now, the original conception of Basic English was aimed at human interaction using a universal language, where the vocabulary is limited and made up of words likely to be understood by a large number of people, unlike Esperanto which is not like any living language. This original version of a reduced-vocabulary English language however is meant for human interaction and did not contemplate its use in an automatic manner to convert Standard English to a reduced-vocabulary, and neither did it include the use of complementary databases to accommodate particular instances of the general-use reduced vocabulary employed.
Clearly, where complex or ambiguous material is being automatically translated from English into a reduced-vocabulary representation, there will be some loss of semantic content. This is a fundamental difference between this invention and both language translating methods and reduced-vocabulary Basic English communication. Where a phrase is being translated between two languages, any semantic loss is due to malfunction or a lack of equivalence for a particular concept in the target language, which is rare. When a person has to express an idea, on the other hand, even using a reduced-lexicon, he can choose how to express himself in order to be able convey some particular subtlety.
An automatic system is not able to truly interpret the underlying semantic content of a sentence and therefore where the subject matter contains ambiguities and or subtleties, there will be some loss of semantic content in the process of translation. Most words provide shades of meaning that are not strictly necessary to convey ideas, though they might be very entertaining or revealing, as any avid Oscar Wilde reader will immediately confirm. This system and method would be unsuitable therefore for subject matter where ambiguity and/or subtlety are a large proportion of the information content. However, material of a legal,
8
business, scientific and technological nature is normally specifically produced in a way that seeks to be both precise and clear, and is therefore amenable to a reduced-vocabulary representation that substantially maintains semantic contcnt. Thus, although there may be some semantic loss this is more than compensated for by the increased accessibility of the subject matter to an audience who would otherwise find it extremely difficult or time-consuming to assimilate the material. Furthermore, the use of a reduced vocabulary makes the analysis of scientific and technology material easier using data mining, artificial cognition and ontology techniques
For the case of more complex words related to the field of science and technology, the Basic Dictionary of Science, first edited in 1965, has been updated so that it can be employed to communicate even high-level scientific information in a reduced lexicon.
In the present invention, several key aspects of the assimilation needs of individuals with only a limited knowledge and vocabulary in the English language are addressed via methods that are system-oriented, rather than instructive, in nature. The need for memorized vocabulary is supplemented by a method for accessing words and phrases expressed in a reduced vocabulary The need for knowledge of syntax is addressed by providing a simplified syntax that can be quickly learnt. Semantic clarifications are resolved by the use of natural language processing algorithms well known in the prior art. Exposure to irregular phrases and idioms is accommodated by providing equivalents of these expressed in a reduced vocabulary. The need for knowledge relative to conjugating verbs is based on Ogden's Basic English rules of grammar.
According to a first aspect of the invention I provide a method of automatically translating information from standard natural language into a reduced-vocabulary form. Toward that end, the invention provides reduced-vocabulary English translation data in a manner such that it is readily assimilated, based on a basic knowledge of colloquial English. The present invention enables users to understand complex full-vocabulary English language content using only a basic knowledge of the English language.
The user's primary task is to recall the meaning of around 1,000 general words. Specific instances of these are highlighted in some instances, such as proper names, cities or countries.
9
In the present invention, in order to gain understanding of what is being analyzed, the requirement is to learn 1000 words and being aware that specific instances of these will be highlighted, for instance if a name such as "Albert Einstein" is part of the source text, the user may optionally have "Albert EinsteinIname" appear in the translation. The task is further simplified in that idioms are also converted to a reduced-vocabulary equivalent and that scientific words are described in footnotes in the same format, e.g. "atom" may be displayed as "atomlscience".
Knowing that even the most complex text can be understood by the user of this system increases his or her confidence level. Visual memory in an actual usage context also reinforces recall of words and phrases, and to this end the interface employs hypertext and colours in order to help fix words and phrases in the user's memory. Additionally, displays of reduced-vocabulary English translation data can be displayed side-by-side with the original text in a user-friendly format that encourages casual review of discrete aspects of reduced-vocabulary English in the context of actual samples of standard English texts.
In another aspect of the invention, the system and method herein disclosed serves as a vocabulary-reducing means for artificial intelligence systems. A vocabulary-reducing means such as the one herein described is helpful in a number of ways. Semantic web technologies have to interact in a medium where, although English is the predominant language, the vocabulary is so large as to require statistical analysis on corpus of data and a large amount of ambiguity arises out of this process. An example of this approach is shown by the "swoggle1' site where a semantic search using ontology concepts as well known in the prior art is employed to provide a very simple application. Artificial cognition systems as implemented in biology and other areas also require the processing of articles and texts with such a large vocabulary that only very limited application areas have been attempted, where semantic meaning has been established by fairly rudimentary means and sometimes even by hand. For instance, "Wordnet" is a hand-built 95000-words directed graph of linked synonym sets designed to aim semantics work in artificial intelligence systems, which can only do the most simple association between words because of the sheer number of words. Databases such as the British national Corpus provide probability estimates for such things as probabilistic context-free grammars, which can be applied to combined CYK/Early algorithms, and source data for strategies such as the WordNet database (nouns, verbs, adjectives, and adverbs).
10
More advanced ontology-based parsers still attempt to process uncondensed natural language, leading to enormous databases and very limited success. Cognitive systems are natural or artificial information processing systems, including those responsible for perception, learning, reasoning, decision-making, communication and action. Many of the opportunities lie at the interface between life sciences, social sciences, engineering and physical sciences. Human-machine interaction will also benefit because although the human in a two-way interaction understands the machine, the machine oftentimes cannot process what the human is attempting to communicate.
Thus, through a combination of active involvement in reading a complex translated text for instance, casual review, passive exposure, and incidental learning the user may experience improved English language proficiency, apart from or in conjunction with a traditional English study program.
These and further and other objects and features of the invention are apparent in the disclosure, which includes the above and ongoing written specification, with the claims and the drawings.
Brief Description of the Drawings
FIG. 1 illustrates the sequential progression of steps for generating original sentences. Detailed Description of the Preferred Embodiment
For purposes of simplification, unless otherwise specified, the English language will serve as the exemplary language in the following description of the present invention. Also, unless otherwise indicated, the following disclosure will refer to a preferred embodiment of the invention designed for English general-use applications. Converting the present communication system to accommodate other languages or applications requires implementing modifications relative to the linguistic content using the same methods as applied to produce Basic English, which is considered under the scope of the present
11
invention. This invention is specifically aimed at "English as a second language" speakers because English is currently the preponderant language for trade, entertainment, scicntific and technological interaction.
In the present invention, the system and method of implementing an automatic translating Ode vice preferably comprises a machine-based program to translate Standard English into a reduced-lexicon English text.
In a preferred embodiment, a data file such as a scientific, technological or business document is uploaded into an internet site implemented as an access point for users to use the system, in a commonly used text format such as Adobe Postscript or Microsoft Word. Typically, a text input into the system will be processed firstly by a Porter-Stemmer to isolate suffixes present in words being translated. Secondly, the system uses standard techniques of natural language processing on sentences, such as the natural language toolkit of the University of Pennsylvania, to isolate probable labels for each word. Apart from the standard syntactic labels, the system possesses lists relating particular instances to predefined basic words taken from Ogden's basic list such as "person", "town", "country", "group" and "science". Thus, if a name is found it is considered as a particular instance of a general word found in the basic vocabulary and it is passed on to the translated text unchanged and suitably labelled if the user so selected. For instance, when a name such as "Albert Einstein" is found it is passed to the translated output as "Albert Einsteinlperson". In this way, several thousand words which are simply specific instances of a basic vocabulary word can be understood by users with limited knowledge of the language, effectively extending the basic 1,000 words vocabulary substantially. If a user reading a text does not know what "Sikkim" means the output of "Sikkimlcountry" will be sufficient for understanding purposes. Determiners are also passed through the system unchanged. Natural language processing also helps with the disambiguation of words, where a given word can have more than one possible meaning. Where ambiguity cannot be resolved, the word is suitably labelled as ambiguous "Idoubtful".
Where the user shares the linguistic roots and alphabet of English specific instances may well be known with only minor changes, although there, are well-known trick words like Geneva which is often confused by foreigners with Genoa because "Genova" is more similar to Geneva than Genoa. However, where a user such as a person of oriental extraction sees an
12
unknown word, the label typically is quite helpful and enough to assimilate the underlying ideas, if it is labelled as previously described. A special case is that of idioms which are labelled "idiom" and they are also translated into their equivalent basic vocabulary representation. For instance, "smooth as silk" will be translated as "smoothlidiom". Scientific words are described in footnotes in the same format, e.g. "atom" may be displayed as "atomlscience" and a footnote will appear explaining the meaning of the word in reduced-vocabulary English. For ease of reading, users can select not to have these labels displayed in the translated text.
The method then contemplates analysing the text word-by-word in order to establish firstly if each word is part of the basic vocabulary, in which case it is conveyed unaltered. This reduced-vocabulary database, as well as all subsidiary databases, is regularly updated using a methodology similar to that in Ogden's pioneering work and complemented by a statistical analysis of emerging and popular words such as "google" and "internet", words that were simply unknown 20 years ago. Table 1 shows the reduced vocabulary as currently employed, including international words and commonly used scicntific words. Idioms, proper names, names of countries, cities etc are not shown but are listed in the system databases according to the frequency of their use in current publications.
Table 1. Basic 1.000 words as of 07/01/08
Basic lexicon a able about account acid across act addition adjustment advertisement agreement after again against air all almost among amount amusement and angle angry animal answer ant any apparatus apple approval arch argument arm army art as at attack attempt attention attraction authority automatic awake baby back bad bag balance ball band base basin basket bath be beautiful because bed bee betore behaviour belief bell bent berry between bird birth bit bite bitter black blade blood blow blue board boat body boiling bone book boot bottle bo\ boy brain brake branch brass bread breath brick bridge bright broken brother brown brush bucket building bulb burn burst business but butter button by cake camera canvas card care carriage cart cat cause certain chain chalk chance change cheap cheese chemical chest chief chin church circle clean clear clock cloth cloud coal coal cold collar colour comb come comfort committee common company comparison competition complete complex condition connection conscious control cook copper copy cord cork cotton cough country cover cow crack credit crime cruel crush cry cup current curtain curve cushion cut damage danger dark daughter day dead dear death debt decision deep degree delicate dependent design desire destruction detail development different digestion direction dmy discovery discussion disease disgust distance distribution division do dog door doubt down drain drawer dress drink driving drop dry dust ear eaily earth east edge education effect egg elastic electric end engine enough equal error even event ever every example exchange existence expansion experience expert eye face fact fall false family far farm
13
fat fathei fear feather feeble feeling female fertile fiction field fight finger fire firsl fish fixed flag flame flat flight floor flower fly fold food foolish foot for force fork form forward fowl frame free frequent friend from front fruit full future garden general get girl give glass glove go goat gold good government grain grass great green gray grip group growth guide gun hair hammer hand hanging happy harboui hard harmony hat hate have he head healthy heating heart heat help here high history hole hollow hook hope horn horse hospital hour house how humour I ice idea if ill important impulse in increase industry ink insect instrument insurance interest invention iron island jelly jewel join journey judge jump keep kettle key kick kind kiss knee Knife knot knowledge land language last late laugh law lead leaf learning leathei left leg let letter level libraiy lift light like limit line linen lip liquid list little less least living lock long loose loss loud love low machine make male man manager map mark market married match material mass may meal measure meat medical meeting memory metal middle military milk mind mine minute mist mixed money monkey month moon morning mother motion mountain mouth move much more most muscle music nail name narrow nation natural near necessary neck need needle nerve net new news night no noise normal north nose not note now number nut observation of off offer office oil old on only open operation opinion opposite or orange order organization ornament other out oven over owner page pain paint paper parallel parcel part past paste payment peace pen pencil person physical picture pig pin pipe place plane plant plate play please pleasure plough pockel point poison polish political poor porter position possible pot potato powder power present price print prison private probable process pioduce profit property prose protest public pull pump punishment purpose push put quality question quick quiet quite rail rain range rat rate ray reaction red reading ready reason receipt record regret regular relation religion representative request respect responsible rest rewaid rhythm rice right ring river toad rod roll roof room root rough round rub rule run sad safe sail salt same sand say scale school science scissors screw sea seat second secret secretary see seed selection self send seem sense separate serious servant sex shade shake shame sharp sheep shelf ship shirt shock shoe short shut side sign silk silver simple sister size skin skirt sky sleep slip slope slow small smash smell smile smoke smooth snake sneeze snow so soap society sock soft solid some son song sort sound south soup space spade special sponge spoon spring square stamp stage siar start statement station steam stem steel step stick sticky still stitch stocking stomach stone stop store story strange street stretch stiff straight strong structure substance sugar suggestion summer support surprise such sudden sun sweet swim system table tail take talk tall taste tax teaching tendency test than that the then theory there thick thin thing this though thoughl thread throat through thumb ihunder ticket tight till time tin tired to toe together tomorrow tongue tooth top touch town tiade train transport tray tree trick trouble trousers true turn twist umbieila under unit up use value verse very vessel view violent voice waiting walk nail war warm wash waste watch water wave wax way weather week weight well west wet wheel when where while whip whistle white who whv wide will wind window wine wing winter wire wise with woman wood wool word work worm wound writing wrong yeai yellow yes yesterday you young
Common international vvoids alcohol, aluminium, automobile, bank. bar. beef, beer, calendar, chemist, check, chocolate, chorus, cigarette, club, coffee, colony, dance, engineer, gas, google, hotel, influenza, informatics, kiosk, lava, madam, nickel, opera, orchestra, park, passport, patent, phonograph, ptano, police, post, program, propaganda, radio, restaurant, sport, taxi, tea, internet, telephone, terrace, theatre, tobacco, university, whisky, zinc
14
General absorption, active adjacent, age, alternative, application, aic, area, arrangement, axis, break, bubble,
scientific and capacity, case, cell, column, component, compound, continuous, cross, decrease, deficiency, deposit, technological determining, difference, difficulty, direct, disappearance, discharge, disturbance, elimination, words environment, equation, evaporation, exact, experiment, explanation, focus, friction, fusion, generation.
groove, guard, hinge, impurity, individual, interpretation, investigator, joint, latitude, layer, length, link, longitude, mean, melt, mixture, nucleus, origin, path, pixel, pressure, projection proof, reference, relative, reproduction, resistance, research, rigidity, rock, rot, rotation, screen, seal, section, sensitivity, shadow, shear, shell, similarity, solution, spark, specialization, specimen, stimulus, strain, strength, stress, substitution, successive, supply, surface, thickness, thrust, tide, uansmission, transparent, tube, \alve,
Secondly, the word will be compared to the database of English words for which there is an equivalent expressed in a reduced vocabulary. If it is an English word not currently available in terms of reduced-vocabulary English, the word is passed through unchanged and labelled accordingly. Thus, a little-used word like "consilience" will appear as "consiliencelunchanged". This is very helpful because in cases where no translating equivalent is found the system is still able to maintain readability, unlike a translating system between two languages where this instance would cause a catastrophic failure. Finally, words not recognized by the system are labelled unknown. Thus a word such as "Zeitgeist" which does appear in English texts occasionally would be displayed as "zeitgeistlunknown".
The user can add any word to the dictionary of words assigned to the Basic English lexicon and therefore not subsequently subject to translation. This is useful where a user is translating material of a technical nature where he is familiar with the meaning of little-known words.
With reference to the schematic flow chart in Figure 1 the system is initialized in a start module, the text (or output from a speech analyzer as is well known in the art) needs to be rendered into paragraphs and sentences ready for a natural language analysis algorithm to tag and produce strings of tagged words and phrases out of these sentences. Tags such as determiners, proper names, countries and the like are then analysed. In these cases, the words are not translated but simply labelled according to their nature, as in Cambridge simply being labelled as a "town". This process helps to reduce the number of words to be processed with a view to their translation, as they are known to be simply particular cases of words that are known in the reduced lexicon. Handling idioms is a special case because entire phrases are considered and not simply words. Thus, if the text contains "thick as two short planks" the system will simply replace this phrase lor "thick". Now, it may be that the fact the person
15
was impaired might be highly relevant to the content, such that a person might wish to translate this phrase as "very thick" and therefore there will be a slight loss of semantic content. However, in the case of business, technology or scicntific documents this kind of subtle difference is of little consequence. Scientific words are handled by labelling them and providing a description in reduced vocabulary as a footnote, with the possibility that the description itself might contain further scientific words that can in turn also be desenbed in a reduced lexicon. Remaining words are then translated from the original language to a reduced-vocabulary description in the same language, making use of semantic and syntactic labels in order to attempt to reduce ambiguity. Semantic labels are provided so that different possible senses of a word can be accommodated while syntactic labels are used to discern diffenng use of a word as a noun, verb, adverb or whatever. Words that are not found thus far are searched in an extensive database of the language, so they can be labelled as in use but with no current representation in the reduced-vocabulary database. Finally, any remaining words are labelled as "unknown" and the user is given the option to include this word in a personalized reduced-vocabulary lexicon. Finally, all these string are collated in the original order and the translated text is generated. Databases for this procedure include supplementary lists of animal names, proper names, cities, countries, determiners, idioms ad the like: a scientific dictionary in the reduced vocabulary: an extensive language dictionary; a reduced vocabulary database, including compound words and any personalized user-selected additions: an original language to reduced-vocabulary translation database, including semantic and syntactic labels: and finally a comprehensive language dictionary.
Moreover, translation of recognized words can be displayed in the text itself, substituted for the recognized word, or normally omitted and only shown when selected by a pointing device such as a mouse. The text being translated can be colour-coded so as to highlight different possibilities: woids in G:een me;e: Uicy have beer; adequate!}', word* in Olive green display a translation which itself contains reference to further entries in the scientific dictionary, words in blue contain two or more possible meanings, which cannot be resolved by the syntax/semantic analysis module, ••• c. .\;.e ,.;v = :• • c;;:re^K e.,'i\;.:L ^
o.in red are !su! jteognized by the iranslulnig tool. In every case, words or phrases can be selected by the user and added to a personal dictionary. In computer implementations where the translated text is being displayed on a screen for instance, the user can select to have recognized words and their translation displayed, translated words or
16
phrases can be substituted for the recognized word, or omitted and only shown when selected by the mouse.
In a preferred embodiment of the present invention, the system is accessible to the user through a world-wide-web portal. Typically, the user will upload a document such as a scientific paper, technical report, business report or thesis, the system will process it and the resulting translation will be either downloaded by the user or sent by the system via electronic mail to a predefined e-mail address. In this way, a user with a limited knowledge of the English language will be able to assimilate even complex material, as is required for business, research or learning purposes. As time goes by and the user becomes more familiar with the subject matter, it is likely that the user will acquire indirectly an enhanced understanding of the language, further supporting existing language learning methods.
In an alternate embodiment of the present invention, the method is implemented as a standalone module, in the form of a wizard or software plug-in, to be used in conjunction with artificial intelligence systems aimed and cognition, command and control, and human-machine interactions amongst other potential uses. In this case, the system input might be a speech processing or other interface module. The output of said module can then be passed on to data-mining, ontology-based of other artificial intelligence programs that deal with natural languages as one of their inputs.
In an further embodiment of the present invention, the system of this invention can be used to translate a user request, contained for instance in a sentence, to a reduced-vocabulary as previously described and compare this translated sentence to other sentences in internet pages, obtained by using existing web searching algorithms, after the content of said internet pages has itself being translated to a reduced vocabulary themselves, thus making a match more likely between the user request and the contents of available web pages, notwithstanding the large vocabulary existing in internet pages. A particular embodiment for this matching procedure can be achieved based on ontology as is well-known in the art (e.g. Yu 2007). The advantage of using the invention herein described is that the number of relations is reduced because of the reduction in the size of the vocabulary. The sense in which the word ontology as in "semantic web technology based on ontology" is being used, is not to be confused with its traditional meaning of the study of being but as an explicit and
17
formal specification of a conceptualization of a domain of interest, where words and phrases are associated to a concept, relations, instances, and axioms in order to clicit the likely meaning of language.
Thus, the invention fills a long standing void in English communication by providing a single automatic resource that replaces solutions which traditionally have been available only by referring to multiple resources, such as lexical dictionaries, phrase books, verb dictionaries and electronic translators. Through a combination of active and passive exposures, the user may enjoy increased English proficiency as a by-product of assimilating complex material expressed in full-vocabulary English.
While the invention has been described with reference to specific embodiments, modifications and variations of the invention may be constructed without departing from the scope of the invention, which is defined in the following claims:
Bibliography
Graham, E.C. (cd.). "The Basic English Dictionary of Sciencc", The MacMillan Company, New York, 1965.
Moorhouse, A.C., "The Triumph of the Alphabet - A History of Writing", Henry Schuman New York, 1953.
Ogden, C.K., "Basic English: A General Introduction with Rules and Grammar", Small format, hardcover. Publisher: Paul Treber & Co., Ltd. London, 1930.
Ogden, C.K. (ed.), "The General Basic English Dictionary", London, 1940.
Rosch, E., "On the Internal Structure of Perceptual and Semantic Categories". In T.E. Moore
(Ed.), "Cognitive Knowledge and the Acquisition of Language", (pp. 111-144), New York:
Academic Press 1973.
Wekker, H., Haegeman, L., "A Modem course in English Syntax". Routledge, London, 19%.
Wittgenstein, L., "Philosophical Investigations", Oxford, England, Blackwell, 1953.
Yu, Liyang, "Introduction to the Semantic Web and Semantic Web Services", Chapman &
Hall/CRC, Boca Raton. USA, 2007.
18

Claims (14)

Claims I claim:
1. An automatic translation system comprising a natural language sentence parser including a word/phrase tagging section to obtain a sequence of tagged text representing the input sentence, a word/phrase translation section, wherein only preselected tagged text is translated, resulting in a translation in the same language as the source material but using a reduced lexicon.
2. The system of claim 1, wherein words and phrases in the translated text are labeled.
3. The system of claim 1, wherein the language being translated is English.
4. The system of claim 1, wherein text containing tags such as determiners, proper names of countries and cities and the like are labeled but are copied unchanged to the resulting translated text.
5. The system of claim I, further comprising a scientific dictionary expressed in a reduced-vocabulary form of the original language so that tagged text identified as being of a scientific nature is provided with an accompanying explanation using said reduced lexicon.
6. The system of claim 1, wherein the interaction between the system and a user is carried out with the support of the world-wide-web or equivalent means.
7. The system of claim 1, wherein it is implemented to interact with other modules in artificial intelligence systems.
8. The system of claim 1, further comprising a search engine for web services in order to carry out searches based on semantic content
9. The system of claim 8, wherein the semantic web technology employed is based on ontology.
19
10. A method for translating comprising the steps of providing a translation system having a natural language sentence parser including word/phrase tagging scction to obtain a sequence of tagged text representing the input sentence, a word/phrase translation section, wherein only pre-selected tagged text is translated, resulting in a translation in the same language as the source material but using a reduced vocabulary.
11. A method for translating substantially as hereinbefore described with reference to the accompanying drawings.
12. A system for translating languages according to the method of any one of the preceding claims.
13. An system for translating languages substantially as hereinbefore described with reference to and as shown in the accompanying drawings.
14. Any novel feature or novel combination of features described herein and/or in the accompanying drawings.
GB0800748A 2008-01-16 2008-01-16 Reduced lexicon translation Pending GB2456391A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
GB0800748A GB2456391A (en) 2008-01-16 2008-01-16 Reduced lexicon translation

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
GB0800748A GB2456391A (en) 2008-01-16 2008-01-16 Reduced lexicon translation

Publications (2)

Publication Number Publication Date
GB0800748D0 GB0800748D0 (en) 2008-02-20
GB2456391A true GB2456391A (en) 2009-07-22

Family

ID=39145000

Family Applications (1)

Application Number Title Priority Date Filing Date
GB0800748A Pending GB2456391A (en) 2008-01-16 2008-01-16 Reduced lexicon translation

Country Status (1)

Country Link
GB (1) GB2456391A (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108120794A (en) * 2017-12-21 2018-06-05 深圳市赛亿科技开发有限公司 A kind of method of concentration of glucose in Intelligent cup and its detection beverage
TWI694790B (en) * 2018-05-24 2020-06-01 仁寶電腦工業股份有限公司 Smart liquor cabinet and processing method for liquor region related service

Also Published As

Publication number Publication date
GB0800748D0 (en) 2008-02-20

Similar Documents

Publication Publication Date Title
Carter et al. Vocabulary and language teaching
Dauenhauer et al. Haa shuká, our ancestors: Tlingit oral narratives
Carter et al. Word lists and learning words: some foundations
Shopen Languages and their speakers
Sterrett et al. How to understand your Bible
Hiebert Teaching words and how they work: Small changes for big vocabulary results
CN113268581A (en) Topic generation method and device
Steiner John Trevisa's Information Age: Knowledge and the Pursuit of Literature, c. 1400
GB2456391A (en) Reduced lexicon translation
Gregor A documentation and description of Yelmek
Cauchard Spatial expression in Caac: An Oceanic language spoken in the north of New Caledonia
Lionnet “The Indies”: Baudelaire's Colonial World
Barnwell Teacher's Manual to Accompany Bible Translation: An Introductory Course in Translation Principles
Yembise Linguistic and cultural variations as barriers to the TEFL settings in Papua
Oladipupo A sociophonetic investigation of Standard British English connected speech processes in Nigerian English
Gale Redemption and Regret: Modernizing Korea in the Writings of James Scarth Gale
Djigunovic et al. Language teaching methodology and second language acquisition
Johnson Static spatial expression in Ske: an Oceanic language ofVanuatu
Murnianti AN ANALYSIS OF TRANSLATING IDIOM AT BRIDGE TO TERABITHIA BY KATHERINE PATERSON
Härtl ‘Bastardizing’National Belonging: Derek Walcott and Joseph Conrad
Hill Harsh words: English words for Chinese learners
GNANASEKARAN TRANSLATING TAMIL POETRY: A PRACTICAL APPROACH
Alhaj The Ambit of English/Arabic Translation: A Practical and Theoretical Guide for English/Arabic Translators
Kim et al. Cross-Cultural Barriers in the Translations of Modern Korean Literature
KC Techniques used in the translation of cultural terms and existing gaps: A case of There is a carnival today