EP1155374A1 - Traduction - Google Patents

Traduction

Info

Publication number
EP1155374A1
EP1155374A1 EP00902775A EP00902775A EP1155374A1 EP 1155374 A1 EP1155374 A1 EP 1155374A1 EP 00902775 A EP00902775 A EP 00902775A EP 00902775 A EP00902775 A EP 00902775A EP 1155374 A1 EP1155374 A1 EP 1155374A1
Authority
EP
European Patent Office
Prior art keywords
language
alternation
class
roles
word
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP00902775A
Other languages
German (de)
English (en)
Inventor
Stephen Clifford Appleby
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
British Telecommunications PLC
Original Assignee
British Telecommunications PLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from GBGB9903747.5A external-priority patent/GB9903747D0/en
Application filed by British Telecommunications PLC filed Critical British Telecommunications PLC
Priority to EP00902775A priority Critical patent/EP1155374A1/fr
Publication of EP1155374A1 publication Critical patent/EP1155374A1/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/55Rule-based translation
    • G06F40/56Natural language generation

Definitions

  • This invention relates to automatic language translation.
  • Machine language translators accept input text in a first natural language (the source language) and generate corresponding output text in a second natural language (the target language) .
  • Such translators may be classified into two types; those which use a set of translation rules for each possible pair of source and target languages, and those (relatively rare) interlingual systems which translate from the source language into a language independent (interlingual) form, and then from this language independent form to the target language.
  • rules specifying the complements which each verb of all source and target languages could take were present, and were stored with pointers from the corresponding verb entries in a lexical database. These rules also specified the mapping between the complements and the roles (e.g. agent or patient) corresponding to them.
  • an apparatus for translating text from a source language to an interlingual representation which can then be transformed into one or more of a plurality of target languages, each of said languages including words corresponding to events and employing a concept of entities which play predetermined roles comprising: means for storing sets of language-independent roles; means for storing, for each said language, a respective set of language-dependent alternation classes, each alternation class being referenced to a said set of language-independent roles and comprising a set of alternations, and a plurality of words, each word representing a said event and being referenced to an alternation class of its respective set of language-specific alternation classes, each alternation of an alternation class comprising, in a respective order, the roles of said set of language-independent roles to which that class is referenced and listing the correspondence between those roles and complements by which each event referenced to that alternation class can validly be represented in that language; means for locating, in said text, a phrase comprising one or
  • said means for replacing are arranged to locate a preposition occurring with said verb, and to replace said preposition with an appropriate role.
  • multiple said events are referenced to a common said alternation class.
  • alternation classes are referenced to a common said set of language-independent roles.
  • the alternations of an alternation class constitute a set of language- independent rules
  • the respective sets of language-dependent alternation classes for all the languages of the apparatus together constitute said means for storing sets of language-independent roles.
  • a method of setting up a machine translation system comprising: a first stage of establishing, for each language, a respective plurality of language-dependent alternation classes, each alternation class comprising a plurality of alternations derived using a common set of language-independent role data shared by words of each language with a common meaning, and a respective set of word entries, each word entry being referenced to a said respective language-dependent alternation class; and a second stage of creating, for each said alternation class and for each word entry referenced to that alternation class, a corresponding set of word entries identical to the word entry referenced to that alternation class, each word entry of said set being referenced to a different one of the alternations of that alternation class.
  • a method of translating text from a source language to an interlingual representation which can then be transformed into one or more of a plurality of target languages, each of said languages including words corresponding to events and employing a concept of entities which play predetermined roles comprising the steps of: storing sets of language-independent roles; storing, for each said language, a respective set of language-dependent alternation classes, each alternation class being referenced to a said set of language-independent roles and comprising a set of alternations, and a plurality of words, each word representing a said event and being referenced to an alternation class of its respective set of language-specific alternation classes, each alternation of an alternation class comprising, in a respective order, the roles of the set of language-independent roles to which that class is referenced and listing the correspondence between those roles and complements by which each event referenced to that alternation class can validly be represented in that language; locating, in said text, a phrase comprising one or more words
  • said events are defined by a verb in said source language, and said replacing step is arranged to locate a preposition occurring with said verb, and to replace said preposition with an appropriate role.
  • multiple said words representing a said event are referenced to a common said alternation class.
  • multiple said alternation classes are referenced to a common said set of language-independent roles.
  • Figure 1 is a block diagram of the language translation apparatus according to a first embodiment
  • Figure 2 is a block diagram showing in greater detail the processes present in a client terminal forming part of the embodiment of Figure 1 ;
  • Figure 3 is a block diagram showing in greater detail the processes present in a server forming part of the embodiment of Figure 1 ;
  • Figure 4 is a block diagram showing in greater detail the subprocesses present within a translation process forming part of the embodiment of Figure 3;
  • Figure 5 is an illustrative diagram showing the formats through which text passes during the translation process of the embodiment of Figure 1 ;
  • Figure 6 is a block diagram showing the databases maintained within the server of Figure 1 ;
  • Figure 7 is a schematic diagram illustrating the word structure produced after text pre-processing in the embodiment of Figure 1 ;
  • Figure 8 is a diagram illustrating the entity/relationship semantic structure produced after parsing in the embodiment of Figure 1 ;
  • Figure 9 is a flow diagram showing schematically the operation of the server of the embodiment of Figure 1 .
  • Figure 10 is a diagram illustrating the data stores present in the embodiment.
  • Figure 1 1 is a diagram illustrating the relationship between records in the embodiment
  • Figure 1 2 is a flow diagram illustrating a compilation phase of writing records to each word store of Figure 10;
  • Figure 1 3 is a diagram illustrating the relationship between records during the process of Figure 1 2;
  • Figure 1 4a illustrates a pre-abstracted. language-specific structure
  • Figure 14b illustrates the structure after application of an abstraction rule
  • Figure 1 5 is a flow diagram illustrating the abstraction rule operation
  • Figure 1 6 (comprising Figures 1 6a and 1 6b) is a flow diagram illustrating the process of generating the data used in the present embodiment
  • Figure 1 7 illustrates a role set record of the embodiment.
  • the present invention may be employed by a client terminal 1 00 connected via a telecommunications network 300 such as the Public Switched Telephone Network (PSTN) to a server computer 200.
  • PSTN Public Switched Telephone Network
  • client and server in this embodiment are illustrative but not limiting to any particular architecture or functionality.
  • the client terminal comprises a keyboard 1 02, a VDU 1 04, a modem 1 06, and a computer 108 comprising a processor, mass storage such as a hard disk drive, and working storage, such as RAM.
  • a SUNTM work station or a PentiumTM personal computer may be employed as the client terminal 1 00.
  • an operating control program 1 10 comprising an operating system 1 1 2 (such as WindowsTM), a browser 1 1 4 (such as Windows ExplorerTM Version 3) and an application designed to operate with the browser 1 1 4, termed an applet, 1 1 6.
  • the function of the operating system is conventional and will not be described further.
  • the function of the browser 1 1 4 is to interact, in known fashion, with hypertext information received from the server 200 via the PSTN 300 and modem 1 06.
  • the browser 1 1 4 thereby downloads the applet 1 1 6 at the beginning of the communications session, as part of a hypertext document from the server 200.
  • the function of the applet 1 1 6 is to control the display of received information, and to allow the input of information for uploading to the server 200 by the user, through the browser 1 14.
  • the server 200 comprises an operating program 21 0 comprising an operating system 21 2 such as UnixTM, a server program 214 and a translator program 21 6.
  • the operating system is conventional and will not be described further.
  • the function of the server program 214 is to receive requests for hypertext documents from the client terminal 100 and to supply hypertext documents in reply.
  • the server program 21 4 initially downloads a document containing the applet 1 1 6 for the client terminal 100.
  • the server program 214 is also arranged to supply data to and receive data from the translator program 21 6, via, for example, a cgi.bin mechanism.
  • the function of the translator program 21 6 is to receive text from the client terminal 100 via the telecommunications network 300 and server program 214; to interact with the user as necessary in order to clarify the text; and to produce a translation of the text for supply back to the user (in this embodiment).
  • Figure 4 shows the component programs of the translator 21 6. It comprises a number of sections; one for each language, of which only a first section 220, relating to a first language (LANG 1 ) and a second section 230 relating to a second language (LANG2), are shown for clarity.
  • Each language section comprises the following subprograms or modules:
  • a source language parser (222, 232)
  • a target language generator (225, 235)
  • FIG. 5 illustrates the stages of translation according to this embodiment.
  • a source language text document (stage A) is received by the translator from the client terminal 1 00.
  • stage B After operation of the text pre-processor stage (221 ), the result is an expanded source language text document (stage B).
  • the operation of the pre-processor is to replace contracted forms of words (such as "he's” in English, or "j'ai” in French) with their non-contracted forms.
  • the source language parser 222 After operation of the source language parser 222, the result at stage C of
  • Figure 5 is a language-specific semantic structure which represents the input text as an encoded entity-relationship graph, where the entities are semantic categories corresponding to the words (in other words, identifying the nouns, verbs and so on), and the relationships are data relating the entities together (e.g. to indicate those which are the subjects or objects of others) .
  • the result at stage D is a further semantic structure D, similar to the language specific semantic structure produced at stage C but indicating additional relationships and data which substitute the language-specific meanings of some of the structures represented within the semantic structure C with abstracted structures
  • a phrase such as "My name is David” input as source language text could be represented within a parsed semantic structure by data indicating ownership of the name by the individual first person, and an attribute of the name being that it is "David” This is a grammatically correct expression, from which French or German text could be generated by a suitable generator such as 235.
  • the source language abstractor 223 recognises, within the parsed semantic structure, the occurrence of structures which are not directly translatable, such as structures involving personal names in this example, and replaces those structures with additional data representing them.
  • the abstracted semantic structure produced at stage D of Figure 5 corresponds to a representation of the input text but with the replacement of specific constructs which are known not to meaningfully translate into one or more other languages (whether or not those languages are represented by sections within the translator 21 6) .
  • the abstracted semantic structure produced at stage D is an interlingual form which is unambiguous in relation to each of the target languages which the system is capable of translating into That is to say that the interlingual form corresponds uniquely with a language-specific semantic structure in each of the target languages.
  • the abstracted semantic structure, or one of the abstracted semantic structures, produced by the abstractor in stage D is then passed to the de-abstractor
  • the target language which comprises a series of rules which test for the presence of the additional structures inserted by the language abstractor 223, and translate them into the form used in the target language.
  • the abstracted naming operation would be converted, in French, into "je me appelle" (I call myself) .
  • the result is then, at stage E, a semantic structure equivalent to the language-specific semantic structure at stage C but in which the semantic substructures corresponding to phrases or expressions in the input text which would give rise to translation difficulties have been replaced by appropriate substructures in the target language.
  • This structure forms the input to the target language generator 235, which generates a corresponding target language output text
  • stage F (stage F), and therefore applies the reverse process to the parsers 222, 232.
  • the generated output text at stage F is contracted by the text postprocessor 236 which takes the generated text and contracts relevant parts of it.
  • the text postprocessor 236 takes the generated text and contracts relevant parts of it.
  • "je me appelle David” would be contracted to "je m'appel David”.
  • Other minor text processing operations such as adding capital letters at appropriate places (for example at the beginning of each sentence), and providing the correct spacing between words, are also carried out.
  • the server 200 stores data for use by the parser and abstractor in each language.
  • This data comprises, for each language, a grammar rules database (227, 237) and an abstraction rules database (228, 238) .
  • a multilingual lexical database 240 stores an entry for each concept represented by a word in any language represented within the translator program, the entry pointing to corresponding word entries, in word stores (see Figure 10), in each of the languages within which a word exists which expresses that concept.
  • the word entries give for each word of the text in the language concerned; the type of lexical element represented by the word (e.g. whether it is a noun, a verb, a pronoun, an adjective and so on) ; data on the manner in which the word is inflected, if at all, in each language, and various other data.
  • each grammar rules database (227, 237) represent, for the corresponding language, the ways in which words of that language may combined.
  • the operation of this embodiment will now be disclosed in greater detail with reference to Figures 7- 1 1 .
  • a step 402 text is received from the client terminal 100.
  • the input text is expanded.
  • the start and end of each possible word in the text is located by detecting spaces and punctuation, so as to result in a stream of possible words.
  • any contracted words (such as "j'ai” in French") are expanded to replace them with full words (in that example, "je ai") .
  • the text pre-processor locates and flags special text items such as proper names, dates, times, sums of money and so on.
  • each word is looked up in the lexical database 240.
  • words which are not recognised but are closely matched to others in the source language that is, the language of the input text
  • step 405 If, after spell checking, any words have not been recognised (step 405) then a query is transmitted back to the user, comprising a text message saying, for example, "The word (unrecognised word) has not been recognised. Please check the spelling, and resubmit this word or a synonym". This query is then transmitted to the client terminal
  • the expanded text (stage B of Figure 5) is no longer necessarily a linear sequence of words but may, as shown in Figure 7, comprise a network or lattice of words.
  • Figure 7 indicates such a network in which the second word, originally B, has been replaced by two possible alternatives (either alternative spellings or alternative expansions) B1 and B2, and the third word C has been replaced by three possible alternatives C1 , C2 and C3. There are thus now six possible routes through the network of words.
  • each word in the network is now replaced by a reference to the corresponding entry in the lexical database 240. If a single word (such as "bank” in English) has two different entries in the lexical database 240 corresponding to different meanings (which would be translated into different words in a target language), the word is replaced by each possible entry in the lexical database 240.
  • the syntactic category information for each word i.e. whether it is a noun, verb etc.
  • a table relating each network position to the corresponding entry in the lexical database 240 may be separately stored for later use.
  • the network of nodes (each corresponding, as noted above, to one of the entries in the lexical database 240 and being represented by the syntactic category of that entry) is processed by the source language parser program, which, for each word, applies the rules within the grammar rules database 227 which are applicable to words of that type.
  • the rule for the active form of the verb "to see” indicates that the verb may be preceded by the seeing "agent” entity (in this case “the dog") and followed by the patient entity (in this case "the cat”).
  • the parsed semantic structure (stage C of Figure 5) is represented, for each sentence of the input text, by one or more structures comprising references to entries in the lexical database 240 (the circles in Figure 8) and pointers linking them together (the lines in Figure 8) .
  • the topological structure of Figure 8 may be represented as [
  • the unifying variables A and P are the links which unify the first occurrence of "the” with “dog” and the second occurrence of "the” with “cat".
  • the verb "see” is linked by an agent relationship and a patient relationship with the terms linked by the relationship A (i.e. "the dog") and the terms linked by the relationship P (i.e. "the cat")
  • the verb is recorded as an event ("event"), and is linked to the lexical entry in the lexical database 240 for the word "see” and is indicated to be the finite form (“fin”) in the past tense ("past”).
  • the word “the” is recorded as a determiner, being the definite article ("def”), single rather than plural form (“s”), having neutral gender ("_”) and referring to the third person ("third”).
  • the terms for "dog” and “cat” are indicated to be entities ("e"), and have a reference to the corresponding word entry in the lexical database 240.
  • parser may be as described in PCT/GB98/02389
  • the abstractor 223 accesses the abstracting rules database 228 to locate those source language phrases which may give rise to translation difficulties.
  • the abstraction process is recursive, insofar as once one abstraction rule has been applied to the parsed text, the entire set of abstraction rules is referred to again when processing the partially abstracted text to identify another abstraction rule to be applied, repetitively until none of the abstraction rules in the set can be applied.
  • step 41 2 the abstractor 223 tests each structure generated by the parser, and where one or more of the abstraction rules is applicable, converts the detected structure to the alternative form recorded within the rule.
  • this test is recursive such that the same rule may be applied at different stages of an abstraction process in which a structure generated by the parser is converted to the interlingual structure.
  • the ideal result should be a single, complete interlingual structure. If the structure is incomplete (that is to say, it was not possible to relate together all the words using the grammar and the abstraction rules) then successful translation will not be possible. If more than one possible structure is produced, then the input text is considered ambiguous since it could result in more than one possible translation in at least one of the target languages. If either of these conditions is met (step 41 4), a query is transmitted to the user (step 406) .
  • problematic points within the semantic structure corresponding to incomplete or ambiguous meanings, are located, and the portions of the input text relating to these are formulated into a message and transmitted back to the user for display and response by the applet 1 1 6, with a query text which may for example say "the following text has not been understood/is ambiguous.”
  • the de-abstractor and generator 224, 225 corresponding to the input (source) language are employed (as described in greater detail below) to generate a source language text for each possible semantic structure where two or more such structures exist, and the query also includes these texts, prefixed with a statement "one of the following meanings may be intended, please indicate which is applicable: "
  • the message transmitted to the user in step 406 comprises a form, with control areas which may be selected by the user at the client terminal 1 00 to indicate an intended meaning for the ambiguous words or phrases detected within the input text.
  • the single, unified, interlingual semantic structure produced by the abstractor 223 is then passed to the target language de-abstractor 234 for the or each target language into which the text is to be translated.
  • the de-abstractor 234 accesses the abstracting rules database 238 and, on detection of any of the substituted forms (for example "I sit") substitutes the normal form for the target language (in this case, "I sit myself” in French or "I am sitting” in English)
  • the de-abstracted structure is then more idiomatically correct in the target language than was the semantic structure produced by the parser.
  • the target language generator program 235 accesses the target language grammar rule database 237 and the lexical database 240 and operates upon the de-abstracted semantic structure to generate output target language text.
  • the operation of the generator is essentially the reverse of that of the parser; briefly stated, it operates a chart-parsing algorithm (of a type known of itself) to take the components of the target language semantic structure generated by the de- abstractor, look up the applicable rules in the target language rules database 237, and assemble the corresponding words located from the lexical database 240 into a string of text ordered in accordance with the grammar rules, until a single stream of text which utilises all components of the semantic structure and obeys the grammatical rules is located.
  • a chart-parsing algorithm of a type known of itself
  • the text is post processed (step 420) to add a space before each word; capitalise the first letter in a sentence; add a full stop after the last word; contract any phrases (such as "je ai") which are capable of contraction; and reproduce any special forms of text (such as dates, amounts of money, and personal names), as appropriate for the target language.
  • the resulting formatted text is then formulated into an HTML (or text, or other suitable format) page, which is transmitted back to the user at the client terminal 1 00 in step 422.
  • the page On receipt of the translation result at the client terminal 1 00, the page is displayed via the browser 1 14 and may be converted and stored for subsequent word- processing by the user.
  • This embodiment therefore provides a new method of handling alternations (particularly for verbs), and new methods of use thereof.
  • the lexical database 240 comprises, for each language, a list 1 241 , 1 242 of word entries or records.
  • Each word record in each language list points to a concept or meaning record entry in a language independent meaning store 1 243.
  • Each record in the store 1 243 contains meaning (i.e. semantic information) relating to a concept, described by the word entries which point to that record.
  • word records in the word list stores 1 241 , 1 242 of different languages are indirectly linked, in that they point to a common entry in the semantic lexicon 240, which is related to the meaning expressed by the words.
  • each word store ( 1 241 , 1 242) in a given language of the lexical database 240 contains a respective plurality of word entries (LEX1 , LEX2, LEX3, LEX4), each word entry contains a pointer to an alternation class record (702, 704, 706) in a corresponding one of a plurality of language-specific alternation class stores 1 238, 1 228 provided within respective grammar rules stores
  • the relationship is a many-to-one mapping. That is to say, many verbs (of the order of several thousand in English) map onto a relatively small number of alternation classes (of the order of 200 in this embodiment in English). Each word entry in the word store 1 241 , 1 242 maps onto to only one alternation class record per language. Several lexical entries will share the same alternation class record. The significance of the alternation class records will be explained below.
  • each alternation class record 702-706 is linked to one or more alternation records 708-722 by pointers. Compiling the Word Entries
  • the contents of the word stores 1 241 , 1 242 are not fully populated until the apparatus is to be used. This results in a saving of memory space, since those word stores for languages which are not to be used in translation require less memory.
  • a first set of uninflected word entries remain resident in the store 1 241 at all times.
  • the word entry is linked by a pointer to an alternation class record (ALT CLASS of Figure 1 3) in the alternation store 1 238 for the language concerned.
  • This record is linked to the alternation records (ALT 1 , ALT 2 of Figure 1 3) of the alternations which that word (and others sharing its alternation class) can take.
  • a program operated prior to translation, performs the process of Figure 1 2.
  • the class record for a word is read, and a first alternation record is selected in a step 1404.
  • the program creates a new lexical entry (WORD 2) for the same word as that for WORD 1 , which incorporates a pointer to the alternation class record, and stores the alternation as a list of possible complements in an order.
  • WORD 2 new lexical entry
  • step 1408 If there are more alternation records (step 1408), the process of step 1 404 on is repeated for the next alternation (step 1 41 0). If not, the next pre-existing word record in the store 1 241 is selected (step 1 414) and the process of step 1402 is repeated until all word entries have thus been expanded to an entry for each alternation (step 141 2).
  • a role set store 740 storing a plurality of role set records, one of which is indicated (as 730) in
  • Each role set record stores a plurality of role data, shown as R 1 , R2 and R3 in Figure 1 7. In the present embodiment, there are on the order of 1 5-20 role set records in total in the role set record store 740.
  • each event concept in the meaning store 1 243 e.g. each entity corresponding to an event or verb in a word store 1 241 , 1 242
  • each event concept in the meaning store 1 243 e.g. each entity corresponding to an event or verb in a word store 1 241 , 1 242
  • other concept which can take an alternation in some language is linked by a pointer to one of the role set records 730. It will therefore be clear that, since there are many thousands of such verb or event records in each word store of the lexical database 240, there is many-to-one mapping of lexical database entries to role set records.
  • Each of the alternation class records in the alternation class store 1 241 , 1 242 for a language is linked by a pointer to one of the role set records 730. Since the number of role set records is substantially smaller than that of alternation class records in each language (e.g. by an order of magnitude), this too is a many-to-one mapping.
  • the lexical database 240 For each word representing an event in a language (e.g. a verb in English) the lexical database 240 stores a record for one or more semantic concepts which correspond to that word. For example, the word “give” in English may be associated with several different concepts, including “give to” and “give up” . Each of these provides a different meaning for the word.
  • each such event concept Associated with each such event concept, then, are one or more roles; that is to say, people or objects taking part in the event. For example, corresponding to the phrase “they looked for the ball” , are an active role or agent ("they” - the people who looked) and a passive role or patient ("the ball” the thing that was looked at) . Other events may involve more parties, for example, “[she] gave [it] [to him]” involves a donor, a recipient and an object. These roles are the same regardless of the word order employed (e.g. " he was given it by her") or the verb form used (e.g. "the book pleases me” and “I like the book”) .
  • the role set record stores for a concept all roles which might be used in any language for that concept. They do not necessarily correspond to the subject and object of a verb, since these vary between active and passive forms of a verb.
  • the role set records therefore represent a language-independent data structure linked to those concept records of the lexical database records which are for event concepts. Examples of (almost all the commonest) roles are
  • Each alternation class record 702-706 contains a record of the language it is relevant to, together with a name field naming the class record (for example, "reflexive verb"), and is linked by pointers to its role set record and alternation records.
  • Each alternation record 708-722 comprises a name field; a pointer to the alternation class record to which it is linked; the syntactic category (e.g. verb) of the alternation; and a list of terms each of which maps a role (which is one of those listed in the role set record associated with the alternation class to which the alternation record belongs) to a syntactic category (e.g. noun, preposition, and so on) .
  • the order of the mapping fields is significant. For a verb, the first mapping field is taken to indicate the subject of the verb (when it is present in the active form) and the remaining mapping fields indicate the complements of the verb in order.
  • alternation of each alternation class is always named "normal"; this alternation will specify the word order most commonly used in the language concerned.
  • the names of the other alternations in each alternation class will specify the conditions under which that alternation is used; for example, “polite”, “formal”, “stressed” .
  • 'hold_onto' is the name of the alternation class
  • 'relate' is the name of the role set record
  • 'normal' is the name of this alternation within the alternation class
  • 'eng' is the language code
  • V[]' is the syntactic category (verb).
  • alternation specifies a prepositional phrase
  • identity of the preposition is stored in the alternation (above, for example, "onto").
  • the data stored in a set of alternations can be used to locate, within a given phrase involving that verb, which of the surrounding words or phrases occupies which role in relation to the verb.
  • This information can then be used to re-generate a corresponding phrase for the same concept expressed in the target language, since the set of roles is defined (in a language independent fashion) by the role set record to which both the source and target language alternation classes point, and from which both were derived.
  • the present embodiment operates as follows. Parsing
  • parsing is performed as described above, to generate the language-specific parsed semantic structure.
  • each word in the expanded source language text document is looked up in the word store 1 241 of the source language.
  • the word is then replaced by a reference to the word entry or entries corresponding to it, for subsequent use.
  • the word is a verb (for example, "give") which can be used in several different senses
  • several different entries will be found in the word store, (e.g. corresponding to "give to", “give up” and so on)
  • Each of these different entries expresses a different concept, and therefore points to a different concept entry in the lexical database 240.
  • Each also points to an alternation class record; however, two or more of these entries may point to a common alternation class record.
  • the parser uses the orders of the possible complements defined in the word records, together with the rules stored in the rule store, to attempt to create paths through the word lattice of Figure 7. Since only one of the alternations will actually be present, those word entries corresponding to alternations which have a complement order other than that detected to be present will be rejected during parsing.
  • the parsed semantic structure will include one or more identified alternations for each event term located in the input document, the alternations being identified where the syntactic categories surrounding the identified event in the source document match those in the order specified in the alternation.
  • the semantic structure will still include any prepositions originally present; for example, the phrase “to the girl " will be identified as the patient entity in the phrase "he gave the ball to the girl” , with "to” identified as a preposition. Also, in the case of verbs with prepositions and some other types of verb (for example, verbs in the passive form) the roles identified during parsing will not be language independent.
  • an abstracting rule which, in the abstracting phrase, identifies each word term in the parsed semantic structure (step 1002), looks up the corresponding word record, and from that, accesses the corresponding alternation class record (step 1004), and thence the alternation record (step 1006) corresponding to the alternation used to generate that word record.
  • the abstraction rule deletes the entry for the preposition from the parsed semantic structure (step 1 01 0), so that instead of pointing to the prepositional phrase "to the girl” as the object (or some other language dependent role), the event term points to the phrase "the girl” which followed the preposition.
  • step 101 2 data recording the language-independent role assigned to that prepositional phrase in the alternation record (here, "patient”, as shown in Figure 14b) is assigned to the phrase which followed the preposition.
  • the abstraction rule then proceeds in similar fashion until all terms in the parsed semantic structure are processed.
  • the abstraction rules can access the alternation records, and since the alternation records are constrained by a language-independent set of roles shared by all translations of the verb being abstracted, the abstraction rules can identify the complements corresponding to each language-independent role and label them correctly where the original role assigned depended upon the source language.
  • the meaning entry references of the terms of the interlingual structure are each looked up in the meaning store 1 243, and a word entry (in the target language word store 1 242) with corresponding tone data to that stored for each term is selected.
  • the roles present for each event term are then compared with those for each alternation record of the alternation class pointed to by the selected word entry, and the best- matching alternation record is selected.
  • the language-independent roles of the interlingual structure are then replaced, where necessary, as specified by that alternation (for example, where the verb in the target language has its roles reversed relative to the source language).
  • the input and editing processes may be performed using the terminal 1 00 to access the server 200, from which the lexical database 240 and other records are read and to which they are written, via a browser program providing a graphical user interface into which data may be input and edited.
  • the role set records (or, at any rate, most of them) are created by the user, and each meaning entry in the lexical database 240 which can have multiple alternations is assigned to one of the role sets (as mentioned above, there are typically 1 5-20 such role sets) .
  • a first language word store 1 241 employed in the translation system (either as a source or a target language or both) is selected.
  • the word entries will already have been assigned pointers to corresponding meaning entries in the meaning store 1 243.
  • a first event entry in the word store is selected.
  • step 2008 the alternation classes associated with the role set assigned to that first event entry, in the language concerned, are displayed. If the event is the first event associated with that role set to be considered, there will be no alternation class displayed.
  • the data displayed (step 201 6) for the alternation class is the list of alternations of the alternation class, displaying for each the role-complement mappings present in that alternation.
  • step 201 2 If no suitable class exists yet (step 201 0), a new class is created (step 201 2) . Usually, a suitable class will exist already. In either case, in step 2014, the event is allocated to the alternation class it matches or the class which has newly been created. If no alternations have yet been defined for the class, a template alternation listing the roles present in the class, in some order, is displayed and the user edits the display to re-order the roles into the desired order, add prepositions as desired, and so on.
  • step 201 8 If (step 201 8) the list of alternations does not match those known by the inputter to exist for the word in the language concerned, then new alternations are created (step 2020) in the same way and added to the alternation class (step 2022) .
  • step 2024 If the last event in the language has not been reached (step 2024) the next event is selected (step 2026) and steps 2006 onwards are repeated.
  • step 2028 If there are more languages to process (step 2028) the next is selected (step 2030) and the process returns to step 2004.
  • the number of alternation records will vary from class to class and from language to language. The number of records will increase with the mutability of the word order in each language and with the irregularity of word orders between different verbs. Role preference data
  • associated with each of the role fields in the role set records 730 may be a role preference field.
  • the lexical database 2040 may be hierarchically arranged, as described in PCT/GB98/03774 filed 1 6/1 2/98 priority 1 7/1 2/97, so that, for example an entry for "computer” points to a hierarchically higher entry for "electrical equipment” which in turn points to a hierarchically higher entry for "man made artefact” which in turn points to a hierarchically higher entry for "artefact” and thence to an entry for "entity” .
  • the preference field associated with each role may be set to point to a corresponding entry in the lexical database.
  • the preference data therefore also points to all the hierarchically lower instances of those general classes which are stored in the lexical database.
  • the output produced by the parser is either incomplete or ambiguous. For example, if a given part of a document can be parsed to give two meanings, allocating different words or phrases to different roles, the ambiguity may be resolved during abstracting.
  • Each possible such parsed structure is matched to locate the corresponding alternation record, from which the presumed roles of each part of the parsed structure are determined.
  • the role set record for the alternations is then examined, and it is determined whether the entities allocated to each role correspond to those specified in the preferences. The meaning for which the entities correspond more closely to the specified preferences is then selected as likelier to be correct.
  • words than verbs can benefit from the invention; it may, for example, be used to compile multiple word entries for words which can change their form - e.g. adjectives which have an adverbial form
  • Each alternation record within a class can have a different syntactic category (e.g. adverb and adjective) and the record can thus be used to specify whether the derivation of a different word form can take place, and what the feature changes should be.
  • the user-created role set records need not be present in the translator, being only used to derive the alternations consistently between different languages as described above, and, in this case, the alternations of an alternation class constitute a set of language-independent rules, and thus the set of language-dependent alternation classes for all the languages of the translator together constitute a means for storing sets of language-independent roles in accordance with the present invention.
  • each abstraction rule would similarly include a reference to those languages for which it was necessary, and only the necessary rules for the intended target language(s) would be used. Such an embodiment may prove useful as the number of target languages increases.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

L'invention se rapporte à une méthode permettant de conceptualiser une structure sémantique langagière source dépendant d'un langage de manière à produire une structure sémantique indépendante du langage, au moyen de règles d'abstraction définissant un ensemble d'alternances pouvant être associées à des mots du langage source, une pluralité de tels mots partageant chaque ensemble de ce type.
EP00902775A 1999-02-18 2000-02-11 Traduction Withdrawn EP1155374A1 (fr)

Priority Applications (1)

Application Number Priority Date Filing Date Title
EP00902775A EP1155374A1 (fr) 1999-02-18 2000-02-11 Traduction

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
GBGB9903747.5A GB9903747D0 (en) 1999-02-18 1999-02-18 Translation
GB9903747 1999-02-18
EP99304784 1999-06-18
EP99304784 1999-06-18
EP00902775A EP1155374A1 (fr) 1999-02-18 2000-02-11 Traduction
PCT/GB2000/000440 WO2000049522A1 (fr) 1999-02-18 2000-02-11 Traduction

Publications (1)

Publication Number Publication Date
EP1155374A1 true EP1155374A1 (fr) 2001-11-21

Family

ID=26153500

Family Applications (1)

Application Number Title Priority Date Filing Date
EP00902775A Withdrawn EP1155374A1 (fr) 1999-02-18 2000-02-11 Traduction

Country Status (5)

Country Link
EP (1) EP1155374A1 (fr)
AU (1) AU2451600A (fr)
CA (1) CA2363008A1 (fr)
HK (1) HK1043701A1 (fr)
WO (1) WO2000049522A1 (fr)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
ES2180392B1 (es) * 2000-09-26 2004-07-16 Pablo Grosschmid Crouy-Chanel Sistema dispositivo e instalacion de interpretacion simultanea mecanizada de idiomas.

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SE466029B (sv) * 1989-03-06 1991-12-02 Ibm Svenska Ab Anordning och foerfarande foer analys av naturligt spraak i ett datorbaserat informationsbehandlingssystem
GB9716887D0 (en) * 1997-08-08 1997-10-15 British Telecomm Translation

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0049522A1 *

Also Published As

Publication number Publication date
HK1043701A1 (zh) 2002-09-20
CA2363008A1 (fr) 2000-08-24
WO2000049522A1 (fr) 2000-08-24
AU2451600A (en) 2000-09-04

Similar Documents

Publication Publication Date Title
KR101139903B1 (ko) 자연어 문서들에서 전체 부분 관계들을 인식하는 시만틱 프로세서
US6965857B1 (en) Method and apparatus for deriving information from written text
US6463404B1 (en) Translation
US20030158723A1 (en) Syntactic information tagging support system and method
US5895446A (en) Pattern-based translation method and system
JP4658420B2 (ja) 文字列の正規化表示を生成するシステム
US8484238B2 (en) Automatically generating regular expressions for relaxed matching of text patterns
Miłkowski Developing an open‐source, rule‐based proofreading tool
Guo Critical tokenization and its properties
Ehsan et al. Grammatical and context‐sensitive error correction using a statistical machine translation framework
WO2001096980A2 (fr) Procédé et système pour analyse de texte
Wu Modelling linguistic resources: A systemic functional approach
Nguyen et al. Ensuring annotation consistency and accuracy for Vietnamese treebank
Van Halteren et al. Linguistic Exploitation of Syntactic Databases: The Use of the Nijmegen LDB Program
CA2297905C (fr) Methode et systeme de traduction interlingual
JP2948159B2 (ja) データベース装置
JP2997469B2 (ja) 自然言語理解方法および情報検索装置
Toirova Establishment of a national corpus the uzbek language is a requirement of a new era
EP1155374A1 (fr) Traduction
Jabbar et al. An Analytical Analysis of Text Stemming Methodologies in Information Retrieval and Natural Language Processing Systems
Baptist Genesis-II: A language generation module for conversational systems
Vasuki et al. English to Tamil machine translation system using parallel corpus
Minjun et al. Towards Understanding and Applying Chinese Parsing using Cparser
Balcha et al. Design and Development of Sentence Parser for Afan Oromo Language
Ibragimovna Establishment of a national corpus the uzbek language is a requirement of a new ERA

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20010820

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

17Q First examination report despatched

Effective date: 20031210

GRAP Despatch of communication of intention to grant a patent

Free format text: ORIGINAL CODE: EPIDOSNIGR1

RTI1 Title (correction)

Free format text: APPARATUS AND METHOD FOR TRANSLATING TEXT

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20050719

REG Reference to a national code

Ref country code: HK

Ref legal event code: WD

Ref document number: 1043701

Country of ref document: HK