ZA200309230B - Content conversion method and apparatus. - Google Patents

Content conversion method and apparatus. Download PDF

Info

Publication number
ZA200309230B
ZA200309230B ZA200309230A ZA200309230A ZA200309230B ZA 200309230 B ZA200309230 B ZA 200309230B ZA 200309230 A ZA200309230 A ZA 200309230A ZA 200309230 A ZA200309230 A ZA 200309230A ZA 200309230 B ZA200309230 B ZA 200309230B
Authority
ZA
South Africa
Prior art keywords
state
segments
segment
word
content
Prior art date
Application number
ZA200309230A
Inventor
Eli Abir
Original Assignee
Eli Abir
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Eli Abir filed Critical Eli Abir
Publication of ZA200309230B publication Critical patent/ZA200309230B/en

Links

Description

CONTENT CONVERSION METHOD AND APPARATUS
R RELATED APPLICATIONS
This application claims the benefit of U.S. Provisional Application No. 60/276,107 filed
March 16, 2001, and U.S. Provisional Application No. 60/299,472 filed June 21, 2001, both of which are hereby incorporated by reference.
FIELD OF THE INVENTION
This invention relates to a method and apparatus for converting content from on state to a second state.
BACKGROUND
Devices and methods for automatically translating documents from one language to another are known. However, these devices and methods often fail to accurately translate documents from one language to another, can consume large amounts of time and can be inconvenient to use. In addition to human-based translators, other known devices include commercially available machine translation software. These known systems have flaws that render them susceptible to errors, slow speed and inconvenience. Known translation devices and methods cannot consistently return accurate translations for text input and therefore frequently require intensive user intervention for proof reading and editing. Accurate machine translation is more complicated than providing devices and methods that make word-for-word translations v of documents. In these word-for-word systems, the translation often times makes little sense to readers of the translated document, as the word-for-word method results in wrong word choices and incoherent grammatical units.
To overcome these deficiencies, known translation devices have for decades attempted to . make choices of word translations within the context of a sentence based on a combination or set of lexical, morphological, syntactic and semantic rules. These systems, known in the art as “Rule-Based” machine translation (MT) systems are flawed because there are so many exceptions to the rules that they cannot provide consistently accurate translation.
In addition to Rule-Based MT, in the last decade a new method for MT known as “example-based” (EBMT) has been developed. EBMT makes use of sentences (or possibly portions of sentences) stored in two different languages in a cross-language database. When a translation query matches a sentence in the database, the translation of the sentence in the target language is produced by the database providing an accurate translation in the second language.
If a portion of a translation query matches a portion of a sentence in the database, these devices attempt to accurately determine which portion of the sentence mapped to the source language sentence is the translation of the query.
EBMT systems cannot provide accurate translation of a broad language because the databases of cross-language sentences are built manually and will always be predominantly “incomplete.” Another flaw of EBMT systems is that partial matches are not reliably translated.
Attempts have been made to automate the creation of cross-language databases using pairs of translated documents for use in EBMT. However, these efforts have not been successful in creating meaningful, accurate cross-language databases of any significant size. None of these
3 WO 02/075586 PCT/US01/50323 attempts use an algorithm that reliably and accurately distill the translations of a significant number of words and word-strings from a pair of translated documents. . Some translation devices combine both Rule-Based and EBMT engines. Although this combination of approaches may yield a higher rate of accuracy than either system alone, the results remain inadequate for nse without significant user intervention and editing.
The problems faced when attempting to translate documents from one language to another can apply more generally to the problem of converting data representing ideas or formation from one state, say words, into data representing the ideas in another state, for example, mathematical symbols. In such cases cross-idea association databases that associate data in one state with equivalent data in the second state must be consulted. Therefore, a need exists for an improved and more efficient method and apparatus for creating dictionaries or databases that associate equivalent ideas in different languages or states, (e.g., words, word- strings, sounds, movement and the like) and for translating or converting ideas conveyed by documents in one language or state into the same or similar ideas represented by documents in a second language or state.
The invention relates to manipulating content using a cross-idea association database. In particular, the present invention provides a method and apparatus for creating a database of associated ideas and provides a method and apparatus for utilizing that database to convert ideas from one state into other states.
In one embodiment, and by example, the present invention provides a method and apparatus for creating a language translation database, where two languages form the database of ) associated ideas. The present invention also provides a method and apparatus for utilizing that language database to convert documents (representing ideas) from one language to another (or more generally, from one state to another). However, the present invention is not limited to language translation, although that preferred embodiment will be presented. The database . creation aspect of the present invention may be applied to any ideas that are related in some manner but expressed in different states and the conversion aspect of the present invention may be applied to accurately translate ideas from one state to another.
The application of the present invention to a language translation embodiment will now be described. As used herein, the terms related to converting, translating, and manipulating are used interchangeably and in their broadest sense.
SUMMARY OF THE INVENTION
One object of the present invention is to facilitate the efficient translation of documents from one language or state to another language or state by providing a method and apparatus for creating and supplementing cross-idea association databases. These databases generally associate data in a first form or state that represents particular ideas or pieces of information with data in a second form or state that represents the same ideas or pieces of information.
Another object of the present invention is to facilitate the translation of documents from one language or state to another language or state by providing a method and apparatus for creating a second document comprising data in a second state, form, or language, from a first document comprising data in a first state, form, or language, with the result that the first and second documents represent substantially the same ideas or information.
Yet another object of the present invention is to facilitate the translation of documents from one language or state to another language or state by providing a method and apparatus for ’
creating a second document comprising data in a second state, form, or language, from a first document comprising data in a first state, form, or language, with the result that the first and ] second documents represent substantially the same ideas or information, and wherein the method and apparatus includes using a cross-idea association database.
Yet another object of the present invention is to provide the translation of documents (in a broad sense, the conversion of ideas from one state to another state) in a real-time manner.
The present invention achieves these and other objects by providing a method and apparatus for creating a cross-idea database. The method and apparatus for creating the cross- idea database can include providing one or more pair of documents in two (or more) different languages representing the same general text (i.e., exact translations of text (“Parallel Text”) or generally related text (“Comparable Text). The present invention selects at least a first and a second occurrence of all words and word strings that have a plurality of occurrences in the first language in the available cross-language documents. It then selects at least a first word range and a second word range in the second language documents, wherein the first and second word ranges correspond to the first and second occurrences of the selected word or word-string in the first language documents. Next, it compares words and word-strings found in the first word range with words and word strings found in the second word range and, locating words and word-strings common to both word ranges, and stores those located common words and word strings in the cross-idea database. The invention then associates in said cross-idea database located common words or word strings in the two ranges in the second language with the selected word or word string in the first language, ranked by their association frequency (number ) of recurrences), after adjusting the association frequencies as detailed herein. By testing common word and word-strings across languages in Parallel or Comparable Texts, the database will resolve more associations as more Parallel or Comparable Text becomes available in a variety of different languages. .
The present invention also achieves these and other objectives by providing a method and apparatus for converting a document from one state to another state. The present invention provides a database comprised of data segments in a first language associated with data segments in a second language (created through methods described above or manually). The present invention translates text by accessing the above-referenced database, and identifying the longest word string in the document to be translated (measured by number of words) beginning with the first word of the document, that exists in the database. The system then retrieves from the database a word string in the second language associated with the located word string from the document in the first language. The system then selects a second word string in the document that exists in the database and has an overlapping word (or alternatively word string) with the previously identified word string in the document, and retrieves from the database a word string in the second language associated with the second word string in the first language. If the word string associations in the second language have an overlapping word (or alternatively words) the word string associations in the second language are combined (eliminating redundancies in the overlap) to form a translation; if not, other second language associations to the first language word strings are retrieved and tested for combination through an overlap of words until successful. The next word string in the document in first language is selected by finding the longest word string in the database that has an overlapping word (or altematively words) with the ; .
previously identified first language word string, and the above process continued until the entire first language document is translated into a second language document.
BRIEF DESCRIPTION OF THE FIGURES
Figure 1 shows an embodiment of a cross-idea database according to the present invention.
DETAILED DESCRIPTION OF THE INVENTION
The present invention provides a method and apparatus for creating and supplementing a cross-idea database and for translating documents from a first language or state into a second language or state using a cross-idea database. Documents as discussed herein are collections of information as ideas that are represented by symbols and characters fixed in some medium. For example, the documents can be electronic documents stored on magnetic or optical media, or paper documents, such as books. The symbols and characters contained in documents represent ideas and information expressed using one or more systems of expression intended to be understood by users of the documents. The present invention manipulates documents in a first state, i.e., containing information expressed in one system of expression, to produce documents in a second state, i.e., containing substantially the same information expressed using a second system of expression. Thus, the present invention can manipulate or translate documents between systems of expression, for example, written and spoken languages such as English, ' Hebrew, and Cantonese, into other languages.
A detailed description of the present invention, including the database creation method and apparatus, and the conversion method and apparatus, will now be described. 1. Database Creation Method and Apparatus . a. Overview
The method of the present invention makes use of a cross-idea database for document content manipulation. Figure 1 depicts an embodiment of a cross-idea database. This embodiment of a cross-idea database comprises a listing of associated data segments in columns 1 and 2. The data segments are symbols or groupings of characters that represent a particular idea or piece of information in a system of expression. "Thus, System A Segments in column 1 are data segments that represent various ideas and combination of ideas Dal, Da2, Da3 and Da4 in a hypothetical system of expression A. System B Segments in column 2 are data segments
Dbl, Db3, Db4, Db5, Db7, Db9, Db10 and Db12, that represent various ideas and some of the combinations of those ideas in a hypothetical system of expression B that are ordered by association frequency with data segments in system of expression A. Column 3 shows the Direct
Frequency, which is the number of times the segment or segments in language B were associated with the listed segment (or segments) in language A. Column 4 shows the Frequencies after
Subtraction, which represents the number of times a data segment (or segments) in language B has been associated with a segment (or segments) in language A after subtracting the number of times that segment (or segments) has been associated as part of a larger segment, as described more fully later.
As shown in Figure 1, it is possible that a single segment, say Dal is most appropriately associated with multiple segments, Dbl together with Db3 and Db4. The higher the Frequencies . .
after Subtraction (as described herein) between data segments, the higher the probability that a system A segment is equivalent to a system B segment. In addition to measuring adjusted . frequencies by total number of occurrences, the adjusted frequencies can also be measured, for example, by calculating the percentage of time that particular system A segments have corresponded to a particular system B segments. When the database is used to translate a document, the highest ranked associated segment will be retrieved from the database first in the process. Often, however, the method used to test the combination of associated segments for translation (as described later) determines that a different, lower ranked association should be tested because the higher ranked association, once tested, can not be used. For example, if the database was queried for an association for Dal, it would return Db1+Db3+Dbd4; if
Db1+Db3+Db4 could not be used as determined by the process that accurately combines data segments for translation, the database would then return Db9+Db10 to test for accurate combination with another associated segment, for translation.
In general, the method for creating a cross-idea database of the present invention includes examining and operating on Parallel or Comparable Text. The method and apparatus of the present invention is utilized such that a database is created with associations across the two states — accurate conversions, or more specifically, associations between ideas as expressed in one state and ideas as expressed in another. The translation and other relevant associations between the two states become stronger, i.e. more frequent, as more documents are examined and operated on by the present invention, such that by operation on a large enough “sample” of documents the most common (and, in one sense, the correct) association becomes apparent and ) the method and apparatus can be utilized for conversion purposes.
In one embodiment of the present invention, the two states represent word languages (e.g., English, Hebrew, Chinese, etc.) such that the present invention creates a cross-language database correlating words and word-strings in one language to their translation counterparts in . a second language. Word-strings may be defined as groups of consecutive adjacent words and often include punctuation and any other mark used in the expression of language. In this example, the present invention creates a database by examining documents in the two languages and creating a database of translations for each recurring word or word string in both languages.
However, the present invention need not be limited to language translation. The present invention allows a user to create a database of ideas and associate those ideas to other, differing ideas in a hierarchical manner. Thus, ideas are associated with other ideas and rated according to the frequency of the occurrence. The specific weight given to the occurrence frequency, and the use applied to the database thus created, can vary depending upon the user’s requirements.
For example, in the context of converting text from one language to another the present invention will operate to create language translations of words and word strings between the
English and Chinese languages. The present invention will return a ranking of associations between words and word-strings across the two languages. Given a large enough sample size, the word or word-string occurring the most often will be one of the Chinese equivalents of the
English word or word-string. However, the present invention will also return other Chinese language associations for the English words or word-strings, and the user may manipulate those associations as desired. For example, the word “mountain,” when operated on according to the present invention may return a list of Chinese language words and word strings in the language being examined. The Chinese language equivalents of the word “mountain” will most likely be ranked the highest; however, the present invention will return other foreign language words or word-strings associated with “mountain,” such as “snow”, “ski”, “a dangerous sport”, “the . highest point in the world”, or “Mt. Everest.” These words and word-strings, which will likely be ranked lower than the translations of “mountain,” can be manipulated as desired by the user.
Thus, the present invention is an automated association database creator. The strongest associations represent “translations” or “conversions” in one sense, but other frequent (but weaker) associations represent ideas that are closely related to the idea being examined. The databases can therefore, be used by systems using artificial intelligence applications that are well known in the art. Those systems currently use incomplete, manually created idea databases or ontologies as “neural networks” for applications.
Another embodiment of the present invention utilizes a computing device such as a personal computer system of the type readily available in the prior art. Although the computing device is typically a common personal computer (either stand-alone or in a networked environment), other computing devices such as PDA’s, wireless devices, servers, mainframes, and the like are similarly contemplated. However, the method and apparatus of the present invention does not need to use such a computing device and can readily be accomplished by other means, including manual creation of the cross-associations. The method by which successive documents are examined to enlarge the “sample” of documents and create the cross- association database is varied — the documents can be set up for analysis and manipulation manually, by automatic feeding (such as automatic paper loaders as known in the prior art), or by using search techniques on the Internet to automatically seek out the related documents such as
Web Crawlers.
Note that the present invention can produce an associated database by examining
Comparable Text, in addition to (or even instead of) Parallel Text. Furthermore, the method looks at all available documents collectively when searching for a recurring word or word-string . within a language. b. Building the Database
According to the present invention, the documents are examined for the purpose of building the database. After document input (again, of a pair of documents representing the same text in two different languages), the creation process begins using the methods and/or apparatus described herein.
For illustrative purposes, assume that the documents contain the same content (or, in a general sense, idea) in two different languages. Document A is in language A, Document B is in language B. The documents have the following text:
The first step in the present invention is to calculate a word range to determine the approximate location of possible associations for any given word or word string. Since a cross- language word-to-word analysis alone will not yield productive results (i.e., word 1 in document
A will often not exist as the literal translation of word 1 in document B), and the sentence structure of one language may have an equivalent idea in a different location (or order) of a sentence than another language, the database creation technique of the present invention associates each word or word-string in the first language with all of the words and word strings found in a selected range in the second language document. This is also important because one to language often expresses ideas in longer or shorter word strings than another language. The ) range is determined by examining the two documents, and is used to compare the words and word-strings in the second document against the words and word-strings in the first document.
That is, a range of words or word-strings in the second document is examined as possible associations for each word and word string in the first document. By testing against a range, the database creation technique establishes a number of second language words or word-strings that : may equate and translate to the first language words and word-strings.
There are two attributes that must be determined in order to establish the range in the second language document in which to look for associations for any given word or word string in the first language document. The first attribute is the value or size of the range in the second document, measured by the number of words in the range. The second attribute is the location of the range in the second document, measured by the placement of the mid-point of the range.
Both attributes are user defined, but examples of preferred embodiments are offered below. In defining the size and location of the range, the goal is to insure a high probability that the second language word or word-string translation of the first language segment being analyzed will be included.
Various techniques can be used to determine the size or value of the range including common statistical techniques such as the derivation of a bell curve based on the number of words in a document. With a statistical technique such as a bell curve, the range at the beginning and end of the document will be smaller than the range in the middie of the document. A bell- : shaped frequency for the range allows reasonable chance of extrapolation of the translation whether it is derived according to the absolute number of words in a document or according to a certain percentage of words in a document. Other methods to calculate the range exist, such as a “step” technique where the range exists at one level for a certain percentage of words, a second . higher level for another percentage of words, and a third level equal to the first level for the last percentage of words. Again, all range attributes can be user defined or established according to other possible parameters with the goal of capturing useful associations for the word or word string being analyzed in the first language.
The location of the range within the second language document may depend on a comparison between the number of words in the two documents. What qualifies as a document for range location purposes is user defined and is exemplified by news articles, book chapters, and any other discretely identifiable units of content, made up of multiple data segments. If the word count of the two documents is roughly equal, the location of the range in the second language will roughly coincide with the location of the word or word-string being analyzed in the first language. If the number of the words in the two documents is not equal, then a ratio may be used to correctly position the location of the range. For example, if document A has 50 words and document B has 100 words, the ratio between the two documents is 1:2. The mid- point of document A is word position 25. If word 25 in document A is being analyzed, however, using this mid-point (word position 25) as the placement of the midpoint of the range in document B is not effective, since this position (word position 25) is not the midpoint of document B. Instead, the midpoint of the range in document B for analysis of word 25 in document A may be determined by the ratio of words between the two documents (i.e., 25 X 2/1 = 50), by manual placement in the mid-point of document B or by other techniques. ) 14 : _ oo
By looking at the position of a word or word-string in the document and noting all the word or word strings that fall within the range as described above, the database creation ] technique of the present invention returns a possible set of words or word-strings in the second- language document that may translate to each word or word-string in the first document being analyzed. As the database creation technique of the present invention is utilized, the set of words and word strings that qualify as possible translations will be narrowed as association frequencies develop. Thus, after examining a pair of documents, the present invention will create association frequencies for words and word strings in one language with words or word strings in a second language. After a number of document pairs are examined according to the present invention (and thus a large sample created), the cross-language association database creation technique will return higher and higher association frequencies for any one word or word string. After a large enough sample, the highest association frequencies result in possible translations; of course, the ultimate point where the association frequency is deemed to be an accurate translation is user - defined and subject to other interpretive translation techniques (such as those described in
Provisional Application No. 60/276,107, entitled “Method and Apparatus for Content
Manipulation” filed on March 16, 2001 and incorporated herein by reference).
As indicated above, the invention tests not only words but also strings of words (multiple words). As mentioned, word strings include all punctuation and other marks as they occur.
After a single word in a first language is analyzed, the database creation technique of the present invention analyzes a two-word word string, then three-word word string, and so on in an incremental manner. This technique makes possible the translation of words or word strings in : one language that translate into a shorter or longer word-string (or word) in another language, as often occurs. If a word or word-string only occurs once in all available documents in the first language, the process immediately proceeds to analyze the next word or word string, where the analysis cycle occurs again. The analysis stops when all word or word strings that have multiple . occurrences in the first language in all available Parallel and Comparable Text have been analyzed.
In a sense, any number of documents are aggregated and can be treated as one single document for purposes of looking for reoccurrences of words or word strings. In essence, for a word or word-string not to repeat it would have to occur only once in all available Parallel and
Comparable Text. In addition, as another embodiment it is possible to examine the range corresponding to every word and word string regardless of whether or not it occurs more than once in all available Comparable and Parallel Text. As another embodiment, the database can be built by resolving specific words and word strings that are part of a query. When words and word strings are entered for translation, the present invention can look for multiple occurrences of the words or word-strings in cross-language documents stored in memory that have not yet been analyzed, by locating cross-language text on the Internet using web-crawlers and other devices and, finally, by asking the user to supply a missing association based on the analysis of the query and the lack of sufficiently available cross-language material.
The present invention thus operates in such a manner so as to analyze word strings that depend on the correct positioning of words (in that word string), and can operate in such a manner so as to account for context of word choice as well as grammatical idiosyncrasies such as phrasing, style, or abbreviations. These word string associations are also useful for the double overlap translation technique that provides the translation process as described herein.
It is important to note, that the present invention can accommodate situations where a subset word or word string of a larger word string is consistently returned as an association for . the larger word string. The present invention accounts for these patterns by manipulating the frequency return. For example, proper names are sometimes presented complete (as in “John
Doe”), abbreviated by first or surname (“John” or “Doe’), or abbreviated by another manner (“Mr. Doe”). Since the present invention will most likely return more individual word returns than word string returns (i.e., more returns for the first or surnames rather than the full name word string “John Doe”), because the words that make up a word string will necessarily be counted individually as well as part of the phrase, a mechanism to change the ranking should be utilized. For example, in any document the name “John Doe” might occur one hundred times, while “John” by itself or as part of John Doe might occur one hundred-twenty times, and “Doe” by itself or as part of John Doe might occur one hundred-ten times. The normal translation return (according to the present invention) will rank “John” higher than “Doe,” and both of those words higher than the word string “John Doe” — all when attempting to analyze the word string “John Doe.” By subtracting the number of occurrences of the larger word string from the occurrences of the subset (or individual returns) the proper ordering may be accomplished (although, of course, other methods may be utilized to obtain a similar result). Thus, subtracting one hundred (the number of occurrences for “John Doe”), from one hundred twenty (the number of occurrences for the word “John”), the corrected return for “John” is twenty. Applying this analysis yields one-hundred as the number of occurrences for the word string “John Doe” (when analyzing and attempting to translate this word string), twenty for the word “John,” and ten for ' the word string “Doe,” thus creating the proper associations.
Note that this issue is not limited to proper names and often occurs in common phrases and in many different contexts. For example, every time the word-string “I love you” is translated to its most frequent word-string association in another language, the word for “love” in - that other language may be associated independently each of those times as well. Additionally, when the word-string is translated differently in other text that is analyzed, the word “love” may again be associated. This will skew the analysis and return the word “love” in the second language instead of “I love you” in the second language for the translation of “I love you” in the first language. Therefore, once again, the system subtracts the number of occurrences of the larger word-string association, from the frequency of all subset associations when ranking associations for the larger string. These concepts are also reflected in Figure 1.
Additionally, the database can be instructed to ignore common words such as “it”, “an”, “a”, “of”, “as”, “in”, and the like — or any common words when counting association frequencies for words and word-strings. This will more accurately reflect the true association frequency numbers that will otherwise be skewed by the numerous occurrences of common words as part of any given range. This allows the association database creation technique of the present invention to prevent common words from skewing the analysis without excessive subtraction calculations. It should be noted that if these or any other common words are not “subtracted” out of the association database, they would ultimately not be approved as a translation, unless appropriate, because the double overlap process described in more detail herein would not accept it.
It should be noted that other calculations to adjust the association frequencies could be made to insure the accurate reflection of the number of common occurrences of word and word strings. For example, an adjustment to avoid double counting may be appropriate when the ranges of analyzed words overlap. Adjustments are desirable in these cases to build more : accurate association frequencies. An example of an embodiment of the method and apparatus for creating and supplementing a cross-idea database according to the present invention will now be described using the two documents described above as an example — the table is re-created as follows: ey oy
Note once again that although this embodiment focuses on recurring words and word- strings in only a single document, this is mainly for illustrative purposes. Recurring words and word-strings will be analyzed using all available Parallel and Comparable Text in the aggregate.
Using the two documents listed above (A, the first language and B, the second language), the following steps occur for the database creation technique.
Step 1. First, the size and location of the range is determined. As indicated, the size and location may be user defined or may be approximated by a variety of methods. The word count of the two documents is approximately equal (ten words in document A, eight words in document B) therefore we will locate the mid-point of the range to coincide with the location of the word or word string in the document A. (Note: As the ratio of word counts between the documents is 80%, the location of the range alternatively could have been established applying a fraction 4/5ths). In this example, a range size or value of three may provide the best results to approximate a bell curve; the range will be (+/-) 1 at the beginning and end of the document, and (+/ -) 2 in the middle. However, as indicated, the range (or the method used to determine the range) is entirely user defined.
Step 2. Next, the first word in document A is examined and tested against document A to determine the number of occurrences of that word in the document. In this example the first word in document A is X: X occurs three times in document A, at positions 1, 4, and 9. The position numbers of a word or word string are simply the location of that word, or word string in the document relative to other words. Thus, the position numbers correspond to the number of words in a document, ignoring punctuation — for example, if a document has ten words in it, and the word “king” appears twice, the position numbers of the word “king” are merely the places (out of ten words) where the word appears.
Because word X occurs more than once in the document, the process proceeds to the next step. If word X only occurred once, then that word would be skipped and the process continued to the next word and the creation process continued.
Step 3. Possible second language translations for first language word X at position 1 are returned: applying the range to document B yields words at positions 1 and 2 (1 +/- 1) in document B: AA and BB (located at positions 1 and 2 in document B). All possible combinations are returned as potential translations or relevant associations for X: AA, BB, and
AA BB (as a word string combination). Thus, X1 (the first occurrence of word X) returns AA,
BB, and AA BB as associations.
Step 4. The next position of word X is analyzed. This word (X2) occurs at position 4. Since position 4 is near the center of the document, the range (as determined above)
will be two words on either side of position 4. Possible associations are returned by looking at word 4 in document B and applying the range (+/-)2 — hence, two words before word 4 and two . words after word 4 are returned. Thus, words at positions 2, 3, 4, 5, and 6 are returned. These positions correspond to words BB, CC, AA, EE, and FF in document B. All forward permutations of these words (and their combined word strings) are considered Thus, X2 returns
BB, CC, AA, EE, FF, BB cc, BB CC AA, BB CC AA EE, BB CC AA EE FF, CC AA, CC AA
EE, CC AA EEFF, AA EE, AA EE FF, and EE FF as possible associations.
Step 5. The returns of the first occurrence of X (position 1) are compared to the returns of the second occurrence of X (position 4) and matches are determined. Note that returns which include the same word or word string occurring in the overlap of the two ranges should be reduced to a single occurrence. For example, in this example the word at position 2 is BB; this is returned both for the first occurrence of X (when operated on by the range) and the second occurrence of X (when operated on by the range). Because this same word position is returned for both X1 and X2, the word is counted as one occurrence. If, however, the same word is returned in an overlapping range, but from two different word positions, then the word is counted twice and the association frequency is recorded. In this case the returns for word X is AA, since that word (AA) occurs in both association returns for X1 and X2. Note that the other word that occurs in both association returns is BB; however, as described above, since that word is the same position (and hence the same word) reached by the operation of the range on the first and second occurrences of X, the word can be disregarded.
Step 6. The next position of word X (position 9) (X3) is analyzed. Applying a range of (+/-) 1 (near the end of the document) returns associations at positions 8, 9 and 10 of document B. Since document B has only 8 positions, the results are truncated and only word position 8 is returned as possible values for X: CC. (Note: alternatively, user defined parameters could have called for a minimum of two characters as part of the analysis that would . have returned position 8 and the next closest position (which is GG in position 7)).
Comparing X3’s returns to X1°s returns reveals no matches and thus no associations. .
Step 7. The next position of word X is analyzed; however, there are no more occurrences of word X in document A. At this point an association frequency of one (1) is established for word X in Language A, to word AA in Language B.
Step 8. Because no more occurrences of word X occur, the process is incremented by a word and a word string is tested. In this case the word string examined is “XY”, the first two words in document A. The same technique described in steps 2-7 are applied to this phrase.
Step 9. By looking at document A, we see that there is only one occurrence of the word string X Y. At this point the incrementing process stops and no database creation occurs.
Because an end-point has been reached, the next word is examined (this process occurs whenever no matches occur for a word string); in this case the word in position 2 of document A is “Y”.
Step 10. Applying the process of steps 2-7 for the word “Y™ yields the following:
Two occurrences of word Y (positions 2 and 7) exist, so the database creation process continues (again, if Y only occurred once in document A, then Y would not be examined);
The size of the range at position 2 is (+/-) 1 word,
Application of range to document B (position 2, the location of the first occurrence of word Y) returns results at positions 1, 2, and 3 in document B;
The corresponding foreign language words in those returned positions are: AA, BB, and CC;
Applying forward-permutations yields the following possibilities for Y1: AA, BB, CC, AA BB, : AA BB CC, and BB CC;
The next position of Y is analyzed (position 7);
The size of the range at position 7 is (+/-) 2 words;
Application of that range to document B (position 7) returns results at positions 5, 6, 7, and 8:
EE FF GG and CC;
All permutations yield the following possibilities for Y2: EE, FF, GG, CC, BE FF, EE FF GG,
EE FF GG CC, FF GG, FF GG CC, and GG CC;
Matching results from Y1 returns CC as the only maich;
Combining matches for Y1 and Y2 yields CC as an association frequency for Y.
Step 11. End of range incrementation: Because the only possible match for word Y (word CC) occurs at the end of the range for the first occurrence of Y (CC occurred at position 3 in document B), the range is incremented by 1 at the first occurrence to return positions 1, 2, 3, and 4: AA, BB, CC, and AA; or the following forward permutations: AA, BB, CC, AA BB, AA
BB CC, AABB CC AA, BB CC, BB CC AA, and CC AA. Applying this result still yields CC as a possible translation for Y. Note that the range was incremented because the returned match was at the end of the range for the first occurrence (the base occurrence for word “Y”); whenever this pattern occurs an end of range incrementation will occur as a sub-step (or alternative step) to ensure completeness.

Claims (20)

I claim:
1. A method for converting content comprising the steps of: receiving content expressed in a first state; : parsing said content expressed in a first state into at least a first segment and a second segment, said first segment having a first portion, said second segment having a second portion, said first portion and said second portion having overlapping portions of said content; accessing a third segment of said content expressed in a second state, said third segment corresponding to one of said first and second segments; accessing a fourth segment of said content expressed in the second state, said fourth segment corresponding to the other one of said first and second segments and having an overlapping portion with said third segment; determining said content expressed in the second state based on combining said third and fourth segments; and providing said content expressed in said second state.
2. A method for creating a content conversion database comprising the steps of: providing a pair of documents representing the same idea in two different states; and using said pair of documents to create a database of segment associations between the two different states by parsing segments of a first state and comparing said parsed first state segments to parsed segments of a second state, and by associating an occurrence frequency between parsed segments of the first state and parsed segments of the second state.
3. The method of claim 2, where in said creating a database includes the step of utilizing ranges of segments in said first state and said second state.
4, The method of claim 2, wherein said method for creating a content conversion database includes the step of providing a plurality of pairs of documents representing the same idea in said first and said second state, and using said plurality of pairs of documents to create a database of segment associations between the two different states by parsing segments of a first state and comparing said parsed first state segments to parsed segments of a second state, and by associating an occurrence frequency between parsed segments of the first state and parsed segments of the second state.
5. The method of claim 2, wherein said method for creating a content conversion database includes the step of providing a plurality of paris of documents representing the same idea in a plurality of states, and using said plurality of documents to create a database of segment associations between the plurality of states by parsing segments of at least one state the plurality of and comparing said parsed segments to parsed segments of at least one other state, and by associating an occurrence frequency between parsed segments of different states.
6. A method of creating a database comprising the steps of: providing one or more pairs of documents representing the same idea in two or more states;
selecting at least a first and a second occurrence of a chosen segment in the first state, the chosen segment having a plurality of occurrences in the documents in the first state; selecting at least a first range and a second range in the second state documents, wherein ) the first and second ranges broadly correspond to the first and second selected segment occurrences in the first state; comparing segments in the first range and the second range and locating segments common to both ranges; storing located common segments in said database; and associating in said database located common segments with the chosen segment, ranked by frequency of occurrence.
7. The method of claim 6, wherein said idea occurs in the form of text.
8. The method of claim 6, wherein said states occur in the form of language.
9. The method of claim 6, wherein said segments occur in the form of a word or a plurality of words.
10. A method for translating idea content from a first state to a second state comprising the steps of utilizing a database of segment associations between content in said first state and said second state to convert the content of the document in a first state into the document of a second state, wherein said conversion includes examining segments of content in said first state and segments of content in said second state, and removing similar segments from said examined first state content and said examined second state content, and associating the content of said first state content with said second state content after removal of similar segments.
11. A method of converting a document, the method comprising the steps of: providing content comprising data segments in a first state associated with data segments . in a second state; selecting the largest delimited portion of the document to be translated that begins with the first segment of the document and exists in a database; retrieving from the database a segment in the second state associated with the located first segment in the first state; selecting at least a second delimited portion in the first state that has one or more overlapping segments with the previous delimited segment in the first state; retrieving from the database a segment in the second state associated with the located second segment in the first state; returningthe two data segments in the first state have overlapping content as a single data segment in the first state; returning, if the two data segments in the second state have overlapping content, a single data segment in the second state; and associating said single data segment in said first state with said single data segment in said second state, thereby returning a conversion of said single data segment from said first state to said second state.
12. The method of claim 11, comprising the additional step of repeating the selection of the : largest delimited portion of the document to be translated that exists in the database and begins with an overlapping segment of the last examined segment of the document .
13. The method of claim 11, wherein said states occur in the form of language.
14. The method of claim 11, wherein said segments occur in the form of a word or a plurality of words.
15. The method of claim 12, wherein said states occur in the form of language. .
16. The method of claim 12, wherein said segments occur in the form of a word or a plurality of words.
17. A method of converting a document, the method comprising the steps of: (a) providing content comprising data segments in a first state associated with data segments in a second state; (b) selecting the largest delimited segment of the document to be translated that begins with the first word of the document and exists in a database; © retrieving from the database a data segment in the second language associated with the located data segment in the first language; (d) selecting at least a second delimited segment in the first language that exists in the database and has one or a plurality of overlapping words with the previous delimited segment in the first language; (e) retrieving from the database a data segment in the second language associated with the located data segment in the first language; and 63] combining the two segments in the second language to form a translation if the two data segments have an overlapping word or plurality of words, and repeating steps (€) and (f) if the two data segments do not have an overlapping word or plurality of words until a data segment is located with an overlapping word or plurality of words.
18. The method of claim 17, further comprising repeating steps (d) — (f) until the document is completely converted into a second state. :
19. A computer system for converting content, comprising: a computing device that receives content expressed in a first state and parses said content into at least a first segment and a second segment, said first segment having a first portion, said second segment having a second portion, said first portion and said second portion having overlapping portions of said content; wherein said computing device accesses third and fourth segments of said content that are each expressed in a second state, said third segment corresponding to one of said first and second segments, said fourth segment corresponding to another one of said first and second segments and having an overlapping portion with said third segment; and : wherein said computing device determines said content expressed in the second state based on said third and fourth segments having an overlapping portion and provides said content : in the second state.
20. The computer system defined in claim 19, further comprising a database system which stores said third and fourth segments, wherein said computing device accesses said third and fourth segments from said database system.
ZA200309230A 2001-03-16 2003-11-27 Content conversion method and apparatus. ZA200309230B (en)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
US37610701P 2001-03-16 2001-03-16

Publications (1)

Publication Number Publication Date
ZA200309230B true ZA200309230B (en) 2004-11-29

Family

ID=34589975

Family Applications (1)

Application Number Title Priority Date Filing Date
ZA200309230A ZA200309230B (en) 2001-03-16 2003-11-27 Content conversion method and apparatus.

Country Status (1)

Country Link
ZA (1) ZA200309230B (en)

Similar Documents

Publication Publication Date Title
US7711547B2 (en) Word association method and apparatus
US8744835B2 (en) Content conversion method and apparatus
US7483828B2 (en) Multilingual database creation system and method
JP2006012168A (en) Method for improving coverage and quality in translation memory system
US20030083860A1 (en) Content conversion method and apparatus
Davis On the effective use of large parallel corpora in cross-language text retrieval
Oakes et al. Bilingual text alignment-an overview
US20030093261A1 (en) Multilingual database creation system and method
Badia et al. An n-gram approach to exploiting a monolingual corpus for Machine Translation
ZA200309230B (en) Content conversion method and apparatus.
Miao et al. An unknown word processing method in NMT by integrating syntactic structure and semantic concept
US20030135357A1 (en) Multilingual database creation system and method
Vasuki et al. English to Tamil machine translation system using parallel corpus
Shanmugarasa Efficient translation between Sinhala and Tamil using part of speech and morphology
KR20130042822A (en) Translation apparatus and method using phrase level translation examples in a pattern-based machine translator
Trujillo et al. Translator’s workbench and translation aids
AU2002231266A1 (en) Content conversion method and apparatus
Chen et al. NTU at CLEF 2001: Chinese-English Cross-Lingual Information Retrieval.
SKADIĽA et al. RECENT ADVANCES IN THE DEVELOPMENT AND SHARING OF LANGUAGE RESOURCES AND TOOLS FOR LATVIAN ANDREJS VASIĻJEVS, TATIANA GORNOSTAY
Luo et al. ICT-Crossn: The System of Cross-lingual Information Retrieval of ICT in NTCIR-7.
Feldman et al. Previous resource-light approaches to NLP
Raju et al. Dictionary Based Translation Approaches in Cross Language Information Retrieval: State of the Art
VASIĻJEVS et al. RECENT ADVANCES IN THE DEVELOPMENT OF LANGUAGE RESOURCES AND TOOLS