WO2015063536A1 - Networked language translation system and method - Google Patents

Networked language translation system and method Download PDF

Info

Publication number
WO2015063536A1
WO2015063536A1 PCT/IB2013/003079 IB2013003079W WO2015063536A1 WO 2015063536 A1 WO2015063536 A1 WO 2015063536A1 IB 2013003079 W IB2013003079 W IB 2013003079W WO 2015063536 A1 WO2015063536 A1 WO 2015063536A1
Authority
WO
Grant status
Application
Patent type
Prior art keywords
translation
segments
segment
source
human
Prior art date
Application number
PCT/IB2013/003079
Other languages
French (fr)
Inventor
Vladimir Gusakov
Artem Ukrainets
Ivan Smolnikov
Original Assignee
Translation Management Systems Ltd.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/20Handling natural language data
    • G06F17/28Processing or translating of natural language
    • G06F17/2809Data driven translation
    • G06F17/2836Machine assisted translation, e.g. translation memory
    • GPHYSICS
    • G06COMPUTING; CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/30Information retrieval; Database structures therefor ; File system structures therefor
    • G06F17/30861Retrieval from the Internet, e.g. browsers
    • G06F17/30864Retrieval from the Internet, e.g. browsers by querying, e.g. search engines or meta-search engines, crawling techniques, push systems
    • G06F17/30867Retrieval from the Internet, e.g. browsers by querying, e.g. search engines or meta-search engines, crawling techniques, push systems with filtering and personalisation

Abstract

A networked language translation system and method allowing access by a distributed network of human and machine translators that communicate electronically to provide for the translation of material. The system and method provide a way to aggregate the resources of a large number of intermittently available, mixed competency translators, human or machine, in order to provide high-quality translations.

Description

TITLE

NETWORKED LANGUAGE TRANSLATION SYSTEM AND METHOD

FIELD OF THE INVENTION

[001] The present invention relates to a method and a system for translation of content and more particularly, to a method and a system for collaborative translation.

BACKGROUND

[002] Information gathering and exchange for any scientific, commercial, political or social purpose often requires fast and easy translation of content in order to make the universe of knowledge and ideas useful on a global scale. Computer programs that translate automatically from one language to another ("machine translation programs") can in principle meet this need and such programs have been developed and are in continued development for a variety of languages. For formal (as opposed to informal, idiomatic, colloquial) content in well-studied languages (e.g., English, French, Spanish, German, and others), such machine translation programs work reasonably well.

[003] However, for more-difficult or less-studied languages (e.g., Arabic), existing machine translation programs do not work well, even for formal communications (e.g., Modern Standard Arabic), and they are particularly weak in the case of informal, colloquial, and idiomatic communications. Similarly, where specificity is needed, machine translation by itself is insufficient even for well-studied languages (e.g., English, French, Spanish, German, and others).

[004] Human translators can in principle provide accurate translations for difficult languages and informal communications, but Internet applications require constant availability and quick response, which cannot be assured in the case of existing methods that use human translators.

[005] In light of above discussion, a method and a system is needed that enables the efficient use of a memory database wherein a large team of translators work on content effectively.

SUMMARY OF THE INVENTION

[006] In general, the invention is achieved as follows: [007] In one aspect, the present invention provides a system and a method for translating the language of a source file. The system comprises of a web server to process data in the source file to be translated and to accept uploads of the source file, a storing database to store the translated content, processed source files and glossary terms, a segmentation module capable of segmenting the source file into a plurality of segments, a processing module to match the segments with the existing data stored in the database storage to get exact and/or fuzzy matches from already translated texts, and a machine translation module to derive a machine translation for the segment, a terminology search module to find terms from the terms database occurring in the source text, and a user interface accessible by multiple users to view the machine translation, exact and fuzzy segment matches and terms found, and to provide the human translation for the source file. The system and method may be provided as computer executable code (e.g. software), hardware or a combination of both.

[008] In another aspect of the present invention, for each segment saved in the database, the processing module searches for an exact or partial match of previously translated sentences, glossary terms and the machine translations of the sentence. In embodiments, the user interface may be accessed by multiple users and the user's translations are transmitted by the user interface to a database. The user interface may also display other users' translations of completely or partially matching segments.

BRIEF DESCRIPTION OF THE DRAWINGS

[009] Embodiments of the invention will hereinafter be described in conjunction with the appended drawings provided to illustrate and not to limit the scope of the invention, wherein like designation denotes like element and in which:

[010] FIG. 1 is a flow diagram illustrating a networked language translation system in accordance with an embodiment of the present invention.

[011] FIG. 2 shows a flow diagram illustrating a pre-translation method used in a translation unit 118 of a networked language translation system in accordance with an embodiment of the present invention.

[012] FIG. 3 is a schematic illustration of a platform with integration layer used in a networked language translation system in accordance with an embodiment of the present invention. DETAILED DESCRIPTION OF THE INVENTION

[013] In the following detailed description of embodiments of the invention, numerous specific details are set forth in order to provide a thorough understanding of the embodiments of the invention. However, it will be obvious to a person skilled in art that the embodiments of invention may be practiced with or without these specific details. In other instances well known methods, procedures and components have not been described in detail so as to not unnecessarily obscure aspects of the embodiments of the invention.

[014] Furthermore, it will be clear that the invention is not limited to these embodiments only. Numerous modifications, changes, variations, substitutions and equivalents will be apparent to those skilled in the art, without parting from the spirit and scope of the invention.

[015] The present invention is directed to providing a system and method for fast, effective and more reliable enhanced language translation through a networked language translation system. The networked language translation system is a distributed network of human and machine translators that communicate electronically and produce the translation of texts that are challenging for both existing machine translation methods and traditional human translation workflow, including the translation of rapidly-evolving dialogs and other rapidly electronically produced data.

[016] In embodiments of the present invention, the system is a web-based cloud-type platform wherein the access to the system is provided via a web-browser through a user interface and where the interface may have a separate window for translation project management and a translator working interface for parallel text editing. The networked language translation system provides a way to aggregate the resources of a large number of intermittently available, mixed competency translators, human or machine, in order to provide high-quality translations in a cost-effective and timely manner.

[017] In embodiments of the present invention, translations are produced by breaking an input source text into segments, sending each segment as a translation request to a translator with redundant requests being sent to a plurality of machine translation engines, terminology and translation memory repositories, with each source having a varying level of reputation, match and/or self-confidence metric for each particular segment sent. Then, the results of these translations are assembled taking into account the reputation of each source, the statistical properties of the translation results (available for each segment), the linguistic and other properties of the particular source and target languages, and other relevant factors which may be represented as numerical scores.

[018] The networked language translation system is based on the technology of Translation Memory (TM), a parallel sentences storage and search system (source language-target language) which is used to facilitate translation from one language to another. TM stores translations so that next time it is not necessary to translate the same phrases or sentences again. Thus, one of the main functions of TM involves search and comparison of sentences, phrases and their translations.

[019] FIG. 1 is a block diagram illustrating a networked language translation system in accordance with an exemplary embodiment of the present invention. As shown in FIG. l, the networked language translation system 100 services a plurality of customers 102 that desire source files 104 to be translated. The plurality of customers 102 are connected to the remote web server 112 through internet 110 with a browser-based user interface 106. The source file 104 is uploaded to a web server 112 through the internet 110. Once a source file 104 is uploaded on the web server 112, a segmentation module 114 processes the source file 104 and breaks the source file 104 text into a plurality of segments. The plurality of segments each having some text of the source file 104 is processed by a basic pre-translation unit 118, which matches each source text segment with linguistic resources 116, i.e. finds exactly matching segments, partially (fuzzy) matching segments, glossary terms found in the segment and associates each segment word with an entry in a frequency dictionary. The linguistic resources 116 consist of frequency dictionary and glossary terms for the language and the database of previously translated segments, with accompanying documents metadata and each segment translation workflow data. Thus the pre- translation unit 118 defines for each found matching segment: document to which it belongs, it's respective translators (MT engines used, human translator, editor, proofreader etc.) and quantitative scores received by these translators if applicable. Based on the pre-defined pre- translation rules (which can be modified for each project) pre-translation unit 118 then creates a pre-translated document/package containing exact and partial translation memory matches, glossary entries and customized machine translation engine's output. These rules are applied for each segment and define which human operations are required for each segment based on the source of the selected segment pre-translation used and its confidence metric.

[020] In embodiments of the present invention, TM (Translation Memory) is a system of storage and search of parallel segments (e.g. sentences, expressions or phrases) - namely: original source - translation. It is used to help the translator translate a text from one language into another. TM accumulates translation results which help avoid translating identical segments (e.g. sentences, expressions or phrases) repeatedly. Thus, one TM function is a search for segments (e.g. sentences, expressions or phrases) and the translations thereof which correlate to the content being translated. A scoring function is used to measure similarity between the fragments of the content and those residing in a database.

[021] A plurality of translators 108 are connected to the system 100 through the web-interface 106. The fragments generated by the segmentation module 114 are sent as a translation request to the pre-translation unit 118, with produces a pre -translated document (which can include a selection on a segment-by-segment level of preferred machine translation output from a plurality of machine translation engines) and a set of recommended human translators/editors/proofreaders with varying levels of reputation and varying preference scores. The web-based user interface 106 includes separate views for project management and translator's working interface for parallel text editing by a plurality of translators/editors/proofreaders/reviewers etc. Using the translator's working interface, the translator 108 can view the translation of the source file 104 created by other translators as well as the translation of a segment produced by a pre-translation unit 118.

[022] Segment translations generated after the pre-translation unit 118 by human translators/editors/proofreaders 108 (different segments can undergo different human operations) are automatically verified after each human operation for compliance with the pre-defined quality assurance rules and customer and project glossaries. Based on these checks each segment can receive a special warning flag displaying for the current text translator his predecessor, worked on this segment, and description of the potential errors found in it. Each translator can correct errors, according to these reports, or can comment them. During the next workflow operation remaining errors and accompanying comment will be visible and can also be corrected. Each segment should pass through all workflow stages defined in the source document, with a possible omission of some stages based on the pre -translation unit 118 results for each particular segment.

[023] FIG. 2 shows a flow diagram illustrating the pre -translation method used in a networked language translation system in accordance with an embodiment of the present invention. Starting from the block 202, a source file 104 is provided as an input by uploading the source file to the networked translation system. The source file 104 can either be in text or in binary; the source file 104 is then processed by the appropriate filter to convert the content into text form. In block 204, the text content of the source file 104 is then divided into segments based on the Segmentation Rule Exchange compatible parser. The segments are then saved in the database together with the specially processed source file 104.

[024] The process of segmenting text into sentences may be done for example as follows: the cursor moves along the text, one character at a time. At each cursor position rules, consisting of a Before and After pattern, are applied in their given order to see if any of the Before patterns are valid for the text on the left and the corresponding After pattern for the text on the right of the cursor. If the rule matches, either the cursor moves on without inserting a segment break (for an exception rule) or a new segment break is created at the current cursor position (for a break rule).

[025] In block 206, for each sentence segment saved in a database, the system searches for a partial match of previously translated sentences in Translation Memory 116, frequency dictionary and glossary terms available. For each exact or partial match found in the translation memory 116 associated document metadata and workflow data are also loaded from database. These data include MT engines used for pre-translation of this segment and string modifications metrics and time-based productivity metrics of human translator post-editing of this segment, and accounts of the translators/editor/proofreader which performed workflow operations for this particular segment.

[026] Each translated (parallel) sentence segment added into the linguistic resources 116 translation memory index is first split into separate words and/or phrases, and is then searched against morphological frequency dictionaries containing all possible forms of the word, definitions of the basic form of the given word, other possible forms, and metadata associated with the particular word form found in the indexed sentence and this data for all the word and phrases found are then added to an index. A target part of the given parallel sentence segment is analyzed in the same way, and a word-by-word alignment is created based on the previously mentioned morphological dictionaries. For each word it is also performed a search in all customers and general glossaries available 208 and data on the glossary matches are added to the index, words which have a frequency (frequency is defined on a general language text corpus) lower than a pre-defined threshold are marked as low- frequency.

[027] In block 206, a set of translation variants is selected by using a fuzzy match metric rule. While searching for previously translated sentences in the Translation Memory index a morphological analysis of the source sentence is performed, and different forms of the same word are considered to be the same word, but with a small penalty taking into account the fact that word forms are different. The much greater metric penalty is given for source segment words not found in the TM sentence or missing in the TM sentence words.

[028] In block 208, a source text sentence is split into words and then each word is searched in the morphological dictionary to get its basic form and then in the frequency dictionary, so frequency metric is attached to each word. Source segment words which have a frequency metric (frequency is defined on a general language text corpus) lower than a pre-defined threshold are marked as low-frequency. Then each word is searched in all available customer and general glossaries, entries found are saved. If given word is a part of a multi-word glossary entry then the presence of other words in the source segment is checked, if they're present, this multi-word entry is also saved as a glossary match. For each word found in the morphological dictionary we also define possible parts of speech and assign them to each found word. Then, based on the terminology extraction rules and entire source document text we define candidate terms to be added to the project glossary. There're two sets of criteria for these candidate selection: linguistic and statistical. Linguistic criteria's are defined in a form of acceptable combinations rules and lists of stop words, not extracted as terms, e.g. we can define that a combination of two nouns can be a legitimate candidate term. Statistical criteria define a minimal number of candidate term entries in the source text, it can be set manually by the project manager and depends on the project phase, volume of already existing glossaries etc. Then a set of new candidate terms is defined and saved into the database inside block 218. Project managers can then assign a translator to review this extraction and add terms translation, translator reviews and translates extraction in block 220 through a web-based user interface 106. The translator can see reference terms translations from customer's translation memories if available, and from other glossaries and translation memories. Translated and reviewed additions to the project glossary is then transferred to the block 208 and is finally included into the aggregated datasets created in block 214.

[029] As described in the section [028] during the addition of the translated (parallel) segments into translation memory 116, glossary entries found in the source and target sentences and low frequency words in the source sentence are identified and this information is attached to the index. In block 210 we look through the index created in the translation memory 116 for low frequency words and glossary entries found in the source segment, and thus we define a set of related segments with the same low frequency words and glossary terms. To each sentence included in the set we define a terminology match metric for machine translation:

MT_terminology_indeXi ,

MT_terminology_match; = : r— -

'' max(MT_terminology_index)

glossary _entriesi

Figure imgf000009_0001

Where i denotes a source segment, j denotes a corresponding related segment,/- is a frequency of the respective word. Constants C1, C2, C-i, fQ are chosen empirically for each language to provide the best possible correlation with human evaluation and automatic string metrics of machine translation engines trained on the texts filtered according to the MT_terminology_match£ calculation algorithm. MT_terminology_indexi j is calculated according to the same formula as MT ' JLerminolo gy Judex but only words present in both sentences are taken into account, and set of the words present in both sentences is saved for all pairs of related sentences i and j. Words are considered matching is their basic forms match, which means that different morphological forms are considered as the same word and count as matching, glossary _entriesi and glossary _entriesj are count here as words found in the source text of the segment, it doesn't matter whether translation of the glossary entry in the target text of the parallel segment coincides with the glossary translation or not.

For each pair of related sentences we also calculate a general match metric, same translation memory matches. For the purpose of selection of best applicable human translators/editors/proofreaders we define the following metrics of terminology matches:

Human_terminology_indeXi ,

HumanJ^rminologyjTiatch =

max(Human_terminolo gy Judex)

Human_terminology_indeXi

Figure imgf000010_0001

Human_terminology_index_stricti j

Human_terminology_match_strictij- =

max(Human_terminology_index_strict)

Human_terminology_index_stricti

C1 ** wwoorrd ass __iinn__eenn.ttrryy ^ ^

fo matching _glossary_entrieSi 1 + C36 f low _freq_wordsi 1 + C36 f

Where matching _glossary_entriesi are glossary entries found in the parallel segment for which translation in the segments target text is the same with the glossary. different_glossary_entrieSi are glossary entries found in the parallel segment for which translation in the segments target text is different to the glossary, glossary entries of that kind are added to the metric only when one of the conditions apply: glossary entry has more than one word or it is a low frequency word.

[030] For each related segment from block 210 and fuzzy match translation memory segment from block 206 we define documents to which they belonged and workflow used for these documents: MT engines, human translators/editors/proofreaders, productivity of their work and string metrics of modifications made at each workflow stage. Data on translators/editors/proofreaders productivity are gathered real-time in the web-based user interface 106, interface collects, stores and then sends to the server all user actions, all keystrokes, mouse clicks and complex events such as entering into segment editing, leaving segment editing, substitution of text from translation memory, glossary or machine translation engine. Editing distance metrics are also calculated. Two types of metrics are used: pure string metrics comparing two strings (e.g. Levenshtein distance) and activity based metrics in which editor activities (keystrokes, mouse clicks and complex events) are also taken into account. Time spent on editing is also calculated taking into account data on inactivity periods when focus was lost, or inactivity period between to actions was greater than a pre-defined threshold.

[031] Block 210 is repeated for every segment of the source text, thus we obtain a set of related segments in the translation memory database 1 16. One segment from the translation memory database 1 16 can be chosen as a related segment for a plurality of source text segments.

MT_terminology_document_match;-

= max(MT_terminology_matchi ,)

MT_terminology_match0 , +— V MT_terminology_matchw ,

H log N

N s

Where MT_terminology_document_match is a metric for the match of sentence j from the translation memory database 1 16, to a source document for translation, i - matching source text segments, with a positive value of MT J^rminologyjriatch metric, N - number of distinct source text segments matching with the sentence j from the translation memory database 1 16.

And the same formula for Human_terminology_match;-

Human_terminology_document_match;- = max(Human_terminology_matchi ,)

Human_terminology_match0 , +— V Human_terminology_matchw ,

+ '— Mog tf

[032] In block 212 for each document in the translation memory database for which we found matching segments in blocks 210 and 206 we then define a document similarity metric. MT_document_similaritym

£ ∑ ~ reiated_segments max( MTj:erminolgy_document_match;) * segment _words j ,

= J /

/ words _totalri

C5∑TM_matches TM jnatchjnetricj * segment words j I

~*~ / words _totalm

Human_document_similaritym

∑rei ted_seament5 max( Human_terminolgy_document_match ) * segment words j ,

~ I words _totalr

C5∑TM_matches TM jnatchjnetricj * segment words j I

/ words _totalm

Where TM jnatchjnetricj is a fuzzy match percentage metric described above in section [029], normed to belong to a (0,1] range, m - document in a translation memory database, words _t talm - number of words in a document m. C4, C5 - are empirically pre-defined

constants.

[033] For each document in the translation memory database 106 we can define glossaries

explicitly assigned to this document by the customer or project manager. In block 214 we create

four sets of linguistic resources for the source document:

(1) Explicitly assigned to the document by the customer or project manager glossaries and

translation memories;

(2) Ordered set of resources for translation model customization:

1. entries from explicitly assigned to the document glossaries;

2. parallel segments from explicitly assigned to the document translation memories;

3. for each document with MT.document.similarity^ metric exceeding pre-defined

threshold we add (documents are ordered according to their MT_document_similaritym

metric, documents with the higher metric value come first):

a. multiword entries from explicitly assigned to this document glossaries;

b. parallel segments from this document;

c. parallel segments from explicitly assigned to this document translation memories; (3) Ordered set of resources for language model customization:

1. parallel segments from explicitly assigned to the document translation memories;

2. for each document with Human_document_similarityjn metric exceeding pre-defined threshold we add (documents are ordered according to their Human_document_similarityjn metric, documents with the higher metric value come first):

a. parallel segments from this document;

b. parallel segments from explicitly assigned to this document translation memories;

(4) Set of segments of the source document with the ordered data for each segment:

1. fuzzy matching segments - only segments from the documents/translation memories present in the dataset (2) are included, segments are ordered according to the match percentage, segments with higher match percentage come first;

2. related segments with the positive HumanJ:erminologyji atch_stricti metric - only segments of the documents/translation memories present in the dataset (2) are included, segments are ordered according to the match percentage, segments with higher match percentage come first, for each parallel segment a word-by-word alignment of segment source text and its translation is also created and stored together with the segment;

[034] In block 226 dataset (1) can be packaged for download and use in arbitrary external environment. Otherwise this data is stored in the database and shown for each respective segment to the translator/editor/proofreader through a web-based user interface 106.

[035] In block 216 datasets (2) and (3) are added to the statistical and model based machine translation systems as a customization data. Data are added with a higher priority than general corpus and inside the dataset they're ordered in the same way as described above. Machine translation engines translation models and language models are then retrained if necessary, or custom models are trained.

[036] In block 222 a draft machine translation is performed for each source text segment, which is processed by all available pre-defined machine translation engines. Each engine is customized in block 216 (if it's customizable) with the datasets (2) and (3) created in block 214. For each source text segment sent to each engine for translation we also add data from the dataset (4) created in block 214. Dataset (4) is used in the following way: we take each parallel segment from the fuzzy matching segments of dataset (4) and define sets of words (substrings) matching with a source segment, for each matching set of words we define its translation in the parallel segment based on the word-by-word alignment. Then we explore possible combinations of such substrings to get better coverage of source segment text, only substrings containing more than one word or low frequency words are considered. Thus we produce multiple options with translation from fuzzy matches.

Then we take related segments with the positive HumanJ:erminologyjTiatch_stricti metric from the dataset (4) and extract from them, based on a word-by-word alignment of parallel segments, possible translations for low frequency words and glossary entries. Then for each translation option constructed as described above we look for these low frequency words and glossary entries not yet contained in any of the segment substrings with translation.

Thus, for each segment we define a set of options, each option contains a source segment text markup with a possible translation of some of its substrings. For each option we define a match percentage metric which defines part of the segment covered by substrings with translation. These segment options are then sent to each of the machine translation engines available. For each input option each engine produces some translation and accompanying self-confidence metric, if available. For all machine translation engines we also calculate our own fluency metric (e.g. Perplexity-based) with a trained and customized in block 216 statistical machine translation language model. We then exclude machine translations with wrong terminology (terminology is verified against project glossaries). Then, for each machine translation engine we select a translation with the best self-confidence metric if it's available or our own fluency metric. Metric values are saved into the database together with the machine translations.

[037] In an embodiment of the present invention, when a translator (or editor/proofreader) works on the sentence translation via the web page interface 106, all his activities inside this web page including keyboard strokes, mouse clicks, hot keys or user interface element usage are collected, send to the web server and stored in the database. Based on this collected worker data, personal productivity and quality metrics are calculated asynchronously, e.g. time spent on the sentence translation (editing/correction), and amount of insignificant and significant (e.g. terminology) changes in the translation during the next workflow steps (e.g. changes done by editor after the translator). A manually defined quality metric can also be attached to the samples of each translated project or even each single document translated by a translator. These manual metrics are based on an error typology approach, when a Language Quality Assurance (LQA) specialist performs a thorough analysis of a small sample of the text and for each sentence records to the database mistakes found and their type and severity. Both types of metrics are used for assigning a reputation to each of the translators, proofreaders and editors.

For each segment which was a machine translation post-editing we calculate a number of the following events: (1) significant terminology change (rephrasing) - when one word is changed into another one, (2) words reordered, (3) words harmonization (change of the word endings, especially in languages with complex morphology, etc.). Then we calculate for each engine amounts of expected changes as a table function of the segment length and machine translation self-confidence and/or fluency statistical metric:

Terminology z a.nge(segment_words, MTjnetric) , Words_reordering(se,gment_words, MTjnetric) , Words Jiarmonization (segmen vords, MTjnetric).

For each translator we also have data on all segments she has post-edited, the amount of the changes of each type and total time spent on each segment post-editing. We can then calculate for each translator and each machine translation engine constants t2, t3 , which provide best linear interpolation for the given worker data set:

Time_spent =

t-L * Terminology z a.nge(segment_words, MTjnetric) + t2 *

Wordsjreordermg(segment_words, MTjnetric) + t3 *

WordsJiarmonization(segment_wor<is, MTjnetric),

[038] Based on the machine translation metrics calculated in block 222, segment length and table functions with expected amount of changes for each engine we define for each machine translation engine expected amount of human changes into each machine translation. Then we select machine translation with the best score, if multiple engines have similar scores we select translation with the highest percentage metric value defining part of the segment covered by substrings with translation from the translation memory. These data are then stored into the database and become available for packaging in block 226.

[039] For each parallel segment included in the dataset (3) with a set of resources for language model customization, created in the block 214 we have a complete set of workflow data: translator (or machine translation post-editor), editor, corrector. We can also define documents containing these segments, manual Language Quality Assurance (LQA) metrics based on the error description in a sample of a text and automated metrics on the amount of different types of changes made by the editor and corrector (terminology change, words reordering, words harmonization), and time they have spent for every segment editing and correction. In block 230 we then calculate a weighted LQA metric for every translator/editor/proofreader participated in the translation workflow of the documents from the dataset (3) :

∑ Human_document_similarityjn * words _totalm * LQA_metricm

LQA_total = — - - -

∑, Human_document_similarityjn * words_totalm

Weight_total = ^ Human_document_similarityjn * words_totalm

Where LQA_metricm is a human LQA metric set to the document m.

Based on the LQA_total metric we exclude translators/editors/proofreaders with the metric value lower then empirically defined threshold. This threshold depends mostly on the project's quality requirements, defined by the project manager during project setup. Then we cluster the results of the LQA_total metrics for translators/editors/proofreaders and inside each cluster we sort results by Weight_total . Thus in block 230 we define a sorted list of preferred translators/editors/proofreaders for the given source document.

For each translator we also define an expected post-editing time and effort for the given source document. Post-editing time is calculated based on her sets of constants t1, t2, t3 for each machine translation engine and selected in block 224 machine translations for each segment and these translations metrics.

[040] Project manager reviews recommended for the project translators/editors/proofreaders, expected post-editing time and effort and project's statistics including translation memory matches, glossary matches, size and similarity metrics of the datasets created in block 214. Project manager gets data availability of the recommended translators/editors/proofreaders from the project management subsystem, makes a final decision and sends invitations to the project to the plurality of the selected responsible parties. This decision can also be made automatically based on the real-time availability status, projected workload of each person and project turnaround requirements.

[041] After receiving an invitation translator/editor/proofreader confirms or declines her participation in the project. After the confirmation, responsible person in block 232 can sign in into her web-based user interface 106 and start working on the project. There are two kinds of projects: sequential, when the next workflow stage starts only after completion of the project's previous stage, and parallel when workflow stages are defined at a segment level, i.e. next workflow stage can be performed on the segment right after it passed through the previous stage.

[042] When the document passes through all the necessary workflow stages and the translation is finalized it goes to the block 128 where the translated document is generated. Translated document 130 has the same file format (text or binary) as the source file. Translated document is generated with the segments final translations and source file metadata produced by the segmentation module 114 during the initial source file processing stage.

[043] Translated document 130 is then delivered to the client through a web-based user interface. Customer downloads the file. If files are stored in some external information system, then translated file is delivered into this system through a web API call from the integration layer 302.

[044] FIG. 3 is a schematic illustration of a platform used in a networked language translation system in accordance with an embodiment of the present invention. Referring to FIG. 3, the platform includes three layers: integration layer 302, translation platform 304 and additional modules 306. The integration layer 302 facilitates the uploading of source file 102 and the converting of source files into text form. The integration layer 302 is present in the platform if the information system such as a content management system, document management system, or portals are used as a source of files for translation. The translation platform layer 304 includes a web-based user interface, server side processing and a persistent layer. The web-based user interface provides an interface for users, translators, managers, editors, proofreaders, terminologists and other users. The web-based user interface comprises a separate window for translation project management and translator's working interface for parallel text editing. The server side processing involves the documents conversion and validation. The additional module service layer 306 includes a translation memory wherein fuzzy matches are found, a ABBYY Lingvo™ dictionary or other dictionary, a machine translator for post-editing, an aligner that converts the external document into XML format and a a spell check unit. Other subsystems can also be integrated, if necessary, through an integration API service bus.

Claims

We claim:
1. A networked language translation system for translating a source file comprising:
a cloud networked server allowing access by a plurality of translators and a plurality of customers connected to the cloud network through the internet;
a user interface to facilitate the uploading of the source files by the plurality of customers to the language translation system and receiving information in the form of suggested translations and glossaries;
a segmentation module to break the source file into a plurality of logical segments and sending the logical segments to the plurality of translators for translating the segments; a translation memory database to store translated logical segments and search for similar segments;
a morphological dictionaries to store words in different forms and associated metadata, including frequency data and find them in the source and target text of the segments; a glossaries module to store glossary terms and find them in the source and target text of the segments;
a matching unit to match the segments to the segments in translation memory database, determining similarity and terminology similarity, and suggesting a set of data for machine translation engines customization;
a word-by-word alignment module to align words in the source and target text of the parallel sentence;
a processing module to select the best machine translation on a segment-by-segment level and aggregate these translations;
a module to define a set of recommended human translators/editors/proofreaders based on their workflow history and human quality metrics assigned to the translated documents;
a module receiving human translators inputs and respective metadata on the actions they take through the web-based user interface;
a processing module to assemble the result translation generated by machine translation engines and the plurality of human translators, to generate the translation of source file.
2. The networked language translation system of claim 1 wherein the translation memory database stores indexed logical segments and translations of the logical segments with respective metadata.
3. The networked language translation system of claim 2 wherein the metadata for indexed logical segments and translation comprises the information on time, the translator, source and the quality score of the translated logical segment.
4. The networked language translation system of claim 1 wherein the logical segments are phrases, sentences or idioms.
5. The networked language translation system of claim 1 wherein the logical segments are matched to the translation memory database on the basis of fuzzy strings similarity calculation logic, taking into account different morphological forms of the same words.
6. The networked language translation system of claim 1 wherein the matching unit perform lexical, morphological and syntactic analysis of the logical segments.
7. The networked language translation system of claim 1 wherein the user interface displays real time translations of logical segments produced by other translators.
8. The networked language translation system of claim 1 wherein the translators are human or machine.
9. A method for translating a source file to a target in a cloud network comprising:
receiving a request to translate the source file in the cloud network;
breaking the source file into a plurality of logical segments;
searching for the similar fragments and fragments with the similar terminology in a translation memory database having translated logical segments, dictionary and glossary for the source file language, the said translated logical segments have an associated quality value;
forwarding the translation requests to a plurality of machine translation engines on the cloud network with an associated datasets to retrain or customize each engine and similar segments from translation memory database which can be used to assemble a machine translation of the source segment;
creating a set of translation options based from the set of similar segments with source segment substrings replaced with translations from similar segments and missing words substituted from glossaries, dictionaries or phrases generated from the translation memory database;
collecting a translation response from the plurality of machine translation engines for the logical segment;
selecting the best translation variant from the translated options;
selecting the best human translators on the cloud network, based on a reputation which is calculated on a set of previously translated documents similar to the source file to be translated;
forwarding the requests to selected human translators;
providing the plurality of human translators with a web interface for simultaneous realtime work on the document translation;
assembling the translation response from the human translator to generate the target file.
10. The method of claim 9 wherein the source file is a text file or binary file.
11. The method of claim 9 wherein the Language Quality Assurance value is obtained by recording errors and severity of mistakes in the text translated by the human translator or in the text post-edited by the human translator after the machine translation.
12. The method of claim 9 wherein the searching of similar fragment comprises lexical, morphological and syntactic analysis.
13. The method of claim 9 wherein the best variant is selected by using smart translation memory technology and wherein selecting the best translation variant from the translated logical segments and replacing differing parts with a translation from the glossary or the dictionary, or from the phrases generated from the translation memory database, generates a machine translation response for the logical segment.
14. The method of claim 9 wherein the translated segments are stored in the translation memory database.
PCT/IB2013/003079 2013-10-28 2013-10-28 Networked language translation system and method WO2015063536A1 (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/IB2013/003079 WO2015063536A1 (en) 2013-10-28 2013-10-28 Networked language translation system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/IB2013/003079 WO2015063536A1 (en) 2013-10-28 2013-10-28 Networked language translation system and method

Publications (1)

Publication Number Publication Date
WO2015063536A1 true true WO2015063536A1 (en) 2015-05-07

Family

ID=53003417

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/IB2013/003079 WO2015063536A1 (en) 2013-10-28 2013-10-28 Networked language translation system and method

Country Status (1)

Country Link
WO (1) WO2015063536A1 (en)

Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294076A1 (en) * 2005-12-12 2007-12-20 John Shore Language translation using a hybrid network of human and machine translators
US20080059146A1 (en) * 2006-09-04 2008-03-06 Fuji Xerox Co., Ltd. Translation apparatus, translation method and translation program
US20100057437A1 (en) * 2008-08-28 2010-03-04 Electronics And Telecommunications Research Institute Machine-translation apparatus using multi-stage verbal-phrase patterns, methods for applying and extracting multi-stage verbal-phrase patterns
CN102662933A (en) * 2012-03-28 2012-09-12 成都优译信息技术有限公司 Distributive intelligent translation method
CN102707097A (en) * 2011-03-28 2012-10-03 河南省电力公司焦作供电公司 Test wire clamp with magnetic absorption connecting device
US20130173247A1 (en) * 2011-12-28 2013-07-04 Bloomberg Finance L.P. System and Method for Interactive Auromatic Translation

Patent Citations (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20070294076A1 (en) * 2005-12-12 2007-12-20 John Shore Language translation using a hybrid network of human and machine translators
US20080059146A1 (en) * 2006-09-04 2008-03-06 Fuji Xerox Co., Ltd. Translation apparatus, translation method and translation program
US20100057437A1 (en) * 2008-08-28 2010-03-04 Electronics And Telecommunications Research Institute Machine-translation apparatus using multi-stage verbal-phrase patterns, methods for applying and extracting multi-stage verbal-phrase patterns
CN102707097A (en) * 2011-03-28 2012-10-03 河南省电力公司焦作供电公司 Test wire clamp with magnetic absorption connecting device
US20130173247A1 (en) * 2011-12-28 2013-07-04 Bloomberg Finance L.P. System and Method for Interactive Auromatic Translation
CN102662933A (en) * 2012-03-28 2012-09-12 成都优译信息技术有限公司 Distributive intelligent translation method

Similar Documents

Publication Publication Date Title
Burger et al. Discriminating gender on Twitter
Kwiatkowski et al. Lexical generalization in CCG grammar induction for semantic parsing
Evert et al. Using small random samples for the manual evaluation of statistical association measures
Baldwin et al. How noisy social media text, how diffrnt social media sources?
US20090055381A1 (en) Domain Dictionary Creation
US20070050182A1 (en) Translation quality quantifying apparatus and method
Toral et al. A proposal to automatically build and maintain gazetteers for Named Entity Recognition by using Wikipedia
US20140358519A1 (en) Confidence-driven rewriting of source texts for improved translation
US20130173247A1 (en) System and Method for Interactive Auromatic Translation
Hutchins Machine translation: A concise history
US20140298199A1 (en) User Collaboration for Answer Generation in Question and Answer System
MartíN-Valdivia et al. Sentiment polarity detection in Spanish reviews combining supervised and unsupervised approaches
US20110087961A1 (en) Method and System for Assisting in Typing
Kasper et al. Sentiment analysis for hotel reviews
Niu et al. Extracting and modeling product line functional requirements
Gambhir et al. Recent automatic text summarization techniques: a survey
Velardi et al. A taxonomy learning method and its application to characterize a scientific web community
Koehn A process study of computer-aided translation
US20110184722A1 (en) Translation quality quantifying apparatus and method
US20110099052A1 (en) Automatic checking of expectation-fulfillment schemes
US20120047172A1 (en) Parallel document mining
Moore Learning translations of named-entity phrases from parallel corpora
US20070288458A1 (en) Obfuscating document stylometry
Kiyavitskaya et al. Cerno: Light-weight tool support for semantic annotation of textual documents
US20110301935A1 (en) Locating parallel word sequences in electronic documents

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 13896248

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase in:
32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 20.09.2016)

122 Ep: pct application non-entry in european phase

Ref document number: 13896248

Country of ref document: EP

Kind code of ref document: A1