Text processing quotation method and system
Field of the invention The present invention relates broadly to an online text processing quotation system, to a method of providing a text processing quotation, and to a computer program containing program code instructing a computer to execute a text processing quotation method either in an online or offline form. The invention is specifically although not exclusively suited to generation of a quotation for translating documents.
Background of the invention With the emergence and continued growth of globally accessible communications networks such as the Internet, both the opportunity and the need for multilingual text processing portals have been recognised. Such multi-lingual portals can provide a tool to, for example, enable expansion of the translation business into new translation markets, including business involving consumer cross-cultural communication. The present invention seeks to provide a quotation method system which third party users can access to obtain quick and reliable quotes for translation and editing of text. Any discussion of documents, publications, acts, devices, substances, articles, materials or the like which is included in the present specification has been done so for the sole purpose so as to provide a contextual basis for the present invention. Any such discussions are not to be understood as admission of subject matter which forms the prior art base, or any part of the common general knowledge of the relevant technical field in relation to the technical field of the present invention to which it extended at the priority date or dates of the present invention.
Summary of the invention According to a first aspect of the invention there is provided a method of at least substantially automatically quoting for the translation of a block of text, said method comprising the steps of: i. receiving the block of text in electronic format in a processor; automatically reviewing the block of text to determine: a) the language or languages of the received text;
b) the format, or formats, of the received text; and ii. automatically selecting one or more processing algorithms to be applied to said block of text, said selections being dependent on the language or languages of the received text, and the format, or formats of the received text; iii. processing the text in accordance with the selected algorithm or algorithms to determine at least the number of words/characters in the received text to be translated; and iv. generating a quotation for translating the text which is based at least in part on the determined number of words/characters in said text. The method may include the steps of processing the text to determine the subject matter of the text, and adjusting the quotation for translating the text on the basis of the determined subject matter. According to a second aspect of the invention there is provided a method of at least substantially automatically quoting for the editing of a block of text, said method including the steps of: i. receiving the block of text in electronic format in a processor; ii. automatically reviewing the block of text to determine the format, or formats, of the received text; iii. automatically selecting one or more processing algorithms to be applied to said block of text, said selection being dependent on the format or formats of the received text; iv. processing the text in accordance with the selected algorithm or algorithms to determine at least the numbers of words and/or characters in the received text to be edited; and v. generating a quotation for editing the text which is based at least in part on the determined number of words and/or characters in said text. According to a third aspect of the invention there is provided an online system for automatically generating a quotation for the translation and/or editing of a block of text, the system comprising: a processor adapted to be linked to a network for receiving a said block of text;
processing means adapted to review the text to determine the format or formats of the received text; processing means adapted to establish the number of words and/or characters in the text, and determine the editing or translation task to be performed on the text, and quotation generation means adapted to generate a quotation for performing the task, said quotation being dependent on the number of words and/or characters in the text, and the task or tasks to be performed on the text, said quotation generation means being arranged to deliver said quotation via said network.
Brief description of the drawings Preferred embodiments of the present invention will now be described, by way of example only, with reference to the accompanying drawings in which: Figure 1 shows a flowchart illustrating an translation quotation method and system embodying the present invention; Figure 2 shows an example pricing and job page-size matrix utilised in the method and system illustrated in Figure 1 ; Figure 3 shows a flowchart illustrating a job assignment process flow in an online translation service embodying the present invention; Figure 4 is a schematic drawing illustrating a web-site architecture of a translation/editing service portal embodying the present invention; Figure 5 is a schematic representation of extraction of text from various formats in accordance with the present invention; and Figure 6 is a schematic representation of extraction of text from a Microsoft Word document.
Detailed description of the embodiments Figure 1 shows a flowchart 100 illustrating a translation quotation method and system embodying the present invention. Text to be processed in respect of the present invention is text which, at least in part, is text which is to be translated by a translator. At step 102, a new block of text to be processed is submitted, and at step 104 the system determines whether or not the text to be translated has been provided as plain text. The new job may be submitted via an
extranet, an intranet via a web service, via a local computer, for example. The new job may be submitted via a direct field input to be processed in plain text format and/or via an attachment field for providing the text as an attached file. If that is the case, whitespace in the plain text provided is removed at step 106, and/or a word and character count conducted at step 108. The processing that is done on the text in processing steps 108 and 112 of course depends on the nature of the received text, the nature of the task or tasks to be performed on the text, and the nature of the output required. The system is structured to conduct this analysis of the received text, the text automatically, and one or more of a range of algorithms will be applied to the received text, depending on which tasks need to be completed to accomplish the user's requirements. The term "algorithm" as used herein is intended to be interpreted broadly to encompass any arrangement for applying rules, or analysis tools, to the text to ascertain one or more characteristics of the received text. Thus, for example, the processor might first determine the language of the received text. This might be achieved by passing the text through a language identification filter. Once the language had been identified an algorithm, or set of rules, will be applied to the text which are consequent of the fact that the received text is in a particular language. Similarly, if the received text is in a range of different formats, an algorithm or set of rules will be applied to the text in the different formats, to assist in arriving at an accurate word count for each of the formats. If the received text is in a particular subject category, such as law, science, or the like, another algorithm, or set of rules will be applied to the text, to again determine with a reasonable degree of accuracy, the complexity or otherwise of the task to be performed. Where there is more than one language present in the text, text that is already translated can be removed from the wordcount. The system may also include a database/glossary of pre-translated terms or phrases.
Terms which are already in the glossary need not be included in the wordcount as such terms need not be translated. The database/glossary may include sub-databases which include particular words or terms which are universal within a technical field, for example some terms in the biological sciences, numerals, trade marks, such that depending upon the application, a
correct sub-database may be invoked for the exclusion from the wordcount. Other special characters or symbols which need not be translated may also be included in the database or in particular sub-databases for particular wordcount procedures. The system may include particular sub-databases of special symbols, for example Greek letters, which in some languages and documents have particular meanings, however when translating a Greek document, such a sub-database would typically not be invoked. If the text has been provided as an attachment, a determination is made by the system at step 110 as to whether the format of the attached file is supported for automated word and character count. If that is the case, at step 112 the system will open the attached file either via code that controls the application that created the file, or via code that implements the published file specification. Once the file is opened, the system will access all textual information in the file, remove white space and irrelevant formatting. The word and character count is conducted once again at step 108. In the alternative, if the format of the received text is not capable of automatic word and/or character counting, the attachment is forwarded to a manual quotation procedure at step 114.
The character and word count results of step 108 is then converted into a page count at step 116. The page count is calculated based on an algorithm that may be affected by the frequency of certain characters in the text (such as a high proportion of katakana in Japanese) which imply that a different word and or character to page ratio should be applied. At step 118, the system determines whether the language pair involved in the translation job for which the quotation is to be given exists in a database containing costs information. If that is the case, the quote is calculated at step 120 based on a multiplication of the page count and a page cost stored in the database, before being saved and displayed to the customer at step 122. If it is determined that the language pair is not included in the database at step 118, the quotation job is provided for a manual quotation at step 114, together with the results of the page count conducted at step 116 for utilization in the subsequent manual quote procedure. Step 124 of the manual quotation procedure may either involve a manual or external software facilitated word and character count to obtain a page count for the quotation and
multiply the result with the relevant language pair page cost. Alternatively the preparation of the quote is based on the automated page count but involving a custom page cost chosen by a coordinator for the requested language pair. It will be appreciated that once the coordinator has chosen a page cost for a particular language pair, that data can then be added into the database utilized at step 118. This avoids the necessity for the manual quotation procedure for any subsequent request for that particular language pair, where the page count data can be obtained automatically. As mentioned above, if a plain-text version of the job is supplied, that text may be used for the quote process. If a file or files are uploaded the file format is examined. Automatic quoting is possible for a specific number of file formats, including Microsoft Word, Microsoft Excel and Microsoft PowerPoint in the example embodiment. If the file type is supported, an automatic quotation, as discussed above, will be performed, otherwise a manual quote by a coordinator may be required. The customer may be asked to confirm that they wish a manual quote to be performed before the job is referred to a coordinator. If a word and/or character count has been performed, the resulting value is converted into a 'page count'. A page is defined as a certain number of characters (for character-based languages such as Japanese and Chinese) or words (such as English and European languages). For example, one 'page' could be specified as being equivalent to 200 words or 400 characters. The conversion of the word-count into a page-count is determined by the characteristics of the source text as shown in Figure 2, matrix 210. This matrix 'normalises' the relationship between length of the input text and the probable length of the translated text based on the frequency of different character types and predetermined formulas that apply to those frequencies; or according to the subject matter identified by the user (depending on the languages involved). Different prices would apply to different 'quality levels'. The example pricing matrix 200 shown in Figure 2 has two different quality levels 202, 204, although there could be additional levels depending on service demand in different embodiments. In the example embodiment, the different levels 202, 204 are distinguished by the number of coordinator proofreading and potential resubmitting to a particular translator before the translation is provided to a user of the on-line translation service. It will be appreciated by a person skilled in the art that counting words (or characters) in a document can be difficult to do reliably. Rules such as hyphenation, double-spaces after
fullstops, multiple blank lines, etc make it difficult for a simple algorithm to get accurate results across all documents. For word-counts, the simplest approach is to treat all whitespace in a document as a 'single space', and then count the number of elements surrounded by 'space'. For character-counts, counting the number of non-whitespace characters (and potentially also excluding certain punctuation, although this must be defined per-language) is the easiest solution. Here, whitespace includes: space characters, tab characters, linefeed characters, nonbreaking space characters.
The algorithm for calculating whitespace (and therefore counting words and characters) in different embodiments is subject to change as research into the optimum counting method results in more sophisticated tools and calculations. For certain file types, such as Microsoft Office application files, the built-in word-count from that application may be used in preference to a custom count of words and characters at step 108 in flowchart 100 as shown in Figure 1. The final character count in step 108 may be adjusted by an algorithm where words found in the text are matched with an existing database/glossary of terms with pre-translated equivalents. In the case of a translation job, this reduces the workload for the translator hence the wordcount is reduced, resulting in a less time consuming task for the translator, which may result in a reduced quoted price. Step 108 also involves processing the text to count statistics about the percentage of the document written in different languages, particularly pre-translated text in the target language for translation jobs, and determining the 'complexity' of the language by examining word lengths, the use of certain types of characters, for example kana in Japanese. These statistics may be used in step 118 with Figure 2 matrix 210 to determine the 'page count' for quoting As part of the quotation process in the example embodiment, a deadline is associated with each job and quotation at step 120. When the deadline is calculated, the type of process for example translation or edit, the size of the task size and languages used; as well as other information available in the system are taken into account such as current workload for translators and editors. Thus, at step 122 the quote is displayed to the customer together with a proposed deadline. A plurality of quotes/deadline pairs may be displayed. In the example embodiment, the deadline is based on the word and/or character count, the language pair and quality level and may further take into account capacity data.
At step 126, the customer has the option to accept the quote and deadline, or select one quote/deadline pair provided. At the same time, the customer may be given the opportunity to request a different deadline. If a different deadline is requested, this information is processed at step 128. The processing at step 128 in the example embodiment comprises a determination whether or not the requested deadline is acceptable based on information available to the system, and may include a revision of the previous quotation to for example add an additional fee. The outcome of that processing is again displayed to the customer, illustrated as a return to step 122 in the flowchart 100 shown in Figure 1, for confirmation by the customer or further modification. When a job has been submitted by a User (customer), a coordinator group is responsible for job processing, and will perform the decision points in the flowchart 300 shown in Figure 3. A job does not start this process flow until it has been submitted and paid for by the customer (or where the customer has credit approval, when the job has been approved for payment on invoice). That is, flowchart 300 is acted upon after the quotation procedure described above with reference to Figures 1 and 2, and more particularly after the customer has accepted the quotation provided. In flowchart 300, the received new job (step 302) is reviewed by a coordinator at step
304. Based on that review, at step 306 the coordinator will decide whether the job is suitable for a reverse auction system. If not, the coordinator decides which translator or editor is to be assigned, based on the job type, the subject matter, and other criteria. The coordinator updates the system database to indicate whether a particular job will be auctioned or assigned directly. If the job is suitable for auctioning, may be transferred into the reverse auction system, indicated at step 308 in Figure 3. The auction bidding at step 308 utilises known on-line implementations which enables qualified translators or editors to bid for a particular job. In the auction bidding at step 308, a deadline and maximum price has been set as determined by the previous processing of the job. Step 310 indicates a determination as to whether or not a winning bidder has been identified. Sometimes there is no winning bid in an auction, for example if there were no bidders at all, or no bidders were below the maximum price set, within a particular bidding
deadline. If no winning bidder is determined, the job is referred back to the coordinator group for further handling at step 311. When a job is won in the reverse auction procedure at step 312, or it is assigned by the coordinator group at step 311, it immediately becomes the responsibility of the winning translator, at step 314, or editor, at step 316 to download the job file and complete the work in the specified time frame. The editing or translating steps 316, 314, respectively may involve checking recursions indicated at numerals 318 and 320 respectively. A job may require editing for a number of reasons in the example embodiment, including a) the quality level paid for by the customer implies that editing will be performed, b) the coordinator decides that the translation output requires editing before sending to the customer, or c) a specified subject specialist is required to review the work of a non-specialist translator or editor. In the flowchart 300 shown in Figure 3, it is illustrated that a translating job at step 314 may effectively be resubmitted for editing (as opposed to translating) based upon a decision by the coordinator group at step 322. As a final step, the finished work is sent to the client at step 324. Figure 4 shows a schematic diagram illustrating the web-site architecture 400 of an online translation/editing service embodying the present invention. The architecture includes a first 'extranet' side 402 for users of the service and providing job quotation and job submission facilities. An 'intranet' site 404 is provided for the internal job handling and processing by personnel of the service portal provider. A second 'extranet' site 406 is provided for translators and editors to obtain jobs for processing by e.g. a reverse auction process, and more generally to interact with the online translation/editing service portal provider. A third 'web service' 408 is provided to facilitate external computer-program access to the job quotation and submission functions, allowing approved external parties to incorporate the service into their own computer Figure 5 is a schematic representation 500 of extraction of text from various formats in a system in accordance with the present invention, whereby text to be processed 502 is inputted into the system at step 504, whereby the text is retrieved in a file from a server. In this example, the system 500 first determines whether the document is a Microsoft Office document at step 506. If the file is a Mircosoft office document, the system then determines whether the document is a Microsoft Word document at step 508 and if it is not, the next test is whether the
document is a Microsoft Excel document at step 510 and if it is not, in this example the system assume that the document is a Microsoft Powerpoint document. If the document is not a Microsoft Office document at step 506, the system determines whether the document is an Acrobat PDF document at step 512 and if it is not, whether the document is a HTML document at step 514. In the case that the document is a Microsoft Word, Excel or PowerPoint document, as shown by steps 516, 518 and 520, including external text such as OLE-embedded Excel sheets/graphs, PowerPoint slides, charts, etc, the system will utilize the Application Programming Interface (API) provided by Microsoft. This means writing algorithms to detect many different "objects" within the text, and invoking custom tools to "extract" the text from each different object type. It should be noted that extracting text from PowerPoint is different to Excel, different to a drawing and different to an Organization Chart. Extracted text from steps 516, 518, 520, 522, 524 and 526 and meta data is passed at step 534 to be counted so as to provide document text and meta data for word counting to be performed at step 536. When counting words and characters, at step in Microsoft files in this example, the following follow user-specified rules apply: For all Office applications, the user may choose to ignore the embedded object text elements, for example, their client may have instructed them to. Alternatively, because it is possible to embed entire documents but only 'see' a small part of that embedded document, the user may choose to ignore all but the visible portion of that document when performing the word count. For PowerPoint, the user may choose to ignore the 'Presentation Notes' on each slide, because they are not visible when the presentation is viewed. Other rules may be developed according to typical translation-industry scenarios, or may be user specific on a case by case basis. When processing Adobe Acrobat PDF files, published specification rules are used for the PDF format to skip through the document ignoring the formatting codes but extracting the viewable text at step 522.
When processing documents which are HTML files, HTML formatting codes are ignored, requiring a set of rules for the parsing of HTML at step 524 and extracting only the visible text. When processing plain text files, standard text and character counting techniques may be used. Once the words and characters have been counted from the processed file, the file is retrieved from the server at step 538 and either displayed to a user at step 528 or uploaded to a server at step 530. The process may then be repeated or terminated at step 532. Figure 6 illustrates diagrammatically the extraction of text from a Microsoft Word document, expanding in detail upon step 516 of Figure 5. A Word document is opened using
Microsoft Word software at step 602. In this example, a user has selected to extract text from paragraphs, from floating objects, from embedded objects, from headers and footers, footnotes and from endnotes. Sequentially throughout the process text is extracted from the document, firstly whereby text is extracted at step 604 paragraph by paragraph until the last paragraph is reached. Upon completion of step 604, step 606 is performed whereby text is extracted through a series of floating objects until the last object is reached. Upon completion of step 606, step 608 is performed whereby text from embedded objects is extracted until the final object of the document is reached. Similarly, text from headers and footers, footnotes and endnotes are sequentially extracted at steps 610, 612 and 614 respectively until the process is completed and stopped at step 616. Similar such algorithms may be used to extract text from documents such as Excel and Powerpoint documents. Other features of an example implementation of an online translation/editing service embodying the present invention will now be described. Customers can: - submit new jobs (translation or editing) - view the status of a job - add information, comments, questions to a job in process, or indicate that a job is similar to past work
- enter a complaint or enquiry about a completed job - review their job history, and view the source and final text of a job at any time (subject to future archiving requirements) - build a glossary of terms that may require special translation or to be ignored. This will be downloadable in an offline format - receive completed jobs immediately via email, or choose to retrieve them from the website at a later time (following an advisory email) Additional services such as news and help are available. A company home page is available to certain customers (created by Administrators only, and not able to be set up by customer themselves). Company users can create multiple internal customers who are all authorized to submit jobs (up to a certain monetary value) which will be paid for on invoice to the company. Alternatively, the company may require each job submission to be centrally authorised before being submitted for processing. This will assist the company in tracking translation service usage and perhaps result in future discounts or additional services. The User Home ('Control Center') notifies the user of any important information (messages about translations, translations newly completed) as well as links to all relevant functions (adding a job, maintaining their glossary, etc). The 'extranet' for the translation business automates the translation submission and processing workflow. Both on-site and remotely located translators, editors and account managers will be able to log in to the extranet to access jobs and other work-related information. Job types are characterized by: - Customer (may be distributed by hopper or direct/ongoing assignment, depending on the customer, translator and job type) - translation (large or small) - edit (large or small) - Multilingual electronic Customer Relationships Management application (MeCRM) business (outgoing) translation
- MeCRM customer input (incoming) translation - FAQ translation - field, label or instruction translation - user input translation - reply translation Jobs have an output quality level, dependent upon the price charged and the customer involved. Low quality jobs may not be edited; high quality jobs will always be edited. How each individual job is routed is determined by the coordinator group. All job routing is accompanied by emails and/or on-screen warnings (depending on the preferences of the translator/editor and whether the system has detected that they are currently logged in). Extranet users can set as a preference that they want email reminders (note: urgent emails are exceptions to this rule). They can also specify an alternate email for reminders, such as a mobile phone-based email account. Extranet users will also be able to register certain Instant Messaging accounts or install additional software that will 'pop-up' warnings about incoming work, questions or reverse- auction notifications. The manner in which the text is stored is dependent upon the type of job or process which is to be performed upon the text. Files containing text to be process are saved to the server in their original form, and again after each step in the workflow such as translation, editing and checking. Each of the files will be archived for future use such as complaint resolution, quality checking and other future reference for statistical purposes. Customer jobs below a certain word limit may be stored in the database. The Extranet offers additional services to translators and editors, including: - forums for discussion work-related issues with each other, and Administrators - links to external resources (online dictionaries, thesauruses, etc) - ability to edit their contact information and password - ability to build a glossary to manage frequently used terms/language. This will be downloadable in some offline format, translators and editors will also have access to a
customer-entered glossary when working on specific customer jobs. The customer-entered glossary may be used to reduce the wordcount used when calculating a price for translation, if the words in the customer-entered glossary have associated translations and appear in the text submitted for processing. - reports of work completed by them, for calculating payment/commissions and reviewing work completed. The Extranet user may have a question or questions about a particular job, which may be answered by the coordinator, an Administrator, an Account Manager or even referred to the User themselves. These questions can be submitted on an 'edit job' page. The coordinators will be notified when a question is saved for a job, and the coordinator will be responsible for sending the question to the correct person and entering the response into the system for the Extranet user to view. On the Reverse Auction page jobs appear with a set of information (deadline, title, subject, word or character count). The Reverse Auction page lists jobs that are not assigned to any particular Extranet user and are available for 'auction'. Jobs will be marked as 'suitable for auction' by coordinators (or they may automatically go into the auction list, according to a set of rules which might include a maximum wordcount, common language pairs or other criteria in different embodiments. Each Extranet user sees a customized view of the Reverse Auction, based on their language skills and other criteria, such as source/target language, subject expertise and quality/reliability scores as entered by Administrators. Jobs can be 'viewed' so as to learn about the size and complexity of the job, and 'bid for' at a per-word rate or flat fee that the Extranet user is willing to work for on that job. When the reverse auction 'closes' (the predetermined time displayed on the reverse auction list) the job is moved into the lowest bidder's Job Inbox, so they can begin work on it immediately. Jobs from the reverse auction will have a strict deadline imposed; and if they are not completed promptly, they will be flagged for attention by a coordinator. Extranet users who are not meeting their obligations may have a job removed from them. There is a limit on the number of jobs an Extranet user can submit a bid for, based on the size and deadline of any jobs they have already bid for (and whether they are currently the lowest bidder). If an Extranet user
wishes to bid for a translation or edit job, they simply enter the price at which they are prepared to complete that job (by the specified deadline). Only select authorized users will have access to the administration functions. There are two types of Admin user: - coordinator, who can access the user lists, job hopper and current jobs - Administrator, who can access all functions. The coordinator's role is to assign jobs within the system, but should not be able to access the reporting and client/user editing functions. The key day-to-day task of the coordinator group is to monitor the incoming job hopper and assign jobs to individual translators and/or editors. This task will be 'computer-aided' by: - jobs from specific customers will be 'pre-filled' with a recommended translator, such as a translator that has done previous work for that customer (set in the Administrators customer list) - jobs for a specific content type (eg. Aeronautical) will be 'pre-filled' with a recommended translator, who has expertise in that subject matter - as jobs approach their deadline, warning messages appear on the Job List page, to remind the coordinator to follow up on the job's progress. The above rules also hold for editing work. Administrators are responsible for maintaining the site itself, including: - creating new - Users - Companies who can manage their own Users - Extranet users (translators & editors) - News items and Help text - Pricing Multilingual electronic customer relationship management (MeCRM) is a particular implementation example for which systems embodying the present invention may be used.
They may be viewed as a specific translation/editing job to facilitate business to consumer communication across different languages to handle customer enquiries, orders or other electronic communication. One such example implementation will now be described. Before the MeCRM process can begin, the Client company must register as a User. They can then perform the MeCRM setup, which typically will only happen once, unless they wish to alter their forms or page design at a later date. Once the Client setup is complete, the system will translate the forms into the target languages, and finally to enable the MeCRM service the customer creates a link on their website to the MeCRM pages on the system's website. Once the MeCRM website link is added to the Client website, any Internet user visiting the Client website can click-thru to the MeCRM form and enter information in any of the supported languages. The Client and the User will receive translated versions of each communication, as soon as the translation has been performed via the translation Extranet. It will be appreciated by the person skilled in the art that numerous modifications and/or variations may be made to the present invention as shown in the specific embodiments without departing from the spirit or scope of the invention as broadly described. The present embodiments are, therefore, to be considered in all respects to be illustrative and not restrictive. In the claims that follow and in the summary of the invention, except where the context requires otherwise due to express language or necessary implication the word "comprising" is used in the sense of "including", i.e. the features specified may be associated with further features in various embodiments of the invention.