CN116933803A - Natural language translation model training and configuration - Google Patents

Natural language translation model training and configuration Download PDF

Info

Publication number
CN116933803A
CN116933803A CN202210336355.2A CN202210336355A CN116933803A CN 116933803 A CN116933803 A CN 116933803A CN 202210336355 A CN202210336355 A CN 202210336355A CN 116933803 A CN116933803 A CN 116933803A
Authority
CN
China
Prior art keywords
natural language
computer
sentences
text
training
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210336355.2A
Other languages
Chinese (zh)
Inventor
黄广扬
黄进安
钟展超
汤思敏
唐志鸿
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Logistics and Supply Chain Multitech R&D Centre Ltd
Original Assignee
Logistics and Supply Chain Multitech R&D Centre Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Logistics and Supply Chain Multitech R&D Centre Ltd filed Critical Logistics and Supply Chain Multitech R&D Centre Ltd
Priority to CN202210336355.2A priority Critical patent/CN116933803A/en
Publication of CN116933803A publication Critical patent/CN116933803A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/40Processing or translation of natural language
    • G06F40/42Data-driven translation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/211Syntactic parsing, e.g. based on context-free grammar [CFG] or unification grammars
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Machine Translation (AREA)

Abstract

A computer-implemented method for training a natural language translation model. The computer-implemented method includes: processing one or more sets of electronic parallel documents to obtain a plurality of aligned parallel sentences; creating a first training set comprising a subset of the plurality of aligned parallel sentences; and training a natural language translation model in a first stage using a first training set. The computer-implemented method further comprises: modifying the first training set based on translation errors detected after the training of the first stage; creating a second training set based on the modified first training set and at least some aligned parallel sentences not in the first training set; and training the natural language translation model in a second stage using a second training set to improve translation performance of the natural language translation model.

Description

Natural language translation model training and configuration
Technical Field
The invention relates to training of a natural language translation model based on artificial intelligence and the natural language translation model.
Background
Natural language processing refers to the processing of language (e.g., translation) by a computer, and it generally involves artificial intelligence based techniques.
Disclosure of Invention
In a first aspect, a computer-implemented method for training a natural language translation model is provided. The computer-implemented method includes: processing one or more sets of electronic parallel documents to obtain a plurality of aligned parallel sentences; creating a first training set comprising a subset of the plurality of aligned parallel sentences; training a natural language translation model in a first stage using a first training set; modifying the first training set based on translation errors detected after the training of the first stage; creating a second training set based on the modified first training set and at least some aligned parallel sentences not in the first training set; and training the natural language translation model in a second stage using a second training set to improve translation performance of the natural language translation model.
An "electronic parallel document" is a parallel document in electronic form, and a "parallel document" is a document of translations of each other, preferably formally and generally accepted. These documents may relate to translations in two languages (one being translations in another language) or translations in more than two languages (each being translations in another language). In one example, a set of parallel documents may include one english document and one correspondingly translated version of spanish document.
Optionally, processing the one or more sets of electronic parallel documents includes: processing one or more sets of electronic parallel documents to obtain a plurality of parallel sentences; and performing an alignment operation to align the plurality of parallel sentences to form a plurality of aligned parallel sentences.
The alignment operation is arranged to align the plurality of parallel sentences such that the sentences of the first natural language are aligned with (i.e., correspond to or match) the respective translations of the sentences of the second natural language. The alignment operation does not necessarily involve spatial alignment of the text.
Optionally, processing the one or more sets of electronic parallel documents includes: if the one or more sets of electronic parallel documents do not exist in the form of text files, converting the one or more sets of electronic parallel documents into text files, preferably plain text files; and processing the text file or the plain text file to obtain a plurality of parallel sentences.
Optionally, processing the one or more sets of electronic parallel documents includes: a cleaning operation is performed to clean up a plurality of parallel sentences or a plurality of aligned parallel sentences.
Optionally, processing the text file includes: a cleaning operation is performed to clean up a plurality of parallel sentences or a plurality of aligned parallel sentences.
Optionally, the cleaning operation includes deleting some text and/or punctuation from the plurality of parallel sentences or the plurality of aligned parallel sentences.
Optionally, the translation error includes one or more of: grammar mistakes, punctuation mistakes, term use inconsistencies, sentence structure mistakes, word order mistakes, untranslated words, word translation mistakes, improper repetition of words, phrases, sentences or articles, misuse of words, and transliteration that does not conform to the context.
Optionally, the modification of the first training set based on translation errors comprises: some text and/or punctuation in a subset of the plurality of aligned parallel sentences is deleted and/or edited.
Optionally, the computer-implemented method further comprises: modifying the second training set based on translation errors detected after the training of the second stage; creating a third training set based on the modified second training set and at least some of the plurality of aligned parallel sentences not in the first training set and the second training set; and training the natural language translation model in a third stage using a third training set to further improve translation performance of the natural language translation model. In some embodiments, these steps of modifying, creating, and training may be repeated in a third training set and one or more further training sets to further improve the translation performance of the natural language translation model.
Optionally, the natural language translation model comprises a neural machine translation model or a network.
Optionally, the natural language translation model is arranged to translate the first natural language into the second natural language. Optionally, each of the one or more sets of electronic parallel documents is a first natural language and a second natural language. Optionally, each of the aligned parallel sentences is a first natural language and a second natural language.
Optionally, the natural language translation model is arranged to translate between the first natural language and the second natural language.
Alternatively, the first natural language is english and the second natural language is spanish.
Optionally, each of the plurality of aligned parallel sentences of the first training set contains 8 to 50 words and each of the plurality of aligned parallel sentences of the second training set contains 8 to 50 words. Optionally, each of the plurality of aligned parallel sentences of the third training set contains 8 to 50 words.
Optionally, the number of aligned parallel sentences used in the training in the second stage is greater than the number of aligned parallel sentences used in the training in the first stage. Optionally, the number of aligned parallel sentences used in the training in the third stage is greater than the number of aligned parallel sentences used in the training in the second stage.
Optionally, the one or more sets of electronic parallel documents comprise electronic documents of the same domain or context, and the natural language translation model is a domain-or context-specific natural language translation model.
Alternatively, the domain or context is a legal domain or legal context. The legal field or legal context may relate to arbitration, mediation, trade, dispute resolution, and the like.
Optionally, the computer-implemented method further comprises: one or more sets of electronic parallel documents are collected, e.g., retrieved, from one or more databases.
In a second aspect, there is provided a natural language translation model trained using or derived from the computer-implemented method of the first aspect. Preferably, the translation model is arranged for translating from a first natural language (e.g. english) to a second natural language (e.g. spanish), and in particular is arranged for translating text in the legal field or in the legal context.
In a third aspect, there is provided a computer program or computer program product comprising the natural language translation model of the second aspect.
In a fourth aspect, a non-transitory computer-readable medium comprising the natural language translation model of the second aspect is provided.
In a fifth aspect, there is provided a computer-implemented natural language translation method comprising: receiving text in a first language (or source language); processing the received text using the natural language translation model of the second aspect; and outputting the text in a second language (or target language) based on the processing, the text in the second language being a translation of the text in the first language.
The computer-implemented method may also include receiving an electronic file containing text in a first language, and extracting text in the first language from the electronic file. The computer-implemented method may also include displaying (e.g., side-by-side displaying) the text in the first language and the text in the second language. The computer-implemented method may further include providing the electronic file containing text in the second language for downloading, transmitting, and/or viewing.
In a sixth aspect, there is provided an electronic system comprising one or more processors configured to: receiving text in a first language (or source language); processing the received text using the natural language translation model of the second aspect; and outputting the text in a second language (or target language) based on the processing, the text in the second language being a translation of the text in the first language.
The electronic system may also include an input device for receiving an electronic file containing text in the first language, and a device (e.g., one or more processors) for extracting text in the first language from the electronic file. The electronic system may also include a display operatively connected to the one or more processors for displaying text in the first language and text in the second language. The electronic system may also include an output device for providing an electronic file containing text in the second language for downloading, transmission, and/or viewing.
Alternatively, the electronic system is an online conversational system, such as an online dispute resolution system. In some implementations, the online dispute resolution system is an arbitration or mediation system that facilitates online arbitration or mediation. The system may be implemented on a single computer or cloud computing system or network. The system may be implemented as a web service.
Other features and aspects of the invention will become apparent by consideration of the detailed description and accompanying drawings. Any feature described herein with respect to one aspect or embodiment may be combined with any other feature described herein with respect to any other aspect or embodiment, where appropriate and applicable.
Drawings
Embodiments of the invention will now be described, by way of example, with reference to the accompanying drawings, in which:
FIG. 1 is a flow chart illustrating a computer-implemented method for training a natural language translation model in one embodiment of the invention;
FIG. 2A is a flow chart illustrating a computer-implemented method for processing an electronic document to obtain aligned parallel sentences in one embodiment of the present invention;
FIG. 2B is a flow chart illustrating a computer-implemented method for processing an electronic document to obtain aligned parallel sentences in one embodiment of the present invention;
FIG. 3 is a flow chart illustrating a computer-implemented method for processing an electronic document to obtain aligned parallel sentences in one embodiment of the present invention;
FIG. 4 is a schematic diagram of natural language translation in one embodiment of the invention; and
FIG. 5 is a block diagram of an information handling system configured to perform at least a portion or all of a computer implemented method in one embodiment of the invention.
Detailed Description
FIG. 1 illustrates a method 100 for training a natural language translation model in one embodiment of the invention. In this embodiment, the natural language translation model comprises a neural machine translation model or network and is configured to translate from a source language to a target language.
The method 100 begins at step 102, where an electronic parallel document is processed to obtain aligned parallel sentences. An electronic parallel document is a parallel document in electronic form (i.e., it is a document that is translated to each other, preferably a formally and well-recognized translated document). Electronic parallel documents may be obtained from reliable sources or databases for quality control. The electronic parallel documents are preferably documents in the same aspect, domain or context (e.g., law), such that the natural language translation model may be specifically adapted to translate a particular aspect, domain (law) or context or closely related aspect, domain or context. Processing of the electronic parallel document to obtain aligned parallel sentences may involve manual (requiring user intervention) or automatic (requiring no user intervention) processing. For example, processing of an electronic document may involve manipulating (e.g., editing) text in an electronic parallel document. The aligned parallel sentences include sentences of the first natural language that are aligned with (i.e., correspond to or match) the corresponding translations of sentences of the second natural language. In some examples, parallel sentences aligned over 10000 pairs, over 20000 pairs, or over 50000 pairs are obtained from step 102. Each of the aligned parallel sentences may consist of 8 to 50 words in one example, and other word numbers in other examples.
After the aligned parallel sentences are obtained in step 102, in step 104, some of the aligned parallel sentences (a subset of the aligned parallel sentences) obtained in step 102 are used to form a training set to be used for training a natural language translation model. Then, in step 106, the selected subset or training set of aligned parallel sentences is provided and used to train the natural language translation model.
After training in step 106, the training performance or translation performance of the model is evaluated. The evaluation may involve detecting a translation error. These translation errors may include, for example, grammar errors relative to the target language, punctuation errors, term usage inconsistencies, sentence structure errors, word order errors, untranslated words, word translation errors, improper repetition of words, phrases, sentences or articles, misuse of words, and transliteration that does not conform to context. These errors may be identified manually by the user and/or automatically by the computer program or application.
In step 108, the training set is modified to account for the errors based on the translation errors identified after training in step 106. For example, modification of the training set may include deleting and/or editing some text and/or punctuation in one or more aligned parallel sentences of the training set. These modifications may be identified manually by the user and/or automatically by the computer program or application.
After modifying the training set in step 108, another training set is created in step 110 using the modified training set and some aligned parallel sentences that are not in the previous training set. These aligned parallel sentences not in the previous training set may be sentences obtained from step 102 and not used in step 104. Thus, the other training set has more aligned parallel sentences than the previous training set.
In step 112, the training set obtained from step 110 is provided and used to further train the natural language translation model. Such a processing arrangement is believed to improve the translation performance (e.g., reduce translation errors) of the natural language translation model.
After step 112, in step 114, the method 100 determines whether the translation performance of the natural language translation model trained in step 112 is acceptable. The determination may include evaluating training performance or translation performance of the model, as in step 106. The evaluation may involve detection of a translation error, and if the translation error is below a threshold (or no translation error is detected), the translation performance of the natural language translation model is determined to be acceptable.
If it is determined in step 114 that the translation performance of the natural language translation model trained in step 112 is acceptable, then the method 100 ends.
If it is determined in step 114 that the translation performance of the natural language translation model trained in step 112 is unacceptable, the method 100 loops back to step 108, where the training set (the set used to train the model in step 112) is further modified based on the operations of step 108 described above, and continues through steps 110 through 114. This process repeats until it is determined in step 114 that the translation performance of the natural language translation model is acceptable. With the generation of one or more further training sets, the number of aligned parallel sentences in each further set will be greater than the number of aligned parallel sentences in the previous set.
FIG. 2A illustrates a computer-implemented method 200A for processing an electronic document to obtain aligned parallel sentences in one embodiment of the invention. Method 200A may be implemented as step 102 of method 100 in fig. 1. The method 200A begins with step 202A, where an electronic parallel document is processed to obtain parallel sentences. An electronic parallel document is a parallel document in electronic form (i.e., a document of translations of each other, preferably formally and generally recognized). Electronic parallel documents may be obtained from reliable sources or databases for quality control. The electronic parallel documents are preferably documents in the same aspect, domain or context (e.g., law), such that the natural language translation model may be specifically adapted to translate a particular aspect, domain (law) or context or closely related aspect, domain or context. Processing of the electronic parallel document to obtain aligned parallel sentences may involve manual (requiring user intervention) or automatic (requiring no user intervention) processing. For example, processing of an electronic document may involve manipulating (e.g., editing) text in an electronic parallel document.
After the parallel sentences are obtained in step 202A, in step 204A, an alignment operation is performed to align the parallel sentences and obtain aligned parallel sentences. The aligning operation may include matching the sentence in the first natural language with a corresponding translation of the sentence in the second natural language. The alignment operation may not include spatial alignment of text.
The aligned parallel sentences include sentences of the first natural language that are aligned with (i.e., correspond to or match) the corresponding translations of sentences of the second natural language.
Then, in step 206A, a clean-up operation is performed to clean up the aligned parallel sentences. The cleanup operation may include deleting some text and/or punctuation marks from the aligned parallel sentences (which do not require translation training). Step 206A may provide more than 10000 pairs, more than 20000 pairs, or more than 50000 pairs of aligned parallel sentences, each of which may consist of 8 to 50 words in one example, and other numbers of words in other examples.
The aligned parallel sentences obtained from step 206A may be used to form training data or training sets to be used to train a natural language translation model, as described in fig. 1.
FIG. 2B illustrates a computer-implemented method 200B for processing an electronic document to obtain aligned parallel sentences in one embodiment of the invention. Method 200B may be implemented as step 102 of method 100 in fig. 1. The method 200B in fig. 2B is the same as the method 200A in fig. 2A, except that the order of the cleaning operations is interchanged with the order of the alignment operations. In method 200B, the cleanup operation in step 204B is performed on the parallel sentences obtained from step 202B, and the alignment operation in step 206B is performed on the cleaned parallel sentences obtained from step 204B.
The aligned parallel sentences obtained from step 206B may be used to form training data or training sets to be used to train a natural language translation model, as described in fig. 1.
FIG. 3 illustrates a computer-implemented method 300 for processing an electronic document to obtain aligned parallel sentences in one embodiment of the invention. Method 300 may be implemented as step 102 of method 100 in fig. 1.
The method 300 begins at step 302, where an electronic parallel document is received or retrieved, for example, from one or more databases. After step 302, in step 304, the method 300 determines whether the electronic parallel document exists in the form of a text file (e.g., word file, TXT file). In one example, step 304 determines whether the electronic parallel document exists in the form of a plain text file (e.g., a TXT file). The determination may be performed manually or automatically by a computer program.
If it is determined in step 304 that the electronic parallel document does not exist in the form of a text file, the method continues to step 306, where the electronic parallel document is converted to a text file, e.g. manually or using a file format conversion computer program. Then, after the conversion in step 306, the method continues to steps 308 and 310, where the text file is processed to obtain parallel sentences and the alignment and cleaning operations are performed to obtain aligned parallel sentences.
If it is determined in step 304 that the electronic parallel document is in the form of a text file, the method continues to steps 308 and 310, where the text file is processed to obtain parallel sentences and the alignment and cleaning operations are performed to obtain aligned parallel sentences.
The processing in steps 308 and 310 may include steps 202A-206A of method 200A or steps 202B-206B of method 200B. For brevity, details of these steps are not described here in detail.
The aligned parallel sentences obtained from step 310 may be used to form training data or training sets to be used for training a natural language translation model, as described in fig. 1.
FIG. 4 illustrates natural language translation operations in one embodiment of the invention. In operation, text in the source language is provided to a natural language translation model, such as a natural language translation model trained using any of the methods of fig. 1-3. The natural language translation model processes text in a source language and provides translated text in a target language. Text provided in the source language to the natural language translation model may be extracted or obtained from an electronic file containing text in the first language. The text in the target language may be provided as an electronic file containing text in the second language for download, transmission, and/or viewing. In some implementations, a user interface may be provided to facilitate the operation of natural language translation. The user interface may include a display for displaying (e.g., side-by-side displaying) text in a first language and text in a second language. In some implementations, the user interface is provided as part of an electronic system. The electronic system may be an online conversational system, such as an online dispute resolution system (e.g., an arbitration or mediation system that facilitates online arbitration or mediation).
FIG. 5 is a block diagram of an information handling system 500 in one embodiment of the invention, the system 500 being configured to perform at least a portion of a computer-implemented method. For example, information handling system 500 may be used to perform some or all of the methods and/or operations of fig. 1-4. The information handling system 500 may be a general purpose information handling system or may be a special purpose information handling system.
As shown in FIG. 5, information handling system 500 generally includes the appropriate components necessary to receive, store, and execute the appropriate computer instructions, commands, or code. The main components of information handling system 500 are a processor 502 and a memory (storage) 504. The processor 502 may be comprised of one or more of the following: a Central Processing Unit (CPU), a Micro Control Unit (MCU), a controller, a logic circuit, a raspberry pi chip, a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA), or any other digital or analog circuit configured to interpret and/or execute program instructions and/or process signals and/or information and/or data. Memory 504 may include one or more volatile memories, such as Random Access Memory (RAM), dynamic Random Access Memory (DRAM), static Random Access Memory (SRAM), one or more non-volatile memories, such as Read Only Memory (ROM), programmable Read Only Memory (PROM), erasable Programmable Read Only Memory (EPROM), electrically Erasable Programmable Read Only Memory (EEPROM), ferroelectric Random Access Memory (FRAM), magnetic Random Access Memory (MRAM), FLASH memory (FLASH), solid State Disk (SSD), NAND, and non-volatile dual inline memory module (NVDIMM), or any combination thereof. Suitable computer instructions, commands, code, information, and/or data may be stored in memory 504. One or more natural language processing models and/or one or more natural language translation models may be stored in memory 504.
Optionally, the information handling system 500 also includes one or more input devices 506. Examples of such input devices 506 include one or more of the following: a keyboard, a mouse, a stylus, an image scanner, a microphone, a haptic/touch input device (e.g., a touch sensitive screen), an image/video input device (e.g., a camera), etc. One or more input devices 506 may be configured to receive text in a source language.
Optionally, the information handling system 500 also includes one or more output devices 508. Examples of such output devices 508 include one or more of the following: displays (e.g., screens, projectors, etc.), speakers, disk drives, headphones, earphones, printers, additive manufacturing machines (e.g., 3D printers), and so forth. The display may include a Liquid Crystal Display (LCD), a light emitting diode/organic light emitting diode (LED/OLED) display, or any other suitable display that may or may not be touch sensitive. One or more output devices 508 may be configured to output text in a target language.
The information handling system 500 may also include one or more disk drives 512, which may include one or more of: solid state drives, hard disk drives, optical disk drives, flash drives, tape drives, and the like. A suitable operating system may be installed on information handling system 500, for example, on disk drive 512 or in memory 504. The memory 504 and the disk drive 512 may be operated by the processor 502.
Optionally, the information handling system 500 also includes a communication device 510 for establishing one or more communication links (not shown) with one or more other computing devices, such as a server, personal computer, terminal, tablet, telephone, watch, or wireless or handheld computing device. The communication device 510 may include one or more of the following: modem, network Interface Card (NIC), integrated network interface, NFC transceiver, zigBee transceiver, wi-Fi transceiver,Transceiver, radio frequency transceiverA device, an optical port, an infrared port, a Universal Serial Bus (USB) connection, or other wired or wireless communication interface. The transceiver may be implemented by one or more devices (one or more integrated transmitters and receivers, one or more separate transmitters and receivers, etc.). One or more of the communication links may be wired or wireless for conveying commands, instructions, information, and/or data. In some implementations, the processor 502, the memory 504, and optionally the one or more input devices 506, the one or more output devices 508, the communication device 510, and the disk drive 512 are interconnected by a bus, a Peripheral Component Interconnect (PCI) such as PCI Express, universal Serial Bus (USB), optical bus, or other similar bus structure. In one embodiment, some of these components may be connected through a network, such as the internet or a cloud computing network. Information handling system 500 may be implemented on a single device or distributed across multiple devices.
Those skilled in the art will appreciate that the information handling system 500 shown in FIG. 5 is exemplary and that the information handling system 500 may have different configurations (e.g., additional components, fewer components, etc.) in other embodiments.
An exemplary embodiment of the present invention will now be presented. It should be noted that this embodiment is exemplary and included to facilitate understanding of how the invention may be implemented in one possibility.
In this example, the online mediation/arbitration cloud service platform is configured as an online neuromotor translation service in conjunction with artificial intelligence. The neuro-machine translation service allows a user to translate text files from one language to another. The neural machine translation service includes a translation model (a well-defined translation data model) that translates a source language into a target language (and vice versa). The following description in this example relates to training of translation models used in a particular field (e.g., legal field).
In this example, the online mediation/mediation cloud service platform is an online dispute resolution platform that allows network users to mediate, arbitrate, and negotiate. The platform provides an artificial intelligence based machine translation service that is adapted to translate legal documents (e.g., text files) from one language to another. In this example, neural machine translation is used in order to make the translated text more human translation. Neural machine translation uses a translation model to translate text files. The translation model contains a large number of translated words (similar to a dictionary, in particular legal terms/phrase dictionary) in pairs of parallel sentences in the source language and the target language.
Construction of a translation model involves data preprocessing, model training, modulation, etc., which can affect the quality of translation provided by the model.
One aspect of this example relates to acquisition and preprocessing of training materials.
Parallel documents are multilingual documents, such as bilingual documents, in which one language is a formal and accepted translation of another language. Preferably, documents that qualify as training materials (for training natural language translation models) have this feature.
In this embodiment, the translation system is used to translate documents in trade, business, or legal terms, which generally include legal terms and phrases. Thus, the context of parallel documents or parallel sentences must be related to arbitration, mediation, trade, online dispute resolution, or other legal aspects. In selecting documents for use as training materials, in this example, emphasis is placed on documents related to law, arbitration, mediation, dispute resolution (online dispute resolution), trade, and the like. In this example, preferred documents are those that have high quality translations, accurate and well-structured sentences, accurate/correctly used words, terms and/or phrases, legal word/term/phrase consistency, and documents related to legal, trade, and/or online dispute resolution. Ideally, these parallel documents should be obtained from a reliable source. In this example, the parallel document is obtained from the following data sources: united nations, law Case Law (CLOUT) cases (case law), different spanish nations, london, stongo and international business, mexico arbitration center, international investment dispute resolution center, chile arbitration and mediation center, chinese hong kong international arbitration center and world trade organization. In one example, arbitration, arbitration reporting, and trade documentation for World Trade Organization (WTO) may be used. In this example, documents from these different sources are used to train the translation model.
In this example, 25000 to 30000 parallel sentences are obtained and used for training, tuning and testing the translation model.
The parallel documents obtained from the source and their approximate number of sentences are recorded to preserve the record. Parallel documents or electronic documents are named according to a predetermined format. In this example, the document name ends with "_EN.txt" for English text files, and "_SP.txt" for Spanish text files.
Another aspect of this example relates to aligned parallel corpus preparation.
In this example, parallel documents suitable for training are identified and obtained (or downloaded). In this example, the system for processing text in parallel documents is set to accept UTF-8 encoded text file (.txt) format. Other systems used in other examples may have other specifications. Thus, if the obtained parallel document has not yet adopted an acceptable format, appropriate file conversion is required.
Then, after obtaining the appropriate file format, the parallel documents are processed to obtain aligned parallel sentences. In one example, an alignment tool (e.g., winMerge) is used to facilitate alignment, comparison, and editing of parallel sentences. The "replace" function in WinMerge may be used to clean up documents, e.g., delete unwanted punctuation marks or symbols.
Various points should be noted when aligning and cleaning up parallel sentences. It should be noted that the performance of the translation system may be sensitive to the nature of the text, for example: complexity (presence of elements other than text), structural distance (proportion of translation from translation to free translation), noise level (text deletion or preprocessing errors), and typed distance between languages. Because of the diversity and noisy nature of the web corpus, cleaning up documents will improve the translation performance of models trained using parallel sentences obtained from these documents. In this example, alignment refers to creating parallel text by assigning the correct english sentence to the corresponding spanish sentence.
In this example, the model is set to translate text related to law from English to Spanish. In this example, one or more cleaning operations may be performed while editing the parallel document. These cleaning operations include:
delete "(s)" or open brackets.
Delete all titles, short titles, and directories. Only complete sentences are used.
Delete sentences containing too many digits or statistics.
Delete all numbers and bullets, which may confuse the format/sentence arrangement.
Single words, e.g. "and/or" at the end of the sentence after the punctuation is deleted "
Delete all words in brackets. Sometimes only one language version is bracketed. Ensuring that corresponding sentences of other languages are found and adjustments are performed.
Delete the name of the person.
The semicolon (;) is replaced with a period (for English).
Adjust the alignment when this happens: for example, "i like cats, and i like dogs. "(English) vs." I like cat. I like dogs "(other languages). In this case, the neural machine translation model will calculate one sentence of english but will calculate two sentences of the other language, which can lead to misalignment.
Delete all periods that are not at the end of the sentence (e.g., i.e., cap.123).
Delete quotation marks "", brackets (), brackets [ ].
Delete web site.
Delete the space at the beginning of the sentence.
Delete duplicate parallel sentences.
Documents that do not use words that contain too many different languages that are not the target language.
In some cases, company names and personal names may be reserved.
When there is a colon (:) and a semicolon (;) separate sentences and different options are created.
Ensure consistency. For example, if a number is used to represent a number in one language, the number is also applied to represent the same number in another language; if written numbers are used to represent numbers in one language, written numbers are also applied to represent the same numbers in another language. As another example, terms (e.g., legal terms) should be used consistently in different sentences.
In this example, one or more further cleaning operations may be performed while editing the parallel document. These further cleanup operations involve english-spanish translations, and they include:
latin terms retaining both languages.
Reserve commas in spanish.
The translation order in spanish is quite different, but the translation is correct. If a sentence is correct and appropriate in spanish/english, the sentence should be kept in the order of translation.
Some spanish words may be feminized and maleinized, in some cases, both versions are preserved.
The hyphens are reserved in english and in the translated version of spanish.
Another aspect of this example relates to model construction. In this example, one can useThe neural machine translation system builds a model.
In this system, different categories may be selected to train the model. In this example, the first training is performed in two categories of interest (general and legal). This is to better understand how the system performs operations from english to spanish and vice versa in these two categories. It should be noted that the purpose of this example is to build a system dedicated to legal translation.
In this example, training material may be uploaded to the system in two ways: (1) all data are merged, and (2) document by document. In one example, these documents are divided into legal cases and mixed documents (united nations, world trade organizations, etc.). After all documents are aligned, all training materials are merged into two large documents using WinMerge. Thereafter, the large document is divided into a plurality of smaller documents each having about 5000 sentences for processing. The document type may be selected when uploading the document to the system.
Next, training, testing, and adjustment data are defined. In this example, the test and adjustment data are manually selected, but in another example they may be selected by a custom translation function in the system.
Record training, testing, and tuning data (for use in saving records).
In this example, the training dataset is the basis set for building/training the model, which can be changed between exercises in order to find the most efficient way to improve the quality of the model translation. The training dataset is an example dataset that the system uses to learn. In this example, the training data set should contain at least 20000 parallel sentences. The training dataset should use the sources and features described above. In this example, parallel sentences are extracted from the training dataset and used in the test set. This will make fine tuning easier after each training. In this example, each sentence must contain 8 to 50 words in any dataset (training, tuning or testing dataset).
In this example, the adjustment dataset is manually selected. In this example, the adjustment dataset does not need more than 2500 parallel sentences, but it should be carefully chosen because it helps the translation system to provide translations closest to the sentences provided in the set, thus having a significant impact on translation quality. In this example, the adjustment dataset should contain meaningful sentences, and be similar to the sentences that the system or model is to translate. In this example, the adjustment dataset should contain sentences each having a length of 8 to 50 words. This is closer to legal documents in real life and will be better prepared for configuring/using the system.
In this example, the test data set includes content or any terminology required by the test system. In this example, the test set is an excerpt of the training set, so the training data set can be more easily tested and trimmed. In this example, the test data set should include legal, online dispute resolution, and trade terms that are considered important to the system or model. In this example, the test dataset contains no more than 2500 parallel sentences, and is manually selected for comparing improvements between exercises. In this example, the test dataset contains sentences each having a length of 8 to 50 words. In this example, the test dataset contains sentences that are well aligned sentences with clear meaning, structure and high quality translations.
After uploading the document, the system determines whether the document is aligned. If misalignment is detected, a correction may be made.
In order to configure a translation model, the model must meet the criteria established by the process. In one example, the model for configuration scores the highest among all models.
Another aspect in this example relates to model evaluation.
The training time of the model will vary and will depend on the amount of data. Training of the model may fail or succeed. Training may fail during training or while processing data. Possible reasons for failure are that the training is too much running at the same time, or that the computer running the training is turned off during the training.
After training is successful, the system provides a BLEU (bilingual replacement evaluation) score and a baseline BLEU score. In this example, the results of each training (e.g., sentence count, BLEU score, status, and system test results) are checked. Sentence count refers to the number of sentences used. Here, the data for training may be downloaded after training. Downloading and comparing it to the original data using the WinMerge et al tool allows us to find errors and misalignments.
In this example, the training output is checked to identify errors and misalignments. These errors and misalignments can be corrected to improve the translation system and produce a more experienced and higher quality neural machine translation model. In this example The test dataset may be checked by clicking on the "system test results" in the system. The test dataset in each training is read to identify the most common translation problems. After identifying the most common translation problems, the test data is read to identify how common these problems are, and to determine if they affect translation quality. Comparing the observed error to a baseline (e.g., usingTranslation engine translates) to see if the problems are also present in the baseline. If a problem also occurs in the baseline, it is interpreted as a natural problem. If not, the question is presented in model training. In addition, possible solutions to the errors and misalignments are identified and implemented. These problems are often related to parallel sentence editing or quantity, but are sometimes unknown.
The most common translation problems identified in this example include: grammar mistakes, punctuation mistakes, term use inconsistencies, sentence structure mistakes, word order mistakes, untranslated words, word translation mistakes, improper repetition of words, phrases, sentences or articles, misuse of words, and transliteration that does not conform to the context.
In this example, the training data is modified between exercises to improve model exercises and thus improve models. In this example, the following is performed before the desired level of english-spanish translation is reached:
The number of parallel sentences is increased in each training in order to create a translation system with a broader language and fewer transliterations.
Edit training dataset between exercises. This involves misalignment, translation problems, consistency of legal terms, latin idioms and acronyms, and clearing punctuation marks and numbers.
After each training, the dataset is adjusted based on the found translation problem.
The test and adjustment data sets are manually selected and mostly kept constant in order to check the training progress easily and more accurately.
The test dataset is compiled once because of the lack of some important legal terms and acronyms.
Another aspect of this example relates to model quality assessment, e.g., how to assess the output quality of a machine translation program, criteria and methods for assessing translation quality.
In this example, a scoring scheme is considered. Quality assessment included 5 criteria, which were divided into 100%. Each standard score ranges from 0-20% of the total score. This is an artificial machine translation quality scheme, since BLEU cannot obtain quality. The first four criteria will be evaluated by human experts in both languages:
1. Correctness of sentence structure: sentences possess the concept of a syntactic structure conforming to the syntactic standard and rules.
2. Correctness of terms and words used: translation uses the correct words, terms, or degrees of tense.
3. Accuracy of meaning: the degree of correctness or accuracy of the sentence. Correctly reflecting the meaning.
4. Translation quality.
5. Adjusted BLEU score: (score X of current training/highest score Y of all tests) X20. The purpose of this is to satisfy a plurality of trained models.
In this example, the BLEU score is an algorithm that provides a base quality indicator. It measures the correspondence between machine translation output and human translation. In general, the closer a machine translation is to a professional human translation, the better it is. The adjusted BLEU score is intended to cover the duration of the training, as it requires multiple training throughout the process. This is a criterion for quality assessment, but not the only criterion. Manual quality assessment should additionally or alternatively be performed to obtain a better assessment. An exemplary scale is used in this example to determine the quality of the translation. The criteria for the configuration model is that it must score over 80. 80-100 min = excellent; 60-80 min = good; 50-60 minutes = acceptable; < 50 minutes = failed. The score is obtained based on the criteria described above.
In this example, after the models are selected and configured, the models will be tested at an internal and external level. After testing, several problems were identified: (1) The model cannot properly handle rare words (which can be translated into the same rare or newly created words); (2) Lengthy sentences (> 50 words) sometimes lead to translation failures. After identifying these problems, the model is retrained by adding 5000 parallel shorter sentences that are relevant to the online dispute resolution. After this training, significantly reduced untranslated words, rare translations, word order errors, and some legal term selection errors were observed.
Although not required, some embodiments described with reference to the figures may be implemented as an Application Programming Interface (API) or as a series of libraries for use by a developer, or may be contained in another software application, such as a terminal or computer operating system or portable computing device operating system. In general, because program modules include routines, programs, objects, components, and data files to help perform particular functions, those skilled in the art will appreciate that the functions of software applications may be distributed among multiple routines, objects, and/or components to achieve the same functions as desired herein.
It will also be appreciated that any suitable computing system architecture may be used where the methods and systems of the present invention are implemented, in whole or in part, by a computing system. This would include a stand-alone computer, a network computer, a dedicated or non-dedicated hardware device. When the terms "computing system" and "computing device" are used, these terms are intended to encompass, but are not limited to, any suitable arrangement of computer or information processing hardware capable of carrying out the functions described.
Those skilled in the art will appreciate that various changes and/or modifications can be made to the invention as shown in the specific embodiments to provide further embodiments of the invention. The described embodiments of the invention are, therefore, to be considered in all respects as illustrative and not restrictive. One or more features from one embodiment can be selectively combined with one or more features from another embodiment to form one or more new embodiments. For example, the natural language translation model may be, but is not necessarily, a neural machine translation model or a network. The natural language translation model may be configured to translate from one or more natural languages to one or more other natural languages, or to translate between multiple (two or more) natural languages. The source language may be, but is not necessarily, english; the target language may be spanish, but is not necessarily spanish.

Claims (24)

1. A computer-implemented method for training a natural language translation model, the computer-implemented method comprising:
processing one or more sets of electronic parallel documents to obtain a plurality of aligned parallel sentences;
creating a first training set comprising a subset of the plurality of aligned parallel sentences;
training the natural language translation model using the first training set in a first stage;
modifying the first training set based on translation errors detected after training of the first stage;
creating a second training set based on the modified first training set and at least some of the aligned parallel sentences not in the first training set; and
the natural language translation model is trained in a second stage using the second training set to improve translation performance of the natural language translation model.
2. The computer-implemented method of claim 1, wherein the processing one or more sets of electronic parallel documents comprises:
processing the one or more sets of electronic parallel documents to obtain a plurality of parallel sentences; and
an alignment operation is performed to align the plurality of parallel sentences to form the plurality of aligned parallel sentences.
3. The computer-implemented method of claim 2, wherein the processing the one or more sets of electronic parallel documents comprises:
if the one or more sets of electronic parallel documents do not exist in the form of text files, converting the one or more sets of electronic parallel documents into text files, preferably plain text files; and
processing the text file to obtain the plurality of parallel sentences.
4. The computer-implemented method of claim 2 or 3, wherein processing the one or more sets of electronic parallel documents comprises:
a cleaning operation is performed to clean up the plurality of parallel sentences or the plurality of aligned parallel sentences.
5. The computer-implemented method of claim 4, wherein the cleaning operation comprises deleting some text and/or punctuation marks from the plurality of parallel sentences or the plurality of aligned parallel sentences.
6. The computer-implemented method of any of claims 1 to 5, wherein the translation error comprises one or more of: grammar mistakes, punctuation mistakes, term use inconsistencies, sentence structure mistakes, word order mistakes, untranslated words, word translation mistakes, improper repetition of words, phrases, sentences or articles, misuse of words, and transliteration that does not conform to the context.
7. The computer-implemented method of claim 6, wherein the modifying of the first training set based on translation errors comprises: deleting and/or editing some text and/or punctuation marks in the subset of the plurality of aligned parallel sentences.
8. The computer-implemented method of any one of claims 1 to 7, wherein the computer-implemented method further comprises:
modifying the second training set based on translation errors detected after the training of the second stage;
creating a third training set based on the modified second training set and the plurality of aligned parallel sentences not in at least some of the first training set and the second training set; and
and training the natural language translation model by using the third training set in a third stage so as to further improve the translation performance of the natural language translation model.
9. The computer-implemented method of any of claims 1 to 8, wherein the natural language translation model comprises a neural machine translation model or a network.
10. The computer-implemented method of any one of claims 1 to 9,
Wherein the natural language translation model is configured to translate a first natural language into a second natural language,
wherein each of the one or more sets of electronic parallel documents is the first natural language and the second natural language, an
Wherein each of the aligned parallel sentences is the first natural language and the second natural language.
11. The computer-implemented method of claim 10, wherein the natural language translation model is configured to translate between the first natural language and the second natural language.
12. The computer-implemented method of claim 10 or 11, wherein the first natural language is english and the second natural language is spanish.
13. The computer-implemented method of any of claims 1 to 12, wherein each of the plurality of aligned parallel sentences of the first training set contains 8 to 50 words and each of the plurality of aligned parallel sentences of the second training set contains 8 to 50 words.
14. The computer-implemented method of any of claims 1-13, wherein a number of the aligned parallel sentences used in the training in the second stage is greater than a number of the aligned parallel sentences used in the training in the first stage.
15. The computer-implemented method of any one of claims 1 to 14,
wherein the one or more sets of electronic parallel documents comprise electronic documents of the same domain or context; and
wherein the natural language translation model is a domain or context specific natural language translation model.
16. The computer-implemented method of claim 15, wherein the domain or context is a legal domain or legal context.
17. The computer-implemented method of any one of claims 1 to 16, wherein the computer-implemented method further comprises:
the one or more sets of electronic parallel documents are collected, e.g., retrieved, from one or more databases.
18. A natural language translation model trained or derived using the computer-implemented method of any one of claims 1 to 17.
19. A computer program or computer program product, characterized in that the computer program or computer program product comprises a natural language translation model according to claim 18.
20. A non-transitory computer-readable medium, characterized in that the non-transitory computer-readable medium comprises the natural language translation model of claim 18.
21. A computer-implemented natural language translation method, the computer-implemented natural language translation method comprising:
receiving text in a first language;
processing the received text using the natural language translation model of claim 18; and
text is output in a second language based on the processing, the text in the second language being a translation of the text in the first language.
22. An electronic system, the electronic system comprising:
one or more processors configured to:
receiving text in a first language;
processing the received text using the natural language translation model of claim 18; and
text is output in a second language based on the processing, the text in the second language being a translation of the text in the first language.
23. The electronic system of claim 22, wherein the electronic system further comprises:
a display operatively connected to the one or more processors for displaying the text in the first language and the text in the second language.
24. An electronic system according to claim 22 or 23, characterized in that the electronic system is an online session system, such as an online dispute resolution system.
CN202210336355.2A 2022-03-31 2022-03-31 Natural language translation model training and configuration Pending CN116933803A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210336355.2A CN116933803A (en) 2022-03-31 2022-03-31 Natural language translation model training and configuration

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210336355.2A CN116933803A (en) 2022-03-31 2022-03-31 Natural language translation model training and configuration

Publications (1)

Publication Number Publication Date
CN116933803A true CN116933803A (en) 2023-10-24

Family

ID=88375989

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210336355.2A Pending CN116933803A (en) 2022-03-31 2022-03-31 Natural language translation model training and configuration

Country Status (1)

Country Link
CN (1) CN116933803A (en)

Similar Documents

Publication Publication Date Title
Choe et al. A neural grammatical error correction system built on better pre-training and sequential transfer learning
US9959776B1 (en) System and method for automated scoring of texual responses to picture-based items
CN1457041B (en) System for automatically annotating training data for natural language understanding system
US8046211B2 (en) Technologies for statistical machine translation based on generated reordering knowledge
CN100452025C (en) System and method for auto-detecting collcation mistakes of file
US9152622B2 (en) Personalized machine translation via online adaptation
CN111310440B (en) Text error correction method, device and system
CN104731777A (en) Translation evaluation method and device
CN1110882A (en) Methods and apparatuses for processing a bilingual database
US20180102062A1 (en) Learning Map Methods and Systems
EP2905729A1 (en) System and method for providing crowd sourcing platform for task allocation
CN116529725A (en) Iteratively applying a machine learning based information extraction model to a document having unstructured text data
US11934781B2 (en) Systems and methods for controllable text summarization
Chen et al. How to measure word length in spoken and written Chinese
Van Der Goot et al. A taxonomy for in-depth evaluation of normalization for user generated content
Combiths et al. Automated phonological analysis and treatment target selection using AutoPATT
CN116933803A (en) Natural language translation model training and configuration
US20230316004A1 (en) Natural language translation model training and deployment
US20230229861A1 (en) Systems and methods near negative distinction for evaluating nlp models
KR20200057277A (en) Apparatus and Method for Automatically Diagnosing and Correcting Automatic Translation Errors
Seretan et al. The ACCEPT Portal: An online framework for the pre-editing and post-editing of user-generated content
CN112307748A (en) Method and device for processing text
Bannò et al. Towards automatic spoken grammatical error correction of l2 learners of english
King General principles of user-oriented evaluation margaret king
US20240169194A1 (en) Training neural networks for name generation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication