WO2024004184A1 - Dispositif de génération, procédé de génération et programme - Google Patents
Dispositif de génération, procédé de génération et programme Download PDFInfo
- Publication number
- WO2024004184A1 WO2024004184A1 PCT/JP2022/026407 JP2022026407W WO2024004184A1 WO 2024004184 A1 WO2024004184 A1 WO 2024004184A1 JP 2022026407 W JP2022026407 W JP 2022026407W WO 2024004184 A1 WO2024004184 A1 WO 2024004184A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- constraint
- unit
- information
- sequence
- generation
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims description 67
- 238000013519 translation Methods 0.000 description 127
- 230000014616 translation Effects 0.000 description 127
- 238000000605 extraction Methods 0.000 description 73
- 238000012986 modification Methods 0.000 description 20
- 230000004048 modification Effects 0.000 description 20
- 238000001914 filtration Methods 0.000 description 19
- 238000006243 chemical reaction Methods 0.000 description 15
- 238000010586 diagram Methods 0.000 description 14
- 238000005516 engineering process Methods 0.000 description 14
- 230000008569 process Effects 0.000 description 10
- 230000015654 memory Effects 0.000 description 9
- 238000012545 processing Methods 0.000 description 6
- 239000000284 extract Substances 0.000 description 5
- 230000001537 neural effect Effects 0.000 description 5
- 240000008042 Zea mays Species 0.000 description 4
- 235000005824 Zea mays ssp. parviglumis Nutrition 0.000 description 4
- 235000002017 Zea mays subsp mays Nutrition 0.000 description 4
- 235000005822 corn Nutrition 0.000 description 4
- 238000011156 evaluation Methods 0.000 description 4
- 230000006870 function Effects 0.000 description 4
- 230000006872 improvement Effects 0.000 description 4
- 238000004458 analytical method Methods 0.000 description 3
- 230000021615 conjugation Effects 0.000 description 3
- 230000000877 morphologic effect Effects 0.000 description 3
- 230000009466 transformation Effects 0.000 description 3
- 238000000844 transformation Methods 0.000 description 3
- 208000003643 Callosities Diseases 0.000 description 2
- 206010020649 Hyperkeratosis Diseases 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 2
- 238000012937 correction Methods 0.000 description 2
- 238000004804 winding Methods 0.000 description 2
- 238000010420 art technique Methods 0.000 description 1
- 230000003416 augmentation Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 239000003610 charcoal Substances 0.000 description 1
- 238000012217 deletion Methods 0.000 description 1
- 230000037430 deletion Effects 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000004836 empirical method Methods 0.000 description 1
- 238000002474 experimental method Methods 0.000 description 1
- 238000012423 maintenance Methods 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000003672 processing method Methods 0.000 description 1
- 230000011218 segmentation Effects 0.000 description 1
- 238000012549 training Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/40—Processing or translation of natural language
- G06F40/42—Data-driven translation
Definitions
- the present invention relates to the technical field of machine translation.
- Vocabulary-constrained machine translation is when a sentence in one domain is translated into another domain (e.g., another language), with constraints imposed to ensure that all specified words (constraint words) are included. Since machine translation with vocabulary constraints can unify the translation of specific words, machine translation with vocabulary constraints is a particularly important technology in the translation of patents, legal documents, technical documents, etc. that require consistency.
- the present invention has been made in view of the above points, and it is an object of the present invention to provide a technique for accurately performing sequence conversion using constraint information.
- a generation device for generating a second series that is another information series from constraint information and a first series that is an information series, an input generation unit that inputs a constraint information list and outputs each element of a subset of one or more constraint information included in the constraint information list as a vocabulary constraint; a sequence generation unit that generates one or more candidates for the second sequence using the first sequence and the vocabulary constraint; and a reranking unit that calculates a score indicating suitability as the second series for each of the one or more candidates.
- FIG. 2 is a diagram showing an example of machine translation with vocabulary constraints.
- 1 is a diagram showing a configuration example of a generation device 100.
- FIG. 3 is a flowchart for explaining the operation of the generation device 100.
- 3 is a diagram showing an example configuration of an extraction unit 120.
- FIG. 3 is a diagram showing an example configuration of an extraction unit 120.
- FIG. 1 is a diagram showing a configuration example of a generation device 100.
- FIG. 3 is a diagram showing an example of the configuration of a sequence generation unit 140.
- FIG. It is a diagram showing an example of the configuration of a machine translation model.
- 3 is a diagram showing an example of the configuration of a sequence generation unit 140.
- FIG. 5 is a diagram showing a display image on a display unit 500.
- FIG. 1 is a diagram showing a configuration example of a generation device 100.
- FIG. 3 is a flowchart for explaining the operation of the generation device 100.
- 3 is a diagram showing an example configuration
- FIG. 1 is a diagram showing a configuration example of a generation device 100.
- FIG. FIG. 3 is a diagram showing detailed settings and hyperparameters that serve as a base for each setting used in the experiment. It is a figure showing an evaluation result. It is a diagram showing an example of the hardware configuration of the device.
- the present invention can be applied to machine translation, but the present invention can be applied to sequence conversion in any field as long as it uses constraint information. It is.
- the present invention can be used for summarization tasks, utterance generation tasks, tasks for adding explanatory text to images, and the like.
- the unit of translation is a sentence, but the unit of translation may be any unit.
- the generation device 100 described below provides certain improvements over prior art techniques such as performing constrained sequence transformations, and represents an improvement in the technical field of constrained sequence transformations. be. Additionally, the extraction apparatus described below provides certain improvements over the prior art in extracting constraint information and represents an improvement in the field of technology related to extracting constraint information.
- FIG. 1 shows an example of input and output in machine translation with vocabulary constraints.
- Non-Patent Document 1 As a conventional technology for machine translation with vocabulary constraints, see Non-Patent Document 1 "Chen, G., Chen, Y., and Li, V. O. (2021). "Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance.” Proceedings of The ⁇ AAAI Conference on Artificial Intelligence'' discloses a machine translation method with vocabulary constraints for manually created constraint phrases. The method disclosed in Non-Patent Document 1 is also called a soft method. The method disclosed in Non-Patent Document 1 does not guarantee that the constraint phrase will always be included in the translated sentence.
- Non-patent document 2 “Matt Post and David Vilar. 2018. Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. ume 1 (Long Papers), pages 1314-1324, New La, Louisiana. Association for Computational Linguistics” and Reference 1 “Chousa, K. and Morishita, M. (2021). “Input Augmentation Improves Constrained Beam Search for Neural Machine Translation : NTT at WAT 2021.” In Proceedings of the 8th Workshop on Asian Translation (WAT), pp. 53--61, Online. Association for Computational Linguistics.” A translation method is disclosed. This method guarantees that the constraint phrase will always be included in the translated sentence. This method is also called the hard method.
- the extracted constraint phrases include words that become noise. Further, even when constraint words and phrases are extracted manually, noise may be included.
- a source language sentence is input using the input unit 110.
- the extraction unit 120 automatically extracts constraint phrases based on the source language sentence (input sentence) input by the input unit 110 and the bilingual dictionary read from the bilingual dictionary DB 200.
- FIG. 4 is a block diagram of the extraction unit 120.
- the extraction unit 120 includes a filtering unit 121, a division unit 122, and a constraint phrase extraction unit 123.
- the extraction unit 120 also refers to the bilingual dictionary 200. Note that the extraction section 120 may be configured without the filtering section 121.
- the bilingual dictionary DB 200 stores a set of pairs of two words that are made to correspond when converting sequences. Specifically, in this embodiment, which targets translation, the bilingual dictionary DB 200 stores a set of ⁇ source language phrase, target language phrase> pairs.
- the source language phrase and the target language phrase may each consist of multiple words. In this embodiment, one ⁇ source language word/phrase, target language word/phrase> pair is referred to as a "bilingual translation".
- the source language phrase and the target language phrase may be called a source language translation word and a target language translation word, respectively.
- ⁇ Extraction unit 120 Filtering unit 121>
- the filtering unit 121 deletes bilingual translations that fall under (A) to (C) below, or words included in the bilingual translations, from the bilingual dictionary.
- the filtering unit 121 does not necessarily need to implement all of (A) to (C), but may implement at least one of (A) to (C). Further, filtering other than (A) to (C) may be performed. In particular, in Modifications 1 and 2, which will be described later, the process (C) below may be skipped.
- the parallel translation of "source language: computer, target language: computer, computer” corresponds to (C). so that there is a one-to-one relationship between the source language phrase and the target language phrase.
- the source language sentence after processing by the dividing unit 122 will be "As long as it is, it is not.”
- Constraint phrase extraction unit 123 extracts bilingual translations corresponding to the phrases included in the source language sentence, and creates a constraint phrase list using the extracted bilingual translations.
- a specific example of the constraint phrase extraction method will be described below. Note that the format of the dictionary, the search method, etc. are not limited to the method described below, and other methods may be used as long as the method can extract the constraint phrases corresponding to the words included in the source language sentence.
- prefix match longest match
- word division as described here are examples of means for realizing constraint phrase extraction with less noise and reduced ambiguity. Other means of disambiguation may be used.
- the segmentation unit 122 when performing morphological analysis in the segmentation unit 122, information necessary for disambiguation, such as part of speech, prototype, stem, conjugation, and reading (pronunciation), is attached to the divided words and the attached information is also used.
- Perform matching In other words, by using not only the character string but also attached information such as its part of speech during matching, for example, the character "in" in the source language sentence can be matched with the preposition in, which is the source language translation, and the noun inn (inn).
- the ambiguity can be resolved by placing the situation in which both of the above match. Resolving ambiguity during matching is an important element in improving translation accuracy.
- the generation device 100 may not include the extraction unit 120.
- the configuration of the generation device 100 in this case is shown in FIG.
- the constraint phrase list generated by the extraction device is input to the generation device 100.
- a constraint phrase list that is not the constraint phrase list generated by the extraction device eg, a constraint phrase list that includes a lot of noise
- the operations of the input generation section 130, sequence generation section 140, and reranking section 150 in FIG. 6 are the same as the operations of the input generation section 130, sequence generation section 140, and reranking section 150 in FIG.
- the input generation unit 130 receives the constraint phrase list as input, and sets all elements of a subset of the words included in the constraint phrase list as vocabulary constraints. However, some elements among all the elements may be subjected to vocabulary constraints.
- ⁇ A, B, C ⁇ is input to the input generation unit 130 as a constraint phrase list.
- A, B, and C are each constraint phrases.
- the input generation unit 130 generates ⁇ , ⁇ A ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ A, B ⁇ , ⁇ A, C ⁇ , ⁇ B, C ⁇ , ⁇ A, B, C ⁇ and output each of them as a vocabulary constraint.
- ⁇ , ⁇ A ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ A, B ⁇ , ⁇ A, C ⁇ , ⁇ B, C ⁇ , ⁇ A, B, C ⁇ is a constraint vocabulary set.
- one ⁇ ... ⁇ is one vocabulary constraint.
- sequence generation unit 140 Next, the sequence generation unit 140 will be explained. It is assumed that the series generation unit 140 holds a trained machine translation model read from the model DB 300. Furthermore, the sequence generation unit 140 repeats the following process by the number of lexical constraints (the number of elements in the lexical constraint set). For example, the vocabulary constraint set is ⁇ , ⁇ A ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ A, B ⁇ , ⁇ A, C ⁇ , ⁇ B, C ⁇ , ⁇ A, B, C ⁇ Then, repeat 8 times.
- the sequence generation unit 140 receives an input sentence (source language sentence) and vocabulary constraints as input.
- the sequence generation unit 140 generates a translated sentence (target language sentence) using a machine translation model by applying an existing method of machine translation with vocabulary constraints.
- a plurality of translated sentences are generated as translated sentence candidates (target language sentence candidates).
- a translation sentence candidate is given a score as a translation sentence.
- LeCA is disclosed in Non-Patent Document 1 and is also called the soft method.
- LeCA+LCD is disclosed in Reference 1 mentioned above, and is also called the hard method.
- the series generation unit 140 outputs a plurality of generated translation sentence candidates.
- the sequence generation unit 140 outputs a predetermined number of translation sentence candidates in descending order of scores.
- the "predetermined number" may be one. In other words, only the translated sentence with the highest score may be output.
- 30 translation candidates are output for each vocabulary constraint.
- FIG. 7 shows a configuration example of the sequence generation unit 140.
- the sequence generation section 140 includes a sequence conversion section 141 and a search section 142.
- the sequence conversion unit 141 uses lexical constraint information, and when using the hard method, the sequence conversion unit 141 uses the lexical constraints depending on the type of hard method. Sometimes it's used, sometimes it's not. Arrows for inputting vocabulary constraints to the series conversion unit 142 are indicated by dotted lines.
- the hard methods in the aforementioned LeCA+LCD, information on vocabulary constraints is used in the sequence conversion unit 141. The configuration/operation assuming LeCA+LCD will be described below.
- the sequence conversion unit 141 uses a general encoder-decoder model (for example, Transformer) as a machine translation model, which has an encoder and a decoder, as shown in FIG. model can be used.
- a general encoder-decoder model for example, Transformer
- the invention can be implemented using models other than the encoder-decoder model.
- the sequence conversion unit 141 receives the source language sentence and vocabulary constraints as input, first expands the source language sentence using the vocabulary constraints to create an input sequence with information on the vocabulary constraints added, and then machine translates it. Use as input to the model.
- the sequence conversion unit 141 generates a sentence by using the expanded input sequence as input to a machine translation model. More specifically, the probability of each word in a set of words that can constitute an output sequence is output.
- the search unit 142 uses the output probability of the decoder in the machine translation model to search for (an approximate solution of) the output sequence that maximizes the generation probability when the input sequence is given.
- the search unit 142 uses a grid beam search method based on beam search to ensure that the output sequence satisfies all of the constraint vocabulary.
- searching unit 142 searching using grid beam search is an example. Any processing method may be used as long as it performs a lexically constrained search so as to include the constraint word/phrase.
- the reranking unit 150 receives as input one or more translated sentence candidates generated by the series generation unit 140. For example, if the series generation unit 140 generates 30 translation candidates per lexical constraint and there are 8 lexical constraints, the reranking unit 150 generates 30 translation candidates per lexical constraint. Each time, 30 translation candidates are received as input.
- the reranking unit 150 calculates a score for each translated sentence candidate using the input sentence (source language sentence), and outputs the translated sentence candidate with the highest score as the final translated sentence.
- the output unit 160 can present the translated sentences to the user in a ranking format using the scores.
- any method may be used as long as it can calculate the score of the translated sentence, but for example, the methods in Example 1 and Example 2 below may be used. be able to.
- Example 1 The reranking unit 150 uses, as a score, the likelihood of translated sentence candidates output by the machine translation model used for translation in the sequence generation unit 140.
- Example 2 The reranking unit 150 uses a machine translation model learned by Transformer, which is an encoder-decoder model, for a Right-to-Left translation task that generates a translated sentence from the end of a sentence to the beginning of a sentence.
- the likelihood when translation candidates are forced to be output is used as a score.
- Forcibly outputting translated sentence candidates may be rephrased as forced decoding using translated sentence candidates.
- the source language sentence is input to the encoder of the reranking model, and the words of the translation sentence candidate whose score (likelihood) is to be evaluated are sequentially input to the decoder of the reranking model.
- the likelihood output by the machine translation model may be any value as long as it indicates plausibility.
- the likelihood output by the machine translation model may be a probability or a value other than probability.
- the reranking unit 150 may calculate the reranking score using both the likelihood of Example 1 and the likelihood of Example 2. For example, the average of the likelihood of Example 1 and the likelihood of Example 2 may be used as the reranking score.
- the constraint phrase list generated by the extraction unit 120 can be one in which a plurality of target language phrases correspond to one source language phrase.
- a constraint phrase list may be called a constraint phrase list that allows multiple translations. For example, if the filtering unit of the extraction unit 120 does not perform the step (C), such a constraint phrase list may be generated.
- a and A' as a plurality of target language phrases for a certain source language phrase, and "A, A', B, C" including these and B and C are added to the constraint phrase list by the extraction unit 120.
- it is generated as multiple elements. For example, if the word in the source language sentence is "computer” and the word in the target language sentence is "computer”, A and A' correspond to "computer” and "computer".
- the input generation unit 130 that receives ⁇ A, A′ ⁇ , ⁇ B ⁇ , ⁇ C ⁇ from the extraction unit 120 generates ⁇ , ⁇ A ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ A, B ⁇ , ⁇ A, C ⁇ , ⁇ B, C ⁇ , ⁇ A, B, C ⁇ , as well as ⁇ A' ⁇ , ⁇ A', B ⁇ , ⁇ A', C ⁇ , ⁇ A', B, C ⁇ is also generated as a vocabulary constraint.
- the input generation unit 130 inputs each of the plurality of generated vocabulary constraints to the sequence generation unit 140.
- the sequence generation unit 140 generates ⁇ , ⁇ A ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ A, B ⁇ , ⁇ A, C ⁇ , ⁇ B, C ⁇ , ⁇ A, B, C ⁇ , ⁇ A'
- machine translation with vocabulary constraints is performed 12 times. and obtain translation candidates. For example, if one translated sentence candidate is generated for one vocabulary constraint, 12 translated sentence candidates will be obtained.
- the reranking unit 150 After performing the machine translation with vocabulary constraints, the reranking unit 150 performs the reranking process using the method described above, and outputs, for example, the translated sentence candidate with the highest score as the final translated sentence. .
- Modification 2 Next, modification 2 will be explained. Also in the second modification, the constraint phrase list generated by the extraction unit 120 can be one in which a plurality of target language phrases correspond to one source language phrase.
- the search in the translation search process in the search unit 142 of the sequence generation unit 140, the search may be performed by allowing the existence of a plurality of expression types of one constraint phrase. In other words, the search may be performed such that one element is satisfied from each constraint word candidate. Specifically, the details are as follows.
- ⁇ A, B, C ⁇ is generated as the constraint word list, and information indicating that A may be A' or A'' is generated. It is assumed that the input is input from the extraction unit 120 to the input generation unit 130. Alternatively, ⁇ A, A', A'', B, C ⁇ is generated as a constraint word list, and information indicating that any of A, A', and A'' is acceptable is sent from the extraction unit 120 to the input generation unit. 130 may be input.
- the input generation unit 130 For the constraint word list ⁇ A, B, C ⁇ , if A may be A', the input generation unit 130 generates ⁇ , ⁇ A, A' ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ A, A' ⁇ , ⁇ B ⁇ , ⁇ A, A' ⁇ , ⁇ C ⁇ , ⁇ A, A' ⁇ , ⁇ B ⁇ , ⁇ C ⁇ seven vocabulary candidate constraints generate.
- Modification 2 there is ambiguity in the translated word because there are cases where multiple target language words (e.g., A, A') correspond to a certain source language word, and the vocabulary used as a constraint is determined. Therefore, we call it ⁇ vocabulary candidate constraint'' instead of lexical constraint.
- the "vocabulary candidate constraint” is a vocabulary constraint that maintains ambiguity.
- the expression format of the vocabulary candidate constraints described above is an example. As long as it can be expressed that either A or A' is acceptable, expression formats other than those described above may be used as the expression format.
- the sequence generation unit 140 receives the source language sentence and the vocabulary candidate constraints as input.
- the sequence generation unit 140 generates ⁇ , ⁇ A, A′ ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , ⁇ A, A′ ⁇ , ⁇ B ⁇ , ⁇ A, A′ ⁇ , ⁇ C ⁇ , ⁇ A, A' ⁇ , ⁇ B ⁇ , ⁇ C ⁇ , each of the seven vocabulary candidate constraints was used as a vocabulary candidate constraint, and machine translation with vocabulary constraints was performed seven times. , obtain translation candidates. For example, if one translated sentence candidate is generated for one vocabulary candidate constraint, seven translated sentence candidates will be obtained.
- the reranking unit 150 After performing the machine translation with vocabulary constraints, the reranking unit 150 performs the reranking process using the method described above, and outputs, for example, the translated sentence candidate with the highest score as the final translated sentence. .
- the search unit 142 of the sequence generation unit 140 When using a vocabulary candidate constraint including ⁇ A, A' ⁇ , the search unit 142 of the sequence generation unit 140 performs a search assuming that the word A may be A'. In other words, a search is performed that takes ambiguity into account.
- a search is performed that takes ambiguity into account.
- reference 2 “Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2017. Guided Open Vocabulary Image Captioning with Constrained Beam Search In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936-945, Copenhagen, Denmark. Association for Computational Linguistics can be used. This method is an example of a "search considering ambiguity" method.
- Reference 2 is a language generation method, but is not a translation technology. There is no prior art that applies this method to search during translation decoding.
- multiple target language phrases e.g. A, A'
- synonyms such as "calculator” and "computer”.
- ⁇ Trunk'' it may be a word or phrase other than a synonym, such as ⁇ car trunk'', ⁇ elephant's trunk'', trunk, trunk line, etc. Since the meaning of words is not taken into account when searching in the search unit 142, A and A' may be completely unrelated words.
- arbitrary criteria can be used to determine which words in the converted series correspond to words in the original series.
- the bilingual dictionary is English-Japanese, and there is a dictionary entry for "corn".
- the source language sentence "We roasted corns over the charcoal.”
- the extraction unit 120 performs matching on a morpheme basis, corns is matched in the bilingual dictionary because corn is included as a morpheme.
- the bilingual dictionary entry is feet and the morpheme in the input sentence is foot, there will be no match. Therefore, by returning the foot to its original form and performing matching, this problem can be resolved.
- the display unit 500 displays a plurality of constraint phrases (constraint phrase list) for the input source language sentence.
- the words and phrases displayed here as constraint words are the words and phrases that have been filtered by the filtering unit 121.
- the filtered constraint phrases are displayed in the form of "Do you want to add?"
- FIG. 11 shows a configuration example of a generation device 100 for realizing the above display.
- the generation device 100 of this embodiment includes an extraction section 120, a display information generation section 170, a modification section 180, a generation section 190, a bilingual dictionary DB 200, and a constraint phrase list DB 400.
- the modification unit 180 may be included in the display information generation unit 170.
- the bilingual dictionary DB 200 and the constraint phrase list DB 400 may be provided outside the generation device 100.
- the generation unit 190 may also be provided outside the generation device 100 (eg, another server).
- the generation device 100 may be used for the purpose of displaying a list of constraint words on the display unit 500.
- the generation device 100 may include only the extraction unit 120 and the display information generation unit 170 among the functional units shown in FIG.
- the generation device 100 may also be called an extraction device.
- the functions of each part are as follows.
- the extraction unit 120 is the extraction unit 120 shown in FIG. 4 or 5. It takes the source language sentence as input and outputs a list of constraint words.
- the output constraint phrase list is stored in the constraint phrase list DB 400 and is input to the display information generation section 170. Further, the extraction unit 120 may output the filtered constraint words as a filter word list. The output filter word list is input to the display information generation section 170.
- the display information generation unit 170 generates information for displaying the constraint phrase list on the display unit 500 (referred to as constraint phrase list presentation information).
- the constraint word/phrase list presentation information includes a constraint word/phrase list. Further, the information for presenting the constraint word/phrase list may include information on the filter word/phrase list as deleted information, filter candidate words, or additional candidates.
- the constraint word list presentation information is transmitted from the display information generation section 170 to the display section 500 and input to the display section 500. Further, the display information generation unit 170 may generate display information to be displayed together with the target language sentence (translated sentence) generated using the constraint phrase in a format in which the constraint phrase can be modified.
- the display information generation unit 170 when the generation device 100 receives the added or modified constraint phrase from the display unit 500, the display information generation unit 170 generates a target language sentence (translated sentence) based on the received constraint phrase. Display information for displaying the target language sentence (translated sentence) may be generated.
- the display information generation unit 170 may generate “correction support information” for the user to check the constraint phrase list, and transmit it to the display unit 500.
- the modification support information includes at least one of a source language sentence input by the user, an extracted constraint phrase list, and a target language sentence generated based on the extracted constraint phrase list.
- the modification unit 180 receives from the display unit 500 at least one of the additional constraint phrases and the modified constraint phrases as information that the user has modified the presented constraint phrase list.
- the modification unit 180 modifies the information stored in the constraint phrase list DB 400 based on the received information.
- a target language sentence is generated again by machine translation with vocabulary constraints based on the modified constraint phrase list, and the display information generation unit 170 generates a target language sentence.
- the modification support information may be generated and displayed on the display section 500 by transmitting it to the display section 500.
- the display unit 500 is, for example, a computer (terminal) having a display.
- the display unit 500 is connected to the generation device 100 via a network.
- the display unit 500 receives a source language sentence from the user and displays a list of constraint words and the like.
- the display unit 500 also accepts instructions for adding and modifying constraint words and sentences in the source language.
- the display unit 500 can also output a source language sentence, a final target language sentence, and a final constraint phrase list as a set.
- the generation device 100 in the above embodiment generates a target language sentence (translated sentence) that is closer to the user's image by interactively repeating the process of modifying the restricted word list while checking the results of machine translation with vocabulary constraints. can do.
- the score used for reranking translation candidates was the score calculated by the reranking model from the source language sentence and translation candidates using Reranker.
- FIG. 13 shows the translation accuracy of each method when using vocabulary constraints automatically extracted by a bilingual dictionary. It can be seen that LeCA and LeCA+LCD are able to improve translation accuracy compared to the baseline (Transformer) in Reranker, which uses scores based on the reranking model. Moreover, from FIG. 13, it can be seen that the translation accuracy is high regardless of the type of dictionary.
- Any of the devices (generating device 100, extracting device) described in this embodiment can be realized, for example, by causing a computer to execute a program.
- This computer may be a physical computer or a virtual machine on the cloud.
- the device can be realized by using hardware resources such as a CPU and memory built into a computer to execute a program corresponding to the processing performed by the device.
- the above program can be recorded on a computer-readable recording medium (such as a portable memory) and can be stored or distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.
- FIG. 14 is a diagram showing an example of the hardware configuration of the computer.
- the computer in FIG. 14 includes a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., which are interconnected by a bus BS.
- the computer may further include a GPU.
- a program that realizes processing on the computer is provided, for example, on a recording medium 1001 such as a CD-ROM or a memory card.
- a recording medium 1001 such as a CD-ROM or a memory card.
- the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000.
- the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via a network.
- the auxiliary storage device 1002 stores installed programs as well as necessary files, data, and the like.
- the technology described in this embodiment makes it possible to appropriately automatically extract constraint phrases used in machine translation with vocabulary constraints with low noise. Furthermore, with the technology described in this embodiment, it is possible to perform translation with high accuracy in machine translation with vocabulary constraints.
- Additional note 1 memory and at least one processor connected to the memory; including;
- the processor includes: dividing each of the first information and the first series in a dictionary which is a set of pairs of first information and second information into unit information;
- the second information corresponding to the first information that matches the unit information of the first series is extracted from the dictionary as constraint information used to generate a second series based on the first series.
- Extraction device The extraction device according to supplementary note 1, wherein the processor deletes pairs that match a predetermined rule from the dictionary, and uses the dictionary that has been subjected to the deletion process.
- Pairs that match the predetermined rules include words other than nouns, pairs containing words other than noun phrases, pairs consisting of words with a length of 1, and pairs that are unique in the correspondence between the first information and the second information.
- the processor generates display information for transmitting the constraint information to the display unit, and receives constraint information added or modified to the constraint information displayed on the display unit. Additional notes 1 to 4.
- the extraction device according to any one of these.
- the processor includes: extracting constraint information based on the first series and a dictionary that is a set of pairs of first information and second information with the first series as input; generating a second sequence based on the constraint information and the first sequence;
- a generation device that generates display information for displaying the constraint information together with the second series in a modifiable format.
- the processor obtains a series generated based on the received constraint information, and generates display information for displaying the series. According to supplementary note 6. generator.
- a computer-implemented extraction method comprising: a dividing step of dividing each of the first information and the first series into unit information in a dictionary that is a set of pairs of first information and second information; The second information corresponding to the first information that matches the unit information of the first series is extracted from the dictionary as constraint information used to generate a second series based on the first series.
- An extraction method comprising a constraint information extraction step.
- (Supplementary Note 10) A generation method executed by a computer, an extraction step of receiving the first series as input and extracting constraint information based on the first series and a dictionary that is a set of pairs of first information and second information; a generation step of generating a second sequence based on the constraint information and the first sequence; a display information generation step of generating display information for displaying the constraint information together with the second series in a modifiable format.
- (Supplementary Note 11) A non-temporary storage medium storing a program for causing a computer to function as the extraction device according to any one of Additional Items 1 to 5.
- the processor includes: inputting a constraint information list, outputting each element of a subset of one or more constraint information included in the constraint information list as a vocabulary constraint; generating one or more candidates for the second sequence using the first sequence and the vocabulary constraint;
- a generation device that calculates a score indicating suitability as the second series for each of the one or more candidates.
- the processor calculates the score based on at least one of a likelihood output by a model used to generate the candidate in the sequence generation unit and a likelihood obtained from the candidate by a reranking model.
- the generating device according to supplementary note 1.
- the processor In the case where the constraint information list includes constraint information having ambiguity, the processor generates the one or more candidates by performing a beam search with a vocabulary constraint that takes the ambiguity into account. 2.
- the generating device according to 2.
- At least one piece of constraint information is input to the processor in a format that allows two or more ambiguities, and the processor generates a vocabulary constraint while maintaining the ambiguity. Any one of Additional Notes 1 to 3.
- a non-temporary storage medium storing a program for causing a computer to function as each part of the generation device according to any one of Supplementary Notes 1 to 4.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Health & Medical Sciences (AREA)
- Artificial Intelligence (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Computational Linguistics (AREA)
- General Health & Medical Sciences (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Stored Programmes (AREA)
- Debugging And Monitoring (AREA)
Abstract
L'invention concerne un dispositif de génération destiné à générer une seconde séquence à partir d'informations de contrainte et une première séquence qui est une séquence d'informations, ladite seconde séquence étant une séquence séparée d'informations, et le dispositif de génération comprenant : une unité de génération d'entrée qui utilise une liste d'informations de contrainte en tant qu'entrée et produit chaque élément d'un ou plusieurs sous-ensembles d'informations de contrainte inclus dans la liste d'informations de contrainte en tant que contrainte de vocabulaire ; une unité de génération de séquence qui génère un ou plusieurs candidats se rapportant à la seconde séquence à l'aide de la première séquence et des contraintes de vocabulaire ; et une unité de reclassement qui calcule un score indiquant l'adéquation en tant que seconde séquence pour chacun desdits un ou plusieurs candidats.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/026407 WO2024004184A1 (fr) | 2022-06-30 | 2022-06-30 | Dispositif de génération, procédé de génération et programme |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2022/026407 WO2024004184A1 (fr) | 2022-06-30 | 2022-06-30 | Dispositif de génération, procédé de génération et programme |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2024004184A1 true WO2024004184A1 (fr) | 2024-01-04 |
Family
ID=89382574
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2022/026407 WO2024004184A1 (fr) | 2022-06-30 | 2022-06-30 | Dispositif de génération, procédé de génération et programme |
Country Status (1)
Country | Link |
---|---|
WO (1) | WO2024004184A1 (fr) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0696114A (ja) * | 1992-09-11 | 1994-04-08 | Toshiba Corp | 機械翻訳システム及び文書編集装置 |
JP2016189154A (ja) * | 2015-03-30 | 2016-11-04 | 日本電信電話株式会社 | 翻訳方法、装置、及びプログラム |
-
2022
- 2022-06-30 WO PCT/JP2022/026407 patent/WO2024004184A1/fr unknown
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH0696114A (ja) * | 1992-09-11 | 1994-04-08 | Toshiba Corp | 機械翻訳システム及び文書編集装置 |
JP2016189154A (ja) * | 2015-03-30 | 2016-11-04 | 日本電信電話株式会社 | 翻訳方法、装置、及びプログラム |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US5895446A (en) | Pattern-based translation method and system | |
JP3189186B2 (ja) | パターンに基づく翻訳装置 | |
US20080040095A1 (en) | System for Multiligual Machine Translation from English to Hindi and Other Indian Languages Using Pseudo-Interlingua and Hybridized Approach | |
JP2005216126A (ja) | 他言語のテキスト生成方法及びテキスト生成装置 | |
JPS62163173A (ja) | 機械翻訳方法 | |
JP2000353161A (ja) | 自然言語生成における文体制御方法及び装置 | |
CN110678868B (zh) | 翻译支持系统、装置和方法以及计算机可读介质 | |
US20030139920A1 (en) | Multilingual database creation system and method | |
US20030083860A1 (en) | Content conversion method and apparatus | |
US20050273316A1 (en) | Apparatus and method for translating Japanese into Chinese and computer program product | |
Scannell | Statistical models for text normalization and machine translation | |
Dhanani et al. | FAST-MT Participation for the JOKER CLEF-2022 Automatic Pun and Humour Translation Tasks | |
Al-Mannai et al. | Unsupervised word segmentation improves dialectal Arabic to English machine translation | |
Yeong et al. | Using dictionary and lemmatizer to improve low resource English-Malay statistical machine translation system | |
JP2018072979A (ja) | 対訳文抽出装置、対訳文抽出方法およびプログラム | |
WO2024004184A1 (fr) | Dispositif de génération, procédé de génération et programme | |
WO2024004183A1 (fr) | Dispositif d'extraction, dispositif de génération, procédé d'extraction, procédé de génération et programme | |
Ouvrard et al. | Collatinus & Eulexis: Latin & Greek Dictionaries in the Digital Ages. | |
JP2006004366A (ja) | 機械翻訳システム及びそのためのコンピュータプログラム | |
Núñez et al. | Phonetic normalization for machine translation of user generated content | |
Anto et al. | Text to speech synthesis system for English to Malayalam translation | |
JP4829685B2 (ja) | 翻訳フレーズペア生成装置、統計的機械翻訳装置、翻訳フレーズペア生成方法、統計的機械翻訳方法、翻訳フレーズペア生成プログラム、統計的機械翻訳プログラム、および、記憶媒体 | |
Langlais et al. | General-purpose statistical translation engine and domain specific texts: Would it work? | |
Sankaravelayuthan et al. | A Comprehensive Study of Shallow Parsing and Machine Translation in Malaylam | |
JP2006024114A (ja) | 機械翻訳装置および機械翻訳コンピュータプログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 22949457 Country of ref document: EP Kind code of ref document: A1 |