WO2024004184A1

WO2024004184A1 - Generation device, generation method, and program

Info

Publication number: WO2024004184A1
Application number: PCT/JP2022/026407
Authority: WO
Inventors: 克己帖佐; 睦森下; 昌明永田
Original assignee: 日本電信電話株式会社
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2024-01-04

Abstract

A generation device for generating a second sequence from constraint information and a first sequence which is a sequence of information, said second sequence being a separate sequence of information, and wherein the generation device comprises: an input generation unit that uses a constraint information list as input and outputs each element of one or more constraint information subsets included in the constraint information list as a vocabulary constraint; a sequence generation unit that generates one or more candidates pertaining to the second sequence using the first sequence and the vocabulary constraints; and a re-ranking unit that calculates a score indicating the suitability as the second sequence for each of the one or more candidates.

Description

Generation device, generation method, and program

The present invention relates to the technical field of machine translation.

Vocabulary-constrained machine translation is when a sentence in one domain is translated into another domain (e.g., another language), with constraints imposed to ensure that all specified words (constraint words) are included. Since machine translation with vocabulary constraints can unify the translation of specific words, machine translation with vocabulary constraints is a particularly important technology in the translation of patents, legal documents, technical documents, etc. that require consistency.

In the machine translation method with vocabulary constraints, a translated sentence is generated to include the given constraint phrases. On the other hand, regardless of whether constraint terms are extracted automatically or manually, inappropriate constraint terms (noise) may be included in the plurality of constraint terms.

If a machine translation method with vocabulary constraints is applied with all constrained phrases containing noise as vocabulary constraints, incorrect words will be included in the translated sentence, and translation accuracy is expected to decrease. Note that such a problem is not limited to the field of machine translation, but can occur in any field in which sequence conversion is performed using constraint information.

The present invention has been made in view of the above points, and it is an object of the present invention to provide a technique for accurately performing sequence conversion using constraint information.

According to the disclosed technology, there is provided a generation device for generating a second series that is another information series from constraint information and a first series that is an information series,
an input generation unit that inputs a constraint information list and outputs each element of a subset of one or more constraint information included in the constraint information list as a vocabulary constraint;
a sequence generation unit that generates one or more candidates for the second sequence using the first sequence and the vocabulary constraint;
and a reranking unit that calculates a score indicating suitability as the second series for each of the one or more candidates.

According to the disclosed technology, a technology for accurately performing sequence conversion using constraint information is provided.

FIG. 2 is a diagram showing an example of machine translation with vocabulary constraints. 1 is a diagram showing a configuration example of a generation device 100. FIG. 3 is a flowchart for explaining the operation of the generation device 100. 3 is a diagram showing an example configuration of an extraction unit 120. FIG. 3 is a diagram showing an example configuration of an extraction unit 120. FIG. 1 is a diagram showing a configuration example of a generation device 100. FIG. 3 is a diagram showing an example of the configuration of a sequence generation unit 140. FIG. It is a diagram showing an example of the configuration of a machine translation model. 3 is a diagram showing an example of the configuration of a sequence generation unit 140. FIG. 5 is a diagram showing a display image on a display unit 500. FIG. 1 is a diagram showing a configuration example of a generation device 100. FIG. FIG. 3 is a diagram showing detailed settings and hyperparameters that serve as a base for each setting used in the experiment. It is a figure showing an evaluation result. It is a diagram showing an example of the hardware configuration of the device.

Hereinafter, an embodiment of the present invention (this embodiment) will be described with reference to the drawings. The embodiments described below are merely examples, and embodiments to which the present invention is applied are not limited to the following embodiments.

The embodiment described below shows an example in which the present invention is applied to machine translation, but the present invention can be applied to sequence conversion in any field as long as it uses constraint information. It is. For example, the present invention can be used for summarization tasks, utterance generation tasks, tasks for adding explanatory text to images, and the like.

Furthermore, in the embodiment described below, the unit of translation is a sentence, but the unit of translation may be any unit.

The generation device 100 described below provides certain improvements over prior art techniques such as performing constrained sequence transformations, and represents an improvement in the technical field of constrained sequence transformations. be. Additionally, the extraction apparatus described below provides certain improvements over the prior art in extracting constraint information and represents an improvement in the field of technology related to extracting constraint information.

(About the assignment)
Before explaining the configuration and operation according to this embodiment in detail, first, the prior art and the problems associated with it will be explained. Note that the following explanation of the problem is not a known technique. Further, the problems described below are problems related to the technology of the embodiment.

As already explained, machine translation with lexical constraints imposes constraints on the purpose of including all specified words when converting sentences from one domain to another domain (e.g., another language). It is called. For reference, FIG. 1 shows an example of input and output in machine translation with vocabulary constraints.

In the example in Figure 1, for the source language sentence ``We have developed a geometrical optical theory of standing waves based on ray coincidence.'', machine translation (MT Output), constraint words (Constraints), MT Output) is shown. The underlined portion indicates the constraint phrase.

As a conventional technology for machine translation with vocabulary constraints, see Non-Patent Document 1 "Chen, G., Chen, Y., and Li, V. O. (2021). "Lexically Constrained Neural Machine Translation with Explicit Alignment Guidance." Proceedings of The ``AAAI Conference on Artificial Intelligence'' discloses a machine translation method with vocabulary constraints for manually created constraint phrases. The method disclosed in Non-Patent Document 1 is also called a soft method. The method disclosed in Non-Patent Document 1 does not guarantee that the constraint phrase will always be included in the translated sentence.

Non-patent document 2 “Matt Post and David Vilar. 2018. Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation. In Proceedings of the 2018 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Vol. ume 1 (Long Papers), pages 1314-1324, New Orleans, Louisiana. Association for Computational Linguistics” and Reference 1 “Chousa, K. and Morishita, M. (2021). “Input Augmentation Improves Constrained Beam Search for Neural Machine Translation : NTT at WAT 2021." In Proceedings of the 8th Workshop on Asian Translation (WAT), pp. 53--61, Online. Association for Computational Linguistics." A translation method is disclosed. This method guarantees that the constraint phrase will always be included in the translated sentence. This method is also called the hard method.

There are also use cases where constraint terms are created automatically rather than manually. For example, when translating documents in domains that include many proper nouns, such as patents and scientific and technical papers, translation memories and bilingual dictionaries created from past translation results are often used. A possible use case is to perform machine translation with vocabulary constraints.

On the other hand, when automatically extracting constraint phrases, it is possible that the extracted constraint phrases include words that become noise. Further, even when constraint words and phrases are extracted manually, noise may be included.

Conventional machine translation methods with vocabulary constraints, such as those disclosed in Non-Patent Documents 1 and 2, assume that the given constraint phrase is included in the reference translation. Therefore, if a machine translation method with vocabulary constraints is applied using the extracted constraint phrases as vocabulary constraints, incorrect words may be included in the translated sentence, and translation accuracy is expected to decrease.

Based on the above points, below we will discuss techniques for appropriately extracting constraint words by reducing noise, and methods for performing machine translation with vocabulary constraints with high accuracy even when using a set of constraint words that may contain noise. Explain the technology.

(Equipment configuration example, overall operation)
FIG. 2 shows a configuration example of the generation device 100 in this embodiment. As shown in FIG. 2, the generation device 100 includes an input section 110, an extraction section 120, an input generation section 130, a sequence generation section 140, a reranking section 150, and an output section 160.

Additionally, a bilingual dictionary DB 200 and a model DB 300 are provided. The bilingual dictionary DB 200 stores bilingual dictionaries, and the model DB 300 stores trained machine translation models. The bilingual dictionary DB 200 and the model DB 300 may be provided outside the generation device 100 (as in the example in FIG. 2), or may be provided inside the generation device 100.

The overall flow of operations by the generation device 100 will be described with reference to the flowchart in FIG. 3. In S101, a source language sentence is input using the input unit 110. In S102, the extraction unit 120 automatically extracts constraint phrases based on the source language sentence (input sentence) input by the input unit 110 and the bilingual dictionary read from the bilingual dictionary DB 200.

In S103, the input generation unit 130 generates a plurality of inputs (vocabulary constraints) from arbitrary combinations of constraint words. In S104, the series generation unit 140 translates the input sentence using the plurality of inputs generated in S103 and the machine translation model read from the model DB 300. Here, translation results are obtained for each of the plurality of inputs generated in S103. That is, the sequence generation unit 140 uses a certain sequence and vocabulary constraints to generate one or more candidates for another sequence based on a previously learned sequence conversion model.

In S105, the reranking unit 150 predicts a reranking score for each translation result using the input sentence. In S106, the output unit 160 outputs the translation result (target language sentence) with the highest score. The configuration and operation of the main functional units will be explained in detail below.

(Extraction unit 120)
First, the extraction unit 120 will be explained. The extraction unit 120 receives the source language sentence and the bilingual dictionary as input, and outputs the source language sentence and the constraint word list. Note that the source language sentence may not be output.

FIG. 4 is a block diagram of the extraction unit 120. As shown in FIG. 4, the extraction unit 120 includes a filtering unit 121, a division unit 122, and a constraint phrase extraction unit 123. The extraction unit 120 also refers to the bilingual dictionary 200. Note that the extraction section 120 may be configured without the filtering section 121.

The bilingual dictionary DB 200 stores a set of pairs of two words that are made to correspond when converting sequences. Specifically, in this embodiment, which targets translation, the bilingual dictionary DB 200 stores a set of <source language phrase, target language phrase> pairs. The source language phrase and the target language phrase may each consist of multiple words. In this embodiment, one <source language word/phrase, target language word/phrase> pair is referred to as a "bilingual translation". The source language phrase and the target language phrase may be called a source language translation word and a target language translation word, respectively.

Note that when the bilingual dictionary DB 200 is used for tasks other than translation, its contents are not limited to a set of <source language phrase, target language phrase> pairs.

The filtering unit 121 filters the bilingual translations that become noise from the bilingual dictionary. The bilingual dictionary after filtering is stored in the bilingual dictionary DB 200, and the dividing unit 122 and the constraint phrase extraction unit 123 refer to the bilingual dictionary after filtering.

The dividing unit 122 morphologically analyzes the source language sentence and the source language phrases in the bilingual dictionary. That is, the dividing unit 122 divides the source language sentences and the source language phrases in the bilingual dictionary into unit information. The constraint phrase extraction unit 123 extracts bilingual translations corresponding to the phrases (examples of unit information obtained by division) included in the source language sentence and creates a constraint phrase list. The processing of each part will be explained in more detail below.

<Extraction unit 120: Filtering unit 121>
The filtering unit 121 deletes bilingual translations that fall under (A) to (C) below, or words included in the bilingual translations, from the bilingual dictionary. However, the filtering unit 121 does not necessarily need to implement all of (A) to (C), but may implement at least one of (A) to (C). Further, filtering other than (A) to (C) may be performed. In particular, in Modifications 1 and 2, which will be described later, the process (C) below may be skipped.

(A) Parallel translations that include words other than nouns/noun phrases (excluding verbs because they have conjugations)
(B) Parallel translation consisting of words with a length of 1 An example of (B) is a one-character translation such as a unit. For example, the parallel translation of "target language: C, source language: degree" corresponds to (B).

(C) There is no unique correspondence between the source language and the target language (for example, there are multiple translations for one word in the source language)
For parallel translations that fall under (C), the parallel translations will be deleted. Alternatively, from a plurality of translated words, leave one translated word and delete the others, so that the source language phrase and the target language phrase are in a one-to-one relationship. Any method can be used to leave one translated word and delete the others from multiple translated words. For example, keep the first translated word listed, or select the one with the highest frequency of occurrence. You can use a method such as leaving it.

For example, the parallel translation of "source language: computer, target language: computer, computer" corresponds to (C). so that there is a one-to-one relationship between the source language phrase and the target language phrase.

<Extraction unit 120: Division unit 122>
The dividing unit 122 divides (tokenizes) the source language sentence and the source language translation of the bilingual dictionary into morpheme units, and inserts predetermined symbols (eg, spaces, "/") at morpheme boundaries. This division unit may be different from the division unit of division processing performed later when translating.

For example, if the source language sentence is "That is not the case," the source language sentence after processing by the dividing unit 122 will be "As long as it is, it is not."

<Extraction unit 120: Constraint phrase extraction unit 123>
The constraint phrase extraction unit 123 extracts bilingual translations corresponding to the phrases included in the source language sentence, and creates a constraint phrase list using the extracted bilingual translations. A specific example of the constraint phrase extraction method will be described below. Note that the format of the dictionary, the search method, etc. are not limited to the method described below, and other methods may be used as long as the method can extract the constraint phrases corresponding to the words included in the source language sentence.

In this example, we will use a bilingual dictionary that is expressed in units of characters of the source language translation using a data structure called a Trie tree.

The constraint word extraction unit 123 performs a prefix match search from the beginning of the source language sentence to a set of source language translated words in the bilingual dictionary. When a pair of translations containing a source language translation that matches a word in the source language sentence is found, the target translation is extracted as a constraint word. When performing a prefix match search, the parallel translation with the longest length of the source language translation is selected.

For example, assume that the source language sentence is divided by morphological analysis by the dividing unit 122, resulting in three words, ie, "ABC/GHI/XYZ". A, B, C, etc. here are letters. When the constraint phrase extraction unit 123 searches the source language translation of the bilingual dictionary using "ABC/GHI/XYZ", a match is found starting from the beginning (or front) of the sentence of "ABC/GHI/XYZ". .

As a result of the above search, for example, even if the four words "AB", "ABC", "ABCG", and "ABC/GHI" match, "AB" and "ABCG" are the units of morphemes, as will be explained later. Since they do not match, it is possible to prevent them from matching. In this case, from among the remaining "ABC" and "ABC/GHI", the target translated word paired with "ABC/GHI", which has the longest length of the source language translated word, is extracted as a constraint word. Thereafter, similar processing is performed using "XYZ" which is the part after "ABC/GHI".

As in this embodiment, the dividing unit 122 divides the source language sentence and the source language translation of the bilingual dictionary into morphemes (an example of unit information) in advance, and performs a search taking morpheme boundaries into consideration. This can prevent incorrect extraction of words whose units do not match. This is particularly effective when the source language is a language that does not have separate lines, such as Japanese. For example, it is possible to prevent the source language translated word "hana" (flower) from matching the source language sentence "so/as long as/de/ha/nai". In other words, "hana" cannot match "ha/na".

Note that the implementation of prefix match, longest match, and word division as described here are examples of means for realizing constraint phrase extraction with less noise and reduced ambiguity. Other means of disambiguation may be used.

For example, when performing morphological analysis in the segmentation unit 122, information necessary for disambiguation, such as part of speech, prototype, stem, conjugation, and reading (pronunciation), is attached to the divided words and the attached information is also used. Perform matching. In other words, by using not only the character string but also attached information such as its part of speech during matching, for example, the character "in" in the source language sentence can be matched with the preposition in, which is the source language translation, and the noun inn (inn). The ambiguity can be resolved by placing the situation in which both of the above match. Resolving ambiguity during matching is an important element in improving translation accuracy.

<Other examples of the configuration of the extraction unit 120>
Note that the extraction unit 120 may have the configuration shown in FIG. 5 instead of the configuration shown in FIG. 4. In the configuration shown in FIG. 5, the filtering section 121 performs filtering on the constraint phrases extracted by the constraint phrase extraction section 121, instead of filtering the bilingual dictionary.

The filtering process is similar to the process by the filtering unit 121 described above. However, "parallel translation" should be read as "constraint phrase". Specifically, the filtering unit 121 deletes constraint phrases that apply to (A) to (C) below from the extraction results by the constraint phrase extraction unit 123. However, the filtering unit 121 does not necessarily need to implement all of (A) to (C), but may implement at least one of (A) to (C). Further, rules other than (A) to (C) may be used. In particular, when performing Modifications 1 and 2, which will be described later, the following process (C) may be skipped.

(A) Restricted phrases that include words other than nouns/noun phrases (verbs are deleted because they have conjugations)
(B) Constraint phrases consisting of a word with a length of 1. (C) Constraints in which there is no unique correspondence between the source language and the target language (for example, multiple constraint phrases for one word in the source language) exists)
When implementing (C), if there are multiple constraint phrases for one word in the source language, for example, delete the multiple constraint phrases, or select one constraint phrase from the multiple constraint phrases. By leaving the ``words'' and deleting the others, the source language phrases and target language phrases become one-to-one.

(Regarding other configuration examples of the extraction unit 120 and the generation device 100)
The extraction unit 120 may be a single device independent of the generation device 100. This single device may also be referred to as an extraction device. Note that the extraction unit 120 included in the generation device 100 may also be referred to as an extraction device. Further, the generation device 100 having the extraction unit 120 may be referred to as an extraction device. Furthermore, both the extraction unit 120 and the extraction device may include both or one of the display information generation unit 170 and the correction unit 180 in the embodiment described later.

In the case where the extraction unit 120 is configured as a single device independent of the generation device 100, the generation device 100 may not include the extraction unit 120. The configuration of the generation device 100 in this case is shown in FIG. In the configuration of FIG. 6, the constraint phrase list generated by the extraction device is input to the generation device 100. However, in the configuration of FIG. 6, a constraint phrase list that is not the constraint phrase list generated by the extraction device (eg, a constraint phrase list that includes a lot of noise) may be input to the generation device 100.

The operations of the input generation section 130, sequence generation section 140, and reranking section 150 in FIG. 6 are the same as the operations of the input generation section 130, sequence generation section 140, and reranking section 150 in FIG.

(Input generation unit 130)
Next, the input generation unit 130 will be explained. The input generation unit 130 receives the constraint phrase list as input, and sets all elements of a subset of the words included in the constraint phrase list as vocabulary constraints. However, some elements among all the elements may be subjected to vocabulary constraints.

Finally, the input generation unit 130 outputs the above vocabulary constraints as the vocabulary constraints corresponding to the source language sentence input to the extraction unit 120. A specific example is shown below.

Assume that {A, B, C} is input to the input generation unit 130 as a constraint phrase list. Here, A, B, and C are each constraint phrases.

The input generation unit 130 generates {}, {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C} and output each of them as a vocabulary constraint.

Note that {{}, {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}} is a constraint vocabulary set. , one {...} is one vocabulary constraint.

There are 2 ^|C| vocabulary constraints created from a subset of the list C of constraint phrases, and as will be described later, a plurality of translation sentence candidates are obtained from each of the vocabulary constraints.

(Sequence generation unit 140)
Next, the sequence generation unit 140 will be explained. It is assumed that the series generation unit 140 holds a trained machine translation model read from the model DB 300. Furthermore, the sequence generation unit 140 repeats the following process by the number of lexical constraints (the number of elements in the lexical constraint set). For example, the vocabulary constraint set is {{}, {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}} Then, repeat 8 times.

The sequence generation unit 140 receives an input sentence (source language sentence) and vocabulary constraints as input. The sequence generation unit 140 generates a translated sentence (target language sentence) using a machine translation model by applying an existing method of machine translation with vocabulary constraints. Here, a plurality of translated sentences are generated as translated sentence candidates (target language sentence candidates). In addition, a translation sentence candidate is given a score as a translation sentence.

Any existing method for machine translation with vocabulary constraints may be used; for example, LeCA or LeCA+LCD may be used. LeCA is disclosed in Non-Patent Document 1 and is also called the soft method. LeCA+LCD is disclosed in Reference 1 mentioned above, and is also called the hard method.

The series generation unit 140 outputs a plurality of generated translation sentence candidates. As an example, the sequence generation unit 140 outputs a predetermined number of translation sentence candidates in descending order of scores. The "predetermined number" may be one. In other words, only the translated sentence with the highest score may be output. Here, for example, 30 translation candidates are output for each vocabulary constraint.

<Configuration example of sequence generation unit 140>
FIG. 7 shows a configuration example of the sequence generation unit 140. As shown in FIG. 7, the sequence generation section 140 includes a sequence conversion section 141 and a search section 142.

Note that when using the soft method to generate translated sentences, the sequence conversion unit 141 uses lexical constraint information, and when using the hard method, the sequence conversion unit 141 uses the lexical constraints depending on the type of hard method. Sometimes it's used, sometimes it's not. Arrows for inputting vocabulary constraints to the series conversion unit 142 are indicated by dotted lines. Among the hard methods, in the aforementioned LeCA+LCD, information on vocabulary constraints is used in the sequence conversion unit 141. The configuration/operation assuming LeCA+LCD will be described below.

The sequence conversion unit 141 uses a general encoder-decoder model (for example, Transformer) as a machine translation model, which has an encoder and a decoder, as shown in FIG. model can be used. However, the invention can be implemented using models other than the encoder-decoder model.

The sequence conversion unit 141 receives the source language sentence and vocabulary constraints as input, first expands the source language sentence using the vocabulary constraints to create an input sequence with information on the vocabulary constraints added, and then machine translates it. Use as input to the model.

More specifically, in the above expansion, the sequence conversion unit 141 converts the source language sentence X, which is the input sequence, and each constraint word C _i into a character string indicating a special delimiter called <sep> as shown below. Create an input sequence with lexical constraints by combining (concatenating) via . <eos> is a character string representing the end of a sentence.

[X, <sep>, C ₁ , <sep>, C ₂ ,..., C _N , <eos>]
The sequence conversion unit 141 generates a sentence by using the expanded input sequence as input to a machine translation model. More specifically, the probability of each word in a set of words that can constitute an output sequence is output.

The search unit 142 uses the output probability of the decoder in the machine translation model to search for (an approximate solution of) the output sequence that maximizes the generation probability when the input sequence is given. The search unit 142 uses a grid beam search method based on beam search to ensure that the output sequence satisfies all of the constraint vocabulary.

Note that the searching unit 142 searching using grid beam search is an example. Any processing method may be used as long as it performs a lexically constrained search so as to include the constraint word/phrase.

(Reranking section 150)
Next, the reranking unit 150 will be explained. The reranking unit 150 receives as input one or more translated sentence candidates generated by the series generation unit 140. For example, if the series generation unit 140 generates 30 translation candidates per lexical constraint and there are 8 lexical constraints, the reranking unit 150 generates 30 translation candidates per lexical constraint. Each time, 30 translation candidates are received as input.

Next, the reranking unit 150 calculates a score for each translated sentence candidate using the input sentence (source language sentence), and outputs the translated sentence candidate with the highest score as the final translated sentence. Here, instead of narrowing down to the one with the highest score, all (or some) of the translated sentences and their scores may be output. Thereby, the output unit 160 can present the translated sentences to the user in a ranking format using the scores.

As for the method of calculating the score by the reranking unit 150, any method may be used as long as it can calculate the score of the translated sentence, but for example, the methods in Example 1 and Example 2 below may be used. be able to.

Example 1:
The reranking unit 150 uses, as a score, the likelihood of translated sentence candidates output by the machine translation model used for translation in the sequence generation unit 140.

Example 2:
The reranking unit 150 uses a machine translation model learned by Transformer, which is an encoder-decoder model, for a Right-to-Left translation task that generates a translated sentence from the end of a sentence to the beginning of a sentence. The likelihood when translation candidates are forced to be output is used as a score. Forcibly outputting translated sentence candidates may be rephrased as forced decoding using translated sentence candidates.

That is, the source language sentence is input to the encoder of the reranking model, and the words of the translation sentence candidate whose score (likelihood) is to be evaluated are sequentially input to the decoder of the reranking model.

Note that in Examples 1 and 2, the likelihood output by the machine translation model may be any value as long as it indicates plausibility. The likelihood output by the machine translation model may be a probability or a value other than probability.

Furthermore, the reranking unit 150 may calculate the reranking score using both the likelihood of Example 1 and the likelihood of Example 2. For example, the average of the likelihood of Example 1 and the likelihood of Example 2 may be used as the reranking score.

(Modification 1)
Next, modification 1 will be explained. In the first modification, the constraint phrase list generated by the extraction unit 120 can be one in which a plurality of target language phrases correspond to one source language phrase. Such a constraint phrase list may be called a constraint phrase list that allows multiple translations. For example, if the filtering unit of the extraction unit 120 does not perform the step (C), such a constraint phrase list may be generated.

For example, there are A and A' as a plurality of target language phrases for a certain source language phrase, and "A, A', B, C" including these and B and C are added to the constraint phrase list by the extraction unit 120. Suppose it is generated as multiple elements. For example, if the word in the source language sentence is "computer" and the word in the target language sentence is "computer", A and A' correspond to "computer" and "computer".

Here, such a constraint phrase list having multiple elements is expressed as {{A, A'}, {B}, {C}}.

The input generation unit 130 that receives {{A, A′}, {B}, {C}} from the extraction unit 120 generates {}, {A}, {B}, {C}, {A, B} , {A, C}, {B, C}, {A, B, C}, as well as {A'}, {A', B}, {A', C}, {A', B, C} is also generated as a vocabulary constraint.

The input generation unit 130 inputs each of the plurality of generated vocabulary constraints to the sequence generation unit 140.

The sequence generation unit 140 generates {}, {A}, {B}, {C}, {A, B}, {A, C}, {B, C}, {A, B, C}, {A' By using each of the 12 vocabulary constraints of }, {A', B}, {A', C}, and {A', B, C} as vocabulary constraints, machine translation with vocabulary constraints is performed 12 times. and obtain translation candidates. For example, if one translated sentence candidate is generated for one vocabulary constraint, 12 translated sentence candidates will be obtained.

After performing the machine translation with vocabulary constraints, the reranking unit 150 performs the reranking process using the method described above, and outputs, for example, the translated sentence candidate with the highest score as the final translated sentence. .

(Modification 2)
Next, modification 2 will be explained. Also in the second modification, the constraint phrase list generated by the extraction unit 120 can be one in which a plurality of target language phrases correspond to one source language phrase.

In the second modification, in the translation search process in the search unit 142 of the sequence generation unit 140, the search may be performed by allowing the existence of a plurality of expression types of one constraint phrase. In other words, the search may be performed such that one element is satisfied from each constraint word candidate. Specifically, the details are as follows.

In Modified Example 2 as well, there are A and A' as a plurality of target language phrases for a certain source language phrase, and "A, A', B, C" including these, B, and C are extracted by the extraction unit 120. Suppose that it is generated as multiple elements of a constraint word list. Here, it is assumed that {A, B, C} is generated as the constraint phrase list, and information indicating that A may be replaced by A' is input from the extraction unit 120 to the input generation unit 130. Alternatively, {A, A', B, C} may be generated as the constraint phrase list, and information indicating that either A or A' is acceptable may be input from the extraction unit 120 to the input generation unit 130. good. Note that although the above example shows an example of a format that allows two ambiguities for one constraint phrase, it is also possible to use a format that allows three or more ambiguities for one constraint phrase. Good too.

For example, if three ambiguities are allowed for A, {A, B, C} is generated as the constraint word list, and information indicating that A may be A' or A'' is generated. It is assumed that the input is input from the extraction unit 120 to the input generation unit 130. Alternatively, {A, A', A'', B, C} is generated as a constraint word list, and information indicating that any of A, A', and A'' is acceptable is sent from the extraction unit 120 to the input generation unit. 130 may be input.

For the constraint word list {A, B, C}, if A may be A', the input generation unit 130 generates {}, {{A, A'}}, {{B}}, {{C}} , {{A, A'}, {B}}, {{A, A'}, {C}}, {{A, A'}, {B}, {C}} seven vocabulary candidate constraints generate. In addition, in Modification 2, there is ambiguity in the translated word because there are cases where multiple target language words (e.g., A, A') correspond to a certain source language word, and the vocabulary used as a constraint is determined. Therefore, we call it ``vocabulary candidate constraint'' instead of lexical constraint. That is, the "vocabulary candidate constraint" is a vocabulary constraint that maintains ambiguity. Note that the expression format of the vocabulary candidate constraints described above is an example. As long as it can be expressed that either A or A' is acceptable, expression formats other than those described above may be used as the expression format.

As shown in FIG. 9, the sequence generation unit 140 receives the source language sentence and the vocabulary candidate constraints as input. The sequence generation unit 140 generates {}, {{A, A′}}, {{B}}, {{C}}, {{A, A′}, {B}}, {{A, A′} , {C}}, {{A, A'}, {B}, {C}}, each of the seven vocabulary candidate constraints was used as a vocabulary candidate constraint, and machine translation with vocabulary constraints was performed seven times. , obtain translation candidates. For example, if one translated sentence candidate is generated for one vocabulary candidate constraint, seven translated sentence candidates will be obtained.

When using a vocabulary candidate constraint including {A, A'}, the search unit 142 of the sequence generation unit 140 performs a search assuming that the word A may be A'. In other words, a search is performed that takes ambiguity into account. For searching, for example, reference 2 “Peter Anderson, Basura Fernando, Mark Johnson, and Stephen Gould. 2017. Guided Open Vocabulary Image Captioning with Constrained Beam Search In Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing, pages 936-945, Copenhagen, Denmark. Association for Computational Linguistics can be used. This method is an example of a "search considering ambiguity" method.

The method disclosed in Reference 2 performs a beam search with lexical constraints that takes into account the ambiguity of the translated word, which may be either A or A'. In other words, ambiguity between A and A' is resolved during beam search.

Note that the method disclosed in Reference 2 is a language generation method, but is not a translation technology. There is no prior art that applies this method to search during translation decoding.

In addition, in the description of the embodiment so far and the description of Modifications 1 and 2, multiple target language phrases (e.g. A, A') for a certain source language phrase are synonyms such as "calculator" and "computer". In addition to the word ``Trunk'', it may be a word or phrase other than a synonym, such as ``car trunk'', ``elephant's trunk'', trunk, trunk line, etc. Since the meaning of words is not taken into account when searching in the search unit 142, A and A' may be completely unrelated words. Furthermore, when using the technology according to the present invention for tasks other than translation, arbitrary criteria can be used to determine which words in the converted series correspond to words in the original series. Bye.

In addition, in the description of the embodiment so far and the description of Modifications 1 and 2, words that are not the original form are converted to the original form during morphological analysis in consideration of word transformations (plural forms, changes in tense, etc.). You may also do so.

As an example, suppose that the bilingual dictionary is English-Japanese, and there is a dictionary entry for "corn". In this case, it is assumed that the source language sentence "We roasted corns over the charcoal." is input to the generating device 100. In this case, when the extraction unit 120 performs matching on a morpheme basis, corns is matched in the bilingual dictionary because corn is included as a morpheme. However, for example, if the bilingual dictionary entry is feet and the morpheme in the input sentence is foot, there will be no match. Therefore, by returning the foot to its original form and performing matching, this problem can be resolved.

(Example)
Next, as a more specific example, an embodiment using the techniques described above will be described. In this embodiment, each time a constraint phrase is edited (corrected or added) on a display unit 500 (device capable of display and input operations), which will be described later, a target language sentence (translation sentence) for the constraint phrase is created. It is possible to confirm.

<Display image>
First, a display image on the display unit 500 will be described with reference to FIG. 10. In the example shown in FIG. 10, the user inputs "A prototype superconducting single-phase autotransformer having an auxiliary winding only in the shunt winding." is input as the source language sentence, and presses "Send".

The display unit 500 displays a plurality of constraint phrases (constraint phrase list) for the input source language sentence. The words and phrases displayed here as constraint words are the words and phrases that have been filtered by the filtering unit 121. On the right side, the filtered constraint phrases are displayed in the form of "Do you want to add?"

Users can select the constraint phrases they wish to modify (or delete) or add by marking checkboxes, and modify (or delete)/add the selected constraint phrases by pressing the corresponding button. . It is also possible to add constraint phrases created by the user himself.

By pressing "Update" on the display image, you can display the target language sentence using the current constraint phrase.

<Device configuration and operation>
FIG. 11 shows a configuration example of a generation device 100 for realizing the above display. As shown in FIG. 11, the generation device 100 of this embodiment includes an extraction section 120, a display information generation section 170, a modification section 180, a generation section 190, a bilingual dictionary DB 200, and a constraint phrase list DB 400. Note that the modification unit 180 may be included in the display information generation unit 170.

Note that the bilingual dictionary DB 200 and the constraint phrase list DB 400 may be provided outside the generation device 100. Furthermore, the generation unit 190 may also be provided outside the generation device 100 (eg, another server). Further, the generation device 100 may be used for the purpose of displaying a list of constraint words on the display unit 500. In that case, the generation device 100 may include only the extraction unit 120 and the display information generation unit 170 among the functional units shown in FIG. The generation device 100 may also be called an extraction device. The functions of each part are as follows.

The extraction unit 120 is the extraction unit 120 shown in FIG. 4 or 5. It takes the source language sentence as input and outputs a list of constraint words. The output constraint phrase list is stored in the constraint phrase list DB 400 and is input to the display information generation section 170. Further, the extraction unit 120 may output the filtered constraint words as a filter word list. The output filter word list is input to the display information generation section 170.

The display information generation unit 170 generates information for displaying the constraint phrase list on the display unit 500 (referred to as constraint phrase list presentation information). The constraint word/phrase list presentation information includes a constraint word/phrase list. Further, the information for presenting the constraint word/phrase list may include information on the filter word/phrase list as deleted information, filter candidate words, or additional candidates. The constraint word list presentation information is transmitted from the display information generation section 170 to the display section 500 and input to the display section 500. Further, the display information generation unit 170 may generate display information to be displayed together with the target language sentence (translated sentence) generated using the constraint phrase in a format in which the constraint phrase can be modified. Further, when the generation device 100 receives the added or modified constraint phrase from the display unit 500, the display information generation unit 170 generates a target language sentence (translated sentence) based on the received constraint phrase. Display information for displaying the target language sentence (translated sentence) may be generated.

Additionally, the display information generation unit 170 may generate “correction support information” for the user to check the constraint phrase list, and transmit it to the display unit 500. The modification support information includes at least one of a source language sentence input by the user, an extracted constraint phrase list, and a target language sentence generated based on the extracted constraint phrase list.

The modification unit 180 receives from the display unit 500 at least one of the additional constraint phrases and the modified constraint phrases as information that the user has modified the presented constraint phrase list.

The modification unit 180 modifies the information stored in the constraint phrase list DB 400 based on the received information. In addition, when the constraint phrase list is modified, a target language sentence is generated again by machine translation with vocabulary constraints based on the modified constraint phrase list, and the display information generation unit 170 generates a target language sentence. The modification support information may be generated and displayed on the display section 500 by transmitting it to the display section 500.

The generation unit 190 includes an input generation unit 130, a sequence generation unit 140, and a reranking unit 150. As explained above, the generation unit 190 uses these functional units to generate an objective that takes into account vocabulary constraints based on the constraint phrase list read from the constraint phrase list DB 400 and the source language sentence received from the display unit 500. A language sentence (translated sentence) is generated, and the generated target language sentence is input to the display information generation section 170.

The display unit 500 is, for example, a computer (terminal) having a display. The display unit 500 is connected to the generation device 100 via a network. As described with reference to FIG. 10, the display unit 500 receives a source language sentence from the user and displays a list of constraint words and the like. The display unit 500 also accepts instructions for adding and modifying constraint words and sentences in the source language. The display unit 500 can also output a source language sentence, a final target language sentence, and a final constraint phrase list as a set.

The generation device 100 in the above embodiment generates a target language sentence (translated sentence) that is closer to the user's image by interactively repeating the process of modifying the restricted word list while checking the results of machine translation with vocabulary constraints. can do.

(Experimental result)
In the following explanation of the experimental results, "the generation device 100 in this embodiment" will be referred to as the proposed method or the proposed system.

In order to confirm the effectiveness of the machine translation method with lexical constraints based on re-ranking of translation candidates based on the lexical constraints automatically extracted using the proposed method, we conducted a machine translation method with lexical constraints using lexical constraints automatically extracted from a bilingual dictionary for Japanese-English translation. Accuracy evaluation was performed.

<About the bilingual dictionary>
As bilingual dictionaries used to extract vocabulary constraints, we used the EDR Japanese-English bilingual dictionary (EDR-JE), which is a general-purpose dictionary, and the bilingual dictionary from the Japanese-English translation system ALT-J/E.

<Model>
The following translation models were used for evaluation.

・Transformer
・LeCA + {EDR-JE, ALT-J/E}
・LeCA+LCD + {EDR-JE, ALT-J/E}
ASPEC was used as the bilingual corpus used for training and evaluating the translation model. For detailed settings and hyperparameters of each model, those shown in FIG. 12 were used.

For the constraints extracted from the dictionary, a collection of the top 30 sentences generated from each of the 2 ^|C| vocabulary constraints was used. The score used for reranking translation candidates was the score calculated by the reranking model from the source language sentence and translation candidates using Reranker.

For the reranking model, we used a model trained with Transformer (big) for the Right-to-Left translation task, which generates translated sentences from the end of the sentence to the beginning of the sentence. The re-ranking score used was the likelihood when forced decoding was performed on the input translation candidates. BLEU, an automatic evaluation scale for translation accuracy, was used to evaluate each method.

<About experimental results>
FIG. 13 shows the translation accuracy of each method when using vocabulary constraints automatically extracted by a bilingual dictionary. It can be seen that LeCA and LeCA+LCD are able to improve translation accuracy compared to the baseline (Transformer) in Reranker, which uses scores based on the reranking model. Moreover, from FIG. 13, it can be seen that the translation accuracy is high regardless of the type of dictionary.

(Hardware configuration example)
Any of the devices (generating device 100, extracting device) described in this embodiment can be realized, for example, by causing a computer to execute a program. This computer may be a physical computer or a virtual machine on the cloud.

That is, the device can be realized by using hardware resources such as a CPU and memory built into a computer to execute a program corresponding to the processing performed by the device. The above program can be recorded on a computer-readable recording medium (such as a portable memory) and can be stored or distributed. It is also possible to provide the above program through a network such as the Internet or e-mail.

FIG. 14 is a diagram showing an example of the hardware configuration of the computer. The computer in FIG. 14 includes a drive device 1000, an auxiliary storage device 1002, a memory device 1003, a CPU 1004, an interface device 1005, a display device 1006, an input device 1007, an output device 1008, etc., which are interconnected by a bus BS. Note that the computer may further include a GPU.

A program that realizes processing on the computer is provided, for example, on a recording medium 1001 such as a CD-ROM or a memory card. When the recording medium 1001 storing the program is set in the drive device 1000, the program is installed from the recording medium 1001 to the auxiliary storage device 1002 via the drive device 1000. However, the program does not necessarily need to be installed from the recording medium 1001, and may be downloaded from another computer via a network. The auxiliary storage device 1002 stores installed programs as well as necessary files, data, and the like.

The memory device 1003 reads and stores the program from the auxiliary storage device 1002 when there is an instruction to start the program. CPU 1004 implements functions related to light touch maintenance device 100 according to programs stored in memory device 1003. The interface device 1005 is used as an interface for connecting to a network or the like. A display device 1006 displays a GUI (Graphical User Interface) and the like based on a program. The input device 1007 is composed of a keyboard, a mouse, buttons, a touch panel, or the like, and is used to input various operation instructions. An output device 1008 outputs the calculation result.

(Summary of embodiments, effects, etc.)
As described above, the technology described in this embodiment makes it possible to appropriately automatically extract constraint phrases used in machine translation with vocabulary constraints with low noise. Furthermore, with the technology described in this embodiment, it is possible to perform translation with high accuracy in machine translation with vocabulary constraints.

Regarding the above embodiments, the following additional notes 1 and 2 are further disclosed.

<Additional note 1>
(Additional note 1)
memory and
at least one processor connected to the memory;
including;
The processor includes:
dividing each of the first information and the first series in a dictionary which is a set of pairs of first information and second information into unit information;
The second information corresponding to the first information that matches the unit information of the first series is extracted from the dictionary as constraint information used to generate a second series based on the first series. Extraction device.
(Additional note 2)
The extraction device according to supplementary note 1, wherein the processor deletes pairs that match a predetermined rule from the dictionary, and uses the dictionary that has been subjected to the deletion process.
(Additional note 3)
Pairs that match the predetermined rules include words other than nouns, pairs containing words other than noun phrases, pairs consisting of words with a length of 1, and pairs that are unique in the correspondence between the first information and the second information. The extraction device according to Supplementary Note 2, wherein at least one of the pairs has no gender.
(Additional note 4)
The extraction device according to any one of Supplementary Notes 1 to 3, wherein the processor performs matching between the unit information of the first series and the first information so as to resolve ambiguity.
(Additional note 5)
The processor generates display information for transmitting the constraint information to the display unit, and receives constraint information added or modified to the constraint information displayed on the display unit. Additional notes 1 to 4. The extraction device according to any one of these.
(Additional note 6)
memory and
at least one processor connected to the memory;
including;
The processor includes:
extracting constraint information based on the first series and a dictionary that is a set of pairs of first information and second information with the first series as input;
generating a second sequence based on the constraint information and the first sequence;
A generation device that generates display information for displaying the constraint information together with the second series in a modifiable format.
(Supplementary Note 7)
When receiving added or modified constraint information, the processor obtains a series generated based on the received constraint information, and generates display information for displaying the series. According to supplementary note 6. generator.
(Supplementary Note 8)
The generating device according to supplementary note 6 or 7, wherein the processor generates display information for displaying filtered constraint information as additional candidates based on a predetermined rule.
(Supplementary Note 9)
A computer-implemented extraction method, comprising:
a dividing step of dividing each of the first information and the first series into unit information in a dictionary that is a set of pairs of first information and second information;
The second information corresponding to the first information that matches the unit information of the first series is extracted from the dictionary as constraint information used to generate a second series based on the first series. An extraction method comprising a constraint information extraction step.
(Supplementary Note 10)
A generation method executed by a computer,
an extraction step of receiving the first series as input and extracting constraint information based on the first series and a dictionary that is a set of pairs of first information and second information;
a generation step of generating a second sequence based on the constraint information and the first sequence;
a display information generation step of generating display information for displaying the constraint information together with the second series in a modifiable format.
(Supplementary Note 11)
A non-temporary storage medium storing a program for causing a computer to function as the extraction device according to any one of Additional Items 1 to 5.

<Additional note 2>
(Additional note 1)
A generation device for generating constraint information and a second series that is another information series from a first series that is an information series,
memory and
at least one processor connected to the memory;
including;
The processor includes:
inputting a constraint information list, outputting each element of a subset of one or more constraint information included in the constraint information list as a vocabulary constraint;
generating one or more candidates for the second sequence using the first sequence and the vocabulary constraint;
A generation device that calculates a score indicating suitability as the second series for each of the one or more candidates.
(Additional note 2)
The processor calculates the score based on at least one of a likelihood output by a model used to generate the candidate in the sequence generation unit and a likelihood obtained from the candidate by a reranking model. The generating device according to supplementary note 1.
(Additional note 3)
In the case where the constraint information list includes constraint information having ambiguity, the processor generates the one or more candidates by performing a beam search with a vocabulary constraint that takes the ambiguity into account. 2. The generating device according to 2.
(Additional note 4)
At least one piece of constraint information is input to the processor in a format that allows two or more ambiguities, and the processor generates a vocabulary constraint while maintaining the ambiguity. Any one of Additional Notes 1 to 3. The generator described in .
(Additional note 5)
A generation method performed by a computer for generating constraint information and a second series that is another information series from a first series that is an information series, the method comprising:
an input generation step of inputting a constraint information list and outputting each element of a subset of one or more constraint information included in the constraint information list as a vocabulary constraint;
a sequence generation step of generating one or more candidates for the second sequence using the first sequence and the vocabulary constraint;
A generation method comprising: a reranking step of calculating a score indicating suitability as the second series for each of the one or more candidates.
(Additional note 6)
A non-temporary storage medium storing a program for causing a computer to function as each part of the generation device according to any one of Supplementary Notes 1 to 4.

Although the present embodiment has been described above, the present invention is not limited to such specific embodiment, and various modifications and changes can be made within the scope of the gist of the present invention as described in the claims. It is possible.

100 Generation device 110 Input section 120 Extraction section 121 Filtering section 122 Division section 123 Constraint phrase extraction section 130 Input generation section 140 Sequence generation section 141 Sequence conversion section 142 Search section 150 Reranking section 160 Output section 170 Display information generation section 180 Modification section 190 Generation unit 200 Bilingual dictionary DB
300 model DB
400 Constraint phrase list DB
500 Display unit 1000 Drive device 1001 Recording medium 1002 Auxiliary storage device 1003 Memory device 1004 CPU
1005 Interface device 1006 Display device 1007 Input device 1008 Output device

Claims

A generation device for generating constraint information and a second series that is another information series from a first series that is an information series,
an input generation unit that inputs a constraint information list and outputs each element of a subset of one or more constraint information included in the constraint information list as a vocabulary constraint;
a sequence generation unit that generates one or more candidates for the second sequence using the first sequence and the vocabulary constraint;
and a reranking unit that calculates a score indicating suitability as the second series for each of the one or more candidates.
The reranking unit calculates the score based on at least one of the likelihood output by the model used to generate the candidates in the sequence generation unit and the likelihood obtained from the candidates by the reranking model. The generating device according to claim 1.
When the constraint information list includes constraint information having ambiguity, the sequence generation unit generates the one or more candidates by performing a beam search with vocabulary constraints that take ambiguity into consideration. 1. The generating device according to 1.
Generation according to claim 1, wherein at least one piece of constraint information is input to the input generation unit in a format that allows two or more ambiguities, and the input generation unit generates the lexical constraint while maintaining the ambiguity. Device.
A generation method executed by a generation device for generating constraint information and a second series that is another information series from a first series that is an information series, the method comprising:
an input generation step of inputting a constraint information list and outputting each element of a subset of one or more constraint information included in the constraint information list as a vocabulary constraint;
a sequence generation step of generating one or more candidates for the second sequence using the first sequence and the vocabulary constraint;
A generation method comprising: a reranking step of calculating a score indicating suitability as the second series for each of the one or more candidates.
A program for causing a computer to function as each part of the generation device according to any one of claims 1 to 4.