JP2022006237A

JP2022006237A - Natural language processing system and natural language processing method

Info

Publication number: JP2022006237A
Application number: JP2020108361A
Authority: JP
Inventors: 夢如王; Mengru Wang
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2020-06-24
Filing date: 2020-06-24
Publication date: 2022-01-13

Abstract

To efficiently improve an accuracy of natural language processing such as machine translation.SOLUTION: A natural language processing system generates a synonymous expression of a sentence of a single language corpus using a synonymous expression generation model, calculates an evaluation value of an ease of natural language processing for a sentence and the synonymous expression of the sentence, generates the synonymous expression generation model using synonymous expression information including the sentence, the synonymous expression of the sentence, and the evaluation value as learning data, learns the synonymous expression generation model such that an evaluation value of the synonymous expression to be translated more easily becomes high and an evaluation value of the synonymous expression difficult to be translated becomes low, generates an adaptive destination bilingual corpus by collecting a pair of a synonymous expression easy to be machine-translated with an adaptive source translation model and a translation of the synonymous expression and a pair of a synonymous expression difficult to be machine-translated with the adaptive source translation model and a reference translation for a sentence in an adaptive source bilingual corpus which is an origin of the synonymous expression difficult to be machine-translated with the adaptive source translation model, and learns an adaptive destination translation model using the generated adaptive destination bilingual corpus.SELECTED DRAWING: Figure 1

Description

本発明は、自然言語処理システム、及び自然言語処理方法に関する。 The present invention relates to a natural language processing system and a natural language processing method.

特許文献１には、エンコーダ／デコーダ方式のニューラル機械翻訳技術に関し、翻訳器全体の精度を向上させるため、目的言語の単言語コーパスを用いてエンコーダを強化することが記載されている。 Patent Document 1 describes an encoder / decoder-type neural machine translation technique for enhancing an encoder by using a monolingual corpus of a target language in order to improve the accuracy of the entire translator.

また、非特許文献１には、原文の中から目的語の予測に関連する部分を自動的に検索する技術に関して記載されている。 Further, Non-Patent Document 1 describes a technique for automatically searching a part related to the prediction of an object from the original text.

また、非特許文献２には、単言語コーパスのみを用いてニューラル機械翻訳（ＮＭＴ（Neural Machine Translation））と統計的機械翻訳（ＳＭＴ（Statistical machine translation））の双方のシステムを訓練する技術に関して記載されている。 In addition, Non-Patent Document 2 describes a technique for training both systems of neural machine translation (NMT) and statistical machine translation (SMT) using only a monolingual corpus. Has been done.

特開２０１９－１５３０２３号公報Japanese Unexamined Patent Publication No. 2019-153023

Bahdanau, Dzmitry, Kyunghyun Cho, Yoshua Bengio、 "Neural machine translation by jointly learning to align and translate."、 arXiv preprint arXiv:1409.0473 (2014). 、[Online]、 [令和２年６月１日検索]、インターネット＜ＵＲＬ：https://arxiv.org/pdf/1409.0473.pdf＞Bahdanau, Dzmitry, Kyunghyun Cho, Yoshua Bengio, "Neural machine translation by jointly learning to align and translate.", ArXiv preprint arXiv: 1409.0473 (2014)., [Online], [Search June 1, 2nd year], Internet <URL: https://arxiv.org/pdf/1409.0473.pdf> Mikel, et al. An effective approach to unsupervised machine translation. arXiv preprint arXiv:1902.01313, 2019. 、[Online]、 [令和２年６月１日検索]、インターネット＜ＵＲＬ：https://arxiv.org/pdf/1902.01313.pdf＞Mikel, et al. An effective approach to unsupervised machine translation. ArXiv preprint arXiv: 1902.01313, 2019., [Online], [Search June 1, 2nd year of Reiwa], Internet <URL: https://arxiv.org/ pdf / 1902.01313.pdf ＞

近年、ＮＭＴの登場により、規則ベース機械翻訳（ＲＭＴ（Rule Based Machine Translation））やＳＭＴに比べて翻訳精度が大幅に向上し、とくに十分な対訳コーパスが整備されているドメインにおいては実用的な精度での翻訳が可能になった。しかし翻訳モデルの学習に必要な対訳コーパスが十分に存在しない状況ではＮＭＴを適用することは難しい。 In recent years, with the advent of NMT, translation accuracy has improved significantly compared to rule-based machine translation (RMT) and SMT, and practical accuracy is particularly high in domains with sufficient bilingual corpora. Translation is now possible. However, it is difficult to apply NMT in a situation where there is not enough bilingual corpus necessary for learning a translation model.

あるドメイン（以下、「適応先ドメイン」と称する。）において十分な学習リソースが確保できない場合、リソースリッチな他のドメイン（以下、「適応元ドメイン」と称する。）において得られた知識を適応先ドメインに転移させることで、適応先ドメインにおいて高い翻訳精度を実現できることが知られている。例えば、少量であっても適用先ドメインの対訳コーパスが存在している場合、適応元ドメインの対訳コーパスによって翻訳モデルを事前学習しておき、その後、適応先ドメインにおいてファインチューニング（fine tuning）を行う手法が提案されている。また、適応先ドメインの単言語コーパスしか利用
できない場合、逆翻訳を用いることにより疑似的に対訳コーパスを生成することができる。逆翻訳とは、処理方向が真逆の翻訳モデルを二つ用意し（例えば、英日翻訳モデルと日英翻訳モデル）、一方の翻訳モデルで生成されたターゲット文を他方の翻訳モデルのソース文にして学習を行う手法のことである（例えば、特許文献１、非特許文献２を参照）。
逆翻訳を用いて適応先ドメインの単言語コーパスによって擬似的に作成された対訳コーパスを適応元ドメインの対訳コーパスに追加し再学習することを繰り返すことで、単言語コーパスのみが利用可能な状況でもＮＭＴの学習が可能である。 When sufficient learning resources cannot be secured in a certain domain (hereinafter referred to as "adaptation destination domain"), the knowledge gained in another resource-rich domain (hereinafter referred to as "adaptation source domain") is applied to the adaptation destination. It is known that high translation accuracy can be achieved in the target domain by transferring to a domain. For example, if a translation corpus of the target domain exists even in a small amount, the translation model is pre-learned by the translation corpus of the adaptation source domain, and then fine tuning is performed in the adaptation destination domain. A method has been proposed. In addition, when only a single language corpus of the adaptation destination domain can be used, a pseudo bilingual corpus can be generated by using reverse translation. In reverse translation, two translation models with opposite processing directions are prepared (for example, English-Japanese translation model and Japanese-English translation model), and the target sentence generated by one translation model is the source sentence of the other translation model. (For example, refer to Patent Document 1 and Non-Patent Document 2).
Even in a situation where only a monolingual corpus can be used, by repeatedly adding a bilingual corpus simulated by the monolingual corpus of the adaptation destination domain to the bilingual corpus of the adaptation source domain and re-learning using reverse translation. NMT learning is possible.

ところで、適応先ドメインにおける翻訳モデルの精度の向上に際しては、通常は適用元ドメインとして適応先ドメインと分野が近いものが選択される。しかしドメインの性質上、適応元ドメインの文と適用先ドメインの文とで文体が異なることも少なくない。例えば、特許に関わる分野において、十分な対訳コーパスが用意されていない適用先ドメインの拒絶理由通知書を翻訳する翻訳モデルを学習するため適用元ドメインの特許明細書（特許公報、特許公開公報等）として利用する場合を考える。この場合、例えば、拒絶理由通知書では審査官からの要望や疑問点等に関して「…を参考されたい」等の表現がよく使われるが、こうした表現は特許明細書では殆ど用いられることがない。そのため、特許明細書を適応元ドメインとして学習した翻訳モデルをそのまま利用しても十分な翻訳精度を得ることができない。 By the way, in order to improve the accuracy of the translation model in the adaptation destination domain, a domain having a field close to that of the adaptation destination domain is usually selected as the application source domain. However, due to the nature of the domain, the writing style of the statement of the application source domain and the statement of the application destination domain are often different. For example, in the field related to patents, the patent specification of the application source domain (patent gazette, patent publication, etc.) is used to learn a translation model for translating the notice of reasons for refusal of the application domain for which a sufficient bilingual corpus is not prepared. Consider the case of using as. In this case, for example, in the notice of reasons for refusal, expressions such as "please refer to ..." regarding the request or question from the examiner are often used, but such expressions are rarely used in the patent specification. Therefore, even if the translation model learned by using the patent specification as the application source domain is used as it is, sufficient translation accuracy cannot be obtained.

本発明は、このような背景に基づきなされたものであり、機械翻訳等の自然言語処理の精度を効率よく向上することが可能な、自然言語処理システム、及び自然言語処理方法を提供することを目的とする。 The present invention has been made based on such a background, and provides a natural language processing system and a natural language processing method capable of efficiently improving the accuracy of natural language processing such as machine translation. The purpose.

上記目的を達成するための本発明の一つは、自然言語処理システムであって、情報処理装置を用いて構成され、単言語コーパス、及び同義表現生成モデルを記憶する記憶部と、前記単言語コーパスの文の同義表現を前記同義表現生成モデルを用いて生成する同義表現生成部と、前記文と当該文について生成される前記同義表現の夫々について自然言語処理のしやすさの評価値を算出する評価部と、前記文、当該文から生成される前記同義表現、及び前記評価値を含む前記同義表現情報を学習データとして、入力される文から前記自然言語処理がしやすい同義表現を生成する前記同義表現生成モデルを生成する同義表現生成モデル学習部と、を備える。 One of the present inventions for achieving the above object is a natural language processing system, which is configured by using an information processing apparatus, and has a storage unit for storing a monolingual corpus and a synonymous expression generation model, and the monolingual. Calculates the evaluation value of the ease of natural language processing for each of the synonymous expression generation unit that generates synonymous expressions of the corpus sentence using the synonymous expression generation model and the synonymous expressions generated for the sentence and the sentence. The evaluation unit, the synonymous expression generated from the sentence, and the synonymous expression information including the evaluation value are used as learning data, and the synonymous expression that is easy to process in natural language is generated from the input sentence. It includes a synonymous expression generation model learning unit that generates the synonymous expression generation model.

その他、本願が開示する課題、及びその解決方法は、発明を実施するための形態の欄、及び図面により明らかにされる。 In addition, the problems disclosed in the present application and the solutions thereof will be clarified by the column of the form for carrying out the invention and the drawings.

本発明によれば、機械翻訳等の自然言語処理の精度を効率よく向上することができる。 According to the present invention, the accuracy of natural language processing such as machine translation can be efficiently improved.

機械翻訳システムの機能を説明するシステムフロー図である。It is a system flow diagram explaining the function of a machine translation system. 情報処理システムを構成する情報処理装置の構成例である。This is a configuration example of an information processing device constituting an information processing system. ドメイン適応学習処理を説明するフローチャートである。It is a flowchart explaining the domain adaptive learning process. ノイズ除去処理の前処理を説明する図である。It is a figure explaining the preprocessing of a noise reduction process. 翻訳処理を説明するフローチャートである。It is a flowchart explaining a translation process. 従来の機械翻訳技術による翻訳例を示す図である。It is a figure which shows the translation example by the conventional machine translation technique. 従来の機械翻訳技術による翻訳例を示す図である。It is a figure which shows the translation example by the conventional machine translation technique. 機械翻訳システムによる翻訳例を示す図である。It is a figure which shows the translation example by a machine translation system. 機械翻訳システムによる翻訳例を示す図である。It is a figure which shows the translation example by a machine translation system. 機械翻訳システムによる翻訳例を示す図である。It is a figure which shows the translation example by a machine translation system. 機械翻訳システムによる翻訳例を示す図である。It is a figure which shows the translation example by a machine translation system. 第２実施形態の検索システムのシステムフロー図である。It is a system flow diagram of the search system of 2nd Embodiment. 同義表現生成モデル生成処理を説明するフローチャートである。It is a flowchart explaining the synonymous expression generation model generation process. 検索処理を説明するフローチャートである。It is a flowchart explaining the search process.

以下、図面を参照しつつ本発明の実施形態について説明する。尚、以下の記載及び図面は、本発明を説明するための例示であって、説明の明確化のため、適宜、省略及び簡略化がなされている。本発明は、他の種々の形態でも実施することが可能である。とくに限定しない限り、各構成要素は単数でも複数でも構わない。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. The following description and drawings are examples for explaining the present invention, and are appropriately omitted or simplified for the sake of clarification of the description. The present invention can also be implemented in various other forms. Unless otherwise specified, each component may be singular or plural.

［第１実施形態］
図１に第１実施形態として説明する機械翻訳システム１のシステムフロー図を示している。機械翻訳システム１は、対象とするドメイン（以下、「適応先ドメイン」と称する。）における翻訳モデル（以下、「適応先翻訳モデル」と称する。）の学習に用いる対訳コーパス（以下、「適応先対訳コーパス」と称する。）を生成する情報処理システムである。機械翻訳システム１は、適応先対訳コーパスの生成を実現するための手段としてリソースリッチなドメイン（以下、「適応元ドメイン」と称する。）において学習済みの翻訳モデル（以下、「適応元翻訳モデル」と称する。）と適応先ドメインにおける翻訳元の言語（以下、「ソース言語」と称する。）の単言語コーパスのみが入手可能であるという制約の下、適応先ドメインに特有な表現の翻訳結果を獲得する。 [First Embodiment]
FIG. 1 shows a system flow diagram of the machine translation system 1 described as the first embodiment. The machine translation system 1 is a bilingual corpus (hereinafter, "adaptation destination") used for learning a translation model (hereinafter, "adaptation destination translation model") in a target domain (hereinafter, "adaptation destination domain"). It is an information processing system that generates a "translation corpus". The machine translation system 1 is a translation model trained in a resource-rich domain (hereinafter referred to as "adaptation source domain") as a means for realizing generation of an adaptation destination translation corpus (hereinafter, "adaptation source translation model"). The translation result of the expression peculiar to the target domain is obtained under the restriction that only a single language corpus of the source language (hereinafter referred to as "source language") in the target domain is available. Acquire.

一般に適応先ドメインに特有な表現については、適応元ドメインに対応する同義表現が存在する。例えば、特許に関わる分野において、十分な対訳コーパスが用意されていない適用先ドメインの拒絶理由通知書を翻訳する適応先翻訳モデルを学習するために適用元ドメインの特許明細書（特許公報、特許公開公報等）として利用する場合を考える。この場合、例えば、適応先ドメインにおける「を参考されたい」という命令口調の表現については、適応元ドメインに「を参考」という同義表現が存在する。また、適応先ドメインにおける「どのようなことを意味しているか」という疑問形の表現については、適応元ドメインに「何を意味しているか」という同義表現が存在する。ここで適応元ドメインにおける表現は適応元翻訳モデルにとっては見慣れた表現であるため精度よく翻訳することができる。従って、適用先ドメインの拒絶理由通知書の文を精度よく翻訳するには、適用元ドメインの特許明細書の同義表現を介することが有効であると考えられる。こうした観点から、本実施形態の機械翻訳システム１は、適応先ドメインの表現と適応元ドメインの表現を関連付ける機能（文体変換機能）を備える。 Generally, for expressions specific to the adaptation destination domain, there are synonymous expressions corresponding to the adaptation source domain. For example, in the field related to patents, the patent specification of the applicable source domain (patent gazette, patent publication) for learning the applicable destination translation model for translating the notice of reasons for refusal of the applicable domain for which a sufficient bilingual corpus is not prepared. Consider the case of using it as a gazette etc.). In this case, for example, with respect to the command tone expression "please refer to" in the adaptation destination domain, there is a synonymous expression "reference" in the adaptation source domain. In addition, regarding the questionable expression of "what does it mean" in the adaptation destination domain, there is a synonymous expression of "what does it mean" in the adaptation source domain. Here, since the expression in the adaptation source domain is a familiar expression for the adaptation source translation model, it can be translated accurately. Therefore, in order to accurately translate the text of the notice of reasons for refusal of the applicable domain, it is considered effective to use synonymous expressions in the patent specification of the applicable domain. From this point of view, the machine translation system 1 of the present embodiment has a function (stylistic conversion function) of associating the expression of the adaptation destination domain with the expression of the adaptation source domain.

機械翻訳システム１は、翻訳のしやすさに応じて文を同義表現に変換することにより対訳対を獲得する。機械翻訳システム１は、ロジックが異なる二つの手法から、適応先翻訳モデルの学習に用いる適応先対訳コーパスを自動的に生成する。 The machine translation system 1 acquires a bilingual pair by converting a sentence into a synonymous expression according to the ease of translation. The machine translation system 1 automatically generates an adaptive translation corpus to be used for learning an adaptive translation model from two methods having different logics.

第１の手法（以下、「easy example生成」と称する。）では、機械翻訳システム１は、適応先ドメインの翻訳しにくい（正しく翻訳することができなかった）表現を、適応元翻訳モデルを用いて翻訳しやすい同義表現に変換する。変換後の同義表現を用いて翻訳処理を行うことで、正しい翻訳結果が得られる可能性が高くなる。また、機械翻訳システム１は、翻訳しにくい適応先ドメインの表現と当該表現の同義表現を介して得られた翻訳結果の対を収集したものを適応先対訳コーパス（以下、「easy example」と称する。）として生成する。 In the first method (hereinafter referred to as "easy example generation"), the machine translation system 1 uses an adaptation source translation model to express an expression that is difficult to translate (cannot be correctly translated) in the adaptation destination domain. Convert to synonymous expressions that are easy to translate. By performing translation processing using the converted synonymous expressions, there is a high possibility that correct translation results will be obtained. Further, the machine translation system 1 collects a pair of translation results obtained through an expression of an adaptation destination domain that is difficult to translate and a synonymous expression of the expression, and collects a pair of adaptation destination translation corpus (hereinafter referred to as "easy example"). .) Generated as.

第２の手法（以下、「hard example生成」と称する。）では、機械翻訳システム１は、適応元ドメインの翻訳しやすい表現を適応先ドメインの翻訳しにくい表現に変換する。そして、適応先ドメインの翻訳しにくい表現と適応元ドメインの文に対応する参照訳との対を収集したものを適応先対訳コーパス（以下、「hard example」と称する。）として生成する。このように、翻訳しにくい表現と上記手順で得られた参照訳とに基づき適応先翻訳モデルを学習させると、適応先ドメインにおける翻訳しにくい適応先ドメインの表現をい
かにして翻訳先の言語（以下、「ターゲット言語」と称する。）に翻訳すべきかを学習することができる。 In the second method (hereinafter referred to as "hard example generation"), the machine translation system 1 converts an easily translatable expression of the adaptation source domain into a difficult-to-translate expression of the adaptation destination domain. Then, a collection of pairs of difficult-to-translate expressions of the adaptation destination domain and reference translations corresponding to the sentences of the adaptation source domain is generated as an adaptation destination bilingual corpus (hereinafter referred to as "hard example"). In this way, if the adaptation destination translation model is trained based on the expressions that are difficult to translate and the reference translation obtained in the above procedure, how can the expressions of the adaptation destination domain that are difficult to translate in the adaptation destination domain be translated into the target language (? Hereinafter, it is possible to learn whether to translate into a "target language").

図１に示すように、機械翻訳システム１は、ドメイン適応学習部１０、データ拡張部２０、適応先翻訳モデル学習部３０、及び翻訳処理部４０の各機能を備える。このうち、ドメイン適応学習部１０は、データ入力処理部１１、データ記憶部１２、適応元翻訳モデル学習部１３、適応元翻訳モデル記憶部１４、ドメイン適応モデル学習部１５、及びドメイン適応モデル記憶部１６を含む。また、データ入力処理部１１は、ユーザインタフェース１１１とデータ処理部１１２を含む。また、ドメイン適応モデル学習部１５は、単言語コーパス入力部１５１、同義表現生成部１５２、reward出力部１５３、同義表現情報記憶部１５４、及び同義表現生成モデル学習部１５５を含む。また、データ拡張部２０は、適応先対訳コーパス生成部２１と適応先対訳コーパス選択部２２を含む。また、適応先翻訳モデル学習部３０は、適応先対訳コーパス記憶部３１、適応先翻訳モデル学習部３２、及び適応先翻訳モデル記憶部３３を含む。また、翻訳処理部４０は、入力インタフェース４１、翻訳部４２、及び出力インタフェース４３を備える。これらの各機能の詳細については後述する。 As shown in FIG. 1, the machine translation system 1 includes functions of a domain adaptation learning unit 10, a data expansion unit 20, an adaptation destination translation model learning unit 30, and a translation processing unit 40. Of these, the domain adaptive learning unit 10 includes a data input processing unit 11, a data storage unit 12, an adaptive source translation model learning unit 13, an adaptive source translation model storage unit 14, a domain adaptive model learning unit 15, and a domain adaptive model storage unit. Includes 16. Further, the data input processing unit 11 includes a user interface 111 and a data processing unit 112. Further, the domain adaptation model learning unit 15 includes a single language corpus input unit 151, a synonymous expression generation unit 152, a reward output unit 153, a synonymous expression information storage unit 154, and a synonymous expression generation model learning unit 155. Further, the data expansion unit 20 includes an adaptation destination translation corpus generation unit 21 and an adaptation destination translation corpus selection unit 22. Further, the adaptation destination translation model learning unit 30 includes an adaptation destination translation model learning unit 31, an adaptation destination translation model learning unit 32, and an adaptation destination translation model storage unit 33. Further, the translation processing unit 40 includes an input interface 41, a translation unit 42, and an output interface 43. Details of each of these functions will be described later.

図２に機械翻訳システム１を構成する情報処理装置１００のハードウェア構成の一例を示す。同図に示すように、情報処理装置１００は、プロセッサ１０１、主記憶装置１０２、通信装置１０３、入力装置１０４、出力装置１０５、及び補助記憶装置１０６を備える。 FIG. 2 shows an example of the hardware configuration of the information processing apparatus 100 constituting the machine translation system 1. As shown in the figure, the information processing device 100 includes a processor 101, a main storage device 102, a communication device 103, an input device 104, an output device 105, and an auxiliary storage device 106.

プロセッサ１０１は、例えば、ＣＰＵ（Central Processing Unit）、ＭＰＵ（Micro Processing Unit）、ＧＰＵ（Graphics Processing Unit）、ＡＩ（Artificial Intelligence）チップ、ＦＰＧＡ（Field Programmable Gate Array）、ＳｏＣ（System on Chip）、ＡＳＩＣ（Application Specific Integrated Circuit）等を用いて構成される。 The processor 101 is, for example, a CPU (Central Processing Unit), an MPU (Micro Processing Unit), a GPU (Graphics Processing Unit), an AI (Artificial Intelligence) chip, an FPGA (Field Programmable Gate Array), a SoC (System on Chip), and an ASIC. (Application Specific Integrated Circuit) etc. are used.

主記憶装置１０２は、プログラムやデータを記憶する装置であり、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、不揮発性メモリ（ＮＶＲＡＭ（Non Volatile RAM））等である。 The main storage device 102 is a device for storing programs and data, and is, for example, a ROM (Read Only Memory), a RAM (Random Access Memory), a non-volatile memory (NVRAM (Non Volatile RAM)), and the like.

通信装置１０３は、通信ネットワークや通信ケーブル等を介してユーザ端末等の他の情報処理装置との間で通信を行う装置であり、無線又は有線の通信モジュール（無線通信モジュール、通信ネットワークアダプタ、ＵＳＢモジュール等）である。 The communication device 103 is a device that communicates with other information processing devices such as a user terminal via a communication network, a communication cable, or the like, and is a wireless or wired communication module (wireless communication module, communication network adapter, USB). Module etc.).

入力装置１０４と出力装置１０５は、機械翻訳システム１のユーザインタフェース部３を構成する。入力装置１０４は、外部からのユーザ入力やデータ入力を受け付けるユーザインタフェースであり、例えば、キーボード、マウス、タッチパネル、カードリーダ、音声入力装置(例えば、マイクロフォン)等である。出力装置１０５は、各種情報をユーザに向けて出力するユーザインタフェースであり、各種情報を表示する表示装置（液晶ディスプレイ、有機ＥＬパネル等）、各種情報を音声によって出力する音声出力装置（例えば、スピーカ）、紙媒体に印刷するプリンタ等である。 The input device 104 and the output device 105 form a user interface unit 3 of the machine translation system 1. The input device 104 is a user interface that accepts user input and data input from the outside, and is, for example, a keyboard, a mouse, a touch panel, a card reader, a voice input device (for example, a microphone), and the like. The output device 105 is a user interface that outputs various information to the user, and is a display device (liquid crystal display, organic EL panel, etc.) that displays various information, and a voice output device (for example, a speaker) that outputs various information by voice. ), Printers that print on paper media, etc.

補助記憶装置１０６は、プログラムやデータを格納する装置であり、例えば、ＳＳＤ（Solid State Drive）、ハードディスクドライブ、光学式記憶媒体（ＣＤ（Compact Disc
）、ＤＶＤ（Digital Versatile Disc）等）、ＩＣカード、ＳＤカード等である。補助記憶装置１０６には、機械翻訳システム１の機能を実現するためのプログラム及びデータが格納されている。補助記憶装置１０６は、記録媒体の読取装置や通信装置１０３を介してプログラムやデータの書き込み／読み出しが可能である。 The auxiliary storage device 106 is a device for storing programs and data, for example, an SSD (Solid State Drive), a hard disk drive, and an optical storage medium (CD (Compact Disc)).
), DVD (Digital Versatile Disc), etc.), IC card, SD card, etc. The auxiliary storage device 106 stores programs and data for realizing the functions of the machine translation system 1. The auxiliary storage device 106 can write / read programs and data via the reading device of the recording medium and the communication device 103.

補助記憶装置１０６に格納（記憶）されているプログラムやデータは、主記憶装置１０２に随時読み出される。機械翻訳システム１が備える各機能は、プロセッサ１０１が、主記憶装置１０２に格納されているプログラムを読み出して実行することによりまた、図１に示した各記憶部（１２，１４，１５４，１６，３１，３３）は、補助記憶装置１０６に所定のデータが読み出し可能に格納されることで実現される。 Programs and data stored (stored) in the auxiliary storage device 106 are read out to the main storage device 102 at any time. Each function included in the machine translation system 1 is obtained by the processor 101 reading and executing a program stored in the main storage device 102, and each storage unit (12, 14, 154, 16, 16; 31 and 33) are realized by readable storage of predetermined data in the auxiliary storage device 106.

機械翻訳システム１の機能の全部又は一部を、他の演算装置（例えば、ＦＰＧＡ（Field Programable Gate Array）やＡＳＩＣ（Application Specific Integrated Circuit）
等のハードウェアによって実現してもよい。 All or part of the functions of the machine translation system 1 can be combined with other arithmetic units (for example, FPGA (Field Programable Gate Array) or ASIC (Application Specific Integrated Circuit)).
It may be realized by hardware such as.

情報処理装置１００は、例えば、パーソナルコンピュータ（デスクトップ型又はノートブック型）、スマートフォン、タブレット、汎用機等である。情報処理装置１００の全部又は一部は、例えば、クラウドシステムにより提供されるクラウドサーバのように仮想的な情報処理資源を用いて実現されるものであってもよい。 The information processing device 100 is, for example, a personal computer (desktop type or notebook type), a smartphone, a tablet, a general-purpose machine, or the like. All or part of the information processing apparatus 100 may be realized by using virtual information processing resources such as a cloud server provided by a cloud system.

図３は、図１に示したドメイン適応学習部１０が行う処理（以下、「ドメイン適応学習処理Ｓ３００」と称する。）を説明するフローチャートである。以下、適宜図１を参照しつつ、図３に沿ってドメイン適応学習処理Ｓ３００を説明する。 FIG. 3 is a flowchart illustrating a process performed by the domain adaptive learning unit 10 shown in FIG. 1 (hereinafter, referred to as “domain adaptive learning process S300”). Hereinafter, the domain adaptive learning process S300 will be described with reference to FIG. 1 as appropriate.

まずドメイン適応学習部１０のデータ入力処理部１１のユーザインタフェース１１１が、機械翻訳システムの管理者等のユーザから、適応元対訳コーパスと適応先ドメインの単言語コーパス（以下、「適応先単元後コーパス」と称する。）の夫々の所在情報の入力を受け付ける。そして、ユーザインタフェース１１１は、受け付けた所在情報により特定されるデータ記憶領域から適応元対訳コーパスと適応先単言語コーパスを取得し（ｓ３０１）、取得した適応元対訳コーパスと適応先単言語コーパスをデータ処理部１１２に出力する。 First, the user interface 111 of the data input processing unit 11 of the domain adaptation learning unit 10 is subjected to a monolingual corpus of the adaptation source bilingual corpus and the adaptation destination domain from a user such as a machine translation system administrator (hereinafter, "adaptation destination unit post-corpus corpus"). ”) Accepts the input of each location information. Then, the user interface 111 acquires the adaptation source bilingual corpus and the adaptation destination monolingual corpus from the data storage area specified by the received location information (s301), and data the acquired adaptation source bilingual corpus and the adaptation destination monolingual corpus. Output to the processing unit 112.

尚、上記所在情報は、フォルダ名やディレクトリ名等、機械翻訳システム１を構成する情報処理装置の記憶領域を指定する情報でもよいし、ＵＲＬ（Uniform Resource Locater）のようにインターネット等の通信ネットワークに接続するデータベースの所在を示す情報でもよい。ユーザインタフェース１１１は、機械翻訳システム１を構成する情報処理装置が備える入力装置１０４（キーボード等）に限らず、例えば、通信装置１０３でもよい。通信装置１０３がユーザインタフェース１１１であれば、ユーザインタフェース１１１は、ユーザの情報処理装置（パーソナルコンピュータ等）に入力された所在情報を通信ネットワークを介して受け付ける。 The location information may be information that specifies a storage area of an information processing device constituting the machine translation system 1, such as a folder name or a directory name, or may be used in a communication network such as the Internet such as a URL (Uniform Resource Locater). It may be information indicating the location of the database to be connected. The user interface 111 is not limited to the input device 104 (keyboard or the like) included in the information processing device constituting the machine translation system 1, and may be, for example, a communication device 103. If the communication device 103 is the user interface 111, the user interface 111 receives the location information input to the user's information processing device (personal computer or the like) via the communication network.

続いて、データ処理部１１２が、入力された適応元対訳コーパスと適応先単言語コーパスを読み込み、これらに含まれる文を翻訳処理部４０において機械翻訳が可能な形式のデータに変換し、変換後のデータをデータ記憶部１２に格納する（ｓ３０２）。データ処理部１１２により変換されたデータは、例えば、x₁,x₂,…,x_S等と記述される単語列のよう
なシンボル系列からなる。上記シンボル系列では、「x_S」と記述される最終シンボルは、末記号（以下、「ＥＯＳ」と称する。）であり、系列に含まれるシンボル数は、ＥＯＳを含めてＳ個である。 Subsequently, the data processing unit 112 reads the input adaptation source bilingual corpus and adaptation destination single language corpus, converts the sentences contained therein into data in a format that can be machine translated by the translation processing unit 40, and after conversion. Data is stored in the data storage unit 12 (s302). The data converted by the data processing unit 112 is composed of a symbol sequence such as a word string described as, for example, x ₁ , x ₂ , ..., X _S and the like. In the above symbol series, the final symbol described as "x _S " is a terminal symbol (hereinafter referred to as "EOS"), and the number of symbols included in the series is S including EOS.

続いて、適応元翻訳モデル学習部１３が、適応元翻訳モデルの生成と学習を行う（ｓ３０３）。適応元翻訳モデル学習部１３は、エンコーダ／デコーダ方式のニューラル機械翻訳を実行する。適応元翻訳モデル学習部１３は、機械学習を行うモード（以下、「学習モード」と称する。）、もしくは機械翻訳を行うモード（以下、「機械翻訳モード」と称する。）で動作する。学習モードでは、適応元翻訳モデル学習部１３は、データ記憶部１２から読み出した適応元対訳コーパスを学習データとして適応元翻訳モデルを生成し、生成
した適応元翻訳モデルのうち、精度が最も高いものを適応元翻訳モデル記憶部１４に格納する。機械翻訳モードでは、適応元翻訳モデル学習部１３は、ソース言語で記述されている入力文を、適応元翻訳モデルを用いてターゲット言語に変換する。後述するように、機械翻訳モードにおいて、適応元翻訳モデル学習部１３は、ドメイン適応モデル学習部１５から出力される、最終シンボルがＥＯＳとなるシンボル系列からなる適応先の文x_n（ｎ＝１，２,…,ｓ)と、当該文の同義表現y_n ^m（ｍ＝１,２,…,ｂ、ｎ＝１，２,…,Ｔ）とを入
力データとして受け取り、受け取った入力データx_n、y_n ^mに対して機械翻訳を実行する。
そして、適応元翻訳モデル学習部１３は、機械翻訳の結果として、最終シンボルがＥＯＳとなるＴ個のシンボル系列からなる出力データy₁,y₂,…,y_Tをドメイン適応モデル学習部
１５にフィードバックする。 Subsequently, the adaptation source translation model learning unit 13 generates and learns the adaptation source translation model (s303). The adaptation source translation model learning unit 13 executes an encoder / decoder type neural machine translation. The adaptation source translation model learning unit 13 operates in a mode for performing machine learning (hereinafter referred to as "learning mode") or a mode for performing machine translation (hereinafter referred to as "machine translation mode"). In the learning mode, the adaptation source translation model learning unit 13 generates an adaptation source translation model using the adaptation source translation corpus read from the data storage unit 12 as learning data, and among the generated adaptation source translation models, the one with the highest accuracy. Is stored in the adaptation source translation model storage unit 14. In the machine translation mode, the adaptation source translation model learning unit 13 converts the input sentence described in the source language into the target language using the adaptation source translation model. As will be described later, in the machine translation mode, the adaptation source translation model learning unit 13 is output from the domain adaptation model learning unit 15, and the adaptation destination sentence x _n (n = 1) consisting of a symbol sequence whose final symbol is EOS. , 2, ..., s) and the synonymous expression y _n ^m (m = 1,2, ..., b, n = 1,2, ..., T) of the sentence are received as input data, and the received input data x Perform machine translation on _n , y _n ^m .
Then, the adaptation source translation model learning unit 13 transfers the output data y ₁ , y ₂ , ..., Y _T consisting of T symbol sequences whose final symbol is EOS to the domain adaptation model learning unit 15 as a result of machine translation. give feedback.

続いて、ドメイン適応モデル学習部１５が、単言語コーパスにおける適応先の文が適切に翻訳できるように、適応先の文の同義表現を生成する学習モデル（以下、「同義表現生成モデル」と称する。）を生成するとともに、適応先の文が適切に翻訳されるように適応元翻訳モデルの学習を行う（ｓ３０４、及びｓ３０５～ｓ３１８のループ処理）。 Subsequently, the domain adaptation model learning unit 15 generates a learning model (hereinafter referred to as "synonymous expression generation model") that generates synonymous expressions of the adaptation destination sentences so that the adaptation destination sentences in the monolingual corpus can be appropriately translated. ) Is generated, and the adaptation source translation model is trained so that the adaptation destination sentence is appropriately translated (loop processing of s304 and s305 to s318).

具体的には、まずループ処理ｓ３０５～ｓ３１８の準備として、ドメイン適応モデル学習部１５の単言語コーパス入力部１５１が、データ記憶部１２から適応先単言語コーパスを読み込み、適応先単言語コーパスを同義表現生成部１５２へ出力する。また同義表現生成部１５２が、入力された適応先単言語コーパスと、ドメイン適応モデル記憶部１６に格納されている同義表現生成モデルを読み込む（ｓ３０４）。 Specifically, first, in preparation for the loop processes s305 to s318, the single language corpus input unit 151 of the domain adaptation model learning unit 15 reads the adaptation destination single language corpus from the data storage unit 12, and synonymous with the adaptation destination single language corpus. Output to the expression generation unit 152. Further, the synonymous expression generation unit 152 reads the input adaptation destination monolingual corpus and the synonymous expression generation model stored in the domain adaptation model storage unit 16 (s304).

ループ処理ｓ３０５～ｓ３１８では、まず同義表現生成部１５２が、エンコーダ／デコーダ方式のニューラル文生成処理を実行する（ｓ３０５）。具体的には、同義表現生成部１５２は、学習済みの同義表現生成モデルを用いて文を生成するモード（以下、「文生成モード」と称する。）では、読み込んだ単言語コーパスにおける一つの文x₁,x₂…x_Sをエ
ンコーディングした後、beam searchデコーディングを実行し、ｂ個の出力データy¹,y²,
…,y^bを生成する。尚、ｂ個の出力データy¹,y²,…,y^bの夫々は、ＥＯＳを最終シンボルとしたＴ個の単語からなるシンボル系列である。例えば、y¹であれば、y¹＝y₁ ¹,y₂ ¹,…,y_T ¹である。beam searchデコーディングの実行により生成されたｂ個のシンボル系列y¹,y²,
…,y^bの一つ一つが単言語コーパスにおける一つの文x₁,x₂…x_Sについての同義表現となる。 In the loop processes s305 to s318, the synonymous expression generation unit 152 first executes the encoder / decoder method neural statement generation process (s305). Specifically, in the mode in which the synonymous expression generation unit 152 generates a sentence using the learned synonymous expression generation model (hereinafter referred to as "sentence generation mode"), one sentence in the read single language corpus is used. After encoding x ₁ , x ₂ … x _S , beam search decoding is executed, and b output data y ¹ , y ² ,
…, Generate y ^b . It should be noted that each of the b output data y ¹ , y ² , ..., Y ^b is a symbol series consisting of T words with EOS as the final symbol. For example, if y ¹ , y ¹ = y ₁ ¹ , y ₂ ¹ , ..., y _T ¹ . b symbol sequences generated by executing beam search decoding y ¹ , y ² ,
Each of ..., y ^b is a synonym for one sentence x ₁ , x ₂ ... x _S in a monolingual corpus.

続いて、同義表現生成部１５２が、ｂ個の同義表現y¹,y²,…,y^bから不適切な表現であ
るノイズを除去する（ｓ３０６）。ノイズの除去に用いられる類似度計算は、BLEU、RIBES等の評価指標に基づいて行うことができる。尚、同義表現生成部１５２は、同義語の言
い換えによる過剰なペナルティを抑制するための前処理を行った上でノイズを除去するための類似度計算を行う。上記の前処理では、同義表現生成部１５２は、単言語コーパスに含まれる元となる適応先の文（以下、「元となる文」と称することがある。）と、beam searchデコーディングにより生成されたｂ個の同義表現の夫々の中から、共通の文字列を
抽出し、この抽出した共通の文字列を、ある変数名に置換する。尚、共通の文字列を抽出する際には、例えば、最長一致を基準とする。そして、基準を満たす全ての箇所が変数名に置き換えられた後、前後となる二つ変数の間にある部分を同義語と見なす。 Subsequently, the synonymous expression generation unit 152 removes noise, which is an inappropriate expression, from the ^b synonymous expressions y ¹ , y ² , ..., Y b (s306). The similarity calculation used for noise removal can be performed based on evaluation indexes such as BLEU and RIBES. The synonym expression generation unit 152 performs preprocessing for suppressing an excessive penalty due to paraphrase of synonyms, and then performs similarity calculation for removing noise. In the above preprocessing, the synonymous expression generation unit 152 is generated by the original adaptation destination sentence (hereinafter, may be referred to as “original sentence”) included in the single language corpus and beam search decoding. A common character string is extracted from each of the b synonymous expressions, and the extracted common character string is replaced with a certain variable name. When extracting a common character string, for example, the longest match is used as a reference. Then, after all the parts that satisfy the criteria are replaced with the variable names, the part between the two variables before and after is regarded as a synonym.

図４に前処理の一例を示す。同図に示すように、意味が同じで表記が異なる四つの文「明細書の先行技術文献６と７に開示された材料を参照してください。」、「明細書の先行技術文献６および７に開示されている材料を参照。」、「明細書の先行技術文献６及び７に開示された材料を参照されたい。」、「明細書の先行技術文献６に開示された材料を参照すべき。」がある場合、同義表現生成部１５２は、これらの文のうち、共通の文字列として、「明細書の先行技術文献6」、「7に開示され」、「材料を参考」を抽出する。そし
て、「明細書の先行技術文献6」の変数名を「ele.1」とし、「7に開示され」を「ele.2」とし、「材料を参考」を「ele.3」、句点「。」を「ele.4」とすると、同義表現生成部１５２は、各変数の間にある「と」、「および」、「及び」を同義語とみなす。同義表現生成部１５２は、このような前処理を行うことで、ノイズ除去に用いられる類似度計算に際し、図４に示した手順で識別した同義語が違う文字列としてカウントされないようにする。 FIG. 4 shows an example of preprocessing. As shown in the figure, four sentences having the same meaning but different notations "Refer to the materials disclosed in the prior art documents 6 and 7 of the specification.", "Prior art documents 6 and 7 of the specification. ”,“ Refer to the materials disclosed in the prior art documents 6 and 7 of the specification. ”,“ The materials disclosed in the prior art document 6 of the specification should be referred to. If there is ".", The synonymous expression generation unit 152 extracts "disclosed in Prior Art Documents 6" and "7" and "reference materials" as common character strings from these sentences. .. Then, the variable name of "Prior Art Document 6 of the specification" is set to "ele.1", "disclosed in 7" is set to "ele.2", "reference to material" is set to "ele.3", and the kuten " When "." Is "ele.4", the synonym expression generation unit 152 considers "to", "and", and "and" between each variable as synonyms. By performing such preprocessing, the synonym expression generation unit 152 prevents the synonyms identified by the procedure shown in FIG. 4 from being counted as different character strings in the similarity calculation used for noise reduction.

同義表現生成部１５２は、前処理を実行すると、ノイズを除去するため、夫々の同義表現と、元となる文との間で類似度を計算し、最終的に計算した類似度が所定の指標（例えば、BLUE値）において所定の値以下(例えば、BLUE値≦０．５)となった同義表現をノイズとみなして除去する。また同義表現生成部１５２は、ノイズ除去の後の同義表現に対し、一つの適応先の文と、beam searchにより取得されたｂ個の同義表現の夫々を組にしたも
のを、reward出力部１５３に送信する。尚、適応先の文は、Ｎ個あるものとする。即ち、文体変換を行うための学習データがＮ組あるものとする。 The synonymous expression generation unit 152 calculates the similarity between each synonymous expression and the original sentence in order to remove noise when the preprocessing is executed, and the finally calculated similarity is a predetermined index. A synonymous expression having a predetermined value or less (for example, BLUE value ≤ 0.5) in (for example, BLUE value) is regarded as noise and removed. Further, the synonymous expression generation unit 152 sets a set of one adaptation destination sentence and b synonymous expressions acquired by beam search for the synonymous expression after noise reduction, in a reward output unit 153. Send to. It is assumed that there are N sentences to which the application is made. That is, it is assumed that there are N sets of learning data for performing stylistic conversion.

図３に戻り、続いて、reward出力部１５３が、元となる文x_n、及びその文x_n（ｎ＝１，２，…，ｓ）の同義表現y_n ^m（ｍ＝１，２，…，ｂ、ｎ＝１，２，…，Ｔ）とに対する機
械翻訳のしやすさを評価する。まずreward出力部１５３は、適応元翻訳モデル記憶部１４から、適応元翻訳モデル学習部１３によって生成された学習済みの適応元翻訳モデルを読み込む（ｓ３０７）。そしてreward出力部１５３は、元となる文x_nを同義表現生成部１５２から送信された適応先の文x_nと、同義表現y_n ^mを、読み込んだ適応元翻訳モデルによっ
て適応元翻訳モデル学習部１３に機械翻訳させ、適応先の文x_nと、同義表現y_n ^mとに対す
る機械翻訳のしやすさ(reward)を算出する（ｓ３０８）。 Returning to FIG. 3, the reward output unit 153 subsequently uses the original sentence x _n and the synonymous expression y _n ^m (m = 1, 2, ..., S) of the sentence x _n (n = 1, 2, ..., S). ..., b, n = 1, 2, ..., T) and the ease of machine translation are evaluated. First, the reward output unit 153 reads the learned adaptation source translation model generated by the adaptation source translation model learning unit 13 from the adaptation source translation model storage unit 14 (s307). Then, the reward output unit 153 ^learns the adaptation source translation model by reading the adaptation source sentence x _n sent from the synonymous expression generation unit 152 and the synonymous expression y _n _m . Let Part 13 perform machine translation, and calculate the ease of machine translation (reward) for the sentence x _n to which it is applied and the synonymous expression y _n ^m (s308).

ここで処理のしやすさは、文の長さや文の係り受け構造等の面から評価してもよいが、機械翻訳システム１では、機械翻訳のしやすさの指標として、対数尤度logp(y_n ^m)を用い
る。対数尤度logp(y_n ^m)は、対象となる同義表現を、適応元翻訳モデルに基づきターゲッ
ト言語に翻訳する際に求められる。対数尤度logp(y_n ^m)は、適応元翻訳モデルの信頼性を
表す値であり、翻訳結果の正確性を示す値である。対数尤度logp(y_n ^m)は、翻訳結果が正
しく生成された場合に大きくなり、翻訳結果が誤って生成された場合に小さくなる。reward出力部１５３は、適応先の文x_nと、その同義表現y_n ^mと、reward出力部１５３の処理に
より得られた機械翻訳のしやすさの評価値logp(y_n ^m)とからなる三つ一組のデータ（以下
、「同義表現情報」と称する。）を同義表現情報記憶部１５４に格納する（ｓ３０９）。 Here, the ease of processing may be evaluated in terms of the length of the sentence, the dependency structure of the sentence, etc., but in the machine translation system 1, the log-likelihood logp ( y _n ^m ) is used. The log-likelihood logp (y _n ^m ) is required when translating the target synonymous expression into the target language based on the adaptation source translation model. The log-likelihood logp (y _n ^m ) is a value indicating the reliability of the adaptation source translation model and is a value indicating the accuracy of the translation result. The log-likelihood logp (y _n ^m ) increases when the translation result is generated correctly and decreases when the translation result is generated incorrectly. The reward output unit 153 consists of the sentence x _n of the adaptation destination, its synonymous expression y _n ^m , and the evaluation value logp (y _n ^m ) of the ease of machine translation obtained by the processing of the reward output unit 153. A set of three pieces of data (hereinafter referred to as "synonymous expression information") is stored in the synonymous expression information storage unit 154 (s309).

続いて、同義表現生成モデル学習部１５５が、同義表現生成モデルを生成する（ｓ３１０）。同義表現生成モデル学習部１５５は、エンコーダ／デコーダ方式のニューラル文生成処理を実行する機能であり、学習モードにおいて、同義表現情報記憶部１５４から同義表現情報を抽出し、この抽出した同義表現情報を学習データとし、より翻訳しやすい同義表現の尤度を高くし、翻訳し難い同義表現の尤度を低下させるように、適応先の文x_nから、その文x_nの同義表現y_n ^mを生成するように学習する。即ち、上述した「easy exampl生成」と、「hard example生成」の手法が当該学習に適用される。 Subsequently, the synonymous expression generation model learning unit 155 generates a synonymous expression generation model (s310). The synonymous expression generation model learning unit 155 is a function of executing the encoder / decoder type neural sentence generation processing. In the learning mode, the synonymous expression information is extracted from the synonymous expression information storage unit 154, and the extracted synonymous expression information is used. In order to increase the likelihood of synonymous expressions that are easier to translate as learning data and reduce the likelihood of synonymous expressions that are difficult to translate, the synonymous expression y _n ^m of that sentence x _n is derived from the sentence x _n to which it is applied. Learn to generate. That is, the above-mentioned methods of "easy exampl generation" and "hard example generation" are applied to the learning.

ここで同義表現y_n ^mが生成される確率は、その同義表現に対する処理のしやすさlogp(y^m)に従う。そこで、同義表現生成モデル学習部１５５は、学習モードにおいて、まず次の
数式１に示すように、適応先の文x_nから算出されたrewardの評価値logp(x_n)を基準とし、当該基準と同義表現y_n ^mのrewardの評価値logp(y_n ^m) との差分ｒを求める。
［数式１］

Here, the probability that the synonymous expression y _n ^m is generated follows the ease of processing logp (y ^m ) for the synonymous expression. Therefore, in the learning mode, the synonymous expression generation model learning unit 155 first uses the evaluation value logp (x _n ) of the reward calculated from the sentence x _n of the adaptation destination as a reference, as shown in the following formula 1, and uses the reference. And the difference r from the evaluation value logp (y _n ^m ) of the reward of the synonymous expression y _n ^m is obtained.
[Formula 1]

「easy exampl生成」によって生成される同義表現生成モデルは、適応先の文x_nを、適
応元翻訳モデルで処理可能な同義表現に変換するためのモデルである。そのため、「easy
exampl生成」は、同義表現が元となる文より翻訳しやすい場合に尤度が高くなり、元と
なる文より翻訳し難い場合に尤度が低くなる。そこで、同義表現生成モデル学習部１５５は、以下の数式２に示した損失関数L_eを最小化するように同義表現生成モデルを生成する。
［数式２］

The synonymous expression generation model generated by "easy exampl generation" is a model for converting the sentence x _n of the adaptation destination into a synonymous expression that can be processed by the adaptation source translation model. Therefore, "easy
"Examppl generation" has a higher likelihood when it is easier to translate than the original sentence, and a lower likelihood when it is more difficult to translate than the original sentence. Therefore, the synonymous expression generation model learning unit 155 generates a synonymous expression generation model so as to minimize the loss function L _e shown in the following mathematical expression 2.
[Formula 2]

一方、「hard example生成」によって生成された同義表現生成モデルは、適応先の文x_nの翻訳が難くなるように変換するためのモデルである。そこで、同義表現生成モデル学習部１５５は、「hard example生成」を行う場合には以下の数式３に示した損失関数L_hを最小化するように同義表現生成モデルを生成する。
［数式３］

On the other hand, the synonymous expression generation model generated by "hard example generation" is a model for converting the sentence x _n to which it is applied so that it is difficult to translate. Therefore, the synonymous expression generation model learning unit 155 generates a synonymous expression generation model so as to minimize the loss function L _h shown in the following mathematical formula 3 when performing “hard example generation”.
[Formula 3]

同義表現生成モデル学習部１５５は、上述したように生成した同義表現生成モデルをドメイン適応モデル記憶部１６に格納する（ｓ３１１）。そして、同義表現生成モデル学習部１５５は、学習モードの実行機会ごとに同義表現生成モデルを更新する。 The synonymous expression generation model learning unit 155 stores the synonymous expression generation model generated as described above in the domain adaptive model storage unit 16 (s311). Then, the synonymous expression generation model learning unit 155 updates the synonymous expression generation model at each execution opportunity of the learning mode.

同義表現生成モデルは、例えば、同義表現生成モデル学習部１５５による事前学習によって生成してもよい。その場合、同義表現生成部１５２は、文生成モードにおいて、最初に同義表現生成モデルを読み込む際、ドメイン適応モデル記憶部１６に格納されている事前学習済みの同義表現生成モデルを利用することができる。同義表現生成モデル学習部１５５による事前学習は、例えば、次のようにして行われる。 The synonymous expression generation model may be generated by, for example, pre-learning by the synonymous expression generation model learning unit 155. In that case, the synonymous expression generation unit 152 can use the pre-learned synonymous expression generation model stored in the domain adaptive model storage unit 16 when the synonymous expression generation model is first read in the sentence generation mode. .. Pre-learning by the synonymous expression generation model learning unit 155 is performed, for example, as follows.

まず同義表現生成モデル学習部１５５は、単言語コーパス入力部１５１から適応先ドメインの単言語コーパスを読み込み、適応先の文x_nを復元するように、エンコーダ／デコーダ方式のニューラルモデルを学習する。このような手順で事前学習された同義表現生成モデルは、適応先の単言語コーパスと同じ言語の文を生成できるようになるため、beam searchデコーディングを行うことで複数の同義表現を生成することが可能となる。同義表現
生成モデル学習部１５５は、事前学習された同義表現生成モデルをドメイン適応モデル記憶部１６に格納する。 First, the synonymous expression generation model learning unit 155 reads the monolingual corpus of the adaptation destination domain from the monolingual corpus input unit 151, and learns an encoder / decoder method neural model so as to restore the adaptation destination sentence x _n . Since the synonymous expression generation model pre-learned by such a procedure can generate sentences in the same language as the single language corpus to which it is applied, multiple synonymous expressions should be generated by performing beam search decoding. Is possible. The synonymous expression generation model learning unit 155 stores the pre-learned synonymous expression generation model in the domain adaptive model storage unit 16.

以上のように、事前学習、及びその後の学習の機会毎に、同義表現生成モデルが訓練されていく。同義表現生成部１５２は、同義表現生成モデル学習部１５５による学習の機会毎に、ドメイン適応モデル記憶部１６に格納されている同義表現生成モデルを読み込むとともに、単言語コーパス入力部１５１から適応先ドメインの単言語コーパスを読み込む（ｓ３１２）。そして同義表現生成部１５２は、上述した同義表現を生成する処理（ｓ３０５）、ノイズを除去する処理（ｓ３０６）と同様に、文生成モードによって、ｂ個の出力データy¹,y²,…,y^b を同義表現として生成するとともに（ｓ３１３）、生成した同義表現のノイズを上述した前処理を行った上で除去し（ｓ３１４）、ノイズ除去後の同義表現、及びその元となる文をreward出力部１５３に送信する。 As described above, the synonymous expression generation model is trained for each pre-learning and subsequent learning opportunities. The synonymous expression generation unit 152 reads the synonymous expression generation model stored in the domain adaptation model storage unit 16 at each learning opportunity by the synonymous expression generation model learning unit 155, and also reads the synonymous expression generation model stored in the domain adaptation model storage unit 16 and adapts the target domain from the single language corpus input unit 151. Read the monolingual corpus of (s312). Then, the synonymous expression generation unit 152 has b output data y ¹ , y ² , ..., Depending on the sentence generation mode, as in the process of generating the synonymous expression (s305) and the process of removing noise (s306) described above. While y ^b is generated as a synonymous expression (s313), the noise of the generated synonymous expression is removed after performing the above-mentioned preprocessing (s314), and the synonymous expression after noise reduction and the original sentence thereof are rewarded. It is transmitted to the output unit 153.

reward出力部１５３は、同義表現生成部１５２から送信された同義表現、及びその元となる文を受け付けると、上述した翻訳のしやすさ（reward）の場合と同様に、同義表現を適応元翻訳モデル学習部１３によってターゲット言語に翻訳させ、その翻訳結果に基づくreward（対数尤度）を出力し、同義表現、その翻訳結果、対数尤度、及び元となる文から
なる組（以下、「同義表現関連情報」と称する。）を対訳対抽出部１５６に送信する（ｓ３１５）。 When the reward output unit 153 receives the synonymous expression transmitted from the synonymous expression generation unit 152 and the sentence that is the source thereof, the reward output unit 153 applies the synonymous expression to the adaptive source translation as in the case of the above-mentioned ease of translation (reward). The model learning unit 13 translates it into the target language, outputs a reward (log-likelihood) based on the translation result, and sets up a synonymous expression, the translation result, the log-likelihood, and the original sentence (hereinafter, "synonymous"). (Representation-related information) ”is transmitted to the bilingual logarithm extraction unit 156 (s315).

対訳対抽出部１５６は、同義表現生成部１５２により送信された同義表現関連情報に基づいて対訳対を抽出する処理を実行する（ｓ３１６）。当該処理では、まず対訳対抽出部１５６が、同義表現生成部１５２から受け付けた同義表現関連情報を、対数尤度が大きい方から順に並べ替え、学習の回数に応じ、対数尤度が上位r％に含まれる組を抽出する。
尚rは、学習の回数が増えるたびに小さくしていく。対訳対抽出部１５６は、このように
して対訳対を抽出することで、reward出力部１５３により生成されたノイズを含む翻訳結果が学習に悪影響を及ぼす可能性を低減させる。即ち、対訳対抽出部１５６は、同義表現生成部１５２がノイズの多い状態で学習データを事前学習した後、比較的質がよい学習データを用いて同義表現生成モデルをファインチューニングする。そのため、同義表現生成部１５２に対してランダムに同義表現生成モデルを与える場合より、高い翻訳精度が得られる同義表現が生成されるようになる。 The bilingual pair extraction unit 156 executes a process of extracting the bilingual pair based on the synonymous expression-related information transmitted by the synonymous expression generation unit 152 (s316). In this process, the bilingual pair extraction unit 156 first sorts the synonymous expression-related information received from the synonymous expression generation unit 152 in order from the one with the largest log-likelihood, and the log-likelihood is higher r% according to the number of learnings. Extract the pairs contained in.
Note that r is reduced as the number of learnings increases. By extracting the translation pair in this way, the translation pair extraction unit 156 reduces the possibility that the translation result including the noise generated by the reward output unit 153 adversely affects the learning. That is, the bilingual pair extraction unit 156 fine-tunes the synonymous expression generation model using the relatively high-quality learning data after the synonymous expression generation unit 152 pre-learns the learning data in a noisy state. Therefore, a synonymous expression with higher translation accuracy can be generated than when a synonymous expression generation model is randomly given to the synonymous expression generation unit 152.

対訳対抽出部１５６は、抽出した同義表現関連情報から、元となる文と翻訳結果とを抽出して対にしたものを学習データとして生成し、この学習データを適応元翻訳モデル学習部１３に送信する。適応元翻訳モデル学習部１３は、対訳対抽出部１５６から送信されてきた学習データにより適応元翻訳モデルを学習し（ｓ３１７）、学習済みの適応元翻訳モデルを適応元翻訳モデル記憶部１４に格納する（ｓ３１８）。 The parallel translation pair extraction unit 156 extracts the original sentence and the translation result from the extracted synonymous expression-related information, generates a pair as learning data, and uses this learning data in the adaptation source translation model learning unit 13. Send. The adaptation source translation model learning unit 13 learns the adaptation source translation model from the learning data transmitted from the parallel translation pair extraction unit 156 (s317), and stores the learned adaptation source translation model in the adaptation source translation model storage unit 14. (S318).

図５は、データ拡張部２０、適応先翻訳モデル学習部３０、及び翻訳処理部４０が行う処理（以下、「翻訳処理Ｓ５００」と称する。）を説明するフローチャートである。以下、適宜図１を参照しつつ、図５に沿って翻訳処理Ｓ５００を説明する。 FIG. 5 is a flowchart illustrating processing performed by the data expansion unit 20, the adaptation destination translation model learning unit 30, and the translation processing unit 40 (hereinafter, referred to as “translation processing S500”). Hereinafter, the translation process S500 will be described with reference to FIG. 1 as appropriate.

データ拡張部２０の適応先対訳コーパス生成部２１は、適応元翻訳モデル学習部１３が、適応元翻訳モデルによって適応先ドメインの文を翻訳できるように学習するための対訳コーパスを疑似的に生成する。適応先対訳コーパス生成部２１は、データ記憶部１２から適応先ドメインの単言語コーパスを読み込み、適応元翻訳モデル記憶部１４から学習済みの適応元翻訳モデルを読み込み、ドメイン適応モデル記憶部１６から学習済みの同義表現生成モデルを読み込む（ｓ５０１）。 The adaptation destination translation corpus generation unit 21 of the data expansion unit 20 pseudo-generates a translation translation corpus for learning so that the adaptation source translation model learning unit 13 can translate the sentence of the adaptation destination domain by the adaptation source translation model. .. The adaptation destination translation corpus generation unit 21 reads the monolingual corpus of the adaptation destination domain from the data storage unit 12, reads the learned adaptation source translation model from the adaptation source translation model storage unit 14, and learns from the domain adaptation model storage unit 16. Read the completed synonymous expression generation model (s501).

適応先対訳コーパス生成部２１は、擬似的な対訳コーパスを生成する際、まず同義表現生成モデルを用いて適応先ドメインの単言語コーパスの中に含まれている文に対する同義表現を生成する（ｓ５０２）。さらに、適応先対訳コーパス生成部２１は、生成した同義表現を、学習済みの適応元翻訳モデルによりターゲット言語に翻訳する。そして、適応先対訳コーパス生成部２１は、元となる適応先ドメインの文、同義表現を介して得られた翻訳結果、及び翻訳モデルが出力する尤度の三つのデータからなる組の集合を取得する（ｓ５０３）。 When generating a pseudo translation corpus, the adaptation destination translation corpus generation unit 21 first generates synonymous expressions for sentences contained in the monolingual corpus of the adaptation destination domain using a synonymous expression generation model (s502). ). Further, the adaptation destination translation corpus generation unit 21 translates the generated synonymous expression into the target language by the learned adaptation source translation model. Then, the adaptation destination translation corpus generation unit 21 acquires a set of three data consisting of the sentence of the original adaptation destination domain, the translation result obtained through the synonymous expression, and the likelihood output by the translation model. (S503).

データ拡張部２０の適応先対訳コーパス選択部２２は、適応先対訳コーパス生成部２１が取得した上記組の集合を翻訳モデルに基づいて出力される尤度の順で並び替える（ｓ５０４）。そして適応先対訳コーパス選択部２２は、並び変えた順番の上位r%のデータを、適応先ドメインの文を翻訳するための翻訳モデル（適応先翻訳モデル）を学習するための擬似的な対訳コーパス（適応先対訳コーパス）として抽出し、抽出した適応先対訳コーパスを適応先翻訳モデル学習部３０の適応先対訳コーパス記憶部３１に格納する（ｓ５０５）。 The adaptation destination translation corpus selection unit 22 of the data expansion unit 20 sorts the set of the above sets acquired by the adaptation destination translation corpus generation unit 21 in the order of the likelihood output based on the translation model (s504). Then, the adaptation destination translation corpus selection unit 22 uses the data of the upper r% in the rearranged order to learn a translation model (adaptation destination translation model) for translating the sentence of the adaptation destination domain. (Adaptation destination translation corpus) is extracted, and the extracted adaptation destination translation corpus is stored in the adaptation destination translation corpus storage unit 31 of the adaptation destination translation model learning unit 30 (s505).

適応先翻訳モデル学習部３０の適応先翻訳モデル学習部３２は、学習済みの適応先翻訳モデルを生成する。適応先翻訳モデル学習部３２は、エンコーダ／デコーダ方式のニュー
ラル機械翻訳を実行する機能である。適応先翻訳モデル学習部３２は、適応先対訳コーパス記憶部３１から適応先対訳コーパスを読み出し、この適応先対訳コーパスを学習データとして学習済みの適応先翻訳モデルを生成し（ｓ５０６）、その適応先翻訳モデルを適応先翻訳モデル記憶部３３に格納する（ｓ５０７）。尚、適応先翻訳モデルとして、ルールベース形式や統計的機械翻訳形式等、他の翻訳モデルを利用してもよい。 The adaptation destination translation model learning unit 32 of the adaptation destination translation model learning unit 30 generates a trained adaptation destination translation model. The adaptive destination translation model learning unit 32 is a function of executing an encoder / decoder type neural machine translation. The adaptation destination translation model learning unit 32 reads the adaptation destination translation corpus from the adaptation destination translation corpus storage unit 31, generates a trained adaptation destination translation model using this adaptation destination translation corpus as training data (s506), and generates an adaptation destination translation model (s506). The translation model is stored in the adaptation destination translation model storage unit 33 (s507). As the adaptation destination translation model, other translation models such as a rule-based format and a statistical machine translation format may be used.

続いて、翻訳処理部４０が、適応先翻訳モデル学習部３０にて生成された適応先翻訳モデルを用いてユーザから受け付けた適応先ドメインの文を翻訳し、その翻訳結果を出力する。具体的には、まず入力インタフェース４１が、機械翻訳システム１のユーザから文の入力を受け付ける（ｓ５０８）。続いて、翻訳部４２が適応先翻訳モデル記憶部３３から翻訳モデルを読み込み、受け付けた文をターゲット言語に翻訳する（ｓ５０９）。続いて、出力インタフェース４３が、翻訳文を、ディスプレイ等の出力装置１０５に出力する（ｓ５１０）。 Subsequently, the translation processing unit 40 translates the sentence of the adaptation destination domain received from the user using the adaptation destination translation model generated by the adaptation destination translation model learning unit 30, and outputs the translation result. Specifically, first, the input interface 41 accepts a sentence input from the user of the machine translation system 1 (s508). Subsequently, the translation unit 42 reads the translation model from the adaptation destination translation model storage unit 33, and translates the received sentence into the target language (s509). Subsequently, the output interface 43 outputs the translated text to an output device 105 such as a display (s510).

以上に説明したように、本実施形態の機械翻訳システム１によれば、適応先ドメインに特有な表現には対応する同義表現が適応元ドメインに存在することを利用して、適応先ドメインにおけるソース言語の単言語コーパスのみが入手可能であるという制約の下で適応先ドメインにおける文を精度よく翻訳することができる。このように本実施形態の機械翻訳システム１によれば、自然言語処理の精度を効率よく向上することができる。 As described above, according to the machine translation system 1 of the present embodiment, the source in the adaptation destination domain utilizes the fact that the synonymous expression corresponding to the expression peculiar to the adaptation destination domain exists in the adaptation destination domain. Sentences in the target domain can be translated accurately under the constraint that only a monolingual corpus of languages is available. As described above, according to the machine translation system 1 of the present embodiment, the accuracy of natural language processing can be efficiently improved.

機械翻訳システム１による翻訳処理の具対例として、日本語をソース言語とし、英語をターゲット言語とするとともに、適応元ドメインの文書を特許公報とし、適応先ドメインの文書を拒絶理由通知書として、日本語の拒絶理由通知を英語に翻訳する例を示す。尚、特許公報の対訳コーパスと、日本語の拒絶理由通知書の単言語コーパスは、データ入力処理部１１を介してデータ記憶部１２に格納されているものとする。 As an example of translation processing by the machine translation system 1, Japanese is used as the source language, English is used as the target language, the document of the source domain is used as the patent gazette, and the document of the destination domain is used as the notice of reasons for refusal. An example of translating a Japanese notice of reasons for refusal into English is shown. It is assumed that the bilingual corpus of the patent gazette and the single language corpus of the notice of reasons for refusal in Japanese are stored in the data storage unit 12 via the data input processing unit 11.

図６Ａ、図６Ｂは、夫々、従来の機械翻訳技術による翻訳例（比較例）であり、日本語の拒絶理由通知書の文と特許公報の文とを入力文とし、その入力文を特許公報のドメインで学習された翻訳モデルを用いて英語に翻訳した例を示している。図６Ａに示す拒絶理由通知書では、図中下線で示す審査官からの要望や疑問点等に関する記載部分（図中、下線で示す部分）が、図６Ｂに示す特許公報における表現（図中、下線で示す部分）とは異なっている。このように、特許公報のドメインで学習された翻訳モデルをそのまま拒絶理由通知に適用しただけでは、拒絶理由通知書に特有な表現を適切に翻訳することができない。 6A and 6B are translation examples (comparative examples) by conventional machine translation technology, respectively, in which the text of the notice of reasons for refusal in Japanese and the text of the patent gazette are input texts, and the input texts are the patent gazettes. An example of translation into English using a translation model learned in the domain of is shown. In the notice of reasons for refusal shown in FIG. 6A, the description part (underlined part in the figure) regarding the request or question from the examiner shown by the underline in the figure is the expression in the patent gazette shown in FIG. 6B (in the figure, It is different from the underlined part). As described above, simply applying the translation model learned in the domain of the patent gazette to the notice of reasons for refusal cannot appropriately translate the expression peculiar to the notice of reasons for refusal.

図７Ａ～図７Ｄに、本実施形態の機械翻訳システム１の適応先翻訳モデルにより拒絶理由通知書の日本語（適用元ドメイン）の文を英語（適用先ドメイン）の文に翻訳した例（同図に示す入力文と翻訳文）を示す。尚、各図に示す同義表現は、適応先翻訳モデルの学習に用いる学習データ（適応先対訳コーパス）の生成に用いた同義表現の例である。このように本実施形態の機械翻訳システム１によれば、拒絶理由通知書の文を精度よく英語の文に翻訳することができる。 7A-7D show an example in which the Japanese (source domain) sentence of the notice of reasons for refusal is translated into the English (application domain) sentence by the application destination translation model of the machine translation system 1 of the present embodiment (same as above). The input sentence and the translated sentence shown in the figure) are shown. The synonymous expressions shown in each figure are examples of synonymous expressions used to generate learning data (adapted target translation corpus) used for learning the adaptive destination translation model. As described above, according to the machine translation system 1 of the present embodiment, the sentence of the notice of reasons for refusal can be accurately translated into an English sentence.

［第２実施形態］
図８は第２実施形態として示す情報処理システム（以下、「検索システム５０」と称する。）の機能を説明するシステムフロー図である。検索システム５０のハードウェア構成は、図２に示した情報処理装置１００と同様である。 [Second Embodiment]
FIG. 8 is a system flow diagram illustrating the functions of the information processing system (hereinafter, referred to as “search system 50”) shown as the second embodiment. The hardware configuration of the search system 50 is the same as that of the information processing apparatus 100 shown in FIG.

同図に示すように、検索システム５０は、ドメイン適応学習部６０と検索処理部７０を備える。また、ドメイン適応学習部６０は、データ入力処理部６１、検索エンジン記憶部６２、記憶部６３、ドメイン適応モデル学習部６４、及びドメイン適応モデル記憶部６５
を含む。また、ドメイン適応モデル学習部６４は、クエリ入力部６４１、同義表現生成部６４２、reward出力部６４３、同義表現情報記憶部６４４、及び同義表現生成モデル学習部６４５を含む。また、検索処理部７０は、入力インタフェース７１、ドメイン適応判別部７２、クエリ同義表現生成部７３、検索エンジン部７４、及び出力インタフェース７５を含む。また、検索エンジン部７４は、クエリ投入部７４１、検索コンテンツ記憶部７４２、類似度計算部７４３、及び検索結果出力部を含む。尚、類似度計算部７４３は、例えば、インターネット上の情報検索サイト等で一般的に用いられている検索エンジンである。 As shown in the figure, the search system 50 includes a domain adaptive learning unit 60 and a search processing unit 70. Further, the domain adaptive learning unit 60 includes a data input processing unit 61, a search engine storage unit 62, a storage unit 63, a domain adaptive model learning unit 64, and a domain adaptive model storage unit 65.
including. Further, the domain adaptation model learning unit 64 includes a query input unit 641, a synonymous expression generation unit 642, a reward output unit 643, a synonymous expression information storage unit 644, and a synonymous expression generation model learning unit 645. Further, the search processing unit 70 includes an input interface 71, a domain adaptation determination unit 72, a query synonymous expression generation unit 73, a search engine unit 74, and an output interface 75. Further, the search engine unit 74 includes a query input unit 741, a search content storage unit 742, a similarity calculation unit 743, and a search result output unit. The similarity calculation unit 743 is, for example, a search engine generally used in an information retrieval site on the Internet.

検索処理部７０は、インターネットに公開されている情報検索サイトを構成する情報処理システムや、スタンドアロン型のデータベース検索装置等と同様に、ユーザに対して情報検索サービスを提供する。具体的には、検索処理部７０は、インターネット等の通信ネットワークを介してユーザからクエリの入力を受け付けると、受け付けたクエリに対する検索結果として、検索コンテンツ記憶部７４２に格納されているコンテンツをユーザが操作するユーザ端末に送信する。また検索処理部７０は、単一の文や複数の文の組み合わせにより構成されるクエリをユーザから受け付けると、より関連度の高い適切なコンテンツを提供する。そしてドメイン適応学習部６０は、ユーザが検索処理部７０に入力したクエリを集約したもの（以下、「クエリコーパス」と称する。）を単言語コーパスとし、当該クエリコーパスに基づき、ユーザから受け付けたクエリを正解のコンテンツが抽出されやすい同義表現に変換する同義表現生成モデルを生成する。 The search processing unit 70 provides an information search service to a user, similar to an information processing system constituting an information search site open to the Internet, a stand-alone database search device, and the like. Specifically, when the search processing unit 70 receives an input of a query from a user via a communication network such as the Internet, the user receives the content stored in the search content storage unit 742 as a search result for the received query. Send to the user terminal to operate. Further, when the search processing unit 70 receives a query composed of a single sentence or a combination of a plurality of sentences from the user, the search processing unit 70 provides more relevant and appropriate content. Then, the domain adaptive learning unit 60 uses a single language corpus as a collection of queries input by the user into the search processing unit 70 (hereinafter referred to as "query corpus"), and a query received from the user based on the query corpus. Generates a synonymous expression generation model that transforms the correct content into a synonymous expression that is easy to extract.

図９は、ドメイン適応学習部６０が行う処理（以下、「同義表現生成モデル生成処理Ｓ９００」と称する。）を説明するフローチャートである。以下、適宜図８を参照しつつ、図９に沿って同義表現生成モデル生成処理Ｓ９００を説明する。 FIG. 9 is a flowchart illustrating a process performed by the domain adaptive learning unit 60 (hereinafter, referred to as “synonymous expression generation model generation process S900”). Hereinafter, the synonymous expression generation model generation process S900 will be described with reference to FIG. 8 as appropriate.

まずデータ入力処理部６１のユーザインタフェース６１１が、ユーザの操作入力等により適応先の単言語コーパスの所在に関する情報が入力されると、入力された所在情報から読み込んだクエリコーパスをデータ処理部６１２に出力する。データ処理部６１２は、入力されたクエリコーパスにおける文をドメイン適応モデル学習部６４が処理できる形式のデータに変換し、変換後のデータを記憶部６３に格納する。続いて、クエリ入力部６４１が、記憶部６３からクエリコーパスを読み込み、読み込んだクエリコーパスを同義表現生成部６４２に出力する（ｓ９０１）。尚、データ処理部６１２により変換されたデータは、第１実施形態の機械翻訳システム１と同様に、クエリを記述する一つの文が、単語列のようなシンボル系列x₁,x₂,…,x_Sに変換されたものである。 First, when the user interface 611 of the data input processing unit 61 inputs information regarding the location of the single language corpus to which the data is applied by inputting a user's operation or the like, the query corpus read from the input location information is transmitted to the data processing unit 612. Output. The data processing unit 612 converts the input sentence in the query corpus into data in a format that can be processed by the domain adaptation model learning unit 64, and stores the converted data in the storage unit 63. Subsequently, the query input unit 641 reads the query corpus from the storage unit 63, and outputs the read query corpus to the synonymous expression generation unit 642 (s901). As for the data converted by the data processing unit 612, as in the machine translation system 1 of the first embodiment, one sentence describing the query is a symbol sequence such as a word string x ₁ , x ₂ , ..., It has been converted to x _S.

続いて、ドメイン適応学習部６０は、ループ処理ｓ９０１～ｓ９１０を開始し、まず同義表現生成部６４２が、エンコーダ／デコーダ方式のニューラル文生成処理を実行する（ｓ９０２）。同義表現生成部６４２は、第１実施形態の機械翻訳システム１の同義表現生成部１５２と同様に動作し、文生成モードにおいて、記憶部６３から読み込んだクエリコーパスにおける一つの文x₁,x₂,…,x_sをエンコーディングした後、beam searchデコーディングを実行し、同義表現であるｂ個の出力データy¹,y²,…y^bを生成する。尚、ｂ個の出力データy¹,y²,…y^bの夫々は、ＥＯＳを最終シンボルとしたＴ個の単語からなるシンボル系列である。 Subsequently, the domain adaptive learning unit 60 starts the loop processes s901 to s910, and first, the synonymous expression generation unit 642 executes the encoder / decoder type neural statement generation process (s902). The synonymous expression generation unit 642 operates in the same manner as the synonymous expression generation unit 152 of the machine translation system 1 of the first embodiment, and in the sentence generation mode, one sentence x ₁ , x ₂ in the query corpus read from the storage unit 63. After encoding ,…, x _s , beam search decoding is executed to generate b output data y ¹ , y ² ,… y ^b , which are synonymous expressions. It should be noted that each of the b output data y ¹ , y ² , ... y ^b is a symbol series consisting of T words with EOS as the final symbol.

続いて、同義表現生成部６４２は、生成した同義表現y¹,y²,…,y^bとクエリx₁,x₂,…,x_sとの類似度に基づき、生成したｂ個の同義表現y¹,y²,…,y^bから不適切な表現を除去する
ノイズ除去処理を行う（ｓ９０３）。尚、その際、同義表現生成部６４２は、第１実施形態の機械翻訳システム１と同様に、生成した同義表現に対して前処理を行い、夫々の同義表現と元となる文との間で類似度を求め、求めた類似度が最終的に所定の指標（例えば、BLUE値）において所定の値以下(例えば、BLUE値≦０．５)となった同義表現をノイズとみ
なして除去する。同義表現生成部６４２は、ノイズ除去後の同義表現、及びその元となるクエリをreward出力部６４３に送信する（ｓ９０４）。 Subsequently, the synonymous expression generation unit 642 generated b synonymous expressions based on the degree of similarity between the generated synonymous expressions y ¹ , y ² , ..., y ^b and the query x ₁ , x ₂ , ..., x _s . Noise reduction processing is performed to remove inappropriate expressions from y ¹ , y ² , ..., y ^b (s903). At that time, the synonymous expression generation unit 642 performs preprocessing on the generated synonymous expression as in the machine translation system 1 of the first embodiment, and between each synonymous expression and the original sentence. The similarity is obtained, and the synonymous expression in which the obtained similarity is finally equal to or less than a predetermined value (for example, BLUE value ≤ 0.5) in a predetermined index (for example, BLUE value) is regarded as noise and removed. The synonymous expression generation unit 642 transmits the synonymous expression after noise reduction and the query that is the source thereof to the reward output unit 643 (s904).

続いて、reward出力部６４３が、検索エンジン記憶部６２から既存の検索エンジンを読み込み（ｓ９０５）、クエリx_n（ｎ＝１，２，…，ｓ）をその同義表現y_n ^m（ｍ＝１，２
，…，ｂ、ｎ＝１，２，…，Ｔ）に変換し、同義表現y_n ^mを検索エンジンに問い合わせる
場合の処理しやすさを算出する（ｓ９０６）。またreward出力部６４３は、クエリx_nに対しても処理のしやすさを求める。尚、第２実施形態の検索システム５０では、処理のしやすさ（reward）の指標として、例えば、TF/IDF等、クエリx_nと、検索対象として検索コンテンツ記憶部７４２に格納されている全てのコンテンツとの関連度から算出されたエントロピーH(y_n ^m)を用いている。検索エンジンは、エントロピーの値が高いほど、クエリx_nに対してより適切なコンテンツを抽出する。 Subsequently, the reward output unit 643 reads the existing search engine from the search engine storage unit 62 (s905), and the query x _n (n = 1, 2, ..., S) is expressed as a synonym for it y _n ^m (m = 1). , 2
, ..., b, n = 1, 2, ..., T), and the ease of processing when inquiring the synonymous expression y _n ^m to the search engine is calculated (s906). The reward output unit 643 also requests the ease of processing for the query x _n . In the search system 50 of the second embodiment, as an index of ease of processing (reward), for example, a query x _n such as TF / IDF and all stored in the search content storage unit 742 as a search target. The entropy H (y _n ^m ) calculated from the degree of relevance to the content of is used. The higher the entropy value, the better the search engine will extract the content for the query x _n .

続いて、reward出力部６４３は、一つのクエリx_nについて、当該クエリx_n、同義表現表現y_n ^m、及び処理のしやすさの評価値H(y_n ^m)からなる三つ一組からなる同義表現情報を同
義表現情報記憶部６４４に格納する（ｓ９０７）。尚、本例では同義表現情報は、Ｎ組あるものとする。同義表現情報は同義表現生成モデル学習部６４５が同義表現生成モデルを生成するための学習データとして用いられる。 Subsequently, the reward output unit 643 is composed of a set of a query x _n , a synonymous expression expression y _n ^m , and an evaluation value H (y _n ^m ) for ease of processing for one query x _n . The synonymous expression information is stored in the synonymous expression information storage unit 644 (s907). In this example, it is assumed that there are N sets of synonymous expression information. The synonymous expression information is used as learning data for the synonymous expression generation model learning unit 645 to generate a synonymous expression generation model.

続いて、同義表現生成モデル学習部６４５が、同義表現生成モデルを読み込む（ｓ９０８）。同義表現生成モデル学習部６４５は、エンコーダ／デコーダ方式のニューラル文生成処理を実行する機能であり、学習モードにおいて、ドメイン適応モデル記憶部１６から、同義表現生成モデルを読み込む。 Subsequently, the synonymous expression generation model learning unit 645 reads the synonymous expression generation model (s908). The synonymous expression generation model learning unit 645 is a function of executing the neural sentence generation process of the encoder / decoder method, and reads the synonymous expression generation model from the domain adaptive model storage unit 16 in the learning mode.

続いて、同義表現生成モデル学習部６４５は、同義表現情報を学習データとして、クエリx_nから、その同義表現y_n ^mを生成するように学習し、学習結果である同義表現生成モデ
ルを生成する（ｓ９０９）。尚、同義表現y_n ^mが生成される確率は、その同義表現に対す
る処理のしやすさ（reward）の評価値H(y_n ^m)に応じて学習される。 Subsequently, the synonymous expression generation model learning unit 645 learns to generate the synonymous expression y _n ^m from the query x _n using the synonymous expression information as learning data, and generates a synonymous expression generation model which is a learning result. (S909). The probability that the synonymous expression y _n ^m is generated is learned according to the evaluation value H (y _n ^m ) of the ease of processing (reward) for the synonymous expression.

具体的には、同義表現生成モデル学習部６４５は、まず以下に示した数式４により、クエリx_nから算出された処理のしやすさの評価値H(xⁿ)を基準とし、当該基準と同義表現y_n ^mの評価値H(y_n ^m)との差分を求める。
［数式４］

Specifically, the synonymous expression generation model learning unit 645 first uses the evaluation value H (x ⁿ ) of the ease of processing calculated from the query x _n by the formula 4 shown below as a reference, and uses the standard. Find the difference between the synonymous expression y _n ^m and the evaluation value H (y _n ^m ).
[Formula 4]

そして同義表現生成モデル学習部６４５は、以下の数式５に示した損失関数が最小となるように学習する。
［数式５］

Then, the synonymous expression generation model learning unit 645 learns so that the loss function shown in the following equation 5 is minimized.
[Formula 5]

同義表現生成モデル学習部６４５は、学習処理が完了する度にドメイン適応モデル記憶部６５に最新の同義表現生成モデルを格納する（ｓ９１０）。 The synonymous expression generation model learning unit 645 stores the latest synonymous expression generation model in the domain adaptive model storage unit 65 each time the learning process is completed (s910).

同義表現生成モデルは、例えば、同義表現生成モデル学習部６４５による事前学習によって生成してもよい。その場合、同義表現生成部６４２は、文生成モードにおいて、最初に同義表現生成モデルを読み込む際にドメイン適応モデル記憶部６５に格納されている事前学習済みの同義表現生成モデルを利用することができる。上記の事前学習は、例えば、同義表現生成モデル学習部６４５が、記憶部６３からクエリコーパスを読み込み、クエリ
x_nを復元するようにエンコーダ／デコーダ方式のニューラルモデルを学習することにより行う。同義表現生成モデル学習部１５５が、事前学習された同義表現生成モデルをドメイン適応モデル記憶部１６に格納する。このような手順で事前学習されたモデルは、クエリコーパスと同じ言語の文を生成できるようになるため、beam search デコーディングを行うことで、複数の同義表現を生成することが可能になる。 The synonymous expression generation model may be generated by, for example, pre-learning by the synonymous expression generation model learning unit 645. In that case, the synonymous expression generation unit 642 can use the pre-learned synonymous expression generation model stored in the domain adaptive model storage unit 65 when the synonymous expression generation model is first read in the sentence generation mode. .. In the above pre-learning, for example, the synonymous expression generation model learning unit 645 reads the query corpus from the storage unit 63 and queries.
This is done by learning an encoder / decoder neural model to restore x _n . The synonymous expression generation model learning unit 155 stores the pre-learned synonymous expression generation model in the domain adaptive model storage unit 16. Since the model pre-trained by such a procedure can generate sentences in the same language as the query corpus, it is possible to generate multiple synonymous expressions by performing beam search decoding.

続いて、検索処理部７０が行う処理（以下、「検索処理Ｓ１０００」と称する。）について説明する。 Subsequently, the processing performed by the search processing unit 70 (hereinafter, referred to as “search processing S1000”) will be described.

図１０は、検索処理Ｓ１０００を説明するフローチャートである。以下、同図とともに検索処理Ｓ１０００について説明する。 FIG. 10 is a flowchart illustrating the search process S1000. Hereinafter, the search process S1000 will be described with reference to the figure.

まず入力インタフェース５１が、ユーザからクエリの入力を受け付け（ｓ１００１）、受け付けたクエリをドメイン適応判別部７２に送信する（ｓ１００２）。 First, the input interface 51 accepts a query input from the user (s1001), and transmits the accepted query to the domain adaptation determination unit 72 (s1002).

ドメイン適応判別部７２は、送信されてきたクエリを同義表現に変換する必要があるか否かを判別する（ｓ１００３）。具体的には、ドメイン適応判別部７２は、検索エンジン記憶部６２から、検索エンジンを読み込み、クエリを検索エンジン部７４に問い合わせる。クエリを検索エンジン部７４の類似度計算部７４３は、クエリと、検索コンテンツ記憶部１２２に含まれている全てのコンテンツとの関連度を用いて、各コンテンツにランキングを付与して検索結果を生成し、ランキングが付与された夫々の検索結果をドメイン適応判別部７２に送信する。 The domain adaptation determination unit 72 determines whether or not it is necessary to convert the transmitted query into synonymous expressions (s1003). Specifically, the domain adaptation determination unit 72 reads the search engine from the search engine storage unit 62 and inquires the query to the search engine unit 74. The similarity calculation unit 743 of the search engine unit 74 assigns a ranking to each content and generates a search result by using the degree of association between the query and all the contents included in the search content storage unit 122. Then, each search result to which the ranking is given is transmitted to the domain adaptation determination unit 72.

ドメイン適応判別部７２は、送信されてきたコンテンツのうち、ランキングが一位のコンテンツとクエリとの間の関連度（例えば、TF/IDF等の評価指標）を判定する（Ｓ１００３）。関連度が所定の閾値ｓ未満である場合（ｓ１００３：ＹＥＳ）、処理はｓ１００４に進む。関連度が閾値ｓ以上である場合（ｓ１００３：ＮＯ）、処理はｓ１００６に進む。 The domain adaptation determination unit 72 determines the degree of relevance (for example, an evaluation index such as TF / IDF) between the content having the highest ranking among the transmitted contents and the query (S1003). If the degree of relevance is less than the predetermined threshold s (s1003: YES), the process proceeds to s1004. If the degree of relevance is equal to or greater than the threshold s (s1003: NO), the process proceeds to s1006.

Ｓ１００４では、クエリ同義表現生成部７３は、ドメイン適応判別部７２から送信されてきたクエリを受け取る。そしてクエリ同義表現生成部７３は、ドメイン適応モデル記憶部６５から同義表現生成モデルを読み込み、クエリの同義表現を生成し、生成した同義表現をクエリとして検索エンジン部７４のクエリ投入部７４１に送信する（ｓ１００５）。 In S1004, the query synonym expression generation unit 73 receives the query transmitted from the domain adaptation determination unit 72. Then, the query synonymous expression generation unit 73 reads the synonymous expression generation model from the domain adaptation model storage unit 65, generates the synonymous expression of the query, and transmits the generated synonymous expression as a query to the query input unit 741 of the search engine unit 74. (S1005).

ｓ１００６では、クエリ投入部７４１が、クエリ同義表現生成部７３、あるいはドメイン適応判別部７２から受け取ったクエリの文字列を類似度計算部７４３に送信する。類似度計算部７４３は、クエリと検索コンテンツ記憶部７４２に含まれている全てのコンテンツとの関連度を算出し、各コンテンツにランキングを付与して検索結果を生成する（ｓ１００７）。 In s1006, the query input unit 741 transmits the character string of the query received from the query synonym expression generation unit 73 or the domain adaptation determination unit 72 to the similarity calculation unit 743. The similarity calculation unit 743 calculates the degree of relevance between the query and all the contents included in the search content storage unit 742, assigns a ranking to each content, and generates a search result (s1007).

続いて、出力インタフェース７５が、上記検索結果を出力する（ｓ１００８）。尚、出力インタフェース７５は、例えば、検索結果を、ランキングが一位のコンテンツやそのコンテンツへのリンク、あるいはコンテンツや各コンテンツへのリンクをランキング順に並べたリスト等、適宜な形式で出力する。 Subsequently, the output interface 75 outputs the above search result (s1008). The output interface 75 outputs the search results in an appropriate format, for example, a content having the highest ranking and a link to the content, or a list in which the content and links to each content are arranged in the order of ranking.

以上に説明したように、本実施形態の検索システム５０は、クエリコーパス中の各クエリと、その同義表現と、クエリ及び同義表現の夫々と検索システム５０が保有する全コンテンツとの関連度とを含む同義表現情報を学習データとして、入力されたクエリを正解のコンテンツが抽出され易い同義表現のクエリに変換するための同義表現生成モデルを生成し、入力されたクエリを同義表現生成モデルで同義表現に変換して新たなクエリとして入
力する。そのため、曖昧な表現を含んだ自然言語で記述されたクエリが入力された場合でも正解のコンテンツを出力することができる。 As described above, the search system 50 of the present embodiment determines the relevance of each query in the query corpus, its synonymous expression, each of the query and the synonymous expression, and all the contents possessed by the search system 50. Using the included synonymous expression information as training data, generate a synonymous expression generation model for converting the input query into a synonymous expression query for which the correct content can be easily extracted, and synonymously express the input query with the synonymous expression generation model. Convert to and enter as a new query. Therefore, even if a query written in natural language including ambiguous expressions is input, the correct content can be output.

以上、本発明の実施形態につき説明したが、本発明は上記した実施形態に限定されるものではなく、様々な変形例が含まれる。また例えば、上記した実施形態は本発明を分かりやすく説明するために構成を詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また各実施形態の構成の一部について、他の構成に追加、削除、置換することが可能である。 Although the embodiments of the present invention have been described above, the present invention is not limited to the above-described embodiments, and includes various modifications. Further, for example, the above-described embodiment is described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to the one including all the described configurations. Further, it is possible to add, delete, or replace a part of the configuration of each embodiment with other configurations.

また、上記の各構成、機能部、処理部、処理手段等は、それらの一部または全部を、例えば、集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサが夫々の機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリやハードディスク、ＳＳＤ（Solid State Drive）等の記録装置、ＩＣ
カード、ＳＤカード、ＤＶＤ等の記録媒体に置くことができる。 Further, each of the above configurations, functional units, processing units, processing means and the like may be realized by hardware by designing a part or all of them by, for example, an integrated circuit. Further, each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files that realize each function can be stored in memory, hard disks, recording devices such as SSDs (Solid State Drives), and ICs.
It can be placed on a recording medium such as a card, SD card, or DVD.

また、以上に説明した各情報処理装置の各種機能部、各種処理部、各種データベースの配置形態は一例に過ぎない。各種機能部、各種処理部、各種データベースの配置形態は、これらの装置が備えるハードウェアやソフトウェアの性能、処理効率、通信効率等の観点から最適な配置形態に変更し得る。 Further, the arrangement form of various functional units, various processing units, and various databases of each information processing apparatus described above is only an example. The arrangement form of various function units, various processing units, and various databases can be changed to the optimum arrangement form from the viewpoint of the performance, processing efficiency, communication efficiency, and the like of the hardware and software included in these devices.

また、前述した各種のデータを格納するデータベースの構成（スキーマ（Schema）等）は、リソースの効率的な利用、処理効率向上、アクセス効率向上、検索効率向上等の観点から柔軟に変更し得る。 Further, the configuration of the database (schema, etc.) for storing various data described above can be flexibly changed from the viewpoints of efficient use of resources, improvement of processing efficiency, improvement of access efficiency, improvement of search efficiency, and the like.

１機械翻訳システム、１０，６０ドメイン適応学習部、１１，６１データ入力処理部、１２，６３記憶部、１３適応元翻訳モデル学習部、１４適応元翻訳モデル記憶部、１５，６４ドメイン適応モデル学習部、１６，６５ドメイン適応モデル記憶部、２０データ拡張部、２１適応先対訳コーパス生成部、２２適応先対訳コーパス選択部、３０適応先翻訳モデル学習部、３１適応先対訳コーパス記憶部、３２適応先翻訳モデル学習部、３３適応先翻訳モデル記憶部、４０翻訳処理部、４１，７１入力インタフェース、４２翻訳部、４３，７５出力インタフェース、５０検索システム、６２検索エンジン記憶部、７０検索処理部、７２ドメイン適応判別部、７３クエリ同義表現生成部、１１１，６１１ユーザインタフェース、１１２，６１２データ処理部、１５１単言語コーパス入力部、１５２，６４２同義表現生成部、１５３，６４３ reward出力部、１５４，６４４同義表現情報記憶部、１５６対訳対抽出部、６４１クエリ入力部、７４１クエリ投入部、７４２検索コンテンツ記憶部、７４３類似度計算部、７４４検索結果出力部 1 Machine translation system, 10,60 domain adaptive learning unit, 11,61 data input processing unit, 12,63 storage unit, 13 adaptation source translation model learning unit, 14 adaptation source translation model storage unit, 15,64 domain adaptation model learning unit Unit, 16,65 Domain adaptation model storage unit, 20 data expansion unit, 21 adaptation destination translation corpus generation unit, 22 adaptation destination translation corpus selection unit, 30 adaptation destination translation model learning unit, 31 adaptation destination translation corpus storage unit, 32 adaptation Destination translation model learning unit, 33 adaptation destination translation model storage unit, 40 translation processing unit, 41,71 input interface, 42 translation unit, 43,75 output interface, 50 search system, 62 search engine storage unit, 70 search processing unit, 72 Domain adaptation discrimination unit, 73 Query synonymous expression generation unit, 111,611 user interface, 112,612 data processing unit, 151 single language corpus input unit, 152,642 synonymous expression generation unit, 153,643 reward output unit, 154 644 Synonymous expression information storage unit, 156 translation pair extraction unit, 641 query input unit, 741 query input unit, 742 search content storage unit, 743 similarity calculation unit, 744 search result output unit

Claims

It is configured using an information processing device and
A storage unit that stores a monolingual corpus and a synonymous expression generation model,
A synonymous expression generation unit that generates synonymous expressions of sentences in the monolingual corpus using the synonymous expression generation model, and
An evaluation unit that calculates the evaluation value of the ease of natural language processing for each of the sentence and the synonymous expression generated for the sentence, and an evaluation unit.
The synonymous expression that generates the synonymous expression that is easy to process in natural language from the input sentence by using the sentence, the synonymous expression generated from the sentence, and the synonymous expression information that is information including the evaluation value as learning data. The domain adaptive model learning unit that generates the generation model,
Natural language processing system with.

The natural language processing system according to claim 1.
The domain adaptive model learning unit executes a pre-learning process to generate an encoder / decoder-type neural model for restoring a sentence of the monolingual corpus as the synonymous expression generation model.
The synonymous expression generation unit uses the synonymous expression generation model generated by the pre-learning process when first generating a synonymous expression of a sentence.
Natural language processing system.

The natural language processing system according to claim 1.
The synonymous expression generation unit generates a plurality of the synonymous expressions by executing beam search decoding after encoding the synonymous expressions generated by using the synonymous expression generation model.
The domain adaptation model learning unit can easily perform the natural language processing on an input sentence using the sentence, the plurality of synonymous expressions generated from the sentence, and the synonymous expression information including the evaluation value as learning data. Generate the synonymous expression Generate the synonymous expression generation model,
Natural language processing system.

The natural language processing system according to claim 3.
The synonym expression generation unit performs preprocessing for recognizing synonyms included in a plurality of generated synonyms as the same character string, and
Following the preprocessing, a noise reduction process for removing noise from the plurality of synonymous expressions,
And run
In the preprocessing, the common character string is extracted based on the longest match, and the characters between the common character strings extracted by each synonym are used as synonyms.
In the noise reduction processing, the similarity between the synonymous expression and the sentence from which the synonymous expression is derived is obtained based on a predetermined index, and the synonymous expression whose similarity is equal to or less than a predetermined value is regarded as noise. To remove,
Natural language processing system.

The natural language processing system according to any one of claims 1 to 4.
The storage unit stores an adaptation source bilingual corpus, which is a bilingual corpus of the adaptation source domain, and an adaptation destination monolingual corpus, which is a monolingual corpus of the adaptation destination domain.
An adaptation source translation model learning unit that generates an adaptation source translation model, which is a translation model that machine-translates a sentence of the adaptation source domain using the adaptation source bilingual corpus as learning data.
The adaptation destination translation model learning unit that generates the adaptation destination translation model that translates the sentence of the adaptation destination domain,
An adaptation destination translation corpus generator that generates an adaptation destination translation corpus, which is a translation corpus of the adaptation destination domain,
A translation unit that translates the input sentence of the adaptation destination domain using the adaptation destination translation model, and
Equipped with
The synonymous expression generation unit generates a synonymous expression of the sentence of the adaptation destination domain of the adaptation destination monolingual corpus.
The evaluation unit calculates the ease of processing when the adaptation source translation model learning unit machine-translates the sentence of the adaptation destination domain and each of the synonymous expressions of the sentence as the evaluation value.
The domain adaptation model learning unit learns the synonymous expression generation model so that the evaluation value of the synonymous expression that is easier to translate becomes high and the evaluation value of the synonymous expression that is difficult to translate becomes low.
The adaptation destination translation corpus generation unit is a pair of a synonymous expression that is easy to machine translate in the adaptation source translation model and a translation of the synonymous expression, and a synonym expression that is difficult to machine translate in the adaptation source translation model. The target translation corpus is generated by collecting each pair of the reference translation for the sentence in the adaptation source translation corpus that is the origin of the synonymous expression.
The adaptation destination translation model learning unit generates the adaptation destination translation model using the adaptation destination translation corpus as learning data.
Natural language processing system.

The natural language processing system according to any one of claims 1 to 4, wherein the natural language processing system is used.
It has a search processing unit that outputs the content corresponding to the query entered by the user by the search engine.
The storage unit stores a query corpus that collects queries for search engines as the monolingual corpus, and stores the contents.
The synonymous expression generation unit generates a synonymous expression of the sentence of the query corpus.
The evaluation unit calculates the evaluation value based on the query included in the query corpus, each of its synonymous expressions, and the degree of relevance to all the contents stored in the storage unit.
The domain adaptation model learning unit minimizes the loss function of the difference between the evaluation value based on the degree of association between the query and the content and the evaluation value based on the synonymous expression of the query and the degree of association with the content. To generate the synonymous expression generation model as described above,
The search processing unit generates a synonymous expression from the user-input query by the synonymous expression generation model, and inputs the synonymous expression as a query to the search engine.
Natural language processing system.

Information processing equipment
A monolingual corpus, and a step to memorize synonymous expression generation models,
A step of generating a synonymous expression of a sentence of the monolingual corpus using the synonymous expression generation model,
A step of calculating the evaluation value of the ease of natural language processing for each of the sentence and the synonymous expression generated for the sentence, and
The synonymous expression that generates the synonymous expression that is easy to process in natural language from the input sentence by using the sentence, the synonymous expression generated from the sentence, and the synonymous expression information that is information including the evaluation value as learning data. Steps to generate a generation model,
Natural language processing method to execute.

The natural language processing method according to claim 7.
The information processing device
A step of executing a pre-learning process to generate an encoder / decoder neural model for restoring a sentence of the monolingual corpus as the synonymous expression generation model.
A step of using the synonymous expression generation model generated by the pre-learning process when first generating a synonymous expression of a sentence,
A natural language processing method that further executes.

The natural language processing method according to claim 7.
The information processing device
A step of generating a plurality of the synonyms by executing beam search decoding after encoding the synonyms generated using the synonym generation model, and
Using the sentence, the plurality of synonymous expressions generated from the sentence, and the synonymous expression information including the evaluation value as learning data, the synonymous expression that generates the synonymous expression that is easy to process in natural language for the input sentence. Steps to generate a generative model,
A natural language processing method that further executes.

The natural language processing method according to claim 9.
The information processing device
A step of executing preprocessing for recognizing synonyms contained in a plurality of generated synonyms as the same character string,
Following the preprocessing, a step of executing a noise reduction process for removing noise from the plurality of synonymous expressions.
Further run,
In the preprocessing, the common character string is extracted based on the longest match, and the characters between the common character strings extracted by each synonym are used as synonyms.
In the noise reduction processing, the similarity between the synonymous expression and the sentence from which the synonymous expression is derived is obtained based on a predetermined index, and the synonymous expression whose similarity is equal to or less than a predetermined value is regarded as noise. To remove,
Natural language processing method.

The natural language processing method according to any one of claims 7 to 10.
The information processing device
A step to memorize the adaptation source bilingual corpus, which is the translation corpus of the adaptation source domain, and the adaptation destination monolingual corpus, which is the monolingual corpus of the adaptation destination domain.
A step of generating an adaptation source translation model, which is a translation model for machine-translating a sentence of the adaptation source domain using the adaptation source bilingual corpus as learning data.
A step of generating an adaptive translation model that translates a sentence in the destination domain,
A step of generating an adaptation destination bilingual corpus, which is a bilingual corpus of the adaptation destination domain,
The step of translating the input sentence of the adaptation destination domain using the adaptation destination translation model,
A step of generating a synonym for a sentence in the destination domain of the destination monolingual corpus,
A step of calculating the ease of processing when machine-translating each of the sentence of the adaptation destination domain and the synonymous expression of the sentence as the evaluation value.
A step of learning the synonymous expression generation model so that the evaluation value of the synonymous expression that is easier to translate becomes higher and the evaluation value of the synonymous expression that is difficult to translate becomes lower.
The pair of synonymous expressions that are easy to machine translate in the adaptation source translation model and the translation of the synonymous expressions, and the synonymous expressions that are difficult to machine translate in the adaptation source translation model and the adaptation that became the origin of the synonymous expressions. A step of collecting each pair of a reference translation for a sentence in the original translation corpus to generate the adaptation destination translation corpus, and
A step of generating the adaptive translation model using the adaptive translation corpus as learning data.
A natural language processing method that further executes.

The natural language processing method according to any one of claims 7 to 10.
The information processing device
A step in which a search engine outputs content that corresponds to a user-entered query,
A step of storing the query corpus that collects queries to search engines as the monolingual corpus and storing the content.
Steps to generate synonyms for the query corpus statement,
A step of calculating the evaluation value based on each of the queries included in the query corpus and their synonymous expressions and the degree of relevance to all the contents to be stored.
The synonymous expression generation model is set so that the loss function of the difference between the evaluation value based on the degree of association between the query and the content and the evaluation value based on the synonymous expression of the query and the content is minimized. Steps to generate and
A step of generating a synonymous expression from a user-input query by the synonymous expression generation model and inputting the synonymous expression into a search engine as a query.
A natural language processing method that further executes.