JPWO2021009972A1

JPWO2021009972A1 - Natural language processing method, natural language processing system, and natural language processing program

Info

Publication number: JPWO2021009972A1
Application number: JP2020518837A
Authority: JP
Inventors: 泰弘梅本
Original assignee: MALLSERVICE INC.
Current assignee: MALLSERVICE INC.
Priority date: 2019-07-18
Filing date: 2020-03-24
Publication date: 2021-09-13
Also published as: WO2021009972A1

Abstract

新規な自然言語処理を実現することを解決すべき課題とする。原言語の第１のテキストを含むウェブページの指定を受け付ける受付ステップと、前記第１のテキストに基づく目的言語の第２のテキストを取得する翻訳ステップと、前記第２のテキストの少なくとも一部をクエリとしてスクレイピングを行い、前記目的言語の第３のテキストを取得する収集ステップと、前記第３のテキストを前記第１のテキストの訳文として決定する決定ステップと、をコンピュータのプロセッサに実行させる自然言語処理方法とそのシステム及びプログラムを実現する。The issue to be solved is to realize new natural language processing. A reception step that accepts the designation of a web page containing the first text of the original language, a translation step of acquiring the second text of the target language based on the first text, and at least a part of the second text. A natural language that causes a computer processor to perform a collection step of scraping as a query to obtain a third text of the target language and a determination step of determining the third text as a translation of the first text. Realize the processing method and its system and program.

Description

本発明は、自然言語処理方法、自然言語処理システム、及び、自然言語処理プログラムに関する。 The present invention relates to a natural language processing method, a natural language processing system, and a natural language processing program.

任意の自然言語（原言語）で記述されたテキストを任意の異なる自然言語（目的言語）で記述されたテキストに変換する機械翻訳は、その翻訳プロセスにおいて主体として人間を含まない「機械による翻訳」である。 Machine translation, which converts text written in any natural language (original language) into text written in any different natural language (target language), is a "machine translation" that does not include humans as the main body in the translation process. Is.

「機械による翻訳」は、専門技能に近い翻訳プロセスを一般化し翻訳コストを圧縮した一方、判読容易ではない翻訳文をインターネットウェブ上に氾濫させており、例えば検索エンジン最適化を困難とする要因の１つとなっている。 "Machine translation" generalizes translation processes that are close to specialized skills and reduces translation costs, but it also floods the Internet Web with unreadable translations, which is a factor that makes search engine optimization difficult, for example. It is one.

そのため、近年の機械翻訳は、翻訳精度の向上のみならず、翻訳プロセスに主体として人間を含み判読容易な意訳文を生成可能な「人間による翻訳」の実現が求められている。 Therefore, in recent years, machine translation is required not only to improve translation accuracy, but also to realize "translation by humans" that can generate easy-to-read translations that include humans as the main body in the translation process.

特許文献１によると、第１の言語の入力文に対しそれぞれ第２の言語の訳文を生成するための、複数個の機械翻訳装置を含む訳文候補生成部と、得られた複数の第２の言語の訳文をそれぞれ起点として、訳文を変形し改良する訳文改良部と、改良された訳文のうちで所定の条件を充足するものを入力文に対する出力文として選択する終了判定部と、を含む機械翻訳システムに関する発明が報告されている。 According to Patent Document 1, a translation candidate generation unit including a plurality of machine translation devices for generating a translation of the second language for each input sentence of the first language, and a plurality of obtained second translation units. A machine that includes a translation improvement unit that transforms and improves the translation from each language translation, and an end judgment unit that selects the improved translation that satisfies certain conditions as the output sentence for the input sentence. Inventions relating to translation systems have been reported.

特許第３９１９７７１号公報Japanese Patent No. 3919771

特許文献１に記載の発明における訳文改良部は、初期候補訳文及び訳文記憶部から読出された訳文との何れか一方を選択するための訳文選択部を含み、直訳度が低く意訳文に近い翻訳文が生成可能であると把握することができる。 The translation improvement unit in the invention described in Patent Document 1 includes a translation selection unit for selecting either an initial candidate translation or a translation read from the translation storage unit, and has a low degree of literal translation and is close to a free translation. It can be understood that the sentence can be generated.

しかしながら「人間による翻訳」を再現するためには、記録された又は改良が加えられた情報だけでなく、人間による創作物としての最新のテキスト／コンテキストを参考とする必要がある。この点において、特許文献１に記載の発明は改善の余地があるといえる。 However, in order to reproduce the "human translation", it is necessary to refer not only to the recorded or improved information, but also to the latest text / context as a human creation. In this respect, it can be said that the invention described in Patent Document 1 has room for improvement.

本発明は、新規な自然言語処理を実現することを解決すべき課題とする。 The present invention is a problem to be solved to realize a novel natural language processing.

上記課題を解決するために、本発明は、自然言語処理方法であって、原言語の第１のテキストを含むウェブページの指定を受け付ける受付ステップと、前記第１のテキストに基づく目的言語の第２のテキストを取得する翻訳ステップと、前記第２のテキストの少なくとも一部をクエリとしてスクレイピングを行い、前記目的言語の第３のテキストを取得する収集ステップと、前記第３のテキストを前記第１のテキストの訳文として決定する決定ステップと、をコンピュータのプロセッサに実行させることを特徴とする。 In order to solve the above problems, the present invention is a natural language processing method, which is a reception step for accepting a designation of a web page including the first text of the original language, and a first language of the target language based on the first text. A translation step of acquiring the second text, a collection step of scraping at least a part of the second text as a query to acquire the third text of the target language, and the first text of the third text. It is characterized by having a computer processor execute a decision step to determine as a translation of the text of.

本発明の好ましい形態では、テキストの形態素列を取得する解析ステップと、前記形態素列の分散ベクトルを取得する評価ステップと、前記分散ベクトル間の類似度を決定する判定ステップと、を前記プロセッサに実行させ、前記翻訳ステップは、前記第３のテキストに基づく前記原言語の第４のテキストを取得し、前記解析ステップは、前記第１のテキストの前記形態素列、及び、前記第４のテキストの前記形態素列を取得し、前記評価ステップは、前記第１のテキストの前記分散ベクトル、及び、前記第４のテキストの前記分散ベクトルを取得し、前記決定ステップは、前記判定ステップが決定した前記第１のテキスト及び第４のテキストの前記分散ベクトル間の前記類似度が閾値を超過する場合、前記第３のテキストを前記訳文として決定することを特徴とする。 In a preferred embodiment of the present invention, the processor executes an analysis step of acquiring a morpheme string of text, an evaluation step of acquiring a variance vector of the morpheme string, and a determination step of determining the similarity between the variance vectors. The translation step obtains the fourth text of the original language based on the third text, and the analysis step is the morpheme sequence of the first text and the fourth text of the fourth text. The morpheme sequence is acquired, the evaluation step acquires the dispersion vector of the first text, and the dispersion vector of the fourth text, and the determination step is the first determination determined by the determination step. When the similarity between the text and the variance vector of the fourth text exceeds the threshold value, the third text is determined as the translation.

本発明の好ましい形態では、前記収集ステップは、前記第１のテキストの前記分散ベクトル、及び、前記第４のテキストの前記分散ベクトルの前記類似度が前記閾値を超過しない場合、前記ウェブページに基づき前記クエリを拡張し、前記スクレイピングを行い、前記第３のテキストを取得することを特徴とする。 In a preferred embodiment of the invention, the collection step is based on the web page if the similarity of the variance vector of the first text and the variance vector of the fourth text does not exceed the threshold. It is characterized in that the query is extended, the scraping is performed, and the third text is acquired.

本発明の好ましい形態では、前記評価ステップは、学習済モデルに基づき前記分散ベクトルを取得し、前記学習済モデルは、隠れ層の一部又は出力層が前記分散ベクトルを示すニューラルネットワークモデルであることを特徴とする。 In a preferred embodiment of the present invention, the evaluation step acquires the variance vector based on the trained model, and the trained model is a neural network model in which a part of the hidden layer or the output layer shows the variance vector. It is characterized by.

上記課題を解決するために、本発明は、自然言語処理システムであって、原言語の第１のテキストを含むウェブページの指定を受け付ける受付手段と、前記第１のテキストに基づく目的言語の第２のテキストを取得する翻訳手段と、前記第２のテキストの少なくとも一部をクエリとしてスクレイピングを行い、前記目的言語の第３のテキストを取得する収集手段と、前記第３のテキストを前記第１のテキストの訳文として決定する決定手段と、を有することを特徴とする。 In order to solve the above problems, the present invention is a natural language processing system, which is a reception means for accepting designation of a web page including a first text of the original language, and a first language of a target language based on the first text. The translation means for acquiring the second text, the collecting means for obtaining the third text of the target language by scraping at least a part of the second text as a query, and the first text. It is characterized by having a determination means for determining as a translation of the text of.

上記課題を解決するために、本発明は、自然言語処理プログラムであって、コンピュータを、原言語の第１のテキストを含むウェブページの指定を受け付ける受付手段と、前記第１のテキストに基づく目的言語の第２のテキストを取得する翻訳手段と、前記第２のテキストの少なくとも一部をクエリとしてスクレイピングを行い、前記目的言語の第３のテキストを取得する収集手段と、前記第３のテキストを前記第１のテキストの訳文として決定する決定手段と、として機能させることを特徴とする。 In order to solve the above problems, the present invention is a natural language processing program, which is a reception means for receiving a designation of a web page containing a first text of the original language, and an object based on the first text. A translation means for acquiring the second text of the language, a collecting means for obtaining the third text of the target language by scraping at least a part of the second text as a query, and the third text. It is characterized in that it functions as a determination means for determining as a translation of the first text.

本発明によれば、新規な自然言語処理を実現することができる。 According to the present invention, a novel natural language processing can be realized.

本発明の一実施形態に係るハードウェア構成を示す。A hardware configuration according to an embodiment of the present invention is shown. 本発明の一実施形態に係るブロックダイアグラムを示す。The block diagram which concerns on one Embodiment of this invention is shown. 本発明の一実施形態に係るデータベースのブロックダイアグラムを示す。The block diagram of the database which concerns on one Embodiment of this invention is shown. 本発明の一実施形態に係るフローチャートを示す。The flowchart which concerns on one Embodiment of this invention is shown.

本発明の一実施形態に係る自然言語処理システム、自然言語処理方法、及び、自然言語処理プログラムは、図面を交えて、以下で説明される。本発明は以下の一実施形態に限定するものではなく、様々な構成を採用し得る。例として、本発明に係る各手段・各ステップは、その作用効果を実現する上で、電子メールやＳＭＳ等を介したメッセージングや、ＡＰＩを介したデータ入出力等を適宜、行い得る。 The natural language processing system, the natural language processing method, and the natural language processing program according to the embodiment of the present invention will be described below with reference to the drawings. The present invention is not limited to the following embodiment, and various configurations may be adopted. As an example, each means / step according to the present invention may appropriately perform messaging via e-mail, SMS, or the like, data input / output via API, or the like in order to realize its action and effect.

自然言語処理システム、自然言語処理方法、及び、自然言語処理プログラムは、同様の作用効果を奏することができる。また、各手段の作用効果と、同一の名称を冠する各ステップの作用効果と、は同一である。 A natural language processing system, a natural language processing method, and a natural language processing program can exert similar effects. Further, the action and effect of each means and the action and effect of each step bearing the same name are the same.

自然言語処理プログラムは、非一過性の記録媒体に記憶させてよい。自然言語処理プログラムが記憶された非一過性の記録媒体は、コンピュータ装置に自然言語処理プログラムをインストールするために用いられる。 The natural language processing program may be stored in a non-transient recording medium. A non-transient recording medium in which the natural language processing program is stored is used to install the natural language processing program on a computer device.

《ハードウェア構成》
図１に示すように、自然言語処理システムは、１以上のサーバ１０と、１以上のキャッシュサーバ２０と、１以上のターミナル３０と、を含む。《Hardware configuration》
As shown in FIG. 1, the natural language processing system includes one or more servers 10, one or more cache servers 20, and one or more terminals 30.

サーバ１０は、コンピュータであり、少なくとも、演算部１１と、主記憶部１２と、補助記憶部１３と、入力部１４と、表示部１５と、通信部１６と、を含む。各部は、サーバ１０に係る各手段の作用効果を実現するために用いられる。 The server 10 is a computer, and includes at least a calculation unit 11, a main storage unit 12, an auxiliary storage unit 13, an input unit 14, a display unit 15, and a communication unit 16. Each part is used to realize the action and effect of each means related to the server 10.

キャッシュサーバ２０は、コンピュータであり、少なくとも、演算部２１と、主記憶部２２と、補助記憶部２３と、入力部２４と、表示部２５と、通信部２６と、を含む。各部は、キャッシュサーバ２０に係る各手段の作用効果を実現するために用いられる。 The cache server 20 is a computer, and includes at least a calculation unit 21, a main storage unit 22, an auxiliary storage unit 23, an input unit 24, a display unit 25, and a communication unit 26. Each part is used to realize the action and effect of each means related to the cache server 20.

ターミナル３０は、コンピュータであり、少なくとも、演算部３１と、主記憶部３２と、補助記憶部３３と、入力部３４と、表示部３５と、通信部３６と、を含む。各部は、ターミナル３０に係る各手段の作用効果を実現するために用いられる。 The terminal 30 is a computer, and includes at least a calculation unit 31, a main storage unit 32, an auxiliary storage unit 33, an input unit 34, a display unit 35, and a communication unit 36. Each part is used to realize the action and effect of each means according to the terminal 30.

サーバ１０、キャッシュサーバ２０、及び、ターミナル３０は、各通信部とネットワークを介して、相互に接続される。当該ネットワークは、パブリックネットワーク及び／又はプライベートネットワークにより構成され、通信プロトコル等に制限はない。 The server 10, the cache server 20, and the terminal 30 are connected to each other via a network with each communication unit. The network is composed of a public network and / or a private network, and there are no restrictions on communication protocols and the like.

サーバ１０、キャッシュサーバ２０、及び、ターミナル３０の各部は、以下で説明される。 Each part of the server 10, the cache server 20, and the terminal 30 will be described below.

演算部１１、２１及び３１は、ＣＰＵ等の既知のプロセッサを備える。主記憶部１２、２２、及び、３２は、ＲＡＭ等の既知の揮発性デバイスを備える。補助記憶部１３、２３、及び、３３は、フラッシュメモリ等の既知の不揮発性デバイスを備え、ＯＳやプログラムが格納される。補助記憶部１３は、後述のデータベースＤＢ１、ＤＢ２、ＤＢ３、ＤＢ４、ＤＢ５、及び、ＤＢ６の少なくとも一部として機能してよい。入力部１４、２４、及び、３４は、自然言語処理システムのデータ入力等のために用いられる。入力部３４は、キーボードやタッチパネル等の入力デバイスを備える。表示部１５、２５、及び、３５は、自然言語処理システムのデータ表示処理のために用いられ、ディスプレイデバイスやグラフィックコントローラ等を備える。通信部１６、２６、及び、３６は、通信処理のために用いられる。なお、データベースＤＢ１、ＤＢ２、ＤＢ３、ＤＢ４、ＤＢ５、及び、ＤＢ６の少なくとも一部は、サーバ１０と通信可能な外部データベースであってよい。 The arithmetic units 11, 21 and 31 include a known processor such as a CPU. The main storage units 12, 22, and 32 include known volatile devices such as RAM. Auxiliary storage units 13, 23, and 33 include known non-volatile devices such as flash memory, and store an OS and a program. The auxiliary storage unit 13 may function as at least a part of the databases DB1, DB2, DB3, DB4, DB5, and DB6 described later. Input units 14, 24, and 34 are used for data input of a natural language processing system and the like. The input unit 34 includes an input device such as a keyboard and a touch panel. The display units 15, 25, and 35 are used for data display processing of a natural language processing system, and include a display device, a graphic controller, and the like. Communication units 16, 26, and 36 are used for communication processing. At least a part of the databases DB1, DB2, DB3, DB4, DB5, and DB6 may be an external database capable of communicating with the server 10.

サーバ１０、及び、キャッシュサーバ２０は、ワークステーション等の既知の装置構成をとり、表示部１５又は２５を含まない構成としてよい。また、ターミナル３０は、スマートフォンやラップトップ等の既知の装置構成を採用することができる。 The server 10 and the cache server 20 may have a known device configuration such as a workstation and may not include the display unit 15 or 25. In addition, the terminal 30 can adopt a known device configuration such as a smartphone or a laptop.

《ブロックダイアグラム》
図２、及び、図３に示すように、本発明は、データベースＤＢ１、ＤＢ２、ＤＢ３、ＤＢ４、ＤＢ５、及び、ＤＢ６と、サーバ１０と、キャッシュサーバ２０と、ターミナル３０と、が有機的に組み合わされることで実現される。《Block diagram》
As shown in FIGS. 2 and 3, in the present invention, the databases DB1, DB2, DB3, DB4, DB5, and DB6, the server 10, the cache server 20, and the terminal 30 are organically combined. It is realized by being.

データベースＤＢ１は、第１のテキストＴ１、第２のテキストＴ２、第３のテキストＴ３、及び、第４のテキストＴ４のそれぞれを示すテキスト情報１００１が格納される。また、データベースＤＢ１は、第１のテキストＴ１、第２のテキストＴ２、第３のテキストＴ３、及び、第４のテキストＴ４のそれぞれのカテゴリ又は自然言語を示すテキストメタ情報１００２がさらに格納される。また、データベースＤＢ１は、形態素列Ｐ１、Ｐ２、Ｐ３、及び、Ｐ４のそれぞれを示す形態素列情報１００３がさらに格納される。データベースＤＢ１は、分散ベクトルＶ１、Ｖ２、Ｖ３、及び、Ｖ４のそれぞれを示す分散ベクトル情報１００４がさらに格納される。 The database DB1 stores text information 1001 indicating each of the first text T1, the second text T2, the third text T3, and the fourth text T4. Further, the database DB1 further stores text meta information 1002 indicating each category or natural language of the first text T1, the second text T2, the third text T3, and the fourth text T4. Further, the database DB1 further stores morpheme string information 1003 indicating each of the morpheme strings P1, P2, P3, and P4. The database DB1 further stores the variance vector information 1004 indicating each of the variance vectors V1, V2, V3, and V4.

データベースＤＢ２は、受付手段１０１により受け付けられたウェブページ０に係る情報が格納される。また、データベースＤＢ２は、ウェブページ０と対応するリンク情報２００１と、ウェブページ０のカテゴリを示すカテゴリ情報２００２と、リンク情報２００１が示す国別コードトップレベルドメインであるドメイン情報２００３と、ウェブページ０に含まれるテキストが示す自然言語を特定するためのランゲージ情報２００４と、がさらに格納される。 The database DB 2 stores information related to the web page 0 received by the reception means 101. In addition, the database DB2 includes link information 2001 corresponding to web page 0, category information 2002 indicating the category of web page 0, domain information 2003 which is a country code top-level domain indicated by link information 2001, and web page 0. Language information 2004 for identifying the natural language indicated by the text contained in is further stored.

データベースＤＢ３は、受付手段１０１により受け付けられたウェブページ０の情報を有する。また、データベースＤＢ３は、ウェブページ０のスタイルシート情報３００１と、ウェブページ０のスクリプトコードを示すスクリプト情報３００２と、を有する。 The database DB 3 has the information of the web page 0 received by the reception means 101. Further, the database DB 3 has the style sheet information 3001 of the web page 0 and the script information 3002 indicating the script code of the web page 0.

データベースＤＢ４は、ユーザ定義辞書の態様をとり、ワード情報４００１を有する。 The database DB4 takes the form of a user-defined dictionary and has word information 4001.

データベースＤＢ５は、後述する学習済モデル１０３１に基づき形態素列の少なくとも一部の分散ベクトルを表すためのボキャブラリ情報５００１を有する。ボキャブラリ情報５００１は、１以上のワードを含む。 The database DB5 has vocabulary information 5001 for representing at least a part of the variance vector of the morpheme sequence based on the trained model 1031 described later. The vocabulary information 5001 includes one or more words.

データベースＤＢ６は、スクレイピングに係るクエリを示すクエリ情報６００１と、スクレイピング先である１以上のウェブページ２を示すリンク情報６００２と、を有する。 The database DB 6 has query information 6001 indicating a query related to scraping, and link information 6002 indicating one or more web pages 2 which are scraping destinations.

サーバ１０は、受付手段１０１、解析手段１０２、評価手段１０３、翻訳手段１０４、収集手段１０５、判定手段１０６、及び、決定手段１０７を少なくとも有する。サーバ１０が有する各手段は、１以上のサーバ１０による分散処理により実現されてよい。サーバ１０が有する各手段は、１以上のサーバ１０のそれぞれが分担するような構成としてもよい。 The server 10 has at least a reception unit 101, an analysis unit 102, an evaluation unit 103, a translation unit 104, a collection unit 105, a determination unit 106, and a determination unit 107. Each means included in the server 10 may be realized by distributed processing by one or more servers 10. Each means included in the server 10 may be configured to be shared by each of one or more servers 10.

受付手段１０１は、ターミナル３０によるウェブページ０の指定を受け付ける。受付手段１０１は、リンク情報２００１、カテゴリ情報２００２、ドメイン情報２００３及びランゲージ情報２００４をデータベースＤＢ２上に格納する。本明細書中の説明における「ウェブページ」は、少なくともウェブページ０、１、及び、２の何れかを指す。また、カテゴリ情報２００２における「カテゴリ」は、ＩＴ、ソフトウェア、及び、ＡＳＰ等を指す。 The reception means 101 accepts the designation of the web page 0 by the terminal 30. The reception means 101 stores the link information 2001, the category information 2002, the domain information 2003, and the language information 2004 on the database DB2. As used herein, the term "web page" refers to at least one of web pages 0, 1, and 2. Further, the "category" in the category information 2002 refers to IT, software, ASP, and the like.

解析手段１０２は、テキストを形態素解析し、当該テキストの形態素列を決定し、形態素列をデータベースＤＢ１に格納する。また、解析手段１０２は、サーバ１０及び／又は外部サーバに格納された形態素解析エンジン１０２１を用いる。形態素解析エンジン１０２１は、既知の形態素解析エンジンを指す。本明細書中の説明における「テキスト」は、少なくとも第１のテキストＴ１、第２のテキストＴ２、第３のテキストＴ３、及び、第４のテキストＴ４の何れかを指す。また、本明細書中の説明における「形態素列」は、少なくとも形態素列Ｐ１、Ｐ２、Ｐ３、及び、Ｐ４の何れかを指す。 The analysis means 102 morphologically analyzes the text, determines the morpheme string of the text, and stores the morpheme string in the database DB1. Further, the analysis means 102 uses the morphological analysis engine 1021 stored in the server 10 and / or the external server. Morphological analysis engine 1021 refers to a known morphological analysis engine. As used herein, the term "text" refers to at least one of a first text T1, a second text T2, a third text T3, and a fourth text T4. Further, the "morpheme string" in the description in the present specification refers to at least one of the morpheme strings P1, P2, P3, and P4.

評価手段１０３は、形態素列の少なくとも一部に基づき分散ベクトルを決定し、分散ベクトルをデータベースＤＢ１に格納する。また、評価手段１０３は、サーバ１０及び／又は外部サーバに格納された学習済モデル１０３１を用いる。学習済モデル１０３１は、ＣＢｏＷモデル、Ｓｋｉｐ−Ｇｒａｍモデル、ＤｏＢＷモデル、ＰＶ−ＤＭモデル等の既知のニューラルネットワークモデルを指す。学習済モデル１０３１の入力値は、形態素列の少なくとも一部に基づく数値ベクトルであり、解析手段１０２が取得した形態素列の少なくとも一部とボキャブラリとに基づく1以上のｏｎｅ−ｈｏｔベクトルである。学習済モデル１０３１の隠れ層の一部又は出力層は、分散ベクトルを示す。本明細書中の説明における「分散ベクトル」は、少なくとも分散ベクトルＶ１、Ｖ２、Ｖ３、及び、Ｖ４の何れかを指す。なお、形態素列の分散ベクトルは、形態素列を構成する各語句のそれぞれの分散ベクトルに基づいてよい。学習済モデル１０３１に係る学習は、階層的ソフトマックス、ネガティブサンプリング、埋め込みレイヤ等を適用することにより高速化されてよい。
なお、学習済モデル１０３１は、ニューラルネットワークモデルの種別に応じて適宜、単語やその周辺語等を教師値とすることは勿論である。The evaluation means 103 determines the variance vector based on at least a part of the morpheme sequence, and stores the variance vector in the database DB1. Further, the evaluation means 103 uses the trained model 1031 stored in the server 10 and / or the external server. The trained model 1031 refers to a known neural network model such as a CBoW model, a Skip-Gram model, a DoBW model, and a PV-DM model. The input value of the trained model 1031 is a numerical vector based on at least a part of the morpheme string, and is one or more one-hot vectors based on at least a part of the morpheme string acquired by the analysis means 102 and the vocabulary. A part of the hidden layer or the output layer of the trained model 1031 shows a variance vector. As used herein, the "dispersion vector" refers to at least one of the variance vectors V1, V2, V3, and V4. The variance vector of the morpheme sequence may be based on the variance vector of each word and phrase constituting the morpheme sequence. The training according to the trained model 1031 may be accelerated by applying a hierarchical softmax, negative sampling, an embedded layer, or the like.
Of course, the trained model 1031 uses words, peripheral words, and the like as teacher values as appropriate according to the type of neural network model.

翻訳手段１０４は、入力値としての原言語のテキストを出力値としての目的言語のテキストに変換し、目的言語のテキストをデータベースＤＢ１に格納する。また、翻訳手段１０４は、サーバ１０及び／又は外部サーバに格納された翻訳エンジン１０４１を用いる。
翻訳エンジン１０４１は、既知の翻訳エンジンを指す。翻訳エンジン１０４１は、ルールベース又はコーパスベースであってよい。翻訳エンジン１０４１は、統計的機械翻訳又はニューラル機械翻訳に基づいてよい。本明細書中の説明における「原言語」及び「目的言語」は、それぞれ、日本語や英語等の既知の自然言語を指す。なお、「原言語」が本発明に係る自然言語処理の翻訳元言語であるのに対し、「目的言語」が本発明に係る自然言語処理の翻訳先言語であることは勿論である。The translation means 104 converts the text of the original language as the input value into the text of the target language as the output value, and stores the text of the target language in the database DB1. Further, the translation means 104 uses a translation engine 1041 stored in the server 10 and / or an external server.
Translation engine 1041 refers to a known translation engine. The translation engine 1041 may be rule-based or corpus-based. The translation engine 1041 may be based on statistical machine translation or neural machine translation. The "original language" and the "objective language" in the description herein refer to known natural languages such as Japanese and English, respectively. It goes without saying that the "original language" is the translation source language of the natural language processing according to the present invention, whereas the "objective language" is the translation destination language of the natural language processing according to the present invention.

収集手段１０５は、クエリ情報６００１に基づきスクレイピングを行い、テキストを取得し、当該テキストをデータベースＤＢ１に格納する。また、収集手段１０５は、サーバ１０又は外部サーバに格納されたスクレイパ１０５１を用いる。スクレイパ１０５１は、既知のスクレイパを指す。また、収集手段１０５は、入力値としてのテキストに基づき出力値としてのクエリを決定してよい。収集手段１０５は、入力値としてのテキストの一部のフレーズをクエリとして決定してよい。また、収集手段１０５は、ウェブページ上の各種情報及び／又は収集手段１０５が保持するテキストに基づき、クエリを拡張し得る。このとき、収集手段１０５は、入力値としてのテキストの少なくとも一部が示すカテゴリと対応する所定のウェブページ上のテキストに基づき、クエリを拡張し得る。入力値としてのテキストの少なくとも一部と対応するカテゴリと、所定のウェブページを示すリンクは、ユーザ操作等により入力されることで、自然言語処理システムにおける各種データベースの何れかに格納され、適宜、収集手段１０５により参照される。本明細書中の説明における「スクレイピング」の収集対象は、既知のウェブページ上の情報全般であり、テキストだけでなく画像等も含まれる。このとき、収集手段１０５は、当該クエリに基づき、既知の検索エンジンによりスコアリングされた既知のウェブページ上のテキストの内、検索スコアの高いテキストを取得し、当該テキストをデータベースＤＢ１に格納する。当該検索スコアは、テキストやウェブページとクエリとの一致度合いを示し、既知の検索エンジンにより用いられる慣用のスコア全般がその一態様として例示される。収集手段１０５は、クエリに基づき収集対象を決定してもよい。収集手段１０５は、既知の画像解析エンジンと協調しウェブページ上の画像からテキストを推定してよい。収集手段１０５は、ワード情報４００１に基づき、第１のテキストＴ１等のテキストデータに含まれる所定の語句を固有名詞として置き換えするような後処理を単語アラインメントとして行ってよい。 The collecting means 105 performs scraping based on the query information 6001, acquires a text, and stores the text in the database DB1. Further, the collecting means 105 uses a scraper 1051 stored in the server 10 or an external server. Scraper 1051 refers to a known scraper. Further, the collecting means 105 may determine a query as an output value based on the text as an input value. The collecting means 105 may determine a part of the phrase of the text as an input value as a query. In addition, the collecting means 105 may extend the query based on various information on the web page and / or the text held by the collecting means 105. At this time, the collecting means 105 may extend the query based on the text on a predetermined web page corresponding to the category indicated by at least a part of the text as an input value. A category corresponding to at least a part of the text as an input value and a link indicating a predetermined web page are input by a user operation or the like and stored in any of various databases in the natural language processing system, as appropriate. Referenced by collection means 105. The collection target of "scraping" in the description in the present specification is all the information on a known web page, and includes not only text but also images and the like. At this time, the collecting means 105 acquires the text having a high search score from the texts on the known web page scored by the known search engine based on the query, and stores the text in the database DB1. The search score indicates the degree of matching between a text or a web page and a query, and a general conventional score used by a known search engine is exemplified as one embodiment. The collection means 105 may determine the collection target based on the query. The collecting means 105 may work with a known image analysis engine to estimate text from an image on a web page. Based on the word information 4001, the collecting means 105 may perform post-processing such as replacing a predetermined word / phrase included in the text data such as the first text T1 with a proper noun as a word alignment.

判定手段１０６は、異なる２つの分散ベクトルを入力値として、分散ベクトル間の類似度３００を出力値として決定する。判定手段１０６は、コサイン類似度、ピアソンの相関係数、偏差パターン類似度、ユークリッド距離、標準ユークリッド距離、マハラノビス距離、マンハッタン距離、ミンコフスキー距離等の既知の類似度指標／距離指標の算出方法に基づき、類似度３００を決定する。 The determination means 106 determines two different variance vectors as input values and a similarity 300 between the variance vectors as an output value. The determination means 106 is based on a method of calculating a known similarity index / distance index such as cosine similarity, Pearson's correlation coefficient, deviation pattern similarity, Euclidean distance, standard Euclidean distance, Mahalanobis distance, Manhattan distance, and Minkowski distance. , Determine the similarity 300.

決定手段１０７は、目的言語のテキストを含むウェブページ１をキャッシュサーバ２０に格納する。このとき、決定手段１０７は、当該テキストを第１のテキストＴ１の訳文として決定する。また、このとき、収集手段１０５は当該テキストを保持してよい。決定手段１０７は、データベースＤＢ２及びＤＢ３上の各種情報の少なくとも一部に基づきウェブページ０上の原言語のテキストを当該テキストに変換する。 The determination means 107 stores the web page 1 including the text of the target language in the cache server 20. At this time, the determination means 107 determines the text as a translation of the first text T1. At this time, the collecting means 105 may hold the text. The determination means 107 converts the text in the original language on the web page 0 into the text based on at least a part of various information on the databases DB2 and DB3.

《フローチャート》
図４が示すように、本発明に係る一連の処理は以下のステップを含む。なお、図３に示される各ステップの順列は一例であり、指定がない限り適宜、当該順列は変更され得る。"flowchart"
As shown in FIG. 4, a series of processes according to the present invention includes the following steps. The permutation of each step shown in FIG. 3 is an example, and the permutation can be changed as appropriate unless otherwise specified.

ターミナル３０は、第１のテキストＴ１を含むウェブページ０のＵＲＬの少なくとも一部を指定する（指定ステップＳ１００）。このとき、ターミナル３０は、自然言語処理対象のテキストとして、ウェブページ０上のテキストの少なくとも一部を指定可能であってよい。次に、受付手段１０１は、指定ステップＳ１００による第１のテキストＴ１を含むウェブページ０に係る指定を受け付ける（受付ステップＳ１０１）。 The terminal 30 specifies at least a part of the URL of the web page 0 including the first text T1 (designation step S100). At this time, the terminal 30 may be able to specify at least a part of the text on the web page 0 as the text to be processed in natural language. Next, the reception means 101 accepts the designation related to the web page 0 including the first text T1 in the designation step S100 (reception step S101).

解析手段１０２は、第１のテキストＴ１の形態素解析を行い第１のテキストＴ１の形態素列Ｐ１を取得する（解析ステップＳ１０２）。このとき、解析手段１０２は、ワード情報４００１に基づき、第１のテキストＴ１等のテキストデータに含まれる所定の語句を固有名詞として置き換えするような前処理を単語アラインメントとして行ってよい。これにより、本発明に係る自然言語処理において、形態素列の決定における精度向上を期待することができる。 The analysis means 102 performs morphological analysis of the first text T1 and acquires the morpheme string P1 of the first text T1 (analysis step S102). At this time, the analysis means 102 may perform preprocessing as a word alignment that replaces a predetermined word / phrase included in the text data such as the first text T1 with a proper noun based on the word information 4001. As a result, in the natural language processing according to the present invention, it can be expected that the accuracy in determining the morpheme sequence will be improved.

評価手段１０３は、学習済モデル１０３１に基づき入力値としての形態素列Ｐ１と対応する出力値としての分散ベクトルＶ１を取得する（評価ステップＳ１０３）。このとき、評価手段１０３は、形態素列Ｐ１等の形態素列に含まれる名詞、動詞、及び、形容詞のみに基づき分散ベクトルＶ１を決定してよい。これにより、本発明に係る自然言語処理において、分散ベクトルにおけるノイズ除去が容易となる。 The evaluation means 103 acquires the morpheme sequence P1 as an input value and the variance vector V1 as an output value corresponding to the trained model 1031 (evaluation step S103). At this time, the evaluation means 103 may determine the variance vector V1 based only on the nouns, verbs, and adjectives included in the morpheme sequence P1 and the like. This facilitates noise removal in the dispersion vector in the natural language processing according to the present invention.

翻訳手段１０４は、原言語の第１のテキストＴ１を入力値として、翻訳エンジン１０４１を介して、出力値としての目的言語の第２のテキストＴ２を取得する（翻訳ステップＳ１０４）。次に、収集手段１０５は、少なくとも第２のテキストＴ２の一部に基づきクエリ情報６００１を決定し、当該クエリ情報６００１に基づくウェブページ２を含むスクレイピング対象に対するスクレイピングを行い、目的言語の第３のテキストＴ３を取得する（収集ステップＳ１０５）。次に、翻訳手段１０４は、翻訳エンジン１０４１を介して、当該目的言語の第３のテキストＴ３を入力値として、原言語の第４のテキストＴ４を出力値として取得する（翻訳ステップＳ１０６）。 The translation means 104 takes the first text T1 of the original language as an input value and acquires the second text T2 of the target language as an output value via the translation engine 1041 (translation step S104). Next, the collecting means 105 determines the query information 6001 based on at least a part of the second text T2, scrapes the scraped target including the web page 2 based on the query information 6001, and performs scraping on the scraping target of the target language. The text T3 is acquired (collection step S105). Next, the translation means 104 acquires the third text T3 of the target language as an input value and the fourth text T4 of the original language as an output value via the translation engine 1041 (translation step S106).

解析手段１０２は、翻訳ステップＳ１０６により決定された原言語の第４のテキストＴ４の形態素解析を行い、第４のテキストＴ４の形態素列Ｐ４を取得する（解析ステップＳ１０７）。次に、評価手段１０３は、学習済モデル１０３１に基づき形態素列Ｐ４の分散ベクトルＶ４を取得する（評価ステップＳ１０８）。 The analysis means 102 performs morphological analysis of the fourth text T4 of the original language determined in the translation step S106, and acquires the morpheme string P4 of the fourth text T4 (analysis step S107). Next, the evaluation means 103 acquires the variance vector V4 of the morpheme sequence P4 based on the trained model 1031 (evaluation step S108).

判定手段１０６は、分散ベクトルＶ１及び分散ベクトルＶ４の類似度３００を取得する（判定ステップＳ１０９）。類似度３００が閾値３０１を超過する場合、決定手段１０７は、第３のテキストＴ３を含むウェブページ１をキャッシュサーバ２０に格納する（決定ステップＳ１１０）。このとき、閾値３０１は任意に設定される値であってよい。類似度３００が閾値３０１を超過しない場合、収集手段１０５は、ウェブページ０に基づきクエリ情報６００１を拡張した上でスクレイピングを行い、目的言語の第３のテキストＴ３を再取得する（収集ステップＳ１０５Ｘ）。 The determination means 106 acquires the similarity 300 of the variance vector V1 and the variance vector V4 (determination step S109). When the similarity 300 exceeds the threshold value 301, the determination means 107 stores the web page 1 including the third text T3 in the cache server 20 (determination step S110). At this time, the threshold value 301 may be a value arbitrarily set. When the similarity 300 does not exceed the threshold value 301, the collecting means 105 expands the query information 6001 based on the web page 0 and then scrapes to reacquire the third text T3 of the target language (collection step S105X). ..

本発明の一実施形態における自然言語処理システムは、第１のテキストＴ１及びメールアドレスを入力可能なＷｅｂＡＰＩ（以下、「メール翻訳ＡＰＩ」と記す。）の態様をとってよい。メール翻訳ＡＰＩでは、例として、ターミナル３０が第１のテキストＴ１及びメールアドレスをＰＯＳＴする。このとき、受付手段１０１は、ターミナル３０により入力された当該第１のテキストＴ１及びメールアドレスを受け付ける。メール翻訳ＡＰＩでは、ターミナル３０の決定手段１０７により第１のテキストＴ１及びメールアドレスがリクエストとしてＰＯＳＴされることで、サーバ１０により第３のテキストＴ３がレスポンスとして返却される。ここで、当該レスポンスは、当該第３のテキストＴ３をメール本文とするメールにかかる当該メールアドレスを宛先とする送信の態様で実現される。なお、メール翻訳ＡＰＩの態様をとる一実施形態は、他の一実施形態の構成の少なくとも一部を適宜、採用することができる。 The natural language processing system according to the embodiment of the present invention may take the form of a Web API (hereinafter, referred to as "email translation API") capable of inputting the first text T1 and an email address. In the email translation API, as an example, the terminal 30 POSTs the first text T1 and the email address. At this time, the receiving means 101 receives the first text T1 and the e-mail address input by the terminal 30. In the mail translation API, the first text T1 and the mail address are posted as a request by the determination means 107 of the terminal 30, and the third text T3 is returned as a response by the server 10. Here, the response is realized in the mode of transmission to the e-mail address of the e-mail having the third text T3 as the e-mail body. In addition, in one embodiment which takes the form of the mail translation API, at least a part of the configuration of the other embodiment can be appropriately adopted.

本発明の一実施形態における自然言語処理システムは、第１のテキストＴ１及び第３のテキストＴ３が、それぞれ、リクエスト及びレスポンスとなるような、ＷｅｂＡＰＩの態様をとる、と把握することができる。 The natural language processing system in one embodiment of the present invention can be understood to take the form of Web API such that the first text T1 and the third text T3 are requests and responses, respectively.

本発明によると、新規な自然言語処理を実現することができる。 According to the present invention, a novel natural language processing can be realized.

０ウェブページ
１ウェブページ
２ウェブページ
１０サーバ
１１演算部
１２主記憶部
１３補助記憶部
１４入力部
１５表示部
１６通信部
２０キャッシュサーバ
２１演算部
２２主記憶部
２３補助記憶部
２４入力部
２５表示部
２６通信部
３０ターミナル
３１演算部
３２主記憶部
３３補助記憶部
３４入力部
３５表示部
３６通信部
１０１受付手段
１０２解析手段
１０３評価手段
１０４翻訳手段
１０５収集手段
１０６判定手段
１０７決定手段
３００類似度
３０１閾値
１００１テキスト情報
１００２テキストメタ情報
１００３形態素列情報
１００４分散ベクトル情報
１０２１形態素解析エンジン
１０３１学習済モデル
１０４１翻訳エンジン
１０５１スクレイパ
２００１リンク情報
２００２カテゴリ情報
２００３ドメイン情報
２００４ランゲージ情報
３００１スタイルシート情報
３００２スクリプト情報
４００１ワード情報
５００１ボキャブラリ情報
６００１クエリ情報
６００２リンク情報
ＤＢ１データベース
ＤＢ２データベース
ＤＢ３データベース
ＤＢ４データベース
ＤＢ５データベース
ＤＢ６データベース
Ｐ１形態素列
Ｐ２形態素列
Ｐ３形態素列
Ｐ４形態素列
Ｓ１００指定ステップ
Ｓ１０１受付ステップ
Ｓ１０２解析ステップ
Ｓ１０３評価ステップ
Ｓ１０４翻訳ステップ
Ｓ１０５収集ステップ
Ｓ１０５Ｘ収集ステップ
Ｓ１０６翻訳ステップ
Ｓ１０７解析ステップ
Ｓ１０８評価ステップ
Ｓ１０９判定ステップ
Ｓ１１０決定ステップ
Ｔ１第１のテキスト
Ｔ２第２のテキスト
Ｔ３第３のテキスト
Ｔ４第４のテキスト
Ｖ１分散ベクトル
Ｖ２分散ベクトル
Ｖ３分散ベクトル
Ｖ４分散ベクトル0 Web page 1 Web page 2 Web page 10 Server 11 Calculation unit 12 Main storage unit 13 Auxiliary storage unit 14 Input unit 15 Display unit 16 Communication unit 20 Cash server 21 Calculation unit 22 Main storage unit 23 Auxiliary storage unit 24 Input unit 25 Display Unit 26 Communication unit 30 Terminal 31 Calculation unit 32 Main memory unit 33 Auxiliary storage unit 34 Input unit 35 Display unit 36 Communication unit 101 Reception means 102 Analysis means 103 Evaluation means 104 Translation means 105 Collection means 106 Judgment means 107 Determining means 300 Similarity 301 Threshold 1001 Text information 1002 Text meta information 1003 Morphological string information 1004 Distributed vector information 1021 Morphological analysis engine 1031 Trained model 1041 Translation engine 1051 Scraper 2001 Link information 2002 Category information 2003 Domain information 2004 Language information 3001 Style sheet information 3002 Script information 4001 Word information 5001 Vocabulary information 6001 Query information 6002 Link information DB1 Database DB2 Database DB3 Database DB4 Database DB5 Database DB6 Database P1 Form element string P2 Form element string P3 Form element string P4 Form element string S100 Designation step S101 Reception step S102 Analysis step S103 Evaluation step S104 Translation step S105 Collection step S105X Collection step S106 Translation step S107 Analysis step S108 Evaluation step S109 Judgment step S110 Decision step T1 First text T2 Second text T3 Third text T4 Fourth text V1 Distributed vector V2 Distributed vector V3 Distributed vector V4 dispersion vector

Claims

It ’s a natural language processing method.
A reception step that accepts the designation of a web page containing the first text in the original language,
A translation step to obtain the second text of the target language based on the first text, and
A collection step of scraping at least a part of the second text as a query to obtain the third text of the target language.
A determination step of determining the third text as a translation of the first text, and
A natural language processing method that causes a computer processor to execute.

It ’s a natural language processing method.
An analysis step to get the morpheme string of the text,
The evaluation step of acquiring the variance vector of the morpheme sequence and
A determination step for determining the similarity between the variance vectors, and
To the processor
The translation step obtains a fourth text in the original language based on the third text.
The analysis step acquires the morpheme string of the first text and the morpheme string of the fourth text.
The evaluation step acquires the variance vector of the first text and the variance vector of the fourth text.
A claim in which the determination step determines the third text as the translation when the similarity between the variance vectors of the first text and the fourth text determined by the determination step exceeds a threshold value. The natural language processing method according to 1.

The collection step extends the query based on the web page if the similarity of the variance vector of the first text and the variance vector of the fourth text does not exceed the threshold. The natural language processing method according to claim 2, wherein scraping is performed and the third text is obtained.

The evaluation step acquires the variance vector based on the trained model and obtains the variance vector.
The natural language processing method according to claim 2 or 3, wherein the trained model is a neural network model in which a part of the hidden layer or the output layer shows the dispersion vector.

It ’s a natural language processing system.
A reception means that accepts the designation of a web page containing the first text in the original language,
A translation means for obtaining a second text of the target language based on the first text, and
A collecting means for scraping at least a part of the second text as a query to obtain the third text of the target language.
A natural language processing system having a determination means for determining the third text as a translation of the first text.

A natural language processing program that uses a computer
A reception means that accepts the designation of a web page containing the first text in the original language,
A translation means for obtaining a second text of the target language based on the first text, and
A collecting means for scraping at least a part of the second text as a query to obtain the third text of the target language.
A natural language processing program that functions as a determination means for determining the third text as a translation of the first text.