JP2020166501A

JP2020166501A - Segmentation model generation system, text segmentation device and segmentation model generation method

Info

Publication number: JP2020166501A
Application number: JP2019065706A
Authority: JP
Inventors: 夢如王; Mengru Wang
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2019-03-29
Filing date: 2019-03-29
Publication date: 2020-10-08

Abstract

To provide a segmentation model generation system and a segmentation model generation method capable of generating a segmentation model for segmenting a sentence, and a text segmentation device capable of generating a sentences with a clear meaning.SOLUTION: A segmentation model generation system 10 comprises an input unit 11 to which first structured document data including a plurality of first sentences and second structured document data including a plurality of second sentences are inputted, a learning data generation unit 12 for generating from among the plurality of first sentences and the plurality of second sentences a combination of the first sentence and the second sentence that are in a predetermined correspondence relation as learning data, a learning data storage unit 13 for storing the generated learning data, and a segmentation model generation unit 14 for learning based on the learning data stored in the learning data storage unit and generating a segmentation model 144 that segments a sentence.SELECTED DRAWING: Figure 1

Description

本発明は、分割モデル生成システム、テキスト分割装置および分割モデル生成方法に関するものである。 The present invention relates to a split model generation system, a text segmentation device, and a split model generation method.

従来、所定の文を他の文に変換することによって、ユーザが容易に理解可能な文を生成する文章変換装置が知られている。例えば、機械翻訳装置は、コンピュータを使用して所定の自然言語を他の自然言語に自動的に翻訳する。係り受け構造が複雑となるにつれて誤訳等が生じやすくなるため、文が長くなるにつれて変換精度が低くなる。そこで、文章変換装置では、変換元の文を分割することで文を短くし、変換精度を高めることが求められる。 Conventionally, there is known a sentence conversion device that generates a sentence that can be easily understood by a user by converting a predetermined sentence into another sentence. For example, a machine translation device uses a computer to automatically translate a given natural language into another natural language. As the dependency structure becomes more complicated, mistranslations and the like are more likely to occur, so that the conversion accuracy decreases as the sentence becomes longer. Therefore, in the sentence conversion device, it is required to shorten the sentence and improve the conversion accuracy by dividing the sentence of the conversion source.

特許文献１の技術では、所定の目的関数を最適化するように文の分割位置を決定する。この場合において、機械翻訳システムは、所定の尤度算出式にしたがって、分割文候補それぞれの尤度を算出する。機械翻訳システムは、分割文候補と所定の言語の間の類似度とを、所定の類似度算出式によって算出する。機械翻訳システムは、文の尤度と類似度によって定義されるスコアが最も高い分割文候補を選択して入力文を分割する。これにより、高い信頼性で長い文の分割位置を定めることが可能である。 In the technique of Patent Document 1, the division position of the sentence is determined so as to optimize a predetermined objective function. In this case, the machine translation system calculates the likelihood of each of the split sentence candidates according to a predetermined likelihood calculation formula. The machine translation system calculates the similarity between the split sentence candidate and the predetermined language by a predetermined similarity calculation formula. The machine translation system divides the input sentence by selecting the split sentence candidate having the highest score defined by the likelihood and similarity of the sentence. This makes it possible to determine the division position of a long sentence with high reliability.

特許文献２の技術では、テキストに含まれる句読点などの区切り文字と、テキストのフォーマットとから分割位置を検出する。この場合において、テキスト分割装置は、区切り文字のみでは表わせないような文の区切りを正しく検出してテキストを１文ずつに切出すことができる。 In the technique of Patent Document 2, the division position is detected from the delimiter such as punctuation marks included in the text and the text format. In this case, the text segmentation device can correctly detect sentence breaks that cannot be represented only by the delimiter and cut out the text one sentence at a time.

特開２００６−１８３５４号公報Japanese Unexamined Patent Publication No. 2006-18354 特開平６−１９９６２号公報Japanese Patent Application Laid-Open No. 6-19662

特許文献１では、文の尤度と類似度とによって定義されるスコアに基づいて入力文を分割する。しかしながら、機械翻訳システムは、分割位置の前にある単語を用いて、分割された文を生成する。機械翻訳システムは、分割位置の後ろにある単語を用いて、分割された文を生成する。分割元の文が入れ子構造を持つ場合には、分割しても意味が通らない文が生成される。 In Patent Document 1, the input sentence is divided based on the score defined by the likelihood and similarity of the sentence. However, the machine translation system uses the word before the split position to generate the split sentence. The machine translation system uses the words after the split position to generate a split sentence. If the split source sentence has a nested structure, a statement that does not make sense even if split is generated.

特許文献２では、テキストに含まれる区切り文字とテキストのフォーマットとから分割位置を検出する。しかしながら、文の係り受け関係を使用して分割位置を検出することには言及されていない。 In Patent Document 2, the division position is detected from the delimiter included in the text and the format of the text. However, there is no mention of using sentence dependency to detect split positions.

そこで、本発明は、上記の課題を解決する為になされたものであり、文を分割する分割モデルを生成可能な分割モデル生成システムおよび分割モデル生成方法と、分割モデル生成システムを用いて意味の明確な文を生成可能なテキスト分割装置と、を提供する。 Therefore, the present invention has been made to solve the above-mentioned problems, and is meaningful by using a division model generation system and a division model generation method capable of generating a division model for dividing a sentence, and a division model generation system. Provided is a text segmentation device capable of generating a clear sentence.

分割モデル生成システムは、複数の第１文を含んで構造化された第１の文書データと、複数の第２文を含んで構造化された第２の文書データと、が入力される入力部と、複数の第１文と複数の第２文とのうち、所定の対応関係にある第１文と第２文との組み合わせを学習データとして生成する学習データ生成部と、生成された学習データを記憶する学習データ記憶部と、学習データ記憶部に記憶された学習データに基づいて学習し、文を分割する分割モデルを生成する分割モデル生成部と、を有する。 The split model generation system is an input unit in which a first document data structured including a plurality of first sentences and a second document data structured including a plurality of second sentences are input. A learning data generation unit that generates training data by combining a combination of the first sentence and the second sentence having a predetermined correspondence relationship among the plurality of first sentences and the plurality of second sentences, and the generated learning data. It has a learning data storage unit that stores data, and a division model generation unit that learns based on the learning data stored in the learning data storage unit and generates a division model that divides a sentence.

本発明によると、意味の明確な文を生成することができる。 According to the present invention, it is possible to generate a sentence having a clear meaning.

テキスト分割装置の概略図。Schematic diagram of a text segmentation device. 第１実施例におけるテキスト分割装置の概略図。The schematic diagram of the text segmentation apparatus in 1st Example. テキスト分割装置のハードウェア構成図。Hardware configuration diagram of the text segmentation device. 分割モデル生成システムの処理の流れ図。Process flow diagram of the split model generation system. 入力インターフェースの説明図。Explanatory diagram of the input interface. 学習データ選択部の説明図。Explanatory drawing of the learning data selection part. 分割モデル生成システムの説明図。Explanatory drawing of the split model generation system. 学習データ分割部の説明図。Explanatory drawing of the learning data division part. 評価提示部２６の説明図。Explanatory drawing of evaluation presentation part 26. 出力インターフェースの説明図。Explanatory drawing of the output interface. 編集部の処理の流れ図。Process flow diagram of the editorial department. 第２実施例におけるテキスト分割装置の概略図。The schematic diagram of the text segmentation apparatus in 2nd Example. 係り受けモデルの処理の流れ図。Processing flow chart of the dependency model. 編集部の処理の流れ図。Process flow diagram of the editorial department.

本実施形態は、文章の変換を正確に行うための前処理に適用可能な分割モデル生成システムと、所定の文章を他の文章に変換するテキスト分割装置と、に関するものである。 The present embodiment relates to a division model generation system applicable to preprocessing for accurately converting sentences, and a text segmentation device that converts a predetermined sentence into another sentence.

以下、本発明の一実施形態を図１に基づいて説明するが、本発明は、図１に記載の実施形態に限定されるものではない。本実施形態は、例えば特許文献、科学技術文献のように構造化された文書に好適に用いることができる。本実施形態は、特許文献または科学技術文献以外の構造化された文書にも適用することができる。 Hereinafter, one embodiment of the present invention will be described with reference to FIG. 1, but the present invention is not limited to the embodiment shown in FIG. This embodiment can be suitably used for structured documents such as patent documents and science and technology documents. The present embodiment can also be applied to structured documents other than patent documents or scientific and technological documents.

図１は、テキスト分割装置１の概略図である。テキスト分割装置１は、例えば分割モデル生成システム１０と、編集部２０と、変換処理部３０と、変換評価処理部４０と、言語サーバ５０と、を有する。 FIG. 1 is a schematic view of a text segmentation device 1. The text segmentation device 1 includes, for example, a division model generation system 10, an editing unit 20, a conversion processing unit 30, a conversion evaluation processing unit 40, and a language server 50.

なお、図中では、「部」を省略して示す場合がある。例えば、編集部２０は、図中では、「編集」と略記する場合がある。 In the figure, the "part" may be omitted. For example, the editorial unit 20 may abbreviate as "edit" in the figure.

分割モデル生成システム１０は、文を分割する分割モデル１４４を生成する機能である。分割モデル生成システム１０は、例えば入力部１１と、学習データ生成部１２と、学習データ記憶部１３と、分割モデル生成部１４と、を有する。 The division model generation system 10 is a function of generating a division model 144 that divides a sentence. The divided model generation system 10 includes, for example, an input unit 11, a learning data generation unit 12, a learning data storage unit 13, and a divided model generation unit 14.

入力部１１は、例えば、対応する二つのコーパスを受け付ける機能である。対応する二つのコーパスにおいて、一方のコーパスは、他方のコーパスの文を意味ごとに整理し、変換された複数の文によって構成される。変換元のコーパスは、「第１の文書データ」の例であり、第１文を複数含む。変換後のコーパスは、「第２の文書データ」の例であり、第１文が変換された第２文を複数含む。 The input unit 11 is, for example, a function of receiving two corresponding corpora. In the two corresponding corpora, one corpus is composed of a plurality of converted sentences by organizing the sentences of the other corpus by meaning. The conversion source corpus is an example of "first document data" and includes a plurality of first sentences. The converted corpus is an example of "second document data", and the first sentence includes a plurality of converted second sentences.

第１文と第２文とは、ひとかたまりの意味表現を複数含む。ひとかたまりの意味表現とは、二語以上の単語の集まりで一つの意味が表現されるものである。例えば、「ネットワーク上で公開される図書館は、デジタルライブラリと呼ばれ、資料がアップロードされることによってコンテンツが充実する。」という文は、「ネットワーク上で公開される図書館は、デジタルライブラリと呼ばれる。」という意味表現と「デジタルライブラリは、資料がアップロードされることによってコンテンツが充実する。」という意味表現を含む。ひとかたまりの意味表現は、文中の「句」または文中の「節」によって示される。以下、ひとかたまりの意味表現は、意味表現と略記する場合がある。 The first sentence and the second sentence include a plurality of semantic expressions of a group. A group of semantic expressions is a group of two or more words that express one meaning. For example, the sentence "a library published on a network is called a digital library, and the content is enriched by uploading materials" is called "a library published on a network is called a digital library." Includes the meaning expression "The content of the digital library is enriched by uploading the material." A group of semantic expressions is indicated by a "phrase" in a sentence or a "clause" in a sentence. Hereinafter, a group of semantic expressions may be abbreviated as semantic expressions.

学習データ生成部１２は、入力された各コーパスの中から、学習に用いる学習用データを抽出する機能である。学習用データとは、所定の対応関係にある第１文のデータおよび第２文のデータである。学習データ記憶部１３は、抽出された学習用データを記憶するデータベースである。分割モデル生成部１４は、学習用データに基づいて学習し、分割モデル１４４を生成する機能である。 The learning data generation unit 12 is a function of extracting learning data used for learning from each input corpus. The learning data is the data of the first sentence and the data of the second sentence having a predetermined correspondence relationship. The learning data storage unit 13 is a database that stores the extracted learning data. The division model generation unit 14 is a function of learning based on the learning data and generating the division model 144.

分割モデル生成部１４は、例えば、第１ターゲットデータ変換部１４１と、第１ソースデータ変換部１４２と、集約部１４３と、学習データ分割部１４４と、第２ターゲット言語変換部１４５と、第２ソース言語変換部１４６と、を有する。 The division model generation unit 14, for example, includes a first target data conversion unit 141, a first source data conversion unit 142, an aggregation unit 143, a learning data division unit 144, a second target language conversion unit 145, and a second. It has a source language conversion unit 146 and.

第１ターゲットデータ変換部１４１は、第２文のデータを一つまたは複数の意味表現データに変換する機能である。第１ソースデータ変換部１４２は、第１文のデータを一つまたは複数の意味表現データに変換する機能である。集約部１４３は、第１ターゲットデータ変換部１４１にて変換された意味表現データを、重文または複文等として示す一つの意味表現データに変換する機能である。 The first target data conversion unit 141 is a function of converting the data of the second sentence into one or a plurality of semantic expression data. The first source data conversion unit 142 is a function of converting the data of the first sentence into one or a plurality of semantic expression data. The aggregation unit 143 is a function of converting the semantic expression data converted by the first target data conversion unit 141 into one semantic expression data shown as a compound sentence or a compound sentence.

「分割モデル１４４」の例である学習データ分割部１４４は、第１文の一つまたは複数の意味表現データと、第２文の一つの意味表現データと、をそれぞれ複数の意味表現データに分割する機能である。第２ターゲット言語変換部１４５は、学習データ分割部１４４にて分割された第２文の意味表現データを、文書形式のデータに変換する機能である。第２ソース言語変換部１４６は、学習データ分割部１４４にて分割された第１文の意味表現データを、文書形式のデータに変換する機能である。 The learning data division unit 144, which is an example of the “division model 144”, divides one or more semantic expression data of the first sentence and one semantic expression data of the second sentence into a plurality of semantic expression data. It is a function to do. The second target language conversion unit 145 is a function of converting the semantic expression data of the second sentence divided by the learning data division unit 144 into document format data. The second source language conversion unit 146 is a function of converting the semantic expression data of the first sentence divided by the learning data division unit 144 into document format data.

編集部２０は、分割モデル１４４に基づいて、所定の文を複数の文に分割する機能である。変換処理部３０は、分割された所定の文を他の文にそれぞれ変換する機能である。変換評価処理部４０は、変換処理部３０の変換精度を評価する機能である。言語サーバ５０は、単語データが保存されるデータベースである。 The editorial unit 20 is a function of dividing a predetermined sentence into a plurality of sentences based on the division model 144. The conversion processing unit 30 is a function of converting a predetermined divided sentence into another sentence. The conversion evaluation processing unit 40 is a function of evaluating the conversion accuracy of the conversion processing unit 30. The language server 50 is a database in which word data is stored.

本実施形態に示す分割モデル生成システム１０は、入力部１１と、学習データ生成部１２と、学習データ記憶部１３と、分割モデル生成部１４と、を有することによって、意味表現ごとに文を分割する分割モデル１４４を生成することができる。これにより、分割モデル生成システム１０は、所定の文を正確に変換する為の前処理として採用することができる。 The division model generation system 10 shown in the present embodiment divides a sentence for each semantic expression by having an input unit 11, a learning data generation unit 12, a learning data storage unit 13, and a division model generation unit 14. It is possible to generate a division model 144 to be used. As a result, the division model generation system 10 can be adopted as a preprocessing for accurately converting a predetermined sentence.

テキスト分割装置１は、例えば分割モデル生成システム１０と、編集部２０と、変換処理部３０と、変換評価処理部４０と、言語サーバ５０と、を有することによって、分割モデル１４４を用いて所定の文を変換することができる。これにより、テキスト分割装置１は、意味が明確な文を生成することができる。 The text segmentation device 1 includes, for example, a division model generation system 10, an editing unit 20, a conversion processing unit 30, a conversion evaluation processing unit 40, and a language server 50, and thus a predetermined division model 144 is used. Sentences can be converted. As a result, the text segmentation device 1 can generate a sentence having a clear meaning.

図１に示すテキスト分割装置１の実施例を、各図面を参照しながら説明する。 An embodiment of the text segmentation device 1 shown in FIG. 1 will be described with reference to each drawing.

図２は、テキスト分割装置１の機能構成を示す概略図である。テキスト分割装置１は、入力された所定の文を意味ごとに分割し、分割された文ごとに他の文へ変換する。テキスト分割装置１は、例えば分割モデル生成システム１０と、編集部２０と、変換処理部３０と、変換評価処理部４０と、言語サーバ５０と、を有する。 FIG. 2 is a schematic diagram showing a functional configuration of the text segmentation device 1. The text segmentation device 1 divides the input predetermined sentence according to the meaning, and converts each divided sentence into another sentence. The text segmentation device 1 includes, for example, a division model generation system 10, an editing unit 20, a conversion processing unit 30, a conversion evaluation processing unit 40, and a language server 50.

入力部１１は、対応する二つのコーパスを受け付ける機能である。対応する二つのコーパスには、ソース言語コーパスと、ソース言語コーパスの文を意味ごとに整理して変換したターゲット言語コーパスと、が含まれる。 The input unit 11 is a function of receiving two corresponding corpora. The two corresponding corpora include a source language corpus and a target language corpus, which is a transformation of the source language corpus sentences organized by meaning.

ソース言語コーパスは、第１の文書データ７１（図６参照）を含む。ターゲット言語コーパスは、第２の文書データ７２を含む。第１の文書データ７１は、複数の第１文を含んで構造化される。第２の文書データ７２は、複数の第２文を含んで構造化される。入力部１１は、第１の文書データ７１と第２の文書データ７２とを学習データ生成部１２へ送信する。入力部１１は、図５にて後述する。 The source language corpus includes the first document data 71 (see FIG. 6). The target language corpus includes the second document data 72. The first document data 71 is structured to include a plurality of first sentences. The second document data 72 is structured to include a plurality of second sentences. The input unit 11 transmits the first document data 71 and the second document data 72 to the learning data generation unit 12. The input unit 11 will be described later with reference to FIG.

第１文と第２文は、複数の意味表現が含まれる一文または、複数の単文のいずれか一方である。複数の意味表現が含まれる一文とは、例えば、複文または重文等である。以下、複数の意味表現が含まれる一文は、「複数意味文」と略記する場合がある。単文は、一つの意味表現で構成される文である。 The first sentence and the second sentence are either one sentence containing a plurality of semantic expressions or a plurality of simple sentences. A sentence containing a plurality of semantic expressions is, for example, a compound sentence or a compound sentence. Hereinafter, a sentence containing a plurality of semantic expressions may be abbreviated as "multiple semantic sentences". A simple sentence is a sentence composed of one semantic expression.

学習データ生成部１２は、複数の第１文と複数の第２文とのうち、所定の対応関係にある第１文と第２文との組み合わせを学習データとして生成する機能である。学習データ生成部１２は、学習データ記憶部１３へ学習用データを送信する。学習データ生成部１２は、図６にて後述する。 The learning data generation unit 12 is a function of generating as learning data a combination of the first sentence and the second sentence having a predetermined correspondence relationship among the plurality of first sentences and the plurality of second sentences. The learning data generation unit 12 transmits the learning data to the learning data storage unit 13. The learning data generation unit 12 will be described later with reference to FIG.

所定の対応関係は、第１文と前記第２文との組み合わせが、一対多の関係、多対一の関係または、多対多の関係の少なくともいずれかを含む。第１文と第２文とが関係を有するとは、第１文に含まれる意味表現と第２文に含まれる意味表現とが同じ意味を示す場合である。一対多の関係とは、第１文が一つの複数意味文であり、第２文が複数の単文である関係である。多対１の関係とは、第１文が複数の単文であり、第２文が一つの複数意味文である関係である。多対多の関係とは、第１文と第２文とがそれぞれ複数の単文である関係である。 In the predetermined correspondence relationship, the combination of the first sentence and the second sentence includes at least one of a one-to-many relationship, a many-to-one relationship, and a many-to-many relationship. The relationship between the first sentence and the second sentence is a case where the semantic expression included in the first sentence and the semantic expression included in the second sentence have the same meaning. The one-to-many relationship is a relationship in which the first sentence is one plural meaning sentence and the second sentence is a plurality of simple sentences. The many-to-one relationship is a relationship in which the first sentence is a plurality of simple sentences and the second sentence is a single plural meaning sentence. The many-to-many relationship is a relationship in which the first sentence and the second sentence are each a plurality of simple sentences.

学習データ記憶部１３は、学習データを記録するデータベースである。学習データ記憶部１３は、分割モデル生成部１４に学習データを供給する。 The learning data storage unit 13 is a database for recording learning data. The learning data storage unit 13 supplies the learning data to the division model generation unit 14.

分割モデル生成部１４は、学習データに基づいて学習し、分割モデル１４４を生成する機能である。分割モデル生成部１４は、第１文と第２文とを意味表現ごとに分割し、分割された第１文と分割された第２文とに基づいて分割モデル１４４を生成する。分割モデル生成部１４は、後述する分割モデル記憶部２３へ分割モデルを送信する。分割モデル生成部１４は、図７にて後述する。分割モデル１４４の一つの例は、図１で述べた学習データ分割部１４４である。したがって、分割モデル１４４に符号１４４を与えることもできる。 The division model generation unit 14 is a function of learning based on the learning data and generating the division model 144. The division model generation unit 14 divides the first sentence and the second sentence for each semantic expression, and generates the division model 144 based on the divided first sentence and the divided second sentence. The division model generation unit 14 transmits the division model to the division model storage unit 23 described later. The division model generation unit 14 will be described later with reference to FIG. One example of the division model 144 is the learning data division unit 144 described in FIG. Therefore, the reference numeral 144 can be given to the division model 144.

編集部２０は、分割モデル生成システム１０が生成した分割モデル１４４に基づいて、所定の文を意味表現ごとに分割する機能である。編集部２０は、例えば入力部２１と、分割対象選択部２２と、分割モデル記憶部２３と、分割部２４と、分割評価部２５と、評価提示部２６と、分割結果抽出部２７と、変換データ記憶部２８と、を有する。 The editorial unit 20 is a function of dividing a predetermined sentence for each semantic expression based on the division model 144 generated by the division model generation system 10. The editing unit 20 converts, for example, the input unit 21, the division target selection unit 22, the division model storage unit 23, the division unit 24, the division evaluation unit 25, the evaluation presentation unit 26, and the division result extraction unit 27. It has a data storage unit 28.

入力部２１は、所定の文書データ３３１（図１１参照）を受け付ける機能である。所定の文書データ３３１は、複数の文を含んで構造化されたデータである。入力部２１は、分割対象選択部２２へデータを出力する。 The input unit 21 is a function of receiving predetermined document data 331 (see FIG. 11). The predetermined document data 331 is data structured including a plurality of sentences. The input unit 21 outputs data to the division target selection unit 22.

分割対象選択部２２は、所定の文書データ３３１の中から分割対象の文を選択する機能である。分割対象選択部２２は、分割部２４に分割対象の文のデータを送信する。なお、分割対象選択部２２は、ユーザによって操作され、所定の文書データ３３１の中から分割処理の対象となる文の集合を抽出してもよい。 The division target selection unit 22 is a function of selecting a sentence to be divided from the predetermined document data 331. The division target selection unit 22 transmits the data of the sentence to be divided to the division unit 24. The division target selection unit 22 may be operated by the user to extract a set of sentences to be divided from the predetermined document data 331.

分割モデル記憶部２３は、分割モデル１４４を記憶するデータベースである。分割モデル記憶部２３は、分割部２４に分割モデル１４４を送信する。分割部２４は、分割モデル１４４に基づいて、一つの分割対象の文を複数の文に分割する機能である。分割部２４は、分割評価部２５に分割した文のデータを送信する。 The division model storage unit 23 is a database that stores the division model 144. The division model storage unit 23 transmits the division model 144 to the division unit 24. The division unit 24 is a function of dividing one sentence to be divided into a plurality of sentences based on the division model 144. The division unit 24 transmits the data of the divided sentence to the division evaluation unit 25.

分割評価部２５は、所定の評価指標に基づいて、分割部２４の分割結果を評価する機能である。所定の評価指標は、例えば、分割結果の文の流暢さ等である。分割評価部２５は、評価提示部２６に評価結果を送信する。 The division evaluation unit 25 is a function of evaluating the division result of the division unit 24 based on a predetermined evaluation index. The predetermined evaluation index is, for example, the fluency of the sentence of the division result. The divided evaluation unit 25 transmits the evaluation result to the evaluation presentation unit 26.

評価提示部２６は、評価値をユーザに提示する機能と、ユーザが設定した閾値を受け付ける機能とを有する。ユーザは、評価提示部２６によって提示された評価値に基づいて閾値を設定することによって、分割結果から変換処理に適する文を選択する。評価提示部２６は、分割結果抽出部２７に評価値と閾値とを送信する。評価提示部２６は、図１０にて後述する。 The evaluation presentation unit 26 has a function of presenting an evaluation value to the user and a function of accepting a threshold value set by the user. The user selects a sentence suitable for the conversion process from the division result by setting the threshold value based on the evaluation value presented by the evaluation presentation unit 26. The evaluation presentation unit 26 transmits the evaluation value and the threshold value to the division result extraction unit 27. The evaluation presentation unit 26 will be described later with reference to FIG.

分割結果抽出部２７は、分割評価部２５の評価に基づいて、分割部２４にて分割された複数の文の中から、変換処理部３０にて変換する文を抽出する機能である。分割結果抽出部２７は、変換データ記憶部２８に変換する文のデータを送信する。変換データ記憶部２８は、変換する文のデータを保存するデータベースである。変換データ記憶部２８は、変換部３１に変換する文のデータを送信する。 The division result extraction unit 27 is a function of extracting a sentence to be converted by the conversion processing unit 30 from a plurality of sentences divided by the division unit 24 based on the evaluation of the division evaluation unit 25. The division result extraction unit 27 transmits the data of the sentence to be converted to the conversion data storage unit 28. The conversion data storage unit 28 is a database that stores the data of the sentence to be converted. The conversion data storage unit 28 transmits the data of the sentence to be converted to the conversion unit 31.

変換処理部３０は、分割された複数の文を他の文に変換する機能である。変換処理部３０は、例えば変換部３１と、変換ソフト記憶部３２と、出力部３３と、を有する。 The conversion processing unit 30 is a function of converting a plurality of divided sentences into other sentences. The conversion processing unit 30 includes, for example, a conversion unit 31, a conversion software storage unit 32, and an output unit 33.

変換部３１は、変換データ記憶部２８に記憶されているテキストデータに対して、変換処理を行う機能である。変換部３１は、変換ソフト記憶部３２からの変換ソフトと、変換データ記憶部２８から取得したテキストデータとを用いて文の変換をし、その結果を出力部３３へ出力する。変換ソフト記憶部３２は、変換部３１にて使用する変換ソフトを記憶するデータベースである。変換ソフト記憶部３２は、変換部３１に変換ソフトのデータを送信する。出力部３３は、変換結果を出力する機能である。出力部３３は、図１１にて後述する。 The conversion unit 31 is a function of performing conversion processing on the text data stored in the conversion data storage unit 28. The conversion unit 31 converts a sentence using the conversion software from the conversion software storage unit 32 and the text data acquired from the conversion data storage unit 28, and outputs the result to the output unit 33. The conversion software storage unit 32 is a database that stores the conversion software used by the conversion unit 31. The conversion software storage unit 32 transmits the data of the conversion software to the conversion unit 31. The output unit 33 is a function of outputting the conversion result. The output unit 33 will be described later with reference to FIG.

変換評価処理部４０は、変換処理部３０の変換精度を評価する機能である。変換評価処理部４０は、例えば変換評価部４１と、出力部４２と、を有する。 The conversion evaluation processing unit 40 is a function of evaluating the conversion accuracy of the conversion processing unit 30. The conversion evaluation processing unit 40 includes, for example, a conversion evaluation unit 41 and an output unit 42.

変換評価部４１は、変換部３１の変換精度を評価する機能である。変換評価部４１は、出力部４２に変換評価の結果を送信する。出力部４２は、変換精度の評価値を出力する機能である。出力部４２は、図１１にて後述する。 The conversion evaluation unit 41 is a function of evaluating the conversion accuracy of the conversion unit 31. The conversion evaluation unit 41 transmits the result of the conversion evaluation to the output unit 42. The output unit 42 is a function of outputting an evaluation value of conversion accuracy. The output unit 42 will be described later with reference to FIG.

言語サーバ５０は、分割部２４および変換部３１が使用する単語データを記憶するデータベースである。言語サーバ５０は、分割部２４と、変換部３１とに単語データを送信する。 The language server 50 is a database that stores word data used by the division unit 24 and the conversion unit 31. The language server 50 transmits word data to the division unit 24 and the conversion unit 31.

図３は、テキスト分割装置のハードウェア構成図である。テキスト分割装置１は、例えば入出力装置６１と、記憶部６２と、ＣＰＵ（ＣｅｎｔｒａｌＰｒｏｃｅｓｓｉｎｇＵｎｉｔ）６３と、メモリ６４と、通信インターフェース（図中、通信Ｉ／Ｆ（ＩｎｔｅｒＦａｃｅ）と示す場合がある）６５と、入出力回路６６と、各機能６１〜６６を双方向に通信可能に接続するデータ伝送路６７と、を有する。 FIG. 3 is a hardware configuration diagram of the text segmentation device. The text dividing device 1 includes, for example, an input / output device 61, a storage unit 62, a CPU (Central Processing Unit) 63, a memory 64, and a communication interface (in the figure, it may be referred to as a communication I / F (InterFace)). It has 65, an input / output circuit 66, and a data transmission line 67 for bidirectionally connecting the functions 61 to 66.

入出力装置６１は、入力装置と出力装置（いずれも図示せず）とを有する。入力装置は、例えば、キーボード、マウス等のポインティングデバイスマイク等の音声入力装置である。出力装置は、例えば、ディスプレイ、プリンタ、音声合成装置等である。さらに、入力装置と出力装置とをタブレットまたはＡＲ（ＡｕｇｍｅｎｔｅｄＲｅａｌｉｔｙ）ディスプレイのように一体化させてもよい。入出力装置６１は、テキスト分割装置１が設けられる計算機とは別の計算機に設けることもできる。例えば、パーソナルコンピュータ、携帯電話（いわゆるスマートフォンを含む）、携帯情報端末等をテキスト分割装置１への入出力装置６１として用いてもよい。 The input / output device 61 has an input device and an output device (neither of them is shown). The input device is, for example, a voice input device such as a pointing device microphone such as a keyboard or a mouse. The output device is, for example, a display, a printer, a voice synthesizer, or the like. Further, the input device and the output device may be integrated like a tablet or an AR (Augmented Reality) display. The input / output device 61 may be provided in a computer different from the computer in which the text segmentation device 1 is provided. For example, a personal computer, a mobile phone (including a so-called smartphone), a personal digital assistant, or the like may be used as the input / output device 61 to the text segmentation device 1.

記憶部６２は、例えば、ハードディスクまたはＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の不揮発性記憶装置である。記憶媒体の種類は問わない。記憶部６２は、分割モデル生成システム１０と、編集部２０と、変換処理部３０と、変換評価処理部４０と、をそれぞれ実現するためのコンピュータプログラム（以下、プログラム）を記憶する。記憶部６２は、言語サーバ５０といった、データベースも記憶する。 The storage unit 62 is, for example, a hard disk or a non-volatile storage device such as an SSD (Solid State Drive). The type of storage medium does not matter. The storage unit 62 stores a computer program (hereinafter referred to as a program) for realizing the division model generation system 10, the editing unit 20, the conversion processing unit 30, and the conversion evaluation processing unit 40, respectively. The storage unit 62 also stores a database such as the language server 50.

ＣＰＵ６３は、メモリ６４を介して記憶部６２から各プログラムを読み込んで実行する。メモリ６４は、例えば、「ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）」等の揮発性記憶装置である。 The CPU 63 reads and executes each program from the storage unit 62 via the memory 64. The memory 64 is, for example, a volatile storage device such as "RAM (Random Access Memory)".

通信インターフェース６５は、例えば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）、インターネットなどの通信ネットワークを介して外部装置と通信する装置である。 The communication interface 65 is a device that communicates with an external device via a communication network such as a LAN (Local Area Network) or the Internet.

入出力回路６６は、例えばＵＳＢ（ＵｎｉｖｅｒｓａｌＳｅｒｉａｌＢｕｓ）メモリ、光ディスク等の記憶媒体６８が接続される端子である。入出力回路６６は、例えば、ＵＳＢポート、コネクタ等である。 The input / output circuit 66 is a terminal to which a storage medium 68 such as a USB (Universal Serial Bus) memory or an optical disk is connected. The input / output circuit 66 is, for example, a USB port, a connector, or the like.

記憶媒体６８は、持ち運び可能な記憶媒体である。なお、記憶媒体６８には、分割モデル生成システム１０を実現するプログラムが記憶されてもよい。ユーザは、記憶媒体６８に保存される分割モデル生成システム１０のプログラムをコンピュータにインストールすることによって、分割モデル生成システム１０をコンピュータ上に設けてもよい。。 The storage medium 68 is a portable storage medium. The storage medium 68 may store a program that realizes the divided model generation system 10. The user may provide the division model generation system 10 on the computer by installing the program of the division model generation system 10 stored in the storage medium 68 on the computer. ..

なお、分割モデル生成システム１０を実現するプログラムは、クラウド上の所定のサーバに保存されてもよい。この場合において、ユーザは、通信インターフェース６５を介して分割モデル生成システム１０のプログラムをクラウド上からコンピュータにインストールすることによって、分割モデル生成システム１０をコンピュータ上に実現してもよい。 The program that realizes the split model generation system 10 may be stored in a predetermined server on the cloud. In this case, the user may realize the divided model generation system 10 on the computer by installing the program of the divided model generation system 10 on the computer via the communication interface 65 from the cloud.

以下、第１の文書データ７１が日本語で記載された特許明細書であり、第２の文書データ７２が英語で記載された特許明細書であることを一例にあげて分割モデル生成システム１０の処理を説明する。なお、第１の文書データ７１および第２の文書データ７２は、特許明細書に限らず、研究論文等の構造化された文書であってもよい。 Hereinafter, the division model generation system 10 is based on the example that the first document data 71 is a patent specification described in Japanese and the second document data 72 is a patent specification described in English. The process will be described. The first document data 71 and the second document data 72 are not limited to patent specifications, but may be structured documents such as research papers.

図４は、分割モデル生成システム１０の処理の流れ図である。分割モデル生成システム１０は、入力部１１の処理（Ｓ１１）と、学習データ生成部１２の処理（Ｓ１２，Ｓ１３）と、分割モデル生成部１４の処理（Ｓ１４，Ｓ１５）と、で実行される。 FIG. 4 is a processing flow chart of the split model generation system 10. The division model generation system 10 is executed by the processing of the input unit 11 (S11), the processing of the learning data generation unit 12 (S12, S13), and the processing of the division model generation unit 14 (S14, S15).

分割モデル生成システム１０は、例えば、ユーザの操作によって開始される。分割モデル生成システム１０の処理は、入力部１１の処理（Ｓ１１）に移動する。入力部１１は、入出力装置６１に、ソース言語コーパスと、ターゲット言語コーパスとの入力欄を表示させる。 The split model generation system 10 is started by, for example, a user operation. The process of the split model generation system 10 moves to the process (S11) of the input unit 11. The input unit 11 causes the input / output device 61 to display input fields for the source language corpus and the target language corpus.

図５は、入力部１１の説明図である。入力部１１は、例えば、ソース言語ファイル入力欄１１１と、ターゲット言語ファイル入力欄１１２と、実行ボタン１１３と、キャンセルボタン１１４と、を入出力装置６１に表示させる。 FIG. 5 is an explanatory diagram of the input unit 11. The input unit 11 causes the input / output device 61 to display, for example, the source language file input field 111, the target language file input field 112, the execute button 113, and the cancel button 114.

ソース言語ファイル入力欄１１１は、ソース言語ファイルとしてのソース言語コーパスを受け付ける機能である。ターゲット言語ファイル入力欄１１２は、ターゲット言語ファイルとしてのターゲット言語コーパスを受け付ける機能である。実行ボタン１１３は、分割モデル生成処理（Ｓ１２〜Ｓ１５）を実行させるボタンである。キャンセルボタン１１４は、入力部１１の表示を取り消すボタンである。 The source language file input field 111 is a function of accepting a source language corpus as a source language file. The target language file input field 112 is a function of accepting the target language corpus as the target language file. The execution button 113 is a button for executing the division model generation processing (S12 to S15). The cancel button 114 is a button for canceling the display of the input unit 11.

ユーザは、ソース言語ファイル入力欄１１１にソース言語コーパスのファイルのディレクトリを入力してもよい。ユーザは、ターゲット言語ファイル入力欄１１２にターゲット言語コーパスのファイルのディレクトリを入力してもよい。または、欄１１１，１１２にプルダウンメニューを設けて、プルダウンメニューの中からファイルを指定してもよい。さらには、欄１１１，１１２にファイルのアイコンをドロップすることにより、ファイルを入力してもよい。 The user may enter the directory of the source language corpus file in the source language file input field 111. The user may enter the directory of the target language corpus file in the target language file input field 112. Alternatively, a pull-down menu may be provided in columns 111 and 112, and a file may be specified from the pull-down menu. Further, the file may be input by dropping the file icon in the fields 111 and 112.

図４に戻り、学習データ生成部１２は、入力部１１からソース言語コーパスのデータとターゲット言語コーパスのデータとを受け取る。学習データ生成部１２は、第１の文書データ７１と第２の文書データ７２との中から、学習用データを抽出する（Ｓ１２）。 Returning to FIG. 4, the learning data generation unit 12 receives the data of the source language corpus and the data of the target language corpus from the input unit 11. The learning data generation unit 12 extracts learning data from the first document data 71 and the second document data 72 (S12).

図６は、学習データ生成部１２の説明図である。第１の文書データ７１と、第２の文書データ７２とは、所定の順序で並べられた複数の項目を有する。 FIG. 6 is an explanatory diagram of the learning data generation unit 12. The first document data 71 and the second document data 72 have a plurality of items arranged in a predetermined order.

第１の文書データ７１は、日本語で示される複数の項目ｃ１〜ｃ７を有する。項目ｃ１は、例えば「書類名」を示す。項目ｃ２は、例えば「発明の名称」を示す。項目ｃ３は、例えば「技術分野」を示す。項目ｃ４は、例えば「背景技術」を示す。項目ｃ５は、例えば「先行技術文献」を示す。項目ｃ６は、例えば「発明の概要」を示す。項目ｃ７は、例えば「発明が解決しようとする課題」を示す。 The first document data 71 has a plurality of items c1 to c7 indicated in Japanese. Item c1 indicates, for example, a “document name”. Item c2 indicates, for example, "the name of the invention". Item c3 indicates, for example, "technical field". Item c4 indicates, for example, "background technology". Item c5 indicates, for example, "Prior Art Document". Item c6 indicates, for example, "outline of the invention". Item c7 indicates, for example, "the problem to be solved by the invention".

第２の文書データ７２は、英語で示される複数の項目ｃ８〜ｃ１４を有する。項目ｃ８は、例えば「ＤＥＳＣＲＩＰＴＩＯＮ」を示す。項目ｃ９は、例えば「ＴｉｔｌｅｏｆＩｎｖｅｎｔｉｏｎ」を示す。例えば項目ｃ１０は、「ＴｅｃｈｎｉｃａｌＦｉｅｌｄ」を示す。項目ｃ１１は、例えば「ＢａｃｋｇｒｏｕｎｄＡｒｔ」を示す。項目ｃ１２は、例えば「ＣｉｔａｔｉｏｎＬｉｓｔ」を示す。項目ｃ１３は、例えば「ＳｕｍｍａｒｙｏｆＩｎｖｅｎｔｉｏｎ」を示す。項目ｃ１４は、例えば「ＴｅｃｈｎｉｃａｌＰｒｏｂｌｅｍ」を示す。 The second document data 72 has a plurality of items c8 to c14 shown in English. Item c8 indicates, for example, "DESCPRIPTION". Item c9 indicates, for example, "Title of Invention". For example, item c10 indicates "Technical Field". Item c11 indicates, for example, "Background Art". Item c12 indicates, for example, "Citation List". Item c13 indicates, for example, "Summery of Invention". Item c14 indicates, for example, "Technical Problem".

項目ｃ１は、項目ｃ８と対応する。項目ｃ２は、項目ｃ９と対応する。項目ｃ３は、項目ｃ１０と対応する。項目ｃ４は、項目ｃ１１と対応する。項目ｃ５は、項目ｃ１２と対応する。項目ｃ６は、項目ｃ１３と対応する。項目ｃ７は、項目ｃ１４と対応する。 Item c1 corresponds to item c8. Item c2 corresponds to item c9. Item c3 corresponds to item c10. Item c4 corresponds to item c11. Item c5 corresponds to item c12. Item c6 corresponds to item c13. Item c7 corresponds to item c14.

学習データ生成部１２は、第１の文書データ７１の各項目ｃ１〜ｃ７と第２の文書データ７２の各項目ｃ８〜ｃ１４とのうち対応する項目間において、第１文と第２文とを抽出する。例えば、学習データ生成部１２は、第１の文書データ７１の項目ｃ４を選択したとする。この場合、学習データ生成部１２は、項目ｃ４に示される複数の文の中から、第１文としての「ＪＰ：Ｓｅｎ（Ｓｅｎｔｅｎｃｅ）」を抽出する。ここで「ＪＰ：Ｓｅｎ」は、例えば、複数の（例えば三つの）意味表現を含んだ、複数意味文である。なお、学習用データとしての第１文は、三つの意味表現を含む複数意味文に限らず、二つまたは四つ以上の意味表現を含む複数意味文であってもよい。 The learning data generation unit 12 sets the first sentence and the second sentence between the corresponding items of the items c1 to c7 of the first document data 71 and the items c8 to c14 of the second document data 72. Extract. For example, it is assumed that the learning data generation unit 12 selects the item c4 of the first document data 71. In this case, the learning data generation unit 12 extracts "JP: Sen (Sentence)" as the first sentence from the plurality of sentences shown in the item c4. Here, "JP: Sen" is, for example, a plural semantic sentence including a plurality of (for example, three) semantic expressions. The first sentence as the learning data is not limited to the plural semantic sentence including three semantic expressions, and may be a plural semantic sentence including two or four or more semantic expressions.

「ＪＰ：Ｓｅｎ」の例としては、「タイヤは、融点が１６０℃以上であるポリプロピレンを含み、前記融点が１６０℃以上であるポリプロピレンの含有率が樹脂組成物全体の６０質量％以下である樹脂組成物を含む。」という文がある。したがって、「ＪＰ：Ｓｅｎ」には、「樹脂組成物を含むタイヤ」という第１の意味表現と、「樹脂組成物は、融点が１６０℃以上であるポリプロピレンを含む」という第２の意味表現と、「樹脂組成物含まれる融点が１６０℃以上であるポリプロピレンの含有率は、樹脂組成物全体の６０質量％以下である」という第３の意味表現と、の三つの意味表現が含まれる。 As an example of "JP: Sen", "a tire contains polypropylene having a melting point of 160 ° C. or higher, and the content of polypropylene having a melting point of 160 ° C. or higher is 60% by mass or less of the entire resin composition. Including the composition. " Therefore, "JP: Sen" includes a first semantic expression of "a tire containing a resin composition" and a second semantic expression of "the resin composition contains polypropylene having a melting point of 160 ° C. or higher". , "The content of polypropylene contained in the resin composition having a melting point of 160 ° C. or higher is 60% by mass or less of the entire resin composition", and three semantic expressions are included.

学習データ生成部１２は、第２の文書データ７２の項目ｃ１１に示される複数の文の中から、第２文としての、「ＥＮ：Ｓｅｇ１」、「ＥＮ：Ｓｅｇ２」および「ＥＮ：Ｓｅｇ３」を抽出する。「ＥＮ：Ｓｅｇ１」、「ＥＮ：Ｓｅｇ２」および「ＥＮ：Ｓｅｇ３」は、英語で記載される単文である。 The learning data generation unit 12 selects "EN: Seg1", "EN: Seg2", and "EN: Seg3" as the second sentence from the plurality of sentences shown in the item c11 of the second document data 72. Extract. "EN: Seg1", "EN: Seg2" and "EN: Seg3" are simple sentences written in English.

「ＥＮ：Ｓｅｇ１」は、例えば、「Ａｔｉｒｅｗｈｉｃｈｃｏｍｐｒｉｓｅｓａｒｅｓｉｎｃｏｍｐｏｓｉｔｉｏｎ．」という文である。「ＥＮ：Ｓｅｇ２」は、例えば、「Ｔｈｅｒｅｓｉｎｃｏｍｐｏｓｉｔｉｏｎｃｏｎｔａｉｎｓａｐｏｌｙｐｒｏｐｙｌｅｎｅｈａｖｉｎｇａｍｅｌｔｉｎｇｐｏｉｎｔｏｆ１６０°Ｃｏｒｍｏｒｅ．」という文である。「ＥＮ：Ｓｅｇ３」は、例えば、「Ｔｈｅｃｏｎｔｅｎｔｏｆｔｈｅｐｏｌｙｐｒｏｐｙｌｅｎｅｈａｖｉｎｇａｍｅｌｔｉｎｇｐｏｉｎｔｏｆ１６０°Ｃｏｒｍｏｒｅｉｓ６０％ｂｙｍａｓｓｏｒｌｅｓｓｏｆｔｈｅｗｈｏｌｅｍａｓｓｏｆｔｈｅｒｅｓｉｎｃｏｍｐｏｓｉｔｉｏｎ．」という文である。 "EN: Seg1" is, for example, the sentence "A tire which compactions a resin composition." "EN: Seg2" is, for example, the sentence "The resin composition contours a polypropylene having having a melting point of 160 ° C or more." "EN: Seg3" is, for example, "The content of the polypropylene having a melting point of 160 ° C or more is 60% by mass or less of the statement".

「ＪＰ：Ｓｅｎ」と、「ＥＮ：Ｓｅｇ１」、「ＥＮ：Ｓｅｇ２」および「ＥＮ：Ｓｅｇ３」とは、一対多の関係を有する。すなわち、「ＪＰ：Ｓｅｎ」は、「ＥＮ：Ｓｅｇ１」、「ＥＮ：Ｓｅｇ２」および「ＥＮ：Ｓｅｇ３」の三つの文の意味表現を有する。 "JP: Sen" and "EN: Seg1", "EN: Seg2" and "EN: Seg3" have a one-to-many relationship. That is, "JP: Sen" has the semantic expressions of the three sentences "EN: Seg1", "EN: Seg2" and "EN: Seg3".

なお、学習データ生成部１２は、第１文と第２文との所定の対応関係を有する組み合わせの中から、単語数または翻訳精度の少なくともいずれか一方に基づいて、学習用データを抽出してもよい。学習データ生成部１２は、単語数または翻訳精度の少なくともいずれか一方に限らず、他の基準に基づいて、学習用データを抽出してもよい。 The learning data generation unit 12 extracts learning data from a combination having a predetermined correspondence between the first sentence and the second sentence based on at least one of the number of words and the translation accuracy. May be good. The learning data generation unit 12 may extract learning data based on other criteria, not limited to at least one of the number of words and the translation accuracy.

学習データ生成部１２は、例えば、単語数が１０単語以上の第１文と、単語数が８０単語以下の第２文と、の組み合わせを抽出してもよい。なお、学習データ生成部１２は、単語数が１０単語以上の第１文と、単語数が８０単語以下の第２文と、の組み合わせに限らず、他の単語数に基づいて学習用データを抽出してもよい。 The learning data generation unit 12 may extract, for example, a combination of a first sentence having 10 or more words and a second sentence having 80 or less words. The learning data generation unit 12 is not limited to the combination of the first sentence having 10 or more words and the second sentence having 80 words or less, and uses the learning data based on the number of other words. It may be extracted.

学習データ生成部１２は、例えば、第１文と第２文とのアライメント精度が「０．１」以上を満たす、第１文と第２文との組み合わせを抽出する。第１文と第２文とのアライメント精度は、ＢＬＥＵ（ＢｉＬｉｎｇｕａｌＥｖａｌｕａｔｉｏｎＵｎｄｅｒｓｔｕｄｙ）およびＲＩＢＥＳ（ＲａｎｋｂａｓｅｄＩｎｔｕｉｔｉｖｅＢｉｌｉｎｇｕａｌＳｃｏｒｅ）等の指標を用いて計算されてもよい。学習データ生成部１２は、アライメント精度が「０．１」以上に限らず、他の値に基づいて学習用データを抽出してもよい。 The learning data generation unit 12 extracts, for example, a combination of the first sentence and the second sentence in which the alignment accuracy between the first sentence and the second sentence satisfies "0.1" or more. The alignment accuracy between the first sentence and the second sentence may be calculated using an index such as BLEU (BiLingual Evaluation Environment) and RIBES (Rankbased Intuitive Bilingual Score). The learning data generation unit 12 is not limited to the alignment accuracy of "0.1" or more, and may extract learning data based on other values.

図４に戻り、学習データ生成部１２は、学習データ記憶部１３に学習用データを記録する（Ｓ１３）。学習データ生成部１２は、例えば、「ＪＰ：Ｓｅｎ」のデータと、「ＥＮ：Ｓｅｇ１」，「ＥＮ：Ｓｅｇ２」および「ＥＮ：Ｓｅｇ３」のデータと、を学習データ記憶部１３に記録する。 Returning to FIG. 4, the learning data generation unit 12 records the learning data in the learning data storage unit 13 (S13). The learning data generation unit 12 records, for example, the data of “JP: Sen” and the data of “EN: Seg1”, “EN: Seg2” and “EN: Seg3” in the learning data storage unit 13.

分割モデル生成部１４は、学習データ記憶部１３から学習用データを取得し、分割モデル１４４を学習する（Ｓ１４）。図７は、分割モデル生成部１４の説明図である。分割モデル生成部１４は、例えば、エンコーダ１４１，１４２と、集約部１４３と、学習データ分割部１４４と、デコーダ１４５，１４６と、を有する。 The division model generation unit 14 acquires learning data from the learning data storage unit 13 and learns the division model 144 (S14). FIG. 7 is an explanatory diagram of the divided model generation unit 14. The division model generation unit 14 includes, for example, encoders 141 and 142, aggregation unit 143, learning data division unit 144, and decoders 145 and 146.

エンコーダ１４１は、図１に示す第１ターゲットデータ変換部１４１に対応する。エンコーダ１４２は、図１に示す第１ソースデータ変換部１４２に対応する。デコーダ１４５は、図１に示す第２ターゲット言語変換部１４５に対応する。デコーダ１４６は、図１に示す第２ソース言語変換部１４６に対応する。 The encoder 141 corresponds to the first target data conversion unit 141 shown in FIG. The encoder 142 corresponds to the first source data conversion unit 142 shown in FIG. The decoder 145 corresponds to the second target language conversion unit 145 shown in FIG. The decoder 146 corresponds to the second source language conversion unit 146 shown in FIG.

エンコーダ１４１は、第２文のデータを、意味表現データに変換する機能である。意味表現データは、文の意味表現を示すデータである。意味表現データは、例えば、ｎ次元のベクトルである。ｎは、所定の定数である。エンコーダ１４１は、例えば、「ＥＮ：Ｓｅｇ１」のデータ，「ＥＮ：Ｓｅｇ２」のデータおよび「ＥＮ：Ｓｅｇ３」のデータを、「ＥＮ：ｈ１」、「ＥＮ：ｈ２」および「ＥＮ：ｈ３」にそれぞれ変換する。「ＥＮ：ｈ１」、「ＥＮ：ｈ２」および「ＥＮ：ｈ３」は、それぞれ一つの意味表現を有する意味表現データである。 The encoder 141 is a function of converting the data of the second sentence into semantic expression data. Semantic expression data is data indicating the semantic expression of a sentence. The semantic expression data is, for example, an n-dimensional vector. n is a predetermined constant. The encoder 141 converts, for example, the data of "EN: Seg1", the data of "EN: Seg2", and the data of "EN: Seg3" into "EN: h1", "EN: h2", and "EN: h3", respectively. Convert. “EN: h1”, “EN: h2”, and “EN: h3” are semantic expression data having one semantic expression, respectively.

エンコーダ１４２は、第１文のデータを、意味表現データに変換する機能である。エンコーダ１４２は、例えば、「ＪＰ：Ｓｅｎ」のデータを、「ＪＰ：ｈ」に変換する。「ＪＰ：ｈ」は、三つの意味表現を含む意味表現データである。 The encoder 142 is a function of converting the data of the first sentence into semantic expression data. The encoder 142 converts, for example, the data of “JP: Sen” into “JP: h”. "JP: h" is semantic expression data including three semantic expressions.

集約部１４３は、複数の意味表現データを集約することによって、複数意味文としての意味表現データを算出する機能である。集約部１４３は、例えば、第２文の複数の意味表現データを集約する。 The aggregation unit 143 is a function of calculating the semantic expression data as a plurality of semantic sentences by aggregating a plurality of semantic expression data. The aggregation unit 143 aggregates, for example, a plurality of semantic expression data of the second sentence.

集約部１４３は、例えば、「ＥＮ：ｈ１」、「ＥＮ：ｈ２」および「ＥＮ：ｈ３」のデータをエンコーダ１４１から取得する。集約部１４３は、「ＥＮ：ｈ１」、「ＥＮ：ｈ２」および「ＥＮ：ｈ３」のデータを集約し、「ＥＮ：ｈ」を取得する。「ＥＮ：ｈ」は、第２文を複数意味文として示した場合の意味表現データである。 The aggregation unit 143 acquires, for example, the data of “EN: h1”, “EN: h2”, and “EN: h3” from the encoder 141. The aggregation unit 143 aggregates the data of "EN: h1", "EN: h2" and "EN: h3", and acquires "EN: h". “EN: h” is semantic expression data when the second sentence is shown as a plural semantic sentence.

集約部１４３は、複数の意味表現データを足し合わせる方法または、複数の意味表現データ間に時系列な依存関係を設定する方法等によって、複数の意味表現データを集約してもよい。なお、集約部１４３は、複数の意味表現データを足し合わせる方法および複数の意味表現データ間に時系列な依存関係を設定する方法に限らず、他の方法によって複数の意味表現データを集約してもよい。 The aggregation unit 143 may aggregate a plurality of semantic expression data by a method of adding a plurality of semantic expression data, a method of setting a time-series dependency between the plurality of semantic expression data, and the like. The aggregation unit 143 is not limited to the method of adding a plurality of semantic expression data and the method of setting a time-series dependency between the plurality of semantic expression data, and aggregates the plurality of semantic expression data by another method. May be good.

学習データ分割部１４４は、学習機能を用いて、一つの意味表現データを複数の意味表現データに分割する機能である。学習データ分割部１４４は、例えば、「ＪＰ：ｈ」と「ＥＮ：ｈ」とを分割する。 The learning data division unit 144 is a function of dividing one semantic expression data into a plurality of semantic expression data by using the learning function. The learning data division unit 144 divides, for example, "JP: h" and "EN: h".

学習データ分割部１４４は、例えば、エンコーダ１４２から「ＪＰ：ｈ」を取得する。学習データ分割部１４４は、例えば、学習した分割方法を用いて「ＪＰ：ｈ」を分割することによって、「ＪＰ：ｈｓ１」と、「ＪＰ：ｈｓ２」と、「ＪＰ：ｈｓ３」と、を算出する。「ＪＰ：ｈｓ１」と、「ＪＰ：ｈｓ２」と、「ＪＰ：ｈｓ３」とは、第１文のそれぞれ一つの意味表現を示す意味表現データである。 The learning data division unit 144 acquires "JP: h" from the encoder 142, for example. The learning data division unit 144 calculates "JP: hs1", "JP: hs2", and "JP: hs3" by, for example, dividing "JP: h" using the learned division method. To do. “JP: hs1”, “JP: hs2”, and “JP: hs3” are semantic expression data indicating one semantic expression in the first sentence.

学習データ分割部１４４は、例えば、集約部１４３から「ＥＮ：ｈ」を取得する。学習データ分割部１４４は、例えば、「ＥＮ：ｈ」を分割することによって、「ＥＮ：ｈｓ１」と、「ＥＮ：ｈｓ２」と、「ＥＮ：ｈｓ３」と、を算出する。「ＥＮ：ｈｓ１」と、「ＥＮ：ｈｓ２」と、「ＥＮ：ｈｓ３」とは、第２文のそれぞれ一つの意味表現を示す意味表現データである。 The learning data division unit 144 acquires "EN: h" from, for example, the aggregation unit 143. The learning data division unit 144 calculates "EN: hs1", "EN: hs2", and "EN: hs3" by, for example, dividing "EN: h". “EN: hs1”, “EN: hs2”, and “EN: hs3” are semantic expression data indicating one semantic expression of the second sentence.

学習データ分割部１４４は、例えば、集約後の意味表現データから集約前の複数の意味表現データを復元するように分割方法を学習する。学習データ分割部１４４は、例えば、「ＥＮ：ｈ１」から「ＥＮ：ｈ１」、「ＥＮ：ｈ２」および「ＥＮ：ｈ３」を復元するように分割方法を学習する。 The learning data division unit 144 learns the division method so as to restore a plurality of semantic expression data before aggregation from the semantic expression data after aggregation, for example. The learning data division unit 144 learns the division method so as to restore "EN: h1", "EN: h2", and "EN: h3" from "EN: h1", for example.

デコーダ１４５は、分割された第２文の意味を示すデータを文書形式に変換する機能である。デコーダ１４５は、例えば、「ＥＮ：ｈｓ１」、「ＥＮ：ｈｓ２」および「ＥＮ：ｈｓ３」を、英語で記載される「ＥＮ：Ｓｅｇ１」、「ＥＮ：Ｓｅｇ２」および「ＥＮ：Ｓｅｇ３」にそれぞれ変換する。「ＥＮ：Ｓｅｇ１」は、「ＥＮ：ｈｓ１」の意味を示すデータを、文書形式に変換したデータである。「ＥＮ：Ｓｅｇ２」は、「ＥＮ：ｈｓ２」の意味を示すデータを、文書形式に変換したデータである。「ＥＮ：Ｓｅｇ３」は、「ＥＮ：ｈｓ３」の意味を示すデータを、文書形式に変換したデータである。 The decoder 145 is a function of converting data indicating the meaning of the divided second sentence into a document format. The decoder 145 converts, for example, "EN: hs1", "EN: hs2" and "EN: hs3" into "EN: Seg1", "EN: Seg2" and "EN: Seg3" described in English, respectively. To do. "EN: Seg1" is data obtained by converting the data indicating the meaning of "EN: hs1" into a document format. "EN: Seg2" is data obtained by converting data indicating the meaning of "EN: hs2" into a document format. "EN: Seg3" is data obtained by converting the data indicating the meaning of "EN: hs3" into a document format.

デコーダ１４６は、分割された第１文の意味を示すデータを文書形式に変換する機能である。デコーダ１４６は、例えば、「ＪＰ：ｈｓ１」、「ＪＰ：ｈｓ２」および「ＪＰ：ｈｓ３」を、日本語で記載される「ＪＰ：Ｓｅｇ１」、「ＪＰ：Ｓｅｇ２」および「ＪＰ：Ｓｅｇ３」にそれぞれ変換する。「ＪＰ：Ｓｅｇ１」は、「ＪＰ：ｈｓ１」の意味を示すデータを、文書形式に変換したデータである。「ＪＰ：Ｓｅｇ２」は、「ＪＰ：ｈｓ２」の意味を示すデータを、文書形式に変換したデータである。「ＪＰ：Ｓｅｇ３」は、「ＪＰ：ｈｓ３」の意味を示すデータを、文書形式に変換したデータである。 The decoder 146 is a function of converting data indicating the meaning of the divided first sentence into a document format. The decoder 146, for example, translates "JP: hs1", "JP: hs2" and "JP: hs3" into "JP: Seg1", "JP: Seg2" and "JP: Seg3" described in Japanese, respectively. Convert. “JP: Seg1” is data obtained by converting data indicating the meaning of “JP: hs1” into a document format. “JP: Seg2” is data obtained by converting data indicating the meaning of “JP: hs2” into a document format. “JP: Seg3” is data obtained by converting data indicating the meaning of “JP: hs3” into a document format.

なお、分割モデル生成部１４は、分割された第１文の意味表現データと、分割された第２文の意味表現データとの差分を最小化する手段を備える。最小化する手段は、例えば、差分の二乗を求めるなどが考えられるが、これに限定されない。 The divided model generation unit 14 includes means for minimizing the difference between the semantic expression data of the divided first sentence and the semantic expression data of the divided second sentence. The means for minimizing the difference is, for example, finding the square of the difference, but is not limited to this.

エンコーダ１４１，１４２と、学習データ分割部１４４と、デコーダ１４５，１４６とは、例えば、ＬＳＴＭ（ＬｏｎｇＳｈｏｒｔ−ＴｅｒｍＭｅｍｏｒｙ）またはＧＲＵ（ＧａｔｅｄＲｅｃｕｒｒｅｎｔＵｎｉｔ）等のニューラルネットワークでもよい。 The encoders 141 and 142, the learning data dividing unit 144, and the decoders 145 and 146 may be, for example, a neural network such as an RSTM (Long Short-Term Memory) or a GRU (Gated Recurrent Unit).

図８は、学習データ分割部１４４の説明図である。学習データ分割部１４４は、例えば、学習用データと分割済みの意味表現データとに基づいて、次に分割する意味表現データを算出する。 FIG. 8 is an explanatory diagram of the learning data division unit 144. The learning data division unit 144 calculates, for example, the semantic expression data to be divided next based on the learning data and the divided semantic expression data.

学習データ分割部１４４は、例えば、第１文のデータまたは第２文のデータのいずれか一方と、所定の値と、に基づいて一つの所定の意味表現データを抽出する。所定の値は、既に分割された意味表現データまたは、既に分割された文の「Ａｔｔｅｎｔｉｏｎｓｃｏｒｅ」を足し合わせたベクトル等を用いてもよい。分割された意味表現データがない場合には、所定の値には、所定の初期値を入力してもよい。 The learning data division unit 144 extracts one predetermined semantic expression data based on, for example, either the data of the first sentence or the data of the second sentence and a predetermined value. As the predetermined value, the already divided semantic expression data, the vector obtained by adding the "Attention score" of the already divided sentence, or the like may be used. If there is no divided semantic expression data, a predetermined initial value may be input as the predetermined value.

既に分割された意味表現データとは、一つ前に分割した意味表現データである。既に分割された文の「Ａｔｔｅｎｔｉｏｎｓｃｏｒｅ」を足し合わせたベクトルとは、一つ前に学習データ分割部１４４が分割した意味表現データをもとに、デコーダ１４６の出力である文書形式のデータに含まれる単語を出力する際に用いられた「Ａｔｔｅｎｔｉｏｎｓｃｏｒｅ」を足し合わせたベクトルである。「Ａｔｔｅｎｔｉｏｎｓｃｏｒｅ」は、生成する単語が入力文のどこに注目するのかを示した値である。 The already divided semantic expression data is the previously divided semantic expression data. The vector to which the "Attention score" of the already divided sentence is added is included in the document format data which is the output of the decoder 146 based on the semantic expression data divided by the learning data dividing unit 144 immediately before. It is a vector obtained by adding "Attention score" used when outputting the word. The "Attention score" is a value indicating where in the input sentence the generated word focuses.

なお、学習データ分割部１４４は、分割された複数の文の間で意味表現が重複することを抑制する手段を備えてもよい。学習データ分割部１４４は、例えば、一つ前に分割された文の「Ａｔｔｅｎｔｉｏｎｓｃｏｒｅ」を足し合わせたベクトルが含まれない所定の値を用いて、所定の意味表現データを抽出する手段を備えてもよい。学習データ分割部１４４は、例えば、分割前の意味表現データの中から既に分割された意味表現データを削除し、残りの意味表現データから一つの所定の意味表現データを抽出する手段を備えてもよい。 The learning data division unit 144 may include means for suppressing duplication of semantic expressions between the plurality of divided sentences. The learning data dividing unit 144 includes, for example, a means for extracting predetermined semantic expression data using a predetermined value that does not include a vector obtained by adding the "Attention score" of the sentence previously divided. May be good. The learning data division unit 144 may include, for example, a means for deleting the already divided semantic expression data from the semantic expression data before division and extracting one predetermined semantic expression data from the remaining semantic expression data. Good.

図４に戻り、分割モデル生成部１４は、生成した分割モデル１４４を分割モデル記憶部２３に保存する（Ｓ１５）。なお、分割モデル生成部１４は、生成した複数の分割モデル１４４の中で、分割精度が最も良い分割モデル１４４を分割モデル記憶部２３に保存してもよい。分割モデル生成システム１０は、処理（Ｓ１５）の後に終了する。 Returning to FIG. 4, the division model generation unit 14 stores the generated division model 144 in the division model storage unit 23 (S15). The division model generation unit 14 may store the division model 144 having the highest division accuracy among the generated plurality of division models 144 in the division model storage unit 23. The split model generation system 10 ends after the process (S15).

図９は、編集部２０の処理の流れ図である。編集部２０は、入力部２１の処理（Ｓ２１）と、分割対象選択部２２の処理（Ｓ２２）と、分割部２４の処理（Ｓ２３）と、分割評価部２５の処理（Ｓ２４）と、評価提示部２６の処理（Ｓ２５）と、分割結果抽出部２７の処理（Ｓ２６〜Ｓ２８）とによって実行される。編集部２０は、ユーザの操作によって開始される。 FIG. 9 is a processing flow chart of the editorial unit 20. The editing unit 20 processes the input unit 21 (S21), the division target selection unit 22 (S22), the division unit 24 (S23), the division evaluation unit 25 (S24), and the evaluation presentation. It is executed by the process of the unit 26 (S25) and the process of the division result extraction unit 27 (S26 to S28). The editorial unit 20 is started by a user operation.

編集部２０の処理は、入力部２１の処理（Ｓ２１）に移動する。入力部２１には、ユーザによって所定の文書データ３３１（図１１参照）が入力される（Ｓ２１）。入力部は、分割対象選択部２２に、所定の文書データ３３１を送信する。 The process of the editorial unit 20 moves to the process (S21) of the input unit 21. Predetermined document data 331 (see FIG. 11) is input to the input unit 21 by the user (S21). The input unit transmits predetermined document data 331 to the division target selection unit 22.

分割対象選択部２２は、所定の文書データ３３１の中から、分割処理の対象になる文を抽出する（Ｓ２２）。分割対象選択部２２は、例えば、一文の単語数に基づいて、所定の文書データ３３１の中から分割対象の文を抽出する。分割対象選択部２２は、例えば、単語数が所定数以上の一文を分割対象の文として抽出する。分割対象選択部２２は、単語数に限らず、他の基準に基づいて、分割対象の文を抽出してもよい。分割対象選択部２２は、抽出した分割対象の文のデータを、分割部２４に送信する。 The division target selection unit 22 extracts a sentence to be divided processing from the predetermined document data 331 (S22). The division target selection unit 22 extracts, for example, a sentence to be divided from the predetermined document data 331 based on the number of words in one sentence. The division target selection unit 22 extracts, for example, one sentence having a predetermined number of words or more as a sentence to be divided. The division target selection unit 22 may extract sentences to be divided based on other criteria, not limited to the number of words. The division target selection unit 22 transmits the data of the extracted sentence to be divided to the division unit 24.

分割部２４は、分割モデル記憶部２３に保存される分割モデル１４４に基づいて、分割対象の文を、意味表現ごとに複数の文に分割する（Ｓ２３）。分割部２４は、言語サーバ５０に保存される単語データを用いて、分割した文を生成する。分割部は、分割結果を分割評価部２５に送信する。 The division unit 24 divides the sentence to be divided into a plurality of sentences for each semantic expression based on the division model 144 stored in the division model storage unit 23 (S23). The division unit 24 generates a divided sentence by using the word data stored in the language server 50. The division unit transmits the division result to the division evaluation unit 25.

分割評価部２５は、分割結果に対して精度評価をする（Ｓ２４）。分割評価部２５は、例えば、分割前から分割後の文における意味表現の保持の度合いと、分割後の文の流暢さと、の観点から精度評価をする。 The division evaluation unit 25 evaluates the accuracy of the division result (S24). The division evaluation unit 25 evaluates the accuracy from the viewpoints of, for example, the degree of retention of the semantic expression in the sentence before and after the division and the fluency of the sentence after the division.

分割評価部２５は、例えば、元の文の意味表現を表すベクトルと、分割後の文の意味表現を結合して得られたベクトルと、の間のｃｏｓｉｎｅ距離等を用いることによって、意味表現の保持について評価をする。分割評価部２５は、例えば、分割後の言語に適する言語モデルを用いることによって、流暢さについて評価をする。分割評価部２５は、評価提示部２６に評価結果を送信する。 The division evaluation unit 25 uses, for example, the cosine distance between the vector representing the semantic expression of the original sentence and the vector obtained by combining the semantic expression of the sentence after division, thereby expressing the semantic expression. Evaluate retention. The division evaluation unit 25 evaluates fluency, for example, by using a language model suitable for the divided language. The divided evaluation unit 25 transmits the evaluation result to the evaluation presentation unit 26.

図１０は、評価提示部２６の説明図である。評価提示部２６は、例えば、評価結果２６１と、閾値入力欄２６２と、実行ボタン２６３と、キャンセルボタン２６４と、を入出力装置６１に表示させる。評価結果２６１は、分割評価部２５の評価結果である。評価提示部２６は、例えば、意味表現の保持を「０．９」と出力し、流暢さを「０．８」と出力する。 FIG. 10 is an explanatory diagram of the evaluation presentation unit 26. The evaluation presentation unit 26 causes the input / output device 61 to display, for example, the evaluation result 261, the threshold value input field 262, the execute button 263, and the cancel button 264. The evaluation result 261 is an evaluation result of the division evaluation unit 25. The evaluation presentation unit 26 outputs, for example, the retention of the semantic expression as "0.9" and the fluency as "0.8".

閾値入力欄２６２は、所定の閾値を受け付ける機能である。所定の閾値は、分割結果抽出部２７が分割結果から変換対象の文のデータを選択する際に使用する値である。ユーザは、分割結果の評価結果２６１に基づいて、所定の閾値を設定する（図９の処理（Ｓ２４））。 The threshold value input field 262 is a function of accepting a predetermined threshold value. The predetermined threshold value is a value used when the division result extraction unit 27 selects the data of the sentence to be converted from the division result. The user sets a predetermined threshold value based on the evaluation result 261 of the division result (process (S24) of FIG. 9).

実行ボタン２６３は、分割結果抽出部２７を開始するボタンである。キャンセルボタン２６４は、編集部２０の処理を終了するボタンである。 The execution button 263 is a button for starting the division result extraction unit 27. The cancel button 264 is a button for ending the process of the editorial unit 20.

図９に戻り、分割結果抽出部２７は、所定の閾値と、各分割された文の評価値と、を比較する（Ｓ２６）。所定の閾値よりも分割された文の評価値が高い場合（Ｓ２６：Ｙｅｓ）には、分割後の文のデータを変換データ記憶部２８に保存する（Ｓ２７）。所定の閾値よりも分割された文の評価値が低い場合（Ｓ２６：Ｎｏ）には、分割前の文のデータを変換データ記憶部２８に保存する（Ｓ２８）。分割結果抽出部２７の処理（Ｓ２７，２８）の後に、編集部２０の処理は、終了する。 Returning to FIG. 9, the division result extraction unit 27 compares the predetermined threshold value with the evaluation value of each divided sentence (S26). When the evaluation value of the divided sentence is higher than the predetermined threshold value (S26: Yes), the data of the divided sentence is stored in the conversion data storage unit 28 (S27). When the evaluation value of the divided sentence is lower than the predetermined threshold value (S26: No), the data of the sentence before the division is stored in the conversion data storage unit 28 (S28). After the processing of the division result extraction unit 27 (S27, 28), the processing of the editorial unit 20 ends.

図１１は、出力部３３，４２の説明図である。出力部３３は、所定の文書データ３３１および、変換結果３３２を入出力装置６１に表示させる。変換結果３３２は、変換部３１にて変換された文書である。 FIG. 11 is an explanatory diagram of the output units 33 and 42. The output unit 33 causes the input / output device 61 to display the predetermined document data 331 and the conversion result 332. The conversion result 332 is a document converted by the conversion unit 31.

変換部３１は、変換データ記憶部２８から文のデータを取得する。変換部３１は、変換ソフト記憶部３２に保存される変換ソフトを用いて、取得した文を変換する。変換部３１は、言語サーバ５０から単語データを取得することによって、変換後の文を生成する。変換部３１は、変換前の文のデータと、変換後の文のデータと、を出力部３３へ送信する。なお、文の変換には、技術文書を対象とした機械翻訳、音声データを介した対話、文の分かりやすさに注目した言い換え生成等が含まれる。 The conversion unit 31 acquires sentence data from the conversion data storage unit 28. The conversion unit 31 converts the acquired sentence by using the conversion software stored in the conversion software storage unit 32. The conversion unit 31 generates the converted sentence by acquiring the word data from the language server 50. The conversion unit 31 transmits the data of the sentence before conversion and the data of the sentence after conversion to the output unit 33. Sentence conversion includes machine translation for technical documents, dialogue via voice data, paraphrase generation focusing on the comprehensibility of sentences, and the like.

なお、変換部３１は、言語サーバ５０に記憶される単語データよりも所定の文書データ３３１に含まれる単語を優先して使用することで、変換後の文を生成してもよい。これにより、変換部３１は、表記の揺れまたは記号の誤り等を抑制することができる。 The conversion unit 31 may generate the converted sentence by preferentially using the words included in the predetermined document data 331 over the word data stored in the language server 50. As a result, the conversion unit 31 can suppress fluctuations in the notation or errors in the symbols.

出力部４２は、変換精度の評価４２１を入出力装置６１に表示させる。出力部４２は、例えば、変換精度の評価４２１として、意味表現の保持を「０．９」と表示させる。出力部４２は、例えば、変換精度の評価４２１として、流暢さを「０．８」と表示させる。 The output unit 42 causes the input / output device 61 to display the evaluation 421 of the conversion accuracy. The output unit 42 displays, for example, the retention of the semantic expression as "0.9" as the evaluation 421 of the conversion accuracy. The output unit 42 displays fluency as "0.8" as, for example, evaluation 421 of conversion accuracy.

変換精度の評価４２１は、変換評価部４１にて算出される。変換評価部４１は、分割部２４から分割後の文のデータを取得し、変換部３１から変換結果３３２を取得する。変換評価部４１は、例えば、分割後の文のデータと、変換結果３３２と、の意味表現の近さを算出することによって、意味表現の保持度を算出する。 The conversion accuracy evaluation 421 is calculated by the conversion evaluation unit 41. The conversion evaluation unit 41 acquires the data of the sentence after the division from the division unit 24, and acquires the conversion result 332 from the conversion unit 31. The conversion evaluation unit 41 calculates the degree of retention of the semantic expression by, for example, calculating the closeness of the semantic expression between the divided sentence data and the conversion result 332.

なお、変換評価部４１は、分割モデル記憶部２３に保存される分割モデル１４４を用いることによって、変換結果を評価してもよい。変換評価部４１は、例えば、分割モデル１４４を用いて、分割後の文のデータと、変換後の文とを意味表現データに変換する。変換評価部４１は、各意味表現データを比較して、分割後の文のデータと変換結果との意味表現の近さを算出する。 The conversion evaluation unit 41 may evaluate the conversion result by using the division model 144 stored in the division model storage unit 23. The conversion evaluation unit 41 uses, for example, the division model 144 to convert the data of the sentence after division and the sentence after conversion into semantic expression data. The conversion evaluation unit 41 compares each semantic expression data and calculates the closeness of the semantic expression between the divided sentence data and the conversion result.

本実施例に示す分割モデル生成システム１０は、入力部１１と、学習データ生成部１２と、学習データ記憶部１３と、分割モデル生成部１４と、を有することによって、文を意味表現ごとに分割する分割モデル１４４を生成することができる。これにより、分割モデル生成システム１０は、変換に適した単位に文を分割することができる。その結果、機械翻訳処理または自然言語処理等の文の品質を向上することができる。 The divided model generation system 10 shown in this embodiment has an input unit 11, a learning data generation unit 12, a learning data storage unit 13, and a division model generation unit 14, so that the sentence is divided for each semantic expression. It is possible to generate a division model 144 to be used. As a result, the division model generation system 10 can divide the sentence into units suitable for conversion. As a result, the quality of sentences such as machine translation processing or natural language processing can be improved.

特許の明細書など技術文書の場合には、文が入れ子構造を持つケースがある。単純に長文を分割して互いに独立な複数の文に分割すると、元々係り元と係り先の関係にある単語が、前後に分かれてしまう可能性がある。分割モデル生成システム１０が意味表現ごとに文を分割する分割モデル１４４を生成することによって、編集部２０は、係り受け関係を保持しながら文を分割することができる。 In the case of technical documents such as patent specifications, there are cases where sentences have a nested structure. If a long sentence is simply divided into a plurality of sentences that are independent of each other, the words that are originally related to each other may be separated into front and back. By generating the division model 144 in which the division model generation system 10 divides the sentence for each semantic expression, the editorial unit 20 can divide the sentence while maintaining the dependency relationship.

長文に対して単純に分割処理を行う場合、後の文が前の文脈と完全に切り離されてしまうことが考えられる。分割モデル生成部１４の学習データ分割部１４４は、既に分割された文に関係するデータを参照して、次に分割する意味表現データを抽出することによって、前の文との照応関係が明確な文を生成することができる。 When a long sentence is simply divided, it is possible that the later sentence is completely separated from the previous context. The learning data division unit 144 of the division model generation unit 14 refers to the data related to the already divided sentence and extracts the semantic expression data to be divided next, so that the anaphoric relationship with the previous sentence is clear. Can generate statements.

翻訳処理に適した表現は、言語間に違いがあるため、翻訳元の文から、翻訳先の言語に適した表現を推論することが困難である。分割モデル生成システム１０は、翻訳元のソース言語と、翻訳先のターゲット言語と、で作成されたコーパスを用いて分割モデル１４４を学習することによって、ターゲット言語の特徴を考慮したうえで、翻訳元の文を翻訳に適した処理単位に分割することができる。 Since there are differences between languages in expressions suitable for translation processing, it is difficult to infer expressions suitable for the target language from the source sentence. The division model generation system 10 learns the division model 144 using the corpus created by the source language of the translation source and the target language of the translation destination, and takes into consideration the characteristics of the target language, and then the translation source. Can be divided into processing units suitable for translation.

所定の対応関係を有する第１文と第２文との組み合わせを、学習データとして学習データ生成部１２が生成することによって、分割モデル生成部１４は、第１文と第２文とを対比させて分割モデル１４４を生成することができる。これにより、分割モデル生成部１４は、意味表現ごとに分割する分割モデル１４４を生成することができる。 The learning data generation unit 12 generates the combination of the first sentence and the second sentence having a predetermined correspondence relationship as learning data, so that the division model generation unit 14 compares the first sentence and the second sentence. The split model 144 can be generated. As a result, the division model generation unit 14 can generate the division model 144 that divides each semantic expression.

第１の文書データ７１と、第２の文書データ７２とが所定の項目で並べられた複数の項目を有することによって、学習データ生成部１２は、各項目を基準にして学習用データを抽出することができる。これにより、学習データ生成部１２は、各項目の基準がない状態に比べて、効率よく学習用データを抽出することができる。 Since the first document data 71 and the second document data 72 have a plurality of items arranged by predetermined items, the learning data generation unit 12 extracts learning data based on each item. be able to. As a result, the learning data generation unit 12 can efficiently extract the learning data as compared with the state where there is no reference for each item.

分割モデル生成部１４は、第１文および第２文を意味表現ごとに分割することによって、分割モデルを生成する。これにより、分割モデル生成部１４は、意味表現ごとに分割する分割モデルを生成することができる。 The division model generation unit 14 generates a division model by dividing the first sentence and the second sentence for each semantic expression. As a result, the division model generation unit 14 can generate a division model that divides each semantic expression.

テキスト分割装置１は、分割モデル生成システム１０と、編集部２０と、を備えることによって、分割モデル１４４に基づいて、所定の文書データ３３１を分割することができる。これにより、テキスト分割装置は、所定の文書データ３３１を意味表現ごとに分割することができる。 The text segmentation device 1 can divide a predetermined document data 331 based on the division model 144 by including the division model generation system 10 and the editorial unit 20. As a result, the text segmentation device can divide the predetermined document data 331 for each semantic expression.

さらに、テキスト分割装置１は、変換処理部３０を備えることによって、所定の文書データ３３１を他の文に変換することができる。これにより、テキスト分割装置１は、所定の文書データ３３１の翻訳および所定の文書データ３３１の校正等をすることができる。 Further, the text segmentation device 1 can convert the predetermined document data 331 into another sentence by including the conversion processing unit 30. As a result, the text segmentation device 1 can translate the predetermined document data 331 and proofread the predetermined document data 331.

編集部２０は、分割評価部２５と、分割結果抽出部２７と、を備えることによって、分割部２４にて分割した文の中から、変換に適した文を選択することができる。これにより、変換処理部３０にて変換した文の品質を向上させることができる。 By providing the division evaluation unit 25 and the division result extraction unit 27, the editorial unit 20 can select a sentence suitable for conversion from the sentences divided by the division unit 24. As a result, the quality of the sentence converted by the conversion processing unit 30 can be improved.

さらに、テキスト分割装置１は、変換評価部４０を備えることによって、分割モデル１４４を用いて変換結果の評価をすることができる。これにより、変換評価部４０は、言語間の構造の差異に依存しない意味表現データを用いて、変換結果の評価をすることができる。 Further, the text segmentation device 1 is provided with the conversion evaluation unit 40, so that the conversion result can be evaluated using the division model 144. As a result, the conversion evaluation unit 40 can evaluate the conversion result using the semantic expression data that does not depend on the structural difference between the languages.

本実施例は、第１実施例の変形例に相当する為、第１実施例との差異を中心に説明する。図１２は、テキスト分割装置１ａの概略図である。テキスト分割装置１ａは、後述する係り受けモデルを用いて、一つの文を複数の文に分割する。テキスト分割装置１ａは、例えば、分割モデル生成システム１０ａと、編集部２０ａと、変換処理部３０と、変換評価処理部４０と、言語サーバ５０とを有する。 Since this embodiment corresponds to a modified example of the first embodiment, the differences from the first embodiment will be mainly described. FIG. 12 is a schematic view of the text segmentation device 1a. The text segmentation device 1a divides one sentence into a plurality of sentences by using the dependency model described later. The text segmentation device 1a includes, for example, a division model generation system 10a, an editing unit 20a, a conversion processing unit 30, a conversion evaluation processing unit 40, and a language server 50.

分割モデル生成システム１０ａは、文を分割する分割モデル１４４と、係り受け関係に基づいて文を分割する係り受けモデルとを生成する機能である。分割モデル生成システム１０は、例えば、入力部１１と、学習データ生成部１２と、学習データ記憶部１３ａと、分割モデル生成部１４ａと、係り受けモデル生成部１５と、を有する。 The division model generation system 10a is a function of generating a division model 144 that divides a sentence and a dependency model that divides a sentence based on the dependency relationship. The divided model generation system 10 includes, for example, an input unit 11, a learning data generation unit 12, a learning data storage unit 13a, a division model generation unit 14a, and a dependency model generation unit 15.

学習データ記憶部１３ａは、学習データを記録するデータベースである。学習データ記憶部１３ａは、分割モデル生成部１４ａおよび係り受けモデル生成部１５へ学習データを供給する。 The learning data storage unit 13a is a database for recording learning data. The learning data storage unit 13a supplies learning data to the division model generation unit 14a and the dependency model generation unit 15.

分割モデル生成部１４ａは、学習データに基づいて、文書を分割する分割モデル１４４を生成する機能である。分割モデル生成部１４ａは、第１文と第２文とを意味表現ごとに分割し、分割された第１文と分割された第２文とに基づいて分割モデル１４４を生成する。分割モデル生成部１４ａは、分割モデル記憶部２３へ分割モデル１４４を保存する。分割モデル生成部１４ａは、係り受けモデル生成部１５へ分割モデル１４４を送信する。分割モデル生成部１４ａは、係り受けモデル生成部１５から係り受けモデルを取得し、係り受けモデルに基づいて分割モデル１４４を生成してもよい。 The division model generation unit 14a is a function of generating a division model 144 that divides a document based on learning data. The division model generation unit 14a divides the first sentence and the second sentence for each semantic expression, and generates the division model 144 based on the divided first sentence and the divided second sentence. The division model generation unit 14a stores the division model 144 in the division model storage unit 23. The division model generation unit 14a transmits the division model 144 to the dependency model generation unit 15. The division model generation unit 14a may acquire the dependency model from the dependency model generation unit 15 and generate the division model 144 based on the dependency model.

係り受けモデル生成部１４ａは、文の係り受け関係に基づいて文を分割する係り受けモデルを生成する機能である。係り受けモデル生成部１４ａは、学習用データと、分割モデル生成部１４ａにて生成された分割モデル１４４と、に基づいて、係り受けモデルを生成する。係り受けモデル生成部１５は、分割モデル記憶部２３ａへ係り受けモデルを保存する。 The dependency model generation unit 14a is a function of generating a dependency model that divides a sentence based on the dependency relationship of the sentence. The dependency model generation unit 14a generates a dependency model based on the learning data and the division model 144 generated by the division model generation unit 14a. The dependency model generation unit 15 stores the dependency model in the division model storage unit 23a.

編集部２０ａは、分割モデル１４４または係り受けモデルの少なくともいずれか一方に基づいて、一つの所定の文を複数の文に分割する機能である。編集部２０ａは、例えば、入力部２１と、分割対象選択部２２と、分割モデル記憶部２３ａと、分割部２４ａと、分割評価部２５と、評価提示部２６と、分割結果抽出部２７と、変換データ記憶部２８と、を有する。 The editorial unit 20a is a function of dividing one predetermined sentence into a plurality of sentences based on at least one of the division model 144 and the dependency model. The editing unit 20a includes, for example, an input unit 21, a division target selection unit 22, a division model storage unit 23a, a division unit 24a, a division evaluation unit 25, an evaluation presentation unit 26, a division result extraction unit 27, and the like. It has a conversion data storage unit 28.

分割モデル記憶部２３ａは、分割モデル１４４および係り受けモデルを記憶するデータベースである。分割モデル記憶部２３ａは、分割部２４ａへ分割モデル１４４または、係り受けモデルの少なくとも一方を送信する。分割部２４ａは、分割モデル１４４または係り受けモデルの少なくともいずれか一方に基づいて、一つの分割対象の文を複数の文に分割する機能である。分割部２４ａは、分割評価部２５へ分割後の文のデータを送信する。 The division model storage unit 23a is a database that stores the division model 144 and the dependency model. The division model storage unit 23a transmits at least one of the division model 144 and the dependency model to the division unit 24a. The division unit 24a is a function of dividing one sentence to be divided into a plurality of sentences based on at least one of the division model 144 and the dependency model. The division unit 24a transmits the data of the sentence after the division to the division evaluation unit 25.

図１３は、係り受けモデル生成部１５の処理の流れ図である。係り受けモデル生成部１５は、学習用データを学習データ記憶部１３ａから取得する（Ｓ３１）。係り受けモデル生成部１５は、例えば、第１文の「ＪＰ：Ｓｅｎ」を学習データ記憶部１３ａから取得する（図６参照）。 FIG. 13 is a processing flow chart of the dependency model generation unit 15. The dependency model generation unit 15 acquires learning data from the learning data storage unit 13a (S31). The dependency model generation unit 15 acquires, for example, the first sentence “JP: Sen” from the learning data storage unit 13a (see FIG. 6).

係り受けモデル生成部１５は、分割モデル生成部１４ａから、学習データの分割結果を取得する（Ｓ３２）。係り受けモデル生成部１５は、例えば、「ＪＰ：Ｓｅｇ１」と、「ＪＰ：Ｓｅｇ２」と、「ＪＰ：Ｓｅｇ３」と、を取得する（図７参照）。 The dependency model generation unit 15 acquires the division result of the learning data from the division model generation unit 14a (S32). The dependency model generation unit 15 acquires, for example, "JP: Seg1", "JP: Seg2", and "JP: Seg3" (see FIG. 7).

係り受けモデル生成部１５は、係り受けモデルを生成する（Ｓ３３）。係り受けモデル生成部１５は、例えば、「ＪＰ：Ｓｅｎ」と、「ＪＰ：Ｓｅｇ１」、「ＪＰ：Ｓｅｇ２」および「ＪＰ：Ｓｅｇ３」と、を比較することによって、「ＪＰ：Ｓｅｎ」の一つの文の係り受け関係を学習する。係り受けモデル生成部１５は、学習結果を係り受けモデルとして生成する。係り受けモデル生成部１５は、係り受けモデルを分割モデル記憶部２３ａに保存する（Ｓ３４）。 The dependency model generation unit 15 generates a dependency model (S33). The dependency model generation unit 15 is one of "JP: Sen" by, for example, comparing "JP: Sen" with "JP: Seg1", "JP: Seg2" and "JP: Seg3". Learn sentence dependency relationships. The dependency model generation unit 15 generates the learning result as a dependency model. The dependency model generation unit 15 stores the dependency model in the split model storage unit 23a (S34).

図１４は、編集部２０ａの処理の流れ図である。編集部２０ａは、入力部２１の処理（Ｓ２１）と、分割対象選択部２２の処理（Ｓ２２）と、分割部２４ａの処理（Ｓ４１，Ｓ４２）と、分割評価部２５の処理（Ｓ２４）と、評価提示部２６の処理（Ｓ２５）と、分割結果抽出部２７の処理（Ｓ２６〜Ｓ２８）とによって実行される。編集部２０ａは、ユーザの操作によって開始される。 FIG. 14 is a processing flow chart of the editorial unit 20a. The editing unit 20a includes processing of the input unit 21 (S21), processing of the division target selection unit 22 (S22), processing of the division unit 24a (S41, S42), processing of the division evaluation unit 25 (S24), and so on. It is executed by the processing of the evaluation presentation unit 26 (S25) and the processing of the division result extraction unit 27 (S26 to S28). The editorial unit 20a is started by a user operation.

編集部２０ａの処理は、入力部２１の処理（Ｓ２１）に移動し、入力部２１が所定の文書データ３３１を取得する。分割対象選択部２２は、所定の文書データ３３１の中から分割対象の文を抽出する（Ｓ２２）。 The processing of the editing unit 20a moves to the processing (S21) of the input unit 21, and the input unit 21 acquires the predetermined document data 331. The division target selection unit 22 extracts a sentence to be divided from the predetermined document data 331 (S22).

分割部２４ａは、分割モデル１４４または係り受けモデルの少なくともいずれか一方を分割モデル記憶部２３ａから取得する（Ｓ４１）。分割部２４ａは、分割モデル１４４または係り受けモデルの少なくともいずれか一方に基づいて、分割対象データを分割する（Ｓ４２）。分割部２４ａは、分割評価部２５に分割結果を送信する。 The division unit 24a acquires at least one of the division model 144 and the dependency model from the division model storage unit 23a (S41). The division unit 24a divides the division target data based on at least one of the division model 144 and the dependency model (S42). The division unit 24a transmits the division result to the division evaluation unit 25.

本実施例に示すテキスト分割装置１ａは、係り受けモデル生成部１５を備えることによって、文の係り受け関係に基づいて、一つの文を複数の文に分割することができる。これにより、言語の種類が変わらない文書の校正等でも、編集部２０ａは、文書の変換に適する長さに文を分割することができる。その結果、テキスト分割装置１は、文の変換の品質を向上することができる。 The text segmentation device 1a shown in this embodiment includes the dependency model generation unit 15, so that one sentence can be divided into a plurality of sentences based on the dependency relationship of the sentences. As a result, the editorial unit 20a can divide the sentence into a length suitable for converting the document even when proofreading the document in which the language type does not change. As a result, the text segmentation device 1 can improve the quality of sentence conversion.

コンピュータを、分割モデル生成システムとして機能させるためのコンピュータプログラムは、コンピュータ上に、複数の第１文を含んで構造化された第１の文書データと、複数の第２文を含んで構造化された第２の文書データとが入力される入力部と、複数の第１文と複数の第２文とのうち、所定の対応関係にある第１文と第２文との組み合わせを学習データとして生成する学習データ生成部と、生成された学習データを記憶する学習データ記憶部と、学習データ記憶部に記憶された学習データに基づいて学習し、文を分割する分割モデル１４４を生成する分割モデル生成部と、をそれぞれ実現させる A computer program for making a computer function as a split model generation system is structured on a computer by including a first document data structured including a plurality of first sentences and a plurality of second sentences. The combination of the input unit into which the second document data is input and the first sentence and the second sentence having a predetermined correspondence relationship among the plurality of first sentences and the plurality of second sentences is used as learning data. A division model that generates a division model 144 that learns based on the training data generation unit to be generated, the learning data storage unit that stores the generated learning data, and the learning data stored in the training data storage unit, and divides the sentence. Realize each of the generators

なお、本発明は上述の実施形態に限定されず、様々な変形例が含まれる。上記実施形態は本発明を分かりやすく説明するために詳細に説明したものであり、必ずしも説明した全ての構成を備えるものに限定されるものではない。また、ある実施形態の構成の一部を他の実施形態の構成に置き換えることもできる。また、ある実施形態の構成に他の実施形態の構成を加えることもできる。また、各実施形態の構成の一部について、他の構成を追加・削除・置換することもできる。 The present invention is not limited to the above-described embodiment, and includes various modifications. The above-described embodiment has been described in detail in order to explain the present invention in an easy-to-understand manner, and is not necessarily limited to those having all the described configurations. It is also possible to replace a part of the configuration of one embodiment with the configuration of another embodiment. It is also possible to add the configuration of another embodiment to the configuration of one embodiment. In addition, other configurations can be added / deleted / replaced with respect to a part of the configurations of each embodiment.

上記各構成、機能、処理部、処理手段等は、それらの一部や全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、上記の各構成、機能等は、プロセッサがそれぞれの機能を実現するプログラムを解釈し、実行することによりソフトウェアで実現してもよい。各機能を実現するプログラム、テーブル、ファイル等の情報は、メモリ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記録装置、ＩＣカード、ＳＤカード、ＤＶＤ等の記録媒体に格納することができる。 Each of the above configurations, functions, processing units, processing means, and the like may be realized by hardware, for example, by designing a part or all of them with an integrated circuit. Further, each of the above configurations, functions, and the like may be realized by software by the processor interpreting and executing a program that realizes each function. Information such as programs, tables, and files that realize each function can be stored in a memory, a hard disk, a recording device such as an SSD (Solid State Drive), or a recording medium such as an IC card, an SD card, or a DVD.

また、上述した実施形態に含まれる技術的特徴は、特許請求の範囲に明示された組み合わせに限らず、適宜組み合わせることができる。 Further, the technical features included in the above-described embodiment are not limited to the combinations specified in the claims, and can be appropriately combined.

１…テキスト分割装置，１０…分割モデル生成システム，１１…入力部，１２…学習データ生成部，１３…学習データ記憶，１４…分割モデル生成部，１４１…第１ターゲットデータ変換部，１４２…第１ソースデータ変換部，１４３…集約部，１４４…学習データ分割部，１４５…第２ターゲットデータ変換部，１４６…第２ソースデータ変換部，２０…編集部，３０…変換処理部，４０…変換評価処理部，５０…言語サーバ

1 ... text division device, 10 ... division model generation system, 11 ... input unit, 12 ... training data generation unit, 13 ... training data storage, 14 ... division model generation unit, 141 ... first target data conversion unit, 142 ... first 1 Source data conversion unit, 143 ... Aggregation unit, 144 ... Learning data division unit, 145 ... Second target data conversion unit, 146 ... Second source data conversion unit, 20 ... Editing unit, 30 ... Conversion processing unit, 40 ... Conversion Evaluation processing unit, 50 ... Language server

Claims

An input unit for inputting a first document data structured including a plurality of first sentences and a second document data structured including a plurality of second sentences.
A learning data generation unit that generates learning data by combining a combination of the first sentence and the second sentence having a predetermined correspondence relationship among the plurality of first sentences and the plurality of second sentences.
A learning data storage unit that stores the generated learning data,
A division model generation system including a division model generation unit that learns based on the learning data stored in the training data storage unit and generates a division model that divides a sentence.

The division according to claim 1, wherein the predetermined correspondence relationship is such that the combination of the first sentence and the second sentence includes at least one of a one-to-many relationship, a many-to-one relationship, and a many-to-many relationship. Model generation system.

The first document data and the second document data have a plurality of items arranged in a predetermined order, and the learning data generation unit has a plurality of items.
The division model generation according to claim 1, wherein the first sentence and the second sentence are extracted between the corresponding items of each item of the first document data and each item of the second document data. system.

The first sentence and the second sentence having a predetermined correspondence relationship include a plurality of a set of semantic expressions.
The split model generator
The first sentence and the second sentence are divided according to the semantic expression of the group.
The divided model generation system according to claim 1, wherein the divided model is generated based on the divided first sentence and the divided second sentence.

The division model generation system according to claim 1, further comprising a dependency learning unit that learns the dependency relationship of sentences based on the learning result of the division model generation unit.

The split model generation system according to claim 1 and
It is provided with an editorial unit that divides one predetermined sentence into a plurality of sentences based on the division model generated by the division model generation system.
The editorial department
An input unit in which predetermined document data structured including a plurality of sentences is input, and
A division target sentence selection unit that selects a division target sentence from the predetermined document data,
A text segmentation device including a division unit that divides one sentence to be divided into a plurality of sentences based on the division model.

The text segmentation device according to claim 6, wherein the text segmentation device includes a conversion processing unit that converts the plurality of divided sentences into other sentences.

Furthermore, the editorial department
A division evaluation unit that evaluates the division result of the division unit based on a predetermined evaluation index,
The seventh aspect of claim 7 includes a division result extraction unit that extracts a sentence to be converted by the conversion processing unit from a plurality of sentences divided by the division evaluation unit based on the evaluation of the division evaluation unit. Text segmentation device.

The text segmentation device according to claim 7, further comprising a conversion evaluation processing unit that evaluates the conversion accuracy of the conversion processing unit.

It is a division model generation method that generates a division model that divides a document by a computer.
The calculator
Of the plurality of first sentences included in the input first document data and the plurality of second sentences included in the input second document data, the first sentence and the second sentence having a predetermined correspondence relationship. Generate a combination with a sentence as learning data,
A division model generation method for generating a division model that divides a document based on the generated learning data.