JP7358132B2

JP7358132B2 - Computer systems and document classification methods

Info

Publication number: JP7358132B2
Application number: JP2019167016A
Authority: JP
Inventors: 祐太是枝; 久雄間瀬; 太亮尾崎; 康充池浦; 光一岡本
Original assignee: Hitachi Ltd
Current assignee: Hitachi Ltd
Priority date: 2019-09-13
Filing date: 2019-09-13
Publication date: 2023-10-10
Anticipated expiration: 2039-09-13
Also published as: JP2021043849A

Description

本発明は、データの分類するための計算機システム及び方法に関する。 The present invention relates to a computer system and method for classifying data.

情報検索及び閲覧を支援する目的として、文書の分類（カテゴリ又はクラス）を示すタグ等の付与が行われている。文書に付与されるタグとしては、特許明細書に対するＦターム及び国際特許分類、医療論文に対するＭｅＳＨターム等が知られている。一般的に、タグは専門知識を有する人間が手動で付与していたため、タグの付与の作業に多大な労力を要するという問題がある。 For the purpose of supporting information search and browsing, tags and the like indicating document classification (category or class) are added. Known tags added to documents include F terms and international patent classifications for patent specifications, and MeSH terms for medical papers. Generally, tags are manually attached by a person with specialized knowledge, so there is a problem in that the task of attaching tags requires a great deal of effort.

情報検索の質を維持するためには一貫した基準に基づいて高い精度でタグが付与されていることが前提になるため、完全な自動化ではなく、これらの分類を行うユーザを支援する方法が求められている。 In order to maintain the quality of information retrieval, it is a prerequisite that tags are assigned with a high degree of accuracy based on consistent standards, so rather than complete automation, there is a need for a method to assist users in making these classifications. It is being

例えば、非特許文献１には、文書及び単語からグラフを構築することによって、文書に分類結果を付与する方法が開示されている。しかし、非特許文献１は分類結果の付与とともに分類の根拠を提示する方法を開示していない。自動的に付与された分類結果を単にユーザに提示しただけでは、ユーザは分類結果を付与した根拠を把握できない。そのため、ユーザは、分類結果を受け入れるべきか否かを判断できない。 For example, Non-Patent Document 1 discloses a method of assigning classification results to documents by constructing a graph from documents and words. However, Non-Patent Document 1 does not disclose a method for presenting the basis of classification together with the assignment of classification results. If the automatically assigned classification results are simply presented to the user, the user cannot grasp the basis for assigning the classification results. Therefore, the user cannot decide whether or not to accept the classification results.

これに対して非特許文献２及び非特許文献３の技術が知られている。非特許文献２には、文書の分類結果とともに、取り除くことによって分類結果が大きく変化する箇所（単語）を提示し、分類の説明性を向上する方法が開示されている。非特許文献３には、再帰ニューラルネットワーク及びアテンション機構を用い、文書の分類結果とともに、文書の一部を根拠箇所として提示する方法が開示されている。 On the other hand, the techniques disclosed in Non-Patent Document 2 and Non-Patent Document 3 are known. Non-Patent Document 2 discloses a method for improving the explainability of classification by presenting, along with the document classification results, locations (words) whose removal would significantly change the classification results. Non-Patent Document 3 discloses a method that uses a recurrent neural network and an attention mechanism to present a document classification result as well as a part of the document as a basis.

Liang Yao, Chengsheng Mao, and Yuan Luo. 2018. Graph Convolutional Networks for Text Classification. AAAI.Liang Yao, Chengsheng Mao, and Yuan Luo. 2018. Graph Convolutional Networks for Text Classification. AAAI. Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. KDD.Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why Should I Trust You?”: Explaining the Predictions of Any Classifier. KDD. Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. NACCL.Zichao Yang, Diyi Yang, Chris Dyer, Xiaodong He, Alex Smola, and Eduard Hovy. 2016. Hierarchical Attention Networks for Document Classification. NACCL.

しかし、非特許文献２に記載の方法は、文書の分類方法自体に改変を加えないため、提示される内容は人の考える根拠とは乖離しており、文書の分類を十分支援できるものではない。非特許文献３に記載のアテンション機構に基づく根拠提示では、分類に寄与する箇所と、人の考える根拠箇所とが一致しないため、根拠提示の適切さと分類精度との間にトレードオフの関係が生じてしまう問題がある。 However, since the method described in Non-Patent Document 2 does not make any changes to the document classification method itself, the presented content is deviated from the basis for human thinking, and it cannot sufficiently support document classification. . In the evidence presentation based on the attention mechanism described in Non-Patent Document 3, the parts that contribute to classification and the parts of the evidence that people think of do not match, so there is a trade-off between the appropriateness of the evidence presentation and the classification accuracy. There is a problem with this.

本発明は、根拠箇所の提示の適切さと分類精度との間にトレードオフが生じないように、文書の分類結果とともに文書の一部を根拠箇所としてユーザに提示するシステム及び方法を実現する。 The present invention realizes a system and method that presents a part of a document as a basis to a user along with a document classification result so that there is no trade-off between the appropriateness of presentation of the basis and classification accuracy.

本願において開示される発明の代表的な一例を示せば以下の通りである。すなわち、少なくとも一つの計算機を備える計算機システムであって、前記少なくとも一つの計算機は、プロセッサ、前記プロセッサに接続されるメモリ、及び前記プロセッサに接続されるインタフェースを有し、前記計算機システムは、文書のデータの入力を受け付け、前記文書及び前記文書の要素を頂点とするグラフを生成するグラフ構築部と、前記複数の頂点の各々について、前記文書を複数のクラスのいずれかに分類するために用いる指標を算出する分類部と、少なくとも一つの前記頂点の前記指標に基づいて前記文書を分類し、前記分類に寄与した少なくとも一つの前記文書の要素から構成される、前記文書上の根拠箇所を特定し、前記分類の結果及び前記文書上の根拠箇所を提示する文書再構築部と、を備え、前記分類部は、前記グラフを入力とするモデルの出力を用いて、前記各クラスについて、前記クラスに該当する確率を表す値を前記指標として算出し、前記文書再構築部は、前記文書に対応する頂点の前記指標に基づいて、前記文書を分類し、前記文書の要素に対応する頂点の、前記文書が分類されたクラスの前記指標に基づいて、前記文書上の根拠箇所を特定する。 A typical example of the invention disclosed in this application is as follows. That is, the computer system includes at least one computer, the at least one computer includes a processor, a memory connected to the processor, and an interface connected to the processor, and the computer system a graph construction unit that receives data input and generates a graph having the document and the elements of the document as vertices , and an index used for classifying the document into one of a plurality of classes for each of the plurality of vertices ; and a classification unit that classifies the document based on the index of at least one of the vertices , and identifies a base point on the document that is composed of at least one element of the document that contributed to the classification. , a document reconstruction unit that presents the results of the classification and the basis points on the document, and the classification unit uses the output of the model that receives the graph as input to determine the classification for each of the classes. A value representing the corresponding probability is calculated as the index, and the document reconstruction unit classifies the document based on the index of the vertex corresponding to the document, and classifies the document based on the index of the vertex corresponding to the element of the document. Based on the index of the class into which the document is classified, a basis location on the document is identified .

本発明の一形態によれば、根拠箇所の提示の適切さと分類精度との間にトレードオフが生じないように、文書の分類結果とともに文書の一部を根拠箇所としてユーザに提示できる。上記した以外の課題、構成及び効果は、以下の実施例の説明により明らかにされる。 According to one embodiment of the present invention, a part of a document can be presented to a user as a basis along with the document classification result so that there is no trade-off between the appropriateness of presentation of the basis and classification accuracy. Problems, configurations, and effects other than those described above will be made clear by the description of the following examples.

実施例１の計算機システムの構成の一例を示す図である。1 is a diagram illustrating an example of the configuration of a computer system according to a first embodiment; FIG. 実施例１のグラフ構築部が実行する処理の一例を説明するフローチャートである。7 is a flowchart illustrating an example of processing executed by the graph construction unit of the first embodiment. 実施例１のグラフ構築部が実行する処理の一例を説明するフローチャートである。7 is a flowchart illustrating an example of processing executed by the graph construction unit of the first embodiment. 実施例１のグラフ構築部が実行する処理の一例を説明するフローチャートである。7 is a flowchart illustrating an example of processing executed by the graph construction unit of the first embodiment. 実施例１のグラフ構築部が実行する処理におけるデータの入出力を説明する図である。FIG. 3 is a diagram illustrating data input/output in processing executed by the graph construction unit of the first embodiment. 実施例１のグラフ構築部が実行する処理におけるデータの入出力を説明する図である。FIG. 3 is a diagram illustrating data input/output in processing executed by the graph construction unit of the first embodiment. 実施例１のグラフ構築部が構築するグラフを説明する図である。FIG. 2 is a diagram illustrating a graph constructed by a graph construction unit of Example 1. FIG. 実施例１の文書再構築部が実行する処理の一例を説明するフローチャートである。5 is a flowchart illustrating an example of processing executed by the document reconstruction unit of the first embodiment. 実施例１の文書再構築部が実行する処理における出力データの構造を説明する図である。FIG. 3 is a diagram illustrating the structure of output data in processing executed by the document reconstruction unit of the first embodiment. 実施例１の表示部によって提示されるユーザインタフェースの一例を説明する図である。FIG. 2 is a diagram illustrating an example of a user interface presented by a display unit of Example 1. FIG. 実施例１の表示部が実行する処理の一例を説明するフローチャートである。5 is a flowchart illustrating an example of processing executed by the display unit of the first embodiment. 実施例２のグラフ構築部が実行する処理を説明するフローチャートである。12 is a flowchart illustrating processing executed by a graph construction unit of Example 2. FIG. 実施例２のグラフ構築部が実行する処理を説明するフローチャートである。12 is a flowchart illustrating processing executed by a graph construction unit of Example 2. FIG. 実施例２のグラフ構築部が実行する処理におけるデータの入出力を説明する図である。FIG. 7 is a diagram illustrating data input/output in processing executed by the graph construction unit of the second embodiment. 実施例２のグラフ構築部が実行する処理におけるデータの入出力を説明する図である。FIG. 7 is a diagram illustrating data input/output in processing executed by the graph construction unit of the second embodiment. 実施例３の表示部が実行する処理におけるユーザインタフェースを説明する図である。FIG. 7 is a diagram illustrating a user interface in processing executed by the display unit of Example 3; 実施例４の計算機の構成例を示す図である。FIG. 7 is a diagram showing an example of the configuration of a computer according to a fourth embodiment. 実施例４の関連要素表示部が実行する処理におけるユーザインタフェースを説明する図である。FIG. 12 is a diagram illustrating a user interface in processing executed by a related element display unit according to a fourth embodiment. 実施例４の関連要素表示部が実行する処理の一例を説明するフローチャートである。12 is a flowchart illustrating an example of a process executed by the related element display unit according to the fourth embodiment. 実施例４の関連要素表示部が実行する処理で使用するデータのデータ構造を説明する図である。FIG. 12 is a diagram illustrating the data structure of data used in processing executed by the related element display unit of the fourth embodiment. 実施例５のグラフ構築部が実行する処理におけるデータの入出力を説明する図である。FIG. 12 is a diagram illustrating data input/output in processing executed by the graph construction unit of Example 5; 実施例５のグラフ構築部が実行する処理におけるデータの入出力を説明する図である。FIG. 12 is a diagram illustrating data input/output in processing executed by the graph construction unit of Example 5; 実施例５のグラフ構築部が実行するステップＳ２１１の一例を説明するフローチャートである。12 is a flowchart illustrating an example of step S211 executed by the graph construction unit of the fifth embodiment.

以下、本発明の実施例を、図面を用いて説明する。ただし、本発明は以下に示す実施の形態の記載内容に限定して解釈されるものではない。本発明の思想ないし趣旨から逸脱しない範囲で、その具体的構成を変更し得ることは当業者であれば容易に理解される。 Embodiments of the present invention will be described below with reference to the drawings. However, the present invention should not be construed as being limited to the contents described in the embodiments shown below. Those skilled in the art will readily understand that the specific configuration can be changed without departing from the spirit or spirit of the present invention.

以下に説明する発明の構成において、同一又は類似する構成又は機能には同一の符号を付し、重複する説明は省略する。 In the configuration of the invention described below, the same or similar configurations or functions are denoted by the same reference numerals, and redundant explanations will be omitted.

本明細書等における「第１」、「第２」、「第３」等の表記は、構成要素を識別するために付するものであり、必ずしも、数又は順序を限定するものではない。 In this specification, etc., expressions such as "first," "second," and "third" are used to identify constituent elements, and do not necessarily limit the number or order.

図面等において示す各構成の位置、大きさ、形状、及び範囲等は、発明の理解を容易にするため、実際の位置、大きさ、形状、及び範囲等を表していない場合がある。したがって、本発明では、図面等に開示された位置、大きさ、形状、及び範囲等に限定されない。 The position, size, shape, range, etc. of each component shown in the drawings etc. may not represent the actual position, size, shape, range, etc. in order to facilitate understanding of the invention. Therefore, the present invention is not limited to the position, size, shape, range, etc. disclosed in the drawings and the like.

図１は、実施例１の計算機システムの構成の一例を示す図である。 FIG. 1 is a diagram showing an example of the configuration of a computer system according to the first embodiment.

計算機システムは、文書の分類結果とともに文書の一部を根拠箇所としてユーザに提示する。実施例１では、文字列のみから構成される文書を分類する計算機システムを想定する。また、文書の分類は、文書を複数のクラスのいずれに該当するかを決定する処理であるものとする。 The computer system presents a part of the document as a proof point to the user along with the document classification result. Embodiment 1 assumes a computer system that classifies documents consisting only of character strings. Furthermore, it is assumed that document classification is a process of determining which of a plurality of classes a document falls into.

計算機システムは、二つの計算機１００－１、１００－２から構成される。計算機１００－１及び計算機１００－２は、ネットワーク１２０を介して互いに接続される。ネットワーク１２０は、例えば、ＬＡＮ（ＬｏｃａｌＡｒｅａＮｅｔｗｏｒｋ）及びＷＡＮ（ＷｉｄｅＡｒｅａＮｅｔｗｏｒｋ）である。なお、本発明はネットワーク１２０の種別に限定されない。また、ネットワーク１２０の接続方式は有線及び無線のいずれでもよい。 The computer system is composed of two computers 100-1 and 100-2. Computer 100-1 and computer 100-2 are connected to each other via network 120. The network 120 is, for example, a LAN (Local Area Network) or a WAN (Wide Area Network). Note that the present invention is not limited to the type of network 120. Further, the connection method of the network 120 may be either wired or wireless.

実施例１の計算機１００－１及び計算機１００－２のハードウェア構成は、同一である。なお、計算機１００－１及び計算機１００－２のハードウェア構成は異なっていてもよい。以下の説明では、計算機１００－１及び計算機１００－２を区別しない場合、計算機１００と記載する。 The hardware configurations of the computer 100-1 and the computer 100-2 in the first embodiment are the same. Note that the hardware configurations of the computer 100-1 and the computer 100-2 may be different. In the following description, if the computer 100-1 and the computer 100-2 are not distinguished, they will be referred to as the computer 100.

計算機１００は、プロセッサ１０１、メモリ１０２、及びネットワークインタフェース１０３を有する。各ハードウェアは内部バスと介して互いに接続される。計算機１００は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）及びＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）等の記憶装置、キーボード、マウス、及びタッチパネル等の入力装置、並びに、ディスプレイ等の出力装置を有してもよい。 Computer 100 has a processor 101, memory 102, and network interface 103. Each piece of hardware is connected to each other via an internal bus. The computer 100 may include a storage device such as a hard disk drive (HDD) and a solid state drive (SSD), an input device such as a keyboard, a mouse, and a touch panel, and an output device such as a display.

プロセッサ１０１は、メモリ１０２に格納されるプログラムを実行する。プロセッサ１０１がプログラムにしたがって処理を実行することによって、所定の機能を有する機能部（モジュール）として動作する。以下の説明では、機能部を主語に処理を説明する場合、プロセッサ１０１が、当該機能部を実現するプログラムを実行していることを表す。 Processor 101 executes a program stored in memory 102. By executing processing according to a program, the processor 101 operates as a functional unit (module) having a predetermined function. In the following description, when a process is described using a functional unit as a subject, it means that the processor 101 is executing a program that implements the functional unit.

メモリ１０２は、プロセッサ１０１が実行するプログラム及び当該プログラムに必要な情報を格納する。また、メモリ１０２は、プログラムが一時的に使用するワークエリアを含む。 The memory 102 stores programs executed by the processor 101 and information necessary for the programs. The memory 102 also includes a work area that is temporarily used by the program.

ネットワークインタフェース１０３は、ネットワークを介して他の装置と接続する。 Network interface 103 connects to other devices via a network.

ここで、計算機１００－１及び計算機１００－２のメモリ１０２に格納されるプログラム及び情報について説明する。 Here, the programs and information stored in the memory 102 of the computers 100-1 and 100-2 will be explained.

計算機１００－１のメモリ１０２は、グラフ構築部１１０を実現するプログラムを格納し、また、グラフ情報１１１を保持する。 The memory 102 of the computer 100-1 stores a program that implements the graph construction unit 110, and also holds graph information 111.

グラフ構築部１１０は、文書のデータの入力を受け付け、文書の構造を表す情報として、当該データから文書及び文書の構成要素を頂点（ノード）とするグラフ６００（図６参照）を構築する。文書の構成要素は、例えば、単語及び段落（文章）等である。なお、本発明は、グラフの形式に限定されず、文書の構造を示すデータであればよい。グラフ構築部１１０によって生成されるグラフ６００の詳細については図６を用いて説明する。 The graph construction unit 110 receives input of document data, and constructs a graph 600 (see FIG. 6) having the document and the constituent elements of the document as vertices (nodes) from the data as information representing the structure of the document. Components of a document include, for example, words and paragraphs (sentences). Note that the present invention is not limited to a graph format, and any data that indicates the structure of a document may be used. Details of the graph 600 generated by the graph construction unit 110 will be explained using FIG. 6.

グラフ情報１１１は、グラフ構築部１１０によって構築されたグラフ６００を表すデータを格納する。 The graph information 111 stores data representing the graph 600 constructed by the graph construction unit 110.

計算機１００－２のメモリは、分類モデル学習部１１２、分類部１１４、文書再構築部１１５、及び表示部１１６を実現するプログラムを格納し、また、分類モデル情報１１３を保持する。 The memory of the computer 100-2 stores programs that implement the classification model learning section 112, the classification section 114, the document reconstruction section 115, and the display section 116, and also holds classification model information 113.

分類モデル学習部１１２は、グラフ情報１１１を取得し、分類モデルを定義する各種パラメータを算出するための学習処理を実行する。本実施例の分類モデルは、グラフの頂点に、あるクラスに該当する確率値を算出するためのモデルである。あるクラスに該当する確率値をクラス確率値とも記載する。一つの頂点に対してクラスの数だけクラス確率値が算出される。 The classification model learning unit 112 acquires the graph information 111 and executes a learning process for calculating various parameters that define the classification model. The classification model of this embodiment is a model for calculating a probability value corresponding to a certain class at the vertex of a graph. A probability value corresponding to a certain class is also described as a class probability value. As many class probability values as there are classes are calculated for one vertex.

分類モデル情報１１３は、分類モデル学習部１１２によって学習された分類モデルの情報、すなわち、パラメータを格納する。 The classification model information 113 stores information on the classification model learned by the classification model learning unit 112, that is, parameters.

分類部１１４は、分類モデル情報１１３に格納される分類モデルに基づいて、グラフを構成する頂点のクラス確率値を算出する。また、分類部１１４は文書を構成する頂点の各クラス確率値に基づいて文書を分類する。より具体的には、分類部１１４は、分類の結果として、文書を構成する少なくとも一つの頂点にラベルを付与するための値を算出する。ラベルは文書のクラスを表す値である。 The classification unit 114 calculates class probability values of vertices forming the graph based on the classification model stored in the classification model information 113. Further, the classification unit 114 classifies the document based on each class probability value of the vertices forming the document. More specifically, the classification unit 114 calculates, as a result of classification, a value for assigning a label to at least one vertex constituting the document. A label is a value that represents the class of a document.

文書再構築部１１５は、分類部１１４によって算出された値に基づいて、文書の分類を決定し、また、グラフ６００に基づいて文書を再構築し、さらに、分類部１１４によって算出された値及び再構築された文書に基づいて、文書の分類に寄与した文書上の根拠箇所を特定する。根拠箇所は少なくとも一つの構成要素から構成される。文書再構築部１１５は、特定された文書上の根拠箇所を提示するための根拠情報を生成する。また、文書再構築部１１５は、分類対象の文書とともに、分類結果及び根拠情報を出力する。 The document reconstruction unit 115 determines the classification of the document based on the values calculated by the classification unit 114, reconstructs the document based on the graph 600, and further uses the values calculated by the classification unit 114 and Based on the reconstructed document, the basis points on the document that contributed to the classification of the document are identified. A proof point consists of at least one component. The document reconstruction unit 115 generates basis information for presenting the specified basis location on the document. Further, the document reconstruction unit 115 outputs the classification result and basis information along with the document to be classified.

表示部１１６は、分類結果及び根拠箇所をユーザに提示するためのデータを出力する。 The display unit 116 outputs data for presenting the classification results and the basis locations to the user.

なお、計算機システムは、計算機１００にデータを入力し、また、計算機１００からデータを取得するための端末等を含んでもよい。計算機１００－１及び計算機１００－２が有する機能を一つの計算機にまとめてもよい。また、計算機１００－１及び計算機１００－２は、仮想化技術を利用して実現してもよい。 Note that the computer system may include a terminal for inputting data to the computer 100 and acquiring data from the computer 100. The functions of computer 100-1 and computer 100-2 may be combined into one computer. Further, the computer 100-1 and the computer 100-2 may be realized using virtualization technology.

ここで、所定の分類体系に基づいて、文書を分類する計算機システムの処理の流れについて説明する。 Here, the flow of processing of a computer system that classifies documents based on a predetermined classification system will be described.

（処理Ａ１）まず、分類体系におけるいずれかのラベルが付与された文書の文書データ５００（図５参照）がグラフ構築部１１０に入力される。グラフ構築部１１０は、文書、二つ以上の単語から構成される段落、及び単語を頂点とする非階層のグラフ６００を構築し、構築されたグラフ６００を表すデータをグラフ情報１１１に格納する。 (Process A1) First, document data 500 (see FIG. 5) of a document assigned any label in the classification system is input to the graph construction unit 110. The graph construction unit 110 constructs a document, a paragraph composed of two or more words, and a non-hierarchical graph 600 with words as vertices, and stores data representing the constructed graph 600 in the graph information 111.

（処理Ａ２）分類モデル学習部１１２は、学習処理を実行して、学習されたパラメータを分類モデル情報１１３に格納する。学習処理によって生成される分類モデルは、グラフ６００の各頂点のクラス確率値を算出するためのモデルである。 (Processing A2) The classification model learning unit 112 executes a learning process and stores the learned parameters in the classification model information 113. The classification model generated by the learning process is a model for calculating the class probability value of each vertex of the graph 600.

（処理Ａ３）分類部１１４は、分類モデル情報１１３に基づいて、グラフ６００における分類対象の文書に関連する少なくとも一つの頂点に対してラベルを付与する。 (Process A3) Based on the classification model information 113, the classification unit 114 assigns a label to at least one vertex related to the document to be classified in the graph 600.

（処理Ａ４）文書再構築部１１５は、グラフ６００及び各頂点のクラス確率値に基づいて、分類対象の文書とともに、分類結果及び根拠情報を出力する。表示部１１６は、文書とともに、分類結果及び根拠箇所をユーザに提示する。 (Process A4) The document reconstruction unit 115 outputs the document to be classified, as well as the classification result and basis information, based on the graph 600 and the class probability value of each vertex. The display unit 116 presents the document, the classification results, and the evidence points to the user.

本計算機システムは、特許明細書の分類を支援するために活用できる。例えば、計算機システムは、出願に係る特許明細書に対して、ある国際特許分類を付与すべきかを判定し、付与すべき国際特許分類とともに付与の根拠をユーザに提示する。 This computer system can be utilized to support the classification of patent specifications. For example, the computer system determines whether a certain international patent classification should be assigned to a patent specification related to an application, and presents the international patent classification to be assigned and the basis for assignment to the user.

また、本計算機システムは、文書の排他的なカテゴリへの分類にも利用できる。例えば、計算機システムは、医療論文の探索を容易にするために、「内科」、「外科」、「整形外科」、「脳神経外科」、「産婦人科」、「皮膚科」、「眼科」、「耳鼻咽喉科」のいずれかのカテゴリに医療論文の抄録が属するかを判定し、属するカテゴリとともに判定の根拠をユーザに提示する。 This computer system can also be used to classify documents into exclusive categories. For example, in order to facilitate searching for medical articles, the computer system can search for ``internal medicine'', ``surgery'', ``orthopaedics'', ``neurosurgery'', ``obstetrics and gynecology'', ``dermatology'', ``ophthalmology'', etc. It is determined whether an abstract of a medical paper belongs to any category of "otolaryngology" and the basis of the determination is presented to the user along with the category to which it belongs.

また、本計算機システムは、文書に限定されず、粒度が異なる要素から構成されるデータの分類にも利用できる。例えば、計算機システムは、化合物の分子構造から毒性の有無を判定し、判定の結果とともに判定の根拠をユーザに提示する。この場合、分子が文書に対応し、原子が単語に対応し、基が段落に対応する。 Furthermore, this computer system is not limited to documents, and can also be used to classify data composed of elements with different granularity. For example, a computer system determines the presence or absence of toxicity from the molecular structure of a compound, and presents the basis for the determination to the user along with the determination result. In this case, molecules correspond to documents, atoms correspond to words, and groups correspond to paragraphs.

図２から図６を用いて、グラフ構築部１１０が実行する処理について説明する。図２から図４は、実施例１のグラフ構築部１１０が実行する処理の一例を説明するフローチャートである。図５Ａ及び図５Ｂは、実施例１のグラフ構築部１１０が実行する処理におけるデータの入出力を説明する図である。図６は、実施例１のグラフ構築部１１０が構築するグラフを説明する図である。 The processing executed by the graph construction unit 110 will be explained using FIGS. 2 to 6. 2 to 4 are flowcharts illustrating an example of processing executed by the graph construction unit 110 of the first embodiment. 5A and 5B are diagrams illustrating data input/output in processing executed by the graph construction unit 110 of the first embodiment. FIG. 6 is a diagram illustrating a graph constructed by the graph construction unit 110 of the first embodiment.

ここでは、特許明細書への任意の国際特許分類を付与すべきか否かの判断を支援するユースケースを例として考える。説明の簡単のために、特許明細書は二つの段落から構成されるものとする。この場合、特許明細書の構造を表すデータは図６に示すようなグラフ６００として表現される。 Here, we will consider as an example a use case that supports the determination of whether or not to assign an arbitrary international patent classification to a patent specification. For ease of explanation, the patent specification shall consist of two paragraphs. In this case, data representing the structure of the patent specification is expressed as a graph 600 as shown in FIG.

グラフ６００は、計算機上では図５に示す隣接行列５２０として管理される。隣接行列５２０はｉ行ｊ列の要素はｊ番目の頂点からｉ番目の頂点への辺の有無及び重みを表す。具体的には、要素の値が零の場合、頂点間を接続する辺は存在しないことを表し、要素の値が非零である場合、ｊ番目の頂点からｉ番目の頂点への方向に、要素の値が重みとして設定された辺が存在することを表す。 The graph 600 is managed on the computer as an adjacency matrix 520 shown in FIG. In the adjacency matrix 520, the element in the i-th row and the j-th column represents the existence and weight of an edge from the j-th vertex to the i-th vertex. Specifically, if the element value is zero, it means that there is no edge connecting the vertices, and if the element value is non-zero, in the direction from the j-th vertex to the i-th vertex, Indicates that there is an edge for which the element value is set as a weight.

グラフ構築部１１０は、文書をグラフ６００として扱うために、隣接行列５２０を生成する。なお、隣接行列５２０は疎な行列であることが多いため、計算機１００のメモリ１０２上でも疎行列として保持されることが望ましい。 The graph construction unit 110 generates an adjacency matrix 520 in order to treat the document as a graph 600. Note that since the adjacency matrix 520 is often a sparse matrix, it is desirable that it is also held as a sparse matrix in the memory 102 of the computer 100.

グラフ構築部１１０は、インデックス５０１、ラベル５０２、及び複数の段落文章データ５０３から構成される文書データ５００を受け付ける（ステップＳ２０１）。 The graph construction unit 110 receives document data 500 including an index 501, a label 502, and a plurality of paragraph sentence data 503 (step S201).

ここで、インデックス５０１は、文書データ５００を一意に識別するために番号である。番号は０から順に付与されるものとする。 Here, the index 501 is a number to uniquely identify the document data 500. Numbers shall be assigned sequentially starting from 0.

ラベル５０２は文書の分類結果である。本実施例のラベル５０２には、文書データ５００に所定の国際特許分類を付与すべきか否かを表す値が格納される。本実施例では、文書データ５００に国際特許分類を付与すべきである場合、ラベル５０２には「１」が格納され、文書データ５００に国際特許分類を付与すべきでない場合、ラベル５０２には「０」が格納されるものとする。 A label 502 is the document classification result. In the label 502 of this embodiment, a value indicating whether or not a predetermined international patent classification should be assigned to the document data 500 is stored. In this embodiment, when the international patent classification should be assigned to the document data 500, "1" is stored in the label 502, and when the international patent classification should not be assigned to the document data 500, the label 502 stores "1". 0'' shall be stored.

段落文章データ５０３は、文書に含まれる段落の文章に対応するデータである。例えば、一つの段落タグ＜ｐ＞を含む文章が段落文章データ５０３として扱われる。なお、文書の分割単位は一つ以上の単語からなる単位であればよく、例えば形式段落、文、及び談話構造に基づいて文書を分割してもよい。 Paragraph text data 503 is data corresponding to paragraph sentences included in a document. For example, a sentence including one paragraph tag <p> is treated as paragraph sentence data 503. Note that the document division unit may be a unit consisting of one or more words; for example, the document may be divided based on formal paragraphs, sentences, and discourse structure.

なお、本実施例では、ラベル５０２が既知の文書データ５００（学習データ）がＮ_Ａ個、ラベル５０２の付与対象である文書データ５００（評価データ）がＮ_Ｑ個、合計（Ｎ_Ａ＋Ｎ_Ｑ）個の文書データ５００が存在するものとする。評価データのラベル５０２にはプレースホルダとして－１が格納されているものとする。図２の説明に戻る。 In addition, in this embodiment, the number of document data 500 (learning data) with a known label 502 is N _A , and the number of document data 500 (evaluation data) to which a label 502 is assigned is N _Q , in total ( _NA + N _Q ). It is assumed that document data 500 exists. It is assumed that -1 is stored as a placeholder in the label 502 of the evaluation data. Returning to the explanation of FIG. 2.

次に、グラフ構築部１１０は、文書データ５００のループ処理を開始する（ステップＳ２０２）。 Next, the graph construction unit 110 starts loop processing of the document data 500 (step S202).

具体的には、グラフ構築部１１０は、複数の文書データ５００の中から一つのターゲット文書データ５００を選択する。 Specifically, the graph construction unit 110 selects one target document data 500 from among the plurality of document data 500.

次に、グラフ構築部１１０は、文書データの前処理を実行する（ステップＳ２０３）。具体的には、以下のような処理が実行される。 Next, the graph construction unit 110 performs preprocessing of the document data (step S203). Specifically, the following processing is executed.

グラフ構築部１１０は、ターゲット文書データ５００の段落文章データ５０３のループ処理を開始する（ステップＳ３０１）。 The graph construction unit 110 starts loop processing of the paragraph sentence data 503 of the target document data 500 (step S301).

具体的には、グラフ構築部１１０は、ターゲット文書データ５００に含まれる複数の段落文章データ５０３の中から一つのターゲット段落文章データ５０３を選択する。 Specifically, the graph construction unit 110 selects one target paragraph sentence data 503 from a plurality of paragraph sentence data 503 included in the target document data 500.

次に、グラフ構築部１１０は、ターゲット段落文章データ５０３を形態素の単位（単語と呼ぶ）に分解する（ステップＳ３０２）。なお、分解の単位は文字及びバイト対符号化等、形態素以下の要素、又は複数の単語から構成されるフレーズ等でもよい。このとき、文章に頻出する句読点及び助動詞等のストップワードを除去する処理、並びに、形態素を原型に戻す処理が行われてもよい。 Next, the graph construction unit 110 decomposes the target paragraph sentence data 503 into units of morphemes (referred to as words) (step S302). Note that the unit of decomposition may be an element below a morpheme, such as character and byte pair encoding, or a phrase composed of a plurality of words. At this time, processing may be performed to remove stop words such as punctuation marks and auxiliary verbs that frequently appear in sentences, and processing to return morphemes to their original forms.

次に、グラフ構築部１１０は、ターゲット段落文章データ５０３の単語から単語Ｎｇｒａｍを抽出し、段落Ｎｇｒａｍデータ５０４としてターゲット文書データ５００に格納する（ステップＳ３０３）。ここで、単語Ｎｇｒａｍとは、連続するｎ個の単語の組み合わせを列挙することを示す。例えば、ｎを２以上とした場合、「データ」、「の」、「分類」という形態素列からは、「データ」、「の」、「分類」、「データの」、「の分類」の五つの単語Ｎｇｒａｍが抽出される。本実施例ではｎは３以下とする。ただし、ｎは任意に設定できる。 Next, the graph construction unit 110 extracts the word Ngram from the words of the target paragraph sentence data 503, and stores it in the target document data 500 as paragraph Ngram data 504 (step S303). Here, the word Ngram indicates that a combination of n consecutive words is listed. For example, if n is 2 or more, the morpheme sequence "data", "no", and "classification" will be divided into five characters: "data", "no", "classification", "data no", and "no classification". One word Ngram is extracted. In this embodiment, n is 3 or less. However, n can be set arbitrarily.

以下の説明では、単語ＮｇｒａｍをＮｇｒａｍと記載する。 In the following description, the word Ngram will be referred to as Ngram.

次に、グラフ構築部１１０は、ターゲット文書データ５００に含まれる全ての段落文章データ５０３について処理を実行したか否かを判定する（ステップＳ３０４）。 Next, the graph construction unit 110 determines whether the process has been executed for all paragraph text data 503 included in the target document data 500 (step S304).

処理を実行していない段落文章データ５０３が少なくとも一つ存在すると判定された場合、グラフ構築部１１０は、ステップＳ３０１に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of paragraph text data 503 that has not been processed, the graph construction unit 110 returns to step S301 and executes the same process.

ターゲット文書データ５００に含まれる全ての段落文章データ５０３について処理を実行したと判定された場合、グラフ構築部１１０は、文書データの前処理を終了し、ステップＳ２０４に進む。グラフ構築部１１０は、全ての文書データ５００について文書データの前処理を実行したか否かを判定する（ステップＳ２０４）。 If it is determined that the processing has been performed on all paragraph text data 503 included in the target document data 500, the graph construction unit 110 ends the preprocessing of the document data and proceeds to step S204. The graph construction unit 110 determines whether document data preprocessing has been performed for all document data 500 (step S204).

文書データの前処理を実行していない文書データ５００が少なくとも一つ存在すると判定された場合、グラフ構築部１１０は、ステップＳ２０２に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of document data 500 for which document data preprocessing has not been performed, the graph construction unit 110 returns to step S202 and executes the same process.

全ての文書データ５００について文書データの前処理が実行したと判定された場合、グラフ構築部１１０は、Ｎｇｒａｍ辞書５１０を生成する（ステップＳ２０５）。 If it is determined that the document data preprocessing has been performed for all of the document data 500, the graph construction unit 110 generates the Ngram dictionary 510 (step S205).

具体的には、グラフ構築部１１０は、文書データ５００から抽出された全てのＮｇｒａｍの中から、次の条件を満たすＮｇｒａｍが出現する段落Ｎｇｒａｍデータ５０４を含む文書データ５００の数が多い順に２００００個のＮｇｒａｍを選択し、選択されたＮｇｒａｍからＮｇｒａｍ辞書５１０を生成する。Ｎｇｒａｍ辞書５１０に登録されるＮｇｒａｍの数をＮ_Ｗとする。 Specifically, the graph construction unit 110 selects 20,000 pieces of document data 500 including the paragraph Ngram data 504 in which an Ngram satisfying the following conditions appears from among all the Ngrams extracted from the document data 500. , and generates an Ngram dictionary 510 from the selected Ngrams. Let _NW be the number of Ngrams registered in the Ngram dictionary 510.

（条件１）Ｎｇｒａｍが出現する段落Ｎｇｒａｍデータ５０４を含む文書データ５００の数が５以上
（条件２）Ｎｇｒａｍが出現する段落Ｎｇｒａｍデータ５０４を含む文書データ５００の割合が１００％未満 (Condition 1) The number of document data 500 containing paragraph Ngram data 504 where Ngram appears is 5 or more (Condition 2) The ratio of document data 500 containing paragraph Ngram data 504 where Ngram appears is less than 100%

なお、Ｎｇｒａｍ辞書５１０は、Ｎｇｒａｍを格納するフィールドであるＮｇｒａｍ５１１及びＮｇｒａｍの識別情報を格納するフィールドであるインデックス５１２から構成されるエントリを含む。 Note that the Ngram dictionary 510 includes an entry composed of an Ngram 511 that is a field that stores an Ngram, and an index 512 that is a field that stores identification information of the Ngram.

次に、グラフ構築部１１０は、Ｎｇｒａｍ間のＰｏｉｎｔｗｉｓｅＭｕｔｕａｌＩｎｆｏｒｍａｔｉｏｎ（ＰＭＩ）を算出し、下式（１）に基づいてＮｇｒａｍ－Ｎｇｒａｍ行列５２１（Ａ^ｗ－ｗ）を生成する（ステップＳ２０６）。 Next, the graph construction unit 110 calculates Pointwise Mutual Information (PMI) between Ngrams and generates an Ngram-Ngram matrix 521 (A ^ww ) based on the following equation (1) (step S206).

Ｎｇｒａｍ間のＰＭＩは二つのＮｇｒａｍの共起頻度を表す指標であり、以下のように算出される。 PMI between Ngrams is an index representing the co-occurrence frequency of two Ngrams, and is calculated as follows.

（処理Ｂ１）まず、グラフ構築部１１０は、Ｎｇｒａｍ辞書５１０に格納されたＮｇｒａｍについて、インデックス５１２がｉであるＮｇｒａｍの頻度ｗ_ｉと、インデックス５１２がｉ、ｊであるＮｇｒａｍの共起頻度ｗ_ｉ，ｊを０に初期化する。 (Process B1) First, for Ngrams stored in the Ngram dictionary 510, the graph construction unit 110 calculates the frequency w _i of Ngrams whose index 512 is i and the co-occurrence frequency w i of Ngrams whose indexes 512 are i and j _{. , j} to 0.

（処理Ｂ２）グラフ構築部１１０は、全文書の全段落文章データ５０３におけるｒ番目の単語から所定の単語数（本実施例では２０単語）までの間の単語列で単語Ｎｇｒａｍ列を算出する。単語Ｎｇｒａｍ列の各Ｎｇｒａｍについて、ＮｇｒａｍがＮｇｒａｍ辞書５１０に含まれる場合、グラフ構築部１１０は、当該Ｎｇｒａｍに対応するインデックス５１２の値ｉをもとにＮｇｒａｍの頻度ｗ_ｉをインクリメントする。単語Ｎｇｒａｍ列に含まれるＮｇｒａｍの全ペアについて、Ｎｇｒａｍペアに含まれるＮｇｒａｍの両方がＮｇｒａｍ辞書５１０に含まれる場合、グラフ構築部１１０は、当該Ｎｇｒａｍペアに含まれる各Ｎｇｒａｍに対応するインデックス５１２の値ｉと値ｊをもとにＮｇｒａｍの共起頻度ｗ_ｉ，ｊをインクリメントする。 (Processing B2) The graph construction unit 110 calculates a word Ngram sequence using a word sequence from the r-th word to a predetermined number of words (20 words in this embodiment) in the all-paragraph sentence data 503 of all documents. For each Ngram in the word Ngram sequence, if the Ngram is included in the Ngram dictionary 510, the graph construction unit 110 increments the frequency w _i of the Ngram based on the value i of the index 512 corresponding to the Ngram. For all pairs of Ngrams included in the word Ngram string, if both Ngrams included in the Ngram pair are included in the Ngram dictionary 510, the graph construction unit 110 calculates the value of the index 512 corresponding to each Ngram included in the Ngram pair. The co-occurrence frequency w _i,j of Ngram is incremented based on i and the value j.

（処理Ｂ３）グラフ構築部１１０は、ｒをインクリメントし、上記の手順を繰り返す。 (Process B3) The graph construction unit 110 increments r and repeats the above procedure.

（処理Ｂ４）グラフ構築部１１０は、最後に下式（２）に基づいてＮｇｒａｍのＰＭＩを算出する。 (Processing B4) The graph construction unit 110 finally calculates the PMI of Ngram based on the following formula (2).

Ｎｇｒａｍ－Ｎｇｒａｍ行列５２１のｉ行ｊ列の要素はインデックス５１２がｊのＮｇｒａｍからインデックス５１２がｉのＮｇｒａｍへの辺６０１を表す。なお、Ｎｇｒａｍ－Ｎｇｒａｍ行列５２１は、ＰＭＩ以外にも単語分散表現のコサイン距離、ＷｏｒｄＮｅｔにおける接続の有無、又はこれらの組み合わせ等、異なる指標を用いて生成されてもよい。 The element in the i-th row and j-column of the Ngram-Ngram matrix 521 represents the edge 601 from Ngram whose index 512 is j to Ngram whose index 512 is i. Note that the Ngram-Ngram matrix 521 may be generated using a different index other than the PMI, such as the cosine distance of the word distributed representation, the presence or absence of a connection in WordNet, or a combination thereof.

次に、グラフ構築部１１０は、文書データ５００のループ処理を開始する（ステップＳ２０７）。 Next, the graph construction unit 110 starts loop processing of the document data 500 (step S207).

具体的には、グラフ構築部１１０は、複数の文書データ５００の中から一つのターゲット文書データ５００を選択する。また、グラフ構築部１１０は、段落番号カウンタｃを０に初期化する。ここでは、ターゲット文書データ５００のインデックス５０１をｄとする。 Specifically, the graph construction unit 110 selects one target document data 500 from among the plurality of document data 500. The graph construction unit 110 also initializes the paragraph number counter c to 0. Here, the index 501 of the target document data 500 is assumed to be d.

次に、グラフ構築部１１０は、文書及び段落の行列更新処理を実行する（ステップＳ２０８）。文書及び段落の行列更新処理では、図４に示すような処理が実行される。 Next, the graph construction unit 110 executes document and paragraph matrix update processing (step S208). In the document and paragraph matrix update process, the process shown in FIG. 4 is executed.

グラフ構築部１１０は、文書－文書行列５２２（Ａ^ｄ－ｄ）を更新する（ステップＳ４０１）。 The graph construction unit 110 updates the document-document matrix 522 (A ^dd ) (step S401).

具体的には、グラフ構築部１１０は、文書の総数と同じ要素数のベクトルを文書－文書行列５２２のｄ行目に追加する。追加されるベクトルは、ｄ番目の要素のみが１であり、他の要素は全て０である。この１は文書ｄから文書ｄへの自己ループに相当する。 Specifically, the graph construction unit 110 adds a vector with the same number of elements as the total number of documents to the d-th row of the document-document matrix 522. In the added vector, only the d-th element is 1, and all other elements are 0. This 1 corresponds to a self-loop from document d to document d.

次に、グラフ構築部１１０は、文書－段落行列５２３（Ａ^ｄ－ｐ）を更新する（ステップＳ４０２）。 Next, the graph construction unit 110 updates the document-paragraph matrix 523 (A ^dp ) (step S402).

具体的には、グラフ構築部１１０は、各文書の段落数の合計値（Ｎ_Ｐ）と同じ要素数のベクトルを文書－段落行列５２３のｄ行目に追加する。追加されるベクトルは、ｃ番目から（ｃ＋ターゲット文書の段落数－１）番目の要素は１であり、他の要素は全て０である。この１は文書ｄに含まれる段落から文書ｄへの辺６０２を表す。 Specifically, the graph construction unit 110 adds a vector with the same number of elements as the total number of paragraphs (N _P ) of each document to the d-th row of the document-paragraph matrix 523. In the added vector, the cth to (c+number of paragraphs in the target document - 1)th elements are 1, and all other elements are 0. This 1 represents a side 602 from a paragraph included in document d to document d.

次に、グラフ構築部１１０は、文書－Ｎｇｒａｍ行列５２４（Ａ^ｄ－ｗ）を更新する（ステップＳ４０３）。 Next, the graph construction unit 110 updates the document-Ngram matrix 524 (A ^dw ) (step S403).

具体的には、グラフ構築部１１０は、文書データ５００のｔｆｉｄｆベクトルｖを文書－Ｎｇｒａｍ行列５２４のｄ行目に追加する。追加されるｔｆｉｄｆベクトルｖのｉ番目の要素ｖ_ｉは下式（３）から（５）に基づいて算出される。 Specifically, the graph construction unit 110 adds the tfidf vector v of the document data 500 to the d-th row of the document-Ngram matrix 524. The i-th element v _i of the tfidf vector v to be added is calculated based on the following equations (3) to (5).

ただし、^ｄｗ_ｉはｄ番目の文書データ５００のＮｇｒａｍのインデックス５１２がｉであるＮｇｒａｍの頻度を表し、ｎ_ｉはインデックス５１２がｉのＮｇｒａｍが含まれる文書データ５００の個数を表す。ｔｆｉｄｆベクトルｖの非零の要素は文書ｄに含まれるＮｇｒａｍから文書ｄへの辺６０３を表す。 However, ^d w _i represents the frequency of Ngrams whose index 512 is i in the d-th document data 500, and n _i represents the number of document data 500 in which Ngrams whose index 512 is i. A non-zero element of the tfidf vector v represents an edge 603 from Ngram included in the document d to the document d.

次に、グラフ構築部１１０は、段落辞書５３０を更新する（ステップＳ４０４）。 Next, the graph construction unit 110 updates the paragraph dictionary 530 (step S404).

具体的には、グラフ構築部１１０は、段落辞書５３０に文書インデックス５３１がｄ、段落インデックス５３２が（Ｎ_Ａ+Ｎ_Ｑ＋ｃ）であるエントリを追加する。段落辞書５３０はｄ番目の文書データ５００の段落が隣接行列５２０の何行目から始まるかを示す情報である。 Specifically, the graph construction unit 110 adds an entry in which the document index 531 is d and the paragraph index 532 is (N _A +N _Q +c) to the paragraph dictionary 530. The paragraph dictionary 530 is information indicating in which row of the adjacency matrix 520 the paragraph of the d-th document data 500 starts.

次に、グラフ構築部１１０は、ターゲット文書データ５００の段落Ｎｇｒａｍデータ５０４のループ処理を開始する（ステップＳ４０５）。 Next, the graph construction unit 110 starts loop processing of the paragraph Ngram data 504 of the target document data 500 (step S405).

具体的には、グラフ構築部１１０は、ターゲット文書データ５００の複数の段落Ｎｇｒａｍデータ５０４の中から一つのターゲット段落Ｎｇｒａｍデータ５０４を選択する。 Specifically, the graph construction unit 110 selects one target paragraph Ngram data 504 from a plurality of paragraph Ngram data 504 of the target document data 500.

次に、グラフ構築部１１０は、段落－段落行列５２５（Ａ^ｐ－ｐ）を更新する（ステップＳ４０６）。 Next, the graph construction unit 110 updates the paragraph-paragraph matrix 525 (A ^pp ) (step S406).

具体的には、グラフ構築部１１０は、各文書の段落の合計値と同じ要素数のベクトルを段落－段落行列５２５のｃ行目に追加する。追加されるベクトルは、ｃ番目の要素のみが１であり、他の要素は全て０である。この１は段落から段落への自己ループに相当する。 Specifically, the graph construction unit 110 adds a vector with the same number of elements as the total value of the paragraphs of each document to the c-th row of the paragraph-paragraph matrix 525. In the added vector, only the c-th element is 1, and all other elements are 0. This 1 corresponds to a self-loop from paragraph to paragraph.

次に、グラフ構築部１１０は、段落－Ｎｇｒａｍ行列５２６（Ａ^ｐ－ｗ）を更新する（ステップＳ４０７）。 Next, the graph construction unit 110 updates the paragraph-Ngram matrix 526 (A ^pw ) (step S407).

具体的には、グラフ構築部１１０は、ターゲット段落Ｎｇｒａｍデータのｔｆｉｄｆベクトルｖを段落－Ｎｇｒａｍ行列５２６のｃ行目に追加する。追加されるｔｆｉｄｆベクトルｖのｉ番目の要素ｖ_ｉは、段落を文書と同列にみなすことで式（３）から（５）に基づいて算出できる。ｔｆｉｄｆベクトルｖの非零の要素は文書に属する段落からＮｇｒａｍへの辺６０４を表す。その後、グラフ構築部１１０はｃをインクリメントする。 Specifically, the graph construction unit 110 adds the tfidf vector v of the target paragraph Ngram data to the c-th row of the paragraph-Ngram matrix 526. The i-th element v _i of the tfidf vector v to be added can be calculated based on equations (3) to (5) by considering a paragraph to be on the same level as a document. A non-zero element of the tfidf vector v represents an edge 604 from the paragraph belonging to the document to Ngram. After that, the graph construction unit 110 increments c.

次に、グラフ構築部１１０は、全ての段落Ｎｇｒａｍデータ５０４について処理を実行したか否かを判定する（ステップＳ４０８）。 Next, the graph construction unit 110 determines whether processing has been performed on all paragraph Ngram data 504 (step S408).

少なくとも一つの段落Ｎｇｒａｍデータ５０４について処理が実行していないと判定された場合、グラフ構築部１１０は、ステップＳ４０５に戻り、同様の処理を実行する。 If it is determined that the process has not been executed for at least one paragraph Ngram data 504, the graph construction unit 110 returns to step S405 and executes the same process.

全ての段落Ｎｇｒａｍデータ５０４について処理を実行したと判定された場合、グラフ構築部１１０は文書及び段落の行列更新処理を終了し、ステップＳ２０９に進む。グラフ構築部１１０は、全ての文書データ５００について文書及び段落の行列更新処理を実行したか否かを判定する（ステップＳ２０９）。 If it is determined that the process has been executed for all paragraph Ngram data 504, the graph construction unit 110 ends the document and paragraph matrix update process, and proceeds to step S209. The graph construction unit 110 determines whether the document and paragraph matrix update processing has been executed for all of the document data 500 (step S209).

文書及び段落の行列更新処理を実行していない文書データ５００が少なくとも一つ存在すると判定された場合、グラフ構築部１１０は、ステップＳ２０７に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of document data 500 that has not undergone the document and paragraph matrix update process, the graph construction unit 110 returns to step S207 and executes the same process.

全ての文書データ５００について文書及び段落の行列更新処理を実行したと判定された場合、グラフ構築部１１０は、Ｎｇｒａｍ－Ｎｇｒａｍ行列５２１（Ａ^ｗ－ｗ）、文書－文書行列５２２（Ａ^ｄ－ｄ）、文書－段落行列５２３（Ａ^ｄ－ｐ）、文書－Ｎｇｒａｍ行列５２４（Ａ^ｄ－ｗ）、段落－段落行列５２５（Ａ^ｐ－ｐ）、及び段落－Ｎｇｒａｍ行列５２６（Ａ^ｐ－ｗ）を用いて、隣接行列５２０（Ａ）を生成する（ステップＳ２１０）。 When it is determined that the document and paragraph matrix update processing has been executed for all document data 500, the graph construction unit 110 updates the Ngram-Ngram matrix 521 (A ^ww ), the document-document matrix 522 (A ^d-d ), document-paragraph matrix 523 (A ^d-p ), document-Ngram matrix 524 (A ^d-w ), paragraph-paragraph matrix 525 (A ^p-p ), and paragraph-Ngram matrix 526 (A ^p-w ) An adjacency matrix 520(A) is generated using (step S210).

具体的には、グラフ構築部１１０は、式（６）に基づいて隣接行列５２０（Ａ）を生成する。 Specifically, the graph construction unit 110 generates the adjacency matrix 520(A) based on equation (6).

次に、グラフ構築部１１０は、要素数がＮの正解ラベルベクトル５４０を生成する（ステップＳ２１１）。なお、Ｎは、Ｎ_Ａ、Ｎ_Ｑ、Ｎ_Ｐ、及びＮ_Ｗの合計値である。グラフ構築部１１０は、Ｎｇｒａｍ辞書５１０、隣接行列５２０、段落辞書５３０、正解ラベルベクトル５４０をグラフ情報１１１に格納する。その後、グラフ構築部１１０は処理を終了する。 Next, the graph construction unit 110 generates a correct label vector 540 with N elements (step S211). Note that N is the total value of _NA , _NQ , _NP , and _NW . The graph construction unit 110 stores an Ngram dictionary 510, an adjacency matrix 520, a paragraph dictionary 530, and a correct label vector 540 in the graph information 111. After that, the graph construction unit 110 ends the process.

正解ラベルベクトル５４０のｄ番目の要素は、インデックス５０１がｄである文書データ５００のラベル５０２の値であり、それ以外の要素は０である。ｄは０以上かつＮ_Ａより小さい。 The d-th element of the correct label vector 540 is the value of the label 502 of the document data 500 whose index 501 is d, and the other elements are 0. d is 0 or more and smaller than _NA .

本実施例では、学習用の文書データ５００及び分類対象の文書データ５００を同時に入力して、グラフ６００を生成しているが、これに限定されない。グラフ構築部１１０は、学習用の文書データ５００からグラフ６００を生成し、分類対象の文書データ５００が入力された場合、グラフ６００を更新するようにしてもよい。 In this embodiment, the graph 600 is generated by simultaneously inputting the document data 500 for learning and the document data 500 to be classified, but the present invention is not limited thereto. The graph construction unit 110 may generate a graph 600 from the learning document data 500, and update the graph 600 when the document data 500 to be classified is input.

以上が、グラフ構築部１１０が実行する処理の説明である。 The above is an explanation of the processing executed by the graph construction unit 110.

分類モデル学習部１１２は、分類モデルの学習処理を実行する。分類モデル学習部１１２では、以下のような処理が実行される。 The classification model learning unit 112 executes a classification model learning process. The classification model learning unit 112 executes the following processing.

（処理Ｃ１）分類モデル学習部１１２は、式（７）に示すように、グラフ情報１１１に格納される隣接行列ＡとＮ次元の単位行列Ｉに、グラフ畳み込みネットワーク（ＧＣＮ）を適用して、長さＮのベクトルｚを算出する。なお、式（７）は式（８）から（１０）を用いて算出される。ここで、^ｉＷは学習可能パラメータ、σはテンソルの各要素に対する正規化線形関数、ζはドロップアウトを表す。ただし、２層のＧＣＮの代わりに任意の層のＧＣＮやＧｒａｐｈＡｔｔｅｎｔｉｏｎＮｅｔｗｏｒｋ（ＧＡＴ）、ＧｒａｐｈＳＡＧＥ等のアルゴリズムを使いてｚを算出してもよい。 (Process C1) The classification model learning unit 112 applies a graph convolution network (GCN) to the adjacency matrix A and the N-dimensional unit matrix I stored in the graph information 111, as shown in equation (7). Calculate a vector z of length N. Note that equation (7) is calculated using equations (8) to (10). Here, ⁱ W is a learnable parameter, σ is a normalized linear function for each element of the tensor, and ζ represents dropout. However, instead of the two-layer GCN, z may be calculated using an arbitrary layer GCN or an algorithm such as Graph Attention Network (GAT) or GraphSAGE.

（処理Ｃ２）分類モデル学習部１１２は、各文書データ５００について、該文書データ５００のラベル５０２が「１」である確率値（クラス確率値）を算出する。具体的には、分類モデル学習部１１２は、式（７）から算出されたｚを式（１１）に代入することによって、インデックス５０１がｉである文書データ５００のラベル５０２が１である確率値＾ｙ_ｉを算出する。ｉはＮ_Ａ以下である。 (Process C2) The classification model learning unit 112 calculates, for each document data 500, a probability value (class probability value) that the label 502 of the document data 500 is "1". Specifically, the classification model learning unit 112 calculates the probability value that the label 502 of the document data 500 whose index 501 is i is 1 by substituting z calculated from the formula (7) into the formula (11). Calculate ^y _i . i is less than or equal to _NA .

（処理Ｃ３）分類モデル学習部１１２は、ラベル５０２と確率値＾ｙ_ｉが一致するように、学習対象のパラメータを算出する。具体的には、分類モデル学習部１１２は、式（１２）で定義する交差エントロピーＬを算出し、最急降下法を用いて交差エントロピーＬが最小となるようにパラメータ^０Ｗ、^１Ｗを算出する。ｙ_ｉは正解ラベルベクトル５４０のｉ番目の要素を表す。 (Process C3) The classification model learning unit 112 calculates the parameters of the learning target so that the label 502 and the probability value ^y _i match. Specifically, the classification model learning unit 112 calculates the cross entropy L defined by equation (12), and calculates the parameters ⁰ W and ¹ W using the steepest descent method so that the cross entropy L is minimized. . y _i represents the i-th element of the correct label vector 540.

（処理Ｃ４）分類モデル学習部１１２は、算出されたパラメータを分類モデル情報１１３に格納する。本実施例では、最急降下法を用いてパラメータを算出していたがこれに限定されない。パラメータの推定方法は、準ニュートン方法、進化的計算、及びマルコフ連鎖モンテカルロ法等を用いてもよい。 (Process C4) The classification model learning unit 112 stores the calculated parameters in the classification model information 113. In this embodiment, the parameters are calculated using the steepest descent method, but the method is not limited thereto. As a parameter estimation method, a quasi-Newton method, evolutionary calculation, Markov chain Monte Carlo method, etc. may be used.

本実施例では、分類モデル学習部１１２は、学習用の文書データ５００及び分類対象の文書データ５００から生成されたグラフ６００を用いて分類モデルを学習しているがこれに限定されない。分類モデル学習部１１２は、学習用の文書データ５００のみから生成されるグラフ６００を用いて分類モデルを学習してもよい。 In this embodiment, the classification model learning unit 112 learns the classification model using the graph 600 generated from the document data 500 for learning and the document data 500 to be classified, but the invention is not limited thereto. The classification model learning unit 112 may learn a classification model using the graph 600 generated only from the document data 500 for learning.

以上が分類モデル学習部１１２の処理の説明である。 The above is an explanation of the processing of the classification model learning unit 112.

分類部１１４は、グラフ６００の各頂点のクラス確率値を算出する。分類部１１４では、以下のような処理が実行される。 The classification unit 114 calculates the class probability value of each vertex of the graph 600. The classification unit 114 executes the following processing.

（処理Ｄ１）分類部１１４は、分類モデル情報１１３からパラメータを取得する。なお、パラメータの取得タイミングに限定されない。例えば、分類部１１４は、分類モデル情報１１３にパラメータが格納された場合に、当該パラメータを取得してもよい。 (Process D1) The classification unit 114 acquires parameters from the classification model information 113. Note that the timing is not limited to the acquisition timing of the parameters. For example, when a parameter is stored in the classification model information 113, the classification unit 114 may acquire the parameter.

（処理Ｄ２）分類部１１４は、式（７）に示すように、グラフ情報１１１に格納される隣接行列５２０とＮ次元の単位行列ＩにＧＣＮを適用して、長さＮのベクトルｚを算出する。 (Processing D2) The classification unit 114 calculates a vector z of length N by applying GCN to the adjacency matrix 520 and the N-dimensional unit matrix I stored in the graph information 111, as shown in equation (7). do.

（処理Ｄ３）分類部１１４は、グラフ６００の各頂点に対してクラス確率値を算出する。本実施例では、分類部１１４は、ベクトルｚを式（１１）に代入することによって、ｉ番目の頂点のラベルが１である確率値＾ｙ_ｉを算出する。 (Process D3) The classification unit 114 calculates a class probability value for each vertex of the graph 600. In this embodiment, the classification unit 114 calculates the probability value ^y _i that the label of the i-th vertex is 1 by substituting the vector z into equation (11).

以上が、分類部１１４が実行する処理の説明である。 The above is an explanation of the processing executed by the classification unit 114.

文書再構築部１１５は、グラフ情報１１１に格納される隣接行列５２０及び段落辞書５３０、並びに、分類部１１４によって算出されたグラフ６００の各頂点のクラス確率値に基づいて、文書の分類を決定し、当該文書を構築して根拠箇所を特定する。 The document reconstruction unit 115 determines the classification of the document based on the adjacency matrix 520 and paragraph dictionary 530 stored in the graph information 111, and the class probability value of each vertex of the graph 600 calculated by the classification unit 114. , construct the document and identify the evidence points.

図７と図８を用いて、文書再構築部１１５が実行する処理について説明する。図７は、実施例１の文書再構築部１１５が実行する処理の一例を説明するフローチャートである。図８は、実施例１の文書再構築部１１５が実行する処理における出力データの構造を説明する図である。 The processing executed by the document reconstruction unit 115 will be explained using FIGS. 7 and 8. FIG. 7 is a flowchart illustrating an example of processing executed by the document reconstruction unit 115 of the first embodiment. FIG. 8 is a diagram illustrating the structure of output data in the process executed by the document reconstruction unit 115 of the first embodiment.

文書再構築部１１５は、グラフ情報１１１に格納される隣接行列５２０及び段落辞書５３０を取得する（ステップＳ７０１）。 The document reconstruction unit 115 obtains the adjacency matrix 520 and paragraph dictionary 530 stored in the graph information 111 (step S701).

次に、文書再構築部１１５は、文書データ５００のループ処理を開始する（ステップＳ７０２）。 Next, the document reconstruction unit 115 starts loop processing of the document data 500 (step S702).

具体的には、文書再構築部１１５は、複数の文書データ５００の中から一つのターゲット文書データ５００を選択する。ここでは、ターゲット文書データ５００のインデックス５０１をｄとし、段落辞書５３０の文書インデックス５３１がｄであるエントリの段落インデックス５３２をＰ_ｄとする。 Specifically, the document reconstruction unit 115 selects one target document data 500 from among the plurality of document data 500. Here, it is assumed that the index 501 of the target document data 500 is d, and the paragraph index 532 of the entry whose document index 531 of the paragraph dictionary 530 is _d is Pd.

次に、文書再構築部１１５は、隣接行列５２０及び段落辞書５３０に基づいて再構築文書データ８００を生成する（ステップＳ７０３）。この時点では、インデックス８０１等は空である。このとき、文書再構築部１１５は、再構築文書データ８００のインデックス８０１にｄを格納する。 Next, the document reconstruction unit 115 generates reconstructed document data 800 based on the adjacency matrix 520 and the paragraph dictionary 530 (step S703). At this point, the index 801 etc. are empty. At this time, the document reconstruction unit 115 stores d in the index 801 of the reconstructed document data 800.

次に、文書再構築部１１５は、再構築文書データ８００のスコア８０２を設定する（ステップＳ７０４）。 Next, the document reconstruction unit 115 sets the score 802 of the reconstructed document data 800 (step S704).

具体的には、文書再構築部１１５は、ｄに対応する頂点のクラス確率値＾ｙ_ｄを取得し、スコア８０２に格納する。これはターゲット文書データ５００のクラス確率値に相当する。 Specifically, the document reconstruction unit 115 obtains the class probability value ^ _yd of the vertex corresponding to d, and stores it in the score 802. This corresponds to the class probability value of the target document data 500.

次に、文書再構築部１１５は、再構築文書データ８００のラベル８０３を設定する（ステップＳ７０５）。 Next, the document reconstruction unit 115 sets the label 803 of the reconstructed document data 800 (step S705).

具体的には、文書再構築部１１５は、ターゲット文書データ５００のラベル５０２に値（ラベル）が設定されている場合、ラベル８０３に当該値を複写する。ターゲット文書データ５００のラベル５０２に値（ラベル）が設定されていない場合、文書再構築部１１５は、ｄ番目の頂点のクラス確率値＾ｙ_ｄと閾値との比較結果に基づいてラベルを決定し、決定されたラベルをラベル８０３に設定する。例えば、文書再構築部１１５は、クラス確率値が閾値以上の場合、ラベル「１」を設定し、クラス確率値が閾値より小さい場合、ラベル「０」を設定する。閾値は例えば０．５である。これはターゲット文書データ５００の分類の予測値に相当する。なお、閾値は一例であり０．５以外の固定値でもよいし、可変な値でもよい。 Specifically, if a value (label) is set in the label 502 of the target document data 500, the document reconstruction unit 115 copies the value to the label 803. If no value (label) is set for the label 502 of the target document data 500, the document reconstruction unit 115 determines the label based on the comparison result between the class probability value ^ _yd of the d-th vertex and the threshold value. , and sets the determined label to label 803. For example, the document reconstruction unit 115 sets the label "1" when the class probability value is greater than or equal to the threshold value, and sets the label "0" when the class probability value is smaller than the threshold value. The threshold value is, for example, 0.5. This corresponds to the predicted value of the classification of the target document data 500. Note that the threshold value is an example, and may be a fixed value other than 0.5, or may be a variable value.

文書再構築部１１５は、Ｐ_ｄ番目の頂点における一層目のＧＣＮの特徴量として^０ＷのＰ_ｄ行目のベクトルを文書特徴量８０４に設定する（ステップＳ７０６）。これはターゲット文書データ５００の特徴量に相当する。 The document reconstruction unit 115 sets the _Pd- th row vector of ⁰ W as the document feature amount 804 as the feature amount of the first layer GCN at _{the Pd-} th vertex (step S706). This corresponds to the feature amount of the target document data 500.

次に、文書再構築部１１５はｐを０に初期化する（ステップＳ７０７）。 Next, the document reconstruction unit 115 initializes p to 0 (step S707).

次に、文書再構築部１１５は、ターゲット文書データ５００のｐ番目の段落の段落インデックス８０６、段落スコア８０７、段落特徴量８０８、及び複数の単語データ８０９から構成される段落データ８０５を再構築文書データ８００に追加する（ステップＳ７０８）。具体的には以下のような処理が実行される。 Next, the document reconstruction unit 115 converts the paragraph data 805 composed of the paragraph index 806 of the p-th paragraph of the target document data 500, the paragraph score 807, the paragraph feature amount 808, and a plurality of word data 809 into the reconstructed document. It is added to the data 800 (step S708). Specifically, the following processing is executed.

（処理Ｅ１）文書再構築部１１５は、段落インデックス８０６にｐを設定する。 (Processing E1) The document reconstruction unit 115 sets p to the paragraph index 806.

（処理Ｅ２）文書再構築部１１５は、（Ｐ_ｄ＋ｐ）に対応する頂点のクラス確率＾ｙ_{（Ｐｄ＋ｐ）}を段落スコア８０７に設定する。これはターゲット文書データ５００のｐ番目の段落のクラス確率値に相当する。 (Process E2) The document reconstruction unit 115 sets the class probability ^y ( _Pd +p) of the vertex corresponding to ( _{P d +p)} to the paragraph score 807. This corresponds to the class probability value of the pth paragraph of the target document data 500.

ここで、ラベルの付与対象は文書であり、本来、段落はラベルの付与対象ではない。しかし、実施例１では、文書及び段落はグラフ６００を構成する同等の頂点として扱われ、段落についてもクラス確率値が算出される。ステップＳ４０３とステップＳ４０７より段落は文書を構成するＮｇｒａｍ集合の部分集合から構成されているため、クラス確率値が１に近い段落は、文書を構成するＮｇｒａｍ集合の中でも分類の付与に寄与する割合が大きいＮｇｒａｍ集合を保有していることを表す。分類付与への寄与が大きい文書の部分集合は分類付与の根拠箇所とみなすことができる。 Here, the target for labeling is the document, and originally, the target for labeling is not the paragraph. However, in the first embodiment, documents and paragraphs are treated as equivalent vertices constituting the graph 600, and class probability values are calculated for paragraphs as well. From step S403 and step S407, since a paragraph is composed of a subset of the Ngram set that makes up the document, a paragraph with a class probability value close to 1 has a higher proportion of the Ngram sets that make up the document. Indicates that a large Ngram set is held. A subset of documents that makes a large contribution to classification can be considered as the basis for classification.

（処理Ｅ３）文書再構築部１１５は、（Ｐ_ｄ＋ｐ）に対応する頂点における一層目のＧＣＮの特徴量として^０Ｗの（Ｐ_ｄ＋ｐ）行目のベクトルを段落特徴量８０８に設定する。これはターゲット文書データ５００のｐ番目の段落の特徴量に相当する。 (Process E3) The document reconstruction unit 115 sets the vector of the (P _d +p)th row of ⁰ W to the paragraph feature amount 808 as the feature amount of the first layer GCN at the vertex corresponding to (P _d +p). This corresponds to the feature amount of the pth paragraph of the target document data 500.

（処理Ｅ４）文書再構築部１１５は、段落データ８０５を再構築文書データ８００に挿入する。具体的には以下のような処理が実行される。 (Process E4) The document reconstruction unit 115 inserts the paragraph data 805 into the reconstructed document data 800. Specifically, the following processing is executed.

（処理Ｅ４－１）文書再構築部１１５は、ループ処理を開始する。まず、文書再構築部１１５は単語番号ｕを０に初期化する。 (Process E4-1) The document reconstruction unit 115 starts loop processing. First, the document reconstruction unit 115 initializes the word number u to 0.

（処理Ｅ４－２）文書再構築部１１５は、集合Ｓを空集合に初期化する。文書再構築部１１５は、ターゲット文書データ５００のｐ番目の段落文章データ５０３のｕ番目から（ｕ＋Δｕ）番目までの単語を、Δｕを０から最大のＮｇｒａｍの大きさまでインクリメントしながらＮｇｒａｍ辞書５１０のＮｇｒａｍ５１１と貪欲法でマッチングする。文書再構築部１１５は、各ΔｕでマッチングしたＮｇｒａｍ５１１に対応するインデックス５１２を集合Ｓに追加する。 (Process E4-2) The document reconstruction unit 115 initializes the set S to an empty set. The document reconstruction unit 115 converts the u-th to (u+Δu)-th words of the p-th paragraph sentence data 503 of the target document data 500 into Ngrams 511 of the Ngram dictionary 510 while incrementing Δu from 0 to the maximum Ngram size. is matched using the greedy method. The document reconstruction unit 115 adds the index 512 corresponding to the Ngram 511 matched with each Δu to the set S.

（処理Ｅ４－３）文書再構築部１１５は、式（１３）に基づいて算出された値を単語スコアとして算出する。ただし、｜Ｓ｜は集合Ｓに含まれる要素の個数を表す。これは、当該単語を含むＮｇｒａｍに対するクラス確率値の平均値と等価である。なお、単語スコアの算出式は一例であり、重み付き和及び最大値等でもよい。 (Process E4-3) The document reconstruction unit 115 calculates the value calculated based on equation (13) as a word score. However, |S| represents the number of elements included in the set S. This is equivalent to the average value of class probability values for Ngrams that include the word. Note that the formula for calculating the word score is just an example, and may be a weighted sum, a maximum value, or the like.

（処理Ｅ４－４）文書再構築部１１５は、式（１４）に基づいて算出されたベクトルを単語特徴量として算出する。ただし、^０Ｗ_ｋはＧＣＮのパラメータ^０Ｗのｋ行目のベクトルである。これは、当該単語を含むＮｇｒａｍの特徴量を要素ごとに平均したベクトルと等価である。｜Ｓ｜が０の場合、文書再構築部１１５は、単語特徴量が定義できないことを示す特殊トークンＮＵＬＬを単語特徴量として算出する。なお、単語特徴量の算出式は一例であり、要素ごとの重み付き和及び要素ごとの最大値等でもよい。 (Process E4-4) The document reconstruction unit 115 calculates the vector calculated based on equation (14) as a word feature amount. However, ⁰ W _k is the k-th row vector of the parameter ⁰ W of GCN. This is equivalent to a vector obtained by averaging the feature amounts of Ngrams containing the word for each element. When |S| is 0, the document reconstruction unit 115 calculates a special token NULL indicating that the word feature cannot be defined as the word feature. Note that the formula for calculating the word feature amount is just an example, and may be a weighted sum for each element, a maximum value for each element, or the like.

（処理Ｅ４－５）文書再構築部１１５は、単語インデックス８１０にｕを設定し、単語テキスト８１１にターゲット文書データ５００のｐ番目の段落文章データ５０３のｕ番目の単語を設定し、単語スコア８１２及び単語特徴量８１３に式（１３）及び式（１４）の算出結果を段落スコア８０７に挿入する。 (Process E4-5) The document reconstruction unit 115 sets u in the word index 810, sets the u-th word of the p-th paragraph sentence data 503 of the target document data 500 in the word text 811, and sets the word score 812 to And the calculation results of equations (13) and (14) are inserted into the word feature amount 813 into the paragraph score 807.

（処理Ｅ４－６）文書再構築部１１５は、ｕをインクリメントする。 (Process E4-6) The document reconstruction unit 115 increments u.

（処理Ｅ４－７）ｕがターゲット文書データ５００のｐ番目の段落文章データ５０３の単語数よりも小さい場合、文書再構築部１１５は、（処理Ｅ４－１）に戻る。以上が、ステップＳ７０８の処理の説明である。 (Processing E4-7) If u is smaller than the number of words in the p-th paragraph sentence data 503 of the target document data 500, the document reconstruction unit 115 returns to (Processing E4-1). The above is the explanation of the process of step S708.

次に、文書再構築部１１５は、ｐをインクリメントする（ステップＳ７０９）。 Next, the document reconstruction unit 115 increments p (step S709).

次に、文書再構築部１１５は、全ての段落についてステップＳ７０８の処理を実行したか否かを判定する（ステップＳ７１０）。 Next, the document reconstruction unit 115 determines whether the process of step S708 has been executed for all paragraphs (step S710).

具体的には、文書再構築部１１５は、段落辞書５３０において文書インデックス５３１がｄとなる行の段落インデックス５３２の値がＰ_ｄであるとき、（Ｐ_ｄ＋ｐ）がＰ_ｄ＋１以上であるか否かを判定する。（Ｐ_ｄ＋ｐ）がＰ_ｄ＋１より小さいと判定された場合、文書再構築部１１５は、ステップＳ７０８の処理を実行していない段落が少なくとも一つ存在する判定する。 Specifically, the document reconstruction unit 115 determines whether (P _d +p) is greater than or equal to P _d ₊₁ when the value of the paragraph index 532 of the line whose document index 531 is d in the paragraph dictionary 530 is P d. Determine whether If it is determined that (P _d +p) is smaller than P _d+1 , the document reconstruction unit 115 determines that there is at least one paragraph for which the process of step S708 has not been executed.

ステップＳ７０８の処理を実行していない段落が少なくとも一つ存在すると判定された場合、文書再構築部１１５は、ステップＳ７０８に戻り、同様の処理を実行する。 If it is determined that there is at least one paragraph for which the process of step S708 has not been performed, the document reconstruction unit 115 returns to step S708 and executes the same process.

全ての段落についてステップＳ７０８の処理をしたと判定された場合、文書再構築部１１５は、全ての文書データ５００の再構築文書データ８００を生成したか否かを判定する（ステップＳ７１１）。 If it is determined that all the paragraphs have been processed in step S708, the document reconstruction unit 115 determines whether or not the reconstructed document data 800 of all the document data 500 has been generated (step S711).

再構築文書データ８００を作成していない文書データ５００が少なくとも一つ存在する場合、文書再構築部１１５は、ステップＳ７０２に戻り、同様の処理を実行する。 If there is at least one document data 500 for which the reconstructed document data 800 has not been created, the document reconstruction unit 115 returns to step S702 and executes the same process.

全ての文書データ５００の再構築文書データ８００を生成したと判定された場合、文書再構築部１１５は処理を終了する。 If it is determined that the reconstructed document data 800 of all the document data 500 has been generated, the document reconstruction unit 115 ends the process.

以上が、文書再構築部１１５が実行する処理の説明である。 The above is an explanation of the processing executed by the document reconstruction unit 115.

表示部１１６は、文書再構築部１１５によって生成された再構築文書データ８００に基づいて、分類結果及び根拠箇所をユーザに提示する。 The display unit 116 presents the classification result and the basis location to the user based on the reconstructed document data 800 generated by the document reconstruction unit 115.

図９及び図１０を用いて、表示部１１６が実行する処理について説明する。図９は、実施例１の表示部１１６によって提示されるユーザインタフェース９００の一例を説明する図である。図１０は、実施例１の表示部１１６が実行する処理の一例を説明するフローチャートである。 Processing executed by the display unit 116 will be described using FIGS. 9 and 10. FIG. 9 is a diagram illustrating an example of a user interface 900 presented by the display unit 116 of the first embodiment. FIG. 10 is a flowchart illustrating an example of processing executed by the display unit 116 of the first embodiment.

ユーザインタフェース９００は、ユーザによって選択された文書データ５００の入力を受け付ける入力欄９０１、選択された文書に付与された分類の一覧を表示する表示欄９０２、及び選択された文書を表示する表示欄９０３から構成される。表示欄９０３には、根拠箇所９０４がハイライトにて強調表示される。 The user interface 900 includes an input field 901 that accepts input of document data 500 selected by the user, a display field 902 that displays a list of classifications assigned to the selected document, and a display field 903 that displays the selected document. It consists of In the display field 903, a proof point 904 is highlighted and displayed.

図９に示すユーザインタフェース９００は一例であり、異なる構成要素、構成要素の位置関係、根拠箇所９０４の強調方式、表示の媒体、及びインタフェースの媒体を有していてもよい。また、文書全体を縮小表示する縮小表示欄９０５、文書全体の表示部分を示すウィンドウ９０６を含んでもよい。ウィンドウ９０６で指定された箇所が表示欄９０３に拡大表示される。 The user interface 900 shown in FIG. 9 is an example, and may have different components, positional relationships of the components, emphasis method for the evidence point 904, display medium, and interface medium. It may also include a reduced display field 905 that displays the entire document in reduced size, and a window 906 that shows the displayed portion of the entire document. The location specified in the window 906 is enlarged and displayed in the display field 903.

表示部１１６は、文書再構築部１１５によって生成された再構築文書データ８００を取得する（ステップＳ１００１）。 The display unit 116 acquires the reconstructed document data 800 generated by the document reconstruction unit 115 (step S1001).

次に、表示部１１６は、表示対象の文書を指定するユーザ入力を受けつける（ステップＳ１００２）。 Next, the display unit 116 receives a user input specifying a document to be displayed (step S1002).

具体的には、表示部１１６は入力欄９０１への入力を受け付ける。入力欄９０１には再構築文書データ８００のインデックス８０１が入力される。入力欄９０１に設定された値は変数ｄに格納される。ここで、選択された再構築文書データ８００のスコア８０２を^Ｄｑ_ｄとする。 Specifically, the display unit 116 accepts input to the input field 901. An index 801 of the reconstructed document data 800 is input into the input field 901 . The value set in the input field 901 is stored in the variable d. Here, the score 802 of the selected reconstructed document data 800 is assumed to be ^D q _d .

なお、入力欄９０１には、公開番号、出願人、又は検索キーワード等を入力してもよい。この場合、表示部１１６は、入力された値に基づいて再構築文書データ８００を検索し、検索結果をユーザに提示する。ユーザは、検索結果に基づいて文書を選択する。 Note that the publication number, applicant, search keyword, etc. may be entered in the input field 901. In this case, the display unit 116 searches the reconstructed document data 800 based on the input value and presents the search results to the user. The user selects documents based on the search results.

次に、表示部１１６は、選択された再構築文書データ８００の段落データ８０５のループ処理を開始する（ステップＳ１００３）。 Next, the display unit 116 starts loop processing of the paragraph data 805 of the selected reconstructed document data 800 (step S1003).

具体的には、表示部１１６は、選択された再構築文書データ８００の段落データ８０５の中から一つのターゲット段落データ８０５を選択する。ここで、ターゲット段落データ８０５の段落インデックス８０６をｐ、ターゲット段落データ８０５の段落スコア８０７を^Ｐｑ_ｄ，ｐとする。 Specifically, the display unit 116 selects one target paragraph data 805 from among the paragraph data 805 of the selected reconstructed document data 800. Here, the paragraph index 806 of the target paragraph data 805 is assumed to be p, and the paragraph score 807 of the target paragraph data 805 is assumed to be ^P q _d,p .

次に、表示部１１６は、ターゲット段落データの背景色を算出する（ステップＳ１００４）。 Next, the display unit 116 calculates the background color of the target paragraph data (step S1004).

具体的には、表示部１１６は、式（１５）及び式（１６）から算出した値ｌに基づいて、色相０度、輝度ｌ×１００％、彩度１００％を背景色に設定する。 Specifically, the display unit 116 sets the background color to a hue of 0 degrees, a brightness of l×100%, and a saturation of 100%, based on the value l calculated from equations (15) and (16).

次に、表示部１１６は、ターゲット段落データ８０５の文章を表示する（ステップＳ１００５）。 Next, the display unit 116 displays the text of the target paragraph data 805 (step S1005).

具体的には、表示部１１６は、ターゲット段落データ８０５に含まれる各単語データ８０９の単語テキスト８１１を単語インデックス８１０に対して昇順となるように前述した背景色で描画する。 Specifically, the display unit 116 draws the word text 811 of each word data 809 included in the target paragraph data 805 in ascending order with respect to the word index 810 in the background color described above.

次に、表示部１１６は、選択された再構築文書データ８００の全ての段落データ８０５について処理を実行したか否かを判定する（ステップＳ１００６）。 Next, the display unit 116 determines whether processing has been performed on all paragraph data 805 of the selected reconstructed document data 800 (step S1006).

処理を実行していない段落データ８０５が少なくとも一つ存在すると判定された場合、表示部１１６は、ステップＳ１００３に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of paragraph data 805 that has not been processed, the display unit 116 returns to step S1003 and executes the same process.

全ての段落データ８０５について処理を実行したと判定された場合、表示部１１６は処理を終了する。 If it is determined that the process has been executed for all paragraph data 805, the display unit 116 ends the process.

実施例１によれば、計算機システムは、文書に対して分類結果を付与するとともに、分類付与の根拠となった段落等の根拠箇所をユーザに提示できる。 According to the first embodiment, the computer system can assign a classification result to a document and present to the user the basis of the classification, such as a paragraph.

グラフ６００では、文書及び構成要素（段落及び単語）はグラフ６００を構成する一つの頂点として扱われるため、従来技術のように、根拠箇所提示の適切さと分類精度との間のトレードオフの関係が生じることなく、分類結果及び根拠を提示できる。 In the graph 600, documents and constituent elements (paragraphs and words) are treated as one vertex constituting the graph 600, so there is no trade-off between the appropriateness of evidence point presentation and classification accuracy, as in the prior art. It is possible to present the classification results and the basis without causing any problems.

なお、段落を陽に頂点として設けない場合でも、後処理として各段落に含まれるＮｇｒａｍのクラス確率値から各段落に対するクラス確率値を求めることも可能である。 Note that even if a paragraph is not explicitly provided as a vertex, it is possible to obtain the class probability value for each paragraph from the class probability value of Ngrams included in each paragraph as a post-processing.

実施例２では、図、表、及び式等の文字列とは異なる構成要素（非テキスト要素）を含む文書の分類を行う計算機システムについて説明する。以下、実施例１との差異を中心に実施例２について説明する。 In the second embodiment, a computer system will be described that classifies documents that include constituent elements (non-text elements) such as figures, tables, and formulas that are different from character strings. The second embodiment will be described below, focusing on the differences from the first embodiment.

実施例２の計算機システムの構成は実施例１と同一である。実施例２の計算機１００－２のハードウェア構成及びソフトウェア構成は実施例１と同一である。また、実施例２の計算機１００－１のハードウェア構成及びソフトウェア構成は実施例１と同一である。 The configuration of the computer system of the second embodiment is the same as that of the first embodiment. The hardware configuration and software configuration of the computer 100-2 in the second embodiment are the same as those in the first embodiment. Further, the hardware configuration and software configuration of the computer 100-1 of the second embodiment are the same as those of the first embodiment.

実施例２では、グラフ構築部１１０は、非テキスト要素を含む段落を頂点とするグラフ６００を生成する点が実施例１と異なる。具体的には、図２のステップＳ２０３及びステップＳ２０８の処理の内容が異なる。 The second embodiment differs from the first embodiment in that the graph construction unit 110 generates a graph 600 whose vertices are paragraphs containing non-text elements. Specifically, the contents of the processes in step S203 and step S208 in FIG. 2 are different.

図１１から図１３を用いて、実施例２のグラフ構築部１１０が実行する処理について説明する。図１１及び図１２は、実施例２のグラフ構築部１１０が実行する処理を説明するフローチャートである。図１３Ａ及び図１３Ｂは、実施例２のグラフ構築部１１０が実行する処理におけるデータの入出力を説明する図である。 Processing executed by the graph construction unit 110 of the second embodiment will be described using FIGS. 11 to 13. 11 and 12 are flowcharts illustrating processing executed by the graph construction unit 110 of the second embodiment. 13A and 13B are diagrams illustrating data input/output in processing executed by the graph construction unit 110 of the second embodiment.

まず、図１１を用いて実施例２のステップＳ２０３において実行される処理について説明する。 First, the process executed in step S203 of the second embodiment will be described using FIG. 11.

グラフ構築部１１０は、ターゲット文書データ５００の段落データ１３０１のループ処理を開始する（ステップＳ１１０１）。 The graph construction unit 110 starts loop processing of the paragraph data 1301 of the target document data 500 (step S1101).

具体的には、グラフ構築部１１０は、ターゲット文書データ５００の複数の段落データ１３０１の中から一つのターゲット段落データ１３０１を選択する。 Specifically, the graph construction unit 110 selects one target paragraph data 1301 from among the plurality of paragraph data 1301 of the target document data 500.

ターゲット段落データ１３０１が図面である場合、グラフ構築部１１０は、図面に光学文字認識（ＯＣＲ）を適用して、テキストを抽出する（ステップＳ１１０２）。ターゲット段落データ１３０１が図面ではない場合、ステップＳ１１０２の処理は省略される。 If the target paragraph data 1301 is a drawing, the graph construction unit 110 applies optical character recognition (OCR) to the drawing to extract text (step S1102). If the target paragraph data 1301 is not a drawing, the process of step S1102 is omitted.

次に、グラフ構築部１１０は、ターゲット段落データ１３０１のテキストを形態素の単位に分解する（ステップＳ１１０３）。ステップＳ１１０３の処理はステップＳ３０２の処理と同一である。 Next, the graph construction unit 110 decomposes the text of the target paragraph data 1301 into units of morphemes (step S1103). The processing in step S1103 is the same as the processing in step S302.

次に、グラフ構築部１１０は、ターゲット段落データ１３０１の単語から単語Ｎｇｒａｍを抽出し、段落Ｎｇｒａｍデータ５０４としてターゲット文書データ５００に格納する（ステップＳ１１０４）。ステップＳ１１０４の処理は、ステップＳ３０３の処理と同一である。 Next, the graph construction unit 110 extracts the word Ngram from the words of the target paragraph data 1301, and stores it in the target document data 500 as the paragraph Ngram data 504 (step S1104). The processing in step S1104 is the same as the processing in step S303.

次に、グラフ構築部１１０は、ターゲット文書データ５００に含まれる全ての段落データ１３０１について処理を実行したか否かを判定する（ステップＳ１１０５）。 Next, the graph construction unit 110 determines whether processing has been performed on all paragraph data 1301 included in the target document data 500 (step S1105).

処理を実行していない段落データ１３０１が少なくとも一つ存在する場合、グラフ構築部１１０は、ステップＳ１１０１に戻り、同様の処理を実行する。全ての段落データ１３０１について処理を実行したと判定された場合、グラフ構築部１１０は、文書データの前処理を終了し、ステップＳ２０４に進む。 If there is at least one piece of paragraph data 1301 that has not been processed, the graph construction unit 110 returns to step S1101 and executes the same process. If it is determined that all the paragraph data 1301 have been processed, the graph construction unit 110 ends the preprocessing of the document data and proceeds to step S204.

次に、図１２を用いて実施例２のステップＳ２０８において実行される処理について説明する。 Next, the process executed in step S208 of the second embodiment will be described using FIG. 12.

グラフ構築部１１０は、文書－文書行列５２２（Ａ^ｄ－ｄ）を更新する（ステップＳ１２０１）。ステップＳ１２０１の処理はステップＳ４０１の処理と同一である。 The graph construction unit 110 updates the document-document matrix 522 (A ^dd ) (step S1201). The processing in step S1201 is the same as the processing in step S401.

次に、グラフ構築部１１０は、文書－段落行列５２３（Ａ^ｄ－ｐ）を更新する（ステップＳ１２０２）。ステップＳ１２０２の処理はステップＳ４０２の処理と同一である。 Next, the graph construction unit 110 updates the document-paragraph matrix 523 (A ^dp ) (step S1202). The processing in step S1202 is the same as the processing in step S402.

次に、グラフ構築部１１０は、文書－Ｎｇｒａｍ行列５２４（Ａ^ｄ－ｗ）を更新する（ステップＳ１２０３）。ステップＳ１２０３の処理はステップＳ４０３の処理と同一である。 Next, the graph construction unit 110 updates the document-Ngram matrix 524 (A ^dw ) (step S1203). The processing in step S1203 is the same as the processing in step S403.

次に、グラフ構築部１１０は、段落辞書５３０を更新する（ステップＳ１２０４）。ステップＳ１２０４の処理はステップＳ４０４の処理と同一である。 Next, the graph construction unit 110 updates the paragraph dictionary 530 (step S1204). The processing in step S1204 is the same as the processing in step S404.

次に、グラフ構築部１１０は、ターゲット文書データ５００の段落Ｎｇｒａｍデータ５０４のループ処理を開始する（ステップＳ１２０５）。ステップＳ１２０５の処理はステップＳ４０５の処理と同一である。 Next, the graph construction unit 110 starts loop processing of the paragraph Ngram data 504 of the target document data 500 (step S1205). The processing in step S1205 is the same as the processing in step S405.

次に、グラフ構築部１１０は、段落－段落行列５２５（Ａ^ｐ－ｐ）を更新する（ステップＳ１２０６）。具体的には、以下のような処理が実行される。 Next, the graph construction unit 110 updates the paragraph-paragraph matrix 525 (A ^pp ) (step S1206). Specifically, the following processing is executed.

（処理Ｆ１）グラフ構築部１１０は、各文書の段落の合計値と同じ要素数の零ベクトルを段落－段落行列５２５（Ａ^ｐ－ｐ）のｃ行目に追加する。 (Process F1) The graph construction unit 110 adds a zero vector with the same number of elements as the total value of the paragraphs of each document to the cth row of the paragraph-paragraph matrix 525 (A ^pp ).

（処理Ｆ２）グラフ構築部１１０は、段落－段落行列５２５（Ａ^ｐ－ｐ）のｃ行ｃ列の要素に１を格納する。この１は段落から段落への自己ループに相当する。 (Processing F2) The graph construction unit 110 stores 1 in the element in the c row and c column of the paragraph-paragraph matrix 525 (A ^pp ). This 1 corresponds to a self-loop from paragraph to paragraph.

（処理Ｆ３）ターゲット段落Ｎｇｒａｍデータ５０４が画像の段落データ１３０１に由来する場合、グラフ構築部１１０は、ターゲット段落Ｎｇｒａｍデータ５０４に対応する段落データ１３０１の図番号を照応している各段落について、段落カウンタｃに対応する該段落の番号がｃ’であるとき、段落－段落行列５２５（Ａ^ｐ－ｐ）のｃ行ｃ’列の要素とｃ’行ｃ列の要素に１を格納する。ターゲット段落Ｎｇｒａｍデータ５０４に対応する段落データ１３０１の図番号を照応している各段落は、例えば図番号が１である場合は正規表現「図１［＾０－９］」などによって取得することできる。 (Processing F3) When the target paragraph Ngram data 504 is derived from the paragraph data 1301 of the image, the graph construction unit 110 constructs the paragraph When the number of the paragraph corresponding to counter c is c', 1 is stored in the element at row c and column c' and the element at row c' and column c of the paragraph-paragraph matrix 525 (A ^pp ). For example, if the figure number is 1, each paragraph corresponding to the figure number of the paragraph data 1301 corresponding to the target paragraph Ngram data 504 can be obtained using the regular expression "Figure 1 [^0-9]". .

（処理Ｆ４）ターゲット段落Ｎｇｒａｍデータ５０４が画像の段落データ１３０１に由来し、かつ、当該段落データ１３０１が幾何学的な情報を有する図等の種別である場合、グラフ構築部１１０は、ターゲット段落Ｎｇｒａｍデータ５０４に対応する段落データ１３０１に類似している各段落データ１３０１について、当該画像に対応する段落の段落カウンタｃに対応する番号がｃ’であるとき、段落－段落行列５２５（Ａ^ｐ－ｐ）のｃ行ｃ’列の要素とｃ’行ｃ列の要素に類似度の値を格納する。類似度の値は画像のＨＯＧ特徴量のＢａｇ－ｏｆ－Ｖｉｓｕａｌ－Ｗｏｒｄｓなどによって算出できる。ある二つの画像が類似しているか否かは、類似度と閾値の比較結果に基づいて判定できる。以上がステップＳ１２０６の処理の説明である。 (Processing F4) If the target paragraph Ngram data 504 is derived from the image paragraph data 1301, and the paragraph data 1301 is of a type such as a diagram having geometric information, the graph construction unit 110 For each paragraph data 1301 that is similar to the paragraph data 1301 corresponding to the data 504, when the number corresponding to the paragraph counter c of the paragraph corresponding to the image is c', the paragraph-paragraph matrix 525 (A ^pp ), the similarity value is stored in the element in row c and column c' and the element in row c' and column c. The similarity value can be calculated using Bag-of-Visual-Words of the HOG feature amount of the image. Whether or not two images are similar can be determined based on the comparison result between the degree of similarity and the threshold value. The above is the explanation of the processing in step S1206.

次に、グラフ構築部１１０は、段落－Ｎｇｒａｍ行列５２６（Ａ^ｐ－ｗ）を更新する（ステップＳ１２０７）。ステップＳ１２０７の処理は、ステップＳ４０７の処理と同一である。 Next, the graph construction unit 110 updates the paragraph-Ngram matrix 526 (A ^pw ) (step S1207). The processing in step S1207 is the same as the processing in step S407.

次に、グラフ構築部１１０は、全ての段落Ｎｇｒａｍデータ５０４について処理を実行したか否かを判定する（ステップＳ１２０８）。 Next, the graph construction unit 110 determines whether processing has been performed on all paragraph Ngram data 504 (step S1208).

処理を実行していない段落Ｎｇｒａｍデータ５０４が少なくとも一つ存在すると判定された場合、グラフ構築部１１０は、ステップＳ１２０５に戻り、同様の処理を実行する。 If it is determined that there is at least one paragraph Ngram data 504 that has not been processed, the graph construction unit 110 returns to step S1205 and executes the same process.

全ての段落Ｎｇｒａｍデータ５０４について処理を実行したと判定された場合、グラフ構築部１１０はステップＳ２０８の処理を終了し、ステップＳ２０９に進む。 If it is determined that the process has been executed for all the paragraph Ngram data 504, the graph construction unit 110 ends the process of step S208 and proceeds to step S209.

実施例２の分類モデル学習部１１２、分類部１１４、文書再構築部１１５、及び表示部１１６が実行する処理は実施例１と同一である。 The processing executed by the classification model learning section 112, the classification section 114, the document reconstruction section 115, and the display section 116 of the second embodiment is the same as that of the first embodiment.

実施例２によれば、計算機システムは、図、表、及び式等の非テキスト要素を含む文書に対して分類結果を付与するとともに、分類の付与の根拠となった段落等の根拠箇所をユーザに提示できる。 According to the second embodiment, the computer system assigns a classification result to a document that includes non-text elements such as figures, tables, and formulas, and also provides the user with the basis for assigning the classification, such as a paragraph. can be presented.

実施例３では、計算機１００－２は、段落とは異なる粒度の根拠箇所をユーザに提示する。以下、実施例１との差異を中心に実施例３について説明する。 In the third embodiment, the computer 100-2 presents the user with evidence points that have a different granularity than paragraphs. The third embodiment will be described below, focusing on the differences from the first embodiment.

実施例３の計算機システムの構成は実施例１と同一である。実施例３の計算機１００－１のハードウェア構成及びソフトウェア構成は実施例１と同一である。実施例３の計算機１００－２のハードウェア構成及びソフトウェア構成は実施例１と同一である。また、実施例３の分類モデル学習部１１２、分類モデル情報１１３、分類部１１４、文書再構築部１１５が実行する処理は実施例１と同一である。 The configuration of the computer system of the third embodiment is the same as that of the first embodiment. The hardware configuration and software configuration of the computer 100-1 in the third embodiment are the same as those in the first embodiment. The hardware configuration and software configuration of the computer 100-2 in the third embodiment are the same as those in the first embodiment. Further, the processing executed by the classification model learning section 112, classification model information 113, classification section 114, and document reconstruction section 115 of the third embodiment is the same as that of the first embodiment.

実施例３のグラフ構築部１１０、分類モデル学習部１１２、分類モデル情報１１３、分類部１１４、文書再構築部１１５が実行する処理は実施例１と同一である。 The processes executed by the graph construction unit 110, classification model learning unit 112, classification model information 113, classification unit 114, and document reconstruction unit 115 of the third embodiment are the same as those of the first embodiment.

実施例３では、表示部１１６が、段落ではなく単語の粒度で根拠箇所をユーザに提示する点が異なる。具体的には、図１０のステップＳ１００４の処理が異なる。 The third embodiment differs in that the display unit 116 presents the evidence to the user in word granularity rather than in paragraphs. Specifically, the processing in step S1004 in FIG. 10 is different.

図１４を用いて、実施例３の表示部１１６が実行する処理について説明する。図１４は、実施例３の表示部１１６が実行する処理におけるユーザインタフェース１４００を説明する図である。 Processing executed by the display unit 116 of the third embodiment will be described using FIG. 14. FIG. 14 is a diagram illustrating a user interface 1400 in processing executed by the display unit 116 of the third embodiment.

実施例３におけるユーザインタフェース１４００は、入力欄９０１、表示欄９０２、及び表示欄９０３から構成される。実施例３では、表示欄９０３の根拠箇所１４０１の粒度が実施例１と異なる。 The user interface 1400 in the third embodiment is composed of an input field 901, a display field 902, and a display field 903. In the third embodiment, the granularity of the evidence point 1401 in the display field 903 is different from that in the first embodiment.

図１４に示すユーザインタフェース１４００は一例であり、異なる構成要素、構成要素の位置関係、根拠箇所１４０１の強調方式、表示の媒体、及びインタフェースの媒体を有していてもよい。また、文書全体を縮小表示する縮小表示欄９０５、文書全体の表示部分を示すウィンドウ９０６を含んでもよい。ウィンドウ９０６で指定された箇所が表示欄９０３に拡大表示される。 The user interface 1400 shown in FIG. 14 is an example, and may have different components, positional relationships of the components, emphasis method for the evidence point 1401, display medium, and interface medium. It may also include a reduced display field 905 that displays the entire document in reduced size, and a window 906 that shows the displayed portion of the entire document. The location specified in the window 906 is enlarged and displayed in the display field 903.

実施例３におけるステップＳ１００４では、以下のような処理が実行される。 In step S1004 in the third embodiment, the following processing is executed.

（処理Ｇ１）表示部１１６は、ターゲット段落データ８０５の単語データ８０９のループ処理を開始する。具体的には、表示部１１６は、ターゲット段落データ８０５の単語データ８０９の中から一つのターゲット単語データ８０９を選択する。ここでは、ターゲット単語データ８０９の単語インデックス８１０をｔ、ターゲット単語データ８０９の単語スコア８１２を^Ｗｑ_{ｄ，ｐ，ｔ}とする。 (Processing G1) The display unit 116 starts loop processing of the word data 809 of the target paragraph data 805. Specifically, the display unit 116 selects one target word data 809 from among the word data 809 of the target paragraph data 805. Here, the word index 810 of the target word data 809 is assumed to be t, and the word score 812 of the target word data 809 is assumed to be ^W q _{d, p, t} .

（処理Ｇ２）表示部１１６は、ターゲット単語データ８０９の背景色を算出する。具体的には、式（１７）及び式（１８）から算出した値ｌに基づいて、色相０度、輝度ｌ×１００％、彩度１００％を背景色に設定する。 (Processing G2) The display unit 116 calculates the background color of the target word data 809. Specifically, based on the value l calculated from equations (17) and (18), the background color is set to a hue of 0 degrees, a brightness of l×100%, and a saturation of 100%.

（処理Ｇ３）表示部１１６は、ターゲット段落データ８０５の全ての単語データ８０９について背景色を算出したか否かを判定する。少なくとも一つの単語データ８０９について背景色が算出されていない場合、表示部１１６は、処理Ｇ１に戻り、同様の処理を実行する。 (Process G3) The display unit 116 determines whether the background color has been calculated for all word data 809 of the target paragraph data 805. If the background color has not been calculated for at least one word data 809, the display unit 116 returns to process G1 and executes the same process.

実施例３によれば、計算機システムは、分類結果ととともに、単語等、段落とは異なる粒度の根拠箇所をユーザに提示できる。 According to the third embodiment, the computer system can present to the user, along with the classification results, evidence points such as words that have a different granularity than paragraphs.

実施例４では、計算機１００－２は、分類結果及び根拠箇所に加えて、文書又は文書の構成要素に関係性がある文書又は文書の構成要素をユーザに提示する。以下、実施例１との差異を中心に実施例４について説明する。 In the fourth embodiment, the computer 100-2 presents the user with documents or document components that are related to the document or document components, in addition to the classification result and the basis location. The fourth embodiment will be described below, focusing on the differences from the first embodiment.

実施例４の計算機システムの構成は実施例１と同一である。実施例４の計算機１００－１のハードウェア構成及びソフトウェア構成は実施例２と同一である。実施例４の計算機１００－２のハードウェア構成は実施例１と同一である。 The configuration of the computer system of the fourth embodiment is the same as that of the first embodiment. The hardware configuration and software configuration of the computer 100-1 in the fourth embodiment are the same as those in the second embodiment. The hardware configuration of the computer 100-2 in the fourth embodiment is the same as that in the first embodiment.

実施例４では、計算機１００－２のソフトウェア構成が一部異なる。図１５は、実施例４の計算機１００－２の構成例を示す図である。 In the fourth embodiment, the software configuration of the computer 100-2 is partially different. FIG. 15 is a diagram showing an example of the configuration of the computer 100-2 according to the fourth embodiment.

実施例４の計算機１００－２のメモリ１０２には、分類モデル学習部１１２、分類モデル情報、分類部１１４、文書再構築部１１５、及び表示部１１６を実現するプログラムに加えて関連要素表示部１５０１を実現するプログラムを格納する。 The memory 102 of the computer 100-2 of the fourth embodiment includes a related element display section 1501 in addition to programs for realizing the classification model learning section 112, classification model information, classification section 114, document reconstruction section 115, and display section 116. Stores the program that realizes this.

実施例４の分類モデル学習部１１２、分類モデル情報、分類部１１４、文書再構築部１１５、及び表示部１１６は、実施例１と同一の機能である。また、実施例４の分類モデル情報１１３は実施例１と同一である。 The classification model learning section 112, classification model information, classification section 114, document reconstruction section 115, and display section 116 of the fourth embodiment have the same functions as those of the first embodiment. Furthermore, the classification model information 113 of the fourth embodiment is the same as that of the first embodiment.

関連要素表示部１５０１は、文書及び根拠箇所に対応する文書の構成要素の特徴量に基づいて、選択された文書又は選択された文書の構成要素に関係性がある文書又は文書の構成要素をユーザに提示する。 The related element display unit 1501 displays documents or document components that are related to the selected document or the selected document components to the user based on the feature amounts of the document components corresponding to the document and the basis location. to be presented.

図１６から図１８を用いて、実施例４の関連要素表示部１５０１が実行する処理について説明する。図１６は、実施例４の関連要素表示部１５０１が実行する処理におけるユーザインタフェース１６００を説明する図である。図１７は、実施例４の関連要素表示部１５０１が実行する処理の一例を説明するフローチャートである。図１８は、実施例４の関連要素表示部１５０１が実行する処理で使用するデータのデータ構造１８００を説明する図である。 Processing executed by the related element display unit 1501 of the fourth embodiment will be described using FIGS. 16 to 18. FIG. 16 is a diagram illustrating a user interface 1600 in the process executed by the related element display unit 1501 of the fourth embodiment. FIG. 17 is a flowchart illustrating an example of processing executed by the related element display unit 1501 of the fourth embodiment. FIG. 18 is a diagram illustrating a data structure 1800 of data used in processing executed by the related element display unit 1501 of the fourth embodiment.

実施例４のユーザインタフェース１６００は、ユーザによって選択された文書データ５００の入力を受け付ける入力欄１６０１、選択された文書の代表段落を表示する表示欄１６０２、選択された文書のキーワードを表示する表示欄１６０３、選択された文書の代表図面を表示する表示欄１６０４、選択された文書の関連文書を表示する表示欄１６０５から構成される。 The user interface 1600 of the fourth embodiment includes an input field 1601 that accepts input of document data 500 selected by the user, a display field 1602 that displays representative paragraphs of the selected document, and a display field that displays keywords of the selected document. 1603, a display field 1604 that displays representative drawings of the selected document, and a display field 1605 that displays related documents of the selected document.

実施例４のユーザインタフェース１６００は一例であり、異なる構成要素、構成要素の位置関係、表示の媒体、インタフェースの媒体を有していてもよい。 The user interface 1600 of the fourth embodiment is an example, and may have different components, positional relationships of the components, display medium, and interface medium.

実施例４の関連要素表示部１５０１が実行する処理で使用するデータのデータ構造１８００は、関連段落を格納する所定の長さの優先度付きキュー（ＰｒｉｏｒｉｔｙＱｕｅｕｅ）１８０１、関連画像を格納する所定の長さの優先度付きキュー１８０２、関連単語を格納する所定の長さの優先度付きキュー１８０３、関連文書を格納する所定の長さの優先度付きキュー１８０４から構成される。 A data structure 1800 of data used in the process executed by the related element display unit 1501 of the fourth embodiment includes a priority queue 1801 of a predetermined length for storing related paragraphs, and a predetermined priority queue for storing related images. It consists of a priority queue 1802 of length, a priority queue 1803 of a predetermined length for storing related words, and a priority queue 1804 of a predetermined length for storing related documents.

関連要素表示部１５０１は、文書再構築部１１５によって生成された再構築文書データ８００を取得する（ステップＳ１７０１）。 The related element display unit 1501 acquires the reconstructed document data 800 generated by the document reconstruction unit 115 (step S1701).

次に、関連要素表示部１５０１は、対象となる再構築文書データ８００を指定するユーザ入力を受け付ける（ステップＳ１７０２）。 Next, the related element display unit 1501 receives a user input specifying the target reconstructed document data 800 (step S1702).

具体的には、関連要素表示部１５０１は、入力欄１６０１への再構築文書データ８００のインデックス８０１の入力を受け付ける。このとき、関連要素表示部１５０１は、受け付けたインデックス８０１を変数ｄに設定する。ここで、選択された再構築文書データ８００の文書特徴量８０４を^Ｄｅ_ｄとする。以下の説明では、選択された再構築文書データ８００を選択文書データ８００とも記載する。 Specifically, the related element display unit 1501 accepts input of the index 801 of the reconstructed document data 800 into the input field 1601. At this time, the related element display unit 1501 sets the received index 801 to the variable d. Here, it is assumed that the document feature amount 804 of the selected reconstructed document data 800 is ^D e _d . In the following description, the selected reconstructed document data 800 will also be referred to as selected document data 800.

なお、入力欄１６０１には、公開番号、出願人、又は検索キーワード等を入力してもよい。この場合、関連要素表示部１５０１は、入力された値に基づいて再構築文書データ８００を検索し、検索結果をユーザに提示する。ユーザは、検索結果に基づいて文書を選択する。 Note that the publication number, applicant, search keyword, etc. may be entered in the input field 1601. In this case, the related element display unit 1501 searches the reconstructed document data 800 based on the input value and presents the search results to the user. The user selects documents based on the search results.

次に、関連要素表示部１５０１は、選択文書データ８００の段落データ８０５のループ処理を開始する（ステップＳ１７０３）。 Next, the related element display unit 1501 starts loop processing of the paragraph data 805 of the selected document data 800 (step S1703).

具体的には、関連要素表示部１５０１は、選択文書データ８００の段落データ８０５の中から一つのターゲット段落データ８０５を選択する。ここで、ターゲット段落データ８０５の段落インデックス８０６をｐ、ターゲット段落データ８０５の段落特徴量８０８を^Ｐｅ_ｄ,ｐとする。 Specifically, the related element display unit 1501 selects one target paragraph data 805 from among the paragraph data 805 of the selected document data 800. Here, the paragraph index 806 of the target paragraph data 805 is assumed to be p, and the paragraph feature amount 808 of the target paragraph data 805 is assumed to be ^P e _d,p .

次に、関連要素表示部１５０１は、選択文書データ８００及びターゲット段落データ８０５の類似度を算出する（ステップＳ１７０４）。具体的には、以下のような処理が実行される。 Next, the related element display unit 1501 calculates the degree of similarity between the selected document data 800 and the target paragraph data 805 (step S1704). Specifically, the following processing is executed.

関連要素表示部１５０１は、選択文書データ８００の文書特徴量８０４及びターゲット段落データ８０５の段落特徴量８０８を用いて、式（１９）に示すコサイン類似度ｓｉｍ（^Ｄｅ_ｄ，^Ｐｅ_ｄ,ｐ）を類似度として算出する。 The related element display unit 1501 uses the document feature amount 804 of the selected document data 800 and the paragraph feature amount 808 of the target paragraph data 805 to calculate the cosine similarity sim( ^D e _d , ^P e _d,p ) is calculated as the degree of similarity.

ターゲット段落データ８０５が画像でなく、かつ、コサイン類似度が所定の閾値より大きい場合、関連要素表示部１５０１は、関連段落を格納する優先度付きキュー１８０１に、コサイン類似度を優先度としてターゲット段落データ８０５の段落インデックス８０６を挿入する。ターゲット段落データ８０５が画像であり、かつ、コサイン類似度が所定の閾値より大きい場合、関連要素表示部１５０１は、関連画像を格納する優先度付きキュー１８０２に、コサイン類似度を優先度としてターゲット段落データ８０５の段落インデックス８０６を挿入する。以上がステップＳ１７０４の処理の説明である。 If the target paragraph data 805 is not an image and the cosine similarity is greater than a predetermined threshold, the related element display unit 1501 stores the target paragraph in the priority queue 1801 that stores related paragraphs with the cosine similarity as the priority. A paragraph index 806 of data 805 is inserted. If the target paragraph data 805 is an image and the cosine similarity is larger than a predetermined threshold, the related element display unit 1501 stores the target paragraph in the priority queue 1802 that stores related images with the cosine similarity as the priority. A paragraph index 806 of data 805 is inserted. The above is the explanation of the process in step S1704.

次に、関連要素表示部１５０１は、ターゲット段落データ８０５の単語データ８０９のループ処理を開始する（ステップＳ１７０５）。 Next, the related element display unit 1501 starts loop processing of the word data 809 of the target paragraph data 805 (step S1705).

具体的には、関連要素表示部１５０１は、ターゲット段落データ８０５の単語データ８０９の中から一つのターゲット単語データ８０９を選択する。ここで、ターゲット単語データ８０９の単語インデックス８１０をｔ、ターゲット単語データ８０９の単語特徴量８１３を^Ｗｅ_ｔとする。 Specifically, the related element display unit 1501 selects one target word data 809 from among the word data 809 of the target paragraph data 805. Here, the word index 810 of the target word data 809 is t, and the word feature amount 813 of the target word data 809 _{is Wet} ^.

次に、関連要素表示部１５０１は、選択文書データ８００及びターゲット単語データ８０９の類似度を算出する（ステップＳ１７０６）。具体的には、以下のような処理が実行される。 Next, the related element display unit 1501 calculates the degree of similarity between the selected document data 800 and the target word data 809 (step S1706). Specifically, the following processing is executed.

関連要素表示部１５０１は、選択文書データ８００の文書特徴量８０４及びターゲット単語データ８０９の単語特徴量８１３を用いて、式（１９）に示すコサイン類似度を類似度として算出する。 The related element display unit 1501 uses the document feature amount 804 of the selected document data 800 and the word feature amount 813 of the target word data 809 to calculate the cosine similarity shown in equation (19) as the similarity.

コサイン類似度が所定の閾値より大きい場合、関連要素表示部１５０１は、関連単語を格納する優先度付きキュー１８０３に、コサイン類似度を優先度としてターゲット単語データ８０９の単語インデックス８１０を挿入する。以上がステップＳ１７０６の処理の説明である。 If the cosine similarity is greater than a predetermined threshold, the related element display unit 1501 inserts the word index 810 of the target word data 809 into the priority queue 1803 that stores related words, with the cosine similarity as the priority. The above is the explanation of the process in step S1706.

次に、関連要素表示部１５０１は、ターゲット段落データ８０５の全ての単語データ８０９について処理を実行したか否かを判定する（ステップＳ１７０７）。 Next, the related element display unit 1501 determines whether processing has been performed on all word data 809 of the target paragraph data 805 (step S1707).

処理を実行していない単語データ８０９が少なくとも一つ存在すると判定された場合、関連要素表示部１５０１は、ステップＳ１７０５に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of word data 809 that has not been processed, the related element display unit 1501 returns to step S1705 and executes the same process.

ターゲット段落データ８０５の全ての単語データ８０９の処理を実行したと判定された場合、関連要素表示部１５０１は、選択文書データ８００の全ての段落データ８０５について処理を実行したか否かを判定する（ステップＳ１７０８）。 If it is determined that all the word data 809 of the target paragraph data 805 have been processed, the related element display unit 1501 determines whether or not all the paragraph data 805 of the selected document data 800 have been processed ( Step S1708).

処理を実行していない段落データ８０５が少なくとも一つ存在すると判定された場合、関連要素表示部１５０１は、ステップＳ１７０３に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of paragraph data 805 that has not been processed, the related element display unit 1501 returns to step S1703 and executes the same process.

選択文書データ８００の全ての段落データ８０５について処理を実行したと判定された場合、関連要素表示部１５０１は、選択文書データ８００を除く再構築文書データ８００のループ処理を開始する（ステップＳ１７０９）。 If it is determined that the process has been executed for all paragraph data 805 of the selected document data 800, the related element display unit 1501 starts loop processing of the reconstructed document data 800 excluding the selected document data 800 (step S1709).

具体的には、関連要素表示部１５０１は、選択文書データ８００を除く再構築文書データ８００の中から一つのターゲット再構築文書データ８００を選択する。ここで、ターゲット再構築文書データ８００のインデックス８０１をｄ’、ターゲット再構築文書データ８００の文書特徴量８０４を^Ｄｅ_ｄ’とする。 Specifically, the related element display unit 1501 selects one target reconstructed document data 800 from among the reconstructed document data 800 excluding the selected document data 800. Here, it is assumed that the index 801 of the target reconstructed document data 800 is d', and the document feature amount 804 of the target reconstructed document data 800 is ^D e _d' .

次に、関連要素表示部１５０１は、選択文書データ８００及びターゲット再構築文書データ８００の類似度を算出する（ステップＳ１７１０）。具体的には、以下のような処理が実行される。 Next, the related element display unit 1501 calculates the degree of similarity between the selected document data 800 and the target reconstructed document data 800 (step S1710). Specifically, the following processing is executed.

関連要素表示部１５０１は、選択文書データ８００の文書特徴量８０４及びターゲット再構築文書データ８００の文書特徴量８０４を用いて、式（１９）に示すコサイン類似度を類似度として算出する。 The related element display unit 1501 uses the document feature amount 804 of the selected document data 800 and the document feature amount 804 of the target reconstructed document data 800 to calculate the cosine similarity shown in equation (19) as the similarity.

コサイン類似度が所定の閾値より大きい場合、関連要素表示部１５０１は、関連文書を格納する優先度付きキュー１８０４に、コサイン類似度を優先度としてターゲット再構築文書データ８００のインデックス８０１を挿入する。以上がステップＳ１７１０の処理の説明である。 If the cosine similarity is greater than a predetermined threshold, the related element display unit 1501 inserts the index 801 of the target reconstructed document data 800 into the priority queue 1804 that stores related documents, with the cosine similarity as the priority. The above is the explanation of the process in step S1710.

次に、関連要素表示部１５０１は、ユーザインタフェース１６００を表示する（ステップＳ１７１２）。 Next, the related element display unit 1501 displays the user interface 1600 (step S1712).

具体的には、関連要素表示部１５０１は、優先度付きキュー１８０１に格納された各段落インデックス８０６に対応する段落文章データ５０３を表示欄１６０２に表示する。関連要素表示部１５０１は、優先度付きキュー１８０２に格納された各段落インデックス８０６に対応する段落データ１３０１を表示欄１６０４に表示する。関連要素表示部１５０１は、優先度付きキュー１８０３に格納された各単語インデックス８１０に対応する単語テキスト８１１を表示欄１６０３に表示する。また、関連要素表示部１５０１は、優先度付きキュー１８０４に格納された各インデックス８０１を表示欄１６０５に表示する。 Specifically, the related element display unit 1501 displays paragraph text data 503 corresponding to each paragraph index 806 stored in the priority queue 1801 in the display column 1602. The related element display unit 1501 displays paragraph data 1301 corresponding to each paragraph index 806 stored in the priority queue 1802 in a display column 1604. The related element display unit 1501 displays word text 811 corresponding to each word index 810 stored in the priority queue 1803 in a display column 1603. Further, the related element display unit 1501 displays each index 801 stored in the priority queue 1804 in a display column 1605.

実施例４では、ループ処理に基づいて文書又は文書の構成要素に関係性がある文書又は文書の構成要素をユーザに提示する方法について説明したが、ループ処理の代わりにＦＬＡＮＮ等の近似Ｋ近傍探索法を用いてもよい。 In the fourth embodiment, a method of presenting documents or document components related to documents or document components to the user based on loop processing was described, but instead of loop processing, an approximate K-neighbor search such as FLANN is used. You may also use the law.

実施例４によれば、計算機システムは、選択された文書又は選択された文書の構成要素に関係性がある文書又は文書の構成要素をユーザに提示することができる。 According to the fourth embodiment, the computer system can present to the user documents or document components that are related to the selected document or the components of the selected document.

実施例５では、分類モデル学習部１１２が、文書及び文の構成要素の各々に付与された一つ以上の正解ラベルに基づいて分類モデルを学習する。以下、実施例１との差異を中心に実施例５について説明する。 In the fifth embodiment, the classification model learning unit 112 learns a classification model based on one or more correct labels given to each of the constituent elements of a document and a sentence. Example 5 will be described below, focusing on the differences from Example 1.

実施例５の計算機システムの構成は実施例１と同一である。実施例５の計算機１００－１、１００－２のハードウェア構成及びソフトウェア構成は実施例１と同一である。 The configuration of the computer system of the fifth embodiment is the same as that of the first embodiment. The hardware and software configurations of computers 100-1 and 100-2 in the fifth embodiment are the same as those in the first embodiment.

実施例５の分類モデル学習部１１２、分類部１１４、文書再構築部１１５、及び表示部１１６が実行する処理は実施例１と同一である。 The processing executed by the classification model learning section 112, the classification section 114, the document reconstruction section 115, and the display section 116 of the fifth embodiment is the same as that of the first embodiment.

実施例５のグラフ構築部１１０は、文書に付与された正解ラベルだけではなく、構想要素に付与された一つ以上の正解ラベルも受けつける点が異なる。具体的には、図２のステップＳ２０３及びステップＳ２１１の処理が異なる。 The graph construction unit 110 of the fifth embodiment differs in that it accepts not only the correct label given to the document but also one or more correct labels given to the conceptual elements. Specifically, the processes in step S203 and step S211 in FIG. 2 are different.

ここで、構成要素に付与される正解ラベルは根拠箇所か否かを示すラベルに相当する。以下の説明では、文書に付与された正解ラベルを文書ラベルと記載し、構成要素の一つである段落に付与された正解ラベルを段落ラベルと記載する。 Here, the correct label given to the component corresponds to a label indicating whether it is a basis part or not. In the following description, a correct label given to a document will be referred to as a document label, and a correct label given to a paragraph, which is one of the constituent elements, will be referred to as a paragraph label.

図３、図１９Ａ、図１９Ｂ、及び図２０を用いて、実施例５のグラフ構築部１１０が実行する処理について説明する。図１９Ａ及び図１９Ｂは、実施例５のグラフ構築部１１０が実行する処理におけるデータの入出力を説明する図である。図２０は、実施例５のグラフ構築部１１０が実行するステップＳ２１１の一例を説明するフローチャートである。 Processing executed by the graph construction unit 110 of the fifth embodiment will be described using FIG. 3, FIG. 19A, FIG. 19B, and FIG. 20. 19A and 19B are diagrams illustrating data input/output in processing executed by the graph construction unit 110 of the fifth embodiment. FIG. 20 is a flowchart illustrating an example of step S211 executed by the graph construction unit 110 of the fifth embodiment.

実施例５におけるステップＳ２０３では、以下のような処理が実行される。 In step S203 in the fifth embodiment, the following processing is executed.

グラフ構築部１１０は、ターゲット文書データ５００の段落データ１９０１のループ処理を開始する（ステップＳ３０１）。 The graph construction unit 110 starts loop processing of the paragraph data 1901 of the target document data 500 (step S301).

具体的には、グラフ構築部１１０は、ターゲット文書データ５００に含まれる複数の段落データ１９０１の中から一つのターゲット段落データ１９０１を選択する。 Specifically, the graph construction unit 110 selects one target paragraph data 1901 from a plurality of paragraph data 1901 included in the target document data 500.

次に、グラフ構築部１１０は、ターゲット段落データ１９０１のテキストデータ１９０２を形態素の単位に分解する（ステップＳ３０２）。 Next, the graph construction unit 110 decomposes the text data 1902 of the target paragraph data 1901 into units of morphemes (step S302).

なお、分割の単位は文字及びバイト対符号化等、形態素以下の単位、又は複数の単語から構成されるフレーズ等でもよい。このとき、文章に頻出する句読点及び助動詞等のストップワードを除去する処理、並びに、形態素を原型に戻す処理が行われてもよい。 Note that the unit of division may be a unit smaller than a morpheme, such as character and byte pair encoding, or a phrase composed of a plurality of words. At this time, processing may be performed to remove stop words such as punctuation marks and auxiliary verbs that frequently appear in sentences, and processing to return morphemes to their original forms.

次に、グラフ構築部１１０は、ターゲット段落データ１９０１の単語から単語Ｎｇｒａｍを抽出し、段落Ｎｇｒａｍデータ５０４としてターゲット文書データ５００に格納する（ステップＳ３０３）。 Next, the graph construction unit 110 extracts the word Ngram from the words of the target paragraph data 1901, and stores it in the target document data 500 as the paragraph Ngram data 504 (step S303).

次に、グラフ構築部１１０は、ターゲット文書データ５００に含まれる全ての段落データ１９０１について処理を実行したか否かを判定する（ステップＳ３０４）。 Next, the graph construction unit 110 determines whether processing has been performed on all paragraph data 1901 included in the target document data 500 (step S304).

処理を実行していない段落データ１９０１が少なくとも一つ存在すると判定された場合、グラフ構築部１１０は、ステップＳ３０１に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of paragraph data 1901 that has not been processed, the graph construction unit 110 returns to step S301 and executes the same process.

ターゲット文書データ５００に含まれる全ての段落データ１９０１について処理を実行したと判定された場合、グラフ構築部１１０は、文書データの前処理を終了し、ステップＳ２０４に進む。 If it is determined that all the paragraph data 1901 included in the target document data 500 have been processed, the graph construction unit 110 ends the preprocessing of the document data and proceeds to step S204.

実施例５におけるステップＳ２１１では、以下のような処理が実行される。 In step S211 in the fifth embodiment, the following processing is executed.

グラフ構築部１１０は、正解ラベルベクトル５４０を初期化する（ステップＳ２００１）。 The graph construction unit 110 initializes the correct label vector 540 (step S2001).

具体的には、グラフ構築部１１０は、要素数がＮかつ全ての要素が０であるベクトルを正解ラベルベクトル５４０として設定する。また、グラフ構築部１１０は、段落番号カウンタｃを０に初期化する。 Specifically, the graph construction unit 110 sets a vector in which the number of elements is N and all elements are 0 as the correct label vector 540. The graph construction unit 110 also initializes the paragraph number counter c to 0.

次に、グラフ構築部１１０は、文書データ５００のループ処理を開始する（ステップＳ２００２）。 Next, the graph construction unit 110 starts loop processing of the document data 500 (step S2002).

具体的には、グラフ構築部１１０は、複数の文書データ５００の中から一つのターゲット文書データ５００を選択する。ここで、ターゲット文書データ５００のインデックス５０１をｄとする。 Specifically, the graph construction unit 110 selects one target document data 500 from among the plurality of document data 500. Here, the index 501 of the target document data 500 is assumed to be d.

次に、グラフ構築部１１０は、正解ラベルベクトル５４０のｄ番目の要素にターゲット文書データ５００のラベル５０２の値を格納する（ステップＳ２００３）。 Next, the graph construction unit 110 stores the value of the label 502 of the target document data 500 in the d-th element of the correct label vector 540 (step S2003).

次に、グラフ構築部１１０は、ターゲット文書データ５００の段落データ１９０１のループ処理を開始する（ステップＳ２００４）。 Next, the graph construction unit 110 starts loop processing of the paragraph data 1901 of the target document data 500 (step S2004).

具体的には、グラフ構築部１１０は、ターゲット文書データ５００の複数の段落データ１９０１の中から一つのターゲット段落データ１９０１を選択する。 Specifically, the graph construction unit 110 selects one target paragraph data 1901 from among the plurality of paragraph data 1901 of the target document data 500.

次に、グラフ構築部１１０は、正解ラベルベクトル５４０の（Ｎ_Ａ+Ｎ_Ｑ＋ｐ）番目の要素にターゲット段落データ１９０１のラベル１９０３の値を格納する（ステップＳ２００５）。 Next, the graph construction unit 110 stores the value of the label 1903 of the target paragraph data 1901 in the (N _A +N _Q +p)th element of the correct label vector 540 (Step S2005).

次に、グラフ構築部１１０は、ｃをインクリメントし、ターゲット文書データ５００の全ての段落データ１９０１について処理を実行したか否かを判定する（ステップＳ２００６）。 Next, the graph construction unit 110 increments c and determines whether the process has been executed for all paragraph data 1901 of the target document data 500 (step S2006).

処理を実行していない段落データ１９０１が少なくとも一つ存在すると判定された場合、グラフ構築部１１０は、ステップＳ２００４に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of paragraph data 1901 that has not been processed, the graph construction unit 110 returns to step S2004 and executes the same process.

全ての段落データ１９０１について処理を実行したと判定された場合、グラフ構築部１１０は、全ての文書データ５００の文書ラベルを格納したか否かを判定する（ステップＳ２００７）。 If it is determined that the process has been executed for all the paragraph data 1901, the graph construction unit 110 determines whether the document labels of all the document data 500 have been stored (step S2007).

文書ラベルを格納していない文書データ５００が少なくとも一つ存在すると判定された場合、グラフ構築部１１０は、ステップＳ２００２に戻り、同様の処理を実行する。 If it is determined that there is at least one piece of document data 500 that does not store a document label, the graph construction unit 110 returns to step S2002 and executes the same process.

全ての文書データ５００の文書ラベルを格納したと判定された場合、グラフ構築部１１０はステップＳ２１１の処理を終了する。 If it is determined that the document labels of all document data 500 have been stored, the graph construction unit 110 ends the process of step S211.

実施例５によれば、計算機システムは、文書に付与された正解ラベル及び根拠箇所となる構成要素に付与された正解ラベルに基づいて、分類モデルを学習することができる。これによって、分類精度及び根拠箇所の提示の精度を向上できる。 According to the fifth embodiment, the computer system can learn a classification model based on the correct label given to the document and the correct label given to the component serving as the basis. This makes it possible to improve the accuracy of classification and the accuracy of presentation of evidence points.

実施例１と同様に、グラフ構築部１１０によって生成されるグラフ６００では文書及び構成要素（段落及び段落）はグラフ６００を構成する一つの頂点として扱われるため、従来技術のように、根拠箇所提示の適切さと分類精度との間のトレードオフの関係が生じることなく、分類結果及び根拠を提示できる。 Similar to the first embodiment, in the graph 600 generated by the graph construction unit 110, documents and constituent elements (paragraphs and paragraphs) are treated as one vertex that constitutes the graph 600. The classification results and rationale can be presented without creating a trade-off relationship between the appropriateness of the data and classification accuracy.

また、上記の各構成、機能、処理部、処理手段等は、それらの一部又は全部を、例えば集積回路で設計する等によりハードウェアで実現してもよい。また、本発明は、実施例の機能を実現するソフトウェアのプログラムコードによっても実現できる。この場合、プログラムコードを記録した記憶媒体をコンピュータに提供し、そのコンピュータが備えるプロセッサが記憶媒体に格納されたプログラムコードを読み出す。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施例の機能を実現することになり、そのプログラムコード自体、及びそれを記憶した記憶媒体は本発明を構成することになる。このようなプログラムコードを供給するための記憶媒体としては、例えば、フレキシブルディスク、ＣＤ－ＲＯＭ、ＤＶＤ－ＲＯＭ、ハードディスク、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、光ディスク、光磁気ディスク、ＣＤ－Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどが用いられる。 Further, each of the above-mentioned configurations, functions, processing units, processing means, etc. may be partially or entirely realized in hardware by designing, for example, an integrated circuit. Further, the present invention can also be realized by software program codes that realize the functions of the embodiments. In this case, a storage medium on which a program code is recorded is provided to a computer, and a processor included in the computer reads the program code stored on the storage medium. In this case, the program code itself read from the storage medium realizes the functions of the embodiments described above, and the program code itself and the storage medium storing it constitute the present invention. Examples of storage media for supplying such program codes include flexible disks, CD-ROMs, DVD-ROMs, hard disks, SSDs (Solid State Drives), optical disks, magneto-optical disks, CD-Rs, magnetic tapes, A non-volatile memory card, ROM, etc. are used.

また、本実施例に記載の機能を実現するプログラムコードは、例えば、アセンブラ、Ｃ／Ｃ＋＋、ｐｅｒｌ、Ｓｈｅｌｌ、ＰＨＰ、Ｐｙｔｈｏｎ、Ｊａｖａ（登録商標）等の広範囲のプログラム又はスクリプト言語で実装できる。 Further, the program code for realizing the functions described in this embodiment can be implemented in a wide range of program or script languages such as assembler, C/C++, Perl, Shell, PHP, Python, and Java (registered trademark).

さらに、実施例の機能を実現するソフトウェアのプログラムコードを、ネットワークを介して配信することによって、それをコンピュータのハードディスクやメモリ等の記憶手段又はＣＤ－ＲＷ、ＣＤ－Ｒ等の記憶媒体に格納し、コンピュータが備えるプロセッサが当該記憶手段や当該記憶媒体に格納されたプログラムコードを読み出して実行するようにしてもよい。 Furthermore, by distributing the software program code that realizes the functions of the embodiment via a network, it can be stored in a storage means such as a computer's hard disk or memory, or a storage medium such as a CD-RW or CD-R. Alternatively, a processor included in the computer may read and execute the program code stored in the storage means or the storage medium.

上述の実施例において、制御線や情報線は、説明上必要と考えられるものを示しており、製品上必ずしも全ての制御線や情報線を示しているとは限らない。全ての構成が相互に接続されていてもよい。 In the above-described embodiments, the control lines and information lines are those considered necessary for explanation, and not all control lines and information lines are necessarily shown in the product. All configurations may be interconnected.

１００計算機
１０１プロセッサ
１０２メモリ
１０３ネットワークインタフェース
１１０グラフ構築部
１１１グラフ情報
１１２分類モデル学習部
１１３分類モデル情報
１１４分類部
１１５文書再構築部
１１６表示部
１２０ネットワーク
５００文書データ
５１０Ｎｇｒａｍ辞書
５２０隣接行列
５３０段落辞書
５４０正解ラベルベクトル
６００グラフ
８００再構築文書データ
９００、１４００、１６００ユーザインタフェース
１５０１関連要素表示部
100 Computer 101 Processor 102 Memory 103 Network interface 110 Graph construction unit 111 Graph information 112 Classification model learning unit 113 Classification model information 114 Classification unit 115 Document reconstruction unit 116 Display unit 120 Network 500 Document data 510 Ngram dictionary 520 Adjacency matrix 530 Paragraph dictionary 540 Correct label vector 600 Graph 800 Reconstructed document data 900, 1400, 1600 User interface 1501 Related element display section

Claims

A computer system comprising at least one computer,
The at least one computer has a processor, a memory connected to the processor, and an interface connected to the processor,
The computer system is
a graph construction unit that receives input of document data and generates a graph having the document and the elements of the document as vertices;
a classification unit that calculates, for each of the plurality of vertices, an index used to classify the document into one of the plurality of classes;
classifying the document based on the index of at least one of the vertices, identifying evidence points on the document consisting of at least one element of the document that contributed to the classification; A document reconstruction unit that presents a basis on the document,
The element of the document is at least one of a word, a diagram, a table, a formula, and a sentence composed of multiple words,
The classification unit includes a graph convolution network that receives the graph as input and outputs a vector representing the characteristics of the graph, and a function that uses the vector as a variable and outputs a value representing the probability of each of the plurality of classes. Calculate a value representing the probability of each class as the index using
The document reconstruction unit includes:
determining the class to which the document belongs based on the size of the index of the vertex corresponding to the document;
A computer system that identifies an element of the document that is a basis on the document based on the magnitude of the index of the class to which the document belongs , of the vertex corresponding to the element of the document.

The computer system according to claim 1,
The computer system is characterized in that the computer system includes a classification model learning unit that generates the graph convolution network by machine learning using the graph representing a learning document to which a class has been assigned.

A document classification method executed by a computer system including at least one computer, the method comprising:
The at least one computer has a processor, a memory connected to the processor, and an interface connected to the processor,
The method of classifying the documents is as follows:
a first step in which the processor receives input of document data, generates data representing the document and a graph having the elements of the document as vertices, and stores the data in the memory;
a second step in which the processor calculates, for each of the plurality of vertices, an index used to classify the document into one of a plurality of classes, and stores the index in the memory;
a third step of the processor classifying the document based on the index of at least one of the vertices and storing the results of the classification in the memory;
a fourth step in which the processor identifies evidence points on the document that are comprised of at least one element of the document that contributed to the classification, and stores the identified results in the memory;
a fifth step in which the processor generates display information for presenting the classification results and the evidence points on the document;
The element of the document is at least one of a word, a diagram, a table, a formula, and a sentence composed of multiple words,
In the second step, the processor inputs the graph and outputs a graph convolution network that outputs a vector representing the characteristics of the graph, and uses the vector as a variable and outputs a value representing the probability of falling into the class. the step of calculating, for each of the classes, a value representing the probability of falling into the class as the index using a function;
The third step includes the processor classifying the document based on the index of the vertex corresponding to the document,
The fourth step includes the step of the processor identifying a base location on the document based on the magnitude of the index of the class into which the document is classified, of the vertex corresponding to the element of the document. A document classification method characterized by:

A method for classifying documents according to claim 3, comprising:
The document analysis method includes the step of the processor generating the graph convolution network by machine learning using the graph representing a learning document to which a class has been assigned. Method.