JP7415495B2

JP7415495B2 - Document processing program, document processing device, and document processing method

Info

Publication number: JP7415495B2
Application number: JP2019218049A
Authority: JP
Inventors: 修也阿部
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2019-12-02
Filing date: 2019-12-02
Publication date: 2024-01-17
Anticipated expiration: 2039-12-02
Also published as: JP2021089473A

Description

本発明は、文書処理技術に関する。 The present invention relates to document processing technology.

近年、産業界等において、ＡＩ（Artificial Intelligence）技術を利用した自然言語処理が増加している。自然言語処理には、形態素解析、構文解析、意味解析、照応解析等が含まれる。ＡＩ技術を利用した自然言語処理では、例えば、以下の手順で文書解析が行われる。
（Ｐ１）ユーザは、文書集合から、機械学習のための学習データを手作業で作成する。
（Ｐ２）ユーザは、機械学習により、学習モデルに学習データを学習させることで、学習モデルのパラメータを調整して、学習済みモデルである解析モデルを生成する。
（Ｐ３）ユーザは、解析モデルを用いて未知文書を解析し、解析結果を生成する。 In recent years, natural language processing using AI (Artificial Intelligence) technology has been increasing in industry. Natural language processing includes morphological analysis, syntactic analysis, semantic analysis, anaphora analysis, etc. In natural language processing using AI technology, for example, document analysis is performed in the following steps.
(P1) The user manually creates learning data for machine learning from a document collection.
(P2) The user causes the learning model to learn learning data using machine learning, adjusts the parameters of the learning model, and generates an analytical model that is a trained model.
(P3) The user analyzes the unknown document using the analysis model and generates an analysis result.

解析結果には、未知文書のテキストに対する付加情報が含まれる。例えば、形態素解析では、形態素間の境界位置、各形態素の品詞等が付加情報として生成される。機械学習の学習データとしては、入力データ及び出力データの組が用いられる。入力データは、解析対象の文書であり、出力データは、付加情報が付加された文書である。 The analysis result includes additional information for the text of the unknown document. For example, in morphological analysis, boundary positions between morphemes, parts of speech of each morpheme, etc. are generated as additional information. A set of input data and output data is used as learning data for machine learning. The input data is a document to be analyzed, and the output data is a document to which additional information is added.

学習データの作成には多大な作業負荷が発生するため、作業負荷を軽減するために、既存の学習データを部分的に変更して別の解析モデルの生成に転用する、学習データの再利用が増加している。学習データの再利用としては、例えば、次のような利用形態が挙げられる。
（ａ）ユーザは、機械学習の性能向上のため、既存の学習データを見直して変更し、より正確な学習データを作成する。
（ｂ）ユーザは、機械学習の性能向上のため、既存の学習データの変更と新たな学習データの追加を行うことで、大規模な学習データを作成する。
（ｃ）ユーザは、独自の基準に基づいて学習データをカスタマイズするため、既存の学習データを変更する。 Creating training data requires a large workload, so to reduce the workload, it is recommended to reuse the training data by partially changing the existing training data and reusing it to generate another analytical model. It has increased. Examples of reuse of learning data include the following usage patterns.
(a) In order to improve the performance of machine learning, the user reviews and changes existing learning data to create more accurate learning data.
(b) In order to improve the performance of machine learning, the user creates large-scale learning data by changing existing learning data and adding new learning data.
(c) Users modify existing learning data to customize the learning data based on their own criteria.

生物医学分野の自然言語処理に関連して、文書に含まれる固有表現に対して注釈を付与する技術が知られている（例えば、非特許文献１を参照）。 Related to natural language processing in the biomedical field, a technique for annotating named entities included in a document is known (see, for example, Non-Patent Document 1).

P. Stenetorp et al.,“BioNLP Shared Task 2011: Supporting Resources”, Proceedings of BioNLP Shared Task 2011 Workshop, pages 112-120, 2011P. Stenetorp et al., “BioNLP Shared Task 2011: Supporting Resources”, Proceedings of BioNLP Shared Task 2011 Workshop, pages 112-120, 2011

機械学習では相当量の学習データが用いられるため、一部の学習データのみが変更された場合、学習処理ではその変更がノイズとして扱われ、学習結果に反映されにくい。したがって、変更内容を学習結果に確実に反映させるためには、学習データ全体にわたって同様の変更を大量に行うことが望ましく、ユーザの作業負荷が増加する。 Machine learning uses a considerable amount of learning data, so if only a portion of the learning data is changed, the change is treated as noise in the learning process and is unlikely to be reflected in the learning results. Therefore, in order to ensure that the changes are reflected in the learning results, it is desirable to make a large number of similar changes over the entire learning data, which increases the user's workload.

なお、かかる問題は、自然言語処理の機械学習で用いられる学習データを変更する場合に限らず、様々な文書集合に含まれる情報を変更する場合において生ずるものである。 Note that this problem occurs not only when changing learning data used in machine learning for natural language processing, but also when changing information included in various document sets.

１つの側面において、本発明は、コンピュータが文書集合から変更候補を抽出することを目的とする。その結果、文書集合の情報を変更する変更作業が効率化される。 In one aspect, the present invention is directed to a computer extracting change candidates from a document collection. As a result, the modification work for modifying the information of the document collection becomes more efficient.

１つの案では、文書処理プログラムは、文書集合に含まれるテキストの付加情報をユーザが変更したことを示す変更履歴であってユーザが行った変更操作を示す変更事例を含む変更履歴に基づいて、文書集合に対してユーザが行う変更の変更対象を推定し、変更対象に対応するテキストを文書集合から抽出し、文書集合から抽出されたテキストを、文書集合から抽出されたテキストの前後に存在するテキストに基づいてクラスタリングすることで、複数のクラスタを生成し、複数のクラスタのうち特定のクラスタに属するテキストの付加情報をユーザが変更した場合、特定のクラスタに属するテキストの付加情報に対する変更を、特定のクラスタに属する他のテキストの付加情報に反映させ、文書集合から抽出されたテキストを示す変更候補情報を出力する処理をコンピュータに実行させる。
In one proposal, the document processing program is based on a change history that indicates that a user has changed additional information of text included in a document set and includes a change example that indicates a change operation performed by the user . Estimates the change target of changes made by the user to the document set, extracts text corresponding to the change target from the document set, and converts the text extracted from the document set to the text that exists before and after the text extracted from the document set. By clustering based on text, multiple clusters are generated, and when the user changes the additional information of the text that belongs to a specific cluster among the multiple clusters, the change to the additional information of the text that belongs to the specific cluster is A computer is caused to perform a process of reflecting the additional information of other texts belonging to a specific cluster and outputting change candidate information indicating the text extracted from the document collection.

１つの側面によれば、文書集合の情報を変更する変更作業を効率化することができる。 According to one aspect, it is possible to streamline the modification work of modifying information on a document set.

医療分野のキュレーションにおけるテキストを示す図である。It is a figure which shows the text in curation of a medical field. 文書処理装置の機能的構成図である。FIG. 1 is a functional configuration diagram of a document processing device. 変更支援処理のフローチャートである。It is a flowchart of change support processing. 文書処理装置の具体例を示す機能的構成図である。FIG. 2 is a functional configuration diagram showing a specific example of a document processing device. 変更事例の分類結果を示す図である。It is a figure which shows the classification result of a change example. 段落の評価値を示す図である。It is a figure which shows the evaluation value of a paragraph. 強調表示された部分文書を示す図である。FIG. 3 is a diagram showing a highlighted partial document. 単語ベクトルを示す図である。FIG. 3 is a diagram showing word vectors. 変更支援処理の具体例を示すフローチャートである。3 is a flowchart illustrating a specific example of change support processing. 編集画面生成処理のフローチャートである。It is a flowchart of edit screen generation processing. 推定処理のフローチャートである。It is a flowchart of estimation processing. 情報処理装置のハードウェア構成図である。FIG. 2 is a hardware configuration diagram of an information processing device.

以下、図面を参照しながら、実施形態を詳細に説明する。 Hereinafter, embodiments will be described in detail with reference to the drawings.

医療分野において、病気等に関連する論文を人間が読みながら、論文中の知識をデータベースに登録する作業が行われることがある。このような作業はキュレーションと呼ばれ、作業者はキュレータと呼ばれる。 In the medical field, there are cases in which humans read papers related to diseases and the like while registering the knowledge contained in the papers in a database. This kind of work is called curation, and the worker is called a curator.

非特許文献１の技術は、キュレーションを効率化するために利用される。この技術では、自然言語処理を用いて論文中に知識が記述されている箇所が強調表示されるとともに、強調表示された箇所に、注釈が付与される。知識が記述されている箇所としては、遺伝子、遺伝子変異、薬、病気等を表す固有表現が用いられ、注釈としては、固有表現の種類を示すタグ、固有表現間の関係等が用いられる。キュレータは、強調表示された箇所を中心にテキストを確認することで、データベースを整備することができる。 The technique of Non-Patent Document 1 is used to make curation more efficient. This technology uses natural language processing to highlight parts of a paper where knowledge is described, and adds annotations to the highlighted parts. Namely expressions representing genes, gene mutations, medicines, diseases, etc. are used as places where knowledge is described, and tags indicating the type of named entity, relationships between named entities, etc. are used as annotations. Curators can maintain their database by checking the text, focusing on the highlighted parts.

図１は、医療分野のキュレーションにおいて、キュレータの端末装置の画面上に表示されるテキストの例を示している。図１のテキストは、英語で記述された医療分野の論文を表し、論文中の“p.R122W”、“FNMTC”（Familial Non-Medullary Thyroid Carcinoma）、及び“thyroid cancer”が、遺伝子変異又は病気に関する固有表現として抽出されている。抽出されたテキストは、マーカにより強調表示される。 FIG. 1 shows an example of text displayed on the screen of a curator's terminal device in curation in the medical field. The text in Figure 1 represents a paper in the medical field written in English, and “p.R122W”, “FNMTC” (Familial Non-Medullary Thyroid Carcinoma), and “thyroid cancer” in the paper represent genetic mutations or diseases. It has been extracted as a unique expression related to The extracted text is highlighted with a marker.

“p.R122W”は、特定の遺伝子変異を表す語句であり、“FNMTC”及び“thyroid cancer”は、特定の病気を表す語句である。“p.R122W”には、“Mutation”というタグが付与されており、“FNMTC”及び“thyroid cancer”には、“Disease”というタグが付与されている。さらに、“p.R122W”と“FNMTC”の間の関係として、“Pathogenic”の矢印が表示されており、“p.R122W”と“thyroid cancer”の間の関係として、“Pathogenic”の矢印が表示されている。 “p.R122W” is a phrase that represents a specific genetic mutation, and “FNMTC” and “thyroid cancer” are phrases that represent a specific disease. “p.R122W” is tagged with “Mutation,” and “FNMTC” and “thyroid cancer” are tagged with “Disease.” Furthermore, the “Pathogenic” arrow is displayed as the relationship between “p.R122W” and “FNMTC,” and the “Pathogenic” arrow is displayed as the relationship between “p.R122W” and “thyroid cancer.” Displayed.

キュレータは、これらのタグ及び関係から、“p.R122W”という遺伝子変異が“FNMTC”及び“thyroid cancer”という病気の原因になっていることを認識して、それらの知識をデータベースに登録する。また、キュレータは、各固有表現に付与されたタグ及び関係に基づいて、自然言語処理の処理結果に誤りが含まれているか否かを確認することができる。処理結果の誤りとしては、固有表現の抽出漏れ、固有表現の抽出誤り、タグ又は関係の誤り等が挙げられる。 Based on these tags and relationships, the curator recognizes that the genetic mutation "p.R122W" is the cause of the diseases "FNMTC" and "thyroid cancer" and registers this knowledge in the database. Further, the curator can check whether or not the processing result of natural language processing contains an error based on the tag and relationship given to each named entity. Errors in the processing results include failure to extract a named entity, an error in extracting a named entity, and an error in tags or relationships.

ここで、医療分野の論文から固有表現を抽出し、抽出された固有表現に注釈を付与する、自然言語処理の解析モデルを生成する場合を想定する。このような解析モデルの学習データとしては、入力データ及び出力データの組が用いられる。入力データは、解析対象の論文であり、出力データは、付加情報が付加された論文である。付加情報は、各固有表現のテキストの範囲と、各固有表現に対するタグと、固有表現間の関係とを含む。 Here, assume that a natural language processing analysis model is generated that extracts named entities from papers in the medical field and annotates the extracted named entities. A set of input data and output data is used as learning data for such an analytical model. The input data is a paper to be analyzed, and the output data is a paper with additional information added. The additional information includes the text range of each named entity, the tag for each named entity, and the relationship between the named entities.

解析モデルを用いたキュレーションにおいて、キュレータが知識をデータベースに登録する際に、自然言語処理の誤りを修正し、修正結果を解析モデルの学習データとして採用することも可能である。これにより、解析モデルによる自然言語処理の精度が徐々に向上し、キュレータの確認作業も徐々に効率化される。 In curation using an analytical model, when a curator registers knowledge in a database, it is also possible to correct errors in natural language processing and use the correction results as training data for the analytical model. As a result, the accuracy of natural language processing using analytical models will gradually improve, and curators' confirmation work will gradually become more efficient.

例えば、解析モデルの処理結果において、“advanced gastric cancer”に“病気”というタグが付与されている場合、キュレータは、タグに対応するテキストの範囲を、“advanced gastric cancer”から“gastric cancer”に変更することができる。この場合、“advanced gastric cancer”ではなく、“gastric cancer”に“病気”というタグが付与された、学習データが生成される。 For example, in the processing results of an analysis model, if "advanced gastric cancer" is tagged as "disease," the curator can change the range of text corresponding to the tag from "advanced gastric cancer" to "gastric cancer." Can be changed. In this case, learning data is generated in which "gastric cancer" is tagged with "disease" instead of "advanced gastric cancer."

しかしながら、キュレータがせいぜい数件の変更を行った程度では、変更内容が学習結果に反映されにくい、という問題がある。その理由は、機械学習では相当量の学習データが用いられるため、せいぜい数件の変更を追加しても、学習処理では追加された変更がノイズとして扱われ、学習結果に反映されないからである。したがって、変更内容を学習結果に反映させるためには、同様の変更を大量に行うことが望ましく、キュレータの作業負荷が増加する。 However, there is a problem in that when a curator makes at most a few changes, the changes are difficult to be reflected in the learning results. The reason for this is that machine learning uses a considerable amount of learning data, so even if you add at most a few changes, the added changes will be treated as noise in the learning process and will not be reflected in the learning results. Therefore, in order to reflect the changes in the learning results, it is desirable to make a large number of similar changes, which increases the curator's workload.

図２は、実施形態の文書処理装置の機能的構成例を示している。図２の文書処理装置２０１は、記憶部２１１、推定部２１２、抽出部２１３、及び出力部２１４を含む。記憶部２１１は、文書集合に含まれる情報をユーザが変更したことを示す変更履歴２２１を記憶する。推定部２１２、抽出部２１３、及び出力部２１４は、変更履歴２２１を用いて変更支援処理を行う。 FIG. 2 shows an example of the functional configuration of the document processing device according to the embodiment. The document processing device 201 in FIG. 2 includes a storage section 211, an estimation section 212, an extraction section 213, and an output section 214. The storage unit 211 stores a change history 221 indicating that the user has changed information included in the document collection. The estimation unit 212, the extraction unit 213, and the output unit 214 perform change support processing using the change history 221.

図３は、図２の文書処理装置２０１が行う変更支援処理の例を示すフローチャートである。まず、推定部２１２は、変更履歴２２１に基づいて、文書集合に対してユーザが行う変更の変更対象を推定する（ステップ３０１）。次に、抽出部２１３は、変更対象に対応するテキストを文書集合から抽出する（ステップ３０２）。そして、出力部２１４は、文書集合から抽出されたテキストを示す変更候補情報を出力する（ステップ３０３）。 FIG. 3 is a flowchart illustrating an example of change support processing performed by the document processing apparatus 201 of FIG. First, the estimating unit 212 estimates the change target of the change made by the user to the document collection based on the change history 221 (step 301). Next, the extraction unit 213 extracts the text corresponding to the change target from the document collection (step 302). Then, the output unit 214 outputs change candidate information indicating the text extracted from the document collection (step 303).

図２の文書処理装置２０１によれば、文書集合の情報を変更する変更作業を効率化することができる。 According to the document processing device 201 of FIG. 2, it is possible to make the change work of changing information of a document collection more efficient.

図４は、図２の文書処理装置２０１の具体例を示している。図４の文書処理装置４０１は、記憶部４１１、解析部４１２、推定部４１３、抽出部４１４、分類部４１５、変更部４１６、生成部４１７、及び出力部４１８を含む。記憶部４１１、推定部４１３、抽出部４１４、及び出力部４１８は、図２の記憶部２１１、推定部２１２、抽出部２１３、及び出力部２１４にそれぞれ対応する。 FIG. 4 shows a specific example of the document processing device 201 shown in FIG. The document processing device 401 in FIG. 4 includes a storage section 411, an analysis section 412, an estimation section 413, an extraction section 414, a classification section 415, a modification section 416, a generation section 417, and an output section 418. The storage unit 411, the estimation unit 413, the extraction unit 414, and the output unit 418 correspond to the storage unit 211, the estimation unit 212, the extraction unit 213, and the output unit 214 in FIG. 2, respectively.

記憶部４１１は、解析モデル４２１及び解析前文書集合４２２を記憶する。解析モデル４２１は、機械学習により生成された自然言語処理の学習済みモデルであり、解析対象の文書を解析して、付加情報を含む解析結果を生成する。解析前文書集合４２２は、解析対象の複数の文書を含み、各文書は、複数の部分文書を含む。部分文書は、章、段落、文等に対応する。 The storage unit 411 stores an analysis model 421 and a pre-analysis document set 422. The analysis model 421 is a trained model of natural language processing generated by machine learning, and analyzes a document to be analyzed to generate an analysis result including additional information. The pre-analysis document set 422 includes multiple documents to be analyzed, and each document includes multiple partial documents. A partial document corresponds to a chapter, paragraph, sentence, etc.

解析部４１２は、解析モデル４２１を用いて、解析前文書集合４２２に含まれる各文書を解析することで、付加情報が付加された文書を生成する。そして、解析部４１２は、生成された文書を含む解析後文書集合４２３を生成して、記憶部４１１に格納する。 The analysis unit 412 generates a document with additional information added by analyzing each document included in the pre-analysis document set 422 using the analysis model 421. Then, the analysis unit 412 generates an analyzed document set 423 including the generated documents, and stores it in the storage unit 411.

医療分野のキュレーションの場合、解析前文書集合４２２は、例えば、医療分野の論文の集合であり、解析後文書集合４２３は、例えば、付加情報が付加された論文の集合である。この場合、付加情報は、各固有表現のテキストの範囲と、各固有表現に対するタグと、固有表現間の関係とを含む。 In the case of curation in the medical field, the pre-analysis document set 422 is, for example, a set of papers in the medical field, and the post-analysis document set 423 is, for example, a set of papers to which additional information has been added. In this case, the additional information includes the text range of each named entity, the tag for each named entity, and the relationship between the named entities.

ユーザは、文書処理装置４０１が有するユーザインタフェース、又は文書処理装置４０１と通信可能な端末装置を介して、解析後文書集合４２３に含まれる文書を編集することができる。 A user can edit documents included in the post-analysis document set 423 via a user interface included in the document processing device 401 or a terminal device that can communicate with the document processing device 401.

出力部４１８は、表示装置又は通信装置であり、解析後文書集合４２３からユーザによって選択された文書を出力する。出力部４１８が表示装置である場合、出力部４１８は、ユーザによって選択された文書を画面上に表示する。出力部４１８が通信装置である場合、出力部４１８は、ユーザによって選択された文書を、通信ネットワークを介してユーザの端末装置へ送信する。端末装置は、文書処理装置４０１から受信した文書を画面上に表示する。 The output unit 418 is a display device or a communication device, and outputs the document selected by the user from the analyzed document set 423. When the output unit 418 is a display device, the output unit 418 displays the document selected by the user on the screen. When the output unit 418 is a communication device, the output unit 418 transmits the document selected by the user to the user's terminal device via the communication network. The terminal device displays the document received from the document processing device 401 on the screen.

ユーザは、画面上に表示された文書に付加された付加情報を参照して、所望の変更操作を行う。このとき、ユーザは、付加情報を変更する変更指示を文書処理装置４０１に入力することで、その付加情報に対する変更操作を行う。変更部４１６は、入力された変更指示を受け付け、受け付けた変更指示に従って付加情報を変更する。そして、変更部４１６は、変更内容を変更事例として含む変更履歴４２４を生成して、記憶部４１１に格納する。これにより、ユーザが行った変更操作を示す複数の変更事例が、変更履歴４２４に蓄積される。変更履歴４２４は、図２の変更履歴２２１に対応する。 The user refers to the additional information added to the document displayed on the screen and performs a desired change operation. At this time, the user performs an operation to change the additional information by inputting a change instruction to change the additional information into the document processing device 401. The changing unit 416 receives the input change instruction and changes the additional information according to the received change instruction. Then, the change unit 416 generates a change history 424 that includes the change contents as a change example, and stores it in the storage unit 411. As a result, a plurality of change examples indicating change operations performed by the user are accumulated in the change history 424. Change history 424 corresponds to change history 221 in FIG. 2 .

推定部４１３は、変更履歴４２４に含まれる各変更事例を、複数の変更種類のいずれかに分類し、各変更事例に基づいて、解析後文書集合４２３に対してユーザが次に行う変更操作の変更対象を、変更種類毎に推定する。抽出部４１４は、変更種類毎の変更対象に対応するテキストを解析後文書集合４２３から検索し、変更種類毎の変更候補として抽出する。 The estimation unit 413 classifies each change example included in the change history 424 into one of a plurality of change types, and determines the next change operation to be performed by the user on the analyzed document set 423 based on each change example. Estimating the change target for each change type. The extraction unit 414 searches the analyzed document set 423 for text corresponding to the change target for each change type, and extracts it as a change candidate for each change type.

変更種類としては、固有表現の範囲の短縮、固有表現の範囲の延長、固有表現の種類の変更、固有表現の追加、固有表現の削除、関係の追加、関係の削除、関係の種類の変更等を用いることができる。変更対象は、変更事例が示す変更前のテキストを特定するとともに、変更前のテキストの一部の語句又はその同義語を含み、かつ、変更前のテキストとは異なるテキストを特定する情報である。各変更種類の変更事例と変更事例から推定される変更対象の具体例は、以下の通りである。 Types of changes include shortening the scope of a named entity, extending the scope of a named entity, changing the type of named entity, adding a named entity, deleting a named entity, adding a relationship, deleting a relationship, changing the type of relationship, etc. can be used. The change target is information that specifies the text before change indicated by the change example, and also specifies a text that includes some words or synonyms of the text before change and is different from the text before change. Specific examples of change cases of each change type and change targets estimated from the change cases are as follows.

（Ｃ１）固有表現の範囲の短縮
固有表現のテキストの範囲が短縮される。変更事例としては、例えば、“advanced gastric cancer：薬”を“gastric cancer：薬”に変更する事例が挙げられる。“advanced gastric cancer：薬”は、変更前の固有表現の範囲が“advanced gastric cancer”であり、“advanced gastric cancer”に付与されたタグが“薬”であることを表す。一方、“gastric cancer：薬”は、変更後の固有表現の範囲が“gastric cancer”であり、“gastric cancer”に付与されたタグが“薬”であることを表す。 (C1) Shortening the range of named entity The text range of named entity is shortened. Examples of changes include, for example, changing "advanced gastric cancer: medicine" to "gastric cancer: medicine." “advanced gastric cancer: drug” indicates that the scope of the unique expression before the change is “advanced gastric cancer” and that the tag given to “advanced gastric cancer” is “drug.” On the other hand, "gastric cancer: medicine" means that the scope of the changed proper expression is "gastric cancer", and the tag given to "gastric cancer" is "medicine".

固有表現の範囲の短縮の場合、変更後の固有表現の末尾の語句又はその同義語を含み、変更後の固有表現と同じタグが付与された固有表現を、変更対象として用いることができる。変更後の固有表現の末尾の語句は、変更前の固有表現の一部でもある。同義語は、シソーラスにより決定される。この場合、変更候補の種類は、固有表現である。 In the case of shortening the range of a named entity, a named entity that includes the last term of the changed named entity or a synonym thereof and has the same tag as the changed named entity can be used as the changed entity. The word at the end of the modified named entity is also part of the original named entity. Synonyms are determined by a thesaurus. In this case, the type of change candidate is a named entity.

“advanced gastric cancer：薬”を“gastric cancer：薬”に変更する変更事例から推定される変更対象としては、例えば、“* gastric cancer：薬”を用いることができる。“*”は、任意の文字列を表す。したがって、“薬”というタグが付与された“advanced gastric cancer”、“progressive gastric cancer”等の固有表現が、変更候補として抽出される。 For example, "* gastric cancer: drug" can be used as a change target estimated from a change example in which "advanced gastric cancer: drug" is changed to "gastric cancer: drug." “*” represents any character string. Therefore, unique expressions such as "advanced gastric cancer" and "progressive gastric cancer" that are tagged with "medicine" are extracted as modification candidates.

変更対象としては、“* cancer：薬”を用いることもできる。この場合、“advanced gastric cancer”、“progressive gastric cancer”等の固有表現に加えて、“advanced colon cancer”、“progressive colon cancer”等の固有表現も、変更候補として抽出される。 “* cancer: medicine” can also be used as the change target. In this case, in addition to specific expressions such as "advanced gastric cancer" and "progressive gastric cancer", specific expressions such as "advanced colon cancer" and "progressive colon cancer" are also extracted as modification candidates.

（Ｃ２）固有表現の範囲の延長
固有表現のテキストの範囲が延長される。変更事例としては、例えば、“gastric cancer：薬”を“advanced gastric cancer：薬”に変更する事例が挙げられる。固有表現の範囲の延長の場合、変更前の固有表現の末尾の語句又はその同義語を含み、変更前の固有表現と同じタグが付与された固有表現を、変更対象として用いることができる。この場合、変更候補の種類は、固有表現である。 (C2) Extension of the scope of a named entity The text scope of a named entity is extended. Examples of changes include, for example, changing "gastric cancer: medicine" to "advanced gastric cancer: medicine." In the case of extending the range of a named entity, a named entity that includes the last term or a synonym of the named entity before change and has the same tag as the named entity before change can be used as the change target. In this case, the type of change candidate is a named entity.

“gastric cancer：薬”を“advanced gastric cancer：薬”に変更する変更事例から推定される変更対象としては、例えば、“* gastric cancer：薬”又は“* cancer：薬”を用いることができる。 For example, "* gastric cancer: medicine" or "* cancer: medicine" can be used as a change target estimated from a change example in which "gastric cancer: medicine" is changed to "advanced gastric cancer: medicine."

（Ｃ３）固有表現の種類の変更
固有表現のタグが変更される。変更事例としては、例えば、“gastric cancer：薬”を“gastric cancer：がん”に変更する事例が挙げられる。固有表現の種類の変更の場合、変更前後の固有表現の末尾の語句又はその同義語を含み、変更前の固有表現と同じタグが付与された固有表現を、変更対象として用いることができる。この場合、変更候補の種類は、固有表現である。 (C3) Changing the type of named entity The tag of the named entity is changed. Examples of changes include changing "gastric cancer: medicine" to "gastric cancer: cancer." In the case of changing the type of named entity, a named entity that includes the last words or synonyms of the named entity before and after the change and has the same tag as the named entity before the change can be used as the change target. In this case, the type of change candidate is a named entity.

“gastric cancer：薬”を“gastric cancer：がん”に変更する変更事例から推定される変更対象としては、例えば、“* gastric cancer：薬”又は“* cancer：薬”を用いることができる。 For example, "* gastric cancer: medicine" or "* cancer: medicine" can be used as the change target estimated from a change example in which "gastric cancer: medicine" is changed to "gastric cancer: cancer."

（Ｃ４）固有表現の追加
固有表現のテキストの範囲とタグが、付加情報に追加される。変更事例としては、例えば、“-”を“nivolumab：薬”に変更する事例が挙げられる。“-”は、変更前の固有表現の範囲が指定されていないことを表す。一方、“nivolumab：薬”は、変更後の固有表現の範囲が“nivolumab”であり、“nivolumab”に付与されたタグが“薬”であることを表す。 (C4) Addition of named entity The text range and tag of the named entity are added to the additional information. An example of a change is, for example, changing "-" to "nivolumab: drug". “-” indicates that the range of the unique expression before change is not specified. On the other hand, "nivolumab: medicine" indicates that the scope of the changed named entity is "nivolumab" and that the tag given to "nivolumab" is "medicine."

固有表現の追加の場合、変更後の固有表現の末尾の語句又はその同義語を、変更対象として用いることができる。変更後の固有表現の末尾の語句は、変更前の固有表現の一部でもある。この場合、変更候補の種類は、文字列である。 In the case of adding a named entity, the word at the end of the changed named entity or its synonym can be used as the change target. The word at the end of the modified named entity is also part of the original named entity. In this case, the type of change candidate is a character string.

“-”を“nivolumab：薬”に変更する変更事例から推定される変更対象としては、例えば、“nivolumab”又は“AAAAAA”を用いることができる。“AAAAAA”は、“nivolumab”の同義語を表す。この場合、“nivolumab”又は“AAAAAA”が、変更候補として抽出される。 For example, "nivolumab" or "AAAAAA" can be used as a change target estimated from a change example in which "-" is changed to "nivolumab: drug." “AAAAAA” represents a synonym for “nivolumab”. In this case, "nivolumab" or "AAAAAA" is extracted as a change candidate.

（Ｃ５）固有表現の削除
固有表現のテキストの範囲とタグが、付加情報から削除される。変更事例としては、例えば、“nivolumab：薬”を“-”に変更する事例が挙げられる。固有表現の削除の場合、変更前の固有表現の末尾の語句又はその同義語を含み、変更前の固有表現と同じタグが付与された固有表現を、変更対象として用いることができる。この場合、変更候補の種類は、固有表現である。 (C5) Deletion of named entity The text range and tag of the named entity are deleted from the additional information. An example of a change is, for example, changing "nivolumab: drug" to "-". In the case of deletion of a named entity, a named entity that includes the last term of the named entity before change or a synonym thereof and has the same tag as the named entity before change can be used as the change target. In this case, the type of change candidate is a named entity.

“nivolumab：薬”を“-”に変更する変更事例から推定される変更対象としては、例えば、“* nivolumab：薬”又は“* AAAAAA：薬”を用いることができる。この場合、“薬”というタグが付与された“nivolumab”、“AAAAAA”等の固有表現が、変更候補として抽出される。 For example, "* nivolumab: drug" or "* AAAAAA: drug" can be used as a change target estimated from a change example in which "nivolumab: drug" is changed to "-". In this case, unique expressions such as "nivolumab" and "AAAAAA" that are tagged as "medicine" are extracted as modification candidates.

（Ｃ６）関係の追加
固有表現間の関係が付加情報に追加される。変更事例としては、例えば、“gefitinib：薬”と“lung cancer：病気”との間に“効果あり”という関係を追加する事例が挙げられる。 (C6) Addition of relationship The relationship between named entities is added to the additional information. Examples of changes include, for example, adding the relationship "effective" between "gefitinib (drug)" and "lung cancer (disease)."

関係の追加の場合、関係が付与されていない固有表現Ｅ１と固有表現Ｅ２との組を、変更対象として用いることができる。固有表現Ｅ１は、変更前の一方の固有表現の末尾の語句又はその同義語を含み、その固有表現と同じタグが付与された固有表現を表す。固有表現Ｅ２は、変更前の他方の固有表現の末尾の語句又はその同義語を含み、その固有表現と同じタグが付与された固有表現を表す。この場合、変更候補の種類は、関係が付与されていない固有表現の組である。 In the case of adding a relationship, a pair of named entity E1 and named entity E2 to which no relationship is assigned can be used as a change target. The named entity E1 represents a named entity that includes the last term or a synonym thereof of one of the named named entities before change, and is given the same tag as that named named entity. The named entity E2 represents a named entity that includes the last term of the other named entity before change or a synonym thereof, and is given the same tag as that named entity. In this case, the type of change candidate is a set of named entities to which no relationship is assigned.

“gefitinib：薬”と“lung cancer：病気”との間に“効果あり”という関係を追加する変更事例から推定される変更対象としては、例えば、同じ文に出現する“* gefitinib：薬”と“* cancer：病気”との組を用いることができる。ただし、関係が付与されていない組のみが、変更対象として指定される。この場合、“gefitinib：薬”と“lung cancer：病気”との組、“gefitinib：薬”と“gastric cancer：病気”との組等が、変更候補として抽出される。 For example, the change target estimated from the example of a change that adds the relationship "effective" between "gefitinib: drug" and "lung cancer: disease" is that "* gefitinib: drug" and "lung cancer: disease" that appear in the same sentence. A combination with “* cancer: disease” can be used. However, only pairs to which no relationship is assigned are designated as change targets. In this case, a set of "gefitinib: medicine" and "lung cancer: disease", a set of "gefitinib: medicine" and "gastric cancer: disease", etc. are extracted as change candidates.

（Ｃ７）関係の削除
固有表現間の関係が付加情報から削除される。変更事例としては、例えば、“gefitinib：薬”と“lung cancer：病気”との間の“効果あり”という関係を削除する事例が挙げられる。 (C7) Deletion of relationship The relationship between named entities is deleted from the additional information. Examples of changes include, for example, the deletion of the relationship ``effective'' between ``gefitinib (drug)'' and ``lung cancer (disease)''.

関係の削除の場合、変更前の関係が付与された固有表現Ｅ１と固有表現Ｅ２との組を、変更対象として用いることができる。固有表現Ｅ１は、変更前の一方の固有表現の末尾の語句又はその同義語を含み、その固有表現と同じタグが付与された固有表現を表す。固有表現Ｅ２は、変更前の他方の固有表現の末尾の語句又はその同義語を含み、その固有表現と同じタグが付与された固有表現を表す。この場合、変更候補の種類は、関係が付与された固有表現の組である。 In the case of deletion of a relationship, the pair of named entity E1 and named entity E2 to which the relationship before change has been assigned can be used as a change target. The named entity E1 represents a named entity that includes the last term or a synonym thereof of one of the named named entities before change, and is given the same tag as that named named entity. The named entity E2 represents a named entity that includes the last term of the other named entity before change or a synonym thereof, and is given the same tag as that named entity. In this case, the type of change candidate is a set of named entities to which a relationship is attached.

“gefitinib：薬”と“lung cancer：病気”との間の“効果あり”という関係を削除する変更事例から推定される変更対象としては、例えば、同じ文に出現する“* gefitinib：薬”と“* cancer：病気”との組を用いることができる。ただし、“効果あり”という関係が付与された組のみが、変更対象として指定される。この場合、“効果あり”という関係が付与された“gefitinib：薬”と“lung cancer：病気”との組、“gefitinib：薬”と“gastric cancer：病気”との組等が、変更候補として抽出される。 For example, an example of a change that can be estimated from a change example that deletes the relationship "effective" between "gefitinib: drug" and "lung cancer: disease" is when "* gefitinib: drug" and "lung cancer: disease" appear in the same sentence. A combination with “* cancer: disease” can be used. However, only the groups to which the relationship of "effective" has been assigned are designated as change targets. In this case, the combination of ``gefitinib: drug'' and ``lung cancer: disease'' and the pair of ``gefitinib: drug'' and ``gastric cancer: disease,'' which are given the relationship ``effective,'' are candidates for change. Extracted.

（Ｃ８）関係の種類の変更
固有表現間の関係が変更される。変更事例としては、例えば、“gefitinib：薬”と“lung cancer：病気”との間の関係を“効果あり”から“効果なし”に変更する事例が挙げられる。 (C8) Change of relationship type The relationship between named entities is changed. Examples of changes include, for example, changing the relationship between "gefitinib (drug)" and "lung cancer (disease)" from "effective" to "ineffective."

関係の種類の変更の場合、変更前の関係が付与された固有表現Ｅ１と固有表現Ｅ２との組を、変更対象として用いることができる。固有表現Ｅ１は、変更前の一方の固有表現の末尾の語句又はその同義語を含み、その固有表現と同じタグが付与された固有表現を表す。固有表現Ｅ２は、変更前の他方の固有表現の末尾の語句又はその同義語を含み、その固有表現と同じタグが付与された固有表現を表す。この場合、変更候補の種類は、関係が付与された固有表現の組である。 In the case of changing the type of relationship, the set of named entity E1 and named entity E2 to which the relationship before change has been assigned can be used as a change target. The named entity E1 represents a named entity that includes the last term or a synonym thereof of one of the named named entities before change, and is given the same tag as that named named entity. The named entity E2 represents a named entity that includes the last term of the other named entity before change or a synonym thereof, and is given the same tag as that named entity. In this case, the type of change candidate is a set of named entities to which a relationship is attached.

“gefitinib：薬”と“lung cancer：病気”との間の関係を“効果あり”から“効果なし”に変更する変更事例から推定される変更対象としては、例えば、同じ文に出現する“* gefitinib：薬”と“* cancer：病気”との組を用いることができる。ただし、“効果あり”という関係が付与された組のみが、変更対象として指定される。 For example, the change target estimated from a change case in which the relationship between “gefitinib: drug” and “lung cancer: disease” is changed from “effective” to “ineffective” is “*” that appears in the same sentence. A combination of "gefitinib: drug" and "* cancer: disease" can be used. However, only the groups to which the relationship of "effective" has been assigned are designated as change targets.

（Ｃ１）～（Ｃ８）に示したような変更対象を推定結果として用いることで、変更事例が示す変更前の固有表現のみならず、変更前の固有表現の一部の語句又はその同義語を含む別の固有表現を変更対象に含めることができる。これにより、ユーザが次に行う変更操作を事前に予測して、予測結果に基づく変更候補を解析後文書集合４２３から抽出することが可能になる。 By using the change targets shown in (C1) to (C8) as estimation results, it is possible to calculate not only the named entity before change indicated by the change example, but also some words or phrases of the named entity before change or their synonyms. You can include another named entity in the change target. This makes it possible to predict in advance the change operation that the user will perform next, and to extract change candidates from the analyzed document set 423 based on the prediction results.

変更種類毎の変更候補が抽出された後、推定部４１３は、解析後文書集合４２３における変更種類毎の変更候補の出現頻度を求め、その出現頻度に基づいて、複数の変更種類のうち特定の変更種類を選択する。 After the change candidates for each change type are extracted, the estimation unit 413 calculates the appearance frequency of the change candidates for each change type in the post-analysis document set 423, and based on the appearance frequency, selects a specific change type from among the plurality of change types. Select the change type.

また、推定部４１３は、変更種類毎の変更候補の出現頻度と、各文書中の各部分文書に含まれる変更種類毎の変更候補の個数とに基づいて、各部分文書の評価値を計算し、計算された評価値に基づいて、特定の部分文書を選択する。 Furthermore, the estimation unit 413 calculates the evaluation value of each partial document based on the appearance frequency of change candidates for each change type and the number of change candidates for each change type included in each partial document in each document. , select a specific partial document based on the calculated evaluation value.

図５は、変更履歴４２４に含まれる変更事例の分類結果の例を示している。変更ＩＤは、変更種類の識別情報であり、変更前付加情報は、変更事例が示す変更操作が行われる前の付加情報を表し、変更後付加情報は、変更事例が示す変更操作が行われた後の付加情報を表す。変更対象は、変更事例から推定される変更対象を表し、変更候補の種類は、変更対象に対応するテキストの種類を表し、事例スコア１及び事例スコア２は、変更種類の評価値を表す。 FIG. 5 shows an example of classification results of change cases included in the change history 424. The change ID is identification information of the change type, the pre-change additional information represents additional information before the change operation indicated by the change example, and the post-change additional information indicates when the change operation indicated by the change example was performed. Represents later additional information. The change target represents the change target estimated from the change example, the change candidate type represents the type of text corresponding to the change target, and the case score 1 and case score 2 represent evaluation values of the change type.

変更ＩＤ“１”は、固有表現の範囲の短縮を示し、“progressive gastric cancer：病気”を“gastric cancer：病気”に変更する変更事例が、変更ＩＤ“１”に分類されている。この例では、“* gastric cancer：病気”が変更対象に決定される。 Change ID "1" indicates a shortening of the range of the named entity, and a change example in which "progressive gastric cancer: disease" is changed to "gastric cancer: disease" is classified as change ID "1". In this example, "* gastric cancer: disease" is determined to be changed.

変更ＩＤ“２”は、固有表現の種類の変更を示し、“AAAAAA：製品”を“AAAAAA：薬品”に変更する変更事例が、変更ＩＤ“２”に分類されている。この例では、“* AAAAAA：製品”又は“nivolumab：製品”が変更対象に決定される。 Change ID “2” indicates a change in the type of named entity, and a change example in which “AAAAAA: product” is changed to “AAAAAA: drug” is classified as change ID “2”. In this example, “*AAAAAA:Product” or “nivolumab:Product” is determined to be changed.

変更ＩＤ“３”は、固有表現の追加を示し、“-”を“nivolumab：薬”に変更する変更事例が、変更ＩＤ“３”に分類されている。この例では、“nivolumab”又は“AAAAAA”が変更対象に決定される。 Change ID “3” indicates addition of a named entity, and a change example in which “-” is changed to “nivolumab: drug” is classified as change ID “3”. In this example, "nivolumab" or "AAAAAA" is determined to be changed.

変更ＩＤ“４”は、固有表現の削除を示し、“tumor：病気”を“-”に変更する変更事例が、変更ＩＤ“４”に分類されている。この例では、“* tumor：病気”が変更対象に決定される。 Change ID “4” indicates deletion of a named entity, and a change example in which “tumor: disease” is changed to “-” is classified as change ID “4”. In this example, "*tumor: disease" is determined to be the change target.

変更ＩＤ“５”は、関係の削除を示し、“gefitinib：薬”と“lung cancer：病気”との間の“=>”という関係を削除する変更事例が、変更ＩＤ“５”に分類されている。この例では、“=>”という関係が付与された“* gefitinib：薬”と“* lung cancer：病気”との組が、変更対象に決定される。 Change ID “5” indicates deletion of a relationship, and a change example that deletes the relationship “=>” between “gefitinib: drug” and “lung cancer: disease” is classified as change ID “5”. ing. In this example, the pair of "* gefitinib: drug" and "* lung cancer: disease" to which the relationship "=>" is attached is determined to be the change target.

図６は、部分文書の一例である段落の評価値の例を示している。段落ＩＤは、解析後文書集合４２３に含まれる各文書中の各段落の識別情報であり、段落スコアは、段落の評価値を表す。変更ＩＤは、段落から抽出された変更候補に対応する変更種類の変更ＩＤを表す。図６の例では、簡単のため、段落“１”～段落“４”のみが示されているが、解析後文書集合４２３には、より多くの段落が含まれていてもよい。 FIG. 6 shows an example of evaluation values for a paragraph, which is an example of a partial document. The paragraph ID is identification information of each paragraph in each document included in the post-analysis document set 423, and the paragraph score represents the evaluation value of the paragraph. The change ID represents the change ID of the change type corresponding to the change candidate extracted from the paragraph. In the example of FIG. 6, only paragraphs “1” to “4” are shown for simplicity, but the post-analysis document set 423 may include more paragraphs.

推定部４１３は、各段落から抽出された変更候補毎に、変更候補に対応する変更対象を特定し、特定された変更対象が属する変更種類の変更ＩＤを求める。例えば、段落“１”には、変更ＩＤ“１”、変更ＩＤ“２”、及び変更ＩＤ“４”それぞれに対応する変更候補が１個ずつ含まれている。また、段落“２”には、変更ＩＤ“３”及び変更ＩＤ“４”それぞれに対応する変更候補が１個ずつ含まれている。 For each change candidate extracted from each paragraph, the estimation unit 413 identifies a change target corresponding to the change candidate, and obtains a change ID of the change type to which the identified change target belongs. For example, paragraph “1” includes one change candidate corresponding to each of change ID “1”, change ID “2”, and change ID “4”. Furthermore, paragraph “2” includes one change candidate corresponding to each of change ID “3” and change ID “4”.

推定部４１３は、すべての段落から抽出されたすべての変更候補の変更ＩＤを基に、各変更種類の変更候補の出現頻度を求め、求めた出現頻度を、図５の事例スコア１として記録する。例えば、変更ＩＤ“１”は、段落“１”、段落“３”、及び段落“４”に１個ずつ含まれているため、変更ＩＤ“１”の事例スコア１は“３”となる。また、変更ＩＤ“２”は、段落“１”に１個だけ含まれているため、変更ＩＤ“２”の事例スコア１は“１”となる。 The estimation unit 413 calculates the appearance frequency of change candidates of each change type based on the change IDs of all change candidates extracted from all paragraphs, and records the calculated appearance frequency as case score 1 in FIG. . For example, since the change ID "1" is included in each of paragraphs "1", "3", and "4", the case score 1 of the change ID "1" is "3". Further, since only one change ID “2” is included in the paragraph “1”, the case score 1 of the change ID “2” is “1”.

次に、推定部４１３は、各段落に含まれる変更ＩＤの事例スコア１の合計を、図６の段落スコアとして記録する。例えば、段落“１”の段落スコアは、変更ＩＤ“１”、変更ＩＤ“２”、及び変更ＩＤ“４”の事例スコア１の合計であり、段落スコアは“８”となる。また、段落“２”の段落スコアは、変更ＩＤ“３”及び変更ＩＤ“４”の事例スコア１の合計であり、段落スコアは“７”となる。 Next, the estimating unit 413 records the sum of case scores 1 of change IDs included in each paragraph as the paragraph score in FIG. 6 . For example, the paragraph score of paragraph “1” is the sum of case scores 1 of change ID “1”, change ID “2”, and change ID “4”, and the paragraph score is “8”. Further, the paragraph score of paragraph “2” is the sum of case score 1 of change ID “3” and change ID “4”, and the paragraph score is “7”.

次に、推定部４１３は、各変更ＩＤの変更候補を含む段落の段落スコアの合計を求め、段落スコアの合計に変更ＩＤの事例スコア１を乗算することで、事例スコア２を計算する。 Next, the estimating unit 413 calculates the total paragraph score of the paragraphs including the change candidates of each change ID, and calculates the case score 2 by multiplying the sum of the paragraph scores by the case score 1 of the change ID.

例えば、変更ＩＤ“１”は、段落“１”、段落“３”、及び段落“４”に含まれているため、段落スコアの合計は、８＋１１＋１０＝２９となる。そして、変更ＩＤ“１”の事例スコア１は“３”であるため、変更ＩＤ“１”の事例スコア２は、３＊２９＝８７となる。 For example, since change ID "1" is included in paragraph "1", paragraph "3", and paragraph "4", the total paragraph score is 8+11+10=29. Since the case score 1 of the change ID "1" is "3", the case score 2 of the change ID "1" is 3*29=87.

また、変更ＩＤ“２”は、段落“１”だけに含まれているため、段落スコアの合計は、“８”となる。そして、変更ＩＤ“２”の事例スコア１は“１”であるため、変更ＩＤ“２”の事例スコア２は、１＊８＝８となる。 Furthermore, since the change ID "2" is included only in the paragraph "1", the total paragraph score is "8". Since the case score 1 of the change ID "2" is "1", the case score 2 of the change ID "2" is 1*8=8.

次に、推定部４１３は、事例スコア２が大きい順に変更種類を選択するとともに、段落スコアが大きい順にＭ件（Ｍは１以上の整数）の段落を、特定の部分文書として選択する。 Next, the estimating unit 413 selects change types in descending order of case score 2, and selects M paragraphs (M is an integer greater than or equal to 1) in descending order of paragraph scores as specific partial documents.

生成部４１７は、選択された特定の部分文書を強調表示する情報を含む変更候補情報を生成し、出力部４１８は、生成された変更候補情報を出力する。 The generation unit 417 generates change candidate information including information for highlighting the selected specific partial document, and the output unit 418 outputs the generated change candidate information.

出力部４１８が表示装置である場合、出力部４１８は、特定の部分文書を含む文書を画面上に表示し、特定の部分文書のテキストを強調表示する。強調表示の形態としては、テキストの表示色の変更、テキストに対するマーカ又は囲み枠の追加等を用いることができる。出力部４１８が通信装置である場合、出力部４１８は、特定の部分文書を含む文書と変更候補情報とを、通信ネットワークを介してユーザの端末装置へ送信する。端末装置は、文書処理装置４０１から受信した文書を画面上に表示し、特定の部分文書のテキストを強調表示する。 When the output unit 418 is a display device, the output unit 418 displays a document including a specific partial document on the screen, and highlights the text of the specific partial document. As a form of highlighting, changing the display color of the text, adding a marker or a surrounding frame to the text, etc. can be used. When the output unit 418 is a communication device, the output unit 418 transmits the document including the specific partial document and the change candidate information to the user's terminal device via the communication network. The terminal device displays the document received from the document processing device 401 on the screen and highlights the text of a specific partial document.

図７は、強調表示された部分文書の例を示している。“BBB mutation”、“gastric cancer”、“celecoxib”、“advanced gastric cancer”、“gefitinib”、“bladder cancer”、及び“lung cancer”は変更候補を表し、段落７０１は、強調表示された部分文書を表す。段落７０１は、表示色の変更、マーカ、囲み枠等により強調表示される。 FIG. 7 shows an example of a highlighted partial document. “BBB mutation”, “gastric cancer”, “celecoxib”, “advanced gastric cancer”, “gefitinib”, “bladder cancer”, and “lung cancer” represent candidate changes, and paragraph 701 is the highlighted partial document. represents. The paragraph 701 is highlighted by changing the display color, using a marker, a surrounding frame, or the like.

例えば、“gastric cancer”及び“advanced gastric cancer”は、“* gastric cancer：病気”という変更対象に対応する変更候補である。この変更対象は、例えば、“progressive gastric cancer：病気”を“gastric cancer：病気”に変更する変更事例から推定される。 For example, "gastric cancer" and "advanced gastric cancer" are change candidates corresponding to the change target "* gastric cancer: disease." This change target is estimated from, for example, a change example in which "progressive gastric cancer: disease" is changed to "gastric cancer: disease."

“gastric cancer”及び“advanced gastric cancer”に付与されたタグ“病気”は、各変更候補のテキストをクリックすることで、画面上に表示される。また、変更候補間の関係は、変更候補間の矢印をクリックすることで、画面上に表示される。 The tag "disease" attached to "gastric cancer" and "advanced gastric cancer" can be displayed on the screen by clicking on the text of each candidate change. Furthermore, the relationship between the change candidates is displayed on the screen by clicking the arrows between the change candidates.

ユーザは、強調表示された段落中の変更候補と、その変更候補に付与されたタグと、変更候補間の関係とを確認し、それらの付加情報に対する所望の変更操作を行う。変更部４１６は、ユーザが入力した変更指示を受け付け、受け付けた変更指示に従って付加情報を変更する。そして、変更部４１６は、変更内容を示す変更事例を変更履歴４２４に追加する。 The user checks the change candidates in the highlighted paragraph, the tags attached to the change candidates, and the relationships between the change candidates, and performs a desired change operation on the additional information. The changing unit 416 receives a change instruction input by the user, and changes the additional information according to the received change instruction. Then, the changing unit 416 adds a change example indicating the content of the change to the change history 424.

このように、推定部４１３及び抽出部４１４を設けることで、ユーザが解析後文書集合４２３に含まれる付加情報を変更した場合、変更事例に関連する変更候補が自動的に抽出されて、ユーザに提示される。提示される変更候補は、解析後文書集合４２３内で変更事例とは異なる位置に出現する、変更事例と同じテキストである場合もあり、変更事例と類似しているが微妙に異なるテキストである場合もある。 In this way, by providing the estimation unit 413 and the extraction unit 414, when the user changes the additional information included in the post-analysis document set 423, change candidates related to the change example are automatically extracted and the user can Presented. The presented change candidate may be the same text as the change example that appears in a different position from the change example in the post-analysis document set 423, or may be text similar to the change example but slightly different. There is also.

ユーザは、提示された変更候補を確認して変更するだけで、解析後文書集合４２３に対して、変更事例と同様の変更操作を行うことができる。したがって、新たな変更候補を手作業で検索する必要がなくなり、変更作業が効率化される。この場合、強調表示された部分文書中の変更候補が変更される可能性が高くなり、それ以外の部分文書中の変更候補が変更される可能性は低くなる。 The user can perform the same modification operation as the modification example on the analyzed document set 423 by simply checking and modifying the presented modification candidates. Therefore, there is no need to manually search for new change candidates, making the change work more efficient. In this case, the possibility that the change candidates in the highlighted partial document will be changed increases, and the possibility that the change candidates in other partial documents will be changed becomes low.

例えば、医療分野のキュレーションの場合、ユーザであるキュレータは、過去に変更した付加情報と同様の付加情報を変更する可能性が高い。 For example, in the case of curation in the medical field, a curator who is a user is likely to change additional information similar to additional information changed in the past.

一例として、キュレータが、“病気”というタグが付与された固有表現の範囲を、“advanced gastric cancer”から“gastric cancer”に変更し、さらに、いくつかの病気についても、修飾表現をタグの付与対象から除外した場合を想定する。この場合、そのキュレータは、病気の修飾表現をタグの付与対象から除外する変更操作を繰り返す可能性がある。 As an example, a curator changes the scope of a named entity tagged “disease” from “advanced gastric cancer” to “gastric cancer,” and also adds modified expressions to tags for some diseases. Assume the case where it is excluded from the target. In this case, the curator may repeat the change operation to exclude the modified expression of disease from being tagged.

別の例として、キュレータが、“病気”というタグが付与された“melancholia”（鬱病）、“postpartum depression”（産後鬱）等の固有表現を、タグの付与対象から除外した場合を想定する。この場合、そのキュレータは、精神病に興味を持っていないという理由により、精神病に関する表現をタグの付与対象から除外する変更操作を繰り返す可能性がある。 As another example, assume that the curator excludes unique expressions such as "melancholia" (depression) and "postpartum depression" (postpartum depression) that have been tagged as "illness" from being tagged. In this case, the curator may repeat the change operation of excluding expressions related to mental illness from being tagged because he or she is not interested in mental illness.

さらに別の例として、キュレータが、“薬”というタグが付与された“gefitinib”に関する固有表現間の関係をいくつか削除した場合を想定する。この場合、ＦＤＡ（Food and Drug Administration）によるgefitinibの認定が取り消されたという理由により、そのキュレータは、“薬”というタグが付与された“gefitinib”に関するあらゆる関係を削除する変更操作を繰り返す可能性がある。 As yet another example, assume that the curator deletes some relationships between named entities related to "gefitinib" which is tagged as "drug". In this case, because the FDA (Food and Drug Administration) certification of gefitinib has been revoked, the curator may repeat the change operation to remove all relationships related to ``gefitinib'' with the tag ``drug.'' There is.

したがって、ユーザが過去に行った変更操作を示す変更事例に基づいて変更対象を推定することで、ユーザが変更する可能性の高い変更候補を抽出して提示することが可能になる。変更候補をユーザに提示し、ユーザが実際に変更した変更候補を変更事例として変更履歴４２４に追加することで、ユーザによる変更操作の情報が蓄積され、変更対象の推定精度が向上する。 Therefore, by estimating the change target based on change examples indicating change operations performed by the user in the past, it becomes possible to extract and present change candidates that are likely to be changed by the user. By presenting change candidates to the user and adding change candidates actually changed by the user to the change history 424 as change examples, information on change operations by the user is accumulated, and the accuracy of estimating the change target is improved.

変更候補をユーザに提示する際に、各部分文書に含まれる変更候補の個数に基づいて部分文書の評価値を計算し、部分文書の評価値に基づいて、強調表示される部分文書を選択することで、より多くの変更候補を含む部分文書を優先的に提示することができる。したがって、ユーザは、提示された部分文書に対する複数の変更操作を集中的に行うことができ、変更作業がさらに効率化される。 When presenting change candidates to the user, calculate the evaluation value of the partial document based on the number of change candidates included in each partial document, and select the partial document to be highlighted based on the evaluation value of the partial document. This allows partial documents containing more change candidates to be presented preferentially. Therefore, the user can intensively perform multiple modification operations on the presented partial document, and the modification work can be made more efficient.

上述した（Ｃ１）～（Ｃ８）のような変更種類毎に変更対象を推定することで、変更種類の特徴に応じた適切な変更候補を提示することが可能になる。例えば、“advanced gastric cancer：薬”が“gastric cancer：薬”に変更された場合、“advanced gastric cancer：薬”、“advanced colon cancer：薬”、“progressive colon cancer：薬”等が、変更候補として提示される。 By estimating the change target for each change type such as (C1) to (C8) described above, it becomes possible to present appropriate change candidates according to the characteristics of the change type. For example, if "advanced gastric cancer: medicine" is changed to "gastric cancer: medicine", "advanced gastric cancer: medicine", "advanced colon cancer: medicine", "progressive colon cancer: medicine", etc. are candidates for change. presented as.

また、“gastric cancer：病気”が“gastric cancer：がん”に変更された場合、“gastric cancer：病気”、“colon cancer：病気”等が、変更候補として提示される。“gefitinib：薬”と“lung cancer：病気”との間の関係が削除された場合、同じ関係が付与された“gefitinib：薬”と“lung cancer：病気”との組、“gefitinib：薬”と“colorectal cancer：病気”との組等が、変更候補として提示される。 Furthermore, when "gastric cancer: disease" is changed to "gastric cancer: cancer", "gastric cancer: disease", "colon cancer: disease", etc. are presented as change candidates. If the relationship between “gefitinib: drug” and “lung cancer: disease” is deleted, the pair “gefitinib: drug” and “lung cancer: disease” with the same relationship, “gefitinib: drug” and “colorectal cancer: disease” are presented as candidates for change.

なお、強調表示された部分文書に含まれる変更候補は、ユーザが変更する可能性の高い変更候補であるが、必ずしもユーザが希望する変更候補であるとは限らない。強調表示された部分文書の変更候補を変更する必要がない場合、ユーザは、変更操作を行うことなく、文書処理装置４０１に対して別の変更候補の提示を要求する。この場合、文書処理装置４０１は、次に大きな段落スコアを有する部分文書を強調表示する。 Note that the change candidates included in the highlighted partial document are change candidates that are likely to be changed by the user, but are not necessarily the change candidates that the user desires. If there is no need to change the change candidate of the highlighted partial document, the user requests the document processing device 401 to present another change candidate without performing a change operation. In this case, the document processing device 401 highlights the partial document having the next highest paragraph score.

図５の変更事例では、解析後文書集合４２３に含まれる文書のテキストに付加された付加情報が変更されているが、ユーザは、任意の文書集合に含まれる文書のテキスト自体を変更することもできる。ユーザが文書のテキストを変更した場合も、付加情報が変更された場合と同様にして、変更事例に関連する変更候補が自動的に抽出され、ユーザに提示される。 In the change example shown in FIG. 5, the additional information added to the text of the document included in the post-analysis document set 423 is changed, but the user can also change the text itself of the document included in any document set. can. Even when a user changes the text of a document, change candidates related to the change example are automatically extracted and presented to the user in the same way as when additional information is changed.

変更候補が自動的に抽出されてユーザに提示されたとしても、ユーザが提示された多数の変更候補を１つずつ確認して変更する場合、ユーザの作業負荷が増加する。したがって、ユーザが同様の変更を数件程度行うだけで、その変更内容が解析後文書集合４２３全体に反映されることが望ましい。 Even if change candidates are automatically extracted and presented to the user, the user's workload increases if the user reviews and changes the many presented change candidates one by one. Therefore, it is desirable that the user only need to make a few similar changes and the changes will be reflected in the entire post-analysis document set 423.

そこで、分類部４１５は、解析後文書集合４２３から抽出された変更候補のテキストを、そのテキストの前後に存在するテキストに基づいてクラスタリングすることで、複数のクラスタを生成する。 Therefore, the classification unit 415 generates a plurality of clusters by clustering the text of the change candidate extracted from the post-analysis document set 423 based on the texts that exist before and after the text.

例えば、分類部４１５は、推定部４１３により選択された変更種類の変更候補を、その変更候補の前後に存在するテキストに基づいてクラスタリングすることで、複数のクラスタを生成する。各クラスタには、１つ以上の変更候補が含まれる。そして、分類部４１５は、生成されたクラスタを、分類結果４２５として記憶部４１１に格納する。 For example, the classification unit 415 generates a plurality of clusters by clustering change candidates of the change type selected by the estimation unit 413 based on text existing before and after the change candidates. Each cluster includes one or more change candidates. The classification unit 415 then stores the generated clusters in the storage unit 411 as classification results 425.

クラスタリングのアルゴリズムとしては、階層型クラスタリング又は非階層型クラスタリングを用いることができる。例えば、非階層型クラスタリングの一例であるｋ－ｍｅａｎｓ法を採用した場合、以下の手順で分類結果４２５を生成することができる。
（Ｐ１１）分類部４１５は、変更候補のテキストの前後のＷ個（Ｗは１以上の整数）の単語を、bag of wordsによりベクトル化することで、変更候補の周辺の文脈を表す単語ベクトルを生成する。
（Ｐ１２）分類部４１５は、ｋ－ｍｅａｎｓ法により、生成された単語ベクトルをＣ個（Ｃは２以上の整数）のクラスタに分類する。ｋ－ｍｅａｎｓ法の距離関数としては、特徴ベクトル間のコサイン距離、ユークリッド距離、マハラノビス距離等を用いることができる。 Hierarchical clustering or non-hierarchical clustering can be used as a clustering algorithm. For example, when the k-means method, which is an example of non-hierarchical clustering, is adopted, the classification result 425 can be generated by the following procedure.
(P11) The classification unit 415 vectorizes W words (W is an integer of 1 or more) before and after the text of the change candidate using a bag of words, thereby generating a word vector representing the context around the change candidate. generate.
(P12) The classification unit 415 classifies the generated word vectors into C clusters (C is an integer of 2 or more) using the k-means method. As the distance function of the k-means method, a cosine distance, Euclidean distance, Mahalanobis distance, etc. between feature vectors can be used.

図８は、変更候補から生成された単語ベクトルの例を示している。出現位置ＩＤは、解析後文書集合４２３における変更候補の識別情報であり、直前のＷ単語は、文書中で変更候補の直前に出現するＷ個の単語を表し、直後のＷ単語は、文書中で変更候補の直後に出現するＷ個の単語を表す。この例では、Ｗ＝３である。単語ベクトルは、直前のＷ単語及び直後のＷ単語から、bag of wordsにより生成された単語ベクトルを表す。 FIG. 8 shows an example of word vectors generated from change candidates. The appearance position ID is the identification information of the change candidate in the post-analysis document set 423, the immediately preceding W word represents the W words that appear immediately before the change candidate in the document, and the immediately following W word represents the W words that appear immediately before the change candidate in the document. represents the W words that appear immediately after the change candidate. In this example, W=3. The word vector represents a word vector generated by a bag of words from the immediately preceding W word and the immediately following W word.

例えば、出現位置ＩＤ“１”の変更候補の直前には、“ａａ”、“ｂｂ”、及び“ｃｃ”の３個の単語が出現し、直後には、“ｄｄ”、“ｅｅ”、及び“ｆｆ”の３個の単語が出現する。出現位置ＩＤ“２”の変更候補の直前には、“ｄｄ”、“ｅｅ”、及び“ｇｇ”の３個の単語が出現し、直後には、“ａａ”、“ｅｅ”、及び“ｃｃ”の３個の単語が出現する。出現位置ＩＤ“３”の変更候補の直前には、“ａａ”、“ｂｂ”、及び“ｄｄ”の３個の単語が出現し、直後には、“ｅｅ”、“ｆｆ”、及び“ｇｇ”の３個の単語が出現する。 For example, immediately before the change candidate with appearance position ID "1", three words "aa", "bb", and "cc" appear, and immediately after, "dd", "ee", and Three words "ff" appear. Immediately before the change candidate with appearance position ID "2", three words "dd", "ee", and "gg" appear, and immediately after that, "aa", "ee", and "cc" appear. ” three words appear. Immediately before the change candidate with appearance position ID "3", three words "aa", "bb", and "dd" appear, and immediately after that, "ee", "ff", and "gg" appear. ” three words appear.

単語ベクトルの各要素は、［ａａ，ｂｂ，ｃｃ，ｄｄ，ｅｅ，ｆｆ，ｇｇ］の順で、各単語の出現回数を表す。例えば、出現位置ＩＤ“１”の直前のＷ単語及び直後のＷ単語には、“ａａ”、“ｂｂ”、“ｃｃ”、“ｄｄ”、“ｅｅ”、及び“ｆｆ”が１回ずつ出現し、“ｇｇ”が出現していないため、単語ベクトルは［１，１，１，１，１，１，０］となる。 Each element of the word vector represents the number of times each word appears in the order [aa, bb, cc, dd, ee, ff, gg]. For example, "aa", "bb", "cc", "dd", "ee", and "ff" appear once in the W word immediately before and immediately after the appearance position ID "1". However, since "gg" does not appear, the word vector becomes [1, 1, 1, 1, 1, 1, 0].

変更部４１６は、分類結果４２５に含まれるＣ個のクラスタのうち、特定のクラスタに属する変更候補の付加情報をユーザが変更した場合、その変更操作に従って付加情報を変更するとともに、同じクラスタに属する他の変更候補の付加情報も同様に変更する。そして、変更部４１６は、そのクラスタのすべての変更候補に対する変更内容を、変更事例として変更履歴４２４に追加する。これにより、ユーザが行った変更操作が、同じクラスタに属する他の変更候補にも自動的に反映される。 When the user changes the additional information of a change candidate that belongs to a specific cluster among the C clusters included in the classification result 425, the change unit 416 changes the additional information according to the change operation, and also changes the additional information of a change candidate that belongs to the same cluster. The additional information of other change candidates is also changed in the same way. Then, the changing unit 416 adds the changes to all the change candidates of the cluster to the change history 424 as change cases. Thereby, the change operation performed by the user is automatically reflected on other change candidates belonging to the same cluster.

このように、分類部４１５を設けることで、同じ変更種類に属する変更候補であっても、周辺の文脈に応じて各変更候補を異なるクラスタに分類することができる。ユーザは各クラスタに含まれる変更候補のうち、強調表示された部分文書に含まれる代表的な変更候補を変更するだけで、同じクラスタに属する他の変更候補も同時に変更することが可能になる。 In this way, by providing the classification unit 415, even if the change candidates belong to the same change type, each change candidate can be classified into different clusters according to the surrounding context. By simply changing a representative change candidate included in the highlighted partial document among the change candidates included in each cluster, the user can simultaneously change other change candidates belonging to the same cluster.

ところで、クラスタリングには長い時間がかかることが多く、速い場合であっても、Ｎ個の変更候補に対する計算量は、Ｏ（Ｎ＾２）である。変更対象として、変更事例が示す変更前の固有表現のみを用いた場合、ユーザが、前回とは異なる変更候補を対象とする新たな変更操作を行う度に、その変更候補の検索及びクラスタリングが実行される。この場合、新たな変更操作を行う度に、クラスタリングの終了を待ち合わせる待ち時間が発生し、作業効率が低下する。 By the way, clustering often takes a long time, and even if it is fast, the amount of calculation for N change candidates is O(N^2). If only the unique expression before change indicated by the change example is used as the change target, each time the user performs a new change operation targeting a change candidate different from the previous one, the search and clustering of the change candidate will be performed. be done. In this case, each time a new change operation is performed, a waiting time is generated to wait for the completion of clustering, which reduces work efficiency.

例えば、ユーザが“advanced gastric cancer：薬”を“gastric cancer：薬”に変更する第１の変更操作を行った場合、解析後文書集合４２３の他の部分から、“advanced gastric cancer：薬”が検索され、変更候補として抽出される。そして、抽出された変更候補がクラスタリングされて、ユーザに提示される。ユーザが各クラスタに含まれる代表的な変更候補を変更する第２の変更操作を行うと、同じクラスタに属するすべての変更候補がまとめて変更される。 For example, if the user performs the first change operation to change “advanced gastric cancer: medicine” to “gastric cancer: medicine”, “advanced gastric cancer: medicine” will be changed from other parts of the post-analysis document set 423. It is searched and extracted as a change candidate. The extracted change candidates are then clustered and presented to the user. When the user performs a second change operation to change a representative change candidate included in each cluster, all change candidates belonging to the same cluster are changed at once.

次に、ユーザが“advanced colon cancer：薬”を“colon cancer：薬”に変更する第３の変更操作を行った場合、解析後文書集合４２３の他の部分から、“advanced colon cancer：薬”が検索され、変更候補として抽出される。そして、抽出された変更候補がクラスタリングされて、ユーザに提示される。ユーザが各クラスタに含まれる代表的な変更候補を変更する第４の変更操作を行うと、同じクラスタに属するすべての変更候補がまとめて変更される。この場合、第３の変更操作から第４の変更操作までの間に待ち時間が発生し、作業効率が低下する。 Next, when the user performs a third change operation to change “advanced colon cancer: medicine” to “colon cancer: medicine”, “advanced colon cancer: medicine” is changed from other parts of the post-analysis document set 423. is searched and extracted as a change candidate. The extracted change candidates are then clustered and presented to the user. When the user performs a fourth change operation to change a representative change candidate included in each cluster, all change candidates belonging to the same cluster are changed at once. In this case, a waiting time occurs between the third change operation and the fourth change operation, resulting in a decrease in work efficiency.

これに対して、上述したように、変更前の固有表現の一部の語句又はその同義語を含む別の固有表現を変更対象に含めることで、変更前の固有表現と類似する固有表現についても、先回りして検索及びクラスタリングを終了しておくことが可能になる。したがって、変更前の固有表現のみを変更対象として用いた場合よりも、作業効率が向上する。 On the other hand, as mentioned above, by including another named entity that includes some of the words or synonyms of the named entity before the change into the change target, the named entity that is similar to the named entity before the change can also be changed. , it becomes possible to complete the search and clustering in advance. Therefore, work efficiency is improved compared to the case where only the unique expression before the change is used as the change target.

例えば、ユーザが第１の変更操作を行った場合、解析後文書集合４２３の他の部分から、“advanced gastric cancer：薬”とともに“advanced colon cancer：薬”も検索され、変更候補として抽出される。そして、抽出された変更候補がクラスタリングされて、ユーザに提示される。 For example, when the user performs the first change operation, "advanced gastric cancer: medicine" and "advanced colon cancer: medicine" are also searched from other parts of the post-analysis document set 423 and extracted as modification candidates. . The extracted change candidates are then clustered and presented to the user.

この場合、提示される変更候補には、“advanced gastric cancer：薬”及び“advanced colon cancer：薬”が含まれているため、ユーザは、両方の変更候補を変更することができる。これにより、ユーザは第２の変更操作及び第４の変更操作を同時に行うことができ、第３の変更操作から第４の変更操作までの間の待ち時間が発生しない。したがって、“advanced gastric cancer：薬”のみを変更対象として用いた場合よりも、作業効率が向上する。 In this case, the presented change candidates include "advanced gastric cancer: medicine" and "advanced colon cancer: medicine," so the user can change both of the change candidates. Thereby, the user can perform the second change operation and the fourth change operation simultaneously, and there is no waiting time between the third change operation and the fourth change operation. Therefore, work efficiency is improved compared to when only "advanced gastric cancer: medicine" is used as the change target.

分類部４１５は、事例スコア２が大きい順に選択された複数の変更種類それぞれの変更候補についてクラスタリングを行い、変更種類毎に分類結果４２５を生成することができる。同時に実行できるクラスタリング処理の個数は、文書処理装置４０１の性能によって決定される。例えば、文書処理装置４０１がＰ個（Ｐは１以上の整数）のクラスタリング処理を同時に実行できる場合、分類部４１５は、事例スコア２が大きい順に選択されたＰ個の変更種類について、クラスタリング処理を実行する。 The classification unit 415 can perform clustering on change candidates for each of a plurality of change types selected in descending order of case score 2, and can generate a classification result 425 for each change type. The number of clustering processes that can be executed simultaneously is determined by the performance of the document processing device 401. For example, if the document processing device 401 can simultaneously execute P clustering processes (P is an integer of 1 or more), the classification unit 415 performs the clustering process on P change types selected in descending order of case score 2. Execute.

事例スコア２が大きい順に変更種類を選択することで、より多くの変更候補を含む変更種類のクラスタリングを優先的に実行することができる。したがって、ユーザが変更する可能性の高い変更候補のクラスタリングを早く終了して、その変更候補をユーザに提示することができる。 By selecting change types in descending order of case score 2, it is possible to preferentially perform clustering of change types that include more change candidates. Therefore, clustering of change candidates that are likely to be changed by the user can be quickly completed and the change candidates can be presented to the user.

ユーザによる解析後文書集合４２３の編集が終了すると、解析後文書集合４２３は、解析モデル４２１に対する新たな学習データとして用いられる。文書処理装置４０１は、機械学習により、解析モデル４２１に解析後文書集合４２３を学習させることで、解析モデル４２１のパラメータを調整して、解析モデル４２１を更新する。これにより、解析後文書集合４２３の編集結果を解析モデル４２１に反映させることができる。 When the user finishes editing the post-analysis document set 423, the post-analysis document set 423 is used as new learning data for the analysis model 421. The document processing device 401 updates the analysis model 421 by making the analysis model 421 learn the analyzed document set 423 through machine learning, adjusting the parameters of the analysis model 421. Thereby, the editing result of the analyzed document set 423 can be reflected in the analysis model 421.

図９は、図４の文書処理装置４０１が行う変更支援処理の具体例を示すフローチャートである。まず、変更部４１６は、ユーザから編集画面のリクエストを受け付け（ステップ８０１）、文書処理装置４０１は、解析後文書集合４２３の編集画面を生成する（ステップ８０２）。そして、出力部４１８は、生成された編集画面を出力する（ステップ８０３）。 FIG. 9 is a flowchart showing a specific example of change support processing performed by the document processing apparatus 401 of FIG. First, the changing unit 416 receives a request for an edit screen from the user (step 801), and the document processing device 401 generates an edit screen for the analyzed document set 423 (step 802). Then, the output unit 418 outputs the generated editing screen (step 803).

次に、変更部４１６は、編集画面に含まれる変更候補に対するユーザからの変更指示を、その変更候補に対する変更操作として受け付け、受け付けた変更操作に従って変更候補の付加情報を変更する（ステップ８０４）。そして、変更部４１６は、その変更候補と同じクラスタに属する他の変更候補の付加情報も同様に変更し、そのクラスタのすべての変更候補に対する変更内容を、変更事例として変更履歴４２４に追加する。 Next, the changing unit 416 receives a change instruction from the user for a change candidate included in the editing screen as a change operation for the change candidate, and changes the additional information of the change candidate according to the received change operation (step 804). Then, the changing unit 416 similarly changes the additional information of other change candidates belonging to the same cluster as the change candidate, and adds the change contents for all the change candidates of the cluster to the change history 424 as change examples.

図１０は、図９のステップ８０２における編集画面生成処理の例を示すフローチャートである。まず、推定部４１３は、変更履歴４２４に含まれる変更事例に基づいて、変更種類毎に、ユーザが次に行う変更操作の変更対象を推定する（ステップ９０１）。そして、抽出部４１４は、変更種類毎の変更対象に対応する変更候補を、解析後文書集合４２３から抽出する。 FIG. 10 is a flowchart showing an example of the editing screen generation process in step 802 of FIG. First, the estimation unit 413 estimates the change target of the next change operation performed by the user for each change type based on the change examples included in the change history 424 (step 901). Then, the extraction unit 414 extracts change candidates corresponding to change targets for each change type from the analyzed document set 423.

次に、分類部４１５は、特定の変更種類に属する変更候補をクラスタリングして、複数のクラスタを含む分類結果４２５を生成する（ステップ９０２）。次に、推定部４１３は、解析後文書集合４２３に含まれる各段落の段落スコアに基づいて、特定の段落を選択する（ステップ９０３）。そして、生成部４１７は、特定の段落を強調表示する情報を含む変更候補情報を生成し、出力部４１８は、特定の段落を含む文書と生成された変更候補情報とを含む編集画面を出力する（ステップ９０４）。 Next, the classification unit 415 clusters change candidates belonging to a specific change type to generate a classification result 425 including a plurality of clusters (step 902). Next, the estimation unit 413 selects a specific paragraph based on the paragraph score of each paragraph included in the post-analysis document set 423 (step 903). Then, the generation unit 417 generates change candidate information including information for highlighting a specific paragraph, and the output unit 418 outputs an editing screen including a document including the specific paragraph and the generated change candidate information. (Step 904).

図１１は、図１０のステップ９０１における推定処理の例を示すフローチャートである。まず、推定部４１３は、変更履歴４２４に含まれる最近のＫ個（Ｋは１以上の整数）の変更事例を選択し、選択された各変更事例を、複数の変更種類のいずれかに分類する（ステップ１００１）。Ｋとしては、例えば、１０～１００の範囲の整数を用いることができる。 FIG. 11 is a flowchart showing an example of the estimation process in step 901 of FIG. First, the estimation unit 413 selects K recent change cases (K is an integer of 1 or more) included in the change history 424, and classifies each selected change case into one of a plurality of change types. (Step 1001). As K, for example, an integer in the range of 10 to 100 can be used.

次に、推定部４１３は、変更種類毎に変更対象を推定し、抽出部４１４は、変更種類毎の変更対象に対応する変更候補を、解析後文書集合４２３から抽出する（ステップ１００２）。 Next, the estimation unit 413 estimates a change target for each change type, and the extraction unit 414 extracts change candidates corresponding to the change targets for each change type from the analyzed document set 423 (step 1002).

次に、推定部４１３は、変更種類毎に抽出された変更候補に基づいて、各変更種類の事例スコア１を計算し（ステップ１００３）、事例スコア１を用いて各段落の段落スコアを計算する（ステップ１００４）。そして、推定部４１３は、事例スコア１及び段落スコアを用いて、各変更種類の事例スコア２を計算し（ステップ１００５）、事例スコア２を用いて特定の変更種類を選択する（ステップ１００６）。 Next, the estimation unit 413 calculates the case score 1 for each change type based on the change candidates extracted for each change type (step 1003), and calculates the paragraph score of each paragraph using the case score 1. (Step 1004). Then, the estimating unit 413 calculates a case score 2 for each change type using the case score 1 and the paragraph score (step 1005), and selects a specific change type using the case score 2 (step 1006).

図２の文書処理装置２０１及び図４の文書処理装置４０１の構成は一例に過ぎず、文書処理装置の用途又は条件に応じて一部の構成要素を省略又は変更してもよい。例えば、図４の文書処理装置４０１において、事前に解析後文書集合４２３が記憶部４１１に格納されている場合は、解析部４１２を省略することができる。変更候補のクラスタリングを行わない場合は、分類部４１５を省略することができる。 The configurations of the document processing device 201 in FIG. 2 and the document processing device 401 in FIG. 4 are merely examples, and some components may be omitted or changed depending on the use or conditions of the document processing device. For example, in the document processing device 401 of FIG. 4, if the analyzed document set 423 is stored in the storage unit 411 in advance, the analysis unit 412 can be omitted. If clustering of change candidates is not performed, the classification unit 415 can be omitted.

図３及び図９～図１１のフローチャートは一例に過ぎず、文書処理装置の構成又は条件に応じて一部の処理を省略又は変更してもよい。例えば、図４の文書処理装置４０１において、変更候補のクラスタリングを行わない場合は、図１０のステップ９０２の処理を省略することができる。 The flowcharts in FIGS. 3 and 9 to 11 are merely examples, and some processes may be omitted or changed depending on the configuration or conditions of the document processing device. For example, in the document processing apparatus 401 of FIG. 4, if clustering of change candidates is not performed, the process of step 902 of FIG. 10 can be omitted.

図１及び図８に示したテキストは一例に過ぎず、編集対象の文書としては、様々な分野の文書が用いられる。編集対象の文書は、機械学習で用いられる学習データには限られず、他の文書であってもよい。ユーザによる変更操作の対象は、文書のテキストであってもよく、文書のテキストに付加された付加情報であってもよい。 The texts shown in FIGS. 1 and 8 are merely examples, and documents in various fields can be used as documents to be edited. The document to be edited is not limited to learning data used in machine learning, but may be other documents. The object of the user's modification operation may be the text of the document, or may be additional information added to the text of the document.

図５に示した事例スコア１及び事例スコア２は一例に過ぎず、別の計算方法により変更種類の評価値を求めてもよい。図６に示した段落スコアは一例に過ぎず、別の計算方法により部分文書の評価値を求めてもよい。図８に示した単語ベクトルは一例に過ぎず、別の方法により単語ベクトルを求めてもよい。 Case score 1 and case score 2 shown in FIG. 5 are only examples, and the evaluation value of the change type may be calculated using another calculation method. The paragraph score shown in FIG. 6 is only an example, and the evaluation value of the partial document may be determined using another calculation method. The word vectors shown in FIG. 8 are merely examples, and word vectors may be obtained using other methods.

図１２は、図２の文書処理装置２０１及び図４の文書処理装置４０１として用いられる情報処理装置（コンピュータ）のハードウェア構成例を示している。図１２の情報処理装置は、ＣＰＵ（Central Processing Unit）１１０１、メモリ１１０２、入力装置１１０３、出力装置１１０４、補助記憶装置１１０５、媒体駆動装置１１０６、及びネットワーク接続装置１１０７を含む。これらの構成要素はハードウェアであり、バス１１０８により互いに接続されている。 FIG. 12 shows an example of the hardware configuration of an information processing device (computer) used as the document processing device 201 in FIG. 2 and the document processing device 401 in FIG. 4. The information processing apparatus in FIG. 12 includes a CPU (Central Processing Unit) 1101, a memory 1102, an input device 1103, an output device 1104, an auxiliary storage device 1105, a medium drive device 1106, and a network connection device 1107. These components are hardware and are connected to each other by a bus 1108.

メモリ１１０２は、例えば、ＲＯＭ（Read Only Memory）、ＲＡＭ（Random Access Memory）、フラッシュメモリ等の半導体メモリであり、処理に用いられるプログラム及びデータを格納する。メモリ１１０２は、図２の記憶部２１１又は図４の記憶部４１１として用いることができる。 The memory 1102 is, for example, a semiconductor memory such as a ROM (Read Only Memory), a RAM (Random Access Memory), or a flash memory, and stores programs and data used for processing. The memory 1102 can be used as the storage unit 211 in FIG. 2 or the storage unit 411 in FIG. 4.

ＣＰＵ１１０１（プロセッサ）は、例えば、メモリ１１０２を利用してプログラムを実行することにより、図２の推定部２１２及び抽出部２１３として動作する。ＣＰＵ１１０１は、メモリ１１０２を利用してプログラムを実行することにより、図４の解析部４１２、推定部４１３、抽出部４１４、分類部４１５、変更部４１６、及び生成部４１７としても動作する。 The CPU 1101 (processor) operates as the estimating unit 212 and the extracting unit 213 in FIG. 2 by executing a program using the memory 1102, for example. The CPU 1101 also operates as the analysis unit 412, estimation unit 413, extraction unit 414, classification unit 415, modification unit 416, and generation unit 417 in FIG. 4 by executing a program using the memory 1102.

入力装置１１０３は、例えば、キーボード、ポインティングデバイス等であり、オペレータ又はユーザからの指示又は情報の入力に用いられる。出力装置１１０４は、例えば、表示装置、プリンタ、スピーカ等であり、オペレータ又はユーザへの問い合わせ又は指示、及び処理結果の出力に用いられる。ユーザからの指示は、変更操作であってもよく、処理結果は、強調表示された部分文書であってもよい。出力装置１１０４は、図２の出力部２１４又は図４の出力部４１８として用いることができる。 The input device 1103 is, for example, a keyboard, a pointing device, or the like, and is used to input instructions or information from an operator or user. The output device 1104 is, for example, a display device, a printer, a speaker, or the like, and is used for making inquiries or instructions to an operator or user, and outputting processing results. The instruction from the user may be a change operation, and the processing result may be a highlighted partial document. The output device 1104 can be used as the output unit 214 in FIG. 2 or the output unit 418 in FIG. 4.

補助記憶装置１１０５は、例えば、磁気ディスク装置、光ディスク装置、光磁気ディスク装置、テープ装置等である。補助記憶装置１１０５は、ハードディスクドライブ又はフラッシュメモリであってもよい。情報処理装置は、補助記憶装置１１０５にプログラム及びデータを格納しておき、それらをメモリ１１０２にロードして使用することができる。補助記憶装置１１０５は、図２の記憶部２１１又は図４の記憶部４１１として用いることができる。 The auxiliary storage device 1105 is, for example, a magnetic disk device, an optical disk device, a magneto-optical disk device, a tape device, or the like. Auxiliary storage 1105 may be a hard disk drive or flash memory. The information processing device can store programs and data in the auxiliary storage device 1105 and load them into the memory 1102 for use. The auxiliary storage device 1105 can be used as the storage unit 211 in FIG. 2 or the storage unit 411 in FIG. 4.

媒体駆動装置１１０６は、可搬型記録媒体１１０９を駆動し、その記録内容にアクセスする。可搬型記録媒体１１０９は、メモリデバイス、フレキシブルディスク、光ディスク、光磁気ディスク等である。可搬型記録媒体１１０９は、ＣＤ－ＲＯＭ（Compact Disk Read Only Memory）、ＤＶＤ（Digital Versatile Disk）、ＵＳＢ（Universal Serial Bus）メモリ等であってもよい。オペレータ又はユーザは、この可搬型記録媒体１１０９にプログラム及びデータを格納しておき、それらをメモリ１１０２にロードして使用することができる。 A medium drive device 1106 drives a portable recording medium 1109 and accesses the recorded contents. The portable recording medium 1109 is a memory device, a flexible disk, an optical disk, a magneto-optical disk, or the like. The portable recording medium 1109 may be a CD-ROM (Compact Disk Read Only Memory), a DVD (Digital Versatile Disk), a USB (Universal Serial Bus) memory, or the like. An operator or user can store programs and data in this portable recording medium 1109 and load them into the memory 1102 for use.

このように、処理に用いられるプログラム及びデータを格納するコンピュータ読み取り可能な記録媒体は、メモリ１１０２、補助記憶装置１１０５、又は可搬型記録媒体１１０９のような、物理的な（非一時的な）記録媒体である。 In this way, a computer-readable recording medium that stores programs and data used for processing is a physical (non-temporary) recording medium such as the memory 1102, the auxiliary storage device 1105, or the portable recording medium 1109. It is a medium.

ネットワーク接続装置１１０７は、ＬＡＮ（Local Area Network）、ＷＡＮ（Wide Area Network）等の通信ネットワークに接続され、通信に伴うデータ変換を行う通信インタフェース回路である。情報処理装置は、プログラム及びデータを外部の装置からネットワーク接続装置１１０７を介して受信し、それらをメモリ１１０２にロードして使用することができる。ネットワーク接続装置１１０７は、図２の出力部２１４又は図４の出力部４１８として用いることができる。 The network connection device 1107 is a communication interface circuit that is connected to a communication network such as a LAN (Local Area Network) or a WAN (Wide Area Network) and performs data conversion associated with communication. The information processing device can receive programs and data from an external device via the network connection device 1107, load them into the memory 1102, and use them. The network connection device 1107 can be used as the output section 214 in FIG. 2 or the output section 418 in FIG. 4.

なお、情報処理装置が図１２のすべての構成要素を含む必要はなく、用途又は条件に応じて一部の構成要素を省略することも可能である。例えば、可搬型記録媒体１１０９又は通信ネットワークを使用しない場合は、媒体駆動装置１１０６又はネットワーク接続装置１１０７を省略してもよい。 Note that the information processing device does not need to include all the components shown in FIG. 12, and some components can be omitted depending on the application or conditions. For example, if the portable recording medium 1109 or the communication network is not used, the medium drive device 1106 or the network connection device 1107 may be omitted.

開示の実施形態とその利点について詳しく説明したが、当業者は、特許請求の範囲に明確に記載した本発明の範囲から逸脱することなく、様々な変更、追加、省略をすることができるであろう。 Having described the disclosed embodiments and their advantages in detail, those skilled in the art will appreciate that various modifications, additions, and omissions can be made without departing from the scope of the invention as clearly set forth in the claims. Dew.

図１乃至図１２を参照しながら説明した実施形態に関し、さらに以下の付記を開示する。
（付記１）
文書集合に含まれる情報をユーザが変更したことを示す変更履歴に基づいて、前記文書集合に対して前記ユーザが行う変更の変更対象を推定し、
前記変更対象に対応するテキストを前記文書集合から抽出し、
前記文書集合から抽出されたテキストを示す変更候補情報を出力する、
処理をコンピュータに実行させるための文書処理プログラム。
（付記２）
前記変更対象を推定する処理は、前記文書集合から抽出されたテキストに対する前記ユーザからの変更指示が入力される前に実行される、
ことを特徴とする付記１記載の文書処理プログラム。
（付記３）
前記変更履歴は、前記ユーザが行った変更操作を示す変更事例を含み、
前記変更対象は、前記変更事例が示す変更前のテキストを特定するとともに、前記変更前のテキストの一部の語句又は前記一部の語句の同義語を含み、かつ、前記変更前のテキストとは異なるテキストを特定する情報である、
ことを特徴とする付記１又は２記載の文書処理プログラム。
（付記４）
前記文書集合に含まれる情報は、前記文書集合に含まれるテキストの付加情報であり、
前記文書処理プログラムは、
前記文書集合から抽出されたテキストを、前記文書集合から抽出されたテキストの前後に存在するテキストに基づいてクラスタリングすることで、複数のクラスタを生成し、
前記複数のクラスタのうち特定のクラスタに属するテキストの付加情報を前記ユーザが変更した場合、前記特定のクラスタに属するテキストの付加情報に対する変更を、前記特定のクラスタに属する他のテキストの付加情報に反映させる、
処理を前記コンピュータにさらに実行させることを特徴とする付記３記載の文書処理プログラム。
（付記５）
前記変更履歴は、前記ユーザが行った変更操作を示す複数の変更事例を含み、
前記変更対象を推定する処理は、前記複数の変更事例各々を分類した複数の変更種類の種類毎に実行され、
前記テキストを抽出する処理は、前記種類毎の前記変更対象に対するテキストを抽出する処理を含み、
前記クラスタリングする処理は、前記文書集合における前記種類毎の前記変更対象に対応するテキストの出現頻度に基づいて選択された特定の種類の前記変更対象に対応するテキストに対して実行される、
ことを特徴とする付記４記載の文書処理プログラム。
（付記６）
前記文書集合は、複数の文書を含み、
前記複数の文書各々は、複数の部分文書を含み、
前記コンピュータは、
前記種類毎の前記変更対象に対応するテキストの出現頻度と、前記複数の部分文書各々に含まれる前記種類毎の前記変更対象に対応するテキストの個数とに基づいて、前記複数の部分文書各々の評価値を計算し、
前記複数の部分文書各々の評価値に基づいて、前記複数の部分文書のうち特定の部分文書を選択し、
前記変更候補情報は、前記特定の部分文書を強調表示する情報を含む、
ことを特徴とする付記５記載の文書処理プログラム。
（付記７）
前記文書集合は、解析モデルを生成する機械学習のための学習データであり、前記解析モデルは、解析対象の文書を解析して、前記解析対象の文書に含まれるテキストの付加情報を生成する、
ことを特徴とする付記４乃至６のいずれか１項に記載の文書処理プログラム。
（付記８）
文書集合に含まれる情報をユーザが変更したことを示す変更履歴を記憶する記憶部と、
前記変更履歴に基づいて、前記文書集合に対して前記ユーザが行う変更の変更対象を推定する推定部と、
前記変更対象に対応するテキストを前記文書集合から抽出する抽出部と、
前記文書集合から抽出されたテキストを示す変更候補情報を出力する出力部と、
を備えることを特徴とする文書処理装置。
（付記９）
前記推定部は、前記文書集合から抽出されたテキストに対する前記ユーザからの変更指示が入力される前に、前記変更対象を推定する、
ことを特徴とする付記８記載の文書処理装置。
（付記１０）
前記変更履歴は、前記ユーザが行った変更操作を示す変更事例を含み、
前記変更対象は、前記変更事例が示す変更前のテキストを特定するとともに、前記変更前のテキストの一部の語句又は前記一部の語句の同義語を含み、かつ、前記変更前のテキストとは異なるテキストを特定する情報である、
ことを特徴とする付記８又は９記載の文書処理装置。
（付記１１）
前記文書集合に含まれる情報は、前記文書集合に含まれるテキストの付加情報であり、
前記文書処理装置は、
前記文書集合から抽出されたテキストを、前記文書集合から抽出されたテキストの前後に存在するテキストに基づいてクラスタリングすることで、複数のクラスタを生成する分類部と、
前記複数のクラスタのうち特定のクラスタに属するテキストの付加情報を前記ユーザが変更した場合、前記特定のクラスタに属するテキストの付加情報に対する変更を、前記特定のクラスタに属する他のテキストの付加情報に反映させる変更部と、
をさらに備えることを特徴とする付記１０記載の文書処理装置。
（付記１２）
前記変更履歴は、前記ユーザが行った変更操作を示す複数の変更事例を含み、
前記推定部は、前記複数の変更事例各々を分類した複数の変更種類の種類毎に、前記変更対象を推定し、
前記抽出部は、前記種類毎の前記変更対象に対するテキストを抽出し、
前記分類部は、前記文書集合における前記種類毎の前記変更対象に対応するテキストの出現頻度に基づいて選択された特定の種類の前記変更対象に対応するテキストに対して、クラスタリングを実行する、
ことを特徴とする付記１１記載の文書処理装置。
（付記１３）
前記文書集合は、複数の文書を含み、
前記複数の文書各々は、複数の部分文書を含み、
前記推定部は、前記種類毎の前記変更対象に対応するテキストの出現頻度と、前記複数の部分文書各々に含まれる前記種類毎の前記変更対象に対応するテキストの個数とに基づいて、前記複数の部分文書各々の評価値を計算し、前記複数の部分文書各々の評価値に基づいて、前記複数の部分文書のうち特定の部分文書を選択し、
前記変更候補情報は、前記特定の部分文書を強調表示する情報を含む、
ことを特徴とする付記１２記載の文書処理装置。
（付記１４）
前記文書集合は、解析モデルを生成する機械学習のための学習データであり、前記解析モデルは、解析対象の文書を解析して、前記解析対象の文書に含まれるテキストの付加情報を生成する、
ことを特徴とする付記１１乃至１３のいずれか１項に記載の文書処理装置。
（付記１５）
文書集合に含まれる情報をユーザが変更したことを示す変更履歴に基づいて、前記文書集合に対して前記ユーザが行う変更の変更対象を推定し、
前記変更対象に対応するテキストを前記文書集合から抽出し、
前記文書集合から抽出されたテキストを示す変更候補情報を出力する、
処理をコンピュータが実行することを特徴とする文書処理方法。
（付記１６）
前記変更対象を推定する処理は、前記文書集合から抽出されたテキストに対する前記ユーザからの変更指示が入力される前に実行される、
ことを特徴とする付記１５記載の文書処理方法。
（付記１７）
前記変更履歴は、前記ユーザが行った変更操作を示す変更事例を含み、
前記変更対象は、前記変更事例が示す変更前のテキストを特定するとともに、前記変更前のテキストの一部の語句又は前記一部の語句の同義語を含み、かつ、前記変更前のテキストとは異なるテキストを特定する情報である、
ことを特徴とする付記１５又は１６記載の文書処理方法。
（付記１８）
前記文書集合に含まれる情報は、前記文書集合に含まれるテキストの付加情報であり、
前記コンピュータは、さらに、
前記文書集合から抽出されたテキストを、前記文書集合から抽出されたテキストの前後に存在するテキストに基づいてクラスタリングすることで、複数のクラスタを生成し、
前記複数のクラスタのうち特定のクラスタに属するテキストの付加情報を前記ユーザが変更した場合、前記特定のクラスタに属するテキストの付加情報に対する変更を、前記特定のクラスタに属する他のテキストの付加情報に反映させる、
ことを特徴とする付記１７記載の文書処理方法。
（付記１９）
前記変更履歴は、前記ユーザが行った変更操作を示す複数の変更事例を含み、
前記変更対象を推定する処理は、前記複数の変更事例各々を分類した複数の変更種類の種類毎に実行され、
前記テキストを抽出する処理は、前記種類毎の前記変更対象に対するテキストを抽出する処理を含み、
前記クラスタリングする処理は、前記文書集合における前記種類毎の前記変更対象に対応するテキストの出現頻度に基づいて選択された特定の種類の前記変更対象に対応するテキストに対して実行される、
ことを特徴とする付記１８記載の文書処理方法。
（付記２０）
前記文書集合は、複数の文書を含み、
前記複数の文書各々は、複数の部分文書を含み、
前記コンピュータは、
前記種類毎の前記変更対象に対応するテキストの出現頻度と、前記複数の部分文書各々に含まれる前記種類毎の前記変更対象に対応するテキストの個数とに基づいて、前記複数の部分文書各々の評価値を計算し、
前記複数の部分文書各々の評価値に基づいて、前記複数の部分文書のうち特定の部分文書を選択し、
前記変更候補情報は、前記特定の部分文書を強調表示する情報を含む、
ことを特徴とする付記１９記載の文書処理方法。
（付記２１）
前記文書集合は、解析モデルを生成する機械学習のための学習データであり、前記解析モデルは、解析対象の文書を解析して、前記解析対象の文書に含まれるテキストの付加情報を生成する、
ことを特徴とする付記１８乃至２０のいずれか１項に記載の文書処理方法。 Regarding the embodiment described with reference to FIGS. 1 to 12, the following additional notes are further disclosed.
(Additional note 1)
Estimating a change target of a change made by the user to the document set based on a change history indicating that the user has changed information included in the document set;
extracting text corresponding to the change target from the document set;
outputting change candidate information indicating text extracted from the document set;
A document processing program that allows a computer to perform processing.
(Additional note 2)
The process of estimating the change target is executed before a change instruction from the user is input to the text extracted from the document set.
The document processing program according to supplementary note 1, characterized in that:
(Additional note 3)
The change history includes a change example indicating a change operation performed by the user,
The change target specifies the text before change indicated by the change example, includes some words or phrases in the text before change, or synonyms of the some words, and is different from the text before change. information that identifies different texts,
The document processing program according to supplementary note 1 or 2, characterized in that:
(Additional note 4)
The information included in the document set is additional information of the text included in the document set,
The document processing program is
Generating a plurality of clusters by clustering the text extracted from the document set based on texts existing before and after the text extracted from the document set,
When the user changes the additional information of the text belonging to a specific cluster among the plurality of clusters, the change to the additional information of the text belonging to the specific cluster is changed to the additional information of other texts belonging to the specific cluster. To reflect,
The document processing program according to appendix 3, further causing the computer to execute processing.
(Appendix 5)
The change history includes a plurality of change examples indicating change operations performed by the user,
The process of estimating the change target is performed for each of the plurality of change types in which each of the plurality of change cases is classified,
The process of extracting the text includes a process of extracting text for the change target for each type,
The clustering process is performed on text corresponding to a specific type of change target selected based on the appearance frequency of text corresponding to the change target for each type in the document set.
The document processing program according to appendix 4, characterized in that:
(Appendix 6)
The document set includes a plurality of documents,
Each of the plurality of documents includes a plurality of partial documents,
The computer includes:
of each of the plurality of partial documents based on the appearance frequency of the text corresponding to the change target for each type and the number of texts corresponding to the change target of each type included in each of the plurality of partial documents. Calculate the evaluation value,
Selecting a specific partial document from the plurality of partial documents based on the evaluation value of each of the plurality of partial documents,
The change candidate information includes information that highlights the specific partial document.
The document processing program according to appendix 5, characterized in that:
(Appendix 7)
The document set is learning data for machine learning that generates an analysis model, and the analysis model analyzes a document to be analyzed to generate additional information of text included in the document to be analyzed.
7. The document processing program according to any one of Supplementary Notes 4 to 6.
(Appendix 8)
a storage unit that stores a change history indicating changes made by the user to information included in the document collection;
an estimation unit that estimates a change target of a change made by the user to the document set based on the change history;
an extraction unit that extracts text corresponding to the change target from the document set;
an output unit that outputs change candidate information indicating text extracted from the document set;
A document processing device comprising:
(Appendix 9)
The estimating unit estimates the change target before a change instruction from the user is input to the text extracted from the document set.
The document processing device according to appendix 8, characterized in that:
(Appendix 10)
The change history includes a change example indicating a change operation performed by the user,
The change target specifies the text before change indicated by the change example, includes some words or phrases in the text before change, or synonyms of the some words, and is different from the text before change. information that identifies different texts,
The document processing device according to appendix 8 or 9, characterized in that:
(Appendix 11)
The information included in the document set is additional information of the text included in the document set,
The document processing device includes:
a classification unit that generates a plurality of clusters by clustering the text extracted from the document set based on texts that exist before and after the text extracted from the document set;
When the user changes the additional information of the text belonging to a specific cluster among the plurality of clusters, the change to the additional information of the text belonging to the specific cluster is changed to the additional information of other texts belonging to the specific cluster. A change part to be reflected,
The document processing device according to appendix 10, further comprising:
(Appendix 12)
The change history includes a plurality of change examples indicating change operations performed by the user,
The estimating unit estimates the change target for each of the plurality of change types classified into each of the plurality of change cases,
The extraction unit extracts text for the change target for each type,
The classification unit performs clustering on text corresponding to a specific type of change target selected based on the appearance frequency of text corresponding to the change target for each type in the document set.
The document processing device according to appendix 11, characterized in that:
(Appendix 13)
The document set includes a plurality of documents,
Each of the plurality of documents includes a plurality of partial documents,
The estimating unit calculates the number of texts corresponding to the change target for each type based on the appearance frequency of texts corresponding to the change target for each type and the number of texts corresponding to the change target for each type included in each of the plurality of partial documents. calculating an evaluation value of each partial document, and selecting a specific partial document from the plurality of partial documents based on the evaluation value of each of the plurality of partial documents;
The change candidate information includes information that highlights the specific partial document.
The document processing device according to appendix 12, characterized in that:
(Appendix 14)
The document set is learning data for machine learning that generates an analysis model, and the analysis model analyzes a document to be analyzed to generate additional information of text included in the document to be analyzed.
14. The document processing device according to any one of Supplementary Notes 11 to 13.
(Appendix 15)
Estimating a change target of a change made by the user to the document set based on a change history indicating that the user has changed information included in the document set;
extracting text corresponding to the change target from the document set;
outputting change candidate information indicating text extracted from the document set;
A document processing method characterized in that processing is performed by a computer.
(Appendix 16)
The process of estimating the change target is executed before a change instruction from the user is input to the text extracted from the document set.
The document processing method according to appendix 15, characterized in that:
(Appendix 17)
The change history includes a change example indicating a change operation performed by the user,
The change target specifies the text before change indicated by the change example, includes some words or phrases in the text before change, or synonyms of the some words, and is different from the text before change. information that identifies different texts,
The document processing method according to appendix 15 or 16, characterized in that:
(Appendix 18)
The information included in the document set is additional information of the text included in the document set,
The computer further includes:
Generating a plurality of clusters by clustering the text extracted from the document set based on texts existing before and after the text extracted from the document set,
When the user changes the additional information of the text belonging to a specific cluster among the plurality of clusters, the change to the additional information of the text belonging to the specific cluster is changed to the additional information of other texts belonging to the specific cluster. To reflect,
The document processing method according to appendix 17, characterized in that:
(Appendix 19)
The change history includes a plurality of change examples indicating change operations performed by the user,
The process of estimating the change target is performed for each of the plurality of change types in which each of the plurality of change cases is classified,
The process of extracting the text includes a process of extracting text for the change target for each type,
The clustering process is performed on text corresponding to a specific type of change target selected based on the appearance frequency of text corresponding to the change target for each type in the document set.
The document processing method according to appendix 18, characterized in that:
(Additional note 20)
The document set includes a plurality of documents,
Each of the plurality of documents includes a plurality of partial documents,
The computer includes:
of each of the plurality of partial documents based on the appearance frequency of the text corresponding to the change target for each type and the number of texts corresponding to the change target of each type included in each of the plurality of partial documents. Calculate the evaluation value,
Selecting a specific partial document from the plurality of partial documents based on the evaluation value of each of the plurality of partial documents,
The change candidate information includes information that highlights the specific partial document.
The document processing method according to appendix 19, characterized in that:
(Additional note 21)
The document set is learning data for machine learning that generates an analysis model, and the analysis model analyzes a document to be analyzed to generate additional information of text included in the document to be analyzed.
21. The document processing method according to any one of appendices 18 to 20.

２０１、４０１文書処理装置
２１１、４１１記憶部
２１２、４１３推定部
２１３、４１４抽出部
２１４、４１８出力部
２２１、４２４変更履歴
４１２解析部
４１５分類部
４１６変更部
４１７生成部
４２１解析モデル
４２２解析前文書集合
４２３解析後文書集合
４２５分類結果
７０１段落
１１０１ＣＰＵ
１１０２メモリ
１１０３入力装置
１１０４出力装置
１１０５補助記憶装置
１１０６媒体駆動装置
１１０７ネットワーク接続装置
１１０８バス
１１０９可搬型記録媒体
201, 401 Document processing device 211, 411 Storage unit 212, 413 Estimation unit 213, 414 Extraction unit 214, 418 Output unit 221, 424 Change history 412 Analysis unit 415 Classification unit 416 Modification unit 417 Generation unit 421 Analysis model 422 Document before analysis Set 423 Document set after analysis 425 Classification results 701 Paragraph 1101 CPU
1102 Memory 1103 Input device 1104 Output device 1105 Auxiliary storage device 1106 Media drive device 1107 Network connection device 1108 Bus 1109 Portable recording medium

Claims

Based on the change history, which is a change history indicating that the user has changed additional information of text included in the document set , and includes a change example indicating a change operation performed by the user, the user Estimate the target of changes made by
extracting text corresponding to the change target from the document set;
Generating a plurality of clusters by clustering the text extracted from the document set based on texts existing before and after the text extracted from the document set,
When the user changes the additional information of the text belonging to a specific cluster among the plurality of clusters, the change to the additional information of the text belonging to the specific cluster is changed to the additional information of other texts belonging to the specific cluster. reflect,
outputting change candidate information indicating text extracted from the document set;
A document processing program that allows a computer to perform processing.

The process of estimating the change target is executed before a change instruction from the user is input to the text extracted from the document set.
The document processing program according to claim 1, characterized in that:

The change target specifies the text before change indicated by the change example, and includes some words or phrases in the text before change or synonyms of the some words, and is different from the text before change. is information that identifies different texts,
The document processing program according to claim 1 or 2, characterized in that:

The change history includes a plurality of change examples indicating change operations performed by the user,
The process of estimating the change target is performed for each of the plurality of change types in which each of the plurality of change cases is classified,
The process of extracting the text includes a process of extracting text for the change target for each type,
The clustering process is performed on text corresponding to a specific type of change target selected based on the appearance frequency of text corresponding to the change target for each type in the document set.
The document processing program according to any one of claims 1 to 3, characterized in that:

The document set includes a plurality of documents,
Each of the plurality of documents includes a plurality of partial documents,
The computer includes:
The total appearance frequency of the text corresponding to the change target for each type, which is included in each of the plurality of partial documents, is calculated as an evaluation value for each of the plurality of partial documents. ,
Selecting a specific partial document from the plurality of partial documents based on the evaluation value of each of the plurality of partial documents,
The change candidate information includes information that highlights the specific partial document.
5. The document processing program according to claim 4 .

The document set is learning data for machine learning that generates an analysis model, and the analysis model analyzes a document to be analyzed to generate additional information of text included in the document to be analyzed.
The document processing program according to any one of claims 1 to 5 , characterized in that:

a storage unit that stores a change history indicating that a user has changed additional information of a text included in a document collection , including a change example indicating a change operation performed by the user ;
an estimation unit that estimates a change target of a change made by the user to the document set based on the change history;
an extraction unit that extracts text corresponding to the change target from the document set;
a classification unit that generates a plurality of clusters by clustering the text extracted from the document set based on texts existing before and after the text extracted from the document set;
When the user changes the additional information of the text belonging to a specific cluster among the plurality of clusters, the change to the additional information of the text belonging to the specific cluster is changed to the additional information of other texts belonging to the specific cluster. A change part to be reflected,
an output unit that outputs change candidate information indicating text extracted from the document set;
A document processing device comprising:

Based on the change history, which is a change history indicating that the user has changed additional information of text included in the document set , and includes a change example indicating a change operation performed by the user, the user Estimate the target of changes made by
extracting text corresponding to the change target from the document set;
Generating a plurality of clusters by clustering the text extracted from the document set based on texts existing before and after the text extracted from the document set,
When the user changes the additional information of the text belonging to a specific cluster among the plurality of clusters, the change to the additional information of the text belonging to the specific cluster is changed to the additional information of other texts belonging to the specific cluster. reflect,
outputting change candidate information indicating text extracted from the document set;
A document processing method characterized in that processing is performed by a computer.