JP2020154448A

JP2020154448A - Information extraction support device and information extraction support method and program

Info

Publication number: JP2020154448A
Application number: JP2019050181A
Authority: JP
Inventors: 昌之岡本; Masayuki Okamoto; 祐一宮村; Yuichi Miyamura
Original assignee: Toshiba Corp
Current assignee: Toshiba Corp
Priority date: 2019-03-18
Filing date: 2019-03-18
Publication date: 2020-09-24
Anticipated expiration: 2039-03-18
Also published as: JP7034977B2

Abstract

To extract information with high accuracy even when a person extracting the information is not an expert.SOLUTION: An information support device according to an embodiment comprises an extraction part, an identification part, an assignment part, a learning inference part, and an output part. The extraction part extracts first information about document data from said document data containing multiple words. The identification part identifies relevant information from the first information as clue information. The assignment part assigns a label to the first information on the basis of a predetermined rule to indicate a correct or incorrect value. The learning inference part performs learning and inference of a rule for extracting new first information on the basis of the first information, a feature quantity used to extract the first information, and the label. The output part outputs a method for changing at least one of the feature quantity and the rules on the basis of the label and the correctness judgment which is a result of the correctness or incorrectness about the first information.SELECTED DRAWING: Figure 1

Description

本発明の実施形態は情報抽出支援装置、情報抽出支援方法及びプログラムに関する。 An embodiment of the present invention relates to an information extraction support device, an information extraction support method, and a program.

例えば、商品名と商品価格のような属性、および商品名と商品価格の関係のような属性間の関係を抽出する情報抽出技術を利用して、文書中の特定の情報を整理することができる。例えば、商品スペック一覧を文書から抜き出して表にまとめる、などである。機械学習技術を用いて実現されることも多いが、多くの場合、評価を行いながら学習に有用な手掛かり（特徴）や、学習に必要な正例（正解）、負例（不正解）の与え方を改良していく。例えば、特許文献１では、元の正解データによる学習結果より係り受けの語順変更などにより拡張されたデータによる学習結果の精度が良ければ後者を採用するなど、学習用データを拡張することで学習性能の向上を図る技術が提案されている。 For example, specific information in a document can be organized by using an information extraction technique that extracts attributes such as product name and product price and relationships between attributes such as the relationship between product name and product price. .. For example, a list of product specifications can be extracted from a document and summarized in a table. It is often realized by using machine learning technology, but in many cases, clues (features) useful for learning while performing evaluation, and positive examples (correct answers) and negative examples (incorrect answers) necessary for learning are given. I will improve the way. For example, in Patent Document 1, the learning performance is expanded by expanding the learning data, such as adopting the latter if the accuracy of the learning result by the data expanded by changing the word order of the dependency is better than the learning result by the original correct answer data. Technology has been proposed to improve.

特開２００６−４３９９号公報Japanese Unexamined Patent Publication No. 2006-4399

しかしながら、情報を抽出する側が熟練者でない場合、改良過程で生じる内容による影響が考慮されていないという問題がある。 However, if the side that extracts the information is not an expert, there is a problem that the influence of the contents generated in the improvement process is not taken into consideration.

実施形態の情報支援装置は抽出部と、特定部と、付与部と、学習推論部と、出力部を備える。抽出部は、複数の単語を含む文書データから当該文書データに関する第１情報を抽出する。特定部は前記第１情報から関連する情報を手掛かり情報として特定する。付与部は、予め決められた規則に基づいて前記第１情報に正誤を示すラベルを付与する。学習推論部は前記第１情報と、前記第１情報の抽出に用いる特徴量と、前記ラベルとに基づいて新たな第１情報を抽出するための規則の学習と推論を実行する。出力部は前記ラベルと前記１情報についての正誤の結果である正誤判定とに基づいて、前記特徴量及び前記規則の少なくとも一つの変更方法を出力する。 The information support device of the embodiment includes an extraction unit, a specific unit, an addition unit, a learning inference unit, and an output unit. The extraction unit extracts the first information about the document data from the document data including a plurality of words. The specific unit specifies related information as clue information from the first information. The granting unit assigns a label indicating correctness to the first information based on a predetermined rule. The learning inference unit executes learning and inference of rules for extracting new first information based on the first information, the feature amount used for extracting the first information, and the label. The output unit outputs at least one method of changing the feature amount and the rule based on the label and the correctness determination which is the result of the correctness of the one information.

第１実施形態の情報抽出支援装置の機能構成の例を示す図。The figure which shows the example of the functional structure of the information extraction support apparatus of 1st Embodiment. 第１実施形態の情報抽出支援方法の例を示すフローチャート。The flowchart which shows the example of the information extraction support method of 1st Embodiment. 属性情報として抽出する例を示す図。The figure which shows the example which extracts as attribute information. 抽出された属性を正誤判定する例を示す図。The figure which shows the example which judges the correctness of the extracted attribute. 属性の正誤判定による変更方法の例を示す図。The figure which shows the example of the change method by the correctness judgment of an attribute. 正誤判定の結果を示す具体的な例を示す図。The figure which shows the specific example which shows the result of a correctness judgment. 第１実施形態の情報抽出支援装置に使用されるコンピュータのハードウェア構成の例を示す図。The figure which shows the example of the hardware composition of the computer used for the information extraction support apparatus of 1st Embodiment. 第１実施形態の情報抽出支援装置の装置構成の例を示す図。The figure which shows the example of the apparatus configuration of the information extraction support apparatus of 1st Embodiment.

例えば、機械学習を利用して情報抽出を行う場合、利用する特徴量を変更したり、訓練事例に正例（正解）や負例（不正解）のラベルを付与する規則や基準が連動したりする。その結果、学習前後でどのように修正するとよいかはわかりにくい。更に評価値（スコア）が時系列で変化することによって、似た性質の抽出結果についての同じ評価を何度も行ってしまう場合もある。本実施例では情報抽出結果およびその評価結果の推移を元に学習用の特徴および訓練データへの正例／負例ラベル付けの変更方法を提示するとともに、評価すべき対象を決めやすくできるような支援を行う技術について説明する。 For example, when extracting information using machine learning, the features to be used may be changed, and rules and criteria for labeling training cases as positive (correct) or negative (incorrect) may be linked. To do. As a result, it is difficult to understand how to correct it before and after learning. Furthermore, as the evaluation value (score) changes over time, the same evaluation may be performed many times for extraction results having similar properties. In this embodiment, based on the information extraction result and the transition of the evaluation result, the learning features and the method of changing the positive / negative example labeling to the training data are presented, and the target to be evaluated can be easily determined. Explain the technology that provides support.

以下に添付図面を参照して、情報抽出支援装置、情報抽出支援方法及びプログラムの実施形態を詳細に説明する。 The information extraction support device, the information extraction support method, and the embodiment of the program will be described in detail with reference to the attached drawings.

（第１実施形態）
はじめに、第１実施形態の情報抽出支援装置について説明する。 (First Embodiment)
First, the information extraction support device of the first embodiment will be described.

［機能構成の例］
図１は第１実施形態の情報抽出支援装置１０１の機能構成の例を示す図である。第１実施形態の情報抽出支援装置システム１０１は、抽出部１０２、特定部１０３、付与部１０４、学習推定部１０５及び出力部１０６を備える。 [Example of functional configuration]
FIG. 1 is a diagram showing an example of a functional configuration of the information extraction support device 101 of the first embodiment. The information extraction support device system 101 of the first embodiment includes an extraction unit 102, a specific unit 103, an addition unit 104, a learning estimation unit 105, and an output unit 106.

抽出部１０２は、分析対象となる１以上の対象文書１０７を取得する。対象文書とは、例えば、Ｗｅｂページやインターネット上にアップロードされているニュース記事、論文、特許明細書などで、複数の単語からなる自然文で表現された文書データ（テキストデータ）である。対象文書は、ユーザからの入力をされることにより取得してもよいし、自動で収集するようにしてもよい。文書データは属性と属性間の関係などを含む属性情報を抽出できるものであればよい。例えば属性は、ユーザが抽出したい所望の情報の種類を示し、例えば、商品名、価格、企業名、材料名、特性値、などが挙げられる。属性間の関係とは商品名と商品価格などである。抽出部１０２は、取得した文書から候補となる属性情報をあらかじめ決められた規則（ルール）に沿って抽出する。例えば文書中の単語、文章、タイトル、文書に付属するメタデータ中の日時やフォーマット形式、図表の数など、あらかじめ決められたルールに従って抽出する。 The extraction unit 102 acquires one or more target documents 107 to be analyzed. The target document is, for example, news articles, papers, patent specifications, etc. uploaded on a Web page or the Internet, and is document data (text data) expressed in a natural sentence composed of a plurality of words. The target document may be acquired by inputting from the user, or may be automatically collected. The document data may be any data that can extract attribute information including attributes and relationships between attributes. For example, the attribute indicates the type of information desired to be extracted by the user, and examples thereof include a product name, a price, a company name, a material name, and a characteristic value. The relationship between attributes is a product name and a product price. The extraction unit 102 extracts candidate attribute information from the acquired document according to a predetermined rule. For example, the words, sentences, titles in the document, the date and time and format in the metadata attached to the document, the number of figures and tables, etc. are extracted according to predetermined rules.

特定部１０３は、抽出部１０２で抽出された属性情報と関連する情報について特徴を表す手掛かり情報として特定する。手掛かり情報には、抽出元の文書自体のメタ情報（例えば作成日時、文書のファイルフォーマット、言語、等）や、抽出される属性である語句自体の特徴（名詞、動詞など単語の品詞情報、人名や地名など固有表現の分類情報、単語の前後に出現する語句、ｎｇｒａｍ、等）、文書に含まれる図表についての特徴（写真、グラフ、イラストなど図の分類、表の行数や列数、等）などが含まれる。 The identification unit 103 identifies the information related to the attribute information extracted by the extraction unit 102 as clues information representing features. The clue information includes meta information of the document itself from which it is extracted (for example, creation date and time, file format of the document, language, etc.), characteristics of the word itself (noun, verb, etc.), and a person's name. Classification information of unique expressions such as and place names, words and phrases that appear before and after words, ngram, etc., features of figures and tables included in documents (classification of figures such as photographs, graphs, illustrations, number of rows and columns of tables, etc. ) Etc. are included.

付与部１０４は、抽出部１０２で抽出された属性情報に対して、訓練用に予め正解がラベル付けされた既知の情報、または訓練用の仮ラベルを所定のルールを用い訓練事例データに対しラベル付けを行う。ここで「仮ラベル」は所定のルールを用いて付けたラベルのことを示している。例えば名詞、動詞など品詞についてのラベルを自然言語処理によって付与するのであればほとんどの場合に「正解」のラベルを付けることができる。一方、人名や地名などの固有表現についての発音について、正解か不正解かというラベルを付ける場合、あらかじめ決められたルールでは対応できない場合があるが、ここでは実際に正解であるかどうかを問わず、決められたルールで付与する。 The giving unit 104 labels the attribute information extracted by the extracting unit 102 with known information in which the correct answer is labeled in advance for training or a temporary label for training with respect to the training case data using a predetermined rule. Label. Here, the "temporary label" indicates a label attached using a predetermined rule. For example, if labels for part of speech such as nouns and verbs are given by natural language processing, the label of "correct answer" can be given in most cases. On the other hand, when labeling the pronunciation of named entities such as personal names and place names as correct or incorrect, it may not be possible to deal with them with predetermined rules, but here it does not matter whether the answer is actually correct or not. , Grant according to the set rules.

学習推論部１０５は、抽出部１０２で抽出された属性情報、特定部１０３で抽出された手掛かり情報、付与部１０４により付与されたラベル（訓練事例）に基づき、訓練事例が表す正誤関係となるように評価基準を学習する。また学習後は、抽出部１０２で抽出された属性情報について所望の属性情報である確率を算出することによって推論する。 The learning inference unit 105 has a correct / incorrect relationship represented by the training case based on the attribute information extracted by the extraction unit 102, the clue information extracted by the specific unit 103, and the label (training case) given by the giving unit 104. Learn the evaluation criteria. After learning, the attribute information extracted by the extraction unit 102 is inferred by calculating the probability of being the desired attribute information.

学習時には、例えば、訓練事例で正例（正解）とみなされた属性情報に付随する手掛かり情報を利用する優先度をより高く設定するようにする。負例（不正解）であると判断された属性情報に付随する手掛かり情報を利用する優先度を低く設定するようにしてもよい。なお、抽出された属性情報、手掛かり情報、訓練事例は記憶部に記憶し、学習時、推論時に取得して利用してもよい。 At the time of learning, for example, the priority of using the clue information accompanying the attribute information regarded as the correct example (correct answer) in the training case is set higher. The priority of using the clue information accompanying the attribute information determined to be a negative example (incorrect answer) may be set low. The extracted attribute information, clue information, and training case may be stored in the storage unit and acquired and used at the time of learning or inference.

ユーザから正誤判定を受け付ける受付部を設けて、情報抽出結果に対する正誤判定の入力を受け付けてもよい。 A reception unit that accepts correct / incorrect judgments from the user may be provided to accept input of correct / incorrect judgments for the information extraction result.

正誤判定の結果入力は、例えば、判定対象である情報抽出結果と正誤のみをチェックできるようにユーザに提示して、結果入力を受け付けてもよいし、正解とわかるものだけを受け付けてもよい。また、正誤判定機能を持つ他のシステムに接続して判定結果を入力してもよい。ユーザからの結果入力を受け付ける場合、例えば、情報抽出結果についての正解か、不正解化にチェック欄を付し、ユーザがどちらかにチェックすることで正誤判定の結果を受け付けてもよい。この時、抽出された属性情報、属性同士の関係、手掛かり情報、抽出元の文書の書誌情報などのメタデータのいずれか、あるいは両方を情報抽出結果と共に提示すると、情報がどのように抽出されたかユーザが推測することも可能になるため、なおよい。 The correct / incorrect judgment result input may be presented to the user so that only the information extraction result and the correctness, which are the judgment targets, can be checked, and the result input may be accepted, or only the correct answer may be accepted. Further, the determination result may be input by connecting to another system having a correctness determination function. When accepting the result input from the user, for example, a check box may be added to check whether the information extraction result is correct or incorrect, and the user may check either of them to accept the correct / incorrect judgment result. At this time, if either or both of the extracted attribute information, the relationship between the attributes, the clue information, and the bibliographic information of the extraction source document are presented together with the information extraction result, how the information was extracted. It's even better because it allows the user to guess.

更に、必要に応じて抽出結果についての尤度や確信度をスコアとして併記してもよい。 Further, if necessary, the likelihood and certainty of the extraction result may be described together as a score.

スコアは文字や単語のレーベンシュタイン距離やジャロ・ウィンクラー距離などを用いて類似度を計算することによって、スコアとする。文字や単語等の類似度が算出できる方法であれば他のものを用いてもよい。これらのスコアや正誤判定結果は記憶部に記憶してもよい。 The score is obtained by calculating the similarity using the Levenshtein distance and the Jaro Winkler distance of letters and words. Other methods may be used as long as the similarity of letters and words can be calculated. These scores and correct / incorrect determination results may be stored in the storage unit.

正誤の仮ラベルに関する情報と抽出された属性情報についての正誤判定のラベル情報とに基づいて、出力部１０６は学習推論部１０５が利用する特徴量またはルール（規則）の少なくとも一つの変更方法を出力する。 Based on the information on the tentative label of correctness and the label information of the correctness judgment on the extracted attribute information, the output unit 106 outputs at least one method of changing the feature amount or the rule (rule) used by the learning inference unit 105. To do.

ここで、情報抽出支援装置１０１における情報抽出処理と判定結果入力を通じた手掛かり情報抽出およびラベル付けルールの変更について図２のフローチャートを参照して説明する。 Here, the information extraction process in the information extraction support device 101 and the change of the clue information extraction and the labeling rule through the determination result input will be described with reference to the flowchart of FIG.

ステップＳ２０１では、抽出部１０２が１つ以上の対象文書を取得する。対象文書からの属性情報抽出および抽出される手掛かり情報の一例について図３を参照して説明する。 In step S201, the extraction unit 102 acquires one or more target documents. An example of extracting attribute information from the target document and extracting clue information will be described with reference to FIG.

図３は、テキストから材料名とその移動度と呼ばれる特性値、および両者の関係を属性情報として抽出する例である。 FIG. 3 is an example of extracting the material name, the characteristic value called its mobility, and the relationship between the two as attribute information from the text.

最初に、対象文書の一部である「…Ｓｉの移動度は５，７００ｃｍ２／Ｖｓ…」という表記から形態素解析等を利用して、複数の単語を抽出する。次に、抽出された複数の単語のうち材料名「Ｓｉ」と移動度の値「５，７００」を予め決められた規則（ルール）を用いて属性情報として抽出する。 First, a plurality of words are extracted from the notation "... the mobility of Si is 5,700 cm2 / Vs ..." which is a part of the target document by using morphological analysis or the like. Next, the material name “Si” and the mobility value “5,700” from the extracted plurality of words are extracted as attribute information using predetermined rules.

さらに「Ｓｉ」と「５，７００」が材料名とその移動度の対として特定する。属性情報を特定する手掛かり情報の集合Ａには、例えば、「材料」と「移動度」の値の間に含まれる単語数が３、単語として「の」「移動度」「は」が含まれることのルールが予め設定されている。 Further, "Si" and "5,700" are specified as a pair of the material name and its mobility. The set A of the clue information for specifying the attribute information includes, for example, 3 words included between the values of "material" and "mobility", and "no", "mobility", and "ha" as words. The rules for that are set in advance.

ここで、学習推定部１０６の学習と推定の実行により、単語数等の手掛かり情報の集合が変更されたとする。例えば、新たな対象文書から属性情報を抽出し、これらを用いて学習をした場合に、移動度の値の直後に単語「ｃｍ２／Ｖｓ」が出現する、という手掛かり情報の集合Ａ’が新たに追加されたとする。 Here, it is assumed that the set of clues information such as the number of words is changed by the learning and the execution of the estimation by the learning estimation unit 106. For example, a new set of clue information A'that the word "cm2 / Vs" appears immediately after the mobility value when attribute information is extracted from a new target document and learning is performed using these. Suppose it is added.

新たな手掛かり情報の集合Ａ‘をもとに、対象文書から属性情報を抽出し、抽出結果と抽出結果に対応する特徴集合、スコアを算出する。新たな特徴集合Ａ’に基づいた抽出結果と過去の抽出結果、付与部１０４のラベルとから、予め決められた規則（ルール）に基づき評価すべきアイテムを決定する（ステップＳ２０２）。たとえば、付与されたラベルの変化した場合に、抽出結果を評価対象として設定する。 Based on the new set of clue information A', the attribute information is extracted from the target document, and the extraction result and the feature set and the score corresponding to the extraction result are calculated. An item to be evaluated is determined based on a predetermined rule from the extraction result based on the new feature set A', the past extraction result, and the label of the assigning unit 104 (step S202). For example, when the given label changes, the extraction result is set as the evaluation target.

ステップＳ２０３では、評価対象に対する正誤判定の結果を受け付ける。 In step S203, the result of the correctness determination for the evaluation target is received.

図４は、抽出された属性を正誤判定（評価）する例である。ユーザが正誤判定を行う場合、例えば抽出結果を所定の基準、あるいはランダムに選択して表示し、正解を○、不正解を×のように適否をユーザが設定する例である。 FIG. 4 is an example of determining (evaluating) the correctness of the extracted attributes. When the user makes a correct / incorrect judgment, for example, the extraction result is selected and displayed by a predetermined criterion or randomly, and the correct answer is set as ◯ and the incorrect answer is set as ×.

ここで図５は、手掛かり情報の集合と正誤判定を行った評価の個数の関係の例を示している。ここでは複数回の評価結果として、手掛かり情報の集合と学習推定部１０６の試行毎のスコアと評価結果の関係とが記憶部に格納されているものとする。 Here, FIG. 5 shows an example of the relationship between the set of clue information and the number of evaluations for which correct / incorrect judgment is performed. Here, it is assumed that the set of clues information, the score for each trial of the learning estimation unit 106, and the relationship between the evaluation results are stored in the storage unit as the evaluation results of a plurality of times.

例えば、集合Ａは、３回の試行で全て評価が正解（○）である。集合Ａの手掛かり情報集合が得られている場合は評価結果を正解（○）であると設定する。この場合、情報抽出の精度改善という点では改良提案がない状態となる。 For example, in the set A, the evaluation is correct (◯) in all three trials. When the clue information set of the set A is obtained, the evaluation result is set as the correct answer (○). In this case, there is no improvement proposal in terms of improving the accuracy of information extraction.

具体的には、図６の正誤判定の結果に示すように、異なる対象文書や同じ文書中での異なる文章などにおいても、集合Ａの手がかり情報により評価結果が正解となる。 Specifically, as shown in the result of the correctness determination in FIG. 6, the evaluation result is correct based on the clue information of the set A even in different target documents or different sentences in the same document.

例えば初回評価では事例「Ｓｉ」と「５，７００」の組について「正解（○）」とし、２回目以降のように「移動度」が異なる値となる場合にも正誤判定（評価）を設定してもよい。「材料」と「移動度」との表記がの類似である場合でも、全く同一でなければ、特徴量や手掛かり情報の精度の向上が望めるためである。好ましくは、同一パターンを評価する回数が多いよりは図６のように「材料」または「移動度」の異なるパターンについて評価できる方が好ましい。 For example, in the first evaluation, the set of cases "Si" and "5,700" is set as "correct answer (○)", and the correctness judgment (evaluation) is set even when the "mobility" is different as in the second and subsequent evaluations. You may. This is because even if the notations of "material" and "mobility" are similar, if they are not exactly the same, the accuracy of the feature amount and the clue information can be expected to be improved. Preferably, it is preferable that patterns having different "materials" or "mobilities" can be evaluated as shown in FIG. 6 rather than the same pattern being evaluated many times.

図５に戻る。図５の集合Ｂでは、正誤判定（評価）で正解（○）と不正解（×）が同数出現している。ため、現在の手掛かり情報集合が情報抽出に適しているかを判別できない。すなわち、掛かり情報の追加か、現在の手掛かり情報の集合が所望の情報抽出に対して適さない可能性がある。この場合、情報抽出支援装置１０１は、正誤判定のための追加の手掛かり情報（特徴量）が必要であることを出力する。 Return to FIG. In the set B of FIG. 5, the same number of correct answers (○) and incorrect answers (×) appear in the correct / incorrect judgment (evaluation). Therefore, it is not possible to determine whether the current clue information set is suitable for information extraction. That is, there is a possibility that the addition of clue information or the current set of clue information is not suitable for the desired information extraction. In this case, the information extraction support device 101 outputs that additional clue information (feature amount) for correct / incorrect determination is required.

また、図５の集合Ｃでは、正誤判定（評価）において不正解（×）の方が多いため、学習に用いたデータ自体またはラベル付けルールが適切でない可能性がある。この場合、情報抽出支援装置１０１は、学習に用いるデータの変更が必要であることを出力する。学習に用いたデータ自体に問題がある場合、新しいデータを用いた学習が必要であることの警告を出力してもよい。 Further, in the set C of FIG. 5, since there are more incorrect answers (x) in the correctness judgment (evaluation), the data itself used for learning or the labeling rule may not be appropriate. In this case, the information extraction support device 101 outputs that the data used for learning needs to be changed. If there is a problem with the data used for training itself, a warning that training using new data is necessary may be output.

また、評価の途中では図５の集合Ｄのように、正解（○）も不正解（×）も示されない場合があってもよい。情報抽出支援装置１０１は評価を促す内容を出力してもよい。 Further, in the middle of the evaluation, as shown in the set D of FIG. 5, neither the correct answer (◯) nor the incorrect answer (x) may be shown. The information extraction support device 101 may output the content prompting the evaluation.

図２に戻る。ステップＳ２０４では、過去の抽出結果と抽出結果、対応する特徴集合、スコア、評価結果の変化に基づき特徴抽出、ラベル付けルールの変更案を提示する。 Return to FIG. In step S204, a proposal for changing the feature extraction and labeling rules is presented based on changes in the past extraction results, extraction results, corresponding feature sets, scores, and evaluation results.

変更案は、例えば、記憶部に記憶された評価履歴を利用する。下記の変更案は遠距離教師あり学習（Ｄｉｓｔａｎｔｓｕｐｅｒｖｉｓｉｏｎ）のルールを設定する場合の例である。 For the change proposal, for example, the evaluation history stored in the storage unit is used. The following modification is an example of setting the rules for long-distance supervised learning (Disstant supervised learning).

（変更案１）１回の評価結果に基づく提案
・情報抽出結果ｅ，ｅの特徴集合Ａ，評価結果が正解（○）である場合
ｃｏｕｎｔ（特徴集合Ａ，評価結果○）＞閾値、かつ
ｃｏｕｎｔ（特徴集合Ａ，評価結果○）／ｃｏｕｎｔ（特徴集合Ａ，評価結果×）＞閾値
であれば、ルール「特徴集合がＡであれば正解（○）とする。 (Proposed change 1) Proposal / information extraction result e based on one evaluation result, feature set A of e, when the evaluation result is correct (○) count (feature set A, evaluation result ○)> threshold value and count If (feature set A, evaluation result ○) / count (feature set A, evaluation result ×)> threshold value, the rule “If the feature set is A, the answer is correct (○).

（変更案２）複数回の評価結果に基づく改良提案（特徴の変化）
・情報抽出結果ｅ，ｅの特徴集合Ａ，評価結果が正解（○）で
情報抽出結果ｅに対する特徴集合がＡ’となった場合、
ルール「特徴集合がＡ’であれば○」とする。 (Change proposal 2) Improvement proposal based on the results of multiple evaluations (change in characteristics)
-When the information extraction result e, the feature set A of e, the evaluation result is correct (○), and the feature set for the information extraction result e is A',
Rule "If the feature set is A', it is ○".

（変更案３）複数回の評価結果に基づく改良提案（スコアの変化）
（３ａ）同じ特徴でスコア、例えば確率値が減少した場合、「新しい手掛かり情報（特徴）の例が必要である」ことを出力する。
例えば、１回目の実行で情報抽出結果ｅ，ｅの特徴集合Ａ，確率値０．９５，評価結果が○であり、２回目の実行で確率値が０．７に下がった場合、集合Ａだけでは精度が下がったことになるため、「新しい特徴」の例を提示するなどしてもよい。
（３ｂ）特徴が変わりスコア、例えば確率値が減少した場合、「精度低下要因としてｄｉｆｆ（Ａ，Ａ’）」を出力する。
例えば、１回目の実行で情報抽出結果ｅ，ｅの特徴集合Ａ，確率値０．９５，評価結果が○であり、２回目の実行で特徴集合がＡ’に変化、確率値が（例えば）０．７に下がった場合、特徴集合の変化が容易として考えられため、特徴集合間の差分を出力する。
また、汎用性の高いルールを生成するために、特徴集合Ａと特徴集合Ｂの共通部分（Ａ∩Ｂ）について正誤判定を行い、ルール化する候補としてもよい。 (Change proposal 3) Improvement proposal based on the results of multiple evaluations (change in score)
(3a) When the score, for example, the probability value decreases with the same feature, it is output that "an example of new clue information (feature) is needed".
For example, if the information extraction result e, the feature set A of e, the probability value 0.95, and the evaluation result are ○ in the first execution, and the probability value drops to 0.7 in the second execution, only the set A Then, since the accuracy is lowered, an example of "new feature" may be presented.
(3b) When the feature changes and the score, for example, the probability value decreases, "diff (A, A')" is output as a factor for reducing accuracy.
For example, in the first execution, the information extraction result e, the feature set A of e, the probability value 0.95, the evaluation result is ○, the feature set changes to A'in the second execution, and the probability value is (for example). When it drops to 0.7, it is considered that the feature set is easy to change, so the difference between the feature sets is output.
Further, in order to generate a highly versatile rule, a correct / incorrect judgment may be made for the intersection (A∩B) of the feature set A and the feature set B, and the rule may be created as a candidate.

図２に戻る。ステップＳ２０５では、出力された変更案の１つ以上を反映する。 Return to FIG. In step S205, one or more of the output change proposals are reflected.

以上のように、出力された変更案の反映、正誤判定（評価）を複数回実行し、抽出された候補、手掛かり情報集合、ラベル付けされた正解／不正解の個数、実際の評価結果を記録する。記録した抽出候補、手掛かり情報集合に対する正誤判定結果の分布変化を算出し、分布変化に応じて手掛かり情報およびラベル付けルールの変更方法を出力する。評価・変更内容に応じ、評価すべき対象を決定することが可能になり、精度改善を効率的に進めることができるようになる。 As described above, the output change proposal is reflected, the correct / incorrect judgment (evaluation) is executed multiple times, and the extracted candidates, the clue information set, the number of labeled correct / incorrect answers, and the actual evaluation result are recorded. To do. The distribution change of the correctness judgment result for the recorded extraction candidate and the clue information set is calculated, and the clue information and the labeling rule change method are output according to the distribution change. It becomes possible to determine the target to be evaluated according to the content of evaluation / change, and it becomes possible to efficiently improve the accuracy.

以上、説明したように、特徴量または規則の変更をすることが可能になる。また、第１実施形態によれば、特徴量や規則の変更方法が分からない場合でも評価の対象を特定することが可能になるため、情報を抽出する側が熟練者でない場合でも精度の高い情報抽出が可能になる。 As described above, it is possible to change the feature amount or the rule. Further, according to the first embodiment, it is possible to specify the evaluation target even when the feature amount and the method of changing the rule are not known, so that the information extraction side can extract information with high accuracy even if the person who extracts the information is not an expert. Will be possible.

最後に、第１実施形態の情報抽出支援装置１０１に使用されるコンピュータのハードウェア構成の例について説明する。 Finally, an example of the hardware configuration of the computer used for the information extraction support device 101 of the first embodiment will be described.

［ハードウェア構成の例］
図７は第１乃実施形態の情報抽出支援装置１０１に使用されるコンピュータのハードウェア構成の例を示す図である。 [Example of hardware configuration]
FIG. 7 is a diagram showing an example of the hardware configuration of the computer used in the information extraction support device 101 of the first embodiment.

情報抽出支援装置１０１に使用されるコンピュータは、制御装置５０１、主記憶装置５０２、補助記憶装置５０３、表示装置５０４、入力装置５０５及び通信装置５０６を備える。制御装置５０１、主記憶装置５０２、補助記憶装置５０３、表示装置５０４、入力装置５０５及び通信装置５０６は、バス５１０を介して接続されている。 The computer used for the information extraction support device 101 includes a control device 501, a main storage device 502, an auxiliary storage device 503, a display device 504, an input device 505, and a communication device 506. The control device 501, the main storage device 502, the auxiliary storage device 503, the display device 504, the input device 505, and the communication device 506 are connected via the bus 510.

制御装置５０１は、補助記憶装置５０３から主記憶装置５０２に読み出されたプログラムを実行する。主記憶装置５０２は、ＲＯＭ（ＲｅａｄＯｎｌｙＭｅｍｏｒｙ）、及び、ＲＡＭ（ＲａｎｄｏｍＡｃｃｅｓｓＭｅｍｏｒｙ）等のメモリである。補助記憶装置５０３は、ＨＤＤ（ＨａｒｄＤｉｓｋＤｒｉｖｅ）、ＳＳＤ（ＳｏｌｉｄＳｔａｔｅＤｒｉｖｅ）、及び、メモリカード等である。 The control device 501 executes the program read from the auxiliary storage device 503 to the main storage device 502. The main storage device 502 is a memory such as a ROM (Read Only Memory) and a RAM (Random Access Memory). The auxiliary storage device 503 is an HDD (Hard Disk Drive), an SSD (Solid State Drive), a memory card, or the like.

表示装置５０４は表示情報を表示する。表示装置５０４は、例えば液晶ディスプレイ等である。入力装置５０５は、コンピュータを操作するためのインタフェースである。入力装置５０５は、例えばキーボードやマウス等である。コンピュータがスマートフォン及びタブレット型端末等のスマートデバイスの場合、表示装置５０４及び入力装置５０５は、例えばタッチパネルである。通信装置５０６は、他の装置と通信するためのインタフェースである。 The display device 504 displays the display information. The display device 504 is, for example, a liquid crystal display or the like. The input device 505 is an interface for operating a computer. The input device 505 is, for example, a keyboard, a mouse, or the like. When the computer is a smart device such as a smartphone or a tablet terminal, the display device 504 and the input device 505 are, for example, a touch panel. The communication device 506 is an interface for communicating with another device.

コンピュータで実行されるプログラムは、インストール可能な形式又は実行可能な形式のファイルでＣＤ−ＲＯＭ、メモリカード、ＣＤ−Ｒ及びＤＶＤ（ＤｉｇｉｔａｌＶｅｒｓａｔｉｌｅＤｉｓｃ）等のコンピュータで読み取り可能な記憶媒体に記録されてコンピュータ・プログラム・プロダクトとして提供される。 Programs that run on a computer are recorded in a computer-readable storage medium such as a CD-ROM, memory card, CD-R, or DVD (Digital Versailles Disc) in an installable or executable format file. Provided as a computer program product.

またコンピュータで実行されるプログラムを、インターネット等のネットワークに接続されたコンピュータ上に格納し、ネットワーク経由でダウンロードさせることにより提供するように構成してもよい。またコンピュータで実行されるプログラムをダウンロードさせずにインターネット等のネットワーク経由で提供するように構成してもよい。 Further, the program executed by the computer may be stored on a computer connected to a network such as the Internet and provided by downloading the program via the network. Further, the program executed by the computer may be configured to be provided via a network such as the Internet without being downloaded.

またコンピュータで実行されるプログラムを、ＲＯＭ等に予め組み込んで提供するように構成してもよい。 Further, the program executed by the computer may be configured to be provided by incorporating it into a ROM or the like in advance.

コンピュータで実行されるプログラムは、上述の情報抽出支援装置１０１の機能構成（機能ブロック）のうち、プログラムによっても実現可能な機能ブロックを含むモジュール構成となっている。当該各機能ブロックは、実際のハードウェアとしては、制御装置５０１が記憶媒体からプログラムを読み出して実行することにより、上記各機能ブロックが主記憶装置５０２上にロードされる。すなわち上記各機能ブロックは主記憶装置５０２上に生成される。 The program executed by the computer has a module configuration including a functional block that can be realized by the program among the functional configurations (functional blocks) of the above-mentioned information extraction support device 101. As actual hardware, each functional block is loaded on the main storage device 502 by the control device 501 reading a program from the storage medium and executing the program. That is, each of the above functional blocks is generated on the main storage device 502.

なお上述した各機能ブロックの一部又は全部をソフトウェアにより実現せずに、ＩＣ（ＩｎｔｅｇｒａｔｅｄＣｉｒｃｕｉｔ）等のハードウェアにより実現してもよい。 It should be noted that a part or all of the above-mentioned functional blocks may not be realized by software, but may be realized by hardware such as an IC (Integrated Circuit).

また複数のプロセッサを用いて各機能を実現する場合、各プロセッサは、各機能のうち１つを実現してもよいし、各機能のうち２つ以上を実現してもよい。 When each function is realized by using a plurality of processors, each processor may realize one of each function, or may realize two or more of each function.

また情報抽出支援装置１０１を実現するコンピュータの動作形態は任意でよい。例えば、情報抽出支援装置１０１を１台のコンピュータにより実現してもよい。また例えば、情報抽出支援装置１０１を、ネットワーク上のクラウドシステムとして動作させてもよい。 Further, the operation mode of the computer that realizes the information extraction support device 101 may be arbitrary. For example, the information extraction support device 101 may be realized by one computer. Further, for example, the information extraction support device 101 may be operated as a cloud system on the network.

［装置構成の例］
図８は本実施形態の情報抽出支援装置１０１の装置構成の例を示す図である。図８の例では、情報抽出支援装置１０１は、複数のクライアント装置１ａ〜１ｚ、ネットワーク２及びサーバ装置３を備える。 [Example of device configuration]
FIG. 8 is a diagram showing an example of the device configuration of the information extraction support device 101 of the present embodiment. In the example of FIG. 8, the information extraction support device 101 includes a plurality of client devices 1a to 1z, a network 2, and a server device 3.

クライアント装置１ａ〜１ｚを区別する必要がない場合は、単にクライアント装置１という。なお、情報抽出支援装置１０１内のクライアント装置１の数は任意でよい。クライアント装置１は、例えば、パソコン及びスマートフォンなどのコンピュータである。複数のクライアント装置１ａ〜１ｚとサーバ装置３とは、ネットワーク２を介して互いに接続されている。ネットワーク２の通信方式は、有線方式であっても無線方式であってもよく、また、両方を組み合わせてもよい。 When it is not necessary to distinguish between the client devices 1a to 1z, it is simply referred to as the client device 1. The number of client devices 1 in the information extraction support device 101 may be arbitrary. The client device 1 is, for example, a computer such as a personal computer and a smartphone. The plurality of client devices 1a to 1z and the server device 3 are connected to each other via the network 2. The communication method of the network 2 may be a wired method or a wireless method, or both may be combined.

例えば、情報抽出支援装置１０１の抽出部１０２、特定部１０３、付与部１０４、学習推定部１０５及び出力部１０６をサーバ装置３により実現し、ネットワーク２上のクラウドシステムとして動作させてもよい。例えば、クライアント装置１が、ユーザから正誤判定の結果を受け付け、当該正誤判定の結果をサーバ装置３へ送信してもよい。そして、サーバ装置３が、出力部１０６により出力された変更方法をクライアント装置１に送信してもよい。 For example, the extraction unit 102, the identification unit 103, the granting unit 104, the learning estimation unit 105, and the output unit 106 of the information extraction support device 101 may be realized by the server device 3 and operated as a cloud system on the network 2. For example, the client device 1 may receive the result of the correctness determination from the user and transmit the result of the correctness determination to the server device 3. Then, the server device 3 may transmit the change method output by the output unit 106 to the client device 1.

本発明の実施形態を説明したが、例として提示したものであり、発明の範囲を限定することは意図していない。これら新規な実施形態は、その他の様々な形態で実施されることが可能であり、発明の要旨を逸脱しない範囲で、種々の省略、置き換え、変更を行うことができる。これら実施形態やその変形は、発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although the embodiments of the present invention have been described, they are presented as examples and are not intended to limit the scope of the invention. These novel embodiments can be implemented in various other embodiments, and various omissions, replacements, and changes can be made without departing from the gist of the invention. These embodiments and modifications thereof are included in the scope and gist of the invention, and are also included in the scope of the invention described in the claims and the equivalent scope thereof.

１クライアント装置
２ネットワーク
３サーバ装置
１０１情報抽出支援装置
１０２抽出部
１０３特定部
１０４付与部
１０５学習推定部
１０６出力部 1 Client device 2 Network 3 Server device 101 Information extraction support device 102 Extraction unit 103 Specific unit 104 Grant unit 105 Learning estimation unit 106 Output unit

Claims

An extraction unit that extracts the first information about the document data from the document data containing a plurality of words,
A specific part that identifies related information as clue information from the first information,
An assigning unit that assigns a label indicating correctness to the first information based on a predetermined rule,
A learning inference unit that executes learning and inference of rules for extracting new first information based on the first information, a feature amount used for extracting the first information, and the label.
An output unit that outputs at least one method of changing the feature amount and the rule based on the label and the correctness determination that is the result of correctness of the one information.
Information extraction support device with.

The information extraction support device according to claim 1, wherein the output unit outputs a combination of the change method, correctness determination of the extraction result of the extraction unit, the clue information, and at least one of the feature amounts.

It also has a reception unit that accepts the correctness judgment from the user.
The information extraction support device according to claim 1, wherein when the reception unit receives a content indicating that it is not clear whether the answer is correct or incorrect, the clue information collection also outputs another similar result.

A step of extracting the first information about the document data from the document data containing a plurality of words, and
A step of identifying related information as clue information from the first information,
A step of assigning a label indicating correctness to the first information based on a predetermined rule, and
A step of learning and inferring a rule for extracting new first information based on the first information, a feature amount used for extracting the first information, and the label, and the label and the first information. An information extraction support method including a step of outputting the feature amount and at least one change method of the rule based on the correctness judgment which is the result of the correctness of the above.

An extraction unit that extracts the first information about the document data from the document data containing a plurality of words on the computer.
A specific part that identifies related information as clue information from the first information,
An assigning unit that assigns a label indicating correctness to the first information based on a predetermined rule,
A learning inference unit that executes learning and inference of rules for extracting new first information based on the first information, a feature amount used for extracting the first information, and the label.
An information extraction support program for functioning as an output unit that outputs at least one change method of the feature amount and the rule based on the correctness determination which is the result of correctness of the label and the one information.