JP7237878B2

JP7237878B2 - Domain knowledge utilization support device, program, and domain knowledge utilization support method

Info

Publication number: JP7237878B2
Application number: JP2020049826A
Authority: JP
Inventors: 一則和久井; 博章三沢; 博基古川
Original assignee: Hitachi Industry and Control Solutions Co Ltd
Current assignee: Hitachi Industry and Control Solutions Co Ltd
Priority date: 2020-03-19
Filing date: 2020-03-19
Publication date: 2023-03-13
Anticipated expiration: 2040-03-19
Also published as: JP2021149637A

Description

本発明は、ドメイン知識活用支援装置、プログラムおよびドメイン知識活用支援方法に関する。 The present invention relates to a domain knowledge utilization support device, a program, and a domain knowledge utilization support method.

従来までのテキスト分類やテキスト検索などの自然言語処理では、個々のテキストに含まれる単語（語）に注目している。例えば、特許文献１に記載のテキストを分類する方法では、テキストに含まれる語や語の関係性（語の並び、係り受け）を利用してテキストを分類している。 Conventional natural language processing such as text classification and text retrieval focuses on words contained in individual texts. For example, in the text classification method described in Patent Document 1, texts are classified using words included in the text and relationships between the words (word arrangement and dependency).

特表２０１５－５１１７３３号公報Japanese Patent Publication No. 2015-511733

従来技術では、テキスト内の語や語の関係性を利用しているが、テキストに含まれる他の情報を利用しておらず、テキスト分類やテキスト検索などにおいて精度向上の余地を残している。テキスト分類やテキスト検索に限らず、多数のデータを基に機械学習技術を用いた翻訳などを含めた自然言語処理においても同様であり、精度や品質の向上の可能性がある。未利用の情報として、例えば、テキストが係るドメイン（業種）における語のカテゴリがある。 Conventional techniques use words and relationships between words in the text, but do not use other information contained in the text, leaving room for improved accuracy in text classification, text retrieval, and the like. The same is true not only for text classification and text search, but also for natural language processing including translation using machine learning technology based on a large amount of data, and there is a possibility of improving accuracy and quality. Unused information includes, for example, the category of words in the domain (industry) to which the text pertains.

本発明は、このような背景を鑑みてなされたものであり、テキストに含まれるドメインに係る特徴の抽出を可能とするドメイン知識活用支援装置、プログラムおよびドメイン知識活用支援方法を提供することを課題とする。 SUMMARY OF THE INVENTION It is an object of the present invention to provide a domain knowledge utilization support device, a program, and a domain knowledge utilization support method that enable extraction of features related to a domain contained in a text. and

上記課題を解決するため、本発明に係るドメイン知識活用支援装置は、テキストを文に分割するテキスト分割部と、前記文が係るドメインを判定するドメイン判定部と、語と前記ドメインと当該語の含意とを関連付けて記憶する含意情報データベースを参照して、前記文の述語となる語の含意を抽出する含意抽出部と、ドメイン別に語と当該語のカテゴリとを関連付けて記憶するドメイン別カテゴリ情報データベースを参照して、前記文から語と当該語のカテゴリとを抽出するドメイン情報抽出部と、前記テキストを木に変換する構造化部とを備え、前記木は、根が、前記テキストに対応し、当該根の下位ノードは、当該テキストに含まれる文に対応するとともに、前記含意抽出部が抽出した当該文の含意を示すノードであり、当該文の含意を示すノードの下位ノードは、ドメインのノードであり、当該ドメインのノードの下位ノードは、前記ドメイン情報抽出部が抽出した当該文に含まれる語のカテゴリを示すノードを含むことを特徴とする。 In order to solve the above problems, the domain knowledge utilization support device according to the present invention includes: a text dividing unit that divides a text into sentences; a domain determination unit that determines a domain related to the sentence; an entailment extracting unit for extracting the entailment of a word serving as a predicate of the sentence by referring to an entailment information database storing entailment in association with the entailment; a domain information extraction unit that references a database to extract words and categories of the words from the sentence; and a structuring unit that transforms the text into a tree , the tree having a root corresponding to the text. and the lower nodes of the root correspond to the sentences included in the text and indicate the entailment of the sentence extracted by the entailment extraction unit. The lower nodes of the nodes indicating the entailment of the sentence correspond to the domain and subordinate nodes of the node of the domain include nodes indicating categories of words included in the sentence extracted by the domain information extraction unit.

本発明によれば、テキストに含まれるドメインに係る特徴の抽出を可能とするドメイン知識活用支援装置、プログラムおよびドメイン知識活用支援方法を提供することができる。 According to the present invention, it is possible to provide a domain knowledge utilization support device, a program, and a domain knowledge utilization support method that enable extraction of features related to a domain contained in text.

本実施形態に係るドメイン知識活用支援装置の機能ブロック図である。1 is a functional block diagram of a domain knowledge utilization support device according to this embodiment; FIG. 本実施形態に係る含意情報データベースのデータ構成を示す図である。It is a figure which shows the data structure of the implication information database which concerns on this embodiment. 本実施形態に係るドメイン別カテゴリ情報データベースのデータ構成を示す図である。FIG. 3 is a diagram showing the data configuration of a domain-by-domain category information database according to the embodiment; 本実施形態に係る一般情報データベースのデータ構成を示す図である。It is a figure which shows the data structure of the general information database which concerns on this embodiment. 本実施形態に係る含意のイメージ変換情報データベースのデータ構成図である。3 is a data configuration diagram of an implication image conversion information database according to the present embodiment; FIG. 本実施形態に係るカテゴリのイメージ変換情報データベースのデータ構成図である。3 is a data configuration diagram of a category image conversion information database according to the present embodiment. FIG. 本実施形態に係る品詞のイメージ変換情報データベースのデータ構成図である。FIG. 3 is a data configuration diagram of a part-of-speech image conversion information database according to the present embodiment; 本実施形態を説明するために用いるテキストの例である。It is an example of the text used for explaining this embodiment. 本実施形態に係る含意抽出部、ドメイン情報抽出部、および一般情報抽出部の動作を説明するための図である。It is a figure for demonstrating the operation|movement of the entailment extraction part which concerns on this embodiment, a domain information extraction part, and a general information extraction part. 本実施形態に係る構造化部が生成した木である。It is a tree generated by the structuring unit according to the present embodiment. 本実施形態に係るイメージ化部が生成した画像である。4 is an image generated by an imaging unit according to the embodiment; 本実施形態に係るドメイン知識活用支援装置が実行する情報抽出処理のフローチャートである。4 is a flowchart of information extraction processing executed by the domain knowledge utilization support device according to the embodiment;

以下に、本発明を実施するための形態（実施形態）におけるドメイン知識活用支援装置について説明する。ドメイン知識活用支援装置は、語と含意とが関連付けられた含意情報データベース、ドメイン（業種）別に、語とカテゴリとが関連付けられたドメイン別カテゴリ情報データベース、および、特定のドメインには含まれない一般的な語と品詞とが関連付けられた一般情報データベースを備える。ドメイン知識活用支援装置は、入力されたテキストを文に分割し、文から含意や語、語のカテゴリを抽出する。次に、ドメイン知識活用支援装置は、テキストを、抽出した含意や語、語のカテゴリを含む木（木構造データ）や画像に変換して出力する。 A domain knowledge utilization support device in a form (embodiment) for carrying out the present invention will be described below. The domain knowledge utilization support device includes an implication information database in which words and implications are associated, a domain-by-domain category information database in which words and categories are associated for each domain (industry), and a general information database not included in a specific domain. a general information database with associated words and parts of speech; The domain knowledge utilization support device divides the input text into sentences and extracts implications, words, and word categories from the sentences. Next, the domain knowledge utilization support device converts the text into a tree (tree structure data) or an image containing the extracted implications, words, and word categories, and outputs the tree.

木には、テキストに含まれる文の含意やドメイン別の語のカテゴリが含まれている。語を抽出する従来技術に比べて、より多くの情報を抽出することができる。また、ドメイン知識活用支援装置は、テキストに関連するドメインに応じて語からカテゴリを抽出しており、ドメインに応じて情報を抽出することができるようになる。この抽出した情報を用いることで、テキスト分類やテキスト検索、翻訳などにおける精度や品質が向上する。 The tree contains the implications of the sentences contained in the text and the categories of words by domain. More information can be extracted than the prior art of extracting words. In addition, the domain knowledge utilization support device extracts categories from words according to domains related to text, and can extract information according to domains. Using this extracted information improves the accuracy and quality of text classification, text retrieval, translation, and the like.

≪ドメイン知識活用支援装置の構成≫
図１は、本実施形態に係るドメイン知識活用支援装置１００の機能ブロック図である。ドメイン知識活用支援装置１００は、制御部１１０、記憶部１２０、および入出力部１８０を含んで構成される。入出力部１８０は、ディスプレイやキーボード、マウスなどのユーザインタフェースの他、他の装置との通信インタフェースを備える。 <<Configuration of Domain Knowledge Utilization Support Device>>
FIG. 1 is a functional block diagram of a domain knowledge utilization support device 100 according to this embodiment. Domain knowledge utilization support device 100 includes control unit 110 , storage unit 120 , and input/output unit 180 . The input/output unit 180 includes user interfaces such as a display, keyboard, and mouse, as well as communication interfaces with other devices.

記憶部１２０には、プログラム１２１、含意情報データベース１３０（後記する図２参照）、ドメイン別カテゴリ情報データベース１４０（後記する図３参照）、一般情報データベース１５０（後記する図４参照）、およびイメージ変換情報データベース１６０（後記する図５～図７参照）が記憶される。プログラム１２１は、制御部１１０が実行する情報抽出処理（後記する図１２参照）の手順を含む。 The storage unit 120 stores a program 121, an implication information database 130 (see FIG. 2 described later), a domain category information database 140 (see FIG. 3 described later), a general information database 150 (see FIG. 4 described later), and an image conversion An information database 160 (see FIGS. 5 to 7 described later) is stored. The program 121 includes procedures for information extraction processing (see FIG. 12 described later) executed by the control unit 110 .

≪ドメイン知識活用支援装置の構成：含意情報データベース≫
図２は、本実施形態に係る含意情報データベース１３０のデータ構成を示す図である。含意とは、文が含む意味であり、文の述語の意味である。含意情報データベース１３０は、例えば表形式のデータであって、１つの行（レコード）は含意を示し、含意１３１、ドメイン１３２、および語１３３を含む。 ≪Configuration of Domain Knowledge Utilization Support Device: Entailment Information Database≫
FIG. 2 is a diagram showing the data configuration of the implication information database 130 according to this embodiment. Implication is the meaning that a sentence contains and the meaning of the predicate of the sentence. The implication information database 130 is tabular data, for example, and one row (record) indicates an implication, and includes an implication 131, a domain 132, and a term 133. FIG.

含意１３１は、文の述語となっている語１３３のドメイン１３２における含意である。ドメイン１３２は、農業、建設業、製造業、小売業などの業種（産業）である。同じ語であっても、文が係るドメインによって含意が異なる場合がある。本実施形態では、文の述語（述部）となる語、および文が係る業種によって文の含意が決まるとする。なお、ドメイン１３２が「一般」である場合には、含意１３１は、ドメイン（業種）共通の含意である。
レコード１３９は、「製造業」に係る文の述語が「原因となる」である場合には、当該文の含意は「要因」であることを示している。 Implication 131 is the implication in domain 132 of word 133 that is the predicate of the sentence. The domain 132 is a type of business (industry) such as agriculture, construction, manufacturing, and retail. The same word may have different implications depending on the domain to which the sentence pertains. In this embodiment, it is assumed that the implication of a sentence is determined by the word that is the predicate (predicate) of the sentence and the industry to which the sentence pertains. When the domain 132 is "general", the implication 131 is the implication common to the domains (industry).
Record 139 indicates that if the predicate of the sentence relating to "manufacturing industry" is "because", the implication of the sentence is "factor".

≪ドメイン知識活用支援装置の構成：ドメイン別カテゴリ情報データベース≫
図３は、本実施形態に係るドメイン別カテゴリ情報データベース１４０のデータ構成を示す図である。図３に示した例は、「製造業」ドメインのドメイン別カテゴリ情報データベース１４０である。
ドメイン別カテゴリ情報データベース１４０は、ドメイン（業種）ごとに語のカテゴリ（語の意味の種類）が記憶される。ドメイン別カテゴリ情報データベース１４０は、例えば表形式のデータであって、１つの行（レコード）は語を示し、語１４１、カテゴリ１４２、および関連語１４３を含む。 <<Configuration of Domain Knowledge Utilization Support Device: Category Information Database by Domain>>
FIG. 3 is a diagram showing the data configuration of the domain category information database 140 according to this embodiment. The example shown in FIG. 3 is the domain category information database 140 of the "manufacturing industry" domain.
The domain category information database 140 stores word categories (types of meaning of words) for each domain (industry). The domain-specific category information database 140 is, for example, tabular data, one row (record) indicates a word, and includes a word 141 , a category 142 and a related word 143 .

語１４１は、見出しとなる語である。カテゴリ１４２は、語１４１のカテゴリである。関連語１４３は、語１４１の関連語であり、同じカテゴリ１４２である。レコード１４９は、「製造業」のドメインにおいては、「製造現場」は「業態」というカテゴリに属する語（「業態」に係る語）であって、関連する語に「生産現場」、「生産ライン」などがあることを示している。「生産現場」、「生産ライン」も「業態」というカテゴリに属する語である。 The word 141 is a headline word. Category 142 is the category of word 141 . A related term 143 is a related term of the term 141 and is of the same category 142 . In the record 149, in the domain of "manufacturing industry," "manufacturing site" is a word belonging to the category "business type" (word related to "business type"), and related terms include "production site" and "production line." ” and so on. "Production site" and "production line" also belong to the category of "business style."

≪ドメイン知識活用支援装置の構成：一般情報データベース≫
図４は、本実施形態に係る一般情報データベース１５０のデータ構成を示す図である。一般情報データベースには、ドメイン固有ではない語が含まれる。一般情報データベース１５０は、例えば表形式のデータであって、１つの行（レコード）は語を示し、語１５１、および品詞１５２を含む。 <<Configuration of Domain Knowledge Utilization Support Device: General Information Database>>
FIG. 4 is a diagram showing the data configuration of the general information database 150 according to this embodiment. The general information database contains terms that are not domain specific. The general information database 150 is tabular data, for example, where one row (record) indicates a word and includes a word 151 and a part of speech 152 .

品詞１５２は、語１５１の品詞であって、例えば、動詞や形容詞などである。品詞１５２には、語１５１によっては、肯定的（＋）か否定的（－）かの属性が付与される。レコード１５９は、「美味しい」という語は、肯定的な形容詞であることを示している。なお、固有名詞については、人、国、組織、車など何の名前なのかを示す属性が付与されてもよい。また、名詞については、動物、体の部位、抽象概念など何の種別かを示す属性が付与されてもよい。 The part of speech 152 is the part of speech of the word 151, such as a verb or an adjective. The part of speech 152 is given a positive (+) or negative (-) attribute depending on the word 151 . Record 159 indicates that the word "delicious" is a positive adjective. It should be noted that a proper noun may be given an attribute indicating the name of a person, country, organization, car, or the like. Also, nouns may be given attributes indicating their type, such as animal, body part, or abstract concept.

≪ドメイン知識活用支援装置の構成：イメージ変換情報データベース≫
図５は、本実施形態に係る含意のイメージ変換情報データベース１６１のデータ構成図である。図６は、本実施形態に係るカテゴリのイメージ変換情報データベース１６６のデータ構成図である。図７は、本実施形態に係る品詞のイメージ変換情報データベース１７１のデータ構成図である。イメージ変換情報データベース１６０（図１参照）には、含意のイメージ変換情報データベース１６１、カテゴリのイメージ変換情報データベース１６６、および品詞のイメージ変換情報データベース１７１が備わる。 <<Configuration of Domain Knowledge Utilization Support Device: Image Conversion Information Database>>
FIG. 5 is a data configuration diagram of the implication image conversion information database 161 according to the present embodiment. FIG. 6 is a data configuration diagram of the category image conversion information database 166 according to the present embodiment. FIG. 7 is a data configuration diagram of the part-of-speech image conversion information database 171 according to the present embodiment. The image conversion information database 160 (see FIG. 1) includes an implication image conversion information database 161 , a category image conversion information database 166 , and a part-of-speech image conversion information database 171 .

含意のイメージ変換情報データベース１６１は、例えば表形式のデータであって、含意情報データベース１３０（図２参照）の含意１３１に対応する含意１６２に割り当てられた色１６３が記憶される。
カテゴリのイメージ変換情報データベース１６６は、例えば表形式のデータであって、ドメイン別カテゴリ情報データベース１４０（図３参照）のカテゴリ１４２に対応するカテゴリ１６７に割り当てられた色１６８が記憶される。
品詞のイメージ変換情報データベース１７１は、例えば表形式のデータであって、一般情報データベース１５０（図４参照）の品詞１５２に対応する品詞１７２に割り当てられた色１７３が記憶される。 The implication image conversion information database 161 is tabular data, for example, and stores colors 163 assigned to implication 162 corresponding to the implication 131 of the implication information database 130 (see FIG. 2).
The category image conversion information database 166 is tabular data, for example, and stores colors 168 assigned to categories 167 corresponding to categories 142 of the domain category information database 140 (see FIG. 3).
The part-of-speech image conversion information database 171 is tabular data, for example, and stores colors 173 assigned to parts of speech 172 corresponding to parts of speech 152 in the general information database 150 (see FIG. 4).

≪ドメイン知識活用支援装置の構成：制御部≫
図１に戻って、制御部１１０はＣＰＵ（Central Processing Unit）から構成され、テキスト分割部１１１、ドメイン判定部１１２、含意抽出部１１３、ドメイン情報抽出部１１４、一般情報抽出部１１５、構造化部１１６、およびイメージ化部１１７を備える。 <<Configuration of Domain Knowledge Utilization Support Device: Control Unit>>
Returning to FIG. 1, the control unit 110 is composed of a CPU (Central Processing Unit), and includes a text dividing unit 111, a domain determining unit 112, an implication extracting unit 113, a domain information extracting unit 114, a general information extracting unit 115, and a structuring unit. 116 and an imaging unit 117 .

図８は、本実施形態を説明するために用いるテキスト２１０の例である。テキスト２１０は、２つの段落、２つの文から構成される。テキスト２１０を例に、制御部１１０に備わる各機能部を説明する。
テキスト分割部１１１は、入出力部１８０から入力されたテキストを段落や文に分割する。例えば、テキスト分割部１１１は、入力されたテキスト２１０を２つの文に分割する。 FIG. 8 is an example of text 210 used to describe this embodiment. Text 210 consists of two paragraphs and two sentences. Using the text 210 as an example, each functional unit provided in the control unit 110 will be described.
The text division unit 111 divides the text input from the input/output unit 180 into paragraphs and sentences. For example, the text dividing unit 111 divides the input text 210 into two sentences.

ドメイン判定部１１２は、段落や文のドメイン（業種）を判定する。ドメイン判定部１１２は、文に含まれる語や語間の係り受けなどから文のドメインを判定する。
含意抽出部１１３は、文から含意を抽出する。詳しくは、含意抽出部１１３は、含意情報データベース１３０（図２参照）のレコードであって、文の述語を語１３３に含み、ドメイン判定部１１２が判定した文のドメインをドメイン１３２とするレコードを特定して、当該レコードの含意１３１を抽出する。 The domain determination unit 112 determines the domain (industry) of paragraphs and sentences. The domain determining unit 112 determines the domain of a sentence from words included in the sentence, dependencies between words, and the like.
The implication extraction unit 113 extracts an implication from the sentence. More specifically, the entailment extraction unit 113 extracts a record in the entailment information database 130 (see FIG. 2) in which the predicate of the sentence is included in the word 133 and the domain of the sentence determined by the domain determination unit 112 is the domain 132. Identify and extract the implication 131 of the record.

ドメイン情報抽出部１１４は、文に含まれる語で、文のドメインに対応するドメイン別カテゴリ情報データベース１４０に記憶されている語を抽出する。一般情報抽出部１１５は、文に含まれる語で、一般情報データベース１５０に記憶されている語（一般語）を抽出する。 The domain information extraction unit 114 extracts words contained in the sentence and stored in the domain-specific category information database 140 corresponding to the domain of the sentence. The general information extraction unit 115 extracts words (general words) stored in the general information database 150 that are contained in sentences.

図９は、本実施形態に係る含意抽出部１１３、ドメイン情報抽出部１１４、および一般情報抽出部１１５の動作を説明するための図である。文２２０は、テキスト２１０（図８参照）の第１文である。
文２２０の破線の下線が施された語は、含意抽出部１１３が抽出した語である。文２２０の実線の下線が施された語は、ドメイン情報抽出部１１４が抽出した語である。また、文２２０の点線の下線が施された語は、一般情報抽出部１１５が抽出した語である。
破線の下線の下に、含意抽出部１１３が抽出した文の含意を記載している。実線の下線の下に、ドメイン情報抽出部１１４が抽出した語のカテゴリを記載している。 FIG. 9 is a diagram for explaining operations of the entailment extraction unit 113, the domain information extraction unit 114, and the general information extraction unit 115 according to this embodiment. Sentence 220 is the first sentence of text 210 (see FIG. 8).
Words underlined with dashed lines in sentence 220 are words extracted by implication extraction unit 113 . The words underlined with a solid line in the sentence 220 are words extracted by the domain information extraction unit 114 . Also, the words underlined with dotted lines in the sentence 220 are the words extracted by the general information extraction unit 115 .
The implication of the sentence extracted by the implication extraction unit 113 is described below the dashed underline. The categories of words extracted by the domain information extraction unit 114 are described below the solid underlines.

図１に戻って、構造化部１１６は、テキスト２１０を木（木構造データ）に変換する。
図１０は、本実施形態に係る構造化部１１６が生成した木３００である。木３００の根（ルートとなるノード３０１）はテキスト２１０を示す。根の葉（下位ノード）は文であり、木３００では含意抽出部１１３が抽出した文の含意をラベルとするノード３１１，３１２である。テキスト２１０の第１の文２２０がノード３１１に対応し、第２の文がノード３１２に対応する。第１の文２２０の述語は「提供しています」であり、その含意は「影響」となる（図２参照）。以下では、第１の文２２０に対応するノード３１１を根とする木について説明する。 Returning to FIG. 1, structuring unit 116 converts text 210 into a tree (tree structure data).
FIG. 10 shows a tree 300 generated by the structuring unit 116 according to this embodiment. The root of tree 300 (root node 301 ) indicates text 210 . Root leaves (lower nodes) are sentences, and in the tree 300 are nodes 311 and 312 whose labels are the implications of the sentences extracted by the implication extraction unit 113 . A first sentence 220 of text 210 corresponds to node 311 and a second sentence corresponds to node 312 . The predicate of the first sentence 220 is "providing" and its implication is "influence" (see Figure 2). A tree rooted at the node 311 corresponding to the first sentence 220 will be described below.

文のノード３１１の下位ノードは、ドメインのノード３２１、および一般のノード３２２である。ドメインのノード３２１の下位は、ノード３１１に対応する文からドメイン情報抽出部１１４が抽出したカテゴリのノード３３１である。カテゴリのノード３３１の下位には、当該カテゴリに対応する語（文に含まれる語）のノード３４１である。一般のノード３２２の下位となるノードは、一般情報抽出部１１５が抽出した語（一般語）のノード３３２である。第２の文のドメインは、第１の文２２０のドメイン（製造業）とは異なっており、カテゴリも異なっている。 The lower nodes of sentence node 311 are domain node 321 and general node 322 . Below the domain node 321 is a category node 331 extracted by the domain information extraction unit 114 from the sentence corresponding to the node 311 . Below the node 331 of the category is a node 341 of the word (word included in the sentence) corresponding to the category. A node subordinate to the general node 322 is a node 332 of the word (general word) extracted by the general information extraction unit 115 . The domain of the second sentence is different from the domain of the first sentence 220 (Manufacturing), and the category is also different.

図１に戻って、イメージ化部１１７は、テキスト２１０を画像に変換する。
図１１は、本実施形態に係るイメージ化部１１７が生成した画像４００である。画像４００には、テキスト２００の第１の文に対応する領域４１０と、第２の文に対応する領域４５０とが含まれる。また、画像４００には、テキストに含まれる文の含意に対応する矩形４２０，４６０、各文の含まれる語のカテゴリに対応する矩形４３０，４７０、および各文に含まれる一般語の品詞に対応する矩形４４０，４８０が含まれる。文の含意、語のカテゴリ、および一般語の品詞は、矩形の色で表現される。 Returning to FIG. 1, imaging unit 117 converts text 210 into an image.
FIG. 11 is an image 400 generated by the imaging unit 117 according to this embodiment. Image 400 includes region 410 corresponding to the first sentence of text 200 and region 450 corresponding to the second sentence. The image 400 also includes rectangles 420 and 460 corresponding to the implications of sentences included in the text, rectangles 430 and 470 corresponding to the categories of words included in each sentence, and rectangles 430 and 470 corresponding to the categories of words included in each sentence. Rectangles 440, 480 are included. Sentence implications, word categories, and common word parts of speech are represented by the color of the rectangle.

文の含意に対応する矩形の色については、イメージ化部１１７は、含意のイメージ変換情報データベース１６１（図５参照）を参照して決定する。例えば、「影響」の含意の色は灰であって、図１１では、ドットのパターンで示している。 The imaging unit 117 determines the color of the rectangle corresponding to the implication of the sentence by referring to the implication image conversion information database 161 (see FIG. 5). For example, the color of the "influence" implication is gray, shown in FIG. 11 by the pattern of dots.

語のカテゴリに対応する矩形の色については、イメージ化部１１７は、カテゴリのイメージ変換情報データベース１６６（図６参照）を参照して決定する。例えば、「業種」のカテゴリの色は黄緑であって、図１１では、横線のパターンで示している。 The imaging unit 117 determines the color of the rectangle corresponding to the word category by referring to the category image conversion information database 166 (see FIG. 6). For example, the color of the category "industry" is yellowish green, which is indicated by a pattern of horizontal lines in FIG.

一般語の品詞に対応する矩形の色については、イメージ化部１１７は、品詞のイメージ変換情報データベース１７１（図７参照）を参照して決定する。例えば、人に相当する名詞のカテゴリの色はオレンジであって、図１１では、市松模様のパターンで示している。 The imaging unit 117 determines the color of the rectangle corresponding to the part of speech of the general word by referring to the image conversion information database 171 (see FIG. 7) of the part of speech. For example, the color of the category of nouns corresponding to people is orange, which is indicated by a checkered pattern in FIG.

≪情報抽出処理≫
図１２は、本実施形態に係るドメイン知識活用支援装置１００が実行する情報抽出処理のフローチャートである。
ステップＳ１１においてテキスト分割部１１１は、入出力部１８０から入力されたテキストを文に分割する。 ≪Information extraction processing≫
FIG. 12 is a flowchart of information extraction processing executed by the domain knowledge utilization support device 100 according to this embodiment.
In step S11, the text division unit 111 divides the text input from the input/output unit 180 into sentences.

ステップＳ１２においてドメイン判定部１１２は、文に係るドメイン（業種）を判定する。
ステップＳ１３において含意抽出部１１３は、文の述語を抽出し、含意情報データベース１３０（図２参照）を参照して、文の含意を抽出する。次に、ドメイン情報抽出部１１４は、ステップＳ１２で判定されたドメインに対応するドメイン別カテゴリ情報データベース１４０を参照して、文に含まれる語のカテゴリを抽出する。続いて、一般情報抽出部１１５は、一般情報データベース１５０を参照して、文に含まれる語の品詞を抽出する。 In step S12, the domain determination unit 112 determines the domain (industry) related to the sentence.
In step S13, the entailment extraction unit 113 extracts the predicate of the sentence, refers to the entailment information database 130 (see FIG. 2), and extracts the entailment of the sentence. Next, the domain information extraction unit 114 refers to the domain category information database 140 corresponding to the domain determined in step S12, and extracts the category of words included in the sentence. Subsequently, the general information extraction unit 115 refers to the general information database 150 and extracts the parts of speech of the words included in the sentence.

ステップＳ１４において構造化部１１６は、テキストを木（図１０参照）に変換して出力する。出力先は入出力部１８０に備わるディスプレイであってもよいし、他の装置に送信してもよい。
ステップＳ１５においてイメージ化部１１７は、テキストを画像（図１１参照）に変換する。出力先は入出力部１８０に備わるディスプレイであってもよいし、他の装置に送信してもよい。 In step S14, the structuring unit 116 converts the text into a tree (see FIG. 10) and outputs it. The output destination may be a display provided in the input/output unit 180, or may be transmitted to another device.
In step S15, the imaging unit 117 converts the text into an image (see FIG. 11). The output destination may be a display provided in the input/output unit 180, or may be transmitted to another device.

≪情報抽出処理の特徴≫
ドメイン知識活用支援装置１００は、テキストに含まれる文について、ドメインを取得して、含意、語のカテゴリ、および語（一般語）の品詞を抽出する。抽出された情報は、木３００（図１０参照）および画像４００（図１１参照）の形式で出力される。
出力された木や画像には、テキストに含まれる語の情報だけではなく、文のドメインによって決まる文の含意や語のカテゴリ、語の属性（肯定的か否か）が含まれている。同じ述語や語であっても、ドメインによって含意やカテゴリが異なり、木や画像には単に含意やカテゴリ、属性の他に、ドメインによる曖昧性も含まれることになる。これらの情報（テキストの特徴）を利用することで、従来技術に見られる語や語の関係性（語の並び、係り受け）を利用したテキスト検索やテキスト分類、翻訳などに比べて、精度や品質の向上が見込まれる。 ≪Characteristics of information extraction processing≫
The domain knowledge utilization support device 100 acquires the domain for sentences included in the text, and extracts implications, word categories, and parts of speech of words (general words). The extracted information is output in the form of tree 300 (see FIG. 10) and image 400 (see FIG. 11).
The output trees and images contain not only information about the words contained in the text, but also sentence implications, word categories, and word attributes (positive or not) determined by the domain of the sentence. Even the same predicate or word has different implications and categories depending on the domain, and trees and images contain not only implications, categories, and attributes, but also ambiguity due to domains. By using these information (characteristics of the text), it is possible to improve the accuracy and accuracy compared to text search, text classification, translation, etc. using words and word relationships (word arrangement, dependencies) seen in conventional technology. Expected to improve quality.

≪変形例：木≫
木３００（図１０参照）において、文に対応する含意のノード３１１の下位ノードは、ドメインのノード３２１、および一般のノード３２２である。これに替えて、含意のノード３１１の下位ノードをカテゴリのノード３３１としてもよい。
また、文に含まれる語のノード３４１，３３２をなくした木としてもよい。語をなくすことで、含まれる情報量は減少するが、木が単純化され見やすくなる。 ≪Modification: Wood≫
In the tree 300 (see FIG. 10), the lower nodes of the implication node 311 corresponding to the sentence are the domain node 321 and the general node 322 . Alternatively, the lower node of the implication node 311 may be the category node 331 .
Also, the tree may be a tree without the nodes 341 and 332 of the words contained in the sentence. Eliminating words reduces the amount of information contained, but simplifies the tree and makes it easier to see.

≪変形例：画像≫
イメージ化部１１７が生成する画像４００（図１１参照）は、矩形から構成されているが、楕円や多角形など他の図形であってもよい。また、画像４００に含まれる矩形は、含意やカテゴリ、品詞に対応していて、矩形の数と文や語（含意とカテゴリと品詞）は一致しているが、食い違ってもよい。例えば、同じカテゴリや品詞が多数あれば、他より大きい１つの矩形にまとめてもよい。また、含意とカテゴリと品詞（属性が付与された品詞）とは、色で表現しているが、図１１に示すように模様（パターン）で示してもよいし、色と模様の組み合わせ、その他の形式で示してもよい。
画像において、語に対応する矩形をなくしてもよい。語をなくすことで、含まれる情報量は減少するが、画像が単純化され見やすくなる。 ≪Modification: Image≫
An image 400 (see FIG. 11) generated by the imaging unit 117 is composed of a rectangle, but may be other figures such as an ellipse and a polygon. Also, the rectangles included in the image 400 correspond to implications, categories, and parts of speech. For example, if there are many of the same categories or parts of speech, they may be grouped into one larger rectangle. Further, the implications, categories, and parts of speech (parts of speech to which attributes are assigned) are represented by colors, but they may be represented by patterns as shown in FIG. can be shown in the form of
In the image, the rectangle corresponding to the word may be eliminated. Eliminating words reduces the amount of information contained, but simplifies the image and makes it easier to see.

≪変形例：文と文との関係≫
上記した実施形態では、ドメインに応じた文の含意を抽出している。これに加えて、例えば接続詞に注目して、接続関係抽出部が、文と文との関係（順接、逆接、転換などの接続関係）を抽出して、構造化部１１６やイメージ化部１１７が、接続関係を木や画像に含めるようにしてもよい。例えば、木において、テキストに対応するルートのノードとその下位の文に対応するノードとの間に、文と前の文との接続関係を示すノードを設けるようにしてもよい。イメージ変換情報データベース１６０は、接続関係と接続関係に割り当てられた色との関連を記憶する接続関係のイメージ変換情報データベースを備え、画像４００（図１１参照）において、文に対応する矩形に隣接して、文と前の文との関係を示す矩形を配置してもよい。
テキストに含まれる文の情報だけではなく、木や画像には文と文との関係まで含まれており、テキスト検索や分類、翻訳おいてさらなる精度や品質の向上が見込める。 ≪Modification: Relationship between sentences≫
In the embodiment described above, the implication of the sentence is extracted according to the domain. In addition to this, for example, focusing on conjunctions, the connection relation extraction unit extracts the relation between sentences (connection relation such as direct conjunctive, adversative conjunctive, conversion, etc.), and However, the connection relation may be included in the tree or the image. For example, in the tree, between the root node corresponding to the text and the node corresponding to the subordinate sentence, a node indicating the connection relationship between the sentence and the previous sentence may be provided. The image conversion information database 160 includes a connection relationship image conversion information database that stores the relationship between the connection relationship and the color assigned to the connection relationship. A rectangle may be placed to show the relationship between the sentence and the previous sentence.
In addition to information on sentences contained in text, trees and images also include relationships between sentences, and further improvements in accuracy and quality can be expected in text search, classification, and translation.

≪その他の変形例≫
本実施形態で例示するドメイン知識活用支援装置１００の構成は、上記した形態に限られず、同様の効果や機能を奏し得る範囲において、任意の単位で機能的または物理的に分散・統合して構成することができる。例えば、含意情報データベース１３０やドメイン別カテゴリ情報データベース１４０、一般情報データベース１５０などは、別の装置に記憶され、ドメイン知識活用支援装置１００は、この別の装置にアクセスして情報抽出処理を実行してもよい。 <<Other Modifications>>
The configuration of the domain knowledge utilization support device 100 exemplified in this embodiment is not limited to the form described above, but is configured by functionally or physically dispersing and integrating in arbitrary units within the range where similar effects and functions can be achieved. can do. For example, the entailment information database 130, the domain category information database 140, the general information database 150, etc. are stored in a separate device, and the domain knowledge utilization support device 100 accesses this separate device to execute information extraction processing. may

上記した実施形態では、ドメインは業種（産業）としたが、他のドメインであってもよい。例えば、入力されるテキストが技術に関するものであれば、計測、光デバイス、動力機械、熱機器、金属、有機化学、高分子、情報処理、デジタル通信などのドメインであってもよく、入力されるテキストに応じて分類された分野／区分け／類型／ジャンルであってもよい。 In the above-described embodiment, the domain is the type of business (industry), but it may be another domain. For example, if the text to be input is related to technology, it may be in the domain of measurement, optical devices, power machinery, thermal equipment, metals, organic chemistry, polymers, information processing, digital communication, etc. Fields/divisions/types/genres classified according to text may be used.

以上、本発明のいくつかの実施形態について説明したが、これらの実施形態は、例示に過ぎず、本発明の技術的範囲を限定するものではない。本発明はその他の様々な実施形態を取ることが可能であり、さらに、本発明の要旨を逸脱しない範囲で、省略や置換等種々の変更を行うことができる。これら実施形態やその変形は、本明細書等に記載された発明の範囲や要旨に含まれるとともに、特許請求の範囲に記載された発明とその均等の範囲に含まれる。 Although several embodiments of the present invention have been described above, these embodiments are merely examples and do not limit the technical scope of the present invention. The present invention can take various other embodiments, and various modifications such as omissions and substitutions can be made without departing from the gist of the present invention. These embodiments and modifications thereof are included in the scope and gist of the invention described in this specification and the like, and are included in the scope of the invention described in the claims and equivalents thereof.

１００ドメイン知識活用支援装置
１１１テキスト分割部
１１２ドメイン判定部
１１３含意抽出部
１１４ドメイン情報抽出部
１１５一般情報抽出部
１１６構造化部
１１７イメージ化部
１２０記憶部
１２１プログラム
１３０含意情報データベース
１４０ドメイン別カテゴリ情報データベース
１５０一般情報データベース
１６０イメージ変換情報データベース
３００木（木構造データ）
４００画像 100 domain knowledge utilization support device 111 text dividing unit 112 domain determining unit 113 implication extracting unit 114 domain information extracting unit 115 general information extracting unit 116 structuring unit 117 imaging unit 120 storage unit 121 program 130 implication information database 140 category information by domain Database 150 General information database 160 Image conversion information database 300 Tree (tree structure data)
400 images

Claims

a text divider that divides the text into sentences;
a domain determination unit that determines a domain related to the sentence;
an entailment extracting unit that extracts entailment of a word that is a predicate of the sentence by referring to an entailment information database that associates and stores the word, the domain, and the entailment of the word;
a domain information extraction unit for extracting a word and the category of the word from the sentence by referring to a domain-by-domain category information database that stores the word and the category of the word in association with each other for each domain;
a structuring unit that transforms the text into a tree ;
The tree is
a root corresponding to said text,
the lower node of the root is a node corresponding to the sentence included in the text and indicating the implication of the sentence extracted by the entailment extraction unit;
The lower nodes of the node indicating the entailment of the sentence are the nodes of the domain,
Lower nodes of the node of the domain include nodes indicating categories of words included in the sentence extracted by the domain information extraction unit.
A domain knowledge utilization support device characterized by :

further comprising a connection relation extracting unit for extracting a connection relation between the sentence and a sentence preceding the sentence by referring to the conjunctions included in the sentence;
The structuring part is
2. The method according to claim 1, wherein between the root of said tree and a node indicating the implication of said sentence, which is a lower node of said root, a node indicating a connection relationship between said sentence and a sentence preceding said sentence is added. Domain knowledge utilization support device.

a general information extraction unit for extracting words stored in the general information database from the sentence by referring to a general information database that stores words in association with parts of speech of the words;
The tree is
lower nodes of the node indicating the implication of the sentence are general nodes in addition to the domain nodes;
Lower nodes of the general node include nodes indicating the words extracted by the general information extraction unit
2. The domain knowledge utilization support device according to claim 1, characterized by:

With reference to an image conversion information database that stores the color scheme of the category and the implication, for each sentence contained in the text, a rectangle of a color indicating the implication of the sentence and a rectangle horizontally arranged below the rectangle, generating a figure containing colored rectangles indicating the category of the words contained in the sentence;
2. The domain knowledge utilization support device according to claim 1, further comprising an imaging unit that generates an image in which the figures generated for each sentence included in the text are arranged vertically.

further comprising a connection relation extracting unit for extracting a connection relation between the sentence and a sentence preceding the sentence by referring to the conjunctions included in the sentence;
The image conversion information database stores the color scheme of the connection relationship,
The imaging unit
In the image, adjacent to the colored rectangle indicating the entailment of the sentence, a colored rectangle indicating the connection between the sentence and the sentence preceding the sentence is arranged.
5. The domain knowledge utilization support device according to claim 4, characterized in that:

a general information extraction unit for extracting words stored in the general information database from the sentence by referring to a general information database that stores words in association with parts of speech of the words;
The image conversion information database stores the color scheme of the part of speech,
The imaging unit
In the graphics generated for each of the sentences, the parts of speech of the words included in the sentence extracted by the general information extraction unit are displayed horizontally under rectangles of colors indicating the categories of the words included in the sentence. Place a rectangle of the indicated color
5. The domain knowledge utilization support device according to claim 4, characterized in that:

The parts of speech stored in the general information database are given attributes including plus and minus,
The image conversion information database stores the color scheme of the part of speech to which the attribute is assigned,
The color indicating the part of speech of the word included in the sentence extracted by the general information extraction unit is the color scheme of the part of speech to which the attribute is assigned.
7. The domain knowledge utilization support device according to claim 6, characterized by:

A program for causing a computer to function as the domain knowledge utilization support device according to any one of claims 1 to 7.

A domain knowledge utilization support method for a domain knowledge utilization support device, comprising:
The domain knowledge utilization support device includes:
an entailment information database that associates and stores words, domains , and entailments of the words;
a storage unit that stores a domain-specific category information database that associates and stores words and categories of the words for each domain;
dividing the text into sentences;
determining the domain to which the sentence pertains;
extracting implications of predicate words of the sentence;
extracting words and categories of the words from the sentence;
converting the text to a tree ;
The tree is
a root corresponding to said text,
The lower node of the root is a node corresponding to the sentence included in the text and indicating the extracted implication of the sentence,
The lower nodes of the node indicating the entailment of the sentence are the nodes of the domain,
Lower nodes of the node of the domain include nodes indicating categories of words included in the extracted sentence
A domain knowledge utilization support method characterized by :