JP2010250439A

JP2010250439A - Retrieval system, data generation method, program and recording medium for recording program

Info

Publication number: JP2010250439A
Application number: JP2009097178A
Authority: JP
Inventors: Tomohito Uchida; 智史内田; Masahiro Takezawa; 真弘竹澤
Original assignee: Kanagawa University
Current assignee: Kanagawa University
Priority date: 2009-04-13
Filing date: 2009-04-13
Publication date: 2010-11-04

Abstract

<P>PROBLEM TO BE SOLVED: To provide a retrieval system which equipped with a system that constructs a large-scale and structured knowledge base using a program. <P>SOLUTION: A retrieval system 1 is configured of: a knowledge base 2 having a knowledge block of a tree structure storing a predetermined word included in a book at the top level, and information related with the predetermined word under the top level; a knowledge base retrieval system 4 for retrieving the knowledge block stored in the knowledge base 2 based on the word input from an input part; and a knowledge base construction system 3 for constructing the knowledge block stored in the knowledge base 2 based on the text data read from a data reading means 30 for reading the text data of the book. The knowledge base construction system 3 generates the knowledge block by executing a natural language processing means 31 and a structured formatting means 32 from the text data read from the data reading means 30, and stores the knowledge block in a knowledge base 2. <P>COPYRIGHT: (C)2011,JPO&INPIT

Description

本発明は、書籍の文書データを格納したデータベースから、検索語句に基づいて目的の情報を検索して提示する検索システム、データベースに格納する文書データを生成するデータ生成方法、データ生成に伴う各手段の機能をコンピュータで実現するためのプログラム、同プログラムを記録した記録媒体に関する。 The present invention relates to a search system for searching and presenting target information based on a search term from a database storing document data of a book, a data generation method for generating document data to be stored in the database, and means associated with data generation The present invention relates to a program for realizing the above functions on a computer and a recording medium on which the program is recorded.

書籍のような膨大な量の文書データを格納したデータベースから、目的の情報（知識）を的確に見つけ出すために、体系的に情報を集約した知識ベースを搭載した検索システムや、単なるテキスト検索ではなく質問者の質問に回答する質問回答システムが開発されている。 It is not a search system equipped with a knowledge base that systematically aggregates information or a simple text search to accurately find the target information (knowledge) from a database that stores a huge amount of document data such as books. A question answering system has been developed to answer the questioner's question.

このような検索システムの一例として、ユーザからの質問に対話的に答えるシステムであって、ユーザが質問文を入力するユーザインターフェイスと、ユーザによって入力されて言語を構文解析する入力解析部と、語句の説明文である知識ユニットを格納した知識ベースと、ユーザから入力された質問文と前記知識ユニットとのマッチングを行う対話管理部と、から構成され、前記知識ベースは、固有名詞などの語句をツリー構造のトップレベルに持ち、その下に定義となる知識ユニットと、それ以外のカテゴライズされていない知識ユニットと、のいずれかを、少なくとも１つ持っている構造を有する対話的ヘルプシステムが知られている（非特許文献１参照）。 As an example of such a search system, a system that interactively answers a question from a user, in which the user inputs a question sentence, an input analysis unit that is input by the user and parses the language, and a phrase A knowledge base that stores a knowledge unit that is an explanatory sentence of and a dialogue management unit that matches a question sentence input by a user with the knowledge unit, and the knowledge base includes a phrase such as a proper noun. An interactive help system having a structure having at least one of a knowledge unit that is defined at the top level of a tree structure and a knowledge unit that is not categorized is known. (See Non-Patent Document 1).

このような対話的ヘルプシステムでは、知識ベースに格納される知識ユニットの増加に伴って、ユーザからの質問に対する回答の選択肢も増加し適切な回答の提示ができるため、質問に対する回答率が上昇する。 In such an interactive help system, as the number of knowledge units stored in the knowledge base increases, the number of options for answers to questions from users increases, and appropriate answers can be presented, so the answer rate for questions increases. .

京都大学総合メディアセンターの対話的ヘルプシステムと京都大学付属図書館の自動レファレンス・サービス・システム．情報処理学会研究報告．自然言語処理研究会報告，Ｖｏｌ．２０００，Ｎｏ．５３，ｐ．９２，２０００Interactive help system of Kyoto University Media Center and automatic reference service system of Kyoto University Library. IPSJ report. Natural Language Processing Study Group Report, Vol. 2000, no. 53, p. 92,2000

しかしながら、非特許文献１に記載された対話的ヘルプシステムでは、作業者がテキストデータから該当する文章を抜き出して知識ユニットを作成するため、大規模な知識ベースの構築には多くの手間と時間がかかる不都合がある。また、手作業で知識ユニットを構築しているため、知識ベースに格納する知識ユニットの数が少なくなり、質問に対する回答率が低くなる不都合がある。 However, in the interactive help system described in Non-Patent Document 1, since a worker extracts a corresponding sentence from text data and creates a knowledge unit, it takes a lot of time and effort to construct a large-scale knowledge base. There is such inconvenience. In addition, since the knowledge units are constructed manually, the number of knowledge units stored in the knowledge base is reduced, and there is an inconvenience that the answer rate to questions is lowered.

本発明は、上記問題に鑑みてなされたものであり、大規模かつ構造化された知識ベースをプログラムによって構築するシステムを搭載した検索システムを提供することを主たる目的とする。 The present invention has been made in view of the above problems, and has as its main object to provide a search system equipped with a system for constructing a large-scale and structured knowledge base by a program.

本発明の検索システムは、書籍に含まれる所定の単語をトップレベルに持ち、前記所定の単語に関連する情報がトップレベルの下に格納されたツリー構造の知識ブロックを有する知識ベースと、入力部を有したユーザインターフェイスを備え、前記入力部から入力された単語に基づいて前記知識ベースに格納された知識ブロックを検索する知識ベース検索システムと、書籍のテキストデータを読み込むデータ読込手段を有し、該データ読込手段から読み込みこまれたテキストデータに基づいて前記知識ベースに格納される知識ブロックを構築する知識ベース構築システムと、を備えた検索システムであって、前記知識ベース構築システムは、前記読込手段から読み込まれたテキストデータに対して形態素解析を行い、該形態素解析の結果を利用して構文解析を行う自然言語処理手段と、前記自然言語処理手段の構文解析に基づいて文章を抽出し、該抽出した文書に対して予め定めたカテゴリに分類する構造化体裁処理手段と、を有することを特徴としている（請求項１）。 A search system according to the present invention includes a knowledge base having a predetermined word contained in a book at a top level and having a tree-structured knowledge block in which information related to the predetermined word is stored below the top level, and an input unit A knowledge base search system for searching a knowledge block stored in the knowledge base based on a word input from the input unit, and a data reading means for reading text data of a book, A knowledge base construction system comprising a knowledge base construction system for constructing a knowledge block stored in the knowledge base based on text data read from the data reading means, wherein the knowledge base construction system comprises the reading Perform morphological analysis on the text data read from the means, and use the result of the morphological analysis. Natural language processing means for performing syntax analysis, and structured appearance processing means for extracting sentences based on the syntax analysis of the natural language processing means and classifying the extracted documents into predetermined categories. (Claim 1).

これにより、知識ブロックの生成過程を自動化することができるため、知識ベースに格納する知識ブロックの量の増加を図ることができ、質問に対する回答率の向上を図ることが可能となる。
また、文章の形態および構造に基づいて知識ブロックの構築を行うため、予め指定したカテゴリに分類することができ、検索速度の向上を図ることが可能となる。 Thereby, since the knowledge block generation process can be automated, the amount of knowledge blocks to be stored in the knowledge base can be increased, and the response rate to the questions can be improved.
Further, since the knowledge blocks are constructed based on the form and structure of the sentence, it can be classified into categories designated in advance, and the search speed can be improved.

また、前記知識ブロックは、予め定めた８種類のカテゴリに分類された文章のうち少なくとも１つを有して構成されることが望ましい（請求項２）。 The knowledge block preferably includes at least one of sentences classified into eight predetermined categories (Claim 2).

また、テキストデータから前記知識ブロックを生成するデータ生成方法であって、前記テキストデータに対して形態素解析を行い、該形態素解析の結果を利用して構文解析を行う自然言語処理工程と、前記自然言語処理工程の処理結果に基づいて文章を抽出し、予め定めたカテゴリに分類する構造化体裁処理工程と、を有することが望ましい(請求項３)。 Further, a data generation method for generating the knowledge block from text data, the morphological analysis is performed on the text data, and the syntax analysis is performed using the result of the morphological analysis, and the natural data processing step It is desirable to have a structured appearance processing step of extracting a sentence based on the processing result of the language processing step and classifying it into a predetermined category (claim 3).

また、本発明は、コンピュータを、前記自然言語処理手段、前記構造化体裁処理手段として機能させることを特徴とするプログラムである（請求項４）。 The present invention is a program that causes a computer to function as the natural language processing means and the structured appearance processing means.

また、本発明は、請求項４に記載のプログラムを記録したコンピュータ読み取り可能な記録媒体である（請求項５）。 The present invention is a computer-readable recording medium in which the program according to claim 4 is recorded (claim 5).

以上本発明によれば、知識ブロックの生成過程を自動化することにより、知識ベースに格納する知識ブロックの量の増加を図ることができ、回答率が向上した検索システムを提供することが可能となる。
また、文章の形態および構造に基づいて知識ブロックの構築を行うため、予め指定したカテゴリに分類することができ、作業者による作業量を減らすことが可能となる。 As described above, according to the present invention, it is possible to increase the amount of knowledge blocks stored in the knowledge base by automating the knowledge block generation process, and to provide a search system with an improved response rate. .
Moreover, since the knowledge block is constructed based on the form and structure of the sentence, it can be classified into categories designated in advance, and the amount of work by the operator can be reduced.

図１は、検索システムの構成を示した構成図である。FIG. 1 is a configuration diagram showing the configuration of the search system. 図２は、知識ブロックの構成について示したものであり、（ａ）は、知識ブロックの構成を示した構成図である。（ｂ）は、知識ブロックのカテゴリの分類を示した図である。FIG. 2 shows the configuration of the knowledge block, and FIG. 2A is a configuration diagram showing the configuration of the knowledge block. (B) is a diagram showing classification of categories of knowledge blocks. 図３は、ユーザインターフェイスの例を示したイメージ図である。FIG. 3 is an image diagram illustrating an example of a user interface. 図４は、知識ベース構築システム３における書籍のテキストデータから知識ブロックを生成する処理フロー図である。FIG. 4 is a processing flow diagram for generating a knowledge block from text data of a book in the knowledge base construction system 3. 図５は、形態素解析の実行結果例を示したイメージ図である。FIG. 5 is an image diagram showing an example of execution results of morphological analysis. 図６は、構文解析の実行結果例を示したイメージ図である。FIG. 6 is an image diagram illustrating an example of the execution result of syntax analysis. 図７は、キーワードによる文の抽出例を示したイメージ図である。FIG. 7 is an image diagram showing an example of sentence extraction by keyword.

以下、本発明の検索システムについて図面を参照して説明する。 The search system of the present invention will be described below with reference to the drawings.

図１に示すように、本発明の検索システム１は、書籍の文書データを格納した知識ベース２と、知識ベース２に格納された文書データを検索する知識ベース検索システム４と、知識ベース２に格納される文書データを構築する知識ベース構築システム３と、これらの知識ベース検索システム４と知識ベース構築システム３とを管理する知識ベース管理システム５から構成されている。 As shown in FIG. 1, a search system 1 according to the present invention includes a knowledge base 2 that stores book document data, a knowledge base search system 4 that searches document data stored in the knowledge base 2, and a knowledge base 2. It comprises a knowledge base construction system 3 for constructing stored document data, and a knowledge base management system 5 for managing these knowledge base retrieval system 4 and knowledge base construction system 3.

知識ベース２は、図２（ａ）および(ｂ)に示すようなツリー構造をもつ知識ブロック２１を形成して書籍の文書データを格納している。 The knowledge base 2 stores document data of books by forming a knowledge block 21 having a tree structure as shown in FIGS. 2 (a) and 2 (b).

知識ブロック２１のツリー構造は、所定の単語をツリー構造のトップに持ち、その下に、たとえば、定義（単語の意味や定義の説明の文章）、方法（手法や用法に関する文章）、例示（例示や種類に関する文章）、可能（可能や不可能に関する文章）、兆候（「〜ができない」などの兆候に関する文章）、理由（理由に関する文章）、比較（複数の事柄の比較に関する文章）、その他（これらのカテゴリに分類することができない文章）の８種類にカテゴリ分けされたタグを有している。以後、このタグに格納される文章を知識とする。 The tree structure of the knowledge block 21 has a predetermined word at the top of the tree structure, and there are, for example, definitions (texts of word meanings and explanations of definitions), methods (texts about techniques and usages), examples (examples). (Text about possible or impossible), signs (text about signs such as “can't do it”), reasons (text about reasons), comparison (text about comparing multiple things), others ( There are tags classified into 8 types of sentences) that cannot be classified into these categories. Hereinafter, the text stored in this tag is used as knowledge.

このようなツリー構造の知識ブロック２１を格納する知識ベース２は、ＸＭＬ（Extensible Markup Language）データベースが採用されることが好ましく、特に、ツリー構造、メタ情報管理という優位性を最大限活用することができるＸＭＬ文書をその構造のまま格納・操作を行うことができるネイティブＸＭＬデータベースであることが望ましい。 The knowledge base 2 storing the knowledge block 21 having such a tree structure preferably employs an XML (Extensible Markup Language) database. In particular, the advantage of tree structure and meta information management can be maximized. It is desirable that the XML document be a native XML database that can store and manipulate XML documents that can be stored in the structure.

知識ベース構築システム３は、ＰＤＦ（Portable Document Format）（登録商標）形式などの書籍データを読み込むデータ読込手段３０、読み込んだ書籍データをたとえば、ＯＣＲ(Optical Character Reader)などの処理によりテキスト化するテキスト化処理手段３３、テキスト化された書籍データに対して形態素解析を実行し、形態素解析の結果を利用して構文解析する自然言語処理手段３１と、構文解析された書籍データから所定の語句を含む文章を抽出し、抽出した文章を上記した８つのカテゴリに分類して知識ブロック２１を生成する構造化体裁処理手段３２を有して構成される。
ここで、本検索システム１では、書籍データの一例としてＰＤＦ形式を扱うが、ＰＤＦ形式のみに限定されることはなく、他の形式の書籍データにおいても扱うことができる。 The knowledge base construction system 3 includes a data reading means 30 for reading book data in a PDF (Portable Document Format) (registered trademark) format, and text for converting the read book data into text by processing such as OCR (Optical Character Reader). The processing unit 33 includes a natural language processing unit 31 that performs morphological analysis on the text-formatted book data and uses the result of the morphological analysis, and includes a predetermined phrase from the parsed book data. A structured appearance processing means 32 for extracting a sentence and classifying the extracted sentence into the above eight categories to generate the knowledge block 21 is configured.
Here, in the present search system 1, the PDF format is handled as an example of the book data, but is not limited to the PDF format, and can be handled in book data of other formats.

知識ベース検索システム４は、知識ベース２に格納された知識ブロック２１を検索して回答を返すシステムであり、ユーザからの質問入力が行われるユーザインターフェイス４１を備え、質問入力された検索文から形態素解析および構文解析を行う自然言語処理手段４２と、知識ブロック２１の検索をする検索アルゴリズム４３と、ユーザに表示する返答文を作成する回答文生成処理手段４４と、を有して構成される。また、知識ベース検索システム４は、サーバ上で動作し、特にＷｅｂアクセスに特化したＪａｖａ（登録商標）ＥＥアプリケーションサーバであることが望ましい。
なお、図３に示すように、ユーザインターフェイス４１は、画面上に検索文を入力することができる入力部を備えている。 The knowledge base search system 4 is a system that searches the knowledge block 21 stored in the knowledge base 2 and returns an answer. The knowledge base search system 4 includes a user interface 41 for inputting a question from the user. It comprises a natural language processing means 42 for performing analysis and syntax analysis, a search algorithm 43 for searching the knowledge block 21, and an answer sentence generation processing means 44 for creating a response sentence to be displayed to the user. The knowledge base search system 4 is preferably a Java (registered trademark) EE application server that operates on a server and specializes in Web access.
As shown in FIG. 3, the user interface 41 includes an input unit that can input a search sentence on the screen.

知識ベース管理システム５は、知識ベース構築システム３の動作を管理するためのＷｅｂシステムであり、書籍データを知識ベース構築システム３に受け渡しを行うアップロード処理手段５１を有して構成される。 The knowledge base management system 5 is a Web system for managing the operation of the knowledge base construction system 3 and includes an upload processing means 51 that delivers book data to the knowledge base construction system 3.

なお、上記又は、後述する自然言語処理手段３１、構造化体裁処理手段３２、テキスト化処理手段３３、自然言語処理手段４２、検索アルゴリズム４３、回答文生成処理手段４４、アップロード処理手段５１、単体知識追加処理手段５３、全文検索手段５５、回答率他解析手段５７は、一般的には、ＣＰＵ、ＲＯＭ、ＲＡＭ、などを有して構成される単数又は複数のコンピュータを、所定のプログラムによって機能させることにより構成されるものである。 The natural language processing means 31, the structured appearance processing means 32, the text processing means 33, the natural language processing means 42, the search algorithm 43, the answer sentence generation processing means 44, the upload processing means 51, the unit knowledge described above or later. The additional processing means 53, the full-text search means 55, and the response rate other analysis means 57 generally cause one or a plurality of computers having a CPU, a ROM, a RAM, and the like to function according to a predetermined program. It is constituted by.

また、そのプログラムはコンピュータ読み取り可能な記録媒体に記録して、ユーザに提供することができる。記録媒体としては、フレキシブルディスク、ＣＤ−ＲＯＭ、ＤＶＤ−ＲＯＭ、あるいはハードディスクや半導体メモリなどの記録可能なタイプの媒体が含まれる。 The program can be recorded on a computer-readable recording medium and provided to the user. The recording medium includes a flexible disk, a CD-ROM, a DVD-ROM, or a recordable type medium such as a hard disk or a semiconductor memory.

以上の構成の検索システム１において、書籍のテキストデータから知識ブロック２１を生成する処理について説明する。 A process for generating the knowledge block 21 from the text data of the book in the search system 1 having the above configuration will be described.

知識ベース管理システム５のアップロード処理手段５１は、ＰＤＦ形式の書籍データを読み込んだ後、知識ベース構築システム３に書籍データの受け渡すアップロード処理を行う。 The upload processing means 51 of the knowledge base management system 5 reads the book data in PDF format, and then performs an upload process for transferring the book data to the knowledge base construction system 3.

知識ベース構築システム３のデータ読込手段３０は、図４に示すように、ＰＤＦデータを受信すると（Ｓ１０１：ＹＥＳ）、図示しないサーバにＰＤＦデータを保存する（Ｓ１０２）。 As shown in FIG. 4, the data reading means 30 of the knowledge base construction system 3 receives the PDF data (S101: YES), and stores the PDF data in a server (not shown) (S102).

次いで、テキスト化処理手段３３は、知識ベース管理システム５からの構築開始信号を受信すると（Ｓ１０３：ＹＥＳ）、サーバからＰＤＦデータを読み出してテキスト化（文章の抽出）処理（１０４）を行う。 Next, when receiving the construction start signal from the knowledge base management system 5 (S103: YES), the text processing means 33 reads PDF data from the server and performs text processing (sentence extraction) processing (104).

また、自然言語処理手段３１は、テキスト化処理された書籍データに対して単語ごとに係り受けなどの構造を解析する自然言語処理を行う（Ｓ１０５）。 Further, the natural language processing means 31 performs natural language processing for analyzing the structure such as dependency for each word on the text-processed book data (S105).

さらに、構造化体裁整理処理手段３２は、自然言語処理の結果に基づいて、予め定めた所定の単語が含まれている文を抽出してカテゴリ毎に分類する構造化体裁整理処理により（Ｓ１０６）知識ブロック２１を作成し、この知識ブロック２１を知識ベース２に格納する（Ｓ１０７）。 Further, the structured appearance organization processing means 32 extracts a sentence including a predetermined word based on the result of the natural language processing, and performs structured appearance organization processing for classifying the sentences into categories (S106). A knowledge block 21 is created and stored in the knowledge base 2 (S107).

ここで、ステップＳ１０４におけるテキスト化処理は、ＰＤＦデータからテキストを抽出しただけでは、ページヘッダやページ数、プログラムリストなどのノイズが所々に入ってしまい、そのままでは後の解析で用いる場合に不都合がある。そこで、テキスト化を行った後、後述する自然言語処理による解析での支障を最低限に抑えるために、プログラムによって、できる限りテキスト文書の整形を加えた方が好ましい。 Here, the text conversion processing in step S104 causes noise such as page header, number of pages, program list, etc. to enter in some places just by extracting text from PDF data. is there. Therefore, after text conversion, it is preferable to format the text document as much as possible by a program in order to minimize the trouble in analysis by natural language processing described later.

具体的なテキスト文書の整形としては、知識ベース構築システム３の書籍データの読み込み時には、受け渡されたた書籍データに対して、ノイズの原因となるページ数及びページヘッダを削除し、ノイズなどにより繋がっていなかったり文章の途中で改行されていたりする文章を結合して読点毎にテキストデータを抽出し、また、注釈の知識自体は、知識ベース構築に際して使用できる知識であるため、注釈の抜き出しも行う。 As specific text document formatting, when reading the book data of the knowledge base construction system 3, the number of pages and the page header causing the noise are deleted from the received book data, Extract text data for each punctuation by combining sentences that are not connected or line breaks in the middle of the text, and because the knowledge of the annotation itself is knowledge that can be used in the construction of the knowledge base, it is also possible to extract annotations Do.

また、ステップＳ１０５における自然言語処理は、まず、図５に示すように、テキスト化された書籍データに対して、文中に使用される単語ごとに名詞や格助詞などの形態を解析する形態素解析を行う。次いで、図６に示すように、形態素解析によって読点毎に区切られたテキストデータに対して、文節ごとの係り受けを示した文章の構文解析を行う。 In the natural language processing in step S105, first, as shown in FIG. 5, morphological analysis is performed on the text-formatted book data to analyze forms such as nouns and case particles for each word used in the sentence. Do. Next, as shown in FIG. 6, the sentence data indicating the dependency for each clause is analyzed with respect to the text data divided for each reading by the morphological analysis.

ここで、形態素解析ツールとしては、たとえば、奈良先端科学技術大学院大学松本研究室で開発されたＣｈａＳｅｎ（茶筌）（http://chasen.naist.jp/hiki/ChaSen/）を用いることができる。また、構文解析ツールとしては、たとえば、ＧＮＵ Lesser Public License（LGPL）に従ったＣａｂｏＣｈａを用いることができる。
なお、自然言語処理は、処理速度を高めるために可能な限り複雑な処理を行わせずに実行することが望ましい。 Here, as a morphological analysis tool, for example, ChaSen (http://chasen.naist.jp/hiki/ChaSen/) developed at Matsumoto Laboratory of Nara Institute of Science and Technology is available. As a syntax analysis tool, for example, CaboCha according to the GNU Lesser Public License (LGPL) can be used.
In addition, it is desirable to execute the natural language process without performing a complicated process as much as possible in order to increase the processing speed.

さらに、ステップＳ１０６における構造化体裁整理処理は、図７に示すように、自然言語処理によって形態解析が行われたテキストに対して、まず、所定の単語（キーワード）が含まれる文章を抽出し、次いで、抽出した文に対してカテゴリ分類を行い、キーワードに関する知識ブロック２１を構築する。 Further, in the structured appearance organization process in step S106, as shown in FIG. 7, first, a sentence including a predetermined word (keyword) is extracted from the text subjected to the morphological analysis by the natural language process, Next, category classification is performed on the extracted sentence, and a knowledge block 21 relating to the keyword is constructed.

ここで、文章を抽出する際のキーワードは、たとえば、書籍の巻末に載っている索引一つ一つを用いることができる。なお、索引には、キーワードとしてそのまま使用するには不適当な表現記法を用いているものがあるため、それらを除外・整理して用いることが望ましい。具体的には、索引には「Courier（フォント名）」などのように括弧が追加されて２重に単語が表記されている場合があり、このような複数の単語を含んだ状態で抽出を実行すると、片方の単語のみが記載された目的とする文章が抽出されない虞があるので、このような括弧を除外して文の抽出を行う。 Here, as a keyword for extracting a sentence, for example, an index listed at the end of a book can be used. Since some indexes use expression notation that is inappropriate for use as keywords, it is desirable to exclude and organize them. Specifically, there are cases where parentheses are added to the index, such as “Courier (font name)”, and words are written twice, and extraction is performed in a state that includes a plurality of such words. When executed, there is a possibility that the target sentence in which only one word is described may not be extracted, and thus the sentence is extracted without such parentheses.

また、カテゴリの分類は、カテゴリごとの文の特徴をまとめ、分類するルールが格納された設定ファイルを知識ベース管理システム５の操作により予め作成しておき、その設定ファイルを参照することで行われる。
具体的には、兆候のカテゴリに分類される文では、「〜ができない」といった否定形の文の形態を含む特徴があるので、このような文の形態の特徴を利用して予め抽出文とカテゴリの関係を設定したファイルを作成し、抽出した文に対してカテゴリ分けが行われる。 The classification of categories is performed by collecting the characteristics of sentences for each category, creating a setting file storing rules for classification in advance by operating the knowledge base management system 5, and referring to the setting file. .
Specifically, the sentence classified into the signs category has a feature including a negative sentence form such as “cannot be done”. A file in which the category relationship is set is created, and the extracted sentences are categorized.

以上のように、知識ブロック２１の生成過程を自動化することにより、知識ベース２に格納する知識ブロック２１の量の増加を図ることができ、回答率の向上を図ることが可能となる。
また、文章の形態および構造に基づいて知識ブロック２１の構築を行うため、予め指定したカテゴリに分類することができ、作業者による作業量を減らすことが可能となる。 As described above, by automating the generation process of the knowledge block 21, the amount of the knowledge block 21 stored in the knowledge base 2 can be increased, and the response rate can be improved.
Further, since the knowledge block 21 is constructed based on the form and structure of the sentence, it can be classified into categories designated in advance, and the amount of work by the operator can be reduced.

本検索システム１における知識ブロック２１の検索について説明する。 The search of the knowledge block 21 in this search system 1 is demonstrated.

知識ベース検索システム４のユーザインターフェイス４１に質問文が入力されると、まず、自然言語処理手段４２により質問の対象となるキーワードや主節などの構造が解析される。次いで、検索アルゴリズム４３は、質問の対象となるキーワードと質問文のカテゴリを抽出すると、知識ベース２におけるそのキーワードの知識ブロック下にある同じカテゴリを抜き出し、さらに、質問の対象キーワード以外の単語（固有名詞や形容詞など）と構文解析結果であるツリー構造から、一致度の高い回答に対して回答文生成処理手段４４によって形成された回答文をユーザインターフェイス４１に提示する。 When a question sentence is input to the user interface 41 of the knowledge base search system 4, first, the natural language processing means 42 analyzes the structure of keywords, main clauses, and the like that are the subject of the question. Next, when the search algorithm 43 extracts the keyword to be questioned and the category of the question sentence, the search algorithm 43 extracts the same category under the knowledge block of the keyword in the knowledge base 2, and further, the search algorithm 43 extracts words other than the question target keyword (unique) An answer sentence formed by the answer sentence generation processing means 44 is presented to the user interface 41 with respect to an answer having a high degree of coincidence from a tree structure that is a result of syntactic analysis and a noun or an adjective.

なお、知識ベース管理システム５に、知識ベース２に格納されているデータを表示する閲覧手段５２と、単体の知識のみを個別に知識ブロック２１に追加する単体知識追加処理手段５３と、保存手段５８と、を備えることで、管理者は、たとえば、ユーザの質問に答えられなかった質問に対し、新たに知識ブロック２１を作成することで、単独で知識を追加することができる。 It should be noted that browsing means 52 for displaying the data stored in the knowledge base 2 in the knowledge base management system 5, single knowledge addition processing means 53 for adding only single knowledge individually to the knowledge block 21, and storage means 58 The administrator can add knowledge alone by creating a new knowledge block 21 for a question that could not be answered to the user's question.

具体的には、保存手段５８は、知識ベース２には格納されていない知識について質問されたときに、その質問の保存を行う。一方で、単体知識追加処理手段５３は、自然言語処理手段５４と全文検索手段５５と、を有しており、管理者が入力した検索文に基づいて知識ブロック２１の検索を行い、格納されていない知識があった場合に知識を追加する。さらに、構築ルールシミュレート処理手段５６を有することによって、構造化体裁処理手段３２がカテゴリ分類の際に参照するファイルの更新を行うことができ、抽出時に起きた不具合を取り除くことが可能となる。 Specifically, the storage unit 58 stores the question when asked about knowledge not stored in the knowledge base 2. On the other hand, the unitary knowledge addition processing unit 53 includes a natural language processing unit 54 and a full-text search unit 55, which searches the knowledge block 21 based on the search sentence input by the administrator and is stored. Add knowledge if there was no knowledge. Further, by having the construction rule simulation processing means 56, the structured reference processing means 32 can update the file referred to when categorizing, and it is possible to eliminate the problems that occurred during the extraction.

さらに、知識ベース管理システム５に、知識ベース検索システム４を管理する手段として、質問履歴および回答履歴の管理をする回答率他解析手段５７を備えることで、質問履歴の管理をすることができる。さらにまた、回答率他解析手段５７にユーザから理解度に関する評価の収集および集計を行うことで、評価の低い知識に対して管理者が言葉を付け足してより解りやすくすることが可能となる。 Further, the question history can be managed by providing the knowledge base management system 5 with the answer rate and other analysis means 57 for managing the question history and the answer history as means for managing the knowledge base search system 4. Furthermore, by collecting and tabulating evaluations related to the degree of understanding from the user to the response rate and other analysis means 57, it becomes possible for the administrator to add words to knowledge with low evaluation to make it easier to understand.

なお、本実施例で説明した検索システムは、日本語以外の外国語の言語においても利用できることは、いうまでもない。 Needless to say, the search system described in this embodiment can also be used in a foreign language other than Japanese.

１検索システム
２知識ベース
３知識ベース構築システム
４知識ベース検索システム
５知識ベース管理システム
２１知識ブロック
３０データ読込手段
３１自然言語処理手段
３２構造化体裁処理手段
３３テキスト化処理手段
４１ユーザインターフェイス
４２自然言語処理手段
４３検索アルゴリズム
４４回答文生成手段
５１アップロード手段
５２閲覧手段
５３単体知識追加処理手段
５４自然言語処理手段
５５前文検索手段
５６構築ルールシミュレート手段
５７回答率他解析手段
５８保存手段 DESCRIPTION OF SYMBOLS 1 Search system 2 Knowledge base 3 Knowledge base construction system 4 Knowledge base search system 5 Knowledge base management system 21 Knowledge block 30 Data reading means 31 Natural language processing means 32 Structured appearance processing means 33 Text processing means 41 User interface 42 Natural language Processing means 43 Search algorithm 44 Answer sentence generation means 51 Upload means 52 Browse means 53 Single knowledge addition processing means 54 Natural language processing means 55 Preamble search means 56 Construction rule simulation means 57 Answer rate other analysis means 58 Storage means

Claims

A knowledge base having a predetermined word included in a book at a top level and having a tree-structured knowledge block in which information related to the predetermined word is stored below the top level;
A knowledge base search system comprising a user interface having an input unit, and searching for a knowledge block stored in the knowledge base based on a word input from the input unit;
A search system comprising a data base for reading text data of a book, and a knowledge base construction system for constructing a knowledge block stored in the knowledge base based on the text data read from the data read means Because
The knowledge base construction system performs a morphological analysis on the text data read from the reading means, and performs a syntax analysis using a result of the morphological analysis, and a syntax of the natural language processing means A retrieval system comprising: structured text processing means for extracting a sentence based on an analysis and classifying the extracted document into a predetermined category.

The search system according to claim 1, wherein the knowledge block includes at least one of sentences classified into eight predetermined categories.

A data generation method for generating the knowledge block from text data, comprising: a natural language processing step of performing morphological analysis on the text data and performing syntax analysis using a result of the morphological analysis; and the natural language processing And a structured appearance processing step of extracting a sentence based on a processing result of the step and classifying the sentence into a predetermined category.

The search system according to claim 1 or 2, wherein a program causes a computer to function as the natural language processing means and the structured appearance processing means.

The computer-readable recording medium which recorded the program of Claim 4.