JP6480380B2

JP6480380B2 - Table cell search apparatus, method, and program

Info

Publication number: JP6480380B2
Application number: JP2016098692A
Authority: JP
Inventors: 京介西田; 松尾　義博; 義博松尾; 東中　竜一郎; 竜一郎東中; 九月貞光
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2016-05-17
Filing date: 2016-05-17
Publication date: 2019-03-06
Anticipated expiration: 2036-05-17
Also published as: JP2017207853A

Description

本発明は、表セル検索装置、方法、及びプログラムに係り、特に、表形式データをインデキシングして、検索クエリに回答するための表セル検索装置、方法、及びプログラムに関する。 The present invention relates to a table cell search apparatus, method, and program, and more particularly, to a table cell search apparatus, method, and program for indexing tabular data and answering a search query.

コンピュータ技術の発展により、Ｗｅｂ上にＨＴＭＬで記述された表データは大量に存在するようになった。表データには、エンティティ、属性、及び属性値の三つ組が抽出可能な表が多数存在しており、表の構造を正確に読み取ることにより、エンティティの属性の値を問う検索要求や、特定の属性値を持つエンティティを問う検索要求に直接回答することが可能になる。 With the development of computer technology, a large amount of table data described in HTML on the Web has come to exist. In table data, there are many tables from which triples of entities, attributes, and attribute values can be extracted. By accurately reading the table structure, search requests that query entity attribute values and specific attributes It is possible to directly answer a search request for an entity having a value.

表データからエンティティ、属性、及び属性値の三つ組を抽出するためには、表のタイプおよび、表に含まれる各行及び各列のタイプを理解する必要があり、それぞれ非特許文献１、非特許文献２などで提案されている手法が利用可能である。 In order to extract a triple of entity, attribute, and attribute value from table data, it is necessary to understand the type of the table and the type of each row and each column included in the table. The method proposed in 2 etc. can be used.

Eric Crestan, Patrick Pantel: Web-scale table census and classification. WSDM 2011: 545-554Eric Crestan, Patrick Pantel: Web-scale table census and classification.WSDM 2011: 545-554 Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles: Table Header Detection and Classification. AAAI 2012Jing Fang, Prasenjit Mitra, Zhi Tang, C. Lee Giles: Table Header Detection and Classification. AAAI 2012

しかし、表内の情報だけでは、検索対象として用いる知識としては不足している。例えば、スポーツ大会Ａに出場した各選手の得点を整理した表から、（山田太郎，得点，123.45）という三つ組を抽出できたとしても、この三つ組がスポーツ大会Ａにおける知識であることを補完しなければ、ユーザの検索要求に直接回答することはできない。こうした表内から獲得した情報について表外情報による適切な具象化を行って検索可能な知識として保存することは、これまで実現されていない。 However, only the information in the table is insufficient as knowledge to be used as a search target. For example, even if you can extract the triple (Taro Yamada, score, 123.45) from the table that summarizes the scores of each player who participated in Sports Tournament A, you must supplement that these triples are knowledge in Sports Tournament A. For example, it is not possible to directly answer a user's search request. Until now, it has not been realized that information obtained from the inside of the table is appropriately expressed by out-of-line information and stored as searchable knowledge.

本発明は、上記問題点を解決するために成されたものであり、表外の知識を利用してセル知識を獲得し、検索クエリに対して回答を検索することができる表セル検索装置、方法、及びプログラムを提供することを目的とする。 The present invention has been made in order to solve the above-described problems, and obtains cell knowledge using out-of-line knowledge and can search for an answer to a search query, a table cell search device, It is an object to provide a method and a program.

上記目的を達成するために、第１の発明に係る表セル検索装置は、ＨＴＭＬ文書に含まれる表形式データから、前記表形式データに含まれる知識を抽出してインデキシングし、キーワード集合あるいは自然文により与えられた検索クエリに直接回答可能な表セルを検索スコアによりランキングして返却する表セル検索装置であって、前記ＨＴＭＬ文書の集合から、ｔａｂｌｅタグにより記述された表形式データからなる表形式データ集合を取得する表形式データ抽出部と、前記表形式データの各々について、前記表形式データに関連する表関連情報を、前記表形式データを含む前記ＨＴＭＬ文書から抽出する表関連情報抽出部と、前記表形式データの各々について、前記表形式データの構造及び内容に基づいて、前記表形式データの表のタイプを分類する表タイプ分類部と、前記表形式データの各々について、前記表形式データの構造及び内容に基づいて、前記表形式データの表に含まれる各行及び各列のタイプを分類する行列タイプ分類部と、前記表形式データの各々について、前記表関連情報抽出部により抽出された前記表関連情報、前記表タイプ分類部における分類結果、及び前記行列タイプ分類部における分類結果に基づいて、前記表形式データから抽出されるエンティティ、属性、及び属性値を含む表内情報と、前記表関連情報との組から構成されるセル知識を抽出し、抽出した前記セル知識を検索データベースに格納する表内知識抽出部と、与えられた検索クエリであるキーワード集合あるいは自然文について、検索クエリ内の属性に対応するキーワードにラベルを付与するクエリ解釈部と、前記クエリ解釈部によって前記ラベルが付与された前記検索クエリに基づいて、前記検索データベースから前記検索クエリに対応する前記セル知識を出力する知識検索部と、前記知識検索部の出力に基づいて、ユーザに検索結果を返却する検索結果生成部と、を含んで構成されている。 In order to achieve the above object, a table cell search apparatus according to a first aspect of the present invention extracts and indexes knowledge contained in the tabular data from the tabular data included in the HTML document, and sets a keyword set or natural sentence. A table cell search apparatus that ranks and returns table cells that can be directly answered to the search query given by the search score, and includes a table format composed of tabular data described by table tags from the set of HTML documents. A tabular data extraction unit that acquires a data set; and a tabular information extraction unit that extracts, for each of the tabular data, table-related information related to the tabular data from the HTML document including the tabular data; For each of the tabular data, a table type of the tabular data is determined based on the structure and contents of the tabular data. A table type classification unit for classifying each of the tabular data, and a matrix type classification for classifying each row and each column type included in the table of the tabular data based on the structure and contents of the tabular data The table-related data extracted by the table-related information extraction unit, the classification result in the table type classification unit, and the classification result in the matrix type classification unit for each of the table format data In the table for extracting cell knowledge composed of a set of information in the table including entities, attributes and attribute values extracted from the format data and the table related information, and storing the extracted cell knowledge in the search database For the knowledge extractor and the keyword set or natural sentence that is the given search query, label the keywords corresponding to the attributes in the search query. A query interpreting unit, a knowledge search unit that outputs the cell knowledge corresponding to the search query from the search database based on the search query to which the label is given by the query interpreting unit, and A search result generation unit that returns a search result to the user based on the output.

また、第１の発明に係る表セル検索装置において、前記表関連情報抽出部は、前記表形式データの各々について、前記ＨＴＭＬ文書からページタイトル、テーブルのキャプション、テーブルが記載されたセクションの見出し、テーブルの直前のテキスト、又はテーブルの直後のテキスト情報を表関連情報として抽出し、前記表内知識抽出部は、前記表形式データの各々について、前記表関連情報抽出部により抽出された前記表関連情報、前記表タイプ分類部における分類結果、及び前記行列タイプ分類部における分類結果に基づいて、前記表形式データの表のタイプに応じて、前記表形式データから、前記表形式データにおけるエンティティ、エンティティタイプ、属性、及び属性値を含む表内情報と、前記表関連情報との組から構成されるセル知識を抽出し、抽出した前記セル知識を検索データベースに格納するようにしてもよい。 Further, in the table cell search device according to the first invention, the table related information extraction unit, for each of the table format data, from the HTML document, a page title, a table caption, a section heading in which the table is described, The text immediately before the table or the text information immediately after the table is extracted as table related information, and the table knowledge extracting unit extracts the table related information extracted by the table related information extracting unit for each of the tabular data. Based on information, a classification result in the table type classification unit, and a classification result in the matrix type classification unit, an entity in the tabular data is obtained from the tabular data according to a table type of the tabular data. Consists of a set of in-table information including type, attribute, and attribute value, and the table-related information Extracting Le knowledge, it extracted the cell knowledge may be the to be stored in the search database.

また、第１の発明に係る表セル検索装置において、前記クエリ解釈部は、検索クエリ内の属性以外のキーワードに更にラベルを付与し、前記知識検索部は、前記検索データベースから、前記検索クエリの属性に対応するキーワードと、前記表内情報の属性が一致し、かつ、前記検索クエリの属性以外のキーワードと、前記表内情報のエンティティ、エンティティタイプ、又は前記表関連情報が一致する前記セル知識を出力するようにしてもよい。 Further, in the table cell search device according to the first invention, the query interpretation unit further gives a label to a keyword other than an attribute in the search query, and the knowledge search unit reads the search query from the search database. The cell knowledge in which the keyword corresponding to the attribute matches the attribute of the information in the table, and the keyword other than the attribute of the search query matches the entity, entity type, or the table related information in the information in the table May be output.

第２の発明に係る表セル検索方法は、ＨＴＭＬ文書に含まれる表形式データから、前記表形式データに含まれる知識を抽出してインデキシングし、キーワード集合あるいは自然文により与えられた検索クエリに直接回答可能な表セルを検索スコアによりランキングして返却する表セル検索装置における表セル検索方法であって、表形式データ抽出部が、前記ＨＴＭＬ文書の集合から、ｔａｂｌｅタグにより記述された表形式データからなる表形式データ集合を取得するステップと、表タイプ分類部が、前記表形式データの各々について、前記表形式データの構造及び内容に基づいて、前記表形式データの表のタイプを分類するステップと、行列タイプ分類部が、前記表形式データの各々について、前記表形式データの構造及び内容に基づいて、前記表形式データの表に含まれる各行及び各列のタイプを分類するステップと、表関連情報抽出部が、前記表形式データの各々について、前記表形式データに関連する表関連情報を、前記表形式データを含む前記ＨＴＭＬ文書から抽出するステップと、表内知識抽出部が、前記表形式データの各々について、前記表関連情報抽出部により抽出された前記表関連情報、前記表タイプ分類部における分類結果、及び前記行列タイプ分類部における分類結果に基づいて、前記表形式データから抽出されるエンティティ、属性、及び属性値を含む表内情報と、前記表関連情報との組から構成されるセル知識を抽出し、抽出した前記セル知識を検索データベースに格納するステップと、クエリ解釈部が、与えられた検索クエリであるキーワード集合あるいは自然文について、検索クエリ内の属性に対応するキーワードにラベルを付与するステップと、知識検索部が、前記クエリ解釈部によって前記ラベルが付与された前記検索クエリに基づいて、前記検索データベースから前記検索クエリに対応する前記セル知識を出力するステップと、検索結果生成部が、前記知識検索部の出力に基づいて、ユーザに検索結果を返却するステップと、を含んで実行することを特徴とする。 In the table cell search method according to the second invention, knowledge contained in the tabular data is extracted from the tabular data included in the HTML document and indexed, and the search query given directly by the keyword set or natural sentence is directly applied. A table cell search method in a table cell search apparatus that ranks and returns answerable table cells according to a search score, wherein a table format data extraction unit describes table format data described by table tags from the set of HTML documents. A table type data set comprising: a table type classification unit classifying a table type of the tabular data based on the structure and contents of the tabular data for each of the tabular data And the matrix type classification unit, for each of the tabular data, based on the structure and content of the tabular data The step of classifying each row and each column type included in the table of the tabular data, and the table related information extraction unit, for each of the tabular data, the table related information related to the tabular data, A step of extracting from the HTML document including format data, and an in-table knowledge extraction unit, for each of the table format data, the table related information extracted by the table related information extraction unit, the classification in the table type classification unit Cell knowledge composed of a set of the table-related information and the in-table information including the entity, the attribute, and the attribute value extracted from the tabular data based on the result and the classification result in the matrix type classification unit And the extracted cell knowledge is stored in a search database, and the query interpreter includes a keyword set as a given search query. Or, for a natural sentence, a step of assigning a label to a keyword corresponding to an attribute in a search query, and a knowledge search unit from the search database based on the search query to which the label is given by the query interpretation unit The step of outputting the cell knowledge corresponding to the search query and the step of the search result generation unit returning the search result to the user based on the output of the knowledge search unit are performed. To do.

また、第２の発明に係る表セル検索方法において、前記表関連情報抽出部が抽出するステップは、前記表形式データの各々について、前記ＨＴＭＬ文書からページタイトル、テーブルのキャプション、テーブルが記載されたセクションの見出し、テーブルの直前のテキスト、又はテーブルの直後のテキスト情報を表関連情報として抽出し、前記表内知識抽出部が抽出するステップは、前記表形式データの各々について、前記表関連情報抽出部により抽出された前記表関連情報、前記表タイプ分類部における分類結果、及び前記行列タイプ分類部における分類結果に基づいて、前記表形式データの表のタイプに応じて、前記表形式データから、前記表形式データにおけるエンティティ、エンティティタイプ、属性、及び属性値を含む表内情報と、前記表関連情報との組から構成されるセル知識を抽出し、抽出した前記セル知識を検索データベースに格納するようにしてもよい。 In the table cell search method according to the second invention, the table-related information extracting unit extracts the page title, table caption, and table from the HTML document for each of the table format data. The step of extracting the section heading, the text immediately before the table, or the text information immediately after the table as the table related information, and extracting by the in-table knowledge extracting unit is the table related information extraction for each of the tabular data. Based on the table related information extracted by the table, the classification result in the table type classification unit, and the classification result in the matrix type classification unit, according to the table type of the tabular data, from the tabular data, In-table information including entities, entity types, attributes, and attribute values in the tabular data Extract the cell knowledge consists of a set of said table-related information, it may be stored the extracted the cell knowledge search database.

また、第２の発明に係る表セル検索方法において、前記クエリ解釈部が解釈するステップは、検索クエリ内の属性以外のキーワードに更にラベルを付与し、
前記知識検索部が検索するステップは、前記検索データベースから、前記検索クエリの属性に対応するキーワードと、前記表内情報の属性が一致し、かつ、前記検索クエリの属性以外のキーワードと、前記表内情報のエンティティ、エンティティタイプ、又は前記表関連情報が一致する前記セル知識を出力するようにしてもよい。 In the table cell search method according to the second invention, the step of interpreting by the query interpreter further adds a label to a keyword other than an attribute in the search query,
The step of searching by the knowledge search unit includes, from the search database, a keyword corresponding to the attribute of the search query, a keyword that matches the attribute of the information in the table, and a keyword other than the attribute of the search query, and the table You may make it output the said cell knowledge in which the entity of internal information, entity type, or the said table relevant information corresponds.

また、第３の発明に係るプログラムは、コンピュータを、第１の発明に係る表セル検索装置の各部として機能させるためのプログラムである。 A program according to the third invention is a program for causing a computer to function as each part of the table cell search device according to the first invention.

本発明の表セル検索装置、方法、及びプログラムによれば、表形式データの各々について、表形式データに関連する表関連情報を、表形式データを含むＨＴＭＬ文書から抽出し、表形式データの表のタイプを分類し、各行及び各列のタイプを分類し、表関連情報、及び分類結果に基づいて、表形式データから抽出されるエンティティ、属性、及び属性値を含む表内情報と、表関連情報との組から構成されるセル知識を抽出し、抽出したセル知識を検索データベースに格納し、与えられた検索クエリであるキーワード集合あるいは自然文について、検索クエリ内の属性に対応するキーワードにラベルを付与し、ラベルが付与された検索クエリに基づいて、検索データベースから検索クエリに対応するセル知識を出力し、出力に基づいて、ユーザに検索結果を返却することにより、表外の知識を利用してセル知識を獲得し、検索クエリに対して回答を検索することができる、という効果が得られる。 According to the table cell search apparatus, method, and program of the present invention, for each tabular data, table related information related to the tabular data is extracted from the HTML document including the tabular data, and a table of tabular data is obtained. In-table information including entities, attributes, and attribute values extracted from tabular data based on table-related information and classification results, and table-related Extracts cell knowledge that consists of a set of information, stores the extracted cell knowledge in the search database, and labels the keyword set corresponding to the attribute in the search query for the keyword set or natural sentence that is the given search query The cell knowledge corresponding to the search query is output from the search database based on the search query to which the label is assigned, and the user is output based on the output. By returning the search results, acquired cell information by using the knowledge of the outside tables, it is possible to find an answer for a search query, the effect is obtained that.

本発明の実施の形態に係る表セル検索装置の構成を示すブロック図である。It is a block diagram which shows the structure of the table cell search apparatus concerning embodiment of this invention. 本発明の実施の形態に係る表セル検索装置における表セル検索処理ルーチンを示すフローチャートである。It is a flowchart which shows the table cell search process routine in the table cell search apparatus concerning embodiment of this invention. 表タイプ分類部によって分類される表タイプの一例を示す図である。It is a figure which shows an example of the table type classified by the table type classification | category part. 本発明の実施の形態に係る表セル検索装置における表形式データ抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the table format data extraction process routine in the table cell search apparatus concerning embodiment of this invention. ｔａｂｌｅタグから表形式データの抽出例を示す図である。It is a figure which shows the example of extraction of tabular data from a table tag. 本発明の実施の形態に係る表セル検索装置における表関連情報抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the table relevant-information extraction process routine in the table cell search apparatus concerning embodiment of this invention. 表関連情報の抽出例を示す図である。It is a figure which shows the example of extraction of table relevant information. 本発明の実施の形態に係る表セル検索装置における「縦リスト」の場合の表内知識抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the knowledge extraction process routine in a table | surface in the case of the "vertical list" in the table cell search apparatus concerning embodiment of this invention. 表タイプのクラスが「縦リスト」の場合の表内知識の抽出例を示す図である。It is a figure which shows the example of extraction of the knowledge in a table | surface when the class of a table type is a "vertical list". 本発明の実施の形態に係る表セル検索装置における「横リスト」の場合の表内知識抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the knowledge extraction process routine in a table | surface in the case of the "horizontal list" in the table cell search apparatus concerning embodiment of this invention. 表タイプのクラスが「横リスト」の場合の表内知識の抽出例を示す図である。It is a figure which shows the example of extraction of the knowledge in a table | surface when the class of a table type is a "horizontal list". 本発明の実施の形態に係る表セル検索装置における「縦属性」の場合の表内知識抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the knowledge extraction process routine in a table | surface in the case of "longitudinal attribute" in the table cell search apparatus concerning embodiment of this invention. 表タイプのクラスが「縦属性」の場合の表内知識の抽出例を示す図である。It is a figure which shows the example of extraction of the knowledge in a table | surface in case a table type class is a "vertical attribute." 本発明の実施の形態に係る表セル検索装置における「横属性」の場合の表内知識抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the knowledge extraction process routine in a table | surface in the case of "horizontal attribute" in the table cell search apparatus concerning embodiment of this invention. 表タイプのクラスが「横属性」の場合の表内知識の抽出例を示す図である。It is a figure which shows the example of extraction of the knowledge in a table | surface when the class of a table type is a "horizontal attribute". 本発明の実施の形態に係る表セル検索装置における「行列」の場合の表内知識抽出処理ルーチンを示すフローチャートである。It is a flowchart which shows the knowledge extraction process routine in a table | surface in the case of the "matrix" in the table cell search apparatus concerning embodiment of this invention. 表タイプのクラスが「行列」の場合の表内知識の抽出例を示す図である。It is a figure which shows the example of extraction of the knowledge in a table | surface in case a table type class is a "matrix." 表タイプのクラスが「行列」の場合の表内知識の抽出例を示す図である。It is a figure which shows the example of extraction of the knowledge in a table | surface in case a table type class is a "matrix." 本発明の実施の形態に係る表セル検索装置における表セル検索処理ルーチンを示すフローチャートである。It is a flowchart which shows the table cell search process routine in the table cell search apparatus concerning embodiment of this invention. 表セル検索装置の検索結果の一例を示す図である。It is a figure which shows an example of the search result of a table cell search device.

以下、図面を参照して本発明の実施の形態を詳細に説明する。 Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.

＜本発明の実施の形態に係る表セル検索装置の構成＞ <Configuration of Table Cell Search Device According to Embodiment of the Present Invention>

まず、本発明の実施の形態に係る表セル検索装置の構成について説明する。本発明の実施の形態の表セル検索装置は、ＨＴＭＬ文書に含まれる表形式データから、表形式データに含まれる知識を抽出してインデキシングし、キーワード集合あるいは自然文により与えられた検索クエリに直接回答可能な表セルを検索スコアによりランキングして返却する。 First, the configuration of the table cell search device according to the embodiment of the present invention will be described. The table cell search apparatus according to the embodiment of the present invention extracts and indexes knowledge included in tabular data from tabular data included in an HTML document, and directly searches a search query given by a keyword set or a natural sentence. The table cells that can be answered are ranked by search score and returned.

図１に示すように、本発明の実施の形態に係る表セル検索装置１００は、ＣＰＵと、ＲＡＭと、後述する検索インデキシング処理ルーチン、及び表セル検索処理ルーチンを実行するためのプログラムや各種データを記憶したＲＯＭと、を含むコンピュータで構成することが出来る。この表セル検索装置１００は、機能的には図１に示すように入力部１０と、演算部２０と、出力部５０とを備えている。 As shown in FIG. 1, a table cell search apparatus 100 according to an embodiment of the present invention includes a CPU, a RAM, a search indexing processing routine, and a program and various data for executing a table cell search processing routine to be described later. And a ROM including a ROM that stores information. Functionally, the table cell search apparatus 100 includes an input unit 10, a calculation unit 20, and an output unit 50 as shown in FIG.

入力部１０は、知識の抽出対象として、ＨＴＭＬ文書の集合を受け付ける。ＨＴＭＬ文書は表形式データを含む。また、入力部１０は、検索クエリとしてキーワード集合あるいは自然文を受け付ける。 The input unit 10 accepts a set of HTML documents as knowledge extraction targets. The HTML document includes tabular data. The input unit 10 accepts a keyword set or a natural sentence as a search query.

演算部２０は、検索データベース２８と、表形式データ抽出部３０と、表タイプ分類部３２と、行列タイプ分類部３４と、表関連情報抽出部３６と、表内知識抽出部３８と、クエリ解釈部４０と、知識検索部４２と、検索結果生成部４４とを含んで構成されている。なお、各処理部の具体的な処理については、後述する表セル検索装置の作用において説明する。 The calculation unit 20 includes a search database 28, a tabular data extraction unit 30, a table type classification unit 32, a matrix type classification unit 34, a table related information extraction unit 36, an in-table knowledge extraction unit 38, and a query interpretation. A unit 40, a knowledge search unit 42, and a search result generation unit 44 are configured. The specific processing of each processing unit will be described in the operation of the table cell search device described later.

表形式データ抽出部３０は、ＨＴＭＬ文書の集合から、ｔａｂｌｅタグにより記述された表形式データからなる表形式データ集合を取得する。 The tabular data extraction unit 30 acquires a tabular data set composed of tabular data described by the table tag from the set of HTML documents.

表タイプ分類部３２は、表形式データ抽出部３０で取得した表形式データの各々について、当該表形式データの構造及び内容に基づいて、当該表形式データの表のタイプを分類する。 The table type classification unit 32 classifies the table type of the tabular data for each of the tabular data acquired by the tabular data extraction unit 30 based on the structure and contents of the tabular data.

行列タイプ分類部３４は、表形式データ抽出部３０で取得した表形式データの各々について、当該表形式データの構造及び内容に基づいて、当該表形式データの表に含まれる各行及び各列のタイプを分類する。 For each of the tabular data acquired by the tabular data extraction unit 30, the matrix type classification unit 34 determines the type of each row and each column included in the table of the tabular data based on the structure and contents of the tabular data. Classify.

表関連情報抽出部３６は、表形式データ抽出部３０で取得した表形式データの各々について、表形式データに関連する表関連情報を、当該表形式データを含むＨＴＭＬ文書から抽出する。ここでは、表関連情報として、ＨＴＭＬ文書からページタイトル、テーブルのキャプション、テーブルが記載されたセクションの見出し、テーブルの直前のテキスト、又はテーブルの直後のテキスト情報を抽出する。 The table related information extraction unit 36 extracts, for each of the table format data acquired by the table format data extraction unit 30, table related information related to the table format data from the HTML document including the table format data. Here, as the table related information, the page title, the caption of the table, the section heading in which the table is described, the text immediately before the table, or the text information immediately after the table is extracted from the HTML document.

表内知識抽出部３８は、表形式データの各々について、表関連情報抽出部３６により抽出された表関連情報、表タイプ分類部３２における分類結果、及び行列タイプ分類部３４における分類結果に基づいて、当該表形式データの表のタイプに応じて、表形式データから、表形式データにおけるエンティティ、エンティティタイプ、属性、及び属性値を含む表内情報と、表関連情報との組から構成されるセル知識を抽出し、抽出したセル知識を検索データベース２８に格納する。 The in-table knowledge extracting unit 38, for each of the tabular data, based on the table related information extracted by the table related information extracting unit 36, the classification result in the table type classification unit 32, and the classification result in the matrix type classification unit 34. Depending on the table type of the tabular data, the cell is composed of a set of tabular data, in-table information including entities, entity types, attributes, and attribute values in the tabular data, and table related information. Knowledge is extracted, and the extracted cell knowledge is stored in the search database 28.

検索データベース２８には、表内知識抽出部３８で抽出されたセル知識が格納される。セル知識は、知識レコード表と、知識セルＩＤ表とからなる。知識レコード表は、エンティティタイプ、エンティティ、エンティティ２、属性、属性値、表タイプ、関連情報（タイトル、キャプション、セクション見出し、表前テキスト、及び表外テキスト）を１レコードとしたテーブルである。知識セルＩＤ表は、エンティティタイプ行ＩＤ、エンティティタイプ列ＩＤ、エンティティ行ＩＤ、エンティティ列ＩＤ、エンティティ２ＩＤ、エンティティ２ＩＤ、属性行ＩＤ、属性列ＩＤ、属性値行ＩＤ、及び属性値列ＩＤを１レコードとしたテーブルである。 The search database 28 stores the cell knowledge extracted by the in-table knowledge extraction unit 38. The cell knowledge includes a knowledge record table and a knowledge cell ID table. The knowledge record table is a table in which an entity type, an entity, an entity 2, an attribute, an attribute value, a table type, and related information (title, caption, section heading, front text, out-of-line text) are recorded as one record. The knowledge cell ID table includes entity type row ID, entity type column ID, entity row ID, entity column ID, entity 2 ID, entity 2 ID, attribute row ID, attribute column ID, attribute value row ID, and attribute value column ID. It is a table as a record.

クエリ解釈部４０は、入力部１０で受け付けた検索クエリであるキーワード集合あるいは自然文について、検索クエリ内の属性に対応するキーワードにラベルを付与すると共に、検索クエリ内の属性以外のキーワードにラベルを付与する。 The query interpretation unit 40 assigns a label to a keyword corresponding to an attribute in the search query for a keyword set or a natural sentence that is a search query received by the input unit 10, and labels a keyword other than the attribute in the search query. Give.

知識検索部４２は、クエリ解釈部４０によってラベルが付与された検索クエリに基づいて、検索データベース２８から検索クエリに対応するセル知識を出力する。ここでは、検索データベース２８から、検索クエリの属性に対応するキーワードと、表内情報の属性が一致し、かつ、検索クエリの属性以外のキーワードと、表内情報のエンティティ、エンティティタイプ、又は表関連情報が一致するセル知識を出力する。 The knowledge search unit 42 outputs cell knowledge corresponding to the search query from the search database 28 based on the search query to which the label is given by the query interpretation unit 40. Here, from the search database 28, the keyword corresponding to the attribute of the search query matches the attribute of the information in the table, and the keyword other than the attribute of the search query and the entity, entity type, or table related information in the table Output cell knowledge with matching information.

検索結果生成部４４は、知識検索部４２から出力されたセル知識に基づいて、獲得したセル知識の知識レコードのフィールド「属性値」の集合を検索結果として出力部５０に出力することにより、ユーザに検索結果を返却する。 Based on the cell knowledge output from the knowledge search unit 42, the search result generation unit 44 outputs a set of field “attribute values” of the acquired knowledge record of cell knowledge to the output unit 50 as a search result, so that the user Return search results to.

＜本発明の実施の形態に係る表セル検索装置の作用＞ <Operation of Table Cell Search Device According to Embodiment of the Present Invention>

次に、本発明の実施の形態に係る表セル検索装置１００の作用について説明する。 Next, the operation of the table cell search device 100 according to the embodiment of the present invention will be described.

表セル検索装置１００は、以下に説明する検索インデキシング処理ルーチン、及び表セル検索処理ルーチンを実行する。 The table cell search apparatus 100 executes a search indexing process routine and a table cell search process routine described below.

まず、検索インデキシング処理ルーチンについて説明する。 First, the search indexing processing routine will be described.

入力部１０においてＨＴＭＬ文書の集合を受け付けると、表セル検索装置１００は、図２に示す検索インデキシング処理ルーチンを実行する。 When the input unit 10 receives a set of HTML documents, the table cell search device 100 executes a search indexing process routine shown in FIG.

まず、ステップＳ１００では、表形式データ抽出部３０が、入力部１０で受け付けたＨＴＭＬ文書集合のＨＴＭＬ文書から、ｔａｂｌｅタグにより記述された表形式データからなる表形式データ集合を取得する。 First, in step S100, the tabular data extraction unit 30 acquires a tabular data set including tabular data described by table tags from the HTML document of the HTML document set received by the input unit 10.

次に、ステップＳ１０２では、表タイプ分類部３２が、処理対象のｔａｂｌｅタグの表形式データについて、ステップＳ１００にて取得された表形式データの属性−属性値集合を入力として、表タイプの分類を行う。表タイプの分類には、属性−属性値集合を素性とした非特許文献１に記載の手法などが利用可能である。本実施の形態における表タイプは、縦リスト、横リスト、縦属性、横属性、行列、又はその他の６クラスである。なお、ステップＳ１０２の分類結果が「その他」である場合には、ステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 Next, in step S102, the table type classification unit 32 inputs the attribute-attribute value set of the table format data acquired in step S100 for the table format data of the table tag to be processed, and classifies the table type. Do. For the table type classification, a method described in Non-Patent Document 1 using an attribute-attribute value set as a feature can be used. The table types in the present embodiment are vertical list, horizontal list, vertical attribute, horizontal attribute, matrix, or other six classes. If the classification result in step S102 is “others”, the process returns to step S100, the next table tag is selected, and the subsequent processing is continued.

図３に各表タイプの例を示す。なお、後述するステップＳ１０６の処理では、各表について、以下の前提であるものとする。縦リスト、及び縦属性表は属性が記述されたヘッダ行を持つ。横リスト、及び横属性表は属性が記述されたヘッダ列を持つ。行列はエンティティが記述されたヘッダ行及びヘッダ列を持つ。縦リストはエンティティが記述された列を持つ。横リストはエンティティが記述された行を持つ。 FIG. 3 shows an example of each table type. In the process of step S106 described later, the following assumptions are made for each table. The vertical list and the vertical attribute table have a header line in which attributes are described. The horizontal list and the horizontal attribute table have a header column in which attributes are described. The matrix has a header row and a header column in which entities are described. The vertical list has a column describing entities. The horizontal list has a line describing the entity.

ステップＳ１０４では、行列タイプ分類部３４が、処理対象のｔａｂｌｅタグの表形式データについて、ステップＳ１００にて抽出された表形式データの属性−属性値集合を入力として、表形式データの各行（行番号ｉを持つセル集合）及び各列（列番号ｊを持つセル集合）のタイプ分類を行う。行タイプ、列タイプの分類には、属性−属性値集合を素性とした非特許文献２に記載の手法などが利用可能である。本実施の形態における行列タイプは、ヘッダ、又はデータの２クラスである。なお、ヘッダと判定された行及び列が存在しない場合、ステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 In step S104, the matrix type classification unit 34 uses the attribute-attribute value set of the tabular data extracted in step S100 as input for the tabular data of the table tag to be processed, and inputs each row (row number) of the tabular data. type classification of a cell set having i) and each column (cell set having a column number j). For the classification of the row type and the column type, a method described in Non-Patent Document 2 using an attribute-attribute value set as a feature can be used. The matrix types in the present embodiment are two classes of header or data. If there is no row and column determined as the header, the process returns to step S100, the next table tag is selected, and the subsequent processing is continued.

ステップＳ１０６では、表関連情報抽出部３６が、処理対象のｔａｂｌｅタグの表形式データを含むＨＴＭＬ文書を解析して、当該ＨＴＭＬ文書の表形式データに関する表関連情報である属性−属性値集合を抽出する。 In step S106, the table related information extraction unit 36 analyzes the HTML document including the table format data of the table tag to be processed, and extracts the attribute-attribute value set that is the table related information regarding the table format data of the HTML document. To do.

ステップＳ１０８では、処理対象のｔａｂｌｅタグの表形式データについて、ステップＳ１０６により抽出された表関連情報、ステップＳ１０２における表タイプの分類結果、及びステップＳ１０４における行列タイプの分類結果に基づいて、当該表形式データの表のタイプに応じて、表形式データから、表形式データにおけるエンティティ、エンティティタイプ、属性、及び属性値を含む表内情報と、表関連情報との組から構成されるセル知識を抽出し、抽出したセル知識を検索データベース２８に格納する。 In step S108, based on the table related information extracted in step S106, the table type classification result in step S102, and the matrix type classification result in step S104, the table format of the table tag data to be processed is determined. Depending on the data table type, cell knowledge consisting of a set of table-related information and table-related information including entities, entity types, attributes, and attribute values in tabular data is extracted from tabular data. The extracted cell knowledge is stored in the search database 28.

上記のステップＳ１００〜Ｓ１０８についてＨＴＭＬ文書に含まれるｔａｂｌｅタグごとに繰り返す。 The above steps S100 to S108 are repeated for each table tag included in the HTML document.

また、上記ｔａｂｌｅタグごとに繰り返すことを、ＨＴＭＬ文書の集合に含まれるＨＴＭＬ文書ごとに繰り返す。 Further, the repetition for each table tag is repeated for each HTML document included in the set of HTML documents.

上記ステップＳ１００の処理は図４に示す表形式データ抽出処理ルーチンによって実行される。 The processing in step S100 is executed by a tabular data extraction processing routine shown in FIG.

ステップＳ２００では、ｔａｂｌｅタグの子要素として出現するｔａｂｌｅタグについて、ｔａｂｌｅタグに含まれる文字列集合に置換し、全てのＨＴＭＬ構造を消去する。 In step S200, the table tag that appears as a child element of the table tag is replaced with a character string set included in the table tag, and all HTML structures are deleted.

ステップＳ２０１では、行番号変数ｉを０で初期化する。また、２次元配列として表形式データを初期化する。 In step S201, the line number variable i is initialized with zero. Also, the tabular data is initialized as a two-dimensional array.

次に、ｔｒタグが出現する毎に、以下の処理を繰り返す。 Next, every time a tr tag appears, the following processing is repeated.

まず、ステップＳ２０２では、行番号変数ｉを１増やすと共に、列番号変数ｊを０にする。
そして、ｔｒタグの子要素としてｔｄタグあるいはｔｈタグを表すタグ（以下、セルタグと呼ぶ）が出現する毎に、ステップＳ２０４の処理を繰り返す。 First, in step S202, the row number variable i is incremented by 1 and the column number variable j is set to 0.
Then, every time a tag representing a td tag or a th tag (hereinafter referred to as a cell tag) appears as a child element of the tr tag, the process of step S204 is repeated.

ステップＳ２０４では、ｔｄタグあるいはｔｈタグを表すタグ（以下、セルタグと呼ぶ）について、セルタグから属性−属性値集合を取得する。詳しい属性−属性値集合は以下の通りである。 In step S204, an attribute-attribute value set is acquired from a cell tag for a tag representing a td tag or a th tag (hereinafter referred to as a cell tag). The detailed attribute-attribute value set is as follows.

属性「行サイズ（記号ｎ）」として、ｒｏｗｓｐａｎ属性を属性値とする。ｒｏｗｓｐａｎの属性値指定が無い場合はｎ＝１とする。 As the attribute “row size (symbol n)”, the rowspan attribute is an attribute value. If the rowspan attribute value is not specified, n = 1.

属性「列サイズ（記号ｍ）」として、ｃｏｌｓｐａｎ属性を属性値とする。ｃｏｌｓｐａｎの属性値指定が無い場合はｍ＝１とする。 As an attribute “column size (symbol m)”, a colspan attribute is an attribute value. If no colspan attribute value is specified, m = 1.

属性「文字列」として、セルタグ子要素中のテキストを属性値とする。 As an attribute “character string”, a text in a cell tag child element is used as an attribute value.

属性「タグ名」として、セルタグの名前（‘ｔｈ’あるいは‘ｔｄ’）を属性値とする。 As an attribute “tag name”, a cell tag name (‘th’ or ‘td’) is an attribute value.

属性「画像フラグ」について、‘ｉｍｇ’タグがセルタグの子要素に存在すれば１、そうでなければ０を属性値とする。 For the attribute “image flag”, the attribute value is 1 if the ‘img’ tag exists in the child element of the cell tag, and 0 otherwise.

属性「数値フラグ」について、セル内の文字列が数値であれば１、そうでなければ０を属性値とする。 For the attribute “numeric flag”, 1 is used as the attribute value if the character string in the cell is a numeric value, and 0 otherwise.

属性「日付フラグ」について、セル内の文字列が日付であれば１、そうでなければ０を属性値とする。 For the attribute “date flag”, the attribute value is 1 if the character string in the cell is a date, and 0 otherwise.

属性「記号フラグ」について、セル内の文字列が記号のみであれば１、そうでなければ０を属性値とする。 For the attribute “symbol flag”, 1 is used as the attribute value if the character string in the cell is only a symbol, and 0 otherwise.

属性「長文字列フラグ」について、セル内の文字列の文字数が閾値以上であれば１、そうでなければ０を属性値とする。閾値には５０などを設定する。 For the attribute “long character string flag”, 1 is used as the attribute value if the number of characters of the character string in the cell is equal to or greater than the threshold value, and 0 otherwise. For example, 50 is set as the threshold.

属性「書式フラグ」について、セル内にｂタグ、ｓｔｒｏｎｇタグなど、ブラウジング時の見え方を変更するタグがあれば１、そうでなければ０を属性値とする。 For the attribute “format flag”, the attribute value is set to 1 if there is a tag that changes the appearance when browsing, such as a b tag or strong tag, and 0 otherwise.

その他、ＨＴＭＬ文書から該当セルタグに関する情報として、例えば、背景色、文字位置指定の有無などを属性として追加しても良い。 In addition, as information on the corresponding cell tag from the HTML document, for example, background color, presence / absence of character position designation, and the like may be added as attributes.

ステップＳ２０６では、表形式データ［ｉ］［ｊ］に、ステップＳ２０４で取得した属性−属性値集合を格納する。 In step S206, the attribute-attribute value set acquired in step S204 is stored in the tabular data [i] [j].

ステップＳ２０６の処理を、表形式データ［ｉ］［ｊ］に属性−属性値集合が格納されない列番号になるまで、ｊを１ずつ増やしていく。その後に、ｉ≦行番号≦ｉ＋ｎ、ｊ≦列番号≦ｊ＋ｍを満たす行番号、列番号の全組合せについて、ステップＳ２０４で得られた属性−属性値集合を、表形式データ［行番号］［列番号］に格納する。つまり、ここでは連結されたセルの内容を対応する表形式データ［行番号］［列番号］の各々にコピーする処理を行う。なお、行番号と列番号が指定された表形式データを以降「セル」と呼ぶ。 In the processing of step S206, j is incremented by 1 until the column number in which the attribute-attribute value set is not stored in the tabular data [i] [j]. Thereafter, for all combinations of row numbers and column numbers satisfying i ≦ row number ≦ i + n and j ≦ column number ≦ j + m, the attribute-attribute value set obtained in step S204 is displayed in tabular data [row number] [column Number]. That is, here, a process of copying the contents of the linked cells to the corresponding tabular data [row number] [column number] is performed. The tabular data in which the row number and the column number are designated is hereinafter referred to as “cell”.

上記ステップＳ２０４〜Ｓ２０６の処理を、未処理のセルタグが無くなるまで、出現順に処理を繰り返す。 The processes in steps S204 to S206 are repeated in the order of appearance until there is no unprocessed cell tag.

また、上記セルタグの処理について、ｔａｂｌｅタグの子要素に出現するｔｒタグについて、未処理のｔｒタグが無くなるまで、出現順に処理を繰り返す。 As for the cell tag processing, the tr tag that appears in the child element of the table tag is repeated in the order of appearance until there is no unprocessed tr tag.

ステップＳ２０８では、表形式データに属性−属性値集合が格納された行番号、及び列番号のうちそれぞれ最大のものを該表形式データの行数Ｎ、及び列数Ｍとして保持する。１≦行番号≦Ｎ、１≦列番号≦Ｍを満たす行番号、及び列番号の全組合せについて、表形式データ［行番号］［列番号］に属性−属性値集合が未格納の場合、空集合を追加する。 In step S208, the largest of the row number and column number in which the attribute-attribute value set is stored in the tabular data is held as the number of rows N and the number of columns M of the tabular data. For all combinations of row number and column number satisfying 1 ≦ row number ≦ N, 1 ≦ column number ≦ M, empty if no attribute-attribute value set is stored in tabular data [row number] [column number] Add a set.

図５に表形式データの抽出例を示す。図５に示すように表形式データは行番号、及び列番号が割り振られ、要素として文字列、タグ名、数値、行サイズ、列サイズ、及び起点をもつ。 FIG. 5 shows an example of tabular data extraction. As shown in FIG. 5, the tabular data is assigned a row number and a column number, and has a character string, a tag name, a numerical value, a row size, a column size, and a starting point as elements.

上記ステップＳ１０６の処理は図６に示す表関連情報抽出処理ルーチンによって実行される。 The processing in step S106 is executed by a table related information extraction processing routine shown in FIG.

ステップＳ３００では、タイトルとして、ｔｉｔｌｅタグのテキスト情報を属性値として取得し、属性「タイトル」として格納する。ｔｉｔｌｅタグが存在しない場合は、属性値はｎｕｌｌ文字列を設定する。 In step S300, the text information of the title tag is acquired as the attribute value as the title, and stored as the attribute “title”. If the title tag does not exist, a null character string is set as the attribute value.

ステップＳ３０２では、キャプションとして、当該表形式データが抽出されたｔａｂｌｅタグの子要素にｃａｐｔｉｏｎタグが含まれている場合、属性値としてｃａｐｔｉｏｎの子要素に含まれるテキストを取得し、属性「キャプション」として格納する。ｃａｐｔｉｏｎタグが存在しない場合はｎｕｌｌ文字列を設定する。 In step S302, when the caption tag is included in the child element of the table tag from which the tabular data is extracted, the text included in the child element of the caption is acquired as the attribute value, and the attribute “caption” is acquired. Store. If the caption tag does not exist, a null character string is set.

ステップＳ３０４では、前テキストとして、当該表形式データが抽出されたｔａｂｌｅタグの開始直前に出現するｐ、又はｄｉｖタグの子要素に含まれるテキストを取得し、属性「前テキスト」として格納する。ｔａｂｌｅタグの直前に出現するｐ、又はｄｉｖタグが存在しない場合、ｎｕｌｌ文字列を設定する。 In step S304, the text included in the child element of the p or div tag that appears immediately before the start of the table tag from which the tabular data is extracted is acquired as the previous text, and is stored as the attribute “previous text”. If there is no p or div tag that appears immediately before the table tag, a null character string is set.

ステップＳ３０６では、属性「後テキスト」に、当該表形式データが抽出されたｔａｂｌｅタグの終了直後に出現するｐ、又はｄｉｖタグの子要素に含まれるテキストを取得する。ｔａｂｌｅタグの直後に出現するｐ、又はｄｉｖタグが存在しない場合ｎｕｌｌ文字列を設定する。 In step S306, the text included in the child element of the p or div tag that appears immediately after the end of the table tag from which the tabular data is extracted is acquired in the attribute “following text”. If there is no p or div tag that appears immediately after the table tag, a null character string is set.

ステップＳ３０８では、見出し配列（要素番号１〜６）を空文字列で初期化する。抽出要素がｈ１〜ｈ６タグの場合、各タグの見出しレベル（ｈ１の場合１、ｈ６の場合６）を取得し、見出し配列の見出しレベル以降を空文字列で初期化する。例えば、ｈ２の場合、見出し配列の２〜６以降を削除する。そして、以下ステップＳ３１０〜Ｓ３１２により、ＨＴＭＬ文書に出現するｈ１〜ｈ６タグおよび当該表形式データのｔａｂｌｅタグを抽出要素として出現順に抽出することを繰り返す。 In step S308, the header array (element numbers 1 to 6) is initialized with an empty character string. When the extracted elements are h1 to h6 tags, the heading level of each tag (1 for h1 and 6 for h6) is acquired, and the heading level of the heading array is initialized with an empty character string. For example, in the case of h2, 2 to 6 after the heading array are deleted. Then, in steps S310 to S312, the h1 to h6 tags appearing in the HTML document and the table tag of the tabular data are extracted in the order of appearance as extraction elements.

ステップＳ３１０では、抽出要素の文字列を取得し、見出し配列［見出しレベル］に、当該抽出要素の文字列を格納する。 In step S310, the character string of the extracted element is acquired, and the character string of the extracted element is stored in the heading array [heading level].

ステップＳ３１２では、抽出要素がｔａｂｌｅタグの場合、繰り返しを抜け、属性「見出し」の属性値として見出し配列の各要素を空白で結合した文字列を格納する。見出し配列に文字列が格納されていない場合、ｎｕｌｌ文字列を属性「見出し」の属性値として格納する。 In step S312, when the extracted element is a table tag, the repetition is skipped, and a character string in which each element of the header array is combined with a blank as an attribute value of the attribute “headline” is stored. If no character string is stored in the heading array, the null character string is stored as the attribute value of the attribute “heading”.

上記ステップＳ３００〜Ｓ３１２の処理を行うことで、表関連情報抽出部３６は、表関連情報として、ページタイトル、テーブルのキャプション、テーブルが記載されたセクションの見出し、テーブルの直前のテキスト、又はテーブルの直後のテキスト情報を抽出する。 By performing the processing of steps S300 to S312, the table related information extracting unit 36 includes, as the table related information, the page title, the table caption, the section heading in which the table is written, the text immediately before the table, or the table The text information immediately after is extracted.

図７に表関連情報の抽出例を示す。 FIG. 7 shows an example of extracting table related information.

上記ステップＳ１０８の処理は、以下に説明するように、ステップＳ１０２で分類された表タイプのクラスに応じて、縦リスト、横リスト、縦属性、横属性、及び行列のそれぞれに対応した表内知識抽出処理ルーチンが実行される。 As described below, the processing in step S108 includes in-table knowledge corresponding to each of the vertical list, horizontal list, vertical attribute, horizontal attribute, and matrix according to the table type class classified in step S102. An extraction processing routine is executed.

まず、表タイプのクラスが「縦リスト」の場合について説明する。「縦リスト」の場合、図８に示す表内知識抽出処理ルーチンを実行する。 First, the case where the table type class is “vertical list” will be described. In the case of “vertical list”, the in-table knowledge extraction processing routine shown in FIG. 8 is executed.

ステップＳ４００では、行番号ｉの行タイプ分類結果が、１≦ｉ≦ｎの範囲においてすべてヘッダであるとき、１≦ｉ≦ｎの範囲をヘッダ行番号集合として取得する。また、ヘッダ行番号集合に含まれず、かつ、１≦ｉ≦Ｎを満たす行番号の集合をデータ行番号集合として取得する。なお、ヘッダ行番号集合が空集合の場合、検索インデキシング処理ルーチンのステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 In step S400, when the row type classification result of row number i is all headers in the range of 1 ≦ i ≦ n, the range of 1 ≦ i ≦ n is acquired as the header row number set. A set of row numbers that are not included in the header row number set and satisfy 1 ≦ i ≦ N is acquired as a data row number set. If the header row number set is an empty set, the process returns to step S100 of the search indexing processing routine, selects the next table tag, and continues the subsequent processing.

ステップＳ４０２では、下記の条件１を満たす列番号の集合を、エンティティ列番号集合として取得する。 In step S402, a set of column numbers satisfying the following condition 1 is acquired as an entity column number set.

（条件１）行番号１〜Ｎのうち行タイプの分類結果が「データ」である行における、表形式データ［行番号］［列番号］の属性−属性値集合の属性「文字列」の種類数Ｋ１÷行タイプの分類結果が「データ」である行数Ｋ２＜閾値θ (Condition 1) The type of the attribute “character string” of the attribute-attribute value set of the tabular data [row number] [column number] in the row whose row type classification result is “data” among the row numbers 1 to N Number K1 ÷ Number of rows whose classification result of row type is “data” K2 <threshold θ

なお、Ｋ２が０のとき条件１は満たされないものとする。θの値としては、具体的には０．７５などを設定する。なお、エンティティ列番号集合が空集合の場合、検索インデキシング処理ルーチンのステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 When K2 is 0, condition 1 is not satisfied. Specifically, the value of θ is set to 0.75 or the like. If the entity column number set is an empty set, the process returns to step S100 of the search indexing processing routine, selects the next table tag, and continues the subsequent processing.

エンティティ列番号集合に含まれる列番号ｊごとに、以下の処理を繰り返す。ただし、ｋ＝ｊの場合は次の列番号ｊの処理に移る。 The following processing is repeated for each column number j included in the entity column number set. However, if k = j, the process proceeds to the next column number j.

まず、１からＭまでの属性列番号ｋごとに、データ行番号集合に含まれる行番号ｉの各々について、以下のステップＳ４０４、Ｓ４０６を繰り返す。 First, the following steps S404 and S406 are repeated for each row number i included in the data row number set for each attribute column number k from 1 to M.

ステップＳ４０４では、エンティティ列番号集合に含まれる列番号ｊ、属性列番号ｋ、及びデータ行番号集合に含まれる行番号ｉについて、表内知識を抽出する。抽出する表内知識は、以下のエンティティタイプ、エンティティ、エンティティ２、属性、及び属性値である。 In step S404, in-table knowledge is extracted for column number j, attribute column number k included in the entity column number set, and row number i included in the data row number set. The in-table knowledge to be extracted is the following entity type, entity, entity 2, attribute, and attribute value.

エンティティタイプは、ヘッダ行番号集合に含まれる行番号ｈのそれぞれについて行番号が若い順に表形式データ［ｈ］［ｊ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。エンティティタイプ行ＩＤはヘッダ行番号集合、エンティティタイプ列ＩＤはｊとする。 The entity type is a character obtained by acquiring the attribute “character string” of the tabular data [h] [j] in ascending order of the line number for each of the line numbers h included in the header line number set, and then removing the duplicate character string. The string set is a string concatenated with a space character. The entity type row ID is a header row number set, and the entity type column ID is j.

エンティティは、表形式データ［ｉ］［ｊ］の属性「文字列」とする。エンティティ行ＩＤはｉ、エンティティ列ＩＤはｊとする。 The entity is an attribute “character string” of the tabular data [i] [j]. The entity row ID is i and the entity column ID is j.

エンティティ２は、ｎｕｌｌ文字列とする。エンティティ２行ＩＤ、エンティティ２列ＩＤはそれぞれｎｕｌｌ文字列とする。 Entity 2 is a null character string. The entity 2 row ID and the entity 2 column ID are each a null character string.

属性は、ヘッダ行番号集合に含まれる行番号ｈのそれぞれについて行番号が若い順に表形式データ［ｈ］［ｋ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。属性行ＩＤはヘッダ行番号集合、属性列ＩＤはｋとする。 The attribute is a character string obtained by removing the duplicate character string after acquiring the attribute “character string” of the tabular data [h] [k] in ascending order of the line number for each of the line numbers h included in the header line number set. It is a character string that concatenates sets with a space character. The attribute row ID is a header row number set, and the attribute column ID is k.

属性値は、表形式データ［ｉ］［ｋ］の属性「文字列」とする。属性値行ＩＤはｉ、属性列ＩＤはｋとする。 The attribute value is the attribute “character string” of the tabular data [i] [k]. The attribute value row ID is i, and the attribute column ID is k.

ステップＳ４０６では、ステップＳ４０４により獲得されたエンティティタイプ、エンティティ、エンティティ２、属性、属性値、ステップＳ１０２により獲得された表タイプ、及びステップＳ１０６により獲得された表関連情報（タイトル、キャプション、セクション見出し、表前テキスト、及び表外テキスト）を１レコードとして、検索データベース２８の知識レコード表に格納する。また、ステップＳ４０４により獲得されたエンティティタイプ行ＩＤ、エンティティタイプ列ＩＤ、エンティティ行ＩＤ、エンティティ列ＩＤ、エンティティ２ＩＤ、エンティティ２ＩＤ、属性行ＩＤ、属性列ＩＤ、属性値行ＩＤ、及び属性値列ＩＤを１レコードとして、検索データベース２８の知識セルＩＤ表に格納する。 In step S406, the entity type, entity, entity 2, attribute, attribute value acquired in step S404, table type acquired in step S102, and table related information acquired in step S106 (title, caption, section heading, (Front text and out-of-line text) are stored in the knowledge record table of the search database 28 as one record. In addition, the entity type row ID, entity type column ID, entity row ID, entity column ID, entity 2 ID, entity 2 ID, attribute row ID, attribute column ID, attribute value row ID, and attribute value column ID acquired in step S404. Is stored in the knowledge cell ID table of the search database 28 as one record.

上記ステップＳ４０４〜Ｓ４０６について、データ行番号集合に含まれる行番号ｉごとに繰り返す。 The above steps S404 to S406 are repeated for each row number i included in the data row number set.

また、上記のように行番号ｉごとに繰り返すことを１からＭまでの属性列番号ｋごとに繰り返す。 Further, the repetition for each row number i as described above is repeated for each attribute column number k from 1 to M.

また、上記のように属性列番号ｋごとに繰り返すことをエンティティ列番号集合に含まれる列番号ｊごとに繰り返す。 In addition, the repetition for each attribute column number k as described above is repeated for each column number j included in the entity column number set.

図９に、表タイプのクラスが「縦リスト」の場合の表内知識の抽出例を示す。 FIG. 9 shows an example of extracting in-table knowledge when the table type class is “vertical list”.

次に、表タイプのクラスが「横リスト」の場合について説明する。「横リスト」の場合、図１０に示す表内知識抽出処理ルーチンを実行する。 Next, a case where the table type class is “horizontal list” will be described. In the case of “horizontal list”, the in-table knowledge extraction processing routine shown in FIG. 10 is executed.

ステップＳ５００では、列番号ｊの列タイプ分類結果が、１≦ｊ≦ｍの範囲においてすべてヘッダであるとき、１≦ｊ≦ｍの範囲をヘッダ列番号集合として取得する。また、ヘッダ列番号集合に含まれず、かつ、１≦ｊ≦Ｍを満たす行番号の集合をデータ列番号集合として取得する。なお、ヘッダ列番号集合が空集合の場合、検索インデキシング処理ルーチンのステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 In step S500, when all the column type classification results of column number j are headers in the range of 1 ≦ j ≦ m, the range of 1 ≦ j ≦ m is acquired as a header column number set. A set of row numbers that are not included in the header column number set and satisfy 1 ≦ j ≦ M is acquired as a data column number set. If the header column number set is an empty set, the process returns to step S100 of the search indexing processing routine, selects the next table tag, and continues the subsequent processing.

ステップＳ５０２では、下記の条件２を満たす行番号の集合を、エンティティ行番号集合として取得する。 In step S502, a set of row numbers satisfying the following condition 2 is acquired as an entity row number set.

（条件２）列番号１〜Ｍのうち列タイプの分類結果が「データ」である列における、表形式データ［行番号］［列番号］の属性−属性値集合の属性「文字列」の種類数Ｋ１÷列タイプの分類結果が「データ」である列数Ｋ２＜閾値θ (Condition 2) Type of attribute “character string” of attribute-attribute value set of tabular data [row number] [column number] in a column whose column type classification result is “data” among column numbers 1 to M Number K1 ÷ number of columns whose column type is “data” K2 <threshold θ

なお、Ｋ２が０のとき条件２は満たされないものとする。θの値としては、具体的には０．７５などを設定する。なお、エンティティ行番号集合が空集合の場合、検索インデキシング処理ルーチンのステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 When K2 is 0, condition 2 is not satisfied. Specifically, the value of θ is set to 0.75 or the like. If the entity row number set is an empty set, the process returns to step S100 of the search indexing processing routine, selects the next table tag, and continues the subsequent processing.

エンティティ行番号集合に含まれる行番号ｉごとに、以下の処理を繰り返す。 The following processing is repeated for each row number i included in the entity row number set.

まず、１からＮまでの行番号ｋごとに、データ列番号集合に含まれる列番号ｊの各々について、以下のステップＳ５０４、Ｓ５０６を繰り返す。 First, the following steps S504 and S506 are repeated for each column number j included in the data column number set for each row number k from 1 to N.

ステップＳ５０４では、エンティティ行番号集合に含まれる行番号ｉ、行番号ｋ、及びデータ列番号集合に含まれる列番号ｊについて、表内知識を抽出する。抽出する表内知識は、以下のエンティティタイプ、エンティティ、エンティティ２、属性、及び属性値である。 In step S504, in-table knowledge is extracted for row number i, row number k included in the entity row number set, and column number j included in the data column number set. The in-table knowledge to be extracted is the following entity type, entity, entity 2, attribute, and attribute value.

エンティティタイプは、ヘッダ列番号集合に含まれる列番号ｈのそれぞれについて列番号が若い順に表形式データ［ｉ］［ｈ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。エンティティ行ＩＤはｉ、エンティティ列ＩＤはヘッダ列番号集合とする。 The entity type is a character obtained by obtaining the attribute “character string” of the tabular data [i] [h] in ascending order of the column number for each of the column numbers h included in the header column number set, and then removing the duplicate character string. The string set is a string concatenated with a space character. The entity row ID is i, and the entity column ID is a header column number set.

属性は、ヘッダ列番号集合に含まれる列番号ｈのそれぞれについて列番号が若い順に表形式データ［ｋ］［ｈ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。属性行ＩＤはｋ、属性列ＩＤはヘッダ列番号集合とする。 The attribute is a character string obtained by removing the duplicate character string after obtaining the attribute “character string” of the tabular data [k] [h] in ascending order of the column number for each column number h included in the header column number set. It is a character string that concatenates sets with a space character. The attribute row ID is k, and the attribute column ID is a header column number set.

属性値は、表形式データ［ｋ］［ｊ］の属性「文字列」とする。属性行ＩＤはｋ、エンティティ列ＩＤはｊとする。 The attribute value is the attribute “character string” of the tabular data [k] [j]. The attribute row ID is k and the entity column ID is j.

ステップＳ５０６では、ステップＳ５０４により獲得されたエンティティタイプ、エンティティ、エンティティ２、属性、属性値、ステップＳ１０２により獲得された表タイプ、及びステップＳ１０６により獲得された表関連情報（タイトル、キャプション、セクション見出し、表前テキスト、及び表外テキスト）を１レコードとして、検索データベース２８の知識レコード表に格納する。また、エンティティタイプ行ＩＤ、エンティティタイプ列ＩＤ、エンティティ行ＩＤ、エンティティ列ＩＤ、エンティティ２ＩＤ、エンティティ２ＩＤ、属性行ＩＤ、属性列ＩＤ、属性値行ＩＤ、属性値列ＩＤを１レコードとして、検索データベース部の知識セルＩＤ表に格納する。 In step S506, the entity type acquired in step S504, entity, entity 2, attribute, attribute value, table type acquired in step S102, and table related information acquired in step S106 (title, caption, section heading, (Front text and out-of-line text) are stored in the knowledge record table of the search database 28 as one record. In addition, the search database includes the entity type row ID, entity type column ID, entity row ID, entity column ID, entity 2 ID, entity 2 ID, attribute row ID, attribute column ID, attribute value row ID, and attribute value column ID as one record. Stored in the knowledge cell ID table.

上記ステップＳ５０４〜Ｓ５０６について、データ列番号集合に含まれる列番号ｊごとに繰り返す。 The above steps S504 to S506 are repeated for each column number j included in the data column number set.

上記のように列番号ｊごとに繰り返すことを、１からＮまでの行番号ｋごとに繰り返す。 The repetition for each column number j as described above is repeated for each row number k from 1 to N.

また、上記のように行番号ｋごとに繰り返すことを、エンティティ行番号集合に含まれる行番号ｉごとに繰り返す。ただし、ｋ＝ｉの場合は次の行番号ｉの処理に移る。 Further, the repetition for each row number k as described above is repeated for each row number i included in the entity row number set. However, when k = i, the process proceeds to the next line number i.

図１１に、表タイプのクラスが「横リスト」の場合の表内知識の抽出例を示す。 FIG. 11 shows an example of extraction of in-table knowledge when the table type class is “horizontal list”.

次に、表タイプのクラスが「縦属性」の場合について説明する。「縦属性」の場合、図１２に示す表内知識抽出処理ルーチンを実行する。 Next, a case where the table type class is “vertical attribute” will be described. In the case of “vertical attribute”, the in-table knowledge extraction processing routine shown in FIG. 12 is executed.

ステップＳ６００では、行番号ｉの行タイプ分類結果が、１≦ｉ≦ｎの範囲においてすべてヘッダであるとき、１≦ｉ≦ｎの範囲をヘッダ行番号集合として取得する。また、ヘッダ行番号集合に含まれず、かつ、１≦ｉ≦Ｎを満たす行番号の集合をデータ行番号集合として取得する。なお、ヘッダ行番号集合が空集合の場合、検索インデキシング処理ルーチンのステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 In step S600, when the row type classification result of row number i is all headers in the range of 1 ≦ i ≦ n, the range of 1 ≦ i ≦ n is acquired as the header row number set. A set of row numbers that are not included in the header row number set and satisfy 1 ≦ i ≦ N is acquired as a data row number set. If the header row number set is an empty set, the process returns to step S100 of the search indexing processing routine, selects the next table tag, and continues the subsequent processing.

ステップＳ６０２では、列番号ｊの列タイプ分類結果が、１≦ｊ≦ｍの範囲においてすべてヘッダであるとき、１≦ｊ≦ｍの範囲をヘッダ列番号集合として取得する。ヘッダ列番号集合に含まれず、かつ、１≦ｊ≦Ｍを満たす行番号の集合をデータ列番号集合として取得する。 In step S602, when the column type classification result of column number j is a header in the range of 1 ≦ j ≦ m, the range of 1 ≦ j ≦ m is acquired as a header column number set. A set of row numbers not included in the header column number set and satisfying 1 ≦ j ≦ M is acquired as a data column number set.

そして、データ列番号集合に含まれる列番号ｊごとに、以下のステップＳ６０４〜Ｓ６０６を繰り返す。 Then, the following steps S604 to S606 are repeated for each column number j included in the data column number set.

ステップＳ６０４では、データ列番号集合に含まれる列番号ｊについて、表内知識を抽出して、検索データベース２８にレコードを格納する。抽出する表内知識は、以下のエンティティタイプ、エンティティ、エンティティ２、属性、及び属性値である。 In step S604, knowledge in the table is extracted for the column number j included in the data column number set, and the record is stored in the search database 28. The in-table knowledge to be extracted is the following entity type, entity, entity 2, attribute, and attribute value.

エンティティタイプはｎｕｌｌ文字列とする。エンティティタイプ行ＩＤ、エンティティタイプ列ＩＤはそれぞれｎｕｌｌ文字列とする。 The entity type is a null string. Each of the entity type row ID and the entity type column ID is a null character string.

エンティティは、表関連情報の属性「キャプション」にｎｕｌｌ文字列以外が設定されている場合は、その属性値を設定し、エンティティ行ＩＤ、エンティティ列ＩＤをそれぞれｎｕｌｌ文字列とする。 When an attribute other than a null character string is set in the attribute “caption” of the table related information, the entity sets the attribute value, and sets the entity row ID and the entity column ID as a null character string.

属性「キャプション」がｎｕｌｌ文字列かつヘッダ行番号集合が空集合で無い場合は、ヘッダ列番号集合に含まれる列番号ｈのそれぞれおよび１〜Ｎの行番号ｋのそれぞれについて列番号が若い順、次に行番号が若い順に表形式データ［ｋ］［ｈ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列をエンティティとする。エンティティ行ＩＤは［１、２、…、Ｎ］、エンティティ列ＩＤはヘッダ列番号集合とする。 When the attribute “caption” is a null character string and the header row number set is not an empty set, the column numbers h are included in the header column number set and the row numbers k of 1 to N are in ascending order. Next, after acquiring the attribute “character string” of the tabular data [k] [h] in ascending order of the line numbers, a character string obtained by concatenating a character string set from which duplicate character strings are removed with a blank character is used as an entity. The entity row ID is [1, 2,..., N], and the entity column ID is a header column number set.

属性「キャプション」がｎｕｌｌ文字列かつヘッダ行番号集合が空集合のときは、表関連情報の属性「タイトル」をエンティティとする。エンティティ行ＩＤ、エンティティ列ＩＤをそれぞれｎｕｌｌ文字列とする。 When the attribute “caption” is a null character string and the header row number set is an empty set, the attribute “title” of the table related information is an entity. Each of the entity row ID and the entity column ID is a null character string.

属性は、ヘッダ行番号集合に含まれる行番号ｈのそれぞれについて行番号が若い順に表形式データ［ｈ］［ｊ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。属性行ＩＤはヘッダ行番号集合、属性列ＩＤはｊとする。 The attribute is a character string obtained by obtaining the attribute “character string” of the tabular data [h] [j] in ascending order of the line numbers for each of the line numbers h included in the header line number set, and then removing the duplicate character strings. It is a character string that concatenates sets with a space character. The attribute row ID is a header row number set, and the attribute column ID is j.

属性値は、データ行番号集合に含まれる行番号ｄのそれぞれについて列番号が若い順に表形式データ［ｄ］［ｊ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。属性値行ＩＤはデータ行番号集合、属性値列ＩＤはｊとする。 The attribute value is a character obtained by acquiring the attribute “character string” of the tabular data [d] [j] in ascending order of the column number for each row number d included in the data row number set, and then removing the duplicate character string. The string set is a string concatenated with a space character. The attribute value row ID is a data row number set, and the attribute value column ID is j.

ステップＳ６０６では、ステップＳ６０４により獲得されたエンティティタイプ、エンティティ、エンティティ２、属性、属性値、ステップＳ１０２により獲得された表タイプ、及びステップＳ１０６により獲得された関連情報（タイトル、キャプション、セクション見出し、表前テキスト、及び表外テキスト）を１レコードとして、検索データベース２８の知識レコード表に格納する。また、エンティティタイプ行ＩＤ、エンティティタイプ列ＩＤ、エンティティ行ＩＤ、エンティティ列ＩＤ、エンティティ２ＩＤ、エンティティ２ＩＤ、属性行ＩＤ、属性列ＩＤ、属性値行ＩＤ、属性値列ＩＤを１レコードとして、検索データベース２８の知識セルＩＤ表に格納する。 In step S606, the entity type acquired in step S604, entity, entity 2, attribute, attribute value, table type acquired in step S102, and related information acquired in step S106 (title, caption, section heading, table) The previous text and out-of-line text) are stored in the knowledge record table of the search database 28 as one record. In addition, the search database includes the entity type row ID, entity type column ID, entity row ID, entity column ID, entity 2 ID, entity 2 ID, attribute row ID, attribute column ID, attribute value row ID, and attribute value column ID as one record. It is stored in 28 knowledge cell ID tables.

図１３に、表タイプのクラスが「縦属性」の場合の表内知識の抽出例を示す。 FIG. 13 shows an example of extracting in-table knowledge when the table type class is “vertical attribute”.

次に、表タイプのクラスが「横属性」の場合について説明する。「横属性」の場合、図１４に示す表内知識抽出処理ルーチンを実行する。 Next, a case where the table type class is “horizontal attribute” will be described. In the case of “horizontal attribute”, the in-table knowledge extraction processing routine shown in FIG. 14 is executed.

ステップＳ７００では、行番号ｉの行タイプ分類結果が、１≦ｉ≦ｎの範囲においてすべてヘッダであるとき、１≦ｉ≦ｎの範囲をヘッダ行番号集合として取得する。また、ヘッダ行番号集合に含まれず、かつ、１≦ｉ≦Ｎを満たす行番号の集合をデータ行番号集合として取得する。 In step S700, when the row type classification result of row number i is all headers in the range of 1 ≦ i ≦ n, the range of 1 ≦ i ≦ n is acquired as the header row number set. A set of row numbers that are not included in the header row number set and satisfy 1 ≦ i ≦ N is acquired as a data row number set.

ステップＳ７０２では、列番号ｊの列タイプ分類結果が、１≦ｊ≦ｍの範囲においてすべてヘッダであるとき、１≦ｊ≦ｍの範囲をヘッダ列番号集合として取得する。ヘッダ列番号集合に含まれず、かつ、１≦ｊ≦Ｍを満たす行番号の集合をデータ列番号集合として取得する。なお、ヘッダ列番号集合が空集合の場合、検索インデキシング処理ルーチンのステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 In step S702, when the column type classification result of the column number j is a header in the range of 1 ≦ j ≦ m, the range of 1 ≦ j ≦ m is acquired as the header column number set. A set of row numbers not included in the header column number set and satisfying 1 ≦ j ≦ M is acquired as a data column number set. If the header column number set is an empty set, the process returns to step S100 of the search indexing processing routine, selects the next table tag, and continues the subsequent processing.

そして、データ行番号集合に含まれる行番号ｉごとに、以下のステップＳ７０４〜Ｓ７０６を繰り返す。 Then, the following steps S704 to S706 are repeated for each row number i included in the data row number set.

ステップＳ７０４では、データ行番号集合に含まれる行番号ｉについて、表内知識を抽出して、検索データベース部にレコードを格納する。抽出する表内知識は、以下のエンティティタイプ、エンティティ、エンティティ２、属性、及び属性値である。 In step S704, in-table knowledge is extracted for the row number i included in the data row number set, and the record is stored in the search database unit. The in-table knowledge to be extracted is the following entity type, entity, entity 2, attribute, and attribute value.

表関連情報の属性「キャプション」にｎｕｌｌ文字列以外が設定されている場合は、エンティティに属性「キャプション」の属性値を設定し、エンティティ行ＩＤ、エンティティ列ＩＤをそれぞれｎｕｌｌ文字列とする。 When a non-null character string is set in the attribute “caption” of the table related information, the attribute value of the attribute “caption” is set in the entity, and the entity row ID and the entity column ID are set to the null character string, respectively.

属性「キャプション」がｎｕｌｌ文字列かつヘッダ行番号集合が空集合で無い場合は、ヘッダ行番号集合に含まれる行番号ｈのそれぞれおよび１〜Ｍの列番号ｋのそれぞれについて行番号が若い順、次に列番号が若い順に表形式データ［ｈ］［ｋ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列をエンティティとする。エンティティ行ＩＤはヘッダ行番号集合、エンティティ列ＩＤは、［１、２、…、Ｍ］の配列とする。 If the attribute “caption” is a null character string and the header row number set is not an empty set, the row numbers h included in the header row number set and the column numbers k of 1 to M are in ascending order, Next, after obtaining the attribute “character string” of the tabular data [h] [k] in ascending order of the column number, a character string obtained by concatenating a set of character strings from which duplicate character strings are removed with a blank character is used as an entity. The entity row ID is a header row number set, and the entity column ID is an array of [1, 2,..., M].

属性は、ヘッダ列番号集合に含まれる列番号ｈのそれぞれについて列番号が若い順に表形式データ［ｉ］［ｈ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。属性行ＩＤはｉ、属性列ＩＤはヘッダ列番号集合とする。 The attribute is a character string obtained by removing the duplicate character string after obtaining the attribute “character string” of the tabular data [i] [h] in ascending order of the column number for each column number h included in the header column number set. It is a character string that concatenates sets with a space character. The attribute row ID is i, and the attribute column ID is a header column number set.

属性値は、データ列番号集合に含まれる列番号ｄのそれぞれについて列番号が若い順に表形式データ［ｉ］［ｄ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。属性値行ＩＤはｉ、属性列ＩＤはデータ列番号集合とする。 The attribute value is a character obtained by obtaining the attribute “character string” of the tabular data [i] [d] in ascending order of the column number for each of the column numbers d included in the data column number set, and then removing the duplicate character string. The string set is a string concatenated with a space character. The attribute value row ID is i, and the attribute column ID is a data column number set.

ステップＳ７０６では、ステップＳ７０４により獲得されたエンティティタイプ、エンティティ、エンティティ２、属性、属性値、ステップＳ１０２により獲得された表タイプ、及びステップＳ１０６により獲得された表関連情報（タイトル、キャプション、セクション見出し、表前テキスト、及び表外テキスト）を１レコードとして、検索データベース２８の知識レコード表に格納する。また、エンティティタイプ行ＩＤ、エンティティタイプ列ＩＤ、エンティティ行ＩＤ、エンティティ列ＩＤ、エンティティ２ＩＤ、エンティティ２ＩＤ、属性行ＩＤ、属性列ＩＤ、属性値行ＩＤ、属性値列ＩＤを１レコードとして、検索データベース２８の知識セルＩＤ表に格納する。 In step S706, the entity type acquired in step S704, entity, entity 2, attribute, attribute value, table type acquired in step S102, and table related information acquired in step S106 (title, caption, section heading, (Front text and out-of-line text) are stored in the knowledge record table of the search database 28 as one record. In addition, the search database includes the entity type row ID, entity type column ID, entity row ID, entity column ID, entity 2 ID, entity 2 ID, attribute row ID, attribute column ID, attribute value row ID, and attribute value column ID as one record. It is stored in 28 knowledge cell ID tables.

図１５に、表タイプのクラスが「横属性」の場合の表内知識の抽出例を示す。 FIG. 15 shows an example of extraction of in-table knowledge when the table type class is “horizontal attribute”.

次に、表タイプのクラスが「行列」の場合について説明する。「行列」の場合、図１６に示す表内知識抽出処理ルーチンを実行する。 Next, a case where the table type class is “matrix” will be described. In the case of “matrix”, the in-table knowledge extraction processing routine shown in FIG. 16 is executed.

ステップＳ８００では、行番号ｉの行タイプ分類結果が、１≦ｉ≦ｎの範囲においてすべてヘッダであるとき、１≦ｉ≦ｎの範囲をヘッダ行番号集合として取得する。また、ヘッダ行番号集合に含まれず、かつ、１≦ｉ≦Ｎを満たす行番号の集合をデータ行番号集合として取得する。 In step S800, when the row type classification result of row number i is all headers in the range of 1 ≦ i ≦ n, the range of 1 ≦ i ≦ n is acquired as the header row number set. A set of row numbers that are not included in the header row number set and satisfy 1 ≦ i ≦ N is acquired as a data row number set.

ステップＳ８０２では、列番号ｊの列タイプ分類結果が、１≦ｊ≦ｍの範囲においてすべてヘッダであるとき、１≦ｊ≦ｍの範囲をヘッダ列番号集合として取得する。また、ヘッダ列番号集合に含まれず、かつ、１≦ｊ≦Ｍを満たす行番号の集合をデータ列番号集合として取得する。なお、ヘッダ列番号集合が空集合の場合、検索インデキシング処理ルーチンのステップＳ１００に戻って、次のｔａｂｌｅタグを選択し、以降の処理を続行する。 In step S802, when the column type classification result of the column number j is a header in the range of 1 ≦ j ≦ m, the range of 1 ≦ j ≦ m is acquired as the header column number set. A set of row numbers that are not included in the header column number set and satisfy 1 ≦ j ≦ M is acquired as a data column number set. If the header column number set is an empty set, the process returns to step S100 of the search indexing processing routine, selects the next table tag, and continues the subsequent processing.

そして、列番号ｊごとに繰り返すことについて、データ行番号集合に含まれる行番号ｉごとに、以下の処理を繰り返す。 And about repeating for every column number j, the following processes are repeated for every row number i included in a data row number set.

データ列番号集合に含まれる列番号ｊごとに、以下のステップＳ８０４〜Ｓ８０６の処理を繰り返す。 The following steps S804 to S806 are repeated for each column number j included in the data column number set.

ステップＳ８０４では、データ行番号集合に含まれる行番号ｉ、及びデータ列番号集合に含まれる列番号ｊについて、表内知識を抽出して、検索データベース２８にレコードを格納する。抽出する表内知識は、以下のエンティティタイプ、エンティティ、エンティティ２、属性、及び属性値である。 In step S804, in-table knowledge is extracted for the row number i included in the data row number set and the column number j included in the data column number set, and the record is stored in the search database 28. The in-table knowledge to be extracted is the following entity type, entity, entity 2, attribute, and attribute value.

エンティティは、ヘッダ列番号集合に含まれる列番号ｈのそれぞれについて列番号が若い順に表形式データ［ｉ］［ｈ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。エンティティ行ＩＤはｉ、エンティティ列ＩＤはヘッダ列番号集合とする The entity acquires the attribute “character string” of the tabular data [i] [h] in ascending order of the column number for each of the column numbers h included in the header column number set, and then removes the duplicate character string. It is a character string that concatenates sets with a space character. The entity row ID is i, and the entity column ID is a header column number set.

エンティティ２は、ヘッダ行番号集合に含まれる行番号ｈのそれぞれについて行番号が若い順に表形式データ［ｈ］［ｊ］の属性「文字列」を取得した後、重複する文字列を除去した文字列集合を空白文字で連結した文字列とする。エンティティ２行ＩＤはヘッダ行番号集合、エンティティ２列ＩＤはｊとする。 The entity 2 obtains the attribute “character string” of the tabular data [h] [j] in ascending order of the line numbers for each of the line numbers h included in the header line number set, and then removes the duplicate character strings. The string set is a string concatenated with a space character. The entity 2 row ID is a header row number set, and the entity 2 column ID is j.

属性は、表関連情報の属性「キャプション」、属性「セクション見出し」、属性「タイトル」の属性値の中で、ｎｕｌｌ文字列で無いものを指定する。ｎｕｌｌ文字列で無い該属性値が複数個ある場合は、キャプションを最も優先し、次にセクション見出しを優先する。属性行ＩＤ、属性列ＩＤはそれぞれｎｕｌｌ文字列とする。 The attribute specifies an attribute value that is not a null character string among the attribute values “caption”, attribute “section heading”, and attribute “title” of the table related information. If there are a plurality of attribute values that are not null character strings, the caption has the highest priority, and the section heading has the highest priority. Each of the attribute row ID and attribute column ID is a null character string.

属性値は、表形式データ［ｉ］［ｊ］の属性「文字列」とする。属性値行ＩＤはｉ、属性値列ＩＤはｊとする。 The attribute value is the attribute “character string” of the tabular data [i] [j]. The attribute value row ID is i, and the attribute value column ID is j.

ステップＳ８０６では、ステップＳ８０４により獲得されたエンティティタイプ、エンティティ、エンティティ２、属性、属性値と、ステップＳ１０２により獲得された表タイプと、ステップＳ１０６により獲得された関連情報（タイトル、キャプション、セクション見出し、表前テキスト、及び表外テキスト）を１レコードとして、検索データベース２８の知識レコード表に格納する。また、エンティティタイプ行ＩＤ、エンティティタイプ列ＩＤ、エンティティ行ＩＤ、エンティティ列ＩＤ、エンティティ２ＩＤ、エンティティ２ＩＤ、属性行ＩＤ、属性列ＩＤ、属性値行ＩＤ、属性値列ＩＤを１レコードとして、検索データベース２８の知識セルＩＤ表に格納する。
上記ステップＳ８０４〜Ｓ８０６を、データ列番号集合に含まれる列番号ｊごとに繰り返す。 In step S806, the entity type, entity, entity 2, attribute, attribute value acquired in step S804, the table type acquired in step S102, and the related information acquired in step S106 (title, caption, section heading, (Front text and out-of-line text) are stored in the knowledge record table of the search database 28 as one record. In addition, the search database includes the entity type row ID, entity type column ID, entity row ID, entity column ID, entity 2 ID, entity 2 ID, attribute row ID, attribute column ID, attribute value row ID, and attribute value column ID as one record. It is stored in 28 knowledge cell ID tables.
Steps S804 to S806 are repeated for each column number j included in the data column number set.

上記のように列番号ｊごとに繰り返すことを、データ行番号集合に含まれる行番号ｉごとに繰り返す。 The repetition for each column number j as described above is repeated for each row number i included in the data row number set.

図１７及び図１８に、表タイプのクラスが「行列」の場合の表内知識の抽出例を示す。 FIG. 17 and FIG. 18 show examples of extracting in-table knowledge when the table type class is “matrix”.

以上が、検索インデキシング処理ルーチンの説明である。 The above is the description of the search indexing processing routine.

次に、表セル検索処理ルーチンについて説明する。 Next, the table cell search processing routine will be described.

入力部１０において検索クエリであるキーワード集合あるいは自然文を受け付けると、表セル検索装置１００は、図１９に示す表セル検索処理ルーチンを実行する。 When the input unit 10 receives a keyword set or a natural sentence as a search query, the table cell search device 100 executes a table cell search processing routine shown in FIG.

ステップＳ９００では、クエリ解釈部４０は入力部１０で受け付けた検索クエリであるキーワード集合あるいは自然文について、検索クエリ内の属性に対応するキーワード、及び検索クエリ内の属性以外のキーワードにラベルを付与する。 In step S900, the query interpretation unit 40 assigns labels to keywords corresponding to attributes in the search query and keywords other than the attributes in the search query for the keyword set or natural sentence that is the search query received by the input unit 10. .

ラベルの付与方法は、検索クエリの内容に応じて異なる。 The labeling method varies depending on the content of the search query.

まず、検索クエリが定義済キーワードテンプレートにマッチするキーワード集合（空白文字区切りの文字列をキーワードと呼ぶ）で与えられた場合は、与えられた定義に従って各キーワードにラベル「属性」あるいは「その他」を付与する。 First, if the search query is given as a keyword set that matches a defined keyword template (blank-separated character strings are called keywords), each keyword is labeled with "attribute" or "other" according to the given definition. Give.

定義済キーワードテンプレートは、
「(属性)of(その他)」
「(その他)の(属性)」
などが利用でき、例えば、「山田の誕生日」というキーワード集合に対しては、属性以外のキーワードである「山田」に「その他」、属性のキーワードである「誕生日」に「属性」というラベルを付与する。その他、自由に定義済キーワードテンプレートは追加しても良い。 The predefined keyword templates are
"(Attribute) of (other)"
`` (Other) (Attribute) ''
For example, for the keyword set “Yamada's birthday”, the label “Yamada”, which is a keyword other than the attribute, “Other”, and the attribute keyword “Birthday”, “Attribute” Is granted. In addition, freely defined keyword templates may be added.

次に、検索クエリが定義済キーワードテンプレート集合にマッチしないキーワード集合で与えられた場合は、各キーワードにより検索データベースの知識レコード表の「属性」フィールドを検索し、最も検索ヒット数が多いキーワードに「属性」ラベルを付与し、属性以外のキーワードには「その他」ラベルを付与する。 Next, when the search query is given as a keyword set that does not match the defined keyword template set, each keyword searches the “attribute” field of the knowledge record table of the search database, and the keyword with the largest number of search hits is “ The “attribute” label is assigned, and the “other” label is assigned to keywords other than the attribute.

クエリが定義済自然文テンプレートにマッチする自然文で与えられた場合は、与えられた定義に従って自然文中の各キーワードにラベルを付与する。定義済自然文テンプレートは、 When the query is given as a natural sentence that matches the defined natural sentence template, a label is assigned to each keyword in the natural sentence according to the given definition. Predefined natural sentence templates are

「(.+:属性)の(.+:その他)はいつですか」 "When is (. +: Attribute) (. +: Other)"?

などが利用できる。ここで、テンプレート中のカッコは、(正規表現パターン:ラベル名)を意味する。 Etc. are available. Here, the parentheses in the template mean (regular expression pattern: label name).

例えば、「山田の誕生日はいつですか」という自然文に対しては正規表現マッチングにより前記テンプレートとマッチするので、「山田」に「その他」、「誕生日」に「属性」というラベルを付与できる。その後で、自然文を形態素解析して得られた形態素のうち、テンプレートにマッチしなかった形態素について「その他」のラベルを付与することで、ラベル付与済みキーワード集合を得る。 For example, the natural sentence "When is Yamada's birthday" matches the template by regular expression matching, so "Yamada" is labeled "Other" and "Birthday" is "Attribute" it can. After that, among the morphemes obtained by morphological analysis of the natural sentence, a “other” label is assigned to a morpheme that does not match the template, thereby obtaining a labeled keyword set.

次に、検索クエリが定義済自然文テンプレートにマッチしない自然文で与えられた場合は、自然文を形態素解析してキーワード集合に変換した後に、上記検索クエリが定義済キーワードテンプレート集合にマッチしないキーワード集合で与えられた場合の処理を行うことでラベル付与済みキーワード集合を得る。 Next, when the search query is given as a natural sentence that does not match the defined natural sentence template, after the natural sentence is converted into a keyword set after morphological analysis, the search query does not match the defined keyword template set. A labeled keyword set is obtained by performing processing when given as a set.

また、ラベル付与済みキーワード集合のキーワードのそれぞれについて、同義語辞書にマッチしたものについては同義語をマッチしたキーワードと同じラベルを付与してラベル付与済みキーワード集合に追加する。例えば、ラベル付与済みキーワード集合が、「山田:その他」「誕生日：属性」であり、同義語辞書に、「誕生日」と「バースデイ」が登録されていた場合、ラベル付与済みキーワード集合は、「山田:その他」「誕生日：属性」「バースデイ：属性」とする。 In addition, for each keyword in the labeled keyword set, those that match the synonym dictionary are given the same label as the keyword that matched the synonym and added to the labeled keyword set. For example, if the labeled keyword set is “Yamada: Other” and “Birthday: Attribute”, and “Birthday” and “Birthday” are registered in the synonym dictionary, the labeled keyword set is “Yamada: Other”, “Birthday: Attribute”, and “Birthday: Attribute”.

ステップＳ９０２では、ステップＳ２０２で得たラベル付与済みキーワード集合を用いて、検索データベース２８を検索し、知識レコードを獲得する。具体的には、以下のように検索する。 In step S902, the search database 28 is searched using the labeled keyword set obtained in step S202, and a knowledge record is acquired. Specifically, the search is performed as follows.

「属性」ラベルが付与されたキーワードのいずれかが、検索データベース２８の知識レコード表の「属性」フィールドの全文検索にマッチした知識レコードを獲得する。 One of the keywords assigned with the “attribute” label acquires a knowledge record that matches the full-text search in the “attribute” field of the knowledge record table of the search database 28.

また、「その他」ラベルが付与されたキーワードのいずれかが、知識レコード表の「エンティティ」フィールドの全文検索にマッチした知識レコードを獲得する。 In addition, any of the keywords assigned with the “other” label acquires a knowledge record that matches the full-text search in the “entity” field of the knowledge record table.

また、「その他」ラベルが付与されたキーワードのいずれかが「エンティティ」フィールド以外のフィールの全文検索にマッチした知識レコードについても獲得する。ここでは、知識レコード表の「エンティティタイプ」あるいは「タイトル」あるいは「セクション見出し」あるいは「キャプション」あるいは「表前テキスト」あるいは「表後テキスト」フィールドのいずれかに全文検索でマッチした知識レコードを獲得する。なお、獲得した知識レコードの「表タイプ」が「行列」の知識レコードについては、「その他」ラベルが付与されたキーワードのいずれかが「エンティティ２」フィールドに全文検索でマッチするか否かを判定し、マッチしない場合は獲得した知識レコードから除去する。 In addition, a knowledge record in which any of the keywords to which the “other” label is assigned matches the full-text search of a field other than the “entity” field is also acquired. Here, we get knowledge records that match the full-text search to either the “entity type” or “title” or “section heading” or “caption” or “before text” or “after text” field of the knowledge record table. To do. For knowledge records with the “table type” of the acquired knowledge record “matrix”, determine whether any of the keywords with the “other” label matches the “entity 2” field in the full-text search. If it does not match, it is removed from the acquired knowledge record.

ステップＳ９０４では、ステップＳ９０２で獲得した知識レコードのフィールド「属性値」の集合を検索結果として出力部５０に出力する。また、図２０の例に示す様に、獲得した知識レコードのフィールド「知識ＩＤ」に対応する検索データベース２８の知識セルＩＤ表を検索したレコードを検索結果に含めても良い。 In step S904, the set of field “attribute values” of the knowledge record acquired in step S902 is output to the output unit 50 as a search result. Further, as shown in the example of FIG. 20, a record obtained by searching the knowledge cell ID table of the search database 28 corresponding to the field “knowledge ID” of the acquired knowledge record may be included in the search result.

以上説明したように、本発明の実施の形態に係る表セル検索装置によれば、表形式データの各々について、表形式データに関連する表関連情報を、表形式データを含むＨＴＭＬ文書から抽出し、表形式データの表のタイプを分類し、各行及び各列のタイプを分類し、表関連情報、及び分類結果に基づいて、表形式データから抽出されるエンティティ、属性、及び属性値を含む表内情報と、表関連情報との組から構成されるセル知識を抽出し、抽出したセル知識を検索データベースに格納し、与えられた検索クエリであるキーワード集合あるいは自然文について、検索クエリ内の属性に対応するキーワードにラベルを付与し、ラベルが付与された検索クエリに基づいて、検索データベースから検索クエリに対応するセル知識を出力し、出力に基づいて、ユーザに検索結果を返却することにより、表外の知識を利用してセル知識を獲得し、検索クエリに対する回答を検索することができる。 As described above, according to the table cell search device according to the embodiment of the present invention, for each table format data, the table related information related to the table format data is extracted from the HTML document including the table format data. A table containing entities, attributes, and attribute values extracted from tabular data based on table-related information and classification results, classifying table types of tabular data, classifying each row and column type Cell knowledge that consists of a set of internal information and table-related information is extracted, the extracted cell knowledge is stored in the search database, and for the keyword set or natural sentence that is the given search query, the attribute in the search query A label is assigned to the keyword corresponding to, and cell knowledge corresponding to the search query is output from the search database based on the search query to which the label is assigned. There, by returning the search results to the user, acquire the cell information by using the knowledge of the outside tables, it is possible to find answers to search queries.

また、本発明の実施の形態の係る手法によれば、表形式データの表内に含まれるエンティティ、属性、及び属性値から構成される情報を、表外情報により補完したものを知識レコードとして抽出することで検索要求に直接回答可能になるため、情報検索や質問応答、対話システムなどに利用可能である。 In addition, according to the method according to the embodiment of the present invention, information composed of entities, attributes, and attribute values included in a table of tabular data, supplemented by out-of-line information, is extracted as a knowledge record. This makes it possible to directly answer a search request, so that it can be used for information retrieval, question answering, a dialogue system, and the like.

なお、本発明は、上述した実施の形態に限定されるものではなく、この発明の要旨を逸脱しない範囲内で様々な変形や応用が可能である。 The present invention is not limited to the above-described embodiment, and various modifications and applications can be made without departing from the gist of the present invention.

１０入力部
２０演算部
２８検索データベース
３０表形式データ抽出部
３２表タイプ分類部
３４行列タイプ分類部
３６表関連情報抽出部
３８表内知識抽出部
４０クエリ解釈部
４２知識検索部
４４検索結果生成部
５０出力部
１００表セル検索装置 DESCRIPTION OF SYMBOLS 10 Input part 20 Operation part 28 Search database 30 Tabular data extraction part 32 Table type classification | category part 34 Matrix type classification | category part 36 Table related information extraction part 38 In-table knowledge extraction part 40 Query interpretation part 42 Knowledge search part 44 Search result generation part 50 Output unit 100 Table cell search device

Claims

From the tabular data contained in the HTML document, knowledge contained in the tabular data is extracted and indexed, and the table cells that can be directly answered to the search query given by the keyword set or the natural sentence are ranked by the search score. A table cell search device to be returned,
A tabular data extraction unit that acquires a tabular data set composed of tabular data described by a table tag from the set of HTML documents;
For each of the tabular data, a table related information extracting unit that extracts table related information related to the tabular data from the HTML document including the tabular data;
For each of the tabular data, based on the structure and contents of the tabular data , the tabular data is defined in advance as a vertical list, a horizontal list, a vertical attribute, a horizontal attribute, and a matrix. A table type classifier that classifies any of the table types;
For each of the tabular data, based on the structure and content of the tabular data, a matrix type classification unit that classifies the type of each row and each column included in the table of the tabular data;
For each of the table format data, based on the table related information extracted by the table related information extraction unit, the classification result in the table type classification unit, and the classification result in the matrix type classification unit, the type of the table In accordance with a predetermined extraction rule, cell knowledge composed of a set of in-table information including entities, attributes, and attribute values extracted from the tabular data and the table related information is extracted and extracted. An in-table knowledge extractor for storing the cell knowledge in a search database;
A query interpreter that gives a label to a keyword corresponding to an attribute in the search query for a keyword set or a natural sentence that is a given search query;
A knowledge search unit that outputs the cell knowledge corresponding to the search query from the search database based on the search query to which the label is given by the query interpretation unit;
Based on the output of the knowledge search unit, a search result generation unit that returns a search result to the user;
Table cell search device including

The table-related information extraction unit extracts, for each of the tabular data, a page title, a table caption, a section heading in which the table is written, text immediately before the table, or text information immediately after the table from the HTML document. Extract as table related information,
The in-table knowledge extraction unit, for each of the table format data, the table-related information extracted by the table-related information extraction unit, the classification result in the table type classification unit, and the classification result in the matrix type classification unit based on, according to the extraction rule which is predetermined according to the type of the table, from the table-format data, and tables in the information including entities in said tabulated data, entity types, attributes, and attribute values, the table-related information The table cell search device according to claim 1, wherein cell knowledge composed of a pair is extracted and the extracted cell knowledge is stored in a search database.

The query interpreter further gives a label to keywords other than the attribute in the search query,
The knowledge search unit includes, from the search database, a keyword corresponding to an attribute of the search query, an attribute of the in-table information, and a keyword other than the attribute of the search query, and an entity of the in-table information The table cell search device according to claim 1, wherein the cell knowledge that matches the entity type or the table related information is output.

From the tabular data contained in the HTML document, knowledge contained in the tabular data is extracted and indexed, and the table cells that can be directly answered to the search query given by the keyword set or the natural sentence are ranked by the search score. A table cell search method in the table cell search device to be returned,
A tabular data extraction unit obtaining a tabular data set composed of tabular data described by a table tag from the set of HTML documents;
A table type classifying unit converts, for each of the tabular data, the tabular data based on the structure and contents of the tabular data, a vertical list, a horizontal list, and a vertical attribute that are predetermined for the tabular data. Categorizing into one of the table types:, horizontal attribute, and matrix ;
A matrix type classification unit for each of the tabular data, classifying the type of each row and each column included in the table of the tabular data based on the structure and content of the tabular data;
A table-related information extraction unit for each of the tabular data, extracting table-related information related to the tabular data from the HTML document including the tabular data;
Based on the table related information extracted by the table related information extraction unit, the classification result in the table type classification unit, and the classification result in the matrix type classification unit for each of the tabular data In accordance with an extraction rule determined in advance according to the type of the table, the table is composed of a set of in-table information including entities, attributes, and attribute values extracted from the tabular data, and the table-related information. Extracting cell knowledge and storing the extracted cell knowledge in a search database;
A step in which a query interpreter assigns a label to a keyword corresponding to an attribute in a search query for a keyword set or a natural sentence that is a given search query;
A knowledge search unit, based on the search query to which the label is given by the query interpretation unit, outputting the cell knowledge corresponding to the search query from the search database;
A step in which the search result generation unit returns the search result to the user based on the output of the knowledge search unit;
Table cell search method including.

The step of extracting the table-related information extracting unit includes, for each of the tabular data, a page title, a caption of the table, a section heading in which the table is described, a text immediately before the table, or immediately after the table. Text information is extracted as table related information,
The step of extracting the in-table knowledge extracting unit includes the table-related information extracted by the table-related information extracting unit, the classification result in the table type classifying unit, and the matrix type classifying unit for each of the tabular data based on the classification results in accordance with the extraction rule which is predetermined according to the type of the table, from the table-format data, and tables in the information including entities in said tabulated data, entity types, attributes, and attribute values, The table cell search method according to claim 4, wherein cell knowledge composed of a pair with the table related information is extracted, and the extracted cell knowledge is stored in a search database.

The step of interpreting by the query interpreter further adds a label to a keyword other than the attribute in the search query,
The step of searching by the knowledge search unit includes, from the search database, a keyword corresponding to the attribute of the search query, a keyword that matches the attribute of the information in the table, and a keyword other than the attribute of the search query, and the table The table cell search method according to claim 4 or 5, wherein the cell knowledge in which the entity of the internal information, the entity type, or the table related information matches is output.

The program for functioning a computer as each part of the table cell search apparatus of any one of Claims 1-3.