JP6998282B2

JP6998282B2 - Information processing equipment, information processing methods, and programs

Info

Publication number: JP6998282B2
Application number: JP2018174410A
Authority: JP
Inventors: 賢太郎西; 雄貴俵; 将平川崎; 拓也門脇; 康之田中
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2018-09-19
Filing date: 2018-09-19
Publication date: 2022-01-18
Anticipated expiration: 2038-09-19
Also published as: JP7434493B2; JP2022191487A; JP2021192232A; JP2020046896A

Description

本発明は、情報処理装置、情報処理方法、およびプログラムに関する。 The present invention relates to an information processing apparatus, an information processing method, and a program.

従来、ウエブからデータ（文書や画像など）を収集し、収集したデータを自動的にデータベース化するクローラが知られている（特許文献１参照）。このクローラは、ウエブページ中のリンクを辿って、様々なＩＰアドレスのウエブページからデータを収集する。クローラによって収集されたデータは、ウエブ情報データベースに蓄積される。 Conventionally, a crawler that collects data (documents, images, etc.) from the web and automatically creates a database of the collected data is known (see Patent Document 1). This crawler follows links in web pages and collects data from web pages with various IP addresses. The data collected by the crawler is stored in the web information database.

特開２０１２－６９１７１号公報Japanese Unexamined Patent Publication No. 2012-69171

しかしながら、上記従来の技術では、効率的に有用な情報を取得することができない場合があった。 However, with the above-mentioned conventional technique, it may not be possible to efficiently acquire useful information.

本発明は、このような事情を考慮してなされたものであり、より効率的に有用な情報を取得することができる情報処理装置、情報処理方法、およびプログラムを提供することを目的の一つとする。 The present invention has been made in consideration of such circumstances, and one of the objects of the present invention is to provide an information processing device, an information processing method, and a program capable of acquiring useful information more efficiently. do.

本発明の一態様は、ウエブページの情報を取得する取得部と、複数のエンティティと前記エンティティ間の関係情報とを含むナレッジデータベースを参照し、前記取得部により取得されたウエブページにおいて、前記ナレッジデータベースに含まれる第１主エンティティと、前記第１主エンティティに従属する第１従属エンティティとを含む表現である第１表現パターンを認識する認識部と、前記ウエブページにおいて、前記ナレッジデータベースに含まれ、関連付けられるべき前記第１従属エンティティと同種の第２従属エンティティが関連付けられてない第２主エンティティを含み、且つ前記第１表現パターンに合致する第２表現パターンを抽出する抽出部と、前記ナレッジデータベースを拡充するために前記第２表現パターンに基づく情報を前記ナレッジデータベースに提供する提供部とを備える情報処理装置である。 One aspect of the present invention refers to a knowledge database including an acquisition unit for acquiring information on a web page and information on relationships between a plurality of entities and the entities, and in the web page acquired by the acquisition unit, the knowledge is described. A recognition unit that recognizes a first expression pattern, which is an expression including a first main entity included in the database and a first dependent entity subordinate to the first main entity, and a recognition unit included in the knowledge database on the web page. , An extractor for extracting a second expression pattern that includes a second primary entity that is not associated with a second dependent entity of the same type as the first dependent entity to be associated and that matches the first expression pattern, and the knowledge. It is an information processing apparatus including a providing unit that provides information based on the second expression pattern to the knowledge database in order to expand the database.

本発明の一態様によれば、より効率的に有用な情報を取得することができる。 According to one aspect of the present invention, useful information can be obtained more efficiently.

情報処理システム１の機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of an information processing system 1. ナレッジデータベース４２の一部を模式的に示す図である。It is a figure which shows a part of the knowledge database 42 schematically. ナレッジデータベース装置３０が端末装置１０に提供するナレッジパネルの一例を示す図である。It is a figure which shows an example of the knowledge panel which a knowledge database apparatus 30 provides to a terminal apparatus 10. 収集装置１００の決定部１０６により実行される処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the process executed by the determination part 106 of the collection apparatus 100. Ｓ１０で選択されたサンプリングウエブページの情報の一例を示す図である。It is a figure which shows an example of the information of the sampling web page selected in S10. エンティティ情報１３４に含まれるエンティティの組み合わせの一例を示す図である。It is a figure which shows an example of the combination of the entity included in the entity information 134. 判定情報１３６の内容の一例を示す図である。It is a figure which shows an example of the content of determination information 136. 収集装置１００により実行される未知情報の抽出処理の流れの一例を示すフローチャートである。It is a flowchart which shows an example of the flow of the unknown information extraction processing executed by the collecting apparatus 100. 認識部１１０が記述パターンを認識する処理について説明するための図（その１）である。FIG. 1 is a diagram (No. 1) for explaining a process in which the recognition unit 110 recognizes a description pattern. 認識部１１０が記述パターンを認識する処理について説明するための図（その２）である。FIG. 2 is a diagram (No. 2) for explaining a process in which the recognition unit 110 recognizes a description pattern. 処理の概要の一例を示す図である。It is a figure which shows an example of the outline of processing. 更新前のナレッジデータベース４２の内容の一例を示す図である。It is a figure which shows an example of the contents of the knowledge database 42 before the update. 更新後のナレッジデータベース４２の内容の一例を示す図である。It is a figure which shows an example of the contents of the knowledge database 42 after the update. 更新前のナレッジデータベース４２に基づいて生成されたナレッジパネルＮＰ１の一例を示す図である。It is a figure which shows an example of the knowledge panel NP1 generated based on the knowledge database 42 before update. 更新後のナレッジデータベース４２に基づいて生成されたナレッジパネルの一例を示す図である。It is a figure which shows an example of the knowledge panel generated based on the knowledge database 42 after the update. 第２実施形態の情報処理システム１Ａの収集装置１００Ａの機能構成の一例を示す図である。It is a figure which shows an example of the functional structure of the collection apparatus 100A of the information processing system 1A of 2nd Embodiment. 信頼度付抽出情報１４０の内容の一例を示す図である。It is a figure which shows an example of the content of the extraction information 140 with a reliability. 既知の割合の組み合わせごとの統合スコアの傾向の一例を示す図である。It is a figure which shows an example of the tendency of the integrated score for each combination of known ratios.

以下、図面を参照し、本発明の情報処理装置、情報処理方法、およびプログラムの実施形態について説明する。 Hereinafter, embodiments of the information processing apparatus, information processing method, and program of the present invention will be described with reference to the drawings.

［概要（その１）］
情報処理装置は、一以上のプロセッサにより実現される。実施形態の情報処理装置は、ウエブページの情報を取得し、複数のエンティティとエンティティ間の関係情報とを含むナレッジデータベースを参照し、取得部により取得されたウエブページにおいて、ナレッジデータベースに含まれる第１主エンティティと、第１主エンティティに従属する第１従属エンティティとを含む表現である第１表現パターンを認識する。そして、情報処理装置は、ウエブページにおいて、ナレッジデータベースに含まれ、関連付けられるべき第１従属エンティティと同種の第２従属エンティティが関連付けられてない第２主エンティティを含み、且つ第１表現パターンに合致する第２表現パターンを抽出し、ナレッジデータベースを拡充するために第２表現パターンに基づく情報をナレッジデータベースに提供する。「表現パターン」とは、例えば、ウエブページの生成に用いられる言語の記述パターンである。 [Overview (1)]
The information processing device is realized by one or more processors. The information processing apparatus of the embodiment acquires information on a web page, refers to a knowledge database including information on a plurality of entities and relationships between the entities, and is included in the knowledge database on the web page acquired by the acquisition unit. Recognize a first expression pattern that is an expression including one primary entity and a first dependent entity subordinate to the first primary entity. Then, on the web page, the information processing device includes a second primary entity that is included in the knowledge database and is not associated with a second dependent entity of the same type as the first dependent entity to be associated, and matches the first expression pattern. The second expression pattern to be processed is extracted, and the information based on the second expression pattern is provided to the knowledge database in order to expand the knowledge database. The "expression pattern" is, for example, a description pattern of a language used for generating a web page.

ナレッジデータベースは、エンティティに関する情報と、エンティティ同士の意味的関係に関する情報とが記述されたものである。エンティティとは、対象事物の実体または概念を表すものである。例えば、あるクエリが入力された場合において、そのクエリがエンティティに該当するものであれば、単なるキーワード検索よりも豊富な情報をユーザに返すことができる。 The knowledge database describes information about entities and information about semantic relationships between entities. An entity represents an entity or concept of an object. For example, when a query is entered, if the query corresponds to an entity, it is possible to return a wealth of information to the user rather than a simple keyword search.

ナレッジデータベースにおいて記述された事物は、オントロジーによって定義される。オントロジーとは、事物のクラスおよびプロパティを定義したものであり、クラスとプロパティとの間に成り立つ制約を集めたものである。 The things described in the knowledge database are defined by the ontology. An ontology is a definition of a class and a property of an object, and is a collection of constraints that hold between the class and the property.

クラスは、エンティティの属性を示す情報である。クラスとは、オントロジーにおいて、同じ性質を持つ事物同士を一つのグループにしたものである。事物の性質がどういったものであるのか、すなわち事物がどのクラスに属するのかは、後述するプロパティにより決定される。 A class is information that indicates an attribute of an entity. A class is a group of things that have the same properties in an ontology. What the nature of an object is, that is, which class the object belongs to, is determined by the properties described below.

例えば、くちばしを持ち、卵生の脊椎動物であり、前肢が翼になっている、という性質を持つ事物は、「鳥」というクラスに分類される。また、「鳥」というクラスの中で、飛べない、という性質を持つ事物は、例えば、「ペンギン」や「ダチョウ」という、より下位のクラスに分類される。このように、クラスの体系は、上位と下位の関係を有する階層構造となっていてよい。 For example, things that have a beak, an oviparous vertebrate, and forelimbs that are wings are classified in the "bird" class. Also, in the class of "birds", things that have the property of not being able to fly are classified into lower classes such as "penguins" and "ostriches". In this way, the class system may have a hierarchical structure having a higher-lower relationship.

プロパティとは、事物の性質や特徴、クラス間の関係を記述する属性である。例えば、プロパティは、「～を体の構成要素としてもつ」という性質や、「～に生息する」という性質を示す属性であってもよいし、「あるクラスが上位クラスであり、あるクラスが下位クラスである」というクラス間の上位下位の関係を示す属性であってもよい。プロパティを識別するためのプロパティ名は、上述したクラス名と同様に、そのプロパティ名自体が意味を表していてもよいし、意味を表していなくてもよい。 Properties are attributes that describe the nature and characteristics of things and the relationships between classes. For example, a property may be an attribute that has the property of "having ... as a component of the body" or the property of "living in", or "a class is a higher class and a certain class is a lower class". It may be an attribute indicating the relationship between the upper and lower levels of "class". As for the property name for identifying the property, the property name itself may or may not represent the meaning, as in the class name described above.

［概要（その２）］
情報処理装置は、ウエブページの情報を取得し、ナレッジデータベースを参照し、取得されたウエブページにおける主エンティティと主エンティティに従属する従属エンティティとが含まれる度合に基づいて、ウエブページに関して、ナレッジデータベースにおいて、第４主エンティティに対して関連付けられるべき従属エンティティと同種の第４従属エンティティの抽出対象とするか否かを決定する。 [Overview (Part 2)]
The information processing device acquires the information of the web page, refers to the knowledge database, and the knowledge database with respect to the web page based on the degree to which the main entity and the dependent entity subordinate to the main entity in the acquired web page are included. In, it is determined whether or not the fourth dependent entity of the same type as the dependent entity to be associated with the fourth main entity is to be extracted.

＜第１実施形態＞
［構成］
図１は、情報処理システム１の機能構成の一例を示す図である。情報処理システム１は、例えば、端末装置１０、一以上のホスト２０（図では２０－１～２０－３）と、ナレッジデータベース装置３０と、検索装置５０と、収集装置１００とを備える。端末装置１０、ホスト２０、および検索装置５０は、ネットワークＮＷを介して互いに通信する。また、ナレッジデータベース３０、検索装置５０、および収集装置１００は、ネットワークＮＷを介して互いに通信する。ネットワークＮＷは、例えばＷＡＮ（Wide Area Network）やＬＡＮ（Local Area Network）、インターネット、専用回線、無線基地局、プロバイダなどを含む。 <First Embodiment>
[Constitution]
FIG. 1 is a diagram showing an example of a functional configuration of the information processing system 1. The information processing system 1 includes, for example, a terminal device 10, one or more hosts 20 (20-1 to 20-3 in the figure), a knowledge database device 30, a search device 50, and a collection device 100. The terminal device 10, the host 20, and the search device 50 communicate with each other via the network NW. Further, the knowledge database 30, the search device 50, and the collection device 100 communicate with each other via the network NW. The network NW includes, for example, WAN (Wide Area Network), LAN (Local Area Network), the Internet, a dedicated line, a wireless base station, a provider, and the like.

端末装置１０は、ユーザが利用する端末装置１０である。端末装置１０は、デスクトップ型端末装置や、ノートパソコンなどの可搬型端末装置、スマートフォン、タブレット型端末装置などである。ホスト２０は、いわゆるウエブページを提供しているウエブサーバである。 The terminal device 10 is a terminal device 10 used by the user. The terminal device 10 is a desktop type terminal device, a portable terminal device such as a notebook computer, a smartphone, a tablet type terminal device, or the like. The host 20 is a web server that provides a so-called web page.

ナレッジデータベース装置３０は、例えば、所定のデータ（例えば、画像やテキストデータ）に基づいてナレッジデータベース４２を生成したり、後述するナレッジパネルを提供したりするサーバである。 The knowledge database device 30 is, for example, a server that generates a knowledge database 42 based on predetermined data (for example, image or text data) and provides a knowledge panel described later.

ナレッジデータベース装置３０の記憶部４０には、ナレッジデータベース４２が記憶されている。図２は、ナレッジデータベース４２の一部を模式的に示す図である。図２に示すように、エンティティには、エンティティ識別情報（例えば「Ｅ１～Ｅ７」）と、エンティティ名（例えば「Ａ水族館」など）と、クラス（例えば「ＣＬ０１」）と、不図示の当該エンティティに関連する情報とが関連付けられている。また、エンティティ間の関係を示すエッジには、プロパティが関連付けられている。図２の例では、例えばプロパティとして、公式サイトや、住所、営業時間などが関連付けられている。 The knowledge database 42 is stored in the storage unit 40 of the knowledge database device 30. FIG. 2 is a diagram schematically showing a part of the knowledge database 42. As shown in FIG. 2, the entity includes the entity identification information (for example, "E1 to E7"), the entity name (for example, "A aquarium", etc.), the class (for example, "CL01"), and the entity (not shown). Is associated with information related to. In addition, a property is associated with the edge that indicates the relationship between the entities. In the example of FIG. 2, for example, the official website, the address, the business hours, and the like are associated with the property.

なお、本実施形態では、図２のＥ２～Ｅ７をエンティティとして表現しているが、これらの情報は、単にエンティティＥ１に関連付けられた情報であってもよい。 In the present embodiment, E2 to E7 in FIG. 2 are represented as an entity, but these information may be simply information associated with the entity E1.

図３は、ナレッジデータベース装置３０が端末装置１０に提供するナレッジパネルの一例を示す図である。例えば、ユーザが端末装置１０を操作して、ポータルサイトなどの検索窓にクエリを入力し、検索装置５０にクエリに関する情報の検索を依頼すると、検索装置５０は、検索対象の情報を参照して、クエリに応じた情報を検索する。また、検索装置５０は、ナレッジデータベース装置３０に、クエリに関連するナレッジパネルの提供を依頼する。 FIG. 3 is a diagram showing an example of a knowledge panel provided by the knowledge database device 30 to the terminal device 10. For example, when a user operates a terminal device 10 to input a query in a search window such as a portal site and requests the search device 50 to search for information related to the query, the search device 50 refers to the information to be searched. , Search for information according to the query. Further, the search device 50 requests the knowledge database device 30 to provide a knowledge panel related to the query.

ナレッジデータベース装置３０は、ナレッジデータベース４２を参照して、クエリに応じた情報を取得し、取得した情報に基づいてナレッジパネルを生成し、生成したナレッジパネルを検索装置５０に提供する。検索装置５０は、検索結果とナレッジパネルとを含む画像の元データを生成し、生成した情報を端末装置１０に提供する。例えば、クエリ「Ａ水族館」が検索クエリである場合、図３に示すように、Ａ水族館に関するウエブページの一覧と、Ａ水族館のナレッジパネルＮＰとを含む画像が、ユーザの端末装置１０の表示部に表示される。 The knowledge database device 30 refers to the knowledge database 42, acquires information according to a query, generates a knowledge panel based on the acquired information, and provides the generated knowledge panel to the search device 50. The search device 50 generates the original data of the image including the search result and the knowledge panel, and provides the generated information to the terminal device 10. For example, when the query "A aquarium" is a search query, as shown in FIG. 3, an image including a list of web pages related to the A aquarium and the knowledge panel NP of the A aquarium is a display unit of the user's terminal device 10. Is displayed in.

なお、以下の説明では、エンティティＥ１「Ａ水族館」などのようにナレッジパネルにおいて主題となるようなエンティティを「主エンティティ」と称し、エンティティＥ２～Ｅ７のように主題を補足する情報や主題に付随する情報（営業時間や住所、公式サイト等）のエンティティを「従属エンティティ」と称する場合がある。 In the following description, an entity such as Entity E1 "A Aquarium" which is the subject in the knowledge panel is referred to as "Main Entity", and is attached to information or the subject which supplements the subject such as Entity E2 to E7. The entity of the information (business hours, address, official website, etc.) may be referred to as a "subordinate entity".

図１の説明に戻る。ナレッジデータベース装置３０は、例えば、通信部２２と、情報管理部２４と、情報処理部２６と、記憶部４０を備える。通信部２２は、ネットワークインターフェースカード（Network Interface Card）等の通信インターフェースを含む。情報管理部２４は、ナレッジデータベース装置３０で生成された情報を他装置に提供したり、他装置から提供された情報を管理したりする。情報処理部２６は、検索装置５０の依頼に応じてナレッジパネルを生成したり、収集装置１００により提供された情報を用いてナレッジデータベース４２を更新したりする。 Returning to the description of FIG. The knowledge database device 30 includes, for example, a communication unit 22, an information management unit 24, an information processing unit 26, and a storage unit 40. The communication unit 22 includes a communication interface such as a network interface card. The information management unit 24 provides the information generated by the knowledge database device 30 to another device, and manages the information provided by the other device. The information processing unit 26 generates a knowledge panel in response to a request from the search device 50, and updates the knowledge database 42 using the information provided by the collection device 100.

［収集装置］
収集装置１００は、例えば、通信部１０２と、収集部１０４と、決定部１０６と、対象情報取得部１０８と、認識部１１０と、抽出部１１２と、特定部１１４と、提供部１１６と、記憶部１３０を備える。収集部１０４、決定部１０６、対象情報取得部１０８、認識部１１０、抽出部１１２、特定部１１４、および提供部１１６は、ＣＰＵ（Central Processing Unit）等のハードウェアプロセッサが、記憶装置に記憶されたプログラムを実行することにより実現される。また、これらの機能部は、ＬＳＩ（Large Scale Integration）、ＡＳＩＣ（Application Specific Integrated Circuit）、ＦＰＧＡ（Field-Programmable Gate Array）、ＧＰＵ（Graphics Processing Unit）等のハードウェアによって実現されてもよいし、ソフトウェアとハードウェアの協働によって実現されてもよい。また、上記のプログラムは、予め記憶装置に格納されていてもよいし、ＤＶＤやＣＤ－ＲＯＭなどの着脱可能な記憶媒体に格納されており、記憶媒体が収集装置１００のドライブ装置に装着されることで記憶装置にインストールされてもよい。 [Collection device]
The collection device 100 includes, for example, a communication unit 102, a collection unit 104, a determination unit 106, a target information acquisition unit 108, a recognition unit 110, an extraction unit 112, a specific unit 114, a provision unit 116, and storage. A unit 130 is provided. In the collection unit 104, the determination unit 106, the target information acquisition unit 108, the recognition unit 110, the extraction unit 112, the specific unit 114, and the provision unit 116, a hardware processor such as a CPU (Central Processing Unit) is stored in a storage device. It is realized by executing the program. Further, these functional units may be realized by hardware such as LSI (Large Scale Integration), ASIC (Application Specific Integrated Circuit), FPGA (Field-Programmable Gate Array), GPU (Graphics Processing Unit), etc. It may be realized by the collaboration of software and hardware. Further, the above program may be stored in a storage device in advance, or is stored in a removable storage medium such as a DVD or a CD-ROM, and the storage medium is attached to the drive device of the collection device 100. It may be installed in the storage device.

記憶部１３０は、例えば、ＲＯＭ（Read Only Memory）、フラッシュメモリ、ＳＤカード、ＲＡＭ（Random Access Memory）、ＨＤＤ（Hard Disc Drive）、レジスタ等によって実現される。また、記憶部１３０の一部または全部は、ＮＡＳ（Network Attached Storage）や外部ストレージサーバ装置等であってもよい。記憶部１３０には、例えば、収集情報１３２、エンティティ情報１３４、判定情報１３６、および抽出情報１３８が記憶されている。これらの情報の詳細については後述する。 The storage unit 130 is realized by, for example, a ROM (Read Only Memory), a flash memory, an SD card, a RAM (Random Access Memory), an HDD (Hard Disc Drive), a register, or the like. Further, a part or all of the storage unit 130 may be NAS (Network Attached Storage), an external storage server device, or the like. For example, the collection information 132, the entity information 134, the determination information 136, and the extraction information 138 are stored in the storage unit 130. Details of this information will be described later.

通信部１０２は、ネットワークＮＷを介して、ホスト２０、ナレッジデータベース装置３０または検索装置５０と通信する。通信部１０２は、例えば、ＮＩＣ（Network Interface Card）等の通信インターフェースを含む。 The communication unit 102 communicates with the host 20, the knowledge database device 30, or the search device 50 via the network NW. The communication unit 102 includes, for example, a communication interface such as a NIC (Network Interface Card).

収集部１０４は、所定のプロトコルに従って、ホスト２０から情報を収集し、収集した情報を収集情報１３２として記憶部１３０に記憶させる。また、例えば、収集部１０４は、各ホスト２０から少量のページ（以下、サンプリングウエブページ）を収集し、収集したサンプリングウエブページを収集情報１３２として記憶部１３０に記憶させる。 The collecting unit 104 collects information from the host 20 according to a predetermined protocol, and stores the collected information in the storage unit 130 as the collected information 132. Further, for example, the collection unit 104 collects a small amount of pages (hereinafter referred to as sampling web pages) from each host 20, and stores the collected sampling web pages in the storage unit 130 as collection information 132.

決定部１０６は、ナレッジデータベース４２を参照し、収集部１０４により取得されたウエブページ（例えばサンプリングウエブページ）において主エンティティと主エンティティに従属する従属エンティティとが含まれる度合に基づいて、当該ウエブページの提供元のホスト２０を、サンプリングウエブページ以外の未収集のウエブページを優先的に収集する対象とするか否かを決定する。 The decision unit 106 refers to the knowledge database 42, and based on the degree to which the main entity and the dependent entity subordinate to the main entity are included in the web page acquired by the collection unit 104 (for example, a sampling web page), the web page is concerned. It is determined whether or not the host 20 of the provider of the above is targeted for preferentially collecting uncollected web pages other than the sampling web page.

対象情報取得部１０８は、収集情報１３２からウエブページの情報を取得する。 The target information acquisition unit 108 acquires web page information from the collected information 132.

認識部１１０は、複数のエンティティとエンティティ間の関係情報とを含むナレッジデータベース４２を参照し、対象情報取得部１０８により取得されたウエブページにおいて、ナレッジデータベース４２に含まれる第１主エンティティと、第１主エンティティに従属する第１従属エンティティとを含む表現である第１表現パターンを認識する。 The recognition unit 110 refers to the knowledge database 42 including a plurality of entities and relationship information between the entities, and in the web page acquired by the target information acquisition unit 108, the first main entity included in the knowledge database 42 and the first main entity. Recognize a first expression pattern that is an expression including a first dependent entity subordinate to one primary entity.

抽出部１１２は、ウエブページにおいて、ナレッジデータベース４２に含まれ、関連付けられるべき第１従属エンティティと同種の第２従属エンティティが関連付けられてない第２主エンティティを含み、且つ第１表現パターンに合致する第２表現パターンを抽出する。 On the web page, the extraction unit 112 includes a second primary entity that is included in the knowledge database 42 and is not associated with a second dependent entity of the same type as the first dependent entity to be associated, and matches the first expression pattern. Extract the second expression pattern.

特定部１１４は、第１表現パターンにおける第１主エンティティと第１従属エンティティとの相対関係に基づいて、抽出部１１２により抽出された第２表現パターンにおいて、第２従属エンティティを特定する。 The identification unit 114 identifies the second dependent entity in the second expression pattern extracted by the extraction unit 112 based on the relative relationship between the first main entity and the first dependent entity in the first expression pattern.

提供部１１６は、ナレッジデータベース４２を拡充するために第２表現パターンに基づく情報をナレッジデータベース装置３０に提供する。 The providing unit 116 provides the knowledge database device 30 with information based on the second expression pattern in order to expand the knowledge database 42.

以下、ナレッジパネルで提供する情報を拡充するための処理について説明する。 Hereinafter, the process for expanding the information provided by the knowledge panel will be described.

［フローチャート（その１）］
図４は、収集装置１００の決定部１０６により実行される処理の流れの一例を示すフローチャートである。処理の詳細については、後述する図５～図７を参照して説明する。 [Flowchart (1)]
FIG. 4 is a flowchart showing an example of the flow of processing executed by the determination unit 106 of the collection device 100. The details of the processing will be described with reference to FIGS. 5 to 7 described later.

まず、決定部１０６が、収集情報１３２のうち、同一のホストにより提供される一以上のサンプリングウエブページを選択する（Ｓ１０）。次に、決定部１０６は、エンティティ情報１３４を参照し、プロパティで関連付けられたエンティティとエンティティとの組み合わせのうち、一つの組み合わせを選択する（Ｓ１２）。エンティティ情報１３４は、ナレッジデータベース４２と同様の情報、或いは前述した図２で示したようなナレッジデータベース４２の一部の情報である。 First, the determination unit 106 selects one or more sampling web pages provided by the same host from the collected information 132 (S10). Next, the determination unit 106 refers to the entity information 134 and selects one combination of the entity and the combination of the entity associated with the property (S12). The entity information 134 is the same information as the knowledge database 42, or a part of the information of the knowledge database 42 as shown in FIG. 2 described above.

次に、決定部１０６は、選択した一つの組み合わせが、選択したサンプリングウエブページに含まれているか否かを判定する（Ｓ１４）。次に、決定部１０６は、判定結果を判定情報１３６として記憶部１３０に記憶させる（Ｓ１６）。 Next, the determination unit 106 determines whether or not one selected combination is included in the selected sampling web page (S14). Next, the determination unit 106 stores the determination result as the determination information 136 in the storage unit 130 (S16).

次に、決定部１０６は、すべての、エンティティとエンティティとの組み合わせを選択したか否かを判定する（Ｓ１８）。すべての組み合わせを選択していない場合、ステップＳ１２の処理に戻る。 Next, the determination unit 106 determines whether or not all the combinations of the entities and the entities have been selected (S18). If all combinations have not been selected, the process returns to step S12.

すべての組み合わせを選択した場合、決定部１０６は、すべてのサンプリングウエブページの情報を選択したか否かを判定する（Ｓ２０）。すべてのサンプリングウエブページの情報を選択していない場合、ステップＳ１０の処理に戻る。 When all combinations are selected, the determination unit 106 determines whether or not the information on all sampling web pages has been selected (S20). If the information on all sampling web pages has not been selected, the process returns to step S10.

すべてのサンプリングウエブページの情報を選択した場合、決定部１０６は、判定結果である判定情報１３６に基づいて、深堀対象のホスト２０を決定する（Ｓ２２）。次に、収集部１０４が、決定された深堀対象であるホスト２０において、収集していないウエブページ（サンプリングウエブページ以外のウエブページ）を収集し、収集した情報を収集情報１３２として記憶部１３０に記憶させる（Ｓ２４）。すなわち、収集部１０４は、少量の収集結果から決定した有用なホスト（深堀対象のホスト）を深堀する深堀処理を行う。これにより本フローチャートの処理は終了する。 When the information of all the sampling web pages is selected, the determination unit 106 determines the host 20 to be deep-drilled based on the determination information 136 which is the determination result (S22). Next, the collecting unit 104 collects web pages (web pages other than the sampling web page) that have not been collected at the determined host 20 that is the target of deep digging, and collects the collected information as the collected information 132 in the storage unit 130. Remember (S24). That is, the collection unit 104 performs a deep digging process for deep digging a useful host (host to be deep digged) determined from a small amount of collection results. This ends the processing of this flowchart.

なお、収集部１０４は、所定のタイミングで、深堀対象とされなかったホスト２０からも、このホスト２０が有し、且つ未収集の情報を収集し、収集した情報を収集情報１３２として記憶部１３０に記憶させる。 The collecting unit 104 collects information possessed by the host 20 and not collected from the host 20 that was not targeted for deep digging at a predetermined timing, and the collected information is used as the collected information 132 in the storage unit 130. To memorize.

また、上述したフローチャートの例では、決定部１０６が、プロパティで関連付けられたエンティティとエンティティとの組み合わせのうち、一つの組み合わせを選択し（Ｓ１２）、選択した一つの組み合わせが、選択したサンプリングウエブページに含まれているか否かを判定するものとしたが、これに代えて、以下のように処理が行われてもよい。
（１）決定部１０６が、抽出対象のエンティティ（例えば、後述する図５、６のＣ美術館）を列挙する。
（２）決定部１０６が、サンプリングウエブページに、抽出対象のエンティティが含まれているか否かを判定する。
（３）抽出対象のエンティティが含まれている場合、決定部１０６は、ナレッジデータベース４２に含まれる、ウエブページに含まれていたエンティティ（例えば、図５、６のＡミュージアム、Ｂ博物館）と抽出対象のプロパティ（例えば、図５、６の公式サイト）で関連付けられていたエンティティ（例えば、図５、６のＡミュージアム、Ｂ博物館の公式サイト）を列挙する。
（４）決定部１０６が、抽出対象のプロパティで関連付けられたエンティティが当該ウエブページに含まれているか否かを判定する処理を行う。そして、決定部１０６は、判定結果に基づいて、当該ウエブページの提供元のホスト２０を深堀対象とするか否かを決定する。 Further, in the above-mentioned flowchart example, the determination unit 106 selects one combination of the entities associated with the property and the combination of the entities (S12), and the selected combination is the selected sampling web page. Although it is determined whether or not it is included in the above, instead of this, the processing may be performed as follows.
(1) The determination unit 106 lists the entities to be extracted (for example, the museum C in FIGS. 5 and 6 described later).
(2) The determination unit 106 determines whether or not the sampling web page contains the entity to be extracted.
(3) When the entity to be extracted is included, the determination unit 106 extracts the entity included in the web page (for example, A museum and B museum in FIGS. 5 and 6) included in the knowledge database 42. List the entities (for example, the official websites of Museum A and Museum B in FIGS. 5 and 6) associated with the target property (for example, the official websites of FIGS. 5 and 6).
(4) The determination unit 106 performs a process of determining whether or not the entity associated with the property to be extracted is included in the web page. Then, the determination unit 106 determines whether or not the host 20 of the provider of the web page is targeted for deep digging based on the determination result.

図５は、Ｓ１０で選択されたサンプリングウエブページの情報の一例を示す図である。例えば、サンプリングウエブページにおいて、観光地の名称と、観光地のＵＲＬとが含まれているものとする。例えば、サンプリングウエブページにおいて「Ａミュージアム」、「ＵＲＬ００１」、「Ｂ博物館」、「ＵＲＬ００２」、「Ｃ美術館」、および「ＵＲＬ００３」が含まれている。 FIG. 5 is a diagram showing an example of information on the sampling web page selected in S10. For example, it is assumed that the name of the tourist spot and the URL of the tourist spot are included in the sampling web page. For example, the sampling web page includes "A Museum", "URL001", "B Museum", "URL002", "C Museum", and "URL003".

図６は、エンティティ情報１３４に含まれるエンティティの組み合わせの一例を示す図である。例えば、「Ａミュージアム」と「ＵＲＬ００１」とがプロパティ「公式サイト」で関連付けられ、「Ｂ博物館」と「ＵＲＬ００２」とがプロパティ「公式サイト」で関連付けられている。そして、エンティティ情報１３４には、エンティティ「Ｃ美術館」が含まれるが、「Ｃ美術館」には「ＵＲＬ００３」は関連付けられていない。施設のＵＲＬ（プロパティ）という関係に基づいて、施設の名称「Ａミュージアム（第１主エンティティ））が「ＵＲＬ００１（第１従属エンティティ）」に関連付けられている場合、施設のＵＲＬ（プロパティ）という関係に基づいて、「ＵＲＬ００２（第２従属エンティティ）」が関連付けられていない施設の名称「Ｃ美術館」は、「第２主エンティティ」の一例である。 FIG. 6 is a diagram showing an example of a combination of entities included in the entity information 134. For example, "A museum" and "URL001" are associated with the property "official site", and "B museum" and "URL002" are associated with the property "official site". The entity information 134 includes the entity "C Museum", but the "C Museum" is not associated with "URL003". When the facility name "A Museum (1st main entity)" is associated with "URL001 (1st subordinate entity)" based on the relationship of facility URL (property), the relationship of facility URL (property) The name "C Museum" of the facility to which "URL002 (second subordinate entity)" is not associated is an example of "second main entity".

図７は、判定情報１３６の内容の一例を示す図である。判定情報１３６は、ホストＩＤに対して、エンティティの組み合わせ、スコア、および深堀対象とするか否かの判定結果を示す情報が互いに関連付けられた情報である。前述した図３のフローチャートのＳ１２～Ｓ１８の処理において、「Ａミュージアム」と「ＵＲＬ００１」との組み合わせ、および「Ｂ博物館」と「ＵＲＬ００２」との組み合わせは、選択されたサンプリングウエブページの情報に含まれていると判定される。決定部１０６は、例えば、上述したように２つの組み合わせがサンプリングウエブページの情報に含まれる場合、スコア「２」と決定する。例えば、決定部１０６は、スコア「２」以上のサンプリングウエブページを提供したホスト２０を深堀対象のホストとして決定する。 FIG. 7 is a diagram showing an example of the contents of the determination information 136. The determination information 136 is information in which the combination of entities, the score, and the information indicating the determination result of whether or not to be the target of deep digging are associated with each other with respect to the host ID. In the processing of S12 to S18 of the flowchart of FIG. 3 described above, the combination of "A museum" and "URL001" and the combination of "B museum" and "URL002" are included in the information of the selected sampling web page. It is determined that it is. For example, when the combination of the two is included in the information on the sampling web page as described above, the determination unit 106 determines the score as “2”. For example, the determination unit 106 determines the host 20 that provided the sampling web page with the score "2" or more as the host to be deep-drilled.

上述したように、深堀対象のホスト２０が決定され、深堀対象のホスト２０に対して優先的に深堀処理が行われる。これにより、有用なホスト２０が有する情報が優先的に収集される。 As described above, the host 20 to be deep-drilled is determined, and the deep-drilling process is preferentially performed for the host 20 to be deep-drilled. As a result, the information possessed by the useful host 20 is preferentially collected.

［フローチャート（その２）］
図８は、収集装置１００により実行される未知情報の抽出処理の流れの一例を示すフローチャートである。本フローチャートは、特定エンティティに対して、所定のプロパティで関連付けられるべきエンティティを特定する処理である。特定エンティティとは、関連付けられるべきエンティティ（第２従属エンティティ）が関連付けられていないエンティティ（第２主エンティティ）である。上述した例では、Ｃ美術館が特定エンティティに該当する。Ｃ美術館に対して、関連付けられるべきエンティティ「ＵＲＬ＊＊＊」が関連付けられていないためである。処理の詳細については、後述する図９～図１１を参照して説明する。 [Flowchart (Part 2)]
FIG. 8 is a flowchart showing an example of the flow of the unknown information extraction process executed by the collection device 100. This flowchart is a process of specifying an entity to be associated with a predetermined property for a specific entity. The specific entity is an entity (second main entity) to which the entity to be associated (second dependent entity) is not associated. In the above example, Museum C corresponds to a specific entity. This is because the entity "URL ***" to be associated with the C museum is not associated. The details of the processing will be described with reference to FIGS. 9 to 11 described later.

まず、収集装置１００の対象情報取得部１０８が、収集情報１３２に含まれるウエブページを取得する（Ｓ１００）。 First, the target information acquisition unit 108 of the collection device 100 acquires the web page included in the collection information 132 (S100).

次に、認識部１１０が、取得したウエブページ（以下、対象ウエブページ）において、プロパティで関連付けられたエンティティとエンティティとの組み合わせを含む第１記述パターン（第１表現パターン）を認識する（Ｓ１０２）。エンティティとエンティティとの組み合わせは、例えば、特定エンティティと同一のクラスのエンティティ（例えば施設）と、特定エンティティに対して関連付けられるべきエンティティのクラスを有するエンティティ（例えば施設のＵＲＬ）との組み合わせである。 Next, the recognition unit 110 recognizes the first description pattern (first expression pattern) including the entity associated with the property and the combination of the entity in the acquired web page (hereinafter referred to as the target web page) (S102). .. A combination of an entity and an entity is, for example, a combination of an entity of the same class as the particular entity (eg, a facility) and an entity having a class of entities to be associated with the particular entity (eg, the URL of the facility).

次に、認識部１１０は、認識した第１記述パターンに基づいて、エンティティの組み合わせの相対位置を特定する（Ｓ１０４）。次に、認識部１１０は、Ｓ１００で取得した対象ウエブページから、特定エンティティを含み、且つ認識した第１記述パターンに合致する第２記述パターン（第２表現パターン）を抽出する（Ｓ１０６）。 Next, the recognition unit 110 identifies the relative position of the combination of the entities based on the recognized first description pattern (S104). Next, the recognition unit 110 extracts a second description pattern (second expression pattern) including the specific entity and matching the recognized first description pattern from the target web page acquired in S100 (S106).

次に、特定部１１４が、第２記述パターンにおいて、Ｓ１０４で特定した第１記述パターンの相対位置に対応する相対位置を特定する（Ｓ１０８）。次に、特定部１１４が、特定した相対位置に関連付けられた情報のうち、特定エンティティが記述された位置（第１位置）とは異なる位置（第２位置）に関連付けられた情報を抽出し、抽出した情報を抽出情報１３８として記憶部１３０に記憶させる（Ｓ１１０）。抽出情報１３８は、特定エンティティと、本処理により抽出された特定エンティティに対して、所定のプロパティによって関連付けられるエンティティとが互いに関連付けられた情報である。 Next, the specifying unit 114 specifies a relative position corresponding to the relative position of the first description pattern specified in S104 in the second description pattern (S108). Next, the specific unit 114 extracts the information associated with the position (second position) different from the position (first position) in which the specific entity is described from the information associated with the specified relative position. The extracted information is stored in the storage unit 130 as the extracted information 138 (S110). The extracted information 138 is information in which the specific entity and the entity associated with the specific entity extracted by this process by a predetermined property are associated with each other.

次に、認識部１１０は、すべての処理対象のウエブページの情報を選択したか否かを判定する（Ｓ１１２）。すべての処理対象のウエブページの情報を選択していない場合、Ｓ１００の処理に戻る。すべての処理対象のウエブページ情報を選択した場合、提供部１１６が、抽出情報１３８をナレッジデータベース装置３０に送信する（Ｓ１１４）。これにより本フローチャートの１ルーチンの処理は終了する。 Next, the recognition unit 110 determines whether or not the information of all the web pages to be processed has been selected (S112). If the information of all the web pages to be processed is not selected, the process returns to S100. When all the web page information to be processed is selected, the providing unit 116 transmits the extracted information 138 to the knowledge database device 30 (S114). This ends the processing of one routine in this flowchart.

なお、処理対象のウエブページは、上述したように収集情報１３２に含まれるすべてのウエブページであってもよいし、設定されたウエブページであってもよい。また、処理対象のウエブページは、深堀対象のホスト２０から取得されたウエブページであってもよい。また、決定部１０６が、ナレッジデータベース４２を参照し、収集部１０４により取得されたウエブページにおいて主エンティティと主エンティティに従属する従属エンティティとが含まれる度合に基づいて、ウエブページ（またはホスト２０）を抽出部１１２の処理対象とするか否かを決定してもよい。 The web page to be processed may be all the web pages included in the collected information 132 as described above, or may be a set web page. Further, the web page to be processed may be a web page acquired from the host 20 to be deep-drilled. Further, the decision unit 106 refers to the knowledge database 42, and the web page (or the host 20) is based on the degree to which the main entity and the dependent entity subordinate to the main entity are included in the web page acquired by the collection unit 104. May be determined whether or not to be the processing target of the extraction unit 112.

図９は、認識部１１０が記述パターンを認識する処理について説明するための図（その１）である。図１０は、認識部１１０が記述パターンを認識する処理について説明するための図（その２）である。例えば、図９に示すように、認識部１１０は、対象ウエブページのＨＴＭＬ（Hyper Text Markup Language）などのソースコードを認識する。そして、図１０に示すように、認識部１１０は、ナレッジデータベース４２に含まれるエンティティの組み合わせを含むソースコードの記述パターンＡを認識する。 FIG. 9 is a diagram (No. 1) for explaining the process of recognizing the description pattern by the recognition unit 110. FIG. 10 is a diagram (No. 2) for explaining the process of recognizing the description pattern by the recognition unit 110. For example, as shown in FIG. 9, the recognition unit 110 recognizes a source code such as HTML (Hyper Text Markup Language) of a target web page. Then, as shown in FIG. 10, the recognition unit 110 recognizes the description pattern A of the source code including the combination of the entities included in the knowledge database 42.

図示する例では、ソースコードは、「ｄｔ」、「ｓｐａｎ」、「ｄｄ」、「ａ」の順で並び、「ｓｐａｎ」の後にエンティティ「Ａミュージアム」が関連付けられ、「ａ」に対してエンティティ「ＵＲＬ」が関連付けられている。エンティティ「Ａミュージアム」とエンティティ「ＵＲＬ００１」とは、ナレッジデータベース４２おいて関連付けられたエンティティの組み合わせである。また、エンティティ「Ｂ博物館」についても同様である。 In the illustrated example, the source code is arranged in the order of "dt", "span", "dd", "a", and the entity "A museum" is associated after "span", and the entity is associated with "a". A "URL" is associated. The entity "A museum" and the entity "URL001" are a combination of the entities associated in the knowledge database 42. The same applies to the entity "B Museum".

この場合において、認識部１１０は、「ｓｐａｎ」の後にエンティティ「施設名」が関連付けられ、「ａ」に対してエンティティ「施設名のＵＲＬ」が関連付けられていることを認識する。これにより、記述パターンにおける、エンティティの組み合わせの相対位置を特定される。「施設名」が付与されている位置は、「第１位置」の一例であり、「施設名のＵＲＬ」が付与されている位置は、「第２位置」の一例である。 In this case, the recognition unit 110 recognizes that the entity "facility name" is associated after "span" and the entity "URL of the facility name" is associated with "a". As a result, the relative position of the combination of entities in the description pattern is specified. The position to which the "facility name" is given is an example of the "first position", and the position to which the "URL of the facility name" is given is an example of the "second position".

認識部１１０は、上記のような記述パターンＡに合致する記述パターンを抽出する。合致する記述パターンは、ソースコードが、「ｄｔ」、「ｓｐａｎ」、「ｄｄ」、「ａ」の順で並び、「ｓｐａｎ」の後にナレッジデータベース４２に含まれる施設のエンティティが関連付けられているパターンである。例えば、認識部１１０は、「ｓｐａｎ」の後にエンティティ「Ｃ美術館」が関連付けられた記述パターンＡを認識する。そして、特定部１１４が、エンティティの組み合わせの相対位置に基づいて、「ａ」に対してエンティティ「Ｃ美術館のＵＲＬ００３」が関連付けられていることを特定する。 The recognition unit 110 extracts a description pattern that matches the description pattern A as described above. The matching description pattern is a pattern in which the source code is arranged in the order of "dt", "span", "dd", and "a", and the entity of the facility included in the knowledge database 42 is associated after "span". Is. For example, the recognition unit 110 recognizes the description pattern A in which the entity “C museum” is associated after the “span”. Then, the specifying unit 114 identifies that the entity "URL003 of the C museum" is associated with "a" based on the relative position of the combination of the entities.

上記処理をまとめると、図１１に示すように表すことができる。収集装置１００は、記述パターン「ｄｔ」、「ｓｐａｎ」、「ｄｄ」、「ａ」を認識し、「ｓｐａｎ」の後にエンティティ「施設名」が関連付けられ、「ａ」にエンティティ「施設名のＵＲＬ」が関連付けられていることを認識する。そして、収集装置１００は、ナレッジデータベース４２において、エンティティ「ＵＲＬ」が関連付けられていないエンティティである施設名「Ｃ美術館」のＵＲＬは、記述パターンＡの「ａ」に関連付けられていると認識する。 The above processes can be summarized as shown in FIG. The collecting device 100 recognizes the description patterns "dt", "span", "dd", and "a", the entity "facility name" is associated after "span", and the entity "URL of the facility name" is associated with "a". Recognize that is associated. Then, the collecting device 100 recognizes in the knowledge database 42 that the URL of the facility name "C museum", which is an entity to which the entity "URL" is not associated, is associated with "a" of the description pattern A.

このように、収集装置１００は、エンティティの組み合わせを含む言語の階層構造である記述パターンに基づいて、未知の情報である特定エンティティに対して関連付けられる情報を特定することができる。換言すると、特定部１１４は、第１表現パターンにおける所定の位置（例えば、第１主エンティティまたは「ｄｔ」）から第１従属エンティティに至るまでの階層構造における特定経路（「ｄｔ」→「ｓｐａｎ」→「ｄｄ」→「ａ」）を特定し、第２表現パターンにおいて、特定経路を辿って第２従属エンティティを特定することができる。 In this way, the collecting device 100 can specify the information associated with the specific entity, which is unknown information, based on the description pattern which is the hierarchical structure of the language including the combination of the entities. In other words, the specific unit 114 is a specific path (“dt” → “span”) in the hierarchical structure from a predetermined position (for example, the first main entity or “dt”) in the first expression pattern to the first dependent entity. → “dd” → “a”) can be specified, and the second dependent entity can be specified by following a specific route in the second expression pattern.

図１２は、更新前のナレッジデータベース４２の内容の一例を示す図である。ナレッジデータベース４２において、「Ｃ美術館」のＵＲＬは、エンティティ「Ｃ美術館」に対して関連付けられていない。 FIG. 12 is a diagram showing an example of the contents of the knowledge database 42 before the update. In the knowledge database 42, the URL of "C Museum" is not associated with the entity "C Museum".

図１３は、更新後のナレッジデータベース４２の内容の一例を示す図である。ナレッジデータベース装置３０が、「Ｃ美術館」の「ＵＲＬ」を収集装置１００から取得すると、ナレッジデータベース装置３０は、エンティティ「Ｃ美術館」に対して、収集装置１００から送信されたＵＲＬを関連付ける。 FIG. 13 is a diagram showing an example of the contents of the knowledge database 42 after the update. When the knowledge database device 30 acquires the "URL" of the "C museum" from the collection device 100, the knowledge database device 30 associates the entity "C museum" with the URL transmitted from the collection device 100.

図１４は、更新前のナレッジデータベース４２に基づいて生成されたナレッジパネルＮＰ１の一例を示す図である。ユーザが、検索クエリ「Ｃ美術館」を入力した場合、ナレッジデータベース装置３０は、エンティティ「Ｃ美術館」に対して、ＵＲＬが関連付けられていないため、ナレッジパネルにおいて、ＵＲＬを含めることができない。 FIG. 14 is a diagram showing an example of the knowledge panel NP1 generated based on the knowledge database 42 before the update. When the user inputs the search query "C Museum", the knowledge database device 30 cannot include the URL in the knowledge panel because the URL is not associated with the entity "C Museum".

これに対して、更新後のナレッジデータベース４２に基づいてナレッジパネルが生成された場合、図１５に示すようにナレッジデータベース装置３０は、エンティティ「Ｃ美術館」に対して、ＵＲＬが関連付けられているため、ナレッジパネルＮＰ２において、ＵＲＬを含めることができる。 On the other hand, when the knowledge panel is generated based on the updated knowledge database 42, the knowledge database device 30 has a URL associated with the entity "C museum" as shown in FIG. , The URL can be included in the knowledge panel NP2.

このように、ナレッジデータベース装置３０が、更新されたナレッジデータベース４２を用いることにより、より有益な情報をユーザに提供することができる。 In this way, the knowledge database device 30 can provide more useful information to the user by using the updated knowledge database 42.

なお、上述した例は、ソースコードの表現パターンに基づいて、処理が行われるものとして説明したが、これに代えて（或いは加えて）、画像のパターンに基づいて、特定エンティティに対して関連付けられるべき情報が特定されてもよい。例えば、特定部１１４は、画像における施設名が表示された位置とＵＲＬが表示された位置に基づいて、ナレッジデータベース４２においてＵＲＬの情報が関連付けられていない施設のＵＲＬを特定してもよい。 In the above example, the processing is described as being performed based on the expression pattern of the source code, but instead (or in addition), it is associated with a specific entity based on the pattern of the image. The information to be used may be specified. For example, the specifying unit 114 may specify the URL of the facility to which the URL information is not associated in the knowledge database 42 based on the position where the facility name is displayed and the position where the URL is displayed in the image.

以上説明した第１実施形態によれば、収集装置１００が、対象ウエブページにおいて、ナレッジデータベース４２に含まれる第１主エンティティと、第１主エンティティに従属する第１従属エンティティとを含む表現である第１表現パターンを認識し、対象ウエブページにおいて、ナレッジデータベース４２に含まれ、関連付けられるべき第１従属エンティティと同種の第２従属エンティティが関連付けられてない第２主エンティティを含み、且つ第１表現パターンに合致する第２表現パターンに基づく情報を、ナレッジデータベース４２を拡充するためにナレッジデータベース装置３０に提供することにより、より効率的に有用な情報を取得することができる。 According to the first embodiment described above, the collection device 100 is an expression including the first main entity included in the knowledge database 42 and the first dependent entity subordinate to the first main entity in the target web page. Recognizing the first expression pattern, the target web page contains a second primary entity that is included in the knowledge database 42 and is not associated with a second dependent entity of the same type as the first dependent entity to be associated, and the first representation. By providing the information based on the second expression pattern matching the pattern to the knowledge database device 30 in order to expand the knowledge database 42, useful information can be acquired more efficiently.

＜第２実施形態＞
以下、第２実施形態について説明する。第２実施形態では、収集装置１００Ａが、抽出したエンティティに対する信頼度を導出し、導出した信頼度が閾値以上のエンティティをナレッジデータベース装置３０に提供する。以下、第１実施形態との相違点について説明する。 <Second Embodiment>
Hereinafter, the second embodiment will be described. In the second embodiment, the collection device 100A derives the reliability for the extracted entity, and provides the knowledge database device 30 with the entity whose derived reliability is equal to or higher than the threshold value. Hereinafter, the differences from the first embodiment will be described.

図１６は、第２実施形態の情報処理システム１Ａの収集装置１００Ａの機能構成の一例を示す図である。収集装置１００Ａは、収集装置１００の機能構成に加え、信頼度導出部１１５を備える。また、収集装置１００Ａは、記憶部１３０に代えて、記憶部１３０Ａを備える。記憶部１３０Ａには、記憶部１３０に記憶される情報に加え、更に信頼度付抽出情報１４０が記憶されている。 FIG. 16 is a diagram showing an example of the functional configuration of the collection device 100A of the information processing system 1A of the second embodiment. The collecting device 100A includes a reliability derivation unit 115 in addition to the functional configuration of the collecting device 100. Further, the collecting device 100A includes a storage unit 130A instead of the storage unit 130. In the storage unit 130A, in addition to the information stored in the storage unit 130, the extraction information 140 with reliability is further stored.

信頼度導出部１１５は、例えば、複数の対象ウエブページから同じファクトが得られた場合、複数の対象ウエブページの情報に基づいて、フォクトの信頼度を導出する。ファクトとは、ナレッジデータベース４２のエンティティの組み合わせが含まれているという事実である。例えば、信頼度導出部１１５は、ウエブページにおける既知のエンティティの組み合わせの割合に基づいて、信頼度である統合スコアを導出し、導出した統合スコアと抽出情報１３８とを合わせて信頼度付抽出情報１４０を生成する。そして、信頼度導出部１１５は、統合スコアが閾値以上のエンティティの組み合わせをナレッジデータベース装置３０に提供することを決定する。 For example, when the same fact is obtained from a plurality of target web pages, the reliability derivation unit 115 derives the reliability of the foct based on the information of the plurality of target web pages. The fact is the fact that it contains a combination of entities in the knowledge database 42. For example, the reliability derivation unit 115 derives the integration score, which is the reliability, based on the ratio of the combination of known entities on the web page, and the derived integration score and the extraction information 138 are combined to extract the information with reliability. Generate 140. Then, the reliability derivation unit 115 determines to provide the knowledge database device 30 with a combination of entities whose integration score is equal to or higher than the threshold value.

図１７は、信頼度付抽出情報１４０の内容の一例を示す図である。信頼度付抽出情報１４０は、対象ウエブページに含まれるエンティティの組み合わせと、その組み合わせがナレッジデータベース４２において既知であるか、未知であるかを示す情報と、対象ウエブページにおいてエンティティの組み合わせが既知の割合、および統合スコアが互いに関連付けられた情報である。例えば、信頼度導出部１１５は、対象ウエブページに含まれるエンティティの組み合わせがナレッジデータベース４２において既知であるか、未知あるかを判定し、判定結果に基づいて、エンティティの組み合わせに対する既知のエンティティの組み合わせの割合を導出する。 FIG. 17 is a diagram showing an example of the contents of the extraction information 140 with reliability. In the extraction information 140 with reliability, the combination of the entities included in the target web page, the information indicating whether the combination is known or unknown in the knowledge database 42, and the combination of the entities are known in the target web page. Percentages and integration scores are information that is associated with each other. For example, the reliability derivation unit 115 determines whether the combination of entities included in the target web page is known or unknown in the knowledge database 42, and based on the determination result, the combination of known entities with respect to the combination of entities. Derivation of the ratio of.

そして、信頼度導出部１１５は、所定のモデルに、対象ウエブページごとに導出した既知の割合を適用して、統合スコアを導出する。所定のモデルとは、例えば、式（１）である。式（１）の「ｘ」は、ホストＩＤ「００１」のホスト２０から収集された第１対象ウエブページにおける既知の割合であり、「ｙ」は、ホストＩＤ「００２」のホスト２０から収集された第２対象ウエブページにおける既知の割合である。「α」は、任意に設定されるパラメータ（例えば「０．１」）である。 Then, the reliability derivation unit 115 applies a known ratio derived for each target web page to a predetermined model to derive an integrated score. The predetermined model is, for example, the equation (1). "X" in the formula (1) is a known ratio in the first target web page collected from the host 20 of the host ID "001", and "y" is collected from the host 20 of the host ID "002". It is a known ratio in the second target web page. “Α” is an arbitrarily set parameter (for example, “0.1”).

図１８は、既知の割合の組み合わせごとの統合スコアの傾向の一例を示す図である。図１８に示すように式（１）は、「ｘ」、「ｙ」の両方の既知の割合が高い場合、統合スコアは高い傾向に導出され、「ｘ」、「ｙ」の両方の既知の割合が低い場合、統合スコアは高い傾向に導出される関数である。 FIG. 18 is a diagram showing an example of the tendency of the integrated score for each combination of known ratios. As shown in FIG. 18, in equation (1), when the known proportions of both "x" and "y" are high, the integration score tends to be high, and both "x" and "y" are known. If the percentage is low, the integration score is a function that tends to be high.

このように、信頼度導出部１１５が、第１ウエブページと第２ウエブページとの既知の割合に基づいて、統合スコアを導出することにより、より精度よく統合スコアを導出することができる。 In this way, the reliability deriving unit 115 can derive the integrated score more accurately by deriving the integrated score based on the known ratio between the first web page and the second web page.

以上説明した第２実施形態によれば、収集装置１００が、第１ウエブページにおいて、ナレッジデータベース４２に含まれる、主エンティティと主エンティティに従属する従属エンティティとが特定の相対関係を有するように表現された表現パターンと、ナレッジデータベース４２に含まれる主エンティティと、ナレッジデータベース４２において主エンティティに従属していない非従属エンティティとが特定の相対関係を有するように表現された表現パターンとの比率、および、第２ウエブページにおいて、ナレッジデータベース４２に含まれる、主エンティティと主エンティティに従属する従属エンティティとが特定の相対関係を有するように表現された表現パターンと、ナレッジデータベースに含まれる主エンティティと、ナレッジデータベースにおいて主エンティティに従属していない非従属エンティティとが特定の相対関係を有するように表現された表現パターンとの比率に基づいて、非従属エンティティをナレッジデータベース４２の拡充するための情報とするか否かを判定することにより、より精度よくナレッジデータベースを拡充するための情報を分別することができる。 According to the second embodiment described above, the collecting device 100 expresses in the first web page so that the main entity and the dependent entities subordinate to the main entity, which are included in the knowledge database 42, have a specific relative relationship. The ratio of the expressed expression pattern to the expression pattern in which the main entity contained in the knowledge database 42 and the non-dependent entity that is not dependent on the main entity in the knowledge database 42 are expressed so as to have a specific relative relationship, and the expression pattern. , In the second web page, the expression pattern in which the main entity and the dependent entities subordinate to the main entity in the knowledge database 42 have a specific relative relationship, and the main entity included in the knowledge database and The non-dependent entity is used as information for expanding the knowledge database 42 based on the ratio of the non-dependent entity that is not dependent on the main entity in the knowledge database to the expression pattern expressed so as to have a specific relative relationship. By determining whether or not it is, the information for expanding the knowledge database can be sorted more accurately.

以上、本発明を実施するための形態について実施形態を用いて説明したが、本発明はこうした実施形態に何等限定されるものではなく、本発明の要旨を逸脱しない範囲内において種々の変形及び置換を加えることができる。 Although the embodiments for carrying out the present invention have been described above using the embodiments, the present invention is not limited to these embodiments, and various modifications and substitutions are made without departing from the gist of the present invention. Can be added.

１‥情報処理システム、１０‥端末装置、２０‥ホスト、３０‥ナレッジデータベース装置、４２‥ナレッジデータベース、１００，１００Ａ‥収集装置、１０２‥通信部、１０４‥収集部、１０６‥決定部、１０８‥対象情報取得部、１１０‥認識部、１１２‥抽出部、１１４‥特定部、１１５‥信頼度導出部、１１６‥提供部、１３０、１３０Ａ‥記憶部、１３４‥エンティティ情報、１３６‥判定情報、１３８‥抽出情報、１４０‥信頼度付抽出情報 1 Information processing system, 10 Terminal device, 20 Host, 30 Knowledge database device, 42 Knowledge database, 100, 100A Collection device, 102 Communication unit, 104 Collection unit, 106 Decision unit, 108 ... Target information acquisition unit, 110 recognition unit, 112 extraction unit, 114 specific unit, 115 reliability derivation unit, 116 provision unit, 130, 130A storage unit, 134 entity information, 136 judgment information, 138 ‥ Extraction information, 140 ‥ Extraction information with reliability

Claims

The acquisition department that acquires information on the web page,
A knowledge database containing a plurality of entities and relationship information between the entities is referred to, and in the web page acquired by the acquisition unit, the first main entity included in the knowledge database and subordinate to the first main entity. A recognition unit that recognizes the first expression pattern, which is an expression including the first dependent entity,
In the web page, the second dependent entity that is the second primary entity included in the knowledge database and is the same kind of second dependent entity as the first dependent entity and should be associated with the second primary entity is associated with it. An extraction unit that includes the second main entity and extracts a second expression pattern similar to the first expression pattern in which the first main entity is replaced with the second main entity .
A providing unit that provides information based on the second expression pattern to the knowledge database in order to expand the knowledge database.
Information processing device equipped with.

The recognition unit recognizes the first expression pattern on one or more web pages provided by the same host.
The information processing apparatus according to claim 1.

The first expression pattern and the second expression pattern are description patterns of the language used for generating a web page.
The information processing apparatus according to claim 1 or 2.

In the second expression pattern extracted by the extraction unit based on the relative relationship between the first main entity and the first dependent entity in the first expression pattern, the specific unit that specifies the second dependent entity is designated. Further prepare
The information processing device according to any one of claims 1 to 3.

The first description pattern of the language used to generate the web page, which is the first expression pattern, and the second description pattern of the language, which is the second expression pattern, are similar.
The specific part is the first position of the first main entity in the first description pattern, the second position of the first dependent entity in the first description pattern, and the second main entity in the second description pattern. Based on the first position, the second position in the second description pattern is specified, and the information described in the second position is specified as the second dependent entity.
The information processing apparatus according to claim 4.

The description pattern is a hierarchical structure of the language.
The information processing device according to claim 3.

The specific route in the hierarchical structure up to the first dependent entity in the first expression pattern is specified, and in the second expression pattern extracted by the extraction unit, the specific route is traced from the second main entity to the said. It has a specific part that identifies the second dependent entity,
The information processing apparatus according to claim 6.

The recognition unit recognizes a third expression pattern, which is an expression including a third main entity included in the knowledge database and a third dependent entity subordinate to the third main entity, on the web page.
In the web page, the extraction unit is the second main entity included in the knowledge database, the second subordinate entity of the same type as the first subordinate entity and the third subordinate entity, and the second main entity. The first main entity and the third main entity among the first expression pattern and the third expression pattern include the second main entity to which the second subordinate entity to be associated with is not attached, and the second is the second . Extract the second expression pattern similar to the one replaced by the main entity ,
The extraction is based on the relative relationship between the first main entity and the first dependent entity in the first expression pattern, and the relative relationship between the third main entity and the third dependent entity in the third expression pattern. In the second expression pattern extracted by the unit, a specific unit that identifies the second dependent entity is further provided.
The information processing apparatus according to any one of claims 1 to 7.

The acquisition unit acquires the first web page and the second web page, and obtains them.
In the first web page, the main entity included in the knowledge database and the dependent entity subordinate to the main entity in the knowledge database are expressed so that the main entity and the subordinate entity have a specific relative relationship. The ratio of the expressed expression pattern to the expression pattern in which the main entity included in the knowledge database and the non-dependent entity that is not dependent on the main entity in the knowledge database are expressed so as to have a specific relative relationship. and,
In the second web page, the main entity included in the knowledge database and the dependent entity subordinate to the main entity in the knowledge database are expressed so that the main entity and the dependent entity have a specific relative relationship. Based on the ratio of the expression pattern that is expressed so that the main entity included in the knowledge database and the non-dependent entity that is not dependent on the main entity in the knowledge database have a specific relative relationship. Further, a determination unit for determining whether or not the non-dependent entity is used as information for expanding the knowledge database is further provided.
The information processing apparatus according to any one of claims 1 to 8.

Whether to make the web page a processing target of the extraction unit based on the degree to which the main entity and the dependent entity subordinate to the main entity are included in the web page acquired by the acquisition unit with reference to the knowledge database. Further equipped with a decision unit for deciding whether or not to do so,
The information processing apparatus according to any one of claims 1 to 9.

The computer
Get the information on the web page and
Refer to the knowledge database containing multiple entities and relationship information between the entities.
In the acquired web page, the first expression pattern, which is an expression including the first main entity included in the knowledge database and the first dependent entity subordinate to the first main entity, is recognized.
In the web page, the second dependent entity that is the second primary entity included in the knowledge database and is the same kind of second dependent entity as the first dependent entity and should be associated with the second primary entity is associated with it. A second expression pattern that includes the second main entity and is similar to the first expression pattern in which the first main entity is replaced with the second main entity is extracted.
Information based on the second expression pattern is provided to the knowledge database in order to expand the knowledge database.
Information processing method.

On the computer
Get the information on the web page,
Refer to a knowledge database containing multiple entities and relationship information between the entities.
In the acquired web page, the first expression pattern, which is an expression including the first main entity included in the knowledge database and the first dependent entity subordinate to the first main entity, is recognized.
In the web page, the second dependent entity that is the second primary entity included in the knowledge database and is the same kind of second dependent entity as the first dependent entity and should be associated with the second primary entity is associated with it. A second expression pattern that includes the second main entity and is similar to the first expression pattern in which the first main entity is replaced with the second main entity is extracted.
In order to expand the knowledge database, the knowledge database is provided with information based on the second expression pattern.
program.