JP5002631B2

JP5002631B2 - Word information collection device, word information collection method, and word information collection program

Info

Publication number: JP5002631B2
Application number: JP2009204796A
Authority: JP
Inventors: 芳郎山本; 享晴吉田
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2009-09-04
Filing date: 2009-09-04
Publication date: 2012-08-15
Anticipated expiration: 2029-09-04
Also published as: JP2011054102A

Description

本発明は、ウェブページに含まれる単語に関する情報を収集する単語情報収集装置、単語情報収集方法および単語情報収集プログラムに関する。 The present invention relates to a word information collection device, a word information collection method, and a word information collection program that collect information about words included in a web page.

従来、ネットワーク上の検索エンジンによる検索結果ページには、指定された検索語に応じた検索結果の一覧のほかにも様々な情報が表示される。例えば、検索語に関連する広告情報や、検索語に関連のあるショッピングなどの特定のサービス情報などがある。このように、検索結果の一覧だけでなく、様々な情報を表示することでユーザに有益な情報を提供することができるため、検索結果ページにおけるコンテンツのさらなる充実化が求められている。 Conventionally, various information is displayed on a search result page by a search engine on a network in addition to a list of search results corresponding to a specified search word. For example, there are advertisement information related to the search term and specific service information such as shopping related to the search term. Thus, since not only a list of search results but also various information can be displayed, useful information can be provided to the user. Therefore, further enhancement of contents on the search result page is required.

ここで、検索エンジンは、単語と該単語が含まれるウェブページに関する情報とを関連付けて記憶しており、これらの情報に基づいて、該単語のウェブページにおける出現頻度や重要度に基づいてインデクシング（索引化）した検索用インデックスを作成している。検索時には、この検索用インデックスを参照するため、検索結果ページにはインデクシングの高い（重みの高い）情報が上位に表示され、より有益で意味のある情報をユーザに提供している。 Here, the search engine stores the word and information related to the web page including the word in association with each other. Based on the information, the search engine performs indexing based on the appearance frequency and importance of the word on the web page. An index for searching is created. At the time of search, since this search index is referred to, information with high indexing (high weight) is displayed at the top of the search result page, and more useful and meaningful information is provided to the user.

また、検索エンジンは、ネットワークを巡回してウェブページに関する情報を取得してデータベースに蓄積する処理を行う。ウェブページは日々更新されるため、これらの更新情報を速報することは、より有益な情報をユーザに提供することになり、ユーザにとって利便性が高い。このような情報提供を行う方式として、例えば、ホームページを定期、定時に巡回し、その都度ホームページ上の異同を検出、分析を行う技術が知られている（例えば、特許文献１参照）。 In addition, the search engine circulates through the network, acquires information about the web page, and stores it in the database. Since the web page is updated every day, it is convenient for the user to provide the user with more useful information to quickly notify the update information. As a method for providing such information, for example, a technique is known in which a homepage is circulated regularly and regularly, and a difference on the homepage is detected and analyzed each time (see, for example, Patent Document 1).

特開２００２−７３６４９号公報JP 2002-73649 A

ところで、検索語としては様々なものを入力でき、例えば、流行語のような珍しい単語を指定する場合がある。流行語は、それが初めて登場したウェブページを発端として流行が広まっている可能性が高く、流行の発端となったウェブページの情報を得たいと思うユーザもいる。
しかしながら、特許文献１に記載の方式では、各ウェブページの最新の情報は得られるものの、特定の単語がウェブページに最初に登場したときの情報を得ることはできない。 By the way, various search terms can be input. For example, rare words such as buzzwords may be designated. A buzzword is highly likely to have spread from the web page where it first appeared, and some users want to obtain information on the web page that started the buzz.
However, with the method described in Patent Document 1, the latest information of each web page can be obtained, but information when a specific word first appears on the web page cannot be obtained.

本発明の目的は、任意の単語が最初に登場したウェブページに関する情報を簡単に収集でき、収集した情報を用いてウェブページのコンテンツの充実化を図ることのできる単語情報収集装置、単語情報収集方法および単語情報収集プログラムを提供することである。 An object of the present invention is to easily collect information on a web page in which an arbitrary word first appears, and to use the collected information to enhance the content of the web page and to collect word information To provide a method and a word information collection program.

本発明の単語情報収集装置は、ネットワーク上のウェブページに含まれる単語に関する情報を収集し、収集した単語を用いて、検索キーに対してインデックス検索を実行するための検索用インデックスを生成する単語情報収集装置であって、前記ネットワークを巡回してウェブページに関する情報とともに該ウェブページの更新日時を取得するページ情報取得手段と、前記取得したウェブページを解析して単語候補を抽出するページ解析手段と、前記抽出された単語候補と、取得済みの単語候補から予め生成された検索用インデックスとを比較し、前記単語候補が前記検索用インデックスに記憶されているか否かを判定する登録状況判定手段と、前記判定の結果、前記検索用インデックスに記憶されていないと判定した場合に、前記単語候補と該ウェブページに関する情報とに前記更新日時を初出日時として関連付けて初出ワード記憶手段に記憶させる初出ワード登録手段と、を備えることを特徴とする。 A word information collection device according to the present invention collects information about words included in a web page on a network, and uses the collected words to generate a search index for performing an index search for a search key A page information acquisition unit that is an information collection device and acquires the update date and time of the web page together with information related to the web page by visiting the network, and a page analysis unit that analyzes the acquired web page and extracts word candidates And registration status determination means for comparing the extracted word candidates with a search index generated in advance from the acquired word candidates and determining whether or not the word candidates are stored in the search index And if it is determined as a result of the determination that it is not stored in the search index, the word candidate Characterized in that and a first-appearing word registration means for storing the first occurrence word storage means in association with the update date and time and information relating to the web page as a Created date.

本発明の単語情報収集装置は、ネットワーク上のウェブページに含まれる単語を収集し、収集した単語を用いて、検索キーに対してインデックス検索を実行するための検索用インデックスを生成する装置である。そのために、ページ情報取得手段は、ネットワークを巡回してウェブページに関する情報を取得する（この処理をクロール処理という。）。ここで、ウェブページに関する情報とは、ウェブページのＵＲＬ（Uniform Resource Locator）情報、ウェブページに表示される文章データ、および画像データ等である。このとき、ページ情報取得手段は、該ウェブページの更新日時も同時に取得する。ページ解析手段は、取得したウェブページの文章を解析して単語候補を抽出する。 The word information collection device of the present invention is a device that collects words contained in a web page on a network and generates a search index for executing an index search for a search key using the collected words. . For this purpose, the page information acquisition means circulates the network and acquires information on the web page (this process is referred to as a crawl process). Here, the information about the web page includes URL (Uniform Resource Locator) information of the web page, text data displayed on the web page, image data, and the like. At this time, the page information acquisition unit also acquires the update date and time of the web page. The page analysis means analyzes the sentence of the acquired web page and extracts word candidates.

登録状況判定手段は、抽出した単語候補が、検索用インデックスに登録済みであるか否かを判定する。検索用インデックスは、クロール処理が行われるたびに、それまでに取得した単語候補全体に対して作成されるものである。初出ワード登録手段は、登録状況判定手段により未登録と判定された単語候補を初出ワード記憶手段へ記憶させる。このとき、単語候補には、初出日時として、ページ情報取得手段により取得した該ウェブページの更新日時が関連付けられ、さらに該ウェブページのＵＲＬ情報や該ウェブページに表示された文章データや画像データ等のウェブページに関する情報が関連付けられて記憶される。 The registration status determination means determines whether or not the extracted word candidate has been registered in the search index. Each time the crawl process is performed, the search index is created for the entire word candidates acquired so far. The first appearing word registering means stores the word candidate judged as unregistered by the registration status judging means in the first appearing word storage means. At this time, the word candidate is associated with the update date and time of the web page acquired by the page information acquisition unit as the first appearance date and time, and the URL information of the web page, text data and image data displayed on the web page, and the like Information relating to the web page is stored in association with each other.

本発明では、ネットワークを通して収集した単語に対して検索用インデックスを作成するという通常の処理を行いながら、一方で、抽出した単語候補に関する初出情報を収集する。初出情報とは、任意の単語が最初にウェブページに登場したときの日時や該ウェブページに関する情報である。すなわち、単語情報収集装置が通常実施するクロール処理やインデックス作成処理を利用して、登録状況判定手段および初出ワード登録手段が同時に初出情報を収集する。 In the present invention, while performing a normal process of creating a search index for the words collected through the network, on the other hand, first appearance information about the extracted word candidates is collected. The first appearance information is the date and time when an arbitrary word first appears on the web page and information related to the web page. That is, using the crawl processing and index creation processing normally performed by the word information collection device, the registration status determination means and the first appearance word registration means simultaneously collect the first appearance information.

このように、クロール処理によりインデックスを作成しながら初出情報を収集することができるので、初出情報を得るためだけの処理を実施する必要がなく、簡単かつ効率よく単語の初出情報を収集することができる。
また、このようにして収集された初出情報は、ウェブページに表示してユーザに提供することができる。例えば、ユーザが指定した検索語に応じた検索結果の一覧と共に初出情報を表示させることで、検索結果ページのコンテンツの充実化を図ることができる。 In this way, it is possible to collect first appearance information while creating an index by crawl processing, so it is not necessary to perform processing only for obtaining first appearance information, and it is possible to collect first appearance information of words easily and efficiently. it can.
The first appearance information collected in this way can be displayed on a web page and provided to the user. For example, the content of the search result page can be enhanced by displaying the first appearance information together with the search result list corresponding to the search term designated by the user.

本発明の単語情報収集装置において、前記初出ワード登録手段は、前記登録状況判定手段により前記単語候補が前記検索用インデックスに記憶されていると判定した場合、前記単語候補に関連付けられて記憶された初出日時と前記取得した更新日時とを比較し、前記更新日時が前記初出日時より古いと判定されると、該単語候補の初出日時を前記更新日時で更新することが好ましい。 In the word information collection device of the present invention, when the registration status determination unit determines that the word candidate is stored in the search index, the first appearance word registration unit stores the word candidate in association with the word candidate. It is preferable that the first appearance date and time and the acquired update date and time are compared, and if it is determined that the update date and time is older than the first appearance date and time, the initial appearance date and time of the word candidate is updated with the update date and time.

この発明では、初出ワード登録手段は、初出ワード記憶手段に記憶された単語データの更新処理を行う。更新処理を行うのは、登録状況判定手段により抽出した単語候補が検索用インデックスに登録済みであると判定された場合である。すなわち、検索用インデックスは、前回のクロール処理によって前回までに取得した単語候補全てに対して作成されているため、初出ワード登録手段に記憶されている単語は全て検索用インデックスに含まれている。したがって、単語候補の検索用インデックスへの登録の有無を判定することで、初出ワード記憶手段への登録の有無を判定できる。 In the present invention, the first appearance word registration means performs an update process of the word data stored in the first appearance word storage means. The update process is performed when it is determined that the word candidate extracted by the registration status determination unit has been registered in the search index. That is, since the search index is created for all the word candidates acquired up to the previous time by the previous crawl process, all the words stored in the first appearing word registration means are included in the search index. Therefore, by determining whether or not the word candidate is registered in the search index, it is possible to determine whether or not the word candidate is registered in the first appearance word storage unit.

更新処理は、クロール処理によって任意の単語が含まれるウェブページの情報を取得するたびに、初出ワード記憶手段に記憶された該単語の初出日時と、ウェブページの情報の取得と同時に取得したウェブページの更新日時と、を比較し、更新日時が初出日時よりも古い場合は、初出ワード記憶手段に記憶されている該単語に関連付けられている初出日時を更新日時で更新し、該単語に関連付けられているウェブページの情報を、新しく取得したウェブページの情報で更新する。すなわち、新しく取得したウェブページの更新日時が古いほど初出ワード記憶手段に記憶されることになる。このような処理が繰り返されることで、結果として該単語がウェブ上に登場した古いウェブページに関する情報を収集することができる。
本発明によれば、クロール処理が行われるたびに、初出ワード記憶手段に記憶された単語データが、より更新日時の古いウェブページの情報に更新されていくので、自動的に最も古い日時のウェブページに関する情報を簡単に収集することができる。したがって、通常のクロール処理を利用して効率よく単語の初出情報を収集することができる。 Each time the update process acquires information on a web page containing an arbitrary word by crawling, the first appearance date and time of the word stored in the first appearance word storage means and the web page acquired simultaneously with the acquisition of the web page information If the update date is older than the first appearance date, the first appearance date associated with the word stored in the first word storage means is updated with the update date, and is associated with the word. Update the information of the current web page with the newly acquired information of the web page. That is, the older the update date and time of a newly acquired web page, the more it is stored in the first word storage means. By repeating such processing, it is possible to collect information about an old web page in which the word appears on the web as a result.
According to the present invention, each time the crawl process is performed, the word data stored in the first appearance word storage means is updated to the information of the web page with the oldest update date and time. Information about the page can be easily collected. Therefore, word first appearance information can be efficiently collected using a normal crawl process.

本発明の単語情報収集装置において、前記ネットワークを介して接続された端末装置に対して検索語の入力を要求し、入力された検索語を取得する検索語取得手段と、前記取得した検索語と一致するキーワードを、前記検索用インデックスから検索し、該当するキーワードに関連付けられたウェブページに関する情報を取得するデータ検索手段と、前記取得した検索語と一致する単語を、前記初出ワード記憶手段から検索し、該当する単語に関連付けられたウェブページに関する情報と初出日時とを取得する初出ワード検索手段と、前記データ検索手段により取得したウェブページに関する情報と前記初出ワード検索手段により取得したウェブページに関する情報および初出日時とを表示させたウェブページを作成して配信する検索結果ページ提供手段と、をさらに備えたことが好ましい。 In the word information collection device of the present invention, a search word acquisition unit that requests input of a search word to a terminal device connected via the network and acquires the input search word, and the acquired search word The search keyword is searched from the search index, the data search means for acquiring information about the web page associated with the corresponding keyword, and the word that matches the acquired search word is searched from the initial word storage means. And information on the web page associated with the corresponding word and the first appearance date search means, information on the web page acquired by the data search means, and information on the web page acquired by the first word search means And a search result page that creates and distributes a web page that displays the date and time of first appearance. It is preferable to further comprise a providing means.

この発明では、初出ワード記憶手段に収集した初出情報を、ユーザが指定した検索語に対する検索結果の一覧とともに表示させる。すなわち、通常利用されている検索エンジンと同様に、指定された検索語を取得し、検索用インデックスから該検索語のデータを取得し、検索結果ページに一覧表示する一方で、さらに初出ワード検索手段が、初出ワード記憶手段から該検索語のデータ（初出日時、ウェブページに関する情報）を取得し、検索結果ページ提供手段によりそのデータを検索結果ページに表示させて端末装置に送信する。 In the present invention, the first appearance information collected in the first appearance word storage means is displayed together with a list of search results for the search term designated by the user. That is, in the same way as a search engine that is normally used, the designated search term is acquired, the data of the search term is acquired from the search index, and displayed on the search result page as a list. However, the data of the search term (first appearance date and time, information on the web page) is acquired from the first appearance word storage means, and the data is displayed on the search result page by the search result page providing means and transmitted to the terminal device.

この発明によれば、ユーザは、指定した検索語に対する検索結果とは別の情報、すなわち検索語の初出情報を得ることができる。特に、検索語として流行後を指定した場合、この流行語に対する初出情報は流行の発端に関わる情報を得ることができ、ユーザにとって有益なものである。このようにして、検索結果ページのコンテンツの充実化を図ることができる。 According to this invention, the user can obtain information different from the search result for the designated search word, that is, first appearance information of the search word. In particular, when post-trend is designated as a search word, the first appearance information for this buzzword can obtain information related to the beginning of the trend, which is useful for the user. In this way, the content of the search result page can be enhanced.

本発明の単語情報収集装置において、前記抽出された単語候補と一致する単語が前記初出ワード情報記憶手段に記憶されているか否かを判定する初出ワード登録判定手段をさらに備え、前記初出ワード登録手段は、前記登録状況判定手段により前記単語候補が前記検索用インデックスに記憶されていないと判定され、かつ、前記初出ワード登録判定手段により前記単語候補が前記初出ワード情報記憶手段に記憶されていないと判定された場合は、前記単語候補と該ウェブページに関する情報とに前記更新日時を初出日時として関連付けて初出ワード記憶手段に記憶させることが好ましい。 In the word information collection device of the present invention, the word information collection device further comprises first word registration determining means for determining whether or not a word that matches the extracted word candidate is stored in the first word information storage means. Is determined that the word candidate is not stored in the search index by the registration status determination means, and the word candidate is not stored in the first appearance word information storage means by the first appearance word registration determination means. When it is determined, it is preferable to associate the update date and time with the word candidate and the information about the web page as the first appearance date and store them in the first appearance word storage means.

また、本発明の単語情報収集装置において、前記抽出された単語候補と一致する単語が前記初出ワード情報記憶手段に記憶されているか否かを判定する初出ワード登録判定手段をさらに備え、前記初出ワード登録手段は、前記登録状況判定手段により前記単語候補が前記検索用インデックスに記憶されていないと判定され、かつ、前記初出ワード登録判定手段により前記単語候補が前記初出ワード情報記憶手段に記憶されていないと判定された場合は、前記単語候補と該ウェブページに関する情報とに前記更新日時を初出日時として関連付けて初出ワード記憶手段に記憶させ、前記初出ワード登録判定手段により前記単語候補が前記初出ワード情報記憶手段に記憶されていると判定された場合は、前記単語候補に関連付けられて記憶された初出日時と前記取得した更新日時とを比較し、前記更新日時が前記初出日時より古いと判定されると、該単語候補の初出日時を前記更新日時で更新することが好ましい。 The word information collection device of the present invention further comprises first word registration determining means for determining whether or not a word that matches the extracted word candidate is stored in the first word information storage means. The registration means determines that the word candidate is not stored in the search index by the registration status determination means, and the word candidate is stored in the first word information storage means by the first word registration determination means. If it is determined that the word candidate and the information related to the web page are associated with the update date and time as the first appearance date and stored in the first appearance word storage means, the first word registration judgment means determines that the word candidate is the first appearance word. If it is determined that the information is stored in the information storage means, the initial value stored in association with the word candidate is stored. Comparing the update date date and said acquired, if the update time is determined to older than the first appearance time, it is preferable to update the first-appearing time of said word candidates in said update time.

この発明では、初出ワード登録手段による登録処理または更新処理を行う前に、初出ワード登録判定手段により、初出ワード記憶手段への該当単語候補の登録の有無を判定する。該当単語が初出ワード記憶手段へ登録済みの場合は更新処理を行い、未登録の場合は登録処理を行う。
これによれば、仮に初出ワード記憶手段に記憶された単語と検索用インデックスに記憶された単語が一致しない場合であっても、確実に登録処理または更新処理を行うことができる。 In this invention, before performing the registration process or the update process by the first appearance word registration means, the first appearance word registration determination means determines whether or not the corresponding word candidate is registered in the first appearance word storage means. If the corresponding word has already been registered in the first appearance word storage means, update processing is performed, and if not registered, registration processing is performed.
According to this, even if the word stored in the first word storage means and the word stored in the search index do not match, the registration process or the update process can be surely performed.

本発明の単語情報収集方法は、ネットワーク上のウェブページに含まれる単語に関する情報を収集し、収集した単語を用いて、検索キーに対してインデックス検索を実行するための検索用インデックスを生成する単語情報収集方法であって、前記ネットワークを巡回してウェブページに関する情報とともに該ウェブページの更新日時を取得するページ情報取得ステップと、前記取得したウェブページを解析して単語候補を抽出するページ解析ステップと、前記抽出された単語候補と、取得済みの単語候補から予め生成された検索用インデックスとを比較し、前記単語候補が前記検索用インデックスに記憶されているか否かを判定する登録状況判定ステップと、前記判定の結果、前記検索用インデックスに記憶されていないと判定した場合に、前記単語候補と該ウェブページに関する情報とに前記更新日時を初出日時として関連付けて初出ワード記憶手段に記憶させる初出ワード登録ステップと、を備えることを特徴とする。 The word information collection method of the present invention collects information about words included in a web page on a network, and uses the collected words to generate a search index for performing an index search on a search key. A method for collecting information, which includes a page information acquisition step for acquiring the update date and time of the web page together with information related to the web page by visiting the network, and a page analysis step for analyzing the acquired web page and extracting word candidates And a registration status determination step of determining whether or not the word candidate is stored in the search index by comparing the extracted word candidate with a search index generated in advance from the acquired word candidate And, as a result of the determination, when it is determined that it is not stored in the search index, A first appearance word registration step of storing the update time to the information about the serial word candidate and the web page first appears word storage means in association with the first appearance time, characterized in that it comprises a.

この発明では、ネットワークを巡回してウェブページに関する情報を取得し、このウェブページを解析して単語候補を抽出し、これまでに抽出した単語に対して検索用インデックスを作成するという処理を行いながら、一方で、抽出した単語候補に関する初出情報を収集する。本発明では、クロール処理により検索用インデックスを生成するという通常の処理を利用して、登録状況判定ステップおよび初出ワード登録ステップにより初出情報を収集する。 According to the present invention, information relating to a web page is obtained by visiting a network, word candidates are extracted by analyzing the web page, and a search index is created for the words extracted so far. On the other hand, the first appearance information about the extracted word candidate is collected. In the present invention, the first appearance information is collected by the registration status determination step and the first appearance word registration step using a normal process of generating a search index by a crawl process.

具体的には、抽出した単語候補が検索用インデックスに登録済みであるか否かを判定し、単語候補が検索用インデックスに未登録と判定されると、単語候補を初出ワード記憶手段へ登録する。登録の際、その単語候補には、初出日時としてページ情報取得ステップで取得された更新日時が関連付けられ、さらに該単語候補が含まれるウェブページに関する情報が関連付けられて記憶される。なお、検索用インデックスは、クロール処理が行われるたびに、それまでに取得した単語候補全体に対して作成されるものである。 Specifically, it is determined whether or not the extracted word candidate is already registered in the search index, and when it is determined that the word candidate is not registered in the search index, the word candidate is registered in the first word storage means. . At the time of registration, the word candidate is associated with the update date and time acquired in the page information acquisition step as the first appearance date and information related to the web page including the word candidate and stored. The search index is created for the entire word candidates acquired so far each time the crawl process is performed.

このように、通常のクロール処理によりインデックスを作成しながら初出情報を収集することができるので、初出情報を得るためだけの処理を実施する必要がなく、簡単かつ効率よく単語の初出情報を収集することができる。
また、このようにして収集された初出情報は、ウェブページに表示することで該ウェブページのコンテンツの充実化を図ることができる。 In this way, it is possible to collect the first appearance information while creating an index by a normal crawl process, so it is not necessary to perform a process only for obtaining the first appearance information, and the first appearance information of the word is collected easily and efficiently. be able to.
Moreover, the first appearance information collected in this way can be displayed on the web page, thereby enhancing the content of the web page.

本発明の単語情報収集方法において、前記初出ワード登録ステップは、前記登録状況判定ステップにより前記単語候補が前記検索用インデックスに記憶されていると判定した場合、前記単語候補に関連付けられて記憶された初出日時と前記取得した更新日時とを比較し、前記更新日時が前記初出日時より古いと判定されると、該単語候補の初出日時を前記更新日時で更新することが好ましい。 In the word information collection method of the present invention, the first appearance word registration step is stored in association with the word candidate when the registration status determination step determines that the word candidate is stored in the search index. It is preferable that the first appearance date and time and the acquired update date and time are compared, and if it is determined that the update date and time is older than the first appearance date and time, the initial appearance date and time of the word candidate is updated with the update date and time.

この発明では、登録状況判定ステップで抽出した単語候補が検索用インデックスに登録済みであると判定された場合に、初出ワード記憶手段に記憶されている該当単語データの更新を行う。更新処理は、クロール処理によって任意の単語が含まれるウェブページの情報を取得するたびに、初出ワード記憶手段に記憶された該当単語の初出日時と、ウェブページの情報の取得と同時に取得したウェブページの更新日時と、を比較し、更新日時が初出日時よりも古い場合は、初出ワード記憶手段に記憶されている該単語に関連付けられている初出日時を更新日時で更新し、該単語に関連付けられているウェブページの情報を、新しく取得したウェブページの情報で更新する。すなわち、新しく取得したウェブページの更新日時が古いほど初出ワード記憶手段に記憶されることになる。このような処理が繰り返されることで、結果として該単語が初出したと思われるウェブページに関する情報を収集することができる。 In the present invention, when it is determined that the word candidate extracted in the registration status determination step is already registered in the search index, the corresponding word data stored in the first appearance word storage unit is updated. The update process is performed every time the web page information including an arbitrary word is acquired by the crawl process, and the first appearance date and time of the corresponding word stored in the first appearance word storage means and the web page acquired at the same time as the acquisition of the web page information. If the update date is older than the first appearance date, the first appearance date associated with the word stored in the first word storage means is updated with the update date, and is associated with the word. Update the information of the current web page with the newly acquired information of the web page. That is, the older the update date and time of a newly acquired web page, the more it is stored in the first word storage means. By repeating such a process, it is possible to collect information on the web page where the word appears to appear for the first time.

本発明によれば、クロール処理が行われるたびに、初出ワード記憶手段に記憶された単語データが、より更新日時の古いウェブページの情報に更新されていくので、自動的に最も古い日時のウェブページに関する情報を簡単に収集することができる。したがって、通常のクロール処理を利用して効率よく単語の初出情報を収集することができる。 According to the present invention, each time the crawl process is performed, the word data stored in the first appearance word storage means is updated to the information of the web page with the oldest update date and time. Information about the page can be easily collected. Therefore, word first appearance information can be efficiently collected using a normal crawl process.

本発明の単語情報収集プログラムは、前述の単語情報収集方法をコンピュータに実行させることを特徴とする。
この発明によれば、コンピュータに前述の単語情報収集方法を実行させるため、この単語情報収集プログラムをインストールするだけの簡単な構成で、前述と同様の作用効果を得ることができ、有用性が高い。 The word information collection program of the present invention causes a computer to execute the above-described word information collection method.
According to this invention, since the above-described word information collecting method is executed by the computer, the same operation and effect as described above can be obtained with a simple configuration simply by installing this word information collecting program, and the utility is high. .

本発明の実施形態にかかる単語情報収集システムの概略構成を示すブロック図。1 is a block diagram showing a schematic configuration of a word information collection system according to an embodiment of the present invention. 前記実施形態における単語情報収集装置の動作を示すフローチャート。The flowchart which shows operation | movement of the word information collection apparatus in the said embodiment. 前記実施形態における単語情報収集装置が提供する検索結果ページを端末装置で表示させた画面の概略図。The schematic diagram of the screen which displayed the search result page which the word information collection apparatus in the said embodiment provides on the terminal device.

以下、本発明の実施形態を図面に基づいて説明する。本実施形態では、検索エンジンの機能を有する単語情報収集システムを例示して説明する。
［１．単語情報収集システムの構成］
図１に示すように、単語情報収集システム１は、単語情報収集装置１００と、インターネット２０を介して単語情報収集装置１００に接続された端末装置２００と、を備えている。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. In the present embodiment, a word information collection system having a search engine function will be described as an example.
[1. Configuration of word information collection system]
As shown in FIG. 1, the word information collection system 1 includes a word information collection device 100 and a terminal device 200 connected to the word information collection device 100 via the Internet 20.

インターネット２０はＴＣＰ／ＩＰなどの汎用のプロトコルに基づくインターネットであるが、これに限られない。例えば、ＬＡＮ（Local Area Network）などのイントラネット、無線媒体により情報が送受信可能な複数の基地局がネットワークを構成する通信回線網や放送網などのネットワーク、さらには、データを直接受信するための媒体となる無線媒体自体など、データを送受信させるいずれの構成も利用できる。 The Internet 20 is the Internet based on a general-purpose protocol such as TCP / IP, but is not limited to this. For example, an intranet such as a LAN (Local Area Network), a network such as a communication line network or a broadcasting network in which a plurality of base stations capable of transmitting and receiving information via a wireless medium constitute a network, and a medium for directly receiving data Any configuration that transmits and receives data, such as the wireless medium itself, can be used.

単語情報収集装置１００は、検索エンジンの機能を有するとともに、単語の初出情報を収集するものである。ここで、初出情報とは、任意の単語がウェブページ上に最初に登場したときの情報であり、そのときの日時（初出日時）や該ウェブページのＵＲＬ情報、該ウェブページに表示される文章データおよび画像データ等の情報を含む。
単語情報収集装置１００としては、一般的に用いられているパーソナルコンピュータ（ＰＣ）が用いられ、各種情報を記憶する記憶手段と、各種演算を実施するＣＰＵ等の制御手段と、キーボードやマウス等の入力手段、ウェブページを画面表示として出力させる表示手段などを備えている。 The word information collection device 100 has a search engine function and collects initial word information. Here, the first appearance information is information when an arbitrary word first appears on the web page, the date and time (first appearance date) at that time, the URL information of the web page, and the text displayed on the web page Includes information such as data and image data.
As the word information collecting apparatus 100, a commonly used personal computer (PC) is used, and a storage means for storing various information, a control means such as a CPU for performing various calculations, a keyboard, a mouse, and the like. Input means, display means for outputting a web page as a screen display, and the like are provided.

単語情報収集装置１００は、図１に示すように、記憶手段として、検索用インデックスとしてのインデックスデータベース１０１と、初出ワード情報記憶手段としての初出ワードデータベース１０２と、を備えている。また、図示しないが、単語情報収集装置１００は、検索結果ページを作成するための各種フォームを記憶させたデータベースを備えている。 As shown in FIG. 1, the word information collection device 100 includes an index database 101 as a search index and a first word database 102 as first word information storage as storage means. Although not shown, the word information collection device 100 includes a database that stores various forms for creating a search result page.

インデックスデータベース１０１は、例えば、以下の表１に示すように、単語ごとに該単語が含まれるウェブページのＵＲＬ（Uniform Resource Locator）情報およびランクが関連付けられて１つのレコードとして記憶されたテーブル構造となっている。なお、項目はここに列挙したものに限られず、検索結果として表示可能な情報、例えば単語に関連するイメージデータ等を適宜追加してもよい。 For example, as shown in Table 1 below, the index database 101 has a table structure in which URL (Uniform Resource Locator) information and rank of a web page including the word are associated with each word and stored as one record. It has become. Note that the items are not limited to those listed here, and information that can be displayed as search results, for example, image data related to words, may be added as appropriate.

ランク付けとは、任意の単語を含む複数のウェブページに対して、単語とウェブページとの関連度を各種アルゴリズムにより算出し、該ウェブページに付与することである。ランク付けの方法として、例えば、該単語を含むウェブページ中で、該単語が該ウェブページの内容に占める頻度が多いほど重要度が高くランク付けされたり、ウェブページのタイトル中に該キーワードが含まれている場合は重要度が高くランク付けされたりする。また、キーワードを含むウェブページにどれだけ多くのリンクが張られているかに応じてランク付けする方法もある。
なお、このランク付けは、クロール処理を行われるたびに新しく収集した単語も含めた検索用インデックスが作成され、インデックスデータベース１０１に再登録される。 Ranking is to calculate the degree of association between a word and a web page with respect to a plurality of web pages including an arbitrary word by using various algorithms, and assign it to the web page. As a ranking method, for example, in a web page including the word, the higher the frequency that the word occupies the content of the web page, the higher the degree of importance, or the web page title includes the keyword. If so, it will be ranked with high importance. There is also a method of ranking according to how many links are made to the web page containing the keyword.
In this ranking, each time a crawl process is performed, a search index including newly collected words is created and re-registered in the index database 101.

初出ワードデータベース１０２は、例えば、以下の表２に示すように、単語ごとに該単語がウェブページ上に最初に登場した日時である初出日時、該単語が含まれるウェブページのＵＲＬ情報およびキャッシュが関連付けられて１つのレコードとして記憶されたテーブル構造となっている。キャッシュとは、ウェブページの内容を保存したものであり、該ウェブページが更新されてしまった場合でも、キャッシュを表示することによって更新前のウェブページを閲覧することができる。なお、項目はここに列挙したものに限られず、検索結果として表示可能な情報、例えば単語に関連するイメージデータ等を適宜追加してもよい。 For example, as shown in Table 2 below, the first appearance word database 102 stores the first appearance date and time, which is the date and time when the word first appeared on the web page for each word, URL information and cache of the web page including the word. The table structure is associated and stored as one record. The cache stores the contents of the web page. Even when the web page has been updated, the web page before the update can be viewed by displaying the cache. Note that the items are not limited to those listed here, and information that can be displayed as search results, for example, image data related to words, may be added as appropriate.

単語情報収集装置１００は、演算処理手段として、ネットワーク上のウェブページから単語情報を収集する単語情報収集手段１１０と、指定された検索語に応じた検索結果を提供するウェブ検索手段１２０と、図示しないが、ネットワークを介して端末装置２００とデータの送受信を行う送受信手段と、を備えている。 The word information collection device 100 includes, as calculation processing means, a word information collection means 110 that collects word information from a web page on a network, a web search means 120 that provides a search result corresponding to a designated search word, However, a transmission / reception means for transmitting / receiving data to / from the terminal device 200 via the network is provided.

単語情報収集手段１１０は、ネットワークから単語情報を収集するものであり、ページ情報取得手段１１１と、ページ解析手段１１２と、登録状況判定手段１１３と、初出ワード登録手段１１４と、検索用インデックス生成手段１１５と、初出ワード登録判定手段１１６と、を備えている。
ページ情報取得手段１１１は、ネットワーク内を巡回し、ネットワーク内に公開されているウェブページのＵＲＬ情報、文章データおよび画像データなどの情報（ウェブページに関する情報）を取得する。この処理は一般的にクロール処理と呼ばれ、前回作成された検索用インデックス、すなわちインデックスデータベース１０１に記憶されたウェブページのＵＲＬ情報に基づいて各ウェブページを巡回する。また、クロール処理の頻度は必要に応じて適宜調整することができる。 The word information collection unit 110 collects word information from the network, and includes page information acquisition unit 111, page analysis unit 112, registration status determination unit 113, first appearance word registration unit 114, and search index generation unit. 115 and first appearance word registration determination means 116.
The page information acquisition unit 111 circulates in the network and acquires information (information on the web page) such as URL information, text data, and image data of the web page published in the network. This process is generally called a crawl process, and each web page is circulated based on the previously created search index, that is, the URL information of the web page stored in the index database 101. In addition, the frequency of the crawl process can be adjusted as necessary.

ページ解析手段１１２は、ページ情報取得手段１１１により取得したウェブページに含まれる文章（テキスト）を抽出し、該文章に対して形態素解析を実施する。形態素解析とは、文章を意味のある単語に区切り、各単語の品詞等を判別する処理である。ページ解析手段１１２は、形態素解析により得られる複数の単語のうち、名詞となり得るものを単語候補として取得する。 The page analysis unit 112 extracts a sentence (text) included in the web page acquired by the page information acquisition unit 111 and performs morphological analysis on the sentence. Morphological analysis is a process of dividing a sentence into meaningful words and determining the part of speech of each word. The page analysis unit 112 acquires words that can be nouns from among a plurality of words obtained by morphological analysis as word candidates.

登録状況判定手段１１３は、ページ解析手段１１２により得られた単語候補が、インデックスデータベース１０１に登録済みであるか否かを判定する。インデックスデータベース１０１には、前回のクロール処理までに取得した単語候補に対して作成したインデックスが記憶されている。インデックスデータベース１０１に登録済みである単語候補は、初出ワードデータベース１０２への更新対象となり、未登録である単語候補は登録対象となる。 The registration status determination unit 113 determines whether the word candidate obtained by the page analysis unit 112 has been registered in the index database 101. The index database 101 stores indexes created for word candidates acquired up to the previous crawl process. Word candidates already registered in the index database 101 are to be updated to the first appearing word database 102, and unregistered word candidates are to be registered.

初出ワード登録手段１１４は、取得した単語候補を初出ワードデータベース１０２に登録または更新の処理を行う。登録処理としては、登録対象となった単語候補に該単語候補が含まれるウェブページのＵＲＬ情報と該ウェブページの更新日時とを関連づけて初出ワードデータベース１０２に記憶させる。また、更新処理としては、更新対象となった初出ワードデータベース１０２内の単語データに対して、記憶されている初出日時と取得した更新日時とを比較し、更新日時が初出日時よりも古い場合は、初出日時を更新日時で更新する。 The first appearance word registration unit 114 registers or updates the acquired word candidate in the first appearance word database 102. As the registration process, the URL information of the web page in which the word candidate is included in the registered word candidate and the update date and time of the web page are associated with each other and stored in the first word database 102. In addition, as the update process, the word data in the first word database 102 to be updated is compared with the stored first date and time and the obtained update date and time. When the update date and time is older than the first date and time, The first appearance date is updated with the update date.

検索用インデックス生成手段１１５は、新しく収集した単語候補と、インデックスデータベース１０１に記憶されている単語情報と、に対して検索用インデックスを作成し、作成した検索用インデックスでインデックスデータベース１０１を更新する。
初出ワード登録判定手段１１６は、初出ワード登録手段１１４の登録または更新処理の前に、対象となる単語候補が初出ワードデータベース１０２に登録済みであるか否かを判定する。単語候補が初出ワードデータベース１０２に登録済みであると判定された場合は、該単語候補は更新処理の対象となる。一方、未登録であると判定された場合は、該単語候補は登録処理の対象となる。 The search index generation means 115 creates a search index for newly collected word candidates and word information stored in the index database 101, and updates the index database 101 with the created search index.
The first appearance word registration determination means 116 determines whether or not the target word candidate has been registered in the first appearance word database 102 before the registration or update processing of the first appearance word registration means 114. When it is determined that the word candidate has already been registered in the first appearance word database 102, the word candidate is subjected to update processing. On the other hand, if it is determined that the word is not registered, the word candidate is subjected to registration processing.

ウェブ検索手段１２０は、端末装置２００で指定された検索語に応じた検索結果ページを提供するものであり、検索語取得手段１２１と、インデックス検索手段１２２と、初出ワード検索手段１２３と、検索結果ページ提供手段１２４と、を備えている。
検索語取得手段１２１は、端末装置２００からの要求に応じて、検索ページを端末装置２００に送信する。検索語を入力させるための欄などが表示された検索ページを端末装置２００に表示させることで、ユーザに検索語を入力させる。入力された検索語は、ユーザの要求により端末装置２００から単語情報収集装置１００に送信され、検索語取得手段１２１は、受信した検索語を取得する。 The web search means 120 provides a search result page corresponding to the search term specified by the terminal device 200, and includes a search word acquisition means 121, an index search means 122, a first word search means 123, and a search result. Page providing means 124.
The search word acquisition unit 121 transmits a search page to the terminal device 200 in response to a request from the terminal device 200. By causing the terminal device 200 to display a search page on which a column for inputting a search term is displayed, the user is allowed to input the search term. The input search term is transmitted from the terminal device 200 to the word information collection device 100 in response to a user request, and the search term acquisition unit 121 acquires the received search term.

インデックス検索手段１２２は、取得した検索語をインデックスデータベース１０１から検索し、検索語に相当するキーワードと対応付けられたＵＲＬ情報を取得する。
初出ワード検索手段１２３は、取得した検索語を初出ワードデータベース１０２から検索し、検索語に該当する初出ワードと、該初出ワードに対応付けられた初出日時、ＵＲＬ情報、およびキャッシュ等を取得する。 The index search means 122 searches the acquired search term from the index database 101, and acquires URL information associated with a keyword corresponding to the search term.
The first appearance word search means 123 searches the acquired first search word from the first appearance word database 102, and acquires the first appearance word corresponding to the search word, the first appearance date and time associated with the first appearance word, URL information, cache, and the like.

検索結果ページ提供手段１２４は、検索結果ページを作成し、端末装置２００に送信する。端末装置２００の表示手段で表示される検索結果ページには、検索結果の一覧のほか、該検索語の初出情報が表示される。初出情報としては、検索語が初出したウェブページのタイトルが表示され、このタイトルにはウェブページへのリンクが張られている。タイトルをクリックするだけで該ウェブページを閲覧することができる。また、初出日時やキャッシュも表示される。キャッシュには、該ウェブページに関する情報を取得したときの内容が保存されているため、仮に該ウェブページが存在しない状況であったとしても、初出した当時のウェブページを閲覧することができる。 The search result page providing unit 124 creates a search result page and transmits it to the terminal device 200. In the search result page displayed by the display means of the terminal device 200, the first appearance information of the search word is displayed in addition to the list of search results. As the first appearance information, the title of the web page where the search term appears for the first time is displayed, and a link to the web page is put on this title. The web page can be browsed simply by clicking on the title. In addition, the first appearance date and cache are also displayed. Since the contents when the information about the web page is acquired are stored in the cache, even if the web page does not exist, the web page at the time of the first appearance can be browsed.

端末装置２００は、図示しないが、演算処理手段として、単語情報収集装置１００に対して検索サービスを要求し、要求した検索サービスのウェブページを受信する端末送受信手段と、ウェブページを画面表示として出力させる出力手段と、文字入力可能なマウスやキーボードなどの入力手段とを備えている。一方、記憶手段としては、各種フォームにかかわるフォームデータを記憶するデータベースなどを備えている。端末装置２００としては特に限定されないが、例えば、携帯電話やノートパソコンなどが挙げられる。 Although not shown, the terminal device 200 requests a search service from the word information collection device 100 as an arithmetic processing unit, and receives and outputs a web page of the requested search service, and outputs the web page as a screen display. Output means for input and input means such as a mouse and a keyboard capable of inputting characters. On the other hand, the storage means includes a database for storing form data related to various forms. Although it does not specifically limit as the terminal device 200, For example, a mobile telephone, a notebook personal computer, etc. are mentioned.

［２．単語情報収集装置１００の動作］
次に、単語情報収集装置１００の動作について説明する。単語情報収集装置１００は、単語情報収集手段１１０による処理と、ウェブ検索手段１２０による処理と、が別々に動作する。 [2. Operation of word information collection device 100]
Next, the operation of the word information collection device 100 will be described. In the word information collection device 100, the processing by the word information collection unit 110 and the processing by the web search unit 120 operate separately.

まず、単語情報収集手段１１０の動作について、図２に基づいて説明する。
ステップＳ１において、ページ情報取得手段１１１は、ネットワークに公開されているウェブページを巡回し、該ウェブページに関する情報と、該ウェブページの更新日時と、を取得する。ここで、ウェブページに関する情報とは、ウェブページのＵＲＬ情報、ウェブページに表示される文章データおよび画像データ等であり、更新日時とは、ウェブページが更新されたときに通常付与される日時のことである。
次に、ステップＳ２において、ページ解析手段１１２は、ページ情報取得手段１１１により取得したウェブページの文章データを抽出し、該文章データに対して形態素解析を実施する。形態素解析により得られる複数の単語のうち、名詞となり得るものを単語候補として取得する。 First, the operation of the word information collecting unit 110 will be described with reference to FIG.
In step S <b> 1, the page information acquisition unit 111 circulates a web page that is open to the network, and acquires information about the web page and the update date and time of the web page. Here, the information on the web page is URL information of the web page, text data and image data displayed on the web page, and the update date / time is the date / time normally given when the web page is updated. That is.
Next, in step S2, the page analysis unit 112 extracts the text data of the web page acquired by the page information acquisition unit 111, and performs morphological analysis on the text data. Among a plurality of words obtained by morphological analysis, a word that can be a noun is acquired as a word candidate.

このようにしてウェブページから得られた単語候補のそれぞれに対して、以下の処理を実施する。
ステップＳ３において、登録状況判定手段１１３は、インデックスデータベース１０１を参照し、ページ解析手段１１２により得られた単語候補が記憶されているか否かを判定する。単語候補がインデックスデータベース１０１に記憶されている場合（Ｓ３：Ｙｅｓ）は、ステップＳ６へ進む。一方、単語候補がインデックスデータベース１０１に記憶されていない場合（Ｓ３：Ｎｏ）は、ステップＳ４へ進む。 In this way, the following processing is performed for each word candidate obtained from the web page.
In step S3, the registration status determination unit 113 refers to the index database 101 and determines whether the word candidate obtained by the page analysis unit 112 is stored. When the word candidate is stored in the index database 101 (S3: Yes), the process proceeds to step S6. On the other hand, when the word candidate is not stored in the index database 101 (S3: No), the process proceeds to step S4.

ステップＳ４では、初出ワード登録判定手段１１６は、初出ワードデータベース１０２を参照し、ページ解析手段１１２により得られた単語候補が記憶されているか否かを判定する。単語候補が初出ワードデータベース１０２に記憶されている場合（Ｓ４：Ｙｅｓ）は、ステップＳ６へ進む。一方、単語候補が初出ワードデータベース１０２に記憶されていない場合（Ｓ４：Ｎｏ）は、ステップＳ５へ進む。 In step S4, the first appearance word registration determination unit 116 refers to the first appearance word database 102 and determines whether the word candidate obtained by the page analysis unit 112 is stored. If the word candidate is stored in the first appearance word database 102 (S4: Yes), the process proceeds to step S6. On the other hand, when the word candidate is not stored in the first appearance word database 102 (S4: No), the process proceeds to step S5.

ステップＳ５では、初出ワード登録手段１１４は、ページ解析手段１１２により得られた単語候補に、該単語候補が含まれるウェブページの更新日時とＵＲＬ情報とを関連付けて、初出ワードデータベース１０２に記憶させてステップＳ８へ進む。
また、ステップＳ６では、初出ワード登録手段１１４は、ページ解析手段１１２により得られた単語候補と一致する単語を初出ワードデータベース１０２から検索し、該当単語に関連付けられた初出日時と、該単語候補が含まれるウェブページの更新日時と、を比較し、更新日時が初出日時よりも古いか否かを判定する。更新日時が初出日時よりも古い場合（Ｓ６：Ｙｅｓ）は、ステップＳ７へ進む。一方、更新日時が初出日時と同じか初出日時より新しい場合（Ｓ６：Ｎｏ）は、ステップＳ８へ進む。 In step S5, the first appearance word registering means 114 associates the word candidate obtained by the page analysis means 112 with the update date and time of the web page containing the word candidate and the URL information, and stores them in the first appearance word database 102. Proceed to step S8.
In step S6, the first appearance word registering unit 114 searches the first appearance word database 102 for a word that matches the word candidate obtained by the page analysis unit 112, the first appearance date and time associated with the word, and the word candidate. The update date / time of the included web page is compared, and it is determined whether the update date / time is older than the first appearance date / time. If the update date is older than the first appearance date (S6: Yes), the process proceeds to step S7. On the other hand, if the update date is the same as or newer than the first appearance date (S6: No), the process proceeds to step S8.

ステップＳ７では、初出ワード登録手段１１４は、ページ解析手段１１２により得られた単語候補と一致する単語を初出ワードデータベース１０２から検索し、該当単語に関連付けられた初出日時を、該ウェブページの更新日時で更新し、さらの該当単語に関連付けられたウェブページのＵＲＬ情報およびキャッシュを該ウェブページのＵＲＬ情報およびキャッシュで更新して、ステップＳ８へ進む。
なお、ステップＳ３〜Ｓ７までの処理は、単語候補の数に応じて複数回実施される。 In step S7, the first appearance word registration unit 114 searches the first appearance word database 102 for a word that matches the word candidate obtained by the page analysis unit 112, and the first appearance date and time associated with the word is determined as the update date and time of the web page. The URL information and cache of the web page associated with the relevant word are updated with the URL information and cache of the web page, and the process proceeds to step S8.
In addition, the process from step S3 to S7 is implemented several times according to the number of word candidates.

ステップＳ８では、検索用インデックス生成手段１１５は、新しく収集した単語候補と、インデックスデータベース１０１に記憶されている単語情報と、に対して検索用インデックスを生成し、新しく生成した検索用インデックスでインデックスデータベース１０１を更新した後、処理を終了する。 In step S8, the search index generation means 115 generates a search index for the newly collected word candidates and the word information stored in the index database 101, and the index database is generated using the newly generated search index. After updating 101, the process ends.

次に、ウェブ検索手段１２０の動作について説明する。
まず、ユーザは、端末装置２００の入力手段を入力操作し、単語情報収集装置１００が提供する検索ページにアクセスするために、例えば、ウェブブラウザを起動させてアドレスを入力し、検索ページを要求する。
単語情報収集装置１００は、図示しない送受信手段により端末装置２００からの検索ページの要求を受信すると、検索語取得手段１２１は、図示しない記憶手段から検索ページ用のフォームを読み出し、これらの情報に基づいて検索ページを作成し、端末装置に送信する。 Next, the operation of the web search means 120 will be described.
First, in order to access the search page provided by the word information collection device 100, for example, the user activates a web browser, inputs an address, and requests the search page. .
When the word information collection device 100 receives a search page request from the terminal device 200 by a transmission / reception means (not shown), the search word acquisition means 121 reads a search page form from a storage means (not shown), and based on the information. To create a search page and send it to the terminal device.

端末装置２００では、端末送受信手段により検索ページの情報を受信して、図示しない表示手段（ディスプレイ等）に画面表示させる。
ユーザは、画面表示にしたがって、入力手段を用いて検索したい単語（検索語）を入力し、単語情報収集装置１００へ送信する。
単語情報収集装置１００は、送受信手段で検索語を受信し、検索語取得手段１２１は検索語を取得する。 In the terminal device 200, the information on the search page is received by the terminal transmission / reception means and is displayed on the screen by a display means (display or the like) (not shown).
In accordance with the screen display, the user inputs a word (search word) to be searched using the input means and transmits it to the word information collecting apparatus 100.
The word information collection device 100 receives the search word by the transmission / reception means, and the search word acquisition means 121 acquires the search word.

次に、インデックス検索手段１２２は、取得した検索語に相当する単語をインデックスデータベース１０１から検索し、該当する単語データを抽出する。
また、初出ワード検索手段１２３は、取得した検索語と一致する単語を初出ワードデータベース１０２から検索し、該当する単語データを抽出する。
次に、検索結果ページ提供手段１２４は、図３に示すような検索結果ページを作成し、端末装置２００に送信する。 Next, the index search means 122 searches the index database 101 for a word corresponding to the acquired search word, and extracts the corresponding word data.
The first word search means 123 searches the first word database 102 for a word that matches the acquired search word, and extracts the corresponding word data.
Next, the search result page providing unit 124 creates a search result page as shown in FIG. 3 and transmits it to the terminal device 200.

図３において、検索結果ページ５は、検索語入力領域５１と、初出情報表示領域５２と、検索結果一覧表示領域５３を有している。
検索語入力領域５１は、ユーザが入力可能な検索語入力欄５１１と検索ボタン５１２を有する。検索語入力欄５１１にはユーザが入力した検索語が表示され、検索ボタン５１２は再検索の要求を単語情報収集装置１００へ送信するためのボタンである。 In FIG. 3, the search result page 5 has a search word input area 51, a first appearance information display area 52, and a search result list display area 53.
The search word input area 51 has a search word input field 511 and a search button 512 that can be input by the user. A search term input by the user is displayed in the search term input field 511, and a search button 512 is a button for transmitting a request for re-search to the word information collecting apparatus 100.

初出情報表示領域５２は、初出情報であることを示すタイトル欄５２１と、ウェブページのタイトルがテキスト表示されたＵＲＬ情報欄５２２と、初出日時が表示された初出日時欄５２３と、該ウェブページのキャッシュへのリンクが張られたキャッシュ欄５２４と、を有する。タイトル欄５２１には、指定された検索語が最初に登場したときのウェブページ情報を表示していることをユーザに理解させるためのタイトルが表示されればよい。例えば、検索語として「ねこなべ」が指定されている場合には「ねこなべの初出は！」というタイトルを表示することができる。ＵＲＬ情報欄５２２に表示されたテキストには、該ウェブページのＵＲＬへのリンクが張られており、該ＵＲＬ情報欄５２２をクリックするだけで、指定した検索語が初出したウェブページのＵＲＬへ移動しその内容を閲覧することができる。また、キャッシュ欄５２４をクリックすると、初出ワードデータベース１０２に保存した時（初出時）のウェブページの内容を閲覧することができる。 The first appearance information display area 52 includes a title field 521 indicating first appearance information, a URL information field 522 in which the title of the web page is displayed in text, a first appearance date and time field 523 in which the first appearance date and time are displayed, and the web page And a cache column 524 with a link to the cache. In the title column 521, a title for allowing the user to understand that the web page information when the designated search word first appears is displayed. For example, when “Nekonabe” is designated as a search term, the title “Nekobebe first appeared!” Can be displayed. The text displayed in the URL information column 522 has a link to the URL of the web page. Just clicking the URL information column 522 moves to the URL of the web page where the specified search term first appears. The contents can be browsed. In addition, when the cache column 524 is clicked, the contents of the web page when saved in the first appearance word database 102 (first appearance) can be browsed.

検索結果一覧表示領域５３は、インデックスデータベース１０１から抽出したデータが一覧表示される領域である。ウェブページのタイトルがテキスト表示されるとともに、該テキストにはウェブページのＵＲＬへのリンクが張られている。 The search result list display area 53 is an area in which data extracted from the index database 101 is displayed as a list. The title of the web page is displayed as text, and the text is linked to the URL of the web page.

ユーザは、端末装置２００の表示手段に画面表示された検索結果ページにより、指定した検索語に関連するウェブページの一覧を閲覧することができるだけでなく、指定した検索語が最初に登場したウェブページに関する情報も得ることができる。 The user can not only browse a list of web pages related to the designated search word by using the search result page displayed on the display unit of the terminal device 200, but also the web page on which the designated search word first appears. Information about can also be obtained.

［３．本実施形態の作用効果］
上述した実施形態では、以下に示す作用効果を奏することができる。
単語情報収集手段１１０において、ページ情報取得手段１１１がネットワークを巡回してウェブページに関する情報を取得し、ページ解析手段１１２が取得したウェブページから単語情報を取得し、検索用インデックス生成手段１１５が検索用インデックスを作成するという、いわゆる検索エンジンにおける通常の処理を行うとともに、登録状況判定手段および初出ワード登録手段１１４により取得した単語情報に関する初出情報を収集している。ページ情報取得手段１１１はウェブページに関する情報とともに、該ウェブページの更新日時を取得する。初出ワードデータベース１０２に記憶された単語には初出日時が関連付けられているので、この初出日時と取得した更新日時とを比較し、古いほうの日時を初出日時として再登録する。すなわち、取得するウェブページの更新日時が随時古い日時に更新されるので、結果として最も古いウェブページの情報を効率よく収集することができる。
このように、検索エンジンにおいて通常行われる処理を行いながら、簡単かつ効率よく初出情報を収集することができる。 [3. Effects of this embodiment]
In embodiment mentioned above, there can exist the effect shown below.
In the word information collection unit 110, the page information acquisition unit 111 circulates the network to acquire information about the web page, the page analysis unit 112 acquires the word information from the acquired web page, and the search index generation unit 115 searches. In addition to normal processing in a so-called search engine for creating an index for use, first appearance information related to word information acquired by the registration status determination means and first appearance word registration means 114 is collected. The page information acquisition unit 111 acquires the update date and time of the web page together with information on the web page. Since the first appearing date and time is associated with the word stored in the first appearing word database 102, the first appearing date and time are compared with the acquired update date and time, and the older date and time is re-registered as the first appearing date and time. That is, since the update date and time of the web page to be acquired is updated to an old date and time as needed, the information on the oldest web page can be efficiently collected as a result.
As described above, the first appearance information can be collected easily and efficiently while performing the processing normally performed in the search engine.

また、ウェブ検索手段１２０では、ユーザが指定した検索語の検索結果の一覧とともに、収集した初出情報を検索結果ページに表示している。ユーザが指定する検索語としては、一般的な単語のほか、流行語のような単語もある。流行語は、あるウェブページに表示されたことが発端となって流行が広まることも多く、流行の発端となったウェブページに関する情報を得たいと思うユーザも多数いる。上記実施形態では、上述の単語情報収集手段１１０によって収集した初出情報を、ウェブ検索手段１２０が、例えば検出語が初出したウェブページのタイトルと、初出日時と、を表示させ、タイトルには該ウェブページのＵＲＬへのリンクを張った状態で検索結果ページに表示する。
したがって、ユーザは指定した検索語の初出情報を得ることができるとともに、初出したウェブページを閲覧することができる。このように、ユーザが知りたいと思う有益な情報を検索語の検索結果とともに提供することができ、検索結果ページのコンテンツの充実化を図ることができる。 Further, the web search means 120 displays the collected first appearance information on the search result page together with the search result list of the search terms designated by the user. As a search term designated by the user, there is a word like a buzzword in addition to a general word. A buzzword is often displayed as a buzzword when it is displayed on a certain web page, and the buzzword often spreads, and there are many users who want to obtain information on a web page that has started a buzzword. In the above embodiment, the web search unit 120 displays, for example, the title of the web page in which the detected word first appears and the date and time of first appearance, for the first appearance information collected by the word information collection unit 110 described above. The search result page is displayed with a link to the URL of the page.
Therefore, the user can obtain the first appearance information of the designated search term and can browse the first appeared web page. Thus, useful information that the user wants to know can be provided together with the search result of the search term, and the content of the search result page can be enhanced.

さらに、上記実施形態では、検索結果ページの初出情報の一部にキャッシュを表示している。初出情報としてリンクが張られるウェブページは古く、その後更新されていることが多いため、初出時のウェブページを閲覧できない可能性が高い。しかしながら、初出時のウェブページの内容をキャッシュとして初出ワードデータベース１０２に保存し、検索結果ページにキャッシュとして表示させるので、仮に初出時のウェブページが存在しない場合でも、初出時のウェブページを閲覧することができる。したがって、ユーザにとって有益な情報を提供することができる。 Further, in the above embodiment, a cache is displayed as part of the first appearance information on the search result page. Since the web page linked as the first appearance information is old and is often updated thereafter, there is a high possibility that the web page at the first appearance cannot be browsed. However, since the contents of the first appearance web page are stored as a cache in the first appearance word database 102 and displayed as a cache on the search result page, even if the first appearance web page does not exist, the first appearance web page is browsed. be able to. Therefore, information useful for the user can be provided.

［４．変形例］
なお、本発明は、上述した実施形態に限定されるものではなく、本発明の目的を達成できる範囲で、以下に示される変形をも含むものである。
例えば、上記実施形態では、単語情報収集手段１１０の動作において、初出ワード登録判定手段１１６により、検索語が初出ワードデータベース１０２に登録済みであるか否かを判定する処理（Ｓ４）を行ったが、この処理は省略してもよい。これは、ステップＳ３において、登録状況判定手段１１３がインデックスデータベース１０１への登録状況を判定しているため、この判定結果に基づいて初出ワードデータベース１０２への登録の有無を判定することができるからである。これによれば、処理の高速化を図ることができる。 [4. Modified example]
In addition, this invention is not limited to embodiment mentioned above, In the range which can achieve the objective of this invention, the deformation | transformation shown below is also included.
For example, in the above embodiment, in the operation of the word information collection unit 110, the first appearance word registration determination unit 116 performs the process of determining whether or not the search word has been registered in the first appearance word database 102 (S4). This process may be omitted. This is because the registration status determination means 113 determines the registration status in the index database 101 in step S3, so that the presence or absence of registration in the first appearance word database 102 can be determined based on this determination result. is there. According to this, the processing speed can be increased.

また、上記実施形態では、ページ解析手段１１２は、形態素解析により文章を単語候補に分解したが、単語候補を抽出する方法はこれに限られない。一般的に用いられる言語処理技術、例えばＮ−ｇｒａｍを用いて解析してもよい。 Moreover, in the said embodiment, although the page analysis means 112 decomposed | disassembled the sentence into the word candidate by morphological analysis, the method of extracting a word candidate is not restricted to this. You may analyze using the language processing technique generally used, for example, N-gram.

さらに、上記実施形態において、初出ワードデータベース１０２の項目として画像データを追加してもよい。任意の単語が含まれるウェブページから、該単語に関連する画像データを取得し、該単語にこの画像データを関連付けて初出ワードデータベース１０２に記憶させる。したがって、ウェブ検索手段１２０により初出情報を検索結果ページに表示させる際は、初出情報の一部としてこの画像データを表示させることができる。画像データは視覚的なものであるので、ユーザにとっては認識が容易である。すなわち、ユーザにわかりやすい情報提供を行うことができる。 Further, in the above embodiment, image data may be added as an item of the first appearance word database 102. Image data related to the word is acquired from a web page including an arbitrary word, and the image data is associated with the word and stored in the first appearance word database 102. Therefore, when the first appearance information is displayed on the search result page by the web search means 120, the image data can be displayed as a part of the first appearance information. Since the image data is visual, it is easy for the user to recognize. That is, it is possible to provide information that is easy to understand for the user.

本発明は、ネットワーク上のウェブページに含まれる単語情報を収集する単語情報収集装置として検索エンジン等に利用できる。 The present invention can be used for a search engine or the like as a word information collection device that collects word information included in a web page on a network.

１００…単語情報収集装置
１０１…インデックスデータベース
１０２…初出ワードデータベース
１１０…単語情報収集手段
１１１…ページ情報取得手段
１１２…ページ解析手段
１１３…登録状況判定手段
１１４…初出ワード登録手段
１１５…検索用インデックス生成手段
１１６…初出ワード登録判定手段
１２０…ウェブ検索手段
１２１…検索語取得手段
１２２…インデックス検索手段
１２３…初出ワード検索手段
１２４…検索結果ページ提供手段
２００…端末装置 DESCRIPTION OF SYMBOLS 100 ... Word information collection apparatus 101 ... Index database 102 ... First appearance word database 110 ... Word information collection means 111 ... Page information acquisition means 112 ... Page analysis means 113 ... Registration status determination means 114 ... First appearance word registration means 115 ... Index generation for search Means 116 ... First appearance word registration determination means 120 ... Web search means 121 ... Search word acquisition means 122 ... Index search means 123 ... First appearance word search means 124 ... Search result page provision means 200 ... Terminal device

Claims

A word information collection device that collects information about words included in a web page on a network and generates a search index for performing an index search on a search key using the collected words.
A page information acquisition unit that circulates the network and acquires the update date and time of the web page together with information about the web page;
Page analysis means for analyzing the acquired web page and extracting word candidates;
A registration status determination unit that compares the extracted word candidates with a search index generated in advance from the acquired word candidates, and determines whether the word candidates are stored in the search index;
As a result of the determination, when it is determined that the word is not stored in the search index, first word registration that associates the update date with the word candidate and information about the web page as the first time and stores it in the first word storage means Means for collecting word information.

In the word information collection device according to claim 1,
The first appearance word registration means is:
If the registration status determination means determines that the word candidate is stored in the search index, the first appearance date and time stored in association with the word candidate is compared with the acquired update date and time, and the update date and time is compared. When it is determined that is older than the first appearance date and time, the first appearance date and time of the word candidate is updated with the update date and time.

In the word information collection device according to claim 1 or 2,
Search term acquisition means for requesting input of a search term to a terminal device connected via the network and acquiring the input search term;
Data search means for searching for a keyword that matches the acquired search term from the search index, and acquiring information relating to a web page associated with the corresponding keyword;
Searching for a word that matches the acquired search word from the first word storage means, and obtaining information on a web page associated with the corresponding word and a first appearance date and time;
Search result page providing means for creating and distributing a web page displaying information on the web page acquired by the data search means, information on the web page acquired by the first appearance word search means, and the date and time of first appearance, and further provided A word information collection device characterized by that.

In the word information collection device according to claim 1,
A first word registration determination unit that determines whether or not a word that matches the extracted word candidate is stored in the first word information storage unit;
The initial word registration means determines that the word candidate is not stored in the search index by the registration status determination means, and the word candidate is stored in the initial word information storage means by the initial word registration determination means. When it is determined that the word candidate is not stored, the word candidate and the information related to the web page are associated with the update date and time as the first appearance date and stored in the first appearance word storage unit.

The word information collection device according to claim 2,
A first word registration determination unit that determines whether or not a word that matches the extracted word candidate is stored in the first word information storage unit;
The first appearance word registration means is:
The registration status determining means determines that the word candidate is not stored in the search index, and the initial word registration determining means determines that the word candidate is not stored in the initial word information storage means. If the word candidate and the information related to the web page, the update date and time is associated as the first appearance date and stored in the first appearance word storage means,
If it is determined by the first word registration determination means that the word candidate is stored in the first word information storage means, the first appearance date and time stored in association with the word candidate is compared with the acquired update date and time. If the update date / time is determined to be older than the first appearance date / time, the word information collection device updates the first appearance date / time of the word candidate with the update date / time.

A method of collecting word information for collecting information about words contained in a web page on a network and generating a search index for performing an index search on a search key using the collected words,
A page information acquisition step of acquiring the update date and time of the web page together with the information about the web page by visiting the network;
A page analysis step of analyzing the acquired web page and extracting word candidates;
A registration status determination step of comparing the extracted word candidates with a search index generated in advance from the acquired word candidates and determining whether the word candidates are stored in the search index;
As a result of the determination, when it is determined that the word is not stored in the search index, first word registration that associates the update date with the word candidate and information about the web page as the first time and stores it in the first word storage means A word information collecting method comprising: a step.

In the word information collection method according to claim 6,
The first appearance word registration step includes:
If it is determined in the registration status determination step that the word candidate is stored in the search index, the first appearance date and time stored in association with the word candidate is compared with the acquired update date and time, and the update date and time is compared. If it is determined that is older than the first appearance date and time, the first appearance date and time of the word candidate is updated with the update date and time.

A word information collecting program for causing a computer to execute the word information collecting method according to claim 6.