JP4507206B2

JP4507206B2 - Internet information collecting apparatus, program and method

Info

Publication number: JP4507206B2
Application number: JP2006542237A
Authority: JP
Inventors: 雅伸増田
Original assignee: Fujitsu Ltd
Current assignee: Fujitsu Ltd
Priority date: 2004-10-28
Filing date: 2005-04-08
Publication date: 2010-07-21
Anticipated expiration: 2025-04-08
Also published as: JPWO2006046323A1; WO2006046323A1

Description

本発明は、画面展開したウェブページからリンク先のウェブ情報を収集するインターネット情報収集装置、プログラム及び方法に関し、特に、ウェブページのタグ文を解析してリンク先のウェブ情報を収集するインターネット情報収集装置、プログラム及び方法に関する。
The present invention relates to an Internet information collection apparatus, program, and method for collecting link destination web information from a web page that has been developed on a screen, and more particularly to collecting Internet information collection by analyzing a tag sentence of a web page and collecting link destination web information. The present invention relates to an apparatus, a program, and a method.

近年、インターネット上のみで公開されて短時間で更新、削除されてしまうウェブページを保存して一般に公開するウェブ図書館システムの構築が計画されており、このようなシステムにあっては、インターネット上の情報資源の収集・蓄積を行うウェブアーカイビングという技術が必要となる。 In recent years, it has been planned to construct a web library system that saves web pages that have been published only on the Internet and that are updated and deleted in a short time, and that is open to the public. A technology called web archiving that collects and stores information resources is required.

従来のウェブアーカイビングでは、インターネットのウェブサイトがハイパーリンクとしてのウェブページの中に他のコンテンツへのリンク情報をもっていることから、このリンク情報を元に、あるページから次のページへの遷移を判断して、関連するページの情報を収集するという方法が取られている。 In conventional web archiving, Internet websites have link information to other contents in web pages as hyperlinks, so transition from one page to the next is based on this link information. Judgment is made and information on related pages is collected.

従来、インターネット上のコンテンツを自動収集するものとしてウェブロボットが知られており、ウェブロボットは、ウェブページのＨＴＭＬ文書を解析することでリンク情報を収集し、階層的にウェブページの遷移を行ってコンテンツを収集し、過去にインターネット上で公開されたウェブページ情報をユーザが検索閲覧することを可能にしている。
特開２００２−０７３６０９号公報特開２００３−３４５８２６号公報 Conventionally, a web robot has been known as an apparatus that automatically collects contents on the Internet. The web robot collects link information by analyzing an HTML document of a web page and hierarchically transitions web pages. Content is collected, and the user can search and browse web page information published on the Internet in the past.
JP 2002-073609 A JP 2003-345826 A

ところで、近年のインターネットコンテンツは、ダイナミックＨＴＭＬとして知られたウェブページに利用者との対話性をもたせるＨＴＭＬの拡張仕様として、ＨＴＭＬ文書の中にスクリプトを埋め込むことにより、動的に外部リンクを生成しているものが増えてきている。 By the way, recent Internet contents are dynamically extended externally by embedding a script in an HTML document as an extended HTML specification that allows a user to interact with a web page known as dynamic HTML. The things that are increasing are increasing.

例えばウェブページに表示された選択メニューによって利用者に選択肢「１」から「３」を示し、利用者の選択した選択肢に応じて異なるリンク先情報としてのＵＲＬを生成し、生成したＵＲＬのウェブページに遷移させている。 For example, the selection menu displayed on the web page indicates options “1” to “3” to the user, generates URLs as different link destination information according to the options selected by the user, and generates the web page of the generated URL Transition to.

しかしながら、従来のウェブページのＨＴＭＬ文書を解析してリンク情報を収集する方法にあっては、対話性を持つコンテンツにおける利用者の操作によって生成されるリンク情報を検出することが困難であり、リンク情報の収集漏れが大きいという問題がある。 However, in the conventional method of analyzing the HTML document of the web page and collecting the link information, it is difficult to detect the link information generated by the user's operation in the interactive content. There is a problem that the collection of information is large.

例えばユーザが選択メニューから選択した選択肢に応じてリンク先のＵＲＬを発生させるコードが記載されたウェブページでは、遷移先のＵＲＬがウェブページに記載されたコードからだけでは判断できないため、遷移先のウェブ情報を収集できないという問題がある。 For example, in a web page in which a code for generating a link destination URL according to an option selected by the user from the selection menu is described, the transition destination URL cannot be determined only from the code described in the web page. There is a problem that web information cannot be collected.

勿論、ウェブページを開いた状態でオペレータが操作ボタンやメニュー選択などの操作を行うことでリンク情報を検出することは可能であるが、人為的な操作を必要とするために手間と時間がかかりすぎる問題がある。 Of course, it is possible for the operator to detect the link information by operating the operation button or menu selection while the web page is open, but it takes time and effort because it requires an artificial operation. There is too much problem.

本発明は、利用者の操作により生成されるリンク情報を漏れなく自動収集するインターネット情報収集装置，プログラム及び方法を提供することを目的とする。
An object of the present invention is to provide an Internet information collection apparatus, program, and method for automatically collecting link information generated by user operations without omission.

本発明はインターネット情報収集装置を提供する。本発明のインターネット情報収集装置は、
インターネット上のウェブページを取得して画面展開するページ閲覧部（ブラウザ）と、
ページ閲覧部で画面展開されたウェブページを解析し、利用者の操作により発生するイベントに応じて動的にリンク情報を生成するイベント操作タグ文を抽出するページ解析部と、
ページ解析部で抽出されたイベント操作タグ文に対しイベントを発生させるイベント発生部と、
イベント発生部による発生イベントで生成されたリンク情報によるページ遷移からリンク先のウェブ情報を検出して保存するリンク情報検出部と、
を備えたことを特徴とする。The present invention provides an Internet information collecting apparatus. The Internet information collection device of the present invention is
A page browsing part (browser) that acquires web pages on the Internet and expands the screen,
A page analysis unit that analyzes a web page expanded in the page browsing unit and extracts an event operation tag sentence that dynamically generates link information according to an event generated by a user operation;
An event generation unit that generates an event for the event operation tag sentence extracted by the page analysis unit;
A link information detection unit that detects and stores link destination web information from page transitions by link information generated by an event generated by the event generation unit;
It is provided with.

ここで、リンク情報検出部は、更に、ページ閲覧部がアクセスしたプロキシサーバからリンク先のウェブ情報を検出して保存する。 Here, the link information detection unit further detects and stores link destination web information from the proxy server accessed by the page browsing unit.

ページ解析部は、ウェブページを構築するタグ文の中のフォーム文で規定された範囲からインプット文を抽出し、イベント発生部は、インプット文につき定義されている全てのイベントを順次発生し、その中の有効イベントによりリンク情報を生成させる。 The page analysis part extracts the input sentence from the range specified by the form sentence in the tag sentence that constructs the web page, and the event generation part sequentially generates all the events defined for the input sentence. The link information is generated by the valid event.

またページ解析部は、ウェブページを構築するタグ文の中のフォーム文で規定された範囲から利用者に選択肢を示すセレクト文と利用者操作を必要とするインプット文を抽出し、イベント発生部は、セレクト文の選択肢を変更しながらインプット文に対しイベントを発生する。 In addition, the page analysis unit extracts select statements that indicate options to the user and input statements that require user operation from the range specified in the form statement in the tag statement that constructs the web page, and the event generation unit An event is generated for the input sentence while changing the choice of the select sentence.

詳細には、ページ解析部は、ウェブページを構築するタグ文の中のフォームタグで規定された範囲からフォームタグの子供タグとなるインプットタグ、インプットタグの兄弟タグとなる選択リストを作成するセレクトタグ、セレクトタグの子供タグとなる選択リストの内容を示す複数のオプションタグを抽出し、イベント発生部は、セレクトタグ内の複数のオプションタグを変更しながらインプットタグのイベントを発生する。 Specifically, the page analysis unit selects an input tag that is a child tag of the form tag and a selection list that is a sibling tag of the input tag from the range specified by the form tag in the tag sentence that constructs the web page. A plurality of option tags indicating the contents of a selection list that is a child tag of the tag and the select tag are extracted, and the event generation unit generates an event of the input tag while changing the plurality of option tags in the select tag.

この場合もイベント発生部は、インプットタグにつき定義されている全てのイベントを順次発生し、その中の有効イベントによりリンク情報を生成させる。 Also in this case, the event generation unit sequentially generates all the events defined for the input tag, and generates link information based on the valid events therein.

リンク情報検出部は、現在展開中のウェブページのイベント操作タグ文に対するイベント発生でページ遷移するウェブページのリンク情報を全て検出して保存した後に、他のウェブページを画面展開してイベント操作タグ文に対するイベント発生でページ遷移するウェブページのリンク情報を取得して保存する処理を繰り返す。 The link information detection unit detects and saves all link information of the web page that transitions when an event occurs for the event operation tag statement of the currently deployed web page, and then expands the other web page to the event operation tag. Repeat the process of acquiring and saving the link information of the web page that transitions when an event occurs for the sentence.

リンク情報検出部は、リンク先への通信前に通知される通信イベント情報からページ遷移せずに遷移するウェブページのリンク情報を検出する。 The link information detection unit detects link information of a web page that transitions without page transition from communication event information notified before communication to a link destination.

本発明はインターネット情報収集プログラムを提供する。本発明のインターネット情報収集プログラムは、コンピュータに、
インターネット上のウェブページを取得して画面展開するページ閲覧ステップと、
ページ閲覧ステップで画面展開されたウェブページを解析し、利用者の操作により発生するイベントに応じて動的にリンク情報を生成するイベント操作タグ文を抽出するページ解析ステップと、
ページ解析ステップで抽出されたイベント操作タグ文に対しイベントを発生させるイベント発生ステップと、
イベント発生ステップによる発生イベントで生成されたリンク情報によるページ遷移からリンク先のウェブ情報を検出して保存するリンク情報検出ステップと、
を実行させることを特徴とする。The present invention provides an Internet information collection program. The Internet information collection program of the present invention is stored in a computer.
A page browsing step to acquire web pages on the Internet and expand the screen,
A page analysis step that analyzes the web page expanded in the page browsing step and extracts an event operation tag sentence that dynamically generates link information according to an event generated by a user operation,
An event generation step for generating an event for the event operation tag sentence extracted in the page analysis step;
A link information detection step for detecting and storing linked web information from page transitions by link information generated by an event generated by the event generation step;
Is executed.

本発明はインターネット情報収集方法を提供する。本発明のインターネット情報収集方法は、
インターネット上のウェブページを取得して画面展開するページ閲覧ステップと、
ページ閲覧ステップで画面展開されたウェブページを解析し、利用者の操作により発生するイベントに応じて動的にリンク情報を生成するイベント操作タグ文を抽出するページ解析ステップと、
ページ解析ステップで抽出されたイベント操作タグ文に対しイベントを発生させるイベント発生ステップと、
イベント発生ステップによる発生イベントで生成されたリンク情報によるページ遷移からリンク先のウェブ情報を検出して保存するリンク情報検出ステップと、
を備えたことを特徴とする。The present invention provides an Internet information collection method. The Internet information collection method of the present invention includes:
A page browsing step to acquire web pages on the Internet and expand the screen,
A page analysis step that analyzes the web page expanded in the page browsing step and extracts an event operation tag sentence that dynamically generates link information according to an event generated by a user operation,
An event generation step for generating an event for the event operation tag sentence extracted in the page analysis step;
A link information detection step for detecting and storing linked web information from page transitions by link information generated by an event generated by the event generation step;
It is provided with.

なお、本発明のインターネット情報収集プログラム及び方法の詳細は、本発明のインターネット情報収集装置と基本的に同じになる。
The details of the Internet information collection program and method of the present invention are basically the same as those of the Internet information collection apparatus of the present invention.

本発明によれば、ページ閲覧部で画面展開されたウェブページのボタンや選択リストに対するマウス操作、キーボード操作を必要とする利用者の操作によって発生するイベントに応じてスクリプト文等の実行で動的に生成されるＵＲＬによるページ遷移を、アプリケーションでイベントを発生させることによる擬似的な操作で実現し、ＨＴＭＬ文書の解析では検出できなかった利用者の操作によって遷移するリンク情報を検出することができ、欠落のないインターネット上のウェブ情報の収集が可能となる。 According to the present invention, a script sentence or the like can be dynamically executed according to an event generated by a user operation requiring a mouse operation or a keyboard operation on a web page button or selection list displayed on the page browsing unit. The page transition by the URL generated in the above is realized by a pseudo operation by generating an event in the application, and the link information that is transitioned by the user's operation that could not be detected by the analysis of the HTML document can be detected. It is possible to collect web information on the Internet without any omissions.

またリンク先のコンテンツについても、同様にアプリケーションによるイベントの発生による擬似的な操作でリンク情報を検出し、これを繰り返すことで、インターネットで公開されている全ての情報を収集することが可能となる。 Similarly, for linked content, it is possible to collect all information published on the Internet by detecting link information by a pseudo operation caused by the occurrence of an event by an application and repeating this. .

更に、擬似的な操作で発生できない例えばマウス通過などで発生するイベントについては、ブラウザがアクセスするプロキシサーバにリンク先のＵＲＬ情報が保存されていることから、プロキシサーバからリンク情報としてＵＲＬを取得することで、展開されたウェブページから漏れなくインターネット上のウェブ情報を収集できる。
Furthermore, for events that cannot be generated by a pseudo operation, such as when the mouse passes, the URL information of the link destination is stored in the proxy server accessed by the browser, so the URL is acquired as link information from the proxy server. Thus, web information on the Internet can be collected without omission from the deployed web page.

本発明によるインターネット情報収集装置のブロック図Block diagram of an Internet information collecting apparatus according to the present invention. 図１のインターネット情報収集装置が実現されるコンピュータのハードウェア環境のブロック図1 is a block diagram of a hardware environment of a computer in which the Internet information collecting apparatus of FIG. 本発明でイベント発生対象とするフォーム部品を配置したウェブページの説明図Explanatory drawing of a web page in which form parts that are subject to event generation are arranged in the present invention 図３のウェブページを構築するＨＴＭＬソース文の説明図Explanatory drawing of the HTML source sentence that constructs the web page of FIG. 図３のＨＴＭＬソース文のＤＯＭパースによる解析で得られたＤＯＭツリーの説明図Explanatory drawing of the DOM tree obtained by the DOM parsing analysis of the HTML source sentence of FIG. Ａタグに対応した発生イベントのイベント一覧の説明図Explanatory drawing of the event list for the event that corresponds to the A tag スクリプトを起動させるイベントを記述したＨＴＭＬの説明図HTML explanatory diagram describing the event that triggers the script 図１のイベント管理テーブルの説明図Explanatory drawing of the event management table of FIG. インターネットエクスプローラ（Ｒ）で提供されるイベント実行メソッドであるファイやイベントのＨＴＭＬソース文の説明図Explanatory drawing of HTML source sentence of file and event which is event execution method provided by Internet Explorer (R) 本発明でイベント発生対象とする選択リストと操作ボタンを配置したウェブページの説明図Explanatory drawing of the web page which arranged the selection list and operation button which are the event generation targets in the present invention 図１０のウェブページを構築するＨＴＭＬソース文の説明図Explanatory drawing of the HTML source sentence that constructs the web page of FIG. 図１１のＨＴＭＬソース文のＤＯＭパースによる解析で得られたＤＯＭツリーの説明図Explanatory drawing of the DOM tree obtained by the DOM parsing analysis of the HTML source sentence in FIG. インターネットエクスプローラ（Ｒ）で通信前に通知されるリンク先ＵＲＬを含むイベント情報であるビフォワナビゲートの説明図Explanatory drawing of before navigate which is event information including link destination URL notified before communication by Internet Explorer (R) 図１のリンク一覧テーブルの説明図Explanatory drawing of the link list table of FIG. 本発明によるインターネット情報収集処理のフローチャートFlowchart of Internet information collection processing according to the present invention 図１５のリンク情報検出処理のフローチャートFlowchart of link information detection process of FIG. 図１６に続くリンク情報検出処理のフローチャートFlowchart of link information detection processing following FIG. 本発明によるインターネット情報収集装置の他の実施形態のブロック図The block diagram of other embodiment of the internet information collection device by the present invention. 図１のインターネット情報収集装置で抽出できないＵＲＬの説明図Explanatory drawing of URL which cannot be extracted by the Internet information collection device of FIG. プロキシサーバのファイルからＵＲＬを収集する図１８の実施形態の処理動作の説明図Explanatory drawing of the processing operation of the embodiment of FIG. 18 that collects URLs from proxy server files 図１８の実施形態におけるインターネット情報収集処理のフローチャートFlowchart of Internet information collection processing in the embodiment of FIG.

図１は本発明によるインターネット情報収集装置の機能構成の実施形態を示したブロック図である。図１において、本発明のインターネット情報収集装置１０は、例えばコンピュータで構成されており、インターネット１２を介して、情報収集先となるウェブサイト１４−１，１４−２，１４−３と接続することができる。 FIG. 1 is a block diagram showing an embodiment of a functional configuration of an Internet information collecting apparatus according to the present invention. In FIG. 1, an Internet information collecting apparatus 10 of the present invention is configured by a computer, for example, and is connected to websites 14-1, 14-2, 14-3 as information collecting destinations via the Internet 12. Can do.

インターネット情報収集装置１０には通信制御部１６とアプリケーション実行環境１８が設けられる。通信制御部１６はインターネット１２を介して、ウェブサイト１４−１〜１４−３との間でウェブページ検索閲覧のための通信制御を行う。 The Internet information collecting apparatus 10 is provided with a communication control unit 16 and an application execution environment 18. The communication control unit 16 performs communication control for browsing and browsing web pages with the websites 14-1 to 14-3 via the Internet 12.

アプリケーション実行環境１８はコンピュータによるプログラムの実行で実現されており、ブラウザ２０、ページ解析部２２、イベント発生部２４、リンク情報検出部２６、イベント管理テーブル２８、リンク一覧テーブル３０及びコンテンツ取得部３２を備えている。 The application execution environment 18 is realized by execution of a program by a computer, and includes a browser 20, a page analysis unit 22, an event generation unit 24, a link information detection unit 26, an event management table 28, a link list table 30, and a content acquisition unit 32. I have.

インターネット情報収集装置１０のアプリケーション実行環境１８に設けているブラウザ２０はページ閲覧部として機能し、インターネット１２を介してウェブサイト例えばウェブサイト１４−１のウェブページを取得して画面展開する。 The browser 20 provided in the application execution environment 18 of the Internet information collecting apparatus 10 functions as a page browsing unit, acquires a web page of a website, for example, the website 14-1 via the Internet 12, and develops the screen.

ページ解析部２２はページ閲覧部として機能するブラウザ２２で画面展開されたウェブページを解析し、利用者の操作により発生するイベントに応じて動的にリンク情報を生成するイベント操作タグ文を抽出する。 The page analysis unit 22 analyzes a web page developed by the browser 22 functioning as a page browsing unit, and extracts an event operation tag sentence that dynamically generates link information according to an event generated by a user operation. .

このイベント操作タグ文とは、ウェブページを構築するＨＴＭＬソース文の中に配置したマウス操作やキーボード操作を必要とするラジオボタンや選択リスト等を構築するタグ文であり、具体的には＜ＦＯＲＭ＞タグで示されるフォーム文を抽出する。 The event operation tag sentence is a tag sentence that constructs a radio button or a selection list that requires mouse operation or keyboard operation arranged in an HTML source sentence for constructing a web page. Specifically, <FORM > The form sentence indicated by the tag is extracted.

イベント発生部２４は、ページ解析部２２で抽出されたイベント操作タグ文に対し、利用者の操作に伴ってリンク先のＬＲＵを動的に発生するスクリプト文を実行させるイベントを発生する。イベント管理テーブル２８には、イベント発生部２４で発生するイベントの一覧が、イベント発生対象となるタグに対応して格納されている。 The event generation unit 24 generates an event for executing a script statement that dynamically generates a link destination LRU in accordance with a user operation, on the event operation tag sentence extracted by the page analysis unit 22. In the event management table 28, a list of events that occur in the event generation unit 24 is stored in correspondence with the tags that are subject to event generation.

このイベント発生部２４によるイベント操作タグ文に対するイベントの発生が、ウェブページ上に配置されているボタンや選択リストなどの操作部品であるフォーム部品を利用者がマウスやキーボードで操作した場合と同様に操作する擬似的操作を行わせることになる。 The occurrence of an event for the event operation tag sentence by the event generation unit 24 is the same as when a user operates a form part such as a button or selection list arranged on a web page with a mouse or a keyboard. A pseudo operation is performed.

リンク情報検出部２６は、イベント発生部２４による発生イベントによるスクリプト文の実行で生成されたページ遷移からリンク先のウェブページ情報、即ちリンク先のＵＲＬを検出してリンク一覧テーブル３０に保存する。 The link information detection unit 26 detects the link destination web page information, that is, the link destination URL from the page transition generated by the execution of the script sentence by the event generated by the event generation unit 24, and stores it in the link list table 30.

コンテンツ取得部３２は、リンク先ＵＲＬの収集が完了した時点でリンク一覧テーブル３０からＵＲＬを順次取り出して、リンク先のウェブサイトに接続し、ウェブサイトのウェブページを取得して、データベースに保存する。 When the collection of link destination URLs is completed, the content acquisition unit 32 sequentially extracts URLs from the link list table 30, connects to the link destination website, acquires the web page of the website, and stores it in the database. .

図１における本発明のインターネット情報収集装置１０は、例えば図２のようなコンピュータのハードウェア資源により実現される。図２のコンピュータにおいて、ＣＰＵ１００のバス１０１にはＲＡＭ１０２、ハードディスクドコントローラ（ソフト）１０４、フロッピィディスクドライバ（ソフト）１１０、ＣＤ−ＲＯＭドライバ（ソフト）１１４、マウスコントローラ１１８、キーボードコントローラ１２２、ディスプレイコントローラ１２６、通信用ボード１３０が接続される。 The Internet information collecting apparatus 10 of the present invention in FIG. 1 is realized by, for example, hardware resources of a computer as shown in FIG. In the computer shown in FIG. 2, a bus 101 of the CPU 100 includes a RAM 102, a hard disk controller (software) 104, a floppy disk driver (software) 110, a CD-ROM driver (software) 114, a mouse controller 118, a keyboard controller 122, and a display controller 126. The communication board 130 is connected.

ハードディスクコントローラ１０４はハードディスクドライブ１０６を接続し、本発明のインターネット情報収集プログラムをローディングしており、コンピュータの起動時にハードディスクドライブ１０６から必要なプログラムを呼び出して、ＲＡＭ１０２上に展開し、ＣＰＵ１００により実行する。 The hard disk controller 104 is connected to the hard disk drive 106 and is loaded with the Internet information collection program of the present invention. A necessary program is called from the hard disk drive 106 when the computer is started up, developed on the RAM 102, and executed by the CPU 100.

フロッピィディスクドライバ１１０にはフロッピィディスクドライブ（ハード）１１２が接続され、フロッピィディスク（Ｒ）に対する読み書きができる。ＣＤ−ＲＯＭドライバ１１４に対しては、ＣＤドライブ（ハード）１１６が接続され、ＣＤに記憶されたデータやプログラムを読み込むことができる。 A floppy disk drive (hardware) 112 is connected to the floppy disk driver 110 and can read and write to the floppy disk (R). A CD drive (hardware) 116 is connected to the CD-ROM driver 114, and data and programs stored on the CD can be read.

マウスコントローラ１１８はマウス１２０の入力操作をＣＰＵ１００に伝える。キーボードコントローラ１２２はキーボード１２４の入力操作をＣＰＵ１００に伝える。ディスプレイコントローラ１２６は表示部１２８に対して表示を行う。通信用ボード１３０は無線を含む通信回線１３２を使用し、インターネット等のネットワークを介してウェブサイトのサーバとの間で通信を行う。 The mouse controller 118 transmits an input operation of the mouse 120 to the CPU 100. The keyboard controller 122 transmits the input operation of the keyboard 124 to the CPU 100. The display controller 126 performs display on the display unit 128. The communication board 130 uses a communication line 132 including radio, and communicates with a website server via a network such as the Internet.

図３は本発明でイベント発生対象とするフォーム部品を配置したウェブページの説明図である。図３のウェブページ３４にあっては、リンクＵＲＬ３６が配置され、その下に操作ボタン３８と操作ボタン４０が配置されている。 FIG. 3 is an explanatory diagram of a web page on which form parts to be event generation targets are arranged according to the present invention. In the web page 34 of FIG. 3, a link URL 36 is arranged, and an operation button 38 and an operation button 40 are arranged below it.

ウェブページ３４におけるリンクＵＲＬ３６を利用者が例えばマウスクリックすると、「ａ．ｈｔｍｌ」のウェブページに遷移する。また利用者が操作ボタン３８を押し下げ操作すると「ｂ．ｈｔｍｌ」のウェブページに遷移し、更に操作ボタン４０を利用者が押し下げ操作すると「ｃ．ｈｔｍｌ」のウェブページに遷移する。 For example, when the user clicks the link URL 36 in the web page 34 with a mouse, a transition is made to the web page “a.html”. When the user presses down the operation button 38, a transition is made to the “b.html” web page, and when the user presses the operation button 40, the transition is made to the “c.html” web page.

図４は図３のウェブページ３４を構築するＨＴＭＬソース文の説明図である。図４のＨＴＭＬソース文４２において、図３のウェブページ３４におけるリンクＵＲＬ３６は、１１行目のＡタグの機能により「ａ．ｈｔｍｌ」へジャンプを行う。このＨＴＭＬソース文４２の１１行目のＡタグによるリンク先「ａ．ｈｔｍｌ」については、従来のようにＨＴＭＬソース文４２を解析することで直接、検出することができる。 FIG. 4 is an explanatory diagram of an HTML source sentence for constructing the web page 34 of FIG. In the HTML source sentence 42 in FIG. 4, the link URL 36 in the web page 34 in FIG. 3 jumps to “a.html” by the function of the A tag on the 11th line. The link destination “a.html” by the A tag on the 11th line of the HTML source sentence 42 can be directly detected by analyzing the HTML source sentence 42 as in the prior art.

図３のウェブページ３４の操作ボタン３８，４０は、図４のＨＴＭＬソース文４２における１２〜１５行目の＜ＦＯＲＭ＞タグで囲まれた範囲のフォーム文により構築される。このフォーム文にあっては、例えば図３のウェブページ３４で利用者が操作ボタン３８のボタン押し下げの操作を行うと、ＨＴＭＬソース文４２の１３行目における「ｏｎｃｌｉｃｋ」イベントが発生し、ここに定義されている「”ｊｕｍｐ（）”」関数が呼び出される。 The operation buttons 38 and 40 of the web page 34 in FIG. 3 are constructed by a form sentence in a range surrounded by <FORM> tags on the 12th to 15th lines in the HTML source sentence 42 in FIG. In this form sentence, for example, when the user presses down the operation button 38 on the web page 34 in FIG. 3, an “onclick” event in the 13th line of the HTML source sentence 42 is generated. The defined "" jump () "" function is called.

このｊｕｍｐ関数では、３〜８行目のスクリプト文を対象に、ＩＮＰＵＴタグのｉｄ属性値を利用してリンク先のＵＲＬを作成し、ｌｏｃａｔｉｏｎオブジェクトを変更することで、ページ遷移を行っている。 In this jump function, a page transition is performed by creating a link destination URL using the id attribute value of the INPUT tag and changing the location object for the script statements on the 3rd to 8th lines.

このようにフォーム文における利用者の操作に伴うイベント発生で動的にスクリプト文によってリンク先のＵＲＬを発生するタグ文については、ＨＴＭＬソース文４２そのものを解析しても、リンク先のＵＲＬである「ｂ．ｈｔｍｌ」及び「ｃ．ｈｔｍｌ」を検出することはできない。 As described above, a tag sentence that dynamically generates a link destination URL by a script sentence when an event occurs in response to a user operation in the form sentence is a link destination URL even if the HTML source sentence 42 itself is analyzed. “B.html” and “c.html” cannot be detected.

そこで本発明にあっては、図１のインターネット情報収集装置１０のアプリケーション実行環境１８に設けているページ解析部２２により、図４に示すＨＴＭＬソース文４２を解析して、アプリケーションとして機能するイベント発生部２４により操作可能な図５に示すＤＯＭツリー４４を構築し、イベント発生部２４によりＩＮＰＵＴタグに対しイベント発生部２４から直接、イベントｏｎｃｌｉｃｋを発生させ、スクリプト文の実行により、リンク先のＵＲＬ「ｂ．ｈｔｍｌ」及び「ｃ．ｈｔｍｌ」のページ遷移を行わせ、このページ遷移に伴うリンク先の情報としてリンク先のＵＲＬを検出する。 Therefore, in the present invention, the page analysis unit 22 provided in the application execution environment 18 of the Internet information collecting apparatus 10 in FIG. 1 analyzes the HTML source sentence 42 shown in FIG. The DOM tree 44 shown in FIG. 5 that can be operated by the unit 24 is constructed, the event generating unit 24 generates an event onclick directly from the event generating unit 24 for the INPUT tag, and the execution of the script sentence causes the link destination URL “ The page transition of “b.html” and “c.html” is performed, and the URL of the link destination is detected as the link destination information accompanying this page transition.

ここで図１に示したページ解析部２２は、ブラウザ２０を対象としたＳＤＫ（ＳｏｆｔｗａｒｅＤｅｖｅｌｏｐｍｅｎｔＫｉｔ）を備えており、ＳＤＫはアプリケーション・プログラミング・インターフェース（以下「ＡＰＩ」という）を利用してソフトウエアを構築するツールである。 Here, the page analysis unit 22 shown in FIG. 1 includes an SDK (Software Development Kit) for the browser 20, and the SDK uses an application programming interface (hereinafter referred to as “API”) as software. Is a tool to build.

具体的にはブラウザ２０により展開されたウェブページのＨＴＭＬソース文４２を解析するＤＯＭパーサを備え、このＤＯＭパーサによりＨＴＭＬソース文を解析し、図５に示すＤＯＭツリー４４を持ったドキュメント・オブジェクト・モデルＤＯＭを生成する。ＤＯＭツリー４４で示されるドキュメント・オブジェクト・モデルＤＯＭは、ＨＴＭＬタグ文をツリー構造のノードオブジェクトの集合としてアクセスするためのＡＰＩである。 More specifically, a DOM parser that analyzes an HTML source sentence 42 of a web page developed by the browser 20 is analyzed. The HTML source sentence is analyzed by the DOM parser, and a document object having a DOM tree 44 shown in FIG. Generate a model DOM. The document object model DOM indicated by the DOM tree 44 is an API for accessing an HTML tag sentence as a collection of node objects having a tree structure.

この図５に示すＤＯＭツリー４４としてのドキュメントオブジェクトモデルの生成により、フォームタグの中にあるＩＮＰＵＴタグに対し、プログラムとしてのイベント発生部２４から直接、ｏｎｃｌｉｃｋイベントを発生して、スクリプト文の実行によりリンク先のＵＲＬを生成してページ遷移させることができる。 By generating the document object model as the DOM tree 44 shown in FIG. 5, an onclick event is directly generated from the event generating unit 24 as a program for the INPUT tag in the form tag, and the script statement is executed. It is possible to change the page by generating the URL of the link destination.

即ち、図４のＨＴＭＬソース文４２におけるフォーム文における１３行目と１４行目のＩＮＰＵＴタグのｏｎｃｌｉｃｋイベントは、本来は図３のウェブページ３４に示す押しボタン３８，４０の操作によってボタン押し下げ操作を行うことでイベントが発生し、３〜８行目のスクリプト文におけるＪａｖａＳｃｒｉｐｔの関数が呼び出されるという仕組みである。 In other words, the onclick event of the 13th and 14th line INPUT tags in the form sentence in the HTML source sentence 42 of FIG. 4 is basically a button down operation by operating the push buttons 38 and 40 shown in the web page 34 of FIG. By doing so, an event occurs, and the JavaScript function in the script statement on the 3rd to 8th lines is called.

これに対し本発明にあっては、ページ解析部２２に設けているソフトウェア・ディベロップメント・キットＳＤＫにおけるＤＯＭパーサ（ＤＯＭ解析手段）を利用して、図５のＤＯＭツリー４４に示すようなツリー構造を持つノードオブジェクトの集合に対しアクセスするためのＡＰＩであるドキュメントオブジェクトモデルＤＯＭを構築することで、プログラムとしてのイベント発生部２４から直接、イベントｏｎｃｌｉｃｋを発生してスクリプト文を実行して「ｂ．ｈｔｍｌ」及び「ｃ．ｈｔｍｌ」を生成し、ページ遷移させることができる。これは利用者によるボタン押し下げ操作をプログラムが擬似的に行うことを意味する。 On the other hand, in the present invention, a DOM parser (DOM analysis means) in the software development kit SDK provided in the page analysis unit 22 is used to make a tree structure as shown in the DOM tree 44 of FIG. By constructing a document object model DOM, which is an API for accessing a set of node objects having “”, an event “onclick” is generated directly from the event generating unit 24 as a program, and a script statement is executed. “html” and “c.html” can be generated and page transitions can be made. This means that the program simulates a button pressing operation by the user.

ところで、図４の１３行目及び１４行目のＩＮＰＵＴタグに対し発生するイベントとして、この実施形態にあっては「ｏｎｃｌｉｃｋ」を有効なイベントとして発生させているが、タグ文で使用されるイベントには、利用者の操作に対応して様々な種類のものがある。 By the way, in this embodiment, “onclick” is generated as a valid event as an event generated for the INPUT tag on the 13th and 14th lines in FIG. There are various types according to user operations.

図６は図４の１１行目のリンク設定に使用されるＡタグに対応して定義されているイベントの種類を示したＡタグ発生イベントリスト４６の説明図である。 FIG. 6 is an explanatory diagram of the A tag occurrence event list 46 showing the types of events defined corresponding to the A tag used for the link setting on the 11th line in FIG.

このＡタグイベント発生リスト４６に示すように、Ａタグだけでも１７種類のイベントを発生させている。この発生イベントの種類は、図４の１３，１４行目のそれぞれに示したＩＮＰＵＴタグについてもほぼ同様に定義されている。 As shown in the A tag event occurrence list 46, 17 types of events are generated only by the A tag. The types of occurrence events are defined in substantially the same manner for the INPUT tags shown in the 13th and 14th lines of FIG.

このＡタグ発生イベントリスト４６について、図７のスクリプト起動ＨＴＭＬソース文４８に示すように記述された場合、イベントｏｎｃｌｉｃｋを発生することによりスクリプト文を起動させることができるが、それ以外のイベントについては、イベントを発生しても、すぐに破棄されることになる。 When the A tag occurrence event list 46 is described as shown in the script activation HTML source statement 48 of FIG. 7, the script statement can be activated by generating an event onclick. For other events, When an event occurs, it will be immediately discarded.

このようなＨＴＭＬタグ文における不必要なイベントは自動的に破棄するという仕組みを利用し、本発明にあっては、フォーム文から抽出されたイベント発生対象となるタグに対し、そのタグについて定義されている一覧の中の全てのイベントを発生させ、図７のように定義されているイベントのみを実行させるという方法をとっている。 Such an unnecessary event in an HTML tag sentence is automatically discarded, and in the present invention, the tag is defined for an event occurrence target tag extracted from a form sentence. All events in the list are generated, and only the events defined as shown in FIG. 7 are executed.

このようにイベント発生対象となるタグに対応して定義されているイベントリストにおける全てのイベントを発生させ、実際にスクリプト文を実行させたイベントを知ることで、特定の有効イベントを意識することなく、スクリプト文をイベント発生で実行できる。 In this way, by generating all events in the event list defined corresponding to the event generation target tag and knowing the event that actually executed the script statement, without being aware of the specific valid event Script statements can be executed when an event occurs.

また、全てのイベントを発生してスクリプト文を実行させることにより認識された有効イベントについては、図８のように、イベント管理テーブル２８にタグ名に対応して有効イベントを登録する。図８のように、イベント管理テーブル２８に登録されたタグ名に対応した有効イベントについては、統計的な情報として、以降のタグに対するイベント発生に利用することが可能であるが、基本的にはタグに対し対応する全てのイベントを発生させる処理を行うことになる。 For valid events recognized by generating all events and executing script statements, the valid events are registered in the event management table 28 corresponding to the tag names as shown in FIG. As shown in FIG. 8, the valid event corresponding to the tag name registered in the event management table 28 can be used as statistical information for event generation for subsequent tags. Processing to generate all corresponding events for the tag is performed.

ここで図１のブラウザ２０としてインターネット・エクスプローラ（Ｒ）を使用した場合、プログラムにより直接イベントを発生するメソッドとして、図９に示すように「ファイヤイベント（ｆｉｒｅＥｖｅｎｔ）」というメソッドが準備されている。 Here, when Internet Explorer® is used as the browser 20 of FIG. 1, a method called “fire event” is prepared as a method for generating an event directly by a program as shown in FIG.

このファイヤイベントのメソッドは、図９のファイヤイベントＨＴＭＬソース文５０に示すように、例えば全てのタグに対し３行目と４行目に示すようにフォーカスの設定である「ｏｎｆｏｃｕｓ」と解除である「ｏｎｄｌｕｒ」を行うことで、全てのタグに対し直接イベントを発行することができ、これによって利用者が擬似的に操作したと同様なスクリプト文の実行によるリンク先ＵＲＬの生成が行われ、ページ遷移を行うことができる。 As shown in the fire event HTML source statement 50 of FIG. 9, for example, the fire event method is “on focus” which is the focus setting as shown in the third and fourth lines for all tags, and is canceled. By performing “onduller”, an event can be directly issued to all the tags. As a result, a link destination URL is generated by executing a script sentence similar to that operated by the user in a pseudo manner, and the page Transitions can be made.

図１０は本発明でイベント発生対象とする選択リストと操作ボタンを配置したウェブページの説明図である。図１０において、ウェブページ５２には地図表示ボタン５４が配置される。地図表示ボタン５４に対応して選択リスト５６が設けられ、選択リスト５６は「東京都」「神奈川県」「静岡県」の３つの選択肢を持っている。 FIG. 10 is an explanatory diagram of a web page in which a selection list to be an event generation target and operation buttons are arranged in the present invention. In FIG. 10, a map display button 54 is arranged on the web page 52. A selection list 56 is provided corresponding to the map display button 54, and the selection list 56 has three options of “Tokyo”, “Kanagawa Prefecture”, and “Shizuoka Prefecture”.

図１１は図１０のウェブページ５２を構築するＨＴＭＬソース文５８の説明図である。図１０のウェブページ５２にあっては、選択リスト５６の選択場所によって、地図表示ボタン５４を押したときのジャンプ先が変更となる。 FIG. 11 is an explanatory diagram of an HTML source sentence 58 for constructing the web page 52 of FIG. In the web page 52 of FIG. 10, the jump destination when the map display button 54 is pressed is changed depending on the selection location of the selection list 56.

即ち、選択リスト５６の「東京都」を選択した状態で地図表示ボタン５４を押した場合、リンク先として「東京都．ｈｔｍｌ」へジャンプする。また選択リスト５６で「神奈川県」を選んだ状態で地図表示ボタン５４を押した場合は、リンク先として「神奈川県．ｈｔｍｌ」へジャンプする。更に、選択リスト５６で「静岡県」を選んだ状態で地図表示ボタン５４を押すと、リンク先として「静岡県．ｈｔｍｌ」へジャンプする。 That is, when the map display button 54 is pressed in a state where “Tokyo” in the selection list 56 is selected, a jump is made to “Tokyo.html” as a link destination. When the map display button 54 is pressed while “Kanagawa Prefecture” is selected in the selection list 56, a jump is made to “Kanagawa Prefecture.html” as a link destination. Further, when the map display button 54 is pressed in a state where “Shizuoka Prefecture” is selected in the selection list 56, a jump is made to “Shizuoka Prefecture.html” as a link destination.

このようなリンクページ５２を構築する図１１のＨＴＭＬソース文５８にあっては、地図表示ボタン５４と選択リスト５６といったフォーム部品は、基本的に１３〜２０行目の＜ＦＯＲＭ＞タグで括られたフォーム文により作られている。このフォーム文の中には１４行目の＜ＳＥＬＥＣＴ＞タグや１９行目の＜ＩＮＰＵＴ＞タグが含まれており、これらのタグは＜ＦＯＲＭ＞タグの子供タグという位置付けになる。 In the HTML source sentence 58 of FIG. 11 for constructing such a link page 52, form parts such as the map display button 54 and the selection list 56 are basically enclosed by <FORM> tags on the 13th to 20th lines. It is made by the form sentence. This form sentence includes the <SELECT> tag on the 14th line and the <INPUT> tag on the 19th line, and these tags are positioned as child tags of the <FORM> tag.

この例では、１９行目の＜ＩＮＰＵＴ＞タグで配置される地図表示ボタン５４を押す時点で、兄弟タグとなる１４〜１８行目の＜ＳＥＬＥＣＴ＞タグのセレクト文の中に選択状態が３パターン存在している。 In this example, when the map display button 54 arranged with the <INPUT> tag on the 19th line is pressed, there are three selection states in the select statement of the <SELECT> tag on the 14th to 18th lines that become sibling tags. Existing.

このため、地図表示ボタン５４を示す＜ＩＮＰＵＴ＞タグを検出した際に、兄弟タグである＜ＳＥＬＥＣＴ＞タグの３つのパターンを示す１５〜１７行目の＜ＯＰＴＩＯＮ＞タグを求めることで、３つの選択パターンがあることが解析できる。 For this reason, when the <INPUT> tag indicating the map display button 54 is detected, three <OPTION> tags on the 15th to 17th lines indicating three patterns of <SELECT> tags that are sibling tags are obtained. It can be analyzed that there is a selection pattern.

したがって、地図表示ボタン５４である＜ＩＮＰＵＴ＞タグに擬似的にイベントを発生するためには３回の反復処理を行い、その都度＜ＳＥＬＥＣＴ＞タグの＜ＯＰＴＩＯＮ＞タグによる選択状態を変更させて、＜ＩＮＰＵＴ＞タグにイベントを発生させればよい。 Therefore, in order to generate an event artificially in the <INPUT> tag that is the map display button 54, iterative processing is performed three times, and each time the selection state of the <SELECT> tag by the <OPTION> tag is changed, An event may be generated in the <INPUT> tag.

図１２は図１１のＨＴＭＬソース文の図１に示したページ解析部２２におけるＤＯＭパースによる解析で得られたＤＯＭツリー６０の説明図であり、＜ＦＯＲＭ＞タグの中に兄弟関係にある＜ＩＮＰＵＴ＞タグと＜ＳＥＬＥＣＴ＞タグが存在し、選択リスト５６を構築する＜ＳＥＬＥＣＴ＞タグの下には３つの選択肢に対応して＜ＯＰＴＩＯＮ＞タグが、選択内容である「東京都」「神奈川県」「静岡県」に対応して配置されている。 FIG. 12 is an explanatory diagram of the DOM tree 60 obtained by the analysis by the DOM parsing in the page analysis unit 22 shown in FIG. 1 of the HTML source sentence of FIG. 11, and the <INPUT> tag has a sibling relationship <INPUT. > Tag and <SELECT> tag exist, and the selection list 56 is constructed. Below the <SELECT> tag, the <OPTION> tag corresponds to three choices, which are selection contents “Tokyo” and “Kanagawa”. It is arranged corresponding to “Shizuoka Prefecture”.

即ち、その処理としては基本的に次の手順となる。
・図１１のＨＴＭＬソース文５８におけるすべてのタグを操作する。
・＜ＦＯＲＭ＞タグを判定する。
・＜ＦＯＲＭ＞タグの範囲内の全ての子タグである＜ＩＮＰＵＴ＞＜ＳＥＬＥＣＴ＞などを調べ、兄弟タグの選択パターンの状態を調べる。
・＜ＳＥＬＥＣＴ＞について求めたパターンの数分、パターンに則って兄弟タグの状態を変更した後、現在の子供タグである＜ＩＮＰＵＴ＞に対しイベントを発行し、３〜１０行目のスクリプト文の実行によりリンク先のＵＲＬを発生する。That is, the process is basically the following procedure.
Manipulate all tags in the HTML source sentence 58 of FIG.
-Determine the <FORM> tag.
Check all child tags within the range of the <FORM> tag, such as <INPUT><SELECT>, etc., and check the status of the sibling tag selection pattern.
・ After changing the status of sibling tags according to the number of patterns obtained for <SELECT>, issue an event to <INPUT>, the current child tag, in the script statement on lines 3-10 The link destination URL is generated by execution.

図１３は、図１のリンク情報検出部２６で利用するインターネット・エクスプローラ（Ｒ）において、任意のウェブページに通信アクセスを開始した際に、その通信前に通知されるリンク先ＵＲＬを含むイベント情報であるビフォワナビゲート６２の説明図である。 FIG. 13 shows event information including a link destination URL notified before communication when an Internet explorer (R) used in the link information detection unit 26 of FIG. 1 starts communication access to an arbitrary web page. It is explanatory drawing of the before-navigation 62 which is.

即ち、インターネットエクスプローラ（Ｒ）の場合、あるＵＲＬを指定してウェブページを閲覧する場合、ウェブサイトに通信を開始する前に通知されるイベントとしてビフォワナビゲート（ＢｅｆｏｒｅＮａｖｉｇａｔｅ）が知られている。 That is, in the case of Internet Explorer (R), when browsing a web page by designating a certain URL, before navigate (Before Navigate) is known as an event notified to the website before starting communication.

このビフォワナビゲート６２にあっては、図１３に示すように３行目の引き数「ｕｒｌ」にリンク先のＵＲＬが設定されている。本発明のリンク情報検出部２６にあっては、このビフォワナビゲート６２のイベント情報の中の引き数「ｕｒｌ」からリンク先のＵＲＬを検出する。 In this before navigate 62, as shown in FIG. 13, the URL of the link destination is set in the argument “url” on the third line. In the link information detection unit 26 of the present invention, the URL of the link destination is detected from the argument “url” in the event information of the before navigate 62.

またビフォワナビゲート６２が通知された後そのままにしておくとリンクページへの遷移が行われることから、既にリンク先のＵＲＬの検出が済んでいることから、図１３のビフォワナビゲート６２の８行目に示す最終パラメータである「Ｃａｎｃｅｌ」に「Ｔｒｕｅ」を設定することで通信をキャンセルする。これにより、ページ遷移をすることなくリンク先のＵＲＬだけを検出して取得することができる。 Further, if the before navigation 62 is notified, if it is left as it is, the transition to the link page is performed, and since the URL of the link destination has already been detected, the eight lines of the before navigation 62 in FIG. Communication is canceled by setting “True” to “Cancel” which is the final parameter shown in the eye. Thereby, it is possible to detect and acquire only the link destination URL without page transition.

図１４は図１のリンク一覧テーブル３０の説明図であり、リンク情報検出部２６で検出されたリンク先のＵＲＬが格納されている。 FIG. 14 is an explanatory diagram of the link list table 30 of FIG. 1, in which the URL of the link destination detected by the link information detection unit 26 is stored.

ここで本発明におけるリンク情報の収集は、あるＵＲＬを使用してウェブページをブラウザ２０により展開し、ページ解析部２２、イベント発生部２４、リンク情報検出部２６により、ウェブページに配置されている利用者の操作を必要とする全てのフォーム部品について、イベント発生により擬似的な操作を行ってウェブページへの遷移を発生させてリンク先のＵＲＬを取得したならば、その後にリンク一覧テーブルを参照して、新たに取得したリンク先のウェブページを展開し、ウェブページに配置されている利用者の操作を必要とするフォーム部品に対するイベント発生によるリンク先のＵＲＬの取得を繰り返す。 Here, the collection of the link information in the present invention is performed by developing a web page by the browser 20 using a certain URL, and arranging the web page by the page analysis unit 22, the event generation unit 24, and the link information detection unit 26. For all form parts that require user operation, if a pseudo-operation is performed by an event occurrence to cause a transition to a web page and a link destination URL is acquired, refer to the link list table after that Then, the newly acquired link destination web page is expanded, and the acquisition of the link destination URL due to the occurrence of an event for the form part that requires the user's operation arranged on the web page is repeated.

即ち本発明にあっては、現在展開中のウェブページに存在するフォーム部品に対するイベント発生によるページ遷移からリンク先ＵＲＬを検出した場合、新たに検出したリンク先ＵＲＬのウェブページを開いて、そのページのフォーム部品のイベント発生による次のリンク先のＵＲＬの取得といった階層方向へのリンク情報の収集は行わず、ウェブページ単位で１つ先のリンク先のＵＲＬの収集を繰り返す。もし階層方向へのリンク情報の収集を行ったとすると、最後のウェブページに達した後、元の階層へ戻らなければならず、処理が煩雑になる。 That is, in the present invention, when a link destination URL is detected from a page transition caused by an event occurrence for a form part existing in a currently developed web page, the newly detected link destination URL web page is opened and the page is opened. The collection of link information in the hierarchical direction such as acquisition of the URL of the next link destination due to the occurrence of the event of the form part is not performed, and the collection of the URL of the next link destination is repeated for each web page. If the link information in the hierarchical direction is collected, after reaching the last web page, it is necessary to return to the original hierarchy, and the processing becomes complicated.

図１５は本発明によるインターネット情報収集処理のフローチャートである。図１５において、ステップＳ１で従来のウェブロボットなどで収集されたＵＲＬの一覧を取得した後、ステップＳ２でその中から１つのＵＲＬを選択し、ステップＳ３でブラウザ２０を起動してウェブページをオープンする。 FIG. 15 is a flowchart of Internet information collection processing according to the present invention. In FIG. 15, after obtaining a list of URLs collected by a conventional web robot or the like in step S1, one URL is selected from them in step S2, and the browser 20 is activated in step S3 to open a web page. To do.

このウェブページのブラウザによるオープンは、実際の画面展開は行う必要がなく、インターネット情報収集装置１０としてのコンピュータの作業におけるバックグラウンド処理として行われている。 The opening of the web page by the browser does not require actual screen development, and is performed as a background process in the operation of the computer as the Internet information collecting apparatus 10.

次に、ステップＳ４でページをＤＯＭパーサなどにより解析して、イベント発生部２４でイベント発生ができるＡＰＩを構築したＤＯＭツリーを持つドキュメント・オブジェクト・モデルＤＯＭを構築した後、ステップＳ５でイベント発生による擬似操作でリンク情報検出処理を実行する。 Next, in step S4, the page is analyzed by a DOM parser or the like, and a document object model DOM having a DOM tree in which an API capable of generating an event is constructed by the event generating unit 24 is constructed. A link information detection process is executed by a pseudo operation.

続いてステップＳ６で、ステップＳ１で読み込んだＵＲＬ一覧につき、未処理のＵＲＬがあるか否かチェックし、未処理のＵＲＬがあればステップＳ２に戻って同様な処理を繰り返す。ステップＳ６で全てのＵＲＬについての処理が終了すると、ステップＳ７に進み、新たに検出したリンク先のＵＲＬの一覧を取得し、ステップＳ８で未処理のＵＲＬがなくなるまで、ステップＳ２からのリンク情報検出のための処理を繰り返す。 In step S6, the URL list read in step S1 is checked to see if there is an unprocessed URL. If there is an unprocessed URL, the process returns to step S2 and the same processing is repeated. When processing for all URLs is completed in step S6, the process proceeds to step S7, where a list of newly detected link destination URLs is acquired, and link information detection from step S2 is performed until there are no unprocessed URLs in step S8. Repeat the process for.

図１６，図１７は、図１５のステップＳ５に対応した本発明によるリンク情報検出処理のフローチャートである。 16 and 17 are flowcharts of the link information detection process according to the present invention corresponding to step S5 of FIG.

図１６において、リンク情報検出処理は、ステップＳ１でＨＴＭＬタグ文におけるタグを操作し、ステップＳ２で非イベント発生タグか否かチェックする。非イベント発生タグとしては、図４の１１行目に示した＜Ａ＞タグ、＜ＩＭＧ＞タグ、＜ＬＩＮＫ＞タグなどがある。非イベント発生タグであった場合にはステップＳ３に進み、リンク先ＵＲＬを直接検出して保存する。 In FIG. 16, in the link information detection process, the tag in the HTML tag sentence is operated in step S1, and it is checked whether or not it is a non-event occurrence tag in step S2. Non-event occurrence tags include the <A> tag, <IMG> tag, <LINK> tag, etc., shown in the eleventh line of FIG. If it is a non-event occurrence tag, the process proceeds to step S3, where the link destination URL is directly detected and stored.

一方、ステップＳ２で非イベント発生タグでなかった場合には、ステップＳ４に進み、＜ＦＯＲＭ＞タグか否か判別する。＜ＦＯＲＭ＞タグであった場合にはステップＳ５に進み、フォーム部品は操作ボタンか否かチェックする。 On the other hand, if the tag is not a non-event occurrence tag in step S2, the process proceeds to step S4 to determine whether it is a <FORM> tag. If it is a <FORM> tag, the process proceeds to step S5 to check whether the form part is an operation button.

操作ボタンであった場合にはステップＳ６に進み、＜ＩＮＰＵＴ＞タグか否かチェックし、＜ＩＮＰＵＴ＞タグであった場合には、ステップＳ７で予め準備されている発生イベント一覧の中から順番にイベントを１つ選択して発行することで、対応するスクリプト文の実行により、リンク先のＵＲＬを生成してページ遷移させる。 If it is an operation button, the process proceeds to step S6, where it is checked whether it is an <INPUT> tag, and if it is an <INPUT> tag, the occurrence event list prepared in advance in step S7 is sequentially selected. By selecting and issuing one event, the URL of the link destination is generated and the page is changed by executing the corresponding script sentence.

続いてステップＳ８でページ遷移の有無をチェックしており、ページ遷移があれば、ステップＳ９でリンク先ＵＲＬを取得して保存する。なおステップＳ８のページ遷移は、図１３に示したように、インターネットエクスプローラ（Ｒ）の場合、通信前に取得されるイベント情報であるビフォワナビゲート６２の取得の有無であり、これが取得された場合には、その中からリンク先ＵＲＬを検出して保存することになる。 Subsequently, whether or not there is a page transition is checked in step S8. If there is a page transition, the link destination URL is acquired and stored in step S9. Note that the page transition in step S8 is, as shown in FIG. 13, in the case of Internet Explorer (R), whether or not the before navigate 62, which is event information acquired before communication, is acquired. In this case, the link destination URL is detected and stored therein.

ステップＳ１０で全てのイベント発生が終了するまで、ステップＳ７からの処理を繰り返す。この全てのイベント発生については、実際にＨＴＭＬ文の＜ＩＮＰＵＴ＞タグに定義されているイベントのみが有効イベントとして機能し、スクリプト文の実行によりリンク先ＵＲＬを発生させることになる。 The processing from step S7 is repeated until all event generations are completed in step S10. With respect to all the event occurrences, only events that are actually defined in the <INPUT> tag of the HTML statement function as valid events, and link destination URLs are generated by executing the script statement.

次に図１７のステップ１１に進み、フォーム部品は選択リストか否かチェックする。選択リストであった場合にはステップＳ１２に進み、＜ＦＯＲＭ＞タグの範囲内にある全ての子供タグ＜ＩＮＰＵＴ＞＜ＳＥＬＥＣＴ＞などを操作する。 Next, proceeding to step 11 in FIG. 17, it is checked whether or not the form part is a selection list. If it is a selection list, the process proceeds to step S12, and all child tags <INPUT> <SELECT> etc. within the range of the <FORM> tag are operated.

続いてステップＳ１３で＜ＩＮＰＵＴ＞タグの兄弟となる＜ＳＥＬＥＣＴ＞タグの選択パターンを解析する。図１０〜図１２の場合、この選択パターンは３種類となっている。次に、ステップＳ１４で兄弟タグ＜ＳＥＬＥＣＴ＞の状態を選択パターンにより変更する。 In step S13, the selection pattern of the <SELECT> tag that is a sibling of the <INPUT> tag is analyzed. In the case of FIGS. 10 to 12, there are three types of selection patterns. In step S14, the state of the sibling tag <SELECT> is changed according to the selection pattern.

続いて、ステップＳ１５で現在の子供タグ＜ＩＮＰＵＴ＞に対しイベントを１つ選択して発行し、ステップＳ１６でページ遷移発生の有無をチェックする。ページ遷移があれば、ステップＳ１７でリンク先ＵＲＬを検出して保存する。続いて、ステップＳ１８で全てのイベント発生終了か否かチェックし、全てのイベント発生が終了するまで、ステップＳ１５からの処理を繰り返す。 Subsequently, in step S15, one event is selected and issued for the current child tag <INPUT>, and in step S16, whether or not a page transition has occurred is checked. If there is a page transition, the link destination URL is detected and stored in step S17. Subsequently, in step S18, it is checked whether or not all events have occurred, and the processing from step S15 is repeated until all events have ended.

次にステップＳ１９で全ての選択パターンの終了の有無をチェックし、選択パターンが終了していなければ、ステップＳ１４に戻り、兄弟タグ＜ＳＥＬＥＣＴ＞の状態を次の選択パターンに変更し、ステップＳ１４〜Ｓ１８の処理を繰り返す。 Next, in step S19, it is checked whether or not all selection patterns have ended. If the selection pattern has not ended, the process returns to step S14 to change the state of the sibling tag <SELECT> to the next selection pattern. The process of S18 is repeated.

ステップＳ１９で全ての選択パターンについて処理が終了すると、ステップＳ２０に進み、全タグについて処理終了の有無をチェックし、終了していなければ図１６のステップＳ１に戻って、タグを操作して次のタグについて処理を行い、以下、全てのタグについて処理が終了するまで、ステップＳ１〜Ｓ２０の処理を繰り返す。 When the process is completed for all the selection patterns in step S19, the process proceeds to step S20, where it is checked whether or not the process is complete for all tags. If not completed, the process returns to step S1 in FIG. Processing is performed for the tags, and thereafter, the processing of steps S1 to S20 is repeated until the processing is completed for all the tags.

また本発明は、コンピュータで構成されるインターネット情報収集装置１０で実行されるインターネット情報収集プログラムを提供するものであり、このプログラムは図１５〜図１６及び図１７のフローチャートに従った処理手順を備えたプログラムとして構築される。 The present invention also provides an Internet information collecting program executed by the Internet information collecting apparatus 10 constituted by a computer, and this program has processing procedures according to the flowcharts of FIGS. It is built as a program.

図１８は本発明によるインターネット情報収集装置の他の実施形態のブロック図である。この実施形態にあっては、アプリケーション実行環境１８に設けたリンク情報検出部２６の機能として、図１の実施形態におけるイベント発生部２４による発生イベントで生成されたリンク情報によるページ遷移からウェブ情報を検出して保存する機能に加え、更にページ閲覧部として機能するブラウザ２０がアクセスしたプロキシサーバ６４からリンク先のウェブ情報を検出して保存するようにしたことを特徴とする。 FIG. 18 is a block diagram of another embodiment of the Internet information collecting apparatus according to the present invention. In this embodiment, as a function of the link information detection unit 26 provided in the application execution environment 18, web information is obtained from page transitions based on link information generated by an event generated by the event generation unit 24 in the embodiment of FIG. In addition to the function of detecting and storing, the web information of the link destination is further detected and stored from the proxy server 64 accessed by the browser 20 functioning as a page browsing unit.

これは図１のインターネット情報収集装置１０の機能では抽出不可能なＵＲＬが存在してしまう問題を解消するものである。 This solves the problem that there is a URL that cannot be extracted by the function of the Internet information collecting apparatus 10 of FIG.

ここで図１の実施形態で抽出不可能なＵＲＬとしては次のものが存在する。
（１）利用者の操作で静的なリンクを更新するような場合。
（２）ジャバアプレット（ＪａｖａＡｐｐｌｅｔ）などでジャバプログラムによって独自にＨＴＴＰ通信をする場合。
（３）アクティブ・エックス・コンポーネント（ＡｃｔｉｖｅＸＣｏｍｐｏｎｅｎｔ）など独自のプログラムが独自にＨＴＴＰ通信をする場合。
（４）ユニックス（Ｕｎｉｘ（Ｒ））環境などでソフトウェア・ディプロップメント・キット（ＳＤＫ）に図１３に示したようなビフォワナビゲート機能が存在しないようなプラットホームなどで動作させる場合。Here, the following URLs cannot be extracted in the embodiment of FIG.
(1) When a static link is updated by a user operation.
(2) When performing HTTP communication independently by a Java program using a Java applet or the like.
(3) When an original program such as Active X Component performs HTTP communication independently.
(4) In a case where the software development kit (SDK) is operated on a platform where the before-navigation function as shown in FIG. 13 does not exist in a Unix (R) environment or the like.

図１９は前記（１）で利用者が静的なリンクを更新する図１の実施形態では抽出できないＵＲＬの説明図である。図１９において、ＨＴＭＬソース文６５は３〜５行目にスクリプト分６６と６〜８行目にスプリクト文６７を記述している。 FIG. 19 is an explanatory diagram of URLs that cannot be extracted in the embodiment of FIG. 1 in which the user updates a static link in (1). In FIG. 19, an HTML source sentence 65 describes a script portion 66 on the 3rd to 5th lines and a script sentence 67 on the 6th to 8th lines.

スプリクト文６６はマウス操作などによりカーソルがイメージ上を通過したときに画像ファイルを「ｏｖｅｒ．ｇｉｆ」に変更する動作を行う。このスプリクト文６８の「ｏｖｅｒ．ｇｉｆ」は利用者のマウス操作によって始めてウェブサイトから取得されることになるが、ページ遷移動作ではないためビフォワナビゲードイベントによっては発生することはない。このため図１の実施形態ではファイル名「ｏｖｅｒ．ｇｉｆ」をフルパスにもつウェブサイトのＵＲＬを検出することはできない。 The script sentence 66 performs an operation of changing the image file to “over.gif” when the cursor passes over the image by a mouse operation or the like. The “over.gif” of the script sentence 68 is acquired from the website for the first time by the user's mouse operation. However, since it is not a page transition operation, it is not generated by the before-navigated event. Therefore, in the embodiment of FIG. 1, it is not possible to detect the URL of a website having the file name “over.gif” in the full path.

次のスプリクト文６７はカーソルがイメージ上か離れたときに画像ファイルを「ｏｕｔ．ｇｉｆ」に戻す動作を行う。この「ｏｕｔ．ｇｉｆ」についても利用者のマウス操作によってはじめてウェブサイトから取得され、ページ遷移動作でないため図１の実施形態におけるビフォワナビゲートイベントは発生しないこととなり、ＵＲＬを取得することができない。 The next script sentence 67 performs the operation of returning the image file to “out.gif” when the cursor moves over or away from the image. This “out.gif” is also acquired from the website for the first time by the user's mouse operation, and since it is not a page transition operation, the before navigate event in the embodiment of FIG. 1 does not occur, and the URL cannot be acquired.

そこで本発明にあっては、図１８のように、インターネット情報収集装置１０がブラウザ２０を利用する場合、必ずプロキシサーバ６４を介してウェブサイト１４−１〜１４−３側にアクセスしており、この場合、プロキシサーバ６４内ではウェブサイトにおけるＨＴＴＰ要求とウェブサイトからのＨＴＴＰ応答に伴ってファイル上にアクセス情報が保存されるに着目して問題の解決を図る。 Therefore, in the present invention, as shown in FIG. 18, when the Internet information collection device 10 uses the browser 20, the websites 14-1 to 14-3 are always accessed via the proxy server 64. In this case, in the proxy server 64, the problem is solved by focusing on the fact that the access information is stored on the file in accordance with the HTTP request from the website and the HTTP response from the website.

即ち、本発明にあってはリンク情報検出部２６においてビフォワナビゲート機能によるページ遷移の発生からリンク先のＵＲＬを検出して保存する処理をすべて終了した後、プロキシサーバ６４にアクセスして、そこに保存しているファイル情報から遷移先のＵＲＬをフルパスで取得してリンク一覧テーブル３０に保存する。 That is, in the present invention, after the link information detection unit 26 completes the process of detecting and storing the link destination URL from the occurrence of the page transition by the before-navigation function, the proxy server 64 is accessed, The transition destination URL is acquired from the file information stored in the full path and stored in the link list table 30.

図２０はプロキシサーバのファイルからＵＲＬを検出して収集する図１８の実施形態の処理動作の説明図である。図２０において、インターネット情報収集装置１０で例えば図１９で示したスプリクト文６８に基づく画像上のカーソル移動でファイヤイベント６８が発生すると、ブラウザ２０からプロキシサーバ６４を介してウェブサイト１４にＨＴＴＰ要求７２が送信される。 FIG. 20 is an explanatory diagram of the processing operation of the embodiment of FIG. 18 for detecting and collecting URLs from the proxy server file. In FIG. 20, when a fire event 68 is generated by the cursor movement on the image based on the script sentence 68 shown in FIG. 19 in the Internet information collecting apparatus 10, an HTTP request 72 is sent from the browser 20 to the website 14 via the proxy server 64. Is sent.

このＨＴＴＰ要求７２を受けたウェブサイト１４にあっては、ファイル名「ｏｖｅｒ．ｇｉｆ」のウェブページ７４をＨＴＴＰ応答７８としてプロキシサーバ６４を介してブラウザ２０に応答する。 In the web site 14 that has received this HTTP request 72, the web page 74 with the file name “over.gif” is returned as an HTTP response 78 to the browser 20 via the proxy server 64.

ここでプロキシサーバ６４にあってはＨＴＴＰ要求７２をウェブサイト１４に送る際にファイル８５にアクセス情報７６を保存しており、またウェブサイト１４からＨＴＴＰ応答７８をブラウザ２０に送る際にアクセス情報８０をファイル８５に保存している。 Here, the proxy server 64 stores the access information 76 in the file 85 when sending the HTTP request 72 to the website 14, and the access information 80 when sending the HTTP response 78 from the website 14 to the browser 20. Is stored in the file 85.

ＨＴＴＰ要求７２に伴って保存されたアクセス情報７６の１行目にはファイル名として「ｏｖｅｒ．ｇｉｆ」が格納され、また３行目にはウェブサイト１４のドメイン名「ｄｏｍａｉｎ」が格納されている。 The first line of the access information 76 saved with the HTTP request 72 stores “over.gif” as the file name, and the third line stores the domain name “domain” of the website 14. .

従って図１８に示したインターネット情報収集装置１０に設けているリンク情報検出部２０は、プロキシサーバ６４のファイル８５を参照し、「HTTP：／／」から始まってファイル名「ｏｖｅｒ．ｇｉｆ」までを示すフルパスのリンク先のＵＲＬ８４として「ｈｔｔｐ：／／ｄｏｍａｉｎ／ｏｖｅｒ．ｇｉｆ」を検出し、リンク一覧テーブル３０のレコード８２に示すように保存する。 Accordingly, the link information detecting unit 20 provided in the Internet information collecting apparatus 10 shown in FIG. 18 refers to the file 85 of the proxy server 64 and starts from “HTTP: //” to the file name “over.gif”. “Http: //domain/over.gif” is detected as the URL 84 of the link destination of the full path shown, and is stored as shown in the record 82 of the link list table 30.

図２１は図１８の実施形態におけるインターネット情報収集装置のフローチャートである。図２１においてステップＳ１〜Ｓ８の処理は図１５に示した図１の実施形態による処理と同じである。図１８の実施形態にあってはステップＳ１〜Ｓ８の処理が終了した後、ステップＳ９でプロキシサーバ６４からフルパスのＵＲＬを取得してリンク一覧テーブルに登録する処理を実行するようにしている。 FIG. 21 is a flowchart of the Internet information collecting apparatus in the embodiment of FIG. In FIG. 21, the processing of steps S1 to S8 is the same as the processing according to the embodiment of FIG. 1 shown in FIG. In the embodiment of FIG. 18, after the processing of steps S1 to S8 is completed, processing for acquiring a full path URL from the proxy server 64 and registering it in the link list table is executed in step S9.

このように一時的な操作では発生できない例えばマウスなどで発生するイベントにつき、ブラウザがアクセスするプロキシサーバからリンク情報としてＵＲＬを取得することでインターネット情報を収集するために展開されたウェブページ上からもれなくインターネット上のウェブ情報を収集することができる。 For events that occur with, for example, a mouse that cannot be generated by a temporary operation in this way, the URL is acquired as link information from a proxy server accessed by the browser, so that it can be used on the web page developed to collect Internet information. Web information on the Internet can be collected.

なお、本発明は、その目的と利点を損なうことのない適宜の変形を含み、更に上記の実施形態に示した数値による限定は受けない。 The present invention includes appropriate modifications that do not impair the object and advantages thereof, and is not limited by the numerical values shown in the above embodiments.

Claims

A page browsing part that acquires web pages on the Internet and expands the screen,
A page analysis unit that analyzes the web page expanded on the page browsing unit and extracts an event operation tag sentence that dynamically generates link information according to an event generated by a user operation;
An event generation unit that generates the event for the event operation tag sentence extracted by the page analysis unit;
A link information detection unit that detects and stores link destination web information from page transitions by link information generated by an event generated by the event generation unit;
An Internet information collecting apparatus comprising:

2. The Internet information collecting apparatus according to claim 1, wherein the link information detection unit further detects and stores link destination web information from a proxy server accessed by the page browsing unit. Collection device.

In the internet information collection device according to claim 1,
The page analysis unit extracts an input sentence from a range defined by a form sentence in a tag sentence that constructs the web page,
The Internet information collection device, wherein the event generation unit sequentially generates all events defined for the input sentence, and generates link information based on valid events therein.

On the computer,
A page browsing step to acquire web pages on the Internet and expand the screen,
Analyzing the web page expanded in the page browsing step, a page analysis step for extracting an event operation tag sentence that dynamically generates link information according to an event generated by a user operation;
An event generation step for generating the event with respect to the event operation tag sentence extracted in the page analysis step;
A link information detection step of detecting and storing link destination web information from page transitions by link information generated in the event generated by the event generation step;
Internet information collection program characterized by causing

A page browsing step to acquire web pages on the Internet and expand the screen,
Analyzing the web page expanded in the page browsing step, a page analysis step for extracting an event operation tag sentence that dynamically generates link information according to an event generated by a user operation;
An event generation step for generating the event with respect to the event operation tag sentence extracted in the page analysis step;
A link information detection step of detecting and storing link destination web information from page transitions by link information generated in the event generated by the event generation step;
A method for collecting Internet information, comprising: