JP2004086841A

JP2004086841A - Apparatus and method for information processing

Info

Publication number: JP2004086841A
Application number: JP2002345885A
Authority: JP
Inventors: Akihiro Okumura; 奥村　晃弘; Tokuji Ikeno; 池野　篤司
Original assignee: Oki Electric Industry Co Ltd
Current assignee: Oki Electric Industry Co Ltd
Priority date: 2002-06-27
Filing date: 2002-11-28
Publication date: 2004-03-18

Abstract

<P>PROBLEM TO BE SOLVED: To reduce an operation load needed, for example, to acquire a structural document from WWW (World Wide Web) sites. <P>SOLUTION: An information processing apparatus determines a main area of a predetermined structural document including a plurality of areas. A reader acquires respective areas of the structural document and management information thereof in a time series for a plurality of times. A recorder stores the areas and the management information acquired by the reader. A comparison/inspection part performs a comparison on the areas or the management information acquired by the reader, and inspects whether the areas are updated based on the comparison result. An update frequency calculator calculates predetermined update frequency corresponding information of the areas based on a history of the inspection results of the comparison/inspection. A main area determinater determines the main area from the plurality of areas of the structural document based on the update frequency corresponding information. <P>COPYRIGHT: (C)2004,JPO

Description

【０００１】
本発明は情報処理装置および方法に関し、例えば、ＷＷＷ（Ｗｏｒｌｄ　Ｗｉｄｅ　Ｗｅｂ）サイトから構造化文書を取得する場合などに適用し得るものである。
【０００２】
【従来の技術】
ＷＷＷサイトに存在する構造化文書を取得し閲覧するためのツールとして、ＷＷＷブラウザがある。一般的に、構造化文書はその文書のページのレイアウト、文字の大きさなどを柔軟に指定することができるようになっている。特に、図１のように、タイトル（領域Ａ）、他の構造化文書へのリンク（領域Ｂ）、本文（領域Ｃ）、など、ページがいくつかの領域（フレーム）に分割されて、ＷＷＷブラウザに表示されるような構造化文書が多く見られる。ＷＷＷブラウザを用いて、このような構造化文書から必要な情報を得るためには、ユーザは、目的の構造化文書のＵＲＬを指定し、その文書がＷＷＷブラウザ上に表示された後に、文書をスクロールしながら目視により検索したり（人手による検索）、あるいは文字列検索機能を利用するといった作業を行なう必要がある。例えば、図１の領域Ｃが、ユーザの必要とする文書であったとし、こういった構造化文書が多数ある場合には、そのユーザが必要とする情報のみを複数の構造化文書から自動的にスクラップし、１つの文書にまとめてユーザに提示することが、人手による作業を簡略化する上で望ましくなる。このようなＷＷＷ情報抽出システムが、次の特許文献１に示されている。
【０００３】
【特許文献１】
特開平１０−１８７７５３号公報
【０００４】
【発明が解決しようとする課題】
しかしながら、上記におけるＷＷＷ情報抽出システムでは、ユーザが構造化文書中で自分が必要とするデータの開始箇所と終了箇所をあらかじめ手入力により指定することが必要である。このため、大量の構造化文書に対して実施するにはユーザの操作負担が大きく現実的ではなかった。
【０００５】
【課題を解決するための手段】
かかる課題を解決するために、第１の発明では、複数の領域を含む所定の構造化文書の中から、主要な領域を判定する情報処理装置において、（１）前記構造化文書中の各領域または、各領域の管理情報を時系列に複数回、取得する読み込み部と、（２）当該読み込み部が取得した領域または管理情報を記憶する記憶部と、（３）当該読み込み部が取得した対応する領域の間または管理情報の間で比較を行い、当該比較結果に基づいて各領域の更新の有無を検査する比較検査部と、（４）当該比較検査部による検査結果の履歴をもとに、各領域ごとに、所定の更新頻度対応情報を算出する更新頻度算出部と、（５）当該更新頻度対応情報に基づいて、前記構造化文書中の複数の領域の中から主要な領域を判定する主要領域判定部とを備えたことを特徴とする。
【０００６】
また、第２の発明では、複数の領域を含む所定の構造化文書の中から、主要な領域を判定する情報処理方法において、（１）読み込み部が、前記構造化文書中の各領域または、各領域の管理情報を時系列に複数回、取得し、（２）当該読み込み部が取得した領域または管理情報を、記憶部が記憶し、（３）比較検査部が、当該読み込み部の取得した対応する領域の間または管理情報の間で比較を行い、当該比較結果に基づいて各領域の更新の有無を検査し、（４）　当該比較検査部による検査結果の履歴をもとに、更新頻度算出部が、各領域ごとに所定の更新頻度対応情報を算出し、（５）当該更新頻度対応情報に基づいて、主要領域判定部が、前記構造化文書中の複数の領域の中から主要な領域を判定することを特徴とする。
【０００７】
【発明の実施の形態】
（Ａ）実施形態
以下、本発明にかかる情報処理装置および方法の実施形態について説明する。
【０００８】
（Ａ−１）第１の実施形態の構成
本実施形態で、主要領域を判定し抽出する機能を有する領域処理部２５（図１６参照）の機能は、通信機能を有するパソコンその他の情報処理装置によって実現され得、ＷＷＷサーバ側に配置すること等も可能であるが、ここでは、通信端末（クライアント）側に配置する場合を例に説明する。
【０００９】
本実施形態にかかる通信システム１０の全体構成例を図１５に示す。
【００１０】
図１５において、当該通信システム１０は、ネットワーク１１と、通信端末１２と、ＷＷＷサーバ１３とを備えている。
【００１１】
このうちネットワーク１１は、ＬＡＮ（ローカルエリアネットワーク）などであってもかまわないが、ここでは、インターネットであるものとする。
【００１２】
ＷＷＷサーバ１３は、通信端末１２からの要求（ＨＴＴＰリクエスト）を受信すると、その要求に応じた応答（ＨＴＴＰレスポンス）としてＷＷＷページを構成するファイルなどを返送する機能を持つサーバである。多くの場合、ＷＷＷサーバ１３は、予め生成したＷＷＷページ等を蓄積しておくためのデータベース（図示せず）、そのデータベースを直接管理するデータベースサーバを伴う。また、これらＷＷＷサーバ１３やデータベースサーバなどの周辺には、ルータやファイアウールなどの各種のネットワーク機器やＤＮＳサーバなどのサーバ類が配置されてＷＷＷサイトを構成するのが普通である。
【００１３】
通信端末１２は上述した領域処理部２５を備えた情報処理装置で、具体的には、ネットワーク機能を有するパーソナルコンピュータなどであってよい。本実施形態の構成上、当該通信端末１２は、ＷＷＷページを閲覧するためのプログラムであるＷＷＷブラウザＢ１（図１６参照）を搭載していることが必要である。
【００１４】
当該通信端末１２の内部構成例を図１６に示す。
【００１５】
（Ａ−１−１）通信端末の内部構成例
図１６において、当該通信端末１２は、通信部２０と、制御部２１と、操作部２２と、記憶部２３と、表示部２４と、領域処理部２５とを備えている。
【００１６】
このうち通信部２０は、前記ネットワーク１１を介してＷＷＷサーバ１３と通信する機能を持つ部分である。
【００１７】
制御部２１は、ハードウエア的には当該通信端末１２の中央処理装置（ＣＰＵ）に相当する部分であり、ソフトウエア的には、オペレーティングシステム（ＯＳ）や上述したＷＷＷブラウザＢ１などに相当する部分である。
【００１８】
操作部２２は当該通信端末１２のユーザＵ１が操作して制御部２１に指示を伝えるための部分で、例えば、キーボードや、ポインティングデバイスなどを有する。
【００１９】
表示部２４は例えば液晶ディスプレイなどの表示画面を有する部分である。ユーザＵ１がＷＷＷページを閲覧する場合には、ＷＷＷブラウザＢ１がＷＷＷページのタグを解釈し処理した結果として、ＷＷＷページの内容が当該表示部２４に画面表示され、ユーザＵ１に閲覧され得る。このとき画面表示されるＷＷＷページは、一例として、図１に示したＤＰ１であってよい。ＤＰ１のようなフレームのページ（フレームページ）を表示するためには、ＷＷＷブラウザＢ１はフレームに対応したものであることを要する。フレームとは、図１の各領域Ａ〜Ｃに画面表示されている内容のことではなく、その内容を収容している枠のことを指す。図１のフレームページは、例えば、図４（Ａ）〜（Ｄ）のＨＴＭＬファイルをもとに生成される。
【００２０】
通常、１つのＷＷＷページは、基本となる１つのＨＴＭＬファイルと、必要に応じて１または複数の各種ファイル（画像ファイルなど）から構成されるが、ＤＰ１のようなフレームページは、これよりファイル数も多く、複雑な構造を有する。
【００２１】
すなわちフレームページは、そのＷＷＷページ全体の構成（すなわち、フレーム構造）を規定するＨＴＭＬファイル（フレーム規定ファイル）と、内容として各フレームに配置される複数のＨＴＭＬファイルが最低限、必要であり、これに加えて、各ＨＴＭＬファイルにリンクされた各種ファイル（画像ファイルなど）が適宜、追加されることになる。
【００２２】
したがって、簡単のために各種ファイルがないと仮定しても、図１に示したフレームページＤＰ１は、図４（Ａ）のＨＴＭＬファイル（フレーム規定ファイル）と、領域Ａに配置される図４（Ｂ）のＨＴＭＬファイルと、領域Ｂに配置される図４（Ｃ）のＨＴＭＬファイルと、領域Ｃに配置される図４（Ｄ）のＨＴＭＬファイルの、合計４つのＨＴＭＬファイルから構成されることになる。
【００２３】
通常のＷＷＷページは、１つのＨＴＭＬファイルの内部でのみ構造化されているが、フレームページでは、各ＨＴＭＬファイルの内部で構造化されているだけでなく、１つのフレームページに含まれる複数のＨＴＭＬファイルのあいだでも構造化されている。
【００２４】
なお、図１では、各領域Ａ〜Ｃのあいだの境界線（スクロールバーなども含む）Ｌ１，Ｌ２が表示されているが、実際のフレームページでは、視覚的な効果などとの関連で、意図的にこのような境界線が表示されないようにし、異なる領域間でまったく地の色が同じになるようにしたり、まったく切れ目のない連続的な背景模様を表示したりすることも多い。したがって、境界線の表示の有無は、領域分割（フレーム構造）の本質とは関係ない。
【００２５】
フレーム構造、すなわち、１つの画面をいくつのフレームに分割し、各フレームの辺の長さの割合（この割合は、各フレームの面積に対応）をどのように設定するか等（境界線の表示、非表示も含む）は、前記フレーム規定ファイル（例えば、ＤＰ１１）の記述によって決まる。
【００２６】
当該フレームページＤＰ１の閲覧を希望する場合、ユーザＵ１が通信端末１２のＷＷＷブラウザＢ１に入力するのは、フレーム規定ファイルＤＰ１１のＵＲＬ（ここでは、ＵＲＬ１１とする）である。したがってこのとき、当該フレーム規定ファイルＤＰ１１の返送を要求するＨＴＴＰリクエストが通信端末１２からＷＷＷサーバ１３へ送信され、そのＨＴＴＰレスポンスとして、各種ＨＴＴＰヘッダ（エンティティヘッダも含む）とともに、エンティティボディとして当該フレーム規定ファイルＤＰ１１が返送される。
【００２７】
エンティティボディ、すなわち、ＨＴＭＬファイルや画像ファイルなどのファイルの返送を要求する場合、ＨＴＴＰリクエストはＧＥＴメソッドを使用するＧＥＴリクエストになる。
【００２８】
その他のＨＴＭＬファイルＤＰ１２〜ＤＰ１４は、当該フレーム規定ファイルＤＰ１１が通信端末１２に受信されたあと、当該フレーム規定ファイルＤＰ１１中の記述（ＵＲＬ１２〜ＵＲＬ１４）に基づいて、ＷＷＷブラウザＢ１が順次、自動的に送信する各ＨＴＴＰリクエストに応じた各ＨＴＴＰレスポンスとして、通信端末１２に受信される。
【００２９】
そして、これら４つのＨＴＭＬファイルを処理し、整形した結果として、例えば、図１に示すような画面が、前記表示部２４に表示されることになる。
【００３０】
ここで、ＨＴＭＬファイルＤＰ１２のＵＲＬを前記ＵＲＬ１２とし、ＨＴＭＬファイルＤＰ１３のＵＲＬを前記ＵＲＬ１３とし、ＨＴＭＬファイルＤＰ１４のＵＲＬを前記ＵＲＬ１４とすると、ＵＲＬ１２は図４（Ａ）の行ＴＧ１２中の「ｔｉｔｌｅ．ｈｔｍｌ」であり、ＵＲＬ１３は行ＴＧ１３中の「ｍｅｎｕ．ｈｔｍｌ」であり、ＵＲＬ１４は行ＴＧ１４中の「ｍａｉｎ．ｈｔｍｌ」である。
【００３１】
通常、１つのフレームページを構成する複数のＨＴＭＬファイルは、フレーム規定ファイルと同じＷＷＷサーバ（ここでは、１３）に置かれ、しかも同じフォルダ内に置かれるため、ＦＱＤＮを含まないこのようなローカルなＵＲＬ（ここでは、ファイル名だけから成るＵＲＬ）によって指定することができる。
【００３２】
必要ならば、１つのフレームページを構成する複数のＨＴＭＬファイルを別なＷＷＷサーバに置くこともでき、その場合には、例えば、前記ＵＲＬ１２〜ＵＲＬ１４の全部または一部が、ＦＱＤＮを含むＵＲＬに変わる。なお、ユーザＵ１によって入力され得るフレーム規定ファイルＤＰ１１を指定する前記ＵＲＬ１１は、ＦＱＤＮを含むＵＲＬであることは当然である。
【００３３】
このＵＲＬ１１は、ユーザＵ１が操作部２２を用いて行う手作業によって入力され得るほか、例えば、前記特許文献１に記載されたオートパイロットツールのようなソフトウエアを利用し、予めユーザＵ１が設定した日時や時間間隔で自動的に入力させることができる。
【００３４】
図１のようなメニューを有するフレームページの場合、メニューのファイルが図４（Ｃ）に示すような内容（ターゲット（ｔａｒｇｅｔ）がすべて、ｍａｉｎ（領域Ｃのフレーム名）になっている）であると、メニュー内でユーザＵ１がハイパーリンクを選択することにより、そのハイパーリンクのリンク先のファイルが、領域Ｃに置き換わって表示されることになる。
【００３５】
したがって、典型的な使用方法では、メニューを有するフレームページの場合、他のフレームの内容（ここでは、領域Ａ、領域Ｂ）は同じでターゲットのフレームの内容（領域Ｃ）だけが置き換わった複数のＷＷＷページが重層的に存在しているとみることができる。
【００３６】
図１６に示す前記記憶部２３は、例えば、ＲＡＭ（ランダムアクセスメモリ）などの揮発性記憶装置や、ハードディスクなどの不揮発性記憶装置によって構成される。
【００３７】
通信端末１２が前記ＷＷＷサーバ１３からＷＷＷページを構成する各ファイル（例えば、前記ＤＰ１１やＤＰ１２など）を受信すると、それらのファイルは、当該記憶部２３のハードディスク上に確保されたキャッシュ領域に一時的に蓄積される。
【００３８】
当該キャッシュ領域は、通常、ＷＷＷブラウザＢ１の管理下に置かれ、ＷＷＷブラウザＢ１から自由にアクセスすることが可能である。
【００３９】
当該キャッシュ領域におけるファイルの記憶は可能なかぎり維持されるが、キャッシュ領域の記憶容量には上限があるため、この上限を越えて新たなＷＷＷページを閲覧し、新たなファイルを蓄積するときには、すでに記憶されているファイルを例えば古いものから順番に削除することによって、必要な記憶容量を確保することになる。
【００４０】
例えばＵＲＬの入力などにより、ユーザＵ１が操作部２２からＷＷＷページの閲覧を指示した場合、そのＵＲＬで指定されるＷＷＷページに関連するファイルがキャッシュ領域に記憶されていればＷＷＷブラウザＢ１は、インターネット１１経由ではなく、当該キャッシュ領域から当該ファイルを取得し、表示部２４に当該ファイルの内容を画面表示する。これによりネットワーク１１上の通信トラフィックや、ＷＷＷサーバ１３側の負荷の増大を抑制し、ユーザＵ１から見た応答時間（指示を出してからＷＷＷページが表示されるまでの時間）を短縮することができる。
【００４１】
前記制御部２１に接続された領域処理部２５は、主要領域を判定し抽出する機能を有する部分であるが、その内部構成は、例えば、図２に示すようになる。
【００４２】
主要領域とは、ＷＷＷページ上の複数の領域のうちユーザ（ここでは、Ｕ１）にとって最も重要であると推定できる１つの領域のことである。本実施形態では、基本的に、更新の頻度が最も高い領域が主要領域であるものと想定している。したがって、例えば、図１に示したフレームページの場合なら、領域Ａ〜Ｃのうち、ユーザＵ１にとって最も重要であると推定できる（最も更新の頻度が高い）いずれか１つの領域が主要領域となる。また、メニューを有するフレームページの典型的な使用方法に対応して、ターゲットのフレームの内容（例えば、領域Ｃ）だけが置き換わった複数のフレームページごとに、主要領域を決定するようにしてもよい。
【００４３】
（Ａ−１−２）領域処理部の内部構成例
図２において、当該領域処理部２５は、読み込み部１０１と、バッファ部１０２と、分割部１０３と、分割結果記憶部１０４と、領域内容比較部１０５と、更新頻度算出部１０６と、更新頻度記憶部１０７と、判定部１０８とを有する。
【００４４】
読み込み部１０１は、定期的にあるいは不定期（ユーザＵ１が読み込みを指示した時など）に指定されたＵＲＬ（例えば、前記ＵＲＬ１１）のフレームページに関する情報Ｎ１０をバッファ部１０２に読み込む部分である。この読み込みは、記憶部２３内の前記キャッシュ領域などから行うようにしてもよいが、通常は、インターネット１１経由でＷＷＷサーバ１３から行うことになる。
【００４５】
このようなインターネット１１経由の読み込みを可能にするため、当該読み込み部１０１は、前記ＷＷＷブラウザＢ１の一部として機能するか、前記ＷＷＷブラウザＢ１と連携するか、あるいは、ＷＷＷブラウザＢ１と独立したＨＴＴＰクライアントとしての機能を備える必要がある。
【００４６】
読み込み部１０１が読み込んだ時点のファイルの内容は、一例として、図５に示す複合ファイルＳＰ１のようなものであってもよい。
【００４７】
図５において、複合ファイルＳＰ１は、図４（Ａ）〜（Ｄ）に示すＨＴＭＬファイルＤＰ１１〜ＤＰ１４の内容を合成したものとなっている。ただしフレーム規定ファイルＤＰ１１の内容は、４つの部分ＰＴ１〜ＰＴ４に分割されて配置されている。
【００４８】
ＷＷＷサーバ１３側でファイルＤＰ１１〜ＤＰ１４が１つのフレームページに対応するファイルであることを管理しておき、フレーム規定ファイルＤＰ１１に対するＨＴＴＰリクエストが受信された時点で４つのファイルＤＰ１１〜ＤＰ１４をもとに１つの合成ファイルＳＰ１を生成して返送するようにすれば、合成ファイルＳＰ１のようなファイルを１つのＨＴＴＰレスポンスのエンティティボディとして返送することも可能であるが、ＷＷＷサーバ１３の負荷や、キャッシュシステムとの親和性などを考慮すると、不利であるため、ＷＷＷサーバ１３側でそのような処理は行わないほうが望ましい。
【００４９】
ＷＷＷサーバ１３側でそのような処理を行わない場合、図１に示すフレームページＤＰ１を閲覧するには、全部で４回のＨＴＴＰリクエストが通信端末１２から送信され、それにこたえて、順次、４回のＨＴＴＰレスポンスがＷＷＷサーバ１３から返されることになり、そのうち３回のＨＴＴＰレスポンスのエンティティボディが、ファイルＤＰ１２〜ＤＰ１４である。
【００５０】
したがって、ＳＰ１のようなファイルは、多くの場合、通信端末１２がすべてのファイルＤＰ１１〜ＤＰ１４を受信したあとの処理（整形処理）によって、はじめて得られるものである。
【００５１】
バッファ部１０２は、読み込み部１０１が読み込んだ内容や分割部１０３が処理した結果を一時的に記憶する部分である。当該バッファ部１０２や、分割結果記憶部１０４，更新頻度記憶部１０７などの記憶機能を実現するための記憶資源は、前記記憶部２３とは別個に確保してもよいが、前記記憶部２３内で確保するようにしてもよい。
【００５２】
分割部１０３は、バッファ部１０２に読み込まれたフレームページを解析することによって、予め指定された文書構造に基づいて、バッファ部１０２の記憶内容を領域に分割する部分である。ただしフレームページの場合、フレームページ上の各領域（例えば、領域Ａ〜領域Ｃ）は異なるＨＴＭＬファイルに対応し、最初から（すなわち、前記整形処理を行う前は）別ファイルに分割されているため、通常のファイル管理を行うことのできるファイルシステムを搭載していれば、分割は省略できる可能性がある。ファイル管理は通常、ＯＳの役割であるから、分割部１０３はＯＳに対して、例えば、バッファ部１０２上の次のファイルを渡すように要求しその応答としてのファイルを受け取る処理を繰り返すだけでよい。
【００５３】
ただし前記ＷＷＷブラウザＢ１と当該領域処理部２５（読み込み部１０１）のインタフェースなどとの関係で、整形処理前のファイルＤＰ１１〜ＤＰ１４を受け取ることができず、整形処理後の合成ファイルＳＰ１しか受け取ることができない場合などには、分割が必要になる。
【００５４】
分割は、例えば図５に示す合成ファイルＳＰ１から、部分ＰＴ１〜ＰＴ４を除外して図６に示す３つのファイルＤＰ１２，ＤＰ１３，ＤＰ１４を得る処理である。
【００５５】
分割部１０３が分割した（ＯＳから受け取った）ファイル（例えば、ＤＰ１２〜ＤＰ１４）は、分割結果記憶部１０４へ供給する。
【００５６】
分割結果記憶部１０４は、分割部１０３から前回、供給を受けたファイルＮ１１を記憶しておく部分である。前回、供給を受けたファイルＮ１１は、今回、供給を受けバッファ部１０２内に格納した該当するファイルＮ１２と比較するために記憶するものであるから、前回と今回で、各ファイルの対応関係がわかるような形で記憶しておく必要があることは当然である。例えば、前回の領域ＣのＨＴＭＬファイルと、今回の領域ＣのＨＴＭＬファイルＤＰ１４が、他のＨＴＭＬファイル（例えば、領域ＢのＨＴＭＬファイルＤＰ１３など）と混同されないようにしておく必要がある。
【００５７】
この対応付けには、例えば、各ＨＴＭＬファイルのファイル名（例えば、ＵＲＬ１２〜１４）を利用することもでき、システム内部で各ＨＴＭＬファイルに付与した固有のファイル識別子（領域番号）を利用することもできる。
【００５８】
領域内容比較部１０５は、分割結果記憶部１０４が記憶している前回、供給を受けた各ファイルＮ１１とバッファ部１０２が記憶している今回、供給を受けた各ファイルとを、ファイルごとに（すなわち、領域ごとに）内容を比較して更新の有無を検出する部分で、検出結果として、更新有無情報Ｎ１６を出力する。更新とは、ファイルの内容の一部または全部が、追加、削除、または、変更されることである。
【００５９】
更新頻度算出部１０６は、当該更新有無情報Ｎ１６の内容と前回までの更新頻度から所定の式に基づいて今回の更新頻度Ｓを算出し更新頻度Ｓに応じた更新頻度情報Ｎ１７を出力する部分である。更新頻度Ｓの算出は、ファイルごと（領域ごと）に行う。更新頻度は、ファイルの更新の頻度が増加するほどその値が増加するものであってもよいが、本実施形態では、ファイルの更新の頻度が増加するほどその値が減少するものを用いる。
【００６０】
また、前記式としては様々な形の式を用いることが可能であるが、本実施形態では、指数平均による次の式（１）を用いるものとする。
【００６１】
Ｓ＝Ｓ_０α＋Ｐ（１−α）　　…（１）
ここで、Ｓ_０は、前回までの更新頻度であり、αは０＜α＜１の定数であるが、ここではα＝０．８とする。また、Ｐは得点であり、１００または０のいずれかの値を持つ。後述するように、更新が行われた場合には得点Ｐを０とし、更新が行われていない場合には１００とするため、更新頻度Ｓの値は、（Ｓ_０を正の値とすると）更新が行われるたびに小さくなる。
【００６２】
なお、ＧＥＴリクエストに対するＨＴＴＰレスポンスでは、エンティティボディ（ファイル）のほかに、各種ＨＴＴＰヘッダも返送されるため、更新頻度算出部１０６の処理において、ＨＴＴＰヘッダ中の情報を利用することも望ましい。
【００６３】
例えば、ＨＴＴＰヘッダの１つであるエンティティヘッダに含まれる更新日情報や有効期限情報を利用することも望ましい。
【００６４】
更新日情報は、ＷＷＷサーバ１３側でファイルが更新された日付を示す情報である。
【００６５】
また、有効期限情報は、ファイルを（前記キャッシュ領域やキャッシュサーバなどに）キャッシュする際の有効期限を設定するための情報で、提供するファイルの内容の最新性を維持するために使用される。有効期限情報は、通常、ファイルの作成者（ＷＷＷページの作成者）など、ＷＷＷサーバ１３側の都合に合わせて設定される。有効期限が経過すれば、たとえ記憶容量に余裕があっても、通信端末１２（あるいは、通信端末１２とＷＷＷサーバ１３のあいだに介在する前記キャッシュサーバ）は、当該ファイルをキャッシュ領域から削除し、オリジナルのファイルをＷＷＷサーバ（ここでは、１３）から取得することになる。したがって、有効期限を短く設定すれば高頻度の更新が行われても最新の内容を持つファイルをユーザ（ここでは、Ｕ１）に提供できる可能性が高まる利点があるが、有効期限を短く設定するとキャッシュの利点が小さくなりＷＷＷサーバ１３側の負荷も増大するため、通常は、できるだけ長い有効期限を設定しようとする。
【００６６】
このため、有効期限情報は、ＷＷＷページの作成者などが、当該ファイルをどの程度の頻度で更新する予定であるか、あるいは、どの程度の頻度で更新される（例えば、電子掲示板（ＣＧＩ掲示板）など、第３者によってファイルの更新が行われるＷＷＷページの場合（ただし、普通、電子掲示板はキャッシュ禁止とする））と予想しているかを示す情報（更新予定情報）であるとみることもできる。ＷＷＷページの作成者等は、通常、ユーザＵ１などからは知ることのできない、自身の予定やＷＷＷページの性質などを熟知しているはずだから、更新予定情報としての有効期限情報の信頼度は高いといえる。
【００６７】
前記更新頻度算出部１０６がこのような有効期限情報を利用する方法には様々なものがあり得るが、例えば、更新頻度Ｓを算出する際、有効期限情報が示す有効期限が短いほど、更新頻度Ｓが高頻度側に変更されるように重み付けを施すことも望ましい。
【００６８】
当該更新頻度算出部１０６から当該更新頻度Ｓに応じた更新頻度情報Ｎ１７を受け取る更新頻度記憶部１０７は、領域ごとに、当該更新頻度情報Ｎ１７を記憶する部分である。
【００６９】
判定部１０８は、更新頻度記憶部１０７の記憶する更新頻度情報Ｎ１７に基づいて最も頻繁に内容が変化している領域を主要領域と判定し、バッファ部１０２から該当領域の内容を取り出してそれを出力する部分である。主要領域の出力先には様々なものが考えられるが、例えば、ＷＷＷブラウザＢ１であってよい。ＷＷＷブラウザＢ１が当該主要領域を表示部２４に画面表示するために必要であれば、前記フレーム規定ファイルＤＰ１１も、当該主要領域とともにＷＷＷブラウザＢ１へ出力するとよい。
【００７０】
一方、前記ＨＴＴＰリクエストにこたえてＨＴＴＰレスポンスを返すＷＷＷサーバ１３の内部構成例は、図１７に示す。
【００７１】
（Ａ−１−３）ＷＷＷサーバの内部構成例
図１７において、当該ＷＷＷサーバ１３は、通信部３０と、制御部３１と、記憶部３２とを備えている。
【００７２】
このうち通信部３０は前記通信部２０に対応し、制御部３１は前記制御部２１に対応し、記憶部３２は前記記憶部２３に対応するので、その詳しい説明は省略する。
【００７３】
ただし制御部３１は、ＷＷＷブラウザ（Ｂ１など）を搭載することはなくＷＷＷサーバソフトを搭載している。
【００７４】
また、ＷＷＷサーバ１３が、予め生成したＷＷＷページ等を蓄積しておくためのデータベースを伴う場合には、必要に応じて、ＤＢＭＳも、当該制御部３１に搭載され得る。
【００７５】
さらに、前記フレームページＤＰ１を提供するため、当該記憶部３２には、少なくとも、ＨＴＭＬファイルＤＰ１１〜ＤＰ１４が格納されている。
【００７６】
以下、上記のような構成を有する本実施形態の動作について、図３，図７、図８の各フローチャートを参照しながら説明する。
【００７７】
図３は更新頻度情報Ｎ１７などの情報を更新するためのフローチャートで、Ｓ１０１〜Ｓ１０４の各ステップを備えている。同様に、図７は更新頻度情報Ｎ１７などをもとに、主要領域の内容を判定し出力するためのフローチャートで、Ｓ１０１、Ｓ１０２，Ｓ１０５の各ステップを備えている。
【００７８】
また、図８は更新頻度Ｓの算出またはその近傍の処理を示すフローチャートで、Ｓ１５１〜Ｓ１６０の各ステップを備えている。図８のフローチャートは、図３中のステップＳ１０３の詳細な動作を示すものである。
【００７９】
（Ａ−２）第１の実施形態の動作
ユーザＵ１が操作部２２を操作することによって、または、前記オートパイロットツールなどの機能によって、ＵＲＬ１１がＷＷＷブラウザＢ１に入力されると、ＷＷＷブラウザＢ１が当該ＵＲＬ１１に対応したＨＴＴＰリクエスト（ＧＥＴリクエスト）を送信する。
【００８０】
ただし前記読み込み部１０１に前記ＷＷＷブラウザＢ１と独立したＨＴＴＰクライアントとしての機能を持たせる場合には、この入力はＷＷＷブラウザＢ１ではなく、当該読み込み部１０１に対して行うことは当然である。
【００８１】
通信端末１２から送信されたこのＨＴＴＰリクエストをインターネット１１経由で受信すると、ＷＷＷサーバ１３（およびサーバＯＳ）は、前記記憶部３２から当該ＵＲＬ１１によって指定されるフレーム規定ファイルＤＰ１１を取り出し、当該フレーム規定ファイルＤＰ１１をエンティティボディとして含むＨＴＴＰレスポンスを返送する。
【００８２】
フレームページＤＰ１を構成するその他のＨＴＭＬファイルＤＰ１２〜ＤＰ１４は、上述したように、当該フレーム規定ファイルＤＰ１１が通信端末１２に受信されたあと、当該フレーム規定ファイルＤＰ１１中の記述（ＵＲＬ１２〜ＵＲＬ１４）に基づいて、ＷＷＷブラウザＢ１が順次、自動的に送信する各ＨＴＴＰリクエストに応じた各ＨＴＴＰレスポンスのエンティティボディとして、通信端末１２に受信されるから、これらを、前記読み込み部１０１が読み込むことになる。
【００８３】
ただしＷＷＷサーバ１３側で前記合成ファイルＳＰ１を生成する場合には、この合成ファイルＳＰ１をエンティティボディとして含むＨＴＴＰレスポンスが返されることになる。
【００８４】
図３に示すように、通信端末１２の領域処理部２５に含まれる読み込み部１０１は単独で、またはＷＷＷブラウザＢ１と連携して当該ファイル（ＤＰ１１またはＳＰ１）を読み込んでバッファ部１０２へ格納することになる（Ｓ１０１）。単独で読み込む場合には、ＨＴＴＰレスポンスのエンティティボディがそのまま読み込み部１０１に読み込まれ、ＷＷＷブラウザＢ１と連携して読み込む場合には、ＷＷＷブラウザＢ１側で処理を施された結果が、読み込み部１０１に読み込まれることになる。
【００８５】
いずれの場合も、このとき読み込み部１０１に読み込まれるファイルには、図４（Ａ）に示すフレーム規定ファイルＤＰ１１と図５に示す合成ファイルＳＰ１の２通りがあり得る。
【００８６】
このとき読み込み部１０１に読み込まれるファイルが合成ファイルＳＰ１である場合には分割する（Ｓ１０２）が、フレーム規定ファイルＤＰ１１である場合には、分割する必要はない。
【００８７】
ただしフレーム規定ファイルＤＰ１１である場合には、上述した３つのＨＴＭＬファイルＤＰ１２〜ＤＰ１４を取得するために、逐次、通信端末１２からＨＴＴＰリクエストを送信する必要があり、順次、受信されるＨＴＴＰファイルを読み込む必要がある。
【００８８】
合成ファイルＳＰ１を用いる場合も用いない場合も、各ＨＴＭＬファイルＤＰ１２〜ＤＰ１４は、前記ファイル名のほか、前記領域番号によって識別し対応付けることが可能である。
【００８９】
ここでは一例として、ＨＴＭＬファイルＤＰ１２には領域番号１を付与して領域１とし、ＨＴＭＬファイルＤＰ１３には領域番号２を付与して領域２とし、ＨＴＭＬファイルＤＰ１４には領域番号３を付与して領域３とする。
【００９０】
次に、ステップＳ１０３では、前記領域内容比較部１０５が、今回、読み込んだ（必要に応じて、分割された）ファイルと前回、読み込んだファイルとを比較して更新の有無を検出し、それをもとに更新頻度算出部１０６が更新頻度Ｓを算出し、さらに、算出した更新頻度Ｓを更新頻度記憶部１０７が記憶する。
【００９１】
そして、今回、読み込んだ（分割された）ファイルＤＰ１２〜ＤＰ１４の内容は、前回、読み込んだファイルに置き換える形で、分割結果記憶部１０４に格納し、次回の新たな読み込みに備える（Ｓ１０４）。
【００９２】
このようなステップＳ１０１〜１０４の処理は、必要に応じて、複数回、繰り返される。わずかの回数（例えば、１回）、更新が検出されただけでは、偶然、そのタイミングでそのファイルが更新されただけかもしれないので、このような偶然性を排除して真に更新の頻度が高いか否かを見極め所期の効果を得るためには、繰り返しの回数が多いほうが望ましく、更新の有無の検出は長期間にわたって行うほうが望ましいことは当然である。
【００９３】
なお、必要に応じて、前記有効期限情報（更新予定情報）を活用することにより、繰り返しの回数が比較的すくなく、更新の有無を検出する期間が比較的みじかい場合でも、偶然性を排除できる可能性がある。
【００９４】
また、繰り返しの回数が少なく、更新の有無の検出期間が短い場合、最近、更新が行われたファイルほど主要領域として選ばれる確率が高まる。ユーザＵ１にとって最も重要な領域は、最も近い過去に更新された領域であると想定することができるとするなら、繰り返しの回数が少なく、更新の有無の検出期間が短くても、重要な領域を選択できる。
【００９５】
前記ステップＳ１０３の詳細を示す図８では、前記領域番号を利用して、各ファイルＤＰ１２〜ＤＰ１４を識別している。
【００９６】
図８において、ステップＳ１５２〜Ｓ１５９は、バッファ部１０２に格納されている各ファイルＤＰ１２〜ＤＰ１４に付与された領域番号をもとに、ファイルごとに（領域番号ごとに）繰り返される。
【００９７】
このうちステップＳ１５２では、前回、読み込んだ（前回、分割した）ファイルと、今回、読み込んだ（今回、分割した）ファイル、すなわち、分割結果記憶部１０４に格納されているファイルとバッファ部１０２に格納されているファイルを、前記領域番号をもとに対応付けて比較する（Ｓ１５３）。
【００９８】
比較の結果、内容が同じであれば、すなわちファイルが更新されていなければ、ステップＳ１５３はＹｅｓ側に分岐し、前記式（１）の得点Ｐに１００を設定し（Ｓ１５５）、当該式（１）に基づいて更新頻度Ｓを算出する（Ｓ１５７）。ここで算出された更新頻度Ｓにより、前回、算出され更新頻度記憶部１０７に記憶されていた更新頻度Ｓ_０が置換される。
【００９９】
比較の結果、前記領域１の内容が前回と同じであり、領域１の前回の更新頻度Ｓ_０をＳ_０＝７３とすると、図９に示すように、前記式（１）にしたがって、領域１の今回の更新頻度Ｓは、７８（≒７３×０．８＋１００×（１−０．８））となる。
【０１００】
また、前記領域２の内容が前回と同じであり、領域２の前回の更新頻度Ｓ_０をＳ_０＝７３とすると、領域１の今回の更新頻度Ｓも、７８となる。
【０１０１】
一方、前記ステップＳ１５３における比較の結果、内容が同じでなければ、すなわちファイルが更新されていれば、ステップＳ１５３はＮｏ側に分岐し、分割数が同じであるか否かを検査する（Ｓ１５４）。分割数とは、前記フレームページＤＰ１を構成するフレーム規定ファイルＤＰ１１以外のファイルの数であるから、このステップＳ１５４では、前記フレーム構造（特に、フレームの数）そのものが変化しているか否かを検査していることになる。
【０１０２】
なお、フレーム構造の変化の有無は、このように分割数を調べなくても、直接、前記フレーム規定ファイルＤＰ１１の記述内容を解析（例えば、「＜ＦＲＡＭＥ　ｓｒｃ」という文字列の数をカウントする）することによっても、検査することが可能である。
【０１０３】
また、フレーム構造の概念には上述したように、フレームの数だけでなく、各フレームの辺の長さの割合なども含まれるが、ステップＳ１５４では、フレームの数だけを問題にしている。したがって、フレーム規定ファイルＤＰ１１の記述が変化し各フレームの辺の長さの割合などが変化したとしても、フレームの数が同じであればステップＳ１５４はＹｅｓ側に分岐する。もちろん、必要ならば、フレーム規定ファイルＤＰ１１の記述が少しでも変化したらステップＳ１５４をＮｏ側に分岐させるようにしてもよい。
【０１０４】
例えば、行ＴＧ１２〜ＴＧ１４に記述されているＵＲＬ１２〜ＵＲＬ１４のいずれかが変化した場合には、同じファイルの内容が更新されたのではなく、そのフレームに配置されるファイルが別ファイルに変更されたことを意味するため、ステップＳ１５４をＹｅｓ側に分岐させるようにしてもよい。
【０１０５】
一例として、それまで領域１〜領域３の３つの領域しか存在しなかったフレームページＤＰ１に、図１０に示す領域４（領域番号４を付与した領域）のような新たな領域が出現した場合には、ステップＳ１５４はＮｏ側に分岐し、当該領域４の更新頻度Ｓには０が設定され（Ｓ１５８）、処理は前記ステップＳ１５９に進む。この場合、当該０は更新頻度の初期値であり、フレームページＤＰ１のフレーム構造そのものが変化してしまったため、比較すべき領域がなくなったとみなして、更新頻度を初期値にリセットしたものである。
【０１０６】
また、分割数が同じであってステップＳ１５４がＹｅｓ側に分岐すると、その領域には、前記得点Ｐに０を設定して（Ｓ１５６）、処理をステップＳ１５７に進める。
【０１０７】
ここで、図９に示すように、領域３に対する処理でステップＳ１５４がＹｅｓ側に分岐し、領域３の前回の更新頻度Ｓ_０をＳ_０＝４６とすると、前記式（１）にしたがって、領域３の今回の更新頻度Ｓは、３７（≒４６×０．８＋０×（１−０．８））となる。
【０１０８】
以上のようにして、更新頻度記憶部１０７に記憶されている各領域１〜領域３の更新頻度Ｓの更新は、１または複数回おこなわれるが、更新頻度Ｓをもとに、前記主要領域を判定し、当該主要領域を出力する場合は、図７のようなフローチャートとなる。
【０１０９】
図７において、図３と同じ符号Ｓ１０１，Ｓ１０２を付与した各ステップの処理は図３と同じである。
【０１１０】
したがって、図７が図３と相違する点は、ステップＳ１０５だけである。
【０１１１】
ステップＳ１０５では、前記判定部１０８が、その時点で更新頻度記憶部１０７に記憶されている更新頻度Ｓに基づいてフレームページＤＰ１上から主要領域を判定し、出力する。例えば、前記領域３に対応するＨＴＭＬファイルＤＰ１４の更新頻度Ｓが最も更新の頻度が高い値を有する場合には、当該ＨＴＭＬファイルＤＰ１４が主要領域として選ばれ、出力されることになる。
【０１１２】
なお、必要ならば、更新の頻度が同じ値の領域が複数、存在する場合、前記有効期限情報などをもとに、１つの主要領域を特定するようにしてもよい。また、必要ならば、複数の領域を主要領域として出力したり、主要領域としては１つの領域を出力するが、更新の頻度が当該主要領域と同じ値を示す領域がほかに存在することをユーザＵ１に伝えるようにしてもよい。
【０１１３】
また、前記オートパイロットツールのようなソフトウエアに、前記日時や時間間隔のほか、前記ＵＲＬ１１のようなフレーム規定ファイルを指定するためのＵＲＬを複数設定しておけば、同時進行で、多数のフレームページに対して同様な処理を行うことができる。これにより、多数のフレームページから主要領域だけを選択して出力させることが可能になり、ユーザＵ１は少ない労力と時間で、効率的に、多くのフレームページの要点を認識することができる。
【０１１４】
ただしその場合には、分割結果記憶部１０４、更新頻度記憶部１０７はフレームページごとに整理して情報を記憶する必要がある。
【０１１５】
また、本実施形態では、特許文献１のように、ユーザＵ１が、データの開始箇所と終了箇所をあらかじめ手入力により指定する必要がないため、ユーザＵ１の操作負担が軽減できる。
【０１１６】
なお、本実施形態では、自動的に主要領域を判定できるので、（ｉ）指定ウェブページの更新時の通知（主要領域以外の更新は通知しない、など）、（ｉｉ）検索（主要領域以外は検索対象としない、など）、（ｉｉｉ）要約（主要領域のみを要約対象とする、など）等のサービスやシステムを容易に構築することが可能となる。
【０１１７】
（Ａ−３）第１実施形態の効果
本実施形態によれば、フレームページから選択した主要領域を出力するためのユーザ（Ｕ１）の操作負担を軽減することが可能である。特に、多数のフレームページから主要領域だけを選択して出力させる場合に有効である。
【０１１８】
また、本実施形態は、基本的に自然言語処理を用いることなく実行可能であるため、記述言語に依存せずに主要領域を判定することができる。
【０１１９】
さらに、本実施形態では、フレームページの解析を実施するが、予め指定した文書構造だけを処理すればよいので、全ての解析をするよりも処理量が少ない。
【０１２０】
（Ｂ）第２の実施形態
以下では、本実施形態が第１の実施形態と相違する点についてのみ説明する。
【０１２１】
第１の実施形態ではファイルの内容そのものを記憶し比較することで更新の有無の検出などを行ったが、本実施形態では、ファイルの内容の替わりに、ファイルの内容のチェックサムを用いる。
【０１２２】
（Ｂ−１）第２の実施形態の構成および動作
本実施形態と第１の実施形態は、前記領域処理部２５の内部構成が相違するだけであるから、図１５，図１６，図１７に示した構成は本実施形態でもそのまま用いることができる。本実施形態の領域処理部には符号３５を付与して第１の実施形態の領域処理部２５と区別する。
【０１２３】
領域処理部３５の内部構成例は図１２に示す通りである。
【０１２４】
図１２において、当該領域処理部３５は、読み込み部１０１と、バッファ部１０２と、分割部１０３と、更新頻度算出部１０６と、更新頻度記憶部１０７と、判定部１０８と、チェックサム算出部２０１と、チェックサム記憶部２０２と、チェックサム比較部２０３とを有する。
【０１２５】
このうち図２と同じ符号１０１〜１０３，１０６〜１０８、Ｎ１０〜Ｎ１２，Ｎ１７〜Ｎ１９を付与した各構成要素および各信号（各情報）の機能は第１の実施形態と同じであるので、その詳しい説明は省略する。
【０１２６】
また、前記チェックサム記憶部２０２は前記分割結果記憶部１０４に対応し、チェックサム比較部２０３は前記領域内容比較部１０５に対応する部分である。
【０１２７】
チェックサム記憶部２０２は、ファイルの内容（分割結果）の替わりに、チェックサム算出部２０１が当該ファイルの内容をもとに算出したチェックサムを示すチェックサム情報Ｎ３０を記憶する部分である。ファイルの内容に比べると、通常、そのチェックサムははるかにサイズが小さいため、チェックサム記憶部２０２の記憶容量は、分割結果記憶部１０４の記憶容量よりも小さくてよい。
【０１２８】
したがって、チェックサム比較部２０３は、前回、読み込んだファイルのチェックサムと、今回、読み込んだファイルのチェックサムを比較することによって、各ファイルの更新の有無を検出し、検出結果に応じた更新有無情報Ｎ３３を出力する。
【０１２９】
今回、読み込んだファイルのチェックサムを示すチェックサム情報Ｎ３１も、今回、読み込まれバッファ部１０２内に格納されたファイルの内容をもとに、前記チェックサム算出部２０１が算出する。
【０１３０】
チェックサムをもとにファイルの更新の有無を検出すると、実際には更新されているのに更新されていないものとする誤検出が発生する可能性があるが、チェックサムのサイズ（ビット数）を長くすること等により、このような誤検出の発生確率を抑制することが可能である。
【０１３１】
本実施形態の動作を示すフローチャートは図７，図１３，図１４に示す通りである。このうち図７のフローチャートは、第１の実施形態と同じものを本実施形態でも用いる。
【０１３２】
また、図１３のフローチャートは図３のフローチャートに対応し、同じ符号Ｓ１０１，Ｓ１０２を付与した各ステップの処理内容は第１の実施形態と同じであるので、その詳しい説明は省略する。
【０１３３】
図１３において、ステップＳ２０１は図３のステップＳ１０３に対応し、ステップＳ２０２は図３のステップＳ１０４に対応する。
【０１３４】
ステップＳ２０１はＳ１０３と比べ、ファイルの内容ではなくチェックサムに基づいて更新の有無を検出する点が相違するだけである。
【０１３５】
また、ステップＳ２０２では、今回、読み込んだファイルのチェックサムを、次回の新たなファイルの読み込みに備えて、前記チェックサム記憶部２０２に格納する。
【０１３６】
一方、図１４のフローチャートは図８のフローチャートに対応し、同じ符号Ｓ１５４〜Ｓ１６０を付与した各ステップの処理内容は第１の実施形態と同じであるので、その詳しい説明は省略する。
【０１３７】
また、ステップＳ２５１〜Ｓ２５３でも、すでに述べたように、ファイルの内容の替わりにそのチェックサムを用いる点が第１の実施形態と相違するだけである。
【０１３８】
（Ｂ−３）第２の実施形態の効果
本実施形態では第１の実施形態の効果と同等な効果を得ることができる。
【０１３９】
加えて、本実施形態では、チェックサム記憶部（２０２）に相当する部分の記憶容量が、第１の実施形態よりも少ないため、記憶資源を節約することができる。
【０１４０】
また、チェックサムのサイズがファイルのサイズよりも小さければ、メモリなどの記憶資源に対する読み書きアクセスのための時間も短いため、処理時間を短縮できる可能性がある。
【０１４１】
（Ｃ）他の実施形態
上記第１の実施形態で、前回との比較を実施するために分割した結果を記憶しておくように構成したが、読み込んだ内容をそのまま記憶しておき、比較する前に改めて分割するようにしてもよい。
【０１４２】
なお、分割した領域に対して合成ファイルＳＰ１の記述の順番に従って対応付けを実施したが、フレームページの内部でそれぞれの領域を識別することができる情報（例えば、前記ＵＲＬ１２〜ＵＲＬ１４）が付加されているので、その情報を使うように構成してもよいことはすでに述べた通りである。
【０１４３】
また、上記第２の実施形態ではチェックサムを使うと説明したが、チェックサム以外の誤り検出方式に対応する符号を利用することもできる。また、ハッシュ関数を用いてファイルの内容を変換した値（ハッシュ値）を、前記チェックサムの替わりに利用するようにしてもよい。
【０１４４】
なお、上記第１および第２の実施形態では、ＨＴＴＰリクエストとしてＧＥＴリクエスト（ＧＥＴメソッド）を用いる場合についてのみ説明したが、ＨＥＡＤリクエスト（ＨＥＡＤメソッド）を利用してもよい。
【０１４５】
ＨＥＡＤリクエストに対するＨＴＴＰレスポンスは、エンティティボディ（ファイル）が含まれていないこと以外、ＧＥＴリクエストに対するＨＴＴＰレスポンスと同じである。したがって通信端末１２は、当該ＨＥＡＤリクエストを送信しても、前記ＷＷＷサーバ１３から前記エンティティヘッダに含まれる有効期限情報や更新日情報などを取得することができる。
【０１４６】
この場合、通信端末１２では更新日情報などをもとに各ファイルの更新の有無を検出することが可能である。また、ＨＥＡＤリクエストによれば、サイズの大きなファイル本体を取り扱わなくて済むため、ＷＷＷサーバ１３側でも通信端末１２側でも負荷が小さく、通信トラフィックも少ない。また、レスポンスタイムも短いため、高速な処理が実行可能である。
【０１４７】
さらに、更新日情報による更新の有無の検出と上記第１（または第２）の実施形態で述べたようなファイル本体（あるいは、そのチェックサム等）を比較することによる更新の有無の検出を混在させてもよい。例えば、ユーザＵ１がフレームページの閲覧を希望してＵＲＬ１１を入力した場合にはＧＥＴリクエストを用いて、ファイル本体を比較し、オートパイロットツールなどが自動的に更新の有無を検出する場合などには、ＨＥＡＤリクエストを用いて、更新日情報による更新の有無の検出を行うようにしてもよい。
【０１４８】
なお、更新日情報などＨＴＴＰヘッダに含まれる各種情報は、ＷＷＷサーバ１３側で収集された管理情報に基づいて記述されるため、ＷＷＷサーバ１３側でファイルの管理が適正に行われていない場合などには、誤った情報となる可能性がある。例えば、実際にはまったく内容が変更されたいないのに、ファイルの更新日が書き換えられ、更新が行われたかのうような更新日情報が生成される可能性もある。
【０１４９】
ファイル本体の内容（やそのチェックサム等）を比較することによる更新の有無の検出などでは、このような誤りが混入しない点で、信頼性が高いといえる。
【０１５０】
なお、図４（Ａ）〜（Ｄ）の例では、トップページがフレームページになっているが、トップページ以外のページをフレームページとしてもよいことは当然である。
【０１５１】
また、フレームページの場合、図１に示したように、領域Ｂのような狭いフレームにメニューを置き、領域Ｃのような広いフレームにそのメニューの選択に応じて変わる内容を置く構成が典型例であるが、メニューのないフレームページにも、本発明が適用できることは当然である。
【０１５２】
さらに、フレームページ以外のＷＷＷページに対しても本発明は適用可能であり、ＨＴＭＬ以外の言語（ＸＭＬやＳＧＭＬなど）による記述にも適用可能である。何らかの意味で、論理的に識別可能な複数の領域が含まれていればよいからである。
【０１５３】
例えば、複数の画像（画像ファイル）を含むＷＷＷページならば、各画像ファイルの更新頻度をもとにいずれかの画像ファイルを主要領域として判定してもよい。また、基本になるＨＴＭＬファイルとそのＨＴＭＬファイルに対応したＷＷＷページ上に表示される画像（画像ファイル）のあいだで更新頻度をもとに主要領域を判定することも可能である。この場合、主要領域として画像（画像ファイル）が選択されると画像だけを表示し、ＨＴＭＬファイルが選択されると画像を含まないＨＴＭＬファイルだけを表示してもよい。
【０１５４】
また、ファイル以外の単位をもとに領域を識別してもよいことは当然である。
【０１５５】
さらに、使用する通信プロトコルは必ずしもＨＴＴＰでなくてもかまわない。
【０１５６】
なお、上記第１、第２の実施形態では通信端末（クライアント）１２側に領域処理部２５（または３５）を配置したが、当該領域処理部の機能は、ＷＷＷサーバ１３側に配置したり、ＷＷＷサーバ１３と通信端末１２の中間に介在し得る例えばプロキシサーバなどに配置することも可能である。
【０１５７】
ＷＷＷサーバ１３側に配置した場合には、必ずしもＨＴＴＰによる通信を行わなくてもよいため、ＷＷＷサーバ１３のサーバＯＳが管理しているファイル管理の情報をそのまま活用し、更新の有無の検出などに利用することができる。
【０１５８】
また、第１および第２の実施形態では、フレームページがＷＷＷサーバ１３上に公開されたものであることを前提としているが、ＣＤ−ＲＯＭなどの記録媒体から得たフレームページ等にも本発明は適用できるので、対象とするフレームページ等は、必ずしもネットワーク経由で入手されるものでなくてもかまわない。
【０１５９】
さらに、本発明で使用する式は前記式（１）に限定する必要はないことはすでに述べた通りである。例えば、更新されたことが検出されるたびに更新頻度の値がインクリメント（あるいは、デクリメント）されるような単純な式を利用することも可能である。
【０１６０】
以上の説明では主としてハードウエア的に本発明を実現したが、本発明はソフトウエア的に実現することも可能である。
【０１６１】
【発明の効果】
本発明によれば、構造化文書の中から主要な領域を判定するために必要なユーザの操作負担を軽減することができる。
【図面の簡単な説明】
【図１】第１および第２の実施形態で使用する構造化文書の構成例である。
【図２】第１の実施形態で使用する領域処理部の構成例を示す概略図である。
【図３】第１の実施形態の「情報更新」時の動作を示すフローチャートである。
【図４】第１の実施形態の動作説明に用いる構造化文書の例を示す説明図である。
【図５】図４に示す構造化文書の各ファイルを読み込んだ結果の一例を示す説明図である。
【図６】図５の構造化文書を分割した結果の一例を示す説明図である。
【図７】第１の実施形態の「主要領域の内容出力」時の動作を示すフローチャートである。
【図８】第１の実施形態の更新頻度Ｓの算出方法を示すフローチャートである。
【図９】第１の実施形態の更新頻度Ｓの算出の具体例である。
【図１０】第１の実施形態の更新頻度Ｓの算出の具体例である。
【図１１】第１の実施形態の「情報更新」時の動作と「主要領域の内容出力」時の動作とまとめた場合のフローチャートである。
【図１２】第２の実施形態の領域処理部の構成例を示す概略図である。
【図１３】第２の実施形態の「情報更新」時の動作を示すフローチャートである。
【図１４】第２の実施形態の更新頻度Ｓの算出方法を示すフローチャートである。
【図１５】第１および第２の実施形態の通信システムの全体構成例を示す概略図である。
【図１６】第１および第２の実施形態で使用する通信端末の構成例を示す概略図である。
【図１７】第１および第２の実施形態で使用するＷＷＷサーバの構成例を示す概略図である。
【符号の説明】
１０１…読み込み部、１０２…バッファ部、１０３…分割部、１０４…分割結果記憶部、１０５…領域内容比較部、１０６…更新頻度算出部、１０７…更新頻度記憶部、１０８…判定部、２０１…チェックサム算出部、２０２…チェックサム記憶部、２０３…チェックサム比較部。[0001]
The present invention relates to an information processing apparatus and method, and can be applied to, for example, a case where a structured document is obtained from a WWW (World Wide Web) site.
[0002]
[Prior art]
There is a WWW browser as a tool for acquiring and browsing a structured document existing on a WWW site. Generally, in a structured document, the layout of pages of the document, the size of characters, and the like can be flexibly specified. In particular, as shown in FIG. 1, a page is divided into several areas (frames) such as a title (area A), a link to another structured document (area B), a body (area C), and the like. There are many structured documents that are displayed in a browser. In order to obtain necessary information from such a structured document using a WWW browser, a user specifies a URL of a target structured document, and after the document is displayed on the WWW browser, downloads the document. It is necessary to perform an operation such as performing a visual search while scrolling (a manual search) or using a character string search function. For example, suppose that the area C in FIG. 1 is a document required by the user, and if there are many such structured documents, only the information required by the user is automatically extracted from the plurality of structured documents. It is desirable to simplify the work manually by scrapping the document and presenting it to the user in a single document. Such a WWW information extraction system is disclosed in Patent Document 1 below.
[0003]
[Patent Document 1]
JP-A-10-187753
[0004]
[Problems to be solved by the invention]
However, in the WWW information extraction system described above, it is necessary for the user to manually specify the start position and the end position of the data required by the user in the structured document in advance. For this reason, the operation burden on the user is large to implement a large number of structured documents, which is not practical.
[0005]
[Means for Solving the Problems]
In order to solve this problem, according to a first aspect, in an information processing apparatus for determining a main area from a predetermined structured document including a plurality of areas, (1) each area in the structured document Alternatively, a reading unit that acquires the management information of each area a plurality of times in chronological order, (2) a storage unit that stores the area or the management information acquired by the reading unit, and (3) a correspondence that the reading unit acquires. A comparison inspection unit that performs comparison between areas to be compared or management information and checks whether or not each area has been updated based on the comparison result; and (4) a comparison inspection unit based on a history of inspection results by the comparison inspection unit. An update frequency calculating unit that calculates predetermined update frequency correspondence information for each area; and (5) determining a main area from a plurality of areas in the structured document based on the update frequency correspondence information. Main area determination unit The features.
[0006]
Further, in the second invention, in the information processing method for determining a main area from a predetermined structured document including a plurality of areas, (1) the reading unit includes: The management information of each area is acquired a plurality of times in chronological order, (2) the area or management information acquired by the reading unit is stored in the storage unit, and (3) the comparison inspection unit acquires the acquisition information of the reading unit. A comparison is made between the corresponding areas or between the management information, and whether or not each area is updated is checked based on the comparison result. (4) The update frequency is determined based on the history of the check result by the comparison check unit. The calculation unit calculates predetermined update frequency correspondence information for each area, and (5) based on the update frequency correspondence information, the main area determination unit determines a main update frequency from a plurality of areas in the structured document. It is characterized in that a region is determined.
[0007]
BEST MODE FOR CARRYING OUT THE INVENTION
(A) Embodiment
Hereinafter, embodiments of an information processing apparatus and method according to the present invention will be described.
[0008]
(A-1) Configuration of First Embodiment
In the present embodiment, the function of the area processing unit 25 (see FIG. 16) having the function of determining and extracting the main area can be realized by a personal computer or other information processing apparatus having a communication function, and is arranged on the WWW server side. Although this is also possible, the case of arranging it on the communication terminal (client) side will be described here as an example.
[0009]
FIG. 15 shows an overall configuration example of the communication system 10 according to the present embodiment.
[0010]
15, the communication system 10 includes a network 11, a communication terminal 12, and a WWW server 13.
[0011]
The network 11 may be a LAN (local area network) or the like, but is assumed to be the Internet in this case.
[0012]
The WWW server 13 is a server having a function of receiving a request (HTTP request) from the communication terminal 12 and returning a file constituting a WWW page as a response (HTTP response) in response to the request. In many cases, the WWW server 13 includes a database (not shown) for storing WWW pages and the like generated in advance, and a database server for directly managing the database. In general, various network devices such as routers and firewalls and servers such as DNS servers are arranged around the WWW server 13 and the database server to constitute a WWW site.
[0013]
The communication terminal 12 is an information processing device provided with the above-described area processing unit 25, and may be, specifically, a personal computer having a network function. In the configuration of the present embodiment, the communication terminal 12 needs to be equipped with a WWW browser B1 (see FIG. 16) which is a program for browsing a WWW page.
[0014]
FIG. 16 shows an example of the internal configuration of the communication terminal 12.
[0015]
(A-1-1) Internal configuration example of communication terminal
16, the communication terminal 12 includes a communication unit 20, a control unit 21, an operation unit 22, a storage unit 23, a display unit 24, and an area processing unit 25.
[0016]
The communication unit 20 has a function of communicating with the WWW server 13 via the network 11.
[0017]
The control unit 21 is a part corresponding to a central processing unit (CPU) of the communication terminal 12 in terms of hardware, and a part corresponding to an operating system (OS) or the above-described WWW browser B1 in terms of software. It is.
[0018]
The operation unit 22 is a part that is operated by the user U1 of the communication terminal 12 to transmit an instruction to the control unit 21, and includes, for example, a keyboard and a pointing device.
[0019]
The display unit 24 is a part having a display screen such as a liquid crystal display. When the user U1 browses the WWW page, as a result of the WWW browser B1 interpreting and processing the tag of the WWW page, the content of the WWW page is displayed on the display unit 24 and can be browsed by the user U1. At this time, the WWW page displayed on the screen may be DP1 shown in FIG. 1 as an example. In order to display a frame page (frame page) such as DP1, the WWW browser B1 needs to correspond to the frame. The frame does not refer to the content displayed on the screen in each of the areas A to C in FIG. 1, but refers to a frame accommodating the content. The frame page of FIG. 1 is generated based on, for example, the HTML files of FIGS.
[0020]
Normally, one WWW page is composed of one basic HTML file and one or a plurality of various files (image files and the like) as required. A frame page such as DP1 has a smaller number of files. And has a complicated structure.
[0021]
That is, a frame page requires at least an HTML file (frame definition file) that defines the configuration of the entire WWW page (that is, a frame structure) and a plurality of HTML files that are arranged in each frame as contents. In addition, various files (such as image files) linked to each HTML file are added as appropriate.
[0022]
Therefore, even if it is assumed that various files do not exist for simplicity, the frame page DP1 shown in FIG. 1 is the same as the HTML file (frame definition file) shown in FIG. 4B, the HTML file of FIG. 4C arranged in the area B, and the HTML file of FIG. 4D arranged in the area C. Become.
[0023]
A normal WWW page is structured only inside one HTML file, but a frame page is structured not only inside each HTML file but also a plurality of HTMLs contained in one frame page. It is structured between files.
[0024]
In FIG. 1, boundaries L1 and L2 (including scroll bars) between the respective areas A to C are displayed. However, in an actual frame page, the intended In many cases, such a boundary line is not displayed, the background color is completely the same between different regions, or a continuous background pattern that is completely continuous is often displayed. Therefore, whether or not the boundary line is displayed has no relation to the essence of the area division (frame structure).
[0025]
Frame structure, that is, how one frame is divided into a number of frames, and how to set the ratio of the length of the side of each frame (this ratio corresponds to the area of each frame) (display of boundary line) , Non-display) is determined by the description of the frame definition file (for example, DP11).
[0026]
When the user wants to browse the frame page DP1, the user U1 inputs the URL (here, URL11) of the frame definition file DP11 into the WWW browser B1 of the communication terminal 12. Accordingly, at this time, an HTTP request requesting the return of the frame specification file DP11 is transmitted from the communication terminal 12 to the WWW server 13, and the HTTP response includes various HTTP headers (including the entity header) as well as the frame specification as an entity body. File DP11 is returned.
[0027]
When requesting a return of an entity body, that is, a file such as an HTML file or an image file, the HTTP request is a GET request using a GET method.
[0028]
After the frame definition file DP11 is received by the communication terminal 12, the other HTML files DP12 to DP14 are automatically and sequentially generated by the WWW browser B1 based on the description (URL12 to URL14) in the frame definition file DP11. The communication terminal 12 receives each HTTP response corresponding to each HTTP request to be transmitted.
[0029]
Then, as a result of processing and shaping these four HTML files, for example, a screen as shown in FIG. 1 is displayed on the display unit 24.
[0030]
Here, assuming that the URL of the HTML file DP12 is the URL 12, the URL of the HTML file DP 13 is the URL 13, and the URL of the HTML file DP 14 is the URL 14, the URL 12 is “title.html” in the row TG12 of FIG. , URL13 is “menu.html” in row TG13, and URL14 is “main.html” in row TG14.
[0031]
Usually, a plurality of HTML files constituting one frame page are located on the same WWW server (here, 13) as the frame definition file, and are also located in the same folder, so that such a local file without FQDN is included. It can be specified by a URL (here, a URL consisting only of a file name).
[0032]
If necessary, a plurality of HTML files constituting one frame page can be placed on another WWW server. In this case, for example, all or a part of the URLs 12 to 14 is changed to a URL including FQDN. . It should be noted that the URL 11 that specifies the frame definition file DP11 that can be input by the user U1 is, of course, a URL that includes FQDN.
[0033]
The URL 11 can be manually input by the user U1 using the operation unit 22, and can be set in advance by the user U1 using software such as an auto-pilot tool described in Patent Document 1. It can be automatically input at the date and time or time interval.
[0034]
In the case of a frame page having a menu as shown in FIG. 1, the menu file has the contents as shown in FIG. 4C (all of the targets are all main (the frame name of the area C)). When the user U1 selects a hyperlink in the menu, the file to which the hyperlink is linked is displayed in place of the area C.
[0035]
Therefore, in a typical usage, in the case of a frame page having a menu, a plurality of frames in which the contents of other frames (here, regions A and B) are the same and only the contents of the target frame (region C) are replaced. It can be seen that WWW pages exist in multiple layers.
[0036]
The storage unit 23 illustrated in FIG. 16 includes, for example, a volatile storage device such as a RAM (random access memory) or a nonvolatile storage device such as a hard disk.
[0037]
When the communication terminal 12 receives each file (for example, the DP11 or DP12) constituting the WWW page from the WWW server 13, the files are temporarily stored in a cache area secured on the hard disk of the storage unit 23. Is accumulated in
[0038]
The cache area is usually placed under the control of the WWW browser B1, and can be freely accessed from the WWW browser B1.
[0039]
The storage of files in the cache area is maintained as much as possible, but the storage capacity of the cache area has an upper limit. When browsing a new WWW page exceeding this upper limit and accumulating a new file, the file is already stored. For example, by deleting stored files in order from the oldest one, a necessary storage capacity is secured.
[0040]
For example, when the user U1 instructs browsing of the WWW page from the operation unit 22 by inputting a URL or the like, if a file related to the WWW page specified by the URL is stored in the cache area, the WWW browser B1 is connected to the Internet The file is acquired not from the cache area but from the cache area, and the contents of the file are displayed on the display unit 24 on the screen. This suppresses an increase in communication traffic on the network 11 and an increase in load on the WWW server 13 side, and shortens the response time (time from when an instruction is issued to when a WWW page is displayed) as viewed from the user U1. it can.
[0041]
The area processing unit 25 connected to the control unit 21 has a function of determining and extracting a main area. The internal configuration is as shown in FIG. 2, for example.
[0042]
The main area is one area that can be estimated to be most important for the user (here, U1) among a plurality of areas on the WWW page. In the present embodiment, it is basically assumed that the area where the frequency of updating is the highest is the main area. Therefore, for example, in the case of the frame page shown in FIG. 1, any one of the areas A to C that can be estimated to be the most important to the user U1 (highest update frequency) is the main area. . Further, in correspondence with a typical use of a frame page having a menu, the main area may be determined for each of a plurality of frame pages in which only the contents of the target frame (for example, area C) are replaced. .
[0043]
(A-1-2) Internal configuration example of area processing unit
2, the area processing unit 25 includes a reading unit 101, a buffer unit 102, a division unit 103, a division result storage unit 104, an area content comparison unit 105, an update frequency calculation unit 106, an update frequency storage And a determination unit 108.
[0044]
The reading unit 101 is a unit that reads information N10 relating to a frame page of a specified URL (for example, the URL 11) into the buffer unit 102 regularly or irregularly (such as when the user U1 instructs reading). This reading may be performed from the cache area or the like in the storage unit 23, but is usually performed from the WWW server 13 via the Internet 11.
[0045]
In order to enable such reading via the Internet 11, the reading unit 101 functions as a part of the WWW browser B1, cooperates with the WWW browser B1, or an HTTP independent from the WWW browser B1. It is necessary to have a function as a client.
[0046]
The content of the file at the time of reading by the reading unit 101 may be, for example, a compound file SP1 shown in FIG.
[0047]
In FIG. 5, the composite file SP1 is a composite of the contents of the HTML files DP11 to DP14 shown in FIGS. However, the contents of the frame definition file DP11 are divided into four parts PT1 to PT4 and arranged.
[0048]
The WWW server 13 manages that the files DP11 to DP14 are files corresponding to one frame page, and based on the four files DP11 to DP14 when an HTTP request for the frame definition file DP11 is received. If one synthesized file SP1 is generated and returned, a file such as the synthesized file SP1 can be returned as an entity body of one HTTP response. However, the load on the WWW server 13 and the cache system It is disadvantageous in consideration of the affinity with the WWW server 13 and the like, and it is preferable not to perform such processing on the WWW server 13 side.
[0049]
When such processing is not performed on the WWW server 13 side, in order to browse the frame page DP1 shown in FIG. 1, a total of four HTTP requests are transmitted from the communication terminal 12, and in response to the HTTP requests, four times in sequence. Are returned from the WWW server 13, and the entity bodies of the three HTTP responses are the files DP12 to DP14.
[0050]
Therefore, in many cases, a file such as SP1 is obtained only by a process (shaping process) after the communication terminal 12 receives all the files DP11 to DP14.
[0051]
The buffer unit 102 is a unit that temporarily stores the content read by the reading unit 101 and the result processed by the division unit 103. The storage resources for realizing the storage functions such as the buffer unit 102, the division result storage unit 104, and the update frequency storage unit 107 may be secured separately from the storage unit 23. May be secured.
[0052]
The dividing unit 103 is a unit that analyzes the frame page read into the buffer unit 102 to divide the storage contents of the buffer unit 102 into regions based on a document structure specified in advance. However, in the case of a frame page, each area (for example, area A to area C) on the frame page corresponds to a different HTML file, and is divided into another file from the beginning (that is, before performing the shaping process). If a file system capable of performing ordinary file management is installed, division may be omitted. Since file management is usually the role of the OS, the dividing unit 103 only has to repeat the process of requesting the OS to deliver the next file on the buffer unit 102 and receiving a file as a response to the request, for example. .
[0053]
However, due to the relationship between the WWW browser B1 and the interface of the area processing unit 25 (reading unit 101) and the like, the files DP11 to DP14 before the shaping process cannot be received, and only the synthesized file SP1 after the shaping process can be received. If this is not possible, division is required.
[0054]
The division is a process of obtaining three files DP12, DP13, and DP14 shown in FIG. 6 by excluding, for example, the parts PT1 to PT4 from the combined file SP1 shown in FIG.
[0055]
The files (for example, DP12 to DP14) divided by the division unit 103 (received from the OS) are supplied to the division result storage unit 104.
[0056]
The division result storage unit 104 is a part that stores the file N11 supplied last time from the division unit 103. The file N11 supplied last time is stored for comparison with the corresponding file N12 received and stored in the buffer unit 102 this time, so that the correspondence between each file can be known between the previous time and this time. It is natural that it is necessary to memorize in such a form. For example, it is necessary to prevent the HTML file of the previous area C and the HTML file DP14 of the current area C from being confused with another HTML file (for example, the HTML file DP13 of the area B).
[0057]
For this association, for example, the file name of each HTML file (for example, URLs 12 to 14) can be used, and a unique file identifier (area number) assigned to each HTML file in the system can be used. it can.
[0058]
The area content comparison unit 105 compares, for each file, the previously supplied file N11 stored in the division result storage unit 104 and the currently supplied file stored in the buffer unit 102 for each file ( That is, in a portion for detecting the presence or absence of update by comparing the contents (for each region), the update presence / absence information N16 is output as a detection result. Updating refers to adding, deleting, or changing part or all of the contents of a file.
[0059]
The update frequency calculation unit 106 calculates the current update frequency S based on a predetermined formula from the content of the update presence / absence information N16 and the previous update frequency, and outputs update frequency information N17 corresponding to the update frequency S. is there. The calculation of the update frequency S is performed for each file (for each area). The update frequency may be such that its value increases as the file update frequency increases. In the present embodiment, the update frequency decreases as the file update frequency increases.
[0060]
Further, as the above equation, various types of equations can be used. In the present embodiment, the following equation (1) based on exponential averaging is used.
[0061]
S = S ₀ α + P (1-α) (1)
Where S ₀ Is the update frequency up to the previous time, and α is a constant of 0 <α <1, but here α = 0.8. P is a score and has a value of either 100 or 0. As described later, the score P is set to 0 when the update is performed, and is set to 100 when the update is not performed. ₀ Becomes smaller each time an update is performed.
[0062]
In the HTTP response to the GET request, since various HTTP headers are returned in addition to the entity body (file), it is desirable to use information in the HTTP header in the processing of the update frequency calculation unit 106.
[0063]
For example, it is desirable to use update date information and expiration date information included in an entity header that is one of the HTTP headers.
[0064]
The update date information is information indicating the date when the file was updated on the WWW server 13 side.
[0065]
The expiration date information is information for setting an expiration date when a file is cached (in the cache area or the cache server, for example), and is used to maintain the latest content of the file to be provided. The expiration date information is usually set according to the convenience of the WWW server 13, such as the creator of the file (the creator of the WWW page). If the expiration date has passed, the communication terminal 12 (or the cache server interposed between the communication terminal 12 and the WWW server 13) deletes the file from the cache area, even if the storage capacity is sufficient, The original file is obtained from the WWW server (here, 13). Therefore, if the expiration date is set short, there is an advantage that the possibility of providing the user (here, U1) with the latest contents even if frequent updating is performed is increased. Since the advantage of the cache decreases and the load on the WWW server 13 increases, it is usually attempted to set the expiration date as long as possible.
[0066]
For this reason, the expiration date information indicates how often the creator of the WWW page plans to update the file or how often (for example, an electronic bulletin board (CGI bulletin board) For example, it can be regarded as information (update schedule information) indicating whether or not a WWW page in which a file is updated by a third party is expected (however, electronic bulletin boards are generally prohibited from being cached). . Since the creator of the WWW page ordinarily must be familiar with his own schedule and the nature of the WWW page, which cannot be known from the user U1 or the like, the reliability of the expiration date information as the update schedule information is high. It can be said that.
[0067]
There are various methods by which the update frequency calculation unit 106 uses such expiration date information. For example, when calculating the update frequency S, the shorter the expiration date indicated by the expiration date information, the higher the update frequency It is also desirable to apply a weight so that S is changed to a high frequency side.
[0068]
The update frequency storage unit 107 that receives the update frequency information N17 corresponding to the update frequency S from the update frequency calculation unit 106 is a part that stores the update frequency information N17 for each area.
[0069]
The determination unit 108 determines the area whose contents change most frequently as the main area based on the update frequency information N17 stored in the update frequency storage unit 107, extracts the contents of the relevant area from the buffer unit 102, and This is the output part. Although various output destinations of the main area can be considered, for example, the WWW browser B1 may be used. If the WWW browser B1 needs to display the main area on the display unit 24 on the screen, the frame definition file DP11 may be output to the WWW browser B1 together with the main area.
[0070]
On the other hand, FIG. 17 shows an example of the internal configuration of the WWW server 13 that returns an HTTP response in response to the HTTP request.
[0071]
(A-1-3) Example of internal configuration of WWW server
17, the WWW server 13 includes a communication unit 30, a control unit 31, and a storage unit 32.
[0072]
Among them, the communication unit 30 corresponds to the communication unit 20, the control unit 31 corresponds to the control unit 21, and the storage unit 32 corresponds to the storage unit 23.
[0073]
However, the control unit 31 does not include a WWW browser (such as B1) but includes WWW server software.
[0074]
When the WWW server 13 has a database for storing WWW pages or the like generated in advance, a DBMS may be installed in the control unit 31 as necessary.
[0075]
Further, in order to provide the frame page DP1, the storage unit 32 stores at least HTML files DP11 to DP14.
[0076]
Hereinafter, the operation of the present embodiment having the above-described configuration will be described with reference to the flowcharts of FIGS. 3, 7, and 8.
[0077]
FIG. 3 is a flowchart for updating information such as the update frequency information N17, and includes steps S101 to S104. Similarly, FIG. 7 is a flowchart for determining and outputting the contents of the main area based on the update frequency information N17 and the like, and includes steps S101, S102, and S105.
[0078]
FIG. 8 is a flowchart showing a process of calculating the update frequency S or a process in the vicinity thereof, and includes steps S151 to S160. The flowchart in FIG. 8 shows the detailed operation of step S103 in FIG.
[0079]
(A-2) Operation of the first embodiment
When the URL is input to the WWW browser B1 by the user U1 operating the operation unit 22 or by a function such as the autopilot tool, the WWW browser B1 transmits an HTTP request (GET request) corresponding to the URL11. Send.
[0080]
However, when the reading unit 101 has a function as an HTTP client independent of the WWW browser B1, it is natural that this input is performed not to the WWW browser B1 but to the reading unit 101.
[0081]
Upon receiving the HTTP request transmitted from the communication terminal 12 via the Internet 11, the WWW server 13 (and the server OS) extracts the frame definition file DP11 specified by the URL 11 from the storage unit 32, and retrieves the frame definition file DP11. An HTTP response including DP11 as an entity body is returned.
[0082]
The other HTML files DP12 to DP14 constituting the frame page DP1 are based on the descriptions (URL12 to URL14) in the frame definition file DP11 after the frame definition file DP11 is received by the communication terminal 12, as described above. Then, since the communication terminal 12 receives the HTTP response in response to each HTTP request automatically transmitted by the WWW browser B1 as an entity body in response to each HTTP request, the reading unit 101 reads these.
[0083]
However, when the composite file SP1 is generated on the WWW server 13, an HTTP response including the composite file SP1 as an entity body is returned.
[0084]
As shown in FIG. 3, the reading unit 101 included in the area processing unit 25 of the communication terminal 12 reads the file (DP11 or SP1) alone and in cooperation with the WWW browser B1 and stores the file in the buffer unit 102. (S101). When reading alone, the entity body of the HTTP response is directly read into the reading unit 101, and when reading in cooperation with the WWW browser B1, the result processed by the WWW browser B1 is sent to the reading unit 101. Will be read.
[0085]
In any case, the file read by the reading unit 101 at this time can be of two types: a frame definition file DP11 shown in FIG. 4A and a combined file SP1 shown in FIG.
[0086]
At this time, when the file read by the reading unit 101 is the composite file SP1, the file is divided (S102). However, when the file is the frame definition file DP11, there is no need to divide the file.
[0087]
However, in the case of the frame definition file DP11, it is necessary to sequentially transmit an HTTP request from the communication terminal 12 in order to acquire the above-described three HTML files DP12 to DP14, and sequentially read the received HTTP file. There is a need.
[0088]
Regardless of whether or not the combined file SP1 is used, each of the HTML files DP12 to DP14 can be identified and associated with the area name in addition to the file name.
[0089]
Here, as an example, an area number 1 is assigned to the HTML file DP12 to be an area 1, an area number 2 is assigned to the HTML file DP13 to an area 2, and an area number 3 is assigned to the HTML file DP14. 3 is assumed.
[0090]
Next, in step S103, the area content comparison unit 105 compares the file read this time (divided as necessary) with the file read last time to detect the presence / absence of an update. The update frequency calculation unit 106 calculates the update frequency S based on the update frequency, and the calculated update frequency S is stored in the update frequency storage unit 107.
[0091]
Then, the contents of the files DP12 to DP14 read (divided) this time are stored in the division result storage unit 104 in such a manner that they are replaced with the previously read files, and are prepared for the next new reading (S104).
[0092]
The processing of steps S101 to S104 is repeated a plurality of times as necessary. If the update is detected only a small number of times (for example, once), the file may have been updated at the same time by chance, so that such a contingency is eliminated and the frequency of the update is truly high. In order to determine whether or not this is the case, it is preferable that the number of repetitions be large, and it is natural that the detection of the presence or absence of update should be performed over a long period of time.
[0093]
If necessary, the expiration date information (update schedule information) can be used to eliminate contingency even when the number of repetitions is relatively short and the period for detecting the presence or absence of update is relatively short. There is.
[0094]
Further, when the number of repetitions is small and the detection period of the presence or absence of update is short, the probability that a file that has been updated recently is selected as a main area increases. If it can be assumed that the most important area for the user U1 is the area that has been updated in the nearest past, the important area is determined even if the number of repetitions is small and the detection period of the update is short. You can choose.
[0095]
In FIG. 8 showing the details of step S103, the files DP12 to DP14 are identified using the area numbers.
[0096]
8, steps S152 to S159 are repeated for each file (for each area number) based on the area numbers assigned to the files DP12 to DP14 stored in the buffer unit 102.
[0097]
Of these, in step S152, the file read last time (previously divided) and the file read this time (divided this time), that is, the file stored in the division result storage unit 104 and the file stored in the buffer unit 102 The files that have been set are compared with each other based on the area number (S153).
[0098]
As a result of the comparison, if the contents are the same, that is, if the file has not been updated, the step S153 branches to the Yes side, and 100 is set as the score P of the equation (1) (S155). The update frequency S is calculated based on () (S157). The update frequency S calculated previously and stored in the update frequency storage unit 107 is calculated based on the update frequency S calculated here. ₀ Is replaced.
[0099]
As a result of the comparison, the content of the area 1 is the same as the previous time, and the previous update frequency S ₀ S ₀ Assuming that = 73, as shown in FIG. 9, the current update frequency S of the area 1 is 78 (≒ 73 × 0.8 + 100 × (1-0.8)) according to the equation (1).
[0100]
Further, the content of the area 2 is the same as the previous time, and the previous update frequency S ₀ S ₀ If = 73, the current update frequency S of the area 1 is also 78.
[0101]
On the other hand, if the result of comparison in step S153 is that the contents are not the same, that is, if the file has been updated, step S153 branches to the No side to check whether the number of divisions is the same (S154). . Since the number of divisions is the number of files other than the frame definition file DP11 constituting the frame page DP1, in this step S154, it is checked whether or not the frame structure (particularly the number of frames) itself has changed. You will be doing.
[0102]
The presence or absence of a change in the frame structure is directly analyzed from the description content of the frame definition file DP11 without checking the number of divisions (for example, the number of character strings “<FRAME src” is counted). By doing so, it is also possible to inspect.
[0103]
Further, as described above, the concept of the frame structure includes not only the number of frames but also the ratio of the length of the side of each frame, and the like, but in step S154, only the number of frames is considered. Therefore, even if the description of the frame definition file DP11 changes and the ratio of the length of the side of each frame changes, the step S154 branches to Yes if the number of frames is the same. Of course, if necessary, the step S154 may be branched to the No side if the description of the frame definition file DP11 changes a little.
[0104]
For example, when any one of the URLs 12 to 14 described in the lines TG12 to TG14 has changed, the content of the same file has not been updated, and the file arranged in the frame has been changed to another file. Therefore, step S154 may be branched to the Yes side.
[0105]
As an example, when a new area such as the area 4 (the area assigned with the area number 4) shown in FIG. 10 appears in the frame page DP1 in which only the three areas of the area 1 to the area 3 existed before. The step S154 branches to No, the update frequency S of the area 4 is set to 0 (S158), and the process proceeds to the step S159. In this case, 0 is the initial value of the update frequency, and the update frequency is reset to the initial value on the assumption that the area to be compared has disappeared because the frame structure itself of the frame page DP1 has changed.
[0106]
If the division number is the same and step S154 branches to the Yes side, the score P is set to 0 in that area (S156), and the process proceeds to step S157.
[0107]
Here, as shown in FIG. 9, in the processing for the area 3, the step S154 branches to the Yes side, and the previous update frequency S ₀ S ₀ Assuming that = 46, the current update frequency S of the area 3 is 37 (≒ 46 × 0.8 + 0 × (1-0.8)) according to the above equation (1).
[0108]
As described above, the update of the update frequency S of each of the areas 1 to 3 stored in the update frequency storage unit 107 is performed one or more times, and the main area is updated based on the update frequency S. If the determination is made and the main area is output, the flowchart shown in FIG.
[0109]
In FIG. 7, the processing of each step denoted by the same reference numerals S101 and S102 as in FIG. 3 is the same as in FIG.
[0110]
Therefore, FIG. 7 differs from FIG. 3 only in step S105.
[0111]
In step S105, the determination unit 108 determines and outputs the main area from the frame page DP1 based on the update frequency S stored in the update frequency storage unit 107 at that time. For example, when the update frequency S of the HTML file DP14 corresponding to the area 3 has a value with the highest update frequency, the HTML file DP14 is selected as a main area and output.
[0112]
If there is a plurality of areas having the same update frequency, if necessary, one main area may be specified based on the expiration date information or the like. If necessary, a plurality of areas are output as the main area, or one area is output as the main area, but the user is notified that there is another area whose update frequency shows the same value as the main area. You may make it transmit to U1.
[0113]
Also, if a plurality of URLs for specifying a frame definition file such as the URL 11 are set in software such as the auto pilot tool in addition to the date and time and the time interval, a large number of frames can be simultaneously processed. Similar processing can be performed on the page. As a result, it is possible to select and output only the main area from many frame pages, and the user U1 can efficiently recognize the main points of many frame pages with little labor and time.
[0114]
In this case, however, the division result storage unit 104 and the update frequency storage unit 107 need to store information by organizing each frame page.
[0115]
Further, in the present embodiment, the user U1 does not need to manually specify the start position and the end position of the data in advance as in Patent Literature 1, so that the operation load on the user U1 can be reduced.
[0116]
In the present embodiment, since the main area can be automatically determined, (i) a notification at the time of updating the designated web page (such as not notifying an update other than the main area, etc.), and (ii) a search (the other than the main area, It is possible to easily construct services and systems such as (not to be searched) and (iii) summarizing (summarizing only main areas).
[0117]
(A-3) Effects of the first embodiment
According to the present embodiment, it is possible to reduce the operation burden on the user (U1) for outputting the main area selected from the frame page. In particular, this is effective when only the main area is selected and output from many frame pages.
[0118]
In addition, since the present embodiment can be basically executed without using natural language processing, the main area can be determined without depending on the description language.
[0119]
Further, in the present embodiment, the analysis of the frame page is performed. However, since only the document structure specified in advance has to be processed, the processing amount is smaller than that of performing all the analysis.
[0120]
(B) Second embodiment
Hereinafter, only the points of the present embodiment that are different from the first embodiment will be described.
[0121]
In the first embodiment, the presence or absence of an update is detected by storing and comparing the contents of the file itself. However, in the present embodiment, a checksum of the contents of the file is used instead of the contents of the file.
[0122]
(B-1) Configuration and Operation of Second Embodiment
The present embodiment is different from the first embodiment only in the internal configuration of the area processing unit 25. Therefore, the configurations shown in FIGS. 15, 16 and 17 can be used as they are in this embodiment. Reference numeral 35 is assigned to the area processing unit of the present embodiment to distinguish it from the area processing unit 25 of the first embodiment.
[0123]
An example of the internal configuration of the area processing unit 35 is as shown in FIG.
[0124]
12, the area processing unit 35 includes a reading unit 101, a buffer unit 102, a division unit 103, an update frequency calculation unit 106, an update frequency storage unit 107, a determination unit 108, and a checksum calculation unit 201. , A checksum storage unit 202, and a checksum comparison unit 203.
[0125]
The functions of the components and signals (information) assigned with the same reference numerals 101 to 103, 106 to 108, N10 to N12, and N17 to N19 as in FIG. 2 are the same as those in the first embodiment. Detailed description is omitted.
[0126]
The checksum storage unit 202 corresponds to the division result storage unit 104, and the checksum comparison unit 203 corresponds to the area contents comparison unit 105.
[0127]
The checksum storage unit 202 stores checksum information N30 indicating a checksum calculated by the checksum calculation unit 201 based on the contents of the file, instead of the contents of the file (the division result). Normally, the size of the checksum is much smaller than the file contents, so that the storage capacity of the checksum storage unit 202 may be smaller than the storage capacity of the division result storage unit 104.
[0128]
Therefore, the checksum comparison unit 203 detects whether each file has been updated by comparing the checksum of the previously read file with the checksum of the currently read file, and determines whether or not each file has been updated. The information N33 is output.
[0129]
The checksum calculation unit 201 also calculates checksum information N31 indicating the checksum of the file read this time, based on the contents of the file read this time and stored in the buffer unit 102.
[0130]
Detecting whether a file has been updated based on the checksum may cause an erroneous detection that the file is actually updated but not updated, but the checksum size (number of bits) , It is possible to suppress the probability of occurrence of such erroneous detection.
[0131]
The flowchart showing the operation of the present embodiment is as shown in FIGS. 7, 13, and 14. Among them, the flowchart of FIG. 7 uses the same one as in the first embodiment in this embodiment.
[0132]
Further, the flowchart of FIG. 13 corresponds to the flowchart of FIG. 3, and the processing content of each step to which the same reference numerals S101 and S102 are assigned is the same as that of the first embodiment, and thus the detailed description thereof will be omitted.
[0133]
In FIG. 13, step S201 corresponds to step S103 in FIG. 3, and step S202 corresponds to step S104 in FIG.
[0134]
Step S201 is different from S103 only in that the presence or absence of an update is detected based on a checksum instead of the contents of a file.
[0135]
In step S202, the checksum of the currently read file is stored in the checksum storage unit 202 in preparation for the next reading of a new file.
[0136]
On the other hand, the flowchart of FIG. 14 corresponds to the flowchart of FIG. 8, and the processing contents of the steps denoted by the same reference numerals S154 to S160 are the same as those of the first embodiment, and thus detailed description thereof will be omitted.
[0137]
Also, as described above, steps S251 to S253 differ from the first embodiment only in that the checksum is used instead of the file contents.
[0138]
(B-3) Effects of the second embodiment
In the present embodiment, an effect equivalent to the effect of the first embodiment can be obtained.
[0139]
In addition, in this embodiment, since the storage capacity of a portion corresponding to the checksum storage unit (202) is smaller than that of the first embodiment, storage resources can be saved.
[0140]
If the size of the checksum is smaller than the size of the file, the processing time may be reduced because the time for reading and writing to a storage resource such as a memory is short.
[0141]
(C) Other embodiments
In the first embodiment, the result of division is stored for comparison with the previous time. However, the read content is stored as it is, and divided before comparison. You may.
[0142]
Although the divided areas are associated with each other in accordance with the order of description of the combined file SP1, information (for example, the URLs 12 to 14) for identifying each area within the frame page is added. As described above, it may be configured to use the information.
[0143]
In the second embodiment, the checksum is used. However, a code corresponding to an error detection method other than the checksum can be used. Further, a value (hash value) obtained by converting the content of the file using a hash function may be used instead of the checksum.
[0144]
In the first and second embodiments, only the case where a GET request (GET method) is used as an HTTP request has been described, but a HEAD request (HEAD method) may be used.
[0145]
The HTTP response to the HEAD request is the same as the HTTP response to the GET request except that no entity body (file) is included. Therefore, even if the communication terminal 12 transmits the HEAD request, the communication terminal 12 can acquire the expiration date information and the update date information included in the entity header from the WWW server 13.
[0146]
In this case, the communication terminal 12 can detect whether or not each file has been updated based on the update date information and the like. Further, according to the HEAD request, a large file body does not need to be handled, so that the load is small on the WWW server 13 side and the communication terminal 12 side, and communication traffic is also small. Further, since the response time is short, high-speed processing can be executed.
[0147]
Further, detection of the presence / absence of the update based on the update date information and detection of the presence / absence of the update by comparing the file body (or the checksum thereof) as described in the first (or second) embodiment are mixed. You may let it. For example, when the user U1 inputs the URL 11 in hope of browsing the frame page, the GET request is used to compare the file bodies, and when the autopilot tool or the like automatically detects the presence or absence of the update, etc. , HEAD request may be used to detect the presence or absence of an update based on update date information.
[0148]
Since various information included in the HTTP header such as the update date information is described based on the management information collected on the WWW server 13, the file management is not properly performed on the WWW server 13 side. May have incorrect information. For example, there is a possibility that the update date of the file is rewritten even though the content is not actually changed, and update date information as if the update was performed is generated.
[0149]
The detection of the presence or absence of an update by comparing the contents of the file body (or its checksum, etc.) is highly reliable in that such errors are not mixed.
[0150]
In the examples of FIGS. 4A to 4D, the top page is a frame page, but it is natural that a page other than the top page may be a frame page.
[0151]
In the case of a frame page, as shown in FIG. 1, a typical example is a configuration in which a menu is placed in a narrow frame such as an area B and contents that change in accordance with the selection of the menu are placed in a wide frame such as an area C. However, it goes without saying that the present invention can be applied to a frame page without a menu.
[0152]
Further, the present invention is applicable to WWW pages other than frame pages, and is also applicable to descriptions in languages other than HTML (such as XML and SGML). This is because, in some sense, a plurality of logically identifiable areas may be included.
[0153]
For example, in the case of a WWW page including a plurality of images (image files), one of the image files may be determined as the main area based on the update frequency of each image file. It is also possible to determine the main area between the basic HTML file and the image (image file) displayed on the WWW page corresponding to the HTML file based on the update frequency. In this case, when an image (image file) is selected as the main area, only the image may be displayed, and when the HTML file is selected, only the HTML file containing no image may be displayed.
[0154]
Also, the area may be identified based on a unit other than the file.
[0155]
Further, the communication protocol used does not necessarily have to be HTTP.
[0156]
In the first and second embodiments, the area processing unit 25 (or 35) is arranged on the communication terminal (client) 12 side, but the function of the area processing unit is arranged on the WWW server 13 side, For example, it can be arranged in a proxy server or the like that can be interposed between the WWW server 13 and the communication terminal 12.
[0157]
In the case where the WWW server 13 is located, it is not always necessary to perform HTTP communication. Therefore, the file management information managed by the server OS of the WWW server 13 is utilized as it is to detect whether or not there is an update. Can be used.
[0158]
In the first and second embodiments, it is assumed that the frame page is published on the WWW server 13, but the present invention is also applicable to a frame page obtained from a recording medium such as a CD-ROM. Is applicable, the target frame page and the like need not necessarily be obtained via a network.
[0159]
Further, as described above, the formula used in the present invention does not need to be limited to the formula (1). For example, it is also possible to use a simple expression in which the value of the update frequency is incremented (or decremented) each time it is detected that the update has been performed.
[0160]
In the above description, the present invention is mainly realized by hardware, but the present invention can also be realized by software.
[0161]
【The invention's effect】
According to the present invention, it is possible to reduce a user's operation load required to determine a main area from a structured document.
[Brief description of the drawings]
FIG. 1 is a configuration example of a structured document used in first and second embodiments.
FIG. 2 is a schematic diagram illustrating a configuration example of an area processing unit used in the first embodiment.
FIG. 3 is a flowchart illustrating an operation at the time of “information update” according to the first embodiment.
FIG. 4 is an explanatory diagram showing an example of a structured document used for explaining the operation of the first embodiment.
FIG. 5 is an explanatory diagram showing an example of a result of reading each file of the structured document shown in FIG. 4;
FIG. 6 is an explanatory diagram showing an example of a result obtained by dividing the structured document of FIG. 5;
FIG. 7 is a flowchart illustrating an operation at the time of “outputting contents of main area” according to the first embodiment.
FIG. 8 is a flowchart illustrating a method of calculating an update frequency S according to the first embodiment.
FIG. 9 is a specific example of calculation of an update frequency S according to the first embodiment.
FIG. 10 is a specific example of calculation of an update frequency S according to the first embodiment.
FIG. 11 is a flowchart in a case where the operation at the time of “information update” and the operation at the time of “output of contents of main area” of the first embodiment are summarized.
FIG. 12 is a schematic diagram illustrating a configuration example of an area processing unit according to the second embodiment.
FIG. 13 is a flowchart illustrating an operation at the time of “information update” according to the second embodiment.
FIG. 14 is a flowchart illustrating a method of calculating an update frequency S according to the second embodiment.
FIG. 15 is a schematic diagram illustrating an example of the overall configuration of a communication system according to the first and second embodiments.
FIG. 16 is a schematic diagram illustrating a configuration example of a communication terminal used in the first and second embodiments.
FIG. 17 is a schematic diagram illustrating a configuration example of a WWW server used in the first and second embodiments.
[Explanation of symbols]
101: reading unit, 102: buffer unit, 103: dividing unit, 104: division result storing unit, 105: area content comparing unit, 106: update frequency calculating unit, 107: update frequency storing unit, 108: determining unit, 201 ... Checksum calculation unit, 202: checksum storage unit, 203: checksum comparison unit.

Claims

In an information processing apparatus for determining a main area from a predetermined structured document including a plurality of areas,
Each area in the structured document or, a plurality of times in time series management information of each area, a reading unit to obtain,
A storage unit that stores the area or management information obtained by the reading unit;
A comparison inspection unit that performs a comparison between corresponding areas or management information obtained by the reading unit, and checks whether or not each area is updated based on the comparison result;
An update frequency calculation unit that calculates predetermined update frequency correspondence information for each region based on a history of the inspection results by the comparison inspection unit;
An information processing apparatus, comprising: a main area determination unit that determines a main area from a plurality of areas in the structured document based on the update frequency correspondence information.

The information processing apparatus according to claim 1,
A boundary division unit that divides each region in the structured document based on boundary information indicating a boundary between regions on a screen display,
The information processing apparatus, wherein the reading unit reads the respective areas or management information by cooperating with the boundary division unit.

The information processing apparatus according to claim 1,
When each area in the structured document is defined in advance as an inter-area structure indicating a logical structure between the areas, an area management unit that identifies each area using the inter-area structure is provided. ,
The information processing apparatus, wherein the reading unit reads the respective areas or the management information by cooperating with the area management unit.

The information processing apparatus according to claim 1,
The comparison inspection unit,
An information processing apparatus, wherein the reading unit checks the presence or absence of an update by comparing a currently acquired area with a previously acquired area corresponding to the area.

The information processing apparatus according to claim 1,
The comparison inspection unit,
An information processing apparatus, wherein the reading unit checks the presence or absence of an update by comparing the management information acquired this time with management information acquired last time corresponding to the management information.

The information processing apparatus according to claim 1,
The area or the management information, for error detection, comprising a conversion unit to convert to smaller size conversion data,
An information processing apparatus, wherein conversion data output by the conversion unit is stored in the storage unit.

The information processing apparatus according to claim 1,
The update frequency calculation unit,
An information processing apparatus for calculating current update frequency correspondence information based on previous update frequency correspondence information and a test result output by the comparison test unit this time.

In an information processing method for determining a main area from a predetermined structured document including a plurality of areas,
The reading unit acquires each area in the structured document or the management information of each area plural times in chronological order,
The storage unit stores the area or management information acquired by the reading unit,
A comparison inspection unit performs comparison between corresponding areas obtained by the reading unit or between management information, and checks whether each area is updated based on the comparison result,
Based on the history of inspection results by the comparison inspection unit, the update frequency calculation unit calculates predetermined update frequency correspondence information for each region,
An information processing method, wherein a main area determination unit determines a main area from a plurality of areas in the structured document based on the update frequency correspondence information.