JP2004199273A

JP2004199273A - Information automatic classification device

Info

Publication number: JP2004199273A
Application number: JP2002365482A
Authority: JP
Inventors: Kenichi Fujii; 憲一藤井
Original assignee: Canon Inc
Current assignee: Canon Inc
Priority date: 2002-12-17
Filing date: 2002-12-17
Publication date: 2004-07-15

Abstract

<P>PROBLEM TO BE SOLVED: To remove complication of bookmark management generated when using many bookmarks, to perform accurate classification, and to thereby improve serviceability without damaging simplicity of function of the bookmark. <P>SOLUTION: This information automatic classification device 104 generates learning information of a characteristic of a Web site in each classification item based on directory information provided by a directory service server 103. On the other hand, the information automatic classification device 104 extracts characteristic information of a corresponding Web site based on bookmark information held by an information client 101. The extracted characteristic information is compared respectively with the learning information in each generated classification item, and similarity between the characteristic information and each learning information is calculated respectively. The classification item corresponding to the learning information related to the maximum similarity among each calculated similarity is determined as the classification item to the bookmark information. <P>COPYRIGHT: (C)2004,JPO&NCIPI

Description

【０００１】
【発明の属する技術分野】
本発明は、情報自動分類装置に関し、特に、複数のＷｅｂサイト情報を所定の分類規則に則って分類してディレクトリ情報を提供するディレクトリサービスサーバ及びクライアント装置を含むネットワーク環境に接続された情報自動分類装置に関する。
【０００２】
【従来の技術】
電子計算機の進歩とインターネットの普及により、大量の情報がＷｅｂサイトから発信されるようになってきた。インターネットで提供される情報は莫大であるため、これを効率的に分類・整理することが求められている。
【０００３】
従来、こうした要請に応える１つの手法としてディレクトリサービスがある。ディレクトリサービスとは、多数のＷｅｂサイトからそれぞれ発信される情報を人手によって認識し、これを適切な分類カテゴリに分類して整理し、提供するようにしたものである。一般の使用者は、ディレクトリサービスを利用して、適切なカテゴリをたどることで目的とする情報に容易にたどり着くことができる。
【０００４】
ただし、このディレクトリサービスでは、経験・知識を有する専門の人員がＷｅｂサイトの情報を吟味し、その内容に基づいて適切なカテゴリを決定するようにしているため、生成された分類情報の分類精度が高いという長所がある反面、多数のＷｅｂサイトの情報を人手によって分類する作業が非常に煩雑であり、また、頻繁に更新されるＷｅｂサイトの情報に対する適切なカテゴリの維持が困難であり、さらに、新しく生成される情報のカテゴリ判定がスムーズに行われにくいという欠点も有している。
【０００５】
上記の要請に応える別の手法として、従来、検索サービスがある。検索サービスとは、Ｗｅｂロボットと呼称される手法を用いて、Ｗｅｂサイトの情報を自動的、定期的に収集し、得られた情報を対象にして検索を実施できるようにしたものである。この検索サービスでは、自動的にＷｅｂサイトの情報を収集するので、ディレクトリサービスにあったようなＷｅｂサイトからの提供情報の更新の高頻度に伴う問題は解消できるが、使用者が所望する情報を得るために適切な検索語を与える必要があり、そのため、特に検索行為に不慣れな一般の使用者が、所望する情報を簡単に得ることが難しいという問題がある。
【０００６】
また、こうしたインターネット上で提供される上記２つのサービスでは、サービスを利用する際にセキュリティリスクの問題があるとともに、サービス停止の問題などがある。
【０００７】
ところで、こうしたインターネット上で提供されるサービスではなく、しかも前述の要請に応える簡便な手法として、従来、ブックマークという手法がある。この手法は、Ｗｅｂサイトを閲覧するソフトウエアに一般に装備されている。このブックマークによれば、Ｗｅｂサイトへのアクセス情報であるＵＲＬ（Uniform Resource Locator）を階層的に分類し、Ｗｅｂサイトに任意の命名を行って保持しておくことができる。
【０００８】
しかし、上記ブックマークという手法では、ブックマークの生成や整理はすべて使用者が行わなければならないため、適切な分類状態の維持や、不要なブックマークの削除といった管理を全て使用者が行わねばならず、使用者にとって煩雑であるという問題があった。
【０００９】
それに対し、例えば特許文献１に記載の技術では、ユーザがＵＲＬを登録しようとしたときに該ＵＲＬのＨＴＭＬソース中のキーワード指定タグを検索し、そのキーワード指定タグによって指定されたキーワードに基づいて該ＵＲＬを分類して登録するものが考案されている。
【００１０】
【特許文献１】
特開平１１−１６７５８０号公報
【００１１】
【発明が解決しようとする課題】
しかしながら、特許文献１に記載の技術では、Ｗｅｂ上のＨＴＭＬソースの作成者が設定したキーワード指定タグに従って分類するので、適切なキーワードが指定されていない場合があり得、その場合には正確な分類がなされないという問題点がある。
【００１２】
本発明はこのような問題点に鑑みてなされたものであって、多数のブックマークを使用する場合に発生するブックマーク管理の煩雑さをなくし、且つ正確な分類を行うようにして、ブックマークという機能のもつ簡便性を損なわずに有用性を向上させた情報自動分類装置を提供することを目的とする。
【００１３】
【課題を解決するための手段】
上記目的を達成するために、本発明によれば、複数のＷｅｂサイト情報を所定の分類規則に則って分類してディレクトリ情報を提供するディレクトリサービスサーバ及びクライアント装置を含むネットワーク環境に接続された情報自動分類装置において、前記ディレクトリサービスサーバが提供するディレクトリ情報を、ネットワークを介して取得するディレクトリ情報取得手段と、前記取得されたディレクトリ情報から、分類項目情報と各分類項目に対応するＷｅｂサイトへのアクセス情報とを抽出する抽出手段と、前記抽出されたアクセス情報を用いて、該アクセス情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第１のサイト情報取得手段と、前記第１のサイト情報取得手段によって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第１の特徴抽出手段と、前記第１の特徴抽出手段によってそれぞれ抽出された複数のサイト情報の特徴情報を分類項目ごとに積算して学習情報を生成する学習手段と、前記クライアント装置が保持するブックマーク情報を、前記ネットワークを介して取得するブックマーク情報取得手段と、前記取得されたブックマーク情報を用いて、該ブックマーク情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第２のサイト情報取得手段と、前記第２のサイト情報取得手段によって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第２の特徴抽出手段と、前記第２の特徴抽出手段によって抽出された特徴情報を、前記学習手段によって生成された分類項目ごとの学習情報とそれぞれ比較し、該特徴情報と各学習情報との類似性をそれぞれ算出する算出手段と、前記算出手段によって算出された各類似性のうち最大の類似性に関連する学習情報に対応する分類項目を、前記第２のサイト情報取得手段が取得したサイト情報に関連するブックマーク情報に対する分類項目と決定する分類項目決定手段とを有することを特徴とする情報自動分類装置が提供される。
【００１４】
【発明の実施の形態】
以下、本発明の実施の形態を、図面を参照して説明する。
【００１５】
［第１の実施の形態］
図１は、本発明の第１の実施の形態に係る情報自動分類装置を含むネットワークシステムの基本構成を示すブロック図である。
【００１６】
同図において、１０１は情報クライアントである。情報クライアント１０１は、使用者からＷｅｂサイトの検索要求を受け付けたり、検索結果に対する使用者からの操作に応じた処理を実施したりするための機能を提供する。
【００１７】
１０２は情報サーバである。ＨＴＴＰ（Hypertext Transfer Protocol）などのプロトコルを用いてＷｅｂサイト情報を提供するものである。
【００１８】
１０３はディレクトリサービスを提供するディレクトリサービスサーバである。ディレクトリサービスサーバ１０３は、多数のＷｅｂサイトからそれぞれ発信される情報を所定の分類カテゴリに分類して整理し、ディレクトリ情報を提供する。このディレクトリ情報は、情報クライアント１０１の使用者が、所望のサイト情報に対応する分類項目を指定して、複数のＷｅｂサイトの中から適切なＷｅｂサイトを探す際に用いられるものである。
【００１９】
１０４は本発明に係る情報自動分類装置である。情報自動分類装置１０４は、ブックマークの適切な管理を行うものであり、詳しくは図３及び図４を参照して後述する。なお、本実施の形態における情報自動分類装置１０４は、情報クライアント１０１と独立して存在するが、これに代わって、情報クライアント１０１内に設けられてもよい。
【００２０】
情報クライアント１０１、情報サーバ１０２、ディレクトリサービスサーバ１０３、及び情報自動分類装置１０４はそれぞれ、電子計算機で構成され、ネットワーク１０５によって互いに接続されている。図１では便宜上、情報クライアント及び情報サーバをそれぞれ１つしか図示していないが、一般に、これらはネットワーク１０３に複数台存在し得る。
【００２１】
図２は、情報クライアント１０１の機能構成を示すブロック図である。
【００２２】
図中２０１は対話部であり、使用者からの検索要求や編集操作を受け付けるものである。対話部２０１は、電子計算機のキーボード、ポインティングデバイス、タッチパネル、ジョイスティック、ペン、タブレット等といった各種入力装置と、ビットマップディスプレイなどの表示装置と、電子計算機上の基本操作システム（オペレーティングシステム）とによって実現され、入力装置を経由して使用者からの指示を受け付け、また表示装置を経由して情報を使用者に提示する。
【００２３】
２０２は制御部である。制御部２０２はＷｅｂブラウザに相当し、Ｗｅｂブラウザは、特定の形式の情報の解釈及び実行を行うためのプログラムである。Ｗｅｂブラウザには、Ｗｅｂサイトへのアクセスに必要な情報であるＵＲＬ（ブックマーク）を保存するブックマーク機能がある。このブックマーク機能では、Ｗｅｂブラウザが現在閲覧しているＷｅｂサイトのＵＲＬを保存することができる。
【００２４】
２０３は通信部である。制御部２０２は、ネットワーク１０５上に存在する情報サーバ１０２、ディレクトリサービスサーバ１０３、及び情報自動分類装置１０４との間で通信を実施する。
【００２５】
図３は、情報自動分類装置１０４の基本構成を示す図である。
【００２６】
３０２はサーバ情報処理部であり、各種のサーバ（情報サーバ１０２、ディレクトリサービスサーバ１０３を含む）がそれぞれ提供する情報を、ネットワーク１０５を介して取得するものである。
【００２７】
３０１はディレクトリ情報処理部であり、ディレクトリ情報処理部３０１は、サーバ情報処理部３０２を用いて、ディレクトリサービスサーバ１０３が提供するディレクトリ情報を取得し、このディレクトリ情報のデータ形式を所定の形式に変換し、これを解析して分類識別情報（分類カテゴリ名）と、各分類カテゴリに対応する参照情報（インターネット上のＷｅｂサイトを識別するＵＲＬ）とを抽出する。また、サーバ情報処理部３０２を用いて、抽出した参照情報（ＵＲＬ）に対応するＷｅｂサイトにアクセスしてサイト情報を取得する。
【００２８】
３０６は特徴情報抽出処理部であり、与えられたＷｅｂサイト情報に対して特徴抽出処理を行い、分類識別情報に対応する特徴情報の生成を行う。具体的には、サイト情報から文書情報だけを抽出し、該文書情報に含有される単語を少なくとも抽出し、該単語を成分としたベクトルを作成し、これを特徴情報とする。
【００２９】
３０３は学習処理部であり、特徴情報抽出処理部３０６によって抽出された複数の特徴情報に対して学習処理を実行する。すなわち、分類項目ごとに多数の特徴情報を積算することによって学習情報を生成する。
【００３０】
３０４は参照情報処理部であり、情報クライアント１０１の制御部２０２（Ｗｅｂブラウザ）が保持するブックマーク情報（ＵＲＬ）を、ネットワーク１０５を介して取得し、サーバ情報処理部３０２に依頼して、ブックマーク情報に対応する情報サーバにアクセスしてＷｅｂサイト情報を取得する。
【００３１】
３０５は分類処理部であり、分類処理部３０５は、ブックマーク情報に関連する特徴情報を、ディレクトリ情報に関連する分類項目ごとの各学習情報と比較し、上記特徴情報に最も類似性の高い学習情報を算出する。この算出された学習情報に関連する分類項目を、上記特徴情報に関連するブックマーク情報に対する分類項目に決定することによってブックマーク情報を分類する。
【００３２】
図４は、情報自動分類装置１０４で実行されるブックマーク情報分類処理の手順を示すフローチャートである。
【００３３】
ステップＳ１からステップＳ５までは学習処理であり、概略すると、ネットワーク１０５上にあるディレクトリサービスサーバ１０３からディレクトリ情報を取得し、これを解析することにより分類識別情報（分類カテゴリ名）と参照情報（ＵＲＬ）とを獲得し、さらに当該参照情報に対応するＷｅｂサイトのサイト情報を取得し、その内容を解析することによって、分類項目ごとの学習情報を生成する。
【００３４】
まずステップＳ１で、ディレクトリ情報処理部３０１がサーバ情報処理部３０２を用いて、ディレクトリサービスサーバ１０３が提供するディレクトリ情報を、ネットワーク１０５を介して取得する。
【００３５】
次にステップＳ２で、ディレクトリ情報処理部３０１は、ディレクトリサービスサーバ１０３から取得したディレクトリ情報を解析し、分類識別情報（分類カテゴリ名）と分類カテゴリごとの参照情報（ＵＲＬ）とを抽出する。
【００３６】
ステップＳ３で、ディレクトリ情報処理部３０１はサーバ情報処理部３０２を用いて、ステップＳ２で抽出された参照情報（ＵＲＬ）に対応するＷｅｂサイトのサイト情報を、ネットワーク１０５を介して取得する。
【００３７】
ステップＳ４では、ステップＳ３で取得したＷｅｂサイト情報を基に、特徴抽出処理部３０６が特徴抽出処理を実施して、Ｗｅｂサイトの特徴を抽出する。次に学習処理部３０３が、分類項目（分類カテゴリ）ごとに多数の特徴情報の積算を行って分類項目に対応する学習情報の生成を行う。
【００３８】
ステップＳ５においては、ステップＳ１〜Ｓ４の処理を継続して実行すべきディレクトリ情報がまだ存在するかを判別し、継続して実行すべき場合はステップＳ１へ戻る。
【００３９】
ステップＳ６からステップＳ９までは分類処理であり、今度は情報クライアント１０１に固有に保存されている参照情報（ブックマークＵＲＬ）を取得し、対応するＷｅｂサイト情報の特徴抽出処理を行う。そして、得られた特徴情報をステップＳ１〜Ｓ４の学習処理で生成された分類項目ごとの学習情報と比較することで、参照情報（ブックマークＵＲＬ）によく対応する分類項目を自動的に決定するものである。
【００４０】
まずステップＳ６で、参照情報処理部３０４が、情報クライアント１０１の制御部２０２（Ｗｅｂブラウザ）が保持するブックマーク情報（参照情報）を取得する。
【００４１】
次にステップＳ７で、参照情報処理部３０４がサーバ情報処理部３０２を用いて、ステップＳ３と同様に、ステップＳ６で取得されたブックマーク情報（参照情報、ＵＲＬ）に対応するＷｅｂサイトのサイト情報を、ネットワーク１０５を介して取得する。
【００４２】
ステップＳ８では先ず、ステップＳ７で取得したＷｅｂサイト情報を基に、特徴抽出処理部３０６が特徴抽出処理を実施し、特徴情報を抽出する。ついで、分類処理部３０５が、このブックマーク情報に関連する特徴情報を、ステップＳ４で生成された分類項目ごとの各学習情報と比較し、該特徴情報と各学習情報との類似性をそれぞれ算出する。そして、算出された各類似性のうち最大の類似性に関連する学習情報に対応する分類項目を、上記ブックマーク情報に対する分類項目と決定する。
【００４３】
ステップＳ９では、ステップＳ６〜Ｓ８の処理を、情報クライアント１０１の制御部２０２（Ｗｅｂブラウザ）が保持するブックマーク情報の全てに対して実行したか否かを判別し、全てに対して実行した場合にはステップＳ１０に進み、まだ実行されていないブックマーク情報がある場合にはステップＳ６に戻る。
【００４４】
ステップＳ１０では、ステップＳ８で得られたブックマーク情報ごとの分類項目を情報クライアント１０１に送って、情報クライアント１０１におけるブックマーク情報の分類管理に供する。
【００４５】
以上のようにして、ブックマークの分類が自動的に行われるので、多数のブックマークを使用する場合に発生するブックマーク管理の煩雑さが無くなり、もってブックマークという機能のもつ簡便性を損なわずに有用性を向上させることが可能となる。
【００４６】
また、ディレクトリサービスを提供しているサーバでの分類の特徴を用いて、ユーザが利用している装置に格納されているブックマークを分類するので、適切で正確な分類がなされる。
【００４７】
［他の実施の形態］
上記実施の形態では、ディレクトリサービスを提供するディレクトリサービスサーバは１つだけ存在したが、複数のディレクトリサービスサーバが存在してもよく、その場合に、複数のディレクトリサービスの内容をそれぞれ個別の学習データとして扱っても、また全てのディレクトリサービスの情報をまとめて仮想的な一つのディレクトリサービスと見なして単一の学習データとして扱ってもよい。
【００４８】
また、上記実施の形態では、ステップＳ８において、最も類似性の高い単一の分類識別情報を算出したが、これに代わって、複数個の分類識別情報を算出し、それぞれに分類してもよい。
【００４９】
また、上記実施の形態では、ステップＳ８において、最も類似性の高い単一の分類識別情報を算出したが、これに代わって、一定の閾値を設け、これを超える値の類似度を得られなかった場合には、どの分類識別情報にも合致しなかったという結果を算出してもよい。
【００５０】
また、本発明は複数の機器（例えばホストコンピュータ、インタフェース機器など）から構成される装置に適用しても、一つの機器からなる装置に適用してもよい。すなわち、情報自動分類装置１０４は、情報クライアント１０１、情報サーバ１０２、ディレクトリサービスサーバ１０３の少なくとも１つと、同一の機器として構成してもかまわない。
【００５１】
また、本発明の目的は、前述した実施形態の機能を実現するソフトウエアのプログラムコードを記録した媒体を、システムまたは装置に供給し、そのシステムまたは装置のコンピュータ（またはＣＰＵやＭＰＵ）が、記憶媒体に格納されたプログラムコードを読みだし実行することによっても達成される。この場合、記憶媒体から読み出されたプログラムコード自体が前述した実施形態の機能を実現することになり、そのプログラムコードを記憶した記憶媒体は本発明を構成することとなる。
【００５２】
なお、プログラムコードを供給するための記憶媒体としては、例えばフロッピー（登録商標）ディスク、ハードディスク、光ディスク、光磁気ディスク、ＣＤ−ＲＯＭ、ＣＤ−Ｒ、磁気テープ、不揮発性のメモリカード、ＲＯＭなどを用いることができる。また、コンピュータが読み出したプログラムコードを実行することにより、前述した実施の形態の機能が実現されるだけではなく、そのプログラムコードの指示に基づき、コンピュータ上で稼働しているＯＳ（オペレーティングシステム）などが実際の処理の一部または全部を行い、その処理によって実施形態の機能が実現される場合も含まれる。
【００５３】
さらに、記憶媒体から読み出されたプログラムコードが、コンピュータに挿入された機能拡張ボードやコンピュータに接続された機能拡張ユニットに備わるメモリに書き込まれたあと、そのプログラムコードの支持に基づき、その機能拡張ボードや機能拡張ユニットに備わるＣＰＵなどが実際の処理の一部または全部を行い、その処理によって前述した実施形態の機能が実現される場合も含まれる。
【００５４】
本発明を上記記憶媒体に適用する場合、その記憶媒体には、先に説明したフローチャートに対応する処理を実行するプログラムコードを格納することになる。
【００５５】
以上のように、本発明の各種の実施の形態を示して説明したが、以下に本発明の実施態様の例を列挙する。
【００５６】
〔実施態様１〕複数のＷｅｂサイト情報を所定の分類規則に則って分類してディレクトリ情報を提供するディレクトリサービスサーバ及びクライアント装置を含むネットワーク環境に接続された情報自動分類装置において、
前記ディレクトリサービスサーバが提供するディレクトリ情報を、ネットワークを介して取得するディレクトリ情報取得手段と、
前記取得されたディレクトリ情報から、分類項目情報と各分類項目に対応するＷｅｂサイトへのアクセス情報とを抽出する抽出手段と、
前記抽出されたアクセス情報を用いて、該アクセス情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第１のサイト情報取得手段と、
前記第１のサイト情報取得手段によって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第１の特徴抽出手段と、
前記第１の特徴抽出手段によってそれぞれ抽出された複数のサイト情報の特徴情報を分類項目ごとに積算して学習情報を生成する学習手段と、
前記クライアント装置が保持するブックマーク情報を、前記ネットワークを介して取得するブックマーク情報取得手段と、
前記取得されたブックマーク情報を用いて、該ブックマーク情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第２のサイト情報取得手段と、
前記第２のサイト情報取得手段によって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第２の特徴抽出手段と、
前記第２の特徴抽出手段によって抽出された特徴情報を、前記学習手段によって生成された分類項目ごとの学習情報とそれぞれ比較し、該特徴情報と各学習情報との類似性をそれぞれ算出する算出手段と、
前記算出手段によって算出された各類似性のうち最大の類似性に関連する学習情報に対応する分類項目を、前記第２のサイト情報取得手段が取得したサイト情報に関連するブックマーク情報に対する分類項目と決定する分類項目決定手段と
を有することを特徴とする情報自動分類装置。
【００５７】
〔実施態様２〕前記ディレクトリサービスサーバが提供するディレクトリ情報は、前記クライアント装置の使用者が、所望のサイト情報に対応する分類項目を指定して、複数のＷｅｂサイトの中から適切なＷｅｂサイトを探す際に用いられるものであり、
前記ディレクトリ情報取得手段は、前記ディレクトリサービスサーバが提供するディレクトリ情報のデータ形式を所定の形式に変換することを特徴とする実施態様１に記載の情報自動分類装置。
【００５８】
〔実施態様３〕前記第１及び第２の特徴抽出手段はそれぞれ、前記第１及び第２のサイト情報取得手段によってそれぞれ取得されたサイト情報から文書情報を抽出し、該文書情報を解析することによって前記サイト情報の特徴を抽出することを特徴とする実施態様１に記載の情報自動分類装置。
【００５９】
〔実施態様４〕前記第１及び第２の特徴抽出手段はそれぞれ、前記抽出された文書情報に含有される単語を少なくとも抽出し、該単語を成分としたベクトルを用いることによって前記解析を行うことを特徴とする実施態様３に記載の情報自動分類装置。
【００６０】
〔実施態様５〕複数のＷｅｂサイト情報を所定の分類規則に則って分類してディレクトリ情報を提供するディレクトリサービスサーバ及びクライアント装置を含むネットワーク環境に接続された情報自動分類装置に適用される情報自動分類方法において、
前記ディレクトリサービスサーバが提供するディレクトリ情報を、ネットワークを介して取得するディレクトリ情報取得ステップと、
前記取得されたディレクトリ情報から、分類項目情報と各分類項目に対応するＷｅｂサイトへのアクセス情報とを抽出する抽出ステップと、
前記抽出されたアクセス情報を用いて、該アクセス情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第１のサイト情報取得ステップと、
前記第１のサイト情報取得ステップによって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第１の特徴抽出ステップと、
前記第１の特徴抽出ステップによってそれぞれ抽出された複数のサイト情報の特徴情報を分類項目ごとに積算して学習情報を生成する学習ステップと、
前記クライアント装置が保持するブックマーク情報を、前記ネットワークを介して取得するブックマーク情報取得ステップと、
前記取得されたブックマーク情報を用いて、該ブックマーク情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第２のサイト情報取得ステップと、
前記第２のサイト情報取得ステップによって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第２の特徴抽出ステップと、
前記第２の特徴抽出ステップによって抽出された特徴情報を、前記学習ステップによって生成された分類項目ごとの学習情報とそれぞれ比較し、該特徴情報と各学習情報との類似性をそれぞれ算出する算出ステップと、
前記算出ステップによって算出された各類似性のうち最大の類似性に関連する学習情報に対応する分類項目を、前記第２のサイト情報取得ステップで取得されたサイト情報に関連するブックマーク情報に対する分類項目と決定する分類項目決定ステップと
を有することを特徴とする情報自動分類方法。
【００６１】
〔実施態様６〕前記ディレクトリサービスサーバが提供するディレクトリ情報は、前記クライアント装置の使用者が、所望のサイト情報に対応する分類項目を指定して、複数のＷｅｂサイトの中から適切なＷｅｂサイトを探す際に用いられるものであり、
前記ディレクトリ情報取得ステップは、前記ディレクトリサービスサーバが提供するディレクトリ情報のデータ形式を所定の形式に変換することを特徴とする実施態様５に記載の情報自動分類方法。
【００６２】
〔実施態様７〕前記第１及び第２の特徴抽出ステップはそれぞれ、前記第１及び第２のサイト情報取得ステップによってそれぞれ取得されたサイト情報から文書情報を抽出し、該文書情報を解析することによって前記サイト情報の特徴を抽出することを特徴とする実施態様５に記載の情報自動分類方法。
【００６３】
〔実施態様８〕前記第１及び第２の特徴抽出ステップはそれぞれ、前記抽出された文書情報に含有される単語を少なくとも抽出し、該単語を成分としたベクトルを用いることによって前記解析を行うことを特徴とする実施態様７に記載の情報自動分類方法。
【００６４】
〔実施態様９〕複数のＷｅｂサイト情報を所定の分類規則に則って分類してディレクトリ情報を提供するディレクトリサービスサーバ及びクライアント装置を含むネットワーク環境に接続された情報自動分類装置に適用される情報自動分類方法を、コンピュータに実行させるためのプログラムにおいて、
前記情報自動分類方法が、
前記ディレクトリサービスサーバが提供するディレクトリ情報を、ネットワークを介して取得するディレクトリ情報取得ステップと、
前記取得されたディレクトリ情報から、分類項目情報と各分類項目に対応するＷｅｂサイトへのアクセス情報とを抽出する抽出ステップと、
前記抽出されたアクセス情報を用いて、該アクセス情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第１のサイト情報取得ステップと、
前記第１のサイト情報取得ステップによって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第１の特徴抽出ステップと、
前記第１の特徴抽出ステップによってそれぞれ抽出された複数のサイト情報の特徴情報を分類項目ごとに積算して学習情報を生成する学習ステップと、
前記クライアント装置が保持するブックマーク情報を、前記ネットワークを介して取得するブックマーク情報取得ステップと、
前記取得されたブックマーク情報を用いて、該ブックマーク情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第２のサイト情報取得ステップと、
前記第２のサイト情報取得ステップによって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第２の特徴抽出ステップと、
前記第２の特徴抽出ステップによって抽出された特徴情報を、前記学習ステップによって生成された分類項目ごとの学習情報とそれぞれ比較し、該特徴情報と各学習情報との類似性をそれぞれ算出する算出ステップと、
前記算出ステップによって算出された各類似性のうち最大の類似性に関連する学習情報に対応する分類項目を、前記第２のサイト情報取得ステップで取得されたサイト情報に関連するブックマーク情報に対する分類項目と決定する分類項目決定ステップと
を有することを特徴とするプログラム。
【００６５】
〔実施態様１０〕前記ディレクトリサービスサーバが提供するディレクトリ情報は、前記クライアント装置の使用者が、所望のサイト情報に対応する分類項目を指定して、複数のＷｅｂサイトの中から適切なＷｅｂサイトを探す際に用いられるものであり、
前記ディレクトリ情報取得ステップは、前記ディレクトリサービスサーバが提供するディレクトリ情報のデータ形式を所定の形式に変換することを特徴とする実施態様９に記載のプログラム。
【００６６】
〔実施態様１１〕前記第１及び第２の特徴抽出ステップはそれぞれ、前記第１及び第２のサイト情報取得ステップによってそれぞれ取得されたサイト情報から文書情報を抽出し、該文書情報を解析することによって前記サイト情報の特徴を抽出することを特徴とする実施態様９に記載のプログラム。
【００６７】
〔実施態様１２〕前記第１及び第２の特徴抽出ステップはそれぞれ、前記抽出された文書情報に含有される単語を少なくとも抽出し、該単語を成分としたベクトルを用いることによって前記解析を行うことを特徴とする実施態様１１に記載のプログラム。
【００６８】
〔実施態様１３〕複数のＷｅｂサイト情報を所定の分類規則に則って分類してディレクトリ情報を提供するディレクトリサービスサーバ及びクライアント装置を含むネットワーク環境に接続された情報自動分類装置に適用される情報自動分類方法をプログラムとして記憶した、コンピュータにより読み出し可能な記憶媒体において、
前記情報自動分類方法が、
前記ディレクトリサービスサーバが提供するディレクトリ情報を、ネットワークを介して取得するディレクトリ情報取得ステップと、
前記取得されたディレクトリ情報から、分類項目情報と各分類項目に対応するＷｅｂサイトへのアクセス情報とを抽出する抽出ステップと、
前記抽出されたアクセス情報を用いて、該アクセス情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第１のサイト情報取得ステップと、
前記第１のサイト情報取得ステップによって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第１の特徴抽出ステップと、
前記第１の特徴抽出ステップによってそれぞれ抽出された複数のサイト情報の特徴情報を分類項目ごとに積算して学習情報を生成する学習ステップと、
前記クライアント装置が保持するブックマーク情報を、前記ネットワークを介して取得するブックマーク情報取得ステップと、
前記取得されたブックマーク情報を用いて、該ブックマーク情報が対応するＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する第２のサイト情報取得ステップと、
前記第２のサイト情報取得ステップによって取得されたサイト情報を解析して、該サイト情報の特徴を抽出する第２の特徴抽出ステップと、
前記第２の特徴抽出ステップによって抽出された特徴情報を、前記学習ステップによって生成された分類項目ごとの学習情報とそれぞれ比較し、該特徴情報と各学習情報との類似性をそれぞれ算出する算出ステップと、
前記算出ステップによって算出された各類似性のうち最大の類似性に関連する学習情報に対応する分類項目を、前記第２のサイト情報取得ステップで取得されたサイト情報に関連するブックマーク情報に対する分類項目と決定する分類項目決定ステップと
を有することを特徴とする記憶媒体。
【００６９】
〔実施態様１４〕前記ディレクトリサービスサーバが提供するディレクトリ情報は、前記クライアント装置の使用者が、所望のサイト情報に対応する分類項目を指定して、複数のＷｅｂサイトの中から適切なＷｅｂサイトを探す際に用いられるものであり、
前記ディレクトリ情報取得ステップは、前記ディレクトリサービスサーバが提供するディレクトリ情報のデータ形式を所定の形式に変換することを特徴とする実施態様１３に記載の記憶媒体。
【００７０】
〔実施態様１５〕前記第１及び第２の特徴抽出ステップはそれぞれ、前記第１及び第２のサイト情報取得ステップによってそれぞれ取得されたサイト情報から文書情報を抽出し、該文書情報を解析することによって前記サイト情報の特徴を抽出することを特徴とする実施態様１３に記載の記憶媒体。
【００７１】
〔実施態様１６〕前記第１及び第２の特徴抽出ステップはそれぞれ、前記抽出された文書情報に含有される単語を少なくとも抽出し、該単語を成分としたベクトルを用いることによって前記解析を行うことを特徴とする実施態様１５に記載の記憶媒体。
【００７２】
【発明の効果】
以上詳述したように本発明によれば、ディレクトリサービスサーバが提供するディレクトリ情報から、分類項目情報と各分類項目に対応するＷｅｂサイトへのアクセス情報とを抽出し、このアクセス情報を用いてＷｅｂサイトにアクセスして該Ｗｅｂサイトが提供するサイト情報を取得する。この取得されたサイト情報を解析して、該サイト情報の特徴を抽出し、これを繰り返してそれぞれ抽出された複数のサイト情報の特徴情報を分類項目ごとに積算して学習情報を生成する。一方、クライアント装置が保持するブックマーク情報を用いてＷｅｂサイトにアクセスしてサイト情報を取得し、取得されたサイト情報を解析して、該サイト情報の特徴を抽出する。この抽出された特徴情報を、前記生成された分類項目ごとの学習情報とそれぞれ比較し、該特徴情報と各学習情報との類似性をそれぞれ算出する。この算出された各類似性のうち最大の類似性に関連する学習情報に対応する分類項目を、前記ブックマーク情報に対する分類項目と決定する。
【００７３】
かくして、ブックマークの分類が自動的に行われるので、多数のブックマークを使用する場合に発生するブックマーク管理の煩雑さが無くなり、且つ正確な分類がなされるので、ブックマークという機能のもつ簡便性を損なわずに有用性を向上させることが可能となる。
【図面の簡単な説明】
【図１】本発明の第１の実施の形態に係る情報自動分類装置を含むネットワークシステムの基本構成を示すブロック図である。
【図２】情報クライアントの機能構成を示すブロック図である。
【図３】情報自動分類装置の基本構成を示す図である。
【図４】情報自動分類装置で実行されるブックマーク情報分類処理の手順を示すフローチャートである。
【符号の説明】
１０１情報クライアント（クライアント装置）
１０２情報サーバ（Ｗｅｂサイトサーバ）
１０３ディレクトリサービスサーバ
１０４情報自動分類装置
１０５ネットワーク[0001]
TECHNICAL FIELD OF THE INVENTION
The present invention relates to an automatic information classification apparatus, and more particularly, to an automatic information classification apparatus connected to a network environment including a directory service server and a client apparatus for providing directory information by classifying a plurality of website information according to a predetermined classification rule. Equipment related.
[0002]
[Prior art]
With the advance of electronic computers and the spread of the Internet, a large amount of information has been transmitted from Web sites. Since the information provided on the Internet is enormous, it is required to efficiently classify and organize the information.
[0003]
Conventionally, there is a directory service as one method for responding to such a request. The directory service manually recognizes information transmitted from a large number of Web sites, classifies the information into appropriate classification categories, organizes the information, and provides the classified information. A general user can easily reach desired information by tracing an appropriate category using a directory service.
[0004]
However, in this directory service, specialized personnel having experience and knowledge examine Web site information and determine an appropriate category based on the content, so that the classification accuracy of the generated classification information is low. Although it has the advantage of being expensive, the task of manually classifying information on a large number of websites is extremely complicated, and it is difficult to maintain appropriate categories for frequently updated website information. There is also a drawback that category determination of newly generated information is difficult to be performed smoothly.
[0005]
As another method for responding to the above request, there is a search service. The search service is a service that automatically and regularly collects information on a Web site using a technique called a Web robot, and enables a search to be performed on the obtained information. This search service automatically collects information on a Web site, so that the problem associated with a high frequency of updating information provided from a Web site as in a directory service can be solved. In order to obtain the information, it is necessary to provide an appropriate search word, and therefore, there is a problem that it is difficult for a general user who is unfamiliar with the search act to easily obtain desired information.
[0006]
In addition, the above two services provided on the Internet have a problem of security risk when using the service, and a problem of stopping the service.
[0007]
By the way, there is a bookmarking method as a simple method that does not provide the service provided on the Internet and that meets the above-mentioned request. This technique is generally included in software for browsing Web sites. According to this bookmark, URLs (Uniform Resource Locators), which are access information to Web sites, can be classified hierarchically, and Web sites can be arbitrarily named and stored.
[0008]
However, in the above-mentioned method of bookmarks, since the creation and organization of bookmarks must all be performed by the user, the user must perform all management such as maintaining an appropriate classification state and deleting unnecessary bookmarks. There is a problem that it is complicated for the person.
[0009]
On the other hand, for example, in the technology described in Patent Document 1, when a user attempts to register a URL, a keyword specifying tag in an HTML source of the URL is searched, and the keyword is specified based on the keyword specified by the keyword specifying tag. One that categorizes and registers URLs has been devised.
[0010]
[Patent Document 1]
JP-A-11-167580
[0011]
[Problems to be solved by the invention]
However, according to the technology described in Patent Document 1, classification is performed according to a keyword specification tag set by the creator of the HTML source on the Web. Therefore, an appropriate keyword may not be specified. There is a problem that is not done.
[0012]
The present invention has been made in view of such a problem, and eliminates the complexity of bookmark management that occurs when a large number of bookmarks are used, and performs accurate classification so that a bookmark function is provided. It is an object of the present invention to provide an automatic information classification device which has improved usefulness without impairing the simplicity.
[0013]
[Means for Solving the Problems]
In order to achieve the above object, according to the present invention, information connected to a network environment including a directory service server and a client device that classify a plurality of website information according to a predetermined classification rule and provide directory information In the automatic classification device, directory information acquisition means for acquiring the directory information provided by the directory service server via a network, and classification item information and a Web site corresponding to each classification item from the acquired directory information. Extracting means for extracting access information, and first site information obtaining means for accessing a Web site corresponding to the access information and obtaining site information provided by the Web site using the extracted access information And obtained by the first site information obtaining means. Analyzing the extracted site information to extract the characteristics of the site information, and integrating the characteristic information of the plurality of site information extracted by the first characteristic extracting unit for each classification item. Learning means for generating learning information through the network, bookmark information obtaining means for obtaining the bookmark information held by the client device via the network, and a Web corresponding to the bookmark information using the obtained bookmark information. A second site information acquisition unit that accesses the site to acquire site information provided by the Web site, and analyzes the site information acquired by the second site information acquisition unit to determine the characteristics of the site information. A second feature extraction unit to be extracted, and feature information extracted by the second feature extraction unit, the feature information extracted by the learning unit. Calculating means for comparing each of the pieces of learning information generated for each classification item with each other and calculating the similarity between the feature information and each piece of learning information; and the largest similarity among the similarities calculated by the calculating means. And a classification item corresponding to the learning information related to the bookmark information related to the site information acquired by the second site information acquisition unit. A classification device is provided.
[0014]
BEST MODE FOR CARRYING OUT THE INVENTION
Hereinafter, embodiments of the present invention will be described with reference to the drawings.
[0015]
[First Embodiment]
FIG. 1 is a block diagram showing a basic configuration of a network system including an automatic information classification device according to the first embodiment of the present invention.
[0016]
In FIG. 1, reference numeral 101 denotes an information client. The information client 101 provides functions for receiving a search request for a Web site from a user and performing processing according to an operation performed by the user on search results.
[0017]
102 is an information server. Web site information is provided using a protocol such as HTTP (Hypertext Transfer Protocol).
[0018]
Reference numeral 103 denotes a directory service server that provides a directory service. The directory service server 103 categorizes and sorts information transmitted from a number of Web sites into predetermined classification categories, and provides directory information. This directory information is used when the user of the information client 101 specifies a classification item corresponding to desired site information and searches for an appropriate Web site from a plurality of Web sites.
[0019]
Reference numeral 104 denotes an automatic information classification device according to the present invention. The automatic information classification device 104 performs appropriate management of bookmarks, and will be described later in detail with reference to FIGS. Note that the automatic information classification device 104 in the present embodiment exists independently of the information client 101, but may be provided in the information client 101 instead.
[0020]
The information client 101, the information server 102, the directory service server 103, and the automatic information classification device 104 are each configured by a computer and are connected to each other by a network 105. Although only one information client and one information server are shown in FIG. 1 for the sake of convenience, generally, a plurality of these can exist on the network 103.
[0021]
FIG. 2 is a block diagram showing a functional configuration of the information client 101.
[0022]
In the figure, reference numeral 201 denotes a dialogue unit which receives a search request and an editing operation from a user. The dialogue unit 201 is realized by various input devices such as a keyboard, a pointing device, a touch panel, a joystick, a pen, and a tablet of a computer, a display device such as a bitmap display, and a basic operation system (operating system) on the computer. Then, an instruction from the user is received via the input device, and information is presented to the user via the display device.
[0023]
202 is a control unit. The control unit 202 corresponds to a web browser, and the web browser is a program for interpreting and executing information in a specific format. The Web browser has a bookmark function for storing a URL (bookmark), which is information necessary for accessing a Web site. With this bookmark function, the URL of the Web site currently being browsed by the Web browser can be saved.
[0024]
203 is a communication unit. The control unit 202 performs communication with the information server 102, the directory service server 103, and the automatic information classification device 104 existing on the network 105.
[0025]
FIG. 3 is a diagram showing a basic configuration of the automatic information classification device 104.
[0026]
Reference numeral 302 denotes a server information processing unit which acquires information provided by various servers (including the information server 102 and the directory service server 103) via the network 105.
[0027]
Reference numeral 301 denotes a directory information processing unit. The directory information processing unit 301 uses the server information processing unit 302 to acquire directory information provided by the directory service server 103, and converts the data format of the directory information into a predetermined format. This is analyzed to extract classification identification information (classification category name) and reference information (URL for identifying a Web site on the Internet) corresponding to each classification category. In addition, using the server information processing unit 302, a Web site corresponding to the extracted reference information (URL) is accessed to acquire site information.
[0028]
Reference numeral 306 denotes a feature information extraction processing unit that performs feature extraction processing on given Web site information and generates feature information corresponding to the classification identification information. Specifically, only the document information is extracted from the site information, at least a word contained in the document information is extracted, and a vector having the word as a component is created, and this is used as feature information.
[0029]
A learning processing unit 303 performs a learning process on the plurality of pieces of feature information extracted by the feature information extraction processing unit 306. That is, learning information is generated by integrating a large number of pieces of feature information for each classification item.
[0030]
Reference numeral 304 denotes a reference information processing unit which acquires bookmark information (URL) held by the control unit 202 (Web browser) of the information client 101 via the network 105, requests the server information processing unit 302, and outputs the bookmark information. To access the information server corresponding to the Web server and acquire Web site information.
[0031]
Reference numeral 305 denotes a classification processing unit. The classification processing unit 305 compares the feature information related to the bookmark information with each piece of learning information for each classification item related to the directory information, and determines the learning information having the highest similarity to the feature information. Is calculated. The bookmark information is classified by determining the classification item related to the calculated learning information as the classification item for the bookmark information related to the feature information.
[0032]
FIG. 4 is a flowchart illustrating a procedure of bookmark information classification processing executed by the information automatic classification device 104.
[0033]
Steps S1 to S5 are a learning process. Briefly, directory information is obtained from the directory service server 103 on the network 105, and the obtained directory information is analyzed to obtain classification identification information (classification category name) and reference information (URL). ) Is obtained, and further, the site information of the Web site corresponding to the reference information is obtained, and the content is analyzed to generate learning information for each classification item.
[0034]
First, in step S1, the directory information processing unit 301 uses the server information processing unit 302 to acquire directory information provided by the directory service server 103 via the network 105.
[0035]
Next, in step S2, the directory information processing unit 301 analyzes the directory information acquired from the directory service server 103, and extracts classification identification information (classification category name) and reference information (URL) for each classification category.
[0036]
In step S3, the directory information processing unit 301 uses the server information processing unit 302 to acquire, via the network 105, site information of a Web site corresponding to the reference information (URL) extracted in step S2.
[0037]
In step S4, the feature extraction processing unit 306 performs feature extraction processing based on the website information acquired in step S3, and extracts features of the website. Next, the learning processing unit 303 performs multiplication of many pieces of feature information for each classification item (classification category) to generate learning information corresponding to the classification item.
[0038]
In step S5, it is determined whether or not there is still directory information to be continuously executed in steps S1 to S4, and if it is to be executed continuously, the process returns to step S1.
[0039]
Steps S6 to S9 are a classification process. This time, the reference information (bookmark URL) uniquely stored in the information client 101 is obtained, and the feature extraction process of the corresponding website information is performed. Then, by comparing the obtained feature information with the learning information for each classification item generated in the learning process of steps S1 to S4, a classification item that well corresponds to the reference information (bookmark URL) is automatically determined. It is.
[0040]
First, in step S6, the reference information processing unit 304 acquires bookmark information (reference information) held by the control unit 202 (Web browser) of the information client 101.
[0041]
Next, in step S7, the reference information processing unit 304 uses the server information processing unit 302 to change the site information of the Web site corresponding to the bookmark information (reference information, URL) acquired in step S6, as in step S3. , Via the network 105.
[0042]
In step S8, first, the feature extraction processing unit 306 performs feature extraction processing based on the website information acquired in step S7, and extracts feature information. Next, the classification processing unit 305 compares the characteristic information related to the bookmark information with each piece of learning information for each classification item generated in step S4, and calculates the similarity between the characteristic information and each piece of learning information. . Then, a classification item corresponding to the learning information related to the maximum similarity among the calculated similarities is determined as a classification item for the bookmark information.
[0043]
In step S9, it is determined whether or not the processing of steps S6 to S8 has been executed for all of the bookmark information held by the control unit 202 (Web browser) of the information client 101. Proceeds to step S10, and returns to step S6 if there is any bookmark information that has not been executed.
[0044]
In step S10, the classification item for each bookmark information obtained in step S8 is sent to the information client 101, and the information client 101 uses the classification item for bookmark information management.
[0045]
As described above, since the classification of bookmarks is performed automatically, the complexity of bookmark management that occurs when a large number of bookmarks are used is eliminated, and the usefulness is maintained without impairing the simplicity of the bookmark function. It can be improved.
[0046]
In addition, since the bookmarks stored in the device used by the user are classified using the classification characteristics of the server providing the directory service, appropriate and accurate classification is performed.
[0047]
[Other embodiments]
In the above embodiment, there is only one directory service server that provides a directory service. However, a plurality of directory service servers may exist. In this case, the contents of the plurality of directory services are individually stored in individual learning data. Or all the directory service information may be collectively regarded as one virtual directory service and treated as a single learning data.
[0048]
Further, in the above-described embodiment, the single classification identification information having the highest similarity is calculated in step S8. Instead, a plurality of classification identification information may be calculated and classified into each. .
[0049]
Further, in the above-described embodiment, the single classification identification information having the highest similarity is calculated in step S8. However, instead of this, a certain threshold value is provided, and a similarity value exceeding this value cannot be obtained. In this case, a result that no classification identification information is matched may be calculated.
[0050]
In addition, the present invention may be applied to an apparatus including a plurality of devices (for example, a host computer, an interface device, and the like) or an apparatus including a single device. That is, the information automatic classification device 104 may be configured as the same device as at least one of the information client 101, the information server 102, and the directory service server 103.
[0051]
Further, an object of the present invention is to supply a medium recording a program code of software for realizing the functions of the above-described embodiments to a system or an apparatus, and a computer (or CPU or MPU) of the system or apparatus to store the medium. It is also achieved by reading and executing the program code stored in the medium. In this case, the program code itself read from the storage medium implements the functions of the above-described embodiment, and the storage medium storing the program code constitutes the present invention.
[0052]
Examples of the storage medium for supplying the program code include a floppy (registered trademark) disk, hard disk, optical disk, magneto-optical disk, CD-ROM, CD-R, magnetic tape, nonvolatile memory card, and ROM. Can be used. When the computer executes the readout program code, not only the functions of the above-described embodiment are realized, but also an OS (Operating System) running on the computer based on the instruction of the program code. Performs part or all of the actual processing, and the processing realizes the functions of the embodiments.
[0053]
Further, after the program code read from the storage medium is written into a memory provided on a function expansion board inserted into the computer or a function expansion unit connected to the computer, the function expansion is performed based on the support of the program code. This includes the case where the CPU or the like provided in the board or the function expansion unit performs part or all of the actual processing, and the processing realizes the functions of the above-described embodiments.
[0054]
When the present invention is applied to the storage medium, the storage medium stores program codes for executing processing corresponding to the flowcharts described above.
[0055]
As described above, various embodiments of the present invention have been shown and described. Examples of the embodiments of the present invention are listed below.
[0056]
[Embodiment 1] In an information automatic classification device connected to a network environment including a directory service server and a client device that classify a plurality of website information according to a predetermined classification rule and provide directory information,
Directory information obtaining means for obtaining directory information provided by the directory service server via a network;
Extracting means for extracting classification item information and access information to a Web site corresponding to each classification item from the obtained directory information;
A first site information acquisition unit that accesses a website corresponding to the access information and acquires site information provided by the website using the extracted access information;
A first feature extracting unit that analyzes the site information acquired by the first site information acquiring unit and extracts features of the site information;
A learning unit for generating learning information by integrating feature information of a plurality of site information extracted by the first feature extracting unit for each classification item;
Bookmark information acquisition means for acquiring the bookmark information held by the client device via the network;
A second site information acquisition unit that accesses the website corresponding to the bookmark information using the acquired bookmark information and acquires site information provided by the website;
A second feature extraction unit that analyzes the site information acquired by the second site information acquisition unit and extracts a feature of the site information;
Calculating means for comparing the feature information extracted by the second feature extracting means with the learning information for each classification item generated by the learning means, and calculating the similarity between the feature information and each learning information, respectively; When,
A classification item corresponding to the learning information related to the maximum similarity among the similarities calculated by the calculation unit is defined as a classification item corresponding to the bookmark information related to the site information acquired by the second site information acquisition unit. Classification item determination means to be determined
An information automatic classification device, comprising:
[0057]
[Embodiment 2] The directory information provided by the directory service server is such that the user of the client device specifies a classification item corresponding to desired site information, and selects an appropriate Web site from a plurality of Web sites. Which is used when searching,
The apparatus according to claim 1, wherein the directory information obtaining unit converts a data format of directory information provided by the directory service server into a predetermined format.
[0058]
[Embodiment 3] The first and second feature extracting means respectively extract document information from the site information acquired by the first and second site information acquiring means, and analyze the document information. The information automatic classifying apparatus according to the first embodiment, wherein a feature of the site information is extracted by using the information.
[0059]
[Embodiment 4] Each of the first and second feature extraction means extracts at least a word contained in the extracted document information, and performs the analysis by using a vector having the word as a component. The information automatic classification device according to embodiment 3, characterized in that:
[0060]
[Embodiment 5] An automatic information classification apparatus applied to an information classification apparatus connected to a network environment including a directory service server and a client apparatus for providing directory information by classifying a plurality of Web site information according to a predetermined classification rule. In the classification method,
Directory information obtaining step of obtaining directory information provided by the directory service server via a network;
An extraction step of extracting classification item information and access information to a Web site corresponding to each classification item from the obtained directory information;
A first site information obtaining step of using the extracted access information to access a Web site corresponding to the access information to obtain site information provided by the Web site;
A first feature extraction step of analyzing the site information acquired by the first site information acquisition step and extracting features of the site information;
A learning step of generating learning information by integrating feature information of a plurality of site information extracted by the first feature extraction step for each classification item;
Bookmark information acquisition step of acquiring the bookmark information held by the client device via the network,
A second site information obtaining step of using the obtained bookmark information to access a Web site corresponding to the bookmark information to obtain site information provided by the Web site;
A second feature extraction step of analyzing the site information acquired by the second site information acquisition step and extracting features of the site information;
A calculating step of comparing the feature information extracted in the second feature extracting step with learning information for each classification item generated in the learning step, and calculating a similarity between the feature information and each learning information, respectively; When,
A classification item corresponding to the learning information related to the largest similarity among the similarities calculated in the calculation step is classified into a classification item corresponding to the bookmark information related to the site information acquired in the second site information acquisition step. And the classification item determination step to determine
An automatic information classification method, comprising:
[0061]
[Sixth Embodiment] The directory information provided by the directory service server is such that the user of the client device specifies a classification item corresponding to desired site information, and selects an appropriate Web site from a plurality of Web sites. Which is used when searching,
The information automatic classification method according to embodiment 5, wherein the directory information obtaining step converts a data format of directory information provided by the directory service server into a predetermined format.
[0062]
[Embodiment 7] The first and second feature extraction steps include extracting document information from the site information acquired by the first and second site information acquisition steps, respectively, and analyzing the document information. The method for automatically classifying information according to embodiment 5, wherein the feature of the site information is extracted by the following.
[0063]
[Eighth Embodiment] In the first and second feature extraction steps, at least a word contained in the extracted document information is extracted, and the analysis is performed by using a vector including the word as a component. The information automatic classification method according to embodiment 7, characterized in that:
[0064]
[Embodiment 9] An automatic information classification apparatus applied to an automatic information classification apparatus connected to a network environment including a directory service server and a client apparatus for providing directory information by classifying a plurality of Web site information according to a predetermined classification rule. In a program for causing a computer to execute the classification method,
The information automatic classification method,
Directory information obtaining step of obtaining directory information provided by the directory service server via a network;
An extraction step of extracting classification item information and access information to a Web site corresponding to each classification item from the obtained directory information;
A first site information obtaining step of using the extracted access information to access a Web site corresponding to the access information to obtain site information provided by the Web site;
A first feature extraction step of analyzing the site information acquired by the first site information acquisition step and extracting features of the site information;
A learning step of generating learning information by integrating feature information of a plurality of site information extracted by the first feature extraction step for each classification item;
Bookmark information acquisition step of acquiring the bookmark information held by the client device via the network,
A second site information obtaining step of using the obtained bookmark information to access a Web site corresponding to the bookmark information to obtain site information provided by the Web site;
A second feature extraction step of analyzing the site information acquired by the second site information acquisition step and extracting features of the site information;
A calculating step of comparing the feature information extracted in the second feature extracting step with learning information for each classification item generated in the learning step, and calculating a similarity between the feature information and each learning information, respectively; When,
A classification item corresponding to the learning information related to the largest similarity among the similarities calculated in the calculation step is classified into a classification item corresponding to the bookmark information related to the site information acquired in the second site information acquisition step. And the classification item determination step to determine
A program characterized by having:
[0065]
[Embodiment 10] The directory information provided by the directory service server is such that the user of the client device specifies a classification item corresponding to desired site information, and selects an appropriate Web site from a plurality of Web sites. Which is used when searching,
The program according to embodiment 9, wherein the directory information obtaining step converts a data format of directory information provided by the directory service server into a predetermined format.
[0066]
[Embodiment 11] The first and second feature extraction steps include extracting document information from the site information acquired by the first and second site information acquisition steps, respectively, and analyzing the document information. The program according to the ninth embodiment, wherein the feature of the site information is extracted by the following.
[0067]
[Embodiment 12] The first and second feature extraction steps each include extracting at least a word contained in the extracted document information and performing the analysis by using a vector having the word as a component. 12. The program according to embodiment 11, wherein
[0068]
[Thirteenth embodiment] An automatic information classification apparatus applied to an automatic information classification device connected to a network environment including a directory service server and a client device that classifies a plurality of Web site information in accordance with a predetermined classification rule and provides directory information. In a computer-readable storage medium storing a classification method as a program,
The information automatic classification method,
Directory information obtaining step of obtaining directory information provided by the directory service server via a network;
An extraction step of extracting classification item information and access information to a Web site corresponding to each classification item from the obtained directory information;
A first site information obtaining step of using the extracted access information to access a Web site corresponding to the access information to obtain site information provided by the Web site;
A first feature extraction step of analyzing the site information acquired by the first site information acquisition step and extracting features of the site information;
A learning step of generating learning information by integrating feature information of a plurality of site information extracted by the first feature extraction step for each classification item;
Bookmark information acquisition step of acquiring the bookmark information held by the client device via the network,
A second site information obtaining step of using the obtained bookmark information to access a Web site corresponding to the bookmark information to obtain site information provided by the Web site;
A second feature extraction step of analyzing the site information acquired by the second site information acquisition step and extracting features of the site information;
A calculating step of comparing the feature information extracted in the second feature extracting step with learning information for each classification item generated in the learning step, and calculating a similarity between the feature information and each learning information, respectively; When,
A classification item corresponding to the learning information related to the largest similarity among the similarities calculated in the calculation step is classified into a classification item corresponding to the bookmark information related to the site information acquired in the second site information acquisition step. And the classification item determination step to determine
A storage medium comprising:
[0069]
[Embodiment 14] In the directory information provided by the directory service server, a user of the client device specifies a classification item corresponding to desired site information, and selects an appropriate Web site from a plurality of Web sites. Which is used when searching,
The storage medium according to embodiment 13, wherein the directory information obtaining step converts a data format of directory information provided by the directory service server into a predetermined format.
[0070]
[Embodiment 15] The first and second feature extraction steps include extracting document information from the site information acquired by the first and second site information acquisition steps, respectively, and analyzing the document information. The storage medium according to embodiment 13, wherein the feature of the site information is extracted by using the following.
[0071]
[Embodiment 16] The first and second feature extraction steps each include extracting at least a word contained in the extracted document information, and performing the analysis by using a vector including the word as a component. The storage medium according to embodiment 15, wherein:
[0072]
【The invention's effect】
As described in detail above, according to the present invention, classification item information and access information to a Web site corresponding to each classification item are extracted from the directory information provided by the directory service server, and the Web information is extracted using the access information. The user accesses the site to obtain site information provided by the Web site. The acquired site information is analyzed to extract the features of the site information, and by repeating this, the feature information of each of the plurality of extracted site information is integrated for each classification item to generate learning information. On the other hand, the client accesses the Web site using the bookmark information held by the client device, acquires the site information, analyzes the acquired site information, and extracts the features of the site information. The extracted feature information is compared with the generated learning information for each classification item, and the similarity between the feature information and each learning information is calculated. The classification item corresponding to the learning information related to the maximum similarity among the calculated similarities is determined as the classification item for the bookmark information.
[0073]
Thus, since the bookmarks are automatically classified, the complexity of bookmark management that occurs when a large number of bookmarks are used is eliminated, and accurate classification is performed, so that the simplicity of the bookmark function is not impaired. It is possible to improve the usefulness.
[Brief description of the drawings]
FIG. 1 is a block diagram showing a basic configuration of a network system including an automatic information classification device according to a first embodiment of the present invention.
FIG. 2 is a block diagram illustrating a functional configuration of an information client.
FIG. 3 is a diagram showing a basic configuration of an automatic information classification device.
FIG. 4 is a flowchart illustrating a procedure of bookmark information classification processing executed by the information automatic classification device.
[Explanation of symbols]
101 Information client (client device)
102 Information server (Web site server)
103 Directory service server
104 Automatic Information Classifier
105 Network

Claims

An information automatic classification device connected to a network environment including a directory service server and a client device that classify a plurality of website information according to a predetermined classification rule and provide directory information,
Directory information obtaining means for obtaining directory information provided by the directory service server via a network;
Extracting means for extracting classification item information and access information to a Web site corresponding to each classification item from the obtained directory information;
A first site information acquisition unit that accesses a website corresponding to the access information and acquires site information provided by the website using the extracted access information;
A first feature extracting unit that analyzes the site information acquired by the first site information acquiring unit and extracts features of the site information;
A learning unit for generating learning information by integrating feature information of a plurality of site information extracted by the first feature extracting unit for each classification item;
Bookmark information acquisition means for acquiring the bookmark information held by the client device via the network;
A second site information acquisition unit that accesses the website corresponding to the bookmark information using the acquired bookmark information and acquires site information provided by the website;
A second feature extraction unit that analyzes the site information acquired by the second site information acquisition unit and extracts a feature of the site information;
Calculating means for comparing the feature information extracted by the second feature extracting means with the learning information for each classification item generated by the learning means, and calculating the similarity between the feature information and each learning information, respectively; When,
A classification item corresponding to the learning information related to the maximum similarity among the similarities calculated by the calculation unit is defined as a classification item corresponding to the bookmark information related to the site information acquired by the second site information acquisition unit. An automatic information classification apparatus, comprising: a classification item determining means for determining.