JP2007080061A

JP2007080061A - Retrieval method of web page and clustering method of web page

Info

Publication number: JP2007080061A
Application number: JP2005268541A
Authority: JP
Inventors: Kazuhiko Kato; 和彦加藤; Mizuki Oka; 瑞起岡; Osamu Nakamura; 理中村
Original assignee: University of Tsukuba NUC
Current assignee: University of Tsukuba NUC
Priority date: 2005-09-15
Filing date: 2005-09-15
Publication date: 2007-03-29

Abstract

<P>PROBLEM TO BE SOLVED: To provide a method for retrieving a desired web page by clustering according to visual similarity. <P>SOLUTION: Visual information included in the web page is vectorized on the basis of the thumbnail image of the web page. Then, main components are analyzed on the basis of the vector of the visual information and a plurality of feature vectors are obtained for the plurality of web pages. The plurality of feature vectors are clustered, and a result is displayed. Thus, retrieval based on an image, which is impossible by clustering based on character information, is made possible. <P>COPYRIGHT: (C)2007,JPO&INPIT

Description

本発明は、複数のＷｅｂページの中から目的とするＷｅｂページを検索する方法及び該方法を実施する際に使用する階層クラスタリング方法並びに特徴ベクトルの抽出方法に関するものである。 The present invention relates to a method for searching a target Web page from among a plurality of Web pages, a hierarchical clustering method used when executing the method, and a feature vector extraction method.

特開２００５−１２２６９０号公報(特許文献１）には、テキスト情報を含む複数の情報アイテム（Ｗｅｂページ）からテキスト情報を分析して、複数の情報アイテム（Ｗｅｂページ）について特徴ベクトルを求め、特徴ベクトルに基づいて階層クラスタリングを行い、階層クラスタリング結果を表示画面に視覚的に表示する技術が開示されている。 Japanese Patent Laying-Open No. 2005-122690 (Patent Document 1) analyzes text information from a plurality of information items (Web pages) including text information, obtains feature vectors for the plurality of information items (Web pages), A technique for performing hierarchical clustering based on a vector and visually displaying a hierarchical clustering result on a display screen is disclosed.

そして最近は、クラスタリング技術が、Ｗｅｂ検索エンジンの高度な機能として活用され始めている。例えば、Ｖｉｖｉｓｉｍｏ（商標）［http://www.vivisimo］やＨｉｇｈｌｉｇｈｔ（商標）［http://highlit.njit.edu］は、通常のＷｅｂ検索結果に対し、テキスト情報に基づき階層的にクラスタリングを行い、自然言語処理を通じてそれぞれのクラスタにラベルを自動的に付与しユーザに表示するサービスを提供している。また、ＡｌｔａＶｉｓｔａ（商標）［http://www.altavista.com］は、検索クエリに対し、同じようにクラスタリングを行い、関連性のある単語を表示することにより、ユーザがクエリを絞り込むことができるサービスを提供している。また、Ｇｏｏｇｌｅ（登録商標）のニュースサイト［http://news.google.com］は、多くのニュースサイトから自動的に記事を収集し、適当なトピックに分類、表示するサービスを行っている。 Recently, clustering technology has begun to be utilized as an advanced function of Web search engines. For example, Vivisimo (trademark) [http: //www.vivisimo] and Highlight (trademark) [http://highlit.njit.edu] perform hierarchical clustering based on text information for normal Web search results. And providing a service that automatically assigns a label to each cluster and displays it to the user through natural language processing. Also, AltaVista (trademark) [http://www.altavista.com] performs clustering in the same way for search queries, and displays related words, so that the user can narrow down the query. Service is provided. The Google (registered trademark) news site [http://news.google.com] automatically collects articles from many news sites, classifies them into appropriate topics, and displays them.

Ｗｅｂ検索の種類は、ブラウジング（ｂｒｏｗｓｉｎｇ）［Ｙａｈｏｏ（登録商標）が提供するようなディレクトリサービス］とクエリング（ｑｕｅｒｙｉｎｇ）（ＧｏｏｇｌｅやＡｌｔａＶｉｓｔａ）の２タイプに分類することができる。ｂｒｗｏｓｉｎｇは、検索している情報が漠然としている場合に有効であり、ｑｕｅｒｙｉｎｇは、探している情報が明確である場合に有効であると考えられる。しかし、ほとんどの検索はｂｒｏｗｓｉｎｇとｑｕｅｒｙｉｎｇの中間程度の要求が多いと考えられる。よって、ｑｕｅｒｙｉｎｇの検索結果にクラスタリングを行うことによりｂｒｏｗｓｉｎｇと中間の機能を提供することが可能になる。その結果、ユーザが求めているページにより効率_よく探し出す支援することが可能になる。
特開２００５−１２２６９０号公報 The types of Web search can be classified into two types: browsing (directory service as provided by Yahoo (registered trademark)) and querying (Google or AltaVista). Browsing is effective when the information being searched is vague, and querying is considered effective when the information being searched for is clear. However, most of the searches are considered to have many requests that are intermediate between browsing and querying. Therefore, it is possible to provide browsing and an intermediate function by clustering the querying search results. As a result, it is possible to assist in efficiently searching for a page requested by the user.
JP 2005-122690 A

前述の特許文献１や既存のＷｅｂ検索結果におけるクラスタリングサービスの共通点は、テキストに基づく処理を行っていることである。テキストのようなセマンティク情報は、クラスタリングする際に重要な情報である。しかし、Ｗｅｂページは、テキスト以外にも、画像、ＵＲＬ、レイアウト等の視覚的特徴即ち非セマンティク情報も多く含んでいる。ユーザがｑｕｅｒｙｉｎｇの結果を逐次的に訪れる場合、どのページにどの程度滞在するかの判断を瞬時に決めていると仮定すると、この判断はテキスト情報（文字情報）だけでなく、画像としてのページの印象も情報として利用していると考えることもできる。しかしながら従来は、クラスタリングにおいて、この画像としてのページの印象を考慮していなかった。そのため、画像イメージを基礎とする所望のＷｅｂページを得ることは難しかった。また文字情報に基づくクラスタリングでは、予想外のＷｅｂページから所望のＷｅｂページを検索できる可能性は極端に低くなる。 The common point of the clustering service in the above-mentioned Patent Document 1 and existing Web search results is that processing based on text is performed. Semantic information such as text is important information for clustering. However, the Web page includes many visual features such as images, URLs, and layouts, that is, non-semantic information, in addition to text. If the user visits the querying results sequentially, assuming that the decision on which page and how long to stay is determined instantaneously, this decision is not only for text information (character information) but also for the page as an image. It can be considered that the impression is also used as information. Conventionally, however, the impression of a page as an image has not been considered in clustering. For this reason, it has been difficult to obtain a desired Web page based on an image. In the clustering based on character information, the possibility that a desired Web page can be searched from an unexpected Web page is extremely low.

本発明の目的は、視覚的な類似性によってクラスタリングを行うことにより、所望のＷｅｂページを検索する方法を提供することにある。 An object of the present invention is to provide a method for searching for a desired Web page by performing clustering based on visual similarity.

本発明の他の目的は、視覚的情報と文字情報とを用いてクラスタリングを行うことによりＷｅｂページから所望のＷｅｂページを検索する方法を提供することにある。 Another object of the present invention is to provide a method for searching a desired Web page from Web pages by performing clustering using visual information and character information.

本発明の別の目的は、視覚的情報と文字情報とを利用したＷｅｂページのクラスタリング方法を提供することにある。 Another object of the present invention is to provide a Web page clustering method using visual information and character information.

本発明の他の目的は、視覚的情報と文字情報とを利用したＷｅｂページの特徴ベクトル抽出方法に関するものである。 Another object of the present invention relates to a method for extracting feature vectors of a Web page using visual information and character information.

ユーザがＷｅｂ検索エンジンの結果から効率良く目的のページを探し出すことを支援したり，検索結果の鳥瞰図を与えたりすることを目的として，情報検索分野のクラスタリング技術が用いられている。クラスタリングに用いるＷｅｂページの特徴としてテキストが使われることが多い。しかし、Ｗｅｂページは、テキスト（文章情報）以外にも、画像、ＵＲＬ、レイアウト等の非セマンティク情報（視覚的情報）も多く含んでいる。ユーザが実際にＷｅｂページの分類を行う際、視覚的印象において類似しているＷｅｂページは関連性がある、という意識が潜在的に働いていると考えられる。この考え方に立てば、ユーザは、個別のＷｅｂページを詳細に読む以前に視覚的（画像的）な印象でフィルタリングをしていることになる。本発明では、この考え方を基礎とする。 A clustering technique in the information search field is used for the purpose of assisting a user in efficiently searching for a target page from the results of a Web search engine or providing a bird's-eye view of search results. Text is often used as a feature of Web pages used for clustering. However, the Web page includes a lot of non-semantic information (visual information) such as images, URLs, and layouts in addition to text (text information). When a user actually classifies Web pages, it is considered that the awareness that Web pages that are similar in visual impression are related is potentially working. Based on this idea, the user is filtering with a visual (image) impression before reading individual Web pages in detail. The present invention is based on this concept.

そこで本発明は、複数のＷｅｂページからそれぞれ得た複数の特徴ベクトルに基づいてクラスタリングを行い、クラスタリングの結果に基づいて、複数のＷｅｂページから所望のＷｅｂページを検索する方法において、Ｗｅｂページ群を視覚的な類似性によってクラスタリングする。そのために本発明では、まずＷｅｂページに含まれる視覚的情報をベクトル化する。次に、視覚的情報のベクトルに基づいて主成分分析を行い、複数のＷｅｂページについて複数の特徴ベクトルを得る。これら複数の特徴ベクトルのクラスタリングを行うと、文字情報を基準にしたクラスタリングでは検索できない、画像イメージを基準にした検索が可能になる。また文字情報を基準にしたクラスタリングでは遭遇し得ないＷｅｂページから所望のＷｅｂページを検索することが可能になる。 Therefore, the present invention performs clustering based on a plurality of feature vectors respectively obtained from a plurality of Web pages, and in a method for searching for a desired Web page from a plurality of Web pages based on the clustering result, Cluster by visual similarity. Therefore, in the present invention, first, the visual information included in the Web page is vectorized. Next, principal component analysis is performed based on the vector of visual information to obtain a plurality of feature vectors for a plurality of Web pages. When these feature vectors are clustered, it becomes possible to perform a search based on an image image, which cannot be retrieved by clustering based on character information. It is also possible to search for a desired Web page from Web pages that cannot be encountered by clustering based on character information.

本発明は、文字情報を考慮することを否定するものではない。したがって文字情報と視覚的情報の両方を加味してクラスタリングを行ってよいのは勿論である。その場合には、複数のＷｅｂページからそれぞれ得た複数の特徴ベクトルに基づいてクラスタリングを行い、クラスタリングの結果に基づいて、複数のＷｅｂページから所望のＷｅｂページを検索する場合に、Ｗｅｂページに含まれる文字情報をベクトル化し、Ｗｅｂページに含まれる視覚的情報をベクトル化する。そして文字情報のベクトルと視覚的情報のベクトルをベクトル連結して連結ベクトルとし、この連結ベクトルに基づいて主成分分析を行い、複数のＷｅｂページについて複数の特徴ベクトルを得る。このようにすると、従来の文字情報に基づいたクラスタリングでは遭遇することがないようなＷｅｂページを見る機会が増える。その結果、従来の方法では、簡単に得ることができなかった所望のＷｅｂページを検索できる可能性が高くなる。 The present invention does not deny considering character information. Therefore, it goes without saying that clustering may be performed in consideration of both character information and visual information. In that case, clustering is performed based on a plurality of feature vectors respectively obtained from a plurality of Web pages, and included in the Web page when a desired Web page is searched from the plurality of Web pages based on the clustering result. Character information to be vectorized, and visual information included in the Web page is vectorized. A vector of character information and a vector of visual information are concatenated into a concatenated vector, and principal component analysis is performed based on the concatenated vector to obtain a plurality of feature vectors for a plurality of Web pages. In this way, there are more opportunities to view Web pages that are not encountered in conventional clustering based on character information. As a result, there is a high possibility that a desired Web page that cannot be easily obtained by the conventional method can be searched.

視覚的情報をどのように得るかは任意であるが、Ｗｅｂページから色ヒストグラムの抽出、エッジの分布の抽出、2次元高速フーリエ変換(FFT)、2次元ウェーブレット変換等を行って視覚的情報を得ると、視覚的情報をより的確に抽出することができる。また文字情報をどのように得るかも任意であるが、Ｗｅｂページからキーワード位置及び／または単語ヒストグラムを取得して文字情報を得ることができる。 The method of obtaining visual information is arbitrary, but visual information is obtained by extracting color histograms, extracting edge distributions, two-dimensional fast Fourier transform (FFT), two-dimensional wavelet transform, etc. from web pages. Once obtained, visual information can be extracted more accurately. Although how to obtain the character information is arbitrary, the character information can be obtained by acquiring the keyword position and / or the word histogram from the Web page.

またクラスタリングの結果を、複数のＷｅｂページを表す複数のシンボルを用いて表示画面上に表示する際に、特徴ベクトルから判断した各複数のＷｅｂページの距離の近さをシンボルに付す色の色合いの近さによって表示してもよい。このようにすると画面上に表示されたシンボルの色を基準として、所望のＷｅｂページを視覚的に検索することが容易になる。 In addition, when displaying the clustering result on the display screen using a plurality of symbols representing a plurality of Web pages, the color shade that gives the symbol the proximity of the distance between each of the plurality of Web pages determined from the feature vector. You may display by proximity. This makes it easy to visually search for a desired Web page based on the color of the symbol displayed on the screen.

さらにクラスタリングの結果を、複数のＷｅｂページを表す複数のシンボルを用いてクラスタが判別できるように表示画面上に表示する際に、画面に表示される複数のクラスタ中の代表的なＷｅｂページをサムネイルとして画面上に表示してもよい。このようにすると、クラスタリングのクラスタをイメージとして把握することができ、所望のＷｅｂページを視覚的に検索することが容易になる。 Further, when displaying the clustering result on the display screen so that the cluster can be identified using a plurality of symbols representing a plurality of Web pages, thumbnails of representative Web pages in the plurality of clusters displayed on the screen are displayed. May be displayed on the screen. In this way, the clustering cluster can be grasped as an image, and it becomes easy to visually search for a desired Web page.

なお各クラスタを１つのＷｅｂページで代表させるとすれば、その代表Ｗｅｂページのみを表示することで、検索結果の全体像をＢｉｒｄ’ｓ−ｅｙｅｖｉｅｗ（鳥瞰図）としてユーザに提示することも可能である。 If each cluster is represented by one web page, it is possible to present the entire search result as a bird's-eye view (bird's eye view) to the user by displaying only the representative web page. is there.

本発明によれば、特徴ベクトルを視覚的情報のベクトルから得て、クラスタリングを行うことにより、文字情報を基準にしたクラスタリングでは検索できない、画像イメージを基準にした検索が可能になる利点が得られる。また文字情報を基準にしたクラスタリングでは遭遇し得ないＷｅｂページから所望のＷｅｂページを検索することが可能になる。また文字情報と視覚的情報の両方を加味してクラスタリングを行えば、従来の方法では、簡単に得ることができなかった所望のＷｅｂページを検索できる可能性が高くなる利点がある。 According to the present invention, by obtaining a feature vector from a vector of visual information and performing clustering, there is an advantage that a search based on an image can be performed, which cannot be performed by clustering based on character information. . It is also possible to search for a desired Web page from Web pages that cannot be encountered by clustering based on character information. Further, if clustering is performed in consideration of both character information and visual information, there is an advantage that it is possible to search for a desired Web page that cannot be easily obtained by the conventional method.

以下図面を参照して、本発明のＷｅｂページの検索方法及びＷｅｂページのクラスタリング方法並びに特徴ベクトルの抽出方法をコンピュータを用いて実施する場合の実施の形態を詳細に説明する。本発明は、インターネットに接続しているユーザが、Ｗｅｂページを分類する際に視覚的印象が類似するＷｅｂページは関連がある、という意識が潜在的に働いていることを基礎として、Ｗｅｂページの画像から特徴を抽出して、クラスタリングをすることを特徴とする。なおＷｅｂページ全体では、データが多くなるため、本実施の形態では、一覧表示のためにＷｅｂページを縮小化したサムネイル画像を対象として扱う。 DESCRIPTION OF THE PREFERRED EMBODIMENTS Embodiments in the case where a Web page search method, Web page clustering method, and feature vector extraction method of the present invention are implemented using a computer will be described in detail below with reference to the drawings. The present invention is based on the fact that a user connected to the Internet has a potential awareness that web pages with similar visual impressions are related when classifying web pages. It is characterized by extracting features from an image and performing clustering. Since the entire Web page has a large amount of data, the present embodiment deals with thumbnail images obtained by reducing the Web page for list display.

複数のＷｅｂページから所望のＷｅｂページを検索する場合には、インターネットを経由して利用可能な検索エンジンを用いることになる。従来提案されている検索エンジンの検索結果を利用したＷｅｂページの検索方法では、複数のＷｅｂページからそれぞれ得た複数の特徴ベクトルに基づいてクラスタリングを行い、クラスタリングの結果に基づいて、複数のＷｅｂページから所望のＷｅｂページを検索する。本実施の形態では、特に、Ｗｅｂページ群を視覚的な類似性によってクラスタリングする。そのために本実施の形態では、まずＷｅｂページに含まれる視覚的情報をベクトル化する。具体的には、Ｗｅｂページのサムネイル画像を、公知のベクトル化技術を用いてベクトル化する。次に、各Ｗｅｂページの視覚的情報のベクトルに基づいて、主成分分析を行い、複数のＷｅｂページについて複数の特徴ベクトルを得る。上記のベクトル化及び特徴ベクトルの取得は、すべてコンピュータを利用して行う。 When a desired Web page is searched from a plurality of Web pages, a search engine that can be used via the Internet is used. In a conventionally proposed Web page search method using search engine search results, clustering is performed based on a plurality of feature vectors respectively obtained from a plurality of Web pages, and a plurality of Web pages are determined based on the result of clustering. A desired web page is searched from. In the present embodiment, in particular, Web page groups are clustered based on visual similarity. Therefore, in this embodiment, first, the visual information included in the Web page is vectorized. Specifically, the thumbnail image of the Web page is vectorized using a known vectorization technique. Next, principal component analysis is performed based on the visual information vector of each Web page to obtain a plurality of feature vectors for a plurality of Web pages. The above vectorization and feature vector acquisition are all performed using a computer.

視覚的情報をベクトル化する場合には、以下のようにする。図１は、Ｗｅｂページの視覚的情報から特徴ベクトルを抽出する場合の例を説明するために用いる図である。Ｗｅｂページの画像には、フレームや、文章、広告バーなどのページを構成する部品が全て画像の中に含まれている。画像から特徴を抽出する際には、これら部品の形や配置に注目することが考えられる。しかし、これらの部品を画像から抽出することは容易ではない。そこで、本実施の形態では、このような部品に注目するのではなくサムネイル画像そのものをパターンとして扱い、統計的パターン認識手法である主成分分析を用いて特徴抽出を行う。主成分分析の画像に対する適用は、Ｔｕｒｋ等が、“Eigenfaces for Recognition”と題する論文（Journal of Cognitive Neuro Science,3,1,pp71-86(1991)で提案している。この論文では、現在、顔画像の認識で広く使われているＥｉｇｅｎｆａｃｅＴｅｃｈｎｉｑｕｅが開示されている。なおこの技術は、画像に限らずベクトル形式にある情報からの冗長性除去のために適用されており、特徴抽出手法の典型である。また主成分分析は多変量で表されるデータの統計から、一次結合で表現される新たな変量を構成し、互いに無相関な「主成分」に要約する手法である。 When visual information is vectorized, it is as follows. FIG. 1 is a diagram used for explaining an example of extracting a feature vector from visual information of a Web page. In the image of the Web page, all parts constituting the page such as a frame, text, and advertisement bar are included in the image. When extracting features from an image, attention may be paid to the shape and arrangement of these components. However, it is not easy to extract these parts from the image. Therefore, in this embodiment, instead of paying attention to such components, the thumbnail images themselves are treated as patterns, and feature extraction is performed using principal component analysis, which is a statistical pattern recognition method. Application of principal component analysis to images has been proposed by Turk et al. In a paper entitled “Eigenfaces for Recognition” (Journal of Cognitive Neuro Science, 3, 1, pp71-86 (1991). Eigenface Technique, which is widely used in facial image recognition, is disclosed, which is applied not only to images but also to removing redundancy from information in vector format, and is a typical feature extraction technique. Principal component analysis is a method of constructing new variables expressed by linear combinations from data statistics expressed in multivariate and summarizing them into uncorrelated “principal components”.

図１には、Ｗｅｂページのサムネイル画像中の視覚的情報から、主成分分析を用いて特徴ベクトルを抽出する手順（コンピュータにおいて実施するステップ）を示してある。まず赤、緑及び青（ＲＧＢ）またはＨＳＶ［H(Hue/色相)、S(Saturation/彩度)、V(Value/明度)］で表現されているカラーサムネイル画像Ｎ枚うちのｉ番目の画像を、各画素の値またはベクトル要素数をならべたＭ次元のベクトルｘｉとして表現する。また、Ｎ枚の画像の平均ベクトルを
FIG. 1 shows a procedure (steps performed by a computer) of extracting a feature vector from visual information in a thumbnail image of a Web page using principal component analysis. First, i-th image out of N color thumbnail images expressed in red, green and blue (RGB) or HSV [H (Hue / Hue), S (Saturation / Saturation), V (Value / Lightness))] Is expressed as an M-dimensional vector xi with the values of each pixel or the number of vector elements. Also, the average vector of N images

とし、各画像から平均ベクトルを引いたベクトルを
And a vector obtained by subtracting the average vector from each image

で表す。そして各画像から平均ベクトルを引いた画像の集合を行列Ｘとして
Represented by A set of images obtained by subtracting the average vector from each image is a matrix X.

で表す。 Represented by

画像集合を平均２乗誤差の意味で最適に近似する正規直交基底Ｕ＝［ｕ_１，・・・ｕ_Ｌ］は、Ｘの共分散行列Ｖ
The orthonormal basis U = [u ₁ ,... U _L ] that optimally approximates the image set in terms of mean square error is the covariance matrix V of X

の固有値問題
Eigenvalue problem

の解として求めることができる。ただし、λは固有値である。また、Ｕとしては、固有値の大きさの順番に対する固有ベクトルをＬ個まで取るものとする。このように求めた固有ベクトルｕを、固有ベクトルまたは固有Ｗｅｂｐａｇｅ（ＥｉｇｅｎＷｅｂｐａｇｅ）と名付ける。このとき、ある画像ｘ_ｉに対する固有空間での表現ｙを
It can be obtained as a solution of Where λ is an eigenvalue. U is assumed to take up to L eigenvectors for the order of the eigenvalue magnitude. The eigenvector u obtained in this way is named an eigenvector or eigenwebpage (Eigen Webpage). At this time, the expression y in the eigenspace for an image x _i is

のように計算できる。このようなｙを特徴ベクトルと呼ぶ。ｙの各成分は、画像ｘを表現するための各固有Ｗｅｂｐａｇｅの貢献度を表していると解釈できる。また、行列の特異値分解の関係から、
It can be calculated as follows. Such y is called a feature vector. Each component of y can be interpreted as representing the contribution of each unique Webpage for representing the image x. From the relationship of singular value decomposition of the matrix,

と
When

のような固有値問題を考えると、ｕとｖとの間には、
Considering the eigenvalue problem as follows, between u and v

のような関係が成り立つ。 The following relationship holds.

従って、画像の大きさＭに比べて画像の枚数Ｎが小さい場合には上記［数８］に関する固有値問題を解いて。それから固有値ｕを計算すればよい。一般には、画像の大きさ（ＲＧＢ，ＨＳＶ等を用いてベクトル化した際のベクトル要素数）Ｍは学習に用いる画像の枚数Ｎよりもかなり大きいので、Ｖ（共分散行列）に関する固有値問題を解くことにより、必要な計算量を削減することができる。 Therefore, when the number N of images is smaller than the image size M, the eigenvalue problem related to [Equation 8] is solved. Then, the eigenvalue u may be calculated. In general, since the image size (the number of vector elements when vectorized using RGB, HSV, etc.) M is considerably larger than the number N of images used for learning, the eigenvalue problem relating to V (covariance matrix) is solved. As a result, the required amount of calculation can be reduced.

上記のようにして、特徴ベクトルを得るための学習用の固有ベクトルを得る。次に、サムネイルデータとして得た複数のＷｅｂページから順次新しいサムネイル（Ｗｅｂページ）を選択し、このサムネイルについてもＲＧＢ，ＨＳＶ等を用いてベクトル化を行う。そしてこのベクトル化したものと、前述の固有ベクトルとの内積を計算して、その新しいサムネイル（Ｗｅｂページ）の特徴ベクトルを順次取得する。 As described above, a learning eigenvector for obtaining a feature vector is obtained. Next, a new thumbnail (Web page) is sequentially selected from a plurality of Web pages obtained as thumbnail data, and this thumbnail is also vectorized using RGB, HSV, or the like. Then, the inner product of this vectorized product and the above-described eigenvector is calculated, and the feature vector of the new thumbnail (Web page) is sequentially obtained.

次に、上記のようにして、複数のＷｅｂページ（サムネイル）の視覚的特徴から得た特徴ベクトルに基づいて階層的クラスタリングを行う。階層的クラスタリングには、全てのサンプルを含む１つのクラスタから開始して、連続的にクラスタを分割するトップダウンアプローチと、Ｎ個の１サンプルから開始し、連続的にクラスタを融合して各レベルでのクラスタを形成していくボトムアップアプローチがある。ボトムアップアプローチはサンプル数が多く、クラスタ数を少なくする場合には、クラスタを融合する演算の回数が増加してしまうことから、本実施の形態では、図２に示すように、トップダウンアプローチを用いる。このクラスタリングには、ｋｍｅａｎｓ法を用いる。ここでｋｍｅａｎｓ法は、Ｎ個のサンプルをｋ個のクラスタに分類する手法である。まず、ｋ個のクラスタの平均ベクトル（μ_１、μ_２、・・・μ_ｋ）をサンプルからランダムに選択し初期化する。その後、Ｎ個のサンプルを最も近い距離のμ_ｉに分類し、μ_ｉを再計算し、μ_ｉに変化が無くなるまで繰り返す。この際の距離計算には、ユークリッド距離を用いた。本実施の形態において、ｋは１画面に１度に表示されるサムネイル画像の個数となる。ユーザが一度に見るＷｅｂページの数をその見やすさからｋ＝９と設定した。 Next, hierarchical clustering is performed based on the feature vectors obtained from the visual features of a plurality of Web pages (thumbnails) as described above. Hierarchical clustering consists of a top-down approach that starts with one cluster containing all samples and divides the cluster continuously, and starts with one sample of N and continuously merges the clusters at each level. There is a bottom-up approach to form clusters at Since the bottom-up approach has a large number of samples and the number of clusters is reduced, the number of operations for fusing the clusters increases. Therefore, in this embodiment, the top-down approach is performed as shown in FIG. Use. For this clustering, the kmeans method is used. Here, the kmeans method is a method of classifying N samples into k clusters. First, an average vector (μ ₁ , μ ₂ ,... Μ _k ) of _k clusters is randomly selected from samples and initialized. Thereafter, to classify the N samples to the nearest distance mu _i, recalculates the mu _i, repeated until a change in mu _i is eliminated. The Euclidean distance was used for the distance calculation at this time. In the present embodiment, k is the number of thumbnail images displayed at a time on one screen. The number of Web pages that the user sees at a time is set to k = 9 for ease of viewing.

図３は、本発明のＷｅｂページの検索方法の実施の形態を実際に実施した例の流れを示している。まず、最初に既存のＧｏｏｇｌｅ（登録商標）等の検索エンジンに、検索クエリー（データの抽出条件）を入力して得た検索結果としての複数のＷｅｂページのサムネイル画像を学習用のサムネイル画像として収集する。この状態は、図４の左側領域（または図１の右側領域に）示したサムネイルデータの通りである。次に、各サムネイル画像に含まれる視覚的情報をベクトル化する（Ｚ_１・・・Ｚ_Ｎ）。そして各サムネイル画像について主成分分析を用いて特徴ベクトルを抽出する（図４に示すように各サムネイル画像に関して学習用の特徴ベクトルｆ_１・・・ｆ_Ｎを得る）。次に、実際にテスト用のサムネイル画像をベクトル化し（Ｚｔ）、このベクトルＺｔと先に求めた学習用の特徴ベクトルｆ_１・・・ｆ_Ｎとの内積を求めて、そのサムネイル画像の特徴ベクトルｆｔとする。このようにして全てのテスト用のサムネイル画像について特徴ベクトルを求める。 FIG. 3 shows a flow of an example in which the embodiment of the Web page search method of the present invention is actually implemented. First, thumbnail images of a plurality of Web pages as a search result obtained by inputting a search query (data extraction condition) to an existing search engine such as Google (registered trademark) are collected as learning thumbnail images. To do. This state is the same as the thumbnail data shown in the left area of FIG. 4 (or in the right area of FIG. 1). Next, the visual information contained in each thumbnail image is vectorized (Z ₁ ... Z _N ). Then, feature vectors are extracted for each thumbnail image using principal component analysis (learning feature vectors f ₁ ... F _N are obtained for each thumbnail image as shown in FIG. 4). Next, the test thumbnail image is actually vectorized (Zt), the inner product of this vector Zt and the learning feature vector f ₁ ... F _N previously obtained is obtained, and the feature vector of the thumbnail image is obtained. Let ft. In this way, feature vectors are obtained for all test thumbnail images.

次に、前述のｋｍｅａｎｓ法を用いて、テスト用の複数のサムネイル画像から得た複数の特徴ベクトルの階層クラスタリングを作成する（この例ではｋ＝９とする）。ユーザに結果を表示する際には、ｋ個の各クラスタから、検索結果のランクが最も高いページをそのクラスタの代表とし、サムネイル画像を表示する（図５の左側参照）。ユーザは、各クラスタの代表サムネイル画像をクリックすることにより、そのクラスタにおける１階層下のクラスタを参照することができる（図５の右側参照）。 Next, hierarchical clustering of a plurality of feature vectors obtained from a plurality of thumbnail images for testing is created using the above-described kmeans method (in this example, k = 9). When displaying a result to the user, a thumbnail image is displayed with the page having the highest rank of the search result as a representative of the cluster among the k clusters (see the left side of FIG. 5). By clicking the representative thumbnail image of each cluster, the user can refer to the cluster one layer below in the cluster (see the right side of FIG. 5).

次に、検索クエリ「ｓｅａｒｃｈｅｎｇｉｎｅ」に対するＧｏｏｇｌｅの検索結果上位２００個を用いて、上記実施の形態を実施する実験を行った結果について説明する。図５の左側が、階層レベル１の結果を示している。表示画面には、各クラスタの代表サムネイル画像と共に、ページタイトル、ＵＲＬ、クラスタに含まれる総ページ数を表示した。各クラスタの代表Ｗｅｂページはページ総数の多い順に以下のような結果となった。 Next, a description will be given of a result of an experiment that implements the above embodiment using the top 200 search results of Google for the search query “search engine”. The left side of FIG. 5 shows the result of hierarchical level 1. The display screen displays the page thumbnail, the URL, and the total number of pages included in the cluster together with the representative thumbnail image of each cluster. The representative Web pages of each cluster had the following results in descending order of the total number of pages.

検索エンジンサイトＡｌｔａＶｉｓｔａ（１１７）
検索エンジンサイトＬｙｃｏｓ（３６）
複数の検索エンジンの結果を返すサイトＷｅｂｃｒａｗｌｅｒ（１９）
サイト内全文検索機能を提供するＦｒｅｅＦｉｎｄ（１１）
メタサーチエンジンサイトＫａｒｔｏｏ（１０）
ＷａｌｔＤｉｓｎｅｙＩｎｔｅｒｎｅｔＧｒｏｕｐのサイトＧｏ．ｃｏｍ（３）
芸術に関する検索エンジンサイトＡｒｔｃｙｃｌｏｐｅｄｉａ（２）
医学に関する検索エンジンを提供しているＭｅｄｉｃａｌＷｏｒｌｄＳｅａｒｃｈ（１）
雑学知識のサイトＣｏｏｌｑｕｉｚ（１）
最も大きいクラスタ（ページ総数１１７）を形成したＡｌｔａＶｉｓｔａクラスタにおける１段階下のクラスタ（階層レベル２）を図４の右側に示す。各クラスタの代表Ｗｅｂページは、ページ総数の多い順に以下のような結果となった。 Search engine site AltaVista (117)
Search engine site Lycos (36)
Webcrawler (19) that returns results from multiple search engines
FreeFind that provides full-text search function in the site (11)
Meta search engine site Kartoo (10)
Walt Disney Internet Group site Go. com (3)
Art-related search engine site Artcyclopedia (2)
Medical World Search (1) providing search engines related to medicine
Trivia Knowledge Site Coolquiz (1)
The one-stage lower cluster (hierarchy level 2) in the AltaVista cluster that forms the largest cluster (total page number 117) is shown on the right side of FIG. The representative web pages of each cluster had the following results in descending order of the total number of pages.

検索結果のクラスタリングを行うメタ検索エンジンサイトＶｉｖｉｓｉｍｏ（３６）
検索エンジンサイトＡｌｔａＶｉｓｔａ（３５）
教育に関するページのみを検索できるサイトＥｄｕｃａｔｉｏｎＷｏｒｌｄ（１９）
ＷｉｒｅｄＶｅｎｔｕｒｅｓ社の検索エンジンサイトＨｏｔｂｏｔ（９）
検索エンジンに関する評価を載せているサイトＳｅａｒｃｈＥｎｇｉｎｅＳｈｏｗｄｏｗｎ（７）
音楽に関する検索エンジンサイトＭｕｓｉｃＲｏｂｏｔ（６）
メタ検索エンジンサイトＳｅａｒｃｈ。ｃｏｍ（３）
質問形式でクエリを入力できる検索エンジンサイトＡｓｋＪｅｅｖｅｓ（３）
Ｓｔｒａｔｈｃｌｙｄｅ大学のＣｏｍｐｕｔｅｒＳｃｉｅｎｃｅのサイト（１）
これらクラスタに含まれるページを詳細に解析すると、レイアウト、色使いが似ているページがそれぞれのクラスタに分類されていることが確認できた。例えば、階層レベル２におけるＡｌｔａＶｉｓｔａのクラスタには、検索エンジンサイトＧｏｏｇｌｅ、ＡｌｌｔｈｅＷｅｂ、ＭａｍｍａＭｅｔａｓｅａｒｃｈ、Ｐｉｃｓｅａｒｃｈといった非常に画像的印象が類似しているＷｅｂページが分類された結果となった。 Meta search engine site Vivisimo for clustering search results (36)
Search engine site AltaVista (35)
A site where you can search only pages related to education Education World (19)
Wired Ventures search engine site Hotbot (9)
Search Engine Showdown (7)
Music Robot Site for Music (6)
Meta search engine site Search. com (3)
Search engine site that can input queries in the form of questions Ask Jeeves (3)
Strathclyde University Computer Science site (1)
Detailed analysis of the pages contained in these clusters confirmed that pages with similar layout and color usage were classified into each cluster. For example, in the cluster of AltaVista at the hierarchical level 2, Web pages with very similar image impressions such as search engine sites Google, AlltheWeb, Mamma Metasearch, and Picsearch are classified.

比較のため、テキストを基にクラスタリングを行っているＶｉｖｉｓｉｍｏに同じクエリ「ｓｅａｒｃｈｅｎｇｉｎｅ」を入力したところ、Ｖｉｖｉｓｉｍｏからは、２２２個の検索結果を得た。トップレベルの階層クラスタは以下の通りである。 For comparison, when the same query “search engine” was input to Vivisimo that is clustering based on text, 222 search results were obtained from Vivisimo. The top level hierarchical clusters are:

ＳｅａｒｃｈＥｎｇｉｎｅＯｐｔｉｍｉｚａｔｉｏｎ（７３）
Ｓｕｂｍｉｓｓｉｏｎ（３９）
Ｆｅａｔｕｒｅｓ（１４）
ＩｎｔｅｒｎｅｔＳｅａｃｈ（１６）
Ｍｅｔａｓｅａｒｃｈ（１５）
Ｓｅａｒｃｈｅｎｇｉｎｅｓａｎｄｄｉｒｅｃｔｏｒｉｅｓ（１１）
Ｂｕｓｉｎｅｓｓ（９）
Ｙａｈｏｏ（１１）
Ｔｅｃｈｎｉｑｕｅｓ（５）
ＳｅａｒｃｈＡｄｖｅｒｔｉｓｉｎｇ（４）
ＳｅａｒｃｈＥｎｇｉｎｅＯｐｔｉｍｉｚａｔｉｏｎや、Ｍｅｔａｓｅａｒｃｈのクラスタには、ラベルに適当なページが分類されている。しかし、Ｇｏｏｇｌｅ、Ｙａｈｏｏ、Ｌｙｃｏｓといったユーザが求めているであろう検索エンジンサイトは、ＳｅａｒｃｈＡｄｖｅｒｔｉｓｉｎｇというクラスタに分類されており、直感的とは言いがたい結果となっている。しかし、Ｖｉｖｉｓｉｍｏでは、ＭＳＮ、Ｌｙｃｏｓ、Ｌｏｏｋｓｍａｒｔ、Ｗｉｓｅｎｕｔ、ＯｐｅｎＤｉｒｅｃｔｏｒｙ、Ｏｖｅｒｔｕｒｅといった検索エンジンからの検索結果に対してクラスタリングを行っており、Ｇｏｏｇｌｅからの検索結果は使用していない。そのため、客観的な評価を与えることは難しかった。 Search Engine Optimization (73)
Submission (39)
Features (14)
Internet Search (16)
Metasearch (15)
Search engines and directories (11)
Businesses (9)
Yahoo (11)
Technologies (5)
Search Advertising (4)
In Search Engine Optimization and Metasearch clusters, appropriate pages are classified into labels. However, search engine sites that users such as Google, Yahoo, and Lycos are seeking are classified into a cluster called Search Advertising, which is not intuitive. However, Vivisimo performs clustering on search results from search engines such as MSN, Lycos, Looksmart, Wisenut, Open Directory, and Overture, and does not use search results from Google. Therefore, it was difficult to give an objective evaluation.

上記実施の形態では、主成分分析を通じてＷｅｂページのサムネイル情報を特徴ベクトル化し、検索結果をクラスタリングに用いている。従来の検索エンジンにおけるクエリーによる検索は、検索結果の出力が多すぎるためユーザが求めるページに辿り着くのが困難な場合が多い。そこで次の実施の形態では、Ｗｅｂページの文字情報と視覚的情報の両方を用いて、検索を容易にする。 In the above embodiment, Web page thumbnail information is converted into feature vectors through principal component analysis, and search results are used for clustering. In a search using a query in a conventional search engine, there are many cases where it is difficult to reach a page requested by a user because there are too many search results output. Therefore, in the next embodiment, the search is facilitated by using both character information and visual information of the Web page.

この実施の形態では、図６に示すように、Ｗｅｂページに含まれる文字情報をベクトル化し、Ｗｅｂページに含まれる視覚的情報をベクトル化する。そして文字情報のベクトルと視覚的情報のベクトルをベクトル連結して連結ベクトルとし、この連結ベクトルに基づいて主成分分析を行い、複数のＷｅｂページについて複数の特徴ベクトルを得る。視覚的情報をどのように得るかは任意である。Ｗｅｂページから色ヒストグラムの抽出、エッジの分布の抽出、2次元高速フーリエ変換(FFT)、2次元ウェーブレット変換等を行って視覚的情報を得ると、視覚的情報をより的確に抽出することができる。また文字情報をどのように得るかも任意である。Ｗｅｂページからキーワード位置、単語ヒストグラム、term frequency ・ inverse document frequency（ｔｆ・ｉｄｆ）等を取得して文字情報を得るようにしてもよい。 In this embodiment, as shown in FIG. 6, character information included in a Web page is vectorized, and visual information included in the Web page is vectorized. A vector of character information and a vector of visual information are concatenated into a concatenated vector, and principal component analysis is performed based on the concatenated vector to obtain a plurality of feature vectors for a plurality of Web pages. How to get visual information is arbitrary. When visual information is obtained by extracting color histograms, edge distributions, two-dimensional fast Fourier transform (FFT), two-dimensional wavelet transform, etc. from a web page, the visual information can be extracted more accurately. . Also, how the character information is obtained is arbitrary. Character information may be obtained by obtaining keyword positions, word histograms, term frequency / inverse document frequency (tf · idf), etc. from a Web page.

連結ベクトルに基づく主成分分析は、前の実施の形態の場合と同様に行えばよい。そして主成分分析により得た特徴ベクトルに基づいてクラスタリングを行う。このようにすると、従来の文字情報に基づいたクラスタリングでは遭遇することがないようなＷｅｂページを見る機会が増える。その結果、従来の方法では、簡単に得ることができなかった所望のＷｅｂページを検索できる可能性が高くなる。 The principal component analysis based on the concatenated vector may be performed in the same manner as in the previous embodiment. Then, clustering is performed based on the feature vector obtained by the principal component analysis. In this way, there are more opportunities to view Web pages that are not encountered in conventional clustering based on character information. As a result, there is a high possibility that a desired Web page that cannot be easily obtained by the conventional method can be searched.

図７は、クラスタリングの結果を表示画面に表示する場合の概略例を示す図である。この例では、クラスタリングの結果を、複数のＷｅｂページを表す複数のシンボル（円形図形）を用いて表示画面上に表示している。図７では、色の相違を白黒の濃淡で表示してあい。またクラスタ１〜ｍにそれぞれ属するＷｅｂページは、シンボルの集合体として表示してある。シンボルの位置はクラスタの距離に比例している。またこの例では、特徴ベクトルから判断した各複数のＷｅｂページの距離の近さをシンボルに付す色の色合いの近さによって表示している。この例では、クラスタが別であってもシンボルの色合いが近ければ、視覚的情報または文字情報等のいずれかの距離が近いことを示している。このようにすると画面上に表示されたシンボルの色を基準として、所望のＷｅｂページを視覚的に検索することが容易になる。 FIG. 7 is a diagram illustrating a schematic example when the result of clustering is displayed on the display screen. In this example, the clustering result is displayed on the display screen using a plurality of symbols (circular figures) representing a plurality of Web pages. In FIG. 7, the difference in color is displayed in shades of black and white. Web pages belonging to clusters 1 to m are displayed as a collection of symbols. The position of the symbol is proportional to the distance of the cluster. In this example, the closeness of the distance between each of the plurality of Web pages determined from the feature vector is displayed based on the closeness of the color shade attached to the symbol. In this example, even if the clusters are different, if the colors of the symbols are close, it indicates that either the visual information or the character information is close. This makes it easy to visually search for a desired Web page based on the color of the symbol displayed on the screen.

またこの例では、画面に表示される複数のクラスタ中の代表的なＷｅｂページをサムネイルとして画面上に表示している。したがってクラスタリングのクラスタをイメージとして把握することができ、所望のＷｅｂページを視覚的に検索することが容易になる。 In this example, representative Web pages in a plurality of clusters displayed on the screen are displayed on the screen as thumbnails. Therefore, the clustering cluster can be grasped as an image, and a desired Web page can be easily searched visually.

Ｗｅｂページの視覚的情報から主成分分析を用いて特徴ベクトルを抽出する場合の例を説明するために用いる図である。It is a figure used in order to explain the example in the case of extracting a feature vector using principal component analysis from the visual information of a Web page. 階層クラスタリングの構造を示す図である。It is a figure which shows the structure of hierarchical clustering. 本発明のＷｅｂページの検索方法の実施の形態を実際に実施した例の流れを示している。The flow of the example which actually implemented embodiment of the search method of the web page of this invention is shown. 主成分分析により特徴ベクトルを求める場合の考え方を説明するために用いる図でる。It is a figure used in order to explain the way of thinking when obtaining a feature vector by principal component analysis. クラスタリングによる検索結果の表示態様の一例を示す図である。It is a figure which shows an example of the display mode of the search result by clustering. 本発明の別の実施の形態の概念を説明するために用いる図である。It is a figure used in order to demonstrate the concept of another embodiment of this invention. クラスタリングによる検索結果の表示態様の他の例を示す図である。It is a figure which shows the other example of the display mode of the search result by clustering.

Claims

A method of performing clustering based on a plurality of feature vectors respectively obtained from a plurality of Web pages, and searching for a desired Web page from the plurality of Web pages based on the result of the clustering,
Vectorizing visual information contained in the web page;
A Web page search method, wherein principal component analysis is performed based on the visual information vector to obtain the plurality of feature vectors for the plurality of Web pages.

A method of performing clustering based on a plurality of feature vectors respectively obtained from a plurality of Web pages, and searching for a desired Web page from the plurality of Web pages based on the result of the clustering,
Vectorize the character information contained in the web page,
Vectorizing visual information contained in the web page;
The character information vector and the visual information vector are concatenated into a concatenated vector,
A Web page search method, wherein principal component analysis is performed based on the concatenated vectors to obtain the plurality of feature vectors for the plurality of Web pages.

3. The web page according to claim 1, wherein the visual information is obtained by performing color histogram extraction, edge distribution extraction, two-dimensional fast Fourier transform, two-dimensional wavelet transform, and the like from the web page. Search method.

The Web page search method according to claim 2, wherein the character information is obtained by acquiring a keyword position and / or a word histogram from the Web page.

When the clustering result is displayed on a display screen using a plurality of symbols representing the plurality of Web pages, a color that attaches to the symbol the proximity of each of the plurality of Web pages determined from the feature vector The method of searching for a Web page according to claim 2, wherein the display is based on the closeness of the color of the web page.

When displaying the clustering result on a display screen so that a cluster can be identified using a plurality of symbols representing the plurality of Web pages, representative Web pages in the plurality of clusters displayed on the screen The web page search method according to claim 1, wherein thumbnails are displayed as thumbnails on the screen.

A method of performing clustering based on a plurality of feature vectors respectively obtained from a plurality of web pages,
Vectorize the character information contained in the web page,
Vectorizing visual information contained in the web page;
The character information vector and the visual report vector are concatenated into a concatenated vector,
A Web page clustering method, wherein principal component analysis is performed based on the connected vectors to obtain the plurality of feature vectors for the plurality of Web pages.

A method for extracting a plurality of feature vectors from a plurality of Web pages,
Vectorize the character information contained in the web page,
Vectorizing visual information contained in the web page;
The character information vector and the visual information vector are concatenated into a concatenated vector,
A feature vector extraction method characterized in that principal component analysis is performed based on the concatenated vectors to obtain the plurality of feature vectors for the plurality of Web pages.