JP4915021B2

JP4915021B2 - Search device and control method of search device

Info

Publication number: JP4915021B2
Application number: JP2008232667A
Authority: JP
Inventors: 毅司増山
Original assignee: Yahoo Japan Corp
Current assignee: Yahoo Japan Corp
Priority date: 2008-09-10
Filing date: 2008-09-10
Publication date: 2012-04-11
Anticipated expiration: 2028-09-10
Also published as: JP2010067005A

Description

本発明は、Ｗｅｂページなどの検索サービスにおいて、検索にヒットしたリソースの有害度合いなどに応じて、検索結果リストの表示順を適切に並び替える技術に関する。 The present invention relates to a technique for appropriately rearranging the display order of a search result list in a search service such as a Web page according to the degree of harmfulness of a resource hit in a search.

現在、インターネット上ではさまざまなＷｅｂページが公開されており、ユーザーは端末からそれらＷｅｂページにアクセスすることで、所望の情報を簡単に得ることができる。しかしその一方で、Ｗｅｂページにはアダルトコンテンツや暴力的なコンテンツなどを含むものも数多く存在し、そのような有害なＷｅｂページに若年者などが簡単にアクセス可能であることが、近年、大きな問題として議論されている。 Currently, various Web pages are open on the Internet, and a user can easily obtain desired information by accessing these Web pages from a terminal. However, on the other hand, there are many web pages that contain adult content or violent content, and it has been a big problem in recent years that young people can easily access such harmful web pages. Is discussed.

そして、このような有害なＷｅｂページへのアクセスをコントロールするため、いわゆる「フィルタリング」と呼ばれる技術が提供されている。この「フィルタリング」では、例えば予めＮＧワードやＵＲＬを登録したフィルタリングソフトを端末にインストールしておく。そして端末がアクセスするＷｅｂページが当該ＮＧワードを含んでいたり、当該ＵＲＬのドメインを含んでいたりするかのマッチング処理結果に応じて有害なＷｅｂサイトであるか否かを判断する。そしてインターネット検索サーバ装置であれば、例えば特許文献１に開示されているように、それら有害なＷｅｂサイトと判断されたものを検索結果に含めないといった処理を行っている。
特開２００７−１２８１１９号公報 In order to control access to such harmful Web pages, a technique called “filtering” is provided. In this “filtering”, for example, filtering software in which NG words and URLs are registered in advance is installed in the terminal. Then, it is determined whether or not the web page accessed by the terminal is a harmful web site according to a matching processing result indicating whether the web page includes the NG word or the domain of the URL. If it is an Internet search server device, for example, as disclosed in Patent Document 1, processing that does not include those determined to be harmful Web sites in the search results is performed.
JP 2007-128119 A

しかし上記登録ＮＧワードなどを利用するフィルタリング技術では、ＮＧワードを含むＷｅｂページが全て検索結果から除かれてしまう、という課題がある。例えばＮＧワードを「自殺」と設定した場合、実は自殺予防のための手法や相談受付のページなども検索結果から省かれてしまい、このような健全なＷｅｂページへのアクセスまでもが遮断されてしまう可能性がある。 However, the filtering technique using the registered NG word has a problem that all Web pages including the NG word are excluded from the search result. For example, when the NG word is set to “suicide”, the method for preventing suicide and the page for accepting consultations are actually omitted from the search results, and access to such a healthy web page is also blocked. There is a possibility.

また、上記のようにＮＧワードなどを判断基準として有害性が疑わしいサイトはその有害度合いなどに関らず全て検索結果に含めない、つまり、Ｗｅｂページを二元的に分類し取捨選択する構成は、特にインターネット検索サービスには適していない、という課題がある。なぜならば、インターネット検索サービスは、ユーザーに所望の情報をあまねく提供することが第一義の目的である。したがって、完全に有害なＷｅｂページへのアクセスを遮断することは意味がある一方で、有害かどうか曖昧な、いわゆる「グレーゾーン」のＷｅｂページまで排除されてしまうことになると、その第一義の目的が、意味も無く達せられないことになってしまうからである。 In addition, as described above, sites that are suspected of being harmful based on NG words and the like are not included in the search results regardless of the degree of harmfulness, that is, the configuration in which Web pages are classified and selected in a binary manner In particular, there is a problem that it is not suitable for an Internet search service. This is because the primary purpose of the Internet search service is to provide users with desired information. Therefore, while it is meaningful to block access to a completely harmful web page, the so-called “gray zone” web page, which is vague whether it is harmful, is excluded. The purpose is meaningless and cannot be achieved.

また上記構成によって判断能力の乏しい幼年者に対する有害Ｗｅｂページへのアクセスをコントロールすることは重要とはいえ、ある程度の判断力を有する者にとっては自分自身で情報を取捨選択するという近代情報化社会において必要な能力を養う機会を奪うことにもなりかねない。 Moreover, although it is important to control access to harmful Web pages for young people with poor judgment ability by the above configuration, in a modern information society where information is selected and selected by those who have a certain degree of judgment. It can also take away the opportunity to develop the necessary abilities.

以上の課題を解決するために、本発明は、単純なＮＧワードのマッチングによる判断よりもさらに実効性の高い「ベクトル比較」によってＷｅｂページの有害性などを判断することができる機能、および価値指標を示すラベルを利用してその有害性の度合いなどを段階的に判断する機能をさらに備えることで、例えば有害と思われるＷｅｂページを検索結果から省くのではなく、その検索結果リスト中の並び順を下げることでアクセスしにくくする検索装置を提供する。 In order to solve the above problems, the present invention provides a function capable of determining the harmfulness of a Web page by “vector comparison”, which is more effective than simple NG word matching, and a value index. By using a label that indicates the degree of harm in a stepwise manner, for example, instead of omitting Web pages that are considered harmful from the search results, the order in the search result list The search apparatus which makes it difficult to access by lowering is provided.

具体的には、検索ヒットリストのそれぞれのページに含まれる文字列から抜き出される単語を素性とする単語ベクトルであるヒット単語ベクトルを生成するヒット単語ベクトル生成部と、価値指標を示すラベルと関連付けられたリファレンス用の単語ベクトルであるリファレンス単語ベクトルを複数保持するリファレンス単語ベクトル保持部と、複数保持されているリファレンス単語ベクトルのそれぞれと、ヒット単語ベクトル生成部にて生成されたヒット単語ベクトルとの類似度を演算する類似度演算部と、前記類似度を演算するための演算式を格納した演算式格納部と、ヒット単語ベクトルを生成したページの識別情報を前記演算により最も類似しているとされるリファレンス単語ベクトルのラベルの価値指標に基づいてソートして表示するソート表示部と、を有する検索装置である。 Specifically, a hit word vector generation unit that generates a hit word vector that is a word vector having a word extracted from a character string included in each page of the search hit list, and a label indicating a value index are associated with each other A reference word vector holding unit that holds a plurality of reference word vectors, which are reference word vectors, a plurality of reference word vectors that are held, and a hit word vector generated by a hit word vector generation unit The similarity calculation unit that calculates the similarity, the calculation expression storage unit that stores the calculation expression for calculating the similarity, and the identification information of the page that generated the hit word vector are most similar by the calculation Sorted based on the value index of the reference word vector label And sorting the display unit that is a retrieval device having a.

また検索結果リストのソートの処理負荷を低減するため、例えば上位１００位までの検索結果に対して有害性の度合いに応じたソートを行う機能を備える検索装置も提供する。具体的には、上記構成に加えて、ヒット単語ベクトル生成部が、検索ヒットリスト中、上位所定順位までの検索ヒットリストに含まれるページを対象としてヒット単語ベクトルを生成する上位生成手段を有し、ソート表示部が、少なくとも上位所定順位までの検索ヒットリストについては、上位生成手段にて生成された表示順に従って表示する上位ソート表示手段を有する検索装置である。 In order to reduce the processing load of sorting the search result list, for example, a search apparatus having a function of sorting the search results up to the top 100 according to the degree of harmfulness is also provided. Specifically, in addition to the above configuration, the hit word vector generation unit has higher-order generation means for generating hit word vectors for pages included in the search hit list up to a predetermined upper rank in the search hit list. The sort display unit is a search device having an upper sort display means for displaying at least the search hit list up to the upper predetermined order according to the display order generated by the upper generation means.

また、有害性の度合いなどを判断するためのベクトルの素性として、ＷｅｂページのＵＲＬに含まれる文字列を利用する検索装置も提供する。具体的には、単語ベクトル生成部が、素性として検索ヒットリストのページのＵＲＬに含まれる文字列を利用してヒット単語ベクトルを生成するＵＲＬ文字列利用手段を有する検索装置である。 In addition, a search device that uses a character string included in a URL of a Web page as a vector feature for determining the degree of harmfulness is also provided. Specifically, the word vector generation unit is a search device having URL character string utilization means for generating a hit word vector using a character string included in the URL of a search hit list page as a feature.

また、有害性の度合いなどを判断するためのベクトルの素性をＷｅｂページ中の出現単語などとし、またその単語の出現頻度を素性値として与えることでベクトルを生成する機能を備える検索装置も提供する。具体的には、上記構成に加えて、さらに単語ベクトル生成部が、ページ中に含まれている同一単語の出現頻度に応じてベクトル空間中でのその単語軸の大きさを定めたヒット単語ベクトルを生成する重み付け手段を有する検索装置である。 Also provided is a search device having a function for generating a vector by using a vector feature for judging the degree of harmfulness as an appearance word or the like in a Web page and giving the appearance frequency of the word as a feature value. . Specifically, in addition to the above configuration, the word vector generation unit further determines the size of the word axis in the vector space in accordance with the appearance frequency of the same word included in the page. Is a search device having weighting means for generating.

なお、上記では本発明の検索装置の検索対象はＷｅｂページに限定されず、その他動画／静止画データや音楽データ、プログラムデータなどのバイナリデータであっても良い。 In the above description, the search target of the search device of the present invention is not limited to the Web page, and may be other binary data such as moving image / still image data, music data, program data, and the like.

以上のような構成を備える本発明によって、検索にヒットしたＷｅｂページなどに関して有害性の度合いなどを段階的に判断することができ、それによって検索結果リスト中の並び順をソートすることができる。ここで、インターネット検索サービスを提供する米ＡＯＬ（登録商標）の調査によると、検索結果リスト中８位以下の検索結果に並べられるサイトへのアクセス率は、それぞれ３％以下、トータルでも２０％以下という報告がされている。このように、有害と思われるサイトを検索結果リストの下に並べ替えるだけでも、十分にそのアクセスを抑制する効果を生じさせることができる。 According to the present invention having the above-described configuration, it is possible to determine the degree of harmfulness in relation to a web page hit by a search step by step, and thereby to sort the arrangement order in the search result list. Here, according to a survey by AOL (registered trademark) that provides an Internet search service, the rate of access to sites listed in the search results of the 8th or lower in the search result list is 3% or less, and the total is 20% or less. Has been reported. As described above, even if the sites that are considered harmful are rearranged below the search result list, the access can be sufficiently suppressed.

以下に、図を用いて本発明の実施の形態を説明する。なお、本発明はこれら実施の形態に何ら限定されるものではなく、その要旨を逸脱しない範囲において、種々なる態様で実施しうる。 Hereinafter, embodiments of the present invention will be described with reference to the drawings. Note that the present invention is not limited to these embodiments, and can be implemented in various modes without departing from the spirit of the present invention.

なお、実施例１は、主に請求項１、５について説明する。また、実施例２は、主に請求項２、６について説明する。また、実施例３は、主に請求項３、７について説明する。また、実施例４は、主に請求項４、８について説明する。 In the first embodiment, claims 1 and 5 will be mainly described. In the second embodiment, claims 2 and 6 will be mainly described. In the third embodiment, claims 3 and 7 will be mainly described. In the fourth embodiment, claims 4 and 8 will be mainly described.

≪実施例１≫
<概要>
図１（ａ）は、検索用Ｗｅｂページに「自殺」という検索クエリを入力した場合に、通常の検索サーバ装置から返信されてくる検索ヒットリストの一例である。この図にあるように、通常の検索サーバ装置による検索ヒットリストには、「自殺マニュアル（楽な死に方）」や「海外自殺画像」といった、自殺を助長するようなＷｅｂページや公序良俗にそぐわないＷｅｂページなどがリストの上位に表示されることがある。 Example 1
<Overview>
FIG. 1A is an example of a search hit list returned from a normal search server device when a search query “suicide” is input to a search Web page. As shown in this figure, a search hit list by a normal search server device includes a web page that promotes suicide such as “suicide manual (easy death)” or “overseas suicide image” or a web that does not match public order and morals. Pages may appear at the top of the list.

これは、通常の検索サーバ装置では、その表示順位決定アルゴリズム（例えば再帰的に決定されるページのオーソリティ度に応じた表示順位の並び替えなど）に従い、検索クエリにヒットしたＷｅｂページの表示順を並び替えて検索ヒットリストを生成しており、そして表示順位決定アルゴリズムの基本部分は、Ｗｅｂページの記事内容そのものを考慮するようには設計されていないからである。 This is because, in a normal search server device, the display order of Web pages that hit the search query is determined according to the display order determination algorithm (for example, rearrangement of display order according to the authority of the page determined recursively). This is because the search hit list is generated by rearranging and the basic part of the display order determination algorithm is not designed to take into account the article content itself of the Web page.

一方、図１（ｂ）は、同じ「自殺」という検索クエリを入力した場合に本実施例の検索装置が組み込まれた検索サーバ装置から返信されてくる検索ヒットリストの一例である。この図にあるように、この検索ヒットリストでは上記のような有害と思われるＷｅｂページがリストの下位に表示されることになる。 On the other hand, FIG. 1B is an example of a search hit list returned from the search server device in which the search device of this embodiment is incorporated when the same search query “suicide” is input. As shown in this figure, in this search hit list, the Web page that seems to be harmful as described above is displayed in the lower part of the list.

これは、本実施例の検索装置において、有害性などの度合いを判断するリファレンス用のベクトルと検索にヒットしたＷｅｂページのベクトルとを利用した類似度判断を行うことで検索にヒットしたＷｅｂページの有害性などの度合いを判断し、それに応じて検索ヒットリストを並び替えているからである。 This is because, in the search device of this embodiment, the similarity of the web page hit by the search is determined by using the reference vector for judging the degree of harmfulness and the vector of the web page hit by the search. This is because the degree of harmfulness is judged and the search hit list is rearranged accordingly.

このように、本実施例の検索装置を利用して検索ヒットリスト中のページの有害度合いなどに応じて、その表示順位を自動的に並び替えることができる。したがって、有害と思われるＷｅｂページへのアクセスを抑制することができる。 As described above, the display order can be automatically rearranged according to the degree of harmfulness of pages in the search hit list using the search device of this embodiment. Therefore, it is possible to suppress access to Web pages that are considered harmful.

<機能的構成>
図２は、本実施例の検索装置における機能ブロックの一例を表す図である。なお「検索装置」とは、所定の検索クエリを受付けて検索クエリを含むリソースを特定し、その特定したリソースの識別情報や所在地情報を検索クエリ入力者に提示する機能を備える装置をいう。そして本実施例の検索装置は、いわゆるインターネット検索サービスを提供するネットワーク上のサーバ装置に組み込まれ、検索サーバ装置として実現されても良い。あるいはエンドユーザの端末装置に組み込まれ、ユーザ端末内のリソースを検索するサービスを実現しても良い。 <Functional configuration>
FIG. 2 is a diagram illustrating an example of functional blocks in the search device according to the present embodiment. The “search device” is a device having a function of accepting a predetermined search query, specifying a resource including the search query, and presenting identification information and location information of the specified resource to a search query input person. The search device of this embodiment may be incorporated into a server device on a network that provides a so-called Internet search service, and may be realized as a search server device. Or you may implement | achieve the service which is integrated in a terminal device of an end user and searches the resource in a user terminal.

また本実施例における検索対象は、概要で例示したＷｅｂページには限定されず、その他の文書データや、静止画／動画、音声、プログラムなどのバイナリデータであっても良い。 Further, the search target in the present embodiment is not limited to the Web page exemplified in the outline, and may be other document data, or binary data such as a still image / moving image, sound, and a program.

そして、この図にあるように、本実施例の「検索装置」（０２００）は、「ヒット単語ベクトル生成部」（０２０１）と、「リファレンス単語ベクトル保持部」（０２０２）と、「類似度演算部」（０２０３）と、「演算式格納部」（０２０４）と、「ソート表示部」（０２０５）と、を有する。 As shown in this figure, the “search device” (0200) of this embodiment includes a “hit word vector generation unit” (0201), a “reference word vector holding unit” (0202), and a “similarity calculation”. Part "(0203)," arithmetic expression storage part "(0204), and" sort display part "(0205).

なお、以下に記載する本検索装置の機能ブロックは、ハードウェア、ソフトウェア、又はハードウェア及びソフトウェアの両方として実現され得る。具体的には、コンピュータを利用するものであれば、ＣＰＵや主メモリ、バス、あるいは二次記憶装置（ハードディスクや不揮発性メモリ、ＣＤやＤＶＤなどの記憶メディアとそれらメディアの読取ドライブなど）、情報入力に利用される入力デバイス、印刷機器や表示装置、その他の外部周辺装置などのハードウェア構成部、またその外部周辺装置用のインターフェース、通信用インターフェース、それらハードウェアを制御するためのドライバプログラムやその他アプリケーションプログラム、ユーザーインターフェース用アプリケーションなどが挙げられる。 Note that the functional blocks of the search device described below can be realized as hardware, software, or both hardware and software. Specifically, if a computer is used, a CPU, a main memory, a bus, or a secondary storage device (a hard disk, a non-volatile memory, a storage medium such as a CD or a DVD, a read drive for the medium, etc.), information Input devices used for input, printing equipment, display devices, other hardware components such as external peripheral devices, interfaces for external peripheral devices, communication interfaces, driver programs for controlling these hardware, Other examples include application programs and user interface applications.

そして主メモリ上に展開したプログラムに従ったＣＰＵの演算処理によって、入力デバイスやその他インターフェースなどから入力されメモリやハードディスク上に保持されているデータなどが加工、蓄積されたり、上記各ハードウェアやソフトウェアを制御するための命令が生成されたりする。また、この発明は検索装置として実現できるのみでなく、方法としても実現可能である。また、このような発明の一部をソフトウェアとして構成することができる。さらに、そのようなソフトウェアをコンピュータに実行させるために用いるソフトウェア製品、及び同製品を記録媒体に固定した記録媒体も、当然にこの発明の技術的な範囲に含まれる（本明細書の全体を通じて同様である）。 Then, through the arithmetic processing of the CPU according to the program developed on the main memory, the data input from the input device or other interface etc. and stored in the memory or hard disk is processed and stored, or each of the above hardware and software An instruction for controlling the above is generated. In addition, the present invention can be realized not only as a search device but also as a method. A part of the invention can be configured as software. Furthermore, a software product used for causing a computer to execute such software and a recording medium in which the product is fixed to a recording medium are naturally included in the technical scope of the present invention (the same applies throughout the present specification). Is).

「ヒット単語ベクトル生成部」（０２０１）は、検索ヒットリストのそれぞれのページに関してヒット単語ベクトルを生成する機能を有し、例えばＣＰＵや主メモリ、ヒット単語ベクトル生成プログラムなどによって実現できる。「検索ヒットリスト」とは、入力された検索クエリによる検索にヒットしたリソースの識別情報（例えばリソースタイトルやＵＲＬ、任意文字列のＩＤ番号など）をリスト化したものをいう。また「ヒット単語ベクトル」とは、検索ヒットリストのそれぞれのページに含まれる文字列から抜き出される単語を素性とする単語ベクトルをいう。 The “hit word vector generation unit” (0201) has a function of generating a hit word vector for each page of the search hit list, and can be realized by a CPU, a main memory, a hit word vector generation program, or the like. The “search hit list” refers to a list of identification information (for example, resource title, URL, ID number of an arbitrary character string, etc.) of resources hit in a search by an input search query. Further, the “hit word vector” refers to a word vector whose feature is a word extracted from a character string included in each page of the search hit list.

図３は、ヒット単語ベクトルの生成の一例を説明するための図である。この図にあるように、例えば、ヒット単語ベクトル生成プログラムに含まれる形態素解析プログラムをＣＰＵが解釈し、検索クエリにヒットしたＷｅｂページのテキストデータ（図３の（ａ））を形態素解析する。具体的には、単語辞書や構文ルール辞書を利用したパターンマッチングなどによる文章の単語分解処理や、隠れマルコフモデルなどの確率的言語モデルを利用したスコアリングによる単語分解処理によって形態素解析処理を行う方法が挙げられる。またこの処理によって抽出される単語は、形態素解析の解析ルールの設定次第で複数の単語からなる慣用句などであっても良い。 FIG. 3 is a diagram for explaining an example of generation of hit word vectors. As shown in this figure, for example, the CPU interprets the morpheme analysis program included in the hit word vector generation program, and performs morpheme analysis on the text data of the Web page hit in the search query ((a) in FIG. 3). Specifically, a method of performing morphological analysis processing by word decomposition processing of a sentence by pattern matching using a word dictionary or a syntax rule dictionary, or word decomposition processing by scoring using a stochastic language model such as a hidden Markov model Is mentioned. The word extracted by this processing may be an idiomatic phrase composed of a plurality of words depending on the setting of an analysis rule for morphological analysis.

そして上記形態素の結果抽出された単語を、図３（ｂ）に示すように、当該Ｗｅｂページのベクトルの素性とする。つまり、ＩＤ：Ｗ００１で識別されるＷｅｂページ（自殺予防の心理学）は、「自殺」、「予防」、「硫化水素」、「美しくない」、「いじめ」といった単語を素性とするベクトルで表される、という具合である。そしてＷｅｂページの内容などをこのようにベクトルで表し後述するリファレンス単語ベクトルとの比較判断処理を行うことで、単純なＮＧワード登録よりも実効的な判断処理が可能になる、という具合である。 Then, the word extracted as a result of the morpheme is set as a vector feature of the Web page, as shown in FIG. In other words, the Web page (psychology of suicide prevention) identified by ID: W001 is represented by a vector whose features are words such as “suicide”, “prevention”, “hydrogen sulfide”, “not beautiful”, and “bullying”. It is said that it is done. Then, the contents of the Web page and the like are represented by vectors in this way, and comparison judgment processing with reference word vectors to be described later is performed, so that more effective judgment processing is possible than simple NG word registration.

また、検索にヒットしたその他のページに関しても同様の処理を行い、例えば、ＩＤ：Ｗ００２で識別されるＷｅｂページは、「自殺」、「首吊り」、「クスリ」、「飛びこみ」、「簡単」などの単語を素性としベクトル化され、ＩＤ：Ｗ００３で識別されるＷｅｂページは、「自殺」、「グロ」、「画像」、「楽しい」、「クリック」などの単語を素性としベクトル化される、という具合である。また、抽出した単語は同義語（類義語）辞典などを参照して、複数の同義語を一の単語としてまとめて素性とするよう構成しても良い。 The same processing is performed for other pages that have been hit by the search. For example, the web page identified by ID: W002 is “suicide”, “hanging neck”, “medicine”, “jumping”, “easy”, etc. The web page identified by ID: W003 is vectorized with features such as “suicide”, “glo”, “image”, “fun”, “click”, etc. as features. That's it. In addition, the extracted words may be configured to refer to a synonym (synonym) dictionary or the like and collect a plurality of synonyms as one word.

また、単語に関して予めその有害度合いなどに応じた値を付与し、ベクトルの素性値とするよう構成しても良い。このように構成することで、後述するベクトル比較による類似性判断において、単語に応じた重み付けを行った上で類似性を判断することができる。あるいは実施例４で後述するように、単語の出現頻度に応じて上記ベクトルの素性に値を与えるよう構成しても良い。 In addition, a value corresponding to the degree of harmfulness or the like may be given in advance with respect to a word, and a vector feature value may be used. With this configuration, similarity can be determined after weighting according to words in similarity determination by vector comparison described later. Or you may comprise so that a value may be given to the feature of the said vector according to the appearance frequency of a word so that it may mention later in Example 4. FIG.

なお、ヒット単語ベクトルの生成処理は形態素解析によるものには限定されず、例えば実施例２で後述するように、ＷｅｂページのＵＲＬに含まれる文字列をベクトルの素性として利用しても良い。また、検索にヒットするリソースが音声であれば、例えばパターンマッチングを利用した周波数解析によって、その音声内に含まれる形態素などを抽出して音声に関するヒット単語ベクトルを生成しても良い。 Note that the hit word vector generation processing is not limited to the one based on morphological analysis. For example, as described later in the second embodiment, a character string included in the URL of a Web page may be used as a vector feature. If the resource that hits the search is a voice, for example, a morpheme included in the voice may be extracted by frequency analysis using pattern matching to generate a hit word vector related to the voice.

あるいは、検索にヒットするリソースが動画や静止画などである場合には、例えば単語の替わりに画像特性をその素性とするヒット画像特性ベクトルを生成しても良い。具体的には、例えば画像データを周波数変換し、そこから抽出した高周波成分によるエッジ（輪郭）画像を生成する。そして予め保持されているパターンマッチング用の参照画像と比較し、例えば「斧」、「木」、「モザイクパターン」などその画像に含まれるオブジェクトを素性としてベクトルを生成する、という具合である。また、検索にヒットするリソースがプログラムであれば、ソースコードを抽出し、パターンマッチングなどによってその処理内容、例えば「システムファイル実行／コピー」、「システム領域の書換え」、「アドレス帳データの取得及び当該アドレス宛のメール送信」などを素性とするヒット実行ベクトルをヒット単語ベクトルの替わりに生成するよう構成すると良い。 Alternatively, when the resource that hits the search is a moving image or a still image, for example, a hit image characteristic vector having an image characteristic as its feature instead of a word may be generated. Specifically, for example, the image data is frequency-converted, and an edge (contour) image is generated from the high-frequency component extracted therefrom. Compared with a reference image for pattern matching held in advance, for example, a vector is generated using an object included in the image such as “ax”, “tree”, “mosaic pattern” as a feature. If the resource that hits the search is a program, the source code is extracted, and the processing contents such as “system file execution / copy”, “system area rewriting”, “address book data acquisition and A hit execution vector having a feature such as “mail transmission addressed to the address” may be generated instead of the hit word vector.

「リファレンス単語ベクトル保持部」（０２０２）は、リファレンス単語ベクトルを複数保持する機能を有し、例えばＨＤＤ（ハードディスクドライブ）や不揮発性メモリ、光学記録メディアとその読取ドライブなどの各種記憶装置によって実現することができる。 The “reference word vector holding unit” (0202) has a function of holding a plurality of reference word vectors, and is realized by various storage devices such as an HDD (hard disk drive), a nonvolatile memory, an optical recording medium, and its reading drive. be able to.

「リファレンス単語ベクトル」とは、価値指標を示すラベルと関連付けられたリファレンス用の単語ベクトルをいい、例えば図４に示すように、ベクトルの素性として「自殺」、「防止」、「相談」などを含む場合には、安全であるとの価値指標を示すラベル「５」が関連付けられている。一方、「自殺」、「楽しい」、「グロ」などがベクトルの素性である場合には、有害であるとの価値指標を示すラベル「１」が関連付けて保持され、「自殺」、「教えて」、「簡単」などを素性とする場合には微妙な有害度合いとしてラベル「３」が関連付けて保持される、という具合である。 The “reference word vector” refers to a reference word vector associated with a label indicating a value index. For example, as shown in FIG. 4, “suicide”, “prevention”, “consultation”, etc. are used as vector features. If it is included, the label “5” indicating the value index indicating safety is associated. On the other hand, when “suicide”, “fun”, “glo”, etc. are vector features, a label “1” indicating a harmful value index is held in association with each other, and “suicide”, “tell me” "," Easy ", etc., the label" 3 "is held in association with the subtle degree of harmfulness.

なお図４に示すラベルは、Ｗｅｂページの有害の度合いを示す価値指標であるが、価値指標はもちろんそれに限定されない。例えば、専門性の度合いを示す価値指標をそのラベルとし、専門性の順に検索ヒットリストを並び替えるよう構成しても良いし、信頼性や有用性の度合いを示す価値指標をそのラベルとし、信頼性や有用性の順に検索ヒットリストを並び替えるよう構成しても良い。 The label shown in FIG. 4 is a value index indicating the degree of harmfulness of the Web page, but the value index is not limited thereto. For example, the value index indicating the degree of expertise may be used as the label, and the search hit list may be rearranged in the order of expertise, or the value index indicating the degree of reliability or usefulness may be used as the label, and the trust The search hit list may be rearranged in the order of sex and usefulness.

また、検索にヒットしたリソースが動画／静止画や音声、プログラムなどであっても、そのリソースに合わせたベクトル素性（動画／静止画であればオブジェクト画像、音声であれば形態素の周波数データ、プログラムであれば一部のソースコードなど）を価値指標を示すラベルと関連付けて保持すると良い。具体的には、例えば「モザイクパターン」や「陰部」のオブジェクト画像データを素性ベクトルとする場合、１８歳未満に相応しくない動画像であることが想定されるのでラベルを「１」として保持する。あるいは、「自殺」、「グロ」、あるいは放送禁止用語などの音声周波数データを素性とするベクトルであればラベル「１」と関連付けて保持したり、「システムファイル実行／コピー」、「システム領域の書換え」のためのソースコードを素性とするベクトルであれば、ウィルスなどの危険性の高いプログラムである可能性があるとしてラベル「１」と関連付けて保持する、という具合である。 Moreover, even if the resource hit in the search is a moving image / still image, sound, program, etc., the vector feature (object image for moving image / still image, frequency data of morpheme, program for sound) If so, a part of the source code) may be stored in association with a label indicating a value index. Specifically, for example, when the object image data of “mosaic pattern” or “shadow” is used as a feature vector, it is assumed that the moving image is not suitable for under 18 years old, so the label is held as “1”. Alternatively, if the vector is characterized by audio frequency data such as “suicide”, “glo”, or broadcast-prohibited terms, it is stored in association with the label “1”, “system file execution / copy”, “system area If it is a vector having a source code for “rewriting” as a feature, it is stored in association with the label “1” because there is a possibility that it is a highly dangerous program such as a virus.

このように、リファレンス単語ベクトルにラベルを関連付けて保持することで、検索ヒットリスト中のＷｅｂページなどが未知のリソースである場合でも、そのリファレンスとの類似性判断によってラベルを付与し順位付けすることが可能となる。 In this way, by associating a label with a reference word vector and holding it, even if a Web page or the like in a search hit list is an unknown resource, the label is assigned and ranked by determining similarity with the reference. Is possible.

なお、このリファレンス単語ベクトルの生成およびラベルの関連付けは、当初はサービス提供者やサービス利用者によって作成入力され、登録されたものを学習事例として利用すると良い。そしてその後は、フィードバックなどによって自動的にその数を拡充するよう構成すると良い。すなわち、後述する本実施例の検索装置の処理によってリファレンス単語ベクトルとの類似性に応じてラベルが付与されたヒット単語ベクトルを、その付与されたラベルと関連付けて今後はリファレンス単語ベクトルとして利用する、という具合である。あるいは、例えばネットワーク上のリソースを定期的に自動収集するプログラム（いわゆる「クローラプログラム」）によって収集したＷｅｂページに対して同様の処理を行い、保持するリファレンス単語ベクトルの数を拡充するよう構成しても良い。つまりリファレンス単語ベクトルは、いわゆる「教師あり機械学習」における教師に相当する学習データとして利用されることになる。そしてこの機械学習によって、後述するベクトル比較による未知のリソースへのラベル付与精度を高めることができる。 It should be noted that the generation of the reference word vector and the association of the labels are preferably initially created and input by a service provider or service user and used as a learning example. And after that, it is good to comprise so that the number may be expanded automatically by feedback etc. That is, the hit word vector to which a label is given according to the similarity to the reference word vector by the processing of the search device of the present embodiment described later will be used as a reference word vector in the future in association with the given label. That's it. Alternatively, for example, the same processing is performed on a Web page collected by a program that automatically collects resources on the network (a so-called “crawler program”), and the number of reference word vectors to be held is expanded. Also good. That is, the reference word vector is used as learning data corresponding to a teacher in so-called “supervised machine learning”. By this machine learning, it is possible to increase the accuracy of labeling an unknown resource by vector comparison described later.

「類似度演算部」（０２０３）は、複数保持されているリファレンス単語ベクトルのそれぞれと、ヒット単語ベクトル生成部にて生成されたヒット単語ベクトルとの類似度を演算する機能を有し、例えばＣＰＵや主メモリ、類似度演算プログラムなどによって実現することができる。 The “similarity calculation unit” (0203) has a function of calculating the similarity between each of the plurality of reference word vectors held and the hit word vector generated by the hit word vector generation unit. Or a main memory, a similarity calculation program, or the like.

図５は、この類似度演算部でのリファレンス単語ベクトルとヒット単語ベクトルの類似度演算処理の一例を説明するための図である。この図５（ａ）にあるように、図３（ｂ）にてＩＤ「Ｗ００１」で識別されるヒット単語ベクトルを、素性をその軸とするベクトル空間に配置する。つづいて、図４にてＩＤ「Ｒ００１」で識別されるリファレンス単語ベクトルをベクトル空間に配置する。そして、両ベクトルの為す角をθとすると、ｃｏｓθ＝（ベクトルＷ００１×ベクトルＲ００１）／（｜ベクトルＷ００１｜×｜ベクトルＲ００１｜）となる。そして、両ベクトルの為す角θが小さいほど両ベクトルの素性が類似していることを示す、すなわち上記算出したｃｏｓθの値によって両ベクトルの類似性が示される、という具合である。 FIG. 5 is a diagram for explaining an example of similarity calculation processing of a reference word vector and a hit word vector in the similarity calculation unit. As shown in FIG. 5A, the hit word vector identified by the ID “W001” in FIG. 3B is arranged in a vector space having the feature as its axis. Subsequently, the reference word vector identified by the ID “R001” in FIG. 4 is arranged in the vector space. If the angle between both vectors is θ, cos θ = (vector W001 × vector R001) / (| vector W001 | × | vector R001 |). The smaller the angle θ between both vectors is, the more similar the features of both vectors are, that is, the similarity between both vectors is indicated by the value of cos θ calculated above.

そして、Ｗ００１のヒット単語ベクトルに関して、図４のその他のリファレンス単語ベクトル「Ｒ００２」、「Ｒ００３」、・・・との類似度（ｃｏｓθ）をそれぞれ算出する。そして図５（ｂ）に示すようにｃｏｓθの値が１に近い、すなわちヒット単語ベクトルとの類似性が高い上位ｋ個（ｋは所定の自然数。図の例では２個）のリファレンス単語ベクトル「Ｒ００１」と「Ｒ００４」を特定する。そしてリファレンス単語ベクトル保持部にて「Ｒ００１」に関連付けて保持されているラベル「５」と、「Ｒ００４」のラベル「４」との平均値（端数切り上げなど）、あるいは中央値、最頻値を、ヒット単語ベクトル「Ｗ００１」のラベルとすることで、次のソート表示部による並替え表示の基準となるラベルを、検索にヒットしたリソースに付与することができる。 Then, regarding the hit word vector of W001, the similarity (cos θ) with the other reference word vectors “R002”, “R003”,. Then, as shown in FIG. 5B, the value of cos θ is close to 1, that is, the top k pieces (k is a predetermined natural number; two in the example shown in the figure) having high similarity to the hit word vector “ R001 "and" R004 "are specified. Then, an average value (rounded up, etc.), median, or mode of the label “5” held in association with “R001” and the label “4” of “R004” in the reference word vector holding unit By using the label of the hit word vector “W001”, a label that becomes a reference for the rearrangement display by the next sort display unit can be given to the resource hit in the search.

なお、上記ヒット単語ベクトルへのラベル付与処理はｋ最近傍法と言われる類似性の判断処理であるが、本実施例はもちろんこの方法に限定されない。例えば、サポートベクターマシン（ＳＶＭ）などを利用して前記類似性を判断しても良い。 The labeling process for the hit word vector is a similarity determination process called the k nearest neighbor method, but the present embodiment is of course not limited to this method. For example, the similarity may be determined using a support vector machine (SVM) or the like.

そして、図３（ｂ）に示す「Ｗ００２」、「Ｗ００３」、・・・のその他のヒット単語ベクトルに関しても、同様にリファレンス単語ベクトル「Ｒ００１」、「Ｒ００２」、・・・それぞれとの類似度の判断処理、及びその類似度に応じたラベルの付与を実行する。そして、図６に示すように検索ヒットリストに示されるそれぞれのＷｅｂページなどのリソースに関して、近似するリファレンス単語ベクトルのラベルを関連付けて記憶装置に格納する、という具合である。 Similarly, other hit word vectors “W002”, “W003”,... Shown in FIG. 3B are similar to the reference word vectors “R001”, “R002”,. Determination processing and label assignment according to the degree of similarity are executed. Then, as shown in FIG. 6, with respect to resources such as Web pages shown in the search hit list, labels of approximate reference word vectors are associated and stored in the storage device.

なお、このようにリファレンス単語ベクトルとの類似度に応じて新たにラベルが付与されたヒット単語ベクトルを、リファレンス単語ベクトル保持部にそのまま保持し、次回以降利用するように構成しても良い。このように構成することで、リファレンス単語ベクトルの保持数を自動的に拡充することができる。 In addition, the hit word vector newly provided with the label according to the similarity to the reference word vector as described above may be held as it is in the reference word vector holding unit and used after the next time. With this configuration, the number of reference word vectors held can be automatically expanded.

「演算式格納部」（０２０４）は、前記類似度を演算するための演算式を格納する機能を有し、例えばＨＤＤや不揮発性メモリ、光学記録メディアとその読取ドライブなどの各種記憶装置によって実現することができる。ここで保持されている演算式は、例えば前記類似度判断部にて説明したようなベクトルのｃｏｓθを求めるための演算式や、ｋ最近傍法を実行するための演算式が挙げられる。ただし、ここに格納されている演算式はそれに限定されず、ベクトル間の類似度を判断し、類似度に応じて価値指標を示すラベルを付与する演算式であればどのような演算式であっても構わない。 The “arithmetic expression storage unit” (0204) has a function of storing an arithmetic expression for calculating the similarity, and is realized by various storage devices such as an HDD, a nonvolatile memory, an optical recording medium, and a reading drive thereof. can do. Examples of the arithmetic expression held here include an arithmetic expression for obtaining cos θ of a vector as described in the similarity determination unit and an arithmetic expression for executing the k nearest neighbor method. However, the arithmetic expression stored here is not limited to this, and any arithmetic expression can be used as long as it determines the similarity between vectors and assigns a label indicating a value index according to the similarity. It doesn't matter.

例えば、リソースがテキストベースのものに限定されるのであれば、その他の演算式として２つの文字列の相違度合いを示すいわゆる「レーベンシュタイン距離（編集距離）」を利用するものが挙げられる。具体的には、ｈ_１、ｈ_２、・・・をヒット単語ベクトルの素性（文字列）とし、ｒ_１、ｒ_２、・・・をリファレンス単語ベクトルの素性（文字列）として、下記数１で示す演算式で各形態素のＳＩＭ（類似度）を算出する。そしてその平均値を、ヒット単語ベクトルとリファレンス単語ベクトルの類似性を示す値として利用する、という具合である。

For example, if the resource is limited to a text-based resource, there is one that uses a so-called “Levenstein distance (editing distance)” indicating the degree of difference between two character strings as another arithmetic expression. Specifically, h ₁ , h ₂ ,... Are set as hit word vector features (character strings), and r ₁ , r ₂ ,. The SIM (similarity) of each morpheme is calculated by the arithmetic expression shown in FIG. The average value is used as a value indicating the similarity between the hit word vector and the reference word vector.

なお、レーベンシュタイン距離ＬＤ（ｈ_ａ、ｒ_ａ）は、例えば従来同様に動的計画法に基づくアルゴリズムを用いて算出すると良い。また、両ベクトルの素性（形態素）ごとにＳＩＭを算出するのではなく、ヒット単語ベクトルおよびリファレンス単語ベクトルを、その素性を文字とする１つの文字列と考え（例えば、ベクトルが「自殺」、「相談」、「予防」を素性としていれば、文字数が３つの文字列「自殺相談予防」と考える、という具合である）、上記数１を演算しても良い。 Note that the Levenshtein distance LD (h _a , r _a ) may be calculated using, for example, an algorithm based on dynamic programming as in the conventional case. Also, instead of calculating the SIM for each feature (morpheme) of both vectors, the hit word vector and the reference word vector are considered as one character string having the feature as a character (for example, the vector is “suicide”, “ If “consultation” and “prevention” are used as features, the number of characters is considered to be “a suicide consultation prevention”.

「ソート表示部」（０２０５）は、ヒット単語ベクトルを生成したページの識別情報を前記演算により最も類似しているとされるリファレンス単語ベクトルのラベルの価値指標の順にソートして表示する機能を有し、例えば、ＣＰＵや主メモリ、ソート表示プログラムなどで実現することができる。 The “sort display section” (0205) has a function of sorting and displaying the identification information of the page on which the hit word vector is generated in the order of the value index of the label of the reference word vector that is most similar by the above calculation. For example, it can be realized by a CPU, a main memory, a sort display program, and the like.

具体的には、図６に示すように新たにラベルが付与された検索ヒットリスト中のＷｅｂページなどのリソースに関して、そのラベル順にソートすることで、図１（ｂ）に示すような検索ヒットリストを生成する、という具合である。なお、このリスト中の識別情報のソート処理は、そのラベルのみを利用して実行されるのではなく、通常の検索ヒットリストのソートルールにラベル値を加味することで実行されても良い。その場合には、通常のソート順を決定するための関数に含まれる変数として当該ラベル値を用いるよう構成すると良い。 Specifically, as shown in FIG. 6, the search hit list as shown in FIG. 1B is obtained by sorting the resources such as Web pages in the search hit list newly assigned with labels in the order of the labels. Is generated. Note that the sorting process of the identification information in the list is not performed using only the label, but may be performed by adding the label value to the normal search hit list sorting rule. In that case, the label value may be used as a variable included in a function for determining a normal sort order.

このようにして、検索にヒットしたＷｅｂページなどに関して有害性の度合いなどその価値指標を段階的に判断して検索結果リスト中の並び順をソートすることができる。したがって、グレーゾーンにあるＷｅｂページなども含めてその検索ヒットリストを検索ユーザに提示することができる。 In this way, it is possible to sort the order of arrangement in the search result list by stepwise judging the value index such as the degree of harmfulness of the web page hit in the search. Therefore, the search hit list including the web page in the gray zone can be presented to the search user.

また、例えばＷｅｂページの文書中に含まれる単語などを素性とするベクトルを利用することで、検索ヒットリスト中の未知のリソースに関してもラベルを付与し、そのラベル順に検索ヒットリストをソートすることができる。また、その未知のリソースにラベルを付与するためのリファレンス単語ベクトルを、フィードバックやクローラの自動収集によって自動的に拡充し、そのラベル付与精度を手間をかけずに高めることもできる。 Further, for example, by using a vector having a feature such as a word included in a document of a Web page, a label is given to an unknown resource in the search hit list, and the search hit list is sorted in the order of the labels. it can. In addition, the reference word vector for giving a label to the unknown resource can be automatically expanded by feedback or automatic collection of crawlers, and the labeling accuracy can be improved without taking time.

<ハードウェア構成>
図７は、上記機能的な各構成要件をハードウェアとして実現した際の、検索装置における構成の一例を表す概略図である。この図を利用して検索ヒットリストの表示ソート処理におけるそれぞれのハードウェア構成部の働きについて説明する。 <Hardware configuration>
FIG. 7 is a schematic diagram illustrating an example of a configuration in the search device when the functional components described above are realized as hardware. The operation of each hardware component in the search hit list display sort process will be described with reference to FIG.

この図にあるように、検索装置は、ヒット単語ベクトル生成部、類似度演算部、およびソート表示部であり、またその他の各種演算処理を実行する「ＣＰＵ（中央演算装置）」（０７０１）と、「主メモリ」（０７０２）と、を備えている。またリファレンス単語ベクトル保持部および演算式格納部である「ＨＤＤ」（０７０３）や、検索クエリを送信してくる検索端末とネットワーク網を介して接続する「通信ＩＦ（インターフェース）」（０７０４）なども備えている。そしてそれらが「システムバス」などのデータ通信経路によって相互に接続され、情報の送受信や処理を行う。 As shown in this figure, the search device is a hit word vector generation unit, a similarity calculation unit, and a sort display unit, and also executes “CPU (Central Processing Unit)” (0701) that executes various other calculation processes. , “Main memory” (0702). In addition, “HDD” (0703) which is a reference word vector holding unit and arithmetic expression storage unit, “communication IF (interface)” (0704) connected via a network to a search terminal which transmits a search query, and the like I have. They are connected to each other via a data communication path such as a “system bus” to transmit / receive information and process information.

また、「主メモリ」にはプログラムが読み出され、「ＣＰＵ」は読み出された当該プログラムを解釈し、その解釈した手順に従い各種演算処理を実行する。また、この「主メモリ」や「フラッシュメモリ」にはそれぞれ複数のアドレスが割り当てられており、「ＣＰＵ」の演算処理においては、そのアドレスを特定し格納されているデータにアクセスすることで、データを用いた演算処理を行うことが可能になっている。 A program is read into the “main memory”, and the “CPU” interprets the read program, and executes various arithmetic processes according to the interpreted procedure. In addition, a plurality of addresses are assigned to each of the “main memory” and “flash memory”, and in the calculation processing of the “CPU”, the addresses are specified and accessed to store the data. It is possible to perform arithmetic processing using.

ここで、検索端末から検索用Ｗｅｂページなどを介して入力された検索クエリを「通信ＩＦ」にて受信し、「主メモリ」のアドレス１に格納する。すると検索装置は、通常の検索処理と同様に「ＣＰＵ」の論理演算処理によって、予めクローラなどで収集し「ＨＤＤ」に格納されている検索用の検索インデックスやキャッシュデータなどを検索する。そして検索クエリに合致する文字列などを含むＷｅｂページの識別情報を検索ヒットＩＤとして「主メモリ」のアドレス２、・・・などに格納する。 Here, the search query input from the search terminal via the search Web page or the like is received by the “communication IF” and stored in the address 1 of the “main memory”. Then, the search device searches the search index, cache data, etc. for search that are collected in advance by the crawler and stored in the “HDD” by the logical operation process of the “CPU” as in the normal search process. Then, the identification information of the Web page including the character string that matches the search query is stored as the search hit ID at the address 2,... Of “main memory”.

つづいて、「ＣＰＵ」がヒット単語ベクトル生成プログラムを解釈し、それに従って検索ヒットＩＤで示されるＷｅｂページのＨＴＭＬ文書データを例えば「ＨＤＤ」内のキャッシュデータから取得する。そして、「ＣＰＵ」は「ＨＤＤ」に格納されている図示しない単語辞書を参照し、例えば最長一致法などによって「辞書中の単語」と「ＨＴＭＬ文書中の単語」とのパターンマッチング処理を行う。次に抽出された単語について「ＣＰＵ」は同じく図示しない文法辞書を参照し、文法辞書で示される単語品詞の活用や接続関係から単語が正しく抽出されているか否かの判断処理を実行する。その判断の結果、正しくないと判断された抽出単語については、別の区切り箇所を再パターンマッチング処理によって見つけ出し、文法的に正しい形で文章中の単語を抽出する。またここで「ＣＰＵ」は類義語辞書を参照し、抽出した単語のうち類義語をまとめて一の単語としてまとめる処理を行っても良い。そして、このように抽出された単語を、当該Ｗｅｂページのベクトルの素性としてヒット単語ベクトルを生成する。そして検索にヒットしたその他の検索ヒットＩＤに係るＷｅｂページに関しても同様の処理を行い、「主メモリ」のアドレス３、・・・などにその生成したヒット単語ベクトルを格納する。 Subsequently, the “CPU” interprets the hit word vector generation program, and acquires HTML document data of the Web page indicated by the search hit ID from the cache data in “HDD”, for example. Then, the “CPU” refers to a word dictionary (not shown) stored in the “HDD”, and performs pattern matching processing between “words in the dictionary” and “words in the HTML document” by, for example, the longest match method. Next, for the extracted word, the “CPU” similarly refers to a grammar dictionary (not shown), and executes a process for determining whether the word is correctly extracted from the word part-of-speech shown in the grammar dictionary and the connection relationship. As for the extracted word determined to be incorrect as a result of the determination, another delimiter is found by the re-pattern matching process, and the word in the sentence is extracted in a grammatically correct form. Here, the “CPU” may refer to the synonym dictionary and perform a process of collecting the synonyms out of the extracted words into one word. Then, a hit word vector is generated using the extracted word as a vector feature of the Web page. The same processing is performed for Web pages related to other search hit IDs that have been searched, and the generated hit word vectors are stored at addresses 3,.

さらに「ＣＰＵ」は類似度演算プログラムを解釈しその解釈結果に従い、「ＨＤＤ」に格納されているリファレンス単語ベクトル１（Ｒ００１）を「主メモリ」のアドレス４に読み出し、アドレス３に格納されているヒット単語ベクトル（Ｗ００１）との類似度を算出する。具体的には、例えば「ＨＤＤ」に格納されている「ｃｏｓθ＝（ベクトルＷ００１×ベクトルＲ００１）／（｜ベクトルＷ００１｜×｜ベクトルＲ００１｜）」といった演算式に上記各ベクトル値を代入し、そのコサイン距離（ｃｏｓθ）を算出する。そして、ヒット単語リストＷ００１に関して、その他のリファレンス単語ベクトルＲ００２、Ｒ００３、・・・などとの間でも同様にｃｏｓθを算出し、そのｃｏｓθ値が1に近いリファレンス単語ベクトルを上位ｋ個特定する。そしてｋ最近傍法によって、ヒット単語ベクトルの生成元となったＷｅｂページのラベルを、その上位ｋ個のリファレンス単語ベクトルに関連付けられているラベルのうち例えば最頻値のラベルとして決定し、「主メモリ」のアドレス５に当該Ｗｅｂページの識別情報と関連付けて格納する。 Further, the “CPU” interprets the similarity calculation program and reads the reference word vector 1 (R001) stored in the “HDD” to the address 4 of the “main memory” and stores it in the address 3 according to the interpretation result. The similarity with the hit word vector (W001) is calculated. Specifically, for example, each vector value is substituted into an arithmetic expression such as “cos θ = (vector W001 × vector R001) / (| vector W001 | × | vector R001 |)” stored in “HDD”. The cosine distance (cos θ) is calculated. Then, for the hit word list W001, cos θ is calculated in the same manner with other reference word vectors R002, R003,..., And the top k reference word vectors whose cos θ values are close to 1 are specified. Then, by the k nearest neighbor method, the label of the Web page from which the hit word vector is generated is determined as, for example, the mode value label among the labels associated with the top k reference word vectors. The information is stored in association with the identification information of the Web page at address 5 of “memory”.

また、その他の検索ヒットＩＤで示されるＷｅｂページに関しても同様の処理によってそのラベルを決定し、それぞれのＷｅｂページの識別情報と関連付けて「主メモリ」に格納する。 In addition, the labels of other Web pages indicated by the search hit IDs are determined by the same processing, and stored in the “main memory” in association with the identification information of each Web page.

そして、「ＣＰＵ」はソート表示プログラムを解釈しその解釈結果に従い、そのラベルで示される価値指標順に、関連付けて格納されているＷｅｂページの識別情報をソートして、ソート済検索ヒットリストを生成し「主メモリ」のアドレス６に格納する。そして、「通信ＩＦ」を介して、検索クエリを送信してきた検索端末にそのソート済検索ヒットリストを返信する、という具合である。 Then, the “CPU” interprets the sort display program, and sorts the identification information of the Web pages stored in association with each other in the order of the value index indicated by the label to generate a sorted search hit list. Store in address 6 of “main memory”. Then, the sorted search hit list is returned to the search terminal that has transmitted the search query via the “communication IF”.

また、「主メモリ」のアドレス５に格納されているラベルを、そのＷｅｂページから生成されたヒット単語ベクトルと関連付けて、次回以降はリファレンス単語ベクトルとして利用できるよう「ＨＤＤ」に保持するよう構成しても良い。 Further, the label stored at the address 5 of the “main memory” is associated with the hit word vector generated from the Web page, and held in the “HDD” so that it can be used as a reference word vector from the next time. May be.

なお、検索対象となるリソースがＷｅｂページではなく動画などのバイナリデータであれば、上記機能的構成にて説明したような形でヒット画像特性ベクトルの生成や類似度判断処理を行うと良い。 If the search target resource is binary data such as a moving image instead of a Web page, hit image characteristic vector generation and similarity determination processing may be performed as described in the above functional configuration.

また、上記構成はネットワーク上の検索サーバ装置に本実施例の検索装置が組み込まれた場合のハードウェア構成例である。本実施例の検索装置がエンドユーザの端末装置に組み込まれている場合には、例えば検索クエリの入力は「通信ＩＦ」ではなく図示しない入力デバイスなどを介して入力されると良い。また、ソート済検索ヒットリストは、「通信ＩＦ」から出力されるのではなく、直接端末自身のディスプレイに出力表示されると良い。 The above configuration is an example of a hardware configuration when the search device of the present embodiment is incorporated in a search server device on a network. When the search device according to the present embodiment is incorporated in the terminal device of the end user, for example, the search query may be input through an input device (not shown) instead of the “communication IF”. In addition, the sorted search hit list may be output and displayed directly on the display of the terminal itself instead of being output from the “communication IF”.

<処理の流れ>
図８は、本実施例の検索装置における処理の流れの一例を表すフローチャートである。なお、以下に示すステップは、上記のような計算機の各ハードウェア構成によって実行されるステップであっても良いし、媒体に記録され計算機を制御するためのプログラムを構成する処理ステップであっても構わない。 <Process flow>
FIG. 8 is a flowchart illustrating an example of a processing flow in the search device according to the present embodiment. The steps shown below may be steps executed by each hardware configuration of the computer as described above, or may be processing steps that constitute a program for controlling the computer recorded on a medium. I do not care.

この図にあるように、まず、検索クエリを受付ける（ステップＳ０８０１）と、受付けた検索クエリにヒットしたページタイトルなどをリスト化し、検索ヒットリストを取得する（ステップＳ０８０２）。つづいて検索ヒットリストのそれぞれのページに関して、例えば形態素解析を行い、その形態素を素性とするヒット単語ベクトルを生成する（ステップＳ０８０３）。そして予め格納されている演算式を用いて、複数保持されているリファレンス単語ベクトルのそれぞれとヒット単語ベクトルとの類似度を、例えばコサイン距離などを利用して演算する（ステップＳ０８０４）。 As shown in this figure, first, when a search query is received (step S0801), page titles that hit the received search query are listed, and a search hit list is acquired (step S0802). Subsequently, for each page of the search hit list, for example, morphological analysis is performed, and a hit word vector having the morpheme as a feature is generated (step S0803). Then, the similarity between each of the stored reference word vectors and the hit word vector is calculated using, for example, a cosine distance using an arithmetic expression stored in advance (step S0804).

そして、検索ヒットリスト中のページタイトルなどを、前記演算により最も類似しているとされるリファレンス単語ベクトルのラベルで示される価値指標順にソートして表示する（ステップＳ０８０５）。また、このようにラベルが付与されたページに関して、そのページから生成されたヒット単語ベクトルをラベルと関連付けて、次回以降はリファレンス単語ベクトルとして利用できるようＨＤＤなどに保持するよう構成しても良い。 Then, the page titles and the like in the search hit list are sorted and displayed in the order of the value index indicated by the reference word vector label that is most similar by the calculation (step S0805). Further, with respect to a page to which a label is given in this way, the hit word vector generated from the page may be associated with the label and stored in the HDD or the like so that it can be used as a reference word vector from the next time.

もちろん、この処理の流れはネットワーク上の検索サーバ装置における処理の流れでも良いし、エンドユーザの端末における処理の流れでも良い。なお前者の場合、ステップＳ０８０１は、ネットワークを介して端末にて入力された検索クエリを受付けるステップとなり、後者の場合、ステップＳ０８０１は入力デバイスなどを介して端末に直接入力された検索クエリを受付けるステップとなる。また、ステップＳ０８０５は、前者の場合ソートしたリストを検索クエリを送信してきた端末に返信することで、その端末上にて検索ヒットリストを表示させるステップとなり、後者の場合は直接自身のディスプレイに検索ヒットリストを表示させるステップとなる（以下の実施例の処理の流れについても同様である）。 Of course, this processing flow may be a processing flow in a search server device on a network or a processing flow in an end user terminal. In the former case, step S0801 is a step of accepting a search query input at the terminal via the network, and in the latter case, step S0801 is a step of accepting a search query input directly to the terminal via an input device or the like. It becomes. Further, step S0805 is a step of displaying the search hit list on the terminal by returning the sorted list in the former case to the terminal that has transmitted the search query. In the latter case, the search is directly performed on its own display. This is a step of displaying a hit list (the same applies to the processing flow of the following embodiments).

<効果の簡単な説明>
以上のように、本実施例の検索装置によって、検索にヒットしたＷｅｂページなどに関して有害性の度合いなどその価値指標を利用して段階的に判断して検索結果リスト中の並び順をソートすることができる。したがって、グレーゾーンにあるＷｅｂページなども含めてその検索ヒットリストを検索ユーザに提示することができる。 <Brief description of effect>
As described above, the search device according to the present embodiment sorts the arrangement order in the search result list by making a step-by-step determination using the value index such as the degree of harmfulness regarding the Web page hit by the search. Can do. Therefore, the search hit list including the web page in the gray zone can be presented to the search user.

≪実施例２≫
<概要>
本実施例は、上記実施例を基本とし、例えば検索ヒットリストとして１ページ表示される分という具合に一部のみソート表示を行うよう構成された検索装置である。 << Example 2 >>
<Overview>
The present embodiment is a search apparatus that is based on the above-described embodiment, and that is configured to perform only a partial sort display, for example, one page is displayed as a search hit list.

図９は、本実施例の検索装置によってソートされた検索ヒットリストの一例を表す図である。この図９（ａ）にあるように、検索クエリにヒットした６００００件のうち、１から１０件までのヒット結果を１ページの検索ヒットリストとして表示しており、その１０件分に関しては上記実施例１で記載したようなソート表示のための演算処理を行っている。しかし、図９（ｂ）にあるように、続く１１件から２０件までのヒット結果に関しては、ソート表示のための演算処理を行っていない、という具合である。また、この２ページ目の検索ヒットリストに関しては、例えばユーザが１ページ目の「次へ」をクリックし２ページ目の表示操作を行ってから改めてソート表示のための演算処理を行うよう構成しても良い。あるいは、ユーザが１ページ目を閲覧中に、バックグラウンド処理で次ページ以降の検索ヒットリストのソート表示のための演算処理を行うよう構成しても良い。 FIG. 9 is a diagram illustrating an example of a search hit list sorted by the search device according to the present embodiment. As shown in FIG. 9 (a), among 60000 hits in the search query, 1 to 10 hit results are displayed as a 1-page search hit list. The calculation process for sort display as described in Example 1 is performed. However, as shown in FIG. 9B, the calculation processing for the sort display is not performed for the subsequent hit results from 11 to 20 cases. The search hit list on the second page is configured such that, for example, the user clicks “Next” on the first page and performs a display operation on the second page, and then performs a calculation process for sorting display again. May be. Alternatively, while the user is browsing the first page, a calculation process for sorting and displaying the search hit list after the next page may be performed in the background process.

そして本実施例ではこのように構成することで、検索ヒットリスト中のページタイトルなどをソートする際の演算処理負荷を低減または分散し、検索ヒットリストの表示速度を速くすることができる。 In this embodiment, with this configuration, it is possible to reduce or distribute the processing load when sorting the page titles in the search hit list, and to increase the search hit list display speed.

<機能的構成>
図１０は、本実施例の検索装置における機能ブロックの一例を表す図である。この図にあるように、本実施例の「検索装置」（１０００）は、実施例１を基本として「ヒット単語ベクトル生成部」（１００１）と、「リファレンス単語ベクトル保持部」（１００２）と、「類似度演算部」（１００３）と、「演算式格納部」（１００４）と、「ソート表示部」（１００５）と、を有する。なお、これらの構成要件については、上記実施例にて記載済みであるので、その説明は省略する。 <Functional configuration>
FIG. 10 is a diagram illustrating an example of functional blocks in the search device according to the present embodiment. As shown in this figure, the “search device” (1000) of the present embodiment is based on the first embodiment and includes a “hit word vector generating unit” (1001), a “reference word vector holding unit” (1002), It has a “similarity calculation unit” (1003), an “arithmetic expression storage unit” (1004), and a “sort display unit” (1005). In addition, since these structural requirements have already been described in the above embodiment, the description thereof will be omitted.

そして、本実施例の検索装置は、ヒット単語ベクトル生成部が「上位生成手段」（１００６）をさらに有する点と、ソート表示部が「上位ソート表示手段」（１００７）をさらに有する点を特徴とする。 The search device according to the present embodiment is characterized in that the hit word vector generation unit further includes “upper generation unit” (1006) and the sort display unit further includes “upper sort display unit” (1007). To do.

「上位生成手段」（１００６）は、検索ヒットリスト中、上位所定順位までの検索ヒットリストに含まれるページを対象としてヒット単語ベクトルを生成する機能を有する。具体的には、ヒット単語ベクトルを生成する前に、従来同様のルールにしたがい表示順が決定された検索ヒットリストを取得し、その検索ヒットリストの中から、例えば上位１００位までのＷｅｂページなどを対象としてヒット単語ベクトルを生成する、という具合である。 The “upper generation means” (1006) has a function of generating hit word vectors for pages included in the search hit list up to a predetermined rank in the search hit list. Specifically, before generating a hit word vector, a search hit list in which the display order is determined according to the same rules as in the past is acquired, and for example, the top 100 web pages from the search hit list, etc. The hit word vector is generated for the target.

なおこの「上位所定順位」は、適宜定められることで検索装置のソート処理の負荷を低減させることができるが、さらに、概要にて前述したように検索ヒットリストが複数ページに分かれて表示される際には、その検索ヒットリスト１ページ表示分に含まれる上位所定順位のページを対象としてヒット単語ベクトルを生成するよう構成すると良い。 The "predetermined upper rank" can be appropriately determined to reduce the sort processing load of the search device. However, as described above in the overview, the search hit list is divided into a plurality of pages and displayed. In this case, it is preferable to generate a hit word vector for the pages in the upper predetermined order included in the one page display of the search hit list.

また、上位生成手段をそのように構成する場合、概要にて説明したように、次の検索ヒットリストページを表示するための操作を受付けた際に、その次の検索ヒットリストページに含まれるＷｅｂページを特定し、そのヒット単語ベクトルを生成する「次ページ操作後生成手段」を本実施例のヒット単語ベクトル生成部がさらに有していても良い。あるいは、ラベルに応じてソート済の検索ヒットリストを表示中に、バックグラウンドで次の検索ヒットリストページに含まれるＷｅｂページを特定し、そのヒット単語ベクトルを生成する「次ページバックグラウンド生成手段」を本実施例のヒット単語ベクトル生成部がさらに有していても良い。 Further, when the upper generation unit is configured as such, as described in the overview, when an operation for displaying the next search hit list page is received, the Web included in the next search hit list page is included. The hit word vector generation unit of the present embodiment may further include “next page operation generation means” for specifying a page and generating the hit word vector. Alternatively, while displaying the search hit list sorted according to the label, the “next page background generation means” that specifies the Web page included in the next search hit list page in the background and generates the hit word vector. May further be included in the hit word vector generation unit of the present embodiment.

「上位ソート表示手段」（１００７）は、少なくとも上位所定順位までの検索ヒットリストについては、前記演算により最も類似しているとされるリファレンス単語ベクトルのラベルの価値指標に基づいてソートして表示する機能を有する。また、前述の「次ページ操作後生成手段」や「次ページバックグラウンド生成手段」に対応して、さらに「次ページソート表示手段」を有していても良い。 The “upper sort display means” (1007) sorts and displays at least the search hit list up to the upper predetermined rank based on the value index of the label of the reference word vector that is most similar by the calculation. It has a function. Further, “next page sort display means” may be further provided in correspondence with the “next page operation generation means” and “next page background generation means” described above.

このようにして本実施例では検索ヒットリスト中のページタイトルなどをソートする際の演算処理負荷を低減または分散し、ソート済みの検索ヒットリストの表示速度を速くすることができる。 In this way, in this embodiment, the processing load when sorting the page titles in the search hit list can be reduced or distributed, and the display speed of the sorted search hit list can be increased.

<処理の流れ>
図１１は、本実施例の検索装置における処理の流れの一例を表すフローチャートである。なお、以下に示すステップは、上記のような計算機の各ハードウェア構成によって実行されるステップであっても良いし、媒体に記録され計算機を制御するためのプログラムを構成する処理ステップであっても構わない。 <Process flow>
FIG. 11 is a flowchart illustrating an example of a processing flow in the search device according to this embodiment. The steps shown below may be steps executed by each hardware configuration of the computer as described above, or may be processing steps that constitute a program for controlling the computer recorded on a medium. I do not care.

この図にあるように、まず、検索クエリを受付ける（ステップＳ１１０１）と、受付けた検索クエリにヒットしたページタイトルなどをリスト化し、検索ヒットリストを取得する（ステップＳ１１０２）。つづいて検索ヒットリストに含まれるそれぞれのページに関して、例えば従来同様のルールにてその表示順位を決定した後、その中の上位所定順位までのヒットページを特定する（ステップＳ１１０３）。そしてその特定したヒットページのそれぞれに関して例えば形態素解析を行い、その形態素を素性とするヒット単語ベクトルを生成する（ステップＳ０１１０４）。そして予め格納されている演算式を用いて、複数保持されているリファレンス単語ベクトルのそれぞれとヒット単語ベクトルとの類似度を、例えばコサイン距離などを利用して演算する（ステップＳ１１０５）。 As shown in this figure, first, when a search query is received (step S1101), page titles that hit the received search query are listed, and a search hit list is acquired (step S1102). Subsequently, for each page included in the search hit list, for example, the display order is determined according to the same rules as in the prior art, and then the hit pages up to the upper predetermined order are specified (step S1103). Then, for example, morpheme analysis is performed on each of the identified hit pages, and a hit word vector having the morpheme as a feature is generated (step S01104). Then, the similarity between each of the stored reference word vectors and the hit word vector is calculated using, for example, a cosine distance using an arithmetic expression stored in advance (step S1105).

そして、少なくとも上位所定順位までの検索ヒットリスト中のページについては、前記演算により最も類似しているとされるリファレンス単語ベクトルのラベルで示される価値指標順にソートして表示する（ステップＳ１１０６）。 Then, at least the pages in the search hit list up to a predetermined upper rank are sorted and displayed in the order of the value index indicated by the reference word vector label that is most similar by the calculation (step S1106).

<効果の簡単な説明>
以上のように本実施例の検索装置では、少なくとも最初は検索ヒットリスト中の上位所定順位までのページに関してベクトルの生成や類似度演算の処理を実行する。したがって検索ヒットリスト中のページタイトルなどをソートする際の演算処理負荷を低減または分散し、ソート済みの検索ヒットリストの表示速度を速くすることができる。 <Brief description of effect>
As described above, in the search device according to the present embodiment, at least initially, vector generation and similarity calculation processing are executed for pages up to a predetermined upper rank in the search hit list. Therefore, it is possible to reduce or distribute the processing load when sorting the page titles in the search hit list, and to increase the display speed of the sorted search hit list.

≪実施例３≫
<概要>
図１２は、本実施例の検索装置におけるヒット単語ベクトルの生成処理の一例を説明するための図である。この図１２（ａ）にあるように、例えば検索にヒットしたあるＷｅｂページに関して、そのＵＲＬ（ユニフォーム・リソース・ロケータ）を取得する。そして、一般的にＵＲＬは、そのＷｅｂページの内容と関連性をもたした文字列を含ませることが多い。また、同一のドメイン名を含む場合、１つのＷｅｂサイトを構成する複数のページ、例えば１つの「海外自殺画像を紹介するサイト」を構成する「東南アジアの画像を集めたページ」、「アフリカの画像を集めたページ」という具合に同内容のＷｅｂページである可能性が高い。 Example 3
<Overview>
FIG. 12 is a diagram for explaining an example of the hit word vector generation process in the search device according to the present embodiment. As shown in FIG. 12A, for example, a URL (Uniform Resource Locator) is acquired for a Web page that has been hit by a search. In general, the URL often includes a character string having relevance to the contents of the Web page. In addition, when the same domain name is included, a plurality of pages constituting one Web site, for example, a “page that collects images of Southeast Asia” that constitutes one “site introducing overseas suicide images”, “an African image” There is a high possibility that the web page has the same content, such as “a page that collects“.

そこで本実施例の検索装置では、図１２（ｂ）に示すように、そのＵＲＬに含まれる文字列を素性としてヒット単語ベクトルを生成し、ソート表示するためのラベル付与に利用することを特徴とする。 Thus, as shown in FIG. 12B, the search device of this embodiment generates a hit word vector using a character string included in the URL as a feature and uses it for labeling for sorting display. To do.

<機能的構成>
図１３は、本実施例の検索装置における機能ブロックの一例を表す図である。この図にあるように、本実施例の「検索装置」（１３００）は、実施例１を基本として「ヒット単語ベクトル生成部」（１３０１）と、「リファレンス単語ベクトル保持部」（１３０２）と、「類似度演算部」（１３０３）と、「演算式格納部」（１３０４）と、「ソート表示部」（１３０５）と、を有する。また、実施例２を基本として図示しない「上位生成手段」や「上位ソート表示手段」を有していても良い。なお、これらの構成要件については、上記実施例にて記載済みであるのでその説明は省略する。 <Functional configuration>
FIG. 13 is a diagram illustrating an example of functional blocks in the search device according to the present embodiment. As shown in this figure, the “search device” (1300) of the present embodiment is based on the first embodiment and includes a “hit word vector generation unit” (1301), a “reference word vector holding unit” (1302), It has a “similarity calculation unit” (1303), an “arithmetic expression storage unit” (1304), and a “sort display unit” (1305). Further, on the basis of the second embodiment, “upper generation means” and “upper sort display means” (not shown) may be provided. In addition, since these structural requirements have already been described in the above embodiment, description thereof will be omitted.

そして、本実施例の検索装置は、ヒット単語ベクトル生成部が「ＵＲＬ文字列利用手段」（１３０６）をさらに有する点を特徴とする。 The search device according to the present embodiment is characterized in that the hit word vector generation unit further includes “URL character string use means” (1306).

「ＵＲＬ文字列利用手段」（１３０６）は、素性として検索ヒットリストのページのＵＲＬに含まれる文字列を利用してヒット単語ベクトルを生成する機能を有する。ＵＲＬは、通信プロトコル名以下、リソースを保持しているホストマシン名やホストマシン内でリソースが保持されているディレクトリ構造を示すパス名などで構成されている。そして前述のように、このＵＲＬはそのＷｅｂページの内容と関連性を持たせた文字列を含ませることが多い。あるいは同一のドメイン名を含む場合には同一のＷｅｂサイトを構成するため類似する内容のＷｅｂページである可能性が高い。つまりＵＲＬに含まれる文字列を比較してＷｅｂページの類似度を判断することが可能であるため、ＵＲＬ文字列利用手段では、そのＵＲＬに含まれる文字列をベクトルの素性としてヒット単語ベクトルを生成する、という具合である。 The “URL character string using means” (1306) has a function of generating a hit word vector using a character string included in the URL of the page of the search hit list as a feature. The URL is made up of a communication protocol name, a host machine name holding the resource, a path name indicating a directory structure holding the resource in the host machine, and the like. As described above, this URL often includes a character string associated with the contents of the Web page. Alternatively, when the same domain name is included, the same Web site is configured, so that there is a high possibility that the Web pages have similar contents. In other words, since it is possible to determine the similarity of Web pages by comparing character strings included in the URL, the URL character string using means generates a hit word vector using the character string included in the URL as a vector feature. It is a state of doing.

なお、ＵＲＬはスラッシュ（／）やドット（．）などによって文字列が区切られている。したがって、単語辞書や文法辞書などを用いる複雑な形態素解析処理を行うことなく、これら区切り記号を参照することで簡単にベクトルの素性となる文字列を特定することができる。また、本実施例では「ｈｔｔｐ」などのプロトコル名や「．ｈｔｍｌ」などの拡張子、「．ｃｏｍ」などのドメイン、あるいはＴＯＰなどページ内容に関係なく汎用的にＵＲＬに含まれるパス名などを予め登録しておき、これら文字列はベクトルの素性とはしないよう構成しても良い。 In the URL, character strings are separated by slashes (/), dots (.), And the like. Therefore, it is possible to easily specify a character string that is a vector feature by referring to these delimiters without performing complicated morphological analysis processing using a word dictionary or a grammar dictionary. Also, in this embodiment, a protocol name such as “http”, an extension such as “.html”, a domain such as “.com”, or a path name included in the URL for general purposes regardless of the page contents such as TOP. These character strings may be registered in advance so that they do not have vector features.

また本実施例においては、リファレンス単語ベクトル保持部にて保持されているリファレンス単語ベクトルも、生成されるヒット単語ベクトルに合わせて「ｐｉｃｔｕｒｅ」、「ｍｏｖｉｅ」などを素性とするよう構成することは当然である。 In the present embodiment, the reference word vector held in the reference word vector holding unit is naturally configured to feature “picture”, “movie”, etc. in accordance with the generated hit word vector. It is.

<処理の流れ>
図１４は、本実施例の検索装置における処理の流れの一例を表すフローチャートである。なお、以下に示すステップは、上記のような計算機の各ハードウェア構成によって実行されるステップであっても良いし、媒体に記録され計算機を制御するためのプログラムを構成する処理ステップであっても構わない。 <Process flow>
FIG. 14 is a flowchart illustrating an example of a processing flow in the search device according to the present embodiment. The steps shown below may be steps executed by each hardware configuration of the computer as described above, or may be processing steps that constitute a program for controlling the computer recorded on a medium. I do not care.

この図にあるように、まず、検索クエリを受付ける（ステップＳ１４０１）と、受付けた検索クエリにヒットしたページタイトルなどをリスト化し、検索ヒットリストを取得する（ステップＳ１４０２）。つづいて検索ヒットリストに含まれるそれぞれのページに関して、そのＵＲＬの区切り記号を参照して文字列を区別し、その文字列を素性とするヒット単語ベクトルを生成する（ステップＳ０１４０３）。そして予め格納されている演算式を用いて、複数保持されているリファレンス単語ベクトルのそれぞれとヒット単語ベクトルとの類似度を、例えばコサイン距離などを利用して演算する（ステップＳ１４０４）。 As shown in this figure, first, when a search query is received (step S1401), page titles that hit the received search query are listed, and a search hit list is acquired (step S1402). Next, for each page included in the search hit list, character strings are distinguished by referring to the delimiters of the URL, and a hit word vector having the character string as a feature is generated (step S01403). Then, the similarity between each of the stored reference word vectors and the hit word vector is calculated using, for example, a cosine distance using an arithmetic expression stored in advance (step S1404).

そして、検索ヒットリスト中のページタイトルなどを、前記演算により最も類似しているとされるリファレンス単語ベクトルのラベルで示される価値指標順にソートして表示する（ステップＳ１４０５）。 Then, the page titles and the like in the search hit list are sorted and displayed in the order of the value index indicated by the reference word vector label that is most similar by the calculation (step S1405).

<効果の簡単な説明>
以上のように本実施例の検索装置によって、ＷｅｂページなどのリソースのＵＲＬに含まれる文字列を利用して表示順位をソートすることができる。したがって、例えば同一のＷｅｂサイトに含まれているかなどを材料として類似性を判断することができる。 <Brief description of effect>
As described above, the display order can be sorted using the character string included in the URL of a resource such as a Web page by the search device of this embodiment. Therefore, the similarity can be determined using, for example, whether the content is included in the same Web site.

また、形態素解析などと比べると負荷の軽い「区切り記号を参照する」といった処理によって素性を決定し、ヒット単語ベクトルを生成することができる。 In addition, it is possible to generate a hit word vector by determining a feature by processing such as “refer to a delimiter” that is lighter in load than morphological analysis.

≪実施例４≫
<概要>
図１５は、本実施例の検索装置におけるヒット単語ベクトルの生成処理の一例を説明するための図である。この図にあるように、本実施例ではヒット単語ベクトルの素性に関して、その単語の出現頻度に応じて素性値を付与することを特徴とする。このようにベクトルの素性値を素性ごとに与えることでベクトル特性がより明確に表現されるので、リファレンス単語ベクトルとの比較によってより厳密に類似性を判断することができるようになる。 Example 4
<Overview>
FIG. 15 is a diagram for explaining an example of the hit word vector generation process in the search device according to the present embodiment. As shown in this figure, the present embodiment is characterized in that a feature value is assigned according to the appearance frequency of the word regarding the feature of the hit word vector. Since the vector characteristic is expressed more clearly by giving the vector feature value for each feature in this way, the similarity can be judged more strictly by comparison with the reference word vector.

<機能的構成>
図１６は、本実施例の検索装置における機能ブロックの一例を表す図である。この図にあるように、本実施例の「検索装置」（１６００）は、実施例１を基本として「ヒット単語ベクトル生成部」（１６０１）と、「リファレンス単語ベクトル保持部」（１６０２）と、「類似度演算部」（１６０３）と、「演算式格納部」（１６０４）と、「ソート表示部」（１６０５）と、を有する。また、実施例２や３を基本として図示しない「上位生成手段」や「上位ソート表示手段」、あるいは「ＵＲＬ文字列利用手段」を有していても良い。なお、これらの構成要件については、上記実施例にて記載済みであるのでその説明は省略する。 <Functional configuration>
FIG. 16 is a diagram illustrating an example of functional blocks in the search device according to the present embodiment. As shown in this figure, the “search device” (1600) of the present embodiment is based on the first embodiment and includes a “hit word vector generation unit” (1601), a “reference word vector storage unit” (1602), It has a “similarity calculation unit” (1603), an “arithmetic expression storage unit” (1604), and a “sort display unit” (1605). Further, on the basis of the second and third embodiments, it may have “upper generation means”, “upper sort display means”, or “URL character string use means” (not shown). In addition, since these structural requirements have already been described in the above embodiment, description thereof will be omitted.

そして本実施例の検索装置は、ヒット単語ベクトル生成部が「重み付け手段」（１６０６）をさらに有する点を特徴とする。 The search apparatus according to the present embodiment is characterized in that the hit word vector generation unit further includes “weighting means” (1606).

「重み付け手段」（１６０６）は、ページ中に含まれている同一単語の出現頻度に応じてベクトル空間中でのその単語軸の大きさを定めたヒット単語ベクトルを生成する機能を有する。 The “weighting means” (1606) has a function of generating a hit word vector in which the size of the word axis in the vector space is determined in accordance with the appearance frequency of the same word included in the page.

図１７は、この重み付け手段によるヒット単語ベクトルの生成処理およびリファレンス単語ベクトルとの類似性判断処理の一例を説明するための図である。この図１７（ａ）にあるように、検索ヒットリストに含まれるＷｅｂページ「Ｗ００１」に関し、形態素解析の結果として得られた単語の出現頻度をカウントする。そして、その出現頻度を素性値としてフラッシュメモリなどに格納する。また、その他のＷｅｂページ「Ｗ００２」、「Ｗ００３」、・・・等に関しても同様に単語の出現頻度をカウントし、そのカウント値を素性値としてヒット単語ベクトルを生成、格納する。 FIG. 17 is a diagram for explaining an example of hit word vector generation processing and similarity determination processing with reference word vectors by this weighting means. As shown in FIG. 17A, regarding the Web page “W001” included in the search hit list, the frequency of appearance of words obtained as a result of morphological analysis is counted. Then, the appearance frequency is stored as a feature value in a flash memory or the like. Similarly, for other Web pages “W002”, “W003”,..., Etc., the appearance frequency of words is similarly counted, and a hit word vector is generated and stored using the count value as a feature value.

一方、図１７（ｂ）に示すように、リファレンス単語ベクトルに関しても、そのベクトル素性に関して素性値が適宜与えられている。もちろん、リファレンス単語ベクトルが何らかのＷｅｂページなどを元に生成されたものであれば、その出現頻度で素性値を与えると良い。 On the other hand, as shown in FIG. 17B, a feature value is appropriately given for the vector feature of the reference word vector. Of course, if the reference word vector is generated based on some web page or the like, the feature value may be given by the appearance frequency.

そして、図１７（ｃ）に示すように、ｃｏｓθ＝（ベクトルＷ００１×ベクトルＲ００１）／（｜ベクトルＷ００１｜×｜ベクトルＲ００１｜）などの演算式によって「Ｗ００１」と「Ｒ００１」の類似度を示す指標を算出するため、上記素性値によってその指標値ｃｏｓθが変動する事になる。したがって、より厳密に類似性を判断することができるようになる、という具合である。 Then, as shown in FIG. 17C, the similarity between “W001” and “R001” is shown by an arithmetic expression such as cos θ = (vector W001 × vector R001) / (| vector W001 | × | vector R001 |). In order to calculate the index, the index value cos θ varies depending on the feature value. Therefore, the similarity can be judged more strictly.

また、上記出現頻度に応じて定められる軸の大きさは、例えば品詞やその文字内容に応じて設定された補正テーブルなどを参照し補正するよう構成されても良い。具体的に、有害と思われる単語、例えば「グロ」などの単語は出現頻度を２倍としてその大きさを定める。あるいは「自殺」「楽な方法」などの単語の組み合わせがある場合にも同様に出現頻度を補正する、という具合である。 The size of the axis determined according to the appearance frequency may be corrected by referring to a part of speech or a correction table set according to the content of the character. Specifically, the size of a word that seems to be harmful, for example, a word such as “glo” is determined by doubling the appearance frequency. Or, when there is a combination of words such as “suicide” and “easy method”, the appearance frequency is similarly corrected.

なお本実施例の重み付け手段においては、検索の対象がＷｅｂページ以外のものである場合には単語以外のその素性、例えば音声であれば所定周波数などの出現頻度を値としてベクトル軸の大きさを定めると良い。 In the weighting means of this embodiment, when the search target is other than a Web page, the size of the vector axis is determined by using its feature other than a word, for example, the appearance frequency of a predetermined frequency if it is speech as a value. It is good to decide.

<処理の流れ>
図１８は、本実施例の検索装置における処理の流れの一例を表すフローチャートである。なお、以下に示すステップは、上記のような計算機の各ハードウェア構成によって実行されるステップであっても良いし、媒体に記録され計算機を制御するためのプログラムを構成する処理ステップであっても構わない。 <Process flow>
FIG. 18 is a flowchart illustrating an example of a processing flow in the search device according to the present embodiment. The steps shown below may be steps executed by each hardware configuration of the computer as described above, or may be processing steps that constitute a program for controlling the computer recorded on a medium. I do not care.

この図にあるように、まず、検索クエリを受付ける（ステップＳ１８０１）と、受付けた検索クエリにヒットしたページタイトルなどをリスト化し、検索ヒットリストを取得する（ステップＳ１８０２）。つづいて検索ヒットリストのそれぞれのページに関して、例えば形態素解析を行いページ中に含まれる単語の出現頻度をカウントする。そして、その形態素を素性とし、カウントした値をその素性値としてヒット単語ベクトルを生成する（ステップＳ１８０３）。そして予め格納されている演算式を用いて、複数保持されているリファレンス単語ベクトルのそれぞれとヒット単語ベクトルとの類似度を、例えばコサイン距離などを利用して演算する（ステップＳ１８０４）。 As shown in this figure, first, when a search query is received (step S1801), page titles that hit the received search query are listed, and a search hit list is acquired (step S1802). Subsequently, for each page of the search hit list, for example, morphological analysis is performed, and the appearance frequency of words included in the page is counted. The morpheme is used as a feature, and a hit word vector is generated using the counted value as the feature value (step S1803). Then, the similarity between each of the stored reference word vectors and the hit word vector is calculated using, for example, a cosine distance using an arithmetic expression stored in advance (step S1804).

そして、検索ヒットリスト中のページタイトルなどを、前記演算により最も類似しているとされるリファレンス単語ベクトルのラベルで示される価値指標順にソートして表示する（ステップＳ１８０５）。 Then, the page titles and the like in the search hit list are sorted and displayed in the order of the value index indicated by the reference word vector label that is most similar by the calculation (step S1805).

<効果の簡単な説明>
以上のように本実施例の検索装置では、単語の出現頻度に応じて素性値を付与したヒット単語ベクトルを生成する。したがってベクトル特性がより明確に表現されるので、リファレンス単語ベクトルとの比較によってより厳密に類似性を判断することができるようになる。 <Brief description of effect>
As described above, the search device of this embodiment generates a hit word vector to which a feature value is assigned according to the appearance frequency of words. Therefore, since the vector characteristic is expressed more clearly, the similarity can be judged more strictly by comparison with the reference word vector.

実施例１の検索装置による検索ヒットリストのソート表示の一例を説明するための図The figure for demonstrating an example of the sorting display of the search hit list by the search device of Example 1. 実施例１の検索装置における機能ブロックの一例を表す図The figure showing an example of the functional block in the search device of Example 1. 実施例１の検索装置のヒット単語ベクトル生成部でのヒット単語ベクトルの生成の一例を説明するための図The figure for demonstrating an example of the production | generation of the hit word vector in the hit word vector production | generation part of the search device of Example 1. 実施例１の検索装置のリファレンス単語ベクトル保持部にて保持されているリファレンス単語ベクトルの一例を表す図The figure showing an example of the reference word vector currently hold | maintained at the reference word vector holding part of the search device of Example 1. 実施例１の検索装置の類似度演算部でのリファレンス単語ベクトルとヒット単語ベクトルの類似度演算処理の一例を説明するための図The figure for demonstrating an example of the similarity calculation process of the reference word vector and hit word vector in the similarity calculation part of the search device of Example 1. 実施例１の検索装置の類似度判断による検索にヒットしたＷｅｂページへのラベル付けの一例を表す図The figure showing an example of labeling to the Web page hit by the search by similarity judgment of the search device of Example 1 実施例１の検索装置におけるハードウェア構成の一例を表す図1 is a diagram illustrating an example of a hardware configuration in a search device according to Embodiment 1. FIG. 実施例１の検索装置における処理の流れの一例を表すフローチャート7 is a flowchart illustrating an example of a processing flow in the search device according to the first embodiment. 実施例２の検索装置による検索ヒットリストのソート表示の一例を説明するための図The figure for demonstrating an example of the sort display of the search hit list by the search device of Example 2. 実施例２の検索装置における機能ブロックの一例を表す図The figure showing an example of the functional block in the search device of Example 2. 実施例２の検索装置における処理の流れの一例を表すフローチャート7 is a flowchart illustrating an example of a process flow in the search device according to the second embodiment. 実施例３の検索装置におけるヒット単語ベクトルの生成処理の一例を説明するための図The figure for demonstrating an example of the generation processing of the hit word vector in the search device of Example 3. 実施例３の検索装置における機能ブロックの一例を表す図The figure showing an example of the functional block in the search device of Example 3. 実施例３の検索装置における処理の流れの一例を表すフローチャート7 is a flowchart illustrating an example of a process flow in the search device according to the third embodiment. 実施例４の検索装置におけるヒット単語ベクトルの生成処理の一例を説明するための図The figure for demonstrating an example of the generation processing of the hit word vector in the search device of Example 4. 実施例４の検索装置における機能ブロックの一例を表す図The figure showing an example of the functional block in the search device of Example 4. 実施例４の検索装置の重み付け手段によるヒット単語ベクトルの生成処理およびリファレンス単語ベクトルとの類似性判断処理の一例を説明するための図The figure for demonstrating an example of the generation process of the hit word vector by the weighting means of the search device of Example 4, and the similarity judgment process with a reference word vector 実施例４の検索装置における処理の流れの一例を表すフローチャート10 is a flowchart illustrating an example of a processing flow in the search device according to the fourth embodiment.

Explanation of symbols

０２００検索装置
０２０１ヒット単語ベクトル生成部
０２０２リファレンス単語ベクトル保持部
０２０３類似度演算部
０２０４演算式格納部
０２０５ソート表示部 0200 Search device 0201 Hit word vector generation unit 0202 Reference word vector holding unit 0203 Similarity calculation unit 0204 Calculation expression storage unit 0205 Sort display unit

Claims

A hit word vector generation unit that generates a hit word vector that is a word vector having a word extracted from a character string included in each page of the search hit list;
A reference word vector holding unit that holds a plurality of reference word vectors that are reference word vectors associated with a label indicating a value index;
A similarity calculator that calculates the similarity between each of the plurality of reference word vectors held and the hit word vector generated by the hit word vector generator;
An arithmetic expression storage unit that stores an arithmetic expression for calculating the degree of similarity;
A sort display unit for sorting and displaying the identification information of the page that generated the hit word vector based on the value index of the label of the reference word vector that is most similar by the calculation;
A search device having:

The hit word vector generation unit
In the search hit list, there is an upper generation means for generating a hit word vector for pages included in the search hit list up to the upper predetermined order,
Sort display section
2. The high-order sort display means for sorting and displaying at least high-order search hit lists up to a predetermined rank based on a value index of a reference word vector label that is most similar by the calculation. Search device.

The search device according to claim 1, wherein the hit word vector generation unit includes URL character string use means for generating a hit word vector using a character string included in a URL of a search hit list page as a feature.

The hit word vector generation unit includes weighting means for generating a hit word vector that defines the size of the word axis in the vector space in accordance with the appearance frequency of the same word included in the page. 4. The search device according to any one of 3.

A control method for a search device having a reference word vector holding unit that holds a plurality of reference word vectors, which are reference word vectors associated with a label indicating a value index,
A hit word vector generation step for generating a hit word vector that is a word vector having a word extracted from a character string included in each page of the search hit list as a feature;
Using the arithmetic expression for calculating the similarity stored in advance in the arithmetic expression storage unit, each of the plurality of reference word vectors held, and the hit word vector generated by the hit word vector generation unit, A similarity calculation step for calculating the similarity of
Sort display step for sorting and displaying the identification information of the page that generated the hit word vector based on the value index of the label of the reference word vector that is said to be most similar by the calculation,
Of a search apparatus for causing a computer to execute the above.

The hit word vector generation step
In the search hit list, including an upper generation step of generating a hit word vector for pages included in the search hit list up to the upper predetermined order,
Sort display step is
6. The high order sort display step of sorting and displaying at least high-order search hit lists up to a predetermined high order based on a value index of a reference word vector label that is most similar by the calculation. Method for controlling a search apparatus of a computer.

7. The search device control according to claim 5, wherein the hit word vector generation step includes a URL character string use step of generating a hit word vector using a character string included in a URL of a search hit list page as a feature. Method.

6. The hit word vector generation step includes a weighting step of generating a hit word vector in which the size of the word axis in the vector space is determined according to the appearance frequency of the same word included in the page. The control method of the search device according to any one of 7.