JP4079354B2

JP4079354B2 - Evaluation function estimation device, program and storage medium for ranking, ranking device and program

Info

Publication number: JP4079354B2
Application number: JP2002223032A
Authority: JP
Inventors: 秀人賀沢; 努平尾
Original assignee: Nippon Telegraph and Telephone Corp
Current assignee: Nippon Telegraph and Telephone Corp
Priority date: 2002-07-31
Filing date: 2002-07-31
Publication date: 2008-04-23
Anticipated expiration: 2022-07-31
Also published as: JP2004062737A

Description

【０００１】
【発明の属する技術分野】
本発明は、例えば情報検索システム等で、ユーザに提示する情報を自動的に順位付けする際に用いられる、順位付けのための評価関数推定装置、プログラム及び記憶媒体、並びに、順位付け装置及びプログラムに関する。
【０００２】
【従来の技術】
コンピュータやインターネットの普及により、大量の情報の中から適切な情報を探しだす情報検索システムの重要性が増している。このようなシステムにおいては、探しだした情報も大量になるため、何らかの基準に基づいて情報を順位付けしてから提示することが多い。通常、この順位付けは、ある評価関数に基づいて各情報にスコアを付け、そのスコア順に並べて出力する、という形で行われることが多い。このような順位付けに用いられる評価関数としては、例えば、ユーザが入力した検索要求（キーワード）に一致する語の数をスコアとするもの等がある。
【０００３】
従来、情報検索システムにおけるこのような評価関数は、システム設計者の経験や直観に頼って設計されることがほとんどであった。しかし、近年、情報のサンプルに対し理想的な順位付けを行ったデータから、自動的に評価関数を推定する技術が幾つか提案されている。例えば、平尾勉、前田英作、松本祐治、「Support Vector Machineによる重要文抽出」、情報処理学会、第６３回情報学基礎研究予稿集（２００１年７月）等がある。
【０００４】
ここで、今までに提案された評価関数の推定方法を説明する。
まず、サンプルとして与えられた順位付きデータを適当な順位を境に、それより上位のものと下位のものに分ける。次に、それぞれのサンプルを別々のカテゴリと見なし、上位のサンプルについては正の値、下位のサンプルについては負の値をとるような評価関数を推定する。この推定にあたっては、既存の２カテゴリの分類器（例えば、ニューラルネットワークやサポートベクトルマシン等）の推定方法を用いる。最後に、こうしてできた評価関数をそのままシステムの評価関数として用いる。
【０００５】
このように評価関数を自動的に推定する方法では、順位付きのサンプルさえ用意すれば、ユーザに適応した評価関数（評価関数のパラメータ）が自動的に設定（推定）される、という利点がある。
【０００６】
【発明が解決しようとする課題】
しかしながら、従来のこのような評価関数の推定方法には次のような問題点がある。まず、サンプルを２つのカテゴリに分けるためには、閾となる順位を予め決める必要がある。しかし、一般にどの順位を閾とすればよいかは、実際に推定された評価関数の精度を調べなければ決定することができない。そのため、最良の評価関数を探すためには、分類器の推定を繰り返す必要があり、例えばサポートベクトルマシン等のように一回一回の推定に時間がかかる分類器を用いる場合には、大きな問題となる。
【０００７】
次に、ある順位でカテゴリを分けると、その順位近辺に現われにくいデータに関しては、評価関数の推定精度が悪くなる傾向がある。これは、一般に分類器による評価関数の推定にあたっては、カテゴリの境界から遠いところでは評価関数の推定精度が低くなるためである。したがって、従来の評価関数の推定方法では、閾とした順位から離れた順位に関しては、評価関数に基づく順位付けがうまくいかないことが多い。
【０００８】
そこで、本発明の主たる目的は、順位付けの対象となる対象データを適切に順位付けすることのできる評価関数を推定することができる順位付けのための順位付けのための評価関数推定装置、プログラム及び記憶媒体、並びに、推定した評価関数を用いて適切に順位付けを行うこができる、順位付け装置及びプログラムを提供すること等にある。
【０００９】
【課題を解決するための手段】
前記課題に鑑み、本発明者らは鋭意研究を行い、順位付きデータの各順位についての上位／下位をカテゴリとして付与してカテゴリ付きデータを作成し、このカテゴリ付きデータから順位付けの基準となる評価関数を推定し、この推定した評価関数を用いることで、順位付けの対象となる対象データを適切に順位付けできることを見出し、本発明を完成するに至った。
【００１０】
即ち、前記課題を解決した本発明は、順位付けの対象となる対象データを順位付けするために、順位が付けられた順位付きデータから、前記対象データを順位付けする際の基準となる評価関数を推定する順位付けのための評価関数推定方法であって、前記順位付きデータを入力し、当該順位付きデータの各順位についての上位／下位をカテゴリとして付与したカテゴリ付きデータを作成し、このようにカテゴリを付与したカテゴリ付きデータから順位付けの基準となる評価関数を推定することを特徴とする。
【００１１】
この構成においては、順位付きデータには順位が付けられており、この順位付きデータの各順位についての上位／下位をカテゴリとして付与する。カテゴリの付与は、例えば後述する実施形態のように、サンプル中の各順位付きデータについて、１位より上位／下位、２位より上位／下位、…、ｎ−１位より上位／下位（ｎは順位付きデータに現われる最も下位な順位）、を表すカテゴリラベル（ｙ[j/i,s]）を付与することにより行われる。
【００１２】
また、順位付けのための評価関数推定方法は、後述する実施形態のように、順位についての上位／下位をカテゴリとして付与した前記カテゴリ付きデータに対して、評価関数の値が下位より上位の方が大きくなるように当該評価関数のパラメータを設定することが好ましい。また、後述する実施形態のように、式（１２）〜（１４）の条件を満たし、かつ式（１１）の値を最小化するように前記評価関数のパラメータを設定することが好ましい。なお、請求項での式（１）は実施形態での式（１１）に相当する。一方、式（２）と式（１２）、式（３）と式（１３）、式（４）と式（１４）は同じ式である。また、本発明は、後述する実施形態のように、二次計画法であるＳＭＯを用いて前記パラメータを設定することが好ましい。
【００１３】
また、前記課題を解決した本発明は、前記順位付けのための評価関数推定方法により推定された評価関数を用いて順位付けの対象となる対象データを順位付けする順位付け方法であって、前記評価関数と前記対象データを入力し、当該評価関数と前記対象データから各対象データのスコアを計算し、このスコアの順に前記対象データに順位を付けることを特徴とする。
【００１４】
この構成によれば、順位付けのための評価関数推定方法により推定された評価関数を活用して、順位付けの対象となる対象データを適切に順位付けすることができる。
【００１５】
また、前記課題を解決した本発明の順位付けのための評価関数推定装置は、順位が付けられた順位付きデータを入力する順位付きデータ入力手段と、前記順位付きデータの順位についての上位／下位をカテゴリとして付与してカテゴリ付きデータを作成するカテゴリ付きデータ作成手段と、前記カテゴリ付きデータから順位付けの対象となる対象データを順位付けするための評価関数を推定する評価関数推定手段とを備えることを特徴とする。
【００１６】
この構成においても、順位付きデータには順位が付けられており、この順位付きデータの各順位についての上位／下位をカテゴリとして付与する。カテゴリの付与は、例えば後述する実施形態のように、サンプル中の各順位付きデータについて、１位より上位／下位、２位より上位／下位、…、ｎ−１位より上位／下位（ｎは順位付きデータに現われる最も下位な順位）、を表すカテゴリラベル（ｙ[j/i,s]）を付与することにより行われる。
【００１７】
また、順位付けのための評価関数推定装置は、後述する実施形態のように、前記評価関数推定手段が、前記順位についての上位／下位をカテゴリとして付与した前記カテゴリ付きデータに対して、前記評価関数の値が下位より上位の方が大きくなるように前記評価関数のパラメータを設定する構成を有することを特徴とする。
【００１８】
また、前記課題を解決した本発明は、前記順位付けのための評価関数推定方法により推定された評価関数を用いて順位付けの対象となる対象データを順位付けする順位付け装置であって、前記評価関数と前記対象データを入力するデータ入力手段と、前記評価関数と前記対象データから前記各対象データのスコアを計算するスコア計算手段と、前記スコアの順に前記対象データに順位を付ける手段と、を備えることを特徴とする。
【００１９】
この構成によれば、順位付けのための評価関数推定方法により推定された評価関数を活用して、順位付けの対象となる対象データを適切に順位付けすることができる。
【００２０】
また、前記課題を解決した本発明（請求項４〜請求項６）は、順位付けのための評価関数推定プログラム、順位付けプログラム、及び順位付けのための評価関数推定プログラムを記憶した記憶媒体であり、コンピュータにプログラムの各ステップを実行させて、順位付けのための評価関数を推定させたり、評価関数を用いて順位付けの対象となる対象データを順位付けさせたりする。
【００２１】
【発明の実施の形態】
以下、本発明の実施形態を、図面を参照して詳細に説明する。
本実施形態で参照する図１の、（ａ）は実施形態に係る順位付け装置を機能展開したブロック図であり、（ｂ）は（ａ）の順位付け装置のハードウェア構成を示すブロック図である。図２は、実施形態に係る順位付けのための評価関数推定方法及び順位付け方法を示すフローチャートである。
【００２２】
〔順位付け装置〕
図１（ａ）に示すように、順位付けのための評価関数推定方法及び順位付け方法を実行する順位付け装置Ａは、順位付きデータを読み込む順位付きデータ入力手段１１と、全ての順位についての上位／下位をカテゴリとして付与したカテゴリ付きデータを作成するカテゴリ付きデータ作成手段１２と、カテゴリ付きデータから評価関数のパラメータを推定する評価関数推定手段１３と、評価関数のパラメータを再利用できるように格納するパラメータ格納手段１４を有すると共に、順位付けの対象となる対象データを読み込む対象データ入力手段２１と、評価関数のパラメータと対象データから全ての対象データのスコアを計算するスコア計算手段２２と、スコア順に対象データを整列するデータ整列手段２３と、整列した対象データを順位付けして出力するデータ出力手段２４を有する。この順位付け装置Ａの各手段（各ブロック）は、プログラムモジュールとして構成されているが、各手段の詳細な機能については、後でフローチャート等を参照して明らかにする。
【００２３】
なお、図１（ｂ）に示すように、本実施形態の順位付け装置Ａは、主制御装置１、記憶装置２、通信制御装置３、入出力装置４がシステムバスに接続された構成を有している。このうち、主制御装置１は、ＣＰＵ（Central Processing Unit）、ＲＡＭ（Random Access Memory）等を含んで構成され、対象データを順位付けするための評価関数（そのパラメータ）を推定する機能、推定した評価関数を用いて対象データを順位付けする機能、順位付け装置Ａを統括制御する機能等を有する。また、記憶装置２は、ハードディスク装置等を含んで構成され、評価関数のパラメータを格納する機能、各種プログラムやデータを記憶する機能等を有する。また、通信制御装置３は、例えばネットワークインタフェイスカード（ＮＩＣ；Network Interface Card）等を含んで構成され、所定の通信プロトコールによりネットワークを介した通信（データの入出力）を実現する機能等を有する。また、入出力装置４は、キーボード、マウス等が接続された構成を有している。
【００２４】
〔順位付け装置の処理フロー〕
上記構成からなる本実施形態の順位付け装置Ａの処理フローを、図１（ｂ）を参照しつつ図２のフローチャートに沿って説明する。なお、このフローチャートは、順位付きのサンプルから、予め閾となる順位を設定する必要がなく、かつ、全順位にわたって精度の劣化しない評価関数を推定する処理フローと、推定した評価関数に基づいて精度の良い順位付けを行う処理フローを含んで構成されている。
【００２５】
ちなみに、情報検索の一分野である情報フィルタリングでは、膨大な文書集合から取り出したサンプル文書（通常文書集合全体に比べると非常に少ない）に対して、ユーザが有用度順に別途順位付けした後に、その順位付けを参考にして文書集合全体の順位を付ける、ということが行われる。本実施形態は、この情報フィルタリングを念頭に置いたもので、順位付きデータはユーザが別途順位付けしたサンプル文書に相当し、対象データは文書集合に相当し、出力データは順位付けられた文書集合に相当する。
【００２６】
まず、順位付きデータ入力手段１１がサンプルとなる順位付きデータを読み込み（Ｓ１１；順位付きデータ入力ステップ）、カテゴリ付きデータ作成手段１２ヘ渡す。カテゴリ付きデータ作成手段１２では、順位付きデータの全ての順位についての上位／下位をカテゴリとして付与して（Ｓ１２；カテゴリ付きデータ作成ステップ）、評価関数推定手段１３ヘ渡す。評価関数推定手段１３では、カテゴリ付きデータから、各順位についての上位／下位の判別を一度に行うことができるような評価関数のパラメータを推定して（Ｓ１３評価関数推定ステップ）、パラメータ格納手段１４ヘ渡す。パラメータ格納手段１４では、評価関数のパラメータを再利用できるように格納する（Ｓ１４；パラメータ格納ステップ）。これにより、読み込んだ（入力した）順位付きデータから、評価関数のパラメータが推定され、再利用可能に記憶される。
【００２７】
なお、本実施形態では、「上位」というときには同位を含むが、「下位」を含まないものとする。
【００２８】
次に、対象データ入力手段２１が順位付けの対象となる対象データを読み込み（Ｓ２１；対象データ入力ステップ）、スコア計算手段２２ヘ渡す。スコア計算手段２２では、評価関数のパラメータをパラメータ格納手段１４から読み込み、それに基づいて各対象データのスコアを計算して（Ｓ２２；スコア計算ステップ）、対象データとともにデータ整列手段２３ヘ渡す。データ整列手段２３では、スコア順に対象データを整列して（Ｓ２３；データ整列ステップ）、その結果をデータ出力手段２４ヘ渡す。最後に、データ出力手段２４では、整列した対象データを出力する（Ｓ２４；データ出力ステップ）。
【００２９】
〔具体的動作〕
次に、本実施形態の具体的動作について、図３等を参照してより詳細に説明する。なお、以下の記載は、説明を容易にするため、図３のフローチャートに示すように、図２のフローチャートにおける全てのステップ（Ｓ１１〜Ｓ２４）を連続して動作する場合を述べる。
【００３０】
なお、図３は図２のフローチャートを直列に配置したフローチャートであり、図２と同じステップ番号が付してある。また、図４は順位付きデータの例を示す表であり、図５は対象データの例を示す表であり、図６は出力データの例を示す表である。また、図７はデータの理解を容易にするために、順位付きデータ（図４）と対象データ（図５）を平面にプロットしてグラフ化した図である。図８は、カテゴリ付きデータの例を示す表である。
【００３１】
ちなみに、図４〜図８において、χ[j/i]はサンプル文書としての順位付きデータである（ｉ＝イ，ロ，ハ、ｊ＝１位，２位，３位）。また、Ａ，Ｂ，Ｃは順位付けの対象となる文書集合としての対象データである。また、χ[j/i]の添字ｉは順位付きデータの組を区別するためのものであり、添字ｊは各組内での順位を表すためのものである。また、（0.9，0.8），（0.4，0.5）等のカッコ内の数字は、サンプル文書（順位付けデータ）や順位付けの対象となる文書（対象データ）の、検索要求（例えばキーワード）に一致する語の数に関連したスコアである。ちなみに、図４の順位付きデータの順位は、このスコアに基づいて予め決められている。また、図７では、χ[j/イ]は○、χ[j/ロ]は△、χ[j/ハ]は×で表している。また、図７では、図４、図５のカッコ内の最初の数値が横軸に、カッコ内の後の数値が縦軸に割り振られる（後述する図１３、図１４も同じ）。
【００３２】
ところで、ユーザによる順位付けが複数回行われる場合もある。図４の順位付きデータは、例えば、文書集合の中から３文書を取り出して順位付けすることを、３回試行するというような場合に相当する。ちなみに、図４の添字ｉの、イの組は１位〜３位までのデータが一組をなしている。同様に本実施形態では、ロの組もハの組も、１位〜３位までのデータが一組をなしている。
【００３３】
以下、図３のフローチャートに沿って、図４に示す順位付きデータから評価関数（そのパラメータ）を推定し、図５に示す対象データについて順位付けを行い、図６に示す出力データを得る例について、具体的動作を説明する。なお、この具体的動作及び図２のフローチャートを参照して説明した処理フローは、特許請求の範囲の、「順位付けのための評価関数推定方法」、「順位付け方法」、「順位付けのための評価関数推定プログラム」、「順位付けプログラム」に相当するものである。
【００３４】
まず、評価関数を推定するために、順位付きデータ入力ステップ（Ｓ１１）では、順位付きデータ入力手段１１が、図４の順位付きデータを読み込み、カテゴリ付きデータ作成手段１２ヘ渡す。
【００３５】
カテゴリ付きデータ作成ステップ（Ｓ１２）では、カテゴリ付きデータ作成手段１２が、図４の順位付きデータについて、１位より上位（下位）、２位より上位（下位）を表すカテゴリラベルを付与してカテゴリ付きデータを作成する。つまり、順位付きデータの全ての順位についての上位／下位をカテゴリとして付与したカテゴリ付きデータを作成する。本実施形態では、カテゴリラベルをｙ[j/i,s]で表し、順位付きデータχ[j/i]がｓ位より上位（同位）のとき「ｙ[j/i,s]＝＋１」、下位のとき「ｙ[j/i,s]＝−１」と決める（ｉ＝イ，ロ，ハ，ｊ＝１位，２位，３位，ｓ＝１，２）。
【００３６】
この決め方に従うと、ｙ[j/i,s]は次のようになり（図８参照）、これらの結果は、評価関数推定手段１３ヘと渡される。
【００３７】
χ[1位/イ]は１位より上位なのでｙ[1位/イ,1]＝＋１、
χ[1位/イ]は２位より上位なのでｙ[1位/イ,2]＝＋１、
χ[2位/イ]は１位より下位なのでｙ[2位/イ,1]＝−１、
χ[2位/イ]は２位より上位なのでｙ[2位/イ,2]＝＋１、
χ[3位/イ]は１位より下位なのでｙ[3位/イ,1]＝−１、
χ[3位/イ]は２位より下位なのでｙ[3位/イ,2]＝−１
【００３８】
χ[1位/ロ]は１位より上位なのでｙ[1位/ロ,1]＝＋１、
χ[1位/ロ]は２位より上位なのでｙ[1位/ロ,2]＝＋１、
χ[2位/ロ]は１位より下位なのでｙ[2位/ロ,1]＝−１、
χ[2位/ロ]は２位より上位なのでｙ[2位/ロ,2]＝＋１、
χ[3位/ロ]は１位より下位なのでｙ[3位/ロ,1]＝−１、
χ[3位/ロ]は２位より下位なのでｙ[3位/ロ,2]＝−１
【００３９】
χ[1位/ハ]は１位より上位なのでｙ[1位/ハ,1]＝＋１、
χ[1位/ハ]は２位より上位なのでｙ[1位/ハ,2]＝＋１、
χ[2位/ハ]は１位より下位なのでｙ[2位/ハ,1]＝−１、
χ[2位/ハ]は２位より上位なのでｙ[2位/ハ,2]＝＋１、
χ[3位/ハ]は１位より下位なのでｙ[3位/ハ,1]＝−１、
χ[3位/ハ]は２位より下位なのでｙ[3位/ハ,2]＝−１
【００４０】
評価関数推定ステップ（Ｓ１３）では、評価関数推定手段１３が、図８のカテゴリラベルが付与された各データ、つまりカテゴリ付きデータに対して、評価関数ｆ(χ)の値が、１位より上位のデータの方が１位より下位のデータよりも大きく、かつ、２位より上位のデータの方が２位より下位のデータよりも大きくなるように、評価関数の種類及びパラメータを推定（設定）する。
【００４１】
ここで、評価関数の種類及びパラメータの推定（設定）法については様々なものが考えられるが、本実施形態では評価関数の種類として入力χの線形関数ｆ(χ)＝ｗ・χを予め採用し（ｗは二次元ベクトル）、そして、この予め採用した線形関数ｆ（χ）に対するパラメータの推定法として次の条件式（１２）〜（１４）を満たし、目的関数Ｒ（つまり次の式（１１））の値が最小となるパラメータｗ，ｂ_S，ξ[j/i,s]を求める方法を採用する（但しｉ＝イ，ロ，ハ，ｊ＝１位，２位，３位，ｓ＝１，２）。
【００４２】
【数２】

【００４３】
式（１２），（１３）のｂ_Sは、ｓ位の上位／下位を分ける閾値である。具体的には、ｂ₁は１位の上位／下位を分ける閾値であり、ｂ₂は２位の上位／下位を分ける閾値である。また、式（１２）は、ｙ[j/i,s]＝＋１のとき、即ちχ[j/i]がｓ位より上位であるときに、評価関数の値ｆ(χ[j/i])がｂ_Sより（１−ξ[j/i,s]だけ余裕を持って）大きいことを要請する条件である。また、式（１３）は、ｙ[j/i,s]＝−１のとき、即ちχ[i/j]がｓ位より下位であるときに、ｆ(χ[j/i])がｂ_Sより（１−ξ[j/i,s]だけ余裕を持って）小さいことを要請する条件である。
【００４４】
ちなみに本実施形態では、順位付きのサンプルデータは図４に示されるとおり３件であることから必要とする閾値はｂ₁，ｂ₂の２つであるが、仮に１０件の順位付きデータ（サンプル文書）を与える場合、閾値は、１位より上位／下位、２位より上位／下位、…９位より上位／下位に対応するｂ₁〜ｂ₉の９つが必要になる。なお、この閾値ｂ_Sを人間が設定する必要は特にない。
【００４５】
ここで、各式の意味を説明するために、仮に全てのξ[j/i,s]を０とした式（１１）〜（１４）を満たすパラメータが見つかったとする。すると、式（１２）及び式（１３）から明らかなように、そのパラメータを用いることで、各ｓについて、ｓ位の上位のデータが下位よりもｆ（χ）の値で２以上大きくなる。このことより、そのようなｆ（χ）を用いれば、全てのχ[j/i]について各順位の上位／下位を正しく分類（判別）できることになるのが理解される。
【００４６】
ところで、一般的に、全てのξ[j/i,s]を０とした最小化問題には解がないこともある。そこで、式（１２）及び式（１３）では、条件をξ[j/i,s]だけ緩めることで、近似的に上位／下位を分類できる評価関数のパラメータを求めている。ただし、条件を緩める程度は出来るだけ小さい方が望ましいと考えられるので、目的関数Ｒ（つまり式（１１））にξ[j/i,s]の和を含めて、ξ[j/i,s]ができるだけ小さくなるような評価関数のパラメータを求めるようにしている。
【００４７】
また、式（１１）には、サンプルだけをよく分類する評価関数を推定しないように（いわゆる過学習を避けるために）、目的関数Ｒにパラメータｗの大きさを制限する項（1/2｜ｗ｜²）を含めている。なお、このような項が過学習を防ぐ効果を持つことについては、例えば、Vladimir N. Vapnik, “The Nature of Statistical Learning Theory, Second Edition”， Springer 1999 に記載されている。
【００４８】
なお、式（１２）〜（１４）を満たし、式（１１）を最少化するパラメータｗを求める問題は、二次計画法の一種であり様々な解法が知られている。本実施形態では、そのような解法の一つSequential Minimal Optimization（ＳＭＯ）を用いて、上記最小化問題を解くこととする。ちなみに、ＳＭＯは式（１１）〜（１４）を双対問題と呼ばれる別の形式に変換した後、目的関数が小さくなる方向に２変数の組を徐々に変化させていくことで最小化を行う。ここで、このような最小化は、Bernhard Scholkpf, Alexander J Smaola "Learning with Kernels" The MIT Press 2002 の第１０章に詳しく記載されている。
【００４９】
本実施形態では、評価関数推定手段１３が、図８のカテゴリ付きデータを式（１１）〜（１４）にあてはめて、ＳＭＯで式（１１）が最小となるパラメータ（評価関数のパラメータ）の値を求める。その結果、パラメータとして図９に示すｗ＝（1.2，1.3）を得たとする。このパラメータをパラメータ格納手段１４ヘ渡す。なお、ＳＭＯを用いることにより、コンピュータで容易に処理して評価関数のパラメータを推定することができるようになる。
【００５０】
パラメータ格納ステップ（Ｓ１４）では、図９のパラメータを、再利用可能な形でパラメータ格納手段１４に格納する。ここで再利用可能な形とは、例えば磁気装置・メモリ等に記憶するものであるが、後に自動的に読み込める形態であれば何を用いても構わない。なお、閾値ｂ_Sも求まるので、これを格納するようにしてもよい。
【００５１】
対象データ入力ステップ（Ｓ２１）では、順位付けを行う対象となるデータ（文書集合）を対象データ入力手段２１から読み込む。本実施形態では、図５のデータ（Ａ，Ｂ，Ｃ）を読み込み、それを対象データのスコアを計算するスコア計算手段２２に渡す。
【００５２】
スコア計算ステップ（Ｓ２２）では、スコア計算手段２２が、パラメータ格納手段１４に格納された評価関数のパラメータ（図９）を読み込み、そのパラメータを代入した評価関数ｆ(χ)＝ｗ・χを用いて、対象データ入力手段２１から渡された対象データ（図５）の各々についてスコアを計算する。
【００５３】
図５、図９から、スコア計算ステップ（Ｓ２２）における、データＡのスコアはｆ（Ａ）＝1.2×0.9+1.3×0.6=1.86となり、データＢのスコアはｆ（Ｂ）＝1.2×0.6+1.3×0.2=0.98となり、データＣのスコアはｆ（Ｃ）＝1.2×0.3+1.3×0.8=1.40となる（図１０参照）。このように、このスコア計算ステップ（Ｓ２２）では、図５の対象データのスコアに、図９の評価関数のパラメータを乗じることで、スコアを計算する。計算されたスコアは、データ整列手段２３ヘ渡される。
【００５４】
なお、図９は、評価関数のパラメータの例を示す図である。図１０は、スコア計算ステップで計算したスコアの例を示す図である。図１１は、対象データを順位付けした出力データの例を示す図である。
【００５５】
データ整列ステップ（Ｓ２３）では、データ整列手段２３が、渡されたスコアの大きいものから順に対象データを並べ替える。本例では、図１０に示すように、Ａ，Ｃ，Ｂの順にスコアが小さくなっていくので、その順に対象データを並べ替え、データ出力手段２４へと渡す（図１１）。
【００５６】
データ出力ステップ（Ｓ２４）では、データ出力手段２４が、並べ替えられたデータを順位とともに出力する。本実施形態では、図１１より、１位Ａ、２位Ｃ、３位Ｂとして出力する。
【００５７】
以上の実施形態では、順位付きデータ・対象データともに３個１組のものを例として説明したが、１組あたりデータ数は３以外でも構わない。また、全ての組についてデータ数が同一でなくても構わない。その場合には、順位付きデータ中の最も大きな順位より１だけ小さい値に、式（１１）〜（１４）におけるｓの値を置き、かつ、ｓ位より上位のデータに対してｙ[j/i,s]＝＋１、下位のデータに対してｙ[j/i,s]＝−１として、本実施形態と同様の手順を行うことが可能である。
【００５８】
〔結果の考察〕
以上の実施形態（実施形態例）で、順位付きデータ（サンプル）から評価関数（そのパラメータ）を推定する際、予め閾となる順位を設定する必要がなく、かつ、全順位にわたって精度の劣化しない評価関数を推定し、順位付けを行ったことを次に説明する。ここで、図１２は、従来方法の評価関数の推定に用いられるカテゴリ付きデータを示す表であり、２位を閾に示してある。図１３は、実施形態例の順位付け方法により得られた出力データの例を示す図である。図１４は、従来方法による順位付けにより得られた出力データの例を示す図である。
【００５９】
まず、カテゴリ付きデータ作成ステップ（Ｓ１２）では、特定の順位についての上位／下位ではなく、全ての順位についての上位／下位をカテゴリとしてラベルを付与しており、特定の順位を閾として決める必要はない。なお、本実施形態例では、閾値ｂ_Sは自ずと求まり、閾値ｂ_Sを決める必要は特にない。
【００６０】
次に、全順位にわたって精度の劣化しない評価関数（そのパラメータ）が推定できていることを、以下の従来方法と比較することで説明する。
【００６１】
まず、従来方法として平尾らの提案にしたがい、本実施形態例の順位付きデータに対して、２位の上位／下位を分類（カテゴライズ）する評価関数をサポートベクトルマシン（ＳＶＭ）により推定する。即ち、図１２にあるようなカテゴリ付きデータを用意し、ｙ[j/i]＝＋１なるデータを正例、ｙ[j/i]＝−１なるデータを負例として、ＳＶＭを適用する。ただし、カテゴリ付きデータは、２位が閾になっている。また、推定に使用する評価関数の形は、本実施形態例と同じく入力の線形関数ｗ’・χとする。
【００６２】
すると、ＳＶＭによる計算結果はｗ’＝（1.1，0.3）となる。なお、平尾らの提案は、「Support Vector Machine による重要文抽出」、情報処理学会、第６３回情報処理学会基礎研究会予稿集（２００１年７月）に記載されている。
【００６３】
このパラメータ（ｗ’＝（1.1，0.3））を用いて本実施形態例の対象データ（図５参照）についてスコアを計算すると、Ａのスコアは1.1×0.9+0.3×0.6＝1.17、Ｂのスコアは1.1×0.6+0.3×0.2＝0.72、Ｃのスコアは1.1×0.3+0.3×0.8＝0.57になる。したがって、従来方法による順位付けの結果は、１位Ａ、２位Ｂ、３位Ｃ（Ａ→Ｂ→Ｃ）になり、本発明の順位付け結果、１位Ａ、２位Ｃ、３位Ｂ（Ａ→Ｃ→Ｂ）とは、２・３位の順番が異なる。
【００６４】
ここで順位付けした対象データ（出力データ）をプロットした図７を見ると、全体的に見て右上から左下にかけて順位が大きく（下位に）なっていることがわかる。したがって、対象データに対する本発明の順位付けＡ，Ｃ，Ｂの方が、従来方法による順位付けＡ，Ｂ，Ｃよりも、もっともらしいと言える。これは、本発明においては全体の順位付けを見て、全体的にもっともらしい評価関数を推定することができたのに対し（図１３）、従来方法では、２位の上位／下位という分類だけに基づいて推定を行ったために、それ以外の順位である１位の上位／下位の分類では間違いの大きい評価関数しか推定できなかったためである（図１４）。ちなみに、対象データの数がさらに多くなると、閾となる順位を用いる従来方法では、閾の順位から離れるほど順位付けの精度が大きく劣化することになる。
【００６５】
なおＳＶＭを用いない方法でも、ある特定の順位の上位／下位の分類のみに基づいて評価関数を推定する限り、同様の問題が発生する。
【００６６】
本実施形態例では、カテゴリ付きデータ作成ステップ（図３のＳ１２）で、一つの順位ではなく各順位、つまりサンプル中（順位付きデータ中）に現われる全ての順位の上位／下位についてカテゴリ分けを行い、評価関数推定ステップ（図３のＳ１３）で、全ての順位に対して上位／下位の分類を行うことができる評価関数のパラメータを決定した。これにより、予め閾となる順位を設定する必要がなく、かつ、全順位にわたって良い精度を達成する評価関数を推定することが可能となった。
【００６７】
また、従来方法では、どの順位を閾とすればよいかは、実際に推定された評価関数の精度を調べなければ決定することができないという問題があったが、以上説明したような評価関数の推定方法では、このような問題はない。このため、評価関数を迅速に推定することが可能である。
【００６８】
以上説明した本発明は、前記した実施形態に限定されることなく幅広く変形実施することができる。例えば、図３の順位付きデータ入力ステップＳ１１からパラメータ格納ステップＳ１４までと、対象データ入力ステップＳ２１からデータ出力ステップＳ２４までは、独立に動作させても構わない。また、順位付きデータ入力ステップＳ１１からパラメータ格納ステップＳ１４までは一度だけ動作させ、その後、対象データ入力ステップＳ２１からデータ出力ステップＳ２４までを繰り返し動作させることで、異なる対象データに対して順位付けを行うこともできる。また、実施形態では各データの表現に表形式を使用するが、表形式に限定されず、論理的に同等な形式表現を使用しても構わない。また、各プログラムは、ネットワーク上を送受信されたり、ＣＤ−ＲＯＭ等の記憶媒体に記憶されたりする。
【００６９】
また、実施形態では、上位／下位のカテゴリ付与を、出現する全ての順位について行ったが、必ずしも出現する全ての順位について行う必要はなく、間引くこともできる。つまり、本発明におけるところの「各順位についての上位／下位をカテゴリとして付与…」の一例が、実施形態におけるところの「全ての順位についての上位／下位をカテゴリとして付与…」に相当する。
【００７０】
また、実施形態では、サンプルとしての順位付けデータを１組３件（ｊ＝１位，２位，３位）で、３組（ｉ＝イ，ロ，ハ）のデータとしたが、例えば１組５〜２０件で、５〜５０組程度のデータとしてもよい。この場合は、式（１）〜（４）について、ｉに関する和は、順位付きデータの全ての組について求める。同様に、ｊに関する和は、１位から順位付きデータ中に現われる最も大きな順位まで求める。同様に、ｓに関する和は、１から順位付きデータ中に現われる最も大きな順位から１だけ小さい値まで求める。
【００７１】
また、図４等におけるカッコ内の数値（0.9，0.8），（0.4，0.5）…は、検索要求に一致する語の数に関連したスコアとしたが、順位付けを行うための何らかの尺度であればよい。
【００７２】
【発明の効果】
以上説明したように、本発明では、各順位の上位／下位を分類できるような評価関数を推定することで、予め閾となる順位を設定する必要がなく、かつ、各順位にわたって精度の劣化しない評価関数を求めることができる。また、この評価関数を用いて、各順位について、精度の劣化しない順位付けを行うことができる。また、このような順位付けのための評価関数を推定する評価関数推定装置や評価関数を用いて順位付けを行う順位付け装置、順位付けのための評価関数推定プログラム、プログラムを記憶した記憶媒体を提供することができる。つまり、以上説明した本発明によれば、順位付けの対象となる対象データを適切に順位付けすることのできる評価関数を推定することができ、また、推定した評価関数を用いて適切に順位付けを行うことができる。
【図面の簡単な説明】
【図１】（ａ）は本発明の実施形態に係る順位付け装置を機能展開したブロック図であり、（ｂ）は（ａ）の順位付け装置のハードウェア構成を示すブロック図である。
【図２】本発明の実施形態に係る順位付けのための評価関数推定方法及び順位付け方法を示すフローチャートである。
【図３】図２のフローチャートを直列に配置したフローチャートである。
【図４】評価関数を推定するための順位付きデータの例を示す表である。
【図５】順位付けの対象となる対象データの例を示す表である。
【図６】順位付けデータを順位付けした出力データの例を示す表である。
【図７】データの理解を容易にするために、順位付きデータ（図４）と対象データ（図５）を平面にプロットしてグラフ化した図である。
【図８】カテゴリ付きデータの例を示す表である。
【図９】評価関数のパラメータの例を示す図である。
【図１０】スコア計算ステップで計算したスコアの例を示す図である。
【図１１】対象データを順位付けした出力データの例を示す図である。
【図１２】従来方法の評価関数の推定に用いられるカテゴリ付きデータを示す表であり、２位を閾に示してある。
【図１３】本発明の実施形態に係る順位付け方法により得られた出力データの例を示す図である。
【図１４】従来方法による順位付けにより得られた出力データの例を示す図である。
【符号の説明】
Ａ … 順位付け装置
１１…順位付きデータ入力手段、１２…カテゴリ付きデータ作成手段、１３…評価関数推定手段、１４…パラメータ格納手段、２１…データ入力手段、２２…スコア計算手段、２３…データ整列手段、２４…データ出力手段[0001]
BACKGROUND OF THE INVENTION
The present invention is used when automatically ranking information to be presented to a user in, for example, an information search system. Evaluation function estimation device for ranking , Programs and storage media, and Ranking device and Program Mu About.
[0002]
[Prior art]
With the spread of computers and the Internet, the importance of information retrieval systems that find appropriate information from a large amount of information is increasing. In such a system, a large amount of information is found, so that information is often presented after ranking based on some criteria. Usually, this ranking is often performed in the form of scoring each piece of information based on a certain evaluation function and outputting it in the order of the score. As an evaluation function used for such ranking, for example, there is a function that uses the number of words that match a search request (keyword) input by the user as a score.
[0003]
Conventionally, such evaluation functions in information retrieval systems are mostly designed based on the experience and intuition of system designers. However, in recent years, several techniques for automatically estimating an evaluation function from data obtained by ideally ranking information samples have been proposed. For example, Tsutomu Hirao, Eisaku Maeda, Yuji Matsumoto, “Important sentence extraction by Support Vector Machine”, Information Processing Society of Japan, 63rd Informatics Basic Research Proceedings (July 2001), etc.
[0004]
Here, the estimation method of the evaluation function proposed so far will be described.
First, the ranked data given as a sample is divided into a higher order and a lower order with an appropriate rank as a boundary. Next, each sample is regarded as a separate category, and an evaluation function that estimates a positive value for the upper sample and a negative value for the lower sample is estimated. For this estimation, an existing two-category classifier (for example, a neural network or a support vector machine) is used. Finally, the evaluation function thus created is directly used as the system evaluation function.
[0005]
As described above, the method of automatically estimating the evaluation function has an advantage that an evaluation function (evaluation function parameter) adapted to the user is automatically set (estimated) as long as samples with ranks are prepared. .
[0006]
[Problems to be solved by the invention]
However, the conventional estimation method of such an evaluation function has the following problems. First, in order to divide samples into two categories, it is necessary to determine the order of thresholds in advance. However, it is generally impossible to determine which rank should be set as a threshold without examining the accuracy of the actually estimated evaluation function. Therefore, in order to find the best evaluation function, it is necessary to repeat the estimation of the classifier. For example, when using a classifier that takes a long time to estimate once, such as a support vector machine, it is a big problem. It becomes.
[0007]
Next, if the categories are divided in a certain rank, the estimation accuracy of the evaluation function tends to be poor for data that hardly appears in the vicinity of the rank. This is because, in general, when the evaluation function is estimated by the classifier, the accuracy of estimation of the evaluation function is low at a location far from the boundary of the category. Therefore, in the conventional evaluation function estimation method, ranking based on the evaluation function is often not successful with respect to the ranking far from the threshold ranking.
[0008]
Therefore, a main object of the present invention is to provide an ordering function that can estimate an evaluation function that can appropriately rank target data to be ranked. Evaluation function estimation device for ranking , The program and the storage medium, and the estimated evaluation function can be used to appropriately rank, Ranking device and Program Mu Is to provide.
[0009]
[Means for Solving the Problems]
In view of the above-mentioned problems, the present inventors have conducted intensive research and created category-attached data by assigning higher / lower ranks for each rank of the rank-attached data as categories, which are used as a reference for ranking from the category-added data. By estimating the evaluation function and using the estimated evaluation function, it was found that the target data to be ranked can be appropriately ranked, and the present invention has been completed.
[0010]
That is, the present invention that solved the above problems Tomorrow In order to rank the target data to be ranked, an evaluation function for ranking is used to estimate an evaluation function serving as a reference for ranking the target data from the ranked data that has been ranked. An estimation method, wherein the ranking data is input, the category-attached data is created by assigning the higher / lower ranks for each rank of the ranking-attached data as a category, and the rank is determined from the category-attached data thus assigned with the category. It is characterized by estimating an evaluation function as a reference for attaching.
[0011]
In this configuration, the ranking data is given a ranking, and the higher / lower ranks of each ranking of the ranking data are assigned as categories. For example, as in the embodiment described later, the category is assigned to each rank-ordered data in the sample by higher / lower than first, higher / lower than second,..., Higher / lower than n−1 (n is This is done by assigning a category label (y [j / i, s]) representing the lowest rank that appears in the ranked data).
[0012]
Also, the evaluation function estimation method for ranking is a method in which the value of the evaluation function is higher than the lower order with respect to the categorized data to which the higher / lower ranks are assigned as categories as in the embodiment described later. It is preferable to set the parameter of the evaluation function so that becomes large. Moreover, it is preferable to set the parameters of the evaluation function so as to satisfy the conditions of the expressions (12) to (14) and to minimize the value of the expression (11) as in the embodiment described later. In addition, Formula (1) in a claim is equivalent to Formula (11) in embodiment. On the other hand, Formula (2) and Formula (12), Formula (3) and Formula (13), Formula (4) and Formula (14) are the same formula. Further, in the present invention, it is preferable to set the parameters using SMO which is a quadratic programming method as in an embodiment described later.
[0013]
In addition, this Tomorrow , Said A ranking method for ranking target data to be ranked using an evaluation function estimated by an evaluation function estimation method for ranking, wherein the evaluation function and the target data are input, and the evaluation is performed A score of each target data is calculated from the function and the target data, and the target data is ranked in the order of the scores.
[0014]
According to this configuration, it is possible to appropriately rank the target data to be ranked by using the evaluation function estimated by the evaluation function estimation method for ranking.
[0015]
In addition, this Mysterious An evaluation function estimation device for ranking includes rank-ordered data input means for inputting rank-ranked data, and ranks of rank-ranked data are assigned higher / lower ranks as categories. The apparatus includes a categorized data creating means to create and an evaluation function estimating means for estimating an evaluation function for ranking the target data to be ranked from the categorized data.
[0016]
Also in this configuration, the ranking data is ranked, and the higher / lower ranks for each ranking of the ranking data are assigned as categories. For example, as in the embodiment described later, the category is assigned to each rank-ordered data in the sample by higher / lower than first, higher / lower than second,..., Higher / lower than n−1 (n is This is done by assigning a category label (y [j / i, s]) representing the lowest rank that appears in the ranked data).
[0017]
Further, the evaluation function estimation device for ranking, as in an embodiment described later, the evaluation function estimation means applies the evaluation to the category-added data to which the higher / lower ranks of the rank are assigned as categories. The evaluation function parameters are set so that the value of the function is larger in the higher order than in the lower order.
[0018]
In addition, this Tomorrow , Said A ranking apparatus for ranking target data to be ranked using an evaluation function estimated by an evaluation function estimation method for ranking, wherein the data input means inputs the evaluation function and the target data And score calculating means for calculating a score of each target data from the evaluation function and the target data, and means for ranking the target data in the order of the scores.
[0019]
According to this configuration, it is possible to appropriately rank the target data to be ranked by using the evaluation function estimated by the evaluation function estimation method for ranking.
[0020]
Further, the present invention which has solved the above problems (claims) 4 ~ Claim 6 ) Is an evaluation function estimation program for ranking, a ranking program, and Evaluation function estimation program for ranking Is a storage medium in which the computer executes each step of the program to estimate the evaluation function for ranking, or to rank the target data to be ranked using the evaluation function .
[0021]
DETAILED DESCRIPTION OF THE INVENTION
Hereinafter, embodiments of the present invention will be described in detail with reference to the drawings.
FIG. 1A referred to in the present embodiment is a block diagram in which functions of the ranking device according to the embodiment are developed, and FIG. 1B is a block diagram showing a hardware configuration of the ranking device in FIG. is there. FIG. 2 is a flowchart showing an evaluation function estimation method and ranking method for ranking according to the embodiment.
[0022]
[Ranking device]
As shown in FIG. 1 (a), a ranking apparatus A that executes an evaluation function estimation method and a ranking method for ranking includes rank-ordered data input means 11 that reads rank-ordered data, and all ranks. Category-attached data creating means 12 for creating categorized data with higher / lower categories as categories, evaluation function estimating means 13 for estimating evaluation function parameters from the categorized data, and evaluation function parameters so that they can be reused. A parameter storage unit 14 for storing, a target data input unit 21 for reading target data to be ranked, a score calculation unit 22 for calculating the scores of all target data from the parameters of the evaluation function and the target data; Data aligning means 23 for aligning the target data in the order of the score, and the sorted target data Having a data output means 24 for outputting poppy. Each means (each block) of the ranking apparatus A is configured as a program module. The detailed functions of each means will be clarified later with reference to flowcharts and the like.
[0023]
As shown in FIG. 1B, the ranking device A of this embodiment has a configuration in which a main control device 1, a storage device 2, a communication control device 3, and an input / output device 4 are connected to a system bus. is doing. Of these, the main controller 1 is configured to include a CPU (Central Processing Unit), a RAM (Random Access Memory), and the like, and has a function to estimate an evaluation function (its parameters) for ranking the target data. It has a function for ranking the target data using the evaluation function, a function for overall control of the ranking apparatus A, and the like. The storage device 2 includes a hard disk device and the like, and has a function of storing parameters of an evaluation function, a function of storing various programs and data, and the like. The communication control device 3 includes, for example, a network interface card (NIC) and the like, and has a function of realizing communication (data input / output) via a network according to a predetermined communication protocol. . The input / output device 4 has a configuration in which a keyboard, a mouse, and the like are connected.
[0024]
[Processing flow of ranking device]
A processing flow of the ranking apparatus A of the present embodiment configured as described above will be described along the flowchart of FIG. 2 with reference to FIG. This flowchart is based on a processing flow for estimating an evaluation function that does not need to set a threshold rank in advance from samples with ranks and that does not deteriorate accuracy over all ranks, and accuracy based on the estimated evaluation function. It includes a processing flow for performing a good ranking.
[0025]
By the way, in information filtering, which is one field of information retrieval, the sample documents taken out from a huge document set (very few compared to the whole normal document set) are ordered by the user in order of usefulness. The ranking of the entire document set is performed with reference to the ranking. In the present embodiment, this information filtering is taken into consideration. The ranked data corresponds to a sample document separately ranked by the user, the target data corresponds to a document set, and the output data corresponds to a ranked document set. It corresponds to.
[0026]
First, the rank-equipped data input means 11 reads sample rank-ordered data (S11; rank-ordered data input step) and passes it to the category-added data creation means 12. The category-added data creation means 12 assigns upper / lower ranks for all ranks of the rank-added data as categories (S12; category-added data creation step) and passes them to the evaluation function estimation means 13. The evaluation function estimation means 13 estimates the parameters of the evaluation function from the category-attached data so that the upper / lower discrimination for each rank can be performed at once (S13 evaluation function estimation step), and the parameter storage means 14 Give it to. The parameter storage means 14 stores the parameters of the evaluation function so that they can be reused (S14; parameter storage step). As a result, the parameters of the evaluation function are estimated from the read (input) ranked data and stored so as to be reusable.
[0027]
In the present embodiment, it is assumed that “higher” includes peers but does not include “lower”.
[0028]
Next, the target data input means 21 reads the target data to be ranked (S21; target data input step) and passes it to the score calculation means 22. The score calculation means 22 reads the parameters of the evaluation function from the parameter storage means 14, calculates the score of each target data based on it (S22; score calculation step), and passes it to the data alignment means 23 together with the target data. The data aligning means 23 aligns the target data in the order of the score (S23; data aligning step), and passes the result to the data output means 24. Finally, the data output means 24 outputs the aligned target data (S24; data output step).
[0029]
[Specific operation]
Next, the specific operation of the present embodiment will be described in more detail with reference to FIG. In the following description, for ease of explanation, a case where all steps (S11 to S24) in the flowchart of FIG. 2 are continuously operated as shown in the flowchart of FIG. 3 will be described.
[0030]
3 is a flowchart in which the flowchart of FIG. 2 is arranged in series, and the same step numbers as those in FIG. 2 are given. FIG. 4 is a table showing an example of ranking data, FIG. 5 is a table showing an example of target data, and FIG. 6 is a table showing an example of output data. FIG. 7 is a graph obtained by plotting the ranked data (FIG. 4) and the target data (FIG. 5) on a plane for easy understanding of the data. FIG. 8 is a table showing an example of data with categories.
[0031]
Incidentally, in FIG. 4 to FIG. 8, χ [j / i] is data with a rank as a sample document (i = i, b, c, j = 1, 2nd, 3rd). A, B, and C are target data as a set of documents to be ranked. Further, the subscript i of χ [j / i] is for distinguishing the data sets with ranking, and the subscript j is for representing the ranking within each group. The numbers in parentheses such as (0.9, 0.8) and (0.4, 0.5) match the search request (for example, keyword) of the sample document (ranking data) and the document to be ranked (target data). This score is related to the number of words to be played. Incidentally, the rank of the ranked data in FIG. 4 is determined in advance based on this score. In FIG. 7, χ [j / b] is represented by ◯, χ [j / b] is represented by Δ, and χ [j / c] is represented by ×. In FIG. 7, the first numerical value in parentheses in FIGS. 4 and 5 is assigned to the horizontal axis, and the subsequent numerical value in parentheses is assigned to the vertical axis (the same applies to FIGS. 13 and 14 described later).
[0032]
By the way, ranking by a user may be performed a plurality of times. The ranked data in FIG. 4 corresponds to, for example, a case where three documents are extracted from the document set and ranked three times. By the way, in the subscript i of FIG. Similarly, in the present embodiment, the data from the first place to the third place form one set for both the pair B and the pair C.
[0033]
Hereinafter, according to the flowchart of FIG. 3, an evaluation function (its parameters) is estimated from the ranked data shown in FIG. 4, the target data shown in FIG. 5 is ranked, and the output data shown in FIG. 6 is obtained. A specific operation will be described. The specific processing and the processing flow described with reference to the flowchart of FIG. 2 are the same as those in the claims, “evaluation function estimation method for ranking”, “ranking method”, “for ranking”. "Evaluation function estimation program" and "ranking program".
[0034]
First, in order to estimate the evaluation function, in the ranked data input step (S11), the ranked data input means 11 reads the ranked data shown in FIG.
[0035]
In the category-added data creation step (S12), the category-added data creation means 12 assigns a category label that represents a higher rank (lower) than the first rank and a higher rank (lower) than the first rank for the ranked data in FIG. Creates data with a label. That is, categorized data in which the higher / lower ranks of all ranks of rank-ordered data are assigned as categories is created. In the present embodiment, the category label is represented by y [j / i, s], and “y [j / i, s] = + 1” when the ranked data χ [j / i] is higher in rank than the s-th (equivalent). When it is lower, “y [j / i, s] = − 1” is determined (i = i, b, c, j = 1, second, third, s = 1, 2).
[0036]
According to this determination method, y [j / i, s] is as follows (see FIG. 8), and these results are passed to the evaluation function estimation means 13.
[0037]
χ [1st place / I] is higher than 1st place, so y [1st place / I, 1] = + 1,
χ [1st place / I] is higher than 2nd place, so y [1st place / I, 2] = + 1,
Since χ [2nd / I] is lower than 1st, y [2nd / I, 1] =-1,
χ [2nd / I] is higher than 2nd, so y [2nd / I, 2] = + 1,
χ [3rd / I ] Is lower than 1st place, so y [3rd place / I, 1] =-1,
χ [3rd / I ] Is lower than 2nd place, so y [3rd place / I, 2] =-1
[0038]
χ [1st place / b] is higher than 1st place, so y [1st place / b, 1] = + 1,
χ [1st / b] is higher than 2nd, so y [1st / b, 2] = + 1,
Since χ [2nd place / b] is lower than 1st place, y [2nd place / b, 1] = − 1,
χ [2nd place / b] is higher than 2nd place, so y [2nd place / b, 2] = + 1,
χ [3rd / b] is lower than 1st, so y [3rd / b, 1] =-1.
Since χ [3rd / b] is lower than 2nd, y [3rd / b, 2] =-1
[0039]
χ [1st place / ha] is higher than 1st place, so y [1st place / ha, 1] = + 1,
χ [1st place / ha] is higher than 2nd place, so y [1st place / ha, 2] = + 1,
χ [2nd place / ha] is lower than 1st place, so y [2nd place / ha, 1] =-1.
χ [2nd place / ha] is higher than 2nd place, so y [2nd place / ha, 2] = + 1,
χ [3rd place / ha] is lower than 1st place, so y [3rd place / ha, 1] =-1.
Since χ [3rd / Ha] is lower than 2nd, y [3rd / ha, 2] =-1
[0040]
In the evaluation function estimation step (S13), the evaluation function estimation means 13 has the value of the evaluation function f (χ) higher than the first rank for each data to which the category label shown in FIG. Estimate (set) the type and parameters of the evaluation function so that the data of is larger than the data lower than the first place and the data higher than the second place is larger than the data lower than the second place. To do.
[0041]
Here, various kinds of evaluation function types and parameter estimation (setting) methods can be considered. In this embodiment, a linear function f (χ) = w · χ of the input χ is adopted in advance as the type of evaluation function. (W is a two-dimensional vector), and the following conditional expressions (12) to (14) are satisfied as parameter estimation methods for the previously adopted linear function f (χ), and the objective function R (that is, the following expression ( 11)) parameter w, b that minimizes the value _S , Ξ [j / i, s] is adopted (where i = I, B, C, j = 1, 2, 3 and s = 1, 2).
[0042]
[Expression 2]

[0043]
B in equations (12) and (13) _S Is a threshold value for dividing the upper / lower order of the sth. Specifically, b ₁ Is a threshold value that divides the top / bottom of the first place, b ₂ Is a threshold value for dividing the upper / lower rank of the second place. Further, the equation (12) is obtained when the evaluation function value f (χ [j / i] is obtained when y [j / i, s] = + 1, that is, when χ [j / i] is higher than the sth position. ) Is b _S This is a condition that requires a larger value (with a margin of 1−ξ [j / i, s]). Further, when the equation (13) is y [j / i, s] = − 1, that is, when χ [i / j] is lower than the sth position, f (χ [j / i]) is b. _S It is a condition that requires a smaller value (with a margin of 1−ξ [j / i, s]).
[0044]
Incidentally, in this embodiment, since the sample data with ranking is three as shown in FIG. ₁ , B ₂ However, if 10 ranked data (sample documents) are given, the threshold values correspond to higher / lower than 1st, higher / lower than 2nd,... B corresponding to higher / lower than 9th. ₁ ~ B ₉ Nine of these are required. This threshold value b _S There is no particular need for humans to set.
[0045]
Here, in order to explain the meaning of each formula, it is assumed that parameters satisfying formulas (11) to (14) in which all ξ [j / i, s] are set to 0 are found. Then, as is clear from the equations (12) and (13), by using the parameters, the data at the s-th place becomes larger by 2 or more in terms of the value of f (χ) than the lower-order for each s. From this, it is understood that using such f (χ), the upper / lower order of each rank can be correctly classified (discriminated) for all χ [j / i].
[0046]
By the way, in general, there is a case where there is no solution for the minimization problem in which all ξ [j / i, s] is 0. Therefore, in the expressions (12) and (13), the parameters of the evaluation function that can approximately classify the upper / lower order are obtained by relaxing the condition by ξ [j / i, s]. However, since it is desirable that the degree of relaxation is as small as possible, the sum of ξ [j / i, s] is included in the objective function R (ie, equation (11)), and ξ [j / i, s ] Is determined so that the evaluation function can be as small as possible.
[0047]
Further, in equation (11), a term (1/2 |) that limits the size of the parameter w to the objective function R so as not to estimate an evaluation function that classifies only the samples well (to avoid so-called overlearning). w | ² ) Is included. The fact that such terms have the effect of preventing overlearning is described in, for example, Vladimir N. Vapnik, “The Nature of Statistical Learning Theory, Second Edition”, Springer 1999.
[0048]
Note that the problem of obtaining the parameter w that satisfies the equations (12) to (14) and minimizes the equation (11) is a kind of quadratic programming, and various solutions are known. In the present embodiment, it is assumed that the minimization problem is solved using Sequential Minimal Optimization (SMO) as one of such solutions. Incidentally, SMO minimizes by transforming equations (11) to (14) into another form called a dual problem and then gradually changing a set of two variables in a direction in which the objective function becomes smaller. Here, such minimization is described in detail in Chapter 10 of Bernhard Scholkpf, Alexander J Smaola "Learning with Kernels" The MIT Press 2002.
[0049]
In the present embodiment, the evaluation function estimation means 13 applies the data with categories of FIG. 8 to the equations (11) to (14), and the value of the parameter (evaluation function parameter) that minimizes the equation (11) in SMO. Ask for. As a result, it is assumed that w = (1.2, 1.3) shown in FIG. 9 is obtained as a parameter. This parameter is passed to the parameter storage means 14. By using SMO, it is possible to easily estimate the parameters of the evaluation function by processing with a computer.
[0050]
In the parameter storage step (S14), the parameters shown in FIG. 9 are stored in the parameter storage means 14 in a reusable form. Here, the reusable form is, for example, stored in a magnetic device / memory or the like, but any form can be used as long as it can be automatically read later. The threshold value b _S Can also be stored.
[0051]
In the target data input step (S 21), the data (document set) to be ranked is read from the target data input means 21. In the present embodiment, the data (A, B, C) of FIG. 5 is read and passed to the score calculation means 22 that calculates the score of the target data.
[0052]
In the score calculation step (S22), the score calculation means 22 reads the parameter (FIG. 9) of the evaluation function stored in the parameter storage means 14, and uses the evaluation function f (χ) = w · χ into which the parameter is substituted. Thus, a score is calculated for each of the target data (FIG. 5) passed from the target data input means 21.
[0053]
5 and 9, the score of data A in the score calculation step (S22) is f (A) = 1.2 × 0.9 + 1.3 × 0.6 = 1.86, and the score of data B is f (B) = 1.2 × 0.6 + 1.3 × 0.2 = 0.98, and the score of data C is f (C) = 1.2 × 0.3 + 1.3 × 0.8 = 1.40 (see FIG. 10). Thus, in this score calculation step (S22), the score is calculated by multiplying the score of the target data in FIG. 5 by the parameter of the evaluation function in FIG. The calculated score is passed to the data alignment means 23.
[0054]
FIG. 9 is a diagram illustrating an example of parameters of the evaluation function. FIG. 10 is a diagram illustrating an example of scores calculated in the score calculation step. FIG. 11 is a diagram illustrating an example of output data in which target data is ranked.
[0055]
In the data alignment step (S23), the data alignment means 23 rearranges the target data in descending order of the delivered scores. In this example, as shown in FIG. 10, the score decreases in the order of A, C, and B, so the target data is rearranged in that order and passed to the data output means 24 (FIG. 11).
[0056]
In the data output step (S24), the data output means 24 outputs the rearranged data together with the rank. In the present embodiment, as shown in FIG.
[0057]
In the above embodiment, the data with ranking and the target data are described as one set of three, but the number of data per set may be other than three. In addition, the number of data does not have to be the same for all sets. In that case, the value of s in the formulas (11) to (14) is set to a value smaller by 1 than the largest rank in the ranked data, and y [j / It is possible to perform the same procedure as in the present embodiment by setting i [j / i, s] = − 1 for the lower-level data with i, s] = + 1.
[0058]
[Consideration of results]
In the above embodiment (embodiment example), when estimating the evaluation function (its parameters) from the ranked data (sample), it is not necessary to set a threshold rank in advance and the accuracy does not deteriorate over all ranks. Next, the evaluation function is estimated and ranked. Here, FIG. 12 is a table showing categorized data used for estimation of the evaluation function of the conventional method, and the second place is shown as a threshold. FIG. 13 is a diagram illustrating an example of output data obtained by the ranking method of the embodiment. FIG. 14 is a diagram illustrating an example of output data obtained by ranking according to the conventional method.
[0059]
First, in the category-added data creation step (S12), labels are assigned not as high / low ranks for specific ranks but as high / low ranks for all ranks, and it is necessary to determine a specific rank as a threshold. Absent. In this embodiment, the threshold value b _S Is naturally obtained and the threshold value b _S There is no particular need to decide.
[0060]
Next, the fact that the evaluation function (its parameters) that does not deteriorate accuracy over all ranks can be estimated will be described by comparing with the following conventional method.
[0061]
First, according to the proposal of Hirao et al. As a conventional method, an evaluation function for classifying (categorizing) the second highest / lower order is estimated by the support vector machine (SVM) with respect to the ranked data of this embodiment. That is, data with a category as shown in FIG. 12 is prepared, and SVM is applied with the data y [j / i] = + 1 as a positive example and the data y [j / i] = − 1 as a negative example. However, data with a category has a threshold at the second place. The form of the evaluation function used for estimation is the input linear function w ′ · χ, as in the present embodiment.
[0062]
Then, the calculation result by SVM is w ′ = (1.1, 0.3). Hirao et al.'S proposal is described in “Important sentence extraction by Support Vector Machine”, Information Processing Society of Japan, 63rd Information Processing Society of Japan Basic Research Group (July 2001).
[0063]
When the score is calculated for the target data (see FIG. 5) of the present embodiment using this parameter (w ′ = (1.1, 0.3)), the score of A is 1.1 × 0.9 + 0.3 × 0.6 = 1.17, and the score of B Is 1.1 × 0.6 + 0.3 × 0.2 = 0.72, and the score of C is 1.1 × 0.3 + 0.3 × 0.8 = 0.57. Therefore, the ranking result by the conventional method is 1st place A, 2nd place B, 3rd place C (A → B → C), and the ranking result of the present invention, 1st place A, 2nd place C, 3rd place B (A → C → B) is different in 2nd and 3rd order.
[0064]
When FIG. 7 in which the target data (output data) ranked here is plotted is seen, it can be seen that the rank increases from the upper right to the lower left as a whole. Therefore, it can be said that the rankings A, C, and B of the present invention for the target data are more likely than the rankings A, B, and C according to the conventional method. In the present invention, it was possible to estimate the overall plausible evaluation function by looking at the overall ranking (FIG. 13), whereas in the conventional method, only the classification of the second highest / lower order This is because only the evaluation function with a large error can be estimated in the higher / lower classification of the first rank, which is the other rank (FIG. 14). Incidentally, when the number of target data is further increased, in the conventional method using the threshold ranking, the accuracy of ranking is greatly deteriorated as the distance from the threshold ranking increases.
[0065]
Even in the method that does not use SVM, the same problem occurs as long as the evaluation function is estimated based only on the upper / lower classifications of a specific rank.
[0066]
In this embodiment, in the category-added data creation step (S12 in FIG. 3), categorization is performed for each rank, that is, higher / lower ranks of all ranks that appear in the sample (in the rank-ordered data) instead of one rank. In the evaluation function estimation step (S13 in FIG. 3), the parameters of the evaluation function capable of performing the upper / lower classification for all ranks are determined. This makes it possible to estimate an evaluation function that achieves good accuracy over all ranks without having to set a threshold rank in advance.
[0067]
In addition, in the conventional method, there is a problem that which rank should be set as a threshold cannot be determined without examining the accuracy of the actually estimated evaluation function. There is no such problem in the estimation method. For this reason, it is possible to estimate the evaluation function quickly.
[0068]
The present invention described above can be widely modified without being limited to the above-described embodiment. For example, the ordered data input step S11 to parameter storage step S14 and the target data input step S21 to data output step S24 in FIG. 3 may be operated independently. In addition, ranking is performed for different target data by operating only once from the ranked data input step S11 to the parameter storing step S14 and then repeatedly operating from the target data input step S21 to the data output step S24. You can also. In the embodiment, a table format is used to represent each data. However, the present invention is not limited to the table format, and a logically equivalent format representation may be used. Each program is transmitted / received over a network or stored in a storage medium such as a CD-ROM.
[0069]
Further, in the embodiment, the upper / lower category assignment is performed for all the ranks that appear, but it is not always necessary to perform the ranks for all the ranks that appear, and can be thinned out. That means The present invention An example of “giving higher / lower ranks for each rank as a category” corresponds to “giving higher / lower ranks for all ranks as a category” in the embodiment.
[0070]
Further, in the embodiment, the ranking data as samples is three sets of data (j = 1, 2, 3), and 3 sets (i = I, B, C). It is good also as data of about 5-50 sets by 5-20 sets. In this case, with respect to formulas (1) to (4), the sum related to i is obtained for all sets of ranked data. Similarly, the sum for j is obtained from the first rank to the largest rank appearing in the ranked data. Similarly, the sum related to s is determined from 1 to a value that is smaller by 1 from the largest rank appearing in the ranked data.
[0071]
In addition, the numerical values in parentheses (0.9, 0.8), (0.4, 0.5), etc. in FIG. 4 and the like are scores related to the number of words matching the search request, but may be any scale for ranking. That's fine.
[0072]
【The invention's effect】
As described above, according to the present invention, by estimating an evaluation function that can classify the upper / lower order of each rank, it is not necessary to set a threshold rank in advance and accuracy does not deteriorate over each rank. An evaluation function can be obtained. Further, using this evaluation function, it is possible to rank each rank without degrading accuracy. Further, an evaluation function estimation device for estimating the evaluation function for ranking, a ranking device for performing ranking using the evaluation function, an evaluation function estimation program for ranking, and a storage medium storing the program Can be provided. That is, according to the present invention described above, it is possible to estimate an evaluation function that can appropriately rank the target data to be ranked, and to appropriately rank using the estimated evaluation function It can be performed.
[Brief description of the drawings]
FIG. 1A is a functional block diagram of a ranking apparatus according to an embodiment of the present invention, and FIG. 1B is a block diagram showing a hardware configuration of the ranking apparatus of FIG.
FIG. 2 is a flowchart showing an evaluation function estimation method and a ranking method for ranking according to an embodiment of the present invention.
FIG. 3 is a flowchart in which the flowchart of FIG. 2 is arranged in series.
FIG. 4 is a table showing an example of ranked data for estimating an evaluation function.
FIG. 5 is a table showing an example of target data to be ranked.
FIG. 6 is a table showing an example of output data in which ranking data is ranked.
FIG. 7 is a graph in which ranking data (FIG. 4) and target data (FIG. 5) are plotted on a plane to facilitate understanding of the data.
FIG. 8 is a table showing an example of data with categories.
FIG. 9 is a diagram illustrating an example of parameters of an evaluation function.
FIG. 10 is a diagram illustrating an example of scores calculated in a score calculation step.
FIG. 11 is a diagram illustrating an example of output data in which target data is ranked.
FIG. 12 is a table showing categorized data used for estimation of an evaluation function of a conventional method, with the second place as a threshold value.
FIG. 13 is a diagram showing an example of output data obtained by the ranking method according to the embodiment of the present invention.
FIG. 14 is a diagram illustrating an example of output data obtained by ranking according to a conventional method.
[Explanation of symbols]
A ... Ranking device
DESCRIPTION OF SYMBOLS 11 ... Data input means with ranking, 12 ... Data creation means with category, 13 ... Evaluation function estimation means, 14 ... Parameter storage means, 21 ... Data input means, 22 ... Score calculation means, 23 ... Data alignment means, 24 ... Data Output means

Claims

A ranked data input means for inputting ranked data with a ranking;
Categorized data creating means for creating categorized data by assigning higher / lower ranks of the rank of the ranked data as categories;
An evaluation function for ranking the target data to be ranked from the categorized data satisfies the conditions of the following formulas (2) to (4) and minimizes the value of the formula (1). Evaluation function estimating means for estimating by setting a parameter of the evaluation function in
An evaluation function estimation device for ranking, comprising:

The evaluation function estimation device for ranking according to claim 1 , wherein the parameter is set using SMO which is a quadratic programming method.

A ranking device that ranks target data to be ranked using the evaluation function estimated by the evaluation function estimation device for ranking according to claim 1 or 2 ,
Data input means for inputting the evaluation function and the target data;
Score calculating means for calculating a score of each target data from the evaluation function and the target data;
Means for ranking the target data in the order of the scores;
The ranking apparatus characterized by comprising.

An evaluation function estimation program for ranking that causes a computer to function as each means constituting the evaluation function estimation device for ranking according to claim 1.

A ranking program for causing a computer to function as each means constituting the ranking device according to claim 3.

5. A program storage medium characterized by storing the evaluation function estimation program for ranking according to claim 4 .