JP5364578B2

JP5364578B2 - Method and system for transductive data classification and data classification method using machine learning technique

Info

Publication number: JP5364578B2
Application number: JP2009519439A
Authority: JP
Inventors: マウリチウスアー．アール．シュミットラー，; クリストファーケー．ハリス，; ローランドボレー，; アンソニーサラ，; ニコラカルーソー，
Original assignee: コファックス，インコーポレイテッド
Priority date: 2006-07-12
Filing date: 2007-06-07
Publication date: 2013-12-11
Anticipated expiration: 2027-06-07
Also published as: WO2008008142A2; WO2008008142A3; EP1924926A4; EP1924926A2; JP2009543254A

Abstract

A system, method, data processing apparatus, and article of manufacture are provided for classifying data. Data classification methods using machine learning techniques are also disclosed.

Description

本発明は、全体としてデータ分類のための方法および装置に関する。より詳細には、本発明は、改良されたトランスダクティブ機械学習法を提供する。本発明はまた、機械学習手法を用いた新規なアプリケーションにも関する。 The present invention relates generally to a method and apparatus for data classification. More particularly, the present invention provides an improved transductive machine learning method. The invention also relates to a novel application using machine learning techniques.

データを処理する方法は、情報化時代において重要性を増しており、より最近では、とりわけ、スキャンした文書、ウェブ材料、検索エンジンデータ、文字データ、画像、音声データファイル等を含む、あらゆる生活分野における電子データの急増と共に、その重要性を増してきている。 Data processing methods are gaining importance in the information age, and more recently, all areas of life, including scanned documents, web materials, search engine data, text data, images, audio data files, among others. With the rapid increase of electronic data in Japan, its importance is increasing.

探究が始まったばかりの１つの分野は、データの非手動分類である。多くの分類法において、機械またはコンピュータは、手作業で入力され生成されたルールセットおよび／または手作業で生成された訓練例に基づいて学習しなければならない。訓練例が用いられる機械学習では、学習例の数は、推定する必要のあるパラメータの数と比較して少ないことが一般的である。すなわち、訓練例によって与えられる制約を満たす解の数が多いということである。機械学習の課題は、制約の不足にもかかわらず十分に汎用化する解を求めることである。従って、従来技術と関連するこれらのおよび／または他の問題を克服する必要がある。 One area that has just begun exploration is the non-manual classification of data. In many taxonomies, machines or computers must learn based on manually entered and generated rule sets and / or manually generated training examples. In machine learning in which training examples are used, the number of learning examples is generally smaller than the number of parameters that need to be estimated. That is, there are many solutions that satisfy the constraints given by the training examples. The challenge of machine learning is to find a solution that is sufficiently generalized despite the lack of constraints. Accordingly, there is a need to overcome these and / or other problems associated with the prior art.

さらに必要とされることは、あらゆる種類の機械学習手法のための実用的なアプリケーションである。 What is further needed is a practical application for all kinds of machine learning techniques.

コンピュータベースのシステムでは、本発明の一実施形態によるデータの分類手法は、ラベル付きデータ点が指定されたカテゴリに含まれるべきデータ点の訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点の訓練例であるかを示す少なくとも１つのラベルを各々が有するラベル付きデータ点を受信するステップと、ラベルなしデータ点を受信するステップと、ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因を受信するステップと、少なくとも１つのコスト要因と、ラベル付きデータ点と、ラベルなしデータ点とを訓練例として用い、繰り返し計算によって、最大エントロピー識別法（ＭＥＤ）を用いてトランスダクティブ分類器を訓練するステップであって、計算の各繰り返しに対して、ラベルなしデータ点のコスト要因は期待ラベル値の関数として調整され、データ点ラベルの事前確率はデータ点のクラス帰属確率の推定に基づいて調整されるステップと、訓練された分類器を適用して、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するステップと、分類されたデータ点の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するステップとを含む。 In a computer-based system, the data classification method according to one embodiment of the present invention is a training example of data points where labeled data points should be included in the specified category or excluded from the specified category Receiving labeled data points each having at least one label indicating whether it is a data point training example; receiving unlabeled data points; and at least one of labeled and unlabeled data points Receiving a predetermined cost factor, at least one cost factor, a labeled data point, and an unlabeled data point as training examples, and iteratively calculating and transducing using maximum entropy identification (MED). Training the active classifier for each iteration of the computation. The cost factor for unlabeled data points is adjusted as a function of the expected label value, the prior probability of the data point label is adjusted based on the estimation of the class membership probability of the data point, and a trained classifier is applied Classifying at least one of the unlabeled data point, the labeled data point, and the input data point and classifying the classified data point, or a derivative thereof, to the user, another system, and Outputting to at least one of the other processes.

本発明の別の実施形態によるデータの分類方法は、コンピュータ上に配備して実行されることになるコンピュータ実行可能プログラムコードを準備するステップを含む。このプログラムコードは、データ点が指定されたカテゴリに含まれるべきデータ点の訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点の訓練例であるのかを示す少なくとも１つのラベルを各々が有する、コンピュータのメモリ内の格納されたラベル付きデータ点にアクセスする命令と、コンピュータのメモリからラベルなしデータ点にアクセスする命令と、コンピュータのメモリからラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因にアクセスする命令と、計算の各繰り返しに対して、ラベルなしデータ点のコスト要因が期待ラベル値の関数として調整され、データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に応じて調整される、少なくとも１つの格納されたコスト要因と、格納されたラベル付きデータ点と、ラベルなしデータ点とを訓練例として用いる繰り返し計算によって、最大エントロピー識別（ＭＥＤ）トランスダクティブ分類器を訓練する命令と、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するために、訓練された分類器を適用する命令と、分類されたデータ点の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するための命令と、を備える。 A method for classifying data according to another embodiment of the present invention includes providing computer executable program code to be deployed and executed on a computer. The program code includes at least one label indicating whether a data point is an example of training a data point to be included in a specified category or an example of training a data point excluded from a specified category, respectively. Instructions to access stored labeled data points in computer memory, instructions to access unlabeled data points from computer memory, and at least labeled and unlabeled data points from computer memory For each instruction that accesses a given cost factor and each iteration of the calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the data point label is the class membership probability of the data point At least one stored cost adjusted according to an estimate of Instructions to train a maximum entropy identification (MED) transductive classifier by iterative computation using factors, stored labeled data points, and unlabeled data points as training examples, unlabeled data points, labeled Instructions for applying a trained classifier to classify at least one of the data points and the input data points and the classification of the classified data points, or a derivative thereof, to a user, another system, And instructions for outputting to at least one of the other processes.

本発明の別の実施形態によるデータ処理装置は、（ｉ）データ点が指定されたカテゴリに含まれているデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されているデータ点に対する訓練例であるのかを示す少なくとも１つのラベルを各々が有するラベル付きデータ点と、（ｉｉ）ラベルなしデータ点と、（ｉｉｉ）ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因と、を格納するための少なくとも１つのメモリと、少なくとも１つの格納されたコスト要因ならびに格納されたラベル付きデータ点および格納されたラベルなしデータ点を訓練例として用い、トランスダクティブ最大エントロピー識別（ＭＥＤ）を用いてトランスダクティブ分類器に繰り返し教示するためのトランスダクティブ分類器訓練装置とを含む。ＭＥＤ計算の各繰り返し時に、ラベルなしデータ点のコスト要因が期待ラベル値の関数として調整され、データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に応じて調整され、トランスダクティブ分類器訓練装置によって訓練された分類器を用いて、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つが分類され、分類されたデータ点、またはその派生物は、ユーザ、別のシステム、別のステップのうちの少なくとも１つに出力される。 A data processing apparatus according to another embodiment of the present invention is (i) a training example for data points that are included in a specified category, or for data points that are excluded from the specified category. Labeled data points each having at least one label indicating whether it is a training example; (ii) unlabeled data points; and (iii) at least one predetermined cost factor for labeled and unlabeled data points; , And at least one stored cost factor and stored labeled data points and stored unlabeled data points as training examples, using a transductive maximum entropy identification (MED) ) To repeatedly teach a transductive classifier And a Restorative classifier training devices. At each iteration of the MED calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, the prior probability of the data point label is adjusted according to the estimate of the class membership probability of the data point, and the transductive A classifier trained by a classifier training device is used to classify and classify at least one of unlabeled data points, labeled data points, and input data points, and , Output to at least one of another system, another step.

本発明の別の実施形態による製品は、コンピュータ可読のプログラム格納媒体を備えており、該媒体は、データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されているデータ点に対する訓練例であるのかを示す少なくとも１つのラベルを各々が有するラベル付きデータ点を受信するステップと、ラベルなしデータ点を受信するステップと、ラベルなしデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因を受信するステップと、ＭＥＤ計算の各繰り返し時に、ラベルなしデータ点のコスト要因が期待ラベル値の関数として調整され、データ点の事前確率がデータ点のクラス帰属確率の推定値に応じて調整される、少なくとも１つの格納されたコスト要因ならびに格納されたラベル付きデータ点および格納されたラベルなしデータ点を用いて、繰り返し最大エントロピー識別（ＭＥＤ）によってトランスダクティブ分類器を訓練するステップと、訓練された分類器を適用して、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するステップと、分類されたデータ点の分類、またはその派生物を、ユーザ、別のシステム、および別のステップのうちの少なくとも１つに出力するステップとを包含する、分類法を実行するためのコンピュータによって実行可能な命令からなる１つ以上のプログラムを明白に具体化している。 A product according to another embodiment of the present invention comprises a computer readable program storage medium, which is an example of training for a data point or a designated data point to be included in a designated category. Receiving labeled data points each having at least one label indicating whether it is a training example for data points excluded from the category; receiving unlabeled data points; and unlabeled data points and labels Receiving at least one predetermined cost factor for the none data point and at each iteration of the MED calculation, the cost factor for the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the data point is the class of the data point At least one stored cost factor that is adjusted according to the attribution probability estimate Training the transductive classifier with iterative maximum entropy identification (MED) using the stored labeled data points and the stored unlabeled data points, and applying the trained classifier; Classifying at least one of unlabeled data points, labeled data points, and input data points, and classifying the classified data points, or a derivative thereof, to a user, another system, and another step And unambiguously embodying one or more programs comprising computer-executable instructions for performing a taxonomy, including outputting to at least one of the two.

コンピュータベースのシステムでは、本発明の別の実施形態によるラベルなしデータを分類する方法は、データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されているデータ点に対する訓練例であるのかを示す少なくとも１つのラベルを各々が有するラベル付きデータ点を受信するステップと、ラベル付きおよびラベルなしデータ点を受信するステップと、ラベル付きデータ点およびラベルなしデータ点の事前ラベル確率を受信するステップと、ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因を受信するステップと、データ点のラベルの事前確率に基づいて各ラベル付きおよびラベルなしデータ点に対する期待ラベルを決定するステップと、データ値がほぼ収束するまで、以下の下位ステップ、すなわち、
・データ点の期待ラベルの絶対値に比例して各ラベルなしデータ点に対するスケーリングされたコスト値を生成するステップと、
・ラベル付きおよびラベルなしデータをそれらの期待ラベルに従って訓練例として用い、含まれた訓練例および除外された訓練例を与えられた決定関数パラメータの事前確率分布に対してＫＬダイバージェンスを最小化する決定関数を決定することによって、分類器を訓練するステップと、
・訓練された分類器を用いて、ラベル付きおよびラベルなしデータ点の分類スコアを決定するステップと、
・訓練された分類器の出力をクラス帰属確率に対して較正するステップと、
・決定されたクラス帰属確率に従って、ラベルなしデータ点のラベルの事前確率を更新するステップと、
・更新されたラベルの事前確率および先に決定された分類スコアを用い、最大エントロピー識別（ＭＥＤ）を用いて、ラベルおよびマージンの確率分布を決定するステップと、
・先に決定されたラベルの確率分布を用いて、新たな期待ラベルを計算するステップと、
・前回の繰り返しにより得た期待ラベルと共に新たな期待ラベルを組み込むことによって、各データ点に対する期待ラベルを更新するステップと、
を繰り返すステップと、を含む。
入力データ点の分類、またはその派生物は、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力される。 In a computer-based system, the method for classifying unlabeled data according to another embodiment of the present invention is an example of training for data points whose data points should be included in a specified category or excluded from a specified category. Receiving labeled data points each having at least one label indicating whether it is a training example for a data point being labeled, receiving labeled and unlabeled data points, labeled data points and labels Receiving pre-label probabilities for unlabeled data points; receiving at least one predetermined cost factor for labeled data points and unlabeled data points; and each labeled and labeled based on prior probabilities of data point labels None Step to determine the expected label for a data point If, until the data value is substantially converges, the following substeps, namely,
Generating a scaled cost value for each unlabeled data point in proportion to the absolute value of the expected label of the data point;
Decisions that use labeled and unlabeled data as training examples according to their expected labels and minimize KL divergence for prior probability distributions of decision function parameters given included training examples and excluded training examples Training a classifier by determining a function;
Using a trained classifier to determine classification scores for labeled and unlabeled data points;
Calibrating the output of the trained classifier to class membership probabilities;
Updating the prior probabilities of labels of unlabeled data points according to the determined class membership probabilities;
Determining the probability distribution of labels and margins using maximum entropy identification (MED), using the updated prior probabilities and previously determined classification scores;
Using the previously determined label probability distribution to calculate a new expected label;
Updating the expected label for each data point by incorporating a new expected label along with the expected label obtained from the previous iteration;
Repeating the steps.
The classification of input data points, or a derivative thereof, is output to at least one of a user, another system, and another process.

本発明の別の実施形態による文書を分類する方法は、ラベル割り当てに関して既知の信頼水準を有する少なくとも１つのラベル付きシード文書を受信するステップと、ラベルなし文書を受信するステップと、少なくとも１つの所定コスト要因を受信するステップと、計算の各繰り返しに対して期待ラベル値の関数として調整される少なくとも１つの所定コスト要因、少なくとも１つのシード文書、およびラベルなし文書を用い、繰り返し計算によってトランスダクティブ分類器を訓練するステップと、少なくともある程度の繰り返しの後に、ラベルなし文書に対する信頼スコアを格納するステップと、最も高い信頼スコアを有するラベルなし文書の識別子を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するステップと、を含む。 A method for classifying documents according to another embodiment of the present invention includes receiving at least one labeled seed document having a known confidence level for label assignment, receiving an unlabeled document, and at least one predetermined Receiving a cost factor and transductive by iterative calculation using at least one predetermined cost factor adjusted as a function of expected label value for each iteration of the calculation, at least one seed document, and an unlabeled document Training a classifier, storing a confidence score for an unlabeled document after at least some iterations, and identifying an identifier for an unlabeled document with the highest confidence score for a user, another system, and another process Output to at least one of the Tsu including and up, the.

本発明の別の実施形態による、法的開示手続（ｄｉｓｃｏｖｅｒｙ）と関連する文書を分析する方法は、法的事項と関連する文書を受信するステップと、該文書に関して文書分類手法を実行するステップと、該文書の分類に基づいて、文書のうちの少なくとも一部の識別子を出力するステップと、を含む。 According to another embodiment of the present invention, a method for analyzing a document associated with a legal discovery procedure includes receiving a document associated with a legal matter, and performing a document classification technique on the document. Outputting an identifier of at least a part of the document based on the classification of the document.

本発明の別の実施形態によるデータを整理する方法は、複数のラベル付きデータ項目を受信するステップと、複数のカテゴリの各々に対して、複数のカテゴリの各々に対するデータ項目のサブセットを選択するステップと、各サブセット内のデータ項目に対する不確実性をほぼゼロに設定するステップと、サブセット内に存在しないデータ項目に対する不確実性をほぼゼロではない所定値に設定するステップと、不確実性、サブセット内のデータ項目、およびサブセット内に存在しないデータ項目を訓練例として用い、繰り返し計算によってトランスダクティブ分類器を訓練するステップと、データ項目の各々を分類するために、訓練された分類器をラベル付きデータ項目の各々に適用するステップと、入力データ項目の分類またはその派生物を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するステップと、を含む。 A method for organizing data according to another embodiment of the present invention includes receiving a plurality of labeled data items and, for each of the plurality of categories, selecting a subset of the data items for each of the plurality of categories. Setting uncertainties for data items in each subset to substantially zero; setting uncertainties for data items not in the subset to a predetermined value that is not substantially zero; and uncertainties, subsets Training the transductive classifier by iterative computation using the data items in and those that are not in the subset as training examples, and label the trained classifier to classify each of the data items The steps to apply to each of the data items, and the classification or derivation of the input data items And it includes user, and outputting another system, and at least one of another process, the.

本発明の別の実施形態によるインボイスと実体との関連を検証する方法は、第１の実体と関連するインボイスの形式に基づいて分類器を訓練するステップと、第１の実体および他の実体のうちの少なくとも１つと関連する旨のラベルが付けられた複数のインボイスにアクセスするステップと、分類器を用いて、インボイスに関して文書分類手法を実行するステップと、第１の実体と関連していない確率が高いインボイスのうちの少なくとも１つの識別子を出力するステップと、を含む。 A method for validating an association between an invoice and an entity according to another embodiment of the present invention includes training a classifier based on the type of invoice associated with the first entity, the first entity and the other Accessing a plurality of invoices labeled to be associated with at least one of the entities, performing a document classification technique on the invoice using a classifier, and associated with the first entity Outputting at least one identifier of invoices having a high probability of not doing so.

本発明の別の実施形態による医療記録を管理する方法は、医学的診断に基づいて分類器を訓練するステップと、複数の医療記録にアクセスするステップと、分類器を用い、医療記録に関して文書分類を実行するステップと、医学的診断と関連している確率が低い医療記録のうちの少なくとも１つの識別子を出力するステップと、を含む。 A method for managing medical records according to another embodiment of the present invention includes training a classifier based on a medical diagnosis, accessing a plurality of medical records, and using the classifier to classify documents with respect to the medical records. And outputting an identifier of at least one of the medical records that have a low probability of being associated with a medical diagnosis.

本発明の別の実施形態による顔認識方法は、既知の信頼水準を有する少なくとも１つのラベル付きの顔のシード画像を受信するステップと、ラベルなし画像を受信するステップと、少なくとも１つの所定コスト要因を受信するステップと、少なくとも１つの所定コスト要因、少なくとも１つのシード画像、およびラベルなし画像を用い、各々に対してコスト要因が期待ラベル値の関数として調整される繰り返し計算によってトランスダクティブ分類器を訓練するステップと、少なくともある程度の繰り返しの後に、ラベルなしシード画像に対する信頼スコアを格納するステップと、最も高い信頼スコアを有するラベルなし画像の識別子を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するステップと、を含む。 A face recognition method according to another embodiment of the present invention includes receiving a seed image of at least one labeled face having a known confidence level, receiving an unlabeled image, and at least one predetermined cost factor. A transductive classifier by an iterative calculation using at least one predetermined cost factor, at least one seed image, and an unlabeled image, with the cost factor adjusted as a function of the expected label value for each And, after at least some iteration, storing a confidence score for the unlabeled seed image, and identifying an identifier for the unlabeled image with the highest confidence score for the user, another system, and another process. And outputting to at least one of them.

本発明の別の実施形態による従来技術文書を分析する方法は、検索クエリに基づいて分類器を訓練するステップと、複数の先行技術文書にアクセスするステップと、分類器を用いて、従来技術文書のうちの少なくともいくつかに関して文書分類手法を実行するステップと、従来技術文書の分類に基づいて、文書のうちの少なくとも一部の識別子を出力するステップと、を含む。 A method for analyzing a prior art document according to another embodiment of the present invention includes training a classifier based on a search query, accessing a plurality of prior art documents, and using the classifier to prior art document. Performing a document classification technique for at least some of the documents and outputting identifiers for at least some of the documents based on the classification of the prior art documents.

本発明の別の実施形態による文書内容のシフトに特許分類を順応させる方法は、少なくとも１つのラベル付きシード文書を受信するステップと、ラベルなし文書を受信するステップと、少なくとも１つのシード文書およびラベルなし文書を用いてトランスダクティブ分類器を訓練するステップと、分類器を用いて、所定の閾値を上回る信頼水準を有するラベルなし文書を複数の既存のカテゴリに分類するステップと、分類器を用いて、所定の閾値を下回る信頼水準を有するラベルなし文書を少なくとも１つの新たなカテゴリに分類するステップと、分類器を用いて、カテゴライズされた文書のうちの少なくとも一部を既存のカテゴリおよび少なくとも１つの新たなカテゴリに再分類するステップと、カテゴライズされた文書の識別子を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するステップと、を含む。 A method for adapting a patent classification to shifting document content according to another embodiment of the invention includes receiving at least one labeled seed document, receiving an unlabeled document, and at least one seed document and label. Using a non-document to train a transductive classifier, using the classifier to classify unlabeled documents having a confidence level above a predetermined threshold into a plurality of existing categories, and using a classifier Classifying unlabeled documents having a confidence level below a predetermined threshold into at least one new category, and using a classifier to categorize at least some of the categorized documents with the existing category and at least one Re-categorize into two new categories and categorized document identifiers , Including the steps of outputting at least one of another system, and another process.

本発明の別の実施形態による、文書を請求項にマッチングする方法は、特許文書または特許出願書類の少なくとも１つの請求項に基づいて、分類器を訓練するステップと、複数の文書にアクセスするステップと、分類器を用いて、文書のうちの少なくとも一部に関して文書分類手法を実行するステップと、文書の分類に基づいて、文書のうちの少なくとも一部の識別子を出力するステップと、を含む。 According to another embodiment of the present invention, a method for matching a document to a claim includes training a classifier and accessing a plurality of documents based on at least one claim of a patent document or patent application document. And using a classifier to perform a document classification technique for at least a portion of the document, and outputting an identifier for at least a portion of the document based on the classification of the document.

本発明の別の実施形態による特許文書または特許出願書類を分類する方法は、特定の特許分類に存在することが分かっている複数の文書に基づいて分類器を訓練するステップと、特許文書または特許出願書類の少なくとも一部を受信するステップと、分類器を用いて、特許文書または特許出願書類の少なくとも一部に関して文書分類手法を実行するステップと、特許文書または特許出願書類の分類を出力するステップと、を含み、文書分類手法は、はい／いいえ式分類手法である。 A method for classifying a patent document or patent application document according to another embodiment of the present invention comprises the steps of training a classifier based on a plurality of documents known to exist in a particular patent classification; Receiving at least a portion of an application document, performing a document classification technique on at least a portion of a patent document or patent application document using a classifier, and outputting a classification of the patent document or patent application document The document classification method is a yes / no classification method.

本発明の別の実施形態による特許文書または特許出願書類を分類する方法は、特定の特許分類と関連する少なくとも１つの文書に基づいて訓練された分類器を用い、特許文書または特許出願書類の少なくとも一部に関して、はい／いいえ式分類手法である文書分類手法を実行するステップと、特許文書または特許出願書類の分類を出力するステップと、を含む。 A method for classifying a patent document or patent application document according to another embodiment of the invention uses a classifier trained based on at least one document associated with a particular patent classification, and at least a patent document or patent application document. For some, including performing a document classification technique that is a yes / no classification technique and outputting a classification of a patent document or patent application document.

本発明の別の実施形態による文書内容のシフトに順応する方法は、少なくとも１つのラベル付きシード文書を受信するステップと、ラベルなし文書を受信するステップと、少なくとも１つの所定コスト要因を受信するステップと、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、およびラベルなし文書を用いてトランスダクティブ分類器を訓練するステップと、分類器を用いて、所定の閾値を上回る信頼水準を有するラベルなし文書を複数のカテゴリに分類するステップと、分類器を用いて、カテゴライズされた文書のうちの少なくとも一部をカテゴリに再分類するステップと、カテゴライズされた文書の識別子を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するステップと、を含む。 A method for adapting to document content shifting according to another embodiment of the present invention includes receiving at least one labeled seed document, receiving an unlabeled document, and receiving at least one predetermined cost factor. Training a transductive classifier using at least one predetermined cost factor, at least one seed document, and an unlabeled document; and using the classifier, no label having a confidence level above a predetermined threshold Classifying a document into a plurality of categories; reclassifying at least a portion of the categorized document into categories using a classifier; and identifying an identifier of the categorized document to a user, another system, And outputting to at least one of the other processes.

本発明の別の実施形態による文書を分離する方法は、ラベル付きデータを受信するステップと、一連のラベルなし文書を受信するステップと、ラベル付きデータおよびラベルなし文書に基づくトランスダクションを用いて、確率的分類規則を順応させるステップと、確率的分類規則に従って文書分類用に用いられる重みを更新するステップと、一連の文書における分離位置を決定するステップと、決定された連なりにおける分離位置の標識を、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力するステップと、標識と相関するコードのフラグを文書に立てるステップと、を含む。 A method for separating documents according to another embodiment of the present invention includes receiving labeled data, receiving a series of unlabeled documents, and transduction based on labeled data and unlabeled documents, Adapting the probabilistic classification rules, updating the weights used for document classification according to the probabilistic classification rules, determining the separation positions in the series of documents, and indicating the separation positions in the determined series. Outputting to at least one of the user, another system, and another process, and setting a flag in the document that correlates with the sign.

本発明の別の実施形態による文書検索方法は、検索クエリを受信するステップと、検索クエリに基づいて文書を取り出すステップと、文書を出力するステップと、文書のうちの少なくとも１つに対する、検索クエリとの文書の関連性を示すユーザ入力ラベルを受信するステップと、検索クエリおよびユーザ入力ラベルに基づいて分類器を訓練するステップと、文書を再分類するために、分類器を用いて、文書に関して文書分類手法を実行するステップと、文書の分類に基づいて、文書のうちの少なくとも一部の識別子を出力するステップと、を含む。
本発明は、例えば、以下の項目も提供する。
（項目１）
コンピュータベースのシステムにおける、データ分類の方法であって、
ラベル付きデータ点を受信するステップであって、該ラベル付きデータ点の各々が、該データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されるデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを有する、ステップと、
ラベルなしデータ点を受信するステップと、
該ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因を受信するステップと、
該少なくとも１つのコスト要因ならびに該ラベル付きデータ点および該ラベルなしデータ点を訓練例として用いる繰り返し計算によって、最大エントロピー識別（ＭＥＤ）を用いてトランスダクティブ分類器を訓練するステップであって、計算の各繰り返しに対して、該ラベルなしデータ点のコスト要因が期待ラベル値の関数として調整され、データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、ステップと、
該ラベルなしデータ点、該ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するために、該訓練された分類器を適用するステップと、
該分類されたデータ点の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、方法。
（項目２）
前記関数は、データ点の前記期待ラベルの絶対値である、項目１に記載の方法。
（項目３）
ラベル付きとラベルなしデータ点の事前確率情報を受信するステップ、をさらに包含する、項目１に記載の方法。
（項目４）
前記トランスダクティブ分類器は、前記ラベル付きとラベルなしデータの事前確率情報を用いて学習する、項目３に記載の方法。
（項目５）
前記記ラベル付きデータと前記ラベルなしデータとをそれらの期待ラベルに従って学習例として利用して、前記含まれる訓練例および除外される訓練例を与えられた決定関数パラメータに対してガウス事前分布を用いて、最小のＫＬダイバージェンスを有する決定関数を決定する、さらなるステップを包含する、項目１に記載の方法。
（項目６）
決定関数パラメータに対して多項事前分布を用いて、最小のＫＬダイバージェンスを有する決定関数を決定する、さらなるステップを包含する、項目１に記載の方法。
（項目７）
トランスダクティブ分類器を訓練する前記繰り返しステップは、データ値の収束に到達するまで反復される、項目１に記載の方法。
（項目８）
前記トランスダクティブ分類器の決定関数の変化が所定の閾値を下回ったときに、収束に到達する、項目７に記載の方法。
（項目９）
決定された期待ラベル値の変化が所定の閾値を下回ったときに、収束に到達する、項目７に記載の方法。
（項目１０）
前記含まれる訓練例の前記ラベルは＋１の値を有し、前記除外される訓練例のラベルは−１の値を有する、項目１に記載の方法。
（項目１１）
前記含まれる例の前記ラベルは第１の数値にマッピングされ、前記除外される例の前記ラベルは第２の数値にマッピングされる、項目１に記載の方法。
（項目１２）
前記ラベル付きデータ点をコンピュータのメモリ内に格納するステップと、
前記ラベルなしデータ点をコンピュータのメモリ内に格納するステップと、
前記入力データ点をコンピュータのメモリ内に格納するステップと、
前記ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因をコンピュータのメモリ内に格納するステップと、
をさらに包含する、項目１に記載の方法。
（項目１３）
コンピュータシステム上に配備され実行されるコンピュータ実行可能なプログラムコードを提供するステップを包含する、データ分類の方法であって、
該プログラムコードは、
ラベル付きデータ点の各々が、該データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されるデータ点に対する訓練例であるのかを示す少なくとも１つのラベルを有する、コンピュータのメモリ内に格納された該ラベル付きデータ点にアクセスし、
コンピュータのメモリからラベルなしデータ点にアクセスし、
コンピュータのメモリから該ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因にアクセスし、
該少なくとも１つの格納されたコスト要因ならびに格納されたラベル付きデータ点および格納されたラベルなしデータ点を用いて、繰り返し計算によって最大エントロピー識別（ＭＥＤ）トランスダクティブ分類器を訓練し、計算の各繰り返しに対して、該ラベルなしデータ点のコスト要因が期待ラベル値の関数として調整され、データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整され、
該ラベルなしデータ点、該ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するために、該訓練された分類器を適用し、
該分類されたデータ点の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力する、
ための命令を備える、
方法。
（項目１４）
前記関数は、データ点の前記期待ラベルの絶対値である、項目１３に記載の方法。
（項目１５）
コンピュータのメモリ内に格納されたラベル付きとラベルなしデータ点の事前確率情報にアクセスするステップ、をさらに包含する、項目１３に記載の方法。
（項目１６）
各繰り返しに対して、前記事前確率情報がデータ点のクラス帰属確率の推定値に従って調整される、項目１５に記載の方法。
（項目１７）
前記ラベル付きとラベルなしデータをそれらの期待ラベルに従って学習例として利用して、前記含まれる訓練例および除外される訓練例を与えられた決定関数パラメータの事前分布に対して最小のＫＬダイバージェンスを有する決定関数を決定するための命令を、さらに備える、項目１３に記載の方法。
（項目１８）
トランスダクティブ分類器を訓練する前記繰り返しステップは、データ値の収束に到達するまで反復される、項目１３に記載の方法。
（項目１９）
トランスダクティブ分類の決定関数の変化が所定の閾値を下回ったときに、収束に到達する、項目１８に記載の方法。
（項目２０）
決定された期待ラベル値の変化が所定の閾値を下回ったときに、収束に到達する、項目１８に記載の方法。
（項目２１）
前記含まれる訓練例の前記ラベルは＋１の値を有し、前記除外される訓練例の前記ラベルは−１の値を有する、項目１３に記載の方法。
（項目２２）
前記含まれる例の前記ラベルは第１の数値にマッピングされ、前記除外される例の前記ラベルは第２の数値にマッピングされる、項目１３に記載の方法。
（項目２３）
データ処理装置であって、該装置は、
（ｉ）ラベル付きデータ点の各々が、該データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されるデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを有する、該ラベル付きデータ点と、（ｉｉ）ラベルなしデータ点と、（ｉｉｉ）該ラベル付きとラベルなしデータ点の少なくとも１つの所定コスト要因と、を格納する、少なくとも１つのメモリと、
該少なくとも１つの格納されたコスト要因ならびに格納されたラベル付きデータ点および格納されたラベルなしデータ点を訓練例として用いて、トランスダクティブ最大エントロピー識別（ＭＥＤ）を用いてトランスダクティブ分類器に繰り返し教示するためのトランスダクティブ分類器訓練装置であって、ＭＥＤ計算の各繰り返しにおいて、該ラベルなしデータ点の該コスト要因が期待ラベル値の関数として調整され、データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、訓練装置と、
を備え、
該トランスダクティブ分類器訓練装置によって訓練された分類器は、該ラベルなしデータ点、該ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するために用いられ、
該分類されたデータ点の分類、またはその派生物が、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力される、
装置。
（項目２４）
前記関数は、データ点の前記期待ラベルの絶対値である、項目２３に記載の装置。
（項目２５）
前記メモリは、ラベル付きとラベルなしデータ点の事前確率情報をも格納する、項目２３に記載の装置。
（項目２６）
前記ＭＥＤ計算の各繰り返しにおいて、前記事前確率情報がデータ点のクラス帰属確率の推定値に従って調整される、項目２５に記載の装置。
（項目２７）
前記ラベル付きとラベルなしデータをそれらの期待ラベルに従って学習例として用いて、前記含まれる訓練例および除外される訓練例を与えられた決定関数の事前分布に対して、最小のＫＬダイバージェンスを有する該決定関数を決定するためのプロセッサをさらに備える、項目２３に記載の装置。
（項目２８）
データ値の収束を判定し、収束の判定と同時に計算を終了する手段をさらに備える、項目２３に記載の装置。
（項目２９）
前記トランスダクティブ分類器計算の決定関数の変化が所定の閾値を下回ったときに、収束に到達する、項目２８に記載の装置。
（項目３０）
決定された期待ラベル値の変化が所定の閾値を下回ったときに、収束に到達する、項目２８に記載の装置。
（項目３１）
前記含まれる訓練例の前記ラベルは＋１の値を有し、前記除外される訓練例の前記ラベルは−１の値を有する、項目２３に記載の装置。
（項目３２）
前記含まれる例の前記ラベルは第１の数値にマッピングされ、前記除外される例の前記ラベルは第２の数値にマッピングされる、項目２３に記載の装置。
（項目３３）
コンピュータによって読み取り可能なプログラム格納媒体を備える製品であって、該媒体は、コンピュータによって実行可能な命令の１つ以上のプログラムを明白に具体化してデータ分類の方法を実行し、該方法は、
ラベル付きデータ点の各々が、該データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを有する該ラベル付きデータ点を受信するステップと、
ラベルなしデータ点を受信するステップと、
該ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定のコスト要因を受信するステップと、
該少なくとも１つの格納されたコスト要因ならびに格納されたラベル付きデータ点および格納されたラベルなしデータ点を訓練例として用いて、繰り返し最大エントロピー識別（ＭＥＤ）計算によってトランスダクティブ分類器を訓練するステップであって、該ＭＥＤ計算の各繰り返しにおいて、該ラベルなしデータ点のコスト要因が期待ラベル値の関数として調整され、データ点の事前確率がデータ点のクラス帰属確率の推定値に従って調整される、ステップと、
該ラベルなしデータ点、該ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するために、該訓練された分類器を適用するステップと、
該分類されたデータ点の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、
製品。
（項目３４）
前記関数は、データ点の前記期待ラベルの絶対値である、項目３３に記載の製品。
（項目３５）
前記方法は、ラベル付きとラベルなしデータ点の事前確率情報をコンピュータのメモリ内に格納するステップをさらに包含する、項目３３に記載の製品。
（項目３６）
前記ＭＥＤ計算の各繰り返しにおいて、前記事前確率情報がデータ点のクラス帰属確率の推定値に従って調整される、項目３５に記載の製品。
（項目３７）
前記方法は、前記ラベル付きとラベルなしデータをそれらの期待ラベルに従って学習例として用いて、前記含まれる訓練例および除外される訓練例を与えられた決定関数パラメータの事前分布に対して、最小のＫＬダイバージェンスを有する該決定関数を決定する、さらなるステップを包含する、項目３３に記載の製品。
（項目３８）
トランスダクティブ分類器を訓練する前記繰り返しステップは、データ値の収束に到達するまで反復される、項目３３に記載の製品。
（項目３９）
前記トランスダクティブ分類の決定関数の変化が所定の閾値を下回ったときに、収束に到達する、項目３８に記載の製品。
（項目４０）
決定された期待ラベル値の変化が所定の閾値を下回ったときに、収束に到達する、項目３８に記載の製品。
（項目４１）
前記含まれる訓練例の前記ラベルは＋１の値を有し、前記除外される訓練例の前記ラベルは−１の値を有する、項目３３に記載の製品。
（項目４２）
前記含まれる例の前記ラベルは第１の数値にマッピングされ、前記除外される例の前記ラベルは第２の数値にマッピングされる、項目３３に記載の製品。
（項目４３）
コンピュータベースのシステムにおける、ラベルなしデータの分類の方法であって、
ラベル付きデータ点を受信するステップであって、該データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されるデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを、該ラベル付きデータ点の各々が有する、ステップと、
ラベル付きとラベルなしデータ点を受信するステップと、
ラベル付きデータ点およびラベルなしデータ点の事前ラベル確率情報を受信するステップと、
該ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因を受信するステップと、
該データ点の該ラベルの事前確率に従って、各ラベル付きとラベルなしデータ点に対する期待ラベルを決定するステップと、
データ値が実質的に収束するまで、以下の下位ステップ、すなわち
・該データ点の期待ラベルの絶対値に比例して各ラベルなしデータ点に対するスケーリングされたコスト値を生成するステップと、
・該ラベル付きとラベルなしデータをそれらの期待ラベルに従って訓練例として用いて、該含まれる訓練例および除外される訓練例を与えられた決定関数パラメータの事前確率分布に対してＫＬダイバージェンスを最小化する該決定関数を算出することによって、分類器を訓練するステップと、
・該訓練された分類器を用いて、該ラベル付きとラベルなしデータ点の分類スコアを決定するステップと、
・該訓練された分類器の出力をクラス帰属確率に対して較正するステップと、
・決定された該クラス帰属確率に従って該ラベルなしデータ点の該ラベルの事前確率を更新するステップと、
・該更新されたラベルの事前確率および先に決定された分類スコアを用いて、最大エントロピー識別（ＭＥＤ）を用いて該ラベルおよびマージンの確率分布を決定するステップと、
・該先に決定されたラベルの確率分布を用いて、新たな期待ラベルを計算するステップと、
・前回の繰り返しによる該期待ラベルで該新たな期待ラベルを補間することによって、各データ点に対する期待ラベルを更新するステップと、
を繰り返すステップと、
該入力データ点の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、方法。
（項目４４）
前記決定関数の変化が所定の閾値を下回ったときに、収束に到達する、項目４３に記載の方法。
（項目４５）
決定された期待ラベル値の変化が所定の閾値を下回ったときに、収束に到達する、項目４３に記載の方法。
（項目４６）
前記含まれる訓練例の前記ラベルは＋１の値を有し、前記除外される訓練例の前記ラベルは−１の値を有する、項目４３に記載の方法。
（項目４７）
ラベル割り当てに関する既知の信頼水準を有する、少なくとも１つのラベル付きシード文書を受信するステップと、
ラベルなし文書を受信するステップと、
少なくとも１つの所定コスト要因を受信するステップと、
該少なくとも１つの所定コスト要因、該少なくとも１つのシード文書、および該ラベルなし文書を用いて、繰り返し計算によってトランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、
少なくとも一部の該繰り返しの後に、該ラベルなし文書に対する信頼スコアを格納するステップと、
最も高い信頼スコアを有する該ラベルなし文書の識別子を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、文書を分類する方法。
（項目４８）
前記少なくとも１つのシード文書は、キーワードのリストを有する、項目４７に記載の方法。
（項目４９）
信頼スコアが、前記各繰り返しの後に格納され、各繰り返しの後に、前記最も高い信頼スコアを有する前記ラベルなし文書の識別子が出力される、項目４７に記載の方法。
（項目５０）
前記ラベル付きとラベルなし文書に対するデータ点のラベルの事前確率を受信するステップ、をさらに包含する、項目４７に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目５１）
法的事項と関連する文書を受信するステップと、
該文書に関して文書分類手法を実行するステップと、
該文書の分類に基づいて、該文書の少なくとも一部の識別子を出力するステップと、
を包含する、法的開示手続と関連する文書を分析する方法。
（項目５２）
前記文書分類手法は、トランスダクティブ処理を含む、項目５１に記載の方法。
（項目５３）
少なくとも１つの所定コスト要因、少なくとも１つのシード文書、および法的事項と関連する文書を用いて、繰り返し計算によってトランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップをさらに包含する、項目５２に記載の方法。
（項目５４）
前記ラベル付きとラベルなし文書に対するデータ点のラベルの事前確率を受信するステップをさらに包含する、項目５３に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に応じて調整される、方法。
（項目５５）
前記文書分類手法は、サポートベクタマシン処理を含む、項目５１に記載の方法。
（項目５６）
前記文書分類手法は、最大エントロピー識別処理を含む、項目５１に記載の方法。
（項目５７）
前記文書間のリンクを表すものを出力するステップをさらに包含する、項目５１に記載の方法。
（項目５８）
複数のラベル付きデータ項目を受信するステップと、
複数のカテゴリの各々に対する該データ項目のサブセットを選択するステップと、
各サブセット内の該データ項目に対する不確実性を、ほぼゼロに設定するステップと、
該サブセット内に存在しない該データ項目に対する不確実性を、ほぼゼロではない所定値に設定するステップと、
該不確実性、該サブセット内のデータ項目、および該サブセット内に存在しない該データ項目を訓練例として用いて、繰り返し計算によってトランスダクティブ分類器を訓練するステップと、
該データ項目の各々を分類するために、該訓練された分類器を該ラベル付きデータ項目の各々に適用するステップと、
該入力データ項目の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、データを整理する方法。
（項目５９）
前記サブセットは、無作為に選択される、項目５８に記載の方法。
（項目６０）
前記サブセットは、ユーザによって選択および検証される、項目５８に記載の方法。
（項目６１）
前記分類に基づいて、少なくとも一部の前記データ項目の前記ラベルを変更するステップ、をさらに包含する、項目５８に記載の方法。
（項目６２）
データ項目の分類後に、所定の閾値を下回る信頼水準を有するデータ項目の識別子がユーザに出力される、項目５８に記載の方法。
（項目６３）
第１の実体と関連するインボイスの形式に基づいて分類器を訓練するステップと、
該第１の実体および他の実体のうちの少なくとも１つと関連する旨のラベルが付けられた複数のインボイスにアクセスするステップと、
該分類器を用いて、該インボイスに関して文書分類手法を実行するステップと、
該第１の実体と関連していない高い確率を有する該インボイスのうちの少なくとも１つの識別子を出力するステップと、
を包含する、インボイスと実体との関連性を検証する方法。
（項目６４）
前記文書分類手法は、トランスダクティブ処理を含む、項目６３に記載の方法。
（項目６５）
前記分類器はトランスダクティブ分類器である、項目６４に記載の方法であって、該方法は、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、および前記インボイスを用いて、繰り返し計算によって該トランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、該訓練された分類器を用いて該インボイスを分類するステップと、を包含する、方法。
（項目６６）
前記シード文書およびインボイスに対するデータ点のラベルの事前確率を受信するステップをさらに包含する、項目６５に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目６７）
前記文書分類手法は、サポートベクタマシン処理を含む、項目６３に記載の方法。
（項目６８）
前記文書分類手法は、最大エントロピー識別処理を含む、項目６３に記載の方法。
（項目６９）
医学的診断に基づいて分類器を訓練するステップと、
複数の医療記録にアクセスするステップと、
該分類器を用いて、該医療記録に関して文書分類手法を実行するステップと、
該医学的診断と関連する低い確率を有する該医療記録のうちの少なくとも１つの識別子を出力するステップと、
を包含する、医療記録を管理する方法。
（項目７０）
前記文書分類手法は、トランスダクティブ処理を含む、項目６９に記載の方法。
（項目７１）
前記分類器はトランスダクティブ分類器である、項目７０に記載の方法であって、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、および前記医療記録を用いて、繰り返し計算によって該トランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、該医療記録を分類するために該訓練された分類器を使用するステップと、をさらに包含する、方法。
（項目７２）
前記シード文書および医療記録に対するデータ点のラベルの事前確率を受信するステップをさらに包含する、項目７１に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目７３）
前記文書分類手法は、サポートベクタマシン処理を含む、項目６９に記載の方法。
（項目７４）
前記文書分類手法は、最大エントロピー識別処理を含む、項目６９に記載の方法。
（項目７５）
既知の信頼水準を有する、少なくとも１つの顔のラベル付きシード画像を受信するステップと、
ラベルなし画像を受信するステップと、
少なくとも１つの所定コスト要因を受信するステップと、
該少なくとも１つの所定コスト要因、該少なくとも１つのシード画像、および該ラベルなし画像を用いて、繰り返し計算によってトランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、
少なくとも一部の該繰り返しの後に、ラベルなしシード画像に対する信頼スコアを格納するステップと、
最も高い信頼スコアを有する該ラベルなし画像の識別子を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、顔認識方法。
（項目７６）
前記少なくとも１つのシード画像は、該画像が指定されたカテゴリに含まれているか否かを示すラベルを有する、項目７５に記載の方法。
（項目７７）
信頼スコアが、各前記繰り返しの後に格納され、各繰り返しの後に前記最も高い信頼スコアを有する前記ラベルなし画像の識別子が出力される、項目７５に記載の方法。
（項目７８）
前記ラベル付きとラベルなし画像に対するデータ点のラベルの事前確率を受信するステップ、をさらに包含する、項目７５に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目７９）
顔の第３のラベルなし画像を受信するステップと、該第３のラベルなし画像を前記最も高い信頼スコアを有する前記画像の少なくとも一部と比較するステップと、該第３のラベルなし画像の顔の信頼度が前記シード画像の前記顔と同一である場合には、該第３のラベルなし画像の識別子を出力するステップと、をさらに包含する、項目７５に記載の方法。
（項目８０）
検索クエリに基づいて分類器を訓練するステップと、
複数の従来技術文書にアクセスするステップと、
該分類器を用いて、該従来技術文書に関して文書分類手法を実行するステップと、
該従来技術文書の分類に基づいて、該従来技術文書の少なくとも一部の識別子を出力するステップと、
を包含する、従来技術文書を分析する方法。
（項目８１）
前記文書分類手法は、トランスダクティブ処理を含む、項目８０に記載の方法。
（項目８２）
前記分類器はトランスダクティブ分類器である、項目８１に記載の方法であって、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、および前記従来技術文書を用いて、繰り返し計算によって該トランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、該従来技術文書を分類するために該訓練された分類器を用いるステップと、を包含する、方法。
（項目８３）
前記シード文書および従来技術文書に対するデータ点のラベルの事前確率を受信するステップをさらに包含する、項目８２に記載の方法であって、前記計算の各繰り返しに対して、データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目８４）
前記検索クエリは、特許開示情報の少なくとも一部を含む、項目８０に記載の方法。
（項目８５）
前記検索クエリは、特許文書または特許出願書類から取り出された項目の少なくとも一部を含む、項目８０に記載の方法。
（項目８６）
前記検索クエリは、特許文書または特許出願書類の要約書の少なくとも一部を含む、項目８０に記載の方法。
（項目８７）
前記検索クエリは、特許文書または特許出願書類から取り出された概要の少なくとも一部を含む、項目８０に記載の方法。
（項目８８）
前記文書分類手法は、サポートベクタマシン処理を含む、項目８０に記載の方法。
（項目８９）
前記文書分類手法は、最大エントロピー識別処理を含む、項目８０に記載の方法。
（項目９０）
前記従来技術文書は、特許庁の公開文書である、項目８０に記載の方法。
（項目９１）
前記文書間のリンクを表すものを出力するステップ、をさらに包含する、項目８０に記載の方法。
（項目９２）
前記従来技術文書の分類に基づいて、該従来技術文書の少なくとも一部の関連性スコアを出力するステップをさらに包含する、項目８０に記載の方法。
（項目９３）
少なくとも１つのラベル付きシード文書を受信するステップと、
ラベルなし文書を受信するステップと、
該少なくとも１つのシード文書および該ラベルなし文書を用いてトランスダクティブ分類器を訓練するステップと、
該分類器を用いて、所定の閾値を上回る信頼水準を有する該ラベルなし文書を複数の既存のカテゴリに分類するステップと、
該分類器を用いて、所定の閾値を下回る信頼水準を有する該ラベルなし文書を少なくとも１つの新たなカテゴリに分類するステップと、
該分類器を用いて、該カテゴライズされた文書の少なくとも一部を、該既存のカテゴリおよび該少なくとも１つの新たなカテゴリに再分類するステップと、
該カテゴライズされた文書の識別子を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、文書内容のシフトに特許分類を順応させる方法。
（項目９４）
前記分類器はトランスダクティブ分類器である、項目９３に記載の方法であって、少なくとも１つの所定コスト要因、検索クエリ、および前記文書を用いて、繰り返し計算によって該トランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、該文書を分類するために該訓練された分類器を用いるステップと、をさらに包含する、方法。
（項目９５）
前記検索クエリおよび文書に対するデータ点のラベルの事前確率を受信するステップをさらに包含する、項目９４に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目９６）
前記文書分類手法は、サポートベクタマシン処理を含む、項目９３に記載の方法。
（項目９７）
前記文書分類手法は、最大エントロピー識別処理を含む、項目９３に記載の方法。
（項目９８）
前記ラベルなし文書は特許出願書類である、項目９３に記載の方法。
（項目９９）
前記少なくとも１つのシード文書は、特許文書および特許出願書類からなる群から選択される、項目９３に記載の方法。
（項目１００）
特許文書または特許出願書類の少なくとも１つの項目に基づいて分類器を訓練するステップと、
複数の文書にアクセスするステップと、
該分類器を用いて、該文書の少なくとも一部に関して文書分類手法を実行するステップと、
該文書の分類に基づいて、該文書の少なくとも一部の識別子を出力するステップと、
を包含する、文書を項目にマッチングする方法。
（項目１０１）
前記文書の前記分類に基づいて、該文書の少なくとも一部の関連性スコアを出力するステップ、をさらに包含する、項目１００に記載の方法。
（項目１０２）
前記文書は従来技術文書である、項目１００に記載の方法。
（項目１０３）
前記文書は製品について記載する、項目１００に記載の方法。
（項目１０４）
特定の特許分類に存在することが知られている複数の文書に基づいて、分類器を訓練するステップと、
特許文書または特許出願書類の少なくとも一部を受信するステップと、
該分類器を用いて、該特許文書または特許出願書類の該少なくとも一部に関して、文書分類手法を実行するステップと、
該特許文書または特許出願書類の分類を出力するステップと、
を包含する、特許文書または特許出願書類を分類する方法であって、
該文書分類手法は、はい／いいえ式分類手法である、方法。
（項目１０５）
前記文書は、特許文書および特許出願書類からなる群から選択される、項目１０４に記載の方法。
（項目１０６）
前記特許文書または特許出願書類の前記少なくとも一部は、特許文書または特許出願書類から取り出された項目の少なくとも一部を含む、項目１０５に記載の方法。
（項目１０７）
前記特許文書または特許出願書類の前記少なくとも一部は、特許文書または特許出願書類の要約書の少なくとも一部を含む、項目１０５に記載の方法。
（項目１０８）
前記特許文書または特許出願書類の前記少なくとも一部は、特許文書または特許出願書類から取り出された概要の少なくとも一部を含む、項目１０５に記載の方法。
（項目１０９）
特定の特許分類と関連する少なくとも１つの文書に基づいて訓練された分類器を用いて、特許文書または特許出願書類の少なくとも部分に関して文書分類手法を実行するステップであって、該文書分類手法は、はい／いいえ式分類手法である、ステップと、
該特許文書または特許出願書類の分類を出力するステップと、
を包含する、特許文書または特許出願書類を分類する方法。
（項目１１０）
第２の特許分類に存在することが知られている複数の文書に基づいて訓練された異なる分類器を用いて、前記方法を反復するステップをさらに包含する、項目１０９に記載の方法。
（項目１１１）
前記特許文書または特許出願書類の前記少なくとも一部は、特許文書または特許出願書類から取り出された項目の少なくとも一部を含む、項目１０９に記載の方法。
（項目１１２）
前記特許文書または特許出願書類の前記少なくとも一部は、特許文書または特許出願書類の要約書の少なくとも一部を含む、項目１０９に記載の方法。
（項目１１３）
前記特許文書または特許出願書類の前記少なくとも一部は、特許文書または特許出願書類から取り出された概要の少なくとも一部を含む、項目１０９に記載の方法。
（項目１１４）
少なくとも１つのラベル付きシード文書を受信するステップと、
ラベルなし文書を受信するステップと、
少なくとも１つの所定コスト要因を受信するステップと、
該少なくとも１つの所定コスト要因、該少なくとも１つのシード文書、および該ラベルなし文書を用いて、トランスダクティブ分類器を訓練するステップと、
該分類器を用いて、所定の閾値を上回る信頼水準を有する該ラベルなし文書を複数のカテゴリに分類するステップと、
該カテゴライズされた文書の識別子を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
を包含する、文書内容のシフトに順応する方法。
（項目１１５）
前記所定の閾値を下回る信頼水準を有するラベルなし文書を、１つ以上の新たなカテゴリに移すステップ、をさらに包含する、項目１１４に記載の方法。
（項目１１６）
少なくとも１つの所定コスト要因、前記少なくとも１つのシード文書、および前記ラベルなし文書を用いて、繰り返し計算によって前記トランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、該ラベルなし文書を分類するために該訓練された分類器を用いるステップと、をさらに包含する、項目１１４に記載の方法。
（項目１１７）
前記シード文書およびラベルなし文書に対するデータ点のラベルの事前確率を受信するステップ、をさらに包含する、項目１１６に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目１１８）
前記ラベルなし文書は顧客の苦情である、項目１１４に記載の方法であって、製品の変更を顧客の苦情とリンクするステップをさらに包含する、方法。
（項目１１９）
前記ラベルなし文書はインボイスである、項目１１４に記載の方法。
（項目１２０）
ラベル付きデータを受信するステップと、
ラベルなし文書の連なりを受信するステップと、
該ラベル付きデータおよび該ラベルなし文書に基づいて、トランスダクションを用いて確率的分類規則を順応させるステップと、
該確率的分類規則に従って、文書分離に用いられる重みを更新するステップと、
該文書の連なりにおける分離位置を決定するステップと、
該連なりにおける該決定された該分離位置の標識を、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力するステップと、
該標識と相関するコードのフラグを、該文書に立てるステップと、
を包含する、文書を分離する方法。
（項目１２１）
検索クエリを受信するステップと、
該検索クエリに基づいて文書を取り出すステップと、
該文書を出力するステップと、
該文書の少なくとも一部に対するユーザ入力ラベルを受信するステップであって、該ラベルは、該文書の該検索クエリとの関連性を示す、ステップと、
該検索クエリおよび該ユーザ入力ラベルに基づいて、分類器を訓練するステップと、
該文書を再分類するために、該分類器を用いて該文書に関して文書分類手法を実行するステップと、
該文書の該分類に基づいて、該文書の少なくとも一部の識別子を出力するステップと、
を包含する、文書検索の方法。
（項目１２２）
前記文書分類手法は、トランスダクティブ処理を含む、項目１２１に記載の方法。
（項目１２３）
前記分類器はトランスダクティブ分類器である、項目１２２に記載の方法であって、少なくとも１つの所定コスト要因、前記検索クエリ、および前記文書を用いて、繰り返し計算によって該トランスダクティブ分類器を訓練するステップであって、該計算の各繰り返しに対して、該コスト要因が期待ラベル値の関数として調整される、ステップと、該文書を分類するために該訓練された分類器を用いるステップと、をさらに包含する、方法。
（項目１２４）
前記検索クエリおよび文書に対するデータ点ラベルの事前確率を受信するステップ、をさらに包含する、項目１２３に記載の方法であって、前記計算の各繰り返しに対して、該データ点のラベルの事前確率がデータ点のクラス帰属確率の推定値に従って調整される、方法。
（項目１２５）
前記文書分類手法は、サポートベクタマシン処理を含む、項目１２１に記載の方法。
（項目１２６）
前記文書分類手法は、最大エントロピー識別処理を含む、項目１２１に記載の方法。
（項目１２７）
前記再分類された文書は出力され、最も高い信頼度を有する文書が最初に出力される、項目１２１に記載の方法。 A document search method according to another embodiment of the present invention includes a search query for at least one of: receiving a search query; retrieving a document based on the search query; outputting a document; and Receiving a user input label indicating the relevance of the document to the document, training a classifier based on the search query and the user input label, and using the classifier to reclassify the document, Performing a document classification technique and outputting at least some identifiers of the document based on the document classification.
The present invention also provides the following items, for example.
(Item 1)
A method of data classification in a computer-based system, comprising:
Receiving labeled data points, wherein each of the labeled data points is a training example for a data point for which the data point is to be included in a specified category or excluded from a specified category. Having at least one label indicating whether the training is for an example data point;
Receiving unlabeled data points;
Receiving at least one predetermined cost factor for the labeled and unlabeled data points;
Training a transductive classifier using maximum entropy discrimination (MED) by iterative computation using the at least one cost factor and the labeled and unlabeled data points as training examples, comprising: For each iteration, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the data point label is adjusted according to the estimate of the class membership probability of the data point;
Applying the trained classifier to classify at least one of the unlabeled data point, the labeled data point, and an input data point;
Outputting the classification of the classified data points, or a derivative thereof, to at least one of a user, another system, and another process;
Including the method.
(Item 2)
Item 2. The method of item 1, wherein the function is an absolute value of the expected label of a data point.
(Item 3)
The method of item 1, further comprising receiving prior probability information for labeled and unlabeled data points.
(Item 4)
4. The method of item 3, wherein the transductive classifier learns using prior probability information of the labeled and unlabeled data.
(Item 5)
Using the labeled data and the unlabeled data as learning examples according to their expected labels, using a Gaussian prior distribution for the decision function parameters given the included training examples and excluded training examples The method of item 1, further comprising the further step of determining a decision function having a minimum KL divergence.
(Item 6)
The method of item 1, comprising the further step of determining a decision function having a minimum KL divergence using a polynomial prior for the decision function parameters.
(Item 7)
The method of item 1, wherein the iterative step of training a transductive classifier is repeated until convergence of data values is reached.
(Item 8)
8. The method of item 7, wherein convergence is reached when a change in the decision function of the transductive classifier falls below a predetermined threshold.
(Item 9)
8. The method of item 7, wherein convergence is reached when the determined change in expected label value falls below a predetermined threshold.
(Item 10)
The method of item 1, wherein the labels of the included training examples have a value of +1 and the excluded training examples labels have a value of -1.
(Item 11)
2. The method of item 1, wherein the labels of the included examples are mapped to a first number and the labels of the excluded examples are mapped to a second number.
(Item 12)
Storing the labeled data points in a memory of a computer;
Storing the unlabeled data points in a memory of a computer;
Storing the input data points in a memory of a computer;
Storing at least one predetermined cost factor of the labeled and unlabeled data points in a memory of a computer;
The method according to item 1, further comprising:
(Item 13)
A method of data classification comprising providing computer-executable program code that is deployed and executed on a computer system, comprising:
The program code is
At least one indicating whether each labeled data point is a training example for a data point to be included in the specified category or a training point for a data point excluded from the specified category Accessing the labeled data point stored in the memory of the computer having one label;
Access unlabeled data points from computer memory,
Accessing at least one predetermined cost factor of the labeled and unlabeled data points from the memory of the computer;
Using the at least one stored cost factor and stored labeled data points and stored unlabeled data points to train a maximum entropy identification (MED) transductive classifier by iterative calculations, For iterations, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, the prior probability of the data point label is adjusted according to the estimate of the class membership probability of the data point,
Applying the trained classifier to classify at least one of the unlabeled data point, the labeled data point, and the input data point;
Outputting the classification of the classified data points, or a derivative thereof, to at least one of a user, another system, and another process;
With instructions for
Method.
(Item 14)
14. The method of item 13, wherein the function is the absolute value of the expected label of a data point.
(Item 15)
14. The method of item 13, further comprising accessing prior probability information of labeled and unlabeled data points stored in a computer memory.
(Item 16)
16. The method of item 15, wherein for each iteration, the prior probability information is adjusted according to an estimate of a data point class membership probability.
(Item 17)
Utilizing the labeled and unlabeled data as learning examples according to their expected labels and having a minimum KL divergence for the pre-distribution of decision function parameters given the included training examples and excluded training examples 14. The method of item 13, further comprising instructions for determining a decision function.
(Item 18)
14. The method of item 13, wherein the iterative step of training a transductive classifier is repeated until convergence of data values is reached.
(Item 19)
Item 19. The method of item 18, wherein convergence is reached when the change in the decision function of the transductive classification falls below a predetermined threshold.
(Item 20)
19. The method of item 18, wherein convergence is reached when the determined change in expected label value falls below a predetermined threshold.
(Item 21)
14. The method of item 13, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of -1.
(Item 22)
14. The method of item 13, wherein the labels of the included examples are mapped to a first number and the labels of the excluded examples are mapped to a second number.
(Item 23)
A data processing device, the device comprising:
(I) Whether each labeled data point is a training example for a data point that should be included in the specified category or a training example for a data point excluded from the specified category Storing at least one labeled data point, (ii) an unlabeled data point, and (iii) at least one predetermined cost factor for the labeled and unlabeled data point, at least One memory,
Using the at least one stored cost factor and stored labeled data points and stored unlabeled data points as training examples, a transductive maximum entropy identification (MED) is used for the transductive classifier. A transductive classifier training device for iterative teaching, wherein at each iteration of the MED calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the label of the data point is A training device that is adjusted according to an estimate of the class membership probability of the data points;
With
A classifier trained by the transductive classifier training device is used to classify at least one of the unlabeled data points, the labeled data points, and the input data points;
The classification of the classified data points, or a derivative thereof, is output to at least one of a user, another system, and another process;
apparatus.
(Item 24)
24. The apparatus of item 23, wherein the function is an absolute value of the expected label of a data point.
(Item 25)
24. The apparatus of item 23, wherein the memory also stores prior probability information for labeled and unlabeled data points.
(Item 26)
26. The apparatus of item 25, wherein in each iteration of the MED calculation, the prior probability information is adjusted according to an estimate of a data point class membership probability.
(Item 27)
Using the labeled and unlabeled data as learning examples according to their expected labels, the included training examples and excluded training examples have a minimum KL divergence for given decision function prior distributions 24. The apparatus of item 23, further comprising a processor for determining a decision function.
(Item 28)
Item 24. The apparatus according to Item 23, further comprising means for determining convergence of the data value and terminating the calculation simultaneously with the determination of convergence.
(Item 29)
29. The apparatus of item 28, wherein convergence is reached when a change in a decision function of the transductive classifier calculation falls below a predetermined threshold.
(Item 30)
29. The apparatus of item 28, wherein convergence is reached when the determined change in expected label value falls below a predetermined threshold.
(Item 31)
24. The apparatus of item 23, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of -1.
(Item 32)
24. The apparatus of item 23, wherein the labels of the included examples are mapped to a first number and the labels of the excluded examples are mapped to a second number.
(Item 33)
A product comprising a computer readable program storage medium, which unambiguously embodies one or more programs of instructions executable by a computer to perform a data classification method, the method comprising:
Each labeled data point indicates whether it is a training example for a data point to be included in a specified category or a training example for a data point excluded from a specified category, at least Receiving the labeled data point having one label;
Receiving unlabeled data points;
Receiving at least one predetermined cost factor of the labeled and unlabeled data points;
Training a transductive classifier by iterative maximum entropy identification (MED) calculation using the at least one stored cost factor and stored labeled data points and stored unlabeled data points as training examples. Where in each iteration of the MED calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the data point is adjusted according to the estimate of the class membership probability of the data point. Steps,
Applying the trained classifier to classify at least one of the unlabeled data point, the labeled data point, and an input data point;
Outputting the classification of the classified data points, or a derivative thereof, to at least one of a user, another system, and another process;
Including
Product.
(Item 34)
34. A product according to item 33, wherein the function is an absolute value of the expected label of a data point.
(Item 35)
34. The product of item 33, wherein the method further comprises storing prior probability information for labeled and unlabeled data points in a computer memory.
(Item 36)
36. The product of item 35, wherein in each iteration of the MED calculation, the prior probability information is adjusted according to an estimate of a data point class membership probability.
(Item 37)
The method uses the labeled and unlabeled data as learning examples according to their expected labels, and uses a minimum for a pre-distribution of decision function parameters given the included training examples and excluded training examples. 34. The product of item 33, comprising the further step of determining said decision function having KL divergence.
(Item 38)
The product of item 33, wherein the iterative step of training the transductive classifier is repeated until convergence of the data values is reached.
(Item 39)
40. The product of item 38, wherein convergence is reached when a change in the decision function of the transductive classification falls below a predetermined threshold.
(Item 40)
39. The product of item 38, wherein convergence is reached when the determined change in expected label value falls below a predetermined threshold.
(Item 41)
34. The product of item 33, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of -1.
(Item 42)
34. The product of item 33, wherein the label of the included example is mapped to a first number and the label of the excluded example is mapped to a second number.
(Item 43)
A method for classifying unlabeled data in a computer-based system, comprising:
Receiving a labeled data point, whether the data point is a training example for a data point to be included in a specified category or a training example for a data point excluded from a specified category Each of the labeled data points has at least one label indicating:
Receiving labeled and unlabeled data points;
Receiving pre-label probability information for labeled and unlabeled data points;
Receiving at least one predetermined cost factor for the labeled and unlabeled data points;
Determining an expected label for each labeled and unlabeled data point according to the prior probability of the label of the data point;
Until the data values substantially converge, the following substeps are taken:
Generating a scaled cost value for each unlabeled data point in proportion to the absolute value of the expected label of the data point;
Use the labeled and unlabeled data as training examples according to their expected labels to minimize KL divergence for the prior probability distribution of the decision function parameters given the included training examples and excluded training examples Training the classifier by calculating the decision function to:
Using the trained classifier to determine a classification score for the labeled and unlabeled data points;
Calibrating the output of the trained classifier to class membership probabilities;
Updating the prior probability of the label of the unlabeled data point according to the determined class membership probability;
Determining the probability distribution of the label and margin using maximum entropy identification (MED) using the updated prior probability of the label and the previously determined classification score;
Using the previously determined probability distribution of labels to calculate a new expected label;
Updating the expected label for each data point by interpolating the new expected label with the expected label from the previous iteration;
Repeating steps,
Outputting the classification of the input data points, or derivatives thereof, to at least one of a user, another system, and another process;
Including the method.
(Item 44)
44. A method according to item 43, wherein convergence is reached when a change in the decision function falls below a predetermined threshold.
(Item 45)
44. A method according to item 43, wherein convergence is reached when a change in the determined expected label value falls below a predetermined threshold.
(Item 46)
45. The method of item 43, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of -1.
(Item 47)
Receiving at least one labeled seed document having a known confidence level for label assignment;
Receiving an unlabeled document;
Receiving at least one predetermined cost factor;
Training a transductive classifier by iteration using the at least one predetermined cost factor, the at least one seed document, and the unlabeled document, for each iteration of the computation, A step where the cost factor is adjusted as a function of the expected label value; and
Storing a confidence score for the unlabeled document after at least some of the iterations;
Outputting an identifier of the unlabeled document having the highest confidence score to at least one of a user, another system, and another process;
A method for classifying documents, including:
(Item 48)
48. The method of item 47, wherein the at least one seed document has a list of keywords.
(Item 49)
48. The method of item 47, wherein a confidence score is stored after each iteration and an identifier of the unlabeled document having the highest confidence score is output after each iteration.
(Item 50)
49. The method of item 47, further comprising receiving prior probabilities of data point labels for the labeled and unlabeled documents, wherein for each iteration of the calculation, the label of the data point. The method wherein the prior probabilities are adjusted according to an estimate of the class membership probability of the data points.
(Item 51)
Receiving documents related to legal matters;
Performing a document classification technique on the document;
Outputting an identifier of at least a portion of the document based on the classification of the document;
Analyzing documents related to legal disclosure procedures, including:
(Item 52)
52. A method according to item 51, wherein the document classification method includes transductive processing.
(Item 53)
Training a transductive classifier by iterative computation using at least one predetermined cost factor, at least one seed document, and a document associated with a legal matter, for each iteration of the computation, 53. The method of item 52, further comprising the step of adjusting the cost factor as a function of an expected label value.
(Item 54)
54. The method of item 53, further comprising receiving prior probabilities of data point labels for the labeled and unlabeled documents, for each iteration of the calculation, prior to the data point label priorities. The method, wherein the probability is adjusted according to an estimate of the class membership probability of the data point.
(Item 55)
52. A method according to item 51, wherein the document classification method includes support vector machine processing.
(Item 56)
52. A method according to item 51, wherein the document classification method includes a maximum entropy identification process.
(Item 57)
52. The method of item 51, further comprising outputting a representation representing a link between the documents.
(Item 58)
Receiving a plurality of labeled data items;
Selecting a subset of the data items for each of a plurality of categories;
Setting the uncertainty for the data items in each subset to approximately zero;
Setting the uncertainty for the data item not present in the subset to a predetermined value that is not substantially zero;
Training a transductive classifier by iterative computation using the uncertainty, data items in the subset, and the data items not in the subset as training examples;
Applying the trained classifier to each of the labeled data items to classify each of the data items;
Outputting the classification of the input data item, or a derivative thereof, to at least one of a user, another system, and another process;
A way to organize data, including
(Item 59)
59. The method of item 58, wherein the subset is selected randomly.
(Item 60)
59. The method of item 58, wherein the subset is selected and verified by a user.
(Item 61)
59. The method of item 58, further comprising changing the label of at least some of the data items based on the classification.
(Item 62)
59. The method of item 58, wherein after classification of the data item, an identifier of the data item having a confidence level below a predetermined threshold is output to the user.
(Item 63)
Training a classifier based on the type of invoice associated with the first entity;
Accessing a plurality of invoices labeled to be associated with at least one of the first and other entities;
Using the classifier to perform a document classification technique on the invoice;
Outputting an identifier of at least one of the invoices having a high probability not associated with the first entity;
A method for verifying the relationship between an invoice and an entity, including
(Item 64)
64. A method according to item 63, wherein the document classification method includes transductive processing.
(Item 65)
65. The method of item 64, wherein the classifier is a transductive classifier, the method using an iterative calculation using at least one predetermined cost factor, at least one seed document, and the invoice. Training the transductive classifier, wherein for each iteration of the calculation, the cost factor is adjusted as a function of an expected label value, and using the trained classifier Classifying the invoice.
(Item 66)
68. The method of item 65, further comprising receiving a prior probability of a data point label for the seed document and invoice, wherein for each iteration of the calculation, the prior probability of the data point label. Is adjusted according to an estimate of the class membership probability of the data points.
(Item 67)
64. A method according to item 63, wherein the document classification method includes support vector machine processing.
(Item 68)
64. A method according to item 63, wherein the document classification method includes a maximum entropy identification process.
(Item 69)
Training a classifier based on a medical diagnosis;
Accessing multiple medical records;
Using the classifier to perform a document classification technique on the medical record;
Outputting an identifier of at least one of the medical records having a low probability associated with the medical diagnosis;
A method of managing medical records.
(Item 70)
70. The method of item 69, wherein the document classification technique includes transductive processing.
(Item 71)
71. The method of item 70, wherein the classifier is a transductive classifier, wherein the transductive classifier is repetitively calculated using at least one predetermined cost factor, at least one seed document, and the medical record. Training a classifier, wherein for each iteration of the calculation, the cost factor is adjusted as a function of an expected label value, and the trained classifier to classify the medical record Using the method.
(Item 72)
72. The method of item 71, further comprising receiving a prior probability of a data point label for the seed document and medical record, wherein for each iteration of the calculation, the prior probability of the data point label. Is adjusted according to an estimate of the class membership probability of the data points.
(Item 73)
70. A method according to item 69, wherein the document classification technique includes support vector machine processing.
(Item 74)
70. The method of item 69, wherein the document classification technique includes a maximum entropy identification process.
(Item 75)
Receiving at least one face labeled seed image having a known confidence level;
Receiving an unlabeled image;
Receiving at least one predetermined cost factor;
Training a transductive classifier by iteration using the at least one predetermined cost factor, the at least one seed image, and the unlabeled image, for each iteration of the computation, A step where the cost factor is adjusted as a function of the expected label value; and
Storing a confidence score for an unlabeled seed image after at least some of the iterations;
Outputting an identifier of the unlabeled image having the highest confidence score to at least one of a user, another system, and another process;
A face recognition method.
(Item 76)
76. The method of item 75, wherein the at least one seed image has a label that indicates whether the image is included in a specified category.
(Item 77)
76. The method of item 75, wherein a confidence score is stored after each of the iterations and an identifier of the unlabeled image having the highest confidence score is output after each iteration.
(Item 78)
76. The method of item 75, further comprising receiving prior probabilities of data point labels for the labeled and unlabeled images, wherein for each iteration of the calculation, the label of the data point. The method wherein the prior probabilities are adjusted according to an estimate of the class membership probability of the data points.
(Item 79)
Receiving a third unlabeled image of a face; comparing the third unlabeled image with at least a portion of the image having the highest confidence score; and a face of the third unlabeled image 76. The method of item 75, further comprising: outputting an identifier of the third unlabeled image if the confidence of is the same as the face of the seed image.
(Item 80)
Training a classifier based on a search query;
Accessing a plurality of prior art documents;
Using the classifier to perform a document classification technique on the prior art document;
Outputting an identifier of at least a portion of the prior art document based on the classification of the prior art document;
A method for analyzing a prior art document including:
(Item 81)
81. A method according to item 80, wherein the document classification technique includes transductive processing.
(Item 82)
82. The method of item 81, wherein the classifier is a transductive classifier, wherein the transductor is repetitively calculated using at least one predetermined cost factor, at least one seed document, and the prior art document. Training an active classifier, wherein for each iteration of the calculation, the cost factor is adjusted as a function of an expected label value, and the trained to classify the prior art document Using a classifier.
(Item 83)
83. The method of item 82, further comprising receiving prior probabilities of data point labels for the seed document and the prior art document, wherein for each iteration of the calculation, prior probabilities of data point labels. Is adjusted according to an estimate of the class membership probability of the data points.
(Item 84)
81. The method of item 80, wherein the search query includes at least a portion of patent disclosure information.
(Item 85)
81. The method of item 80, wherein the search query includes at least a portion of an item retrieved from a patent document or patent application document.
(Item 86)
81. The method of item 80, wherein the search query includes at least a portion of a patent document or a summary of patent application documents.
(Item 87)
81. The method of item 80, wherein the search query includes at least a portion of a summary retrieved from a patent document or patent application document.
(Item 88)
81. A method according to item 80, wherein the document classification method includes support vector machine processing.
(Item 89)
81. The method of item 80, wherein the document classification technique includes a maximum entropy identification process.
(Item 90)
81. A method according to item 80, wherein the prior art document is a patent office public document.
(Item 91)
The method of item 80, further comprising outputting a representation of a link between the documents.
(Item 92)
81. The method of item 80, further comprising outputting a relevance score for at least a portion of the prior art document based on the prior art document classification.
(Item 93)
Receiving at least one labeled seed document;
Receiving an unlabeled document;
Training a transductive classifier using the at least one seed document and the unlabeled document;
Using the classifier to classify the unlabeled documents having a confidence level above a predetermined threshold into a plurality of existing categories;
Using the classifier to classify the unlabeled document having a confidence level below a predetermined threshold into at least one new category;
Reclassifying at least a portion of the categorized document into the existing category and the at least one new category using the classifier;
Outputting an identifier of the categorized document to at least one of a user, another system, and another process;
To adapt patent classification to document content shifts.
(Item 94)
94. The method of item 93, wherein the classifier is a transductive classifier, and trains the transductive classifier by iterative calculation using at least one predetermined cost factor, a search query, and the document. And, for each iteration of the calculation, the cost factor is adjusted as a function of an expected label value; and using the trained classifier to classify the document; Further comprising a method.
(Item 95)
95. The method of item 94, further comprising receiving a prior probability of a data point label for the search query and document, wherein for each iteration of the calculation, the prior probability of the data point label is A method that is adjusted according to an estimate of the class membership probability of the data points.
(Item 96)
94. A method according to item 93, wherein the document classification method includes support vector machine processing.
(Item 97)
94. A method according to item 93, wherein the document classification method includes a maximum entropy identification process.
(Item 98)
94. A method according to item 93, wherein the unlabeled document is a patent application document.
(Item 99)
94. The method of item 93, wherein the at least one seed document is selected from the group consisting of patent documents and patent application documents.
(Item 100)
Training a classifier based on at least one item of a patent document or patent application document;
Accessing multiple documents;
Using the classifier to perform a document classification technique on at least a portion of the document;
Outputting an identifier of at least a portion of the document based on the classification of the document;
A method of matching documents to items, including
(Item 101)
101. The method of item 100, further comprising outputting a relevance score for at least a portion of the document based on the classification of the document.
(Item 102)
101. A method according to item 100, wherein the document is a prior art document.
(Item 103)
101. A method according to item 100, wherein the document describes a product.
(Item 104)
Training a classifier based on a plurality of documents known to exist in a particular patent classification;
Receiving at least a portion of a patent document or patent application document;
Performing a document classification technique on the at least part of the patent document or patent application document using the classifier;
Outputting the classification of the patent document or patent application document;
A method for classifying patent documents or patent application documents, comprising:
The document classification technique is a yes / no classification technique.
(Item 105)
105. The method of item 104, wherein the document is selected from the group consisting of patent documents and patent application documents.
(Item 106)
106. The method of item 105, wherein the at least part of the patent document or patent application document includes at least a part of an item retrieved from the patent document or patent application document.
(Item 107)
106. The method of item 105, wherein the at least part of the patent document or patent application document includes at least a part of a summary of the patent document or patent application document.
(Item 108)
106. The method of item 105, wherein the at least part of the patent document or patent application document includes at least a part of a summary taken from the patent document or patent application document.
(Item 109)
Performing a document classification technique on at least a portion of a patent document or patent application document using a classifier trained based on at least one document associated with a particular patent classification, the document classification technique comprising: Yes / No formula classification method, step,
Outputting the classification of the patent document or patent application document;
Classifying patent documents or patent application documents, including:
(Item 110)
110. The method of item 109, further comprising repeating the method with different classifiers trained based on a plurality of documents known to exist in the second patent classification.
(Item 111)
110. The method of item 109, wherein the at least part of the patent document or patent application document includes at least a part of an item retrieved from the patent document or patent application document.
(Item 112)
110. The method of item 109, wherein the at least part of the patent document or patent application document includes at least a part of a summary of the patent document or patent application document.
(Item 113)
110. The method of item 109, wherein the at least part of the patent document or patent application document includes at least a part of a summary taken from the patent document or patent application document.
(Item 114)
Receiving at least one labeled seed document;
Receiving an unlabeled document;
Receiving at least one predetermined cost factor;
Training a transductive classifier using the at least one predetermined cost factor, the at least one seed document, and the unlabeled document;
Using the classifier to classify the unlabeled documents having a confidence level above a predetermined threshold into a plurality of categories;
Outputting an identifier of the categorized document to at least one of a user, another system, and another process;
To adapt to shifts in document content.
(Item 115)
115. The method of item 114, further comprising transferring unlabeled documents having a confidence level below the predetermined threshold to one or more new categories.
(Item 116)
Training the transductive classifier by iterative computation using at least one predetermined cost factor, the at least one seed document, and the unlabeled document, for each iteration of the computation, 115. The method of item 114, further comprising: a cost factor is adjusted as a function of expected label value; and using the trained classifier to classify the unlabeled document.
(Item 117)
118. The method of item 116, further comprising receiving prior probabilities of data point labels for the seed document and unlabeled document, wherein for each iteration of the calculation, the label of the data point. The method wherein the prior probabilities are adjusted according to an estimate of the class membership probability of the data points.
(Item 118)
119. The method of item 114, wherein the unlabeled document is a customer complaint, further comprising linking a product change with a customer complaint.
(Item 119)
115. The method of item 114, wherein the unlabeled document is an invoice.
(Item 120)
Receiving labeled data; and
Receiving a sequence of unlabeled documents;
Adapting probabilistic classification rules using transduction based on the labeled data and the unlabeled document;
Updating weights used for document separation according to the probabilistic classification rules;
Determining separation positions in the sequence of documents;
Outputting the indicator of the determined separation location in the series to at least one of a user, another system, and another process;
Setting a flag in the document that correlates with the indicator;
A method for separating documents, including:
(Item 121)
Receiving a search query;
Retrieving a document based on the search query;
Outputting the document;
Receiving a user input label for at least a portion of the document, the label indicating relevance of the document to the search query;
Training a classifier based on the search query and the user input label;
Performing a document classification technique on the document using the classifier to reclassify the document;
Outputting an identifier of at least a portion of the document based on the classification of the document;
Document retrieval method including
(Item 122)
124. A method according to item 121, wherein the document classification technique includes transductive processing.
(Item 123)
123. The method of item 122, wherein the classifier is a transductive classifier, wherein the transductive classifier is repetitively calculated using at least one predetermined cost factor, the search query, and the document. Training, wherein for each iteration of the calculation, the cost factor is adjusted as a function of an expected label value; and using the trained classifier to classify the document; Further comprising.
(Item 124)
124. The method of item 123, further comprising: receiving a prior probability of a data point label for the search query and document, wherein for each iteration of the calculation, the prior probability of the label of the data point is A method that is adjusted according to an estimate of the class membership probability of the data points.
(Item 125)
124. A method according to item 121, wherein the document classification method includes support vector machine processing.
(Item 126)
124. A method according to item 121, wherein the document classification method includes a maximum entropy identification process.
(Item 127)
122. The method of item 121, wherein the reclassified document is output and the document with the highest confidence is output first.

以下の記述は、本発明を実施するために現在企図される最良の形態である。この記述は、本発明の一般的原理を説明する目的でなされるものであり、本明細書において主張される発明の概念を制限することを意図するものではない。さらに、本明細書において記述される特定の特徴は、記述される他の特徴と、種々の可能な組み合わせおよび順列の各々において、組み合わせて用いられ得る。 The following description is the best mode presently contemplated for carrying out the invention. This description is made for the purpose of illustrating the general principles of the invention and is not intended to limit the inventive concepts claimed herein. Furthermore, the particular features described herein can be used in combination with the other features described in each of the various possible combinations and permutations.

本明細書において別途具体的に定義しない限り、すべての用語は、本明細書によりもたらされる意味、および当業者によって理解され、また辞書、専門書などに定義される意味を含んで、それらの用語に可能な限りの最も幅広い解釈を与えられる。 Unless defined otherwise specifically in the specification, all terms are intended to include the meanings provided by the specification and the meanings understood by those of ordinary skill in the art and defined in dictionaries, technical books, etc. Is given the widest interpretation possible.

（文字分類）
文字データの分類に対する関心および必要性は特に強く、いくつかの分類手法が採用されてきた。以下に、文字データの分類法について検討する。 (Character classification)
Interest and need for character data classification is particularly strong, and several classification techniques have been employed. Below, the classification method of character data is examined.

分類法の有用性および知能を向上させるために、例えばコンピュータのような機械が、常に増加し続ける内容に対象を分類する（または認識する）ために必要とされる。例えば、コンピュータは、光学式文字認識を用いて、手書きまたはスキャンした数字および文字を分類することができ、パターン認識を用いて、顔、指紋、戦闘機などのような画像を分類することができ、あるいは、音声認識を用いて、音、声などを分類することができる。 In order to improve the usefulness and intelligence of classification methods, machines such as computers are needed to classify (or recognize) objects into ever-increasing content. For example, a computer can classify handwritten or scanned numbers and letters using optical character recognition, and can classify images such as faces, fingerprints, fighters, etc. using pattern recognition. Alternatively, sound, voice, etc. can be classified using speech recognition.

機械は、例えば文字からなるコンピュータファイルまたは文書のような文字情報オブジェクトを分類するためにも必要とされてきた。文字分類用アプリケーションは様々であり、かつ重要である。例えば、文字分類は、文字情報オブジェクトを、例えば所定のクラスまたはカテゴリの階層構造に編成するために、使用され得る。この手法で、特定の主題に関連する文字情報オブジェクトの発見（またはそれへのナビゲーション）が簡易化される。文字分類は、文字情報オブジェクトを、しかるべき人々または場所に送るために使用され得る。この手法で、情報サービス産業は、多岐にわたる主題（例えば、ビジネス、スポーツ、株式市場、フットボール、特定の会社、特定のフットボールチーム）をカバーする文字情報オブジェクトを、様々な関心を有する人々に送ることができる。文字分類は、望まない文字内容（ジャンクメール、または「スパム」とも呼ばれる望まない未承諾メールのような）によって個人が迷惑を被らないように、文字情報オブジェクトにフィルタをかけるために使用され得る。これら少数の例から分かるように、文字分類に対する多くの魅力的かつ重要な用途がある。 Machines have also been required to classify character information objects such as computer files or documents consisting of characters. There are various and important applications for character classification. For example, character classification can be used to organize character information objects, for example, into a hierarchical structure of a given class or category. This approach simplifies the discovery (or navigation to) character information objects associated with a particular subject. Character classification can be used to send character information objects to the appropriate people or places. In this way, the information service industry sends textual information objects covering a wide variety of subjects (eg business, sports, stock market, football, specific companies, specific football teams) to people with various interests. Can do. Character classification can be used to filter character information objects so that individuals are not bothered by unwanted character content (such as junk mail or unwanted unsolicited mail, also called "spam"). . As can be seen from these few examples, there are many attractive and important uses for character classification.

（ルールベースの分類）
一部の場合には、文字内容は、特定の承認された論理に基づき、絶対的確実性をもって分類される必要がある。ルールベースシステムは、そのような種類の分類を行うために使用され得る。基本的に、ルールベースシステムは、次の形の生成規則を用い：
もし、（条件）であれば、（事実）である。
ここで条件は、文字情報が特定の語または語句を含むか否か、特定の構文を有するか否か、または特定の属性を有するか否かを、含み得る。例えば、文字内容が語「終える」、語句「ナスダック」および数を有する場合には、それは「株式市場」に関する文字に分類される。 (Rule-based classification)
In some cases, the character content needs to be classified with absolute certainty based on certain approved logic. A rule-based system can be used to perform such types of classification. Basically, rule-based systems use production rules of the form:
If (condition), it is (facts).
Here, the condition may include whether the character information includes a specific word or phrase, whether it has a specific syntax, or whether it has a specific attribute. For example, if the character content has the word “finished”, the phrase “Nasdaq” and a number, it is classified as a character for “stock market”.

この１０年間ほどの間に、他の種類の分類器が次第に用いられるようになってきた。これらの分類器は、ルールベースの分類器のように静的で事前定義された論理を用いるものではないが、それらは、多くのアプリケーションにおいて、ルールベース分類器を上回る性能を示してきた。このような分類器は通常、学習要素と実行要素とを含む。このような分類器は、ニューラルネットワークと、ベイジアン（Ｂａｙｅｓｉａｎ）ネットワークと、サポートベクタマシンとを含み得る。これらの分類器の各々が公知であるが、読者の便宜のために、各々を以下に簡単に紹介する。 During the last decade or so, other types of classifiers have been increasingly used. Although these classifiers do not use static and predefined logic like rule-based classifiers, they have shown performance in many applications over rule-based classifiers. Such a classifier typically includes a learning element and an execution element. Such classifiers can include neural networks, Bayesian networks, and support vector machines. Each of these classifiers is known, but for the convenience of the reader, each is briefly introduced below.

（学習要素および実行要素を有する分類器）
前節末でちょうど言及したとおり、学習要素および実行要素を有する分類器は、多くのアプリケーションにおいて、ルールベース分類器を上回る性能を有する。繰り返して述べると、これらの分類器は、ニューラルネットワークと、ベイジアンネットワークと、サポートベクタマシンとを含み得る。 (Classifier with learning and execution elements)
As just mentioned at the end of the previous section, classifiers with learning and execution elements outperform rule-based classifiers in many applications. To reiterate, these classifiers can include neural networks, Bayesian networks, and support vector machines.

（ニューラルネットワーク）
ニューラルネットワークは、基本的に、ニューロンとも呼ばれる同一の処理要素の多層にわたる階層的な配列である。各ニューロンは、１つ以上の入力を有し得るが、出力はひとつだけである。各ニューロン入力は、係数によって重み付けされる。ニューロンの出力は通常、重み付けされた入力の合計とバイアス値との関数である。活性化関数とも呼ばれるこの関数は、一般的にシグモイド関数である。すなわち、活性化関数は、Ｓ字状で、単調に増加し得、その入力（単数または複数）がそれぞれ正または負の無限大に近づくにつれて漸近的に固定値（例えば、＋１、０、−１）に接近し得る。シグモイド関数と個々のニューラル重み付けおよびバイアス値が、入力信号に対するニューロンの応答または「敏感性」を決定する。 (neural network)
A neural network is basically a hierarchical arrangement over multiple layers of identical processing elements, also called neurons. Each neuron can have more than one input, but only one output. Each neuron input is weighted by a coefficient. The output of a neuron is usually a function of the sum of the weighted inputs and the bias value. This function, also called the activation function, is generally a sigmoid function. That is, the activation function is sigmoidal and can be monotonically increasing and asymptotically fixed values (eg, +1, 0, −1) as its input (s) approaches positive or negative infinity, respectively. ). The sigmoid function and the individual neural weights and bias values determine the neuron's response or “sensitivity” to the input signal.

ニューロンの階層的配列においては、１つの層におけるニューロンの出力は、次の層における１つ以上のニューロンへの入力として分配され得る。典型的なニューラルネットワークは、入力層と２つの別個の層、すなわち、入力層、中間ニューロン層、および出力ニューロン層を含み得る。入力層のノードはニューロンではないことに留意されたい。むしろ、入力層のノードは、１つだけの入力を有しており、基本的に、該入力を、無処理の状態で次の層の入力に供給する。例えば、ニューラルネットワークが２０×１５ピクセルアレイ内の数字を認識するために用いられる場合には、入力層は３００ニューロン（すなわち、入力の各ピクセルに対して１つ）を有し得、出力アレイは、１０ニューロン（すなわち、１０個の数字の各々に対して１つ）を有し得る。 In a hierarchical arrangement of neurons, the output of neurons in one layer can be distributed as an input to one or more neurons in the next layer. A typical neural network may include an input layer and two separate layers: an input layer, an intermediate neuron layer, and an output neuron layer. Note that the nodes in the input layer are not neurons. Rather, an input layer node has only one input and basically feeds that input unprocessed to the next layer input. For example, if a neural network is used to recognize numbers in a 20 × 15 pixel array, the input layer may have 300 neurons (ie, one for each pixel in the input) and the output array You can have 10 neurons (ie, one for each of the 10 numbers).

ニューラルネットワークの使用法は、全体として、２つの連続する段階を含む。最初に、ネットワークが初期化され、既知の出力値（または分類）を有する既知の入力に関して訓練される。ひとたびニューラルネットワークが訓練されると、それは、次いで、未知入力を分類するために使用され得る。ニューラルネットワークは、ニューロンの重みおよびバイアスを一般的にガウス分布から生成されるランダム値に設定することによって、初期化され得る。次いで、既知の出力（または分類）を有する一連の入力を用いて、ニューラルネットワークが訓練される。訓練入力がニューラルネットワークに供給される際に、各個々の訓練パターンについてのニューラルネットワークの出力が既知の出力に近づくか、またはそれに一致するように、ニューロンの重みおよびバイアスの値が調整（例えば、既知の逆伝播法に従って）される。基本的に、重み空間における最急降下法（ｇｒａｄｉｅｎｔｄｅｓｃｅｎｔ）が、出力誤差を最小化するために用いられる。この手法で、連続的訓練入力を用いた学習は、重みおよびバイアスに対する局所最適解に向けて収束する。すなわち、重みおよびバイアスは、誤差を最小化するように調整される。 The use of a neural network as a whole includes two successive stages. Initially, the network is initialized and trained on known inputs with known output values (or classifications). Once the neural network is trained, it can then be used to classify unknown inputs. The neural network can be initialized by setting neuron weights and biases to random values, typically generated from a Gaussian distribution. The neural network is then trained using a series of inputs with known outputs (or classifications). As the training inputs are fed into the neural network, the neuron weights and bias values are adjusted so that the neural network output for each individual training pattern approaches or matches the known output (eg, According to the known backpropagation method). Basically, a gradient descend in the weight space is used to minimize the output error. In this way, learning with continuous training input converges towards a local optimal solution for weights and bias. That is, the weight and bias are adjusted to minimize the error.

実際には、このシステムは、通常は、最適解に収束する点に至るまで訓練されることはない。さもなければ、このシステムは「過度に訓練」され、その結果として、このシステムは訓練データに対して過度に特殊化されることになり、訓練集合内の入力とどこか異なる入力を分類することに、適さなくなり得る。従って、訓練期間中の様々な時点で、このシステムは一組の検証データを使用して試験される。検証セットを使用したこのシステムの性能がもはや向上しなくなったときに、訓練は中止される。 In practice, this system is usually not trained until it converges to an optimal solution. Otherwise, the system will be “overtrained” and as a result, the system will be overly specialized for training data, classifying inputs that are somewhere different from the inputs in the training set. It may become unsuitable. Thus, at various points during the training period, the system is tested using a set of validation data. When the performance of this system using the verification set no longer improves, the training is stopped.

ひとたび訓練が完了すれば、ニューラルネットワークは、訓練中に算出された重みおよびバイアスに基づいて、未知の入力を分類するために使用され得る。ニューラルネットワークが信頼性をもって未知の入力を分類できる場合には、出力層におけるニューロンの出力の１つは、他よりもはるかに高くなる。 Once training is complete, the neural network can be used to classify unknown inputs based on weights and biases calculated during training. If the neural network can reliably classify unknown inputs, one of the neuron outputs in the output layer will be much higher than the others.

（ベイジアンネットワーク）
一般的に、ベイジアンネットワークは、データ（例えば特徴ベクトル入力）と予測（例えば分類）との間の中間段階のものとして、仮説を用いる。データを所与として、各仮説の確率（「Ｐ（ｈｙｐｏ｜ｄａｔａ）」）が推定され得る。仮説の事後確率を用いて、仮説から予測が行われ、各々の仮説に関する個々の予測が重み付けされる。データＤを所与とした場合の予測Ｘの確率は、 (Bayesian network)
In general, Bayesian networks use hypotheses as an intermediate stage between data (eg, feature vector input) and prediction (eg, classification). Given the data, the probability of each hypothesis (“P (hypo | data)”) can be estimated. Using hypotheses posterior probabilities, predictions are made from hypotheses, and the individual predictions for each hypothesis are weighted. The probability of predicted X given data D is

で表され、ここで、Ｈ_ｉはｉ番目の仮説である。Ｄを所与とした場合のＨ_ｉの確率（Ｐ（Ｈ_ｉ｜Ｄ））を最大化する最も確からしい仮説は、最大事後仮説（または「Ｈ_ＭＡＰ」）と呼ばれ、

Where H _i is the i th hypothesis. The most probable hypothesis that maximizes the probability of H _i (P (H _i | D)) given D is called the maximum posterior hypothesis (or “H _MAP ”),

で表すことができる。
ベイズの定理を用いると、データＤを所与とした場合の仮説Ｈ．ｓｕｂ．ｉの確率は、

It can be expressed as
Using Bayes' theorem, the hypothesis H.D. sub. The probability of i is

で表すことができる。データＤの確率は固定されたままである。従って、Ｈ_ＭＡＰを求めるためには分子を最大化する必要がある。

It can be expressed as The probability of data D remains fixed. Therefore, it is necessary to maximize the molecule to determine H _MAP .

分子の第１項は、仮説をｉをとしてそのデータが観測されたであろう確率を表す。第２項は、所与の仮説ｉに割り当てられた事前確率を表す。 The first term in the numerator represents the probability that the data would be observed with a hypothesis i. The second term represents the prior probability assigned to a given hypothesis i.

ベイジアンネットワークは、変数と、変数間の有向辺（ｄｉｒｅｃｔｅｄｅｄｇｅ）とを含んでおり、それによって有向非巡回（ｄｉｒｅｃｔｅｄａｃｙｃｌｉｃ）グラフ（または「ＤＡＧ」）を定義する。各変数は、相互排他的状態の任意の有限数をとることができる。親変数Ｂ_１、．．．Ｂ_ｎ、を有する各変数Ａに対して、確率テーブル（Ｐ（Ａ｜Ｂ_１．．．Ｂ_ｎ）が添付されている。ベイジアンネットワークの構造は、各変数が、その親変数を所与とした場合、各変数の非子孫（ｎｏｎ−ｄｅｓｃｅｎｄａｎｔ）とは条件付きで独立であるという仮定を、符号化している。 A Bayesian network includes variables and directed edges between variables, thereby defining a directed acyclic graph (or “DAG”). Each variable can take any finite number of mutually exclusive states. Parent variables B ₁ ,. . . For each variable A with B _n , a probability table (P (A | B ₁ ... B _n ) is attached.The structure of the Bayesian network is that each variable has a given parent variable. In this case, the assumption is that it is conditionally independent from the non-descendant of each variable.

ベイジアンネットワークの構造が既知であり、変数が観測可能であると仮定すれば、条件付き確率テーブルの集合のみを学習すればよい。これらのテーブルは、一組の学習例からもたらされる統計を用いて直接推定され得る。構造が既知であるが一部の変数が隠されている場合には、学習は、上に論じたニューラルネットワークの学習に類似している。 Assuming that the structure of the Bayesian network is known and the variables are observable, only a set of conditional probability tables need be learned. These tables can be estimated directly using statistics from a set of learning examples. If the structure is known but some variables are hidden, learning is similar to the neural network learning discussed above.

簡単なベイジアンネットワークの一例を以下に紹介する。変数「ＭＭＬ」は、「私の芝生の水分」を表し得、「湿った」状態と「乾燥した」状態を有し得る。ＭＭＬ変数は、「雨」という親変数と「私のスプリンクラーが作動している」という親変数とを有し得、各々の親変数は「はい」の状態と「いいえ」の状態とを有する。別の変数「ＭＮＬ」は、「私の隣人の芝生の水分」を表し得、「湿った」状態と「乾燥した」状態を有し得る。ＭＮＬ変数は、「雨」という親変数を共有し得る。この例では、予測は、私の芝生が「湿っている」か、または「乾燥している」かであり得る。この予測は、仮説（ｉ）「もし雨が降れば、私の芝生は確率（ｘ_１）で湿るであろう」と、仮説（ｉｉ）「もし私のスプリンクラーが作動していたら、私の芝生は確率（ｘ_２）で湿るであろう」とに依存し得る。雨が降ったという確率または私のスプリンクラーが作動していたという確率は、他の変数に依存し得る。例えば、もし私の隣人の芝生が湿っており、かつ隣人がスプリンクラーを持っていなければ、雨が降ったという可能性がより高くなる。 An example of a simple Bayesian network is introduced below. The variable “MML” may represent “my lawn moisture” and may have a “moist” state and a “dry” state. An MML variable may have a parent variable of “rain” and a parent variable of “my sprinkler is working”, each parent variable having a “yes” state and a “no” state. Another variable “MNL” may represent “my neighbor's lawn moisture” and may have a “wet” state and a “dry” state. MNL variables may share a parent variable of “rain”. In this example, the prediction can be whether my lawn is “moist” or “dry”. This prediction is hypothesis (i) “If it rains, my lawn will get wet with probability (x ₁ )”, hypothesis (ii) “If my sprinkler is working, my Lawn will get wet with probability (x ₂ ) ". The probability that it rained or that my sprinkler was working may depend on other variables. For example, if my neighbor's lawn is moist and the neighbor doesn't have a sprinkler, it's more likely that it's raining.

上に論じたように、ベイジアンネットワークにおける条件付き確率テーブルは、ニューラルネットワークの場合のように訓練され得る。有用にも、予備的知識の提供を許容することによって、学習過程は短縮され得る。しかしながら、残念なことに、条件付き確率に対する事前確率は通常未知であり、その場合には、一様な事前確率が用いられる。 As discussed above, conditional probability tables in Bayesian networks can be trained as in the case of neural networks. Useful, the learning process can be shortened by allowing provision of preliminary knowledge. Unfortunately, however, prior probabilities for conditional probabilities are usually unknown, in which case uniform prior probabilities are used.

本発明の一実施形態は、２つの基本機能、すなわち分類器用パラメータの生成と、文字情報オブジェクトのようなオブジェクトの分類とのうちの、少なくとも１つを実行し得る。 One embodiment of the present invention may perform at least one of two basic functions: generation of parameters for classifiers and classification of objects such as character information objects.

基本的に、パラメータは、一組の訓練例に基づいて、分類器用に生成される。一組の訓練例から、一組の特徴ベクトルが生成され得る。一組の特徴ベクトルの特徴が縮約され得る。生成されるべきパラメータは、定義済みの単調（例えばシグモイド）関数および重みベクトルを含み得る。重みベクトルは、ＳＶＭ訓練（または別の公知の手法）によって決定され得る。単調（例えば、シグモイド）関数は、最適化手法を用いて定義され得る。 Basically, parameters are generated for a classifier based on a set of training examples. From a set of training examples, a set of feature vectors can be generated. The features of a set of feature vectors can be reduced. The parameters to be generated may include predefined monotonic (eg sigmoid) functions and weight vectors. The weight vector can be determined by SVM training (or another known technique). Monotonic (eg, sigmoid) functions can be defined using optimization techniques.

文字分類器は、重みベクトルと、定義済みの単調（例えば、シグモイド）関数とを含み得る。基本的に、本発明の文字分類器の出力は、 The character classifier may include a weight vector and a predefined monotonic (eg, sigmoid) function. Basically, the output of the character classifier of the present invention is

で表すことができる。ここで、
Ｏ_ｃ＝カテゴリｃに関する分類出力、
ｗ_ｃ＝カテゴリｃと関連付けられた重みベクトルのパラメータ、
ｘ＝未知の文字情報オブジェクトに基づく（縮約）特徴ベクトル、であり、
ＡおよびＢは、単調（例えばシグモイド）関数の調整パラメータである。

It can be expressed as here,
O _c = classification output for category c,
w _c = parameter of the weight vector associated with category c,
x = feature vector based on an unknown character information object,
A and B are adjustment parameters of a monotone (eg, sigmoid) function.

式（２）からの出力の計算は、式（１）からの出力の計算よりも速い。 The output calculation from equation (2) is faster than the output calculation from equation (1).

分類されるべきオブジェクトの形に応じて、分類器は、（ｉ）文字情報オブジェクトを特徴ベクトルに変換し、（ｉｉ）特徴ベクトルを縮約してより少ない要素を有する特徴ベクトルとする、ことができる。 Depending on the shape of the object to be classified, the classifier may: (i) convert the character information object into a feature vector, and (ii) reduce the feature vector to a feature vector having fewer elements. it can.

（トランスダクティブ機械学習）
商用の自動分類システムにおける現在の最先端手法は、ルールベースのものであるか、または帰納的機械学習、すなわち手動でラベルを付けた訓練例を用いる機械学習を利用している。いずれの手法も一般的に、トランスダクティブ法と比較して、多くの手作業による設定努力を必要とする。ルールベースシステムまたは帰納的手法によって提供される解は静的な解であり、それは、人手による努力なくしては、ドリフトする分類概念に順応することができない。 (Transductive machine learning)
Current state-of-the-art approaches in commercial automatic classification systems are either rule-based or use inductive machine learning, ie machine learning using manually labeled training examples. Both approaches generally require more manual setup effort compared to transductive methods. Solutions provided by rule-based systems or inductive approaches are static solutions that cannot adapt to drifting classification concepts without manual effort.

帰納的機械学習は、特徴または関係を、トークン（すなわち、１つまたは少数の観測または経験）に基づいた種類に帰するために、または繰り返し起こるパターンの限られた観測に基づいて法則を定式化するために用いられる。帰納的機械学習は、一般規則を生成するための観測済み訓練例からの推論を含み、該一般規則はその後、試験例に適用される。 Inductive machine learning formulates laws to attribute features or relationships to types based on tokens (ie, one or a few observations or experiences), or based on limited observations of recurring patterns Used to do. Inductive machine learning includes inferences from observed training examples to generate general rules, which are then applied to test examples.

特に、好適な実施形態は、トランスダクティブ機械学習手法を用いる。トランスダクティブ機械学習は、これらの不利点を被らない強力な手法である。 In particular, the preferred embodiment uses a transductive machine learning technique. Transductive machine learning is a powerful technique that does not suffer from these disadvantages.

トランスダクティブ機械手法は、ドリフトする分類概念に自動的に順応し、かつラベル付き訓練例を自動的に修正しながら、極めて小さい組のラベル付き訓練例から学習することができる。これらの利点が、トランスダクティブ機械学習を、多種多様な商用アプリケーション用の興味深くかつ価値ある手法としている。 The transductive machine approach can learn from a very small set of labeled training examples while automatically adapting to drifting classification concepts and automatically modifying the labeled training examples. These advantages make transductive machine learning an interesting and valuable approach for a wide variety of commercial applications.

トランスダクション法は、データ内のパターンを学習する。ラベル付きデータからのみならず、ラベルなしデータからも学習することによって、それは帰納的学習の概念を拡張する。これにより、トランスダクション法は、ラベル付きデータ内では捕捉されないか、または部分的にしか捕捉されないパターンを学習することが可能となる。その結果として、ルールベースシステムまたは帰納的学習に基づくシステムとは対照的に、トランスダクション法は、動的に変化する環境に順応し得る。この能力によって、トランスダクション法が、文書の発見、データの整理、および、とりわけドリフトする分類概念への対処のために、用いられることを可能とする。 Transduction methods learn patterns in data. By learning not only from labeled data but also from unlabeled data, it extends the concept of inductive learning. This allows the transduction method to learn patterns that are not captured or only partially captured in labeled data. As a result, in contrast to rule-based systems or inductive learning based systems, transduction methods can adapt to dynamically changing environments. This capability allows transduction methods to be used for document discovery, data organization, and especially for dealing with drifting classification concepts.

以下は、サポートベクタマシン（ＳＶＭ）による分類および最大エントロピー識別（ＭＥＤ）の枠組みを使用した、トランスダクティブ分類の一実施形態の説明である。 The following is a description of one embodiment of transductive classification using a support vector machine (SVM) classification and maximum entropy identification (MED) framework.

（サポートベクタマシン）
サポートベクタマシン（ＳＶＭ）は、文字分類に採用される１つの手法であり、このような手法は、正則化理論の概念を用いてあり得る解に制約を導入することによって、多数の解に関する問題点およびその結果生じる一般化の問題に対処する。例えば、２値のＳＶＭ分類器は、訓練データを適切に分離するすべての超平面から、解として、マージンを最大化する超平面を選択する。訓練データが適切に分類されるという制約下での最大マージン正規化は、一般化と記憶との間の適切なトレードオフを選択するという、前述の問題の学習に取り組む。訓練データ上の制約は該データを記憶するが、一方で、正規化が適切な一般化を確実なものとする。帰納的分類は、既知のラベルを有する訓練例から学習する、すなわち、すべての訓練例のクラス帰属が既知である。帰納的分類は既知のラベルから学習するが、トランスダクティブ分類は、ラベル付きデータおよびラベルなしデータから分類規則を決定する。トランスダクティブＳＶＭ分類の一例を表１に示す。 (Support vector machine)
A support vector machine (SVM) is one method employed for character classification, and such a method introduces constraints on possible solutions using the concept of regularization theory, thereby solving a number of solution problems. Address points and the resulting generalization issues. For example, a binary SVM classifier selects the hyperplane that maximizes the margin as a solution from all hyperplanes that adequately separate the training data. Maximum margin normalization under the constraint that the training data is properly classified addresses the learning of the aforementioned problem of choosing an appropriate trade-off between generalization and memory. The constraints on the training data store the data, while normalization ensures proper generalization. Inductive classification learns from training examples with known labels, ie the class membership of all training examples is known. Inductive classification learns from known labels, while transductive classification determines classification rules from labeled and unlabeled data. An example of transductive SVM classification is shown in Table 1.

（トランスダクティブＳＶＭ分類の原理） (Principle of transductive SVM classification)

表１は、サポートベクタマシンを用いたトランスダクティブ分類の原理を示している。解は、ラベルなしデータの全てのあり得るラベル割り当てに関して、最大マージンをもたらす超平面（ｈｙｐｅｒｐｌａｎｅ）によって与えられる。あり得るラベル割り当ては、ラベルなしデータの数において指数関数的に増加し、実際に当てはまる解に対しては、表１のアルゴリズムを近似的に使用する必要がある。そのような近似の例は、Ｔ．Ｊｏａｃｈｉｍｓによる「Ｔｒａｎｓｄｕｃｔｉｖｅｉｎｆｅｒｅｎｃｅｆｏｒｔｅｘｔｃｌａｓｓｉｆｉｃａｔｉｏｎｕｓｉｎｇｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅｓ」、Ｔｅｃｈｎｉｃａｌｒｅｐｏｒｔ、ＵｎｉｖｅｒｓｉｔａｅｔＤｏｒｔｍｕｎｄ、ＬＡＳＶＩＩＩ、１９９９年に記載され（Ｊｏａｃｈｉｍｓ）ている。

Table 1 shows the principle of transductive classification using a support vector machine. The solution is given by the hyperplane that yields the maximum margin for all possible label assignments of unlabeled data. The possible label assignments increase exponentially in the number of unlabeled data, and for the solutions that apply in practice, the algorithm of Table 1 should be used approximately. An example of such an approximation is T.W. Joachims, “Transductive influence for text classification using supporting vector machines”, Technical report, Universitet Dortmund, 1999, J.

表１におけるラベル割り当て全体にわたる一様分布は、ラベルなしデータ点がクラスの正の例となる１／２の確率および負の例となる１／２の確率を有すること、すなわち、ｙ＝＋１（正の例）およびｙ＝−１（負の例）という２つのあり得るラベル割り当ての確率は等しく、その結果として期待されるラベルはゼロであることを、意味している。ゼロのラベル期待値は、１／２に等しい固定クラスの事前確率によって、または一様な事前分布を有するランダム変数であるクラスの事前確率、すなわち未知のクラスの事前確率によって求められ得る。従って、１／２に等しくない既知のクラスの事前確率を有するアプリケーションにおいては、この追加情報を組み込むことによってアルゴリズムが改善され得る。例えば、表１のラベル割り当てに関する一様分布を用いる代わりに、クラスの事前確率に従って、他のものよりも一部のラベル割り当てを優先することが、選択され得る。しかしながら、尤もらしいラベル割り当てを有するより小さいマージンの解と、より高いマージンを有するがラベル割り当ての尤もらしさにおいて劣る解との間の、トレードオフは困難である。ラベル割り当ての確率とマージンとは、尺度を異にする。 The uniform distribution across the label assignments in Table 1 shows that unlabeled data points have a 1/2 probability of being a positive example of class and a 1/2 probability of being a negative example, ie y = + 1 ( The two possible label assignment probabilities of positive example) and y = −1 (negative example) are equal, meaning that the resulting label is zero. A zero label expectation can be determined by a fixed class prior probability equal to 1/2 or by a class prior probability that is a random variable with a uniform prior distribution, ie, an unknown class prior probability. Thus, in applications with a known class of prior probabilities not equal to 1/2, the algorithm can be improved by incorporating this additional information. For example, instead of using the uniform distribution for the label assignments in Table 1, it may be chosen to prioritize some label assignments over others according to class prior probabilities. However, the trade-off between a smaller margin solution with a plausible label assignment and a solution with a higher margin but less likely label assignment is difficult. The scale of label allocation probability and margin are different.

（最大エントロピー識別）
別の分類法、最大エントロピー識別（ＭＥＤ）法（例えばＴ．Ｊｅｂａｒａ「ＭａｃｈｉｎｅＬｅａｒｎｉｎｇＤｉｓｃｒｉｍｉｎａｔｉｖｅａｎｄＧｅｎｅｒａｔｉｖｅ」、ＫｌｕｗｅｒＡｃａｄｅｍｉｃＰｕｂｌｉｓｈｅｒｓを参照されたい）（Ｊｅｂａｒａ）は、決定関数正規化項およびラベル割り当て正規化項の両方とも解上の事前確率分布から導出され、従って、両方とも同一の確率的尺度上にあるので、ＳＶＭに関連する問題に遭遇することはない。従って、クラスの事前確率、従って、ラベルの事前確率が既知の場合には、トランスダクティブＭＥＤ分類は、理にかなった手法で事前ラベル知識の組み込みを許容するので、トランスダクティブＳＶＭ分類よりも優れている。 (Maximum entropy identification)
Another classification method, Maximum Entropy Discrimination (MED) method (see, eg, T. Jebara “Machine Learning Discriminative and General”, Kluwer Academic Publishers) (Jebara) is an assignment function normalization term and a decision function normalization term. Since both are derived from the prior probability distribution on the solution, and therefore both are on the same probabilistic measure, the problems associated with SVM are not encountered. Thus, if the prior probability of the class, and hence the prior probability of the label, is known, transductive MED classification allows the incorporation of prior label knowledge in a reasonable manner, and thus more than transductive SVM classification. Are better.

帰納的ＭＥＤ分類法は、決定関数のパラメータの上に事前分布を、バイアス項の上に事前分布を、マージンの上に事前分布を仮定する。帰納的ＭＥＤ分類法は、これらのパラメータの上の最終分布として、事前分布に最も近いものを選択し、データ点を適切に分類する推定決定関数を得る。 Inductive MED classification assumes a prior distribution over the parameters of the decision function, a prior distribution over the bias term, and a prior distribution over the margin. The inductive MED classification method selects the closest distribution to the prior distribution as the final distribution on these parameters to obtain an estimated decision function that classifies the data points appropriately.

形式的には、例えば線形分類器とすれば、この問題は、次のように定式化される。超平面パラメータに関する分布ｐ（Θ）、バイアス分布ｐ（ｂ）、データ点分類マージンｐ（γ）を、それらの結合された確率分布が結合されたそれぞれの事前分布ｐ_０に対して最小のカルバックライブラーダイバージェンスＫＬを有するように求める、すなわち、 Formally, for example, for a linear classifier, this problem is formulated as follows: The distribution p (Θ), the bias distribution p (b), and the data point classification margin p (γ) with respect to the hyperplane parameter are set to the minimum cullback for each prior distribution p ₀ to which the combined probability distribution is combined. Ask to have a Liver Divergence KL, ie

は、下の制約条件に従う。

Follows the constraints below.

ここで、

here,

は、分離超平面の重みベクトルとｔ番目のデータ点の特徴ベクトルとのドット積である。ラベル割り当てｙ_ｔは既知でありかつ固定されているので、２値のラベル割り当てに対する事前分布は必要ではない。従って、帰納的ＭＥＤ分類をトランスダクティブＭＥＤ分類に一般化する直接的手法は、２値のラベル割り当てを、あり得るラベル割り当てに対する事前分布によって制約されるパラメータとして処理することである。トランスダクティブＭＥＤの一例を表２に示す。

Is the dot product of the weight vector of the separation hyperplane and the feature vector of the tth data point. Since the label allocation y _t is known and fixed, prior distribution for the label allocation binary is not necessary. Thus, a direct approach to generalizing inductive MED classification to transductive MED classification is to treat binary label assignments as parameters constrained by a prior distribution for possible label assignments. An example of a transductive MED is shown in Table 2.

（トランスダクティブＭＥＤ分類） (Transductive MED classification)

ラベル付きデータに対しては、ラベルの事前分布はδ関数であり、従って、＋１または−１となるようにラベルを効果的に固定する。ラベルなしデータに対しては、ラベルの事前確率ｐ_０（ｙ）は、すべてのラベルなしデータ点に、ｐ_０（ｙ）の確率を有するｙ＝＋１の正のラベルおよび１−ｐ_０（ｙ）の確率を有するｙ＝−１の負のラベルを割り当ると仮定される。情報を提供しないラベルの事前確率（ｐ_０（ｙ）＝１／２）を仮定することで、上に論じたトランスダクティブＳＶＭ分類に類似したトランスダクティブＭＥＤ分類が得られる。

For labeled data, the label prior distribution is a δ function, thus effectively fixing the label to be +1 or −1. For unlabeled data, the label prior probability p ₀ (y) is the positive label of y = + 1 with a probability of p ₀ (y) and 1−p ₀ (y ) Is assumed to be assigned a negative label of y = −1 with probability of. By assuming a prior probability (p ₀ (y) = 1/2) of labels that do not provide information, a transductive MED classification similar to the transductive SVM classification discussed above is obtained.

トランスダクティブＳＶＭ分類の場合のように、このようなＭＥＤアルゴリズムの実用的実施は、あり得る全ラベル割り当てにわたって検索を近似する必要がある。Ｔ．Ｊａａｋｋｏｌａ、Ｍ．ＭｅｉｌａおよびＴ．Ｊｅｂａｒａによる「Ｍａｘｉｍｕｍｅｎｔｒｏｐｙｄｉｓｃｒｉｍｉｎａｔｉｏｎ」、ＴｅｃｈｎｉｃａｌＲｅｐｏｒｔＡＩＴＲ−１６６８、マサチューセッツ工科大学、人工知能研究所、１９９９年に記載された（Ｊａａｋｋｏｌａ）手法は、近似式として、期待値最大化（ＥＭ）の定式化に類似して、手順を２段階に分解することを選んでいる。この定式化には、解決すべき２つの問題点がある。第１は、ＥＭアルゴリズムのＭ段階に類似し、ラベル割り当てに関する現在最善の推測に従って全データ点を適切に分類する一方での、マージンの最大化と似ている点である。第２の段階は、Ｅ段階に類似して、Ｍ段階で決定された分類結果を用いて各例のクラス帰属に関する新たな値を推定する。この第２の段階を、本発明者らはラベル帰納と呼ぶ。全体的な説明は表２に示されている。 As in the case of transductive SVM classification, a practical implementation of such a MED algorithm needs to approximate the search across all possible label assignments. T.A. Jaakcola, M .; Meila and T.M. The “Maximum encyclopedia” by Jebara, Technical Report AITR-1668, Massachusetts Institute of Technology, Artificial Intelligence Laboratory, (Jaakkola) method described in 1999 is an approximation formula for expectation maximization (EM) formulation. Similarly, we have chosen to break down the procedure into two stages. This formulation has two problems to be solved. The first is similar to the M stage of the EM algorithm, similar to maximizing margins while properly classifying all data points according to the current best guess for label assignment. In the second stage, similar to the E stage, a new value related to the class attribution of each example is estimated using the classification result determined in the M stage. We refer to this second stage as label induction. The overall description is shown in Table 2.

本明細書において参照するＪａａｋｋｏｌａの手法の特定の実装は、超平面のパラメータに対して、平均ゼロと単位分散を有するガウス分布を、バイアスのパラメータに対して、平均ゼロと分散σ_ｂ ^２を有するガウス分布を、上に論じたラベルなしデータに対して、γがデータ点のマージン、ｃがコスト要因である式ｅｘｐ［−ｃ（１−γ）］の形のマージン事前確率、およびｐ_０（ｙ）の２値ラベルの事前確率を、仮定する。本明細書において参照するトランスダクティブ分類アルゴリズム、Ｊａａｋｋｏｌａに関する以下の論述に関しては、簡略化の理由から、また一般性を喪失しないために、１／２のラベルの事前確率を仮定する。 The specific implementation of the Jaakcola approach referred to herein has a Gaussian distribution with mean zero and unit variance for hyperplane parameters and a mean zero and variance σ _b ² for bias parameters. For unlabeled data discussed above, the Gaussian distribution is a margin prior probability in the form exp [−c (1-γ)] where γ is the margin of the data points, c is the cost factor, and p ₀ ( Assume the prior probability of the binary label of y). For the following discussion of the transductive classification algorithm, Jaakcola, referred to herein, a prior probability of 1/2 label is assumed for reasons of simplification and not to lose generality.

ラベル帰納段階は、超平面のパラメータに関する固定確率分布を所与としたラベルの確率分布を決定する。上に紹介したマージンおよびラベルの事前確率を用いて、ラベル帰納段階に対する以下の目的関数が得られる（表２参照）。 The label induction step determines the probability distribution of the label given a fixed probability distribution for the hyperplane parameters. Using the margin and label prior probabilities introduced above, the following objective function for the label induction stage is obtained (see Table 2).

ここで、λ_ｔはｔ回目の訓練例のラグランジュ乗数、ｓ_ｔは先のＭ段階で決定されたその分類スコア、ｃはコスト要因である。訓練例に関する合計の中の最初の２つの項はマージンの事前分布から導出されるが、それに対して、３番目の項はラベルの事前分布によって与えられる。

Here, the lambda _t Lagrange multipliers t-th training examples, s _t is the classification score determined in the previous M phase, c is a cost factor. The first two terms in the total for the training example are derived from the marginal prior, whereas the third term is given by the label prior.

を最大化することによってラグランジュ乗数が決定され、その結果として、ラベルなしデータに関するラベルの確率分布が決定される。式３から分かるように、各データ点は独立して目的関数に寄与する。従って、各ラグランジュ乗数は、他のすべてのラグランジュ乗数に関係なく決定され得る。例えば、その分類スコアの高い絶対値｜ｓ_ｔ｜を有するラベルなしデータ点の寄与を最大化するためには、小さいラグランジュ乗数λ_ｔが必要であるが、それに対して、小さい値｜ｓ_ｔ｜を有するラベルなしデータ点は、大きいラグランジュ乗数と共に、

To determine the Lagrangian multiplier and, as a result, the probability distribution of the labels for unlabeled data. As can be seen from Equation 3, each data point independently contributes to the objective function. Thus, each Lagrangian multiplier can be determined regardless of all other Lagrangian multipliers. For example, to maximize the contribution of an unlabeled data point with a high absolute value | s _t | of its classification score, a small Lagrange multiplier λ _t is required, whereas a small value | s _t | An unlabeled data point with a large Lagrange multiplier,

に対する寄与を最大化する。その一方では、ラベルなしデータ点の分類スコアｓおよびそのラグランジュ乗数λの関数としてのラベルなしデータ点の期待ラベル＜ｙ＞は、

Maximize the contribution to. On the other hand, the expected label <y> of the unlabeled data point as a function of the classification score s of the unlabeled data point and its Lagrange multiplier λ is

となる。
図１に、ｃ＝５およびｃ＝１．５のコスト要因を用いた分類スコアｓの関数としての期待ラベル＜ｙ＞を示す。図１の生成に用いたラグランジュ乗数は、ｃ＝５およびｃ＝１．５のコスト要因を用いて式３を解くことによって決定された。図１から分かるように、マージンの外側、すなわち｜ｓ｜＞１のラベルなしデータ点は、ゼロに近い期待ラベル＜ｙ＞を有しており、マージンに近い、すなわち｜ｓ｜≒１のデータ点は、最も高い期待ラベル絶対値をもたらし、超平面に近い、すなわち｜ｓ｜＜∈のデータ点は、｜＜ｙ＞｜＜∈をもたらす。｜ｓ｜→∞に対して＜ｙ＞→０というこの非直感的ラベル割り当ての理由は、分類上の制約が満たされる限りはできるだけ事前分布の近傍にとどまろうとする、選択された識別的手法にある。これは、表２の既知の手法によって選択された近似式のアーチファクトではなく、すなわち、あり得る全ラベル割り当てを網羅的に検索し、従って、大域的最適解を求めることを保証するするアルゴリズムがまた、マージンの外側のラベルなしデータにもゼロに近いかまたはゼロに等しい期待ラベルを割り当てる。上に述べたように、ここでもまた、識別的観点からそれが期待される。マージンの外側のデータ点は、例を分離するのには重要ではなく、従って、これらのデータ点のすべての個々の確率分布は、それらの事前確率分布に戻る。

It becomes.
FIG. 1 shows an expected label <y> as a function of the classification score s using cost factors of c = 5 and c = 1.5. The Lagrangian multiplier used to generate FIG. 1 was determined by solving Equation 3 using a cost factor of c = 5 and c = 1.5. As can be seen from FIG. 1, the unlabeled data point outside the margin, ie, | s |> 1, has an expected label <y> close to zero and is close to the margin, ie, | s | ≈1. A point yields the highest expected label absolute value, and a data point that is close to the hyperplane, ie, | s | <ε yields | <y> | <ε. The reason for this non-intuitive label assignment, <y> → 0 for | s | → ∞, is that the selected discriminative approach tries to stay as close as possible to the prior distribution as long as the classification constraints are met. is there. This is not an artifact of the approximate expression selected by the known method of Table 2, ie, an algorithm that ensures an exhaustive search of all possible label assignments, and thus a global optimal solution is also found. Allocate unlabeled data outside the margin to expected labels close to or equal to zero. As mentioned above, this is also expected here from a discriminative point of view. Data points outside the margin are not important to isolate the example, so all individual probability distributions of these data points revert to their prior probability distribution.

本明細書において参照するＪａａｋｋｏｌａのトランスダクティブ分類アルゴリズムのＭ段階は、下記の制約下で、それぞれの事前分布に最も近い、超平面のパラメータ、バイアス項、およびデータ点のマージンに関する確率分布を決定する。 The M stage of the Jaakcola transductive classification algorithm referenced herein determines the probability distribution for hyperplane parameters, bias terms, and data point margins that are closest to each prior distribution under the following constraints: To do.

ここで、ｓ_ｔはｔ回目のデータ点分類スコア、〈ｙ_ｔ〉はその期待ラベル、〈γ_ｔ〉はその期待マージンである。ラベル付きデータに対しては、期待ラベルは固定されており、＜ｙ＞＝＋１または＜ｙ＞＝−１である。ラベルなしデータに関する期待ラベルは、区間（−１、＋１）の中にあり、ラベル帰納段階で推定される。式５によれば、分類スコアは期待ラベルによってスケーリングされるので、ラベルなしデータは、ラベル付きデータよりも厳しい分類制約を満たす必要がある。さらに、図１を参照し、分類スコアの関数としての期待ラベルの依存性を所与とすると、分離超平面に近いラベルなしデータは、最も厳しい分類制約を有する。なぜならば、それらのスコアおよびそれらの期待ラベルの絶対値｜〈ｙ_ｔ〉｜が小さいからである。上述の事前分布を所与としたＭ段階の全目的関数は、

Here, _st is the t-th data point classification score, <y _t > is the expected label, and <γ _t > is the expected margin. For labeled data, the expected label is fixed and <y> = + 1 or <y> =-1. Expected labels for unlabeled data are in the interval (-1, +1) and are estimated at the label induction stage. According to Equation 5, since the classification score is scaled by the expected label, the unlabeled data needs to satisfy stricter classification constraints than the labeled data. Further, referring to FIG. 1, given the expected label dependence as a function of the classification score, unlabeled data close to the separation hyperplane has the most stringent classification constraints. This is because the absolute value | <y _t > | of those scores and their expected labels is small. The overall objective function of M stages given the above prior distribution is

となる。
第１項はガウスの超平面パラメータ事前分布から導出され、第２項はマージン事前正規化項、最後の項は、平均ゼロと分散σ_ｂ ^２とを有するガウス事前分布から導出されるバイアスの事前正規化項である。バイアス項に対する事前分布は、クラスの事前確率に対する事前分布として解釈され得る。従って、バイアスの事前分布に対応する正規化項は、正から負までの例の重みを制約する。式６によれば、バイアス項の寄与は、超平面上での正の例の一括プルと負の例の一括プルとが等しくなる場合に最小化される。バイアスの事前分布によるラグランジュ乗数に対する一括制約は、データ点の期待ラベルによって重み付けされ、従って、ラベル付きデータに対するよりもラベルなしデータに対する方が制約が少ない。従って、ラベルなしデータは、最終解に対して、ラベル付きデータよりも強い影響を与える能力を有する。

It becomes.
The first term is derived from a Gaussian hyperplane parameter prior distribution, the second term is a margin prenormalization term, the last term is a bias prior derived from a Gaussian prior distribution with mean zero and variance σ _b ^2. It is a normalization term. The prior distribution for the bias term can be interpreted as a prior distribution for the class prior probabilities. Thus, the normalization term corresponding to the bias prior distribution constrains the weight of the example from positive to negative. According to Equation 6, the bias term contribution is minimized when the positive example collective pull and the negative example collective pull on the hyperplane are equal. The bulk constraint on the Lagrangian multiplier due to the bias prior distribution is weighted by the expected labels of the data points and is therefore less constrained for unlabeled data than for labeled data. Thus, unlabeled data has the ability to influence the final solution more strongly than labeled data.

要約すれば、本明細書において参照するＪａａｋｋｏｌａのトランスダクティブ分類アルゴリズムのＭ段階で、ラベルなしデータは、ラベル付きデータよりも厳しい分類上の制約を満たす必要があり、解に対するラベルなしデータの累積重みは、ラベル付きデータに対するよりも少ない制約を受ける。さらに、現在のＭ段階のマージン内に位置するゼロに近い期待ラベルを有するラベルなしデータは、解に最も影響を与える。この手法でＥ段階およびＭ段階を定式化することから得られた正味の効果が、データセットに対してこのアルゴリズムを適用することによって、図２において示される。このデータセットは、２つのラベル付き例、すなわちｘ−位置−１に位置する負の例（×）および＋１に位置する正の例（＋）と、ｘ−軸に沿って−１と＋１との間に位置する６つのラベルなしの例（○）とを含む。×印（×）はラベル付きの負の例、プラス記号（＋）はラベル付きの正の例、円（○）はラベルなしデータを示す。様々なプロットは、Ｍ段階の種々の繰り返し時点で求められた分離超平面を示す。本明細書において参照するＪａａｋｋｏｌａのトランスダクティブＭＥＤ分類器によって選ばれた最終解は、正のラベル付き訓練例を誤分類する。図２に、Ｍ段階のいくつかの繰り返しを示す。Ｍ段階の最初の繰り返しでは、ラベルなしデータについては考慮されず、分離超平面はｘ＝０に位置する。負のｘ値を有する１つのラベルなしデータ点は、他のどのラベルなしデータよりもこの分離超平面に近い。次のラベル帰納段階で、このラベルなしデータ点は、最小の｜＜ｙ＞｜を割り当てられることになり、従って、次のＭ段階で、これは、正のラベル付き例に向けて超平面をプッシュする最も大きい力を有する。ラベルなしデータ点の特定の間隔と結合された、選択されたコスト要因によって決定される分類スコアの関数としての期待ラベル＜ｙ＞の特定の形状（図１参照）は、各連続的Ｍ段階において分離超平面が正のラベル付き例に向けて次第に近づいてゆくブリッジ効果を生成する。直観的に、Ｍ段階では、最新の分離超平面に最も近いラベルなしデータ点が該平面の最終位置を最も決定し、さらに離れたデータ点はさほど重要ではない、一種の近視状態となる。最終的に、ラベル付きデータの一括プルよりもラベルなしデータの一括プルをより少なく制約するバイアスの事前分布項により、分離超平面は正のラベル付き例を超えて先へ移動し、最終解、すなわち、図２の１５回目の繰り返しが得られ、それは正のラベル付き例を誤分類する。σ_ｂ ^２＝１のバイアス分散およびｃ＝１０のコスト要因が図２で用いられた。σ_ｂ ^２＝１を有すれば、９．８＜ｃ＜１３の範囲内の任意のコスト要因が、結果的に、１つの正のラベル付き例を誤分類する最終超平面をもたらす。区間９．８＜ｃ＜１３の外のコスト要因は、２つのラベル付き例の間のいずれかの位置に分離超平面をもたらす。 In summary, at the M stage of the Jaakcola transductive classification algorithm referred to herein, unlabeled data must meet stricter classification constraints than labeled data, and the accumulation of unlabeled data for the solution Weights are less constrained than for labeled data. Furthermore, unlabeled data with expected labels close to zero located within the current M-stage margin has the most impact on the solution. The net effect obtained from formulating the E and M stages in this manner is shown in FIG. 2 by applying this algorithm to the data set. This data set consists of two labeled examples: a negative example (x) located at x-position-1 and a positive example (+) located at +1, and -1 and +1 along the x-axis. And 6 unlabeled examples (O) located between. A cross (x) indicates a negative example with a label, a plus sign (+) indicates a positive example with a label, and a circle (◯) indicates unlabeled data. The various plots show the separation hyperplanes determined at various iterations of the M stage. The final solution chosen by the Jaakcola transductive MED classifier referenced herein misclassifies the positive labeled training example. FIG. 2 shows several iterations of the M stage. In the first iteration of the M stage, no unlabeled data is considered and the separation hyperplane is located at x = 0. One unlabeled data point with a negative x value is closer to this separation hyperplane than any other unlabeled data. In the next label induction stage, this unlabeled data point will be assigned the smallest | <y> |, so in the next M stage it will be the hyperplane for the positive labeled example. Has the greatest force to push. The specific shape (see FIG. 1) of the expected label <y> as a function of the classification score determined by the selected cost factor, combined with the specific spacing of unlabeled data points, is The separation hyperplane produces a bridging effect that gradually approaches toward a positive labeled example. Intuitively, at the M stage, the unlabeled data point closest to the latest separated hyperplane most determines the final position of the plane, and the farther away data point is a kind of myopia, which is less important. Eventually, the bias hyperdistribution term constrains the bulk pull of unlabeled data to less than the bulk pull of labeled data, so that the separation hyperplane moves beyond the positive labeled example and the final solution, That is, the 15th iteration of FIG. 2 is obtained, which misclassifies the positive labeled example. A bias variance of σ _b ² = 1 and a cost factor of c = 10 were used in FIG. With σ _b ² = 1, any cost factor in the range 9.8 <c <13 results in a final hyperplane that misclassifies one positive labeled example. Cost factors outside the interval 9.8 <c <13 result in a separation hyperplane somewhere between the two labeled examples.

このアルゴリズムのこの不安定さは、図２に示す例に限定されるものではなく、本明細書において参照するＪａａｋｋｏｌａ法を当業者に公知のロイターのデータセットを含む実世界に適用する間にも、経験されている。表２に記載した方法に固有の不安定さは、この実装の主要な欠点であり、その一般的利用性を限定するが、しかし、Ｊａａｋｋｏｌａ法は本発明の一部の実施形態において実行され得る。 This instability of this algorithm is not limited to the example shown in FIG. 2, but also during the application of the Jaakcola method referenced herein to the real world containing Reuters datasets known to those skilled in the art. Have been experienced. The inherent instability of the method described in Table 2 is a major drawback of this implementation, limiting its general applicability, but the Jaakcola method can be implemented in some embodiments of the present invention. .

本発明の１つの好適な手法は、最大エントロピー識別（ＭＥＤ）の枠組みを用いたトランスダクティブ分類を採用している。本発明の種々の実施形態は、分類に適用可能であると同時に、これに限定するものではないが、トランスダクティブＭＥＤ回帰およびグラフィカルモデルを含む、トランスダクションを用いた他のＭＥＤ学習上の問題にもまた適用可能であることが、理解されるべきである。 One preferred approach of the present invention employs transductive classification using a maximum entropy identification (MED) framework. While various embodiments of the present invention are applicable to classification, other MED learning problems using transduction, including but not limited to, transductive MED regression and graphical models. It should be understood that this is also applicable.

最大エントロピー識別法は、パラメータに対する事前確率分布を仮定することによって、あり得る解に制約を加えて縮約する。最終解は、期待される解が訓練データを適切に記述するという制約下で、仮定された事前確率分布に最も近い確率分布に従ったあり得るすべての解の期待値である。解の上の事前確率分布は、正規化項にマッピングする。すなわち、特定の事前分布を選択することによって、特定の正規化を選択したことになる。 Maximum entropy discrimination methods constrain and reduce possible solutions by assuming prior probability distributions for the parameters. The final solution is the expected value of all possible solutions according to the probability distribution closest to the assumed prior probability distribution, under the constraint that the expected solution adequately describes the training data. The prior probability distribution above the solution maps to the normalization term. That is, a specific normalization is selected by selecting a specific prior distribution.

サポートベクタマシンによって適用される識別的推定は、数少ない例から学習する際に効果的である。本発明の一実施形態のこの方法および装置は、これをサポートベクタマシンと同様に有しており、与えられた問題を解くために必要以上のパラメータを推定しようとせず、その結果、スパース解をもたらす。これは、基礎となるプロセスを説明しようとし、かつ一般的に識別的推定よりも大きな統計データを必要とする、生成的モデル推定と対照的である。一方では、生成的モデルはより用途が広く、より多種多様な問題に適用され得る。さらに、生成的モデル推定は、従来知識の直接的包含が可能である。最大エントロピー識別を用いた本発明の一実施形態の方法および装置は、純粋に識別的な、例えばサポートベクタマシン学習と、生成的モデル推定との間のギャップを埋める。 The discriminative estimation applied by the support vector machine is effective when learning from a few examples. This method and apparatus of one embodiment of the present invention has this as well as a support vector machine and does not attempt to estimate more parameters than necessary to solve a given problem, resulting in a sparse solution. Bring. This is in contrast to generative model estimation, which attempts to explain the underlying process and generally requires more statistical data than discriminative estimation. On the one hand, generative models are more versatile and can be applied to a wider variety of problems. Furthermore, generative model estimation can directly include conventional knowledge. The method and apparatus of an embodiment of the present invention using maximum entropy discrimination bridges the gap between purely discriminatory, eg, support vector machine learning and generative model estimation.

表３に示す本発明の一実施形態の方法は、本明細書において参照するＪａａｋｋｏｌａにおいて論じた方法の不安定さの問題を有しない、改良されたトランスダクティブＭＥＤ分類アルゴリズムである。相違点は、これに限定するものではないが、本発明の一実施形態では、各データ点がそのラベル期待絶対値｜＜ｙ＞｜に比例するそれ自体のコスト要因を有することを含む。さらに、各データ点のラベルの事前確率は、決定関数までのデータ点の距離の関数としての推定クラス帰属確率に従って、各Ｍ段階の後に更新される。本発明の一実施形態の方法は、以下の表３で説明される。 The method of one embodiment of the present invention shown in Table 3 is an improved transductive MED classification algorithm that does not have the method instability issues discussed in Jaakcola referenced herein. The differences include, but are not limited to, in one embodiment of the present invention, that each data point has its own cost factor proportional to its label expected absolute value | <y> |. Furthermore, the prior probabilities of the labels of each data point are updated after each M stage according to the estimated class membership probability as a function of the distance of the data points to the decision function. The method of one embodiment of the present invention is illustrated in Table 3 below.

（改良されたトランスダクティブＭＥＤ分類） (Improved transductive MED classification)

｜＜ｙ＞｜によってデータ点のコスト要因をスケーリングすることは、ラベルなしデータがラベル付きデータよりも超平面上でより強い累積プルを有し得るという問題を緩和する。なぜならば、ラベルなしデータのコスト要因は今やラベル付きのコスト要因よりも小さい、すなわち各ラベルなしデータ点の最終解に対する個々の寄与はラベル付きデータ点の個々の寄与よりも常に小さいからである。しかしながら、ラベルなしデータの量がラベル付きデータの数よりもはるかに大きい場合には、ラベルなしデータは依然として、ラベル付きデータよりも最終解に影響を与え得る。さらに、コスト要因のスケーリングと推定クラス確率を用いたラベルの事前確率の更新との結合は、上に概説したブリッジ効果の問題を解決する。最初のＭ段階で、ラベルなしデータは、極めて平坦な分類スコアの関数として期待ラベルをもたらす小さいコスト要因を有し（図１参照）、従って、小さい重みにすぎないが、ある程度まで、全ラベルなしデータは超平面上でプルすることが可能である。さらに、ラベルの事前確率の更新の結果として、分離超平面から離れたラベルなしデータはゼロに近い期待ラベルを割り当てられないが、数回の繰り返しの後に、ｙ＝＋１またはｙ＝−１に近いラベルが割り当てられ、かくして、ラベル付きデータのようにゆっくりと処理される。

Scaling the cost factor of data points by | <y> | alleviates the problem that unlabeled data may have a stronger cumulative pull on the hyperplane than labeled data. This is because the cost factor for unlabeled data is now smaller than the labeled cost factor, ie, the individual contribution to the final solution of each unlabeled data point is always smaller than the individual contribution of the labeled data point. However, if the amount of unlabeled data is much larger than the number of labeled data, unlabeled data can still affect the final solution more than labeled data. Further, the combination of cost factor scaling and label prior probability update using estimated class probabilities solves the bridge effect problem outlined above. In the first M stages, the unlabeled data has a small cost factor that yields the expected label as a function of a very flat classification score (see FIG. 1) and is therefore only a small weight, but to some extent, not all labels Data can be pulled on the hyperplane. Furthermore, as a result of updating the label prior probability, unlabeled data away from the separation hyperplane is not assigned an expected label close to zero, but after several iterations, it is close to y = + 1 or y = -1. Labels are assigned and thus processed slowly as labeled data.

本発明の一実施形態の方法の特定の実装において、決定関数パラメータΘに対して、平均ゼロと単位分散とを有するガウス事前分布を仮定することによって、次のようになる。 In a particular implementation of the method of an embodiment of the present invention, by assuming a Gaussian prior distribution with mean zero and unit variance for the decision function parameter Θ:

決定関数パラメータに対する事前分布は、当面の特定の分類上の問題に関する重要な従来知識を組み込んでいる。分類上の問題にとって重要な決定関数パラメータの他の事前分布は、例えば、多項分布、ポアソン分布、コーシー分布（Ｂｒｅｉｔ−Ｗｉｇｎｅｒ）、マクスウェル−ボルツマン分布、またはボーズ−アインシュタイン分布である。

Prior distributions for decision function parameters incorporate important prior knowledge about the particular classification problem at hand. Other prior distributions of decision function parameters that are important for classification problems are, for example, the multinomial distribution, Poisson distribution, Breit-Wigner, Maxwell-Boltzmann distribution, or Bose-Einstein distribution.

決定関数の閾値ｂに対する事前分布は、平均μ_ｂと分散σ_ｂ ^２とを有するガウス分布によって与えられる。 The prior distribution for decision function threshold b is given by a Gaussian distribution with mean μ _b and variance σ _b ² .

データ点の分類マージンγ_ｉの事前分布として、

As a prior distribution of data point classification margin γ _i ,

が選ばれ、ここで、ｃはコスト要因である。この事前分布は、式ｅｘｐ［−ｃ（ｌ−γ）］の形を有する本明細書において参照するＪａａｋｋｏｌａで用いられるものとは異なっている。式９で与えられた形が本明細書において参照するＪａａｋｋｏｌａで用いられる形を越えることが好ましく、なぜならば、Ｊａａｋｋｏｌａの形が１より小さいコスト要因に対してさえも正の期待マージンをもたらすのに対して、式ｅｘｐ［−ｃ（ｌ−γ）］は、ｃ＜１に対して負の期待マージンをもたらすからである。

Where c is a cost factor. This prior distribution is different from that used in Jaakcola referred to herein having the form exp [−c (l−γ)]. It is preferred that the form given by Equation 9 exceed the form used in Jaakcola referenced herein, because the Jaakola form provides a positive expectation margin even for cost factors less than 1. In contrast, the expression exp [−c (l−γ)] provides a negative expected margin for c <1.

これらの事前分布が与えられると、対応する分配関数Ζを決定することは容易であり（例えば、Ｔ．Ｍ．ＣｏｖｅｒおよびＪ．Ａ．Ｔｈｏｍａｓ「ＥｌｅｍｅｎｔｓｏｆＩｎｆｏｒｍａｔｉｏｎＴｈｅｏｒｙ」、ＪｏｈｎＷｉｌｅｙ＆Ｓｏｎｓ，Ｉｎｃ．参照）（Ｃｏｖｅｒ）、目的関数 Given these prior distributions, it is easy to determine the corresponding partition function Ζ (see, eg, TM Cover and JA Thomas “Elements of Information Theory”, John Wiley & Sons, Inc.). See) (Cover), objective function

は、

Is

となる。本明細書において参照するＪａａｋｋｏｌａによれば、Ｍ段階の目的関数は、

It becomes. According to Jaakcola referred to in this specification, the M-stage objective function is

となり、Ｅ段階の目的関数は、

The objective function of stage E is

となる。ここで、ｓ_ｔは先のＭ段階で決定されたｔ番目のデータ点の分類スコアであり、ｐ_０，ｔ（ｙ_ｔ）はデータ点の２値ラベル事前確率である。ラベルの事前確率は、ラベル付きデータに対してはｐ_０，ｔ（ｙ_ｔ）＝１に、ラベルなしデータに対しては、ｐ_０，ｔ（ｙ_ｔ）＝１／２の情報を与えない事前確率またはクラスの事前確率に初期化される。

It becomes. Here, s _t is the classification score of the t-th data point determined in the previous M stage, and p _{0, t} (y _t ) is the binary label prior probability of the data point. The prior probability of the label does not give information of p _{0, t} (y _t ) = 1 for labeled data and p _{0, t} (y _t ) = 1/2 for unlabeled data. Initialized to prior probabilities or class prior probabilities.

本明細書におけるＭ段階と題する章は、Ｍ段階の目的関数を解くためのアルゴリズムについて説明する。また、本明細書におけるＥ段階と題する章は、Ｅ段階のアルゴリズムについて説明する。 The section entitled M-stage in this specification describes an algorithm for solving the M-stage objective function. Also, the chapter entitled “E stage” in this specification describes the algorithm of the E stage.

表３の行５の、ＥｓｔｉｍａｔｅＣｌａｓｓＰｒｏｂａｂｉｌｉｔｙの段階は、訓練データを用いて、分類スコアをクラス帰属確率に、すなわちスコアｐ（ｃ｜ｓ）を与えられたクラスの確率に、変えるための較正パラメータを決定する。確率に関するスコア較正を推定するための関連する方法は、Ｊ．Ｐｌａｔｔ「Ｐｒｏｂａｂｉｌｉｓｔｉｃｏｕｔｐｕｔｓｆｏｒｓｕｐｐｏｒｔｖｅｃｔｏｒｍａｃｈｉｎｅｓａｎｄｃｏｍｐａｒｉｓｏｎｔｏｒｅｇｕｌａｒｉｚｅｄｌｉｋｅｌｉｈｏｏｄｍｅｔｈｏｄｓ」、６１−７４頁、２０００年（Ｐｌａｔｔ）に、ならびにＢ．ＺａｄｒｏｚｎｙおよびＣ．Ｅｌｋａｎ「Ｔｒａｎｓｆｏｒｍｉｎｇｃｌａｓｓｉｆｉｅｒｓｃｏｒｅｓｉｎｔｏａｃｃｕｒａｔｅｍｕｌｔｉ−ｃｌａｓｓｐｒｏｂａｂｉｌｉｔｙｅｓｔｉｍａｔｅｓ」、２００２年（Ｚａｄｒｏｚｎｙ）に、記載されている。 The EstimateClassProvability stage in row 5 of Table 3 uses training data to determine calibration parameters for changing the classification score to class membership probability, ie, the probability of the class given the score p (c | s). To do. A related method for estimating score calibration for probability is described in J. Org. Platt “Probabilistic outputs for support vector machines and comparison to legal methods,” pages 61-74, 2000 (Platt), and Zadrozny and C.I. Elkan "Transforming classifiers into accurate multi-class probabilities Estimates", 2002 (Zadrozny).

図３を特に参照し、×印（×）はラベル付きの負の例、プラス記号（＋）はラベル付きの正の例、円（○）はラベルなしデータを示す。様々なプロットは、Ｍ段階の種々の繰り返し時点で決定された分離超平面を示す。２０回目の繰り返しは、改良されたトランスダクティブＭＥＤ分類器によって選ばれた最終解を示す。図３は、上で紹介した玩具データセットに適用された改良型トランスダクティブＭＥＤ分類アルゴリズムを示す。使用パラメータは、ｃ＝１０、σ_ｂ ^２＝１、およびμ_ｂ＝０である。異なるｃがｘ≒−０．５とｘ＝０との間に位置する分離超平面をもたらし、それにより、ｃ＜３．５で、超平面はｘ＜０を有する１つのラベルなしデータの右に位置し、ｃ≧３．５でこのラベルなしデータ点の左に位置する。
図４を特に参照して、本発明の一実施形態のラベルなしデータの分類法を示す制御流れ図が示されている。方法１００は、ステップ１０２で始まり、ステップ１０４で、格納されたデータ１０６にアクセスする。データは記憶域に格納されており、ラベル付きデータと、ラベルなしデータと、少なくとも１つの所定コスト要因とを含む。データ１０６は、割り当てられたラベルを有するデータ点を含む。割り当てられたラベルは、ラベル付きデータ点が特定のカテゴリに含まれることを意図されているのか、あるいは特定のカテゴリから除外されることを意図されているのかを識別する。 With particular reference to FIG. 3, a cross (×) indicates a negative example with a label, a plus sign (+) indicates a positive example with a label, and a circle (◯) indicates unlabeled data. The various plots show the separation hyperplane determined at various iteration times of the M stage. The 20th iteration shows the final solution chosen by the improved transductive MED classifier. FIG. 3 shows an improved transductive MED classification algorithm applied to the toy data set introduced above. The usage parameters are c = 10, σ _b ² = 1, and μ _b = 0. A different c results in a separation hyperplane located between x≈−0.5 and x = 0, so that for c <3.5, the hyperplane is the right of one unlabeled data with x <0 And c ≧ 3.5, to the left of this unlabeled data point.
With particular reference to FIG. 4, a control flow diagram illustrating a method for classifying unlabeled data according to one embodiment of the present invention is shown. The method 100 begins at step 102 and accesses stored data 106 at step 104. The data is stored in a storage area and includes labeled data, unlabeled data, and at least one predetermined cost factor. Data 106 includes data points having assigned labels. The assigned label identifies whether the labeled data point is intended to be included in a particular category or excluded from a particular category.

ひとたびステップ１０４でデータがアクセスされると、本発明の一実施形態の方法は次いで、ステップ１０８で、データ点のラベル情報を用いて、データ点のラベルの事前確率を決定する。次いで、ステップ１１０で、ラベルの事前確率に従って、データ点の期待ラベルが決定される。ステップ１１０で決定された期待ラベルと、ラベル付きデータ、ラベルなしデータ、およびコスト要因と共に、ステップ１１２は、コスト要因のラベルなしデータ点のスケーリングによるトランスダクティブＭＥＤ分類器の繰り返し訓練を含む。計算の各繰り返しの中で、データ点のコスト要因がスケーリングされる。かくして、ＭＥＤ分類器は、計算の反復繰り返しを通じて学習する。訓練された分類器は次いで、ステップ１１６で入力データ１１４にアクセスする。訓練された分類器は次いで、ステップ１１８で入力データ分類のステップを完了し得、ステップ１２０で終了する。 Once the data is accessed at step 104, the method of one embodiment of the present invention then determines the prior probabilities of the data point labels using the data point label information at step 108. Then, at step 110, an expected label for the data point is determined according to the prior probability of the label. Along with the expected label determined in step 110, labeled data, unlabeled data, and cost factors, step 112 includes iterative training of the transductive MED classifier by scaling the cost factor unlabeled data points. During each iteration of the calculation, the cost factor for the data points is scaled. Thus, the MED classifier learns through iterative iterations of computation. The trained classifier then accesses the input data 114 at step 116. The trained classifier may then complete the input data classification step at step 118 and ends at step 120.

１０６のラベルなしデータおよび入力データ１１４は、単一のソースから導出され得ることが、理解されるべきである。かくして、入力データ／ラベルなしデータは、１１２の繰り返しプロセスに用いられ得、それは次いで、１１８で分類するために使用される。さらに、本発明の一実施形態は、入力データ１１４が、該入力データを１０６に格納されたデータに供給するためのフィードバック機構を含み、それにより１１２のＭＥＤ分類器が、入力された新たなデータから動的に学習し得ることを、企図している。 It should be understood that 106 unlabeled data and input data 114 may be derived from a single source. Thus, input data / unlabeled data can be used in 112 iteration processes, which are then used to classify at 118. Further, one embodiment of the present invention includes a feedback mechanism for the input data 114 to provide the input data to the data stored at 106 so that 112 MED classifiers can receive new data input. It is intended to be able to learn dynamically from.

図５を特に参照して、ユーザ定義の事前確率情報を含む、本発明の一実施形態のラベルなしデータの別の分類法を示す制御流れ図が示されている。方法２００は、ステップ２０２で始まり、ステップ２０４で格納されたデータ２０６にアクセスする。データ２０６は、ラベル付きデータと、ラベルなしデータと、所定コスト要因と、ユーザによって提供された事前確率情報とを含む。２０６のラベル付きデータは、割り当てられたラベルを有するデータ点を含む。割り当てられたラベルは、ラベル付きデータ点が特定のカテゴリに含まれることを意図されているのか、あるいは特定のカテゴリから除外されることを意図されているのかを識別する。 With particular reference to FIG. 5, a control flow diagram illustrating another classification scheme for unlabeled data of one embodiment of the present invention, including user-defined prior probability information, is shown. Method 200 begins at step 202 and accesses data 206 stored at step 204. Data 206 includes labeled data, unlabeled data, predetermined cost factors, and prior probability information provided by the user. The labeled data 206 includes data points with assigned labels. The assigned label identifies whether the labeled data point is intended to be included in a particular category or excluded from a particular category.

ステップ２０８で、期待ラベルが２０６のデータから計算される。期待ラベルは次いで、ステップ２１０で、ラベル付きデータ、ラベルなしデータ、およびコスト要因と共に、トランスダクティブＭＥＤ分類器の繰り返し訓練を行うために用いられる。２１０の繰り返し計算は、各計算時点で、ラベルなしデータのコスト要因をスケーリングする。計算は、分類器が適切に訓練されるまで続く。 At step 208, an expected label is calculated from the 206 data. The expected label is then used at step 210 to iteratively train the transductive MED classifier along with labeled data, unlabeled data, and cost factors. 210 iterations scale the cost factor of unlabeled data at each computation point. The calculation continues until the classifier is properly trained.

訓練された分類器は次いで、２１４で、入力データ２１２からの入力データにアクセスする。訓練された分類器は次いで、ステップ２１６で、入力データを分類するステップを完了し得る。図４で説明したプロセスおよび方法の場合と同様に、入力データおよびラベルなしデータは単一のソースから導出され得、２０６と２１２との両方においてシステムに入力され得る。かくして、入力データ２１２は２１０での訓練に影響を与え得、その結果として、プロセスは継続入力データで動的に経時変化し得る。 The trained classifier then accesses the input data from the input data 212 at 214. The trained classifier may then complete the step of classifying input data at step 216. As with the process and method described in FIG. 4, input data and unlabeled data may be derived from a single source and input to the system at both 206 and 212. Thus, the input data 212 can affect training at 210, and as a result, the process can dynamically change over time with continuous input data.

図４および図５に示す両方の方法において、モニタが、システムが収束に到達したか否かを判断し得る。収束は、ＭＥＤ計算の各繰り返しの間における超平面の変化が所定の閾値を下回ったときに、判断され得る。本発明の代替の実施形態では、この閾値は、決定された期待ラベルの変化がある所定の閾値を下回ったときに、判断され得る。収束に到達した場合には、繰り返し訓練プロセスは終了し得る。 In both methods shown in FIGS. 4 and 5, the monitor may determine whether the system has reached convergence. Convergence can be determined when the change in the hyperplane during each iteration of the MED calculation falls below a predetermined threshold. In an alternative embodiment of the invention, this threshold may be determined when the determined expected label change falls below a predetermined threshold. If convergence is reached, the iterative training process may end.

図６を特に参照して、本発明の方法の少なくとも１つの実施形態の、繰り返し訓練プロセスのより詳細な制御流れ図が示されている。プロセス３００は、ステップ３０２で始まり、ステップ３０４で、データ３０６からデータがアクセスされる。データ３０６は、ラベル付きデータと、ラベルなしデータと、少なくとも１つの所定コスト要因と、事前確率情報とを含み得る。３０６のラベル付きデータ点は、データが指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されるべきデータ点に対する訓練例であるのかを識別するラベルを含む。３０６の事前確率情報は、ラベル付きデータセットおよびラベルなしデータセットの確率情報を含む。 With particular reference to FIG. 6, a more detailed control flow diagram of the iterative training process of at least one embodiment of the method of the present invention is shown. Process 300 begins at step 302, where data is accessed from data 306. Data 306 may include labeled data, unlabeled data, at least one predetermined cost factor, and prior probability information. The labeled data point 306 is a label that identifies whether the data is a training example for data points that should be included in the specified category or a training example for data points that should be excluded from the specified category. Including. The prior probability information 306 includes probability information of labeled data sets and unlabeled data sets.

ステップ３０８で、３０６の事前確率情報からのデータから期待ラベルが決定される。ステップ３１０で、データ点の期待ラベルの絶対値に比例して、各ラベルなしデータに対するコスト要因がスケーリングされる。次いで、ステップ３１２で、ラベル付きデータとラベルなしデータをそれらの期待ラベルに従って訓練例として用い、含まれた訓練例と除外された訓練例との間のマージンを最大化する決定関数を決定することによって、ＭＥＤ分類器が訓練される。ステップ３１４で、３１２の訓練された分類器を用いて、分類スコアが決定される。３１６で、クラス帰属確率に対して分類スコアが較正される。ステップ３１８で、クラス帰属確率に基づいて、ラベルの事前確率情報が更新される。ステップ３２０でＭＥＤ計算が行われ、ラベルおよびマージンの確率分布が決定される。ここで、先に決定された分類スコアがＭＥＤ計算に用いられる。その結果として、ステップ３２２で新たな期待ラベルが計算され、ステップ３２２からの計算結果を用いて、ステップ３２４で期待ラベルが更新される。ステップ３２６で本方法は、収束に到達したか否かを判断する。到達した場合には、本方法はステップ３２８で終了する。収束に到達していない場合には、ステップ３１０で始まる、本方法の別の繰り返しが完了される。繰り返しは収束に到達するまで反復され、その結果として、ＭＥＤ分類器が繰り返して訓練される。ＭＥＤ計算の各繰り返しの間における決定関数の変化が所定の値を下回ったときに、収束に到達し得る。本発明の代替の実施形態では、決定された期待ラベル値の変化が所定の閾値を下回ったときに、収束に到達し得る。 At step 308, an expected label is determined from the data from the prior probability information at 306. At step 310, the cost factor for each unlabeled data is scaled in proportion to the absolute value of the expected label for the data point. Then, at step 312, using the labeled and unlabeled data as training examples according to their expected labels, determining a decision function that maximizes the margin between the included training examples and the excluded training examples Trains the MED classifier. At step 314, a classification score is determined using the 312 trained classifier. At 316, the classification score is calibrated against the class membership probability. In step 318, the prior probability information of the label is updated based on the class membership probability. At step 320, a MED calculation is performed to determine the probability distribution of labels and margins. Here, the previously determined classification score is used for MED calculation. As a result, a new expected label is calculated in step 322, and the expected label is updated in step 324 using the calculation result from step 322. In step 326, the method determines whether convergence has been reached. If so, the method ends at step 328. If convergence has not been reached, another iteration of the method starting at step 310 is completed. The iteration is repeated until convergence is reached, so that the MED classifier is iteratively trained. Convergence can be reached when the change in the decision function during each iteration of the MED calculation falls below a predetermined value. In an alternative embodiment of the present invention, convergence may be reached when the determined change in expected label value falls below a predetermined threshold.

図７は、一実施形態によるネットワークアーキテクチャ７００を示す。図に示すように、第１の遠隔ネットワーク７０４および第２の遠隔ネットワーク７０６を含む複数の遠隔ネットワーク７０２が提供される。ゲートウェイ７０７が、遠隔ネットワーク７０２と隣接ネットワーク７０８と間に結合され得る。本ネットワークアーキテクチャ７００の状況においては、ネットワーク７０４、７０６はそれぞれ、これに限定するものではないが、ＬＡＮ、インターネットのようなＷＡＮ、ＰＳＴＮ、内部電話ネットワークなどを含む、任意の形態をとり得る。 FIG. 7 illustrates a network architecture 700 according to one embodiment. As shown, a plurality of remote networks 702 including a first remote network 704 and a second remote network 706 are provided. A gateway 707 may be coupled between the remote network 702 and the adjacent network 708. In the context of the present network architecture 700, each of the networks 704, 706 may take any form including, but not limited to, a LAN, a WAN such as the Internet, a PSTN, an internal telephone network, and the like.

使用時には、ゲートウェイ７０７は、遠隔ネットワーク７０２から隣接ネットワーク７０８への入口点としての役割を果たす。かくして、ゲートウェイ７０７は、ゲートウェイ７０７に到達する与えられたデータパケットを誘導するルータとして、また与えられたパケットに対してゲートウェイ７０７を出入りする実際の経路を提供するスイッチとして、機能し得る。 In use, the gateway 707 serves as an entry point from the remote network 702 to the adjacent network 708. Thus, the gateway 707 can function as a router that directs a given data packet that reaches the gateway 707 and as a switch that provides the actual path to and from the gateway 707 for the given packet.

さらに、隣接ネットワーク７０８に結合され、ゲートウェイ７０７を介して遠隔ネットワーク７０２からアクセス可能な、少なくとも１つのデータサーバ７１４が含まれる。データサーバ（単数または複数）７１４は任意の種類の計算装置／グループウェアをも含み得ることが、留意されるべきである。各データサーバ７１４に、複数のユーザ装置７１６が結合されている。このようなユーザ装置７１６は、デスクトップコンピュータ、ラップトップコンピュータ、ハンドヘルドコンピュータ、プリンタまたは任意の他の種類の論理を含み得る。一実施形態では、ユーザ装置７１７はまた任意のネットワークに直接的に結合され得ることが、留意されるべきである。 Further included is at least one data server 714 coupled to the neighboring network 708 and accessible from the remote network 702 via the gateway 707. It should be noted that the data server (s) 714 may include any type of computing device / groupware. A plurality of user devices 716 are coupled to each data server 714. Such user equipment 716 may include a desktop computer, laptop computer, handheld computer, printer, or any other type of logic. It should be noted that in one embodiment, user equipment 717 can also be directly coupled to any network.

１つのファクシミリ装置７２０または一連のファクシミリ装置７２０が、ネットワーク７０４、７０６、７０８のうちの１つ以上に結合され得る。 A facsimile machine 720 or a series of facsimile machines 720 may be coupled to one or more of the networks 704, 706, 708.

データベースおよび／または追加の構成要素が、ネットワーク７０４、７０６、７０８に結合された任意の種類のネットワーク要素と共に用いられ得、またはそれに統合され得ることが、留意されるべきである。本説明の文脈の中で、ネットワーク要素は、ネットワークの任意の構成要素を参照し得る。 It should be noted that the database and / or additional components can be used with or integrated with any type of network element coupled to the networks 704, 706, 708. Within the context of this description, a network element may refer to any component of the network.

図８は、一実施形態による、図７のユーザ装置７１６と関連付けられた代表的なハードウェア環境を示す。当該の図は、マイクロプロセッサのような中央処理ユニット８１０、およびシステムバス８１２を介して相互に接続された多数の他のユニットを有する、ワークステーションの一般的なハードウェア構成を示している。 FIG. 8 illustrates an exemplary hardware environment associated with the user device 716 of FIG. 7, according to one embodiment. The figure shows a typical hardware configuration of a workstation having a central processing unit 810, such as a microprocessor, and a number of other units interconnected via a system bus 812.

図８に示すワークステーションは、ランダムアクセスメモリ（ＲＡＭ）８１４と、読取り専用メモリ（ＲＯＭ）８１６と、磁気ディスク装置８２０のような周辺装置をバス８１２に接続するためのＩ／Ｏアダプタ８１８と、キーボード８２４、マウス８２６、スピーカ８２８、マイクロホン８３２、および／またはタッチスクリーンおよびデジタルカメラ（図示せず）のような他のインタフェース装置をバス８１２に接続するためのユーザインタフェースアダプタ８２２と、ワークステーションを通信ネットワーク８３５（例えば、データ処理ネットワーク）に接続するための通信アダプタ８３４と、バス８１２をディスプレイ装置８３８に接続するためのディスプレイアダプタ８３６と、を含む。 The workstation shown in FIG. 8 includes a random access memory (RAM) 814, a read only memory (ROM) 816, an I / O adapter 818 for connecting peripheral devices such as a magnetic disk device 820 to the bus 812, Communicates workstation with user interface adapter 822 for connecting keyboard 824, mouse 826, speaker 828, microphone 832, and / or other interface devices such as touch screens and digital cameras (not shown) to bus 812 A communication adapter 834 for connecting to a network 835 (eg, a data processing network) and a display adapter 836 for connecting the bus 812 to the display device 838 are included.

図９を特に参照して、本発明の一実施形態の装置４１４が示されている。本発明の一実施形態は、ラベル付きデータ４１６を格納するためのメモリ装置８１４を備える。ラベル付きデータ点４１６はそれぞれ、データ点が指定されたカテゴリに含まれるデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点に対する訓練例であるのかを示すラベルを含む。メモリ８１４はまた、ラベルなしデータ４１８、事前確率データ４２０、およびコスト要因データ４２２をも格納する。 With particular reference to FIG. 9, an apparatus 414 of one embodiment of the present invention is shown. One embodiment of the present invention includes a memory device 814 for storing labeled data 416. Each labeled data point 416 includes a label indicating whether the data point is an example training for a data point included in a specified category or an example training for a data point excluded from the specified category. Memory 814 also stores unlabeled data 418, prior probability data 420, and cost factor data 422.

プロセッサ８１０は、メモリ８１４からのデータにアクセスし、トランスダクティブＭＥＤ計算を用いて２値分類器を訓練し、それがラベルなしデータを分類できるようにする。プロセッサ８１０は、ラベル付きデータとラベルなしデータとからのコスト要因および訓練例を用いることによって、かつそのコスト要因を期待ラベル値の関数としてスケーリングして、その後プロセッサ８１０に再入力されるコスト要因データ４２２のデータに影響を与えることによって、繰り返しトランスダクティブ計算を使用する。従って、コスト要因４２２は、プロセッサ８１０によるＭＥＤ分類の各繰り返しと共に変化する。ひとたびプロセッサ８１０が適切にＭＥＤ分類器を訓練すると、プロセッサは次いで、ラベルなしデータを分類済みデータ４２４に分類するための分類器を構築し得る。 The processor 810 accesses data from the memory 814 and trains a binary classifier using transductive MED calculations so that it can classify unlabeled data. The processor 810 uses the cost factors and training examples from the labeled and unlabeled data and scales the cost factors as a function of the expected label value before being re-input to the processor 810. Iterative transductive computation is used by affecting 422 data. Thus, cost factor 422 changes with each iteration of MED classification by processor 810. Once processor 810 properly trains the MED classifier, the processor may then construct a classifier to classify unlabeled data into classified data 424.

従来技術のトランスダクティブＳＶＭ定式化およびＭＥＤ定式化は、あり得るラベル割り当ての指数関数的増加をもたらし、実用化のためには、近似式を開発する必要がある。本発明の代替の実施形態においては、あり得るラベル割り当てが指数関数的に増加せず、一般的な閉形式解を可能にする、トランスダクティブＭＥＤ分類の異なる定式化が導入されている。線形分類器に対して、この問題は以下のように定式化される。超平面パラメータに関する分布ｐ（Θ）、バイアス分布ｐ（ｂ）、データ点分類マージンｐ（γ）を、これらの結合された確率分布が結合されたそれぞれの事前分布ｐ_０に対して最小のカルバックライブラーダイバージェンスＫＬを有するように求める、すなわち、 Prior art transductive SVM and MED formulations provide an exponential increase in possible label assignments, and for practical use it is necessary to develop approximate equations. In an alternative embodiment of the present invention, a different formulation of transductive MED classification is introduced that does not increase the possible label assignments exponentially and allows general closed-form solutions. For a linear classifier, this problem is formulated as follows: The distribution p (Θ), bias distribution p (b), and data point classification margin p (γ) with respect to the hyperplane parameter are set to the minimum cullback for each prior distribution p ₀ to which these combined probability distributions are combined. Ask to have a Liver Divergence KL, ie

であり、ラベル付きデータに対して以下の制約に従い、

And follows the following restrictions for labeled data:

ラベルなしデータに対して以下の制約に従い、

Subject to the following restrictions for unlabeled data:

ここで、ΘＸ_ｔは、分離超平面の重みベクトルとｔ番目のデータ点の特徴ベクトルとのドット積である。ラベルに対する事前分布は必要ではない。ラベル付きデータは、それらの既知のラベルに従って分離超平面の右側に位置するように制約されているが、ラベルなしデータに対する唯一の要求条件は、超平面までのラベルなしデータの距離の２乗がマージンよりも大きいということである。要約すると、本発明のこの実施形態は、選択された事前分布に最も近く、ラベル付きデータを適切に分離し、かつマージン間にラベルなしデータを全く有しないという、妥協点となる分離超平面を求める。利点は、ラベルに対する事前分布を導入する必要がなく、従って、指数関数的に増加するラベル割り当てに関する問題が回避されることである。

Here, ΘX _t is the dot product of the weight vector of the separation hyperplane and the feature vector of the t th data point. Prior distribution for labels is not necessary. Although labeled data is constrained to lie to the right of the separation hyperplane according to their known labels, the only requirement for unlabeled data is that the square of the distance of unlabeled data to the hyperplane is It is larger than the margin. In summary, this embodiment of the present invention produces a compromise hyperplane that is closest to the selected prior distribution, properly separates the labeled data, and has no unlabeled data between the margins. Ask. The advantage is that there is no need to introduce a prior distribution for the labels, thus avoiding problems with exponentially increasing label allocation.

本発明の別の実施形態の特定の実装では、超平面パラメータ、バイアス、およびマージンに対して式７、式８、および式９に与えられた事前分布を用いて、以下の分配関数が得られ、 In a specific implementation of another embodiment of the present invention, using the prior distributions given in Equation 7, Equation 8, and Equation 9 for hyperplane parameters, bias, and margin, the following distribution function is obtained: ,

ここで、ｔはラベル付きデータの添え字であり、ｔ´はラベルなしデータの添え字である。下記の表記法を用いると、

Here, t is a subscript of labeled data, and t ′ is a subscript of unlabeled data. Using the following notation:

式１６は、以下のように書き換えられる。

Equation 16 can be rewritten as follows:

積分の後に、以下の分配関数が得られる。

After integration, the following partition function is obtained:

すなわち、最終目的関数は、

That is, the final objective function is

となる。目的関数

It becomes. Objective function

は、本明細書においてＭ段階と題する章で述べられる、既知のラベルの場合の手法に類似した手法を適用することによって解かれ得る。差異は、最大マージン項の二次形式におけるマトリックスＧ_３ ^−１が、ここで非対角項を有する点である。

Can be solved by applying a technique similar to that of the known label case described herein in the section entitled M Stage. The difference is that the matrix G ₃ ⁻¹ in the quadratic form of the maximum margin term now has off-diagonal terms.

分類に加えて、最大エントロピー識別の枠組みを採り入れた本発明の方法の用途は、数多く存在する。例えば、ＭＥＤは、一般的なデータの分類、任意の種類の識別関数および事前分布、回帰モデルおよびグラフィカルモデルを解くために適用され得る（Ｔ．Ｊｅｂａｒａ「ＭａｃｈｉｎｅＬｅａｒｎｉｎｇＤｉｓｃｒｉｍｉｎａｔｉｖｅａｎｄＧｅｎｅｒａｔｉｖｅ」、ＫｌｕｗｅｒＡｃａｄｅｍｉｃＰｕｂｌｉｓｈｅｒｓ）（Ｊｅｂａｒａ）。 In addition to classification, there are many applications of the method of the present invention that incorporate a framework of maximum entropy identification. For example, MED can be applied to solve general data classifications, any kind of discriminant functions and prior distributions, regression models and graphical models (T. Jebara “Machine Learning and Generalized”, Kluwer Academic Publishers). (Jebara).

本発明の実施形態のアプリケーションは、既知のラベルを有する純粋に帰納的な学習問題として、およびラベル付きとラベルなしの訓練例を有するトランスダクティブ学習問題として、定式化され得る。後者の場合には、表３に記載されたトランスダクティブＭＥＤ分類アルゴリズムに対する改良が、一般のトランスダクティブＭＥＤ分類、トランスダクティブＭＥＤ回帰、グラフィカルモデルのトランスダクティブＭＥＤ学習に対しても、同様に適用可能である。かくして、本開示および特許請求の範囲の目的に対して、語「分類」は、回帰またはグラフィカルモデルを含み得る。 Applications of embodiments of the present invention can be formulated as purely recursive learning problems with known labels and as transductive learning problems with labeled and unlabeled training examples. In the latter case, improvements to the transductive MED classification algorithm described in Table 3 are similar for general transductive MED classification, transductive MED regression, and graphical model transductive MED learning. It is applicable to. Thus, for purposes of this disclosure and the claims, the term “classification” may include regression or graphical models.

（Ｍ段階）
式１１によれば、Ｍ段階の目的関数は、 (M stage)
According to Equation 11, the M-stage objective function is

となる。これにより、ラグランジュ乗数λ_ｔは、Ｊ_Ｍを最大化することによって決定される。

It becomes. Thereby, the Lagrange multiplier λ _t is determined by maximizing J _M.

λ_ｔ＜ｃという冗長制約を省くと、上記の双対問題に対するラグランジアンは、 Without the redundant constraint of λ _t <c, the Lagrangian for the above dual problem is

となる。最適性に対して必要かつ十分なＫＫＴ条件は、

It becomes. The necessary and sufficient KKT condition for optimality is

となる。ここで、Ｆ_ｔは、

It becomes. Where F _t is

である。最適点において、基底は期待バイアス

It is. At the optimal point, the basis is expected bias

と等しくなり、

Is equal to

が得られる。

Is obtained.

これらの式は、δ_ｔλ_ｔ＝０制約を用いた２つの例を考察することによって、要約され得る。第１の例は、すべてに対してλ_ｔ＝０、第２の例は、すべてに対して０＜λ_ｔ＜ｃ、である。ＳＶＭアルゴリズムに適用された、Ｓ．Ｋｅｅｒｔｈｉ、Ｓ．Ｓｈｅｖａｄｅ、Ｃ．Ｂｈａｔｔａｃｈａｒｙｙａ、およびＫ．Ｍｕｒｔｈｙ「Ｉｍｐｒｏｖｅｍｅｎｔｓｔｏｐｌａｉｔ’ｓｓｍｏａｌｇｏｒｉｔｈｍｆｏｒｓｖｍｃｌａｓｓｉｆｉｅｒｄｅｓｉｇｎ」、１９９９年（Ｋｅｅｒｔｈｉ）に記載されているような、第３の例は必要でない。この定式化におけるポテンシャル関数は、λ_ｔ≠ｃを保っている。 These equations can be summarized by considering two examples with the δ _t λ _t = 0 constraint. The first example is λ _t = 0 for all, and the second example is 0 <λ _t <c for all. Applied to the SVM algorithm. Keerthi, S .; Shevade, C.I. Bhattacharya, and K.A. A third example is not necessary, as described in Murthy “Improvements to plait's smo algorithm for svm classifier design”, 1999 (Keerthy). The potential function in this formulation maintains λ _t ≠ c.

最適条件に到達するまでに、一部のデータ点ｔに対するこれらの条件の侵害が存在する。すなわち、λ_ｔがゼロでないときにはＦ_ｔ≠−〈ｂ〉、またはλ_ｔがゼロのときにはＦ_ｔ〈ｙ_ｔ〉＜−〈ｂ〉〈ｙ_ｔ〉、である。残念なことに、〈ｂ〉の計算は、最適なλ_ｔのそれなくしては不可能である。これに対する良解は、以下の３つの組を構築することによって、再び本明細書において参照するＫｅｅｒｔｈｉから借用される。

By the time the optimal conditions are reached, there is a violation of these conditions for some data points t. That is, when the lambda _t is non-zero _{F t} ≠ - , or lambda _t when the zero _{_{F t <y t><-}}<yt>, is. Unfortunately, the calculation of is not possible without that of the optimal λ _t . A good answer to this is borrowed from Keerthi, referred to herein again, by constructing the following three sets:

これらの組を利用して、以下の定義を用いた最も極端な最適性条件違反を定義することができる。Ｉ_０の要素は、それらが−〈ｂ〉に等しくないときは常に違反であり、従って、Ｉ_０からの最大Ｆ_ｔおよび最小Ｆ_ｔは、違反の候補である。Ｉ_１の要素は、Ｆ_ｔ＜−〈ｂ〉のときに違反であり、従って、Ｉ_１からの最小要素は、もしあるとすれば、最も極端な違反である。最後に、Ｉ_４の要素は、Ｆ_ｔ＞−〈ｂ〉のときに違反であり、それはＩ_４からの最大要素を違反候補にする。従って、−〈ｂ〉は以下に示すように、これらの組に関する最小および最大によって制限される。

These sets can be used to define the most extreme optimality condition violations using the following definitions: The elements of I ₀ are in violation whenever they are not equal to-, so the maximum F _t and minimum F _t from I ₀ are candidates for violation. The element of I ₁ is violated when F _t <-, so the smallest element from I ₁ is the most extreme violation, if any. Finally, the element of I ₄ is in violation when F _t > − , which makes the largest element from I _{4 a} candidate for violation. Thus,- is limited by the minimum and maximum for these sets, as shown below.

最適な−ｂ_ｕｐと−ｂ_ｌｏｗとは等しくなければならず、すなわち−〈ｂ〉であるので、−ｂ_ｕｐと−ｂ_ｌｏｗとの間のギャップを減らすことが、訓練アルゴリズムを収束に向けてプッシュする。さらに、ギャップはまた、数値的収束を判断するための手法として、測定され得る。

Optimal -b _up and -b _low must be equal, i.e.-, so reducing the gap between -b _up and -b _low will make the training algorithm converge. To push. Further, the gap can also be measured as a technique for determining numerical convergence.

先に述べたように、ｂ＝〈ｂ〉の値は、収束するまでは未知である。この代替の実施形態の方法は、１度に１例のみが最適化され得るという点で異なる。従って、訓練のヒューリスティックは、１回おきに、Ｉ_０の例とすべての例との間で行きつ戻りつすることである。 As described above, the value of b = is unknown until convergence. The method of this alternative embodiment differs in that only one example can be optimized at a time. Thus, the training heuristic is to go back and forth between the I ₀ example and all examples every other time.

（Ｅ段階）
式１２のＥ段階の目的関数は、 (E stage)
The objective function of the E stage of Equation 12 is

であり、ここでｓ_ｔは、先のＭ段階で決定されたｔ番目のデータ点の分類スコアである。ラグランジュ乗数λ_ｔは、

Where s _t is the classification score of the t th data point determined in the previous M stages. The Lagrange multiplier λ _t is

を最大化することによって決定される。

Is determined by maximizing.

となる。最適性に対して必要かつ十分なＫＫＴ条件は、

It becomes. The necessary and sufficient KKT condition for optimality is

である。ＫＫＴ条件を最適化することによってラグランジュ５乗数に対する解を求めることは、ＫＫＴ条件が例を分解する（ｆａｃｔｏｒｉｚｅ）ので、例を１回パスすることによって行われ得る。

It is. Finding a solution for the Lagrangian multiplier by optimizing the KKT condition can be done by passing the example once, as the KKT condition factorizes the example.

ラベル付き例に対しては、期待ラベル〈ｙ_ｔ〉は、Ｐ_０，ｔ（ｙ_ｔ）＝１およびＰ_０，ｔ（−ｙ_ｔ）＝０を有するものであり、ＫＫＴ条件を For the labeled example, the expected label <y _t > has P _{0, t} (y _t ) = 1 and P _{0, t} (−y _t ) = 0, and the KKT condition

に簡略化し、ラベル付き例のラグランジュ乗数に対する解として、

As a solution to the Lagrangian multiplier in the labeled example,

をもたらす。ラベルなし例に対して、式３５は解析的に解くことはできないが、しかしながら、式３５を満たす各ラベルなし例のラグランジュ乗数に対して、例えば線形探索を適用することによって、決定されねばならない。

Bring. For the unlabeled example, Equation 35 cannot be solved analytically, however, it must be determined, for example, by applying a linear search to the Lagrangian multiplier for each unlabeled example that satisfies Equation 35.

以下は、上述の手法によって可能となるいくつかの非限定的な例、それらの派生物または変形物、および当業界で公知の他の手法である。各例は、好適な算法と、基本的な好適な手法の中で実装され得る任意選択的な算法またはパラメータとを含む。 The following are some non-limiting examples made possible by the techniques described above, their derivatives or variations, and other techniques known in the art. Each example includes a suitable algorithm and optional algorithms or parameters that can be implemented in a basic preferred approach.

図１０に提示される一実施形態では、ラベル付きデータ点がステップ１００２で受信され、そこでは、データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されるデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを、ラベル付きデータ点の各々が有する。さらに、ラベルなしデータ点が、ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因と共に、ステップ１００４で受信される。データ点は、任意の媒体、例えば語、画像、音響等を含み得る。ラベル付きとラベルなしデータ点の事前確率情報がまた、受信され得る。また、含まれた訓練例のラベルは、第１の数値、例えば＋１などにマッピングされ得、除外された訓練例のラベルは、第２の数値、例えば−１などにマッピングされ得る。さらに、ラベル付きデータ点、ラベルなしデータ点、入力データ点、ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因が、コンピュータのメモリに格納され得る。 In one embodiment presented in FIG. 10, labeled data points are received at step 1002, where the data points are training examples for data points to be included in a specified category, or a specified category. Each labeled data point has at least one label that indicates whether it is a training example for a data point excluded from. Further, unlabeled data points are received at step 1004 along with at least one predetermined cost factor of labeled and unlabeled data points. Data points can include any medium, such as words, images, sounds, and the like. Prior probability information for labeled and unlabeled data points may also be received. Also, the included training example labels may be mapped to a first numerical value, such as +1, and the excluded training example labels may be mapped to a second numerical value, such as -1. Further, at least one predetermined cost factor of labeled data points, unlabeled data points, input data points, labeled data points and unlabeled data points may be stored in the memory of the computer.

さらに、ステップ１００６で、上述の少なくとも１つのコスト要因と、ラベル付きデータ点およびラベルなしデータ点とを訓練例として用いて、繰り返し計算によってトランスダクティブＭＥＤ分類器が訓練される。計算の各繰り返しに対して、ラベルなしデータ点のコスト要因は、期待ラベル値、例えばデータ点の期待ラベルの絶対値などの関数として調整され、データ点のラベルの事前確率は、データ点のクラス帰属確率の推定値に基づいて調整され、これによって安定性を確保する。また、トランスダクティブ分類器は、ラベル付きとラベルなしデータの事前確率情報を用いて学習し得、これは安定性をさらに向上させる。トランスダクティブ分類器を訓練する繰り返しステップは、データ値が収束に到達するまで、例えば、トランスダクティブ分類器の決定関数の変化が所定の閾値を下回るとき、決定された期待ラベル値の変化が所定の閾値を下回るとき、などまで反復され得る。 Further, at step 1006, the transductive MED classifier is trained by iterative calculation using the at least one cost factor described above and labeled and unlabeled data points as training examples. For each iteration of the calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, for example, the absolute value of the expected label of the data point, and the prior probability of the data point label is the data point class It is adjusted based on the estimated probability of belonging, thereby ensuring stability. Also, the transductive classifier can learn using prior probability information of labeled and unlabeled data, which further improves stability. The iterative step of training the transductive classifier is that the change in the expected expected label value is determined until the data value reaches convergence, eg, when the change in the decision function of the transductive classifier is below a predetermined threshold. When it falls below a predetermined threshold, etc. can be repeated.

さらに、ステップ１００８で、訓練された分類器は、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するために適用される。入力データ点は、分類器が訓練される前に、または後に受信され得、あるいは全く受信され得ない。また、含まれた訓練例および除外された訓練例を与えられた、決定関数パラメータの事前確率分布に対するＫＬダイバージェンスを最小化する決定関数は、ラベル付きとラベルなしデータ点をそれらの期待ラベルに従って学習例として用いて決定され得る。代替案としては、決定関数パラメータに対して多項分布を用いた最小のＫＬダイバージェンスによって、決定関数が決定され得る。 Further, at step 1008, the trained classifier is applied to classify at least one of unlabeled data points, labeled data points, and input data points. Input data points may be received before or after the classifier is trained, or may not be received at all. A decision function that minimizes KL divergence for the prior probability distribution of decision function parameters given included and excluded training examples also learns labeled and unlabeled data points according to their expected labels It can be determined using as an example. Alternatively, the decision function can be determined by minimum KL divergence using a multinomial distribution for the decision function parameters.

ステップ１０１０で、分類されたデータ点の分類、またはその派生物が、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力される。本システムは遠隔の、または局地のものであり得る。分類の派生物の例は、これに限定するものではないが、分類されたデータ点そのもの、分類されたデータ点を表現したものまたはその識別子、あるいはホストファイル／文書、などであり得る。 At step 1010, the classification of the classified data points, or a derivative thereof, is output to at least one of the user, another system, and another process. The system can be remote or local. Examples of categorical derivatives may include, but are not limited to, classified data points themselves, representations of classified data points or their identifiers, or host files / documents.

別の実施形態では、コンピュータ実行可能なプログラムコードがコンピュータシステムに配備され、その上で実行される。このプログラムコードは、コンピュータのメモリ内に格納されたラベル付きデータ点にアクセスするための命令を備え、該ラベル付きデータ点の各々は、データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点に対する訓練例であるのかを示す少なくとも１つのラベルを有する。さらに、コンピュータコードは、コンピュータのメモリからラベルなしデータ点にアクセスするための命令と、コンピュータのメモリからラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因にアクセスするための命令をも含む。コンピュータのメモリ内に格納されたラベル付きとラベルなしデータ点の事前確率情報がまた、アクセスされ得る。また、含まれた訓練例のラベルは、第１の数値、例えば＋１などにマッピングされ得、除外された訓練例のラベルは、第２の数値、例えば−１などにマッピングされ得る。 In another embodiment, computer executable program code is deployed to and executed on a computer system. The program code includes instructions for accessing labeled data points stored in the memory of the computer, each labeled data point being trained on a data point to be included in the category in which the data point is specified. It has at least one label indicating whether it is an example or a training example for data points excluded from a specified category. In addition, the computer code includes instructions for accessing unlabeled data points from the computer memory, and instructions for accessing at least one predetermined cost factor of labeled and unlabeled data points from the computer memory. Including. Prior probabilities information for labeled and unlabeled data points stored in the computer's memory can also be accessed. Also, the included training example labels may be mapped to a first numerical value, such as +1, and the excluded training example labels may be mapped to a second numerical value, such as -1.

さらに、プログラムコードは、少なくとも１つの格納されたコスト要因と格納されたラベル付きデータ点、および格納されたラベルなしデータ点、ならびに訓練例を用いた繰り返し計算によって、トランスダクティブ分類器を訓練するための命令を備える。また、計算の各繰り返しに対して、ラベルなしデータ点のコスト要因が、該データ点の期待ラベル値、例えばデータ点の期待ラベルの絶対値の関数として調整される。また、各繰り返しに対して、データ点のクラス帰属確率の推定値に基づき、事前確率情報が調整され得る。トランスダクティブ分類器を訓練する繰り返しステップは、データ値の収束に到達するまで、例えば、トランスダクティブ分類器の決定関数の変化が所定の閾値を下回るとき、決定された期待ラベル値の変化が所定の閾値を下回るとき、などまで反復され得る。 Further, the program code trains the transductive classifier by iterative calculations using at least one stored cost factor and stored labeled data points and stored unlabeled data points and training examples. Instructions are provided. Also, for each iteration of the calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value of the data point, eg, the absolute value of the expected label of the data point. Also, prior probability information can be adjusted for each iteration based on the estimated value of class membership probability of the data points. The iterative step of training the transductive classifier is that until the convergence of the data value is reached, for example, when the change of the decision function of the transductive classifier falls below a predetermined threshold, the change in the determined expected label value is When it falls below a predetermined threshold, etc. can be repeated.

さらに、プログラムコードは、訓練された分類器を適用して、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するための命令と、分類されたデータ点の分類、またはその派生物を、ユーザ、別のシステム、および別のプロセスに出力するための命令とを備える。また、含まれた訓練例および除外された訓練例を与えられた、決定関数パラメータの事前確率分布に対してＫＬダイバージェンスを最小化する決定関数は、ラベル付きとラベルなしデータ点をそれらの期待ラベルに従って学習例として用いて、決定され得る。 Further, the program code applies a trained classifier to classify at least one of the unlabeled data points, the labeled data points, and the input data points, and the classified data points. Instructions for outputting the classification, or a derivative thereof, to a user, another system, and another process. Also, given the included training examples and the excluded training examples, the decision function that minimizes KL divergence for the prior probability distribution of the decision function parameters is to use labeled and unlabeled data points as their expected labels. As a learning example and can be determined.

さらに別の実施形態では、データ処理装置は、（ｉ）データ点が指定されたカテゴリに含まれるデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを各々が有するラベル付きデータ点、（ｉｉ）ラベルなしデータ点、および（ｉｉｉ）ラベル付きデータ点とラベルなしデータ点の少なくとも１つの所定コスト要因、を格納するための少なくとも１つのメモリを備える。このメモリは、ラベル付きとラベルなしデータ点の事前確率情報をも格納し得る。また、含まれた訓練例のラベルは、第１の数値、例えば＋１などにマッピングされ得、除外された訓練例のラベルは、第２の数値、例えば−１などにマッピングされ得る。 In yet another embodiment, the data processing apparatus is (i) a training example for data points that are included in a specified category or a training example for data points that are excluded from the specified category. To store labeled data points each having at least one label, (ii) unlabeled data points, and (iii) at least one predetermined cost factor of labeled and unlabeled data points At least one memory. This memory may also store prior probability information for labeled and unlabeled data points. Also, the included training example labels may be mapped to a first numerical value, such as +1, and the excluded training example labels may be mapped to a second numerical value, such as -1.

さらに、このデータ処理装置は、少なくとも１つの格納されたコスト要因および格納されたラベル付きデータ点ならびに格納されたラベルなしデータ点を訓練例として用いて、トランスダクティブ最大エントロピー識別（ＭＥＤ）を用いてトランスダクティブ分類器に繰り返し教示するためのトランスダクティブ分類器訓練装置を備える。さらに、ＭＥＤ計算の各繰り返しにおいて、ラベルなしデータ点のコスト要因が、データ点の期待ラベル値、例えばデータ点の期待ラベルの絶対値などの関数として調整される。また、ＭＥＤ計算の各繰り返しにおいて、事前確率情報が、データ点のクラス帰属確率の推定値に基づいて調整され得る。本装置は、例えば、トランスダクティブ分類器の計算の決定関数の変化が所定の閾値を下回ったとき、決定された期待ラベル値の変化が所定の閾値を下回ったときなどに、データ値の収束を判断するための手段、および収束の判断と同時に計算を終了するための手段を、さらに備え得る。 In addition, the data processing apparatus uses transductive maximum entropy identification (MED), using at least one stored cost factor and stored labeled data points and stored unlabeled data points as training examples. And a transductive classifier training apparatus for repeatedly teaching the transductive classifier. Further, at each iteration of the MED calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value of the data point, eg, the absolute value of the expected label of the data point. Also, in each iteration of the MED calculation, the prior probability information can be adjusted based on an estimate of the class membership probability of the data points. For example, when the change in the decision function of the calculation of the transductive classifier falls below a predetermined threshold, or when the change in the expected expected label value falls below the predetermined threshold, the apparatus converges the data value. And a means for terminating the calculation simultaneously with the determination of convergence.

さらに、訓練された分類器を用いて、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つが分類される。さらに、含まれた訓練例および除外された訓練例を与えられた、決定関数パラメータの事前確率分布に対してＫＬダイバージェンスを最小化する決定関数が、ラベル付きとラベルなしデータをそれらの期待ラベルに従って学習例として用いてプロセッサによって決定され得る。また、分類されたデータ点の分類、またはその派生物が、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力される。 Further, using the trained classifier, at least one of unlabeled data points, labeled data points, and input data points are classified. Furthermore, given the included training examples and the excluded training examples, the decision function that minimizes the KL divergence for the prior probability distribution of the decision function parameters is able to reduce the labeled and unlabeled data according to their expected labels. It can be determined by the processor using as a learning example. Also, the classification of the classified data points, or a derivative thereof, is output to at least one of the user, another system, and another process.

さらなる実施形態において、製品は、コンピュータ読み取り可能なプログラム格納媒体を備え、該媒体は、データの分類法を実行するためのコンピュータによって実行可能な命令の１つ以上のプログラムを明白に具体化する。使用時には、データ点が指定されたカテゴリに含まれるデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを各々が有するラベル付きデータ点が受信される。さらに、ラベルなしデータ点と、ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因とが、受信される。ラベル付きとラベルなしデータ点の事前確率情報がまた、コンピュータのメモリ内に格納され得る。また、含まれた訓練例のラベルは、第１の数値、例えば＋１などにマッピングされ得、除外された訓練例のラベルは第２の数値、例えば−１などにマッピングされ得る。 In a further embodiment, the product comprises a computer readable program storage medium, which unambiguously embodies one or more programs of instructions executable by a computer to perform data classification. In use, each has at least one label indicating whether the data points are training examples for data points included in the specified category or are training examples for data points excluded from the specified category A labeled data point is received. Further, unlabeled data points and at least one predetermined cost factor of labeled data points and unlabeled data points are received. Prior probability information for labeled and unlabeled data points can also be stored in the memory of the computer. Also, the included training example labels may be mapped to a first numerical value, such as +1, and the excluded training example labels may be mapped to a second numerical value, such as -1.

さらに、トランスダクティブ分類器は、少なくとも１つの格納されたコスト要因ならびに格納されたラベル付きデータ点およびラベルなしデータ点を訓練例として用いて、繰り返し最大エントロピー識別（ＭＥＤ）計算によって訓練される。ＭＥＤ計算の各繰り返しにおいて、ラベルなしデータ点のコスト要因が、データ点の期待ラベル値、例えばデータ点の期待ラベルの絶対値などの関数として調整される。また、ＭＥＤの各繰り返しにおいて、事前確率情報が、データ点のクラス帰属確率の推定値に基づいて調整され得る。トランスダクティブ分類器を訓練する繰り返しステップは、データ値の収束に到達するまで、例えば、トランスダクティブ分類器の決定関数の変化が所定の閾値を下回るとき、決定された期待ラベル値の変化が所定の閾値を下回るとき、などまで、反復され得る。 Further, the transductive classifier is trained by iterative maximum entropy identification (MED) computation using at least one stored cost factor and stored labeled and unlabeled data points as training examples. In each iteration of the MED calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value of the data point, eg, the absolute value of the expected label of the data point. Also, in each iteration of MED, prior probability information can be adjusted based on an estimate of the class membership probability of the data points. The iterative step of training the transductive classifier is that until the convergence of the data value is reached, for example, when the change of the decision function of the transductive classifier falls below a predetermined threshold, the change in the determined expected label value is It can be repeated until, for example, when it falls below a predetermined threshold.

さらに、入力データ点がコンピュータのメモリからアクセスされ、訓練された分類器が、ラベルなしデータ点、ラベル付きデータ点、および入力データ点のうちの少なくとも１つを分類するために適用される。また、含まれた訓練例および除外された訓練例を与えられた、決定関数パラメータの事前確率分布に対してＫＬダイバージェンスを最小化する決定関数が、ラベル付きとラベルなしデータをそれらの期待ラベルに従って学習例として用いて決定され得る。さらに、分類されたデータ点の分類、またはその派生物が、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力される。 Further, the input data points are accessed from a computer memory and a trained classifier is applied to classify at least one of the unlabeled data points, the labeled data points, and the input data points. Also, given the included training examples and the excluded training examples, a decision function that minimizes KL divergence for the prior probability distribution of decision function parameters is used for labeling and unlabeled data according to their expected labels. It can be determined using as a learning example. Further, the classification of the classified data points, or a derivative thereof, is output to at least one of the user, another system, and another process.

さらに別の実施形態において、コンピュータベースのシステムにおけるラベルなしデータの分類法が提示される。使用時には、データ点が指定されたカテゴリに含まれるべきデータ点に対する訓練例であるのか、あるいは指定されたカテゴリから除外されたデータ点に対する訓練例であるのかを示す、少なくとも１つのラベルを各々が有するラベル付きデータ点が受信される。 In yet another embodiment, a method for classifying unlabeled data in a computer-based system is presented. In use, each has at least one label that indicates whether the data points are training examples for data points that should be included in the specified category or are training examples for data points excluded from the specified category. A labeled data point having is received.

さらに、ラベル付きデータ点およびラベルなしデータ点の事前ラベル確率情報と同様に、ラベル付きデータ点およびラベルなしデータ点が受信される。さらに、ラベル付きデータ点およびラベルなしデータ点の少なくとも１つの所定コスト要因が受信される。 Further, labeled data points and unlabeled data points are received, as well as pre-label probability information for labeled and unlabeled data points. Further, at least one predetermined cost factor is received for labeled data points and unlabeled data points.

さらに、各ラベル付きとラベルなしデータ点に対する期待ラベルが、データ点のラベルの事前確率に基づいて決定される。データ値の実質的な収束まで、以下の下位ステップが繰り返される。すなわち、
・データ点の期待ラベルの絶対値に比例して、各ラベルなしデータ点に対するスケーリングされたコスト値を生成し、
・ラベル付きとラベルなしデータをそれらの期待ラベルに従って訓練例として用いて、含まれた訓練例および除外された訓練例を与えられた、決定関数パラメータの事前確率分布に対してＫＬダイバージェンスを最小化する決定関数を決定することによって、最大エントロピー識別（ＭＥＤ）分類器を訓練し、
・訓練された分類器を用いて、ラベル付きデータ点とラベルなしデータ点の分類スコアを決定し、
・訓練された分類器の出力をクラス帰属確率に対して較正し、
・決定されたクラス帰属確率に従って、ラベルなしデータ点のラベルの事前確率を更新し、
・更新されたラベルの事前確率および先に決定された分類スコアを用いて、最大エントロピー識別（ＭＥＤ）を用いてラベルおよびマージンの確率分布を決定し、
・先に決定されたラベルの確率分布を用いて、新たな期待ラベルを計算し、
・新たな期待ラベルを前回の繰り返しの期待ラベルで補間することによって、各データ点に対する期待ラベルを更新する。 Further, an expected label for each labeled and unlabeled data point is determined based on the prior probability of the data point label. The following substeps are repeated until the data values substantially converge. That is,
Produce a scaled cost value for each unlabeled data point in proportion to the absolute value of the expected label for the data point,
Use labeled and unlabeled data as training examples according to their expected labels to minimize KL divergence for prior probability distributions of decision function parameters given included training examples and excluded training examples Train a maximum entropy discriminator (MED) classifier by determining a decision function to
Use a trained classifier to determine classification scores for labeled and unlabeled data points;
Calibrate the output of the trained classifier to the class membership probability,
Update the prior probabilities for unlabeled data points according to the determined class membership probabilities,
Determining the probability distribution of labels and margins using maximum entropy identification (MED) using the updated prior probabilities and previously determined classification scores;
・ A new expected label is calculated using the probability distribution of the previously determined label,
Update the expected label for each data point by interpolating the new expected label with the expected label from the previous iteration.

また、入力データ点の分類またはその派生物が、ユーザ、別のシステム、および別のプロセスのうちの少なくとも１つに出力される。 Also, the classification of the input data points or derivatives thereof is output to at least one of the user, another system, and another process.

決定関数の変化が所定の閾値を下回ったときに、収束に到達し得る。さらに、算出された期待ラベル値の変化が所定の閾値を下回ったときにも、収束に到達し得る。さらに、含まれた訓練例のラベルは、任意の値、例えば＋１という値を有し得、除外された訓練例のラベルは、任意の値、例えば−１という値を有し得る。 Convergence can be reached when the change in the decision function falls below a predetermined threshold. Furthermore, convergence can also be reached when the calculated change in expected label value falls below a predetermined threshold. Further, the included training example labels may have any value, eg, a value of +1, and the excluded training example labels may have any value, eg, a value of −1.

本発明の一実施形態における、文書を分類する方法が図１１に提示される。使用時には、ステップ１１００で、ラベルなし文書および少なくとも１つの所定コスト要因と共に、既知の信頼水準を有する少なくとも１つのシード文書が受信される。シード文書および他のアイテムは、コンピュータのメモリ、ユーザ、ネットワーク接続などから受信され得、本方法を実行中のシステムからの要求後に受信され得る。少なくとも１つのシード文書は、該文書が指定されたカテゴリに含まれているか否かを示すラベルを有し得、キーワードのリストを含み得、または文書の分類を支援し得る任意の他の属性を有し得る。さらに、ステップ１１０２で、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、およびラベルなし文書を用いた繰り返し計算によって、トランスダクティブ分類器が訓練され、ここで、各計算の繰り返しに対して、コスト要因が期待ラベル値の関数として調整される。ラベル付きとラベルなし文書に関するデータ点のラベルの事前確率がまた受信され得、ここで、計算の各繰り返しに対して、データ点のクラス帰属確率の推定に従って、データ点のラベルの事前確率が調整され得る。 A method for classifying documents in one embodiment of the present invention is presented in FIG. In use, at step 1100, at least one seed document having a known confidence level is received along with an unlabeled document and at least one predetermined cost factor. Seed documents and other items may be received from computer memory, users, network connections, etc., and may be received after a request from a system executing the method. The at least one seed document may have a label that indicates whether the document is included in a specified category, may include a list of keywords, or any other attribute that may assist in document classification. Can have. Further, in step 1102, the transductive classifier is trained by iterative calculations using at least one predetermined cost factor, at least one seed document, and unlabeled documents, where for each iteration of the calculation: Cost factors are adjusted as a function of expected label values. Data point label prior probabilities for labeled and unlabeled documents can also be received, where for each iteration of the calculation, the data point label prior probabilities are adjusted according to the data point class membership probability estimate. Can be done.

さらに、少なくともいくつかの繰り返しの後に、ステップ１１０４で、ラベルなし文書に対する信頼スコアが格納され、ステップ１１０６で、最も高い信頼スコアを有するラベルなし文書の識別子が、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力される。識別子は、文書自体の電子コピー、それらの部分、それらの表題、それらの名前、それらのファイル名、それらの文書へのポインタなどであり得る。また信頼スコアは、各々の繰り返しの後に格納され得、その場合、各繰り返し後に最も高い信頼スコアを有するラベルなし文書の識別子が出力される。 Further, after at least some iterations, at step 1104, a confidence score for the unlabeled document is stored, and at step 1106, the identifier of the unlabeled document with the highest confidence score is determined by the user, another system, and another Output to at least one of the processes. Identifiers can be electronic copies of the documents themselves, their parts, their titles, their names, their file names, pointers to their documents, and so on. The confidence score can also be stored after each iteration, in which case the identifier of the unlabeled document with the highest confidence score is output after each iteration.

本発明の一実施形態は、最初の文書を残りの文書にリンクするパターンを発見することができる。開示手続きという仕事は、このパターン発見が特に価値のあるものとなる分野である。例えば、事実審理前の法的開示手続では、当面の訴訟と関連を有し得る大量の文書を調査する必要がある。究極の目標は、「決定的証拠」を発見することである。別の例では、発明者、特許審査官、および特許専門弁護士の日常的な仕事は、従来技術の検索によって技術の新規性を評価することである。詳細には、この仕事は、公開されたすべての特許および他の広報を検索して、新規性に関して審査されている特定の技術に関連し得る文書をこの組の中に発見することである。 One embodiment of the present invention can find patterns that link the first document to the remaining documents. The work of disclosure procedures is an area where this pattern discovery is particularly valuable. For example, pre-trial legal disclosure procedures require the investigation of large volumes of documents that may be relevant to the current litigation. The ultimate goal is to discover “definitive evidence”. In another example, the day-to-day work of inventors, patent examiners, and patent attorneys is to assess the novelty of technology through prior art searches. Specifically, the task is to search all published patents and other public relations to find documents in this set that may be relevant to the particular technology being examined for novelty.

開示手続の仕事は、一組のデータ内の一文書または一組の文書の発見を含む。最初の文書または概念を得ると、ユーザは、該最初の文書または概念に関連する文書の発見を望み得る。しかしながら、最初の文書または概念と、標的文書、すなわち発見対象の文書との間の関係性の見解は、発見が生じた後にのみ十分に理解される。ラベル付きデータ点およびラベルなし文書、概念などから学習することによって、本発明は、最初の一文書または複数の文書と標的文書との間のパターンおよび関連性を学習し得る。 The job of the disclosure procedure involves the discovery of a document or a set of documents in a set of data. Having obtained the initial document or concept, the user may desire to find a document associated with the initial document or concept. However, the view of the relationship between the initial document or concept and the target document, ie the document to be discovered, is only fully understood after the discovery has occurred. By learning from labeled data points and unlabeled documents, concepts, etc., the present invention can learn patterns and relationships between the initial document or documents and the target document.

本発明の別の実施形態における、法的開示手続と関連する文書を分析する方法が図１２に提示される。使用時には、ステップ１２００で、法的事項と関連する文書が受信される。そのような文書は、文書自体の電子コピー、それらの部分、それらの表題、それらの名前、それらのファイル名、文書へのポインタなどを含み得る。さらに、ステップ１２０２で、文書分類手法が文書に関して実行される。さらに、ステップ１２０４で、文書の分類に基づいて、文書の少なくとも一部の識別子が出力される。オプションとして、文書間のリンクを表示するものが、出力され得る。 A method for analyzing documents associated with legal disclosure procedures in another embodiment of the present invention is presented in FIG. In use, at step 1200, a document associated with a legal matter is received. Such documents may include electronic copies of the documents themselves, their parts, their titles, their names, their file names, pointers to documents, etc. Further, at step 1202, a document classification technique is performed on the document. Further, at step 1204, an identifier of at least a portion of the document is output based on the document classification. Optionally, one that displays links between documents can be output.

文書分類手法は、任意の種類の処理、例えばトランスダクティブ処理などを含み得る。例えば、上述の任意の帰納的手法またはトランスダクティブ手法が使用され得る。好ましい手法では、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、および法的事項と関連する文書を用いた繰り返し計算によって、トランスダクティブ分類器が訓練される。計算の各繰り返しに対して、コスト要因が期待ラベル値の関数として好適にも調整され、この訓練された分類器が、受信された文書を分類するために使用される。このプロセスは、ラベル付きとラベルなし文書に関するデータ点のラベルの事前確率を受信するステップをさらに含み得、ここで、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定に応じて調整される。さらに、文書分類手法は、サポートベクタマシン処理および最大エントロピー識別処理のうちの１つ以上を含み得る。 Document classification techniques may include any type of processing, such as transductive processing. For example, any inductive or transductive technique described above can be used. In a preferred approach, the transductive classifier is trained by iterative calculations using at least one predetermined cost factor, at least one seed document, and documents associated with legal matters. For each iteration of the calculation, the cost factor is preferably adjusted as a function of the expected label value, and this trained classifier is used to classify the received document. The process may further include receiving a prior probability of the data point label for the labeled and unlabeled documents, where for each iteration of the calculation, the prior probability of the data point label is the data point's It is adjusted according to the estimation of class membership probability. Further, the document classification technique may include one or more of support vector machine processing and maximum entropy identification processing.

さらに別の実施形態における、従来技術文書を分析する方法が図１３に提示される。使用時には、ステップ１３００で、検索クエリに基づいて分類器が訓練される。ステップ１３０２で、複数の従来技術文書がアクセスされる。そのような従来技術文書は、所与の日付よりも前に任意の形で公表された任意の情報を含み得る。そのような従来技術は、所与の日付よりも前の時点では任意の形において公表されていない任意の情報をさらに、あるいは代替案として含み得る。例示的な従来技術文書は、任意の種類の文書、例えば特許庁の広報、データベースから取り出されたデータ、従来技術を収集したもの、ウェブサイトの一部、などであり得る。また、文書分類手法が、ステップ１３０４で、分類器を用いて従来技術文書の少なくとも一部に関して実行され、従来技術文書の少なくとも一部の識別子が、従来技術文書の分類に基づいてステップ１３０６で出力される。この文書分類手法は、サポートベクタマシン処理、最大エントロピー識別処理、または上述の任意の帰納的手法またはトランスダクティブ手法を含む、任意の１つ以上の処理を含み得る。また、あるいは代替案として、文書間のリンクを表示するものが、出力され得る。さらに別の実施形態では、少なくとも一部の従来文書の関連性スコアが、文書の分類に基づいて出力される。 In yet another embodiment, a method for analyzing a prior art document is presented in FIG. In use, at step 1300, the classifier is trained based on the search query. At step 1302, a plurality of prior art documents are accessed. Such prior art documents may include any information published in any form prior to a given date. Such prior art may additionally or alternatively include any information that has not been published in any way at any time prior to a given date. Exemplary prior art documents can be any type of document, for example, a patent office publication, data retrieved from a database, a collection of prior art, a portion of a website, and the like. A document classification technique is also performed at step 1304 on at least a portion of the prior art document using a classifier, and at least a portion of the identifier of the prior art document is output at step 1306 based on the prior art document classification. Is done. This document classification technique may include any one or more processes, including support vector machine processing, maximum entropy identification processing, or any recursive or transductive technique described above. Alternatively, or as an alternative, what displays a link between documents can be output. In yet another embodiment, relevance scores for at least some conventional documents are output based on the classification of the documents.

検索クエリは、特許情報開示の少なくとも一部を含み得る。例示的な特許情報開示は、発明を要約した、発明者によって作成された開示、特許仮出願、非暫定特許出願、外国特許出願、または特許出願、などを含む。 The search query may include at least a portion of patent information disclosure. Exemplary patent information disclosures include disclosures made by the inventor, patent provisional applications, non-provisional patent applications, foreign patent applications, or patent applications, etc. that summarize the invention.

好適な一手法では、検索クエリは、特許文書または特許出願書類から取り出した請求項の少なくとも一部を含む。別の手法では、検索クエリは、特許文書または特許出願書類の要約書の少なくとも一部を含む。さらに別の手法では、検索クエリは、特許文書または特許出願書類から取り出された概要の少なくとも一部を含む。 In one preferred approach, the search query includes at least a portion of a claim retrieved from a patent document or patent application document. In another approach, the search query includes at least a portion of a patent document or a summary of patent application documents. In yet another approach, the search query includes at least a portion of a summary retrieved from a patent document or patent application document.

図２７は、文書を請求項とマッチングするための方法を示す。ステップ２７００で、特許文書または特許出願書類の少なくとも１つの請求項に基づいて、分類器が訓練される。従って、１つ以上の請求項またはそれらの一部が、分類器を訓練するために用いられ得る。ステップ２７０２で、複数の文書がアクセスされる。そのような文書は、従来技術文書、潜在的に侵害または出し抜きをはかる製品を記載している文書、などを含み得る。ステップ２７０４で、分類器を用いて、少なくとも一部の文書に関して文書分類手法が実行される。ステップ２７０６で、少なくとも一部の文書の識別子が、文書の分類に基づいて出力される。少なくとも一部の文書の関連性スコアがまた、文書の分類に基づいて出力され得る。 FIG. 27 illustrates a method for matching a document with a claim. At step 2700, the classifier is trained based on at least one claim in the patent document or patent application document. Accordingly, one or more claims or portions thereof may be used to train a classifier. In step 2702, a plurality of documents are accessed. Such documents may include prior art documents, documents describing products that are potentially infringing or withdrawing, and the like. In step 2704, a document classification technique is performed on at least some documents using a classifier. At step 2706, at least some document identifiers are output based on the document classification. A relevance score for at least some documents may also be output based on the classification of the documents.

本発明の一実施形態は、特許出願の分類に使用され得る。例えば、米国では、特許および特許出願は現在、米国特許分類（ＵＳＰＣ）システムを用いて、主題によって分類されている。この仕事は現在手作業で行われており、従って、非常に費用がかかりかつ多大な時間を必要とする。このような手作業による分類はまた、人為的ミスを被る。特許文書または特許出願書類が多数のクラスに分類され得ることが、そのような仕事の複雑さの度合いを増している。 One embodiment of the present invention can be used to classify patent applications. For example, in the United States, patents and patent applications are currently classified by subject matter using the US Patent Classification (USPC) system. This task is currently done manually, and is therefore very expensive and time consuming. Such manual classification also suffers human error. The ability to classify patent documents or patent application documents into multiple classes increases the degree of complexity of such work.

図２８は、一実施形態による特許出願を分類する方法を示す。ステップ２８００で、特定の特許分類に入ることが分かっている複数の文書に基づいて、分類器が訓練される。そのような文書は一般的に、特許文書および特許出願書類（またはそれらの一部）であり得るが、特定の特許分類の標的主題を記載した概要票でもあり得る。ステップ２８０２で、特許文書または特許出願書類の少なくとも一部が受信される。この一部は、請求項、概要、要約書、明細書、タイトルなどを含み得る。ステップ２８０４で、分類器を用いて、特許文書または特許出願書類の少なくとも一部に関して文書分類手法が実行される。ステップ２８０６で、特許文書または特許出願書類の分類が出力される。オプションとして、ユーザは、特許出願の一部または全部の分類を、手動で検証し得る。 FIG. 28 illustrates a method for classifying patent applications according to one embodiment. At step 2800, a classifier is trained based on a plurality of documents known to be in a particular patent classification. Such documents can generally be patent documents and patent application documents (or parts thereof), but can also be summary votes describing the target subject matter of a particular patent classification. At step 2802, at least a portion of a patent document or patent application document is received. Some of this may include claims, summaries, abstracts, specifications, titles, and the like. At step 2804, a document classification technique is performed on at least a portion of the patent document or patent application document using the classifier. At step 2806, the classification of the patent document or patent application document is output. Optionally, the user can manually verify some or all classifications of the patent application.

文書分類手法は、はい／いいえ式の分類手法であることが好ましい。換言すれば、文書が特定のクラスにある確率が閾値を上回る場合には、判定は「はい」で、その文書はこのクラスに属する。文書が特定のクラスにある確率が閾値を下回る場合には、判定は「いいえ」で、その文書はこのクラスに属さない。 The document classification technique is preferably a yes / no classification technique. In other words, if the probability that a document is in a particular class is above the threshold, the determination is “yes” and the document belongs to this class. If the probability that the document is in a particular class is below the threshold, the determination is “no” and the document does not belong to this class.

図２９は、特許出願を分類するさらに別の方法を示している。ステップ２９００で、特定の特許分類と関連する少なくとも１つの文書に基づいて訓練された分類器を用いて、特許文書または特許出願書類の少なくとも一部に関して文書分類手法が実行される。この場合にもまた、分類手法は、はい／いいえ式の分類手法であることが好ましい。ステップ２９０２で、特許文書または特許出願書類の分類が出力される。 FIG. 29 illustrates yet another method for classifying patent applications. In step 2900, a document classification technique is performed on at least a portion of a patent document or patent application document using a classifier trained based on at least one document associated with a particular patent classification. Again, the classification technique is preferably a yes / no classification technique. In step 2902, the classification of the patent document or patent application document is output.

図２８および図２９に示す方法のいずれにおいても、異なる特許分類に入ることが分かっている複数の文書に基づいて訓練された異なる分類器を用いて、それぞれの方法が反復され得る。 In either of the methods shown in FIGS. 28 and 29, each method can be repeated using different classifiers trained on multiple documents known to fall into different patent classifications.

公式には、特許の分類は、請求項に基づくべきである。しかしながら、（任意のＩＰ関連内容）と（任意のＩＰ関連内容）との間のマッチングを行うことが、また所望され得る。一例を挙げれば、１つの手法は、特許の明細書を用いて訓練を行い、該特許の請求項に基づいて出願を分類する。別の手法は、明細書と請求項を用いて訓練を行い、要約書に基づいて分類する。特に好適な手法では、特許文書または特許出願書類のいかなる部分を用いて訓練が行われても、分類時にもそれと同じ種類の内容が用いられることであり、すなわち、システムが請求項に基づいて訓練される場合には、分類は請求項に基づいて行われることである。文書分類手法は、任意の種類の処理、例えばトランスダクティブ処理などを含み得る。例えば、上述の任意の帰納的手法またはトランスダクティブ手法が用いられ得る。好適な手法では、分類器はトランスダクティブ分類器であり得、トランスダクティブ分類器は、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、および従来技術文書を用いた繰り返し計算によって訓練され得、ここで、計算の各繰り返しに対して、コスト要因が期待ラベル値の関数として調整され、訓練された分類器は、従来技術文書を分類するために使用され得る。シード文書および従来技術文書に対するデータ点のラベルの事前確率がまた、受信され得、ここで、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定値に応じて調整され得る。シード文書は任意の文書、例えば特許庁の広報、データベースから取り出されたデータ、従来技術を収集したもの、ウェブサイト、特許開示情報、などであり得る。 Officially, patent classification should be based on the claims. However, it may also be desirable to perform a match between (any IP related content) and (any IP related content). In one example, one approach is to train using a patent specification and classify the application based on the patent claims. Another approach is to train using the description and claims and classify based on the summary. A particularly preferred approach is that any part of a patent document or patent application document is trained, but the same type of content is used during classification, i.e. the system trains based on the claims. If so, the classification is based on the claims. Document classification techniques may include any type of processing, such as transductive processing. For example, any inductive or transductive technique described above can be used. In a preferred approach, the classifier can be a transductive classifier, and the transductive classifier can be trained by iterative calculations using at least one predetermined cost factor, at least one seed document, and prior art documents. Here, for each iteration of the calculation, the cost factor is adjusted as a function of the expected label value, and a trained classifier can be used to classify the prior art document. Data point label prior probabilities for the seed document and the prior art document may also be received, where for each iteration of the calculation, the data point label prior probability is an estimate of the data point class membership probability. Can be adjusted accordingly. The seed document can be any document, such as a patent office publication, data retrieved from a database, a collection of prior art, a website, patent disclosure information, and the like.

図１４は、１つの手法における、本発明の一実施形態を説明する。ステップ１４０１で、一組のデータが読み込まれる。この組の中の、ユーザと関連する文書の発見が所望されている。ステップ１４０２で、最初の１シード文書または複数のシード文書にラベルが付けられる。文書は任意の種類の文書、例えば特許庁の公報、データベースから取り出されたデータ、従来技術を収集したもの、ウェブサイト、などであり得る。ユーザによって提供された異なる一連のキーワードまたは文書で、トランスダクション処理をシードすることが、また可能である。ステップ１４０６で、トランスダクティブ分類器の訓練が、ラベル付きデータおよび所与の組のラベルなしデータの組を用いて行われる。繰り返しトランスダクション処理中の各ラベル帰納ステップで、ラベル帰納中に決定された信頼スコアが格納される。ひとたび訓練が終了すると、ラベル帰納ステップで高い信頼スコアを達成した文書が、ステップ１４０８でユーザに対して表示される。高い信頼スコアを有するこれらの文書は、発見という目的に対してユーザに関連する文書を表す。表示は、最初のシード文書から始まり、最後のラベル帰納ステップで発見された最終組の文書まで、ラベル帰納ステップの時間順になされ得る。 FIG. 14 illustrates one embodiment of the present invention in one approach. In step 1401, a set of data is read. It is desirable to find documents associated with the user in this set. In step 1402, the first seed document or seed documents are labeled. The document can be any type of document, such as a patent office publication, data retrieved from a database, a collection of prior art, a website, and the like. It is also possible to seed the transduction process with a different set of keywords or documents provided by the user. At step 1406, training of the transductive classifier is performed using labeled data and a given set of unlabeled data sets. At each label induction step during the iterative transduction process, the confidence score determined during label induction is stored. Once training is complete, documents that have achieved a high confidence score in the label induction step are displayed to the user in step 1408. Those documents with a high confidence score represent documents that are relevant to the user for discovery purposes. The display can be made in time order of the label induction step, starting with the first seed document and ending with the final set of documents found in the last label induction step.

本発明の別の実施形態は、例えば業務処理の自動化と結びついた、データの整理および正確な分類を含む。整理および分類の手法は、任意の種類の処理、例えばトランスダクティブ処理、などを含み得る。例えば、上述の任意の帰納的手法またはトランスダクティブ手法が使用され得る。好適な手法では、データベースへのエントリキーが、データベースの期待清浄度に応じて、一部の信頼水準と関連付けられたラベルとして用いられる。次いで、関連付けられた信頼水準を併せ持つラベル、すなわち期待ラベルが、トランスダクティブ分類器を訓練するために使用され、該トランスダクティブ分類器がラベル（キー）を修正し、データベース内のデータのより一貫性のある編成を達成する。例えば、自動的データ抽出、例えば合計金額、発注番号、製品量、発送先などの決定を可能にするために、インボイスは、該インボイスを発行した会社または個人に従って最初に分類される必要がある。通常、自動分類システムを準備するためには、訓練例が必要である。しかしながら、顧客によって提供される訓練例は、誤分類文書または他のノイズ―例えばファックスの表紙−をしばしば含んでおり、それらは、正確な分類を得るために自動分類システムの訓練に先立って識別され除去されねばならない。別の例では、患者記録の分野において、医師によって書かれた報告書と診断との間の矛盾を検出するために役立つ。 Another embodiment of the invention includes data organization and accurate classification, for example, coupled with business process automation. Arrangement and classification techniques may include any type of processing, such as transductive processing. For example, any inductive or transductive technique described above can be used. In the preferred approach, the entry key to the database is used as a label associated with some confidence level, depending on the expected cleanliness of the database. Then, a label that has an associated confidence level, ie, an expected label, is used to train the transductive classifier, which modifies the label (key) and uses the data in the database. Achieve consistent organization. For example, to allow automatic data extraction, eg determination of total price, order number, product quantity, shipping address, etc., the invoice must first be classified according to the company or individual that issued the invoice. is there. Typically, training examples are required to prepare an automatic classification system. However, examples of training provided by customers often include misclassified documents or other noise--for example, the cover of a fax--that are identified prior to training an automatic classification system to obtain an accurate classification. Must be removed. In another example, it helps to detect discrepancies between reports written by doctors and diagnoses in the field of patient records.

別の例では、特許庁は持続的に再分類プロセスを実施していることが知られており、その際に特許庁は、（１）混同に対する特許庁の分類法の既存の分岐を評価し、（２）過度に輻輳しているノードを平等に分配するために分類法を再構築し、かつ（３）既存の特許を新たな構造内に再分類する。本明細書に提示されるトランスダクティブ学習法は、特許庁、およびこの作業を外部委託する会社によって、その分類法を再評価し、（１）所与の主要分類に対して新たな分類法を構築し、かつ（２）既存の特許を再分類することで、それを支援するために使用され得る。 In another example, it is known that the Patent Office is continuously carrying out a reclassification process, in which case the Patent Office (1) evaluates the existing branch of the Patent Office's taxonomy for confusion. , (2) restructuring the taxonomy to evenly distribute over-congested nodes, and (3) reclassify existing patents into the new structure. The transductive learning method presented here re-evaluates its taxonomy by the JPO and companies that outsource this work, and (1) a new taxonomy for a given major category. And (2) can be used to assist with reclassifying existing patents.

トランスダクションは、ラベル付きとラベルなしデータから学習し、それによって、ラベル付きデータからラベルなしデータへの移行が滑らかとなる。スペクトルの一方の端部には、完全な予備的知識を有するラベル付きデータがある。すなわち、与えられたラベルは例外なく正しい。他方の端部には、予備的知識が与えられていないラベルなしデータがある。あるレベルのノイズを含む編成されたデータは、ラベル付けに誤りのあるデータを構成し、上述の２つの最端部の間のスペクトル上のどこかに位置している。データの編成によって与えられたラベルは、ある程度まで正しいとして信用され得るが、完全にではない。従って、トランスダクションは、データの所与の編成内に一定のレベルの誤りを仮定することによって、および、これらをラベル割り当てに関する予備的知識における不確実性として解釈することによって、既存のデータ編成を整理するために使用され得る。 Transduction learns from labeled and unlabeled data, thereby smoothing the transition from labeled data to unlabeled data. At one end of the spectrum is labeled data with complete preliminary knowledge. That is, the given label is correct without exception. At the other end is unlabeled data for which no prior knowledge is given. The organized data containing a certain level of noise constitutes mislabeled data and is located somewhere on the spectrum between the two extremes mentioned above. The labels given by the organization of the data can be trusted to some extent as correct, but not completely. Thus, transduction replaces existing data organization by assuming a certain level of error within a given organization of data, and interpreting these as uncertainty in the preliminary knowledge about label assignment. Can be used for organizing.

一実施形態における、データを整理する方法が、図１５に提示される。使用時には、ステップ１５００で、複数のラベル付きデータ項目が受信され、ステップ１５０２で、複数のカテゴリの各々に対するデータ項目のサブセットが選択される。さらに、ステップ１５０４で、各サブセット内のデータ項目に対する不確実性がほぼゼロに設定され、ステップ１５０６で、サブセット内に存在しないデータ項目に対する不確実性が、ほぼゼロではない所定値に設定される。さらに、ステップ１５０８で、不確実性、サブセット内のデータ項目、およびサブセット内に存在しないデータ項目を訓練例として用いて、繰り返し計算によってトランスダクティブ分類器が訓練され、ステップ１５１０で、訓練された分類器が、データ項目の各々を分類するために、ラベル付きデータ項目の各々に適用される。また、ステップ１５１２で、入力データ項目の分類またはその派生物が、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力される。 A method for organizing data in one embodiment is presented in FIG. In use, at step 1500, a plurality of labeled data items are received, and at step 1502, a subset of data items for each of the plurality of categories is selected. Further, at step 1504, the uncertainty for data items in each subset is set to approximately zero, and at step 1506, the uncertainty for data items not present in the subset is set to a predetermined value that is not approximately zero. . Further, at step 1508, the transductive classifier is trained by iterative calculation using the uncertainty, data items in the subset, and data items that do not exist in the subset as training examples, and trained in step 1510. A classifier is applied to each labeled data item to classify each data item. Also, at step 1512, the classification of the input data item or derivative thereof is output to at least one of the user, another system, and another process.

さらに、サブセットは無作為に選択され得、またユーザによって選択および検証され得る。少なくとも一部のデータ項目のラベルは、分類に基づいて変更され得る。また、データ項目の分類後、所定の閾値を下回る信頼水準を有するデータ項目の識別子が、ユーザに出力され得る。識別子は、文書自体の電子コピー、それらの部分、それらの表題、それらの名前、それらのファイル名、文書へのポインタ、などであり得る。 Further, the subset can be selected randomly and can be selected and verified by the user. The labels of at least some data items can be changed based on the classification. In addition, after the data item is classified, an identifier of the data item having a confidence level below a predetermined threshold can be output to the user. The identifier can be an electronic copy of the document itself, their parts, their title, their name, their file name, a pointer to the document, and so on.

本発明の一実施形態では、図１６に示すように、ステップ１６００で、整理プロセスを開始する２つの選択肢がユーザに提示される。１つ選択肢は、ステップ１６０２での完全自動整理であり、この場合には、各概念またはカテゴリに対して、特定数の文書が無作為に選択され、正しく編成されていると見なされる。代替案としては、ステップ１６０４で、いくつかの文書が、各概念またはカテゴリに対する１つ以上のラベル割り当てが適切に編成されていることの、人手による再調査および検証のために、フラグを立てられ得る。ステップ１６０６で、データ内のノイズレベルの推定値が受信される。ステップ１６１０で、検証済み（人手により検証された、または無作為に選択された）データおよびステップ１６０８の未検証データを用いて、トランスダクティブ分類器が訓練される。ひとたび訓練が終了すれば、文書は、新たなラベルに従って再編成される。ステップ１６１２で、ラベル割り当てにおいて特定の閾値を下回る低い信頼度を有する文書が、人手による再調査のためにユーザに対して表示される。ステップ１６１４で、ラベル割り当てにおいて特定の閾値を上回る信頼水準を有する文書が、トランスダクティブラベル割り当てに従って自動的に修正される。 In one embodiment of the invention, as shown in FIG. 16, at step 1600, the user is presented with two choices for initiating the organizing process. One option is fully automatic organization at step 1602, where a specific number of documents are randomly selected and correctly organized for each concept or category. Alternatively, at step 1604, some documents are flagged for manual review and verification that one or more label assignments for each concept or category are properly organized. obtain. At step 1606, an estimate of the noise level in the data is received. At step 1610, the transductive classifier is trained using the verified (manually verified or randomly selected) data and the unverified data from step 1608. Once training is complete, the document is reorganized according to the new label. At step 1612, documents with low confidence below a certain threshold in label assignment are displayed to the user for manual review. At step 1614, documents having confidence levels above a certain threshold in label assignment are automatically modified according to the transductive label assignment.

別の実施形態における、医療記録を管理する方法が、図１７に示される。使用時には、ステップ１７００で、医学的診断に基づいて分類器が訓練され、ステップ１７０２で、複数の医療記録がアクセスされる。さらに、ステップ１７０４で、分類器を用いて医療記録に関して文書分類手法が実行され、ステップ１７０６で、医学的診断と関連する低い確率を有する、少なくとも１つの医療記録の識別子が出力される。文書分類手法は、任意の種類の処理、例えばトランスダクティブ処理、などを含み得、かつ、サポートベクタマシン処理、最大エントロピー識別処理などを含む、１つ以上の上述の任意の帰納的手法またはトランスダクティブ手法を含み得る。 In another embodiment, a method for managing medical records is shown in FIG. In use, at step 1700, the classifier is trained based on the medical diagnosis, and at step 1702, multiple medical records are accessed. Further, at step 1704, a document classification technique is performed on the medical record using a classifier, and at step 1706, an identifier of at least one medical record having a low probability associated with a medical diagnosis is output. The document classification technique may include any type of processing, such as transductive processing, and includes one or more of any of the above described recursive techniques or transformers including support vector machine processing, maximum entropy identification processing, etc. It may include a ductive approach.

一実施形態では、分類器はトランスダクティブ分類器であり得、トランスダクティブ分類器は、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、および医療記録を用いた繰り返し計算によって訓練され得、ここで、計算の各繰り返しに対して、コスト要因が期待ラベル値の関数として調整され、その後、訓練された分類器は、医療記録を分類するために使用され得る。シード文書および医療記録に対するデータ点のラベルの事前確率がまた、受信され得、ここで、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定値に従って調整され得る。 In one embodiment, the classifier can be a transductive classifier, the transductive classifier can be trained by iterative calculations using at least one predetermined cost factor, at least one seed document, and medical records; Here, for each iteration of the calculation, the cost factor is adjusted as a function of the expected label value, and then a trained classifier can be used to classify the medical records. Prior probabilities of data point labels for seed documents and medical records may also be received, where for each iteration of the calculation, the prior probabilities of data point labels are in accordance with an estimate of the class membership probability of the data points. Can be adjusted.

本発明の別の実施形態は、動的な、シフトする分類概念に対して責任を負う。例えば、アプリケーションを処理する形式では、文書は、その後の処理に備えて文書を分類するために、文書のレイアウト情報および／または内容情報を用いて分類される。多くのアプリケーションにおいて、文書は静的ではなく、時間と共に進化する。例えば、文書の内容および／またはレイアウトは、新たな法律の制定によって変化し得る。トランスダクティブ分類は、これらの変化に自動的に順応し、ドリフトする分類概念にもかかわらず、同一のまたは同等の分類精度をもたらす。これは、人手による調整なくしては、概念のドリフトによって最初から分類精度に苦しむ、ルールベースシステムまたは帰納的分類法とは対照的である。この一例はインボイス処理であり、それは従来から帰納的学習を含み、またはインボイスのレイアウトを利用するルールベースシステムが用いられる。これら従来のシステムの下では、レイアウトに変化が生じた場合には、新たな訓練データにラベルを付けるか、または新たなルールを定めることによって、システムは手動で再構成されねばならない。しかしながら、トランスダクションの使用は、インボイスのレイアウトの小さな変化にも自動的に順応することによって、手動での再構成を不要とする。別の例では、トランスダクティブ分類は顧客の苦情分析に適用され得、そのような苦情の性質の変化を監視することができる。例えば、会社は、製品の変更を顧客の苦情と自動的に結び付け得る。 Another embodiment of the present invention is responsible for dynamic, shifting classification concepts. For example, in a format for processing an application, the document is classified using document layout information and / or content information to classify the document for subsequent processing. In many applications, documents are not static and evolve over time. For example, the content and / or layout of a document may change due to new legislation. Transductive classification automatically adapts to these changes and provides the same or equivalent classification accuracy despite the drifting classification concept. This is in contrast to rule-based systems or inductive classification methods that suffer from classification accuracy from the outset due to conceptual drift without manual adjustment. One example of this is invoice processing, which traditionally involves inductive learning or uses a rule-based system that utilizes invoice layout. Under these conventional systems, if layout changes occur, the system must be manually reconfigured by labeling new training data or defining new rules. However, the use of transduction eliminates manual reconfiguration by automatically adapting to small changes in the invoice layout. In another example, transductive classification can be applied to customer complaint analysis to monitor changes in the nature of such complaints. For example, a company may automatically combine product changes with customer complaints.

トランスダクションは、ニュース記事の分類にも用いられ得る。例えば、２００１年９月１１日のテロリストによる攻撃に関する記事で始まり、アフガニスタンでの戦争を経て、今日のイラク情勢に関するニュース報道内容までの、テロとの戦いに関するニュース記事が、トランスダクションを用いて自動的に特定され得る。 Transduction can also be used to classify news articles. For example, news articles about the fight against terrorism, which began with an article on attacks by terrorists on September 11, 2001, went through the war in Afghanistan, and covered the news coverage of today's situation in Iraq. Specific.

さらに別の例では、生物体の分類（アルファ分類学）が、生物体の新たな種を生成し他の種が絶滅することによる進化と共に、時間と共に変化し得る。分類体系または分類学のこれらの法則は、時間と共にシフトまたは変化する分類概念を有する、動的なものであり得る。 In yet another example, the taxonomy of an organism (alpha taxonomy) can change over time with evolution due to the creation of new species of organisms and the extinction of other species. These laws of classification system or taxonomy can be dynamic, with classification concepts shifting or changing over time.

ラベルなしデータとして分類されるべき入力データを用いることによって、トランスダクションは、シフトする分類概念を認識し得、従って、進化する分類体系に動的に順応し得る。例えば、図１８は、ドリフトする分類概念を与えられた、トランスダクションを用いた本発明の一実施形態を示す。ステップ１８０２に示すように、文書セットＤ_ｉは、時刻ｔ_ｉにシステムに入る。ステップ１８０４で、トランスダクティブ分類器Ｃ_ｉが、これまで蓄積されたラベル付きデータおよびラベルなしデータを用いて訓練され、ステップ１８０６で、セットＤ_ｉの中の文書が分類される。手動モードが用いられる場合には、ステップ１８０８で判定されたユーザ指定閾値を下回る信頼水準を有する文書が、ステップ１８１０で、手動による再調査のためにユーザに提示される。ステップ１８１２に示すように、自動モードでは、ある信頼水準を有する文書が、システムに追加される新たなカテゴリの生成をトリガし、次いで該文書は、その新たなカテゴリに割り当てられる。選択された閾値を上回る信頼水準を有する文書は、ステップ１８２０Ａ〜１８２０Ｂで、現在のカテゴリ１からＮまでに分類される。ステップｔ_ｉの前に現在のカテゴリに分類されてきた、現在のカテゴリ内の全文書が、ステップ１８２２で分類器Ｃ_ｉによって再分類され、以前に割り当てられたカテゴリに分類されない全ての文書が、ステップ１８２４および１８２６で新たなカテゴリに移される。 By using input data to be classified as unlabeled data, transduction can recognize shifting classification concepts and thus dynamically adapt to evolving classification schemes. For example, FIG. 18 shows one embodiment of the present invention using transduction given the drifting classification concept. As shown in step 1802, the document set D _i enters the system at time t _i . At step 1804, the transductive classifier C _i is trained with the labeled and unlabeled data accumulated so far, and at step 1806, the documents in the set D _i are classified. If manual mode is used, documents having confidence levels below the user-specified threshold determined at step 1808 are presented to the user for manual review at step 1810. As shown in step 1812, in automatic mode, a document with a certain confidence level triggers the creation of a new category to be added to the system, which is then assigned to that new category. Documents having a confidence level above the selected threshold are classified into current categories 1 through N in steps 1820A-1820B. All documents in the current category that have been classified in the current category before step t _i are reclassified by the classifier C _{i in} step 1822 and all documents that are not classified in the previously assigned category are Steps 1824 and 1826 move to a new category.

さらに別の実施形態における、文書内容のシフトに順応する方法が、図１９に提示される。文書の内容は、これに限定するものではないが、グラフィカルな内容、文字の内容、レイアウト、ナンバリング、などを含み得る。シフトの例は、時間的なシフト、スタイルのシフト（２人以上の人間が１つ以上の文書に関して作業する場合）、施される処理のシフト、レイアウトのシフト、などを含み得る。ステップ１９００で、ラベルなし文書および少なくとも１つの所定コスト要因と共に、少なくとも１つのラベル付きシード文書が受信される。文書は、これに限定するものではないが、顧客の苦情、インボイス、様式文書、領収書、などを含み得る。さらに、ステップ１９０２で、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、およびラベルなし文書を用いて、トランスダクティブ分類器が訓練される。また、ステップ１９０４で、所定の閾値を上回る信頼水準を有するラベルなし文書は、分類器を用いて複数のカテゴリに分類され、少なくとも一部のカテゴライズされた文書は、ステップ１９０６で、分類器を用いてカテゴリに再分類される。さらに、ステップ１９０８で、カテゴライズされた文書の識別子が、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力される。識別子は、文書自体の電子コピー、それらの部分、それらの表題、それらの名前、それらのファイル名、文書へのポインタ、などであり得る。さらに、製品の変更が、顧客の苦情などと結び付けられ得る。 In yet another embodiment, a method for accommodating document content shifting is presented in FIG. The content of the document may include, but is not limited to, graphical content, character content, layout, numbering, and the like. Examples of shifts may include temporal shifts, style shifts (when two or more people work on one or more documents), processing shifts applied, layout shifts, and the like. At step 1900, at least one labeled seed document is received along with an unlabeled document and at least one predetermined cost factor. Documents may include, but are not limited to, customer complaints, invoices, form documents, receipts, and the like. Further, at step 1902, the transductive classifier is trained using at least one predetermined cost factor, at least one seed document, and an unlabeled document. Also, in step 1904, unlabeled documents having a confidence level that exceeds a predetermined threshold are classified into a plurality of categories using a classifier, and at least some categorized documents use the classifier in step 1906. Reclassified into categories. Further, at step 1908, the categorized document identifier is output to at least one of the user, another system, and another process. The identifier can be an electronic copy of the document itself, their parts, their title, their name, their file name, a pointer to the document, and so on. Furthermore, product changes can be combined with customer complaints and the like.

さらに、所定の閾値を下回る信頼水準を有するラベルなし文書が、１つ以上の新たなカテゴリに移され得る。また、少なくとも１つの所定コスト要因、少なくとも１つのシード文書、およびラベルなし文書を用いた繰り返し計算によって、トランスダクティブ分類器が訓練され得、ここで、計算の各繰り返しに対して、コスト要因が期待ラベル値の関数として調整され得、訓練された分類器を用いてラベルなし文書を分類し得る。さらに、シード文書およびラベルなし文書に対するデータ点のラベルの事前確率が、受信され得、ここで、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定値に従って調整され得る。 In addition, unlabeled documents having a confidence level below a predetermined threshold can be moved to one or more new categories. Also, the transductive classifier can be trained by iterative calculations using at least one predetermined cost factor, at least one seed document, and an unlabeled document, where for each iteration of the calculation, the cost factor is It can be adjusted as a function of the expected label value and the unlabeled document can be classified using a trained classifier. Further, prior probabilities of data point labels for seed documents and unlabeled documents may be received, where for each iteration of the calculation, the prior probabilities of data point labels are estimates of class membership probabilities of data points. It can be adjusted according to the value.

別の実施形態における、特許分類を文書内容のシフトに対して順応させる方法が、図２０に提示される。ステップ２０００で、ラベルなし文書と共に、少なくとも１つのラベル付きシード文書が受信される。ラベルなし文書は、任意の種類の文書、例えば、特許出願書類、裁判所提出書類、情報開示フォーム、文書の修正、などを含み得る。シード文書（単数または複数）は、特許文書（単数または複数）、特許出願書類（単数または複数）、などを含み得る。ステップ２００２で、少なくとも１つのシード文書およびラベルなし文書を用いて、トランスダクティブ分類器が訓練され、所定の閾値を上回る信頼水準を有するラベルなし文書は、分類器を用いて複数の既存のカテゴリに分類される。分類器は任意の種類の分類器、例えばトランスダクティブ分類器であり得、文書分類手法は任意の手法、例えばサポートベクタマシン処理、最大エントロピー識別処理、などであり得る。例えば、上述の任意の帰納的手法またはトランスダクティブ手法が用いられ得る。 In another embodiment, a method for adapting patent classification to document content shifts is presented in FIG. At step 2000, at least one labeled seed document is received along with the unlabeled document. Unlabeled documents may include any type of document, such as patent application documents, court filing documents, information disclosure forms, document modifications, and the like. Seed document (s) may include patent document (s), patent application document (s), and the like. In step 2002, a transductive classifier is trained using at least one seed document and an unlabeled document, and an unlabeled document having a confidence level above a predetermined threshold is converted into a plurality of existing categories using the classifier. are categorized. The classifier can be any type of classifier, such as a transductive classifier, and the document classification technique can be any technique, such as support vector machine processing, maximum entropy identification processing, and the like. For example, any inductive or transductive technique described above can be used.

また、ステップ２００４で、所定の閾値を下回る信頼水準を有するラベルなし文書が、分類器を用いて少なくとも１つの新たなカテゴリに分類され、ステップ２００６で、少なくとも一部のカテゴライズされた文書が、分類器を用いて既存のカテゴリおよび少なくとも１つの新たなカテゴリに再分類される。さらに、ステップ２００８で、カテゴライズされた文書の識別子が、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力される。また、トランスダクティブ分類器が、少なくとも１つの所定コスト要因、検索クエリ、および文書を用いた繰り返し計算によって訓練され得、ここで、計算の各繰り返しに対して、コスト要因が期待ラベル値の関数として調整され得、訓練された分類器が、文書を分類するために使用され得る。さらに、検索クエリおよび文書に対するデータ点のラベルの事前確率が受信され得、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定に従って調整される。 Also, in step 2004, unlabeled documents having a confidence level below a predetermined threshold are classified into at least one new category using a classifier, and in step 2006, at least some categorized documents are classified. Reclassify into an existing category and at least one new category using the instrument. Further, at step 2008, the categorized document identifier is output to at least one of the user, another system, and another process. The transductive classifier can also be trained by iterative calculations using at least one predetermined cost factor, a search query, and a document, where for each iteration of the cost, the cost factor is a function of the expected label value. And a trained classifier can be used to classify the document. In addition, prior probabilities of data point labels for the search query and document may be received, and for each iteration of the calculation, the prior probabilities of the data point labels are adjusted according to an estimate of the class membership probability of the data points.

本発明のさらに別の実施形態は、文書分離の分野における文書のドリフトに対して責任を負う。文書分離に対する１つの実用例は、抵当文書の処理を含む。一連の様々な貸付文書、例えば、融資申込書、融資承認書、融資依頼書、融資金額などからなる、融資関係文書フォルダがスキャンされ、一連の画像内の様々な文書が、その後の処理の前に確認される必要がある。用いられる文書は静的ではなく、時が経つにつれて変化し得る。例えば、融資関係文書フォルダ内で用いられる納税申告用紙は、法律の変更により、時が経つにつれて変化し得る。 Yet another embodiment of the invention is responsible for document drift in the field of document separation. One practical example for document separation involves the processing of mortgage documents. A series of various loan documents, such as a loan application, loan approval, loan request, loan amount, etc., is scanned for a loan-related document folder, and the various documents in the series of images are processed before further processing. Need to be confirmed. The document used is not static and can change over time. For example, the tax return form used in the loan related document folder may change over time due to changes in the law.

文書分離は、一連の画像内の文書または部分文書の境界を見出すという問題を解決する。一連の画像を生成する一般的な例は、デジタルスキャナまたは多機能周辺装置（ＭＦＰ）である。分類の場合と同様に、トランスダクションが、文書およびそれらの境界の経時ドリフトに対処するために、文書分離に用いられ得る。ルールベースシステムまたは帰納的学習による解決に基づくシステムのような静的分離システムは、ドリフトする分離概念に自動的に順応し得ない。これらの静的分離システムの性能は、ドリフトが発生したときは常に、経時低下する。性能をその初期のレベルに維持するためには、ルールに手動で順応させる（ルールベースシステムの場合）か、または手動で新たな文書にラベルを付け、システムを再学習させる（帰納的学習による解決の場合）必要がある。いずれの方法も、時間と費用を要する。文書分離にトランスダクションを適用することにより、分離概念のドリフトに自動的に順応するシステムの開発が可能となる。 Document separation solves the problem of finding document or partial document boundaries in a series of images. Common examples for generating a series of images are a digital scanner or a multifunction peripheral (MFP). As with classification, transduction can be used for document separation to deal with drift over time of documents and their boundaries. Static separation systems, such as rule-based systems or systems based on recursive learning solutions, cannot automatically adapt to drifting separation concepts. The performance of these static separation systems decreases with time whenever drift occurs. To maintain performance at its initial level, manually adapt the rules (for rule-based systems) or manually label new documents and retrain the system (inductive learning solution) In the case of). Both methods are time consuming and expensive. By applying transduction to document separation, it is possible to develop a system that automatically adapts to the separation concept drift.

一実施形態における、文書分離の方法が、図２１に提示される。ステップ２１００でラベル付きデータが受信され、ステップ２１０２で、一連のラベルなし文書が受信される。そのようなデータおよび文書は、法定開示文書、拒絶理由通知書、ウェブページデータ、代理人と依頼者との間の往復書簡、などを含み得る。さらに、ステップ２１０４で、ラベル付きデータおよびラベルなし文書に基づいて、トランスダクションを用いて確率的分類規則が順応され、ステップ２１０６で、文書分離用に用いられる重みが、確率的分類規則に従って更新される。また、ステップ２１０８で、一連の文書の中の分離位置が決定され、ステップ２１１０で、一連の文書の中の決定された分離位置の標識が、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力される。標識は、文書自体の電子コピー、それらの部分、それらの表題、それらの名前、それらのファイル名、文書へのポインタ、などであり得る。さらに、ステップ２１１２で、文書は、標識と相関するコードのフラグを立てられる。 A method for document separation in one embodiment is presented in FIG. At step 2100, labeled data is received, and at step 2102 a series of unlabeled documents is received. Such data and documents may include statutory disclosure documents, notices of reasons for refusal, web page data, round trip letters between agents and clients, and the like. Further, at step 2104, probabilistic classification rules are adapted using transduction based on the labeled data and unlabeled documents, and at step 2106, the weights used for document separation are updated according to the probabilistic classification rules. The Also, at step 2108, a separation location within the series of documents is determined, and at step 2110, the indicator of the determined separation location within the series of documents is at least one of a user, another system, and another process. Are output. The sign may be an electronic copy of the document itself, their parts, their title, their name, their file name, pointer to the document, and so on. Further, at step 2112, the document is flagged with a code that correlates with the sign.

図２２は、文書分離と関連して用いられる本発明の分類法および装置の実装を示す。自動文書分離は、デジタルスキャン後の文書の分離および特定に含まれる、人手による努力を低減するために用いられる。１つのそのような文書分類方法は、本明細書に記載する分類方法を用い、分類規則を結合して、入手可能な全情報から可能性が最も高い分離を減じる推論アルゴリズムを使用することによって、ページの連なりを自動的に分離する。図２２に示す本発明の一実施形態では、本発明のトランスダクティブＭＥＤの分類法が、文書分離に採用されている。より詳細には、文書ページ２２００がデジタルスキャナ２２０２またはＭＦＰに挿入され、一連のデジタル画像２２０４に変換される。文書ページは、任意の種類の文書、例えば、特許庁の公報、データベースから取り出されたデータ、従来技術を集めたもの、ウェブサイト、などからのページであり得る。ステップ２２０６で、一連のデジタル画像が入力され、トランスダクションを用いて、確率的分類規則を動的に順応させる。ステップ２２０６は、ラベルなしデータとしての一連の画像の２２０４、およびラベル付きデータ２２０８を使用する。ステップ２２１０で、確率的ネットワーク内の重みが更新され、動的に順応された分類規則に従って、自動的文書分離に用いられる。出力ステップ２２１２は、一連のデジタル化されたページ２２１４が分離シート２２１６の自動画像によりインタリーブされる、分離画像の自動挿入の動的順応であり、ステップ２２１２で、一連の画像に分離シートの画像を自動的に挿入する。本発明の一実施形態では、ソフトウェアで生成された分離ページ２２１６はまた、分離ページ２２１６のすぐ後に続くかまたは先行する文書の種類を示し得る。ここで説明するシステムは、経時的に生じる文書のドリフトする分離概念に自動的に順応し、ルールに基づく解決または帰納的機械学習に基づく解決のような静的システムのように、分離精度の低下を被ることはない。アプリケーション処理の形式におけるドリフトする分離概念または分類概念の一般的な例は、先に述べたように、新たな法律の制定による文書の改正である。 FIG. 22 shows an implementation of the classification method and apparatus of the present invention used in conjunction with document separation. Automatic document separation is used to reduce manual effort involved in document separation and identification after digital scanning. One such document classification method uses the classification methods described herein, and combines the classification rules and uses an inference algorithm that subtracts the most likely separation from all available information, Automatically separate page sequences. In one embodiment of the present invention shown in FIG. 22, the transductive MED classification method of the present invention is employed for document separation. More specifically, a document page 2200 is inserted into a digital scanner 2202 or MFP and converted into a series of digital images 2204. The document page can be any type of document, for example, a page from a patent office publication, data retrieved from a database, a collection of prior art, a website, and the like. At step 2206, a series of digital images is input and the transduction is used to dynamically adapt the probabilistic classification rules. Step 2206 uses a series of images 2204 and unlabeled data 2208 as unlabeled data. In step 2210, the weights in the probabilistic network are updated and used for automatic document separation according to dynamically adapted classification rules. Output step 2212 is a dynamic adaptation of automatic insertion of separated images in which a series of digitized pages 2214 are interleaved with the automatic image of separating sheets 2216. In step 2212, images of the separating sheets are converted into a series of images. Insert automatically. In one embodiment of the present invention, software-generated separation page 2216 may also indicate the type of document that immediately follows or precedes separation page 2216. The system described here automatically adapts to the document's drifting separation concept that occurs over time and reduces the separation accuracy, like static systems such as rule-based solutions or recursive machine learning-based solutions. Will not suffer. A common example of a drifting separation or classification concept in the form of application processing is the revision of documents due to the enactment of new laws, as mentioned above.

さらに、図２２に示すシステムは、図２３に示すシステムに修正され得る。図２３に示すシステムでは、ページ２３００がデジタルスキャナ２３０２またはＭＦＰに挿入され、一連のデジタル画像２３０４に変換される。ステップ２３０６で、一連のデジタル画像が入力され、トランスダクションを用いて確率的分類規則を動的に順応させる。ステップ２３０６は、ラベルなしデータとしての一連の画像２３０４、およびラベル付きデータ２３０８を使用する。ステップ２３１０は、採用された動的に順応された分類規則に従って自動文書分離に用いられる、確率的ネットワーク内の重みを更新する。ステップ２３１２では図１８で説明したように分離シート画像を挿入せずに、ステップ２３１２は、分離情報の自動挿入を動的に順応させ、文書の画像２３１４に、コード化された記述のフラグを立てる。このようにして、文書ページ画像は、可視化処理されたデータベース２３１６に入力され得、文書はソフトウェア識別子によってアクセスされ得る。 Furthermore, the system shown in FIG. 22 can be modified to the system shown in FIG. In the system shown in FIG. 23, a page 2300 is inserted into a digital scanner 2302 or MFP and converted into a series of digital images 2304. At step 2306, a series of digital images is input to dynamically adapt the probabilistic classification rules using transduction. Step 2306 uses a series of images 2304 as unlabeled data and labeled data 2308. Step 2310 updates the weights in the probabilistic network that are used for automatic document separation according to the adopted dynamically adapted classification rules. In step 2312, as described with reference to FIG. 18, the separation sheet image is not inserted, and in step 2312, the automatic insertion of separation information is dynamically adapted to set a coded description flag on the document image 2314. . In this way, the document page image can be entered into the visualized database 2316 and the document can be accessed by the software identifier.

本発明のさらに別の実施形態は、トランスダクションを用いて顔認識を行うことができる。上に述べたように、トランスダクションの利用は、多くの利点、例えば、必要となる訓練例が比較的少数であること、訓練にラベルなしの例を利用できること、などを有する。上述の利点を活用することによって、トランスダクティブ顔認識は、犯罪の検挙のために実装され得る。 Yet another embodiment of the present invention can perform face recognition using transduction. As mentioned above, the use of transduction has many advantages, such as the relatively small number of training examples required, the availability of unlabeled examples for training, and the like. By taking advantage of the advantages described above, transductive face recognition can be implemented for crime arrest.

例えば、国土安全保障省は、テロリストが民間航空機への搭乗を許可されないことを保証しなければならない。空港のスクリーニングプロセスの一部は、空港の検問所で各乗客の写真を撮り、その人物を認識しようとすることであり得る。本システムは先ず、テロリスト容疑者に関して入手可能な限られた写真からの小数の例を用いて、訓練され得る。同じく訓練に用いられ得る、入手可能な同一テロリストのより多くのラベルなし写真がまた、他の捜査当局のデータベースにもあるかもしれない。従って、トランスダクティブ訓練装置は、機能的顔認識システムを生成するために、最初の疎なデータを活用するだけでなく、他の供給源からのラベルなし例をも用いて性能を向上させる。空港の検問所で撮られた写真を処理した後に、トランスダクティブシステムは、対比され得る帰納的システムよりもより正確に、問題の人物を認識することができる。 For example, the Department of Homeland Security must ensure that terrorists are not allowed to board commercial aircraft. Part of the airport screening process may be to take a picture of each passenger at an airport checkpoint and attempt to recognize the person. The system can first be trained using a small number of examples from the limited photos available for suspected terrorists. There may also be more unlabeled photos of the same terrorists available that can also be used for training in other investigative authorities' databases. Thus, the transductive training device not only uses the initial sparse data to generate a functional face recognition system, but also uses unlabeled examples from other sources to improve performance. After processing the photos taken at the airport checkpoint, the transductive system can recognize the person in question more accurately than an inductive system that can be contrasted.

さらに別の実施形態における、顔認識法が、図２４に提示される。ステップ２４００で、既知の信頼水準を有する少なくとも１つのラベル付きの顔のシード画像が受信される。この少なくとも１つのシード画像は、該画像が指定されたカテゴリに含まれているか否かを示すラベルを有し得る。さらに、ステップ２４００で、ラベルなし画像が、例えば、警察、政府系機関、迷子データベース、空港警備部門、またはその他の任意の場所から受信され、少なくとも１つの所定コスト要因が受信される。また、ステップ２４０２で、少なくとも１つの所定コスト要因、少なくとも１つのシード画像、およびラベルなし画像を用いて、繰り返し計算によって、トランスダクティブ分類器が訓練され、ここで、各計算の繰り返しに対して、コスト要因が期待ラベル値の関数として調整される。少なくとも一部の繰り返しの後に、ステップ２４０４で、ラベルなしシード画像に対する信頼スコアが格納される。 In yet another embodiment, a face recognition method is presented in FIG. At step 2400, a seed image of at least one labeled face having a known confidence level is received. The at least one seed image may have a label that indicates whether the image is included in a specified category. Further, at step 2400, an unlabeled image is received from, for example, a police, government agency, lost child database, airport security department, or any other location, and at least one predetermined cost factor is received. Also, in step 2402, the transductive classifier is trained by iterative calculations using at least one predetermined cost factor, at least one seed image, and unlabeled images, where for each iteration of the calculation The cost factor is adjusted as a function of the expected label value. After at least some iterations, at step 2404, a confidence score for the unlabeled seed image is stored.

さらに、ステップ２４０６で、最も高い信頼スコアを有するラベルなし文書の識別子が、ユーザ、別のシステム、および別のプロセスの少なくとも１つに出力される。識別子は、文書自体の電子コピー、それらの部分、それらの表題、それらの名前、それらのファイル名、文書へのポインタ、などであり得る。また、各繰り返しの後に、信頼スコアが格納され得、各繰り返しの後に最も高い信頼スコアを有する、ラベルなし画像の識別子が出力される。さらに、ラベル付きとラベルなし画像に対するデータ点のラベルの事前確率が受信され得、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定に従って調整され得る。さらに、例えば上述の空港の検問所の例からの、顔の第３のラベルなし画像が受信され得、この第３のラベルなし画像は、最も高い信頼スコアを有する少なくとも一部の画像と比較され得、第３のラベルなし画像の顔の信頼度がシード画像の顔と同一である場合には、第３のラベルなし画像の識別子が出力され得る。 Further, at step 2406, the identifier of the unlabeled document with the highest confidence score is output to at least one of the user, another system, and another process. The identifier can be an electronic copy of the document itself, their parts, their title, their name, their file name, a pointer to the document, and so on. Also, after each iteration, a confidence score may be stored, and an identifier of the unlabeled image having the highest confidence score is output after each iteration. In addition, prior probabilities of data point labels for labeled and unlabeled images can be received, and for each iteration of the calculation, the prior probabilities of data point labels can be adjusted according to an estimate of the class membership probability of the data points. . In addition, a third unlabeled image of the face can be received, eg, from the airport checkpoint example described above, and this third unlabeled image is compared to at least some images having the highest confidence score. If the reliability of the face of the third unlabeled image is the same as the face of the seed image, the identifier of the third unlabeled image can be output.

本発明のさらに別の実施形態は、文書発見システムにフィードバックを提供することによって、ユーザが自身の検索結果を向上させることを可能にする。例えば、インターネットの検索エンジン上で、特許文書または特許出願書類の検索結果など、検索を行っているときに、ユーザは、自身の検索クエリに応答した多数の結果を入手し得る。本発明の一実施形態は、ユーザが検索エンジンから提案された結果を再吟味して、１つ以上の取り出された結果についての関連度、例えば「私が望んだものに近いが、そのものではない」、「全く違う」などをエンジンに報告することを可能にする。ユーザがエンジンにフィードバックを提供するたびに、より良い結果がユーザの再吟味のために優先される。 Yet another embodiment of the invention allows users to improve their search results by providing feedback to the document discovery system. For example, when performing a search on a search engine on the Internet, such as a search result for patent documents or patent application documents, a user may obtain a number of results in response to his search query. One embodiment of the present invention re-examines the results suggested by the search engine from the user and the relevance of one or more retrieved results, eg, “close to but not the one I wanted. ”,“ Completely different ”, etc. can be reported to the engine. Each time a user provides feedback to the engine, better results are prioritized for user review.

一実施形態における、文書検索法が、図２５に提示される。ステップ２５００で、検索クエリが受信される。検索クエリは、大文字と小文字を区別するクエリ、ブールクエリ、近似マッチングクエリ、構造化クエリ、などを含む、任意の種類のクエリであり得る。ステップ２５０２で、検索クエリに基づいた文書が取り出される。さらに、ステップ２５０４で文書が出力され、ステップ２５０６で、少なくとも一部の文書に対して、検索クエリへの文書の関連性を示すユーザ入力ラベルが受信される。例えば、ユーザは、クエリから返送された特定の結果が関連性を有するか否かを示し得る。また、ステップ２５０８で、検索クエリおよびユーザ入力ラベルに基づいて分類器が訓練され、ステップ２５１０で、文書を再分類するために分類器を用いて、文書に関して文書分類手法が実行される。さらに、ステップ２５１２で、少なくとも一部の文書の識別子が、文書の分類に基づいて出力される。識別子は、文書自体の電子コピー、それらの部分、それらの表題、それらの名前、それらのファイル名、文書へのポインタ、などであり得る。再分類された文書がまた、最初に出力された最も高い信頼性を有する文書と共に、出力され得る。 A document retrieval method in one embodiment is presented in FIG. At step 2500, a search query is received. The search query can be any type of query, including case sensitive queries, Boolean queries, approximate matching queries, structured queries, and so on. In step 2502, a document based on the search query is retrieved. In addition, a document is output in step 2504, and a user input label indicating the relevance of the document to the search query is received in step 2506 for at least some documents. For example, the user may indicate whether a particular result returned from the query is relevant. Also, at step 2508, the classifier is trained based on the search query and the user input label, and at step 2510, a document classification technique is performed on the document using the classifier to reclassify the document. Further, at step 2512, at least some document identifiers are output based on the document classification. The identifier can be an electronic copy of the document itself, their parts, their title, their name, their file name, a pointer to the document, and so on. The reclassified document can also be output along with the most reliable document that was output first.

文書分類手法は、任意の種類の処理、例えば、トランスダクティブ処理、サポートベクタマシン処理、最大エントロピー識別処理、などを含み得る。上述の任意の帰納的手法またはトランスダクティブ手法が、使用され得る。好適な手法では、分類器はトランスダクティブ分類器であり得、トランスダクティブ分類器は、少なくとも１つの所定コスト要因、検索クエリ、および文書を用いて、繰り返し計算によって訓練され得、ここで、計算の各繰り返しに対して、コスト要因が期待ラベル値の関数として調整され得、訓練された分類器が文書を分類するために使用され得る。さらに、検索クエリおよび文書に対するデータ点のラベルの事前確率が受信され得、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定に従って調整され得る。 Document classification techniques may include any type of processing, such as transductive processing, support vector machine processing, maximum entropy identification processing, and the like. Any inductive or transductive approach described above can be used. In a preferred approach, the classifier can be a transductive classifier, and the transductive classifier can be trained by iterative computation using at least one predetermined cost factor, a search query, and a document, where For each iteration of the calculation, the cost factor can be adjusted as a function of the expected label value and a trained classifier can be used to classify the document. Further, the prior probabilities of the data point labels for the search query and the document may be received, and for each iteration of the calculation, the prior probabilities of the data point labels may be adjusted according to the estimate of the class membership probability of the data points.

本発明のさらなる実施形態は、ＩＣＲ／ＯＣＲ、および音声認識を向上させるために用いられ得る。例えば、音声認識プログラムおよび音声認識システムの多くの実施形態は、システムを訓練するために、オペレータがいくつかの語を繰り返すことを必要とする。本発明は、例えば電話での会話を聴くことによって、最初に、あらかじめ設定された期間だけユーザの声をモニタして、「未分類」の内容を集め得る。その結果として、ユーザが認識システムの訓練を開始するときに、本システムは、トランスダクティブ学習を活用してモニタした音声を利用し、メモリモデルの構築を支援する。 Further embodiments of the invention can be used to improve ICR / OCR and speech recognition. For example, many embodiments of speech recognition programs and speech recognition systems require an operator to repeat several words in order to train the system. The present invention may first collect user unclassified content by monitoring the user's voice for a preset period, for example by listening to telephone conversations. As a result, when the user starts training the recognition system, the system supports the construction of the memory model by using the sound monitored by utilizing transductive learning.

さらに別の実施形態における、インボイスと実体との関連付けを検証する方法が、図２６に提示される。ステップ２６００で、第１の実体と関連するインボイスの形式に基づいて分類器が訓練される。インボイスの形式は、インボイスの上での荷印の物理的レイアウト、またはインボイスの上のキーワード、インボイス番号、顧客名などのような特徴の、いずれかまたは両方を指すことができる。さらに、ステップ２６０２で、第１の実体および他の実体のうちの少なくとも１つと関連する旨のラベルが付けられた複数のインボイスがアクセスされ、ステップ２６０４で、分類器を用いて、インボイスに関して文書分類手法が実行される。例えば、上述の任意の帰納的手法またはトランスダクティブ手法が、文書分類手法として用いられ得る。例えば、文書分類手法は、トランスダクティブ処理、サポートベクタマシン処理、最大エントロピー識別処理、などを含み得る。また、ステップ２６０６で、第１の実体と関連していない高い確率を有するインボイスのうちの、少なくとも１つの識別子が出力される。 In yet another embodiment, a method for verifying the association between an invoice and an entity is presented in FIG. At step 2600, the classifier is trained based on the type of invoice associated with the first entity. The invoice format may refer to either or both of the physical layout of the indicia on the invoice or features such as keywords, invoice number, customer name, etc. on the invoice. Further, at step 2602, a plurality of invoices labeled as associated with at least one of the first entity and the other entity are accessed, and at step 2604, a classifier is used to relate the invoice. A document classification technique is executed. For example, any inductive or transductive technique described above can be used as the document classification technique. For example, document classification techniques may include transductive processing, support vector machine processing, maximum entropy identification processing, and the like. Also, at step 2606, at least one identifier of the invoice having a high probability that is not associated with the first entity is output.

さらに、分類器は、任意の種類の分類器、例えばトランスダクティブ分類器であり得、トランスダクティブ分類器は、少なくとも１つの所定コスト要因、少なくとも１つの文書分類、およびインボイスを用いて、繰り返し計算によって訓練され得、計算の各繰り返しに対して、コスト要因が期待ラベル値の関数として調整され、訓練された分類器を用いてインボイスを分類する。また、シード文書およびインボイスに対するデータ点のラベルの事前確率が受信され得、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定に従って調整される。 Further, the classifier can be any type of classifier, eg, a transductive classifier, which uses at least one predetermined cost factor, at least one document classification, and an invoice, It can be trained by iterative calculations, and for each iteration of the calculation, the cost factor is adjusted as a function of the expected label value and the invoice is classified using a trained classifier. Also, data point label prior probabilities for the seed document and invoice may be received, and for each iteration of the calculation, the data point label prior probabilities are adjusted according to the data point class membership probability estimate.

本明細書に記述された実施形態によって提供される利点の１つは、トランスダクティブアルゴリズムの安定性である。この安定性は、コスト要因のスケーリングおよびラベルの事前確率の調整によって達成される。例えば、一実施形態では、トランスダクティブ分類器は、少なくとも１つのコスト要因、ラベル付きデータ点、およびラベルなしデータ点を訓練例として用いて、繰り返し分類によって訓練される。計算の各繰り返しに対して、ラベルなしデータ点のコスト要因が、期待ラベル値の関数として調整される。さらに、計算の各繰り返しに対して、データ点のラベルの事前確率が、データ点のクラス帰属確率の推定に従って調整される。 One advantage provided by the embodiments described herein is the stability of the transductive algorithm. This stability is achieved by scaling cost factors and adjusting the prior probabilities of labels. For example, in one embodiment, the transductive classifier is trained by iterative classification using at least one cost factor, labeled data points, and unlabeled data points as training examples. For each iteration of the calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value. Further, for each iteration of the calculation, the prior probabilities of the data point labels are adjusted according to the estimation of the data point class membership probability.

ワークステーションは、ＭｉｃｒｏｓｏｆｔＷｉｎｄｏｗｓ（登録商標）オペレーティングシステム（ＯＳ）、ＭＡＣＯＳ、またはＵＮＩＸ（登録商標）オペレーティングシステムのような、オペレーティングシステムを搭載して有し得る。好適な実施形態がまた、言及したもの以外のプラットフォームおよびオペレーティングシステム上で実装され得ることが、理解される。好適な実施形態は、ＪＡＶＡ（登録商標）、ＸＭＬ、Ｃ、および／またはＣ^＋＋言語、または他のプログラミング言語、また、オブジェクト指向のプログラム方法論を用いて、記述され得る。複雑なアプリケーションを開発するためにますます多く用いられる、オブジェクト指向プログラミング（ＯＯＰ）が、使用され得る。 A workstation may have an operating system, such as a Microsoft Windows operating system (OS), MAC OS, or UNIX operating system. It will be appreciated that the preferred embodiments may also be implemented on platforms and operating systems other than those mentioned. Preferred embodiments may be described using JAVA®, XML, C, and / or C ⁺⁺ languages, or other programming languages, and also object-oriented programming methodologies. Object oriented programming (OOP), which is increasingly used to develop complex applications, can be used.

上述のアプリケーションは、トランスダクティブ学習を用いて、帰納的顔認識システムに困難をもたらす極めて疎なデータの問題を克服する。トランスダクティブ学習のこの局面は、このアプリケーションに限定されるものではなく、疎なデータに起因する他の機械学習上の問題を解決するために使用され得る。 The applications described above use transductive learning to overcome the problem of extremely sparse data that poses difficulties for inductive face recognition systems. This aspect of transductive learning is not limited to this application and can be used to solve other machine learning problems due to sparse data.

当業者は、本明細書において開示される本発明の種々の実施形態の範囲および精神内にある、変形形態を工夫し得る。さらに、本明細書において開示される実施形態の種々の特徴は、単独で、または相互の様々な組み合わせの形で用いられ得、本明細書において記載される特定の組み合わせに限定されることを意図されてはいない。従って、特許請求の範囲は、例示された実施形態によって限定されない。 Those skilled in the art can devise variations that are within the scope and spirit of the various embodiments of the invention disclosed herein. Further, the various features of the embodiments disclosed herein may be used alone or in various combinations with each other and are intended to be limited to the specific combinations described herein. It has not been done. Accordingly, the claims are not limited by the illustrated embodiment.

図１は、ラベル帰納に応用されたＭＥＤ識別学習を採り入れることによって得られる、分類スコアの関数として期待ラベルをプロットしたチャートの描画である。FIG. 1 is a drawing of a chart plotting expected labels as a function of classification score, obtained by incorporating MED discriminative learning applied to label induction. 図２Ａ〜図２Ｈは、トランスダクティブＭＥＤ学習によって得られる、決定関数の計算の繰り返しを示す、一連のプロットの描画である。2A-2H are plots of a series of plots showing the iteration of the decision function calculation obtained by transductive MED learning. 図３Ａ〜図３Ｈは、本発明の一実施形態の、改良されたトランスダクティブＭＥＤ学習によって得られる、決定関数の計算の繰り返しを示す、一連のプロットの描画である。FIGS. 3A-3H are plots of a series of plots showing the iterations of the decision function computation obtained by improved transductive MED learning according to one embodiment of the present invention. 図４は、スケーリングされたコスト要因を用いる本発明の一実施形態による、ラベルなしデータの分類のための制御流れ図を示す。FIG. 4 shows a control flow diagram for classification of unlabeled data according to one embodiment of the present invention using scaled cost factors. 図５は、ユーザ定義の事前確率情報を用いる本発明の一実施形態による、ラベルなしデータの分類のための制御流れ図を示す。FIG. 5 shows a control flow diagram for classification of unlabeled data according to one embodiment of the present invention using user-defined prior probability information. 図６は、スケーリングされたコスト要因および事前確率情報と共に最大エントロピー識別を用いる、本発明の一実施形態による、ラベルなしデータの分類のための詳細な制御流れ図を示す。FIG. 6 shows a detailed control flow diagram for classification of unlabeled data according to one embodiment of the present invention using maximum entropy identification with scaled cost factors and prior probability information. 図７は、本明細書に記載される種々の実施形態が実装され得る、ネットワークアーキテクチャを示す、ネットワーク図である。FIG. 7 is a network diagram illustrating a network architecture in which various embodiments described herein may be implemented. 図８は、ユーザ装置と関連付けられる代表的なハードウェア環境の系統図である。FIG. 8 is a system diagram of a typical hardware environment associated with a user device. 図９は、本発明の一実施形態の装置のブロック図を示す。FIG. 9 shows a block diagram of an apparatus according to an embodiment of the present invention. 図１０は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 10 shows in a flowchart the classification process performed according to one embodiment. 図１１は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 11 shows in a flowchart the classification process performed according to one embodiment. 図１２は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 12 shows in a flowchart the classification process performed according to one embodiment. 図１３は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 13 shows in a flowchart the classification process performed according to one embodiment. 図１４は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 14 shows in a flowchart the classification process performed according to one embodiment. 図１５は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 15 shows in a flowchart the classification process performed according to one embodiment. 図１６は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 16 illustrates in a flowchart the classification process performed according to one embodiment. 図１７は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 17 shows in a flowchart the classification process performed according to one embodiment. 図１８は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 18 shows in a flowchart the classification process performed according to one embodiment. 図１９は、一実施形態に従って実行される分類プロセスを、フローチャートで示す。FIG. 19 illustrates in a flowchart the classification process performed according to one embodiment. 図２０は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 20 shows in a flowchart the classification process performed according to one embodiment. 図２１は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 21 shows in a flowchart the classification process performed according to one embodiment. 図２２は、第１の文書分離システムに適用された本発明の、一実施形態の方法を示す制御流れ図を示す。FIG. 22 shows a control flow diagram illustrating the method of one embodiment of the present invention as applied to the first document separation system. 図２３は、第２の分離システムに適用された本発明の、一実施形態の方法を示す制御流れ図を示す。FIG. 23 shows a control flow diagram illustrating the method of one embodiment of the present invention as applied to the second separation system. 図２４は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 24 shows in a flowchart the classification process performed according to one embodiment. 図２５は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 25 illustrates in a flowchart the classification process performed in accordance with one embodiment. 図２６は、一実施形態に従って実行される分類プロセスを、フローチャートに示す。FIG. 26 illustrates in a flowchart the classification process performed according to one embodiment. 図２７は、一実施形態に従って実行される分類プロセスを、フローチャートで示す。FIG. 27 illustrates in a flowchart the classification process performed according to one embodiment. 図２８は、一実施形態に従って実行される分類プロセスを、フローチャートで示す。FIG. 28 illustrates in a flowchart the classification process performed according to one embodiment. 図２９は、一実施形態に従って実行される分類プロセスを、フローチャートで示す。FIG. 29 illustrates in a flowchart the classification process performed according to one embodiment.

Claims

A method of data classification in a computer-based system, comprising:
Receiving labeled data points, wherein each of the labeled data points is a training example for a data point for which the data point is to be included in a specified category or excluded from a specified category. Having at least one label indicating whether the training is for an example data point;
Receiving unlabeled data points;
Receiving at least one predetermined cost factor for the labeled and unlabeled data points;
Training a transductive classifier using maximum entropy discrimination (MED) by iterative computation using the at least one cost factor and the labeled and unlabeled data points as training examples, comprising: For each iteration, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the data point label is adjusted according to the estimate of the class membership probability of the data point;
Applying the trained classifier to classify at least one of the unlabeled data point, the labeled data point, and an input data point;
And outputting the classified data classification points, or derivatives thereof, the user, another system, at least one and processes,
Including the method.

The method of claim 1, wherein the function is an absolute value of the expected label of a data point.

Further comprising the steps of receiving priori probability information labeled and unlabeled data points A method according to claim 1.

The method of claim 3, wherein the transductive classifier learns using prior probability information of the labeled and unlabeled data.

Before and Kira bell with data the unlabeled data using the training examples according to their expected label, a Gaussian prior distribution to the decision function parameters given training examples are training examples and exclusion included The method of claim 1, comprising the further step of using to determine a decision function with minimal KL divergence.

The method of claim 1, comprising the further step of determining a decision function having a minimum KL divergence using a polynomial prior for the decision function parameters.

The method of claim 1, wherein the iterative step of training a transductive classifier is repeated until convergence of data values is reached.

The method of claim 7, wherein convergence is reached when a change in a decision function of the transductive classifier falls below a predetermined threshold.

The method of claim 7, wherein convergence is reached when a change in the determined expected label value falls below a predetermined threshold.

The method of claim 1, wherein the labels of the included training examples have a value of +1 and the excluded training examples of labels have a value of −1.

The method of claim 1, wherein the label of the included example is mapped to a first number and the label of the excluded example is mapped to a second number.

Storing the labeled data points in a memory of a computer;
Storing the unlabeled data points in a memory of a computer;
Storing the input data points in a memory of a computer;
Storing at least one predetermined cost factor of the labeled and unlabeled data points in a memory of a computer;
The method of claim 1, further comprising:

A method of data classification comprising providing computer-executable program code that is deployed and executed on a computer system, comprising:
The program code is
At least one indicating whether each labeled data point is a training example for a data point to be included in the specified category or a training point for a data point excluded from the specified category Instructions for accessing the labeled data point stored in the memory of the computer having one label;
Instructions for accessing unlabeled data points from the computer's memory;
Instructions for accessing at least one predetermined cost factor of the labeled and unlabeled data points from a computer memory;
Instructions for training a maximum entropy identification (MED) transductive classifier by iterative computation using the at least one stored cost factor and stored labeled data points and stored unlabeled data points. there, for each iteration of the computation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, prior probability of data points Ru is adjusted according to the estimated value of the class membership probability of a data point, the instruction And
Instructions for applying the trained classifier to classify at least one of the unlabeled data point, the labeled data point, and an input data point;
The classification classification of data points, or a derivative thereof, a user, a command for outputting another system, at least one and processes,
Comprising
Method.

The method of claim 13, wherein the function is an absolute value of the expected label of a data point.

Further comprising the steps of accessing the a priori probability information of labeled and unlabeled data points stored in the memory of a computer, The method according to claim 13.

The method of claim 15, wherein for each iteration, the prior probability information is adjusted according to an estimate of a class membership probability of a data point.

Utilizing the labeled and unlabeled data as learning examples according to their expected labels and having a minimum KL divergence for the pre-distribution of decision function parameters given the included training examples and excluded training examples The method of claim 13, further comprising instructions for determining a decision function.

The method of claim 13, wherein the iterative step of training a transductive classifier is repeated until convergence of data values is reached.

The method of claim 18, wherein convergence is reached when the change in the decision function of the transductive classification falls below a predetermined threshold.

The method of claim 18, wherein convergence is reached when a change in the determined expected label value falls below a predetermined threshold.

The method of claim 13, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of −1.

The method of claim 13, wherein the label of the included example is mapped to a first number and the label of the excluded example is mapped to a second number.

A data processing device, the device comprising:
(I) Whether each labeled data point is a training example for a data point that should be included in the specified category or a training example for a data point excluded from the specified category Storing at least one of the labeled data points having at least one label; (ii) unlabeled data points; and (iii) at least one predetermined cost factor for the labeled and unlabeled data points, With two memories
Using the at least one stored cost factor and stored labeled data points and stored unlabeled data points as training examples, a transductive maximum entropy identification (MED) is used for the transductive classifier. A transductive classifier training device for iterative teaching, wherein at each iteration of the MED calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the label of the data point is A training device that is adjusted according to an estimate of the class membership probability of the data points;
With
A classifier trained by the transductive classifier training device is used to classify at least one of the unlabeled data points, the labeled data points, and the input data points;
The classification classification of data points, or a derivative thereof, the user, another data processing apparatus, is output to at least one and processes,
apparatus.

24. The apparatus of claim 23, wherein the function is an absolute value of the expected label of a data point.

24. The apparatus of claim 23, wherein the memory also stores prior probability information for labeled and unlabeled data points.

26. The apparatus of claim 25, wherein in each iteration of the MED calculation, the prior probability information is adjusted according to an estimate of a data point class membership probability.

Using the labeled and unlabeled data as learning examples according to their expected labels, have a minimum KL divergence for the prior distribution of decision function parameters given the included training examples and excluded training examples. further comprising a processor for determining a decision functions that, according to claim 23.

24. The apparatus of claim 23, further comprising means for determining convergence of the data value and terminating the calculation simultaneously with the determination of convergence.

29. The apparatus of claim 28, wherein convergence is reached when a change in a decision function of the transductive classifier calculation falls below a predetermined threshold.

29. The apparatus of claim 28, wherein convergence is reached when a change in the determined expected label value falls below a predetermined threshold.

24. The apparatus of claim 23, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of -1.

24. The apparatus of claim 23, wherein the labels of the included examples are mapped to a first number and the labels of the excluded examples are mapped to a second number.

A program storage medium body computer readable, the medium is substantively perform the method of data classification embodying the method one or more programs of instructions executable by the computer,
Each labeled data point indicates whether it is a training example for a data point to be included in a specified category or a training example for a data point excluded from a specified category, at least Receiving the labeled data point having one label;
Receiving unlabeled data points;
Receiving at least one predetermined cost factor of the labeled and unlabeled data points;
Training a transductive classifier by iterative maximum entropy identification (MED) calculation using the at least one stored cost factor and stored labeled data points and stored unlabeled data points as training examples. Where in each iteration of the MED calculation, the cost factor of the unlabeled data point is adjusted as a function of the expected label value, and the prior probability of the data point is adjusted according to the estimate of the class membership probability of the data point. Steps,
Applying the trained classifier to classify at least one of the unlabeled data point, the labeled data point, and an input data point;
And outputting the classified data classification points, or derivatives thereof, the user, another program storage medium, at least one and processes,
Including
Program storage medium .

The program storage medium according to claim 33, wherein the function is an absolute value of the expected label of a data point.

34. The program storage medium of claim 33, wherein the method further comprises storing prior probability information for labeled and unlabeled data points in a computer memory.

36. The program storage medium of claim 35, wherein in each iteration of the MED calculation, the prior probability information is adjusted according to an estimate of a data point class membership probability.

The method uses the labeled and unlabeled data as learning examples according to their expected labels, and uses a minimum for a pre-distribution of decision function parameters given the included training examples and excluded training examples. It determined that having a KL divergence further encompasses automatic answering step to determine the constant function program storage medium of claim 33.

34. The program storage medium of claim 33, wherein the iterative step of training a transductive classifier is repeated until convergence of data values is reached.

40. The program storage medium of claim 38, wherein convergence is reached when a change in the transductive classification decision function falls below a predetermined threshold.

39. The program storage medium according to claim 38, wherein convergence is reached when a change in the determined expected label value falls below a predetermined threshold.

34. The program storage medium of claim 33, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of -1.

34. The program storage medium of claim 33, wherein the label of the included example is mapped to a first numeric value and the label of the excluded example is mapped to a second numeric value.

A method for classifying unlabeled data in a computer-based system, comprising:
Receiving a labeled data point, whether the data point is a training example for a data point to be included in a specified category or a training example for a data point excluded from a specified category Each of the labeled data points has at least one label indicating:
Receiving labeled and unlabeled data points;
Receiving pre-label probability information for labeled and unlabeled data points;
Receiving at least one predetermined cost factor for the labeled and unlabeled data points;
Determining an expected label for each labeled and unlabeled data point according to the prior probability of the label of the data point;
Until the data value converges, and generating a scaled cost value for each unlabeled data points in proportion to the absolute value of the expected label following substeps, namely, the data points,
Use the labeled and unlabeled data as training examples according to their expected labels to minimize KL divergence for the prior probability distribution of the decision function parameters given the included training examples and excluded training examples determined by calculating the constant function that, the step of training a classifier,
Using the trained classifier to determine a classification score for the labeled and unlabeled data points;
Calibrating the output of the trained classifier to class membership probabilities;
Updating the prior probability of the label of the unlabeled data point according to the determined class membership probability;
Determining the probability distribution of the label and margin using maximum entropy identification (MED) using the updated prior probability of the label and the previously determined classification score;
Using the previously determined probability distribution of labels to calculate a new expected label;
Updating the expected label for each data point by interpolating the new expected label with the expected label from the previous iteration;
Repeating steps,
And outputting the classification of the input data points, or a derivative thereof, a user, another system, at least one and processes,
Including the method.

44. The method of claim 43, wherein convergence is reached when a change in the decision function falls below a predetermined threshold.

44. The method of claim 43, wherein convergence is reached when a change in the determined expected label value falls below a predetermined threshold.

44. The method of claim 43, wherein the labels of the included training examples have a value of +1 and the labels of the excluded training examples have a value of -1.

Having a known confidence level relating to the label allocation, receiving a-out document at least with one label,
Receiving an unlabeled document;
Receiving at least one predetermined cost factor;
Training a transductive classifier by iterative computation using the at least one predetermined cost factor, the at least one labeled document, and the unlabeled document, for each iteration of the computation: The cost factor is adjusted as a function of the expected label value; and
Storing a confidence score for the unlabeled document after at least some of the iterations;
And outputting the identifier of the unlabeled documents, the user, the system, at least one and processes with the highest confidence score,
And the identifier of the unlabeled document includes a copy of at least a part of the unlabeled document, the file location ID of the unlabeled document, the name of the unlabeled document, or the title of the unlabeled document . A way to classify documents.

48. The method of claim 47, wherein the at least one labeled document has a list of keywords.

48. The method of claim 47, wherein a confidence score is stored after each iteration, and an identifier for the unlabeled document having the highest confidence score is output after each iteration.

Further comprising the steps of receiving a prior probability of the label of the data points for the labeled and unlabeled documents, A method according to claim 47, for each iteration of the calculation, the label of the data points The prior probability of is adjusted according to an estimate of the class membership probability of the data points.

Having a known confidence level, receiving a-out images with labels of the at least one face,
Receiving an unlabeled image;
Receiving at least one predetermined cost factor;
Training a transductive classifier by iteration using the at least one predetermined cost factor, the at least one labeled image, and the unlabeled image, for each iteration of the computation: The cost factor is adjusted as a function of the expected label value; and
After at least a portion of the repeat, and storing the confidence score for unlabeled picture image,
And outputting the identifier of the label without an image, the user, the system, at least one and processes with the highest confidence score,
And the identifier of the unlabeled document includes a copy of at least a part of the unlabeled document, the file location ID of the unlabeled document, the name of the unlabeled document, or the title of the unlabeled document . Face recognition method.

52. The method of claim 51 , wherein the at least one labeled image has a label that indicates whether the image is included in a specified category.

52. The method of claim 51 , wherein a confidence score is stored after each of the iterations and an identifier of the unlabeled image having the highest confidence score is output after each iteration.

Further comprising the steps of receiving a prior probability of the label of the data points for the labeled and unlabeled images, A method according to claim 51, for each iteration of the calculation, the label of the data points The prior probability of is adjusted according to an estimate of the class membership probability of the data points.

Receiving a third unlabeled image of a face; comparing the third unlabeled image with at least a portion of the image having the highest confidence score; and a face of the third unlabeled image of when the reliability is identical to the face of said labeled image, further comprising the step of outputting an identifier of the third unlabeled image the method of claim 51.