JP2006501545A

JP2006501545A - Method and apparatus for automatically determining salient features for object classification

Info

Publication number: JP2006501545A
Application number: JP2004539741A
Authority: JP
Inventors: ピー．ルリッチダニエル; ジー．ギラクファルジン
Original assignee: Microsoft Corp
Current assignee: Microsoft Corp
Priority date: 2002-09-25
Filing date: 2002-09-25
Publication date: 2006-01-12
Also published as: BR0215899A; AU2002334669A1; EP1543437A4; MXPA05003249A; WO2004029826A1; CN1669023A; EP1543437A1; CN100378713C; CA2500264A1

Abstract

オブジェクトの分類のための顕著な特徴（３０８）を自動的に判定する方法および装置が提供される。一実施例によると、１または複数の固有の特徴を、オブジェクトの第１コンテンツグループから抽出して、第１特徴リストを形成し、１または複数の固有の特徴を、オブジェクトの第２アンチコンテンツグループから抽出して、第２特徴リストを形成する。次に、第１特徴リストの固有の特徴と第２特徴リストの固有の特徴との間に統計的弁別を適用することにより、特徴のランク付けされたリストを作成する。顕著な特徴のセット（３０８）を、結果として生じた特徴のランク付けされたリストから識別する。A method and apparatus is provided for automatically determining salient features (308) for object classification. According to one embodiment, one or more unique features are extracted from the first content group of the object to form a first feature list, and the one or more unique features are extracted from the second anti-content group of the object. To form a second feature list. Next, a ranked list of features is created by applying statistical discrimination between the unique features of the first feature list and the unique features of the second feature list. A salient feature set (308) is identified from the ranked list of resulting features.

Description

本発明は、データ処理の分野に関する。さらに詳細には、本発明は、オブジェクトをグループに分類するのに使用するためのオブジェクトの特徴を自動的に選択することに関する。 The present invention relates to the field of data processing. More particularly, the invention relates to automatically selecting object features for use in classifying objects into groups.

ワールドワイドウェブは、オンラインでの閲覧およびダウンロードに利用可能な莫大な数のページが推定される情報を有する重要な情報源を提供する。しかし、このような情報を有効に活用するためには、この莫大に広がるデータをナビゲートするための実用的な方法が必要である。 The World Wide Web provides an important source of information with information that estimates an enormous number of pages available for online browsing and downloading. However, in order to effectively use such information, a practical method for navigating this enormous amount of data is necessary.

インターネットサーフィンの初期に、ウェブ検索に役立つ２つの基本的な方法が開発された。第１の手法では、新しいページおよび一意のページを探してウェブの中を「クロール（crawl）」する自動化された検索エンジンによって収集されたウェブページのコンテンツに基づき、インデックス化されたデータベースを作成する。このデータベースを、様々なクエリ技術を使用して検索することができ、クエリの形式との類似性に基づいてランク付けすることが多い。第２の手法では、ウェブページは、典型的にはツリー形式で示されるカテゴリ階層にグループ化される。次に、ユーザは、決定ポイントの下にあるサブツリー間で顕著な相違を示す各レベルで２以上の選択肢がある階層を降下し、最後にテキストおよび／またはマルチメディアコンテンツのページを含むリーフノードに達する間に一連の選択をする。 In the early days of surfing the Internet, two basic methods were developed to help with web searches. The first approach creates an indexed database based on the content of web pages collected by automated search engines that search for new and unique pages and “crawl” the web. . This database can be searched using various query techniques and is often ranked based on similarity to the form of the query. In the second approach, web pages are grouped into category hierarchies, typically shown in a tree format. The user then descends down the hierarchy with two or more choices at each level that show significant differences between the subtrees below the decision point, and finally to the leaf node that contains the text and / or multimedia content pages. Make a series of choices while reaching.

例えば、図１に、典型的な先行技術の主題階層１０２を示す。ここでは、複数の決定ノード（以下「ノード」と言う）１３０−１３６が、複数の親及び／又は子ノードに階層的に配置されており、それらノードの各々が、独自の主題カテゴリと関連付けられている。例えば、ノード１３０は、ノード１３１と１３２に対する親ノードであり、ノード１３１と１３２は、ノード１３０に対する子ノードである。ノード１３１と１３２は、両方とも同じ親ノード（例えばノード１３０）に対する子ノードなので、ノード１３１と１３２は、互いに兄弟であると言える。主題階層１０２に付加される兄弟の組は、ノード１３３と１３４ならびにノード１３５と１３６を含む。図１から、ノード１３０が、主題階層１０２の第１レベル１３７を形成し、ノード１３１−１３２が主題階層１０２の第２レベル１３８を形成し、ノード１３３−１３６が主題階層１０２の第３レベル１３９を形成することが分かる。加えてノード１３０は、他のいずれのノードに関しても子ではないので、主題階層１０２の根ノードと呼ばれる。 For example, FIG. 1 shows a typical prior art subject hierarchy 102. Here, a plurality of decision nodes (hereinafter “nodes”) 130-136 are arranged hierarchically in a plurality of parent and / or child nodes, each of which is associated with its own subject category. ing. For example, the node 130 is a parent node for the nodes 131 and 132, and the nodes 131 and 132 are child nodes for the node 130. Since nodes 131 and 132 are both child nodes for the same parent node (for example, node 130), it can be said that nodes 131 and 132 are siblings of each other. The set of siblings added to the subject hierarchy 102 includes nodes 133 and 134 and nodes 135 and 136. From FIG. 1, node 130 forms a first level 137 of the subject hierarchy 102, nodes 131-132 form a second level 138 of the subject hierarchy 102, and nodes 133-136 represent a third level 139 of the subject hierarchy 102. Can be seen to form. In addition, node 130 is called the root node of subject hierarchy 102 because it is not a child with respect to any other node.

ウェブページに関する階層組織を作成する過程は、多数の課題を提示する。まず、階層の性質が定義されなければならない。これは、典型的には、特定の主題領域の専門家の手により、図書館のデューイ十進方式においてカテゴリを作成するのに類似した方法でなされる。これらのカテゴリに説明ラベルが与えられ、ユーザ及びカテゴリ分け担当者が、階層をナビゲートしながら適切に決定できるようにする。例えば、個々の電子文書といった形式のコンテンツは、手動検索により階層を通じてカテゴリに配置される。 The process of creating a hierarchical organization for web pages presents a number of challenges. First, the nature of the hierarchy must be defined. This is typically done in a manner similar to creating a category in the library's Dewey decimal format, by the hands of an expert in a particular subject area. These categories are provided with descriptive labels so that users and categorizers can make appropriate decisions while navigating the hierarchy. For example, contents in the form of individual electronic documents are arranged in categories through a hierarchy by manual search.

近年では、この過程の各種段階を自動化する方向に関心が向けられている。文書のコーパスから文書を自動的に分類するためのシステムが存在する。例えば、あるシステムは、文書に関連したキーワードを利用して類似した文書を自動的にクラスタ化またはグループ化する。このようなクラスタを、スーパークラスタへと繰り返しグループ化することができ、階層構造を作成する。しかしながら、これらのシステムは、キーワードを手動で挿入する必要があるので、体系的構造のない階層を作成する。そのような階層が手動検索に使われるのであれば、サブノード又はリーフ文書を手動で調べて共通の特徴を特定することにより、その階層のノードにラベルを付さなくてはならない。 In recent years, interest has been directed towards automating the various stages of this process. Systems exist for automatically classifying documents from a document corpus. For example, some systems automatically cluster or group similar documents using keywords associated with the documents. Such clusters can be grouped repeatedly into super clusters, creating a hierarchical structure. However, these systems create a hierarchy without a systematic structure because keywords need to be inserted manually. If such a hierarchy is used for manual retrieval, the nodes in that hierarchy must be labeled by manually examining subnodes or leaf documents to identify common features.

多くの分類システムは、文書を分類するために見出し単語のリストを利用する。典型的には、主要な単語は、あらかじめ定義されるか、処理される文書から選択され、その文書をより正確に特徴付ける。普通、これら主要な単語リストは、１セットの文書の各々に使われた全単語の出現頻度を計数することにより作成される。次に、１または複数の判断基準により、単語を単語リストから削除する。多くの場合、文書のコーパスにたまにしか出現しない単語は削除される。このような単語は、カテゴリ内で正確に特徴を示すことが非常にまれなためである。一方、あまりに頻繁に出現する単語も削除される。このような単語は、カテゴリをまたいで全ての文書によく出現すると推測されるからである。 Many classification systems use a list of headwords to classify documents. Typically, key words are either pre-defined or selected from the document to be processed and more accurately characterize that document. Usually, these main word lists are created by counting the frequency of occurrence of all words used in each of a set of documents. Next, the word is deleted from the word list according to one or more criteria. In many cases, words that appear only occasionally in the document corpus are deleted. This is because such words are very rarely characterized accurately within a category. On the other hand, words that appear too frequently are also deleted. This is because such words are presumed to appear frequently in all documents across categories.

更に、「ストップワード」及び語幹も、顕著な特徴の判定を円滑にするため特徴リストから削除されることが多い。ストップワードには、意味のある内容を含まないと受け止められる「ａ」、「ｔｈｅ」、「ｈｉｓ」、「ａｎｄ」など、言語に共通の単語が含まれる。ここで、語幹は「−ｉｎｇ」、「−ｅｎｄ」、「−ｉｓ」、「−ａｂｌｅ」などの接尾辞をあらわす。残念ながら、ストップワードと語幹のリストの生成は、言語特有の問題で、時代により異なることのある構文、文法、用法に関する専門知識を必要とする。そのため、顕著な特徴を判定するのにより柔軟性のある方法が望まれている。 In addition, “stop words” and stems are often deleted from the feature list to facilitate the determination of salient features. Stop words include words that are common to languages, such as “a”, “the”, “his”, “and”, etc., which are accepted as having no meaningful content. Here, the stem represents a suffix such as “-ing”, “-end”, “-is”, “-able”. Unfortunately, the generation of stopwords and stem lists is a language-specific issue and requires expertise in syntax, grammar, and usage that can vary over time. Therefore, a more flexible method for determining salient features is desired.

本発明は、添付図面に示す典型的な実施例を用いて記述されるが、それに限定されるものではない。図面では類似参照番号は類似要素を意味する。 The present invention will be described using the exemplary embodiments shown in the accompanying drawings, but is not limited thereto. In the drawings, like reference numbers indicate like elements.

以下の記述において、本発明の各種態様が記述される。しかし当業者には、本発明はその一部のみまたは全部の態様を用いて実施され得ることが明らかであろう。説明の目的で、具体的な数字、材料および構成が本発明の完全な理解を提供するために示されるが、当業者には本発明はそのような具体的な詳細がなくても実施され得ることが明らかであろう。他の事例において、本発明をわかりにくくしないために、公知の特性は省略又は簡略化されている。 In the following description, various aspects of the present invention will be described. However, it will be apparent to those skilled in the art that the present invention may be practiced using only some or all of the embodiments. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced without such specific details. It will be clear. In other instances, well-known properties have been omitted or simplified so as not to obscure the present invention.

以下の記述の一部は、データ、記憶、選択、判定、計算などの用語を使用してプロセッサベースの装置によって行われる操作に関して示され、当業者が自己の仕事の要旨を他の当業者に伝えるために普通に採用する方法と一致する。当業者には良く理解されているように、量は、プロセッサベースのデバイスの機械的及び電気的構成部品を通じて記憶、伝送、結合及びその他の操作をすることの出来る電気、磁気、又は光信号の形を取る。また、プロセッサという用語は、独立型、補助型又は埋込型のマイクロプロセッサ、マイクロコントローラ、デジタル信号プロセッサ等を含む。 Some of the following descriptions are presented in terms of operations performed by processor-based devices using terms such as data, storage, selection, determination, calculation, etc., to enable those skilled in the art to give their work summary to others skilled in the art Consistent with the method normally adopted to communicate. As is well understood by those skilled in the art, quantities are quantities of electrical, magnetic, or optical signals that can be stored, transmitted, coupled, and otherwise manipulated through the mechanical and electrical components of processor-based devices. Take shape. The term processor includes stand-alone, auxiliary or embedded microprocessors, microcontrollers, digital signal processors and the like.

各種の操作は、複数の離散ステップとして順に、本発明の理解に最も役立つ方法で記述されるが、この記述の順序は、これらの演算が必然的に順序に依存することを意味すると見なされるものではない。詳細には、これらの演算は、提示された順序で行われる必要はない。さらに、以下の記述では「一実施例」の文言が繰り返し使用されるが、これは、同一実施例を指すこともあるが、通常は同一実施例を指すものではない。 The various operations are described in sequence as a plurality of discrete steps in a way that is most useful for understanding the present invention, but the order of the descriptions is considered to mean that these operations are necessarily dependent on the order. is not. In particular, these operations need not be performed in the order presented. Further, in the following description, the word “one embodiment” is repeatedly used. This may refer to the same embodiment, but usually does not refer to the same embodiment.

本発明の一実施例において、１または複数の固有の特徴を、オブジェクトの第１グループから抽出して、第１の特徴セットを形成し、１または複数の固有の特徴を、オブジェクトの第２グループから抽出して、第２特徴セットを形成する。次に、第１の特徴セットの固有の特徴と第２の特徴セットの固有の特徴との間に統計的弁別を適用して、特徴のランク付けされたリストを作成する。そして、顕著な特徴セットを、結果として生じた特徴のランク付けされたリストから識別する。 In one embodiment of the invention, one or more unique features are extracted from a first group of objects to form a first feature set, and one or more unique features are extracted from a second group of objects. To form a second feature set. A statistical discrimination is then applied between the unique features of the first feature set and the unique features of the second feature set to create a ranked list of features. A salient feature set is then identified from the ranked list of resulting features.

一実施例において、顕著な特徴を判定して、データオブジェクトの効率のよい分類とカテゴリ化を容易にする。データオブジェクトのカテゴリ化は、これに限定されないが、非常に大規模な階層的分類ツリー内および単層ファイルのように非階層的なデータ構造内の独自フォーマットおよび汎用フォーマットの双方を含むテキストファイル、画像ファイル、音声およびビデオシーケンスを含む。例えば、テキストファイルでは、特徴は、単語の形式を取り得る。ここで「単語」という用語は、普通には、ある言語において、何らかの意味論的に意味を有する文字のグループを示すことが理解される。より一般的には、特徴は、Ｎ−トークングラムとなることが出来る。ここでトークンとは、例えば、英語ではＮ−文字グラムとＮ−単語グラムを、アジアの言語ではＮ−表意文字グラムを含むある言語の１つの原子である。例えば、音声シーケンスにおいて、音符、抑揚、テンポ、持続時間、音程、音量等は、特徴としてその音声を分類するのに用いられ得る。一方、ビデオシーケンス及び静止画像において、クロミナンス、輝度のレベルなどの各種ピクセル属性を特徴として用いることができる。本発明の一実施例では、例えば、１つの特徴グループが、電子文書などといったあるグループから識別されると、これらの特徴のサブセットは、与えられたデータオブジェクトのグループを分類するために顕著であると判定される。「電子文書」という用語は、１または複数の構造的名特徴を含む、上述したような一群のデータオブジェクトを記述するために本明細書中で広く使用される。電子文書は、テキストを含むことができ、テキストに代わり、またはテキストに加え、音声および／またはビデオコンテンツを同様に含むことができる。 In one embodiment, salient features are determined to facilitate efficient classification and categorization of data objects. Data object categorization includes, but is not limited to, text files that contain both proprietary and generic formats within very large hierarchical classification trees and non-hierarchical data structures such as single layer files, Includes image files, audio and video sequences. For example, in a text file, the feature can take the form of a word. Here, it is understood that the term “word” usually refers to a group of characters that have some semantic meaning in a language. More generally, the feature can be an N-tokengram. Here, the token is one atom of a language including, for example, an N-character gram and an N-word gram in English, and an N-ideogram character in an Asian language. For example, in a speech sequence, notes, intonation, tempo, duration, pitch, volume, etc. can be used to classify the speech as a feature. On the other hand, various pixel attributes such as chrominance and luminance level can be used as features in video sequences and still images. In one embodiment of the invention, for example, if a single feature group is identified from a group, such as an electronic document, a subset of these features is prominent for classifying a given group of data objects. It is determined. The term “electronic document” is used broadly herein to describe a group of data objects as described above, including one or more structural name features. An electronic document can include text and can similarly include audio and / or video content in place of or in addition to text.

特徴選択基準が判定される（すなわち、各種のテキスト／音声／ビデオ属性がデータオブジェクトのセット内で判定用の特徴として利用される）と、本発明による顕著な特徴を判定する過程が行われる。顕著な特徴を判定する過程を開始するために、データオブジェクトは、２つのグループに分割される。「適合性のオッズ」を示す式を、データオブジェクトのこれら２つのグループに適用される（例えば式１を参照）。式中、Ｏ（ｄ）は、与えられたあるデータオブジェクトが第１グループのデータオブジェクトのメンバであるオッズを表し、Ｐ（Ｒ｜ｄ）は、そのデータオブジェクトがその第１グループのメンバである確率を表し、Ｐ（Ｒ’｜ｄ）は、そのデータオブジェクトが第２グループのメンバである確率を表す。 Once the feature selection criteria are determined (ie, various text / audio / video attributes are utilized as determining features in the set of data objects), the process of determining salient features according to the present invention is performed. In order to begin the process of determining salient features, the data object is divided into two groups. An expression indicating “fitness odds” is applied to these two groups of data objects (see, for example, Expression 1). Where O (d) represents the odds that a given data object is a member of the first group of data objects, and P (R | d) is the data object is a member of the first group. P (R ′ | d) represents the probability that the data object is a member of the second group.

データオブジェクトの手動によるグループ化では、適合性のオッズを計算するための所望の確率が与えられないので、式（１）は、この値を概算するため最大化することが出来る。したがって、ベイズの公式を併用した対数関数を、式（１）の両辺に適用することができ、式（２）が導き出される。 Since manual grouping of data objects does not give the desired probability for calculating the odds of fitness, equation (1) can be maximized to approximate this value. Therefore, a logarithmic function using the Bayes formula can be applied to both sides of Equation (1), and Equation (2) is derived.

あるデータオブジェクトが、１セットの特徴｛Ｆ_ｉ｝からなると仮定され、かつ任意の特徴ｆ_ｉが、データオブジェクトに存在する場合にＸ_ｉが１であり、ｆ_ｉが存在しない場合はＸ_ｉは０であるならば、 It is assumed that a data object consists of a set of features {F _i } and if any feature f _i is present in the data object, X _i is 1 and if f _i is not present, X _i is If it is 0,

となる。ｌｏｇＰ（Ｒ）とｌｏｇＰ（Ｒ’）は定数であり、上記データオブジェクトの中で顕著なものとして選択された特徴とは無関係なので、新しい量ｇ（ｄ）が次のように定義される。 It becomes. Since logP (R) and logP (R ') are constants and are independent of the features selected as prominent in the data object, a new quantity g (d) is defined as follows:

ｐ_ｉ＝Ｐ（Ｘ_ｉ＝１｜Ｒ）は、任意の特徴（ｆ_ｉ）がデータオブジェクトの第１グループにあるデータオブジェクトに生じる確率を示すと仮定し、ｑ_ｉ＝Ｐ（Ｘ_ｉ＝１｜Ｒ’）は、任意の特徴（ｆ_ｉ）がデータオブジェクトの第２グループにあるデータオブジェクトに生じる確率を示すと仮定すると、代入し、式を簡単にすることによって式（５）が導き出される。 Let p _i = P (X _i = 1 | R) denote the probability that any feature (f _i ) will occur in a data object in the first group of data objects, and q _i = P (X _i = 1 Assuming that | R ′) represents the probability that any feature (f _i ) will occur in a data object in the second group of data objects, substituting and simplifying the equation yields equation (5) .

２番目の積算は、データオブジェクト中の特徴発生に依存しないので、これを消去でき、結果として式（６）が得られる。 Since the second integration does not depend on the occurrence of features in the data object, it can be deleted, resulting in equation (6).

ｌｏｇ関数は単調関数なので、 Since the log function is a monotone function,

の比を最大にすることが、対応するｌｏｇ値を最大にする十分条件である。本発明の一実施例によると、顕著な特徴の識別を容易にするため、式（７）を、データオブジェクトの２つのグループ用に結合された特徴リストの中の各特徴に適用される。そのようにするため、Ｐ_ｉは、特徴ｆ_ｉを少なくとも１回は含むデータオブジェクトの第１グループにあるデータオブジェクトの数を示すと推定され、データオブジェクト文書の第１グループにあるデータオブジェクトの総数によって除される。同様に、ｑ_ｉは、特徴ｆ_ｉを少なくとも１回は含む第結果ベクトルを獲得する２グループのデータオブジェクトの数を示すと推定され、データオブジェクトの第２グループにあるデータオブジェクトの総数によって除される。 It is a sufficient condition to maximize the corresponding log value. According to one embodiment of the present invention, equation (7) is applied to each feature in the feature list combined for two groups of data objects to facilitate the identification of salient features. To do so, P _i is presumed to indicate the number of data objects in the first group of data objects that include the feature f _i at least once, and the total number of data objects in the first group of data object documents. Divided by. Similarly, q _i is estimated to indicate the number of data objects in two groups that obtain a second result vector that includes feature f _i at least once, and divided by the total number of data objects in the second group of data objects. The

図２Ａ−Ｃに、本発明の一実施例にかかる顕著な特徴の判定機能の操作フローを示す。最初に、データオブジェクトの第１のセットが検査され、少なくともデータオブジェクトの第１のセットからの１または複数のデータオブジェクト内に示された固有の特徴からなる特徴リストを作成する（ブロック２１０）。識別された各固有の特徴ごとに、式（７）が適用され、ランク付けされた特徴リストを生成する（ブロック２２０）。そして、ランク付けされた特徴リストの少なくとも１つのサブセットが顕著な特徴として選ばれる（ブロック２３０）。この顕著な特徴リストは、ランク付けされた特徴リストから選択された要素の１または複数の連続または非連続グループを含むことができる。一実施例において、特徴のランク付けされたリストの最初のＮ個の要素は、顕著なものとして選ばれる。ここでＮはシステムの要件に従って変動し得る。代替実施例において、特徴のランク付けされたリストの最後のＭ個の要素は、顕著なものとして選ばれる。ここでＭもまたシステムの要件に従って変動し得る。 2A to 2C show an operation flow of the distinctive feature determination function according to one embodiment of the present invention. Initially, a first set of data objects is examined to create a feature list consisting of at least the unique features shown in one or more data objects from the first set of data objects (block 210). For each unique feature identified, equation (7) is applied to generate a ranked feature list (block 220). Then, at least one subset of the ranked feature list is selected as a salient feature (block 230). This salient feature list can include one or more continuous or non-contiguous groups of elements selected from the ranked feature list. In one embodiment, the first N elements of the ranked list of features are chosen as prominent. Here N may vary according to system requirements. In an alternative embodiment, the last M elements of the ranked list of features are chosen as prominent. Here, M can also vary according to the requirements of the system.

本発明の一実施例によると、ブロック２１０で特徴リストを作成する間に、データオブジェクトの各グループ内に含まれるデータオブジェクトの総数が判定され（ブロック２１２）、少なくともデータオブジェクトの第１グループの中に識別される固有の特徴毎に、その固有の特徴を含むデータオブジェクトの総数もまた判定される（ブロック２１４）。さらに、固有の特徴のリストは、所望の通りに各種判断基準に基づいてフィルタされ得る（ブロック２１６）。例えば、固有の特徴のそのリストは切り詰められてもよく、少なくともある程度の最小数のデータオブジェクト中に見出されない特徴、ある程度確定した最短の長さより短い特徴および／または割当量より少ない回数しか生じない特徴は削除される。 According to one embodiment of the present invention, during the creation of the feature list at block 210, the total number of data objects contained within each group of data objects is determined (block 212), at least in the first group of data objects. For each unique feature identified in, the total number of data objects containing that unique feature is also determined (block 214). Further, the list of unique features may be filtered based on various criteria as desired (block 216). For example, that list of unique features may be truncated, resulting in features that are not found in at least some minimum number of data objects, features that are shorter than some fixed minimum length, and / or less than quota The feature is deleted.

本発明の一実施例によると、図２Ａのブロック２２０に記述するように、統計的弁別を適用して特徴のランク付けされたリストを獲得することは、さらに図２Ｃに示す過程を含む。換言すれば、統計的弁別の適用において（即ち、式（７）により示されるように）、データオブジェクトの第１のセット内で識別された固有の特徴が、データオブジェクトの第２のセット内にも存在するかについて判定がなされ（ブロック２２１）、ならびにデータオブジェクトの第１のセット内で識別された固有の特徴が、文書の第２のセット内には存在しないかについても判定がなされる（ブロック２２２）。例示した実施例によると、データオブジェクトの１つのセットの中に存在するが、別のセットには存在しないと判定される特徴には、特徴のランク付けされたリストの中で高い相対順位を割当て（ブロック２２３）、データオブジェクトの両方のセットの中に存在すると判定される特徴には、特徴のランク付けされたリストの中で、統計的弁別（即ち、式（７））を通じて判定されるような、低い相対順位を割当てる（ブロック２２４）。状況に応じ、このような特徴は、各個別の特徴を含むデータオブジェクトの総数に基づき、ランク付けされた特徴リスト内でさらにランク付けされてもよい。 According to one embodiment of the present invention, applying statistical discrimination to obtain a ranked list of features, as described in block 220 of FIG. 2A, further includes the process shown in FIG. 2C. In other words, in the application of statistical discrimination (ie, as shown by equation (7)), the unique features identified in the first set of data objects are in the second set of data objects. Is also determined (block 221), as well as whether the unique features identified in the first set of data objects are not present in the second set of documents (block 221). Block 222). According to the illustrated embodiment, features that are present in one set of data objects but are not present in another set are assigned a higher relative rank in the ranked list of features. (Block 223), features that are determined to be present in both sets of data objects are determined through statistical discrimination (ie, equation (7)) in the ranked list of features. Assign a lower relative rank (block 224). Depending on the situation, such features may be further ranked in the ranked feature list based on the total number of data objects that contain each individual feature.

適用例
ここで図３を参照すると、ここには、本発明の一実施例にかかる顕著な特徴を判定する手段の適用例が示されている。図に示すように、クラシファイア３００を設けて、電子文書などのデータオブジェクトを、効率良く分類しカテゴリ化する。電子文書は、これに限定されないが、極めて大規模な階層分類ツリー及びフラットファイルフォーマットを含む各種データ構造内での、独自フォーマット及び汎用フォーマットのテキストファイル、画像ファイル、音声及びビデオシーケンスを含む。クラシファイア３００は、クラシファイアトレーニングサービス３０５を含み、クラシファイア３００をトレーニングして、予めカテゴリ化されたデータ階層から抽出される分類規則に基づいて、新規データオブジェクトをカテゴリ化する。同様に、クラシファイアカテゴリ化サービス３１５を含み、クラシファイア３００に入力された新規データオブジェクトをカテゴリ化する。 Application Example Referring now to FIG. 3, there is shown an application example of means for determining salient features according to one embodiment of the present invention. As shown in the figure, a classifier 300 is provided to efficiently classify and categorize data objects such as electronic documents. Electronic documents include, but are not limited to, text files, image files, audio and video sequences in proprietary and general-purpose formats within various data structures including very large hierarchical classification trees and flat file formats. Classifier 300 includes a classifier training service 305 that trains classifier 300 to categorize new data objects based on classification rules extracted from a pre-categorized data hierarchy. Similarly, a classifier categorization service 315 is included to categorize new data objects input to the classifier 300.

クラシファイアトレーニングサービス３０５は、集約機能３０６、本発明の顕著な特徴の判定機能３０８、及びノード特徴付け機能３０９を含む。図示の実施例に従うと、予めカテゴリ化されたデータ階層からのコンテンツを、階層内の各ノードに集め、例えば集約機能３０６などを介して、データのコンテンツとアンチコンテンツ両方のグループを形成する。次に、これらデータのグループの各々からの特徴を抽出し、顕著な特徴の判定機能３０８を用いて、これら特徴のサブセットが顕著であることを判定する。ノード特徴付け機能３０９を使用して、予めカテゴリ化されたデータ階層の各ノードを、顕著な特徴に基づいて特徴付けし、このような階層の特徴付けを、例えばデータストア３１０に記憶し、さらにクラシファイアカテゴリ化サービス３１５で使用する。 The classifier training service 305 includes an aggregation function 306, a salient feature determination function 308 of the present invention, and a node characterization function 309. In accordance with the illustrated embodiment, content from pre-categorized data hierarchies is collected at each node in the hierarchy, forming both data content and anti-content groups via, for example, an aggregation function 306. Next, features from each of these groups of data are extracted and a salient feature determination function 308 is used to determine that a subset of these features is salient. A node characterization function 309 is used to characterize each node of the pre-categorized data hierarchy based on salient features, storing such hierarchy characterization in, for example, the data store 310, and Used by the classifier categorization service 315.

クラシファイアトレーニングサービス３０５及びクラシファイアカテゴリ化サービス３１５を含むクラシファイア３００に関する追加情報は、これと同時出願であって本願代理人が同じく担当する米国特許出願第５１０２６．Ｐ００４号、名称「Very-Large-Scale Automatic Categorizer For Web Content」に記述されており、その開示をここに参照によって組み込む。 Additional information regarding classifier 300, including classifier training service 305 and classifier categorization service 315, is co-filed with US Patent Application No. 51026. P004, named “Very-Large-Scale Automatic Categorizer For Web Content”, the disclosure of which is incorporated herein by reference.

クラシファイアトレーニングサービス
図４は、本発明の一実施例にかかる図３のクラシファイアトレーニングサービス３０５の機能的ブロック図を示す。図４に示すように、予めカテゴリ化されたデータ階層４０２は、クラシファイア３００のクラシファイアトレーニングサービス３０５の入力として与えられる。予めカテゴリ化されたデータ階層４０２は、予め（一般的には個人による手入力により）主題階層に分類およびカテゴリ化されている、オーディオ、ビデオ及び／又はテキストオブジェクトなどのデータオブジェクトのセットを表わす。予めカテゴリ化されたデータ階層４０２は、ウェブポータル又は検索エンジンなどにより予めカテゴリ化された電子文書の１または複数のセットを表してもよい。 Classifier Training Service FIG. 4 shows a functional block diagram of the classifier training service 305 of FIG. 3 according to one embodiment of the present invention. As shown in FIG. 4, the pre-categorized data hierarchy 402 is provided as an input to the classifier training service 305 of the classifier 300. The pre-categorized data hierarchy 402 represents a set of data objects, such as audio, video and / or text objects, that have been previously categorized and categorized into a subject hierarchy (typically manually by an individual). The pre-categorized data hierarchy 402 may represent one or more sets of electronic documents that have been pre-categorized, such as by a web portal or search engine.

図示の例によると、集約機能４０６は、予めカテゴリ化されたデータ階層４０２からのコンテンツを、データのコンテンツグループとアンチコンテンツグループに集約して、階層の各レベルにおける兄弟ノードの間の弁別が増加させる。顕著な特徴の判定機能４０８は、データのコンテンツグループとアンチコンテンツグループから特徴を抽出して、抽出された特徴（４０９）のうち顕著と見なすべきもの（４０９’）を判定するように動作する。 According to the illustrated example, the aggregation function 406 aggregates content from the pre-categorized data hierarchy 402 into a data content group and an anti-content group to increase discrimination between sibling nodes at each level of the hierarchy. Let The salient feature determination function 408 operates to extract features from the content group and the anti-content group of the data, and to determine which of the extracted features (409) should be regarded as salient (409 ').

加えて、図示の例によると、図３のノード特徴付け機能３０９は、データのコンテンツグループとアンチコンテンツグループを特徴付けるように動作する。一実施例において、データのコンテンツグループとアンチコンテンツグループは、判定される顕著な特徴に基づいて特徴付けされる。一実施例においては、特徴付けをデータストア３１０に記憶する。データストア３１０は、データベース、ディレクトリ構造、又は簡易ルックアップテーブルなど任意の数のデータ構造の形で実現される。本発明の一実施例においては、各ノードのクラシファイアに関するパラメータは、予めカテゴリ化されたデータ階層を模倣するファイル構造を有する階層カテゴリ化ツリーに記憶される。 In addition, according to the illustrated example, the node characterization function 309 of FIG. 3 operates to characterize content groups and anti-content groups of data. In one embodiment, the data content group and the anti-content group are characterized based on the salient features that are determined. In one embodiment, the characterization is stored in the data store 310. The data store 310 is implemented in the form of any number of data structures, such as a database, directory structure, or simple lookup table. In one embodiment of the present invention, the parameters for each node classifier are stored in a hierarchical categorization tree having a file structure that mimics a pre-categorized data hierarchy.

コンピュータシステムの例
図５は、本発明の一実施例にかかる顕著な特徴の判定に使用するのに適するコンピュータシステムの一例を示す。図示するように、コンピュータシステム５００は、１または複数のプロセッサ５０２とシステムメモリ５０４を含む。加えて、コンピュータシステム５００は、大容量記憶装置５０６（ディスケット、ハードドライブ、ＣＤＲＯＭなど）、入出力装置５０８（キーボード、カーソル制御など）及び通信インタフェース５１０（ネットワークインタフェースカード、モデムなど）を含む。各要素は、１または複数のバスをあらわすシステムバス５１２を介して互いに結合される。システムバス５１２が多数のバスをあらわす場合、それらは１または複数のバスブリッジ（図示せず）を用いてブリッジされる。 Example Computer System FIG. 5 illustrates an example of a computer system suitable for use in determining salient features according to one embodiment of the present invention. As shown, computer system 500 includes one or more processors 502 and system memory 504. In addition, the computer system 500 includes a mass storage device 506 (diskette, hard drive, CDROM, etc.), an input / output device 508 (keyboard, cursor control, etc.) and a communication interface 510 (network interface card, modem, etc.). Each element is coupled to each other via a system bus 512 representing one or more buses. If the system bus 512 represents multiple buses, they are bridged using one or more bus bridges (not shown).

これらの要素の各々は、公知の従来の機能を実行する。詳細には、システムメモリ５０４と大容量記憶装置５０６を用いて、本発明のカテゴリ化システムを実行するプログラム命令の作業用コピーと永久コピーを記憶する。プログラム命令の永久コピーは、工場又は現場で、前述のように、配布媒体（図示せず）を介して、又は（配布サーバ（図示せず）から）通信インタフェース５１０を介して、大容量記憶装置５０６にロードされる。これら要素５０２−５１２の構造は公知なので、これ以上は記述しない。 Each of these elements performs a known conventional function. Specifically, system memory 504 and mass storage device 506 are used to store working and permanent copies of program instructions that implement the categorization system of the present invention. Permanent copies of program instructions can be stored in the mass storage device at the factory or in the field, as described above, via a distribution medium (not shown) or via a communication interface 510 (from a distribution server (not shown)). 506 is loaded. The structure of these elements 502-512 is well known and will not be described further.

このようにして上の記述から、オブジェクト分類のため顕著な特徴を自動的に判定するための新規の方法と装置を記述したことが分かる。上に示す実施例を用いて本発明を記述したけれども、当業者は、本発明が記述の実施例に限定されるものでないことを認識するであろう。本発明は、添付の特許請求の範囲の精神と範囲の中で修正と変更をおこなって実施することが出来る。したがって、本記述は本発明に関する限定ではなく解説と見なされるものである。 Thus, it can be seen from the above description that a novel method and apparatus for automatically determining salient features for object classification has been described. Although the present invention has been described using the embodiments shown above, those skilled in the art will recognize that the invention is not limited to the described embodiments. The present invention may be practiced with modification and alteration within the spirit and scope of the appended claims. The description is thus to be regarded as illustrative instead of limiting on the present invention.

複数の決定ノードを含む典型的な先行技術の主題階層を示す図である。FIG. 2 illustrates an exemplary prior art subject hierarchy including multiple decision nodes. 本発明の一実施例にかかる顕著な特徴判定機能の操作の流れを示すフローチャートである。It is a flowchart which shows the flow of operation of the remarkable feature determination function concerning one Example of this invention. 本発明の一実施例にかかる顕著な特徴の判定機能の操作フローを示すフローチャートである。It is a flowchart which shows the operation flow of the determination function of the remarkable feature concerning one Example of this invention. 本発明の一実施例にかかる顕著な特徴の判定機能の操作フローを示すフローチャートである。It is a flowchart which shows the operation flow of the determination function of the remarkable feature concerning one Example of this invention. 本発明の一実施例にかかる顕著な特徴を判定する手段を示す図である。It is a figure which shows the means to determine the remarkable characteristic concerning one Example of this invention. 本発明の一実施例にかかる図３のクラシファイアトレーニングサービスを示す機能的ブロック図である。FIG. 4 is a functional block diagram illustrating the classifier training service of FIG. 3 according to one embodiment of the present invention. 本発明の一実施例にかかる顕著な特徴の判定に使用するのに適したコンピュータシステム示す図である。FIG. 2 illustrates a computer system suitable for use in determining salient features according to one embodiment of the present invention.

Claims

Extracting one or more unique features from a first content group of data objects to form a first feature list;
Extracting one or more unique features from the second anti-content group of the data object document to form a second feature list;
Creating a ranked list of features by applying statistical discrimination between the unique features of the first feature list and the unique features of the second feature list; and
Identifying a set of salient features from the ranked list of features.

The method of claim 1, wherein each of the first content group of the data object and the second anti-content group of the data object includes one or more electronic documents.

Determining a total number of first data objects including a first content group of the data objects; and
The method of claim 1, further comprising: determining a total number of second data objects that include a second anti-content group of the data objects.

Determining the number of first data objects in the first content group of the data object for each of the one or more unique features forming the first feature list, the first content group of the data object Includes at least one instance for each of the one or more unique features of the first feature list; and
Determining, for each of the one or more unique features forming the second feature list, a number of second data objects in a second anti-content group of the data object, the second of the data objects The method of claim 3, further comprising: an anti-content group includes at least one instance for each of the one or more unique features of the second feature list.

Creating the ranked list is:
Identifying unique features of the first feature list that are not present in the second feature list as exclusive features;
Identifying a unique feature of the first feature list that is also present in the second feature list as a common feature; and
5. The method of claim 4, comprising: ordering the ranked list so that the exclusive feature is ranked higher than the common feature in the ranked list. .

Applying a probability function to each of the common features to obtain a result vector, the probability function including a quotient dividing the number of first data objects by the total number of first data objects; Including a ratio to the quotient dividing the number of data objects by the total number of second data objects; and
The method of claim 5, further comprising: ordering the common features at least in part in the ranked list based on a result vector of the probability function.

The method of claim 5, wherein the exclusive features are further ranked based on the number of the first data objects.

Identifying the salient feature set from the ranked list of features includes selecting the first N adjacent features from the ranked list of features. Item 2. The method according to Item 1.

Identifying the salient feature set from the ranked list of features includes selecting the last M adjacent features from the ranked list of features. Item 2. The method according to Item 1.

The method of claim 1, wherein each of the unique features includes one or more alphanumeric character groups.

A new data object as most relevant to one of the first content group of the data object and the second anti-content group of the data object, at least in part based on the set of salient features; The method of claim 1, further comprising classifying.

The first content group of data objects includes data objects corresponding to selected nodes of the subject hierarchy having a plurality of nodes and some related subnodes of the selected nodes;
The method of claim 1, wherein the second anti-content group of data objects includes data objects corresponding to some related sibling nodes of the selected node and any related subnodes of the sibling node.

A method for identifying salient features, comprising:
Identifying one or more unique features that are members of the first data class;
Examining a second data class to identify the one or more unique features that are also members of the second data class and the one or more unique features that are not members of the second data class;
Generating a ranked list of unique features having an order based on membership of each of the one or more unique features in the second data class; and
Identifying one or more of the ranked list of unique features as prominent.

14. The method of claim 13, further comprising determining, for each ranked list of unique features, the number of objects in the first data class that include the respective unique features. Including methods.

Generating a ranked list causes the unique feature that is not a member of the second data class to be higher in the ranked list than the unique feature that is a member of the second data class. 15. The method of claim 14, comprising ranking.

Generating a ranked list may be configured such that the unique features belonging to an object with a large number of first data classes are compared to the unique features belonging to an object with a small number of first data classes. 16. The method of claim 15, comprising ranking higher in the rendered list.

14. The method of claim 13, wherein identifying as prominent includes selecting an initial set of N consecutive unique features from the ranked list of unique features. .

14. The method of claim 13, wherein identifying as prominent includes selecting the last M consecutive unique features from the ranked list of unique features.

A storage medium for internally storing a plurality of program instructions designed to perform a plurality of functions related to a category designation service for giving a category name to a data object,
Extracting one or more unique features from the first content group of the data object to form a first feature list;
Extracting one or more unique features from the second anti-content group of the data object to form a second feature list;
Creating a ranked list of features by applying statistical discrimination between the unique features of the first feature list and the unique features of the second feature list;
A storage medium comprising one or more functions for identifying a set of salient features from the ranked list of features;
And a processor coupled to the storage medium to execute program instructions.

The apparatus of claim 19, wherein each of the first content group of the data object and the second anti-content group of the data object includes one or more data objects.

The plurality of instructions are:
Determining a total number of first data objects including the first content group of the data objects;
The apparatus of claim 19, further comprising instructions for determining a total number of second data objects that include the second anti-content group of the data objects.

The plurality of instructions are:
Determining the number of first data objects in the first content group of the data object for each of the one or more unique features forming the first feature list, the first content group of the data object Includes at least one instance for each of the one or more unique features of the first feature list; and
Determining, for each of the one or more unique features forming the second feature list, a number of second data objects in a second anti-content group of the data object, the second of the data objects The apparatus of claim 19, wherein the anti-content group further comprises an instruction to include at least one instance for each of the one or more unique features of the second feature list.

The plurality of instructions for creating the ranked list is:
Identifying unique features of the first feature list that are not present in the second feature list as exclusive features;
Identifying a unique feature of the first feature list that is also present in the second feature list as a common feature;
21. The apparatus of claim 20, wherein the ranked list is ordered and the exclusive feature includes an instruction that ranks higher in the ranked list than the common feature.

The plurality of instructions are:
Applying a probability function to each of the common features to obtain a result vector, the probability function including a quotient dividing the number of first data objects by the total number of first data objects; Including a ratio to the quotient dividing the number of documents by the total number of second data objects; and
24. The apparatus of claim 23, further comprising an instruction to order the common features at least partially in the ranked list based on the result vector of the probability function.

The apparatus of claim 23, wherein the exclusive features are further ranked based on the number of the first data objects.

The plurality of instructions for identifying the salient feature set from the ranked list of features further includes an instruction to select the first N adjacent features from the ranked list of features. The apparatus according to claim 19, characterized in that:

The plurality of instructions identifying the set of salient features from the ranked list of features further comprises an instruction to select the last M adjacent features from the ranked list of features. The apparatus according to claim 19, characterized in that:

The apparatus of claim 19, wherein each of the unique features includes one or more alphanumeric character groups.

The plurality of instructions are:
A new data object is most relevant to one of the first content group of the data object and the second anti-content group of the data object, at least in part based on the set of salient features. The apparatus of claim 19 further comprising instructions for classifying.

The first content group of data objects includes data objects corresponding to selected nodes of the subject hierarchy having a plurality of nodes and some related subnodes of the selected nodes;
The apparatus of claim 19, wherein the second anti-content group of data objects includes data objects corresponding to some related sibling nodes of the selected node and any related subnodes of the sibling nodes.

A storage medium that internally stores a plurality of program instructions designed to perform a plurality of functions,
Identifying one or more unique features that are members of the first data class;
Examining a second data class to identify the one or more unique features that are also members of the second data class and the one or more unique features that are not members of the second data class;
Generating a ranked list of unique features having an order based on membership of each of the one or more unique features in the second data class;
A storage medium comprising one or more functions that identify one or more of the ranked list of unique features as prominent;
And a processor coupled to the storage medium to execute program instructions.

The plurality of instructions are:
32. The apparatus of claim 31, comprising instructions for determining a number of objects in the first data class that include a respective unique feature for each of the ranked list of unique features. .

The plurality of instructions for generating a ranked list may cause the unique feature that is not a member of the second data class to be compared to the unique feature that is a member of the second data class. 33. The apparatus of claim 32, comprising instructions that rank higher within.

The plurality of instructions for generating a ranked list may include the unique feature belonging to an object with a large number of the first data class, and the unique feature belonging to an object with a small number of the first data class. 34. The apparatus of claim 33, comprising instructions that rank higher in the ranked list.

32. The plurality of instructions that are identified as prominent include an instruction that selects an initial set of N consecutive unique features from the ranked list of unique features. The device described.

32. The plurality of instructions that are identified as prominent include an instruction that selects a last set of M consecutive unique features from the ranked list of unique features. The device described.