KR20050096912A

KR20050096912A - Method and apparatus for automatically determining salient features for object classification

Info

Publication number: KR20050096912A
Application number: KR1020057005238A
Authority: KR
Inventors: 다이니엘 피. 루리쯔; 파진 지. 귈아크
Original assignee: 마이크로소프트 코포레이션
Priority date: 2005-03-25
Filing date: 2002-09-25
Publication date: 2005-10-06

Abstract

A method and apparatus for automatically determining salient features (308) for object classification is provided. In accordance with one embodiment, one or more unique features are extracted from a first content group of objects to form a first feature list, and one or more unique features are extracted from a second anti- content group of objects to form a second feature list. A ranked list of features is then created by applying statistical differentiation between unique features of the first feature list and unique features of the second feature list. A set of salient features (308) is then identified from the resulting ranked list of features.

Description

Method and apparatus for automatically determining salient features for object classification {METHOD AND APPARATUS FOR AUTOMATICALLY DETERMINING SALIENT FEATURES FOR OBJECT CLASSIFICATION}

본 발명은, 데이터 처리 분야에 관한 것이다. 보다 구체적으로는, 본 발명은 객체들을 그룹들로 분류하는데 사용되는 객체 특징들의 자동 선택에 관한 것이다.The present invention relates to the field of data processing. More specifically, the present invention relates to the automatic selection of object features used to classify objects into groups.

월드 와이드 웹(World Wide Web)은, 온라인 뷰잉(viewing) 및 다운로딩에 이용가능한 대략 수십억 페이지 분량의 정보를 갖는 중요한 정보 자원을 제공한다. 그러나, 이러한 정보를 효율적으로 사용하기 위해서는, 이처럼 방대한 데이터 공간을 네비게이트하기 위한 현명한 방법이 필수적이다.The World Wide Web provides an important information resource with approximately billions of pages of information available for online viewing and downloading. However, to use this information efficiently, a smart way to navigate these vast data spaces is essential.

인터넷 서핑의 초기에는, 웹 검색을 지원하기 위해 2개의 기본 방법들이 개발되었다. 첫번째 접근법은, 새롭고 고유한 페이지들을 찾기 위해 웹을 "크롤(crawl)"하는 자동화된 검색 엔진들에 의해 수집된 웹 페이지들의 콘텐츠에 기초하여 인덱스화된 데이터베이스를 생성하는 것이다. 그 후, 이러한 데이터베이스는, 다양한 질의 기술을 사용하여 검색될 수 있고, 종종 질의 형태의 유사성에 기초하여 순위결정된다. 두번째 접근법은, 웹 페이지들을, 통상적으로 트리 형태로 표현되는, 카테고리화 계층구조로 그룹화하는 것이다. 그 후, 사용자는 계층구조를 하강하면서, 결정 포인트 아래의 서브 트리들 간의 두드러진 차이를 나타내는 각 레벨에서의 2개 이상의 선택을 사용하여 일련의 선택을 행하여, 궁극적으로 텍스트 및/또는 멀티미디어 콘텐트의 페이지들을 포함하는 종단 노드(leaf node)들에 도달한다.In the early days of surfing the Internet, two basic methods were developed to support web browsing. The first approach is to create an indexed database based on the content of web pages collected by automated search engines that "crawl" the web to find new and unique pages. These databases can then be searched using various query techniques, and are often ranked based on the similarity of the query types. The second approach is to group web pages into a categorization hierarchy, typically represented in tree form. The user then descends the hierarchy, making a series of selections using two or more selections at each level that represent a noticeable difference between the subtrees below the decision point, ultimately a page of text and / or multimedia content. Reach leaf nodes containing them.

예를 들면, 도 1은 복수의 결정 노드들(이후부터 노드들"(nodes)"이라고 함)(130-136)이 다수의 부모 노드(parent node) 및/또는 자식 노드(child node) - 그들 각각은 고유한 주제별 카테고리와 연관됨 - 로 계층적으로 배열되는 예시적인 종래의 기술의 주제별 계층구조(subject hierarchy)(102)를 나타낸다. 예를 들어, 노드(130)는 노드(131, 132)의 부모 노드이고, 노드(131, 132)는 노드(130)의 자식 노드들이다. 노드(131, 132)는 모두 동일 노드(예컨대, 노드(130))의 자식 노드들이기 때문에, 노드(131, 132)는 서로 형제라고 말할 수 있다. 주제별 계층구조(102)에서의 추가의 형제 쌍들은, 노드(133, 134) 및 노드(135, 136)를 포함한다. 노드(130)가 주제별 계층구조(102)의 제1 레벨(137)을 형성하고, 노드(131, 132)가 주제별 계층구조(102)의 제2 레벨(138)을 형성하고, 노드(133-136)가 주제별 계층구조(102)의 제3 레벨(139)을 형성한다는 것을 도 1로부터 알 수 있다. 또한, 노드(130)는, 임의의 다른 노드의 자식이 아니라는 점에서 주제별 계층구조(102)의 루트 노드(root node)로 불린다.For example, FIG. 1 shows that a plurality of decision nodes (hereinafter referred to as nodes "nodes") 130-136 may comprise multiple parent nodes and / or child nodes-they Each represents an exemplary prior art subject hierarchy 102 arranged hierarchically, with associated with a unique subject category. For example, node 130 is a parent node of nodes 131, 132, and nodes 131, 132 are child nodes of node 130. Since nodes 131, 132 are both child nodes of the same node (eg, node 130), nodes 131, 132 may be said to be siblings of each other. Additional sibling pairs in the topic hierarchy 102 include nodes 133 and 134 and nodes 135 and 136. Node 130 forms first level 137 of thematic hierarchy 102, nodes 131, 132 form second level 138 of thematic hierarchy 102, and node 133-. It can be seen from FIG. 1 that 136 forms the third level 139 of the topical hierarchy 102. Node 130 is also referred to as the root node of thematic hierarchy 102 in that it is not a child of any other node.

웹 페이지들에 대한 계층적 카테고리화를 생성하는 프로세스에는 여러가지 과제들이 존재한다. 먼저, 계층구조의 성질이 정의되어야 한다. 이것은 통상적으로, 도서관에 사용되는 듀이 십진 시스템(Dewey Decimal System)의 카테고리들을 생성하는 것과 유사한 방식으로, 특정 주제 영역 내에서 숙련자들에 의해 수동으로 이루어진다. 그 후, 이러한 카테고리들에 기술적 라벨(descriptive label)을 제공함으로써 사용자들 및 카테고라이저(categorizer)들이 계층구조를 내비케이팅하면서 적절한 결정을 행할 수 있다. 그 후, 개별적인 전자 문서 형태의 콘텐트는 계층구조를 통한 수동 검색에 의해 카테고리들에 배치된다.There are many challenges in the process of creating hierarchical categorization of web pages. First, the nature of the hierarchy must be defined. This is typically done manually by a skilled person within a particular subject area, in a manner similar to creating categories of the Dewey Decimal System used in libraries. Then, by providing descriptive labels to these categories, users and categorizers can make appropriate decisions while navigating the hierarchy. Then, the content in the form of individual electronic documents is placed in the categories by manual search through the hierarchy.

최근, 이러한 프로세스의 다양한 단계를 자동화하는 것이 주의를 끌었다. 문서 자료로부터 문서를 자동으로 카테고리화하기 위한 시스템이 존재한다. 예를 들면, 일부 시스템들은 문서들과 연관된 주제어들을 활용하여 유사한 문서들을 자동적으로 클러스터(cluster) 또는 그룹화한다. 그러한 클러스터들은 슈퍼-클러스터들로 반복적으로 그룹화되어, 계층적 구조를 생성할 수 있으나, 이러한 시스템들은 수동적인 주제어 삽입을 필요로 하고 체계적인 구조를 갖지 못하는 계층구조를 생성한다. 계층구조가 수동 검색에 사용되면, 공통 특징(들)을 식별하기 위해 서브 노드들 또는 종단 문서들의 수동 검사에 의해 계층구조의 노드들에 라벨들이 부착되어야 한다.Recently, automating the various steps of this process has drawn attention. There is a system for automatically categorizing documents from document material. For example, some systems automatically take advantage of the keywords associated with documents to automatically cluster or group similar documents. Such clusters can be repeatedly grouped into super-clusters to create a hierarchical structure, but such systems require a manual main insertion and create a hierarchical structure that does not have a systematic structure. If a hierarchy is used for manual retrieval, labels must be attached to nodes in the hierarchy by manual inspection of sub-nodes or end documents to identify common feature (s).

문서를 분류하기 위해 많은 분류 시스템들은 두드러진 단어(salient word)들의 리스트들을 활용한다. 통상적으로, 문서들의 특성을 보다 정확하게 기술하기 위해 처리되는 문서로부터 두드러진 단어들이 선택되거나 또는 미리 정의된다. 일반적으로 이러한 두드러진 단어 리스트들은 문서 세트 각각에 대해 모든 단어들의 발생 빈도를 카운트함으로써 생성된다. 그 후, 단어들은 하나 이상의 기준에 따라 단어 리스트들로부터 제거된다. 종종, 자료 내에 거의 발생하지 않는 단어들은, 너무 드물어서 카테고리들 사이에서 신뢰성있게 구별될 수 없기 때문에, 제거되는 반면, 너무 자주 발생하는 단어들은, 그런 단어들이 카테고리들에 걸쳐 모든 문서 내에 공통적으로 발생하는 것으로 간주되기 때문에, 제거된다.Many classification systems use lists of salient words to classify documents. Typically, prominent words are selected or predefined from the document being processed to more accurately describe the characteristics of the documents. Typically these prominent word lists are generated by counting the frequency of occurrence of all words for each document set. The words are then removed from the word lists according to one or more criteria. Often, words that rarely occur in the material are removed because they are so rare that they cannot be reliably distinguished between categories, while words that occur too often occur such words are common in all documents across categories. Because it is considered to be done, it is removed.

또한, 두드러진 특징의 결정을 용이하게 하기 위해 "스톱 워드(stop word)" 및 워드 스템(word stem)이 특징 리스트에서 종종 제거된다. 스톱 워드들은, 의미있는 내용을 전달하지 못한다고 느끼는 "a", "the", "his", 및 "and"와 같은 언어 내의 일반적인 단어들을 포함하며, 워드 스템은 "-ing", "-end", "-is", 및 "-able"와 같은 접미사들을 나타낸다. 불행하게도, 스톱 워드 및 워드 스템 리스트의 생성은, 시간에 따라 변화하는 신택스(syntax), 문법 및 사용법에 대한 전문가적 지식을 필요로 하는 언어-고유의 작업이다. 따라서, 두드러진 특징들을 결정하는 보다 유연한 방법이 바람직하다.In addition, "stop words" and word stems are often removed from the feature list to facilitate determination of salient features. Stopwords include common words in languages such as "a", "the", "his", and "and" that do not convey meaningful content, and the word stem contains "-ing", "-end" suffixes such as "-is", and "-able". Unfortunately, the generation of stopwords and word stem lists is a language-specific task that requires expert knowledge of syntax, grammar, and usage that changes over time. Thus, a more flexible way of determining salient features is desirable.

본 발명은, 유사한 참조 번호들이 유사한 구성요소들을 표기하는 첨부 도면들에 도시된, 예시적인 실시예들을 통해 설명되며 이에 한정되지 않는다.The invention is described through, but is not limited to, exemplary embodiments shown in the accompanying drawings in which like reference numerals designate like elements.

도 1은, 복수의 결정 노드들을 포함하는 예시적인 종래 기술의 주제별 계층구조를 나타낸다.1 illustrates an exemplary prior art thematic hierarchy comprising a plurality of decision nodes.

도 2a 내지 도 2c는 본 발명의 일 실시예에 따른, 두드러진 특징 결정 기능의 동작 플로우를 나타낸다.2A-2C illustrate the operational flow of a salient feature determination function, in accordance with an embodiment of the invention.

도 3은 일 실시예에 따른, 본 발명에서의 두드러진 특징 결정 기능들의 예시적인 응용을 나타낸다.3 illustrates an exemplary application of salient feature determination functions in the present invention, according to one embodiment.

도 4는 본 발명의 일 실시예에 따른, 도 3의 분류자(classifier) 트레이닝 서비스의 기능적 블록도를 나타낸다.4 illustrates a functional block diagram of the classifier training service of FIG. 3, in accordance with an embodiment of the present invention.

도 5는 본 발명의 일 실시예에 따른, 두드러진 특징을 결정하기 위해 사용되는데 적합한 예시적인 컴퓨터 시스템을 나타낸다.5 illustrates an example computer system suitable for use in determining salient features, in accordance with an embodiment of the present invention.

다음의 설명에서, 본 발명의 다양한 측면들이 설명될 것이다. 그러나, 본 기술 분야의 당업자에게는, 본 발명이 본 발명의 단지 일부 측면 또는 모든 측면들로 구현될 수 있다는 것이 자명할 것이다. 본 발명의 완전한 이해를 제공하기 위해, 설명을 위해, 특정 번호, 재료 및 구성들이 제시된다. 그러나, 본 발명이 특정 세부사항이 없이도 구현될 수 있다는 것도 본 기술 분야의 당업자에게는 자명할 것이다. 다른 경우에서, 본 발명을 모호하게 하지 않기 위해 잘 알려진 특징들이 생략되거나 또는 간략화된다.In the following description, various aspects of the invention will be described. However, it will be apparent to one skilled in the art that the present invention may be implemented in only some or all aspects of the invention. For purposes of explanation, specific numbers, materials, and configurations are set forth in order to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that the present invention may be practiced without the specific details. In other instances, well-known features are omitted or simplified in order not to obscure the present invention.

설명의 부분들은, 자신들의 작업의 요지를 본 기술 분야의 숙련된 다른 사람에게 전달하기 위해 본 기술 분야의 당업자들이 일반적으로 채용하는 방식과 일관되는, 데이터, 저장, 선택, 결정, 계산 등의 용어들을 사용하여 프로세서 기반 장치에 의해 실행되는 동작들로 표현될 것이다. 본 기술 분야의 당업자가 잘 이해하는 바와 같이, 수량들은, 프로세서 기반 장치의 기계 및 전기적 컴포넌트들을 통해, 저장, 전송, 결합 및 그렇지 않으면 조작될 수 있는 전기, 자기 또는 광학 신호들의 형태를 취하고; 프로세서라는 용어는, 독립적이거나, 부수적이거나 또는 내장될 수 있는, 마이크로프로세서, 마이크로컨트롤러, 디지털 신호 프로세서 등을 포함한다.Portions of the description are terms such as data, storage, selection, determination, calculation, etc., consistent with the manner generally employed by those skilled in the art to convey the gist of their work to others skilled in the art. To the operations executed by the processor-based device. As will be appreciated by those skilled in the art, quantities may take the form of electrical, magnetic or optical signals that can be stored, transmitted, combined and otherwise manipulated through the mechanical and electrical components of the processor-based device; The term processor includes microprocessors, microcontrollers, digital signal processors, and the like, which may be independent, incidental, or embedded.

다양한 동작들이 차례로 복수의 개별적인 단계들로서, 본 발명의 이해를 가장 잘 도울 수 있는 방식으로 기술될 것이나, 설명의 순서가, 그러한 동작들이 반드시 순서 의존적이라는 것을 의미하도록 이해하지는 말아야 한다. 특히, 이러한 동작들은 설명되는 순서로 수행될 필요가 없다. 또한, 설명은 "일 실시예(in one embodiment)"라는 문구를 반복적으로 사용하는데, 이는 동일한 실시예를 지칭할 수도 있지만 대개 동일한 실시예를 지칭하지는 않는다.Various operations will be described, in turn, as a plurality of individual steps, in a manner that may best aid the understanding of the present invention, but the order of description should not be understood to mean that such operations are necessarily order dependent. In particular, these operations need not be performed in the order described. In addition, the description uses the phrase “in one embodiment” repeatedly, which may refer to the same embodiment but does not usually refer to the same embodiment.

본 발명의 일 실시예에 따르면, 하나 이상의 고유 특징이 객체들의 제1 그룹에서 추출되어 제1 특징 세트를 형성하고, 하나 이상의 고유 특징이 객체들의 제2 그룹에서 추출되어 제2 특징 세트를 형성한다. 그 후, 제1 특징 세트의 고유 특징들과 제2 특징 세트의 고유 특징들 사이에 통계적 구별(statistical differentiation)을 적용함으로써 순위화된(ranked) 특징 리스트를 생성한다. 그 후, 두드러진 특징들의 세트가 결과물인 순위화된 특징 리스트로부터 구별된다.According to one embodiment of the invention, one or more unique features are extracted from a first group of objects to form a first feature set, and one or more unique features are extracted from a second group of objects to form a second feature set. . The ranked feature list is then generated by applying statistical differentiation between the unique features of the first feature set and the unique features of the second feature set. The set of salient features is then distinguished from the resulting ranked list of features.

일 실시예에서, 초 대규모 계층적 분류 트리들 내, 뿐만 아니라 단층 파일(flat file)과 같은 비계층적 데이터 구조 내의 사설 양식(proprietary format) 및 비-사설 양식들 모두를 포함하는 비디오 시퀀스, 오디오 시퀀스, 이미지 파일, 텍스트 파일을 포함하나 이에 한정되지 않는 데이터 객체들의 효율적인 분류 및 카테고리화(categorization)를 용이하게 하기 위해 두드러진 특징들이 결정된다. 예를 들어, 텍스트 파일에서 특징들은, "단어(word)"라는 용어가 소정의 언어 내에서, 어떤 의미론적 의미를 갖는, 문자들의 그룹을 표현하는 것으로 일반적으로 이해되는 단어들의 형태를 취할 수 있다. 보다 일반적으로, 특징은 N-토큰 그램(N-token gram)일 수 있는데, 여기서 토큰은, 예를 들어 영어의 N-문자 그램들(N-letter gram) 및 N-단어 그램(N-word gram), 뿐만 아니라 아시아 언어의 N-표의문자 그램(N-ideogram grams)을 포함하는, 언어의 하나의 원자적 구성요소(atomic element)이다. 예를 들어, 오디오 시퀀스들에서, 음표, 억양, 템포, 음의 길이(duration), 피치(pitch), 볼륨 등이 오디오를 분류하기 위한 특징들로 활용될 수 있으며, 비디오 시퀀스 및 정지 이미지들에서는, 색상(chrominance) 및 휘도 레벨들과 같은 다양한 화소 속성들이 특징들로 활용될 수 있다. 본 발명의 일 실시예에 따르면, 특징들의 그룹이, 예를 들어 전자 문서들의 그룹으로부터 식별되었다면, 그 후 데이터 객체들의 소정 그룹을 분류할 목적으로 그러한 특징들의 서브세트가 두드러진 것으로 결정된다. "전자 문서(electronic document)"라는 용어는, 하나 이상의 성분 특징들을 포함하는 상술한 것과 같은 데이터 객체들의 패밀리를 기술하기 위해 본원에서 널리 사용된다. 전자 문서가 텍스트를 포함할 수 있더라도, 텍스트 대신에 또는 텍스트에 부가하여 오디오 및/또는 비디오 콘텐트를 유사하게 포함할 수 있다.In one embodiment, video sequences, audio, including both proprietary and non-private forms in ultra-large hierarchical classification trees, as well as in non-hierarchical data structures such as flat files. Outstanding features are determined to facilitate efficient classification and categorization of data objects, including but not limited to sequences, image files, text files, and the like. For example, features in a text file may take the form of words that are generally understood to represent a group of characters in which the term "word" has some semantic meaning within a given language. . More generally, the feature may be an N-token gram, where the token is, for example, N-letter grams and N-word grams in English, for example. ), As well as one atomic element of the language, including N-ideogram grams in Asian languages. For example, in audio sequences, notes, intonation, tempo, duration, pitch, volume, etc. may be utilized as features for classifying audio, and in video sequences and still images Various pixel attributes such as color, chrominance and luminance levels can be utilized as features. According to one embodiment of the invention, if a group of features has been identified, for example from a group of electronic documents, then it is determined that a subset of those features stands out for the purpose of classifying a certain group of data objects. The term "electronic document" is widely used herein to describe a family of data objects, such as described above, that includes one or more component features. Although the electronic document may contain text, it may similarly include audio and / or video content instead of or in addition to the text.

특징 선택 기준이 결정되면(즉, 다양한 텍스트/오디오/비디오 속성들중 임의의 것이 데이터 객체 세트 내에서 결정적인(determinative) 특징으로 활용되면), 본 발명에 따른 두드러진 특징 결정 프로세스가 수행될 수 있다. 두드러진 특징의 결정 프로세스를 시작하면, 해당 데이터 객체들이 2개의 그룹으로 분할된다. 그 후, "관련 가능성(odds of relevance)"을 나타내는 수학식(예를 들면 수학식 1 참조)이 2개 그룹의 데이터 객체들에 적용되는데, 여기서, O(d)는 소정의 데이터 객체가 데이터 객체들의 제1 그룹의 구성원일 가능성을 나타내고, P(R|d)는 데이터 객체가 제1 그룹의 구성원일 확률을 나타내고, P(R'|d)는 데이터 객체가 제2 그룹의 구성원일 확률을 나타낸다.Once the feature selection criteria have been determined (ie any of the various text / audio / video attributes are utilized as determinative features within the set of data objects), a prominent feature determination process according to the present invention can be performed. Beginning the process of determining the salient features, the corresponding data objects are divided into two groups. Then, an equation representing " odds of relevance " (see, for example, Equation 1) is applied to the two groups of data objects, where O (d) is a data object. P (R | d) indicates the probability that the data object is a member of the first group, and P (R '| d) indicates the probability that the data object is a member of the second group. Indicates.

데이터 객체들의 수동적인 그룹화는 관련 가능성을 계산하는데 바람직한 확률을 제공하지 못하기 때문에, 수학식 1은 이러한 값을 근사하기 위해 최대치가 구해질 수 있다(maximize). 따라서, 베이즈 공식(Baye's formula)과 함께 로그 함수가 수학식 1의 양측에 적용되어 수학식 2를 생성할 수 있다.Since passive grouping of data objects does not provide the desired probability to calculate the relevant probability, Equation 1 can be maximized to approximate this value. Accordingly, a logarithm function together with Bayes' formula may be applied to both sides of Equation 1 to generate Equation 2.

데이터 객체가 특징 세트 {F_i}로 구성되는 것으로 가정하고, 소정의 특징 f_i가 데이터 객체 내에 존재하거나 또는 존재하지 않는 경우 X_i가 각각 1 또는 0이라면, 그 후,Assume that the data object consists of the feature set {F _i }, and if X _i is 1 or 0, respectively, if a given feature f _i is present or absent in the data object, then,

이 된다.Becomes

log P(R) 및 log P(R')이 일정하고 데이터 객체 내에서 두드러진 것으로 선택된 특징들에 독립적이기 때문에, 새로운 양 g(d)는 수학식 4와 같이 정의된다.Since log P (R) and log P (R ') are constant and independent of the features selected to stand out in the data object, the new amount g (d) is defined as:

p_i = P(X_i = 1|R)은 소정의 특징 (f_i)가 데이터 객체들의 제1 그룹 내의 데이터 객체에 발생될 확률을 나타내고, q_i = P(X_i = 1|R')는 데이터 객체들의 제2 그룹 내의 데이터 객체에 소정의 특징 (f_i)가 발생할 확률을 나타내는 것으로 가정하면, 그 후, 대입 및 단순화를 통해 수학식 5가 생성된다.p _i = P (X _i = 1 | R) indicates the probability that a certain feature f _i will occur in the data object in the first group of data objects, and q _i = P (X _i = 1 | R ') Is assumed to represent the probability that a certain feature f _i will occur in a data object in the second group of data objects, then equation 5 is generated through assignment and simplification.

제2 합산이 데이터 객체들 내의 특징 발생에 의존적이지 않기 때문에 소거를 통해 수학식 6이 된다.Since the second summation is not dependent on the feature occurrence in the data objects, Equation 6 is obtained through erasure.

로그 함수는 단조(monotonic)이기 때문에,Since the logarithmic function is monotonic,

의 비율의 최대치를 구하는 것은 대응하는 로그 값의 최대치를 구하는 것으로 충분하다. 본 발명의 일 실시예에 따르면, 두드러진 특징들의 식별을 용이하게 하기 위해 2개 그룹의 데이터 객체들에 대해 결합된 특징 리스트 내의 각 특징에 수학식 7이 적용된다. 그렇게 하기 위해, p_i는, 특징 f_i을 적어도 한번 포함하는 데이터 객체들의 제1 그룹 내의 데이터 객체 수를, 데이터 객체 문서들의 제1 그룹 내의 데이터 객체 총수(total number)로 나눈 것을 나타내는 것으로 추정된다. 유사하게, q_i는, 특징 f_i를 적어도 한번 포함하는 제2 그룹 내의 데이터 객체 수를, 데이터 객체들의 제2 그룹 내의 데이터 객체 총수로 나눈 것을 나타내는 것으로 추정된다.Finding the maximum of the ratios is enough to find the maximum of the corresponding log value. According to one embodiment of the present invention, Equation 7 is applied to each feature in the combined feature list for two groups of data objects to facilitate the identification of the salient features. To do so, p _i is estimated to represent the number of data objects in the first group of data objects that includes feature f _i at least once divided by the total number of data objects in the first group of data object documents. . Similarly, q _i is estimated to represent the number of data objects in the second group that includes at least once feature f _i divided by the total number of data objects in the second group of data objects.

도 2a 내지 2c는 본 발명의 일 실시예에 따른, 두드러진 특징 결정 기능의 동작 플로우를 나타낸다. 시작하면, 데이터 객체들의 제1 세트가 검사되어 데이터 객체들의 적어도 제1 세트로부터 하나 이상의 데이터 객체들 내에 존재하는 고유 특징들로 구성되는 특징 리스트를 생성한다(블럭 210). 식별된 고유 특징 각각에 대해, 수학식 7이 적용되어 순위화된 특징 리스트를 생성하고(블럭 220), 순위화된 특징 리스트의 적어도 하나의 서브세트가 두드러진 특징들로 선택된다(블럭 230). 두드러진 특징들은 순위화된 특징 리스트로부터 선택된 엘리먼트들의 하나 이상의 인접한 그룹(들) 또는 인접하지 않은 그룹(들)을 포함할 수 있다. 일 실시예에서, 순위화된 특징 리스트의 첫번째 N개 엘리먼트들이 두드러진 것으로 선택되는데, 여기서 N은 시스템의 요구조건에 따라 변화할 수 있다. 대안적인 실시예에서, 순위화된 특징 리스트의 최종 M개의 엘리먼트가 두드러진 것으로 선택되는데, 여기서 M도 시스템의 요구조건에 따라 변화할 수 있다.2A-2C illustrate the operational flow of a salient feature determination function, in accordance with an embodiment of the invention. Beginning, a first set of data objects is examined to generate a feature list consisting of unique features present in one or more data objects from at least the first set of data objects (block 210). For each of the identified unique features, Equation 7 is applied to generate a ranked feature list (block 220), and at least one subset of the ranked feature list is selected as the salient features (block 230). The salient features may include one or more adjacent group (s) or non-adjacent group (s) of elements selected from the ranked feature list. In one embodiment, the first N elements of the ranked feature list are chosen to be prominent, where N may vary depending on the requirements of the system. In an alternative embodiment, the last M elements of the ranked feature list are chosen to be prominent, where M may also change depending on the requirements of the system.

본 발명의 일 실시예에 따르면, 특징 리스트를 생성하고(블럭 210), 데이터 객체들의 각 그룹 내에 포함되는 데이터 객체들의 총수가 결정되고(블럭 212), 데이터 객체들의 적어도 제1 그룹 내의 식별된 각 고유 특징에 대해, 고유 특징을 포함하는 데이터 객체들의 총수도 결정된다(블럭 214). 또한, 고유 특징 리스트는 원하는 다양한 기준에 기초하여 필터링될 수 있다(블럭 216). 예를 들면, 고유 특징 리스트는, 적어도 소정의 최소 데이터 객체 수에서 발견되지 않는 특징들, 소정의 설정된 최소 길이보다 짧은 특징들, 및/또는 할당된 양보다 적은 횟수로 발생하는 특징들을 제거하기 위해 절단될 수 있다.According to one embodiment of the invention, a feature list is generated (block 210), the total number of data objects included in each group of data objects is determined (block 212), and the identified angles in at least the first group of data objects are determined. For the unique feature, the total number of data objects containing the unique feature is also determined (block 214). In addition, the unique feature list may be filtered based on various criteria desired (block 216). For example, the unique feature list may be used to remove features not found in at least a predetermined minimum number of data objects, features shorter than a predetermined set minimum length, and / or features that occur less than an assigned amount. Can be cut.

본 발명의 일 실시예에 따르면, 도 2a의 블럭 220에 대해 설명한 것과 같이, 순위화된 특징 리스트를 얻기 위해 통계적 구별을 적용하는 것은, 도 2c에 설명한 프로세스들을 더 포함한다. 말하자면, (즉, 수학식 7로 표현되는) 통계적 구별을 적용하면, 데이터 객체들의 제1 세트 내에서 식별된 고유 특징들중 어느 것이 데이터 객체들의 제2 세트 내에도 존재하는지에 대해 판정이 이루어질 뿐만 아니라(블럭 221), 데이터 객체들의 제1 세트 내에서 식별된 고유 특징들중 어느 것이 문서들의 제2 세트 내에 존재하지 않는지에 대한 판정이 이루어진다(블럭 222). 도시된 실시예에 따르면, 데이터 객체들의 한 세트 내에 존재하나 다른 세트에는 존재하지 않는 것으로 판정된 특징들은 순위화된 특징 리스트 내에서 상대적으로 보다 높은 순위가 할당되며(블럭 223), 데이터 객체들의 양쪽 세트들 모두에 존재하는 것으로 판정된 특징들은 통계적 구별(즉, 수학식 7)을 통해 결정되는 바와 같이, 상대적으로 보다 낮은 순위가 할당된다(블럭 224). 선택적으로, 개별적인 특징 각각을 포함하는 데이터 객체들의 총수에 기초하여 순위화된 특징 리스트 내에서 특징들이 추가로 순위화될 수 있다.According to one embodiment of the invention, applying statistical discrimination to obtain a ranked feature list, as described for block 220 of FIG. 2A, further includes the processes described in FIG. 2C. In other words, applying statistical discrimination (ie, represented by equation 7), a determination is made as to which of the unique features identified in the first set of data objects are also present in the second set of data objects. Rather (block 221), a determination is made as to which of the unique features identified in the first set of data objects are not present in the second set of documents (block 222). According to the illustrated embodiment, features determined to be present in one set of data objects but not in another set are assigned a relatively higher rank in the ranked feature list (block 223), so that both of the data objects Features determined to be present in both sets are assigned a relatively lower rank, as determined through statistical distinction (ie, Equation 7) (block 224). Optionally, features may be further ranked in a ranked feature list based on the total number of data objects including each of the individual features.

예시적인 응용Example Application

도 3을 참조하면, 일 실시예에 따라, 본 발명의 두드러진 특징 결정 기능들을 예시적으로 적용하는 도면이 도시된다. 도시된 바와 같이, 분류자(300)가 제공되어, 초 대규모 계층적 분류 트리 및 단층 파일 형식을 포함하는 다양한 데이터 구조 내에서, 사설 양식(proprietary format) 및 비-사설 양식들 모두를 포함하는 비디오 시퀀스, 오디오 시퀀스, 이미지 파일, 텍스트 파일을 포함하나 이에 한정되지 않는, 전자 문서들과 같은 데이터 객체들을 효율적으로 분류하고 카테고리화한다. 분류자(300)는, 이전에 카테고리화된 데이터 계층구조로부터 추출된 분류 규칙들에 기초하여 새로운 데이터 객체들을 카테고리화하도록 분류자(300)를 훈련시키는 분류자 트레이닝 서비스(305) 뿐만 아니라 분류자(300)로 입력되는 새로운 데이터 객체들을 카테고리화하기 위한 분류자 카테고리화 서비스(315)를 포함한다. Referring to FIG. 3, shown is an illustration of exemplary application of salient feature determination functions of the present invention, in accordance with an embodiment. As shown, a classifier 300 is provided to include video, including both proprietary and non-private forms, within various data structures, including ultra-large hierarchical classification trees and monolayer file formats. Efficiently classify and categorize data objects, such as, but not limited to, electronic documents, sequences, audio sequences, image files, and text files. Classifier 300 is a classifier as well as a classifier training service 305 that trains classifier 300 to categorize new data objects based on classification rules extracted from previously categorized data hierarchies. A classifier categorization service 315 for categorizing new data objects input to 300.

분류자 트레이닝 서비스(305)는 수집 기능(aggregation function)(306), 본 발명의 두드러진 특징 결정 기능(308) 및 노드 특성기술 기능(309)을 포함한다. 도시된 실시예에 따르면, 이전의 카테고리화된 데이터 계층구조로부터의 콘텐트는 예를 들어 수집 기능(306)을 통해 계층구조 내의 각 노드에 수집되어 데이터의 콘텐트 그룹 및 안티-콘텐트(anti-content) 그룹 모두를 형성한다. 그 후, 이러한 데이터 그룹들 각각으로부터 특징들이 추출되고 그러한 특징들의 서브세트가 두드러진 특징 결정 기능(308)을 통해 두드러진 것으로 결정된다. 노드 특성기술 기능(309)이 활용되어 두드러진 특징들에 기초하여 이전에 카테고리화된 데이터 계층구조의 각 노드의 특성을 기술하고, 그러한 계층적 특성기술들을, 분류자 카테고리화 서비스(315)에서 더 사용하기 위해, 예를 들어 데이터 저장소(310) 내에 저장한다.Classifier training service 305 includes an aggregation function 306, a salient feature determination function 308, and a node characterization function 309 of the present invention. According to the illustrated embodiment, the content from the previously categorized data hierarchy is collected at each node in the hierarchy, for example via a collection function 306, so that the content group and anti-content of the data is collected. Form all of the groups. Features are then extracted from each of these data groups and a subset of those features is determined to be prominent via the salient feature determination function 308. The node characterization function 309 is utilized to describe the characteristics of each node of the previously categorized data hierarchy based on the salient features, and such hierarchical characterizations are further described in the classifier categorization service 315. For use, for example, it stores in data store 310.

분류자 트레이닝 서비스(305) 및 분류자 카테고리화 서비스(315)를 포함하는 분류자(300)에 대한 추가 정보는, 본원과 동시에 출원되고, 본 출원의 양수인에게 일반적으로 양도된, 동시계류중인, "Very-Large-Scale Automatic Categorizer For Web Content"라는 제목의 미국 특허 출원 번호 <51026.P004>에 설명되어 있으며, 본원에 참조로 전부 포함된다.Additional information about classifier 300, including classifier training service 305 and classifier categorization service 315, is filed concurrently with the present application and is generally assigned to the assignee of the present application. US Patent Application No. < 51026.P004 " entitled " Very-Large-Scale Automatic Categorizer For Web Content ", which is incorporated herein by reference in its entirety.

분류자 트레이닝 서비스Classifier Training Service

도 4는 본 발명의 일 실시예에 따른, 도 3의 분류자 트레이닝 서비스(305)의 기능적 블록도이다. 도 4에 도시된 바와 같이, 이전에 카테고리화된 데이터 계층구조(402)가 분류자(300)의 분류자 트레이닝 서비스(305)의 입력으로 제공된다. 이전에 카테고리화된 데이터 계층구조(402)는 오디오, 비디오 및/또는 텍스트 객체와 같은 데이터 객체들의 세트를 나타내는데, (통상적으로 개별적인 수동 입력을 통해) 주제별 계층구조로 이전에 분류되고 카테고리화된 것이다. 이전에 카테고리화된 데이터 계층구조(402)는 예를 들어 웹 포털 또는 검색 엔진에 의해 이전에 카테고리화된 하나 이상의 전자 문서 세트를 나타낼 수 있다. 4 is a functional block diagram of the classifier training service 305 of FIG. 3, in accordance with an embodiment of the present invention. As shown in FIG. 4, a previously categorized data hierarchy 402 is provided as input to the classifier training service 305 of the classifier 300. The previously categorized data hierarchy 402 represents a set of data objects, such as audio, video and / or text objects, previously categorized and categorized into a topical hierarchy (typically via individual manual input). . The previously categorized data hierarchy 402 may represent one or more sets of electronic documents previously categorized by, for example, a web portal or search engine.

도시된 예에 따르면, 수집 기능(406)은, 계층구조의 각 레벨에서의 형제 노드들 간의 구별을 증가시키기 위해 이전에 카테고리화된 데이터 계층구조(402)로부터의 콘텐트를 데이터의 콘텐트 그룹 및 안티-콘텐트 그룹으로 모을 수 있다. 두드러진 특징 결정 기능(408)은, 데이터의 콘텐트 및 안티-콘텐트 그룹들로부터 특징들을 추출하고 추출된 특징들(409)중 어느 것이 두드러지는 것으로 고려될 수 있는지(409')를 결정하도록 동작한다.According to the example shown, the gathering function 406 is configured to retrieve content from the previously categorized data hierarchy 402 to increase the distinction between sibling nodes at each level of the hierarchy. Can be grouped into content groups. The salient feature determination function 408 operates to extract features from the content of the data and the anti-content groups and to determine which of the extracted features 409 can be considered salient 409 '.

또한, 도시된 예에 따르면, 도 3의 노드 특성기술 기능(309)은, 데이터의 콘텐트 및 안티-콘텐트 그룹들의 특성을 기술하도록 동작한다. 일 실시예에서, 데이터의 콘텐트 및 안티-콘텐트 그룹들은 결정된 두드러진 특징에 기초하여 특성이 기술된다. 일 실시예에서, 특성기술은 데이터 저장소(310)에 저장되는데, 데이터베이스, 디렉토리 구조 또는 단순한 룩업 테이블 등과 같은 임의의 개수의 데이터 구조들의 형태로 구현될 수 있다. 본 발명의 일 실시예에서, 각 노드에 대한 분류자들에 관한 파라미터들은, 이전의 카테고리화된 데이터 계층구조를 모방하는 파일 구조를 갖는 계층적 카테고리화 트리 내에 저장된다.Also, according to the illustrated example, the node characterization function 309 of FIG. 3 operates to describe the content of the data and the characteristics of the anti-content groups. In one embodiment, the content and anti-content groups of data are characterized based on the determined salient feature. In one embodiment, the feature description is stored in the data store 310, which may be implemented in the form of any number of data structures, such as a database, directory structure, or a simple lookup table. In one embodiment of the invention, the parameters for the classifiers for each node are stored in a hierarchical categorization tree with a file structure that mimics the previous categorized data hierarchy.

예시적인 컴퓨터 시스템Example Computer System

도 5는 본 발명의 일 실시예에 따른, 두드러진 특징을 결정하기 위해 사용하는데 적합한 예시적인 컴퓨터 시스템을 도시한다. 도시된 바와 같이, 컴퓨터 시스템(500)은 하나 이상의 프로세서(502) 및 시스템 메모리(504)를 포함한다. 또한, 컴퓨터 시스템(500)은 (디스켓, 하드 드라이브, CDROM 등과 같은) 대용량 저장장치(506), (키보드, 커서 제어 등과 같은) 입/출력 장치(508) 및 (네트워크 인터페이스 카드, 모뎀 등과 같은) 통신 인터페이스(510)를 포함한다. 구성요소들은, 하나 이상의 버스들을 나타내는, 시스템 버스(512)를 통해 서로 연결된다. 시스템 버스(512)가 다수의 버스를 나타내는 경우, 하나 이상의 버스 브릿지(도시 안됨)에 의해 버스들이 브릿지된다.5 illustrates an example computer system suitable for use in determining salient features, in accordance with an embodiment of the present invention. As shown, computer system 500 includes one or more processors 502 and system memory 504. Computer system 500 may also include mass storage 506 (such as diskettes, hard drives, CDROMs, etc.), input / output devices 508 (such as keyboards, cursor controls, etc.), and (such as network interface cards, modems, etc.) Communication interface 510. The components are connected to each other via a system bus 512, which represents one or more buses. If the system bus 512 represents multiple buses, the buses are bridged by one or more bus bridges (not shown).

이러한 구성요소 각각은 본 기술 분야에 알려진 종래의 기능을 수행한다. 특히, 시스템 메모리(504) 및 대용량 저장장치(506)가 사용되어 본 발명의 카테고리화 시스템을 구현하는 프로그래밍 명령어들의 작업 카피(working copy) 및 영구 카피를 저장한다. 프로그래밍 명령어들의 영구 카피는 공장에서 대용량 저장장치(506) 내에 로딩하거나 또는 앞서 설명한 바와 같이, 분배 매체(도시 안됨)를 통해 또는 (분산 서버(도시 안됨)로부터의) 통신 인터페이스(510)를 통해 필드에서 로딩될 수 있다. 이러한 구성요소들(502-512)의 구조는 알려져 있고, 따라서 더이상 설명되지 않는다.Each of these components performs a conventional function known in the art. In particular, system memory 504 and mass storage 506 are used to store working copies and permanent copies of programming instructions that implement the categorization system of the present invention. Permanent copies of programming instructions can be loaded into the mass storage 506 at the factory or through a distribution medium (not shown) or through a communication interface 510 (from a distributed server (not shown)), as described above. Can be loaded from. The structure of these components 502-512 is known and therefore will not be described any further.

결론 및 에필로그Conclusion and Epilogue

따라서, 상기 설명에서 볼 수 있는 바와 같이, 객체 분류를 위한 두드러진 특징들을 자동적으로 결정하는 신규한 방법 및 장치에 대해 설명하였다. 본 발명이 상기 도시된 실시예들을 통해 설명되었지만, 본 기술 분야의 당업자는, 본 발명이 설명된 실시예들에 한정되지 않는다는 것을 알 것이다. 본 발명은, 첨부된 청구범위의 사상 및 범주 내에서 수정 및 변경이 이루어 질 수 있다. 따라서, 명세서는 본 발명의 한정적인 의미가 아니라 예시적인 것으로 간주되어야 한다.Thus, as can be seen in the above description, a novel method and apparatus for automatically determining salient features for object classification has been described. Although the present invention has been described through the illustrated embodiments, those skilled in the art will recognize that the present invention is not limited to the described embodiments. The invention may be modified and changed within the spirit and scope of the appended claims. Accordingly, the specification is to be regarded in an illustrative rather than a restrictive sense.

Claims

Extracting one or more unique features from the first content group of data objects to form a first feature list;

Extracting one or more unique features from a second anti-content group of data object documents to form a second feature list;

Generating a ranked feature list by applying statistical differentiation between the unique features of the first feature list and the unique features of the second feature list; And

Identifying a set of salient features from the ranked feature list

How to include.

The method of claim 1,

Each of the first content group of data objects and the second anti-content group of data objects comprises one or more electronic documents.

The method of claim 1,

Determining a total number of first data objects comprising the first content group of data objects; And

Determining a second total number of data objects comprising the second anti-content group of data objects.

The method of claim 3,

For each of the one or more unique features forming the first feature list, the first content group of data objects including at least one instance of each of the one or more unique features of the first feature list Determining a number of first data objects of the; And

For each of the one or more unique features forming the second feature list, of the second anti-content group of data objects comprising at least one instance of each of the one or more unique features of the second feature list. Determining the number of second data objects.

The method of claim 4, wherein

Generating the ranked list,

Identifying unique features of the first feature list that are not in the second feature list as exclusive features;

Identifying unique features of the first feature list that are also present in the second feature list as common features; And

Ordering the ranked list such that the exclusive features are ranked higher in the ranked list when compared to the common features.

The method of claim 5,

Applying a probability function to each of the common features to obtain a result vector, the probability function being the first data object number divided by the second data object number divided by the total number of second data objects; Contains the ratio of 1 divided by the total number of data objects; And

Ordering the common features in the ranked list based at least in part on the result vector of the probability function.

The method of claim 5,

The proprietary features are further ranked based on the first data object number.

The method of claim 1,

Identifying the salient feature set from the ranked feature list comprises selecting the first N neighboring features of the ranked feature list.

The method of claim 1,

Identifying the salient feature set from the ranked feature list includes selecting the last M adjacent features of the ranked feature list.

The method of claim 1,

Each of said unique features comprises a grouping of one or more alphanumeric characters.

The method of claim 1,

Based at least in part on the salient feature set, classifying a new data object as most relevant to one of the first content group of data objects and the second anti-content group of data objects.

The method of claim 1,

The first content group of data objects comprises a selected node of a topical hierarchy having a plurality of nodes and data objects corresponding to any associated subnodes of the selected node,

And the second anti-content group of data objects includes data objects corresponding to any associated sibling nodes of the selected node and any associated subnodes of the sibling nodes.

As a way to identify the salient features,

Identifying one or more unique features that are members of the first data class;

Examining a second data class to identify the one or more unique features that are also members of the second data class, and the one or more unique features that are not members of the second data class;

Generating an ordered unique feature list based on a membership of each of the one or more unique features in the second data class; And

Identifying one or more of the ranked unique feature list as prominent

How to include.

The method of claim 13,

For each of the ranked unique feature list, determining the number of objects in the first data class that includes each respective unique feature.

The method of claim 14,

Generating a ranked list includes ranking the unique features in the ranked list that are not members of the second data class above the unique features that are also members of the second data class. How to include more.

The method of claim 15,

Generating a ranked list may include in the ranked list the unique features belonging to a larger object number of the first data class and the unique features belonging to a smaller object number in the first data class. And further including a higher ranking step.

The method of claim 13,

Identifying as being salient includes selecting a first set of N consecutive unique features from the ranked unique feature list.

The method of claim 13,

Identifying as salient comprises selecting the last M consecutive unique features from the ranked unique feature list.

Extract one or more unique features from the first content group of data objects to form a first feature list; extract one or more unique features from the second anti-content group of data objects to form a second feature list; Apply a statistical differentiation between the unique features of the first feature list and the unique features of the second feature list to generate a ranked feature list and identify a set of salient features from the ranked feature list. A storage medium that first includes one or more functions for: storing a plurality of programming instructions designed to implement a plurality of functions of a category name service for providing a category name to a data object; And

A processor coupled to the storage medium to execute the programming instructions

Device comprising a.

The method of claim 19,

Each of the first content group of data objects and the second anti-content group of data objects comprise one or more data objects.

The method of claim 19,

The plurality of instructions,

Determine a total number of first data objects comprising the first content group of data objects,

And determining a second total number of data objects including the second anti-content group of data objects.

The method of claim 19,

The plurality of instructions,

For each of the one or more unique features forming the first feature list, a first of the first content group of data objects comprising at least one instance of each of the one or more unique features of the first feature list Determine the number of data objects,

For each of the one or more unique features forming the second feature list, the second anti-content group of data objects comprising at least one instance of each of the one or more unique features of the second feature list. 2 further comprising instructions for determining the number of data objects.

The method of claim 20,

The plurality of instructions for generating the ranked list,

Identify unique features of the first feature list that are not present in the second feature list as exclusive features,

Identify unique features of the first feature list that are also present in the second feature list as common features,

And ordering the ranked list so that the exclusive features are ranked higher than the common features in the ranked list.

The method of claim 23,

The plurality of instructions,

Apply a probability function to each of the common features to obtain a result vector, the probability function being the first data object number divided by the total number of second data objects divided by the second document object number; Contains the ratio of the value divided by the total number of data objects-,

And ordering the common features in the ranked list based at least in part on the result vector of the probability function.

The method of claim 23,

And the exclusive features are further ranked based on the first data object number.

The method of claim 19,

And the plurality of instructions identifying the salient feature set from the ranked feature list further includes instructions to select the first N adjacent features of the ranked feature list.

The method of claim 19,

And the plurality of instructions identifying the salient feature set from the ranked feature list further includes instructions to select the last M adjacent features of the ranked feature list.

The method of claim 19,

Each of the unique features comprises a grouping of one or more alphanumeric characters.

The method of claim 19,

The plurality of instructions,

Based at least in part on the salient feature set, instructions for classifying a new data object as most relevant to one of the first content group of data objects and the second anti-content group of data objects.

The method of claim 19,

Identify one or more unique features that are members of a first data class, examine the second data class to be one or more members of the second data class, and the one or more members that are not members of the second data class Identify a unique feature, generate an ordered unique feature list based on a membership of each of the one or more unique features in the second data class, and generate one or more of the ranked unique feature lists as prominent A storage medium storing a plurality of programming instructions designed to implement a plurality of functions first including one or more functions for identifying; And

Device comprising a.

The method of claim 31, wherein

The plurality of instructions,

And for each of the ranked unique feature list, instructions for determining the number of objects in the first data class that includes each respective unique feature.

33. The method of claim 32,

The plurality of instructions for generating a ranked list rank, in the ranked list, the unique features that are not members of the second data class higher than the unique features that are also members of the second data class. The apparatus further comprises instructions for assigning.

The method of claim 33, wherein

The plurality of instructions for generating a ranked list further include, in the ranked list, the unique features belonging to a larger number of objects of the first data class that belong to a smaller number of objects in the first data class. And further comprising instructions that rank higher than the intrinsic features.

The method of claim 31, wherein

The plurality of instructions that identify as salient further include instructions for selecting a first set of N consecutive unique features from the ranked unique feature list.

The method of claim 31, wherein

And the plurality of instructions that identify as salient include instructions for selecting a final set of M consecutive unique features from the ranked unique feature list.