EP1543437A1 - Procede et appareil de determination automatique de traits saillants pour classification par article - Google Patents

Procede et appareil de determination automatique de traits saillants pour classification par article

Info

Publication number
EP1543437A1
EP1543437A1 EP02807873A EP02807873A EP1543437A1 EP 1543437 A1 EP1543437 A1 EP 1543437A1 EP 02807873 A EP02807873 A EP 02807873A EP 02807873 A EP02807873 A EP 02807873A EP 1543437 A1 EP1543437 A1 EP 1543437A1
Authority
EP
European Patent Office
Prior art keywords
features
data objects
unique features
list
ranked list
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP02807873A
Other languages
German (de)
English (en)
Other versions
EP1543437A4 (fr
Inventor
Daniel P. Lulich
Farzin G. Guilak
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Microsoft Corp
Original Assignee
Microsoft Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Microsoft Corp filed Critical Microsoft Corp
Publication of EP1543437A1 publication Critical patent/EP1543437A1/fr
Publication of EP1543437A4 publication Critical patent/EP1543437A4/fr
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/31Indexing; Data structures therefor; Storage structures
    • G06F16/313Selection or weighting of terms for indexing
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • salient words are either predefined or selected from the documents being processed to more accurately characterize the documents.
  • these salient word lists are created by counting the frequency of occurrence of all words for each of a set of documents. Words are then removed from the word lists according to one or more criteria. Often, words that occur too few times within the corpus are eliminated, since such words are too rare to reliably distinguish among categories, whereas words that occur too frequently are eliminated, because such words are assumed to occur commonly in ail documents across categories. Further, "stop words" and word stems are often eliminated from feature lists to facilitate salient feature determination.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Image Processing (AREA)

Abstract

L'invention concerne un procédé et un appareil permettant de déterminer automatiquement des traits saillants (308) pour effectuer une classification par article. Selon un mode de réalisation, un ou plusieurs traits uniques sont extraits d'un premier groupe de contenu d'objets pour former une première liste de traits, et un ou plusieurs traits uniques sont extraits d'un second groupe anti-contenu d'objets pour former une seconde liste de traits. Une liste par ordre d'importance de traits est alors créée par application d'une différentiation statistique entre des traits uniques de la première liste de traits et des traits uniques de la seconde liste de traits. Un ensemble de traits saillants (308) est identifié de la liste de traits classés par ordre d'importance obtenue.
EP02807873A 2002-09-25 2002-09-25 Procede et appareil de determination automatique de traits saillants pour classification par article Withdrawn EP1543437A4 (fr)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/US2002/030457 WO2004029826A1 (fr) 2002-09-25 2002-09-25 Procede et appareil de determination automatique de traits saillants pour classification par article

Publications (2)

Publication Number Publication Date
EP1543437A1 true EP1543437A1 (fr) 2005-06-22
EP1543437A4 EP1543437A4 (fr) 2008-05-28

Family

ID=32041246

Family Applications (1)

Application Number Title Priority Date Filing Date
EP02807873A Withdrawn EP1543437A4 (fr) 2002-09-25 2002-09-25 Procede et appareil de determination automatique de traits saillants pour classification par article

Country Status (8)

Country Link
EP (1) EP1543437A4 (fr)
JP (1) JP2006501545A (fr)
CN (1) CN100378713C (fr)
AU (1) AU2002334669A1 (fr)
BR (1) BR0215899A (fr)
CA (1) CA2500264A1 (fr)
MX (1) MXPA05003249A (fr)
WO (1) WO2004029826A1 (fr)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7576755B2 (en) 2007-02-13 2009-08-18 Microsoft Corporation Picture collage systems and methods
US8024327B2 (en) 2007-06-26 2011-09-20 Endeca Technologies, Inc. System and method for measuring the quality of document sets
US8935249B2 (en) 2007-06-26 2015-01-13 Oracle Otc Subsidiary Llc Visualization of concepts within a collection of information
US9307107B2 (en) * 2013-06-03 2016-04-05 Kodak Alaris Inc. Classification of scanned hardcopy media
US20220309384A1 (en) * 2021-03-25 2022-09-29 International Business Machines Corporation Selecting representative features for machine learning models

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
US20020059219A1 (en) * 2000-07-17 2002-05-16 Neveitt William T. System and methods for web resource discovery

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
AU6849196A (en) * 1995-08-16 1997-03-19 Syracuse University Multilingual document retrieval system and method using semantic vector matching
US6539115B2 (en) * 1997-02-12 2003-03-25 Fujitsu Limited Pattern recognition device for performing classification using a candidate table and method thereof
US6018733A (en) * 1997-09-12 2000-01-25 Infoseek Corporation Methods for iteratively and interactively performing collection selection in full text searches
US6353825B1 (en) * 1999-07-30 2002-03-05 Verizon Laboratories Inc. Method and device for classification using iterative information retrieval techniques

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6233575B1 (en) * 1997-06-24 2001-05-15 International Business Machines Corporation Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values
US20020059219A1 (en) * 2000-07-17 2002-05-16 Neveitt William T. System and methods for web resource discovery

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
GOLDBERG J L ED - INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS: "CDM: an approach to learning in text categorization" PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, HERNDON, VA, US, 5 November 1995 (1995-11-05), - 8 November 1995 (1995-11-08) pages 258-265, XP010153210 IEEE COMPUTER SOC., LOS ALAMITOS, CA, US ISBN: 0-8186-7312-5 *
See also references of WO2004029826A1 *

Also Published As

Publication number Publication date
BR0215899A (pt) 2005-07-26
CA2500264A1 (fr) 2004-04-08
MXPA05003249A (es) 2005-07-05
CN100378713C (zh) 2008-04-02
WO2004029826A1 (fr) 2004-04-08
JP2006501545A (ja) 2006-01-12
CN1669023A (zh) 2005-09-14
EP1543437A4 (fr) 2008-05-28
AU2002334669A1 (en) 2004-04-19

Similar Documents

Publication Publication Date Title
US6938025B1 (en) Method and apparatus for automatically determining salient features for object classification
US6826576B2 (en) Very-large-scale automatic categorizer for web content
US7444356B2 (en) Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors
US5819258A (en) Method and apparatus for automatically generating hierarchical categories from large document collections
US6912550B2 (en) File classification management system and method used in operating systems
US5943670A (en) System and method for categorizing objects in combined categories
US8473532B1 (en) Method and apparatus for automatic organization for computer files
EP1426882A2 (fr) Stockage et récuperation des informations
Asirvatham et al. Web page classification based on document structure
CN114461783A (zh) 关键词生成方法、装置、计算机设备、存储介质和产品
EP1543437A1 (fr) Procede et appareil de determination automatique de traits saillants pour classification par article
Asirvatham et al. Web page categorization based on document structure
CN112100330B (zh) 一种基于人工智能技术的主题搜索方法及其系统
JP2005010848A (ja) 情報検索装置、情報検索方法、情報検索プログラム、及び記録媒体
Chung et al. Developing a specialized directory system by automatically classifying Web documents
WO2002037328A2 (fr) Integration de recherche et de classification: resultat et classement
KR20050096912A (ko) 객체 분류를 위한 두드러진 특징들을 자동적으로 결정하는방법 및 장치
JP2004206571A (ja) 文書情報提示方法及び装置並びにプログラム及び記録媒体
CN117972025B (zh) 一种基于语义分析的海量文本检索匹配方法
JP2000259658A (ja) 文書分類装置
Sutanto et al. Automatic index expansion for concept-based image query
JP2023057658A (ja) 情報処理装置、情報を提供するためにコンピューターによって実行される方法、および、プログラム
Chouchoulas A rough set approach to text classification
Patrizi SIBILLA: An implementation of an intelligent library management system
JP2004310199A (ja) 文書分類方法及び文書分類プログラム

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20050321

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SK TR

AX Request for extension of the european patent

Extension state: AL LT LV MK RO SI

DAX Request for extension of the european patent (deleted)
A4 Supplementary search report drawn up and despatched

Effective date: 20080425

17Q First examination report despatched

Effective date: 20080804

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20100401