EP1543437A1 - Procede et appareil de determination automatique de traits saillants pour classification par article - Google Patents
Procede et appareil de determination automatique de traits saillants pour classification par articleInfo
- Publication number
- EP1543437A1 EP1543437A1 EP02807873A EP02807873A EP1543437A1 EP 1543437 A1 EP1543437 A1 EP 1543437A1 EP 02807873 A EP02807873 A EP 02807873A EP 02807873 A EP02807873 A EP 02807873A EP 1543437 A1 EP1543437 A1 EP 1543437A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- features
- data objects
- unique features
- list
- ranked list
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/31—Indexing; Data structures therefor; Storage structures
- G06F16/313—Selection or weighting of terms for indexing
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Definitions
- salient words are either predefined or selected from the documents being processed to more accurately characterize the documents.
- these salient word lists are created by counting the frequency of occurrence of all words for each of a set of documents. Words are then removed from the word lists according to one or more criteria. Often, words that occur too few times within the corpus are eliminated, since such words are too rare to reliably distinguish among categories, whereas words that occur too frequently are eliminated, because such words are assumed to occur commonly in ail documents across categories. Further, "stop words" and word stems are often eliminated from feature lists to facilitate salient feature determination.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Image Processing (AREA)
Abstract
L'invention concerne un procédé et un appareil permettant de déterminer automatiquement des traits saillants (308) pour effectuer une classification par article. Selon un mode de réalisation, un ou plusieurs traits uniques sont extraits d'un premier groupe de contenu d'objets pour former une première liste de traits, et un ou plusieurs traits uniques sont extraits d'un second groupe anti-contenu d'objets pour former une seconde liste de traits. Une liste par ordre d'importance de traits est alors créée par application d'une différentiation statistique entre des traits uniques de la première liste de traits et des traits uniques de la seconde liste de traits. Un ensemble de traits saillants (308) est identifié de la liste de traits classés par ordre d'importance obtenue.
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/US2002/030457 WO2004029826A1 (fr) | 2002-09-25 | 2002-09-25 | Procede et appareil de determination automatique de traits saillants pour classification par article |
Publications (2)
Publication Number | Publication Date |
---|---|
EP1543437A1 true EP1543437A1 (fr) | 2005-06-22 |
EP1543437A4 EP1543437A4 (fr) | 2008-05-28 |
Family
ID=32041246
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP02807873A Withdrawn EP1543437A4 (fr) | 2002-09-25 | 2002-09-25 | Procede et appareil de determination automatique de traits saillants pour classification par article |
Country Status (8)
Country | Link |
---|---|
EP (1) | EP1543437A4 (fr) |
JP (1) | JP2006501545A (fr) |
CN (1) | CN100378713C (fr) |
AU (1) | AU2002334669A1 (fr) |
BR (1) | BR0215899A (fr) |
CA (1) | CA2500264A1 (fr) |
MX (1) | MXPA05003249A (fr) |
WO (1) | WO2004029826A1 (fr) |
Families Citing this family (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7576755B2 (en) | 2007-02-13 | 2009-08-18 | Microsoft Corporation | Picture collage systems and methods |
US8024327B2 (en) | 2007-06-26 | 2011-09-20 | Endeca Technologies, Inc. | System and method for measuring the quality of document sets |
US8935249B2 (en) | 2007-06-26 | 2015-01-13 | Oracle Otc Subsidiary Llc | Visualization of concepts within a collection of information |
US9307107B2 (en) * | 2013-06-03 | 2016-04-05 | Kodak Alaris Inc. | Classification of scanned hardcopy media |
US20220309384A1 (en) * | 2021-03-25 | 2022-09-29 | International Business Machines Corporation | Selecting representative features for machine learning models |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6233575B1 (en) * | 1997-06-24 | 2001-05-15 | International Business Machines Corporation | Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values |
US20020059219A1 (en) * | 2000-07-17 | 2002-05-16 | Neveitt William T. | System and methods for web resource discovery |
Family Cites Families (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
AU6849196A (en) * | 1995-08-16 | 1997-03-19 | Syracuse University | Multilingual document retrieval system and method using semantic vector matching |
US6539115B2 (en) * | 1997-02-12 | 2003-03-25 | Fujitsu Limited | Pattern recognition device for performing classification using a candidate table and method thereof |
US6018733A (en) * | 1997-09-12 | 2000-01-25 | Infoseek Corporation | Methods for iteratively and interactively performing collection selection in full text searches |
US6353825B1 (en) * | 1999-07-30 | 2002-03-05 | Verizon Laboratories Inc. | Method and device for classification using iterative information retrieval techniques |
-
2002
- 2002-09-25 AU AU2002334669A patent/AU2002334669A1/en not_active Abandoned
- 2002-09-25 JP JP2004539741A patent/JP2006501545A/ja active Pending
- 2002-09-25 CN CNB02829663XA patent/CN100378713C/zh not_active Expired - Fee Related
- 2002-09-25 CA CA002500264A patent/CA2500264A1/fr not_active Abandoned
- 2002-09-25 EP EP02807873A patent/EP1543437A4/fr not_active Withdrawn
- 2002-09-25 MX MXPA05003249A patent/MXPA05003249A/es unknown
- 2002-09-25 WO PCT/US2002/030457 patent/WO2004029826A1/fr active Application Filing
- 2002-09-25 BR BR0215899-0A patent/BR0215899A/pt not_active IP Right Cessation
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6233575B1 (en) * | 1997-06-24 | 2001-05-15 | International Business Machines Corporation | Multilevel taxonomy based on features derived from training documents classification using fisher values as discrimination values |
US20020059219A1 (en) * | 2000-07-17 | 2002-05-16 | Neveitt William T. | System and methods for web resource discovery |
Non-Patent Citations (2)
Title |
---|
GOLDBERG J L ED - INSTITUTE OF ELECTRICAL AND ELECTRONICS ENGINEERS: "CDM: an approach to learning in text categorization" PROCEEDINGS OF THE 7TH INTERNATIONAL CONFERENCE ON TOOLS WITH ARTIFICIAL INTELLIGENCE, HERNDON, VA, US, 5 November 1995 (1995-11-05), - 8 November 1995 (1995-11-08) pages 258-265, XP010153210 IEEE COMPUTER SOC., LOS ALAMITOS, CA, US ISBN: 0-8186-7312-5 * |
See also references of WO2004029826A1 * |
Also Published As
Publication number | Publication date |
---|---|
BR0215899A (pt) | 2005-07-26 |
CA2500264A1 (fr) | 2004-04-08 |
MXPA05003249A (es) | 2005-07-05 |
CN100378713C (zh) | 2008-04-02 |
WO2004029826A1 (fr) | 2004-04-08 |
JP2006501545A (ja) | 2006-01-12 |
CN1669023A (zh) | 2005-09-14 |
EP1543437A4 (fr) | 2008-05-28 |
AU2002334669A1 (en) | 2004-04-19 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6938025B1 (en) | Method and apparatus for automatically determining salient features for object classification | |
US6826576B2 (en) | Very-large-scale automatic categorizer for web content | |
US7444356B2 (en) | Construction of trainable semantic vectors and clustering, classification, and searching using trainable semantic vectors | |
US5819258A (en) | Method and apparatus for automatically generating hierarchical categories from large document collections | |
US6912550B2 (en) | File classification management system and method used in operating systems | |
US5943670A (en) | System and method for categorizing objects in combined categories | |
US8473532B1 (en) | Method and apparatus for automatic organization for computer files | |
EP1426882A2 (fr) | Stockage et récuperation des informations | |
Asirvatham et al. | Web page classification based on document structure | |
CN114461783A (zh) | 关键词生成方法、装置、计算机设备、存储介质和产品 | |
EP1543437A1 (fr) | Procede et appareil de determination automatique de traits saillants pour classification par article | |
Asirvatham et al. | Web page categorization based on document structure | |
CN112100330B (zh) | 一种基于人工智能技术的主题搜索方法及其系统 | |
JP2005010848A (ja) | 情報検索装置、情報検索方法、情報検索プログラム、及び記録媒体 | |
Chung et al. | Developing a specialized directory system by automatically classifying Web documents | |
WO2002037328A2 (fr) | Integration de recherche et de classification: resultat et classement | |
KR20050096912A (ko) | 객체 분류를 위한 두드러진 특징들을 자동적으로 결정하는방법 및 장치 | |
JP2004206571A (ja) | 文書情報提示方法及び装置並びにプログラム及び記録媒体 | |
CN117972025B (zh) | 一种基于语义分析的海量文本检索匹配方法 | |
JP2000259658A (ja) | 文書分類装置 | |
Sutanto et al. | Automatic index expansion for concept-based image query | |
JP2023057658A (ja) | 情報処理装置、情報を提供するためにコンピューターによって実行される方法、および、プログラム | |
Chouchoulas | A rough set approach to text classification | |
Patrizi | SIBILLA: An implementation of an intelligent library management system | |
JP2004310199A (ja) | 文書分類方法及び文書分類プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20050321 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE BG CH CY CZ DE DK EE ES FI FR GB GR IE IT LI LU MC NL PT SE SK TR |
|
AX | Request for extension of the european patent |
Extension state: AL LT LV MK RO SI |
|
DAX | Request for extension of the european patent (deleted) | ||
A4 | Supplementary search report drawn up and despatched |
Effective date: 20080425 |
|
17Q | First examination report despatched |
Effective date: 20080804 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20100401 |