EP1210669A1 - Dokument-klassifikations-vorrichtung - Google Patents

Dokument-klassifikations-vorrichtung

Info

Publication number
EP1210669A1
EP1210669A1 EP99946569A EP99946569A EP1210669A1 EP 1210669 A1 EP1210669 A1 EP 1210669A1 EP 99946569 A EP99946569 A EP 99946569A EP 99946569 A EP99946569 A EP 99946569A EP 1210669 A1 EP1210669 A1 EP 1210669A1
Authority
EP
European Patent Office
Prior art keywords
document
classification
classifier
features
category
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Withdrawn
Application number
EP99946569A
Other languages
English (en)
French (fr)
Inventor
Ah Hwee Tan
Fon Lin Lai
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Kent Ridge Digital Labs
Original Assignee
Kent Ridge Digital Labs
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Kent Ridge Digital Labs filed Critical Kent Ridge Digital Labs
Publication of EP1210669A1 publication Critical patent/EP1210669A1/de
Withdrawn legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N5/00Computing arrangements using knowledge-based models
    • G06N5/02Knowledge representation; Symbolic representation
    • G06N5/022Knowledge engineering; Knowledge acquisition
    • G06N5/025Extracting rules from data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/353Clustering; Classification into predefined classes

Definitions

  • This invention relates to apparatus for classifying documents.
  • US 5,461,488 provides routing based on identification of recipient name.
  • classification strategy either profile-based, having a keyword/character profile for each
  • the systems also use only a single knowledge acquisition strategy, either statistically
  • classifier using the knowledge base, determines a predicted classification for the
  • the classifier being switchable between the modes under user control .
  • the features are preferably formed into a feature vector for input to the classifier and the
  • features preferably comprise classification-associated words or phrases which may appear
  • the extracting means may be arranged to provide a measure of the
  • the classifier may comprise a supervised ART system, preferably an ARAM system of
  • the apparatus may further be operable in knowledge acquisition mode to process a
  • the apparatus may further be operable in a rule insertion sub-mode of the knowledge
  • the apparatus may further comprising a router arranged to route the document to one of a
  • the desc ⁇ bed embodiment provides a document classification apparatus which allows
  • the apparatus perform learning of a plurality of cases as a batch. During batch learning, the apparatus learns each case one by one and accumulates the classification information into the
  • the apparatus Besides learning from training data, the apparatus also allows rules to be
  • the apparatus is furthermore able to determine a confidence that the classification of a
  • This confidence value is
  • Figure 1 is a schematic diagram illustrating the structure of the described embodiments of
  • Figure 2 is a diagram illustrating the document classifier of Figure 1 in a document
  • Figure 3 is a diagram illustrating the modes of operation of the embodiments of the invention.
  • Figure 4 is a diagram of an ARAM system used as a document classifier in an
  • apparatus is operable in a knowledge acquisition mode and a document classification mode.
  • knowledge acquisition mode the apparatus learns from training documents and
  • a document text file for example a text file derived from a scanned and OCR processed
  • the document classifier includes a feature extraction
  • ARAM Adaptive Resonance Associative Map
  • This classification is associated with a confidence value which, together
  • threshold input by a system administrator 50. If the value exceeds the threshold, the threshold
  • the document is routed to the system administrator 50 for manual routing via path 60.
  • the destinations 52 can also communicate with the system administrator 50 through path
  • the acquisition mode two sub-modes are used.
  • the first, represented by block 100, is based
  • rule insertion is based on rule insertion, represented by block 110.
  • rule insertion a feature vector
  • the system administrator can access the document classifier directly by via path 70 to
  • Such switching may be used, for example, if a mis-directed
  • the system administrator may
  • the features are extracted and passed to the classifier 30, so that the mis-directed document
  • tokens are individual words that have been
  • selection is “select”).
  • Other “filtering” options based on sentence structure such as
  • the keyword-based feature sets can be pre-defined manually or generated automatically
  • the algorithm for automatic keyword selection accepts a list of pre-classified (i.e.
  • Processing involves the extraction of all nouns (in root form) from each document and
  • f rat ⁇ ng a selection rating
  • Count (x) Total number of occurrences of keyword in category x
  • Count (x) Total number of occurrences of keyword in document x
  • the keyword-per-document ratio (f Ratl0 ,) for a category, i, is the total keyword count (C,) for the
  • Relative Keyword Count thus gives an indication of the difference between the keyword-
  • the keyword occurs at least once.
  • a measurement of f DIR for D lsl documents in the top category is given by:
  • O lst number of documents in top category in which keyword occurs. The overall ranking of each keyword is therefore simply derived by taking:
  • the following example uses a small training set of two categories with 124 relevant
  • the categories are business newspaper articles in the first category and
  • the algorithm allows for the specification of a minimum number K, of non-zero keyword
  • keywords are selected in the manner described above.
  • the feature extraction algorithm parses the
  • Keyword counts are then normalized such that the maximum score is 1 and the minimum
  • the second article for Category 2 (sports, music and
  • the Classifier Adaptive Resonance Associative Map (ARAM)
  • ARAM is a family of neural network models that performs incremental supervised
  • the F, ⁇ field (300) serves as the input field containing the input activity vector
  • F2 field (320) contains the activities of categories that are used to encode the patterns.
  • ARAM formulates recognition categories of input
  • ARAM discovers during learning is compatible with IF-THEN rule-based representation.
  • each node in the F 2 field (320) represents a recognition category associating
  • node constitute a set of rules that link antecedents to consequents.
  • system architecture can be translated into a compact set
  • the ART modules used in ARAM can be ART 1 [1] , which categorizes binary patterns,
  • ART 2-A [2] ART 2-A [2]
  • fuzzy ART [3] which categorize both
  • fuzzy ARAM model that is composed of two overlapping
  • ARAM learns a set of
  • recognition categories or rules by training from pre-labeled document sets.
  • ARAM as input A one at a time together with the associated class label input B.
  • ARAM Given an input keyword vector A, ARAM first searches for a F 2 recognition category
  • templates of the F 2 recognition category is modified to contain the intersection of the
  • ARAM learning is stable in the sense that weight values do not oscillate, as
  • Input vectors The F, a and F, input vectors are normalized by complement coding that
  • Complement coding represents both the on-response and
  • the complement coded F, a input vector A is a 2M-
  • the complement coded F, b input vector B is a 2N-dimensional vector
  • Each F 2 category node j is associated with two adaptive weight templates
  • a category node is selected for encoding, it becomes committed.
  • Fuzzy ARAM dynamics are determined by the choice parameters ⁇ , > 0 and ⁇ b > 0; the
  • T J
  • I is defined by
  • ⁇ , p, for vectors p and q.
  • T j max ⁇ T ; : for all F 2 node j ⁇ .
  • Resonance occurs if the match functions, m, ⁇ and m ; , meet the
  • mismatch reset occurs in which the value of the choice function T j is set to 0 for the
  • J is an uncommitted node, and then take ⁇ a ⁇ 1 and ⁇ b ⁇ 1 after the category node is
  • a rule is typically in the IF-THEN format
  • the rule insertion algorithm creates a keyword frequent vector in which the frequency
  • ARAM first searches
  • a recognition category is created to encode a keyword template consisting of "Stock”, “Share ", and
  • Rule insertion proceeds in two phases.
  • the first phase translates each rule into a 2M-
  • N is the number of classes.
  • the vector pairs derived from the rules are then used as training patterns to initialize an
  • recognition categories are associated through the map field.
  • the vigilance parameters p a and p b are each set to 1 to ensure
  • a feature extraction module parses the text to derive a
  • vector A is then presented to the F, a field.
  • ARAM Given an input keyword vector A, ARAM first searches for a F 2 recognition category
  • the output class is predicted to be K if b ⁇ > b k for
  • output vector y represents a less extreme contrast enhancement of the F, a to F 2 input T, in
  • the power rule raises the input T j to the j m F 2 node to a power p and
  • the power rule converges toward the choice rule as p becomes large.
  • K-max Rule In the spirit of the K Nearest Neighbor (KNN) system, the K-max rule picks
  • the F 2 activity values y are
  • the output vector B 2 (b,, b 2 ...., b N ) is then read directly from x b .
  • the output class is
  • voting scores normalized by the number of ARAM provide a prediction score
  • v is the number of votes given to and S j is the normalized prediction score for the
  • the output class j is the selected predicted
  • the system administrator can switch between the classification mode and the knowledge
  • the message can be either LEARN, INSERT, or CLASSIFY.
  • the document classifier adjusts the input baseline
  • the document classifier receives a document text together with a
  • the feature extraction module derives the normalized keyword
  • the keyword vector is presented as the input vector to
  • the ARAM classifier is then run with p_ a ⁇ 1 (typically
  • the document classifier receives an IF-THEN rule.
  • the ARAM classifier is
  • the document classifier receives a document text.
  • classification label is then read from the F, b field and returned to the user.
  • classifier module has been shown implemented using an ARAM structure, this may be

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
EP99946569A 1999-08-25 1999-08-25 Dokument-klassifikations-vorrichtung Withdrawn EP1210669A1 (de)

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/SG1999/000089 WO2001014992A1 (en) 1999-08-25 1999-08-25 Document classification apparatus

Publications (1)

Publication Number Publication Date
EP1210669A1 true EP1210669A1 (de) 2002-06-05

Family

ID=20430235

Family Applications (1)

Application Number Title Priority Date Filing Date
EP99946569A Withdrawn EP1210669A1 (de) 1999-08-25 1999-08-25 Dokument-klassifikations-vorrichtung

Country Status (2)

Country Link
EP (1) EP1210669A1 (de)
WO (1) WO2001014992A1 (de)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9298983B2 (en) 2014-01-20 2016-03-29 Array Technology, LLC System and method for document grouping and user interface

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030221166A1 (en) * 2002-05-17 2003-11-27 Xerox Corporation Systems and methods for authoritativeness grading, estimation and sorting of documents in large heterogeneous document collections
US7516130B2 (en) 2005-05-09 2009-04-07 Trend Micro, Inc. Matching engine with signature generation
CN102298583B (zh) * 2010-06-22 2016-04-27 深圳市世纪光速信息技术有限公司 一种电子公告板网页质量评价方法和系统
US8787681B1 (en) * 2011-03-21 2014-07-22 First American Data Tree Llc System and method for classifying documents
US10235649B1 (en) 2014-03-14 2019-03-19 Walmart Apollo, Llc Customer analytics data model
US10346769B1 (en) 2014-03-14 2019-07-09 Walmart Apollo, Llc System and method for dynamic attribute table
US10235687B1 (en) 2014-03-14 2019-03-19 Walmart Apollo, Llc Shortest distance to store
US10565538B1 (en) 2014-03-14 2020-02-18 Walmart Apollo, Llc Customer attribute exemption
US10733555B1 (en) 2014-03-14 2020-08-04 Walmart Apollo, Llc Workflow coordinator
WO2016048295A1 (en) * 2014-09-24 2016-03-31 Hewlett Packard Enterprise Development Lp Assigning a document to partial membership in communities
CN109614606B (zh) * 2018-10-23 2023-02-03 中山大学 基于文档嵌入的长文本案件罚金范围分类预测方法及装置
WO2022101383A1 (en) * 2020-11-13 2022-05-19 Detectsystem Lab A/S Document uniqueness verification

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JPH03290774A (ja) * 1990-04-06 1991-12-20 Fuji Facom Corp 文書画像の文章領域抽出装置
GB2278705A (en) * 1993-06-01 1994-12-07 Vernon John Spencer Facsimile machine
US5566273A (en) * 1993-12-30 1996-10-15 Caterpillar Inc. Supervised training of a neural network
AUPN431595A0 (en) * 1995-07-24 1995-08-17 Co-Operative Research Centre For Sensor Signal And Information Processing Selective attention adaptive resonance theory
US5794236A (en) * 1996-05-29 1998-08-11 Lexis-Nexis Computer-based system for classifying documents into a hierarchy and linking the classifications to the hierarchy
JPH1139313A (ja) * 1997-07-24 1999-02-12 Nippon Telegr & Teleph Corp <Ntt> 文書自動分類システム、文書分類向け知識ベース生成方法及びそのプログラムを記録した記録媒体
JPH1185797A (ja) * 1997-09-01 1999-03-30 Canon Inc 文書自動分類装置、学習装置、分類装置、文書自動分類方法、学習方法、分類方法および記憶媒体
JPH1185796A (ja) * 1997-09-01 1999-03-30 Canon Inc 文書自動分類装置、学習装置、分類装置、文書自動分類方法、学習方法、分類方法および記憶媒体

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See references of WO0114992A1 *

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9298983B2 (en) 2014-01-20 2016-03-29 Array Technology, LLC System and method for document grouping and user interface

Also Published As

Publication number Publication date
WO2001014992A1 (en) 2001-03-01

Similar Documents

Publication Publication Date Title
Antonie et al. Text document categorization by term association
Payne et al. Interface agents that learn an investigation of learning issues in a mail agent interface
Drucker et al. Support vector machines for spam categorization
Schapire et al. Boosting and Rocchio applied to text filtering
Nigam Using unlabeled data to improve text classification
Dumais et al. Inductive learning algorithms and representations for text categorization
Del Castillo et al. A multistrategy approach for digital text categorization from imbalanced documents
US6314420B1 (en) Collaborative/adaptive search engine
AU2002350112B8 (en) Systems, methods, and software for classifying documents
Awad et al. Machine Learning methods for E-mail Classification
Godbole et al. Scaling multi-class support vector machines using inter-class confusion
GB2369698A (en) Theme-based system and method for classifying patent documents
US20050086045A1 (en) Question answering system and question answering processing method
WO2001014992A1 (en) Document classification apparatus
Mostafa et al. Automatic classification using supervised learning in a medical document filtering application
Saleh et al. A semantic based Web page classification strategy using multi-layered domain ontology
Moh'd Mesleh et al. Support vector machine text classification system: Using Ant Colony Optimization based feature subset selection
WO2021189583A1 (zh) 基于受限玻尔兹曼机驱动的交互式个性化搜索方法
Cao et al. An e-mail filtering approach using neural network
Bickel et al. Learning from message pairs for automatic email answering
Li et al. User profile model: a view from artificial intelligence
Li et al. Combining multiple email filters based on multivariate statistical analysis
GB2442286A (en) Categorisation of data e.g. web pages using a model
Méndez et al. Analyzing the impact of corpus preprocessing on anti-spam filtering software
Tan Predictive self-organizing networks for text categorization

Legal Events

Date Code Title Description
PUAI Public reference made under article 153(3) epc to a published international application that has entered the european phase

Free format text: ORIGINAL CODE: 0009012

17P Request for examination filed

Effective date: 20020131

AK Designated contracting states

Kind code of ref document: A1

Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE

RBV Designated contracting states (corrected)

Designated state(s): DE FR GB

17Q First examination report despatched

Effective date: 20061228

STAA Information on the status of an ep patent application or granted ep patent

Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN

18D Application deemed to be withdrawn

Effective date: 20100302