EP1210669A1 - Dokument-klassifikations-vorrichtung - Google Patents
Dokument-klassifikations-vorrichtungInfo
- Publication number
- EP1210669A1 EP1210669A1 EP99946569A EP99946569A EP1210669A1 EP 1210669 A1 EP1210669 A1 EP 1210669A1 EP 99946569 A EP99946569 A EP 99946569A EP 99946569 A EP99946569 A EP 99946569A EP 1210669 A1 EP1210669 A1 EP 1210669A1
- Authority
- EP
- European Patent Office
- Prior art keywords
- document
- classification
- classifier
- features
- category
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Withdrawn
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/02—Knowledge representation; Symbolic representation
- G06N5/022—Knowledge engineering; Knowledge acquisition
- G06N5/025—Extracting rules from data
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/35—Clustering; Classification
- G06F16/353—Clustering; Classification into predefined classes
Definitions
- This invention relates to apparatus for classifying documents.
- US 5,461,488 provides routing based on identification of recipient name.
- classification strategy either profile-based, having a keyword/character profile for each
- the systems also use only a single knowledge acquisition strategy, either statistically
- classifier using the knowledge base, determines a predicted classification for the
- the classifier being switchable between the modes under user control .
- the features are preferably formed into a feature vector for input to the classifier and the
- features preferably comprise classification-associated words or phrases which may appear
- the extracting means may be arranged to provide a measure of the
- the classifier may comprise a supervised ART system, preferably an ARAM system of
- the apparatus may further be operable in knowledge acquisition mode to process a
- the apparatus may further be operable in a rule insertion sub-mode of the knowledge
- the apparatus may further comprising a router arranged to route the document to one of a
- the desc ⁇ bed embodiment provides a document classification apparatus which allows
- the apparatus perform learning of a plurality of cases as a batch. During batch learning, the apparatus learns each case one by one and accumulates the classification information into the
- the apparatus Besides learning from training data, the apparatus also allows rules to be
- the apparatus is furthermore able to determine a confidence that the classification of a
- This confidence value is
- Figure 1 is a schematic diagram illustrating the structure of the described embodiments of
- Figure 2 is a diagram illustrating the document classifier of Figure 1 in a document
- Figure 3 is a diagram illustrating the modes of operation of the embodiments of the invention.
- Figure 4 is a diagram of an ARAM system used as a document classifier in an
- apparatus is operable in a knowledge acquisition mode and a document classification mode.
- knowledge acquisition mode the apparatus learns from training documents and
- a document text file for example a text file derived from a scanned and OCR processed
- the document classifier includes a feature extraction
- ARAM Adaptive Resonance Associative Map
- This classification is associated with a confidence value which, together
- threshold input by a system administrator 50. If the value exceeds the threshold, the threshold
- the document is routed to the system administrator 50 for manual routing via path 60.
- the destinations 52 can also communicate with the system administrator 50 through path
- the acquisition mode two sub-modes are used.
- the first, represented by block 100, is based
- rule insertion is based on rule insertion, represented by block 110.
- rule insertion a feature vector
- the system administrator can access the document classifier directly by via path 70 to
- Such switching may be used, for example, if a mis-directed
- the system administrator may
- the features are extracted and passed to the classifier 30, so that the mis-directed document
- tokens are individual words that have been
- selection is “select”).
- Other “filtering” options based on sentence structure such as
- the keyword-based feature sets can be pre-defined manually or generated automatically
- the algorithm for automatic keyword selection accepts a list of pre-classified (i.e.
- Processing involves the extraction of all nouns (in root form) from each document and
- f rat ⁇ ng a selection rating
- Count (x) Total number of occurrences of keyword in category x
- Count (x) Total number of occurrences of keyword in document x
- the keyword-per-document ratio (f Ratl0 ,) for a category, i, is the total keyword count (C,) for the
- Relative Keyword Count thus gives an indication of the difference between the keyword-
- the keyword occurs at least once.
- a measurement of f DIR for D lsl documents in the top category is given by:
- O lst number of documents in top category in which keyword occurs. The overall ranking of each keyword is therefore simply derived by taking:
- the following example uses a small training set of two categories with 124 relevant
- the categories are business newspaper articles in the first category and
- the algorithm allows for the specification of a minimum number K, of non-zero keyword
- keywords are selected in the manner described above.
- the feature extraction algorithm parses the
- Keyword counts are then normalized such that the maximum score is 1 and the minimum
- the second article for Category 2 (sports, music and
- the Classifier Adaptive Resonance Associative Map (ARAM)
- ARAM is a family of neural network models that performs incremental supervised
- the F, ⁇ field (300) serves as the input field containing the input activity vector
- F2 field (320) contains the activities of categories that are used to encode the patterns.
- ARAM formulates recognition categories of input
- ARAM discovers during learning is compatible with IF-THEN rule-based representation.
- each node in the F 2 field (320) represents a recognition category associating
- node constitute a set of rules that link antecedents to consequents.
- system architecture can be translated into a compact set
- the ART modules used in ARAM can be ART 1 [1] , which categorizes binary patterns,
- ART 2-A [2] ART 2-A [2]
- fuzzy ART [3] which categorize both
- fuzzy ARAM model that is composed of two overlapping
- ARAM learns a set of
- recognition categories or rules by training from pre-labeled document sets.
- ARAM as input A one at a time together with the associated class label input B.
- ARAM Given an input keyword vector A, ARAM first searches for a F 2 recognition category
- templates of the F 2 recognition category is modified to contain the intersection of the
- ARAM learning is stable in the sense that weight values do not oscillate, as
- Input vectors The F, a and F, input vectors are normalized by complement coding that
- Complement coding represents both the on-response and
- the complement coded F, a input vector A is a 2M-
- the complement coded F, b input vector B is a 2N-dimensional vector
- Each F 2 category node j is associated with two adaptive weight templates
- a category node is selected for encoding, it becomes committed.
- Fuzzy ARAM dynamics are determined by the choice parameters ⁇ , > 0 and ⁇ b > 0; the
- T J
- I is defined by
- ⁇ , p, for vectors p and q.
- T j max ⁇ T ; : for all F 2 node j ⁇ .
- Resonance occurs if the match functions, m, ⁇ and m ; , meet the
- mismatch reset occurs in which the value of the choice function T j is set to 0 for the
- J is an uncommitted node, and then take ⁇ a ⁇ 1 and ⁇ b ⁇ 1 after the category node is
- a rule is typically in the IF-THEN format
- the rule insertion algorithm creates a keyword frequent vector in which the frequency
- ARAM first searches
- a recognition category is created to encode a keyword template consisting of "Stock”, “Share ", and
- Rule insertion proceeds in two phases.
- the first phase translates each rule into a 2M-
- N is the number of classes.
- the vector pairs derived from the rules are then used as training patterns to initialize an
- recognition categories are associated through the map field.
- the vigilance parameters p a and p b are each set to 1 to ensure
- a feature extraction module parses the text to derive a
- vector A is then presented to the F, a field.
- ARAM Given an input keyword vector A, ARAM first searches for a F 2 recognition category
- the output class is predicted to be K if b ⁇ > b k for
- output vector y represents a less extreme contrast enhancement of the F, a to F 2 input T, in
- the power rule raises the input T j to the j m F 2 node to a power p and
- the power rule converges toward the choice rule as p becomes large.
- K-max Rule In the spirit of the K Nearest Neighbor (KNN) system, the K-max rule picks
- the F 2 activity values y are
- the output vector B 2 (b,, b 2 ...., b N ) is then read directly from x b .
- the output class is
- voting scores normalized by the number of ARAM provide a prediction score
- v is the number of votes given to and S j is the normalized prediction score for the
- the output class j is the selected predicted
- the system administrator can switch between the classification mode and the knowledge
- the message can be either LEARN, INSERT, or CLASSIFY.
- the document classifier adjusts the input baseline
- the document classifier receives a document text together with a
- the feature extraction module derives the normalized keyword
- the keyword vector is presented as the input vector to
- the ARAM classifier is then run with p_ a ⁇ 1 (typically
- the document classifier receives an IF-THEN rule.
- the ARAM classifier is
- the document classifier receives a document text.
- classification label is then read from the F, b field and returned to the user.
- classifier module has been shown implemented using an ARAM structure, this may be
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/SG1999/000089 WO2001014992A1 (en) | 1999-08-25 | 1999-08-25 | Document classification apparatus |
Publications (1)
Publication Number | Publication Date |
---|---|
EP1210669A1 true EP1210669A1 (de) | 2002-06-05 |
Family
ID=20430235
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
EP99946569A Withdrawn EP1210669A1 (de) | 1999-08-25 | 1999-08-25 | Dokument-klassifikations-vorrichtung |
Country Status (2)
Country | Link |
---|---|
EP (1) | EP1210669A1 (de) |
WO (1) | WO2001014992A1 (de) |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9298983B2 (en) | 2014-01-20 | 2016-03-29 | Array Technology, LLC | System and method for document grouping and user interface |
Families Citing this family (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030221166A1 (en) * | 2002-05-17 | 2003-11-27 | Xerox Corporation | Systems and methods for authoritativeness grading, estimation and sorting of documents in large heterogeneous document collections |
US7516130B2 (en) | 2005-05-09 | 2009-04-07 | Trend Micro, Inc. | Matching engine with signature generation |
CN102298583B (zh) * | 2010-06-22 | 2016-04-27 | 深圳市世纪光速信息技术有限公司 | 一种电子公告板网页质量评价方法和系统 |
US8787681B1 (en) * | 2011-03-21 | 2014-07-22 | First American Data Tree Llc | System and method for classifying documents |
US10235649B1 (en) | 2014-03-14 | 2019-03-19 | Walmart Apollo, Llc | Customer analytics data model |
US10346769B1 (en) | 2014-03-14 | 2019-07-09 | Walmart Apollo, Llc | System and method for dynamic attribute table |
US10235687B1 (en) | 2014-03-14 | 2019-03-19 | Walmart Apollo, Llc | Shortest distance to store |
US10565538B1 (en) | 2014-03-14 | 2020-02-18 | Walmart Apollo, Llc | Customer attribute exemption |
US10733555B1 (en) | 2014-03-14 | 2020-08-04 | Walmart Apollo, Llc | Workflow coordinator |
WO2016048295A1 (en) * | 2014-09-24 | 2016-03-31 | Hewlett Packard Enterprise Development Lp | Assigning a document to partial membership in communities |
CN109614606B (zh) * | 2018-10-23 | 2023-02-03 | 中山大学 | 基于文档嵌入的长文本案件罚金范围分类预测方法及装置 |
WO2022101383A1 (en) * | 2020-11-13 | 2022-05-19 | Detectsystem Lab A/S | Document uniqueness verification |
Family Cites Families (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPH03290774A (ja) * | 1990-04-06 | 1991-12-20 | Fuji Facom Corp | 文書画像の文章領域抽出装置 |
GB2278705A (en) * | 1993-06-01 | 1994-12-07 | Vernon John Spencer | Facsimile machine |
US5566273A (en) * | 1993-12-30 | 1996-10-15 | Caterpillar Inc. | Supervised training of a neural network |
AUPN431595A0 (en) * | 1995-07-24 | 1995-08-17 | Co-Operative Research Centre For Sensor Signal And Information Processing | Selective attention adaptive resonance theory |
US5794236A (en) * | 1996-05-29 | 1998-08-11 | Lexis-Nexis | Computer-based system for classifying documents into a hierarchy and linking the classifications to the hierarchy |
JPH1139313A (ja) * | 1997-07-24 | 1999-02-12 | Nippon Telegr & Teleph Corp <Ntt> | 文書自動分類システム、文書分類向け知識ベース生成方法及びそのプログラムを記録した記録媒体 |
JPH1185797A (ja) * | 1997-09-01 | 1999-03-30 | Canon Inc | 文書自動分類装置、学習装置、分類装置、文書自動分類方法、学習方法、分類方法および記憶媒体 |
JPH1185796A (ja) * | 1997-09-01 | 1999-03-30 | Canon Inc | 文書自動分類装置、学習装置、分類装置、文書自動分類方法、学習方法、分類方法および記憶媒体 |
-
1999
- 1999-08-25 WO PCT/SG1999/000089 patent/WO2001014992A1/en active Application Filing
- 1999-08-25 EP EP99946569A patent/EP1210669A1/de not_active Withdrawn
Non-Patent Citations (1)
Title |
---|
See references of WO0114992A1 * |
Cited By (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9298983B2 (en) | 2014-01-20 | 2016-03-29 | Array Technology, LLC | System and method for document grouping and user interface |
Also Published As
Publication number | Publication date |
---|---|
WO2001014992A1 (en) | 2001-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Antonie et al. | Text document categorization by term association | |
Payne et al. | Interface agents that learn an investigation of learning issues in a mail agent interface | |
Drucker et al. | Support vector machines for spam categorization | |
Schapire et al. | Boosting and Rocchio applied to text filtering | |
Nigam | Using unlabeled data to improve text classification | |
Dumais et al. | Inductive learning algorithms and representations for text categorization | |
Del Castillo et al. | A multistrategy approach for digital text categorization from imbalanced documents | |
US6314420B1 (en) | Collaborative/adaptive search engine | |
AU2002350112B8 (en) | Systems, methods, and software for classifying documents | |
Awad et al. | Machine Learning methods for E-mail Classification | |
Godbole et al. | Scaling multi-class support vector machines using inter-class confusion | |
GB2369698A (en) | Theme-based system and method for classifying patent documents | |
US20050086045A1 (en) | Question answering system and question answering processing method | |
WO2001014992A1 (en) | Document classification apparatus | |
Mostafa et al. | Automatic classification using supervised learning in a medical document filtering application | |
Saleh et al. | A semantic based Web page classification strategy using multi-layered domain ontology | |
Moh'd Mesleh et al. | Support vector machine text classification system: Using Ant Colony Optimization based feature subset selection | |
WO2021189583A1 (zh) | 基于受限玻尔兹曼机驱动的交互式个性化搜索方法 | |
Cao et al. | An e-mail filtering approach using neural network | |
Bickel et al. | Learning from message pairs for automatic email answering | |
Li et al. | User profile model: a view from artificial intelligence | |
Li et al. | Combining multiple email filters based on multivariate statistical analysis | |
GB2442286A (en) | Categorisation of data e.g. web pages using a model | |
Méndez et al. | Analyzing the impact of corpus preprocessing on anti-spam filtering software | |
Tan | Predictive self-organizing networks for text categorization |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PUAI | Public reference made under article 153(3) epc to a published international application that has entered the european phase |
Free format text: ORIGINAL CODE: 0009012 |
|
17P | Request for examination filed |
Effective date: 20020131 |
|
AK | Designated contracting states |
Kind code of ref document: A1 Designated state(s): AT BE CH CY DE DK ES FI FR GB GR IE IT LI LU MC NL PT SE |
|
RBV | Designated contracting states (corrected) |
Designated state(s): DE FR GB |
|
17Q | First examination report despatched |
Effective date: 20061228 |
|
STAA | Information on the status of an ep patent application or granted ep patent |
Free format text: STATUS: THE APPLICATION IS DEEMED TO BE WITHDRAWN |
|
18D | Application deemed to be withdrawn |
Effective date: 20100302 |