WO2008042264A2 - Distributed method for integrating data mining and text categorization techniques - Google Patents

Distributed method for integrating data mining and text categorization techniques Download PDF

Info

Publication number
WO2008042264A2
WO2008042264A2 PCT/US2007/020938 US2007020938W WO2008042264A2 WO 2008042264 A2 WO2008042264 A2 WO 2008042264A2 US 2007020938 W US2007020938 W US 2007020938W WO 2008042264 A2 WO2008042264 A2 WO 2008042264A2
Authority
WO
WIPO (PCT)
Prior art keywords
documents
document
class
terms
text
Prior art date
Application number
PCT/US2007/020938
Other languages
French (fr)
Other versions
WO2008042264A3 (en
Inventor
Ali Hadjarian
Original Assignee
Inferx Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inferx Corporation filed Critical Inferx Corporation
Publication of WO2008042264A2 publication Critical patent/WO2008042264A2/en
Publication of WO2008042264A3 publication Critical patent/WO2008042264A3/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • G06F16/355Class or cluster creation or modification

Definitions

  • an Information Extraction (IE) algorithm (such as described in Done, J., Gerstl, P. and Seiffeit, R. (1999), Text mining: finding nuggets in mountains of textual data, in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, CA, 1999), 398- 401; Pazienza, Maria Maria Maria (1999), Information Extraction: Towards Scalable, Adaptable Systems, Springer; and Knight, Kevin (1999).Mining Online Text. Communications of the ACM 42(11): 586) is first used to populate structured data tables with data elements extracted from unstructured data collections. A data mining algorithm is then applied to the structured data in order to find patterns of potential interest to the user. So this form of text mining can easily facilitate the integration of structured and unstructured data sources.
  • a popular form of IE is that of Entity Extraction, intended at extracting such information as the names of people, organizations, and places from the documents.
  • Text Categorization (such as described in Sebastiani, Fabrizio (2002), Machine learning in automated text categorization, ACM Computing Surveys, 34(1 ):1- 47; Joachims, T. (1998), Text categorization with Support Vector Machines: Learning with many relevant features, In Machine Learning: ECML-98, Tenth European Conference on Machine Learning, pp. 137 — 142; Koller, D., Sahami, M. (1997), Hierarchically classifying documents using very few words, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 170-178; Lewis, D., D. Stern and A.
  • Text Categorization and text classification are often used interchangeably. Since the ultimate aim of such a classifier is simply assigning classes (e.g. topical labels) to various data points, the human comprehens ⁇ bility aspect of the generated models is generally not of much concern. As such, most text classifiers use a black-box approach to modeling, i.e. what is of essence is the input to and the output of the classifier and not so much the intermediate representations of object classes.
  • a method for prediction analysis using text categorization includes the steps of: grouping a plurality of text documents into a plurality of classes; selecting a top m most discriminatory terms for each class of documents using statistical based measures; determining for each document the presence or absence of each of the discriminatory terms; learning rule-based models of each class of documents using a rule learning algorithm; determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document; creating a database of the rules associated with documents satisfying the rules; and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
  • a method for prediction analysis using text categorization includes the steps of: providing a structured data table having a plurality of class labels; grouping a plurality of text documents into classes based on the class labels; selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents; determining for each document the presence or absence of each of the discriminatory terms; determining a concept for each class, the concept being associated with the respective class; determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document; forming a numeric vector for each document indicating if the document is associated with each respective concept; creating a structured data table of the vectors; and performing distributed data mining on the structured data table to form a predictive result.
  • a method for prediction analysis using text categorization includes the steps of: providing a structured data table having a plurality of class labels; grouping a plurality of text documents into classes based on the class labels; selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents; determining for each document the presence or absence _of each of the discriminatory terms; determining at least one concept for each class, the concept being associated with the respective class; determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document; creating a database of the concepts and the associated documents; and performing distributed data mining on the database to form a predictive result.
  • the method further includes the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
  • the plurality of text documents are from an unstructured database.
  • the method further includes the step of representing each document in terms of a numeric vector indicating whether a learned rule has been satisfied by the document.
  • the step of performing data mining includes utilizing a decision tree to form the predictive result.
  • the step of performing data mining includes the steps of: collecting candidate attributes by a mediator from a plurality of agents; selecting a winning agent; initiating data splitting by the winning agent; forwarding split data index information from the winning agent to the mediator; forwarding the split data index information from the mediator to each of the agents; and initiating data splitting by each of the agents other than the winning agent.
  • a system for prediction analysis using text categorization includes at least one memory unit and a plurality of processing units.
  • the plurality of processing units grouping a plurality of text documents into a plurality of classes, selecting a top m most discriminatory terms for each class of documents using statistical based measures, determining for each document the presence or absence of each of the discriminatory terms, learning rule-based models of each class of documents using a rule learning algorithm, determining, for at least a . portion of the plurality of documents, if a given learned rule has been satisfied by each respective document, creating a database of the rules associated with documents satisfying the rules and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
  • Other forms are also contemplated as understood by those skilled in the art.
  • Figure 1 is a diagrammatic representation of one form of a method for text mining
  • Figure 2 is a diagrammatic representation of one form of a concept extraction process
  • Figure 3 is a diagrammatic representation of one form of a feature selection process
  • Figure 4 is a diagrammatic representation of one form of a vector space
  • Figure 5 is a diagrammatic representation of one form of an agent-mediator communication mechanism
  • Figure 6 is a diagrammatic representation of one form of a distributed data mining method and system.
  • the methodology presented in this application is concerned with text mining scenarios where data associated with objects are collected at distributed databases.
  • data associated with objects are collected at distributed databases.
  • data points can be registered across various databases through common keys.
  • the method includes Text Categorization, typically a stand-alone application, with a predictive analytics process. Additionally, the method includes the distributed aspect of the predictive analytics process itself, in which a novel distributed decision tree learning algorithm is employed to generate models of data dispersed in various locations without the need to bring all that data to a central location.
  • the methodology presented in this application is concerned with text mining scenarios where data associated with objects are collected at distributed databases.
  • Figure 1 depicts a high-level view of one form of a text mining method 20.
  • a database 22 with structured data there is one database 22 with structured data and one database 24 with unstructured data (i.e. a collection of documents).
  • a Concept Extraction process/concept extractor 26 At the heart of the methodology is a Concept Extraction process/concept extractor 26. This, in essence, is a Text Categorization algorithm that builds models of unstructured data, i.e. document collections, based on the labels assigned to them using the annotations specified by the structured data.
  • the aim here is not simply to use Text Categorization to build a set of classifiers for the unstructured data. Rather, the resulting models are used to extract features from the unstructured data to be used in conjunction with the structured data in the mining process (i.e. building classifiers over both structured and unstructured data).
  • the intended features specify the presence or absence of various "concepts" within each class of documents, hence the term Concept Extraction.
  • Documents 28 are first grouped into classes 30 assigned to them, using the class labels of the corresponding data points in the structured data table. Again, the documents 28 and data points in the structured database are registered with common keys. A classifier is then learned for each of these document classes. A rule learning algorithm is employed for this purpose. Each learned rule captures some aspect of the document class. In other words, each rule identifies the various "concepts" present in the class. The presence or absence of such concepts in documents can then be used as features to populate a structured database table.
  • each document in a given class is represented in terms of a vector of top m features.
  • the top features are those with the highest calculated fitness measure (e.g., Information Gain), as determined by a Feature Selection algorithm 40. This process is depicted in Figure 3.
  • each document is re-represented in terms of a numeric vector indicating the presence or absence of each of the features, such as shown in Figure 4.
  • a structured table populated by "concept” based features extracted from unstructured data is used to facilitate data mining across structured and unstructured databases. This is achieved through the use of a distributed mining algorithm described in the following section.
  • FIG. 6 illustrates one basic form of distributed data mining.
  • Distributed mining is accomplished via a synchronized collaboration of agents 10 as well as a mediator component 12.
  • agents 10 as well as a mediator component 12.
  • mediator component 12 see Hadjarian A., Baik, S. 3 BaIa J., Manthorne C. (2001) "InferAgent - A Decision Tree Induction From Distributed Data Algorithm," 5th
  • the mediator component 12 facilitates the communication among agents 10.
  • each agent 10 has access to its own local database 14 and is responsible for mining the data contained by the database 14.
  • Distributed data mining results in a set of rules generated through a tree induction algorithm.
  • the tree induction algorithm determines the feature which is most discriminatory and then it dichotomizes (splits) the data into classes categorized by this feature.
  • the next significant feature of each of the subsets is then used to further partition them and the process is repeated recursively until each of the subsets contain only one kind of labeled data.
  • the resulting structure is called a decision tree, where nodes stand for feature discrimination tests, while their exit branches stand for those subclasses of labeled examples satisfying the test.
  • a tree is rewritten to a collection of rules, one for each leaf in the tree. Every path from the root of a tree to a leaf gives one initial rule.
  • the left-hand side of the rule contains all the conditions established by the path, and the right-hand side specifies the classes at the leaf
  • Each such rule is simplified by removing conditions that do not seem helpful for discriminating the nominated class from other classes.
  • tree induction is accomplished through a partial tree generation process and an Agent-Mediator communication mechanism, such as shown in Figure 5 that executes the following steps:
  • the data mining process starts with the mediator 12 issuing a call to all the agents 10 to start the mining process.
  • Each agent 10 then starts the process of mining its own local data by finding the feature (or attribute) that can best split the data into the various training classes (i.e. the attribute with the highest information gain). 3.
  • the selected attribute is then sent as a candidate attribute to the mediator 12 for overall evaluation.
  • the winner agent 10 i.e. the agent whose database includes the attribute with the highest information gain
  • the winner agent 10 will then continue the mining process by splitting the data using the winning attribute and its associated split value. This split results in the formation of two separate clusters of data (i.e. those satisfying the split criteria and those not satisfying it).
  • the associated indices of the data in each cluster are passed to the mediator 12 to be used by all the other agents 10.
  • the other (i.e. non-winner) agents 10 access the index information passed to the mediator 12 by the winner agent 10 and split their data accordingly.
  • the mining process then continues by repeating the process of candidate feature selection by each of the agents 10.
  • the mediator 12 is generating the classification rules by tracking the attribute/split information coming from the various mining agents 10. The generated rules can then be passed on to the various agents 10 for the purpose of presenting them to the user through advanced 3D visualization techniques.
  • Customer profiling or modeling of a customer's interests, can facilitate personalized purchase offers and recommendations.
  • An online bookstore for example, can make book recommendations based on the purchase history of its customers. To do so, the bookstore must first generate a model of a customer's interests.
  • Customer C has specific interests in modern philosophy and baking. Obviously the bookstore's customer database holds a variety of valuable information on previously purchased items, such as the general topic, price, and the year of publication. However missing from this database is the rich information contained in the textual description of each item. Using this often unstructured textual information in conjunction with the structured data contained in the customer database can potentially yield a more accurate picture of a customer's interests.
  • Step 1- Grouping of documents (i.e. book descriptions) into various categories. Examples of these could be general categories such as "ofjnterest” and “not_ofjnterest”.
  • the historical data stored in the customer database can of course facilitate such a grouping. While the descriptions of the books purchased by Customer C in the past can be grouped into the "of_interest” category, descriptions of the items not purchased by this customer (or a sample of them) can be used to populate the "not_of_interest” category.
  • Step 2- Selecting the most discriminatory terms (i.e. keywords) for differentiating between the "ofjnterest” and “not_of_interest” categories. This is achieved in an automated fashion with a help of a Feature Selection algorithm that uses statistics based measures such as Information Gain.
  • the list of selected features for the "ofjnterest” category could include terms such as: recipe, baking, philosophy, desserts, Sartre, existentialism, French, culinary, German, morality, Nietzsche, and cookbook.
  • Step 3- Re-representing each document in terms of a numeric vector indicating the presence (e.g., as indicated by a 1) or absence (e.g., as indicated by a 0) of each of the selected terms.
  • Document 1 contains the terms recipe and baking and Document 3 the terms philosophy and existentialism.
  • Document 2 ⁇ 0, 1, 0, 1, 0, 0, ...>
  • Document 3 ⁇ 0, 0, 1, 0, 0, 1, ...>
  • a rule learning algorithm is used for this purpose. Examples of rules generated for the "ofjnterest" category could include:
  • Step 5- Re-representing each document, this time in terms of a numeric vector indicating whether the document can be classified as belonging to a given category using the generated rules for that category and if so which concept (i.e. learned rule) is satisfied by that document.
  • the following vectors indicate that Document 2 belongs to the "ofjnterest” category and satisfies Concept 7 (i.e., has the terms desserts and culinary) and Document 12 belongs to the "not_of_interest” category.
  • Step 6- Populating a structured database with the above concept vector representation of documents and using this database in conjunction with other existing structured customer databases to generate models of Customer Cs interests. This is facilitated by a distributed predictive analytics method as shown in Figures 5 and 6.
  • An example of a generated rule-based model for an item to be recommended to Customer C could include the following:
  • the above example is an application of one form of the present method and system. It should be understood that variations of the method are also contemplated as understood by those skilled in the art. Furthermore, it should be understood that the methods described herein may be embodied in a system, such as a computer, network and the like as understood by those skilled in the art.
  • the system may include one or more processing units, hard drives, RAM, ROM, other forms of memory and other associated structure and features as understood by those skilled in the art. it should be understood that multiple processing units may be used in the system such that one processing units performs certain functions at one data locale, a second processing unit performs certain functions at a second data locale and a third processing unit acts as a mediator.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Databases & Information Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

A method for prediction analysis using text categorization is provided. The method includes the steps of: grouping a plurality of text documents into a plurality of classes; selecting a top m most discriminatory terms for each class of documents using statistical based measures; determining for each document the presence or absence of each of the discriminatory terms, learning rule-based models of each class of documents using a rule learning algorithm; determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document; creating a database of the rules associated with documents satisfying the rules; and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.

Description

DISTRIBUTED METHOD FOR INTEGRATING DATA MINING AND TEXT CATEGORIZATION TECHNIQUES
FIELD OF THE INVENTION This invention relates generally to a method for Integrating Predictive
Analytics and Text Categorization techniques within a distributed machine learning framework.
BACKGROUND Recent years have seen a significant surge of interest in the application of mining algorithms to unstructured data. This stems from the general realization that the true potentials of mining applications can only be actualized with the ability to tap into the vast amounts of unstructured data, 85% of all data according to some estimates. Most algorithms designed for the processing of unstructured data are loosely coined as text mining algorithms. These include Information Extraction and Text
Categorization algorithms, among others. While there is often a well established link between Information Extraction and data mining, the application of Text Categorization in a data mining context is much less prevalent.
In a typical text mining application, an Information Extraction (IE) algorithm (such as described in Done, J., Gerstl, P. and Seiffeit, R. (1999), Text mining: finding nuggets in mountains of textual data, in Proceedings of ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (San Diego, CA, 1999), 398- 401; Pazienza, Maria Teresa (1999), Information Extraction: Towards Scalable, Adaptable Systems, Springer; and Knight, Kevin (1999).Mining Online Text. Communications of the ACM 42(11): 586) is first used to populate structured data tables with data elements extracted from unstructured data collections. A data mining algorithm is then applied to the structured data in order to find patterns of potential interest to the user. So this form of text mining can easily facilitate the integration of structured and unstructured data sources. A popular form of IE is that of Entity Extraction, intended at extracting such information as the names of people, organizations, and places from the documents.
Text Categorization (TC) (such as described in Sebastiani, Fabrizio (2002), Machine learning in automated text categorization, ACM Computing Surveys, 34(1 ):1- 47; Joachims, T. (1998), Text categorization with Support Vector Machines: Learning with many relevant features, In Machine Learning: ECML-98, Tenth European Conference on Machine Learning, pp. 137 — 142; Koller, D., Sahami, M. (1997), Hierarchically classifying documents using very few words, Proc. of the 14th International Conference on Machine Learning ICML97, pp. 170-178; Lewis, D., D. Stern and A. Singhal (1999), ATTICS: A Software Platform for Online Text Classification, SIGIR '99; and Hadjarian, AIi, Jerzy W. BaIa, Peter Pachowicz (2001), Text Categorization through Multistrategy Learning and Visualization, In Proceedings of Conference on Intelligent Text Processing and Computational Linguistics (CICLing) 2001 : 437-443) on the other hand is generally not intended for explicit discovery of new knowledge from unstructured data, (see Hearst, M. (1999). Untangling text data mining. Proceedings of ACL'99: the 37th Annual Meeting of the Association for Computational Linguistics). Instead, it is designed to build classifiers that automatically assign unstructured data (e.g. text documents) to predefined categories. As such, the terms Text Categorization and text classification are often used interchangeably. Since the ultimate aim of such a classifier is simply assigning classes (e.g. topical labels) to various data points, the human comprehensϊbility aspect of the generated models is generally not of much concern. As such, most text classifiers use a black-box approach to modeling, i.e. what is of essence is the input to and the output of the classifier and not so much the intermediate representations of object classes.
SUMMARY
In one form, a method for prediction analysis using text categorization is provided. The method includes the steps of: grouping a plurality of text documents into a plurality of classes; selecting a top m most discriminatory terms for each class of documents using statistical based measures; determining for each document the presence or absence of each of the discriminatory terms; learning rule-based models of each class of documents using a rule learning algorithm; determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document; creating a database of the rules associated with documents satisfying the rules; and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
According to one form, a method for prediction analysis using text categorization is provided. The method includes the steps of: providing a structured data table having a plurality of class labels; grouping a plurality of text documents into classes based on the class labels; selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents; determining for each document the presence or absence of each of the discriminatory terms; determining a concept for each class, the concept being associated with the respective class; determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document; forming a numeric vector for each document indicating if the document is associated with each respective concept; creating a structured data table of the vectors; and performing distributed data mining on the structured data table to form a predictive result.
In one form, a method for prediction analysis using text categorization is provided. The method includes the steps of: providing a structured data table having a plurality of class labels; grouping a plurality of text documents into classes based on the class labels; selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents; determining for each document the presence or absence _of each of the discriminatory terms; determining at least one concept for each class, the concept being associated with the respective class; determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document; creating a database of the concepts and the associated documents; and performing distributed data mining on the database to form a predictive result.
According to one form, the method further includes the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
In one form, the plurality of text documents are from an unstructured database.
According to one form, the method further includes the step of representing each document in terms of a numeric vector indicating whether a learned rule has been satisfied by the document.
In one form, the step of performing data mining includes utilizing a decision tree to form the predictive result.
According to one form, the step of performing data mining includes the steps of: collecting candidate attributes by a mediator from a plurality of agents; selecting a winning agent; initiating data splitting by the winning agent; forwarding split data index information from the winning agent to the mediator; forwarding the split data index information from the mediator to each of the agents; and initiating data splitting by each of the agents other than the winning agent.
In one form, a system for prediction analysis using text categorization is provided. The system includes at least one memory unit and a plurality of processing units. The plurality of processing units grouping a plurality of text documents into a plurality of classes, selecting a top m most discriminatory terms for each class of documents using statistical based measures, determining for each document the presence or absence of each of the discriminatory terms, learning rule-based models of each class of documents using a rule learning algorithm, determining, for at least a . portion of the plurality of documents, if a given learned rule has been satisfied by each respective document, creating a database of the rules associated with documents satisfying the rules and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents. Other forms are also contemplated as understood by those skilled in the art.
BRIEF DESCRIPTION OF THE DRAWINGS
For the purpose of facilitating an understanding of the subject matter sought to be protected, there are illustrated in the accompanying drawings embodiments thereof, from an inspection of which, when considered in connection with the following description, the subject matter sought to be protected, its constructions and operation, and many of its advantages should be readily understood and appreciated.
Figure 1 is a diagrammatic representation of one form of a method for text mining; Figure 2 is a diagrammatic representation of one form of a concept extraction process;
Figure 3 is a diagrammatic representation of one form of a feature selection process; Figure 4 is a diagrammatic representation of one form of a vector space;
Figure 5 is a diagrammatic representation of one form of an agent-mediator communication mechanism; and
Figure 6 is a diagrammatic representation of one form of a distributed data mining method and system.
DETAILED DESCRIPTION
The methodology presented in this application is concerned with text mining scenarios where data associated with objects are collected at distributed databases. In addition, there is at least one database with structured and one with unstructured data. It is further assumed that data points can be registered across various databases through common keys. In one form, it may be preferable to mine the data across distributed structured and unstructured databases without the need to bring all the data to one central location.
In one form, the method includes Text Categorization, typically a stand-alone application, with a predictive analytics process. Additionally, the method includes the distributed aspect of the predictive analytics process itself, in which a novel distributed decision tree learning algorithm is employed to generate models of data dispersed in various locations without the need to bring all that data to a central location.
The methodology presented in this application is concerned with text mining scenarios where data associated with objects are collected at distributed databases. In addition, in one form, there is at least one database with structured and one with unstructured data. Furthermore, in one form, it can be assumed that data points can be registered across various databases through common keys.
Figure 1 depicts a high-level view of one form of a text mining method 20. In this form, there is one database 22 with structured data and one database 24 with unstructured data (i.e. a collection of documents). At the heart of the methodology is a Concept Extraction process/concept extractor 26. This, in essence, is a Text Categorization algorithm that builds models of unstructured data, i.e. document collections, based on the labels assigned to them using the annotations specified by the structured data.
However, the aim here is not simply to use Text Categorization to build a set of classifiers for the unstructured data. Rather, the resulting models are used to extract features from the unstructured data to be used in conjunction with the structured data in the mining process (i.e. building classifiers over both structured and unstructured data). The intended features specify the presence or absence of various "concepts" within each class of documents, hence the term Concept Extraction.
One form of a Concept Extraction process 26 is illustrated in Figure 2. Documents 28 are first grouped into classes 30 assigned to them, using the class labels of the corresponding data points in the structured data table. Again, the documents 28 and data points in the structured database are registered with common keys. A classifier is then learned for each of these document classes. A rule learning algorithm is employed for this purpose. Each learned rule captures some aspect of the document class. In other words, each rule identifies the various "concepts" present in the class. The presence or absence of such concepts in documents can then be used as features to populate a structured database table.
Documents of course must first be converted to a representation suitable for use by a learning algorithm, in this case the rule learner. A popular form of representation, namely that of vector space, has been utilized for this purpose. Here, each document in a given class is represented in terms of a vector of top m features. The top features (i.e. terms) are those with the highest calculated fitness measure (e.g., Information Gain), as determined by a Feature Selection algorithm 40. This process is depicted in Figure 3. Once the top m features for each document class have been identified, each document is re-represented in terms of a numeric vector indicating the presence or absence of each of the features, such as shown in Figure 4.
A structured table populated by "concept" based features extracted from unstructured data is used to facilitate data mining across structured and unstructured databases. This is achieved through the use of a distributed mining algorithm described in the following section.
Distributed Data Mining
Figure 6 illustrates one basic form of distributed data mining. Distributed mining is accomplished via a synchronized collaboration of agents 10 as well as a mediator component 12. (see Hadjarian A., Baik, S.3 BaIa J., Manthorne C. (2001) "InferAgent - A Decision Tree Induction From Distributed Data Algorithm," 5th
World Multiconference on Systemics, Cybernetics and Informatics (SCI 2001) and 7th International Conference on Information Systems Analysis and Synthesis (ISAS 2001), Orlando, Florida). The mediator component 12 facilitates the communication among agents 10. In one form, each agent 10 has access to its own local database 14 and is responsible for mining the data contained by the database 14.
Distributed data mining results in a set of rules generated through a tree induction algorithm. The tree induction algorithm, in an iterative fashion, determines the feature which is most discriminatory and then it dichotomizes (splits) the data into classes categorized by this feature. The next significant feature of each of the subsets is then used to further partition them and the process is repeated recursively until each of the subsets contain only one kind of labeled data. The resulting structure is called a decision tree, where nodes stand for feature discrimination tests, while their exit branches stand for those subclasses of labeled examples satisfying the test. A tree is rewritten to a collection of rules, one for each leaf in the tree. Every path from the root of a tree to a leaf gives one initial rule. The left-hand side of the rule contains all the conditions established by the path, and the right-hand side specifies the classes at the leaf Each such rule is simplified by removing conditions that do not seem helpful for discriminating the nominated class from other classes.
In the distributed framework, tree induction is accomplished through a partial tree generation process and an Agent-Mediator communication mechanism, such as shown in Figure 5 that executes the following steps:
1. The data mining process starts with the mediator 12 issuing a call to all the agents 10 to start the mining process.
2. Each agent 10 then starts the process of mining its own local data by finding the feature (or attribute) that can best split the data into the various training classes (i.e. the attribute with the highest information gain). 3. The selected attribute is then sent as a candidate attribute to the mediator 12 for overall evaluation.
4. Once the mediator 12 has collected the candidate attributes of all the agents 10, it can then select the attribute with the highest information gain as the winner.
5. The winner agent 10 (i.e. the agent whose database includes the attribute with the highest information gain) will then continue the mining process by splitting the data using the winning attribute and its associated split value. This split results in the formation of two separate clusters of data (i.e. those satisfying the split criteria and those not satisfying it).
6. The associated indices of the data in each cluster are passed to the mediator 12 to be used by all the other agents 10.
7. The other (i.e. non-winner) agents 10 access the index information passed to the mediator 12 by the winner agent 10 and split their data accordingly. The mining process then continues by repeating the process of candidate feature selection by each of the agents 10.
8. Meanwhile, the mediator 12 is generating the classification rules by tracking the attribute/split information coming from the various mining agents 10. The generated rules can then be passed on to the various agents 10 for the purpose of presenting them to the user through advanced 3D visualization techniques.
On exemplary application of one form of the method could be that of customer profiling for an online store. Customer profiling, or modeling of a customer's interests, can facilitate personalized purchase offers and recommendations. An online bookstore, for example, can make book recommendations based on the purchase history of its customers. To do so, the bookstore must first generate a model of a customer's interests.
Customer C has specific interests in modern philosophy and baking. Obviously the bookstore's customer database holds a variety of valuable information on previously purchased items, such as the general topic, price, and the year of publication. However missing from this database is the rich information contained in the textual description of each item. Using this often unstructured textual information in conjunction with the structured data contained in the customer database can potentially yield a more accurate picture of a customer's interests.
The following is an outline of the steps necessary to generate a profile of Customer C using one form of the method:
Step 1- Grouping of documents (i.e. book descriptions) into various categories. Examples of these could be general categories such as "ofjnterest" and "not_ofjnterest". The historical data stored in the customer database can of course facilitate such a grouping. While the descriptions of the books purchased by Customer C in the past can be grouped into the "of_interest" category, descriptions of the items not purchased by this customer (or a sample of them) can be used to populate the "not_of_interest" category.
Step 2- Selecting the most discriminatory terms (i.e. keywords) for differentiating between the "ofjnterest" and "not_of_interest" categories. This is achieved in an automated fashion with a help of a Feature Selection algorithm that uses statistics based measures such as Information Gain. For this particular customer, the list of selected features for the "ofjnterest" category could include terms such as: recipe, baking, philosophy, desserts, Sartre, existentialism, French, culinary, German, morality, Nietzsche, and cookbook.
Step 3- Re-representing each document in terms of a numeric vector indicating the presence (e.g., as indicated by a 1) or absence (e.g., as indicated by a 0) of each of the selected terms. In the below illustration for example, Document 1 contains the terms recipe and baking and Document 3 the terms philosophy and existentialism.
vector of selected terms:< recipe, baking, philosophy, desserts, Sartre, existentialism, ...> Document 1 : <1, 1, 0, 0, 0, 0, ...>
Document 2: <0, 1, 0, 1, 0, 0, ...> Document 3: <0, 0, 1, 0, 0, 1, ...>
Step 4- Learning rule-based models of each category of documents using the above vector space representation. A rule learning algorithm is used for this purpose. Examples of rules generated for the "ofjnterest" category could include:
Concept 1 : if (recipe = 1) and (baking = 1) then (category = "ofjnterest") Concept 2: if (existentialism = 1) then (category = "ofjnterest") Concept 7: if (desserts = 1) and (culinary = 1) then (category = "ofjnterest")
Step 5- Re-representing each document, this time in terms of a numeric vector indicating whether the document can be classified as belonging to a given category using the generated rules for that category and if so which concept (i.e. learned rule) is satisfied by that document. For example the following vectors indicate that Document 2 belongs to the "ofjnterest" category and satisfies Concept 7 (i.e., has the terms desserts and culinary) and Document 12 belongs to the "not_of_interest" category. category vector: <of_interest, not_of_interest> Document 1 : <1, 0> Document 2: <7, 0> Document 3: <2,0>
Document 12: <0, 1>
Step 6- Populating a structured database with the above concept vector representation of documents and using this database in conjunction with other existing structured customer databases to generate models of Customer Cs interests. This is facilitated by a distributed predictive analytics method as shown in Figures 5 and 6. An example of a generated rule-based model for an item to be recommended to Customer C could include the following:
if (years_since_publication < 3) and (price < 20) and (of_interest = 7) then (recommend = yes)
This rules indicates that the user might be interested in books published in the last three years, with a price tag of less than $20 and dealing with the concept of (desserts and culinary).
It should be appreciated that the above example is an application of one form of the present method and system. It should be understood that variations of the method are also contemplated as understood by those skilled in the art. Furthermore, it should be understood that the methods described herein may be embodied in a system, such as a computer, network and the like as understood by those skilled in the art. The system may include one or more processing units, hard drives, RAM, ROM, other forms of memory and other associated structure and features as understood by those skilled in the art. it should be understood that multiple processing units may be used in the system such that one processing units performs certain functions at one data locale, a second processing unit performs certain functions at a second data locale and a third processing unit acts as a mediator.
The matter set forth in the foregoing description and accompanying drawings is offered by way of illustration only and not as a limitation. While particular embodiments have been shown and described, it will be obvious to those skilled in the art that changes and modifications may be made without departing from the broader aspects of applicants' contribution. The actual scope of the protection sought is intended to be defined in the following claims when viewed in their proper perspective based on the prior art.

Claims

CLAIMS:
1. A method for prediction analysis using text categorization, the method comprising the steps of: grouping a plurality of text documents into a plurality of classes; selecting a top m most discriminatory terms for each class of documents using statistical based measures; determining for each document the presence or absence of each of the discriminatory terms; learning rule-based models of each class of documents using a rule learning algorithm; determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document; creating a database of the rules associated with documents satisfying the rules; and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
2. The method of claim 1 further comprising the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
3. The method of claim 1 wherein the plurality of text documents are from an unstructured database.
4. The method of claim 1 further comprising the step of representing each document in terms of a numeric vector indicating whether a learned rule has been satisfied by the document.
5. The method of claim 1 wherein the step of performing data mining includes utilizing a decision tree to form the predictive result.
6. The method of claim 1 wherein the step of performing data mining includes the steps of: collecting candidate attributes by a mediator from a plurality of agents; selecting a winning agent; initiating data splitting by the winning agent; forwarding split data index information from the winning agent to the mediator; forwarding the split data index information from the mediator to each of the agents; and initiating data splitting by each of the agents other than the winning agent.
7. A method for prediction analysis using text categorization, the method comprising the steps of: providing a structured data table having a plurality of class labels; grouping a plurality of text documents into classes based on the class labels; selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents; determining for each document the presence or absence of each of the discriminatory terms; determining at least one concept for each class, the concept being associated with the respective class; determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document; forming a numeric vector for each document indicating if the document is associated with each respective concept; creating a structured data table of the vectors; and performing distributed data mining on the structured data table to form a predictive result.
8. The method of claim 7 further comprising the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
9. The method of claim 7 wherein the plurality of text documents are from an unstructured database.
10. The method of claim 7 wherein the step of performing data mining includes utilizing a decision tree to form the predictive result.
1 ] . The method of claim 7 wherein the step of performing data mining includes the steps of: collecting candidate attributes by a mediator from a plurality of agents; selecting a winning agent; initiating data splitting by the winning agent; forwarding split data index information from the winning agent to the mediator; forwarding the split data index information from the mediator to each of the agents; and initiating data splitting by each of the agents other than the winning agent.
12. A method for prediction analysis using text categorization, the method comprising the steps of: providing a structured data table having a plurality of class labels; grouping a plurality of text documents into classes based on the class labels; selecting a top m most discriminatory terms having the highest calculated fitness measure for each class of documents; determining for each document the presence or absence of each of the discriminatory terms; determining a concept for each class, the concept being associated with the respective class; determining, for at least a portion of the plurality of documents, if a given concept is associated with each respective document; creating a database of the concepts and the associated documents; and performing distributed data mining on the database to form a predictive result.
13. The method of claim 12 further comprising the step of representing each document in terms of a numeric vector indicating the presence or absence of the discriminatory terms.
14. The method of claim 12 wherein the plurality of text documents are from an unstructured database.
15. The method of claim 12 wherein the step of performing data mining includes utilizing a decision tree to form the predictive result.
16. The method of claim 12 wherein the step of performing data mining includes the steps of: collecting candidate attributes by a mediator from a plurality of agents; selecting a winning agent; initiating data splitting by the winning agent; forwarding split data index information from the winning agent to the mediator; forwarding the split data index information from the mediator to each of the agents; and initiating data splitting by each of the agents other than the winning agent.
17. A system for prediction analysis using text categorization comprising: at least one memory unit; and a plurality of processing units, the plurality of processing units grouping a plurality of text documents into a plurality of classes, selecting a top m most discriminatory terms for each class of documents using statistical based measures, determining for each document the presence or absence of each of the discriminatory terms, learning rule-based models of each class of documents using a rule learning algorithm, determining, for at least a portion of the plurality of documents, if a given learned rule has been satisfied by each respective document, creating a database of the rules associated with documents satisfying the rules and performing distributed data mining to form a predictive result based on at least a portion of the plurality of documents.
PCT/US2007/020938 2006-09-29 2007-09-28 Distributed method for integrating data mining and text categorization techniques WO2008042264A2 (en)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
US84809206P 2006-09-29 2006-09-29
US60/848,092 2006-09-29

Publications (2)

Publication Number Publication Date
WO2008042264A2 true WO2008042264A2 (en) 2008-04-10
WO2008042264A3 WO2008042264A3 (en) 2008-07-24

Family

ID=39268995

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2007/020938 WO2008042264A2 (en) 2006-09-29 2007-09-28 Distributed method for integrating data mining and text categorization techniques

Country Status (1)

Country Link
WO (1) WO2008042264A2 (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9928294B2 (en) 2015-07-30 2018-03-27 Wipro Limited System and method for improving incident ticket classification
CN112766506A (en) * 2021-01-19 2021-05-07 澜途集思生态科技集团有限公司 Knowledge base construction method based on architecture

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030041042A1 (en) * 2001-08-22 2003-02-27 Insyst Ltd Method and apparatus for knowledge-driven data mining used for predictions
US20050154692A1 (en) * 2004-01-14 2005-07-14 Jacobsen Matthew S. Predictive selection of content transformation in predictive modeling systems
US20060101048A1 (en) * 2004-11-08 2006-05-11 Mazzagatti Jane C KStore data analyzer
US20060190310A1 (en) * 2005-02-24 2006-08-24 Yasu Technologies Pvt. Ltd. System and method for designing effective business policies via business rules analysis

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030041042A1 (en) * 2001-08-22 2003-02-27 Insyst Ltd Method and apparatus for knowledge-driven data mining used for predictions
US20050154692A1 (en) * 2004-01-14 2005-07-14 Jacobsen Matthew S. Predictive selection of content transformation in predictive modeling systems
US20060101048A1 (en) * 2004-11-08 2006-05-11 Mazzagatti Jane C KStore data analyzer
US20060190310A1 (en) * 2005-02-24 2006-08-24 Yasu Technologies Pvt. Ltd. System and method for designing effective business policies via business rules analysis

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9928294B2 (en) 2015-07-30 2018-03-27 Wipro Limited System and method for improving incident ticket classification
CN112766506A (en) * 2021-01-19 2021-05-07 澜途集思生态科技集团有限公司 Knowledge base construction method based on architecture

Also Published As

Publication number Publication date
WO2008042264A3 (en) 2008-07-24

Similar Documents

Publication Publication Date Title
US20080097937A1 (en) Distributed method for integrating data mining and text categorization techniques
Mukhtar et al. Urdu sentiment analysis using supervised machine learning approach
Li et al. Using text mining and sentiment analysis for online forums hotspot detection and forecast
CN108073568A (en) keyword extracting method and device
CN109766431A (en) A kind of social networks short text recommended method based on meaning of a word topic model
JPH0877010A (en) Method and device for data analysis
Shirsat et al. Document level sentiment analysis from news articles
JP2004139553A (en) Document retrieval system and question answering system
JP2003330948A (en) Device and method for evaluating web page
CN108846097B (en) User interest tag representation method, article recommendation device and equipment
Shah et al. Sentimental Analysis Using Supervised Learning Algorithms
Ebadi et al. A hybrid multi-criteria hotel recommender system using explicit and implicit feedbacks
MX2012011923A (en) Ascribing actionable attributes to data that describes a personal identity.
CN112182145A (en) Text similarity determination method, device, equipment and storage medium
KR20210033294A (en) Automatic manufacturing apparatus for reports, and control method thereof
KR101625124B1 (en) The Technology Valuation Model Using Quantitative Patent Analysis
CN110110220A (en) Merge the recommended models of social networks and user&#39;s evaluation
KR102119083B1 (en) User review based rating re-calculation apparatus and method, storage media storing the same
US11544600B2 (en) Prediction rationale analysis apparatus and prediction rationale analysis method
Beheshti-Kashi et al. Trendfashion-a framework for the identification of fashion trends
CN117420998A (en) Client UI interaction component generation method, device, terminal and medium
Midhunchakkaravarthy et al. A novel approach for feature fatigue analysis using HMM stemming and adaptive invasive weed optimisation with hybrid firework optimisation method
WO2008042264A2 (en) Distributed method for integrating data mining and text categorization techniques
US20220300907A1 (en) Systems and methods for conducting job analyses
JP7351502B2 (en) Variable data generation device, predictive model generation device, variable data production method, predictive model production method, program and recording medium

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 07838993

Country of ref document: EP

Kind code of ref document: A2

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 07838993

Country of ref document: EP

Kind code of ref document: A2