US7788292B2 - Raising the baseline for high-precision text classifiers - Google Patents
Raising the baseline for high-precision text classifiers Download PDFInfo
- Publication number
- US7788292B2 US7788292B2 US11/955,007 US95500707A US7788292B2 US 7788292 B2 US7788292 B2 US 7788292B2 US 95500707 A US95500707 A US 95500707A US 7788292 B2 US7788292 B2 US 7788292B2
- Authority
- US
- United States
- Prior art keywords
- document
- term weight
- term
- naïve bayes
- feature
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Expired - Fee Related, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06Q—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
- G06Q10/00—Administration; Management
- G06Q10/10—Office automation; Time management
Definitions
- Empowering people to make well-informed decisions has become increasingly important in today's fast paced environment. Providing individuals with relevant and timely information is an essential element in facilitating such well-informed decisions. However, certain information that is noise to some may be valuable to others. Additionally, some information can also be temporally critical and as such there may be significant value associated with timely delivery of such information. Moreover, with the growth of computer and information systems, and related network technologies such as wireless and Internet communications, ever increasing amounts of electronic information are communicated, transferred and subsequently processed by users and/or systems. As an example, web browsers have become a popular application amongst computer users for generating and receiving content.
- classifiers There are many applications for automatic classification of items such as email, documents, images, and recordings. To address this need, a plethora of classifiers have been developed based, for example, on probabilistic dependency models learned from training data. Examples of such models can include logistic regression models, decision tree models, support vector machines, neural networks, Na ⁇ ve Bayes, and the like.
- Na ⁇ ve Bayes classifiers to date have been one of the most widely utilized classifiers ever developed in the text domain even though the classifier is generally recognized as providing solutions that are just “good enough”. Nevertheless, Na ⁇ ve Bayes classifiers are utilized by a plethora of classification applications, typically to provide a lower bound for the classification while the upper classification bounds are generally handled by more arcane and abstruse methodologies despite the fact that utilization of such techniques in some cases ekes out only marginal gains in terms of cost and time over utilization of the ubiquitous Na ⁇ ve Bayes classifier, while in other instances gains accrued can be dependent on factors such as document representation and precision requirements (e.g., if high precision is not required, many standard versions of Na ⁇ ve Bayes classifiers can perform adequately.
- the claimed subject matter can provide a link between Na ⁇ ve Bayes and the logarithmic opinion pooling of the mixture-of-experts framework, which dictates a particular type of document length normalization.
- the claimed subject matter in accordance with an aspect can employ monotonic constraints on document term weighting, which can be an effective method of fine-tuning document representation.
- the claimed subject matter can normalize document representation for use with Na ⁇ ve Bayes which can comprise computing the norm of each document by summing the absolute values of term weights, dividing each term weight by the norm (e.g., known as the L1 norm), training the Na ⁇ ve Bayes model using the normalized representation, and using the normalized representation when applying the trained Na ⁇ ve Bayes model to new data.
- norm e.g., known as the L1 norm
- a multi-stage technique can be employed to adjust term weights for Na ⁇ ve Bayes where the process can be primed with an original weight representation, the Na ⁇ ve Bayes model can then be computed using a given term-weight representation, the original term weights multiplied by the absolute values of the corresponding model weights for a prescribed number of times or until convergence occurs.
- the subject matter as claimed can inductively perform document-specific feature selection where a classifier can be trained using the original feature representation which can result in a first model, model A, then for each training document, the terms can be ranked according to the absolute values of model A weights and the top-N weights selected.
- a new classifier, model B can then be induced using the reduced document representation.
- the top-N terms of a test document can be selected using model A and the resulting reduced document can be classified employing model B.
- the claimed subject matter can optimize non-negative term weights (e.g., different from model weights) under a rank preserving constraint where original term weights can be acquired (e.g., assumed to be non-negative), a model learnt and evaluated, term weights adjusted in a way that preserves their ranking and improves model performance for a prescribed number of iterations or until convergence has been achieved.
- non-negative term weights e.g., different from model weights
- FIG. 1 illustrates a machine-implemented system that effectuates and facilitates improving Na ⁇ ve Bayes performance in text applications demanding high precision in accordance with the claimed subject matter.
- FIG. 2 provides a more detailed depiction of an illustrative analysis component in accordance with an aspect of the claimed subject matter.
- FIG. 3 provides a more detailed depiction of an illustrative analysis component in accordance with a further aspect of the claimed subject matter.
- FIG. 4 provides a further detailed depiction of an analysis component in accordance with an aspect of the claimed subject mater.
- FIG. 5 provides yet another more detailed depiction of an analysis component in accordance with an aspect of the claimed subject matter.
- FIG. 6 illustrates a flow diagram of a machine implemented methodology that effectuates and facilitates normalization of term weights in accordance with an aspect of the subject matter as claimed.
- FIG. 7 depicts a further flow diagram of a machine implemented method that effectuates and facilitates inductive term weight optimization in accordance with an aspect of the claimed subject matter.
- FIG. 8 illustrates another methodology that facilitates and effectuates rank preserving term weight optimization in accordance with an aspect of the claimed subject mater.
- FIG. 9 depicts an illustrative methodology that facilitates and effectuates inductive document-specific feature selection in accordance with an aspect of the claimed subject matter.
- FIG. 10 illustrates a block diagram of a computer operable to execute the disclosed system in accordance with an aspect of the claimed subject matter.
- FIG. 11 illustrates a schematic block diagram of an exemplary computing environment for processing the disclosed architecture in accordance with another aspect.
- the claimed subject matter in accordance with an aspect focuses on the aspects of document representation, and in particular on the impact of document sparsity, term weighting and length normalization in problems demanding high specificity.
- the subject matter as claimed concentrates on Na ⁇ ve Bayes, which generally is a highly scalable learner and for which a number of recent improvements have been proposed, making it quite competitive with more complex techniques such as Support Vector Machines (SVMs).
- SVMs Support Vector Machines
- Document length normalization can provide a mechanism for controlling the influence of any particular term on a document by document basis. Although it has been widely used with other text classifiers, its use with Na ⁇ ve Bayes is a recent development and generally not well understood. Nevertheless, certain types of length normalization cast Na ⁇ ve Bayes into the mixture-of-experts framework, and as utilized in the claimed subject matter, can provide a solid basis for this type of transformation, explain its effectiveness for this classifier, and illustrates that for Na ⁇ ve Bayes, L1 normalization can be more appropriate than the traditional L2 normalization.
- Na ⁇ ve Bayes can compete and even outperform state-of-the-art learners, such as Logistic Regression and Support Vector Machines (SVMs). This is particularly true for data sets with some degree of class noise, which is typical in practical applications of text mining. These improvements in performance of Na ⁇ ve Bayes typically do not take away its attractiveness in terms of speed of learning and ease of implementation.
- SVMs Logistic Regression and Support Vector Machines
- the definition of loss can be application-specific and often is taken to be the error rate.
- the misclassification costs are asymmetric and in some cases the cost of one type of error can be high enough to demand very low, or even near zero, probability of occurrence.
- top-N results returned of the user query have very high precision even if this significantly restricts the number of potentially relevant responses that can make it to the top-N.
- spam detection for instance, users have low tolerance for false positive errors and accept e-mail filtering solutions as long as the chance of losing some important e-mail communications is negligibly low.
- misclassification cost values and accurate estimates of posterior probabilities are available, optimum decisions can be made by setting the decision threshold in the probability space to minimize the expected misclassification cost. Due to the practical problems in obtaining these, it typically is convenient to work with the Neyman-Pearson criterion by setting the limit on the maximum acceptable false positive rate or alternatively on the minimum acceptable precision.
- a classifier returns a score proportional to its “confidence”.
- the score can be computed as:
- the multi-nominal model can be extended whereby the values f ij no longer have to correspond to in-document frequency but to a function thereof
- mapping f ij to a real-valued tf ⁇ idf weight e.g., a statistical measure employed to evaluate the importance of a word has to a document in a collection of documents—the importance a word has in relation to a document typically increases proportionately depending on the number of times it appears in the document but this can generally be offset by the frequency that the word has in the collection
- a real-valued tf ⁇ idf weight e.g., a statistical measure employed to evaluate the importance of a word has to a document in a collection of documents—the importance a word has in relation to a document typically increases proportionately depending on the number of times it appears in the document but this can generally be offset by the frequency that the word has in the collection
- normalizing these features on a per-document basis so that the L2 norm of each feature vector is one.
- a threshold can be chosen such that decisions with scores exceeding the threshold can be classified as “positive”.
- the classifier's inability to perform well at a low enough false-positive rate can be seen as evidence of its overconfidence, whereby erroneous decisions are made with apparent high confidence. While this behavior can be observed in many learners, it is typically common for Na ⁇ ve Bayes learners, due to the classifiers assumption of conditional independence of features given the class label. Although Na ⁇ ve Bayes can generally have a reasonably low error rate, in some cases feature inter-correlations can be compounded resulting in overconfident predictions. This is generally true for long documents, which typically can be the reason why feature selection can have strong positive effects for this type of classifier.
- document classification can be done fairly accurately by looking at only a small portion of the text.
- document-specific feature selection that employs a small set of “important” words in a document can be shown to improve Na ⁇ ve Bayes' performance significantly, especially in settings with highly skewed misclassification costs.
- document-specific feature selection that utilizes a small set of “important” words in a document is typically applied post-induction and thus is generally unable to take full advantage of the document-specific feature selection process.
- the technique can be suboptimal for some learners since these learners typically do not get a chance to induce a model over the reduced document representation, although for Na ⁇ ve Bayes this approach can work quite well.
- the claimed subject matter naturally extends the document-specific feature selection process so that it affects classifier induction.
- the process can not only be more suitable for discriminative learners such as Support Vector Machines, but also more effective for Na ⁇ ve Bayes itself.
- discriminative learners such as Support Vector Machines
- Na ⁇ ve Bayes itself.
- the original and modified document-specific feature selection processes are outlined below.
- Document-specific feature selection generally relies on the choice of a single cut-off parameter for all documents, regardless of their length and content. While this typically can be seen to regularize Na ⁇ ve Bayes, it can be suboptimal for many documents, for example, those containing more numerous strongly relevant terms than suggested by the cut-off threshold.
- One possible way to address this issue can be to consider soft document-specific term weighting instead of hard feature selection, which can be decided by term frequency and a predefined cut-off threshold.
- “pure” versions of the Na ⁇ ve Bayes classifier can perform poorly when faced with large volumes of high dimensional data, the improvements and modifications as explicated and utilized by the subject matter as claimed can make Na ⁇ ve Bayes competitive with state-of-the-art discriminative learners.
- Equation (3) can be expressed as:
- the posterior odds for a document can be a weighted geometric mean of the term-based odds for terms contained in a document.
- This type of formula combines probability distributions in the mixture of experts framework and is known as logarithmic opinion pooling. Under this interpretation, the terms found in a document can be considered as possibly correlated “experts”, whose opinions are pooled or aggregated.
- the term weight z i can correspond to the relative reliability of expert i. If all experts are considered equally reliable, the posterior probability of the classifier can be computed as the geometric mean of the term-wise posteriors.
- the claimed subject matter can employ several term weighting mechanisms including the traditional unsupervised methods such as tf ⁇ idf and also supervised methods such as feature weights learned by models in the previous stage. Moreover, the claimed subject matter can utilize approaches that improve these term weightings such as combining the supervised and unsupervised approaches, and employing monotonic term weighting transformation using parameterized softmax functions.
- Term weighting typically requires that each term appearing in a document receive a positive weight in order to emphasize attributes of likely importance and de-emphasize common and irrelevant ones.
- Some techniques can be based upon multiplicative combining of the in-document frequency with the inverse document frequency of the term in the training collection. Note that such a measure of term importance typically does not take into account class information, which can be relevant to categorization tasks.
- supervised term weighting schemes that derive the weight from functions used in ranking features for selection, such as Information Gain or ⁇ 2 .
- the use of ALO type weights for feature ranking in Na ⁇ ve Bayes is also related to the asymmetric odds-ratio criterion:
- the score function is typically symmetrical with respect to classifier weights and term weights, for instance,
- Document length normalization can introduce nonlinearity that breaks the symmetry between term weights and classifier weights as exemplified in equation (7). It can also make term weights document-specific, while maintaining their relative relationship (e.g., the ratio of any two weights before and after normalization can remain the same). Joint optimization of classifier weights and term weights however is possible, but is generally difficult due to the size of the parameter space.
- Term weighting typically works with many more choices than feature selection, which itself can be viewed as a hard problem. That said however, the set of potential choices can nonetheless be meaningfully constrained by the initial choice of the term weighting function. The good performance of feature selection functions in supervised term weighting suggests that these functions are not only useful in determining feature ranking but also in their relative importance.
- any two feature selection functions that rank terms in the same order can behave differently when considered as term weighting functions.
- they can have different steepness as a function of rank, with steeper functions highly emphasizing the strongest terms and being analogous to aggressive document-specific feature selection.
- flatter functions can favor document classification with significant contribution from a larger set of a document's features, which typically can be analogous to mild document-specific feature selection.
- it can be difficult to stipulate a priori which one is more suitable for term weighting in a particular learning method, which suggests that their quality needs to be assessed via classification performance of the resulting classifier.
- the search for optimum feature weightings can be formulated as finding a set of values tw ( t 1 ) ⁇ . . . ⁇ tw ( t N ) ⁇ 0 such that the performance of a given learning method built over such document representation is maximized. Nevertheless, even with such monotone constraints, optimizing for both classifier parameters and term weights can be difficult.
- a parameterized monotonic transform of the original term weights f ( ⁇ ,x ): x 1 ⁇ x 2 f ( ⁇ ,x 1 ) ⁇ f ( ⁇ ,x 2 ) can be considered, for which the best parameter settings can be determined using a validation set.
- a parameterized monotonic transformation of x that preserves term ranking, but also allows one to control the steepness of the mapping via parameter ⁇ >0.
- FIG. 1 illustrates a system 100 implemented on a machine that effectuates and facilitates improving Na ⁇ ve Bayes performance in text applications demanding high precision in accordance with an aspect of the claimed subject matter.
- System 100 can include interface component 102 (hereinafter referred to as “interface 102 ”) that can receive one or more documents (e.g., email, text files, word processing files, spreadsheet files, graphical files, audio/visual files, and the like).
- interface 102 in this illustrative aspect of the claimed subject matter can disseminate a score determined by analysis component 104 , for instance. Utilization of the score ascertained by analysis component 104 and distributed by interface 102 can depend on the application to which the score is applied. For instance, in a classification application the score determined by analysis component 104 can be compared to a threshold in order to make a binary decision. For example, in the case of spam filtering, whether or not received email is spam.
- Interface 102 can provide various adapters, connectors, channels, communication pathways, etc. to integrate the various components included in system 100 into virtually any operating system and/or database system and/or with one another. Additionally, interface 102 can provide various adapters, connectors, channels, communication modalities, etc., that can provide for interaction with various components that can comprise system 100 , and/or any other component (external and/or internal), data, and the like, associated with system 100 .
- Analysis component 104 upon receipt of documents arriving in a document stream from interface 102 , for example, can extract features (e.g., words) from the incoming documents and provide a set of features (e.g., a set of words). The set of words can then be utilized to generate or construct a feature vector (e.g., a vector of numbers corresponding to features that have been used in training). In order to generate the feature vector, analysis component 104 can for each feature look at whether or not the feature has been used in training (e.g., determine whether there is something known about the feature under scrutiny). If the feature has previously been employed in training, the feature can be mapped to a numeric identifier, for example. Where the feature has not previously been used during training the feature can be discarded.
- features e.g., words
- the set of words can then be utilized to generate or construct a feature vector (e.g., a vector of numbers corresponding to features that have been used in training).
- analysis component 104 can for each feature look at whether or not the feature
- analysis component 104 can apply a transformation to the generated feature vector, for example, that provides the feature vector in a form acceptable to and implementable by a downstream classification technique.
- analysis component 104 can take the vector, for each feature in the received vector look at the weights of the classifier assigned to each feature, sum up all the weights and divide the sum of the weighs by the features present in the supplied vector. The sum of all the weights divided by the features present can be utilized as a score of the classifier which can then be compared with a decision threshold, for instance. Accordingly, where the score exceeds the decision threshold associated features can be classified as “class A”; conversely where the score falls below the decision threshold, associated features can be categorized as falling within “class B”.
- analysis component 104 can also apply other transformations prior to ascertaining a score. For example, instead of summing up all the weights present in the supplied feature vector, analysis component 104 can provide a weighted sum where each of the weights in the feature vector can be multiplied by a term weight.
- the term weight typically can have two components, the term weight itself and a normalization component.
- the normalization component of the term weight can typically be utilized to make certain that the term weights actually represent a unit vector according to a norm. For example, an L1 norm rather than the Euclidian norm that typically is employed for this purpose.
- FIG. 2 provides further illustration 200 of analysis component 104 in accordance with an aspect of the claimed subject matter.
- analysis component 104 can undertake L1 normalization of term weights in accordance with the claimed subject matter and can include assignment component 202 that can associate term weights to features.
- Term weights can be pre-specified or alternatively and/or additionally can be derived from training collections (e.g., tf ⁇ idf based). Where multi-stage learning is employed term weights can also be the result from running previous iterations. It should be noted that term weights are typically assumed to be non-negative, but as will be appreciated by those conversant in this field of endeavor, the claimed subject matter is not so limited.
- analysis component 104 can also include summation component 204 that for each document received aggregates the term weights associated with the document to provide the document's norm (e.g., L1 norm) and divisor component 206 that for each feature in the document the feature's term weight can be divided by the document's term norm (e.g., L1 norm) which becomes the feature's new term weight.
- analysis component 104 can include transformation component 108 that constructs a model based at least in part on the final term weight representation.
- transformation component 208 can apply the same transformation when using the model with previously unseen data.
- FIG. 3 provides yet further illustration 300 of analysis component 104 in accordance with another aspect of the claimed subject matter.
- analysis component 104 can provide inductive term weight optimization and can include training component 302 that can train Na ⁇ ve Bayesian models utilizing a given document representation.
- analysis component 104 can also include derivation component 304 that can derive term importance from the current model and merge component 306 that can merge previous and derived (e.g., new) term importance weights.
- FIG. 4 depicts another aspect 400 of analysis component 104 in accordance with the claimed subject matter.
- analysis component 104 can provide rank preserving term weight optimization and can include initialization component 402 that can set initial performance to 0.
- analysis component 104 can include training component 404 that, like training component 302 , can train Na ⁇ ve Bayes models using a given document representation.
- analysis component 104 can include estimation component 406 that can estimate the performance of the Na ⁇ ve Bayes model based at least in part on an evaluation set.
- estimation component 406 can direct analysis component 104 to maintain the current document representation, update the current performance and indicate to adjustment component 408 to adjust the term weights such that their order is preserved and so that the expected performance with the new representation is better than the current one.
- FIG. 5 illustrates a further aspect 500 of analysis component 104 in accordance with the subject matter as claimed.
- Analysis component 104 can include training component 502 that like training components 302 and 404 can train Na ⁇ ve Bayes models using given document representations.
- Analysis component 104 can also include reduction component 504 that can reduce each document to using the top-N features (e.g., reduction component 504 can use the current model to estimate weight relevance).
- analysis component 104 can also include retraining component 506 that uses the reduced representation provided by reduction component 504 to retain the model and apply the retrained model to evaluation data, and iteration component 508 that permutes the document features thus providing different choices of N and maintains solutions that result in the best estimated performances.
- program modules can include routines, programs, objects, data structures, etc. that perform particular tasks or implement particular abstract data types.
- functionality of the program modules may be combined and/or distributed as desired in various aspects.
- FIG. 6 illustrates a machine implement methodology 600 that effectuates and facilitates normalization of term weights in accordance with an aspect of the claimed subject matter.
- Method 600 can commence at 602 at which point various and sundry initialization tasks and processes can be initiated upon completion of which the method can proceed to 604 .
- term weights can be assigned to features. Term weights can be pre-specified or can be derived from training collections (e.g., tf ⁇ idf based). Additionally, where multi-stage learning is implemented term weights can result from previous iterations of a constructed model.
- term weights for each document can be summed or aggregated to provide each document's L1 norm, and at 608 each document's feature's term weight can be divided by the document's L1 norm thus providing a new term weight for the feature at issue.
- a model can be constructed using a final term weight representation. The same transformation can be applied when using the model with previously unseen data.
- FIG. 7 depicts a further illustrative methodology 700 that effectuates and facilitates inductive term weight optimization in accordance with an aspect of the claimed subject matter.
- various initialization tasks and processes can be undertaken upon completion of which method 700 can proceed to 704 .
- supplied document representations can be utilized to train a Na ⁇ ve Bayes model, and at 706 term importance can be derived from the current model.
- previous and new term importance weights can be merged after which the methodology can cycle back to 704 .
- FIG. 8 provides illustration of a further methodology 800 that facilitates and effectuates rank preserving term weight optimization in accordance with an aspect of the claimed subject mater.
- initialization processes can be performed after which method 800 can proceed to 804 .
- initial performance can be set to 0 at which point, at 806 , a Na ⁇ ve Bayes model can be trained using a given document representation.
- the performance of the trained Na ⁇ ve Bayes model can be estimated through utilization of an evaluation set.
- the current document representation is maintained and the current performance updated.
- term weights can be adjusted such that their order is preserved and so that the expected performance with the new representation is better than the current one.
- FIG. 9 depicts an illustrative methodology 900 that facilitates and effectuates inductive document-specific feature selection in accordance with an aspect of the claimed subject matter.
- various initialization tasks can be performed after which methodology 900 can proceed to 904 .
- a Na ⁇ ve Bayes model can be trained based at least in part on the document representation.
- each document can be reduced to using the top-N features (e.g., the current model can be used to help estimate weight relevance).
- the model can be retrained using the reduced representation and the retrained model applied against evaluation data.
- the solution that results in the best estimated performance can be maintained while different choices of N can be made wherein with each different choice the method returns back to 904 to further train the Na ⁇ ve Bayes model.
- each component of the system can be an object in a software routine or a component within an object.
- Object oriented programming shifts the emphasis of software development away from function decomposition and towards the recognition of units of software called “objects” which encapsulate both data and functions.
- Object Oriented Programming (OOP) objects are software entities comprising data structures and operations on data. Together, these elements enable objects to model virtually any real-world entity in terms of its characteristics, represented by its data elements, and its behavior represented by its data manipulation functions. In this way, objects can model concrete things like people and computers, and they can model abstract concepts like numbers or geometrical concepts.
- a component can be, but is not limited to being, a process running on a processor, a processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer.
- a component can be, but is not limited to being, a process running on a processor, a processor, a hard disk drive, multiple storage drives (of optical and/or magnetic storage medium), an object, an executable, a thread of execution, a program, and/or a computer.
- an application running on a server and the server can be a component.
- One or more components can reside within a process and/or thread of execution, and a component can be localized on one computer and/or distributed between two or more computers.
- Artificial intelligence based systems can be employed in connection with performing inference and/or probabilistic determinations and/or statistical-based determinations as in accordance with one or more aspects of the claimed subject matter as described hereinafter.
- the term “inference,” “infer” or variations in form thereof refers generally to the process of reasoning about or inferring states of the system, environment, and/or user from a set of observations as captured via events and/or data. Inference can be employed to identify a specific context or action, or can generate a probability distribution over states, for example. The inference can be probabilistic—that is, the computation of a probability distribution over states of interest based on a consideration of data and events.
- Inference can also refer to techniques employed for composing higher-level events from a set of events and/or data. Such inference results in the construction of new events or actions from a set of observed events and/or stored event data, whether or not the events are correlated in close temporal proximity, and whether the events and data come from one or several event and data sources.
- Various classification schemes and/or systems e.g., support vector machines, neural networks, expert systems, Bayesian belief networks, fuzzy logic, data fusion engines . . .
- computer readable media can include but are not limited to magnetic storage devices (e.g., hard disk, floppy disk, magnetic strips . . . ), optical disks (e.g., compact disk (CD), digital versatile disk (DVD) . . . ), smart cards, and flash memory devices (e.g., card, stick, key drive . . . ).
- magnetic storage devices e.g., hard disk, floppy disk, magnetic strips . . .
- optical disks e.g., compact disk (CD), digital versatile disk (DVD) . . .
- smart cards e.g., card, stick, key drive . . .
- a carrier wave can be employed to carry computer-readable electronic data such as those used in transmitting and receiving electronic mail or in accessing a network such as the Internet or a local area network (LAN).
- LAN local area network
- FIG. 10 there is illustrated a block diagram of a computer operable to execute the disclosed system.
- FIG. 10 and the following discussion are intended to provide a brief, general description of a suitable computing environment 1000 in which the various aspects of the claimed subject matter can be implemented. While the description above is in the general context of computer-executable instructions that may run on one or more computers, those skilled in the art will recognize that the subject matter as claimed also can be implemented in combination with other program modules and/or as a combination of hardware and software.
- program modules include routines, programs, components, data structures, etc., that perform particular tasks or implement particular abstract data types.
- inventive methods can be practiced with other computer system configurations, including single-processor or multiprocessor computer systems, minicomputers, mainframe computers, as well as personal computers, hand-held computing devices, microprocessor-based or programmable consumer electronics, and the like, each of which can be operatively coupled to one or more associated devices.
- Computer-readable media can be any available media that can be accessed by the computer and includes both volatile and non-volatile media, removable and non-removable media.
- Computer-readable media can comprise computer storage media and communication media.
- Computer storage media includes both volatile and non-volatile, removable and non-removable media implemented in any method or technology for storage of information such as computer-readable instructions, data structures, program modules or other data.
- Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, digital video disk (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can be accessed by the computer.
- the exemplary environment 1000 for implementing various aspects includes a computer 1002 , the computer 1002 including a processing unit 1004 , a system memory 1006 and a system bus 1008 .
- the system bus 1008 couples system components including, but not limited to, the system memory 1006 to the processing unit 1004 .
- the processing unit 1004 can be any of various commercially available processors. Dual microprocessors and other multi-processor architectures may also be employed as the processing unit 1004 .
- the system bus 1008 can be any of several types of bus structure that may further interconnect to a memory bus (with or without a memory controller), a peripheral bus, and a local bus using any of a variety of commercially available bus architectures.
- the system memory 1006 includes read-only memory (ROM) 1010 and random access memory (RAM) 1012 .
- ROM read-only memory
- RAM random access memory
- a basic input/output system (BIOS) is stored in a non-volatile memory 1010 such as ROM, EPROM, EEPROM, which BIOS contains the basic routines that help to transfer information between elements within the computer 1002 , such as during start-up.
- the RAM 1012 can also include a high-speed RAM such as static RAM for caching data.
- the computer 1002 further includes an internal hard disk drive (HDD) 1014 (e.g., EIDE, SATA), which internal hard disk drive 1014 may also be configured for external use in a suitable chassis (not shown), a magnetic floppy disk drive (FDD) 1016 , (e.g., to read from or write to a removable diskette 1018 ) and an optical disk drive 1020 , (e.g., reading a CD-ROM disk 1022 or, to read from or write to other high capacity optical media such as the DVD).
- the hard disk drive 1014 , magnetic disk drive 1016 and optical disk drive 1020 can be connected to the system bus 1008 by a hard disk drive interface 1024 , a magnetic disk drive interface 1026 and an optical drive interface 1028 , respectively.
- the interface 1024 for external drive implementations includes at least one or both of Universal Serial Bus (USB) and IEEE 1094 interface technologies. Other external drive connection technologies are within contemplation of the claimed subject matter.
- the drives and their associated computer-readable media provide nonvolatile storage of data, data structures, computer-executable instructions, and so forth.
- the drives and media accommodate the storage of any data in a suitable digital format.
- computer-readable media refers to a HDD, a removable magnetic diskette, and a removable optical media such as a CD or DVD, it should be appreciated by those skilled in the art that other types of media which are readable by a computer, such as zip drives, magnetic cassettes, flash memory cards, cartridges, and the like, may also be used in the exemplary operating environment, and further, that any such media may contain computer-executable instructions for performing the methods of the disclosed and claimed subject matter.
- a number of program modules can be stored in the drives and RAM 1012 , including an operating system 1030 , one or more application programs 1032 , other program modules 1034 and program data 1036 . All or portions of the operating system, applications, modules, and/or data can also be cached in the RAM 1012 . It is to be appreciated that the claimed subject matter can be implemented with various commercially available operating systems or combinations of operating systems.
- a user can enter commands and information into the computer 1002 through one or more wired/wireless input devices, e.g., a keyboard 1038 and a pointing device, such as a mouse 1040 .
- Other input devices may include a microphone, an IR remote control, a joystick, a game pad, a stylus pen, touch screen, or the like.
- These and other input devices are often connected to the processing unit 1004 through an input device interface 1042 that is coupled to the system bus 1008 , but can be connected by other interfaces, such as a parallel port, an IEEE 1094 serial port, a game port, a USB port, an IR interface, etc.
- a monitor 1044 or other type of display device is also connected to the system bus 1008 via an interface, such as a video adapter 1046 .
- a computer typically includes other peripheral output devices (not shown), such as speakers, printers, etc.
- the computer 1002 may operate in a networked environment using logical connections via wired and/or wireless communications to one or more remote computers, such as a remote computer(s) 1048 .
- the remote computer(s) 1048 can be a workstation, a server computer, a router, a personal computer, portable computer, microprocessor-based entertainment appliance, a peer device or other common network node, and typically includes many or all of the elements described relative to the computer 1002 , although, for purposes of brevity, only a memory/storage device 1050 is illustrated.
- the logical connections depicted include wired/wireless connectivity to a local area network (LAN) 1052 and/or larger networks, e.g., a wide area network (WAN) 1054 .
- LAN and WAN networking environments are commonplace in offices and companies, and facilitate enterprise-wide computer networks, such as intranets, all of which may connect to a global communications network, e.g., the Internet.
- the computer 1002 When used in a LAN networking environment, the computer 1002 is connected to the local network 1052 through a wired and/or wireless communication network interface or adapter 1056 .
- the adaptor 1056 may facilitate wired or wireless communication to the LAN 1052 , which may also include a wireless access point disposed thereon for communicating with the wireless adaptor 1056 .
- the computer 1002 can include a modem 1058 , or is connected to a communications server on the WAN 1054 , or has other means for establishing communications over the WAN 1054 , such as by way of the Internet.
- the modem 1058 which can be internal or external and a wired or wireless device, is connected to the system bus 1008 via the serial port interface 1042 .
- program modules depicted relative to the computer 1002 can be stored in the remote memory/storage device 1050 . It will be appreciated that the network connections shown are exemplary and other means of establishing a communications link between the computers can be used.
- the computer 1002 is operable to communicate with any wireless devices or entities operatively disposed in wireless communication, e.g., a printer, scanner, desktop and/or portable computer, portable data assistant, communications satellite, any piece of equipment or location associated with a wirelessly detectable tag (e.g., a kiosk, news stand, restroom), and telephone.
- any wireless devices or entities operatively disposed in wireless communication e.g., a printer, scanner, desktop and/or portable computer, portable data assistant, communications satellite, any piece of equipment or location associated with a wirelessly detectable tag (e.g., a kiosk, news stand, restroom), and telephone.
- the communication can be a predefined structure as with a conventional network or simply an ad hoc communication between at least two devices.
- Wi-Fi Wireless Fidelity
- Wi-Fi is a wireless technology similar to that used in a cell phone that enables such devices, e.g., computers, to send and receive data indoors and out; anywhere within the range of a base station.
- Wi-Fi networks use radio technologies called IEEE 802.11x (a, b, g, etc.) to provide secure, reliable, fast wireless connectivity.
- IEEE 802.11x a, b, g, etc.
- a Wi-Fi network can be used to connect computers to each other, to the Internet, and to wired networks (which use IEEE 802.3 or Ethernet).
- Wi-Fi networks can operate in the unlicensed 2.4 and 5 GHz radio bands.
- IEEE 802.11 applies to generally to wireless LANs and provides 1 or 2 Mbps transmission in the 2.4 GHz band using either frequency hopping spread spectrum (FHSS) or direct sequence spread spectrum (DSSS).
- IEEE 802.11a is an extension to IEEE 802.11 that applies to wireless LANs and provides up to 54 Mbps in the 5 GHz band.
- IEEE 802.11 a uses an orthogonal frequency division multiplexing (OFDM) encoding scheme rather than FHSS or DSSS.
- OFDM orthogonal frequency division multiplexing
- IEEE 802.11b (also referred to as 802.11 High Rate DSSS or Wi-Fi) is an extension to 802.11 that applies to wireless LANs and provides 11 Mbps transmission (with a fallback to 5.5, 2 and 1 Mbps) in the 2.4 GHz band.
- IEEE 802.11g applies to wireless LANs and provides 20+ Mbps in the 2.4 GHz band.
- Products can contain more than one band (e.g., dual band), so the networks can provide real-world performance similar to the basic 10BaseT wired Ethernet networks used in many offices.
- the system 1100 includes one or more client(s) 1102 .
- the client(s) 1102 can be hardware and/or software (e.g., threads, processes, computing devices).
- the client(s) 1102 can house cookie(s) and/or associated contextual information by employing the claimed subject matter, for example.
- the system 1100 also includes one or more server(s) 1104 .
- the server(s) 1104 can also be hardware and/or software (e.g., threads, processes, computing devices).
- the servers 1104 can house threads to perform transformations by employing the claimed subject matter, for example.
- One possible communication between a client 1102 and a server 1104 can be in the form of a data packet adapted to be transmitted between two or more computer processes.
- the data packet may include a cookie and/or associated contextual information, for example.
- the system 1100 includes a communication framework 1106 (e.g., a global communication network such as the Internet) that can be employed to facilitate communications between the client(s) 1102 and the server(s) 1104 .
- a communication framework 1106 e.g., a global communication network such as the Internet
- Communications can be facilitated via a wired (including optical fiber) and/or wireless technology.
- the client(s) 1102 are operatively connected to one or more client data store(s) 1108 that can be employed to store information local to the client(s) 1102 (e.g., cookie(s) and/or associated contextual information).
- the server(s) 1104 are operatively connected to one or more server data store(s) 1110 that can be employed to store information local to the servers 1104 .
Landscapes
- Engineering & Computer Science (AREA)
- Business, Economics & Management (AREA)
- Strategic Management (AREA)
- Entrepreneurship & Innovation (AREA)
- Human Resources & Organizations (AREA)
- Operations Research (AREA)
- Economics (AREA)
- Marketing (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Tourism & Hospitality (AREA)
- Physics & Mathematics (AREA)
- General Business, Economics & Management (AREA)
- General Physics & Mathematics (AREA)
- Theoretical Computer Science (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
{di,yi}: XiεX, yiε{C,
the objective is to find a mapping {f: X→{C,
where the constant term captures the effect of class priors (e.g., which can be ignored if classification threshold is chosen based on a validation set). For the multi-nominal variant of Naïve Bayes, typically utilized in text applications, a summation on equation (1) can be carried out over the terms present in the document d (e.g., as opposed to all possible terms) and the value of fi corresponds to the frequency of occurrence of term ti in d. The occurrences of terms in d can be assumed to be independent given the class label and the class conditional probabilities P(tj|C) are estimated as
where fij≧0 is the number of occurrences of term tj in document di and V is the vocabulary size. In equation (2) the Laplace technique can be applied to smooth the probability estimates. The multi-nominal model can be extended whereby the values fij no longer have to correspond to in-document frequency but to a function thereof In particular, mapping fij to a real-valued tf×idf weight (e.g., a statistical measure employed to evaluate the importance of a word has to a document in a collection of documents—the importance a word has in relation to a document typically increases proportionately depending on the number of times it appears in the document but this can generally be offset by the frequency that the word has in the collection) and additionally normalizing these features on a per-document basis so that the L2 norm of each feature vector is one.
| Document-Specific Feature Selection |
| Post-Induction | Full Induction | ||
| 1. Train classifier | 1. Train classifier | ||
| 2. Rank feature weights | 2. Rank feature weights | ||
| 3. Use top-N features per | 3. Retain top-N features per | ||
| document in evaluation | training document | ||
| 4. Retrain with the new | |||
| representation | |||
| 5. Use top-N features per | |||
| document in evaluation | |||
where fi is the number of occurrences of term ti in d, with Σifi=N. Under the tf×idf weighting and L2 length normalization transform, the formula can typically be changed to:
where for each term ti contained in a document, zi is its normalized tf×idf weight factor, so that zi≧0 and Σizi 2=1.
denotes the odds of term ti belonging to the target class rather than the anti-target, equation (3) can be expressed as:
and if the document-length normalization is based on the L1 norm instead of L2, then Σizi=1, which can yield:
Thus the posterior odds for a document can be a weighted geometric mean of the term-based odds for terms contained in a document. This type of formula combines probability distributions in the mixture of experts framework and is known as logarithmic opinion pooling. Under this interpretation, the terms found in a document can be considered as possibly correlated “experts”, whose opinions are pooled or aggregated. The term weight zi can correspond to the relative reliability of expert i. If all experts are considered equally reliable, the posterior probability of the classifier can be computed as the geometric mean of the term-wise posteriors. In the log space this typically is equivalent to taking the arithmetic average of log-odds weights rather than just their sum as usually done for Naïve Bayes. Note that unlike Naïve Bayes, in mixtures-of-experts there generally is no assumption of the experts' mutual independence conditioned on the class label. Indeed, much research has been devoted to derive weight values zi able to take expert inter-correlation into account. Since odds are generally a measure of classifier confidence, taking the mean of individual opinions can be advantageous in that it typically cannot exceed the maximum of component odds (e.g., a document that contains only ambiguous terms generally cannot result in a highly confident decision). This is in contrast to the regular Naïve Bayes where the compounding effect of many weakly positive or negative features can give the appearance of very high overall confidence.
can be shown to outperform Information Gain-based ranking in document-specific feature selection. The use of ALO type weights for feature ranking in Naïve Bayes is also related to the asymmetric odds-ratio criterion:
which typically works in terms of selecting relevant features for Naïve Bayes in text categorization.
tw(t i)=idf(t i)·ALO(t i) (5)
Nevertheless, other term weighting functions can be used in place of ALO and idf and, indeed, more than two measures can be incorporated into an aggregate term weighting function that can combine several measures of reliability. For instance, a multiplicative scheme that, given N weighting functions, computes the combined term weight for term i in document d as:
can be utilized, where fj(xi, d)>0 represents the term weight assigned by the j-th function.
-
- 1. A mixture-of-experts variant of Naïve Bayes classifier is built using the original feature representation
- 2. The absolute weight values are incorporated into the term weighting function
- 3. A second mixture-of-experts variant of Naïve Bayes classifier is built using the modified document representation
It should be noted without limitation that document-specific feature selection and term weighting can also be performed at the same time. Also, the aforementioned process can continue beyond the first two models, with a compounding effect of importance weights produced by the consecutive classifiers. Nevertheless, it is likely that the weights of the individual Naïve Bayes classifiers will generally be highly correlated, thus providing little rationale for continuing with the procedure beyond the first two models.
Usually the term weights are fixed and the parameter vector, w, optimized. Nevertheless, their roles can be reversed and, for a fixed w, optimize the term weights, especially when the in-document term frequency is not taken to be part of the weighting function. Typically, while term weights can be assumed to be non-negative, this requirement can be dropped. Naïve Bayes in effect can be utilized as a “term weighting” function for the more expensive algorithms, although since the term weights can take negative values this can be considered an “unorthodox” method of applying term weights. As such, these techniques can be evaluated from the standpoint of whether Naïve Bayes term weighting provides a better or worse performance for the target algorithm (e.g., linear Support Vector Machines) when compared to the native document representation or an alternative form of term weighting.
rank(t 1)≦ . . . ≦rank(t N)
and further assuming that ranking of features is maintained, the search for optimum feature weightings can be formulated as finding a set of values
tw(t 1)≧ . . . ≧tw(t N)≧0
such that the performance of a given learning method built over such document representation is maximized. Nevertheless, even with such monotone constraints, optimizing for both classifier parameters and term weights can be difficult. Therefore, a parameterized monotonic transform of the original term weights
f(α,x): x 1 ≧x 2 f(α,x 1)≧f(α,x 2)
can be considered, for which the best parameter settings can be determined using a validation set. For a fixed ranking function, for example, one can consider a parameterized monotonic transformation of x that preserves term ranking, but also allows one to control the steepness of the mapping via parameter α>0. For purposes of exposition and not limitation, the claimed subject matter can use a parameterized version of the softmax function. Given a set of values {xi: i=1 . . . N} the softmax function can transform them as:
This normalization can be applied on a per-document basis. For large values of α, equation (8) can approximate the case of classifying with just a single “most important” feature, while for low values of the parameter can be considered equivalent to treating all term weights as equal. This generally captures the intent of document-specific term weighting, although as one of ordinary skill will appreciate other forms of transforming the feature ranking function can be utilized. Also in the context of Naïve Bayes applied to large amounts of data, where speed and scalability are key it can seem counterintuitive to employ an optimization process for term weighting that can be more expensive than the process of inducing Naïve Bayes itself In this sense, evaluating the impact of equation (8) over a small set of α can be acceptable and similar in complexity to a search for an optimum smoothing parameter for estimating individual probabilities.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US11/955,007 US7788292B2 (en) | 2007-12-12 | 2007-12-12 | Raising the baseline for high-precision text classifiers |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| US11/955,007 US7788292B2 (en) | 2007-12-12 | 2007-12-12 | Raising the baseline for high-precision text classifiers |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| US20090157720A1 US20090157720A1 (en) | 2009-06-18 |
| US7788292B2 true US7788292B2 (en) | 2010-08-31 |
Family
ID=40754626
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| US11/955,007 Expired - Fee Related US7788292B2 (en) | 2007-12-12 | 2007-12-12 | Raising the baseline for high-precision text classifiers |
Country Status (1)
| Country | Link |
|---|---|
| US (1) | US7788292B2 (en) |
Cited By (7)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8972328B2 (en) | 2012-06-19 | 2015-03-03 | Microsoft Corporation | Determining document classification probabilistically through classification rule analysis |
| WO2015112989A1 (en) * | 2014-01-27 | 2015-07-30 | Alibaba Group Holding Limited | Obtaining social relationship type of network subjects |
| CN105373808A (en) * | 2015-10-28 | 2016-03-02 | 小米科技有限责任公司 | Information processing method and device |
| US9659214B1 (en) * | 2015-11-30 | 2017-05-23 | Yahoo! Inc. | Locally optimized feature space encoding of digital data and retrieval using such encoding |
| CN107644101A (en) * | 2017-09-30 | 2018-01-30 | 百度在线网络技术(北京)有限公司 | Information classification approach and device, information classification equipment and computer-readable medium |
| US10339407B2 (en) * | 2017-04-18 | 2019-07-02 | Maxim Analytics, Llc | Noise mitigation in vector space representations of item collections |
| US10387564B2 (en) * | 2010-11-12 | 2019-08-20 | International Business Machines Corporation | Automatically assessing document quality for domain-specific documentation |
Families Citing this family (14)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9082080B2 (en) * | 2008-03-05 | 2015-07-14 | Kofax, Inc. | Systems and methods for organizing data sets |
| US8671112B2 (en) * | 2008-06-12 | 2014-03-11 | Athenahealth, Inc. | Methods and apparatus for automated image classification |
| US8140526B1 (en) | 2009-03-16 | 2012-03-20 | Guangsheng Zhang | System and methods for ranking documents based on content characteristics |
| US8407234B1 (en) * | 2009-04-10 | 2013-03-26 | inFRONT Devices & Systems LLC | Ordering a list embodying multiple criteria |
| US8868402B2 (en) * | 2009-12-30 | 2014-10-21 | Google Inc. | Construction of text classifiers |
| US20140123178A1 (en) | 2012-04-27 | 2014-05-01 | Mixaroo, Inc. | Self-learning methods, entity relations, remote control, and other features for real-time processing, storage, indexing, and delivery of segmented video |
| US12323673B2 (en) * | 2012-04-27 | 2025-06-03 | Comcast Cable Communications, Llc | Audiovisual content item transcript search engine |
| US11140115B1 (en) * | 2014-12-09 | 2021-10-05 | Google Llc | Systems and methods of applying semantic features for machine learning of message categories |
| US20170222960A1 (en) * | 2016-02-01 | 2017-08-03 | Linkedin Corporation | Spam processing with continuous model training |
| CN110390094B (en) * | 2018-04-20 | 2023-05-23 | 伊姆西Ip控股有限责任公司 | Method, electronic device and computer program product for classifying documents |
| US11847537B2 (en) * | 2020-08-12 | 2023-12-19 | Bank Of America Corporation | Machine learning based analysis of electronic communications |
| CN113240025B (en) * | 2021-05-19 | 2022-08-12 | 电子科技大学 | An Image Classification Method Based on Bayesian Neural Network Weight Constraints |
| CN114328934B (en) * | 2022-01-18 | 2024-05-28 | 重庆邮电大学 | Attention mechanism-based multi-label text classification method and system |
| US12008054B2 (en) * | 2022-01-31 | 2024-06-11 | Walmart Apollo, Llc | Systems and methods for determining and utilizing search token importance using machine learning architectures |
Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4833610A (en) | 1986-12-16 | 1989-05-23 | International Business Machines Corporation | Morphological/phonetic method for ranking word similarities |
| US5297039A (en) | 1991-01-30 | 1994-03-22 | Mitsubishi Denki Kabushiki Kaisha | Text search system for locating on the basis of keyword matching and keyword relationship matching |
| US20030172357A1 (en) * | 2002-03-11 | 2003-09-11 | Kao Anne S.W. | Knowledge management using text classification |
| US20040024583A1 (en) | 2000-03-20 | 2004-02-05 | Freeman Robert J | Natural-language processing system using a large corpus |
| US6785669B1 (en) | 2000-03-08 | 2004-08-31 | International Business Machines Corporation | Methods and apparatus for flexible indexing of text for use in similarity searches |
| US20040181527A1 (en) | 2003-03-11 | 2004-09-16 | Lockheed Martin Corporation | Robust system for interactively learning a string similarity measurement |
| US6810376B1 (en) | 2000-07-11 | 2004-10-26 | Nusuara Technologies Sdn Bhd | System and methods for determining semantic similarity of sentences |
| US20050210003A1 (en) | 2004-03-17 | 2005-09-22 | Yih-Kuen Tsay | Sequence based indexing and retrieval method for text documents |
| US6990628B1 (en) | 1999-06-14 | 2006-01-24 | Yahoo! Inc. | Method and apparatus for measuring similarity among electronic documents |
| US20060117228A1 (en) | 2002-11-28 | 2006-06-01 | Wolfgang Theimer | Method and device for determining and outputting the similarity between two data strings |
| US7260773B2 (en) | 2002-03-28 | 2007-08-21 | Uri Zernik | Device system and method for determining document similarities and differences |
| US7577709B1 (en) * | 2005-02-17 | 2009-08-18 | Aol Llc | Reliability measure for a classifier |
-
2007
- 2007-12-12 US US11/955,007 patent/US7788292B2/en not_active Expired - Fee Related
Patent Citations (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US4833610A (en) | 1986-12-16 | 1989-05-23 | International Business Machines Corporation | Morphological/phonetic method for ranking word similarities |
| US5297039A (en) | 1991-01-30 | 1994-03-22 | Mitsubishi Denki Kabushiki Kaisha | Text search system for locating on the basis of keyword matching and keyword relationship matching |
| US6990628B1 (en) | 1999-06-14 | 2006-01-24 | Yahoo! Inc. | Method and apparatus for measuring similarity among electronic documents |
| US6785669B1 (en) | 2000-03-08 | 2004-08-31 | International Business Machines Corporation | Methods and apparatus for flexible indexing of text for use in similarity searches |
| US20040024583A1 (en) | 2000-03-20 | 2004-02-05 | Freeman Robert J | Natural-language processing system using a large corpus |
| US6810376B1 (en) | 2000-07-11 | 2004-10-26 | Nusuara Technologies Sdn Bhd | System and methods for determining semantic similarity of sentences |
| US20030172357A1 (en) * | 2002-03-11 | 2003-09-11 | Kao Anne S.W. | Knowledge management using text classification |
| US7260773B2 (en) | 2002-03-28 | 2007-08-21 | Uri Zernik | Device system and method for determining document similarities and differences |
| US20060117228A1 (en) | 2002-11-28 | 2006-06-01 | Wolfgang Theimer | Method and device for determining and outputting the similarity between two data strings |
| US20040181527A1 (en) | 2003-03-11 | 2004-09-16 | Lockheed Martin Corporation | Robust system for interactively learning a string similarity measurement |
| US20050210003A1 (en) | 2004-03-17 | 2005-09-22 | Yih-Kuen Tsay | Sequence based indexing and retrieval method for text documents |
| US7577709B1 (en) * | 2005-02-17 | 2009-08-18 | Aol Llc | Reliability measure for a classifier |
Non-Patent Citations (4)
| Title |
|---|
| Jun-Peng Bao, et al. Quick asymmetric text similarity measures. 0-7803-7865-2/03 IEEE, Proceedings of the Second International Conference on Machine Learning and Cybernetics, Xi'an, Nov. 2-5, 2003. http://ieeexplore.ieee.org/Xplore/defdeny.jsp?url=/iel5/8907/28247/01264505.pdf&htry=1?code=18. Last accessed on Oct. 4, 2007, 6 pages. |
| Rada Mihalcea, et al. Corpus-based and Knowledge-based Measures of Text Semantic Similarity, American Association for Artificial Intelligence, (www.aaai.org) 2006 http://www.cs.unt.edu/~rada/papers/mihalcea.aaai06.pdf. Last accessed Oct. 4, 2007, 6 pages. |
| Rada Mihalcea, et al. Corpus-based and Knowledge-based Measures of Text Semantic Similarity, American Association for Artificial Intelligence, (www.aaai.org) 2006 http://www.cs.unt.edu/˜rada/papers/mihalcea.aaai06.pdf. Last accessed Oct. 4, 2007, 6 pages. |
| Vasileios Hatzivassiloglou, et al. Detecting Text Similarity over Short Passages: Exploring Linguistic Feature Combinations via Machine Learning http://acl.ldc.upenn.edu/W/W99/W99-0625.pdf. Last accessed Oct. 4, 2007, 10 pages. |
Cited By (12)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US10387564B2 (en) * | 2010-11-12 | 2019-08-20 | International Business Machines Corporation | Automatically assessing document quality for domain-specific documentation |
| US8972328B2 (en) | 2012-06-19 | 2015-03-03 | Microsoft Corporation | Determining document classification probabilistically through classification rule analysis |
| US9495639B2 (en) | 2012-06-19 | 2016-11-15 | Microsoft Technology Licensing, Llc | Determining document classification probabilistically through classification rule analysis |
| WO2015112989A1 (en) * | 2014-01-27 | 2015-07-30 | Alibaba Group Holding Limited | Obtaining social relationship type of network subjects |
| US10037584B2 (en) | 2014-01-27 | 2018-07-31 | Alibaba Group Holding Limited | Obtaining social relationship type of network subjects |
| CN105373808A (en) * | 2015-10-28 | 2016-03-02 | 小米科技有限责任公司 | Information processing method and device |
| CN105373808B (en) * | 2015-10-28 | 2018-11-20 | 小米科技有限责任公司 | Information processing method and device |
| US9659214B1 (en) * | 2015-11-30 | 2017-05-23 | Yahoo! Inc. | Locally optimized feature space encoding of digital data and retrieval using such encoding |
| US20170154216A1 (en) * | 2015-11-30 | 2017-06-01 | Yahoo! Inc. | Locally optimized feature space encoding of digital data and retrieval using such encoding |
| US10339407B2 (en) * | 2017-04-18 | 2019-07-02 | Maxim Analytics, Llc | Noise mitigation in vector space representations of item collections |
| US20190318191A1 (en) * | 2017-04-18 | 2019-10-17 | Maxim Analytics, Llc | Noise mitigation in vector space representations of item collections |
| CN107644101A (en) * | 2017-09-30 | 2018-01-30 | 百度在线网络技术(北京)有限公司 | Information classification approach and device, information classification equipment and computer-readable medium |
Also Published As
| Publication number | Publication date |
|---|---|
| US20090157720A1 (en) | 2009-06-18 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US7788292B2 (en) | Raising the baseline for high-precision text classifiers | |
| Yu et al. | Federated learning with only positive labels | |
| US20240312198A1 (en) | System and method for mitigating bias in classification scores generated by machine learning models | |
| US20240232292A1 (en) | Pattern change discovery between high dimensional data sets | |
| Figueiredo | Adaptive sparseness for supervised learning | |
| US20210383254A1 (en) | Adaptive pointwise-pairwise learning to rank | |
| Yue et al. | Hierarchical exploration for accelerating contextual bandits | |
| JP4813744B2 (en) | User profile classification method based on analysis of web usage | |
| US9536201B2 (en) | Identifying associations in data and performing data analysis using a normalized highest mutual information score | |
| US12182713B2 (en) | Multi-task equidistant embedding | |
| US7809665B2 (en) | Method and system for transitioning from a case-based classifier system to a rule-based classifier system | |
| CN109840833B (en) | Bayesian collaborative filtering recommendation method | |
| US20130097103A1 (en) | Techniques for Generating Balanced and Class-Independent Training Data From Unlabeled Data Set | |
| US20120310864A1 (en) | Adaptive Batch Mode Active Learning for Evolving a Classifier | |
| Du et al. | Probabilistic streaming tensor decomposition | |
| Huang et al. | Spectral clustering via adaptive layer aggregation for multi-layer networks | |
| JP2015526795A (en) | Method and apparatus for estimating user demographic data | |
| US10936964B2 (en) | Method and apparatus for estimating multi-ranking using pairwise comparison data | |
| CN112948683B (en) | A social recommendation method based on dynamic fusion of social information | |
| CN115689673A (en) | A recommendation method, system, medium and equipment based on sorting comparison loss | |
| Khan et al. | A study on relationship between prediction uncertainty and robustness to noisy data | |
| Maximov et al. | Tight risk bounds for multi-class margin classifiers | |
| Houle et al. | Improving k-NN graph accuracy using local intrinsic dimensionality | |
| Huang et al. | Categorizing social multimedia by neighborhood decision using local pairwise label correlation | |
| Steyn et al. | A nearest neighbor open-set classifier based on excesses of distance ratios |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| AS | Assignment |
Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KOLCZ, ALEKSANDER;YIH, WEN-TAU;REEL/FRAME:020236/0134;SIGNING DATES FROM 20071210 TO 20071211 Owner name: MICROSOFT CORPORATION, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:KOLCZ, ALEKSANDER;YIH, WEN-TAU;SIGNING DATES FROM 20071210 TO 20071211;REEL/FRAME:020236/0134 |
|
| FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
| REMI | Maintenance fee reminder mailed | ||
| LAPS | Lapse for failure to pay maintenance fees | ||
| STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
| FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20140831 |
|
| AS | Assignment |
Owner name: MICROSOFT TECHNOLOGY LICENSING, LLC, WASHINGTON Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:MICROSOFT CORPORATION;REEL/FRAME:034542/0001 Effective date: 20141014 |