US11514233B2 - Automated nonparametric content analysis for information management and retrieval - Google Patents
Automated nonparametric content analysis for information management and retrieval Download PDFInfo
- Publication number
- US11514233B2 US11514233B2 US16/415,065 US201716415065A US11514233B2 US 11514233 B2 US11514233 B2 US 11514233B2 US 201716415065 A US201716415065 A US 201716415065A US 11514233 B2 US11514233 B2 US 11514233B2
- Authority
- US
- United States
- Prior art keywords
- elements
- feature
- categories
- computationally
- text
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active, expires
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2132—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on discrimination criteria, e.g. discriminant analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
- G06F18/2135—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods based on approximation criteria, e.g. principal component analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2155—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the incorporation of unlabelled data, e.g. multiple instance learning [MIL], semi-supervised techniques using expectation-maximisation [EM] or naïve labelling
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/241—Classification techniques relating to the classification model, e.g. parametric or non-parametric approaches
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F40/00—Handling natural language data
- G06F40/10—Text processing
- G06F40/12—Use of codes for handling textual entities
- G06F40/151—Transformation
-
- G06K9/6247—
-
- G06K9/6268—
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N5/00—Computing arrangements using knowledge-based models
- G06N5/01—Dynamic search techniques; Heuristics; Dynamic trees; Branch-and-bound
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06V—IMAGE OR VIDEO RECOGNITION OR UNDERSTANDING
- G06V30/00—Character recognition; Recognising digital ink; Document-oriented image-based pattern recognition
- G06V30/40—Document-oriented image-based pattern recognition
- G06V30/41—Analysis of document content
- G06V30/418—Document matching, e.g. of document images
Definitions
- the field of the invention relates, generally, to systems and methods for content analysis of documents and, more particularly, to content analysis using a nonparametric estimation approach.
- One conventional approach for document quantification utilizes a parametric “classify-and-count” method.
- This approach is highly model-dependent and may rely on a perfect classifier that is unrealistic in real applications and unnecessary for aggregate accuracy if individual-level errors cancel.
- the classify-and-count approach generally involves choosing a classifier by trying to maximize the proportion of individual documents correctly classified; this often yields biased estimates of statistical aggregates. For example, in many applications, a method that classifies 60% of documents correctly would be judged successful, and useful for individual classification (i.e., representing approximately how often a GOOGLE or BING search returned what was desired), but since this means that category percentages can be off by as much as 40 percentage, points the same classifier may be useless for social science purposes.
- the conventional approach begins document quantification and classification by analyzing a small subset of documents with category labels (which generally are hand-coded); it then assumes that a set of unanalyzed, unlabeled elements is drawn from the same population as that of the labeled set for calibrating class probabilities.
- category labels which generally are hand-coded
- the labeled set may be created in one time period while the unlabeled set may be collected during a subsequent time period and may have a different distribution.
- the document quantification obtained using this approach may be inaccurate.
- Another document-quantification approach utilizes direct estimation of the category proportions to avoid the problems associated with the classify-and-count approach.
- this approach estimates category percentages without resorting to individual classifications as a first step, thereby providing better results in applications such as text analysis for political science.
- the direct-estimation approach also remains imperfect. For example, it suffers when there is a lack of textual discrimination (i.e., the meaning and usage of language is too similar across categories) or when there is a “concept drift” (i.e., the meaning and usage of language is too different between the training and test data sets).
- Embodiments of the present invention involve applying feature extraction and/or matching in combination with nonparametric estimation to estimate the proportion of documents in each of a plurality of labeled categories with improved accuracy and computational performance compared to conventional approaches. More specifically, feature extraction and/or matching may advantageously reduce the effective divergence (such as the concept drift) between two data sets while increasing textual discrimination between different categories and/or different textual features so as to increase the precision and, in most cases, the accuracy of estimation. In one embodiment, feature extraction creates a continuous feature space that effectively discriminates among categories and contains as many non-redundant or independent features as possible.
- feature extraction utilizes a projection matrix that projects a document-feature matrix in the feature space (i.e., the space whose dimensions correspond to document features) onto a lower-dimensional subspace matrix (which may be custom-configured for a particular application or document type).
- the projection may be linear, nonlinear or random.
- the projection matrix may be optimized to maximize an equally weighted sum of a category-discrimination metric and a feature-discrimination metric. This approach may effectively reduce estimation errors resulting from lack of textual discrimination, as in prior approaches.
- matching may be utilized to construct a matched set that closely resembles an unlabeled (or unobserved) data set based on a labeled (or observed) set, thereby improving the resemblance between the distributions of the labeled and unlabeled sets.
- matching first identifies, for each document in the unlabeled set, three nearest neighbors, defined in a Euclidean space, among the documents from the labeled set.
- other documents in the labeled set that are closer than the median nearest neighbor among the three nearest neighbors of all documents in the unobserved set may be identified. Any documents in the labeled set that are not identified in the above manner are pruned out for analysis.
- this approach advantageously addresses concept drift and proportion divergence.
- estimations of category proportions may be significantly improved without the need for tuning or using model-dependent classification methods developed in particular fields for their quantities of interest.
- Embodiments of the invention may improve accuracy in applications relating to item categorization and classification, interpretation (e.g., of the content of the items and what this suggests), and/or retrieval of computational and real objects subject to automated analysis.
- the invention pertains to a method of computationally estimating a proportion of data elements in multiple data categories.
- the method includes (a) receiving and electronically storing the first set of elements, each element in the first set being computationally assigned to one of the categories and having one of the feature profiles computationally associated therewith; (b) receiving and electronically storing the second set of elements, each element in the second set having one of the feature profiles computationally associated therewith; (c) computationally defining a continuous feature space having multiple numerical variables representing the feature profiles in the first set, the feature space being configured to discriminate between the categories and the feature profiles; (d) computationally constructing, based at least in part on the first set, a matched set that substantially resembles the second set, each element in the matched set being associated with multiple numerical variables representing multiple feature profiles associated therewith; and (e) estimating a distribution of the elements in the second set over the categories based at least in part on (i) the numerical variables associated with the feature profiles in the matched set and (ii)
- the method may further include computationally creating, with respect to the feature space, an element-feature matrix data structure having rows for at least some of the elements in the first and second sets and columns for the feature profiles associated therewith.
- the method may include computationally creating a projection matrix data structure for projecting the element-feature matrix onto a lower-dimensional subspace matrix data structure; the distribution of the elements in the second set over the categories is estimated based at least in part on the numerical variables in the lower-dimensional subspace matrix data structure.
- the method further includes optimizing the projection matrix by maximizing an equally weighted sum of a category-discrimination metric and a feature-discrimination metric.
- the projection matrix may be optimized using the Hooke-Jeeves algorithm.
- the projection may be linear, nonlinear or random.
- step (d) of the method includes (i) identifying, for each element in the second set, three nearest neighbors among the elements from the first set, and (ii) identifying the elements in the first set that are closer than a median nearest neighbor among the three nearest neighbors of all elements in the second set; the matched set is then constructed by pruning out the elements that are not identified in steps (i) and (ii) in the first set.
- the method may further include filtering the elements in the first set and/or second set as so to retain information of interest only.
- the elements in the first and second sets include text; the method includes (i) converting the text to lowercase and removing punctuation marks, (ii) mapping a word to its stem and/or (iii) summarizing the feature profiles in the first set and/or second set as a set of dichotomous variables.
- the distribution of the elements in the second set over the categories may not be constrained to be the same as the distribution of the elements in the first set over the categories.
- the distribution of the elements in the second set over the categories may be unbiased.
- the distribution of the elements in the second set over the categories may be estimated without assigning the elements in the second set to the categories individually.
- the method includes storing the distribution of the elements in the second set over the categories on a computer storage medium.
- the elements in the first and second sets may include text, audio, and/or video data encapsulated in files, streams, and/or database entries.
- the feature profiles may indicate whether certain words and/or combinations of words occur in the text.
- the text is unstructured.
- the method further includes analyzing at least some of the elements in the first set or the second set to obtain the feature profiles associated with the elements.
- the invention in another aspect, relates to an apparatus for computationally estimating a proportion of data elements in multiple data categories.
- the apparatus includes a computer memory; a non-transitory storage device for data storage and retrieval; and a computer processor configured to (a) receive and electronically store the first set of elements in the memory, each element in the first set being computationally assigned to one of the categories and having one of the feature profiles computationally associated therewith; (b) receive and electronically storing a second set of elements in the memory, each element in the second set having one of the feature profiles computationally associated therewith; (c) computationally define a continuous feature space having multiple numerical variables representing the feature profiles in the first set, the feature space being configured to discriminate between the categories and the feature profiles; (d) computationally construct, based at least in part on the first set, a matched set that substantially resembles the second set, each element in the matched set being associated with multiple numerical variables representing multiple feature profiles associated therewith; and (e) estimate a distribution of the elements in the second set over the categories
- the computer processor may be further configured to computationally create, with respect to the feature space, an element-feature matrix data structure having rows for at least some of the elements in the first and second sets and columns for the feature profiles associated therewith.
- the computer processor may be configured to computationally create a projection matrix data structure for projecting the element-feature matrix onto a lower-dimensional subspace matrix data structure; the distribution of the elements in the second set over the categories is estimated based at least in part on the numerical variables in the lower-dimensional subspace matrix data structure.
- the computer processor is further configured to optimize the projection matrix by maximizing an equally weighted sum of a category-discrimination metric and a feature-discrimination metric.
- the computer processor may be configured to optimize the projection matrix using the Hooke-Jeeves algorithm.
- the projection may be linear, nonlinear or random.
- the computer processor is further configured to (i) identify, for each element in the second set, three nearest neighbors among the elements from the first set, and (ii) identify the elements in the first set that are closer than a median nearest neighbor among the three nearest neighbors of all elements in the second set; the computer processor is then configured to construct the matched set by pruning out the elements that are not identified in steps (i) and (ii) in the first set.
- the computer processor may be configured to filter the elements in the first set and/or second set as so to retain information of interest only.
- the elements in the first and second sets include text; the computer processor is further configured to (i) convert the text to lowercase and removing punctuation marks, (ii) map a word to its stem and/or (iii) summarize the feature profiles in the first set and/or second set as a set of dichotomous variables.
- the distribution of the elements in the second set over the categories may not be constrained to be the same as a distribution of the elements in the first set over the categories.
- the distribution of the elements in the second set over the categories may be unbiased.
- the computer processor may be configured to estimate the distribution of the elements in the second set over the categories without assigning the elements in the second set to the categories individually.
- the computer processor is configured to store the distribution of the elements in the second set over the categories on the computer memory and/or the storage device.
- the elements in the first and second sets may include text, audio, and/or video data encapsulated in files, streams, and/or database entries.
- the feature profiles may indicate whether certain words and/or combinations of words occur in the text.
- the text is unstructured.
- the computer processor is further configured to analyze at least some of the elements in the first set or the second set to obtain the feature profiles associated with the elements.
- FIG. 1 is a flowchart of an exemplary nonparametric approach for estimating category proportions in a data set in accordance with various embodiments
- FIG. 2 is a flowchart of an exemplary approach for preprocessing data in a data set in accordance with various embodiments
- FIG. 3 depicts an exemplary document-feature matrix F in accordance with various embodiments
- FIGS. 4A-4C illustrates the dependence of mean-square errors on the proportion divergence and category discrimination in accordance with various embodiments
- FIG. 5 is a flowchart of an exemplary feature-extraction approach in accordance with various embodiments.
- FIG. 6A depicts projections of a document-feature matrix when its projection matrix is optimized by maximizing the category discrimination only in accordance with various embodiments
- FIG. 6B depicts projections of a document-feature matrix when its projection matrix is optimized by maximizing the feature discrimination only in accordance with various embodiments
- FIG. 6C depicts projections of a document-feature matrix when its projection matrix is optimized by maximizing the sum of both category discrimination and feature discrimination in accordance with various embodiments
- FIG. 7 is a flowchart of an exemplary matching approach in accordance with various embodiments.
- FIG. 8A depicts a comparison of estimation errors in document quantifications using various estimation approaches
- FIG. 8B depicts a comparison of estimation errors in document quantifications using an unimproved and an improved estimation approach in accordance with various embodiments.
- FIG. 9 is a block diagram illustrating a facility for performing estimations of document quantifications in a data set in accordance with various embodiments.
- Embodiments of the present invention relate to nonparametric estimation of the proportion of documents in each of a plurality of labeled categories without necessarily classifying each individual document (although if desired, the nonparametric approach described herein may be used to improve individual classifications as well). Accuracy of the proportion estimations can be improved by applying feature extraction and/or matching in conjunction with nonparametric estimation as further described below.
- a nonparametric estimation method 100 for document quantifications involves two steps: representing unstructured text in the documents by structured numerical variables (step 102 ) and statistically analyzing the numerical summaries to estimate the category proportions of interest (step 104 ).
- step 102 representing unstructured text in the documents by structured numerical variables
- step 104 statistically analyzing the numerical summaries to estimate the category proportions of interest
- the description herein refers to performing the nonparametric method 100 for estimating the proportion of textual documents only, it should be understood that the same approach generally applies as well to other information formats (e.g., audio, and/or video data encapsulated in files, streams, database entries, or any other suitable data format) and other sets of objects (e.g., people, deaths, attitudes, buildings, books, etc.) for which the goal is to estimate category proportions.
- the nonparametric method 100 may be applied to any structured, unstructured, and/or partially structured source data. Nonparametric estimation is generally described in U.S. Pat. No. 8,180,
- step 102 includes receiving textual documents from a data source, such as an application, a storage medium, or any suitable device.
- the received documents may be optionally preprocessed in one or more steps to reduce the complexity of the text.
- the documents may be filtered using any suitable filtering technique to retain the information of interest only (step 202 ).
- the filter is designed to retain English documents relating to a specific topic only.
- filtering is not required in the nonparametric method 100 , it may help focus the documents on particular elements of interest, thereby reducing analysis time.
- the text within each document may be converted to lowercase and all punctuation marks may be removed (step 204 ).
- the language complexity may be reduced by mapping a word to its stem.
- the stemming process may reduce “consist,” “consisted,” “consistency,” “consistent,” “consistently,” “consisting,” and “consists” to their stem—i.e., “consist” (step 206 ).
- the preprocessed text may be summarized as a set of dichotomous variables: one type for the presence (coded 1) or absence (coded 0) of each word stem (or “unigram”), a second type for each word pair (or “bigram”), a third type of each word triplet (or “trigram”), and so on to all “n-grams” (step 208 ).
- a small set of the received documents is selected for labeling with one of a given number of categories (these may be referred to herein as “labeled documents” in a “labeled” set), while the rest of the received documents are unlabeled (and may be referred to herein as “unlabeled documents” in an “unlabeled” set).
- the labeled set may be chosen for having specific qualities or characteristics that differ in dramatic but specific ways from the characteristics of the broader population of the source data.
- the set of labeled documents may be randomly or pseudo-randomly selected.
- a generic index, i denotes a document in either the labeled or unlabeled set
- N denotes a generic description of the size of either set.
- each document in the labeled set is individually classified into one of the categories, C, using a suitable approach.
- the classification may be performed by human reading and hand coding; alternatively, it may be performed using any suitable automated technique known to one of skill in the art.
- the documents in the labeled set have “observed” classifications
- documents in the unlabeled set have “unobserved” classifications.
- the coding exists in the documents themselves.
- the documents may include customers' ratings from 1 to 10 or from one star to five stars. The numbers from 1-10 or star numbers may then be the categories.
- the number of documents in category c in the labeled and unlabeled sets are denoted as N c L and N c U , respectively, and N c refers to either set in category c.
- step 102 includes a conventional procedure that maps the entire textual documents in the labeled and unlabeled sets into a numerical feature space such that the natural language in the documents is represented as numerical variables (this step is referred to herein as the “text-to-numbers step”).
- the mapping procedure may be designed to optimize quantification of the documents as further described below.
- estimation of the document quantification is simplified to consider only dichotomous stemmed unigram indicator variables (i.e., the presence or absence of each of a list of word stems).
- Each element of this matrix, F iw is a binary indicator (0 or 1) for whether the document i is characterized by a word stem profile w.
- the feature vector of the unlabeled set, S U may also be determined using the same procedure with the same word-stem profiles.
- the conditional feature vectors, X c L and X c U may be computed from F c , a document-feature matrix representing documents in category c only utilizing the same procedure applied to obtain S L within the category c described above.
- ⁇ U ⁇ 1 U , . . . ⁇ c U ⁇ —using S L , S U , and X L , in various embodiments, an accounting identity (i.e., true by definition) may be implemented.
- ⁇ U is estimated by replacing X U′ and X U with X L′ , and X L , respectively.
- the nonparametric approach described herein may provide estimations of the category proportion of the unlabeled set without classifying each document therein.
- the labeled conditional feature matrix may be expressed as an unbiased and consistent estimator of the unlabeled conditional feature matrix:
- the matrix X L must be assumed to be of full rank, which translates into: (i) feature choices that lead to W>C, and (ii) the lack of perfect collinearity among the columns of X L .
- Assumption (i) may be easy to control by generating a sufficient number of features from the text. Assumption (ii) may be violated if the feature distributions in documents across different categories are identical (which is generally unlikely with a sufficient number of coded documents).
- nonparametric estimation as described herein may also include linear regression with random measurement errors in the explanatory variables. But because nonparametric estimation as described herein is carried out in the converted feature space (as opposed to the original observation space), the estimations may be statistically consistent: as more documents for the labeled set are collected and coded (while keeping W fixed, or at least growing more slowly than n), the estimator described herein converges to the truth:
- the estimator in various embodiments is then the least-square estimator of ⁇ 1 U , and can be written as follows based on four propositions.
- Proposition 1 two-category estimator in the nonparametric approach herein is:
- ⁇ wc is a random variable with the mean zero and variance inversely proportional to N c .
- Proposition 2 the expected value of the two-category estimator is:
- Proposition 3 the approximate bias of the estimator is:
- the bias of the nonparametric estimation is smallest when the category proportion divergence is smallest between the labeled and unlabeled sets.
- the three factors that affect the performance of nonparametric estimation as described herein can be varied. These are the degree of concept drift (i.e., how the meaning of the text changes between the labeled and unlabeled sets), textual discrimination (i.e., how distinct the language in different categories is), and proportion divergence (i.e., differences between ⁇ U and ⁇ L ).
- proportion divergence can be controlled by drawing ⁇ U and ⁇ L from independent and identically distributed (IID) Dirichlet distributions with concentration parameters set to 2.
- IID independent and identically distributed
- X wc U can be sampled from an IID Normal distribution with the mean of 0 and variance of 1/9.
- 5,000 repeated sample data sets can be generated from each set of parameters, the nonparametric estimator can be applied, and the mean squared error (MSE) can be estimated.
- MSE mean squared error
- document-level features can be randomly generated from Normal densities by adding a draw from a standard Normal to each cell value of X wc L and X wc U .
- FIG. 4A illustrates how the MSE behaves as a function of category discrimination (vertically) and proportion divergence (horizontally). MSE is coded as illustrated from low (black) to high (white). Performance of the nonparametric estimation approach is the best at the top left—i.e., where the proportion divergence is low and category discrimination is high. When the language is clearly distinguishable between different categories, the estimation approach can overcome even very large divergences between the labeled and unlabeled sets. Without the good textual discrimination, the estimation approach can become vulnerable to high levels of proportion divergence. Category discrimination and proportion divergence appear to have roughly the same relative importance, as the contour lines in FIG. 4A fall at approximately 45° angles.
- FIG. 4B illustrates how the category discrimination (horizontally) and feature discrimination (vertically) jointly impact the MSE. If the feature discrimination is held fixed, increasing the category discrimination may improve the estimation performance; if the category discrimination is held fixed, a greater feature discrimination may similarly lead to better performance. Of these two factors, the feature discrimination is more predictive of the performance, but both factors may be important.
- FIG. 4C illustrates how the relationship between the feature discrimination (three separate lines in each panel) and the proportion divergence (horizontal axis) is mediated by the presence of the concept drift (difference between the panels). Without the concept drift (left panel), highly discriminatory features greatly reduce the MSE (which can be seen by the wide separation among the lines). In contrast, in the presence of the concept drift (in this case, the mean of E(X L ) can be moved by a quarter of a standard deviation from X U ), more discriminatory features still tend to outperform less discriminatory features; but the difference is less pronounced. With the concept drift, features which are discriminatory in the labeled set may no longer be discriminatory in the unlabeled set.
- performance of the nonparametric estimation may be degraded in the presence of a concept drift, a lack of textual discriminations, and a proportion divergence.
- concept drift occurs when the meaning of words change between the labeled and unlabeled sets.
- obvious inferential problems for any text analytic method may occur if “Watergate” refers to a hotel in the labeled set but a scandal in the unlabeled set.
- matching may be applied to address this problem, thereby improving the estimations as further described below.
- proportion divergence interacts with the other two problems, meaning that the category proportions in the labeled set ⁇ L diverge from those in the unlabeled set ⁇ U .
- the nonparametric estimation approach described herein may accurately return the observed proportions in the labeled set, ⁇ L , which is an unbiased estimate of ⁇ U .
- the labeled set is selectively pruned to improve the estimations as further described below.
- a feature-extraction method 500 is applied to improve textual discrimination and, as a result, the estimation results.
- the feature-extraction method 500 may include optimizing the text-to-numbers step 102 described above for direct estimation of the category proportions.
- the discrete feature space created in step 102 is first replaced with a reduced, continuous feature space having document-level “global vectors” (step 502 ).
- the reduced feature space may be established using a global log-bilinear regression model as described in “Glove: Global vectors for word representation,” Proceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pp. 1532-1543, the entire disclosure of which is hereby incorporated by reference.
- the reduced feature space includes 50 vector dimensions; statistical values (e.g., the minimum, maximum, and/or mean values) of the vectors in the reduced feature space can then be computed.
- the text-to-numbers summary in the reduced feature space may produce a more informative F matrix.
- the n ⁇ W document-feature matrix F may be projected onto a custom built n ⁇ W′ (where W′ ⁇ W) lower dimensional subspace matrix F .
- a nonlinear projection or a random projection may be utilized.
- definition of the conditional feature vector, X remains the same, but the features that populate the rows of this matrix are now taken from the matrix F , instead of F. It should be noted that a “Tied-Hand Principle (THP)” is followed here.
- a and B as two data objects, Z as exogenous information, and m* as a mapping that transforms the original inputs by matching or weighting object subsets, transforming the object features, or selecting object features for use in an analysis comparing m(A) and m(B).
- THP a special case of the THP is also invoked in causal inference, where the matching of treated and control observations is performed without taking into account the response—i.e., the observation weights are calculated explicitly without taking into account the outcome in the alternative treatment class.
- a special case in the context of case control designs is that selecting observations on the outcome, Y i , is permitted provided that the users do not select on X also.
- the projection matrix ⁇ is chosen to maximize the equally weighted sum of the category discrimination and feature discrimination metrics.
- the Hooke-Jeeves algorithm is utilized for the optimization.
- the columns of S is normalized as a form of regularization to keep ⁇ a computationally manageable size.
- optimizing projection matrix ⁇ to satisfy both criteria simultaneously is crucial, as optimizing ⁇ for the category discrimination alone may lead to a high variance by allowing collinearity in X, and optimizing ⁇ for the feature discrimination alone may lead to a high bias by lacking category discrimination. Optimizing both discriminations together may reduce mean-square errors overall. This point may be illustrated using an analysis of data comprised of 1,426 emails drawn from the broader Enron Corporation email corpus made public during the Federal Energy Regulatory Commission's investigation into the firm's bankruptcy. These emails are first coded into five broad topics: company business, personal communications, logistics arrangements, employment arrangements, and document editing.
- W′ is set as 2 and ⁇ is chosen by maximizing the category discrimination metric alone (i.e., selecting projections that create maximal contrasts in pairwise columns of X L ).
- FIG. 6A depicts a scatterplot of the resulting projections of F , with different symbols to represent the five categories.
- FIG. 6A reveals that these features discriminate between the categories (which is seen by the separation between different symbols). But, as is also apparent, the two dimensions are highly correlated which, like in the linear regression analysis, may lead to higher variance estimates. In the linear regression, given a fixed sample size, collinearity is an immutable fact of the fixed data set.
- FIG. 6B depicts another scatterplot of the resulting projections of F when ⁇ is optimized by maximizing only the feature discrimination.
- the columns of X L are uncorrelated but unfortunately do not discriminate between categories well (as can be seen by the points with different symbols overlapping).
- FIG. 6C depicts a scatterplot of the resulting projections of F generated by optimizing the sum of both category and feature discrimination.
- this result is well calibrated for estimating category proportions: the dimensions are discriminatory (which is seen by the symbol separation) and thus bias reducing, but still uncorrelated and thus variance reducing.
- the sum of the absolute residuals in estimating ⁇ U is improved from 0.55 to 0.30.
- the feature-extraction approach 500 may advantageously create a feature space that optimally discriminates between categories (i.e., maximizing the category discrimination) and contains as many non-redundant or independent features as possible (i.e., maximizing the feature discrimination).
- the document-feature matrix F (and thereby X L ) is adjusted such that the reliance on the assumption of Eq. (3) is reduced. Both concept drift and proportion divergence can be taken on by noting that if X U can be used in the estimator of the nonparametric approach, the linear regression yields ⁇ U exactly. Accordingly, the goal is to adjust F, and therefore X L , for the purpose of reducing the distance between X U and X L —i.e., ⁇ X U ⁇ X L ⁇ . In one implementation, this goal is achieved using a matching approach as further described below.
- the unlabeled set may contain neologisms (i.e., a token containing a string of characters not represented in the labeled set); no empirical method can address this problem directly.
- neologisms i.e., a token containing a string of characters not represented in the labeled set
- Second is the potentially differing empirical frequencies with which different words and patterns of words occur in the labeled and unlabeled sets; this issue can be addressed directly utilizing the approach described herein.
- matching is employed to reduce these frequency discrepancies by reducing the model dependence in parametric causal inference.
- Matching is implemented to improve the “balance” between the labeled and unlabeled set—i.e., improving the degree to which the distributions of the labeled and unlabeled sets resemble each other.
- Matching may operate by constructing a matched set, , that substantially resembles the unlabeled set based on the labeled set.
- the term “substantially” means that the degree to which the distributions of the labeled and unlabeled sets resemble each other is 90%, 80%, or 70%.
- the distance between X U and X L i.e., ⁇ X U ⁇ X L ⁇
- the concept drift, and the proportion divergence may be all reduced; consequently, the bias in estimating ⁇ U may be reduced.
- FIG. 7 depicts a flowchart of an exemplary matching method 700 .
- the unlabeled set is fixed such that the quantity of interest, ⁇ U , is not changed.
- three nearest neighbors defined in the Euclidean space, among the documents from the labeled set are identified (step 702 ).
- any documents that are closer than the median nearest neighbor among the three nearest neighbors of all unlabeled documents in labeled set are captured (step 704 ).
- any labeled set documents that are not matched by these rules are pruned out and not used. This act of pruning is what makes the matching approach work in causal inference and, as applied here, reduces the concept drift and proportion divergence.
- the matched X L (denoted as ) can then be computed, and, subsequently, the linear regression described above may be apply to estimate ⁇ U .
- Details of the matching approach are described “Matching as Nonparametric Preprocessing for Reducing Model Dependence in Parametric Causal Inference,” Political Analysis , vol. 15, pp. 199-236 and “Causal Inference without Balance Checking: Coarsened Exact Matching,” Political Analysis , vol. 20, pp. 1-24, the entire disclosures of which are hereby incorporated by reference.
- the matching method 700 achieves the desired goal: in the 72 real-world data sets, matching reduces the divergence between X L and X U in 99.6% of the cases and on average by 19.8%. Proportion divergence, which is not observed in real applications but can be measured because unlabeled sets are coded for evaluation, has reduced in 83.2% of the cases, and by on average 25%.
- the 72 corpora included the Enron email data set above, a set of 462 newspaper editorials about immigration (with 3,618 word stems and 5 categories), and a set of 1,938 blog posts about candidate Hillary Clinton from the 2008 presidential election (with 3,623 word stems and 7 categories).
- the 72 corpora included 69 separate Twitter data sets, each created by a different political candidate, private company, nonprofit, or government agency for their own business purposes, covering different time frames and categories; these data cover 150-4,200 word stems, 3-12 categories, and 700-4,000 tweets. All documents in each of the 72 corpora were labeled with a time stamp.
- a time point can be randomly selected and the previous 300 documents can be picked as the labeled set and the next 300 documents can be picked as the out-of-sample evaluation set (wrapping in time if necessary). For each corpus, this process can be repeated 100 times; as a result, 7,200 data sets in total can be provided. This procedure keeps the evaluation highly realistic, while also ensuring many types of proportion divergence, textual discrimination, and concept drift.
- FIG. 8A depicts the results from the ten classify-and-count methods and the unimproved estimation approach. The proportion of data sets with higher errors than the improved estimation approach were computed vertically by the proportion divergence in quantiles horizontally.
- the improved estimation approach outperforms the best classifier (the regularized multinomial regression) in the continuous feature space in 67% of the data sets and the average classifier in the continuous space in more than 80% of the corpora.
- the improved estimation approach outperforms the best discrete classifier in over 70% of the corpora. Performance is good across different levels of the category proportion divergence between the labeled and unlabeled sets.
- the improved approach's relative performance improves further when the proportion divergence is high so there are substantial changes between the labeled and unlabeled sets (which makes sense since the estimation is the only approach that directly addresses concept drift).
- the improved approach achieved better performance on 96% of the sample corpora, with an average corpus-wise improvement of 34% (as depicted in FIG. 8B ).
- FIG. 8B depicts a more detailed analysis of the error in estimating ⁇ U (vertically) using the unimproved estimation approach compared to the improved estimation approach (horizontally, ordered by the size of the improvement).
- the length of each arrow represents the average improvement over the 100 separate analyses of subsets of each of the 72 data set. In all but three cases, the arrows face downward; this indicates that on average the improved estimation approach almost always outperforms the unimproved one. Overall, a 35.7% average corpus-wide improvement is observed utilizing the unimproved approach.
- estimations of the category proportions may be significantly improved using nonparametric estimation in conjunction with feature extraction and/or matching as described above.
- the improvement is achieved without the need for tuning or using any model-dependent methods of individual classifications.
- the improved estimation approach loosens the key assumptions of the unimproved approach while creating new numerical representations of each of the documents specifically tuned to reduce the mean-square errors of multi-category, nonparametric quantification.
- various approaches described herein may be profitably applied in other domains as well.
- the dimension-reduction approach 502 may be profitably used for data visualization.
- the improved estimation approach may be applied to find the two-dimensional projection that maximally discriminates between Democrats, Republicans and Independents, and simultaneously contains minimal redundancy.
- the relevant clusters may then become more visible, and may be paired with a data-clustering algorithm on the two-dimensional projection for additional visualization or analysis purposes.
- the dimension-reduction approach 502 may also be applied in the study of causality.
- investigators often use nonparametric approaches such as matching, but there is considerable interest in performing this matching in an optimal feature space, such as in the space of predicted values for the outcome under the control intervention (such as in “predictive mean matching”).
- matching may be performed on the features derived in various embodiments of the present invention.
- the resulting causal estimator may have especially good properties, since it takes into account the relationship between the covariates and outcome (leading to low bias) while taking into account several independent sources of information (leading to low variance).
- FIG. 9 illustrates an exemplary embodiment utilizing a suitably programmed general-purpose computer 900 .
- the computer includes a central processing unit (CPU) 902 , system memory 904 , and non-volatile mass storage devices 906 (such as, e.g., one or more hard disks and/or optical storage units).
- CPU central processing unit
- system memory 904 volatile memory
- non-volatile mass storage devices 906 such as, e.g., one or more hard disks and/or optical storage units.
- the computer 900 further includes a bidirectional system bus 908 over which the CPU 902 , memory 904 , and storage devices 906 communicate with each other and with internal or external input/output devices, such as traditional user interface components 910 (including, e.g., a screen, a keyboard, and a mouse) as well as a remote computer 912 and/or a remote storage device 914 via one or more networks 916 .
- the remote computer 912 and/or storage device 914 may transmit any document format (e.g., text, audio, and/or video data encapsulated in files, streams, database entries, or any other suitable data format) to the computer 900 using the network 916 .
- the system memory 904 contains instructions, conceptually illustrated as a group of modules, that control the operation of CPU 902 and its interaction with the other hardware components.
- An operating system 920 directs the execution of low-level, basic system functions such as memory allocation, file management and operation of mass storage devices 906 .
- one or more service applications provide the computational functionality required for estimating category proportions in a data set. For example, as illustrated, upon receiving a query, via the user interface 910 , from a user, the system may communicate with the storage devices 906 , remote computer 912 and/or remote storage device 914 to receive documents associated with the query. The retrieved data may then be electronically stored in the system memory 904 and/or storage devices 906 .
- a text-to-numbers module 922 then retrieves the stored documents and convert the text therein to numerical variables as described above; the computer 900 may include a database 924 (in the memory 904 and/or storage devices 906 ) relating the numerical variables to the corresponding text.
- the database 924 may be organized as a series of records each of which classifies a numerical variable as a particular text in the received documents, and which may contain pointers to the file or files encoding the numerical variable in a suitable manner, e.g., as an uncompressed binary file.
- the text-to-numbers module 922 cooperates with a filtering module 926 that preprocesses the documents to filter the documents and retain only the information of interest.
- the filtering module 926 may be designed to retain English documents relating to a specific topic only.
- the computer 900 may further include a converting module 928 to convert the text within the filter and/or unfiltered documents to lowercase (so that “These” and “these” are recognized as the same) and remove all punctuation marks to improve ease of analysis of the documents.
- the computer 900 may include a mapping module 930 that maps a word to its stem to further reduce the language complexity as described above.
- the mapping module 930 may reduce “consist,” “consisted,” “consistency,” “consistent,” “consistently,” “consisting,” and “consists” to their stem—i.e., “consist.”
- the computer 900 may then implement a dichotomous module 932 to summarize the text preprocessed by the filtering module 926 , converting module 928 , and/or mapping module 930 as a set of dichotomous variables.
- the dichotomous variables may then be transmitted to a matrix-creation module 934 to create a document-feature matrix (such as F and F as described above).
- the computer 900 further includes a feature-extraction module 938 to perform feature extraction as described above.
- the feature-extraction module 938 may first replace the discrete feature space with a continuous feature space having document-level “global vectors as described above and incorporated with the matrix-creation module 934 to project the document-feature matrix created thereby to a lower dimensional subspace matrix using a projection matrix.
- the feature-extraction module 938 may optimize the projection matrix by maximizing both category discrimination and feature discrimination above.
- the computer 900 may include a matching module 940 to perform matching as described above.
- the matching module 940 may construct a matched set that closely resembles the unlabeled set based on the labeled set such that the distance between ⁇ X U ⁇ X L ⁇ is reduced.
- the output from the feature-extraction module 938 and/or matching module 940 may be the computational module 936 to compute various feature vectors and apply the linear regression as described above to estimate the category proportions of the documents retrieved from the storage devices 906 , remote computer 912 and/or remote storage device 914 .
- the proportion estimation may then be provided to the user via the user interface 910 .
- embodiments of the computer 900 implementing the feature-extraction module 938 and/or matching module 940 may advantageously estimate the proportion of retrieved documents in each of a plurality of labeled categories with improved accuracy compared to conventional approaches.
- the network 916 may include a wired or wireless local-area network (LAN), wide-area network (WAN) and/or other types of networks.
- LAN local-area network
- WAN wide-area network
- computers When used in a LAN networking environment, computers may be connected to the LAN through a network interface or adapter.
- computers When used in a WAN networking environment, computers typically include a modem or other communication mechanism. Modems may be internal or external, and may be connected to the system bus via the user-input interface, or other appropriate mechanism. Computers may be connected over the Internet, an Intranet, Extranet, Ethernet, or any other system that provides communications.
- communications protocols may include TCP/IP, UDP, or OSI, for example.
- communications protocols may include the cellular telecommunications infrastructure, WiFi or other 802.11 protocol, Bluetooth, Zigbee, IrDa or other suitable protocol.
- components of the system may communicate through a combination of wired or wireless paths.
- any suitable programming language may be used to implement without undue experimentation the analytical functions described within.
- the programming language used may include assembly language, Ada, APL, Basic, C, C++, C*, COBOL, dBase, Forth, FORTRAN, Java, Modula-2, Pascal, Prolog, Python, REXX, and/or JavaScript for example.
- assembly language Ada
- APL APL
- Basic Basic
- C C
- C++ C*
- COBOL COBOL
- dBase Forth
- FORTRAN Java
- Modula-2 Pascal
- Pascal Pascal
- Prolog Prolog
- Python Python
- REXX and/or JavaScript
- any number of different programming languages may be utilized as is necessary or desirable.
- the representative computer 900 may facilitate more accurate performance for applications relating to item categorization and classification, interpretation (e.g., of the content of the items and what this suggests), and/or retrieval of computational and real objects subject to automated analysis.
- documents such as the Enron email data set, newspaper editorials, and/or blog posts, as described above may be first retrieved by the computer 900 ; the computer 900 can then search the documents to filter out information that is not of interest.
- the computer 900 may perform various computation and/or analysis as described above to estimate the proportion of documents in each of the categories. The estimation is particularly useful in social science when aggregate generalizations about populations of documents are of more interest.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- Evolutionary Computation (AREA)
- Software Systems (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Life Sciences & Earth Sciences (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Computational Linguistics (AREA)
- Medical Informatics (AREA)
- Health & Medical Sciences (AREA)
- General Health & Medical Sciences (AREA)
- Audiology, Speech & Language Pathology (AREA)
- Multimedia (AREA)
- Databases & Information Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
The vector of proportion πU≡{π1 U, . . . , πc U} representing the quantities of interest forms a simplex—i.e., πc U∈[0,1] for all c, and Σc=1 Cπc U=1. The analogous (but observed) category proportions for the labeled set πL can be similarly defined.
S w U=Σc=1 C X wc Uπc U ,∀w Eq. (1)
(or equivalently in a matrix form: SU=XUπU).
Eq. (1) may then be solved for the quantity of interest as in a linear regression: πU=(XU′XU)−1XU′SU. But because the “regressor” XU is unobserved, πU may not be directly computed this way. In various embodiments, πU is estimated by replacing XU′ and XU with XL′, and XL, respectively. The estimated can then be computed by:
=(X L′ X L)−1 X L′ S U Eq. (2)
(or any modified version of this expression so long as it explicitly preserves the simplex constraint). Accordingly, the nonparametric approach described herein may provide estimations of the category proportion of the unlabeled set without classifying each document therein.
This assumption, however, may be violated when there is a concept drift—e.g., if the labeled set is coded at one time, and the unlabeled set is collected at another time or in another place where the meanings of certain terms differ from those in the labeled set. Concept drift may be overcome using a matching approach as further described below.
This indicates that, unlike a classic errors-in-variables linear regression model, collecting more data using nonparametric estimation as described herein may reduce the estimation bias and variance. It also indicates that a finite sample bias (rather than consistency results) may be focused on in order to improve the estimation results.
S w U =X w2 U+(X w1 U −X w2 U)π1 U Eq. (5)
If XL=XU, the above expression equals π1 U. However, due to the sampling error, the realized sample value of XL may differ from the unobserved true value XU. By the assumption of Eq. (3), Xwc L satisfies: Xwc L=Xwc U+ϵwc, where ϵwc is a random variable with the mean zero and variance inversely proportional to Nc. This enables us to write the estimator in the nonparametric estimation approach in terms of XU, the true unlabeled set category proportion π1 U, and the sample category size Nc L. Taking the expectation of this quantity yields:
The consistency property of nonparametric estimation can be seen here: as the error in measuring XU with XL goes to zero or NL goes to infinity, the second term in the expectation is 0 (because ϵw2∝1/Nc L→0), while the first term converges to π1 U. In the presence of measurement errors, the bias is a function of the difference in the true category proportions, Xw1 U−Xw2 U, and the combined error variance Var(ϵw1−ϵw2)—both of which are components of the lack of textual discrimination. Further intuition can be obtained by an approximation using a first-order Taylor polynomial:
This expression suggests four insights: first, as the textual discrimination (Xw1 U−Xw2 U)2 increases relative to the variance of the error terms, the bias approaches 0. In other words, nonparametric estimation works better when the language of the documents across categories is distinct. Because the numerical summaries of the text clearly differ across categories, their observed values are not obscured by measurement error. Second, adding more informative numerical representations of the text to increase W (but with a fixed n) has an indeterminate impact on the bias. While more informative numerical summaries of the text can increase the sum in the denominator, they may increase the overall bias if the error variance is high relative to the discriminatory power. In other words, when the discriminatory power of the numerical summaries of the text is low, the bias generated by nonparametric estimation as described herein is dominated by the relationship between the error variances. Third, since the elements of XL are simple means across documents assumed to be independent, the variance of the measurement error terms is simply Var(ϵwc)=σwc 2/Nc L, which declines as the labeled set category sizes increase. Finally, by assuming independence of the measurement errors across categories (i.e., Cov(ϵw1,ϵw2)=0), the bias in the nonparametric estimation is minimized when the following relationship holds between the labeled and unlabeled set category proportions:
m*=arg maxm f(m,A,Z), or
m*=arg maxm f(m,B,Z), but not
m*=arg maxm f(m,A,B,Z).
Category Discrimination ∝Σc<c′Σw=1 W |X wc L −X wc′ L|, and
Feature Discrimination ∝Σc<c′Σw′<w ∥X−X wc′ |−|X w′c −X w′c′∥,
where c and c′ are two different categories in the set of categories C. In some embodiments, the projection matrix Γ is chosen to maximize the equally weighted sum of the category discrimination and feature discrimination metrics. In one embodiment, the Hooke-Jeeves algorithm is utilized for the optimization. In addition, the columns of S is normalized as a form of regularization to keep Γ a computationally manageable size.
As illustrated below, empirically, the
Estimation Results Utilizing the Improvement Approaches
Claims (40)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US16/415,065 US11514233B2 (en) | 2016-11-22 | 2017-11-16 | Automated nonparametric content analysis for information management and retrieval |
Applications Claiming Priority (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US201662425131P | 2016-11-22 | 2016-11-22 | |
US16/415,065 US11514233B2 (en) | 2016-11-22 | 2017-11-16 | Automated nonparametric content analysis for information management and retrieval |
PCT/US2017/061983 WO2018098009A1 (en) | 2016-11-22 | 2017-11-16 | Improved automated nonparametric content analysis for information management and retrieval |
Publications (2)
Publication Number | Publication Date |
---|---|
US20190377784A1 US20190377784A1 (en) | 2019-12-12 |
US11514233B2 true US11514233B2 (en) | 2022-11-29 |
Family
ID=62195663
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US16/415,065 Active 2040-02-06 US11514233B2 (en) | 2016-11-22 | 2017-11-16 | Automated nonparametric content analysis for information management and retrieval |
Country Status (2)
Country | Link |
---|---|
US (1) | US11514233B2 (en) |
WO (1) | WO2018098009A1 (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US11087088B2 (en) * | 2018-09-25 | 2021-08-10 | Accenture Global Solutions Limited | Automated and optimal encoding of text data features for machine learning models |
US11941497B2 (en) * | 2020-09-30 | 2024-03-26 | Alteryx, Inc. | System and method of operationalizing automated feature engineering |
Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20020022956A1 (en) | 2000-05-25 | 2002-02-21 | Igor Ukrainczyk | System and method for automatically classifying text |
US6490573B1 (en) | 2000-04-11 | 2002-12-03 | Philip Chidi Njemanze | Neural network for modeling ecological and biological systems |
US7024400B2 (en) | 2001-05-08 | 2006-04-04 | Sunflare Co., Ltd. | Differential LSI space-based probabilistic document classifier |
US7890438B2 (en) | 2007-12-12 | 2011-02-15 | Xerox Corporation | Stacked generalization learning for document annotation |
US20110085728A1 (en) * | 2009-10-08 | 2011-04-14 | Yuli Gao | Detecting near duplicate images |
US20120215784A1 (en) * | 2007-03-20 | 2012-08-23 | Gary King | System for estimating a distribution of message content categories in source data |
US20140012855A1 (en) | 2012-05-25 | 2014-01-09 | Crimson Hexagon, Inc. | Systems and Methods for Calculating Category Proportions |
-
2017
- 2017-11-16 WO PCT/US2017/061983 patent/WO2018098009A1/en active Application Filing
- 2017-11-16 US US16/415,065 patent/US11514233B2/en active Active
Patent Citations (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6490573B1 (en) | 2000-04-11 | 2002-12-03 | Philip Chidi Njemanze | Neural network for modeling ecological and biological systems |
US20020022956A1 (en) | 2000-05-25 | 2002-02-21 | Igor Ukrainczyk | System and method for automatically classifying text |
US7024400B2 (en) | 2001-05-08 | 2006-04-04 | Sunflare Co., Ltd. | Differential LSI space-based probabilistic document classifier |
US20120215784A1 (en) * | 2007-03-20 | 2012-08-23 | Gary King | System for estimating a distribution of message content categories in source data |
US7890438B2 (en) | 2007-12-12 | 2011-02-15 | Xerox Corporation | Stacked generalization learning for document annotation |
US20110085728A1 (en) * | 2009-10-08 | 2011-04-14 | Yuli Gao | Detecting near duplicate images |
US20140012855A1 (en) | 2012-05-25 | 2014-01-09 | Crimson Hexagon, Inc. | Systems and Methods for Calculating Category Proportions |
Non-Patent Citations (1)
Title |
---|
International Search Report and Written Opinion for International Application No. PCT/US2017/061983 dated Feb. 6, 2018, 8 pages. |
Also Published As
Publication number | Publication date |
---|---|
US20190377784A1 (en) | 2019-12-12 |
WO2018098009A1 (en) | 2018-05-31 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11640494B1 (en) | Systems and methods for construction, maintenance, and improvement of knowledge representations | |
US20230351212A1 (en) | Semi-supervised method and apparatus for public opinion text analysis | |
Do et al. | Multiview deep learning for predicting twitter users' location | |
Jerzak et al. | An improved method of automated nonparametric content analysis for social science | |
US8438162B2 (en) | Method and apparatus for selecting clusterings to classify a predetermined data set | |
US11216512B2 (en) | Accessible machine learning backends | |
US10956825B1 (en) | Distributable event prediction and machine learning recognition system | |
CN112559747A (en) | Event classification processing method and device, electronic equipment and storage medium | |
JP2024503036A (en) | Methods and systems for improved deep learning models | |
US20220309292A1 (en) | Growing labels from semi-supervised learning | |
Datta et al. | Regularized Bayesian transfer learning for population-level etiological distributions | |
Mena et al. | An overview of inference methods in probabilistic classifier chains for multilabel classification | |
US11514233B2 (en) | Automated nonparametric content analysis for information management and retrieval | |
US11416712B1 (en) | Tabular data generation with attention for machine learning model training system | |
Isoni | Machine learning for the web | |
CN117546160A (en) | Automated data hierarchy extraction and prediction using machine learning models | |
CN108304568B (en) | Real estate public expectation big data processing method and system | |
US11977952B1 (en) | Apparatus and a method for generating a confidence score associated with a scanned label | |
US11868859B1 (en) | Systems and methods for data structure generation based on outlier clustering | |
US11868313B1 (en) | Apparatus and method for generating an article | |
US11895141B1 (en) | Apparatus and method for analyzing organization digital security | |
Chitta | Kernel-based clustering of big data | |
US20230162518A1 (en) | Systems for Generating Indications of Relationships between Electronic Documents | |
US20230297963A1 (en) | Apparatus and method of opportunity classification | |
CN112463964B (en) | Text classification and model training method, device, equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO UNDISCOUNTED (ORIGINAL EVENT CODE: BIG.); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
FEPP | Fee payment procedure |
Free format text: ENTITY STATUS SET TO SMALL (ORIGINAL EVENT CODE: SMAL); ENTITY STATUS OF PATENT OWNER: SMALL ENTITY |
|
AS | Assignment |
Owner name: PRESIDENT AND FELLOWS OF HARVARD COLLEGE, MASSACHUSETTS Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JERZAK, CONNOR T.;KING, GARY;STREZHNEV, ANTON;REEL/FRAME:050255/0174 Effective date: 20180525 Owner name: PRESIDENT AND FELLOWS OF HARVARD COLLEGE, MASSACHU Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNORS:JERZAK, CONNOR T.;KING, GARY;STREZHNEV, ANTON;REEL/FRAME:050255/0174 Effective date: 20180525 |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NON FINAL ACTION MAILED |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: RESPONSE TO NON-FINAL OFFICE ACTION ENTERED AND FORWARDED TO EXAMINER |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: AWAITING TC RESP., ISSUE FEE NOT PAID |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: NOTICE OF ALLOWANCE MAILED -- APPLICATION RECEIVED IN OFFICE OF PUBLICATIONS |
|
STPP | Information on status: patent application and granting procedure in general |
Free format text: PUBLICATIONS -- ISSUE FEE PAYMENT VERIFIED |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |