WO2011071833A1 - Procédé et système d'amélioration de qualité de données accélérée - Google Patents
Procédé et système d'amélioration de qualité de données accélérée Download PDFInfo
- Publication number
- WO2011071833A1 WO2011071833A1 PCT/US2010/059126 US2010059126W WO2011071833A1 WO 2011071833 A1 WO2011071833 A1 WO 2011071833A1 US 2010059126 W US2010059126 W US 2010059126W WO 2011071833 A1 WO2011071833 A1 WO 2011071833A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- data
- candidate
- functional dependencies
- conditional functional
- data quality
- Prior art date
Links
- 238000000034 method Methods 0.000 title claims abstract description 25
- 238000011161 development Methods 0.000 claims abstract description 10
- 238000004519 manufacturing process Methods 0.000 claims description 14
- 238000007670 refining Methods 0.000 claims description 14
- 230000002708 enhancing effect Effects 0.000 claims description 6
- 238000000926 separation method Methods 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 2
- 230000007423 decrease Effects 0.000 claims 1
- 238000013459 approach Methods 0.000 description 8
- 230000008569 process Effects 0.000 description 7
- 244000141353 Prunus domestica Species 0.000 description 6
- 238000012360 testing method Methods 0.000 description 5
- 238000004458 analytical method Methods 0.000 description 3
- 230000008030 elimination Effects 0.000 description 3
- 238000003379 elimination reaction Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 238000007726 management method Methods 0.000 description 3
- 238000005065 mining Methods 0.000 description 3
- 238000012986 modification Methods 0.000 description 3
- 230000004048 modification Effects 0.000 description 3
- 238000001162 G-test Methods 0.000 description 2
- 230000000694 effects Effects 0.000 description 2
- 238000012558 master data management Methods 0.000 description 2
- 230000004075 alteration Effects 0.000 description 1
- 230000008901 benefit Effects 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 230000003749 cleanliness Effects 0.000 description 1
- 238000004590 computer program Methods 0.000 description 1
- 238000013479 data entry Methods 0.000 description 1
- 230000003247 decreasing effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000003993 interaction Effects 0.000 description 1
- 238000012545 processing Methods 0.000 description 1
- 230000000135 prohibitive effect Effects 0.000 description 1
- 238000005070 sampling Methods 0.000 description 1
- 238000007619 statistical method Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2458—Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
- G06F16/2465—Query processing support for facilitating data mining operations in structured databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/24—Querying
- G06F16/245—Query processing
- G06F16/2455—Query execution
- G06F16/24564—Applying rules; Deductive queries
- G06F16/24565—Triggers; Constraints
Definitions
- the invention relates generally to automated data cleansing, and more specifically to automated data quality enhancement through the application of conditional functional dependencies.
- Data quality enhancement is generally an automated process, wherein a computer screens through all of the data in an electronic storage database and automatically flags or deletes data values that appear to be erroneous.
- the critical task in data quality enhancement is the identification of rules that validate, cleanse, and govern poor quality data.
- a sufficient rule would be that any entry for a district where money is being spent should also appear in a list of all the legislative districts in the United States.
- Data quality rules can be identified using either manual or automated development. Manual development involves a data or business analyst leveraging the input of a subject matter expert (SME), or utilizing a data profiling tool.
- SME subject matter expert
- SMEs are persons that understand the characteristics of data sets that encompass information within their field of expertise. For example, a data analyst may leverage a SME in the utilities field to learn that meters have serial numbers that are usually recorded incorrectly, and are connected to
- transformers with serial numbers that are related to the serial numbers of the meters The analyst would then be able to take in this information and create a data quality rule that screened for serial numbers in a data set that did not fit the pattern described.
- Data profiling tools are computer programs that examine data of interest to report statistics such as frequency of a value, percentage of overlap between two columns, and other relationships and values inherent in the data. Examples of data profiling tools include TS Discovery, Informatica IDE/IDQ, and Oracle Data Integrator. The information gleaned from a data profiling tool can indicate potential quality problems. Analysts use the information they obtain from the use of a data profiling tool to manually create rules that can enhance the quality of the examined data.
- profilers such as Informatica Data Explorer
- Some profilers can automatically infer basic data quality rules on their own. For example, they can set a rule for which columns cannot have null values. However, this is a particularly simple data quality rule. Null value entries are the easiest type of error to detect because they are clearly indicative of a data entry oversight and they do not have values equivalent to any possible correct entry.
- Other profilers such as TS Discovery, Informatica Data Quality, provide out-of-the-box rules for name and address validation. These rules are also somewhat rudimentary because addresses are characteristically regimented, are a quintessential element for large commercial databases, and follow tight patterns. Available data profilers do not contain rules that target more complex, or more client-specific quality problems.
- CFDs dependencies
- CFDs are rules that enforce patterns of semantically related constants.
- Figure 1 provides an example of a simple CFD.
- the input data points 101 and 102 have three attributes which are a country code (CC), a state (S), and an area code (AC).
- CC country code
- S state
- AC area code
- a data set made up of such data points could be part of a database keeping track of the locations of an enterprise's customers.
- CFD 100 checks data based on the fact that if a country code is 01 for the United States, and an area code is 408, then the accompanying state should be California. Applying data input 101 to CFD 100 will result in a passing output value 103. Whereas applying data input 102 to CFD 100 will result in a failing output value 104.
- a computer-implemented method for producing data quality rules for a data set is provided.
- a set of candidate conditional functional dependencies based on a set of candidate seeds by using an ontology of said data set is generated.
- Each of the candidate seeds are comprised of a subset of attributes drawn from a set of all the attributes of said data set that have a predetermined degree of separation in said ontology.
- the candidate conditional functional dependencies are applied individually to the data set to obtain a set of corresponding result values for each of the candidate conditional functional dependencies.
- the candidate conditional functional functional are applied individually to the data set to obtain a set of corresponding result values for each of the candidate conditional functional dependencies.
- a relevant set of said candidate conditional functional dependencies are selected to be used as said data quality rules for said data set.
- Figure 1 illustrates a conditional functional dependency operating on input data.
- Figure 2 illustrates a method for producing data quality rules for a data set that is in accordance with the present invention.
- Figure 3 illustrates a system for producing data quality rules for a data set that is in accordance with the present invention.
- Figure 4 illustrates a graphical user interface data input that is in accordance with the present invention.
- Figure 5 illustrates a graphical user interface rule display that is in accordance with the present invention.
- Figure 6 illustrates a fully connected graph for an attribute
- Embodiments of the present invention solve the technical problem of identifying, collecting, and managing rules that improve poor quality data on enterprise initiatives ranging from data governance to business intelligence. Embodiments of the present invention also significantly reduce the amount of manual effort required to collect data quality rules on enterprise initiatives such as master data management, business intelligence, and others. Moreover, embodiments of the present invention also supports other business needs such as ensuring that one's data conforms to predefined business logic.
- Embodiments of the present invention solve the problems described above by automatically discovering actionable data quality rules and by providing an intuitive rule browser to manage these rules.
- Embodiments of the present invention do not suffer from the computational complexity of prior art methods and are capable of dealing with noisy data.
- embodiments of the present invention are able to provide data quality enhancement rules to specific client data cleanliness issues without the need for costly access to, and assimilation of SME knowledge of data characteristics.
- Figure 2 displays a method for producing data quality rules for a data set that is in accordance with the present invention.
- an ontology of the data set is available which indicates which attributes in the data set are related.
- area code and state would be directly related, whereas a different variable such as a customer's first name may not be related to area code at all.
- a set of candidate CFDs is generated.
- the candidate CFDs are based on a set of candidate seeds that are subsets of all the attributes in the data set.
- a candidate seed could be a combination of the country code and the area code.
- the attributes selected for the candidate seeds would have a certain degree of separation in the ontology of the data set. For example, attributes that are within three links in the ontology could be selected as groups of attributes for the candidate seeds.
- the number of candidate CFDs, the number of conditions in each CFD, and the number of attributes in each CFD is determined by a user prior to beginning to practice the invention.
- the number of conditions in a CFD determines how many constraints are placed on the attributes that form the CFD. In keeping with our example, the rule "if area code is 408; then state must be California" would count as a single condition. All three of these variables would have a direct impact on the time it would take for the automated portion of the method to execute. Selecting a lower number for any of these values would trade-off the overall efficiency of the resultant set of data enhancement rules for a faster convergence of the candidate CFDs.
- the candidate CFDs would be applied individually to data in the data set. In a specific embodiment of the invention, this applying would be done in data segments having a predetermined length. For example, the CFDs could be applied to a data segment having a length of one-thousand data points. Embodiments that took this approach would save a significant amount of time because it would take less time to apply the rules to a data segment as compared to the entire data set.
- the size of the data segment would be set by a scan period that was determined by a user.
- the purpose of applying the CFDs to the data would be to obtain a set of corresponding result values for each of the CFDs.
- the set of corresponding result values would generally be equivalent in size to the number of data points said CFDs were applied to.
- the set of result values would indicate if the rule matched the data point, if the rule did not match but did not conflict with the data point, and if the rule conflicted with the data point.
- the candidate CFDs are refined individually if they have a result signature that does not meet a predetermined expectation.
- the result signature would be a list of the result values that came about from applying the individual CFDs to the data. The refining of the individual candidate CFDs would be done such that they would more closely meet the predetermined expectation if reapplied to the data.
- the refining could be achieved through the elimination of a high entropy attribute from the candidate CFD.
- the highest entropy attribute would be the attribute in the candidate CFD that took on the most values throughout the data set. Selecting this attribute for elimination would be effective in refining the candidate CFD because it would statistically be the best attribute to eliminate in order to make the candidate CFD less restrictive. In the example discussed above with three attributes, this would most likely result in the elimination of the area code attribute in any candidate CFD that did not meet the predetermined
- the predetermined expectation would be set by a coverage estimate and a predetermined error estimate.
- the coverage estimate would be an estimate of how many different data points the candidate CFD would apply to, meaning that the attributes and values on which the candidate CFD operated on were present in the data point.
- a candidate CFD with the condition "if area code is 408; then state must be California" would cover any data point where the area code attribute was 408.
- the error estimate would be an estimate of how many different data points would fail a candidate CFD that expressed a desired relationship in the data. For example, an SME might provide the information that 5% of the area codes in a database were probably incorrect, and that such errors were random.
- the error estimate would be 5%, and a data point with an area code of 408 and a state besides California would count as one data point towards the total error content of the result signature. If there were five errors in a result signature for a one-hundred data point data segment, then the error estimate would match exactly.
- Embodiments that utilize error estimates will be able to handle noisy data because they take account of potential errors. Without taking an error estimate into account, a result value that indicated the rule didn't fit would not carry any information regarding whether the rule was erroneous.
- the coverage estimate and error estimate could be adjusted by a user.
- step 203 the applying and refining of the candidate CFDs terminates when the candidate CFD has reached a quiescent state.
- a quiescent state is defined as the point when a candidate CFD has been applied without refinement to a series of data points that contain stable data. Data stability can be determined either by reference to the swing of the values of particular attributes relative to a known variance, or it could be set by tolerances obtained from a SME. The number of data points in the
- aforementioned series could be set by a window period value, and in another specific embodiment of the invention the window period could be adjusted by a user. Since this window period and the data segments in step 201 are of different sizes, there may be a lag time between when step 202 produces a meets-expectation-result, and when step 203 executes and determines if a CFD under test has reached quiescence.
- different candidate CFDs could be at different places within Figure 2. Some candidate CFDs could reach quiescence rapidly and be ready to move on to step 204, while others are still circling back through step 201. As mentioned before, this approach would save precious computing time because CFDs that had already converged would not be reapplied to the data.
- a relevant set of said candidate CFDs is selected.
- the relevant set of candidate CFDs will be the data quality rules for the data set. Relevance is determined mainly by the level of coverage of any specific candidate CFD. Coverage, was described above, and refers to how many data points a candidate CFD applies to.
- relevance would also be set by a goodness of fit statistical analysis of the stable candidate CFDs. The goodness of fit analysis for relevance would include a detected error rate and a degree of coverage for the CFDs. The most relevant CFDs under the goodness of fit analysis would be those with the highest level of coverage, and a minimum proximity between their detected error rates and the estimated error rate.
- the data quality rules could be sorted automatically. This would be important because in many complex situations the number of stable candidate CFDs would be very high and their analysis would be time consuming.
- the candidate CFDs in the relevant set could be ranked according to an interestingness factor. The ranking would be done so that a person evaluation the relevant CFDs would be aided in directing their attention.
- the interestingness factor would increase as the portion of a data set containing one of the values on which the candidate CFD was based decreased.
- the data quality rules could be grouped together into subsets of rules that addressed similar data quality problems.
- statistics such as connections between rules, conflicts between rules, and percentage of data covered by each rule could be provided along with the rules.
- the relevant set of candidate CFDs produced by the process would be applied to enhance the data quality of a data set.
- the candidate CFDs which at this point would be the data enhancement rules, would be applied to all of the data in the data set. Data points that did not align with the data enhancement rules would either be flagged for later attention or could be deleted or modified to a best guess of their proper value, thereby enhancing the data quality of the data set.
- the data enhancement rules generated in accordance with the present invention could also be applied in embodiments of the present invention to enhance the data quality of a related group of data sets.
- the rules could be applied to any number of data sets with similar content, meaning that the data in the related data sets had a characteristic similar to that of the original data set on which the method determined the data quality rules. This process could be adapted for data sets that were stored externally by exporting the relevant rules to a data quality product or an external database
- the data quality products to which the rules could be exported could be TS Discovery, Informatica IDE/IDQ and Oracle Data Integrator.
- Figure 3 displays a computer system for the development of data quality rules that is in accordance with the present invention.
- Rule repository 302 is used for storing data quality rules.
- rule repository 302 is capable of delivering the rules to a plug in data-exchanger such as plug-in 303.
- Plug-in 303 can be added to the system which allows exporting the data rules to another system in a compatible fashion.
- plug-in 303 would be comprised of a set of plug-ins that each assured compatability with a different external system. Such an embodiment would be desirable because the rules can then be adapted to be applied to any number of external systems along data line 304.
- the external systems capable of receiving the data quality rules could be a system running a data quality product, an external data base management system, or any other system to which data quality rules may be applied.
- the external system could be one running a data quality product such as TS Discovery, Informatica IDE/IDQ and Oracle Data Integrator.
- Rule repository 302 obtains the data quality rules from data quality rules discovery engine 301.
- the data quality rules discovery engine 301 is capable of receiving a data set, an ontology of the data set, and a set of rule generation parameters from user interface 300.
- User interface 300 is also capable of outputting the data quality rules that are discovered by data quality rules discovery engine 301 for external use.
- Data quality rules discovery engine 301 forms a set of candidate CFDs based on the ontology of the data set and refines those rules iteratively based on observation of how the rules function when applied to the data.
- Data quality rules discovery engine 301 terminates the iterative refining process when the candidate CFDs reach a quiescent state and become data quality rules.
- a user interface such as user interface 300, could further comprise a graphical user interface (GUI).
- GUI graphical user interface
- such a GUI could be capable of receiving rule generation parameters, an address of a data set, an address of related data sets, and an address of an ontology from a user.
- the rule generation parameters could also be adjusted by a user through the use of the GUI.
- the GUI could also be capable of displaying the rules that were generated by the rule discovery engine to the user such that the user could double check and optionally revise the displayed rules.
- the rules could also be displayed by the GUI with information regarding the rules such as the portion of the data that the rule applied to and the detected error rate of the data when applied to the rule.
- Figure 4 displays an example of an input display of a GUI in
- GUI 400 is capable of displaying information to and receiving information from a user.
- Display window 401 contains several selectors.
- the selectors could include a max number-of-rules selector 402 capable of accepting and setting the number of candidate CFDs, a max number of conditions selector 403 capable of accepting and setting the maximum number of conditions in each of the candidate CFDs, a max number of seeds selector 404 capable of accepting and setting the maximum number of candidate seeds in each of the candidate CFDs, a coverage selector 405 capable of accepting and setting the desired coverage of any particular CFD as applied to the data set, an error rate selector 406 capable of accepting and setting the expected error rate of any particular CFD as applied to the data set, a frequency selector 407 capable of accepting and setting the scan period for each application of any particular CFD to the data set, and a window size selector 408 capable of accepting and setting the amount of data that needs to be evaluated before the rules will be evaluated for quiescence. Values selected by the selectors
- FIG. 5 displays an example of an output display of a GUI in accordance with the present invention.
- GUI 500 is capable of displaying information to and receiving information from a user.
- Display window 501 is capable of enabling both business and technical users to understand, modify, and manage discovered rules by reporting key information such as
- Rule display pane 503 is capable of displaying a summary of each rule as well as important statistics of the rule.
- Rule list 502 is capable of displaying the rules in an organized and modifiable format with a summary of the statistics of each rule displayed alongside.
- Details pane 504 is capable of displaying more in-depth information regarding a selected rule.
- CFinder discovers CFDs from a relation of interest through the following steps. CFinder first generates an initial set of candidate CFDs.
- CFinder refines each CFD by removing extraneous (or invalid) conditions, and stops refining a CFD when it becomes stable. Finally, CFinder filters weak (and subsumed) CFDs, and generalizes the remaining ones to increase their applicability.
- CFinder Given a relation R, CFinder generates candidate CFDs (i.e. rules of the form (X -> Y, Tp) where X and Y are attributes from R, and Tp is a pattern tuple which consists of values from these attributes).
- CFinder first generates all attribute combinations of size N+ 1 from R where N is the maximum number of attributes (and hence conditions) allowed in the antecedent X of a CFD. CFinder imposes this restriction because CFDs with a large number of conditions in the antecedent have limited applicability in practice.
- CFinder then generates candidate CFDs from each combination. For each attribute in a combination, CFinder turns that attribute into the
- CFinder prunes combinations that are unlikely to produce useful CFDs based on two heuristics.
- CFDs are more likely to be generated from attributes that are strongly related (e.g. Agency and Agency Code).
- CFinder implements this heuristic by treating each combination as a fully connected graph with attributes as nodes and by computing the average strength across all edges (and hence how strongly are the attributes related to each other) using the following equation: ls(O I where E(c) are all edges in the attribute combination c, (A, B) is an edge between attributes A and B, and Strength (A, B) measures how strongly A is related to B.
- a good measure for Strength(A, B) can be the semantic
- CFinder defines Strength(A, B) as the mutual information shared between A and B:
- U(A) and U(B) are the unique values in A and B respectively; and P is the relative frequency of a value (or value pair) in an attribute (or attribute pair).
- CFinder prunes combinations with low strength, and sets the default strength threshold to 0.5.
- Figure 6 shows the fully connected graph for the following attribute combination from Table 1 .
- Ci (Rcpt Category, Rcpt City, Agency, Agency Code)
- the edge labels indicate the strength between these attributes. Since the average strength (i.e. 1 .13) is greater than 0.5, in one embodiment, CFinder will keep this combination.
- the second heuristic is many combinations are variants of one another and can be pruned. These variants often result in the discovery of the same CFDs because in one embodiment, CFinder refines CFDs by removing extraneous and/or invalid conditions from the antecedent.
- CFinder implements this heuristic by first sorting, in descending order based on strength, combinations that remain after applying the first heuristic. In one embodiment, CFinder then traverse this list in descending order, and for each combination c it finds all preceding
- CFinder defines this difference as the number of attributes in c that are not in c' where C'EC, and sets the default difference to 1 (i.e. C will contain all combinations that differ from c by one attribute).
- CFinder should prune c if it has significant overlap with C Because each combination can be treated as a fully connected graph, the overlap between c and any combination in C is their maximum common subgraph. If the non-overlapping edges in c (i.e. edges not found in C) are weak, then it is unlikely that this combination will produce any new, useful CFDs. In one embodiment, CFinder captures this notion formally as: where E(c) are all edges in c and E'(c) are edges in c that overlap with combinations in C If this value exceeds the prune threshold HP , then the combination is pruned.
- Figure 6 shows two additional combinations from Table 1 whose strengths rank higher than ci . If HP is 0.85, then, in one embodiment, CFinder will prune Ci because it has high overlap (shown in bold) with C2 and C3, and the nonoverlapping edge in Ci is weak.
- CFinder generates candidate CFDs from the remaining combinations. In one embodiment, CFinder starts with the strongest one and refines these CFDs in the order they are generated.
- CFinder refines each candidate CFD by comparing it with records from the relation of interest. In one embodiment, CFinder randomizes the order in which records are examined. In one embodiment, for each record, CFinder determines whether the record is consistent, inconsistent, or irrelevant to the CFD.
- a record is consistent with a CFD if all values in the pattern tuple of the CFD match the respective values in the record. If so, then, in one embodiment, CFinder increments the consistent record count R c by 1.
- a record is inconsistent with a CFD if all values in the pattern tuple that correspond to the antecedent of the CFD match the respective values in the record, but values that correspond to the consequent do not. If so, then, in one embodiment, CFinder increments the inconsistent record count R
- CFinder increments the irrelevant record count R v by 1. In one embodiment, CFinder uses these counts to check whether the CFD is too specific (and hence needs to be refined) and whether inconsistencies encountered for the CFD are real errors in the data or anomalies, which can be ignored. In one embodiment, CFinder performs this check once every M records using the minimum support threshold H s - i.e. Rc/(Rc +Rv) ⁇ H s - and the maximum inconsistency threshold Hi -i.e. R/(Ri +Rv ) ⁇ Hi.
- CFinder refines the CFD by removing extraneous and/or invalid conditions from its antecedent.
- the difference between the observed support (i.e. Rc (Rc + Rv )) and the expected support (i.e. H s ) may be due to "sampling" effect with the M records examined. This effect can cause the CFD to be over-refined and become too promiscuous.
- CFinder will refine a CFD only if the difference is significant. The difference is significant if the resulting X 2 value exceeds the critical X 2 value at specified confidence level, which CFinder defaults to 99%. In one embodiment, CFinder selects the top K most promising conditions to remove from the antecedent of the CFD. Since the goal is to improve support, in one embodiment, CFinder should remove conditions whose value occurs infrequently and whose corresponding attribute has high uncertainty (i.e.
- CFinder implements this notion formally as:
- a and B are attributes of the condition and consequent respectively
- T p (*) is the value of an attribute in the pattern tuple; P is the relative frequency of the value pair
- Entropy(A,B) is the joint entropy between A and B across all records examined so far.
- CFinder selects K conditions with the highest scores based on the equation above, and for each condition CFinder removes the condition from the antecedent of the original CFD to generate a new CFD. For example, assuming CFinder needs to refine the following CFD by selecting the top two conditions, and the records in Table 1 are the ones examined so far.
- CFinder will select Program and CFDA No. - whose scores are 1 .97 and 1 .69 respectively (Agency Code has the lowest score of 0.98) - and remove them from the original CFD to generate the following new CFDs.
- CFinder For each new CFD, in one embodiment, CFinder records the CFD to prevent it from being generated again; and recomputes R c , Ri , and R v for the CFD. If no conditions remain in the antecedent, then the CFD is discarded.
- CFinder determines whether the difference between the observed inconsistency (i.e. R
- CFinder penalizes the CFD by adding Ri to Rv and then resetting Ri to 0. This penalty increases the likelihood that the CFD will fail to meet the minimum support threshold, which will cause the CFD to be refined and eventually discarded (if the inconsistencies persist).
- CFinder repeats the above process until all records have been examined or the CFD becomes stable.
- CFinder addresses these two issues by determining whether a CFD is stable and hence does not need to be refined further.
- a CFD is stable if both the support for the CFD and the certainty of the values that make up the attributes referenced in the CFD are constant over a given period of time.
- CFinder captures this notion by first computing a stability score St for the CFD using the following equation: Where Rc and Rv are consistent and irrelevant record counts for the CFD respectively (see previous section); X U Y are all attributes referenced in the CFD; and Entropy(A) is the entropy of A across all records examined so far. In one embodiment, CFinder computes this score once every M records— when it checks the minimum support and maximum inconsistency thresholds.
- CFinder then computes the standard deviation SDST for the past L stability scores; and if SD S T is constant according to the following equation: then the CFD is stable.
- Avg S t is the average of the past L stability scores; and H S t is the stability threshold.
- CFinder uses the measures of support and conviction to filter weak CFDs (i.e. CFDs that do not meet and/or exceed the thresholds specified for these measures).
- Support measures how much evidence there is for a CFD, and can be defined using the consistent and irrelevant record counts.
- Conviction measures how much the antecedent and consequent of a CFD deviate from independence while considering
- CFinder applies an additional filter to remove subsumed CFDs.
- a CFD F1 can be generalized if there exists another CFD F2 such that F1 and F2 have the same antecedents and consequents - i.e. X1 equals X2 and Y1 equals Y2.
- the pattern tuples of F1 and F2 differ by a single value If these conditions are met, then, in one embodiment, CFinder generalizes F1 and F2 into a single CFD by replacing the differing value in their pattern tuples with a wildcard (i.e. '_') which can match any arbitrary value. For example, given the following CFDs: (Rcpt Category, agency -> Program,
- CFinder can generalize them into:
- Embodiments of the invention as described above can significantly accelerate data quality efforts on enterprise initiatives ranging from master data management to business intelligence by significantly reducing the amount of manual effort required to identify and collect data quality rules.
- the fact that they can be integrated with key data quality vendor solutions assures that the data quality rules can quickly be made operational for these solutions. It is also important to note that they can effectively detect and validate data quality problems beyond addresses, names, null values, and value ranges.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Computational Linguistics (AREA)
- Fuzzy Systems (AREA)
- Mathematical Physics (AREA)
- Probability & Statistics with Applications (AREA)
- Software Systems (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Selon des modes de réalisation, la présente invention résout le problème technique d'identification, de rassemblement et de gestion de règles qui améliorent des données de mauvaise qualité sur des initiatives d'entreprise allant de la gouvernance des données à la veille économique. Dans un mode de réalisation particulier de la présente invention, un procédé est décrit pour produire des règles de qualité de données pour un ensemble de données. Un ensemble de dépendances fonctionnelles, conditionnelles, candidates, sont générées, constituées de germes candidats d'attributs qui sont dans les limites d'un certain degré de parenté dans l'ontologie de l'ensemble de données. Les dépendances fonctionnelles, conditionnelles, candidates sont ensuite appliquées aux données, et affinées jusqu'à ce qu'elles atteignent un état de repos dans lequel elles n'ont pas été affinées bien que les données auxquelles elles sont été appliquées aient été stables. Les dépendances fonctionnelles, conditionnelles, candidates, affinées résultantes sont les règles d'amélioration de données pour l'ensemble de données et d'autres ensembles de données apparentés. Dans un autre mode de réalisation particulier de la présente invention, un système informatique, pour le développement de règles de qualité de données, est décrit présentant un référentiel de règles, un moteur de découverte de règles de qualité de données et une interface utilisateur.
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP10795826A EP2350887A1 (fr) | 2009-12-07 | 2010-12-06 | Procédé et système d'amélioration de qualité de données accélérée |
CA2734599A CA2734599C (fr) | 2009-12-07 | 2010-12-06 | Procede et systeme d'amelioration acceleree de la qualite des donnees |
CN201080002524.4A CN102257496B (zh) | 2009-12-07 | 2010-12-06 | 用于加速的数据质量增强的方法和系统 |
Applications Claiming Priority (6)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US26733809P | 2009-12-07 | 2009-12-07 | |
US61/267,338 | 2009-12-07 | ||
US29723310P | 2010-01-21 | 2010-01-21 | |
US61/297,233 | 2010-01-21 | ||
US12/779,830 | 2010-05-13 | ||
US12/779,830 US8700577B2 (en) | 2009-12-07 | 2010-05-13 | Method and system for accelerated data quality enhancement |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2011071833A1 true WO2011071833A1 (fr) | 2011-06-16 |
Family
ID=44083245
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/US2010/059126 WO2011071833A1 (fr) | 2009-12-07 | 2010-12-06 | Procédé et système d'amélioration de qualité de données accélérée |
Country Status (5)
Country | Link |
---|---|
US (1) | US8700577B2 (fr) |
EP (1) | EP2350887A1 (fr) |
CN (1) | CN102257496B (fr) |
CA (1) | CA2734599C (fr) |
WO (1) | WO2011071833A1 (fr) |
Cited By (40)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8601326B1 (en) | 2013-07-05 | 2013-12-03 | Palantir Technologies, Inc. | Data quality monitors |
US8855999B1 (en) | 2013-03-15 | 2014-10-07 | Palantir Technologies Inc. | Method and system for generating a parser and parsing complex data |
US8930897B2 (en) | 2013-03-15 | 2015-01-06 | Palantir Technologies Inc. | Data integration tool |
US9009827B1 (en) | 2014-02-20 | 2015-04-14 | Palantir Technologies Inc. | Security sharing system |
US9081975B2 (en) | 2012-10-22 | 2015-07-14 | Palantir Technologies, Inc. | Sharing information between nexuses that use different classification schemes for information access control |
US9201920B2 (en) | 2006-11-20 | 2015-12-01 | Palantir Technologies, Inc. | Creating data in a data store using a dynamic ontology |
US9223773B2 (en) | 2013-08-08 | 2015-12-29 | Palatir Technologies Inc. | Template system for custom document generation |
US9229952B1 (en) | 2014-11-05 | 2016-01-05 | Palantir Technologies, Inc. | History preserving data pipeline system and method |
US9338013B2 (en) | 2013-12-30 | 2016-05-10 | Palantir Technologies Inc. | Verifiable redactable audit log |
US9495353B2 (en) | 2013-03-15 | 2016-11-15 | Palantir Technologies Inc. | Method and system for generating a parser and parsing complex data |
US9576015B1 (en) | 2015-09-09 | 2017-02-21 | Palantir Technologies, Inc. | Domain-specific language for dataset transformations |
US9678850B1 (en) | 2016-06-10 | 2017-06-13 | Palantir Technologies Inc. | Data pipeline monitoring |
US9727560B2 (en) | 2015-02-25 | 2017-08-08 | Palantir Technologies Inc. | Systems and methods for organizing and identifying documents via hierarchies and dimensions of tags |
US9740369B2 (en) | 2013-03-15 | 2017-08-22 | Palantir Technologies Inc. | Systems and methods for providing a tagging interface for external content |
US9772934B2 (en) | 2015-09-14 | 2017-09-26 | Palantir Technologies Inc. | Pluggable fault detection tests for data pipelines |
US9898167B2 (en) | 2013-03-15 | 2018-02-20 | Palantir Technologies Inc. | Systems and methods for providing a tagging interface for external content |
US9922108B1 (en) | 2017-01-05 | 2018-03-20 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US9946777B1 (en) | 2016-12-19 | 2018-04-17 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US9996595B2 (en) | 2015-08-03 | 2018-06-12 | Palantir Technologies, Inc. | Providing full data provenance visualization for versioned datasets |
US10007674B2 (en) | 2016-06-13 | 2018-06-26 | Palantir Technologies Inc. | Data revision control in large-scale data analytic systems |
US10102229B2 (en) | 2016-11-09 | 2018-10-16 | Palantir Technologies Inc. | Validating data integrations using a secondary data store |
US10127289B2 (en) | 2015-08-19 | 2018-11-13 | Palantir Technologies Inc. | Systems and methods for automatic clustering and canonical designation of related data in various data structures |
US10133782B2 (en) | 2016-08-01 | 2018-11-20 | Palantir Technologies Inc. | Techniques for data extraction |
US10248722B2 (en) | 2016-02-22 | 2019-04-02 | Palantir Technologies Inc. | Multi-language support for dynamic ontology |
US10311081B2 (en) | 2012-11-05 | 2019-06-04 | Palantir Technologies Inc. | System and method for sharing investigation results |
US10496529B1 (en) | 2018-04-18 | 2019-12-03 | Palantir Technologies Inc. | Data unit test-based data management system |
US10503574B1 (en) | 2017-04-10 | 2019-12-10 | Palantir Technologies Inc. | Systems and methods for validating data |
US10572496B1 (en) | 2014-07-03 | 2020-02-25 | Palantir Technologies Inc. | Distributed workflow system and database with access controls for city resiliency |
US10621314B2 (en) | 2016-08-01 | 2020-04-14 | Palantir Technologies Inc. | Secure deployment of a software package |
US10691729B2 (en) | 2017-07-07 | 2020-06-23 | Palantir Technologies Inc. | Systems and methods for providing an object platform for a relational database |
US10698938B2 (en) | 2016-03-18 | 2020-06-30 | Palantir Technologies Inc. | Systems and methods for organizing and identifying documents via hierarchies and dimensions of tags |
US10754822B1 (en) | 2018-04-18 | 2020-08-25 | Palantir Technologies Inc. | Systems and methods for ontology migration |
US10803106B1 (en) | 2015-02-24 | 2020-10-13 | Palantir Technologies Inc. | System with methodology for dynamic modular ontology |
US10853378B1 (en) | 2015-08-25 | 2020-12-01 | Palantir Technologies Inc. | Electronic note management via a connected entity graph |
US10866792B1 (en) | 2018-04-17 | 2020-12-15 | Palantir Technologies Inc. | System and methods for rules-based cleaning of deployment pipelines |
US10956406B2 (en) | 2017-06-12 | 2021-03-23 | Palantir Technologies Inc. | Propagated deletion of database records and derived data |
US10956508B2 (en) | 2017-11-10 | 2021-03-23 | Palantir Technologies Inc. | Systems and methods for creating and managing a data integration workspace containing automatically updated data models |
USRE48589E1 (en) | 2010-07-15 | 2021-06-08 | Palantir Technologies Inc. | Sharing and deconflicting data changes in a multimaster database system |
US11106692B1 (en) | 2016-08-04 | 2021-08-31 | Palantir Technologies Inc. | Data record resolution and correlation system |
US11461355B1 (en) | 2018-05-15 | 2022-10-04 | Palantir Technologies Inc. | Ontological mapping of data |
Families Citing this family (37)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20120150825A1 (en) * | 2010-12-13 | 2012-06-14 | International Business Machines Corporation | Cleansing a Database System to Improve Data Quality |
US20130117202A1 (en) * | 2011-11-03 | 2013-05-09 | Microsoft Corporation | Knowledge-based data quality solution |
US9152662B2 (en) * | 2012-01-16 | 2015-10-06 | Tata Consultancy Services Limited | Data quality analysis |
GB2508573A (en) * | 2012-02-28 | 2014-06-11 | Qatar Foundation | A computer-implemented method and computer program for detecting a set of inconsistent data records in a database including multiple records |
US8972363B2 (en) * | 2012-05-14 | 2015-03-03 | Nec Corporation | Rule discovery system, method, apparatus and program |
JPWO2013172309A1 (ja) * | 2012-05-14 | 2016-01-12 | 日本電気株式会社 | ルール発見システムと方法と装置並びにプログラム |
WO2013187816A1 (fr) * | 2012-06-15 | 2013-12-19 | Telefonaktiebolaget Lm Ericsson (Publ) | Procédé et vérificateur de cohérence de recherche d'incohérences de données dans un organe d'archivage de données |
US20140012637A1 (en) * | 2012-07-06 | 2014-01-09 | Xerox Corporation | Traffic delay detection by mining ticket validation transactions |
EP2728493A1 (fr) * | 2012-11-01 | 2014-05-07 | Telefonaktiebolaget L M Ericsson (Publ) | Procédé, appareil et programme informatique pour détecter des écarts de référentiels de données |
US10545932B2 (en) * | 2013-02-07 | 2020-01-28 | Qatar Foundation | Methods and systems for data cleaning |
US9558230B2 (en) | 2013-02-12 | 2017-01-31 | International Business Machines Corporation | Data quality assessment |
US10157175B2 (en) | 2013-03-15 | 2018-12-18 | International Business Machines Corporation | Business intelligence data models with concept identification using language-specific clues |
US10318388B2 (en) * | 2013-05-31 | 2019-06-11 | Qatar Foundation | Datasets profiling tools, methods, and systems |
US9146918B2 (en) | 2013-09-13 | 2015-09-29 | International Business Machines Corporation | Compressing data for natural language processing |
CN104252398A (zh) * | 2013-12-04 | 2014-12-31 | 深圳市华傲数据技术有限公司 | 一种数据防火墙系统修复数据方法和系统 |
US10210156B2 (en) | 2014-01-10 | 2019-02-19 | International Business Machines Corporation | Seed selection in corpora compaction for natural language processing |
US10698924B2 (en) | 2014-05-22 | 2020-06-30 | International Business Machines Corporation | Generating partitioned hierarchical groups based on data sets for business intelligence data models |
US20150363437A1 (en) * | 2014-06-17 | 2015-12-17 | Ims Health Incorporated | Data collection and cleaning at source |
US9754208B2 (en) * | 2014-09-02 | 2017-09-05 | Wal-Mart Stores, Inc. | Automatic rule coaching |
GB201417129D0 (en) * | 2014-09-29 | 2014-11-12 | Ibm | A method of processing data errors for a data processing system |
US10002179B2 (en) | 2015-01-30 | 2018-06-19 | International Business Machines Corporation | Detection and creation of appropriate row concept during automated model generation |
CN105045807A (zh) * | 2015-06-04 | 2015-11-11 | 浙江力石科技股份有限公司 | 互联网交易信息的数据清洗算法 |
US9984116B2 (en) | 2015-08-28 | 2018-05-29 | International Business Machines Corporation | Automated management of natural language queries in enterprise business intelligence analytics |
US9852164B2 (en) | 2015-09-10 | 2017-12-26 | International Business Machines Corporation | Task handling in a multisystem environment |
US20170124154A1 (en) | 2015-11-02 | 2017-05-04 | International Business Machines Corporation | Establishing governance rules over data assets |
US10832186B2 (en) | 2016-03-21 | 2020-11-10 | International Business Machines Corporation | Task handling in a master data management system |
CA2989617A1 (fr) | 2016-12-19 | 2018-06-19 | Capital One Services, Llc | Systemes et methodes de fourniture de gestion de la qualite des donnees |
CN108460038A (zh) * | 2017-02-20 | 2018-08-28 | 阿里巴巴集团控股有限公司 | 规则匹配方法及其设备 |
US10528523B2 (en) * | 2017-05-31 | 2020-01-07 | International Business Machines Corporation | Validation of search query in data analysis system |
US11106643B1 (en) * | 2017-08-02 | 2021-08-31 | Synchrony Bank | System and method for integrating systems to implement data quality processing |
CN108536777B (zh) * | 2018-03-28 | 2022-03-25 | 联想(北京)有限公司 | 一种数据处理方法、服务器集群及数据处理装置 |
EP3732628A1 (fr) * | 2018-05-18 | 2020-11-04 | Google LLC | Politiques d'augmentation de données d'apprentissage |
US11461671B2 (en) | 2019-06-03 | 2022-10-04 | Bank Of America Corporation | Data quality tool |
CN112181254A (zh) * | 2020-10-10 | 2021-01-05 | 武汉中科通达高新技术股份有限公司 | 数据质量管理方法及装置 |
US11550813B2 (en) | 2021-02-24 | 2023-01-10 | International Business Machines Corporation | Standardization in the context of data integration |
US11533235B1 (en) | 2021-06-24 | 2022-12-20 | Bank Of America Corporation | Electronic system for dynamic processing of temporal upstream data and downstream data in communication networks |
CN115193026A (zh) * | 2022-09-16 | 2022-10-18 | 成都止观互娱科技有限公司 | 一种高并发全球同服游戏服务器架构及数据访问方法 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090006302A1 (en) * | 2007-06-29 | 2009-01-01 | Wenfei Fan | Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies |
US20090287721A1 (en) * | 2008-03-03 | 2009-11-19 | Lukasz Golab | Generating conditional functional dependencies |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7315825B2 (en) * | 1999-06-23 | 2008-01-01 | Visicu, Inc. | Rules-based patient care system for use in healthcare locations |
KR100922141B1 (ko) * | 2003-09-15 | 2009-10-19 | 아브 이니티오 소프트웨어 엘엘시 | 데이터 프로파일링 방법 및 시스템 |
CN100485612C (zh) * | 2006-09-08 | 2009-05-06 | 中国科学院软件研究所 | 软件需求获取系统 |
US7836004B2 (en) * | 2006-12-11 | 2010-11-16 | International Business Machines Corporation | Using data mining algorithms including association rules and tree classifications to discover data rules |
CN101261705A (zh) * | 2008-03-19 | 2008-09-10 | 北京航空航天大学 | 业务建模驱动的erp软件需求获取方法 |
US20090271358A1 (en) * | 2008-04-28 | 2009-10-29 | Eric Lindahl | Evidential Reasoning Network and Method |
US20100138363A1 (en) * | 2009-06-12 | 2010-06-03 | Microsoft Corporation | Smart grid price response service for dynamically balancing energy supply and demand |
-
2010
- 2010-05-13 US US12/779,830 patent/US8700577B2/en active Active
- 2010-12-06 WO PCT/US2010/059126 patent/WO2011071833A1/fr active Application Filing
- 2010-12-06 CA CA2734599A patent/CA2734599C/fr active Active
- 2010-12-06 CN CN201080002524.4A patent/CN102257496B/zh active Active
- 2010-12-06 EP EP10795826A patent/EP2350887A1/fr not_active Ceased
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20090006302A1 (en) * | 2007-06-29 | 2009-01-01 | Wenfei Fan | Methods and Apparatus for Capturing and Detecting Inconsistencies in Relational Data Using Conditional Functional Dependencies |
US20090287721A1 (en) * | 2008-03-03 | 2009-11-19 | Lukasz Golab | Generating conditional functional dependencies |
Non-Patent Citations (5)
Title |
---|
OLIVIER CURÉ ET AL: "Data Quality Enhancement of Databases Using Ontologies and Inductive Reasoning", 25 November 2007, ON THE MOVE TO MEANINGFUL INTERNET SYSTEMS 2007: COOPIS, DOA, ODBASE, GADA, AND IS; [LECTURE NOTES IN COMPUTER SCIENCE], SPRINGER BERLIN HEIDELBERG, BERLIN, HEIDELBERG, PAGE(S) 1117 - 1134, ISBN: 978-3-540-76846-3, XP019083399 * |
OLIVIER CURÉ: "Improving the Data Quality of Relational Databases using OBDA and OWL 2 QL", PROCEEDINGS OF OWLED 2009,, 1 January 2009 (2009-01-01), pages 1 - 4, XP009144031, Retrieved from the Internet <URL:http://www.webont.org/owled/2009/papers/owled2009_submission_23.pdf> * |
PHILIP BOHANNON ET AL: "Conditional Functional Dependencies for Data Cleaning", DATA ENGINEERING, 2007. ICDE 2007. IEEE 23RD INTERNATIONAL CONFERENCE ON, IEEE, PI, 1 April 2007 (2007-04-01), pages 746 - 755, XP031095818, ISBN: 978-1-4244-0802-3 * |
STEFAN BRÜGGEMANN ED - YANCHUN ZHANG ET AL: "Rule Mining for Automatic Ontology Based Data Cleaning", 26 April 2008, PROGRESS IN WWW RESEARCH AND DEVELOPMENT; [LECTURE NOTES IN COMPUTER SCIENCE], SPRINGER BERLIN HEIDELBERG, BERLIN, HEIDELBERG, PAGE(S) 522 - 527, ISBN: 978-3-540-78848-5, XP019088103 * |
WENFEI FAN ET AL: "Discovering Conditional Functional Dependencies", DATA ENGINEERING, 2009. ICDE '09. IEEE 25TH INTERNATIONAL CONFERENCE ON, IEEE, PISCATAWAY, NJ, USA, 29 March 2009 (2009-03-29), pages 1231 - 1234, XP031447812, ISBN: 978-1-4244-3422-0 * |
Cited By (78)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US9589014B2 (en) | 2006-11-20 | 2017-03-07 | Palantir Technologies, Inc. | Creating data in a data store using a dynamic ontology |
US10872067B2 (en) | 2006-11-20 | 2020-12-22 | Palantir Technologies, Inc. | Creating data in a data store using a dynamic ontology |
US9201920B2 (en) | 2006-11-20 | 2015-12-01 | Palantir Technologies, Inc. | Creating data in a data store using a dynamic ontology |
USRE48589E1 (en) | 2010-07-15 | 2021-06-08 | Palantir Technologies Inc. | Sharing and deconflicting data changes in a multimaster database system |
US10891312B2 (en) | 2012-10-22 | 2021-01-12 | Palantir Technologies Inc. | Sharing information between nexuses that use different classification schemes for information access control |
US9081975B2 (en) | 2012-10-22 | 2015-07-14 | Palantir Technologies, Inc. | Sharing information between nexuses that use different classification schemes for information access control |
US9836523B2 (en) | 2012-10-22 | 2017-12-05 | Palantir Technologies Inc. | Sharing information between nexuses that use different classification schemes for information access control |
US10846300B2 (en) | 2012-11-05 | 2020-11-24 | Palantir Technologies Inc. | System and method for sharing investigation results |
US10311081B2 (en) | 2012-11-05 | 2019-06-04 | Palantir Technologies Inc. | System and method for sharing investigation results |
US9898167B2 (en) | 2013-03-15 | 2018-02-20 | Palantir Technologies Inc. | Systems and methods for providing a tagging interface for external content |
US12079456B2 (en) | 2013-03-15 | 2024-09-03 | Palantir Technologies Inc. | Systems and methods for providing a tagging interface for external content |
US9495353B2 (en) | 2013-03-15 | 2016-11-15 | Palantir Technologies Inc. | Method and system for generating a parser and parsing complex data |
US8855999B1 (en) | 2013-03-15 | 2014-10-07 | Palantir Technologies Inc. | Method and system for generating a parser and parsing complex data |
US8930897B2 (en) | 2013-03-15 | 2015-01-06 | Palantir Technologies Inc. | Data integration tool |
US10120857B2 (en) | 2013-03-15 | 2018-11-06 | Palantir Technologies Inc. | Method and system for generating a parser and parsing complex data |
US9740369B2 (en) | 2013-03-15 | 2017-08-22 | Palantir Technologies Inc. | Systems and methods for providing a tagging interface for external content |
US10809888B2 (en) | 2013-03-15 | 2020-10-20 | Palantir Technologies, Inc. | Systems and methods for providing a tagging interface for external content |
US9984152B2 (en) | 2013-03-15 | 2018-05-29 | Palantir Technologies Inc. | Data integration tool |
US9348851B2 (en) | 2013-07-05 | 2016-05-24 | Palantir Technologies Inc. | Data quality monitors |
US8601326B1 (en) | 2013-07-05 | 2013-12-03 | Palantir Technologies, Inc. | Data quality monitors |
US10970261B2 (en) | 2013-07-05 | 2021-04-06 | Palantir Technologies Inc. | System and method for data quality monitors |
US9223773B2 (en) | 2013-08-08 | 2015-12-29 | Palatir Technologies Inc. | Template system for custom document generation |
US10699071B2 (en) | 2013-08-08 | 2020-06-30 | Palantir Technologies Inc. | Systems and methods for template based custom document generation |
US9338013B2 (en) | 2013-12-30 | 2016-05-10 | Palantir Technologies Inc. | Verifiable redactable audit log |
US11032065B2 (en) | 2013-12-30 | 2021-06-08 | Palantir Technologies Inc. | Verifiable redactable audit log |
US10027473B2 (en) | 2013-12-30 | 2018-07-17 | Palantir Technologies Inc. | Verifiable redactable audit log |
US9923925B2 (en) | 2014-02-20 | 2018-03-20 | Palantir Technologies Inc. | Cyber security sharing and identification system |
US9009827B1 (en) | 2014-02-20 | 2015-04-14 | Palantir Technologies Inc. | Security sharing system |
US10873603B2 (en) | 2014-02-20 | 2020-12-22 | Palantir Technologies Inc. | Cyber security sharing and identification system |
US10572496B1 (en) | 2014-07-03 | 2020-02-25 | Palantir Technologies Inc. | Distributed workflow system and database with access controls for city resiliency |
US9483506B2 (en) | 2014-11-05 | 2016-11-01 | Palantir Technologies, Inc. | History preserving data pipeline |
US10191926B2 (en) | 2014-11-05 | 2019-01-29 | Palantir Technologies, Inc. | Universal data pipeline |
US10853338B2 (en) | 2014-11-05 | 2020-12-01 | Palantir Technologies Inc. | Universal data pipeline |
US9229952B1 (en) | 2014-11-05 | 2016-01-05 | Palantir Technologies, Inc. | History preserving data pipeline system and method |
US9946738B2 (en) | 2014-11-05 | 2018-04-17 | Palantir Technologies, Inc. | Universal data pipeline |
US10803106B1 (en) | 2015-02-24 | 2020-10-13 | Palantir Technologies Inc. | System with methodology for dynamic modular ontology |
US9727560B2 (en) | 2015-02-25 | 2017-08-08 | Palantir Technologies Inc. | Systems and methods for organizing and identifying documents via hierarchies and dimensions of tags |
US10474326B2 (en) | 2015-02-25 | 2019-11-12 | Palantir Technologies Inc. | Systems and methods for organizing and identifying documents via hierarchies and dimensions of tags |
US9996595B2 (en) | 2015-08-03 | 2018-06-12 | Palantir Technologies, Inc. | Providing full data provenance visualization for versioned datasets |
US10127289B2 (en) | 2015-08-19 | 2018-11-13 | Palantir Technologies Inc. | Systems and methods for automatic clustering and canonical designation of related data in various data structures |
US12038933B2 (en) | 2015-08-19 | 2024-07-16 | Palantir Technologies Inc. | Systems and methods for automatic clustering and canonical designation of related data in various data structures |
US11392591B2 (en) | 2015-08-19 | 2022-07-19 | Palantir Technologies Inc. | Systems and methods for automatic clustering and canonical designation of related data in various data structures |
US10853378B1 (en) | 2015-08-25 | 2020-12-01 | Palantir Technologies Inc. | Electronic note management via a connected entity graph |
US9965534B2 (en) | 2015-09-09 | 2018-05-08 | Palantir Technologies, Inc. | Domain-specific language for dataset transformations |
US11080296B2 (en) | 2015-09-09 | 2021-08-03 | Palantir Technologies Inc. | Domain-specific language for dataset transformations |
US9576015B1 (en) | 2015-09-09 | 2017-02-21 | Palantir Technologies, Inc. | Domain-specific language for dataset transformations |
US10936479B2 (en) | 2015-09-14 | 2021-03-02 | Palantir Technologies Inc. | Pluggable fault detection tests for data pipelines |
US10417120B2 (en) | 2015-09-14 | 2019-09-17 | Palantir Technologies Inc. | Pluggable fault detection tests for data pipelines |
US9772934B2 (en) | 2015-09-14 | 2017-09-26 | Palantir Technologies Inc. | Pluggable fault detection tests for data pipelines |
US10909159B2 (en) | 2016-02-22 | 2021-02-02 | Palantir Technologies Inc. | Multi-language support for dynamic ontology |
US10248722B2 (en) | 2016-02-22 | 2019-04-02 | Palantir Technologies Inc. | Multi-language support for dynamic ontology |
US10698938B2 (en) | 2016-03-18 | 2020-06-30 | Palantir Technologies Inc. | Systems and methods for organizing and identifying documents via hierarchies and dimensions of tags |
US10318398B2 (en) | 2016-06-10 | 2019-06-11 | Palantir Technologies Inc. | Data pipeline monitoring |
US9678850B1 (en) | 2016-06-10 | 2017-06-13 | Palantir Technologies Inc. | Data pipeline monitoring |
US10007674B2 (en) | 2016-06-13 | 2018-06-26 | Palantir Technologies Inc. | Data revision control in large-scale data analytic systems |
US11106638B2 (en) | 2016-06-13 | 2021-08-31 | Palantir Technologies Inc. | Data revision control in large-scale data analytic systems |
US10621314B2 (en) | 2016-08-01 | 2020-04-14 | Palantir Technologies Inc. | Secure deployment of a software package |
US10133782B2 (en) | 2016-08-01 | 2018-11-20 | Palantir Technologies Inc. | Techniques for data extraction |
US11106692B1 (en) | 2016-08-04 | 2021-08-31 | Palantir Technologies Inc. | Data record resolution and correlation system |
US10102229B2 (en) | 2016-11-09 | 2018-10-16 | Palantir Technologies Inc. | Validating data integrations using a secondary data store |
US11416512B2 (en) | 2016-12-19 | 2022-08-16 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US11768851B2 (en) | 2016-12-19 | 2023-09-26 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US10482099B2 (en) | 2016-12-19 | 2019-11-19 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US9946777B1 (en) | 2016-12-19 | 2018-04-17 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US9922108B1 (en) | 2017-01-05 | 2018-03-20 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US10776382B2 (en) | 2017-01-05 | 2020-09-15 | Palantir Technologies Inc. | Systems and methods for facilitating data transformation |
US10503574B1 (en) | 2017-04-10 | 2019-12-10 | Palantir Technologies Inc. | Systems and methods for validating data |
US11221898B2 (en) | 2017-04-10 | 2022-01-11 | Palantir Technologies Inc. | Systems and methods for validating data |
US10956406B2 (en) | 2017-06-12 | 2021-03-23 | Palantir Technologies Inc. | Propagated deletion of database records and derived data |
US10691729B2 (en) | 2017-07-07 | 2020-06-23 | Palantir Technologies Inc. | Systems and methods for providing an object platform for a relational database |
US11301499B2 (en) | 2017-07-07 | 2022-04-12 | Palantir Technologies Inc. | Systems and methods for providing an object platform for datasets |
US10956508B2 (en) | 2017-11-10 | 2021-03-23 | Palantir Technologies Inc. | Systems and methods for creating and managing a data integration workspace containing automatically updated data models |
US10866792B1 (en) | 2018-04-17 | 2020-12-15 | Palantir Technologies Inc. | System and methods for rules-based cleaning of deployment pipelines |
US11294801B2 (en) | 2018-04-18 | 2022-04-05 | Palantir Technologies Inc. | Data unit test-based data management system |
US12032476B2 (en) | 2018-04-18 | 2024-07-09 | Palantir Technologies Inc. | Data unit test-based data management system |
US10754822B1 (en) | 2018-04-18 | 2020-08-25 | Palantir Technologies Inc. | Systems and methods for ontology migration |
US10496529B1 (en) | 2018-04-18 | 2019-12-03 | Palantir Technologies Inc. | Data unit test-based data management system |
US11461355B1 (en) | 2018-05-15 | 2022-10-04 | Palantir Technologies Inc. | Ontological mapping of data |
Also Published As
Publication number | Publication date |
---|---|
US8700577B2 (en) | 2014-04-15 |
EP2350887A1 (fr) | 2011-08-03 |
CN102257496A (zh) | 2011-11-23 |
CN102257496B (zh) | 2016-09-28 |
US20110138312A1 (en) | 2011-06-09 |
CA2734599A1 (fr) | 2011-06-07 |
CA2734599C (fr) | 2015-01-13 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CA2734599C (fr) | Procede et systeme d'amelioration acceleree de la qualite des donnees | |
US11704325B2 (en) | Systems and methods for automatic clustering and canonical designation of related data in various data structures | |
EP3537325B1 (fr) | Interfaces utilisateurs interactives | |
Hellerstein | Quantitative data cleaning for large databases | |
Zhu et al. | Class noise vs. attribute noise: A quantitative study | |
US8645332B1 (en) | Systems and methods for capturing data refinement actions based on visualized search of information | |
AU2013329525C1 (en) | System and method for recursively traversing the internet and other sources to identify, gather, curate, adjudicate, and qualify business identity and related data | |
Kirsch et al. | An efficient rigorous approach for identifying statistically significant frequent itemsets | |
Hahsler et al. | New probabilistic interest measures for association rules | |
Yeh et al. | An efficient and robust approach for discovering data quality rules | |
US10929531B1 (en) | Automated scoring of intra-sample sections for malware detection | |
CN106776703A (zh) | 一种虚拟化环境下的多元数据清洗技术 | |
CN113626241A (zh) | 应用程序的异常处理方法、装置、设备及存储介质 | |
US20030182136A1 (en) | System and method for ranking objects by likelihood of possessing a property | |
Abuzaid et al. | Macrobase: Prioritizing attention in fast data | |
US11321359B2 (en) | Review and curation of record clustering changes at large scale | |
Uher et al. | Automation of cleaning and ensembles for outliers detection in questionnaire data | |
Miyauchi et al. | What is a network community? A novel quality function and detection algorithms | |
Mezzanzanica et al. | Data quality sensitivity analysis on aggregate indicators | |
Huang et al. | Twain: Two-end association miner with precise frequent exhibition periods | |
US20160063394A1 (en) | Computing Device Classifier Improvement Through N-Dimensional Stratified Input Sampling | |
CN112084262A (zh) | 数据信息筛选方法、装置、计算机设备及存储介质 | |
CN111523921A (zh) | 漏斗分析方法、分析设备、电子设备及可读存储介质 | |
CN112016975A (zh) | 产品筛选方法、装置、计算机设备及可读存储介质 | |
CN113760864A (zh) | 数据模型的生成方法和装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
WWE | Wipo information: entry into national phase |
Ref document number: 201080002524.4 Country of ref document: CN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 874/CHENP/2011 Country of ref document: IN |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2734599 Country of ref document: CA |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010795826 Country of ref document: EP |
|
NENP | Non-entry into the national phase |
Ref country code: DE |