WO2005029369A2 - Data profiling - Google Patents

Data profiling Download PDF

Info

Publication number
WO2005029369A2
WO2005029369A2 PCT/US2004/030144 US2004030144W WO2005029369A2 WO 2005029369 A2 WO2005029369 A2 WO 2005029369A2 US 2004030144 W US2004030144 W US 2004030144W WO 2005029369 A2 WO2005029369 A2 WO 2005029369A2
Authority
WO
WIPO (PCT)
Prior art keywords
data
field
values
fields
records
Prior art date
Application number
PCT/US2004/030144
Other languages
French (fr)
Other versions
WO2005029369A9 (en
WO2005029369A3 (en
Inventor
Joel Gould
Carl Feynman
Paul Bay
Original Assignee
Ab Initio Software Corporation
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority to EP10009234.5A priority Critical patent/EP2261821B1/en
Priority to EP04784113A priority patent/EP1676217B1/en
Priority to JP2006526986A priority patent/JP5328099B2/en
Priority to AT04784113T priority patent/ATE515746T1/en
Priority to AU2004275334A priority patent/AU2004275334B9/en
Priority to CA002538568A priority patent/CA2538568C/en
Application filed by Ab Initio Software Corporation filed Critical Ab Initio Software Corporation
Publication of WO2005029369A2 publication Critical patent/WO2005029369A2/en
Publication of WO2005029369A3 publication Critical patent/WO2005029369A3/en
Publication of WO2005029369A9 publication Critical patent/WO2005029369A9/en
Priority to HK06114200.1A priority patent/HK1093568A1/en
Priority to AU2009200294A priority patent/AU2009200294A1/en
Priority to AU2009200293A priority patent/AU2009200293B2/en

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • G06F16/24534Query rewriting; Transformation
    • G06F16/24542Plan optimisation
    • G06F16/24544Join order optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24553Query execution of query operations
    • G06F16/24554Unary operations; Data partitioning operations
    • G06F16/24556Aggregation; Duplicate elimination
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/252Integrating or interfacing systems involving database management systems between a Database Management System and a front-end application
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/258Data format conversion from or to a database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F17/00Digital computing or data processing equipment or methods, specially adapted for specific functions
    • G06F17/40Data acquisition and logging

Definitions

  • This invention relates to data profiling.
  • Stored data sets often include data for which various characteristics are not known beforehand. For example, ranges of values or typical values for a data set, relationships between different fields within the data set, or functional dependencies among values in different fields, may be unknown.
  • Data profiling can involve examining a source of a data set in order to determine such characteristics.
  • One use of data profiling systems is to collect information about a data set which is then used to design a staging area for loading the data set before further processing. Transformations necessary to map the data set to a desired target format and location can then be performed in the staging area based on the information collected in the data profiling.
  • the invention features a method and corresponding software and a system for processing data.
  • Data from a data source is profiled. This profiling includes reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data.
  • the data is then processed from the data source. This processing includes accessing the stored profile information and processing the data according to the accessed profile information.
  • the invention features a method for processing data. Data from a data source is profiled.
  • Profiling includes reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data.
  • Profiling the data includes profiling the data in parallel, including partitioning the data into parts and processing the parts using separate ones of a first set of parallel components.
  • Processing the data from the data source includes reading the data from the data source.
  • Profiling the data is performed without maintaining a copy of the data outside the data source.
  • the data can include records with a variable record structure such as conditional fields and/or variable numbers of fields.
  • Computing summary data while reading the data includes inte ⁇ reting the variable record structure records while computing summary data characterizing the data.
  • the data source includes a data storage system, such as a database system, or a serial or parallel file system.
  • Computing the summary data includes counting a number of occurrences for each of a set of distinct values for a field.
  • the profile information can include statistics for the field based on the counted number of occurrences for said field.
  • a metadata store that contains metadata related to the data source is maintained. Storing the profile information can include updating the metadata related to the data source.
  • Profiling the data and processing the data can each make use of metadata for the data source.
  • Profiling data from the data source further includes determining a format specification based on the profile information. It can also include determining a validation specification based on the profile information. Invalid records can be identified during the processing of the data based on the format specification and/or the validation specification.
  • Data transformation instructions are specified based on the profile information.
  • Processing the data can then include applying the transformation instructions to the data.
  • Processing the data includes importing the data into a data storage subsystem.
  • the data can be validated prior to importing the data into a data storage subsystem. Such validating of the data can include comparing characteristics of the data to reference characteristics for the data, such as by comparing statistical properties of the data.
  • the profiling of the data can be performed in parallel. This can include partitioning the data into parts and processing the parts using separate ones of a first set of parallel components. Computing the summary data for different fields of the data can include using separate ones of a second set of parallel components. Outputs of the first set of parallel components can be repartitioned to form inputs for the second set of parallel components.
  • the invention features a method and corresponding software and a system for processing data.
  • Information characterizing values of a first field in records of £ first data source and information characterizing values of a second field in records of a second data source are accepted.
  • Quantities characterizing a relationship between the first field and the second field are then computed based on the accepted information.
  • Information relating the first field and the second field is presented.
  • aspects of the invention can include one or more of the following features.
  • the information relating the first field and the second field is presented to a user.
  • the first data source and the second data source are either the same data source, or are separate data sources.
  • Either or both of the data source or sources can be a database table, or a file.
  • the quantities characterizing the relationship include quantities characterizing joint characteristics of the values of the first field and of the second field.
  • the information characterizing the values of the first field (or similarly of the second field) includes information characterizing a distribution of values of that field. Such information may be stored in a data structure, such as a "census" data structure.
  • the information characterizing the distribution of values of the first field can include multiple data records, each associating a different value and a corresponding number of occurrences of that value in the first field in the first data source.
  • information characterizing the distribution of values of the second field can include multiple records of the same or similar format.
  • the information characterizing the distribution of values of the first field and of the second field is processed to compute quantities related to a multiple different categories of cooccurrence of values.
  • the quantities related to the categories of co-occurrence of values include multiple data records, each associated with one of the categories of co-occurrence and including a number of different values in the first and the second fields that are in that category.
  • Information characterizing a distribution of values in a "join" of the first data source and the second data source on the first field and the second field, respectively, is computed. This computation can include computing quantities related to a plurality of categories of co- occurrence of values.
  • Examples of such categories include values that occur at least once in one of the first and the second fields but not in the other of the fields, values that occur exactly once in each of the first and the second fields, values that occur exactly once in one of the first and the second fields and more than once in the other of the fields, and values that occur more than once in each of the first and the second fields.
  • the steps of accepting information characterizing values and computing quantities characterizing joint characteristics of the values are repeated for multiple different pairs of fields, one of field from the first data source and the other field from the second data source.
  • Information relating the fields of one or more of the plurality of pairs of fields can then be presented to the user. Presenting the information relating the fields of one or more of the pairs of fields includes identifying candidate types of relationships of fields.
  • the invention features a method and corresponding software and a system for processing data.
  • a plurality of subsets of fields of data records of a data source are identified. Co-occurrence statistics are determined for each of the plurality of subsets.
  • One or more of the plurality of subsets is identified as having a functional relationship among the fields of the identified subset.
  • aspects of the invention can include one or more of the following features. At least one of the subsets of fields is a subset of two fields.
  • Identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes identifying one or more of the plurality of subsets as having one of a plurality of possible predetermined functional relationships. Determining the co-occurrence statistics includes forming data elements each identifying a pair of fields and identifying a pair of values occurring in the pair of fields in one of the data records.
  • Determining the co-occurrence statistics includes partitioning the data records into parts, the data records having a first field and a second field, determining a quantity based on a distribution of values that occur in the second field of one or more records in a first of the parts, the one or more records having a common value occurring in a first field of the one or more records, and combining the quantity with other quantities from records in other of the parts to generate a total quantity.
  • Identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes identifying a functional relationship between the first and second fields based on the total quantity.
  • the parts are based on values of the first field and of the second field.
  • the parts are processed using separate ones of a set of parallel components.
  • Identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes determining a degree of match to the functional relationship.
  • the degree of match includes a number of exceptional records that are not consistent with the functional relationship.
  • the functional relationship includes a mapping of at least some of the values of a first field onto at least some of the values of a second field.
  • the mapping can be, for example, a many-to-one mapping, a one-to-many mapping, or a one-to-one mapping.
  • the method further includes filtering the plurality of subsets based on information characterizing values in fields of the plurality of subsets.
  • the data records include records of one or more database tables. Aspects of the invention can include one or more of the following advantages.
  • aspects of the invention provide advantages in a variety of scenarios.
  • a developer may use an input data set to test the application.
  • the output of the application run using the test data set is compared against expected test results, or inspected manually.
  • the results may be usually too large to be verified by inspection.
  • Data profiling can be used to verify the application behavior. Instead of inspecting every record produced by running the application using production data, a profile of the output is inspected. The data profiling can detect invalid or unexpected values, as well as unexpected patterns or distributions in the output that could signal an application design problem.
  • data profiling can be used as part of a production process.
  • input data that is part of a regular product run can be profiled.
  • a processing module can load the profiling results and verify that the input data meets certain quality metrics. If the input data looks bad, the product run can be cancelled and the appropriate people alerted.
  • a periodic audit of a large collection of data e.g., hundreds of database tables in multiple sets of data
  • data profiling can be perfonned every night on a subset of the data.
  • the data that is profiled can be cycled such that all of the data is profiled, e.g., once a quarter so that every database table will be profiled four times a year.
  • the data profiling can be performed automatically.
  • the data profiling can be performed from a script (e.g., a shell script) and integrated with other forms of processing.
  • Results of the data profiling can be automatically published, e.g., in a form that can be displayed in a web browser, without having to manually post-process the results or run a separate reporting application.
  • Operating on information characterizing values of the records in the data sources rather than necessarily operating directly on the records of the data sources themselves can reduce the amount of computation considerably.
  • using census data rather than the raw data records reduces the complexity of computing characteristics of a join on two fields from being of the order of the product of the number of data records in the two data sources to being of the order of the product of the number of unique values in the two data sources.
  • Profiling the data without maintaining a copy of the data outside the data source can avoid potential for errors associated with maintaining duplicate copies and avoids using extra storage space for a copy of the data.
  • the operations may be parallelized according to data value, thereby enabling efficient distributed processing. Quantities characterizing a relationship between fields can provide an indication of which fields may be related by different types of relationships. The user may then be able to examine the data more closely to determine whether the fields truly form that type of relationship.
  • Determining co-occurrence statistics for each of a plurality of subsets of fields of data records of a data source enables efficient identification of potential functional relationships among the fields. Aspects of the invention can be useful in profiling data sets with which the user is not familiar. The information that is automatically determined, or which is determined in cooperation with the user , can be used to populate metadata for the data sources, which can then be used for further processing. Other features and advantages of the invention are apparent from the following description, and from the claims. Description of Drawings
  • FIG. 1 is a block diagram of a system that includes a data profiling module.
  • FIG. 2 is a block diagram that illustrates the organization of objects in a metadata store used for data profiling.
  • FIG. 3 is a profiling graph for the profiling module.
  • FIG. 4 is a tree diagram of a hierarchy for a type object used to inte ⁇ ret a data format.
  • FIGS. 5A.-C are diagrams that illustrates sub-graphs implementing the make census component, analyze census component, and make samples component of the profiling graph.
  • FIG. 6 is a flowchart for a rollup procedure.
  • FIG. 7 is a flowchart for a canonicalize procedure.
  • FIGS. 8A-C are example user interface screen outputs showing profiling results.
  • FIG. 9 is a flowchart of an exemplary profiling procedure.
  • FIG. 10 is a flowchart of an exemplary profiling procedure.
  • FIGS. 11 A-B are two examples of a join operation performed on records from two pairs of fields.
  • FIGS. 12A-B are two examples of a census join operation on census records from two pairs of fields.
  • FIG. 13 is an example of extended records used to perform a single census join operation on two pairs of fields.
  • FIG. 14 is an extend component used to generate extended records.
  • FIGS. 15 -C are graphs used to perform joint-field analysis.
  • FIG. 16 is an example table with fields having a functional dependency relationship.
  • FIG. 17 is a graph used to perform functional dependency analysis. Description
  • a data processing system 10 includes a profiling and processing subsystem 20, which is used to process data from data sources 30 and to update a metadata store 112 and a data store 124 in a data storage subsystem 40.
  • the stored metadata and data is then accessible to users using an interface subsystem 50.
  • Data sources 30 in general includes a variety of individual data sources, each of which may have unique storage formats and interfaces (for example, database tables, spreadsheet files, flat text files, or a native format used by a mainframe 110).
  • the individual data sources may be local to the profiling and processing sub-system 20, for example, being hosted on the same computer system (e.g., file 102), or may be remote to the profiling and processing sub-system 20.
  • Data storage sub-system 40 includes a data store 124 as well as a metadata store 112.
  • Metadata store 112 includes information related to data in data sources 30 as well as information about data in data store 124. Such information can include record formats as well as specifications for determining the validity of field values in those records (validation specifications).
  • the metadata store 112 can be used to store initial information about a data set in data sources 30 to be profiled, as well as information obtained about such a data set, as well as data sets in data store 124 derived from that data set, during the profiling process.
  • the data store 124 can be used to store data, which has been read from the data sources 30, optionally transformed using information derived from data profiling.
  • the profiling and processing subsystem 20 includes a profiling module 100, which reads data directly from a data source without necessarily landing a complete copy of the data to a storage medium before profiling in units of discrete work elements such as individual records.
  • a record is associated with a set of data fields, each field having a particular value for each record (including possibly a null value).
  • the records in a data source may have a fixed record structure in which each record includes the same fields.
  • records may have a variable record structure, for example, including variable length vectors or conditional fields.
  • the records are processed without necessarily storing a "flattened" (i.e., fixed record structure) copy of the data prior to profiling.
  • the profiling module 100 When first reading data from a data source, the profiling module 100 typically starts with some initial format information about records in that data source. (Note that in some circumstances, even the record structure of the data source may not be known).
  • This information about records of a data source is specified in a data manipulation language (DML) file that is stored in a metadata store 112.
  • the profiling module 100 can use predefined DML files to automatically inte ⁇ ret data from a variety of common data system formats (e.g., SQL tables, XML files, CSV files) or use a DML file obtained from the metadata store 1 12 describing a customized data system format. Partial, possibly inaccurate, initial information about records of a data source may be available to the profiling and processing subsystem 20 prior to the profiling module 100 initial reading of the data. For example, a COBOL copy book associated with a data source may be available as stored data 114, or entered by a user 118 through a user interface 116.
  • Such existing information is processed by a metadata import module 115 and stored in the metadata store 112 and/or used to define the DML file used to access the data source.
  • the profiling module 100 reads records from a data source, it computes statistics and other descriptive information that reflect the contents of the data set.
  • the profiling module 100 then writes those statistics and descriptive information in the form of a "profile" into the metadata store 1 12 which can then be examined through the user interface 116 or any other module with access to the metadata store 112.
  • the statistics in the profile preferably include a histogram of values in each field, maximum, minimum, and mean values, and samples of the least common and most common values.
  • the statistics obtained by reading from the data source can be used for a variety of uses.
  • the metadata store 112 is able to store validation information associated with each profiled field, for example as a validation specification that encodes the validation information.
  • the validation information can be stored in an external storage location and retrieved by the profiling module 100.
  • the validation information may specify a valid data type for each field. For example, if a field is a person's "title", a default valid value may be any value that is a "string" data type.
  • a user may also supply valid values such as "Mr.”, "Mrs.” and “Dr.” prior to profiling the data source so that any other value read by the profiling module 100 would be identified as invalid
  • information obtained from a profiling run can also be used by a user to specify valid values for a particular field. For example, the user may find that after profiling a data set the values "Ms.” and “Msr.” appear as common values. The user may add "Ms.” as a valid value, and map the value "Msr.” to the value "Mrs.” as a data cleaning option.
  • the validation information can include valid values and mapping information to permit cleaning of invalid values by mapping them onto valid values.
  • the profiling of a data source may be undertaken in an iterative manner as more information about the data source is discovered through successive profiling runs.
  • the profiling module 100 can also generate executable code to implement other modules that can access the profiled data systems.
  • a processing module 120 can include code generated by the profiling module 100. An example of such code might map a value "Msr.” to "Mrs.” as part of the access procedure to the data source.
  • the processing module 120 may run in the same runtime environment as the profiling module 100, and preferably can communicate with the metadata store 112 to access a profile associated with a data set.
  • the processing module 120 can read the same data formats as the profiling module 100 (e.g., by obtaining the same DML file from the metadata store 112).
  • the processing module 120 can use the data set profile to obtain values used to validate or clean incoming records before storing them in a data store 124. Similar to the profiling module 100, the processing module 120 also reads data directly from a data system in units of discrete work elements. This "data flow" of work elements has the benefit of allowing the data profiling to be performed on large data sets without necessarily copying data to local storage (e.g., a disk drive). This data flow model, described in more detail below, also allows complex data transformations to be performed by a processing module without the source data being first copied to a staging area, potentially saving storage space and time.
  • the profiling module 100 uses the metadata store 112 to organize and store various metadata and profiling preferences and results in data objects.
  • the metadata store 112 may store a group of profile setup objects 201, each for information related to a profiling job, a group of data set objects 207, each for information related to a data set, and a group of DML files 211, each describing a particular data format.
  • a profile setup object contains preferences for a profiling run executed by the profiling module 100.
  • a user 118 can enter information used to create a new profile setup object or select a pre-stored profile setup object 200.
  • the profile setup object 200 contains a reference 204 to a data set object 206.
  • a data set setup object 206 contains a data set locator 202 which enables the profiling module 100 to locate data to be profiled on one or more data systems accessible within the runtime environment.
  • the data set locator 202 is typically a path/filename, URL, or a list of path/filenames and/or URLs for a data set spread over multiple locations.
  • the data set object 206 can optionally contain a reference 208 to one or more DML files 210.
  • the DML file(s) 210 may be pre-selected based on knowledge about the format of data in a data set, or may be specified at runtime by a user.
  • the profiling module 100 can obtain an initial portion of the data set and present to the user over the user interface 116 an inte ⁇ retation of the initial portion based on a default DML file. The user may then modify the default DML file specification based on an interactive view of the inte ⁇ retation. More than one DML file may be referenced if the data set includes data with multiple formats.
  • the data set object 206 contains a reference 212 to a set of field objects 214. There is one field object for each field within the records of the data set to be profiled.
  • a data set profile 216 is contained within the data set object 206 corresponding to the data set that was profiled.
  • the data set profile 216 contains statistics that relate to the data set, such as total number of records and total number of valid/invalid records.
  • a field object 218 can optionally contain validation information 220 that can be used by the profiling module 100 to determine valid values for the corresponding field, and specify rules for cleaning invalid values (i.e., mapping invalid values onto valid values).
  • the field object 218 also contains a field profile 222, stored by the profiling module 100 upon completion of a profiling run, which contains statistics that relate to the corresponding field, such as numbers of distinct values, null values, and valid/invalid values.
  • the field profile 222 can also include sample values such as maximum, minimum, most common, and least common values.
  • a complete "profile" includes the data set profile 216 and field profiles for all of the profiled fields.
  • Other user preferences for a profiler run can be collected and stored in the profile setup object 200, or in the data set object 206.
  • the user can select a filter expression which can be used to limit the fields or number of values profiled, including profiling a random sample of the values (e.g., 1%).
  • the profiling module 100 executes in a mntime environment that allows data from the data source(s) to be read and processed as a flow of discrete work elements.
  • the computations performed by the profiling module 100 and processing module 120 can be expressed in tem s of data flow through a directed graph, with components of the computations being associated with the vertices of the graph and data flows between the components corresponding to links (arcs, edges) of the graph.
  • a system that implements such graph-based computations is described in U.S. Patent 5,966,072, EXECUTING COMPUTATIONS EXPRESSED AS GRAPHS.
  • Graphs made in accordance with this system provide methods for getting information into and out of individual processes represented by graph components, for moving information between the processes, and for defining a running order for the processes.
  • This system includes algorithms that choose inte ⁇ rocess communication methods (for example, communication paths according to the links of the graph can use TCP/IP or UNIX domain sockets, or use shared memory to pass data between the processes).
  • the runtime environment also provides for the profiling module 100 to execute as a parallel process. The same type of graphic representation described above may be used to describe parallel processing systems.
  • parallel processing systems include any configuration of computer systems using multiple central processing units (CPUs), either local (e.g., multiprocessor systems such as SMP computers), or locally distributed (e.g., multiple processors coupled as clusters or MPPs), or remotely, or remotely distributed (e.g., multiple processors coupled via LAN or WAN networks), or any combination thereof.
  • CPUs central processing units
  • the graphs will be composed of components (graph vertices) and flows (graph links). By explicitly or implicitly replicating elements of the graph (components and flows), it is possible to represent parallelism in a system.
  • a flow control mechanism is implemented using input queues for the links entering a component.
  • This flow control mechanism allows data to flow between the components of a graph without being written to non- volatile local storage, such as a disk drive, which is typically large but slow.
  • the input queues can be kept small enough to hold work elements in volatile memory, typically smaller and faster than non- volatile memory. This potential savings in storage space and time exists even for very large data sets.
  • Components can use output buffers instead of, or in addition to, input queues.
  • the upstream component sends work elements to the downstream component as long as the downstream component keeps consuming the work elements. If the downstream component falls behind, the upstream component will fill up the input queue of the downstream component and stop working until the input queue clears out again.
  • Computation graphs can be specified with various levels of abstraction. So a "sub-graph" containing components and links can be represented within another graph as a single component, showing only those links which connect to the rest of the graph.
  • a profiling graph 400 performs computations for the profiling module 100.
  • An input data set component 402 represents data from potentially several types of data systems. The data systems may have different physical media types (e.g., magnetic, optical, magneto-optical) and/or different data format types (e.g., binary, database, spreadsheet, ASCII string, CSV, or XML).
  • the input data set component 402 sends a data flow into a make census component 406.
  • the make census component 406 conducts a "census" of the data set, creating a separate census record for each unique field/value pair in the records that flow into the component.
  • Each census record includes a count of the number of occurrences of the unique field/value pair for that census record.
  • the make census component 406 has a cleaning option which can map a set of invalid values onto valid values according to validation information stored in a corresponding field object.
  • the cleaning option can also store records having fields containing invalid values in a location represented by an invalid records component 408.
  • the invalid records can then be examined, for example, by a user wanting to determine the source of an invalid value.
  • the census records flowing out of the make census component 406 are stored in a file represented by the census file component 410. This intermediate storage of census records may, in some cases, increase efficiency for multiple graph components accessing the census records.
  • the census records can flow directly from the make census component 406 to an analyze census component 412 without being stored in a file.
  • the analyze census component 412 creates a histogram of values for each field and performs other analyses of the data set based on the census records.
  • the field profiles component 414 represents an intermediate storage location for the field profiles.
  • a load metadata store component 416 loads the field profiles and other profiling results into the conesponding objects in the metadata store 112.
  • the user interface 116 allows a user to browse through the analyzed data, for example, to see histograms or common values in fields.
  • a "drill-down" capability is provided, for example, to view specific records that are associated with a bar in a histogram.
  • the user can also update preferences through the user interface 116 based on results of the profiling.
  • the make samples component 418 stores a collection of sample records 420 representing a sampling of records associated with a value shown on the user interface 116 (e.g., associated with a bar in a histogram).
  • the phase break line 422 represents two phases of execution in the graph 400, such that the components on the right side of the line begin execution after all the components on the left side of the line finish execution. Therefore, the make samples component 418 executes after the analyze census component 412 finishes storing results in the field profiles component 414.
  • sample records can be retrieved from a recorded location in the input data set 402.
  • the profiling module 100 can be initiated by a user 118 or by an automated scheduling program.
  • a master script collects any DML files and parameters to be used by the profiling graph 400 from the metadata store 112. Parameters can be obtained from objects such as the profile setup object 200, the data set object 206, and the field objects 218. If necessary, the master script can create a new DML file based on information supplied about the data set to be profiled. For convenience, the master script can compile the parameters into a job file. The master script may then execute the profiling graph 400 with the appropriate parameters from the job file, and present a progress display keeping track of the time elapsed and estimating time remaining before the profiling graph 400 completes execution. The estimated time remaining is calculated based on data (e.g., work elements) that is written to the metadata store 112 as the profiling graph 400 executes.
  • data e.g., work elements
  • An import component implements the portion of the profiling module 100 that can inte ⁇ ret the data format of a wide variety of data systems.
  • the import component is configured to directly inte ⁇ ret some data formats without using a DML file.
  • the import component can read data from a data system that uses structured query language (SQL), which is an ANSI standard computer language for accessing and manipulating databases.
  • SQL structured query language
  • Other data formats that are handled without use of a DML file are, for example, text files formatted according to an XML standard or using comma-separated values (CSV).
  • CSV comma-separated values
  • the import component uses a DML file specified in the profile setup object 200.
  • DML file can specify various aspects of inte ⁇ reting and manipulating data in a data set.
  • a DML file can specify the following for a data set: type object - defines a correspondence between raw data and the values represented by the raw data.
  • key specifier defines ordering, partitioning, and grouping relationships among records.
  • expression - defines a computation using values from constants, the fields of data records, or the results of other expressions to produce a new value.
  • transform function defines collections of rules and other logic used to produce one or more outputs records from zero or more input records.
  • package - provides a convenient way of grouping type objects, transform functions, and variables that can be used by a component to perform various tasks.
  • a type object is the basic mechanism used to read individual work elements (e.g., individual records) from raw data in a data system.
  • the runtime environment provides access to a physical computer-readable storage medium (e.g., a magnetic, optical, or magneto-optical medium) as a string of raw data bits (e.g., mounted in a file system or flowing over a network connection).
  • the import component can access a DML file to determine how to read and inte ⁇ ret the raw data in order to generate a flow of work elements.
  • a type object 502 can be, for example, a base type 504 or a compound type 506.
  • a base type object 504 specifies how to inte ⁇ ret a string of bits (of a given length) as a single value.
  • the base type object 504 includes a length specification indicating the number of raw data bits to be read and parsed.
  • a length specification can indicate a fixed length, such as a specified number of bytes, or a variable length, specifying a delimiter (e.g., a specific character or string) at the end of the data, or a number of (potentially variable length) characters to be read.
  • a void type 514 represents a block of data whose meaning or internal structure is unnecessary to interpret (e.g., compressed data that will not be inte ⁇ reted until after it is decompressed). The length of a void type 514 is specified in bytes.
  • a number type 516 represents a number and is inte ⁇ reted differently if the number is designated an integer 524, real 526, or decimal 528, according to various encodings that are standard or native to a particular CPU.
  • a string type 518 is used to inte ⁇ ret text with a specified character set.
  • Date 520 and datetime 522 types are used to inte ⁇ ret a calendar date and/or time with a specified character set and other formatting information.
  • a compound type 506 is an object made up of multiple sub-objects which are themselves either base or compound types.
  • a vector type 508 is an object containing a sequence of objects of the same type (either a base or compound type).
  • a record type 510 is an object containing a sequence of objects, each of which can be a different base or compound type. Each object in the sequence corresponds to a value associated with a named field. Using a record type 510, a component can inte ⁇ ret a block of raw data to extract values for all of the fields of a record.
  • a union type 512 is an object similar to a record type 510 except that objects corresponding to different fields may inte ⁇ ret the same raw data bits as different values.
  • the union type 512 provides a way to have several inte ⁇ retations of the same raw data.
  • the DML file also enables profiling of data with custom data types.
  • a user can define a custom type object by supplying a type definition in terms of other DML type objects, either base or compound.
  • a custom type object can then be used by the profiling module 100 to inte ⁇ ret data with a non-standard structure.
  • the DML file also enables profiling of data with a conditional structure. Records may only include some fields based on values associated with other fields.
  • a record may only include the field "spouse” if the value of the field "married” is "yes.”
  • the DML file includes a rule for determining whether a conditional field exists for a given record. If the conditional field does exist in a record, the value of the field can be inte ⁇ reted by a DML type object.
  • the import component can be used by graphs to efficiently handle various types of record structures. The ability of the import component to inte ⁇ ret records with variable record structure such as conditional records or variable length vectors enables graphs to process such data without the need to first flatten such data into fixed length segments.
  • Another type of processing that can be performed by graphs using the import component is discovery of relationships between or among parts of the data (e.g., across different records, tables, or files). Graphs can use a rule within the import component to find a relationship between a foreign key or field in one table to a primary key or field in another table, or to perform functional dependency calculations on parts of the data. 4.2 Statistics
  • a sub-graph 600 implementing one embodiment of the make census component 406 includes a filter component 602 that passes a portion of incoming records based on a filter expression stored in the profile setup object 200.
  • the filter expression may limit the fields or number of values profiled.
  • An example of a filter expression is one that limits profiling to a single field of each incoming record (e.g., "title").
  • Another optional function of the filter component 602 is to implement the cleaning option described above, sending a sample of invalid records to the invalid records component 408. Records flowing out of the filter component 602 flow into a local rollup sequence stats component 604 and a partition by round- robin component 612.
  • the ability of the profiling graph 400 (and other graphs and sub-graphs) to run in parallel on multiple processors and/or computers, and the ability of the profiling graph 400 to read a parallel data set stored across multiple locations, are implicitly represented in the sub-graph 600 by line thicknesses of the components and symbols on the links between components.
  • the thick border of components representing storage locations such as the input data set component 402 indicates that it may optionally be a parallel data set.
  • the thick border of the process components such as the filter component 602 indicates that the process may optionally be running in multiple partitions with each partition mnning on a different processor or computer.
  • the user can indicate through the user interface 116 whether to run the optionally parallel graph components in parallel or serially.
  • a thin border indicates that a data set or process is serial.
  • the local rollup sequence stats component 604 computes statistics related to the sequential characteristics of the incoming records. For example, the component 604 may count the number of sequential pairs of records that have values for a field that increase, decrease, or increment by 1. In the case of parallel execution, the sequence statistics are calculated for each partition separately.
  • a rollup process involves combining information from multiple input elements (sequence statistics for the rollup process performed by this component 604) and producing a single output element in place of the combined input elements.
  • a gather link symbol 606 represents a combination or "gathering" of the data flows from any multiple partitions of a parallel component into a single data flow for a serial component.
  • FIG. 6 is a flowchart of an example of a process 700 for performing a rollup process, including the rollup processes performed by the local rollup sequence stats component 604 and the global rollup sequence stats component 608.
  • the process 700 begins by receiving 702 an input element.
  • the process 700 updates 704 information being compiled, and determines 706 whether there are any more elements to be compiled. If there are more elements, the process 700 receives 702 the next element and updates 704 the information accordingly.
  • the process 700 finalizes 708 the output element based on the compiled rollup information.
  • a rollup process can be used to consolidate a group of elements into a single element, or to determine aggregate properties of a group of elements (such as statistics of values in those elements).
  • the partition by round-robin component 612 takes records from the single or multiple partitions of the input data set 402 and re-partitions the records among a number of parallel processors and/or computers (e.g., as selected by the user) in order to balance the work load among the processors and/or computers.
  • a cross-connect link symbol 614 represents the re- partitioning of the data flows (performed by the linked component 612).
  • the canonicalize component 616 takes in a flow of records and sends out a flow of census elements containing a field/value pair representing values for each field in an input record. (An input record with ten fields yields a flow often census elements.) Each value is converted into a canonical (i.e., according to a pre-determined format) human readable string representation. Also included in the census element are flags indicating whether the value is valid and whether the value is null (i.e., corresponds to a pre-determined "null" value). The census elements flow into a local rollup field/value component which (for each partition) takes occurrences of the same value for the same field and combines them into one census element including a count of the number of occurrences.
  • FIG. 7 is a flowchart of an example of a process 710 performed by the canonicalize component that can handle conditional records, which may not all have the same fields, to produce a flow of census elements containing field/value pairs.
  • the process 710 performs a nested loop which begins with getting 712 a new record. For each record, the process 710 gets 714 a field in that record and determines 716 whether that field is a conditional field.
  • the process 710 determines 718 whether that field exists for that record. If the field does exist, the process 710 canonicalizes 720 the record's value for that field and produces a corresponding output element containing a field/value pair. If the field does not exist, the process 710 proceeds to determine 722 whether there is another field or to determine 724 whether there is another record. If the field is not conditional, the process 710 canonicalizes 720 the record's for that field (including possibly a null value) and proceeds to the next field or record.
  • the partition by field/value component 624 re-partitions the census elements by field and value so that the rollup process performed in the global rollup field/value component 626 can add the occurrences calculated in different partitions to produce a total occurrences count in a single census element for each unique field/value pair contained within the profiled records.
  • the global rollup field/value component 626 processes these census elements in potentially multiple partitions for a potentially parallel file represented by the census file component 410.
  • FIG. 5B is a diagram that illustrates a sub-graph 630 implementing the analyze census component 412 of the profiling graph 400.
  • a partition by field component 632 reads a flow of census elements from the census file component 410 and re-partitions the census elements according to a hash value based on the field such that census records with the same field (but different values) are in the same partition.
  • the partition in to string, number, date component 634 further partitions the census elements according to the type of the value in the census element. Different statistics are computed (using a rollup process) for values that are strings (in the rollup string component 636), numbers (in the rollup number component 638), or dates/datetimes (in the rollup date component 640). For example, it may be appropriate to calculate average and standard deviation for a number, but not for a string.
  • the results are gathered from all partitions and the compute histogram/decile info component 642 provides information useful for constructing histograms (e.g., maximum and minimum values for each field) to a compute buckets component 654, and information useful for calculating decile statistics (e.g., the number of values for each field) to a compute deciles component 652.
  • the components of the sub-graph 630 that generate the histograms and decile statistics (below the phase break line 644) execute after the compute histogram/decile info component 642 (above the phase break line 644) finishes execution.
  • the sub-graph 630 constructs a list of values at decile boundaries (e.g., value larger than 10% of values, 20% of values, etc.) by: sorting the census elements by value within each partition (in the sort component 646), re-partitioning the census elements according to the sorted value (in the partition by values component 648), and merging the elements into a sorted (serial) flow into the compute deciles component 652.
  • the compute deciles component 652 counts the , sorted values for each field in groups of one tenth of the total number of values in that field to find the values at the decile boundaries.
  • the sub-graph 630 constmcts histograms for each field by: calculating the values defining each bracket of values (or "bucket"), counting values within each partition falling in the same bucket (in the local rollup histogram component 656), counting values within each bucket from all the partitions (in the global rollup histogram component 658).
  • a combine field profile parts component 660 then collects all of the information for each field profile, including the histograms, decile statistics, and the sequence statistics from the temporary file 610, into the field profiles component 414.
  • FIG. 5C is a diagram that illustrates a sub-graph 662 implementing the make samples component 418 of the profiling graph 400.
  • a partition by round-robin component 664 takes records from the single or multiple partitions of the input data set 402 and re-partitions the records among a number of parallel processors and/or computers in order to balance the work load among the processors and/or computers.
  • a lookup and select component 666 uses information from the field profiles component
  • the lookup and select component 666 computes a random selection number that determines whether the record is selected to represent a sample type. For example, for a total of five sample records of a particular sample type, if the selection number is one of the five largest seen so far (of a particular sample type within a single partition) then the corresponding record is passed as an output along with information indicating what value(s) may correspond to drill-down viewing.
  • the next component is a partition by sample type component 668 which re-partitions the records according to sample type so that the sort component 670 can sort by selection number within each sample type.
  • the scan component 672 selects the five records with the largest selection numbers (among all the partitions) for each sample type.
  • the write/link sample component 674 then writes these sample records to a sample records file 420 and links the records to the corresponding values in the field profiles component 414.
  • the load metadata store component 416 loads the data set profile from the temporary file component 620 into a data set profile 216 object in the metadata store 112, and loads each of the field profiles from the field profiles component 414 into a field profile 222 object in the metadata store 112.
  • the user interface 116 can then retrieve the profiling results for a data set and display it to a user 118 on a screen of produced by the user interface 116.
  • a user can browse through the profile results to see histograms or common values for fields.
  • a drill-down capability may be provided, for example, to view specific records that are associated with a bar in a histogram.
  • FIG. 8A-C are example user interface screen outputs showing profiling results.
  • FIG. 8A shows results from a data set profile 216.
  • FIGS. 8B-C show results from an exemplary field profile 222.
  • a selection of values, such as most common values 806, and most common invalid values 808, are displayed in various forms including: the value itself as a human readable string 810, a total count of occurrences of the value 812, occunences as a percentage of the total number of values 814, and a bar chart 816.
  • a histogram of values 818 is displayed showing a bar for each of multiple buckets spanning the range of values, including buckets with counts of zero. The decile boundaries 820 are also displayed.
  • FIG. 9 shows a flowchart for an example of a procedure 900 for profiling a data set to discover its contents before using it in another process.
  • the procedure 900 can be performed automatically (e.g., by a scheduling script) or manually (e.g., by a user at a terminal).
  • the procedure 900 first identifies 902 a data set to be profiled on one or more data systems accessible within the runtime environment.
  • the procedure 900 may then optionally set a record format 904 and set validation rules 906 based on supplied information or existing metadata. For some types of data, such as a database table, a default record format and validation rules may be used.
  • the procedure 9O0 then runs 908 a profile on the data set (or a subset of the data set).
  • the procedure 900 can refine 910 the record format, or refine 912 the validation rules based on the results of the initial profile. If any profiling options have changed, the procedure 900 then decides 914 whether to run another profile on the data using the new options, or to process 916 the data set if enough information about the data set has been obtained from the (possibly repeated) profiling. The process would read directly from the one or more data systems using the information obtained from the profiling.
  • FIG. 10 shows a flowchart for an example of a procedure 1000 for profiling a data set to test its quality before transforming and loading it into a data store.
  • the procedure 1000 can be performed automatically or manually. Rules for testing the quality of a data set can come from prior knowledge of the data set, and/or from results of a profiling procedure such as procedure 900 performed on a similar data set (e.g., a data set from the same source as the data set to be tested).
  • This procedure 1000 can be used by a business, for example, to profile a periodic (e.g., monthly) data feed sent from a business partner before importing or processing the data.
  • the procedure 1000 first identifies 1002 a data set to be tested on one or more data systems accessible within the mntime environment. The procedure 1000 then runs 1004 a profile on the data set (or a subset of the data set) and performs 1006 a quality test based on results of the profile.
  • "bad” data e.g., data with a percentage of invalid values higher than a threshold
  • a percentage of occurrences of a particular common value in the data set can be compared with a percentage of occurrences of the common value in a prior data set (based on a prior profiling run), and if the percentages differ from each other by more than 10%, the quality test fails.
  • This quality test could be applied to a value in a series of data sets that is known to occur consistently (within 10%).
  • the procedure 1000 determines 1008 the results of the quality test, and generates 1010 a flag (e.g., a user interface prompt or an entry in a log file) upon failure.
  • the procedure 1000 then reads directly from the one or more data systems and transforms (possibly using information from the profile) and loads 1012 data from the data set into a data store. For example, the procedure can then repeat by identifying 1002 another data set.
  • the profiling module 100 can generate executable code such as a graph component that can be used to process a flow of records from a data set.
  • the generated component can filter incoming records, allowing only valid records to flow out, similar to the cleaning option of the profiling graph 400.
  • the user can select a profiling option that indicates that a clean component should be generated upon completion of a profiling ran.
  • Code for implementing the component is directed to a file location (specified by the user).
  • the generated clean component can then run in the same runtime environment as the profiling module 100 using information stored in the metadata store 112 during the profiling run.
  • the profiling module 100 can optionally perform an analysis of relationships within one or more groups of fields.
  • the profiling module 100 is able to perform an analysis between two of a pair of fields, which may be in the same or in different data sets.
  • the profiling module is able to perform analysis on a number of pairs of fields, for example analyzing every field in one data set with every field in another data set, or every field in one data set with every other field in the same data set.
  • An analysis of two fields in different data sets is related to the characteristics of a join operation on the two data sets on those fields, as described in more detail below.
  • a join operation is performed on two data sets (e.g., files or tables).
  • the information in the census file can be used to perform the joint-field analysis between fields in two different profiled data sets, or between fields in two different parts of the same profiled data set (or any other data set for which a census file exists).
  • the result of joint-field analysis includes information about potential relationships between the fields. Three types of relationships that are discovered are: a "common domain" relationship, a "common domain" relationship, a "common domain" relationship, a
  • joint well relationship and “foreign key” relationship.
  • a pair of fields is categorized as having one of these three types of relationships if results of the joint-field analysis meet certain criteria, as described below.
  • the joint-field analysis includes compiling information such as the number of records produced from a join operation performed using the two fields as key fields.
  • FIGS. 11A-B illustrate examples of a join operation performed on records from two database tables. Each of Table A and Table B has two fields labeled "Field 1" and "Field 2," and four records.
  • a join component 1100 compares values from a key field of records from Table A with values from a key field of records from Table B.
  • the key field is Field 1
  • the key field is Field 2
  • the join component 1100 compares values 1102 from Table A, Field 1 (Al) with values 1104 from Table B, Field 1 (Bl).
  • the join component 1 100 receives flows of input records 1110 from the tables, and, based on the comparison of key-field values, produces a flow of joined records 1112 forming a new joined table, Table C.
  • the join component 1100 produces a joined record that is a concatenation of the records with matching key-field values for each pair of matching key-field values in the input flows.
  • the number of joined records with a particular key-field value that exit on the joined output port 1 114 is the Cartesian product of the number of records with that key-field value in each of the inputs, respectively.
  • the input flows of records 1110 are shown labeled by the value of their respective key fields, and the output flow of joined records 1112 are shown labeled by the matched values.
  • the profiling module 100 compiles statistics of joined and rejected values for categorizing the relationship between two fields. The statistics are summarized in an occurrence chart 1118 that categorizes occurrences of values in the two fields. An "occurrence number” represents the number of times a value occurs in a field.
  • the columns of the chart correspond to occurrence numbers 0, 1, and N (where N > 1) for the first field (from Table A in this example), and the rows of the chart correspond to occurrence numbers 0, 1, and N (where N > 1) for the second field (from Table B in this example).
  • the boxes in the chart contain counts associated with the corresponding pattern of occurrence: 'column occurrence number' x 'row occurrence number'. Each box contains two counts: the number of distinct values that have that pattern of occurrence, and the total number of individual joined records for those values.
  • the values occur in both fields (i.e., having a pattern of occurrence: lxl, lxN, Nxl, or NxN), and in other cases the values occur in only one field (i.e., having a pattern of occurrence: 1x0, 0x1, NxO, or OxN).
  • the counts are separated by a comma.
  • the occurrence chart 1118 contains counts corresponding to the joined records 1112 and the rejected record on port 1116A.
  • the value "W" on the rejected output port 1116A corresponds to the "1, 1" counts in the box for the 1x0 pattern of occurrence indicating a single value, and a single record, respectively.
  • FIG. 1 IB illustrates an example similar to the example of FIG. 11 A, but with a different pair of key fields.
  • the key field is Field 1
  • the key field is Field 2. So the join component compares values 1102 from Table A, Field 1 (Al) with values 1120 from Table B, Field 2 (B2).
  • This example has an occurrence chart 1122 with counts corresponding to the flows of input records 1124 for these fields. Similar to the example in FIG. 11 A, there is a single rejected value "Z" that corresponds to the "1, 1" counts in the box for the 0x1 pattern of occurrence. However, in this example, there are two values, "W” and "Y,” that both have thelxl pattern of occurrence, corresponding to the "2, 2" counts in the fox for the lxl pattern of occurrence since there are two values, and two joined records. The value "X" corresponds to the "1, 2" counts in the box for the Nxl pattern of occurrence, indicating a single value and 2 joined records. Various totals are calculated from the numbers in the occurrence chart.
  • Some of these totals include the total number of distinct key-field values occurring in both Table A and Table B, the total number of distinct key-field values occurring in Table A, the total number of distinct key- field values occurring in Table B, and the total number unique values (i.e., values occurring only in a single record of the key field) in each table.
  • Some statistics based on these totals are used to determine whether a pair of fields has one of the three types of relationships mentioned above.
  • the statistics include the percentages of total records in a field that have distinct or unique values, percentages of total records having a particular pattern of occurrence, and the "relative value overlap" for each field.
  • the relative value overlap is the percentage of distinct values occurring one field that also occur in the other.
  • the criteria for determining whether a pair of fields has one of the three types of relationships are: foreign key relationship - a first one of the fields has a high relative value overlap (e.g., >99%) and the second field has a high percentage (e.g., >99%) of unique values.
  • the second field is potentially a primary key and the second field is potentially a foreign key of the primary key.
  • joins well relationship -at least one of the fields has a small percentage (e.g., ⁇ 10%) of rejected records, and the percentage of individual joined records having a pattern of occurrence of NxN is small (e.g., ⁇ 1 %).
  • common domain relationship - at least one of the fields has a high relative value overlap (e.g., >95%). If a pair of fields has both a foreign key and a joins well or common domain relationship, the foreign key relationship is reported. If a pair of fields has both a joins well relationship and a common domain relationship, but not a foreign key relationship, the joins well relationship is reported.
  • a census join component 1200 analyzes fields from Table A and Table B and compiles the statistics for an occurrence chart by performing a "census join" operation from census data for the tables.
  • Each census record has a field/value pair and a count of the occurrences of the value in the field. Since each census record has a unique field/value pair, for a given key field, the values in an input flow of the census join component 1200 are unique.
  • the example of FIG. 12A corresponds to the join operation on the pair of key fields Al, Bl (illustrated in FIG. 11A).
  • the census join component 1200 By comparing census records corresponding to the key fields in the join operation, with filter 1202 selecting “Field 1" (Al) and filter 1204 selecting “Field 1" (Bl), the census join component 1200 potentially makes a much smaller number of comparisons than a join component 1100 that compares key fields of individual records from Table A and Table B.
  • FIG. 12B coreesponds to the join operation on the pair of key fields Al, B2 (illustrated in FIG. 1 IB), with filter 1206 selecting "Field 1" (Al) and filter 1208 selecting "Field 2" (B2).
  • the selected census records 1210-1218 are shown labeled by the value for their respective field in the field/value pair, and the count of occurrences for that value.
  • the output record contains the matched value, the corresponding pattern of occurrence based on the two counts, and a total number of records that would be generated in a join operation on the pair of key fields (which is just the product of the two counts). If no match is found for a value, the value is also output with a corresponding pattern of occurrence and a total number of records (which is the single count in the single input record). This information within the output records of the census join component 1200 is sufficient to compile all of the counts in an occurrence chart for the join operation. In the example of FIG.
  • the value "W” appears at the output with an occureence pattern of 1x0 and a total of 1
  • the value "X” appears at the output with an occurrence pattern of NxN and a total of 4
  • the value "Y” appears at the output with an occurrence pattern of lxN and a total of 2.
  • This information corresponds to the information in the occurrence chart 1118 of FIG. 11 A. In the example of FIG.
  • the value "W” appears at the output with an occurrence pattern of lxl and a total of 1
  • the value "X” appears at the output with an occurrence pattern of Nxl and a total of 2
  • the value "Y” appears at the output with an occureence pattern of lxl and a value of 1
  • the value "Z” appears at the output with an occurrence pattern of 0x1 and a value of 1.
  • This information corresponds to the information in the occurrence chart 1122 of FIG. 11 B .
  • a joint-field analysis for multiple field pairs in a single census join operation includes generating "extended records" based on the census records.
  • the census join component 1200 compares records for a joint-field analysis of both pairs of key fields Al, Bl and Al, B2, combining the joint-field analysis illustrated in FIGS. 12 A-B.
  • An extended record is generated from a census records by concatenating a unique identifier for the pair of key fields that are being joined with the value in the census record, and keeping the same count of occurrences as the census record. If a joint-field analysis includes results of a field being joined with multiple other fields, then multiple extended records are generated for each value in the that field.
  • the census record 1210 corresponds to two extended recordsl 301-1302, with the value "W” concatenated with an identifier "A1B1" and "A1B2,” respectively.
  • the census join component 1200 handles the extended record 1301 just as it would handle a census record with the value "WA1B1.”
  • the census record 1211 corresponds to the two extended records 1303-1304
  • census record 1212 corresponds to the two extended records 1305-1306.
  • an extend component 1400 processes input census records to generate extended records, based on join information 1401 indicating which fields are being joined with which other fields in the joint-field analysis.
  • the join information 1401 indicates that a field V ⁇ from census data for table Tl (having four census records 1402) is being joined with four other fields: field Fi from census data for table T2 (having two census records 1404), field F 2 from census data for table T2 (having two census records 1406), field Fi from census data for table T3 (having two census records 1408), and field F 2 from census data for table T3 (having two census records 1410).
  • the extend component 1400 generates four extended records 1413-1416 for the input census record 1412.
  • the census join component 1200 uses unique identifiers for fields including fields in different tables having the same name.
  • the extended record 1413 has a value c(Tl,F ⁇ ,T2,F 1 ,Vj) that is a concatenation of the original value Vi with identifiers for the fields being joined as well as for the table (or file or other data source) from which the census data for the field was generated. Including the identifier for the table enables fields having the same name to be distinguished.
  • the extended record 1415 that has a value c(Tl,F ⁇ ,T3,F ⁇ ,Vj) that can be distinguished from the value c(Tl,F ⁇ ,T2,F l5 Vj) of the extended record 1413, where both tables T2 and T3 have the same field name F[. Alternatively, a unique number can be assigned to each field and used in place of the field name. 6.3 Joint- field analysis graphs
  • FIGS. 15 A-B show graphs used by the profiling module 100 to perform an optional joint- field analysis of selected fields in sources (e.g., tables or files) within data sources 30.
  • a user 118 selects options for profiling and for joint-field analysis, including the option of performing profiling without joint-field analysis.
  • the user 118 selects field pairs for joint-field analysis including two specific fields paired with each other, one field paired with every other field, or every field paired with every other field.
  • the user 118 selects an option allow pairing of fields within the same table or file, or to allow pairing of fields only from different tables or files. These options are stored in the metadata store 112. Referring to FIG.
  • the graph 1500 for each source (e.g., a table or file) of fields specified in the joint-field analysis options, the graph 1500 generates a file with prepared census data 1510 for those specified fields. The graph 1500 executes once for each such source included in the joint- field analysis.
  • a filter 1504 receives records from census data 1502 generated by the make census component 406 and prepares the records for joint-field analysis. The filter 1504 discards records for fields that are not included in the analysis (as determined by user options stored in the metadata store 112). The filter 1504 also discards invalid values, null values, and other values not included in a meaningful analysis of the content of data sources (e.g., known data flags).
  • the values in the census data 1502 have been canonicalized by a canonicalize component 616 within the make census component 406. However, these canonicalized values may have portions that should not be used in a logical comparison of values (e.g., strings with leading or trailing spaces or numbers with leading or trailing zeros).
  • the user 118 can select an option for these values to be compared "literally” or "logically.” If the user 118 selects "literal" comparison, then the values in the census records are left in the canonicalized form. If the user 118 selects "logical" comparison, then the filter 1504 converts values in the census records according to rales such as stripping leading and trailing spaces, and stripping leading and trailing zeros for numbers.
  • the partition by value component 1506 re-partitions the records based on the value in the census record. Any census records with the same value are put into the same partition. This allows the joint-field analysis to be run in parallel across any number partition. Since the census join component 1200 only produces an output record for input records with matching values, census records (or any extended records generated from them) in different partitions do not need to be compared with one another.
  • a rollup logical values component 1508 combines any census records that have matching field/value pairs due to the conversion performed by the filter 1504. The combined record has a count of occurrences that is the sum of the count for all of the combined records.
  • the rollup logical values component 1508 combines these two converted records to a single record with a field, value, count of "amount, 1, 8.”
  • the graph 1512 executes using the prepared census data A 1514 and prepared census data B 1516, each prepared by graph 1500.
  • Two extend components 1400 receive records from these sets of prepared census data, along with join information 1515 specifying the specific fields in source A to be compared with specific fields in source B.
  • Extended records flow into a census join component 1200 that generates records containing values, patterns of occurrence, and counts for occurrence charts for the fields being compared.
  • a local rollup join statistics component 1518 compiles the information in these records within each partition.
  • the records in the various partitions are then gathered and complied by a global rollup join statistics component 1520 that outputs a file 1522 all of the joint-field analysis statistics for the fields in all of the pairs of sources that are analyzed.
  • the results of the joint-field analysis including which of the three types of relationship potentially exists between various fields is loaded into the metadata store 112 for presentation to the user 118.
  • the user 118 can select a link on the user interface 116 for a pair of fields with a potential relationship and view a page on the user interface 116 with detailed analysis results including counts from an occurrence chart for the pair of fields.
  • the graph 1524 executes using the prepared census data C 1526 prepared by graph 1500.
  • a single extend component 1400 receives records from the set of prepared census data C 1526, along with join information 1528 specifying the specific fields in source C to be compared. Extended records flow into both ports of a census join component 1200 that generates records containing values, patterns of occurrence, and counts for occurrence charts for the fields being compared.
  • join information 1528 In the case of joint-field analysis options indicating that every field in source C is to be compared with every other field in source C (having four fields: FI, F2, F3, F4), one approach is for the join information 1528 to specify twelve pairs of fields (F1-F2, F1-F3, F1-F4, F2-F1, F2- F3, F2-F4, F3-F1, F3-F2, F3-F4, F4-F1, F4-F2, F4-F3). However, since the same operations are performed for the pairs F1-F3 and F3-F1, some operations are repeated. Accordingly, another approach is for the join information to specify only the six unique pairs F1-F2, F1-F3, F1-F4, F2- F3, F2-F4, F3-F4. In this case, the results in the output file 1530 are augmented to include results for the other six field pairs by reversing the order of the fields in the analysis results for the six pairs that were analyzed.
  • the profiling module 100 is able to perform is a test for a functional relationship between values of fields.
  • the fields tested can be from a single table that has a set of fields or from a "virtual table” that includes fields from multiple sources that are related (e.g., through a join operation on the fields using a common key field, as described in more detail in section 7.3).
  • One type of functional relationship between a pair of fields is "functional dependency" where the value associated with one field of a record can be uniquely detem ined by the value associated with another field of the record.
  • the value of the Zip Code field determines the value of the State field (e.g., CA).
  • Each value of the Zip Code field maps onto a unique value of the State field (i.e., a "many-to-one" mapping).
  • a functional dependency relationship can also exist among a subset of fields where the value associated with one field of a record can be uniquely determined by the values associated with other fields of the record.
  • the value of the Zip Code field can be uniquely determined by the values of a City field and a Street field.
  • the functional dependency can also be an "approximate functional dependency" where some but not necessarily all of the values associated with one field map onto a unique value of another field, with a percentage of exceptions that do not map onto the unique value. For example, some records may have an unknown Zip Code that is indicated by a special value 00000. In this case, the value 00000 of the Zip Code field may map onto more than one value of the State field (e.g., CA, FL, and TX). Exceptions can also occur due to records with incorrect values, or other errors. If the percentage of exceptions is smaller than a pre-determined (e.g., as entered by a user) threshold, then a field may still be determined to be functionally dependent on another field. Referring to FIG.
  • a Last Name field has twelve values corresponding to twelve records (rows 1-12). Ten of the values are unique, and two of the records have the same repeated value name_g.
  • a citizenship field has two unique values: US occurring eleven times and CANADA occurring once.
  • a Zip Code field has various values each corresponding to one of three values CA, FL, and TX for a State field. Each value of Zip Code uniquely determines a value of State, except for the Zip Code value 00000 that corresponds to FL in one record (row 10) and to TX in another record (row 12). 7.1 Functional dependency analysis graph
  • FIG. 17 shows an example of a graph 1700 used by the profiling module 100 to perform an optional functional dependency analysis of selected fields in one or more sources (e.g., in a single table or file, or in multiple tables and/or files as described in section 7.3) within data sources 30.
  • a user 118 selects options for profiling and for functional dependency analysis, including the option of performing profiling without functional dependency analysis.
  • the user 118 may choose which pair or pairs of fields are tested for a functional relationship.
  • the user 118 selects particular fields of a data source (e.g., a table or file), and chooses, for example, "all to selected” or “selected to selected” to determine which pairs of fields are tested or chooses "all to all” to test all pairs of fields in the data source.
  • the user may also select a threshold for determining a degree of functional dependency before deciding that a field is or is not functionally dependent on another field. For example, the user may select a threshold for determining how many exceptions to allow (as a percentage of records). These options are stored in the metadata store 112.
  • the graph 1700 determines whether a functional dependency relationship exists, and if so, classifies the relationship between field fl and field f2 as: "fl detennines f2", "f2 determines fl", "one-to-one” (a one-to-one mapping exists between fl and f2), or "identical” (fl has identically the same value as f2 in each of the records).
  • the graph 1700 reads field information 1702 stored by the profiling module 100 to determine unique identifiers for fields to be analyzed.
  • a make pairs component 1704 generates a flow of field pairs (fl, f2) using a pair of unique identifiers for each of the pairs of fields to be tested.
  • the pair (fl, f2) is an ordered pair since the relationship between fl and f2 is not necessarily symmetric. So both pairs (fl, f2) and (f2, fl) are included in the flow.
  • a select pairs component 1706 limits the field pairs that flow to the rest of the graph 170O by selecting the field pairs chosen for analysis by the user. The select pairs component
  • a broadcast component 1708 broadcasts the serial flow of field pairs to each of the partitions of a (potentially parallel) attach values component 1718, as represented by a broadcast link symbol 1710.
  • Each partition of the attach values component 1718 takes as input a flow of field pairs (e.g., (LastName, citizenship), (Zip, State), ...
  • a filter component 1712 extracts records from the input data set 402, and optionally removes a portion of the records based on a filter expression.
  • the records flowing out of the filter component 1712 flow into a partition by round-robin component 1714.
  • the partition by round-robin component 1714 takes records from the partitions of the input data set 402 and re-partitions the records among a number of parallel processors and/or computers in order to balance the work load among the processors and/or computers.
  • the canonicalize component 1716 (similar to the canonicalize component 616 described above) takes in a flow of records and sends out a flow of field/value pair representing values for each field in an input record. As described above, each value is converted into a canonical human readable string representation.
  • the attach values component 1718 performs a series of join operations to generate a flow of fl/f2/vl/v2 quadmples where fl and f2 correspond to one of the field pairs received at the input, and vl and v2 corresponds to values that are paired with those fields in a record.
  • the attach value component 1718 when the Last Name field corresponds to fl and the citizenship field corresponds to f2, the attach value component 1718 generates a flow of twelve fl/f2/vl/v2 quadruples including: (LastName/Citizenship/name_a/Canada), (LastName/Citizenship/name_b US), ..., (LastName/Citizenship/name_k/US), (LastName/Citizenship/name_g/US).
  • the attach values component 1718 generates similar series of fl/f2/vl/v2 quadruples for (Zip, State) and any other pairs of fields that are analyzed.
  • the attach values component 1718 outputs the flow of fl/£2/vl/v2 quadruples into a "local rollup fl/f2/vl/v2" component 1720 which (for each partition) accumulates multiple quadruples with the same fields and values fl, f2, vl, v2 and represents them as a single quadruple with a count of the number of occivrrences of the quadruple in the input flow.
  • the output flow of the "local rollup fl/f2/vl/v2" component 1720 consists of quadruples with counts (or "accumulated quadruples”). The accumulation that occurs in the "local rollup fl/f2/vl/v2" component 1720 is within each partition.
  • a "partition by fl/f2" component 1721 repartitions the flow of accumulated quadmples such that quadruples with the same fields fl, £2 are in the same partition-
  • a "global rollup fl/f2/vl/v2" component 1722 further accumulates the repartitioned quadruples.
  • the output flow of the "global rollup fl/f2/vl/v2" component 1722 consists of unique accumulated quadruples.
  • a "global rollup fl/f2/vl" component 1724 combines accumulated quadruples that have both fields fl, £2 and the first value vl in common, h producing an output element, this component 1724 examines all values of v2 that go with a value of vl and selects the most frequent v2 to associate with that vl value. The number of quadruples sharing the most frequent v2 are counted as "good” and the rest of the quadruples are counted as "exceptions.” If there is only one value of v2 for a given vl, then the accumulated quadruples having that value are good and there are no exceptions.
  • the component 1724 when the Zip field coreesponds to fl and the State field coreesponds to £2, the component 1724 generates: (Zip/State/90019/CA, 4 good), (Zip/State/90212/CA, 2 good), (Zip/State/33102/FL, 3 good), (Zip/State/00000/FL, 1 good, 1 exception), (Zip/State/77010 /TX, 1 good).
  • the component 1724 When the State field corresponds to fl and the Zip field corresponds to £2, the component 1724 generates: (State/Zip/CA/90019, 4 good, 2 exceptions), (State/Zip/FL/33102, 3 good, 1 exception), (State/Zi ⁇ /TX/77010, 1 good, 1 exception).
  • a "global rollup fl/£2" component 1726 adds the good counts and the exceptions for each unique pair of fields fl, £2. In the example of table 1600, when the Zip field corresponds to fl and the State field coreesponds to £2, the component 1726 generates: (Zip/State, 11 good, 1 exception).
  • a find dependencies component 1728 uses the accumulated co-occurrence statistics (i.e., numbers of good and exceptional records) from the "global rollup fl/£2" component 1726 to determine whether a pair of fields has the relationship "fl determines£.” If the percentage of exceptions (give by: number of exceptions/(number of good + number of exceptions)) is less than the selected threshold for determining how many exceptions to allow, then the pair of fields has the relationship "fl determines£.” In the example of table 1600, for a threshold of 10%, when the Zip field corresponds to fl and the State field corresponds to £2, the percentage of exceptions is 8.3% and the value of the Zip field determines the value of the State field.
  • the percentage of exceptions is 33%, so the relationship between the Zip and State fields is not a one-to-one mapping.
  • a value based on a mathematical property of the accumulated values can be used tc determine whether field fl determines field £2 (e.g., the conditional entropy of the value of field £2 given the value of field fl, or a standard deviation of a numerical value).
  • a variety of optimizations can be applied to increase the speed of functional dependency analysis, for example, by filtering pairs of fields at the select pairs component 1706, or by filtering records at the filter component 1712. Some optimizations are based on the recognition that some functional dependency relationships that are discovered by the graph 1700 described above may not as meaningful to a user as others. For a given pair of fields, some of these cases can be detected and filtered out by the select pairs component 1706 based on statistics provided by the profiling module 100, saving computing resources. For example, if all of the values of a first field fl are unique (each value occurring in only a single record), then the value of that field fl determines the value of the second field £2 regardless of the values occurring in the field £2.
  • the graph 1700 can use census data obtained during profiling to compute a probability that a first field fl determines a second field £2 based on a random (e.g., a uniform probability distribution) pairing of values in the fields. If there is a high probability (e.g., > 10%) that a random pairing would result in a functional dependency, then the field pair is filtered out by the select pairs component 1706.
  • a high probability e.g., > 10%
  • the select pairs component 1706 filters the pair (LastName, citizenship). ' Another optimization is based on histograms of values calculated by the profiling module 100 from census data. The select pairs component 1706 filters pairs when it is not possible for field fl to determine field £2. In the example of table 1600, the most frequent value of State occurs 6 times and the most frequent value of a Zip occurs only 4 times. So it is not possible for the value of State to determine the value of Zip since there would be at least 2 out of 6 exceptions for at least half of the values, resulting in at least a 16.7% exception percentage.
  • the select pairs component 1706 filters the pair (State, Zip).
  • the graph 1700 can increase the speed of testing for functional dependency by processing a small sample of the records first to eliminate field pairs that are highly likely not functionally related before processing all of the records.
  • the graph 1700 can use the filter component 1712 to select a portion of the records.
  • the graph 1700 can use the canonicalize component 1716 to select a portion of the field/value pairs.
  • the records or field value pairs can be sampled based on a variety of criteria.
  • the graph 1700 can sample based on statistics provided by the profiling module 100. For example, the graph 1700 can test for functional dependency based on the most frequent value of first field fl (the "determiner").
  • the graph 1700 can also test for functional dependency based on a random sample of determiner values. If a sufficient number of quadruples count as good among the sampled values, then the probability of finding a substantial number exceptions among the other values is assumed to be negligible. Other sampling criteria are possible. Another optional optimization is to test for pre-determined functional relationships between fields based on a library of known functions. This test can be performed on the records or on the values of the quadruples.
  • profiling module 100 In one approach for testing for functional dependency across multiple sources (e.g., database tables), profiling module 100 generates a "virtual table" that includes fields from the multiple sources.
  • the virtual table can be generated, for example, by performing a join operation on the sources using a key field that is common to the sources.
  • a first data source is a database of motor vehicle registration information (a motor vehicle registry (MVR) database) and a second data source is a database of issued traffic citations (a traffic citation (TC) database).
  • MVR database includes fields such as make, model, color, and includes a license field that is designated as a "primary key" field. Each record in the MVR database has a unique value of the license field.
  • the TC database includes fields such as name, date, location, violation, vehicle make, vehicle model, vehicle color and includes a vehicle license field that is designated as a "foreign key" field.
  • Each value of the vehicle license field has a coreesponding record in the MVR database with that value in the license field.
  • the profiling module 100 joins records from the MVR database and TC database to form a virtual table (e.g., as described above with reference to the join component 1100 shown in FIG. 11A).
  • Each record of the virtual table has each of the fields from the two databases including a single license field that has the matched value from the MVR license field and the TC vehicle license field.
  • a record may, however, have a value of the color field from the MVR database that is different from the value of the vehicle color field from the TC database.
  • the MVR database may use a "BLU" code to indicate the color blue and the TC database uses a "BU" code to indicate the color blue.
  • the color field will have a "one-to-one” functional relationship with the vehicle color field.
  • a record may have different values for the color field and the vehicle color field if a vehicle has been painted a different color in the time between being registered and receiving a citation.
  • the profiling module 100 can discover any of a variety of relationships that my exist between the fields in those data sets.
  • the same or similar dependency analysis as described above can be run on fields in joined virtual table.
  • the approaches described above can be implemented using software for execution on a computer.
  • the software forms procedures in one or more computer programs that execute on one or more programmed or programmable computer systems (which may be of various architectures, such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (for example, volatile and non- volatile memory and/oi storage elements), at least one input device or port, and at least one output device or port.
  • the software may form one or more modules of a larger program, for example, a program that provides other services related to the design and configuration of graphs.
  • the software may be provided on a medium or device readable by a general or special pu ⁇ ose programmable computer or delivered (encoded in a propagated signal) over a network to the computer where it is executed. All of the functions may be performed on a special pu ⁇ ose computer, or using special-pu ⁇ ose hardware, such as coprocessors.
  • the software may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computers.
  • Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special pu ⁇ ose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein.
  • a storage media or device e.g., solid state memory or media, or magnetic or optical media
  • the inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Operations Research (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Numerical Control (AREA)
  • Testing, Inspecting, Measuring Of Stereoscopic Televisions And Televisions (AREA)
  • Signal Processing For Digital Recording And Reproducing (AREA)
  • Television Systems (AREA)
  • Holo Graphy (AREA)
  • Crystals, And After-Treatments Of Crystals (AREA)
  • Optical Communication System (AREA)

Abstract

Processing data includes profiling data from a data source, including reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data. The data is then processed from the data source. This processing includes accessing the stored profile information and processing the data according to the accessed profile information.

Description

Data Profiling Cross-Reference to Related Applications
This application claims the benefit of U.S. Provisional Applications No. 60/502,908, filed September 15, 2003, No. 60/513,038, filed October 20, 2003, and No. 60/532,956, filed December 22, 2003. The above referenced applications are incoφorated herein by reference. Background
This invention relates to data profiling. Stored data sets often include data for which various characteristics are not known beforehand. For example, ranges of values or typical values for a data set, relationships between different fields within the data set, or functional dependencies among values in different fields, may be unknown. Data profiling can involve examining a source of a data set in order to determine such characteristics. One use of data profiling systems is to collect information about a data set which is then used to design a staging area for loading the data set before further processing. Transformations necessary to map the data set to a desired target format and location can then be performed in the staging area based on the information collected in the data profiling. Such transformations may be necessary, for example, to make third-party data compatible with an existing data store, or to transfer data from a legacy computer system into a new computer system. Summary In one aspect, in general, the invention features a method and corresponding software and a system for processing data. Data from a data source is profiled. This profiling includes reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data. The data is then processed from the data source. This processing includes accessing the stored profile information and processing the data according to the accessed profile information. h another aspect, in general, the invention features a method for processing data. Data from a data source is profiled. This profiling includes reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data. Profiling the data includes profiling the data in parallel, including partitioning the data into parts and processing the parts using separate ones of a first set of parallel components. Aspects of the invention can include one or more of the following features: Processing the data from the data source includes reading the data from the data source. Profiling the data is performed without maintaining a copy of the data outside the data source. For example, the data can include records with a variable record structure such as conditional fields and/or variable numbers of fields. Computing summary data while reading the data includes inteφreting the variable record structure records while computing summary data characterizing the data. The data source includes a data storage system, such as a database system, or a serial or parallel file system. Computing the summary data includes counting a number of occurrences for each of a set of distinct values for a field. The profile information can include statistics for the field based on the counted number of occurrences for said field. A metadata store that contains metadata related to the data source is maintained. Storing the profile information can include updating the metadata related to the data source. Profiling the data and processing the data can each make use of metadata for the data source Profiling data from the data source further includes determining a format specification based on the profile information. It can also include determining a validation specification based on the profile information. Invalid records can be identified during the processing of the data based on the format specification and/or the validation specification. Data transformation instructions are specified based on the profile information.
Processing the data can then include applying the transformation instructions to the data. Processing the data includes importing the data into a data storage subsystem. The data can be validated prior to importing the data into a data storage subsystem. Such validating of the data can include comparing characteristics of the data to reference characteristics for the data, such as by comparing statistical properties of the data. The profiling of the data can be performed in parallel. This can include partitioning the data into parts and processing the parts using separate ones of a first set of parallel components. Computing the summary data for different fields of the data can include using separate ones of a second set of parallel components. Outputs of the first set of parallel components can be repartitioned to form inputs for the second set of parallel components. The data can be read from a parallel data source, each part of the parallel data source being processed by a different one of the first set of parallel components. In another aspect, in general, the invention features a method and corresponding software and a system for processing data. Information characterizing values of a first field in records of £ first data source and information characterizing values of a second field in records of a second data source are accepted. Quantities characterizing a relationship between the first field and the second field are then computed based on the accepted information. Information relating the first field and the second field is presented. Aspects of the invention can include one or more of the following features. The information relating the first field and the second field is presented to a user. The first data source and the second data source are either the same data source, or are separate data sources. Either or both of the data source or sources can be a database table, or a file. The quantities characterizing the relationship include quantities characterizing joint characteristics of the values of the first field and of the second field. The information characterizing the values of the first field (or similarly of the second field) includes information characterizing a distribution of values of that field. Such information may be stored in a data structure, such as a "census" data structure. The information characterizing the distribution of values of the first field can include multiple data records, each associating a different value and a corresponding number of occurrences of that value in the first field in the first data source. Similarly, information characterizing the distribution of values of the second field can include multiple records of the same or similar format. The information characterizing the distribution of values of the first field and of the second field is processed to compute quantities related to a multiple different categories of cooccurrence of values. The quantities related to the categories of co-occurrence of values include multiple data records, each associated with one of the categories of co-occurrence and including a number of different values in the first and the second fields that are in that category. Information characterizing a distribution of values in a "join" of the first data source and the second data source on the first field and the second field, respectively, is computed. This computation can include computing quantities related to a plurality of categories of co- occurrence of values. Examples of such categories include values that occur at least once in one of the first and the second fields but not in the other of the fields, values that occur exactly once in each of the first and the second fields, values that occur exactly once in one of the first and the second fields and more than once in the other of the fields, and values that occur more than once in each of the first and the second fields. The steps of accepting information characterizing values and computing quantities characterizing joint characteristics of the values are repeated for multiple different pairs of fields, one of field from the first data source and the other field from the second data source. Information relating the fields of one or more of the plurality of pairs of fields can then be presented to the user. Presenting the information relating the fields of one or more of the pairs of fields includes identifying candidate types of relationships of fields. Examples of such types of relationships of fields include a primary key and foreign key relationship and a common domain relationship. In another aspect, in general, the invention features a method and corresponding software and a system for processing data. A plurality of subsets of fields of data records of a data source are identified. Co-occurrence statistics are determined for each of the plurality of subsets. One or more of the plurality of subsets is identified as having a functional relationship among the fields of the identified subset. Aspects of the invention can include one or more of the following features. At least one of the subsets of fields is a subset of two fields. Identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes identifying one or more of the plurality of subsets as having one of a plurality of possible predetermined functional relationships. Determining the co-occurrence statistics includes forming data elements each identifying a pair of fields and identifying a pair of values occurring in the pair of fields in one of the data records. Determining the co-occurrence statistics includes partitioning the data records into parts, the data records having a first field and a second field, determining a quantity based on a distribution of values that occur in the second field of one or more records in a first of the parts, the one or more records having a common value occurring in a first field of the one or more records, and combining the quantity with other quantities from records in other of the parts to generate a total quantity. Identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes identifying a functional relationship between the first and second fields based on the total quantity. The parts are based on values of the first field and of the second field. The parts are processed using separate ones of a set of parallel components. Identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes determining a degree of match to the functional relationship. The degree of match includes a number of exceptional records that are not consistent with the functional relationship. The functional relationship includes a mapping of at least some of the values of a first field onto at least some of the values of a second field. The mapping can be, for example, a many-to-one mapping, a one-to-many mapping, or a one-to-one mapping. The method further includes filtering the plurality of subsets based on information characterizing values in fields of the plurality of subsets. The data records include records of one or more database tables. Aspects of the invention can include one or more of the following advantages. Aspects of the invention provide advantages in a variety of scenarios. For example, in developing an application, a developer may use an input data set to test the application. The output of the application run using the test data set is compared against expected test results, or inspected manually. However, when the application is run using a realistic "production data," the results may be usually too large to be verified by inspection. Data profiling can be used to verify the application behavior. Instead of inspecting every record produced by running the application using production data, a profile of the output is inspected. The data profiling can detect invalid or unexpected values, as well as unexpected patterns or distributions in the output that could signal an application design problem. In another scenario, data profiling can be used as part of a production process. For example, input data that is part of a regular product run can be profiled. After the data profiling has finished, a processing module can load the profiling results and verify that the input data meets certain quality metrics. If the input data looks bad, the product run can be cancelled and the appropriate people alerted. In another scenario, a periodic audit of a large collection of data (e.g., hundreds of database tables in multiple sets of data) can be performed by profiling the data regularly. For example, data profiling can be perfonned every night on a subset of the data. The data that is profiled can be cycled such that all of the data is profiled, e.g., once a quarter so that every database table will be profiled four times a year. This provides an historic data quality audit on all of the data that can be referred to later, if necessary. The data profiling can be performed automatically. For example, the data profiling can be performed from a script (e.g., a shell script) and integrated with other forms of processing. Results of the data profiling can be automatically published, e.g., in a form that can be displayed in a web browser, without having to manually post-process the results or run a separate reporting application. Operating on information characterizing values of the records in the data sources rather than necessarily operating directly on the records of the data sources themselves can reduce the amount of computation considerably. For example, using census data rather than the raw data records reduces the complexity of computing characteristics of a join on two fields from being of the order of the product of the number of data records in the two data sources to being of the order of the product of the number of unique values in the two data sources. Profiling the data without maintaining a copy of the data outside the data source can avoid potential for errors associated with maintaining duplicate copies and avoids using extra storage space for a copy of the data. The operations may be parallelized according to data value, thereby enabling efficient distributed processing. Quantities characterizing a relationship between fields can provide an indication of which fields may be related by different types of relationships. The user may then be able to examine the data more closely to determine whether the fields truly form that type of relationship. Determining co-occurrence statistics for each of a plurality of subsets of fields of data records of a data source enables efficient identification of potential functional relationships among the fields. Aspects of the invention can be useful in profiling data sets with which the user is not familiar. The information that is automatically determined, or which is determined in cooperation with the user , can be used to populate metadata for the data sources, which can then be used for further processing. Other features and advantages of the invention are apparent from the following description, and from the claims. Description of Drawings
FIG. 1 is a block diagram of a system that includes a data profiling module. FIG. 2 is a block diagram that illustrates the organization of objects in a metadata store used for data profiling. FIG. 3 is a profiling graph for the profiling module. FIG. 4 is a tree diagram of a hierarchy for a type object used to inteφret a data format. FIGS. 5A.-C are diagrams that illustrates sub-graphs implementing the make census component, analyze census component, and make samples component of the profiling graph. FIG. 6 is a flowchart for a rollup procedure. FIG. 7 is a flowchart for a canonicalize procedure. FIGS. 8A-C are example user interface screen outputs showing profiling results. FIG. 9 is a flowchart of an exemplary profiling procedure. FIG. 10 is a flowchart of an exemplary profiling procedure. FIGS. 11 A-B are two examples of a join operation performed on records from two pairs of fields. FIGS. 12A-B are two examples of a census join operation on census records from two pairs of fields. FIG. 13 is an example of extended records used to perform a single census join operation on two pairs of fields. FIG. 14 is an extend component used to generate extended records. FIGS. 15 -C are graphs used to perform joint-field analysis. FIG. 16 is an example table with fields having a functional dependency relationship. FIG. 17 is a graph used to perform functional dependency analysis. Description
1 Overview
Referring to FIG. 1, a data processing system 10 includes a profiling and processing subsystem 20, which is used to process data from data sources 30 and to update a metadata store 112 and a data store 124 in a data storage subsystem 40. The stored metadata and data is then accessible to users using an interface subsystem 50. Data sources 30 in general includes a variety of individual data sources, each of which may have unique storage formats and interfaces (for example, database tables, spreadsheet files, flat text files, or a native format used by a mainframe 110). The individual data sources may be local to the profiling and processing sub-system 20, for example, being hosted on the same computer system (e.g., file 102), or may be remote to the profiling and processing sub-system 20. for example, being hosted on a remote computer (e.g., mainframe 110) that is accessed over a local or wide area data network. Data storage sub-system 40 includes a data store 124 as well as a metadata store 112. Metadata store 112 includes information related to data in data sources 30 as well as information about data in data store 124. Such information can include record formats as well as specifications for determining the validity of field values in those records (validation specifications). The metadata store 112 can be used to store initial information about a data set in data sources 30 to be profiled, as well as information obtained about such a data set, as well as data sets in data store 124 derived from that data set, during the profiling process. The data store 124 can be used to store data, which has been read from the data sources 30, optionally transformed using information derived from data profiling. The profiling and processing subsystem 20 includes a profiling module 100, which reads data directly from a data source without necessarily landing a complete copy of the data to a storage medium before profiling in units of discrete work elements such as individual records. Typically, a record is associated with a set of data fields, each field having a particular value for each record (including possibly a null value). The records in a data source may have a fixed record structure in which each record includes the same fields. Alternatively, records may have a variable record structure, for example, including variable length vectors or conditional fields. In the case of variable record structure, the records are processed without necessarily storing a "flattened" (i.e., fixed record structure) copy of the data prior to profiling. When first reading data from a data source, the profiling module 100 typically starts with some initial format information about records in that data source. (Note that in some circumstances, even the record structure of the data source may not be known). The initial information about records can include the number of bits that represent a distinct value (e.g., 16 bits (= 2 bytes)) and the order of values, including values associated with record fields and values associated with tags or delimiters, and the type of value (e.g., string, signed/unsigned integer) represented by the bits. This information about records of a data source is specified in a data manipulation language (DML) file that is stored in a metadata store 112. The profiling module 100 can use predefined DML files to automatically inteφret data from a variety of common data system formats (e.g., SQL tables, XML files, CSV files) or use a DML file obtained from the metadata store 1 12 describing a customized data system format. Partial, possibly inaccurate, initial information about records of a data source may be available to the profiling and processing subsystem 20 prior to the profiling module 100 initial reading of the data. For example, a COBOL copy book associated with a data source may be available as stored data 114, or entered by a user 118 through a user interface 116. Such existing information is processed by a metadata import module 115 and stored in the metadata store 112 and/or used to define the DML file used to access the data source. As the profiling module 100 reads records from a data source, it computes statistics and other descriptive information that reflect the contents of the data set. The profiling module 100 then writes those statistics and descriptive information in the form of a "profile" into the metadata store 1 12 which can then be examined through the user interface 116 or any other module with access to the metadata store 112. The statistics in the profile preferably include a histogram of values in each field, maximum, minimum, and mean values, and samples of the least common and most common values. The statistics obtained by reading from the data source can be used for a variety of uses.
Such uses can include discovering the contents of unfamiliar data sets, building up a collection of metadata associated with a data set, examining third-party data before purchasing or using it, and implementing a quality control scheme for collected data. Procedures for using the data processing system 10 to perform such tasks are described in detail below. The metadata store 112 is able to store validation information associated with each profiled field, for example as a validation specification that encodes the validation information. Alternatively, the validation information can be stored in an external storage location and retrieved by the profiling module 100. Before a data set is profiled, the validation information may specify a valid data type for each field. For example, if a field is a person's "title", a default valid value may be any value that is a "string" data type. A user may also supply valid values such as "Mr.", "Mrs." and "Dr." prior to profiling the data source so that any other value read by the profiling module 100 would be identified as invalid, information obtained from a profiling run can also be used by a user to specify valid values for a particular field. For example, the user may find that after profiling a data set the values "Ms." and "Msr." appear as common values. The user may add "Ms." as a valid value, and map the value "Msr." to the value "Mrs." as a data cleaning option. Thus, the validation information can include valid values and mapping information to permit cleaning of invalid values by mapping them onto valid values. The profiling of a data source may be undertaken in an iterative manner as more information about the data source is discovered through successive profiling runs. The profiling module 100 can also generate executable code to implement other modules that can access the profiled data systems. For example, a processing module 120 can include code generated by the profiling module 100. An example of such code might map a value "Msr." to "Mrs." as part of the access procedure to the data source. The processing module 120 may run in the same runtime environment as the profiling module 100, and preferably can communicate with the metadata store 112 to access a profile associated with a data set. The processing module 120 can read the same data formats as the profiling module 100 (e.g., by obtaining the same DML file from the metadata store 112). The processing module 120 can use the data set profile to obtain values used to validate or clean incoming records before storing them in a data store 124. Similar to the profiling module 100, the processing module 120 also reads data directly from a data system in units of discrete work elements. This "data flow" of work elements has the benefit of allowing the data profiling to be performed on large data sets without necessarily copying data to local storage (e.g., a disk drive). This data flow model, described in more detail below, also allows complex data transformations to be performed by a processing module without the source data being first copied to a staging area, potentially saving storage space and time.
2 Metadata store organization
The profiling module 100 uses the metadata store 112 to organize and store various metadata and profiling preferences and results in data objects. Referring to FIG. 2, the metadata store 112 may store a group of profile setup objects 201, each for information related to a profiling job, a group of data set objects 207, each for information related to a data set, and a group of DML files 211, each describing a particular data format. A profile setup object contains preferences for a profiling run executed by the profiling module 100. A user 118 can enter information used to create a new profile setup object or select a pre-stored profile setup object 200. The profile setup object 200 contains a reference 204 to a data set object 206. A data set setup object 206 contains a data set locator 202 which enables the profiling module 100 to locate data to be profiled on one or more data systems accessible within the runtime environment. The data set locator 202 is typically a path/filename, URL, or a list of path/filenames and/or URLs for a data set spread over multiple locations. The data set object 206 can optionally contain a reference 208 to one or more DML files 210. The DML file(s) 210 may be pre-selected based on knowledge about the format of data in a data set, or may be specified at runtime by a user. The profiling module 100 can obtain an initial portion of the data set and present to the user over the user interface 116 an inteφretation of the initial portion based on a default DML file. The user may then modify the default DML file specification based on an interactive view of the inteφretation. More than one DML file may be referenced if the data set includes data with multiple formats. The data set object 206 contains a reference 212 to a set of field objects 214. There is one field object for each field within the records of the data set to be profiled. Upon completion of a profiling run performed by the profiling module 100, a data set profile 216 is contained within the data set object 206 corresponding to the data set that was profiled. The data set profile 216 contains statistics that relate to the data set, such as total number of records and total number of valid/invalid records. A field object 218 can optionally contain validation information 220 that can be used by the profiling module 100 to determine valid values for the corresponding field, and specify rules for cleaning invalid values (i.e., mapping invalid values onto valid values). The field object 218 also contains a field profile 222, stored by the profiling module 100 upon completion of a profiling run, which contains statistics that relate to the corresponding field, such as numbers of distinct values, null values, and valid/invalid values. The field profile 222 can also include sample values such as maximum, minimum, most common, and least common values. A complete "profile" includes the data set profile 216 and field profiles for all of the profiled fields. Other user preferences for a profiler run can be collected and stored in the profile setup object 200, or in the data set object 206. For example, the user can select a filter expression which can be used to limit the fields or number of values profiled, including profiling a random sample of the values (e.g., 1%). 3 Runtime environment
The profiling module 100 executes in a mntime environment that allows data from the data source(s) to be read and processed as a flow of discrete work elements. The computations performed by the profiling module 100 and processing module 120 can be expressed in tem s of data flow through a directed graph, with components of the computations being associated with the vertices of the graph and data flows between the components corresponding to links (arcs, edges) of the graph. A system that implements such graph-based computations is described in U.S. Patent 5,966,072, EXECUTING COMPUTATIONS EXPRESSED AS GRAPHS. Graphs made in accordance with this system provide methods for getting information into and out of individual processes represented by graph components, for moving information between the processes, and for defining a running order for the processes. This system includes algorithms that choose inteφrocess communication methods (for example, communication paths according to the links of the graph can use TCP/IP or UNIX domain sockets, or use shared memory to pass data between the processes). The runtime environment also provides for the profiling module 100 to execute as a parallel process. The same type of graphic representation described above may be used to describe parallel processing systems. For puφoses of this discussion, parallel processing systems include any configuration of computer systems using multiple central processing units (CPUs), either local (e.g., multiprocessor systems such as SMP computers), or locally distributed (e.g., multiple processors coupled as clusters or MPPs), or remotely, or remotely distributed (e.g., multiple processors coupled via LAN or WAN networks), or any combination thereof. Again, the graphs will be composed of components (graph vertices) and flows (graph links). By explicitly or implicitly replicating elements of the graph (components and flows), it is possible to represent parallelism in a system. A flow control mechanism is implemented using input queues for the links entering a component. This flow control mechanism allows data to flow between the components of a graph without being written to non- volatile local storage, such as a disk drive, which is typically large but slow. The input queues can be kept small enough to hold work elements in volatile memory, typically smaller and faster than non- volatile memory. This potential savings in storage space and time exists even for very large data sets. Components can use output buffers instead of, or in addition to, input queues. When two components are connected by a flow, the upstream component sends work elements to the downstream component as long as the downstream component keeps consuming the work elements. If the downstream component falls behind, the upstream component will fill up the input queue of the downstream component and stop working until the input queue clears out again. Computation graphs can be specified with various levels of abstraction. So a "sub-graph" containing components and links can be represented within another graph as a single component, showing only those links which connect to the rest of the graph.
4 Profiling graph
Referring to FIG. 3, in a preferred embodiment, a profiling graph 400 performs computations for the profiling module 100. An input data set component 402 represents data from potentially several types of data systems. The data systems may have different physical media types (e.g., magnetic, optical, magneto-optical) and/or different data format types (e.g., binary, database, spreadsheet, ASCII string, CSV, or XML). The input data set component 402 sends a data flow into a make census component 406. The make census component 406 conducts a "census" of the data set, creating a separate census record for each unique field/value pair in the records that flow into the component. Each census record includes a count of the number of occurrences of the unique field/value pair for that census record. The make census component 406 has a cleaning option which can map a set of invalid values onto valid values according to validation information stored in a corresponding field object. The cleaning option can also store records having fields containing invalid values in a location represented by an invalid records component 408. The invalid records can then be examined, for example, by a user wanting to determine the source of an invalid value. In the illustrated embodiment, the census records flowing out of the make census component 406 are stored in a file represented by the census file component 410. This intermediate storage of census records may, in some cases, increase efficiency for multiple graph components accessing the census records. Alternatively, the census records can flow directly from the make census component 406 to an analyze census component 412 without being stored in a file. The analyze census component 412 creates a histogram of values for each field and performs other analyses of the data set based on the census records. In the illustrated embodiment, the field profiles component 414 represents an intermediate storage location for the field profiles. A load metadata store component 416 loads the field profiles and other profiling results into the conesponding objects in the metadata store 112. The user interface 116 allows a user to browse through the analyzed data, for example, to see histograms or common values in fields. A "drill-down" capability is provided, for example, to view specific records that are associated with a bar in a histogram. The user can also update preferences through the user interface 116 based on results of the profiling. The make samples component 418 stores a collection of sample records 420 representing a sampling of records associated with a value shown on the user interface 116 (e.g., associated with a bar in a histogram). The phase break line 422 represents two phases of execution in the graph 400, such that the components on the right side of the line begin execution after all the components on the left side of the line finish execution. Therefore, the make samples component 418 executes after the analyze census component 412 finishes storing results in the field profiles component 414. Alternatively, sample records can be retrieved from a recorded location in the input data set 402. The profiling module 100 can be initiated by a user 118 or by an automated scheduling program. Upon initiation of the profiling module 100, a master script (not shown) collects any DML files and parameters to be used by the profiling graph 400 from the metadata store 112. Parameters can be obtained from objects such as the profile setup object 200, the data set object 206, and the field objects 218. If necessary, the master script can create a new DML file based on information supplied about the data set to be profiled. For convenience, the master script can compile the parameters into a job file. The master script may then execute the profiling graph 400 with the appropriate parameters from the job file, and present a progress display keeping track of the time elapsed and estimating time remaining before the profiling graph 400 completes execution. The estimated time remaining is calculated based on data (e.g., work elements) that is written to the metadata store 112 as the profiling graph 400 executes.
4.1 Data format inteφretation An import component implements the portion of the profiling module 100 that can inteφret the data format of a wide variety of data systems. The import component is configured to directly inteφret some data formats without using a DML file. For example, the import component can read data from a data system that uses structured query language (SQL), which is an ANSI standard computer language for accessing and manipulating databases. Other data formats that are handled without use of a DML file are, for example, text files formatted according to an XML standard or using comma-separated values (CSV). For other data formats the import component uses a DML file specified in the profile setup object 200. A. DML file can specify various aspects of inteφreting and manipulating data in a data set. For example, a DML file can specify the following for a data set: type object - defines a correspondence between raw data and the values represented by the raw data. key specifier — defines ordering, partitioning, and grouping relationships among records. expression - defines a computation using values from constants, the fields of data records, or the results of other expressions to produce a new value. transform function — defines collections of rules and other logic used to produce one or more outputs records from zero or more input records. package - provides a convenient way of grouping type objects, transform functions, and variables that can be used by a component to perform various tasks. A type object is the basic mechanism used to read individual work elements (e.g., individual records) from raw data in a data system. The runtime environment provides access to a physical computer-readable storage medium (e.g., a magnetic, optical, or magneto-optical medium) as a string of raw data bits (e.g., mounted in a file system or flowing over a network connection). The import component can access a DML file to determine how to read and inteφret the raw data in order to generate a flow of work elements. Referring to FIG. 4, a type object 502 can be, for example, a base type 504 or a compound type 506. A base type object 504 specifies how to inteφret a string of bits (of a given length) as a single value. The base type object 504 includes a length specification indicating the number of raw data bits to be read and parsed. A length specification can indicate a fixed length, such as a specified number of bytes, or a variable length, specifying a delimiter (e.g., a specific character or string) at the end of the data, or a number of (potentially variable length) characters to be read. A void type 514 represents a block of data whose meaning or internal structure is unnecessary to interpret (e.g., compressed data that will not be inteφreted until after it is decompressed). The length of a void type 514 is specified in bytes. A number type 516 represents a number and is inteφreted differently if the number is designated an integer 524, real 526, or decimal 528, according to various encodings that are standard or native to a particular CPU. A string type 518 is used to inteφret text with a specified character set. Date 520 and datetime 522 types are used to inteφret a calendar date and/or time with a specified character set and other formatting information. A compound type 506 is an object made up of multiple sub-objects which are themselves either base or compound types. A vector type 508 is an object containing a sequence of objects of the same type (either a base or compound type). The number of sub-objects in the vector (i.e., the length of the vector) can be indicated by a constant in the DML file or by a rule (e.g., a delimiter indicating the end of the vector) enabling profiling of vectors with varying lengths. A record type 510 is an object containing a sequence of objects, each of which can be a different base or compound type. Each object in the sequence corresponds to a value associated with a named field. Using a record type 510, a component can inteφret a block of raw data to extract values for all of the fields of a record. A union type 512 is an object similar to a record type 510 except that objects corresponding to different fields may inteφret the same raw data bits as different values. The union type 512 provides a way to have several inteφretations of the same raw data. The DML file also enables profiling of data with custom data types. A user can define a custom type object by supplying a type definition in terms of other DML type objects, either base or compound. A custom type object can then be used by the profiling module 100 to inteφret data with a non-standard structure. The DML file also enables profiling of data with a conditional structure. Records may only include some fields based on values associated with other fields. For example, a record may only include the field "spouse" if the value of the field "married" is "yes." The DML file includes a rule for determining whether a conditional field exists for a given record. If the conditional field does exist in a record, the value of the field can be inteφreted by a DML type object. The import component can be used by graphs to efficiently handle various types of record structures. The ability of the import component to inteφret records with variable record structure such as conditional records or variable length vectors enables graphs to process such data without the need to first flatten such data into fixed length segments. Another type of processing that can be performed by graphs using the import component is discovery of relationships between or among parts of the data (e.g., across different records, tables, or files). Graphs can use a rule within the import component to find a relationship between a foreign key or field in one table to a primary key or field in another table, or to perform functional dependency calculations on parts of the data. 4.2 Statistics
Referring to FIG. 5 A, a sub-graph 600 implementing one embodiment of the make census component 406 includes a filter component 602 that passes a portion of incoming records based on a filter expression stored in the profile setup object 200. The filter expression may limit the fields or number of values profiled. An example of a filter expression is one that limits profiling to a single field of each incoming record (e.g., "title"). Another optional function of the filter component 602 is to implement the cleaning option described above, sending a sample of invalid records to the invalid records component 408. Records flowing out of the filter component 602 flow into a local rollup sequence stats component 604 and a partition by round- robin component 612. The ability of the profiling graph 400 (and other graphs and sub-graphs) to run in parallel on multiple processors and/or computers, and the ability of the profiling graph 400 to read a parallel data set stored across multiple locations, are implicitly represented in the sub-graph 600 by line thicknesses of the components and symbols on the links between components. The thick border of components representing storage locations such as the input data set component 402 indicates that it may optionally be a parallel data set. The thick border of the process components such as the filter component 602 indicates that the process may optionally be running in multiple partitions with each partition mnning on a different processor or computer. The user can indicate through the user interface 116 whether to run the optionally parallel graph components in parallel or serially. A thin border indicates that a data set or process is serial. The local rollup sequence stats component 604 computes statistics related to the sequential characteristics of the incoming records. For example, the component 604 may count the number of sequential pairs of records that have values for a field that increase, decrease, or increment by 1. In the case of parallel execution, the sequence statistics are calculated for each partition separately. A rollup process involves combining information from multiple input elements (sequence statistics for the rollup process performed by this component 604) and producing a single output element in place of the combined input elements. A gather link symbol 606 represents a combination or "gathering" of the data flows from any multiple partitions of a parallel component into a single data flow for a serial component. The global rollup sequence stats combines the "local" sequence statistics from multiple partitions into a single "global" collection of sequence statistics representing records from all of the partitions. The resulting sequence statistics may be stored in a temporary file 610. FIG. 6 is a flowchart of an example of a process 700 for performing a rollup process, including the rollup processes performed by the local rollup sequence stats component 604 and the global rollup sequence stats component 608. The process 700 begins by receiving 702 an input element. The process 700 then updates 704 information being compiled, and determines 706 whether there are any more elements to be compiled. If there are more elements, the process 700 receives 702 the next element and updates 704 the information accordingly. When there are no more elements, the process 700 finalizes 708 the output element based on the compiled rollup information. A rollup process can be used to consolidate a group of elements into a single element, or to determine aggregate properties of a group of elements (such as statistics of values in those elements). The partition by round-robin component 612 takes records from the single or multiple partitions of the input data set 402 and re-partitions the records among a number of parallel processors and/or computers (e.g., as selected by the user) in order to balance the work load among the processors and/or computers. A cross-connect link symbol 614 represents the re- partitioning of the data flows (performed by the linked component 612). The canonicalize component 616 takes in a flow of records and sends out a flow of census elements containing a field/value pair representing values for each field in an input record. (An input record with ten fields yields a flow often census elements.) Each value is converted into a canonical (i.e., according to a pre-determined format) human readable string representation. Also included in the census element are flags indicating whether the value is valid and whether the value is null (i.e., corresponds to a pre-determined "null" value). The census elements flow into a local rollup field/value component which (for each partition) takes occurrences of the same value for the same field and combines them into one census element including a count of the number of occurrences. Another output of the canonicalize component 616 is a count of the total number of fields and values, which are gathered for all the partitions and combined in a rollup total counts component 618. The total counts are stored in a temporary file 620 for loading into the data set profile 216. FIG. 7 is a flowchart of an example of a process 710 performed by the canonicalize component that can handle conditional records, which may not all have the same fields, to produce a flow of census elements containing field/value pairs. The process 710 performs a nested loop which begins with getting 712 a new record. For each record, the process 710 gets 714 a field in that record and determines 716 whether that field is a conditional field. If the field is conditional, the process 710 determines 718 whether that field exists for that record. If the field does exist, the process 710 canonicalizes 720 the record's value for that field and produces a corresponding output element containing a field/value pair. If the field does not exist, the process 710 proceeds to determine 722 whether there is another field or to determine 724 whether there is another record. If the field is not conditional, the process 710 canonicalizes 720 the record's for that field (including possibly a null value) and proceeds to the next field or record. The partition by field/value component 624 re-partitions the census elements by field and value so that the rollup process performed in the global rollup field/value component 626 can add the occurrences calculated in different partitions to produce a total occurrences count in a single census element for each unique field/value pair contained within the profiled records. The global rollup field/value component 626 processes these census elements in potentially multiple partitions for a potentially parallel file represented by the census file component 410. FIG. 5B is a diagram that illustrates a sub-graph 630 implementing the analyze census component 412 of the profiling graph 400. A partition by field component 632 reads a flow of census elements from the census file component 410 and re-partitions the census elements according to a hash value based on the field such that census records with the same field (but different values) are in the same partition. The partition in to string, number, date component 634 further partitions the census elements according to the type of the value in the census element. Different statistics are computed (using a rollup process) for values that are strings (in the rollup string component 636), numbers (in the rollup number component 638), or dates/datetimes (in the rollup date component 640). For example, it may be appropriate to calculate average and standard deviation for a number, but not for a string. The results are gathered from all partitions and the compute histogram/decile info component 642 provides information useful for constructing histograms (e.g., maximum and minimum values for each field) to a compute buckets component 654, and information useful for calculating decile statistics (e.g., the number of values for each field) to a compute deciles component 652. The components of the sub-graph 630 that generate the histograms and decile statistics (below the phase break line 644) execute after the compute histogram/decile info component 642 (above the phase break line 644) finishes execution. The sub-graph 630 constructs a list of values at decile boundaries (e.g., value larger than 10% of values, 20% of values, etc.) by: sorting the census elements by value within each partition (in the sort component 646), re-partitioning the census elements according to the sorted value (in the partition by values component 648), and merging the elements into a sorted (serial) flow into the compute deciles component 652. The compute deciles component 652 counts the , sorted values for each field in groups of one tenth of the total number of values in that field to find the values at the decile boundaries. The sub-graph 630 constmcts histograms for each field by: calculating the values defining each bracket of values (or "bucket"), counting values within each partition falling in the same bucket (in the local rollup histogram component 656), counting values within each bucket from all the partitions (in the global rollup histogram component 658). A combine field profile parts component 660 then collects all of the information for each field profile, including the histograms, decile statistics, and the sequence statistics from the temporary file 610, into the field profiles component 414. FIG. 5C is a diagram that illustrates a sub-graph 662 implementing the make samples component 418 of the profiling graph 400. As in the sub-graph 600, a partition by round-robin component 664 takes records from the single or multiple partitions of the input data set 402 and re-partitions the records among a number of parallel processors and/or computers in order to balance the work load among the processors and/or computers. A lookup and select component 666 uses information from the field profiles component
414 to determine whether a record corresponds to a value shown on the user interface 116 that can be selected by a user for drill-down viewing. Each type of value shown in the user interface 116 corresponds to a different "sample type." If a value in a record corresponds to a sample type, the lookup and select component 666 computes a random selection number that determines whether the record is selected to represent a sample type. For example, for a total of five sample records of a particular sample type, if the selection number is one of the five largest seen so far (of a particular sample type within a single partition) then the corresponding record is passed as an output along with information indicating what value(s) may correspond to drill-down viewing. With this scheme, the first five records of any sample type are automatically passed to the next component as well as any other records that have one of the five largest selection numbers seen so far. The next component is a partition by sample type component 668 which re-partitions the records according to sample type so that the sort component 670 can sort by selection number within each sample type. The scan component 672 then selects the five records with the largest selection numbers (among all the partitions) for each sample type. The write/link sample component 674 then writes these sample records to a sample records file 420 and links the records to the corresponding values in the field profiles component 414. The load metadata store component 416 loads the data set profile from the temporary file component 620 into a data set profile 216 object in the metadata store 112, and loads each of the field profiles from the field profiles component 414 into a field profile 222 object in the metadata store 112. The user interface 116 can then retrieve the profiling results for a data set and display it to a user 118 on a screen of produced by the user interface 116. A user can browse through the profile results to see histograms or common values for fields. A drill-down capability may be provided, for example, to view specific records that are associated with a bar in a histogram. FIG. 8A-C are example user interface screen outputs showing profiling results. FIG. 8A shows results from a data set profile 216. Various totals 802 are shown for the data set as a whole, along with a summary 804 of properties associated with the profiled fields. FIGS. 8B-C show results from an exemplary field profile 222. A selection of values, such as most common values 806, and most common invalid values 808, are displayed in various forms including: the value itself as a human readable string 810, a total count of occurrences of the value 812, occunences as a percentage of the total number of values 814, and a bar chart 816. A histogram of values 818 is displayed showing a bar for each of multiple buckets spanning the range of values, including buckets with counts of zero. The decile boundaries 820 are also displayed.
5 Examples
5.1 Data discovery
FIG. 9 shows a flowchart for an example of a procedure 900 for profiling a data set to discover its contents before using it in another process. The procedure 900 can be performed automatically (e.g., by a scheduling script) or manually (e.g., by a user at a terminal). The procedure 900 first identifies 902 a data set to be profiled on one or more data systems accessible within the runtime environment. The procedure 900 may then optionally set a record format 904 and set validation rules 906 based on supplied information or existing metadata. For some types of data, such as a database table, a default record format and validation rules may be used. The procedure 9O0 then runs 908 a profile on the data set (or a subset of the data set). The procedure 900 can refine 910 the record format, or refine 912 the validation rules based on the results of the initial profile. If any profiling options have changed, the procedure 900 then decides 914 whether to run another profile on the data using the new options, or to process 916 the data set if enough information about the data set has been obtained from the (possibly repeated) profiling. The process would read directly from the one or more data systems using the information obtained from the profiling.
5.2 Quality testing
FIG. 10 shows a flowchart for an example of a procedure 1000 for profiling a data set to test its quality before transforming and loading it into a data store. The procedure 1000 can be performed automatically or manually. Rules for testing the quality of a data set can come from prior knowledge of the data set, and/or from results of a profiling procedure such as procedure 900 performed on a similar data set (e.g., a data set from the same source as the data set to be tested). This procedure 1000 can be used by a business, for example, to profile a periodic (e.g., monthly) data feed sent from a business partner before importing or processing the data. This would enable the business to detect "bad" data (e.g., data with a percentage of invalid values higher than a threshold) so it doesn't "pollute" an existing data store by actions that may be difficult to undo. The procedure 1000 first identifies 1002 a data set to be tested on one or more data systems accessible within the mntime environment. The procedure 1000 then runs 1004 a profile on the data set (or a subset of the data set) and performs 1006 a quality test based on results of the profile. For example, a percentage of occurrences of a particular common value in the data set can be compared with a percentage of occurrences of the common value in a prior data set (based on a prior profiling run), and if the percentages differ from each other by more than 10%, the quality test fails. This quality test could be applied to a value in a series of data sets that is known to occur consistently (within 10%). The procedure 1000 determines 1008 the results of the quality test, and generates 1010 a flag (e.g., a user interface prompt or an entry in a log file) upon failure. If the quality test is passed, the procedure 1000 then reads directly from the one or more data systems and transforms (possibly using information from the profile) and loads 1012 data from the data set into a data store. For example, the procedure can then repeat by identifying 1002 another data set.
5.3 Code generation
The profiling module 100 can generate executable code such as a graph component that can be used to process a flow of records from a data set. The generated component can filter incoming records, allowing only valid records to flow out, similar to the cleaning option of the profiling graph 400. For example, the user can select a profiling option that indicates that a clean component should be generated upon completion of a profiling ran. Code for implementing the component is directed to a file location (specified by the user). The generated clean component can then run in the same runtime environment as the profiling module 100 using information stored in the metadata store 112 during the profiling run.
6 Joint-field analysis
The profiling module 100 can optionally perform an analysis of relationships within one or more groups of fields. For example, the profiling module 100 is able to perform an analysis between two of a pair of fields, which may be in the same or in different data sets. Similarly, the profiling module is able to perform analysis on a number of pairs of fields, for example analyzing every field in one data set with every field in another data set, or every field in one data set with every other field in the same data set. An analysis of two fields in different data sets is related to the characteristics of a join operation on the two data sets on those fields, as described in more detail below. In a first approach to joint-field analysis, a join operation is performed on two data sets (e.g., files or tables). In another approach, described below in section 6.1, after the make census component 406 generates a census file for a data set, the information in the census file can be used to perform the joint-field analysis between fields in two different profiled data sets, or between fields in two different parts of the same profiled data set (or any other data set for which a census file exists). The result of joint-field analysis includes information about potential relationships between the fields. Three types of relationships that are discovered are: a "common domain" relationship, a
"joins well" relationship, and "foreign key" relationship. A pair of fields is categorized as having one of these three types of relationships if results of the joint-field analysis meet certain criteria, as described below.
The joint-field analysis includes compiling information such as the number of records produced from a join operation performed using the two fields as key fields. FIGS. 11A-B illustrate examples of a join operation performed on records from two database tables. Each of Table A and Table B has two fields labeled "Field 1" and "Field 2," and four records. Referring to FIG. 11 A, a join component 1100 compares values from a key field of records from Table A with values from a key field of records from Table B. For Table A, the key field is Field 1, and for Table B, the key field is Field 2. So the join component 1100 compares values 1102 from Table A, Field 1 (Al) with values 1104 from Table B, Field 1 (Bl). The join component 1 100 receives flows of input records 1110 from the tables, and, based on the comparison of key-field values, produces a flow of joined records 1112 forming a new joined table, Table C. The join component 1100 produces a joined record that is a concatenation of the records with matching key-field values for each pair of matching key-field values in the input flows. The number of joined records with a particular key-field value that exit on the joined output port 1 114 is the Cartesian product of the number of records with that key-field value in each of the inputs, respectively. In the illustrated example, the input flows of records 1110 are shown labeled by the value of their respective key fields, and the output flow of joined records 1112 are shown labeled by the matched values. Since two "X" values appear in each of the two input flows, there are four "X" values in the output flow. Records in one input flow having a key-field value that does not match with any record in the other input flow exit on "rejected" output ports 1116A and 1116B, for the Table A and Table B input flows, respectively. In the illustrated example, a "W" value appears on the rejected output port 1116 A. The profiling module 100 compiles statistics of joined and rejected values for categorizing the relationship between two fields. The statistics are summarized in an occurrence chart 1118 that categorizes occurrences of values in the two fields. An "occurrence number" represents the number of times a value occurs in a field. The columns of the chart correspond to occurrence numbers 0, 1, and N (where N > 1) for the first field (from Table A in this example), and the rows of the chart correspond to occurrence numbers 0, 1, and N (where N > 1) for the second field (from Table B in this example). The boxes in the chart contain counts associated with the corresponding pattern of occurrence: 'column occurrence number' x 'row occurrence number'. Each box contains two counts: the number of distinct values that have that pattern of occurrence, and the total number of individual joined records for those values. In some cases the values occur in both fields (i.e., having a pattern of occurrence: lxl, lxN, Nxl, or NxN), and in other cases the values occur in only one field (i.e., having a pattern of occurrence: 1x0, 0x1, NxO, or OxN). The counts are separated by a comma. The occurrence chart 1118 contains counts corresponding to the joined records 1112 and the rejected record on port 1116A. The value "W" on the rejected output port 1116A corresponds to the "1, 1" counts in the box for the 1x0 pattern of occurrence indicating a single value, and a single record, respectively. The value "X" corresponds to the "1, 4" counts in the box for the NxN pattern of occurrence since the value "X" occurs twice in each input flow, for a total of four joined records. The value "Y" corresponds to the "1, 2" counts in the box for the lxN pattern of occurrence since the value "Y" occurs once in the first input flow and twice in the second input flow, for a total of two joined records. FIG. 1 IB illustrates an example similar to the example of FIG. 11 A, but with a different pair of key fields. For Table A, the key field is Field 1, and for Table B, the key field is Field 2. So the join component compares values 1102 from Table A, Field 1 (Al) with values 1120 from Table B, Field 2 (B2). This example has an occurrence chart 1122 with counts corresponding to the flows of input records 1124 for these fields. Similar to the example in FIG. 11 A, there is a single rejected value "Z" that corresponds to the "1, 1" counts in the box for the 0x1 pattern of occurrence. However, in this example, there are two values, "W" and "Y," that both have thelxl pattern of occurrence, corresponding to the "2, 2" counts in the fox for the lxl pattern of occurrence since there are two values, and two joined records. The value "X" corresponds to the "1, 2" counts in the box for the Nxl pattern of occurrence, indicating a single value and 2 joined records. Various totals are calculated from the numbers in the occurrence chart. Some of these totals include the total number of distinct key-field values occurring in both Table A and Table B, the total number of distinct key-field values occurring in Table A, the total number of distinct key- field values occurring in Table B, and the total number unique values (i.e., values occurring only in a single record of the key field) in each table. Some statistics based on these totals are used to determine whether a pair of fields has one of the three types of relationships mentioned above. The statistics include the percentages of total records in a field that have distinct or unique values, percentages of total records having a particular pattern of occurrence, and the "relative value overlap" for each field. The relative value overlap is the percentage of distinct values occurring one field that also occur in the other. The criteria for determining whether a pair of fields has one of the three types of relationships (which are not necessarily mutually exclusive) are: foreign key relationship - a first one of the fields has a high relative value overlap (e.g., >99%) and the second field has a high percentage (e.g., >99%) of unique values. The second field is potentially a primary key and the second field is potentially a foreign key of the primary key. joins well relationship -at least one of the fields has a small percentage (e.g., <10%) of rejected records, and the percentage of individual joined records having a pattern of occurrence of NxN is small (e.g., <1 %). common domain relationship - at least one of the fields has a high relative value overlap (e.g., >95%). If a pair of fields has both a foreign key and a joins well or common domain relationship, the foreign key relationship is reported. If a pair of fields has both a joins well relationship and a common domain relationship, but not a foreign key relationship, the joins well relationship is reported.
6.1 Census join
Referring to FIG. 12 A, in an alternative to actually performing a join operation on the tables, a census join component 1200 analyzes fields from Table A and Table B and compiles the statistics for an occurrence chart by performing a "census join" operation from census data for the tables. Each census record has a field/value pair and a count of the occurrences of the value in the field. Since each census record has a unique field/value pair, for a given key field, the values in an input flow of the census join component 1200 are unique. The example of FIG. 12A corresponds to the join operation on the pair of key fields Al, Bl (illustrated in FIG. 11A). By comparing census records corresponding to the key fields in the join operation, with filter 1202 selecting "Field 1" (Al) and filter 1204 selecting "Field 1" (Bl), the census join component 1200 potentially makes a much smaller number of comparisons than a join component 1100 that compares key fields of individual records from Table A and Table B. The example of FIG. 12B coreesponds to the join operation on the pair of key fields Al, B2 (illustrated in FIG. 1 IB), with filter 1206 selecting "Field 1" (Al) and filter 1208 selecting "Field 2" (B2). The selected census records 1210-1218 are shown labeled by the value for their respective field in the field/value pair, and the count of occurrences for that value. If the census join component 1200 finds a match between the values in two input census records 1210-1218, the output record contains the matched value, the corresponding pattern of occurrence based on the two counts, and a total number of records that would be generated in a join operation on the pair of key fields (which is just the product of the two counts). If no match is found for a value, the value is also output with a corresponding pattern of occurrence and a total number of records (which is the single count in the single input record). This information within the output records of the census join component 1200 is sufficient to compile all of the counts in an occurrence chart for the join operation. In the example of FIG. 12A, the value "W" appears at the output with an occureence pattern of 1x0 and a total of 1, the value "X" appears at the output with an occurrence pattern of NxN and a total of 4, and the value "Y" appears at the output with an occurrence pattern of lxN and a total of 2. This information corresponds to the information in the occurrence chart 1118 of FIG. 11 A. In the example of FIG. 12B, the value "W" appears at the output with an occurrence pattern of lxl and a total of 1, the value "X" appears at the output with an occurrence pattern of Nxl and a total of 2, the value "Y" appears at the output with an occureence pattern of lxl and a value of 1 , and the value "Z" appears at the output with an occurrence pattern of 0x1 and a value of 1. This information corresponds to the information in the occurrence chart 1122 of FIG. 11 B .
6.2 Extended records
A joint-field analysis for multiple field pairs in a single census join operation includes generating "extended records" based on the census records. In the example illustrated in FIG. 13, the census join component 1200 compares records for a joint-field analysis of both pairs of key fields Al, Bl and Al, B2, combining the joint-field analysis illustrated in FIGS. 12 A-B. An extended record is generated from a census records by concatenating a unique identifier for the pair of key fields that are being joined with the value in the census record, and keeping the same count of occurrences as the census record. If a joint-field analysis includes results of a field being joined with multiple other fields, then multiple extended records are generated for each value in the that field. For example, the census record 1210 corresponds to two extended recordsl 301-1302, with the value "W" concatenated with an identifier "A1B1" and "A1B2," respectively. The census join component 1200 handles the extended record 1301 just as it would handle a census record with the value "WA1B1." Likewise, the census record 1211 corresponds to the two extended records 1303-1304, and census record 1212 corresponds to the two extended records 1305-1306. In the joint-field analysis of FIG. 13, the field Bl is only joined with one other field (Al), so each census record 1213-1214 corresponds to a single extended record 1307-1308, respectively- Likewise, the field B2 is joined with one other field (Al), so each census record 1215-1218 corresponds to a single extended record 1309-1312. Each extended record includes a value based on the original value concatenated with a unique field identifier. Referring to FIG. 14, an extend component 1400 processes input census records to generate extended records, based on join information 1401 indicating which fields are being joined with which other fields in the joint-field analysis. In this example, the join information 1401 indicates that a field V\ from census data for table Tl (having four census records 1402) is being joined with four other fields: field Fi from census data for table T2 (having two census records 1404), field F2 from census data for table T2 (having two census records 1406), field Fi from census data for table T3 (having two census records 1408), and field F2 from census data for table T3 (having two census records 1410). A census record 1412 flowing into the extend component 1400 represents one of the four census records 1402 from census data for table Tl having field Fi, and value Vj where i = 1, 2, 3, or 4. The extend component 1400 generates four extended records 1413-1416 for the input census record 1412. The census join component 1200 uses unique identifiers for fields including fields in different tables having the same name. The extended record 1413 has a value c(Tl,Fι,T2,F1,Vj) that is a concatenation of the original value Vi with identifiers for the fields being joined as well as for the table (or file or other data source) from which the census data for the field was generated. Including the identifier for the table enables fields having the same name to be distinguished. The extended record 1415 that has a value c(Tl,Fι,T3,Fι,Vj) that can be distinguished from the value c(Tl,Fι,T2,Fl5Vj) of the extended record 1413, where both tables T2 and T3 have the same field name F[. Alternatively, a unique number can be assigned to each field and used in place of the field name. 6.3 Joint- field analysis graphs
FIGS. 15 A-B show graphs used by the profiling module 100 to perform an optional joint- field analysis of selected fields in sources (e.g., tables or files) within data sources 30. A user 118 selects options for profiling and for joint-field analysis, including the option of performing profiling without joint-field analysis. The user 118 selects field pairs for joint-field analysis including two specific fields paired with each other, one field paired with every other field, or every field paired with every other field. The user 118 selects an option allow pairing of fields within the same table or file, or to allow pairing of fields only from different tables or files. These options are stored in the metadata store 112. Referring to FIG. 15 A, for each source (e.g., a table or file) of fields specified in the joint-field analysis options, the graph 1500 generates a file with prepared census data 1510 for those specified fields. The graph 1500 executes once for each such source included in the joint- field analysis. A filter 1504 receives records from census data 1502 generated by the make census component 406 and prepares the records for joint-field analysis. The filter 1504 discards records for fields that are not included in the analysis (as determined by user options stored in the metadata store 112). The filter 1504 also discards invalid values, null values, and other values not included in a meaningful analysis of the content of data sources (e.g., known data flags). The values in the census data 1502 have been canonicalized by a canonicalize component 616 within the make census component 406. However, these canonicalized values may have portions that should not be used in a logical comparison of values (e.g., strings with leading or trailing spaces or numbers with leading or trailing zeros). The user 118 can select an option for these values to be compared "literally" or "logically." If the user 118 selects "literal" comparison, then the values in the census records are left in the canonicalized form. If the user 118 selects "logical" comparison, then the filter 1504 converts values in the census records according to rales such as stripping leading and trailing spaces, and stripping leading and trailing zeros for numbers. The partition by value component 1506 re-partitions the records based on the value in the census record. Any census records with the same value are put into the same partition. This allows the joint-field analysis to be run in parallel across any number partition. Since the census join component 1200 only produces an output record for input records with matching values, census records (or any extended records generated from them) in different partitions do not need to be compared with one another. A rollup logical values component 1508 combines any census records that have matching field/value pairs due to the conversion performed by the filter 1504. The combined record has a count of occurrences that is the sum of the count for all of the combined records. For example, if a census record with a field, value, count of "amount, 01.00, 5" is converted to "amount, 1, 5" and a census record with a field, value, count of "amount, 1.0, 3" is converted to "amount, 1, 3," then the rollup logical values component 1508 combines these two converted records to a single record with a field, value, count of "amount, 1, 8." Referring to FIG. 15B, for each pair of sources, source A and source B, having one or more fields to be compared, as specified in the joint-field analysis options, the graph 1512 executes using the prepared census data A 1514 and prepared census data B 1516, each prepared by graph 1500. Two extend components 1400 receive records from these sets of prepared census data, along with join information 1515 specifying the specific fields in source A to be compared with specific fields in source B. Extended records flow into a census join component 1200 that generates records containing values, patterns of occurrence, and counts for occurrence charts for the fields being compared. A local rollup join statistics component 1518 compiles the information in these records within each partition. The records in the various partitions are then gathered and complied by a global rollup join statistics component 1520 that outputs a file 1522 all of the joint-field analysis statistics for the fields in all of the pairs of sources that are analyzed. The results of the joint-field analysis including which of the three types of relationship potentially exists between various fields is loaded into the metadata store 112 for presentation to the user 118. For example, the user 118 can select a link on the user interface 116 for a pair of fields with a potential relationship and view a page on the user interface 116 with detailed analysis results including counts from an occurrence chart for the pair of fields. Referring to FIG. 15C, when a joint-field analysis is performed for two fields within the same source (source C), the graph 1524 executes using the prepared census data C 1526 prepared by graph 1500. A single extend component 1400 receives records from the set of prepared census data C 1526, along with join information 1528 specifying the specific fields in source C to be compared. Extended records flow into both ports of a census join component 1200 that generates records containing values, patterns of occurrence, and counts for occurrence charts for the fields being compared. In the case of joint-field analysis options indicating that every field in source C is to be compared with every other field in source C (having four fields: FI, F2, F3, F4), one approach is for the join information 1528 to specify twelve pairs of fields (F1-F2, F1-F3, F1-F4, F2-F1, F2- F3, F2-F4, F3-F1, F3-F2, F3-F4, F4-F1, F4-F2, F4-F3). However, since the same operations are performed for the pairs F1-F3 and F3-F1, some operations are repeated. Accordingly, another approach is for the join information to specify only the six unique pairs F1-F2, F1-F3, F1-F4, F2- F3, F2-F4, F3-F4. In this case, the results in the output file 1530 are augmented to include results for the other six field pairs by reversing the order of the fields in the analysis results for the six pairs that were analyzed.
7 Functional dependency analysis
Another type of analysis that the profiling module 100 is able to perform is a test for a functional relationship between values of fields. The fields tested can be from a single table that has a set of fields or from a "virtual table" that includes fields from multiple sources that are related (e.g., through a join operation on the fields using a common key field, as described in more detail in section 7.3). One type of functional relationship between a pair of fields is "functional dependency" where the value associated with one field of a record can be uniquely detem ined by the value associated with another field of the record. For example, if a database has a State field and a Zip Code field, the value of the Zip Code field (e.g., 90019) determines the value of the State field (e.g., CA). Each value of the Zip Code field maps onto a unique value of the State field (i.e., a "many-to-one" mapping). A functional dependency relationship can also exist among a subset of fields where the value associated with one field of a record can be uniquely determined by the values associated with other fields of the record. For example, the value of the Zip Code field can be uniquely determined by the values of a City field and a Street field. The functional dependency can also be an "approximate functional dependency" where some but not necessarily all of the values associated with one field map onto a unique value of another field, with a percentage of exceptions that do not map onto the unique value. For example, some records may have an unknown Zip Code that is indicated by a special value 00000. In this case, the value 00000 of the Zip Code field may map onto more than one value of the State field (e.g., CA, FL, and TX). Exceptions can also occur due to records with incorrect values, or other errors. If the percentage of exceptions is smaller than a pre-determined (e.g., as entered by a user) threshold, then a field may still be determined to be functionally dependent on another field. Referring to FIG. 16, an example table 1600 with records (rows) and fields (columns) to be tested for functional dependency or approximate functional dependency is shown. A Last Name field has twelve values corresponding to twelve records (rows 1-12). Ten of the values are unique, and two of the records have the same repeated value name_g. A Citizenship field has two unique values: US occurring eleven times and CANADA occurring once. A Zip Code field has various values each corresponding to one of three values CA, FL, and TX for a State field. Each value of Zip Code uniquely determines a value of State, except for the Zip Code value 00000 that corresponds to FL in one record (row 10) and to TX in another record (row 12). 7.1 Functional dependency analysis graph
FIG. 17 shows an example of a graph 1700 used by the profiling module 100 to perform an optional functional dependency analysis of selected fields in one or more sources (e.g., in a single table or file, or in multiple tables and/or files as described in section 7.3) within data sources 30. A user 118 selects options for profiling and for functional dependency analysis, including the option of performing profiling without functional dependency analysis. The user 118 may choose which pair or pairs of fields are tested for a functional relationship. The user 118 selects particular fields of a data source (e.g., a table or file), and chooses, for example, "all to selected" or "selected to selected" to determine which pairs of fields are tested or chooses "all to all" to test all pairs of fields in the data source. The user may also select a threshold for determining a degree of functional dependency before deciding that a field is or is not functionally dependent on another field. For example, the user may select a threshold for determining how many exceptions to allow (as a percentage of records). These options are stored in the metadata store 112. For each pair of fields (fl, f2) that is to be analyzed, the graph 1700 determines whether a functional dependency relationship exists, and if so, classifies the relationship between field fl and field f2 as: "fl detennines f2", "f2 determines fl", "one-to-one" (a one-to-one mapping exists between fl and f2), or "identical" (fl has identically the same value as f2 in each of the records). The graph 1700 reads field information 1702 stored by the profiling module 100 to determine unique identifiers for fields to be analyzed. A make pairs component 1704 generates a flow of field pairs (fl, f2) using a pair of unique identifiers for each of the pairs of fields to be tested. The pair (fl, f2) is an ordered pair since the relationship between fl and f2 is not necessarily symmetric. So both pairs (fl, f2) and (f2, fl) are included in the flow. A select pairs component 1706 limits the field pairs that flow to the rest of the graph 170O by selecting the field pairs chosen for analysis by the user. The select pairs component
1706 further limits the pairs that flow to the rest of the graph based on a variety of optimizations. For example, a field is not paired with itself since such a pair is classified as "identical" by definition. So the pairs (fl, fl), (f2, f2), ... etc. are not included in the flow. Other optimizations may remove one or more pairs of fields from the flow, as described in more detail below in section 7.2. A broadcast component 1708 broadcasts the serial flow of field pairs to each of the partitions of a (potentially parallel) attach values component 1718, as represented by a broadcast link symbol 1710. Each partition of the attach values component 1718 takes as input a flow of field pairs (e.g., (LastName, Citizenship), (Zip, State), ... etc.) and a flow of field/value pairs (e.g., (LastName, name_a), (LastName, name_b), (LastName, name_c), ..., (Citizenship, Canada),. (Citizenship, US), (Citizenship, US), ... etc.). To obtain the flow of field/value pairs, a filter component 1712 extracts records from the input data set 402, and optionally removes a portion of the records based on a filter expression. The records flowing out of the filter component 1712 flow into a partition by round-robin component 1714. The partition by round-robin component 1714 takes records from the partitions of the input data set 402 and re-partitions the records among a number of parallel processors and/or computers in order to balance the work load among the processors and/or computers. The canonicalize component 1716 (similar to the canonicalize component 616 described above) takes in a flow of records and sends out a flow of field/value pair representing values for each field in an input record. As described above, each value is converted into a canonical human readable string representation. The attach values component 1718 performs a series of join operations to generate a flow of fl/f2/vl/v2 quadmples where fl and f2 correspond to one of the field pairs received at the input, and vl and v2 corresponds to values that are paired with those fields in a record. In the example of table 1600, when the Last Name field corresponds to fl and the Citizenship field corresponds to f2, the attach value component 1718 generates a flow of twelve fl/f2/vl/v2 quadruples including: (LastName/Citizenship/name_a/Canada), (LastName/Citizenship/name_b US), ..., (LastName/Citizenship/name_k/US), (LastName/Citizenship/name_g/US). The attach values component 1718 generates similar series of fl/f2/vl/v2 quadruples for (Zip, State) and any other pairs of fields that are analyzed. The attach values component 1718 outputs the flow of fl/£2/vl/v2 quadruples into a "local rollup fl/f2/vl/v2" component 1720 which (for each partition) accumulates multiple quadruples with the same fields and values fl, f2, vl, v2 and represents them as a single quadruple with a count of the number of occivrrences of the quadruple in the input flow. The output flow of the "local rollup fl/f2/vl/v2" component 1720 consists of quadruples with counts (or "accumulated quadruples"). The accumulation that occurs in the "local rollup fl/f2/vl/v2" component 1720 is within each partition. So it is possible that some quadruples with the same values of fl, f2, vl, v2 are not accumulated by this component 1720. A "partition by fl/f2" component 1721 repartitions the flow of accumulated quadmples such that quadruples with the same fields fl, £2 are in the same partition- A "global rollup fl/f2/vl/v2" component 1722 further accumulates the repartitioned quadruples. The output flow of the "global rollup fl/f2/vl/v2" component 1722 consists of unique accumulated quadruples. In the example of table 1600, when the Zip field corresponds to fl and the State field coreesponds to £2, the combined effect of the components 1720-1722 generates the following six accumulated quadruples: (Zip/State /90019/CA, 4), (Zip/State /90212/CA, 2), (Zip/State /33102/FL, 3), (Zip/State /00000/FL, 1), (Zip/State /77010/TX, 1), (Zip/State /00000/TX, 1). When the State field corresponds to fl and the Zip field coreesponds to f2, the combined effect of the components 1720-1722 generates the following six accumulated quadruples: (State/Zip /CA/90019, 4), (State/Zip /CA 90212, 2), (State/Zip /FL/33102, 3), (State/Zip /FL/00000, 1), (State/Zip /TX/77010, 1), (State/Zip /TX/00000, 1). To prepare to test for a functional dependency relationship between a pair of fields, a "global rollup fl/f2/vl" component 1724 combines accumulated quadruples that have both fields fl, £2 and the first value vl in common, h producing an output element, this component 1724 examines all values of v2 that go with a value of vl and selects the most frequent v2 to associate with that vl value. The number of quadruples sharing the most frequent v2 are counted as "good" and the rest of the quadruples are counted as "exceptions." If there is only one value of v2 for a given vl, then the accumulated quadruples having that value are good and there are no exceptions. If there is a tie for the most frequent value of v2, then the first value is selected. In the example of table 1600, when the Zip field coreesponds to fl and the State field coreesponds to £2, the component 1724 generates: (Zip/State/90019/CA, 4 good), (Zip/State/90212/CA, 2 good), (Zip/State/33102/FL, 3 good), (Zip/State/00000/FL, 1 good, 1 exception), (Zip/State/77010 /TX, 1 good). When the State field corresponds to fl and the Zip field corresponds to £2, the component 1724 generates: (State/Zip/CA/90019, 4 good, 2 exceptions), (State/Zip/FL/33102, 3 good, 1 exception), (State/Ziρ/TX/77010, 1 good, 1 exception). A "global rollup fl/£2" component 1726 adds the good counts and the exceptions for each unique pair of fields fl, £2. In the example of table 1600, when the Zip field corresponds to fl and the State field coreesponds to £2, the component 1726 generates: (Zip/State, 11 good, 1 exception). When the State field corresponds to fl and the Zip field coreesponds to £2, the component 1726 generates: (State/Zip, 8 good, 4 exceptions). A find dependencies component 1728 uses the accumulated co-occurrence statistics (i.e., numbers of good and exceptional records) from the "global rollup fl/£2" component 1726 to determine whether a pair of fields has the relationship "fl determines £2." If the percentage of exceptions (give by: number of exceptions/(number of good + number of exceptions)) is less than the selected threshold for determining how many exceptions to allow, then the pair of fields has the relationship "fl determines £2." In the example of table 1600, for a threshold of 10%, when the Zip field corresponds to fl and the State field corresponds to £2, the percentage of exceptions is 8.3% and the value of the Zip field determines the value of the State field. When the State field coreesponds to fl and the Zip field coreesponds to £2, the percentage of exceptions is 33%, so the relationship between the Zip and State fields is not a one-to-one mapping. Alternatively, a value based on a mathematical property of the accumulated values can be used tc determine whether field fl determines field £2 (e.g., the conditional entropy of the value of field £2 given the value of field fl, or a standard deviation of a numerical value).
7.2 Field pair selection optimizations
A variety of optimizations can be applied to increase the speed of functional dependency analysis, for example, by filtering pairs of fields at the select pairs component 1706, or by filtering records at the filter component 1712. Some optimizations are based on the recognition that some functional dependency relationships that are discovered by the graph 1700 described above may not as meaningful to a user as others. For a given pair of fields, some of these cases can be detected and filtered out by the select pairs component 1706 based on statistics provided by the profiling module 100, saving computing resources. For example, if all of the values of a first field fl are unique (each value occurring in only a single record), then the value of that field fl determines the value of the second field £2 regardless of the values occurring in the field £2. The graph 1700 can use census data obtained during profiling to compute a probability that a first field fl determines a second field £2 based on a random (e.g., a uniform probability distribution) pairing of values in the fields. If there is a high probability (e.g., > 10%) that a random pairing would result in a functional dependency, then the field pair is filtered out by the select pairs component 1706. In the example of table 1600, when the LastName field corresponds to fl and the Citizenship field corresponds to £2, every random pairing of LastName with Citizenship results in all quadmples being counted as good except when one of the name_g values (in row 7 or row 12) is randomly paired with the value Canada. Even when this random pairing occurs (with a probability of 16.7% (2 out of 12 pairings)), the percentage of exceptions is only 8.3%, which is under the threshold. So in this example, the select pairs component 1706 filters the pair (LastName, Citizenship). ' Another optimization is based on histograms of values calculated by the profiling module 100 from census data. The select pairs component 1706 filters pairs when it is not possible for field fl to determine field £2. In the example of table 1600, the most frequent value of State occurs 6 times and the most frequent value of a Zip occurs only 4 times. So it is not possible for the value of State to determine the value of Zip since there would be at least 2 out of 6 exceptions for at least half of the values, resulting in at least a 16.7% exception percentage. So in this example, the select pairs component 1706 filters the pair (State, Zip). For a large number of records, the graph 1700 can increase the speed of testing for functional dependency by processing a small sample of the records first to eliminate field pairs that are highly likely not functionally related before processing all of the records. The graph 1700 can use the filter component 1712 to select a portion of the records. Alternatively, the graph 1700 can use the canonicalize component 1716 to select a portion of the field/value pairs. The records or field value pairs can be sampled based on a variety of criteria. The graph 1700 can sample based on statistics provided by the profiling module 100. For example, the graph 1700 can test for functional dependency based on the most frequent value of first field fl (the "determiner"). If the resulting number of exceptions are higher than the threshold, then there is no need to process the rest of values of the determiner. The graph 1700 can also test for functional dependency based on a random sample of determiner values. If a sufficient number of quadruples count as good among the sampled values, then the probability of finding a substantial number exceptions among the other values is assumed to be negligible. Other sampling criteria are possible. Another optional optimization is to test for pre-determined functional relationships between fields based on a library of known functions. This test can be performed on the records or on the values of the quadruples.
7.3 Functional dependency analysis across multiple sources
In one approach for testing for functional dependency across multiple sources (e.g., database tables), profiling module 100 generates a "virtual table" that includes fields from the multiple sources. The virtual table can be generated, for example, by performing a join operation on the sources using a key field that is common to the sources. In an example of functional dependency analysis using a virtual table, a first data source is a database of motor vehicle registration information (a motor vehicle registry (MVR) database) and a second data source is a database of issued traffic citations (a traffic citation (TC) database). The MVR database includes fields such as make, model, color, and includes a license field that is designated as a "primary key" field. Each record in the MVR database has a unique value of the license field. The TC database includes fields such as name, date, location, violation, vehicle make, vehicle model, vehicle color and includes a vehicle license field that is designated as a "foreign key" field. Each value of the vehicle license field has a coreesponding record in the MVR database with that value in the license field. There may be multiple records in the TC database having the same value of the vehicle license field. The profiling module 100 joins records from the MVR database and TC database to form a virtual table (e.g., as described above with reference to the join component 1100 shown in FIG. 11A). Each record of the virtual table has each of the fields from the two databases including a single license field that has the matched value from the MVR license field and the TC vehicle license field. A record may, however, have a value of the color field from the MVR database that is different from the value of the vehicle color field from the TC database. For example, the MVR database may use a "BLU" code to indicate the color blue and the TC database uses a "BU" code to indicate the color blue. In this case, if a vehicle has the same color in both databases, the color field will have a "one-to-one" functional relationship with the vehicle color field. Alternatively, a record may have different values for the color field and the vehicle color field if a vehicle has been painted a different color in the time between being registered and receiving a citation. Since the joined virtual table includes fields from each of multiple data sets, the profiling module 100 can discover any of a variety of relationships that my exist between the fields in those data sets. The same or similar dependency analysis as described above can be run on fields in joined virtual table. The approaches described above can be implemented using software for execution on a computer. For instance, the software forms procedures in one or more computer programs that execute on one or more programmed or programmable computer systems (which may be of various architectures, such as distributed, client/server, or grid) each including at least one processor, at least one data storage system (for example, volatile and non- volatile memory and/oi storage elements), at least one input device or port, and at least one output device or port. The software may form one or more modules of a larger program, for example, a program that provides other services related to the design and configuration of graphs. The software may be provided on a medium or device readable by a general or special puφose programmable computer or delivered (encoded in a propagated signal) over a network to the computer where it is executed. All of the functions may be performed on a special puφose computer, or using special-puφose hardware, such as coprocessors. The software may be implemented in a distributed manner in which different parts of the computation specified by the software are performed by different computers. Each such computer program is preferably stored on or downloaded to a storage media or device (e.g., solid state memory or media, or magnetic or optical media) readable by a general or special puφose programmable computer, for configuring and operating the computer when the storage media or device is read by the computer system to perform the procedures described herein. The inventive system may also be considered to be implemented as a computer-readable storage medium, configured with a computer program, where the storage medium so configured causes a computer system to operate in a specific and predefined manner to perform the functions described herein. It is to be understood that the foregoing description is intended to illustrate and not to limit the scope of the invention, which is defined by the scope of the appended claims. Other embodiments are within the scope of the following claims.

Claims

What is claimed is:
1. A method for processing data including: profiling data from a data source, including reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data; and processing the data from the data source, including accessing the stored profile information and processing the data according to the accessed profile information.
2. The method of claim 1 wherein processing the data from the data source further includes reading the data from the data source.
3. The method of claim 1 wherein profiling the data is performed without maintaining a copy of the data outside the data source.
4. The method of claim 3 wherein the data includes variable record stmcture records with at least one of a conditional field and a variable number of fields.
5. The method of claim 4 wherein computing summary data characterizing the data while reading the data includes inteφreting the variable record stracture records while computing summary data characterizing the data.
6. The method of claim 1 wherein the data source includes a data storage system.
7. The method of claim 6 wherein the data storage system includes a database system.
8. The method of claim 1 wherein computing the summary data includes counting a number of occurrences for each of a set of distinct values for a field.
9. The method of claim 8 wherein storing profile information includes storing statistics for the field based on the counted number of occurrences for said field.
10. The method of claim 1 further including maintaining a metadata store that contains metadata related to the data source
11. The method of claim 10 wherein storing the profile information includes updating the metadata related to the data source.
12. The method of claim 10 wherein profiling the data and processing the data each make use of metadata for the data source
13. The method of claim 1 wherein profiling data from the data source further includes determining a format specification based on the profile information.
14. The method of claim 1 wherein profiling data from the data source further includes determining a validation specification based on the profile information.
15. The method of claim 14 wherein processing the data includes identifying invalid records in the data based on the validation specification.
16. The method of claim 1 wherein profiling data from the data source further includes specifying data transformation instructions based on the profile information.
17. The method of claim 16 wherein processing the data includes applying the transformation instructions to the data.
18. The method of claim 1 wherein processing the data includes importing the data into a data storage subsystem.
19. The method of claim 18 wherein processing the data includes validating the data prior to importing the data into a data storage subsystem.
20. The method of claim 19 wherein validating the data includes comparing characteristics of the data to reference characteristics for said data.
21. The method of claim 20 wherein the reference characteristics include statistical properties of the data.
22. The method of claim 1 wherein profiling the data includes profiling said data in parallel, including partitioning the data into parts and processing the parts using separate ones of a first set of parallel components.
23. The method of claim 22 wherein profiling the data in parallel further includes computing the summary data for different fields of the data using separate ones of a second set of parallel components.
24. The method of claim 23 wherein profiling the data in parallel further includes repartitioning outputs of the first set of parallel components to form inputs for the second set of parallel components.
25. The method of claim 22 wherein profiling the data in parallel includes reading the data from a parallel data source, each part of the parallel data source being processed by a different one of the first set of parallel components.
2 . A method for processing data including: profiling data from a data source, including reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data; wherein profiling the data includes profiling said data in parallel, including partitioning the data into parts and processing the parts using separate ones of a first set of parallel components.
27. Software including instructions adapted to perform all the method steps of any of claims 1 through 26 when executed on a data processing system.
28. The software of claim 27 embodied on a computer-readable medium.
29. A data processing system including: a profiling module configured to read data from a data source, to compute summary data characterizing the data while reading the data, and to store profile information that is based on the summary data; and a processing module configured to access the stored profile information and to process the data from the data source according to the accessed profile information.
30. A data processing system including: means for profiling data from a data source, including means for reading the data from the data source, means for computing summary data characterizing the data while reading the data, and means for storing profile information that is based on the summary data; and means for processing the data from the data source, including means for accessing the stored profile information and means for processing the data according to the accessed profile information.
31. A method for processing data including: accepting information characterizing values of a first field in records of a first data source and information characterizing values of a second field in records of a second data source; computing quantities characterizing a relationship between the first field and the second field based on the accepted information; and presenting information relating the first field and the second field.
32. The method of claim 31 wherein presenting the information includes presenting the information to a user.
33. The method of claim 31 wherein the first data source and the second data source are the same data source.
34. The method of claim 31 wherein at least one of the first data source and the second data source includes a database table.
35. The method of claim 31 wherein the quantities characterizing the relationship include quantities characterizing joint characteristics of the values of the first field and of the second field.
36. The method of claim 35 wherein the information characterizing the values of the first field includes information characterizing a distribution of values of said first field.
37. The method of claim 36 wherein the information characterizing the distribution of values of the first field includes plurality of data records, each associating a different value and a coreesponding number of occurrences of said value in the first field in the first data source.
38. The method of claim 36 wherein the information characterizing the values of the second field includes information characterizing a distribution of values of said field.
39. The method of claim 38 wherein computing the quantities characterizing the joint characteristics includes processing the information characterizing the distribution of values of the first field and of the second field to compute quantities related to a plurality of categories of cooccurrence of values.
40. The method of claim 39 wherein the information characterizing the distribution of values of the first field and of the second field includes a plurality of data records, each associating a different value and a corresponding number of occureences of said value and wherein processing the information characterizing said distributions of values includes computing information characterizing a distribution of values in a join of the first data source and the second data source on the first field and the second field, respectively.
41. The method of claim 39 wherein the quantities related to the plurality of categories of cooccurrence of values includes a plurality of data records, each associated with one of the categories of co-occurrence and including a number of unique values of the first and the second fields in said category.
42. The method of claim 35 wherein computing the quantities characterizing the joint characteristics of the values of the first field and the second field includes computing information characterizing a distribution of values in a join of the first data source and the second data source using the first field and the second field, respectively.
43. The method of claim 35 wherein computing the quantities characterizing the joint characteristics of the values of the first field and the second field includes computing quantities related to a plurality of categories of co-occurrence of values.
44. The method of claim 42 wherein the categories of co-occurrence of values includes values that occur at least once in one of the first field and the second field but not in the other of said fields.
45. The method of claim 42 wherein the categories of co-occurrence of values includes values that occur exactly once in each of the first field and the second field.
46. The method of claim 42 wherein the categories of co-occurrence of values includes values that occur exactly once in one of the first field and the second field and more than once in the other of said fields.
47. The method of claim 42 wherein the categories of co-occurrence of values includes values that occur more than once in each of the first field and the second field.
48. The method of claim 35 further including the steps of accepting information characterizing values and computing quantities characterizing joint characteristics of the values are repeated for a plurality of pairs of first and second fields.
49. The method of claim 48 wherein each of the plurality of pairs of fields has a unique identifier that is included with values in the pairs of fields to compute the quantities characterizing the joint characteristics of the values.
50. The method of claim 48 further including presenting information relating the fields of one or more of the plurality of pairs of fields.
51. The method of claim 50 wherein presenting the information relating the fields of one or more of the plurality of pairs of fields includes identifying fields as candidate fields of one of a plurality of types of relationships of fields.
52. The method of claim 51 wherein the plurality of types of relationships of fields includes a primary key and foreign key relationship.
53. The method of claim 51 wherein the plurality of types of relationships of fields includes a common domain relationship.
54. The method of claim 31 wherein computing the quantities includes computing said quantities based on logical values that are converted from literal values of the first field and of the second field.
55. The method of claim 37 wherein computing the quantities includes computing said quantities in parallel, including partitioning the data records into parts and processing the parts using separate ones of a set of parallel components.
56. The method of claim 55 wherein the parts are based on values of the first field and of the second field.
57. The method of claim 56 wherein data records having the same value are in the same part.
58. Software including instructions adapted to perform all the method steps of any of claims 31 through 57 when executed on a data processing system.
59. The software of claim 58 embodied on a computer-readable medium.
60. A system for processing data including: a values processing module configured to accept information characterizing values of a first field in records of a first data source and information characterizing values of a second field in records of a second data source; a relationship processing module configured to compute quantities characterizing a relationship between the first field and the second field based on the accepted information; an interface configured to present information relating the first field and the second field.
61. A system for processing data including: i means for accepting information characterizing values of a first field in records of a first data source and information characterizing values of a second field in records of a second data source; means for computing quantities characterizing a relationship between the first field and the second field based on the accepted information; means for presenting information relating the first field and the second field.
62. A method for processing data including: identifying a plurality of subsets of fields of data records of a data source; determining co-occurrence statistics for each of the plurality of subsets; and identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset.
63. The method of claim 62 wherein at least one of the subsets of fields is a subset of two fields.
64. The method of claim 62 wherein identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes identifying one or more of the plurality of subsets as having one of a plurality of possible predetermined functional relationships.
65. The method of claim 62 wherein detennining the co-occurrence statistics includes forming data elements each identifying a pair of fields and identifying a pair of values occurring in the pair of fields in one of the data records.
66. The method of claim 62 wherein determining the co-occurrence statistics includes: partitioning the data records into parts, the data records having a first field and a second field; determining a quantity based on a distribution of values that occur in the second field of one or more records in a first of the parts, the one or more records having a common value occurring in a first field of the one or more records; and combining the quantity with other quantities from records in other of the parts to generate a total quantity.
67. The method of claim 66 wherein identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes identifying a functional relationship between the first and second fields based on the total quantity.
68. The method of claim 66 wherein the parts are based on values of the first field and of the second field.
69. The method of claim 66 wherein the parts are processed using separate ones of a set of parallel components.
70. The method of claim 62 wherein identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset includes determining a degree of match to said functional relationship.
a
71. The method of claim 70 wherein the degree of match includes a number of exceptional records that are not consistent with said functional relationship.
72. The method of claim 62 wherein the functional relationship includes a mapping of at least some of the values of a first field onto at least some of the values of a second field.
73. The method of claim 72 wherein the mapping is a many-to-one mapping.
74. The method of claim 72 wherein the mapping is a one-to-many mapping.
75. The method of claim 72 wherein the mapping is a one-to-one mapping.
76. The method of claim 62 further including filtering the plurality of subsets based on information characterizing values in fields of the plurality of subsets.
77. The method of claim 62 wherein the data records include records of a database table.
78. The method of claim 77 wherein the data records include records of a plurality of database tables.
79. Software including instructions adapted to perform all the method steps of any of claims 62 through 78 when executed on a data processing system.
80. The software of claim 79 embodied on a computer-readable medium.
81. A system for processing data including: an identification processing module configured to identify a plurality of subsets of fields of data records of a data source; a statistics processing module configured to determine co-occureence statistics for each of the plurality of subsets; and a functional relationship processing module configured to identify one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset.
82. A system for processing data including: means for identifying a plurality of subsets of fields of data records of a data source; means for determining co-occurrence statistics for each of the plurality of subsets; and means for identifying one or more of the plurality of subsets as having a functional relationship among the fields of the identified subset.
PCT/US2004/030144 2003-09-15 2004-09-15 Data profiling WO2005029369A2 (en)

Priority Applications (9)

Application Number Priority Date Filing Date Title
EP04784113A EP1676217B1 (en) 2003-09-15 2004-09-15 Data profiling
JP2006526986A JP5328099B2 (en) 2003-09-15 2004-09-15 Data profiling
AT04784113T ATE515746T1 (en) 2003-09-15 2004-09-15 DATA PROFILING
AU2004275334A AU2004275334B9 (en) 2003-09-15 2004-09-15 Data Profiling
CA002538568A CA2538568C (en) 2003-09-15 2004-09-15 Data profiling
EP10009234.5A EP2261821B1 (en) 2003-09-15 2004-09-15 Data profiling
HK06114200.1A HK1093568A1 (en) 2003-09-15 2006-12-28 Data profiling
AU2009200294A AU2009200294A1 (en) 2003-09-15 2009-01-28 Data profiling
AU2009200293A AU2009200293B2 (en) 2003-09-15 2009-01-28 Data profiling

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US50290803P 2003-09-15 2003-09-15
US60/502,908 2003-09-15
US51303803P 2003-10-20 2003-10-20
US60/513,038 2003-10-20
US53295603P 2003-12-22 2003-12-22
US60/532,956 2003-12-22

Publications (3)

Publication Number Publication Date
WO2005029369A2 true WO2005029369A2 (en) 2005-03-31
WO2005029369A3 WO2005029369A3 (en) 2005-08-25
WO2005029369A9 WO2005029369A9 (en) 2006-05-04

Family

ID=34381971

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/US2004/030144 WO2005029369A2 (en) 2003-09-15 2004-09-15 Data profiling

Country Status (10)

Country Link
US (5) US8868580B2 (en)
EP (3) EP2261820A3 (en)
JP (3) JP5328099B2 (en)
KR (4) KR100922141B1 (en)
CN (1) CN102982065B (en)
AT (1) ATE515746T1 (en)
AU (3) AU2004275334B9 (en)
CA (3) CA2538568C (en)
HK (1) HK1093568A1 (en)
WO (1) WO2005029369A2 (en)

Cited By (19)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007032846A1 (en) * 2005-09-12 2007-03-22 Microsoft Corporation Remoting redirection layer for graphics device interface
US7852342B2 (en) 2004-10-14 2010-12-14 Microsoft Corporation Remote client graphics rendering
US7889203B2 (en) 2004-10-14 2011-02-15 Microsoft Corporation Encoding for remoting graphics to decoder device
CN103348598A (en) * 2011-01-28 2013-10-09 起元科技有限公司 Generating data pattern information
WO2014065918A1 (en) * 2012-10-22 2014-05-01 Ab Initio Technology Llc Characterizing data sources in a data storage system
US8868580B2 (en) 2003-09-15 2014-10-21 Ab Initio Technology Llc Data profiling
WO2015109047A1 (en) * 2014-01-16 2015-07-23 Ab Initio Technology Llc Database key identification
WO2015134193A1 (en) * 2014-03-07 2015-09-11 Ab Initio Technology Llc Managing data profiling operations related to data type
US9323749B2 (en) 2012-10-22 2016-04-26 Ab Initio Technology Llc Profiling data with location information
US9892026B2 (en) 2013-02-01 2018-02-13 Ab Initio Technology Llc Data records selection
US20210200757A1 (en) * 2017-07-25 2021-07-01 Capital One Services, Llc Systems and methods for expedited large file processing
US11068540B2 (en) 2018-01-25 2021-07-20 Ab Initio Technology Llc Techniques for integrating validation results in data profiling and related systems and methods
US11347715B2 (en) * 2007-09-27 2022-05-31 Experian Information Solutions, Inc. Database system for triggering event notifications based on updates to database records
US11373261B1 (en) 2004-09-22 2022-06-28 Experian Information Solutions, Inc. Automated analysis of data to generate prospect notifications based on trigger events
US11399029B2 (en) 2018-09-05 2022-07-26 Consumerinfo.Com, Inc. Database platform for realtime updating of user data from third party sources
US11410230B1 (en) 2015-11-17 2022-08-09 Consumerinfo.Com, Inc. Realtime access and control of secure regulated data
US11729230B1 (en) 2015-11-24 2023-08-15 Experian Information Solutions, Inc. Real-time event-based notification system
US11861691B1 (en) 2011-04-29 2024-01-02 Consumerinfo.Com, Inc. Exposing reporting cycle information
US11978114B1 (en) 2009-01-06 2024-05-07 Consumerinfo.Com, Inc. Report existence monitoring

Families Citing this family (194)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7653641B2 (en) * 2004-05-04 2010-01-26 Accruent, Inc. Abstraction control solution
US7349898B2 (en) * 2004-06-04 2008-03-25 Oracle International Corporation Approximate and exact unary inclusion dependency discovery
US7647293B2 (en) * 2004-06-10 2010-01-12 International Business Machines Corporation Detecting correlation from data
US7386566B2 (en) * 2004-07-15 2008-06-10 Microsoft Corporation External metadata processing
US7610264B2 (en) * 2005-02-28 2009-10-27 International Business Machines Corporation Method and system for providing a learning optimizer for federated database systems
US8117300B2 (en) * 2005-04-25 2012-02-14 Invensys Systems, Inc Supporting both asynchronous and synchronous data transfers between production event information sources and a production information database
US7836104B2 (en) * 2005-06-03 2010-11-16 Sap Ag Demonstration tool for a business information enterprise system
US7877350B2 (en) * 2005-06-27 2011-01-25 Ab Initio Technology Llc Managing metadata for graph-based computations
US20070006070A1 (en) * 2005-06-30 2007-01-04 International Business Machines Corporation Joining units of work based on complexity metrics
US8788464B1 (en) * 2005-07-25 2014-07-22 Lockheed Martin Corporation Fast ingest, archive and retrieval systems, method and computer programs
US20070033198A1 (en) * 2005-08-02 2007-02-08 Defries Anthony Data representation architecture for media access
US20070073721A1 (en) * 2005-09-23 2007-03-29 Business Objects, S.A. Apparatus and method for serviced data profiling operations
US20070074176A1 (en) * 2005-09-23 2007-03-29 Business Objects, S.A. Apparatus and method for parallel processing of data profiling information
US8996586B2 (en) * 2006-02-16 2015-03-31 Callplex, Inc. Virtual storage of portable media files
US7873628B2 (en) * 2006-03-23 2011-01-18 Oracle International Corporation Discovering functional dependencies by sampling relations
US20070271259A1 (en) * 2006-05-17 2007-11-22 It Interactive Services Inc. System and method for geographically focused crawling
US7526486B2 (en) * 2006-05-22 2009-04-28 Initiate Systems, Inc. Method and system for indexing information about entities with respect to hierarchies
EP2030134A4 (en) 2006-06-02 2010-06-23 Initiate Systems Inc A system and method for automatic weight generation for probabilistic matching
US7711736B2 (en) * 2006-06-21 2010-05-04 Microsoft International Holdings B.V. Detection of attributes in unstructured data
US8356009B2 (en) 2006-09-15 2013-01-15 International Business Machines Corporation Implementation defined segments for relational database systems
US7698268B1 (en) 2006-09-15 2010-04-13 Initiate Systems, Inc. Method and system for filtering false positives
US7685093B1 (en) 2006-09-15 2010-03-23 Initiate Systems, Inc. Method and system for comparing attributes such as business names
WO2008034213A1 (en) * 2006-09-18 2008-03-27 Infobright Inc. A method and system for data compression in a relational database
US8266147B2 (en) * 2006-09-18 2012-09-11 Infobright, Inc. Methods and systems for database organization
US8762834B2 (en) * 2006-09-29 2014-06-24 Altova, Gmbh User interface for defining a text file transformation
US7809747B2 (en) * 2006-10-23 2010-10-05 Donald Martin Monro Fuzzy database matching
US9846739B2 (en) 2006-10-23 2017-12-19 Fotonation Limited Fast database matching
US20080097992A1 (en) * 2006-10-23 2008-04-24 Donald Martin Monro Fast database matching
US7613707B1 (en) * 2006-12-22 2009-11-03 Amazon Technologies, Inc. Traffic migration in a multilayered data service framework
US8150870B1 (en) 2006-12-22 2012-04-03 Amazon Technologies, Inc. Scalable partitioning in a multilayered data service framework
US7774329B1 (en) 2006-12-22 2010-08-10 Amazon Technologies, Inc. Cross-region data access in partitioned framework
CN101226523B (en) * 2007-01-17 2012-09-05 国际商业机器公司 Method and system for analyzing data general condition
US8359339B2 (en) 2007-02-05 2013-01-22 International Business Machines Corporation Graphical user interface for configuration of an algorithm for the matching of data records
US20080195575A1 (en) * 2007-02-12 2008-08-14 Andreas Schiffler Electronic data display management system and method
US8515926B2 (en) * 2007-03-22 2013-08-20 International Business Machines Corporation Processing related data from information sources
WO2008121170A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for parsing languages
US8423514B2 (en) 2007-03-29 2013-04-16 International Business Machines Corporation Service provisioning
WO2008121824A1 (en) 2007-03-29 2008-10-09 Initiate Systems, Inc. Method and system for data exchange among data sources
US8370355B2 (en) 2007-03-29 2013-02-05 International Business Machines Corporation Managing entities within a database
US20120164613A1 (en) * 2007-11-07 2012-06-28 Jung Edward K Y Determining a demographic characteristic based on computational user-health testing of a user interaction with advertiser-specified content
US8069129B2 (en) 2007-04-10 2011-11-29 Ab Initio Technology Llc Editing and compiling business rules
US20090254588A1 (en) * 2007-06-19 2009-10-08 Zhong Li Multi-Dimensional Data Merge
US20110010214A1 (en) * 2007-06-29 2011-01-13 Carruth J Scott Method and system for project management
KR101526626B1 (en) * 2007-07-12 2015-06-05 아트멜 코포레이션 Two-dimensional touch panel
US20090055828A1 (en) * 2007-08-22 2009-02-26 Mclaren Iain Douglas Profile engine system and method
AU2008302144B2 (en) * 2007-09-20 2014-09-11 Ab Initio Technology Llc Managing data flows in graph-based computations
AU2008304255B2 (en) 2007-09-28 2013-03-14 International Business Machines Corporation Method and system for associating data records in multiple languages
US8713434B2 (en) 2007-09-28 2014-04-29 International Business Machines Corporation Indexing, relating and managing information about entities
EP2193415A4 (en) 2007-09-28 2013-08-28 Ibm Method and system for analysis of a system for matching data records
US8321914B2 (en) * 2008-01-21 2012-11-27 International Business Machines Corporation System and method for verifying an attribute in records for procurement application
US8224797B2 (en) * 2008-03-04 2012-07-17 International Business Machines Corporation System and method for validating data record
US8046385B2 (en) * 2008-06-20 2011-10-25 Ab Initio Technology Llc Data quality tracking
WO2010002919A1 (en) 2008-06-30 2010-01-07 Ab Initio Software Llc Data logging in graph-based computations
US8239389B2 (en) * 2008-09-29 2012-08-07 International Business Machines Corporation Persisting external index data in a database
KR101789608B1 (en) 2008-10-23 2017-10-25 아브 이니티오 테크놀로지 엘엘시 A method, and a computer-readable record medium storing a computer program for performing a data operation
KR101661532B1 (en) * 2008-12-02 2016-09-30 아브 이니티오 테크놀로지 엘엘시 Mapping instances of a dataset within a data management system
JP5490824B2 (en) * 2009-01-30 2014-05-14 アビニシオ テクノロジー エルエルシー Data processing using vector fields
AU2010213618B9 (en) 2009-02-13 2015-07-30 Ab Initio Technology Llc Managing task execution
US8051060B1 (en) * 2009-02-13 2011-11-01 At&T Intellectual Property I, L.P. Automatic detection of separators for compression
US9846732B2 (en) * 2009-02-13 2017-12-19 Ab Initio Technology Llc Communicating with data storage systems
US10102398B2 (en) 2009-06-01 2018-10-16 Ab Initio Technology Llc Generating obfuscated data
KR101660853B1 (en) * 2009-06-10 2016-09-28 아브 이니티오 테크놀로지 엘엘시 Generating test data
JP2011008560A (en) * 2009-06-26 2011-01-13 Hitachi Ltd Information management system
US8205113B2 (en) 2009-07-14 2012-06-19 Ab Initio Technology Llc Fault tolerant batch processing
JP6121163B2 (en) * 2009-09-16 2017-04-26 アビニシオ テクノロジー エルエルシー Mapping dataset elements
US8683214B2 (en) * 2009-09-17 2014-03-25 Panasonic Corporation Method and device that verifies application program modules
US8700577B2 (en) * 2009-12-07 2014-04-15 Accenture Global Services Limited GmbH Method and system for accelerated data quality enhancement
JP6084037B2 (en) * 2009-12-14 2017-02-22 アビニシオ テクノロジー エルエルシー Specifying user interface elements
US9477369B2 (en) * 2010-03-08 2016-10-25 Salesforce.Com, Inc. System, method and computer program product for displaying a record as part of a selected grouping of data
US8205114B2 (en) * 2010-04-07 2012-06-19 Verizon Patent And Licensing Inc. Method and system for partitioning data files for efficient processing
US8577094B2 (en) 2010-04-09 2013-11-05 Donald Martin Monro Image template masking
US8417727B2 (en) 2010-06-14 2013-04-09 Infobright Inc. System and method for storing data in a relational database
US8521748B2 (en) 2010-06-14 2013-08-27 Infobright Inc. System and method for managing metadata in a relational database
WO2011159759A1 (en) 2010-06-15 2011-12-22 Ab Initio Technology Llc Dynamically loading graph-based computations
JP5826260B2 (en) 2010-06-22 2015-12-02 アビニシオ テクノロジー エルエルシー Processing related datasets
WO2012008951A1 (en) * 2010-07-13 2012-01-19 Hewlett-Packard Development Company, L.P. Methods, apparatus and articles of manufacture to archive data
US8515863B1 (en) * 2010-09-01 2013-08-20 Federal Home Loan Mortgage Corporation Systems and methods for measuring data quality over time
EP2633398B1 (en) 2010-10-25 2020-05-27 Ab Initio Technology LLC Managing data set objects in a dataflow graph that represents a computer program
KR20120061308A (en) * 2010-12-03 2012-06-13 삼성전자주식회사 Apparatus and method for db controlling in portable terminal
CN103329130B (en) 2011-01-14 2016-12-07 起元技术有限责任公司 The change of management data acquisition system
US9021299B2 (en) 2011-02-18 2015-04-28 Ab Initio Technology Llc Restarting processes
US9116759B2 (en) 2011-02-18 2015-08-25 Ab Initio Technology Llc Restarting data processing systems
WO2012124270A1 (en) * 2011-03-15 2012-09-20 パナソニック株式会社 Tamper monitoring system, administration device, protection control module, and sensor module
US20120330880A1 (en) * 2011-06-23 2012-12-27 Microsoft Corporation Synthetic data generation
US8782016B2 (en) * 2011-08-26 2014-07-15 Qatar Foundation Database record repair
US9116934B2 (en) * 2011-08-26 2015-08-25 Qatar Foundation Holistic database record repair
US8863082B2 (en) * 2011-09-07 2014-10-14 Microsoft Corporation Transformational context-aware data source management
US8719271B2 (en) 2011-10-06 2014-05-06 International Business Machines Corporation Accelerating data profiling process
US9438656B2 (en) 2012-01-11 2016-09-06 International Business Machines Corporation Triggering window conditions by streaming features of an operator graph
US9430117B2 (en) * 2012-01-11 2016-08-30 International Business Machines Corporation Triggering window conditions using exception handling
US20130304712A1 (en) * 2012-05-11 2013-11-14 Theplatform For Media, Inc. System and method for validation
US9582553B2 (en) * 2012-06-26 2017-02-28 Sap Se Systems and methods for analyzing existing data models
US9633076B1 (en) * 2012-10-15 2017-04-25 Tableau Software Inc. Blending and visualizing data from multiple data sources
US10489360B2 (en) * 2012-10-17 2019-11-26 Ab Initio Technology Llc Specifying and applying rules to data
US10108521B2 (en) 2012-11-16 2018-10-23 Ab Initio Technology Llc Dynamic component performance monitoring
US9507682B2 (en) 2012-11-16 2016-11-29 Ab Initio Technology Llc Dynamic graph performance monitoring
US9703822B2 (en) 2012-12-10 2017-07-11 Ab Initio Technology Llc System for transform generation
EP2757467A1 (en) * 2013-01-22 2014-07-23 Siemens Aktiengesellschaft Management apparatus and method for managing data elements of a version control system
US9135280B2 (en) * 2013-02-11 2015-09-15 Oracle International Corporation Grouping interdependent fields
US9471545B2 (en) 2013-02-11 2016-10-18 Oracle International Corporation Approximating value densities
US9110949B2 (en) 2013-02-11 2015-08-18 Oracle International Corporation Generating estimates for query optimization
US9811233B2 (en) 2013-02-12 2017-11-07 Ab Initio Technology Llc Building applications for configuring processes
US10332010B2 (en) 2013-02-19 2019-06-25 Business Objects Software Ltd. System and method for automatically suggesting rules for data stored in a table
US9576036B2 (en) 2013-03-15 2017-02-21 International Business Machines Corporation Self-analyzing data processing job to determine data quality issues
KR101444249B1 (en) * 2013-05-13 2014-09-26 (주) 아트리아트레이딩 Method, system and non-transitory computer-readable recording medium for providing information on securities lending and borrowing transaction, short selling or equity swap transaction
WO2014186673A2 (en) * 2013-05-17 2014-11-20 Ab Initio Technology Llc Managing memory and storage space for a data operation
US20150032907A1 (en) * 2013-07-26 2015-01-29 Alcatel-Lucent Canada, Inc. Universal adapter with context-bound translation for application adaptation layer
WO2015027085A1 (en) 2013-08-22 2015-02-26 Genomoncology, Llc Computer-based systems and methods for analyzing genomes based on discrete data structures corresponding to genetic variants therein
EP3049913B1 (en) 2013-09-27 2022-05-11 Ab Initio Technology LLC Evaluating rules applied to data
US20150120224A1 (en) * 2013-10-29 2015-04-30 C3 Energy, Inc. Systems and methods for processing data relating to energy usage
CA3114544A1 (en) 2013-12-05 2015-06-11 Ab Initio Technology Llc Managing interfaces for dataflow composed of sub-graphs
SG11201604364UA (en) 2013-12-18 2016-07-28 Ab Initio Technology Llc Data generation
US9529849B2 (en) 2013-12-31 2016-12-27 Sybase, Inc. Online hash based optimizer statistics gathering in a database
US9984173B2 (en) * 2014-02-24 2018-05-29 International Business Machines Corporation Automated value analysis in legacy data
WO2015138497A2 (en) 2014-03-10 2015-09-17 Interana, Inc. Systems and methods for rapid data analysis
US9846567B2 (en) 2014-06-16 2017-12-19 International Business Machines Corporation Flash optimized columnar data layout and data access algorithms for big data query engines
US9633058B2 (en) 2014-06-16 2017-04-25 International Business Machines Corporation Predictive placement of columns during creation of a large database
US10318283B2 (en) 2014-07-18 2019-06-11 Ab Initio Technology Llc Managing parameter sets
EP3189418B1 (en) * 2014-09-02 2022-02-23 AB Initio Technology LLC Visually specifying subsets of components in graph-based programs through user interactions
US9626393B2 (en) 2014-09-10 2017-04-18 Ab Initio Technology Llc Conditional validation rules
US9880818B2 (en) * 2014-11-05 2018-01-30 Ab Initio Technology Llc Application testing
US10055333B2 (en) 2014-11-05 2018-08-21 Ab Initio Technology Llc Debugging a graph
US10296507B2 (en) 2015-02-12 2019-05-21 Interana, Inc. Methods for enhancing rapid data analysis
US9952808B2 (en) 2015-03-26 2018-04-24 International Business Machines Corporation File system block-level tiering and co-allocation
CN104850590A (en) * 2015-04-24 2015-08-19 百度在线网络技术(北京)有限公司 Method and device for generating metadata of structured data
US11068647B2 (en) * 2015-05-28 2021-07-20 International Business Machines Corporation Measuring transitions between visualizations
KR101632073B1 (en) * 2015-06-04 2016-06-20 장원중 Method, device, system and non-transitory computer-readable recording medium for providing data profiling based on statistical analysis
CA3128629A1 (en) 2015-06-05 2016-07-28 C3.Ai, Inc. Systems and methods for data processing and enterprise ai applications
US9384203B1 (en) * 2015-06-09 2016-07-05 Palantir Technologies Inc. Systems and methods for indexing and aggregating data records
US10409802B2 (en) * 2015-06-12 2019-09-10 Ab Initio Technology Llc Data quality analysis
US10241979B2 (en) * 2015-07-21 2019-03-26 Oracle International Corporation Accelerated detection of matching patterns
US10657134B2 (en) 2015-08-05 2020-05-19 Ab Initio Technology Llc Selecting queries for execution on a stream of real-time data
US10127264B1 (en) 2015-09-17 2018-11-13 Ab Initio Technology Llc Techniques for automated data analysis
US10607139B2 (en) 2015-09-23 2020-03-31 International Business Machines Corporation Candidate visualization techniques for use with genetic algorithms
EP3369013A4 (en) * 2015-10-30 2019-04-10 Acxiom Corporation Automated interpretation for the layout of structured multi-field files
US10140337B2 (en) * 2015-10-30 2018-11-27 Sap Se Fuzzy join key
JP6184641B1 (en) * 2016-02-26 2017-08-23 株式会社日立製作所 Analysis system and analysis method for executing analysis processing using at least part of time-series data and analysis data as input data
US10685035B2 (en) 2016-06-30 2020-06-16 International Business Machines Corporation Determining a collection of data visualizations
US10423387B2 (en) 2016-08-23 2019-09-24 Interana, Inc. Methods for highly efficient data sharding
US10146835B2 (en) 2016-08-23 2018-12-04 Interana, Inc. Methods for stratified sampling-based query execution
US10956415B2 (en) 2016-09-26 2021-03-23 Splunk Inc. Generating a subquery for an external data system using a configuration file
US11860940B1 (en) 2016-09-26 2024-01-02 Splunk Inc. Identifying buckets for query execution using a catalog of buckets
US11620336B1 (en) 2016-09-26 2023-04-04 Splunk Inc. Managing and storing buckets to a remote shared storage system based on a collective bucket size
US12013895B2 (en) 2016-09-26 2024-06-18 Splunk Inc. Processing data using containerized nodes in a containerized scalable environment
US10353965B2 (en) 2016-09-26 2019-07-16 Splunk Inc. Data fabric service system architecture
US11604795B2 (en) 2016-09-26 2023-03-14 Splunk Inc. Distributing partial results from an external data system between worker nodes
US11093703B2 (en) 2016-09-29 2021-08-17 Google Llc Generating charts from data in a data table
US9720961B1 (en) 2016-09-30 2017-08-01 Semmle Limited Algebraic data types for database query languages
US9633078B1 (en) * 2016-09-30 2017-04-25 Semmle Limited Generating identifiers for tuples of recursively defined relations
US11741091B2 (en) 2016-12-01 2023-08-29 Ab Initio Technology Llc Generating, accessing, and displaying lineage metadata
US10650050B2 (en) 2016-12-06 2020-05-12 Microsoft Technology Licensing, Llc Synthesizing mapping relationships using table corpus
US10936555B2 (en) * 2016-12-22 2021-03-02 Sap Se Automated query compliance analysis
US10565173B2 (en) * 2017-02-10 2020-02-18 Wipro Limited Method and system for assessing quality of incremental heterogeneous data
US10002146B1 (en) * 2017-02-13 2018-06-19 Sas Institute Inc. Distributed data set indexing
US10514993B2 (en) * 2017-02-14 2019-12-24 Google Llc Analyzing large-scale data processing jobs
CN107220283B (en) * 2017-04-21 2019-11-08 东软集团股份有限公司 Data processing method, device, storage medium and electronic equipment
US11921672B2 (en) 2017-07-31 2024-03-05 Splunk Inc. Query execution at a remote heterogeneous data store of a data fabric service
US11989194B2 (en) * 2017-07-31 2024-05-21 Splunk Inc. Addressing memory limits for partition tracking among worker nodes
US12118009B2 (en) * 2017-07-31 2024-10-15 Splunk Inc. Supporting query languages through distributed execution of query engines
US11423083B2 (en) 2017-10-27 2022-08-23 Ab Initio Technology Llc Transforming a specification into a persistent computer program
US11055074B2 (en) * 2017-11-13 2021-07-06 Ab Initio Technology Llc Key-based logging for processing of structured data items with executable logic
US11509540B2 (en) * 2017-12-14 2022-11-22 Extreme Networks, Inc. Systems and methods for zero-footprint large-scale user-entity behavior modeling
CN111971665B (en) * 2018-01-25 2024-04-19 起元技术有限责任公司 Techniques for integrating validation results into data archives and related systems and methods
US11334543B1 (en) 2018-04-30 2022-05-17 Splunk Inc. Scalable bucket merging for a data intake and query system
EP3575980A3 (en) 2018-05-29 2020-03-04 Accenture Global Solutions Limited Intelligent data quality
CA3106682A1 (en) * 2018-07-19 2020-01-23 Ab Initio Technology Llc Publishing to a data warehouse
US11080266B2 (en) * 2018-07-30 2021-08-03 Futurewei Technologies, Inc. Graph functional dependency checking
US11227065B2 (en) 2018-11-06 2022-01-18 Microsoft Technology Licensing, Llc Static data masking
US11423009B2 (en) * 2019-05-29 2022-08-23 ThinkData Works, Inc. System and method to prevent formation of dark data
US11704494B2 (en) 2019-05-31 2023-07-18 Ab Initio Technology Llc Discovering a semantic meaning of data fields from profile data of the data fields
US11153400B1 (en) 2019-06-04 2021-10-19 Thomas Layne Bascom Federation broker system and method for coordinating discovery, interoperability, connections and correspondence among networked resources
US11494380B2 (en) 2019-10-18 2022-11-08 Splunk Inc. Management of distributed computing framework components in a data fabric service system
CN111143433B (en) * 2019-12-10 2024-07-09 中国平安财产保险股份有限公司 Method and device for counting data in data bin
KR102365910B1 (en) * 2019-12-31 2022-02-22 가톨릭관동대학교산학협력단 Data profiling method and data profiling system using attribute value quality index
FR3105844A1 (en) * 2019-12-31 2021-07-02 Bull Sas PROCESS AND SYSTEM FOR IDENTIFYING RELEVANT VARIABLES
US11922222B1 (en) 2020-01-30 2024-03-05 Splunk Inc. Generating a modified component for a data intake and query system using an isolated execution environment image
US11200215B2 (en) * 2020-01-30 2021-12-14 International Business Machines Corporation Data quality evaluation
US11321340B1 (en) 2020-03-31 2022-05-03 Wells Fargo Bank, N.A. Metadata extraction from big data sources
US11556563B2 (en) * 2020-06-12 2023-01-17 Oracle International Corporation Data stream processing
US11403268B2 (en) * 2020-08-06 2022-08-02 Sap Se Predicting types of records based on amount values of records
US11704313B1 (en) 2020-10-19 2023-07-18 Splunk Inc. Parallel branch operation using intermediary nodes
KR102265937B1 (en) * 2020-12-21 2021-06-17 주식회사 모비젠 Method for analyzing sequence data and apparatus thereof
US12106026B2 (en) 2021-01-05 2024-10-01 Capital One Services, Llc Extensible agents in agent-based generative models
US11847390B2 (en) 2021-01-05 2023-12-19 Capital One Services, Llc Generation of synthetic data using agent-based simulations
US20220215243A1 (en) * 2021-01-05 2022-07-07 Capital One Services, Llc Risk-Reliability Framework for Evaluating Synthetic Data Models
CA3209125A1 (en) 2021-01-31 2022-08-04 Ab Initio Technology Llc Data processing system with manipulation of logical dataset groups
US11537594B2 (en) 2021-02-05 2022-12-27 Oracle International Corporation Approximate estimation of number of distinct keys in a multiset using a sample
CN112925792B (en) * 2021-03-26 2024-01-05 北京中经惠众科技有限公司 Data storage control method, device, computing equipment and medium
US12072939B1 (en) 2021-07-30 2024-08-27 Splunk Inc. Federated data enrichment objects
CN113656430B (en) * 2021-08-12 2024-02-27 上海二三四五网络科技有限公司 Control method and device for automatic expansion of batch table data
KR102714923B1 (en) * 2021-12-22 2024-10-08 세종대학교산학협력단 Method and apparatus for data profiling
KR102437098B1 (en) 2022-04-15 2022-08-25 이찬영 Method and apparatus for determining error data based on artificial intenigence
US12093272B1 (en) 2022-04-29 2024-09-17 Splunk Inc. Retrieving data identifiers from queue for search of external data system
US11907051B1 (en) 2022-09-07 2024-02-20 International Business Machines Corporation Correcting invalid zero value for data monitoring

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5966072A (en) 1996-07-02 1999-10-12 Ab Initio Software Corporation Executing computations expressed as graphs
US20020161778A1 (en) 2001-02-24 2002-10-31 Core Integration Partners, Inc. Method and system of data warehousing and building business intelligence using a data storage model

Family Cites Families (71)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
JP2760794B2 (en) * 1988-01-29 1998-06-04 株式会社日立製作所 Database processing method and apparatus
US5179643A (en) * 1988-12-23 1993-01-12 Hitachi, Ltd. Method of multi-dimensional analysis and display for a large volume of record information items and a system therefor
JPH032938A (en) 1989-05-31 1991-01-09 Hitachi Ltd Data base processing method
JPH04152440A (en) * 1990-10-17 1992-05-26 Hitachi Ltd Intelligent inquiry processing method
FR2698465B1 (en) 1992-11-20 1995-01-13 Bull Sa Method for extracting statistics profiles, use of statistics created by the method.
US5742806A (en) * 1994-01-31 1998-04-21 Sun Microsystems, Inc. Apparatus and method for decomposing database queries for database management system including multiprocessor digital data processing system
JP3519126B2 (en) 1994-07-14 2004-04-12 株式会社リコー Automatic layout system
US5842200A (en) 1995-03-31 1998-11-24 International Business Machines Corporation System and method for parallel mining of association rules in databases
US6601048B1 (en) * 1997-09-12 2003-07-29 Mci Communications Corporation System and method for detecting and managing fraud
US5778373A (en) 1996-07-15 1998-07-07 At&T Corp Integration of an information server database schema by generating a translation map from exemplary files
US6138123A (en) * 1996-07-25 2000-10-24 Rathbun; Kyle R. Method for creating and using parallel data structures
JPH1055367A (en) 1996-08-09 1998-02-24 Hitachi Ltd Data utilization system
US5845285A (en) * 1997-01-07 1998-12-01 Klein; Laurence C. Computer system and method of data analysis
US5987453A (en) 1997-04-07 1999-11-16 Informix Software, Inc. Method and apparatus for performing a join query in a database system
US6134560A (en) * 1997-12-16 2000-10-17 Kliebhan; Daniel F. Method and apparatus for merging telephone switching office databases
US6826556B1 (en) * 1998-10-02 2004-11-30 Ncr Corporation Techniques for deploying analytic models in a parallel
US6959300B1 (en) * 1998-12-10 2005-10-25 At&T Corp. Data compression method and apparatus
US6343294B1 (en) 1998-12-15 2002-01-29 International Business Machines Corporation Data file editor for multiple data subsets
JP4037001B2 (en) * 1999-02-23 2008-01-23 三菱電機株式会社 Database creation device and database search device
US6741995B1 (en) 1999-03-23 2004-05-25 Metaedge Corporation Method for dynamically creating a profile
US6430539B1 (en) * 1999-05-06 2002-08-06 Hnc Software Predictive modeling of consumer financial behavior
US6163774A (en) 1999-05-24 2000-12-19 Platinum Technology Ip, Inc. Method and apparatus for simplified and flexible selection of aggregate and cross product levels for a data warehouse
KR20020041337A (en) 1999-06-18 2002-06-01 토런트 시스템즈, 인크. Segmentation and processing of continuous data streams using transactional semantics
US6801938B1 (en) * 1999-06-18 2004-10-05 Torrent Systems, Inc. Segmentation and processing of continuous data streams using transactional semantics
JP3318834B2 (en) 1999-07-30 2002-08-26 三菱電機株式会社 Data file system and data retrieval method
JP3567861B2 (en) 2000-07-07 2004-09-22 日本電信電話株式会社 Information source location estimation method and apparatus, and storage medium storing information source location estimation program
JP4366845B2 (en) * 2000-07-24 2009-11-18 ソニー株式会社 Data processing apparatus, data processing method, and program providing medium
US6788302B1 (en) * 2000-08-03 2004-09-07 International Business Machines Corporation Partitioning and load balancing graphical shape data for parallel applications
US20020073138A1 (en) * 2000-12-08 2002-06-13 Gilbert Eric S. De-identification and linkage of data records
US6952693B2 (en) 2001-02-23 2005-10-04 Ran Wolff Distributed mining of association rules
US20020120602A1 (en) * 2001-02-28 2002-08-29 Ross Overbeek System, method and computer program product for simultaneous analysis of multiple genomes
JP2002269114A (en) * 2001-03-14 2002-09-20 Kousaku Ookubo Knowledge database, and method for constructing knowledge database
US20030033138A1 (en) * 2001-07-26 2003-02-13 Srinivas Bangalore Method for partitioning a data set into frequency vectors for clustering
US7130852B2 (en) * 2001-07-27 2006-10-31 Silicon Valley Bank Internal security system for a relational database system
WO2003014867A2 (en) * 2001-08-03 2003-02-20 John Allen Ananian Personalized interactive digital catalog profiling
US6801903B2 (en) 2001-10-12 2004-10-05 Ncr Corporation Collecting statistics in a database system
US20030140027A1 (en) * 2001-12-12 2003-07-24 Jeffrey Huttel Universal Programming Interface to Knowledge Management (UPIKM) database system with integrated XML interface
US7813937B1 (en) * 2002-02-15 2010-10-12 Fair Isaac Corporation Consistency modeling of healthcare claims to detect fraud and abuse
US7031969B2 (en) 2002-02-20 2006-04-18 Lawrence Technologies, Llc System and method for identifying relationships between database records
WO2003081391A2 (en) * 2002-03-19 2003-10-02 Mapinfo Corporation Location based service provider
US20040083199A1 (en) * 2002-08-07 2004-04-29 Govindugari Diwakar R. Method and architecture for data transformation, normalization, profiling, cleansing and validation
US6657568B1 (en) 2002-08-27 2003-12-02 Fmr Corp. Data packing for real-time streaming
US7047230B2 (en) * 2002-09-09 2006-05-16 Lucent Technologies Inc. Distinct sampling system and a method of distinct sampling for optimizing distinct value query estimates
WO2004036461A2 (en) * 2002-10-14 2004-04-29 Battelle Memorial Institute Information reservoir
US7698163B2 (en) * 2002-11-22 2010-04-13 Accenture Global Services Gmbh Multi-dimensional segmentation for use in a customer interaction
US7403942B1 (en) * 2003-02-04 2008-07-22 Seisint, Inc. Method and system for processing data records
US7433861B2 (en) * 2003-03-13 2008-10-07 International Business Machines Corporation Byte-code representations of actual data to reduce network traffic in database transactions
US7117222B2 (en) * 2003-03-13 2006-10-03 International Business Machines Corporation Pre-formatted column-level caching to improve client performance
US20040249810A1 (en) * 2003-06-03 2004-12-09 Microsoft Corporation Small group sampling of data for use in query processing
GB0314591D0 (en) 2003-06-21 2003-07-30 Ibm Profiling data in a data store
US7426520B2 (en) 2003-09-10 2008-09-16 Exeros, Inc. Method and apparatus for semantic discovery and mapping between data sources
KR100922141B1 (en) 2003-09-15 2009-10-19 아브 이니티오 소프트웨어 엘엘시 Data profiling method and system
US7587394B2 (en) * 2003-09-23 2009-09-08 International Business Machines Corporation Methods and apparatus for query rewrite with auxiliary attributes in query processing operations
US7149736B2 (en) 2003-09-26 2006-12-12 Microsoft Corporation Maintaining time-sorted aggregation records representing aggregations of values from multiple database records using multiple partitions
WO2005050482A1 (en) 2003-10-21 2005-06-02 Nielsen Media Research, Inc. Methods and apparatus for fusing databases
US7376656B2 (en) * 2004-02-10 2008-05-20 Microsoft Corporation System and method for providing user defined aggregates in a database system
US20050177578A1 (en) 2004-02-10 2005-08-11 Chen Yao-Ching S. Efficient type annontation of XML schema-validated XML documents without schema validation
US8447743B2 (en) * 2004-08-17 2013-05-21 International Business Machines Corporation Techniques for processing database queries including user-defined functions
US7774346B2 (en) 2005-08-26 2010-08-10 Oracle International Corporation Indexes that are based on bitmap values and that use summary bitmap values
US20070073721A1 (en) * 2005-09-23 2007-03-29 Business Objects, S.A. Apparatus and method for serviced data profiling operations
US8271452B2 (en) 2006-06-12 2012-09-18 Rainstor Limited Method, system, and database archive for enhancing database archiving
US8412713B2 (en) 2007-03-06 2013-04-02 Mcafee, Inc. Set function calculation in a database
US7912867B2 (en) * 2008-02-25 2011-03-22 United Parcel Services Of America, Inc. Systems and methods of profiling data for integration
US9251212B2 (en) * 2009-03-27 2016-02-02 Business Objects Software Ltd. Profiling in a massive parallel processing environment
JP6121163B2 (en) 2009-09-16 2017-04-26 アビニシオ テクノロジー エルエルシー Mapping dataset elements
AU2010319344B2 (en) 2009-11-13 2014-10-09 Ab Initio Technology Llc Managing record format information
US8396873B2 (en) * 2010-03-10 2013-03-12 Emc Corporation Index searching using a bloom filter
US8296274B2 (en) * 2011-01-27 2012-10-23 Leppard Andrew Considering multiple lookups in bloom filter decision making
US9449057B2 (en) * 2011-01-28 2016-09-20 Ab Initio Technology Llc Generating data pattern information
US8615519B2 (en) 2011-06-17 2013-12-24 Sap Ag Method and system for inverted indexing of a dataset
US8762396B2 (en) * 2011-12-22 2014-06-24 Sap Ag Dynamic, hierarchical bloom filters for network data routing

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5966072A (en) 1996-07-02 1999-10-12 Ab Initio Software Corporation Executing computations expressed as graphs
US20020161778A1 (en) 2001-02-24 2002-10-31 Core Integration Partners, Inc. Method and system of data warehousing and building business intelligence using a data storage model

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
"17th. ACM Slgact-Slgmod-Slgart symposium", 1 June 1998, ASSOCIATION FOR COMPUTING MACHINERY, article "An overview of query optimisation in relational systems", pages: 34 - 43
JAEDICKE M ET AL.: "On parallel processing of aggregate and scalar functions in object-relational DBMS", vol. 27, 1 June 1998, SIGMOND RECORD ACM USA, pages: 379 - 389
OLSON C: "Know your data: data profiling solutions for today's hot projects", vol. 10, DM REVIEW FAULKNER & GRAY USA, pages: 1 - 4
RAHM E; DO H-H: "Quarterly Bulletin of the Computer Society of the IEEE Technical Committee on Data Engineering", December 2000, THE COMMITTEE, article "data cleaning: problems and current approaches", pages: 1 - 11

Cited By (42)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8868580B2 (en) 2003-09-15 2014-10-21 Ab Initio Technology Llc Data profiling
US9323802B2 (en) 2003-09-15 2016-04-26 Ab Initio Technology, Llc Data profiling
US11562457B2 (en) 2004-09-22 2023-01-24 Experian Information Solutions, Inc. Automated analysis of data to generate prospect notifications based on trigger events
US11373261B1 (en) 2004-09-22 2022-06-28 Experian Information Solutions, Inc. Automated analysis of data to generate prospect notifications based on trigger events
US11861756B1 (en) 2004-09-22 2024-01-02 Experian Information Solutions, Inc. Automated analysis of data to generate prospect notifications based on trigger events
US7852342B2 (en) 2004-10-14 2010-12-14 Microsoft Corporation Remote client graphics rendering
US7889203B2 (en) 2004-10-14 2011-02-15 Microsoft Corporation Encoding for remoting graphics to decoder device
US8527563B2 (en) 2005-09-12 2013-09-03 Microsoft Corporation Remoting redirection layer for graphics device interface
WO2007032846A1 (en) * 2005-09-12 2007-03-22 Microsoft Corporation Remoting redirection layer for graphics device interface
US11954089B2 (en) 2007-09-27 2024-04-09 Experian Information Solutions, Inc. Database system for triggering event notifications based on updates to database records
US11347715B2 (en) * 2007-09-27 2022-05-31 Experian Information Solutions, Inc. Database system for triggering event notifications based on updates to database records
US11978114B1 (en) 2009-01-06 2024-05-07 Consumerinfo.Com, Inc. Report existence monitoring
US9449057B2 (en) 2011-01-28 2016-09-20 Ab Initio Technology Llc Generating data pattern information
US9652513B2 (en) 2011-01-28 2017-05-16 Ab Initio Technology, Llc Generating data pattern information
CN103348598A (en) * 2011-01-28 2013-10-09 起元科技有限公司 Generating data pattern information
US11861691B1 (en) 2011-04-29 2024-01-02 Consumerinfo.Com, Inc. Exposing reporting cycle information
US9569434B2 (en) 2012-10-22 2017-02-14 Ab Initio Technology Llc Profiling data with source tracking
US9323748B2 (en) 2012-10-22 2016-04-26 Ab Initio Technology Llc Profiling data with location information
US9990362B2 (en) 2012-10-22 2018-06-05 Ab Initio Technology Llc Profiling data with location information
US9323749B2 (en) 2012-10-22 2016-04-26 Ab Initio Technology Llc Profiling data with location information
WO2014065918A1 (en) * 2012-10-22 2014-05-01 Ab Initio Technology Llc Characterizing data sources in a data storage system
US10719511B2 (en) 2012-10-22 2020-07-21 Ab Initio Technology Llc Profiling data with source tracking
US11163670B2 (en) 2013-02-01 2021-11-02 Ab Initio Technology Llc Data records selection
US9892026B2 (en) 2013-02-01 2018-02-13 Ab Initio Technology Llc Data records selection
US10241900B2 (en) 2013-02-01 2019-03-26 Ab Initio Technology Llc Data records selection
WO2015109047A1 (en) * 2014-01-16 2015-07-23 Ab Initio Technology Llc Database key identification
US11487732B2 (en) 2014-01-16 2022-11-01 Ab Initio Technology Llc Database key identification
AU2015206487B2 (en) * 2014-01-16 2020-07-02 Ab Initio Technology Llc Database key identification
EP3594821A1 (en) * 2014-03-07 2020-01-15 AB Initio Technology LLC Managing data profiling operations related to data type
US9971798B2 (en) 2014-03-07 2018-05-15 Ab Initio Technology Llc Managing data profiling operations related to data type
KR102361153B1 (en) 2014-03-07 2022-02-09 아브 이니티오 테크놀로지 엘엘시 Managing data profiling operations related to data type
WO2015134193A1 (en) * 2014-03-07 2015-09-11 Ab Initio Technology Llc Managing data profiling operations related to data type
KR20160130256A (en) * 2014-03-07 2016-11-10 아브 이니티오 테크놀로지 엘엘시 Managing data profiling operations related to data type
US11410230B1 (en) 2015-11-17 2022-08-09 Consumerinfo.Com, Inc. Realtime access and control of secure regulated data
US11893635B1 (en) 2015-11-17 2024-02-06 Consumerinfo.Com, Inc. Realtime access and control of secure regulated data
US11729230B1 (en) 2015-11-24 2023-08-15 Experian Information Solutions, Inc. Real-time event-based notification system
US20210200757A1 (en) * 2017-07-25 2021-07-01 Capital One Services, Llc Systems and methods for expedited large file processing
US11625408B2 (en) * 2017-07-25 2023-04-11 Capital One Services, Llc Systems and methods for expedited large file processing
US12111838B2 (en) 2017-07-25 2024-10-08 Capital One Services, Llc Systems and methods for expedited large file processing
US11068540B2 (en) 2018-01-25 2021-07-20 Ab Initio Technology Llc Techniques for integrating validation results in data profiling and related systems and methods
US11399029B2 (en) 2018-09-05 2022-07-26 Consumerinfo.Com, Inc. Database platform for realtime updating of user data from third party sources
US12074876B2 (en) 2018-09-05 2024-08-27 Consumerinfo.Com, Inc. Authenticated access and aggregation database platform

Also Published As

Publication number Publication date
CN102982065A (en) 2013-03-20
HK1093568A1 (en) 2007-03-02
EP2261820A2 (en) 2010-12-15
EP2261821A3 (en) 2010-12-29
US20160239532A1 (en) 2016-08-18
JP2010267288A (en) 2010-11-25
US7849075B2 (en) 2010-12-07
JP2010267289A (en) 2010-11-25
CA2538568A1 (en) 2005-03-31
KR100922141B1 (en) 2009-10-19
KR101033179B1 (en) 2011-05-11
JP2007506191A (en) 2007-03-15
KR20070106574A (en) 2007-11-01
EP1676217A2 (en) 2006-07-05
WO2005029369A9 (en) 2006-05-04
JP5328099B2 (en) 2013-10-30
AU2004275334A1 (en) 2005-03-31
KR20090039803A (en) 2009-04-22
JP5372851B2 (en) 2013-12-18
AU2004275334B2 (en) 2011-02-10
ATE515746T1 (en) 2011-07-15
EP1676217B1 (en) 2011-07-06
AU2009200293A1 (en) 2009-02-19
US7756873B2 (en) 2010-07-13
CA2538568C (en) 2009-05-19
JP5372850B2 (en) 2013-12-18
WO2005029369A3 (en) 2005-08-25
AU2009200293B2 (en) 2011-07-07
CA2655735C (en) 2011-01-18
EP2261821B1 (en) 2022-12-07
CA2655731C (en) 2012-04-10
EP2261821A2 (en) 2010-12-15
KR100899850B1 (en) 2009-05-27
KR20080016532A (en) 2008-02-21
US8868580B2 (en) 2014-10-21
CA2655731A1 (en) 2005-03-31
EP2261820A3 (en) 2010-12-29
AU2009200294A1 (en) 2009-02-19
AU2004275334B9 (en) 2011-06-16
US20050102325A1 (en) 2005-05-12
US20050114368A1 (en) 2005-05-26
US9323802B2 (en) 2016-04-26
US20050114369A1 (en) 2005-05-26
CA2655735A1 (en) 2005-03-31
CN102982065B (en) 2016-09-21
KR20060080588A (en) 2006-07-10
US20150106341A1 (en) 2015-04-16

Similar Documents

Publication Publication Date Title
US9323802B2 (en) Data profiling
EP3657348B1 (en) Profiling data with source tracking
CN101271471B (en) Data processing method, software and data processing system
AU2013200067B2 (en) Data profiling
JP2017532652A (en) Managing calculations for hierarchical entities

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 200480026429.2

Country of ref document: CN

AK Designated states

Kind code of ref document: A2

Designated state(s): AE AG AL AM AT AU AZ BA BB BG BW BY BZ CA CH CN CO CR CU CZ DK DM DZ EC EE EG ES FI GB GD GE GM HR HU ID IL IN IS JP KE KG KP KZ LC LK LR LS LT LU LV MA MD MK MN MW MX MZ NA NI NO NZ PG PH PL PT RO RU SC SD SE SG SK SY TJ TM TN TR TT TZ UA UG US UZ VN YU ZA ZM

AL Designated countries for regional patents

Kind code of ref document: A2

Designated state(s): BW GH GM KE LS MW MZ NA SD SZ TZ UG ZM ZW AM AZ BY KG MD RU TJ TM AT BE BG CH CY DE DK EE ES FI FR GB GR HU IE IT MC NL PL PT RO SE SI SK TR BF CF CG CI CM GA GN GQ GW ML MR SN TD TG

121 Ep: the epo has been informed by wipo that ep was designated in this application
WWE Wipo information: entry into national phase

Ref document number: 2004275334

Country of ref document: AU

WWE Wipo information: entry into national phase

Ref document number: 2004784113

Country of ref document: EP

Ref document number: 1167/DELNP/2006

Country of ref document: IN

WWE Wipo information: entry into national phase

Ref document number: 2538568

Country of ref document: CA

WWE Wipo information: entry into national phase

Ref document number: 1020067005255

Country of ref document: KR

Ref document number: 2006526986

Country of ref document: JP

ENP Entry into the national phase

Ref document number: 2004275334

Country of ref document: AU

Date of ref document: 20040915

Kind code of ref document: A

WWP Wipo information: published in national office

Ref document number: 2004275334

Country of ref document: AU

COP Corrected version of pamphlet

Free format text: PAGES 1/23-23/23, DRAWINGS, ADDED

DPEN Request for preliminary examination filed prior to expiration of 19th month from priority date (pct application filed from 20040101)
WWP Wipo information: published in national office

Ref document number: 2004784113

Country of ref document: EP

WWP Wipo information: published in national office

Ref document number: 1020067005255

Country of ref document: KR