CN101271471B - Data processing method, software and data processing system - Google Patents

Data processing method, software and data processing system Download PDF

Info

Publication number
CN101271471B
CN101271471B CN200810093033XA CN200810093033A CN101271471B CN 101271471 B CN101271471 B CN 101271471B CN 200810093033X A CN200810093033X A CN 200810093033XA CN 200810093033 A CN200810093033 A CN 200810093033A CN 101271471 B CN101271471 B CN 101271471B
Authority
CN
China
Prior art keywords
field
value
data
order
information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Lifetime
Application number
CN200810093033XA
Other languages
Chinese (zh)
Other versions
CN101271471A (en
Inventor
乔尔·古尔德
卡尔·范曼
保罗·贝
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Archie Taco Ltd
Qiyuan Software Co ltd
Ab Initio Technology LLC
Original Assignee
Ab Initio Software LLC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ab Initio Software LLC filed Critical Ab Initio Software LLC
Publication of CN101271471A publication Critical patent/CN101271471A/en
Application granted granted Critical
Publication of CN101271471B publication Critical patent/CN101271471B/en
Anticipated expiration legal-status Critical
Expired - Lifetime legal-status Critical Current

Links

Images

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This invention provides a method, software and system for processing data. The processing data includes profiling data from a data source, including reading the data from the data source, computing summary data characterizing the data while reading the data, and storing profile information that is based on the summary data, then processing from the data source. This processing includes accessing the stored profile information and processing the data according to the accessed profile information. The method for processing data including the following step: accepting information characterizing values of a first field in records of a first data source and information characterizing values of a second field in records of a second data source; computing quantities characterizing a relationship between the first field and the second field based on the accepted information; and presenting information relating the first field and the second field.

Description

Data processing method, software and data handling system
The application be that September 15 in 2004, application number are 200480026429.2 the applying date, denomination of invention divides an application for the application for a patent for invention of " data filing ".
The reference of related application
The application requires the No.60/502 of submission on September 15th, 2003, the No.60/513 that submit 908,2003 year October 20, the No.60/532 that on Dec 22nd, 038 and 2003 submitted, the rights and interests of 956 U.S. Provisional Application.Above-mentioned application of quoting is quoted at this by reference.
Technical field
The present invention relates to data filing (profiling).
Background technology
The data set of storage often comprises the data of prior unknown various characteristics.For example, the scope of the value of data set or general value, the relation in data set between the different field, or the functional dependence between the value in different field may be unknown.The data filing meeting relates to the source of checking data set, to determine these characteristics.One of purposes of data filing system is the information of collecting the relevant data collection, and this information is used to design build-up area (staging area) then, so as before further to handle the loading data collection.Then, based on the information of in the data filing process, collecting, in the build-up area, carry out data set is mapped to the required conversion of object format and position of hope.This conversion may be necessary, for example, makes third party's data and data with existing storer compatibility, perhaps with data from original computer system transferring to new computer system.
Summary of the invention
Generally speaking, in a scheme, feature of the present invention is the software and the data handling system of method, correspondence.Will be from the data filing of data source.This filing comprises from the data source reading of data, calculate the summary data in order to the data of description characteristic when reading of data, and storage is based on the archive information of summary data.Then, processing is from the data of data source.This processing comprises archive information that visit is stored and according to the archive information deal with data of visit.
Generally speaking, in another scheme, feature of the present invention is a data processing method.Will be from the data filing of data source.This filing comprises from the data source reading of data, calculate the summary data in order to the data of description characteristic when reading of data, and storage is based on the archive information of summary data.Data filing comprises with the parallel mode filing data, and this comprises data are divided into a plurality of parts, and uses some assemblies that separate in first group of parallel component to handle these parts.
The solution of the present invention can comprise one or more following features.
Processing from the data of data source is comprised from the data source reading of data.
When carrying out data filing, the copy of the data outside the retention data source not.For example, data can comprise the have variable interrecord structure record of (for example field of condition field and variable number).When reading of data, the calculating of summary data being included in calculating in order to the summary data of data of description characteristic variable interrecord structure record is made explanations.
Data source comprises data-storage system, for example Database Systems or serial or parallel file system.
The calculating of summary data comprised the occurrence number of each value in one group of different value of field is counted.Archive information can comprise the statistical figure of this field that obtains based on the occurrence number that described field is calculated.
The metadata repository that comprises the metadata relevant with data source is safeguarded.Storage to archive information can comprise the metadata that renewal is relevant with data source.To the filing of data with all can utilize the metadata of data source to the processing of data.
Filing from the data of data source also comprised based on archive information determine format specification.Also can comprise and determine the validity standard based on archive information.During data processing, can be based on format specification and/or validity standard identification invalid record.
Based on the conversion instruction of archive information specific data.Then, can comprise the processing of data conversion instruction is applied to data.
The processing of data comprised enter data into data storage subsystem.Before data are imported data storage subsystem, data can be verified.This data verification for example can comprise compares the characteristic of data by the statistical attribute of comparing data with the datum characteristic of data.
Filing to data can be carried out with parallel mode.This can comprise data are divided into a plurality of parts, and use some assemblies that separate in first group of parallel component to handle these parts.Can comprise the calculating of the summary data of different pieces of information field and to use some assemblies that separate in second group of parallel component.The output of first group of parallel component can be cut apart again, to form the input of second group of parallel component.Data can read from the parallel data source, and each part of the data source that this is parallel is handled by different in first group of parallel component parallel components.
Generally speaking, in another scheme, feature of the present invention is the software and the data handling system of method, correspondence.Acceptance is in order to the information of the characteristic of the value of first field in the record of describing first data source with in order to the information of the characteristic of the value of second field in the record of describing second data source.Then, based on the information of being accepted, calculate parameter in order to the characteristic of describing the relation between first field and second field.Present the information relevant with second field with first field.
The solution of the present invention can comprise one or more following features.
The information relevant with second field with first field is presented to the user.
First data source and second data source can be same data sources, or the data source of separating.In the described data source any or two can be database table or file.
The parameter that converges characteristic that comprises the value of the value of describing first field and second field in order to the parameter of the characteristic of describing this relation.
The information that comprises the distribution character of the value of describing this field in order to the information of the characteristic of the value (or value of similar second field) of describing first field.This information can be stored in data structure, for example in " investigation " data structure.Information in order to the distribution character of the value of describing first field can comprise a plurality of data recording, and the different value and the corresponding occurrence number of this value are associated in first field of each data recording and first data source.A plurality of records that can comprise similarly, identical or similar form in order to the information of the distribution character of the value of describing second field.
Processing is in order to the information of the distribution character of the value of the value of describing first field and second field, and is a plurality of different from the relevant parameters of values of existing (cooccurrence) classifications with these to calculate.
Comprise a plurality of data recording with multiple with the relevant parameter of the value that shows classification, each data recording is associated with showing one of classifications with a plurality of, and comprises the quantity of the different value of first and second fields in described classification.
On first field and second field, calculate the information of the distribution character of the value in " converging " data source that is described in first data source and second data source respectively.This calculating can comprise: calculate with multiple with the relevant parameter of value that shows classification.The example of these classifications comprises: in first field and second field one, occur at least once but in another field absent variable value, value once appears in each of first field and second field just, in first field and second field one, just occur once and once value in another field, occurs surpassing, and in each of first field and second field, value once occurs surpassing.
For different many to field, for example, repeat following two steps from field of first data source with from another field of second data source, promptly accept in order to the information of the characteristic of description value and the parameter that converges characteristic that calculates the description value.Then, with many to a pair of in the field or manyly can present to the user to the relevant information of field.
With many to a pair of in the field or more presenting of the relevant information of field comprised: the field relation of identification candidate type.The example of such field relation comprises major key (primary key) relation, external key (foreign key) relation and generic domain (common domain) relation.
Generally speaking, in another scheme, feature of the present invention is the software and the data handling system of method, correspondence.A plurality of subclass to field in the data recording of data source are discerned.Determine the same existing statistical figure of each subclass in a plurality of subclass.Discern the one or more subclass in a plurality of subclass, make between the field of the subclass of being discerned, to have funtcional relationship.
The solution of the present invention can comprise one or more following features.
At least one subclass of field is the subclass of two fields.
Discern one or more subclass in a plurality of subclass, make the processing that between the field of the subclass of being discerned, has funtcional relationship comprise: discern the one or more subclass in a plurality of subclass, make it have one of a plurality of possible predefined functions relations.
To with statistical figure now determine comprise: form a plurality of data elements, a pair of field in data record of each data element identification also is identified in this a pair of value to occurring in the field.
To with statistical figure now determine comprise: data recording is divided into a plurality of parts, and these data recording have first field and second field; Parameter is determined in the distribution of the value that occurs in second field based on one or more records in first, and wherein said one or more records have the common value that occurs in first field of described one or more record; And this parameter and other parameter of obtaining based on the record in the other parts merged, to produce total parameter.
Discern one or more subclass in a plurality of subclass, make the processing that between the field of the subclass of being discerned, has funtcional relationship comprise:, be identified in the funtcional relationship between first and second field based on this total parameter.
These parts are based on the value of the value of first field and second field and obtain.
Use some assemblies that separate in one group of parallel component to handle these parts.
Discern one or more subclass in a plurality of subclass, make the processing that between the field of the subclass of being discerned, has funtcional relationship comprise: determine matching degree with described funtcional relationship.
This matching degree comprises the number with the inconsistent exception record of described funtcional relationship.
This funtcional relationship comprises that at least some values with first field are mapped at least some values of second field.
This mapping can be for example many-one mapping, one-to-many mapping or mapping one to one.
This method also comprises based on the information of the characteristic of the value in the field of describing a plurality of subclass, these a plurality of subclass is filtered.
These data recording comprise the record of one or more database tables.
The solution of the present invention can comprise one or more following advantages.
The solution of the present invention all can embody its advantage in various situations.For example, when application program of research and development, the research staff can use input data set to test this application program.To use the output of the application program of this test data set operation to compare, or it is manually checked with the test result of hope.Yet when using actual " production data (production data) " when moving this application program, the result can not verify by checking too greatly usually.Data filing can be used to verify the behavior (behavior) of this application program.Replace each record of check (it produces by using production data to move this application program), can be to the filing check of output.This data filing can detect invalid or undesirable value, and will cause undesirable figure or distribution in the output of application programming problem.
In another case, data filing can be as the part of production run.For example, can be filed as the input data of the part of conventional production run data.After finishing data filing, processing module can be loaded the filing result, and verifies whether these input data meet some quality standard.If it is underproof that the input data look like, can cancel the operation and the warning related personnel of product.
In another case, by data filing regularly, can carry out periodic auditing to a large amount of collection data hundreds of database tables of a plurality of data centralizations (for example).For example, can carry out data filing to the subclass of data every night.By the data of being filed can be thereby that all data of round-robin are filed, for example, a season once, thereby each database table will be by filing four times in 1 year.This can provide the historical data quality audit to all data that in the future can reference where necessary.
This data filing can be carried out automatically.For example, this data filing can be carried out from script (script) (for example, the SHELL script), and combines with the processing of other form.The result of this data filing can for example issue with the form that can show in web browser automatically, and these results of aftertreatment manually or move independent reporting application.
Information to the characteristic of the value that writes down in the data of description source is operated, and needn't directly the record in the data source be operated itself, can reduce calculated amount considerably.For example, use enquiry data but not original data record, the complicacy of converging characteristic of calculating two fields can be reduced to a little magnitude from a big magnitude, promptly the magnitude of the product of data recording quantity reduces to the magnitude of the product of the quantity of exclusive value in two data sources from two data sources.
The copy of data outside the retention data source not when carrying out data filing can be avoided the error possibility that is associated with the copy of conservative replication, and avoid using external memory space to store data trnascription.
Can realize the processing that effectively distributes thus according to these operations of data value parallel processing.
The parameter of the characteristic that concerns between the description field can provide such indication, that is, indicating which field can be associated with dissimilar relations.Then, the user can check this data more closely, whether forms the relation of the sort of type really to determine these fields.
Same existing statistical figure to each subclass in a plurality of subclass of the field of data recording in the data source are determined, can be identified in the potential funtcional relationship between the field effectively.
Each scheme of the present invention is of great use when the unfamiliar data set of user is filed.Can determine automatically or information that the collaborative user determines can be used for the metadata of unloading (populate) data source, be used for further processing afterwards.
Other features and advantages of the present invention will be apparent from following description and claims.
Description of drawings
Fig. 1 is the block scheme that comprises the system of data filing module.
Fig. 2 is the block scheme of institutional framework that is used for the metadata repository object of data filing.
Fig. 3 is the filing chart of filing module.
Fig. 4 is the level tree derivation that is used to illustrate the type of data format object.
Fig. 5 A-C carries out the synoptic diagram of the subgraph table of assembly, investigation and analysis assembly and sampling component for the investigation of implementing the filing chart.
Fig. 6 on roll across the process flow diagram of journey.
Fig. 7 is the process flow diagram of process of normalization.
Fig. 8 A-C illustration show the output of filing result's user interface screen.
Fig. 9 is the process flow diagram of exemplary archiving process.
Figure 10 is the process flow diagram of exemplary archiving process.
Two examples converging operation of Figure 11 A-B for the record from two pairs of fields is carried out.
Figure 12 A-B is for converging two examples of operation to the investigation of carrying out from the investigation records of two pairs of fields.
Figure 13 is used for two pairs of fields are carried out the example that single investigation converges the extension record of operation.
Figure 14 is the extension element that is used to produce extension record.
Figure 15 A-C is used to converge-chart of field analysis.
Figure 16 is the example form with field of functional dependence sexual intercourse.
Figure 17 is the chart that is used to carry out the functional dependence analysis.
Embodiment
1 general introduction
With reference to Fig. 1, data handling system 10 comprises: filing and processing subsystem 20, it is used to handle the data from data source 30, upgrades metadata repository 112 and data-carrier store 124 in the data storage subsystem 40.Then, metadata of being stored and data are addressable for the user who uses interface subsystem 50.
Generally speaking, data source 30 comprises multiple independently data source, and each data source has exclusive storage format and interface (for example, the intrinsic form of database table, spreadsheet (spreadsheet) file, plane text or main frame 110 uses).Respectively independently data source can be local for filing and processing subsystem 20, for example, (for example be positioned at identical computer system, file 102), can be far-end perhaps for filing and processing subsystem 20, for example, be positioned at the far-end computer (for example, main frame 110) of visiting by local or wide-area data net.
Data storage subsystem 40 comprises data-carrier store 124 and metadata repository 112.Metadata repository 112 comprises the information relevant with data in the data source 30 and about the information of the data in the data-carrier store 124.This information can comprise record format and determine the standard (validity standard) of the validity of field value in these records.
Metadata repository 112 can be used for storing the initial information about the data set of data source to be filed 30, and information and the data set from the data-carrier store 124 that this data set obtains about this data set that obtain in archiving process.Data-carrier store 124 can be used for storing the data after arbitrarily changing from data source 30 information that read, that use is obtained from the data filing process.
Filing and processing subsystem 20 comprise filing module 100, its with discrete job element for example separately the unit of being recorded as directly from the data source reading of data, and needn't be before filing the complete copy of loading (landing) data to storage medium.Usually, a record is associated with one group of data field, and for each record, each field has particular value (may comprise null value).Record in the data source can have fixing interrecord structure, that is, each record comprises identical field.Replacedly, record can have variable interrecord structure, for example, comprises variable-length vector or condition field.Under the situation of variable interrecord structure, handle record, " flattening (flattened) " (that is Gu Ding interrecord structure) copy of data needn't be before filing, stored and record can be handled.
At first, when from the data source reading of data, filing module 100 general using start about some initial format information of the record in the data source.(note, in some cases, even do not know the interrecord structure of data source).Initial information about record can comprise: the figure place of representing different value (for example, 16 (=2 byte)); The order of value (comprising value that is associated with record field and the value that is associated with identifier or separator); And by the type of the value of bit representation (for example, string, tape symbol/not signed integer).This information about the record of data source illustrates in data manipulation language (DML) (DML) file that metadata repository 112 is stored.Filing module 100 can be used predefined DML file, automatically (for example illustrate from various universal data system forms, SQL form, XML file, csv file) data, or the data system form that uses the DML file description that obtains from metadata repository 112 to customize.
Partly, possible out of true ground, before filing module 100 initial read were fetched data, filing and processing subsystem 20 can obtain the initial information about the record of data source.For example, the data 114 that the copy book that is associated with data source (copy book) can be used as storage obtain, and are perhaps obtained by user interface 116 inputs by user 118.The DML file that this existing information is handled by metadata load module 115 and is stored in the metadata repository 112 and/or uses when being used to define access data sources.
When filing module 100 from the data source reading and recording, other descriptive informations of its counting statistics numeral and reflection data set content.Then, filing module 100 is written to these statistical figure and the descriptive information form with " filing " in the metadata repository 112, then, check metadata repository 112 by any module that can accesses meta-data storer 112 of user interface 116 or other.Statistical figure in the filing preferably include histogram, maximal value, minimum value and the mean value of the value in each field, the sampling of minimum common value and maximum common value.
The statistical figure that obtained by the reading of data source can be used for various uses.These purposes comprise the content of finding to be unfamiliar with data set, metadata that foundation is associated with data set set, buy or use before inspection third party data, and data implementation quality controlling schemes (scheme) to collecting.To describe the process of using data handling system 10 to carry out these tasks below in detail.
Metadata repository 112 can be stored the validity information that is associated with each field of being filed, for example with it as validity standard to validity information coding.Replacedly, validity information can be stored in external storage location and retrieve (retrieved) by filing module 100.Before with data set archive, validity information can be specified the valid data type of each field.For example, if a field is a people " title ", the acquiescence effective value can be any value of " string " data type.The user can also provide effective value for example " Mr. ", " Mrs. " and " Dr. " before with the data source filing, invalid thereby any other value that filing module 100 reads all will be identified as.The user also can use the information that obtains from the filing operational process to specify the effective value of specific fields.For example, after with data set archive, the user may find " Ms. " and " Msr. " as the common value appearance.The user can be added to effective value with " Ms. ", and will be worth " Msr. " and be mapped to " Mrs. " as the data dump option.Therefore, validity information can comprise effective value and map information, thereby allows to remove effective value by map information is mapped to effective value.When finding more information about data source by continuous filing operational process, can file data source with iterative manner.
Filing module 100 also can produce executable code, to implement to visit other module of the data system of being filed.For example, processing module 120 can comprise the code that filing module 100 produces.As the part to the access process of data source, an example of this code can be mapped to " Mrs. " with value " Msr. ".Processing module 120 can be moved in the environment in the working time identical with filing module 100, and preferably can communicate by letter with metadata repository 112, to visit the filing that is associated with data set.Processing module 120 can read the data layout identical with filing module 100 (for example, by obtaining identical DML file from metadata repository 112).Processing module 120 will be imported recording storage before data-carrier store 124, the value that can use data set archive to obtain checking or clear up these input records.
Similar to filing module 100, processing module 120 is that unit is directly from the data system reading of data with discrete job element also.This " data stream " of job element helps the larger data collection is carried out data filing, and needn't go up copy data at local storage (for example, disc driver).This data flow model that will be described in more detail below also allows processing module to carry out complicated data-switching, and does not need earlier source data to be copied to the build-up area, and this has just saved storage space and time potentially.
2 metadata repository institutional frameworks
Filing module 100 is used metadata repository 112 tissues and is stored various metadata, and preference and result are filed with the form of data object.With reference to Fig. 2, metadata repository 112 can be stored one group of filing and be set up object 201 (all being used for the information relevant with filing work), a group data set object 207 (all being used for the information relevant with data set) and one group of DML file 211 (all describing specific data layout).The preference that object comprises the filing operational process of filing module 100 execution is set up in filing.User 118 can import and be used to set up the information that new filing sets up object or select the filing of pre-stored to set up object 200.
Filing is set up object 200 and is comprised reference 204 to object data set 206.Data set is set up object 206 and is comprised data set steady arm 202, be used for making data locking that filing module 100 will be to be filed one or more at the addressable data system of environment working time.Data set steady arm 202 generally is path/filename, URL, or the tabulation of path/filename and/or URL, is used for scattering (spread) at a plurality of locational data sets.Object data set 206 can optionally comprise the reference 208 to one or more DML files 210.
One or more DML files 210 can based on about the knowledge of the data layout of data centralization by preselected, or can specify in working time by the user.Filing module 100 can obtain the initial part of data set, and presents explanation to initial part by user interface 116 to the user based on the DML file of acquiescence.Then, the user can revise the DML filespec of acquiescence based on the mutual view of explaining.If data set comprises the data with multiple form, then need with reference to more than one DML file.
Object data set 206 comprises the reference 212 to a group field object 214.In the record of data set to be filed, each field has a field object.In case finish by the filing operational process that filing module 100 is carried out, just comprise in the object data set 206 corresponding to the data set archive 216 of filing data collection.Data set archive 216 comprises the statistical figure relevant with data set, for example the sum of Ji Lu sum and effective/invalid record.
Field object 218 can optionally comprise validity information 220, and filing module 100 can use validity information 220 to determine the effective value of corresponding field, and specifies the rule (that is, invalid value being mapped on the effective value) that is used to remove invalid value.Field object 218 also comprises the field filing 222 that filing module 100 is stored when finishing the filing operational process, field filing 222 comprises the statistical figure relevant with corresponding field, the quantity of for example different value, null value and effective/invalid value.Field filing 222 also can comprise sampled value, for example maximal value, minimum value, maximum common value and minimum common value.Complete " filing " comprises that data set archive 216 and all are filed the field of field and filed.
Other user preferences that are used for filing apparatus (profiler) operation can be collected and be stored in filing and set up object 200 or object data set 206.For example, the user can select to filter expression formula (filter expression), and this filtration expression formula can be used for the number of limited field or filed value, comprises stochastic sampling (for example, the 1%) filing to value.
3 working time environment
Filing module 100 was moved the environment in the working time that allows data to read and be processed into discrete job element stream from data source.The calculating that filing module 100 and processing module 120 are carried out is by the formal representation of oriented (directed) chart with data stream, and the component in these calculating formulas is associated with the summit (vertice) and the data stream between these components (its link corresponding to chart (arc, edge)) of chart simultaneously.The system that implements this calculating based on chart has description in U.S. patent 5,966,072 in " table is reached the calculating of chart ".The chart of making according to this system provides such these methods,, obtains into/go out the information by each process of chart representation in components that is, mobile message between these processes, and the operation order that defines these processes.This system comprises the algorithm of selecting inter-process communication methods (for example, the communication path consistent with the chart link can use TCP/IP or UNIX field socket, or uses shared storage Data transmission between these processes).
Environment also allowed to file module 100 and moved as concurrent process working time.The graphical presentation of type same as described above (representation) can be used to describe parallel processing system (PPS).As discussion, parallel processing system (PPS) comprises the computer system of the arbitrary structures that uses a plurality of CPU (central processing unit) (CPU), perhaps local (promptly, multicomputer system is the SMP computing machine for example), or local distributed (for example, connecting into multiprocessor or the MPP that troops), or it is long-range, or long-distance distribution formula (for example), or their combination in any through the multiprocessor of LAN or the connection of WAN net.Moreover these charts will be made up of component (chart summit) and stream (chart link).By the element of copy chart (component and stream) clearly or implicitly, the degree of parallelism in can representation system.
Use is used to import the link input queue of component and implements flow-control mechanism.This flow-control mechanism allows data to flow between the component of chart, and it is for example big usually but in the disc driver that speed is slower can not to be written into non-volatile local storage.Input queue can be enough little, and is littler and fast in the volatile memory than nonvolatile memory job element can be kept in general.Even for very large data set, this also can save storage space and time potentially.These components can use output buffer to replace input queue, or also use output buffer outside the use input queue.
When two components connected by stream, as long as the downstream component continues to use job element, the upstream component just sent job element to the downstream component.If the downstream component lags behind (falls behind), the upstream component will be filled up the input queue of (fill up) downstream component, and quit work, till input queue is removed once more.
Can use extracting mode (abstraction) in various degree to specify calculation chart.Therefore, " the subgraph table " that can will comprise component and link in another chart is expressed as single component, and this single component only shows that those are connected to the link of chart remainder.
The filing of 4 charts
With reference to Fig. 3, in a preferred embodiment, chart filing apparatus 400 is used to file the calculating of module 100.402 expressions of input data set assembly are from the data of the data system of potential several types.These data systems can have different physical media type (for example, magnetic, light, magneto-optic) and/or different type of data format (for example, scale-of-two, database, spreadsheet, ASCII string, CSV or XML).Input data set assembly 402 transmission data flow to investigation and carry out in the assembly 406." investigation " that assembly 406 carries out data set carried out in investigation, is that each exclusive field/value is to creating independent investigation records flowing into that this investigation carries out in the record of assembly.Each investigation records comprises the exclusive field/value that is used for this investigation records counting to the occurrence number of (pair).
Investigation is carried out assembly 406 and is had the removing option, and this removing option can be mapped to one group of invalid value on the effective value according to the validity information that is stored in the corresponding field object.This removing option also can be stored in record (having the field that comprises invalid value) on the position of being represented by invalid record assembly 408.Then, for example want the user in the source of definite invalid value can check these invalid record.
In shown embodiment, the investigation records of carrying out assembly 406 outflows from investigation is stored in the file of being represented by survey document assembly 410.In some cases, the efficient that intermediate storage can increase a plurality of chart component interview survey records is carried out in investigation records.Replacedly, investigation records can be carried out assembly 406 from investigation and flow directly to investigation and analysis assembly 412, and is not stored in the file.
Investigation and analysis assembly 412 is created the histogram of the value of each field, and based on investigation records data set is carried out other analyses.In shown embodiment, the intermediate storage position of field filing assembly 414 expression field filings.Metadata load libraries assembly 416 is loaded into field filing and other filings result in the corresponding objects in the metadata repository 112.
User interface 116 allows the browsing data after user's operational analysis, for example, checks histogram or common value in the field." down deep drilling (drill-down) " is provided ability, for example, checks the specific record that is associated with bar (bar) in the histogram.The user can also upgrade preference by user interface 116 based on the filing result.
The value of the demonstration sampling of record of (for example, being associated) that is associated in the set of sampling component 418 store sample record 420, sample record 420 expressions and user interface 116 with bar in the histogram.Two stages of carrying out in stage (phase) the dotted line 422 expression charts 400, that is, after all component operation of left side line was finished, the assembly of right-hand line brought into operation.Therefore, investigation and analysis assembly 412 finish the result is stored in field filing assembly 414 after, sampling component 418 operations.Location retrieval (retrieved) sample record that replacedly, can from input data set 402, be write down.
Filing module 100 can be by user 118 or scheduler program initialization automatically.In case initialization filing module 100, main script (master script) (not shown) is collected any DML file and the parameter that filing chart 400 will use from metadata repository 112.Parameter can for example be filed from object and set up object 200, object data set 206 and field object 218 obtains.If necessary, main script can based on provided, create new DML file about the information of data set to be filed.For simplicity, main script can be compiled into working document with parameter.Then, main script can use from the suitable parameter of working document operation filing chart 400, and presents process show before 400 operations of filing chart are finished, and follows the tracks of elapsed time and estimates excess time.When 400 operations of filing chart, calculate estimated excess time based on the data that write metadata repository 112 (for example, job element).
4.1 data layout is explained
Utilize an input module to implement to file the data layout part of the data system that can explain wide range of types in the module 100.This input module is configured to directly explain some data layouts, and does not use the DML file.For example, input module can be from the data system reading of data of utilization structure query language (SQL) (the ansi standard computerese in a kind of visit and manipulation data storehouse).Other data layout that does not use the DML file processing for example is according to the XML standard or uses the formative text of comma separated value (CSV).
For other data layout, input module uses filing to set up the DML file of appointment in the object 200.The DML file can be specified and be explained and the various aspects of the data that manipulation data is concentrated.For example, the following aspect that the DML file can the specific data collection:
Corresponding relation between the represented value of type object-definition raw data and raw data.
Ordering between the record of key indicator (key specifier)-be defined in, cut apart and the relation of dividing into groups.
The calculating of carrying out from constant, data recording field or other expression formula result's value is used in expression formula-definition, to produce new value.
Transfer function-definition is used for from 0 to the rule of a plurality of one or more output records of input record generation and the set of other logic.
Bag (packet)-provide is with type object, transfer function and can be used and be finished the convenient manner that the variable of various tasks divides into groups by assembly.
Type object is basic system (mechanism), is used for reading each job element (for example, each record) from the raw data of data system.Working time, environment allowed string (for example, be installed in the file system or on network connects flow) the visit physical computer readable storage medium storing program for executing (for example, magnetic, light or magnet-optical medium) with original data bits.Input module can be visited the DML file to determine how to read and explain raw data, to produce job element stream.
With reference to Fig. 4, type object 502 for example can be fundamental type 504 or compound type 506.How fundamental type object 504 specifies the string (string of given length) with the position to be construed to single value.Fundamental type object 504 comprises length specification, and it points out the number of the original data bits of to be read and analysis.Length specification can indicate regular length for example to specify the byte of number or variable-length (in the terminal number (potential length variable) of specifying separator (for example, specific character or string) or character to be read of data.
The data block that void type (void type) 514 its implications of expression or inner structure needn't be explained (for example, packed data, it is not explained, after decompress(ion)).The length of void type 514 is specified with byte.Numeral of numeric type 516 expressions, and if should numeral be designated as integer 524, real number 526 according to various codings standard or that particular CPU is intrinsic, or decimal 528, then can make different explanations to it.String type 518 is used to use the character set interpretative version of appointment.Date type 520 and time on date type 522 are used to use the character set of appointment and other formatted message to explain calendar date and/or time.
Compound type 506 is the objects that are made of a plurality of subobjects, and these subobjects itself can be fundamental type or compound type.Vector type 508 is the objects that comprise the object sequence of same type (basic or compound type).The number of the subobject in vector (that is vector length) can indicate by the constant in the DML file or by the rule (for example, indicating the separator of vector end) that allows to file vector with variable-length.Record type 510 is the objects that comprise object sequence (wherein each object can be different fundamental type or compound type).Each object in this sequence is corresponding to the value that is associated with the field of being named.Service recorder type 510, assembly can be explained original data block, with the value of all fields of extracting a record.Associating (union) type 512 is objects similar to record type 510, and difference is identical original data bits to be construed to different value corresponding to the object of different field.Union type 512 provides the mode of identical raw data being carried out multiple explanation.
The DML file also allows to use the customization data type to carry out data filing.The user can define a types of customization object with other DML type object (basic or compound type) by type definition is provided.Then, filing module 100 can use the types of customization object to explain the data with normal structure.
The DML file also allows the service condition structure to carry out data filing.Record can only comprise some fields based on the value that is associated with other field.For example, if the value of field " wedding is not " is a "Yes", record can only comprise field " spouse ".The DML file comprises the rule that is used to given record to determine whether the existence condition field.If certain existence condition field in the record can be explained the value of field with the DML type object.
Chart can use input module to handle various types of interrecord structures effectively.Input module use variable interrecord structure for example condition record or variable-length vector explain the ability of record, the section that makes chart needn't earlier this data be flattened into regular length just can be handled this data.It is the relation of finding between the each several part of data (for example, the data in different record, form or the file) that chart uses the processing of another type that input module can carry out.Chart can use the rule in the input module to seek external key or major key in field and another form or the relation between the field in a form, or the each several part of data is carried out functional dependence calculating.
4.2 statistical figure
With reference to Fig. 5 A, the subgraph table 600 that an embodiment of assembly 406 is carried out in the enforcement investigation comprises filter assemblies 602, and it can set up the filtration expression formula of storing among the object 200a based on filing passes through part input record.Filter expression formula and can limit the field of being filed or the number of value.An example that filters expression formula is the single field (for example, " title ") that filing is restricted to each input record.The optional function of another of filter assemblies 602 is the above-mentioned removing option of operation, and the sampled value that sends invalid record is to invalid record assembly 408.The recorded stream that flows out from filter assemblies 602 goes into that sequence statistical figure assembly 604 is rolled in the part and assembly 612 is cut apart in circulation.
In subgraph table 600, the ability of filing chart 400 (and other chart and subgraph table) parallel running on a plurality of processors and/or computing machine, and file the ability that chart 400 reads the parallel data collection of storing on a plurality of positions, represent by the line thickness of assembly and the symbol on the link between the assembly not obviously.The assembly of the expression memory location for example webbing line of input data set assembly 402 shows that it optionally is the parallel data collection.The processing components for example webbing line of filter assemblies 602 shows: move process alternatively with a plurality of subregions, each subregion moves on different processors or computing machine.The user can show whether move parallel chart assembly alternatively with parallel or serial mode by user interface 116.Thin sideline shows that data set or program are serials.
Roll sequence statistical figure assembly 604 on the part and calculate the relevant statistical figure of sequential nature that write down with input.For example, assembly 604 can be counted the number of (its field value increases, reduces or increase progressively 1) the record of order.Under the situation of parallel running, be each subregion difference sequence of calculation statistical figure.Scroller program relates to from a plurality of input elements (the sequence statistical figure that are used for scroller program that assembly 604 is carried out) pooling information, produces the input element that the replacement usually of single output unit merges.The expression of set link symbol 606 will merge from the data stream of any a plurality of subregions of parallel component or " set " in the individual traffic of serial assembly.Roll the sequence statistical figure on the whole and will merge to single " integral body " sequence statistical figure set (its expression is from the record of all subregions) from " part " sequence statistical figure of a plurality of subregions.Resulting sequence statistical figure can be stored in the temporary file 610.
Fig. 6 is the process flow diagram of example that is used to carry out the program 700 of scroller program, comprises and rolls sequence statistical figure assembly 604 on the part and roll the scroller program that sequence statistical figure assembly 608 is carried out on the whole.Program 700 is by receiving input element (step 702) beginning.Then, program 700 is upgraded the information that is compiling in step 704, determines whether that in step 706 element needs compiling in addition.If the element that will edit in addition, program 700 receives next element in step 702, then in step 704 lastest imformation.If the element that will upgrade not, program 700 in step 708 based on rolling end of message (EOM) output element on having compiled.Scroller program can be used for a group element is merged into individual element, or determine a group element gather (aggregate) attribute (for example, the statistical figure of the value in these elements).
Circulation is cut apart assembly 612 extract record from the single or multiple subregion of input data set 402, and (for example at a plurality of parallel processors and/or computing machine, the user selects) between cutting recording again, with balance working load between these processors and/or computing machine.The cutting apart again of cross connection link symbol 614 expression data stream (assembly 612 by link is carried out).
Modular unit 616 extracts recorded streams, and sends and comprise the investigation element flow of field/value to (value of each field in its expression input record).(the input record with 10 fields produces 10 streams of investigating elements).The string list that each value is converted into standard (that is, according to predetermined form), be can read by personnel reaches mode.Investigation also comprises sign (flag) in the element, its value of pointing out whether effectively and value whether be 0 (that is, being worth) corresponding to " 0 " be scheduled to.These investigation element flows go into to roll the field/value assembly on the part, and latter's (being each subregion) extracts the occurrence rate of the identical value of same field, and it is merged in the investigation element (comprising the counting to occurrence number).Another output of standardization assembly 616 is the counting to the sum of field and value, and for all subregions, these fields and value are all gathered and merged to rolls in the tale assembly 618.These tales are stored in the temporary file 620, are used for being loaded into data set archive 216.
Fig. 7 can handle the condition record that may and not all have same field for the process flow diagram of the example of the program 710 of modular unit execution, comprises the right investigation element flow of field/value with generation.Program 710 is carried out nested loop, and this nested loop starts from step 712 and obtains new record.For each record, program 710 obtains field in that record in step 714, determines in step 716 whether that field is condition field.If this field is with good conditionsi, program 710 determines in step 718 whether that field of that record exists.If this field exists really, then program 710 is in the value of the record of that field of step 720 standardization, and produces the corresponding right output element of field/value that comprises.If this field does not exist, then program 710 continues to have determined whether another field or has determined whether another record in step 724 in step 722.If this field is not with good conditionsi, then program 710 is in the value (comprising 0 possible value) of the record of that field of step 720 standardization and proceed to next field or record.
Field/value is cut apart assembly 624 and is cut apart these investigation elements again by field and value, thereby roll the scroller program of carrying out in the field/value assembly 626 on the whole and can be added on the occurrence rate that difference is cut apart middle calculating, so that file intrarecord each exclusive field/value to producing the total occurrence count in the single investigation element for being included in.Roll field/value assembly 626 on the whole in potential a plurality of subregions of the potential parallel file of representing by survey document assembly 410, handle these investigation elements.
Fig. 5 B is the synoptic diagram of the subgraph table 630 of the investigation and analysis assembly 412 of enforcement filing chart 400.Field is cut apart assembly 632 and is read the investigation element flow from survey document assembly 410, and cut apart these investigation elements again according to hashed value (hash value) based on field, have same field the investigation records of (but having different values) is arranged in same subregion thereby make.String, numeral, date are cut apart assembly 634 and also cut apart these investigation elements according to the type of the value in the investigation element.For as string (rolling in the string assembly 636) last, as numeral (rolling in the digital assembly 638) last or as the date/value of time on date (rolling in the date assembly 640) last, (rolling into journey in the use) calculates different statistical figure.For example, calculating mean value and standard deviation are suitable for numeral, but are not suitable for string.
Collect these results from all subregions, and histogram/decile information calculations assembly 642 will be used to (for example make up histogrammic information, the minimum and maximum value of each field) offers and play up zone (bucket) computation module 654, and the information that will be used to calculate decile statistical figure (for example, the number of the value of each field) offers decile computation module 652.After histogram/decile information calculations assembly 642 (above stage dotted line 644) is finished operation, the assembly operating of the generation histogram of subgraph table 630 and decile statistical figure (below stage dotted line 644).
Subgraph table 630 makes up the tabulation of the value (for example, greater than value of 10%, 20% of value or the like) on decile border by following steps: by value these are investigated elements and sort out in each subregion (sorting out in the assembly 646); Again cut apart these investigation elements according to the value of being sorted out (cutting apart in the assembly 648) in value; And these investigation elements are merged into sort out (serial) stream, and flow into decile computation module 652.In 1/10th the group of the sum of the value of decile computation module 652 in each field, count the classification value of this field, to seek value on the decile border.
Subgraph table 630 makes up the histogram of each field by following steps: the value of calculating the parantheses (or " playing up the zone ") of each value of definition; Count (rolling histogram assembly 656 partially) to falling into the identical value of playing up each subregion in zone; Each value of playing up in the zone from all subregions is counted (rolling on the whole in the histogram assembly 658).Then, mergefield filing part assembly 660 will be used for all information that each field is filed from temporary file 610, comprise histogram, decile statistical figure and sequence statistical figure, collect in the field filing assembly 414.Fig. 5 C is the synoptic diagram of the subgraph table 662 of the sampling component 418 of enforcement filing chart 400.The same with subgraph table 600, the single or multiple subregions extraction records of assembly 664 from input data set 402 are cut apart in circulation, and between a plurality of parallel processors and/or computing machine cutting recording again, with balance working load between these processors and/or computing machine.
Search and select the information of assembly 666 uses from field filing assembly 414, whether definite record is corresponding to the value that shows on the user interface 116, and this value can be used for checking of " down deep drilling " by user's selection.Every type the value that shows in the user interface 116 is corresponding to different " sampling type ".If the value in the record corresponding to the sampling type, is searched and selected assembly 666 to calculate and select number at random, this selects number to determine whether this record is selected for expression sampling type at random.
For example, five sample records altogether for the particular sample type, if selecting number is one of five maximum numbers can seeing (they are the particular sample type in the single subregion), transmitting corresponding record so and indicate what value can be corresponding to the information of browsing of " down deep drilling " as output.Adopt this scheme, first group of five record of any sampling type are the same with any other record with one of five MAXIMUM SELECTION numbers can seeing, can automatically be delivered to next assembly.
Next assembly is that the sampling type is cut apart assembly 668, and it can use the selection number in each sampling type to carry out classification according to sampling type cutting recording again thereby sort out assembly 670.Then, search) assembly 672 has five records of MAXIMUM SELECTION number (between all subregions) for each sampling type selecting.Then, record/link sampling component 674 is written to sample record file 420 with these sample records, and with the respective value of these record linkage in the field filing assembly 414.
Metadata load libraries assembly 416 from temporary file assembly 620 is loaded into data set archive 216 objects the metadata repository 112, and is filed each field filing data set archive in 222 objects from the field that field filing assembly 414 is loaded into the metadata repository 112.Then, the filing result that user interface 116 can the retrieve data collection, and on the screen that user interface 116 produces, be shown to user 118.The user can browse the filing result, with histogram or the common value of watching field.For example, can provide the ability of " down deep drilling ", to watch the specific record that is associated with bar in the histogram.
Fig. 8 A-C illustration show filing result's user interface screen output.Fig. 8 A illustrates the result from data set archive 216.Various total 802 of data set illustrates as a whole, also shows and the general introduction 804 of filing the field associated attributes.Fig. 8 B-C illustrates the result from exemplary field filing 222.Show that with various forms to for example selection of the value of maximum common value 806 and maximum public invalid value 808, these forms comprise: himself is as the occurrence rate 814 and the bar shaped statistical graph 816 of the number percent of the sum of the total 812 of the occurrence rate of the value 810 of the readable string of personnel, value, conduct value.Scope a plurality of that the histogram 818 of shown value illustrates spanning value play up that each plays up the clauses and subclauses in zone in the zone, comprise that counting is 0 the zone of playing up.Also shown decile border 820.
5 examples
5.1 data are found
Fig. 9 is the process flow diagram of the example of program 900, is used for data set archive, so that found its content before using it for another program.Program 900 can be automatically (for example, by the scheduling script) or manually (for example, by the user in terminal) carry out.At first, program 900 902 on addressable one or more data systems in the working time environment identification data set to be filed.Then, program 900 can optionally be provided with record format and in step 906 validity rule 906 is set in step 904 based on information that is provided or existing metadata.For the data of some type, for example database table can use default record form and validity rule.Then, program 900 is in step 908 pair data set (or subclass of data set) operation filing.Program 900 can be simplified record format in step 910 based on the result of initial filing, or simplifies the validity rule in step 912.If arbitrary filing option changes, program 900 judges whether use new option that this data set is moved another filing in step 914 so, if perhaps from (may repeat) archiving process, obtained enough information, then handled this data set in step 916 about this data set.This process will use the information that obtains from archiving process directly to read from one or more data system.
5.2 quality test
Figure 10 is the process flow diagram of the example of program 1000, is used for data set archive, to test its quality with its conversion and before being loaded in the data-carrier store.Process 1000 can be carried out by automatic or manual.The rule of the quality of test data set can be from the prior knowledge of data set, and/or for example the journey that the similar data set data set from identical sources of data set to be tested (for example, as) carries out is needed 900 result from the filing program.This program 1000 for example can be commercial use, in order to before input or deal with data, will file from regular (for example every month) data input that the business parnter sends.This will allow commercial " bad " data (for example, the ratio of invalid value is higher than the data of threshold value) that detect, therefore can be because of the action " pollution " that is difficult to cancel existing data-carrier store.
At first, program 1000 is at step 1002 identification data set to be tested on addressable one or more data systems in the working time environment.Then, program 1000 is filed in step 1004 pair data set (or subclass of data set) operation, and carries out quality test in step 1006 based on the filing result.For example, the appearance ratio of the specific common value of this data set can be compared with the appearance ratio of this common value of past data collection (based on previous filing operation), surpass 10% if the ratio of the two differs, then quality test failure.This quality test can be applied to the well-known value (in 10%) that always occurs of a series of data centralizations.Program 1000 is determined the result of quality test in step 1008, if failure then produce sign (for example, user interface prompt or sign in to journal file) in step 1010.If quality test is passed through, then program 1000 is directly from one or more data system reading of data, and is loaded into the data-carrier store from data set in step 1012 conversion (may use the information from filing) and with it.Then, for example by repeating this program at another data set of step 1002 identification.
5.3 code produces
Filing module 100 can produce executable code, for example can be used to handle the chart assembly from the recorded stream of data set.The assembly that is produced can filter input (incoming) record, only allows effective recorded stream to go out, and this is similar to the removing option of filing chart 400.For example, the user can select to file option, just should produce clean-out assembly in case this filing option indicates to finish to file to move.Be used to implement the code sensing document location (specifying) of this assembly by the user.Then, the clean-out assembly that is produced can move in the environment in the working time identical with filing module 100 (it uses the information that is stored in metadata repository 112 at the filing run duration).
6 converge-field analysis
Filing module 100 can optionally be analyzed the relation in one or more groups field.For example, filing module 100 can be analyzed between two fields of a pair of field, and this a pair of field can be in identical or different data centralization.Similarly, the filing module can for example be analyzed each field and each field that data are concentrated that data are concentrated, or analyze each concentrated field of data and each other field of same data centralization to a plurality of fields to analyzing.The characteristic of converging operation of two data sets of analysis and this on these fields in two concentrated fields of different pieces of information is relevant, and this point will be discussed in more detail below.
Converging-first kind of mode of field analysis in, two data sets (for example, file or form) are converged operation.In another way (will describe in 6.1 joints below), carry out the survey document of assembly 406 generation data sets in investigation after, information in this survey document can be used between the field that two different filing datas are concentrated, or converges-field analysis between the field in two different pieces of the identical collection of filing data (or any other have the data set of survey document).Converge-result of field analysis comprises the information about the potential relation between these fields.
Find three types relation: " generic domain " relation, " converging well (joint well) " relation and " external key " relation.If converge-result of field analysis meets certain standard as mentioned below, and a pair of field then is classified as has one of this relation of three types.
Converge-field analysis comprises compiling information, this information for example is to use these two fields to converge the record number that operation produces as what key field was carried out.Two examples converging operation of Figure 11 A-B for the record from two database tables is carried out.Form A and form B all have two fields and four records that are labeled as " field 1 " and " field 2 ".
With reference to Figure 11 A, converge assembly 1100 and will compare with value from the value of the key field of the record of form A from the key field of the record of form B.For form A, key field is a field 1; For form B, key field is a field 2.Therefore, converge assembly 1100 will from the value 1102 of form A, field 1 (A1) with compare from the value 1104 of form B, field 1 (B1).Converge assembly 1100 and receive input recorded stream 1110 from these forms, and produce the new form that converges in order to form based on the comparative result of key-field value, promptly form C's converges recorded stream 1112.Converge the assembly 1100 key-field value for every pair of coupling of these inlet flows, producing one converges record, and this converges, and to write down be the series connection of record with key-field value of coupling.
The number of converging record of converging output port 1114 output with particular key-field value be the record number respectively with the cartesian product (Cartesianproduct) of the key-field value of each input.In shown example, shown input recorded stream 1110 is by the value mark of their key fields separately, and the shown output stream 1112 that converges record is marked by matching value.Owing to two " X " values in two inlet flows, all occur, four " X " values just arranged at output stream.For the inlet flow of form A and form B, in an inlet flow, have and any record in other inlet flow record of unmatched key-field value all, export at " refusal " output port 1116A and 1116B respectively.In shown example, " W " value appears at " refusal " output port 1116A.
Filing module 100 compiles the statistical figure of the value of converging and refusing, and is used for the relation between two fields is classified.These statistical figure are summarized in the distribution table 1118, and this distribution table 1118 is classified to the distribution of the value in these two fields." occurrence number " is illustrated in the number of times that the field intermediate value occurs.Hurdle in this table is corresponding to the occurrence number 0,1 and the N (wherein N>1) of first field (in this example from form A), and the row in this table is corresponding to the occurrence number 0,1 and the N (wherein N>1) of second field (in this example from form B).Frame in this table comprises the counting that is associated with corresponding distribution pattern: " number appears in the hurdle " * " number appears in row ".Each frame comprises two countings: have the number of the different value of this distribution pattern, the sum that respectively converges record of these values.In some cases, these values all occur two fields (that is, have distribution pattern: 1 * 1,1 * N, N * 1 or N * N), in other cases, 1 * 0,0 * 1, N * 0 or 0 * N) these values only occur (that is, having distribution pattern: in a field.These countings separate with comma.
Distribution table 1118 comprises corresponding to the counting that converges record 1112 and the record of the refusal on port one 116A.Value " W " on " refusal " output port 1116A is represented single value and single record respectively corresponding to the counting in the frame of distribution pattern 1 * 0 " 1,1 ".Because value " X " occurs twice at each inlet flow, converge record for total four, value " X " is corresponding to the counting in the frame of distribution pattern N * N " 1,4 ".Because value " Y " occurs once and occurs twice at second inlet flow at first inlet flow, converge record for total two, value " Y " is corresponding to the counting in the frame of distribution pattern 1 * N " 1,2 ".
Figure 11 B is and the example similar paradigm of Figure 11 A that it is right still to have different key fields.For form A, key field is a field 1; For form B, key field is a field 2.Therefore, converge assembly will from the value 1102 of form A, field 1 (A1) with compare from the value 1120 of form B, field 2 (B2).This example has distribution table 1122, and the counting of this distribution table 1122 is corresponding to the input recorded stream 1124 of these fields.Similar with the example among Figure 11 A, there is single refusal value " Z ", it is corresponding to the counting in the frame of distribution pattern 0 * 1 " 1,1 ".Yet, in this example, there are two values " W " and " Y ", owing to have two values and two to converge record, they all have distribution pattern 1 * 1, corresponding to the counting in the frame of distribution pattern 1 * 1 " 2,2 ".
Value " X " indicates single value and 2 and converges record corresponding to the counting in the frame of distribution pattern N * 1 " 1,2 ".
Can carry out various totals from these numerals of distribution table.Wherein some total comprises: the sum of the different key-field value that all occurs in form A and form B, the sum of the different key-field value that in form A, occurs, the sum of the different key-field value that in form B, occurs, and the exclusive value in each form (that is, in the single record of key field value, only occur once value) sum.Some statistical figure based on these totals are used for determining whether a pair of field has one of aforementioned three types relation.These statistical figure comprise: have in the field summary journal of the exclusive value of phase XOR ratio, have the ratio of the summary journal of specific distribution figure and " relative value is overlapping " of each field.Relative value is overlapping to be the ratio of the different value that occurs field, also occur in other field.It is as follows to determine whether a pair of field has the standard of one of relation of three types (these concern not necessarily mutual exclusion):
First field of external key relation-these fields has higher relative value overlapping (for example,>99%), and second field has higher proportion () unique value for example,>99%.Second field is potential major key, and second field is the external key of potential major key.
The refusal record that converges at least one field in well relation-these fields has less ratio (for example,<10%), and has the ratio of respectively converging record less (for example,<1%) of distribution pattern N * N
At least one field has higher relative value overlapping (for example,>95%) in generic domain relation-these fields
If a pair of field has external key and converges well or the generic domain relation, then report the external key relation.If a pair of field has the well of converging and generic domain relation, but do not have the external key relation, then report converges the well relation.
6.1 investigation converges
With reference to Figure 12 A, in reality these forms execution are converged in the alternative of operation, investigation converges the field that assembly 1200 is analyzed from form A and form B, and by the enquiry data of these forms is carried out " investigation converges " operation, compiles the statistical figure of distribution table.Each investigation records have field/value to and the occurrence count of this value in this field.Because for given key field, it is right that each investigation records has exclusive field/value, the value of then investigating the inlet flow that converges assembly 1200 is exclusive.The example of Figure 12 A is corresponding to the converge operation of key field to A1, B1 (shown in Figure 11 A).By relatively and this converge the corresponding investigation records of key field in the operation, select " field 1 " (A1) and by filter assemblies 1204 to select " field 1 " (B1) by filter assemblies 1202, investigation converges assembly 1200 and carries out potentially than converging the assembly 1100 much smaller comparison of comparison number of (it compares the key field from each record of form A and form B).The example of Figure 12 B selects " field 1 " (A1) and by filter assemblies 1208 to select " field 2 " (B2) corresponding to the operation that converges of key field to A1, B2 (shown in Figure 11 B) by filter assemblies 1206.The field value of shown selected investigation records 1210-1218 each comfortable field/value centering by them and the occurrence count of this value mark.
If converging assembly 1200, investigation finds two couplings between the value among the input investigation records 1210-1218, output record comprises: matching value, based on the corresponding distribution pattern of these two countings, and this record sum that will produce in the operation converging of key field (it is exactly the product of these two countings).If do not find the coupling of this value, also export this value and corresponding distribution pattern and record sum (it is the single counting in single input record).This information in investigation converges the output record of assembly 1200 is enough to be compiled in all countings in the distribution table that converges operation.
In the example of Figure 12 A, the distribution pattern of the value " W " that occurs at output terminal is 1 * 0, is 1 altogether; The distribution pattern of the value " X " that occurs at output terminal is N * N, is 4 altogether; And the distribution pattern of the value " Y " that occurs at output terminal is 1 * N, is 2 altogether.This information is corresponding to the information in the distribution table 1118 of Figure 11 A.In the example of Figure 12 B, the distribution pattern of the value " W " that occurs at output terminal is 1 * 1, is 1 altogether; The distribution pattern of the value " X " that occurs at output terminal is N * 1, is 2 altogether; The distribution pattern of the value " Y " that occurs at output terminal is 1 * 1, and value is 1; And the distribution pattern of the value " Z " that occurs at output terminal is 0 * 1, and value is 1.This information is corresponding to the information in the distribution table 1122 of Figure 11 B.
6.2 extension record
Be used for comprising based on investigation records and produce " extension record " in right the converging of a plurality of fields-field analysis that single investigation converges operation.In example shown in Figure 13, investigation converges the record that assembly 1200 relatively is used for the converging of two couples of key field A1, B1 and A1, B2-field analysis, and in conjunction with converging-field analysis shown in Figure 12 A-B.Exclusive identifier by connecting the pair of keys field converged and the value in investigation records, and keep identical occurrence count as this investigation records, produce extension record from this investigation records.
If converge-field analysis comprises the result of a field of converging with a plurality of other fields, then is that each value of this field produces a plurality of extension records.For example, investigation records 1210 is corresponding to two extension record 1301-1302, and value " W " is connected with " A1B2 " with identifier " A1B1 " respectively.It is identical with the mode of investigation records that it has processing value " WA1B1 " that investigation converges mode that assembly 1200 handles extension records 1301.Equally, investigation records 1211 is corresponding to two extension record 1303-1304, and investigation records 1212 is corresponding to two extension record 1305-1306.
In the converging of Figure 13-field analysis, field B1 only converges with other field (A1), so each investigation records 1213-1214 corresponds respectively to single extension record 1307-1308.Equally, field B2 only converges with other field (A1), so each investigation records 1215-1218 corresponds respectively to single extension record 1309-1312.Each extension record comprises the value based on the original value that is connected with exclusive field specifier.
With reference to Figure 14, extension element 1400 is handled the input investigation records to produce extension record based on converging information 1401, converge information 1401 indicate converging-field analysis in which field and which other field converge.In this example, converge information 1401 and indicate field F from the enquiry data of form T1 1(having four investigation records 1402) just converged with four other fields: from the field F of the enquiry data of form T2 1(having two investigation records 1404) is from the field F of the enquiry data of form T2 2(having two investigation records 1406) is from the field F of the enquiry data of form T3 1(having two investigation records 1408), and from the field F of the enquiry data of form T3 2(having two investigation records 1410).Investigation records 1412 expressions that flow into extension element 1400 (have field F from form T1 1With value V i, i=1 wherein, 2,3, or 4) one of four investigation records 1402 of enquiry data.For input investigation records 1412, extension element 1400 produces four extension record 1413-1416.
For field (being included in the field that has same name in the different forms), investigation converges assembly 1200 and uses exclusive identifier.Extension record 1413 has value c (T1, F 1, T2, F 1, V i), value c is original value V iWith by being connected of the identifier of the identifier of the field converged and form (or file or other data source), the wherein enquiry data of generation field from this form (or file or other data source).Form comprises that identifier can distinguish the field with same names.Extension record 1415 has value c (T1, the F that is different from extension record 1413 1, T2, F 1, V i) value c (T1, F 1, T3, F 1, V i), wherein two form T2 have identical field name F with T3 1Replacedly, field name be can replace, each field and use given exclusive digital distribution.
6.3 converge-the field analysis chart
The chart that Figure 15 A-B uses for filing module 100 is to converge arbitrarily-field analysis selected field in the source in data source 30 (for example, form or file).User 118 selects to be used to file and converge-option of field analysis, comprise that execution does not converge-option of the filing of field analysis.User 118 selects to be used to converge-field of field analysis is right, comprising: two paired mutually specific fields, with a paired field or each field paired of each other field with each other field.User 118 selects an option, with allow identical form or the field in the file in pairs, or only allow different forms or the field in the file in pairs.These options are stored in the metadata repository 112.
With reference to Figure 15 A, for converge-the field analysis option in each source (for example, form or file) of field of appointment, chart 1500 is utilized as the enquiry data 1510 that these specific fields prepare and produces a file.1500 pairs in chart converges-and contained each such source operation is once in the field analysis.Filter assemblies 1504 carries out enquiry data 1502 receiving records that assembly 406 produces from investigation, and prepares to be used to converge-record of field analysis.Filter assemblies 1504 is lost the record (being determined by the user option that is stored in metadata repository 112) of field not to be covered in this analysis.Filter assemblies 1504 is also lost invalid value, null value and to other value not to be covered (for example well-known Data Labels) in the significant analysis of the content of data source.
Value in the enquiry data assembly 1502 is carried out standardization in the assembly 406 by standardization assembly 616 in investigation.Yet, these normalized values may have not should the value of being used in the part (for example, have the string in guiding or hangover space, or have the numeral of guiding or hangover zero) of logic in relatively.User 118 can select to be used for " literal ground " or " logically " relatively option of these values.If user 118 selects " literal " relatively, then the value of these in the investigation records keeps with normalized form.If user's 118 selections " logic " relatively, then filter assemblies 1504 is changed the value in the investigation records according to rule (for example remove the guiding and the hangover space of numeral and remove digital guiding and hangover zero).
Value is cut apart assembly 1506 based on the value in investigation records cutting recording again.Any investigation records with identical value is placed into identical subregion.This allows to converge-and field analysis strides across the subregion parallel running of any number.Because for the input record with matching value, investigation converges 1200 of assemblies and produces output record, the investigation records in different subregions (or any extension record of their generations) needn't be compared mutually.
On roll logical value assembly 1508 and merge the right any investigation records of field/value that the conversion carried out because of filter assemblies 1504 has coupling.Record after the merging has occurrence count, this occurrence count be to the counting of all merge records and.For example, if field, value, counting are " quantity, 01.00; 5 " investigation records be converted into " quantity, 1,5 ", and field, value, counting are that the investigation records of " quantity; 1.0,3 " is converted into " quantity, 1; 3 " roll the record of logical value assembly 1508 after then and be merged into field, value, counting single record for " quantity, 1,8 " with these two conversions.
With reference to Figure 15 B, for as converge-the specified every pair of source of field analysis option with one or more fields to be compared, that is, source A and source B, chart 1512 use enquiry data A 1514 for preparing and the enquiry data B1516 for preparing (preparing by chart 1500) operation.Two extension element 1400 are from the enquiry data group of received record of these preparations and converge information 1515, and this converges the specific field that information 1515 is specified among the source A that will compare with the specific field among the B of source.Extension record flows into investigation and converges assembly 1200, and this investigation converges the record that assembly 1200 produces the counting that comprises the field that is comparing in value, distribution pattern and the distribution table.Roll on the part and converge statistical figure assembly 1518 and be compiled in information in these records in each subregion.Then, record in various subregions converges 1520 collections of statistical figure assembly and compiling by rolling on the whole, this rolls on the whole and converges statistical figure assembly 1520 output files 1522, its comprise the field of all analyzed data source centerings all converge-the field analysis statistical figure.Converge-result of field analysis (be included in exist potentially between the various field in this relation of three types which) and be loaded in the metadata repository 112, be used to present to user 118.For example, user 118 can select to be used to have the link of a pair of field of potential relation on user interface 116, and watches the page on user interface 116, and this page has detail analysis result (it comprises this counting to field from distribution table).
With reference to Figure 15 C, when two fields in the identical sources (source C) are converged-during field analysis, enquiry data C 1526 operations that chart 1524 uses charts 1500 to prepare.Enquiry data C 1526 receiving records that single extension element 1400 is prepared from this group and converge information 1528, this converges the specific field that information 1528 is specified among the source C to be compared.Extension record flows into two ports that investigation converges assembly 1200, and this investigation converges assembly 1200 and produces the record that comprises the counting of field to be compared in value, distribution pattern and the distribution table.
Converging-each field that the field analysis option indicates among the C of source will (have four field: F1, F2, F3 with source C, F4) under the situation that other field of each in is compared, a kind of mode is to converge information 1528 to specify 12 couples of fields (F1-F2, F1-F3, F1-F4, F2-F1, F2-F3, F2-F4, F3-F1, F3-F2, F3-F4, F4-F1, F4-F2, F4-F3).Yet owing to will carry out identical operations for field to F1-F3 and F3-F1, certain operations is repeated to carry out.Thereby another kind of mode is to converge information only to specify six pairs of exclusive fields to F1-F2 F1-F3, F1-F4, F2-F3, F2-F4, F3-F4.In the case, by the field in the analysis result of six pairs of fields will having analyzed out of order, the result of output file 1530 comprises other six results that field is right.
7 functional dependence analyses
The analysis of the another kind of type that filing module 100 can be carried out is the funtcional relationship between the tested word segment value.The field of being tested can be from the single form with a group field or from " virtual (virtual) form ", this virtual tables from the multiple source that is relative to each other (for example comprises, by using the publicty effect field that the operation that converges of these fields be correlated with, this will be in 7.3 joints detailed description) field.One type funtcional relationship between a pair of field is " functional dependence ", that is, the value that the value that is associated with a field of record can be associated by another field with record is determined uniquely.For example, if database has state (State) field and postcode (Zip Code) field, the value that the value of postcode field (for example, 90019) is determined the state field (for example, CA).Each value of postcode field is mapped to the exclusive value (that is " many-one " mapping) of state field.The functional dependence sexual intercourse also may reside between the subclass (wherein the value that is associated with a field of record can be determined uniquely by the value that be associated with another field that writes down) of field.For example, the value of postcode field can be determined uniquely by the value of city fields and street field.
Functional dependence also can be " approximate functional dependence ", and promptly some that is associated with a field (needn't all) values are mapped to the exclusive value of another field, and has the exceptional value of not shining upon exclusive value and have certain proportion.For example, some records may have the unknown postcode by particular value 00000 expression.In the case, the value 00000 of postcode field can be mapped to one the value (for example CA, FL, and TX) of surpassing of state field.May exceptional value appear because of the record with incorrect value or other mistake also.If the ratio of exceptional value is less than predetermined (for example, being imported by the user) threshold value, then field can still be confirmed as having functional dependence with another field.
With reference to Figure 16, example form 1600 is shown, it has will trial function dependence or dependent record of approximate function (OK) and field (hurdle).Surname (LastName) field has 12 values corresponding to 12 records (row 1-12).Wherein ten values are exclusive, and two in these records have identical repetition values name g.Nationality (Citizenship) field has two exclusive value: US and occurs ten once, and CANADA occurs once.The postcode field has various values, and each is corresponding to three of being worth CA, FL and TX of state field.Except corresponding to the FL in the record (row 10) with corresponding to the postcode of the TX in another record (row 12) value 00000, each value of postcode is determined the value in state uniquely.
7.1 functional dependence analysis diagram
Figure 17 is the example of the chart 1700 of filing module 100 uses, with (for example to the one or more sources in data source 30, in single form or file, or as 7.3 the joint described in a plurality of forms and/or file) in selected field carry out functional dependence analysis arbitrarily.User 118 selects to be used to file the option with the functional dependence analysis, comprises that execution do not carry out the option of the filing of functional dependence analysis.User 118 can select then which to field or which field to carrying out the test of funtcional relationship.User 118 selects data source, and (for example, form or file) specific fields, and selection for example " all to selection " or " choosing selection " determines which field of test is right, or selection " all to whole ", comes all fields in the test data source right.Determination field with or do not have functional dependence with another field before, the user can select to determine the threshold value of functional dependence degree.For example, the user can select to determine to allow the threshold value of how many exceptional values (as the ratio of record).These options are stored in the metadata repository 112.
For every couple of field (f1 that will analyze, f2), chart 1700 determines whether to exist the functional dependence sexual intercourse, if exist, then the relation between field f1 and the field f2 is classified as: " f1 determines f2 ", " f2 determines f1 ", " one to one " (between f1 and f2, having man-to-man mapping), or " being equal to " (f1 has the value identical with f2 substantially in each record).Chart 1700 reads by the field information 1702 in 100 storages of filing module, to determine the exclusive identifier of field to be analyzed.Pairing assembly 1704 uses a pair of exclusive identifier of every pair of field to be tested to produce field to (f1, stream f2).Because the relation between f2 and the f2 needs not to be symmetry, (f1 is orderly right f2) to field for this.Therefore comprise in this stream two pairs of fields (f1, f2) with (f2, f1).
Field is to selecting assembly 1706 by selecting the user right for analyzing the field of selecting, and the field that limits the remainder that flows to chart 1700 is right.Field is to selecting assembly 1706 also based on various optimizations, and the field that further limits the remainder that flows to chart 1700 is right.For example, field and himself are not paired, because such to be classified into " being equal to " by definition.Therefore, do not comprise in this stream field to (f1, f1), (f2, f2) ... Deng.Other optimization can remove one or more fields from this stream right, and this will describe in detail in 7.2 joints below.
Broadcasting assembly 1708 sequence flows that field is right is broadcast to each subregion of (walking abreast potentially) added value assembly 1718, by 1710 expressions of broadcasting link symbol.Each subregion of added value assembly 1718 adopt field to (for example, (and LastName, Citizenship), (Zip, State) ... Deng) stream and field/value to (for example, (LastName, name_a), (LastName, name_b), (LastName, name_c) ..., (Citizenship, Canada) (Citizenship, US), (Citizenship, US) ... Deng) stream as input.
For obtaining the right stream of field/value, filter assemblies 1712 extracts record from input data set 402, and optionally removes partial record based on filtering expression formula.Cut apart assembly 1714 from the record inflow circulation that filter assemblies 1712 flows out.Circulation is cut apart assembly 1714 and is extracted record from the subregion of input data set 402, and between a plurality of parallel processors and/or computing machine cutting recording again, with balance working load between these processors and/or computing machine.Standardization assembly 1716 (similar with aforementioned standardization assembly 616) obtains recorded stream, and sends out the right stream of field/value of the value of each field in the expression input record.As mentioned above, each value is converted into the readable string list of normalized personnel and shows.
Added value assembly 1718 is carried out a sequence and is converged operation, to produce f1/f2/v1/v2 quadruple (quadruples) stream, wherein f1 and f2 corresponding to the field that receives at input end to one of, v1 and v2 corresponding to record in the paired value of these fields.In the example of form 1600, when the surname field corresponding to f1 and nationality's field during corresponding to f2, added value assembly 1718 produces 12 f1/f2/v1/v2 quadruple streams, comprise: (LastName/Citizenship/name_a/Canada), (LastName/Citizenship/name_b/US) ... (LastName/Citizenship/name_k/US), (LastName/Citizenship/name_g/US).For (Zip, State) right with any other field of having analyzed, added value assembly 1718 produces similar sequences, i.e. f1/f2/v1/v2 quadruple stream.
Added value assembly 1718 outputs to " rolling f1/f2/v1/v2 on the part " assembly 1720 with f1/f2/v1/v2 quadruple stream, the assembly 1720 (for each subregion) that " rolls f1/f2/v1/v2 on the part " uses identical field and value f1, f2, v1, v2 a plurality of quadruples that add up to flow, and they are expressed as single quadruple stream, it has the counting of the occurrence number of quadruple stream in the inlet flow.The output stream of the assembly 1720 that " rolls f1/f2/v1/v2 on the part " is made up of the quadruple stream with counting (" the quadruple stream that adds up ").
Adding up of taking place in " rolling f1/f2/v1/v2 on the part " assembly 1720 is positioned at each subregion.Therefore some quadruples with identical f1, f2, v1, v2 value flow and can not added up by this assembly 1720." f1/f2 is cut apart " assembly 1721 is cut apart the quadruple stream that adds up again, thereby the quadruple with same field f1, f2 flows in identical subregion.The assembly 1722 that " rolls f1/f2/v1/v2 on the whole " the quadruple stream after cutting apart again that further adds up.The output stream of the assembly 1722 that " rolls f1/f2/v1/v2 on the whole " is made up of unique quadruple stream that adds up.In the example of form 1600, when the postcode field corresponding to f1 and state field during corresponding to f2, the combined effect of assembly 1720-1722 produces following six quadruple streams that add up: (Zip/State/90019/CA, 4), (Zip/State/90212/CA, 2), (Zip/State/33102/FL, 3), (Zip/State/00000/FL, 1), (Zip/State/77010/TX, 1), (Zip/State/00000/TX, 1).
When the state field corresponding to f1 and postcode field during corresponding to f2, the combined effect of assembly 1720-1722 produces following six quadruple streams that add up: (State/Zip/CA/90019,4), (State/Zip/CA/90212,2), (State/Zip/FL/33102,3), (State/Zip/FL/00000,1), (State/Zip/TX/77010,1), (State/Zip/TX/00000,1).
For the functional dependence sexual intercourse between a pair of field of setup test, " rolling f1/f2/v1 on the whole " assembly 1724 merges the quadruple that is added up, have two field f1, f2 and first a value v1 jointly and flows.In the process that generates the output element, this assembly 1724 is checked all v2 values of following the v1 value, and selects the most frequent v2 to be associated with this v1 value.The quantity of sharing the quadruple stream of the most frequent v2 is counted as " good ", and remaining quadruple stream is counted as " exception ".If have only a v2 value for given v1, four streams that added up that then have this value heavily are good and do not have exceptional value.If the most frequent v2 value has one " knot (tie) ", then select first value.In the example of form 1600, when the postcode field corresponding to f1 and state field during corresponding to f2, assembly 1724 produces: (Zip/State/90019/CA, 4 is good), (Zip/State/90212/CA, 2 is good), (Zip/State/33102/FL, 3 is good), (Zip/State/00000/FL, 1 is good, 1 exception), (Zip/State/77010/TX, 1 is good).When the state field corresponding to f1 and postcode field during corresponding to f2, assembly 1724 produces: (State/Zip/CA/90019,4 is good, 2 exceptions), (State/Zip/FL/33102,3 is good, 1 exception), (State/Zip/TX/77010,1 is good, 1 exception).
The assembly 1726 that " rolls f1/f2 on the whole " is with good counting and the exceptional value addition of each exclusive field to f1, f2.In the example of form 1600, when the postcode field corresponding to f1 and state field during corresponding to f2, assembly 1726 produces: (Zip/State, 11 is good, 1 exception).When the state field corresponding to f1 and postcode field during corresponding to f2, assembly 1726 produces: (State/Zip, 8 is good, 4 exceptions).
Dependence finds that assembly 1728 uses same existing statistical figure that added up from " rolling f1/f2 on the whole " assembly 1726 (that is, the number of good and record exception), determines whether a pair of field has the relation of " f1 determines f2 ".If the ratio of exceptional value (number of exception/(number of good number+exception)) is less than the selected threshold value that is used for determining to allow how many exceptional values, then this has the relation of " f1 determines f2 " to field.In the example of form 1600, for threshold value 10%, when the postcode field corresponding to f1 and state field during corresponding to f2, the ratio of exceptional value is 8.3%, and the value of postcode field is determined the value of state field.When the state field corresponding to f1 and postcode field during corresponding to f2, the ratio of exceptional value is 33%, so the relation between postcode field and the state field is not mapping one to one.
Replacedly, the value based on the mathematical properties of the value that adds up can be used to determine whether to be that field f1 determines field f2 (for example, the conditional entropy of the value of the field f2 that the value of field f1 is given, or the standard deviation of numerical value).
7.2 select the right optimization of field
Can adopt various optimizations to accelerate the speed of functional dependence analysis, for example right to selection assembly 1706 filtered fields in field, or at filter assemblies 1712 filter records.Some optimizations are based on the recognition, that is, some functional dependence sexual intercourse that above-mentioned chart 1700 is found are for the user and unlike meaningful to other people.Right for a given field, the certain situation in these situations can be detected and filter out based on the statistical figure that filing module 100 provides selecting assembly 1706 by field, and this has saved computational resource.For example, if all values of the first field f1 all is exclusive (each value only occurs once in single record), then the value of this field f1 is determined the value of the second field f2, and with field f2 in the value that occurs irrelevant.
Chart 1700 can use the enquiry data that obtains in archiving process, (for example, unifying probability distribution) at random the probability of being determined the second field f2 by the first field f1 is calculated in pairing based on the value in these fields.If random pair will cause the probability of functional dependence higher (for example,>10%), then this field is to being filtered out selecting assembly 1706 by field.In the example of form 1600, except when the name_g value in (be expert at 7 or row 12 in) one with value Canada random pair beyond, when the surname field corresponding to f1 and nationality's field during corresponding to f2, each random pair of surname and nationality will cause all quadruple streams to be counted as.Even when (probability is 16.7% (12 pairing in 2)) takes place for this random pair, the exception ratio only is to be lower than 8.3% of threshold value.Therefore, in this example, field to select assembly 1706 filter this to field (LastName, Citizenship).
The histogram of the value that another optimization calculates from enquiry data based on filing module 100.Field can not determine that to selecting assembly 1706 to filter out field f1 the field of field f2 is right.In the example of form 1600, the most frequent value in state occurs 6 times, and the most frequent value of postcode only occurs 4 times.Therefore, the value in state be can not determine the value of postcode, because for the value of half at least, have 2 exceptions at least in 6 scopes, this causes at least 16.7% exception ratio.Therefore, in this example, field to select assembly 1706 filter out this to (State, Zip).
For the record of One's name is legion, chart 1700 can be handled the sampling of fraction record by elder generation before handling all records, and is right to get rid of some fields that probably do not have functional dependence, accelerates the dependent speed of trial function.Chart 1700 can use filter assemblies 1712 selection portion member records.Replacedly, chart 1700 can select the part field/value right by operating specification assembly 1716.
Can based on various standards to the record or field/value to the sampling.Chart 1700 can be sampled based on the statistical figure that filing module 100 provides.For example, chart 1700 can be based on the most frequent value trial function dependence of the first field f1 (" determinative (determiner) ").If the exceptional value number that obtains is higher than threshold value, then needn't handle its residual value of this determinative.Chart 1700 also can be based on the stochastic sampling trial function dependence of determinative value.If the quadruple stream of enough numbers is counted as between these sampled values, then supposition can be ignored the probability of finding the exceptional value of actual number between other value.Other standard for manual sampling also can use.
Another optional optimization is based on predetermined funtcional relationship between the known function library test field.This test can be carried out the value of record or quadruple stream.
7.3 stride across the functional dependence analysis of multiple source
(for example, a plurality of database tables in the mode of) functional dependence property testing, file module 100 and produce " virtual tables (the virtual table) " that comprises from the field of described multiple source at a kind of multi-source that strides across.This virtual tables can for example be converged operation to the shared key field in these sources to these sources by utilization and produce.
Carry out in the example of functional dependence analysis in the use virtual tables, first data source is the database of motor vehicles register information (motor vehicles registration (MVR) database), and second data source is the database of the traffic ticket (traffic ticket (TC) database) that sends.The MVR database comprises field for example manufacturer, model, color, and comprises the license field that is designated as " major key " field.Each record of MVR database has exclusive license field value.The TC database comprises field for example name, date, position, violation record, vehicle manufacturers, motor vehicle model, motor vehicle color, and comprises the motor vehicle license field that is designated as " external key " field.Each value of motor vehicle license field have with the MVR database in the corresponding record of value of license field.The TC database can have the identical a plurality of records of motor vehicle license field value.
Filing module 100 is converged record from MVR database and TC database, to form virtual tables (for example, shown in earlier in respect of figures 11A converge assembly 1100 described).Each record of this virtual tables has each field from two databases, comprises single license field, its have with from the matching value of MVR license field and TC motor vehicle license field.Yet, record can have with from the motor vehicle color field value of TC database different, from the color field value of MVR database.For example, the MVR database can use " BLU " code to represent blueness, and the TC database uses " BU " code to represent blueness.In the case, if motor vehicle has identical color in these two databases, then color field and motor vehicle color field will have the funtcional relationship of " one to one ".Replacedly, if the time motor vehicle between registration and reception citation (citation) has been coated with different colors, for color field and motor vehicle color field, record can have different values.
Comprise field owing to converge virtual tables, any relation in the various relations that filing module 100 can be found may exist between the field of these data centralizations from a plurality of each data set of data centralization.Aforesaid identical or similar dependency analysis can be to converging the field operation in the virtual tables.
Aforesaid way can utilize the software of operation on computers to implement.For example, this software can be formed on one or more programmed or programmable computer system (they can be various frameworks, for example distributed, client/server or grid) goes up program in one or more computer programs of operation; Wherein, each computer system comprises: at least one processor, at least one data-storage system (for example, volatibility and nonvolatile memory and or memory element), at least one input media or port, and at least one output unit or port.This software for example can form bigger program, provides and the design of chart and one or more modules of the program that disposes other relevant service.
This software can be arranged on readable medium of the programmable calculator of general or special-purpose purpose or the device, or through network delivery (being encoded into propagable signal) to computing machine with bootup window.All these functions can be on the computing machine of special-purpose purpose or the hardware that uses special-purpose purpose for example coprocessor realize.This software can be implemented with distribution mode, and wherein, the different calculating section of software appointment is undertaken by different computing machines.Each this computer program (for example preferably is stored in or downloads to the readable storage medium of the programmable calculator of general or special-purpose purpose or device, solid-state memory or medium, or magnetic or light medium), be used at this storage medium of this computer system reads or device when moving program described herein configuration and operational computations machine.System of the present invention also can be considered as the computer-readable recording medium that disposes computer program and implement, and wherein, storage medium is configured to make computer system to operate in the specific and predefined mode that realizes function described here.
The description that is to be understood that the front is intended to exemplary explanation but not limits the scope of the invention, and scope of the present invention is limited by the scope of appending claims.Other embodiment drop in the scope of appending claims.

Claims (27)

1. a data processing method comprises the steps:
Acceptance is in order to the characteristic information of the value of first field in the record of describing first data source with in order to the characteristic information of the value of second field in the record of describing second data source, wherein, comprise distribution character information in order to the characteristic information of the value of describing first field in order to the value of describing described first field;
Based on the information of being accepted, calculate parameter in order to the characteristic of describing the relation between first field and second field; And
Present the information relevant with second field with first field,
Wherein, comprise a plurality of data recording in order to the distribution character information of the value of describing first field, the corresponding occurrence number of different values and a described different value is associated in first field in each data recording and first data source.
2. the method for claim 1, wherein the step of presentation information comprises the steps: the presentation information to the user.
3. the method for claim 1, wherein first data source is identical data source with second data source.
4. the method for claim 1, wherein at least one data source of first data source and second data source comprises database table.
5. the method for claim 1, wherein comprise the parameter that converges characteristic in order to the value of the value of describing first field and second field in order to the parameter of the characteristic of describing this relation.
6. method as claimed in claim 5 wherein, comprises distribution character information in order to the value of describing described field in order to the characteristic information of the value of describing second field.
7. method as claimed in claim 6, wherein, the step that calculating converges the parameter of characteristic in order to description comprises the steps: to handle the distribution character information in order to the value of the value of describing first field and second field, to calculate with multiple with the relevant parameter of value that shows classification.
8. method as claimed in claim 7, wherein, distribution character information in order to the value of the value of describing first field and second field comprises a plurality of data recording, the corresponding occurrence number of each data recording and different values and a described different value is associated, wherein handling step in order to the described distribution character information of description value comprises the steps: respectively to calculate in order to be described in the distribution character information of converging the value in the data source of first data source and second data source on first field and second field.
9. method as claimed in claim 7, wherein, the parameter relevant with the value of multiple same classification now comprises a plurality of data recording, each data recording and a kind of same quantity that classification is associated and comprises the exclusive value of first and second fields in described classification that shows.
10. method as claimed in claim 5, wherein, the step of calculating in order to the value of describing first field and the parameter that converges characteristic of the value of second field comprises the steps: to use respectively first field and the calculating of second field in order to be described in the distribution character information of converging the value in the data source of first data source and second data source.
11. method as claimed in claim 5, wherein, the value of calculating in order to describe first field comprises the steps: to calculate with multiple with the relevant parameter of value that shows classification with the step of the parameter that converges characteristic of the value of second field.
12. method as claimed in claim 11 wherein, multiplely comprises with the value of classification now: occur once at least in the field in first field and second field but in another field absent variable value.
13. method as claimed in claim 11, wherein, the multiple value that shows classification together comprises: all occur value once in first field and second field just.
14. method as claimed in claim 11 wherein, multiplely comprises with the value of classification now: occur once just in the field in first field and second field, and once value in another field, occurs surpassing.
15. method as claimed in claim 11, wherein, the multiple value that shows classification together comprises: all occur surpassing value once in first field and second field.
16. method as claimed in claim 5 also comprises the steps: for many first and second fields are repeated such step, promptly accepts in order to the characteristic information of description value and calculates the parameter that converges characteristic in order to the description value.
17. method as claimed in claim 16, wherein, described manyly have exclusive identifier to every pair in field field, and this identifier is included in this in the field with value, to calculate the parameter that converges characteristic in order to the description value.
18. method as claimed in claim 16 also comprises the steps: to present with described many to the relevant information of one or more pairs of fields in the field.
19. method as claimed in claim 18 wherein, presents the candidate field that comprises the steps: a kind of relation in polytype field relation that field is identified as with described many steps to the relevant information of the one or more pairs of fields in the field.
20. method as claimed in claim 19, wherein, described polytype field relation comprises major key relation and external key relation.
21. method as claimed in claim 19, wherein, described polytype field relation comprises the generic domain relation.
22. the method for claim 1, wherein the step of computing parameter comprises the steps: that the logic-based value calculates described parameter, described logical value is converted to by the literal of first field and the literal of second field.
23. the method for claim 1, wherein the step of computing parameter comprises the steps: to calculate described parameter with parallel mode, comprises data recording is divided into a plurality of parts, and uses some assemblies that separate in one group of parallel component to handle these parts.
24. method as claimed in claim 23, wherein, these parts obtain based on the value of first field and the value of second field.
25. method as claimed in claim 24, wherein, the data recording with identical value is in identical part.
26. a data handling system comprises:
The value processing module, it is configured to accept in order to the characteristic information of the value of first field in the record of describing first data source with in order to the characteristic information of the value of second field in the record of describing second data source, wherein, comprise distribution character information in order to the characteristic information of the value of describing first field in order to the value of describing described first field;
Concern processing module, it is configured to based on the information of being accepted, and calculates in order to describe the parameter of the relation property between first field and second field; And
Interface, it is configured to present the information relevant with second field with first field,
Wherein, comprise a plurality of data recording in order to the distribution character information of the value of describing first field, the corresponding occurrence number of different values and a described different value is associated in first field in each data recording and first data source.
27. a data handling system comprises:
Receiving device, be used for accepting in order to the characteristic information of the value of record first field of describing first data source with in order to the characteristic information of the value of second field in the record of describing second data source, wherein, comprise distribution character information in order to the characteristic information of the value of describing first field in order to the value of describing described first field;
Calculation element is used for based on the information of being accepted, and calculates in order to describe the parameter of the relation property between first field and second field; And
Present device, be used to present the information relevant with second field with first field,
Wherein, comprise a plurality of data recording in order to the distribution character information of the value of describing first field, the corresponding occurrence number of different values and a described different value is associated in first field in each data recording and first data source.
CN200810093033XA 2003-09-15 2004-09-15 Data processing method, software and data processing system Expired - Lifetime CN101271471B (en)

Applications Claiming Priority (6)

Application Number Priority Date Filing Date Title
US50290803P 2003-09-15 2003-09-15
US60/502,908 2003-09-15
US51303803P 2003-10-20 2003-10-20
US60/513,038 2003-10-20
US53295603P 2003-12-22 2003-12-22
US60/532,956 2003-12-22

Related Parent Applications (1)

Application Number Title Priority Date Filing Date
CN 200480026429 Division CN1853181A (en) 2003-09-15 2004-09-15 Data profiling

Publications (2)

Publication Number Publication Date
CN101271471A CN101271471A (en) 2008-09-24
CN101271471B true CN101271471B (en) 2011-08-17

Family

ID=37134186

Family Applications (3)

Application Number Title Priority Date Filing Date
CN2008100930344A Expired - Lifetime CN101271472B (en) 2003-09-15 2004-09-15 Data processing method and data processing system
CN 200480026429 Pending CN1853181A (en) 2003-09-15 2004-09-15 Data profiling
CN200810093033XA Expired - Lifetime CN101271471B (en) 2003-09-15 2004-09-15 Data processing method, software and data processing system

Family Applications Before (2)

Application Number Title Priority Date Filing Date
CN2008100930344A Expired - Lifetime CN101271472B (en) 2003-09-15 2004-09-15 Data processing method and data processing system
CN 200480026429 Pending CN1853181A (en) 2003-09-15 2004-09-15 Data profiling

Country Status (1)

Country Link
CN (3) CN101271472B (en)

Families Citing this family (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7769729B2 (en) * 2007-05-21 2010-08-03 Sap Ag Block compression of tables with repeated values
US8205113B2 (en) * 2009-07-14 2012-06-19 Ab Initio Technology Llc Fault tolerant batch processing
WO2011033773A1 (en) * 2009-09-17 2011-03-24 パナソニック株式会社 Information processing device, administration device, invalid-module detection system, invalid-module detection method, recording medium having an invalid-module detection program recorded thereon, administration method, recording medium having an administration program recorded thereon, and integrated circuit
KR102074026B1 (en) * 2012-10-22 2020-02-05 아브 이니티오 테크놀로지 엘엘시 Profiling data with location information
CA2887661C (en) * 2012-10-22 2022-08-02 Ab Initio Technology Llc Characterizing data sources in a data storage system
US9892026B2 (en) * 2013-02-01 2018-02-13 Ab Initio Technology Llc Data records selection
US11487732B2 (en) 2014-01-16 2022-11-01 Ab Initio Technology Llc Database key identification
WO2015134193A1 (en) * 2014-03-07 2015-09-11 Ab Initio Technology Llc Managing data profiling operations related to data type
US10409802B2 (en) * 2015-06-12 2019-09-10 Ab Initio Technology Llc Data quality analysis
CN107783950B (en) * 2017-04-11 2021-05-14 平安医疗健康管理股份有限公司 Method and device for processing drug instruction
US11068540B2 (en) 2018-01-25 2021-07-20 Ab Initio Technology Llc Techniques for integrating validation results in data profiling and related systems and methods
EP3770889B1 (en) * 2018-03-19 2023-08-23 Nippon Telegraph And Telephone Corporation Parameter setting apparatus, calculation apparatus, methods therefor, program, and recording medium
KR102686924B1 (en) * 2018-11-12 2024-07-19 삼성전자주식회사 Method of operating storage device, storage device performing the same and storage system including the same
CN110716895B (en) * 2019-09-17 2023-05-26 平安科技(深圳)有限公司 Target data archiving method, device, computer equipment and medium
US11556563B2 (en) * 2020-06-12 2023-01-17 Oracle International Corporation Data stream processing

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5842200A (en) * 1995-03-31 1998-11-24 International Business Machines Corporation System and method for parallel mining of association rules in databases

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US5842200A (en) * 1995-03-31 1998-11-24 International Business Machines Corporation System and method for parallel mining of association rules in databases

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Erhard Rahm, Hong Hai Do.Data Cleaning:Problems and Current Approaches,Uiversity of Leipzig,Germany.2000,第1-7页,图1-3,表1-3. *
Heikki Mannila.Theoretical Frameworks for Data Mining.2000,全文. *
Or, How to Build a Data QualityBrowser.2002,全文. *
Tamraparni Dasu, Theodore Johnson, S. Muthukrishnan,Vladislav Shkapenyuk.Mining Database Structure *

Also Published As

Publication number Publication date
CN1853181A (en) 2006-10-25
CN101271471A (en) 2008-09-24
CN101271472A (en) 2008-09-24
CN101271472B (en) 2011-04-13

Similar Documents

Publication Publication Date Title
CN102982065A (en) Data processing method, data processing apparatus, and computer readable storage medium
CN101271471B (en) Data processing method, software and data processing system
EP2909747B1 (en) Characterizing data sources in a data storage system
CN103080932B (en) Process associated data set
AU2013200067B2 (en) Data profiling

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
ASS Succession or assignment of patent right

Owner name: AB INITIO SOFTWARE CORP.

Free format text: FORMER OWNER: ARCHITEKTEN CO., LTD.

Effective date: 20100412

Owner name: AB INITIO SOFTWARE CORP.

Free format text: FORMER OWNER: AB KAIYUAN SOFTWARE CO., LTD.

Effective date: 20100412

Owner name: ARCHITEKTEN CO., LTD.

Free format text: FORMER OWNER: AB INITIO SOFTWARE CORP.

Effective date: 20100412

C41 Transfer of patent application or patent right or utility model
TA01 Transfer of patent application right

Effective date of registration: 20100412

Address after: Massachusetts, USA

Applicant after: AB INITIO TECHNOLOGY LLC

Address before: Massachusetts, USA

Applicant before: Archie Taco Ltd.

Effective date of registration: 20100412

Address after: Massachusetts, USA

Applicant after: Archie Taco Ltd.

Address before: Massachusetts, USA

Applicant before: Qiyuan Software Co.,Ltd.

Effective date of registration: 20100412

Address after: Massachusetts, USA

Applicant after: Qiyuan Software Co.,Ltd.

Address before: Massachusetts, USA

Applicant before: Ab Initio Technology LLC

C14 Grant of patent or utility model
GR01 Patent grant
CX01 Expiry of patent term

Granted publication date: 20110817