CN108829750A - A kind of quality of data determines system and method - Google Patents
A kind of quality of data determines system and method Download PDFInfo
- Publication number
- CN108829750A CN108829750A CN201810511444.XA CN201810511444A CN108829750A CN 108829750 A CN108829750 A CN 108829750A CN 201810511444 A CN201810511444 A CN 201810511444A CN 108829750 A CN108829750 A CN 108829750A
- Authority
- CN
- China
- Prior art keywords
- data
- determined
- index
- quality
- index value
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 48
- 238000013479 data entry Methods 0.000 claims description 70
- 238000012549 training Methods 0.000 claims description 45
- 238000012512 characterization method Methods 0.000 claims description 11
- 230000001419 dependent effect Effects 0.000 claims description 11
- 238000012545 processing Methods 0.000 claims description 8
- 230000008569 process Effects 0.000 abstract description 23
- 238000012854 evaluation process Methods 0.000 abstract description 6
- 230000000875 corresponding effect Effects 0.000 description 48
- 238000003860 storage Methods 0.000 description 8
- 238000004519 manufacturing process Methods 0.000 description 5
- 230000011218 segmentation Effects 0.000 description 5
- 238000011156 evaluation Methods 0.000 description 4
- 238000010586 diagram Methods 0.000 description 3
- 235000013399 edible fruits Nutrition 0.000 description 3
- 238000000605 extraction Methods 0.000 description 3
- 238000012360 testing method Methods 0.000 description 3
- 238000004458 analytical method Methods 0.000 description 2
- 238000004590 computer program Methods 0.000 description 2
- 230000009193 crawling Effects 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 238000006467 substitution reaction Methods 0.000 description 2
- 241001269238 Data Species 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 230000008859 change Effects 0.000 description 1
- 238000006243 chemical reaction Methods 0.000 description 1
- 230000002596 correlated effect Effects 0.000 description 1
- 238000013461 design Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000018109 developmental process Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000005484 gravity Effects 0.000 description 1
- 239000004615 ingredient Substances 0.000 description 1
- 230000014759 maintenance of location Effects 0.000 description 1
- 235000016709 nutrition Nutrition 0.000 description 1
- 238000005096 rolling process Methods 0.000 description 1
- 230000001568 sexual effect Effects 0.000 description 1
- 230000009466 transformation Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/22—Matching criteria, e.g. proximity measures
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
Landscapes
- Engineering & Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Theoretical Computer Science (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
This application provides a kind of qualities of data to determine system and method, wherein the system includes:Data acquisition module, for obtaining data to be determined;Index value determining module, for determining index value of the data to be determined in the case where preset quality determines index;Quality determination module, for determining the quality determination results of the data to be determined based on the index value under the preset data quality index.The system is when the progress quality of data determines, it is capable of the quality of more objective, accurate determining business datum, and it does not need artificially to participate in the determination process in the quality of business datum, reduces a possibility that business datum is artificially revealed, increase safety of the business datum in evaluation process.
Description
Technical field
This application involves data assessment technical fields, determine system and method in particular to a kind of quality of data.
Background technique
In today of digital information rapid development, influence of the data to enterprise is increasingly enhanced, and more and more enterprises need
" being spoken with data ".For enterprise, the specific gravity that intangible asset occupies is increasing, in addition to patent, software copyright, trade mark etc.
The importance of the intangible assets such as intellectual property, this intangible asset of business datum should not be underestimated.The value of business datum is sometimes straight
Connect the value for determining enterprise.
When the value to business datum is assessed, it is normally based on business datum to carry out;Business datum
Quality can largely influence its value assessment result.Therefore, it carries out assessing it in the value to business datum
Before, it usually needs the quality of business datum is determined.The assessment business of business datum is provided in the prior art, for real
Now the quality of business datum is determined.The supplier that business datum assesses business is mainly Asset assessment organizations;Carry out
When business datum is assessed, person to be determined needs to contact with Asset assessment organizations, and both sides link up evaluation condition face to face;In evaluation condition
After settling, business datum is supplied to Asset assessment organizations, then the assets assessment expert by Asset assessment organizations by person to be determined
Business datum is assessed according to certain estimation flow.Such assessment mode is resulted in artificially to be led in evaluation process
The influence of sight factor is more, so that assessment result is not objective enough, accurate.
Summary of the invention
In view of this, a kind of quality of data of being designed to provide of the embodiment of the present application determines system and method, it can
The quality of more objective, accurate determining business datum, and do not needed in the determination process artificially in the quality of business datum
It participates in, reduces a possibility that business datum is artificially revealed, increase safety of the business datum in evaluation process.
In a first aspect, the embodiment of the present application, which provides a kind of quality of data, determines system, including:
Data acquisition module, for obtaining data to be determined;
Index value determining module, for determining index value of the data to be determined in the case where preset quality determines index;
Quality determination module, for determining the number to be determined based on the index value under the preset data quality index
According to quality determination results.
With reference to first aspect, the embodiment of the present application provides the first possible embodiment of first aspect, wherein:Institute
Stating quality of data index includes:Data consistency index, data integrity index, data age index, data redudancy refer to
One or more of mark, data scarcity index and data figureofmerit.
With reference to first aspect, the embodiment of the present application provides second of possible embodiment of first aspect, wherein:Needle
Index, which includes the case where that data consistency index, the data to be determined include, to be determined to the quality:Data content and institute
State the corresponding description information of data to be determined;
The index value determining module is specifically used for determining the included data content of data to be determined with described to true
Fixed number is according to the degree of consistency for corresponding to description information;And determine the data to be determined in data one based on the degree of consistency
Index value under cause property index, and the degree of consistency is higher, characterizes the data to be determined under data consistency index
Index value it is higher.
With reference to first aspect, the embodiment of the present application provides the third possible embodiment of first aspect, wherein:
The index value determining module, be specifically used for determining following one or more data contents and corresponding description information it
Between the degree of consistency, and to true described in the higher characterization of the degree of consistency between any one data content and corresponding description information
The index value of the data consistency index of fixed number evidence is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
With reference to first aspect, the embodiment of the present application provides the 4th kind of possible embodiment of first aspect, wherein:
Include the case where data integrity index for the quality of data index, the index value determining module, specifically
For determining the null value accounting in the included data entry of data to be determined;And based on the null value accounting determine it is described to
It determines index value of the data under data integrity index, and the null value accounting is lower, characterizes the number of the data to be determined
It is higher according to integrality;
Determine that index includes the case where data age index for the quality, the index value determining module, specifically
For determine the data to be determined start generation time terminate between generation time the time interval crossed over and it is described to
Determine that data start the time difference between generation time and the data offer time to be determined;Based on the time interval and
The time difference determines index value of the data to be determined under data age index;Wherein, the time interval span
Bigger, the index value for characterizing the data age index of the data to be determined is higher;And time difference is smaller, characterizes institute
The index value for stating the data age index of data to be determined is higher;
Include the case where data redudancy index for the quality of data index, the index value determining module, specifically
For determining the accounting of repeated entries in data entry that the data to be determined are included;And accounting for based on the repeated entries
Index value of the data to be determined more described than determination under data redudancy index, and the accounting of the repeated entries is lower, characterization
The data redudancy of the data to be determined is lower;
Determine that index includes the case where data figureofmerit for the quality, the index value determining module is specifically used for
Determine the data volume that the data to be determined include;And amount determines the data to be determined in data figureofmerit based on the data
Under index value, and the data volume is bigger, and the index value for characterizing the data figureofmerit of the data to be determined is higher.
With reference to first aspect, the embodiment of the present application provides the 5th kind of possible embodiment of first aspect, wherein:Also
Including:Set of metadata of similar data determining module;
The data acquisition module is also used to crawl multiple data sets from the default platform;
The set of metadata of similar data determining module, for being solved respectively to the data to be determined and the multiple data set
Analysis, determines the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with
The lexical feature of each data set carries out text similarity matching;The data set that text similarity is reached default similarity threshold is true
It is set to the set of metadata of similar data of the data to be determined.
With reference to first aspect, the embodiment of the present application provides the 6th kind of possible embodiment of first aspect, wherein:Needle
Index, which includes the case where data scarcity index, to be determined to the quality,
The index value determining module is specifically used for determining the data to be determined and similar to the data to be determined
Set of metadata of similar data default platform frequency of occurrence;And determine that the data to be determined are rare in data based on the frequency of occurrence
Property index under index value, and the frequency of occurrence is fewer, and the scarcity for characterizing the data to be determined is higher.
With reference to first aspect, the embodiment of the present application provides the 7th kind of possible embodiment of first aspect, wherein:Institute
Quality determination module is stated, specifically for the weight coefficient according to preset data quality index, to the data to be determined default
Index value under quality of data index is weighted summation process, obtains the quality determination results of the data to be determined.
With reference to first aspect, the embodiment of the present application provides the 8th kind of possible embodiment of first aspect, wherein:Also
Including:The quality of data determines model training module;
The quality of data determines model training module, is used for using the preset data quality index as independent variable, will
The quality of data grade determines model as the dependent variable building quality of data;
The data acquisition module is also used to obtain training data;
The index value determining module is also used to determine finger of the training data under the preset data quality index
The quality determination results of scale value and the training data;
The quality of data determines model training module, is also used to for the index value determined for the training data being used as certainly
The quality determination results of corresponding training data are substituted into the quality of data as dependent variable value and determine model, to institute by variate-value
It states the quality of data and determines that model is trained;
The quality determination module, specifically with the index by the data to be determined under the preset data quality index
Value substitutes into the trained quality of data as independent variable and determines model, obtains the quality determination results of the data to be determined.
Second aspect, the embodiment of the present application provide a kind of quality of data and determine method, including:
Obtain data to be determined;
Determine index value of the data to be determined in the case where preset quality determines index;
Based on the index value under preset data quality index, the quality determination results of data to be determined are determined.
In conjunction with second aspect, the embodiment of the present application provides the first possible embodiment of second aspect, wherein:Institute
Stating quality of data index includes:Data consistency index, data integrity index, data age index, data redudancy refer to
One or more of mark, data scarcity index and data figureofmerit.
In conjunction with second aspect, the embodiment of the present application provides second of possible embodiment of second aspect, wherein:Needle
Index, which includes the case where that data consistency index, the data to be determined include, to be determined to the quality:Data content and institute
State the corresponding description information of data to be determined;
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number according to included data content description information corresponding with the data to be determined the degree of consistency;And it is based on the consistency
Degree determines index value of the data to be determined under data consistency index, and the degree of consistency is higher, characterizes institute
It is higher to state index value of the data to be determined under data consistency index.
In conjunction with second aspect, the embodiment of the present application provides the third possible embodiment of second aspect, wherein:Really
The degree of consistency of fixed include the data content of data to be determined description information corresponding with the data to be determined, specifically packet
It includes:Determine the degree of consistency between following one or more data contents and corresponding description information, and any one data content
The index value of the data consistency index of the higher characterization data to be determined of the degree of consistency between corresponding description information
It is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
In conjunction with second aspect, the embodiment of the present application provides the 4th kind of possible embodiment of second aspect, wherein:Needle
Data integrity index is included the case where to the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:
Determine the null value accounting in the included data entry of data to be determined;And institute is determined based on the null value accounting
Index value of the data to be determined under data integrity index is stated, and the null value accounting is lower, characterizes the data to be determined
Data integrity it is higher.
In conjunction with second aspect, the embodiment of the present application provides the 5th kind of possible embodiment of second aspect, wherein:Needle
Index, which includes the case where data age index, to be determined to the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
According to starting, generation time terminates the time interval crossed between generation time to fixed number and the data to be determined start to generate
Time and the data to be determined provide the time difference between the time;Institute is determined based on the time interval and the time difference
State index value of the data to be determined under data age index;
Wherein, the time interval span is bigger, characterizes the index value of the data age index of the data to be determined
It is higher;And time difference is smaller, the index value for characterizing the data age index of the data to be determined is higher.
In conjunction with second aspect, the embodiment of the present application provides the 6th kind of possible embodiment of second aspect, wherein:Needle
Data redudancy index is included the case where to the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number according to repeated entries in the data entry for being included accounting;And it is determined based on the accounting of the repeated entries described to be determined
Index value of the data under data redudancy index, and the accounting of the repeated entries is lower, characterizes the data to be determined
Data redudancy is lower.
In conjunction with second aspect, the embodiment of the present application provides the 7th kind of possible embodiment of second aspect, wherein:Also
Including:Multiple data sets are crawled from the default platform;The data to be determined and the multiple data set are solved respectively
Analysis, determines the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with
The lexical feature of each data set carries out text similarity matching;The data set that text similarity is reached default similarity threshold is true
It is set to the set of metadata of similar data of the data to be determined.
In conjunction with second aspect, the embodiment of the present application provides the 8th kind of possible embodiment of second aspect, wherein:Needle
Index, which includes the case where data scarcity index, to be determined to the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number accordingly and set of metadata of similar data similar with the data to be determined default platform frequency of occurrence;And based on the occurrence out
Number determines the index values of the data to be determined under data scarcity index, and the frequency of occurrence is fewer, characterization it is described to
Determine that the scarcity of data is higher.
In conjunction with second aspect, the embodiment of the present application provides the 9th kind of possible embodiment of second aspect, wherein:Needle
Index, which includes the case where data figureofmerit, to be determined to the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number according to comprising data volume;And amount determines index value of the data to be determined under data figureofmerit based on the data,
And the data volume is bigger, the index value for characterizing the data figureofmerit of the data to be determined is higher.
In conjunction with second aspect, the embodiment of the present application provides the tenth kind of possible embodiment of second aspect, wherein:Base
Index value under the preset data quality index determines the quality determination results of the data to be determined, specifically includes:With
In the weight coefficient according to preset data quality index, to index value of the data to be determined under preset data quality index
It is weighted summation process, obtains the quality determination results of the data to be determined.
In conjunction with second aspect, the embodiment of the present application provides a kind of the tenth possible embodiment of second aspect, wherein:
It is also used to construct data matter for the quality of data grade as dependent variable using the preset data quality index as independent variable
It measures and determines model;
Obtain training data;
Determine the matter of index value and the training data of the training data under the preset data quality index
Measure definitive result;
Using the index value determined for the training data as argument value, by the quality determination results of corresponding training data
The quality of data is substituted into as dependent variable value and determines model, and model, which is trained, to be determined to the quality of data;
It is trained index value of the data to be determined under the preset data quality index as independent variable substitution
The quality of data determine model, obtain the quality determination results of the data to be determined.
The quality of data provided by the embodiments of the present application determines system, after obtaining data to be determined by data acquisition module,
It will use value determining module and determine index value of the data to be determined in the case where preset quality determines index, then determined using quality
Module determines the quality determination results of data to be determined based on the index value under preset data quality index, and whole process is not required to
The intervention that very important person is, it will be able to the quality of more objective, accurate determining business datum.
To enable the above objects, features, and advantages of the application to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate
Appended attached drawing, is described in detail below.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in the embodiment attached
Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair
The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this
A little attached drawings obtain other relevant attached drawings.
Fig. 1 shows the structural schematic diagram that a kind of quality of data provided by the embodiment of the present application determines system;
Fig. 2 shows the structural schematic diagrams that the another kind quality of data provided by the embodiment of the present application determines system;
Fig. 3 shows the flow chart that a kind of quality of data provided by the embodiment of the present application determines method;
Fig. 4 shows a kind of structural schematic diagram of computer equipment provided by the embodiment of the present application.
Specific embodiment
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application
Middle attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only
It is some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is real
The component for applying example can be arranged and be designed with a variety of different configurations.Therefore, below to the application's provided in the accompanying drawings
The detailed description of embodiment is not intended to limit claimed scope of the present application, but is merely representative of the selected reality of the application
Apply example.Based on embodiments herein, those skilled in the art institute obtained without making creative work
There are other embodiments, shall fall in the protection scope of this application.
Unlike the prior art, the embodiment of the present application passes through data acquisition mould when determining to business datum progress quality
Block obtains business datum (data to be determined in the embodiment of the present application), determines business datum at least by index value determining module
Index value under a kind of preset data quality index, then by quality determination module based on the finger under preset data quality index
Scale value, determines the quality results of business datum, and whole process does not need artificial intervention, it will be able to more objective, accurately determine industry
The quality for data of being engaged in, and the intervention thought is not needed exactly yet, reduce the possibility that business datum is contacted with people, to reduce
A possibility that business datum is artificially revealed increases safety of the business datum in evaluation process.
It is to be determined to a kind of quality of data disclosed in the embodiment of the present application first convenient for understanding the present embodiment
System describes in detail.It should be noted that the quality of data determines quality of the system in addition to can determine business datum, also can
Enough determine the quality of other data, such as test data, Home data etc..It is below business datum to this Shen to data to be determined
Please technical solution be illustrated.
Shown in Figure 1, the quality of data provided by the embodiments of the present application determines that system includes:Data acquisition module 10 refers to
Scale value determining module 20 and quality determination module 30.
Wherein, data acquisition module 10, for obtaining data to be determined.
When specific implementation, data to be determined are the business datums that quality to be carried out determines.Data to be determined can be with
It obtains in several ways, such as the business datum crawled from default platform, default platform includes enterprise web site, statistics bureau, number
According to transaction platform, button platform etc.;Receive the data to be determined sent from data source.
Index value determining module 20, for determining index value of the data to be determined under preset data quality index.
When specific implementation, quality of data index includes:Data consistency index, data integrity index, data
One or more of timeliness index, data redudancy index, data scarcity index and data figureofmerit.
Preferably, the object that the embodiment of the present application is implemented every time can be a kind of data, if such data includes multiple
Data set, then the quality of data of the embodiment of the present application determines that object can be a data set.
Index value determining module 20 is specifically used for determining data to be determined by the method for following 1-6 in the embodiment of the present application
Index value under each quality of data index.Specifically:
1, determine that index includes the case where that data consistency index, the data to be determined include for the quality:Number
According to content and the corresponding description information of the data to be determined;
The index value determining module 20, be specifically used for determining the included data content of data to be determined with it is described to
Determine that data correspond to the degree of consistency of description information;And the Data Data to be determined one is determined based on the degree of consistency
The index value of cause property index, and the degree of consistency is higher, characterizes the finger of the data consistency index of the data to be determined
Scale value is higher.
It, can be by determining between following one or more data contents and corresponding description information when specific implementation
Consistency journey, to characterize the data content of data to be determined and the degree of consistency of description information, wherein in any item data
Hold and the index of the data consistency index of the higher characterization data to be determined of the degree of consistency between corresponding description information
It is worth higher.
One:Data described in the description information of data volume and the data to be determined that the data to be determined include
Amount.
Herein, the data content of data to be determined is carried in the file of certain format;Data to be determined can be by a plurality of
Data entry is constituted, and every data entry is made of multiple data elements;Wherein, data element is the most base for constituting data to be determined
Notebook data unit.
Such as data to be determined be commodity price data when, the data element that a data to be determined include is followed successively by:Commodity
Title, commodity production quotient, the place of production, production time, shelf-life, net content, nutritional ingredient, product batch number, on-sale date.
That is data to be determined are preferably the form of data entry, are text for the data with evaluation requirement
The case where data, can carry out text data key message extraction operation in advance before being assessed, and generate data entry shape
The data of formula.Such as:Data with evaluation requirement are buyer's guide text, can be before assessment according to product name, quotient
The keyword extractions such as product manufacturer, the place of production, production time at data entry form, using the data entry of extraction as to be determined
Data.
The data volume that data to be determined are included, the data volume for the valid data member that data as to be determined include, for example,
In the examples described above, the quantity for the data element that a complete data include should be nine, then every data entry is corresponding
Data volume is 9;If data to be determined include 100 data entries, the data volume that should have should be 900, that is,
Data volume described in description information is 900;But in practice, it is understood that there may be certain data elements are sky, are not had for empty data element
There is actual content, causes the actual amount of data of data to be determined less than description data volume.
By taking the quantity of data entry as an example, here can also the data more to be determined data entry quantity that includes with it is described
Data entry quantity described in the description information of data to be determined.
Therefore it can be retouched by the description information for the data volume and the data to be determined that determination data to be determined include
The degree of consistency for the data volume stated characterizes the data content of data to be determined and the degree of consistency of description information.
Secondly:The size of description described in the description information of the size of the data to be determined and the data to be determined.
Herein, the size of data to be determined can actually regard the file size for carrying the file of data to be determined as.
For example, the data element of certain data entry, which lacks (i.e. data element is sky), will also result in the file data for carrying data to be determined
It is not of uniform size described in authentic document size and description information.
Therefore it can pass through description described in the size of determination data to be determined and the description information of the data to be determined
The degree of consistency of size characterizes the data content of data to be determined and the degree of consistency of description information.
Thirdly:Data lattice described in the description information of the data format of the data to be determined and the data to be determined
Formula.
Herein, the data format of data to be determined can be the file format for carrying the file of data to be determined.Carrying to
Determine that the file format of data may be different from file format described in description information.
It therefore can be by being retouched described in description information of the data format of determination data to be determined with the data to be determined
The degree of consistency for the data format stated characterizes the data content of data to be determined and the degree of consistency of description information.
It should be noted that the data content that data to be determined are included can be but be not necessarily limited to data volume, size and
Data format etc.;The corresponding description information of data to be determined is generally used for describing the data of data to be determined, data to be determined
Corresponding description information also includes the contents such as data volume, size and data format.
Specifically, the embodiment of the present application provides a kind of degree of consistency based on data volume, size of data and data format,
To determine the specific method of index value of the data to be determined under data consistency index:
Calculate first of data volume described in the description information of data volume and data to be determined that data to be determined include
Absolute difference calculates the second absolute difference of the size of the size of data to be determined and the description information of data to be determined,
If the data format of data to be determined is consistent with data format described in the description information of data to be determined, it is determined that be determined
The consistent degree P of data is the first preset value, is the second preset value, according to the first absolute difference, the second absolute difference otherwise
And consistent degree, calculate the index value of data consistency index.
Herein, the first preset value can be set as to 0, the second preset value is set as 1.Optionally, can also by the first preset value and
Second preset value is set as other numerical value, and the numerical value for meeting the second preset value is greater than the numerical value of the first preset value.
Specifically, the first absolute difference L1 meets:L1=| La-Lm|;
Wherein, LaThe data volume for including by data to be determined, LmThe data that description information by data to be determined includes
Amount.
Second absolute difference L2 meets:L2=| Sa-Sm|;
Wherein, SaFor the size of data to be determined, SmFor the size of the description information of data to be determined.
Then index value ω of the data to be determined under data consistency index1Meet:
α is design factor, can use the value between 0-1, such as take 1/3,1/4,1/2 etc..
ω1Value range is generally [0,1], ω1Value is bigger, illustrates that the degree of consistency of data to be determined is higher.
2, include the case where data integrity index for the quality of data index,
The index value determining module 20, specifically for the null value in determination the included data entry of data to be determined
Accounting;And index value of the data to be determined under data integrity index, and the sky are determined based on the null value accounting
Value accounting is lower, and the data integrity for characterizing the data to be determined is higher.
When specific implementation, there may be lack the data element of data to be determined.In the case, it lacks
Data element it is more, then the integrality of data to be determined is poorer.
Index value determining module 20 is when determining the null value accounting in the included data entry of data to be determined:Successively
Detect whether the data element in data to be determined in each data entry is empty;According to testing result, each data element is carried out
Integrality assignment obtains the integrity value of each data element, and data element is if it is empty, then corresponding integrity value is 0;Data element
It is not sky, then corresponding integrity value is 1;By the sum of the integrity value of all data elements, ratio with data element quantity, as
Null value accounting.
Can directly using the index value of the null value accounting as data to be determined under data integrity index, such as:
Index value ω of the data to be determined under data integrity index is calculated using following formula2:
Wherein, aiFor the integrity value of i-th of data element in data to be determined, N is the data element in data to be determined
Sum.
ω2Value range be [0,1], ω2Value is bigger, indicates that the data integrity of data to be determined is better.
It is also based on positive correlation of the data to be determined between the index value under data integrity index and null value accounting
Sexual intercourse, to determine index value of the data to be determined under data integrity index based on null value accounting.
In addition, index value determining module 20 when determining the null value accounting in data entry included by data to be determined, is gone back
Following step can be used:Count be in all data entries in data to be determined empty data element total quantity;By all numbers
According to the ratio of the total quantity of all data elements in the total quantity and data to be determined of the data element in entry being sky, as null value
Accounting.
Further, null value accounting can also be the accounting in data entry sum of invalid data entry in data to be determined
Than.There are the data entries of preset quantity sky data element can be determined as invalid data entry.ω2For invalid data entry and number
According to the quotient of entry sum.
3, determine that index includes the case where data age index for the quality,
The index value determining module 20, when starting generation time termination generation specifically for the determination data to be determined
Between between the time interval the crossed over and data to be determined start generation time and the data to be determined provide the time
Between time difference;Determine the data to be determined under data age index based on the time interval and the time difference
Index value;
Wherein, the time interval span is bigger, characterizes the index value of the data age index of the data to be determined
It is higher;And time difference is smaller, the index value for characterizing the data age index of the data to be determined is higher.
When specific implementation, the time interval that data generation time to be determined is crossed over starts for data to be determined
Generation time terminates between generation time to data to be determined, the time interval crossed over.The unit of time interval will be according to this
The length of time interval is specifically set.
Distinguishingly, when can not determine data to be determined beginning generation time and terminate generation time when, can by
Determine that the description information of data determines;The initial time, most in time interval that generation time can cross over for data to be determined
Between terminal hour, or average time, preferably initial time.
For example, setting minute for the unit of time interval if the length of the time interval is 1 day;If time interval
Length be 2 months, then set day for the unit of time interval;If the length of the time interval is 3 years, can be by the time
The unit in section this be for week.It should be noted that the unit in above-mentioned setting time section is only that the embodiment of the present application is mentioned
The example of confession cannot be considered as being the restriction to technical scheme.
Data provide the time, refer to the quality of data determine the data acquisition module 10 of system obtain data to be determined when
Between.It is noted herein that data acquisition module is actually can not be since data to be determined have certain data volume
Some time point obtains whole data to be determined from scratch, and therefore, the data offer time can be data acquisition
Module 10 obtains the initial time of data to be determined, when being also possible to data acquisition module 10 and obtaining the termination of data to be determined
Between;In addition, since data acquisition module 10 is after obtaining data to be determined, it can be in a short period of time by data to be determined
It is transferred to index determining module 20 to be handled, when data acquisition module 10 obtains the initial time or termination of data to be determined
Between determine that the time difference of the current time of index value under timeliness index is very little to it apart from quality determination module 20,
Therefore quality determination module 20 can also determine data to be determined to the current time of its index value under timeliness index
The time is provided as data.
For example, including 100 data entries in data to be determined;In 100 data entries, earliest data strip is generated
Purpose generation time (namely data to be determined start generation time) is on March 15th, 2018;The data strip of generation time the latest
Purpose generation time (namely data to be determined terminate generation time) is on April 17th, 2018;Then data generation time to be determined
The time interval crossed over is 33 days.If it is on May 10th, 2018 that data to be determined, which provide the time, when data to be determined generate
Between data to be determined provide the time between time difference, as on March 15th, 2018, until between on May 10th, 2018 when
Between it is poor.
Determining the data to be determined under data age index based on the time interval and the time difference
Index value when, can index using time interval and the ratio of time difference as data to be determined under timeliness index
Value.
For example, index value ω of the data to be determined under timeliness index can be calculated using following formula3:
TfGeneration time is terminated for data to be determined, if data to be determined can not determine the final time, using to be determined
The final time of the corresponding description information of data;TsStart generation time for data to be determined, if data to be determined can not determine
Start generation time, then uses the beginning generation time of the corresponding description information of data to be determined;TnThe offer of data to be determined
Time.
ω3Value range is [0,1], ω3Value is bigger, indicates that the timeliness of data to be determined is stronger.
4, include the case where data redudancy index for the quality of data index,
The index value determining module 20 repeats in the data entry for being included specifically for the determination data to be determined
The accounting of entry;And the index value of the Data Data redundancy index to be determined is determined based on the accounting of the repeated entries,
And the accounting of the repeated entries is lower, the data redudancy for characterizing the data to be determined is lower.
When specific implementation, data redudancy is the ratio for calculating repeated data and occurring.In a data acquisition system,
Duplicate data become data redundancy, and information redundance is higher, and the quality of data is lower.
Specifically, index value determining module 20 can determine data to be determined in data using any one in following manner
Index value under redundancy index:
One:According to the data element that every data entry includes, every data entry weight in the data to be determined is counted
It appears again existing number;The number and the data entry repeated according to all data entries in the data to be determined
Total number, determine the ratio that ratio that the data entry repeats namely repeated entries occur;Namely institute's number to be determined
According to the accounting of repeated entries in the data entry for being included.The ratio that entry repeats based on the data, calculate it is described to
Determine that quality of the data under the information redundance index determines value;Wherein, the data to be determined are in the information redundancy
Quality under degree index determines the being negatively correlated property of ratio that value and the data entry repeat.
It herein, be according to data strip when the number that every data entry repeats in counting the data to be determined
Purpose distributing order, successively detects whether every data entry occurred in front;Wherein, number in two identical data entries
According to member content is completely the same or content is consistent or similar data element quantity reaches preset threshold.Assuming that detecting i-th
When data entry, which is to occur for the first time, then statistical magnitude is constant;If the i-th data entry
Not first occurs, then statistical magnitude is added 1.
Secondly:Index value determining module 20 successively detects that whether attaching most importance in each data entry in data to be determined appears again
Existing data entry;According to testing result, repeated assignment is carried out to each data entry, it is corresponding obtains each data entry
Repeatability value.If data entry is the data entry repeated, namely before detecting current data entry, has had and worked as
The identical another data entry of preceding data entry is tested, then corresponding repeatability value is 1;If data entry is simultaneously non-duplicate
The data entry of appearance, namely before detecting current data entry, another data not identical with current data entry
Entry is tested, then corresponding repeatability value is 0, by the sum of the repeatability value of all data entries, with data entry quantity
Ratio, the accounting of repeated entries in the data entry for being included as data to be determined.
For example, ω of the data to be determined under data redudancy index can be calculated using following formula4Index value ω4:
Wherein, biFor the repeatability value of i-th of data entry in data to be determined, N is data entry in data to be determined
Sum.
ω4Value range is [0,1], ω4Value is bigger, shows that the data redundancy of data to be determined is smaller, then corresponding
Data value is also higher.
For example, including 5 data entries, respectively a, b, c, d, e in data to be determined, wherein a, b are identical with e, c, d
It is identical, successively detect whether every data entry is the data entry repeated from a to e;A occurs for the first time, repeatability
Value is 0;B is identical with a, and the repeatability value for the data entry repeated, therefore b is that 1, c occurs for the first time, repeatability value
It is 0;D is identical with c, and for the data entry repeated, repeatability value is 1;E is identical as a, for the data strip repeated
Mesh, repeatability value are 1, and the accounting of repeated entries is 0.6 in the data entry that finally obtained data to be determined are included.Root
According to above-mentioned formula, it is known that index value ω of the final resulting determining data under data redudancy index4It is 0.4.
5, determine that index includes the case where data scarcity index for the quality,
The index value determining module 20, be specifically used for determining the data to be determined and with the data phase to be determined
As set of metadata of similar data default platform frequency of occurrence;And determine that the data to be determined are dilute in data based on the frequency of occurrence
The index value under property index is lacked, and the frequency of occurrence is fewer, the scarcity for characterizing the data to be determined is higher.
When specific implementation, scarcity refers to according to the preset platform and data information of acquisition for same
The offer situation of class data, calculates the degree of scarcity of data;Homogeneous data is more, and scarcity is lower;Homogeneous data is fewer, rare
Property is higher;The higher data to be determined of scarcity, quality and value are also corresponding higher.
When specific implementation, in order to obtain set of metadata of similar data similar with data to be determined, another reality of the application
It applies in example, further includes:Set of metadata of similar data determining module 40.
Data acquisition module 10 in the embodiment of the present application is also used to crawl multiple data sets from the default platform.
Herein, default platform can be data trade platform, be also possible to other data platforms;It is with data trade platform
Example, each data trade are corresponding at least a kind of business datum merchandised.When crawling data set from default platform,
To each data trade is directed to, a data set is crawled;It include multiple data entries in each data set.
When carrying out data and crawling, can by crawler, crawl the technologies such as tool and crawl data set, the application is to this
Not limit.
Set of metadata of similar data determining module 40, for being parsed respectively to the data to be determined and the multiple data set,
Determine the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with each number
Text similarity matching is carried out according to the lexical feature of collection;The data set that text similarity reaches default similarity threshold is determined as
The set of metadata of similar data of the data to be determined.
In specific implementation, set of metadata of similar data determining module 40 can determine data to be determined and data by following step
The lexical feature of collection:
Word segmentation processing is carried out to each data set of acquisition, the first lexical data after obtaining word segmentation processing;At participle
The sequence of the frequency of occurrence that each first lexical data after reason is concentrated in corresponding data from high to low, filters out preceding preset quantity
A first lexical data, each data for data sets go out in the data set according to each first lexical data filtered out
The existing frequency determines the lexical feature of the data.
Word segmentation processing is carried out to data to be determined, the second lexical data after obtaining word segmentation processing;After word segmentation processing
Frequency of occurrence sequence from high to low of each second lexical data in data to be determined, preset quantity the before filtering out
Two lexical datas, for each data in data to be determined, according to each second lexical data filtered out in the number to be determined
According to the frequency of middle appearance, the lexical feature of the data is determined.
For each lexical feature in each data set, calculate the lexical feature in the data set respectively with it is to be determined
The text similarity between lexical feature in data.Text similarity is greater than or equal to the data set of default similarity threshold
It is determined as the set of metadata of similar data of data to be determined.
Further, the case where determining multiple feature vocabulary for data to be determined and data set, for number to be determined
According to each feature vocabulary, can by each feature vocabulary of this feature vocabulary and data set, composition notebook similarity is compared respectively, will
The feature vocabulary that similarity reaches the first default similarity threshold is determined as the similar vocabulary of this feature vocabulary, similar vocabulary quantity
When reaching the second preset threshold, say that data to be determined and data set are determined as set of metadata of similar data.
Further, there is the case where industry label marked for data to be determined and data set, it can also be direct
Using industry label as the feature vocabulary of corresponding data, feature vocabulary is directly subjected to similarity comparison.
After the set of metadata of similar data of data to be determined has been determined in the multiple data sets crawled, can be existed according to set of metadata of similar data
The number that default platform occurs, determines index value of the data to be determined under data scarcity index.
Specifically, quality of the data to be determined under the scarcity index can be calculated using following step determine value:
The determining quantity with the data set of the similar set of metadata of similar data of the data to be determined;
Based on the total quantity of the data set crawled, and with the data to be determined the similar set of metadata of similar data number
According to the quantity of collection, index value of the data to be determined under the scarcity index is calculated;
For example, being calculated using the following equation index value ω of the data to be determined under data scarcity index5:
Wherein, x is the set of metadata of similar data of data to be determined and data to be determined in the frequency of occurrence of default platform, and y is to crawl
The total quantity of the data set arrived.
ω5Value range be [0,1], work as ω5Close to 1, it is more to illustrate that the set of metadata of similar data of data to be determined occurs,
The scarcity of data to be determined is lower, ω5Closer to 0, show fewer, the number to be determined that the set of metadata of similar data of data to be determined occurs
According to scarcity it is higher.
Furthermore it is also possible to calculate index value ω of the data to be determined under data scarcity index using following formula5:
ω5=1-e-x/y
Wherein, x is the set of metadata of similar data of data to be determined and data to be determined in the frequency of occurrence of default platform, and y is default
The sum of platform.
ω5Value range be [0,1], work as ω5Close to 1, illustrate that each default platform has set of metadata of similar data, it is to be determined
The scarcity of data is lower, ω5Equal to 0, show each default platform there is no set of metadata of similar data, the scarcity of data to be determined is got over
It is high.
6, determine that index includes the case where data figureofmerit for the quality,
The index value determining module 20, the data volume for including specifically for the determination data to be determined;And based on institute
State data volume and determine index value of the data to be determined under data figureofmerit, and the data volume is bigger, characterization it is described to
Determine that the index value of the data figureofmerit of data is higher.
When specific implementation, data to be determined can be determined in data volume using any one in following two methods
Index value under index:
First, can be by the ratio of the data volume of the data to be determined of calculating and the total amount of data of the data of each default platform
As the index value of data figureofmerit, can also directly using the data volume of data to be determined as the index value of data figureofmerit,
It can be determines according to actual conditions.
For example, using the ratio of the data volume of data to be determined and the total amount of data of the data of each default platform as data
When the index value of figureofmerit, the index value ω of data figureofmerit can be calculated using the following equation6:
Wherein, N is the data volume of data in data to be determined, and P is the total amount of data of the data of each default platform.
ω6Value be [0,1], work as ω6When=0, illustrate that the data volume of data to be determined is small, otherwise data volume is big.
Second, the committed amount of data and description information that carry in the description information based on the data to be determined are retouched
The data volume stated;Data volume that data to be determined include and data acquisition obtains is carried out to the data of default platform with to
It determines the similar set of metadata of similar data amount of data, calculates index value of the data to be determined under data figureofmerit.
Wherein, when which refers to that user provides data to be determined, it is contemplated that the number of data to be determined to be offered
According to amount.
The data volume that data to be determined are included, the data volume for the valid data member that data as to be determined include.
The set of metadata of similar data amount similar with data to be determined that data acquisition obtains is carried out to the data of default platform, is obtained
Process is similar to the acquisition process of set of metadata of similar data with when determining the index value of data to be determined under data scarcity index.Specifically
Process is:
Data acquisition module 10 crawls multiple data sets from the default platform;Set of metadata of similar data determining module 40, for dividing
It is other that the data to be determined and the multiple data set are parsed, determine the word of the data to be determined and each data set
Remittance feature;The lexical feature of the data to be determined is subjected to text similarity matching with the lexical feature of each data set respectively;
The data set that text similarity reaches default similarity threshold is determined as to the set of metadata of similar data of the data to be determined;To determining
Set of metadata of similar data carries out the operation that data volume determines, to obtain set of metadata of similar data amount similar with data to be determined.
Specifically, index value of the data to be determined under data figureofmerit can be calculated using following formula:
Wherein, m indicates the data volume that data to be determined include;N1Indicate that carrying out data acquisition to the data of default platform obtains
The set of metadata of similar data amount similar with data to be determined taken;N2Indicate data described in description information;N3Indicate committed amount of data.
Quality determination module 30, for determining described to be determined based on the index value under the preset data quality index
The quality determination results of data.
When specific implementation, quality determination module 30 can determine number to be determined using any one in following proposal
According to quality determination results:
One:According to the weight coefficient of preset data quality index, the data to be determined are referred in preset data quality
Index value under mark is weighted summation process, obtains the quality determination results of the data to be determined.
Herein, the mistake of summation process is weighted to index value of the data to be determined under preset data quality index
Journey, it is actually different according to quality influence degree of the different data figureofmerit to data to be determined, to determine data to be determined
Quality determination results process.
The corresponding weight coefficient of different types of data to be determined may be the same or different.
For example, determining that index includes that data consistency index, data integrity index, data age refer to for quality
The case where mark, data redudancy index, data scarcity index and data figureofmerit, can according to following formula calculate to
Determine the quality determination results M of data:
M=a1×ω1+a2×ω2+a3×ω3+a4×ω4+a5×ω5+a6×ω6。
Wherein, a1To a6It is followed successively by data consistency index, data integrity index, data age index, data redundancy
Spend index, data scarcity index and the corresponding weight coefficient of data figureofmerit.ω1To ω6It is consistent to be followed successively by data
Property index, data integrity index, data age index, data redudancy index, data scarcity index and data volume
Index respectively corresponds index value.
Second, the quality of data determines in system further includes:The quality of data determines model training module 50.Wherein, the number
Model training module 50 is determined according to quality, is used for using the preset data quality index as independent variable, by the quality of data
Grade determines model as the dependent variable building quality of data;
The data acquisition module 10, is also used to obtain training data;
The index value determining module 20 is also used to determine the training data under the preset data quality index
The quality determination results of index value and the training data;
The quality of data determines model training module 50, is also used to the index value conduct that will be determined for the training data
The quality determination results of corresponding training data are substituted into the quality of data as dependent variable value and determine model by argument value, right
The quality of data determines that model is trained;
The quality determination module 30, specifically for by the data to be determined under the preset data quality index
Index value substitutes into the trained quality of data as independent variable and determines model, and the quality for obtaining the data to be determined determines knot
Fruit.
When specific implementation, the quality of data determines model training module when the building quality of data determines model
It waits, it is thus necessary to determine that explanatory variable and explained variable in model, and explanatory variable is determined by following model training process
Connection between explained variable.Influence the several because being known as of the quality of data to be determined, then using these types of factor as pair
The quality of data index answered, and using quality of data coordinate as independent variable, and using the quality determination results of data to be determined as
Dependent variable constructs model.
In the embodiment of the present application, constructed model includes but is not limited to:Autoregression model, returns at moving average model(MA model) certainly
Return moving average model(MA model), integrate rolling average autoregression model, EC GARCH.
After constructing the quality of data and determining model, model, which is trained, to be determined to the quality of data.Used in training
Training data can be obtained by data acquisition module;Herein, it should be noted that acquired training data can be
The data that quality determines had been carried out, have been also possible to not carry out the data that quality determines.
For having carried out the data of quality determination, does not then need index determining module and carry out quality determination to it again.
The data determined for not carrying out quality then need index determining module to carry out quality to it and determine, obtain the data pre-
If the quality determination results of index value and training data under quality of data index.
Herein, the quality determination results of training data can be the grade of the quality of data, be also possible to point of the quality of data
Number, can specifically be set according to actual demand.
Specifically, knot is determined in the quality for determining data to be determined by quality determination method provided by the embodiments of the present application
When fruit, if the quality determination results of data to be determined are score, the data to be determined will can be referred in preset data quality
Index value under mark is weighted the result of summation process directly as score, and the value range of the score is [0,1] at this time,
Treated result can be carried out to the result of the weighted sum as score, such as by weighted sum result multiplied by must after 100
Score of the value arrived as the quality of data to be determined.It, can be based on default if the quality determination results of data to be determined are grade
Transformation rule, the knot of summation process will be weighted to index value of the data to be determined under preset data quality index
Under the corresponding grade of fruit conversion.
For example, 5 grades of setting, respectively A, B, C, D, E, and the quality of the corresponding data to be determined of A is corresponding lower than E
The quality of data to be determined.Summation process is weighted to index value of the data to be determined under preset data quality index
Result it is smaller, then lower grade.The value range of the result of the corresponding weighted sum processing of grade A-E is followed successively by:[0,
0.2), [0.2,0.4), [0.4,0.6), [0.6,0.8), [0.8,1].It can be based on above-mentioned value range, by weighting processing
As a result it is converted into the grade of corresponding data to be determined.
The process that model is trained using training data, i.e., for according to the index value of training data and corresponding quality
Definitive result, to the process that the parameter of model is constantly adjusted, so that model is being based on each training data in preset data
When index value under quality calculates its quality determination results, the quality determination results being calculated matter corresponding with the training data
Amount definitive result is consistent.
The quality of data provided by the embodiments of the present application determines system, after obtaining data to be determined by data acquisition module,
It will use value determining module and determine index value of the data to be determined in the case where preset quality determines index, then determined using quality
Module determines the quality determination results of data to be determined based on the index value under preset data quality index, and whole process is not required to
The intervention that very important person is, it will be able to the quality of more objective, accurate determining business datum, and artificial do is not needed exactly yet
In advance, reduce the possibility that business datum is contacted with people, to reduce a possibility that business datum is artificially revealed, increase business
Safety of the data in evaluation process.
Based on the same inventive concept, it is additionally provided in the embodiment of the present application and determines the corresponding data matter of system with the quality of data
The method of determination is measured, the principle and the above-mentioned quality of data of the embodiment of the present application solved the problems, such as due to the device in the embodiment of the present application is true
Determine that system is similar, therefore the implementation of device may refer to the implementation of system, overlaps will not be repeated.
It is shown in Figure 3, the embodiment of the present application he provide the quality of data determine that method includes:
S301:Obtain data to be determined;
S302:Determine index value of the data to be determined in the case where preset quality determines index;
S303:Based on the index value under preset data quality index, the quality determination results of data to be determined are determined.
After the embodiment of the present application obtains data to be determined, finger of the data to be determined in the case where preset quality determines index can determine whether
Scale value, the index value being then based under preset data quality index determine the quality determination results of data to be determined, entire quality
Determination process does not need artificial intervention, so as to the quality of more objective, accurate determining business datum, and is exactly yet
Artificial intervention is not needed, the possibility that business datum is contacted with people is reduced, to reduce what business datum was artificially revealed
Possibility increases safety of the business datum in evaluation process.
Optionally, the quality of data index includes:Data consistency index, data integrity index, data age
One or more of index, data redudancy index, data scarcity index and data figureofmerit.
Optionally, determine that index includes the case where data consistency index, the data packet to be determined for the quality
It includes:Data content and the corresponding description information of the data to be determined;
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number according to included data content description information corresponding with the data to be determined the degree of consistency;And it is based on the consistency
Degree determines index value of the data to be determined under data consistency index, and the degree of consistency is higher, characterizes institute
It is higher to state index value of the data to be determined under data consistency index.
Optionally it is determined that the included data content of data to be determined description information corresponding with the data to be determined
The degree of consistency specifically includes:Determine the degree of consistency between following one or more data contents and corresponding description information,
And the data one of the higher characterization data to be determined of the degree of consistency between any one data content and corresponding description information
The index value of cause property index is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
Optionally, include the case where data integrity index for the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:
Determine the null value accounting in the included data entry of data to be determined;And institute is determined based on the null value accounting
Index value of the data to be determined under data integrity index is stated, and the null value accounting is lower, characterizes the data to be determined
Data integrity it is higher.
Optionally, determine that index includes the case where data age index for the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
According to starting, generation time terminates the time interval crossed between generation time to fixed number and the data to be determined start to generate
Time and the data to be determined provide the time difference between the time;Institute is determined based on the time interval and the time difference
State index value of the data to be determined under data age index;
Wherein, the time interval span is bigger, characterizes the index value of the data age index of the data to be determined
It is higher;And time difference is smaller, the index value for characterizing the data age index of the data to be determined is higher.
Optionally, include the case where data redudancy index for the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number according to repeated entries in the data entry for being included accounting;And it is determined based on the accounting of the repeated entries described to be determined
Index value of the data under data redudancy index, and the accounting of the repeated entries is lower, characterizes the data to be determined
Data redudancy is lower.
Optionally, further include:Multiple data sets are crawled from the default platform;Respectively to data to be determined and described
Multiple data sets are parsed, and determine the lexical feature of the data to be determined and each data set;By the data to be determined
Lexical feature respectively with the lexical feature of each data set carry out text similarity matching;Text similarity is reached preset it is similar
The data set of degree threshold value is determined as the set of metadata of similar data of the data to be determined.
Optionally, determine that index includes the case where data scarcity index for the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number accordingly and set of metadata of similar data similar with the data to be determined default platform frequency of occurrence;And based on the occurrence out
Number determines the index values of the data to be determined under data scarcity index, and the frequency of occurrence is fewer, characterization it is described to
Determine that the scarcity of data is higher.
Optionally, determine that index includes the case where data figureofmerit for the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true
Fixed number according to comprising data volume;And amount determines index value of the data to be determined under data figureofmerit based on the data,
And the data volume is bigger, the index value for characterizing the data figureofmerit of the data to be determined is higher.
Optionally, based on the index value under the preset data quality index, determine that the quality of the data to be determined is true
Determine as a result, specifically including:For the weight coefficient according to preset data quality index, to the data to be determined in preset data
Index value under quality index is weighted summation process, obtains the quality determination results of the data to be determined.
Optionally, be also used to using the preset data quality index as independent variable, using the quality of data grade as
The dependent variable building quality of data determines model;
Obtain training data;
Determine the matter of index value and the training data of the training data under the preset data quality index
Measure definitive result;
Using the index value determined for the training data as argument value, by the quality determination results of corresponding training data
The quality of data is substituted into as dependent variable value and determines model, and model, which is trained, to be determined to the quality of data;
It is trained index value of the data to be determined under the preset data quality index as independent variable substitution
The quality of data determine model, obtain the quality determination results of the data to be determined.
As shown in figure 4, provide a kind of computer equipment for the embodiment of the present application, the computer equipment include processor 41,
Memory 42 and bus 43, the memory 42 storage execute instruction, when described device operation, the processor 41 with it is described
It is communicated between memory 42 by bus 43, the processor 41 executes described execute instruction so that described device executes above-mentioned number
According to quality determination method.
Method is determined corresponding to the quality of data in Fig. 3, and the embodiment of the present application also provides a kind of computer-readable storages
Medium is stored with computer program on the computer readable storage medium, executes when which is run by processor
State the step of quality of data determines method.
The quality of data provided by the embodiment of the present application determines the computer program product of system and method, including storage
The computer readable storage medium of program code, the instruction that said program code includes can be used for executing previous methods embodiment
Described in method, specific implementation can be found in embodiment of the method, details are not described herein.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description
It with the specific work process of device, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product
It is stored in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially in other words
The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter
Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a
People's computer, server or network equipment etc.) execute each embodiment the method for the application all or part of the steps.
And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any
Those familiar with the art within the technical scope of the present application, can easily think of the change or the replacement, and should all contain
Lid is within the scope of protection of this application.Therefore, the protection scope of the application shall be subject to the protection scope of the claim.
Claims (10)
1. a kind of quality of data determines system, which is characterized in that including:
Data acquisition module, for obtaining data to be determined;
Index value determining module, for determining index value of the data to be determined in the case where preset quality determines index;
Quality determination module, for determining the data to be determined based on the index value under the preset data quality index
Quality determination results.
2. system according to claim 1, which is characterized in that the quality of data index includes:Data consistency index,
In data integrity index, data age index, data redudancy index, data scarcity index and data figureofmerit
One or more.
3. system according to claim 2, which is characterized in that determine that index includes that data consistency refers to for the quality
Target situation, the data to be determined include:Data content and the corresponding description information of the data to be determined;
The index value determining module is specifically used for determining the included data content of data to be determined and the number to be determined
According to the degree of consistency of corresponding description information;And determine the data to be determined in data consistency based on the degree of consistency
Index value under index, and the degree of consistency is higher, characterizes finger of the data to be determined under data consistency index
Scale value is higher.
4. system as claimed in claim 3, which is characterized in that the index value determining module is specifically used for determining as next
Or the degree of consistency between multinomial data content and corresponding description information, and any one data content and corresponding description information
Between the higher characterization of the degree of consistency data to be determined data consistency index index value it is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
5. system according to claim 2, which is characterized in that for the quality of data index include that data integrity refers to
Target situation, the index value determining module, specifically for the null value in determination the included data entry of data to be determined
Accounting;And index value of the data to be determined under data integrity index, and the sky are determined based on the null value accounting
Value accounting is lower, and the data integrity for characterizing the data to be determined is higher;
Determine that index includes the case where that data age index, the index value determining module are specifically used for for the quality
Determine that the data to be determined start generation time and terminate between generation time the time interval crossed over and described to be determined
Data start the time difference between generation time and the data offer time to be determined;Based on the time interval and described
Time difference determines index value of the data to be determined under data age index;Wherein, the time interval span is bigger,
The index value for characterizing the data age index of the data to be determined is higher;And time difference is smaller, characterization it is described to
Determine that the index value of the data age index of data is higher;
Include the case where that data redudancy index, the index value determining module are specifically used for for the quality of data index
Determine the accounting of repeated entries in data entry that the data to be determined are included;And the accounting based on the repeated entries is true
Fixed index value of the data to be determined under data redudancy index, and the accounting of the repeated entries is lower, described in characterization
The data redudancy of data to be determined is lower;
Determine that index includes the case where data figureofmerit for the quality, the index value determining module is specifically used for determining
The data volume that the data to be determined include;And amount determines the data to be determined under data figureofmerit based on the data
Index value, and the data volume is bigger, the index value for characterizing the data figureofmerit of the data to be determined is higher.
6. system as claimed in claim 2, which is characterized in that further include:Set of metadata of similar data determining module;
The data acquisition module is also used to crawl multiple data sets from the default platform;
The set of metadata of similar data determining module, for being parsed respectively to the data to be determined and the multiple data set, really
Make the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with each data
The lexical feature of collection carries out text similarity matching;The data set that text similarity reaches default similarity threshold is determined as institute
State the set of metadata of similar data of data to be determined.
7. system according to claim 6, which is characterized in that determine that index includes that data scarcity refers to for the quality
Target situation,
The index value determining module is specifically used for determining the data to be determined and the similar phase with the data to be determined
Frequency of occurrence of the likelihood data in default platform;And determine that the data to be determined refer in data scarcity based on the frequency of occurrence
Index value under mark, and the frequency of occurrence is fewer, the scarcity for characterizing the data to be determined is higher.
8. system according to claim 1, which is characterized in that the quality determination module is specifically used for according to present count
According to the weight coefficient of quality index, summation is weighted to index value of the data to be determined under preset data quality index
Processing obtains the quality determination results of the data to be determined.
9. system according to claim 1, which is characterized in that further include:The quality of data determines model training module;
The quality of data determines model training module, is used for using the preset data quality index as independent variable, will be described
Quality of data grade determines model as the dependent variable building quality of data;
The data acquisition module is also used to obtain training data;
The index value determining module is also used to determine index of the training data under the preset data quality index
The quality determination results of value and the training data;
The quality of data determines model training module, is also used to be the determining index value of the training data as independent variable
The quality determination results of corresponding training data are substituted into the quality of data as dependent variable value and determine model, to the number by value
Determine that model is trained according to quality;
The quality determination module is specifically made with by index value of the data to be determined under the preset data quality index
The trained quality of data is substituted into for independent variable and determines model, obtains the quality determination results of the data to be determined.
10. a kind of quality of data determines method, which is characterized in that this method includes:
Obtain data to be determined;
Determine index value of the data to be determined in the case where preset quality determines index;
Based on the index value under the preset data quality index, the quality determination results of the data to be determined are determined.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810511444.XA CN108829750A (en) | 2018-05-24 | 2018-05-24 | A kind of quality of data determines system and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201810511444.XA CN108829750A (en) | 2018-05-24 | 2018-05-24 | A kind of quality of data determines system and method |
Publications (1)
Publication Number | Publication Date |
---|---|
CN108829750A true CN108829750A (en) | 2018-11-16 |
Family
ID=64145374
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201810511444.XA Pending CN108829750A (en) | 2018-05-24 | 2018-05-24 | A kind of quality of data determines system and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN108829750A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183952A (en) * | 2020-09-08 | 2021-01-05 | 支付宝(杭州)信息技术有限公司 | Index quality supervision processing method and device and electronic equipment |
CN117273552A (en) * | 2023-11-22 | 2023-12-22 | 山东顺国电子科技有限公司 | Big data intelligent treatment decision-making method and system based on machine learning |
Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101576893A (en) * | 2008-05-09 | 2009-11-11 | 北京世纪拓远软件科技发展有限公司 | Method and system for analyzing data quality |
CN101719139A (en) * | 2009-11-10 | 2010-06-02 | 南京联创科技集团股份有限公司 | Method for monitoring data quality based on index set |
CN101894319A (en) * | 2010-06-28 | 2010-11-24 | 中国烟草总公司湖南省公司 | Tobacco enterprise data quality management system and method |
CN103247008A (en) * | 2013-05-07 | 2013-08-14 | 国家电网公司 | Quality evaluation method of electricity statistical index data |
CN103544314A (en) * | 2013-11-04 | 2014-01-29 | 北京中搜网络技术股份有限公司 | Searching data quality statistical method |
CN104462744A (en) * | 2014-10-09 | 2015-03-25 | 广东工业大学 | Data quality control method suitable for cardiovascular remote monitoring system |
CN105824806A (en) * | 2016-06-13 | 2016-08-03 | 腾讯科技(深圳)有限公司 | Quality evaluation method and device for public accounts |
CN106257511A (en) * | 2016-04-14 | 2016-12-28 | 江苏瑞中数据股份有限公司 | A kind of grid faults characteristics quality testing method |
CN106355447A (en) * | 2016-08-31 | 2017-01-25 | 国信优易数据有限公司 | Price evaluation method and system for data commodities |
CN106469395A (en) * | 2016-08-31 | 2017-03-01 | 国信优易数据有限公司 | A kind of data commodity dynamic comprehensive appraisal procedure and system |
CN106503912A (en) * | 2016-10-27 | 2017-03-15 | 国信优易数据有限公司 | A kind of data service system |
CN106845846A (en) * | 2017-01-23 | 2017-06-13 | 重庆邮电大学 | Big data asset evaluation method |
CN106934493A (en) * | 2017-02-28 | 2017-07-07 | 北京科技大学 | A kind of construction method of power customer appraisal Model |
CN107315968A (en) * | 2017-06-29 | 2017-11-03 | 国信优易数据有限公司 | A kind of data processing method and equipment |
CN107463532A (en) * | 2017-06-28 | 2017-12-12 | 国网上海市电力公司 | A kind of mass analysis method of electric power statistics |
CN107491381A (en) * | 2017-07-04 | 2017-12-19 | 广西电网有限责任公司电力科学研究院 | A kind of equipment condition monitoring quality of data evaluating system |
CN107704806A (en) * | 2017-09-01 | 2018-02-16 | 深圳市唯特视科技有限公司 | A kind of method that quality of human face image prediction is carried out based on depth convolutional neural networks |
-
2018
- 2018-05-24 CN CN201810511444.XA patent/CN108829750A/en active Pending
Patent Citations (17)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101576893A (en) * | 2008-05-09 | 2009-11-11 | 北京世纪拓远软件科技发展有限公司 | Method and system for analyzing data quality |
CN101719139A (en) * | 2009-11-10 | 2010-06-02 | 南京联创科技集团股份有限公司 | Method for monitoring data quality based on index set |
CN101894319A (en) * | 2010-06-28 | 2010-11-24 | 中国烟草总公司湖南省公司 | Tobacco enterprise data quality management system and method |
CN103247008A (en) * | 2013-05-07 | 2013-08-14 | 国家电网公司 | Quality evaluation method of electricity statistical index data |
CN103544314A (en) * | 2013-11-04 | 2014-01-29 | 北京中搜网络技术股份有限公司 | Searching data quality statistical method |
CN104462744A (en) * | 2014-10-09 | 2015-03-25 | 广东工业大学 | Data quality control method suitable for cardiovascular remote monitoring system |
CN106257511A (en) * | 2016-04-14 | 2016-12-28 | 江苏瑞中数据股份有限公司 | A kind of grid faults characteristics quality testing method |
CN105824806A (en) * | 2016-06-13 | 2016-08-03 | 腾讯科技(深圳)有限公司 | Quality evaluation method and device for public accounts |
CN106355447A (en) * | 2016-08-31 | 2017-01-25 | 国信优易数据有限公司 | Price evaluation method and system for data commodities |
CN106469395A (en) * | 2016-08-31 | 2017-03-01 | 国信优易数据有限公司 | A kind of data commodity dynamic comprehensive appraisal procedure and system |
CN106503912A (en) * | 2016-10-27 | 2017-03-15 | 国信优易数据有限公司 | A kind of data service system |
CN106845846A (en) * | 2017-01-23 | 2017-06-13 | 重庆邮电大学 | Big data asset evaluation method |
CN106934493A (en) * | 2017-02-28 | 2017-07-07 | 北京科技大学 | A kind of construction method of power customer appraisal Model |
CN107463532A (en) * | 2017-06-28 | 2017-12-12 | 国网上海市电力公司 | A kind of mass analysis method of electric power statistics |
CN107315968A (en) * | 2017-06-29 | 2017-11-03 | 国信优易数据有限公司 | A kind of data processing method and equipment |
CN107491381A (en) * | 2017-07-04 | 2017-12-19 | 广西电网有限责任公司电力科学研究院 | A kind of equipment condition monitoring quality of data evaluating system |
CN107704806A (en) * | 2017-09-01 | 2018-02-16 | 深圳市唯特视科技有限公司 | A kind of method that quality of human face image prediction is carried out based on depth convolutional neural networks |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112183952A (en) * | 2020-09-08 | 2021-01-05 | 支付宝(杭州)信息技术有限公司 | Index quality supervision processing method and device and electronic equipment |
CN117273552A (en) * | 2023-11-22 | 2023-12-22 | 山东顺国电子科技有限公司 | Big data intelligent treatment decision-making method and system based on machine learning |
CN117273552B (en) * | 2023-11-22 | 2024-02-13 | 山东顺国电子科技有限公司 | Big data intelligent treatment decision-making method and system based on machine learning |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN108764705A (en) | A kind of data quality accessment platform and method | |
CN108734405A (en) | A kind of data value Evaluation Platform and method | |
CN108764707A (en) | A kind of data assessment system and method | |
US6834266B2 (en) | Methods for estimating the seasonality of groups of similar items of commerce data sets based on historical sales data values and associated error information | |
CN108763277B (en) | Data analysis method, computer readable storage medium and terminal device | |
CN109711955B (en) | Poor evaluation early warning method and system based on current order and blacklist base establishment method | |
CN110766428A (en) | Data value evaluation system and method | |
CN106469395A (en) | A kind of data commodity dynamic comprehensive appraisal procedure and system | |
CN110659926A (en) | Data value evaluation system and method | |
CN110874787A (en) | Recommendation model effect evaluation method and related device | |
CN108734587A (en) | The recommendation method and terminal device of financial product | |
CN109543940B (en) | Activity evaluation method, activity evaluation device, electronic equipment and storage medium | |
CN109872026A (en) | Evaluation result generation method, device, equipment and computer readable storage medium | |
CN107767152A (en) | Product purchase intention analysis method and server | |
CN108829750A (en) | A kind of quality of data determines system and method | |
CN108764995A (en) | A kind of data value determines system and method | |
CN115203496A (en) | Project intelligent prediction and evaluation method and system based on big data and readable storage medium | |
CN109214634A (en) | A kind of information processing method, device and information processing readable medium | |
CN106776757A (en) | User completes the indicating means and device of Net silver operation | |
CN104867032A (en) | Electronic commerce client evaluation identification system | |
CN111291567A (en) | Evaluation method and device for manual labeling quality, electronic equipment and storage medium | |
CN115905558A (en) | Knowledge graph-based XAI model evaluation method, device, equipment and medium | |
CN109345301A (en) | A kind of data price-determining system and determining method | |
CN112307307B (en) | Insurance product recommendation method and apparatus | |
CN114971240A (en) | Reading behavior risk assessment processing method and device |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
CB02 | Change of applicant information | ||
CB02 | Change of applicant information |
Address after: 100070, No. 101-8, building 1, 31, zone 188, South Fourth Ring Road, Beijing, Fengtai District Applicant after: Guoxin Youyi Data Co.,Ltd. Address before: 100070, No. 188, building 31, headquarters square, South Fourth Ring Road West, Fengtai District, Beijing Applicant before: SIC YOUE DATA Co.,Ltd. |
|
RJ01 | Rejection of invention patent application after publication | ||
RJ01 | Rejection of invention patent application after publication |
Application publication date: 20181116 |