CN108829750A - A kind of quality of data determines system and method - Google Patents

A kind of quality of data determines system and method Download PDF

Info

Publication number
CN108829750A
CN108829750A CN201810511444.XA CN201810511444A CN108829750A CN 108829750 A CN108829750 A CN 108829750A CN 201810511444 A CN201810511444 A CN 201810511444A CN 108829750 A CN108829750 A CN 108829750A
Authority
CN
China
Prior art keywords
data
determined
index
quality
index value
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201810511444.XA
Other languages
Chinese (zh)
Inventor
王肃
庞钰宁
范月涛
段立新
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guoxin Youe Data Co Ltd
Original Assignee
Guoxin Youe Data Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guoxin Youe Data Co Ltd filed Critical Guoxin Youe Data Co Ltd
Priority to CN201810511444.XA priority Critical patent/CN108829750A/en
Publication of CN108829750A publication Critical patent/CN108829750A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/21Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
    • G06F18/214Generating training patterns; Bootstrap methods, e.g. bagging or boosting

Landscapes

  • Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Theoretical Computer Science (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

This application provides a kind of qualities of data to determine system and method, wherein the system includes:Data acquisition module, for obtaining data to be determined;Index value determining module, for determining index value of the data to be determined in the case where preset quality determines index;Quality determination module, for determining the quality determination results of the data to be determined based on the index value under the preset data quality index.The system is when the progress quality of data determines, it is capable of the quality of more objective, accurate determining business datum, and it does not need artificially to participate in the determination process in the quality of business datum, reduces a possibility that business datum is artificially revealed, increase safety of the business datum in evaluation process.

Description

A kind of quality of data determines system and method
Technical field
This application involves data assessment technical fields, determine system and method in particular to a kind of quality of data.
Background technique
In today of digital information rapid development, influence of the data to enterprise is increasingly enhanced, and more and more enterprises need " being spoken with data ".For enterprise, the specific gravity that intangible asset occupies is increasing, in addition to patent, software copyright, trade mark etc. The importance of the intangible assets such as intellectual property, this intangible asset of business datum should not be underestimated.The value of business datum is sometimes straight Connect the value for determining enterprise.
When the value to business datum is assessed, it is normally based on business datum to carry out;Business datum Quality can largely influence its value assessment result.Therefore, it carries out assessing it in the value to business datum Before, it usually needs the quality of business datum is determined.The assessment business of business datum is provided in the prior art, for real Now the quality of business datum is determined.The supplier that business datum assesses business is mainly Asset assessment organizations;Carry out When business datum is assessed, person to be determined needs to contact with Asset assessment organizations, and both sides link up evaluation condition face to face;In evaluation condition After settling, business datum is supplied to Asset assessment organizations, then the assets assessment expert by Asset assessment organizations by person to be determined Business datum is assessed according to certain estimation flow.Such assessment mode is resulted in artificially to be led in evaluation process The influence of sight factor is more, so that assessment result is not objective enough, accurate.
Summary of the invention
In view of this, a kind of quality of data of being designed to provide of the embodiment of the present application determines system and method, it can The quality of more objective, accurate determining business datum, and do not needed in the determination process artificially in the quality of business datum It participates in, reduces a possibility that business datum is artificially revealed, increase safety of the business datum in evaluation process.
In a first aspect, the embodiment of the present application, which provides a kind of quality of data, determines system, including:
Data acquisition module, for obtaining data to be determined;
Index value determining module, for determining index value of the data to be determined in the case where preset quality determines index;
Quality determination module, for determining the number to be determined based on the index value under the preset data quality index According to quality determination results.
With reference to first aspect, the embodiment of the present application provides the first possible embodiment of first aspect, wherein:Institute Stating quality of data index includes:Data consistency index, data integrity index, data age index, data redudancy refer to One or more of mark, data scarcity index and data figureofmerit.
With reference to first aspect, the embodiment of the present application provides second of possible embodiment of first aspect, wherein:Needle Index, which includes the case where that data consistency index, the data to be determined include, to be determined to the quality:Data content and institute State the corresponding description information of data to be determined;
The index value determining module is specifically used for determining the included data content of data to be determined with described to true Fixed number is according to the degree of consistency for corresponding to description information;And determine the data to be determined in data one based on the degree of consistency Index value under cause property index, and the degree of consistency is higher, characterizes the data to be determined under data consistency index Index value it is higher.
With reference to first aspect, the embodiment of the present application provides the third possible embodiment of first aspect, wherein:
The index value determining module, be specifically used for determining following one or more data contents and corresponding description information it Between the degree of consistency, and to true described in the higher characterization of the degree of consistency between any one data content and corresponding description information The index value of the data consistency index of fixed number evidence is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
With reference to first aspect, the embodiment of the present application provides the 4th kind of possible embodiment of first aspect, wherein:
Include the case where data integrity index for the quality of data index, the index value determining module, specifically For determining the null value accounting in the included data entry of data to be determined;And based on the null value accounting determine it is described to It determines index value of the data under data integrity index, and the null value accounting is lower, characterizes the number of the data to be determined It is higher according to integrality;
Determine that index includes the case where data age index for the quality, the index value determining module, specifically For determine the data to be determined start generation time terminate between generation time the time interval crossed over and it is described to Determine that data start the time difference between generation time and the data offer time to be determined;Based on the time interval and The time difference determines index value of the data to be determined under data age index;Wherein, the time interval span Bigger, the index value for characterizing the data age index of the data to be determined is higher;And time difference is smaller, characterizes institute The index value for stating the data age index of data to be determined is higher;
Include the case where data redudancy index for the quality of data index, the index value determining module, specifically For determining the accounting of repeated entries in data entry that the data to be determined are included;And accounting for based on the repeated entries Index value of the data to be determined more described than determination under data redudancy index, and the accounting of the repeated entries is lower, characterization The data redudancy of the data to be determined is lower;
Determine that index includes the case where data figureofmerit for the quality, the index value determining module is specifically used for Determine the data volume that the data to be determined include;And amount determines the data to be determined in data figureofmerit based on the data Under index value, and the data volume is bigger, and the index value for characterizing the data figureofmerit of the data to be determined is higher.
With reference to first aspect, the embodiment of the present application provides the 5th kind of possible embodiment of first aspect, wherein:Also Including:Set of metadata of similar data determining module;
The data acquisition module is also used to crawl multiple data sets from the default platform;
The set of metadata of similar data determining module, for being solved respectively to the data to be determined and the multiple data set Analysis, determines the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with The lexical feature of each data set carries out text similarity matching;The data set that text similarity is reached default similarity threshold is true It is set to the set of metadata of similar data of the data to be determined.
With reference to first aspect, the embodiment of the present application provides the 6th kind of possible embodiment of first aspect, wherein:Needle Index, which includes the case where data scarcity index, to be determined to the quality,
The index value determining module is specifically used for determining the data to be determined and similar to the data to be determined Set of metadata of similar data default platform frequency of occurrence;And determine that the data to be determined are rare in data based on the frequency of occurrence Property index under index value, and the frequency of occurrence is fewer, and the scarcity for characterizing the data to be determined is higher.
With reference to first aspect, the embodiment of the present application provides the 7th kind of possible embodiment of first aspect, wherein:Institute Quality determination module is stated, specifically for the weight coefficient according to preset data quality index, to the data to be determined default Index value under quality of data index is weighted summation process, obtains the quality determination results of the data to be determined.
With reference to first aspect, the embodiment of the present application provides the 8th kind of possible embodiment of first aspect, wherein:Also Including:The quality of data determines model training module;
The quality of data determines model training module, is used for using the preset data quality index as independent variable, will The quality of data grade determines model as the dependent variable building quality of data;
The data acquisition module is also used to obtain training data;
The index value determining module is also used to determine finger of the training data under the preset data quality index The quality determination results of scale value and the training data;
The quality of data determines model training module, is also used to for the index value determined for the training data being used as certainly The quality determination results of corresponding training data are substituted into the quality of data as dependent variable value and determine model, to institute by variate-value It states the quality of data and determines that model is trained;
The quality determination module, specifically with the index by the data to be determined under the preset data quality index Value substitutes into the trained quality of data as independent variable and determines model, obtains the quality determination results of the data to be determined.
Second aspect, the embodiment of the present application provide a kind of quality of data and determine method, including:
Obtain data to be determined;
Determine index value of the data to be determined in the case where preset quality determines index;
Based on the index value under preset data quality index, the quality determination results of data to be determined are determined.
In conjunction with second aspect, the embodiment of the present application provides the first possible embodiment of second aspect, wherein:Institute Stating quality of data index includes:Data consistency index, data integrity index, data age index, data redudancy refer to One or more of mark, data scarcity index and data figureofmerit.
In conjunction with second aspect, the embodiment of the present application provides second of possible embodiment of second aspect, wherein:Needle Index, which includes the case where that data consistency index, the data to be determined include, to be determined to the quality:Data content and institute State the corresponding description information of data to be determined;
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number according to included data content description information corresponding with the data to be determined the degree of consistency;And it is based on the consistency Degree determines index value of the data to be determined under data consistency index, and the degree of consistency is higher, characterizes institute It is higher to state index value of the data to be determined under data consistency index.
In conjunction with second aspect, the embodiment of the present application provides the third possible embodiment of second aspect, wherein:Really The degree of consistency of fixed include the data content of data to be determined description information corresponding with the data to be determined, specifically packet It includes:Determine the degree of consistency between following one or more data contents and corresponding description information, and any one data content The index value of the data consistency index of the higher characterization data to be determined of the degree of consistency between corresponding description information It is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
In conjunction with second aspect, the embodiment of the present application provides the 4th kind of possible embodiment of second aspect, wherein:Needle Data integrity index is included the case where to the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:
Determine the null value accounting in the included data entry of data to be determined;And institute is determined based on the null value accounting Index value of the data to be determined under data integrity index is stated, and the null value accounting is lower, characterizes the data to be determined Data integrity it is higher.
In conjunction with second aspect, the embodiment of the present application provides the 5th kind of possible embodiment of second aspect, wherein:Needle Index, which includes the case where data age index, to be determined to the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true According to starting, generation time terminates the time interval crossed between generation time to fixed number and the data to be determined start to generate Time and the data to be determined provide the time difference between the time;Institute is determined based on the time interval and the time difference State index value of the data to be determined under data age index;
Wherein, the time interval span is bigger, characterizes the index value of the data age index of the data to be determined It is higher;And time difference is smaller, the index value for characterizing the data age index of the data to be determined is higher.
In conjunction with second aspect, the embodiment of the present application provides the 6th kind of possible embodiment of second aspect, wherein:Needle Data redudancy index is included the case where to the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number according to repeated entries in the data entry for being included accounting;And it is determined based on the accounting of the repeated entries described to be determined Index value of the data under data redudancy index, and the accounting of the repeated entries is lower, characterizes the data to be determined Data redudancy is lower.
In conjunction with second aspect, the embodiment of the present application provides the 7th kind of possible embodiment of second aspect, wherein:Also Including:Multiple data sets are crawled from the default platform;The data to be determined and the multiple data set are solved respectively Analysis, determines the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with The lexical feature of each data set carries out text similarity matching;The data set that text similarity is reached default similarity threshold is true It is set to the set of metadata of similar data of the data to be determined.
In conjunction with second aspect, the embodiment of the present application provides the 8th kind of possible embodiment of second aspect, wherein:Needle Index, which includes the case where data scarcity index, to be determined to the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number accordingly and set of metadata of similar data similar with the data to be determined default platform frequency of occurrence;And based on the occurrence out Number determines the index values of the data to be determined under data scarcity index, and the frequency of occurrence is fewer, characterization it is described to Determine that the scarcity of data is higher.
In conjunction with second aspect, the embodiment of the present application provides the 9th kind of possible embodiment of second aspect, wherein:Needle Index, which includes the case where data figureofmerit, to be determined to the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number according to comprising data volume;And amount determines index value of the data to be determined under data figureofmerit based on the data, And the data volume is bigger, the index value for characterizing the data figureofmerit of the data to be determined is higher.
In conjunction with second aspect, the embodiment of the present application provides the tenth kind of possible embodiment of second aspect, wherein:Base Index value under the preset data quality index determines the quality determination results of the data to be determined, specifically includes:With In the weight coefficient according to preset data quality index, to index value of the data to be determined under preset data quality index It is weighted summation process, obtains the quality determination results of the data to be determined.
In conjunction with second aspect, the embodiment of the present application provides a kind of the tenth possible embodiment of second aspect, wherein: It is also used to construct data matter for the quality of data grade as dependent variable using the preset data quality index as independent variable It measures and determines model;
Obtain training data;
Determine the matter of index value and the training data of the training data under the preset data quality index Measure definitive result;
Using the index value determined for the training data as argument value, by the quality determination results of corresponding training data The quality of data is substituted into as dependent variable value and determines model, and model, which is trained, to be determined to the quality of data;
It is trained index value of the data to be determined under the preset data quality index as independent variable substitution The quality of data determine model, obtain the quality determination results of the data to be determined.
The quality of data provided by the embodiments of the present application determines system, after obtaining data to be determined by data acquisition module, It will use value determining module and determine index value of the data to be determined in the case where preset quality determines index, then determined using quality Module determines the quality determination results of data to be determined based on the index value under preset data quality index, and whole process is not required to The intervention that very important person is, it will be able to the quality of more objective, accurate determining business datum.
To enable the above objects, features, and advantages of the application to be clearer and more comprehensible, preferred embodiment is cited below particularly, and cooperate Appended attached drawing, is described in detail below.
Detailed description of the invention
Technical solution in ord to more clearly illustrate embodiments of the present application, below will be to needed in the embodiment attached Figure is briefly described, it should be understood that the following drawings illustrates only some embodiments of the application, therefore is not construed as pair The restriction of range for those of ordinary skill in the art without creative efforts, can also be according to this A little attached drawings obtain other relevant attached drawings.
Fig. 1 shows the structural schematic diagram that a kind of quality of data provided by the embodiment of the present application determines system;
Fig. 2 shows the structural schematic diagrams that the another kind quality of data provided by the embodiment of the present application determines system;
Fig. 3 shows the flow chart that a kind of quality of data provided by the embodiment of the present application determines method;
Fig. 4 shows a kind of structural schematic diagram of computer equipment provided by the embodiment of the present application.
Specific embodiment
To keep the purposes, technical schemes and advantages of the embodiment of the present application clearer, below in conjunction with the embodiment of the present application Middle attached drawing, the technical scheme in the embodiment of the application is clearly and completely described, it is clear that described embodiment is only It is some embodiments of the present application, instead of all the embodiments.The application being usually described and illustrated herein in the accompanying drawings is real The component for applying example can be arranged and be designed with a variety of different configurations.Therefore, below to the application's provided in the accompanying drawings The detailed description of embodiment is not intended to limit claimed scope of the present application, but is merely representative of the selected reality of the application Apply example.Based on embodiments herein, those skilled in the art institute obtained without making creative work There are other embodiments, shall fall in the protection scope of this application.
Unlike the prior art, the embodiment of the present application passes through data acquisition mould when determining to business datum progress quality Block obtains business datum (data to be determined in the embodiment of the present application), determines business datum at least by index value determining module Index value under a kind of preset data quality index, then by quality determination module based on the finger under preset data quality index Scale value, determines the quality results of business datum, and whole process does not need artificial intervention, it will be able to more objective, accurately determine industry The quality for data of being engaged in, and the intervention thought is not needed exactly yet, reduce the possibility that business datum is contacted with people, to reduce A possibility that business datum is artificially revealed increases safety of the business datum in evaluation process.
It is to be determined to a kind of quality of data disclosed in the embodiment of the present application first convenient for understanding the present embodiment System describes in detail.It should be noted that the quality of data determines quality of the system in addition to can determine business datum, also can Enough determine the quality of other data, such as test data, Home data etc..It is below business datum to this Shen to data to be determined Please technical solution be illustrated.
Shown in Figure 1, the quality of data provided by the embodiments of the present application determines that system includes:Data acquisition module 10 refers to Scale value determining module 20 and quality determination module 30.
Wherein, data acquisition module 10, for obtaining data to be determined.
When specific implementation, data to be determined are the business datums that quality to be carried out determines.Data to be determined can be with It obtains in several ways, such as the business datum crawled from default platform, default platform includes enterprise web site, statistics bureau, number According to transaction platform, button platform etc.;Receive the data to be determined sent from data source.
Index value determining module 20, for determining index value of the data to be determined under preset data quality index.
When specific implementation, quality of data index includes:Data consistency index, data integrity index, data One or more of timeliness index, data redudancy index, data scarcity index and data figureofmerit.
Preferably, the object that the embodiment of the present application is implemented every time can be a kind of data, if such data includes multiple Data set, then the quality of data of the embodiment of the present application determines that object can be a data set.
Index value determining module 20 is specifically used for determining data to be determined by the method for following 1-6 in the embodiment of the present application Index value under each quality of data index.Specifically:
1, determine that index includes the case where that data consistency index, the data to be determined include for the quality:Number According to content and the corresponding description information of the data to be determined;
The index value determining module 20, be specifically used for determining the included data content of data to be determined with it is described to Determine that data correspond to the degree of consistency of description information;And the Data Data to be determined one is determined based on the degree of consistency The index value of cause property index, and the degree of consistency is higher, characterizes the finger of the data consistency index of the data to be determined Scale value is higher.
It, can be by determining between following one or more data contents and corresponding description information when specific implementation Consistency journey, to characterize the data content of data to be determined and the degree of consistency of description information, wherein in any item data Hold and the index of the data consistency index of the higher characterization data to be determined of the degree of consistency between corresponding description information It is worth higher.
One:Data described in the description information of data volume and the data to be determined that the data to be determined include Amount.
Herein, the data content of data to be determined is carried in the file of certain format;Data to be determined can be by a plurality of Data entry is constituted, and every data entry is made of multiple data elements;Wherein, data element is the most base for constituting data to be determined Notebook data unit.
Such as data to be determined be commodity price data when, the data element that a data to be determined include is followed successively by:Commodity Title, commodity production quotient, the place of production, production time, shelf-life, net content, nutritional ingredient, product batch number, on-sale date.
That is data to be determined are preferably the form of data entry, are text for the data with evaluation requirement The case where data, can carry out text data key message extraction operation in advance before being assessed, and generate data entry shape The data of formula.Such as:Data with evaluation requirement are buyer's guide text, can be before assessment according to product name, quotient The keyword extractions such as product manufacturer, the place of production, production time at data entry form, using the data entry of extraction as to be determined Data.
The data volume that data to be determined are included, the data volume for the valid data member that data as to be determined include, for example, In the examples described above, the quantity for the data element that a complete data include should be nine, then every data entry is corresponding Data volume is 9;If data to be determined include 100 data entries, the data volume that should have should be 900, that is, Data volume described in description information is 900;But in practice, it is understood that there may be certain data elements are sky, are not had for empty data element There is actual content, causes the actual amount of data of data to be determined less than description data volume.
By taking the quantity of data entry as an example, here can also the data more to be determined data entry quantity that includes with it is described Data entry quantity described in the description information of data to be determined.
Therefore it can be retouched by the description information for the data volume and the data to be determined that determination data to be determined include The degree of consistency for the data volume stated characterizes the data content of data to be determined and the degree of consistency of description information.
Secondly:The size of description described in the description information of the size of the data to be determined and the data to be determined.
Herein, the size of data to be determined can actually regard the file size for carrying the file of data to be determined as. For example, the data element of certain data entry, which lacks (i.e. data element is sky), will also result in the file data for carrying data to be determined It is not of uniform size described in authentic document size and description information.
Therefore it can pass through description described in the size of determination data to be determined and the description information of the data to be determined The degree of consistency of size characterizes the data content of data to be determined and the degree of consistency of description information.
Thirdly:Data lattice described in the description information of the data format of the data to be determined and the data to be determined Formula.
Herein, the data format of data to be determined can be the file format for carrying the file of data to be determined.Carrying to Determine that the file format of data may be different from file format described in description information.
It therefore can be by being retouched described in description information of the data format of determination data to be determined with the data to be determined The degree of consistency for the data format stated characterizes the data content of data to be determined and the degree of consistency of description information.
It should be noted that the data content that data to be determined are included can be but be not necessarily limited to data volume, size and Data format etc.;The corresponding description information of data to be determined is generally used for describing the data of data to be determined, data to be determined Corresponding description information also includes the contents such as data volume, size and data format.
Specifically, the embodiment of the present application provides a kind of degree of consistency based on data volume, size of data and data format, To determine the specific method of index value of the data to be determined under data consistency index:
Calculate first of data volume described in the description information of data volume and data to be determined that data to be determined include Absolute difference calculates the second absolute difference of the size of the size of data to be determined and the description information of data to be determined, If the data format of data to be determined is consistent with data format described in the description information of data to be determined, it is determined that be determined The consistent degree P of data is the first preset value, is the second preset value, according to the first absolute difference, the second absolute difference otherwise And consistent degree, calculate the index value of data consistency index.
Herein, the first preset value can be set as to 0, the second preset value is set as 1.Optionally, can also by the first preset value and Second preset value is set as other numerical value, and the numerical value for meeting the second preset value is greater than the numerical value of the first preset value.
Specifically, the first absolute difference L1 meets:L1=| La-Lm|;
Wherein, LaThe data volume for including by data to be determined, LmThe data that description information by data to be determined includes Amount.
Second absolute difference L2 meets:L2=| Sa-Sm|;
Wherein, SaFor the size of data to be determined, SmFor the size of the description information of data to be determined.
Then index value ω of the data to be determined under data consistency index1Meet:
α is design factor, can use the value between 0-1, such as take 1/3,1/4,1/2 etc..
ω1Value range is generally [0,1], ω1Value is bigger, illustrates that the degree of consistency of data to be determined is higher.
2, include the case where data integrity index for the quality of data index,
The index value determining module 20, specifically for the null value in determination the included data entry of data to be determined Accounting;And index value of the data to be determined under data integrity index, and the sky are determined based on the null value accounting Value accounting is lower, and the data integrity for characterizing the data to be determined is higher.
When specific implementation, there may be lack the data element of data to be determined.In the case, it lacks Data element it is more, then the integrality of data to be determined is poorer.
Index value determining module 20 is when determining the null value accounting in the included data entry of data to be determined:Successively Detect whether the data element in data to be determined in each data entry is empty;According to testing result, each data element is carried out Integrality assignment obtains the integrity value of each data element, and data element is if it is empty, then corresponding integrity value is 0;Data element It is not sky, then corresponding integrity value is 1;By the sum of the integrity value of all data elements, ratio with data element quantity, as Null value accounting.
Can directly using the index value of the null value accounting as data to be determined under data integrity index, such as:
Index value ω of the data to be determined under data integrity index is calculated using following formula2
Wherein, aiFor the integrity value of i-th of data element in data to be determined, N is the data element in data to be determined Sum.
ω2Value range be [0,1], ω2Value is bigger, indicates that the data integrity of data to be determined is better.
It is also based on positive correlation of the data to be determined between the index value under data integrity index and null value accounting Sexual intercourse, to determine index value of the data to be determined under data integrity index based on null value accounting.
In addition, index value determining module 20 when determining the null value accounting in data entry included by data to be determined, is gone back Following step can be used:Count be in all data entries in data to be determined empty data element total quantity;By all numbers According to the ratio of the total quantity of all data elements in the total quantity and data to be determined of the data element in entry being sky, as null value Accounting.
Further, null value accounting can also be the accounting in data entry sum of invalid data entry in data to be determined Than.There are the data entries of preset quantity sky data element can be determined as invalid data entry.ω2For invalid data entry and number According to the quotient of entry sum.
3, determine that index includes the case where data age index for the quality,
The index value determining module 20, when starting generation time termination generation specifically for the determination data to be determined Between between the time interval the crossed over and data to be determined start generation time and the data to be determined provide the time Between time difference;Determine the data to be determined under data age index based on the time interval and the time difference Index value;
Wherein, the time interval span is bigger, characterizes the index value of the data age index of the data to be determined It is higher;And time difference is smaller, the index value for characterizing the data age index of the data to be determined is higher.
When specific implementation, the time interval that data generation time to be determined is crossed over starts for data to be determined Generation time terminates between generation time to data to be determined, the time interval crossed over.The unit of time interval will be according to this The length of time interval is specifically set.
Distinguishingly, when can not determine data to be determined beginning generation time and terminate generation time when, can by Determine that the description information of data determines;The initial time, most in time interval that generation time can cross over for data to be determined Between terminal hour, or average time, preferably initial time.
For example, setting minute for the unit of time interval if the length of the time interval is 1 day;If time interval Length be 2 months, then set day for the unit of time interval;If the length of the time interval is 3 years, can be by the time The unit in section this be for week.It should be noted that the unit in above-mentioned setting time section is only that the embodiment of the present application is mentioned The example of confession cannot be considered as being the restriction to technical scheme.
Data provide the time, refer to the quality of data determine the data acquisition module 10 of system obtain data to be determined when Between.It is noted herein that data acquisition module is actually can not be since data to be determined have certain data volume Some time point obtains whole data to be determined from scratch, and therefore, the data offer time can be data acquisition Module 10 obtains the initial time of data to be determined, when being also possible to data acquisition module 10 and obtaining the termination of data to be determined Between;In addition, since data acquisition module 10 is after obtaining data to be determined, it can be in a short period of time by data to be determined It is transferred to index determining module 20 to be handled, when data acquisition module 10 obtains the initial time or termination of data to be determined Between determine that the time difference of the current time of index value under timeliness index is very little to it apart from quality determination module 20, Therefore quality determination module 20 can also determine data to be determined to the current time of its index value under timeliness index The time is provided as data.
For example, including 100 data entries in data to be determined;In 100 data entries, earliest data strip is generated Purpose generation time (namely data to be determined start generation time) is on March 15th, 2018;The data strip of generation time the latest Purpose generation time (namely data to be determined terminate generation time) is on April 17th, 2018;Then data generation time to be determined The time interval crossed over is 33 days.If it is on May 10th, 2018 that data to be determined, which provide the time, when data to be determined generate Between data to be determined provide the time between time difference, as on March 15th, 2018, until between on May 10th, 2018 when Between it is poor.
Determining the data to be determined under data age index based on the time interval and the time difference Index value when, can index using time interval and the ratio of time difference as data to be determined under timeliness index Value.
For example, index value ω of the data to be determined under timeliness index can be calculated using following formula3
TfGeneration time is terminated for data to be determined, if data to be determined can not determine the final time, using to be determined The final time of the corresponding description information of data;TsStart generation time for data to be determined, if data to be determined can not determine Start generation time, then uses the beginning generation time of the corresponding description information of data to be determined;TnThe offer of data to be determined Time.
ω3Value range is [0,1], ω3Value is bigger, indicates that the timeliness of data to be determined is stronger.
4, include the case where data redudancy index for the quality of data index,
The index value determining module 20 repeats in the data entry for being included specifically for the determination data to be determined The accounting of entry;And the index value of the Data Data redundancy index to be determined is determined based on the accounting of the repeated entries, And the accounting of the repeated entries is lower, the data redudancy for characterizing the data to be determined is lower.
When specific implementation, data redudancy is the ratio for calculating repeated data and occurring.In a data acquisition system, Duplicate data become data redundancy, and information redundance is higher, and the quality of data is lower.
Specifically, index value determining module 20 can determine data to be determined in data using any one in following manner Index value under redundancy index:
One:According to the data element that every data entry includes, every data entry weight in the data to be determined is counted It appears again existing number;The number and the data entry repeated according to all data entries in the data to be determined Total number, determine the ratio that ratio that the data entry repeats namely repeated entries occur;Namely institute's number to be determined According to the accounting of repeated entries in the data entry for being included.The ratio that entry repeats based on the data, calculate it is described to Determine that quality of the data under the information redundance index determines value;Wherein, the data to be determined are in the information redundancy Quality under degree index determines the being negatively correlated property of ratio that value and the data entry repeat.
It herein, be according to data strip when the number that every data entry repeats in counting the data to be determined Purpose distributing order, successively detects whether every data entry occurred in front;Wherein, number in two identical data entries According to member content is completely the same or content is consistent or similar data element quantity reaches preset threshold.Assuming that detecting i-th When data entry, which is to occur for the first time, then statistical magnitude is constant;If the i-th data entry Not first occurs, then statistical magnitude is added 1.
Secondly:Index value determining module 20 successively detects that whether attaching most importance in each data entry in data to be determined appears again Existing data entry;According to testing result, repeated assignment is carried out to each data entry, it is corresponding obtains each data entry Repeatability value.If data entry is the data entry repeated, namely before detecting current data entry, has had and worked as The identical another data entry of preceding data entry is tested, then corresponding repeatability value is 1;If data entry is simultaneously non-duplicate The data entry of appearance, namely before detecting current data entry, another data not identical with current data entry Entry is tested, then corresponding repeatability value is 0, by the sum of the repeatability value of all data entries, with data entry quantity Ratio, the accounting of repeated entries in the data entry for being included as data to be determined.
For example, ω of the data to be determined under data redudancy index can be calculated using following formula4Index value ω4
Wherein, biFor the repeatability value of i-th of data entry in data to be determined, N is data entry in data to be determined Sum.
ω4Value range is [0,1], ω4Value is bigger, shows that the data redundancy of data to be determined is smaller, then corresponding Data value is also higher.
For example, including 5 data entries, respectively a, b, c, d, e in data to be determined, wherein a, b are identical with e, c, d It is identical, successively detect whether every data entry is the data entry repeated from a to e;A occurs for the first time, repeatability Value is 0;B is identical with a, and the repeatability value for the data entry repeated, therefore b is that 1, c occurs for the first time, repeatability value It is 0;D is identical with c, and for the data entry repeated, repeatability value is 1;E is identical as a, for the data strip repeated Mesh, repeatability value are 1, and the accounting of repeated entries is 0.6 in the data entry that finally obtained data to be determined are included.Root According to above-mentioned formula, it is known that index value ω of the final resulting determining data under data redudancy index4It is 0.4.
5, determine that index includes the case where data scarcity index for the quality,
The index value determining module 20, be specifically used for determining the data to be determined and with the data phase to be determined As set of metadata of similar data default platform frequency of occurrence;And determine that the data to be determined are dilute in data based on the frequency of occurrence The index value under property index is lacked, and the frequency of occurrence is fewer, the scarcity for characterizing the data to be determined is higher.
When specific implementation, scarcity refers to according to the preset platform and data information of acquisition for same The offer situation of class data, calculates the degree of scarcity of data;Homogeneous data is more, and scarcity is lower;Homogeneous data is fewer, rare Property is higher;The higher data to be determined of scarcity, quality and value are also corresponding higher.
When specific implementation, in order to obtain set of metadata of similar data similar with data to be determined, another reality of the application It applies in example, further includes:Set of metadata of similar data determining module 40.
Data acquisition module 10 in the embodiment of the present application is also used to crawl multiple data sets from the default platform.
Herein, default platform can be data trade platform, be also possible to other data platforms;It is with data trade platform Example, each data trade are corresponding at least a kind of business datum merchandised.When crawling data set from default platform, To each data trade is directed to, a data set is crawled;It include multiple data entries in each data set.
When carrying out data and crawling, can by crawler, crawl the technologies such as tool and crawl data set, the application is to this Not limit.
Set of metadata of similar data determining module 40, for being parsed respectively to the data to be determined and the multiple data set, Determine the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with each number Text similarity matching is carried out according to the lexical feature of collection;The data set that text similarity reaches default similarity threshold is determined as The set of metadata of similar data of the data to be determined.
In specific implementation, set of metadata of similar data determining module 40 can determine data to be determined and data by following step The lexical feature of collection:
Word segmentation processing is carried out to each data set of acquisition, the first lexical data after obtaining word segmentation processing;At participle The sequence of the frequency of occurrence that each first lexical data after reason is concentrated in corresponding data from high to low, filters out preceding preset quantity A first lexical data, each data for data sets go out in the data set according to each first lexical data filtered out The existing frequency determines the lexical feature of the data.
Word segmentation processing is carried out to data to be determined, the second lexical data after obtaining word segmentation processing;After word segmentation processing Frequency of occurrence sequence from high to low of each second lexical data in data to be determined, preset quantity the before filtering out Two lexical datas, for each data in data to be determined, according to each second lexical data filtered out in the number to be determined According to the frequency of middle appearance, the lexical feature of the data is determined.
For each lexical feature in each data set, calculate the lexical feature in the data set respectively with it is to be determined The text similarity between lexical feature in data.Text similarity is greater than or equal to the data set of default similarity threshold It is determined as the set of metadata of similar data of data to be determined.
Further, the case where determining multiple feature vocabulary for data to be determined and data set, for number to be determined According to each feature vocabulary, can by each feature vocabulary of this feature vocabulary and data set, composition notebook similarity is compared respectively, will The feature vocabulary that similarity reaches the first default similarity threshold is determined as the similar vocabulary of this feature vocabulary, similar vocabulary quantity When reaching the second preset threshold, say that data to be determined and data set are determined as set of metadata of similar data.
Further, there is the case where industry label marked for data to be determined and data set, it can also be direct Using industry label as the feature vocabulary of corresponding data, feature vocabulary is directly subjected to similarity comparison.
After the set of metadata of similar data of data to be determined has been determined in the multiple data sets crawled, can be existed according to set of metadata of similar data The number that default platform occurs, determines index value of the data to be determined under data scarcity index.
Specifically, quality of the data to be determined under the scarcity index can be calculated using following step determine value:
The determining quantity with the data set of the similar set of metadata of similar data of the data to be determined;
Based on the total quantity of the data set crawled, and with the data to be determined the similar set of metadata of similar data number According to the quantity of collection, index value of the data to be determined under the scarcity index is calculated;
For example, being calculated using the following equation index value ω of the data to be determined under data scarcity index5
Wherein, x is the set of metadata of similar data of data to be determined and data to be determined in the frequency of occurrence of default platform, and y is to crawl The total quantity of the data set arrived.
ω5Value range be [0,1], work as ω5Close to 1, it is more to illustrate that the set of metadata of similar data of data to be determined occurs, The scarcity of data to be determined is lower, ω5Closer to 0, show fewer, the number to be determined that the set of metadata of similar data of data to be determined occurs According to scarcity it is higher.
Furthermore it is also possible to calculate index value ω of the data to be determined under data scarcity index using following formula5
ω5=1-e-x/y
Wherein, x is the set of metadata of similar data of data to be determined and data to be determined in the frequency of occurrence of default platform, and y is default The sum of platform.
ω5Value range be [0,1], work as ω5Close to 1, illustrate that each default platform has set of metadata of similar data, it is to be determined The scarcity of data is lower, ω5Equal to 0, show each default platform there is no set of metadata of similar data, the scarcity of data to be determined is got over It is high.
6, determine that index includes the case where data figureofmerit for the quality,
The index value determining module 20, the data volume for including specifically for the determination data to be determined;And based on institute State data volume and determine index value of the data to be determined under data figureofmerit, and the data volume is bigger, characterization it is described to Determine that the index value of the data figureofmerit of data is higher.
When specific implementation, data to be determined can be determined in data volume using any one in following two methods Index value under index:
First, can be by the ratio of the data volume of the data to be determined of calculating and the total amount of data of the data of each default platform As the index value of data figureofmerit, can also directly using the data volume of data to be determined as the index value of data figureofmerit, It can be determines according to actual conditions.
For example, using the ratio of the data volume of data to be determined and the total amount of data of the data of each default platform as data When the index value of figureofmerit, the index value ω of data figureofmerit can be calculated using the following equation6
Wherein, N is the data volume of data in data to be determined, and P is the total amount of data of the data of each default platform.
ω6Value be [0,1], work as ω6When=0, illustrate that the data volume of data to be determined is small, otherwise data volume is big.
Second, the committed amount of data and description information that carry in the description information based on the data to be determined are retouched The data volume stated;Data volume that data to be determined include and data acquisition obtains is carried out to the data of default platform with to It determines the similar set of metadata of similar data amount of data, calculates index value of the data to be determined under data figureofmerit.
Wherein, when which refers to that user provides data to be determined, it is contemplated that the number of data to be determined to be offered According to amount.
The data volume that data to be determined are included, the data volume for the valid data member that data as to be determined include.
The set of metadata of similar data amount similar with data to be determined that data acquisition obtains is carried out to the data of default platform, is obtained Process is similar to the acquisition process of set of metadata of similar data with when determining the index value of data to be determined under data scarcity index.Specifically Process is:
Data acquisition module 10 crawls multiple data sets from the default platform;Set of metadata of similar data determining module 40, for dividing It is other that the data to be determined and the multiple data set are parsed, determine the word of the data to be determined and each data set Remittance feature;The lexical feature of the data to be determined is subjected to text similarity matching with the lexical feature of each data set respectively; The data set that text similarity reaches default similarity threshold is determined as to the set of metadata of similar data of the data to be determined;To determining Set of metadata of similar data carries out the operation that data volume determines, to obtain set of metadata of similar data amount similar with data to be determined.
Specifically, index value of the data to be determined under data figureofmerit can be calculated using following formula:
Wherein, m indicates the data volume that data to be determined include;N1Indicate that carrying out data acquisition to the data of default platform obtains The set of metadata of similar data amount similar with data to be determined taken;N2Indicate data described in description information;N3Indicate committed amount of data.
Quality determination module 30, for determining described to be determined based on the index value under the preset data quality index The quality determination results of data.
When specific implementation, quality determination module 30 can determine number to be determined using any one in following proposal According to quality determination results:
One:According to the weight coefficient of preset data quality index, the data to be determined are referred in preset data quality Index value under mark is weighted summation process, obtains the quality determination results of the data to be determined.
Herein, the mistake of summation process is weighted to index value of the data to be determined under preset data quality index Journey, it is actually different according to quality influence degree of the different data figureofmerit to data to be determined, to determine data to be determined Quality determination results process.
The corresponding weight coefficient of different types of data to be determined may be the same or different.
For example, determining that index includes that data consistency index, data integrity index, data age refer to for quality The case where mark, data redudancy index, data scarcity index and data figureofmerit, can according to following formula calculate to Determine the quality determination results M of data:
M=a1×ω1+a2×ω2+a3×ω3+a4×ω4+a5×ω5+a6×ω6
Wherein, a1To a6It is followed successively by data consistency index, data integrity index, data age index, data redundancy Spend index, data scarcity index and the corresponding weight coefficient of data figureofmerit.ω1To ω6It is consistent to be followed successively by data Property index, data integrity index, data age index, data redudancy index, data scarcity index and data volume Index respectively corresponds index value.
Second, the quality of data determines in system further includes:The quality of data determines model training module 50.Wherein, the number Model training module 50 is determined according to quality, is used for using the preset data quality index as independent variable, by the quality of data Grade determines model as the dependent variable building quality of data;
The data acquisition module 10, is also used to obtain training data;
The index value determining module 20 is also used to determine the training data under the preset data quality index The quality determination results of index value and the training data;
The quality of data determines model training module 50, is also used to the index value conduct that will be determined for the training data The quality determination results of corresponding training data are substituted into the quality of data as dependent variable value and determine model by argument value, right The quality of data determines that model is trained;
The quality determination module 30, specifically for by the data to be determined under the preset data quality index Index value substitutes into the trained quality of data as independent variable and determines model, and the quality for obtaining the data to be determined determines knot Fruit.
When specific implementation, the quality of data determines model training module when the building quality of data determines model It waits, it is thus necessary to determine that explanatory variable and explained variable in model, and explanatory variable is determined by following model training process Connection between explained variable.Influence the several because being known as of the quality of data to be determined, then using these types of factor as pair The quality of data index answered, and using quality of data coordinate as independent variable, and using the quality determination results of data to be determined as Dependent variable constructs model.
In the embodiment of the present application, constructed model includes but is not limited to:Autoregression model, returns at moving average model(MA model) certainly Return moving average model(MA model), integrate rolling average autoregression model, EC GARCH.
After constructing the quality of data and determining model, model, which is trained, to be determined to the quality of data.Used in training Training data can be obtained by data acquisition module;Herein, it should be noted that acquired training data can be The data that quality determines had been carried out, have been also possible to not carry out the data that quality determines.
For having carried out the data of quality determination, does not then need index determining module and carry out quality determination to it again. The data determined for not carrying out quality then need index determining module to carry out quality to it and determine, obtain the data pre- If the quality determination results of index value and training data under quality of data index.
Herein, the quality determination results of training data can be the grade of the quality of data, be also possible to point of the quality of data Number, can specifically be set according to actual demand.
Specifically, knot is determined in the quality for determining data to be determined by quality determination method provided by the embodiments of the present application When fruit, if the quality determination results of data to be determined are score, the data to be determined will can be referred in preset data quality Index value under mark is weighted the result of summation process directly as score, and the value range of the score is [0,1] at this time, Treated result can be carried out to the result of the weighted sum as score, such as by weighted sum result multiplied by must after 100 Score of the value arrived as the quality of data to be determined.It, can be based on default if the quality determination results of data to be determined are grade Transformation rule, the knot of summation process will be weighted to index value of the data to be determined under preset data quality index Under the corresponding grade of fruit conversion.
For example, 5 grades of setting, respectively A, B, C, D, E, and the quality of the corresponding data to be determined of A is corresponding lower than E The quality of data to be determined.Summation process is weighted to index value of the data to be determined under preset data quality index Result it is smaller, then lower grade.The value range of the result of the corresponding weighted sum processing of grade A-E is followed successively by:[0, 0.2), [0.2,0.4), [0.4,0.6), [0.6,0.8), [0.8,1].It can be based on above-mentioned value range, by weighting processing As a result it is converted into the grade of corresponding data to be determined.
The process that model is trained using training data, i.e., for according to the index value of training data and corresponding quality Definitive result, to the process that the parameter of model is constantly adjusted, so that model is being based on each training data in preset data When index value under quality calculates its quality determination results, the quality determination results being calculated matter corresponding with the training data Amount definitive result is consistent.
The quality of data provided by the embodiments of the present application determines system, after obtaining data to be determined by data acquisition module, It will use value determining module and determine index value of the data to be determined in the case where preset quality determines index, then determined using quality Module determines the quality determination results of data to be determined based on the index value under preset data quality index, and whole process is not required to The intervention that very important person is, it will be able to the quality of more objective, accurate determining business datum, and artificial do is not needed exactly yet In advance, reduce the possibility that business datum is contacted with people, to reduce a possibility that business datum is artificially revealed, increase business Safety of the data in evaluation process.
Based on the same inventive concept, it is additionally provided in the embodiment of the present application and determines the corresponding data matter of system with the quality of data The method of determination is measured, the principle and the above-mentioned quality of data of the embodiment of the present application solved the problems, such as due to the device in the embodiment of the present application is true Determine that system is similar, therefore the implementation of device may refer to the implementation of system, overlaps will not be repeated.
It is shown in Figure 3, the embodiment of the present application he provide the quality of data determine that method includes:
S301:Obtain data to be determined;
S302:Determine index value of the data to be determined in the case where preset quality determines index;
S303:Based on the index value under preset data quality index, the quality determination results of data to be determined are determined.
After the embodiment of the present application obtains data to be determined, finger of the data to be determined in the case where preset quality determines index can determine whether Scale value, the index value being then based under preset data quality index determine the quality determination results of data to be determined, entire quality Determination process does not need artificial intervention, so as to the quality of more objective, accurate determining business datum, and is exactly yet Artificial intervention is not needed, the possibility that business datum is contacted with people is reduced, to reduce what business datum was artificially revealed Possibility increases safety of the business datum in evaluation process.
Optionally, the quality of data index includes:Data consistency index, data integrity index, data age One or more of index, data redudancy index, data scarcity index and data figureofmerit.
Optionally, determine that index includes the case where data consistency index, the data packet to be determined for the quality It includes:Data content and the corresponding description information of the data to be determined;
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number according to included data content description information corresponding with the data to be determined the degree of consistency;And it is based on the consistency Degree determines index value of the data to be determined under data consistency index, and the degree of consistency is higher, characterizes institute It is higher to state index value of the data to be determined under data consistency index.
Optionally it is determined that the included data content of data to be determined description information corresponding with the data to be determined The degree of consistency specifically includes:Determine the degree of consistency between following one or more data contents and corresponding description information, And the data one of the higher characterization data to be determined of the degree of consistency between any one data content and corresponding description information The index value of cause property index is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
Optionally, include the case where data integrity index for the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:
Determine the null value accounting in the included data entry of data to be determined;And institute is determined based on the null value accounting Index value of the data to be determined under data integrity index is stated, and the null value accounting is lower, characterizes the data to be determined Data integrity it is higher.
Optionally, determine that index includes the case where data age index for the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true According to starting, generation time terminates the time interval crossed between generation time to fixed number and the data to be determined start to generate Time and the data to be determined provide the time difference between the time;Institute is determined based on the time interval and the time difference State index value of the data to be determined under data age index;
Wherein, the time interval span is bigger, characterizes the index value of the data age index of the data to be determined It is higher;And time difference is smaller, the index value for characterizing the data age index of the data to be determined is higher.
Optionally, include the case where data redudancy index for the quality of data index,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number according to repeated entries in the data entry for being included accounting;And it is determined based on the accounting of the repeated entries described to be determined Index value of the data under data redudancy index, and the accounting of the repeated entries is lower, characterizes the data to be determined Data redudancy is lower.
Optionally, further include:Multiple data sets are crawled from the default platform;Respectively to data to be determined and described Multiple data sets are parsed, and determine the lexical feature of the data to be determined and each data set;By the data to be determined Lexical feature respectively with the lexical feature of each data set carry out text similarity matching;Text similarity is reached preset it is similar The data set of degree threshold value is determined as the set of metadata of similar data of the data to be determined.
Optionally, determine that index includes the case where data scarcity index for the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number accordingly and set of metadata of similar data similar with the data to be determined default platform frequency of occurrence;And based on the occurrence out Number determines the index values of the data to be determined under data scarcity index, and the frequency of occurrence is fewer, characterization it is described to Determine that the scarcity of data is higher.
Optionally, determine that index includes the case where data figureofmerit for the quality,
It determines index value of the data to be determined in the case where preset quality determines index, specifically includes:It determines described to true Fixed number according to comprising data volume;And amount determines index value of the data to be determined under data figureofmerit based on the data, And the data volume is bigger, the index value for characterizing the data figureofmerit of the data to be determined is higher.
Optionally, based on the index value under the preset data quality index, determine that the quality of the data to be determined is true Determine as a result, specifically including:For the weight coefficient according to preset data quality index, to the data to be determined in preset data Index value under quality index is weighted summation process, obtains the quality determination results of the data to be determined.
Optionally, be also used to using the preset data quality index as independent variable, using the quality of data grade as The dependent variable building quality of data determines model;
Obtain training data;
Determine the matter of index value and the training data of the training data under the preset data quality index Measure definitive result;
Using the index value determined for the training data as argument value, by the quality determination results of corresponding training data The quality of data is substituted into as dependent variable value and determines model, and model, which is trained, to be determined to the quality of data;
It is trained index value of the data to be determined under the preset data quality index as independent variable substitution The quality of data determine model, obtain the quality determination results of the data to be determined.
As shown in figure 4, provide a kind of computer equipment for the embodiment of the present application, the computer equipment include processor 41, Memory 42 and bus 43, the memory 42 storage execute instruction, when described device operation, the processor 41 with it is described It is communicated between memory 42 by bus 43, the processor 41 executes described execute instruction so that described device executes above-mentioned number According to quality determination method.
Method is determined corresponding to the quality of data in Fig. 3, and the embodiment of the present application also provides a kind of computer-readable storages Medium is stored with computer program on the computer readable storage medium, executes when which is run by processor State the step of quality of data determines method.
The quality of data provided by the embodiment of the present application determines the computer program product of system and method, including storage The computer readable storage medium of program code, the instruction that said program code includes can be used for executing previous methods embodiment Described in method, specific implementation can be found in embodiment of the method, details are not described herein.
It is apparent to those skilled in the art that for convenience and simplicity of description, the system of foregoing description It with the specific work process of device, can refer to corresponding processes in the foregoing method embodiment, details are not described herein.
It, can be with if the function is realized in the form of SFU software functional unit and when sold or used as an independent product It is stored in a computer readable storage medium.Based on this understanding, the technical solution of the application is substantially in other words The part of the part that contributes to existing technology or the technical solution can be embodied in the form of software products, the meter Calculation machine software product is stored in a storage medium, including some instructions are used so that a computer equipment (can be a People's computer, server or network equipment etc.) execute each embodiment the method for the application all or part of the steps. And storage medium above-mentioned includes:USB flash disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited The various media that can store program code such as reservoir (RAM, Random Access Memory), magnetic or disk.
The above, the only specific embodiment of the application, but the protection scope of the application is not limited thereto, it is any Those familiar with the art within the technical scope of the present application, can easily think of the change or the replacement, and should all contain Lid is within the scope of protection of this application.Therefore, the protection scope of the application shall be subject to the protection scope of the claim.

Claims (10)

1. a kind of quality of data determines system, which is characterized in that including:
Data acquisition module, for obtaining data to be determined;
Index value determining module, for determining index value of the data to be determined in the case where preset quality determines index;
Quality determination module, for determining the data to be determined based on the index value under the preset data quality index Quality determination results.
2. system according to claim 1, which is characterized in that the quality of data index includes:Data consistency index, In data integrity index, data age index, data redudancy index, data scarcity index and data figureofmerit One or more.
3. system according to claim 2, which is characterized in that determine that index includes that data consistency refers to for the quality Target situation, the data to be determined include:Data content and the corresponding description information of the data to be determined;
The index value determining module is specifically used for determining the included data content of data to be determined and the number to be determined According to the degree of consistency of corresponding description information;And determine the data to be determined in data consistency based on the degree of consistency Index value under index, and the degree of consistency is higher, characterizes finger of the data to be determined under data consistency index Scale value is higher.
4. system as claimed in claim 3, which is characterized in that the index value determining module is specifically used for determining as next Or the degree of consistency between multinomial data content and corresponding description information, and any one data content and corresponding description information Between the higher characterization of the degree of consistency data to be determined data consistency index index value it is higher:
Data volume described in the description information of data volume and the data to be determined that the data to be determined include;
The size of description described in the description information of the size of the data to be determined and the data to be determined;
Data format described in the description information of the data format of the data to be determined and the data to be determined.
5. system according to claim 2, which is characterized in that for the quality of data index include that data integrity refers to Target situation, the index value determining module, specifically for the null value in determination the included data entry of data to be determined Accounting;And index value of the data to be determined under data integrity index, and the sky are determined based on the null value accounting Value accounting is lower, and the data integrity for characterizing the data to be determined is higher;
Determine that index includes the case where that data age index, the index value determining module are specifically used for for the quality Determine that the data to be determined start generation time and terminate between generation time the time interval crossed over and described to be determined Data start the time difference between generation time and the data offer time to be determined;Based on the time interval and described Time difference determines index value of the data to be determined under data age index;Wherein, the time interval span is bigger, The index value for characterizing the data age index of the data to be determined is higher;And time difference is smaller, characterization it is described to Determine that the index value of the data age index of data is higher;
Include the case where that data redudancy index, the index value determining module are specifically used for for the quality of data index Determine the accounting of repeated entries in data entry that the data to be determined are included;And the accounting based on the repeated entries is true Fixed index value of the data to be determined under data redudancy index, and the accounting of the repeated entries is lower, described in characterization The data redudancy of data to be determined is lower;
Determine that index includes the case where data figureofmerit for the quality, the index value determining module is specifically used for determining The data volume that the data to be determined include;And amount determines the data to be determined under data figureofmerit based on the data Index value, and the data volume is bigger, the index value for characterizing the data figureofmerit of the data to be determined is higher.
6. system as claimed in claim 2, which is characterized in that further include:Set of metadata of similar data determining module;
The data acquisition module is also used to crawl multiple data sets from the default platform;
The set of metadata of similar data determining module, for being parsed respectively to the data to be determined and the multiple data set, really Make the lexical feature of the data to be determined and each data set;By the lexical feature of the data to be determined respectively with each data The lexical feature of collection carries out text similarity matching;The data set that text similarity reaches default similarity threshold is determined as institute State the set of metadata of similar data of data to be determined.
7. system according to claim 6, which is characterized in that determine that index includes that data scarcity refers to for the quality Target situation,
The index value determining module is specifically used for determining the data to be determined and the similar phase with the data to be determined Frequency of occurrence of the likelihood data in default platform;And determine that the data to be determined refer in data scarcity based on the frequency of occurrence Index value under mark, and the frequency of occurrence is fewer, the scarcity for characterizing the data to be determined is higher.
8. system according to claim 1, which is characterized in that the quality determination module is specifically used for according to present count According to the weight coefficient of quality index, summation is weighted to index value of the data to be determined under preset data quality index Processing obtains the quality determination results of the data to be determined.
9. system according to claim 1, which is characterized in that further include:The quality of data determines model training module;
The quality of data determines model training module, is used for using the preset data quality index as independent variable, will be described Quality of data grade determines model as the dependent variable building quality of data;
The data acquisition module is also used to obtain training data;
The index value determining module is also used to determine index of the training data under the preset data quality index The quality determination results of value and the training data;
The quality of data determines model training module, is also used to be the determining index value of the training data as independent variable The quality determination results of corresponding training data are substituted into the quality of data as dependent variable value and determine model, to the number by value Determine that model is trained according to quality;
The quality determination module is specifically made with by index value of the data to be determined under the preset data quality index The trained quality of data is substituted into for independent variable and determines model, obtains the quality determination results of the data to be determined.
10. a kind of quality of data determines method, which is characterized in that this method includes:
Obtain data to be determined;
Determine index value of the data to be determined in the case where preset quality determines index;
Based on the index value under the preset data quality index, the quality determination results of the data to be determined are determined.
CN201810511444.XA 2018-05-24 2018-05-24 A kind of quality of data determines system and method Pending CN108829750A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201810511444.XA CN108829750A (en) 2018-05-24 2018-05-24 A kind of quality of data determines system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201810511444.XA CN108829750A (en) 2018-05-24 2018-05-24 A kind of quality of data determines system and method

Publications (1)

Publication Number Publication Date
CN108829750A true CN108829750A (en) 2018-11-16

Family

ID=64145374

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201810511444.XA Pending CN108829750A (en) 2018-05-24 2018-05-24 A kind of quality of data determines system and method

Country Status (1)

Country Link
CN (1) CN108829750A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183952A (en) * 2020-09-08 2021-01-05 支付宝(杭州)信息技术有限公司 Index quality supervision processing method and device and electronic equipment
CN117273552A (en) * 2023-11-22 2023-12-22 山东顺国电子科技有限公司 Big data intelligent treatment decision-making method and system based on machine learning

Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101576893A (en) * 2008-05-09 2009-11-11 北京世纪拓远软件科技发展有限公司 Method and system for analyzing data quality
CN101719139A (en) * 2009-11-10 2010-06-02 南京联创科技集团股份有限公司 Method for monitoring data quality based on index set
CN101894319A (en) * 2010-06-28 2010-11-24 中国烟草总公司湖南省公司 Tobacco enterprise data quality management system and method
CN103247008A (en) * 2013-05-07 2013-08-14 国家电网公司 Quality evaluation method of electricity statistical index data
CN103544314A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Searching data quality statistical method
CN104462744A (en) * 2014-10-09 2015-03-25 广东工业大学 Data quality control method suitable for cardiovascular remote monitoring system
CN105824806A (en) * 2016-06-13 2016-08-03 腾讯科技(深圳)有限公司 Quality evaluation method and device for public accounts
CN106257511A (en) * 2016-04-14 2016-12-28 江苏瑞中数据股份有限公司 A kind of grid faults characteristics quality testing method
CN106355447A (en) * 2016-08-31 2017-01-25 国信优易数据有限公司 Price evaluation method and system for data commodities
CN106469395A (en) * 2016-08-31 2017-03-01 国信优易数据有限公司 A kind of data commodity dynamic comprehensive appraisal procedure and system
CN106503912A (en) * 2016-10-27 2017-03-15 国信优易数据有限公司 A kind of data service system
CN106845846A (en) * 2017-01-23 2017-06-13 重庆邮电大学 Big data asset evaluation method
CN106934493A (en) * 2017-02-28 2017-07-07 北京科技大学 A kind of construction method of power customer appraisal Model
CN107315968A (en) * 2017-06-29 2017-11-03 国信优易数据有限公司 A kind of data processing method and equipment
CN107463532A (en) * 2017-06-28 2017-12-12 国网上海市电力公司 A kind of mass analysis method of electric power statistics
CN107491381A (en) * 2017-07-04 2017-12-19 广西电网有限责任公司电力科学研究院 A kind of equipment condition monitoring quality of data evaluating system
CN107704806A (en) * 2017-09-01 2018-02-16 深圳市唯特视科技有限公司 A kind of method that quality of human face image prediction is carried out based on depth convolutional neural networks

Patent Citations (17)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101576893A (en) * 2008-05-09 2009-11-11 北京世纪拓远软件科技发展有限公司 Method and system for analyzing data quality
CN101719139A (en) * 2009-11-10 2010-06-02 南京联创科技集团股份有限公司 Method for monitoring data quality based on index set
CN101894319A (en) * 2010-06-28 2010-11-24 中国烟草总公司湖南省公司 Tobacco enterprise data quality management system and method
CN103247008A (en) * 2013-05-07 2013-08-14 国家电网公司 Quality evaluation method of electricity statistical index data
CN103544314A (en) * 2013-11-04 2014-01-29 北京中搜网络技术股份有限公司 Searching data quality statistical method
CN104462744A (en) * 2014-10-09 2015-03-25 广东工业大学 Data quality control method suitable for cardiovascular remote monitoring system
CN106257511A (en) * 2016-04-14 2016-12-28 江苏瑞中数据股份有限公司 A kind of grid faults characteristics quality testing method
CN105824806A (en) * 2016-06-13 2016-08-03 腾讯科技(深圳)有限公司 Quality evaluation method and device for public accounts
CN106355447A (en) * 2016-08-31 2017-01-25 国信优易数据有限公司 Price evaluation method and system for data commodities
CN106469395A (en) * 2016-08-31 2017-03-01 国信优易数据有限公司 A kind of data commodity dynamic comprehensive appraisal procedure and system
CN106503912A (en) * 2016-10-27 2017-03-15 国信优易数据有限公司 A kind of data service system
CN106845846A (en) * 2017-01-23 2017-06-13 重庆邮电大学 Big data asset evaluation method
CN106934493A (en) * 2017-02-28 2017-07-07 北京科技大学 A kind of construction method of power customer appraisal Model
CN107463532A (en) * 2017-06-28 2017-12-12 国网上海市电力公司 A kind of mass analysis method of electric power statistics
CN107315968A (en) * 2017-06-29 2017-11-03 国信优易数据有限公司 A kind of data processing method and equipment
CN107491381A (en) * 2017-07-04 2017-12-19 广西电网有限责任公司电力科学研究院 A kind of equipment condition monitoring quality of data evaluating system
CN107704806A (en) * 2017-09-01 2018-02-16 深圳市唯特视科技有限公司 A kind of method that quality of human face image prediction is carried out based on depth convolutional neural networks

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112183952A (en) * 2020-09-08 2021-01-05 支付宝(杭州)信息技术有限公司 Index quality supervision processing method and device and electronic equipment
CN117273552A (en) * 2023-11-22 2023-12-22 山东顺国电子科技有限公司 Big data intelligent treatment decision-making method and system based on machine learning
CN117273552B (en) * 2023-11-22 2024-02-13 山东顺国电子科技有限公司 Big data intelligent treatment decision-making method and system based on machine learning

Similar Documents

Publication Publication Date Title
CN108764705A (en) A kind of data quality accessment platform and method
CN108734405A (en) A kind of data value Evaluation Platform and method
CN108764707A (en) A kind of data assessment system and method
US6834266B2 (en) Methods for estimating the seasonality of groups of similar items of commerce data sets based on historical sales data values and associated error information
CN108763277B (en) Data analysis method, computer readable storage medium and terminal device
CN109711955B (en) Poor evaluation early warning method and system based on current order and blacklist base establishment method
CN110766428A (en) Data value evaluation system and method
CN106469395A (en) A kind of data commodity dynamic comprehensive appraisal procedure and system
CN110659926A (en) Data value evaluation system and method
CN110874787A (en) Recommendation model effect evaluation method and related device
CN108734587A (en) The recommendation method and terminal device of financial product
CN109543940B (en) Activity evaluation method, activity evaluation device, electronic equipment and storage medium
CN109872026A (en) Evaluation result generation method, device, equipment and computer readable storage medium
CN107767152A (en) Product purchase intention analysis method and server
CN108829750A (en) A kind of quality of data determines system and method
CN108764995A (en) A kind of data value determines system and method
CN115203496A (en) Project intelligent prediction and evaluation method and system based on big data and readable storage medium
CN109214634A (en) A kind of information processing method, device and information processing readable medium
CN106776757A (en) User completes the indicating means and device of Net silver operation
CN104867032A (en) Electronic commerce client evaluation identification system
CN111291567A (en) Evaluation method and device for manual labeling quality, electronic equipment and storage medium
CN115905558A (en) Knowledge graph-based XAI model evaluation method, device, equipment and medium
CN109345301A (en) A kind of data price-determining system and determining method
CN112307307B (en) Insurance product recommendation method and apparatus
CN114971240A (en) Reading behavior risk assessment processing method and device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
CB02 Change of applicant information
CB02 Change of applicant information

Address after: 100070, No. 101-8, building 1, 31, zone 188, South Fourth Ring Road, Beijing, Fengtai District

Applicant after: Guoxin Youyi Data Co.,Ltd.

Address before: 100070, No. 188, building 31, headquarters square, South Fourth Ring Road West, Fengtai District, Beijing

Applicant before: SIC YOUE DATA Co.,Ltd.

RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20181116