CN106845846A - Big data asset evaluation method - Google Patents

Big data asset evaluation method Download PDF

Info

Publication number
CN106845846A
CN106845846A CN201710058720.7A CN201710058720A CN106845846A CN 106845846 A CN106845846 A CN 106845846A CN 201710058720 A CN201710058720 A CN 201710058720A CN 106845846 A CN106845846 A CN 106845846A
Authority
CN
China
Prior art keywords
data
attribute
value
sigma
tables
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710058720.7A
Other languages
Chinese (zh)
Inventor
卓颋
殷荣华
刘洪明
舒夕珂
曹慧英
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing Soft Cloud Technology Co Ltd
Chongqing University of Post and Telecommunications
Original Assignee
Beijing Soft Cloud Technology Co Ltd
Chongqing University of Post and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing Soft Cloud Technology Co Ltd, Chongqing University of Post and Telecommunications filed Critical Beijing Soft Cloud Technology Co Ltd
Priority to CN201710058720.7A priority Critical patent/CN106845846A/en
Publication of CN106845846A publication Critical patent/CN106845846A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q10/00Administration; Management
    • G06Q10/06Resources, workflows, human or project management; Enterprise or organisation planning; Enterprise or organisation modelling
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06QINFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES; SYSTEMS OR METHODS SPECIALLY ADAPTED FOR ADMINISTRATIVE, COMMERCIAL, FINANCIAL, MANAGERIAL OR SUPERVISORY PURPOSES, NOT OTHERWISE PROVIDED FOR
    • G06Q40/00Finance; Insurance; Tax strategies; Processing of corporate or income taxes

Landscapes

  • Engineering & Computer Science (AREA)
  • Business, Economics & Management (AREA)
  • Economics (AREA)
  • Strategic Management (AREA)
  • Theoretical Computer Science (AREA)
  • Human Resources & Organizations (AREA)
  • Development Economics (AREA)
  • Marketing (AREA)
  • Entrepreneurship & Innovation (AREA)
  • Physics & Mathematics (AREA)
  • General Business, Economics & Management (AREA)
  • General Physics & Mathematics (AREA)
  • Finance (AREA)
  • Educational Administration (AREA)
  • Technology Law (AREA)
  • Game Theory and Decision Science (AREA)
  • Accounting & Taxation (AREA)
  • Operations Research (AREA)
  • Quality & Reliability (AREA)
  • Tourism & Hospitality (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention discloses a kind of big data asset evaluation method, big data asset evaluation method, including:First, data quality accessment, the index of the quality of data includes accuracy, integrality, uniformity, ageing;2nd, data scale assessment, data scale index includes data attribute number, data tuple number and unit information amount;3rd, data content assessment, data content includes transaction data, personal information, merchandise news, production management data, user's evaluating data and social network data;4th, industry value calculation;5th, data assets value calculation.Big data asset evaluation method of the present invention, it provides specific quantitative criteria for the assessment of data assets, makes Appraisal process simpler apparent, eliminates the subjective factor influence of judge, evaluation result is more consistent with actual.

Description

Big data asset evaluation method
Technical field
The present invention relates to assets assessment technical field, more particularly to a kind of appraisal procedure of data assets.
Background technology
Data value in view of different industries is different, and tax revenue may determine that the size of the business transaction amount of money, therefore according to The data such as tax yearbook, data are divided into by industry:
(1) agricultural data
(2) mining industry data
(3) manufacturing industry data
(4) the production and supply industry data of electric power, heating power, combustion gas and water
(5) construction industry data
(6) wholesale and retail industry data
(7) communications and transportation, storage and postal industry data
(8) accommodation and catering industry data
(9) information transfer, software and information technology service industry
(10) financial circles data
(11) real estate data
(12) lease and commerce services industry data
(13) educational data
(14) health and social work data
(15) culture, physical culture and show business data
(16) public administration, social security and social organization's data
(17) other industry data
It is known that each data file generally includes much information, therefore can be splitted data into again by data content:
(1) transaction data
(2) personal information
(3) commodity (service) information
(4) production management data
(5) user's evaluating data
(6) social network data
Wherein, personal information includes vendor information and consumer's information.It is worth noting that, each data file is included A class or multi-class data in above-mentioned six classes data.
In recent years, in appearing in our life again and again with " big data " this vocabulary, the evaluation problem of data assets Also the hot spot of society is turned into.The current research on data assets is perfect not enough.In view of intangible asset assessment in state Certain achievement in research, thus data assets have inside been obtained as a kind of special intangible asset, its value assessment can with it is general Logical intangible assessment is connected.Sun Rongling etc. proposes the quantization side of the value and Value Realization to intangible asset first Face is studied, but traditional appraisal procedure is relatively rough;Then Chen Chang clouds propose Black-Scholes Black-Scholes Option Pricing Model Black-Scholes and EVA methods, and introduce it into the assessment to enterprise's integral value, model is more accurate, but do not consider different enterprises it Between gap;The then continuous research and inquirement of experts and scholars, forms a set of more perfect intangible asset system, mainly has Income approach, market method and cost-or-market method, but still presence conflicts with this several method for the evaluation criteria of data assets and key element, It is thus impossible to these methods are applied in data assets completely;At the same time, data assets value defines heterogeneity, The shortage and data assets of the data assets value assessment dimension of data assets appraisal Model or reference model and system Assessment lack a specific quantitative criteria, this brings more difficulties to researcher.
In addition, data assessment importance for different classes of is different, it is faintly regarded as a class, analysis result shows Obtain some to lose contact with reality, the actual demand with society to data is runed counter to.In addition, the evaluation structure of data assets is considered as more Many aspects, past research there is also deficiency in structure.
The content of the invention
In view of this, the purpose of the present invention is directed to the deficiency of method in the past, proposes a kind of assessment of new data assets Method.
Big data asset evaluation method of the present invention, it is characterised in that including:
First, data quality accessment, including:
1st, the calculating of data accuracy
Sampling obtains training set, inspection set and accuracy rate forecast set respectively first from tables of data, every time for training set In a predictable attribute f, it is class label to set it, and training obtains a grader, and carries out performance inspection by inspection set Survey;Then the value of the attribute f of each tuple in forecast set is predicted with this grader, predicted value is consistent with actual value (for numerical attribute, its difference without departing from certain threshold value, such as standard deviation) if think that the property value is correct, it is and accurate pre- The tuple ratio of survey is accuracy rate a of the tables of data on the attributef.This process is repeated to each attribute in tables of data, Obtain the accuracy rate a of each attributej
Wherein j=1,2 ..., m, m is the number of predictable attribute;
Wherein, ntIt is the number of tuples in forecast set, nrjIt is the number of tuples correctly classified in forecast set;Calculate these aj's Weighted arithmetic average obtains the comprehensive accuracy rate A of tables of data, i.e.,:
Wherein, j is the numbering for being predicted attribute, wfjIt is the weight of attribute j, its value can be according to the span of attribute j Determined with dispersion degree, because attribute span is bigger, dispersion degree is higher, the accuracy rate of its prediction is lower, imparting Weight should be smaller;The computing formula of weight is:
Wherein, hjIt is the entropy of attribute j, entropy represents the size of attribute span and the height of dispersion degree, its calculating Formula is:
Wherein, v is the number of value, pfFor attribute takes f-th probability of value;
Finally total accuracy rate of whole data set is:
Wherein, wtiIt is the weight of table i, t is the total number of table in evaluated data set;The formula of weight is:
Wherein, ntiIt is the number of tuples of table i, nfiIt is the attribute number of whole data set, nt is the number of tuples of whole data set, Nf is the attribute number of whole data set;
2nd, the calculating of data integrity degree I
Wherein, nnullTo lack or being the data item number of null, nitemIt is data item total number;
3rd, the calculating of data consistent degree C
This formula is to investigate object with a database in data set, wherein, Ci is evaluated i-th database of data set Consistent degree;fniIt is total attribute number, n in i-th databasenameIt is the inconsistent attribute number of naming convention in i-th database, ncodeIt is the inconsistent attribute number of data code used in i-th database, nformIt is the lattice of input field in i-th database The inconsistent attribute number of formula, L is the number for being evaluated the database included in data set, WiIt is i-th weight of database;
4th, data time is worth the calculating of T
Wherein, tpThe time of expression information issue, tcRepresent current time, C (tc,tp) represent information in tcThe shadow at moment Ring power size, i.e. tcThe time value at moment, what a was represented is the aging rate coefficient of information, and aging rate coefficient a is set to 0.1;
5th, the quality of data is assessed by formula
Wherein, QiIt is the Quality factors of prior and sample data of the i-th class data classified according to data content;
2nd, data scale assessment, including:
1st, the calculating of data attribute number
1) attribute number of numeric data is calculated
(1) by the correlation coefficient r of formula evaluation attribute A and BA,B,
Wherein, n is the number of data tuple, aiAnd biIt is respectively tuple i values on A and B,WithIt is respectively the equal of A and B Value, σAAnd σBIt is respectively the standard deviation of A and B;
(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, obtain each attribute attribute number it With;
2) attribute number of nominal, grouped data is calculated
(1) by χ2Check to judge correlation;
Wherein, oijIt is joint event (Ai, Bj) observation frequency, and eijIt is (Ai, Bj) expectation frequency;
Wherein, n is the number of data tuple, count (A=ai) it is that value is a on AiTuple number, count (B=bi) It is that value is b on BiTuple number;χ2A and B independences are assumed in statistical check, based on insolation level, with the free degree (R-1) × (C- 1);χ is calculated by above-mentioned formula2Value, then with χ2The region of rejection of inspection is compared, then can sentence two correlations of attribute of section;
According to repeatedly calculate inspection, obtain it is autocorrelative in the case of χ2=n, therefore in χ2>On the premise of 10.828, can be by rA,B Used as the degree of correlation between two attributes, formula is as follows:
Wherein, R, C are the classification numbers of classified variable;
(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, attribute compression step:
1. correlation matrix is built
Wherein, rij=it is attribute fiAnd fjThe degree of correlation, RiIt is attribute fiWith the summation of other Attribute Correlations,
2. the row of R matrixes is pressed into RiOrder sequence from big to small, obtains
3. increase by one and arrange f0Represent the initial scale benchmark of single attribute
4. condensation matrix is obtained
5. the element on diagonal is added the attribute number after just being compressed
fnc=r '11+r′22+…+r′nn
2nd, directly statistics obtains the data tuple number in tables of data;tnj
3rd, the calculating of unit information amount
(1) the comentropy computing formula of discrete type attribute is:
Wherein, P (xi) be each property value occur probability;
(2) calculating of the comentropy of continuous type attribute:
After a kind of discretization method is first selected to its discretization, then carried out by the computing formula of discrete type Attribute information entropy Calculate;
(3) after obtaining the comentropy of each attribute, the average information entropy of attribute is obtained:
Fn is the attribute number of individual data table before compression;
Then the computing formula of individual data table scale is obtained:
Wherein, S is that a certain data scale of tables of data weighs the factor (unit is bit), fncAfter the compression of this tables of data Data attribute number, tn is the number of tuples of this tables of data,It is the average information entropy of all properties;
3rd, data content assessment
One comparator matrix B=(b is constructed using AHP three scale methodsij)n×n,bijFor on same level element ratio compared with gained Scale value, specially
The importance ranking index of each element is calculated with following formula:
Note rmax=MAX { ri},rmin=MIN { ri},bm=rmax/rmin, obtain judgment matrix C=(cij)n×n:
So as to obtain
After obtaining judgment matrix, calculate according to the following steps and check:
(1) weight is calculated with root method, formula is as follows:
Calculation procedure:1. by the element of C by the new vector of row mutually multiplied,
2. each component of new vector is opened into n powers,
3. gained vector normalization is weight vectors;
(2) coincident indicator CI is calculated
Wherein, λmaxIt is the eigenvalue of maximum of judgment matrix C;
(3) coincident indicator RI is searched
(4) consistency ration CR is calculated
Work as CR<When 0.10, it is believed that the uniformity of judgment matrix can be receiving, otherwise tackle judgment matrix and make to repair in right amount Just;Thus, obtain with the weight of every class data of classifying content;
4th, industry value calculation
1st, tax revenues highest industry is taken, fraction is worth and is set to 100;
2nd, the tax revenue of other industry and highest industry tax revenue are divided by, multiplied by with 100, obtain the industry valency of other industry Value;
5th, data assets value calculation
1st, by Quality factors of prior and sample data Qij, data scale factor SijAnd by the weight W of classifying contentiIt is multiplied, if the i-th class Packet contains multiple tables of data, then first calculate individual tables of data, then the result of this several tables of data is added up;
2nd, calculated by above-mentioned computational methods by every class of classifying content, the result for obtaining adds up successively;
3rd, accumulated result is multiplied with the industry value being calculated and obtains the value fraction V of data assets;
Value fraction
4th, data assets value is assessed by being worth fraction V.
Beneficial effects of the present invention:
Big data asset evaluation method of the present invention, it provides specific quantitative criteria for the assessment of data assets, makes to comment Sentence that process is simpler apparent, eliminate the subjective factor influence of judge, make evaluation result and be actually more consistent.
Brief description of the drawings
Fig. 1 is data assets value assessment overall construction drawing.
Specific embodiment
The invention will be further described with reference to the accompanying drawings and examples.
The present embodiment big data asset evaluation method, including:
First, data quality accessment, including:
1st, whether the calculating of data accuracy, data accuracy describes the feature phase one of the corresponding Subject of data Cause;
Sampling obtains training set, inspection set and accuracy rate forecast set respectively first from tables of data, every time for training set In a predictable attribute f, it is class label to set it, and training obtains a grader, and carries out performance inspection by inspection set Survey;Then the value of the attribute f of each tuple in forecast set is predicted with this grader, predicted value and actual value one Cause, for numerical attribute, its difference is without departing from certain threshold value, then it is assumed that the property value is correct, and the unit of Accurate Prediction Group ratio is accuracy rate a of the tables of data on the attributef, this process is repeated to each attribute in tables of data, obtain every The accuracy rate a of individual attributej
Wherein j=1,2 ..., m, m is the number of predictable attribute;
Wherein, ntIt is the number of tuples in forecast set, nrjIt is the number of tuples correctly classified in forecast set;Wherein sorting algorithm Can voluntarily select (such as:Decision Tree Inductive C4.5, CART etc.);
Calculate these ajWeighted arithmetic average obtain the comprehensive accuracy rate A of tables of data, i.e.,;
Wherein, j is the numbering for being predicted attribute, wfjIt is the weight of attribute j, its value can be according to the span of attribute j Determined with dispersion degree, because attribute span is bigger, dispersion degree is higher, the accuracy rate of its prediction is lower, imparting Weight should be smaller;The computing formula of weight is:
Wherein, hjIt is the entropy of attribute j, entropy represents the size of attribute span and the height of dispersion degree, its calculating Formula is:
Wherein, v is the number of value, pfFor attribute takes f-th probability of value;
Finally total accuracy rate of whole data set is:
Wherein, wtiIt is the weight of table i, t is the total number of table in evaluated data set.The formula of weight is:
Wherein, ntiIt is the number of tuples of table i, nfiIt is the attribute number of whole data set, nt is the number of tuples of whole data set, Nf is the attribute number of whole data set;
Predictable attribute:The span of some attributes is very big and with certain randomness, and some of which information is often It is related to the privacy of individual, being generally required in the application for being related to big data to conclude the business and analyzing carries out desensitization process, such as:Name, Telephone number, address etc.;Some do not have physical meaning then, such as:Tuple ID, some project codes etc., to this kind of attribute evaluation its Accuracy there is no need, referred to as unpredictable attribute, and other are referred to as predictable attribute;
2nd, the calculating of data integrity degree I, data integrity degree I describes data with the presence or absence of missing record or absent field,
Wherein, nnullTo lack or being the data item number of null, nitemIt is data item total number;
3rd, the calculating of data consistent degree C, data consistent degree describes the value of the same attribute of same entity in different systems Or it is whether consistent in data set;
This formula is to investigate object with a database in data set, wherein, Ci is evaluated i-th database of data set Consistent degree;fniIt is total attribute number, n in i-th databasenameIt is the inconsistent attribute number of naming convention in i-th database, ncodeIt is the inconsistent attribute number of data code used in i-th database, nformIt is the lattice of input field in i-th database The inconsistent attribute number of formula, L is the number for being evaluated the database included in data set, WiIt is i-th weight of database;
4th, the calculating of data time value (T)
Wherein, tpThe time of expression information issue, tcRepresent current time, C (tc,tp) represent information in tcThe shadow at moment Ring power size, i.e. tcThe time value at moment, what a was represented is the aging rate coefficient of information, and aging rate coefficient a is set to 0.1;
5th, the quality of data is assessed by formula
Wherein, QiIt is the Quality factors of prior and sample data of the i-th class data classified according to data content;
2nd, data scale assessment, including:
1st, the calculating of data attribute number, also known as field, when most of, the row of table are referred to as field to data attribute, each field Information comprising a certain special topic;
1) attribute number of numeric data is calculated
(1) by the correlation coefficient r of formula evaluation attribute A and BAB,
Wherein, n is the number of data tuple, aiAnd biIt is respectively tuple i values on A and B,WithIt is respectively the equal of A and B Value, σAAnd σBIt is respectively the standard deviation of A and B;
(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, obtain each attribute attribute number it With;
2) attribute number of nominal, grouped data is calculated
(1) by χ2Check to judge correlation;
Wherein, oijIt is joint event (Ai, Bj) observation frequency, and eijIt is (Ai, Bj) expectation frequency;
Wherein, n is the number of data tuple, count (A=ai) it is that value is a on AiTuple number, count (B=bi) It is that value is b on BiTuple number;χ2A and B independences are assumed in statistical check, based on insolation level, with the free degree (R-1) × (C- 1);χ is calculated by above-mentioned formula2Value, then with χ2The region of rejection of inspection is compared, then can sentence two correlations of attribute of section;
According to repeatedly calculate inspection, obtain it is autocorrelative in the case of χ2=n, therefore in χ2>On the premise of 10.828, can be by rA,B Used as the degree of correlation between two attributes, formula is as follows:
Wherein, R, C are the classification numbers of classified variable;
(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, attribute compression step:
1. correlation matrix is built
Wherein, rij=it is attribute fiAnd fjThe degree of correlation, RiIt is attribute fiWith the summation of other Attribute Correlations,
2. the row of R matrixes is pressed into RiOrder sequence from big to small, obtains
3. increase by one and arrange f0The initial scale benchmark of single attribute is represented, 1 is set to;
4. following procedure computation attribute scale compression matrix is pressed
Obtain
5. the element on diagonal is added the attribute number after just being compressed
fnc=r '11+r′22+…+r′nn
2nd, directly statistics obtains the data tuple number in tables of data;tnj;In bivariate table, tuple also known as record, in table Often go, i.e., every record in database, is exactly a tuple;
3rd, the calculating of unit information amount, during unit information volume index is according to file, same attribute includes different numerical value How much;
(1) the comentropy computing formula of discrete type attribute is:
Wherein, P (xi) be each property value occur probability;
(2) calculating of the comentropy of continuous type attribute:
After a kind of discretization method is first selected to its discretization, then carried out by the computing formula of discrete type Attribute information entropy Calculate;
(3) after obtaining the comentropy of each attribute, the average information entropy of attribute is obtained:
Fn is the attribute number of individual data table before compression;
Then the computing formula of individual data table scale is obtained:
Wherein, S is that a certain data scale of tables of data weighs the factor (unit is bit), fncAfter the compression of this tables of data Data attribute number, tn is the number of tuples of this tables of data,It is the average information entropy of all properties;
3rd, data content assessment
One comparator matrix B=(b is constructed using AHP three scale methodsij)n×n,bijFor on same level element ratio compared with gained Scale value, specially
The importance ranking index of each element is calculated with following formula:
Note rmax=MAX { ri},rmin=MIN { ri},bm=rmax/rmin, obtain judgment matrix c=(cij)n×n:
So as to obtain
After obtaining judgment matrix, calculate according to the following steps and check:
(1) weight is calculated with root method, formula is as follows:
Calculation procedure:1. by the element of c by the new vector of row mutually multiplied,
2. each component of new vector is opened into n powers,
3. gained vector normalization is weight vectors;
(2) coincident indicator CI is calculated
Wherein, λmaxIt is the eigenvalue of maximum of judgment matrix C;
(3) coincident indicator RI is searched
(4) consistency ration CR is calculated
Work as CR<When 0.10, it is believed that the uniformity of judgment matrix can be receiving, otherwise tackle judgment matrix and make to repair in right amount Just;Thus, obtain with the weight of every class data of classifying content;
4th, industry value calculation
1st, tax revenues highest industry is taken, fraction is worth and is set to 100;
2nd, the tax revenue of other industry and highest industry tax revenue are divided by, multiplied by with 100, obtain the industry valency of other industry Value;
5th, data assets value calculation
1st, by Quality factors of prior and sample data Qij, data scale factor SijAnd by the weight W of classifying contentiIt is multiplied, if the i-th class Packet contains multiple tables of data, then first calculate individual tables of data, then the result of this several tables of data is added up;
2nd, calculated by above-mentioned computational methods by every class of classifying content, the result for obtaining adds up successively;
3rd, accumulated result is multiplied with the industry value being calculated and obtains the value fraction V of data assets;
Value fraction
4th, data assets value is assessed by being worth fraction V.
The present embodiment big data asset evaluation method, it provides specific quantitative criteria, makes for the assessment of data assets Appraisal process is simpler apparent, eliminates the subjective factor influence of judge, evaluation result is more consistent with actual.
Finally illustrate, the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted, although with reference to compared with Good embodiment has been described in detail to the present invention, it will be understood by those within the art that, can be to skill of the invention Art scheme is modified or equivalent, and without deviating from the objective and scope of technical solution of the present invention, it all should cover at this In the middle of the right of invention.

Claims (1)

1. a kind of big data asset evaluation method, it is characterised in that including:
First, data quality accessment, including:
1st, the calculating of data accuracy
Sampling obtains training set, inspection set and accuracy rate forecast set respectively first from tables of data, every time in training set One predictable attribute f, it is class label to set it, and training obtains a grader, and carries out performance detection by inspection set;So The value of the attribute f of each tuple in forecast set is predicted with this grader afterwards, predicted value is consistent with actual value, for Numerical attribute, its difference is without departing from certain threshold value, then it is assumed that the property value is correct, and the tuple ratio of Accurate Prediction is The accuracy rate a on the attribute that is tables of dataf, this process is repeated to each attribute in tables of data, obtain each attribute Accuracy rate aj
Wherein j=1,2 ..., m, m is the number of predictable attribute;
Wherein, ntIt is the number of tuples in forecast set, nrjIt is the number of tuples correctly classified in forecast set;Calculate these ajWeighting Arithmetic average is worth to the comprehensive accuracy rate A of tables of data, i.e.,:
A = &Sigma; j = 1 m fw j a j m
Wherein, j is the numbering for being predicted attribute, wfjBe the weight of attribute j, its value can according to the span of attribute j and from Scattered degree determines, because attribute span is bigger, dispersion degree is higher, the accuracy rate of its prediction is lower, the weight of imparting Should be smaller;The computing formula of weight is:
wf j = ( 1 - h j &Sigma; j m h j ) / m - 1
Wherein, hjIt is the entropy of attribute j, entropy represents the size of attribute span and the height of dispersion degree, its computing formula For:
h j = - &Sigma; f = 1 v p f &times; log 2 ( p f )
Wherein, v is the number of value, pfFor attribute takes f-th probability of value;
Finally total accuracy rate of whole data set is:
A = &Sigma; i = 1 t wt i A i t
Wherein, wtiIt is the weight of table i, t is the total number of table in evaluated data set.The formula of weight is:
wt i = nt i &times; nf i &Sigma; i = 1 t n t &times; n f ;
Wherein, ntiIt is the number of tuples of table i, nfiIt is the attribute number of whole data set, nt is the number of tuples of whole data set, and nf is The attribute number of whole data set;
2nd, the calculating of data integrity degree I
I = n n u l l n i t e m
Wherein, nnullTo lack or being the data item number of null, nitemIt is data item total number;
3rd, the calculating of data consistent degree C
C i = 1 - n n a m e + n c o d e + n f o r m fn i ;
W i = fn i &Sigma; i = 1 L fn i ;
C = &Sigma; i = 1 L W i C i
This formula is to investigate object with a database in data set, wherein, Ci is the one of evaluated i-th database of data set Cause degree;fniIt is total attribute number, n in i-th databasenameIt is the inconsistent attribute number of naming convention in i-th database, ncode It is the inconsistent attribute number of data code used in i-th database, nformFor input field in i-th database form not Consistent attribute number, L is the number for being evaluated the database included in data set, WiIt is i-th weight of database;
4th, data time is worth the calculating of T
C ( t c , t p ) = e - a ( t c - t p ) , T = C ( t c , t p ) = e - 0.1 ( t c - t p )
Wherein, tpThe time of expression information issue, tcRepresent current time, C (tc,tp) represent information in tcThe influence power at moment Size, i.e. tcThe time value at moment, what a was represented is the aging rate coefficient of information, and aging rate coefficient a is set to 0.1;
5th, the quality of data is assessed by formula
Q i = 1 4 ( A + I + C + T )
Wherein, QiIt is the Quality factors of prior and sample data of the i-th class data classified according to data content;
2nd, data scale assessment, including:
1st, the calculating of data attribute number
1) attribute number of numeric data is calculated
(1) by the correlation coefficient r of formula evaluation attribute A and BA,B,
r A , B = &Sigma; i = 1 n ( a i - A &OverBar; ) ( b i - B &OverBar; ) n&sigma; A &sigma; B = &Sigma; i = 1 n ( a i b i ) - n A &OverBar; B &OverBar; n&sigma; A &sigma; B
Wherein, n is the number of data tuple, aiAnd biIt is respectively tuple i values on A and B,WithIt is respectively the average of A and B, σA And σBIt is respectively the standard deviation of A and B;
(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, and obtains the attribute number sum of each attribute;
2) attribute number of nominal, grouped data is calculated
(1) by χ2Check to judge correlation;
&chi; 2 = &Sigma; i = 1 c &Sigma; j = 1 r ( o i j - e i j ) 2 e i j
Wherein, oijIt is joint event (Ai, Bj) observation frequency, and eijIt is (Ai, Bj) expectation frequency;
e i j = c o u n t ( A = a i ) &times; c o u n t ( B = b i ) n
Wherein, n is the number of data tuple, count (A=ai) it is that value is a on AiTuple number, count (B=bi) it is on B It is b to be worthiTuple number;χ2A and B independences are assumed in statistical check, based on insolation level, with the free degree (R-1) × (C-1);It is logical Cross above-mentioned formula and calculate χ2Value, then with χ2The region of rejection of inspection is compared, then can sentence two correlations of attribute of section;
According to repeatedly calculate inspection, obtain it is autocorrelative in the case of χ2=n, therefore in χ2>On the premise of 10.828, can be by rA,BAs The degree of correlation between two attributes, formula is as follows:
r A , B = &chi; 2 n m i n &lsqb; R - 1 , C - 1 &rsqb;
Wherein, R, C are the classification numbers of classified variable;
(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, attribute compression step:
1. correlation matrix is built
Wherein, rij=it is attribute fiAnd fjThe degree of correlation, RiIt is attribute fiWith the summation of other Attribute Correlations,
R i = &Sigma; j = 1 n r i j - 1 ; i &Element; { 1 , 2 , ... , n } ;
2. the row of R matrixes is pressed into RiOrder sequence from big to small, obtains
3. increase by one and arrange f0Represent the initial scale benchmark of single attribute
4. condensation matrix is obtained
5. the element on diagonal is added the attribute number after just being compressed
fnc=r '11+r′22+…+r′nn
2nd, directly statistics obtains the data tuple number in tables of data:tnj
3rd, the calculating of unit information amount
(1) the comentropy computing formula of discrete type attribute is:
H ( X ) = - &Sigma; i = 1 n P ( x i ) log 2 &lsqb; P ( x i ) &rsqb;
Wherein, P (xi) be each property value occur probability;
(2) calculating of the comentropy of continuous type attribute:
After a kind of discretization method is first selected to its discretization, then based on carrying out by the computing formula of discrete type Attribute information entropy Calculate;
(3) after obtaining the comentropy of each attribute, the average information entropy of attribute is obtained:
Fn is the attribute number of individual data table before compression;
Then the computing formula of individual data table scale is obtained:
S = t n &times; fn c &times; H ( A ) &OverBar;
Wherein, S is that a certain data scale of tables of data weighs the factor (unit is bit), fncIt is the number after the compression of this tables of data According to attribute number, tn is the number of tuples of this tables of data,It is the average information entropy of all properties;
3rd, data content assessment
One comparator matrix B=(b is constructed using AHP three scale methodsij)n×n,bijFor on same level element ratio compared with gained scale Value, specially
The importance ranking index of each element is calculated with following formula:
r i = &Sigma; j = 1 n b i j , i = 1 , 2 , ... , n .
Note rmax=MAX { ri},rmin=MIN { ri},bm=rmax/rmin, obtain judgment matrix c=(cij)n×n:
c i j = &lsqb; ( r i - r j ) / ( r m a x - r m i n ) &rsqb; &times; ( b m - 1 ) + 1 r i &GreaterEqual; r j { &lsqb; ( r j - r i ) / ( r m a x - r min ) &rsqb; &times; ( b m - 1 ) + 1 } - 1 r i < r j
So as to obtain
After obtaining judgment matrix, calculate according to the following steps and check:
(1) weight is calculated with root method, formula is as follows:
W i = ( &Pi; j = 1 n a i j ) 1 n &Sigma; i = 1 n ( &Pi; j = 1 n a i j ) 1 n , i = 1 , 2 , 3 , ... , n .
Calculation procedure:1. by the element of c by the new vector of row mutually multiplied,
2. each component of new vector is opened into n powers,
3. gained vector normalization is weight vectors;
(2) coincident indicator CI is calculated
C I = &lambda; m a x - n n - 1
Wherein, λmaxIt is the eigenvalue of maximum of judgment matrix C;
(3) coincident indicator RI is searched
(4) consistency ration CR is calculated
C R = C I R I
Work as CR<When 0.10, it is believed that the uniformity of judgment matrix can be receiving, otherwise tackle judgment matrix and make appropriate amendment; Thus, obtain with the weight of every class data of classifying content;
4th, industry value calculation
1st, tax revenues highest industry is taken, fraction is worth and is set to 100;
2nd, the tax revenue of other industry and highest industry tax revenue are divided by, multiplied by with 100, obtain the industry value of other industry;
5th, data assets value calculation
1st, by Quality factors of prior and sample data Qij, data scale factor SijAnd by the weight W of classifying contentiIt is multiplied, if the i-th class data Comprising multiple tables of data, then individual tables of data is first calculated, then the result of this several tables of data is added up;
2nd, calculated by above-mentioned computational methods by every class of classifying content, the result for obtaining adds up successively;
3rd, accumulated result is multiplied with the industry value being calculated and obtains the value fraction V of data assets;
4th, data assets value is assessed by being worth fraction V.
CN201710058720.7A 2017-01-23 2017-01-23 Big data asset evaluation method Pending CN106845846A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710058720.7A CN106845846A (en) 2017-01-23 2017-01-23 Big data asset evaluation method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710058720.7A CN106845846A (en) 2017-01-23 2017-01-23 Big data asset evaluation method

Publications (1)

Publication Number Publication Date
CN106845846A true CN106845846A (en) 2017-06-13

Family

ID=59122737

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710058720.7A Pending CN106845846A (en) 2017-01-23 2017-01-23 Big data asset evaluation method

Country Status (1)

Country Link
CN (1) CN106845846A (en)

Cited By (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633257A (en) * 2017-08-15 2018-01-26 上海数据交易中心有限公司 Data Quality Assessment Methodology and device, computer-readable recording medium, terminal
CN107995020A (en) * 2017-10-23 2018-05-04 北京兰云科技有限公司 A kind of asset valuation method and apparatus
CN108804655A (en) * 2018-06-07 2018-11-13 福建江夏学院 A kind of intangible asset method and system based on big data
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN109615431A (en) * 2018-12-13 2019-04-12 普元信息技术股份有限公司 The system and method for data assets perception and pricing function are realized under big data background
CN110766429A (en) * 2018-07-26 2020-02-07 国信优易数据有限公司 Data value evaluation system and method
CN111475695A (en) * 2020-03-30 2020-07-31 贵阳大数据交易所有限责任公司 Service data asset pricing method based on metadata
CN111539770A (en) * 2020-04-27 2020-08-14 启迪数华科技有限公司 Intelligent data asset assessment method and system
CN112101447A (en) * 2020-09-10 2020-12-18 北京百度网讯科技有限公司 Data set quality evaluation method, device, equipment and storage medium
CN113673889A (en) * 2021-08-26 2021-11-19 上海罗盘信息科技有限公司 Intelligent data asset identification method
CN113806356A (en) * 2020-06-16 2021-12-17 中国移动通信集团重庆有限公司 Data identification method and device and computing equipment

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107633257B (en) * 2017-08-15 2020-04-17 上海数据交易中心有限公司 Data quality evaluation method and device, computer readable storage medium and terminal
CN107633257A (en) * 2017-08-15 2018-01-26 上海数据交易中心有限公司 Data Quality Assessment Methodology and device, computer-readable recording medium, terminal
CN107995020B (en) * 2017-10-23 2021-05-07 北京兰云科技有限公司 Asset value assessment method and device
CN107995020A (en) * 2017-10-23 2018-05-04 北京兰云科技有限公司 A kind of asset valuation method and apparatus
CN108829750A (en) * 2018-05-24 2018-11-16 国信优易数据有限公司 A kind of quality of data determines system and method
CN108804655A (en) * 2018-06-07 2018-11-13 福建江夏学院 A kind of intangible asset method and system based on big data
CN110766429A (en) * 2018-07-26 2020-02-07 国信优易数据有限公司 Data value evaluation system and method
CN109615431A (en) * 2018-12-13 2019-04-12 普元信息技术股份有限公司 The system and method for data assets perception and pricing function are realized under big data background
CN111475695A (en) * 2020-03-30 2020-07-31 贵阳大数据交易所有限责任公司 Service data asset pricing method based on metadata
CN111539770A (en) * 2020-04-27 2020-08-14 启迪数华科技有限公司 Intelligent data asset assessment method and system
CN111539770B (en) * 2020-04-27 2023-06-16 国云数字科技(重庆)有限公司 Intelligent evaluation method and system for data assets
CN113806356A (en) * 2020-06-16 2021-12-17 中国移动通信集团重庆有限公司 Data identification method and device and computing equipment
CN113806356B (en) * 2020-06-16 2024-03-19 中国移动通信集团重庆有限公司 Data identification method and device and computing equipment
CN112101447A (en) * 2020-09-10 2020-12-18 北京百度网讯科技有限公司 Data set quality evaluation method, device, equipment and storage medium
CN112101447B (en) * 2020-09-10 2024-04-16 北京百度网讯科技有限公司 Quality evaluation method, device, equipment and storage medium for data set
CN113673889A (en) * 2021-08-26 2021-11-19 上海罗盘信息科技有限公司 Intelligent data asset identification method

Similar Documents

Publication Publication Date Title
CN106845846A (en) Big data asset evaluation method
Tang et al. Neural networks analysis in business failure prediction of Chinese importers: A between-countries approach
CN101894316A (en) Method and system for monitoring indexes of international market prosperity conditions
JP2004500642A (en) Methods and systems for assessing cash flow recovery and risk
CN111738843B (en) Quantitative risk evaluation system and method using running water data
Tang et al. Online-purchasing behavior forecasting with a firefly algorithm-based SVM model considering shopping cart use
Xu et al. Novel key indicators selection method of financial fraud prediction model based on machine learning hybrid mode
Kiv et al. Machine learning of emerging markets in pandemic times
CN112966962A (en) Electric business and enterprise evaluation method
CN112419030B (en) Method, system and equipment for evaluating financial fraud risk
CN109146611A (en) A kind of electric business product quality credit index analysis method and system
Hamal et al. A novel integrated AHP and MULTIMOORA method with interval-valued spherical fuzzy sets and single-valued spherical fuzzy sets to prioritize financial ratios for financial accounting fraud detection
Chimonaki et al. Identification of financial statement fraud in Greece by using computational intelligence techniques
Wardana et al. Comparation of SAW method and TOPSIS in assesing the best area using HSE standards
US20140156544A1 (en) Non-Tangible Assets Valuation Tool
Sharma et al. Predicting purchase probability of retail items using an ensemble learning approach and historical data
CN105447117A (en) User clustering method and apparatus
CN113450004A (en) Power credit report generation method and device, electronic equipment and readable storage medium
CN115905319B (en) Automatic identification method and system for abnormal electricity fees of massive users
CN105956012A (en) Database mode abstract method based on graphical partition strategy
CN114626940A (en) Data analysis method and device and electronic equipment
Li et al. A commonsense knowledge-enabled textual analysis approach for financial market surveillance
Shmueli et al. Data mining in excel: Lecture notes and cases
CN111091410B (en) Node embedding and user behavior characteristic combined net point sales prediction method
Kasinadh et al. Building fuzzy OLAP using multi-attribute summarization

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170613