CN106845846A

CN106845846A - Big data asset evaluation method

Info

Publication number: CN106845846A
Application number: CN201710058720.7A
Authority: CN
Inventors: 卓颋; 殷荣华; 刘洪明; 舒夕珂; 曹慧英
Original assignee: Beijing Soft Cloud Technology Co Ltd; Chongqing University of Post and Telecommunications
Current assignee: Beijing Soft Cloud Technology Co Ltd; Chongqing University of Post and Telecommunications
Priority date: 2017-01-23
Filing date: 2017-01-23
Publication date: 2017-06-13

Abstract

The invention discloses a kind of big data asset evaluation method, big data asset evaluation method, including：First, data quality accessment, the index of the quality of data includes accuracy, integrality, uniformity, ageing；2nd, data scale assessment, data scale index includes data attribute number, data tuple number and unit information amount；3rd, data content assessment, data content includes transaction data, personal information, merchandise news, production management data, user's evaluating data and social network data；4th, industry value calculation；5th, data assets value calculation.Big data asset evaluation method of the present invention, it provides specific quantitative criteria for the assessment of data assets, makes Appraisal process simpler apparent, eliminates the subjective factor influence of judge, evaluation result is more consistent with actual.

Description

Big data asset evaluation method

Technical field

The present invention relates to assets assessment technical field, more particularly to a kind of appraisal procedure of data assets.

Background technology

Data value in view of different industries is different, and tax revenue may determine that the size of the business transaction amount of money, therefore according to The data such as tax yearbook, data are divided into by industry：

(1) agricultural data

(2) mining industry data

(3) manufacturing industry data

(4) the production and supply industry data of electric power, heating power, combustion gas and water

(5) construction industry data

(6) wholesale and retail industry data

(7) communications and transportation, storage and postal industry data

(8) accommodation and catering industry data

(9) information transfer, software and information technology service industry

(10) financial circles data

(11) real estate data

(12) lease and commerce services industry data

(13) educational data

(14) health and social work data

(15) culture, physical culture and show business data

(16) public administration, social security and social organization's data

(17) other industry data

It is known that each data file generally includes much information, therefore can be splitted data into again by data content：

(1) transaction data

(2) personal information

(3) commodity (service) information

(4) production management data

(5) user's evaluating data

(6) social network data

Wherein, personal information includes vendor information and consumer's information.It is worth noting that, each data file is included A class or multi-class data in above-mentioned six classes data.

In recent years, in appearing in our life again and again with " big data " this vocabulary, the evaluation problem of data assets Also the hot spot of society is turned into.The current research on data assets is perfect not enough.In view of intangible asset assessment in state Certain achievement in research, thus data assets have inside been obtained as a kind of special intangible asset, its value assessment can with it is general Logical intangible assessment is connected.Sun Rongling etc. proposes the quantization side of the value and Value Realization to intangible asset first Face is studied, but traditional appraisal procedure is relatively rough；Then Chen Chang clouds propose Black-Scholes Black-Scholes Option Pricing Model Black-Scholes and EVA methods, and introduce it into the assessment to enterprise's integral value, model is more accurate, but do not consider different enterprises it Between gap；The then continuous research and inquirement of experts and scholars, forms a set of more perfect intangible asset system, mainly has Income approach, market method and cost-or-market method, but still presence conflicts with this several method for the evaluation criteria of data assets and key element, It is thus impossible to these methods are applied in data assets completely；At the same time, data assets value defines heterogeneity, The shortage and data assets of the data assets value assessment dimension of data assets appraisal Model or reference model and system Assessment lack a specific quantitative criteria, this brings more difficulties to researcher.

In addition, data assessment importance for different classes of is different, it is faintly regarded as a class, analysis result shows Obtain some to lose contact with reality, the actual demand with society to data is runed counter to.In addition, the evaluation structure of data assets is considered as more Many aspects, past research there is also deficiency in structure.

The content of the invention

In view of this, the purpose of the present invention is directed to the deficiency of method in the past, proposes a kind of assessment of new data assets Method.

Big data asset evaluation method of the present invention, it is characterised in that including：

First, data quality accessment, including：

1st, the calculating of data accuracy

Sampling obtains training set, inspection set and accuracy rate forecast set respectively first from tables of data, every time for training set In a predictable attribute f, it is class label to set it, and training obtains a grader, and carries out performance inspection by inspection set Survey；Then the value of the attribute f of each tuple in forecast set is predicted with this grader, predicted value is consistent with actual value (for numerical attribute, its difference without departing from certain threshold value, such as standard deviation) if think that the property value is correct, it is and accurate pre- The tuple ratio of survey is accuracy rate a of the tables of data on the attribute_f.This process is repeated to each attribute in tables of data, Obtain the accuracy rate a of each attribute_j；

Wherein j=1,2 ..., m, m is the number of predictable attribute；

Wherein, n_tIt is the number of tuples in forecast set, n_rjIt is the number of tuples correctly classified in forecast set；Calculate these a_j's Weighted arithmetic average obtains the comprehensive accuracy rate A of tables of data, i.e.,：

Wherein, j is the numbering for being predicted attribute, wf_jIt is the weight of attribute j, its value can be according to the span of attribute j Determined with dispersion degree, because attribute span is bigger, dispersion degree is higher, the accuracy rate of its prediction is lower, imparting Weight should be smaller；The computing formula of weight is：

Wherein, h_jIt is the entropy of attribute j, entropy represents the size of attribute span and the height of dispersion degree, its calculating Formula is：

Wherein, v is the number of value, p_fFor attribute takes f-th probability of value；

Finally total accuracy rate of whole data set is：

Wherein, wt_iIt is the weight of table i, t is the total number of table in evaluated data set；The formula of weight is：

Wherein, nt_iIt is the number of tuples of table i, nf_iIt is the attribute number of whole data set, nt is the number of tuples of whole data set, Nf is the attribute number of whole data set；

2nd, the calculating of data integrity degree I

Wherein, n_nullTo lack or being the data item number of null, n_itemIt is data item total number；

3rd, the calculating of data consistent degree C

This formula is to investigate object with a database in data set, wherein, Ci is evaluated i-th database of data set Consistent degree；fn_iIt is total attribute number, n in i-th database_nameIt is the inconsistent attribute number of naming convention in i-th database, n_codeIt is the inconsistent attribute number of data code used in i-th database, n_formIt is the lattice of input field in i-th database The inconsistent attribute number of formula, L is the number for being evaluated the database included in data set, W_iIt is i-th weight of database；

4th, data time is worth the calculating of T

Wherein, t_pThe time of expression information issue, t_cRepresent current time, C (t_c,t_p) represent information in t_cThe shadow at moment Ring power size, i.e. t_cThe time value at moment, what a was represented is the aging rate coefficient of information, and aging rate coefficient a is set to 0.1；

5th, the quality of data is assessed by formula

Wherein, Q_iIt is the Quality factors of prior and sample data of the i-th class data classified according to data content；

2nd, data scale assessment, including：

1st, the calculating of data attribute number

1) attribute number of numeric data is calculated

(1) by the correlation coefficient r of formula evaluation attribute A and B_A,B,

Wherein, n is the number of data tuple, a_iAnd b_iIt is respectively tuple i values on A and B,WithIt is respectively the equal of A and B Value, σ_AAnd σ_BIt is respectively the standard deviation of A and B；

(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, obtain each attribute attribute number it With；

2) attribute number of nominal, grouped data is calculated

(1) by χ²Check to judge correlation；

Wherein, o_ijIt is joint event (A_i, B_j) observation frequency, and e_ijIt is (A_i, B_j) expectation frequency；

Wherein, n is the number of data tuple, count (A=a_i) it is that value is a on A_iTuple number, count (B=b_i) It is that value is b on B_iTuple number；χ²A and B independences are assumed in statistical check, based on insolation level, with the free degree (R-1) × (C- 1)；χ is calculated by above-mentioned formula²Value, then with χ²The region of rejection of inspection is compared, then can sentence two correlations of attribute of section；

According to repeatedly calculate inspection, obtain it is autocorrelative in the case of χ²=n, therefore in χ²>On the premise of 10.828, can be by r_A,B Used as the degree of correlation between two attributes, formula is as follows：

Wherein, R, C are the classification numbers of classified variable；

(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, attribute compression step：

1. correlation matrix is built

Wherein, r_ij=it is attribute f_iAnd f_jThe degree of correlation, R_iIt is attribute f_iWith the summation of other Attribute Correlations,

2. the row of R matrixes is pressed into R_iOrder sequence from big to small, obtains

3. increase by one and arrange f₀Represent the initial scale benchmark of single attribute

4. condensation matrix is obtained

5. the element on diagonal is added the attribute number after just being compressed

fn_c=r '₁₁+r′₂₂+…+r′_nn

2nd, directly statistics obtains the data tuple number in tables of data；tn_j；

3rd, the calculating of unit information amount

(1) the comentropy computing formula of discrete type attribute is：

Wherein, P (x_i) be each property value occur probability；

(2) calculating of the comentropy of continuous type attribute：

After a kind of discretization method is first selected to its discretization, then carried out by the computing formula of discrete type Attribute information entropy Calculate；

(3) after obtaining the comentropy of each attribute, the average information entropy of attribute is obtained：

Fn is the attribute number of individual data table before compression；

Then the computing formula of individual data table scale is obtained：

Wherein, S is that a certain data scale of tables of data weighs the factor (unit is bit), fn_cAfter the compression of this tables of data Data attribute number, tn is the number of tuples of this tables of data,It is the average information entropy of all properties；

3rd, data content assessment

One comparator matrix B=(b is constructed using AHP three scale methods_ij)_n×n,b_ijFor on same level element ratio compared with gained Scale value, specially

The importance ranking index of each element is calculated with following formula：

Note r_max=MAX { r_i},r_min=MIN { r_i},b_m=r_max/r_min, obtain judgment matrix C=(c_ij)_n×n:

So as to obtain

After obtaining judgment matrix, calculate according to the following steps and check：

(1) weight is calculated with root method, formula is as follows：

Calculation procedure:1. by the element of C by the new vector of row mutually multiplied,

2. each component of new vector is opened into n powers,

3. gained vector normalization is weight vectors；

(2) coincident indicator CI is calculated

Wherein, λ_maxIt is the eigenvalue of maximum of judgment matrix C；

(3) coincident indicator RI is searched

(4) consistency ration CR is calculated

Work as CR<When 0.10, it is believed that the uniformity of judgment matrix can be receiving, otherwise tackle judgment matrix and make to repair in right amount Just；Thus, obtain with the weight of every class data of classifying content；

4th, industry value calculation

1st, tax revenues highest industry is taken, fraction is worth and is set to 100；

2nd, the tax revenue of other industry and highest industry tax revenue are divided by, multiplied by with 100, obtain the industry valency of other industry Value；

5th, data assets value calculation

1st, by Quality factors of prior and sample data Q_ij, data scale factor S_ijAnd by the weight W of classifying content_iIt is multiplied, if the i-th class Packet contains multiple tables of data, then first calculate individual tables of data, then the result of this several tables of data is added up；

2nd, calculated by above-mentioned computational methods by every class of classifying content, the result for obtaining adds up successively；

3rd, accumulated result is multiplied with the industry value being calculated and obtains the value fraction V of data assets；

Value fraction

4th, data assets value is assessed by being worth fraction V.

Beneficial effects of the present invention：

Big data asset evaluation method of the present invention, it provides specific quantitative criteria for the assessment of data assets, makes to comment Sentence that process is simpler apparent, eliminate the subjective factor influence of judge, make evaluation result and be actually more consistent.

Brief description of the drawings

Fig. 1 is data assets value assessment overall construction drawing.

Specific embodiment

The invention will be further described with reference to the accompanying drawings and examples.

The present embodiment big data asset evaluation method, including：

First, data quality accessment, including：

1st, whether the calculating of data accuracy, data accuracy describes the feature phase one of the corresponding Subject of data Cause；

Sampling obtains training set, inspection set and accuracy rate forecast set respectively first from tables of data, every time for training set In a predictable attribute f, it is class label to set it, and training obtains a grader, and carries out performance inspection by inspection set Survey；Then the value of the attribute f of each tuple in forecast set is predicted with this grader, predicted value and actual value one Cause, for numerical attribute, its difference is without departing from certain threshold value, then it is assumed that the property value is correct, and the unit of Accurate Prediction Group ratio is accuracy rate a of the tables of data on the attribute_f, this process is repeated to each attribute in tables of data, obtain every The accuracy rate a of individual attribute_j；

Wherein j=1,2 ..., m, m is the number of predictable attribute；

Wherein, n_tIt is the number of tuples in forecast set, n_rjIt is the number of tuples correctly classified in forecast set；Wherein sorting algorithm Can voluntarily select (such as：Decision Tree Inductive C4.5, CART etc.)；

Calculate these a_jWeighted arithmetic average obtain the comprehensive accuracy rate A of tables of data, i.e.,；

Finally total accuracy rate of whole data set is：

Wherein, wt_iIt is the weight of table i, t is the total number of table in evaluated data set.The formula of weight is：

Predictable attribute：The span of some attributes is very big and with certain randomness, and some of which information is often It is related to the privacy of individual, being generally required in the application for being related to big data to conclude the business and analyzing carries out desensitization process, such as：Name, Telephone number, address etc.；Some do not have physical meaning then, such as：Tuple ID, some project codes etc., to this kind of attribute evaluation its Accuracy there is no need, referred to as unpredictable attribute, and other are referred to as predictable attribute；

2nd, the calculating of data integrity degree I, data integrity degree I describes data with the presence or absence of missing record or absent field,

3rd, the calculating of data consistent degree C, data consistent degree describes the value of the same attribute of same entity in different systems Or it is whether consistent in data set；

4th, the calculating of data time value (T)

5th, the quality of data is assessed by formula

2nd, data scale assessment, including：

1st, the calculating of data attribute number, also known as field, when most of, the row of table are referred to as field to data attribute, each field Information comprising a certain special topic；

1) attribute number of numeric data is calculated

(1) by the correlation coefficient r of formula evaluation attribute A and B_AB,

2) attribute number of nominal, grouped data is calculated

(1) by χ²Check to judge correlation；

Wherein, R, C are the classification numbers of classified variable；

1. correlation matrix is built

3. increase by one and arrange f₀The initial scale benchmark of single attribute is represented, 1 is set to；

4. following procedure computation attribute scale compression matrix is pressed

Obtain

fn_c=r '₁₁+r′₂₂+…+r′_nn

2nd, directly statistics obtains the data tuple number in tables of data；tn_j；In bivariate table, tuple also known as record, in table Often go, i.e., every record in database, is exactly a tuple；

3rd, the calculating of unit information amount, during unit information volume index is according to file, same attribute includes different numerical value How much；

(1) the comentropy computing formula of discrete type attribute is：

Wherein, P (x_i) be each property value occur probability；

(2) calculating of the comentropy of continuous type attribute：

Fn is the attribute number of individual data table before compression；

Then the computing formula of individual data table scale is obtained：

3rd, data content assessment

So as to obtain

(1) weight is calculated with root method, formula is as follows：

2. each component of new vector is opened into n powers,

3. gained vector normalization is weight vectors；

(2) coincident indicator CI is calculated

Wherein, λ_maxIt is the eigenvalue of maximum of judgment matrix C；

(3) coincident indicator RI is searched

(4) consistency ration CR is calculated

4th, industry value calculation

5th, data assets value calculation

Value fraction

4th, data assets value is assessed by being worth fraction V.

The present embodiment big data asset evaluation method, it provides specific quantitative criteria, makes for the assessment of data assets Appraisal process is simpler apparent, eliminates the subjective factor influence of judge, evaluation result is more consistent with actual.

Finally illustrate, the above embodiments are merely illustrative of the technical solutions of the present invention and it is unrestricted, although with reference to compared with Good embodiment has been described in detail to the present invention, it will be understood by those within the art that, can be to skill of the invention Art scheme is modified or equivalent, and without deviating from the objective and scope of technical solution of the present invention, it all should cover at this In the middle of the right of invention.

Claims

1. a kind of big data asset evaluation method, it is characterised in that including：

First, data quality accessment, including：

1st, the calculating of data accuracy

Sampling obtains training set, inspection set and accuracy rate forecast set respectively first from tables of data, every time in training set One predictable attribute f, it is class label to set it, and training obtains a grader, and carries out performance detection by inspection set；So The value of the attribute f of each tuple in forecast set is predicted with this grader afterwards, predicted value is consistent with actual value, for Numerical attribute, its difference is without departing from certain threshold value, then it is assumed that the property value is correct, and the tuple ratio of Accurate Prediction is The accuracy rate a on the attribute that is tables of data_f, this process is repeated to each attribute in tables of data, obtain each attribute Accuracy rate a_j；

Wherein j=1,2 ..., m, m is the number of predictable attribute；

Wherein, n_tIt is the number of tuples in forecast set, n_rjIt is the number of tuples correctly classified in forecast set；Calculate these a_jWeighting Arithmetic average is worth to the comprehensive accuracy rate A of tables of data, i.e.,：

A = \frac{Σ_{j = 1}^{m} {fw}_{j} a_{j}}{m}

Wherein, j is the numbering for being predicted attribute, wf_jBe the weight of attribute j, its value can according to the span of attribute j and from Scattered degree determines, because attribute span is bigger, dispersion degree is higher, the accuracy rate of its prediction is lower, the weight of imparting Should be smaller；The computing formula of weight is：

{wf}_{j} = (1 - \frac{h_{j}}{Σ_{j}^{m} h_{j}}) / m - 1

Wherein, h_jIt is the entropy of attribute j, entropy represents the size of attribute span and the height of dispersion degree, its computing formula For：

h_{j} = - Σ_{f = 1}^{v} p_{f} \times \log_{2} (p_{f})

Finally total accuracy rate of whole data set is：

A = \frac{Σ_{i = 1}^{t} {wt}_{i} A_{i}}{t}

{wt}_{i} = \frac{{nt}_{i} \times {nf}_{i}}{Σ_{i = 1}^{t} n t \times n f};

Wherein, nt_iIt is the number of tuples of table i, nf_iIt is the attribute number of whole data set, nt is the number of tuples of whole data set, and nf is The attribute number of whole data set；

2nd, the calculating of data integrity degree I

I = \frac{n_{n u l l}}{n_{i t e m}}

3rd, the calculating of data consistent degree C

C_{i} = 1 - \frac{n_{n a m e} + n_{c o d e} + n_{f o r m}}{{fn}_{i}};

W_{i} = \frac{{fn}_{i}}{Σ_{i = 1}^{L} {fn}_{i}};

C = Σ_{i = 1}^{L} W_{i} C_{i}

This formula is to investigate object with a database in data set, wherein, Ci is the one of evaluated i-th database of data set Cause degree；fn_iIt is total attribute number, n in i-th database_nameIt is the inconsistent attribute number of naming convention in i-th database, n_code It is the inconsistent attribute number of data code used in i-th database, n_formFor input field in i-th database form not Consistent attribute number, L is the number for being evaluated the database included in data set, W_iIt is i-th weight of database；

4th, data time is worth the calculating of T

C (t_{c}, t_{p}) = e^{- a (t_{c} - t_{p})}, T = C (t_{c}, t_{p}) = e^{- 0.1 (t_{c} - t_{p})}

Wherein, t_pThe time of expression information issue, t_cRepresent current time, C (t_c,t_p) represent information in t_cThe influence power at moment Size, i.e. t_cThe time value at moment, what a was represented is the aging rate coefficient of information, and aging rate coefficient a is set to 0.1；

5th, the quality of data is assessed by formula

Q_{i} = \frac{1}{4} (A + I + C + T)

2nd, data scale assessment, including：

1st, the calculating of data attribute number

1) attribute number of numeric data is calculated

r_{A, B} = \frac{Σ_{i = 1}^{n} (a_{i} - \overset{&OverBar;}{A}) (b_{i} - \overset{&OverBar;}{B})}{{nσ}_{A} σ_{B}} = \frac{Σ_{i = 1}^{n} (a_{i} b_{i}) - n \overset{&OverBar;}{A} \overset{&OverBar;}{B}}{{nσ}_{A} σ_{B}}

Wherein, n is the number of data tuple, a_iAnd b_iIt is respectively tuple i values on A and B,WithIt is respectively the average of A and B, σ_A And σ_BIt is respectively the standard deviation of A and B；

(2) after coefficient correlation is obtained, the attribute number of logarithm Value Data is compressed, and obtains the attribute number sum of each attribute；

2) attribute number of nominal, grouped data is calculated

(1) by χ²Check to judge correlation；

χ^{2} = Σ_{i = 1}^{c} Σ_{j = 1}^{r} \frac{{(o_{i j} - e_{i j})}^{2}}{e_{i j}}

e_{i j} = \frac{c o u n t (A = a_{i}) \times c o u n t (B = b_{i})}{n}

Wherein, n is the number of data tuple, count (A=a_i) it is that value is a on A_iTuple number, count (B=b_i) it is on B It is b to be worth_iTuple number；χ²A and B independences are assumed in statistical check, based on insolation level, with the free degree (R-1) × (C-1)；It is logical Cross above-mentioned formula and calculate χ²Value, then with χ²The region of rejection of inspection is compared, then can sentence two correlations of attribute of section；

According to repeatedly calculate inspection, obtain it is autocorrelative in the case of χ²=n, therefore in χ²>On the premise of 10.828, can be by r_A,BAs The degree of correlation between two attributes, formula is as follows：

r_{A, B} = \sqrt{\frac{χ^{2}}{n m i n [R - 1, C - 1]}}

Wherein, R, C are the classification numbers of classified variable；

1. correlation matrix is built

R_{i} = Σ_{j = 1}^{n} r_{i j} - 1; i &Element; {1, 2, ..., n};

4. condensation matrix is obtained

fn_c=r '₁₁+r′₂₂+…+r′_nn

2nd, directly statistics obtains the data tuple number in tables of data：tn_j；

3rd, the calculating of unit information amount

(1) the comentropy computing formula of discrete type attribute is：

H (X) = - Σ_{i = 1}^{n} P (x_{i}) \log_{2} [P (x_{i})]

Wherein, P (x_i) be each property value occur probability；

(2) calculating of the comentropy of continuous type attribute：

After a kind of discretization method is first selected to its discretization, then based on carrying out by the computing formula of discrete type Attribute information entropy Calculate；

Fn is the attribute number of individual data table before compression；

Then the computing formula of individual data table scale is obtained：

S = t n \times {fn}_{c} \times \overset{&OverBar;}{H (A)}

Wherein, S is that a certain data scale of tables of data weighs the factor (unit is bit), fn_cIt is the number after the compression of this tables of data According to attribute number, tn is the number of tuples of this tables of data,It is the average information entropy of all properties；

3rd, data content assessment

r_{i} = Σ_{j = 1}^{n} b_{i j}, i = 1, 2, ..., n .

c_{i j} = \{\begin{matrix} [(r_{i} - r_{j}) / (r_{m a x} - r_{m i n})] \times (b_{m} - 1) + 1 & r_{i} &GreaterEqual; r_{j} \\ {[(r_{j} - r_{i}) / (r_{m a x} - r_{\min})] \times (b_{m} - 1) + 1}^{- 1} & r_{i} < r_{j} \end{matrix}

So as to obtain

(1) weight is calculated with root method, formula is as follows：

W_{i} = \frac{{(Π_{j = 1}^{n} a_{i j})}^{\frac{1}{n}}}{Σ_{i = 1}^{n} {(Π_{j = 1}^{n} a_{i j})}^{\frac{1}{n}}}, i = 1, 2, 3, ..., n .

2. each component of new vector is opened into n powers,

3. gained vector normalization is weight vectors；

(2) coincident indicator CI is calculated

C I = \frac{λ_{m a x} - n}{n - 1}

Wherein, λ_maxIt is the eigenvalue of maximum of judgment matrix C；

(3) coincident indicator RI is searched

(4) consistency ration CR is calculated

C R = \frac{C I}{R I}

Work as CR<When 0.10, it is believed that the uniformity of judgment matrix can be receiving, otherwise tackle judgment matrix and make appropriate amendment； Thus, obtain with the weight of every class data of classifying content；

4th, industry value calculation

2nd, the tax revenue of other industry and highest industry tax revenue are divided by, multiplied by with 100, obtain the industry value of other industry；

5th, data assets value calculation

1st, by Quality factors of prior and sample data Q_ij, data scale factor S_ijAnd by the weight W of classifying content_iIt is multiplied, if the i-th class data Comprising multiple tables of data, then individual tables of data is first calculated, then the result of this several tables of data is added up；

4th, data assets value is assessed by being worth fraction V.