CN109783698B

CN109783698B - Industrial production data entity identification method based on Merkle-tree

Info

Publication number: CN109783698B
Application number: CN201910035568.XA
Authority: CN
Inventors: 王妍; 曾辉; 杨冰清; 李玉诺
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2023-05-26
Anticipated expiration: 2039-01-15
Also published as: CN109783698A

Abstract

The industrial production data entity identification method based on Merkle-tree comprises the following steps: 1) Aiming at the floatability of industrial production data, the data are subjected to corresponding standardized processing by utilizing the properties of the matrix and the vector, so that the numerical attribute values of the same entity are ensured to be the same; 2) Calculating information entropy of each attribute column, obtaining attribute sensitivity intensity information, removing attributes with low sensitivity, and sorting other attributes according to the sensitivity descending order; 3) Providing a Chain structure called St-Chain, performing progressive hash coding based on St-Chain, and dividing entities with the same hash value into the same block; 4) And (3) continuously calculating the hash value of the follow-up attribute in each tuple for the structure obtained in the step (3), repeatedly dividing the structure into blocks according to the difference of the hash values, and finally obtaining an entity identification result. The entity identification method provided by the invention has moderate algorithm operation efficiency and higher identification precision.

Description

Industrial production data entity identification method based on Merkle-tree

Technical Field

The invention relates to an entity identification method, in particular to an industrial production data entity identification method based on Merkle-tree.

Background

With the development of information technology and internet of things technology, the modern industry accumulates a large amount of data over time, but because certain variables are not easy to control in the production process and the variables are mutually influenced and mutually connected, the data have floatability. Industrial big data holds great value, and how to improve the usability of industrial big data has become a research hotspot.

Entity identification is an important method for improving the availability of data, and a data set can be divided into a plurality of entity sets, so that the problem of entity identity in the data set is solved. The existing method comprises the steps of firstly carrying out block processing before carrying out operation on data with the identity problem, then carrying out similarity calculation on the data in each block, and finally carrying out matching decision according to the similarity and a specific matching strategy. For industrial production data, the data has floatability, and if the traditional entity identification technology is adopted, the misjudgment rate of the identification result can be greatly increased. Therefore, the work of the method is to directly identify the entity of the data with the identity problem, avoid error superposition, reduce the misjudgment rate, and ensure the identification efficiency by adopting some means while improving the identification precision.

Disclosure of Invention

In order to solve the problems in the prior art, the invention provides an industrial production data entity identification algorithm based on Merkle-tree. The method utilizes the thought of Merkle-tree and proposes a Pro M-tree structure. The Pro M-tree is based on hash coding, so that the entity identification precision is improved, and the identification efficiency is ensured; a Chain structure, called "St-Chain", is proposed for supporting chunking operations in progressive hash coding; aiming at the floatability of industrial production data, a data standardization processing method is provided by utilizing the properties of matrixes and vectors, so that the consistency of numerical attribute values of the same entity is ensured, and preparation is made for subsequent hash coding operation.

In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the industrial production data entity identification method based on Merkle-tree is characterized by comprising the following steps of: the method comprises the following steps:

step 1), aiming at the floatability of industrial production data, corresponding standardized processing is carried out on the data by utilizing the properties of a matrix and a vector, so as to ensure that the numerical attribute values of the same entity are the same;

step 2), calculating information entropy of each attribute column, obtaining attribute sensitivity intensity information, removing attributes with low sensitivity, and sorting other attributes according to a sensitivity descending order;

step 3), providing a Chain structure called St-Chain, performing progressive hash coding on the ordered attributes based on St-Chain, and dividing entities with the same hash value into the same block;

and 4) continuously calculating the hash value of the subsequent attribute in each tuple for the structure obtained in the step 3), repeatedly dividing the structure into blocks according to the difference of the hash values, and finally obtaining an entity identification result.

The specific method in the step 1) is as follows:

1) Sampling an original data set by the standard name of an entity, wherein the obtained sample set is called a standard entity sample set S; calculating a (i) '.b= { a (i)'.b (k 1), a (i) '.b (k 2) … … a (i)'.b (k 3) },

wherein: a (i) '.b (key) = { a (i) '.b (1), a (i) '.b (2) … … a (i) '.b (k) }, a (i) ' representing the transpose of a (i);

tuple vector set a= { a (i) |a (i) ∈a, i=1, 2,3 … … n };

standard entity matrix set b= { b (key) |key: standard entity name }, wherein matrix b (key) = { b (j) |b (j) ∈s, j=1, 2,3 … … k };

2) According to the vector cosine formula

Calculating the cosine value of each vector in each vector a (i)'.b (key) to form a cosine value vector c (key) = { cos theta 1, cos theta 2, cos theta 3..cos theta k };

3) Summing the elements in each vector c (key) and then calculating the average avg thereof;

4) Calculating max (avg), finding out a standard entity corresponding to the maximum avg, and calculating the average value of each attribute of the standard entity;

5) And taking the average value of each corresponding attribute as a standard value to finish the data standardization processing.

The specific method in the step 2) is as follows:

judging the sensitivity of the attribute by using the information entropy, wherein the larger the information entropy of the attribute is, the more diversified the attribute value is, the stronger the capability of distinguishing the entity is, namely the higher the attribute sensitivity is; the information entropy formula is as follows:

wherein p is _i Is the probability of occurrence of a certain attribute value in the attributes; and removing the attribute with low information entropy, sorting the attribute in descending order according to the size of the information entropy, and calculating an attribute hash value with high sensitivity.

The specific method in the step 3) is as follows:

1) According to the ordered attribute sequence, calculating the hash value of a certain attribute in each tuple;

2) And dividing the datamation into blocks according to the hash value difference to form a Chain structure St-Chain.

The specific method in the step 4) is as follows:

1) According to the St-Chain structure obtained in the step 3), if num in the structure is larger than 1; continuing to calculate the hash value of the follow-up attribute in the block, and continuing to divide the block according to the hash value difference; if num=1, it is indicated that the entity in the block has no identity problem, and it is not necessary to calculate a subsequent attribute hash value;

2) Repeating the operation of 1) until the entity identification is completed, and obtaining a final identification result.

The beneficial effects of the invention are as follows:

compared with the prior art, the industrial production data entity identification method based on the Merkle-tree provided by the invention utilizes the thought of the Merkle-tree and provides a Pro M-tree structure. The Pro M-tree based progressive hash coding is performed, so that the entity identification precision is improved, and the identification efficiency is ensured; and acquiring sensitivity intensity information of each attribute by using the information entropy, and ordering the attributes in descending order according to the information entropy, so that the attribute with high sensitivity preferentially calculates a hash value, and the recognition efficiency is better ensured.

Drawings

FIG. 1 is a flow chart of the method of the present invention.

Fig. 2 shows the F1 value of IDEM algorithm according to the present invention compared with other algorithms on the dataset.

Detailed Description

The industrial production data entity identification method based on Merkle-tree is characterized by comprising the following steps of: the method comprises the following steps:

step 1), aiming at the floatability of industrial production data, the data are subjected to corresponding standardized processing by utilizing the properties of the matrix and the vector, so that the numerical attribute values of the same entity are ensured to be the same.

The specific method comprises the following steps:

tuple vector set a= { a (i) |a (i) ∈a, i=1, 2,3 … … n };

2) According to the vector cosine formula

And 2) calculating information entropy of each attribute column, obtaining attribute sensitivity intensity information, removing attributes with low sensitivity, and sequencing the rest attributes according to a sensitivity descending order.

The sensitivity of the information entropy determination attribute is utilized herein, the information entropy is used to measure the expected value of a random variable, and the larger the information entropy of a variable is, the more various situations are present, namely, the more contents are contained. Here, the greater the information entropy of the attribute, the more diversified the attribute value thereof, the more the ability to distinguish the entities, i.e., the higher the attribute sensitivity. The information entropy formula is as follows:

wherein p is _i Is the probability of occurrence of a certain attribute value in the attribute. And removing the attribute with low information entropy, and sorting the attribute in descending order according to the information entropy so as to calculate the attribute hash value with high sensitivity as early as possible, thereby better ensuring the recognition efficiency.

Step 3), a Chain structure is proposed, namely St-Chain, progressive hash coding is performed based on St-Chain, and entities with the same hash value are divided into the same blocks.

The specific method comprises the following steps:

Example 1:

1) Experimental data set

Due to confidentiality of the industrial production process, the data set adopted in the experiment is a simulation data set generated by using a generating tool datafactor according to the law in a conventional chart data handbook for steelmaking. And comparing and analyzing the IDEM algorithm, the traditional threshold-based entity identification method (Part) and the more general entity identification algorithm (Clustering) based on the similarity graph. Table 1 gives part of the properties of the data set used for the experiment.

Table 1 data set used in experiments

2) Experimental results and analysis

The experimental environment is as follows: intel Pentium 3.0GHz CPU,4GB memory, WIN7 operating system, and PyCharm programming software.

Fig. 2 and table 2 show the F1 values and the run time of the IDEM algorithm, part algorithm and the managing algorithm, respectively, on the dataset.

As shown in fig. 2, the F1 values of the IDEM algorithm presented herein are on average about 15.76% and 24.23% higher, respectively, than the other two entity recognition algorithms, with an overall average of 20% higher. Thus, the IDEM algorithm presented herein is far better than the Part algorithm and the Clustering algorithm in terms of F1 values.

Table 2 run time of each algorithm

As shown in table 2, the IDEM algorithm has no advantage in recognition efficiency, but the difference in merits is not large. The IDEM algorithm operates at an average efficiency of 8.5% lower than the other two algorithms, i.e., the IDEM algorithm presented herein trades 8.5% efficiency for 20% accuracy. Thus, the IDEM algorithm presented herein is more suitable for entity identification of mass industrial production data.

Because the hash algorithm is adopted in the method, the entity identification accuracy is improved to a certain extent on mass industrial production data; the information entropy is adopted to obtain attribute sensitivity intensity information, and the modified Merkle-tree structure Pro M-tree is provided for progressive hash coding, so that the accuracy is improved, and the recognition efficiency is ensured.

Claims

1. The industrial production data entity identification method based on Merkle-tree is characterized by comprising the following steps of: the method comprises the following steps:

1.1 Sampling the original data set by the standard name of the entity, and obtaining a sample set called a standard entity sample set S; calculating a (i) '.b= { a (i)'.b (k 1), a (i) '.b (k 2) … … a (i)'.b (kn) },

a (i) '.b (key) = { a (i) '.b (1), a (i) '.b (2) … … a (i) '.b (n) }, a (i) ' representing the transpose of a (i);

tuple vector set a= { a (i) |a (i) ∈a, i=1, 2,3 … … n };

standard entity matrix set b= { b (key) |key: standard entity name }, wherein matrix b (key) = { b (j) |b (j) ∈s, j=1, 2,3 … … n };

1.2 According to the vector cosine formula

The cosine value of each vector in each vector a (i)'.b (key) is calculated to form a cosine value vector c (key) = { cos theta ₁ ,cosθ ₂ ,cosθ ₃ ,...,cosθ _k }；

1.3 Summing the elements in each vector c (key) and then calculating the average avg thereof;

1.4 Calculating max (avg), finding out a standard entity corresponding to the maximum avg, and calculating the average value of each attribute of the standard entity;

1.5 Taking the average value of each corresponding attribute as the standard value to finish the data standardization processing;

wherein p is _i Is the probability of occurrence of a certain attribute value in the attributes; removing the attribute with low information entropy, and according to the size of the information entropy, belonging toSorting in descending order, firstly calculating attribute hash values with high sensitivity;

step 3), a Chain structure called "St-Chain" is proposed; performing progressive hash coding on the ordered attributes based on St-Chain, and dividing entities with the same hash value into the same block;

3.1 According to the ordered attribute sequence, calculating the hash value of a certain attribute in each tuple;

3.2 Dividing the datamation into blocks according to the hash value difference to form a Chain structure St-Chain;

step 4), continuously calculating hash values of subsequent attributes in each tuple for the structure obtained in the step 3), repeatedly dividing the structure into blocks according to the difference of the hash values, and finally obtaining an entity identification result;

4.1 According to the St-Chain structure obtained in the step 3), if num in the structure is larger than 1; continuing to calculate the hash value of the follow-up attribute in the block, and continuing to divide the block according to the hash value difference; if num=1, it is indicated that the entity in the block has no identity problem, and it is not necessary to calculate a subsequent attribute hash value;

4.2 Repeating the operation of 1) until the entity identification is completed, and obtaining a final identification result.