CN109783698B - Industrial production data entity identification method based on Merkle-tree - Google Patents

Industrial production data entity identification method based on Merkle-tree Download PDF

Info

Publication number
CN109783698B
CN109783698B CN201910035568.XA CN201910035568A CN109783698B CN 109783698 B CN109783698 B CN 109783698B CN 201910035568 A CN201910035568 A CN 201910035568A CN 109783698 B CN109783698 B CN 109783698B
Authority
CN
China
Prior art keywords
attribute
entity
calculating
sensitivity
vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910035568.XA
Other languages
Chinese (zh)
Other versions
CN109783698A (en
Inventor
王妍
曾辉
杨冰清
李玉诺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN201910035568.XA priority Critical patent/CN109783698B/en
Publication of CN109783698A publication Critical patent/CN109783698A/en
Application granted granted Critical
Publication of CN109783698B publication Critical patent/CN109783698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The industrial production data entity identification method based on Merkle-tree comprises the following steps: 1) Aiming at the floatability of industrial production data, the data are subjected to corresponding standardized processing by utilizing the properties of the matrix and the vector, so that the numerical attribute values of the same entity are ensured to be the same; 2) Calculating information entropy of each attribute column, obtaining attribute sensitivity intensity information, removing attributes with low sensitivity, and sorting other attributes according to the sensitivity descending order; 3) Providing a Chain structure called St-Chain, performing progressive hash coding based on St-Chain, and dividing entities with the same hash value into the same block; 4) And (3) continuously calculating the hash value of the follow-up attribute in each tuple for the structure obtained in the step (3), repeatedly dividing the structure into blocks according to the difference of the hash values, and finally obtaining an entity identification result. The entity identification method provided by the invention has moderate algorithm operation efficiency and higher identification precision.

Description

Industrial production data entity identification method based on Merkle-tree
Technical Field
The invention relates to an entity identification method, in particular to an industrial production data entity identification method based on Merkle-tree.
Background
With the development of information technology and internet of things technology, the modern industry accumulates a large amount of data over time, but because certain variables are not easy to control in the production process and the variables are mutually influenced and mutually connected, the data have floatability. Industrial big data holds great value, and how to improve the usability of industrial big data has become a research hotspot.
Entity identification is an important method for improving the availability of data, and a data set can be divided into a plurality of entity sets, so that the problem of entity identity in the data set is solved. The existing method comprises the steps of firstly carrying out block processing before carrying out operation on data with the identity problem, then carrying out similarity calculation on the data in each block, and finally carrying out matching decision according to the similarity and a specific matching strategy. For industrial production data, the data has floatability, and if the traditional entity identification technology is adopted, the misjudgment rate of the identification result can be greatly increased. Therefore, the work of the method is to directly identify the entity of the data with the identity problem, avoid error superposition, reduce the misjudgment rate, and ensure the identification efficiency by adopting some means while improving the identification precision.
Disclosure of Invention
In order to solve the problems in the prior art, the invention provides an industrial production data entity identification algorithm based on Merkle-tree. The method utilizes the thought of Merkle-tree and proposes a Pro M-tree structure. The Pro M-tree is based on hash coding, so that the entity identification precision is improved, and the identification efficiency is ensured; a Chain structure, called "St-Chain", is proposed for supporting chunking operations in progressive hash coding; aiming at the floatability of industrial production data, a data standardization processing method is provided by utilizing the properties of matrixes and vectors, so that the consistency of numerical attribute values of the same entity is ensured, and preparation is made for subsequent hash coding operation.
In order to achieve the above purpose, the technical scheme adopted by the invention is as follows: the industrial production data entity identification method based on Merkle-tree is characterized by comprising the following steps of: the method comprises the following steps:
step 1), aiming at the floatability of industrial production data, corresponding standardized processing is carried out on the data by utilizing the properties of a matrix and a vector, so as to ensure that the numerical attribute values of the same entity are the same;
step 2), calculating information entropy of each attribute column, obtaining attribute sensitivity intensity information, removing attributes with low sensitivity, and sorting other attributes according to a sensitivity descending order;
step 3), providing a Chain structure called St-Chain, performing progressive hash coding on the ordered attributes based on St-Chain, and dividing entities with the same hash value into the same block;
and 4) continuously calculating the hash value of the subsequent attribute in each tuple for the structure obtained in the step 3), repeatedly dividing the structure into blocks according to the difference of the hash values, and finally obtaining an entity identification result.
The specific method in the step 1) is as follows:
1) Sampling an original data set by the standard name of an entity, wherein the obtained sample set is called a standard entity sample set S; calculating a (i) '.b= { a (i)'.b (k 1), a (i) '.b (k 2) … … a (i)'.b (k 3) },
wherein: a (i) '.b (key) = { a (i) '.b (1), a (i) '.b (2) … … a (i) '.b (k) }, a (i) ' representing the transpose of a (i);
tuple vector set a= { a (i) |a (i) ∈a, i=1, 2,3 … … n };
standard entity matrix set b= { b (key) |key: standard entity name }, wherein matrix b (key) = { b (j) |b (j) ∈s, j=1, 2,3 … … k };
2) According to the vector cosine formula
Figure BDA0001945778750000021
Calculating the cosine value of each vector in each vector a (i)'.b (key) to form a cosine value vector c (key) = { cos theta 1, cos theta 2, cos theta 3..cos theta k };
3) Summing the elements in each vector c (key) and then calculating the average avg thereof;
4) Calculating max (avg), finding out a standard entity corresponding to the maximum avg, and calculating the average value of each attribute of the standard entity;
5) And taking the average value of each corresponding attribute as a standard value to finish the data standardization processing.
The specific method in the step 2) is as follows:
judging the sensitivity of the attribute by using the information entropy, wherein the larger the information entropy of the attribute is, the more diversified the attribute value is, the stronger the capability of distinguishing the entity is, namely the higher the attribute sensitivity is; the information entropy formula is as follows:
Figure BDA0001945778750000022
wherein p is i Is the probability of occurrence of a certain attribute value in the attributes; and removing the attribute with low information entropy, sorting the attribute in descending order according to the size of the information entropy, and calculating an attribute hash value with high sensitivity.
The specific method in the step 3) is as follows:
1) According to the ordered attribute sequence, calculating the hash value of a certain attribute in each tuple;
2) And dividing the datamation into blocks according to the hash value difference to form a Chain structure St-Chain.
The specific method in the step 4) is as follows:
1) According to the St-Chain structure obtained in the step 3), if num in the structure is larger than 1; continuing to calculate the hash value of the follow-up attribute in the block, and continuing to divide the block according to the hash value difference; if num=1, it is indicated that the entity in the block has no identity problem, and it is not necessary to calculate a subsequent attribute hash value;
2) Repeating the operation of 1) until the entity identification is completed, and obtaining a final identification result.
The beneficial effects of the invention are as follows:
compared with the prior art, the industrial production data entity identification method based on the Merkle-tree provided by the invention utilizes the thought of the Merkle-tree and provides a Pro M-tree structure. The Pro M-tree based progressive hash coding is performed, so that the entity identification precision is improved, and the identification efficiency is ensured; and acquiring sensitivity intensity information of each attribute by using the information entropy, and ordering the attributes in descending order according to the information entropy, so that the attribute with high sensitivity preferentially calculates a hash value, and the recognition efficiency is better ensured.
Drawings
FIG. 1 is a flow chart of the method of the present invention.
Fig. 2 shows the F1 value of IDEM algorithm according to the present invention compared with other algorithms on the dataset.
Detailed Description
The industrial production data entity identification method based on Merkle-tree is characterized by comprising the following steps of: the method comprises the following steps:
step 1), aiming at the floatability of industrial production data, the data are subjected to corresponding standardized processing by utilizing the properties of the matrix and the vector, so that the numerical attribute values of the same entity are ensured to be the same.
The specific method comprises the following steps:
1) Sampling an original data set by the standard name of an entity, wherein the obtained sample set is called a standard entity sample set S; calculating a (i) '.b= { a (i)'.b (k 1), a (i) '.b (k 2) … … a (i)'.b (k 3) },
wherein: a (i) '.b (key) = { a (i) '.b (1), a (i) '.b (2) … … a (i) '.b (k) }, a (i) ' representing the transpose of a (i);
tuple vector set a= { a (i) |a (i) ∈a, i=1, 2,3 … … n };
standard entity matrix set b= { b (key) |key: standard entity name }, wherein matrix b (key) = { b (j) |b (j) ∈s, j=1, 2,3 … … k };
2) According to the vector cosine formula
Figure BDA0001945778750000031
Calculating the cosine value of each vector in each vector a (i)'.b (key) to form a cosine value vector c (key) = { cos theta 1, cos theta 2, cos theta 3..cos theta k };
3) Summing the elements in each vector c (key) and then calculating the average avg thereof;
4) Calculating max (avg), finding out a standard entity corresponding to the maximum avg, and calculating the average value of each attribute of the standard entity;
5) And taking the average value of each corresponding attribute as a standard value to finish the data standardization processing.
And 2) calculating information entropy of each attribute column, obtaining attribute sensitivity intensity information, removing attributes with low sensitivity, and sequencing the rest attributes according to a sensitivity descending order.
The sensitivity of the information entropy determination attribute is utilized herein, the information entropy is used to measure the expected value of a random variable, and the larger the information entropy of a variable is, the more various situations are present, namely, the more contents are contained. Here, the greater the information entropy of the attribute, the more diversified the attribute value thereof, the more the ability to distinguish the entities, i.e., the higher the attribute sensitivity. The information entropy formula is as follows:
Figure BDA0001945778750000041
wherein p is i Is the probability of occurrence of a certain attribute value in the attribute. And removing the attribute with low information entropy, and sorting the attribute in descending order according to the information entropy so as to calculate the attribute hash value with high sensitivity as early as possible, thereby better ensuring the recognition efficiency.
Step 3), a Chain structure is proposed, namely St-Chain, progressive hash coding is performed based on St-Chain, and entities with the same hash value are divided into the same blocks.
The specific method comprises the following steps:
1) According to the ordered attribute sequence, calculating the hash value of a certain attribute in each tuple;
2) And dividing the datamation into blocks according to the hash value difference to form a Chain structure St-Chain.
And 4) continuously calculating the hash value of the subsequent attribute in each tuple for the structure obtained in the step 3), repeatedly dividing the structure into blocks according to the difference of the hash values, and finally obtaining an entity identification result.
The specific method comprises the following steps:
1) According to the St-Chain structure obtained in the step 3), if num in the structure is larger than 1; continuing to calculate the hash value of the follow-up attribute in the block, and continuing to divide the block according to the hash value difference; if num=1, it is indicated that the entity in the block has no identity problem, and it is not necessary to calculate a subsequent attribute hash value;
2) Repeating the operation of 1) until the entity identification is completed, and obtaining a final identification result.
Example 1:
1) Experimental data set
Due to confidentiality of the industrial production process, the data set adopted in the experiment is a simulation data set generated by using a generating tool datafactor according to the law in a conventional chart data handbook for steelmaking. And comparing and analyzing the IDEM algorithm, the traditional threshold-based entity identification method (Part) and the more general entity identification algorithm (Clustering) based on the similarity graph. Table 1 gives part of the properties of the data set used for the experiment.
Table 1 data set used in experiments
Figure BDA0001945778750000042
Figure BDA0001945778750000051
2) Experimental results and analysis
The experimental environment is as follows: intel Pentium 3.0GHz CPU,4GB memory, WIN7 operating system, and PyCharm programming software.
Fig. 2 and table 2 show the F1 values and the run time of the IDEM algorithm, part algorithm and the managing algorithm, respectively, on the dataset.
As shown in fig. 2, the F1 values of the IDEM algorithm presented herein are on average about 15.76% and 24.23% higher, respectively, than the other two entity recognition algorithms, with an overall average of 20% higher. Thus, the IDEM algorithm presented herein is far better than the Part algorithm and the Clustering algorithm in terms of F1 values.
Table 2 run time of each algorithm
Figure BDA0001945778750000052
As shown in table 2, the IDEM algorithm has no advantage in recognition efficiency, but the difference in merits is not large. The IDEM algorithm operates at an average efficiency of 8.5% lower than the other two algorithms, i.e., the IDEM algorithm presented herein trades 8.5% efficiency for 20% accuracy. Thus, the IDEM algorithm presented herein is more suitable for entity identification of mass industrial production data.
Because the hash algorithm is adopted in the method, the entity identification accuracy is improved to a certain extent on mass industrial production data; the information entropy is adopted to obtain attribute sensitivity intensity information, and the modified Merkle-tree structure Pro M-tree is provided for progressive hash coding, so that the accuracy is improved, and the recognition efficiency is ensured.

Claims (1)

1. The industrial production data entity identification method based on Merkle-tree is characterized by comprising the following steps of: the method comprises the following steps:
step 1), aiming at the floatability of industrial production data, corresponding standardized processing is carried out on the data by utilizing the properties of a matrix and a vector, so as to ensure that the numerical attribute values of the same entity are the same;
1.1 Sampling the original data set by the standard name of the entity, and obtaining a sample set called a standard entity sample set S; calculating a (i) '.b= { a (i)'.b (k 1), a (i) '.b (k 2) … … a (i)'.b (kn) },
a (i) '.b (key) = { a (i) '.b (1), a (i) '.b (2) … … a (i) '.b (n) }, a (i) ' representing the transpose of a (i);
tuple vector set a= { a (i) |a (i) ∈a, i=1, 2,3 … … n };
standard entity matrix set b= { b (key) |key: standard entity name }, wherein matrix b (key) = { b (j) |b (j) ∈s, j=1, 2,3 … … n };
1.2 According to the vector cosine formula
Figure FDA0004181218680000011
The cosine value of each vector in each vector a (i)'.b (key) is calculated to form a cosine value vector c (key) = { cos theta 1 ,cosθ 2 ,cosθ 3 ,...,cosθ k };
1.3 Summing the elements in each vector c (key) and then calculating the average avg thereof;
1.4 Calculating max (avg), finding out a standard entity corresponding to the maximum avg, and calculating the average value of each attribute of the standard entity;
1.5 Taking the average value of each corresponding attribute as the standard value to finish the data standardization processing;
step 2), calculating information entropy of each attribute column, obtaining attribute sensitivity intensity information, removing attributes with low sensitivity, and sorting other attributes according to a sensitivity descending order;
judging the sensitivity of the attribute by using the information entropy, wherein the larger the information entropy of the attribute is, the more diversified the attribute value is, the stronger the capability of distinguishing the entity is, namely the higher the attribute sensitivity is; the information entropy formula is as follows:
Figure FDA0004181218680000012
wherein p is i Is the probability of occurrence of a certain attribute value in the attributes; removing the attribute with low information entropy, and according to the size of the information entropy, belonging toSorting in descending order, firstly calculating attribute hash values with high sensitivity;
step 3), a Chain structure called "St-Chain" is proposed; performing progressive hash coding on the ordered attributes based on St-Chain, and dividing entities with the same hash value into the same block;
3.1 According to the ordered attribute sequence, calculating the hash value of a certain attribute in each tuple;
3.2 Dividing the datamation into blocks according to the hash value difference to form a Chain structure St-Chain;
step 4), continuously calculating hash values of subsequent attributes in each tuple for the structure obtained in the step 3), repeatedly dividing the structure into blocks according to the difference of the hash values, and finally obtaining an entity identification result;
4.1 According to the St-Chain structure obtained in the step 3), if num in the structure is larger than 1; continuing to calculate the hash value of the follow-up attribute in the block, and continuing to divide the block according to the hash value difference; if num=1, it is indicated that the entity in the block has no identity problem, and it is not necessary to calculate a subsequent attribute hash value;
4.2 Repeating the operation of 1) until the entity identification is completed, and obtaining a final identification result.
CN201910035568.XA 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree Active CN109783698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910035568.XA CN109783698B (en) 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910035568.XA CN109783698B (en) 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree

Publications (2)

Publication Number Publication Date
CN109783698A CN109783698A (en) 2019-05-21
CN109783698B true CN109783698B (en) 2023-05-26

Family

ID=66500472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910035568.XA Active CN109783698B (en) 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree

Country Status (1)

Country Link
CN (1) CN109783698B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377605B (en) * 2019-07-24 2023-04-25 贵州大学 Sensitive attribute identification and classification method for structured data
CN110609907A (en) * 2019-09-17 2019-12-24 湖南大学 Medicine field knowledge reasoning method based on random walk

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482876B (en) * 2008-12-11 2011-11-09 南京大学 Weight-based link multi-attribute entity recognition method
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN107480549B (en) * 2017-06-28 2019-08-02 银江股份有限公司 A kind of sensitive information desensitization method and system that data-oriented is shared

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data

Also Published As

Publication number Publication date
CN109783698A (en) 2019-05-21

Similar Documents

Publication Publication Date Title
Erisoglu et al. A new algorithm for initial cluster centers in k-means algorithm
KR102363811B1 (en) Image retrieval methods, devices, instruments and readable storage media
Ishioka An expansion of X-means for automatically determining the optimal number of clusters
CN107832456B (en) Parallel KNN text classification method based on critical value data division
CN109783698B (en) Industrial production data entity identification method based on Merkle-tree
US11941087B2 (en) Unbalanced sample data preprocessing method and device, and computer device
US11062120B2 (en) High speed reference point independent database filtering for fingerprint identification
Al-Salihy et al. Classifying breast cancer by using decision tree algorithms
Cao et al. Local information-based fast approximate spectral clustering
Fei et al. Local orientation binary pattern with use for palmprint recognition
CN107886130A (en) A kind of kNN rapid classification methods based on cluster and Similarity-Weighted
CN113743457A (en) Quantum density peak value clustering method based on quantum Grover search technology
CN117349406A (en) Patent information retrieval system and method based on big data
CN111177084A (en) File classification method and device, computer equipment and storage medium
CN105975909B (en) A kind of fingerprint classification method and fingerprint three-level classification method based on fractal dimension
CN108170672A (en) A kind of Chinese organization names real-time analysis method and system
CN113313126A (en) Method, computing device, and computer storage medium for image recognition
Ye et al. Weighted graph based description for finger-vein recognition
Anand et al. Content based image retrieval CBIR using multiple features for texture images by using SVM classifier
CN114443628B (en) Finance missing data processing method based on clustering
Zhang et al. Towards fine-scale population stratification modeling based on kernel principal component analysis and random forest
CN116450925B (en) User relationship analysis method and system based on artificial intelligence
JP6248774B2 (en) Information processing apparatus and method for nearest neighbor search
JP2890753B2 (en) Feature selection method
Basha et al. comparison of real datasets characteristics by using clustering approaches

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant