CN109783698A

CN109783698A - Industrial production data entity recognition method based on Merkle-tree

Info

Publication number: CN109783698A
Application number: CN201910035568.XA
Authority: CN
Inventors: 王妍; 曾辉; 杨冰清; 李玉诺
Original assignee: Liaoning University
Current assignee: Liaoning University
Priority date: 2019-01-15
Filing date: 2019-01-15
Publication date: 2019-05-21
Anticipated expiration: 2039-01-15
Also published as: CN109783698B

Abstract

Industrial production data entity recognition method based on Merkle-tree, step are as follows: 1) be directed to the floatability of industrial production data, using the property of matrix and vector, corresponding standardization carried out to data, it is ensured that the Numeric Attributes value of same entity is identical；2) comentropy for calculating each attribute column obtains attribute susceptibility power information, the low attribute of removal susceptibility, by remaining attribute according to susceptibility descending sort；3) a kind of chain structure, referred to as " St-Chain " are proposed, gradual Hash coding is carried out based on St-Chain, by the identical entity division of cryptographic Hash into same；4) for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to the cryptographic Hash similarities and differences, repetition divides blocking, finally obtains Entity recognition result.The present invention provides that a kind of algorithm operational efficiency is moderate, the higher entity recognition method of accuracy of identification by the above method.

Description

Industrial production data entity recognition method based on Merkle-tree

Technical field

The invention is related to a kind of entity recognition method, especially a kind of industrial production number based on Merkle-tree According to entity recognition method.

Background technique

With the development of information technology and technology of Internet of things, modern industry is with a large amount of data of accumulated time, still It since certain variables are not easy to control in process of production, and influences each other, connects each other between each variable, these data is caused to exist Floatability.Industrial big data contains big value, and how to improve the availability of industrial big data, has become a research Hot spot.

One data set, can be divided into several by a kind of method of the Entity recognition as important raising availability of data A entity set, thus the entity homogeneity that will be solved the problems, such as in data set.Existing method is to the number with Problems of Identity Before being operated, piecemeal processing is first carried out, similarity calculation then is carried out to the data in each piece, finally according to similarity Matching decision is carried out with specific matching strategy.And for industrial production data, there are floatabilities for data, according to biography The False Rate of the entity recognition techniques of system, recognition result can significantly increase.Therefore, our work is exactly directly same to having The data of one property problem carry out Entity recognition, and error is avoided to be superimposed, and reduce False Rate, and while improving accuracy of identification, adopt Guarantee recognition efficiency with some means.

Summary of the invention

Of the existing technology in order to solve the problems, such as, the present invention provides a kind of industrial production number based on Merkle-tree According to entity identification algorithms.This method utilizes the thought of Merkle-tree, proposes a kind of Pro M-tree structure.Based on Pro M-tree carries out Hash coding, also ensures recognition efficiency while improving Entity recognition precision；Propose a kind of chain type knot Structure, referred to as " St-Chain ", piecemeal operation when for supporting gradual Hash to encode；For the floating of industrial production data Property, using the property of matrix and vector, a kind of data standardization processing method is proposed, guarantees the Numeric Attributes of same entity Value is consistent, prepares for subsequent Hash encoding operation.

To achieve the goals above, the technical solution that the invention uses are as follows: the industrial production based on Merkle-tree Data entity recognition methods, it is characterised in that: the steps include:

Step 1), the floatability for industrial production data carry out data corresponding using the property of matrix and vector Standardization, it is ensured that the Numeric Attributes value of same entity is identical；

Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, the low category of removal susceptibility Property, by remaining attribute according to susceptibility descending sort；

Step 3) proposes a kind of chain structure, referred to as " St-Chain ", the progress based on St-Chain to attribute after sequence Gradual Hash coding, by the identical entity division of cryptographic Hash into same；

Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to Kazakhstan The uncommon value similarities and differences, repetition divide blocking, finally obtain Entity recognition result.

In the step 1) method particularly includes:

1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample Collect S；It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },

Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' generation The transposition of table a (i)；

Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n }；

Standards entities set of matrices b=b (key) | key: standards entities title }, wherein matrix b (key)=b (j) | b (j) ∈ S, j=1,2,3 ... k }；

2) according to vector cosine formulaIt calculates each in each vector a (i) ' b (key) The cosine value of vector forms a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k }；

3) element in each vector c (key) is summed, then calculates its average value avg；

4) max (avg) is calculated, finds out the corresponding standards entities of maximum avg, calculates its each attribute average value；

5) using corresponding each attribute average value as its standard value, data normalization processing is completed.

In the step 2) method particularly includes:

The susceptibility of attribute is determined using comentropy, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, distinguishes The ability of entity is also stronger, i.e. attribute susceptibility is higher；Comentropy formula is as follows:

Wherein, p_iIt is the probability that a certain attribute value occurs in attribute；The low attribute of comentropy is removed, it is big according to comentropy It is small, by attribute descending sort, first calculate the big attribute cryptographic Hash of susceptibility.

In the step 3) method particularly includes:

1) according to the attribute sequence after sequence, the cryptographic Hash of some attribute in each tuple is calculated；

2) according to the cryptographic Hash similarities and differences, digitization is divided into block, is formed chain structure " St-Chain ".

In the step 4) method particularly includes:

1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure；Then continue to calculate subsequent in the block Attribute cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences；If num=1, illustrate that Problems of Identity is not present in the entity in the block, Calculated for subsequent attribute cryptographic Hash is not needed；

2) operation repeated 1) obtains final recognition result until Entity recognition is completed.

The invention has the beneficial effect that

Compared with prior art, the present invention the industrial production data entity proposed by the present invention based on Merkle-tree is known Other method proposes a kind of Pro M-tree structure using the thought of Merkle-tree.It is carried out based on Pro M-tree progressive Formula Hash coding, also ensures recognition efficiency while improving Entity recognition precision；It is sensitive that each attribute is obtained using comentropy Strong and weak information is spent, according to comentropy size by attribute descending sort, so that the strong attribute of susceptibility preferentially calculates cryptographic Hash, more preferably Ground guarantees recognition efficiency.

Detailed description of the invention

Fig. 1 is the method for the present invention flow chart.

Fig. 2 is that the F1 value of IDEM algorithm proposed by the present invention and other algorithms on data set compares.

Specific embodiment

Industrial production data entity recognition method based on Merkle-tree, it is characterised in that: the steps include:

Step 1), the floatability for industrial production data carry out data corresponding using the property of matrix and vector Standardization, it is ensured that the Numeric Attributes value of same entity is identical.

Method particularly includes:

Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n }；

Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, the low category of removal susceptibility Property, by remaining attribute according to susceptibility descending sort.

The susceptibility of attribute is determined using comentropy herein, comentropy is the expectation for measuring the appearance of a stochastic variable Value, as soon as the comentropy of variable is bigger, then the various situations that he occurs are also more, that is, the content for including is more.At this In, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, and the ability for distinguishing entity is also stronger, i.e. attribute susceptibility It is higher.Comentropy formula is as follows:

Wherein, p_iIt is the probability that a certain attribute value occurs in attribute.The low attribute of comentropy is removed, it is big according to comentropy It is small, attribute descending sort is better ensured that into recognition efficiency to calculate the big attribute cryptographic Hash of susceptibility as early as possible.

Step 3) proposes a kind of chain structure, referred to as " St-Chain ", carries out gradual Hash volume based on St-Chain Code, by the identical entity division of cryptographic Hash into same.

Method particularly includes:

Embodiment 1:

1), experimental data set

Due to the confidentiality of industrial processes, the data set used in this experiment is according to " the common chart data of steel-making Handbook " in rule, utilize Core Generator datafactory generate simulated data sets.And by IDEM algorithm proposed in this paper, The traditional entity recognition method (Part) based on threshold value, the more general entity identification algorithms based on similar diagram cluster (Clustering) it compares and analyzes.Table 1 gives some properties for testing data set used.

Data set used in the experiment of table 1

2), experimental result and analysis

Experimental situation are as follows: the CPU of Intel Pentium 3.0GHz, 4GB memory, operating system are WIN7, and programming software is PyCharm.

Fig. 2 and table 2 provide the F1 value and fortune of IDEM algorithm, Part algorithm and Clustering algorithm on data set respectively The row time.

As shown in Fig. 2, the F1 value of IDEM algorithm proposed in this paper is average respectively compared to other two kinds of entity identification algorithms It is higher by about 15.76% and 24.23%, population mean is high by 20%.Therefore, for F1 value, IDEM algorithm proposed in this paper is much It is better than Part algorithm and Clustering algorithm.

The runing time of each algorithm of table 2

As shown in table 2, though IDEM algorithm on recognition efficiency without advantage, superiority and inferiority gap is simultaneously little.Compared to other Two kinds of algorithms, the operational efficiency average low 8.5% of IDEM algorithm, i.e., IDEM algorithm proposed in this paper is exchanged for 8.5% efficiency 20% precision.Therefore, IDEM algorithm proposed in this paper is more suitable for the Entity recognition of magnanimity industrial production data.

Due to using hash algorithm herein, there is Entity recognition accuracy rate on magnanimity industrial production data and centainly mention It is high；Attribute susceptibility power information is obtained using comentropy, proposes improved Merkle-tree structure " Pro M-tree " simultaneously Gradual Hash coding is carried out, while improving accuracy rate, so that recognition efficiency is guaranteed.

Claims

1. the industrial production data entity recognition method based on Merkle-tree, it is characterised in that: the steps include:

Step 1), the floatability for industrial production data carry out corresponding standard to data using the property of matrix and vector Change processing, it is ensured that the Numeric Attributes value of same entity is identical；

Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, and the low attribute of removal susceptibility will Remaining attribute is according to susceptibility descending sort；

Step 3) proposes a kind of chain structure, referred to as " St-Chain ".The attribute after sequence is carried out based on St-Chain progressive Formula Hash coding, by the identical entity division of cryptographic Hash into same；

Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to cryptographic Hash The similarities and differences, repetition divides blocking, finally obtains Entity recognition result.

2. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 1) method particularly includes:

1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample set S； It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },

Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' represents a (i) transposition；

Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n }；

2) according to vector cosine formulaCalculate each vector in each vector a (i) ' b (key) Cosine value, form a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k }；

3. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 2) method particularly includes:

The susceptibility of attribute is determined using comentropy, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, distinguishes entity Ability it is also stronger, i.e. attribute susceptibility is higher；Comentropy formula is as follows:

Wherein, p_iIt is the probability that a certain attribute value occurs in attribute；The low attribute of removal comentropy will belong to according to comentropy size Property descending sort, first calculate the big attribute cryptographic Hash of susceptibility.

4. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 3) method particularly includes:

5. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 4) method particularly includes:

1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure；Then continue to calculate Subsequent attributes in the block Cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences；If num=1, illustrates that Problems of Identity is not present in the entity in the block, be not required to Want calculated for subsequent attribute cryptographic Hash；