CN109783698A - Industrial production data entity recognition method based on Merkle-tree - Google Patents

Industrial production data entity recognition method based on Merkle-tree Download PDF

Info

Publication number
CN109783698A
CN109783698A CN201910035568.XA CN201910035568A CN109783698A CN 109783698 A CN109783698 A CN 109783698A CN 201910035568 A CN201910035568 A CN 201910035568A CN 109783698 A CN109783698 A CN 109783698A
Authority
CN
China
Prior art keywords
attribute
cryptographic hash
entity
susceptibility
industrial production
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910035568.XA
Other languages
Chinese (zh)
Other versions
CN109783698B (en
Inventor
王妍
曾辉
杨冰清
李玉诺
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Liaoning University
Original Assignee
Liaoning University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Liaoning University filed Critical Liaoning University
Priority to CN201910035568.XA priority Critical patent/CN109783698B/en
Publication of CN109783698A publication Critical patent/CN109783698A/en
Application granted granted Critical
Publication of CN109783698B publication Critical patent/CN109783698B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

Industrial production data entity recognition method based on Merkle-tree, step are as follows: 1) be directed to the floatability of industrial production data, using the property of matrix and vector, corresponding standardization carried out to data, it is ensured that the Numeric Attributes value of same entity is identical;2) comentropy for calculating each attribute column obtains attribute susceptibility power information, the low attribute of removal susceptibility, by remaining attribute according to susceptibility descending sort;3) a kind of chain structure, referred to as " St-Chain " are proposed, gradual Hash coding is carried out based on St-Chain, by the identical entity division of cryptographic Hash into same;4) for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to the cryptographic Hash similarities and differences, repetition divides blocking, finally obtains Entity recognition result.The present invention provides that a kind of algorithm operational efficiency is moderate, the higher entity recognition method of accuracy of identification by the above method.

Description

Industrial production data entity recognition method based on Merkle-tree
Technical field
The invention is related to a kind of entity recognition method, especially a kind of industrial production number based on Merkle-tree According to entity recognition method.
Background technique
With the development of information technology and technology of Internet of things, modern industry is with a large amount of data of accumulated time, still It since certain variables are not easy to control in process of production, and influences each other, connects each other between each variable, these data is caused to exist Floatability.Industrial big data contains big value, and how to improve the availability of industrial big data, has become a research Hot spot.
One data set, can be divided into several by a kind of method of the Entity recognition as important raising availability of data A entity set, thus the entity homogeneity that will be solved the problems, such as in data set.Existing method is to the number with Problems of Identity Before being operated, piecemeal processing is first carried out, similarity calculation then is carried out to the data in each piece, finally according to similarity Matching decision is carried out with specific matching strategy.And for industrial production data, there are floatabilities for data, according to biography The False Rate of the entity recognition techniques of system, recognition result can significantly increase.Therefore, our work is exactly directly same to having The data of one property problem carry out Entity recognition, and error is avoided to be superimposed, and reduce False Rate, and while improving accuracy of identification, adopt Guarantee recognition efficiency with some means.
Summary of the invention
Of the existing technology in order to solve the problems, such as, the present invention provides a kind of industrial production number based on Merkle-tree According to entity identification algorithms.This method utilizes the thought of Merkle-tree, proposes a kind of Pro M-tree structure.Based on Pro M-tree carries out Hash coding, also ensures recognition efficiency while improving Entity recognition precision;Propose a kind of chain type knot Structure, referred to as " St-Chain ", piecemeal operation when for supporting gradual Hash to encode;For the floating of industrial production data Property, using the property of matrix and vector, a kind of data standardization processing method is proposed, guarantees the Numeric Attributes of same entity Value is consistent, prepares for subsequent Hash encoding operation.
To achieve the goals above, the technical solution that the invention uses are as follows: the industrial production based on Merkle-tree Data entity recognition methods, it is characterised in that: the steps include:
Step 1), the floatability for industrial production data carry out data corresponding using the property of matrix and vector Standardization, it is ensured that the Numeric Attributes value of same entity is identical;
Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, the low category of removal susceptibility Property, by remaining attribute according to susceptibility descending sort;
Step 3) proposes a kind of chain structure, referred to as " St-Chain ", the progress based on St-Chain to attribute after sequence Gradual Hash coding, by the identical entity division of cryptographic Hash into same;
Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to Kazakhstan The uncommon value similarities and differences, repetition divide blocking, finally obtain Entity recognition result.
In the step 1) method particularly includes:
1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample Collect S;It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },
Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' generation The transposition of table a (i);
Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n };
Standards entities set of matrices b=b (key) | key: standards entities title }, wherein matrix b (key)=b (j) | b (j) ∈ S, j=1,2,3 ... k };
2) according to vector cosine formulaIt calculates each in each vector a (i) ' b (key) The cosine value of vector forms a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k };
3) element in each vector c (key) is summed, then calculates its average value avg;
4) max (avg) is calculated, finds out the corresponding standards entities of maximum avg, calculates its each attribute average value;
5) using corresponding each attribute average value as its standard value, data normalization processing is completed.
In the step 2) method particularly includes:
The susceptibility of attribute is determined using comentropy, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, distinguishes The ability of entity is also stronger, i.e. attribute susceptibility is higher;Comentropy formula is as follows:
Wherein, piIt is the probability that a certain attribute value occurs in attribute;The low attribute of comentropy is removed, it is big according to comentropy It is small, by attribute descending sort, first calculate the big attribute cryptographic Hash of susceptibility.
In the step 3) method particularly includes:
1) according to the attribute sequence after sequence, the cryptographic Hash of some attribute in each tuple is calculated;
2) according to the cryptographic Hash similarities and differences, digitization is divided into block, is formed chain structure " St-Chain ".
In the step 4) method particularly includes:
1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure;Then continue to calculate subsequent in the block Attribute cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences;If num=1, illustrate that Problems of Identity is not present in the entity in the block, Calculated for subsequent attribute cryptographic Hash is not needed;
2) operation repeated 1) obtains final recognition result until Entity recognition is completed.
The invention has the beneficial effect that
Compared with prior art, the present invention the industrial production data entity proposed by the present invention based on Merkle-tree is known Other method proposes a kind of Pro M-tree structure using the thought of Merkle-tree.It is carried out based on Pro M-tree progressive Formula Hash coding, also ensures recognition efficiency while improving Entity recognition precision;It is sensitive that each attribute is obtained using comentropy Strong and weak information is spent, according to comentropy size by attribute descending sort, so that the strong attribute of susceptibility preferentially calculates cryptographic Hash, more preferably Ground guarantees recognition efficiency.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart.
Fig. 2 is that the F1 value of IDEM algorithm proposed by the present invention and other algorithms on data set compares.
Specific embodiment
Industrial production data entity recognition method based on Merkle-tree, it is characterised in that: the steps include:
Step 1), the floatability for industrial production data carry out data corresponding using the property of matrix and vector Standardization, it is ensured that the Numeric Attributes value of same entity is identical.
Method particularly includes:
1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample Collect S;It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },
Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' generation The transposition of table a (i);
Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n };
Standards entities set of matrices b=b (key) | key: standards entities title }, wherein matrix b (key)=b (j) | b (j) ∈ S, j=1,2,3 ... k };
2) according to vector cosine formulaIt calculates each in each vector a (i) ' b (key) The cosine value of vector forms a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k };
3) element in each vector c (key) is summed, then calculates its average value avg;
4) max (avg) is calculated, finds out the corresponding standards entities of maximum avg, calculates its each attribute average value;
5) using corresponding each attribute average value as its standard value, data normalization processing is completed.
Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, the low category of removal susceptibility Property, by remaining attribute according to susceptibility descending sort.
The susceptibility of attribute is determined using comentropy herein, comentropy is the expectation for measuring the appearance of a stochastic variable Value, as soon as the comentropy of variable is bigger, then the various situations that he occurs are also more, that is, the content for including is more.At this In, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, and the ability for distinguishing entity is also stronger, i.e. attribute susceptibility It is higher.Comentropy formula is as follows:
Wherein, piIt is the probability that a certain attribute value occurs in attribute.The low attribute of comentropy is removed, it is big according to comentropy It is small, attribute descending sort is better ensured that into recognition efficiency to calculate the big attribute cryptographic Hash of susceptibility as early as possible.
Step 3) proposes a kind of chain structure, referred to as " St-Chain ", carries out gradual Hash volume based on St-Chain Code, by the identical entity division of cryptographic Hash into same.
Method particularly includes:
1) according to the attribute sequence after sequence, the cryptographic Hash of some attribute in each tuple is calculated;
2) according to the cryptographic Hash similarities and differences, digitization is divided into block, is formed chain structure " St-Chain ".
Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to Kazakhstan The uncommon value similarities and differences, repetition divide blocking, finally obtain Entity recognition result.
Method particularly includes:
1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure;Then continue to calculate subsequent in the block Attribute cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences;If num=1, illustrate that Problems of Identity is not present in the entity in the block, Calculated for subsequent attribute cryptographic Hash is not needed;
2) operation repeated 1) obtains final recognition result until Entity recognition is completed.
Embodiment 1:
1), experimental data set
Due to the confidentiality of industrial processes, the data set used in this experiment is according to " the common chart data of steel-making Handbook " in rule, utilize Core Generator datafactory generate simulated data sets.And by IDEM algorithm proposed in this paper, The traditional entity recognition method (Part) based on threshold value, the more general entity identification algorithms based on similar diagram cluster (Clustering) it compares and analyzes.Table 1 gives some properties for testing data set used.
Data set used in the experiment of table 1
2), experimental result and analysis
Experimental situation are as follows: the CPU of Intel Pentium 3.0GHz, 4GB memory, operating system are WIN7, and programming software is PyCharm.
Fig. 2 and table 2 provide the F1 value and fortune of IDEM algorithm, Part algorithm and Clustering algorithm on data set respectively The row time.
As shown in Fig. 2, the F1 value of IDEM algorithm proposed in this paper is average respectively compared to other two kinds of entity identification algorithms It is higher by about 15.76% and 24.23%, population mean is high by 20%.Therefore, for F1 value, IDEM algorithm proposed in this paper is much It is better than Part algorithm and Clustering algorithm.
The runing time of each algorithm of table 2
As shown in table 2, though IDEM algorithm on recognition efficiency without advantage, superiority and inferiority gap is simultaneously little.Compared to other Two kinds of algorithms, the operational efficiency average low 8.5% of IDEM algorithm, i.e., IDEM algorithm proposed in this paper is exchanged for 8.5% efficiency 20% precision.Therefore, IDEM algorithm proposed in this paper is more suitable for the Entity recognition of magnanimity industrial production data.
Due to using hash algorithm herein, there is Entity recognition accuracy rate on magnanimity industrial production data and centainly mention It is high;Attribute susceptibility power information is obtained using comentropy, proposes improved Merkle-tree structure " Pro M-tree " simultaneously Gradual Hash coding is carried out, while improving accuracy rate, so that recognition efficiency is guaranteed.

Claims (5)

1. the industrial production data entity recognition method based on Merkle-tree, it is characterised in that: the steps include:
Step 1), the floatability for industrial production data carry out corresponding standard to data using the property of matrix and vector Change processing, it is ensured that the Numeric Attributes value of same entity is identical;
Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, and the low attribute of removal susceptibility will Remaining attribute is according to susceptibility descending sort;
Step 3) proposes a kind of chain structure, referred to as " St-Chain ".The attribute after sequence is carried out based on St-Chain progressive Formula Hash coding, by the identical entity division of cryptographic Hash into same;
Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to cryptographic Hash The similarities and differences, repetition divides blocking, finally obtains Entity recognition result.
2. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 1) method particularly includes:
1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample set S; It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },
Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' represents a (i) transposition;
Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n };
Standards entities set of matrices b=b (key) | key: standards entities title }, wherein matrix b (key)=b (j) | b (j) ∈ S, j=1,2,3 ... k };
2) according to vector cosine formulaCalculate each vector in each vector a (i) ' b (key) Cosine value, form a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k };
3) element in each vector c (key) is summed, then calculates its average value avg;
4) max (avg) is calculated, finds out the corresponding standards entities of maximum avg, calculates its each attribute average value;
5) using corresponding each attribute average value as its standard value, data normalization processing is completed.
3. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 2) method particularly includes:
The susceptibility of attribute is determined using comentropy, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, distinguishes entity Ability it is also stronger, i.e. attribute susceptibility is higher;Comentropy formula is as follows:
Wherein, piIt is the probability that a certain attribute value occurs in attribute;The low attribute of removal comentropy will belong to according to comentropy size Property descending sort, first calculate the big attribute cryptographic Hash of susceptibility.
4. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 3) method particularly includes:
1) according to the attribute sequence after sequence, the cryptographic Hash of some attribute in each tuple is calculated;
2) according to the cryptographic Hash similarities and differences, digitization is divided into block, is formed chain structure " St-Chain ".
5. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist In: in the step 4) method particularly includes:
1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure;Then continue to calculate Subsequent attributes in the block Cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences;If num=1, illustrates that Problems of Identity is not present in the entity in the block, be not required to Want calculated for subsequent attribute cryptographic Hash;
2) operation repeated 1) obtains final recognition result until Entity recognition is completed.
CN201910035568.XA 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree Active CN109783698B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910035568.XA CN109783698B (en) 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910035568.XA CN109783698B (en) 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree

Publications (2)

Publication Number Publication Date
CN109783698A true CN109783698A (en) 2019-05-21
CN109783698B CN109783698B (en) 2023-05-26

Family

ID=66500472

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910035568.XA Active CN109783698B (en) 2019-01-15 2019-01-15 Industrial production data entity identification method based on Merkle-tree

Country Status (1)

Country Link
CN (1) CN109783698B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377605A (en) * 2019-07-24 2019-10-25 贵州大学 A kind of Sensitive Attributes identification of structural data and classification stage division
CN110609907A (en) * 2019-09-17 2019-12-24 湖南大学 Medicine field knowledge reasoning method based on random walk

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482876A (en) * 2008-12-11 2009-07-15 南京大学 Weight-based link multi-attribute entity recognition method
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101482876A (en) * 2008-12-11 2009-07-15 南京大学 Weight-based link multi-attribute entity recognition method
CN104239553A (en) * 2014-09-24 2014-12-24 江苏名通信息科技有限公司 Entity recognition method based on Map-Reduce framework
CN107480549A (en) * 2017-06-28 2017-12-15 银江股份有限公司 A kind of shared sensitive information desensitization method of data-oriented and system
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
SUN CHEN-CHEN等: "Entity resolution oriented clustering algorithm", 《ARCHIVE》 *
杨萌等: "基于随机森林的实体识别方法", 《集成技术》 *

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110377605A (en) * 2019-07-24 2019-10-25 贵州大学 A kind of Sensitive Attributes identification of structural data and classification stage division
CN110377605B (en) * 2019-07-24 2023-04-25 贵州大学 Sensitive attribute identification and classification method for structured data
CN110609907A (en) * 2019-09-17 2019-12-24 湖南大学 Medicine field knowledge reasoning method based on random walk

Also Published As

Publication number Publication date
CN109783698B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
CN109165639B (en) Finger vein identification method, device and equipment
Yue et al. Hashing based fast palmprint identification for large-scale databases
CN111783875A (en) Abnormal user detection method, device, equipment and medium based on cluster analysis
CN107682319A (en) A kind of method of data flow anomaly detection and multiple-authentication based on enhanced angle Outlier factor
CN110826618A (en) Personal credit risk assessment method based on random forest
CN107832456B (en) Parallel KNN text classification method based on critical value data division
CN106685964B (en) Malicious software detection method and system based on malicious network traffic thesaurus
US20200257885A1 (en) High speed reference point independent database filtering for fingerprint identification
Cao et al. Local information-based fast approximate spectral clustering
CN109783698A (en) Industrial production data entity recognition method based on Merkle-tree
CN110347827B (en) Event Extraction Method for Heterogeneous Text Operation and Maintenance Data
Zhang et al. Data anomaly detection based on isolation forest algorithm
CN104598441A (en) Method for splitting Chinese sentences through computer
CN111341390A (en) Quantitative structure-activity relationship assisted matching molecule pair analysis method
CN111737694B (en) Malicious software homology analysis method based on behavior tree
CN116186562B (en) Encoder-based long text matching method
CN114124437B (en) Encrypted flow identification method based on prototype convolutional network
CN115221013B (en) Method, device and equipment for determining log mode
CN111008673A (en) Method for collecting and extracting malignant data chain in power distribution network information physical system
CN116204647A (en) Method and device for establishing target comparison learning model and text clustering
Xu et al. Extracting trigger-sharing events via an event matrix
Schleif et al. Fast approximated relational and kernel clustering
WO2023272855A1 (en) Virus gene classification method and apparatus, electronic device, and computer-readable storage medium
Luong et al. A Study on the Efficiency of ML-Based IDS with Dimensional Reduction Methods for Industry IoT
CN113378881B (en) Instruction set identification method and device based on information entropy gain SVM model

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant