CN109783698A - Industrial production data entity recognition method based on Merkle-tree - Google Patents
Industrial production data entity recognition method based on Merkle-tree Download PDFInfo
- Publication number
- CN109783698A CN109783698A CN201910035568.XA CN201910035568A CN109783698A CN 109783698 A CN109783698 A CN 109783698A CN 201910035568 A CN201910035568 A CN 201910035568A CN 109783698 A CN109783698 A CN 109783698A
- Authority
- CN
- China
- Prior art keywords
- attribute
- cryptographic hash
- entity
- susceptibility
- industrial production
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Industrial production data entity recognition method based on Merkle-tree, step are as follows: 1) be directed to the floatability of industrial production data, using the property of matrix and vector, corresponding standardization carried out to data, it is ensured that the Numeric Attributes value of same entity is identical;2) comentropy for calculating each attribute column obtains attribute susceptibility power information, the low attribute of removal susceptibility, by remaining attribute according to susceptibility descending sort;3) a kind of chain structure, referred to as " St-Chain " are proposed, gradual Hash coding is carried out based on St-Chain, by the identical entity division of cryptographic Hash into same;4) for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to the cryptographic Hash similarities and differences, repetition divides blocking, finally obtains Entity recognition result.The present invention provides that a kind of algorithm operational efficiency is moderate, the higher entity recognition method of accuracy of identification by the above method.
Description
Technical field
The invention is related to a kind of entity recognition method, especially a kind of industrial production number based on Merkle-tree
According to entity recognition method.
Background technique
With the development of information technology and technology of Internet of things, modern industry is with a large amount of data of accumulated time, still
It since certain variables are not easy to control in process of production, and influences each other, connects each other between each variable, these data is caused to exist
Floatability.Industrial big data contains big value, and how to improve the availability of industrial big data, has become a research
Hot spot.
One data set, can be divided into several by a kind of method of the Entity recognition as important raising availability of data
A entity set, thus the entity homogeneity that will be solved the problems, such as in data set.Existing method is to the number with Problems of Identity
Before being operated, piecemeal processing is first carried out, similarity calculation then is carried out to the data in each piece, finally according to similarity
Matching decision is carried out with specific matching strategy.And for industrial production data, there are floatabilities for data, according to biography
The False Rate of the entity recognition techniques of system, recognition result can significantly increase.Therefore, our work is exactly directly same to having
The data of one property problem carry out Entity recognition, and error is avoided to be superimposed, and reduce False Rate, and while improving accuracy of identification, adopt
Guarantee recognition efficiency with some means.
Summary of the invention
Of the existing technology in order to solve the problems, such as, the present invention provides a kind of industrial production number based on Merkle-tree
According to entity identification algorithms.This method utilizes the thought of Merkle-tree, proposes a kind of Pro M-tree structure.Based on Pro
M-tree carries out Hash coding, also ensures recognition efficiency while improving Entity recognition precision;Propose a kind of chain type knot
Structure, referred to as " St-Chain ", piecemeal operation when for supporting gradual Hash to encode;For the floating of industrial production data
Property, using the property of matrix and vector, a kind of data standardization processing method is proposed, guarantees the Numeric Attributes of same entity
Value is consistent, prepares for subsequent Hash encoding operation.
To achieve the goals above, the technical solution that the invention uses are as follows: the industrial production based on Merkle-tree
Data entity recognition methods, it is characterised in that: the steps include:
Step 1), the floatability for industrial production data carry out data corresponding using the property of matrix and vector
Standardization, it is ensured that the Numeric Attributes value of same entity is identical;
Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, the low category of removal susceptibility
Property, by remaining attribute according to susceptibility descending sort;
Step 3) proposes a kind of chain structure, referred to as " St-Chain ", the progress based on St-Chain to attribute after sequence
Gradual Hash coding, by the identical entity division of cryptographic Hash into same;
Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to Kazakhstan
The uncommon value similarities and differences, repetition divide blocking, finally obtain Entity recognition result.
In the step 1) method particularly includes:
1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample
Collect S;It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },
Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' generation
The transposition of table a (i);
Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n };
Standards entities set of matrices b=b (key) | key: standards entities title }, wherein matrix b (key)=b (j) | b
(j) ∈ S, j=1,2,3 ... k };
2) according to vector cosine formulaIt calculates each in each vector a (i) ' b (key)
The cosine value of vector forms a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k };
3) element in each vector c (key) is summed, then calculates its average value avg;
4) max (avg) is calculated, finds out the corresponding standards entities of maximum avg, calculates its each attribute average value;
5) using corresponding each attribute average value as its standard value, data normalization processing is completed.
In the step 2) method particularly includes:
The susceptibility of attribute is determined using comentropy, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, distinguishes
The ability of entity is also stronger, i.e. attribute susceptibility is higher;Comentropy formula is as follows:
Wherein, piIt is the probability that a certain attribute value occurs in attribute;The low attribute of comentropy is removed, it is big according to comentropy
It is small, by attribute descending sort, first calculate the big attribute cryptographic Hash of susceptibility.
In the step 3) method particularly includes:
1) according to the attribute sequence after sequence, the cryptographic Hash of some attribute in each tuple is calculated;
2) according to the cryptographic Hash similarities and differences, digitization is divided into block, is formed chain structure " St-Chain ".
In the step 4) method particularly includes:
1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure;Then continue to calculate subsequent in the block
Attribute cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences;If num=1, illustrate that Problems of Identity is not present in the entity in the block,
Calculated for subsequent attribute cryptographic Hash is not needed;
2) operation repeated 1) obtains final recognition result until Entity recognition is completed.
The invention has the beneficial effect that
Compared with prior art, the present invention the industrial production data entity proposed by the present invention based on Merkle-tree is known
Other method proposes a kind of Pro M-tree structure using the thought of Merkle-tree.It is carried out based on Pro M-tree progressive
Formula Hash coding, also ensures recognition efficiency while improving Entity recognition precision;It is sensitive that each attribute is obtained using comentropy
Strong and weak information is spent, according to comentropy size by attribute descending sort, so that the strong attribute of susceptibility preferentially calculates cryptographic Hash, more preferably
Ground guarantees recognition efficiency.
Detailed description of the invention
Fig. 1 is the method for the present invention flow chart.
Fig. 2 is that the F1 value of IDEM algorithm proposed by the present invention and other algorithms on data set compares.
Specific embodiment
Industrial production data entity recognition method based on Merkle-tree, it is characterised in that: the steps include:
Step 1), the floatability for industrial production data carry out data corresponding using the property of matrix and vector
Standardization, it is ensured that the Numeric Attributes value of same entity is identical.
Method particularly includes:
1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample
Collect S;It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },
Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' generation
The transposition of table a (i);
Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n };
Standards entities set of matrices b=b (key) | key: standards entities title }, wherein matrix b (key)=b (j) | b
(j) ∈ S, j=1,2,3 ... k };
2) according to vector cosine formulaIt calculates each in each vector a (i) ' b (key)
The cosine value of vector forms a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k };
3) element in each vector c (key) is summed, then calculates its average value avg;
4) max (avg) is calculated, finds out the corresponding standards entities of maximum avg, calculates its each attribute average value;
5) using corresponding each attribute average value as its standard value, data normalization processing is completed.
Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, the low category of removal susceptibility
Property, by remaining attribute according to susceptibility descending sort.
The susceptibility of attribute is determined using comentropy herein, comentropy is the expectation for measuring the appearance of a stochastic variable
Value, as soon as the comentropy of variable is bigger, then the various situations that he occurs are also more, that is, the content for including is more.At this
In, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, and the ability for distinguishing entity is also stronger, i.e. attribute susceptibility
It is higher.Comentropy formula is as follows:
Wherein, piIt is the probability that a certain attribute value occurs in attribute.The low attribute of comentropy is removed, it is big according to comentropy
It is small, attribute descending sort is better ensured that into recognition efficiency to calculate the big attribute cryptographic Hash of susceptibility as early as possible.
Step 3) proposes a kind of chain structure, referred to as " St-Chain ", carries out gradual Hash volume based on St-Chain
Code, by the identical entity division of cryptographic Hash into same.
Method particularly includes:
1) according to the attribute sequence after sequence, the cryptographic Hash of some attribute in each tuple is calculated;
2) according to the cryptographic Hash similarities and differences, digitization is divided into block, is formed chain structure " St-Chain ".
Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to Kazakhstan
The uncommon value similarities and differences, repetition divide blocking, finally obtain Entity recognition result.
Method particularly includes:
1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure;Then continue to calculate subsequent in the block
Attribute cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences;If num=1, illustrate that Problems of Identity is not present in the entity in the block,
Calculated for subsequent attribute cryptographic Hash is not needed;
2) operation repeated 1) obtains final recognition result until Entity recognition is completed.
Embodiment 1:
1), experimental data set
Due to the confidentiality of industrial processes, the data set used in this experiment is according to " the common chart data of steel-making
Handbook " in rule, utilize Core Generator datafactory generate simulated data sets.And by IDEM algorithm proposed in this paper,
The traditional entity recognition method (Part) based on threshold value, the more general entity identification algorithms based on similar diagram cluster
(Clustering) it compares and analyzes.Table 1 gives some properties for testing data set used.
Data set used in the experiment of table 1
2), experimental result and analysis
Experimental situation are as follows: the CPU of Intel Pentium 3.0GHz, 4GB memory, operating system are WIN7, and programming software is
PyCharm.
Fig. 2 and table 2 provide the F1 value and fortune of IDEM algorithm, Part algorithm and Clustering algorithm on data set respectively
The row time.
As shown in Fig. 2, the F1 value of IDEM algorithm proposed in this paper is average respectively compared to other two kinds of entity identification algorithms
It is higher by about 15.76% and 24.23%, population mean is high by 20%.Therefore, for F1 value, IDEM algorithm proposed in this paper is much
It is better than Part algorithm and Clustering algorithm.
The runing time of each algorithm of table 2
As shown in table 2, though IDEM algorithm on recognition efficiency without advantage, superiority and inferiority gap is simultaneously little.Compared to other
Two kinds of algorithms, the operational efficiency average low 8.5% of IDEM algorithm, i.e., IDEM algorithm proposed in this paper is exchanged for 8.5% efficiency
20% precision.Therefore, IDEM algorithm proposed in this paper is more suitable for the Entity recognition of magnanimity industrial production data.
Due to using hash algorithm herein, there is Entity recognition accuracy rate on magnanimity industrial production data and centainly mention
It is high;Attribute susceptibility power information is obtained using comentropy, proposes improved Merkle-tree structure " Pro M-tree " simultaneously
Gradual Hash coding is carried out, while improving accuracy rate, so that recognition efficiency is guaranteed.
Claims (5)
1. the industrial production data entity recognition method based on Merkle-tree, it is characterised in that: the steps include:
Step 1), the floatability for industrial production data carry out corresponding standard to data using the property of matrix and vector
Change processing, it is ensured that the Numeric Attributes value of same entity is identical;
Step 2), the comentropy for calculating each attribute column obtain attribute susceptibility power information, and the low attribute of removal susceptibility will
Remaining attribute is according to susceptibility descending sort;
Step 3) proposes a kind of chain structure, referred to as " St-Chain ".The attribute after sequence is carried out based on St-Chain progressive
Formula Hash coding, by the identical entity division of cryptographic Hash into same;
Step 4), for structure obtained in step 3), continue the cryptographic Hash for calculating Subsequent attributes in each tuple, according to cryptographic Hash
The similarities and differences, repetition divides blocking, finally obtains Entity recognition result.
2. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist
In: in the step 1) method particularly includes:
1) raw data set is sampled by the title of entity, obtained sample set is known as standards entities sample set S;
It calculates a (i) ' b={ a (i) ' b (k1), a (i) ' b (k2) ... a (i) ' b (k3) },
Wherein: a (i) ' b (key)={ a (i) ' b (1), a (i) ' b (2) ... a (i) ' b (k) }, a (i) ' represents a
(i) transposition;
Tuple vector set a=a (i) | a (i) ∈ A, i=1,2,3 ... n };
Standards entities set of matrices b=b (key) | key: standards entities title }, wherein matrix b (key)=b (j) | b (j)
∈ S, j=1,2,3 ... k };
2) according to vector cosine formulaCalculate each vector in each vector a (i) ' b (key)
Cosine value, form a cosine value vector c (key)={ cos θ 1, cos θ 2, cos θ 3......cos θ k };
3) element in each vector c (key) is summed, then calculates its average value avg;
4) max (avg) is calculated, finds out the corresponding standards entities of maximum avg, calculates its each attribute average value;
5) using corresponding each attribute average value as its standard value, data normalization processing is completed.
3. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist
In: in the step 2) method particularly includes:
The susceptibility of attribute is determined using comentropy, the comentropy of attribute is bigger, illustrates that its attribute value is more diversified, distinguishes entity
Ability it is also stronger, i.e. attribute susceptibility is higher;Comentropy formula is as follows:
Wherein, piIt is the probability that a certain attribute value occurs in attribute;The low attribute of removal comentropy will belong to according to comentropy size
Property descending sort, first calculate the big attribute cryptographic Hash of susceptibility.
4. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist
In: in the step 3) method particularly includes:
1) according to the attribute sequence after sequence, the cryptographic Hash of some attribute in each tuple is calculated;
2) according to the cryptographic Hash similarities and differences, digitization is divided into block, is formed chain structure " St-Chain ".
5. the industrial production data entity recognition method according to claim 1 based on Merkle-tree, feature exist
In: in the step 4) method particularly includes:
1) the St-Chain structure obtained according to step 3), if num is greater than 1 in structure;Then continue to calculate Subsequent attributes in the block
Cryptographic Hash continues piecemeal according to the cryptographic Hash similarities and differences;If num=1, illustrates that Problems of Identity is not present in the entity in the block, be not required to
Want calculated for subsequent attribute cryptographic Hash;
2) operation repeated 1) obtains final recognition result until Entity recognition is completed.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910035568.XA CN109783698B (en) | 2019-01-15 | 2019-01-15 | Industrial production data entity identification method based on Merkle-tree |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910035568.XA CN109783698B (en) | 2019-01-15 | 2019-01-15 | Industrial production data entity identification method based on Merkle-tree |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109783698A true CN109783698A (en) | 2019-05-21 |
CN109783698B CN109783698B (en) | 2023-05-26 |
Family
ID=66500472
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910035568.XA Active CN109783698B (en) | 2019-01-15 | 2019-01-15 | Industrial production data entity identification method based on Merkle-tree |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109783698B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110377605A (en) * | 2019-07-24 | 2019-10-25 | 贵州大学 | A kind of Sensitive Attributes identification of structural data and classification stage division |
CN110609907A (en) * | 2019-09-17 | 2019-12-24 | 湖南大学 | Medicine field knowledge reasoning method based on random walk |
Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101482876A (en) * | 2008-12-11 | 2009-07-15 | 南京大学 | Weight-based link multi-attribute entity recognition method |
CN104239553A (en) * | 2014-09-24 | 2014-12-24 | 江苏名通信息科技有限公司 | Entity recognition method based on Map-Reduce framework |
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN109165202A (en) * | 2018-07-04 | 2019-01-08 | 华南理工大学 | A kind of preprocess method of multi-source heterogeneous big data |
-
2019
- 2019-01-15 CN CN201910035568.XA patent/CN109783698B/en active Active
Patent Citations (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN101482876A (en) * | 2008-12-11 | 2009-07-15 | 南京大学 | Weight-based link multi-attribute entity recognition method |
CN104239553A (en) * | 2014-09-24 | 2014-12-24 | 江苏名通信息科技有限公司 | Entity recognition method based on Map-Reduce framework |
CN107480549A (en) * | 2017-06-28 | 2017-12-15 | 银江股份有限公司 | A kind of shared sensitive information desensitization method of data-oriented and system |
CN109165202A (en) * | 2018-07-04 | 2019-01-08 | 华南理工大学 | A kind of preprocess method of multi-source heterogeneous big data |
Non-Patent Citations (2)
Title |
---|
SUN CHEN-CHEN等: "Entity resolution oriented clustering algorithm", 《ARCHIVE》 * |
杨萌等: "基于随机森林的实体识别方法", 《集成技术》 * |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110377605A (en) * | 2019-07-24 | 2019-10-25 | 贵州大学 | A kind of Sensitive Attributes identification of structural data and classification stage division |
CN110377605B (en) * | 2019-07-24 | 2023-04-25 | 贵州大学 | Sensitive attribute identification and classification method for structured data |
CN110609907A (en) * | 2019-09-17 | 2019-12-24 | 湖南大学 | Medicine field knowledge reasoning method based on random walk |
Also Published As
Publication number | Publication date |
---|---|
CN109783698B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109165639B (en) | Finger vein identification method, device and equipment | |
Yue et al. | Hashing based fast palmprint identification for large-scale databases | |
CN111783875A (en) | Abnormal user detection method, device, equipment and medium based on cluster analysis | |
CN107682319A (en) | A kind of method of data flow anomaly detection and multiple-authentication based on enhanced angle Outlier factor | |
CN110826618A (en) | Personal credit risk assessment method based on random forest | |
CN107832456B (en) | Parallel KNN text classification method based on critical value data division | |
CN106685964B (en) | Malicious software detection method and system based on malicious network traffic thesaurus | |
US20200257885A1 (en) | High speed reference point independent database filtering for fingerprint identification | |
Cao et al. | Local information-based fast approximate spectral clustering | |
CN109783698A (en) | Industrial production data entity recognition method based on Merkle-tree | |
CN110347827B (en) | Event Extraction Method for Heterogeneous Text Operation and Maintenance Data | |
Zhang et al. | Data anomaly detection based on isolation forest algorithm | |
CN104598441A (en) | Method for splitting Chinese sentences through computer | |
CN111341390A (en) | Quantitative structure-activity relationship assisted matching molecule pair analysis method | |
CN111737694B (en) | Malicious software homology analysis method based on behavior tree | |
CN116186562B (en) | Encoder-based long text matching method | |
CN114124437B (en) | Encrypted flow identification method based on prototype convolutional network | |
CN115221013B (en) | Method, device and equipment for determining log mode | |
CN111008673A (en) | Method for collecting and extracting malignant data chain in power distribution network information physical system | |
CN116204647A (en) | Method and device for establishing target comparison learning model and text clustering | |
Xu et al. | Extracting trigger-sharing events via an event matrix | |
Schleif et al. | Fast approximated relational and kernel clustering | |
WO2023272855A1 (en) | Virus gene classification method and apparatus, electronic device, and computer-readable storage medium | |
Luong et al. | A Study on the Efficiency of ML-Based IDS with Dimensional Reduction Methods for Industry IoT | |
CN113378881B (en) | Instruction set identification method and device based on information entropy gain SVM model |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |