CN112966027B - Entity association mining method based on dynamic probe - Google Patents

Entity association mining method based on dynamic probe Download PDF

Info

Publication number
CN112966027B
CN112966027B CN202110302533.5A CN202110302533A CN112966027B CN 112966027 B CN112966027 B CN 112966027B CN 202110302533 A CN202110302533 A CN 202110302533A CN 112966027 B CN112966027 B CN 112966027B
Authority
CN
China
Prior art keywords
attribute
similarity
entities
log
calculating
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110302533.5A
Other languages
Chinese (zh)
Other versions
CN112966027A (en
Inventor
陶冶
郭帅童
丁香乾
侯瑞春
李辉
史操
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Qingdao University of Science and Technology
Original Assignee
Qingdao University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Qingdao University of Science and Technology filed Critical Qingdao University of Science and Technology
Priority to CN202110302533.5A priority Critical patent/CN112966027B/en
Publication of CN112966027A publication Critical patent/CN112966027A/en
Application granted granted Critical
Publication of CN112966027B publication Critical patent/CN112966027B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/237Lexical tools
    • G06F40/242Dictionaries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/253Grammatical analysis; Style critique
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/30Semantic analysis

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • Artificial Intelligence (AREA)
  • Computational Linguistics (AREA)
  • General Health & Medical Sciences (AREA)
  • Audiology, Speech & Language Pathology (AREA)
  • Health & Medical Sciences (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Computation (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Evolutionary Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses an entity association mining method based on a dynamic probe, which comprises the steps of configuring interactive data of a probe monitoring application system to a database; processing the sensed data to form entity formatted data and storing the entity formatted data into a relational database; performing feature fusion on the entities and existing entities in a relational database, and respectively calculating the similarity of attribute information, the similarity of attribute values and the log similarity of two compared entities; and then, obtaining the similarity of the two comparison entities by using a fuzzy logic reasoning method according to the attribute information similarity, the attribute value similarity and the log similarity obtained by calculation, so as to realize the association matching between the entities. The invention adopts an entity association mining method based on a dynamic probe to obtain related data stored by enterprise services and a database, measures the similarity between different entities through multi-dimensional characteristics, and adopts a fuzzy logic reasoning method to give the best matching result between the entities, thereby saving manual matching time.

Description

Entity association mining method based on dynamic probe
Technical Field
The invention belongs to the technical field of data management, and particularly relates to a mining method for realizing relevance between entities in a heterogeneous information system.
Background
With the development of enterprise business development and informatization construction, massive data is accumulated in databases of various business systems, and the data generally has the characteristics of multiple sources, isomerism, autonomy and the like. The appropriate data fusion technology is adopted to integrate fragmented data dispersed in a plurality of systems into a comprehensive and accurate enterprise data space, which is beneficial to breaking information islands among the systems and provides effective support for deeply mining data association relations, constructing knowledge maps and realizing comprehensive and efficient data sharing.
Entity association across systems is an important ring in the data integration process. Generally, in a traditional data warehouse construction, multiple processes are needed to realize cross-system entity association matching, database management personnel acquire corresponding data according to requirements set by business personnel, and professional personnel with relevant backgrounds are needed to assist in data analysis processing, so that association between entities is confirmed and matched in a manual mode. For example, the price field in a table in the system a and the unit _ price field in a table in the system B actually describe the price of a certain product and the price of a certain component subordinate to the product, and the price of the component directly affects the price fluctuation of the product, so there is a close relationship between the data information of the two fields. However, with current technology, this correlation finding and matching is typically done in a manual mode. In a business system with a certain scale, entity attributes are often thousands, and it is a very time-consuming task to completely rely on manual discovery of the correlation between data.
In addition, with the continuous development of enterprise business, the entity association matching result has hysteresis and needs to be continuously adjusted according to specific conditions. If the discovery and the matching of the relevance between the entities can be automatically realized, the manual matching time of related personnel can be saved.
Disclosure of Invention
The invention aims to provide an entity association mining method based on a dynamic probe, which can automatically discover and match the association of each attribute between entities.
In order to solve the technical problems, the invention adopts the following technical scheme to realize:
an entity association mining method based on a dynamic probe comprises the following processes:
configuring a probe, and intercepting request information of an application system to a database and corresponding response data;
processing the sensed data to form entity formatted data and storing the entity formatted data into a relational database;
and performing feature fusion on the entity and the existing entity in the relational database, wherein the process comprises the following steps:
calculating the similarity of the attribute information of the two comparison entities;
calculating the similarity of the attribute values of the two compared entities;
calculating the log similarity of two comparison entities;
and obtaining the similarity of the two comparison entities by using a fuzzy logic reasoning method according to the attribute information similarity, the attribute value similarity and the log similarity obtained by calculation.
Compared with the prior art, the invention has the advantages and positive effects that: the invention adopts an entity association mining method based on a dynamic probe to obtain related data stored by enterprise business and a database, measures the similarity between different entities through multi-dimensional characteristics, gives the best matching result between the entities by adopting a fuzzy logic reasoning method, and provides the best matching result for related personnel as reference, thereby saving the manual matching time of the related personnel and improving the working efficiency.
Other features and advantages of the present invention will become more apparent from the detailed description of the embodiments of the present invention when taken in conjunction with the accompanying drawings.
Drawings
FIG. 1 is a general architecture diagram of an embodiment of a dynamic probe-based entity association mining method proposed by the present invention;
FIG. 2 is a data processing flow diagram of one embodiment of a dynamic probe;
FIG. 3 is a flow diagram of one embodiment of an attribute information analysis process;
FIG. 4 is a diagram of one embodiment of a tree semantic hierarchy;
FIG. 5 is a flow diagram of one embodiment of an attribute value analysis process;
FIG. 6 is a flow diagram of one embodiment of a log analysis process;
FIG. 7 is a flow diagram of one embodiment of a comparison entity similarity determination process;
FIG. 8 is a tree diagram illustration of a specific example.
Detailed Description
The following detailed description of embodiments of the invention refers to the accompanying drawings.
As shown in fig. 1, the entity association mining method of this embodiment mainly includes processing procedures such as data interception, attribute information analysis, attribute value analysis, log analysis, and similarity determination based on fuzzy logic. The method comprises the steps that request information and corresponding response data of an application system to a database are monitored at any time by configuring a dynamic pointer, and further formatted data of an entity are formed and stored in a relational database; and then, carrying out feature fusion on the entity data newly stored in the relational database and the existing entity data in the relational database, analyzing the similarity of the two comparison entities in three dimensions of attribute information, attribute values and log access, and further giving the similarity between the two comparison entities by adopting a fuzzy logic reasoning method to realize automatic association matching between the two entities.
The data interception and feature fusion processes of the dynamic probe are described in detail below.
The mature service architecture generally adopts a data read-write separation mode to perform service development. The enterprise software architecture is generally divided into a business logic layer, a database access layer and a database storage layer. By means of the database access layer, in the process of requesting to access the database, the business personnel reads the database by using the API interface provided by the database management software without concerning the operation details of the bottom database.
As shown in fig. 2, the embodiment first loads a front probe on the middleware for intercepting interaction data between the application system and the database. Specifically, the bidirectional dynamic probe can be configured to listen to request information and corresponding response data of the service logic layer to the database. And then, cleaning and sorting the data obtained by the bidirectional dynamic probe, solving the problems of data noise, format confusion and the like, and uploading the data serving as the formatted data of an entity to a relational database for feature fusion processing.
Feature fusion refers to associative matching between entities. And storing the attribute information in the formatted data acquired by the dynamic probe into a database, storing the attribute values according to the organization format of the source database, and storing the log information in a file form.
First, entities may be pre-classified according to the data types listed in Table 1, with the possibility that there will be similarities between entities belonging to the same data type, and entities of different data types will generally not be similar.
Type of data Member
Shaping machine SMALLINT、MEDIUMINT、INT、BIGINT
Floating point type FLOAT、DOUBLE、DECIMAL
Date YEAR、DATE、TIME、DATETIME
Character(s) char、VARchar、BLOB、TEXT
TABLE 1
And then, analyzing the data uploaded by the dynamic probe formatting from three dimensions of 'attribute information, attribute values and logs' to obtain the similarity degree between the entities.
(one) analysis for attribute information
The attribute information may be divided into an attribute name and an attribute constraint. Attribute constraints typically include data type, whether it can be null, whether it is a primary key, whether it is a foreign key, comments, and the like. Based on this information, the data can be classified and matched, and the matching process preferably includes the following aspects, as shown in fig. 3:
1. calculating naive text similarity
In this embodiment, the edit distance is preferably used to measure the degree of similarity between the two sequences. The edit distance refers to the minimum number of single character edit operations required to convert one word to another between two words. Calculating naive text similarity S according to editing times 1 The formula is as follows:
Figure BDA0002986889150000041
wherein, w 1 、w 2 The attribute names of the two compared entities are respectively; l. the 1 、l 2 Are respectively attribute names w 1 And w 2 The character length of (2); d is an attribute name w 1 And w 2 The edit distance of (d); max is a function of taking the maximum value.
2. Calculating text semantic similarity
Because the application scenario, naming specification, etc. are different, the description of the same entity may have different expressions, for example, information of some upstream company is recorded in an enterprise database, and the attribute name of the upstream company may be named as company id and SupplierID due to the difference of the scenario. For this case, it is difficult to find the similarity relationship between the attribute names by only a naive text analysis. Therefore, it is preferable to further discriminate the association between the attribute names of the two entities by using an analysis method based on semantic similarity.
Specifically, a syntax dictionary provided by wordnet may be used to establish a tree-like semantic hierarchy relationship for the attribute names, such as the tree diagram shown in fig. 4, and the similarity between the attribute names is calculated through the corresponding positions of the attribute names in the tree diagram.
The specific calculation formula is as follows:
Figure BDA0002986889150000051
wherein, N 1 And N 2 Respectively represent attribute names w 1 、w 2 Shortest path between the shortest path and the attribute name w of the nearest public father node; h denotes the shortest path from w to the root node.
3. Calculating attribute name similarity
Then calculate the plain text similarity S 1 Semantic similarity to text S 2 Then, S may be 1 And S 2 Similarity S with maximum value of (1) as attribute name 3 Namely:
S 3 =Max(S 1 ,S 2 ) ③
where Max is a function taking the maximum value.
4. Calculating attribute constraint similarity
When the attribute information is built, the constraint thereof usually follows a certain design principle, for example: data type, whether it is a primary key, whether it is a foreign key, whether it is empty, etc., as shown in table 2.
Candidate constraints i=1 i=2 i=3 i=4 i=5
Definition of Data type Whether or not it is empty Whether it is a main key Whether it is an external key Note
TABLE 2
Respectively defining attribute constraint vectors of two comparison entities as A and B; wherein A is i And B i Respectively representing the values of the ith candidate constraint corresponding to the vector A and the vector B, and enabling:
Figure BDA0002986889150000052
wherein n is the number of candidate constraints in vector A and vector B, and otherwise represents dividing A i =B i Other than the case. For example, A 1 And B 1 Respectively representing the data types of two attribute constraints if A 1 And B 1 Are all integer, then A 1 =B 1 ,v 1 =1; if A 1 For shaping, B 1 In floating point type, then A 1 ≠B 1 ,v 1 And =0. And so on.
Calculating attribute constraint similarity S 4 The formula is as follows:
Figure BDA0002986889150000061
5. calculating similarity of attribute information
Because the attribute information comprises the attribute name and the attribute constraint, a weighting algorithm can be adopted to calculate the similarity S of the attribute information of two compared entities 5 The formula is as follows:
S 5 =α·S 3 +β·S 4
wherein, alpha and beta are weights, different weights can be distributed according to different conditions, and alpha is more than or equal to 0 and less than or equal to 1, beta is more than or equal to 0 and less than or equal to 1, and alpha + beta =1.
(II) analysis for attribute value
According to different data types, attribute values can be divided into four types: numeric values, characters, enumerations, text. The calculation methods of the similarity of the attribute values of different data types are different, and are described below with reference to fig. 5.
1. Numerical attribute value
For the case where the attribute values of two contrasting entities are both numerical values, the similarity between the contrasting entities may be considered from the point of view of the numerical distribution. In this embodiment, the mean, the median of the calculations, the mode, the standard deviation of the samples, the maximum, and the minimum may be selected as the feature vector elements. Of course, several of them may be selected, or other statistical methods may be selected as the feature vector elements, and the embodiment is not limited to the above examples.
The feature vectors of the attribute values of two comparison entities are represented by u and v, respectively, and the definition of the feature vector elements is shown in table 3:
mean value of Median of arithmetic Mode number Standard deviation of sample Maximum value Minimum value
u 1 u 2 u 3 u 4 u 5 u 6
v 1 v 2 v 3 v 4 v 5 v 6
TABLE 3
Substituting the statistic corresponding to each feature vector element into formula (7) to calculate the similarity S of the attribute values of two comparison entities with the attribute value as the numerical value 6
Figure BDA0002986889150000071
Where m is the number of eigenvector elements.
2. Character type attribute value
Referring to short text content, the word frequency-inverse document frequency can be used as a similarity judgment basis.
In particular toIn other words, first, the attribute values of two comparison entities are merged to form a corpus; then, respectively calculating the word frequency-inverse document frequency corresponding to the attribute value of each entity by adopting a word frequency-inverse document frequency algorithm to correspondingly form vectors U and V; finally, substituting the vectors U and V into a formula (8) to calculate the similarity S of the attribute values of the two comparison entities 7
Figure BDA0002986889150000072
3. Enumerated attribute values
The attribute value includes at least two data. The attribute values of two comparison entities can be converted into two sets A and B, and the ratio of intersection to union in the sets is calculated by using a formula (9) and is used as the similarity S of enumerated attribute values 8
Figure BDA0002986889150000073
Wherein, n is an intersection symbol; u is a union symbol; and | is an absolute value sign.
4. Text type attribute value
For the attribute values of the long text content, a mathematical model can be established by adopting a self-coding algorithm in deep learning, the mathematical model is trained by utilizing partial data in the attribute values, and the similarity of the attribute values of two comparison entities is calculated by utilizing the trained mathematical model.
Specifically, the following steps may be included:
(a) Randomly selecting k data from p data of the attribute value of one entity to form a training set, forming a test set by using the remaining p-k data, and then training a mathematical model established by using a self-coding algorithm in deep learning by using the training set to form the trained mathematical model;
(b) According to a predefined threshold value omega, if the similarity result obtained by calculating the trained mathematical model of the data in the test set is greater than omega, the data is considered to be similar to the training set; calculating the proportion of the number of data judged to be similar to the training set after the trained mathematical model in the test set is calculated in the training set (namely the proportion of the number of data judged to be similar to the training set in the test set to k) and recording as lambda;
(c) Forming a test set by using all data in the attribute value of another entity, repeating the step (b), and marking the obtained proportion as theta;
(d) Calculating similarity S of attribute values of two compared entities by using formula R 9
Figure BDA0002986889150000081
Wherein Min is a function of minimum value.
(III) analysis against logs
During each interaction between the application system and the database, a log file is generated. After the log file is stored in a relational database and the log is formatted, the SQL command of the log file contains the equivalence relation between the entities and can be used as an analysis basis for measuring the similarity of the entities. The similarity between entities can be obtained by counting the number of equivalent relationships in the log file, as shown in fig. 6.
Specifically, assuming that a and b are two comparison entities, the log similarity calculation formula of the two comparison entities is as follows:
Figure BDA0002986889150000082
wherein N is a 、N b The number of times that the attribute name and/or attribute value of the entity a and the entity b appear in the log file can be specifically counted by adopting the number of times that the SQL command containing the attribute name and/or attribute value of the entity a appears in the log a The number of times of occurrence of SQL commands containing the attribute name and/or attribute value of the entity b in the log is adopted to count N b
N ab In log file for attribute name and/or attribute value of entity a, bThe number of times of simultaneous occurrence in the pieces may specifically be counted by the number of times of occurrence of SQL commands that simultaneously include the attribute names and/or attribute values of the entities a and b in the log.
(IV) discriminating the similarity of two compared entities
Similarity S of the calculated attribute information 5 Similarity of attribute values S 6 /S 7 /S 8 /S 9 Log similarity S 10 And judging by using a fuzzy logic reasoning method to obtain the similarity of the two comparison entities.
Referring to fig. 7, the following process is specifically included:
firstly, fuzzification processing is carried out on the attribute information similarity, the attribute value similarity and the log similarity of two comparison entities respectively by adopting a membership function, and the membership is calculated. The membership function preferably adopts a triangular membership function, and the independent variable value range [0,1] and the dependent variable value range [0,1]; degree of membership { dissimilar, general, similar }.
Secondly, the fuzzification rule is formulated as follows:
if the If attribute information and the attribute value similar or attribute information and the log similar or attribute value and the log similar or attribute information, the attribute value and the log are similar, the two comparison entities are similar;
if the attribute information is similar, the attribute value is similar to the general log or attribute value of the log, the attribute information is similar to the general log or log of the log, and the attribute information is similar to the attribute value of the general log or attribute value of the log, the two comparison entities are similar;
if the attribute information, the attribute value and the log are general, the two comparison entities are similar;
the two comparative entities of Else are dissimilar;
wherein If, or and Else are respectively logic conditions: if, or otherwise.
And finally, performing defuzzification processing. That is, if it is determined that the two comparison entities are not similar according to the fuzzification rule, the result is 0; if the results are similar, the result is the average value of the specific gravity maximum values in the three membership degree vectors.
The result is the degree of similarity of the two compared entities. In the relational database, every time a new entity is added, the entity can be automatically associated and matched with other existing entities in the relational database, so that a matching result between the entities is formed and provided for relevant personnel as reference, manual matching time is saved, and working efficiency is improved.
The following describes the entity similarity degree calculation method according to the present embodiment by using a specific example.
The feasibility of the solution of the present embodiment was verified by analyzing the data set of the product, and the data information of the two entities is shown in tables 4 and 5.
Figure BDA0002986889150000101
TABLE 4 product Table
Figure BDA0002986889150000102
TABLE 5 company Table
Taking the data in the first row in tables 4 and 5 as an example, the specific implementation process of the association method between the entities is demonstrated:
step 1: calculating similarity of attribute information
Step 1-1: calculating naive text similarity of attribute names of two entities
Both the product ID and the componyID have a character length of 9, i.e./l 1 =9、l 2 =9; the edit distance D =7 between the product id and the componyid is substituted into the formula (1), and the naive text similarity S can be calculated 1 =0.22。
Step 1-2: calculating text semantic similarity of attribute names of two entities
In combination with the actual situation, a grammar dictionary provided by the wordnet is used for establishing a tree diagram related to the grammar dictionary, as shown in fig. 8. H =7 and N are obtained according to the corresponding positions of the attribute names productID and companyID in the tree diagram 1 =40、N 2 =46, the product id can be calculated by substituting equation (2)And semantic similarity of text between the ananyID and the compcanyID
Figure BDA0002986889150000111
Step 1-3: generating attribute name similarity of two entities
Will S 1 And S 2 Substituting the formula (3) to obtain: s 3 =Max(S 1 ,S 2 ) Max (0.22, 0.14) =0.22; the similarity of the attribute names productID and componyid is 0.22.
Step 1-4: calculating similarity of attribute constraints
The attribute constraint information from tables 4 and 5 may be derived: v = [1,1,1,1,1 =]. Calculating the attribute constraint similarity value S according to the formula (5) 4 =1; according to the formula (6), the weights are assigned to α =0.5 and β =0.5, and the similarity S is constrained by the attribute 5 =α·S 3 +β·S 4 =0.5×0.22+0.5×1=0.61。
And 2, step: calculating attribute value similarity
Since the attribute values of the two entities are integer data, the similarity of the attribute values of the two entities in this example should be calculated by using a numerical attribute value similarity calculation method.
That is, first, the values of the respective elements are calculated in accordance with the feature vector elements defined in table 3, and feature vectors u and v are formed. Assuming that u = [50.5, 1,29.01,100,1], v = [28.5, 1,16.31,56,1], according to equation (7), the similarity of the attribute values of the two entities can be calculated:
Figure BDA0002986889150000112
and 3, step 3: calculating log similarity
By counting the relevant log information, it is found that the times of occurrence of the productID and the componyID in the log are 447 and 389 respectively, and the times of co-occurrence are 328. According to the formula
Figure BDA0002986889150000121
Log similarity can be obtained
Figure BDA0002986889150000122
And 4, step 4: fuzzy logic similarity discrimination
Three dimensional vectors for measuring the similarity of the product ID and the companyID are [0.61,0.99 and 0.39] according to the steps; the membership degrees are [0,0.56,0.44], [0,1], [0.44,0.56,0] in this order by fuzzifying the triangle membership functions.
Wherein [0,0.56,0.44] indicates that the probability that the attribute information of the two entities is dissimilar is 0, the probability that the similarity of the attribute information is general is 0.56, and the probability that the attribute information is similar is 0.44;
[0,1] indicates that the probability that the attribute values of two entities are dissimilar is 0, the probability that the similarity of the attribute values is general is 0, and the probability that the attribute values are similar is 1;
[0.44,0.56,0] indicates that the log of two entities is not similar with a probability of 0.44, the log is similar with a probability of 0.56, and the log is similar with a probability of 0.
Therefore, according to the fuzzy rule, two entities can be judged to be similar.
The similarity of productID and companyID obtained by de-blurring is (0.56 +1+ 0.56)/3 =0.707.
The similarity degree calculation can be performed for the remaining rows of data in tables 4 and 5 according to the above steps, as shown in fig. 6.
Figure BDA0002986889150000123
TABLE 6
As can be seen from table 6, there is a large correlation between the air conditioner price and the compressor price.
Of course, the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, several modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (9)

1. An entity association mining method based on a dynamic probe is characterized by comprising the following steps:
configuring a probe, and intercepting request information of an application system to a database and corresponding response data;
processing the sensed data to form formatted data of an entity and storing the formatted data into a relational database;
and performing feature fusion on the entity and the existing entity in the relational database, wherein the process comprises the following steps:
calculating the similarity of the attribute information of the two compared entities;
calculating the similarity of the attribute values of the two compared entities;
calculating the log similarity of two comparison entities;
obtaining the similarity degree of the two comparison entities by using a fuzzy logic reasoning method according to the attribute information similarity degree, the attribute value similarity degree and the log similarity degree which are obtained by calculation;
wherein, the process of calculating the similarity of the attribute information of the two comparison entities comprises the following steps:
the attribute information comprises an attribute name and an attribute constraint;
the calculation process of the attribute name similarity comprises the following steps:
calculating the similarity S of plain text 1
Calculating text semantic similarity S 2
Selection of S 1 、S 2 The maximum value of (1) as the attribute name similarity S 3
The calculation process of the attribute constraint similarity comprises the following steps:
respectively defining attribute constraint vectors of two compared entities as A and B; wherein A is i And B i Respectively representing the values of the ith candidate constraint corresponding to the vector A and the vector B;
calculating out
Figure FDA0003802830580000011
Wherein n is a vectorThe number of candidate constraints in A and vector B, otherwise denotes dividing A i =B i Other situations than;
calculating attribute constraint similarity
Figure FDA0003802830580000012
Calculating the similarity S of the attribute information of two compared entities by adopting a weighting algorithm 5 =α·S 3 +β·S 4 (ii) a Wherein, alpha and beta are weights, and alpha is epsilon [0,1],β∈[0,1]And α + β =1.
2. The dynamic probe-based entity association mining method of claim 1, wherein the naive text similarity S 1 The formula is adopted to calculate and obtain:
Figure FDA0003802830580000021
wherein w 1 And w 2 The attribute names of the two compared entities are respectively; l 1 ,l 2 For the attribute name w 1 And w 2 D is the attribute name w 1 And w 2 Max is a function of taking the maximum value.
3. The dynamic probe-based entity association mining method according to claim 1 or 2, wherein the text semantic similarity S 2 The calculation process of (2) is as follows:
establishing a tree semantic hierarchy relation to form a tree graph;
calculating the attribute names w of two compared entities according to the corresponding positions of the attribute names in the dendrogram 1 And w 2 Similarity between them
Figure FDA0003802830580000022
Wherein N is 1 And N 2 Respectively represent attribute names w 1 、w 2 Shortest path to the nearest public parent node attribute name w; h denotes the shortest path from w to the root node.
4. The dynamic probe-based entity association mining method according to claim 1, wherein the process of calculating the similarity of the attribute values of two compared entities is:
according to different data types, attribute values are divided into four types, namely: numeric type, character type, enumeration type, text type;
aiming at the numerical attribute value, selecting a plurality of or all of the average value, the median of the calculated number, the mode, the sample standard deviation, the maximum value and the minimum value as characteristic vector elements to form characteristic vectors u and v corresponding to two comparison entities, and calculating the similarity of the attribute values of the two comparison entities
Figure FDA0003802830580000023
Aiming at the character type attribute values, firstly, combining the attribute values of two comparison entities to form a corpus; then, respectively calculating the word frequency-inverse document frequency corresponding to the attribute value of each entity by adopting a word frequency-inverse document frequency algorithm to correspondingly form vectors U and V; calculating similarity of attribute values of two compared entities
Figure FDA0003802830580000031
Aiming at the enumerated attribute value, the attribute value of each entity at least comprises two data, the attribute values of two comparison entities are converted into two sets A and B, and the similarity of the attribute values of the two comparison entities is calculated
Figure FDA0003802830580000032
Wherein, n is an intersection symbol; u is a union symbol;
aiming at the text type attribute value, a mathematical model is established by adopting a self-coding algorithm in deep learning, the model is trained by utilizing data in the attribute value, and the similarity of the attribute values of two comparison entities is calculated by utilizing the trained model.
5. The dynamic probe-based entity association mining method of claim 4, wherein the process of calculating the similarity of the attribute values of two compared entities for the text-type attribute values is:
randomly selecting k data from the attribute value of one entity to form a training set, forming a test set by using the remaining data, and training the established mathematical model by using the training set;
predefining a threshold value omega, if the similarity result obtained by calculating the trained mathematical model of the data in the test set is greater than omega, determining that the data is similar to the training set; calculating the proportion of the number of data judged to be similar after the trained mathematical model in the test set to k, and recording as lambda;
forming a test set by using all data in the attribute value of the other entity, repeating the previous step, and recording the obtained proportion as theta;
calculating similarity of attribute values of two contrasting entities
Figure FDA0003802830580000033
Where Min is a function of minimum.
6. The entity association mining method based on dynamic probe as claimed in claim 1, wherein the process of calculating the log similarity of two compared entities is:
in the system operation process, the fusion feature space stores a log file;
recording the two comparison entities as a and b, the log similarity of the two comparison entities is:
Figure FDA0003802830580000034
wherein, N a 、N b Attribute names and/or of entities a, b, respectivelyThe number of times the attribute value appears in the log file; n is a radical of hydrogen ab Is the number of times the attribute names and/or attribute values of entities a, b appear in the log file at the same time.
7. The entity association mining method based on dynamic probe as claimed in claim 6, wherein in the process of counting the times of occurrence and simultaneous occurrence of attribute names and/or attribute values of entities a, b in the log file, the number of occurrences of SQL command is used for counting.
8. The dynamic probe-based entity association mining method according to claim 1, wherein the process of using fuzzy logic reasoning method to derive the similarity degree of two comparison entities is:
fuzzification processing is respectively carried out on attribute information similarity, attribute value similarity and log similarity of two comparison entities by adopting a triangular membership function, and dissimilar, common and similar membership vectors are correspondingly generated;
judging whether the two comparison entities are similar or not according to a specified fuzzification rule;
if the judgment result is not similar, the result is 0;
if the results are similar, the result is the average value of the specific gravity maximum values in the three membership degree vectors.
9. The dynamic probe-based entity association mining method of claim 8, wherein the fuzzification rule is:
if the If attribute information and the attribute value similar or attribute information and the log similar or attribute value and the log similar or attribute information, the attribute value and the log are similar, the two comparison entities are similar;
if the attribute information is similar, the attribute value is similar to the general log or attribute value of the log, the attribute information is similar to the general log or log of the log, and the attribute information is similar to the attribute value of the general log or attribute value of the log, the two comparison entities are similar;
if the attribute information, the attribute value and the log are general, the two comparison entities are similar;
the two compared entities of Else are dissimilar;
wherein If, or and Else are respectively logic conditions: if, or otherwise.
CN202110302533.5A 2021-03-22 2021-03-22 Entity association mining method based on dynamic probe Active CN112966027B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110302533.5A CN112966027B (en) 2021-03-22 2021-03-22 Entity association mining method based on dynamic probe

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110302533.5A CN112966027B (en) 2021-03-22 2021-03-22 Entity association mining method based on dynamic probe

Publications (2)

Publication Number Publication Date
CN112966027A CN112966027A (en) 2021-06-15
CN112966027B true CN112966027B (en) 2022-10-21

Family

ID=76278144

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110302533.5A Active CN112966027B (en) 2021-03-22 2021-03-22 Entity association mining method based on dynamic probe

Country Status (1)

Country Link
CN (1) CN112966027B (en)

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866625A (en) * 2015-06-15 2015-08-26 苏州大学张家港工业技术研究院 Method and system for entities matching
CN104966123A (en) * 2015-07-16 2015-10-07 北京工业大学 SLAM data association method based on fuzzy-self-adaptation
CN108647318A (en) * 2018-05-10 2018-10-12 北京航空航天大学 A kind of knowledge fusion method based on multi-source data
CN108710663A (en) * 2018-05-14 2018-10-26 北京大学 A kind of data matching method and system based on ontology model
CN109359172A (en) * 2018-08-02 2019-02-19 浙江大学 A kind of entity alignment optimization method divided based on figure
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 A kind of entity alignment schemes based on improvement WMD algorithm
CN110795572A (en) * 2019-10-29 2020-02-14 腾讯科技(深圳)有限公司 Entity alignment method, device, equipment and medium
CN111199361A (en) * 2020-01-13 2020-05-26 国网福建省电力有限公司信息通信分公司 Electric power information system health assessment method and system based on fuzzy reasoning theory
CN111400507A (en) * 2020-06-05 2020-07-10 浙江口碑网络技术有限公司 Entity matching method and device
CN111666313A (en) * 2020-05-25 2020-09-15 中科星图股份有限公司 Correlation construction and multi-user data matching method based on multi-source heterogeneous remote sensing data
CN111753099A (en) * 2020-06-28 2020-10-09 中国农业科学院农业信息研究所 Method and system for enhancing file entity association degree based on knowledge graph
CN111881290A (en) * 2020-06-17 2020-11-03 国家电网有限公司 Distribution network multi-source grid entity fusion method based on weighted semantic similarity
CN112256882A (en) * 2020-10-16 2021-01-22 美林数据技术股份有限公司 Multi-similarity-based cross-system network entity fusion method
CN112445876A (en) * 2020-11-25 2021-03-05 中国科学院自动化研究所 Entity alignment method and system fusing structure, attribute and relationship information
CN113177105A (en) * 2021-05-06 2021-07-27 南京大学 Word embedding-based multi-source heterogeneous water conservancy field data fusion method
CN114118310A (en) * 2022-01-28 2022-03-01 航天宏康智能科技(北京)有限公司 Clustering method and device based on comprehensive similarity

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104866625A (en) * 2015-06-15 2015-08-26 苏州大学张家港工业技术研究院 Method and system for entities matching
CN104966123A (en) * 2015-07-16 2015-10-07 北京工业大学 SLAM data association method based on fuzzy-self-adaptation
CN108647318A (en) * 2018-05-10 2018-10-12 北京航空航天大学 A kind of knowledge fusion method based on multi-source data
CN108710663A (en) * 2018-05-14 2018-10-26 北京大学 A kind of data matching method and system based on ontology model
CN109359172A (en) * 2018-08-02 2019-02-19 浙江大学 A kind of entity alignment optimization method divided based on figure
CN109902144A (en) * 2019-01-11 2019-06-18 杭州电子科技大学 A kind of entity alignment schemes based on improvement WMD algorithm
CN110795572A (en) * 2019-10-29 2020-02-14 腾讯科技(深圳)有限公司 Entity alignment method, device, equipment and medium
CN111199361A (en) * 2020-01-13 2020-05-26 国网福建省电力有限公司信息通信分公司 Electric power information system health assessment method and system based on fuzzy reasoning theory
CN111666313A (en) * 2020-05-25 2020-09-15 中科星图股份有限公司 Correlation construction and multi-user data matching method based on multi-source heterogeneous remote sensing data
CN111400507A (en) * 2020-06-05 2020-07-10 浙江口碑网络技术有限公司 Entity matching method and device
CN111881290A (en) * 2020-06-17 2020-11-03 国家电网有限公司 Distribution network multi-source grid entity fusion method based on weighted semantic similarity
CN111753099A (en) * 2020-06-28 2020-10-09 中国农业科学院农业信息研究所 Method and system for enhancing file entity association degree based on knowledge graph
CN112256882A (en) * 2020-10-16 2021-01-22 美林数据技术股份有限公司 Multi-similarity-based cross-system network entity fusion method
CN112445876A (en) * 2020-11-25 2021-03-05 中国科学院自动化研究所 Entity alignment method and system fusing structure, attribute and relationship information
CN113177105A (en) * 2021-05-06 2021-07-27 南京大学 Word embedding-based multi-source heterogeneous water conservancy field data fusion method
CN114118310A (en) * 2022-01-28 2022-03-01 航天宏康智能科技(北京)有限公司 Clustering method and device based on comprehensive similarity

Also Published As

Publication number Publication date
CN112966027A (en) 2021-06-15

Similar Documents

Publication Publication Date Title
CN110618983B (en) JSON document structure-based industrial big data multidimensional analysis and visualization method
Berlin et al. Database schema matching using machine learning with feature selection
Li et al. Database integration using neural networks: implementation and experiences
US6718338B2 (en) Storing data mining clustering results in a relational database for querying and reporting
US7370033B1 (en) Method for extracting association rules from transactions in a database
US20070005658A1 (en) System, service, and method for automatically discovering universal data objects
US20240126815A1 (en) Data Preparation Using Semantic Roles
Li et al. Chameleon based on clustering feature tree and its application in customer segmentation
CN105138588B (en) A kind of database overlap scheme abstraction generating method propagated based on multi-tag
CN111401785A (en) Power system equipment fault early warning method based on fuzzy association rule
CN109408578A (en) One kind being directed to isomerous environment monitoring data fusion method
US20030033138A1 (en) Method for partitioning a data set into frequency vectors for clustering
CA2614713A1 (en) Method and tool for searching in several data sources for a selected community of users
CN111325235B (en) Multilingual-oriented universal place name semantic similarity calculation method and application thereof
CN112966027B (en) Entity association mining method based on dynamic probe
KR102358357B1 (en) Estimating apparatus for market size, and control method thereof
CN113673889A (en) Intelligent data asset identification method
Basha et al. An improved similarity matching based clustering framework for short and sentence level text
CN114218337B (en) Natural resource survey monitoring data identification and fusion updating method
CN114511027B (en) Method for extracting English remote data through big data network
CN115712720A (en) Rainfall dynamic early warning method based on knowledge graph
CN114490571A (en) Modeling method, server and storage medium
Altın et al. Analyzing The Encountered Problems and Possible Solutions of Converting Relational Databases to Graph Databases
CN117131251B (en) Multidimensional data analysis processing system and method based on cloud computing
Phutela et al. Applying Descriptive and Predictive Analytics on Academic Dataset

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant