CN110188148B - Entity identification method and device facing multimode heterogeneous characteristics - Google Patents

Entity identification method and device facing multimode heterogeneous characteristics Download PDF

Info

Publication number
CN110188148B
CN110188148B CN201910435097.1A CN201910435097A CN110188148B CN 110188148 B CN110188148 B CN 110188148B CN 201910435097 A CN201910435097 A CN 201910435097A CN 110188148 B CN110188148 B CN 110188148B
Authority
CN
China
Prior art keywords
matrix
entity
data
relationship
feature
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910435097.1A
Other languages
Chinese (zh)
Other versions
CN110188148A (en
Inventor
周小平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Civil Engineering and Architecture
Original Assignee
Beijing University of Civil Engineering and Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Civil Engineering and Architecture filed Critical Beijing University of Civil Engineering and Architecture
Priority to CN201910435097.1A priority Critical patent/CN110188148B/en
Publication of CN110188148A publication Critical patent/CN110188148A/en
Application granted granted Critical
Publication of CN110188148B publication Critical patent/CN110188148B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2465Query processing support for facilitating data mining operations in structured databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Fuzzy Systems (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The invention provides an entity identification method and device facing multimode heterogeneous characteristics, wherein the method comprises the following steps: fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to obtain a relationship characteristic fusion result; generating a heterogeneous feature matrix of first data according to a first entity of a preset entity type, generating a heterogeneous feature matrix of second data according to a second entity of the preset entity type, and fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result; calculating the similarity between the attribute characteristics of the first entity and the second entity to generate an attribute characteristic similarity matrix; and mining the associated entities according to the relationship characteristic fusion result, the heterogeneous characteristic fusion result and the attribute characteristic similarity matrix to determine the associated entities. The embodiment of the invention does not need prior knowledge, has strong generalization capability and improves the mining quality of the associated entities.

Description

Entity identification method and device facing multimode heterogeneous characteristics
Technical Field
The embodiment of the invention belongs to the technical field of data mining, and particularly relates to an entity identification method and device for multimode heterogeneous characteristics.
Background
A large amount of scattered data can be generated in the whole life cycle process of the building, and data fusion is a means for exerting the value of the large data to the maximum extent. The method realizes the organic fusion of dispersed and isolated data by the mining of the associated entities, ensures the consistency and the integrity of the building data, and is an effective method for solving the problems of 'information isolated island' and 'information fault' and the like existing in the building engineering and improving the data value. For example, stage 1, MAAnd MBRespectively generating data for different tools and establishing association; stage 2, MAAfter upgrading to version 2, M is establishedAVersion 2 and MBThe data association of version 1 can avoid the generation of data fault and data island by the data of the phase 1 and the phase 2.
The associated entity refers to an entity which indicates the same object in the real world in different data, and the associated entity mining aims to accurately, comprehensively and quickly discover the associated entities among different data without prior conditions. In general, data M may be represented as a combination of a set of entities E and a set of relationships between entities R, i.e., M ═ E (E, R). Given two data MA=(EA,RA) And MB=(EB,RB) If E isAi∈EAAnd EBj∈EBRefer to the same object in the real world, then EAiAnd EBjFor the associated entity, note EAi=EBj(ii) a Otherwise EAi≠EBj. Associative entity mining refers to mining by MAAnd MBThe data feature in (1) finds all associated entities between the two. In general, associative entity mining translates to decision EAi∈EAAnd EBj∈EBWhether or not it is an associated entity, i.e.
Figure BDA0002070269000000011
Associative entity mining aims to discover all associative entities in two data accurately, comprehensively, and quickly. Undoubtedly, associated entity mining based on the UUID (unique identifier of the entity) is the simplest and most accurate method; however, different tools maintain different UUIDs, and even UUIDs formed by different versions of the same tool. Therefore, the existing methods mostly use entity features contained in the data for associated entity mining. At present, most of the associated entity mining is based on geometric attribute matching, manual labeling or text attribute modeling. Although the associated entity mining method based on geometric attribute matching can detect three-dimensional similarities and differences between two models, the method only identifies model differences in geometric shapes, is difficult to apply to associated entity mining with complex relationships such as reference and inheritance, and cannot identify entities without geometric shapes. Therefore, these methods can only mine a portion of the associated entities. Manually labeled associated entity mining is mainly used for version change management. The method is used for change management, evaluation of the influence of changes on projects and the like by establishing a change management system. The effect of the method depends on the construction quality of the change relation model and the accuracy of manual change marking, and the manual workload is heavy and errors are easy to occur.
The associative entity mining is also similar or related to the study of language translation in natural language processing, entity alignment in knowledge bases, database record linking, entity matching, named recognition in information retrieval, social network associative user mining, bipartite graph matching, and isomorphic network alignment in biological information. However, these methods have certain limitations in the mining of associated entities, which are specifically expressed as follows: (1) many methods need to have part of prior associated entities, and then adopt supervised or semi-supervised learning to mine the associated entities from the prior associated entities, and the methods are difficult to be directly applied to mining the associated entities which cannot obtain the prior associated entities. (2) Most methods are designed for specific fields, most of the methods are suitable for single-mode or isomorphic scenes, and the methods cannot be directly applied to associated entity mining under the multi-mode heterogeneous characteristics.
Disclosure of Invention
In order to overcome the problems that the existing entity identification method can only identify part of entities, is time-consuming and labor-consuming, is easy to make mistakes, needs prior knowledge of the entities, and has a limited application range or at least partially solves the problems, the embodiment of the invention provides an entity identification method and device oriented to multimode heterogeneous characteristics.
According to a first aspect of the embodiments of the present invention, there is provided a method for identifying an entity oriented to a multi-modal heterogeneous feature, including:
fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to obtain a relationship characteristic fusion result; wherein the first data comprises a plurality of relationships between the first entity and all of the first entities and the second data comprises a plurality of relationships between the second entity and all of the second entities;
generating a heterogeneous feature matrix of the first data according to any first entity of any preset entity type, generating a heterogeneous feature matrix of the second data according to any second entity of any preset entity type, and fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result;
calculating the similarity between the attribute features of any first entity and the attribute features of any second entity, and generating an attribute feature similarity matrix according to all the similarities;
and mining associated entities according to the relationship feature fusion result, the heterogeneous feature fusion result and the attribute feature similarity matrix, and determining the associated first entity and the second entity.
According to a second aspect of the embodiments of the present invention, an entity identification apparatus facing multi-modal heterogeneous features is provided, including:
the first acquisition module is used for fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to acquire a relationship characteristic fusion result; wherein the first data comprises a plurality of relationships between the first entity and all of the first entities and the second data comprises a plurality of relationships between the second entity and all of the second entities;
a second obtaining module, configured to generate a heterogeneous feature matrix of the first data according to any first entity of any preset entity type, generate a heterogeneous feature matrix of the second data according to any second entity of any preset entity type, and fuse the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result;
the third acquisition module is used for calculating the similarity between the attribute characteristics of any first entity and the attribute characteristics of any second entity and generating an attribute characteristic similarity matrix according to all the similarities;
and the determining module is used for mining associated entities according to the relationship feature fusion result, the heterogeneous feature fusion result and the attribute feature similarity matrix, and determining the associated first entity and the associated second entity.
According to a third aspect of the embodiments of the present invention, there is also provided an electronic apparatus, including:
at least one processor, and at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor to invoke the entity identification method for multimodal heterogeneous features provided by any of the various possible implementations of the first aspect.
According to a fourth aspect of embodiments of the present invention, there is also provided a non-transitory computer-readable storage medium storing computer instructions for causing a computer to execute the entity identification method for multimodal heterogeneous characteristics provided in any one of the various possible implementations of the first aspect.
The embodiment of the invention provides an entity identification method and device facing multimode heterogeneous characteristics, the method comprises the steps of firstly using an entity and a relationship to carry out formal description on first data and second data, then aiming at the multimode heterogeneous characteristics of the entity, respectively fusing the attribute characteristics, the multimode characteristics and the heterogeneous characteristics of the entity, and mining a related entity according to a fusion result, the embodiment does not need prior knowledge, has stronger generalization capability, comprehensively considers the attribute characteristics and the multimode heterogeneous characteristics of the entity, improves the mining quality of the related entity, namely recall rate and accuracy rate, is suitable for the existing single-mode or isomorphic and other environments, and has universality; in addition, the method can be applied to BIM data, and can also be applied to other fields of data such as social networks, traffic networks, biological information, electronic commerce systems and the like.
Drawings
Fig. 1 is a schematic overall flow chart of an entity identification method for multi-modal heterogeneous features according to an embodiment of the present invention;
fig. 2 is a schematic overall structure diagram of an entity identification apparatus for multi-modal heterogeneous features according to an embodiment of the present invention;
fig. 3 is a schematic view of an overall structure of an electronic device according to an embodiment of the present invention.
Detailed Description
The following describes embodiments of the present invention in further detail with reference to the drawings and examples. The following examples are intended to illustrate the examples of the present invention, but are not intended to limit the scope of the examples of the present invention.
In an embodiment of the present invention, a method for identifying an entity oriented to a multi-modal heterogeneous feature is provided, and fig. 1 is a schematic overall flow chart of the method for identifying an entity oriented to a multi-modal heterogeneous feature provided in the embodiment of the present invention, where the method includes: s101, fusing the relationship between first entities in the first data and the relationship between second entities in the second data to obtain a relationship characteristic fusion result; wherein the first data comprises a plurality of relationships between the first entity and all of the first entities and the second data comprises a plurality of relationships between the second entity and all of the second entities;
the first data and the second data are two different versions of data, and the first data and the second data may be BIM (Building Information Model) data, but the embodiment is not limited to such data. And performing the same formalization description on the first data and the second data, wherein an entity in the first data is used as a first entity, and an entity in the second data is used as a second entity. The formal description of the data will now be explained taking the first data as an example. The first data may consist of entities and entity relationships, i.e., D ═ E, R, where E is the set of all first entities in the first data D and R is the set of relationships between all first entities in D. For any first entity E in the first dataiIncluding attribute characteristics of the first entity, e.g., attribute characteristics of entities in BIM data may be referenced to data standards of IFCs (Industry Foundation Classes). The common attribute features of each first entity and the common attribute features of each second entity may be summarized using literature research methods and inductive summarization methods. For any entity relationship RijkIt can be described as: rijk={Ei,Ej,CkIs defined as EiIn relation to CkDependent on Ej. Thus, is available
Figure BDA0002070269000000051
Description of EiIn relation to CkAll entities that depend. The present embodiment is not limited to the kind of relationship between the first entities and the kind of relationship between the second entities. When the first data DAIn (1) contains nAWhen there are first entities, n may be used as an adjacency matrix formed by a certain relationship among all the first entitiesA×nAMatrix NAIs represented by the formula, wherein NAContains some relationship between any two first entities. When the second data DBIn (1) contains nBWhen there are second entities, n may be used as the adjacency matrix formed by some relationship among all the second entitiesB×nBMatrix NBIs represented by the formula, wherein NBWhich contains some relationship between any two second entities. For example, when nAAt 3, some relationship between the three first entities forms an adjacency matrix NAComprises the following steps:
Figure BDA0002070269000000061
wherein R isEi,EjI ═ 1,2,3, and j ═ 1,2,3, for some relationship between the ith and jth first entities. Fusing the relationship between the first entities and the relationship between the second entities, and converting N into NAAnd NBThe tensor product is adopted to form a relational feature fusion result of the tensor product and the image, namely the relational feature fusion result
Figure BDA0002070269000000062
S102, generating a heterogeneous feature matrix of the first data according to any first entity of any preset entity type, generating a heterogeneous feature matrix of the second data according to any second entity of any preset entity type, and fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result;
if the first data and the second data are assumed to have C types of preset entity types respectively. Then use nAX C matrix HARepresenting a heterogeneous feature matrix of the first data, using nBX C matrix HBA heterogeneous feature matrix representing the second data. The step of fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data refers to performing operation according to the two matrixes to obtain a matrix H of a heterogeneous feature fusion result. The present embodiment is not limited to a matrix of two heterogeneous featuresThe way of operation between.
S103, calculating the similarity between the attribute features of any first entity and the attribute features of any second entity, and generating an attribute feature similarity matrix according to all the similarities;
the attribute feature may be one or more, such as a text attribute feature. The attribute features are different in different application scenarios. When the attribute characteristics are one, directly calculating any first entity EAiAnd any second entity EBjTo generate an attribute feature similarity matrix. When the attribute characteristics are multiple, after the similarity matrix of each attribute characteristic is respectively obtained, the similarity matrices of all the attribute characteristics are fused to obtain a final attribute characteristic similarity matrix.
And S104, mining associated entities according to the relationship feature fusion result, the heterogeneous feature fusion result and the attribute feature similarity matrix, and determining the associated first entity and the second entity.
The entity is dependent on its surroundings and the associated entity identification can be made from the surroundings of the entity. For this reason, the basic concept of the mining of the associated entities in this embodiment is: if the first entity EAiAnd a second entity EBjAs an associated entity, i.e. EAi=EBjThen E isAiAnd EBjThe following conditions should be satisfied: a. node consistency, EAiAnd EBjThe types of the inherited entities are the same or the types of the inherited entities are the same; b. consistency of properties, EAiAnd EBjShould have similar text and geometric attribute features; c. environmental uniformity, EAiAnd EBjHave a similar environment; i.e. EAikAnd EBjkMost of the entities in (2) are also associated entities. Based on the basic assumption, the embodiment further performs associated entity mining on the basis of the relationship feature fusion result, the heterogeneous feature fusion result and the attribute feature similarity matrix, and determines the associated first entity and second entity. The present embodiment is not limited to the method of associated entity mining.
According to the embodiment, firstly, the entity and the relation are used for formally describing the first data and the second data, then, aiming at the multi-mode heterogeneous characteristics of the entity, the attribute characteristics, the multi-mode characteristics and the heterogeneous characteristics of the entity are respectively fused, and the associated entity mining is carried out according to the fusion result, so that the embodiment does not need prior knowledge, has stronger generalization capability, comprehensively considers the attribute characteristics and the multi-mode heterogeneous characteristics of the entity, improves the mining quality of the associated entity, namely recall rate and accuracy, is suitable for the existing single-mode or isomorphic and other environments, and has universality; in addition, the method can be applied to BIM data, and can also be applied to other fields of data such as social networks, traffic networks, biological information, electronic commerce systems and the like.
On the basis of the above embodiments, the relationship in this embodiment includes a connection relationship and a modal relationship; correspondingly, the step of fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to obtain the fusion result of the relationship characteristics specifically includes: generating a first adjacency matrix according to the connection relation between any two first entities, and generating a second adjacency matrix according to the connection relation between any two second entities; carrying out tensor product on the first adjacent matrix and the second adjacent matrix to generate a topological relation fusion matrix; for any one modal relationship, generating a first modal relationship matrix according to the modal relationship between any two first entities, and generating a second modal relationship matrix according to the modal relationship between any two second entities; carrying out tensor product on the first modal relationship matrix and the second modal relationship matrix to generate a fusion matrix of the modal relationship; adding the fusion matrixes of all the modal relations to obtain a multi-mode characteristic fusion matrix; and performing dot multiplication on the topological relation fusion matrix and the multi-mode feature fusion matrix to obtain the relation feature fusion result.
The connection relationship is the interconnection relationship between the entities, and is the topological relationship of the data. The existence of various different modal relationships among entities, including relationships such as reference, inclusion, decomposition, connection and inheritance, means that the entities are differentThe forms are dependent on each other. In this embodiment, on the basis of summarizing the types and characteristics of the existing relationship modalities, entity relationship diagrams in different modalities are established, and then a data analysis method is adopted to analyze structural characteristics and similarities, including density, degree distribution, radius and the like, of the entity relationship diagrams in each modality relationship from a large amount of actual engineering data. Given the inclusion of nAFirst data D of a first entityAAnd comprises nBSecond data D of a second entityB. The adjacency matrix formed by the interconnection relationship among all the first entities can be nA×nAMatrix NAIndicating that the adjacency matrix formed by the interconnection relationships among all the second entities can be nB×nBMatrix NBAnd (4) showing. At this time, the topological relation fusion matrix N of the first data and the second data may be formed using a tensor product, that is
Figure BDA0002070269000000081
Wherein, in the tensor of order 2, the operator
Figure BDA0002070269000000082
Also known as kronecker product.
Assuming that there are L different modal relationships in the first data and the second data, respectively, L n can be usedA×nAMatrix array
Figure BDA0002070269000000088
And L nB×nBMatrix array
Figure BDA0002070269000000083
A modal relationship matrix describing the first data and the second data, respectively, wherein 0 ≦ l<And L. At this time, nA×nAMatrix array
Figure BDA0002070269000000084
Figure BDA0002070269000000085
Describe DAThe relationship of all first entities in the modality l, nA×nAMatrix array
Figure BDA0002070269000000086
Figure BDA0002070269000000087
Description of DAWhere [ ] is a dot product of the matrix, i.e., a Hadamard product. The multi-modal eigen-fusion matrix M of the first and second data may be formed using tensor products without regard to the adjacency matrix, i.e., the
Figure BDA0002070269000000091
Performing dot multiplication on the topological relation fusion matrix and the multi-mode feature fusion matrix to obtain a relation feature fusion result matrix T, wherein the relation feature fusion result matrix T comprehensively describes the topological features and the multi-mode features of the two data, namely
Figure BDA0002070269000000092
On the basis of the foregoing embodiment, in this embodiment, the step of fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result specifically includes: for any preset entity type, acquiring a first column vector formed by the preset entity type in the heterogeneous feature matrix of the first data and a second column vector formed by the preset entity type in the heterogeneous feature matrix of the second data; carrying out tensor product on the first column vector and the second column vector to generate an feature fusion matrix of the preset entity type; and adding the feature fusion matrixes of all the preset entity types to obtain a heterogeneous feature fusion result.
In particular, use is made of
Figure BDA0002070269000000093
Heterogeneous feature matrix H representing first dataAThe c-th dimension of the feature in (a) forms a column vector,
Figure BDA0002070269000000094
heterogeneous feature matrix H representing second dataBThe c-th dimension features form a column vector. Wherein c is more than or equal to 0<C. Will be provided with
Figure BDA0002070269000000095
And
Figure BDA0002070269000000096
and carrying out tensor product to form a c-dimension feature fusion matrix, and adding the feature fusion matrices of all dimensions to obtain a heterogeneous feature fusion result. The calculation formula is as follows:
Figure BDA0002070269000000097
wherein H is a heterogeneous characteristic fusion result, and C is the number of the preset entity types.
On the basis of the above embodiment, in this embodiment, the attribute feature of the first entity includes a text attribute feature and a geometric attribute feature, and the attribute feature of the second entity includes a text attribute feature and a geometric attribute feature; correspondingly, the step of generating the attribute feature similarity matrix according to all the similarities specifically includes: generating an attribute feature similarity matrix according to the similarity between the text attribute features of any first entity and the text attribute features of any second entity; generating a geometric characteristic similarity matrix according to the similarity between the geometric attribute characteristics of any first entity and the geometric attribute characteristics of any second entity; and fusing the attribute characteristic similarity matrix and the geometric characteristic similarity matrix based on a Logit regression model to generate an attribute similarity matrix.
Specifically, the attribute features in the present embodiment include a text attribute feature and a geometric attribute feature. For example, the attribute features of the entities in the BIM data include both of these attribute features. Due to the entityThe text attribute of the text is mostly short text, when the similarity of the text attribute features is calculated, the embodiment adopts short text word vector construction optimization algorithm analysis and entity attribute semantic feature vector model establishment, obtains the text attribute semantic feature vectors of each first entity and each second entity according to the entity attribute semantic feature vector model, and then calculates the first entity E through cosine similarity or Euclidean distance and other methods according to the text attribute semantic feature vectors of any first entity and any second entityAiAnd the second entity EBjSimilarity of attribute features between them, form nA×nBAttribute feature similarity matrix P of orderP. When calculating the similarity of the geometric attribute features, the geometric model of the entity may be first converted into Brep format, and then the geometric model of Brep format may be converted into a triangular mesh format. Similarity calculation is carried out on the triangular network of any first entity and the triangular network of any second entity to form a geometric attribute similarity matrix P of all entities in the first data and the second dataGHowever, the present embodiment is not limited to this calculation method.
Because the two entities are not associated when the similarity of all the attribute features is large, when the similarity of the text attribute features or/and the geometric attribute features is large, the two entities have a certain probability as associated entities. Therefore, in this embodiment, a Logit regression model is used to fuse the text attribute similarity matrix and the geometric attribute similarity matrix to form an attribute feature similarity matrix P, and the formula is as follows:
Figure BDA0002070269000000101
wherein P is an attribute similarity matrix, alpha is an adjustment parameter, PpIs a matrix of attribute feature similarities, PGIs a geometric feature similarity matrix.
On the basis of the above embodiment, in this embodiment, the associated entity mining is performed according to the relationship feature fusion result, the heterogeneous feature fusion result, and the attribute feature similarity matrix by using the following formula:
s=βT's+(1-β)(H⊙P)';
wherein β is an adjustment parameter, T' is a normalized matrix of the fused result of the heterogeneous features, H is a fused result of the heterogeneous features, P is an attribute feature similarity matrix, H [ < P > ] indicates a dot product of H and P, (H [ < P > ] indicates a normalized result of the dot product of H and P, and s indicates a correlation matrix between all the first entities and all the second entities.
Specifically, in the multi-mode heterogeneous network-based association entity mining, a pair of entities of which the neighbors are all association entities are also association entities, namely, the environment is consistent. If n isA×nBThe matrix S is a correlation matrix between all the first entities and all the second entities, and two entities corresponding to the items with higher correlation in S are more likely to be correlated entities. If S ═ vec (S) is a vector formed by expanding S according to columns, that is, vec (·) is a vectorization operator, then an associated entity mining model based on the multi-mode features can be constructed as follows:
Figure BDA0002070269000000111
wherein, TAIs' TANormalized result, TBIs' TBThe result of the normalization.
On the basis, attribute consistency is further blended. P' is the normalized matrix of P, i.e. P ═ vec (P)
Figure BDA0002070269000000112
In this embodiment, a linear weighting method is adopted to fuse the attribute similarity model, and the above formula is changed into:
Figure BDA0002070269000000113
wherein beta is an adjusting parameter.
The embodiment further incorporates heterogeneous characteristics, i.e., node consistency. Obviously, if two entities are heterogeneous, their similarity is 0. To this end, a consistency model of attributes and nodes may be constructed as:
Q=H⊙P。
if q is vec (q), q' is a standardized matrix of q, and then a correlation entity mining model meeting node consistency, attribute consistency and environment consistency is established, namely the correlation entity mining model is
Figure BDA0002070269000000114
On the basis of the foregoing embodiment, the mining of associated entities according to the relationship feature fusion result, the heterogeneous feature fusion result, and the attribute feature similarity matrix in this embodiment specifically includes: solving the mining model of the associated entity, namely, beta T's + (1-beta) (H | _ P)' to obtain a solution of s; acquiring items of which the numerical values in the solution of s are greater than or equal to a first preset threshold value, and taking a first entity and a second entity corresponding to each item as potential associated entities; matching the potential associated entities based on a bipartite graph matching method to obtain optimal matching; and if the value of any matching item in the optimal matching is larger than a second preset threshold value, taking the first entity and the second entity corresponding to the matching item as associated entities.
Specifically, it is apparent that the associated entity mining model is a typical Sylvester equation. And solving s in the formula, wherein the entity pair corresponding to the item of which the median value of s is greater than or equal to the first preset threshold is the potential associated entity. A potential associated entity is an entity that is likely to be an associated entity. Since s is a normalized matrix, s · 1 ═ I, where 1 is the full 1 vector and I is the identity matrix. Therefore, the associated entity mining model can be converted into:
Figure BDA0002070269000000121
solving for s, essentially solving for a matrix
Figure BDA0002070269000000122
Which can be solved in an iterative manner, i.e.
Figure BDA0002070269000000123
Where k is the number of iterations. When s converges, it is the solution of s. The entity pair corresponding to the item with higher numerical value in s is DAAnd DBThe potential associated entity in (1). The associated entities in the first data and the second data are mined as a one-to-one match, i.e., one first entity in the first data is matched with one second entity in the second data. And on the basis of the potential associated entities, acquiring the optimal matching based on a bipartite graph matching method. And when the value of the matching item in the optimal matching is larger than a second preset threshold value, regarding the two entities corresponding to the matching item as associated entities.
In another embodiment of the embodiments of the present invention, an entity identification apparatus oriented to multi-modal heterogeneous features is provided, and the apparatus is used for implementing the methods in the foregoing embodiments. Therefore, the description and definition in the embodiments of the entity identification method oriented to the multi-mode heterogeneous features may be used for understanding each execution module in the embodiments of the present invention. Fig. 2 is a schematic diagram of an overall structure of an entity identification apparatus facing multi-modal heterogeneous features according to an embodiment of the present invention, where the apparatus includes a first obtaining module 201, a second obtaining module 202, a third obtaining module 203, and a determining module 204; wherein:
the first obtaining module 201 is configured to fuse a relationship between first entities in the first data and a relationship between second entities in the second data, and obtain a relationship feature fusion result; wherein the first data comprises a plurality of relationships between the first entity and all of the first entities and the second data comprises a plurality of relationships between the second entity and all of the second entities; the second obtaining module 202 is configured to generate a heterogeneous feature matrix of the first data according to any first entity of any preset entity type, generate a heterogeneous feature matrix of the second data according to any second entity of any preset entity type, and fuse the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result; the third obtaining module 203 is configured to calculate a similarity between the attribute features of any one of the first entities and the attribute features of any one of the second entities, and generate an attribute feature similarity matrix according to all the similarities; the determining module 204 is configured to perform associated entity mining according to the relationship feature fusion result, the heterogeneous feature fusion result, and the attribute feature similarity matrix, and determine the associated first entity and the second entity.
According to the embodiment, the first data and the second data are formally described by using the entity and the relationship, then the attribute characteristics, the multimode characteristics and the heterogeneous characteristics of the entity are respectively fused aiming at the multimode heterogeneous characteristics of the entity, and the associated entity is mined according to the fusion result, so that the embodiment does not need prior knowledge, has stronger generalization capability, comprehensively considers the attribute characteristics and the multimode heterogeneous characteristics of the entity, improves the mining quality of the associated entity, namely recall rate and accuracy, is suitable for the existing single-mode or isomorphic and other environments, and has universality; in addition, the method can be applied to BIM data, and can also be applied to other fields of data such as social networks, traffic networks, biological information, electronic commerce systems and the like.
On the basis of the above embodiments, the relationship in this embodiment includes a connection relationship and a modal relationship; correspondingly, the first obtaining module is specifically configured to: generating a first adjacency matrix according to the connection relation between any two first entities, and generating a second adjacency matrix according to the connection relation between any two second entities; carrying out tensor product on the first adjacent matrix and the second adjacent matrix to generate a topological relation fusion matrix; for any one modal relationship, generating a first modal relationship matrix according to the modal relationship between any two first entities, and generating a second modal relationship matrix according to the modal relationship between any two second entities; carrying out tensor product on the first modal relationship matrix and the second modal relationship matrix to generate a fusion matrix of the modal relationship; adding the fusion matrixes of all the modal relations to obtain a multi-mode characteristic fusion matrix; and performing dot multiplication on the topological relation fusion matrix and the multi-mode feature fusion matrix to obtain the relation feature fusion result.
On the basis of the foregoing embodiment, the second obtaining module in this embodiment is specifically configured to: for any preset entity type, acquiring a first column vector formed by the preset entity type in the heterogeneous feature matrix of the first data and a second column vector formed by the preset entity type in the heterogeneous feature matrix of the second data; carrying out tensor product on the first column vector and the second column vector to generate an feature fusion matrix of the preset entity type; and adding the feature fusion matrixes of all the preset entity types to obtain a heterogeneous feature fusion result.
On the basis of the above embodiment, in this embodiment, the attribute feature of the first entity includes a text attribute feature and a geometric attribute feature, and the attribute feature of the second entity includes a text attribute feature and a geometric attribute feature; correspondingly, the third obtaining module is specifically configured to: generating an attribute feature similarity matrix according to the similarity between the text attribute features of any first entity and the text attribute features of any second entity; generating a geometric characteristic similarity matrix according to the similarity between the geometric attribute characteristics of any first entity and the geometric attribute characteristics of any second entity; and fusing the attribute characteristic similarity matrix and the geometric characteristic similarity matrix based on a Logit regression model to generate an attribute similarity matrix.
On the basis of the foregoing embodiments, the determining module in this embodiment performs associated entity mining according to the relationship feature fusion result, the heterogeneous feature fusion result, and the attribute feature similarity matrix by using the following formula:
s=βT's+(1-β)(H⊙P)';
wherein β is an adjustment parameter, T' is a normalized matrix of the fused result of the heterogeneous features, H is a fused result of the heterogeneous features, P is an attribute feature similarity matrix, H [ < P > ] indicates a dot product of H and P, (H [ < P > ] indicates a normalized result of the dot product of H and P, and s indicates a correlation matrix between all the first entities and all the second entities.
On the basis of the foregoing embodiment, the determining module in this embodiment is further configured to: solving the associated entity mining model to obtain a solution of s; acquiring items of which the numerical values in the solution of s are greater than or equal to a first preset threshold value, and taking a first entity and a second entity corresponding to each item as potential associated entities; matching the potential associated entities based on a bipartite graph matching method to obtain optimal matching; and if the value of any matching item in the optimal matching is larger than a second preset threshold value, taking the first entity and the second entity corresponding to the matching item as associated entities.
On the basis of the foregoing embodiment, in this embodiment, the third obtaining module specifically fuses the attribute feature similarity matrix and the geometric feature similarity matrix based on a Logit regression model by using the following formula to generate an attribute similarity matrix:
Figure BDA0002070269000000151
wherein P is an attribute similarity matrix, alpha is an adjustment parameter, PpIs a matrix of attribute feature similarities, PGIs a geometric feature similarity matrix.
The embodiment provides an electronic device, and fig. 3 is a schematic view of an overall structure of the electronic device according to the embodiment of the present invention, where the electronic device includes: at least one processor 301, at least one memory 302, and a bus 303; wherein the content of the first and second substances,
the processor 301 and the memory 302 are communicated with each other through a bus 303;
the memory 302 stores program instructions executable by the processor 301, and the processor calls the program instructions to perform the methods provided by the above method embodiments, for example, the method includes: fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to obtain a relationship characteristic fusion result; generating a heterogeneous feature matrix of first data according to a first entity of a preset entity type, generating a heterogeneous feature matrix of second data according to a second entity of the preset entity type, and fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result; calculating the similarity between the attribute characteristics of the first entity and the second entity to generate an attribute characteristic similarity matrix; and mining the associated entities according to the relationship characteristic fusion result, the heterogeneous characteristic fusion result and the attribute characteristic similarity matrix to determine the associated entities.
The present embodiments provide a non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the methods provided by the above method embodiments, for example, including: fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to obtain a relationship characteristic fusion result; generating a heterogeneous feature matrix of first data according to a first entity of a preset entity type, generating a heterogeneous feature matrix of second data according to a second entity of the preset entity type, and fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result; calculating the similarity between the attribute characteristics of the first entity and the second entity to generate an attribute characteristic similarity matrix; and mining the associated entities according to the relationship characteristic fusion result, the heterogeneous characteristic fusion result and the attribute characteristic similarity matrix to determine the associated entities.
Those of ordinary skill in the art will understand that: all or part of the steps for implementing the method embodiments may be implemented by hardware related to program instructions, and the program may be stored in a computer readable storage medium, and when executed, the program performs the steps including the method embodiments; and the aforementioned storage medium includes: various media that can store program codes, such as ROM, RAM, magnetic or optical disks.
The above-described embodiments of the electronic device are merely illustrative, and units illustrated as separate components may or may not be physically separate, and components displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods of the various embodiments or some parts of the embodiments.
Finally, the method of the present application is only a preferred embodiment and is not intended to limit the scope of the embodiments of the present invention. Any modification, equivalent replacement, improvement and the like made within the spirit and principle of the embodiments of the present invention should be included in the protection scope of the embodiments of the present invention.

Claims (8)

1. An entity identification method facing multimode heterogeneous characteristics is characterized by comprising the following steps:
fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to obtain a relationship characteristic fusion result; wherein the first data comprises a plurality of relationships between the first entity and all of the first entities and the second data comprises a plurality of relationships between the second entity and all of the second entities; moreover, the first data and the second data are BIM data;
generating a heterogeneous feature matrix of the first data according to any first entity of any preset entity type, generating a heterogeneous feature matrix of the second data according to any second entity of any preset entity type, and fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result;
calculating the similarity between the attribute features of any first entity and the attribute features of any second entity, and generating an attribute feature similarity matrix according to all the similarities;
performing associated entity mining according to the relationship feature fusion result, the heterogeneous feature fusion result and the attribute feature similarity matrix, and determining the associated first entity and the second entity;
wherein the relationship comprises a connection relationship and a modal relationship;
correspondingly, the step of fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to obtain the fusion result of the relationship characteristics specifically includes:
generating a first adjacency matrix according to the connection relation between any two first entities, and generating a second adjacency matrix according to the connection relation between any two second entities;
carrying out tensor product on the first adjacent matrix and the second adjacent matrix to generate a topological relation fusion matrix;
for any one modal relationship, generating a first modal relationship matrix according to the modal relationship between any two first entities, and generating a second modal relationship matrix according to the modal relationship between any two second entities;
carrying out tensor product on the first modal relationship matrix and the second modal relationship matrix to generate a fusion matrix of the modal relationship;
adding the fusion matrixes of all the modal relations to obtain a multi-mode characteristic fusion matrix;
performing dot multiplication on the topological relation fusion matrix and the multi-mode feature fusion matrix to obtain a relation feature fusion result;
the step of fusing the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result specifically includes:
for any preset entity type, acquiring a first column vector formed by the preset entity type in the heterogeneous feature matrix of the first data and a second column vector formed by the preset entity type in the heterogeneous feature matrix of the second data;
carrying out tensor product on the first column vector and the second column vector to generate an feature fusion matrix of the preset entity type;
and adding the feature fusion matrixes of all the preset entity types to obtain a heterogeneous feature fusion result.
2. The method of claim 1, wherein the attribute features of the first entity comprise text attribute features and geometric attribute features, and the attribute features of the second entity comprise text attribute features and geometric attribute features;
correspondingly, the step of generating the attribute feature similarity matrix according to all the similarities specifically includes:
generating an attribute feature similarity matrix according to the similarity between the text attribute features of any first entity and the text attribute features of any second entity;
generating a geometric characteristic similarity matrix according to the similarity between the geometric attribute characteristics of any first entity and the geometric attribute characteristics of any second entity;
and fusing the attribute characteristic similarity matrix and the geometric characteristic similarity matrix based on a Logit regression model to generate an attribute similarity matrix.
3. The method according to any one of claims 1-2, wherein the related entity mining is performed according to the relationship feature fusion result, the heterogeneous feature fusion result, and the attribute feature similarity matrix by the following formula:
s=βT's+(1-β)(H⊙P)';
wherein β is an adjustment parameter, T' is a normalized matrix of the fused result of the heterogeneous features, H is a fused result of the heterogeneous features, P is an attribute feature similarity matrix, H [ < P > ] indicates a dot product of H and P, (H [ < P > ] indicates a normalized result of the dot product of H and P, and s indicates a correlation matrix between all the first entities and all the second entities.
4. The method according to claim 3, wherein performing association entity mining according to the relationship feature fusion result, the heterogeneous feature fusion result, and the attribute feature similarity matrix, and the step of determining the associated first entity and second entity specifically includes:
solving the mining model of the associated entity, namely, beta T's + (1-beta) (H | _ P)' to obtain a solution of s;
acquiring items of which the numerical values in the solution of s are greater than or equal to a first preset threshold value, and taking a first entity and a second entity corresponding to each item as potential associated entities;
matching the potential associated entities based on a bipartite graph matching method to obtain optimal matching;
and if the value of any matching item in the optimal matching is larger than a second preset threshold value, taking the first entity and the second entity corresponding to the matching item as associated entities.
5. The method according to claim 2, wherein the attribute feature similarity matrix and the geometric feature similarity matrix are fused based on a Logit regression model by the following formula to generate an attribute similarity matrix:
Figure FDA0002806208890000031
wherein P is an attribute similarity matrix, alpha is an adjustment parameter, PpIs a matrix of attribute feature similarities, PGIs a geometric feature similarity matrix.
6. An entity identification device facing multi-mode heterogeneous features, comprising:
the first acquisition module is used for fusing the relationship between the first entities in the first data and the relationship between the second entities in the second data to acquire a relationship characteristic fusion result; wherein the first data comprises a plurality of relationships between the first entity and all of the first entities and the second data comprises a plurality of relationships between the second entity and all of the second entities; moreover, the first data and the second data are BIM data;
a second obtaining module, configured to generate a heterogeneous feature matrix of the first data according to any first entity of any preset entity type, generate a heterogeneous feature matrix of the second data according to any second entity of any preset entity type, and fuse the heterogeneous feature matrix of the first data and the heterogeneous feature matrix of the second data to obtain a heterogeneous feature fusion result;
the third acquisition module is used for calculating the similarity between the attribute characteristics of any first entity and the attribute characteristics of any second entity and generating an attribute characteristic similarity matrix according to all the similarities;
the determining module is used for mining associated entities according to the relationship feature fusion result, the heterogeneous feature fusion result and the attribute feature similarity matrix, and determining the associated first entity and the associated second entity;
wherein the relationship comprises a connection relationship and a modal relationship;
correspondingly, the first obtaining module is specifically configured to: generating a first adjacency matrix according to the connection relation between any two first entities, and generating a second adjacency matrix according to the connection relation between any two second entities; carrying out tensor product on the first adjacent matrix and the second adjacent matrix to generate a topological relation fusion matrix; for any one modal relationship, generating a first modal relationship matrix according to the modal relationship between any two first entities, and generating a second modal relationship matrix according to the modal relationship between any two second entities; carrying out tensor product on the first modal relationship matrix and the second modal relationship matrix to generate a fusion matrix of the modal relationship; adding the fusion matrixes of all the modal relations to obtain a multi-mode characteristic fusion matrix; performing dot multiplication on the topological relation fusion matrix and the multi-mode feature fusion matrix to obtain a relation feature fusion result;
the second obtaining module is specifically configured to: for any preset entity type, acquiring a first column vector formed by the preset entity type in the heterogeneous feature matrix of the first data and a second column vector formed by the preset entity type in the heterogeneous feature matrix of the second data; carrying out tensor product on the first column vector and the second column vector to generate an feature fusion matrix of the preset entity type; and adding the feature fusion matrixes of all the preset entity types to obtain a heterogeneous feature fusion result.
7. An electronic device, comprising:
at least one processor, at least one memory, and a bus; wherein the content of the first and second substances,
the processor and the memory complete mutual communication through the bus;
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 5.
8. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 5.
CN201910435097.1A 2019-05-23 2019-05-23 Entity identification method and device facing multimode heterogeneous characteristics Active CN110188148B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910435097.1A CN110188148B (en) 2019-05-23 2019-05-23 Entity identification method and device facing multimode heterogeneous characteristics

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910435097.1A CN110188148B (en) 2019-05-23 2019-05-23 Entity identification method and device facing multimode heterogeneous characteristics

Publications (2)

Publication Number Publication Date
CN110188148A CN110188148A (en) 2019-08-30
CN110188148B true CN110188148B (en) 2021-02-02

Family

ID=67717581

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910435097.1A Active CN110188148B (en) 2019-05-23 2019-05-23 Entity identification method and device facing multimode heterogeneous characteristics

Country Status (1)

Country Link
CN (1) CN110188148B (en)

Families Citing this family (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110704640A (en) * 2019-09-30 2020-01-17 北京邮电大学 Representation learning method and device of knowledge graph
CN110909424B (en) * 2019-10-31 2023-08-15 武汉科技大学 Planetary gear train isomorphism judging method, system and medium based on adjacency matrix
CN110825827B (en) * 2019-11-13 2022-10-25 北京明略软件系统有限公司 Entity relationship recognition model training method and device and entity relationship recognition method and device
CN111242318B (en) * 2020-01-13 2024-04-26 拉扎斯网络科技(上海)有限公司 Service model training method and device based on heterogeneous feature library
CN111539210B (en) * 2020-04-16 2023-08-11 支付宝(杭州)信息技术有限公司 Cross-network entity identification method and device, electronic equipment and medium
CN111931485B (en) * 2020-08-12 2021-03-23 北京建筑大学 Multi-mode heterogeneous associated entity identification method based on cross-network representation learning
CN114625875B (en) * 2022-03-09 2024-03-29 平安科技(深圳)有限公司 Pattern matching method, device, storage medium and equipment for multiple data source information

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103631970B (en) * 2013-12-20 2017-08-18 百度在线网络技术(北京)有限公司 The method and apparatus for excavating attribute and entity associated relation
US9589235B2 (en) * 2014-05-23 2017-03-07 Jiali Ding Method for measuring individual entities' infectivity and susceptibility in contagion
CN107609469B (en) * 2017-07-28 2020-12-04 北京建筑大学 Social network associated user mining method and system
CN107545046B (en) * 2017-08-17 2021-05-25 北京奇安信科技有限公司 Fusion method and device for multi-source heterogeneous data

Also Published As

Publication number Publication date
CN110188148A (en) 2019-08-30

Similar Documents

Publication Publication Date Title
CN110188148B (en) Entity identification method and device facing multimode heterogeneous characteristics
CN111522989B (en) Method, computing device, and computer storage medium for image retrieval
Li et al. Multi-source information fusion based heterogeneous network embedding
Yılmaz et al. Multimodal event detection in Twitter hashtag networks
US20170060977A1 (en) Data preparation for data mining
Huang et al. Efficient business process consolidation: combining topic features with structure matching
Movshovitz-Attias et al. Kb-lda: Jointly learning a knowledge base of hierarchy, relations, and facts
CN115827895A (en) Vulnerability knowledge graph processing method, device, equipment and medium
CN111931485B (en) Multi-mode heterogeneous associated entity identification method based on cross-network representation learning
CN112257959A (en) User risk prediction method and device, electronic equipment and storage medium
Wu et al. A Tensor CP decomposition method for clustering heterogeneous information networks via stochastic gradient descent algorithms
CN112800179B (en) Associated database query method and device, storage medium and electronic equipment
CN113761185A (en) Main key extraction method, equipment and storage medium
Du et al. Evaluating structural and topological consistency of complex regions with broad boundaries in multi-resolution spatial databases
CN116244497A (en) Cross-domain paper recommendation method based on heterogeneous data embedding
CN112765183B (en) Multi-source data fusion method and device, storage medium and electronic equipment
CN114443783A (en) Supply chain data analysis and enhancement processing method and device
Wang et al. Matching weak informative ontologies
CN113822112A (en) Method and apparatus for determining label weights
CN112287005A (en) Data processing method, device, server and medium
Cao E-Commerce Big Data Mining and Analytics
Di et al. Pattern match query for spatiotemporal RDF graph
Liu et al. Research on dynamic ontology construction method for knowledge fusion in group corporation
CN112685574B (en) Method and device for determining hierarchical relationship of domain terms
US20230359606A1 (en) Methods and systems for connecting data with non-standardized schemas in connected graph data exchanges

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant