CN111931485B - Multi-mode heterogeneous associated entity identification method based on cross-network representation learning - Google Patents

Multi-mode heterogeneous associated entity identification method based on cross-network representation learning Download PDF

Info

Publication number
CN111931485B
CN111931485B CN202010806775.3A CN202010806775A CN111931485B CN 111931485 B CN111931485 B CN 111931485B CN 202010806775 A CN202010806775 A CN 202010806775A CN 111931485 B CN111931485 B CN 111931485B
Authority
CN
China
Prior art keywords
entity
heterogeneous
entities
multimode
network
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202010806775.3A
Other languages
Chinese (zh)
Other versions
CN111931485A (en
Inventor
周小平
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Civil Engineering and Architecture
Original Assignee
Beijing University of Civil Engineering and Architecture
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Civil Engineering and Architecture filed Critical Beijing University of Civil Engineering and Architecture
Priority to CN202010806775.3A priority Critical patent/CN111931485B/en
Publication of CN111931485A publication Critical patent/CN111931485A/en
Application granted granted Critical
Publication of CN111931485B publication Critical patent/CN111931485B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/205Parsing
    • G06F40/216Parsing using statistical methods
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/35Clustering; Classification
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/22Matching criteria, e.g. proximity measures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F40/00Handling natural language data
    • G06F40/20Natural language analysis
    • G06F40/279Recognition of textual entities
    • G06F40/289Phrasal analysis, e.g. finite state techniques or chunking
    • G06F40/295Named entity recognition
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Abstract

The invention provides a multi-mode heterogeneous associated entity identification method based on cross-network representation learning. The method comprises the following steps: given two multimode heterogeneous information networks:
Figure DDA0002629425630000011
and
Figure DDA0002629425630000012
EAand EBIs a set of entities, RAAnd RBBeing a set of entity relationships, TAAnd TBAs a set of entity types, CAAnd CBFor entity relationship type set, let two entities EAi∈EAAnd EBj∈EBBased on EAiAnd EBjThe random walk path set between the two sets is established by an iterative methodAiAnd EBjTransition probability M of multi-mode relation betweenijThrough MijLearning by an objective function to obtain EAiAnd EBjThe multi-modal heterogeneous eigenvectors of (a); when judging EAiAnd EBjAnd E, determining the consistency of the multimode heterogeneity, the attribute consistency and the environment consistencyAiAnd EBjIs an associated entity. The invention fully analyzes the multimode heterogeneous characteristics of the multimode heterogeneous information network, and forms a set of multimode heterogeneous information network formalized description method and a multimode heterogeneous associated entity identification model and method based on cross-network representation learning.

Description

Multi-mode heterogeneous associated entity identification method based on cross-network representation learning
Technical Field
The invention relates to the technical field of identification of multimode heterogeneous information network associated entities, in particular to a multimode heterogeneous associated entity identification method based on cross-network representation learning.
Background
The multimode heterogeneous Information network (Building Information Model/modeling) is a digital expression of physical and functional characteristics of Building facilities, aims to provide reliable shared knowledge resources for decision and cooperation of different participants in the whole life cycle of a Building, and becomes important content for modernization of the Building industry, construction of smart cities and the like in China.
The multimodal heterogeneous information network associated entity identifies data entities that are intended to find out in different multimodal heterogeneous information networks that refer to the same object in the real world. The accurate and comprehensive identification of the multimode heterogeneous information network associated entities realizes the organic integration of dispersed and isolated multimode heterogeneous information networks, is the key for realizing the whole-process integrated application of the multimode heterogeneous information networks and the whole life cycle data sharing of engineering construction projects, solves the problems of 'information fault' and 'information isolated island' in the digitization of the current construction projects, and provides reliable and complete infrastructure big data support for the engineering construction projects and the whole-cycle management of smart cities. At the present stage, most multimode heterogeneous information network associated entity identification methods are based on manual labeling, geometric attribute matching or text attribute modeling; few studies consider the entity relationships of the multi-mode heterogeneous information networks, but ignore the multi-mode characteristics of the entity relationships of the multi-mode heterogeneous information networks.
The identification of the multimode heterogeneous information network associated entity is a cross-domain and cross-discipline task, is a key of the whole-process integrated application and the whole-life-cycle data fusion and sharing of the multimode heterogeneous information network, and is an important component of a domain-oriented building big data value and knowledge discovery theory and method. The implementation of the method enriches and perfects theories and methods such as associated entity recognition and network representation learning in the field of data mining, promotes the application and innovation of leading-edge theories and methods of computer science in the building science, develops a new thought of multimode heterogeneous information network basic research, develops a new research direction for the cross fields of computers, building and civil engineering and the like, and has important theoretical value. The research result of the method can promote the national important requirement of modernization transformation and upgrading of the building industry, serve the construction and the 'full-period management' of smart cities, smart infrastructures, smart people and the like, and have great economic and social benefits.
Currently, the importance of identification of multi-mode heterogeneous information network associated entities has attracted extensive attention of scholars at home and abroad. The cross-network representation learning of continuous low-dimensional vectors embedding different networks into the same space is a research hotspot in the field of machine learning in recent years. Many colleges and universities and scientific research institutions at home and abroad develop researches such as identification of multimode heterogeneous information network associated entities and cross-network representation learning, and achievements can be found in top-level periodicals and conferences of computers and cross-subject applications thereof. Thus, multi-modal heterogeneous information network association entity identification is the leading edge of current computer and building and civil interdisciplinary research.
The multimode heterogeneous information network associated entity refers to a multimode heterogeneous information network entity which refers to the same real-world object in different multimode heterogeneous information networks. In general, a multi-mode heterogeneous information network
Figure BDA0002629425610000021
Can be expressed as
Figure BDA0002629425610000022
Wherein, E and R are respectively a heterogeneous entity set and an inter-entity multi-mode relationship set, and T and C are respectively a type set of E and R. Given two multimode heterogeneous information networks
Figure BDA0002629425610000023
And
Figure BDA0002629425610000024
if EAi∈EAAnd EBj∈EBRefer to the same object in the real world, then called EAiAnd EBjFor the associated entity, note EAi=EBj(ii) a Otherwise EAi≠EBj. FIG. 1 is a schematic diagram of identification of a multi-mode heterogeneous information network associated entity through which identification is performed
Figure BDA0002629425610000025
And
Figure BDA0002629425610000026
data feature determination in (E)Ai∈EAAnd EBj∈EBWhether it is an associated entity, i.e.:
Figure BDA0002629425610000027
IFCs (Industry Foundation Classes) are currently recognized international standards for multimode heterogeneous information networks and are widely used in various enterprises in the construction Industry. At present, almost all multi-mode heterogeneous information network software supports the IFC format, and most multi-mode heterogeneous information network researches are based on the IFC standard, such as building construction and the like. Based on the IFC standard, the multi-mode heterogeneous information network shows multi-mode heterogeneous characteristics and massive entity characteristics.
Multi-mode heterogeneous characteristics
The heterogeneous characteristics mean that the types of the multimode heterogeneous information network entities are various, and the attributes of different types of entities are different. Currently, IFCs have defined 653 different entities, and the number of entities continues to expand with the actual demand and iteration of the IFC version. The attributes of the multi-mode heterogeneous information network entity can be divided into semi-structured text attributes for describing basic information of the entity and unstructured geometric attributes for describing a three-dimensional shape of the entity. In the IFC standard, only entities that inherit the IFCProduct class are likely to have geometric properties. The roof objects in FIG. 1 all inherit to an IFCProduct class, which contains both geometric and textual properties. The problems of non-uniform fields, missing values, redundancy, inaccuracy, inconsistency and the like exist in entity text attributes of different multimode heterogeneous information networks, so that the identification quality of the multimode heterogeneous information network associated entity identification method based on the text attributes is poor (the recall rate and the accuracy rate are low), and the requirement of the multimode heterogeneous information network on the whole-process integrated application cannot be met.
The multimode characteristic means that a plurality of relationships of potentially different modes exist between any two multimode heterogeneous information network entities. Currently, IFC has defined 5 major classes of 19 different types of relationships, including: reference, containment, decomposition, connection, inheritance, and the like. The multimode heterogeneous information network has different multimode relation description forms, and challenges are brought to the formal description and mathematical expression of the multimode heterogeneous information network. The multimode characteristic also means that multimode heterogeneous information network entities are interdependent in different forms, showing strong dependence. The introduction of the entity relationship is an effective way for solving the problem of poor identification quality of the identification method of the multimode heterogeneous information network associated entity based on the text attribute, however, the existing method ignores the multimode characteristic of the multimode heterogeneous information network relationship.
The multimode heterogeneous characteristics of the multimode heterogeneous information network are important manifestations of the complexity of the multimode heterogeneous information network. At present, the research is started from the attributes of multimode heterogeneous information network entities, and the multimode characteristics of the multimode heterogeneous information network are researched and explored less. If the multimode heterogeneous characteristics of the multimode heterogeneous information network can be deeply explored, a formal description method of the multimode heterogeneous information network is established from the perspective of a complex network, application innovation of theories and methods such as graph theory, network science, graph learning and big data in the multimode heterogeneous information network is promoted, a new idea of fundamental application research of the multimode heterogeneous information network is developed, and a model basis is established for identification, parallel computing and the like of multimode heterogeneous information network associated entities.
② mass entity characteristics
The IFC is a multi-mode heterogeneous information network description file with highly compressed information, and a million IFC file contains millions or even tens of millions of multi-mode heterogeneous information network entities. Generally, a multi-mode heterogeneous information network of an actual engineering project is composed of a plurality of IFC files of different specialties. According to statistics, the multimode heterogeneous information network of a three-layer building in the design stage can reach 50G. Thus, the multi-mode heterogeneous information network contains a vast number of multi-mode heterogeneous information network entities.
In the prior art, most of the research methods of the multimode heterogeneous information network only aim at the multimode heterogeneous information network with smaller volume. Some students pay attention to massive entities and big data characteristics thereof in the multimode heterogeneous information network, and develop researches on multimode heterogeneous information network big data distributed storage and management frameworks and the like for lightweight visualization of the multimode heterogeneous information network and field-oriented application. The parallel computing distributes computing tasks to a plurality of processing units for computing, and is an effective way for improving the processing capacity and efficiency of the multimode heterogeneous information network. A few researches initially explore a multimode heterogeneous information network parallel computing method, however, the method ignores the imbalance of the multimode heterogeneous information network entity attributes, is difficult to be applied to any multimode heterogeneous information network, and cannot meet the requirement of full-life-cycle multimode heterogeneous information network parallel processing. The strong dependence of the multi-mode heterogeneous information network makes it difficult for the existing parallel computing framework to be directly applied to the multi-mode heterogeneous information network. Due to disciplinary intersection, the research of the current multimode heterogeneous information network parallel computing method is less, and the method for identifying the associated entity is limited to rapidly process the large-volume multimode heterogeneous information network.
Identification research status of multi-mode heterogeneous information network associated entity
The identification of the multimode heterogeneous information network associated entity based on UUID (Universal Unique Identifier) is the simplest and most accurate method; however, different multimode heterogeneous information network tools maintain different UUIDs, and even UUIDs formed by different versions of the same multimode heterogeneous information network tool are different. At present, most of identification methods of the multimode heterogeneous information network associated entities are based on manual labeling, geometric attribute matching or text attribute modeling.
The identification of the manually marked multimode heterogeneous information network associated entity depends on the quality of the change relation model and the accuracy of the manual change marking, and the manual workload is heavy and is easy to make mistakes. Although the associated entity identification method based on geometric attribute matching can detect three-dimensional similarities and differences between two models; however, the method only identifies the model difference in geometric shape, is difficult to be applied to identification of the multi-mode heterogeneous information network associated entity with complex relationships such as reference and inheritance, and cannot identify the multi-mode heterogeneous information network entity without the geometric shape. In order to solve the problems existing in manual labeling, a part of researches propose an associated entity identification model based on text attributes of multimode heterogeneous information network entities; however, entities of the same type typically have similar text attributes. For example, in fig. 1, the text attributes of a plurality of window entities of the same type are mostly the same or similar. The similarity of the attribute characteristics of the same type of entities of the multimode heterogeneous information network limits the application range of the method. A few studies convert the reference relationship between the entities of the multimode heterogeneous information network into an RDF (Resource Description Framework) graph and a reference hierarchy, so as to improve the quality of identification of the multimode heterogeneous information network associated entities based on text attributes. The method also ignores the complex relation and geometric attribute characteristics of the multimode heterogeneous information network.
The comprehensive utilization of the attributes and the multimode heterogeneous characteristics of the multimode heterogeneous information network entities is an effective way for improving the identification quality of the multimode heterogeneous information network associated entities, however, research on the aspects in the prior art is less. On one hand, the multimode heterogeneous information network field is less provided with multimode heterogeneous information network formalized description methods facing multimode heterogeneous characteristics, so that the identification of the existing multimode heterogeneous information network associated entities is limited to attribute characteristics such as texts; on the other hand, the existing data mining theory and method are difficult to extract the multi-mode heterogeneous characteristics of mass entities of different networks to the same characteristic space.
(2) Cross-network representation learning research status oriented to associated entity identification
Network Representation Learning (Network Representation Learning), also known as Network/Graph Embedding (Network/Graph Embedding), is one of the research hotspots and frontiers of machine Learning in recent years. Given the ability of network representation learning to represent and infer in vector space, more and more scholars extend network representation learning from a single network to multiple networks, exploring cross-network representation learning models and their application in social network associated user identification and knowledge graph alignment, etc. Most social network associated user identification researches establish a homogeneous single mode network by taking users as nodes and user relationships as edges, and then establish a cross-network representation learning model and method by adopting a graph neural network, deep active learning and the like. Some scholars notice the heterogeneous entities in the social network, and establish the heterogeneous network by taking the heterogeneous entities as nodes and the heterogeneous entity relationship as edges. Wang et al extracts user interests according to user contents, establishes a heterogeneous network with the users and the interests as nodes, and then provides a cross-network user feature representation learning model. Zhou et al establishes a heterogeneous network with entities such as users, locations, postings, pictures, and the like in a social network as nodes and relationships between the entities as sides, establishes a cross-network representation learning model by designing a Meta Path (Meta Path), and completes the identification of associated users. Ye et al uses a graph convolutional network to establish a cross-network edge and node feature representation learning model under a priori associated entities.
Extensibility is a marker that represents learning across networks that can handle large amounts of data. The existing cross-network representation learning method which is experimentally verified in a million-level data set and above uses the distributed learning capability of a Word vector (Word2Vec) model for reference. Word vector model-based meta path which is often required to be designed skillfully for heterogeneous network representation learning[30]While the design of meta-paths relies on domain knowledge and its design complexity increases dramatically with the increase of network entity types and modal relationships. This also makes the learning study less for multi-modal heterogeneous feature oriented distributed representation across networks. If a domain-independent cross-multimode heterogeneous network distributed representation learning model can be designed, the dependence of the existing heterogeneous network distributed representation learning on element path design can be thoroughly solved, and the method can be suitable for single-mode or (and) homogeneous networks and any fields and has universality.
(3) Identification of research status of multi-mode heterogeneous associated entity
Data mining for multi-mode heterogeneous features has become the leading edge of research, however, most research focuses mainly on data mining tasks such as network embedding, personalized recommendation and the like in a single data set. Some studies have preliminarily explored the identification of associated entities to multimode or heterogeneous networks without a priori knowledge. In the field of social networks, Zhang et al propose an unsupervised heterogeneous network associated entity identification method facing two types of heterogeneous entities, namely users and positions. In the traffic field, Nassar et al propose an ISORank-based multimode homogeneous network associated entity identification method. In the field of bioinformatics, Gu et al extend homogeneous network associated entity identification methods to heterogeneous networks using graph staining methods. In the field of electronic commerce, Zhu et al have used Graph Summarization (Graph Summarization) and other methods to identify heterogeneous entities such as manufacturers and commodities. In the knowledge base field, the Shen et al multimode heterogeneous information network is regarded as a field knowledge base, and the problem of entity link of unstructured field texts and the field knowledge base is explored. The multimode heterogeneous associated entities have attracted the attention of many researchers in many fields, however, most of the existing research is still multimode or heterogeneous network design. The identification research of the multimode heterogeneous associated entity is less without prior, and the mass entity characteristics of the multimode heterogeneous network are ignored in many researches.
The multi-modal heterogeneous associated entity recognition is also similar or related to studies of language translation in natural language processing, entity alignment in knowledge base, database record linking, entity matching, named recognition in information retrieval, social network associated user recognition, bipartite graph matching, homogeneous network alignment in biological information, and the like. However, these methods have certain limitations in the identification of the multi-mode heterogeneous information network associated entity, which are specifically expressed as follows:
modeling of multi-modal heterogeneous characteristics is absent. Most of the existing methods are designed with associated entity identification models and methods under single mode or homogeneous scenes oriented to specific fields, multimode or (and) heterogeneous characteristics are not integrated into the existing methods, and the identification quality of the associated entities cannot meet the requirement of multimode heterogeneous information network associated entity identification oriented to whole-process integrated application.
Secondly, the computing power of mass entities is insufficient. Parallel and distributed algorithms in a big data environment are still the public problem of the identification of associated entities in various fields. Many associated entity identification methods cannot process massive data, so that the methods cannot be directly applied to identification of multi-mode heterogeneous information network associated entities with massive entities.
And the dependency of the prior associated entity is strong. Most methods rely on prior associated entities to construct supervised and semi-supervised associated entity recognition models and methods, and the associated entity recognition quality depends on the quality and quantity of the prior associated entities. Moreover, the prior associated entities are difficult to label, and the manual work is heavy. This also limits the applicability of such methods to identification of multimodal heterogeneous information network associated entities.
In summary, the related entity identification research in the prior art mainly focuses on single-mode homogeneous environment, many methods require a priori related entities, and few researches pay attention to multimode characteristics or heterogeneous characteristics in data and develop preliminary exploration. Identification of multimode heterogeneous information network associated entities oriented to multimode heterogeneous characteristics is an important trend of current associated entity identification research; theoretically, the research result can be generalized and applied to the existing single-mode or (and) homogeneous environments and the like, and the method is more universal; in application, the research result can be used for a multi-mode heterogeneous information network, and can also be used for other field data such as a social network, a traffic network, biological information, an electronic commerce system, a knowledge graph and the like.
At present, no multimode heterogeneous information network associated entity identification method oriented to multimode heterogeneous characteristics exists in the prior art.
Disclosure of Invention
The embodiment of the invention provides a multimode heterogeneous information network associated entity identification method oriented to multimode heterogeneous characteristics, which aims to overcome the problems in the prior art.
In order to achieve the purpose, the invention adopts the following technical scheme.
A multi-mode heterogeneous associated entity identification method based on cross-network representation learning comprises the following steps:
two multimode heterogeneous information networks:
Figure BDA0002629425610000061
and
Figure BDA0002629425610000062
EAand EBIs a set of entities, RAAnd RBBeing a set of entity relationships, TAAnd TBAs a set of entity types, CAAnd CBFor entity relationship type set, let two entities EAi∈EAAnd EBj∈EB
Based on entity EAiAnd EBjThe random walk path set between the two sets is established by an iterative methodAiAnd EBjTransition probability M of multi-mode relation betweenijTransition the probability M through the multi-modal relationshipijLearning by using an objective function to obtain the entity EAiAnd EBjThe multi-modal heterogeneous eigenvectors of (a);
according to said entity EAiAnd EBjJudging the two entities E by the multi-mode heterogeneous characteristic vectorAiAnd EBjWhether the multi-mode heterogeneous consistency exists or not, and two entities E are also judgedAiAnd EBjWhether attribute consistency and environment consistency exist, when the two entities EAiAnd EBjAnd E, determining the consistency of the multimode heterogeneity, the attribute consistency and the environment consistencyAiAnd EBjIs an associated entity.
Preferably, said entity-based EAiAnd EBjThe entity E is established by an iterative method through a random walk path setAiAnd EBjTransition probability M of multi-mode relation betweenijThe method comprises the following steps:
assuming the relation of | C | different modes in the multimode heterogeneous information network, the multimode relation transfer matrix is expressed by | C | × | C | matrix M, wherein M isijRepresenting relationship type C in a multi-mode heterogeneous networkiTo CjThe transition probability of (2);
in a random walk, if the last node EiBy the relation CxTransfer to current node EjIt is transferred to the next node EkProbability p (E)k|Ei,Ej,CxAnd M) is calculated by the following method:
Figure BDA0002629425610000071
wherein, WijAs entity EiAnd EjWeight of (C)ijIs a relation (E)i,Ej) Type (b) NiAs entity EiSet of neighbor nodes of, Wij=(Ni∩Nj)/(Ni∪Nj) If d isijAs entity EiAnd EjThe distance between them is:
Figure BDA0002629425610000072
acquiring a set of random walk path sets P ═ { P ] by adopting random walks according to formula (2)1,P2,P3… and corresponding multimode transition path T ═ { T ═1,T2,T3… }, wherein
Figure 1
Figure BDA0002629425610000074
Using a vector e of dimension | P |iRepresents a relationship type CiFeatures in a random walk set P, where eijIs represented by CiAt PiThe number of occurrences in (a);
calculating a relationship type C according to the Pearson correlation coefficientiAnd CjOf (2) similarity, i.e.
Figure BDA0002629425610000075
Updating multimode relation transition probability by adopting Sigmoid function
Figure BDA0002629425610000076
Initially, MijThe matrix is set to be an all 1 matrix or a random matrix according to MijAcquiring a random walk path set P by adopting a formula (2), and updating M according to a formula (5)ijContinuously iterating the above process until MijConverging to complete the multi-mode relationship transfer matrix ZijAnd (4) constructing.
Preferably, said transition matrix M through said multi-modal relationshipijLearning by using an objective function to obtain the entity EAiAnd EBjThe multi-modal heterogeneous feature vector of (1), comprising:
the entity EAiAnd EBjRespectively serving as a node, establishing a cross-network distributed representation learning model and an algorithm by using a Skip-Gram model in Word2Vec, and setting a target optimization function of the Skip-Gram model in the cross-network distributed representation learning facing the multi-mode heterogeneous characteristics as follows:
Figure BDA0002629425610000077
where θ is the band solution parameter, Nt(v) A context node of type t in a neighboring node being node V, if VtFor a set of nodes of type t in two networks, then:
Figure BDA0002629425610000081
wherein, XvA multi-mode heterogeneous feature vector of a node v;
obtaining entity E by solving equation (10)AiAnd EBjOf the multi-modal heterogeneous eigenvector XAiAnd XBj
Preferably, said method is based on said entity EAiAnd EBjJudging the two entities E by the multi-mode heterogeneous characteristic vectorAiAnd EBjWhether or not there is multi-modal heterogeneous consistency, including:
according to entity EAiAnd EBjThe multi-mode heterogeneous feature vector judgment entity EAiType TAiAnd EBjType TBjWhether or not they are identical, if so, two entities EAiAnd EBjDegree of identification of type relationship between HijEqual to 1; otherwise, two entities EAiAnd EBjThe type relation identification degree between the two is equal to 0;
Figure BDA0002629425610000082
when two entities EAiAnd EBjWhen the types of (A) are the same, entity EAiAnd EBjBetween the multimode heterogeneous similarity RijThe calculation method comprises the following steps:
Figure BDA0002629425610000083
XAiand XBjTwo entities E obtained for solutionAiAnd EBjThe multi-modal heterogeneous eigenvectors of (A), RijComposition of entity set EAAnd EBAnd a multi-modal heterogeneous feature similarity matrix R therebetween.
Preferably, said determining two entities EAiAnd EBjWhether attribute consistency exists includes:
the entity EAiAnd EBjThe attribute of the entity E comprises a text attribute and a geometric attribute, wherein the text attribute is a short text, a semantic feature vector model of the entity attribute is analyzed and established by adopting a short text word vector method, and the entity E is calculated by cos similarity or Euclidean distance methodAiAnd EBjText attribute feature similarity between them;
fusion entity EAiAnd EBjThe similarity of text attribute features and the similarity of geometric attribute features between form an entity EAiAnd EBjAttribute consistency feature similarity matrix P therebetweenijAll P areijComposition of entity set EAAnd EBThe attribute consistency feature similarity matrix P therebetween.
Preferably, said determining two entities EAiAnd EBjWhether there is environmental consistency, including:
if Z is
Figure BDA0002629425610000091
And
Figure BDA0002629425610000092
in the set of associated entities, entity EAiAnd EBjEnvironmental consistency feature similarity between them YijThe calculation method comprises the following steps:
Figure BDA0002629425610000093
wherein, IAi=NAi∩Z,IBj=NBjN and Z, in the initial stage,
Figure BDA0002629425610000094
as the iterative process continues, more and more associated entities in Z will be present, all YijComposition of entity set EAAnd EBThe environment consistency feature similarity matrix Y between them.
Preferably, said two entities EAiAnd EBjAnd E, determining the consistency of the multimode heterogeneity, the attribute consistency and the environment consistencyAiAnd EBjIs an association entity, comprising:
synthetic entity EAiAnd EBjDegree of identification of type relationship between HijMulti-mode heterogeneous similarity RijEnvironment consistency feature similarity YijAnd attribute consistency feature similarity matrix PijObtaining said entity EAiAnd EBjThe similarity value S betweenij
Sij=sim(EAi,EBj)=Hij·Rij·Yij·Pij
Based on EAAnd EBThe similarity value between all entities in E constitutesAAnd EBThe similarity matrix S between the entities selects the unassociated entity pair E with the maximum similarity value in SAiAnd EBjIs a related entity and needs to satisfy Sij>Tau, tau is a set similarity threshold;
when a new associated entity Δ Z is identified, the associated entity set Z is updated to be: and Z is Z U delta Z, updating Y and S, re-identifying a new associated entity, finishing iteration when the associated entity meeting the requirement cannot be identified, and outputting an identified associated entity set Z.
It can be seen from the technical solutions provided by the embodiments of the present invention that, in the embodiments of the present invention, starting from the important requirements of the full-process integrated application and the full-life cycle data sharing of the multi-mode heterogeneous information network, the identification of the multi-mode heterogeneous information network associated entity under a massive entity is taken as a research target, and on the basis of fully analyzing the multi-mode heterogeneous characteristics of the multi-mode heterogeneous information network, a formal description method of the complex multi-mode heterogeneous information network, a domain-independent distributed representation learning model and method across the multi-mode heterogeneous network, a parallel computing method of the multi-mode heterogeneous information network, and an associated entity identification model and algorithm of comprehensive attribute characteristics and multi-mode heterogeneous characteristics are mainly researched, and experimental verification is performed on massive data.
Additional aspects and advantages of the invention will be set forth in part in the description which follows, and in part will be obvious from the description, or may be learned by practice of the invention.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.
Fig. 1 is a schematic diagram illustrating identification of an associated entity of a multimode heterogeneous information network in the prior art;
fig. 2 is a general implementation framework structure diagram of a multimode heterogeneous information network associated entity identification method oriented to multimode heterogeneous characteristics according to an embodiment of the present invention;
fig. 3 is a framework diagram of a cross-network-node multi-mode relationship feature representation learning method according to an embodiment of the present invention;
FIG. 4 is a schematic diagram of a random walk according to an embodiment of the present invention, in which a multi-modal relationship is considered;
fig. 5 is a schematic diagram of a geometric property similarity calculation process according to an embodiment of the present invention;
fig. 6 is a flowchart illustrating an iterative association entity identification according to an embodiment of the present invention.
Detailed Description
Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like reference numerals refer to the same or similar elements or elements having the same or similar function throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.
As used herein, the singular forms "a", "an", "the" and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It will be further understood that the terms "comprises" and/or "comprising," when used in this specification, specify the presence of stated features, integers, steps, operations, elements, and/or components, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. It will be understood that when an element is referred to as being "connected" or "coupled" to another element, it can be directly connected or coupled to the other element or intervening elements may also be present. Further, "connected" or "coupled" as used herein may include wirelessly connected or coupled. As used herein, the term "and/or" includes any and all combinations of one or more of the associated listed items.
It will be understood by those skilled in the art that, unless otherwise defined, all terms (including technical and scientific terms) used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this invention belongs. It will be further understood that terms, such as those defined in commonly used dictionaries, should be interpreted as having a meaning that is consistent with their meaning in the context of the prior art and will not be interpreted in an idealized or overly formal sense unless expressly so defined herein.
For the convenience of understanding the embodiments of the present invention, the following description will be further explained by taking several specific embodiments as examples in conjunction with the drawings, and the embodiments are not to be construed as limiting the embodiments of the present invention.
The embodiment of the invention aims at the urgent need of the building industry for the whole-process integrated application of a multimode heterogeneous information Network and the whole life cycle data sharing of construction projects and the Network Representation Learning (Network Representation Learning) leading edge scientific theory, and establishes a multimode heterogeneous information Network associated entity identification model and method based on cross-Network Representation Learning by taking the cooperative coupling of the computer science and the key technology of the building and civil engineering science as a means.
The invention comprehensively considers text and geometric attribute characteristics, multimode heterogeneous characteristics and massive entities of a multimode heterogeneous information network, researches a multimode heterogeneous information network associated entity identification model and a method based on cross-network representation learning by using the theory and the method of network representation learning, and the overall implementation framework structure of the multimode heterogeneous information network associated entity identification method oriented to the multimode heterogeneous characteristics is shown in figure 2.
The method firstly researches a formal description method of the complex multimode heterogeneous information network, converts the multimode heterogeneous information network into the multimode heterogeneous network from the perspective of the complex network, and establishes a model basis for a multimode heterogeneous information network associated entity identification and parallel computing method and the like. Aiming at the multimode heterogeneous characteristics and the mass entities, by establishing a multimode relation transfer model, a cross-network random walk model and a cross-network distributed representation learning model based on word vectors, the multimode heterogeneous characteristics of different network nodes are embedded into the same space low-dimensional continuous vector, and a foundation is established for multimode heterogeneous consistency calculation. Aiming at mass entity characteristics, by establishing a multi-mode heterogeneous consistency model, an environment consistency model and an attribute consistency model and comprehensively considering the attribute characteristics and the multi-mode heterogeneous characteristics of a multi-mode heterogeneous information network, the identification quality of the associated entities of the multi-mode heterogeneous information network is improved, and the applicability of the associated entity identification model is ensured. And finally, carrying out extensive experimental verification by adopting actual engineering data, and ensuring that research results can serve the whole-process integrated application and the whole life cycle data sharing of the multimode heterogeneous information network.
(1) Multi-mode heterogeneous feature analysis and formalization description method of multi-mode heterogeneous information network based on IFC
The invention is supposed to combine IFC data standard, and analyze the multi-mode heterogeneous characteristics of the multi-mode heterogeneous information network from the two aspects of entity attribute characteristics and relationship characteristics. Aiming at the attribute characteristics, the invention aims to adopt a literature research method and an induction summarizing method to summarize the common attribute characteristics and the characteristics of each entity and establish a foundation for the subsequent extraction of the attribute characteristic vectors. Aiming at the relation characteristics, the invention aims to establish entity relation graphs under different modes on the basis of summarizing and summarizing the types and characteristics of the existing relation modes; and then, analyzing the structural characteristics and similarities, including density, degree distribution, radius and the like of each modal relational graph from a large number of actual engineering multimode heterogeneous information networks by adopting a data analysis method, and providing necessary support for theoretical analysis and algorithm improvement of subsequent algorithms.
And then, according to the research results, by using a complex network theory for the purpose of referring to the formal description of the social network, researching a formal description method of the multimode heterogeneous information network. In general, a multi-mode heterogeneous information network
Figure BDA0002629425610000111
May be composed of entities and entity relationships,
Figure 2
wherein E is
Figure BDA0002629425610000113
In the entity set, R is an entity relationship set, T is an entity type set, and C is an entity relationship type set. For any entity E in the multi-mode heterogeneous information networkiWhich includes the attribute characteristics of the entity, the specific attribute characteristics being referenced to the data standard of the IFC. For any two entities EiAnd EjWhere there may be a plurality of different modal relationships, the present invention contemplates the use of RijRepresents EiAnd EjA set of all relationships. For any entity relationship Rijk∈RijIt can be described as: rijk={Ei,Ej,CkIs defined as EiIn relation to CkE C depends on Ej. Thus, is available
Figure BDA0002629425610000121
Description of EiIn relation to CkAll entities that depend. Entity EiMay be of the type TiOr T (E)i) A description will be given.
After formal description, the invention converts the multimode heterogeneous information network model into a multimodeA heterogeneous information network. At this time, the entities are also called nodes, and the relationships are also called edges. The invention is intended to use | · | to represent the number of sets. When RijWhen | < 1, the multi-mode heterogeneous information network degenerates to a heterogeneous information network; when | T | ═ 1, the multimodal heterogeneous information network degenerates to a homogeneous network. Therefore, the research content of the invention has more universality compared with homogeneous and/or single-mode information networks. On the basis, the formal description method of the multi-mode heterogeneous information network is further deepened, so that a basic mathematical model is provided for the establishment of a subsequent multi-mode heterogeneous information network associated entity recognition model, a multi-mode heterogeneous information network parallel computing algorithm and the like, and a model basis is established for the research of other multi-mode heterogeneous information networks.
(2) Domain-independent cross-multimode heterogeneous network distributed representation learning method
Fig. 3 is a framework diagram of a cross-network-node multi-modal relationship feature representation learning method according to an embodiment of the present invention. The cross-network representation learning aims at embedding network features of different network nodes into the same low-dimensional continuous space, and is one of effective methods for calculating the similarity of node network structures in different networks. The introduction of Meta Path (Meta Path) to extend homogeneous network distributed representation learning methods (such as deep walk, LINE and node2vec) to heterogeneous networks is the mainstream method of heterogeneous network distributed representation learning, such as Meta Path2 vec. On one hand, the heterogeneous network distributed representation method based on meta-paths requires sufficient domain knowledge to design reasonable meta-paths, so that it has no universality; on the other hand, the meta-path based method only considers heterogeneous nodes, and does not fully consider the multi-mode relationship. Furthermore, as the number of node types and modality types in the network increases, the design of meta-paths becomes extremely complex.
Partial research explores a cross-network distributed representation learning method under a given certain correlation node; however, it often requires a certain amount of associated nodes and is not adaptable in a multi-mode heterogeneous network. Considering the mass of the multi-mode heterogeneous information network entities, the invention aims to research a domain-independent cross-multi-mode heterogeneous network distributed representation learning method based on a word vector model and establish a foundation for an associated entity identification model under the multi-mode heterogeneous network. As shown in fig. 4, the cross-network representation learning model contains three parts: a multi-modal relational transfer model, a cross-network random walk model, and a cross-network distributed representation learning model based on word vectors.
Multi-mode relation transfer model
The multimode relation transfer model aims to establish multimode relation transfer probability in a multimode heterogeneous network, so that the problems of dependence on professional field knowledge, universality and the like of the conventional meta path-based method are solved. Given the relationship of | C | different modalities in a multimodal heterogeneous information network, the multimodal relationship transition matrix may be represented by | C | × | C | matrix M, where M isijRepresenting relationship type C in a multi-mode heterogeneous networkiTo CjThe transition probability of (2).
Fig. 4 is a schematic diagram of a random walk considering a multi-mode relationship according to an embodiment of the present invention. In a random walk, if the last node EiBy the relation CxTransfer to current node Ej(as shown in FIG. 4), it is transferred to the next node EkThe probability of (c) is:
Figure BDA0002629425610000131
wherein, WijAs entity EiAnd EjWeight of (C)ijIs a relation (E)i,Ej) Type (b) NiAs entity EiIs determined. WijCan be set according to actual conditions, and the invention adopts Wij=(Ni∩Nj)/(Ni∪Nj) And (6) performing calculation. If d isijAs entity EiAnd EjThe distance between them is:
Figure BDA0002629425610000132
formula (2) considers not only the weight relationship between nodes, but also the transition probability relationship between multi-mode relationships, thereby facilitating the embedding of multi-mode relationship features into low-dimensional continuous vectors.
Given matrix M, a set of random walk path sets P ═ P can be obtained using random walks according to equation (2)1,P2,P3… and corresponding multimode transition path T ═ { T ═1,T2,T3… }, wherein
Figure BDA0002629425610000133
Figure BDA0002629425610000134
At this time, a vector e of | P | dimension can be usediRepresents a relationship type CiFeatures in a random walk set P, where eijIs represented by CiAt PiThe number of occurrences in (c).
On the basis, the invention is intended to calculate the relation type C according to the Pearson correlation coefficientiAnd CjOf (2) similarity, i.e.
Figure BDA0002629425610000135
Then, updating the multi-mode relation transfer matrix by adopting Sigmoid function
Figure BDA0002629425610000136
Initially, the M matrix may be set to be an all 1 matrix or a random matrix. And (3) acquiring a random walk path set P by adopting a formula (2) according to the M, and updating the M according to a formula (5). And continuously iterating the process until M converges, and finishing the construction of the multi-mode relation transfer matrix.
On the basis, the invention theoretically demonstrates the convergence of the M iteration process and forms a corresponding algorithm.
② cross-network random walk model
The multi-modal relationship transfer model solves the problem of random walk in a single model considering multi-modal relationships. The cross-network random walk model connects the nodes and relations of different networks in series on a path, which is the key for mapping the node relation characteristics of different networks to the same low-dimensional continuous space.
Given two multimode heterogeneous information networks
Figure BDA0002629425610000141
And
Figure BDA0002629425610000142
two entities E inAi∈EAAnd EBj∈EBThe invention is to define the structural similarity as follows:
Figure BDA0002629425610000143
if | NAiI denotes EAiNumber of neighbors, | EAI and RAModel is expressed respectively |
Figure BDA0002629425610000144
The number of middle entities and the number of relationships, then
Figure BDA0002629425610000145
In the initial state, a node E in a multimode heterogeneous network is randomly selected by a cross-network random walk modelAiAs an initial node for random walks. Then, the following rules are adopted to form a random walk path across the network:
a. acquiring random probability, and if the probability is smaller than a specified threshold epsilon, wandering in the current multimode heterogeneous network; otherwise, the network roams to another multimode heterogeneous network model;
b. when the current multi-mode heterogeneous network is kept to walk, selecting a next walking node by adopting the probability of the formula (2);
c. when switching to another multimode heterogeneous network for wandering, if the current node has a node with known association, the next node of random wandering is the node with known associationConnecting nodes; otherwise, from EAiSwim to the next node EBjThe probability of (c) is:
Figure BDA0002629425610000146
wherein, h (E)Ai,EBj) Is EAiAnd EBjThe calculation method of the attribute similarity is shown in formula (16).
By the above method, a set of sample paths S may be formed that may be used for distributed representation learning across network nodes.
Distributed representation learning model and algorithm for node multi-mode relation characteristics
The Word vector model (Word2Vec) characterizes semantic information of words in a Word vector manner by learning text, i.e., words that are semantically similar are close together in an embedding space by the space. Considering the mass of the multi-mode heterogeneous information network entity, the invention aims to use Skip-Gram model in Word2Vec for reference to establish cross-network distributed representation learning model and algorithm. In a single homogeneous network (the nodes in the network are of the same type and the relations are of the same type, i.e., | T | ═ 1 and | C | ═ 1), the target optimization function of the Skip-Gram model is:
Figure BDA0002629425610000147
where θ is a band solution parameter.
Considering the multimode heterogeneous characteristics of the network, the formula (9) can be extended to the learning of the cross-network distributed representation oriented to the multimode heterogeneous characteristics, and the objective optimization function can be converted into:
Figure BDA0002629425610000151
wherein N ist(v) The type t context node in the adjacent node of the node v. If VtFor a set of nodes of type t in both networks, then
Figure BDA0002629425610000152
Wherein, XvIs the multi-modal heterogeneous eigenvector of node v.
Obtaining entity E by solving equation (10)AiAnd EBjOf the multi-modal heterogeneous eigenvector XAiAnd XBjFor subsequent calculation of multi-modal heterogeneity coherence. The formula (10) considers the multi-mode characteristics of the network through the multi-mode relation transfer matrix M and considers the heterogeneous characteristics of the network through T. Therefore, the feature vector learned by equation (10) embeds the multi-modal heterogeneous features of the network.
The solution operation amount of the formula (10) is large due to a large number of nodes in the network, and the model training complexity is reduced by adopting negative sampling, so that the objective function can be converted into:
Figure BDA0002629425610000153
wherein σ (·) is sigmoid function, NEG is negative sampling edge number. And then training X by adopting a random gradient descent method to obtain the multimode heterogeneous characteristic vector of each node. Many studies have verified that the negative sampling-based Skip-Gram model is applicable to node feature representation learning of ten million levels and above of node networks; therefore, the method can be used for extracting the multimode heterogeneous characteristics of massive entities of the multimode heterogeneous information network.
The invention aims to design a cross-network distributed representation learning algorithm according to the model, and theoretically discuss the complexity of the algorithm, the influence of the hyper-parameter on the model and the like.
(4) Associated entity recognition model and method integrating attribute characteristics and relationship characteristics
In order to improve the quality of the identification of the associated entity without prior, the invention considers that: an entity depends on its surrounding "environment" and can be identified from the surrounding "environment". For this reason, the basic idea of the identification of the associated entity of the invention is: if EAiAnd EBjIs associated withEntities, i.e. EAi=EBjThen E isAiAnd EBjThe following conditions should be satisfied:
a. and (4) multi-modal heterogeneous consistency. EAiAnd EBjIs the same type or the same type of inherited entity, and EAiAnd EBjHave similar multimode heterogeneous characteristics;
b. and (4) consistency of the attributes. EAiAnd EBjShould have similar text and geometric attribute features;
c. and (4) environment consistency. EAiAnd EBjHave a similar "environment"; i.e. NAiAnd NBjMost of the entities in (2) are also associated entities.
|EA|×|EBThe matrix S represents MAAnd MBA similarity matrix of entities. When two entities EAiAnd EBjIs different, the similarity of the two entities is directly set as 0, S ij0. At this point, there is no need to compute entity EAiAnd EBjMulti-modal heterogeneous consistency, environmental consistency, and attribute consistency. If | EA|×|EBThe matrix H represents MAAnd MBThe type relation matrix of (1) is
Figure BDA0002629425610000161
Multi-mode heterogeneous consistency model
After the multi-mode heterogeneous features of the nodes of two different multi-mode heterogeneous networks are embedded into the low-dimensional continuous vectors in the same space, the cosine similarity can be adopted to calculate two nodes EAiAnd EBjFeature vector X ofAiAnd XBjAnd forming a multi-mode heterogeneous consistency model according to the similarity. That is to say that the first and second electrodes,
Figure BDA0002629425610000162
wherein, | EA|×|EBThe matrix R is
Figure BDA0002629425610000163
And
Figure BDA0002629425610000164
a multi-modal heterogeneous feature similarity matrix of the entity. XAiAnd XBjTwo entities E obtained for the solution described aboveAiAnd EBjThe multi-modal heterogeneous eigenvectors of (A), RijComposition of entity set EAAnd EBAnd a multi-modal heterogeneous feature similarity matrix R therebetween.
Environment consistency model
If Z is the set of the associated entities in the two multimode heterogeneous networks, two nodes EAiAnd EBjThe environmental consistency model of (a) can be calculated using the Jaccard similarity. Namely:
Figure BDA0002629425610000165
wherein, IAi=NAi∩Z,IBj=NBjAndu is Z. Without a priori associated entities, initially, with
Figure BDA0002629425610000166
The invention designs an iterative algorithm to mine the associated entities; thus, as the iterative process continues, there are more and more associated entities in Z. All Y areijComposition of entity set EAAnd EBThe environment consistency feature similarity matrix Y between them.
Third, attribute consistency model
The multi-mode heterogeneous information network attribute comprises two forms of text attribute and geometric attribute. The method establishes similarity models for the text attributes and the geometric attributes respectively.
a. And (5) a text attribute feature model. The text attribute of the multimode heterogeneous information network entity is mostly short text. The method adopts a short text word vector method to analyze and establish an entity attribute semantic feature vector model; then, the cos similarity or Euclidean distance is used for the equationMethod calculation entity EAiAnd EBjForm nA×nBAttribute feature similarity matrix P of orderP
b. And (5) a geometric attribute feature model. IFCs support a number of different geometric model types. Specifically, the IFC adopts a model composed of basic graphic primitives such as Curve2D, GeometricSet and GeometricCurveSet description points, lines and surfaces, adopts a surface model and adopts a Solidmodel to describe an entity model; wherein, the SolidModel can be subdivided into various types such as SweptSolid, Brep, CSG, Clipping, advanced SweptSolid, and the like. The multiple kinds and complex citations of the IFC geometric description bring great challenges to the similarity of the geometric attributes of the multimode heterogeneous information network.
Fig. 5 is a schematic diagram of a geometric property similarity calculation process provided in an embodiment of the present invention, where the calculation process includes: firstly, the invention aims to fully utilize the result of the early multi-mode heterogeneous information network lightweight visualization and convert each geometric model type into Brep; then, Brep is converted into a Delaunay triangulation network, and similarity calculation based on the Delaunay triangulation network is further performed. In the aspect of triangulation network similarity calculation, the invention adopts shape distribution similarity to calculate. On the basis, finally forming a similarity matrix P of all entity geometric attributes in the two multimode heterogeneous information networksG
In the identification of the multimode heterogeneous information network associated entity, two entities are associated without the condition that the similarity of all attributes is large; when the similarity value of the text attribute or (and) the geometric attribute is large, the two multimode heterogeneous information network entities have a certain probability as associated entities. Therefore, the invention adopts a Logit regression model to fuse the text attribute and the geometric attribute similarity to form a multi-mode heterogeneous information network entity attribute similarity matrix P,
Figure BDA0002629425610000171
all P are addedijComposition of entity set EAAnd EBThe attribute consistency feature similarity matrix P therebetween.
Associated entity identification method
In order to improve the identification accuracy of the associated entity, the multi-heterogeneous characteristics of the multi-heterogeneous information network are simulated and integrated, the multi-heterogeneous consistency, the attribute consistency and the environment consistency are considered, and an associated entity identification iterative algorithm is designed. Fig. 6 is a flowchart illustrating iterative association entity identification. First, H, R, Y, and P matrices are calculated based on the study contents (2) and (3) and the multi-modal heterogeneous consistency, attribute consistency, and environment consistency models. Then, for two entities EAiAnd EBjAnd calculating the similarity as follows:
Sij=sim(EAi,EBj)=Hij·Rij·Yij·Pij。 (18)
the algorithm will select the unassociated entity pair E with the largest similarity value in SAiAnd EBjIs a related entity and needs to satisfy Sij>τ, τ is a set similarity threshold. When a new associated entity Δ Z is identified, the associated entity set Z is updated to be: z ═ Z @ U Δ Z. Then, Y and S are updated and new associated entities are re-identified. And when the associated entities meeting the requirements cannot be identified, finishing the iteration and outputting the identified associated entity set Z. Considering that in each iteration process, the delta Z only affects part of the content in the Y; therefore, each iteration does not need to update all Y values and S values, and therefore the efficiency of the associated entity identification method is guaranteed.
The invention aims to design a corresponding algorithm on the basis of the above, and theoretically discuss the influence of algorithm complexity and hyperparameters on the associated entity recognition model.
In summary, in the embodiments of the present invention, starting from the important requirements of the full-process integrated application and the full-life cycle data sharing of the multi-mode heterogeneous information network, the identification of the multi-mode heterogeneous information network associated entity under a massive entity is taken as a research target, and on the basis of fully analyzing the multi-mode heterogeneous characteristics of the multi-mode heterogeneous information network, a formal description method of the complex multi-mode heterogeneous information network, a domain-independent learning model and method of distributed representation across the multi-mode heterogeneous network, and an associated entity identification model and algorithm of the comprehensive attribute characteristics and the multi-mode heterogeneous characteristics are mainly researched, and experimental verification is performed on massive data.
The invention forms a set of multimode heterogeneous information network formalized description method and a multimode heterogeneous associated entity identification model and method based on cross-network representation learning, enriches and perfects theories and methods of network representation learning in the field of data mining and associated entity identification, multimode heterogeneous information network in the field of building informatization, promotes the cross fusion of computer science and building and civil engineering schools, and has important theoretical value. The research result promotes the whole-process integrated application and the whole-life-cycle data sharing of the multimode heterogeneous information network, improves the big data application capability and the management decision level of the building industry and enterprises, serves the national important requirement of modernized transformation and upgrading of the building industry, supports the big data construction and the 'whole-cycle management' of smart cities, smart infrastructures, smart people and the like, and has great economic and social benefits.
Those of ordinary skill in the art will understand that: the figures are merely schematic representations of one embodiment, and the blocks or flow diagrams in the figures are not necessarily required to practice the present invention.
From the above description of the embodiments, it is clear to those skilled in the art that the present invention can be implemented by software plus necessary general hardware platform. Based on such understanding, the technical solutions of the present invention may be embodied in the form of a software product, which may be stored in a storage medium, such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments.
The embodiments in the present specification are described in a progressive manner, and the same and similar parts among the embodiments are referred to each other, and each embodiment focuses on the differences from the other embodiments. In particular, for apparatus or system embodiments, since they are substantially similar to method embodiments, they are described in relative terms, as long as they are described in partial descriptions of method embodiments. The above-described embodiments of the apparatus and system are merely illustrative, and the units described as separate parts may or may not be physically separate, and the parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
The above description is only for the preferred embodiment of the present invention, but the scope of the present invention is not limited thereto, and any changes or substitutions that can be easily conceived by those skilled in the art within the technical scope of the present invention are included in the scope of the present invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims (6)

1. A multi-mode heterogeneous associated entity identification method based on cross-network representation learning is characterized by comprising the following steps:
two multimode heterogeneous information networks:
Figure FDA0002932037570000011
and
Figure FDA0002932037570000012
EAand EBIs a set of entities, RAAnd RBBeing a set of entity relationships, TAAnd TBAs a set of entity types, CAAnd CBFor entity relationship type set, let two entities EAi∈EAAnd EBj∈EB
Based on entity EAiAnd EBjThe random walk path set between the two sets is established by an iterative methodAiAnd EBjTransition probability M of multi-mode relation betweenijTransition the probability M through the multi-modal relationshipijLearning by using an objective function to obtain the entity EAiAnd EBjThe multi-modal heterogeneous eigenvectors of (a);
according to said entity EAiAnd EBjJudging the two entities E by the multi-mode heterogeneous characteristic vectorAiAnd EBjWhether the multi-mode heterogeneous consistency exists or not, and two entities E are also judgedAiAnd EBjWhether attribute consistency and environment consistency exist, when the two entities EAiAnd EBjAnd E, determining the consistency of the multimode heterogeneity, the attribute consistency and the environment consistencyAiAnd EBjIs an associated entity;
the entity EAiAnd EBjThe attribute of the entity E comprises a text attribute and a geometric attribute, wherein the text attribute is a short text, a semantic feature vector model of the entity attribute is analyzed and established by adopting a short text word vector method, and the entity E is calculated by cos similarity or Euclidean distance methodAiAnd EBjText attribute feature similarity between them;
fusion entity EAiAnd EBjThe similarity of text attribute features and the similarity of geometric attribute features between form an entity EAiAnd EBjAttribute consistency feature similarity matrix P therebetweenijAll P areijComposition of entity set EAAnd EBThe attribute consistency feature similarity matrix P therebetween.
2. The method of claim 1, wherein the entity E is based onAiAnd EBjThe entity E is established by an iterative method through a random walk path setAiAnd EBjTransition probability M of multi-mode relation betweenijThe method comprises the following steps:
assuming the relation of | C | different modes in the multimode heterogeneous information network, the multimode relation transfer matrix is expressed by | C | × | C | matrix M, wherein M isijRepresenting relationship type C in a multi-mode heterogeneous networkiTo CjThe transition probability of (2);
in a random walk, if the last node EiBy the relation CxTransfer to current node EjIt is transferred to the next node EkProbability p (E)k|Ei,Ej,CxAnd M) is calculated by the following method:
Figure FDA0002932037570000013
wherein, WijAs entity EiAnd EjWeight of (C)ijIs a relation (E)i,Ej) Type (b) NiAs entity EiSet of neighbor nodes of, Wij=(Ni∩Nj)/(Ni∪Nj) If d isijAs entity EiAnd EjThe distance between them is:
Figure FDA0002932037570000021
acquiring a set of random walk path sets P ═ { P ] by adopting random walks according to formula (2)1,P2,P3… and corresponding multimode transition path T ═ { T ═1,T2,T3… }, wherein
Figure FDA0002932037570000022
Figure FDA0002932037570000023
Using a vector e of dimension | P |iRepresents a relationship type CiFeatures in a random walk set P, where eijIs represented by CiAt PiThe number of occurrences in (a);
calculating a relationship type C according to the Pearson correlation coefficientiAnd CjOf (2) similarity, i.e.
Figure FDA0002932037570000024
Updating multimode relation transition probability by adopting Sigmoid function
Figure FDA0002932037570000025
Initially, MijThe matrix is set to be an all 1 matrix or a random matrix according to MijAcquiring a random walk path set P by adopting a formula (2), and updating M according to a formula (5)ijContinuously iterating the above process until MijConverging to complete the multi-mode relationship transfer matrix ZijAnd (4) constructing.
3. The method according to claim 2, wherein said transition matrix M is based on said multi-modal relationshipijLearning by using an objective function to obtain the entity EAiAnd EBjThe multi-modal heterogeneous feature vector of (1), comprising:
the entity EAiAnd EBjRespectively serving as a node, establishing a cross-network distributed representation learning model and an algorithm by using a Skip-Gram model in Word2Vec, and setting a target optimization function of the Skip-Gram model in the cross-network distributed representation learning facing the multi-mode heterogeneous characteristics as follows:
Figure FDA0002932037570000026
where θ is the band solution parameter, Nt(v) A context node of type t in a neighboring node being node V, if VtFor a set of nodes of type t in two networks, then:
Figure FDA0002932037570000027
wherein, XvA multi-mode heterogeneous feature vector of a node v;
obtaining entity E by solving equation (10)AiAnd EBjOf the multi-modal heterogeneous eigenvector XAiAnd XBj
4. The method of claim 3, wherein said determining is based on said entity EAiAnd EBjJudging the two entities E by the multi-mode heterogeneous characteristic vectorAiAnd EBjWhether or not there is multi-modal heterogeneous consistency, including:
according to entity EAiAnd EBjThe multi-mode heterogeneous feature vector judgment entity EAiType TAiAnd EBjType TBjWhether or not they are identical, if so, two entities EAiAnd EBjDegree of identification of type relationship between HijEqual to 1; otherwise, two entities EAiAnd EBjThe type relation identification degree between the two is equal to 0;
Figure FDA0002932037570000031
when two entities EAiAnd EBjWhen the types of (A) are the same, entity EAiAnd EBjBetween the multimode heterogeneous similarity RijThe calculation method comprises the following steps:
Figure FDA0002932037570000032
XAiand XBjTwo entities E obtained for solutionAiAnd EBjThe multi-modal heterogeneous eigenvectors of (A), RijComposition of entity set EAAnd EBAnd a multi-modal heterogeneous feature similarity matrix R therebetween.
5. The method of claim 3, wherein said determining two entities EAiAnd EBjWhether there is environmental consistency, including:
if Z is
Figure FDA0002932037570000033
And
Figure FDA0002932037570000034
of the set of associated entities in the set of associated entities,
Figure FDA0002932037570000035
and
Figure FDA0002932037570000036
two multimode heterogeneous information networks;
entity EAiAnd EBjEnvironmental consistency feature similarity between them YijThe calculation method comprises the following steps:
Figure FDA0002932037570000037
wherein, IAi=NAi∩Z,IBj=NBjN and Z, in the initial stage,
Figure FDA0002932037570000038
as the iterative process continues, more and more associated entities in Z will be present, all YijComposition of entity set EAAnd EBThe environment consistency feature similarity matrix Y between them.
6. The method of claim 5, wherein said two entities E are differentAiAnd EBjAnd E, determining the consistency of the multimode heterogeneity, the attribute consistency and the environment consistencyAiAnd EBjIs an association entity, comprising:
synthetic entity EAiAnd EBjDegree of identification of type relationship between HijMulti-mode heterogeneous similarity RijEnvironment consistency feature similarity YijAnd attributeConsistency feature similarity matrix PijObtaining said entity EAiAnd EBjThe similarity value S betweenij
Sij=sim(EAi,EBj)=Hij·Rij·Yij·Pij
Based on EAAnd EBThe similarity value between all entities in E constitutesAAnd EBThe similarity matrix S between the entities selects the unassociated entity pair E with the maximum similarity value in SAiAnd EBjIs a related entity and needs to satisfy Sij>Tau, tau is a set similarity threshold;
when a new associated entity Δ Z is identified, the associated entity set Z is updated to be: and Z is Z U delta Z, updating Y and S, re-identifying a new associated entity, finishing iteration when the associated entity meeting the requirement cannot be identified, and outputting an identified associated entity set Z.
CN202010806775.3A 2020-08-12 2020-08-12 Multi-mode heterogeneous associated entity identification method based on cross-network representation learning Active CN111931485B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202010806775.3A CN111931485B (en) 2020-08-12 2020-08-12 Multi-mode heterogeneous associated entity identification method based on cross-network representation learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202010806775.3A CN111931485B (en) 2020-08-12 2020-08-12 Multi-mode heterogeneous associated entity identification method based on cross-network representation learning

Publications (2)

Publication Number Publication Date
CN111931485A CN111931485A (en) 2020-11-13
CN111931485B true CN111931485B (en) 2021-03-23

Family

ID=73310734

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202010806775.3A Active CN111931485B (en) 2020-08-12 2020-08-12 Multi-mode heterogeneous associated entity identification method based on cross-network representation learning

Country Status (1)

Country Link
CN (1) CN111931485B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112836063B (en) * 2021-01-27 2023-06-06 四川新网银行股份有限公司 Method for realizing feature tracing
CN113704566B (en) * 2021-10-29 2022-01-18 贝壳技术有限公司 Identification number body identification method, storage medium and electronic equipment
CN116306936A (en) * 2022-11-24 2023-06-23 北京建筑大学 Knowledge graph embedding method and model based on hierarchical relation rotation and entity rotation

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111083767A (en) * 2019-12-23 2020-04-28 哈尔滨工业大学 Heterogeneous network selection method based on deep reinforcement learning

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9576048B2 (en) * 2014-06-26 2017-02-21 International Business Machines Corporation Complex service network ranking and clustering
CN105825430A (en) * 2016-01-08 2016-08-03 南通弘数信息科技有限公司 Heterogeneous social network-based detection method
CN109902203B (en) * 2019-01-25 2021-06-01 北京邮电大学 Network representation learning method and device based on edge random walk
CN110188148B (en) * 2019-05-23 2021-02-02 北京建筑大学 Entity identification method and device facing multimode heterogeneous characteristics
CN110717098B (en) * 2019-09-20 2022-06-24 中国科学院自动化研究所 Meta-path-based context-aware user modeling method and sequence recommendation method
CN110929046B (en) * 2019-12-10 2022-09-30 华中师范大学 Knowledge entity recommendation method and system based on heterogeneous network embedding
CN111291243B (en) * 2019-12-30 2022-07-12 浙江大学 Visual reasoning method for uncertainty of spatiotemporal information of character event
CN111277433B (en) * 2020-01-15 2021-02-12 同济大学 Network service abnormity detection method and device based on attribute network characterization learning
CN111381902B (en) * 2020-03-10 2021-04-13 中南大学 APP starting acceleration method based on embedded heterogeneous network with attributes

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111083767A (en) * 2019-12-23 2020-04-28 哈尔滨工业大学 Heterogeneous network selection method based on deep reinforcement learning

Also Published As

Publication number Publication date
CN111931485A (en) 2020-11-13

Similar Documents

Publication Publication Date Title
CN111931485B (en) Multi-mode heterogeneous associated entity identification method based on cross-network representation learning
WO2022267976A1 (en) Entity alignment method and apparatus for multi-modal knowledge graphs, and storage medium
Xue et al. Neural collective entity linking based on recurrent random walk network learning
CN108681557B (en) Short text topic discovery method and system based on self-expansion representation and similar bidirectional constraint
CN112463980A (en) Intelligent plan recommendation method based on knowledge graph
CN104899253A (en) Cross-modality image-label relevance learning method facing social image
CN110188148B (en) Entity identification method and device facing multimode heterogeneous characteristics
WO2024031933A1 (en) Social relation analysis method and system based on multi-modal data, and storage medium
Yang et al. Co-embedding network nodes and hierarchical labels with taxonomy based generative adversarial networks
CN109885693B (en) Method and system for rapid knowledge comparison based on knowledge graph
CN113191154B (en) Semantic analysis method, system and storage medium based on multi-modal graph neural network
CN109284414B (en) Cross-modal content retrieval method and system based on semantic preservation
Gao et al. CNL: collective network linkage across heterogeneous social platforms
CN112966091A (en) Knowledge graph recommendation system fusing entity information and heat
Sachan et al. Probabilistic model for discovering topic based communities in social networks
Yuan et al. CHOP: An orthogonal hashing method for zero-shot cross-modal retrieval
Zhou et al. Rank2vec: learning node embeddings with local structure and global ranking
Zhang et al. Embedding heterogeneous information network in hyperbolic spaces
Yin et al. Two-stage Text-to-BIMQL semantic parsing for building information model extraction using graph neural networks
CN112765490A (en) Information recommendation method and system based on knowledge graph and graph convolution network
CN110765276A (en) Entity alignment method and device in knowledge graph
Qi et al. Breaking the barrier to transferring link information across networks
Zhang et al. End‐to‐end generation of structural topology for complex architectural layouts with graph neural networks
Liu et al. Mirror: Mining implicit relationships via structure-enhanced graph convolutional networks
Coscia et al. Social network analysis as knowledge discovery process: A case study on digital bibliography

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant