CN104035917B

CN104035917B - A kind of knowledge mapping management method and system based on semantic space mapping

Info

Publication number: CN104035917B
Application number: CN201410253673.8A
Authority: CN
Inventors: 王晓平; 肖仰华; 汪卫
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2014-06-10
Filing date: 2014-06-10
Publication date: 2017-07-07
Anticipated expiration: 2034-06-10
Also published as: CN104035917A

Abstract

The invention belongs to text semantic treatment, semantic network technology field, specially a kind of knowledge mapping management method and system based on semantic space mapping.The inventive method includes：Semantic vector builds, semantic space maps, knowledge mapping management；Knowledge mapping management is divided into including three again：Semantic Clustering, semantic duplicate removal, semantic tagger.For the side/node of knowledge mapping, its text unit will be described first and is projected to semantic space, and its vector representation on semantic space is obtained by vector accumulation；On this basis, the multinomial management role of knowledge mapping is realized；System includes that corresponding semantic vector builds, semantic space maps, knowledge mapping manages 3 modules.Carried out when semanteme compares to sensitive shortcomings of factor such as word deformation, synonym change, grammatical form changes instant invention overcomes traditional knowledge collection of illustrative plates management method, and the mode of vector accumulation can easily tackle the difference of word number, it is easy to accomplish the knowledge mapping management role such as further Semantic Clustering, semantic duplicate removal, semantic tagger.

Description

Knowledge graph management method and system based on semantic space mapping

Technical Field

The invention belongs to the technical field of text semantic processing and semantic web, and particularly relates to a knowledge graph management method and system based on semantic space mapping.

Background

The construction of the knowledge graph is an important project in the big data era, and can correlate disordered data and arrange the data into structured knowledge to be provided for users, and the characteristic determines that the knowledge graph has important application in many fields, for example, the current search causes that the search is carried out based on keyword matching, and after the knowledge graph is established, after a certain keyword is input, the related information such as the attribute, the category, the relationship with other entities and the like of the keyword can be returned, so that the required information can be provided for the users more accurately and perfectly. The knowledge graph is a base stone for realizing a series of applications such as semantic search, machine automatic question answering, internet advertisement recommendation, personalized electronic reading and the like, and the size of the knowledge graph playing a role in the fields is directly determined if the knowledge graph can be effectively managed.

However, what is finally extracted in the current knowledge graph construction is a deterministic relational representation, and the deterministic description has poor adaptability under the conditions of word deformation, synonym change, grammar form change and the like, for example, two edges with similar semantics can be regarded as two completely different edges because the two edges are described by different words, and the processing mode is not only unreasonable, but also brings great difficulty to the management of the knowledge graph, such as edge/node clustering, edge/node deduplication, edge/node labeling and the like, thereby affecting the effective application of the knowledge graph.

Disclosure of Invention

The invention provides a knowledge graph management method and system based on semantic space mapping, aiming at the defects of the current knowledge graph management technical method.

For the edges/nodes (namely the relation/entity between the entities) of the knowledge graph, firstly projecting and accumulating the text units describing the edges/nodes to a semantic space so as to obtain the vector representation of the edges/nodes on the semantic space; then on the basis of text semantic vectorization, a plurality of management tasks of the knowledge graph can be further realized: the clustering method can be used and vector similarity measurement is combined to conveniently carry out semantic clustering of edges/nodes, so that relationships/entities with similar semantics are excavated; semantic duplication removal can be realized by calculating a typical edge/typical node substitution class set on the basis of semantic clustering; the automatic labeling of the relation/entity can be realized according to the semantic distance between the newly added edge/node and the labeled edge/node model.

The invention provides a knowledge graph management method based on semantic space mapping, which comprises the following specific steps: semantic vector construction, semantic space mapping and knowledge map management; wherein:

(1) the specific steps of semantic vector construction are as follows:

the semantic vector library is constructed based on the corpus, so that the text units are mapped to vectors in a semantic space, and the semantic similarity between the text units can be compared according to the distance of corresponding vectors in the semantic space, and the spatial distance of semantic vectors corresponding to words with similar semantics can be very close, so that the influences of word deformation, synonym change and grammar form change during direct comparison among the words are overcome.

The semantic vector can be calculated by various methods, such as the Word2Vec method, the esa (explicit semantic analysis) method, the lsa (late semantic analysis) method, the co-occurrence Word frequency characteristics, and so on, preferably, the Word2Vec method (https:// code.

The selection principle of the training data for constructing the semantic vector is to ensure high coverage rate and field independence by using a large-scale encyclopedic corpus, preferably, a Wikipedia knowledge base (http:// www.wikipedia. org /) is used as the corpus for training the semantic vector by using a Word2Vec method, and the semantic vector base is constructed by using the training result for other modules to use in semantic mapping.

(2) Semantic space mapping

The method maps texts representing edge nodes in a knowledge graph into vectors in a semantic space, and comprises the following specific steps:

(2.1) filtering words in edges/nodes (relationships between entities/entities) in the knowledge graph to remove stop words without semantics;

(2.2) acquiring the projection vector of each word in the semantic space from the constructed semantic vector library for each word which is reserved after the last operation processing, and then accumulating the semantic vectors corresponding to the words to further obtain the overall semantic vector representing the edge/node.

(3) The knowledge graph management is divided into four sub-steps: semantic clustering, semantic duplication removal and semantic labeling;

and (3.1) semantic clustering, which is further semantic mining based on knowledge graph construction, is very important for managing the knowledge graph and specifically comprises edge clustering (relational clustering) and node clustering (entity clustering). For edge clustering, the edges connecting different node pairs can be clustered, entity pairs with similar semantic relations are found, a plurality of edges of one node can be clustered, main related entity classes of the node are excavated, even a plurality of edges connecting the same pair of nodes can be clustered, and main relation classes among the edges are excavated; for node clustering, entities with similar semantics can be found.

The specific steps of semantic clustering are as follows:

and (3) carrying out semantic space mapping on the edge/node set to be clustered based on the constructed semantic vector library, and then further clustering the obtained semantic vectors. The clustering method may employ various methods such as a hierarchical clustering method, a Kmeans method, etc., and preferably, a hierarchical clustering method is employed. The similarity measure can be a variety of measures such as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev, etc., preferably using Cosine similarity.

Wherein, x and y are two vectors to be compared respectively, and Sim is a Cosine similarity result obtained by calculation.

(3.2) semantic deduplication

This is the case for knowledge maps constructed based on big data: although the specific representation forms (texts describing relations/entities) are different, the semantic contents represented by the different edges/nodes are very close to or even consistent, which results in that the knowledge graph is increased in size and is accompanied by the increase of redundant information. From the data cleaning perspective, if the edges/nodes are uniformly expressed and semantic deduplication (edge deduplication and node deduplication) is realized, the number of semantic edges/nodes (namely the number of relations/entities) is reduced and the simplified expression of the knowledge graph is realized at the same time.

The specific steps of semantic deduplication are as follows:

for the result of semantic clustering, for the edge/node set clustered in the same class, the redundancy of semantic information is reduced by calculating typical edge/typical node to replace the original class set elements, and the selection basis is as follows:

here, V_iIs the semantic vector of the corresponding ith relation/entity in the set to be merged, V is the accumulated semantic vector of all relations/entities in the set to be merged, Sim (a, b) represents the similarity of the vector a and the vector b, the similarity measure can adopt various measures such as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev, etc., preferably, Cosine phase is adoptedSimilarity.

By selecting typical edges/typical nodes by calculation to perform relationship/entity duplication removal, the storage space of the knowledge graph is effectively reduced, and the simplified representation of the knowledge graph is realized without losing representativeness.

(3.3) semantic annotation

The semantic similarity between the input edge/node and the known edge/node model is compared, the corresponding model is judged, and then the corresponding label in the predefined known type range is attached to the model, so that the edge/node in the knowledge graph can be conveniently and uniformly expressed and managed. The semantic annotation specifically comprises the following steps:

(3.3.1) edge/node model construction:

for the clustered edges/nodes, an edge/node model (namely a relation/entity model) is constructed based on the corresponding semantic vector set, the model can be constructed by using various methods such as a mean vector model, a Gaussian model, an artificial neural network, a support vector machine and the like, and preferably, the mean vector model is used; meanwhile, the type label corresponding to each type of relation/entity is marked manually.

Wherein,m _i,jis shown asiClass I the firstjThe number of the vectors is such that,n _ifor the number of samples in the class,is the mean vector.

After the model is built, it is added to the edge/node model library.

(3.3.2) edge/node identification

For the edge/node to be queried, after obtaining the semantic vector representation according to the steps of the semantic space mapping module, sequentially comparing the vector with the edge/node models in the relational model library, for example: for the mean vector model and the Gaussian model, the similarity between vectors can be directly compared or the probability value of the input vector belonging to the model is calculated, and the category corresponding to the highest value is taken as output after traversal; for the artificial neural network and the support vector machine, the corresponding categories are directly output.

Class of output, taking mean vector model as an exampleClassComprises the following steps:

v is the semantic vector to be identified,for the mean vector corresponding to class i edges/nodes, i ∈ { 1,2, …, N }, N is the number of models in the edge/node model library, Sim (a, b) represents the similarity of vector a and vector b, where the similarity measure can be various measures such as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev, etc., preferably Cosine similarity.

(3.3.3) edge/node semantic annotation

And for the category output in the last step, taking out the corresponding type label which is labeled in advance from the edge/node model library and assigning the label to the input edge/node, thereby completing the semantic labeling process.

The invention also provides a knowledge graph management system based on semantic space mapping corresponding to the method. The system consists of three modules: the system comprises a semantic vector construction module, a semantic space mapping module and a knowledge map management module. Wherein, the knowledge map management module comprises three sub-modules: the semantic clustering submodule, the semantic duplication removing submodule and the semantic labeling submodule.

The specific contents are as follows:

(1) the semantic vector construction module:

the module is used for constructing a semantic vector library based on a corpus so that text units are mapped to vectors in a semantic space, and has the advantages that semantic similarity among the text units can be compared according to the distance of corresponding vectors in the semantic space, and the distance of semantic vectors corresponding to words with similar semantics in the semantic space is very short, so that the influences of word deformation, synonym change and grammar form change during direct comparison among the words are overcome.

(2) The semantic space mapping module specifically comprises the following contents:

the module maps texts representing edge nodes in the knowledge graph into vectors in a semantic space:

(3) The knowledge graph management module comprises the following specific contents:

the module is responsible for completing the management of the knowledge graph and comprises three sub-modules: the semantic clustering submodule, the semantic duplication removing submodule and the semantic labeling submodule. Respectively corresponding to 3 substeps of the knowledge-graph management step;

(3.1) semantic clustering submodule

Semantic clustering is further semantic mining based on knowledge graph construction, which is important for managing the knowledge graph and specifically comprises edge clustering (relational clustering) and node clustering (entity clustering). For edge clustering, the edges connecting different node pairs can be clustered, entity pairs with similar semantic relations are found, a plurality of edges of one node can be clustered, main related entity classes of the node are excavated, even a plurality of edges connecting the same pair of nodes can be clustered, and main relation classes among the edges are excavated; for node clustering, entities with similar semantics can be found;

(3.2) semantic deduplication submodule

The specific content of semantic deduplication is as follows:

here, V_iIs a semantic vector corresponding to the ith relation/entity in the set to be merged, V is a cumulative semantic vector of all relations/entities in the set to be merged, Sim (a, b) represents the similarity of the vector a and the vector b, the similarity measure here can adopt various measures such as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev, etc., preferably, Cosine similarity is adopted;

by selecting typical edges/typical nodes by calculation to carry out relationship/entity duplication removal, the storage space of the knowledge graph is effectively reduced, the simplified representation of the knowledge graph is realized, and the representativeness is not lost;

(3.3) semantic annotation submodule

The module judges the corresponding model by comparing the semantic similarity of the input edge/node with the known edge/node model, and then attaches the corresponding label in the predefined known type range to the model, which is beneficial to the uniform representation and management of the edges/nodes in the knowledge graph. The specific contents of the sub-modules are as follows:

(3.3.1) edge/node model construction:

After the model is built, it is added to the edge/node model library.

(3.3.2) edge/node identification

(3.3.3) edge/node semantic annotation

The invention has the advantages of

The invention overcomes the defect that the traditional knowledge graph management method is sensitive to factors such as word deformation, synonym change, grammar form change and the like when semantic comparison is carried out by mapping the text representing the knowledge graph edges/nodes into semantic vectors, and the vector accumulation mode enables the knowledge graph management method to easily cope with the difference of the number of words, thereby being easy to realize further knowledge graph management tasks such as semantic clustering, semantic duplication removal and semantic annotation, and improving the accuracy of semantic comparison while enhancing the processing flexibility.

Drawings

FIG. 1: a system block diagram.

FIG. 2: hierarchical clustering result graphs (edge clustering). The abscissa is the serial number of the entity pair and the ordinate is the inter-class distance.

FIG. 3: hierarchical clustering result graphs (node clustering). The abscissa is the sequence number of the entity and the ordinate is the inter-class distance.

FIG. 4: semantic deduplication-typical edge selection. The abscissa is the serial number of the entity and the ordinate is the similarity.

Detailed Description

The following examples are provided to illustrate the specific embodiments of the present invention, and the system modules are processed as follows:

(1) semantic vector construction

Based on the text corpus of the whole English wiki (http:// www.wikipedia.org /), the Word2Vec is used for training, and the vector dimension output by training is 500 dimensions.

(2) Semantic space mapping

And for the words on the edge/node, after the stop words are removed, taking out the corresponding semantic vector from the trained semantic vector library, and then performing vector accumulation to obtain the semantic vector representation of the edge/node.

(3) Semantic clustering

(3.1) edge semantic clustering

An input example, the format is:

sequence number: { node 1}, { side }, { node 2}

1：{Shanghai}, {large city}, {China}

2：{ipad}, {product}, {Apple}

3：{Barack Obama}, {president}, {USA}

4：{Kindle}, {manufacture}, {Amazon}

5：{New York}, {metropolis}, {USA}

6：{Dmitry Medvedev}, {Prime Minister}, {Russia}

The hierarchical clustering result graph (edge clustering) is shown in fig. 2.

Taking the threshold value as 0.8, the clustering result is as follows:

the first type: 2. 4. the following examples of the present invention

The second type: 1. 5. mu.l of

In the third category: 3. 6. mu.l of

The clustering result is correct;

(3.2) node semantic clustering

Inputting 6 nodes:

1：{tuna}

2：{tiger}

3：{leopard}

4：{car}

5：{fish}

6：{train}

the hierarchical clustering result graph (node clustering) is shown in fig. 3.

Taking the threshold value as 0.8, the clustering result is as follows:

the first type: 1. 5. mu.l of

The second type: 2. 3. mu.l

In the third category: 4. 6. mu.l of

The clustering result is correct.

(4) Semantic deduplication

For example, two nodes in the knowledge-graph: { Bill Gates }, { Microsoft }, the following edges between them are clustered in the same class after semantic clustering:

1：{CEO}

2：{executives}

3：{president}

4：{chief executive officer}

5：{current chairman}

6：{chairman}

7：{chair}

semantic deduplication-typical edge selection, as shown in fig. 4.

Accumulating the semantic vectors of all the edges to obtain a total semantic representation vector, then sequentially calculating the similarity between each edge and the total semantic representation vector, and selecting the typical edge with the maximum similarity and the serial number of 6, namely { chairman }, so that 7 edges originally gathered into the same class are replaced by only 1 typical edge, and the purposes of simplifying and representing the knowledge graph, reducing the storage space and keeping the representativeness are achieved.

(5) Semantic annotation

For example, for a set of edges of a class of relationships that complete a cluster:

1：{large city}

2：{metropolis}

3：{megacity}

4：{major city}

5：{big cities}

6：{megacities}

7：{mega cities}

and constructing a mean vector model according to the corresponding semantic vector set, and marking the type label of the model as 'metropolitan area'.

For a newly input edge { big city }, calculating the similarity between the corresponding semantic vector and the edge model,

Sim = 0.8434

and if the threshold value is 0.8, the input edge is considered to have the same meaning as the meaning represented by the class edge, so that a model type label 'metropolian area' is assigned to the input edge, thereby completing the semantic annotation process.

Reference to the literature

[1]Tomas Mikolov, et al. Efficient Estimation of WordRepresentations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2]Tomas Mikolov, et al. Distributed Representations of Words andPhrases and their Compositionality. In Proceedings of NIPS, 2013.

[3]Tomas Mikolov, et al. Linguistic Regularities in Continuous SpaceWord Representations. In Proceedings of NAACL HLT, 2013。

Claims

1. A knowledge graph management method based on semantic space mapping is characterized by comprising the following specific steps: semantic vector construction, semantic space mapping and knowledge map management; wherein:

(1) the specific steps of semantic vector construction are as follows:

constructing a semantic vector library based on the corpus so that the text units are mapped to vectors on a semantic space;

adopting a wikipedia knowledge base as a corpus for training semantic vectors by using a Word2Vec method to construct training data of the semantic vectors, and constructing a semantic vector base by using training results;

(2) semantic space mapping

Mapping texts representing edge nodes in the knowledge graph into vectors in a semantic space, and specifically comprising the following steps:

(2.1) filtering words in edges/nodes in the knowledge graph to remove stop words without semantics;

(2.2) acquiring a projection vector of each word retained after the operation processing in the step (2.1) in a semantic space from a constructed semantic vector library, and then accumulating semantic vectors corresponding to the words to further obtain a total semantic vector representing the edge/node;

(3) knowledge graph management is divided into three sub-steps: semantic clustering, semantic duplication removal and semantic labeling;

(3.1) the specific steps of semantic clustering are as follows:

firstly, semantic space mapping is carried out on the edge/node set to be clustered based on a constructed semantic vector library, and then the obtained semantic vectors are further clustered;

(3.2) the specific steps of semantic deduplication are as follows:

the meaning of the formula is to choose the functionTaking k corresponding to the maximum value as Typical, wherein the Typical refers to a selected Typical edge or a Typical node;

here, V_kIs the semantic vector of the corresponding kth relation/entity in the set to be merged, V is the cumulative semantic vector of all relations/entities in the set to be merged, Sim (a, b) represents the direction toSimilarity of the quantity a and the vector b;

(3.3) the semantic annotation specifically comprises the following steps:

(3.3.1) edge/node model construction:

for the clustered edges/nodes, constructing an edge/node model based on the corresponding semantic vector set;

meanwhile, manually marking a type label corresponding to each type of relation/entity;

i ∈ { 1,2, …, N } which is the mean vector corresponding to the class i edges/nodes, N being the number of models in the edge/node model library;

wherein,m _i,jis shown asiClass I the firstjThe number of the vectors is such that,n _ifor the number of samples in the class,is a mean vector;

after the model is built, adding the edge/node model into an edge/node model library;

(3.3.2) edge/node identification

For the edge/node to be inquired, after the edge/node semantic vector representation is obtained by mapping the steps according to the semantic space, the vector is sequentially compared with edge/node models in a relational model library, wherein for an average value vector model and a Gaussian model, the similarity among the vectors can be directly compared or the probability value of the input vector belonging to the model is calculated, and the category corresponding to the highest value is taken as output after traversal; for the artificial neural network and the support vector machine, the corresponding categories are directly output;

(3.3.3) edge/node semantic annotation

And (3) for the category output in the step (3.3.2), taking out the corresponding type label which is labeled in advance from the edge/node model library and assigning the label to the input edge/node, thereby completing the semantic labeling process.

2. The knowledge-graph management method based on semantic space mapping according to claim 1, characterized in that in step (3.3.2), for the mean vector model, the output categories are:

the meaning of the formula is to choose the functionTaking i corresponding to the maximum value as Class;

v is a semantic vector to be identified, and Sim (a, b) represents the similarity of the vector a and the vector b.

3. The knowledge-graph management system based on semantic space mapping based on the method of claim 1 is characterized by comprising the following three modules: the semantic vector construction module is used for executing the step (1), the semantic space mapping module is used for executing the step (2), and the knowledge graph management module is used for executing the step (3), wherein: the knowledge graph management module comprises three sub-modules: the semantic clustering submodule is used for executing the step (3.1), the semantic duplication removing submodule is used for executing the step (3.2), and the semantic labeling submodule is used for executing the step (3.3).