CN104035917B

CN104035917B - A kind of knowledge mapping management method and system based on semantic space mapping

Info

Publication number: CN104035917B
Application number: CN201410253673.8A
Authority: CN
Inventors: 王晓平; 肖仰华; 汪卫
Original assignee: Fudan University
Current assignee: Fudan University
Priority date: 2014-06-10
Filing date: 2014-06-10
Publication date: 2017-07-07
Anticipated expiration: 2034-06-10
Also published as: CN104035917A

Abstract

The invention belongs to text semantic treatment, semantic network technology field, specially a kind of knowledge mapping management method and system based on semantic space mapping.The inventive method includes：Semantic vector builds, semantic space maps, knowledge mapping management；Knowledge mapping management is divided into including three again：Semantic Clustering, semantic duplicate removal, semantic tagger.For the side/node of knowledge mapping, its text unit will be described first and is projected to semantic space, and its vector representation on semantic space is obtained by vector accumulation；On this basis, the multinomial management role of knowledge mapping is realized；System includes that corresponding semantic vector builds, semantic space maps, knowledge mapping manages 3 modules.Carried out when semanteme compares to sensitive shortcomings of factor such as word deformation, synonym change, grammatical form changes instant invention overcomes traditional knowledge collection of illustrative plates management method, and the mode of vector accumulation can easily tackle the difference of word number, it is easy to accomplish the knowledge mapping management role such as further Semantic Clustering, semantic duplicate removal, semantic tagger.

Description

A kind of knowledge mapping management method and system based on semantic space mapping

Technical field

The invention belongs to text semantic treatment, semantic network technology field, and in particular to it is a kind of based on semantic space mapping Knowledge mapping management method and system.

Background technology

The Important Project that knowledge mapping is the big data epoch is built, mixed and disorderly data can be associated and arranged by it Knowledge into structuring is supplied to user, and this feature determines that it can all have important application in many fields, for example, at present Search cause and be all based on what keyword match was scanned for, and work as after knowledge mapping sets up, it is crucial certain is input into After word, it is possible to return to the related informations such as the relation of attribute, classification and other entities of this keyword, so can be more accurate Really, it is perfect to provide the user required information.Knowledge mapping is to realize that semantic search, machine automatic question answering, internet are wide A series of foundation stone of applications such as recommendation, individual electronic reading is accused, and whether effectively knowledge mapping can be managed and then will Directly determine the size that it is played a role in these fields.

However, what is finally extracted in current knowledge mapping structure is that a kind of deterministic relation is represented, and it is this true Adaptability of the qualitative description when word deformation, synonym change, grammatical form change is not strong, such as two semantic phases As side two entirely different sides can be counted as due to being to be described with different words, then, this processing mode is not It is only unreasonable, can also to the management of knowledge mapping for example while/node cluster, while/node duplicate removal, side/node mark etc. bring huge Difficulty, so as to have influence on effective application of knowledge mapping.

The content of the invention

The present invention for current knowledge collection of illustrative plates management technology method deficiency, it is proposed that it is a kind of based on semantic space mapping Knowledge mapping management method and system.

For the side/node of knowledge mapping（That is inter-entity relation/entity）, its text unit will be described first to semanteme Space projection is simultaneously accumulated, so as to obtain the vector representation of the side/node on semantic space；Then in text semantic vector On the basis of change, the multinomial management role of knowledge mapping can be further realized：Clustering method can be used and combines vectorial similar Property measurement easily carry out the Semantic Clustering of side/node, so as to excavate the inter-entity relation/entity of semantic similarity；Can be with On the basis of Semantic Clustering, class set is replaced to realize semantic duplicate removal by calculating typical side/Typical knots；Can add according to new Enter while/node with when having marked/automatic marking of semantic distance implementation relation/entity of node model etc..

Knowledge mapping management method based on semantic space mapping proposed by the present invention, comprises the following steps that：Semantic vector Build, semantic space maps, knowledge mapping management；Wherein：

（1）What semantic vector built comprises the following steps that：

It is based on building of corpus semantic vector storehouse so that text unit is mapped to the vector on semantic space, its advantage It is that the distance that semantic similarity between text unit can be according to correspondence vector in semantic space be compared, it is semantic Close word, their corresponding semantic vector distances spatially also can be close, directly compares between thus overcoming word When by word deformation, synonym change, grammatical form change influenceed.

Semantic vector can be calculated by various methods and obtained, such as Word2Vec methods, ESA（Explicit semantic analysis）Method, LSA（Latent semantic analysis）Method, co-occurrence word frequecy characteristic etc., it is preferable that adopt Use Word2Vec methods（https://code.google.com/p/word2vec/, referring also to document [1,2,3]）.

The selection principle for building the training data of semantic vector is to ensure height with extensive, encyclopaedia type corpus to cover Lid rate and field independence, it is preferable that use wikipedia knowledge base（http://www.wikipedia.org/）As with Word2Vec methods train the corpus of semantic vector, and build semantic vector storehouse with training result, so that other modules are in language Used when benefit film showing is penetrated.

（2）Semantic space maps

It is that the text that side node will be represented in knowledge mapping is mapped as the vector in semantic space, comprises the following steps that：

（2.1）To the side/node in knowledge mapping（Inter-entity relation/entity）In word carry out filtration treatment, remove Wherein without semantic stop words；

（2.2）To each word retained after drilling and dealing with, obtained from the semantic vector storehouse having been built up Its projection vector in semantic space is taken, then the corresponding semantic vector of these words is added up, and then characterized The overall semantic vector of the side/node.

（3）Knowledge mapping management is divided into four step by step：Semantic Clustering, semantic duplicate removal, semantic tagger；

（3.1）Semantic Clustering, is the further semantic excavation on the basis of knowledge mapping structure, and this is to managerial knowledge figure Spectrum is particularly significant, specifically includes side cluster（Relation is clustered）Clustered with node（Entity cluster）.For side cluster, both can be to even The side for connecing different node pair clusters, and discovery has the entity pair of similar semantic relation, it is also possible to a plurality of of node While being clustered, the main related entities classification of the node is excavated, it might even be possible to which connection is entered with the multiple summits of a pair of nodes Row cluster, excavates the prevailing relationship classification between them；For node cluster, then it can be found that the entity of semantic similarity.

Semantic Clustering is comprised the following steps that：

To side/node set to be clustered, being primarily based on the semantic vector storehouse for building carries out semantic space mapping, then Further these semantic vectors for obtaining are clustered.Clustering method can using various methods such as hierarchy clustering method, Kmeans methods etc., it is preferable that use hierarchy clustering method.Similarity measurement can using it is various measurement as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev etc., it is preferable that similar using Cosine Degree.

Wherein, x and y are respectively two vectors to be compared, and Sim is the Cosine similarity results being calculated.

（3.2）Semantic duplicate removal

The knowledge mapping generally existing built based on big data such case：Many different side/nodes are although specific Representation（The text of description relation/entity）Differ, but semantic content represented by it is closely even one Cause, this will cause knowledge mapping while scale increases also along with the increase of amount of redundant information.From data cleansing angle Degree sets out, if these side/nodes are carried out with unified representation, semantic duplicate removal is realized（Side duplicate removal, node duplicate removal）, it will reducing The quantity of semantic side/node（That is the quantity of relation/entity）While realize knowledge mapping simplify expression.

Semantic duplicate removal is comprised the following steps that：

For the result of Semantic Clustering, to being gathered the side/node set in same class, by calculating typical side/typical case Node replaces original class set element to reduce the redundancy of semantic information, and its basis for selecting is：

Here, V_iIt is i-th semantic vector of relation/entity of correspondence in set to be combined, V is that own in set to be combined The accumulation semantic vector of relation/entity, Sim (a, b) represents the similarity of vector a and vector b, and similarity measurement here can be adopted Such as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev are measured with various, preferably Ground, using Cosine similarities.

By carrying out relation/entity duplicate removal with the typical side/Typical knots of selection are calculated, knowledge mapping will be effectively reduced Memory space, while realize that knowledge mapping simplifies expression and do not lose representativeness.

（3.3）Semantic tagger

By compare input while/node with it is known while/node model semantic similarity, judge the model corresponding to it, Then for it sticks the respective labels in the range of pre-defined known type, its benefit is easy for side/node in knowledge mapping Unified representation and management.Semantic tagger is comprised the following steps that：

（3.3.1）Side/node model builds：

For the side/node after cluster, side/node model is built based on its corresponding semantic vector set（Namely relation/ Physical model）, the structure of model can be used various methods for example mean vector model, Gauss model, artificial neural network, support to Amount machine etc., it is preferable that use mean vector model；Meanwhile, by hand for each class relation/entity calibrates its corresponding type mark Sign.

Wherein,m _i,jRepresent theiIn classjIndividual vector,n _iIt is the number of samples in such,It is mean vector.

After the completion of model construction, will its be added to side/node model storehouse.

（3.3.2）Side/node identification

For side/node to be checked, after the step as described in semantic space mapping module obtains its semantic vector sign, The vector is compared successively with the side/node model in relational model storehouse, for example：To mean vector model, Gauss model, Similarity between vectors can directly be compared or the probable value that input vector belongs to model is calculated, peak is taken after traversal corresponding Classification is used as output；Then it is directly to export corresponding classification to artificial neural network, SVMs.

By taking mean vector model as an example, the classification of outputClassFor：

V is semantic vector to be identified,Be the mean vector of correspondence i classes side/node, i ∈ { 1,2 ..., N }, N be side/ Model number in node model storehouse, Sim (a, b) represents the similarity of vector a and vector b, and similarity measurement here can be adopted Such as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev are measured with various, preferably Ground, using Cosine similarities.

（3.3.3）Side/node semantic tagger

For the classification exported in previous step, the respective type label of mark in advance is taken out from side/node model storehouse Side/the node of input is assigned to, so as to complete semantic tagger process.

The present invention also provides the knowledge mapping management system based on semantic space mapping corresponding to the above method.System by Three big module compositions：Semantic vector builds module, semantic space mapping block, knowledge mapping management module.Wherein, knowledge mapping Management module includes three submodules again：Semantic Clustering submodule, semantic duplicate removal submodule, semantic tagger submodule.

Particular content is as follows：

（1）Semantic vector builds module：

The effect of this module is to be based on building of corpus semantic vector storehouse so that text unit is mapped on semantic space Vector, its advantage is that the semantic similarity between text unit can enter according to the vectorial distance in semantic space of correspondence Row compares, semantic close word, and their corresponding semantic vector distances spatially also can be close, thus overcomes word Between the word deformation that is subject to when directly comparing, synonym change, grammatical form change influence.

（2）Semantic space mapping block, particular content is as follows：

This module is that the text that side node will be represented in knowledge mapping is mapped as the vector in semantic space：

（3）Knowledge mapping management module, particular content is as follows：

The module is responsible for completing the management of knowledge mapping, and it includes three submodules again：Semantic Clustering submodule, semanteme go Baryon module, semantic tagger submodule.Correspond respectively to 3 in knowledge mapping management process step by step；

（3.1）Semantic Clustering submodule

Semantic Clustering is that further semantic on the basis of knowledge mapping structure is excavated, this to managerial knowledge collection of illustrative plates very It is important, specifically include side cluster（Relation is clustered）Clustered with node（Entity cluster）.For side cluster, both can be different to connection The side of node pair clusters, and discovery has the entity pair of similar semantic relation, it is also possible to which a multiple summits for node are carried out Cluster, excavates the main related entities classification of the node, it might even be possible to which connection is clustered with the multiple summits of a pair of nodes, Excavate the prevailing relationship classification between them；For node cluster, then it can be found that the entity of semantic similarity；

（3.2）Semantic duplicate removal submodule

The particular content of semantic duplicate removal is as follows：

Here, V_iIt is i-th semantic vector of relation/entity of correspondence in set to be combined, V is that own in set to be combined The accumulation semantic vector of relation/entity, Sim (a, b) represents the similarity of vector a and vector b, and similarity measurement here can be adopted Such as Cosine, Cityblock, Euclidean, Mahalanobis, Minkowski, Chebychev are measured with various, preferably Ground, using Cosine similarities；

By carrying out relation/entity duplicate removal with the typical side/Typical knots of selection are calculated, knowledge mapping will be effectively reduced Memory space, while realize that knowledge mapping simplifies expression and do not lose representativeness；

（3.3）Semantic tagger submodule

The module pass through to compare input while/node with it is known while/node model semantic similarity, judge corresponding to it Model, then for it sticks the respective labels in the range of pre-defined known type, its benefit be easy for side in knowledge mapping/ The unified representation of node and management.The submodule particular content is as follows：

（3.3.1）Side/node model builds：

（3.3.2）Side/node identification

（3.3.3）Side/node semantic tagger

Beneficial effects of the present invention

The present invention overcomes traditional knowledge figure by would indicate that the text of knowledge mapping side/node is mapped as semantic vector The sensitive shortcomings of factor such as spectrum management method changes when semantic comparing is carried out to word deformation, synonym, grammatical form change, And the mode of vector accumulation can easily tackle the difference of word number, it is easy to accomplish further knowledge mapping management is appointed Business such as Semantic Clustering, semantic duplicate removal, semantic tagger, while enhancing treatment flexibility, also improve it is semantic compare it is accurate Property.

Brief description of the drawings

Fig. 1：System module figure.

Fig. 2：Hierarchical clustering result figure（Side clusters）.Abscissa is the sequence number of entity pair, and ordinate is between class distance.

Fig. 3：Hierarchical clustering result figure（Node is clustered）.Abscissa is the sequence number of entity, and ordinate is between class distance.

Fig. 4：Choose on semantic duplicate removal-typical case side.Abscissa is the sequence number of entity, and ordinate is similarity.

Specific embodiment

Specific embodiment of the invention is demonstrated with example below, each module of system carries out processing as follows successively：

（1）Semantic vector builds

Based on whole English Wiki storehouse（http://www.wikipedia.org/）Corpus of text, use Word2Vec It is trained, it is 500 dimensions to train the vector dimension of output.

（2）Semantic space maps

For the word on side/node, after stop words is removed, corresponding language is taken out from the semantic vector storehouse for training Adopted vector, then enters row vector and adds up again, so as to the semantic vector for obtaining the side/node is characterized.

（3）Semantic Clustering

（3.1）Side Semantic Clustering

Example is input into, form is：

Sequence number：{ node 1 }, { side }, { node 2 }

1：{Shanghai}, {large city}, {China}

2：{ipad}, {product}, {Apple}

3：{Barack Obama}, {president}, {USA}

4：{Kindle}, {manufacture}, {Amazon}

5：{New York}, {metropolis}, {USA}

6：{Dmitry Medvedev}, {Prime Minister}, {Russia}

Hierarchical clustering result figure（Side clusters）As shown in Figure 2.

It is 0.8 to take threshold value, and cluster result is as follows：

The first kind：2、4

Equations of The Second Kind：1、5

3rd class：3、6

Cluster result is correct；

（3.2）Node Semantic Clustering

6 nodes of input：

1：{tuna}

2：{tiger}

3：{leopard}

4：{car}

5：{fish}

6：{train}

Hierarchical clustering result figure（Node is clustered）As shown in Figure 3.

It is 0.8 to take threshold value, and cluster result is as follows：

The first kind：1、5

Equations of The Second Kind：2、3

3rd class：4、6

Cluster result is correct.

（4）Semantic duplicate removal

For example, two nodes in knowledge mapping：{ Bill Gates }, { Microsoft }, following side is in language between them Gathered in same class after justice cluster：

1：{CEO}

2：{executives}

3：{president}

4：{chief executive officer}

5：{current chairman}

6：{chairman}

7：{chair}

Semantic duplicate removal-typical case side is chosen, as shown in Figure 4.

By the semantic vector on all these sides it is cumulative after obtain overall characterizing semanticses vector, then calculate successively each bar side with The similarity of the overall characterizing semanticses vector, and choose similarity it is maximum be typical side, serial number 6, i.e. { chairman }, this Sample, only with 1 typical case while just instead of originally be polymerized to of a sort 7 while, reached knowledge mapping simplify expressions, reduce deposit Store up space and do not lose representational purpose.

（5）Semantic tagger

For example, the line set of the class relation for completing cluster：

1：{large city}

2：{metropolis}

3：{megacity}

4：{major city}

5：{big cities}

6：{megacities}

7：{mega cities}

Mean vector model is built according to its corresponding semantic vector set, and the type label of peg model is “metropolitan area”。

For a line { big city } of new input, the similarity of its corresponding semantic vector and side model is calculated,

Sim = 0.8434

Threshold value is taken for 0.8, then it is assumed that the semanteme that input is represented while with such is identical, thus by types of models label " metropolitan area " is assigned to be input into side, so as to complete semantic tagger process, its benefit be by compare input while with while The similarity degree of model, for label in the range of pre-defined known type is sticked on input side, is easy to side in knowledge mapping Unified representation and management.

Bibliography

[1] Tomas Mikolov, et al. Efficient Estimation of Word Representations in Vector Space. In Proceedings of Workshop at ICLR, 2013.

[2] Tomas Mikolov, et al. Distributed Representations of Words and Phrases and their Compositionality. In Proceedings of NIPS, 2013.

[3] Tomas Mikolov, et al. Linguistic Regularities in Continuous Space Word Representations. In Proceedings of NAACL HLT, 2013。

Claims

1. it is a kind of based on semantic space mapping knowledge mapping management method, it is characterised in that specific steps are divided into：Semantic vector Build, semantic space maps, knowledge mapping management；Wherein：

（1）What semantic vector built comprises the following steps that：

Based on building of corpus semantic vector storehouse so that text unit is mapped to the vector on semantic space；

The training data for building semantic vector trains semantic vector using wikipedia knowledge base as with Word2Vec methods Corpus, and build semantic vector storehouse with training result；

（2）Semantic space maps

The text that side node is represented in knowledge mapping is mapped as the vector in semantic space, is comprised the following steps that：

（2.1）Filtration treatment is carried out to the word in the side/node in knowledge mapping, removal is wherein without semantic stop words；

（2.2）To through step（2.1）Each word retained after operation treatment, obtains from the semantic vector storehouse having been built up Its projection vector in semantic space is taken, then the corresponding semantic vector of these words is added up, and then characterized The overall semantic vector of the side/node；

（3）Knowledge mapping management is divided into three step by step：Semantic Clustering, semantic duplicate removal, semantic tagger；

（3.1）Semantic Clustering is comprised the following steps that：

To side/node set to be clustered, being primarily based on the semantic vector storehouse for building carries out semantic space mapping, then enters one Step ground is clustered to these semantic vectors for obtaining；

（3.2）Semantic duplicate removal is comprised the following steps that：

For the result of Semantic Clustering, to being gathered the side/node set in same class, by calculating typical side/Typical knots Replace original class set element to reduce the redundancy of semantic information, its basis for selecting is：

Formula is meant that selection makes functionK corresponding during maximum is taken as Typical, Typical refers to choosing The typical side for taking or Typical knots；

Here, V_kIt is k-th semantic vector of relation/entity of correspondence in set to be combined, V is that institute is relevant in set to be combined The accumulation semantic vector of system/entity, Sim (a, b) represents the similarity of vector a and vector b；

（3.3）Semantic tagger is comprised the following steps that：

（3.3.1）Side/node model builds：

For the side/node after cluster, side/node model is built based on its corresponding semantic vector set；

Meanwhile, by hand for each class relation/entity calibrates its corresponding type label；

It is the mean vector of correspondence i classes side/node, i ∈ { 1,2 ..., N }, N are the model number in side/node model storehouse；

Wherein,m _i,jRepresent theiIn classjIndividual vector,n _iIt is the number of samples in such,It is mean vector；

After the completion of model construction, will when/node model is added to/node model storehouse；

（3.3.2）Side/node identification

For side/node to be checked, after the step as described in semantic space reflection obtains side/node semantic vector sign, will The vector is compared successively with the side/node model in relational model storehouse, wherein, to mean vector model, Gauss model, can Directly compare similarity between vectors or calculate the probable value that input vector belongs to model, the corresponding class of peak is taken after traversal Not as output；Then it is directly to export corresponding classification to artificial neural network, SVMs；

（3.3.3）Side/node semantic tagger

For step（3.3.2）The classification of middle output, the respective type label that mark in advance is taken out from side/node model storehouse is assigned To the side/node of input, so as to complete semantic tagger process.

2. it is according to claim 1 based on semantic space mapping knowledge mapping management method, it is characterised in that step （3.3.2）In, during for mean vector model, the classification of output is：

Formula is meant that selection makes functionI corresponding during maximum is taken as Class；

V is semantic vector to be identified, and Sim (a, b) represents the similarity of vector a and vector b.

3. the knowledge mapping management system based on semantic space mapping of claim 1 methods described is based on, it is characterised in that had Following three big module compositions：Semantic vector builds module to be used to perform step（1）, semantic space mapping block be used for perform step （2）, knowledge mapping management module be used for perform step（3）, wherein：Knowledge mapping management module, including three submodules：It is semantic Cluster submodule is used to perform step（3.1）, semantic duplicate removal submodule is used to perform step（3.2）, semantic tagger submodule use In execution step（3.3）.