CN110795511A - Knowledge graph representation method based on cloud model - Google Patents

Knowledge graph representation method based on cloud model Download PDF

Info

Publication number
CN110795511A
CN110795511A CN201911045361.7A CN201911045361A CN110795511A CN 110795511 A CN110795511 A CN 110795511A CN 201911045361 A CN201911045361 A CN 201911045361A CN 110795511 A CN110795511 A CN 110795511A
Authority
CN
China
Prior art keywords
relation
vector
test
triplet
knowledge graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201911045361.7A
Other languages
Chinese (zh)
Inventor
刘学军
周航
蒋军成
李斌
王志荣
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Tech University
Original Assignee
Nanjing Tech University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Tech University filed Critical Nanjing Tech University
Priority to CN201911045361.7A priority Critical patent/CN110795511A/en
Publication of CN110795511A publication Critical patent/CN110795511A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/288Entity relationship models
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • G06F18/23Clustering techniques

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • General Physics & Mathematics (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Computational Linguistics (AREA)
  • Animal Behavior & Ethology (AREA)
  • Artificial Intelligence (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Evolutionary Biology (AREA)
  • Evolutionary Computation (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a knowledge graph representation method based on a cloud model, which comprises the following steps: acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion; dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation; calculating the meaning of each relation which can most express the relation; and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model. The invention provides a knowledge graph representation method based on a cloud model, aiming at obtaining a vector value capable of expressing the semantics of a relation vector most on the premise that the relation vector has multiple semantics, introducing an uncertainty thought, and combining a determination degree in a new scoring function to enable the representation of the knowledge graph to be more accurate.

Description

Knowledge graph representation method based on cloud model
Technical Field
The invention relates to the field of natural language processing, in particular to a knowledge graph representation method based on a cloud model.
Background
With the development of the internet, the content of the network data presents an explosive growth situation. The characteristics of large scale, heterogeneous and multiple internet contents, loose organization structure and the like provide great challenges for people to effectively acquire information and knowledge. The Knowledge Graph (Knowledge Graph) lays a foundation for the intellectual organization and intelligent application of the internet era by virtue of the strong semantic processing capability and open organization capability of the Knowledge Graph. Therefore, the study and application of large-scale knowledge maps has attracted sufficient attention in academia and industry. A knowledge graph is intended to describe entities that exist in the real world and relationships between entities. The knowledge graph was formally proposed by Google in 2012, and its original purpose is to improve the capability of search engine, improve the search quality and search experience of users. With the development and application of artificial intelligence technology, knowledge-graph is becoming one of the key technologies, and more researchers are dedicated to the research of knowledge-graph (KG). Knowledge maps provide a new mechanism for the effective representation of knowledge, and are now widely used in the fields of expert systems, web search, question answering, and the like.
Based on the knowledge representation of the translation model, each piece of knowledge in the knowledge graph is generally represented by a triple (head, relation, tail), wherein the head represents a head entity in the triple, the tail represents a tail entity in the triple, and the translation represents a semantic relationship between the head entity and the tail entity; although conventional translation-based models have proven effective in many cases, such models consider that one relationship corresponds to only one translation vector, and thus do not address the problem of multiple semantic relationships. For example, the has _ part relationship (sichuan, hasplat, chengdu) indicates a regional relationship, and (house, hasplat, door) indicates a composition relationship. Furthermore, different relationships have different degrees of certainty.
Disclosure of Invention
In order to solve the problems, the invention provides a knowledge graph representation method based on a cloud model, which comprises the following steps:
acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion;
dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation;
calculating the meaning of each relation which can most express the relation;
and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model.
Further, the dividing each relation in the training set into a plurality of semantics to obtain the gaussian mixture model of the relation specifically includes the following steps:
clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:
Figure BDA0002253987310000021
wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)r,m2) Representing a mathematical expectation of ur,mVariance is σ2M denotes the number of semantics contained by a single relation r, ur,mThe translation vector, λ, representing the mth semanticr,mWeight, λ, representing the mth semanticr,mObtained by Bayesian statistical screening.
Further, the calculating the main meaning of each relationship that can most express the relationship specifically includes:
carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most*The concrete formula is as follows:
Figure BDA0002253987310000022
Figure BDA0002253987310000023
wherein,representing main semantics by vectors of the main semantics
Figure BDA0002253987310000025
Instead of the relation vector r of the triplet, (h, r, t) represents the vector representation of the triplet, where
Figure BDA0002253987310000026
Representing the euclidean distance between the head entity vector h and the tail entity vector t.
Further, the computing of the coordinates of the language value of each main semantic based on the cloud model and the degree of determination thereof specifically includes the following steps:
generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given triplet
Figure BDA0002253987310000028
The method specifically comprises the following steps:
generating an expected value ofMean square error of
Figure BDA00022539873100000210
Two-dimensional normal random entropy of
Figure BDA00022539873100000211
Generating an expected value of
Figure BDA00022539873100000212
Mean square error ofTwo-dimensional normal random number of
Figure BDA00022539873100000214
Then:
Figure BDA0002253987310000027
wherein
Figure BDA00022539873100000215
Meaning m as subject*The coordinates of the language value of (a),
Figure BDA00022539873100000216
is composed of
Figure BDA00022539873100000217
Belonging to a main semantic m*The degree of certainty of the language value of (a);
thus, the most expressible main semantic m is obtained*Coordinate values of (2):
Figure BDA0002253987310000031
further, the knowledge graph representing method based on the cloud model is characterized by further comprising:
and constructing a scoring function, preprocessing the test set to obtain a scoring ranking of the test triple, and evaluating the method by using an average ranking score (MeanRank) and a proportion (Hits @10) of which the ranking is not more than 10 as evaluation indexes.
Further, the method comprises the following steps of constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and evaluating the method by using an average ranking score (MeanRank) and a ratio (Hits @10) of the ranking not greater than 10 as evaluation indexes, wherein the method specifically comprises the following steps:
randomly extracting a triple (h, r, t) from the test set, and randomly replacing a head entity (or a tail entity) of the triple with an entity in the test set to construct a test triple (h ', r, t');
performing 'Filter' setting, specifically: prior to ranking each test triplet, culling (excluding test target (h, r, t)) the correct triplets already in the training set and test set;
scoring each test triple through a scoring function, wherein the formula of the scoring function P { (h, r, t) } is specifically as follows:
Figure BDA0002253987310000032
where (h, r, t) is a vector representation of the triplet in the test dataset.
Compared with the prior art, the invention has the beneficial effects that:
on the premise that the relation vector has multiple semantics, the invention obtains the vector value which can most express the semantics of the relation vector, and simultaneously introduces the uncertainty idea, so that the representation of the knowledge graph is more accurate.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
Detailed Description
The present invention is described in further detail below with reference to the attached drawing figures.
In this disclosure, aspects of the present invention are described with reference to the accompanying drawings, in which a number of illustrative embodiments are shown. Embodiments of the present disclosure are not necessarily intended to include all aspects of the invention. It should be appreciated that the various concepts and embodiments described above, as well as those described in greater detail below, may be implemented in any of numerous ways, as the disclosed concepts and embodiments are not limited to any one implementation. In addition, some aspects of the present disclosure may be used alone, or in any suitable combination with other aspects of the present disclosure.
The data set Yoochoose is taken as an embodiment of the present invention and is further described with reference to fig. 1, and the following detailed description is given.
The invention discloses a knowledge graph representation method based on a cloud model, which comprises the following steps:
a knowledge map representation method based on a cloud model aims at obtaining coordinate values and determining degree which can most express certain relation semantics and constructing a high-quality fault diagnosis knowledge map; the method comprises the following steps:
s1: firstly, acquiring a data set of fault diagnosis knowledge, and randomly dividing the data set into a training set and a testing set according to a proportion, specifically;
experimental data sets were obtained from four common benchmark data sets (WN18, FB15k, WN11, FB13) of WordNet and Freebase, the data sets being arranged as 4: the proportion of 1 is randomly divided into a training set and a testing set; wherein WordNet is an English dictionary based on cognitive linguistics and designed by psychologists, linguists and computer engineers of the university of Princeton; freebase is an authoring sharing website similar to Wikipedia (all contents are added by users, and can be freely quoted by adopting an creative sharing license).
S2: dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation, which specifically comprises the following steps:
s11: clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:
Figure BDA0002253987310000041
wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)r,m2) Representing a mathematical expectation of ur,mVariance is σ2M denotes the number of semantics contained by a single relation r, ur,mThe translation vector, λ, representing the mth semanticr,mWeight, λ, representing the mth semanticr,mObtained by Bayesian statistical screening.
S3: calculating the meanings of each relation which can best express the relation, and specifically comprising the following steps:
s31: carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most*The concrete formula is as follows:
Figure BDA0002253987310000051
Figure BDA0002253987310000052
wherein,representing main semantics by vectors of the main semantics
Figure BDA0002253987310000055
Instead of the relation vector r of the triplet, (h, r, t) represents the vector representation of the triplet, where
Figure BDA0002253987310000056
Representing the euclidean distance between the head entity vector h and the tail entity vector t.
S4: calculating the coordinate of the language value of each main semantic and the determination degree thereof, specifically comprising the following steps:
s41: generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given triplet
Figure BDA0002253987310000057
The method specifically comprises the following steps:
generating an expected value of
Figure BDA0002253987310000058
Mean square error of
Figure BDA0002253987310000059
Two-dimensional normal random entropy of
Generating an expected value of
Figure BDA00022539873100000511
Mean square error of
Figure BDA00022539873100000512
Two-dimensional normal random number of
Figure BDA00022539873100000513
Then:
Figure BDA0002253987310000053
wherein
Figure BDA00022539873100000514
Meaning m as subject*The coordinates of the language value of (a),
Figure BDA00022539873100000515
is composed of
Figure BDA00022539873100000516
Belonging to a main semantic m*The degree of certainty of the language value of (a);
thus, the most expressible main semantic m is obtained*Coordinate values of (2):
Figure BDA00022539873100000517
s5: constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and taking an average ranking score (MeanRank) and a proportion (Hits @10) of the ranking not more than 10 as an evaluation index of algorithm performance, wherein the method specifically comprises the following steps:
s51: a test triplet (h, r, t) is randomly extracted from the test set, and the entities in the test set randomly replace the head (or tail) entity of the triplet to construct a new test triplet (h ', r, t').
S52: if a new test triplet (h ', r, t') exists in the knowledge-graph, i.e. the triplet is actually correct, it is reasonable to rank it before the test triplet. To eliminate the effect of this problem, during the test process, before ranking each test triplet, the correct triplets already in the training set, the validation set, and the test set are rejected (excluding the test target (h, r, t)) and the result without such processing is "Filter", and the evaluation result of "Filter" is no doubt more important than "Raw".
S53: each test triple was scored by a scoring function, using the average ranking score (MeanRank) and the ratio of ranks no greater than 10 (Hits @10) as an evaluation index of the performance of the method. The smaller the MeanRank value, the larger the Hits @10 value, the better the method performance.
The formula of the scoring function P { (h, r, t) } is specifically:
Figure BDA0002253987310000061
where (h, r, t) is a vector representation of the triplet in the test dataset.
The results of the evaluation are shown in Table 1,
Figure BDA0002253987310000062
table 1 mean prediction results for different methods, TransE model: and embedding the entities and the relations of the knowledge graph into the same low-dimensional vector space, and judging whether the triples are reasonable or not by calculating Euclidean distances among the vectors.
TransH model: it is considered that different relationships should have different expressions for an entity, and the entity is projected onto the hyperplane of the corresponding relationship by means of projection.
TransR model: an entity is considered to be a complex of multiple attributes, different relationships focus on different attributes of the entity, and the entity and the relationships are projected into different spaces.
The TransG model: and refining the relation, and selecting an optimal relation semantic from the refined result.
Among them, the reason why the present invention does not perform well on the index of the MeanRank under the WN18 data set is that WN18 only contains a small number of relationships, which results in that different types of relationships are ignored and the influence of some extreme low-rank triples. In the multiple FB15K with complex relation, the indexes of the invention are all the best.
The invention provides a knowledge graph representation method based on a cloud model, aiming at obtaining a vector value capable of expressing the semantics of a relation vector most on the premise that the relation vector has multiple semantics, introducing an uncertainty thought, and combining a determination degree in a new scoring function to enable the representation of the knowledge graph to be more accurate.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (6)

1. A knowledge graph representation method based on a cloud model is characterized by comprising the following steps:
acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion;
dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation;
calculating the meaning of each relation which can most express the relation;
and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model.
2. The cloud model-based knowledge graph representation method of claim 1, wherein the dividing of each relationship in the training set into a plurality of semantics and the obtaining of the gaussian mixture model of the relationship specifically comprises the steps of:
clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:
Figure FDA0002253987300000011
wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)r,m,σ2) Representing a mathematical expectation of ur,mVariance is σ2M denotes the number of semantics contained by a single relation r, ur,mThe translation vector, λ, representing the mth semanticr,mWeight, λ, representing the mth semanticr,mObtained by Bayesian statistical screening.
3. The cloud model-based knowledge graph representation method according to claim 1, wherein the calculating of the main meaning that can most express the relationship in each relationship is specifically:
carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most*The concrete formula is as follows:
Figure FDA0002253987300000012
Figure FDA0002253987300000013
wherein,
Figure FDA0002253987300000014
representing main semantics by vectors of the main semanticsInstead of the relation vector r of the triplet,
(h, r, t) represents a vector representation of the triplet, where
Figure FDA0002253987300000016
Representing the euclidean distance between the head entity vector h and the tail entity vector t.
4. The cloud model-based knowledge graph representation method according to claim 1, wherein the cloud model-based computation of the coordinates of the linguistic value of each primary semantic and the degree of certainty thereof specifically comprises the steps of:
generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given triplet
Figure FDA0002253987300000017
The method specifically comprises the following steps:
generating an expected value of
Figure FDA0002253987300000021
Mean square error of
Figure FDA0002253987300000022
Two-dimensional normal random entropy of
Figure FDA0002253987300000023
Generating an expected value of
Figure FDA0002253987300000024
Mean square error of
Figure FDA0002253987300000025
Two-dimensional normal random number of
Figure FDA0002253987300000026
Then:
Figure FDA0002253987300000027
wherein
Figure FDA0002253987300000028
Meaning m as subject*The coordinates of the language value of (a),
Figure FDA0002253987300000029
is composed of
Figure FDA00022539873000000210
Belonging to a main semantic m*The degree of certainty of the language value of (a);
thus, the most expressible main semantic m is obtained*Coordinate values of (2):
Figure FDA00022539873000000211
5. the cloud model-based knowledge graph representation method of claim 4, further comprising:
and constructing a scoring function, preprocessing the test set to obtain a scoring ranking of the test triple, and evaluating the method by using an average ranking score (MeanRank) and a proportion (Hits @10) of which the ranking is not more than 10 as evaluation indexes.
6. The cloud model-based knowledge graph representing method according to claim 5, wherein the method comprises the following steps of constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and evaluating the method by using an average ranking score (MeanRank) and a ratio (Hits @10) with the ranking not greater than 10 as evaluation indexes:
randomly extracting a triple (h, r, t) from the test set, and randomly replacing a head entity (or a tail entity) of the triple with an entity in the test set to construct a test triple (h ', r, t');
performing 'Filter' setting, specifically: prior to ranking each test triplet, culling (excluding test target (h, r, t)) the correct triplets already in the training set and test set;
scoring each test triple through a scoring function, wherein the formula of the scoring function P { (h, r, t) } is specifically as follows:
Figure FDA00022539873000000212
where (h, r, t) is a vector representation of the triplet in the test dataset.
CN201911045361.7A 2019-10-30 2019-10-30 Knowledge graph representation method based on cloud model Pending CN110795511A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911045361.7A CN110795511A (en) 2019-10-30 2019-10-30 Knowledge graph representation method based on cloud model

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911045361.7A CN110795511A (en) 2019-10-30 2019-10-30 Knowledge graph representation method based on cloud model

Publications (1)

Publication Number Publication Date
CN110795511A true CN110795511A (en) 2020-02-14

Family

ID=69442196

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911045361.7A Pending CN110795511A (en) 2019-10-30 2019-10-30 Knowledge graph representation method based on cloud model

Country Status (1)

Country Link
CN (1) CN110795511A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348190A (en) * 2020-10-26 2021-02-09 福州大学 Uncertain knowledge graph prediction method based on improved embedded model SUKE
CN112463979A (en) * 2020-11-23 2021-03-09 东南大学 Knowledge representation method based on uncertainty ontology

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112348190A (en) * 2020-10-26 2021-02-09 福州大学 Uncertain knowledge graph prediction method based on improved embedded model SUKE
CN112463979A (en) * 2020-11-23 2021-03-09 东南大学 Knowledge representation method based on uncertainty ontology

Similar Documents

Publication Publication Date Title
CN107329995B (en) A kind of controlled answer generation method of semanteme, apparatus and system
CN111767405A (en) Training method, device and equipment of text classification model and storage medium
CN110929161B (en) Large-scale user-oriented personalized teaching resource recommendation method
Roelofs Measuring Generalization and overfitting in Machine learning
WO2018196718A1 (en) Image disambiguation method and device, storage medium, and electronic device
US11308367B2 (en) Learning apparatus, system for generating captured image classification apparatus, apparatus for generating captured image classification apparatus, learning method, and program
CN114998602B (en) Domain adaptive learning method and system based on low confidence sample contrast loss
Liu et al. Efficient combinatorial optimization for word-level adversarial textual attack
CN114579833B (en) Microblog public opinion visual analysis method based on topic mining and emotion analysis
Li-guo et al. A new naive Bayes text classification algorithm
CN110795511A (en) Knowledge graph representation method based on cloud model
CN112732914A (en) Text clustering method, system, storage medium and terminal based on keyword matching
CN103268346B (en) Semisupervised classification method and system
Krishna et al. Revisiting the importance of encoding logic rules in sentiment classification
Xu et al. Large-margin multi-view Gaussian process for image classification
CN104537280A (en) Protein interactive relationship identification method based on text relationship similarity
CN112463914B (en) Entity linking method, device and storage medium for internet service
Scott et al. GAN-SMOTE: A Generative Adversarial Network approach to Synthetic Minority Oversampling.
Liu et al. Noise learning for text classification: A benchmark
CN116935057A (en) Target evaluation method, electronic device, and computer-readable storage medium
CN115952908A (en) Learning path planning method, system, device and storage medium
CN114595336A (en) Multi-relation semantic solution model based on Gaussian mixture model
Selvan et al. Improved cuckoo search optimization algorithm based multi-document summarization model
CN106202234B (en) Interactive information retrieval method based on sample-to-classifier correction
Li Predicting Emotions from Twitter Posts: A Comparative Study of Machine Learning Methods

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination