CN110795511A - Knowledge graph representation method based on cloud model - Google Patents
Knowledge graph representation method based on cloud model Download PDFInfo
- Publication number
- CN110795511A CN110795511A CN201911045361.7A CN201911045361A CN110795511A CN 110795511 A CN110795511 A CN 110795511A CN 201911045361 A CN201911045361 A CN 201911045361A CN 110795511 A CN110795511 A CN 110795511A
- Authority
- CN
- China
- Prior art keywords
- relation
- vector
- test
- triplet
- knowledge graph
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
- 238000000034 method Methods 0.000 title claims abstract description 32
- 239000013598 vector Substances 0.000 claims abstract description 45
- 238000012360 testing method Methods 0.000 claims abstract description 43
- 238000012549 training Methods 0.000 claims abstract description 20
- 230000006870 function Effects 0.000 claims abstract description 13
- 239000000203 mixture Substances 0.000 claims abstract description 10
- 238000011156 evaluation Methods 0.000 claims description 8
- 238000013519 translation Methods 0.000 claims description 7
- 238000009826 distribution Methods 0.000 claims description 6
- 238000007781 pre-processing Methods 0.000 claims description 5
- 238000012216 screening Methods 0.000 claims description 3
- 230000008520 organization Effects 0.000 description 3
- 238000011161 development Methods 0.000 description 2
- 238000003745 diagnosis Methods 0.000 description 2
- 238000005516 engineering process Methods 0.000 description 2
- 239000012458 free base Substances 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000013473 artificial intelligence Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000001149 cognitive effect Effects 0.000 description 1
- 238000010586 diagram Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 239000002360 explosive Substances 0.000 description 1
- 230000014509 gene expression Effects 0.000 description 1
- 230000007246 mechanism Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000003058 natural language processing Methods 0.000 description 1
- 230000008569 process Effects 0.000 description 1
- 238000007670 refining Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 238000010200 validation analysis Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/288—Entity relationship models
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/23—Clustering techniques
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- General Physics & Mathematics (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Computational Linguistics (AREA)
- Animal Behavior & Ethology (AREA)
- Artificial Intelligence (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Bioinformatics & Computational Biology (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Machine Translation (AREA)
Abstract
The invention provides a knowledge graph representation method based on a cloud model, which comprises the following steps: acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion; dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation; calculating the meaning of each relation which can most express the relation; and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model. The invention provides a knowledge graph representation method based on a cloud model, aiming at obtaining a vector value capable of expressing the semantics of a relation vector most on the premise that the relation vector has multiple semantics, introducing an uncertainty thought, and combining a determination degree in a new scoring function to enable the representation of the knowledge graph to be more accurate.
Description
Technical Field
The invention relates to the field of natural language processing, in particular to a knowledge graph representation method based on a cloud model.
Background
With the development of the internet, the content of the network data presents an explosive growth situation. The characteristics of large scale, heterogeneous and multiple internet contents, loose organization structure and the like provide great challenges for people to effectively acquire information and knowledge. The Knowledge Graph (Knowledge Graph) lays a foundation for the intellectual organization and intelligent application of the internet era by virtue of the strong semantic processing capability and open organization capability of the Knowledge Graph. Therefore, the study and application of large-scale knowledge maps has attracted sufficient attention in academia and industry. A knowledge graph is intended to describe entities that exist in the real world and relationships between entities. The knowledge graph was formally proposed by Google in 2012, and its original purpose is to improve the capability of search engine, improve the search quality and search experience of users. With the development and application of artificial intelligence technology, knowledge-graph is becoming one of the key technologies, and more researchers are dedicated to the research of knowledge-graph (KG). Knowledge maps provide a new mechanism for the effective representation of knowledge, and are now widely used in the fields of expert systems, web search, question answering, and the like.
Based on the knowledge representation of the translation model, each piece of knowledge in the knowledge graph is generally represented by a triple (head, relation, tail), wherein the head represents a head entity in the triple, the tail represents a tail entity in the triple, and the translation represents a semantic relationship between the head entity and the tail entity; although conventional translation-based models have proven effective in many cases, such models consider that one relationship corresponds to only one translation vector, and thus do not address the problem of multiple semantic relationships. For example, the has _ part relationship (sichuan, hasplat, chengdu) indicates a regional relationship, and (house, hasplat, door) indicates a composition relationship. Furthermore, different relationships have different degrees of certainty.
Disclosure of Invention
In order to solve the problems, the invention provides a knowledge graph representation method based on a cloud model, which comprises the following steps:
acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion;
dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation;
calculating the meaning of each relation which can most express the relation;
and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model.
Further, the dividing each relation in the training set into a plurality of semantics to obtain the gaussian mixture model of the relation specifically includes the following steps:
clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:
wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)r,m,σ2) Representing a mathematical expectation of ur,mVariance is σ2M denotes the number of semantics contained by a single relation r, ur,mThe translation vector, λ, representing the mth semanticr,mWeight, λ, representing the mth semanticr,mObtained by Bayesian statistical screening.
Further, the calculating the main meaning of each relationship that can most express the relationship specifically includes:
carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most*The concrete formula is as follows:
wherein,representing main semantics by vectors of the main semanticsInstead of the relation vector r of the triplet, (h, r, t) represents the vector representation of the triplet, whereRepresenting the euclidean distance between the head entity vector h and the tail entity vector t.
Further, the computing of the coordinates of the language value of each main semantic based on the cloud model and the degree of determination thereof specifically includes the following steps:
generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given tripletThe method specifically comprises the following steps:
Then:
whereinMeaning m as subject*The coordinates of the language value of (a),is composed ofBelonging to a main semantic m*The degree of certainty of the language value of (a);
thus, the most expressible main semantic m is obtained*Coordinate values of (2):
further, the knowledge graph representing method based on the cloud model is characterized by further comprising:
and constructing a scoring function, preprocessing the test set to obtain a scoring ranking of the test triple, and evaluating the method by using an average ranking score (MeanRank) and a proportion (Hits @10) of which the ranking is not more than 10 as evaluation indexes.
Further, the method comprises the following steps of constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and evaluating the method by using an average ranking score (MeanRank) and a ratio (Hits @10) of the ranking not greater than 10 as evaluation indexes, wherein the method specifically comprises the following steps:
randomly extracting a triple (h, r, t) from the test set, and randomly replacing a head entity (or a tail entity) of the triple with an entity in the test set to construct a test triple (h ', r, t');
performing 'Filter' setting, specifically: prior to ranking each test triplet, culling (excluding test target (h, r, t)) the correct triplets already in the training set and test set;
scoring each test triple through a scoring function, wherein the formula of the scoring function P { (h, r, t) } is specifically as follows:
where (h, r, t) is a vector representation of the triplet in the test dataset.
Compared with the prior art, the invention has the beneficial effects that:
on the premise that the relation vector has multiple semantics, the invention obtains the vector value which can most express the semantics of the relation vector, and simultaneously introduces the uncertainty idea, so that the representation of the knowledge graph is more accurate.
Drawings
FIG. 1 is a schematic flow diagram of the present invention.
Detailed Description
The present invention is described in further detail below with reference to the attached drawing figures.
In this disclosure, aspects of the present invention are described with reference to the accompanying drawings, in which a number of illustrative embodiments are shown. Embodiments of the present disclosure are not necessarily intended to include all aspects of the invention. It should be appreciated that the various concepts and embodiments described above, as well as those described in greater detail below, may be implemented in any of numerous ways, as the disclosed concepts and embodiments are not limited to any one implementation. In addition, some aspects of the present disclosure may be used alone, or in any suitable combination with other aspects of the present disclosure.
The data set Yoochoose is taken as an embodiment of the present invention and is further described with reference to fig. 1, and the following detailed description is given.
The invention discloses a knowledge graph representation method based on a cloud model, which comprises the following steps:
a knowledge map representation method based on a cloud model aims at obtaining coordinate values and determining degree which can most express certain relation semantics and constructing a high-quality fault diagnosis knowledge map; the method comprises the following steps:
s1: firstly, acquiring a data set of fault diagnosis knowledge, and randomly dividing the data set into a training set and a testing set according to a proportion, specifically;
experimental data sets were obtained from four common benchmark data sets (WN18, FB15k, WN11, FB13) of WordNet and Freebase, the data sets being arranged as 4: the proportion of 1 is randomly divided into a training set and a testing set; wherein WordNet is an English dictionary based on cognitive linguistics and designed by psychologists, linguists and computer engineers of the university of Princeton; freebase is an authoring sharing website similar to Wikipedia (all contents are added by users, and can be freely quoted by adopting an creative sharing license).
S2: dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation, which specifically comprises the following steps:
s11: clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:
wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)r,m,σ2) Representing a mathematical expectation of ur,mVariance is σ2M denotes the number of semantics contained by a single relation r, ur,mThe translation vector, λ, representing the mth semanticr,mWeight, λ, representing the mth semanticr,mObtained by Bayesian statistical screening.
S3: calculating the meanings of each relation which can best express the relation, and specifically comprising the following steps:
s31: carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most*The concrete formula is as follows:
wherein,representing main semantics by vectors of the main semanticsInstead of the relation vector r of the triplet, (h, r, t) represents the vector representation of the triplet, whereRepresenting the euclidean distance between the head entity vector h and the tail entity vector t.
S4: calculating the coordinate of the language value of each main semantic and the determination degree thereof, specifically comprising the following steps:
s41: generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given tripletThe method specifically comprises the following steps:
Then:
whereinMeaning m as subject*The coordinates of the language value of (a),is composed ofBelonging to a main semantic m*The degree of certainty of the language value of (a);
thus, the most expressible main semantic m is obtained*Coordinate values of (2):
s5: constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and taking an average ranking score (MeanRank) and a proportion (Hits @10) of the ranking not more than 10 as an evaluation index of algorithm performance, wherein the method specifically comprises the following steps:
s51: a test triplet (h, r, t) is randomly extracted from the test set, and the entities in the test set randomly replace the head (or tail) entity of the triplet to construct a new test triplet (h ', r, t').
S52: if a new test triplet (h ', r, t') exists in the knowledge-graph, i.e. the triplet is actually correct, it is reasonable to rank it before the test triplet. To eliminate the effect of this problem, during the test process, before ranking each test triplet, the correct triplets already in the training set, the validation set, and the test set are rejected (excluding the test target (h, r, t)) and the result without such processing is "Filter", and the evaluation result of "Filter" is no doubt more important than "Raw".
S53: each test triple was scored by a scoring function, using the average ranking score (MeanRank) and the ratio of ranks no greater than 10 (Hits @10) as an evaluation index of the performance of the method. The smaller the MeanRank value, the larger the Hits @10 value, the better the method performance.
The formula of the scoring function P { (h, r, t) } is specifically:
where (h, r, t) is a vector representation of the triplet in the test dataset.
The results of the evaluation are shown in Table 1,
table 1 mean prediction results for different methods, TransE model: and embedding the entities and the relations of the knowledge graph into the same low-dimensional vector space, and judging whether the triples are reasonable or not by calculating Euclidean distances among the vectors.
TransH model: it is considered that different relationships should have different expressions for an entity, and the entity is projected onto the hyperplane of the corresponding relationship by means of projection.
TransR model: an entity is considered to be a complex of multiple attributes, different relationships focus on different attributes of the entity, and the entity and the relationships are projected into different spaces.
The TransG model: and refining the relation, and selecting an optimal relation semantic from the refined result.
Among them, the reason why the present invention does not perform well on the index of the MeanRank under the WN18 data set is that WN18 only contains a small number of relationships, which results in that different types of relationships are ignored and the influence of some extreme low-rank triples. In the multiple FB15K with complex relation, the indexes of the invention are all the best.
The invention provides a knowledge graph representation method based on a cloud model, aiming at obtaining a vector value capable of expressing the semantics of a relation vector most on the premise that the relation vector has multiple semantics, introducing an uncertainty thought, and combining a determination degree in a new scoring function to enable the representation of the knowledge graph to be more accurate.
The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (6)
1. A knowledge graph representation method based on a cloud model is characterized by comprising the following steps:
acquiring a data set, and randomly dividing the data set into a training set and a testing set according to a proportion;
dividing each relation in the training set into a plurality of semantics to obtain a Gaussian mixture model of the relation;
calculating the meaning of each relation which can most express the relation;
and calculating the coordinates of the language value of each main semantic and the determination degree thereof based on the cloud model.
2. The cloud model-based knowledge graph representation method of claim 1, wherein the dividing of each relationship in the training set into a plurality of semantics and the obtaining of the gaussian mixture model of the relationship specifically comprises the steps of:
clustering and expressing triples in a training set to obtain a plurality of semantemes, expressing each semanteme into Gaussian distribution by adopting the thought of a Gaussian mixture model, and expressing the final relation into a mixed form of a plurality of Gaussian distributions, wherein the specific formula is as follows:
wherein t represents the tail entity vector in the triplet, h represents the head entity vector in the triplet, r represents the relationship vector in the triplet, σ is the variance, and N (u)r,m,σ2) Representing a mathematical expectation of ur,mVariance is σ2M denotes the number of semantics contained by a single relation r, ur,mThe translation vector, λ, representing the mth semanticr,mWeight, λ, representing the mth semanticr,mObtained by Bayesian statistical screening.
3. The cloud model-based knowledge graph representation method according to claim 1, wherein the calculating of the main meaning that can most express the relationship in each relationship is specifically:
carrying out statistics on the training data set by using Bayesian non-parameter statistics to obtain the weight of each semantic in each relation and obtain a subject semantic m capable of expressing the relation most*The concrete formula is as follows:
wherein,representing main semantics by vectors of the main semanticsInstead of the relation vector r of the triplet,
4. The cloud model-based knowledge graph representation method according to claim 1, wherein the cloud model-based computation of the coordinates of the linguistic value of each primary semantic and the degree of certainty thereof specifically comprises the steps of:
generating cloud droplets by a two-dimensional normal cloud generator for a vector representation of a given tripletThe method specifically comprises the following steps:
Then:
whereinMeaning m as subject*The coordinates of the language value of (a),is composed ofBelonging to a main semantic m*The degree of certainty of the language value of (a);
thus, the most expressible main semantic m is obtained*Coordinate values of (2):
5. the cloud model-based knowledge graph representation method of claim 4, further comprising:
and constructing a scoring function, preprocessing the test set to obtain a scoring ranking of the test triple, and evaluating the method by using an average ranking score (MeanRank) and a proportion (Hits @10) of which the ranking is not more than 10 as evaluation indexes.
6. The cloud model-based knowledge graph representing method according to claim 5, wherein the method comprises the following steps of constructing a scoring function, preprocessing a test set to obtain a scoring ranking of a test triple, and evaluating the method by using an average ranking score (MeanRank) and a ratio (Hits @10) with the ranking not greater than 10 as evaluation indexes:
randomly extracting a triple (h, r, t) from the test set, and randomly replacing a head entity (or a tail entity) of the triple with an entity in the test set to construct a test triple (h ', r, t');
performing 'Filter' setting, specifically: prior to ranking each test triplet, culling (excluding test target (h, r, t)) the correct triplets already in the training set and test set;
scoring each test triple through a scoring function, wherein the formula of the scoring function P { (h, r, t) } is specifically as follows:
where (h, r, t) is a vector representation of the triplet in the test dataset.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911045361.7A CN110795511A (en) | 2019-10-30 | 2019-10-30 | Knowledge graph representation method based on cloud model |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201911045361.7A CN110795511A (en) | 2019-10-30 | 2019-10-30 | Knowledge graph representation method based on cloud model |
Publications (1)
Publication Number | Publication Date |
---|---|
CN110795511A true CN110795511A (en) | 2020-02-14 |
Family
ID=69442196
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201911045361.7A Pending CN110795511A (en) | 2019-10-30 | 2019-10-30 | Knowledge graph representation method based on cloud model |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN110795511A (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348190A (en) * | 2020-10-26 | 2021-02-09 | 福州大学 | Uncertain knowledge graph prediction method based on improved embedded model SUKE |
CN112463979A (en) * | 2020-11-23 | 2021-03-09 | 东南大学 | Knowledge representation method based on uncertainty ontology |
-
2019
- 2019-10-30 CN CN201911045361.7A patent/CN110795511A/en active Pending
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN112348190A (en) * | 2020-10-26 | 2021-02-09 | 福州大学 | Uncertain knowledge graph prediction method based on improved embedded model SUKE |
CN112463979A (en) * | 2020-11-23 | 2021-03-09 | 东南大学 | Knowledge representation method based on uncertainty ontology |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN107329995B (en) | A kind of controlled answer generation method of semanteme, apparatus and system | |
CN111767405A (en) | Training method, device and equipment of text classification model and storage medium | |
CN110929161B (en) | Large-scale user-oriented personalized teaching resource recommendation method | |
Roelofs | Measuring Generalization and overfitting in Machine learning | |
WO2018196718A1 (en) | Image disambiguation method and device, storage medium, and electronic device | |
US11308367B2 (en) | Learning apparatus, system for generating captured image classification apparatus, apparatus for generating captured image classification apparatus, learning method, and program | |
CN114998602B (en) | Domain adaptive learning method and system based on low confidence sample contrast loss | |
Liu et al. | Efficient combinatorial optimization for word-level adversarial textual attack | |
CN114579833B (en) | Microblog public opinion visual analysis method based on topic mining and emotion analysis | |
Li-guo et al. | A new naive Bayes text classification algorithm | |
CN110795511A (en) | Knowledge graph representation method based on cloud model | |
CN112732914A (en) | Text clustering method, system, storage medium and terminal based on keyword matching | |
CN103268346B (en) | Semisupervised classification method and system | |
Krishna et al. | Revisiting the importance of encoding logic rules in sentiment classification | |
Xu et al. | Large-margin multi-view Gaussian process for image classification | |
CN104537280A (en) | Protein interactive relationship identification method based on text relationship similarity | |
CN112463914B (en) | Entity linking method, device and storage medium for internet service | |
Scott et al. | GAN-SMOTE: A Generative Adversarial Network approach to Synthetic Minority Oversampling. | |
Liu et al. | Noise learning for text classification: A benchmark | |
CN116935057A (en) | Target evaluation method, electronic device, and computer-readable storage medium | |
CN115952908A (en) | Learning path planning method, system, device and storage medium | |
CN114595336A (en) | Multi-relation semantic solution model based on Gaussian mixture model | |
Selvan et al. | Improved cuckoo search optimization algorithm based multi-document summarization model | |
CN106202234B (en) | Interactive information retrieval method based on sample-to-classifier correction | |
Li | Predicting Emotions from Twitter Posts: A Comparative Study of Machine Learning Methods |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination |