CN111160564A - Chinese knowledge graph representation learning method based on feature tensor - Google Patents

Chinese knowledge graph representation learning method based on feature tensor Download PDF

Info

Publication number
CN111160564A
CN111160564A CN201911300781.5A CN201911300781A CN111160564A CN 111160564 A CN111160564 A CN 111160564A CN 201911300781 A CN201911300781 A CN 201911300781A CN 111160564 A CN111160564 A CN 111160564A
Authority
CN
China
Prior art keywords
entity
vector
triples
matrix
triple
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911300781.5A
Other languages
Chinese (zh)
Other versions
CN111160564B (en
Inventor
李巧勤
郑子强
刘勇国
杨尚明
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201911300781.5A priority Critical patent/CN111160564B/en
Publication of CN111160564A publication Critical patent/CN111160564A/en
Application granted granted Critical
Publication of CN111160564B publication Critical patent/CN111160564B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/30Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
    • G06F16/36Creation of semantic tools, e.g. ontology or thesauri
    • G06F16/367Ontology

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Data Mining & Analysis (AREA)
  • General Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Computing Systems (AREA)
  • Medical Informatics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Artificial Intelligence (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Animal Behavior & Ethology (AREA)
  • Computational Linguistics (AREA)
  • Databases & Information Systems (AREA)
  • Machine Translation (AREA)

Abstract

The invention provides a Chinese knowledge graph representation learning method based on feature tensor, which comprises the following steps: preparing data, establishing a data structure, constructing an entity characteristic vector matrix, defining a relation vector and a distance formula of a marked triplet, obtaining a training set, training a knowledge graph representation learning model, updating model parameters, performing iterative training, performing relation prediction on an unmarked triplet by using the model, and performing iterative training again until a new unmarked triplet cannot be learned. The invention provides a method for forming a feature tensor by using Chinese pinyin, character information, word information and description information and converting the feature tensor into a feature vector to replace a method for randomly initializing an entity vector in traditional knowledge representation learning, and fully utilizes the characteristics of Chinese. In addition, a double-layer iteration mode is adopted to supplement the training corpus, so that the relationship matrix can be continuously corrected, and the precision and the convergence speed of the knowledge graph representation learning model are improved.

Description

Chinese knowledge graph representation learning method based on feature tensor
Technical Field
The invention relates to the field of knowledge graphs, in particular to a Chinese knowledge graph representation learning method based on feature tensor.
Background
The knowledge graph describes the complex relation between concepts and entities in the objective world in a structured form, and provides the capability of better organizing, managing and understanding mass information of the Internet. The knowledge graph technology generally comprises research contents of three aspects of knowledge representation, knowledge graph construction and knowledge graph application, wherein the knowledge representation is the basis of the knowledge graph construction and application, reflects the cognition of human beings on the objective world and can express the semantics presented by the objective world from different levels and granularities. Firstly, the method needs to know how the human beings represent knowledge and solve problems by using the knowledge, and then formally represents the knowledge into an expression form which can be inferred and calculated by a computer, so as to establish a knowledge-based system and provide intelligent knowledge services. At the same time, knowledge representation also needs the capability of combining the representation, processing and calculation of symbols by a computer. The key problem to be solved by knowledge representation is 1) what knowledge representation form can accurately reflect the knowledge of the objective world; 2) establishing what knowledge representation can have semantic representation capability; 3) how the knowledge representation supports efficient knowledge reasoning and computation, thereby enabling the knowledge representation to have reasoning capabilities for getting new knowledge. Current knowledge representation methods can be divided into symbolic logic-based knowledge representation, open knowledge representation methods for internet resources, and knowledge graph-based representation learning.
1) Knowledge representation based on symbolic logic: although the knowledge representation technology based on symbolic logic can well describe logical reasoning, the capacity of a machine to generate rules in reasoning is weak, the inference rules need a large amount of manpower to acquire, the requirement on data quality is high, and the knowledge representation based on symbolic logic cannot well solve the problem of knowledge representation in the current large-scale data era.
2) Knowledge representation of web content: tim Berners-Lee proposes the concept of a semantic web in which web content should have a definite meaning and can be easily understood, acquired and integrated by a computer. The web content knowledge representation comprises a semi-structure markup language XML based on marks, a web resource semantic metadata description framework based on RDF, an OWL ontology description language based on description logic and the like; and a large-scale applied knowledge graph knowledge representation method based on triples is currently obtained in the industry, wherein the triples can be represented as < h, r, t >, and represent that a relation r exists between a head entity h and a tail entity t. These technologies allow us to publish semantic information understood and processed by machines on the world wide web. But web content is in the hundreds of trillions, which is a huge challenge for knowledge storage and knowledge representation learning.
3) Represents learning: the goal of representation learning is to represent the semantic information of the study object as a dense low-dimensional vector through machine learning or deep learning. And carrying out implicit vectorization representation on the knowledge units with different granularities so as to support quick knowledge calculation in a big data environment. The method for representing learning mainly comprises tensor reconstruction and potential energy function. Tensor reconstruction integrates information of the whole knowledge base, but tensor dimensionality is high in a big data environment, and the reconstructed calculation amount is large; the potential energy function method considers that the relation is a translation operation from a head entity to a tail entity, a TransE model proposed by Bordes et al is a representative of a translation model, but lacks displayed semantic information, particularly characters with pinyin, structure and word information, such as Chinese, low-dimensional vectors of Chinese learned through machine learning or deep learning are only parameter fitting of a computer, and are lack of interpretability.
In conclusion, the knowledge representation based on symbolic logic and the open knowledge representation method of internet resources enable knowledge to have semantic definition of display, but the problem of data sparsity exists, and large-scale knowledge map application is difficult to realize; knowledge representation based on deep learning can map units of knowledge (entities, relationships, and rules) to a low-dimensional continuous real space representation, but lacks the semantic definition of the display.
In addition, foreign countries have many study researches on knowledge graph representation, but the study researches are only limited to English knowledge graphs, English words only have simple character string information and phrase information due to language differences, only vectors need to be initialized randomly when knowledge representation is studied, Chinese contains rich semantic information, the existing study method cannot achieve good effects on Chinese knowledge graphs, at present, domestic mainly stays in a stage of how to construct knowledge graphs, and study on Chinese knowledge graph representation study is lacked.
Disclosure of Invention
Aiming at the problems, the invention provides a Chinese knowledge graph representation learning method based on feature tensor, compared with a random initialization vector, the invention introduces four features of Chinese pinyin, characters, words and description information as Chinese display semantic information to form the feature tensor, so that the learning process of the Chinese knowledge graph representation becomes interpretable, and is combined with deep learning to map the learned knowledge representation to a low-dimensional continuous real number space, thereby facilitating the learning of Chinese knowledge and the relation between the Chinese knowledge and the low-dimensional continuous real number space.
The invention provides a Chinese knowledge map representation learning method based on feature tensor, which comprises the following steps of:
step 1) data preparation
Me data from an open Chinese link data set zhishi.me form triple data, wherein the triple data consists of a large number of triples, the triple form is < h, r, t >, h represents a head entity, t represents a tail entity, and r represents the relationship between the head entity h and the tail entity t;
step 2) establishing a data structure
Dividing the triple data into marked triples and unmarked triples, and constructing data structures of a dictionary, an entity dictionary, a relation dictionary, an entity pinyin matrix, a word embedding matrix and a description matrix;
step 3) constructing an entity characteristic vector matrix
For each entity in the marked triple, firstly, the entity pinyin vector, the word vector and the description vector form the characteristic tensor of the entity; converting the feature tensors of all the entities in the marked triple into the feature vectors of the entities, and constructing an entity feature vector matrix according to the sequence of an entity dictionary;
step 4) taking a marked triple Tl=<h,r,t>By moment of entity eigenvectorsObtaining the characteristic vector h of the head entity h and the tail entity t by the arrayftAnd tftIn order to indicate that the entity h has a relationship r with the entity T, i.e. h + r ═ T, the triplet T is markedl=<h,r,t>The relationship vector of (d) can be expressed as:
r=tft-hft
in order to calculate the distance between the entity h and the entity t, the relationship between the entities is expressed by vector conversion, and the distance formula of the Euclidean distance definition triple < h, r, t > is as follows:
Figure BDA0002321720820000031
wherein the subscript "2" represents the 2 norm, i.e., the euclidean norm, and the superscript "2" represents squaring;
step 5) taking the marked triples as a training set, initializing entity vectors, namely an entity characteristic vector matrix, initializing relationship vectors, constructing a relationship vector matrix, wherein the sequence is consistent with a relationship dictionary, and the relationship calculation is carried out by a formula r-tft-hftIf a plurality of entity pairs have the same relationship, the relationship vector is obtained by averaging the difference values of the plurality of entity pairs, and normalization is performed after all the relationship vectors are initialized, so that the precision is improved, and the convergence is strengthened;
step 6) randomly selecting a positive triple in the training set<h,r,t>In the negative triplet will<h′,r,t>And<h,r,t′>select out, and<h,r,t>forming a training batch T by forming a group of elementsbatch=[(<h,r,t>,<h′,r,t>),(<h,r,t>,<h,r,t′>)]By using Sp={<h,r,t>Denotes a positive triplet, Sf={<h′,r,t>|h′∈E,<h,r,t′>I T' is belonged to E } represents a negative triple, and T is addedbatchAs input to the knowledge graph representation learning model, for TbatchPerforming knowledge graph representation learning training, wherein E represents an entity set and combines a formula
Figure BDA0002321720820000033
The knowledge graph represents a loss function of the learning model defined as:
Figure BDA0002321720820000032
where γ is a separation distance parameter greater than 0, is a hyperparameter, [ x [ ]]+Represents a positive value function, i.e. when x > 0, [ x []+X; when x is less than or equal to 0, [ x ]]+0; the training method is called margin-based ranking criterion, and aims to separate a positive triple from a negative triple as much as possible and find out a support vector with the maximum distance;
step 7) updating learning model parameters represented by the knowledge graph by adopting random gradient descent (SGD), wherein gradient updating only needs to calculate distances d (h + R, t) and d (h '+ R, t'), a total of | E | entities and | R | relations are set, the length of each entity vector is m-dimension, and the length of the relation vector is n-dimension, so that the total of (| E | m + | R | n) parameters needs to be updated;
step 8) repeating the step 6) to the step 7) to carry out iterative training, and after the iterative training is finished, using a knowledge graph to represent learning model parameters to carry out relation prediction on the unmarked triples, wherein the prediction step is as follows: taking any of unmarked triples<h,r,t>unlabelUsing a knowledge graph to represent the relation r 'between learning model parameter prediction h and t, and if r' is r, the prediction is correct; meanwhile, taking the correctly predicted triples as positive triples, randomly replacing head entities or tail entities of the positive triples as negative triples, and then combining the new positive triples and the new negative triples into the original marking triples to form new marking triples;
and 9) repeating the steps 4) to 8) by adopting the new marked triples to carry out iterative training until the new unmarked triples cannot be learned, which shows that the knowledge graph represents that the learning model cannot learn more Chinese knowledge characteristics from the current training data, and the entity vector and the relation vector output by the knowledge graph representation learning model are the best Chinese knowledge graph representation corresponding to the Chinese link data set zhishi.
The invention provides a method for forming a feature tensor by using Chinese pinyin, character information, word information and description information and converting the feature tensor into a feature vector aiming at the problem that the existing knowledge representation learning method cannot be combined with Chinese word information, so as to replace the method for randomly initializing an entity vector in the traditional knowledge representation learning and fully utilize the characteristics of Chinese. In addition, the invention adopts a double-layer iteration mode to supplement the training corpus, so that the relationship matrix can be continuously corrected, and the precision and the convergence speed of the knowledge graph representation learning model are improved.
Drawings
FIG. 1 is a process flow diagram of the method of the present invention
FIG. 2 is a flow chart of the processing procedure of the method of the present invention
FIG. 3 is a description matrix LSTM encoded entity description vector
FIG. 4 is a schematic diagram of tensor conversion into vector
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments.
As shown in fig. 2, the method for learning a chinese knowledge graph representation based on feature tensor provided by the present invention includes the following steps:
step 1) data preparation
The triple data used by the method is from an open Chinese link data set zhishi.me and consists of a large number of triples, wherein the triple is in a shape of < h, r, t >, h represents a head entity, t represents a tail entity, and r represents the relation between the head entity h and the tail entity t.
Step 2) establishing a data structure
As shown in fig. 1, the triple data is divided into marked triples and unmarked triples, and data structures such as a dictionary, an entity dictionary, a relationship dictionary, an entity pinyin matrix, a word embedding matrix, a description matrix, and the like are constructed.
Marking a triple: randomly extracting triples from the data set zhishi.me to obtain a triple data set, taking all triples in the triple data set as positive triples, removing a head entity or a tail entity of each positive triple, randomly selecting an entity different from the triple in an entity dictionary to replace the triple to form a negative triple, and only replacing one entity in the triple each time, so that the triple has contrast. The triples are marked, and the positive triples are marked as 1, and the negative triples are marked as 0.
Unlabeled triplet: me any unlabeled triplet in the dataset zhishi.
A dictionary: me, including all head entities, tail entities and a dictionary formed by relations, the dictionary form is "word: sequence number ", the sequence number is a number, increasing from zero.
And (3) entity dictionary: me, including all dictionaries formed by head entities and tail entities, the dictionary form is "entity name: sequence number ", the sequence number is a number, increasing from zero.
A relation dictionary: me, the dictionary is in the form of a relation name: sequence number ", the sequence number is a number, increasing from zero.
An entity pinyin matrix: in order to solve the problem of different meanings of polyphones, a Baidu translation API is called to obtain entity pinyin, an entity pinyin matrix is constructed, the number of rows is consistent with the number of entities in an entity dictionary, and each row is an entity pinyin vector obtained by using a one-hot coding mode.
Word embedding matrix: the number of lines corresponds to the number of words in the dictionary, and each line uses a word vector derived from word2 vec.
Word embedding matrix: the number of lines is consistent with the number of entities in the entity dictionary, and each line is a word vector obtained by using word2 vec.
Describing the matrix: the line number is consistent with the entity number in the entity dictionary, the Baidu encyclopedia API is called to obtain entity description information, and the description information is input into a bidirectional Long Short-Term Memory network (BilSTM) for encoding to obtain an entity description vector, as shown in FIG. 3. The vector introduces description information of an entity, and can solve the problem of Chinese synonyms.
Step 3) constructing an entity characteristic vector matrix
For each entity in the marked triple, firstly, the entity pinyin vector, the word vector and the description vector form the characteristic tensor of the entity, and the characteristic tensor is used as the predefined characteristic tensor of the entity in the subsequent steps. The construction process comprises the following steps: tag triplet representation as TlUsing E to represent an entity set in the knowledge graph, using R to represent a relation set, selecting an entity E in a mark triple to belong to E, searching an entity pinyin matrix to obtain an entity pinyin vector Ep(ii) a Let the name of an entity be c1c2...cmWherein c ismExpressing the m-th word forming the entity name, and finding the word embedding matrix according to the word to obtain the word vector e of the entityc=c1c2...cm(ii) a Finding word embedding matrix to obtain word vector ew(ii) a Finding description matrix to obtain description vector e of entitydThe feature tensor FeatureTens or of the entity is expressed as
Figure BDA0002321720820000062
And converting the characteristic tensor of the entity into the characteristic vector of the entity to construct an entity characteristic vector matrix. As shown in fig. 4, different dimensions of the feature tensor of the entity are connected, and the connection mode is vector splicing. E.g. given vector a ═ x1,x2,x3,...,xm]And B ═ y1,y2,y3,...,yn]And connecting to obtain vector C ═ x1,x2,x3,...,xm,y1,y2,y3,...,yn]And using dropout to randomly lose the vector, so as to prevent knowledge representation from learning overfitting. The method is used for converting the pinyin vector e of the same entity epWord vector ecWord vector ewDescription vector edPerforming connection to obtain the productFeature vector e of volumeft=[ep;ec;ew;ed]。
And converting the feature tensors of all the entities in the marked triple into the feature vectors of the entities, and constructing an entity feature vector matrix according to the sequence of the entity dictionary.
Step 4) taking a marked triple Tl=<h,r,t>Obtaining the eigenvector h of the head entity h and the tail entity t through the entity eigenvector matrixftAnd tftIn order to indicate that the entity h has a relationship r with the entity T, i.e. h + r ═ T, the triplet T is markedl=<h,r,t>The relationship vector of (d) can be expressed as:
r=tft-hft(1)
in order to calculate the distance between the entity h and the entity t, the relationship between the entities is expressed by vector conversion, and the distance formula of the Euclidean distance definition triple < h, r, t > is as follows:
Figure BDA0002321720820000061
the subscript "2" in equation (2) represents a 2-norm, i.e., the euclidean norm, and the superscript "2" represents squaring.
And 5) taking the marked triples as a training set, initializing entity vectors, namely an entity characteristic vector matrix, initializing relationship vectors, constructing a relationship vector matrix, wherein the sequence is consistent with a relationship dictionary, and the relationship calculation is obtained by the formula (1). If a plurality of entity pairs have the same relationship, the relationship vector is the average of the difference values of the plurality of entity pairs. Normalization after initialization of all vectors will increase accuracy and enhance convergence.
Step 6) randomly selecting a positive triple in the training set<h,r,t>In the negative triplet will<h′,r,t>And<h,r,t′>select out, and<h,r,t>forming a training batch T by forming a group of elementsbatch=[(<h,r,t>,<h′,r,t′>),(<h,r,t>,<h,r,t′>)]. With Sp={<h,r,t>Denotes a positive triplet, Sf={<h′,r,t>|h′∈E,<h,r,t′>I t' E represents a negative triple. Will TbatchAs input to the knowledge graph representation learning model, for TbatchAnd (3) carrying out knowledge graph representation learning training, and combining the formula (2), wherein a loss function of a knowledge graph representation learning model is defined as:
Figure BDA0002321720820000071
where γ is a separation distance parameter greater than 0, is a hyperparameter, [ x [ ]]+Represents a positive value function, i.e. when x > 0, [ x []+X; when x is less than or equal to 0, [ x ]]+0. The training method is called margin-based ranking criterion and aims to separate the positive triples and the negative triples as far as possible and find out the support vector with the maximum distance.
And 7) updating the learning model parameters represented by the knowledge graph by adopting random gradient descent (SGD), wherein the gradient updating only needs to calculate the distances d (h + r, t) and d (h '+ r, t'). If a total of | E | entities and | R | relationships are set, the length of each entity vector is m-dimensional, and the length of the relationship vector is n-dimensional, then a total of (| E | + m + | R |) parameters needs to be updated.
And 8) repeating the steps 6) -7) to carry out iterative training, and after the iterative training is finished, using a knowledge graph to represent a learning model to carry out relationship prediction on the unmarked triples. The prediction step is that any one triple is taken from the unmarked triples<h,r,t>unlabelThe learning model is used to predict the relationship r 'between h and t, and if r' is r, the prediction is correct. Meanwhile, the correct predicted triples are used as positive triples, head entities or tail entities of the positive triples are randomly replaced to be used as negative triples, and then the new positive triples and the new negative triples are combined into the original marked triples to form new marked triples.
And 9) repeating the steps 4) to 8) by adopting the new marked triples to carry out iterative training until the new unmarked triples cannot be learned, which shows that the knowledge graph represents that the learning model cannot learn more Chinese knowledge characteristics from the current training data, and the knowledge graph represents that the entity vector and the relation vector output by the learning model are the best Chinese knowledge graph representation corresponding to the data set zhishi.
The Chinese knowledge graph representation learning method generally adopts link prediction as an evaluation task, and the evaluation indexes comprise rank average (MR), average reciprocal rank (MRR), percentage of the result of a correct entity in the top ten ranks (Hits @10), percentage of the result of a correct entity in the top three ranks (Hits @3), percentage of the first ranked result in the correct entity (Hits @1), wherein the smaller the value of MR is, the better the values of MRR, Hits @10, Hits @3 and Hits @1 are. Me of the data set zhishi.me, part of entities or relations of the triples are removed randomly, and link prediction is to predict the entities or relations removed randomly in the triples.
In the embodiment of the invention, representation learning and link prediction task evaluation are carried out on an open-source Chinese data set zhishi.me, and compared with test results of two knowledge graph representation learning methods, namely a TransE model and a TransR model, the results are shown in Table 1:
TABLE 1 test results
Evaluation index MR MRR Hits@10 Hits@3 Hits@1
TransE 713 0.458 0.812 0.723 0.556
TranR 687 0.519 0.839 0.768 0.646
The invention 611 0.843 0.875 0.801 0.692
The experimental result shows that the test result of the invention is superior to the TransE model and the TransR model and reaches the available level.
Although illustrative embodiments of the present invention have been described above to facilitate the understanding of the present invention by those skilled in the art, it should be understood that the present invention is not limited in scope to the specific embodiments. Such variations are obvious and all the inventions utilizing the concepts of the present invention are intended to be protected.

Claims (4)

1. A Chinese knowledge map representation learning method based on feature tensor is characterized by comprising the following steps:
step 1) data preparation
Me data from an open Chinese link data set zhishi.me form triple data, wherein the triple data is composed of a large number of triples, the triple form is < h, r, t >, wherein h represents a head entity, t represents a tail entity, and r represents the relationship between the head entity h and the tail entity t;
step 2) establishing a data structure
Dividing the triple data into marked triples and unmarked triples, and constructing data structures of a dictionary, an entity dictionary, a relation dictionary, an entity pinyin matrix, a word embedding matrix and a description matrix, wherein,
marking a triple: randomly extracting triple data from the Chinese link data set zhishi.me to obtain a triple data set, taking all triples in the triple data set as positive triples, removing a head entity or a tail entity of each positive triplet, randomly selecting an entity different from the triple in an entity dictionary to replace the triple to form a negative triplet, only replacing one entity in the triples each time so that the triples have contrast, marking the triples, and marking the positive triples as 1 and the negative triples as 0;
unlabeled triplet: me, any unmarked triple in the Chinese link dataset zhishi.me;
a dictionary: me, including all head entities, tail entities and a dictionary formed by relations, the dictionary form is "word: sequence number ", the sequence number is a number, increasing from zero;
and (3) entity dictionary: me, the entity set in the chinese linked data set zhishi.me is represented by E, and includes a dictionary formed by all head entities and tail entities, and the dictionary form is "entity name: sequence number ", the sequence number is a number, increasing from zero;
a relation dictionary: me, the dictionary is in the form of a relation name: sequence number ", the sequence number is a number, increasing from zero;
an entity pinyin matrix: in order to solve the problem of different meanings of polyphones, a Baidu translation API is called to obtain entity pinyin, an entity pinyin matrix is constructed, the number of lines of the entity pinyin matrix is consistent with the number of entities in the entity dictionary, and each behavior of the entity pinyin matrix uses an entity pinyin vector obtained in a one-hot coding mode;
word embedding matrix: the number of rows of the word embedding matrix is consistent with the number of words in the dictionary, and each row of the word embedding matrix uses a word vector obtained by word2 vec;
word embedding matrix: the line number of the word embedding matrix is consistent with the number of entities in the entity dictionary, and each behavior of the word embedding matrix uses a word vector obtained by word2 vec;
describing the matrix: the line number of the description matrix is consistent with the number of entities in the entity dictionary, an encyclopedia API is called to obtain entity description information, the entity description information is input into a bidirectional long short-Term Memory network (Bi-directional Long short-Term Memory, BilSTM) to be coded to obtain an entity description vector, and the entity description vector introduces the entity description information and can solve the problem of Chinese synonym;
step 3) constructing an entity characteristic vector matrix
For each entity in the marked triple, firstly, forming a characteristic tensor of the entity by an entity pinyin vector, a word vector and an entity description vector; converting the feature tensors of all the entities in the marked triple into the feature vectors of the entities, and constructing an entity feature vector matrix according to the sequence of the entity dictionary;
step 4) taking a marked triple Tl=<h,r,t>Obtaining the eigenvector h of the head entity h and the tail entity t through the entity eigenvector matrixftAnd tftIn order to indicate that the entity h has a relationship r with the entity T, i.e. h + r ═ T, the triplet T is markedl=<h,r,t>The relationship vector of (d) can be expressed as:
r=tft-hft
in order to calculate the distance between the entity h and the entity t, the relationship between the entities is expressed by vector conversion, and the distance formula of the Euclidean distance definition triple < h, r, t > is as follows:
Figure FDA0002321720810000021
wherein the subscript "2" represents the 2 norm, i.e., the euclidean norm, and the superscript "2" represents squaring;
step 5) taking all the marked triples as a training set, initializing entity vectors, namely an entity characteristic vector matrix, initializing relationship vectors, constructing a relationship vector matrix, wherein the sequence of the relationship vector matrix is consistent with the relationship dictionary, and the relationship calculation is carried out by the formula r-tft-hftIf a plurality of entity pairs have the same relationship, the relationship vector is obtained by averaging the difference values of the plurality of entity pairs, and normalization is performed after all the relationship vectors are initialized, so that the precision is improved, and the convergence is strengthened;
step 6) randomly selecting a positive triple < h, r, t > in the training set, selecting < h', r, t > and < h, r, t > in the negative triple, forming a component group pair with < h, r, t > to form a training batch
Tbatch=[(<h,r,t>,<h′,r,t>),(<h,r,t>,<h,r,t′>)],
By using Sp={<h,r,t>Denotes a positive triplet, Sf={<h′,r,t>|h′∈E,<h,r,t′>I T' is belonged to E } represents a negative triple, and T is addedbatchAs input to the knowledge graph representation learning model, for TbatchPerforming knowledge graph representation learning training by combining formula
Figure FDA0002321720810000022
The knowledge graph represents a loss function of the learning model defined as:
Figure FDA0002321720810000031
where γ is a separation distance parameter greater than 0, is a hyperparameter, [ x [ ]]+Representing positive functions, i.e. when x>At 0, [ x ]]+X; when x is less than or equal to 0, [ x ]]+0; the training method is called margin-based ranking criterion and aims to combine positive ternarySeparating the group and the negative triple as far as possible, and finding out the support vector of the maximum distance;
step 7) updating the parameters of the knowledge graph representation learning model by adopting random gradient descent (SGD), wherein gradient updating only needs to calculate distances d (h + R, t) and d (h '+ R, t'), a total | E | entities and | R | relations are set, the length of each entity vector is m-dimensional, the length of the relation vector is n-dimensional, and | E | m + | R | n parameters are required to be updated;
step 8) repeating the step 6) to the step 7) to carry out iterative training, and after the iterative training is finished, using a knowledge graph to represent a learning model to carry out relation prediction on the unmarked triples, wherein the prediction step is as follows: taking any of unmarked triples<h,r,t>unlabelUsing a knowledge graph to represent the relation r 'between the learning model prediction h and t, and if r' is r, the prediction is correct; meanwhile, taking the correctly predicted triples as positive triples, randomly replacing head entities or tail entities of the positive triples as negative triples, and then combining the new positive triples and the new negative triples into the original marking triples to form new marking triples;
and 9) repeating the steps 4) to 8) by adopting the new marked triples to carry out iterative training until the new unmarked triples cannot be learned, wherein the knowledge graph indicates that the learning model cannot learn more Chinese knowledge characteristics from the current training set, the knowledge graph indicates that the entity vector and the relation vector output by the learning model are the best Chinese knowledge graph representation corresponding to the Chinese link data set zhishi.
2. The feature tensor-based Chinese knowledge graph representation learning method as recited in claim 1, wherein the process of constructing the feature tensor of the entity in the step 3) is as follows: tag triplet representation as TlUsing E to represent an entity set in the knowledge graph, using R to represent a relation set, arbitrarily selecting an entity E in a mark triple to belong to E, searching the entity pinyin matrix to obtain a pinyin vector E of the entity Ep(ii) a Let the name of an entity be c1c2…cmWherein c ismExpressing the m-th word forming the entity name, searching the word embedding matrix according to the word to obtain a word vector e of the entity ec=c1c2…cm(ii) a Searching the word embedding matrix to obtain a word vector e of an entity ew(ii) a Finding the description matrix to obtain the description vector e of the entity edThe feature tensor FeatureTensor of entity e is expressed as
Figure FDA0002321720810000032
3. The feature tensor-based chinese knowledge graph representation learning method of claim 2, wherein the process of converting the feature tensor of the entity into the feature vector of the entity in step 3) is as follows: connecting different dimensions of the characteristic tensor of the entity in a vector splicing mode, randomly losing the vector by adopting dropout to prevent knowledge representation from learning overfitting, and using the method to obtain the pinyin vector e of the same entity epWord vector ecWord vector ewDescription vector edConnecting to obtain a characteristic vector e of an entity eft=[ep;ec;ew;ed]。
4. The feature tensor-based Chinese knowledge graph representation learning method as claimed in any one of claims 1 to 3, wherein the feature tensor-based Chinese knowledge graph representation learning method adopts link prediction as an evaluation task, and the evaluation indexes comprise a rank mean MR, an average reciprocal rank MRR, a percentage Hits @10 of a result of a correct entity in the top ten rank, a percentage Hits @3 of a result of a correct entity in the top three rank, and a percentage Hits @1 of a correct entity in the first rank, wherein the smaller the value of MR is, the better the values of MRR, Hits @10, Hits @3, Hits @1 are; and randomly removing part of entities or relations of the triples in the Chinese link data set zhishi.me, wherein link prediction refers to predicting the randomly removed entities or relations in the triples.
CN201911300781.5A 2019-12-17 2019-12-17 Chinese knowledge graph representation learning method based on feature tensor Active CN111160564B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911300781.5A CN111160564B (en) 2019-12-17 2019-12-17 Chinese knowledge graph representation learning method based on feature tensor

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911300781.5A CN111160564B (en) 2019-12-17 2019-12-17 Chinese knowledge graph representation learning method based on feature tensor

Publications (2)

Publication Number Publication Date
CN111160564A true CN111160564A (en) 2020-05-15
CN111160564B CN111160564B (en) 2023-05-19

Family

ID=70557605

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911300781.5A Active CN111160564B (en) 2019-12-17 2019-12-17 Chinese knowledge graph representation learning method based on feature tensor

Country Status (1)

Country Link
CN (1) CN111160564B (en)

Cited By (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100398A (en) * 2020-08-31 2020-12-18 清华大学 Patent blank prediction method and system
CN112463976A (en) * 2020-09-29 2021-03-09 东南大学 Knowledge graph construction method taking crowd sensing task as center
CN113051904A (en) * 2021-04-21 2021-06-29 东南大学 Link prediction method for small-scale knowledge graph
CN113742488A (en) * 2021-07-30 2021-12-03 清华大学 Embedded knowledge graph completion method and device based on multitask learning
CN113963748A (en) * 2021-09-28 2022-01-21 华东师范大学 Protein knowledge map vectorization method
WO2022069958A1 (en) * 2020-09-29 2022-04-07 International Business Machines Corpofiation Automatic knowledge graph construction
CN114416941A (en) * 2021-12-28 2022-04-29 北京百度网讯科技有限公司 Generation method and device of dialogue knowledge point determination model fusing knowledge graph
CN114861665A (en) * 2022-04-27 2022-08-05 北京三快在线科技有限公司 Method and device for training reinforcement learning model and determining data relation
CN112463976B (en) * 2020-09-29 2024-05-24 东南大学 Knowledge graph construction method taking crowd sensing task as center

Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100057762A1 (en) * 2008-09-03 2010-03-04 Hamid Hatami-Hanza System and Method of Ontological Subject Mapping for Knowledge Processing Applications
US20120131008A1 (en) * 2010-11-23 2012-05-24 Microsoft Corporation Indentifying referring expressions for concepts
CN106528610A (en) * 2016-09-28 2017-03-22 厦门理工学院 Knowledge graph representation learning method based on path tensor decomposition
CN106886543A (en) * 2015-12-16 2017-06-23 清华大学 The knowledge mapping of binding entity description represents learning method and system
US20170193390A1 (en) * 2015-12-30 2017-07-06 Facebook, Inc. Identifying Entities Using a Deep-Learning Model
US9705908B1 (en) * 2016-06-12 2017-07-11 Apple Inc. Emoji frequency detection and deep link frequency
JP2018010543A (en) * 2016-07-15 2018-01-18 株式会社トヨタマップマスター Notation fluctuation glossary creation device, retrieval system, methods thereof, computer program thereof and recording medium recording computer program thereof
CN108509483A (en) * 2018-01-31 2018-09-07 北京化工大学 The mechanical fault diagnosis construction of knowledge base method of knowledge based collection of illustrative plates
CN108829865A (en) * 2018-06-22 2018-11-16 海信集团有限公司 Information retrieval method and device
CN109522465A (en) * 2018-10-22 2019-03-26 国家电网公司 The semantic searching method and device of knowledge based map
CN109740168A (en) * 2019-01-09 2019-05-10 北京邮电大学 A kind of classic of TCM ancient Chinese prose interpretation method based on knowledge of TCM map and attention mechanism
US20190180154A1 (en) * 2017-12-13 2019-06-13 Abbyy Development Llc Text recognition using artificial intelligence
CN109933307A (en) * 2019-02-18 2019-06-25 杭州电子科技大学 A kind of intelligent controller machine learning algorithm modular form description and packaging method based on ontology
CN110334219A (en) * 2019-07-12 2019-10-15 电子科技大学 The knowledge mapping for incorporating text semantic feature based on attention mechanism indicates learning method
CN110377755A (en) * 2019-07-03 2019-10-25 江苏省人民医院(南京医科大学第一附属医院) Reasonable medication knowledge map construction method based on medicine specification
US20190354810A1 (en) * 2018-05-21 2019-11-21 Astound Ai, Inc. Active learning to reduce noise in labels

Patent Citations (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100057762A1 (en) * 2008-09-03 2010-03-04 Hamid Hatami-Hanza System and Method of Ontological Subject Mapping for Knowledge Processing Applications
US20120131008A1 (en) * 2010-11-23 2012-05-24 Microsoft Corporation Indentifying referring expressions for concepts
CN106886543A (en) * 2015-12-16 2017-06-23 清华大学 The knowledge mapping of binding entity description represents learning method and system
US20170193390A1 (en) * 2015-12-30 2017-07-06 Facebook, Inc. Identifying Entities Using a Deep-Learning Model
US9705908B1 (en) * 2016-06-12 2017-07-11 Apple Inc. Emoji frequency detection and deep link frequency
JP2018010543A (en) * 2016-07-15 2018-01-18 株式会社トヨタマップマスター Notation fluctuation glossary creation device, retrieval system, methods thereof, computer program thereof and recording medium recording computer program thereof
CN106528610A (en) * 2016-09-28 2017-03-22 厦门理工学院 Knowledge graph representation learning method based on path tensor decomposition
US20190180154A1 (en) * 2017-12-13 2019-06-13 Abbyy Development Llc Text recognition using artificial intelligence
CN108509483A (en) * 2018-01-31 2018-09-07 北京化工大学 The mechanical fault diagnosis construction of knowledge base method of knowledge based collection of illustrative plates
US20190354810A1 (en) * 2018-05-21 2019-11-21 Astound Ai, Inc. Active learning to reduce noise in labels
CN108829865A (en) * 2018-06-22 2018-11-16 海信集团有限公司 Information retrieval method and device
CN109522465A (en) * 2018-10-22 2019-03-26 国家电网公司 The semantic searching method and device of knowledge based map
CN109740168A (en) * 2019-01-09 2019-05-10 北京邮电大学 A kind of classic of TCM ancient Chinese prose interpretation method based on knowledge of TCM map and attention mechanism
CN109933307A (en) * 2019-02-18 2019-06-25 杭州电子科技大学 A kind of intelligent controller machine learning algorithm modular form description and packaging method based on ontology
CN110377755A (en) * 2019-07-03 2019-10-25 江苏省人民医院(南京医科大学第一附属医院) Reasonable medication knowledge map construction method based on medicine specification
CN110334219A (en) * 2019-07-12 2019-10-15 电子科技大学 The knowledge mapping for incorporating text semantic feature based on attention mechanism indicates learning method

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
冯丽娜等: "基于词频统计的《颜氏家训》儒家思想分析", 《图书馆》 *
吴运兵等: "路径张量分解的知识图谱推理算法", 《模式识别与人工智能》 *
马惠珠等: "项目计算机辅助受理的研究方向与关键词――2012年度受理情况与2013年度注意事项", 《电子与信息学报》 *

Cited By (15)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112100398B (en) * 2020-08-31 2021-09-14 清华大学 Patent blank prediction method and system
CN112100398A (en) * 2020-08-31 2020-12-18 清华大学 Patent blank prediction method and system
WO2022069958A1 (en) * 2020-09-29 2022-04-07 International Business Machines Corpofiation Automatic knowledge graph construction
CN112463976A (en) * 2020-09-29 2021-03-09 东南大学 Knowledge graph construction method taking crowd sensing task as center
GB2613999A (en) * 2020-09-29 2023-06-21 Ibm Automatic knowledge graph construction
CN112463976B (en) * 2020-09-29 2024-05-24 东南大学 Knowledge graph construction method taking crowd sensing task as center
CN113051904A (en) * 2021-04-21 2021-06-29 东南大学 Link prediction method for small-scale knowledge graph
CN113742488A (en) * 2021-07-30 2021-12-03 清华大学 Embedded knowledge graph completion method and device based on multitask learning
CN113742488B (en) * 2021-07-30 2022-12-02 清华大学 Embedded knowledge graph completion method and device based on multitask learning
CN113963748A (en) * 2021-09-28 2022-01-21 华东师范大学 Protein knowledge map vectorization method
CN113963748B (en) * 2021-09-28 2023-08-18 华东师范大学 Protein knowledge graph vectorization method
CN114416941A (en) * 2021-12-28 2022-04-29 北京百度网讯科技有限公司 Generation method and device of dialogue knowledge point determination model fusing knowledge graph
CN114416941B (en) * 2021-12-28 2023-09-05 北京百度网讯科技有限公司 Knowledge graph-fused dialogue knowledge point determination model generation method and device
CN114861665A (en) * 2022-04-27 2022-08-05 北京三快在线科技有限公司 Method and device for training reinforcement learning model and determining data relation
CN114861665B (en) * 2022-04-27 2023-01-06 北京三快在线科技有限公司 Method and device for training reinforcement learning model and determining data relation

Also Published As

Publication number Publication date
CN111160564B (en) 2023-05-19

Similar Documents

Publication Publication Date Title
CN111160564B (en) Chinese knowledge graph representation learning method based on feature tensor
CN110309267B (en) Semantic retrieval method and system based on pre-training model
Guu et al. Retrieval augmented language model pre-training
CN109299341A (en) One kind confrontation cross-module state search method dictionary-based learning and system
CN112182245B (en) Knowledge graph embedded model training method and system and electronic equipment
Wang et al. Facilitating image search with a scalable and compact semantic mapping
Ju et al. An efficient method for document categorization based on word2vec and latent semantic analysis
WO2017193685A1 (en) Method and device for data processing in social network
CN111881292B (en) Text classification method and device
CN109284414B (en) Cross-modal content retrieval method and system based on semantic preservation
CN111274790A (en) Chapter-level event embedding method and device based on syntactic dependency graph
CN111581364B (en) Chinese intelligent question-answer short text similarity calculation method oriented to medical field
CN113515632A (en) Text classification method based on graph path knowledge extraction
Lu et al. Image annotation by semantic sparse recoding of visual content
CN114743029A (en) Image text matching method
Ding et al. A Knowledge-Enriched and Span-Based Network for Joint Entity and Relation Extraction.
CN116720519B (en) Seedling medicine named entity identification method
CN110674293B (en) Text classification method based on semantic migration
Perdana et al. Instance-based deep transfer learning on cross-domain image captioning
Deng Large scale visual recognition
CN116341515A (en) Sentence representation method of dynamic course facing contrast learning
Fan et al. Large margin nearest neighbor embedding for knowledge representation
CN110633363B (en) Text entity recommendation method based on NLP and fuzzy multi-criterion decision
Cui et al. A new Chinese text clustering algorithm based on WRD and improved K-means
CN111881689A (en) Method, system, device and medium for processing polysemous word vector

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant