CN117371534B

CN117371534B - Knowledge graph construction method and system based on BERT

Info

Publication number: CN117371534B
Application number: CN202311669625.2A
Authority: CN
Inventors: 雷雨寒; 朱雅都; 张迪勇; 黄尚强; 郑佳勇; 方卫洪; 魏明欣; 席力凡; 陈曦; 范红霞; 李丹
Original assignee: Same Side Seville Information Technology Co ltd
Current assignee: Same Side Seville Information Technology Co ltd
Priority date: 2023-12-07
Filing date: 2023-12-07
Publication date: 2024-02-27
Anticipated expiration: 2043-12-07
Also published as: CN117371534A

Abstract

The invention discloses a method and a system for constructing a knowledge graph based on BERT, wherein the method comprises the following steps: s1, positioning and extracting entities from unstructured text data to complete entity identification; s2, extracting the relation among entities according to the text content; s3, positioning entity names and pronouns thereof, clustering the references of the same entity together, and connecting a plurality of relations of the same entity to one entity node to complete reference digestion; s4, calculating the similarity between the entity names and the entities, and corresponding the entity names to the entities in the knowledge graph; s5, predicting and complementing the deleted triples in the knowledge graph, and constructing the knowledge graph by using the obtained triples. The invention is helpful for improving the construction quality of the knowledge graph and the integrity of the knowledge graph.

Description

Knowledge graph construction method and system based on BERT

Technical Field

The invention relates to knowledge extraction, knowledge fusion and knowledge graph completion, in particular to a BERT-based knowledge graph construction method and system.

Background

Knowledge extraction, knowledge fusion, knowledge graph completion are key steps for building knowledge graphs using unstructured data, which often require natural language processing. In recent years, a large model with a huge number of parameters has achieved remarkable results in the fields of natural language processing and the like through a large amount of data training. The construction of the knowledge graph by using unstructured text data is a very challenging task, and the construction quality of the knowledge graph can be improved by combining a large model with the knowledge graph construction. In the process of constructing a knowledge graph by using unstructured text data, a knowledge extraction part needs to accurately identify and extract target entities in a text and possible relations between the entities, however, due to the fact that Chinese sentence patterns are changeable, the same thing often has a plurality of different languages, and information contained in some sentences is not limited to surface semantics. Some conventional extraction approaches (e.g., rule-based extraction methods) have not been able to extract target entities and relationships very efficiently. The large model in the natural language processing field can accurately extract and represent the characteristics of words and sentences after being pretrained by a large amount of data and subjected to fine adjustment of downstream tasks, and the accuracy rate and recall rate of knowledge extraction are far superior to those of other means; in the knowledge fusion process, two main technical means of reference resolution and entity disambiguation are often included. Reference resolution refers to a technique of associating a pronoun of an entity with the entity itself, and there are two major difficulties in implementing this technique: how to locate the pronoun and how to direct the pronoun to the correct entity. The pronouns and their designations often have similar features, and by obtaining a feature representation of a sentence through a large model, it is possible to accurately locate the pronouns and relate them to the original entity, simply by associating the pronouns "you", "me", "he" with their ontology, such as "small description: when the astronaut is in the sentence of the grown-up, the aim of referring to digestion is to connect the pronoun of the "me" and the "Xiaoming" in sequence. The method comprises the steps of carrying out a first treatment on the surface of the Entity disambiguation refers to the fact that the same entity's references may represent different entities, such as "millet" may refer to a food item, which is commonly found in the context of science and technology, electronics, and also a cell phone company, which is often different from the context in which they are located. The characteristic representation of sentences in which the large models are positioned can be well obtained through the large models, and the purpose of distinguishing the large models from the sentences is achieved through calculating the similarity of the contexts of the large models; in the process of knowledge graph completion, it is necessary to predict the entity relationship which does not exist in the knowledge graph according to the existing entity relationship, such as "A is dad of B and his wife is called C", through the previous knowledge extraction process, we can extract two relationship pairs (A, B, father and son), (C, couple) from the sentence, through the knowledge fusion process we can obtain "he" to refer to A, and finally obtain two relationship pairs (A, B, father and son), (A, C, couple), but there is an implicit relationship pair (B, C, mother and son) in the sentence. And splicing the relation triples into a text sequence, and performing fine adjustment on the large model as training data to predict the rationality of the new triples. And carrying out knowledge graph completion based on the characteristic representation of the word context. Compared with the traditional knowledge representation method based on space (such as TransE, rotatE and the like), the method can contain more semantic information and can better distinguish entities similar in space representation.

For the traditional feature-based Word embedding generation method, word2Vec and GloVe can only convert sentences into expression vectors which are relatively independent among words, the expression of the words cannot contain context information, and the situation of Word ambiguity is difficult to process; the subsequent ELMo model, while able to encode certain context information, is limited in its performance by the LSTM feature extraction capabilities and its strategy of bi-directional stitching.

Disclosure of Invention

The invention aims to overcome the defects of the prior art and provides a method and a system for constructing a knowledge graph based on BERT, which are beneficial to improving the construction quality of the knowledge graph and the integrity of the knowledge graph.

The aim of the invention is realized by the following technical scheme: a knowledge graph construction method based on BERT comprises the following steps:

s1, positioning and extracting an entity E from unstructured text data to complete entity identification;

the step S1 includes:

s101, modeling an entity E: giving entity types needing to be collected, and setting attributes needing to be collected of each entity type; (e.g. when a person is taken as an entity, information such as name, sex, etc. can be set as the relevant attribute to be collected)

S102, preparing a corresponding data set aiming at entity types, wherein each piece of training data comprises a source text, an entity contained in the text and the position of the entity in the source text, and each piece of training data comprises a source text serving as a training sample and a corresponding entity and the position of the entity in the source text serving as sample labels;

constructing an entity identification model: the entity recognition model comprises a BERT model and a neural network NER, wherein the input of the BERT model is the input of the entity recognition model, the output of the BERT model is used as the input of the neural network NER, and the output of the neural network NER is used as the output of the entity recognition model;

training the entity recognition model by utilizing the data set, storing the model parameters after training, and directly loading the model parameters when entity recognition is carried out on the text;

in embodiments of the present application, the token may also be utilized to convert the input text into a code when arbitrary data is entered into the BERT model. the token is a word segmentation device, and [ CLS ] is a start character input by the token. [ SEP ] is the separator/terminator input by the token, and steps S2-S5 are similar.

S103, inputting a text which needs to be subjected to entity recognition, determining whether the text needs to be truncated or not by judging whether the length exceeds a preset maximum length L, if yes, performing step S104, otherwise, directly taking the text content as a text segment which is actually subjected to entity recognition, and entering step S105;

s104, cutting off the text, firstly, cutting off the text from the beginning of the text according to a preset maximum length L, then searching for a first semantic termination symbol from the end of a text result obtained by the primary cutting off, and taking the position where the symbol appears as an actual position Pos of secondary cutting off so as to ensure the integrity of sentence semantics in the primary cutting off process; the text is cut off twice, the length is not more than the maximum length L, the text which is cut off is regarded as the text segment which is actually recognized by the entity, the initial position offset of the current cut-off text in the original text is recorded, and the step S105 is entered;

s105, inputting text fragments which are actually subjected to entity recognition into an entity recognition model to obtain an entity recognition result, wherein the entity recognition result comprises different types of entities and the positions pos of the entities in the current truncated text; according to the initial position offset of the current truncated text in the original text, calculating to obtain the actual position of the entity in the text as pos+offset;

s106, removing all texts from the Pos position to the front, taking the rest texts as input, returning to the step S103 until all entity identification in the source text is finished, collecting entity identification results of all truncated texts, and combining the entity identification results into a complete entity identification result.

S2, extracting a relation R between entities according to the text content, and completing the preliminary construction of the knowledge graph by using a relation extraction result;

s201, a preset relation R needing to be extracted is obtained, wherein R is { h, R, t }, h represents a head entity type, R represents a relation type, t represents a tail entity type, and then each relation needing to be extracted is numbered to obtain a list L of one-to-one mapping relation between a characterization relation and the number _R Represented by [ R ] ₁ , R ₂ , R ₃ , ......]

S202, preparing a corresponding data set according to a preset extraction relation, wherein each piece of training data comprises a head entity, a tail entity, a source text and a one-dimensional vector representing a relation type;

in each piece of training data, a source text, a head entity and a tail entity are used as training samples, a label is a one-dimensional vector representing a relation type, and the value corresponding to each dimension in the one-dimensional vector is: the head entity and tail entity relationship in the training data and a preset relationship list L _R Similarity of each serial number relationship;

building a relation extraction model: the relation extraction model comprises a BERT model and a neural network NRE, wherein the input of the BERT model is the input of the relation extraction model, the output of the BERT model is used as the input of the neural network NRE, and the output of the neural network NRE is used as the output of the relation extraction model;

training the relation extraction model by utilizing a data set, storing the model parameters after training, and directly loading the model parameters when extracting the relation;

s203, starting from the beginning of an input text, dividing the input text into complete sentences according to semantic terminators to obtain a sentence list, and determining the starting position and the ending position of each sentence in the input text;

sequentially selecting each sentence as a sentence for relation extraction, arranging the results of entity identification according to the sequence from small to large of the positions of the entities in the source text, removing the entities with the positions smaller than the initial position or larger than the final position of the text, and randomly selecting two entities a and b from the rest entities;

s204, according to the entity types of a and b in the entity identification result, comparing a head entity type h and a tail entity type t of the relation R; determining candidate relationship R, and obtaining all R in relationship list L _R The corresponding subscript in (a) is used as a sequence number to obtain a candidate relation sequence number list L;

s205, splicing the names of the two entities a and b and the sentence T to obtain a [ CLS ] a [ SEP ] b [ SEP ] T [ SEP ] ";

s206, inputting the [ CLS ] a [ SEP ] b [ SEP ] T [ SEP ] into a relation extraction model to obtain a one-dimensional vector, taking all values corresponding to a corresponding candidate relation sequence number list L in the vector, taking the maximum value of the values, judging whether the maximum value exceeds a set threshold lim, if the maximum value is larger than the threshold, indicating that the relation is the relation between two entities a and b, and directly outputting the relation; if the similarity is smaller than the threshold value, outputting the relation between a and b as an unknown relation; if the output relation r is not unknown, recording { a, r, b } into a relation extraction result;

s207, returning to the step S203 until the relation extraction of the entities contained in all sentences of the input text is completed, and then executing the step S208;

s208, carrying out preliminary construction on the knowledge graph by taking a, b as entity nodes and r as edges on each piece of data { a, r, b } in the relation extraction result to obtain G ₀ 。

S3, positioning entity designations M and entity pronouns P, clustering the designations (M/P) of the same entity together, and connecting a plurality of relations of the same entity E to one entity node N in the knowledge graph _E Finishing the reference digestion;

s301, preparing reference resolution data sets, wherein each piece of data contains source text and entity reference clustering results in the text; the clustering result contains a plurality of clusters in the format of [ [ Pos ] ₁₁ , Pos ₁₂ , Pos ₁₃ , ...... ], [Pos ₂₁ , Pos ₂₂ , Pos ₂₃ , ......], [Pos ₃₁ , Pos ₃₂ , Pos ₃₃ , ......], ......]；

Constructing an index digestion model: the reference digestion model comprises a BERT model and a neural network CR, wherein the input of the BERT model is the input of the reference digestion model, the output of the BERT model is used as the input of the neural network CR, and the output of the neural network CR is used as the output of the reference digestion model;

training the reference digestion model by utilizing a data set, storing the model parameters after training, and directly loading the model parameters when carrying out reference digestion;

s302, determining whether the text needs to be truncated by judging whether the length of the input text exceeds a preset maximum text length, if so, executing a step S303, otherwise, taking the input text as a processing target, and executing a step S304;

s303, cutting off the input text, if the cut-off is the primary cut-off, directly taking the beginning of the text as the starting position P of the forward cut-off _s The method comprises the steps of carrying out a first treatment on the surface of the If the primary truncation is not performed, selecting the initial position sequence number from the index word list of each row of entities from the previous index digestion result R to be the largest, and marking the selected result in the ith row as Pos _i,max Obtaining [ Pos ] _1,max , Pos _2,max , Pos _3,max , .......]Denoted as Lmax, and then from L _max The minimum is selected as the starting position P of the current truncated text _s Ensuring that the new one-time cutoff result can contain all entities in the last digestion result; intercepting a fixed-length text from the initial position, searching for a first termination symbol from the end of the text forward, and taking the position where the symbol appears as the termination position P of text interception _e To ensure the integrity of sentence meaning in one cut-off process; intercepting position P in source text _s To position P _e Taking T as a processing target, and executing step S304;

s304, inputting the processing target into an index digestion model to obtain a two-dimensional list, wherein each row represents an index word list belonging to the same entity, namely a clustering resultThe format is [ [ Pos ] ₁₁ , Pos ₁₂ , Pos ₁₃ , ......], [Pos ₂₁ , Pos ₂₂ , Pos ₂₃ , ......], [Pos ₃₁ , Pos ₃₂ , Pos ₃₃ , ......], ......]The result is marked as an index digestion result R;

s305, taking the text left after truncation as input, returning to the step S302 until the text left is empty, merging the reference digestion results R of all the text fragments T, and removing the reference words with repeated positions;

s306, combining the reference digestion result R, and constructing a knowledge graph G in the step S2 ₀ The vertexes belonging to the same cluster are combined to obtain a knowledge graph G ₁ 。

S4, calculating the similarity between the entity name M and the entity E, and corresponding the entity name M to the entity in the knowledge graph;

s401, establishing an empty entity dictionary as a knowledge base KB, taking the name of an entity E as a key, and taking an entity context as a value, wherein the entity context is represented by sentences in which the entity name appears;

s402, preparing a sentence similarity data set, wherein each piece of data contains two sentences and the similarity thereof, taking the two sentences in each piece of data as samples, and taking the similarity as a sample label;

constructing a similarity recognition model: the similarity recognition model comprises a BERT model and a neural network NED, wherein the input of the BERT model is the input of the similarity recognition model, the output of the BERT model is used as the input of the neural network NED, and the output of the neural network NED is used as the output of the similarity recognition model;

training the similarity recognition by using the similarity data set, storing the model parameters after training, and directly loading the model parameters when the similarity recognition is performed;

s403, judging whether the current entity dictionary contains entity key values with the same name or not according to the entity name M in the entity identification result, if not, executing the step S404, otherwise, executing the step S405;

s404, adding the entity index M as a new entity E into a knowledge base KB as a key, and taking a sentence in which the entity index M is positioned, namely 'context' appearing in the entity, as a corresponding value;

s405, sequentially taking all entity name E key values which are the same as the current entity name M in the entity dictionary, and setting the context T where M is located _M Context T corresponding to entity name E in entity dictionary _E Splicing to obtain [ CLS ]]T _M [SEP]T _E [SEP]", step S406;

s406, will [ CLS ]]T _M [SEP]T _E [SEP]Inputting the similarity recognition model to obtain two sections of contexts T _M And T _E If the similarity exceeds the set threshold, adding the entity name E in the entity dictionary into a candidate entity name list L; taking entity name E with highest similarity in candidate entity name list L ₀ As disambiguation result, and refer the current entity to the entity context T corresponding to M _M Merging to E ₀ Entity context T of (2) _E0 In (a) and (b);

if the similarity between the entity contexts corresponding to all the entity names in the L and the entity contexts corresponding to the current entity names does not meet the requirement, adding 'x' after the current entity is referred to as a new entity E_x to add the new entity E_x into a knowledge base KB, wherein x represents that the current entity E is an xth different entity with the same name;

s407, combining the knowledge graph G constructed in the step S3 ₁ And knowledge base KB, G ₁ The node with the middle name of the current name M is replaced by the name of the entity E corresponding to the node in the knowledge base, and a knowledge graph G is obtained ₂ 。

S5, predicting and complementing the missing triples in the knowledge graph, and constructing the knowledge graph by using the obtained triples; according to the difference of the prediction objects, the method is divided into head entity prediction, tail entity prediction and relation prediction.

S501, taking the knowledge graph G constructed in the step S4 ₂ For G ₂ Taking two entities linked with the three sides to form a triplet together with the three sides, wherein the triplet comprises a head entity, a relation and a tail entity, the format is { h, r, t }, and G is taken ₂ All edge triples form an existing relational triplet listL；

S502, sequentially splicing head entities, relations and tail entities of triples in an existing relation triplet list L to obtain a [ CLS ] h [ SEP ] r [ SEP ] t [ SEP ] ", marking rationality as 1 as a training sample and marking the training sample as a sample label to obtain training data, wherein the training data corresponding to all triples in the relation triplet list L form a rationality data set, and carrying out random negative sampling in the training process to be used as an unreasonable sample for training.

Constructing a neural network model for rationality: the method comprises the steps that a BERT model and a neural network KC are used, wherein the input of the BERT model is the reasonable neural network model input, the output of the BERT model is used as the input of the neural network KC, and the output of the neural network KC is used as the output of the reasonable neural network model;

training a reasonable neural network model by utilizing a rationality data set, storing the model parameters after training, and directly loading the model parameters when rationality judgment is carried out;

s503, sequentially selecting two entities in the entity identification result as a head entity h and a tail entity t, determining a candidate relation list according to the entity type, selecting one relation r, entering into step S504, and if the candidate relation list is empty, re-executing step S503;

s504, judging whether the current triplet (h, r, t) is contained in the existing triplet list, if not, taking the current triplet as input, and executing step S505; if so, discarding the triplet, and executing step S503;

s505, sequentially splicing the head entity h, the relation r and the tail entity t to finally obtain "[ CLS ] h [ SEP ] r [ SEP ] t [ SEP ]";

s506, inputting [ CLS ] h [ SEP ] r [ SEP ] t [ SEP ] into a trained rationality neural network model for processing, wherein an output value is the rationality of the triplet, if the output value is rational, executing a step S507, otherwise discarding the triplet;

s507, adding the triples into the existing triples list of the knowledge graph;

s508, after the rationality calculation of all triples is completed,the existing triplet is utilized to obtain a knowledge graph G in the step S4 ₂ And (3) completing, namely supplementing the missing edges in the knowledge graph to obtain the finally constructed knowledge graph G.

A BERT-based knowledge graph construction system, comprising:

the entity identification module is used for positioning and extracting the entity from the unstructured text data, completing entity identification and providing an entity source for the relation extraction module; the entity identification type can be adjusted in an increment mode, and has a certain zero-shot capability.

The relation extraction module is used for extracting the relation between the entities, namely predicting the relation between the target entities according to the input text content and the input two target entities;

the reference digestion module is used for positioning entity names and pronouns thereof and clustering references of the same entity together so as to connect a plurality of relations of the same entity to one entity node;

the entity disambiguation module is used for calculating the similarity between the entity names and the entities and corresponding the entity names to the entities in the knowledge graph;

the knowledge graph completion module is used for predicting and completing the missing triples in the knowledge graph, and is divided into head entity prediction, tail entity prediction and relation prediction according to different prediction objects.

The beneficial effects of the invention are as follows: (1) The invention utilizes the natural language processing tool in the process of constructing each knowledge graph by using the large model, and has the advantage of higher accuracy compared with the traditional model.

(2) The invention provides an entity disambiguation method which uses entity context as entity context to replace entity description text, the method is completely independent of an external knowledge base, supports automatic construction of the knowledge base from scratch, and is suitable for the situation that no external knowledge base can be used for entity disambiguation.

(3) The invention provides a method for determining candidate relations to form candidate triples by utilizing entity types in entity identification results, calculating rationality of the triples by utilizing a BERT model and further completing a knowledge graph. Compared with other methods, the method can encode more semantic information, and has higher accuracy due to the constraint of entity relation types.

(4) According to the invention, all tasks of knowledge graph construction are integrated on the same BERT basic model, and different methods are different in processing input text data and processing feature vectors obtained by BERT model processing, so that multiple tasks can be executed without loading multiple BERT basic models at the same time, and the performance requirement is reduced.

Drawings

FIG. 1 is a general flow chart of knowledge graph construction according to the present invention;

FIG. 2 is a flow chart of entity identification according to the present invention;

FIG. 3 is a flow chart of relationship extraction according to the present invention;

FIG. 4 is an reference digestion flow chart of the present invention;

FIG. 5 is a flow chart of entity disambiguation in accordance with the present invention;

FIG. 6 is a knowledge graph completion flow chart of the present invention;

fig. 7 is a system architecture diagram of the present invention.

Detailed Description

The technical solution of the present invention will be described in further detail with reference to the accompanying drawings, but the scope of the present invention is not limited to the following description.

As shown in fig. 1, a BERT-based knowledge graph construction method includes the following steps:

s1, positioning and extracting entities from unstructured text data to complete entity identification, wherein the specific process is shown in FIG. 2:

for an input text, it is first determined whether it exceeds the maximum text length of the entity recognition limit, and if it exceeds the maximum length, it is necessary to truncate the text. (in order to ensure the semantic integrity, after cutting the text according to the length, the text is required to be searched from back to front, and the text is cut by taking the position as the final cutting position.) the text meeting the requirements is input into a token for coding, then the coding result is input into a BERT model for processing, a feature vector is obtained, the feature vector is processed through a neural network NER to obtain the position of a target entity, and finally the entity recognition result can be obtained by extracting the entity according to the position. If the text is cut, all the rest texts need to be processed, and finally the entity recognition results are combined to obtain the entity recognition result for entity recognition of the whole text.

The BERT model using the transformer can better extract semantic patterns in the context, and further can better encode text into feature vectors. The BERT (Bidirectional Encoder Representations from Transformers) model is a language representation model, and a large natural language processing model adopted by all processes belongs to a BERT model, and different self-defined neural network layers are added on the basis of the large natural language processing model to achieve different task effects.

S2, extracting the relation among entities according to the text content, wherein the specific process is shown in figure 3;

firstly, according to the existing relation list, the mapping relation between the sequence number and the relation is obtained. And selecting two pairs of entities in the same sentence from the entity recognition result, determining a candidate relation list according to the entity types in the entity recognition result, splicing the entity pairs and the sentences in which the entity pairs are positioned, inputting a token zer for coding, inputting the coding result into a BERT model for processing to obtain a feature vector, finally further processing the feature vector through a neural network NRE to obtain a one-dimensional vector representing the similarity of the relation of the current entity pair and a series of pre-trained relations, taking the maximum value contained in the candidate relation list, judging whether the maximum value exceeds a set relation similarity judgment threshold, outputting the relation of the corresponding serial number if the maximum value exceeds the set relation similarity judgment threshold, and outputting the relation as an unknown relation if the maximum value exceeds the maximum value.

S3, locating entity names and pronouns thereof, clustering the references of the same entity together, connecting a plurality of relations of the same entity to one entity node, and completing reference digestion, wherein the specific process is shown in FIG. 4;

first, it is determined whether the length of the input text exceeds the maximum text length acceptable for reference resolution, and if the length exceeds the maximum acceptable length, then the input text needs to be truncated. And (if the text is first cut, directly cutting the fixed-length text from the 0 position, if the text is not first cut, obtaining the maximum position number pointed by at least one entity of all categories from the result of the last pointed digestion, cutting the fixed-length text from the position.) for the text meeting the length requirement, firstly putting a token zer for coding, then inputting the coding result into a BERT model for processing, obtaining a feature vector, and then processing the feature vector through a neural network CR, thus obtaining a clustering result. If there is any input text that is not contained, the text is re-truncated from the new location and the process described above is performed. After the input text is completely subjected to reference digestion for one time, the clustering results subjected to reference digestion for multiple times are required to be de-overlapped and processed, so that a final reference digestion result is obtained.

S4, calculating the similarity between the entity names and the entities, and corresponding the entity names to the entities in the knowledge graph, wherein the specific process is shown in FIG. 5;

first, a key with entity name is established, and a sentence with entity name is taken as a value, and an entity dictionary is established. Judging whether the current dictionary contains the same key or not for the entity name in the entity identification result, if not, the entity name is a new entity name, and directly inserting the entity name and the sentence in which the entity name is positioned into the entity dictionary; if one or more corresponding keys exist in the physical dictionary, corresponding values are taken out of the physical dictionary, the current sentence and the values are spliced and then are put into a token for encoding, the encoding result is put into a BERT model for processing to obtain feature vectors, the numerical values obtained by processing the feature vectors through a neural network NED are two sentence similarity values, whether the current entity name points to the same entity as the keys in the physical dictionary or not is judged by judging whether the similarity is larger than a certain threshold value, and if the current entity name points to the same entity, the sentence where the current entity name exists is added to the values of the corresponding keys in the physical dictionary and then the physical dictionary is updated; if the entity is not the same entity, adding a suffix of 'x' after the name of the current entity, and then inserting the suffix into an entity dictionary. (x represents what the same entity name is, and the suffix does not affect the determination of whether the entity names are the same, i.e., a=a_x in the disambiguation process.)

S5, predicting and complementing the missing triples in the knowledge graph, and constructing the knowledge graph by using the obtained triples; according to the difference of the prediction objects, the prediction is divided into head entity prediction, tail entity prediction and relation prediction, and the specific process is shown in fig. 6.

First, a triplet list is established according to the entity identification result, the relation extraction result, the reference resolution result and the entity disambiguation result. Then correspondingly constructing entity pair lists in pairs, selecting entity pairs in the entity pair list to obtain a head entity h and a tail entity t, determining a possible candidate relation list R between h and t according to the types of h and t, and reselecting the entity pair if the list is empty; if the list is not empty, taking one relation r, h and t to form a triplet, putting the text of the triplet into a token for coding after splicing, putting the coding result into a BERT model for processing to obtain a feature vector, processing the feature vector through a neural network KC to obtain a rationality value of the current triplet, and judging whether the triplet is rational by judging whether the value is larger than a set reasonable threshold value. If so, adding the triplet into a triplet list; otherwise, the triplet is discarded. All entity pairs are traversed and the above process is repeated.

In the knowledge graph construction system based on the large model of the embodiment, entity identification is firstly carried out on input text source data; secondly, extracting the relation based on the entity identification result; thirdly, performing reference digestion on the source text data to match the entity and the corresponding reference word; fourthly, performing entity disambiguation according to the entity identification result to distinguish different entities with the same reference; fifthly, complementing the missing triples by using a knowledge graph complementing service according to the entity identification result and the relation extraction result; and finally, constructing a knowledge graph by using the obtained ternary structure.

As shown in fig. 7, a BERT-based knowledge graph construction system includes:

the entity identification module is used for positioning and extracting the entity from the unstructured text data, completing entity identification and providing an entity source for the relation extraction module;

While the foregoing description illustrates and describes a preferred embodiment of the present invention, it is to be understood that the invention is not limited to the form disclosed herein, but is not to be construed as limited to other embodiments, but is capable of use in various other combinations, modifications and environments and is capable of changes or modifications within the spirit of the invention described herein, either as a result of the foregoing teachings or as a result of the knowledge or skill of the relevant art. And that modifications and variations which do not depart from the spirit and scope of the invention are intended to be within the scope of the appended claims.

Claims

1. A BERT-based knowledge graph construction method is characterized by comprising the following steps of: the method comprises the following steps:

s3, positioning entity designations M and entity pronouns P, clustering the designations of the same entity together, and connecting a plurality of relations of the same entity E to one entity node N in the knowledge graph _E Finishing the reference digestion;

s5, predicting and complementing the missing triples in the knowledge graph, and constructing the knowledge graph by using the obtained triples;

the step S1 includes:

s101, modeling an entity E: giving entity types needing to be collected, and setting attributes needing to be collected of each entity type;

s106, removing all texts from the Pos position to the front, taking the rest texts as input, returning to the step S103 until all entity identification in the source text is finished, collecting entity identification results of all truncated texts, and combining the entity identification results into a complete entity identification result;

the step S2 includes:

s201, a preset relation R needing to be extracted is obtained, wherein R is { h, R, t }, h represents a head entity type, R represents a relation type, t represents a tail entity type, and then each relation needing to be extracted is numbered to obtain a list L of one-to-one mapping relation between a characterization relation and the number _R Represented by [ R ] ₁ , R ₂ , R ₃ , ......]；

s205, splicing the names of the two entities a and b and the sentence T to obtain [ CLS ] a [ SEP ] b [ SEP ] T [ SEP ];

s206, inputting [ CLS ] a [ SEP ] b [ SEP ] T [ SEP ] into a relation extraction model to obtain a one-dimensional vector, taking all values corresponding to a corresponding candidate relation sequence number list L in the vector, taking the maximum value of the values, judging whether the maximum value exceeds a set threshold lim, if the maximum value is larger than the threshold, indicating that the relation is the relation between two entities a and b, and directly outputting the relation; if the similarity is smaller than the threshold value, outputting the relation between a and b as an unknown relation; if the output relation r is not unknown, recording { a, r, b } into a relation extraction result;

s208, carrying out preliminary construction on the knowledge graph by taking a, b as entity nodes and r as edges on each piece of data { a, r, b } in the relation extraction result to obtain G ₀ ；

The step S3 includes:

s303, cutting off the input text, if the cut-off is the primary cut-off, directly taking the beginning of the text as the starting position P of the forward cut-off _s The method comprises the steps of carrying out a first treatment on the surface of the If the primary truncation is not performed, selecting the initial position sequence number from the index word list of each row of entities from the previous index digestion result R to be the largest, and marking the selected result in the ith row as Pos _i,max Obtaining [ Pos ] _1,max , Pos _2,max , Pos _3,max , .......]Denoted as Lmax, and then from L _max The minimum is selected as the starting position P of the current truncated text _s Ensuring that the new one-time cutoff result can contain all entities in the last digestion result; intercepting a fixed-length text from the initial position, searching for a first termination symbol from the end of the text forward, and taking the position where the symbol appears as the termination position P of text interception _e To ensure sentences in one cut-off processIntegrity of semantics; intercepting position P in source text _s To position P _e Taking T as a processing target, and executing step S304;

s304, inputting the processing target into an index digestion model to obtain a two-dimensional list, wherein each row represents an index word list belonging to the same entity, namely a clustering result, and the format is [ [ Pos ] ₁₁ , Pos ₁₂ , Pos ₁₃ , ......], [Pos ₂₁ , Pos ₂₂ , Pos ₂₃ , ......], [Pos ₃₁ , Pos ₃₂ , Pos ₃₃ , ......], ......]The result is marked as an index digestion result R;

s306, combining the reference digestion result R, and constructing a knowledge graph G in the step S2 ₀ The vertexes belonging to the same cluster are combined to obtain a knowledge graph G ₁ ；

The step S4 includes:

s404, adding the entity name M as a new entity E into a knowledge base KB as a key, and taking a sentence in which the entity name M is positioned, namely a context in which the entity appears, as a corresponding value;

s405, sequentially taking all entity name E key values which are the same as the current entity name M in the entity dictionary, and setting the context T where M is located _M Context T corresponding to entity name E in entity dictionary _E Splicing to obtain [ CLS ]]T _M [SEP]T _E [SEP]Step S406;

if the similarity between the entity contexts corresponding to all the entity names in the L and the entity contexts corresponding to the current entity names does not meet the requirement, adding an_x after the current entity is named and then adding the same as a new entity E_x into a knowledge base KB, wherein x represents that the current entity E is an xth different entity with the same name;

s407, combining the knowledge graph G constructed in the step S3 ₁ And knowledge base KB, G ₁ The node with the middle name of the current name M is replaced by the name of the entity E corresponding to the node in the knowledge base, and a knowledge graph G is obtained ₂ ；

The step S5 includes:

s501, taking the knowledge graph G constructed in the step S4 ₂ For G ₂ Taking two entities linked with the three sides to form a triplet together with the three sides, wherein the triplet comprises a head entity, a relation and a tail entity, the format is { h, r, t }, and G is taken ₂ Ternary organization of all edgesForming an existing relation triplet list L;

s502, sequentially splicing head entities, relations and tail entities of triples in an existing relation triplet list L to obtain [ CLS ] h [ SEP ] r [ SEP ] t [ SEP ] as a training sample, marking the rationality as 1, and using the training sample as a sample label to obtain training data, wherein the training data corresponding to all triples in the relation triplet list L form a rationality data set, and performing random negative sampling in the training process to serve as an unreasonable sample for training;

s505, sequentially splicing the head entity h, the relation r and the tail entity t to finally obtain [ CLS ] h [ SEP ] r [ SEP ] t [ SEP ];

s507, adding the triples into the existing triples list of the knowledge graph;

s508, after the rationality calculation of all triples is completedThe knowledge graph G is obtained in the step S4 by utilizing the existing triplet ₂ And (3) completing, namely supplementing the missing edges in the knowledge graph to obtain the finally constructed knowledge graph G.

2. A BERT-based knowledge graph construction system, based on the method of claim 1, characterized in that: