CN115687651A

CN115687651A - Knowledge graph construction method and device, electronic equipment and storage medium

Info

Publication number: CN115687651A
Application number: CN202211384778.8A
Authority: CN
Inventors: 杨磊; 刘权
Original assignee: iFlytek Co Ltd
Current assignee: iFlytek Co Ltd
Priority date: 2022-11-07
Filing date: 2022-11-07
Publication date: 2023-02-03

Abstract

The disclosure provides a knowledge graph construction method and device, electronic equipment and a storage medium, and relates to the technical field of natural language processing. The method comprises the steps of analyzing text data and generating an abstract semantic representation of the text data; extracting a first entity and a first entity relation in the text data according to the abstract semantic representation, wherein the first entity and the first entity relation have a corresponding relation; constructing a triple according to the relation between the first entity and the first entity; checking the triples; and if the triples pass the verification, constructing a target knowledge graph according to the triples. According to the embodiment of the invention, the knowledge graph with high accuracy can be constructed on the premise of not needing sample data of manual labeling, so that the construction efficiency of the knowledge graph is improved, and a large amount of labor cost can be saved.

Description

Knowledge graph construction method and device, electronic equipment and storage medium

Technical Field

The present disclosure relates to the field of natural language processing technologies, and in particular, to a method and an apparatus for constructing a knowledge graph, an electronic device, and a storage medium.

Background

The knowledge graph is a symbolic language taking triples as basic composition units and is used for structurally describing the knowledge of a complex real world, namely the interrelation between various concepts and entities. By structuring the knowledge representation, the computer system can calculate and process the external common knowledge and professional knowledge.

The construction of knowledge graph in the related art is usually constructed by extracting entities and entity relations in texts through natural language processing technology. However, since the extraction of the entity and the entity relationship in the text requires the intervention of a deep learning model, in order to ensure the accuracy of the construction of the knowledge graph, a large amount of manually and carefully labeled corpus data is required to train the deep learning model. This not only affects the efficiency of knowledge graph construction, but also consumes a lot of human costs.

Disclosure of Invention

In view of this, the present disclosure provides a method and an apparatus for constructing a knowledge graph, an electronic device, and a storage medium, which can complete construction of a knowledge graph without participation of manually labeled sample data.

In a first aspect, a method for constructing a knowledge graph is provided, which includes: analyzing the text data to generate an abstract semantic representation of the text data; extracting a first entity and a first entity relation in the text data according to the abstract semantic representation, wherein the first entity and the first entity relation have a corresponding relation; and constructing a target knowledge graph according to the relation between the first entity and the first entity.

In some embodiments, constructing the target knowledge-graph from the first entity and the first entity relationships comprises: carrying out entity disambiguation on the first entity to obtain a second entity; and constructing a target knowledge graph according to the relation between the second entity and the first entity.

In some embodiments, entity disambiguating a first entity to obtain a second entity comprises: screening a plurality of third entities from a preset entity library according to the similarity of the character strings between the first entity and each preset entity in the preset entity library; respectively calculating graph similarity between an abstract semantic representation subgraph corresponding to the first entity and a knowledge graph subgraph corresponding to each of a plurality of third entities, wherein the abstract semantic representation subgraph is generated according to the adjacent relation of the first entity in the abstract semantic representation subgraph, and the knowledge graph subgraph is generated according to the adjacent relation of the third entity corresponding to the knowledge graph subgraph in a preset knowledge graph; and linking a third entity corresponding to the knowledge graph subgraph with the highest graph similarity between the abstract semantic representation subgraphs with the first entity to obtain a second entity.

In some embodiments, constructing the target knowledge-graph from the second entity and the first entity relationships comprises: mapping the first entity relation into a second entity relation according to the mapping relation between the first entity relation and each entity relation in a preset entity relation library, wherein the second entity relation is an entity relation existing in the preset entity relation library; and constructing a target knowledge graph according to the second entity and the second entity relation.

In some embodiments, the target knowledge-graph comprises at least one triplet, each triplet of the at least one triplet comprising a head entity, a relationship, and a tail entity, wherein the head entity and the tail entity are constructed from a first entity and the relationship is constructed from a first entity relationship; after the target knowledge-graph is constructed according to the first entity and the first entity relation, the method further comprises the following steps: verifying at least one triple in the target knowledge-graph; and removing the triples which fail to pass the verification in at least one triple from the target knowledge graph.

In some embodiments, verifying at least one triplet in the target knowledge-graph comprises: respectively performing mask prediction on a head entity, a relation and a tail entity in at least one triple aiming at each triple in at least one triple to obtain the existence probability of the head entity in the triple, the existence probability of the relation in the triple and the existence probability of the tail entity in the triple; obtaining the existence probability of the triples according to the existence probability of the head entity in the triples, the existence probability of the relation in the triples and the existence probability of the tail entity in the triples; and checking the triples according to the existence probability of the triples.

In some embodiments, for each triple in at least one triple, performing mask prediction on a head entity, a relationship, and a tail entity in the triple to obtain an existence probability of the head entity in the triple, an existence probability of the relationship in the triple, and an existence probability of the tail entity in the triple, respectively includes: respectively performing mask processing on a head entity, a relation and a tail entity in at least one triple to obtain a head entity mask triple, a relation mask triple and a tail entity mask triple; and respectively inputting the head entity mask triple, the relation mask triple and the tail entity mask triple into a pre-trained mask prediction model to obtain the existence probability of the head entity in the triple, the existence probability of the relation in the triple and the existence probability of the tail entity in the triple.

In some embodiments, verifying the triplets based on their probability of existence includes: and if the existence probability of the triple does not meet the preset check probability condition, determining that the triple fails to pass the check.

In some embodiments, the textual data is unstructured textual data.

In a second aspect, there is provided a knowledge-graph building apparatus, including: the analysis module is used for analyzing the text data and generating an abstract semantic representation of the text data; the extraction module is used for extracting a first entity and a first entity relation in the text data according to the abstract semantic representation, and the first entity relation have a corresponding relation; and the construction module is used for constructing the target knowledge graph according to the relation between the first entity and the first entity.

In a third aspect, an electronic device is provided, including: a processor; and a memory for storing executable instructions for the processor; wherein the processor is configured to perform the method of the first aspect described above via execution of the executable instructions.

In a fourth aspect, there is provided a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the method of the first aspect described above.

According to the knowledge graph construction method provided by the embodiment of the disclosure, the abstract semantic representation of the text data is generated, and the relation between the first entity and the first entity in the text data is extracted, so that the entity and the entity relation are extracted without using a deep learning model needing to be trained by manually marking corpus data, the construction efficiency of the knowledge graph is improved, and a large amount of labor cost can be saved.

Drawings

Fig. 1 shows a system architecture diagram of a knowledge graph construction method in the embodiment of the present disclosure.

FIG. 2 is a flow chart diagram illustrating a method for constructing a knowledge graph according to an embodiment of the disclosure.

FIG. 3 is a diagram illustrating an abstract semantic representation according to an embodiment of the present disclosure.

FIG. 4 shows a process diagram for entity disambiguation in an embodiment of the disclosure.

FIG. 5 is a flow chart illustrating a method for knowledge graph validation in an embodiment of the present disclosure.

Fig. 6 shows a schematic diagram of a training method for a mask prediction model in an embodiment of the present disclosure.

Fig. 7 shows a schematic structural diagram of a knowledge graph constructing apparatus in an embodiment of the present disclosure.

Fig. 8 shows a schematic structural diagram of an electronic device in an embodiment of the present disclosure.

Detailed Description

Example embodiments will now be described more fully with reference to the accompanying drawings. Example embodiments may, however, be embodied in many different forms and should not be construed as limited to the examples set forth herein; rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the concept of example embodiments to those skilled in the art. The described features, structures, or characteristics may be combined in any suitable manner in one or more embodiments.

Furthermore, the drawings are merely schematic illustrations of the present disclosure and are not necessarily drawn to scale. The same reference numerals in the drawings denote the same or similar parts, and thus their repetitive description will be omitted. Some of the block diagrams shown in the figures are functional entities and do not necessarily correspond to physically or logically separate entities. These functional entities may be implemented in the form of software, or in one or more hardware modules or integrated circuits, or in different networks and/or processor devices and/or microcontroller devices.

The knowledge graph is a symbolic language taking triples as basic composition units and is used for structurally describing the knowledge of a complex real world, namely the interrelation between various concepts and entities. Through the structured expression of knowledge, the computer system can calculate and process external common knowledge and professional knowledge, and further the knowledge map has wide application prospects in the fields of human-computer interaction, medical treatment, judicial expertise and the like.

The sources of construction of a knowledge graph can be divided into structured text (e.g., various types of repositories), semi-structured text, and unstructured text. The knowledge graph constructed based on the unstructured text faces more challenges than the knowledge graph constructed based on the structured text, and the knowledge contained in the unstructured text with free expression needs to be completely identified, understood and formalized. The tasks related to the construction process, such as entity extraction, relationship extraction, entity classification and entity linking, of the text all need the intervention of a deep learning model, so that massive high-quality manual labeling data is indispensable, the construction cost of the knowledge graph in the related technology is extremely high, and huge manpower and material resources are required to be invested.

In view of this, the solution provided by the present disclosure extracts the first entity and the first entity relationship in the text data by generating an abstract semantic representation of the text data, and constructs a triple. And in the case that the constructed triples pass the verification, using the triples for construction of the knowledge graph. Based on this, the embodiment of the present disclosure can construct a knowledge graph unsupervised, that is, the method provided by the embodiment of the present disclosure can automatically construct a true and accurate knowledge graph from a large amount of texts without excessive human input, thereby providing a powerful knowledge support for downstream applications.

Fig. 1 shows an exemplary system architecture diagram of a knowledge graph construction method or apparatus that may be applied to embodiments of the present disclosure.

As shown in fig. 1, the system architecture 100 may include a terminal device 101, a building apparatus 102, and a knowledge base 103.

The terminal device 101 may be various electronic devices such as, but not limited to, a smartphone, a tablet, a laptop, a desktop computer, a wearable device, an augmented reality device, a virtual reality device, and the like.

Illustratively, the terminal device 101 is deployed with a client, which may be an application, a web page client, an applet client, etc., but is not limited thereto, that can initiate the knowledge graph construction method provided by the present disclosure. In addition, the specific form of the client may also be different based on different terminal platforms, for example, the client may be a mobile phone client, a PC client, or the like.

The construction apparatus 102 may be an electronic device capable of executing the method for constructing a knowledge graph provided by the present disclosure, such as, but not limited to, a cluster, a server, a cloud platform, and the like. The construction apparatus 102 may be directly or indirectly connected with the terminal device 101 through a wired or wireless communication manner, which is not limited by the present disclosure.

The knowledge base 103 may be a server deployed with a database on which a knowledge graph is stored. It should be noted that the knowledge base 103 may also be a different virtual module configured on the same server as the building apparatus 102, which is not limited by the present disclosure.

Those skilled in the art will appreciate that the number of terminal devices, building devices, and knowledge bases shown in fig. 1 is merely illustrative, and that there may be any number of terminal devices, building devices, and knowledge bases according to actual needs, and the disclosure is not limited thereto.

The present exemplary embodiment will be described in detail below with reference to the drawings and examples.

First, a knowledge graph construction method is provided in the embodiments of the present disclosure, and the method may be executed by any electronic device with computing processing capability.

Fig. 2 is a schematic flow chart of a method for constructing a knowledge graph in an embodiment of the present disclosure, and as shown in fig. 2, the method for constructing a knowledge graph provided in the embodiment of the present disclosure includes the following steps.

S201, analyzing the text data and generating an abstract semantic representation of the text data.

It should be noted that the text data in the embodiment of the present disclosure may be unstructured text data. Specifically, the unstructured text data may be data that cannot be logically expressed in a fixed structure, for example, various evaluation contents about products scattered in forums, microblogs or other channels in the internet.

It should be noted that an Abstract semantic Representation (AMR) diagram is a Representation method of Abstract sentence semantics, where the Abstract semantic Representation abstracts the semantics of a sentence into a single directed acyclic graph, and the abstraction refers to abstracting entities in the sentence into nodes of the Abstract semantic Representation, abstracting relationships between the entities into edges, and using the edges to refer to semantic relationships between different concepts. Abstract semantics represent diagrams that may allow for the omission of null words and more imaginary semantics embodied by morphological changes (e.g., articles, singletons, tenses, etc.), and may allow for the supplementation of omitted or missing concepts in sentences, thereby more completely representing the semantics of the sentence. The single root in the abstract semantic representation diagram aims to ensure the integrity of sentence semantics, the directional representation aims to ensure semantic transfer, and the loop-free representation aims to avoid trapping of semantic transfer into endless loop.

Illustratively, fig. 3 is a schematic diagram illustrating an abstract semantic representation in an embodiment of the present disclosure, taking the text data "boy wants to go to school" as an example, where ARG is a core semantic role, ARG0 is a prototype professional, and ARG1 is a prototype professional.

As shown in fig. 3, the abstract semantic representation includes at least one node and at least one edge, each node refers to a concept to which a word belongs, and the number of nodes in the abstract semantic representation may be equal to the number of entities corresponding to a sentence. Each edge refers to an entity relationship between different entities, and each edge may connect two nodes and may indicate an entity relationship between entities corresponding to the two nodes.

In some embodiments, the abstract semantic representation of the text data may be generated by encoding the text data, predicting edges or nodes in the abstract semantic representation, or predicting actions required to generate the graph.

Illustratively, embodiments of the present disclosure may generate an abstract semantic representation of textual data using a Sequence-to-Sequence (Sequence 2 Sequence) model. Because the sequence-to-sequence model can be well expanded to other tasks, knowledge of other related tasks, such as syntax analysis, semantic dependency analysis and the like, can be combined by means of multi-task learning, and more auxiliary information can be provided to improve the generalization capability of the model.

Specifically, the sequence-to-sequence model is composed of an Encoder (Encoder) and a Decoder (Decoder). The encoder and the decoder are generally composed of a Recurrent Neural Network (RNN), and a Long-term memory (LSTM) Neural Network is commonly used. When the sequence is applied to the sequence model to generate the abstract semantic representation of the text data, the encoder and the decoder are pre-trained through a mask learning strategy (for example, through a Bert model) on a large-scale unsupervised prediction, so that the prior knowledge contained in the large-scale corpus, such as a syntax structure, entity knowledge and the like, is acquired.

It should be noted that, in the above pre-training process of the sequence-to-sequence model, the used data is public data, and no additional label is needed.

The input of the sequence to the sequence model may be the text data itself. After the serialized text data is encoded by an encoder, the hidden representation of the text data is transmitted into a decoder for decoding, and a serialized abstract semantic representation semantic graph can be obtained. For example, inputting "[ cls ], man, child, want, go, school, [ sep ]" (where [ cls ] and [ sep ] are used to represent the beginning and end of the text data, respectively) into the sequence-to-sequence model, a serialized abstract semantic representation semantic graph "want-00: ARG0 boy-01: ARG1 school-02 \8230;. Wherein "00, 01, 02" are node sequence numbers in the abstract semantic graph. The ARG is a core semantic role, illustratively, ARG0 is an original professional and ARG1 is an original professional. Since the specific encoding and decoding processes are well known to those skilled in the art, the present disclosure will not be described in detail herein.

S202, extracting the relation between the first entity and the first entity in the text data according to the abstract semantic representation.

It should be noted that an entity may be an abstract object of a person, thing, or thing, and is generally a noun. Such as "students", "chinese class", "astronauts", etc. An entity relationship may be a relationship between two entities, corresponding to two entities. For example, b is an astronaut, wherein "b" and "astronaut" are two entities, and the relationship between the two entities is the "occupation", i.e., the occupation of b is astronaut.

In some embodiments, the first entity may be an entity represented by a node included in the abstract semantic representation, the first entity relationship may be an entity relationship represented by an edge included in the abstract semantic representation, and the first entity relationship have a corresponding relationship therebetween.

Illustratively, for the text data "boy wants to go to school", the first entities "boy" and "school", the first entity relationships "want" and "go" may be extracted from the serialized abstract semantic representation semantic graph. Wherein, the 'go' has corresponding relation with the 'boy' and the 'school', and the 'want' has corresponding relation with the 'boy'.

S203, constructing a target knowledge graph according to the relation between the first entity and the first entity.

Illustratively, in order to ensure the accuracy of the constructed knowledge graph, entity disambiguation can be performed on the first entity to obtain a second entity, and then the target knowledge graph is constructed according to the relationship between the second entity and the first entity. Wherein, the second entity is a standard entity obtained after the first entity is disambiguated.

It should be noted that there are many ways to perform entity disambiguation on for the first entity, for example, performing entity disambiguation by means of a bag-of-words model, and performing entity disambiguation by means of semantic decomposition. In the embodiment of the disclosure, in consideration of the fact that the number of entities extracted through the abstract semantic representation diagram may be huge, in order to simultaneously consider the efficiency and accuracy of entity disambiguation, an entity disambiguation method combining the character string similarity and the diagram similarity is provided.

Illustratively, the entity disambiguation method in the embodiments of the present disclosure may be: and screening a plurality of third entities from the preset entity library according to the similarity of the character strings between the first entity and each preset entity in the preset entity library. And then respectively calculating graph similarity between the abstract semantic representation subgraph corresponding to the first entity and the knowledge graph subgraphs corresponding to the third entities. And linking a third entity corresponding to the knowledge graph subgraph with the highest graph similarity between the abstract semantic representation subgraphs with the first entity to obtain a second entity, thereby completing entity disambiguation.

It should be noted that the preset entity library may include a plurality of preset standard entities, and the third entity is a candidate entity roughly screened from the plurality of preset standard entities and associated with the first entity.

For example, the calculation method of the similarity of the character string may adopt different character string measurement methods such as Brute-Force (BF), jaccard, knuth-Morris-Pratt (KMP), and the like, which is not limited in this disclosure.

In some embodiments, a plurality of third entities having a certain similarity with the first entity may be selected by fusing the similarity of character strings obtained in different manners through a weighted summation manner. For example, a standard entity whose string similarity with the first entity is greater than a preset threshold may be used as the third entity, or the first several standard entities with the string similarity with the first entity in all the standard entities may be used as the third entity, which is not limited in this disclosure.

For example, for the first entity "boy", the "old boy, male" in the preset entity library may be screened out as the third entity through the character string similarity. After that, some heuristic rules can be combined to filter out some obviously unsuitable entities in the third entity.

The abstract semantic representation subgraph is generated according to the adjacent relation of the first entity in the abstract semantic representation, and the knowledge graph subgraph is generated according to the adjacent relation of the third entity corresponding to the knowledge graph subgraph in the preset knowledge graph.

Illustratively, the graph similarity calculation manner in the embodiment of the present disclosure is as shown in the following equation (1).

Wherein e is a first entity extracted from the text data; g (e) is e, namely an abstract semantic representation subgraph, adjacent entity sets in the abstract semantic representation graph constructed in the S201; e.g. of a cylinder _i I is a positive integer for each third entity screened from the preset entity library; g (e) _i ) For presetting entity e in knowledge map _i The adjacent node set of (1), namely a knowledge graph subgraph; s (G (e), G (e) _i ) Are used to denote G (e) and G (e) _i ) The degree of graph similarity between them.

Illustratively, for G (e) and G (e) _i ) In the set operation of (2), the similarity of character strings can be used again to judge G (e) and G (e) _i ) Whether a pair of entities is equal.

Through the calculation of graph similarity, the entity link can be guided by using the context entity information contained in the text data.

For example, for the text data "apple releases a new mobile phone," combining the adjacent nodes "apple" and "mobile phone" can determine that "apple" in the text data refers to "apple (company)".

It should be noted that the linking between entities can be understood as a process of unambiguously and correctly pointing the extracted first entity to the third entity with the highest graph similarity. That is, through entity linking, the content of the first entity may be replaced with the content of the third entity having the highest similarity to the first entity map, resulting in the second entity. For example, the entity "apple" in the text data "apple released a new cell phone" can be converted into "apple (company)" through entity linking. In addition, a unique identifier can be allocated to the replaced first entity so as to distinguish ambiguous words.

Exemplarily, fig. 4 illustrates a process diagram of entity disambiguation in an embodiment of the present disclosure, taking the text data "apple publishes a new handset" as an example. As shown in fig. 4, for a first entity "apple" in the text data "apple publishes a new mobile phone", four third entities "apple (fruit)", "apple (company)", "apple (movie)" and "apple (song)" can be screened from a preset entity library in a manner of character string similarity. And then determining the entity with the highest graph similarity with the "apple" in the text data as "apple (company)" by means of graph similarity and combining the context information, so as to link the "apple" in the text data to the "apple (company)", namely, the "apple (company)" is taken as a second entity.

Since a large number of entity relationships are carried by verbs in sentences, the abstract semantic representation is used to extract verbs between large relationships and entities. Therefore, in order to avoid ambiguity generation in the constructed knowledge graph and increase readability of knowledge in the knowledge graph, the extracted first entity relationship can be mapped to a second entity relationship in the preset entity relationship library. The preset entity relation library comprises a plurality of preset standard entity relations.

For example, for the text data "apple publishes a new mobile phone", two groups of corresponding first entities and first entity relationships, namely "publish-apple" and "publish-mobile phone", may be extracted, from which the triple "apple-publish-mobile phone" may be obtained. According to the above example, in entity disambiguation, "apple" is used to refer to "apple (company)", and by querying the mapping relationship in the preset entity relationship library, it can be queried that "release" corresponds to "developer", that is, the triple "apple-release-mobile phone" can be converted into a standard triple "apple (company) -developer-mobile phone", and a knowledge graph is constructed according to the standard triple.

In some embodiments, if the extracted first entity relationship cannot be mapped to an entity relationship in the preset entity relationship library, it may be determined whether the first entity relationship is a new entity relationship according to an occurrence frequency or an occurrence frequency of the first entity relationship in the text data. For example, when the frequency of occurrence is greater than 1 per 1000 words, it may be set as the new entity relationship, or when the frequency of occurrence is greater than 5 times, it may be set as the new entity relationship. In this way, more potential triples may be efficiently mined in the text data.

It should be noted that a triple is a basic unit of a knowledge spectrogram, and one triple includes a head entity, a relationship, and a tail entity. Wherein, the relationship here is the entity relationship between the head entity and the tail entity. By connecting the head and the tail of a plurality of triples, a knowledge graph describing the relationship of everything can be formed.

Illustratively, the head, relationship and tail entities in a triple may be understood as subjects, predicates and objects in the syntax. The head entity and the tail entity in the embodiment of the disclosure are constructed according to the first entity, and the relationship is constructed according to the first entity relationship.

It should be noted that the terms "first entity", "second entity" and "third entity" in the embodiments of the present disclosure are only used for distinguishing different entities, and should not be construed to explicitly indicate or imply the order, relative importance and quantitative relationship among the different entities. The terms "first entity relationship" and "second entity relationship" are used merely to distinguish different entity relationships and should not be construed to imply or imply an order, relative importance, and quantitative relationship among different nodes.

In some embodiments, compared with the traditional deep learning model, the extraction accuracy rate is lower by adopting the mode of the abstract semantic representation, so that in order to ensure the accuracy of the construction of the knowledge map, the verification operation on the target knowledge map is added after the knowledge map is constructed, and only the triples which do not pass the verification in the target knowledge map are removed from the target knowledge map, so that the accuracy of the knowledge in the target knowledge map is ensured. Therefore, the knowledge graph with high accuracy can be constructed on the premise of not needing participation of manually marked corpus data.

Specifically, fig. 5 shows a flow chart of a knowledge graph checking method in the embodiment of the present disclosure. As shown in fig. 5, the method for checking a knowledge graph in the embodiment of the present disclosure includes the following steps.

S501, at least one triple in the target knowledge graph is checked.

In some embodiments, the verification of the target knowledge-graph may be a masking prediction. Because the target map comprises at least one triple, the knowledge map can be verified by respectively verifying the at least one triple in a mask prediction mode.

The following will take a triple in the target knowledge graph as an example to describe the verification process for the triple in the embodiment of the present disclosure.

By respectively performing mask prediction on the head entity, the relation and the tail entity in the triple, the existence probability of the head entity in the triple, the existence probability of the relation in the triple and the existence probability of the tail entity in the triple can be obtained. And obtaining the existence probability of the triples according to the existence probability of the head entities in the triples, the existence probability of the relation in the triples and the existence probability of the tail entities in the triples. And checking the triples according to the existence probability of the triples.

Illustratively, the specific manner of mask prediction may be: respectively performing mask processing on a head entity, a relation and a tail entity in at least one triple to obtain a head entity mask triple, a relation mask triple and a tail entity mask triple; and respectively inputting the head entity mask triple, the relation mask triple and the tail entity mask triple into a pre-trained mask prediction model to obtain the existence probability of the head entity in the triple, the existence probability of the relation in the triple and the existence probability of the tail entity in the triple.

It should be noted that the existence probability can be understood as a confidence probability, i.e., a probability of reliability. For example, the existence probability of the head entity in the above-mentioned triplet may be understood as the reliability of the existence of the head entity in the triplet, i.e. the probability that the head entity to be masked is the head entity of the triplet in the target knowledge graph. For the existence probability of the relationship in the triple, the existence probability of the tail entity in the triple, and the existence probability of the triple, since the understanding manner is similar to the existence probability of the head entity in the triple, the present disclosure is not repeated herein.

In some embodiments, the method for obtaining the existence probability of the triplet according to the existence probability of the head entity in the triplet, the existence probability of the relationship in the triplet and the existence probability of the tail entity in the triplet may be implemented in any mathematical manner. For example, the existence probability of the head entity, the existence probability of the relationship, and the existence probability of the tail entity are sequentially multiplied as the existence probability of the triplet. The existence probability of the head entity, the existence probability of the relationship and the existence probability of the tail entity can also be weighted and summed to serve as the existence probability of the triple. The embodiments of the present disclosure do not limit this.

It should be noted that the mask prediction may be understood as performing a mask (or masking) process on a part of the features to predict the features of the masked part.

Similarly, in the embodiment of the present disclosure, performing mask prediction on the head entity, the relationship, and the tail entity may be understood as performing mask (or referred to as masking) processing on the head entity, the relationship, and the tail entity randomly to predict the head entity, the relationship, or the tail entity of the masked portion.

It should be noted that the header entity mask triple in the embodiment of the present disclosure may be understood as a triple obtained by performing a masking process on a header entity in a triple. The relationship mask triple may be understood as a triple obtained by performing mask processing on a relationship in a triple. The tail entity mask triple may be understood as a triple obtained by performing mask processing on a tail entity in a triple.

In some embodiments, the mask prediction model may employ a Transformer encoder model.

Exemplarily, fig. 6 showsAnd (3) a schematic diagram of a training method for a mask prediction model. Wherein, [ CLS]And [ SEP ]]For indicating the beginning and end of text data, respectively, w ₁ For representing words in text data, e _s,1 For representing head entity, [ MASK]For representing a mask, e _t,1 For representing tail entities.

As shown in fig. 6, all triples in the knowledge graph after being spliced with the text data are used as input, and simultaneously, the triples are sent into the model to construct the dependency relationship between knowledge. Where the text data is the global memory of the model. The learning mode can adopt an unsupervised mask prediction strategy to randomly predict the entity and the relation in the knowledge, so that the model can model the information interaction process between the entities and the entity relation, and realize the combined modeling of the learning of the entity and the entity relation.

S502, removing the triples which do not pass the verification in at least one triple from the target knowledge graph.

In some embodiments, the triples may be culled by presetting a check probability condition.

Exemplarily, if the existence probability of the triples does not meet the preset check probability condition, it is determined that the triples do not pass the check, and then the triples which do not pass the check are removed from the target knowledge graph.

Illustratively, the predetermined verification probability condition may be a threshold, for example, triples with existence probabilities smaller than the threshold are removed from the target knowledge-graph. Or may be a probability range, for example, triples whose existence probabilities do not fall within the probability range are removed from the target knowledge-graph. The embodiments of the present disclosure do not limit this.

It is noted that in the embodiment of the present disclosure, the text data and the triples used in the training process of the mask prediction model do not need to be labeled additionally.

Fig. 7 is a schematic structural diagram of a knowledge graph constructing apparatus in an embodiment of the present disclosure, and as shown in fig. 7, the knowledge graph constructing apparatus 700 includes: an analysis module 701, an extraction module 702 and a construction module 703.

Specifically, the parsing module 701 is configured to parse text data to generate an abstract semantic representation of the text data. The extracting module 702 is configured to extract a first entity and a first entity relationship in the text data according to the abstract semantic representation, where the first entity and the first entity relationship have a corresponding relationship. The construction module 703 is configured to construct a target knowledge graph according to the relationship between the first entity and the first entity.

In some embodiments, the building module 703 is further configured to perform entity disambiguation on the first entity to obtain a second entity; and constructing a target knowledge graph according to the relation between the second entity and the first entity.

In some embodiments, the constructing module 703 is further configured to screen out a plurality of third entities from the preset entity library according to the similarity of the character strings between the first entity and each preset entity in the preset entity library; respectively calculating graph similarity between an abstract semantic representation subgraph corresponding to the first entity and knowledge graph subgraphs corresponding to the third entities, wherein the abstract semantic representation subgraphs are generated according to the adjacent relation of the first entity in the abstract semantic representation subgraphs, and the knowledge graph subgraphs are generated according to the adjacent relation of the third entities corresponding to the knowledge graph subgraphs in a preset knowledge graph; and linking a third entity corresponding to the knowledge graph subgraph with the highest graph similarity between the abstract semantic representation subgraphs with the first entity to obtain a second entity.

In some embodiments, the building module 703 is further configured to map the first entity relationship into a second entity relationship according to a mapping relationship between the first entity relationship and each entity relationship in the preset entity relationship library, where the second entity relationship is an entity relationship existing in the preset entity relationship library; and constructing a target knowledge graph according to the second entity and the second entity relation.

In some embodiments, the knowledge-graph building apparatus 700 further comprises a verification module, and the target knowledge-graph comprises at least one triplet, each triplet of the at least one triplet comprising a head entity, a relationship, and a tail entity, wherein the head entity and the tail entity are built according to the first entity and the relationship is built according to the first entity relationship.

Specifically, the checking module is used for checking at least one triple in the target knowledge graph; and removing the triples which fail to pass the verification in at least one triple from the target knowledge graph.

In some embodiments, the check module is further configured to, for each triple in the at least one triple, perform mask prediction on a head entity, a relationship, and a tail entity in the triple, respectively, to obtain an existence probability of the head entity in the triple, an existence probability of the relationship in the triple, and an existence probability of the tail entity in the triple; obtaining the existence probability of the triples according to the existence probability of the head entity in the triples, the existence probability of the relation in the triples and the existence probability of the tail entity in the triples; and checking the triples according to the existence probability of the triples.

In some embodiments, the check module is further configured to, for each triplet in the at least one triplet, respectively perform mask processing on a head entity, a relationship, and a tail entity in the triplet, to obtain a head entity mask triplet, a relationship mask triplet, and a tail entity mask triplet; and respectively inputting the head entity mask triple, the relation mask triple and the tail entity mask triple into a pre-trained mask prediction model to obtain the existence probability of the head entity in the triple, the existence probability of the relation in the triple and the existence probability of the tail entity in the triple.

In some embodiments, the check module is further configured to determine that the triplet has not been verified if the existence probability of the triplet does not satisfy the preset verification probability condition.

In some embodiments, the text data is unstructured text data.

It should be noted that, when the knowledge graph constructing apparatus provided in the foregoing embodiment is used for knowledge graph construction, only the division of the above functional modules is used as an example, and in practical applications, the above functions may be distributed to different functional modules according to needs, that is, the internal structure of the apparatus may be divided into different functional modules to complete all or part of the above described functions. In addition, the knowledge graph constructing apparatus provided in the above embodiments and the knowledge graph constructing method embodiments belong to the same concept, and specific implementation processes thereof are described in detail in the method embodiments, and are not described herein again.

As will be appreciated by one skilled in the art, aspects of the present disclosure may be embodied as a system, method or program product. Accordingly, various aspects of the present disclosure may be embodied in the form of: an entirely hardware embodiment, an entirely software embodiment (including firmware, microcode, etc.), or an embodiment combining hardware and software aspects that may all generally be referred to herein as a "circuit," module "or" system.

An electronic device 800 according to this embodiment of the disclosure is described below with reference to fig. 8. The electronic device 800 shown in fig. 8 is only an example, and should not bring any limitation to the functions and applicable scope of the embodiments of the present disclosure.

As shown in fig. 8, the electronic device 800 is in the form of a general purpose computing device. The components of the electronic device 800 may include, but are not limited to: the at least one processing unit 810, the at least one memory unit 820, and a bus 830 that couples various system components including the memory unit 820 and the processing unit 810.

Where the memory unit stores program code, the program code may be executed by the processing unit 810 to cause the processing unit 810 to perform steps according to various exemplary embodiments of the present disclosure as described in the "exemplary methods" section above in this specification.

In some embodiments, processing unit 810 may perform the following steps of the above-described method embodiments: analyzing the text data to generate an abstract semantic representation of the text data; extracting a first entity and a first entity relation in the text data according to the abstract semantic representation, wherein the first entity and the first entity relation have a corresponding relation; and constructing a target knowledge graph according to the relation between the first entity and the first entity.

The storage unit 820 may include readable media in the form of volatile memory units such as a random access memory unit (RAM) 8201 and/or a cache memory unit 8202, and may further include a read only memory unit (ROM) 8203.

The storage unit 820 may also include a program/utility 8204 having a set (at least one) of program modules 8205, such program modules 8205 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which or some combination thereof may comprise an implementation of a network environment.

Bus 830 may be any of several types of bus structures including a memory unit bus or memory unit controller, a peripheral bus, an accelerated graphics port, a processing unit, or a local bus using any of a variety of bus architectures.

The electronic device 800 may also communicate with one or more external devices 840 (e.g., keyboard, pointing device, bluetooth device, etc.), with one or more devices that enable a user to interact with the electronic device 800, and/or with any devices (e.g., router, modem, etc.) that enable the electronic device 800 to communicate with one or more other computing devices. Such communication may occur over input/output (I/O) interfaces 850. Also, the electronic device 800 may communicate with one or more networks (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public network, such as the internet) via the network adapter 860. As shown, the network adapter 860 communicates with the other modules of the electronic device 800 via the bus 830. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 800, including but not limited to: microcode, device drivers, redundant processing units, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

Through the above description of the embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, or by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a terminal device, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

In an exemplary embodiment of the present disclosure, there is also provided a computer-readable storage medium, which may be a readable signal medium or a readable storage medium. On which a program product capable of implementing the above-described method of the present disclosure is stored. In some possible embodiments, various aspects of the disclosure may also be implemented in the form of a program product comprising program code for causing a terminal device to perform the steps according to various exemplary embodiments of the disclosure as described in the above-mentioned "exemplary methods" section of this specification, when the program product is run on the terminal device.

More specific examples of the computer-readable storage medium in the present disclosure may include, but are not limited to: an electrical connection having one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

In the present disclosure, a computer readable storage medium may include a propagated data signal with readable program code embodied therein, either in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination thereof. A readable signal medium may be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Alternatively, program code embodied on a computer readable storage medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RF, etc., or any suitable combination of the foregoing.

In particular implementations, program code for carrying out operations of the present disclosure may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + +, or the like, as well as conventional procedural programming languages, such as the "C" programming language or similar programming languages. The program code may execute entirely on the user's computing device, partly on the user's device, as a stand-alone software package, partly on the user's computing device and partly on a remote computing device, or entirely on the remote computing device or server. In the case of a remote computing device, the remote computing device may be connected to the user computing device through any kind of network, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external computing device (e.g., through the internet using an internet service provider).

It should be noted that although in the above detailed description several modules or units of the device for action execution are mentioned, such a division is not mandatory. Indeed, the features and functionality of two or more modules or units described above may be embodied in one module or unit, according to embodiments of the present disclosure. Conversely, the features and functions of one module or unit described above may be further divided into embodiments by a plurality of modules or units.

Moreover, although the steps of the methods of the present disclosure are depicted in the drawings in a particular order, this does not require or imply that the steps must be performed in this particular order, or that all of the depicted steps must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken into multiple step executions, etc.

Through the description of the above embodiments, those skilled in the art will readily understand that the exemplary embodiments described herein may be implemented by software, and may also be implemented by software in combination with necessary hardware. Therefore, the technical solution according to the embodiments of the present disclosure may be embodied in the form of a software product, which may be stored in a non-volatile storage medium (which may be a CD-ROM, a usb disk, a removable hard disk, etc.) or on a network, and includes several instructions to enable a computing device (which may be a personal computer, a server, a mobile terminal, or a network device, etc.) to execute the method according to the embodiments of the present disclosure.

Other embodiments of the disclosure will be apparent to those skilled in the art from consideration of the specification and practice of the disclosure disclosed herein. This disclosure is intended to cover any variations, uses, or adaptations of the disclosure following, in general, the principles of the disclosure and including such departures from the present disclosure as come within known or customary practice within the art to which the disclosure pertains. It is intended that the specification and examples be considered as exemplary only, with a true scope and spirit of the disclosure being indicated by the following claims.

Claims

1. A knowledge graph construction method is characterized by comprising the following steps:

analyzing text data to generate an abstract semantic representation of the text data;

extracting a first entity and a first entity relation in the text data according to the abstract semantic representation, wherein the first entity and the first entity relation have a corresponding relation;

and constructing a target knowledge graph according to the relation between the first entity and the first entity.

2. The method of claim 1, wherein constructing a target knowledge-graph based on the first entity and the first entity relationship comprises:

carrying out entity disambiguation on the first entity to obtain a second entity;

and constructing the target knowledge graph according to the relation between the second entity and the first entity.

3. The method of claim 2, wherein the entity disambiguating the first entity to obtain a second entity comprises:

screening a plurality of third entities from a preset entity library according to the similarity of the character strings between the first entity and each preset entity in the preset entity library;

respectively calculating graph similarity between an abstract semantic representation subgraph corresponding to the first entity and a knowledge graph subgraph corresponding to each of the third entities, wherein the abstract semantic representation subgraph is generated according to the adjacent relation of the first entity in the abstract semantic representation subgraph, and the knowledge graph subgraph is generated according to the adjacent relation of the third entity corresponding to the knowledge graph subgraph in a preset knowledge graph;

and linking a third entity corresponding to the knowledge graph subgraph with the highest graph similarity between the abstract semantic representation subgraphs with the first entity to obtain a second entity.

4. The method of claim 2, wherein the constructing the target knowledge-graph from the second entity and the first entity relationships comprises:

mapping the first entity relationship into a second entity relationship according to the mapping relationship between the first entity relationship and each entity relationship in a preset entity relationship library, wherein the second entity relationship is the entity relationship existing in the preset entity relationship library;

and constructing the target knowledge graph according to the relationship between the second entity and the second entity.

5. The method of claim 1, wherein the target knowledge-graph comprises at least one triplet, each triplet comprising a head entity, a relationship, and a tail entity, wherein the head entity and the tail entity are constructed from the first entity and the relationship is constructed from the first entity relationship;

after the constructing a target knowledge-graph according to the first entity and the first entity relation, further comprising:

verifying the at least one triplet in the target knowledge-graph;

and eliminating the triples which fail to pass the verification in the at least one triple from the target knowledge graph.

6. The method of claim 5, wherein the checking the at least one triplet of the target knowledge-graph comprises:

respectively performing mask prediction on a head entity, a relation and a tail entity in the triples aiming at each triplet in the at least one triplet to obtain the existence probability of the head entity in the triples, the existence probability of the relation in the triples and the existence probability of the tail entity in the triples;

obtaining the existence probability of the triples according to the existence probability of the head entity in the triples, the existence probability of the relation in the triples and the existence probability of the tail entity in the triples;

and checking the triples according to the existence probability of the triples.

7. The method of claim 6, wherein the performing, for each of the at least one triplet, mask prediction on a leading entity, a relationship, and a trailing entity in the triplet to obtain a probability of existence of the leading entity in the triplet, a probability of existence of the relationship in the triplet, and a probability of existence of the trailing entity in the triplet comprises:

respectively performing mask processing on a head entity, a relation and a tail entity in the triples aiming at each triplet in the at least one triplet to obtain a head entity mask triplet, a relation mask triplet and a tail entity mask triplet;

and respectively inputting the head entity mask triple, the relation mask triple and the tail entity mask triple into a pre-trained mask prediction model to obtain the existence probability of the head entity in the triple, the existence probability of the relation in the triple and the existence probability of the tail entity in the triple.

8. The method according to claim 6, wherein the verifying the triplet according to the existence probability of the triplet comprises:

and if the existence probability of the triple does not meet the preset check probability condition, determining that the triple does not pass the check.

9. The method according to any one of claims 1 to 8, wherein the text data is unstructured text data.

10. A knowledge-graph building apparatus, comprising:

the analysis module is used for analyzing the text data and generating an abstract semantic representation of the text data;

the extraction module is used for extracting a first entity and a first entity relation in the text data according to the abstract semantic representation, and the first entity relation have a corresponding relation;

and the construction module is used for constructing a target knowledge graph according to the relation between the first entity and the first entity.

11. An electronic device, comprising:

a processor; and

a memory for storing executable instructions of the processor;

wherein the processor is configured to perform the method of any of claims 1 to 9 via execution of the executable instructions.

12. A computer-readable storage medium, on which a computer program is stored which, when being executed by a processor, carries out the method of any one of claims 1 to 9.