CN114372150A

CN114372150A - Knowledge graph construction method, system, device and storage medium

Info

Publication number: CN114372150A
Application number: CN202111505685.1A
Authority: CN
Inventors: 李洁; 龚晟; 杨震
Original assignee: Tianyi IoT Technology Co Ltd
Current assignee: Tianyi IoT Technology Co Ltd
Priority date: 2021-12-10
Filing date: 2021-12-10
Publication date: 2022-04-19
Anticipated expiration: 2041-12-10
Also published as: CN114372150B

Abstract

The invention discloses a knowledge graph construction method, a knowledge graph construction system, a knowledge graph construction device and a storage medium, and relates to the technical field of computers. The knowledge graph construction method comprises the following steps: acquiring text data; processing the text data to obtain a plurality of word parameters; determining a difference vector according to the text data and the plurality of word parameters; and updating a relation rule base according to the difference vector. According to the method, the common information of the extracted relation rules and the current relation rule base is removed through a principal component analysis method, and then the relation rule base is updated according to the relation rules when the difference vector similarity of the relation rules extracted twice before and after is larger than a preset value, so that newly generated relation rules can be better screened, the calculated amount is reduced, and the accuracy of knowledge graph relation extraction is improved.

Description

Knowledge graph construction method, system, device and storage medium

Technical Field

The invention relates to the technical field of computers, in particular to a knowledge graph construction method, a knowledge graph construction system, a knowledge graph construction device and a storage medium.

Background

In the construction process of the industry knowledge graph, the extraction of the relation between the entities in the text is a key and difficult problem. The traditional rule identification-based method is high in labor cost and low in recall rate, a large amount of sample data needs to be marked in the supervised learning-based method, the labor consumption is large, and the accuracy rate of the semi-supervised learning-based method is rapidly reduced along with the increase of iteration times although the labor input can be reduced.

Disclosure of Invention

The present invention is directed to solving at least one of the problems of the prior art. Therefore, the invention provides a method, a system, a device and a storage medium for constructing a knowledge graph, which can improve the accuracy of the knowledge graph while improving the construction efficiency of the knowledge graph.

In one aspect, an embodiment of the present invention provides a method for constructing a knowledge graph, including the following steps:

acquiring text data;

processing the text data to obtain a plurality of word parameters;

determining a difference vector from the text data and the plurality of word parameters, wherein the difference vector is determined by:

acquiring a relation rule base and a tuple database;

extracting a relation rule in the text data according to first tuple data in the tuple database;

determining sentence vectors according to the relation rules and the word parameters;

determining principal component feature vectors of the relation rule base through a principal component analysis method;

determining a difference vector in the sentence vector according to the principal component feature vector;

updating a relational rule base according to the difference vector, wherein the relational rule base is updated by:

extracting the text data according to the relation rule base to obtain second tuple data, and updating the tuple database;

determining a new difference vector and a new relation rule according to the text data and the plurality of word parameters based on the new tuple database;

and updating the relation rule base according to the new relation rule when the similarity is greater than a preset value according to the new difference vector and the similarity of the difference vector.

According to some embodiments of the invention, the word parameters comprise word vectors and tf-idf values, and the processing the text data to obtain a plurality of word parameters comprises:

performing word segmentation processing on the text data to obtain a plurality of words;

determining a word frequency of each of the words in the text data;

determining the tf-idf value of each word according to the word frequency;

determining a word vector for each of the words via a neural network coding model.

According to some embodiments of the invention, the determining a sentence vector according to the relationship rule and the word parameter comprises the steps of:

determining a plurality of the word parameters included in the relationship rule;

determining the sentence vector from a plurality of the word parameters, wherein the sentence vector is determined by the formula:

wherein S represents a sentence vector, n represents the number of words contained in the sentence vector, t_iTf-idf value, V, representing the ith word_iA word vector representing the ith word.

According to some embodiments of the invention, the determining a disparity vector in the sentence vector from the principal component feature vector comprises:

determining a first principal component in the principal component feature vector;

and determining the difference between the sentence vector and the projection value of the sentence vector on the first principal component to obtain the difference vector.

According to some embodiments of the invention, the disparity vector is determined by the following formula:

S_d＝S-uu^TS；

wherein S is_dRepresenting a disparity vector, S representing a sentence vector, u representing a first principal component, u representing a second principal component^TA transpose matrix representing the first principal component.

According to some embodiments of the present invention, the word parameter further includes an entity type, and the processing the text data to obtain the plurality of word parameters further includes:

inputting the word vector of the word into an entity recognition model to obtain the entity type of the word;

the step of extracting the text data according to the relation rule base to obtain second tuple data and updating the tuple database comprises the following steps of:

extracting the text data according to the relation rule base to obtain a plurality of second tuple data;

selecting second tuple data which is the same as a preset entity type according to the entity type of the word;

adding second tuple data with the same type as the preset entity into the tuple database to update the tuple database.

According to some embodiments of the invention, the method of knowledge-graph construction comprises the steps of:

and repeatedly executing the step of updating the relation rule base according to the difference vector to update the tuple database and the relation rule base until the similarity is less than the preset value, and stopping updating.

On the other hand, the embodiment of the invention also provides a knowledge graph construction system, which comprises:

a first module for acquiring text data;

the second module is used for processing the text data to obtain a plurality of word parameters;

a third module for determining a difference vector from the text data and the plurality of word parameters, wherein the difference vector is determined by:

acquiring a relation rule base and a tuple database;

a fourth module for updating a relational rule base according to the difference vector, wherein the relational rule base is updated by:

On the other hand, the embodiment of the invention also provides a knowledge graph construction device, which comprises:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement a method of knowledge-graph construction as previously described.

In another aspect, the embodiment of the present invention further provides a computer-readable storage medium, which stores computer-executable instructions for causing a computer to execute the method for constructing a knowledge graph as described above.

The technical scheme of the invention at least has one of the following advantages or beneficial effects: and the construction of the knowledge graph is based on the acquired text data to extract the relation rule and the second tuple data to continuously update the tuple database and the relation rule base. And determining a sentence vector based on the extracted relation rule, and determining a difference vector between the sentence vector and a principal component characteristic vector of the current relation rule base by a principal component analysis method, thereby selecting the relation rule with certain difference from the current relation rule base for further analysis. And then based on two difference vectors obtained by the tuple database before and after updating, when the similarity of the two difference vectors is greater than a preset value, adding the currently extracted relationship rule into the relationship rule base for updating. According to the method, the common information of the extracted relation rules and the current relation rule base is removed through a principal component analysis method, and then the relation rule base is updated according to the relation rules when the difference vector similarity of the relation rules extracted twice before and after is larger than a preset value, so that newly generated relation rules can be better screened, the calculated amount is reduced, and the accuracy of knowledge graph relation extraction is improved.

Drawings

FIG. 1 is a flow chart of a knowledge graph construction method provided by an embodiment of the invention;

FIG. 2 is a schematic diagram of a knowledge graph building system provided by an embodiment of the invention;

fig. 3 is a schematic diagram of a knowledge graph constructing apparatus according to an embodiment of the present invention.

Detailed Description

Reference will now be made in detail to embodiments of the present invention, examples of which are illustrated in the accompanying drawings, wherein like or similar reference numerals refer to the same or similar elements or components having the same or similar functions throughout. The embodiments described below with reference to the accompanying drawings are illustrative only for the purpose of explaining the present invention, and are not to be construed as limiting the present invention.

In the description of the present invention, it should be understood that the orientation or positional relationship referred to in the description of the orientation, such as the upper, lower, left, right, etc., is based on the orientation or positional relationship shown in the drawings, and is only for convenience of description and simplicity of description, and does not indicate or imply that the device or element referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus, should not be construed as limiting the present invention.

In the description of the present invention, if there are first, second, etc. described, they are only used for distinguishing technical features, but they are not interpreted as indicating or implying relative importance or implicitly indicating the number of indicated technical features or implicitly indicating the precedence of the indicated technical features.

Referring to fig. 1, the method for constructing a knowledge graph according to an embodiment of the present invention includes, but is not limited to, step S100, step S200, step S300, and step S400.

Step S100, acquiring text data;

step S200, processing the text data to obtain a plurality of word parameters;

step S300, determining a difference vector according to the text data and the plurality of word parameters, wherein the difference vector is determined through the following steps:

step S310, a relation rule base and a tuple database are obtained;

step S320, extracting a relation rule in the text data according to the first tuple data in the tuple database;

step S330, determining sentence vectors according to the relation rules and the word parameters;

step S340, determining principal component characteristic vectors of the relation rule base through a principal component analysis method;

step S350, determining a difference vector in the sentence vector according to the principal component feature vector;

step S400, updating the relation rule base according to the difference vector, wherein the relation rule base is updated through the following steps:

step S410, extracting the text data according to the relation rule base to obtain second tuple data, and updating the tuple database;

step S420, determining a new difference vector and a new relation rule according to the text data and the word parameters based on the new tuple database;

and step S430, updating the relation rule base according to the new difference vector and the similarity of the difference vector when the similarity is greater than a preset value.

Specifically, the text data may be obtained from a network or manually input, and then the text data is processed to obtain a plurality of word parameters, for example, after the text data is subjected to word segmentation, weight calculation, encoding, entity type identification, and the like, the word parameters of a plurality of words are obtained, and the word parameters may include tf-idf values, word vectors, entity types, and the like. Then, a tuple database initialized by a person is obtained, corresponding relation rules are extracted from text data according to initial first tuple data in the tuple database, and then the relation rules are added into a relation rule base to initialize the relation rule base. For example, the initial first tuple data is "apple belongs to fruit" and "fruit belongs to food", and the rule of the relationship between "strawberry" and "food" in the text data, that is, "strawberry belongs to food", can be identified according to the first tuple data. Then, according to the word parameters such as strawberry and food in the extracted relation rule, a sentence vector S is generated₁Calculating principal component feature vector W of relational rule base by principal component analysis₁Then according to the sentence vector S₁And principal component feature vector W₁Determining a disparity vector S_d1. And then extracting the text data according to the relation rule base to obtain second tuple data, and adding the second tuple data into the tuple database to update the tuple database. And determining a new difference vector S according to the text data and the word parameters based on the new tuple database_d2And new relation rule, calculating difference vector S_d2And a disparity vector S_d1And when the similarity is greater than a preset value, adding the new relation rule into the relation rule base to update the relation rule base. Then, the text data is continuously extracted according to the relation rule base to obtain second tuple data, and the second tuple data is added into the tuple database to update the tuple dataA library. And determining a new difference vector S according to the text data and the word parameters based on the new tuple database_d3And new relation rule, calculating difference vector S_d3And a disparity vector S_d2And when the similarity is greater than a preset value, adding the new relation rule into the relation rule base to update the relation rule base. And analogizing in turn, thereby continuously updating the tuple database and the relation rule base until the quantity of the relation rule base and the tuple data is not increased any more, namely when the similarity is smaller than a preset value, and completing the construction of the knowledge graph.

In this embodiment, the construction of the knowledge graph continuously updates the tuple database and the relation rule base by extracting the relation rule and the tuple data based on the acquired text data. And determining a sentence vector based on the extracted relation rule, and determining a difference vector between the sentence vector and a principal component characteristic vector of the current relation rule base by a principal component analysis method, thereby selecting the relation rule with certain difference from the current relation rule base for further analysis. And then based on two difference vectors obtained by the tuple database before and after updating, when the similarity of the two difference vectors is greater than a preset value, adding the currently extracted relationship rule into the relationship rule base for updating. According to the method, the common information of the extracted relation rules and the current relation rule base is removed through a principal component analysis method, and then the relation rule base is updated according to the relation rules when the difference vector similarity of the relation rules extracted twice before and after is larger than a preset value, so that newly generated relation rules can be better screened, the calculated amount is reduced, and the accuracy of knowledge graph relation extraction is improved.

According to some embodiments of the invention, the word parameters include a word vector and tf-idf values, and step S200 includes, but is not limited to, the following steps:

step S210, performing word segmentation processing on the text data to obtain a plurality of words;

step S220, determining the word frequency of each word in the text data;

step S230, determining the tf-idf value of each word according to the word frequency;

step S240, determining a word vector of each word through a neural network coding model.

Specifically, after word segmentation processing is carried out on text data to obtain a plurality of words, word frequency of each word in the text data is determined, then based on TF-IDF technology, TF-IDF value of each word is determined according to the word frequency, and then word vector of each word is determined through a neural network coding model.

It should be noted that TF-IDF (term frequency-inverse document frequency) is a commonly used weighting technique for information retrieval and data mining. TF is Term Frequency (Term Frequency) and IDF is Inverse text Frequency index (Inverse Document Frequency).

It should be noted that the neural network coding model may be a word2vec model or a fasttext model.

According to some embodiments of the invention, step S330 includes, but is not limited to, the following steps:

step S331, determining a plurality of word parameters contained in the relation rule;

step S332, determining a sentence vector according to the word parameters, wherein the sentence vector is determined by the following formula:

According to some embodiments of the invention, step S350 includes, but is not limited to, the following steps:

step S351, determining a first principal component in the principal component feature vector;

in step S352, the difference between the sentence vector and the projection value of the sentence vector on the first principal component is determined to obtain a difference vector.

Specifically, the disparity vector is determined by the following formula:

S_d＝S-uu^TS；

It should be noted that, in general, the principal component feature vector includes a plurality of principal components, the more the information of the common relationship rule in the relationship rule base represented by the first principal component is, the second principal component is, and so on. The embodiment of the present invention is not limited to calculating the difference between the sentence vector and the projection value of the sentence vector on the first principal component to obtain the difference vector, and may also calculate the difference between the sentence vector and the projection value of the sentence vector on all principal components or calculate the difference between the sentence vector and the projection value of the sentence vector on the first several principal components to obtain the difference vector.

According to some embodiments of the present invention, the word parameter further includes an entity type, and step S200 further includes, but is not limited to, the following steps:

step S250, inputting the word vector of the word into the entity recognition model to obtain the entity type of the word;

step S410 includes, but is not limited to, the following steps:

step S411, extracting the text data according to a relation rule base to obtain a plurality of second tuple data;

step S412, selecting second tuple data which is the same as the preset entity type according to the entity type of the word;

in step S413, adding the second tuple data with the same type as the preset entity into the tuple database to update the tuple database.

Specifically, after the text data is extracted according to the relation rule base to obtain a plurality of second tuple data, the second tuple data with the same entity type as the preset entity type is selected according to the entity type of the words in the second tuple data, for example, if the preset entity type is a place name, the second tuple data with the entity type as the place name is selected from the plurality of second tuple data, and then the second tuple data with the same entity type as the preset entity type is added into the tuple data base to update the tuple data base, so that the knowledge graph can be constructed according to the required theme, and the efficiency and the accuracy of constructing the knowledge graph are improved.

According to some embodiments of the present invention, the method for constructing a knowledge graph further includes, but is not limited to, the following steps:

step S600, the step of updating the relation rule base according to the difference vector is repeatedly executed to update the tuple database and the relation rule base, and the updating is stopped until the similarity is smaller than the preset value.

The embodiment of the present invention further provides a knowledge graph construction system, referring to fig. 2, including:

a first module for acquiring text data;

acquiring a relation rule base and a tuple database;

determining principal component characteristic vectors of a relation rule base through a principal component analysis method;

a fourth module for updating the relational rule base according to the difference vector, wherein the relational rule base is updated by:

and updating the relation rule base according to the new difference vector and the similarity of the difference vector when the similarity is greater than a preset value.

It can be understood that the contents in the embodiment of the knowledge graph construction method are all applicable to the embodiment of the system, the functions specifically realized by the embodiment of the system are the same as those of the embodiment of the knowledge graph construction method, and the beneficial effects achieved by the embodiment of the knowledge graph construction method are also the same as those achieved by the embodiment of the knowledge graph construction method.

Referring to fig. 3, fig. 3 is a schematic diagram of a knowledge graph constructing apparatus according to an embodiment of the present invention. The knowledge graph constructing device of the embodiment of the invention comprises one or more control processors and memories, and one control processor and one memory are taken as an example in fig. 3.

The control processor and the memory may be connected by a bus or other means, as exemplified by the bus connection in fig. 3.

The memory, which is a non-transitory computer readable storage medium, may be used to store non-transitory software programs as well as non-transitory computer executable programs. Further, the memory may include high speed random access memory, and may also include non-transitory memory, such as at least one disk storage device, flash memory device, or other non-transitory solid state storage device. In some embodiments, the memory optionally includes memory remotely located from the control processor, and the remote memory may be connected to the knowledge-graph constructing apparatus via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

Those skilled in the art will appreciate that the configuration of the apparatus shown in FIG. 3 does not constitute a limitation of the knowledge-graph building apparatus and may include more or fewer components than shown, or some components in combination, or a different arrangement of components.

The non-transitory software programs and instructions required to implement the method of knowledge-graph construction applied to the knowledge-graph constructing apparatus in the above-described embodiments are stored in a memory and, when executed by a control processor, perform the method of knowledge-graph construction applied to the knowledge-graph constructing apparatus in the above-described embodiments.

Furthermore, an embodiment of the present invention also provides a computer-readable storage medium, which stores computer-executable instructions, which are executed by one or more control processors, and can make the one or more control processors execute the method for constructing the knowledge graph in the method embodiment.

One of ordinary skill in the art will appreciate that all or some of the steps, systems, and methods disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. Some or all of the physical components may be implemented as software executed by a processor, such as a central processing unit, digital signal processor, or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

The embodiments of the present invention have been described in detail with reference to the accompanying drawings, but the present invention is not limited to the above embodiments, and various changes can be made within the knowledge of those skilled in the art without departing from the gist of the present invention.

Claims

1. A knowledge graph construction method is characterized by comprising the following steps:

acquiring text data;

processing the text data to obtain a plurality of word parameters;

acquiring a relation rule base and a tuple database;

2. The method of constructing a knowledge graph according to claim 1, wherein the word parameters include word vectors and tf-idf values, and the processing the text data to obtain a plurality of word parameters includes the steps of:

determining a word frequency of each of the words in the text data;

determining the tf-idf value of each word according to the word frequency;

3. The method of knowledge-graph construction according to claim 2, wherein said determining sentence vectors according to said relationship rules and said word parameters comprises the steps of:

4. The method of constructing a knowledge graph according to claim 3, wherein the determining a difference vector in the sentence vector according to the principal component feature vector comprises the steps of:

5. The method of knowledge-graph construction according to claim 4, wherein the disparity vector is determined by the following formula:

S_d＝S-uu^TS；

6. The method of constructing a knowledge graph according to claim 2, wherein the word parameters further include entity types, and the processing the text data to obtain a plurality of word parameters further includes the steps of:

7. The method of knowledge-graph construction according to claim 1, comprising the steps of:

8. A knowledge-graph building system, comprising:

a first module for acquiring text data;

acquiring a relation rule base and a tuple database;

9. A knowledge-graph building apparatus, comprising:

at least one processor;

at least one memory for storing at least one program;

when executed by the at least one processor, cause the at least one processor to implement the method of knowledge-graph construction according to any one of claims 1 to 7.

10. A computer-readable storage medium in which a processor-executable program is stored, wherein the processor-executable program, when executed by the processor, is for implementing the method of knowledge-graph construction according to any one of claims 1 to 7.