CN113449119A

CN113449119A - Method and device for constructing knowledge graph, electronic equipment and storage medium

Info

Publication number: CN113449119A
Application number: CN202110736340.0A
Authority: CN
Inventors: 潘云嵩
Original assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd
Current assignee: Beijing Kingsoft Office Software Inc; Zhuhai Kingsoft Office Software Co Ltd
Priority date: 2021-06-30
Filing date: 2021-06-30
Publication date: 2021-09-28

Abstract

The application provides a method, a device, electronic equipment and a computer storage medium for constructing a knowledge graph; the method for constructing the knowledge graph comprises the following steps: determining nodes of the knowledge graph according to a first vocabulary contained in a first word bank; obtaining a word vector of the first vocabulary based on a word vector model; the word vector model is obtained through pre-training in an unsupervised training mode; and determining nodes for establishing edges in the knowledge graph according to the relation among the word vectors of the first vocabulary, and establishing edges for the determined nodes. The method and the device can automatically complete the construction of the knowledge graph without manual participation, and are high in efficiency, strong in universality and low in construction cost.

Description

Method and device for constructing knowledge graph, electronic equipment and storage medium

Technical Field

The present application relates to the field of computer information, and in particular, to a method and an apparatus for constructing a knowledge graph, an electronic device, and a storage medium.

Background

At present, knowledge maps are increasingly used in the fields of intelligent question-answering systems, search recommendation and the like. A knowledge graph can be viewed as a semantic network graph that is used to describe various concepts that exist in the real world, and the connections or relationships between different concepts; the knowledge graph comprises a plurality of nodes and edges between the nodes, wherein the nodes are used for representing a certain concept/entity in the real world, and the edges between the nodes are used for representing the relationship between the corresponding concept/entity.

In the related art, the process of generating the knowledge graph comprises the following steps 101-104:

101. designing basic attributes of nodes and edges in the knowledge graph by a demand side and a domain expert;

102. marking the training data by a marking person according to the designed basic attribute of the node/edge;

103. training an AI (Artificial Intelligence) model by using a supervised machine learning method and through the training data marked by the marking personnel;

104. the trained AI model can extract concepts/entities with the basic attributes from the corpus, establish nodes according to the extracted concepts/entities, extract relationships among different concepts/entities, and generate edges according to the relationships, thereby completing construction of the knowledge graph.

However, the design cost of the domain expert is very high, the basic attributes of the nodes/edges in the knowledge graph designed by the demander and the domain expert can only be applied to a certain scene or a certain field, if the original design scheme cannot be used in the case of changing the scene/field, the redesign is needed, and the method for generating the knowledge graph is poor in universality and low in efficiency; in addition, the cost of manually labeling data is high.

Disclosure of Invention

The following is a summary of the subject matter described in detail in this application. This summary is not intended to limit the scope of the claims.

The application provides a method, a device, electronic equipment and a computer storage medium for constructing a knowledge graph, which can automatically complete the construction of the knowledge graph without manual participation, and have the advantages of high efficiency, strong universality and low construction cost.

In one aspect, an embodiment of the present application provides a method for constructing a knowledge graph, including: determining nodes of the knowledge graph according to a first vocabulary contained in a first word bank; obtaining a word vector of the first vocabulary based on a word vector model; the word vector model is obtained through pre-training in an unsupervised training mode; and determining nodes for establishing edges in the knowledge graph according to the relation among the word vectors of the first vocabulary, and establishing edges for the determined nodes.

Preferably, before determining the nodes of the knowledge graph according to the first vocabulary contained in the first lexicon, the method further comprises: performing word segmentation on the first corpus to obtain a first word sequence; obtaining a plurality of candidate words according to a first word sequence, wherein the candidate words consist of two adjacent words in the first word sequence; and selecting at least one candidate word according to the relevancy of the candidate words and storing the candidate word as a first vocabulary in the first word bank.

Preferably, selecting at least one candidate word according to the relevance of the candidate word and storing the candidate word as a first vocabulary in the first vocabulary bank includes: calculating the relevance of each candidate word; sorting the candidate words according to the sequence of the relevance from big to small; selecting the candidate words ranked at the top N, and storing the candidate words as first words in the first word bank; wherein N is a positive integer.

Preferably, the correlation degree is a mutual information value, or the correlation degree is a sum of a left information entropy, a right information entropy and the mutual information value.

Preferably, before the selecting at least one of the candidate words according to the relevance to be stored as the first vocabulary in the first vocabulary bank, the method further includes: and under the condition that the relevancy does not meet the relevancy judgment condition, constructing a new candidate word by using the candidate word and the word adjacent to the candidate word in the first word sequence.

Preferably, the Word vector model is a Word2Vec model, and the method includes: performing word segmentation on the second corpus to obtain a second word sequence; wherein the second sequence of words comprises at least one first word; acquiring first vocabularies contained in a first lexicon, and respectively executing the following operations on each first vocabulary to train a Word2Vec model to obtain Word vectors of the first vocabularies: for each first vocabulary, extracting a preset number of words before and after the first vocabulary is processed currently from the second Word sequence, and inputting the Word2Vec model; and comparing the output value of the Word2Vec model with the currently processed first vocabulary, iterating the Word2Vec model according to the comparison result until the comparison result meets a preset condition, and taking the weight of the Word2Vec model as a Word vector of the first vocabulary.

Preferably, the determining nodes of the knowledge graph according to the first vocabulary contained in the first lexicon includes: for each first vocabulary, respectively establishing a corresponding node in the knowledge graph; the relationship between word vectors according to the first vocabulary, comprising: calculating the characteristic distance between every two first vocabularies; and for the two first vocabularies with the characteristic distance smaller than the preset distance threshold, establishing an edge between nodes corresponding to the two first vocabularies in the knowledge graph.

Preferably, the method comprises: determining the number of the existing edges of the nodes corresponding to the first vocabulary; and under the condition that the number of the existing edges does not exceed the upper limit of the number of the edges, establishing edges between the nodes corresponding to the two first words in the knowledge graph.

In another aspect, an embodiment of the present application provides an apparatus for constructing a knowledge graph, including: the node determining module is used for determining nodes of the knowledge graph according to a first vocabulary contained in the first word bank; the word vector obtaining module is used for obtaining word vectors of the first vocabulary based on a word vector model; the word vector model is obtained through pre-training in an unsupervised training mode; and the relation determining module is used for determining the nodes for establishing the edges in the knowledge graph according to the relation among the word vectors of the first vocabulary, and establishing the edges for the determined nodes.

In another aspect, an embodiment of the present application provides an electronic device, including: a memory and a processor; the memory is used for storing a program for constructing a knowledge graph; the processor is used for reading and executing the program for constructing the knowledge graph in the memory and carrying out the method for constructing the knowledge graph.

In still another aspect, an embodiment of the present application provides a computer storage medium, in which a program for constructing a knowledge graph is stored, and the method for constructing a knowledge graph is implemented when the program for constructing a knowledge graph is executed by a processor.

This application need not the professional and designs the knowledge map, also need not the label personnel to mark data, can accomplish the structure of knowledge map automatically under the condition that does not need artifical the participation, and is efficient and the structure is with low costs, can be applicable to the structure of various fields knowledge maps moreover, and the commonality is strong.

Additional features and advantages of the application will be set forth in the description which follows, and in part will be obvious from the description, or may be learned by the practice of the application. The objectives and other advantages of the application may be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

Drawings

The accompanying drawings are included to provide a further understanding of the claimed subject matter and are incorporated in and constitute a part of this specification, illustrate embodiments of the subject matter and together with the description serve to explain the principles of the subject matter and not to limit the subject matter.

FIG. 1 is a flow chart of a method of constructing a knowledge graph as provided by an embodiment of the present application;

FIG. 2a is a schematic diagram of the structure of CBOW in the Word2Vec model;

FIG. 2b is a schematic diagram of the Skip-gram structure in the Word2Vec model;

FIG. 3 is a diagram illustrating a specific process of constructing a knowledge graph in example 1 of the present application;

FIG. 4 is a diagram illustrating a specific process of mining vocabularies in embodiment 2 of the present application;

FIG. 5 is a schematic diagram of a specific process for training a Word2Vec model in embodiment 3 of the present application;

FIG. 6 is a diagram illustrating a specific process of creating a knowledge graph according to the specialized vocabulary in the first lexicon and the word vectors thereof in embodiment 4 of the present application;

FIG. 7 is a knowledge-graph obtained in an example of embodiment 4 of the present application;

FIG. 8 is a knowledge-graph obtained in an example of embodiment 5 of the present application;

FIG. 9 is a schematic diagram of an apparatus for constructing a knowledge graph according to an embodiment of the present application;

fig. 10 is a schematic view of an electronic device provided in an embodiment of the present application.

Detailed Description

Hereinafter, embodiments of the present application will be described in detail with reference to the accompanying drawings. It should be noted that the features of the embodiments and examples of the present application may be arbitrarily combined with each other without conflict.

The steps illustrated in the flow charts of the figures may be performed in a computer system such as a set of computer-executable instructions. Also, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

In the present application, ordinal numbers such as "first", "second", "third", and the like are provided to avoid confusion of constituent elements, and are not limited in number. The term "plurality" in this application means two or more than two.

The embodiment of the present application provides a method for constructing a knowledge graph, as shown in fig. 1, including steps S110 to S130:

s110, determining nodes of the knowledge graph according to a first vocabulary contained in a first word bank;

s120, obtaining a word vector of the first vocabulary based on a word vector model; the word vector model is obtained through pre-training in an unsupervised training mode;

s130, determining nodes of the established edges in the knowledge graph, and establishing edges for the determined nodes.

In an exemplary embodiment, the first vocabulary contained in the first lexicon corresponds to nodes of the knowledge graph one by one, the nodes of the knowledge graph for establishing edges are determined according to the relation between word vectors of the first vocabulary, and the edges are established in the knowledge graph for the determined nodes.

The method for constructing the knowledge graph provided by the embodiment of the application does not need to design the knowledge graph by professional personnel, does not need to label data by labeling personnel, can automatically complete the construction of the knowledge graph under the condition of no manual participation, is high in efficiency and low in construction cost, can be suitable for the construction of knowledge graphs in various fields, and is high in universality.

Wherein, the word stock refers to the collection of words. The first thesaurus may be a pre-designated existing professional thesaurus, or may be a thesaurus for storing mined new words, or may be a thesaurus for storing a set of words pre-selected by the user. The first vocabulary refers to the vocabulary contained in the first lexicon. The word vector is the expression mode of the mathematics and the vectorization of the vocabulary; the word vector model is a model for embedding words in symbol form (such as chinese words, english words, etc.) in a natural language into a mathematical space or converting the words into numerical form; the Word vector model may include Word2Vec, Bert, roberta, XLNet, DistilBERT, and the like. The relationship between the word vectors may include similarity between the word vectors, feature distance, and the like. The unsupervised training can be called unsupervised learning, and refers to that in the process of model training (namely machine learning), training is carried out by adopting a training sample without labels, namely, the training of the model can be completed without manual labeling.

In this implementation manner, in obtaining the word vector of the first vocabulary based on the word vector model, the first vocabulary may be used as the input of the word vector model, and the corresponding word vector may be the final output of the word vector model.

Preferably, step S110 may be preceded by: performing word segmentation on the first corpus to obtain a first word sequence; obtaining a plurality of candidate words according to a first word sequence, wherein the candidate words consist of two adjacent words in the first word sequence; and selecting at least one candidate word according to the relevancy of the candidate words and storing the candidate word as a first word in the first word bank.

In the preferred embodiment, the first corpus may be a set of documents specified in advance, and the first word sequence refers to a result obtained by segmenting each sentence in the documents contained in the first corpus; the word segmentation means that a sentence is segmented into a plurality of words, and each sentence is converted into a sequence consisting of a plurality of words after being segmented; the word sequence obtained by segmenting each sentence in the first corpus is collectively called as a first word sequence. Relevancy is used to represent the likelihood that multiple words contained in a candidate word can be combined into a complete, independent word.

In an exemplary embodiment, when a knowledge graph of a certain field is constructed, a paper, a journal article and the like of a certain field can be selected as a first corpus, and the first word mined is a professional word of the field generally; in this implementation manner, the first corpus may be collected according to a knowledge graph that needs to be established, for example, when a knowledge graph in the biological field needs to be established, papers, journal articles, and the like related to the biological field may be collected as the first corpus; accordingly, the first word mined has a high probability of being a professional word in the biological field. By selecting different first corpora, professional vocabularies in different fields can be correspondingly mined, so that the knowledge graph in the field is established, and therefore the method for establishing the knowledge graph can be conveniently used in each field and is good in universality.

In the preferred embodiment, the first corpus can be segmented by but not limited to an existing lexicon; wherein, the existing word stock can comprise a general word stock, namely a word stock for receiving and recording common words; the existing word banks may also include a professional word bank, that is, a word bank that includes a proprietary word in a certain field. This existing lexicon is referred to as a second lexicon in order to distinguish it from the first lexicon used for listing the first vocabulary.

In the preferred embodiment, the first corpus is segmented, specifically, the vocabulary in the existing lexicon can be compared with the first corpus to determine the position of the vocabulary in the existing lexicon in the first corpus, and separators are added before and after the position of the recognized vocabulary. By identifying the separators in the first corpus, word boundaries in sentences composed of consecutive characters can be identified. The word segmentation of the first corpus by using the existing word bank means that the word boundaries of sentences in the first corpus are distinguished by using words contained in the existing word bank, so that each sentence is segmented into a plurality of words, namely: recognizing words in an existing word stock appearing in sentences of the first corpus, and adding separators before and after the recognized words so as to divide the sentences contained in the first corpus into a sequence consisting of a plurality of words; for example, the sentence is "naflavone has been produced for many years in our country", and the result obtained after segmenting the sentence is the following sequence, wherein "/" is a separator for identifying word boundaries, which is used to separate adjacent words in the sentence: i/production/naphthalene/flavone/existing/many years. By segmenting each sentence in the first corpus, a plurality of sequences similar to the above example, i.e., the above first word sequence, can be obtained.

In the process of obtaining multiple candidate words according to the first word sequence, multiple sequence segments can be obtained by identifying separators in the first corpus and separating different words in the first word sequence by using the separators, and candidate words can be obtained by using adjacent sequence segment combinations.

In the preferred scheme, the candidate word is a combined word of adjacent words; the adjacent concepts are that other words and punctuation marks cannot be arranged between the words; for example, a word before a comma and a word after the comma cannot be combined as a candidate word as adjacent words. To briefly explain how to obtain candidate words by using an example, in the sequence of 6 words "i/m/n/f/i/p/years", a total of 5 candidate words can be obtained by using every two adjacent words as candidate words, which are: i have produced and produced naphthalene, naphthalene flavone and flavone for years. Similarly, a plurality of candidate words can be obtained according to the first word sequence obtained by segmenting each sentence in the first corpus.

In an exemplary embodiment of the present preferred embodiment, selecting at least one candidate word according to the relevance of the candidate word and storing the selected candidate word as a first vocabulary in the first vocabulary bank includes: calculating the relevance of each candidate word; sorting the candidate words according to the sequence of the relevance from big to small; selecting the candidate words ranked at the top N, and storing the candidate words as first words in the first word bank; wherein N is a positive integer.

Wherein, the greater the degree of relevance, the greater the probability that the candidate word is a complete and independent word. The candidate words with the top N positions are sorted, namely the N candidate words with the maximum relevance; for example, there are candidate word 1, candidate word2, candidate word 3, candidate word 4, and candidate word 5, and the sum of the left and right information entropies and the mutual information values of these five candidate words is: 0.002168, 0.000056, 0.000603, 0.000123, 0.000007; if N is 2, the top 2 ranked candidate words are candidate word 1 (sum 0.002168) and candidate word 3 (sum 0.000603), sorted from large to small in sum.

In a possible implementation manner, the correlation degree is a mutual information value, or the correlation degree is a sum of a left information entropy, a right information entropy and the mutual information value.

The mutual information value reflects the degree of interdependence between two variables; the entropy is an index used for representing the uncertainty of the random variable, the higher the entropy is, the more diversified the information content is, the higher the uncertainty is, the more difficult the prediction is, and the left information entropy refers to the entropy of the left boundary of the candidate word; the right entropy refers to the entropy of the right boundary of the candidate word.

In this exemplary embodiment, two words included in the candidate word may be used as variables, for example, if the candidate word is "naflavone", then a mutual information value between "naphthalene" and "flavone" is calculated. The mutual information values of the two words represent the correlation between the two words, and the higher the mutual information values of the words w1 and w2 indicate that the higher the correlation between w1 and w2, the higher the possibility that w1 and w2 form phrases is; the lower the mutual information value, the lower the correlation of w1 and w 2.

In an alternative implementation, the mutual information value MI (X, Y) is calculated as shown in the following formula (1).

W1 and w2 in formula (1) represent two words combined into a candidate word, and p (w1) is the probability of the occurrence of the word w1, i.e., the number of occurrences of w1 in the first corpus divided by the total number of words in the first corpus. p (w2) is the probability of the occurrence of the word w2, i.e., the number of occurrences of w2 in the first corpus divided by the total number of words in the first corpus. p (w1, w2) is w1, and w2 is the probability of a word appearing in the text, i.e., the number of times the candidate word "w 1w 2" appears in the first corpus is divided by the total number of words in the first corpus.

For example, if X is a random variable with a finite value, and the probability of X taking the value X is p (X), the entropy of X is defined as the following formula (2):

in an optional implementation, the left and right information entropy H_LThe calculation method of (w) can be represented by the following formulas (3) and (4), respectively:

h in the formula (3)_L(w) is the left entropy of the candidate word w, the conditional probability p (w)_l| w) is the word w_lProbability of appearing to the left of the candidate word w. L e L represents the set of words that appear to the left of the candidate word w.

H in the formula (4)_R(w) is the right entropy of the candidate word w, the conditional probability p (w)_r| w) is the word w_rProbability of appearing to the right of candidate word w. R ∈ R denotes the set of words that appear to the right of word w.

In the present exemplary embodiment, candidate words with the same content appearing at different positions of the first corpus are regarded as the same candidate word; for example, if a plurality of sentences of the first corpus includes "w 1w 2", all the sentences "w 1w 2" are regarded as the same candidate word.

In the exemplary embodiment, the candidate words are selected by calculating the left and right information entropies and the mutual information values, new words formed by combining existing words in the first corpus can be automatically mined, on one hand, the defect that the words are excessively fragmented by utilizing the existing word bank can be avoided, on the other hand, nodes of the knowledge graph can be obtained in an unsupervised mode, and the construction cost is greatly reduced.

In other implementation manners, besides the correlation degree, the correlation degree can be obtained by performing weighted summation or other linear operations on the left information entropy and the right information entropy, and the candidate words are selected according to the operation result sorting. In addition, other parameters or more parameters may be introduced to select the candidate word, such as the word frequency of the candidate word appearing in the first corpus, and the like.

In an exemplary implementation manner, before the selecting at least one of the candidate words according to the relevance to be stored as the first vocabulary in the first vocabulary bank, the method further includes: and under the condition that the relevancy does not meet the relevancy judgment condition, constructing a new candidate word by using the candidate word and the word adjacent to the candidate word in the first word sequence.

And the relevancy judgment condition is used for judging whether the adjacent candidate words in the first word sequence need to be recombined into new candidate words or not. As an example, when the relevance of a candidate word is high, it indicates that word senses of adjacent candidate words are similar or the same, a new candidate word needs to be recombined on the adjacent candidate words, and if the relevance of the candidate word is low, it indicates that the difference between word senses of the adjacent candidate words is large, it does not need to be recombined on the adjacent candidate words.

As an example, a candidate word with left information entropy or right information entropy of 0 may be taken as a segment, a word adjacent to the left or right of the segment in the first word sequence may be combined with the segment to be a new candidate word, and the relevance of the new candidate word may be calculated. Specifically, the left information entropy is 0, meaning thatIn the first corpus, the same word always appears to the left of the candidate word w, i.e. the conditional probability p (w)_l| w) is 1. Similarly, if the right entropy is 0, it means that the same word appears on the right of the candidate word w in the first corpus, i.e. the conditional probability p (w)_r| w) is 1. If only one of the left information entropy and the right information entropy is 0, the candidate word needs to be determined again, if the left information entropy is 0, the left adjacent word is added to obtain a new candidate word, and if the right information entropy is 0, the right adjacent word is added to obtain a new candidate word; and if the left or right information entropy of the new candidate word is 0, combining the new candidate word again until the left information entropy and the right information entropy are not 0. By the method, new candidate words formed by combining three or more adjacent words in the first corpus can be mined.

Such as "helicobacter pylori agent", is divided into three words in the first word sequence: helicobacter pylori/agent, and two candidate words "helicobacter pylori" and "helicobacter agent" were obtained. If the left entropy of the candidate word "helicobacter agent" is calculated to be 0, indicating that "helicobacter agent" is likely to be a fragment of a word, a new candidate word "helicobacter pylori agent" can be obtained by adding the word "pylorus" adjacent to the left with "helicobacter agent" as a fragment.

Another example is a combination of four common words w1, w2, w3, w 4; for the candidate word "w 3w 4", the left information entropy is 0, and after the segment "w 3w 4" is added with the left adjacent word "w 2", a new candidate word "w 2w3w 4" is obtained, but the left information entropy is still 0; then, the adjacent word "w 1" on the left side of "w 2w3w 4" is added to obtain a new candidate word "w 1w2w3w 4", and the left and right entropy of the new candidate word is not 0, so that the candidate word is not changed. For the candidate word "w 2w 3", the left and right entropy is 0, and w1 may be added to the left of the segment "w 2w 3", and w4 may be added to the right to obtain a new candidate word "w 1w2w3w 4". The left and right information entropies are not 0, and the candidate word is not changed any more.

Preferably, the method for constructing a knowledge-graph may further comprise: and training the word vector model based on an unsupervised training mode. The unsupervised training mode means that training samples used for training the word vector model are unlabeled, that is, do not need human participation.

Preferably, the Word vector model is Word2Vec, and the method further comprises: performing word segmentation on the second corpus to obtain a second word sequence; wherein the second sequence of words comprises at least one first word; acquiring first vocabularies contained in the first lexicon, and respectively executing the following operations on each first vocabulary to train a Word2Vec model so as to obtain Word vectors of the first vocabularies: for each first vocabulary, extracting a preset number of words before and after the first vocabulary is processed currently from the second Word sequence, and inputting the Word2Vec model; and comparing the output value of the Word2Vec model with the currently processed first vocabulary, iterating the Word2Vec model according to the comparison result until the comparison result meets a preset condition, and taking the weight of the Word2Vec model as a Word vector of the first vocabulary.

Wherein the first vocabulary, i.e. the first vocabulary of the current input model, is currently processed. In the training process, the words extracted from different positions of the second Word sequence can be respectively used as a training sample, so that a large number of training samples can be obtained through the second linguistic data to train the Word2Vec model; and the training samples do not need to be labeled, the words (namely, the central words) positioned in the middle positions in the training samples are directly used as expected outputs, and the rest words are used as the inputs of the model to carry out training.

Wherein, the iterative process can be regarded as the process of adjusting the weight of the Word2Vec model; the comparison may be regarded as an error between the actual output and the desired output (i.e. the first vocabulary), and the purpose of the iteration is to reduce the error until the error is small enough, i.e. a predetermined condition is met, and the training is considered to be complete. The Word vectors of the first vocabulary are the weights of the trained Word2Vec model. In a possible implementation manner, the predetermined condition may be a similarity threshold between the output value of the Word2Vec model and the first vocabulary, and when the similarity between the output value of the Word2Vec model and the first vocabulary is greater than or equal to the similarity threshold, it is determined that the comparison result meets the predetermined condition, the training of the Word2Vec model is completed, otherwise, the predetermined condition is not met, and the iterative training of the Word2Vec model is continued. And if the similarity of the output value and the first vocabulary is 85%, and the similarity threshold is 90%, and the visible similarity is smaller than the similarity threshold, judging that the preset condition is not met, and continuing to iteratively train the Word2Vec model. And when the similarity between the output value and the first vocabulary is 95%, and the similarity is greater than a similarity threshold, judging that a preset condition is met, and finishing training the Word2Vec model.

The Word2Vec model is a shallow neural network model and comprises an input layer, a hidden layer and an output layer; taking the central Word and the context Word of the central Word as training samples to train the Word2Vec model, wherein the weight in the trained Word2Vec model is the Word vector of the central Word; the central Word is a Word at the central position in the Word sequence as the training sample, and is a Word which is used as Word2Vec model input data (in the case of predicting the context through the central Word) or expected output data (in the case of predicting the central Word through the context).

Under the scene of establishing the knowledge graph, the quality of Word vectors obtained by other models is not obviously superior to that of a Word2Vec model, and the cost of training the model is far higher than that of the Word2Vec model. In addition, from the structure and the training mode of the model, Word2Vec predicts the central Word through the context or predicts the context through the central Word, so that the co-occurrence relation among the words can be implicitly learned, and the characteristic is compared and matched with the requirements of nodes and edges for constructing the knowledge graph.

In this implementation, the Word2Vec model includes two network structures: skip-gram structures and CBOW structures. The CBOW structure predicts the central word according to the word with the central word w (t) context, and the Skip-gram structure predicts the word with the context according to the central word w (t). Both the CBOW and Skip-gram structures contain an INPUT layer (INPUT), a PROJECTION (PROJECTION) and an OUTPUT layer (OUTPUT), the two structures differing only in the content of the INPUT and OUTPUT layers. Specifically, the CBOW structure is shown in fig. 2a, the input layer is context words w (t-2), w (t-1), w (t +1), and w (t +2) of the core word w (t), and the output layer is the core word w (t). For example, "do i work at company a" the core word is "company a" and the context words include "i", "on work" and "do". The Skip-gram structure is shown in FIG. 2b, where the input layer is the core word w (t), and the output layer is the contexts w (t-2), w (t-1), w (t +1), and w (t +2) of w (t).

In an exemplary implementation, the Word2Vec model adopts a CBOW structure; accordingly, the Word2Vec model can be trained in an unsupervised manner as follows: performing word segmentation on the second corpus to obtain a second word sequence; taking each first word in the second word sequence as a central word to perform the following operations: extracting a central word and a central word context word from the second word sequence by adopting a sliding window with a preset length as a training sample; and respectively taking the words except the central Word in each training sample as input, taking the output central Word as a training target, and training the Word2Vec model.

Wherein the context words comprise words adjacent to the central word in the second word sequence, words before the central word and words after the central word; the length of the sliding window represents the number of extracted words, for example, when the length of the sliding window is 3, a central word is extracted, and adjacent words are respectively before and after the central word, so that three words are obtained in total and serve as a training sample. In the extracted training samples, context words are used as input quantity of the Word2Vec model, and the central words are used as expected output targets, so that the Word2Vec model can be trained by extracting a large number of training samples from massive linguistic data, and the output of the Word2Vec model is enabled to approach the central words continuously.

In the training process, the central Word is output as a training target, namely, the iteration target during training is to make the comparison result (namely error) between the output value of Word2Vec and the central Word smaller and smaller, namely, the weight of Word2Vec is adjusted to make the error smaller. And when the error meets a preset condition, stopping iteration and finishing training. The Word vector for the core Word is the weight of the Word2Vec model that completed the training. In other implementations, it is not excluded to use the Word2Vec model of the Skip-gram structure to derive the Word vectors.

Preferably, in step S110, the determining nodes of the knowledge graph according to the first vocabulary contained in the first lexicon includes: for each first vocabulary, respectively establishing a corresponding node in the knowledge graph; determining nodes for establishing edges in the knowledge graph according to the relation between word vectors of the first vocabulary, wherein the step of establishing edges for the determined nodes comprises the following steps: calculating the characteristic distance between every two first vocabularies; and for two first vocabularies with the characteristic distance smaller than a preset distance threshold, determining nodes corresponding to the two first vocabularies as nodes for establishing edges, and establishing edges between the nodes corresponding to the two first vocabularies in the knowledge graph.

In the preferred embodiment, the feature distance may be, but is not limited to, a euclidean feature distance, and the similarity between two first words may be obtained through the feature distance, so as to determine whether to establish an edge. In addition, the characteristic distance can be replaced by cosine similarity, and when the cosine similarity of the word vectors of the two nodes is larger than a preset similarity threshold, an edge is established between the two nodes.

In an exemplary implementation, the method further includes: determining the number of the existing edges of the nodes corresponding to the first vocabulary; and under the condition that the number of the existing edges does not exceed the upper limit of the number of the edges, establishing edges between the nodes corresponding to the two first words in the knowledge graph.

The number of the existing edges may be the number of edges which are already established between the current node and other nodes; the upper limit on the number of edges may be the maximum number of edges that can be established for each node. In this implementation, the number of edges of each node in the knowledge graph is set to have an upper limit of the number of edges, so as to avoid confusion caused by too many edges in the knowledge graph.

In other implementation manners, edges may be established in the knowledge graph according to the cosine similarity greater than the preset similarity threshold, and then the number of edges of each node is traversed, and for nodes whose number of edges exceeds the preset upper limit of the number of edges, part of the edges may be deleted, and the edges corresponding to the smaller cosine similarity may be preferentially deleted.

The scheme of the present application is further illustrated below by 5 examples.

Example 1

In this example, the case where a knowledge map of a biological field is to be constructed is described. In this embodiment, documents such as articles, journal articles, and the like in the biological field are collected as a first corpus; and taking the training data of the word vector model as a second corpus. Wherein, some documents in the first and second corpus may be the same.

In this embodiment, the process of constructing the knowledge graph is shown in fig. 3, and includes the following steps S210 to S230:

and S210, excavating new words according to the first corpus to serve as professional vocabularies of the biological field, and storing the professional vocabularies in a first word bank.

S220, training the Word2Vec model by using the second corpus to obtain a Word vector of each professional vocabulary in the first lexicon.

And S230, establishing a knowledge graph of the biological field according to all professional vocabularies and word vectors thereof in the first word stock. The method specifically comprises the following steps: aiming at each professional vocabulary in the first lexicon, establishing a corresponding node in the knowledge graph; and determining whether to establish an edge between two nodes corresponding to the knowledge graph according to the relation of the word vectors of every two professional vocabularies.

Example 2

This embodiment specifically describes the process of mining the professional vocabulary according to the first corpus, as shown in fig. 4, including steps 401 and 406:

401. loading an existing second lexicon; the loaded second thesaurus may contain only a general thesaurus, or may contain a professional thesaurus existing in the biological field in addition to the general thesaurus.

402. And performing word segmentation on the first corpus to obtain a first word sequence, and taking every two adjacent words in the first word sequence as a candidate word.

403. And respectively calculating a left information entropy, a right information entropy and a mutual information value for each candidate word.

404. For each candidate word, summing the left information entropy, the right information entropy and the mutual information value of the candidate word, and taking the obtained result as the score of the candidate word;

405. and sorting all the candidate words according to the sequence of scores from large to small.

406. And selecting the candidate words (N is a preset positive integer) ranked at the top N as the mined professional vocabulary, and storing the professional vocabulary in a first word bank.

In an example of this embodiment, the first corpus is a document such as a paper, a journal article, etc. in the biological field, and the obtained professional vocabulary of the biological field in the first lexicon is shown in table 1.

TABLE 1 mined professional vocabulary

Professional vocabulary	Score of
		Nucleotide sequence	0.002132401
Restriction enzymes	0.001811142
		Recognition sites	0.001690015
Coding region	0.001085844
		Coding sequence	0.00106069
Nucleic acid sequences	0.00105856
		Gene sequences	0.001013436
Base pairing	0.000961465
		Base sequence	0.00095819
Primer and method for producing the same	0.000957036
		Transcription	0.000603698

Example 3

This embodiment specifically describes a process of training a Word2Vec model by using a second corpus, as shown in fig. 5, including the following steps 501-504:

501. and acquiring a first word stock and a second word stock.

502. And loading the first word stock and the second word stock.

503. Performing word segmentation on the second corpus to obtain a second word sequence after word segmentation; in this embodiment, a jieba word segmentation is used to obtain a second word sequence; other word segmentation tools or algorithms may be used, and the present application is not limited thereto.

504. And training a Word2Vec model by adopting the second Word sequence to obtain a Word vector of each professional vocabulary in the first lexicon.

In this embodiment, a Word2Vec model with a CBOW structure is used for training. In a Word2Vec model with a CBOW structure, different words of a central Word context are respectively input by each neuron of an input layer, softmax probabilities of words corresponding to each neuron of an output layer are output, the goal of training is to maximize the softmax probabilities corresponding to the central words, and the goal is achieved by adjusting the weight of the Word2Vec model in the training.

In an example of this embodiment, the training process is to use each professional vocabulary in table 1 in the example of embodiment 2 as a core word, and perform the following operations for each core word:

the second Word sequence is stroked with a sliding window of length 5, resulting in a sequence [ w1, w2, w3, w4, w5] containing the central Word w3, and the two words w1, w2 located before the central Word w3, and the two words w4, w5 located after the central Word w3, so that the input of the Word2Vec model is [ w1, w2, w4, w5], and the Word2Vec model is trained such that the softmax probability of the Word w3 in the output layer is maximized. The weight of the trained Word2Vec model is the Word vector of the central Word w 3. Alternatively, the sequence [ w1, w2, w3, w4, w5] may be extracted from the first corpus or other corpora, words other than w3 may be input into the trained Word2Vec model, the central Word w3 may be used as the output target, and the weight of the Word2Vec model thus obtained may be used as the Word vector of w 3.

In this example, the weight of the obtained Word2Vec model is a 200-dimensional vector, and the 200-dimensional vector is used as a Word vector.

Example 4

This embodiment specifically describes how to establish a knowledge graph according to all the specialized vocabularies and word vectors thereof in the first lexicon after obtaining the word vectors of the specialized vocabularies in the first lexicon and the first lexicon.

The establishment of the knowledge graph has two main tasks: establishing nodes of the knowledge graph and establishing connection (namely establishing edges) between the nodes; in this embodiment, a specific process of establishing the knowledge graph is shown in fig. 6, and includes the following steps 601-:

601. and acquiring all professional vocabularies in the first lexicon.

602. And respectively establishing a node in the knowledge graph aiming at each professional vocabulary.

603. And acquiring a word vector of each professional vocabulary.

604. And calculating cosine similarity between word vectors of every two professional vocabularies.

605. For two professional vocabularies with cosine similarity larger than a preset similarity threshold (0.6 in the embodiment), connection is established between corresponding nodes of the knowledge graph, namely edges are established.

In the above steps, part of operations of establishing a node and establishing an edge may be performed in parallel, for example, steps 601 and 602 may be performed while

steps

603 and 604 are performed, as long as 605 is performed after 602.

In an example of this embodiment, the first lexicon shown in table 1 may be obtained according to the implementation manner of embodiment 2, and the word vector of each professional vocabulary may be obtained according to the implementation manner of embodiment 3; for each professional vocabulary in the table 1, respectively generating a node in the knowledge graph; calculating cosine similarity between word vectors of every two professional vocabularies, selecting cosine similarity larger than a preset similarity threshold value of 0.6, establishing edges between corresponding nodes of the knowledge graph, and finally obtaining the knowledge graph of the biological field, wherein the edges are shown in figure 7.

Example 5

The difference between this embodiment and embodiment 4 is that in this embodiment, an upper limit M of the number of edges (M is a preset positive integer) is defined for the number of edges of each node in the knowledge graph, that is, it is defined that at most one node can establish an edge with M other nodes.

In this embodiment, under the condition that a knowledge graph is established according to professional vocabularies and word vectors thereof in the first lexicon, the method for establishing nodes can refer to

steps

601 and 602 in embodiment 4, and the process for establishing edges also includes obtaining the word vector of each professional vocabulary first and calculating cosine similarity between the word vectors of every two professional vocabularies, which can refer to

steps

603 and 604 in embodiment 4; in this embodiment, only the last operation of establishing the edge is different from that in embodiment 4, and two conditions that the cosine similarity is greater than the preset similarity threshold and the number of edges of the node is not greater than the upper limit of the number of edges need to be considered.

In an implementation manner of this embodiment, for two professional vocabularies whose cosine similarity of a word vector is greater than a preset similarity threshold, whether the respective edge numbers of nodes corresponding to the two professional vocabularies reach a predetermined upper limit M of the edge numbers is checked, and an edge is established between the nodes corresponding to the two professional vocabularies when neither of the edge numbers reaches the predetermined upper limit M of the edge numbers.

In the implementation mode, for two nodes of the edge to be established, whether a preset condition is met or not is judged firstly (the number of the edges of the two nodes does not reach the upper limit M of the number of the edges), the edge is established only under the condition of meeting the preset condition, and otherwise, the edge is not established.

In another implementation manner of this embodiment, for two professional vocabularies whose cosine similarity of a word vector is greater than a preset threshold, an edge between nodes corresponding to the two professional vocabularies is established first; and after the edges of the cosine similarity greater than the preset similarity threshold are correspondingly established, traversing each node in the knowledge graph, respectively judging whether the edge number of each node is greater than a preset upper limit M of the edge number, and if so, only keeping the M edges with the greater cosine similarity.

In this implementation, after the edge is established, the edge that does not meet the preset condition may be screened out and deleted. As an example, in the established knowledgeable graph shown in fig. 7, assuming that the upper limit M of the predetermined number of edges is 3, the nodes in the knowledgeable graph are traversed to determine that the number of edges of the "nucleotide sequence" node is 10 and is greater than the highest number of edges, the cosine similarity between the "nucleotide sequence" and the word vectors of the remaining 10 specialized words is ranked, and assuming that the cosine similarity between the "nucleotide sequence" and the word vectors of the "nucleic acid sequence", "base sequence", and "gene sequence" is higher than the cosine similarity between the "nucleotide sequence" and the word vectors of the other 7 specialized words, only the edges between the "nucleotide sequence" node and the "nucleic acid sequence" node "," base sequence "node, and" gene sequence "node may be retained, as shown in fig. 8.

In other embodiments, more preset conditions may be added to select edges to be reserved; for example, if the node a has only one edge connected to the node b and the node b has four edges, it is now necessary to delete one edge of the node b, and if the edge between the node a and the node b is deleted, the node a may become an isolated node in the knowledge graph, that is, a node that is not connected to any other node, and this may preferentially keep the connection line between the node b and the node a.

An exemplary embodiment of the present application further provides an apparatus for constructing a knowledge graph, as shown in fig. 9, including:

the node determining module 91 is configured to determine a node of the knowledge graph according to a first vocabulary contained in the first lexicon;

a word vector obtaining module 92, configured to obtain a word vector of the first vocabulary based on a word vector model; the word vector model is obtained through pre-training in an unsupervised training mode;

and a relation determining module 93, configured to determine, according to a relation between word vectors of the first vocabulary, a node of the edge established in the knowledge graph, and establish an edge for the determined node.

An exemplary embodiment of the present application also provides a computer storage medium in which a program for constructing a knowledge graph is stored, and the method for constructing a knowledge graph in any one of the above embodiments is performed when the program for constructing a knowledge graph is executed by a processor.

An exemplary embodiment of the present application also provides an electronic device, as shown in fig. 10, including: a memory 1001 and a processor 1002, the memory 1001 being used for storing a program for constructing a knowledge graph; the processor 1002 is configured to read and execute the program for constructing a knowledge graph in the memory 1001 to perform the method for constructing a knowledge graph in any of the embodiments described above.

It will be understood by those of ordinary skill in the art that all or some of the steps of the methods, systems, functional modules/units in the devices disclosed above may be implemented as software, firmware, hardware, and suitable combinations thereof. In a hardware implementation, the division between functional modules/units mentioned in the above description does not necessarily correspond to the division of physical components; for example, one physical component may have multiple functions, or one function or step may be performed by several physical components in cooperation. Some or all of the components may be implemented as software executed by a processor, such as a digital signal processor or microprocessor, or as hardware, or as an integrated circuit, such as an application specific integrated circuit. Such software may be distributed on computer readable media, which may include computer storage media (or non-transitory media) and communication media (or transitory media). The term computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data, as is well known to those of ordinary skill in the art. Computer storage media includes, but is not limited to, RAM, ROM, EEPROM, flash memory or other memory technology, CD-ROM, Digital Versatile Disks (DVD) or other optical disk storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to store the desired information and which can accessed by a computer. In addition, communication media typically embodies computer readable instructions, data structures, program modules or other data in a modulated data signal such as a carrier wave or other transport mechanism and includes any information delivery media as known to those skilled in the art.

Claims

1. A method of constructing a knowledge graph, comprising:

determining nodes of the knowledge graph according to a first vocabulary contained in a first word bank;

obtaining a word vector of the first vocabulary based on a word vector model; the word vector model is obtained through pre-training in an unsupervised training mode;

and determining nodes for establishing edges in the knowledge graph according to the relation among the word vectors of the first vocabulary, and establishing edges for the determined nodes.

2. The method of constructing a knowledge-graph of claim 1 wherein prior to determining nodes of the knowledge-graph based on a first vocabulary contained in a first thesaurus, the method further comprises:

performing word segmentation on the first corpus to obtain a first word sequence;

obtaining a plurality of candidate words according to a first word sequence, wherein the candidate words consist of two adjacent words in the first word sequence;

and selecting at least one candidate word according to the relevancy of the candidate words and storing the candidate word as a first vocabulary in the first word bank.

3. The method of constructing a knowledge graph as claimed in claim 2, wherein selecting at least one candidate word to be stored as a first vocabulary in the first lexicon based on the relevance of the candidate word comprises:

calculating the relevance of each candidate word;

sorting the candidate words according to the sequence of the relevance from big to small;

selecting the candidate words ranked at the top N as the first vocabulary and storing the first vocabulary in the first word bank; wherein N is a positive integer.

4. A method of constructing a knowledge graph as defined in claim 2, wherein:

the correlation degree is a mutual information value, or the correlation degree is the sum of a left information entropy, a right information entropy and the mutual information value.

5. The method of constructing a knowledge graph of claim 2, wherein before selecting at least one of the candidate words to store as a first vocabulary in the first lexicon based on relevance, the method further comprises:

and under the condition that the relevancy does not meet the relevancy judgment condition, constructing a new candidate word by using the candidate word and the word adjacent to the candidate word in the first word sequence.

6. The method of constructing a knowledge-graph according to any one of claims 4 or 5, wherein the Word vector model is a Word2Vec model, the method comprising:

performing word segmentation on the second corpus to obtain a second word sequence; wherein the second sequence of words comprises at least one first word;

acquiring first vocabularies contained in a first lexicon, and respectively executing the following operations on each first vocabulary to train a Word2Vec model to obtain Word vectors of the first vocabularies:

for each first vocabulary, extracting a predetermined number of words from the second Word sequence before and after the first vocabulary is currently processed, and inputting the words into the Word2Vec model;

and comparing the output value of the Word2Vec model with the currently processed first vocabulary, iterating the Word2Vec model according to the comparison result until the comparison result meets a preset condition, and taking the weight of the Word2Vec model as a Word vector of the first vocabulary.

7. The method of constructing a knowledge-graph as claimed in claim 4 or 5, wherein said determining nodes of the knowledge-graph based on the first vocabulary contained in the first thesaurus comprises:

for each first vocabulary, respectively establishing a corresponding node in the knowledge graph;

the establishing of the edges among the nodes in the knowledge graph according to the relation among the word vectors of the first vocabulary comprises the following steps:

calculating the characteristic distance between every two first vocabularies;

and for the two first vocabularies with the characteristic distance smaller than the preset distance threshold, establishing an edge between nodes corresponding to the two first vocabularies in the knowledge graph.

8. The method of constructing a knowledge graph according to claim 7, the method further comprising:

determining the number of existing edges of the nodes corresponding to the first vocabulary;

and under the condition that the number of the existing edges does not exceed the upper limit of the number of the edges, establishing edges between the nodes corresponding to the two first words in the knowledge graph.

9. An apparatus for constructing a knowledge graph, comprising:

the node determining module is used for determining nodes of the knowledge graph according to a first vocabulary contained in the first word bank;

the word vector obtaining module is used for obtaining word vectors of the first vocabulary based on a word vector model; the word vector model is obtained through pre-training in an unsupervised training mode;

and the relation determining module is used for determining the nodes for establishing the edges in the knowledge graph according to the relation among the word vectors of the first vocabulary, and establishing the edges for the determined nodes.

10. An electronic device, comprising: a memory and a processor, wherein,

the memory is used for storing a program for constructing a knowledge graph;

the processor is used for reading and executing the program for constructing the knowledge graph in the memory to perform the method for constructing the knowledge graph according to any one of claims 1-8.

11. A computer storage medium in which a program for constructing a knowledge graph is stored, the program for constructing a knowledge graph implementing the method for constructing a knowledge graph according to any one of claims 1 to 8 when executed by a processor.