Disclosure of Invention
The invention provides a domain knowledge pushing method based on a knowledge graph.
The technical scheme for realizing the purpose of the invention is as follows: a domain knowledge pushing method based on knowledge graph comprises the following specific steps:
step 1, constructing a text knowledge base, wherein the text knowledge base is composed of a field knowledge text;
step 2, performing semantic analysis and topic modeling on the knowledge base text;
step 3, obtaining semantic distribution vectors of knowledge points by carrying out graph embedding processing on the domain knowledge graph;
step 4, establishing a task context feature vector according to the user task description and the task theme;
step 5, performing entity alignment on the domain entities in the user task description text and the domain knowledge graph in the step 3, performing feature expansion based on graph node paths and graph node semantic distribution characteristics, and performing task associated knowledge recall;
step 6, performing text similarity calculation on the recall text obtained in the step 5 and the user task to obtain a recall text score;
step 7, pushing the sequencing result text to the user according to the scores;
step 8, if the user task is finished, the pushing is terminated; steps 4 to 7 are repeated when the user's scene and status changes.
Preferably, the construction method of the text knowledge base comprises the following steps: determining a knowledge range according to the field task requirements, and screening the content; sentence division is carried out on the text, and stop words are filtered; and constructing the final text set into a text knowledge base.
Preferably, the specific method for performing semantic analysis on the knowledge base text is as follows:
segmenting the knowledge text, and training the text by adopting an unsupervised WORD2VEC WORD embedding algorithm to obtain semantic distribution vectors of WORDs;
and calculating the semantic vector of the text sentence by adopting a method based on word vector weighted sum.
Preferably, the specific method for text topic modeling is as follows:
performing word segmentation on texts in a knowledge base, performing word frequency statistics on text sentences in the knowledge base according to word segmentation results, and performing word filtering on the texts with the word frequency lower than a preset threshold value;
performing character processing on the sentence to obtain a BIGRAM dictionary of the knowledge base text and to construct a mapping table from the text to the corresponding bag-of-words vector;
and acquiring a bag-of-words vector of the knowledge base text through a mapping table, and training the bag-of-words vector as the input of an LDA algorithm to acquire a theme distribution vector of the knowledge base text.
Preferably, the specific method for obtaining the semantic distribution vector of the knowledge graph nodes is as follows:
step 3.1, constructing a domain knowledge graph, including two tasks of named entity identification and relationship extraction, and obtaining a domain knowledge entity and a relationship between the entities by adopting a BERT-based pre-training model to perform supervised learning;
and 3.2, obtaining a map node semantic distribution vector, and learning the node topology in the domain knowledge map through a map convolution neural network to obtain the semantic distribution vector of the node.
Preferably, the specific method for establishing the task context characteristics is as follows:
step 4.1, performing word segmentation processing on the user task description text, and performing vectorization representation of task description by using the word vector trained in the step 2 to serve as a semantic feature of the user task;
and 4.2, extracting entities in the user task theme, and obtaining an entity expression vector associated with the operation and detection task by using the knowledge graph node semantic distribution vector trained in the step 3 as a classification characteristic of the user task.
Preferably, the specific steps of aligning the domain entities in the user task description text with the domain knowledge graph in step 3, performing feature expansion based on graph node paths and graph node semantic distribution features, and performing task-associated knowledge recall include:
step 5.1, acquiring task description and task association system components according to a user task entity, and performing entity alignment operation on a knowledge graph spectrum to obtain a sub-graph corresponding to the task entity on the graph spectrum;
step 5.2, calculating the embedded vectors of the entity of the sub-graph in the step 5.1, and obtaining the word embedded vectors of the entity nodes on each path in the three hops of the sub-graph;
step 5.3, performing key path expansion on entity nodes of each path of the graph;
and 5.4, carrying out knowledge base text filtering by taking the user task context characteristics, the graph embedded vector of the task entity and the embedded vector of the sub-graph node combination in the step 4 as a primary recall condition to obtain a recall text with rough knowledge precision of the task associated nodes.
Preferably, the text similarity calculation is performed between the recall text obtained in step 5 and the user task, and the specific method for obtaining the score of the recall text is as follows:
6.1, respectively calculating topic distribution vectors of the recalled text and the user task according to the topic model of the text knowledge base obtained in the step 2;
step 6.2, according to a word migration distance algorithm, performing word-level similarity calculation on the recalled text and the task description to obtain a word migration distance similarity score of the recalled text;
6.3, calculating the similarity according to a cosine formula of the vector space to obtain a similarity score of the recalled text theme;
and 6.4, calculating scores based on a weighted voting strategy, and adjusting the word shift distance weight and the topic similarity weight according to the tasks.
Compared with the prior art, the invention has the following remarkable advantages:
(1) the method is based on the domain knowledge map, overcomes the Martian effect of a recommendation system through rich domain entity associated knowledge, and expands the diversity of pushed knowledge according to the associated knowledge;
(2) according to the method, the modeling is carried out based on the scene and the user task, the attributes and the characteristics of the task are captured more effectively, the distinguishing capability of the specific task associated knowledge text is enhanced, and the accuracy of text knowledge pushing is improved;
(3) the method is based on semantic feature calculation, has strong interpretability, and can flexibly adapt to diversified scenes and tasks by replacing a feature model and a similarity calculation method;
(4) the invention adopts an unsupervised method, and can obtain better performance and accuracy of knowledge recommendation even in large-scale domain knowledge;
(5) the method has good portability, can be popularized to various fields with similar scene and task requirements, and provides knowledge push service.
The present invention is described in further detail below with reference to the attached drawings.
Detailed Description
A domain knowledge pushing method based on knowledge graph includes the following steps:
step 1, constructing a text knowledge base, wherein the text knowledge base is composed of a field knowledge text;
specifically, the construction method of the text knowledge base comprises the following steps: and determining the knowledge range according to the field task requirements, and screening the content. And performing sentence segmentation, stop words filtering and the like on the text, wherein the stop words are mainly provided by experts in the field. The final text set is constructed as a text knowledge base.
Step 2, performing semantic analysis and text topic modeling on the knowledge base text;
in one embodiment, the specific method for performing semantic analysis on the knowledge base text is as follows:
the method comprises the steps of segmenting a knowledge text, and training the text by adopting an unsupervised WORD2VEC WORD embedding algorithm to obtain semantic distribution vectors of WORDs, namely WORD vectors. In the aspect of semantic vector representation of text sentences, a method based on word vector weighted sum is adopted for calculation. Specifically, a higher weight is given to a vocabulary having a high degree of matching with the task description text, and a lower weight is given to an irrelevant vocabulary. Here the degree of match is measured in terms of the number of string matches.
In one embodiment, the specific method for text topic modeling is as follows:
the method comprises the steps of segmenting words of texts in a knowledge base, carrying out word frequency statistics on text sentences in the knowledge base according to word segmentation results, and carrying out word filtering on the texts with the word frequency lower than a preset threshold value.
And performing character processing on the sentence to obtain a BIGRAM dictionary of the knowledge base text and using the BIGRAM dictionary to construct a mapping table from the text to the corresponding bag-of-words vector. And finally, acquiring a bag-of-words vector of the knowledge base text through a mapping table, and training the bag-of-words vector as LDA algorithm input to acquire a theme distribution vector of the knowledge base text.
Step 3, obtaining semantic distribution vectors of knowledge points by carrying out graph embedding processing on the domain knowledge graph;
in one embodiment, the specific method for obtaining the semantic distribution vector of the knowledge graph node is as follows:
and 3.1, constructing a domain knowledge graph, mainly comprising two tasks of named entity identification and relationship extraction, wherein a BERT-based pre-training model is adopted for supervised learning to obtain the relationship between the domain knowledge entity and the entity. The constructed power knowledge graph is mainly stored in a form of a triplet, such as < transformer, component and bushing >, and the construction process is respectively shown in fig. 2, 3 and 4.
And 3.2, obtaining a semantic distribution vector of the map nodes. Graph embedding is a knowledge graph node semantic distribution vector representation technology and can be obtained by algorithms such as random walk and the like. The embodiment adopts a GCN graph-based neural network to carry out graph node embedded representation learning. Specifically, the node topology in the domain knowledge graph is learned through a graph convolution neural network, namely, the attributes and connection relation semantics of the graph nodes are mapped to a low-dimensional space through the neural network, so that the semantic distribution vector of the nodes is obtained. The learning effect of the node classification task can be effectively improved by adding the node attribute information in the training process.
Step 4, establishing a task context feature vector according to the user task description and the task theme;
in one embodiment, the specific method for establishing the task context features includes:
step 4.1, performing word segmentation processing on the user task description text, and performing vectorization representation of task description by using the word vector trained in the step 2 to serve as a semantic feature of the user task;
and 4.2, extracting entities in the user task theme, and obtaining an entity expression vector associated with the operation and detection task by using the knowledge graph node semantic distribution vector trained in the step 3 as a classification characteristic of the user task.
Step 5, performing entity alignment on the domain entities in the user task description text and the domain knowledge graph in the step 3, performing feature expansion based on graph node paths and graph node semantic distribution characteristics, and performing task associated knowledge recall;
in a further embodiment, the method comprises the following specific steps:
step 5.1, acquiring task description and task association system components according to a user task entity, and performing entity alignment operation on a knowledge graph spectrum to obtain a sub-graph corresponding to the task entity on the graph spectrum;
step 5.2, calculating the embedded vectors of the entity of the sub-graph in the step 5.1, and obtaining the word embedded vectors of the entity nodes on each path in the three hops of the sub-graph;
step 5.3, performing key path expansion on entity nodes of various paths of the subgraph, namely combining nodes on paths in three hops to obtain sentence embedding vectors with combined characteristics, wherein the combination mode adopts a sum-average method;
and 5.4, carrying out knowledge base text filtering by taking the user task context characteristics, the graph embedded vector of the task entity and the embedded vector of the sub-graph node combination in the step 4 as a primary recall condition to obtain a recall text with rough knowledge precision of the task associated nodes.
Step 6, performing text similarity calculation on the recall text obtained in the step 5 and the user task to obtain a recall text score;
in a further embodiment, the method comprises the following specific steps:
6.1, respectively calculating topic distribution vectors of the recalled text and the user task according to the topic model of the text knowledge base obtained in the step 2;
step 6.2, according to a word migration distance algorithm, performing word-level similarity calculation on the recalled text and the task description to obtain a word migration distance similarity score of the recalled text;
step 6.3, obtaining the similarity of the recalled text and the user task, namely calculating the similarity according to a cosine formula of a vector space to obtain the similarity score of the recalled text topic;
and 6.4, performing final score calculation based on a weighted voting strategy, and adjusting the word shift distance weight and the topic similarity weight according to the task. The voting result is the candidate document score.
Step 7, pushing the sorted texts to a user according to the scores;
step 8, if the user task is finished, the pushing is terminated; steps 4 to 7 are repeated when the user changes scene and state.
The invention mainly completes the pushing of the domain knowledge according to the following aspects:
1) mining domain knowledge: in the service system maintenance work, the user work content and scene are often required to be recorded and analyzed, but the information is usually scattered, the relevance is not high, and the features are sparse. Therefore, abundant domain knowledge reserves are needed for the problems to be solved by users, and the knowledge graph is a structured knowledge representation form with abundant association formed by mining the entity and the relationship of unstructured text information in the vertical domain, so that the requirements on knowledge storage and mining are met.
2) Task feature modeling: the operation and maintenance tasks of the user need to be operated according to certain specifications and procedures. Compared with the traditional pushing system, knowledge pushing needs to take specific tasks and task scenes of a user as a starting point, does not need to infer according to user preferences and historical operations, and is to carry out knowledge association feature mining on massive texts and tasks, so that knowledge texts with the same semantic connotation as the tasks are pushed.
3) And (3) text matching calculation: a large number of text recalls are the main content of knowledge pushing, and recall precision influences the final effect of subsequent text similarity calculation. In addition, the result form of knowledge pushing is short and high-accuracy knowledge text, and the method relates to natural language processing related technologies.
Examples
A domain knowledge pushing method based on knowledge graph is shown in figure 1, and the key steps and implementation are as follows:
step 1, collecting knowledge texts of the power equipment to construct a text knowledge base.
The text knowledge base in the power field is a text set aiming at knowledge required by system tasks and is a source for pushing auxiliary knowledge of transformer substation operation and inspection tasks. The sources of the knowledge base mainly comprise electric power operation examination authority books, electric power operation examination related journal documents, electric power science research institute internal documents and an electric power operation examination subject network encyclopedia question-answer knowledge base.
After the knowledge source obtains the documents, the knowledge range is determined according to the requirements of the electric power operation and inspection task, the documents such as a transformer, a circuit breaker, a secondary non-electric quantity device, a protection device and the like are mainly related, and then the contents are screened. And performing sentence segmentation, stop word filtering and other processing on the text, wherein the stop words are mainly provided by experts in the field of electric power operation and detection. And constructing the final text set into a power domain text corpus.
And 2, performing semantic analysis and topic modeling on the knowledge base text of the power equipment. The method is implemented according to the following steps:
and 2.1, segmenting WORDs of the text related to the equipment, wherein the WORDs comprise equipment description, equipment operation and inspection task description and equipment defect description, and training the text by adopting an unsupervised WORD2VEC WORD embedding algorithm to obtain semantic distribution vectors of the WORDs. In the aspect of semantic vector representation of text sentences, a method based on word vector weighted sum is adopted for calculation. Specifically, a high weight is given to a vocabulary with a high matching degree with the electric power equipment operation task description text, and a low weight is given to an irrelevant vocabulary. Here the degree of match is measured in terms of the number of string matches.
And 2.2, modeling the text theme of the knowledge base. The method comprises the steps of segmenting words of texts in a text base of the power equipment, counting word frequencies in the text base, and filtering words with low word frequencies according to needs. The method comprises the steps of segmenting words of texts in a knowledge base, carrying out word frequency statistics on text sentences in the knowledge base according to word segmentation results, and carrying out word filtering when the word frequency is lower than a preset threshold value. And then, performing character processing on the sentence to obtain a BIGRAM dictionary of the text of the knowledge base of the power equipment, and using the BIGRAM dictionary to construct a mapping table from the text to a corresponding bag-of-words vector. Finally, obtaining bag-of-words vectors of the knowledge base text through a mapping table, training the bag-of-words vectors as LDA algorithm input, and obtaining theme distribution vectors of the equipment knowledge base text
Semantic analysis here includes both word embedding and sentence embedding. Semantic analysis can keep sentence semantic information, calculate the similarity between texts on a semantic level, and is different from the similarity on simple vocabulary fonts.
And 3, carrying out graph embedded processing on the knowledge graph in the power field to obtain a semantic distribution vector of the node. The method is implemented according to the following steps:
and 3.1, constructing a domain knowledge graph, mainly comprising two tasks of named entity identification and relationship extraction, wherein a BERT-based pre-training model is adopted for supervised learning to obtain the relationship between the domain knowledge entity and the entity. The constructed power knowledge graph is mainly stored in a form of a triplet, such as < transformer, component and bushing >, and the construction process is respectively shown in fig. 2, 3 and 4.
And 3.2, obtaining a semantic distribution vector of the map nodes. Graph embedding is a knowledge graph node semantic distribution vector representation technology and can be obtained by algorithms such as random walk and the like. The implementation adopts a GCN-based graph neural network to carry out graph node embedded representation learning. Specifically, the node topology in the domain knowledge graph is learned through a graph convolution neural network, namely, the attributes and connection relation semantics of the graph nodes are mapped to a low-dimensional space through the neural network, so that the semantic distribution vector of the nodes is obtained. The learning effect of the node classification task can be effectively improved by adding the node attribute information in the training process.
And 4, establishing task context characteristics through task description according to the electric power equipment operation and detection task of the user by taking a light gas alarm task as an example. The method is implemented according to the following steps:
step 4.1, performing word segmentation processing on the light gas alarm task description text, and performing vectorization representation on task description by using the word vector trained in the step 2 to serve as a semantic feature of the light gas alarm task;
step 4.2, extracting entities in the topic of the light gas alarm task, and obtaining entity expression vectors associated with the operation and detection task by using the node distribution vectors of the knowledge graph trained in the step 3, wherein the entity expression vectors are used as a classification characteristic of the light gas alarm task;
step 5, carrying out entity alignment on the electric power equipment entity and the electric power field knowledge graph according to the equipment operation and inspection task text, carrying out feature expansion and task associated knowledge recall on the basis of graph node paths and graph node semantic distribution characteristics, and specifically implementing according to the following steps:
step 5.1, acquiring task description and task association system components, namely a gas relay, according to a gas alarm task entity, and performing entity alignment operation on a knowledge graph in the power field to acquire a subgraph corresponding to the task association entity on the graph;
step 5.2, acquiring the embedding vector of the sub-graph entity calculated in the step 5.1 and the word embedding vector of the entity node in the adjacent relation path in the three hops of the sub-graph node;
step 5.3, performing key path expansion on the sub-graph nodes, namely combining adjacent nodes, and simultaneously acquiring sentence embedding of combination characteristics, wherein the combination mode adopts a sum average method;
and 5.4, performing text filtering by taking the gas alarm task context characteristic vector, the graph embedded vector of the task entity and the embedded vector of the sub-graph node combination in the step 4 as a primary recall condition to obtain the recall of the task associated node knowledge with coarse precision.
And 6, performing text similarity calculation on the recall text obtained in the step 5 and the user task to obtain a recall text score, wherein the process is shown in fig. 5. The method is implemented according to the following steps:
6.1, respectively calculating topic distribution vectors of the recalled text and the gas operation and inspection task according to the topic model of the text corpus obtained in the step 2;
step 6.2, according to a word migration distance algorithm, performing word-level similarity calculation on the candidate text and the task description to obtain a word migration distance similarity score of the recalled text;
the word-shift distance is a way (method) for measuring the distance between two text documents, and is used for judging the similarity between two texts, namely the larger the WMD distance is, the smaller the similarity is, the smaller the WMD distance is, the greater the text similarity is.
Step 6.3, obtaining the topic similarity of the recall text and the gas alarm task, namely calculating the similarity according to a cosine formula of a vector space to obtain a topic similarity score of the recall text;
and 6.4, performing final score calculation based on a weighted voting strategy, and adjusting the word shift distance weight and the topic similarity weight according to the task. The voting result is the candidate document score.
And 7, pushing the sorted text to the user according to the scores. The method is implemented according to the following steps:
and 6, sorting in a descending order according to the scores of the recalled documents obtained in the step 6, and selecting a certain number of documents to push according to the needs.
Step 8, if the user task is finished, the pushing is terminated; steps 4 to 7 are repeated when the user changes scene and state.