CN112699246B

CN112699246B - Domain knowledge pushing method based on knowledge graph

Info

Publication number: CN112699246B
Application number: CN202011522006.7A
Authority: CN
Inventors: 李蔚清; 颜于升
Original assignee: Nanjing University of Science and Technology
Current assignee: Nanjing University of Science and Technology
Priority date: 2020-12-21
Filing date: 2020-12-21
Publication date: 2022-09-27
Anticipated expiration: 2040-12-21
Also published as: CN112699246A

Abstract

The invention discloses a domain knowledge pushing method based on a knowledge graph, which comprises the following steps: collecting a domain knowledge text to construct a text knowledge base; performing semantic analysis and topic modeling on knowledge base texts; embedding a domain knowledge graph to obtain semantic distribution vectors of the nodes; establishing task context characteristics according to user task description and task topics; performing entity alignment according to a domain entity and a knowledge graph in task description, performing feature expansion based on graph node paths and graph node semantic distribution characteristics, and performing task-associated knowledge recall; performing text similarity calculation on the recalled text and the user task text to obtain a candidate text score; and pushing the sequencing result text to the user according to the score. The invention improves the text matching degree and the user experience of the domain knowledge pushing by the node association knowledge of the knowledge graph and the graph embedding technology.

Description

Domain knowledge pushing method based on knowledge graph

Technical Field

The invention belongs to the computer application technology, and particularly relates to a domain knowledge pushing method based on a knowledge graph.

Background

In the increasing scale production and fierce service competition, with the continuous appearance of large-scale complex business systems, enterprises conduct a large amount of business knowledge management, and accumulate a large amount of rich system management knowledge. The large-scale complex system is indispensable, a plurality of defects occur, and standardized system inspection and defect repair are required frequently. However, at present, on-site maintenance work generally carries out system troubleshooting through technology and experience accumulation of workers, and an effective practical intelligent supporting means is lacked to help the workers to carry out standardized operation, quickly acquire relevant knowledge of system faults and quickly update relevant data.

With the continuous development of service systems, the coverage area is continuously enlarged, the number is continuously increased, the network architecture is continuously upgraded, and the complexity of system maintenance is continuously improved. Therefore, operation and maintenance personnel are required to achieve the operation standard in the maintenance process of the system, and the processing method meets the requirements. Therefore, a set of systematized operable operation and maintenance flow and a knowledge pushing system for guiding business operation flow are constructed through the field knowledge accumulated by enterprises, and the requirements for improving the quality and the efficiency of the whole maintenance work are very necessary.

Knowledge push is a technology that automatically selects specific information related to or interested in a user from a server according to a certain protocol, and periodically transmits the information to the user in a certain way so as to reduce the learning cost of the user. Knowledge pushing mainly comprises three stages, namely a user data acquisition stage, a data processing stage and a pushing stage. The method has the main idea that the server actively pushes information which is interested by the user to the user according to the acquired state and intention of the user, so that the time of the user for retrieving the information is shortened, meanwhile, the information is screened according to the purpose and the interest of the user, the user is helped to discover valuable information, and the accuracy and the efficiency of the user for acquiring the information are improved. At present, all industries carry out related research and experiments of knowledge pushing technology in the system in related fields. However, most still employ open-world-oriented knowledge recommendation similar approaches, such as content-based recommendation, collaborative filtering-based, model-based approaches. The classical theoretical method generally adopts user behaviors collected by a system to perform user portrait modeling, and recommends through item feature modeling and user collaborative filtering strategies. The problems of cold start caused by the imperfection of the theoretical method, push content limitation and death caused by the Martian effect and the like are solved.

The classical recommendation algorithm is often used for serving various products, and recommendation of various form information including pictures, audio, characters, videos, commodities and the like is not suitable for pushing professional knowledge in various industries or fields.

Disclosure of Invention

The invention provides a domain knowledge pushing method based on a knowledge graph.

The technical scheme for realizing the purpose of the invention is as follows: a domain knowledge pushing method based on knowledge graph comprises the following specific steps:

step 1, constructing a text knowledge base, wherein the text knowledge base is composed of a field knowledge text;

step 2, performing semantic analysis and topic modeling on the knowledge base text;

step 3, obtaining semantic distribution vectors of knowledge points by carrying out graph embedding processing on the domain knowledge graph;

step 4, establishing a task context feature vector according to the user task description and the task theme;

step 5, performing entity alignment on the domain entities in the user task description text and the domain knowledge graph in the step 3, performing feature expansion based on graph node paths and graph node semantic distribution characteristics, and performing task associated knowledge recall;

step 6, performing text similarity calculation on the recall text obtained in the step 5 and the user task to obtain a recall text score;

step 7, pushing the sorting result text to the user according to the scores;

step 8, if the user task is finished, the pushing is terminated; steps 4 to 7 are repeated when the user's scene and status changes.

Preferably, the construction method of the text knowledge base comprises the following steps: determining a knowledge range according to the field task requirement, and screening the content; sentence division is carried out on the text, and stop words are filtered; and constructing the final text set into a text knowledge base.

Preferably, the specific method for performing semantic analysis on the knowledge base text is as follows:

segmenting a knowledge text, and training the text by adopting an unsupervised WORD2VEC WORD embedding algorithm to obtain a semantic distribution vector of WORDs;

and calculating the semantic vector of the text sentence by adopting a method based on the word vector weighted sum.

Preferably, the specific method for text topic modeling is as follows:

performing word segmentation on texts in a knowledge base, performing word frequency statistics on text sentences in the knowledge base according to word segmentation results, and performing word filtering on the texts with the word frequency lower than a preset threshold value;

performing character processing on the sentence to obtain a BIGRAM dictionary of the knowledge base text and to construct a mapping table from the text to the corresponding bag-of-words vector;

and acquiring a bag-of-words vector of the knowledge base text through a mapping table, and training the bag-of-words vector as the input of an LDA algorithm to acquire a theme distribution vector of the knowledge base text.

Preferably, the specific method for obtaining the semantic distribution vector of the knowledge graph nodes is as follows:

step 3.1, constructing a domain knowledge graph, including two tasks of named entity identification and relationship extraction, and obtaining a domain knowledge entity and a relationship between the entities by adopting a BERT-based pre-training model to perform supervised learning;

and 3.2, obtaining a map node semantic distribution vector, and learning the node topology in the domain knowledge map through a map convolution neural network to obtain the semantic distribution vector of the node.

Preferably, the specific method for establishing the task context characteristics is as follows:

step 4.1, performing word segmentation on the user task description text, and performing vectorization representation of task description by using the word vector trained in the step 2 to serve as a semantic feature of the user task;

and 4.2, extracting entities in the user task theme, and obtaining an entity expression vector associated with the operation and detection task by using the knowledge graph node semantic distribution vector trained in the step 3 as a classification characteristic of the user task.

Preferably, the specific steps of aligning the domain entities in the user task description text with the domain knowledge graph in step 3, performing feature expansion based on graph node paths and graph node semantic distribution features, and performing task-associated knowledge recall include:

step 5.1, acquiring task description and task association system components according to a user task entity, and performing entity alignment operation on a knowledge graph spectrum to obtain a sub-graph corresponding to the task entity on the graph spectrum;

step 5.2, calculating the embedded vectors of the entity of the sub-graph in the step 5.1, and obtaining the word embedded vectors of the entity nodes on each path in the three hops of the sub-graph;

step 5.3, performing key path expansion on entity nodes of each path of the graph;

and 5.4, carrying out knowledge base text filtering by taking the user task context characteristics, the graph embedded vector of the task entity and the embedded vector of the sub-graph node combination in the step 4 as a primary recall condition to obtain a recall text with rough knowledge precision of the task associated nodes.

Preferably, the text similarity calculation is performed between the recall text obtained in step 5 and the user task, and the specific method for obtaining the score of the recall text is as follows:

6.1, respectively calculating topic distribution vectors of the recalled text and the user task according to the topic model of the text knowledge base obtained in the step 2;

step 6.2, according to a word migration distance algorithm, performing word-level similarity calculation on the recalled text and the task description to obtain a word migration distance similarity score of the recalled text;

6.3, calculating the similarity according to a cosine formula of the vector space to obtain a similarity score of the recalled text theme;

and 6.4, calculating scores based on a weighted voting strategy, and adjusting the word shift distance weight and the topic similarity weight according to the tasks.

Compared with the prior art, the invention has the following remarkable advantages:

(1) the method is based on the domain knowledge map, overcomes the Martian effect of a recommendation system through rich domain entity associated knowledge, and expands the diversity of pushed knowledge according to the associated knowledge;

(2) according to the method, the modeling is carried out based on the scene and the user task, the attributes and the characteristics of the task are captured more effectively, the distinguishing capability of the specific task associated knowledge text is enhanced, and the accuracy of text knowledge pushing is improved;

(3) the method is based on semantic feature calculation, has strong interpretability, and can flexibly adapt to diversified scenes and tasks by replacing a feature model and a similarity calculation method;

(4) the invention adopts an unsupervised method, and can obtain better performance and accuracy of knowledge recommendation even in large-scale domain knowledge;

(5) the method has good portability, can be popularized to various fields with similar scene and task requirements, and provides knowledge push service.

The present invention is described in further detail below with reference to the attached drawings.

Drawings

Fig. 1 is a flowchart of a domain knowledge push method based on a knowledge graph.

FIG. 2 is a named entity recognition flow diagram.

FIG. 3 is a flow chart of entity relationship extraction.

FIG. 4 is a schematic view of a knowledge-graph structure.

Fig. 5 is a text similarity calculation flowchart.

Detailed Description

A domain knowledge pushing method based on knowledge graph includes the following steps:

specifically, the construction method of the text knowledge base comprises the following steps: and determining the knowledge range according to the field task requirements, and screening the content. The method comprises the steps of sentence segmentation, stop word filtering and the like on texts, wherein the stop words are mainly provided by experts in the field. The final text set is constructed as a text knowledge base.

Step 2, performing semantic analysis and text topic modeling on the knowledge base text;

in one embodiment, the specific method for performing semantic analysis on the knowledge base text is as follows:

the method comprises the steps of segmenting a knowledge text, and training the text by adopting an unsupervised WORD2VEC WORD embedding algorithm to obtain a semantic distribution vector of WORDs, namely a WORD vector. In the aspect of semantic vector representation of text sentences, a method based on word vector weighted sum is adopted for calculation. Specifically, a higher weight is given to a vocabulary having a high degree of matching with the task description text, and a lower weight is given to an irrelevant vocabulary. Here the degree of match is measured in terms of the number of string matches.

In one embodiment, the specific method for text topic modeling is as follows:

and performing word segmentation on the texts in the knowledge base, performing word frequency statistics on the text sentences in the knowledge base according to word segmentation results, and performing word filtering on the texts of which the word frequency is lower than a preset threshold value.

And performing character processing on the sentence to obtain a BIGRAM dictionary of the knowledge base text and using the BIGRAM dictionary to construct a mapping table from the text to the corresponding bag-of-words vector. And finally, acquiring a bag-of-words vector of the knowledge base text through a mapping table, and training the bag-of-words vector as LDA algorithm input to acquire a theme distribution vector of the knowledge base text.

in one embodiment, the specific method for obtaining the semantic distribution vector of the knowledge graph node is as follows:

and 3.1, constructing a domain knowledge graph, mainly comprising two tasks of named entity identification and relationship extraction, wherein supervised learning is carried out by adopting a BERT pre-training model to obtain domain knowledge entities and relationships among the entities. The constructed power knowledge graph is mainly stored in a form of a triplet, such as < transformer, component and bushing >, and the construction process is respectively shown in fig. 2, 3 and 4.

And 3.2, obtaining a semantic distribution vector of the map nodes. Graph embedding is a knowledge graph node semantic distribution vector representation technology and can be obtained by algorithms such as random walk. The embodiment adopts a GCN graph-based neural network to carry out graph node embedded representation learning. Specifically, the node topology in the domain knowledge graph is learned through a graph convolution neural network, namely, the attributes and connection relation semantics of the graph nodes are mapped to a low-dimensional space through the neural network, so that the semantic distribution vector of the nodes is obtained. The learning effect of the node classification task can be effectively improved by adding the node attribute information in the training process.

in one embodiment, the specific method for establishing the task context features includes:

step 4.1, performing word segmentation processing on the user task description text, and performing vectorization representation of task description by using the word vector trained in the step 2 to serve as a semantic feature of the user task;

in a further embodiment, the method comprises the following specific steps:

step 5.3, performing key path expansion on entity nodes of various paths of the subgraph, namely combining nodes on paths in three hops to obtain sentence embedding vectors with combined characteristics, wherein the combination mode adopts a sum-average method;

in a further embodiment, the method comprises the following specific steps:

6.1, respectively calculating topic distribution vectors of the recall text and the user task according to the topic model of the text knowledge base obtained in the step 2;

6.3, obtaining the similarity of the recall text and the theme of the user task, namely calculating the similarity according to a cosine formula of a vector space to obtain the similarity score of the recall text theme;

and 6.4, performing final score calculation based on a weighted voting strategy, and adjusting the word shift distance weight and the topic similarity weight according to the task. The voting result is the candidate document score.

Step 7, pushing the sorted texts to a user according to the scores;

step 8, if the user task is finished, the pushing is terminated; steps 4 to 7 are repeated when the user changes scene and state.

The invention mainly completes the pushing of the domain knowledge according to the following aspects:

1) mining domain knowledge: in the service system maintenance work, the user work content and scene are often required to be recorded and analyzed, but the information is usually scattered, the relevance is not high, and the features are sparse. Therefore, abundant domain knowledge reserves are needed for the problems to be solved by users, and the knowledge graph is a structured knowledge representation form with abundant association formed by mining the entity and the relationship of unstructured text information in the vertical domain, so that the requirements on knowledge storage and mining are met.

2) Task feature modeling: the operation and maintenance tasks of the user need to be operated according to certain specifications and procedures. Compared with the traditional pushing system, knowledge pushing needs to take specific tasks and task scenes of a user as a starting point, does not need to infer according to user preferences and historical operations, and is to carry out knowledge association feature mining on massive texts and tasks, so that knowledge texts with the same semantic connotation as the tasks are pushed.

3) Text matching calculation: a large number of text recalls are the main content of knowledge pushing, and recall precision influences the final effect of subsequent text similarity calculation. In addition, the result form of knowledge pushing is short and high-accuracy knowledge text, and the method relates to the technology related to natural language processing.

Examples

A domain knowledge pushing method based on knowledge graph is disclosed, as shown in figure 1, the key steps and implementation are as follows:

step 1, collecting knowledge texts of the power equipment to construct a text knowledge base.

The electric power field text knowledge base is a text set aiming at knowledge required by system tasks and is a source for pushing auxiliary knowledge of transformer substation operation and inspection tasks. The sources of the knowledge base mainly comprise electric power operation examination authoritative books, electric power operation examination related journal documents, electric power science research institute internal documents and an electric power operation examination subject network encyclopedic question-answer knowledge base.

After the knowledge source obtains the documents, the knowledge range is determined according to the requirements of the electric power operation and inspection task, the documents such as a transformer, a circuit breaker, a secondary non-electric quantity device, a protection device and the like are mainly related, and then the contents are screened. And performing sentence segmentation, stop word filtering and other processing on the text, wherein the stop words are mainly provided by experts in the field of electric power operation and detection. And constructing the final text set into a power domain text corpus.

And 2, performing semantic analysis and topic modeling on the knowledge base text of the power equipment. The method is implemented according to the following steps:

and 2.1, segmenting WORDs of the text related to the equipment, wherein the WORDs comprise equipment description, equipment operation and inspection task description and equipment defect description, and training the text by adopting an unsupervised WORD2VEC WORD embedding algorithm to obtain semantic distribution vectors of the WORDs. In the aspect of semantic vector representation of text sentences, a method based on word vector weighted sum is adopted for calculation. Specifically, a high weight is given to words having a high degree of matching with the electric power equipment operation task description text, and a low weight is given to irrelevant words. Here, the matching degree is measured by the number of character string matching.

And 2.2, modeling the text theme of the knowledge base. The method comprises the steps of segmenting words of texts in a text base of the power equipment, counting word frequencies in the text base, and filtering words with low word frequencies according to needs. The method comprises the steps of segmenting words of texts in a knowledge base, carrying out word frequency statistics on text sentences in the knowledge base according to word segmentation results, and carrying out word filtering when the word frequency is lower than a preset threshold value. And then, performing character processing on the sentence to obtain a BIGRAM dictionary of the text of the knowledge base of the power equipment, and using the BIGRAM dictionary to construct a mapping table from the text to a corresponding bag-of-words vector. And finally, acquiring a bag-of-words vector of the knowledge base text through a mapping table, and training the bag-of-words vector as LDA algorithm input to acquire a theme distribution vector of the equipment knowledge base text.

Semantic analysis here includes both word embedding and sentence embedding. Semantic analysis can keep sentence semantic information, calculate the similarity between texts on a semantic level, and is different from the similarity on simple vocabulary fonts.

And 3, carrying out graph embedded processing on the knowledge graph in the power field to obtain a semantic distribution vector of the node. The method is implemented according to the following steps:

And 3.2, obtaining a semantic distribution vector of the map node. Graph embedding is a knowledge graph node semantic distribution vector representation technology and can be obtained by algorithms such as random walk and the like. The implementation adopts a GCN-based graph neural network to carry out graph node embedded representation learning. Specifically, the node topology in the domain knowledge graph is learned through a graph convolution neural network, that is, the attributes and the connection relation semantics of the graph nodes are mapped to a low-dimensional space through the neural network, so that the semantic distribution vector of the nodes is obtained. The learning effect of the node classification task can be effectively improved by adding the node attribute information in the training process.

And 4, establishing task context characteristics through task description according to the electric power equipment operation and detection task of the user by taking a light gas alarm task as an example. The method is implemented according to the following steps:

step 4.1, performing word segmentation on the light gas alarm task description text, and performing vectorization representation of task description by using the word vector trained in the step 2 to serve as a semantic feature of the light gas alarm task;

step 4.2, extracting entities in the topic of the light gas alarm task, and obtaining entity expression vectors associated with the operation and detection task by using the node distribution vectors of the knowledge graph trained in the step 3, wherein the entity expression vectors are used as a classification characteristic of the light gas alarm task;

step 5, carrying out entity alignment on the electric power equipment entity and the electric power field knowledge graph according to the equipment operation and inspection task text, carrying out feature expansion and task associated knowledge recall on the basis of graph node paths and graph node semantic distribution characteristics, and specifically implementing according to the following steps:

step 5.1, acquiring task description and task association system components, namely a gas relay, according to a gas alarm task entity, and performing entity alignment operation on a knowledge graph in the power field to acquire a subgraph corresponding to the task association entity on the graph;

step 5.2, acquiring the embedding vector of the sub-graph entity calculated in the step 5.1 and the word embedding vector of the entity node in the adjacent relation path in the three hops of the sub-graph node;

step 5.3, performing key path expansion on the sub-graph nodes, namely combining adjacent nodes, and simultaneously acquiring sentence embedding of combination characteristics, wherein the combination mode adopts a sum average method;

and 5.4, performing text filtering by taking the gas alarm task context characteristic vector, the graph embedded vector of the task entity and the embedded vector of the sub-graph node combination in the step 4 as a primary recall condition to obtain the recall of the task associated node knowledge with coarse precision.

And 6, performing text similarity calculation on the recall text obtained in the step 5 and the user task to obtain a recall text score, wherein the process is shown in fig. 5. The method is implemented according to the following steps:

6.1, respectively calculating topic distribution vectors of the recalled text and the gas operation and inspection task according to the topic model of the text corpus obtained in the step 2;

step 6.2, according to a word migration distance algorithm, performing word-level similarity calculation on the candidate text and the task description to obtain a word migration distance similarity score of the recalled text;

the word-shift distance is a way (method) for measuring the distance between two text documents, and is used for judging the similarity between two texts, namely the larger the WMD distance is, the smaller the similarity is, the smaller the WMD distance is, the greater the text similarity is.

6.3, obtaining the theme similarity of the recall text and the gas alarm task, namely calculating the similarity according to a cosine formula of a vector space to obtain a theme similarity score of the recall text;

And 7, pushing the sorted text to the user according to the scores. The method is implemented according to the following steps:

and 6, sorting in a descending order according to the scores of the recalled documents obtained in the step 6, and selecting a certain number of documents to push as required.

Claims

1. A domain knowledge pushing method based on a knowledge graph is characterized by comprising the following specific steps:

step 5, carrying out entity alignment on the domain entities in the user task description text and the domain knowledge graph in the step 3, carrying out feature expansion and task associated knowledge recall based on graph node paths and graph node semantic distribution characteristics, and specifically comprising the following steps:

step 5.1, acquiring task description and task association system components according to a user task entity, and performing entity alignment operation on a knowledge graph spectrum to acquire a sub-graph corresponding to the task entity on the graph;

step 5.3, performing key path expansion on the entity nodes in each path of the graph;

step 5.4, filtering the text of the knowledge base by taking the user task context characteristics, the graph embedded vector of the task entity and the embedded vector of the sub-graph node combination in the step 4 as a primary recall condition to obtain a recall text with rough knowledge precision of the task associated nodes;

and 6, performing text similarity calculation on the recall text obtained in the step 5 and the user task to obtain a recall text score, wherein the specific method comprises the following steps of:

6.4, calculating scores based on a weighted voting strategy, and adjusting the word shift distance weight and the theme similarity weight according to the tasks;

step 7, pushing the sequencing result text to the user according to the scores;

step 8, if the user task is finished, the pushing is terminated; steps 4 to 7 are repeated when the user's context and status changes.

2. The knowledge-graph-based domain knowledge pushing method according to claim 1, wherein the text knowledge base is constructed by the following method: determining a knowledge range according to the field task requirements, and screening the content; sentence division is carried out on the text, and stop words are filtered; and constructing the final text set into a text knowledge base.

3. The knowledge-graph-based domain knowledge push method according to claim 1, wherein the specific method for semantic analysis of knowledge base text is as follows:

segmenting the knowledge text, and training the text by adopting an unsupervised WORD2VEC WORD embedding algorithm to obtain semantic distribution vectors of WORDs;

and calculating the semantic vector of the text sentence by adopting a method based on word vector weighted sum.

4. The knowledge-graph-based domain knowledge push method according to claim 1, wherein the text topic modeling is performed by a specific method comprising:

performing word segmentation on texts in a knowledge base, performing word frequency statistics on text sentences in the knowledge base according to word segmentation results, and performing word filtering on the texts of which the word frequency is lower than a preset threshold value;

5. The domain knowledge push method based on the knowledge-graph according to claim 1, wherein the specific method for obtaining the semantic distribution vector of the nodes of the knowledge-graph is as follows:

step 3.1, constructing a domain knowledge graph, including two tasks of named entity identification and relation extraction, and obtaining a domain knowledge entity and a relation between the entities by adopting a BERT-based pre-training model to perform supervised learning;

6. The domain knowledge push method based on the knowledge graph according to claim 1, wherein the specific method for establishing the task context features is as follows: