CN112905746A

CN112905746A - System archive knowledge mining processing method based on knowledge graph technology

Info

Publication number: CN112905746A
Application number: CN202110249513.6A
Authority: CN
Inventors: 张洪涛; 陈功娥; 李光华; 唐晓芳; 文辉; 李小龙; 白良俊
Original assignee: Guodian Dadu River Hydropower Development Co Ltd
Current assignee: Guodian Dadu River Hydropower Development Co Ltd
Priority date: 2021-03-08
Filing date: 2021-03-08
Publication date: 2021-06-04

Abstract

The invention discloses a system archive knowledge mining and processing method based on knowledge graph technology, belonging to the technical field of knowledge graph construction and comprising the following steps of: setting a basic mode of a knowledge base and constructing the knowledge base; labeling data of system documents in an original document set by a method based on an entity and a relation labeling platform, and converting platform text labeling data into text sequence labeling data; taking the text sequence marking data as input, training and testing a deep learning model, and generating an extraction model of the relation between the marking and the entity; extracting incremental system documents based on an extraction model to serve as a pre-labeling result; performing knowledge fusion on a large number of related system names existing in the extracted entity system and the abstract; performing knowledge audit on the entity after knowledge fusion and storing the entity in a knowledge base; the association between the systems, the systems and the sub-systems, and the systems and the clauses remarkably improves the utilization efficiency and the value of the knowledge.

Description

System archive knowledge mining processing method based on knowledge graph technology

Technical Field

The invention belongs to the technical field of knowledge graph construction, and relates to a system archive knowledge mining processing method based on knowledge graph technology.

Background

In the artificial intelligence era, enterprise system archives become increasingly important recessive assets of enterprises, and are the largest and most complete data resource pool formed by the enterprises in development. How to exert the knowledge value of system documents is the core of the management of a new generation of intelligent enterprises.

For a huge enterprise, after the gradual informatization process of the past decades, the data of a large number of system documents accumulated shows the following three characteristics:

firstly, the method comprises the following steps: the system file amount is large, the digitization degree is high, a large number of electronic documents exist, and the document amount is continuously enlarged along with the continuous enlargement of the service; II, secondly: the knowledge value contained in the document is high, not only the document value of the file itself is high, but also the knowledge value density contained in a single file is very high, for example, the XX production specification of an enterprise contains a large number of normalized knowledge points of the production flow, and the points are scattered in each document and lack sufficient correlation; thirdly, the method comprises the following steps: in the search scenario, because of the close knowledge association between documents (for example, a document at a lower level is made according to a document at an upper level), the conventional knowledge search technology only finds out the document and knowledge related to the text from the search matching perspective, and cannot search out the associated knowledge of the document and the document.

In summary, the construction and retrieval difficulties of the enterprise document knowledge base include: system document data is large in quantity and continuously enlarged in scale, and a document knowledge base needs to support large-scale document import and horizontal expansion; knowledge in system documents needs to be structured, and the establishment of association between the knowledge needs to establish a closed-loop construction mode of manual labeling, model training, pre-labeling and manual auditing, so that the construction efficiency is improved, knowledge retrieval supports knowledge association, expansion and traceability, the knowledge associated to business related documents is supported when the retrieval is supported, the knowledge expansion based on a knowledge base is supported, and the knowledge traceability viewing from the knowledge to source documents is supported.

Disclosure of Invention

The invention aims to: the system provides a professional system archive question-answering robot system based on a semantic analysis technology, and solves the problem that the conventional knowledge retrieval technology can only find out documents and knowledge related to texts from the search matching perspective and can not retrieve the associated knowledge of the documents.

The technical scheme adopted by the invention is as follows:

a system archive knowledge mining processing method based on knowledge graph technology comprises the following steps:

setting a basic mode of a knowledge base and constructing the knowledge base;

labeling data of system documents in an original document set by a method based on an entity and a relation labeling platform, and converting platform text labeling data into text sequence labeling data;

taking the text sequence marking data as input, training and testing a deep learning model, and generating an extraction model of the relation between the marking and the entity;

extracting incremental system documents based on an extraction model to serve as a pre-labeling result;

performing knowledge fusion on a large number of related system names existing in the extracted entity system and the abstract;

and performing knowledge audit on the entity after knowledge fusion and storing the entity in a knowledge base.

Further, the basic schema of the knowledge base is: and taking the system document as a standard document, analyzing the clauses, the sub-systems and the units of the text from the system document, and analyzing the clauses from the sub-systems.

Furthermore, the entity extraction algorithm of the extraction model adopts a Bi-LSTM + CRF model for model training, and the relation extraction algorithm adopts a Simple Bert model.

Further, the Bi-LSTM + CRF model masks the main entity and the guest entity in the sentence by using special characters.

Further, knowledge fusion comprises the following steps:

searching a plurality of text related entities for the extracted entity names by a full text search method;

respectively extracting entity attribute characteristics, entity name text characteristics and relationship characteristics from a plurality of candidate entities and target entities, inputting the entity attribute characteristics, the entity name text characteristics and the relationship characteristics into a binary model for judgment, and outputting fusion probability as a judgment basis for judging whether fusion is performed or not;

the relationship features comprise first-degree relationship features, first-degree entity features and second-degree relationship features of the entities.

Further, the relationship among the systems comprises abolishing, basis, mentioning and correlation, the abolishing and basis is extracted by a keyword triggering and pattern matching method, and the specific steps are as follows:

defining a system relation mode;

positioning sentences in the system making abstract according to the key trigger words, performing entity extraction on the sentences, and extracting a plurality of system names;

and extracting system relationship pairs according to a system relationship mode.

Further, for the query sentence input by the user, the knowledge association retrieval of the system comprises the following steps:

carrying out entity link on a query sentence of a user and a knowledge base, and finding out an entity set A which can be hit by the query sentence;

respectively carrying out first-degree and second-degree exploration on the entities in the set A in a knowledge base, and calculating the weight of the entities after the first-degree and second-degree exploration;

and performing weight sequencing on all the searched candidate entities, filtering out entities hit by entity links, and putting back the entities as an associated recommended entity set.

In summary, due to the adoption of the technical scheme, the invention has the beneficial effects that:

the system archive knowledge mining and processing method based on the knowledge graph technology is characterized in that a knowledge graph is used as a knowledge base to model system documents of enterprises, and system-system associations, system-sub-systems associations and system-clauses associations are performed, so that the utilization efficiency and the value of knowledge are obviously improved; using graph exploration, relationships between regimes and overall context between regime terms can be queried quickly.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are required to be used in the embodiments will be briefly described below, it should be understood that the following drawings only illustrate some embodiments of the present invention and therefore should not be considered as limiting the scope, and that for those skilled in the art, other relevant drawings can be obtained according to the drawings without inventive effort, wherein:

FIG. 1 is a schematic flow diagram of the present invention;

FIG. 2 is a schematic diagram of the basic schema of the knowledge base of the present invention.

Detailed Description

In order to make the objects, technical solutions and advantages of the present invention more apparent, the present invention is described in further detail below with reference to the accompanying drawings and embodiments. It should be understood that the detailed description and specific examples, while indicating the preferred embodiment of the invention, are intended for purposes of illustration only and are not intended to limit the scope of the invention. The components of embodiments of the present invention generally described and illustrated in the figures herein may be arranged and designed in a wide variety of different configurations.

Thus, the following detailed description of the embodiments of the present invention, presented in the figures, is not intended to limit the scope of the invention, as claimed, but is merely representative of selected embodiments of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments of the present invention without making any creative effort, shall fall within the protection scope of the present invention.

It is noted that relational terms such as "first" and "second," and the like, may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.

Examples

As shown in fig. 1, a system archive knowledge mining processing method based on the knowledge graph technology according to a preferred embodiment of the present invention includes the following steps:

setting a basic mode of a knowledge base and constructing the knowledge base;

specifically, as shown in fig. 2, the basic schema of the knowledge base is: and taking the system document as a standard document, analyzing the clauses, the sub-systems and the units of the text from the system document, and analyzing the clauses from the sub-systems.

In this embodiment, for example, the "reimbursement management system" is used as an institutional entity, and the "reimbursement system for business trips" in the institutional system may be used as a sub-system. The "clause" entity type is used as the most detailed knowledge expression form, such as "one-line city reimbursement process" can be used as a clause, and the clause includes attributes such as "applicable personnel", "applicable standard", and the like.

preferably, the entity extraction algorithm of the extraction model adopts a Bi-LSTM + CRF model for model training, and the relation extraction algorithm adopts a Simple Bert model; the Bi-LSTM + CRF model masks the host and guest entities in the sentence using special characters.

In this embodiment, the input form of the Simple Bert model is as follows: [ CLS ] sentence [ SEP ] host entity [ SEP ] guest entity [ SEP ]. To prevent overfitting, the host and guest entities in the sentence are masked with special characters, e.g., [ S-PER ] for a host entity of type person name and [ O-LOC ] for a guest entity of type place name. The whole sequence is coded by a Simple Bert model, the obtained hidden vector at each position is spliced with the coding vector of the relative position of the hidden vector and the coding vector of the host entity and the guest entity in the sentence, then the spliced hidden vector is input into a bidirectional LSTM layer, the hidden state at the last moment in each direction is taken and then spliced, and finally, the relation type prediction is realized for a feedforward network layer.

Extracting incremental system documents based on an extraction model to serve as a pre-labeling result; it should be noted that the result of the pre-labeling is used as the continuous accumulation of the labeling data, and in turn, an extraction model with better effect can be trained.

specifically, knowledge fusion comprises the following steps:

The system comprises a system and a method, wherein the system comprises a system and a system, the system comprises a system and a method, the system comprises the following steps:

defining a system relation mode; in practice, the relationship mode may be set as: [ make < B-system > according to | follow ] < A-system >; printing < A system > original < B system > abolishes, etc.

In addition, for the query sentence input by the user, the knowledge correlation retrieval step of the system is as follows:

The above description is only for the purpose of illustrating the preferred embodiments of the present invention and should not be taken as limiting the scope of the present invention, and any modifications, equivalents and improvements made by those skilled in the art within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims

1. A system archive knowledge mining processing method based on knowledge graph technology is characterized in that: the method comprises the following steps:

setting a basic mode of a knowledge base and constructing the knowledge base;

2. The system archive knowledge mining processing method based on the knowledge graph technology as claimed in claim 1, characterized in that: the basic modes of the knowledge base are as follows: and taking the system document as a standard document, analyzing the clauses, the sub-systems and the units of the text from the system document, and analyzing the clauses from the sub-systems.

3. The system archive knowledge mining processing method based on the knowledge graph technology as claimed in claim 1, characterized in that: the entity extraction algorithm of the extraction model adopts a Bi-LSTM + CRF model for model training, and the relation extraction algorithm adopts a Simple Bert model.

4. The system archive knowledge mining processing method based on the knowledge graph technology as claimed in claim 3, wherein: the Bi-LSTM + CRF model masks the host and guest entities in the sentence using special characters.

5. The system archive knowledge mining processing method based on the knowledge graph technology as claimed in claim 1, characterized in that: the knowledge fusion comprises the following steps:

6. The system archive knowledge mining processing method based on the knowledge graph technology as claimed in claim 1, characterized in that: the relationship among the systems comprises abolishing, basis, mentioning and correlation, the abolishing and basis are extracted by a keyword triggering and pattern matching method, and the specific steps are as follows:

defining a system relation mode;

7. The system archive knowledge mining processing method based on the knowledge graph technology as claimed in claim 1, characterized in that: for the query sentence input by the user, the knowledge association retrieval of the system comprises the following steps: