CN112256884A

CN112256884A - Knowledge graph-based data asset library access method and device

Info

Publication number: CN112256884A
Application number: CN202011144033.5A
Authority: CN
Inventors: 乔林; 陈硕; 薄珏; 徐立波; 刘碧琦; 王妍; 齐俊; 郭任; 常将; 李希
Original assignee: Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd
Current assignee: Information and Telecommunication Branch of State Grid Liaoning Electric Power Co Ltd
Priority date: 2020-10-23
Filing date: 2020-10-23
Publication date: 2021-01-22

Abstract

The application provides a data asset library access method and device based on a knowledge graph, wherein the access method comprises the following steps: according to the relation between the data entity of the first existing system and the data entity, and based on the SG-CIM unified information model and the data of the existing service system, a data association model of a data asset library is constructed in a bidirectional mode; and accessing the data in the data asset library by constructing a uniform access ontology, wherein the knowledge base of the access ontology is perfected based on a knowledge graph. According to the method and the device for accessing the data asset library based on the knowledge graph, the unified access body is constructed, so that the unified management of data access can be realized without merging all databases.

Description

Knowledge graph-based data asset library access method and device

Technical Field

The present disclosure relates to the field of data processing, and in particular, to a method and an apparatus for accessing a data asset library based on a knowledge graph.

Background

With the continuous increase of electric power mass data and the continuous expansion of an electric power business system, a national network enterprise data asset library is established to form an electric power data asset library, which is particularly important, all data resources capable of generating values of an enterprise can be collected by the data asset library, an asset view is provided for a user, the enterprise assets can be rapidly known, bad assets can be found, a decision basis is provided for a manager, and the value of the data assets is improved. However, the outstanding problem existing in the data of the power grid enterprise data asset library is that the data volume is large, most of service application systems of the power grid are provided with respective data management systems, unified data integration and centralized management are lacked, and unified access is difficult.

Disclosure of Invention

One of the objectives of the present disclosure is to solve the problem of the difficulty in accessing the data asset library uniformly mentioned in the background art by providing a method and apparatus for accessing the data asset library based on a knowledge graph.

To achieve the above object, according to one embodiment of the present disclosure, there is provided a method for accessing a data asset library based on a knowledge-graph, including: according to the relation between the data entity of the first existing system and the data entity, and based on the SG-CIM unified information model and the data of the existing service system, a data association model of a data asset library is constructed in a bidirectional mode; and accessing the data in the data asset library by constructing a uniform access ontology, wherein the knowledge base of the access ontology is perfected based on a knowledge graph.

Optionally, the step of constructing a data association model of the data asset library based on the SG-CIM unified information model and the data of the existing business system includes: acquiring data of a first existing system, and forming unstructured service metadata of the data of the first existing system according to the data of the first existing system and an existing service system and the incidence relation of the data of the first existing system and the existing service system; and constructing an association model of the structured data and the unstructured data based on the SG-CIM unified information model and the unstructured service metadata.

Optionally, the step of accessing data in the data asset library comprises: sending an access request to a structured data center to acquire basic information of accessed related equipment and entity codes of unstructured data; and sending an access request to the unstructured data management platform according to the unstructured data entity code provided by the structured data center so as to obtain a target document corresponding to the data entity code.

Optionally, the step of accessing the data in the data asset library by constructing a unified access ontology includes: and extracting the entities, attributes and relationships of the data assets in the data warehouses of the plurality of business systems to construct a uniform access body so as to perform uniform access on the data in the data asset warehouse of the plurality of business systems.

Optionally, the step of refining the knowledge base of the access ontology based on a knowledge graph includes: based on the knowledge map technology, the potential and missing associated data retrieved from a plurality of business system data warehouses are compared with the similarity of different types of data sets of the data warehouses through the knowledge of the structured triples to obtain associated information, and the unified access ontology is perfected.

According to another embodiment of the present disclosure, there is provided a knowledge-graph-based data asset library accessing apparatus including: the data management model building unit is used for building a data association model of the data asset library in a two-way mode according to the relationship between the data entity and the data entity of the first existing system and the data based on the SG-CIM unified information model and the data of the existing business system; and the data access unit is used for accessing the data in the data asset library by constructing a uniform access ontology, wherein the knowledge base of the access ontology is perfected based on a knowledge graph.

The embodiment of the present disclosure can achieve the following advantageous effects: aiming at the characteristics that the data volume in an enterprise-level data asset library in the prior art has mass and dispersion, and complete data library combination cannot be realized even if part of important data is integrated, the invention provides a data asset library access method based on a knowledge graph.

It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory only and are not restrictive of the disclosure.

Drawings

Other features, objects and advantages of the present application will become more apparent upon reading of the following detailed description of non-limiting embodiments thereof, made with reference to the accompanying drawings in which:

FIG. 1 is a schematic flow chart diagram of a method for knowledge-graph based access to a data asset library according to an embodiment of the present application;

FIG. 2 is a schematic illustration of a process for applying for access to data in a data asset library according to one embodiment of the present application;

FIG. 3 is a schematic block diagram of a knowledge-graph based data asset library access mechanism provided in accordance with an embodiment of the present application;

the same or similar reference numbers in the drawings identify the same or similar structures.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

The terminology used in the present disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the disclosure. As used in this disclosure and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and includes any and all possible combinations of one or more of the associated listed items.

According to one embodiment of one aspect of the present application, a method for knowledge-graph based access to a database asset library is provided. Referring to fig. 1, the data asset library access method includes:

and S101, bidirectionally constructing a data association model of the data asset library according to the relationship between the data entity and the data entity of the first existing system and the data based on the SG-CIM unified information model and the data of the existing business system.

And S102, accessing data in the data asset library by constructing a uniform access ontology, wherein the knowledge base of the access ontology is perfected based on a knowledge graph.

Specifically, for step S101, the data asset library may include, but is not limited to, an electric power data asset library and data asset libraries of other industries and enterprises. The construction of a data asset library typically includes processes of data collection, dirty data identification, data cleansing, and data auto-correlation, which may be suitable for use with embodiments of the present application. Taking the construction of the data asset library of the power grid enterprise as an example, the technical route for constructing the data asset library comprises the following steps:

firstly, developing key business metadata research on the basis of an SG-CIM (national grid company public data model) enterprise information model, combing business systems to refine and form the associated elements of the existing system data and the structured business metadata, and forming unstructured business metadata information of the existing system data. The business metadata may include business names, definitions, descriptions, and the like, which are used to identify various attributes in the data warehouse and business system. The business system data warehouse can be constructed based on the business data entities, the relationships among the entities, the entity attributes and other information. Wherein an entity may refer to a specific thing that is distinguishable and independent in a business system.

And secondly, forming an association model of the structured data and the unstructured data by combining the unstructured key service metadata based on an SG-CIM unified information model.

Specifically, for unstructured business metadata information forming existing system data, the following two substeps may be included:

first, business system data is carded. The method comprises the steps of combing the incidence relation between data in the existing system of the power grid enterprise and structured data in a business system, determining information such as a source business system, an incidence business data source table, incidence field information, the access logic of incidence business data, a topic domain (SG-CIM model topic domain) to which the incidence data belongs and the like by adopting a mode of combining automatic incidence rule forming and manual incidence relation inputting, and formulating an incidence relation description specification according to a CWM specification and an SG-CIM specification.

Second, the existing system data platform combs. And (4) combing the unstructured data accessed by the system data platform, determining the unified management range of the data assets, and preliminarily forming metadata key elements of the unstructured data of the power grid enterprise data asset library.

For forming a correlation model of structured data and unstructured data, the following two substeps can be included:

first, a correlation model is formed. According to the key elements of the unstructured service metadata obtained by combing, the basic metadata of unstructured data and the data structure of the associated metadata are combined, a CWM (continuous Web management) data warehouse metadata model is referred to or followed, SG-CIM (Standard of the organization-information model) is followed, an unstructured and structured associated model is formed, and the access and storage of the normalized metadata are realized.

Second, a managed storage model of unstructured metadata is formed. And forming information storage models such as change, management and the like of the unstructured metadata based on a preset existing information storage model, wherein the information storage models are used for supporting operation and maintenance management of the unstructured service metadata.

After describing the technical route of constructing the data asset library in the present application, the following details the process of constructing the data asset library in step S101 by taking the construction of the data asset library of the power grid enterprise as an example.

In one embodiment, a data management model of a data asset library is built using bi-directional modeling. Specifically, the step S101 of constructing the data association model of the data asset library based on the SG-CIM unified information model and the data of the existing business system may include:

-obtaining data of the first existing system, forming unstructured service metadata of the data of the first existing system based on the data of the first existing system and the existing service system and the association relationship between the data of the first existing system and the existing service system; and constructing an association model of the structured data and the unstructured data based on the SG-CIM unified information model and the unstructured service metadata. The first existing system is, for example, an existing information system of a power grid enterprise, and the existing business system includes, for example, various business systems related to the power grid system, such as a marketing business system.

More specifically, on one hand, based on data of the first existing system, data entities accessed by the first existing system and relationships among the data entities are combed, abstracted and abstracted, data topic domains to which the data entities belong are analyzed and merged, and relationships among the topic domains are analyzed to form an unstructured data association model.

On the other hand, from the business requirements, based on the SG-CIM unified information model and the existing business system, the business requirements of extracting and combing unstructured data of each business line are analyzed, and according to the business process, key entities are extracted, the relationship between the subject domain to which the entities belong and the entities is analyzed, and the association relationship between the unstructured data entities and the structured data entities, a data association model is formed. The association relationship between the unstructured data entities and the structured data entities can be realized by adding codes of the unstructured data entities in the storage structure of the structured data center for association.

In one embodiment, the step of accessing the data in the data asset library in step S102 includes: sending an access request to a structured data center to acquire basic information of accessed related equipment and entity codes of unstructured data; and sending an access request to the unstructured data management platform according to the unstructured data entity code provided by the structured data center so as to obtain a target document corresponding to the data entity code.

Specifically, referring to fig. 2, as shown in fig. 2, the application for accessing data in the data asset library may be implemented by:

(1) and the business application calls the service provided by the database management platform to the outside, sends a request to the structured data center and inquires the basic information of the related equipment and the non-structured data entity code.

(2) And the structured data center returns the basic information of the equipment and the like and the non-structured data entity code to the business application according to the request submitted by the business application.

(3) And the business application sends a request to the unstructured data management platform according to the unstructured data entity code provided by the structured data center, and inquires information such as related documents and the like.

(4) And the unstructured data management platform acquires the target document through data entity coding according to the request of the service application and returns the target document to the service application.

Optionally, for step S102, a unified access ontology may be constructed by extracting entities, attributes, and relationships of data assets in the multiple business system data warehouses, so as to perform unified access on data in the data asset warehouses of the multiple business systems. Wherein the access ontology describes entities, attributes or identifications, and associations of data assets in the service network system.

Specifically, by constructing the uniform access body, the management of the data asset library does not need to store all databases together, but the uniform management of data access is realized by utilizing the uniform access body, and the entities, attributes and relationships of the data assets in the data warehouses of the business systems are extracted, fused and uniformly accessed. The method comprises the steps of discovering potential and missing associated data from a data warehouse by using a knowledge graph technology, comparing similarity of different types of data sets of the data warehouse through knowledge of structured triples (for example, an entity triplet including two entities and an association relation between the two entities, such as an entity x-XX relation-an entity y) to obtain associated information, perfecting a unified access ontology, using the extracted and discovered associated data as an example expansion engineering ontology, participating an updated ontology in ontology fusion, and continuously perfecting the unified access engineering field ontology of different databases in an analysis domain. More specifically, if the similarity is lower than the threshold, adding the entity identifiers corresponding to the entities in the entity triples, the association relationship among the entities, and the like to the knowledge base corresponding to the knowledge graph, and supplementing the knowledge base. Of course, the above description of performing similarity comparison is only an example, and in other embodiments, the similarity comparison of the data sets may also be implemented by using the prior art.

The knowledge graph technology may include a knowledge graph construction technology in the prior art, for example, analyzing key data in the data asset library through a preset model (including a convolutional neural network language model, for example), preprocessing (such as denoising) the key data to obtain knowledge data in a uniform format, and constructing a knowledge graph corresponding to the data asset library according to entities, attributes, association relations and the like of data assets in the access ontology, where the key data includes acquired and monitored electricity consumption data, for example.

Specifically, in the process of constructing the uniform access ontology, firstly, selecting resources, and determining texts for extracting the related entities in the field after selecting the basic ontology; secondly, concept learning, namely acquiring related concepts of the field from the selected text and establishing a classification relation among the concepts, wherein the establishment of the classification relation is realized by processing concept data through a softmax classifier or other classification methods, for example; then, field centralized processing is carried out, concepts which are irrelevant to the field are removed, and only concept structures which are relevant to the field and establish a target ontology are left; besides some relationships inherited from the basic ontology, other relationships need to be extracted from the text by a learning method. And the construction of the unified access engineering field body is perfected to obtain the enterprise data asset library, and the goal of unified data access is achieved through the enterprise data asset library.

In addition, for the knowledge base based on the knowledge graph perfection or completion access ontology, when a new data entity is introduced into the data asset base by perfecting the knowledge base of the ontology, the completion of the knowledge base can reason the existing entity having a relationship with the data entity through the existing structured triple and entity set and relationship set.

For example, for the knowledge graph G, it is assumed that G includes an entity set E { (E1, E2, …, eM } (M is the number of entities), a relationship set R { (R1, R2, …, rN } (N is the number of relationships), and a triplet set T { (ei, rk, ej) | ei, ej belongs to E, and rk belongs to R }. Since the number of entities and relationships in a knowledge-graph G is typically limited, there may be some entities and relationships that are not in G. Note that the set of entities not in the knowledge graph G is E ═ { E1 ═ E2 ×, …, es × (S is the number of entities), and the set of relationships is R ═ R1 ×, R2 ×, …, rT × (T is the number of relationships). According to the specific prediction objects in the triplets, the knowledge-graph completion can be divided into 3 subtasks: head entity prediction, tail entity prediction, and relationship prediction. For head (tail) entity prediction, the tail (head) entities of triples and relationships are given, and then the entities that can make up the correct triples are predicted.

The technical means of knowledge base completion can be realized based on the knowledge base completion technical means of embedded representation, the knowledge base completion technical means of variable-quantity credibility, and other technical means to find missing triples for the knowledge graph. In the process of completing a knowledge base, adding vector representation of a head entity in a semantic space and vector representation of a relation to a missing tail entity to obtain predicted tail entity vector representation, and selecting an entity closest to the predicted tail entity from an entity list as a prediction result; and for the missing relationship between the two entities, subtracting the embedding vector of the head entity from the embedding vector of the tail entity, then making a difference between the result and the embedding vector of the alternative relationship, and selecting the relationship which is most similar to the prediction relationship vector as a prediction result.

Specifically, for the knowledge base completion technology based on the embedded representation, structured triples and entities and relations in the knowledge base are quantized into low-dimensional vectors. The most classical distributed embedded representation model is TransE, where h, r, t in a triplet (h, r, t) represent the head, relationship and tail entities, respectively. TransE regards the relationship vector as a translation vector from the head entity to the tail entity, and for two entity vectors eh, et belongs to Rn, and the difference between eh + er and et is used for scoring the translation effect. After training, all entities in the knowledge base are represented as a vector, the similarity between vectors represents the similarity between the entities, and the sum of the entities and the relation vector represents the vector of the object entity obtained by prediction when the entity is used as a subject of the relation. Therefore, when the data asset library is constructed and a new entity e is introduced, the data asset library can be embedded into the semantic space of the knowledge base, and link prediction is performed on the relation possibly generated between the data asset library and other entities by using low-dimensional distributed embedded representation, so that new knowledge is mined, and the knowledge base is supplemented.

For the knowledge base completion technical means based on the quantitative variable credibility, the method specifically includes the steps of validity calculation, construction of a training set (the training set may include quintuple of head entity, relationship, tail entity, time slice and valid credibility and quadruplet of head entity, relationship, tail entity and valid time period), initialization of training parameters (including training the entity set, relationship set and time slice in the form of entity vector set, relationship vector set and time slice vector set respectively), calculation of an evaluation function (including calculating the evaluation function and a loss function by using a preset calculation rule based on mapping the entity vector set, relationship vector set and time slice vector set to a hyperplane, and adjustment of the training parameters (adjusting the training parameters based on the loss function) in the prior art, wherein the process of training parameters may include:

1) modeling the quantitative variable credibility, namely performing duration modeling on the meta-fact data containing various relations to obtain a model of the quantitative variable credibility of the meta-fact;

2) splitting the effective time period in the quadruple into time slices, calculating the quantitative change reliability according to the time points, inserting the quadruple into the time slices to generate quintuple (head entity, relationship, tail entity, time slice and quantitative change reliability);

3) initializing training parameters, and initializing a vector set of entities, relations and time slices randomly according to preset dimensions;

4) randomly extracting a small training set (batch) from the quintuple set, and generating a negative sample by the quintuple;

5) obtaining a positive sample, mapping the positive sample and the negative sample to respective time slices, calculating an evaluation function, and adjusting model training parameters according to a loss function;

6) and outputting the model obtained by training, repeating the steps 4) and 5) to continue training, and stopping adjusting the training parameters when the training times are equal to the preset times.

It should be noted that while the operations of the disclosed methods are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Rather, the steps depicted in the flowcharts may change the order of execution. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

According to a general inventive concept of the present application, embodiments of the present application also provide a data asset library access device based on a knowledge graph. The various units and modules in the knowledge-graph based data asset library accessing means may be implemented in whole or in part by software, hardware, and combinations thereof. The units and modules may be embedded in hardware or independent from a processor in the computer device, or may be stored in a memory in the computer device in software, so that the processor can call and execute operations corresponding to the above modules.

Referring to fig. 3, the access device may include:

the data management model building unit 101 is configured to build a data association model of the data asset library in a bidirectional manner according to a relationship between a data entity of the first existing system and the data entity, and data based on the SG-CIM unified information model and the existing business system;

the data access unit 102 is configured to access data in the data asset library by constructing a unified access ontology, where a knowledge base of the access ontology is completed based on a knowledge graph.

Optionally, the data management model building unit 101 specifically includes:

the unstructured service metadata forming module is used for acquiring data of the first existing system, and forming unstructured service metadata of the data of the first existing system according to the data of the first existing system and the existing service system and the association relationship of the data of the first existing system and the existing service system;

the association model building module is used for building an association model of the structured data and the unstructured data based on the SG-CIM unified information model and the unstructured service metadata.

Optionally, the data access unit 102 specifically includes:

a first request sending module, configured to send an access request to a structured data center to obtain entity codes of basic information and unstructured data of accessed related devices;

the second request sending module is used for sending an access request to the unstructured data management platform according to the unstructured data entity code provided by the structured data center so as to obtain a corresponding target document based on the data entity code.

Optionally, the data access unit 102 is specifically configured to: and extracting the entities, attributes and relations of the data assets in the data warehouses of the plurality of business systems to construct a uniform access body so as to perform uniform access on the data of the data asset warehouse of the plurality of business systems.

Optionally, the data access unit 102 is specifically configured to: based on the knowledge map technology, the potential and missing associated data retrieved from a plurality of business system data warehouses are compared with the similarity of different types of data sets of the data warehouses through the knowledge of the structured triples to obtain associated information, and the unified access ontology is perfected.

The method of the embodiment of the invention corresponds to the device of the embodiment of the invention, and the technical characteristics and the beneficial effects described in the embodiment of the method are all applicable to the embodiment of the device.

The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. Various modifications to these embodiments will be readily apparent to those skilled in the art, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the invention.

It will be evident to those skilled in the art that the present application is not limited to the details of the foregoing illustrative embodiments, and that the present application may be embodied in other specific forms without departing from the spirit or essential attributes thereof. The present embodiments are therefore to be considered in all respects as illustrative and not restrictive, the scope of the application being indicated by the appended claims rather than by the foregoing description, and all changes which come within the meaning and range of equivalency of the claims are therefore intended to be embraced therein. Any reference sign in a claim should not be construed as limiting the claim concerned.

Claims

1. A method for accessing a data asset library based on a knowledge graph is characterized by comprising the following steps:

according to the relation between the data entity of the first existing system and the data entity, and based on the SG-CIM unified information model and the data of the existing service system, a data association model of a data asset library is constructed in a bidirectional mode;

and accessing the data in the data asset library by constructing a uniform access ontology, wherein the knowledge base of the access ontology is perfected based on a knowledge graph.

2. The method according to claim 1, wherein the step of constructing the data association model of the data asset library based on the SG-CIM unified information model and the data of the existing business system comprises:

acquiring data of a first existing system, and forming unstructured service metadata of the data of the first existing system according to the data of the first existing system and an existing service system and the incidence relation of the data of the first existing system and the existing service system;

and constructing an association model of the structured data and the unstructured data based on the SG-CIM unified information model and the unstructured service metadata.

3. The data asset library access method of claim 1, wherein said step of accessing data in a data asset library comprises:

sending an access request to a structured data center to acquire basic information of accessed related equipment and entity codes of unstructured data;

and sending an access request to the unstructured data management platform according to the unstructured data entity code provided by the structured data center so as to obtain a target document corresponding to the data entity code.

4. The method according to claim 1, wherein the step of accessing the data in the data asset library by constructing a unified access ontology comprises:

and extracting the entities, attributes and relationships of the data assets in the data warehouses of the plurality of business systems to construct a uniform access body so as to perform uniform access on the data in the data asset warehouse of the plurality of business systems.

5. The method of claim 1, wherein the step of refining the knowledge base of the access ontology based on a knowledge-graph comprises: based on the knowledge map technology, the potential and missing associated data retrieved from a plurality of business system data warehouses are compared with the similarity of different types of data sets of the data warehouses through the knowledge of the structured triples to obtain associated information, and the unified access ontology is perfected.

6. A data asset repository access device based on a knowledge-graph, comprising:

the data management model building unit is used for building a data association model of the data asset library in a two-way mode according to the relationship between the data entity and the data entity of the first existing system and the data based on the SG-CIM unified information model and the data of the existing business system;

and the data access unit is used for accessing the data in the data asset library by constructing a uniform access ontology, wherein the knowledge base of the access ontology is perfected based on a knowledge graph.

7. The data asset library access device according to claim 6, wherein the data management model building unit specifically comprises:

the unstructured service metadata forming module is used for acquiring data of a first existing system and forming unstructured service metadata of the data of the first existing system according to the data of the first existing system and an existing service system and the incidence relation of the data of the first existing system and the existing service system;

and the association model building module is used for building an association model of the structured data and the unstructured data based on the SG-CIM unified information model and the unstructured service metadata.

8. The data asset library access device according to claim 6, wherein the data access unit specifically comprises:

the first request sending module is used for sending an access request to the structured data center so as to obtain the basic information of the accessed related equipment and the entity code of the unstructured data;

and the second request sending module is used for sending an access request to the unstructured data management platform according to the unstructured data entity code provided by the structured data center so as to obtain a target document corresponding to the data entity code.

9. The data asset library access device of claim 6, wherein the data access unit is specifically configured to:

and extracting the entities, attributes and relations of the data assets in the data warehouses of the plurality of business systems to construct a uniform access body so as to perform uniform access on the data of the data asset warehouse of the plurality of business systems.

10. The data asset library access device of claim 6, wherein the data access unit is specifically configured to: based on the knowledge map technology, the potential and missing associated data retrieved from a plurality of business system data warehouses are compared with the similarity of different types of data sets of the data warehouses through the knowledge of the structured triples to obtain associated information, and the unified access ontology is perfected.