CN116303336A

CN116303336A - Data management method based on data braiding architecture

Info

Publication number: CN116303336A
Application number: CN202211504450.5A
Authority: CN
Inventors: 陈彬; 萧展辉; 徐欢; 李辉; 时燕
Original assignee: Information Center of Yunnan Power Grid Co Ltd
Current assignee: Information Center of Yunnan Power Grid Co Ltd
Priority date: 2022-11-28
Filing date: 2022-11-28
Publication date: 2023-06-23

Abstract

The invention discloses a data management method based on a data braiding architecture, which comprises the following steps: constructing an active metadata management tool, forming a metadata knowledge graph, and generating a panoramic data portrait; based on business experience and a machine learning model, deep mining is carried out on metadata to form an intelligent recommendation engine; taking active metadata as a core, using AI artificial intelligence and machine learning to realize automatic cataloging of data and realizing enhanced data service catalogue; constructing a data virtualization engine through federal query, dynamic integration and data arrangement technologies, and realizing full-link self-service use number; and constructing a DataOps data research and development and management system, and realizing agile and high-quality data delivery. The business user can directly use the data analysis result and form the prediction capability, does not need to repeatedly carry out complex data science work, realizes extremely agile data delivery, and simultaneously enables the data architecture to be continuously healthy through active, intelligent and continuous data management, thereby providing more value than the traditional data management.

Description

Data management method based on data braiding architecture

Technical Field

The invention relates to the technical field of data management, in particular to a data management method based on a data braiding architecture.

Background

With the entry of the big data age, the data has become a new production resource, and the value of the data is increasingly prominent. The demands of organizations on data become more diverse, and massive amounts of data may reside in multiple application systems in a distributed environment, especially with the accumulation of large amounts of semi-structured, unstructured data volumes, the increase of external data source correlation, and the development of a hybrid cloudy environment, the challenges of data management and application processes of organizations are all filled:

(1) The proliferation of dark data and data islands

As the volume of enterprise data increases and data demands become more complex, more and more data technologies (e.g., data warehouse, data lake, noSQL database, OLAP database, real-time data source, etc.) are introduced, and enterprise data is physically broken away, especially after a hybrid cloud & multi-cloud architecture is adopted, which aggravates this problem.

(2) Demand delivery with gradual delay

The increasing enterprise data, explosive business demands and complex data engineering make self-service business searching and using become more difficult.

(3) Increasingly severe quality problems

More and more data technologies make it difficult to implement "single fact source data".

(4) Ever-expanding safety compliance risk

With increasing legal output and external security threat of data security and privacy protection such as network security, digital security, personal security, GDPR, CCPA and the like, enterprises have to show higher standards in compliance and management, and more difficult, the enterprises also need to consider the use efficiency of the business at the same time.

Disclosure of Invention

This section is intended to outline some aspects of embodiments of the invention and to briefly introduce some preferred embodiments. Some simplifications or omissions may be made in this section as well as in the description summary and in the title of the application, to avoid obscuring the purpose of this section, the description summary and the title of the invention, which should not be used to limit the scope of the invention.

The present invention has been made in view of the above-described problems.

Therefore, the technical problems solved by the invention are as follows: the existing data management and application method has the problems of data sources, technical confusion, difficulty in self-help finding and using numbers and optimization of how to safely and successfully manage the data.

In order to solve the technical problems, the invention provides the following technical scheme: a data management method based on a data braiding architecture, comprising:

constructing an active metadata management tool, forming a metadata knowledge graph, and generating a panoramic data portrait;

based on business experience and a machine learning model, deep mining is carried out on metadata to form an intelligent recommendation engine;

taking active metadata as a core, using AI artificial intelligence and machine learning to realize automatic cataloging of data and realizing enhanced data service catalogue;

constructing a data virtualization engine through federal query, dynamic integration and data arrangement technologies, and realizing full-link self-service use number;

and constructing a DataOps data research and development and management system, and realizing agile and high-quality data delivery.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the step of constructing an active metadata management tool includes:

establishing connection with a data warehouse and a traditional relational database;

using an automatic pipeline mode to configure metadata acquisition time, an acquisition object and a processing script, and acquiring and formulating metadata information related to data according to configuration timing;

constructing a relation mapping between the table and the fields based on the obtained result information to form metadata assets;

processing metadata assets based on the existing business relation dictionary table, establishing association relation among tables, fields and businesses, and manually checking processing results to form metadata lakes;

and using a Neo4j component to display the data in the metadata lake in a mapping way, wherein the blood margin between metadata can be optimized in a manual intervention way.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the active metadata comprises metadata lakes and data flow automation;

the metadata lake is used as a base stone of active metadata and comprises technical metadata, business metadata, operation metadata, social metadata and all data which occur on the data and are made on the data;

the process automation includes: and automatically collecting data distribution information, automatically classifying sensitive data and performing service classification on the global data, and simultaneously performing real-time diffusion based on data blood edges to realize classified and classified data management and compliance management strategies.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the knowledge graph uses nodes and edges to represent the data information and the relation between the data information, entity connection and quantification of connection relation are carried out through an AI/ML algorithm, association relation between data and data, between data and user and between data and business semantics is automatically mined and established, and a semantic knowledge graph is formed.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the machine learning model is a random forest algorithm, model training, logistic regression and business fitting are carried out on metadata information in metadata lakes, a result set formed based on the algorithm is matched with data users, and sorting is carried out according to the matching degree.

8. As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the step of deep mining the metadata includes:

setting the size of a training set T as N, the number of features as M and the size of a random forest as K;

traversing the size of the random forest K times;

a mode of replacing sampling is arranged in the training set T, and a new sub-training set D is formed by sampling for N times;

randomly selecting M features, where M < M;

learning a complete decision tree by using the new training set D and m features;

obtaining a random forest.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the feature number selection further includes: for data classification problems, use is made at each division

Characteristic, for regression problem, select +.>

But not less than 5 features.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the federation query refers to the realization of federation query across databases based on Apache Hive3 and SQL structured query; the Apache Hive automatically identifies the data source in the statement to be queried based on the configured JDBC data source connection, realizes JDBC intelligent push-down by means of a cost-based optimizer, automatically groups the data sources in the query statement, and finally forms a result set matched with the query statement.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the build data virtualization engine comprises:

creating connectors that support different data sources;

acquiring information such as table, field and the like in a target data source or a database through a connector, and intuitively displaying data in the data source in a foreground visual mode;

creating data services based on addition, deletion, checking and changing of the object table in the form of an automatic API application program interface;

and providing a visual interface, namely combining and arranging the data services of a plurality of different databases in a dragging, pulling and dragging mode to form data services related to business, and providing calling to the outside in an API (application program interface) mode.

As a preferable scheme of the data management method based on the data weave architecture of the present invention, the method comprises: the data virtualization comprises a data virtualization representation layer and a data federation, wherein the data virtualization representation layer provides query service at a virtual layer or a semantic layer and shields the underlying database storage; after receiving the query instruction, the data federation mechanism decomposes the query instruction into a query part aiming at the Oracle database and a query part aiming at the DB2 database, implements real data query operation and returns a query result.

The invention has the beneficial effects that: according to the data management method based on the data braiding architecture, provided by the invention, the abstract layer is arranged on the bottom data component, so that service users can directly use data analysis results and form prediction capability without repeatedly carrying out complex data science work; by optimizing discovery and access of cross-source heterogeneous data, trusted data is delivered to all relevant data consumers from all data sources in a flexible and business-understandable manner, so that the data consumers can serve by self and cooperate with high efficiency, extremely agile data delivery is realized, and meanwhile, the data architecture is kept healthy through active, intelligent and continuous data management, so that more value is provided than that of traditional data management.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings that are needed in the description of the embodiments will be briefly described below, it being obvious that the drawings in the following description are only some embodiments of the present invention, and that other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art. Wherein:

fig. 1 is a general flowchart of a data management method based on a data weave architecture according to a first embodiment of the present invention;

fig. 2 is a schematic diagram of automatic cataloging of data in a data management method based on a data weave architecture according to a first embodiment of the present invention;

fig. 3 is a schematic diagram of a full-link data service development and operation management system in a data management method based on a data fabric architecture according to a first embodiment of the present invention.

Fig. 4 is a general architecture diagram of a data management method based on a data weave architecture according to a second embodiment of the invention.

Fig. 5 is a diagram of a power grid data digitalized operation system in a data management method based on a data weave architecture according to a second embodiment of the present invention.

Detailed Description

So that the manner in which the above recited objects, features and advantages of the present invention can be understood in detail, a more particular description of the invention, briefly summarized above, may be had by reference to the embodiments, some of which are illustrated in the appended drawings. All other embodiments, which can be made by one of ordinary skill in the art based on the embodiments of the present invention without making any inventive effort, shall fall within the scope of the present invention.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present invention, but the present invention may be practiced in other ways other than those described herein, and persons skilled in the art will readily appreciate that the present invention is not limited to the specific embodiments disclosed below.

Further, reference herein to "one embodiment" or "an embodiment" means that a particular feature, structure, or characteristic can be included in at least one implementation of the invention. The appearances of the phrase "in one embodiment" in various places in the specification are not necessarily all referring to the same embodiment, nor are separate or alternative embodiments mutually exclusive of other embodiments.

While the embodiments of the present invention have been illustrated and described in detail in the drawings, the cross-sectional view of the device structure is not to scale in the general sense for ease of illustration, and the drawings are merely exemplary and should not be construed as limiting the scope of the invention. In addition, the three-dimensional dimensions of length, width and depth should be included in actual fabrication.

Also in the description of the present invention, it should be noted that the orientation or positional relationship indicated by the terms "upper, lower, inner and outer", etc. are based on the orientation or positional relationship shown in the drawings, are merely for convenience of describing the present invention and simplifying the description, and do not indicate or imply that the apparatus or elements referred to must have a specific orientation, be constructed and operated in a specific orientation, and thus should not be construed as limiting the present invention. Furthermore, the terms "first, second, or third" are used for descriptive purposes only and are not to be construed as indicating or implying relative importance.

The terms "mounted, connected, and coupled" should be construed broadly in this disclosure unless otherwise specifically indicated and defined, such as: can be fixed connection, detachable connection or integral connection; it may also be a mechanical connection, an electrical connection, or a direct connection, or may be indirectly connected through an intermediate medium, or may be a communication between two elements. The specific meaning of the above terms in the present invention will be understood in specific cases by those of ordinary skill in the art.

Example 1

Referring to fig. 1, for one embodiment of the present invention, there is provided a data management method based on a data weave architecture, including:

s1: constructing an active metadata management tool, forming a metadata knowledge graph, and generating a panoramic data portrait;

further, the step of constructing an active metadata management tool includes:

establishing connection with OLAP (data warehouse), OLTP (traditional relational database);

the metadata acquisition time, the acquisition object and the processing script are configured in an automatic pipeline mode, and all metadata information related to data such as a formulated table, a report, a model, an index, a data processing script, a data operation and the like is acquired according to configuration timing;

constructing a relation mapping between the table and the fields based on the result information obtained in the steps to form metadata assets;

Furthermore, the active metadata is defined as data of the data, and also includes all services occurring on the data and data generated in the process of the services, and mainly includes metadata lakes and two key components of data flow automation.

The metadata lake is used as a base stone of active metadata, and comprises technical metadata (such as data types, data models and the like), business metadata (such as business marks, business strategies, business relations and the like), operation metadata (such as data operations, data blood edges, data performances and the like), social metadata (such as data consumption behavior, sharing behavior, UGC (user originated content), evaluation and the like) and the like, and all data of the data are generated, so that the business semantics of the data can be enriched, and a system and a data consumer can be helped to better understand the data.

The data flow automation comprises the steps of automatically collecting data distribution information, automatically classifying sensitive data and classifying business of global data, and simultaneously performing real-time diffusion based on data blood edges to realize classified and classified data management and compliance management strategies. Based on real-time collection and analysis of the full-link SQL (structured query language) log, operator-level data blood edges can be automatically analyzed and generated, visual field processing caliber is extracted, and a user can rapidly develop influence surface analysis, trace sources, track downstream and identify key nodes through a blood edge visual UI (user interaction) so as to develop efficient data governance analysis work.

It should be noted that the knowledge-graph on which the data weave depends can represent all transactions that occur on the organization data. In general, the integration process of organizing data is quite complex, often involving extraction, translation, modeling, mapping, etc. between various applications. Wherein, the custom code required by modeling and mapping is difficult to use on a large scale, which prevents the innovation of the organization business data and the deepening of predictive analysis. In contrast, the knowledge graph creates a reusable knowledge network to power the organization service, the data of various structures can be easily represented through the graph, semantic understanding of data inside the organization and third parties can be provided, and effective access to the service prediction analysis capability is formed, which is the core capability required by data braiding.

Further, rather than representing data information in rows, columns, tables, and keys, knowledge maps represent data assets and relationships between those assets in nodes and edges. Fundamentally, this graphical data model is simpler than the relational model, but it is more expressive and functional, easier to modify, and infinitely extensible, and the knowledge graph actually exists at the computational layer of the data management hierarchy, rather than at the storage layer, meaning that it can be modified at any time by adding new nodes and edges, without having to laboriously devise a single shared data model covering all current and future organizational data requirements at some point in time.

Furthermore, the entity connection and the quantification of the connection relation are carried out through an AI (artificial intelligence)/ML (machine learning) algorithm, the association relation between data and data, between the data and a user, between the data and business semantics is automatically mined and established, and a semantic knowledge graph is formed, so that more three-dimensional depiction of the data can be realized, and more intelligent data use recommendation is supported.

S2: based on business experience and a machine learning model, deep mining is carried out on metadata to form an intelligent recommendation engine;

still further, the recommendation engine uses rules and machine learning models formed based on expert experience for DataOps, data management, and data preparation and services (e.g., data integration schemes or engine performance optimization), where the recommendation range may cover various phases of the data full lifecycle, such as data asset recommendation, data usage recommendation, data integration scheme recommendation, execution plan recommendation, calculation engine recommendation, data classification recommendation, data aging promotion recommendation, data security wind control recommendation, cost management recommendation, and the like.

Further, the service experience is mainly aimed at grid services, such as power dispatching, load prediction, cost and quantity analysis, main distribution network planning and the like.

Furthermore, the machine learning model is a random forest algorithm, and model training, logistic regression and business fitting are carried out on the metadata information in the metadata lake; and matching the result set formed based on the algorithm with data users, and sorting according to the matching degree, so that a related data table can be recommended to the service personnel, and the time for the service personnel to search the data is reduced.

Further, deep mining of metadata includes:

assuming that the size of the training set T is N, the number of features is M, and the size of the random forest is K, the specific steps are as follows:

traversing the size of the random forest K times;

randomly selecting M features, where M < M;

obtaining a random forest;

for data classification problems, use is made at each division

Characteristic, for regression problem, select +.>

But not less than 5 features.

It should be noted that truly effective data braiding is intelligent, meaning that the system needs to provide automatic advice in conjunction with task load and enterprise data management requirements, depending on the particular use of the data, which must be able to analyze past activities to predict the future in order to form the most appropriate recommended advice. Therefore, the recommendation engine needs to have good openness to support the realization of various recommendation schemes and stable service.

Still further, the intelligent recommendation engine function includes: intelligent data classification and intelligent SQL (structured query language) associative functions.

The intelligent data classification is based on active metadata and field content sampling, so that the PII (personal identity information) sensitive information is automatically identified, the service classification recommendation of the data asset is realized, meanwhile, the classification label is diffused in real time based on the operator-level blood margin, the global data classification is completed with lower cost, higher timeliness and higher accuracy, and basic data support is provided for realizing data classification and classification management.

The intelligent SQL association is based on the identification of the data-funded user and the deep mining of the data use behavior, so that when the user writes SQL codes, data use association beyond SQL grammar prompt, such as common table association, common filtering conditions, common summarization dimensionality and measurement fields, is given, and the SQL writing efficiency and the user experience of the user are greatly improved.

S3: taking active metadata as a core, using AI artificial intelligence and machine learning to realize automatic cataloging of data and realizing enhanced data service catalogue;

the first problem faced by analysts and business personnel to self-service is how to find data, understand data, and trust data from data that is vast and scattered around the sea. Unlike traditional data dictionaries or data map-like products, enhanced data catalogs are intended to be presented to all users within an enterprise, not just IT (internet technology) personnel, through an intuitive user interface. Unlike IT personnel, analysts and business personnel are unfamiliar with the condition of data assets in enterprises, and when data is searched by self-help through a data dictionary or a data map product, the problems of poor data searching, poor data movement, data dare and the like are often faced.

The advent of the enhanced data catalog was intended to allow an analyst to quickly search and find data, evaluate and use which data is the best choice, and then perform data preparation and analysis efficiently and confidently.

Furthermore, the method takes active metadata as a core, uses AI (artificial intelligence) and machine learning for metadata collection, semantic reasoning and classification marking, and automatically catalogs the data, thereby reducing the work of manually maintaining the metadata to the maximum extent.

Still further, AI and machine learning are used for metadata collection, semantic reasoning and classification tagging, and the step of automatically cataloging data is shown in FIG. 2.

Furthermore, the enhanced data service directory realized by the method can provide the following services for business personnel:

(1) Semantical data search

The enhanced data directory provides a powerful search capability that is friendly to business personnel, including supporting searches for keywords, business terms, and natural language, and ordering the search results by relevance and frequency of use to help users find the desired data quickly.

(2) Panoramic data representation

The enhanced data catalogue helps users evaluate applicability of data and analysis requirements by describing the data in full depth, such as sampling preview data, data quality information, data output timeliness, security sensitivity level, user rating and evaluation, expert user annotation, common usage and the like, and the information is automatically generated by active metadata, so that the efficiency of selecting the data by the users can be remarkably improved.

(3) Visual blood margin analysis

The enhanced data directory provides visual blood-edge analysis tools for users, so that the users can perform flexible upstream and downstream link exploration, intelligently find key nodes and key paths, and quickly clear the coming and going pulses of data.

(4) Global data search

The technical capability of directly accessing global data for interactive federal query is provided, and access protection of sensitive data such as security, privacy and compliance is built in.

S4: constructing a data virtualization engine through federal query, dynamic integration and data arrangement technologies, and realizing full-link self-service use number;

furthermore, the federation query core in the method is based on Apache Hive3 and SQL (structured query) to realize federation query across databases. Apache Hive automatically identifies the data source in the statement to be queried based on the configured JDBC data source connection, realizes JDBC intelligent push-down by means of a cost-based optimizer, automatically groups the data sources in the query statement, and finally forms a result set matched with the query statement.

The data virtualization is the core for realizing data braiding, bears the key responsibility of self-help completion of data integration, preparation and delivery of service personnel, provides a virtual semantic layer for connection, integration and consumption of data between a data source and a data consumption end, and enables a user to complete data conversion by defining data query, thereby realizing transparent integration, self-help preparation and high-performance service of data of cross sources and cross environments (such as multi-cloud, mixed cloud and Saas software suppliers).

Further, constructing the data virtualization engine includes:

creating connectors that support different data sources, such as JDBC, ODBC, MQTT, AMQP;

information such as table, field and the like in a target data source (or database) is acquired through a connector, and data in the data source is visually displayed in a foreground visual mode;

creating data services based on addition, deletion, check and change of an object table in the form of an automatic API (application program interface);

providing a visual interface, namely combining and arranging data services of a plurality of different databases in a dragging, pulling and dragging mode to form data services related to business, and providing calling to the outside in an API (application program interface) mode;

the above steps complete the process from the data source to the data service, i.e. complete the data virtualization.

Still further, data virtualization generally includes two components: the data virtualizes the presentation layer and the data federation. For example, one class of data of an organization is stored in an Oracle database, another class of data set is stored in a DB2 database, a data virtualization representation layer can provide query services at a virtual layer or a semantic layer, and the underlying database storage is shielded to look like a single data model; after receiving the query, the lower layer data federation mechanism decomposes the query into a query part for the Oracle database and a query part for the DB2 database, performs a real data query operation and returns a query result. The whole process not only avoids a large amount of data migration and copying work, but also provides a unified data application view, so that the detailed information of formatting and management of the data in the original source is transparent to the data consumer, and finally, the process of defining the data return form by the consumer and combining multiple source data according to the form is realized.

The diversity of data sources, the rapid increase of data volume and the complexity of data demands make the design of proper data delivery schemes increasingly more complex, and the traditional data delivery method based on ETL (extraction/conversion/loading) requires users to master a great deal of data technology related details, for example, proper data integration schemes (such as offline batch synchronization or stream increment synchronization and the like) must be selected first to collect data into a certain data source (such as a data warehouse or a data lake), then ETL (extraction/conversion/loading) tasks are written and operated according to the performance characteristics of the data source, and finally, a scheme which is most matched with the service demands is selected from various data acceleration schemes (such as ApacheKylin, clickhouse and the like) to finish data acceleration and issue data services. This is almost impossible for non-technicians to do, greatly impeding the self-help use of business personnel.

The method realizes data virtualization through key technologies such as federal query, dynamic integration, self-adaptive acceleration, data arrangement and the like, and can directly explore, prepare and provide data service in a Restful style for global data based on SQL (structured query language), without concern about data storage positions, data task operation and maintenance, query performance and the like, so that business personnel really realize full-link self-service number.

S5: and constructing a DataOps data research and development and management system, and realizing agile and high-quality data delivery.

The core connotation of the DataOps concept is that principles of agile research and development, continuous integration, continuous deployment and the like similar to the DevOps are applied to the data research and development and management process so as to realize more agile and high-quality data delivery.

Further, as shown in fig. 3, based on the DataOps idea, a full-link data service development and operation management and control system is constructed to realize the global data service operation, and the effects are compared as follows:

prior to administration using the data service: the data service has long delivery cycle, the risk of data change is unpredictable, and the compliance and the safety of the data service are difficult to manage and control;

after administration using the data service: and the one-stop data service research and development platform obviously shortens the data service delivery period, and the service management and control is embedded into each link of data service research and development to realize high-quality data service delivery.

Furthermore, the DataOps data operation management and control system has the following characteristics:

(1) One-stop data development: the links of data research and development, testing, publishing, operation and maintenance and the like are connected in series in a seamless mode, different roles can be coordinated in the same product in a high-efficiency mode, friction of each link in the data research and development process is reduced, and agile data delivery can be achieved.

(2) Data change CI (continuous integration)/CD (continuous deployment): the high-frequency delivery of high-quality data, continuous integration and continuous deployment are key, the system has multi-version control capability, supports link integration test and simulation test based on change, discovers the influence of data change on downstream through link data comparison before and after change, and further prevents the critical change risk from being leaked or subjectively exaggerated. CI (continuous integration)/CD (continuous deployment) can greatly decouple the collaborative dependence of the data producer and downstream consumer, thereby allowing data changes to be ultimately performed.

(3) Embedded control: the system embeds a management and control strategy into each link of data research and development to ensure that the strategies such as data quality, data safety, compliance and the like are managed and controlled before data release, and the health of a data architecture is not maintained only through post management. Post-treatment often gives rise to orders of magnitude cost, and the treatment requirements also conflict with the business requirements of high-speed changes, so that treatment work is difficult to perform, and finally the more the risk is accumulated, the more difficult the treatment is in the vicious circle.

(4) The data quality can be observed: data agility delivery causes a rapid increase in the frequency of data changes, and to control the risk of unavailability of data and down time (data down time), observability of data becomes abnormally important. The system realizes end-to-end data quality monitoring (such as abnormal fluctuation detection, task overtime prediction, abnormal data quantity alarm, source end change detection, link change alarm and the like) and has the capability of rapidly detecting data shutdown and giving coping suggestions based on ML (machine learning).

Example 2

Referring to fig. 4-5, for one embodiment of the present invention, a data management method based on a data weave architecture is provided, and in order to verify the beneficial effects of the present invention, scientific demonstration is performed through economic benefit calculation and simulation experiments.

Firstly, aiming at the method of the embodiment, the method is adopted to construct a data management operation system by taking a data management platform used for data management of the existing Yunnan power grid and a required data analysis type as references.

Referring to fig. 4, for the overall architecture of the method, by constructing an active metadata management tool metadata knowledge graph, constructing a data virtualization engine, implementing an enhanced data directory and constructing a DataOps data research and development and management system, agile and high-quality data delivery is realized.

Fig. 5 is a digital operation system of the power grid data constructed based on the method.

The enhanced data services directory is as follows:

cloud platform: docker, kubernates, virtual machines and physical machines;

and (3) a data center: mysql, MPP, HBase, mongoDB, dream, flink, kafka;

data integration: dynamic integration, data service scanning, data service generation, data service customization, data service automatic registration configuration and federation query.

Data service operation: service rights management, service rights inheritance, link tracking, service directory, service monitoring and log auditing;

policy type service gateway: service route management, service authentication policy configuration, service fusing policy configuration and service flow limiting policy configuration;

CI/CD: combining service release, service automation test, container resource monitoring, elastic telescopic strategy configuration and service+container full life cycle management;

service orchestration: visual drag, pull and drag low code arrangement, data conversion, protocol conversion capability, automatic scheduling flow, breakpoint continuous running capability and service aggregation capability;

active data management tool: metadata lake, knowledge graph and data panorama representation;

intelligent recommendation engine: an intelligent data classification engine, a SOL association engine, a user behavior prediction engine, an intelligent asset recommendation engine and a data relationship depiction engine;

compared with the traditional power data management method, the implementation data management constructed based on the method has the advantages that the manpower-driven data management task is reduced by 50%, the application integrated design time is reduced by 30%, the system deployment time is reduced by 30%, the system maintenance time is reduced by 70%, the data quality and the operation cost can be reduced by 65%, the data management efficiency is remarkably improved, the time required for data delivery is reduced, and the power-assisted enterprises reduce the cost and increase the efficiency.

The benefits brought by data braiding are mainly reflected in the following four aspects:

and firstly, the data user experience is improved, and the data delivery and supply are accelerated. The establishment of enterprise global data catalogues and the application of AI technologies such as semantic search, knowledge graph, NLP and the like enable users to conveniently and rapidly acquire rich, reliable and high-quality data, and focus more time on business scenes and data analysis instead of searching and identifying the data. And secondly, simplifying an integrated analysis mode and solving the problem of data island. The enterprise organically connects the scattered, dynamic and diverse data sources together in a virtual data link mode, breaks through barriers that data cannot be integrated and analyzed in a correlation way, and can be explored and accessed without developing and deploying ETL operation. With the rapid system and full-flow intelligent evolution of new data sources, the system scale and user experience of data braiding are continuously improved, access limitation of data caused by storage in different environments is avoided, and data island is effectively eliminated. Meanwhile, the change of the integration mode greatly reduces the same data copy, and reduces the cost of data storage, maintenance and management. And thirdly, assisting comprehensive data management, and strengthening safety privacy protection. The data braiding is beneficial to enterprises to realize comprehensive data management, unified access control and privacy protection strategies, ensures that risks are controllable in the data analysis application process, meets supervision requirements, and avoids private data disclosure. Fourthly, the user data needs are insight, and an intelligent consumption community is constructed. By recording the data access footprint of the user, the data use condition, the combination preference and the access rule of the user are obtained, on one hand, more business application cooperation sharing opportunities can be found, and the data can be promoted to be used deeply; on the other hand, the data distribution, flow direction and common precipitation can be actively recommended or automatically optimized, the large-scale cross-regional and cross-system frequent flow of data is reduced, the user access efficiency is improved, and the overall hardware resource overhead is reduced.

It should be noted that the above embodiments are only for illustrating the technical solution of the present invention and not for limiting the same, and although the present invention has been described in detail with reference to the preferred embodiments, it should be understood by those skilled in the art that the technical solution of the present invention may be modified or substituted without departing from the spirit and scope of the technical solution of the present invention, which is intended to be covered in the scope of the claims of the present invention.

Claims

1. A data management method based on a data braiding architecture, comprising:

2. The data management method based on the data braiding architecture of claim 1, wherein: the step of constructing an active metadata management tool includes:

3. A data management method based on a data braiding architecture as claimed in claims 1 and 2, wherein: the active metadata comprises metadata lakes and data flow automation;

4. The data management method based on the data braiding architecture of claim 1, wherein: the knowledge graph uses nodes and edges to represent the data information and the relation between the data information, entity connection and quantification of connection relation are carried out through an AI/ML algorithm, association relation between data and data, between data and user and between data and business semantics is automatically mined and established, and a semantic knowledge graph is formed.

5. The data management method based on the data braiding architecture of claim 1, wherein: the machine learning model is a random forest algorithm, model training, logistic regression and business fitting are carried out on metadata information in metadata lakes, a result set formed based on the algorithm is matched with data users, and sorting is carried out according to the matching degree.

6. The data management method based on a data braiding architecture of claim 1 or 5, wherein: the step of deep mining the metadata includes:

traversing the size of the random forest K times;

randomly selecting M features, where M < M;

obtaining a random forest.

7. The data management method based on the data braiding architecture of claim 6, wherein: the feature number selection further includes: for data classification problems, use is made at each division

Characteristic, for regression problem, select +.>

But not less than 5 features.

8. The data management method based on the data braiding architecture of claim 1, wherein: the federation query refers to the realization of federation query across databases based on Apache Hive3 and SQL structured query; the Apache Hive automatically identifies the data source in the statement to be queried based on the configured JDBC data source connection, realizes JDBC intelligent push-down by means of a cost-based optimizer, automatically groups the data sources in the query statement, and finally forms a result set matched with the query statement.

9. The data management method based on the data braiding architecture of claim 1, wherein: the build data virtualization engine comprises:

creating connectors that support different data sources;

10. The data management method based on the data braiding architecture of claim 1, wherein: the data virtualization comprises a data virtualization representation layer and a data federation, wherein the data virtualization representation layer provides query service at a virtual layer or a semantic layer and shields the underlying database storage; after receiving the query instruction, the data federation mechanism decomposes the query instruction into a query part aiming at the Oracle database and a query part aiming at the DB2 database, implements real data query operation and returns a query result.