CN113553381A - Distributed data management system based on novel pipeline scheduling algorithm - Google Patents

Distributed data management system based on novel pipeline scheduling algorithm Download PDF

Info

Publication number
CN113553381A
CN113553381A CN202110853719.XA CN202110853719A CN113553381A CN 113553381 A CN113553381 A CN 113553381A CN 202110853719 A CN202110853719 A CN 202110853719A CN 113553381 A CN113553381 A CN 113553381A
Authority
CN
China
Prior art keywords
data
module
management
pipeline
algorithm
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202110853719.XA
Other languages
Chinese (zh)
Inventor
张绍蓉
邵炜晨
邢祎
冀晓镭
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
China Building Materials Xinyun Zhilian Technology Co ltd
Cnbm Technology Corp ltd
Original Assignee
China Building Materials Xinyun Zhilian Technology Co ltd
Cnbm Technology Corp ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by China Building Materials Xinyun Zhilian Technology Co ltd, Cnbm Technology Corp ltd filed Critical China Building Materials Xinyun Zhilian Technology Co ltd
Priority to CN202110853719.XA priority Critical patent/CN113553381A/en
Publication of CN113553381A publication Critical patent/CN113553381A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/602Providing cryptographic facilities or services
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/60Protecting data
    • G06F21/62Protecting access to data via a platform, e.g. using keys or access control rules
    • G06F21/6218Protecting access to data via a platform, e.g. using keys or access control rules to a system of files or objects, e.g. local or distributed file system or database
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/48Program initiating; Program switching, e.g. by interrupt
    • G06F9/4806Task transfer initiation or dispatching
    • G06F9/4843Task transfer initiation or dispatching by program, e.g. task dispatcher, supervisor, operating system
    • G06F9/4881Scheduling strategies for dispatcher, e.g. round robin, multi-level priority queues
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/50Allocation of resources, e.g. of the central processing unit [CPU]
    • G06F9/5083Techniques for rebalancing the load in a distributed system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/46Multiprogramming arrangements
    • G06F9/54Interprogram communication
    • G06F9/544Buffers; Shared memory; Pipes

Abstract

The invention relates to the technical field of data management, in particular to a distributed data management system based on a novel pipeline scheduling algorithm. The system comprises an infrastructure unit, a pipeline scheduling unit, a unified service unit and a data management unit; the basic framework unit is used for managing a network framework supporting the operation of the system; the pipeline scheduling unit is used for managing scheduling of the interprocess communication pipeline; the unified service unit is used for unified standardized management of data; the data management unit is used for comprehensively managing the data. The design of the invention can realize distributed data management, improve the fusion degree of the system and enhance the ductility of the system; the transmission effect in the data exchange process can be improved, and the integrity and the safety of data are improved; by carrying out centralized scheduling management on the data, the architecture is convenient for carrying out node expansion on the system, can support high-concurrency access of large users, can realize visual management of the full life cycle of the data, shortens the data management cycle and reduces cost waste.

Description

Distributed data management system based on novel pipeline scheduling algorithm
Technical Field
The invention relates to the technical field of data management, in particular to a distributed data management system based on a novel pipeline scheduling algorithm.
Background
For an enterprise, the most important assets are data, i.e., the core value of the data can be understood as the core business value. The realization of data value requires business data analysis and value mining, and the data value can be well presented only by clean and complete data. Most traditional enterprises have no quality management awareness on data resources, and do not know how to manage data, so that the data quality is poor, such as data loss, data isolated island, data distortion and the like, and the data resources cannot be utilized to obtain more commercial values. The great reason for poor data quality is that in the data exchange process, a data transmission channel is not well established and managed, and problems such as data loss and information leakage are likely to occur in the data exchange process, so that subsequent data management operation is time-consuming and labor-consuming, and the management efficiency is low. However, at present, no data management system platform for emphatically managing the data exchange process exists, so that data cannot be well collected and dispatched, the construction quality of an enterprise informatization system is influenced, and the development of enterprises is restricted.
Disclosure of Invention
The invention aims to provide a distributed data management system based on a novel pipeline scheduling algorithm, so as to solve the problems in the background technology.
To achieve the above technical problem, one of the objectives of the present invention is to provide a distributed data management system based on a new scheduling algorithm of a pipeline, comprising
The system comprises a basic framework unit, a pipeline scheduling unit, a unified service unit and a data management unit; the basic framework unit, the pipeline scheduling unit, the unified service unit and the data management unit are sequentially connected through network communication; the basic framework unit is used for providing and managing a basic network topology framework supporting system operation; the pipeline scheduling unit is used for providing and managing different interprocess communication pipelines and performing distribution scheduling; the unified service unit is used for realizing unified standardized management and analysis utilization of data by enterprises through a plurality of unified service functions; the data management unit is used for performing quality control and comprehensive treatment on the data;
the basic architecture unit comprises an application platform module, a data source module, a technical support module and an algorithm management module;
the pipeline scheduling unit comprises an interprocess communication module, an anonymous pipeline module, a named pipeline module and a polling scheduling module;
the unified service unit comprises a metadata management module, a service gateway module, a data storage module and an identity authentication module;
the data management unit comprises a data integration module, a data exchange module, a data management module, a main data management module and a data application module.
As a further improvement of the technical solution, the application platform module, the data source module, the technical support module and the algorithm management module are sequentially connected through network communication and run in parallel; the application platform module is used for constructing a multifunctional data governance application platform on the basis of big data and a block chain technology so as to realize human-computer interaction; the data source module is used for building a channel for signal connection and data transmission between the system and each data source and managing and distributing the data source platform; the technical support module is used for loading various intelligent technologies to support and improve the functionality of the system; the algorithm management module is used for packaging various intelligent algorithms supporting system functions.
The intelligent technology includes, but is not limited to, big data technology, big data analysis technology, block chain technology, front-end and back-end separation technology, and the like.
The intelligent algorithm includes, but is not limited to, an intelligent scheduling algorithm, an encryption algorithm, a consensus algorithm, a matching algorithm, and the like.
As a further improvement of the technical solution, a signal output end of the interprocess communication module is connected with signal input ends of the anonymous pipeline module and the named pipeline module, the anonymous pipeline module and the named pipeline module run in parallel, and signal output ends of the anonymous pipeline module and the named pipeline module are connected with a signal input end of the polling scheduling module; the interprocess communication module is used for opening up a buffer area in a kernel by taking a memory as a medium to realize the communication of interprocess data exchange; the anonymous pipeline module is used for creating an anonymous pipeline between processes with blood relationship by calling a pipe function so as to provide one-way communication; the named pipeline module is used for creating a named pipeline visible to the system between any two processes by calling an mknow or mkfifo command so as to realize communication between the two processes; the polling scheduling module is used for scheduling the pipeline in a polling mode to realize load balancing of pipeline scheduling.
As a further improvement of the technical solution, in the polling scheduling module, the type of the scheduling algorithm includes a polling scheduling algorithm and a modified weighted polling scheduling algorithm based on the polling scheduling algorithm, wherein: the polling scheduling algorithm is suitable for the condition that all servers in the server group have the same software and hardware configuration and the average service request is relatively balanced, and the weighted polling scheduling algorithm is suitable for the condition that the configuration, the installed service application and the processing capacity of each server in the server group are different; in the weight polling scheduling algorithm, the calculation expression of weight distribution is as follows:
Figure DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 463243DEST_PATH_IMAGE002
using negatives for serversAnd (4) loading.
As a further improvement of the technical solution, the metadata management module, the service gateway module, the data storage module and the identity authentication module sequentially communicate with each other through a network and operate in parallel; the metadata management module is used for providing unified management of information description metadata and information authorization metadata, metadata description, a metadata interface and a high-performance metadata access mechanism; the service gateway module is used for providing a uniform data service gateway to realize safe and controllable butt joint and service between an internal system and an external system and support the full life cycle management of an API; the data storage module is used for providing uniform distributed data storage service, can realize uniform storage and access of various types of data such as structured data, unstructured data, files and the like, and can realize self-description and information authorization through metadata management information; the identity authentication module is used for providing unified user management and identity authentication services, and supporting multi-mode access and third-party identity dispute so as to meet the identity authentication requirements of different application systems.
The unified metadata management is internally provided with rich acquisition adapters, end-to-end automatic acquisition and one-key metadata analysis are realized, data resources are quickly cleared, data coming and going veins are known, and a data map can be constructed to provide basic support for data standard construction and data quality; the functional items of metadata management include, but are not limited to, metadata collection, metadata retrieval, data maps, blood relationship analysis, impact analysis, and the like.
In the unified identity authentication process, the login mode includes but is not limited to a mobile phone, a mailbox, a QQ, a WeChat, a microblog and the like.
As a further improvement of the technical scheme, the data integration module, the data exchange module, the data management module, the main data management module and the data application module are sequentially connected through network communication and run alternately; the data integration module is used for realizing loading, cleaning, conversion and integration of the trans-regional data and supporting custom scheduling and graphical monitoring, so that unified scheduling and unified monitoring are realized to meet the visual requirements of operation and maintenance; the data exchange module is used for realizing the transmission and sharing of data or files among a plurality of service subsystems and can integrate data acquisition, processing and distribution and exchange transmission; the data treatment module is used for carrying out integrated treatment on data from multiple aspects; the main data management module is used for establishing a unified view and centralized management on data to be shared so as to provide optimal data for service system data calling; the data application module is used for providing the well-treated high-quality data for the user to invoke and apply through formats such as deep mining, statistical forms and the like.
The functional items of data integration include, but are not limited to, real-time acquisition, cleansing conversion, encryption desensitization, and the like.
The functional items of the master data management include, but are not limited to, identification application, master data retrieval, master data storage, shared publishing, master data monitoring, and the like.
As a further improvement of the technical scheme, the data exchange module comprises a data transmission module, a node management module, a pipeline matching module and an exchange approval module; the data transmission module, the node management module, the pipeline matching module and the exchange approval module are sequentially connected in communication; the data transmission module is used for realizing a data security transmission process in various modes through various data exchange components; the node management module is used for carrying out visual configuration on the data transmission switching node, controlling the data transmission state of the node, supporting the transmission of data in various formats and shielding the data type difference existing among systems; the pipeline matching module is used for selecting used interprocess pipelines through a matching algorithm according to the correlation analysis result between the metadata and the business data and the interprocess; the exchange approval module is used for supporting the design of a data exchange task mode and realizing the detection approval of the data exchange task by flexibly extracting and exchanging data.
In the data exchange process, the data exchange components include but are not limited to table exchange, file transmission, SFTP uploading and downloading, Http components and the like; the data transmission mode includes but is not limited to data encryption transmission, power-off continuous transmission, desensitization algorithm, dual control of data authority and function authority, data partitioning, parallel loading technology and the like.
In the node management process, the data format includes but is not limited to a mainstream database, a text file, an Excel file, an API interface, a WebService service, and the like.
As a further improvement of the technical solution, in the pipeline matching module, a matching algorithm based on correlation is adopted for determining the correlation between the process relationship and the pipeline type, and an NCC normalized cross-correlation coefficient is adopted to represent the degree of correlation between the process relationship and the pipeline type, and the calculation expression is as follows:
Figure 101904DEST_PATH_IMAGE003
wherein X, Y are two random variables,
Figure 182993DEST_PATH_IMAGE004
Figure 674148DEST_PATH_IMAGE005
respectively, are the mean values of two random variables,
Figure 498884DEST_PATH_IMAGE006
Figure 477205DEST_PATH_IMAGE007
is the standard deviation of two random variables;
the denominator of the above formula is the standard deviation of two random variables, which plays a role of normalization, and the numerator of the above formula is also subtracted by the mean value of the two random variables, which is called centering.
Neither normalization nor centering is the essence of the correlation coefficient, and after the normalization and centering are stripped, the remaining part is the expectation of the product of two random variables, namely the inner product of two vectors, which is the essence of the correlation coefficient.
In particular, it is understood that: sampling the random variable X, Y for multiple times, and putting the sampled samples into two vectors, so that the inner product of the two vectors is the expectation of the product of X, Y; the reverse is understood as: considering the two vectors as a jointly distributed column of two random variables, the inner product of the two vectors is the expectation of the product of the two random variables.
As a further improvement of the technical scheme, the data management module comprises a standard management module, a quality management module, a safety management module and a life cycle module; the standard management module, the quality management module, the safety management module and the life cycle module are sequentially connected through network communication; the standard management module is used for providing a comprehensive and complete data standard management process and method, determining and establishing a single, accurate and authoritative fact source, realizing complete, effective, consistent, standard, open and shared management of data, and providing a standard basis for data quality inspection and data safety management; the quality management module is used for integrating the working environments such as quality assessment, quality check, quality correction, quality report and the like through operation means such as wizard, visualization and the like to form a complete data quality management closed loop which takes data standards as data check bases and metadata as data check objects; the security management module is used for providing various data security management measures such as encryption, desensitization, fuzzification processing, database authorization monitoring and the like for private data in the whole data management process, and realizing the process of comprehensively guaranteeing the data security operation; the life cycle module is used for recording the whole flowing process from creation and initial storage to obsolescence deletion of data, and carrying out near-line archiving, off-line archiving, destruction and full life cycle monitoring on the data during the storage period.
The functional items managed by the data standards include, but are not limited to, standard publishing, standard mapping, standard querying and the like.
Data quality management includes, but is not limited to, rule management, quality reporting, problem rectification, and the like.
The functional items of data security management include, but are not limited to, change tracking, anomaly monitoring, data desensitization, secure transmission, privacy removal, and the like.
The functional items of the data lifecycle management include, but are not limited to, data policies, level agreements, classification evaluation, data archiving, data destruction, and the like.
The invention also aims to provide an operation method of a distributed data management system based on a novel pipeline scheduling algorithm, which comprises the following steps:
s1, building a distributed data management system based on the block chain and big data technology, and connecting the application platform with a plurality of data sources;
s2, loading a corresponding creating mode and a polling scheduling algorithm of the interprocess communication pipeline;
s3, carrying out unified management on the data management system, carrying out preset management on the metadata, and monitoring and adjusting the standard quality of the metadata;
s4, when data exchange is carried out between the data source and the data integration module, the corresponding interprocess pipeline is selected and established according to the matching algorithm based on the correlation by taking the rule of metadata to different interprocess and the analysis result of the blood relationship as the basis, so as to realize the safe transmission of data;
s5, carrying out safety management and life cycle management on the data, carrying out management such as unified acquisition, conversion, cleaning, filing and the like on the data, and providing high-quality gold data;
s6, applying the treated high-quality data to the front-end service;
and S7, accessing by a user in various modes, logging in an application platform of the system after uniform identity authentication, acquiring high-quality data from the system, and calling related data for application.
The invention also provides an operating device of the distributed data governance system based on the novel pipeline scheduling algorithm, which comprises a processor, a memory and a computer program stored in the memory and operated on the processor, wherein the processor is used for realizing the distributed data governance system based on any one of the novel pipeline scheduling algorithms when executing the computer program.
It is a further object of the present invention to provide a computer readable storage medium storing a computer program which, when executed by a processor, implements a distributed data governance system for any of the above novel pipeline-based dispatch algorithms.
Compared with the prior art, the invention has the beneficial effects that:
1. the distributed data management system based on the novel pipeline scheduling algorithm is based on big data and a block chain, can build a system framework which is wide in coverage range and stable and parallel, achieves distributed data management, improves the fusion degree of the system, and enhances the ductility of the system;
2. the distributed data management system based on the novel pipeline scheduling algorithm manages the establishment, selection and scheduling of inter-process communication, can improve the transmission effect of a data exchange process, reduces the loss and leakage of data, and improves the integrity and safety of the data;
3. according to the distributed data management system based on the novel pipeline scheduling algorithm, the data are collected and scheduled and managed in a centralized mode, the distributed architecture is convenient for node expansion of the system, high-concurrency access of large users can be supported, visual management of the full life cycle of the data can be achieved, the data management period is shortened, cost waste is reduced, and construction and development of enterprise information systems are promoted.
Drawings
FIG. 1 is a block diagram of an exemplary product architecture of the present invention;
FIG. 2 is a block diagram of the overall system apparatus of the present invention;
FIG. 3 is a diagram of one embodiment of a local system device architecture;
FIG. 4 is a second block diagram of a local system apparatus according to the present invention;
FIG. 5 is a third block diagram of a local system apparatus according to the present invention;
FIG. 6 is a fourth embodiment of the present invention;
FIG. 7 is a fifth embodiment of the present invention;
FIG. 8 is a sixth embodiment of the present invention;
FIG. 9 is a block diagram of an exemplary electronic computer product device according to the present invention.
The various reference numbers in the figures mean:
100. an infrastructure unit; 101. an application platform module; 102. a data source module; 203. a technical support module; 204. an algorithm management module;
200. a pipeline scheduling unit; 201. an inter-process communication module; 202. an anonymous pipe module; 203. naming a pipeline module; 204. a polling scheduling module;
300. unifying the service units; 301. a metadata management module; 302. a service gateway module; 303. a data storage module; 304. an identity authentication module;
400. a data management unit; 401. a data integration module; 402. a data exchange module; 4021. a data transmission module; 4022. a node management module; 4023. a pipeline matching module; 4024. an exchange approval module; 403. a data management module; 4031. a standard management module; 4032. a quality management module; 4033. a security management module; 4034. a life cycle module; 404. a master data management module; 405. and a data application module.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
Example 1
As shown in FIGS. 1-9, the present embodiment provides a distributed data governance system based on a new pipeline scheduling algorithm, comprising
An infrastructure unit 100, a pipe scheduling unit 200, a unified service unit 300, and a data management unit 400; the infrastructure unit 100, the pipeline scheduling unit 200, the unified service unit 300 and the data management unit 400 are connected in sequence through network communication; the infrastructure unit 100 is used to provide and manage the basic network topology supporting the system operation; the pipeline scheduling unit 200 is configured to provide and manage different interprocess communication pipelines and perform distribution scheduling; the unified service unit 300 is used for implementing unified standardized management and analysis utilization of data by enterprises through a plurality of unified service functions; the data management unit 400 is used for quality control and comprehensive treatment of data;
the infrastructure unit 100 comprises an application platform module 101, a data source module 102, a technical support module 203 and an algorithm management module 204;
the pipeline scheduling unit 200 comprises an interprocess communication module 201, an anonymous pipeline module 202, a named pipeline module 203 and a polling scheduling module 204;
the unified service unit 300 comprises a metadata management module 301, a service gateway module 302, a data storage module 303 and an identity authentication module 304;
the data management unit 400 includes a data integration module 401, a data exchange module 402, a data governance module 403, a master data management module 404, and a data application module 405.
In this embodiment, the application platform module 101, the data source module 102, the technical support module 203, and the algorithm management module 204 are connected in sequence through network communication and run in parallel; the application platform module 101 is used for constructing a multifunctional data governance application platform based on big data and a block chain technology to realize human-computer interaction; the data source module 102 is used for building a channel for signal connection and data transmission between the system and each data source and managing and distributing the data source platform; the technical support module 203 is used for loading various intelligent technologies to support and perfect the functionality of the system; the algorithm management module 204 is used to package a variety of intelligent algorithms that support system functions.
The intelligent technology includes, but is not limited to, big data technology, big data analysis technology, block chain technology, front-end and back-end separation technology, and the like.
The intelligent algorithm includes, but is not limited to, an intelligent scheduling algorithm, an encryption algorithm, a consensus algorithm, a matching algorithm, and the like.
In this embodiment, the signal output end of the interprocess communication module 201 is connected with the signal input ends of the anonymous pipeline module 202 and the named pipeline module 203, the anonymous pipeline module 202 and the named pipeline module 203 run in parallel, and the signal output ends of the anonymous pipeline module 202 and the named pipeline module 203 are connected with the signal input end of the polling scheduling module 204; the interprocess communication module 201 is configured to open up a buffer in the kernel using the memory as a medium to implement communication of interprocess data exchange; the anonymous pipe module 202 is used for creating an anonymous pipe among processes with a blood relationship by calling a pipe function to provide one-way communication; the named pipe module 203 is used for creating a named pipe visible to the system between any two processes by calling an mknow or mkfifo command so as to realize communication between the two processes; the polling scheduling module 204 is configured to schedule the pipe in a polling manner to achieve load balancing of the pipe scheduling.
Specifically, in the polling scheduling module 204, the types of the scheduling algorithm include a polling scheduling algorithm and a weighted polling scheduling algorithm improved based on the polling scheduling algorithm, wherein: the polling scheduling algorithm is suitable for the condition that all servers in the server group have the same software and hardware configuration and the average service request is relatively balanced, and the weighted polling scheduling algorithm is suitable for the condition that the configuration, the installed service application and the processing capacity of each server in the server group are different; in the weight polling scheduling algorithm, the calculation expression of weight distribution is as follows:
Figure 241112DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 203251DEST_PATH_IMAGE002
load is used for the server.
In this embodiment, the metadata management module 301, the service gateway module 302, the data storage module 303, and the identity authentication module 304 sequentially communicate with each other via a network and operate in parallel; the metadata management module 301 is configured to provide unified management of information description metadata and information authorization metadata, metadata description, a metadata interface, and a high-performance metadata access mechanism; the service gateway module 302 is used for providing a uniform data service gateway to realize safe and controllable docking and service between the internal and external systems and support the full life cycle management of the API; the data storage module 303 is used for providing a uniform distributed data storage service, and can implement uniform storage and access of various types of data such as structured data, unstructured data, files and the like, and implement self-description and information authorization through metadata management information; the identity authentication module 304 is used for providing unified user management and identity authentication services, and supporting multi-mode access and third-party identity dispute so as to meet the identity authentication requirements of different application systems.
The unified metadata management is internally provided with rich acquisition adapters, end-to-end automatic acquisition and one-key metadata analysis are realized, data resources are quickly cleared, data coming and going veins are known, and a data map can be constructed to provide basic support for data standard construction and data quality; the functional items of metadata management include, but are not limited to, metadata collection, metadata retrieval, data maps, blood relationship analysis, impact analysis, and the like.
In the unified identity authentication process, the login mode includes but is not limited to a mobile phone, a mailbox, a QQ, a WeChat, a microblog and the like.
In this embodiment, the data integration module 401, the data exchange module 402, the data administration module 403, the main data management module 404, and the data application module 405 are sequentially connected through network communication and run alternately; the data integration module 401 is used for realizing loading, cleaning, conversion and integration of cross-region data and supporting custom scheduling and graphical monitoring, so that unified scheduling and unified monitoring are realized to meet the visual requirements of operation and maintenance; the data exchange module 402 is used for realizing the transmission and sharing of data or files among a plurality of service subsystems, and can integrate data acquisition, processing and distribution, and exchange transmission; the data governance module 403 is used for performing integrated governance on the data from multiple aspects; the main data management module 404 is configured to establish a unified view and centralized management on data to be shared, so as to provide optimal data for service system data invocation; the data application module 405 is configured to provide the high-quality data after being treated to the user invoking application through the formats such as deep mining and statistical forms.
The functional items of data integration include, but are not limited to, real-time acquisition, cleansing conversion, encryption desensitization, and the like.
The functional items of the master data management include, but are not limited to, identification application, master data retrieval, master data storage, shared publishing, master data monitoring, and the like.
Further, the data exchange module 402 includes a data transmission module 4021, a node management module 4022, a pipeline matching module 4023, and an exchange approval module 4024; the data transmission module 4021, the node management module 4022, the pipeline matching module 4023 and the exchange approval module 4024 are sequentially connected through communication; the data transmission module 4021 is configured to implement a data secure transmission process in multiple ways through multiple data exchange components; the node management module 4022 is configured to perform visual configuration on the data transmission switching nodes, control the data transmission state of the nodes, support the transmission of data in various formats, and shield the data type differences among the systems; the pipeline matching module 4023 is configured to select an inter-process pipeline used through a matching algorithm according to a result of correlation analysis between the metadata and the service data and the processes; the exchange approval module 4024 is configured to support the design of a data exchange task mode and implement detection approval of a data exchange task by flexibly extracting and exchanging data.
In the data exchange process, the data exchange components include but are not limited to table exchange, file transmission, SFTP uploading and downloading, Http components and the like; the data transmission mode includes but is not limited to data encryption transmission, power-off continuous transmission, desensitization algorithm, dual control of data authority and function authority, data partitioning, parallel loading technology and the like.
In the node management process, the data format includes but is not limited to a mainstream database, a text file, an Excel file, an API interface, a WebService service, and the like.
Specifically, in the pipeline matching module 4023, a matching algorithm based on correlation is used to determine the correlation between the process relationship and the pipeline type, and an NCC normalized cross-correlation coefficient is used to represent the degree of correlation between the process relationship and the pipeline type, and the calculation expression is:
Figure 316832DEST_PATH_IMAGE008
wherein X, Y are two random variables,
Figure 946397DEST_PATH_IMAGE004
Figure 352976DEST_PATH_IMAGE005
respectively, are the mean values of two random variables,
Figure 67991DEST_PATH_IMAGE006
Figure 719683DEST_PATH_IMAGE007
is the standard deviation of two random variables;
the denominator of the above formula is the standard deviation of two random variables, which plays a role of normalization, and the numerator of the above formula is also subtracted by the mean value of the two random variables, which is called centering.
Neither normalization nor centering is the essence of the correlation coefficient, and after the normalization and centering are stripped, the remaining part is the expectation of the product of two random variables, namely the inner product of two vectors, which is the essence of the correlation coefficient.
In particular, it is understood that: sampling the random variable X, Y for multiple times, and putting the sampled samples into two vectors, so that the inner product of the two vectors is the expectation of the product of X, Y; the reverse is understood as: considering the two vectors as a jointly distributed column of two random variables, the inner product of the two vectors is the expectation of the product of the two random variables.
Further, the data governance module 403 includes a standard management module 4031, a quality management module 4032, a security management module 4033 and a life cycle module 4034; the standard management module 4031, the quality management module 4032, the security management module 4033 and the life cycle module 4034 are sequentially connected through network communication; the standard management module 4031 is used for providing a comprehensive and complete data standard management process and method, determining and establishing a single, accurate and authoritative fact source, realizing complete, effective, consistent, standard, open and shared management of data, and providing a standard basis for data quality inspection and data security management; the quality management module 4032 is used for integrating the flow of the working environments such as quality evaluation, quality check, quality correction and quality report by operating means such as wizard, visualization and the like to form a complete data quality management closed loop which takes the data standard as the data check basis and the metadata as the data check object; the safety management module 4033 is used for providing various data safety management measures such as encryption, desensitization, fuzzification processing, database authorization monitoring and the like for private data in the whole data management process, and realizing the process of comprehensively guaranteeing the safe operation of the data; lifecycle module 4034 is used to record the entire flow of data from creation and initial storage to its obsolescence, for near-line archiving, offline archiving, destruction, and full lifecycle monitoring of data for a sustained period.
The functional items managed by the data standards include, but are not limited to, standard publishing, standard mapping, standard querying and the like.
Data quality management includes, but is not limited to, rule management, quality reporting, problem rectification, and the like.
The functional items of data security management include, but are not limited to, change tracking, anomaly monitoring, data desensitization, secure transmission, privacy removal, and the like.
The functional items of the data lifecycle management include, but are not limited to, data policies, level agreements, classification evaluation, data archiving, data destruction, and the like.
The embodiment also provides an operation method of the distributed data management system based on the novel scheduling algorithm of the pipeline, which comprises the following steps:
s1, building a distributed data management system based on the block chain and big data technology, and connecting the application platform with a plurality of data sources;
s2, loading a corresponding creating mode and a polling scheduling algorithm of the interprocess communication pipeline;
s3, carrying out unified management on the data management system, carrying out preset management on the metadata, and monitoring and adjusting the standard quality of the metadata;
s4, when data exchange is carried out between the data source and the data integration module, the corresponding interprocess pipeline is selected and established according to the matching algorithm based on the correlation by taking the rule of metadata to different interprocess and the analysis result of the blood relationship as the basis, so as to realize the safe transmission of data;
s5, carrying out safety management and life cycle management on the data, carrying out management such as unified acquisition, conversion, cleaning, filing and the like on the data, and providing high-quality gold data;
s6, applying the treated high-quality data to the front-end service;
and S7, accessing by a user in various modes, logging in an application platform of the system after uniform identity authentication, acquiring high-quality data from the system, and calling related data for application.
As shown in FIG. 9, the present embodiment also provides an apparatus for operating a distributed data governance system based on a new scheduling algorithm for a pipeline, the apparatus comprising a processor, a memory, and a computer program stored in the memory and running on the processor.
The processor comprises one or more processing cores, the processor is connected with the memory through the bus, the memory is used for storing program instructions, and the distributed data management system based on the novel pipeline scheduling algorithm is realized when the processor executes the program instructions in the memory.
Alternatively, the memory may be implemented by any type or combination of volatile or non-volatile memory devices, such as Static Random Access Memory (SRAM), electrically erasable programmable read-only memory (EEPROM), erasable programmable read-only memory (EPROM), programmable read-only memory (PROM), read-only memory (ROM), magnetic memory, flash memory, magnetic or optical disks.
In addition, the invention also provides a computer readable storage medium, which stores a computer program, and the computer program is executed by a processor to realize the distributed data governance system based on the novel pipeline scheduling algorithm.
Optionally, the present invention also provides a computer program product containing instructions which, when run on a computer, cause the computer to execute the distributed data governance system of the novel pipeline-based scheduling algorithm of the above aspects.
It will be understood by those skilled in the art that all or part of the steps for implementing the above embodiments may be implemented by hardware, or may be implemented by a program instructing relevant hardware, and the program may be stored in a computer-readable storage medium, where the above-mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
The foregoing shows and describes the general principles, essential features, and advantages of the invention. It will be understood by those skilled in the art that the present invention is not limited to the embodiments described above, and the preferred embodiments of the present invention are described in the above embodiments and the description, and are not intended to limit the present invention. The scope of the invention is defined by the appended claims and equivalents thereof.

Claims (10)

1. A distributed data management system based on a novel pipeline scheduling algorithm is characterized in that: comprises that
The system comprises an infrastructure unit (100), a pipeline scheduling unit (200), a unified service unit (300) and a data management unit (400); the infrastructure unit (100), the pipeline scheduling unit (200), the unified service unit (300) and the data management unit (400) are sequentially connected through network communication; the infrastructure unit (100) is used for providing and managing an infrastructure network topology supporting system operation; the pipeline scheduling unit (200) is used for providing and managing different interprocess communication pipelines and performing distribution scheduling; the unified service unit (300) is used for realizing unified standardized management and analysis utilization of data by enterprises through a plurality of unified service functions; the data management unit (400) is used for performing quality control and comprehensive treatment on data;
the infrastructure unit (100) comprises an application platform module (101), a data source module (102), a technical support module (203) and an algorithm management module (204);
the pipeline scheduling unit (200) comprises an interprocess communication module (201), an anonymous pipeline module (202), a named pipeline module (203) and a polling scheduling module (204);
the unified service unit (300) comprises a metadata management module (301), a service gateway module (302), a data storage module (303) and an identity authentication module (304);
the data management unit (400) comprises a data integration module (401), a data exchange module (402), a data governance module (403), a main data management module (404) and a data application module (405).
2. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 1, wherein: the application platform module (101), the data source module (102), the technical support module (203) and the algorithm management module (204) are connected in sequence through network communication and run in parallel; the application platform module (101) is used for constructing a multifunctional data governance application platform on the basis of big data and a block chain technology so as to realize human-computer interaction; the data source module (102) is used for building a channel for signal connection and data transmission between the system and each data source and managing and distributing the data source platform; the technical support module (203) is used for loading various intelligent technologies to support and perfect the functionality of the system; the algorithm management module (204) is used for packaging a plurality of intelligent algorithms supporting system functions.
3. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 2, wherein: the signal output end of the interprocess communication module (201) is connected with the signal input ends of the anonymous pipeline module (202) and the named pipeline module (203), the anonymous pipeline module (202) and the named pipeline module (203) run in parallel, and the signal output ends of the anonymous pipeline module (202) and the named pipeline module (203) are connected with the signal input end of the polling scheduling module (204); the interprocess communication module (201) is used for opening up a buffer area in a kernel by taking a memory as a medium to realize the communication of interprocess data exchange; the anonymous pipe module (202) is used for creating an anonymous pipe among processes with blood-related relationships by calling a pipe function to provide one-way communication; the named pipe module (203) is used for creating a named pipe visible to a system between any two processes by calling an mknow or mkfifo command so as to realize communication between the two processes; the polling scheduling module (204) is used for scheduling the pipeline in a polling mode to realize load balancing of pipeline scheduling.
4. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 3, wherein: in the polling scheduling module (204), the type of scheduling algorithm includes a polling scheduling algorithm and a weighted polling scheduling algorithm improved on the basis of the polling scheduling algorithm, wherein: the polling scheduling algorithm is suitable for the condition that all servers in the server group have the same software and hardware configuration and the average service request is relatively balanced, and the weighted polling scheduling algorithm is suitable for the condition that the configuration, the installed service application and the processing capacity of each server in the server group are different; in the weight polling scheduling algorithm, the calculation expression of weight distribution is as follows:
Figure 867748DEST_PATH_IMAGE001
in the formula (I), the compound is shown in the specification,
Figure 390390DEST_PATH_IMAGE002
load is used for the server.
5. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 4, wherein: the metadata management module (301), the service gateway module (302), the data storage module (303) and the identity authentication module (304) sequentially communicate with each other through a network and operate in parallel; the metadata management module (301) is used for providing information description metadata, unified management of information authorization metadata, metadata description, a metadata interface and a high-performance metadata access mechanism; the service gateway module (302) is used for providing a uniform data service gateway to realize safe and controllable butt joint and service between an internal system and an external system and support the full life cycle management of an API; the data storage module (303) is used for providing uniform distributed data storage service, can realize uniform storage and access of various types of data such as structured data, unstructured data, files and the like, and can realize self-description and information authorization through metadata management information; the identity authentication module (304) is used for providing unified user management and identity authentication services, and supporting multi-mode access and third-party identity dispute so as to meet the identity authentication requirements of different application systems.
6. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 5, wherein: the data integration module (401), the data exchange module (402), the data governance module (403), the main data management module (404) and the data application module (405) are sequentially connected through network communication and run in an interpenetration mode; the data integration module (401) is used for realizing loading, cleaning, conversion and integration of cross-region data and supporting custom scheduling and graphical monitoring, so that unified scheduling and unified monitoring are realized to meet the visual requirements of operation and maintenance; the data exchange module (402) is used for realizing the transmission and sharing of data or files among a plurality of service subsystems, and can integrate data acquisition, processing and distribution and exchange transmission; the data governance module (403) is used for carrying out integrated governance on data from multiple aspects; the main data management module (404) is used for establishing a unified view and centralized management on data needing to be shared, so that optimal data are provided for service system data calling; and the data application module (405) is used for providing the treated high-quality data for the user to invoke application through formats such as deep mining, statistical forms and the like.
7. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 6, wherein: the data exchange module (402) comprises a data transmission module (4021), a node management module (4022), a pipeline matching module (4023) and an exchange approval module (4024); the data transmission module (4021), the node management module (4022) and the pipeline matching module (4023) are sequentially connected with the exchange approval module (4024) through communication; the data transmission module (4021) is used for realizing a data security transmission process in multiple modes through multiple data exchange components; the node management module (4022) is used for carrying out visual configuration on the data transmission switching nodes, controlling the data transmission state of the nodes, supporting the transmission of data in various formats and shielding the data type difference existing among the systems; the pipeline matching module (4023) is used for selecting the used interprocess pipeline through a matching algorithm according to the correlation analysis result between the metadata and the business data and the interprocess; the exchange approval module (4024) is used for supporting the design of a data exchange task mode and realizing the detection approval of the data exchange task by flexibly extracting and exchanging data.
8. The distributed data governance system of the new pipeline-based dispatch algorithm of claim 7, wherein: in the pipeline matching module (4023), a matching algorithm based on correlation is adopted for judging the correlation between the process relation and the pipeline type, and the correlation degree between the process relation and the pipeline type is expressed by adopting an NCC normalized cross correlation coefficient, and the calculation expression is as follows:
Figure 156220DEST_PATH_IMAGE003
wherein X, Y are two random numbersThe variables are the variables of the process,
Figure 655466DEST_PATH_IMAGE004
Figure 970778DEST_PATH_IMAGE005
respectively, are the mean values of two random variables,
Figure 412124DEST_PATH_IMAGE006
Figure 415983DEST_PATH_IMAGE007
is the standard deviation of two random variables;
the denominator of the above formula is the standard deviation of two random variables, which plays a role of normalization, and the numerator of the above formula is also subtracted by the mean value of the two random variables, which is called centering.
9. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 8, wherein: the data governance module (403) comprises a standard management module (4031), a quality management module (4032), a safety management module (4033) and a life cycle module (4034); the standard management module (4031), the quality management module (4032), the security management module (4033) and the life cycle module (4034) are sequentially connected through network communication; the standard management module (4031) is used for providing a comprehensive and complete data standard management process and method, determining and establishing a single, accurate and authoritative fact source, realizing complete, effective, consistent, standard, open and shared management of data, and providing a standard basis for data quality inspection and data security management; the quality management module (4032) is used for integrating the flow of working environments such as quality evaluation, quality check, quality correction and quality report through operation means such as wizard, visualization and the like to form a complete data quality management closed loop which takes data standards as data check bases and metadata as data check objects; the safety management module (4033) is used for providing various data safety management measures such as encryption, desensitization, fuzzification processing, database authorization monitoring and the like for private data in the whole data management process, and realizing the process of comprehensively guaranteeing the safe operation of the data; the lifecycle module (4034) is used to record the entire flow process from creation and initial storage to its obsolescence deletion, to perform on-the-fly archiving, off-line archiving, destruction and full lifecycle monitoring of data for a prolonged period of time.
10. The distributed data governance system of the pipeline-based modern dispatch algorithm of claim 1, wherein: the operation method of the system comprises the following steps:
s1, building a distributed data management system based on the block chain and big data technology, and connecting the application platform with a plurality of data sources;
s2, loading a corresponding creating mode and a polling scheduling algorithm of the interprocess communication pipeline;
s3, carrying out unified management on the data management system, carrying out preset management on the metadata, and monitoring and adjusting the standard quality of the metadata;
s4, when data exchange is carried out between the data source and the data integration module, the corresponding interprocess pipeline is selected and established according to the matching algorithm based on the correlation by taking the rule of metadata to different interprocess and the analysis result of the blood relationship as the basis, so as to realize the safe transmission of data;
s5, carrying out safety management and life cycle management on the data, carrying out management such as unified acquisition, conversion, cleaning, filing and the like on the data, and providing high-quality gold data;
s6, applying the treated high-quality data to the front-end service;
and S7, accessing by a user in various modes, logging in an application platform of the system after uniform identity authentication, acquiring high-quality data from the system, and calling related data for application.
CN202110853719.XA 2021-07-28 2021-07-28 Distributed data management system based on novel pipeline scheduling algorithm Pending CN113553381A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110853719.XA CN113553381A (en) 2021-07-28 2021-07-28 Distributed data management system based on novel pipeline scheduling algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110853719.XA CN113553381A (en) 2021-07-28 2021-07-28 Distributed data management system based on novel pipeline scheduling algorithm

Publications (1)

Publication Number Publication Date
CN113553381A true CN113553381A (en) 2021-10-26

Family

ID=78104663

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110853719.XA Pending CN113553381A (en) 2021-07-28 2021-07-28 Distributed data management system based on novel pipeline scheduling algorithm

Country Status (1)

Country Link
CN (1) CN113553381A (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115249149A (en) * 2022-09-21 2022-10-28 中国电子信息产业集团有限公司 Data circulation system, safety management and control system and safety management and control method thereof
CN117032587A (en) * 2023-09-26 2023-11-10 深圳市智赋新能源有限公司 Optical storage integrated information management system based on distributed architecture

Cited By (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115249149A (en) * 2022-09-21 2022-10-28 中国电子信息产业集团有限公司 Data circulation system, safety management and control system and safety management and control method thereof
CN117032587A (en) * 2023-09-26 2023-11-10 深圳市智赋新能源有限公司 Optical storage integrated information management system based on distributed architecture
CN117032587B (en) * 2023-09-26 2024-01-09 深圳市智赋新能源有限公司 Optical storage integrated information management system based on distributed architecture

Similar Documents

Publication Publication Date Title
CN109559258B (en) Educational resource public service system
US10560465B2 (en) Real time anomaly detection for data streams
US20220215028A1 (en) Data Pipeline for Scalable Analytics and Management
CN111984717A (en) Big data intelligent government affair platform information management method
US10223329B2 (en) Policy based data collection, processing, and negotiation for analytics
CN106254466B (en) HDFS distributed file sharing method based on local area network
CN107567696A (en) The automatic extension of resource instances group in computing cluster
US9992269B1 (en) Distributed complex event processing
CN111694555B (en) Service system construction method and device, electronic equipment and storage medium
US11405259B2 (en) Cloud service transaction capsulation
CN113553381A (en) Distributed data management system based on novel pipeline scheduling algorithm
Dittakavi Evaluating the Efficiency and Limitations of Configuration Strategies in Hybrid Cloud Environments
JP2021516811A (en) Data anonymization
Du Application of information communication network security management and control based on big data technology
Gao et al. Big data sensing and service: A tutorial
Al-Sayed et al. CloudFNF: An ontology structure for functional and non-functional features of cloud services
Bakhshi Forensic of things: Revisiting digital forensic investigations in internet of things
Lungu et al. Data Collection and Command Mechanism for Management of Network Resources
Cao et al. Analytics everywhere for streaming iot data
US10067849B2 (en) Determining dynamic statistics based on key value patterns
CN109324892A (en) Distribution management method, distributed management system and device
US11200138B1 (en) Policy-based request tracing using a computer
US20210109905A1 (en) Processing metrics data with graph data context analysis
US11811676B2 (en) Proactive auto-scaling
US11586626B1 (en) Optimizing cloud query execution

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination