CN112925767A

CN112925767A - Multi-data-source dynamic data synchronization management method and system based on internet supervision

Info

Publication number: CN112925767A
Application number: CN202110234138.8A
Authority: CN
Inventors: 侯居永; 栾丽丽; 张雷; 陈兆亮
Original assignee: Inspur Cloud Information Technology Co Ltd
Current assignee: Inspur Cloud Information Technology Co Ltd
Priority date: 2021-03-03
Filing date: 2021-03-03
Publication date: 2021-06-08

Abstract

The invention discloses a multi-data-source dynamic data synchronous treatment method and a system based on internet supervision, belonging to the field of internet plus supervision, aiming at solving the technical problems of helping a user to quickly construct a big data processing analysis flow and realizing low-cost quick construction of a data center, and adopting the following technical scheme: the method fuses data trends of various structured data, semi-structured data and unstructured data, provides a one-stop data development environment, visual process design, rich data types and intelligent task monitoring, and realizes that a user quickly constructs a big data processing analysis process and a data center with low cost; the method comprises the following specific steps: data source management: managing data connection services; designing a data flow: defining each data processing flow as a data flow operation, and managing the data processing flows through the data flow operation; template management: and (4) migrating and multiplexing the flow.

Description

Multi-data-source dynamic data synchronization management method and system based on internet supervision

Technical Field

The invention relates to the field of Internet plus supervision, in particular to a method and a system for synchronously managing dynamic data of multiple data sources based on Internet supervision.

Background

Currently, the new generation of information technology is rapidly changing the production and living style of society, data has become a core asset of organizations and enterprises, digital economy is driving a new round of global revolution, and the digital transformation of enterprises has become a trend of big data era.

The integration of the internet, big data, artificial intelligence and entity economy is deeply integrated, and the integration innovation of various industries is promoted. In the era of amalgamation innovation, the value maximization of the big data by fully utilizing the association, the intersection and the amalgamation of the data becomes the key point for implementing the digital transformation of various industries. Under the background, the data trends of cross-field, cross-industry and cross-region are fused in a cross-domain mode, multi-source data such as organization data, internet of things data and scientific research data are fused in a trend mode, and hypermedia data such as structured data, semi-structured data and unstructured data are fused in a trend mode. The multi-source heterogeneous hypermedia data fusion which takes large scale, multi-source heterogeneous, cross-domain, cross-media, cross-language and dynamic evolution as main characteristics becomes a key problem to be solved urgently for implementing digital transformation strategy in vertical industry and ecological enterprises.

In a conventional data warehouse system, data models are defined in advance before data is loaded and stored, and only structured and processed data can be stored in the data warehouse system.

Therefore, how to help users to quickly construct a big data processing and analyzing process and realize low-cost quick construction of a data center is a problem which needs to be solved at present.

Disclosure of Invention

The technical task of the invention is to provide a method and a system for synchronously managing dynamic data of multiple data sources based on internet supervision, so as to solve the problems of how to help users to quickly construct a big data processing and analyzing process and realize quick construction of a data center with low cost.

The technical task of the invention is realized in the following way, the method for synchronously managing the dynamic data of multiple data sources based on internet supervision fuses the data trends of various structured data, semi-structured data and unstructured data, provides one-stop data development environment, visual process design, abundant data types and intelligent task monitoring, and realizes that a user quickly constructs a big data processing analysis process and a data center with low cost; the method comprises the following specific steps:

data source management: managing data connection services;

and (3) data management: data flow design, data flow debugging, data flow monitoring and data flow operation and maintenance, wherein each data processing flow is defined as a data flow operation, and the data processing flows are managed through the data flow operation;

template management: and the flow is migrated and reused, and the functions of uploading, deleting and downloading the data flow template are provided.

Preferably, the data source management is specifically as follows:

the user uniformly defines data source connection to ensure that the data source connection can be directly referred when designing a data processing flow;

the data source connection adopts a connection pool mode, so that a large number of data source connection numbers are prevented from being occupied; the types of the data source connection comprise the following:

firstly, JDBC connection types such as various JDBC-supported databases of MySQL, Oracle, MSSQL, DB2 and the like;

② FTP connection type;

③ SFTP connection type;

fourthly, HDFS connection type;

fifthly, HBase connection type;

sixth, Hive connection type;

seventhly, an elastic search linkage type;

the connection type of the Kafka;

ninthly, Excel, csv and other connection types.

Preferably, the data flow design is specifically as follows:

grouping the processes: the flow design functions of adding, deleting, modifying grouping, starting and stopping are provided, and the data flow operation is classified in a layering way through grouping, so that the management and operation and maintenance difficulty of the data processing flow is reduced;

and (3) flow tree display: all the jobs created by the current user are displayed by a tree, and the job names are distinguished by different colors: green indicates that the operation is normal, red indicates that prompt warning information exists during the operation, and black indicates that the operation does not run;

designing a visual operation flow;

data access: providing a plurality of data access processors for acquiring various multi-source heterogeneous data, providing wide data source adaptation, high-performance data acquisition and flexible scheduling modes, and meeting various data acquisition requirements;

loading data: the data loading provides a plurality of data loading processors for importing data into various data storage services;

data cleaning: the data loading provides a plurality of data cleaning processors for checking and cleaning the acquired data;

data conversion: the data loading provides a plurality of data conversion processors for converting the acquired data;

a self-defining processor: the processor for realizing the specific function is written by java code, and is loaded to the flow operation to realize more complex functions, such as data splitting, whether the flow data is in a data table or not, and the like.

Preferably, the visualized workflow design is specifically as follows:

each data flow design manages an independent canvas, defines one or more flow nodes and forms one or more data flows;

providing abundant data processing types in a toolbar of a canvas, defining flow nodes in a dragging mode and connecting the flow nodes;

configuring a flow node scheduling rule, configuring flow node attributes, configuring starting and stopping flows or nodes, and configuring debugging and monitoring the running state of the flows;

providing auxiliary functions of flow node alignment and highlight display;

the flow design in a visual mode is completed through an interface by the operations of flow definition, start and stop, debugging, monitoring and operation and maintenance.

Preferably, the data sources supported by the data access include the following:

collecting data in a JDBC mode, such as MySQL, Oracle, DB2 and various databases supporting JDBC;

secondly, Oracle data are collected through Oracle logs, and all data operations of databases INSERT, UPDATE and DELETE can be collected;

thirdly, MySQL data is collected through MySQL logs, and all data operations of databases INSERT, UPDATE and DELETE can be collected;

fourthly, collecting FTP/SFTP file data;

collecting HDFS file data;

sixthly, collecting HBase data;

collecting Hive data;

and eighthly, consuming Kafka data.

More preferably, the data storage services include the following:

firstly, data is imported into various databases supporting JDBC, such as MySQL, Oracle, DB2 and the like;

secondly, importing data into FTP/SFTP;

thirdly, importing data into an HDFS;

fourthly, importing data into HBase;

importing data into Hive;

sixthly, importing data into an elastic search;

and seventhly, importing data into Kafka.

More preferably, the data cleansing types include the following:

firstly, checking a null value and non-null;

secondly, prefix verification and suffix verification;

checking the data length;

checking the numerical range;

checking an enumeration value;

sixthly, checking the result regularly;

the data conversion types include the following:

firstly, mapping data;

secondly, converting a character set;

thirdly, data format conversion;

fourthly, splitting data;

fifthly, merging data;

sixthly, date format conversion;

seventhly, replacing character strings;

replacing the null value;

and ninthly, replacing dictionary values.

A multi-data source dynamic data synchronous treatment system based on internet supervision comprises,

the data source management unit is used for managing data connection service;

the data management unit is used for managing data, defining each data processing flow as a data flow operation and managing the data processing flow through the data flow operation; the data management unit comprises a data flow design subunit, a data flow debugging subunit, a data flow monitoring subunit and a data flow operation and maintenance subunit; and the template management unit is used for migrating and multiplexing the process and providing the functions of uploading, deleting and downloading the data stream template.

Preferably, the data flow design subunit includes,

the flow grouping module is used for providing functions of adding, deleting and modifying groups, starting and stopping flow design, and reducing the management and operation and maintenance difficulty of the data processing flow by hierarchically classifying the data flow operation through the groups;

the tree display module is used for displaying all the jobs created by the current user by using a tree, and the job operation states are distinguished by different colors according to job names: green indicates that the operation is normal, red indicates that prompt warning information exists during the operation, and black indicates that the operation does not run;

the visualized flow design module is used for completing the visualized mode of the operations of flow definition, start and stop, debugging, monitoring and operation and maintenance through an interface to complete flow design;

and the data access module is used for acquiring various multi-source heterogeneous data. The method has the advantages that wide data source adaptation, high-performance data acquisition and flexible scheduling modes are provided, and various data acquisition requirements are met;

the data loading module is used for importing data into various data storage services;

the data cleaning module is used for verifying and cleaning the acquired data;

the data conversion module is used for converting the acquired data;

and the self-defining module is used for compiling the processor with the specific function through the java code.

A computer readable storage medium having stored therein a computer program executable by a processor to implement a multiple data source dynamic data synchronization governance method based on internet policing as described above.

The multi-data-source dynamic data synchronous treatment method and system based on internet supervision have the following advantages:

the method has the advantages that various structured data, semi-structured data, unstructured data and other data trends are fused, a one-stop data development environment, visual process design, abundant data types and intelligent task monitoring are provided, a user is helped to quickly construct a big data processing and analyzing process, and a data center is quickly constructed at low cost; the invention is a data processing and distributing system which is easy to use, powerful in function and reliable, supports powerful and highly configurable data routing, conversion and system intermediate logic based on directed graphs, supports dynamic pulling of data from various data sources, and fully utilizes the association, intersection and fusion of the data to realize the value maximization of the data;

the invention provides visual task arrangement capacity by relying on years of experience and practice accumulation of the wave internet + supervision industry, multi-source heterogeneous data are fused and stored in a big data center, multi-source heterogeneous data acquisition, storage and access are realized, a client is helped to extract and integrate all related data, a unified data center is built, a data island is broken, data interconnection and intercommunication are realized, data analysis insight is supported, data value is released, and the client is helped to complete transformation of big data information;

the invention provides the visual task scheduling capability for the user, does not need to install any client program, and can complete the operations of scheduling, debugging, starting and stopping, monitoring and the like of the task flow at the browser end through simple dragging operation;

the method is internally provided with rich data development types, and comprises various data development types such as SQL, Hive, MapReduce, Spark, Streaming, Flink, Kylin, Jar, RestAPI, Pyspark, machine learning, deep learning and the like;

the invention can provide rich scheduling configuration strategies and massive job scheduling capability, and simultaneously support various scheduling modes such as time period scheduling, event-driven scheduling, manual scheduling and the like;

the invention has the advantages of visual task arrangement capability, abundant data development types and abundant scheduling configuration strategies.

Drawings

The invention is further described below with reference to the accompanying drawings.

FIG. 1 is a schematic diagram of a multi-data-source dynamic data synchronization management method based on Internet supervision.

Detailed Description

The method and system for synchronously managing the dynamic data of multiple data sources based on internet supervision according to the present invention are described in detail below with reference to the drawings and the specific embodiments of the specification.

Example 1:

as shown in fig. 1, the method for synchronously managing multiple data sources dynamic data based on internet supervision of the present invention fuses data trends of various structured data, semi-structured data and unstructured data, provides a one-stop data development environment, a visual process design, abundant data types and intelligent task monitoring, and realizes that a user quickly constructs a big data processing analysis process and a data center with low cost; the method comprises the following specific steps:

s1, data source management: managing data connection services;

s2, data management: data flow design, data flow debugging, data flow monitoring and data flow operation and maintenance, wherein each data processing flow is defined as a data flow operation, and the data processing flows are managed through the data flow operation;

s3, template management: and the flow is migrated and reused, and the functions of uploading, deleting and downloading the data flow template are provided. The data flow template comprises a complaint report template, a risk early warning template, a knowledge base template and the like.

The data source management in step S1 in this embodiment is specifically as follows:

s101, uniformly defining data source connection by a user, and ensuring that the data source connection can be directly referred when a data processing flow is designed;

s102, a connection pool mode is adopted for data source connection, and a large number of data source connection numbers are prevented from being occupied;

the types of the data source connection comprise the following:

② FTP connection type;

③ SFTP connection type;

fourthly, HDFS connection type;

fifthly, HBase connection type;

sixth, Hive connection type;

seventhly, an elastic search linkage type;

the connection type of the Kafka;

ninthly, Excel, csv and other connection types.

The data flow design of step S2 in this embodiment is specifically as follows:

s201, grouping the processes: the flow design functions of adding, deleting, modifying grouping, starting and stopping are provided, and the data flow operation is classified in a layering way through grouping, so that the management and operation and maintenance difficulty of the data processing flow is reduced;

s202, flow tree display: all the jobs created by the current user are displayed by a tree, and the job names are distinguished by different colors: green indicates that the operation is normal, red indicates that prompt warning information exists during the operation, and black indicates that the operation does not run;

s203, designing a visual operation flow;

s204, data access: providing a plurality of data access processors for acquiring various multi-source heterogeneous data, providing wide data source adaptation, high-performance data acquisition and flexible scheduling modes, and meeting various data acquisition requirements;

s205, data loading: the data loading provides a plurality of data loading processors for importing data into various data storage services;

s206, data cleaning: the data loading provides a plurality of data cleaning processors for checking and cleaning the acquired data;

s207, data conversion: the data loading provides a plurality of data conversion processors for converting the acquired data;

s208, customizing the processor: the processor for realizing the specific function is written by java code, and is loaded to the flow operation to realize more complex functions, such as data splitting, whether the flow data is in a data table or not, and the like.

The visualized workflow design in step S203 in this embodiment is specifically as follows:

s20301, designing and managing an independent canvas for each data flow, defining one or more flow nodes, and forming one or more data flows;

s20302, providing rich data processing types in a toolbar of a canvas, defining flow nodes in a dragging mode and connecting the flow nodes;

s20303, configuring a process node scheduling rule, configuring process node attributes, configuring start and stop processes or nodes, and configuring a debugging and monitoring process running state;

s20304, providing auxiliary functions of flow node alignment and highlight display;

s20305, process design in a visual mode is completed through operation of process definition, start and stop, debugging, monitoring and operation and maintenance through an interface.

In this embodiment, the data sources supported by the data access in step S204 include the following:

fourthly, collecting FTP/SFTP file data;

collecting HDFS file data;

sixthly, collecting HBase data;

collecting Hive data;

and eighthly, consuming Kafka data.

The data storage service of step S205 in this embodiment includes the following:

secondly, importing data into FTP/SFTP;

thirdly, importing data into an HDFS;

fourthly, importing data into HBase;

importing data into Hive;

sixthly, importing data into an elastic search;

and seventhly, importing data into Kafka.

The data cleansing types of step S206 in this embodiment include the following:

firstly, checking a null value and non-null;

secondly, prefix verification and suffix verification;

checking the data length;

checking the numerical range;

checking an enumeration value;

sixthly, checking the result regularly;

the data conversion types of step S207 in this embodiment include the following:

firstly, mapping data;

secondly, converting a character set;

thirdly, data format conversion;

fourthly, splitting data;

fifthly, merging data;

sixthly, date format conversion;

seventhly, replacing character strings;

replacing the null value;

and ninthly, replacing dictionary values.

Example 2:

the invention relates to a multi-data source dynamic data synchronous treatment system based on internet supervision, which comprises,

the data source management unit is used for managing data connection service;

The data flow design subunit in this embodiment includes,

the data cleaning module is used for verifying and cleaning the acquired data;

the data conversion module is used for converting the acquired data;

Example 3:

the embodiment of the invention also provides a computer-readable storage medium, wherein a plurality of instructions are stored, and the instructions are loaded by the processor, so that the processor executes the multi-data-source dynamic data synchronization management method based on internet supervision in any embodiment of the invention. Specifically, a system or an apparatus equipped with a storage medium on which software program codes that realize the functions of any of the above-described embodiments are stored may be provided, and a computer (or a CPU or MPU) of the system or the apparatus is caused to read out and execute the program codes stored in the storage medium.

In this case, the program code itself read from the storage medium can realize the functions of any of the above-described embodiments, and thus the program code and the storage medium storing the program code constitute a part of the present invention.

Examples of the storage medium for supplying the program code include a floppy disk, a hard disk, a magneto-optical disk, an optical disk (e.g., CD-ROM, CD-R, CD-RW, DVD-ROM, DVD-R, and systems M, DVD-RW, DVD + RW) for managing dynamic data synchronization of multiple data sources based on internet administration), a magnetic tape, a nonvolatile memory card, and a ROM. Alternatively, the program code may be downloaded from a server computer via a communications network.

Further, it should be clear that the functions of any one of the above-described embodiments may be implemented not only by executing the program code read out by the computer, but also by causing an operating system or the like operating on the computer to perform a part or all of the actual operations based on instructions of the program code.

Further, it is to be understood that the program code read out from the storage medium is written to a memory provided in an expansion board inserted into the computer or to a memory provided in an expansion unit connected to the computer, and then causes a CPU or the like mounted on the expansion board or the expansion unit to perform part or all of the actual operations based on instructions of the program code, thereby realizing the functions of any of the above-described embodiments.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A multi-data-source dynamic data synchronization management method based on internet supervision is characterized in that the method fuses data trends of various structured data, semi-structured data and unstructured data, provides a one-stop data development environment, a visual process design, abundant data types and intelligent task monitoring, and achieves the purposes that a user quickly constructs a big data processing analysis process and a data center is quickly constructed at low cost; the method comprises the following specific steps:

data source management: managing data connection services;

2. The internet-supervision-based multiple data source dynamic data synchronization management method according to claim 1, wherein the data source management is specifically as follows:

the user uniformly defines data source connection;

the data source connection adopts a connection pool mode; the types of the data source connection comprise the following:

firstly, JDBC connection type;

② FTP connection type;

③ SFTP connection type;

fourthly, HDFS connection type;

fifthly, HBase connection type;

sixth, Hive connection type;

seventhly, an elastic search linkage type;

the connection type of the Kafka;

ninthly, Excel and csv connection types.

3. The internet-supervision-based multiple data source dynamic data synchronization management method according to claim 1, wherein the data flow design is specifically as follows:

designing a visual operation flow;

a self-defining processor: a processor for realizing a specific function is written by java code and is loaded to the flow operation.

4. The internet-supervision-based multiple data source dynamic data synchronization management method according to claim 3, wherein the visualization workflow design is specifically as follows:

providing auxiliary functions of flow node alignment and highlight display;

5. The internet-based-supervision-based multi-data-source dynamic data synchronization management method as claimed in claim 3, wherein the data sources supported by data access include the following:

firstly, data are collected through a JDBC mode;

fourthly, collecting FTP/SFTP file data;

collecting HDFS file data;

sixthly, collecting HBase data;

collecting Hive data;

and eighthly, consuming Kafka data.

6. The internet-based-surveillance-based multiple-data-source dynamic data synchronization management method as claimed in claim 3, wherein the data storage service comprises the following:

secondly, importing data into FTP/SFTP;

thirdly, importing data into an HDFS;

fourthly, importing data into HBase;

importing data into Hive;

sixthly, importing data into an elastic search;

and seventhly, importing data into Kafka.

7. The internet-based-surveillance-based multiple-data-source dynamic data synchronization management method as claimed in claim 3, wherein the data cleaning types include the following:

firstly, checking a null value and non-null;

secondly, prefix verification and suffix verification;

checking the data length;

checking the numerical range;

checking an enumeration value;

sixthly, checking the result regularly;

the data conversion types include the following:

firstly, mapping data;

secondly, converting a character set;

thirdly, data format conversion;

fourthly, splitting data;

fifthly, merging data;

sixthly, date format conversion;

seventhly, replacing character strings;

replacing the null value;

and ninthly, replacing dictionary values.

8. A multi-data source dynamic data synchronous treatment system based on internet supervision is characterized by comprising,

the data source management unit is used for managing data connection service;

the data management unit is used for managing data, defining each data processing flow as a data flow operation and managing the data processing flow through the data flow operation; the data management unit comprises a data flow design subunit, a data flow debugging subunit, a data flow monitoring subunit and a data flow operation and maintenance subunit;

and the template management unit is used for migrating and multiplexing the process and providing the functions of uploading, deleting and downloading the data stream template.

9. The Internet governance-based multiple data source dynamic data synchronization governance system according to claim 8, wherein said data flow design subunit comprises,

the data cleaning module is used for verifying and cleaning the acquired data;

the data conversion module is used for converting the acquired data;

10. A computer-readable storage medium having stored thereon a computer program executable by a processor to implement the internet-based curation of multiple data source dynamic data synchronization governance method as claimed in any one of claims 1 to 7.