CN113641739B

CN113641739B - Spark-based intelligent data conversion method

Info

Publication number: CN113641739B
Application number: CN202110756908.5A
Authority: CN
Inventors: 王仁俊; 罗义斌; 胡明慧; 魏阳; 李军; 司震; 宋炜伟
Original assignee: Nanjing Linkage Information Technology Co ltd
Current assignee: Nanjing Linkage Information Technology Co ltd
Priority date: 2021-07-05
Filing date: 2021-07-05
Publication date: 2022-09-06
Anticipated expiration: 2041-07-05
Also published as: CN113641739A

Abstract

An intelligent data conversion method based on Spark, a unified data conversion system based on Spark of a distributed computing framework, wherein the system comprises a resolver, an actuator and a scheduler; the analyzer is used for analyzing the page selection information and sending the page selection information to the analyzer; resolving into corresponding Spark codes through each resolver, selecting an execution mode of Sqoop, and storing the shell script which correspondingly generates the Sqoop, wherein the Spark codes or the Sqoop scripts generated by the executor through the resolvers are stored in an HDFS, and after a task is triggered to be executed, selecting a corresponding execution engine through the executor to execute a specific script or code; the scheduler can set a dependent parent task such as tasks B and C or a triggering child task such as tasks D and E according to the scheduling dependency graph, and finally a plurality of dependency relations form the whole task dependency network required by the user; another party may set a task failure policy. And finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster.

Description

Spark-based intelligent data conversion method

Technical Field

The invention relates to a method for realizing intelligent conversion of mass data with high performance and high availability by utilizing Spark distributed computing components.

Background

With the development of internet services, storage media used in the services are becoming more and more abundant. During the business development process, different data stores are inevitably used. The system comprises relational databases Mysql, Oracle and Sql Server, document storage MongoDB, structured and unstructured storage: HBase, elastic search and big data storage component Hive, Kudu. There is a critical need for a method that integrates the decimation conversion of each component; in addition, in data extraction, the data sources are various and are lack of management, if each extraction task is managed by a script only, the task state and the like are difficult to manage and maintain, and the data quality is difficult to manage, so that the upstream service borne by a data department possibly lacks of accuracy. The platform is needed to manage and maintain the data metadata and the task metadata.

Spark is a common cluster computing framework that provides efficient memory computation by distributing large data set computation tasks over multiple computers. The distributed computing framework addresses two issues: how to distribute data and how to distribute computations. Hadoop uses HDFS to solve the distributed data problem, and the MapReduce computing paradigm provides efficient distributed computing. Similarly, Spark owns a multi-language functional programming API that provides more operators than map and reduce through a distributed data framework called elastic Distributed Data Sets (RDDs). Essentially, RDD is a programming abstraction that represents a set of read-only objects that can be split across machines.

The Sqoop is a source opening tool, is mainly used for data transmission between Hadoop and a traditional database (MySQL, postgresql.) and can guide data in a relational database (such as MySQL, Oracle, Postgres and the like) into an HDFS of Hadoop and also guide data of the HDFS into the relational database. HDFS is a distributed file system Had. HDFS is characterized by high fault tolerance and is designed to be deployed on inexpensive hardware. And it provides a high transmission rate to access applications.

Disclosure of Invention

The invention aims to rapidly increase multi-source multi-structure data along with the continuous development of business, and the requirements on data quality monitoring, data blood margin management, data extraction efficiency improvement, metadata management and the like are urgent. The invention provides an intelligent data conversion tool. And data metadata and task metadata governance are provided to solve various big data ETL related problems of multi-source data extraction unified configuration management, data quality management, data blood margin management and the like.

The technical scheme of the invention is that the intelligent data conversion method based on Spark is a unified data conversion system based on Spark of a distributed computing framework, and the system comprises a resolver, an actuator and a scheduler; the analyzer is used for analyzing the page selection information and sending the page selection information to the analyzer; resolving into corresponding Spark codes through each resolver, selecting an execution mode of Sqoop, and storing the shell script which correspondingly generates the Sqoop, wherein the Spark codes or the Sqoop scripts generated by the executor through the resolvers are stored in the HDFS, and when a task is triggered to be executed, selecting a corresponding execution engine through the executor to execute a specific script or code; according to the scheduling dependency graph shown in fig. 3, on one hand, a scheduler sets task dependencies, each task can set dependency parent tasks such as tasks B and C, and can also set triggering child tasks such as tasks D and E, and finally a plurality of dependency relationships form the whole task dependency network required by a user. And the other party can set a task failure policy, wherein the policy can be selected, and the timeout retry/failure retry is carried out, wherein the executive state and the executive condition of the task are recorded by an executor in the task execution process, such as: consuming time, using resources and the like, setting an execution detection strategy as overtime retry, namely setting the execution time of a task to reach a set threshold value, triggering the task retry, and forcibly deleting the file of the last unfinished task before retry; setting to a failed retry may determine whether to rerun based on the status when the task is finished. Finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster;

the method comprises the following specific steps:

1) data source configuration, namely uniformly managing and configuring multiple data sources in advance, and automatically scanning metadata of a base table through page configuration of the data sources; after the data source is constructed, a user selects to create an extraction task on a page;

2) constructing an actuator, selecting a data source in the first step of an extraction task, and then checking an extraction calculation rule according to the data source, wherein the selection of the extraction rule is single extraction/incremental extraction/full extraction; whether concurrent execution is needed or not is selected through checking, and the divided fields which need to be concurrent are checked; when the increment extraction is selected, the increment basis field needs to be selected; setting a field check rule; after the setting is successful, setting a storage medium input into the data center next step;

after the data source, the intermediate calculation rule and the output end are configured, the ETL task is finally added to the timing task, the timing rule can be selected on a page, and the calculation rule is stored in the database by a background; after the page selected synchronization condition is passed, the background analyzes the calculation condition and the alarm condition by the analyzer, the calculation and alarm condition is stored and landed, after the subsequent on-line task is determined, the analyzer analyzes the corresponding task into Spark or Sqoop codes, the extraction task can be tested and extracted according to the extraction condition, the concurrency and the data quantity can be limited in the test and extraction, and after the test is passed, the on-line can be arranged. The calculation task generates a corresponding Spark code according to the calculation rule, the actuator executes the specific code, the test code is executed according to the set concurrency and the limited calculation amount, and the extraction and calculation tasks can be selected to be on line after the test execution is successfully passed.

The bottom layer data extraction uses Spark and Sqoop; when a task is triggered, different interfaces are matched to extract data according to the task condition, and after the data are obtained, Spark sql is spliced or Spark core is used for data cleaning according to corresponding rules; the method comprises the steps that a Spark code or a Sqoop script generated by a parser is stored in an HDFS, after a task is triggered to be executed, a corresponding execution engine is selected by an executor to start to execute a specific script or code, the executor can pull a data quality rule of the parser to the corresponding task in the execution process, and real-time warning is carried out through the condition of a rule monitoring field when the task is executed. If the gender field is set as: the executor can count related data for data quality monitoring after the extraction task is finished, wherein the number, the speed, the extraction time, the execution time and the like of the extracted data of the time can be counted after the extraction task is finished. And (4) starting to execute quality monitoring logic after the statistical indexes are dropped into the library, comparing the page monitoring rule analyzed by the analyzer with the extracted result, and if the page monitoring rule reaches a set threshold value and is subjected to real-time alarm according to the set alarm communication mode. If the number of the current extraction is less than 50% of the average value in the past, and the like.

3) In the task management process, an extraction threshold is configured in advance for each extraction task, or a certain extraction index needs to be accurately configured; when the executor finishes executing the tasks, the task management module writes the execution time of the whole process, wherein the execution conditions include data number, execution state and index information of a specific threshold value back to the platform service library for record statistics, and the specific execution process is that after each task is finished, result information is written into a SUCEESS file on the HDFS. And if the SUCEESS file exists, reading the task information recorded in the file and writing the task information such as execution time and resource use condition into a service library.

Meanwhile, the alarm can be configured for the important execution task, and the alarm information is sent to related developers in various communication modes when the task fails, so that the verification data can be quickly responded, and the minimum loss is ensured when the task makes mistakes.

4) Metadata maintenance process: firstly, when a data source is configured, the data source is stored in a service library, when a calculation logic is configured, related data records participating in calculation are stored, and after an actuator calculates processing data, the data is extracted to be a destination record and stored; the data is carried out from place to place, and the whole process of calculation is participated.

The invention mainly solves the extraction requirements of the three types of data, and supports different data sources through MapReduce at Spark and Sqoop bottom layers. By providing data source configuration, the extraction rule is customized, and the extraction direction selection determines the whole data trend. Where the data metadata and task metadata are across the entire flow.

The distributed data extraction tool has the beneficial effects that the support of a mainstream big data calculation framework on source multi-structure is combined, and the data blood margin and the data quality are managed and sorted through metadata, so that the distributed data extraction tool is designed. The method solves the relevant problems of various data extraction scenes such as high-efficiency distribution, simple management, high stability and the like. The kernel uses a Spark and Sqoop big data ETL heterogeneous data source offline synchronization tool to configure a data source through paging, adapt to various data sources, configure data extraction rules and task alarm thresholds, and extract the destination. And data metadata and task metadata management are provided to solve various large data ETL related problems of multi-source data extraction unified configuration management, data quality management, data consanguinity management and the like.

Drawings

FIG. 1 is an overall architecture diagram of the present invention

FIG. 2 is a flow diagram of a configuration, such as an extraction flow diagram, of the present invention;

fig. 3 a scheduling dependency graph.

Detailed Description

The structure of the invention is shown in figure 1, and the flow chart is shown in figure 2.

The invention comprises the following steps:

the method comprises the following steps: a user login platform selects a data source with corresponding authority, such as ORACLE service;

step two: selecting an original table and a target table which need to be converted, such as selecting a user sub-table in ORACLE;

step three: mapping fields of the original table and the target table, such as Hive storage uniform user identification id and user sublist user _ id mapping;

step four: storing metadata according to the cleaning rules of the source end and the target end, wherein the metadata comprises original field information of an ORACLE user sub-table, a mapping relation of a target Hive table and the like;

step five: selecting a data conversion filtering condition, such as filtering user ID to be null data;

step six: selecting alarm conditions, such as number fields are empty, or the data extracted each time is less than 50% of the average value;

step seven: selecting an execution cycle type and setting a specific trigger cycle;

step eight: the flow ends.

The fourth step is specifically as follows:

2.1, constructing resolvers, resolving the synchronization condition, the calculation condition and the alarm condition which are selected through the page through each resolver into corresponding Spark codes, and if the execution mode of Sqoop is selected, correspondingly generating a shell script of the Sqoop;

2.2, constructing an actuator, storing a Spark code or Sqoop script generated by the analyzer in the HDFS, and selecting a corresponding execution engine to execute a specific script or code through the actuator after a task is triggered and executed;

2.3, a scheduler is constructed, on one hand, the scheduler can set task dependence, each task can set a dependence parent task and can also set an incentive subtask, and finally, a whole task dependence network required by a user is formed; the other party can set a task failure strategy, wherein the strategy can be selected, and overtime retry/failure retry is carried out; and finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster.

Platform construction process

The method comprises the following steps: data source configuration

In order to achieve the paging operation, the data sources need to be uniformly managed and configured in advance. The base table metadata is automatically scanned by paging the configuration data source. In subsequent operation, the user can select the libraries and tables needing synchronization only by checking and dragging on the page, and complex synchronization scripts and codes do not need to be written. On the other hand, the configurable data authority is managed through the data source, a scene that one user password has all base table authorities is avoided, and data security is guaranteed.

Step two: platform actuator construction

After the data source is constructed, the user can select and create an extraction task on the page, the extraction task selects the data source in the first step,

and then checking extraction calculation rules according to the data source, wherein the extraction rules are selected to be single time extraction/incremental extraction/full-volume extraction. And checking whether the fields need to be executed concurrently or not, and checking the divided fields which need to be executed concurrently. When the incremental extraction is selected, an incremental basis field needs to be selected. The system automatically lists fields and lists increment basis fields through checking. After adding, the fields of the required synchronization table are checked out, and field checking rules are set, such as (male, female) enumeration value detection is required to be added to the age field. After the setting is successful, the storage medium input into the data center is set in the next step, and the storage is as follows: Hbase/Kudu/HDFS/Hive. And finally, after the configuration from the data source to the intermediate calculation rule to the output end is completed, the ETL task is finally added to the timing task, the timing rule can be selected on a page, and the calculation rule is stored in the database by the background. And when the timing condition is triggered, the platform computing executor performs data extraction according to the extraction rule in the library.

The platform actuator is constructed as follows: the underlying data extraction uses Spark and Sqoop. When a task is triggered, different interfaces are matched to extract data according to the task condition, and after the data are obtained, the data are spliced into Spark sql according to corresponding rules or Spark core is used for data cleaning.

Step three: task management build

Because each extraction task is numerous, the calculation and statistics of the extraction task by the big data platform after the extraction task bears a lot of upper-layer services, and because the traditional scheduling task can only check whether the execution of the task is successful or not, whether the execution result is correct or not, and the execution efficiency cannot be counted, the data quality is low, the data quality must be improved so as to construct a task management module,

the task management process is that an extraction threshold value is configured in advance for each extraction task, for example, the data volume of yesterday of ring ratio, or a certain extraction index needs to be accurately configured. After the executor finishes executing the task, the task management module writes back the execution time of the whole process, wherein the execution conditions include data number, execution state and index information of a specific threshold value to the platform service library for record statistics. And displaying the tasks on a task execution page every day, and meanwhile, configuring alarms for important execution tasks. And sending alarm information to related developers in various communication modes when the task fails so as to quickly respond to the verification data and ensure that the loss is minimized when the task goes wrong.

Step four: metadata maintenance

Due to the need for efficient management of the extracted data. Therefore, metadata construction is particularly important, including data consanguinity management, and is very significant for data source going to and about to participate in which tasks and for non-data problem troubleshooting. In addition, metadata matching is also needed for data quality management, for all services of a large data platform, the accuracy of data extraction is the most important guarantee, and the main process of metadata maintenance is as follows: firstly, when a data source is configured, the data source is stored in a service library in the future, when a calculation logic is configured, related data records participating in calculation are stored, and after an actuator calculates processing data, data extraction destination records are stored. The data is done from where to where, and participates in the whole process of which calculation.

Shell Script, namely Shell Script, Shell Script and Windows/DosBatch processingSimilarly, the commands are put into a file in advance, thereby facilitating one-time executionProgram fileThe method is mainly convenient for an administrator to set or manage. And ETL extracts (extract), transforms (transform), and loads (load) data from the source to the destination. The term ETL is more commonly used inData warehouseBut its object is not limited toData warehouse。

Claims

1. An intelligent data conversion method based on Spark is characterized in that,

the unified data conversion system based on the distributed computing framework Spark comprises a resolver, an executor and a scheduler; the analyzer is used for analyzing the page selection information and sending the page selection information to the analyzer; resolving into corresponding Spark codes through each resolver, selecting an execution mode of Sqoop, and storing the shell script which correspondingly generates the Sqoop, wherein the Spark codes or the Sqoop scripts generated by the executor through the resolvers are stored in an HDFS, and after a task is triggered to be executed, selecting a corresponding execution engine through the executor to execute a specific script or code; according to the scheduling dependency graph, on one hand, a scheduler sets task dependencies, each task can set dependency parent tasks such as tasks B and C, or set triggering subtasks such as tasks D and E, and finally a plurality of dependency relationships form a whole task dependency network required by a user;

and the other party can set a task failure policy, wherein the policy can be selected, and the timeout retry/failure retry is carried out, wherein the executive state and the executive condition of the task are recorded by an executor in the task execution process, such as: consuming time, using resources and the like, setting an execution detection strategy as overtime retry, namely, when the execution time of the task reaches a set threshold value, triggering the task retry, and forcibly deleting the file of the last unfinished task before retry; if the retry fails, judging whether to re-run or not according to the state when finishing the task;

finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster;

the method comprises the following specific steps:

2) constructing an actuator, selecting a data source in the first step of an extraction task, and then checking an extraction calculation rule according to the data source, wherein the selection of the extraction rule is single extraction/incremental extraction/full extraction; whether concurrent execution is needed or not is selected through checking, and the divided fields which need to be concurrent are checked; when the increment extraction is selected, the increment basis field needs to be selected;

setting a field check rule; after the setting is successful, setting a storage medium input into the data center next step;

after the data source, the intermediate calculation rule and the output end are configured, the ETL task is finally added to the timing task, the timing rule can be selected on a page, and the calculation rule is stored in the database by a background; after the page selection synchronization condition is passed, the background analyzes the calculation condition and the alarm condition through the analyzer, the calculation and alarm condition is stored and landed, after the subsequent on-line task is determined, the analyzer analyzes the corresponding task into Spark or Sqoop codes, the extraction task performs test extraction according to the extraction condition, the concurrency and the data quantity are limited in the test extraction, and the on-line can be arranged after the test is passed; the calculation task can generate a corresponding Spark code according to the calculation rule, the specific code is executed through the actuator, the execution test code can successfully test and execute the extraction and calculation tasks according to the set concurrency and the limited calculation amount, and then the online selection can be carried out;

the bottom layer data extraction uses Spark and Sqoop; when a task is triggered, different interfaces are matched to extract data according to the task condition, and after the data are obtained, the data are spliced into Spark sql according to corresponding rules or Spark core is used for data cleaning; the method comprises the steps that a Spark code or an Sqoop script generated by a parser is stored in an HDFS, when a task is triggered to be executed, a corresponding execution engine is selected through an actuator to start to execute a specific script or code, the actuator can pull a data quality rule of the parser to the corresponding task in the execution process, and real-time warning is conducted through the condition of a rule monitoring field when the task is executed; if the gender field is set to: the executor can count related data for data quality monitoring after the extraction task is finished, wherein the number, the speed, the extraction time, the execution time and the like of the extracted data of this time can be counted after the extraction task is finished; starting to execute quality monitoring logic after the statistical indexes fall into the library, comparing the page monitoring rule analyzed by the analyzer with the extracted result, and if the page monitoring rule reaches a set threshold value, carrying out real-time alarm according to a set alarm communication mode;

3) in the task management process, an extraction threshold value is configured in advance for each extraction task, or a certain extraction index needs to be accurately configured; when the executor finishes executing the tasks, the task management module writes the execution time of the whole process, wherein the execution conditions include data number, execution state and index information of a specific threshold value back to the platform service library for record statistics, and the specific execution process is that after each task finishes, result information is written into a SUCEESS file on an HDFS; triggering a statistical process to judge whether a success file exists or not after each task is finished, if the success file exists, indicating that the task is successfully executed, otherwise, failing, finally recording the state to the record information of the execution state table, and if the success file exists, reading the task information recorded in the file, and writing the task information such as the execution time and the resource use condition into a service library;

meanwhile, an alarm can be configured for an important execution task, and alarm information is sent to related developers in various communication modes when the task fails, so that verification data can be responded quickly, and the loss is minimized when the task goes wrong;

4) metadata maintenance process: firstly, when a data source is configured, the data source is stored in a service library, when a calculation logic is configured, related data records participating in calculation are stored, and after an actuator calculates processing data, data extraction destination records are stored; therefore, the whole process of calculating the data from place to place is participated.

2. The smart Spark-based data conversion method according to claim 1, wherein:

the method comprises the following steps:

the method comprises the following steps: a user login platform selects a data source with a corresponding authority;

step two: selecting an original table and a target table which need to be transformed;

step three: mapping fields of the original table and the target table;

step four: storing metadata according to the cleaning rules of the source end and the target end;

step five: selecting a data conversion filtering condition;

step six: selecting an alarm condition;

step eight: the flow ends.

3. The smart Spark-based data conversion method as claimed in claim 1, wherein: the second step is specifically that

2.1) constructing resolvers, resolving the synchronization conditions, the calculation conditions and the alarm conditions selected by the page selection into corresponding Spark codes through each resolver, and if the execution mode of Sqoop is selected, correspondingly generating the shell script of the Sqoop;

2.2) constructing an actuator, storing the Spark code or Sqoop script generated by the analyzer in the HDFS, and selecting a corresponding execution engine to execute a specific script or code through the actuator after a task is triggered and executed;

2.3) a scheduler is constructed, on one hand, the scheduler can set task dependence, each task can set a dependence parent task and can also set a triggering subtask, and finally, a whole task dependence network required by a user is formed; the other party can set a task failure strategy, wherein the strategy can be selected, and overtime retry/failure retry is carried out; and finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster.

4. The Spark-based intelligent data conversion method according to claim 1, wherein: the number of samples was less than 50% of the previous average.