CN113641739B - Spark-based intelligent data conversion method - Google Patents

Spark-based intelligent data conversion method Download PDF

Info

Publication number
CN113641739B
CN113641739B CN202110756908.5A CN202110756908A CN113641739B CN 113641739 B CN113641739 B CN 113641739B CN 202110756908 A CN202110756908 A CN 202110756908A CN 113641739 B CN113641739 B CN 113641739B
Authority
CN
China
Prior art keywords
task
data
extraction
spark
execution
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110756908.5A
Other languages
Chinese (zh)
Other versions
CN113641739A (en
Inventor
王仁俊
罗义斌
胡明慧
魏阳
李军
司震
宋炜伟
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing Linkage Information Technology Co ltd
Original Assignee
Nanjing Linkage Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing Linkage Information Technology Co ltd filed Critical Nanjing Linkage Information Technology Co ltd
Priority to CN202110756908.5A priority Critical patent/CN113641739B/en
Publication of CN113641739A publication Critical patent/CN113641739A/en
Application granted granted Critical
Publication of CN113641739B publication Critical patent/CN113641739B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/25Integrating or interfacing systems involving database management systems
    • G06F16/254Extract, transform and load [ETL] procedures, e.g. ETL data flows in data warehouses
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3006Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system is distributed, e.g. networked systems, clusters, multiprocessor systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/302Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3089Monitoring arrangements determined by the means or processing involved in sensing the monitored data, e.g. interfaces, connectors, sensors, probes, agents
    • G06F11/3093Configuration details thereof, e.g. installation, enabling, spatial arrangement of the probes
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/32Monitoring with visual or acoustical indication of the functioning of the machine
    • G06F11/324Display of status information
    • G06F11/327Alarm or error message display
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/34Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
    • G06F11/3452Performance evaluation by statistical analysis
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/44Arrangements for executing specific programs
    • G06F9/455Emulation; Interpretation; Software simulation, e.g. virtualisation or emulation of application or operating system execution engines
    • G06F9/45504Abstract machines for programme code execution, e.g. Java virtual machine [JVM], interpreters, emulators
    • G06F9/45508Runtime interpretation or emulation, e g. emulator loops, bytecode interpretation
    • G06F9/45512Command shells
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/80Database-specific techniques
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F2201/00Indexing scheme relating to error detection, to error correction, and to monitoring
    • G06F2201/805Real-time

Abstract

An intelligent data conversion method based on Spark, a unified data conversion system based on Spark of a distributed computing framework, wherein the system comprises a resolver, an actuator and a scheduler; the analyzer is used for analyzing the page selection information and sending the page selection information to the analyzer; resolving into corresponding Spark codes through each resolver, selecting an execution mode of Sqoop, and storing the shell script which correspondingly generates the Sqoop, wherein the Spark codes or the Sqoop scripts generated by the executor through the resolvers are stored in an HDFS, and after a task is triggered to be executed, selecting a corresponding execution engine through the executor to execute a specific script or code; the scheduler can set a dependent parent task such as tasks B and C or a triggering child task such as tasks D and E according to the scheduling dependency graph, and finally a plurality of dependency relations form the whole task dependency network required by the user; another party may set a task failure policy. And finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster.

Description

Spark-based intelligent data conversion method
Technical Field
The invention relates to a method for realizing intelligent conversion of mass data with high performance and high availability by utilizing Spark distributed computing components.
Background
With the development of internet services, storage media used in the services are becoming more and more abundant. During the business development process, different data stores are inevitably used. The system comprises relational databases Mysql, Oracle and Sql Server, document storage MongoDB, structured and unstructured storage: HBase, elastic search and big data storage component Hive, Kudu. There is a critical need for a method that integrates the decimation conversion of each component; in addition, in data extraction, the data sources are various and are lack of management, if each extraction task is managed by a script only, the task state and the like are difficult to manage and maintain, and the data quality is difficult to manage, so that the upstream service borne by a data department possibly lacks of accuracy. The platform is needed to manage and maintain the data metadata and the task metadata.
Spark is a common cluster computing framework that provides efficient memory computation by distributing large data set computation tasks over multiple computers. The distributed computing framework addresses two issues: how to distribute data and how to distribute computations. Hadoop uses HDFS to solve the distributed data problem, and the MapReduce computing paradigm provides efficient distributed computing. Similarly, Spark owns a multi-language functional programming API that provides more operators than map and reduce through a distributed data framework called elastic Distributed Data Sets (RDDs). Essentially, RDD is a programming abstraction that represents a set of read-only objects that can be split across machines.
The Sqoop is a source opening tool, is mainly used for data transmission between Hadoop and a traditional database (MySQL, postgresql.) and can guide data in a relational database (such as MySQL, Oracle, Postgres and the like) into an HDFS of Hadoop and also guide data of the HDFS into the relational database. HDFS is a distributed file system Had. HDFS is characterized by high fault tolerance and is designed to be deployed on inexpensive hardware. And it provides a high transmission rate to access applications.
Disclosure of Invention
The invention aims to rapidly increase multi-source multi-structure data along with the continuous development of business, and the requirements on data quality monitoring, data blood margin management, data extraction efficiency improvement, metadata management and the like are urgent. The invention provides an intelligent data conversion tool. And data metadata and task metadata governance are provided to solve various big data ETL related problems of multi-source data extraction unified configuration management, data quality management, data blood margin management and the like.
The technical scheme of the invention is that the intelligent data conversion method based on Spark is a unified data conversion system based on Spark of a distributed computing framework, and the system comprises a resolver, an actuator and a scheduler; the analyzer is used for analyzing the page selection information and sending the page selection information to the analyzer; resolving into corresponding Spark codes through each resolver, selecting an execution mode of Sqoop, and storing the shell script which correspondingly generates the Sqoop, wherein the Spark codes or the Sqoop scripts generated by the executor through the resolvers are stored in the HDFS, and when a task is triggered to be executed, selecting a corresponding execution engine through the executor to execute a specific script or code; according to the scheduling dependency graph shown in fig. 3, on one hand, a scheduler sets task dependencies, each task can set dependency parent tasks such as tasks B and C, and can also set triggering child tasks such as tasks D and E, and finally a plurality of dependency relationships form the whole task dependency network required by a user. And the other party can set a task failure policy, wherein the policy can be selected, and the timeout retry/failure retry is carried out, wherein the executive state and the executive condition of the task are recorded by an executor in the task execution process, such as: consuming time, using resources and the like, setting an execution detection strategy as overtime retry, namely setting the execution time of a task to reach a set threshold value, triggering the task retry, and forcibly deleting the file of the last unfinished task before retry; setting to a failed retry may determine whether to rerun based on the status when the task is finished. Finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster;
the method comprises the following specific steps:
1) data source configuration, namely uniformly managing and configuring multiple data sources in advance, and automatically scanning metadata of a base table through page configuration of the data sources; after the data source is constructed, a user selects to create an extraction task on a page;
2) constructing an actuator, selecting a data source in the first step of an extraction task, and then checking an extraction calculation rule according to the data source, wherein the selection of the extraction rule is single extraction/incremental extraction/full extraction; whether concurrent execution is needed or not is selected through checking, and the divided fields which need to be concurrent are checked; when the increment extraction is selected, the increment basis field needs to be selected; setting a field check rule; after the setting is successful, setting a storage medium input into the data center next step;
after the data source, the intermediate calculation rule and the output end are configured, the ETL task is finally added to the timing task, the timing rule can be selected on a page, and the calculation rule is stored in the database by a background; after the page selected synchronization condition is passed, the background analyzes the calculation condition and the alarm condition by the analyzer, the calculation and alarm condition is stored and landed, after the subsequent on-line task is determined, the analyzer analyzes the corresponding task into Spark or Sqoop codes, the extraction task can be tested and extracted according to the extraction condition, the concurrency and the data quantity can be limited in the test and extraction, and after the test is passed, the on-line can be arranged. The calculation task generates a corresponding Spark code according to the calculation rule, the actuator executes the specific code, the test code is executed according to the set concurrency and the limited calculation amount, and the extraction and calculation tasks can be selected to be on line after the test execution is successfully passed.
The bottom layer data extraction uses Spark and Sqoop; when a task is triggered, different interfaces are matched to extract data according to the task condition, and after the data are obtained, Spark sql is spliced or Spark core is used for data cleaning according to corresponding rules; the method comprises the steps that a Spark code or a Sqoop script generated by a parser is stored in an HDFS, after a task is triggered to be executed, a corresponding execution engine is selected by an executor to start to execute a specific script or code, the executor can pull a data quality rule of the parser to the corresponding task in the execution process, and real-time warning is carried out through the condition of a rule monitoring field when the task is executed. If the gender field is set as: the executor can count related data for data quality monitoring after the extraction task is finished, wherein the number, the speed, the extraction time, the execution time and the like of the extracted data of the time can be counted after the extraction task is finished. And (4) starting to execute quality monitoring logic after the statistical indexes are dropped into the library, comparing the page monitoring rule analyzed by the analyzer with the extracted result, and if the page monitoring rule reaches a set threshold value and is subjected to real-time alarm according to the set alarm communication mode. If the number of the current extraction is less than 50% of the average value in the past, and the like.
3) In the task management process, an extraction threshold is configured in advance for each extraction task, or a certain extraction index needs to be accurately configured; when the executor finishes executing the tasks, the task management module writes the execution time of the whole process, wherein the execution conditions include data number, execution state and index information of a specific threshold value back to the platform service library for record statistics, and the specific execution process is that after each task is finished, result information is written into a SUCEESS file on the HDFS. And if the SUCEESS file exists, reading the task information recorded in the file and writing the task information such as execution time and resource use condition into a service library.
Meanwhile, the alarm can be configured for the important execution task, and the alarm information is sent to related developers in various communication modes when the task fails, so that the verification data can be quickly responded, and the minimum loss is ensured when the task makes mistakes.
4) Metadata maintenance process: firstly, when a data source is configured, the data source is stored in a service library, when a calculation logic is configured, related data records participating in calculation are stored, and after an actuator calculates processing data, the data is extracted to be a destination record and stored; the data is carried out from place to place, and the whole process of calculation is participated.
The invention mainly solves the extraction requirements of the three types of data, and supports different data sources through MapReduce at Spark and Sqoop bottom layers. By providing data source configuration, the extraction rule is customized, and the extraction direction selection determines the whole data trend. Where the data metadata and task metadata are across the entire flow.
The distributed data extraction tool has the beneficial effects that the support of a mainstream big data calculation framework on source multi-structure is combined, and the data blood margin and the data quality are managed and sorted through metadata, so that the distributed data extraction tool is designed. The method solves the relevant problems of various data extraction scenes such as high-efficiency distribution, simple management, high stability and the like. The kernel uses a Spark and Sqoop big data ETL heterogeneous data source offline synchronization tool to configure a data source through paging, adapt to various data sources, configure data extraction rules and task alarm thresholds, and extract the destination. And data metadata and task metadata management are provided to solve various large data ETL related problems of multi-source data extraction unified configuration management, data quality management, data consanguinity management and the like.
Drawings
FIG. 1 is an overall architecture diagram of the present invention
FIG. 2 is a flow diagram of a configuration, such as an extraction flow diagram, of the present invention;
fig. 3 a scheduling dependency graph.
Detailed Description
The structure of the invention is shown in figure 1, and the flow chart is shown in figure 2.
The invention comprises the following steps:
the method comprises the following steps: a user login platform selects a data source with corresponding authority, such as ORACLE service;
step two: selecting an original table and a target table which need to be converted, such as selecting a user sub-table in ORACLE;
step three: mapping fields of the original table and the target table, such as Hive storage uniform user identification id and user sublist user _ id mapping;
step four: storing metadata according to the cleaning rules of the source end and the target end, wherein the metadata comprises original field information of an ORACLE user sub-table, a mapping relation of a target Hive table and the like;
step five: selecting a data conversion filtering condition, such as filtering user ID to be null data;
step six: selecting alarm conditions, such as number fields are empty, or the data extracted each time is less than 50% of the average value;
step seven: selecting an execution cycle type and setting a specific trigger cycle;
step eight: the flow ends.
The fourth step is specifically as follows:
2.1, constructing resolvers, resolving the synchronization condition, the calculation condition and the alarm condition which are selected through the page through each resolver into corresponding Spark codes, and if the execution mode of Sqoop is selected, correspondingly generating a shell script of the Sqoop;
2.2, constructing an actuator, storing a Spark code or Sqoop script generated by the analyzer in the HDFS, and selecting a corresponding execution engine to execute a specific script or code through the actuator after a task is triggered and executed;
2.3, a scheduler is constructed, on one hand, the scheduler can set task dependence, each task can set a dependence parent task and can also set an incentive subtask, and finally, a whole task dependence network required by a user is formed; the other party can set a task failure strategy, wherein the strategy can be selected, and overtime retry/failure retry is carried out; and finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster.
Platform construction process
The method comprises the following steps: data source configuration
In order to achieve the paging operation, the data sources need to be uniformly managed and configured in advance. The base table metadata is automatically scanned by paging the configuration data source. In subsequent operation, the user can select the libraries and tables needing synchronization only by checking and dragging on the page, and complex synchronization scripts and codes do not need to be written. On the other hand, the configurable data authority is managed through the data source, a scene that one user password has all base table authorities is avoided, and data security is guaranteed.
Step two: platform actuator construction
After the data source is constructed, the user can select and create an extraction task on the page, the extraction task selects the data source in the first step,
and then checking extraction calculation rules according to the data source, wherein the extraction rules are selected to be single time extraction/incremental extraction/full-volume extraction. And checking whether the fields need to be executed concurrently or not, and checking the divided fields which need to be executed concurrently. When the incremental extraction is selected, an incremental basis field needs to be selected. The system automatically lists fields and lists increment basis fields through checking. After adding, the fields of the required synchronization table are checked out, and field checking rules are set, such as (male, female) enumeration value detection is required to be added to the age field. After the setting is successful, the storage medium input into the data center is set in the next step, and the storage is as follows: Hbase/Kudu/HDFS/Hive. And finally, after the configuration from the data source to the intermediate calculation rule to the output end is completed, the ETL task is finally added to the timing task, the timing rule can be selected on a page, and the calculation rule is stored in the database by the background. And when the timing condition is triggered, the platform computing executor performs data extraction according to the extraction rule in the library.
The platform actuator is constructed as follows: the underlying data extraction uses Spark and Sqoop. When a task is triggered, different interfaces are matched to extract data according to the task condition, and after the data are obtained, the data are spliced into Spark sql according to corresponding rules or Spark core is used for data cleaning.
Step three: task management build
Because each extraction task is numerous, the calculation and statistics of the extraction task by the big data platform after the extraction task bears a lot of upper-layer services, and because the traditional scheduling task can only check whether the execution of the task is successful or not, whether the execution result is correct or not, and the execution efficiency cannot be counted, the data quality is low, the data quality must be improved so as to construct a task management module,
the task management process is that an extraction threshold value is configured in advance for each extraction task, for example, the data volume of yesterday of ring ratio, or a certain extraction index needs to be accurately configured. After the executor finishes executing the task, the task management module writes back the execution time of the whole process, wherein the execution conditions include data number, execution state and index information of a specific threshold value to the platform service library for record statistics. And displaying the tasks on a task execution page every day, and meanwhile, configuring alarms for important execution tasks. And sending alarm information to related developers in various communication modes when the task fails so as to quickly respond to the verification data and ensure that the loss is minimized when the task goes wrong.
Step four: metadata maintenance
Due to the need for efficient management of the extracted data. Therefore, metadata construction is particularly important, including data consanguinity management, and is very significant for data source going to and about to participate in which tasks and for non-data problem troubleshooting. In addition, metadata matching is also needed for data quality management, for all services of a large data platform, the accuracy of data extraction is the most important guarantee, and the main process of metadata maintenance is as follows: firstly, when a data source is configured, the data source is stored in a service library in the future, when a calculation logic is configured, related data records participating in calculation are stored, and after an actuator calculates processing data, data extraction destination records are stored. The data is done from where to where, and participates in the whole process of which calculation.
Shell Script, namely Shell Script, Shell Script and Windows/DosBatch processingSimilarly, the commands are put into a file in advance, thereby facilitating one-time executionProgram fileThe method is mainly convenient for an administrator to set or manage. And ETL extracts (extract), transforms (transform), and loads (load) data from the source to the destination. The term ETL is more commonly used inData warehouseBut its object is not limited toData warehouse

Claims (4)

1. An intelligent data conversion method based on Spark is characterized in that,
the unified data conversion system based on the distributed computing framework Spark comprises a resolver, an executor and a scheduler; the analyzer is used for analyzing the page selection information and sending the page selection information to the analyzer; resolving into corresponding Spark codes through each resolver, selecting an execution mode of Sqoop, and storing the shell script which correspondingly generates the Sqoop, wherein the Spark codes or the Sqoop scripts generated by the executor through the resolvers are stored in an HDFS, and after a task is triggered to be executed, selecting a corresponding execution engine through the executor to execute a specific script or code; according to the scheduling dependency graph, on one hand, a scheduler sets task dependencies, each task can set dependency parent tasks such as tasks B and C, or set triggering subtasks such as tasks D and E, and finally a plurality of dependency relationships form a whole task dependency network required by a user;
and the other party can set a task failure policy, wherein the policy can be selected, and the timeout retry/failure retry is carried out, wherein the executive state and the executive condition of the task are recorded by an executor in the task execution process, such as: consuming time, using resources and the like, setting an execution detection strategy as overtime retry, namely, when the execution time of the task reaches a set threshold value, triggering the task retry, and forcibly deleting the file of the last unfinished task before retry; if the retry fails, judging whether to re-run or not according to the state when finishing the task;
finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster;
the method comprises the following specific steps:
1) data source configuration, namely uniformly managing and configuring multiple data sources in advance, and automatically scanning metadata of a base table through page configuration of the data sources; after the data source is constructed, a user selects to create an extraction task on a page;
2) constructing an actuator, selecting a data source in the first step of an extraction task, and then checking an extraction calculation rule according to the data source, wherein the selection of the extraction rule is single extraction/incremental extraction/full extraction; whether concurrent execution is needed or not is selected through checking, and the divided fields which need to be concurrent are checked; when the increment extraction is selected, the increment basis field needs to be selected;
setting a field check rule; after the setting is successful, setting a storage medium input into the data center next step;
after the data source, the intermediate calculation rule and the output end are configured, the ETL task is finally added to the timing task, the timing rule can be selected on a page, and the calculation rule is stored in the database by a background; after the page selection synchronization condition is passed, the background analyzes the calculation condition and the alarm condition through the analyzer, the calculation and alarm condition is stored and landed, after the subsequent on-line task is determined, the analyzer analyzes the corresponding task into Spark or Sqoop codes, the extraction task performs test extraction according to the extraction condition, the concurrency and the data quantity are limited in the test extraction, and the on-line can be arranged after the test is passed; the calculation task can generate a corresponding Spark code according to the calculation rule, the specific code is executed through the actuator, the execution test code can successfully test and execute the extraction and calculation tasks according to the set concurrency and the limited calculation amount, and then the online selection can be carried out;
the bottom layer data extraction uses Spark and Sqoop; when a task is triggered, different interfaces are matched to extract data according to the task condition, and after the data are obtained, the data are spliced into Spark sql according to corresponding rules or Spark core is used for data cleaning; the method comprises the steps that a Spark code or an Sqoop script generated by a parser is stored in an HDFS, when a task is triggered to be executed, a corresponding execution engine is selected through an actuator to start to execute a specific script or code, the actuator can pull a data quality rule of the parser to the corresponding task in the execution process, and real-time warning is conducted through the condition of a rule monitoring field when the task is executed; if the gender field is set to: the executor can count related data for data quality monitoring after the extraction task is finished, wherein the number, the speed, the extraction time, the execution time and the like of the extracted data of this time can be counted after the extraction task is finished; starting to execute quality monitoring logic after the statistical indexes fall into the library, comparing the page monitoring rule analyzed by the analyzer with the extracted result, and if the page monitoring rule reaches a set threshold value, carrying out real-time alarm according to a set alarm communication mode;
3) in the task management process, an extraction threshold value is configured in advance for each extraction task, or a certain extraction index needs to be accurately configured; when the executor finishes executing the tasks, the task management module writes the execution time of the whole process, wherein the execution conditions include data number, execution state and index information of a specific threshold value back to the platform service library for record statistics, and the specific execution process is that after each task finishes, result information is written into a SUCEESS file on an HDFS; triggering a statistical process to judge whether a success file exists or not after each task is finished, if the success file exists, indicating that the task is successfully executed, otherwise, failing, finally recording the state to the record information of the execution state table, and if the success file exists, reading the task information recorded in the file, and writing the task information such as the execution time and the resource use condition into a service library;
meanwhile, an alarm can be configured for an important execution task, and alarm information is sent to related developers in various communication modes when the task fails, so that verification data can be responded quickly, and the loss is minimized when the task goes wrong;
4) metadata maintenance process: firstly, when a data source is configured, the data source is stored in a service library, when a calculation logic is configured, related data records participating in calculation are stored, and after an actuator calculates processing data, data extraction destination records are stored; therefore, the whole process of calculating the data from place to place is participated.
2. The smart Spark-based data conversion method according to claim 1, wherein:
the method comprises the following steps:
the method comprises the following steps: a user login platform selects a data source with a corresponding authority;
step two: selecting an original table and a target table which need to be transformed;
step three: mapping fields of the original table and the target table;
step four: storing metadata according to the cleaning rules of the source end and the target end;
step five: selecting a data conversion filtering condition;
step six: selecting an alarm condition;
step seven: selecting an execution cycle type and setting a specific trigger cycle;
step eight: the flow ends.
3. The smart Spark-based data conversion method as claimed in claim 1, wherein: the second step is specifically that
2.1) constructing resolvers, resolving the synchronization conditions, the calculation conditions and the alarm conditions selected by the page selection into corresponding Spark codes through each resolver, and if the execution mode of Sqoop is selected, correspondingly generating the shell script of the Sqoop;
2.2) constructing an actuator, storing the Spark code or Sqoop script generated by the analyzer in the HDFS, and selecting a corresponding execution engine to execute a specific script or code through the actuator after a task is triggered and executed;
2.3) a scheduler is constructed, on one hand, the scheduler can set task dependence, each task can set a dependence parent task and can also set a triggering subtask, and finally, a whole task dependence network required by a user is formed; the other party can set a task failure strategy, wherein the strategy can be selected, and overtime retry/failure retry is carried out; and finally, scheduling the tasks to be executed to the nodes with sufficient resources to execute the tasks according to the existing resource condition of the cluster.
4. The Spark-based intelligent data conversion method according to claim 1, wherein: the number of samples was less than 50% of the previous average.
CN202110756908.5A 2021-07-05 2021-07-05 Spark-based intelligent data conversion method Active CN113641739B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110756908.5A CN113641739B (en) 2021-07-05 2021-07-05 Spark-based intelligent data conversion method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110756908.5A CN113641739B (en) 2021-07-05 2021-07-05 Spark-based intelligent data conversion method

Publications (2)

Publication Number Publication Date
CN113641739A CN113641739A (en) 2021-11-12
CN113641739B true CN113641739B (en) 2022-09-06

Family

ID=78416709

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110756908.5A Active CN113641739B (en) 2021-07-05 2021-07-05 Spark-based intelligent data conversion method

Country Status (1)

Country Link
CN (1) CN113641739B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115858310B (en) * 2023-03-01 2023-07-21 美云智数科技有限公司 Abnormal task identification method, device, computer equipment and storage medium
CN116089518A (en) * 2023-04-07 2023-05-09 广州思迈特软件有限公司 Data model extraction method and system, terminal and medium

Family Cites Families (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US10275278B2 (en) * 2016-09-14 2019-04-30 Salesforce.Com, Inc. Stream processing task deployment using precompiled libraries
CN108427709B (en) * 2018-01-25 2020-10-16 朗新科技集团股份有限公司 Multi-source mass data processing system and method
US10705883B2 (en) * 2018-06-19 2020-07-07 Microsoft Technology Licensing, Llc Dynamic hybrid computing environment
CN109165202A (en) * 2018-07-04 2019-01-08 华南理工大学 A kind of preprocess method of multi-source heterogeneous big data
CN109684053B (en) * 2018-11-05 2023-08-01 广东岭南通股份有限公司 Task scheduling method and system for big data
CN110069335A (en) * 2019-05-07 2019-07-30 江苏满运软件科技有限公司 Task processing system, method, computer equipment and storage medium
CN110457371A (en) * 2019-08-13 2019-11-15 杭州有赞科技有限公司 Data managing method, device, storage medium and system
CN112100265A (en) * 2020-09-17 2020-12-18 博雅正链(北京)科技有限公司 Multi-source data processing method and device for big data architecture and block chain
CN112102111B (en) * 2020-09-27 2021-06-08 华电福新广州能源有限公司 Intelligent processing system for power plant data
CN112596876A (en) * 2020-12-17 2021-04-02 平安普惠企业管理有限公司 Task scheduling method, device and related equipment
CN112632174A (en) * 2020-12-31 2021-04-09 江苏苏宁云计算有限公司 Data inspection method, device and system
CN112925619A (en) * 2021-02-24 2021-06-08 深圳依时货拉拉科技有限公司 Big data real-time computing method and platform

Also Published As

Publication number Publication date
CN113641739A (en) 2021-11-12

Similar Documents

Publication Publication Date Title
US11526338B2 (en) System and method for inferencing of data transformations through pattern decomposition
US10467220B2 (en) System and method for generating an effective test data set for testing big data applications
US20180307571A1 (en) Recovery strategy for a stream processing system
US10216782B2 (en) Processing of updates in a database system using different scenarios
CN113641739B (en) Spark-based intelligent data conversion method
CN103441900A (en) Centralization cross-platform automated testing system and control method thereof
CN111367989B (en) Real-time data index calculation system and method
CN109635024A (en) A kind of data migration method and system
CN115617834A (en) Data processing method, device, equipment and storage medium
CN115329011A (en) Data model construction method, data query method, data model construction device and data query device, and storage medium
US20200356885A1 (en) Service management in a dbms
Zhao et al. MapReduce model-based optimization of range queries
Shahverdi et al. Comparative evaluation for the performance of big stream processing systems
Dhanda Big data storage and analysis
Aytas Designing Big Data Platforms: How to Use, Deploy, and Maintain Big Data Systems
Darius et al. From Data to Insights: A Review of Cloud-Based Big Data Tools and Technologies
CN114328533A (en) Metadata unified management method, system, medium, device, and program
US11809390B2 (en) Context-dependent event cleaning and publication
Simitsis et al. Hybrid analytic flows-the case for optimization
Stensland Business Management Execution On Data Streams
Al-Saeedi Factors influencing the database selection for B2C web applications
Wickström Distributed data processing for fourth-generation smart factories
Teixeira Event-Driven Real-Time Streaming Approach for Big Data, applied to an End-to-End Supply Chain
Marques Design and implementation of a mapReduce architecture for angraDB
Correas Grifoll Study and implementation of Machine Learning algorithms optimized for distributed multidimensional indexing databases

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
PE01 Entry into force of the registration of the contract for pledge of patent right
PE01 Entry into force of the registration of the contract for pledge of patent right

Denomination of invention: A Smart Data Conversion Method Based on Spark

Granted publication date: 20220906

Pledgee: Bank of China Limited by Share Ltd. Jiangsu branch

Pledgor: NANJING LINKAGE INFORMATION TECHNOLOGY Co.,Ltd.

Registration number: Y2024980011768