CN114253534A - Real-time data processing method and system - Google Patents

Real-time data processing method and system Download PDF

Info

Publication number
CN114253534A
CN114253534A CN202111508187.2A CN202111508187A CN114253534A CN 114253534 A CN114253534 A CN 114253534A CN 202111508187 A CN202111508187 A CN 202111508187A CN 114253534 A CN114253534 A CN 114253534A
Authority
CN
China
Prior art keywords
data
configuration
task
real
data processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111508187.2A
Other languages
Chinese (zh)
Inventor
谢县东
李洪勋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Weimeng Chuangke Network Technology China Co Ltd
Original Assignee
Weimeng Chuangke Network Technology China Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Weimeng Chuangke Network Technology China Co Ltd filed Critical Weimeng Chuangke Network Technology China Co Ltd
Priority to CN202111508187.2A priority Critical patent/CN114253534A/en
Publication of CN114253534A publication Critical patent/CN114253534A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/30Creation or generation of source code
    • G06F8/36Software reuse
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/70Software maintenance or management
    • G06F8/71Version control; Configuration management

Abstract

The embodiment of the invention provides a real-time data processing method and a real-time data processing system, wherein the data processing process of each big data technical component is abstracted into a plurality of stages, and the data processing engine corresponding to the big data technical component is appointed in task configuration and the configuration of each stage is used for realizing that the corresponding big data technical component can be added into a project for processing real-time data without compiling codes, so that the threshold of using the big data technical component by developers is reduced, and the development efficiency is improved.

Description

Real-time data processing method and system
Technical Field
The invention relates to the field of real-time computing, in particular to a real-time data processing method and system.
Background
Real-time data is a set of sequential, large, fast, continuous arriving data sequences, and a data stream can be viewed as a dynamic data set that grows indefinitely over time. For the detection of real-time data, and for the efficient computation of large amounts of data, large data technology components are commonly used, such as Storm, Spark, Flink, etc., which process real-time data, Storm is sourced by Twitter and hosted by a stream computing framework on GitHub, Spark, Apache Spark, is a fast general purpose computing engine designed specifically for large-scale data processing, Flink is a distributed processing engine for streaming data and batch data, however, in the prior art, the real-time data processing technology is various and long, but for developers to use one technology, the principle needs to be mastered first, and then the relevant task codes are written, and the learning curve of the coding technology for learning the large data technology components is relatively steep, therefore, higher requirements are put forward on the capability of project developers, certain threshold is provided, the development time is long, and the maintenance cost is high. A large number of codes need to be compiled when the big data technology components are used, but task indexes acquired by different real-time data processing technologies are different, coverage is incomplete, the tasks cannot be monitored in an all-around mode, codes among projects are difficult to recycle, efficiency of project development is reduced, and project stability cannot keep consistent stability.
In the process of implementing the invention, the applicant finds that at least the following problems exist in the prior art:
the coding technology of the big data technology component needs to be learned, codes among projects are difficult to recycle, the efficiency of project development is reduced, and project stability cannot be kept consistent.
Disclosure of Invention
The embodiment of the invention provides a real-time data processing method and a real-time data processing system, which solve the problems that the efficiency of project development is reduced and the project stability cannot keep consistent stability because codes between projects are difficult to reuse because the coding technology of a big data technology component needs to be learned.
To achieve the above object, in one aspect, an embodiment of the present invention provides a real-time data processing method, which executes a real-time task according to the following steps, including:
dynamically creating an instance object of the data processing engine according to information of the data processing engine recorded in pre-specified task configuration; and the number of the first and second electrodes,
and executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object.
Further, the information of the data processing engine includes a class name of the data processing engine;
the dynamically creating an instance object of the data processing engine according to the information of the data processing engine recorded in the pre-specified task configuration specifically includes:
and dynamically creating an instance object of the data processing engine in a reflection mode according to the class name of the data processing engine.
Further, the specifying the operating configuration includes: data reading configuration, data processing configuration and data output configuration;
the executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object comprises the following steps:
reading data to be processed from a data source appointed in the data reading configuration, and extracting structured data from the data to be processed according to a data structure format appointed in the data reading configuration;
processing the structured data according to data pre-processing operation, data aggregation operation and/or data post-processing operation defined by the data processing configuration to obtain processed data;
writing the processed data into an output storage component appointed in the data output configuration according to an output format appointed in the data output configuration;
wherein the data preprocessing operation comprises: intercepting, replacing, filtering, converting, merging, unfolding, deleting and/or splitting the structured data to obtain a data preprocessing result;
the data post-processing operation comprises: and intercepting, replacing, filtering, converting, merging, expanding, deleting and/or splitting the result after the data aggregation operation to obtain processed data.
Further, before executing the real-time task, the method further includes:
displaying a task management page to a user, and receiving and submitting task configuration filled by the user as the pre-specified task configuration through the task management page;
storing the pre-specified task configuration, and responding to a read configuration request in a mode of returning the pre-specified task configuration when the read configuration request aiming at the pre-specified task configuration is received;
during the execution of the real-time task, further comprising:
collecting and storing specified indexes of the real-time task during the running period;
and inquiring the specified index according to the specified alarm configuration, and if the specified index meets the specified alarm rule in the alarm configuration, giving an alarm in a specified mode.
Further, still include:
managing the running of the real-time task through running management operation submitted by a task management page; and the number of the first and second electrodes,
when the operation management operation is designated as a starting operation, reading the pre-designated task configuration, performing configuration check, if the configuration check passes, further inquiring a registration record, and if the real-time task is not registered, registering the registration information of the real-time task in the registration record, and starting the real-time task;
restarting the real-time task when the operation management operation is designated as a restart operation;
when the operation management operation is designated as a stop operation, terminating the real-time task and deleting registration information of the real-time task from the registration record.
In another aspect, an embodiment of the present invention provides a real-time data processing system, including:
the real-time task creating unit is used for dynamically creating an instance object of the data processing engine according to the information of the data processing engine recorded in the pre-specified task configuration;
and the real-time task running unit is used for executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object.
Further, the information of the data processing engine includes a class name of the data processing engine;
the real-time task creating unit is specifically configured to:
and dynamically creating an instance object of the data processing engine in a reflection mode according to the class name of the data processing engine.
Further, the specifying the operating configuration includes: data reading configuration, data processing configuration and data output configuration;
the real-time task running unit comprises:
the data reading module is used for reading data to be processed from a data source specified in the data reading configuration and extracting structured data from the data to be processed according to a data structure format specified in the data reading configuration;
the data processing module is used for processing the structured data according to data preprocessing operation, data aggregation operation and/or data post-processing operation defined by the data processing configuration to obtain processed data;
the data output module is used for writing the processed data into an output storage component appointed in the data output configuration according to an output format appointed in the data output configuration;
wherein the data preprocessing operation comprises: intercepting, replacing, filtering, converting, merging, unfolding, deleting and/or splitting the structured data to obtain a data preprocessing result;
the data post-processing operation comprises: and intercepting, replacing, filtering, converting, merging, expanding, deleting and/or splitting the result after the data aggregation operation to obtain processed data.
Further, still include:
the configuration generating unit is used for displaying a task management page to a user and receiving and submitting task configuration filled by the user as the pre-specified task configuration through the task management page;
a configuration storage unit, configured to store the pre-specified task configuration, and configured to respond to a read configuration request in a manner of returning the pre-specified task configuration when the read configuration request for the pre-specified task configuration is received;
the index collection and storage unit is used for collecting and storing the specified indexes during the running period of the real-time task;
and the alarm unit is used for inquiring the specified index according to the specified alarm configuration, and sending out an alarm in a specified mode if the specified index meets the specified alarm rule in the alarm configuration.
Further, still include: a task management unit configured to: managing the running of the real-time task through running management operation submitted by a task management page; when the operation management operation is designated as a starting operation, reading the pre-designated task configuration, performing configuration check, if the configuration check is passed, further inquiring a registration record, and if the real-time task is not registered, registering the registration information of the real-time task in the registration record, and starting the real-time task; restarting the real-time task when the operation management operation is designated as a restart operation; when the operation management operation is designated as a stop operation, terminating the real-time task and deleting registration information of the real-time task from the registration record.
The technical scheme has the following beneficial effects: the method comprises the steps of designating used data processing engines in a configuration file, wherein each data processing engine corresponds to a corresponding big data technical component one by one, abstracting a data processing process of each data processing engine into data reading, data processing and data output stage processes, providing corresponding configuration in the configuration file, abstracting the data processing process into data preprocessing, data aggregation and data postprocessing, further abstracting the data preprocessing and the data postprocessing into operation processes such as interception, replacement, filtration, conversion, combination, expansion, deletion and/or split, combining example objects of the data processing engines dynamically established according to information of the data processing engines and executing corresponding operations according to task configuration by a method of the example objects, so that coding knowledge of the corresponding data processing engines is not needed, functions which can be realized by the data processing engines and configuration required by the realization of the functions can be realized only by knowing the functions which the data processing engines can realize And the engine is integrated into the project to complete the processing of the real-time data. The user can select freely according to the use scene, abstracts the data processing, simplifies the development cost and is convenient to use. The development threshold is reduced, the purpose of reusing the functional components is achieved, the project development efficiency is improved, the project is more stable, and the maintenance cost is lower. Furthermore, abnormal data can be found in real time by monitoring the data processed by the data processing engine, even if an alarm is given; furthermore, the real-time task is started only when the registration check finds that the real-time task is not registered, the task is prevented from being started repeatedly, the task is prevented from being resubmitted, and the high availability of the task can be ensured.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
FIG. 1 is a flow chart of a method of real-time data processing according to one embodiment of the invention;
FIG. 2 is a schematic diagram of a real-time data processing stage according to one embodiment of the present invention;
FIG. 3 is a schematic diagram of a distributed implementation of a real-time data processing system in accordance with one embodiment of the present invention;
FIG. 4 is a block diagram of a real-time data processing system according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In one aspect, as shown in fig. 1, a real-time data processing method performs a real-time task according to the following steps, including:
step S100, dynamically creating an instance object of a data processing engine according to information of the data processing engine recorded in pre-specified task configuration; and the number of the first and second electrodes,
and step S101, executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object.
In some embodiments, the real-time task includes at least two parts, step S100 implements creation of the real-time task according to task configuration, and step S101 implements a data processing process for executing the real-time task according to task configuration. Each data processing engine corresponds to a corresponding big data processing component, for example, but not limited to Storm, spark, flink, and the like, and in specific implementation, a class and a method corresponding to a certain data processing engine may directly use the class and the method of the corresponding big data processing component, or may be obtained by encapsulating the class and the method of the corresponding big data processing component. The method for pre-specifying task configuration can include, but is not limited to, creating pre-specified task configuration according to the function rule of a specific data processing engine through a web page or creating pre-specified task configuration in the form of directly writing a configuration file; information specifying the data processing engine in a pre-specified task configuration, specifically including but not limited to a class name of a class corresponding to the data processing engine or a predefined number for uniquely identifying the corresponding data processing engine, etc.; the method can compare the information of the data processing engines one by one to judge and identify the specifically designated data processing engine in the task configuration or directly establish an instance object of the data processing engine aiming at the data processing engine selected by the task configuration through a reflection method according to the class name of the data processing engine; the instance object of each data processing engine provides a plurality of data processing method interfaces, the data processing functions provided by each big data processing component are abstracted and summarized to form each data processing interface of the data processing engine for realizing the data processing process of the real-time task, the data processing process comprises three stages of data reading, data processing and/or data output, the operation of the data processing stage is further abstracted to include but not limited to data preprocessing operation, data aggregation operation and/or output post-processing operation, the data preprocessing and data post-processing can be further abstracted to include but not limited to data intercepting, replacing, filtering, converting, merging, expanding, deleting and/or splitting, and the processing stages and the processing operation can be selected according to the needs of concrete items to form a designated running configuration, and corresponding configuration is carried out in task configuration. And the instance object reads the pre-specified task configuration and executes the data processing process of the real-time task according to the specified running configuration. Specifically, the pre-specified task configuration may use a file including, but not limited to, xml, json, etc., and the format of the specific configuration information in the task configuration may be defined according to specific situations.
The embodiment of the invention has the following technical effects: the method comprises the steps of designating used data processing engines in a configuration file, wherein each data processing engine corresponds to a corresponding big data technical component one by one, abstracting a data processing process of each data processing engine into data reading, data processing and data output stage processes, providing corresponding configuration in the configuration file, abstracting the data processing process into data preprocessing, data aggregation and data postprocessing, further abstracting the data preprocessing and the data postprocessing into operation processes such as interception, replacement, filtration, conversion, combination, expansion, deletion and/or split, combining example objects of the data processing engines dynamically established according to information of the data processing engines and executing corresponding operations according to task configuration by a method of the example objects, so that coding knowledge of the corresponding data processing engines is not needed, functions which can be realized by the data processing engines and configuration required by the realization of the functions can be realized only by knowing the functions which the data processing engines can realize And the engine is integrated into the project to complete the processing of the real-time data. The user can select freely according to the use scene, abstracts the data processing, simplifies the development cost and is convenient to use. The development threshold is reduced, the purpose of reusing the functional components is achieved, the project development efficiency is improved, the project is more stable, and the maintenance cost is lower.
Further, the information of the data processing engine includes a class name of the data processing engine;
the dynamically creating an instance object of the data processing engine according to the information of the data processing engine recorded in the pre-specified task configuration specifically includes:
and dynamically creating an instance object of the data processing engine in a reflection mode according to the class name of the data processing engine.
In some embodiments, one of the plurality of data processing engines may be specified in a pre-specified task configuration, and in particular, a class name of a selected data processing engine may be specified. The configuration information structure format and the used description language in the examples are not intended to limit the embodiments of the present invention, and the specific description language and configuration information structure format may be selected according to specific requirements for specific items, for example, in the following json structure:
Figure BDA0003405005010000071
wherein the class name of the data processing engine is specified as:
the above examples are not intended to limit the embodiments of the present invention.
The instance objects of the class of the data processing engine are dynamically created by reflection means, such as the reflection mechanism in the java language. Therefore, developers using the system can finish the selective use of the data processing engine only by specifying the class name of the data processing engine in the configuration file without knowing the internal implementation and coding knowledge of the data processing engine. The user can select freely according to the use scene, abstracts the data processing, simplifies the development cost and is convenient to use.
Further, the specifying the operating configuration includes: data reading configuration, data processing configuration and data output configuration;
the executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object comprises the following steps:
reading data to be processed from a data source appointed in the data reading configuration, and extracting structured data from the data to be processed according to a data structure format appointed in the data reading configuration;
processing the structured data according to data pre-processing operation, data aggregation operation and/or data post-processing operation defined by the data processing configuration to obtain processed data;
writing the processed data into an output storage component appointed in the data output configuration according to an output format appointed in the data output configuration;
wherein the data preprocessing operation comprises: intercepting, replacing, filtering, converting, merging, unfolding, deleting and/or splitting the structured data to obtain a data preprocessing result;
the data post-processing operation comprises: and intercepting, replacing, filtering, converting, merging, expanding, deleting and/or splitting the result after the data aggregation operation to obtain processed data.
In some embodiments, as shown in fig. 2, the running stage of the real-time task is abstracted into a data reading stage, a data processing stage and a data output stage, and a data reading configuration, a data processing configuration and a data output configuration are correspondingly defined in a specified running configuration of pre-specified task configurations; in a data reading stage, reading data to be processed from a specified data source, and extracting structured data from the data to be processed according to a specified data structure format in the data reading configuration; the extraction process may include, but is not limited to, format conversion and data cleaning of the data to be processed. For example, the following json structure, the configuration information structure format and the description language used in the example are not intended to limit the embodiments of the present invention, and the specific description language and configuration information structure format may be selected according to specific requirements for specific items:
Figure BDA0003405005010000081
"sourceConfig" indicates this field is used for data read configuration; the specified data source is known to be output from Kafka by the value "source _ Kafka" of the "format" field, which is a distributed message queue, Apache Kafka, which is suitable for offline and online message processing; in the data reading stage, reading data to be processed from kafka; the field value corresponding to the field "a", "uid", "code" and "oid" is expected to be extracted from the data to be processed to obtain structured data, wherein the field value is specified in the "extractor" field; the obtained structured data can be stored in a file or a memory or a database according to the form of a key-value pair, and the format of the structured data can be specifically defined according to the requirement. The specified data sources include, but are not limited to, data sources such as kafka, hdfs, or databases. The data to be processed from the data source is converted into the structured data, so that the post data processing is more convenient. The method comprises the steps of abstracting various operation interfaces respectively provided by various big data technical components, abstracting operations in a data processing stage into data preprocessing operations, data aggregation operations and/or data postprocessing operations, and selecting required data preprocessing operations, data aggregation operations and/or data postprocessing operations in data processing configuration according to specific requirements to instruct an instance object to execute required operations so as to finish the processing of structured data. The data operation method interface provided by each big data technology component can be further abstracted, and the operation abstraction involved in the data preprocessing operation is classified into operations such as interception, replacement, filtration, conversion, combination, expansion, deletion and/or splitting; the operation abstraction involved in the data post-processing operation is classified into operations such as interception, replacement, filtration, conversion, combination, expansion, deletion and/or split; the operation method interface with aggregation and convergence functions provided in each big data technology component, such as operations including but not limited to SQL (database query language), total statistics, mean calculation and the like, are abstractly classified into data aggregation operations. The data processing configuration of the json structure is illustrated below, the configuration information structure format and the description language used in the example are not intended to limit the embodiments of the present invention, and the specific description language and the configuration information structure format may be selected according to specific requirements for specific items.
Wherein, "processConfig" represents the data processing configuration, and the specific key-value character string can be defined by itself. "pre" indicates that the data preprocessing operation is recorded in the json block; "agg" indicates that the data aggregation operation is recorded in the json block; "pro" indicates that the data post-processing operation is recorded in the json block;
Figure BDA0003405005010000091
Figure BDA0003405005010000101
Figure BDA0003405005010000111
Figure BDA0003405005010000121
writing the processed data into an output storage component appointed in the data output configuration according to an output format appointed in the data output configuration; the configuration information structure format and the used description language in the following examples are not intended to limit the embodiments of the present invention, and specific description language and configuration information structure format may be selected according to specific requirements for specific items, and the following examples are given by data output configuration of json structure:
Figure BDA0003405005010000122
the embodiment of the invention has the following technical effects: developers using the system do not need to know the coding method of each big data technology component, only need to know the method function which can be realized by each big data technology component on the application level, and the specific parameter configuration required by the related method function, and setting the related operation configuration content in the task configuration according to the data processing stage and the specific data processing operation defined by each stage, no requirement on the encoding level of a developer, obviously reducing the technical difficulty of calling each big data technical component to develop the project in the project, realizing the aim of multiplexing the real-time data processing system realized by the technical scheme of the invention by setting different task configurations among the projects, and in different projects, the same system which is verified to be stable and reliable repeatedly is used, so that the stability and the reliability of the corresponding project are further improved.
Further, before executing the real-time task, the method further includes:
displaying a task management page to a user, and receiving and submitting task configuration filled by the user as the pre-specified task configuration through the task management page;
storing the pre-designated task configuration, and responding to a read configuration request in a mode of returning the pre-designated task configuration when the read configuration request aiming at the pre-designated task configuration is received;
during the execution of the real-time task, further comprising:
collecting and storing specified indexes of the real-time task during the running period;
and inquiring the specified index according to the specified alarm configuration, and if the specified index meets the specified alarm rule in the alarm configuration, giving an alarm in a specified mode.
In some embodiments, the pre-specified task configuration may be obtained by way of a configuration file written directly by a developer; in the future, each developer can further conveniently make a configuration file, the use threshold of the system realized by the method of the embodiment is further reduced, a task management page is provided and displayed for a user, and the user, namely each developer calling the system realized by the method of the embodiment, can set task configuration and designate operation management operation in the task management page. The task configuration set by the user can also be stored in a local or configuration storage server, and the task configuration can be read by other servers or clients. During the execution of the real-time task, the real-time task acquires and stores specified indexes, where the specified indexes may be pre-specified operation indexes common to the projects obtained according to statistics of the demands of the projects, such as but not limited to CPU utilization and/or dynamic allocation quantity, abnormal quantity, processing rate, processing time consumption at each stage, and the like, and may also be specified indexes defined in task configuration. The collected specified index can be stored in a local persistent storage device or a file, or in a memory, or can be stored in other servers, and the specified index can be read out and analyzed. The specified indexes can be analyzed according to the alarm rules defined in the specified alarm configuration, and if the specified alarm rules are met, the alarm is given according to the specified mode. The designated alarm configuration, alarm rule and the designated manner of the alarm can be predefined in the system implemented by the method of the embodiment, and can also be defined in the task configuration. The steps in this embodiment may be implemented in one service or program, or in multiple services or programs, or may be deployed on the same server, or on multiple servers; as shown in fig. 3, a task management page provided by a web server service establishes task configuration and sets a real-time task for running management operation management; through the saving and reading response of the configuration of the control server service management task, the control server service can also respond to the reading operation of streaming control service, horn job service and Alert Manager service, the streaming control service realizes distributed scheduling control service, the distributed scheduling control service is used for managing tasks submitted by users, is responsible for starting, stopping and other operations of the tasks, can be deployed on a plurality of nodes at the same time, and each node is responsible for managing partial tasks. For example, the system is responsible for responding to a submission restarting command of the Web Server, and a streaming control is responsible for managing and submitting tasks after a user submits the tasks; real-time tasks are submitted to the yarn job cluster to run; the real-time task writes the acquired specified index into a timing database (Metrics DB), an Alert Manager is a monitoring alarm service, inquires and monitors the specified index in the Metrics DB according to the task configuration in the config server, and gives an alarm in a specified mode when the specified index meets specified alarm rules; the designation may be various and may be configured as desired, such as by sending alert information to a designated micro-signal, a designated telephone number, and/or a designated mailbox. The above services, such as a web server service, a config server service, a streaming control service, a horn job service, a time series database Metrics DB, and an Alert Manager service, may be deployed in one server, or in any combination on a plurality of servers, or separately on separate servers.
The technical scheme of the invention has the following technical effects: by providing the task management page, a more friendly use way can be provided for developers or users; storing task configuration and allowing the task configuration to be requested by a plurality of requesters, so that the system can be freely realized in a distributed or single machine mode during specific implementation; collected indexes are collected, detection and alarm are carried out according to the indexes, and the effects of dynamically monitoring the operation condition of the system and timely alarming are achieved.
Further, still include:
managing the running of the real-time task through running management operation submitted by a task management page; and the number of the first and second electrodes,
when the operation management operation is designated as a starting operation, reading the pre-designated task configuration, performing configuration check, if the configuration check passes, further inquiring a registration record, and if the real-time task is not registered, registering the registration information of the real-time task in the registration record, and starting the real-time task;
restarting the real-time task when the operation management operation is designated as a restart operation;
when the operation management operation is designated as a stop operation, terminating the real-time task and deleting registration information of the real-time task from the registration record.
In some embodiments, the user may manage the running of the real-time task in a running management operation set on the task management page, where the specific running management operation includes, but is not limited to, starting the real-time task, restarting the real-time task, and stopping the real-time task. And when the user submits the operation management operation on the task management page, the corresponding operation management operation is applied to the corresponding real-time task. In order to ensure that only one task instance runs, when a task is started, whether the real-time task is registered is checked firstly, and if the real-time task is registered, a new starting operation is not executed on the real-time task; if not, the real-time task is registered first, and then the real-time task is started, so that the problem of repeated task submission is solved. As shown in fig. 3, a real-time task is started through streaming control service, after the real-time task is started, the real-time task registers itself with zookeeper service first, and the task is submitted to a yann cluster to run after the registration is successful, and if the task is found to be registered in zookeeper during the registration, the task exits and is not submitted to the yann cluster.
The embodiment of the invention has the following technical effects: and the real-time task is started only when the registration check finds that the real-time task is not registered, so that the task is prevented from being started repeatedly, the task is prevented from being resubmitted, and the high availability of the task can be ensured.
In another aspect, as shown in fig. 4, an embodiment of the present invention provides a real-time data processing system, including:
a real-time task creating unit 400, configured to dynamically create an instance object of a data processing engine according to information of the data processing engine recorded in a pre-specified task configuration;
and the real-time task running unit 401 is configured to execute a data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object.
Further, the information of the data processing engine includes a class name of the data processing engine;
the real-time task creating unit 400 is specifically configured to:
and dynamically creating an instance object of the data processing engine in a reflection mode according to the class name of the data processing engine.
Further, the specifying the operating configuration includes: data reading configuration, data processing configuration and data output configuration;
the real-time task running unit 401 includes:
the data reading module is used for reading data to be processed from a data source specified in the data reading configuration and extracting structured data from the data to be processed according to a data structure format specified in the data reading configuration;
the data processing module is used for processing the structured data according to data preprocessing operation, data aggregation operation and/or data post-processing operation defined by the data processing configuration to obtain processed data;
the data output module is used for writing the processed data into an output storage component appointed in the data output configuration according to an output format appointed in the data output configuration;
wherein the data preprocessing operation comprises: intercepting, replacing, filtering, converting, merging, unfolding, deleting and/or splitting the structured data to obtain a data preprocessing result;
the data post-processing operation comprises: and intercepting, replacing, filtering, converting, merging, expanding, deleting and/or splitting the result after the data aggregation operation to obtain processed data.
Further, still include:
the configuration generating unit is used for displaying a task management page to a user and receiving and submitting task configuration filled by the user as the pre-specified task configuration through the task management page;
a configuration storage unit, configured to store the pre-specified task configuration, and configured to respond to a read configuration request in a manner of returning the pre-specified task configuration when the read configuration request for the pre-specified task configuration is received;
the index collection and storage unit is used for collecting and storing the specified indexes during the running period of the real-time task;
and the alarm unit is used for inquiring the specified index according to the specified alarm configuration, and sending out an alarm in a specified mode if the specified index meets the specified alarm rule in the alarm configuration.
Further, still include: a task management unit configured to: managing the running of the real-time task through running management operation submitted by a task management page; when the operation management operation is designated as a starting operation, reading the pre-designated task configuration, performing configuration check, if the configuration check is passed, further inquiring a registration record, and if the real-time task is not registered, registering the registration information of the real-time task in the registration record, and starting the real-time task; restarting the real-time task when the operation management operation is designated as a restart operation; when the operation management operation is designated as a stop operation, terminating the real-time task and deleting registration information of the real-time task from the registration record.
A real-time data processing system provided in an embodiment of the present invention is a product class embodiment corresponding to the foregoing real-time data processing method one to one, and a person skilled in the art can understand the real-time data processing system provided in the embodiment of the present invention according to the foregoing real-time data processing method embodiment, and details are not described herein again.
The above technical solutions of the embodiments of the present invention are described in detail below with reference to specific application examples, and reference may be made to the foregoing related descriptions for technical details that are not described in the implementation process.
As shown in fig. 3, the system mainly includes the following parts:
WebServer: and front-end service, and a user can write the configuration of the task. The system is also responsible for management tasks, is responsible for creating, starting, stopping, restarting and monitoring applications, and writes out application configuration to the Config Server;
config Server: the configuration service is responsible for storing the relevant information of application configuration and is written by the Web Server;
StreamingControl: and the distributed scheduling control service is responsible for responding to a submitting and restarting command of the WebServer, and the control is responsible for managing and submitting the tasks after the users submit the tasks.
Yarn joba: real-time tasks run on the yann cluster;
metrics DB: the time sequence database is used for writing the acquired indexes into a database by the task;
alert Manager: monitoring alarm service, inquiring relevant indexes according to alarm configuration, and alarming when abnormal;
zookeeper: storing the distribution information and the running information of the application;
detailed description:
the user firstly submits the related configuration of the task, namely the task configuration, the operation parameters and the like at a front-end service (WebServer). And after submission, storing the file into the Config Server service.
The user starts and stops the task through the front-end service, and the StreamingControl service can respond to the start and stop operation of the front end to submit or delete the task.
When the StreamingControl starts a task, the workflow Server is firstly used for acquiring task configuration, configuration check is carried out, meanwhile, the task name is written into the zookeeper, and the task is registered to prevent the task from being resubmitted.
The StreamingControl then loads the corresponding class of data processing engines in a reflection manner according to the data processing engines (including but not limited to spark, flash, storm, etc.) specified by the user in the task configuration. Real-time tasks are submitted to the yarn cluster. The data processing is performed, and the data processing flow of each real-time task can be divided into three stages, as shown in fig. 2: a data reading stage, a data processing stage and a data output stage.
During the running process of the task, statistics information is collected, and relevant indexes are reported to a Metrics DB, so that the running information of the task is conveniently monitored. Meanwhile, the Alert Manager service can query related specified indexes from the Metrics DB according to the alarm rules configured by the user, and trigger an alarm to notify the user when the specified indexes are abnormal.
Three stages of data processing flow:
a data reading stage:
the method is mainly responsible for reading data to be processed from a specified data source, extracting the data into structured data according to a defined data format, and filtering out unsatisfied source data according to a definition during extraction.
And (3) a data processing stage:
the method is mainly responsible for data processing, and the processing process can be divided into three stages: data preprocessing, data aggregation and data postprocessing. The three stages can flexibly select one or more of the three stages for use according to requirements.
The data preprocessing and data postprocessing support the operations of intercepting, replacing, filtering, converting, merging, unfolding, deleting, splitting and the like on data. And selecting the appointed operation by the user according to the requirement.
The data aggregation supports spark SQL and flink SQL, and a user can calculate data through SQL.
And a data output stage:
the processed data is mainly written into an external system according to a format specified by a user, such as: kafka, Redis, HDFS, MySQL, clickhouse, etc.
The embodiment of the invention has the following technical effects: in the prior art, real-time data: is a set of sequential, large, fast, continuous arriving data sequences, and a data stream can be viewed as a dynamic collection of data that grows indefinitely over time. Storm, spark, flink: some big data technology components that process real-time data can efficiently perform calculations on a large amount of data. The prior art has the following defects: 1. real-time data processing techniques are of a wide variety, each being of a long length, such as: storm, spark, flink, and the like. However, for developers to use one of the techniques, the principle needs to be mastered first, and there is a certain threshold for writing related task codes. Long development time and high maintenance cost. 2. It is unreasonable that one task will run on the cluster if it is submitted once, and multiple identical tasks will run if the misoperation is submitted many times. After the key node of the task fails, the task also fails, and high availability cannot be guaranteed. 3. The task indexes acquired by different real-time data processing technologies are different, the coverage is not complete, and the task cannot be monitored in all directions. In view of the above disadvantages, the embodiments of the present invention solve the following problems: 1. the method integrates various real-time data processing technologies, and abstracts the processing process into three stages (namely data reading, data processing and data output). When the user uses the system, the user only needs to write a configuration file such as a json or xml file, a data processing engine and the configuration of each stage are specified in the file, and a real-time task can be operated by selecting different real-time data processing technologies, namely big data technology components, through the specified data processing engine. And codes do not need to be written, and a user does not need to know the principle of the underlying technology, so that the production efficiency is improved. 2. When a task is submitted, whether the task is registered or not is judged firstly, if the task is not registered, the task is submitted to run, if the task is registered, the task is quitted, the problem of repeated submission of the task is solved, meanwhile, the registered task is monitored, and when the running fails, the task is restarted without user operation. 3. And in the data processing process, indexes defined in each stage are collected and reported to the database in a unified manner, so that a user can know the running state of the task conveniently. Meanwhile, an alarm can be given based on the collected indexes. The following effects are achieved: integrates a plurality of real-time data processing technologies, and is compatible with all functions of the technologies. The user can select freely according to the use scene, abstracts the data processing, simplifies the development cost and is convenient to use. Meanwhile, high availability of the tasks can be ensured, and the tasks are prevented from being resubmitted. Tasks can also be monitored.
It should be understood that the specific order or hierarchy of steps in the processes disclosed is an example of exemplary approaches. Based upon design preferences, it is understood that the specific order or hierarchy of steps in the processes may be rearranged without departing from the scope of the present disclosure. The accompanying method claims present elements of the various steps in a sample order, and are not intended to be limited to the specific order or hierarchy presented.
In the foregoing detailed description, various features are grouped together in a single embodiment for the purpose of streamlining the disclosure. This method of disclosure is not to be interpreted as reflecting an intention that the claimed embodiments of the subject matter require more features than are expressly recited in each claim. Rather, as the following claims reflect, invention lies in less than all features of a single disclosed embodiment. Thus, the following claims are hereby expressly incorporated into the detailed description, with each claim standing on its own as a separate preferred embodiment of the invention.
The previous description of the disclosed embodiments is provided to enable any person skilled in the art to make or use the present invention. To those skilled in the art; various modifications to these embodiments will be readily apparent, and the generic principles defined herein may be applied to other embodiments without departing from the spirit or scope of the disclosure. Thus, the present disclosure is not intended to be limited to the embodiments shown herein but is to be accorded the widest scope consistent with the principles and novel features disclosed herein.
What has been described above includes examples of one or more embodiments. It is, of course, not possible to describe every conceivable combination of components or methodologies for purposes of describing the aforementioned embodiments, but one of ordinary skill in the art may recognize that many further combinations and permutations of various embodiments are possible. Accordingly, the embodiments described herein are intended to embrace all such alterations, modifications and variations that fall within the scope of the appended claims. Furthermore, to the extent that the term "includes" is used in either the detailed description or the claims, such term is intended to be inclusive in a manner similar to the term "comprising" as "comprising: as interpreted by the use of "in the claims as a conjunction. Furthermore, any use of the term "or" in the specification of the claims is intended to mean a "non-exclusive or".
Those of skill in the art will further appreciate that the various illustrative logical blocks, units, and steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both. To clearly illustrate the interchangeability of hardware and software, various illustrative components, elements, and steps have been described above generally in terms of their functionality. Whether such functionality is implemented as hardware or software depends upon the particular application and design requirements of the overall system. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present embodiments.
The various illustrative logical blocks, or elements, described in connection with the embodiments disclosed herein may be implemented or performed with a general purpose processor, a digital signal processor, an Application Specific Integrated Circuit (ASIC), a field programmable gate array or other programmable logic device, discrete gate or transistor logic, discrete hardware components, or any combination thereof designed to perform the functions described herein. A general-purpose processor may be a microprocessor, but in the alternative, the processor may be any conventional processor, controller, microcontroller, or state machine. A processor may also be implemented as a combination of computing devices, e.g., a digital signal processor and a microprocessor, a plurality of microprocessors, one or more microprocessors in conjunction with a digital signal processor core, or any other similar configuration.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. A software module may be stored in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art. For example, a storage medium may be coupled to the processor such the processor can read information from, and write information to, the storage medium. In the alternative, the storage medium may be integral to the processor. The processor and the storage medium may reside in an ASIC, which may be located in a user terminal. In the alternative, the processor and the storage medium may reside in different components in a user terminal.
In one or more exemplary designs, the functions described above in connection with the embodiments of the invention may be implemented in hardware, software, firmware, or any combination of the three. If implemented in software, the functions may be stored on or transmitted over as one or more instructions or code on a computer-readable medium. Computer-readable media includes both computer storage media and communication media that facilitate transfer of a computer program from one place to another. Storage media may be any available media that can be accessed by a general purpose or special purpose computer. For example, such computer-readable media can include, but is not limited to, RAM, ROM, EEPROM, CD-ROM or other optical disk storage, magnetic disk storage or other magnetic storage devices, or any other medium which can be used to carry or store program code in the form of instructions or data structures and which can be read by a general-purpose or special-purpose computer, or a general-purpose or special-purpose processor. Additionally, any connection is properly termed a computer-readable medium, and, thus, is included if the software is transmitted from a website, server, or other remote source via a coaxial cable, fiber optic cable, twisted pair, Digital Subscriber Line (DSL), or wirelessly, e.g., infrared, radio, and microwave. Such discs (disk) and disks (disc) include compact disks, laser disks, optical disks, DVDs, floppy disks and blu-ray disks where disks usually reproduce data magnetically, while disks usually reproduce data optically with lasers. Combinations of the above may also be included in the computer-readable medium.
The above-mentioned embodiments are intended to illustrate the objects, technical solutions and advantages of the present invention in further detail, and it should be understood that the above-mentioned embodiments are merely exemplary embodiments of the present invention, and are not intended to limit the scope of the present invention, and any modifications, equivalent substitutions, improvements and the like made within the spirit and principle of the present invention should be included in the scope of the present invention.

Claims (10)

1. A real-time data processing method, characterized by performing a real-time task according to the steps of:
dynamically creating an instance object of the data processing engine according to information of the data processing engine recorded in pre-specified task configuration; and the number of the first and second electrodes,
and executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object.
2. A real-time data processing method according to claim 1, wherein the information of the data processing engine comprises a class name of the data processing engine;
the dynamically creating an instance object of the data processing engine according to the information of the data processing engine recorded in the pre-specified task configuration specifically includes:
and dynamically creating an instance object of the data processing engine in a reflection mode according to the class name of the data processing engine.
3. The real-time data processing method of claim 1, wherein the specifying the operational configuration comprises: data reading configuration, data processing configuration and data output configuration;
the executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object comprises the following steps:
reading data to be processed from a data source appointed in the data reading configuration, and extracting structured data from the data to be processed according to a data structure format appointed in the data reading configuration;
processing the structured data according to data pre-processing operation, data aggregation operation and/or data post-processing operation defined by the data processing configuration to obtain processed data;
writing the processed data into an output storage component appointed in the data output configuration according to an output format appointed in the data output configuration;
wherein the data preprocessing operation comprises: intercepting, replacing, filtering, converting, merging, unfolding, deleting and/or splitting the structured data to obtain a data preprocessing result;
the data post-processing operation comprises: and intercepting, replacing, filtering, converting, merging, expanding, deleting and/or splitting the result after the data aggregation operation to obtain processed data.
4. The real-time data processing method of claim 1, further comprising, prior to executing the real-time task:
displaying a task management page to a user, and receiving and submitting task configuration filled by the user as the pre-specified task configuration through the task management page;
storing the pre-specified task configuration, and responding to a read configuration request in a mode of returning the pre-specified task configuration when the read configuration request aiming at the pre-specified task configuration is received;
during the execution of the real-time task, further comprising:
collecting and storing specified indexes of the real-time task during the running period;
and inquiring the specified index according to the specified alarm configuration, and if the specified index meets the specified alarm rule in the alarm configuration, giving an alarm in a specified mode.
5. The real-time data processing method of claim 1, further comprising:
managing the running of the real-time task through running management operation submitted by a task management page; and the number of the first and second electrodes,
when the operation management operation is designated as a starting operation, reading the pre-designated task configuration, performing configuration check, if the configuration check passes, further inquiring a registration record, and if the real-time task is not registered, registering the registration information of the real-time task in the registration record, and starting the real-time task;
restarting the real-time task when the operation management operation is designated as a restart operation;
when the operation management operation is designated as a stop operation, terminating the real-time task and deleting registration information of the real-time task from the registration record.
6. A real-time data processing system, comprising:
the real-time task creating unit is used for dynamically creating an instance object of the data processing engine according to the information of the data processing engine recorded in the pre-specified task configuration;
and the real-time task running unit is used for executing the data processing process of the real-time task according to the specified running configuration recorded in the pre-specified task configuration through the method interface provided by the instance object.
7. The real-time data processing system of claim 6, wherein the information for the data processing engine includes a class name for the data processing engine;
the real-time task creating unit is specifically configured to:
and dynamically creating an instance object of the data processing engine in a reflection mode according to the class name of the data processing engine.
8. The real-time data processing system of claim 6, wherein the specified operational configuration comprises: data reading configuration, data processing configuration and data output configuration;
the real-time task running unit comprises:
the data reading module is used for reading data to be processed from a data source specified in the data reading configuration and extracting structured data from the data to be processed according to a data structure format specified in the data reading configuration;
the data processing module is used for processing the structured data according to data preprocessing operation, data aggregation operation and/or data post-processing operation defined by the data processing configuration to obtain processed data;
the data output module is used for writing the processed data into an output storage component appointed in the data output configuration according to an output format appointed in the data output configuration;
wherein the data preprocessing operation comprises: intercepting, replacing, filtering, converting, merging, unfolding, deleting and/or splitting the structured data to obtain a data preprocessing result;
the data post-processing operation comprises: and intercepting, replacing, filtering, converting, merging, expanding, deleting and/or splitting the result after the data aggregation operation to obtain processed data.
9. The real-time data processing system of claim 6, further comprising:
the configuration generating unit is used for displaying a task management page to a user and receiving and submitting task configuration filled by the user as the pre-specified task configuration through the task management page;
a configuration storage unit, configured to store the pre-specified task configuration, and configured to respond to a read configuration request in a manner of returning the pre-specified task configuration when the read configuration request for the pre-specified task configuration is received;
the index collection and storage unit is used for collecting and storing the specified indexes during the running period of the real-time task;
and the alarm unit is used for inquiring the specified index according to the specified alarm configuration, and sending out an alarm in a specified mode if the specified index meets the specified alarm rule in the alarm configuration.
10. The real-time data processing system of claim 6, further comprising:
a task management unit configured to: managing the running of the real-time task through running management operation submitted by a task management page; when the operation management operation is designated as a starting operation, reading the pre-designated task configuration, performing configuration check, if the configuration check is passed, further inquiring a registration record, and if the real-time task is not registered, registering the registration information of the real-time task in the registration record, and starting the real-time task; restarting the real-time task when the operation management operation is designated as a restart operation; when the operation management operation is designated as a stop operation, terminating the real-time task and deleting registration information of the real-time task from the registration record.
CN202111508187.2A 2021-12-10 2021-12-10 Real-time data processing method and system Pending CN114253534A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111508187.2A CN114253534A (en) 2021-12-10 2021-12-10 Real-time data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111508187.2A CN114253534A (en) 2021-12-10 2021-12-10 Real-time data processing method and system

Publications (1)

Publication Number Publication Date
CN114253534A true CN114253534A (en) 2022-03-29

Family

ID=80794616

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111508187.2A Pending CN114253534A (en) 2021-12-10 2021-12-10 Real-time data processing method and system

Country Status (1)

Country Link
CN (1) CN114253534A (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629805A (en) * 2023-06-07 2023-08-22 浪潮智慧科技有限公司 Water conservancy index service method, equipment and medium for distributed flow batch integration

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116629805A (en) * 2023-06-07 2023-08-22 浪潮智慧科技有限公司 Water conservancy index service method, equipment and medium for distributed flow batch integration
CN116629805B (en) * 2023-06-07 2023-12-01 浪潮智慧科技有限公司 Water conservancy index service method, equipment and medium for distributed flow batch integration

Similar Documents

Publication Publication Date Title
US11836533B2 (en) Automated reconfiguration of real time data stream processing
CN107506451B (en) Abnormal information monitoring method and device for data interaction
US8938421B2 (en) Method and a system for synchronizing data
US8812752B1 (en) Connector interface for data pipeline
CN107016480B (en) Task scheduling method, device and system
CN111125444A (en) Big data task scheduling management method, device, equipment and storage medium
JP2017515180A (en) Processing data sets in big data repositories
CN112162821B (en) Container cluster resource monitoring method, device and system
CN112507029A (en) Data processing system and data real-time processing method
CN113448812A (en) Monitoring alarm method and device under micro-service scene
CN113377626B (en) Visual unified alarm method, device, equipment and medium based on service tree
JP2021502658A (en) Key-based logging for processing structured data items using executable logic
CN113760677A (en) Abnormal link analysis method, device, equipment and storage medium
CN111190892A (en) Method and device for processing abnormal data in data backfilling
CN111611207A (en) State data processing method and device and computer equipment
CN114090580A (en) Data processing method, device, equipment, storage medium and product
CN111177237B (en) Data processing system, method and device
CN114253534A (en) Real-time data processing method and system
US10951540B1 (en) Capture and execution of provider network tasks
CN116016702A (en) Application observable data acquisition processing method, device and medium
CN112906373A (en) Alarm calculation method and device, electronic equipment and storage medium
CN110109986B (en) Task processing method, system, server and task scheduling system
US10567469B1 (en) Embedding hypermedia resources in data interchange format documents
US10536390B1 (en) Requesting embedded hypermedia resources in data interchange format documents
CN112363774B (en) Method and device for configuring Storm real-time task

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination