CN116662325B - Data processing method and system - Google Patents

Data processing method and system Download PDF

Info

Publication number
CN116662325B
CN116662325B CN202310909725.1A CN202310909725A CN116662325B CN 116662325 B CN116662325 B CN 116662325B CN 202310909725 A CN202310909725 A CN 202310909725A CN 116662325 B CN116662325 B CN 116662325B
Authority
CN
China
Prior art keywords
data
processing
layer
state
logic
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202310909725.1A
Other languages
Chinese (zh)
Other versions
CN116662325A (en
Inventor
袁猛
裴文刚
李勉
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Ningbo Sumscope Information Technology Co ltd
Original Assignee
Ningbo Sumscope Information Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Ningbo Sumscope Information Technology Co ltd filed Critical Ningbo Sumscope Information Technology Co ltd
Priority to CN202310909725.1A priority Critical patent/CN116662325B/en
Publication of CN116662325A publication Critical patent/CN116662325A/en
Application granted granted Critical
Publication of CN116662325B publication Critical patent/CN116662325B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1446Point-in-time backing up or restoration of persistent data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/23Updating
    • G06F16/2365Ensuring data consistency and integrity
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Abstract

The application provides a data processing method and a data processing system; the method comprises the following steps: carrying out data processing on each data layer in the data model based on a service logic file, and determining data corresponding to each data layer in the data model; performing persistence processing on the incremental data corresponding to the data layer according to a check point in the data layer at preset time intervals to obtain a data intermediate state, wherein the check point is used for representing a signal for performing the persistence processing on the incremental data in the data layer; and responding to the data processing meeting a first condition, and carrying out data recovery processing on the data according to the data intermediate state. The data processing method provided by the application improves the processing efficiency of the market data.

Description

Data processing method and system
Technical Field
The present application relates to the field of computer technologies, and in particular, to a data processing method and system.
Background
Financial data market processing plays an important role in the financial industry, mainly for real-time monitoring, analysis and prediction of financial market dynamics. However, when the conventional market processing system processes a large amount of real-time data, the conventional market processing system has the problems of low efficiency, poor expansibility and the like. In addition, how to recover data and maintain consistency of data processing after a program is restarted and how to improve development efficiency of developers are challenges of current financial data market processing systems.
Disclosure of Invention
The embodiment of the application provides a data processing method and a data processing system, which can improve the efficiency and expansibility of data processing in a quotation system, improve the development efficiency of developers, and simultaneously recover data and keep the consistency of data processing after abnormal restarting.
The technical scheme of the embodiment of the application is realized as follows:
in a first aspect, an embodiment of the present application provides a data processing method, including:
carrying out data processing on each data layer in the data model based on a service logic file, and determining data corresponding to each data layer;
performing persistence processing on the incremental data corresponding to the data layer according to a check point in the data layer at preset time intervals to obtain a data intermediate state, wherein the check point is used for representing a signal for performing the persistence processing on the incremental data in the data layer;
and responding to the data processing meeting a first condition, and carrying out data recovery processing on the data according to the data intermediate state.
In the above scheme, the performing persistence processing on the incremental data corresponding to the data layer according to the check point in the data layer at intervals of preset time to obtain a data intermediate state includes:
generating the check point every preset time;
sending the checkpoints to input operators in the data model, each checkpoint flowing in a respective data layer with a data processing link;
after the data layer receives the check points, pre-submitting the input operators and the processing results of the incremental data which are sequentially corresponding to each type of downstream operators of the input operators;
if all operators are submitted successfully, formally submitting the input operators and the processing of the incremental data corresponding to all downstream operators of the input operators, and performing persistence processing on all the incremental data corresponding to the data layer.
In the above scheme, the service logic file includes at least one of the following:
standard processing logic, data cleansing logic, and data expansion logic.
In the above scheme, the data processing is performed on each data layer in the data model based on the service logic file, and determining the data corresponding to each data layer includes:
determining source data as data of a first data layer in the data model;
processing the data of the first data layer based on the data cleaning logic in the service logic file to obtain the data of the second data layer in the data model; the data of the first data layer corresponds to the data of the second data layer one by one;
and processing the data of the second data layer based on the data expansion logic in the service logic file to obtain the data of the third data layer in the data model.
In the above scheme, the performing persistence processing on the incremental data corresponding to the data layer according to the check point in the data layer at intervals of preset time to obtain a data intermediate state includes:
determining a computing state corresponding to data in the data layer, wherein the computing state comprises a first computing state and a second computing state;
determining the data with the numerical value changed in the preset time as the incremental data in the data of the first calculation state;
wherein the first computational state characterizes the computational result of the data in dependence on the computational result of the data at a first instant of time; the second computational state characterizes the computational result of the data independent of the computational result of the data at any instant.
In the above arrangement, the standard processing logic comprises:
determining the attribute and the type of an entity in a class file corresponding to the source data according to the field and the type corresponding to each piece of source data;
and determining the logic relationship of each entity according to the dependency relationship between each field.
In the above scheme, the responding to the data processing meeting the first condition, performing data recovery processing on the data according to the data intermediate state, includes:
acquiring a state descriptor of the data intermediate state;
acquiring a data offset and an intermediate result of the data when the data processing meets a first condition according to the state descriptor;
and carrying out data recovery processing on the data according to the data offset and the intermediate result.
In a second aspect, an embodiment of the present application provides a data processing system, including:
the pipeline building assembly is used for carrying out data processing on each data layer in the data model based on the service logic file and determining the data corresponding to each data layer;
the metadata management component is used for carrying out persistence processing on the incremental data corresponding to the data layer according to the check point in the data layer every preset time to obtain a data intermediate state;
and the state management component is used for responding to the data processing meeting the first condition and carrying out data recovery processing on the data according to the data intermediate state.
In the above solution, the system further includes:
and the domain model component is used for determining at least one business logic file of standard processing logic, data cleaning logic and data expansion logic.
According to the data processing method provided by the embodiment of the application, each data layer in the data model is subjected to data processing based on the service logic file, and the data corresponding to each data layer is determined; performing persistence processing on the incremental data corresponding to the data layer according to a check point in the data layer at preset time intervals to obtain a data intermediate state, wherein the check point is used for representing a signal for performing the persistence processing on the incremental data in the data layer; and responding to the data processing meeting a first condition, and carrying out data recovery processing on the data according to the data intermediate state. The data processing method of the application processes the data through the business logic file, improves the efficiency of data processing and the expansibility of data processing, improves the development efficiency, and can simultaneously recover the data through the intermediate state of the data obtained after the incremental data is subjected to persistence when the data processing meets certain conditions, thereby ensuring the consistency of the processing result in the data processing process.
Drawings
The drawings are included to provide a better understanding of the present application and are not to be construed as limiting the application. Wherein:
FIG. 1 is a schematic diagram of an alternative processing flow of a data processing method according to an embodiment of the present application;
FIG. 2 is a schematic diagram of an alternative architecture of a data processing system provided by an embodiment of the present application;
figure 3 is a schematic illustration of an alternative construction of a conduit building set provided by an embodiment of the application.
Detailed Description
The present application will be further described in detail with reference to the accompanying drawings, for the purpose of making the objects, technical solutions and advantages of the present application more apparent, and the described embodiments should not be construed as limiting the present application, and all other embodiments obtained by those skilled in the art without making any inventive effort are within the scope of the present application.
In the following description, reference is made to "some embodiments" which describe a subset of all possible embodiments, but it is to be understood that "some embodiments" can be the same subset or different subsets of all possible embodiments and can be combined with one another without conflict.
In the following description, the terms "first", "second", and the like are merely used to distinguish between similar objects and do not represent a particular ordering of the objects, it being understood that the "first", "second", or the like may be interchanged with one another, if permitted, to enable embodiments of the application described herein to be practiced otherwise than as illustrated or described herein.
Unless defined otherwise, all technical and scientific terms used herein have the same meaning as commonly understood by one of ordinary skill in the art to which this application belongs. The terminology used herein is for the purpose of describing embodiments of the application only and is not intended to be limiting of the application.
Referring to fig. 1, fig. 1 is a schematic diagram of an alternative processing flow of the data processing method according to the embodiment of the present application, and the description will be made below with reference to steps S101 to S103 shown in fig. 1.
Step S101, data processing is carried out on each data layer in the data model based on the service logic file, and data corresponding to each data layer is determined.
In some embodiments, the current data processing of financial market data has the following problems: 1. the financial quotation data has more data sources and different extraction modes; 2. the data link processing is relatively complex, one data processing may correspond to various conversions and loads, and a plurality of data sources may need to be combined into one data source; 3. most of conversion functions are designed with state calculation, and the disaster recovery capability of the conversion functions needs to be paid attention to; 4. the timeliness requirement of the data is high; 5. the data loading types are more. Therefore, in order to solve the above problems, an abstract concept of "operator" may be designed, an extraction process corresponds to an input (Source) operator, a conversion process corresponds to a conversion (converter) operator, a load corresponds to an output (Sink) operator, and the like. Each operator has the following characteristics: 1. having 0-n upstream; 2. having 0-n downstream; 3. the method comprises the steps of having own data access buffer area; 4. the method comprises the steps of having a data transmission buffer area; 5. a global state manager is held. Meanwhile, the upstream and downstream of operators are in a many-to-many relationship, so that a complex data processing link can be realized.
In some embodiments, the access source may be abstracted, a unified input operator may be defined, and financial market data from different data sources may be flexibly processed by the input operator to obtain source data. The financial quotation data can comprise real-time quotation data and off-line quotation data, wherein the real-time quotation data represents all quotation data updated in real time in the current transaction time period and is generally transmitted through a network socket; off-line market data represents a certain batch of market data before or after a disk, typically read in from a database or file system. The key index for monitoring the number of received data on the same day can be realized based on the input operator, and the total number of the accessed source data on the same day is counted.
In some embodiments, after the source data is acquired, the source data may be subjected to data modeling based on the data fields included in the source data, so as to obtain a data model corresponding to the source data. Each data layer in the data model can correspond to a different operator to implement abstract processing of each layer of data in the data model. The first data layer in the data model may receive source data based on the input operator implementation and determine the source data as data for the first data layer in the data model.
In some embodiments, each data layer in the data model may correspond to a different business logic file, which may include at least one of standard processing logic, data cleansing logic, and data expansion logic. A developer can implement a generic interface corresponding to different business process logic files based on a business process (Channel) operator. The service processing logic of different services corresponds to different service processing operators, and the development efficiency and the expansibility of data processing can be improved by decoupling and multiplexing the corresponding service processing operators according to actual requirements.
The standard processing logic can determine the attribute and the type of the entity in the class file corresponding to the data according to the field and the type corresponding to each piece of data in the data layer. And determining the logic relationship of the entity according to the dependency relationship among each field, such as the blood relationship formed in the processes of generating, fusing, circulating and dying. By analyzing the blood relationship of the data, the attribution and the source of the data can be determined, and the data can be provided with traceability and a certain layering property. During development, standard processing logic may generate class files, such as Java class files, for a particular programming language for development needs.
The data cleansing logic may include deleting redundant data that is repeated, normalizing the format or value of the data, supplementing missing data, or text pre-processing text in the data, such as deleting special symbols, deleting english characters or case-based processing, etc.
The data expansion logic may perform classification, combination, screening, summarization, analysis, induction, reasoning, comparison, deduction, etc. on the data according to the logical relationship of the data in the data layer. And a developer can develop the data expansion logic according to actual requirements.
In some embodiments, the data model of the source data after data modeling may include three data layers, and the data in each data layer may be stored in the form of a two-dimensional table. The first data layer may be a source data layer (ODS), the second data layer may be a normalized detail data layer (DWD), and the third data layer may be a service summary data layer (DWS).
Wherein the source data may be determined as the data of the first data layer, which is raw and is the most primitive data.
The data of the first data layer, i.e., the source data, may be determined as the data of the second data layer based on the data obtained after the data is processed by the data cleansing logic. The data cleansing logic may include performing standardized processing or format modification on the source data, and does not change a field corresponding to the source data, so that data of the first data layer corresponds to data of the second data layer one by one.
The data of the second data layer can be processed based on the data expansion logic to obtain the data corresponding to the third data layer. And if the data fields in the second data layer are aggregated and summarized based on the volume or the yield, obtaining new fields such as the total volume or the total yield.
In some embodiments, data caching can be performed based on lock-free cache queues according to a collector operator, so that data processing efficiency of each data layer is improved. The data accessed by the input operator in the data model can be stored in the buffer area corresponding to the collector operator, and all operators downstream related to the input operator are informed to acquire the data from the buffer area. Assuming that an input operator has multiple upstream sides, there is a data access buffer for each upstream side. The operator can circularly acquire data from the access buffer area for processing, and the data is continuously sent to a downstream operator after the processing is finished. When the last operator is the output operator, the data processing will end and no longer pass downstream. The buffer may enable the program to have a reverse pressure capability. Assuming that the consumption speed of a certain operator at the downstream is far less than the sending speed of an operator at the upstream, the buffer area is occupied very quickly, so that the upstream cannot continue to send data, and the pressure influences the input operator at the most upstream according to the data link, so that the input operator can slow down the data acquisition speed, and a data processing program is more stable. Some key indexes can be cached, such as statistics of maximum value, minimum value and average value of data processing time in all data of the same day, statistics of number of data which are not processed currently, and the like. The collector operator has the following three advantages that 1) when the data volume is too large, the data delay problem caused by the network back pressure is reduced; 2) The problem of pseudo sharing in the memory of the computer is solved through the idea of cache filling, and the processing speed of data is accelerated; 3) Support the highest performance lock-free multithreading concurrent processing.
Step S102, performing persistence processing on the incremental data corresponding to the data layer according to a check point in the data layer every preset time to obtain a data intermediate state, wherein the check point is used for representing a signal for performing the persistence processing on the incremental data in the data layer.
In some embodiments, each data layer in the data model may perform two-phase commit based on checkpoints, persisting data intermediate states in the data layer. The Checkpoint (Checkpoint) is a special data structure, which does not contain any data, and can be used for characterizing a signal for performing persistence processing on the incremental data in the data layer, when an operator receives Checkpoint type data, i.e. Checkpoint, the operator acquires the incremental data held by all operators corresponding to the current data model in a preset time, and submits all the incremental data in an intermediate state of the processed data. The check point can be defined by a user according to actual conditions. Two-phase commit may include pre-commit and formal commit. Because a plurality of conversion operators are arranged in a program, each operator holds some states of the operators, and the problem of inconsistent global states after the two-stage submission can be avoided when one operator fails to execute.
As an example, a checkpoint coordinator may be first defined, which may read all operator information and send checkpoints to all input operators in a timed or user-defined manner. The input operator inserts the check point into the sending buffer area, simultaneously starts a transaction of one state, writes the self state into the transaction to pre-commit the processing result of the increment data, and notifies the coordinator of successful pre-commit after successful writing. The downstream operator repeats the same steps after receipt of the checkpoint. After all operators return the processing result of the pre-submitted incremental data successfully, the coordinator informs all operators to formally submit own transactions, at the moment, the persistence processing of one piece of data double-tyre information in the data model is completed, and each layer of data in the data model can carry out the persistence processing of the data according to the corresponding data increment in the data layer at preset time intervals.
In some embodiments, there are multiple upstream due to one operator. All checkpoints must be received to pre-commit the state of the buffer, so that the sending buffer and the receiving buffer in the single program can be designed together, i.e. the memory space of the same buffer is shared, data is stored in the buffer from the upstream and fetched from the buffer from the downstream. When the operator has multiple checkpoints upstream that it currently receives the first such time, it will stop taking data from the buffer. The data continues to be acquired until all checkpoints have received and performed the pre-commit process.
In some embodiments, when performing data persistence processing on incremental data corresponding to the data layers according to checkpoints in the data layers, information such as a computing state, a storage structure, a state descriptor, a state backend and the like corresponding to the data in each data layer in the data model can be determined, and persistence processing is performed on a data intermediate state obtained after the incremental data is processed.
Wherein the computing states may include a first computing state and a second computing state. Wherein the first computing state may be represented as stateful computing logic. In the stateful computing logic, the computation result of the current data depends on the computation result of the data at a first time before the current time. The value of the first moment can be flexibly determined according to the requirements. The second calculation state may be represented as a stateless calculation logic in which the calculation result of the current data is independent of the calculation result at any time before the current time. The data of the first computing state, the data of which the numerical value changes at intervals of preset time, can be determined to be incremental data, and the data is persisted through a check point mechanism. When the program is restarted after abnormality occurs, data of each data layer in the data model can be subjected to data recovery processing based on the data intermediate state.
The storage structures may include a hash (Map) storage structure, a List (List) storage structure, a single Value (Value) storage structure, and an Offset (Offset) storage structure. Wherein the hash storage structure is used for storing data with key Value pair information, such as a stock, for example, the name of the stock is a key (key), and the highest Value of the stock in the current day is a Value (Value); the list storage structure is used for storing a plurality of time-sequential data, such as the price of each stock on the same day; the single value storage structure is used for storing data of single value, such as storing the total amount of all stocks; the offset storage structure is used for storing the current consumption site of the input operator, namely the current data processing progress. The hash storage structure, list storage structure, and single value storage structure may all be used to store intermediate results of the data of the first computing state.
The state descriptor may include information of the type of data storage structure, data identification (id), time of validity of the data storage, back-end name of the data storage, whether incremental updating of the data to the back-end is supported, etc. If the data is incremental data, updating the increment of the data to the back end every preset time. When updating, the latest processing result of the data can be updated.
The backend names of the data stores, i.e., state backend, may include distributed databases, such as Redis, and local databases, such as RocksDB, for providing a storage path for intermediate states of data during processing. When a program processes a certain piece of data, partial states may be changed, the states are changed in the memory, then the states are uniformly persisted to the state back end based on the changed incremental data, and an updated data intermediate state is obtained, so that the program can still be restarted from the last failure position after abnormal restarting. The data intermediate state comprises data offset information and intermediate results in the data processing process, the data offset information is used for representing the identification of the current data processing position, and the data offset information and the intermediate results are in one-to-one correspondence with the data in the state descriptor through the data identification. Wherein the storage path of the data intermediate state may characterize the storage address of the intermediate state in the state backend.
When the persistence processing is performed on the data layers, the data intermediate state in each data layer in the current data model can be updated according to the calculation state, the storage structure, the state descriptor and the state rear end corresponding to the incremental data in each data layer in the data model, so that the current data intermediate state is obtained. When large-batch data processing is performed, the data processing efficiency is improved by only updating the intermediate state of the incremental data at a time.
And step 103, responding to the data processing meeting a first condition, and carrying out data recovery processing on the data according to the data intermediate state.
In some embodiments, the first condition may be a condition requiring a restart in the event of a downtime, outage, flashing back, or other abnormal condition in the data processing. During data processing, all data intermediate states can be submitted to a specified state back end through a checkpointing mechanism after each piece of data is completed or at specified time intervals. When the data processing is restarted abnormally, a data intermediate state of the data before abnormal starting can be acquired, a state descriptor of the data is acquired from the data intermediate state, information such as the type of a storage structure of the data, a data identifier, a data effective time and the like is acquired from the state descriptor, if the time at the time of restarting is within the data effective time, a storage path of the intermediate state corresponding to the data identifier in the state rear end of the data intermediate state is acquired, and data offset information and an intermediate result are acquired from the storage path. The data offset information indicates a processing progress in a data processing process before restarting, for example, the data is processed to the nth data, and the data offset information is N. The data recovery processing can be carried out on the data processing according to the data offset information and the intermediate result, namely, the data processing operation is carried out on the basis of the data offset information on the basis of the intermediate result, so that the fault tolerance mechanism after abnormal restarting is realized, and the purpose of breakpoint continuous transmission is achieved.
A data processing system according to an embodiment of the present application will be described with reference to fig. 2, and will be described with reference to fig. 2.
In FIG. 2, data processing system 200 includes metadata component 201, pipeline construction component 202, state management component 203, and domain model component 204.
The metadata component 201 is configured to store and manage a data model corresponding to source data, where the data model includes information such as a data field, a business meaning, and a field history version number in each data layer in the data model; performing persistence processing on data corresponding to the data layer according to a check point in the data layer at preset time intervals to obtain a data intermediate state, wherein the check point is used for representing a signal for performing the persistence processing on incremental data in the data layer;
the pipeline building assembly 202 is configured to perform data processing on each data layer in the data model based on the service logic file, and determine data corresponding to each data layer;
a state management component 203, configured to perform data recovery processing on the data according to the data intermediate state in response to the data processing meeting a first condition;
a domain model component 204 for determining at least one business logic file of standard processing logic, data cleansing logic, and data expansion logic.
In some embodiments, the metadata component 201 is further configured to generate the checkpoint every preset time; sending the checkpoints to input operators in the data model, each checkpoint flowing in a respective data layer with a data processing link; and after the data layer receives the check point, pre-submitting the processing results of the incremental data sequentially corresponding to the input operator and each type of downstream operators of the input operator.
In some embodiments, the pipeline building set 202 may include four interfaces, an access interface, a collector interface, a business process interface, and an output interface, respectively, as shown in fig. 3, where the pipeline building set is divided into four layers, data flows from the access interface into the collector interface, then from the collector interface into the business process interface, and finally to the output interface, forming a directed acyclic graph. The interfaces in each hierarchy can be assembled according to the actual requirements of the service, so that various directed acyclic graphs are formed.
The access interface can realize the data support of real-time quotation data and offline quotation data provided by different data sources, flexibly process financial data from different data sources, and monitor key indexes of the data receiving number of the same day.
The collector interface may buffer data received by the access interface based on the lock-free buffer queue, including: and storing the data accessed by the input operator into a buffer area corresponding to the collector operator, and simultaneously notifying all operators downstream related to the buffer area to pull the data. Each upstream operator has a data access buffer area, and data is acquired from the access buffer area for processing, and is sent to the downstream operator through the buffer area after the processing is finished. The method can reduce the data delay problem in the data processing system, relieve the data delay problem caused by network back pressure when the data volume is overlarge, solve the false sharing in the computer memory, accelerate the data processing speed and support the lock-free multithread concurrent processing with the highest performance. Such as statistics of the maximum, minimum and average values of all data processing times of the day, and statistics of the amount of data currently unprocessed, may be implemented based on the collector interface.
The service processing interface can realize the processing of specific services. Different business processing flows can correspond to different collector interfaces, and developers can realize different business processing interfaces according to actual requirements to call corresponding business logic files in the domain model component 204, and decouple and multiplex different business processing interfaces according to actual requirements. The service processing interfaces can be mutually assembled, so that the technical effects of decoupling and task flow modes can be achieved, and meanwhile, index monitoring can be performed on key components, such as the number of input and output of monitoring data, data processing time and the like, so that the efficiency of developers is further simplified, and the developers are focused on service logic in the service development process.
The output interface may output the processing result of the service by the service processing interface to a downstream target system or a target storage device, such as a database, a message queue, or a file system. In actual development, a callback function can be set for an operator corresponding to each output interface, and a developer can decide the next work of the data after successful or failed transmission according to actual requirements. For example, the output interface may return a key indicator of "number of data transmission" after successful transmission.
In some embodiments, state management component 203 may provide a unified management of internal computing states and data consumption sites for data processing system 200.
Wherein the computing states include a first computing state and a second computing state. The method comprises the steps of determining the increment data according to the preset time, wherein the increment data comprises a first calculation state representing the calculation result of current data and a second calculation state representing the calculation result of the current data, wherein the calculation result of the first calculation state represents the calculation result of the first time before the current time, the calculation result of the second calculation state representing the calculation result of the current data does not depend on the calculation result of any time before the current time, and the increment data are determined according to the data with the numerical value changed in the preset time in the data of the first calculation state. The data consumption site characterizes the location of data that has currently completed data processing. The developer can realize unified management of the computing state only by configuring the storage state. The state management assembly 203 may be combined with an access interface in the pipeline building assembly 202 to maintain consistency of the state of the data and the data consumption site during data processing.
If the current data is incremental data in the first computing state, the state management component 203 may save the state, the storage structure, the state descriptor, the state backend, and the storage path corresponding to the data. In the data processing process, if the data processing is completed or every specified time interval, all the current incremental data can be submitted to the specified state back end in the data intermediate state after the data processing through a check point mechanism. If the data processing is restarted when an abnormality occurs, the state management component 203 can automatically restore the intermediate state of the data which is successfully stored last time from the state back end to the state storage structure according to the information of the state descriptor of the data. Meanwhile, the information of the last data consumption site is transmitted to the corresponding input operator, so that a fault tolerance mechanism is realized, and the purpose of breakpoint continuous transmission is achieved.
In some embodiments, the domain model component is used to build a business model that generates business model files in a particular language, such as class files in the Java language, based on business logic. The data and the business logic can be packaged in the same class file based on the congestion model, so that the reusability and the maintainability of codes are improved.
Standard processing logic in the domain model component 204 can be used to determine the attribute and type of the entity in the class file corresponding to the data according to the field and type corresponding to each piece of data in the data layer, and determine the logical relationship of the entity according to the dependency relationship between each field, such as the blood-margin relationship between the entities, so as to implement the tracing of the attribution and the source of the data.
The data cleansing logic in the domain model component 204 can be used to delete redundant data that is repeated, to normalize the format or value of the data, to supplement missing data, or to pre-process text in the data, such as deleting special symbols, deleting english characters or case-based processing, and the like.
The data expansion logic in the domain model component 204 can be developed based on actual requirements of the business, such as classifying, combining, screening, summarizing, analyzing, summarizing, reasoning, comparing, deducting, etc., the data. And a developer only needs to write the service processing interface implementation class to write the data expansion logic.
Data processing system 200 improves both data processing efficiency and development efficiency, while facilitating service expansion and system maintenance. For example, during the data processing process, only the processing logic required by the data flow development library definition is needed, the service processing interface can be assembled through the pipeline building assembly 202, and the state of the data is put into the state management assembly 203, so that the state lasting consistency and the data recovery after the failure are automatically maintained. If horizontal expansion is needed, only a plurality of quotation processing examples are required to be operated, and the rear end with durable state is set as an external centralized storage database.
It should be noted that, the data processing system of the embodiment of the present application is similar to the description of the embodiment of the data processing method, and has similar beneficial effects as the embodiment of the method, so that a detailed description is omitted. The details of the technology that are not described in the data processing system according to the embodiment of the present application can be understood from the description of fig. 1.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
It should be understood that, in various embodiments of the present application, the size of the sequence number of each implementation process does not mean that the execution sequence of each process should be determined by its function and internal logic, and should not constitute any limitation on the implementation process of the embodiments of the present application.
The above is merely an example of the present application and is not intended to limit the scope of the present application. Any modification, equivalent replacement, improvement, etc. made within the spirit and scope of the present application are included in the protection scope of the present application.

Claims (8)

1. A method of data processing, the method comprising:
carrying out data processing on each data layer in the data model based on the service logic file, and determining the data corresponding to each data layer;
generating check points at intervals of preset time; sending the checkpoints to input operators in the data model, each checkpoint flowing in a respective data layer with a data processing link; after the data layer receives the check point, pre-submitting the input operator and the processing results of the incremental data corresponding to each type of downstream operators of the input operator in sequence; if all operators are submitted successfully, formally submitting the processing results of the input operators and the incremental data corresponding to all downstream operators of the input operators, and performing persistence processing on all the incremental data corresponding to the data layer to obtain a data intermediate state; wherein the checkpoints are used to characterize signals that perform the persistence processing on incremental data in the data layer;
and responding to the data processing meeting a first condition, and carrying out data recovery processing on the data according to the data intermediate state.
2. The method of claim 1, wherein the business logic file comprises at least one of:
standard processing logic, data cleansing logic, and data expansion logic.
3. The method according to claim 1, wherein the data processing is performed on each data layer in the data model based on the service logic file, and determining the data corresponding to each data layer includes:
determining source data as data of a first data layer in the data model;
processing the data of the first data layer based on the data cleaning logic in the service logic file to obtain the data of the second data layer in the data model; the data of the first data layer corresponds to the data of the second data layer one by one;
and processing the data of the second data layer based on the data expansion logic in the service logic file to obtain the data of the third data layer in the data model.
4. The method of claim 1, wherein the performing persistence processing on the incremental data corresponding to the data layer according to the check point in the data layer at every preset time to obtain a data intermediate state includes:
determining a computing state corresponding to data in the data layer, wherein the computing state comprises a first computing state and a second computing state;
determining the data with the numerical value changed in the preset time as the incremental data in the data of the first calculation state;
wherein the first computational state characterizes the computational result of the data in dependence on the computational result of the data at a first instant of time; the second computational state characterizes the computational result of the data independent of the computational result of the data at any instant.
5. The method of claim 2, wherein the standard processing logic comprises:
determining the attribute and the type of an entity in a class file corresponding to the source data according to the field and the type corresponding to each piece of source data;
and determining the logic relationship of the entity according to the dependency relationship between each field.
6. The method of claim 1, wherein said performing data recovery processing on said data according to said data intermediate state in response to said data processing satisfying a first condition comprises:
acquiring a state descriptor in the data intermediate state;
acquiring a data offset and an intermediate result of the data when the data processing meets a first condition according to the state descriptor;
and carrying out data recovery processing on the data according to the data offset and the intermediate result.
7. A data processing system, the system comprising:
the pipeline building assembly is used for carrying out data processing on each data layer in the data model based on the service logic file and determining the data corresponding to each data layer;
the metadata management component is used for generating check points at intervals of preset time; sending the checkpoints to input operators in the data model, each checkpoint flowing in a respective data layer with a data processing link; after the data layer receives the check point, pre-submitting the input operator and the processing results of the incremental data corresponding to each type of downstream operators of the input operator in sequence; if all operators are submitted successfully, formally submitting the processing results of the input operators and the incremental data corresponding to all downstream operators of the input operators, and performing persistence processing on all the incremental data corresponding to the data layer to obtain a data intermediate state; wherein the checkpoints are used to characterize signals that perform the persistence processing on incremental data in the data layer;
and the state management component is used for responding to the data processing meeting the first condition and carrying out data recovery processing on the data according to the data intermediate state.
8. The system of claim 7, wherein the system further comprises:
and the domain model component is used for determining at least one business logic file of standard processing logic, data cleaning logic and data expansion logic.
CN202310909725.1A 2023-07-24 2023-07-24 Data processing method and system Active CN116662325B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310909725.1A CN116662325B (en) 2023-07-24 2023-07-24 Data processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310909725.1A CN116662325B (en) 2023-07-24 2023-07-24 Data processing method and system

Publications (2)

Publication Number Publication Date
CN116662325A CN116662325A (en) 2023-08-29
CN116662325B true CN116662325B (en) 2023-11-10

Family

ID=87712114

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310909725.1A Active CN116662325B (en) 2023-07-24 2023-07-24 Data processing method and system

Country Status (1)

Country Link
CN (1) CN116662325B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN117112302B (en) * 2023-08-30 2024-03-12 广州经传多赢投资咨询有限公司 Abnormal disaster recovery method, system, equipment and medium for financial data

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102147755A (en) * 2011-04-14 2011-08-10 中国人民解放军国防科学技术大学 Multi-core system fault tolerance method based on memory caching technology
CN106294357A (en) * 2015-05-14 2017-01-04 阿里巴巴集团控股有限公司 Data processing method and stream calculation system
CN109145023A (en) * 2018-08-30 2019-01-04 北京百度网讯科技有限公司 Method and apparatus for handling data
CN111125163A (en) * 2018-10-30 2020-05-08 百度在线网络技术(北京)有限公司 Method and apparatus for processing data
CN112199334A (en) * 2020-10-23 2021-01-08 东北大学 Method and device for storing data stream processing check point file based on message queue
CN112486639A (en) * 2019-09-12 2021-03-12 中兴通讯股份有限公司 Data saving and restoring method and device for task, server and storage medium
CN114692585A (en) * 2022-03-30 2022-07-01 上海幻电信息科技有限公司 Table service processing method and system
CN114896200A (en) * 2022-05-26 2022-08-12 浙江邦盛科技股份有限公司 Queue-based rapid persistence method for check point in bank flow computing service system

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060150010A1 (en) * 2005-01-03 2006-07-06 Stiffler Jack J Memory-controller-embedded apparatus and procedure for achieving system-directed checkpointing without operating-system kernel support
US7860863B2 (en) * 2007-09-05 2010-12-28 International Business Machines Corporation Optimization model for processing hierarchical data in stream systems
US10430298B2 (en) * 2010-10-28 2019-10-01 Microsoft Technology Licensing, Llc Versatile in-memory database recovery using logical log records

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102147755A (en) * 2011-04-14 2011-08-10 中国人民解放军国防科学技术大学 Multi-core system fault tolerance method based on memory caching technology
CN106294357A (en) * 2015-05-14 2017-01-04 阿里巴巴集团控股有限公司 Data processing method and stream calculation system
CN109145023A (en) * 2018-08-30 2019-01-04 北京百度网讯科技有限公司 Method and apparatus for handling data
CN111125163A (en) * 2018-10-30 2020-05-08 百度在线网络技术(北京)有限公司 Method and apparatus for processing data
CN112486639A (en) * 2019-09-12 2021-03-12 中兴通讯股份有限公司 Data saving and restoring method and device for task, server and storage medium
CN112199334A (en) * 2020-10-23 2021-01-08 东北大学 Method and device for storing data stream processing check point file based on message queue
CN114692585A (en) * 2022-03-30 2022-07-01 上海幻电信息科技有限公司 Table service processing method and system
CN114896200A (en) * 2022-05-26 2022-08-12 浙江邦盛科技股份有限公司 Queue-based rapid persistence method for check point in bank flow computing service system

Non-Patent Citations (4)

* Cited by examiner, † Cited by third party
Title
Checkpointing and Recovery Mechanism in Grid;Janki Mehta etc.;《2008 16th International Conference on Advanced Computing and Communications》;第131-139页 *
Co-Designing Multi-Level Checkpoint Restart for MPI Applications;Parasyris, K etc.;《 21ST IEEE/ACM INTERNATIONAL SYMPOSIUM ON CLUSTER, CLOUD AND INTERNET COMPUTING》;第103-112页 *
增量检查点设置与卷回恢复技术研究;卢鹏飞;《中国优秀硕士学位论文全文数据库(信息科技辑)》;第I137-8页 *
流处理网络离群数据检测与故障恢复的研究与应用;张晓倩;《中国优秀硕士学位论文全文数据库(信息科技辑)》;第I138-527页 *

Also Published As

Publication number Publication date
CN116662325A (en) 2023-08-29

Similar Documents

Publication Publication Date Title
Chen et al. Realtime data processing at facebook
US7010538B1 (en) Method for distributed RDSMS
EP2831767B1 (en) Method and system for processing data queries
US20130117226A1 (en) Method and A System for Synchronizing Data
CN116662325B (en) Data processing method and system
CN103514223A (en) Data synchronism method and system of database
US20120331333A1 (en) Stream Data Processing Failure Recovery Method and Device
WO2019109854A1 (en) Data processing method and device for distributed database, storage medium, and electronic device
CN110515927B (en) Data processing method and system, electronic device and medium
CN111400011A (en) Real-time task scheduling method, system, equipment and readable storage medium
CN110377664B (en) Data synchronization method, device, server and storage medium
CN111752945A (en) Time sequence database data interaction method and system based on container and hierarchical model
Bockermann A survey of the stream processing landscape
CN115562676B (en) Triggering method of graph calculation engine
CN115617480A (en) Task scheduling method, device and system and storage medium
CN114791900A (en) Operator-based Redis operation and maintenance method, device, system and storage medium
CN113568892A (en) Method and equipment for carrying out data query on data source based on memory calculation
CN112214207A (en) Design method based on distributed and big data anti-money laundering batch processing architecture
CN116150263B (en) Distributed graph calculation engine
Höger Fault tolerance in parallel data processing systems
WO2018042022A1 (en) System and apparatus for providing different versions of a type of data journey
Petrescu Replication in Raft vs Apache Zookeeper
Ji et al. A fault-tolerant optimization mechanism for spatiotemporal data analysis in flink
Sakr et al. Large-scale stream processing systems
Fortino Reengineering of a Big Data architecture for real-time ingestion and data analysis

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant