CN117076551A - Data storage method, device and equipment based on distributed data processing architecture - Google Patents

Data storage method, device and equipment based on distributed data processing architecture Download PDF

Info

Publication number
CN117076551A
CN117076551A CN202210503082.6A CN202210503082A CN117076551A CN 117076551 A CN117076551 A CN 117076551A CN 202210503082 A CN202210503082 A CN 202210503082A CN 117076551 A CN117076551 A CN 117076551A
Authority
CN
China
Prior art keywords
data
time window
deduplication
current time
database
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210503082.6A
Other languages
Chinese (zh)
Inventor
青超群
施雯洁
唐辉
刘帆
黄伟康
聂晓楠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Tencent Technology Chengdu Co Ltd
Original Assignee
Tencent Technology Chengdu Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Tencent Technology Chengdu Co Ltd filed Critical Tencent Technology Chengdu Co Ltd
Priority to CN202210503082.6A priority Critical patent/CN117076551A/en
Publication of CN117076551A publication Critical patent/CN117076551A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/27Replication, distribution or synchronisation of data between databases or within a distributed database system; Distributed database system architectures therefor
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2462Approximate or statistical queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2477Temporal data queries

Abstract

The application discloses a data storage method, device and equipment based on a distributed data processing architecture, and relates to the technical field of data processing. The method comprises the following steps: acquiring newly added data in a current time window; in the key value database, comparing and deduplicating newly added data with historical deduplication data to obtain deduplication data corresponding to a current time window; counting the de-duplication data corresponding to the current time window to obtain a counting result corresponding to the current time window, wherein the counting result is used for indicating the statistics of the target index corresponding to the de-duplication data; and separating and storing the de-duplication data and the statistical result corresponding to the current time window. According to the application, the data deduplication and the data statistics are separated, and the deduplication data and the statistics result are stored separately, so that the problem that the newly added data cannot be deduplicated after the task is recovered is avoided, and the accuracy of the statistics result after the task is recovered is improved.

Description

Data storage method, device and equipment based on distributed data processing architecture
Technical Field
The embodiment of the application relates to the technical field of data processing, in particular to a data storage method, device and equipment based on a distributed data processing architecture.
Background
Flink is a framework and distributed processing engine for stateful computation of unbounded and bounded stream data.
Currently, flink provides a Savepoint and Checkpoint mechanism to perform recovery of Flink tasks. The method comprises the steps that a snapshot is generated for the transient state of each Operator of the flank task periodically, then persistence storage is automatically carried out, and under the condition that the flank task needs to be restarted or closed due to factors such as task breakdown, server faults, network faults and software faults, the computing state of each Operator can be recovered from the snapshots again to continue to finish the flank task.
However, after the flank task is restored, the data is deduplicated from the newly added data, so that the statistics result is inaccurate.
Disclosure of Invention
The embodiment of the application provides a data storage method, a device and equipment based on a distributed data processing architecture, which can improve the accuracy and the continuity of a statistical result.
According to an aspect of an embodiment of the present application, there is provided a data storage method based on a distributed data processing architecture, the method including:
Acquiring newly added data in a current time window;
in a key value database, comparing and deduplicating the newly added data with the historical deduplication data to obtain deduplication data corresponding to the current time window; the historical deduplication data refers to deduplication data corresponding to a time window previous to the current time window;
counting the deduplication data corresponding to the current time window to obtain a counting result corresponding to the current time window, wherein the counting result is used for indicating the statistics of a target index corresponding to the deduplication data;
and separating and storing the de-duplication data and the statistical result corresponding to the current time window.
According to an aspect of an embodiment of the present application, there is provided a data storage device based on a distributed data processing architecture, the device comprising:
the new data acquisition module is used for acquiring the new data in the current time window;
the de-duplication data acquisition module is used for comparing and de-duplication the newly added data with the historical de-duplication data in the key value database to obtain de-duplication data corresponding to the current time window; the historical deduplication data refers to deduplication data corresponding to a time window previous to the current time window;
The statistical result acquisition module is used for carrying out statistics on the deduplication data corresponding to the current time window to obtain a statistical result corresponding to the current time window, wherein the statistical result is used for indicating statistics of a target index corresponding to the deduplication data;
and the data result storage module is used for separating and storing the deduplication data and the statistical result corresponding to the current time window.
According to an aspect of an embodiment of the present application, there is provided a computer device including a processor and a memory, in which a computer program is stored, the computer program being loaded and executed by the processor to implement the above-described data storage method based on a distributed data processing architecture.
According to an aspect of an embodiment of the present application, there is provided a computer readable storage medium having stored therein a computer program loaded and executed by a processor to implement the above-described data storage method based on a distributed data processing architecture.
According to an aspect of embodiments of the present application, there is provided a computer program product or computer program comprising computer instructions stored in a computer readable storage medium. The processor of the computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions, so that the computer device performs the data storage method based on the distributed data processing architecture.
The technical scheme provided by the embodiment of the application can have the following beneficial effects:
by separating the data deduplication and data statistics tasks (namely, performing data deduplication on a database by using a key value, performing data statistics on an operator), and separating and storing the deduplication data and the statistics results, the problem that newly added data cannot be deduplicated due to historical deduplication data loss after restarting the tasks is avoided, so that the newly added data can be deduplicated, the accuracy of the deduplication data is ensured, and the accuracy of the statistics results is further improved.
In addition, in the task recovery process, each time window before the current service time is triggered in sequence, and because the deduplication data and the statistics result corresponding to each historical time window respectively can be obtained from the database, the deduplication data and the statistics result of the current time window can be ensured to be obtained, so that the statistics result before the current time window is not lost continuously, and the comprehensiveness of task recovery is improved.
In addition, in the task recovery process, the duplicate removal data and the statistical result corresponding to all time windows from the appointed service time to the current service time can be quickly recovered from the database, so that the flexibility of task recovery is improved.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present application, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
FIG. 1 is a schematic illustration of an implementation environment for an embodiment of the present application;
FIG. 2 is a schematic diagram of a server side of a distributed data processing architecture according to one embodiment of the present application;
FIG. 3 is a schematic diagram of a data storage method based on a distributed data processing architecture according to one embodiment of the present application;
FIG. 4 is a flow chart of a data processing portion provided by one embodiment of the present application;
FIG. 5 is a flow chart of a checkpoint triggering portion provided by one embodiment of the present application;
FIG. 6 is a flow chart of a task restoration portion provided by an embodiment of the present application;
FIG. 7 is a block diagram of a data storage device based on a distributed data processing architecture provided in accordance with one embodiment of the present application;
FIG. 8 is a block diagram of a data storage device based on a distributed data processing architecture provided in accordance with another embodiment of the present application;
FIG. 9 is a block diagram of a computer device provided in one embodiment of the application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present application more apparent, the embodiments of the present application will be described in further detail with reference to the accompanying drawings.
Referring to fig. 1, a schematic diagram of an implementation environment of an embodiment of the present application is shown. The implementation environment of the scheme can be realized as a distributed data processing architecture. The implementation environment may include: a client 10 and a server 20.
The client 10 is a client corresponding to a distributed data processing architecture, through which a user may submit a task (also referred to as a Job, job) to a server corresponding to the distributed data processing architecture. The task may refer to any real-time computing task, such as a real-time accumulated deduplication task, a real-time intelligent recommendation task, a complex time processing task, a real-time fraud detection task, a streaming data real-time analysis task, and the like. Alternatively, the client 10 may be installed in a terminal, which may be an electronic device such as a mobile phone, a tablet computer, a PC (Personal Computer ), a wearable device, a smart robot, or the like.
The server 20 is a server corresponding to the distributed data processing architecture, and the server 20 is a distributed master-slave architecture. For example, the server 20 includes a master node for task scheduling, computing resource management, creation of checkpoints, etc., and a slave node for executing tasks. The server 20 requires computing resources to execute the application. The server 20 is integrated with all common set-up resource managers and may also be arranged to operate as a stand-alone cluster. When the application is deployed, the server 20 automatically identifies the required resources according to the parallelism of the application configuration, and requests the resources from the resource manager. Alternatively, the server 20 may be deployed in a server, where the server may be a server cluster or a distributed system formed by a plurality of physical servers, and may also be a cloud server that provides a cloud computing service.
Communication between the client 10 and the server 20 may be via a network 30.
The technical scheme provided by the embodiment of the application is suitable for any scene requiring stream data processing, such as financial transaction data processing, internet order data processing, game data processing, position data processing, sensor signal number processing, terminal generated data processing, server log data processing, communication signal data processing and the like. The technical scheme provided by the embodiment of the application can improve the accuracy of the statistical result after task repair.
In some embodiments, the data processing process based on the distributed data processing architecture can be abstracted into the following three steps: 1. one or more data sources are received. 2. Several user-desired conversion operators are performed. 3. And outputting the converted result. Illustratively, referring to fig. 2, the server 20 may include: a data source layer 201, a task layer 202, a status and results management layer 203, a backup and detection layer 204, and a persistence layer 205.
The data source layer 201 is configured to obtain one or more data sources from an associated resource manager, where the data sources may refer to stream data that is composed of events, such as stream data that is formed by operational events of a user for an application. The stream data may refer to data generated by a terminal, log data processing of a server, internet order data, and the like, which is not limited by the embodiment of the present application. For example, the stream data may refer to MQ (Message Queue), such as Kafka, rocketMQ, and the like.
The task layer 202 is used for processing tasks. Alternatively, the task layer 202 may process multiple real-time tasks in parallel. The real-time tasks may refer to real-time cumulative deduplication tasks, real-time intelligent recommendation tasks, complex time processing tasks, real-time fraud detection tasks, streaming data real-time analysis tasks, and the like. Taking the real-time cumulative deduplication task as an example, the real-time cumulative deduplication task refers to a task of performing cumulative real-time deduplication on newly added data and counting the number value of the target index according to historical data, and the task can be in the level of seconds of the day (i.e. executing and outputting a number value every second), such as the number of today's real-time access persons, the number of today's real-time order taking persons and the number of today's real-time payment persons. The target index according to the embodiment of the present application is not limited, and includes, for example, but not limited to, the number of visitors, the number of orders, the number of payees, and the like.
Alternatively, the task layer 202 may perform execution of tasks based on a time window. The time window is a time range for real-time calculation, and when the calculation of a certain time window is finished, the statistical result of the time window is output to the downstream calculation. For example, the time window is set to be 1 minute, and the statistics result can be displayed to be output every 1 minute, and the statistics result of the next time window is calculated based on the statistics result of the last time window, so that a line graph can be conveniently manufactured to check the statistics result.
The state and result management layer 203 is configured to manage the state data and the statistics corresponding to the time window. The state data may include states of respective operators corresponding to the distributed data processing architecture, that is, intermediate statistics of the respective operators on newly added data in the time window. Alternatively, the sum of the statistics corresponding to the current time window and the historical statistics may be used as the final statistics under the current time window. For example, taking the day as an example, the final statistics corresponding to each time window may refer to the cumulative statistics from day 0 to the current business time.
Alternatively, the state and result management layer 203 may be configured to send the state data and statistics, respectively, to different databases in the persistence layer 205 for persistence storage.
The backup and detection layer 204 is configured to manage snapshot data corresponding to the checkpoints. In embodiments of the present application, the snapshot data may include consumption of the data source (i.e., an offset to the consumed data), watermarks, time windows, and the like. Therefore, the data volume of the snapshot data can be reduced, so that the condition that the snapshot data is bigger or the storage fails is avoided, and the problem that the breakpoint recovery time is long or the snapshot data cannot be recovered is further avoided. The breakpoint recovery refers to recovery from a normal time node (i.e., a check point) before failure after a running task is restarted or closed due to factors such as breakdown of the running task or failure of a server.
The backup and detection layer 204 is further configured to detect and manage the task, and alarm a user in case of a task failure, such as when the task fails due to factors such as task crash, server failure, network failure, software failure, etc., the backup and detection layer 204 alarms the user until the task layer 202 operates normally.
The persistence layer 205 is configured to persistence store the deduplication data, the statistics, the status data, and the snapshot data. In one example, persistence layer 205 may include an HDFS database, a MYSQL database, a REDIS database, an MQ database, and a ROCKS DB database. The REDIS database may be an external database of the distributed data processing architecture, that is, not disposed in a memory corresponding to an operator of the distributed data processing architecture (i.e., in a program content). The REDIS database can be used for storing state data and duplicate removal data, and is used as a memory KV database, so that the reading and inquiring speed is high, the data duplicate removal can be performed by utilizing the self set, and the convenience of duplicate removal is improved.
The MYSQL database may be used to store statistics, i.e., statistics of target metrics that the user wants to obtain. By adopting MYSQL, things can be guaranteed during data writing, and the part of data can be combined with a platform to perform visual processing.
The ROCKS DB database can be used for storing snapshot data, and the Rocks DB is used as a hard disk KV database, so that longer data can be stored compared with the REDIS database, and in a distributed data processing architecture, the ROCKS DB database is more commonly used and is easy to access.
The MQ database may be used to store data sources. Alternatively, HDFS may also be used to store portions of snapshot data.
The data storage method based on the distributed data processing architecture will be described in detail below in connection with the implementation environment of the above-described scheme.
Referring to fig. 3, a flowchart of a data storage method based on a distributed data processing architecture according to an embodiment of the present application is shown, where the execution subject of each step of the method may be the distributed data processing architecture in the implementation environment of the embodiment shown in fig. 1, and the method may include the following steps (301 to 304).
Step 301, obtaining new data in the current time window.
The time window is a time range for real-time calculation, and the time range can be adaptively set and adjusted according to actual use requirements, for example, the time window can be set to a second level, a minute level, or the like. The current time window is a time window corresponding to the current service time, and the current service time is the natural time. In the embodiment of the application, the time window is used for executing the task corresponding to the distributed data processing architecture. For example, a deduplication operation and a statistics operation are performed on newly added data within each time window. Wherein the distributed data processing architecture may be referred to as a Flink.
Optionally, each time window corresponds to deduplication data, statistics, and intermediate state data. The deduplication data is data obtained by performing deduplication comparison on newly added data and historical deduplication data. The statistical result is data obtained by statistics of a target index corresponding to the deduplication data. Intermediate state data may refer to state data corresponding to each operator executing a task in the time window, that is, intermediate statistics (i.e., statistics corresponding to a certain time window).
It should be noted that, in the relational database (such as RCOKS DB database) below, accumulated statistics are stored, and in the key value, intermediate statistics are stored in the database.
For example, a time window is set to be 1 second, the current service time is set to be t, new data from t-1 to t are used as new data in the current time window, and at the time t, the deduplication operation and the statistics operation corresponding to the current time window are triggered.
Step 302, comparing and deduplicating newly added data with historical deduplication data in a key value database to obtain deduplication data corresponding to a current time window; the historical deduplication data refers to deduplication data corresponding to a time window previous to the current time window.
A key-value database refers to a database in which data is stored in key-value pair form, which can be used for persistent storage of data, as well as for quick reading and querying of data. Alternatively, the key may be to a database such as REDIS database, memcached database, or the like.
Optionally, the de-duplication data corresponding to each time window in the database by the key value refers to accumulated de-duplication data, that is, data after comparing and de-duplication of all data corresponding to the time window from the task start execution to the current time window. The key values may perform deduplication operations on the database using its own set. For example, the key value pair database is correspondingly provided with an application program with a deduplication function, the application program compares the newly added data with the historical deduplication data to obtain deduplication data corresponding to the current time window, for example, whether the key (i.e. key) between the newly added data and the historical deduplication data is repeated is judged to determine whether repeated data exist in the newly added data, then the repeated data in the newly added data is removed to obtain residual data, the residual data is stored in the database in combination with the historical deduplication data in the key value, and the residual data is used as deduplication data corresponding to the current time window. The repeated data may be repeated uploaded data or repeated invalid data in a short time, which is not limited in the embodiment of the present application.
In one example, before the newly added data in the current window is sent to the key value pair database, the newly added data can be further predicted, so that the data volume of the newly added data is reduced, the comparison volume is reduced, and further the execution efficiency of the task is improved. At the same time, the data skew that may occur in real-time computation can also be solved. The specific process can be as follows:
pre-aggregating new data corresponding to the current window to obtain pre-aggregated new data, wherein the dimension of the pre-aggregated new data is smaller than that of the new data; pre-deduplicating the new data after pre-aggregation to obtain adjusted new data, wherein the data volume of the adjusted new data is smaller than that of the new data; the adjusted newly added data is used for comparing and deduplicating with the historical deduplication data to obtain deduplication data corresponding to the current time window.
For example, it is assumed that the newly added data in the current window includes 20 pieces of user data, the newly added data is pre-aggregated to obtain data corresponding to 5 users respectively, then pre-deduplication is performed on the data corresponding to 5 users respectively to obtain deduplication data corresponding to 5 users respectively, that is, the adjusted newly added data, and finally the adjusted newly added data is sent to a key value database to be compared with historical deduplication data finally for deduplication, so as to obtain deduplication data corresponding to the current time window. For example, taking the statistics of the number of users as an example, if the first user does not exist in the historical deduplication data, the first user may be counted into the historical cumulative result. If the first user exists, the data corresponding to the first user can be omitted.
Step 303, counting the deduplication data corresponding to the current time window to obtain a statistics result corresponding to the current time window, where the statistics result is used to indicate statistics of a target index corresponding to the deduplication data.
The embodiment of the application does not limit the target index, and can be adaptively set and adjusted according to the actual use requirement. Illustratively, in an internet order processing scenario, the target index may refer to the number of visitors, the number of orders, the number of payees, etc.; in the real-time intelligent recommendation task, the target index can be the number of pointing hits, the number of conversion, the duration and the like; in a game data processing scenario, the target index may refer to an online population, a virtual character death number, a refill population, and the like.
In this step, the deduplication data corresponding to the current time window refers to the data after the newly added data in the current time window is deduplicated. The statistical result corresponding to the current time window is the statistical result corresponding to the newly added data in the current time window.
And 304, separating and storing the de-duplication data corresponding to the current time window and the statistical result.
Optionally, the deduplication data corresponding to the current time window may be stored in a key value database; storing the statistical result corresponding to the current time window as state data into a key value database; storing the sum value between the statistical result corresponding to the current time window and the historical statistical result into a relational database; the historical statistical result refers to a statistical result corresponding to a time window above the current time window.
The sum of the deduplication data and the historical deduplication data corresponding to the current time window can be stored in the key value database, so that the deduplication data can be stored in a lasting mode. The relational database ensures things during data writing, and is beneficial to the visualization of statistical results combined with a BI (Business Intelligence ) platform. Alternatively, the relational database may refer to a MYSQL database, an Oracle database, a DB2 database, or the like.
The state data may refer to statistics corresponding to each operator corresponding to the current time window. Optionally, the state data may further include custom variables, newly added data, and the like corresponding to the operators. Alternatively, the status data may be stored in a key-value database. Alternatively, each key of state data and deduplication data may be constructed from the UID of the operator (User Identification ).
Optionally, the key value pair database is an external database of the distributed data processing architecture, that is, the key value pair database is not arranged in a memory (i.e., in the program content) corresponding to an operator of the distributed data processing architecture, so that the situation that massive data occupy most of the memory of the program can be avoided, and the influence of massive data (i.e., state data) on the storage of snapshot data is avoided.
In one example, responsive to triggering a checkpoint corresponding to a distributed data processing architecture, snapshot data corresponding to the checkpoint is obtained; the snapshot data comprises consumption conditions, watermarks and time windows corresponding to the stream data; and the snapshot data corresponding to the check point is sent to a fast reading database.
Wherein, a checkpoint may refer to a time point that the distributed data processing architecture periodically triggers, and when the checkpoint is triggered, a snapshot of the transient state of each operator in the task at the business time is obtained, where the snapshot may include the snapshot data described above. The stream data comprises newly added data corresponding to a current time window, the consumption condition is used for indicating the offset of the stream data, the watermark corresponds to a time stamp of the stream data, and the time window corresponds to a time window corresponding to a check point.
The fast read database is bundled with operators of the distributed data processing architecture, i.e., the built-in database. Which may be used to persist snapshot data. Alternatively, the fast read database may be, for example, a ROCKS DB database, a LEVEL DB database, an HDFS database, or the like. Alternatively, the key for each snapshot data may be constructed from the UID of the operator (User Identification ).
In one example, in the event that a task corresponding to the distributed data processing architecture fails, a latest checkpoint may be obtained from the fast read database, along with snapshot data corresponding to the latest checkpoint; determining a target time window according to snapshot data corresponding to the latest check point; in the key value database, comparing the newly added data corresponding to the target time window with the de-duplication data corresponding to the last time window of the target time window to obtain de-duplication data corresponding to the target time window; counting the de-duplication data corresponding to the target time window to obtain a counting result corresponding to the target time window; and sequentially acquiring statistical results corresponding to each time window from the target time window to the current service time.
Alternatively, the latest checkpoint may refer to a historical checkpoint that is closest to the point of failure. The consumption condition of the stream data can be determined according to the snapshot data corresponding to the latest check point, so that the offset of the stream data is determined, and then the watermark and the time window in the snapshot data are combined to determine the target time window. The target time window is the first time window needed to acquire statistical results after task recovery. The last time window of the target time window is the latest time window for completing the task before the fault point. And then, the state data of the last time window of the target time window stored in the database can be keyed, and each operator in the distributed data processing architecture is restored to the calculation state corresponding to the state data, so that a basis is provided for the deduplication operation and the statistics operation of the target time window. And finally, combining the accumulated deduplication data and the accumulated statistical result of the area of the last time window of the target time window, and acquiring the statistical result of the target time window.
For example, let time window 1 consume data with an offset of 8-10, the current business time is in time window 2, and time window 2 currently corresponds to data with an offset of 11. And if the fault point appears in the time window 2, determining the time window 1 according to snapshot data of the check point A when the task is restarted, and then recovering the state of each operator in the task according to state data corresponding to the time window 1. After the task is restored, when the offset corresponding to the check point A is detected to be 10, executing the task from a time window (namely, a time window 2) with the offset of 11, sequentially acquiring accumulated deduplication data and accumulated statistical results of each time window from the time window 2 to the current service time, and storing the accumulated deduplication data and accumulated statistical results into a key value database and a relational database respectively.
Because the state data and the statistical result corresponding to the check point are stored outside the distributed data processing architecture, the duplicate removal operation and the accumulation operation after task recovery can be conveniently performed, and the risk of losing the state data and the statistical result is avoided. At the same time, the distributed data processing architecture uses event time, and the time window is also an event time window, so that the recovery starts from the target time window. For example, if the target time window and the current latest window are separated by 10 time windows, all 10 time windows are triggered along with the data consumption of the task, and the statistical result of any one time window is not pulled down. Therefore, after the task fails to recover, the application can still present complete continuous data, thereby achieving the effect of lossless breakpoint recovery.
In summary, according to the technical scheme provided by the embodiment of the application, by separating the data deduplication and data statistics tasks (namely, performing data deduplication on the database by using the key value, performing data statistics on the database by using the operator), and separating and storing the deduplication data and the statistics result, the problem that new data cannot be deduplicated due to the loss of historical deduplication data after restarting the task is avoided, so that the new data can be deduplicated, the accuracy of the deduplication data is ensured, and the accuracy of the statistics result is further improved.
In addition, in the task recovery process, each time window before the current service time is triggered in sequence, and because the deduplication data and the statistics result corresponding to each historical time window respectively can be obtained from the database, the deduplication data and the statistics result of the current time window can be ensured to be obtained, so that the statistics result before the current time window is not lost continuously, and the comprehensiveness of task recovery is improved.
In addition, in the task recovery process, the duplicate removal data and the statistical result corresponding to all time windows from the appointed service time to the current service time can be quickly recovered from the database, so that the flexibility of task recovery is improved.
In addition, by separately storing the state data and the snapshot data, the snapshot data can not be larger and larger along with the progress of the task, and the snapshot data can not be stored and failed due to the influence of mass data (namely, the state data), so that the problem that the task recovery time is long or cannot be recovered is solved, and the recovery efficiency and the recovery stability of the task are improved.
In an exemplary embodiment, taking a distributed data processing architecture as a link and tasks as real-time accumulated deduplication tasks as examples, the data storage method based on the distributed data processing architecture provided by the embodiment of the application is described, and the specific content of the data storage method may be as follows:
referring to fig. 5 to 7, the technical solution provided by the embodiment of the present application may be divided into three parts: a data processing portion 400, a checkpoint triggering portion 500 and a task recovery portion 600.
1. Referring to fig. 5, the contents corresponding to the data processing part 400 may be as follows:
stream data is acquired, and watermark updating is carried out on the stream data based on natural time.
After the time window is triggered, pre-aggregation and pre-duplication removal of newly added data of the time window are firstly carried out in the Flink to obtain adjustment data corresponding to the time window, so that the data quantity corresponding to the time window can be reduced as much as possible, and then the adjustment data corresponding to the time window and historical duplication removal data in an external REDIS database are compared and duplicated in the external REDIS database to obtain duplication removal data corresponding to the time window, and statistics is carried out on the duplication removal data corresponding to the time window to obtain a statistical result corresponding to the time window.
After the statistical result corresponding to the current time window is obtained, the accumulated deduplication data (i.e. the sum value between the deduplication data corresponding to the current time window and the historical deduplication data), the accumulated statistical result (i.e. the sum value between the statistical result corresponding to the current time window and the historical statistical data) and the state data (i.e. the statistical result corresponding to the current time window) need to be updated. Firstly, storing the accumulated duplicate removal data into a REDIS database, then storing the state data into the REDIS database, and finally storing the accumulated statistical result into a MYSQL database.
And completing the calculation task of the time window and waiting for triggering of the next time window.
2. Referring to fig. 5, the contents corresponding to the checkpoint triggering section 500 may be as follows:
in the flow calculation process of the link (namely in the data processing process), the snapshot data corresponding to the check point is obtained in response to the timing trigger of the check point, and the state data and the statistical result are not reserved in the task any more, so that the snapshot data of the calculation task are very few and mainly comprise the consumption condition of a data source, a watermark and a time window. The snapshot data may be stored in the ROCKS DB database. The ROCKS DB database is safer than the content and recovers faster than the HDFS database.
The ROCKS DB database is very common persistent key-value (key value pair) storage, and meanwhile, an internal large data platform is embedded, so that the ROCKS DB database can be conveniently switched. The three ways (memory, ROCKS DB database, HDFS database) currently used to save checkpoints are compared as follows:
storage and read speed: memory > ROCKS DB database > HDFS database;
mass data support: HDFS database > ROCKS DB database > memory;
data reliability: HDFS database = rod DB database > memory;
since we have saved both the state data and statistics that occupy the large area outside the Flink in the data processing portion 400, the snapshot data corresponding to the checkpoint is small overall, and the snapshot data will not increase over time, so we finally determine to use the ROCKS DB database to save the snapshot data corresponding to the checkpoint.
3. The content corresponding to the task restoring section 600 may be as follows:
with the data processing section 400 and checkpointing section 500 padded, we can easily complete the task restoration section 600. If the Flink task fails, the Flink will automatically acquire snapshot data corresponding to the latest check point from the ROCKS DB database to perform task recovery of the previous time window of the target time window, and after task recovery, automatically acquire status data and deduplication data from the REDIS database, and acquire statistics results from the MYSQL database to perform statistics result recovery from the target time window until the current service time.
If the Flink task does not fail, the Flink task is continuously detected.
In summary, according to the technical scheme provided by the embodiment of the application, by separating the data deduplication and data statistics tasks (namely, performing data deduplication on the database by using the key value, performing data statistics on the database by using the operator), and separating and storing the deduplication data and the statistics result, the problem that new data cannot be deduplicated due to the loss of historical deduplication data after restarting the task is avoided, so that the new data can be deduplicated, the accuracy of the deduplication data is ensured, and the accuracy of the statistics result is further improved.
In addition, in the task recovery process, each time window before the current service time is triggered in sequence, and because the deduplication data and the statistics result corresponding to each historical time window respectively can be obtained from the database, the deduplication data and the statistics result of the current time window can be ensured to be obtained, so that the statistics result before the current time window is not lost continuously, and the comprehensiveness of task recovery is improved.
In addition, in the task recovery process, the duplicate removal data and the statistical result corresponding to all time windows from the appointed service time to the current service time can be quickly recovered from the database, so that the flexibility of task recovery is improved.
The following are examples of the apparatus of the present application that may be used to perform the method embodiments of the present application. For details not disclosed in the embodiments of the apparatus of the present application, please refer to the embodiments of the method of the present application.
Referring to FIG. 7, a block diagram of a data storage device based on a distributed data processing architecture is shown, provided in accordance with one embodiment of the present application. The device has the function of realizing the method example, and the function can be realized by hardware or can be realized by executing corresponding software by hardware. The apparatus may be the computer device described above or may be provided in a computer device. As shown in fig. 7, the apparatus 700 includes: a new data acquisition module 701, a deduplication data acquisition module 702, a statistics acquisition module 703, and a data result storage module 704.
The new data obtaining module 701 is configured to obtain new data in the current time window.
The deduplication data obtaining module 702 is configured to compare the newly added data with the historical deduplication data in the key value database, and obtain deduplication data corresponding to the current time window; the historical deduplication data refers to deduplication data corresponding to a time window previous to the current time window.
The statistics result obtaining module 703 is configured to perform statistics on the deduplication data corresponding to the current time window, and obtain a statistics result corresponding to the current time window, where the statistics result is used to indicate statistics of a target index corresponding to the deduplication data.
And the data result storage module 704 is configured to store the deduplication data and the statistics result corresponding to the current time window separately.
In an exemplary embodiment, the data result storage module 704 is configured to:
storing the duplicate removal data corresponding to the current time window into the key value pair database;
storing the statistical result corresponding to the current time window as state data into the key value pair database;
storing the sum value between the statistical result corresponding to the current time window and the historical statistical result into a relational database; the historical statistical result refers to a statistical result corresponding to a time window above the current time window.
In one exemplary embodiment, the key value database is an external database of the distributed data processing architecture.
In an exemplary embodiment, as shown in fig. 8, the apparatus 700 further includes: the checkpoint triggers the module 705.
And the checkpoint triggering module 705 is configured to, in response to triggering a checkpoint corresponding to the distributed data processing architecture, obtain snapshot data corresponding to the checkpoint, where the snapshot data includes a consumption situation, a watermark, and a time window corresponding to the stream data.
The data result storage module 704 is further configured to store snapshot data corresponding to the checkpoint in a fast reading database.
In an exemplary embodiment, as shown in fig. 8, the apparatus 700 further includes: a data pre-aggregation module 706 and a data pre-deduplication module 707.
The data pre-aggregation module 706 is configured to pre-aggregate the new data corresponding to the current window to obtain pre-aggregated new data, where a dimension of the pre-aggregated new data is smaller than a dimension of the new data.
The data pre-deduplication module 707 is configured to pre-deduplicate the new data after pre-aggregation to obtain adjusted new data, where the data amount of the adjusted new data is smaller than the data amount of the new data.
And comparing the adjusted newly added data with the historical deduplication data to obtain deduplication data corresponding to the current time window.
In an exemplary embodiment, as shown in fig. 8, the apparatus 700 further includes: a data recovery module 708 and a target window determination module 709.
The data recovery module 708 is configured to obtain, in case of a task failure corresponding to the distributed data processing architecture, a latest checkpoint from a fast read database, and snapshot data corresponding to the latest checkpoint.
The target window determining module 709 is configured to determine a target time window according to the snapshot data corresponding to the latest checkpoint.
And the de-duplication data obtaining module 702 is configured to compare, in the key value database, newly added data corresponding to the target time window with de-duplication data corresponding to a time window previous to the target time window, and obtain de-duplication data corresponding to the target time window.
The statistics result obtaining module 703 is further configured to perform statistics on the deduplication data corresponding to the target time window, so as to obtain a statistics result corresponding to the target time window.
The statistics result obtaining module 703 is further configured to sequentially obtain statistics results corresponding to each time window from the target time window to the current service time.
In summary, according to the technical scheme provided by the embodiment of the application, by separating the data deduplication and data statistics tasks (namely, performing data deduplication on the database by using the key value, performing data statistics on the database by using the operator), and separating and storing the deduplication data and the statistics result, the problem that new data cannot be deduplicated due to the loss of historical deduplication data after restarting the task is avoided, so that the new data can be deduplicated, the accuracy of the deduplication data is ensured, and the accuracy of the statistics result is further improved.
In addition, in the task recovery process, each time window before the current service time is triggered in sequence, and because the deduplication data and the statistics result corresponding to each historical time window respectively can be obtained from the database, the deduplication data and the statistics result of the current time window can be ensured to be obtained, so that the statistics result before the current time window is not lost continuously, and the comprehensiveness of task recovery is improved.
In addition, in the task recovery process, the duplicate removal data and the statistical result corresponding to all time windows from the appointed service time to the current service time can be quickly recovered from the database, so that the flexibility of task recovery is improved.
It should be noted that, in the apparatus provided in the foregoing embodiment, when implementing the functions thereof, only the division of the foregoing functional modules is used as an example, in practical application, the foregoing functional allocation may be implemented by different functional modules, that is, the internal structure of the device is divided into different functional modules, so as to implement all or part of the functions described above. In addition, the apparatus and the method embodiments provided in the foregoing embodiments belong to the same concept, and specific implementation processes of the apparatus and the method embodiments are detailed in the method embodiments and are not repeated herein.
Referring to fig. 9, a block diagram of a computer device according to an embodiment of the present application is shown. The computer device may be used to implement the data storage method based on a distributed data processing architecture provided in the above embodiments. Specifically, the following may be included.
The computer device 900 includes a central processing unit (such as a CPU (Central Processing Unit, central processing unit), a GPU (Graphics Processing Unit, graphics processor), an FPGA (Field Programmable Gate Array ), and the like) 901, a system Memory 904 including a RAM (Random-Access Memory) 902 and a ROM (Read-Only Memory) 903, and a system bus 905 connecting the system Memory 904 and the central processing unit 901. The computer device 900 also includes a basic input/output system (Input Output System, I/O system) 906, which helps to transfer information between the various devices within the server, and a mass storage device 907 for storing an operating system 913, application programs 914, and other program modules 915.
The basic input/output system 906 includes a display 908 for displaying information and an input device 909, such as a mouse, keyboard, or the like, for user input of information. Wherein the display 908 and the input device 909 are connected to the central processing unit 901 via an input output controller 910 connected to the system bus 905. The basic input/output system 906 may also include an input/output controller 910 for receiving and processing input from a number of other devices, such as a keyboard, mouse, or electronic stylus. Similarly, the input-output controller 910 also provides output to a display screen, a printer, or other type of output device.
The mass storage device 907 is connected to the central processing unit 901 through a mass storage controller (not shown) connected to the system bus 905. The mass storage device 907 and its associated computer-readable media provide non-volatile storage for the computer device 900. That is, the mass storage device 907 may include a computer readable medium (not shown) such as a hard disk or CD-ROM (Compact Disc Read-Only Memory) drive.
Without loss of generality, the computer readable medium may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media includes RAM, ROM, EPROM (Erasable Programmable Read-Only Memory), EEPROM (Electrically Erasable Programmable Read-Only Memory), flash Memory or other solid state Memory technology, CD-ROM, DVD (Digital Video Disc, high density digital video disc) or other optical storage, magnetic cassettes, magnetic tape, magnetic disk storage or other magnetic storage devices. Of course, those skilled in the art will recognize that the computer storage medium is not limited to the ones described above. The system memory 904 and mass storage device 907 described above may be collectively referred to as memory.
The computer device 900 may also operate in accordance with embodiments of the present application by a remote computer connected to the network through a network, such as the internet. I.e., the computer device 900 may be connected to the network 912 through a network interface unit 911 coupled to the system bus 905, or other types of networks or remote computer systems (not shown) may be coupled using the network interface unit 911.
The memory also includes a computer program stored in the memory and configured to be executed by the one or more processors to implement the data storage method based on the distributed data processing architecture described above.
In one exemplary embodiment, a computer readable storage medium is also provided, in which a computer program is stored which, when being executed by a processor, implements the above described data storage method based on a distributed data processing architecture.
Alternatively, the computer-readable storage medium may include: ROM (Read-Only Memory), RAM (Random-Access Memory), SSD (Solid State Drives, solid State disk), optical disk, or the like. The random access memory may include ReRAM (Resistance Random Access Memory, resistive random access memory) and DRAM (Dynamic Random Access Memory ), among others.
In one exemplary embodiment, a computer program product or computer program is also provided, the computer program product or computer program comprising computer instructions stored in a computer readable storage medium. A processor of a computer device reads the computer instructions from the computer readable storage medium, and the processor executes the computer instructions to cause the computer device to perform the data storage method based on the distributed data processing architecture described above.
It should be noted that, the information (including, but not limited to, object device information, object personal information, etc.), data (including, but not limited to, data for analysis, stored data, presented data, etc.), and signals related to the present application are all authorized by the object or sufficiently authorized by each party, and the collection, use, and processing of the related data is required to comply with the relevant laws and regulations and standards of the relevant country and region. For example, streaming data, MQ, newly added data, etc. involved in the present application are all acquired with sufficient authorization.
It should be understood that references herein to "a plurality" are to two or more. "and/or", describes an association relationship of an association object, and indicates that there may be three relationships, for example, a and/or B, and may indicate: a exists alone, A and B exist together, and B exists alone. The character "/" generally indicates that the context-dependent object is an "or" relationship. In addition, the step numbers described herein are merely exemplary of one possible execution sequence among steps, and in some other embodiments, the steps may be executed out of the order of numbers, such as two differently numbered steps being executed simultaneously, or two differently numbered steps being executed in an order opposite to that shown, which is not limiting.
The foregoing description of the exemplary embodiments of the application is not intended to limit the application to the particular embodiments disclosed, but on the contrary, the intention is to cover all modifications, equivalents, and alternatives falling within the spirit and scope of the application.

Claims (10)

1. A data storage method based on a distributed data processing architecture, the method comprising:
acquiring newly added data in a current time window;
in a key value database, comparing and deduplicating the newly added data with the historical deduplication data to obtain deduplication data corresponding to the current time window; the historical deduplication data refers to deduplication data corresponding to a time window previous to the current time window;
counting the deduplication data corresponding to the current time window to obtain a counting result corresponding to the current time window, wherein the counting result is used for indicating the statistics of a target index corresponding to the deduplication data;
and separating and storing the de-duplication data and the statistical result corresponding to the current time window.
2. The method of claim 1, wherein the separately storing the deduplication data and the statistics corresponding to the current time window comprises:
Storing the duplicate removal data corresponding to the current time window into the key value pair database;
storing the statistical result corresponding to the current time window as state data into the key value pair database;
storing the sum value between the statistical result corresponding to the current time window and the historical statistical result into a relational database; the historical statistical result refers to a statistical result corresponding to a time window above the current time window.
3. The method of claim 1, wherein the key-value database is an external database of the distributed data processing architecture.
4. The method of claim 1, wherein after the performing statistics on the deduplication data corresponding to the current time window to obtain the statistics result corresponding to the current time window, further comprises:
responsive to triggering a checkpoint corresponding to the distributed data processing architecture, obtaining snapshot data corresponding to the checkpoint, wherein the snapshot data comprises consumption conditions, watermarks and time windows corresponding to stream data;
and storing the snapshot data corresponding to the check point into a fast reading database.
5. The method of claim 1, wherein after the obtaining the new data in the current time window, further comprising:
pre-aggregating the newly-added data corresponding to the current window to obtain pre-aggregated newly-added data, wherein the dimension of the pre-aggregated newly-added data is smaller than that of the newly-added data;
pre-deduplicating the pre-aggregated new added data to obtain adjusted new added data, wherein the data volume of the adjusted new added data is smaller than that of the new added data;
and comparing the adjusted newly added data with the historical deduplication data to obtain deduplication data corresponding to the current time window.
6. The method according to claim 1, wherein the method further comprises:
under the condition that the task corresponding to the distributed data processing architecture fails, acquiring a latest check point and snapshot data corresponding to the latest check point from a quick reading database;
determining a target time window according to the snapshot data corresponding to the latest check point;
in the key value database, comparing the newly added data corresponding to the target time window with the de-duplication data corresponding to the last time window of the target time window, and de-duplication to obtain de-duplication data corresponding to the target time window;
Counting the de-duplication data corresponding to the target time window to obtain a counting result corresponding to the target time window;
and sequentially acquiring statistical results corresponding to each time window from the target time window to the current service time.
7. A data storage device based on a distributed data processing architecture, the device comprising:
the new data acquisition module is used for acquiring the new data in the current time window;
the de-duplication data acquisition module is used for comparing and de-duplication the newly added data with the historical de-duplication data in the key value database to obtain de-duplication data corresponding to the current time window; the historical deduplication data refers to deduplication data corresponding to a time window previous to the current time window;
the statistical result acquisition module is used for carrying out statistics on the deduplication data corresponding to the current time window to obtain a statistical result corresponding to the current time window, wherein the statistical result is used for indicating statistics of a target index corresponding to the deduplication data;
and the data result storage module is used for separating and storing the deduplication data and the statistical result corresponding to the current time window.
8. A computer device comprising a processor and a memory, wherein the memory has stored therein a computer program that is loaded and executed by the processor to implement a data storage method based on a distributed data processing architecture as claimed in any one of claims 1 to 6.
9. A computer readable storage medium, characterized in that the computer readable storage medium has stored therein a computer program, which is loaded and executed by a processor to implement a data storage method based on a distributed data processing architecture as claimed in any of claims 1 to 6.
10. A computer program product comprising computer instructions stored in a computer readable storage medium, the computer instructions being read from the computer readable storage medium and executed by a processor to implement the distributed data processing architecture based data storage method of any one of claims 1 to 6.
CN202210503082.6A 2022-05-09 2022-05-09 Data storage method, device and equipment based on distributed data processing architecture Pending CN117076551A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210503082.6A CN117076551A (en) 2022-05-09 2022-05-09 Data storage method, device and equipment based on distributed data processing architecture

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210503082.6A CN117076551A (en) 2022-05-09 2022-05-09 Data storage method, device and equipment based on distributed data processing architecture

Publications (1)

Publication Number Publication Date
CN117076551A true CN117076551A (en) 2023-11-17

Family

ID=88702949

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210503082.6A Pending CN117076551A (en) 2022-05-09 2022-05-09 Data storage method, device and equipment based on distributed data processing architecture

Country Status (1)

Country Link
CN (1) CN117076551A (en)

Similar Documents

Publication Publication Date Title
CN108139958B (en) System and method for processing events of an event stream
US11023355B2 (en) Dynamic tracing using ranking and rating
US10268695B2 (en) Snapshot creation
Akidau et al. Millwheel: Fault-tolerant stream processing at internet scale
US10909073B2 (en) Automatic snapshot and journal retention systems with large data flushes using machine learning
US20070282921A1 (en) Recovery point data view shift through a direction-agnostic roll algorithm
US20190347343A1 (en) Systems and methods for indexing and searching
CN111339073A (en) Real-time data processing method and device, electronic equipment and readable storage medium
US9037905B2 (en) Data processing failure recovery method, system and program
KR20150118963A (en) Queue monitoring and visualization
CN116701033A (en) Host switching abnormality detection method, device, computer equipment and storage medium
CN113391767B (en) Data consistency checking method and device, electronic equipment and readable storage medium
CN114528127A (en) Data processing method and device, storage medium and electronic equipment
US11934377B2 (en) Consistency checking for distributed analytical database systems
CN110543413A (en) Business system testing method, device, equipment and storage medium
CN114020527A (en) Snapshot recovery method and device, computer equipment and storage medium
US11461186B2 (en) Automatic backup strategy selection
CN117076551A (en) Data storage method, device and equipment based on distributed data processing architecture
CN113641693B (en) Data processing method and device of streaming computing system, electronic equipment and medium
US20130290385A1 (en) Durably recording events for performing file system operations
CN113421109A (en) Service checking method, device, electronic equipment and storage medium
CN114791901A (en) Data processing method, device, equipment and storage medium
CN110658989B (en) System and method for backup storage garbage collection
US20230065833A1 (en) Maintaining ongoing transaction information of a database system in an external data store for processing unsaved transactions in response to designated events
CN115580528A (en) Fault root cause positioning method, device, equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination