CN112084226A

CN112084226A - Data processing method, system, device and computer readable storage medium

Info

Publication number: CN112084226A
Application number: CN201910512000.2A
Authority: CN
Inventors: 徐卓夫
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2019-06-13
Filing date: 2019-06-13
Publication date: 2020-12-15
Anticipated expiration: 2039-06-13
Also published as: CN112084226B

Abstract

The embodiment of the invention provides a data processing method, a system, a device and a computer readable storage medium. The data processing method comprises the following steps: collecting first time sequence data; caching the first time series data; down-sampling the first time series data to obtain second time series data; and storing the first time sequence data and the second time sequence data respectively in a persistent mode, and performing down sampling before writing the data into the database, so that the data is prevented from being read from the database, the influence of down sampling operation on the system performance is reduced, and the down sampling operation is performed in a distributed mode, and the instantaneous influence of the down sampling operation on the system performance is reduced.

Description

Data processing method, system, device and computer readable storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a data processing method, system, apparatus, and computer-readable storage medium.

Background

In a monitoring system, time series data is continuously generated and used for acquisition, storage, query and analysis. This type of time series data is engineered to have consistently high concurrency of writes and fewer writes and reads than other types of data.

For this type of time-series data, queries are generally classified into short-term data queries and long-term data queries. Short-term data queries typically refer to the data for the period of the failure for the operation and maintenance personnel, e.g., only within 1 hour before and after the failure. Long-term data queries are typically used to observe a trend of data over a relatively long period of time, such as by week or month to month statistics of the last year. Short-term data queries require relatively high data accuracy, while long-term data queries can reduce data accuracy as appropriate.

In order to meet the precision requirements of short-term data query and length data query, the industry proposes a TSDB solution with wider application currently, and the core idea is to perform down-sampling by periodically querying historical data after original data is written into a database, and rewrite the result into the database.

However, the down-sampling requires reading the original precision data from the database, performing calculation, and then writing back to the database again, that is, for each timeline, one read and one write are added; the pressure stored by the monitoring system itself is already quite large, so this solution has a large impact on the system performance. For example, in the current jingdong cloud monitoring system, the number of requests/transactions per second is 50 ten thousand, and if downward sampling operation is performed once in 10 minutes, 3000 ten thousand pieces of data need to be read, 50 ten thousand pieces of data need to be written, and 3050 ten thousand read-write times are added depending on the space.

Disclosure of Invention

In view of this, embodiments of the present invention provide a data processing method, system, apparatus, and computer readable storage medium, which are used to solve the problem in the prior art that the performance loss of the system due to the down-sampling is large.

In a first aspect, an embodiment of the present invention provides a data processing method, including:

collecting first time sequence data;

caching the first time-series data;

down-sampling the first time series data to obtain second time series data; and

persisting the first time-series data and the second time-series data, respectively.

In an alternative embodiment, the down-sampling comprises:

obtaining a plurality of second time series data through a plurality of times of down sampling;

storing the second time series data into a database comprises:

storing a plurality of the second time-series data in a database.

In an alternative embodiment, the performing a plurality of downsamples includes:

and downsampling the second time sequence data obtained by the previous downsampling to obtain the second time sequence data of the current time.

In an alternative embodiment, the caching includes: and organizing and storing the first time series data by adopting the time stamp of the sampling time point.

In an optional embodiment, the first time series data is organized and stored by using a double-layer hash table, wherein an outer table key of the double-layer hash table is a time stamp of a sampling time point of the second time series data, and an inner table key is a time stamp of a sampling time point of the first time series data.

In an alternative embodiment, the down-sampling is performed a set number of time intervals later.

In an optional embodiment, the method further comprises: preprocessing the first time-series data before the buffering the first time-series data.

In an alternative embodiment, the first time series data and the second time series data are distributed to a plurality of databases for storage.

In an optional embodiment, the method further comprises: and configuring a sampling time point and a sampling density.

In a second aspect, an embodiment of the present invention provides a data processing system, including:

the acquisition unit is used for acquiring first time sequence data;

the first storage unit is used for caching the first time series data;

the sampling unit is used for carrying out down-sampling on the first time series data to obtain second time series data; and

a second storage unit configured to persistently store the first time-series data and the second time-series data, respectively.

In a third aspect, an embodiment of the present invention provides a computer-readable storage medium, where computer instructions are stored, and when the computer instructions are executed, the data processing method described in any one of the foregoing is implemented.

In a fourth aspect, an embodiment of the present invention provides an apparatus, including:

a memory for storing computer instructions;

a processor coupled to the memory, the processor configured to perform a data processing method implementing any of the above based on computer instructions stored by the memory.

The embodiment of the invention has the following advantages or beneficial effects: when the first time sequence data is collected, the first time sequence data is stored in the buffer unit, and then the data is read from the buffer unit for down-sampling operation, and the data does not need to be read from the database for down-sampling operation, so that the influence of the down-sampling operation on the system performance is reduced, and the down-sampling operation is dispersedly executed, which is also beneficial to reducing the instantaneous influence of the down-sampling operation on the system performance.

Drawings

The above and other objects, features and advantages of the present invention will become more apparent by describing embodiments of the present invention with reference to the following drawings, in which:

FIG. 1 is a flow chart illustrating a data processing method according to an embodiment of the present invention;

FIG. 2 is a block diagram of a data processing system according to an embodiment of the present invention;

FIG. 3 is a diagram illustrating an example of a buffering unit for buffering first time-series data according to an embodiment of the present invention;

FIG. 4 is a flow diagram illustrating a sampling unit of a data processing system according to an embodiment of the present invention;

fig. 5 is a block diagram of an apparatus according to an embodiment of the present invention.

Detailed Description

The present invention will be described below based on examples, but the present invention is not limited to only these examples. In the following detailed description of the present invention, certain specific details are set forth. It will be apparent to one skilled in the art that the present invention may be practiced without these specific details. Well-known methods, procedures, and procedures have not been described in detail so as not to obscure the present invention. The figures are not necessarily drawn to scale.

Description of terms:

time series data (time series data) is data collected at different times, such data being collected chronologically for the case where the described phenomenon varies with time. Such data reflects the state or extent of change of an object, phenomenon, etc. over time.

Time series databases (TSDB, imeries DataBase,): a class of databases specifically designed and optimized for time series data.

Down sampling (down sampling) reduces the accuracy (or resolution) of the data and achieves the purpose of storing long-term data at lower cost, sampling for short.

Goang, a programming language promulgated by Google, employs pipe communication instead of the traditional thread lock model to make concurrent programming easier.

Goronutine, a Golang protocol, a lightweight concurrency enforcement unit.

Golang pipe, used for communication between the goroutines, avoiding application-level locks.

Fig. 1 is a flow chart illustrating a data processing method according to an embodiment of the present invention. The method specifically comprises the following steps.

In step S101, first time-series data is acquired.

In this step, the first time series data may be acquired in an active acquisition mode or a passive reception mode. In the active acquisition mode, the system starts an acquisition process, and the acquisition process acquires first time sequence data from a specified position according to a time sequence. In the passive receiving mode, the other system generates first time sequence data and sends the first time sequence data to the system, and the system is in a monitoring and receiving state all the time.

In step S102, the first time-series data is buffered.

The buffering in this step means temporarily storing the first time-series data. Temporary storage, such as using system memory, stores first time series data, which is typically cleared after the system exits. In addition, if the system itself is a daemon process, the outdated data in the system memory needs to be deleted or covered regularly, so that the influence on the system performance due to the excessive overhead of the system memory is avoided.

In step S103, the first time-series data is down-sampled to obtain second time-series data.

Based on the time series data, an up-sampling operation and a down-sampling operation may be performed. The down-sampling operation decreases the sampling timing point and the sampling density, and the up-sampling operation increases the sampling timing point and the sampling density. For example, hourly data is taken from the data per minute for down-sampling, and every second data is taken from the data per minute for up-sampling. In this step, the second time-series data is time-series data obtained by down-sampling based on the first time-series data, and therefore the number of sampling points of the second time-series data is smaller than the number of sampling points of the first time-series data.

In step S104, the first time-series data and the second time-series data are respectively persistently stored.

The persistent storage in this step corresponds to the cache in step S102. The persistent storage is, for example, to store the first time-series data and the second time-series data using a relational database ORACLE so that the data is maintained in the relational database ORACLE even after the system is logged out.

According to the technical scheme provided by the embodiment, when the first time series data is collected, the first time series data is stored in the cache unit, and then the data is read from the cache unit for down-sampling, and the data does not need to be read from the database for down-sampling, so that the influence of the down-sampling on the system performance is reduced, and the down-sampling is dispersedly executed, and the instantaneous influence of the down-sampling on the system performance is also reduced.

For the present embodiment, a more detailed description is given below using an example. A certain data source collects data once per second, 3600 sampling time points are stored in one hour, 86400 sampling time points are stored in one day, 604800 sampling time points are stored in one week, and the like. According to the prior art, when 30 days of daily data need to be generated, 86400 × 30 sampling time points of data need to be acquired from a data source and aggregated, and according to the present embodiment, the system normally acquires 86400 sampling time points of data each day and stores the data in a cache, and then performs down-sampling based on the 86400 sampling time points of data to generate a piece of daily data, and so on. Obviously, the effect of the present embodiment on the system performance is much smaller than that of the prior art.

In the present embodiment, the down-sampling is performed once based on the first time-series data to obtain the second time-series data, but actually, a plurality of down-sampling operations may be performed based on the first time-series data to obtain a plurality of second time-series data. When the down-sampling operation is performed a plurality of times, the down-sampling may be performed using the second time-series data obtained by the previous down-sampling to obtain the second time-series data of the current time. For example, data per minute is obtained from data per second, then data per hour is obtained from data per minute, and data per day (i.e., 24 hour data) is obtained from data per hour. Accordingly, the system can configure the sampling time points and sampling density of a plurality of down samples according to actual requirements, and can perform synchronous operation among the plurality of down samples.

FIG. 2 is a block diagram of a data processing system according to an embodiment of the present invention. The data processing system 200 includes: the device comprises an acquisition unit 201, a preprocessing unit 202, a buffer unit 203, two sampling units 204 and a storage unit 205.

The acquisition unit 201 is configured to acquire first time-series data. In an alternative embodiment, the acquisition unit 201 may receive data push through the TSDB interface, and data conforming to the interface format definition may be sent to the acquisition unit 201 through the TSDB interface.

The preprocessing unit 202 is configured to preprocess the first time-series data. The preprocessing comprises cleaning and defending the data. Of course, the system may not include the module.

The buffer unit 203 is configured to buffer the first time-series data into the buffer unit. The data in the cache unit is temporarily stored data, which is generally cleared when the system is exited.

The sampling unit 204 down-samples the first time-series data to obtain second time-series data. In the present embodiment, a two-stage sampling unit 204 is included. Each stage of the sampling unit 204 performs sampling calculation on the time series data according to the configuration. After the sampling unit at the previous stage sends the generated second time series data to the storage unit 205, the data is written into the sampling unit at the next stage for sampling. And repeating the steps and sampling step by step.

The storage unit 205 is configured to persistently store the first time-series data and the second time-series data, respectively. In the present embodiment, the storage unit 205 includes a distribution process 2051 and a plurality of databases 2052, whereby it is possible to distribute all data of the first time-series data into different databases and all data of the second time-series data into different databases. It should be understood that the plurality of databases 2052 herein may be of various types of storage backend, such as, for example, Elasticsearch, Cassandra, Redis, etc.

The technical solution provided by this embodiment provides two sampling units, so that two-stage down-sampling can be performed. Of course, the present invention is not limited to this, and multiple or multiple stages of down-sampling can be configured according to actual needs, and the sampling time point and sampling density can be adjusted during the operation of the system.

Fig. 3 is a diagram illustrating an example of a buffering unit 300 for buffering first time-series data according to an embodiment of the present invention. The cache unit organizes and stores the first time series data by adopting a double-layer hash table (namely, a nested hash table), an outer table key of the double-layer hash table is a time stamp of a sampling time point of the second time series data, and an inner table key is a time stamp of a sampling time point of the first time series data. Referring to FIG. 2, 301 is a record in a two-level hash table, where "2018-11-2210: 00: 00" is a down-sampled timestamp (sampled every 10 minutes) as the exterior key and "2018-11-2210: 00: 01" is the originally acquired timestamp (sampled every 1 second) as the interior key. The internal table key may also be identified with an identifier other than a timestamp. Corresponding to the double-layer hash table structure shown in fig. 2, optionally, when the system receives new data, the storage location of the data is calculated first, then the unique identifier of the system is found at the corresponding location, the data in the current cache is fetched, and the new data are subjected to sampling calculation, such as addition or averaging. The storage position of the newly arrived data can be calculated by adopting the following formula:

wherein, timestamp represents the timestamp carried by the new data, interval represents the time interval of the exterior key, and int () represents the rounding operation. Ideally, every ten minutes, say starting at 10:10:00, we can say that data from 10:00:00 to 10:09:59 has been sampled with a time stamp of 10; data at 00: 00. However, in the real world, because clocks of various collection objects are not necessarily synchronized, network transmission may have delay or even retransmission, data collection may cache data for batch sending for improving performance, and the like, and time series data may not always arrive as expected. Through observation, the arrival time of the time series data and the timestamp marked by the time series data have the following characteristics:

1) much later, it labels itself 10:10:00, but is not actually sent to the system until 10:11: 00;

2) some earlier, labeled itself as 10:10:00, but actually 10:09:00 has already been generated and sent to the system;

3) the amplitude of the misalignment is relatively small, and the misalignment is usually within 1 minute whether late/early.

For example, the result is 10:09:00, may arrive at the system at 10:11: 00. If the ideal case algorithm is followed, 10:00: 00-10: 09 at 10:10: 00: the data downsampling for the 59 time period is completed and sent downstream, and the 10 minute sample data finally stored to the back end would include two 10:00:00 data. This results in inaccurate sampled data. In a TSDB system that partially fails to properly process "multiple time series data of the same timestamp", a query error may be caused. To avoid such errors, the system may optionally employ a delayed write strategy that waits two time intervals, as shown in fig. 4.

FIG. 4 is a flow chart illustrating a sampling unit of a data processing system according to an embodiment of the present invention. The time interval of the sampling unit is set to 10 minutes, for example, and an internal buffer unit is initialized after the sampling unit is started, and the buffer unit is used for buffering data written in for 10 minutes. The cache unit may adopt a double-layer hash table shown in fig. 3, where an outer table key is a timestamp, an inner table key is a unique identifier of time sequence data, a corresponding numerical value is the time sequence data, and sampling statistics is performed in an outer-layer timestamp interval. The processing procedure of the cache unit further comprises the following steps.

In step S401, a periodic check of the route is started, and the route is run once every 3 seconds to check whether a set number of time intervals (the time intervals are set to 20 minutes, for example) have been reached. If the set number of time intervals is reached, step S402 is performed, otherwise step S404 is performed.

In steps S402 and S403, the cache unit is traversed, and for the outer timestamp of each record, whether the first time series data is complete is determined, if so, it indicates that the downsampling has been completed, step S405 is executed, and if not, step S401 is continuously executed.

In step S404, the first time-series data is written into the buffer unit, and downsampling is performed.

In step S405, the first time-series data and the second time-series data are transmitted to the distribution process, and the first time-series data and the second time-series data are stored by the distribution process.

It should be noted that the sampling unit provided in this embodiment performs a downsampling operation based on the two-level hash table shown in fig. 3, and the sampling unit takes out data of the hash table and performs downsampling calculation with the new data every time the sampling unit receives a new data, so that, because the time when the time series data arrives generally differs from the timestamp marked by itself by a small amount, when a set number of time intervals are reached, the time series data is inevitably complete, the downsampling operation is inevitably completed, and at this time, the first time series data and the second time series data can be both stored in the database. Of course, the sampling unit may also perform the down-sampling operation after all the sampling data at a certain sampling time point are complete.

In summary, according to the technical scheme of this embodiment, a large amount of extra read/write operations are avoided from being performed on the database through the cache, and especially for a storage scenario of large-scale time series data, the system performance can be significantly improved. In addition, the multi-stage sampling unit further reduces the storage cost and provides more flexible time series data storage.

Furthermore, it should be noted that the first, second, … … in this document are used only to distinguish between different subjects and do not indicate that different subjects have a difference in priority or rank.

Fig. 5 is a block diagram of an apparatus according to an embodiment of the present invention. The apparatus shown in fig. 5 is only an example and should not limit the functionality and scope of use of the embodiments of the present invention in any way.

Referring to fig. 5, the apparatus includes a processor 501, a memory 502, and an input-output device 503, which are connected by a bus. Memory 502 includes Read Only Memory (ROM) and Random Access Memory (RAM), with various computer instructions and data required to perform system functions being stored in memory 502, and with various computer instructions being read by processor 501 from memory 502 to perform various appropriate actions and processes. An input/output device including an input portion of a keyboard, a mouse, and the like; an output section including a display such as a Cathode Ray Tube (CRT), a Liquid Crystal Display (LCD), and the like, and a speaker; a storage section including a hard disk and the like; and a communication section including a network interface card such as a LAN card, a modem, or the like. The memory 502 also stores the following computer instructions to perform the operations specified by the apparatus of an embodiment of the invention: collecting first time sequence data; caching the first time-series data; down-sampling the first time series data to obtain second time series data; and persistently storing the first time-series data and the second time-series data, respectively.

Accordingly, embodiments of the present invention provide a computer-readable storage medium storing computer instructions that, when executed, implement the operations specified by the above-described method.

The flowcharts and block diagrams in the figures and block diagrams illustrate the possible architectures, functions, and operations of the systems, methods, and apparatuses according to the embodiments of the present invention, and may represent a module, a program segment, or merely a code segment, which is an executable instruction for implementing a specified logical function. It should also be noted that the executable instructions that implement the specified logical functions may be recombined to create new modules and program segments. The blocks of the drawings, and the order of the blocks, are thus provided to better illustrate the processes and steps of the embodiments and should not be taken as limiting the invention itself.

The various modules or units of the system may be implemented in hardware, firmware or software. The software includes, for example, a code program formed using various programming languages such as JAVA, C/C + +/C #, SQL, and the like. Although the steps and sequence of steps of the embodiments of the present invention are presented in method and method diagrams, the executable instructions of the steps implementing the specified logical functions may be re-combined to create new steps. The sequence of the steps should not be limited to the sequence of the steps in the method and the method illustrations, and can be modified at any time according to the functional requirements. Such as performing some of the steps in parallel or in reverse order.

Systems and methods according to the present invention may be deployed on a single server or on multiple servers. For example, different modules may be deployed on different servers, respectively, to form a dedicated server. Alternatively, the same functional unit, module or system may be deployed in a distributed fashion across multiple servers to relieve load stress. The server includes but is not limited to a plurality of PCs, PC servers, blades, supercomputers, etc. on the same local area network and connected via the Internet.

The above description is only a preferred embodiment of the present invention and is not intended to limit the present invention, and various modifications and changes may be made by those skilled in the art. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. A data processing method, comprising:

collecting first time sequence data;

caching the first time-series data;

down-sampling the first time series data to obtain second time series data; and

2. The data processing method of claim 1, wherein the down-sampling comprises:

storing the second time series data into a database comprises:

storing a plurality of the second time-series data in a database.

3. The data processing method of claim 2, wherein the down-sampling a plurality of times comprises:

4. The data processing method of claim 1, wherein the caching comprises: and organizing and storing the first time series data by adopting the time stamp of the sampling time point.

5. The data processing method according to claim 4, wherein the first time series data is organized and stored by using a double-layer hash table, wherein an outer table key of the double-layer hash table is a time stamp of a sampling time point of the second time series data, and an inner table key is a time stamp of a sampling time point of the first time series data.

6. A data processing method as claimed in claim 4 or 5, characterized in that the down-sampling is performed delayed for a set number of time intervals.

7. The data processing method of claim 1, further comprising: preprocessing the first time-series data before the buffering the first time-series data.

8. The data processing method according to claim 1, wherein the first time-series data and the second time-series data are distributed to a plurality of databases to be stored.

9. The data processing method of claim 1, further comprising: and configuring a sampling time point and a sampling density.

10. A data processing system, comprising:

the acquisition unit is used for acquiring first time sequence data;

the buffer unit is used for buffering the first time series data;

a storage unit configured to persistently store the first time-series data and the second time-series data, respectively.

11. A computer-readable storage medium, characterized in that it stores computer instructions which, when executed, implement the data processing method of any one of claims 1 to 9.

12. An apparatus, comprising:

a memory for storing computer instructions;

a processor coupled to the memory, the processor configured to perform a data processing method implementing any of claims 1 to 9 based on computer instructions stored by the memory.