WO2020078395A1 - Data storage method and apparatus, and storage medium - Google Patents

Data storage method and apparatus, and storage medium Download PDF

Info

Publication number
WO2020078395A1
WO2020078395A1 PCT/CN2019/111510 CN2019111510W WO2020078395A1 WO 2020078395 A1 WO2020078395 A1 WO 2020078395A1 CN 2019111510 W CN2019111510 W CN 2019111510W WO 2020078395 A1 WO2020078395 A1 WO 2020078395A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
low
aggregated
time
processing unit
Prior art date
Application number
PCT/CN2019/111510
Other languages
French (fr)
Chinese (zh)
Inventor
曾锐
陈国栋
徐乾龙
Original Assignee
杭州海康威视数字技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN201811204394.7A external-priority patent/CN111061758B/en
Priority claimed from CN201811236196.9A external-priority patent/CN111090705B/en
Application filed by 杭州海康威视数字技术股份有限公司 filed Critical 杭州海康威视数字技术股份有限公司
Publication of WO2020078395A1 publication Critical patent/WO2020078395A1/en

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation

Definitions

  • Embodiments of the present application relate to the technical field of data processing, and in particular, to a data storage method, device, and storage medium.
  • data storage can be implemented through a data cube, where the data cube is a type of multi-dimensional matrix, that is, data of multiple dimensions can be stored.
  • an implementation manner of storing data through a data cube may include: the storage device obtains data to be stored, and performs aggregate statistical processing on the obtained data to obtain corresponding aggregated data. After that, the obtained aggregated data can be merged with the existing data in the data cube, and the merged data can be stored in the data cube.
  • Embodiments of the present application provide a data storage method, device, and storage medium, which can solve the problem that it takes a relatively long time to query data in the related art.
  • the technical solution is as follows:
  • a data storage method includes:
  • the multiple aggregated data is classified and stored by multiple data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
  • a data storage device comprising:
  • the acquisition module is used to acquire multiple pieces of data from the data source, and each piece of data carries a time stamp;
  • a classification processing module configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data
  • An aggregation statistics module configured to aggregate statistics on each group of the multiple sets of data to obtain multiple aggregated data
  • a classification storage module is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
  • a computer-readable storage medium on which instructions are stored, and when the instructions are executed by a processor, the data storage method according to the first aspect described above is implemented.
  • a computer program product containing instructions, which when executed on a computer, causes the computer to execute the data storage method described in the first aspect above.
  • a storage device includes a processor and a memory, wherein the memory is used to store a computer program; the processor is used to execute a program stored on the memory, Implement the data storage method described in the first aspect above.
  • Fig. 1 is a flowchart of a data storage method according to an exemplary embodiment
  • Fig. 2 is a schematic diagram of a data processing unit according to an exemplary embodiment
  • Fig. 3 is a schematic structural diagram of a data storage device according to an exemplary embodiment
  • Fig. 4 is a schematic structural diagram of a storage device according to an exemplary embodiment.
  • Spark Streaming A computing engine that can batch process data. Its basic principle is to batch process input data at a certain time interval. When the batch processing interval is shortened to the second level, it can be used to process real-time data. flow. It can support obtaining data from multiple data sources.
  • Data sources can include Kafka data sources, Flume data sources, Twitter data sources, ZeroMQ data sources, Kinesis data sources, and TCP (Transmission Control Control Protocol) socket data sources.
  • Kafka data sources can include Kafka data sources, Flume data sources, Twitter data sources, ZeroMQ data sources, Kinesis data sources, and TCP (Transmission Control Control Protocol) socket data sources.
  • Data cube It is a kind of multi-dimensional matrix, which can be used for data analysis and indexing, and can support real-time indexing of metadata with any number of keywords.
  • the data cube may be composed of memory and disk (distributed database) to implement multi-dimensional data storage based on the memory and disk.
  • the related technical field proposes to store data through data cubes.
  • the data is generally stored in a distributed database of the data cube, for example, the distributed database is HBase.
  • the embodiments of the present application provide a data storage method, which can solve the above-mentioned problems.
  • FIG. 1 please refer to the embodiment shown in FIG. 1 below.
  • the data storage method provided by the embodiments of the present application may be executed by a storage device, and the storage device includes multiple data processing units to store data through the multiple data processing units.
  • each data processing unit in the plurality of data processing units is composed of a memory and a disk.
  • the data processing unit may be the aforementioned data cube.
  • the storage device may also include Spark Streaming to obtain data from the data source through the Spark Streaming.
  • Fig. 1 is a flowchart of a data storage method according to an exemplary embodiment.
  • the data storage method is implemented by using the above storage device as an example for illustration.
  • the data storage method may include the following implementation steps:
  • Step 101 Obtain multiple pieces of data from a data source, and each piece of data carries a time stamp.
  • the storage device may obtain the multiple pieces of data from the data source through Spark Streaming.
  • the data source is a kafka data source
  • multiple pieces of data may be read from the kafka data source through Spark Streaming.
  • Each piece of data in carries a time stamp. The time stamp of each piece of data can be used to indicate the generation time of each piece of data.
  • Step 102 Classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data.
  • the storage device classifies the multiple pieces of data according to the time stamp of each piece of data.
  • the specific implementation may include the following implementation steps:
  • the multiple pieces of data may be classified according to the two data types of recent data and old data, that is, the multiple pieces of data that belong to the recent data may be classified into one category, and the old data Is divided into one category, for which a short-term time frame needs to be determined.
  • the latest time is obtained from the time stamps of the multiple pieces of data, in other words, the latest time is obtained from the time stamps of the multiple pieces of data.
  • the pieces of data include first data, second data, third data, and fourth data.
  • the time indicated by the timestamp of the first data is June 25, 2017, and the timestamp indicated by the second data
  • the time is June 29, 2017,
  • the time indicated by the timestamp of the third data is July 2, 2017, and the time indicated by the timestamp of the fourth data is July 5, 2017, then the storage device obtains The latest time is July 5, 2017.
  • the specific implementation of determining the target time interval that includes the latest time and the interval length is a preset threshold may include the following possible implementations:
  • the first implementation manner when the latest time is within the time interval of the pre-stored interval length being the preset threshold, the time interval is determined as the target time interval.
  • the preset threshold may be set by the user according to actual needs, or may be set by the storage device by default, which is not limited in this embodiment of the present application.
  • the preset threshold may be 30 days.
  • the pre-stored time interval is a recent time range relative to multiple pieces of data acquired in the batch.
  • Time interval, wherein the target time interval is equivalent to the above-mentioned recent time range.
  • the second implementation manner when the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval, determine the The time sum between the left value of the time interval and the time difference, update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum, and determine the updated time interval Is the target time interval.
  • the pre-stored time interval needs to be updated to re-determine the target time interval.
  • it is equivalent to sliding the time interval to the right for a certain length of time, which is The difference between the latest time and the right value of the time interval. For example, if the pre-stored time interval is [July 1, July 15] and the latest time is July 16, the target time interval can be determined as [July 2, July 16].
  • the pre-stored time interval may be updated to the target time interval.
  • the target time interval may also be determined in other ways, for example, the time determined in the first implementation manner Do further calculation on the basis of the interval, and use the operation result as the target time interval. For example, add a fixed value to the left and right values of the time interval to obtain the target time interval, where the fixed value can be based on actual needs Make settings. As another example, further calculation can be performed on the basis of the updated time interval determined in the second implementation manner to obtain the target time interval, such as adding a fixed value to the left and right values of the updated time interval respectively
  • the numerical value, the target time interval, etc. are not limited in the embodiment of the present application.
  • the storage device may also query whether the time interval exists.
  • the target time interval is determined according to the above two implementation methods.
  • the storage device may generate the target time interval based on the latest time and the length of the interval. For example, the difference between the latest time and a preset threshold may be determined, and then the latest time is determined as the right value of the target time interval, and the determined difference value is determined as the left value of the target time interval.
  • 1023 Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.
  • the pieces of data are classified according to the time stamp of each piece of data and the determined target time interval.
  • the data whose time indicated by the time stamp in the multiple pieces of data is less than the left value of the target time interval is determined as high-level data, and the data indicated by the time stamp among the multiple pieces of data is within the target time interval Determined as low-level data.
  • the piece of data is the data before the target time interval. It can be considered that the piece of data is old data. Divided into high-level data.
  • the piece of data may be regarded as recent data, and this type of data is divided into low-level data here. In this way, two groups of data are obtained after data classification processing.
  • Step 103 Aggregate statistics for each set of data in the multiple sets of data to obtain multiple aggregated data.
  • the two sets of high-level data and low-level data obtained above need to be aggregated and counted.
  • the target time interval uses day as the time granularity, based on Three time granularities of year, month, and day, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highest aggregated data, and based on year, month, day, hour, minute, and second
  • aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data.
  • different time levels include different dimensions of time granularity.
  • the data attribute is one-dimensional as an example.
  • the storage device aggregates statistics according to different time levels and data attributes based on the three time granularities of year, month, and day.
  • the different time levels include a first time level, a second time level, and a third time level.
  • the first time level includes a time granularity of year
  • the second time level includes a time granularity of year and month.
  • the third time level includes three time granularities: year, month, and day.
  • the storage device aggregates statistics on each piece of data according to the first time level and data attributes to obtain the first high-aggregated data corresponding to the first time level ; According to the second time level and data attributes, aggregate statistics on each piece of data to obtain the first highly aggregated data corresponding to the second time level; according to the third time level and data attributes, aggregate statistics on each piece of data To obtain the first highly aggregated data corresponding to the third time level.
  • the storage device aggregates statistics according to different time levels and data attributes according to six time granularities of year, month, day, hour, minute, and second.
  • the different time levels include not only the first time level, the second time level, and the third time level, but also the fourth time level, the fifth time level, and the sixth time level.
  • the fourth time level includes the year, Four time granularities of month, day, and hour
  • the fifth time hierarchy includes five time granularities of year, month, day, hour, and minute
  • the sixth time hierarchy includes six times of year, month, day, hour, minute, and second granularity.
  • the storage device aggregates statistics on each piece of data according to the first time level and data attributes to obtain second high-aggregated data corresponding to the first time level ;
  • the sixth time level and data attributes aggregate statistics on each piece of data to obtain the first The first low-aggregated data corresponding to the six time levels.
  • the data attribute is one-dimensional as an example for illustration.
  • the storage device combines two dimensions of data attributes for aggregation statistics, and based on the third time level, combines two dimensions of data attributes for aggregation statistics. In this way, 12 first high Aggregate data.
  • Data attributes and data attribute values are aggregated, for example, when the data attribute is age, the data attribute value may be an age value, etc.
  • Step 104 When the multiple data processing units include a high-level data processing unit and a low-level data processing unit, obtain the row health in each aggregated data, and the row health of each aggregated data is generated during aggregation statistics, It is used to indicate the time level and data attributes corresponding to each aggregated data.
  • each data processing unit of the plurality of data processing units is composed of a memory and a disk, and the type of aggregated data stored in each data processing unit is the same.
  • the multiple data processing units include a high-level data processing unit and a low-level data processing unit, please refer to FIG. 2, which is a schematic diagram of a data processing unit according to an exemplary embodiment.
  • the storage device obtains the exercise keys generated during the aggregation statistics process.
  • the exercise health generated is also the same.
  • the first data is aggregated based on July 2017 and a certain data attribute
  • the second data is also aggregated based on July 2017 and the data attribute
  • the two aggregated data obtained after the aggregated statistics Xingjian is the same.
  • Step 105 Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit storage.
  • the plurality of first high-aggregated data and the plurality of second high-aggregated data are stored in the high-level data processing unit, that is, the high-aggregated data and the low Part of the highly aggregated data obtained by hierarchical data aggregation statistics is stored in the same data processing unit.
  • the multiple first high aggregated data and the multiple second high may include: merging the high aggregated data with the same health key among the multiple first high aggregated data and the multiple second high aggregated data to obtain multiple third high aggregated data, which The third highest aggregated data is stored in the high-level data processing unit.
  • the data when storing highly aggregated data in the high-level data processing unit, the data is not directly merged with the data in the high-level data processing unit, but merged only when certain conditions are met.
  • the high-aggregated data with the same health key is merged to obtain multiple third-high-aggregated data, so that when the multiple third-high-aggregated data is stored in the high-level data processing unit You can merge high-aggregation data with the same health. In this way, it is convenient for the user to subsequently query multiple pieces of data at the same time level and within the same time range at a time, avoiding the need to merge at the time of query, and improving the efficiency of data query.
  • a specific implementation of storing the plurality of third-highest aggregated data in the high-level data processing unit may include: for each third-highest aggregated data in the plurality of third-highest aggregated data, querying the high-level aggregated data Whether the memory of the data processing unit stores the same data as the row health of each third-highest aggregated data, when the memory of the high-level data processing unit stores the same row health of each third-highest aggregated data Data, merge the queried data with each third-highest aggregated data, and store the merged data in the memory of the high-level data processing unit.
  • the embodiment of the present application first merges the high-aggregated data in the memory, that is, queries whether the memory of the high-level data processing unit stores the data with the third-highest aggregated data. Walk the same data. If it exists, merge the high-aggregation data with the same row health directly in memory, and store the merged high-aggregation data in memory.
  • the third The data with the same row health of the high-aggregated data merge the acquired data with each third-highest aggregated data, and store the merged data in the memory of the high-level data processing unit.
  • Step 106 Based on the row key in each first low-aggregated data, the multiple first low-aggregated data is stored by the low-level data processing unit.
  • a plurality of first low-aggregated data obtained through aggregation statistics are stored in a low-level data processing unit. Further, the storage device stores the plurality of first low-aggregated data through the low-level data processing unit based on the row health in each first low-aggregated data.
  • the specific implementation process may include: The first low-aggregated data with the same row and key in the low-aggregated data are combined to obtain multiple second low-aggregated data, and the multiple second low-aggregated data are stored in the low-level data processing unit.
  • the first low-aggregated data when the time level and the data attribute based on the same are the same, and the time corresponding to the time level is within the same time range, the generated exercise keys are also the same.
  • the first low-aggregated data having the same health key are combined to obtain multiple second low-aggregated data, so that the multiple second low-aggregated data are stored in the low-level data processing unit At the same time, you can merge the low aggregate data with the same health. In this way, it is convenient for the user to subsequently query multiple pieces of data at the same time level and within the same time range at a time, avoiding the need to merge at the time of query, and improving the efficiency of data query.
  • the above specific implementation of storing the plurality of second low-aggregated data in the low-level data processing unit may include: for each second low-aggregated data in the plurality of second low-aggregated data, query the low Whether the memory of the hierarchical data processing unit stores the same data as the row health of each second low-aggregated data; when the memory of the low-level data processing unit stores the row health of the second low-aggregated data When the data is the same, the queried data is merged with each second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
  • the embodiment of the present application first merges the low-aggregated data in the memory, that is, queries whether the second low-aggregated data is stored in the memory of the low-level data processing unit Walk the same data. If it exists, merge the low-aggregated data with the same row health directly in memory, and store the merged data in memory.
  • the second The data of the low-aggregated data has the same row health; the acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
  • the data in the memory of the high-level data processing unit is stored to the disk of the high-level data processing unit.
  • the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.
  • the preset number threshold may be set by the user according to actual needs, or may be set by the storage device by default, which is not limited in the embodiment of the present application.
  • the merged high-aggregated data is first stored in the memory of the high-level data processing unit, and the merged low-aggregated data is first stored in the memory of the low-level data processing unit, only when the memory of the high-level data processing unit When the stored data reaches a certain value, or when the data stored in the memory of the low-level data processing unit reaches a certain value, the data in the memory is written to the disk, which can reduce the number of interactions with the disk.
  • querying high-aggregated data or low-aggregated data first query from the memory, when the query is not found in the memory, then query from the disk, to avoid frequent reading and writing to the disk, and improve system performance.
  • the use of this storage method can also reduce the use of disks by high-level data processing units and disks by low-level data processing units.
  • steps 104 to 106 are used to realize the operation of classifying and storing the multiple aggregated data by multiple data processing units.
  • the storage device can also delete data in the low-level data processing unit that does not belong to the target time interval, so that low-level data can be saved Storage space of the processing unit.
  • the offset of the acquired data may be recorded.
  • the offset is used to indicate that the currently acquired data is The location in the data source.
  • the next batch of data can be obtained according to the recorded offset. For example, if the data in the data source is numbered sequentially, and 5 pieces of data are acquired this time, the offset is 5, that is, the next piece of data will be acquired from the sixth piece of data.
  • multiple pieces of data carrying a time stamp are obtained from a data source, and the multiple pieces of data are classified and processed according to the time stamp of each piece of data to obtain multiple sets of data.
  • Aggregate statistics for each of the multiple sets of data and then, through multiple data processing units composed of memory and disks, classify and store the multiple aggregated data so that the aggregated data stored in each data processing unit Of the same type. In this way, in the subsequent data query, the query can be performed from the corresponding data processing unit based on the time stamp of the data to be queried, which improves the efficiency of data query.
  • Fig. 3 is a schematic structural diagram of a data storage device according to an exemplary embodiment.
  • the data storage device may be implemented by software, hardware, or a combination of both.
  • the data storage device may include:
  • the obtaining module 310 is used to obtain multiple pieces of data from a data source, and each piece of data carries a time stamp;
  • the classification processing module 320 is configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;
  • the aggregation statistics module 330 is configured to aggregate statistics on each group of the multiple groups of data to obtain multiple aggregated data;
  • the classification storage module 340 is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein each data processing unit of the plurality of data processing units is composed of a memory and a disk, and each data processing The type of aggregated data stored in the cell is the same.
  • the classification processing module 320 is used to:
  • the classification processing module 320 is used to:
  • the time interval is determined as the target time interval.
  • the classification processing module 320 is used to:
  • the updated time interval is determined as the target time interval.
  • the classification processing module 320 is used to:
  • the aggregation statistics module 330 is used to:
  • the target time interval uses day as the time granularity, based on year, month, and day Time granularity, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highly aggregated data, and six time granularities based on year, month, day, hour, minute, and second, according to Different time levels and data attributes aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data.
  • Different time levels include different granularities of time granularity.
  • the classification storage module 340 is used to:
  • the row keys in each aggregated data are obtained.
  • the row keys of each aggregated data are generated during aggregation statistics. To indicate the time level and data attributes corresponding to each aggregated data;
  • the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit Storing, and based on the row key in each first low-aggregated data, storing the plurality of first low-aggregated data through the low-level data processing unit.
  • the classification storage module 340 is used to:
  • the classification storage module 340 is used to:
  • the acquired data is merged with each of the third highest aggregated data, and the merged data is stored in the memory of the high-level data processing unit.
  • the classification storage module 340 is used to:
  • each second low-aggregated data in the plurality of second low-aggregated data query whether the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit ;
  • the queried data is merged with each second low-aggregated data to The combined data is stored in the memory of the low-level data processing unit.
  • the classification storage module 340 is used to:
  • the The data of the second low aggregated data has the same health and health
  • the acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
  • the classification storage module 340 is used to:
  • the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.
  • multiple pieces of data carrying a time stamp are obtained from a data source, and the multiple pieces of data are classified and processed according to the time stamp of each piece of data to obtain multiple sets of data.
  • Aggregate statistics for each of the multiple sets of data and then, through multiple data processing units composed of memory and disks, classify and store the multiple aggregated data so that the aggregated data stored in each data processing unit Of the same type. In this way, in the subsequent data query, the query can be performed from the corresponding data processing unit based on the time stamp of the data to be queried, which improves the efficiency of data query.
  • the data storage device provided in the above embodiments is only exemplified by the division of the above functional modules.
  • the above functions can be allocated by different functional modules as needed That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above.
  • the data storage device and the data storage method embodiment provided in the above embodiments belong to the same concept. For the specific implementation process, refer to the method embodiments, and details are not described here.
  • Fig. 4 is a schematic structural diagram of a storage device according to an exemplary embodiment. Specifically:
  • the storage device 400 includes a central processing unit (CPU) 401, a system memory 404 including a random access memory (RAM) 402 and a read only memory (ROM) 403, and a system bus 405 connecting the system memory 404 and the central processing unit 401.
  • the storage device 400 also includes a basic input / output system (I / O system) 406 that helps transfer information between various devices in the computer, and a large-capacity storage device for storing the operating system 413, application programs 414, and other program modules 415 407.
  • I / O system basic input / output system
  • the basic input / output system 406 includes a display 408 for displaying information and an input device 409 for a user to input information, such as a mouse and a keyboard.
  • the display 408 and the input device 409 are both connected to the central processing unit 401 through the input and output controller 410 connected to the system bus 405.
  • the basic input / output system 406 may also include an input-output controller 410 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus.
  • the input output controller 410 also provides output to a display screen, printer, or other type of output device.
  • the mass storage device 407 is connected to the central processing unit 401 through a mass storage controller (not shown) connected to the system bus 405.
  • the mass storage device 407 and its associated computer-readable medium provide non-volatile storage for the storage device 400. That is, the mass storage device 407 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.
  • Computer-readable media may include computer storage media and communication media.
  • Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data.
  • Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory, or other solid-state storage technologies, CD-ROM, DVD, or other optical storage, tape cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices.
  • RAM random access memory
  • ROM read-only memory
  • EPROM Erasable programmable read-only memory
  • EEPROM electrically erasable programmable read-only memory
  • flash memory or other solid-state storage technologies
  • CD-ROM, DVD or other optical storage
  • tape cassettes magnetic tape
  • magnetic disk storage or other magnetic storage devices.
  • computer storage medium is not limited to the above.
  • the above-mentioned system memory 404 and mass storage device 407 may be collectively referred to as a memory.
  • the storage device 400 may also be operated by a remote computer connected to the network through a network such as the Internet. That is, the storage device 400 may be connected to the network 412 through the network interface unit 411 connected to the system bus 405, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 411.
  • the above memory also includes one or more programs.
  • One or more programs are stored in the memory and configured to be executed by the CPU.
  • the one or more programs include a method for performing data storage provided by the embodiments of the present application.
  • An embodiment of the present application further provides a non-transitory computer-readable storage medium, when the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal can execute the data provided by the embodiment shown in FIG. 1 Storage method.
  • An embodiment of the present application also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the data storage method provided by the embodiment shown in FIG. 1 described above.
  • the program may be stored in a computer-readable storage medium.
  • the mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

Abstract

Disclosed are a data storage method and apparatus, and a storage medium, falling within the technical field of data processing. The method comprises: acquiring multiple pieces of data from a data source, wherein each piece of data carries a timestamp; according to the timestamp of each piece of data, classifying the multiple pieces of data to obtain multiple groups of data; performing aggregated counting on each group of data in the multiple groups of data to obtain multiple pieces of aggregate data; and classifying and storing the multiple pieces of aggregate data by means of multiple data processing units, wherein each data processing unit in the multiple data processing units is constituted by an internal memory and a magnetic disk and the types of aggregate data stored in the data processing units are the same. In this case, when data is queried later, querying can be performed in a corresponding data processing unit based on the timestamp of data needing to be queried so that efficiency of data query is improved.

Description

数据存储方法、装置及存储介质Data storage method, device and storage medium
本申请要求于2018年10月16日提交的申请号为201811204394.7、发明名称为“数据存储方法、装置及存储介质”以及于2018年10月23日提交的申请号为201811236196.9、发明名称为“一种多维数据处理方法、装置及设备、存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。This application requires that the application number submitted on October 16, 2018 is 201811204394.7, the name of the invention is "data storage method, device and storage medium" and the application number submitted on October 23, 2018 is 201811236196.9, and the name of the invention is "one Priority of the Chinese patent application for a multi-dimensional data processing method, device and equipment, and storage medium, the entire contents of which are incorporated by reference in this application.
技术领域Technical field
本申请实施例涉及数据处理技术领域,特别涉及一种数据存储方法、装置及存储介质。Embodiments of the present application relate to the technical field of data processing, and in particular, to a data storage method, device, and storage medium.
背景技术Background technique
随着计算机技术的快速发展,数据规模急剧膨胀,各个领域中的数据量越来越大,数据类型也越来越多。为了满足数据的存储需求,可以通过数据立方体来实现数据存储,其中,该数据立方体是一类多维矩阵,即可以存储多个维度的数据。With the rapid development of computer technology, the scale of data has expanded dramatically, the amount of data in various fields is increasing, and the types of data are also increasing. In order to meet the storage requirements of data, data storage can be implemented through a data cube, where the data cube is a type of multi-dimensional matrix, that is, data of multiple dimensions can be stored.
在相关技术中,通过数据立方体存储数据的实现方式可以包括:存储设备获取需要存储的数据,对获取到的数据进行聚合统计处理,得到对应的聚合数据。之后,可以将所得到的聚合数据与数据立方体中已有的数据进行合并,并将合并后的数据存储至该数据立方体中。In the related art, an implementation manner of storing data through a data cube may include: the storage device obtains data to be stored, and performs aggregate statistical processing on the obtained data to obtain corresponding aggregated data. After that, the obtained aggregated data can be merged with the existing data in the data cube, and the merged data can be stored in the data cube.
然而,在上述实现方式中,如果数据立方体中存储的数据量非常大,后续当从该数据立方体中查询数据时,需要耗费较长的时间。However, in the above implementation, if the amount of data stored in the data cube is very large, it will take a long time to subsequently query data from the data cube.
发明内容Summary of the invention
本申请实施例提供了一种数据存储方法、装置及存储介质,可以解决相关技术中查询数据时需要花费较差时间的问题。所述技术方案如下:Embodiments of the present application provide a data storage method, device, and storage medium, which can solve the problem that it takes a relatively long time to query data in the related art. The technical solution is as follows:
第一方面,提供了一种数据存储方法,所述方法包括:In a first aspect, a data storage method is provided. The method includes:
从数据源获取多条数据,每条数据携带时间戳;Obtain multiple pieces of data from the data source, and each piece of data carries a time stamp;
根据所述每条数据的时间戳,对所述多条数据进行分类处理,得到多组数 据;Classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;
对所述多组数据中的每组数据进行聚合统计,得到多个聚合数据;Aggregate statistics for each set of data in the multiple sets of data to obtain multiple aggregated data;
通过多个数据处理单元对所述多个聚合数据进行分类存储,其中,每个数据处理单元中存储的聚合数据的类型相同。The multiple aggregated data is classified and stored by multiple data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
第二方面,提供了一种数据存储装置,所述装置包括:In a second aspect, a data storage device is provided, the device comprising:
获取模块,用于从数据源获取多条数据,每条数据携带时间戳;The acquisition module is used to acquire multiple pieces of data from the data source, and each piece of data carries a time stamp;
分类处理模块,用于根据所述每条数据的时间戳,对所述多条数据进行分类处理,得到多组数据;A classification processing module, configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;
聚合统计模块,用于对所述多组数据中的每组数据进行聚合统计,得到多个聚合数据;An aggregation statistics module, configured to aggregate statistics on each group of the multiple sets of data to obtain multiple aggregated data;
分类存储模块,用于通过多个数据处理单元对所述多个聚合数据进行分类存储,其中,每个数据处理单元中存储的聚合数据的类型相同。A classification storage module is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
第三方面,提供了一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,所述指令被处理器执行时实现上述第一方面所述的数据存储方法。In a third aspect, a computer-readable storage medium is provided, on which instructions are stored, and when the instructions are executed by a processor, the data storage method according to the first aspect described above is implemented.
第四方面,提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述第一方面所述的数据存储方法。According to a fourth aspect, there is provided a computer program product containing instructions, which when executed on a computer, causes the computer to execute the data storage method described in the first aspect above.
第五方面,提供了一种存储设备,所述存储设备包括处理器和存储器,其中,所述存储器,用于存放计算机程序;所述处理器,用于执行所述存储器上所存放的程序,实现上述第一方面所述的数据存储方法。According to a fifth aspect, a storage device is provided. The storage device includes a processor and a memory, wherein the memory is used to store a computer program; the processor is used to execute a program stored on the memory, Implement the data storage method described in the first aspect above.
本申请实施例提供的技术方案带来的有益效果是:The beneficial effects brought by the technical solutions provided by the embodiments of the present application are:
从数据源获取携带时间戳的多条数据,根据每条数据的时间戳,对该多条数据进行分类处理,得到多组数据。对该多组数据中的每组数据进行聚合统计,之后,通过多个数据处理单元,对该多个聚合数据进行分类存储,使得每个数据处理单元中存储的聚合数据的类型相同。如此,在后续数据查询时,可以基于需要查询的数据的时间戳,从对应的数据处理单元中进行查询,提高了数据查询效率。Obtain multiple pieces of data carrying time stamps from the data source, and classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data. Aggregate statistics for each of the multiple sets of data, and then, through multiple data processing units, classify and store the multiple aggregated data so that the type of aggregated data stored in each data processing unit is the same. In this way, in the subsequent data query, the query can be performed from the corresponding data processing unit based on the time stamp of the data to be queried, which improves the efficiency of data query.
附图说明BRIEF DESCRIPTION
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请 的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For a person of ordinary skill in the art, without paying any creative work, other drawings can also be obtained based on these drawings.
图1是根据一示例性实施例示出的一种数据存储方法的流程图;Fig. 1 is a flowchart of a data storage method according to an exemplary embodiment;
图2是根据一示例性实施例示出的一种数据处理单元的示意图;Fig. 2 is a schematic diagram of a data processing unit according to an exemplary embodiment;
图3是根据一示例性实施例示出的一种数据存储装置的结构示意图;Fig. 3 is a schematic structural diagram of a data storage device according to an exemplary embodiment;
图4是根据一示例性实施例示出的一种存储设备的结构示意图。Fig. 4 is a schematic structural diagram of a storage device according to an exemplary embodiment.
具体实施方式detailed description
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。To make the objectives, technical solutions, and advantages of the present application clearer, the following describes the embodiments of the present application in further detail with reference to the accompanying drawings.
在对本申请实施例提供的数据存储方法进行详细介绍之前,先对本申请实施例涉及的名词、应用场景和实施环境进行简单介绍。Before introducing the data storage method provided in the embodiments of the present application in detail, the nouns, application scenarios, and implementation environments involved in the embodiments of the present application are briefly introduced.
首先,对本申请实施例涉及的名词进行简单介绍。First, a brief introduction to the nouns involved in the embodiments of the present application.
Spark Streaming:一种可以对数据进行批量处理的计算引擎,其基本原理是把输入的数据以某一时间间隔进行批量处理,当批量处理的时间间隔缩短至秒级时,即可用于处理实时数据流。可以支持从多种数据源获取数据。Spark Streaming: A computing engine that can batch process data. Its basic principle is to batch process input data at a certain time interval. When the batch processing interval is shortened to the second level, it can be used to process real-time data. flow. It can support obtaining data from multiple data sources.
数据源:可以包括Kafka数据源、Flume数据源、Twitter数据源、ZeroMQ数据源、Kinesis数据源以及TCP(Transmission Control Protocol传输控制协议)sockets数据源。Data sources: can include Kafka data sources, Flume data sources, Twitter data sources, ZeroMQ data sources, Kinesis data sources, and TCP (Transmission Control Control Protocol) socket data sources.
数据立方体:是一类多维矩阵,可以用于数据分析与索引,能够支持对元数据进行任意多关键字实时索引。数据立方体可以由内存和磁盘(分布式数据库)组成,以基于该内存和磁盘实现多维数据存储。Data cube: It is a kind of multi-dimensional matrix, which can be used for data analysis and indexing, and can support real-time indexing of metadata with any number of keywords. The data cube may be composed of memory and disk (distributed database) to implement multi-dimensional data storage based on the memory and disk.
其次,对本申请实施例涉及的应用场景进行简单介绍。Next, a brief introduction to the application scenarios involved in the embodiments of the present application.
为了能够适应数据的多维度发展,相关技术领域提出了通过数据立方体来存储数据。然而,当数据立方体存储的数据量非常大时,如果从数据立方体中查询数据需要花费较长时间。并且,在相关技术中,通过数据立方体存储数据时,一般将数据存储至数据立方体的分布式数据库中,譬如,该分布式数据库为HBase等。如此,当分布式数据库的性能达到瓶颈时,就会增加数据立方体的更新时间,降低系统的吞吐量。并且,对分布式数据库频繁读写将会影响其性能。为此,本申请实施例提供了一种数据存储方法,可以解决上述问题,其具体实现请参见如下图1所示的实施例。In order to adapt to the multi-dimensional development of data, the related technical field proposes to store data through data cubes. However, when the amount of data stored in the data cube is very large, it takes a long time to query data from the data cube. Moreover, in the related art, when storing data through a data cube, the data is generally stored in a distributed database of the data cube, for example, the distributed database is HBase. In this way, when the performance of the distributed database reaches the bottleneck, it will increase the update time of the data cube and reduce the throughput of the system. Also, frequent reads and writes to the distributed database will affect its performance. For this reason, the embodiments of the present application provide a data storage method, which can solve the above-mentioned problems. For specific implementation, please refer to the embodiment shown in FIG. 1 below.
接下来,对本申请实施例涉及的实施环境进行简单介绍。Next, a brief introduction to the implementation environment involved in the embodiments of the present application.
本申请实施例提供的数据存储方法可以由存储设备来执行,该存储设备中包括多个数据处理单元,以通过该多个数据处理单元来存储数据。其中,该多个数据处理单元中的每个数据处理单元均由内存和磁盘组成。在一些实施例中,数据处理单元可以为上述数据立方体。进一步地,该存储设备还可以包括Spark Streaming,以通过该Spark Streaming从数据源获取数据。The data storage method provided by the embodiments of the present application may be executed by a storage device, and the storage device includes multiple data processing units to store data through the multiple data processing units. Wherein, each data processing unit in the plurality of data processing units is composed of a memory and a disk. In some embodiments, the data processing unit may be the aforementioned data cube. Further, the storage device may also include Spark Streaming to obtain data from the data source through the Spark Streaming.
在介绍完本申请实施例涉及的名词、应用场景和实施环境后,接下来将结合附图对本申请实施例提供的数据存储方法进行详细介绍。After introducing the terms, application scenarios, and implementation environments involved in the embodiments of the present application, the data storage method provided by the embodiments of the present application will be described in detail in conjunction with the accompanying drawings.
图1是根据一示例性实施例示出的一种数据存储方法的流程图,这里以该数据存储方法通过上述存储设备实现为例进行说明,该数据存储方法可以包括如下几个实现步骤:Fig. 1 is a flowchart of a data storage method according to an exemplary embodiment. Here, the data storage method is implemented by using the above storage device as an example for illustration. The data storage method may include the following implementation steps:
步骤101:从数据源获取多条数据,每条数据携带时间戳。Step 101: Obtain multiple pieces of data from a data source, and each piece of data carries a time stamp.
在一些实施例中,存储设备可以通过Spark Streaming从数据源获取该多条数据,譬如,当该数据源为kafka数据源时,通过Spark Streaming从kafka数据源读取多条数据,该多条数据中的每条数据均携带时间戳。其中,该每条数据的时间戳可以用于指示该每条数据的生成时间。In some embodiments, the storage device may obtain the multiple pieces of data from the data source through Spark Streaming. For example, when the data source is a kafka data source, multiple pieces of data may be read from the kafka data source through Spark Streaming. Each piece of data in carries a time stamp. The time stamp of each piece of data can be used to indicate the generation time of each piece of data.
步骤102:根据该每条数据的时间戳,对该多条数据进行分类处理,得到多组数据。Step 102: Classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data.
为了对该多条数据进行区分存储,存储设备根据每条数据的时间戳对该多条数据进行分类。在一些实施例中,其具体实现可以包括如下几个实现步骤:In order to differentiate and store the multiple pieces of data, the storage device classifies the multiple pieces of data according to the time stamp of each piece of data. In some embodiments, the specific implementation may include the following implementation steps:
1021:从该多条数据的时间戳中获取最新时间。1021: Obtain the latest time from the time stamps of the multiple pieces of data.
在一些实施例中,可以根据近期数据和旧数据两种数据类型对该多条数据进行分类,也就是说,可以将该多条数据中属于近期数据的分为一类,以及将属于旧数据的分为一类,为此,需要确定一个近期时间范围。In some embodiments, the multiple pieces of data may be classified according to the two data types of recent data and old data, that is, the multiple pieces of data that belong to the recent data may be classified into one category, and the old data Is divided into one category, for which a short-term time frame needs to be determined.
为了确定这个近期时间范围,从该多条数据的时间戳中获取最新时间,换句话说,从该多条数据的时间戳中获取最晚时间。譬如,该多条数据包括第一数据、第二数据、第三数据和第四数据,该第一数据的时间戳指示的时间为2017年6月25日,该第二数据的时间戳指示的时间为2017年6月29日,该第三数据的时间戳指示的时间为2017年7月2日,该第四数据的时间戳指示的时间为2017年7月5日,则该存储设备获取最新时间为2017年7月5日。In order to determine this recent time range, the latest time is obtained from the time stamps of the multiple pieces of data, in other words, the latest time is obtained from the time stamps of the multiple pieces of data. For example, the pieces of data include first data, second data, third data, and fourth data. The time indicated by the timestamp of the first data is June 25, 2017, and the timestamp indicated by the second data The time is June 29, 2017, the time indicated by the timestamp of the third data is July 2, 2017, and the time indicated by the timestamp of the fourth data is July 5, 2017, then the storage device obtains The latest time is July 5, 2017.
1022:确定包含该最新时间且区间长度为预设阈值的目标时间区间。1022: Determine a target time interval that includes the latest time and the interval length is a preset threshold.
在一些实施例中,确定包含该最新时间且区间长度为预设阈值的目标时间区间的具体实现可以包括如下几种可能的实现方式:In some embodiments, the specific implementation of determining the target time interval that includes the latest time and the interval length is a preset threshold may include the following possible implementations:
第一种实现方式:当该最新时间处于预先存储的区间长度为该预设阈值的时间区间内时,将该时间区间确定为该目标时间区间。The first implementation manner: when the latest time is within the time interval of the pre-stored interval length being the preset threshold, the time interval is determined as the target time interval.
其中,该预设阈值可以由用户根据实际需求进行设置,也可以由该存储设备默认设置,本申请实施例对此不做限定。譬如,该预设阈值可以为30天。The preset threshold may be set by the user according to actual needs, or may be set by the storage device by default, which is not limited in this embodiment of the present application. For example, the preset threshold may be 30 days.
如果该最新时间处于该时间区间内,则说明相对于该批获取的多条数据,预先存储的该时间区间为近期时间范围,此时,可以直接将预先存储的时间区间确定为待确定的目标时间区间,其中,该目标时间区间即相当于是上述近期时间范围。If the latest time is within the time interval, it means that the pre-stored time interval is a recent time range relative to multiple pieces of data acquired in the batch. Time interval, wherein the target time interval is equivalent to the above-mentioned recent time range.
第二种实现方式:当该最新时间大于预先存储的区间长度为所述预设阈值的时间区间的右值时,确定该最新时间与该时间区间的右值之间的时间差值,确定该时间区间的左值与该时间差值之间的时间和,将该时间区间的右值更新为该最新时间,以及将该时间区间的左值更新为该时间和,将更新后的时间区间确定为该目标时间区间。The second implementation manner: when the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval, determine the The time sum between the left value of the time interval and the time difference, update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum, and determine the updated time interval Is the target time interval.
当该最新时间大于该时间区间的右值,说明需要对预先存储的时间区间进行更新,以重新确定目标时间区间,在这里,相当于将该时间区间向右滑动一定的时长,该时长即为该最新时间与该时间区间的右值之间的差值。譬如,假设预先存储的时间区间为[7月1日,7月15日],该最新时间为7月16日,则可以确定该目标时间区间为[7月2日,7月16日]。When the latest time is greater than the right value of the time interval, the pre-stored time interval needs to be updated to re-determine the target time interval. Here, it is equivalent to sliding the time interval to the right for a certain length of time, which is The difference between the latest time and the right value of the time interval. For example, if the pre-stored time interval is [July 1, July 15] and the latest time is July 16, the target time interval can be determined as [July 2, July 16].
进一步地,在该种实现方式中,由于重新确定了近期时间范围,因此,为了便于后续可以基于重新确定的近期时间范围,对下一批数据进行处理,该存储设备确定该目标时间区间后,可以将预先存储的时间区间更新为该目标时间区间。Further, in this implementation manner, since the recent time range is re-determined, in order to facilitate subsequent processing of the next batch of data based on the re-determined recent time range, after the storage device determines the target time interval, The pre-stored time interval may be updated to the target time interval.
需要说明的是,上述确定目标时间区间的实现方式仅是示例性地,在另一实施例中,还可以采用其它方式确定该目标时间区间,譬如,还可以在第一种实现方式确定的时间区间的基础上做进一步运算,并将运算结果作为该目标时间区间,如将该时间区间的左值和右值分别加一个固定数值,得到该目标时间区间,其中,该固定数值可以根据实际需求进行设置。再如,还可以在第二种实现方式确定的更新后的时间区间的基础上作进一步运算,以得到该目标时间 区间,如将该更新后的时间区间的左值和右值分别加一个固定数值,得到该目标时间区间,等等,本申请实施例对此不做限定。It should be noted that the above implementation manner of determining the target time interval is merely exemplary. In another embodiment, the target time interval may also be determined in other ways, for example, the time determined in the first implementation manner Do further calculation on the basis of the interval, and use the operation result as the target time interval. For example, add a fixed value to the left and right values of the time interval to obtain the target time interval, where the fixed value can be based on actual needs Make settings. As another example, further calculation can be performed on the basis of the updated time interval determined in the second implementation manner to obtain the target time interval, such as adding a fixed value to the left and right values of the updated time interval respectively The numerical value, the target time interval, etc. are not limited in the embodiment of the present application.
进一步地,在上述实现方式中,存储设备在确定目标时间区间之前,还可以查询是否存在该时间区间。当存在该时间区间,按照上述两种实现方式来确定目标时间区间。反之,若不存在该时间区间,该存储设备可以根据该最新时间和上述区间长度,生成该目标时间区间。譬如,可以确定该最新时间与预设阈值之间的差值,之后,将该最新时间确定为目标时间区间的右值,以及将所确定的差值确定为该目标时间区间的左值。Further, in the above implementation manner, before determining the target time interval, the storage device may also query whether the time interval exists. When the time interval exists, the target time interval is determined according to the above two implementation methods. Conversely, if the time interval does not exist, the storage device may generate the target time interval based on the latest time and the length of the interval. For example, the difference between the latest time and a preset threshold may be determined, and then the latest time is determined as the right value of the target time interval, and the determined difference value is determined as the left value of the target time interval.
1023:根据该每条数据的时间戳和该目标时间区间,对该多条数据进行分类处理。1023: Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.
为了对获取的多条数据进行区分存储,这里按照每条数据的时间戳和所确定的目标时间区间,对多条数据进行分类。在实施中,将该多条数据中时间戳指示的时间小于该目标时间区间左值的数据确定为高层级数据,以及将该多条数据中时间戳指示的时间处于该目标时间区间内的数据确定为低层级数据。In order to distinguish and store multiple pieces of data obtained, the pieces of data are classified according to the time stamp of each piece of data and the determined target time interval. In implementation, the data whose time indicated by the time stamp in the multiple pieces of data is less than the left value of the target time interval is determined as high-level data, and the data indicated by the time stamp among the multiple pieces of data is within the target time interval Determined as low-level data.
不难理解,当某条数据的时间戳所指示的时间小于该目标时间区间左值时,说明该条数据是目标时间区间之前的数据,可以认为该条数据为旧数据,这里将该类数据划分为高层级数据。此外,当某条数据的时间戳所指示的时间处于该目标时间区间内时,可以认为该条数据为近期的数据,这里将该类数据划分为低层级数据。如此,经过数据分类处理后得到了两组数据。It is not difficult to understand that when the time indicated by the time stamp of a piece of data is less than the left value of the target time interval, it means that the piece of data is the data before the target time interval. It can be considered that the piece of data is old data. Divided into high-level data. In addition, when the time indicated by the time stamp of a piece of data is within the target time interval, the piece of data may be regarded as recent data, and this type of data is divided into low-level data here. In this way, two groups of data are obtained after data classification processing.
步骤103:对该多组数据中的每组数据进行聚合统计,得到多个聚合数据。Step 103: Aggregate statistics for each set of data in the multiple sets of data to obtain multiple aggregated data.
在这里,需要对上述得到的高层级数据和低层级数据这两组数据进行聚合统计。在一种可能的实现方式中,当该每条数据的时间戳指示的时间均包括年、月、日、时、分、秒多个时间粒度,该目标时间区间以日为时间粒度时,基于年、月、日三个时间粒度,按照不同的时间层级和数据属性,对该高层级数据进行聚合统计,得到多个第一高聚合数据,以及基于年、月、日、时、分、秒六个时间粒度,按照不同的时间层级和数据属性,对该低层级数据进行聚合统计,得到多个第二高聚合数据和多个第一低聚合数据。其中,不同时间层级包括不同维度的时间粒度。Here, the two sets of high-level data and low-level data obtained above need to be aggregated and counted. In a possible implementation, when the time indicated by the time stamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, and the target time interval uses day as the time granularity, based on Three time granularities of year, month, and day, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highest aggregated data, and based on year, month, day, hour, minute, and second Six time granularities, according to different time levels and data attributes, aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data. Among them, different time levels include different dimensions of time granularity.
通常情况下,一些旧数据可能不需要做精细统计,因此,可以只基于年、月、日这几个大时间粒度进行聚合统计,而近期的数据一般需要做精细地统计,因此,可以基于年、月、日、时、分、秒多个时间粒度进行聚合统计。也就是 说,这里将基于不同时间粒度,按照不同的时间层级和数据属性对分类得到的两组数据分别进行聚合统计。Under normal circumstances, some old data may not need fine statistics, so you can only aggregate statistics based on the large time granularity of year, month, and day, and recent data generally need to do fine statistics, so you can base on year , Months, Days, Hours, Minutes, and Seconds for aggregate time statistics. That is to say, based on different time granularity, according to different time levels and data attributes, the two sets of data obtained by classification are aggregated and counted separately.
为了便于理解,这里以数据属性是一维的为例进行说明。对于该高层级数据,存储设备基于年、月、日三个时间粒度,按照不同的时间层级和数据属性进行聚合统计。其中,该不同的时间层级包括第一时间层级、第二时间层级和第三时间层级,该第一时间层级包括年这一时间粒度,该第二时间层级包括年、月两个时间粒度,该第三时间层级包括年、月、日三个时间粒度。For ease of understanding, the data attribute is one-dimensional as an example. For this high-level data, the storage device aggregates statistics according to different time levels and data attributes based on the three time granularities of year, month, and day. The different time levels include a first time level, a second time level, and a third time level. The first time level includes a time granularity of year, and the second time level includes a time granularity of year and month. The third time level includes three time granularities: year, month, and day.
也就是说,对于该高层级数据中包括的每条数据,该存储设备根据第一时间层级和数据属性,对该每条数据进行聚合统计,得到该第一时间层级对应的第一高聚合数据;根据第二时间层级和数据属性,对该每条数据进行聚合统计,得到该第二时间层级对应的第一高聚合数据;根据第三时间层级和数据属性,对该每条数据进行聚合统计,得到该第三时间层级对应的第一高聚合数据。In other words, for each piece of data included in the high-level data, the storage device aggregates statistics on each piece of data according to the first time level and data attributes to obtain the first high-aggregated data corresponding to the first time level ; According to the second time level and data attributes, aggregate statistics on each piece of data to obtain the first highly aggregated data corresponding to the second time level; according to the third time level and data attributes, aggregate statistics on each piece of data To obtain the first highly aggregated data corresponding to the third time level.
此外,对于低层级数据,该存储设备根据年、月、日、时、分、秒六个时间粒度,按照不同的时间层级和数据属性进行聚合统计。此时,该不同时间层级不仅包括上述第一时间层级、第二时间层级和第三时间层级,还包括第四时间层级、第五时间层级和第六时间层级,该第四时间层级包括年、月、日、时四个时间粒度,该第五时间层级包括年、月、日、时、分五个时间粒度,该第六时间层级包括年、月、日、时、分、秒六个时间粒度。In addition, for low-level data, the storage device aggregates statistics according to different time levels and data attributes according to six time granularities of year, month, day, hour, minute, and second. At this time, the different time levels include not only the first time level, the second time level, and the third time level, but also the fourth time level, the fifth time level, and the sixth time level. The fourth time level includes the year, Four time granularities of month, day, and hour, the fifth time hierarchy includes five time granularities of year, month, day, hour, and minute, and the sixth time hierarchy includes six times of year, month, day, hour, minute, and second granularity.
也就是说,对于该低层级数据中包括的每条数据,该存储设备根据第一时间层级和数据属性,对该每条数据进行聚合统计,得到该第一时间层级对应的第二高聚合数据;根据第二时间层级和数据属性,对该每条数据进行聚合统计,得到该第二时间层级对应的第二高聚合数据;根据第三时间层级和数据属性,对该每条数据进行聚合统计,得到该第三时间层级对应的第二高聚合数据;根据第四时间层级和数据属性,对该每条数据进行聚合统计,得到该第四时间层级对应的第一低聚合数据;根据第五时间层级和数据属性,对该每条数据进行聚合统计,得到该第五时间层级对应的第一低聚合数据;根据第六时间层级和数据属性,对该每条数据进行聚合统计,得到该第六时间层级对应的第一低聚合数据。In other words, for each piece of data included in the low-level data, the storage device aggregates statistics on each piece of data according to the first time level and data attributes to obtain second high-aggregated data corresponding to the first time level ; According to the second time level and data attributes, aggregate statistics on each piece of data to obtain the second highest aggregated data corresponding to the second time level; according to the third time level and data attributes, aggregate statistics on each piece of data To obtain the second highest aggregated data corresponding to the third time level; according to the fourth time level and data attributes, aggregate statistics on each piece of data to obtain the first low aggregated data corresponding to the fourth time level; according to the fifth Time level and data attributes, aggregate statistics on each piece of data to obtain the first low-aggregated data corresponding to the fifth time level; according to the sixth time level and data attributes, aggregate statistics on each piece of data to obtain the first The first low-aggregated data corresponding to the six time levels.
需要说明的是,这里是以该数据属性是一维为例进行说明,在另一些实施例中,当该数据属性为多维时,还需要结合不同维度的数据属性进行聚合统计。譬如,以该数据属性为二维,对高层级数据进行聚合统计为例,此时,需要根 据上述第一时间层级进行聚合统计;根据第一时间层级和第一维度的数据属性进行聚合统计;根据第一时间层级和第二维度的数据属性进行聚合统计;根据第一时间层级、第一维度的数据属性和第二维度的数据属性进行聚合统计。同理,该存储设备基于第二时间层级,结合两个维度的数据属性进行聚合统计,以及基于第三时间层级,结合两个维度的数据属性进行聚合统计,如此,可以得到12个第一高聚合数据。It should be noted that the data attribute is one-dimensional as an example for illustration. In other embodiments, when the data attribute is multi-dimensional, it is necessary to combine data attributes of different dimensions for aggregation statistics. For example, taking the data attribute as two-dimensional, and performing aggregation statistics on high-level data as an example, at this time, it is necessary to perform aggregation statistics based on the above first time hierarchy; aggregate statistics based on the data attributes of the first time hierarchy and the first dimension; Aggregate statistics according to the data attributes of the first time level and the second dimension; aggregate statistics according to the data attributes of the first time level, the first dimension, and data attributes of the second dimension. Similarly, based on the second time level, the storage device combines two dimensions of data attributes for aggregation statistics, and based on the third time level, combines two dimensions of data attributes for aggregation statistics. In this way, 12 first high Aggregate data.
还需要说明的是,上述仅是以根据不同时间粒度,按照不同的时间层级和数据属性进行聚合统计为例进行说明,在另一实施例中,还可以根据不同时间粒度,按照不同的时间层级、数据属性和数据属性值进行聚合统计,譬如,当该数据属性为年龄时,该数据属性值可以为年龄值等。It should also be noted that the above description is based on the example of performing aggregation statistics according to different time levels and data attributes according to different time granularities. In another embodiment, it can also be based on different time levels and according to different time hierarchies. , Data attributes and data attribute values are aggregated, for example, when the data attribute is age, the data attribute value may be an age value, etc.
步骤104:当该多个数据处理单元包括高层级数据处理单元和低层级数据处理单元时,获取每个聚合数据中的行健,该每个聚合数据的行健是在聚合统计时生成的,用于指示该每个聚合数据对应的时间层级和数据属性。Step 104: When the multiple data processing units include a high-level data processing unit and a low-level data processing unit, obtain the row health in each aggregated data, and the row health of each aggregated data is generated during aggregation statistics, It is used to indicate the time level and data attributes corresponding to each aggregated data.
其中,该多个数据处理单元中的每个数据处理单元由内存和磁盘组成,每个数据处理单元中存储的聚合数据的类型相同。譬如,当该多个数据处理单元包括高层级数据处理单元和低层级数据处理单元时,请参考图2,该图2根据一示例性实施例示出的一种数据处理单元的示意图。Wherein, each data processing unit of the plurality of data processing units is composed of a memory and a disk, and the type of aggregated data stored in each data processing unit is the same. For example, when the multiple data processing units include a high-level data processing unit and a low-level data processing unit, please refer to FIG. 2, which is a schematic diagram of a data processing unit according to an exemplary embodiment.
为了将得到的聚合数据分类存储至高层级数据处理单元和低层级数据处理单元中,存储设备获取在聚合统计过程中生成的行健。这里要说明的是,在聚合统计过程中,当基于的时间层级和数据属性相同,且该时间层级对应的时间处于同一时间范围内(譬如,均处于同一天)时,所生成的行健也相同。譬如,当第一数据是基于2017年7月和某数据属性进行聚合统计,以及第二数据也是基于2017年7月和该数据属性进行聚合统计时,聚合统计后所得到的两个聚合数据的行健相同。In order to classify and store the obtained aggregated data into the high-level data processing unit and the low-level data processing unit, the storage device obtains the exercise keys generated during the aggregation statistics process. What needs to be explained here is that, in the process of aggregation statistics, when the time level and the data attribute are the same, and the time corresponding to the time level is in the same time range (for example, both are on the same day), the exercise health generated is also the same. For example, when the first data is aggregated based on July 2017 and a certain data attribute, and the second data is also aggregated based on July 2017 and the data attribute, the two aggregated data obtained after the aggregated statistics Xingjian is the same.
步骤105:基于每个第一高聚合数据和每个第二高聚合数据中的行健,通过该高层级数据处理单元对该多个第一高聚合数据和该多个第二高聚合数据进行存储。Step 105: Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit storage.
在存储过程中,将该多个第一高聚合数据和该多个第二高聚合数据存储至高层级数据处理单元中,也就是说,将对高层级数据聚合统计得到的高聚合数据和对低层级数据聚合统计得到的部分高聚合数据存储至同一个数据处理单元中。During the storage process, the plurality of first high-aggregated data and the plurality of second high-aggregated data are stored in the high-level data processing unit, that is, the high-aggregated data and the low Part of the highly aggregated data obtained by hierarchical data aggregation statistics is stored in the same data processing unit.
在一些实施例中,基于每个第一高聚合数据和每个第二高聚合数据中的行健,通过该高层级数据处理单元对该多个第一高聚合数据和该多个第二高聚合数据进行存储的具体实现可以包括:将该多个第一高聚合数据和该多个第二高聚合数据中行健相同的高聚合数据进行合并,得到多个第三高聚合数据,将该多个第三高聚合数据存储至该高层级数据处理单元中。In some embodiments, based on the row key in each first high aggregated data and each second high aggregated data, the multiple first high aggregated data and the multiple second high The specific implementation of storing aggregated data may include: merging the high aggregated data with the same health key among the multiple first high aggregated data and the multiple second high aggregated data to obtain multiple third high aggregated data, which The third highest aggregated data is stored in the high-level data processing unit.
也就是说,在将高聚合数据存储至高层级数据处理单元中时,不是直接与高层级数据处理单元中的数据进行合并,而是在满足一定的条件下才进行合并。如前文所述,在聚合统计过程中,当基于的时间层级和数据属性相同,且该时间层级对应的时间处于同一时间范围内时,所生成的行健也相同。在本申请实施例中,将具有相同行健的高聚合数据进行合并,得到多个第三高聚合数据,以便于在将该多个第三高聚合数据存储至该高层级数据处理单元时,可以将行健相同的高聚合数据进行合并。如此,可以方便用户后续可以一次性查询到相同时间层级且处于同一时间范围内的多条数据,避免需要在查询时再进行合并,提高了数据查询效率。That is to say, when storing highly aggregated data in the high-level data processing unit, the data is not directly merged with the data in the high-level data processing unit, but merged only when certain conditions are met. As described above, in the aggregation statistics process, when the time level and the data attribute based on the same are the same, and the time corresponding to the time level is within the same time range, the generated exercise keys are also the same. In the embodiment of the present application, the high-aggregated data with the same health key is merged to obtain multiple third-high-aggregated data, so that when the multiple third-high-aggregated data is stored in the high-level data processing unit You can merge high-aggregation data with the same health. In this way, it is convenient for the user to subsequently query multiple pieces of data at the same time level and within the same time range at a time, avoiding the need to merge at the time of query, and improving the efficiency of data query.
进一步地,将该多个第三高聚合数据存储至该高层级数据处理单元中的具体实现可以包括:对于该多个第三高聚合数据中的每个第三高聚合数据,查询该高层级数据处理单元的内存中是否存储有与该每个第三高聚合数据的行健相同的数据,当该高层级数据处理单元的内存中存储有与该每个第三高聚合数据的行健相同的数据时,将查询到的数据与该每个第三高聚合数据进行合并,将合并后的数据存储至该高层级数据处理单元的内存中。Further, a specific implementation of storing the plurality of third-highest aggregated data in the high-level data processing unit may include: for each third-highest aggregated data in the plurality of third-highest aggregated data, querying the high-level aggregated data Whether the memory of the data processing unit stores the same data as the row health of each third-highest aggregated data, when the memory of the high-level data processing unit stores the same row health of each third-highest aggregated data Data, merge the queried data with each third-highest aggregated data, and store the merged data in the memory of the high-level data processing unit.
为了避免对磁盘的频繁读写,本申请实施例先在内存中对高聚合数据进行合并,也即是,查询该高层级数据处理单元的内存中是否存储有与每个第三高聚合数据的行健相同的数据。如果存在,则直接在内存中将行健相同的高聚合数据进行合并,并将合并后的高聚合数据存储至内存中。In order to avoid frequent reading and writing to the disk, the embodiment of the present application first merges the high-aggregated data in the memory, that is, queries whether the memory of the high-level data processing unit stores the data with the third-highest aggregated data. Walk the same data. If it exists, merge the high-aggregation data with the same row health directly in memory, and store the merged high-aggregation data in memory.
进一步地,当该高层级数据处理单元的内存中未存储有与该每个第三高聚合数据的行健相同的数据时,从该高层级数据处理单元的磁盘中获取与该每个第三高聚合数据的行健相同的数据,将获取的数据与该每个第三高聚合数据进行合并,将合并后的数据存储至该高层级数据处理单元的内存中。Further, when the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the third The data with the same row health of the high-aggregated data, merge the acquired data with each third-highest aggregated data, and store the merged data in the memory of the high-level data processing unit.
步骤106:基于每个第一低聚合数据中的行健,通过该低层级数据处理单元对该多个第一低聚合数据进行存储。Step 106: Based on the row key in each first low-aggregated data, the multiple first low-aggregated data is stored by the low-level data processing unit.
在数据存储过程中,将经过聚合统计所得到的多个第一低聚合数据存储至 低层级数据处理单元中。进一步地,存储设备基于每个第一低聚合数据中的行健,通过该低层级数据处理单元对该多个第一低聚合数据进行存储,其具体实现过程可以包括:将该多个第一低聚合数据中行健相同的第一低聚合数据进行合并,得到多个第二低聚合数据,将该多个第二低聚合数据存储至该低层级数据处理单元中。In the data storage process, a plurality of first low-aggregated data obtained through aggregation statistics are stored in a low-level data processing unit. Further, the storage device stores the plurality of first low-aggregated data through the low-level data processing unit based on the row health in each first low-aggregated data. The specific implementation process may include: The first low-aggregated data with the same row and key in the low-aggregated data are combined to obtain multiple second low-aggregated data, and the multiple second low-aggregated data are stored in the low-level data processing unit.
同理,在将第一低聚合数据存储至低层级数据处理单元中时,也不是直接与低层级数据处理单元中的数据进行合并,而是在满足一定的条件下进行合并。如前文所述,在聚合统计过程中,当基于的时间层级和数据属性相同,且该时间层级对应的时间处于同一时间范围内时,所生成的行健也相同。在本申请实施例中,将具有相同行健的第一低聚合数据进行合并,得到多个第二低聚合数据,以便于在将该多个第二低聚合数据存储至该低层级数据处理单元时,可以将行健相同的低聚合数据进行合并。如此,可以方便用户后续可以一次性查询到相同时间层级且处于同一时间范围内的多条数据,避免需要在查询时再进行合并,提高了数据查询效率。Similarly, when the first low-aggregated data is stored in the low-level data processing unit, it is not directly merged with the data in the low-level data processing unit, but is merged under certain conditions. As described above, in the aggregation statistics process, when the time level and the data attribute based on the same are the same, and the time corresponding to the time level is within the same time range, the generated exercise keys are also the same. In the embodiment of the present application, the first low-aggregated data having the same health key are combined to obtain multiple second low-aggregated data, so that the multiple second low-aggregated data are stored in the low-level data processing unit At the same time, you can merge the low aggregate data with the same health. In this way, it is convenient for the user to subsequently query multiple pieces of data at the same time level and within the same time range at a time, avoiding the need to merge at the time of query, and improving the efficiency of data query.
进一步地,上述将该多个第二低聚合数据存储至该低层级数据处理单元中的具体实现可以包括:对于该多个第二低聚合数据中的每个第二低聚合数据,查询该低层级数据处理单元的内存中是否存储有与该每个第二低聚合数据的行健相同的数据;当该低层级数据处理单元的内存中存储有与该每个第二低聚合数据的行健相同的数据时,将查询到的数据与该每个第二低聚合数据进行合并,将合并后的数据存储至该低层级数据处理单元的内存中。Further, the above specific implementation of storing the plurality of second low-aggregated data in the low-level data processing unit may include: for each second low-aggregated data in the plurality of second low-aggregated data, query the low Whether the memory of the hierarchical data processing unit stores the same data as the row health of each second low-aggregated data; when the memory of the low-level data processing unit stores the row health of the second low-aggregated data When the data is the same, the queried data is merged with each second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
为了避免对磁盘的频繁读写,本申请实施例先在内存中对低聚合数据进行合并,也即是,查询该低层级数据处理单元的内存中是否存储有与每个第二低聚合数据的行健相同的数据。如果存在,则直接在内存中将行健相同的低聚合数据进行合并,并将合并后的数据存储至内存中。In order to avoid frequent reads and writes to the disk, the embodiment of the present application first merges the low-aggregated data in the memory, that is, queries whether the second low-aggregated data is stored in the memory of the low-level data processing unit Walk the same data. If it exists, merge the low-aggregated data with the same row health directly in memory, and store the merged data in memory.
进一步地,当该低层级数据处理单元的内存中未存储有与该每个第二低聚合数据的行健相同的数据时,从该低层级数据处理单元的磁盘中获取与该每个第二低聚合数据的行健相同的数据;将获取的数据与该每个第二低聚合数据进行合并,将合并后的数据存储至该低层级数据处理单元的内存中。Further, when the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the second The data of the low-aggregated data has the same row health; the acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
进一步地,当该高层级数据处理单元的内存中的数据量达到预设数量阈值,将该高层级数据处理单元的内存中的数据存储至该高层级数据处理单元的磁盘中。或者,该低层级数据处理单元的内存中的数据量达到预设数量阈值时,将 该低层级数据处理单元的内存中的数据存储至该低层级数据处理单元的磁盘中。Further, when the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, the data in the memory of the high-level data processing unit is stored to the disk of the high-level data processing unit. Alternatively, when the amount of data in the memory of the low-level data processing unit reaches a preset number threshold, the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.
其中,该预设数量阈值可以由用户根据实际需求自定义设置,也可以由该存储设备默认设置,本申请实施例对此不做限定。The preset number threshold may be set by the user according to actual needs, or may be set by the storage device by default, which is not limited in the embodiment of the present application.
如此,将合并后的高聚合数据先存储至高层级数据处理单元的内存中,以及将合并后的低聚合数据先存储至低层级数据处理单元的内存中,只有当该高层级数据处理单元的内存所存储的数据达到一定数值,或者当该低层级数据处理单元的内存所存储的数据达到一定数值时,在将内存中的数据写入磁盘中,如此可以减少与磁盘之间的交互次数。并且,在查询高聚合数据或低聚合数据时,先从内存中查询,当在内存中未查询到时,再将从磁盘中查询,避免对磁盘频繁读写,提高了系统性能。此外,采用该种存储方式还可以减少高层级数据处理单元对磁盘的使用量,以及低层级数据处理单元对磁盘的使用量。In this way, the merged high-aggregated data is first stored in the memory of the high-level data processing unit, and the merged low-aggregated data is first stored in the memory of the low-level data processing unit, only when the memory of the high-level data processing unit When the stored data reaches a certain value, or when the data stored in the memory of the low-level data processing unit reaches a certain value, the data in the memory is written to the disk, which can reduce the number of interactions with the disk. In addition, when querying high-aggregated data or low-aggregated data, first query from the memory, when the query is not found in the memory, then query from the disk, to avoid frequent reading and writing to the disk, and improve system performance. In addition, the use of this storage method can also reduce the use of disks by high-level data processing units and disks by low-level data processing units.
需要说明的是,上述步骤105和步骤106之间没有先后执行顺序。It should be noted that there is no sequential execution order between the above steps 105 and 106.
另外,还需要说明的是,上述步骤104至步骤106用于实现通过多个数据处理单元对该多个聚合数据进行分类存储的操作。In addition, it should also be noted that the above steps 104 to 106 are used to realize the operation of classifying and storing the multiple aggregated data by multiple data processing units.
进一步地,在处理每一批数据的过程中,当目标时间区间更新时,该存储设备还可以将低层级数据处理单元中不属于该目标时间区间内的数据删除,如此,可以节省低层级数据处理单元的存储空间。Further, in the process of processing each batch of data, when the target time interval is updated, the storage device can also delete data in the low-level data processing unit that does not belong to the target time interval, so that low-level data can be saved Storage space of the processing unit.
进一步地,为了保证从数据源获取的数据是不重复的,可以在处理完所获取的一批数据后,记录所获取的数据的偏移量,该偏移量用于指示当前获取的数据在数据源中的位置。如此,待下次从该数据源获取数据时,可以根据所记录的偏移量,获取下一批数据。譬如,若数据源中的数据是按照顺序编号的,本次获取了5条数据,则该偏移量为5,即下一次从第六条数据开始获取。Further, in order to ensure that the data obtained from the data source is not repeated, after processing the acquired batch of data, the offset of the acquired data may be recorded. The offset is used to indicate that the currently acquired data is The location in the data source. In this way, when the data is obtained from the data source next time, the next batch of data can be obtained according to the recorded offset. For example, if the data in the data source is numbered sequentially, and 5 pieces of data are acquired this time, the offset is 5, that is, the next piece of data will be acquired from the sixth piece of data.
在本申请实施例中,从数据源获取携带时间戳的多条数据,根据每条数据的时间戳,对该多条数据进行分类处理,得到多组数据。对该多组数据中的每组数据进行聚合统计,之后,通过由内存和磁盘组成的多个数据处理单元,对该多个聚合数据进行分类存储,使得每个数据处理单元中存储的聚合数据的类型相同。如此,在后续数据查询时,可以基于需要查询的数据的时间戳,从对应的数据处理单元中进行查询,提高了数据查询效率。In the embodiment of the present application, multiple pieces of data carrying a time stamp are obtained from a data source, and the multiple pieces of data are classified and processed according to the time stamp of each piece of data to obtain multiple sets of data. Aggregate statistics for each of the multiple sets of data, and then, through multiple data processing units composed of memory and disks, classify and store the multiple aggregated data so that the aggregated data stored in each data processing unit Of the same type. In this way, in the subsequent data query, the query can be performed from the corresponding data processing unit based on the time stamp of the data to be queried, which improves the efficiency of data query.
图3是根据一示例性实施例示出的一种数据存储装置的结构示意图,该数 据存储装置可以由软件、硬件或者两者的结合实现。该数据存储装置可以包括:Fig. 3 is a schematic structural diagram of a data storage device according to an exemplary embodiment. The data storage device may be implemented by software, hardware, or a combination of both. The data storage device may include:
获取模块310,用于从数据源获取多条数据,每条数据携带时间戳;The obtaining module 310 is used to obtain multiple pieces of data from a data source, and each piece of data carries a time stamp;
分类处理模块320,用于根据所述每条数据的时间戳,对所述多条数据进行分类处理,得到多组数据;The classification processing module 320 is configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;
聚合统计模块330,用于对所述多组数据中的每组数据进行聚合统计,得到多个聚合数据;The aggregation statistics module 330 is configured to aggregate statistics on each group of the multiple groups of data to obtain multiple aggregated data;
分类存储模块340,用于通过多个数据处理单元对所述多个聚合数据进行分类存储,其中,所述多个数据处理单元中的每个数据处理单元由内存和磁盘组成,每个数据处理单元中存储的聚合数据的类型相同。The classification storage module 340 is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein each data processing unit of the plurality of data processing units is composed of a memory and a disk, and each data processing The type of aggregated data stored in the cell is the same.
可选地,所述分类处理模块320用于:Optionally, the classification processing module 320 is used to:
从所述多条数据的时间戳中获取最新时间;Obtaining the latest time from the time stamps of the multiple pieces of data;
确定包含所述最新时间且区间长度为预设阈值的目标时间区间;Determine a target time interval that includes the latest time and the interval length is a preset threshold;
根据所述每条数据的时间戳和所述目标时间区间,对所述多条数据进行分类处理。Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.
可选地,所述分类处理模块320用于:Optionally, the classification processing module 320 is used to:
当所述最新时间处于预先存储的区间长度为所述预设阈值的时间区间内时,将所述时间区间确定为所述目标时间区间。When the latest time is within a time interval in which the length of the pre-stored interval is the preset threshold, the time interval is determined as the target time interval.
可选地,所述分类处理模块320用于:Optionally, the classification processing module 320 is used to:
当所述最新时间大于预先存储的区间长度为所述预设阈值的时间区间的右值时,确定所述最新时间与所述时间区间的右值之间的时间差值;When the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval;
确定所述时间区间的左值与所述时间差值之间的时间和;Determine the time sum between the left value of the time interval and the time difference;
将所述时间区间的右值更新为所述最新时间,以及将所述时间区间的左值更新为所述时间和;Update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum;
将更新后的时间区间确定为所述目标时间区间。The updated time interval is determined as the target time interval.
可选地,所述分类处理模块320用于:Optionally, the classification processing module 320 is used to:
将所述多条数据中时间戳指示的时间小于所述目标时间区间左值的数据确定为高层级数据,以及将所述多条数据中时间戳指示的时间处于所述目标时间区间内的数据确定为低层级数据。Determining that the time indicated by the timestamp in the multiple pieces of data is less than the left value of the target time interval as high-level data, and the data indicating that the time indicated by the timestamp in the multiple pieces of data is within the target time interval Determined as low-level data.
可选地,所述聚合统计模块330用于:Optionally, the aggregation statistics module 330 is used to:
当所述每条数据的时间戳指示的时间均包括年、月、日、时、分、秒多个时间粒度,所述目标时间区间以日为时间粒度时,基于年、月、日三个时间粒 度,按照不同的时间层级和数据属性,对所述高层级数据进行聚合统计,得到多个第一高聚合数据,以及基于年、月、日、时、分、秒六个时间粒度,按照不同的时间层级和数据属性,对所述低层级数据进行聚合统计,得到多个第二高聚合数据和多个第一低聚合数据,不同时间层级包括不同维度的时间粒度。When the time indicated by the timestamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, and the target time interval uses day as the time granularity, based on year, month, and day Time granularity, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highly aggregated data, and six time granularities based on year, month, day, hour, minute, and second, according to Different time levels and data attributes aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data. Different time levels include different granularities of time granularity.
可选地,所述分类存储模块340用于:Optionally, the classification storage module 340 is used to:
当所述多个数据处理单元包括高层级数据处理单元和低层级数据处理单元时,获取每个聚合数据中的行健,所述每个聚合数据的行健是在聚合统计时生成的,用于指示所述每个聚合数据对应的时间层级和数据属性;When the plurality of data processing units include a high-level data processing unit and a low-level data processing unit, the row keys in each aggregated data are obtained. The row keys of each aggregated data are generated during aggregation statistics. To indicate the time level and data attributes corresponding to each aggregated data;
基于每个第一高聚合数据和每个第二高聚合数据中的行健,通过所述高层级数据处理单元对所述多个第一高聚合数据和所述多个第二高聚合数据进行存储,以及基于每个第一低聚合数据中的行健,通过所述低层级数据处理单元对所述多个第一低聚合数据进行存储。Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit Storing, and based on the row key in each first low-aggregated data, storing the plurality of first low-aggregated data through the low-level data processing unit.
可选地,所述分类存储模块340用于:Optionally, the classification storage module 340 is used to:
将所述多个第一高聚合数据和所述多个第二高聚合数据中行健相同的高聚合数据进行合并,得到多个第三高聚合数据;Combining the plurality of first high-aggregated data and the plurality of second high-aggregated data with the same high-aggregated data with the same health key to obtain a plurality of third high-aggregated data;
对于所述多个第三高聚合数据中的每个第三高聚合数据,查询所述高层级数据处理单元的内存中是否存储有与所述每个第三高聚合数据的行健相同的数据;For each third-highest aggregated data in the plurality of third-highest aggregated data, query whether the same data as the row health of each third-highest aggregated data is stored in the memory of the high-level data processing unit ;
当所述高层级数据处理单元的内存中存储有与所述每个第三高聚合数据的行健相同的数据时,将查询到的数据与所述每个第三高聚合数据进行合并,将合并后的数据存储至所述高层级数据处理单元的内存中。When the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, merge the queried data with the third-highest aggregated data to merge The combined data is stored in the memory of the high-level data processing unit.
可选地,所述分类存储模块340用于:Optionally, the classification storage module 340 is used to:
当所述高层级数据处理单元的内存中未存储有与所述每个第三高聚合数据的行健相同的数据时,从所述高层级数据处理单元的磁盘中获取与所述每个第三高聚合数据的行健相同的数据;When the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the The data of the three high aggregated data is the same as the health;
将获取的数据与所述每个第三高聚合数据进行合并,将合并后的数据存储至所述高层级数据处理单元的内存中。The acquired data is merged with each of the third highest aggregated data, and the merged data is stored in the memory of the high-level data processing unit.
可选地,所述分类存储模块340用于:Optionally, the classification storage module 340 is used to:
将所述多个第一低聚合数据中行健相同的第一低聚合数据进行合并,得到多个第二低聚合数据;Combining the first low-aggregated data with the same health key among the multiple first low-aggregated data to obtain multiple second low-aggregated data;
对于所述多个第二低聚合数据中的每个第二低聚合数据,查询所述低层级 数据处理单元的内存中是否存储有与所述每个第二低聚合数据的行健相同的数据;For each second low-aggregated data in the plurality of second low-aggregated data, query whether the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit ;
当所述低层级数据处理单元的内存中存储有与所述每个第二低聚合数据的行健相同的数据时,将查询到的数据与所述每个第二低聚合数据进行合并,将合并后的数据存储至所述低层级数据处理单元的内存中。When the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit, the queried data is merged with each second low-aggregated data to The combined data is stored in the memory of the low-level data processing unit.
可选地,所述分类存储模块340用于:Optionally, the classification storage module 340 is used to:
当所述低层级数据处理单元的内存中未存储有与所述每个第二低聚合数据的行健相同的数据时,从所述低层级数据处理单元的磁盘中获取与所述每个第二低聚合数据的行健相同的数据;When the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the The data of the second low aggregated data has the same health and health;
将获取的数据与所述每个第二低聚合数据进行合并,将合并后的数据存储至所述低层级数据处理单元的内存中。The acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
可选地,所述分类存储模块340用于:Optionally, the classification storage module 340 is used to:
当所述高层级数据处理单元的内存中的数据量达到预设数量阈值时,将所述高层级数据处理单元的内存中的数据存储至所述高层级数据处理单元的磁盘中;或者,When the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, store the data in the memory of the high-level data processing unit to the disk of the high-level data processing unit; or,
当所述低层级数据处理单元的内存中的数据量达到所述预设数量阈值时,将所述低层级数据处理单元的内存中的数据存储至所述低层级数据处理单元的磁盘中。When the amount of data in the memory of the low-level data processing unit reaches the preset number threshold, the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.
在本申请实施例中,从数据源获取携带时间戳的多条数据,根据每条数据的时间戳,对该多条数据进行分类处理,得到多组数据。对该多组数据中的每组数据进行聚合统计,之后,通过由内存和磁盘组成的多个数据处理单元,对该多个聚合数据进行分类存储,使得每个数据处理单元中存储的聚合数据的类型相同。如此,在后续数据查询时,可以基于需要查询的数据的时间戳,从对应的数据处理单元中进行查询,提高了数据查询效率。In the embodiment of the present application, multiple pieces of data carrying a time stamp are obtained from a data source, and the multiple pieces of data are classified and processed according to the time stamp of each piece of data to obtain multiple sets of data. Aggregate statistics for each of the multiple sets of data, and then, through multiple data processing units composed of memory and disks, classify and store the multiple aggregated data so that the aggregated data stored in each data processing unit Of the same type. In this way, in the subsequent data query, the query can be performed from the corresponding data processing unit based on the time stamp of the data to be queried, which improves the efficiency of data query.
需要说明的是:上述实施例提供的数据存储装置在实现数据存储方法时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据存储装置与数据存储方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that when implementing the data storage method, the data storage device provided in the above embodiments is only exemplified by the division of the above functional modules. In practical applications, the above functions can be allocated by different functional modules as needed That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the data storage device and the data storage method embodiment provided in the above embodiments belong to the same concept. For the specific implementation process, refer to the method embodiments, and details are not described here.
图4是根据一示例性实施例示出的一种存储设备的结构示意图。具体来讲:Fig. 4 is a schematic structural diagram of a storage device according to an exemplary embodiment. Specifically:
存储设备400包括中央处理单元(CPU)401、包括随机存取存储器(RAM)402和只读存储器(ROM)403的系统存储器404,以及连接系统存储器404和中央处理单元401的系统总线405。存储设备400还包括帮助计算机内的各个器件之间传输信息的基本输入/输出系统(I/O系统)406,和用于存储操作系统413、应用程序414和其他程序模块415的大容量存储设备407。The storage device 400 includes a central processing unit (CPU) 401, a system memory 404 including a random access memory (RAM) 402 and a read only memory (ROM) 403, and a system bus 405 connecting the system memory 404 and the central processing unit 401. The storage device 400 also includes a basic input / output system (I / O system) 406 that helps transfer information between various devices in the computer, and a large-capacity storage device for storing the operating system 413, application programs 414, and other program modules 415 407.
基本输入/输出系统406包括有用于显示信息的显示器408和用于用户输入信息的诸如鼠标、键盘之类的输入设备409。其中显示器408和输入设备409都通过连接到系统总线405的输入输出控制器410连接到中央处理单元401。基本输入/输出系统406还可以包括输入输出控制器410以用于接收和处理来自键盘、鼠标、或电子触控笔等多个其他设备的输入。类似地,输入输出控制器410还提供输出到显示屏、打印机或其他类型的输出设备。The basic input / output system 406 includes a display 408 for displaying information and an input device 409 for a user to input information, such as a mouse and a keyboard. The display 408 and the input device 409 are both connected to the central processing unit 401 through the input and output controller 410 connected to the system bus 405. The basic input / output system 406 may also include an input-output controller 410 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 410 also provides output to a display screen, printer, or other type of output device.
大容量存储设备407通过连接到系统总线405的大容量存储控制器(未示出)连接到中央处理单元401。大容量存储设备407及其相关联的计算机可读介质为存储设备400提供非易失性存储。也就是说,大容量存储设备407可以包括诸如硬盘或者CD-ROM驱动器之类的计算机可读介质(未示出)。The mass storage device 407 is connected to the central processing unit 401 through a mass storage controller (not shown) connected to the system bus 405. The mass storage device 407 and its associated computer-readable medium provide non-volatile storage for the storage device 400. That is, the mass storage device 407 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.
不失一般性,计算机可读介质可以包括计算机存储介质和通信介质。计算机存储介质包括以用于存储诸如计算机可读指令、数据结构、程序模块或其他数据等信息的任何方法或技术实现的易失性和非易失性、可移动和不可移动介质。计算机存储介质包括RAM、ROM、EPROM、EEPROM、闪存或其他固态存储其技术,CD-ROM、DVD或其他光学存储、磁带盒、磁带、磁盘存储或其他磁性存储设备。当然,本领域技术人员可知计算机存储介质不局限于上述几种。上述的系统存储器404和大容量存储设备407可以统称为存储器。Without loss of generality, computer-readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory, or other solid-state storage technologies, CD-ROM, DVD, or other optical storage, tape cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art may know that the computer storage medium is not limited to the above. The above-mentioned system memory 404 and mass storage device 407 may be collectively referred to as a memory.
根据本申请的各种实施例,存储设备400还可以通过诸如因特网等网络连接到网络上的远程计算机运行。也即存储设备400可以通过连接在系统总线405上的网络接口单元411连接到网络412,或者说,也可以使用网络接口单元411来连接到其他类型的网络或远程计算机系统(未示出)。According to various embodiments of the present application, the storage device 400 may also be operated by a remote computer connected to the network through a network such as the Internet. That is, the storage device 400 may be connected to the network 412 through the network interface unit 411 connected to the system bus 405, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 411.
上述存储器还包括一个或者一个以上的程序,一个或者一个以上程序存储于存储器中,被配置由CPU执行。所述一个或者一个以上程序包含用于进行本申请实施例提供的数据存储方法。The above memory also includes one or more programs. One or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include a method for performing data storage provided by the embodiments of the present application.
本申请实施例还提供了一种非临时性计算机可读存储介质,当所述存储介质中的指令由移动终端的处理器执行时,使得移动终端能够执行上述图1所示实施例提供的数据存储方法。An embodiment of the present application further provides a non-transitory computer-readable storage medium, when the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal can execute the data provided by the embodiment shown in FIG. 1 Storage method.
本申请实施例还提供了一种包含指令的计算机程序产品,当其在计算机上运行时,使得计算机执行上述图1所示实施例提供的数据存储方法。An embodiment of the present application also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the data storage method provided by the embodiment shown in FIG. 1 described above.
本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。A person of ordinary skill in the art may understand that all or part of the steps to implement the above-described embodiments may be completed by hardware, or may be completed by a program instructing related hardware. The program may be stored in a computer-readable storage medium. The mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.
以上所述仅为本申请的较佳实施例,并不用以限制本申请,凡在本申请的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。The above are only preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application should be included in the protection of this application Within range.

Claims (26)

  1. 一种数据存储方法,其特征在于,所述方法包括:A data storage method, characterized in that the method includes:
    从数据源获取多条数据,每条数据携带时间戳;Obtain multiple pieces of data from the data source, and each piece of data carries a time stamp;
    根据所述每条数据的时间戳,对所述多条数据进行分类处理,得到多组数据;Classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;
    对所述多组数据中的每组数据进行聚合统计,得到多个聚合数据;Aggregate statistics for each set of data in the multiple sets of data to obtain multiple aggregated data;
    通过多个数据处理单元对所述多个聚合数据进行分类存储,其中,每个数据处理单元中存储的聚合数据的类型相同。The multiple aggregated data is classified and stored by multiple data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
  2. 如权利要求1所述的方法,其特征在于,所述根据所述每条数据的时间戳,对所述多条数据进行分类处理,包括:The method according to claim 1, wherein the classifying the multiple pieces of data according to the time stamp of each piece of data includes:
    从所述多条数据的时间戳中获取最新时间;Obtaining the latest time from the time stamps of the multiple pieces of data;
    确定包含所述最新时间且区间长度为预设阈值的目标时间区间;Determine a target time interval that includes the latest time and the interval length is a preset threshold;
    根据所述每条数据的时间戳和所述目标时间区间,对所述多条数据进行分类处理。Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.
  3. 如权利要求2所述的方法,其特征在于,所述确定包含所述最新时间且区间长度为预设阈值的目标时间区间,包括:The method of claim 2, wherein the determining a target time interval that includes the latest time and the interval length is a preset threshold includes:
    当所述最新时间处于预先存储的区间长度为所述预设阈值的时间区间内时,将所述时间区间确定为所述目标时间区间。When the latest time is within a time interval in which the length of the pre-stored interval is the preset threshold, the time interval is determined as the target time interval.
  4. 如权利要求2所述的方法,其特征在于,所述确定包含所述最新时间且区间长度为预设阈值的目标时间区间,包括:The method of claim 2, wherein the determining a target time interval that includes the latest time and the interval length is a preset threshold includes:
    当所述最新时间大于预先存储的区间长度为所述预设阈值的时间区间的右值时,确定所述最新时间与所述时间区间的右值之间的时间差值;When the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval;
    确定所述时间区间的左值与所述时间差值之间的时间和;Determine the time sum between the left value of the time interval and the time difference;
    将所述时间区间的右值更新为所述最新时间,以及将所述时间区间的左值更新为所述时间和;Update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum;
    将更新后的时间区间确定为所述目标时间区间。The updated time interval is determined as the target time interval.
  5. 如权利要求2-4任一项所述的方法,其特征在于,所述根据所述每条数据的时间戳和所述目标时间区间,对所述多条数据进行分类处理,包括:The method according to any one of claims 2-4, wherein the classifying the plurality of pieces of data according to the time stamp of each piece of data and the target time interval includes:
    将所述多条数据中时间戳指示的时间小于所述目标时间区间左值的数据确定为高层级数据,以及将所述多条数据中时间戳指示的时间处于所述目标时间区间内的数据确定为低层级数据。Determining that the time indicated by the timestamp in the multiple pieces of data is less than the left value of the target time interval as high-level data, and the data indicating that the time indicated by the timestamp in the multiple pieces of data is within the target time interval Determined as low-level data.
  6. 如权利要求5所述的方法,其特征在于,当所述每条数据的时间戳指示的时间均包括年、月、日、时、分、秒多个时间粒度,所述目标时间区间以日为时间粒度时,所述对所述多组数据中的每组数据进行聚合统计,得到多个聚合数据,包括:The method according to claim 5, wherein when the time indicated by the time stamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, the target time interval is in days In the case of time granularity, the aggregation of each group of data in the multiple groups of data to obtain multiple aggregated data includes:
    基于年、月、日三个时间粒度,按照不同的时间层级和数据属性,对所述高层级数据进行聚合统计,得到多个第一高聚合数据,以及基于年、月、日、时、分、秒六个时间粒度,按照不同的时间层级和数据属性,对所述低层级数据进行聚合统计,得到多个第二高聚合数据和多个第一低聚合数据,不同时间层级包括不同维度的时间粒度。Based on the three time granularities of year, month, and day, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highest aggregated data, and based on year, month, day, hour, minute , Six time granularity of seconds, according to different time levels and data attributes, aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data, different time levels include different dimensions of Time granularity.
  7. 如权利要求6所述的方法,其特征在于,当所述多个数据处理单元包括高层级数据处理单元和低层级数据处理单元时,所述通过多个数据处理单元对所述多个聚合数据进行分类存储,包括:The method according to claim 6, wherein when the plurality of data processing units include a high-level data processing unit and a low-level data processing unit, the plurality of aggregated data is processed by the plurality of data processing units Carry out classified storage, including:
    获取每个聚合数据中的行健,所述每个聚合数据的行健是在聚合统计时生成的,用于指示所述每个聚合数据对应的时间层级和数据属性;Acquiring the row health in each aggregated data, the row health of each aggregated data is generated during aggregation statistics, and used to indicate the time level and data attributes corresponding to each aggregated data;
    基于每个第一高聚合数据和每个第二高聚合数据中的行健,通过所述高层级数据处理单元对所述多个第一高聚合数据和所述多个第二高聚合数据进行存储,以及基于每个第一低聚合数据中的行健,通过所述低层级数据处理单元对所述多个第一低聚合数据进行存储。Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit Storing, and based on the row key in each first low-aggregated data, storing the plurality of first low-aggregated data through the low-level data processing unit.
  8. 如权利要求7所述的方法,其特征在于,所述基于每个第一高聚合数据和每个第二高聚合数据中的行健,通过所述高层级数据处理单元对所述多个第一高聚合数据和所述多个第二高聚合数据进行存储,包括:The method according to claim 7, characterized in that, based on the line health in each first high-aggregated data and each second high-aggregated data, the plurality of The storing of one high-aggregated data and the plurality of second high-aggregated data includes:
    将所述多个第一高聚合数据和所述多个第二高聚合数据中行健相同的高聚 合数据进行合并,得到多个第三高聚合数据;Combining the plurality of first high-aggregated data and the plurality of second high-aggregated data with the same high-aggregated data with the same health key to obtain a plurality of third high-aggregated data;
    对于所述多个第三高聚合数据中的每个第三高聚合数据,查询所述高层级数据处理单元的内存中是否存储有与所述每个第三高聚合数据的行健相同的数据;For each third-highest aggregated data in the plurality of third-highest aggregated data, query whether the same data as the row health of each third-highest aggregated data is stored in the memory of the high-level data processing unit ;
    当所述高层级数据处理单元的内存中存储有与所述每个第三高聚合数据的行健相同的数据时,将查询到的数据与所述每个第三高聚合数据进行合并,将合并后的数据存储至所述高层级数据处理单元的内存中。When the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, merge the queried data with the third-highest aggregated data to merge The combined data is stored in the memory of the high-level data processing unit.
  9. 如权利要求8所述的方法,其特征在于,所述查询所述高层级数据处理单元的内存中是否存储有与所述每个第三高聚合数据的行健相同的数据之后,还包括:The method according to claim 8, wherein after querying whether the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, the method further includes:
    当所述高层级数据处理单元的内存中未存储有与所述每个第三高聚合数据的行健相同的数据时,从所述高层级数据处理单元的磁盘中获取与所述每个第三高聚合数据的行健相同的数据;When the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the The data of the three high aggregated data is the same as the health;
    将获取的数据与所述每个第三高聚合数据进行合并,将合并后的数据存储至所述高层级数据处理单元的内存中。The acquired data is merged with each of the third highest aggregated data, and the merged data is stored in the memory of the high-level data processing unit.
  10. 如权利要求7所述的方法,其特征在于,所述基于每个第一低聚合数据中的行健,通过所述低层级数据处理单元对所述多个第一低聚合数据进行存储,包括:The method according to claim 7, wherein the storing the plurality of first low-aggregated data by the low-level data processing unit based on the line health in each first low-aggregated data includes: :
    将所述多个第一低聚合数据中行健相同的第一低聚合数据进行合并,得到多个第二低聚合数据;Combining the first low-aggregated data with the same health key among the multiple first low-aggregated data to obtain multiple second low-aggregated data;
    对于所述多个第二低聚合数据中的每个第二低聚合数据,查询所述低层级数据处理单元的内存中是否存储有与所述每个第二低聚合数据的行健相同的数据;For each second low-aggregated data in the plurality of second low-aggregated data, query whether the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit ;
    当所述低层级数据处理单元的内存中存储有与所述每个第二低聚合数据的行健相同的数据时,将查询到的数据与所述每个第二低聚合数据进行合并,将合并后的数据存储至所述低层级数据处理单元的内存中。When the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit, the queried data is merged with each second low-aggregated data to The combined data is stored in the memory of the low-level data processing unit.
  11. 如权利要求10所述的方法,其特征在于,所述查询所述低层级数据处 理单元的内存中是否存储有与所述每个第二低聚合数据的行健相同的数据之后,还包括:The method according to claim 10, wherein after querying whether memory of the low-level data processing unit stores the same data as the row health of each second low-aggregated data, the method further includes:
    当所述低层级数据处理单元的内存中未存储有与所述每个第二低聚合数据的行健相同的数据时,从所述低层级数据处理单元的磁盘中获取与所述每个第二低聚合数据的行健相同的数据;When the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the The data of the second low aggregated data has the same health and health;
    将获取的数据与所述每个第二低聚合数据进行合并,将合并后的数据存储至所述低层级数据处理单元的内存中。The acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
  12. 如权利要求7所述的方法,其特征在于,所述方法还包括:The method of claim 7, wherein the method further comprises:
    当所述高层级数据处理单元的内存中的数据量达到预设数量阈值时,将所述高层级数据处理单元的内存中的数据存储至所述高层级数据处理单元的磁盘中;或者,When the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, store the data in the memory of the high-level data processing unit to the disk of the high-level data processing unit; or,
    当所述低层级数据处理单元的内存中的数据量达到所述预设数量阈值时,将所述低层级数据处理单元的内存中的数据存储至所述低层级数据处理单元的磁盘中。When the amount of data in the memory of the low-level data processing unit reaches the preset number threshold, the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.
  13. 一种数据存储装置,其特征在于,所述装置包括:A data storage device, characterized in that the device includes:
    获取模块,用于从数据源获取多条数据,每条数据携带时间戳;The acquisition module is used to acquire multiple pieces of data from the data source, and each piece of data carries a time stamp;
    分类处理模块,用于根据所述每条数据的时间戳,对所述多条数据进行分类处理,得到多组数据;A classification processing module, configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;
    聚合统计模块,用于对所述多组数据中的每组数据进行聚合统计,得到多个聚合数据;An aggregation statistics module, configured to aggregate statistics on each group of the multiple sets of data to obtain multiple aggregated data;
    分类存储模块,用于通过多个数据处理单元对所述多个聚合数据进行分类存储,其中,每个数据处理单元中存储的聚合数据的类型相同。A classification storage module is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
  14. 如权利要求13所述的装置,其特征在于,所述分类处理模块用于:The apparatus according to claim 13, wherein the classification processing module is used to:
    从所述多条数据的时间戳中获取最新时间;Obtaining the latest time from the time stamps of the multiple pieces of data;
    确定包含所述最新时间且区间长度为预设阈值的目标时间区间;Determine a target time interval that includes the latest time and the interval length is a preset threshold;
    根据所述每条数据的时间戳和所述目标时间区间,对所述多条数据进行分类处理。Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.
  15. 如权利要求14所述的装置,其特征在于,所述分类处理模块用于:The apparatus according to claim 14, wherein the classification processing module is used to:
    当所述最新时间处于预先存储的区间长度为所述预设阈值的时间区间内时,将所述时间区间确定为所述目标时间区间.When the latest time is within the time interval of the pre-stored interval length is the preset threshold, the time interval is determined as the target time interval.
  16. 如权利要求14所述的装置,其特征在于,所述分类处理模块用于:The apparatus according to claim 14, wherein the classification processing module is used to:
    当所述最新时间大于预先存储的区间长度为所述预设阈值的时间区间的右值时,确定所述最新时间与所述时间区间的右值之间的时间差值;When the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval;
    确定所述时间区间的左值与所述时间差值之间的时间和;Determine the time sum between the left value of the time interval and the time difference;
    将所述时间区间的右值更新为所述最新时间,以及将所述时间区间的左值更新为所述时间和;Update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum;
    将更新后的时间区间确定为所述目标时间区间。The updated time interval is determined as the target time interval.
  17. 如权利要求14-16任一项所述的装置,其特征在于,所述分类处理模块用于:The device according to any one of claims 14-16, wherein the classification processing module is used to:
    将所述多条数据中时间戳指示的时间小于所述目标时间区间左值的数据确定为高层级数据,以及将所述多条数据中时间戳指示的时间处于所述目标时间区间内的数据确定为低层级数据。Determining that the time indicated by the timestamp in the multiple pieces of data is less than the left value of the target time interval as high-level data, and the data indicating that the time indicated by the timestamp in the multiple pieces of data is within the target time interval Determined as low-level data.
  18. 如权利要求17所述的装置,其特征在于,所述聚合统计模块用于:The apparatus of claim 17, wherein the aggregation statistics module is used to:
    当所述每条数据的时间戳指示的时间均包括年、月、日、时、分、秒多个时间粒度,所述目标时间区间以日为时间粒度时,基于年、月、日三个时间粒度,按照不同的时间层级和数据属性,对所述高层级数据进行聚合统计,得到多个第一高聚合数据,以及基于年、月、日、时、分、秒六个时间粒度,按照不同的时间层级和数据属性,对所述低层级数据进行聚合统计,得到多个第二高聚合数据和多个第一低聚合数据,不同时间层级包括不同维度的时间粒度。When the time indicated by the timestamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, and the target time interval uses day as the time granularity, based on year, month, and day Time granularity, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highly aggregated data, and six time granularities based on year, month, day, hour, minute, and second, according to Different time levels and data attributes aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data. Different time levels include different granularities of time granularity.
  19. 如权利要求18所述的装置,其特征在于,所述分类存储模块用于:The apparatus of claim 18, wherein the classification storage module is used to:
    当所述多个数据处理单元包括高层级数据处理单元和低层级数据处理单元时,获取每个聚合数据中的行健,所述每个聚合数据的行健是在聚合统计时生 成的,用于指示所述每个聚合数据对应的时间层级和数据属性;When the plurality of data processing units include a high-level data processing unit and a low-level data processing unit, the row keys in each aggregated data are obtained. The row keys of each aggregated data are generated during aggregation statistics. To indicate the time level and data attributes corresponding to each aggregated data;
    基于每个第一高聚合数据和每个第二高聚合数据中的行健,通过所述高层级数据处理单元对所述多个第一高聚合数据和所述多个第二高聚合数据进行存储,以及基于每个第一低聚合数据中的行健,通过所述低层级数据处理单元对所述多个第一低聚合数据进行存储。Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit Storing, and based on the row key in each first low-aggregated data, storing the plurality of first low-aggregated data through the low-level data processing unit.
  20. 如权利要求19所述的装置,其特征在于,所述分类存储模块用于:The apparatus of claim 19, wherein the classification storage module is used to:
    将所述多个第一高聚合数据和所述多个第二高聚合数据中行健相同的高聚合数据进行合并,得到多个第三高聚合数据;Combining the plurality of first high-aggregated data and the plurality of second high-aggregated data with the same high-aggregated data with the same health key to obtain a plurality of third high-aggregated data;
    对于所述多个第三高聚合数据中的每个第三高聚合数据,查询所述高层级数据处理单元的内存中是否存储有与所述每个第三高聚合数据的行健相同的数据;For each third-highest aggregated data in the plurality of third-highest aggregated data, query whether the same data as the row health of each third-highest aggregated data is stored in the memory of the high-level data processing unit ;
    当所述高层级数据处理单元的内存中存储有与所述每个第三高聚合数据的行健相同的数据时,将查询到的数据与所述每个第三高聚合数据进行合并,将合并后的数据存储至所述高层级数据处理单元的内存中。When the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, merge the queried data with the third-highest aggregated data to merge The combined data is stored in the memory of the high-level data processing unit.
  21. 如权利要求20所述的装置,其特征在于,所述分类存储模块用于:The apparatus of claim 20, wherein the classification storage module is used to:
    当所述高层级数据处理单元的内存中未存储有与所述每个第三高聚合数据的行健相同的数据时,从所述高层级数据处理单元的磁盘中获取与所述每个第三高聚合数据的行健相同的数据;When the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the The data of the three high aggregated data is the same as the health;
    将获取的数据与所述每个第三高聚合数据进行合并,将合并后的数据存储至所述高层级数据处理单元的内存中。The acquired data is merged with each of the third highest aggregated data, and the merged data is stored in the memory of the high-level data processing unit.
  22. 如权利要求19所述的装置,其特征在于,所述分类存储模块用于:The apparatus of claim 19, wherein the classification storage module is used to:
    将所述多个第一低聚合数据中行健相同的第一低聚合数据进行合并,得到多个第二低聚合数据;Combining the first low-aggregated data with the same health key among the multiple first low-aggregated data to obtain multiple second low-aggregated data;
    对于所述多个第二低聚合数据中的每个第二低聚合数据,查询所述低层级数据处理单元的内存中是否存储有与所述每个第二低聚合数据的行健相同的数据;For each second low-aggregated data in the plurality of second low-aggregated data, query whether the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit ;
    当所述低层级数据处理单元的内存中存储有与所述每个第二低聚合数据的 行健相同的数据时,将查询到的数据与所述每个第二低聚合数据进行合并,将合并后的数据存储至所述低层级数据处理单元的内存中。When the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit, the queried data is merged with each second low-aggregated data to The combined data is stored in the memory of the low-level data processing unit.
  23. 如权利要求22所述的装置,其特征在于,所述分类存储模块用于:The apparatus according to claim 22, wherein the classification storage module is used to:
    当所述低层级数据处理单元的内存中未存储有与所述每个第二低聚合数据的行健相同的数据时,从所述低层级数据处理单元的磁盘中获取与所述每个第二低聚合数据的行健相同的数据;When the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the The data of the second low aggregated data has the same health and health;
    将获取的数据与所述每个第二低聚合数据进行合并,将合并后的数据存储至所述低层级数据处理单元的内存中。The acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
  24. 如权利要求19所述的装置,其特征在于,所述分类存储模块用于:The apparatus of claim 19, wherein the classification storage module is used to:
    当所述高层级数据处理单元的内存中的数据量达到预设数量阈值时,将所述高层级数据处理单元的内存中的数据存储至所述高层级数据处理单元的磁盘中;或者,When the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, store the data in the memory of the high-level data processing unit to the disk of the high-level data processing unit; or,
    当所述低层级数据处理单元的内存中的数据量达到所述预设数量阈值时,所述低层级数据处理单元的内存中的数据存储至所述低层级数据处理单元的磁盘中。When the amount of data in the memory of the low-level data processing unit reaches the preset number threshold, the data in the memory of the low-level data processing unit is stored in the disk of the low-level data processing unit.
  25. 一种计算机可读存储介质,所述计算机可读存储介质上存储有指令,其特征在于,所述指令被处理器执行时实现权利要求1-12所述的任一项方法的步骤。A computer-readable storage medium having instructions stored on the computer-readable storage medium, characterized in that, when the instructions are executed by a processor, the steps of any one of the methods of claims 1-12 are implemented.
  26. 一种存储设备,其特征在于,所述存储设备包括处理器和存储器,其中,所述存储器,用于存放计算机程序;所述处理器,用于执行所述存储器上所存放的程序,实现权利要求1-12任一所述的方法步骤。A storage device, characterized in that the storage device includes a processor and a memory, wherein the memory is used to store a computer program; the processor is used to execute a program stored on the memory to implement rights The method steps of any one of claims 1-12 are required.
PCT/CN2019/111510 2018-10-16 2019-10-16 Data storage method and apparatus, and storage medium WO2020078395A1 (en)

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN201811204394.7 2018-10-16
CN201811204394.7A CN111061758B (en) 2018-10-16 2018-10-16 Data storage method, device and storage medium
CN201811236196.9A CN111090705B (en) 2018-10-23 2018-10-23 Multidimensional data processing method, device and equipment and storage medium
CN201811236196.9 2018-10-23

Publications (1)

Publication Number Publication Date
WO2020078395A1 true WO2020078395A1 (en) 2020-04-23

Family

ID=70282908

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/111510 WO2020078395A1 (en) 2018-10-16 2019-10-16 Data storage method and apparatus, and storage medium

Country Status (1)

Country Link
WO (1) WO2020078395A1 (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107193839A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 Data aggregation method and device
CN107924345A (en) * 2015-06-26 2018-04-17 亚马逊技术股份有限公司 Data storage area for the polymerization measurement result of measurement
US20180293280A1 (en) * 2017-04-07 2018-10-11 Salesforce.Com, Inc. Time series database search system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107924345A (en) * 2015-06-26 2018-04-17 亚马逊技术股份有限公司 Data storage area for the polymerization measurement result of measurement
CN107193839A (en) * 2016-03-15 2017-09-22 阿里巴巴集团控股有限公司 Data aggregation method and device
US20180293280A1 (en) * 2017-04-07 2018-10-11 Salesforce.Com, Inc. Time series database search system

Similar Documents

Publication Publication Date Title
CN110622152B (en) Scalable database system for querying time series data
KR102627690B1 (en) Dimensional context propagation techniques for optimizing SKB query plans
CN111061758B (en) Data storage method, device and storage medium
US8108367B2 (en) Constraints with hidden rows in a database
US8978034B1 (en) System for dynamic batching at varying granularities using micro-batching to achieve both near real-time and batch processing characteristics
US8725730B2 (en) Responding to a query in a data processing system
KR102522274B1 (en) User grouping method, apparatus thereof, computer, computer-readable recording medium and computer program
CN107301214B (en) Data migration method and device in HIVE and terminal equipment
US10127283B2 (en) Projecting effect of in-flight streamed data on a relational database
US8924373B2 (en) Query plans with parameter markers in place of object identifiers
US20240126817A1 (en) Graph data query
US10296497B2 (en) Storing a key value to a deleted row based on key range density
US11086694B2 (en) Method and system for scalable complex event processing of event streams
US20180121448A1 (en) Altering In-Flight Streamed Data from a Relational Database
US10025826B2 (en) Querying in-flight streamed data from a relational database
CN110555038A (en) Data processing system, method and device
US8396858B2 (en) Adding entries to an index based on use of the index
US9380126B2 (en) Data collection and distribution management
Ahsaan et al. Big data analytics: challenges and technologies
US10366081B2 (en) Declarative partitioning for data collection queries
US20210064592A1 (en) Computer storage and retrieval mechanisms using distributed probabilistic counting
WO2020078395A1 (en) Data storage method and apparatus, and storage medium
US8935200B2 (en) Dynamic database dump
CN110737727A (en) data processing method and system
US8392374B2 (en) Displaying hidden rows in a database after an expiration date

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19872785

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19872785

Country of ref document: EP

Kind code of ref document: A1

122 Ep: pct application non-entry in european phase

Ref document number: 19872785

Country of ref document: EP

Kind code of ref document: A1

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 031221)

122 Ep: pct application non-entry in european phase

Ref document number: 19872785

Country of ref document: EP

Kind code of ref document: A1