WO2020078395A1

WO2020078395A1 - Data storage method and apparatus, and storage medium

Info

Publication number: WO2020078395A1
Application number: PCT/CN2019/111510
Authority: WO
Inventors: 曾锐; 陈国栋; 徐乾龙
Original assignee: 杭州海康威视数字技术股份有限公司
Priority date: 2018-10-16
Filing date: 2019-10-16
Publication date: 2020-04-23

Abstract

Disclosed are a data storage method and apparatus, and a storage medium, falling within the technical field of data processing. The method comprises: acquiring multiple pieces of data from a data source, wherein each piece of data carries a timestamp; according to the timestamp of each piece of data, classifying the multiple pieces of data to obtain multiple groups of data; performing aggregated counting on each group of data in the multiple groups of data to obtain multiple pieces of aggregate data; and classifying and storing the multiple pieces of aggregate data by means of multiple data processing units, wherein each data processing unit in the multiple data processing units is constituted by an internal memory and a magnetic disk and the types of aggregate data stored in the data processing units are the same. In this case, when data is queried later, querying can be performed in a corresponding data processing unit based on the timestamp of data needing to be queried so that efficiency of data query is improved.

Description

Data storage method, device and storage medium

This application requires that the application number submitted on October 16, 2018 is 201811204394.7, the name of the invention is "data storage method, device and storage medium" and the application number submitted on October 23, 2018 is 201811236196.9, and the name of the invention is "one Priority of the Chinese patent application for a multi-dimensional data processing method, device and equipment, and storage medium, the entire contents of which are incorporated by reference in this application.

Technical field

Embodiments of the present application relate to the technical field of data processing, and in particular, to a data storage method, device, and storage medium.

Background technique

With the rapid development of computer technology, the scale of data has expanded dramatically, the amount of data in various fields is increasing, and the types of data are also increasing. In order to meet the storage requirements of data, data storage can be implemented through a data cube, where the data cube is a type of multi-dimensional matrix, that is, data of multiple dimensions can be stored.

In the related art, an implementation manner of storing data through a data cube may include: the storage device obtains data to be stored, and performs aggregate statistical processing on the obtained data to obtain corresponding aggregated data. After that, the obtained aggregated data can be merged with the existing data in the data cube, and the merged data can be stored in the data cube.

However, in the above implementation, if the amount of data stored in the data cube is very large, it will take a long time to subsequently query data from the data cube.

Summary of the invention

Embodiments of the present application provide a data storage method, device, and storage medium, which can solve the problem that it takes a relatively long time to query data in the related art. The technical solution is as follows:

In a first aspect, a data storage method is provided. The method includes:

Obtain multiple pieces of data from the data source, and each piece of data carries a time stamp;

Classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;

Aggregate statistics for each set of data in the multiple sets of data to obtain multiple aggregated data;

The multiple aggregated data is classified and stored by multiple data processing units, wherein the types of aggregated data stored in each data processing unit are the same.

In a second aspect, a data storage device is provided, the device comprising:

The acquisition module is used to acquire multiple pieces of data from the data source, and each piece of data carries a time stamp;

A classification processing module, configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;

An aggregation statistics module, configured to aggregate statistics on each group of the multiple sets of data to obtain multiple aggregated data;

A classification storage module is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein the types of aggregated data stored in each data processing unit are the same.

In a third aspect, a computer-readable storage medium is provided, on which instructions are stored, and when the instructions are executed by a processor, the data storage method according to the first aspect described above is implemented.

According to a fourth aspect, there is provided a computer program product containing instructions, which when executed on a computer, causes the computer to execute the data storage method described in the first aspect above.

According to a fifth aspect, a storage device is provided. The storage device includes a processor and a memory, wherein the memory is used to store a computer program; the processor is used to execute a program stored on the memory, Implement the data storage method described in the first aspect above.

The beneficial effects brought by the technical solutions provided by the embodiments of the present application are:

Obtain multiple pieces of data carrying time stamps from the data source, and classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data. Aggregate statistics for each of the multiple sets of data, and then, through multiple data processing units, classify and store the multiple aggregated data so that the type of aggregated data stored in each data processing unit is the same. In this way, in the subsequent data query, the query can be performed from the corresponding data processing unit based on the time stamp of the data to be queried, which improves the efficiency of data query.

BRIEF DESCRIPTION

In order to more clearly explain the technical solutions in the embodiments of the present application, the following will briefly introduce the drawings required in the description of the embodiments. Obviously, the drawings in the following description are only some embodiments of the present application. For a person of ordinary skill in the art, without paying any creative work, other drawings can also be obtained based on these drawings.

Fig. 1 is a flowchart of a data storage method according to an exemplary embodiment;

Fig. 2 is a schematic diagram of a data processing unit according to an exemplary embodiment;

Fig. 3 is a schematic structural diagram of a data storage device according to an exemplary embodiment;

Fig. 4 is a schematic structural diagram of a storage device according to an exemplary embodiment.

detailed description

To make the objectives, technical solutions, and advantages of the present application clearer, the following describes the embodiments of the present application in further detail with reference to the accompanying drawings.

Before introducing the data storage method provided in the embodiments of the present application in detail, the nouns, application scenarios, and implementation environments involved in the embodiments of the present application are briefly introduced.

First, a brief introduction to the nouns involved in the embodiments of the present application.

Spark Streaming: A computing engine that can batch process data. Its basic principle is to batch process input data at a certain time interval. When the batch processing interval is shortened to the second level, it can be used to process real-time data. flow. It can support obtaining data from multiple data sources.

Data sources: can include Kafka data sources, Flume data sources, Twitter data sources, ZeroMQ data sources, Kinesis data sources, and TCP (Transmission Control Control Protocol) socket data sources.

Data cube: It is a kind of multi-dimensional matrix, which can be used for data analysis and indexing, and can support real-time indexing of metadata with any number of keywords. The data cube may be composed of memory and disk (distributed database) to implement multi-dimensional data storage based on the memory and disk.

Next, a brief introduction to the application scenarios involved in the embodiments of the present application.

In order to adapt to the multi-dimensional development of data, the related technical field proposes to store data through data cubes. However, when the amount of data stored in the data cube is very large, it takes a long time to query data from the data cube. Moreover, in the related art, when storing data through a data cube, the data is generally stored in a distributed database of the data cube, for example, the distributed database is HBase. In this way, when the performance of the distributed database reaches the bottleneck, it will increase the update time of the data cube and reduce the throughput of the system. Also, frequent reads and writes to the distributed database will affect its performance. For this reason, the embodiments of the present application provide a data storage method, which can solve the above-mentioned problems. For specific implementation, please refer to the embodiment shown in FIG. 1 below.

Next, a brief introduction to the implementation environment involved in the embodiments of the present application.

The data storage method provided by the embodiments of the present application may be executed by a storage device, and the storage device includes multiple data processing units to store data through the multiple data processing units. Wherein, each data processing unit in the plurality of data processing units is composed of a memory and a disk. In some embodiments, the data processing unit may be the aforementioned data cube. Further, the storage device may also include Spark Streaming to obtain data from the data source through the Spark Streaming.

After introducing the terms, application scenarios, and implementation environments involved in the embodiments of the present application, the data storage method provided by the embodiments of the present application will be described in detail in conjunction with the accompanying drawings.

Fig. 1 is a flowchart of a data storage method according to an exemplary embodiment. Here, the data storage method is implemented by using the above storage device as an example for illustration. The data storage method may include the following implementation steps:

Step 101: Obtain multiple pieces of data from a data source, and each piece of data carries a time stamp.

In some embodiments, the storage device may obtain the multiple pieces of data from the data source through Spark Streaming. For example, when the data source is a kafka data source, multiple pieces of data may be read from the kafka data source through Spark Streaming. Each piece of data in carries a time stamp. The time stamp of each piece of data can be used to indicate the generation time of each piece of data.

Step 102: Classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data.

In order to differentiate and store the multiple pieces of data, the storage device classifies the multiple pieces of data according to the time stamp of each piece of data. In some embodiments, the specific implementation may include the following implementation steps:

1021: Obtain the latest time from the time stamps of the multiple pieces of data.

In some embodiments, the multiple pieces of data may be classified according to the two data types of recent data and old data, that is, the multiple pieces of data that belong to the recent data may be classified into one category, and the old data Is divided into one category, for which a short-term time frame needs to be determined.

In order to determine this recent time range, the latest time is obtained from the time stamps of the multiple pieces of data, in other words, the latest time is obtained from the time stamps of the multiple pieces of data. For example, the pieces of data include first data, second data, third data, and fourth data. The time indicated by the timestamp of the first data is June 25, 2017, and the timestamp indicated by the second data The time is June 29, 2017, the time indicated by the timestamp of the third data is July 2, 2017, and the time indicated by the timestamp of the fourth data is July 5, 2017, then the storage device obtains The latest time is July 5, 2017.

1022: Determine a target time interval that includes the latest time and the interval length is a preset threshold.

In some embodiments, the specific implementation of determining the target time interval that includes the latest time and the interval length is a preset threshold may include the following possible implementations:

The first implementation manner: when the latest time is within the time interval of the pre-stored interval length being the preset threshold, the time interval is determined as the target time interval.

The preset threshold may be set by the user according to actual needs, or may be set by the storage device by default, which is not limited in this embodiment of the present application. For example, the preset threshold may be 30 days.

If the latest time is within the time interval, it means that the pre-stored time interval is a recent time range relative to multiple pieces of data acquired in the batch. Time interval, wherein the target time interval is equivalent to the above-mentioned recent time range.

The second implementation manner: when the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval, determine the The time sum between the left value of the time interval and the time difference, update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum, and determine the updated time interval Is the target time interval.

When the latest time is greater than the right value of the time interval, the pre-stored time interval needs to be updated to re-determine the target time interval. Here, it is equivalent to sliding the time interval to the right for a certain length of time, which is The difference between the latest time and the right value of the time interval. For example, if the pre-stored time interval is [July 1, July 15] and the latest time is July 16, the target time interval can be determined as [July 2, July 16].

Further, in this implementation manner, since the recent time range is re-determined, in order to facilitate subsequent processing of the next batch of data based on the re-determined recent time range, after the storage device determines the target time interval, The pre-stored time interval may be updated to the target time interval.

It should be noted that the above implementation manner of determining the target time interval is merely exemplary. In another embodiment, the target time interval may also be determined in other ways, for example, the time determined in the first implementation manner Do further calculation on the basis of the interval, and use the operation result as the target time interval. For example, add a fixed value to the left and right values of the time interval to obtain the target time interval, where the fixed value can be based on actual needs Make settings. As another example, further calculation can be performed on the basis of the updated time interval determined in the second implementation manner to obtain the target time interval, such as adding a fixed value to the left and right values of the updated time interval respectively The numerical value, the target time interval, etc. are not limited in the embodiment of the present application.

Further, in the above implementation manner, before determining the target time interval, the storage device may also query whether the time interval exists. When the time interval exists, the target time interval is determined according to the above two implementation methods. Conversely, if the time interval does not exist, the storage device may generate the target time interval based on the latest time and the length of the interval. For example, the difference between the latest time and a preset threshold may be determined, and then the latest time is determined as the right value of the target time interval, and the determined difference value is determined as the left value of the target time interval.

1023: Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.

In order to distinguish and store multiple pieces of data obtained, the pieces of data are classified according to the time stamp of each piece of data and the determined target time interval. In implementation, the data whose time indicated by the time stamp in the multiple pieces of data is less than the left value of the target time interval is determined as high-level data, and the data indicated by the time stamp among the multiple pieces of data is within the target time interval Determined as low-level data.

It is not difficult to understand that when the time indicated by the time stamp of a piece of data is less than the left value of the target time interval, it means that the piece of data is the data before the target time interval. It can be considered that the piece of data is old data. Divided into high-level data. In addition, when the time indicated by the time stamp of a piece of data is within the target time interval, the piece of data may be regarded as recent data, and this type of data is divided into low-level data here. In this way, two groups of data are obtained after data classification processing.

Step 103: Aggregate statistics for each set of data in the multiple sets of data to obtain multiple aggregated data.

Here, the two sets of high-level data and low-level data obtained above need to be aggregated and counted. In a possible implementation, when the time indicated by the time stamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, and the target time interval uses day as the time granularity, based on Three time granularities of year, month, and day, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highest aggregated data, and based on year, month, day, hour, minute, and second Six time granularities, according to different time levels and data attributes, aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data. Among them, different time levels include different dimensions of time granularity.

Under normal circumstances, some old data may not need fine statistics, so you can only aggregate statistics based on the large time granularity of year, month, and day, and recent data generally need to do fine statistics, so you can base on year , Months, Days, Hours, Minutes, and Seconds for aggregate time statistics. That is to say, based on different time granularity, according to different time levels and data attributes, the two sets of data obtained by classification are aggregated and counted separately.

For ease of understanding, the data attribute is one-dimensional as an example. For this high-level data, the storage device aggregates statistics according to different time levels and data attributes based on the three time granularities of year, month, and day. The different time levels include a first time level, a second time level, and a third time level. The first time level includes a time granularity of year, and the second time level includes a time granularity of year and month. The third time level includes three time granularities: year, month, and day.

In other words, for each piece of data included in the high-level data, the storage device aggregates statistics on each piece of data according to the first time level and data attributes to obtain the first high-aggregated data corresponding to the first time level ; According to the second time level and data attributes, aggregate statistics on each piece of data to obtain the first highly aggregated data corresponding to the second time level; according to the third time level and data attributes, aggregate statistics on each piece of data To obtain the first highly aggregated data corresponding to the third time level.

In addition, for low-level data, the storage device aggregates statistics according to different time levels and data attributes according to six time granularities of year, month, day, hour, minute, and second. At this time, the different time levels include not only the first time level, the second time level, and the third time level, but also the fourth time level, the fifth time level, and the sixth time level. The fourth time level includes the year, Four time granularities of month, day, and hour, the fifth time hierarchy includes five time granularities of year, month, day, hour, and minute, and the sixth time hierarchy includes six times of year, month, day, hour, minute, and second granularity.

In other words, for each piece of data included in the low-level data, the storage device aggregates statistics on each piece of data according to the first time level and data attributes to obtain second high-aggregated data corresponding to the first time level ; According to the second time level and data attributes, aggregate statistics on each piece of data to obtain the second highest aggregated data corresponding to the second time level; according to the third time level and data attributes, aggregate statistics on each piece of data To obtain the second highest aggregated data corresponding to the third time level; according to the fourth time level and data attributes, aggregate statistics on each piece of data to obtain the first low aggregated data corresponding to the fourth time level; according to the fifth Time level and data attributes, aggregate statistics on each piece of data to obtain the first low-aggregated data corresponding to the fifth time level; according to the sixth time level and data attributes, aggregate statistics on each piece of data to obtain the first The first low-aggregated data corresponding to the six time levels.

It should be noted that the data attribute is one-dimensional as an example for illustration. In other embodiments, when the data attribute is multi-dimensional, it is necessary to combine data attributes of different dimensions for aggregation statistics. For example, taking the data attribute as two-dimensional, and performing aggregation statistics on high-level data as an example, at this time, it is necessary to perform aggregation statistics based on the above first time hierarchy; aggregate statistics based on the data attributes of the first time hierarchy and the first dimension; Aggregate statistics according to the data attributes of the first time level and the second dimension; aggregate statistics according to the data attributes of the first time level, the first dimension, and data attributes of the second dimension. Similarly, based on the second time level, the storage device combines two dimensions of data attributes for aggregation statistics, and based on the third time level, combines two dimensions of data attributes for aggregation statistics. In this way, 12 first high Aggregate data.

It should also be noted that the above description is based on the example of performing aggregation statistics according to different time levels and data attributes according to different time granularities. In another embodiment, it can also be based on different time levels and according to different time hierarchies. , Data attributes and data attribute values are aggregated, for example, when the data attribute is age, the data attribute value may be an age value, etc.

Step 104: When the multiple data processing units include a high-level data processing unit and a low-level data processing unit, obtain the row health in each aggregated data, and the row health of each aggregated data is generated during aggregation statistics, It is used to indicate the time level and data attributes corresponding to each aggregated data.

Wherein, each data processing unit of the plurality of data processing units is composed of a memory and a disk, and the type of aggregated data stored in each data processing unit is the same. For example, when the multiple data processing units include a high-level data processing unit and a low-level data processing unit, please refer to FIG. 2, which is a schematic diagram of a data processing unit according to an exemplary embodiment.

In order to classify and store the obtained aggregated data into the high-level data processing unit and the low-level data processing unit, the storage device obtains the exercise keys generated during the aggregation statistics process. What needs to be explained here is that, in the process of aggregation statistics, when the time level and the data attribute are the same, and the time corresponding to the time level is in the same time range (for example, both are on the same day), the exercise health generated is also the same. For example, when the first data is aggregated based on July 2017 and a certain data attribute, and the second data is also aggregated based on July 2017 and the data attribute, the two aggregated data obtained after the aggregated statistics Xingjian is the same.

Step 105: Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit storage.

During the storage process, the plurality of first high-aggregated data and the plurality of second high-aggregated data are stored in the high-level data processing unit, that is, the high-aggregated data and the low Part of the highly aggregated data obtained by hierarchical data aggregation statistics is stored in the same data processing unit.

In some embodiments, based on the row key in each first high aggregated data and each second high aggregated data, the multiple first high aggregated data and the multiple second high The specific implementation of storing aggregated data may include: merging the high aggregated data with the same health key among the multiple first high aggregated data and the multiple second high aggregated data to obtain multiple third high aggregated data, which The third highest aggregated data is stored in the high-level data processing unit.

That is to say, when storing highly aggregated data in the high-level data processing unit, the data is not directly merged with the data in the high-level data processing unit, but merged only when certain conditions are met. As described above, in the aggregation statistics process, when the time level and the data attribute based on the same are the same, and the time corresponding to the time level is within the same time range, the generated exercise keys are also the same. In the embodiment of the present application, the high-aggregated data with the same health key is merged to obtain multiple third-high-aggregated data, so that when the multiple third-high-aggregated data is stored in the high-level data processing unit You can merge high-aggregation data with the same health. In this way, it is convenient for the user to subsequently query multiple pieces of data at the same time level and within the same time range at a time, avoiding the need to merge at the time of query, and improving the efficiency of data query.

Further, a specific implementation of storing the plurality of third-highest aggregated data in the high-level data processing unit may include: for each third-highest aggregated data in the plurality of third-highest aggregated data, querying the high-level aggregated data Whether the memory of the data processing unit stores the same data as the row health of each third-highest aggregated data, when the memory of the high-level data processing unit stores the same row health of each third-highest aggregated data Data, merge the queried data with each third-highest aggregated data, and store the merged data in the memory of the high-level data processing unit.

In order to avoid frequent reading and writing to the disk, the embodiment of the present application first merges the high-aggregated data in the memory, that is, queries whether the memory of the high-level data processing unit stores the data with the third-highest aggregated data. Walk the same data. If it exists, merge the high-aggregation data with the same row health directly in memory, and store the merged high-aggregation data in memory.

Further, when the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the third The data with the same row health of the high-aggregated data, merge the acquired data with each third-highest aggregated data, and store the merged data in the memory of the high-level data processing unit.

Step 106: Based on the row key in each first low-aggregated data, the multiple first low-aggregated data is stored by the low-level data processing unit.

In the data storage process, a plurality of first low-aggregated data obtained through aggregation statistics are stored in a low-level data processing unit. Further, the storage device stores the plurality of first low-aggregated data through the low-level data processing unit based on the row health in each first low-aggregated data. The specific implementation process may include: The first low-aggregated data with the same row and key in the low-aggregated data are combined to obtain multiple second low-aggregated data, and the multiple second low-aggregated data are stored in the low-level data processing unit.

Similarly, when the first low-aggregated data is stored in the low-level data processing unit, it is not directly merged with the data in the low-level data processing unit, but is merged under certain conditions. As described above, in the aggregation statistics process, when the time level and the data attribute based on the same are the same, and the time corresponding to the time level is within the same time range, the generated exercise keys are also the same. In the embodiment of the present application, the first low-aggregated data having the same health key are combined to obtain multiple second low-aggregated data, so that the multiple second low-aggregated data are stored in the low-level data processing unit At the same time, you can merge the low aggregate data with the same health. In this way, it is convenient for the user to subsequently query multiple pieces of data at the same time level and within the same time range at a time, avoiding the need to merge at the time of query, and improving the efficiency of data query.

Further, the above specific implementation of storing the plurality of second low-aggregated data in the low-level data processing unit may include: for each second low-aggregated data in the plurality of second low-aggregated data, query the low Whether the memory of the hierarchical data processing unit stores the same data as the row health of each second low-aggregated data; when the memory of the low-level data processing unit stores the row health of the second low-aggregated data When the data is the same, the queried data is merged with each second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.

In order to avoid frequent reads and writes to the disk, the embodiment of the present application first merges the low-aggregated data in the memory, that is, queries whether the second low-aggregated data is stored in the memory of the low-level data processing unit Walk the same data. If it exists, merge the low-aggregated data with the same row health directly in memory, and store the merged data in memory.

Further, when the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the second The data of the low-aggregated data has the same row health; the acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.

Further, when the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, the data in the memory of the high-level data processing unit is stored to the disk of the high-level data processing unit. Alternatively, when the amount of data in the memory of the low-level data processing unit reaches a preset number threshold, the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.

The preset number threshold may be set by the user according to actual needs, or may be set by the storage device by default, which is not limited in the embodiment of the present application.

In this way, the merged high-aggregated data is first stored in the memory of the high-level data processing unit, and the merged low-aggregated data is first stored in the memory of the low-level data processing unit, only when the memory of the high-level data processing unit When the stored data reaches a certain value, or when the data stored in the memory of the low-level data processing unit reaches a certain value, the data in the memory is written to the disk, which can reduce the number of interactions with the disk. In addition, when querying high-aggregated data or low-aggregated data, first query from the memory, when the query is not found in the memory, then query from the disk, to avoid frequent reading and writing to the disk, and improve system performance. In addition, the use of this storage method can also reduce the use of disks by high-level data processing units and disks by low-level data processing units.

It should be noted that there is no sequential execution order between the

above steps

105 and 106.

In addition, it should also be noted that the above steps 104 to 106 are used to realize the operation of classifying and storing the multiple aggregated data by multiple data processing units.

Further, in the process of processing each batch of data, when the target time interval is updated, the storage device can also delete data in the low-level data processing unit that does not belong to the target time interval, so that low-level data can be saved Storage space of the processing unit.

Further, in order to ensure that the data obtained from the data source is not repeated, after processing the acquired batch of data, the offset of the acquired data may be recorded. The offset is used to indicate that the currently acquired data is The location in the data source. In this way, when the data is obtained from the data source next time, the next batch of data can be obtained according to the recorded offset. For example, if the data in the data source is numbered sequentially, and 5 pieces of data are acquired this time, the offset is 5, that is, the next piece of data will be acquired from the sixth piece of data.

In the embodiment of the present application, multiple pieces of data carrying a time stamp are obtained from a data source, and the multiple pieces of data are classified and processed according to the time stamp of each piece of data to obtain multiple sets of data. Aggregate statistics for each of the multiple sets of data, and then, through multiple data processing units composed of memory and disks, classify and store the multiple aggregated data so that the aggregated data stored in each data processing unit Of the same type. In this way, in the subsequent data query, the query can be performed from the corresponding data processing unit based on the time stamp of the data to be queried, which improves the efficiency of data query.

Fig. 3 is a schematic structural diagram of a data storage device according to an exemplary embodiment. The data storage device may be implemented by software, hardware, or a combination of both. The data storage device may include:

The obtaining module 310 is used to obtain multiple pieces of data from a data source, and each piece of data carries a time stamp;

The classification processing module 320 is configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;

The aggregation statistics module 330 is configured to aggregate statistics on each group of the multiple groups of data to obtain multiple aggregated data;

The classification storage module 340 is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein each data processing unit of the plurality of data processing units is composed of a memory and a disk, and each data processing The type of aggregated data stored in the cell is the same.

Optionally, the classification processing module 320 is used to:

Obtaining the latest time from the time stamps of the multiple pieces of data;

Determine a target time interval that includes the latest time and the interval length is a preset threshold;

Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.

Optionally, the classification processing module 320 is used to:

When the latest time is within a time interval in which the length of the pre-stored interval is the preset threshold, the time interval is determined as the target time interval.

Optionally, the classification processing module 320 is used to:

When the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval;

Determine the time sum between the left value of the time interval and the time difference;

Update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum;

The updated time interval is determined as the target time interval.

Optionally, the classification processing module 320 is used to:

Determining that the time indicated by the timestamp in the multiple pieces of data is less than the left value of the target time interval as high-level data, and the data indicating that the time indicated by the timestamp in the multiple pieces of data is within the target time interval Determined as low-level data.

Optionally, the aggregation statistics module 330 is used to:

When the time indicated by the timestamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, and the target time interval uses day as the time granularity, based on year, month, and day Time granularity, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highly aggregated data, and six time granularities based on year, month, day, hour, minute, and second, according to Different time levels and data attributes aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data. Different time levels include different granularities of time granularity.

Optionally, the classification storage module 340 is used to:

When the plurality of data processing units include a high-level data processing unit and a low-level data processing unit, the row keys in each aggregated data are obtained. The row keys of each aggregated data are generated during aggregation statistics. To indicate the time level and data attributes corresponding to each aggregated data;

Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit Storing, and based on the row key in each first low-aggregated data, storing the plurality of first low-aggregated data through the low-level data processing unit.

Optionally, the classification storage module 340 is used to:

Combining the plurality of first high-aggregated data and the plurality of second high-aggregated data with the same high-aggregated data with the same health key to obtain a plurality of third high-aggregated data;

For each third-highest aggregated data in the plurality of third-highest aggregated data, query whether the same data as the row health of each third-highest aggregated data is stored in the memory of the high-level data processing unit ;

When the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, merge the queried data with the third-highest aggregated data to merge The combined data is stored in the memory of the high-level data processing unit.

Optionally, the classification storage module 340 is used to:

When the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the The data of the three high aggregated data is the same as the health;

The acquired data is merged with each of the third highest aggregated data, and the merged data is stored in the memory of the high-level data processing unit.

Optionally, the classification storage module 340 is used to:

Combining the first low-aggregated data with the same health key among the multiple first low-aggregated data to obtain multiple second low-aggregated data;

For each second low-aggregated data in the plurality of second low-aggregated data, query whether the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit ;

When the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit, the queried data is merged with each second low-aggregated data to The combined data is stored in the memory of the low-level data processing unit.

Optionally, the classification storage module 340 is used to:

When the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the The data of the second low aggregated data has the same health and health;

The acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.

Optionally, the classification storage module 340 is used to:

When the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, store the data in the memory of the high-level data processing unit to the disk of the high-level data processing unit; or,

When the amount of data in the memory of the low-level data processing unit reaches the preset number threshold, the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.

It should be noted that when implementing the data storage method, the data storage device provided in the above embodiments is only exemplified by the division of the above functional modules. In practical applications, the above functions can be allocated by different functional modules as needed That is, the internal structure of the device is divided into different functional modules to complete all or part of the functions described above. In addition, the data storage device and the data storage method embodiment provided in the above embodiments belong to the same concept. For the specific implementation process, refer to the method embodiments, and details are not described here.

Fig. 4 is a schematic structural diagram of a storage device according to an exemplary embodiment. Specifically:

The storage device 400 includes a central processing unit (CPU) 401, a system memory 404 including a random access memory (RAM) 402 and a read only memory (ROM) 403, and a system bus 405 connecting the system memory 404 and the central processing unit 401. The storage device 400 also includes a basic input / output system (I / O system) 406 that helps transfer information between various devices in the computer, and a large-capacity storage device for storing the operating system 413, application programs 414, and other program modules 415 407.

The basic input / output system 406 includes a display 408 for displaying information and an input device 409 for a user to input information, such as a mouse and a keyboard. The display 408 and the input device 409 are both connected to the central processing unit 401 through the input and output controller 410 connected to the system bus 405. The basic input / output system 406 may also include an input-output controller 410 for receiving and processing input from a number of other devices such as a keyboard, mouse, or electronic stylus. Similarly, the input output controller 410 also provides output to a display screen, printer, or other type of output device.

The mass storage device 407 is connected to the central processing unit 401 through a mass storage controller (not shown) connected to the system bus 405. The mass storage device 407 and its associated computer-readable medium provide non-volatile storage for the storage device 400. That is, the mass storage device 407 may include a computer-readable medium (not shown) such as a hard disk or CD-ROM drive.

Without loss of generality, computer-readable media may include computer storage media and communication media. Computer storage media includes volatile and nonvolatile, removable and non-removable media implemented in any method or technology for storage of information such as computer readable instructions, data structures, program modules or other data. Computer storage media include RAM, ROM, EPROM, EEPROM, flash memory, or other solid-state storage technologies, CD-ROM, DVD, or other optical storage, tape cassettes, magnetic tape, magnetic disk storage, or other magnetic storage devices. Of course, those skilled in the art may know that the computer storage medium is not limited to the above. The above-mentioned system memory 404 and mass storage device 407 may be collectively referred to as a memory.

According to various embodiments of the present application, the storage device 400 may also be operated by a remote computer connected to the network through a network such as the Internet. That is, the storage device 400 may be connected to the network 412 through the network interface unit 411 connected to the system bus 405, or may be connected to other types of networks or remote computer systems (not shown) using the network interface unit 411.

The above memory also includes one or more programs. One or more programs are stored in the memory and configured to be executed by the CPU. The one or more programs include a method for performing data storage provided by the embodiments of the present application.

An embodiment of the present application further provides a non-transitory computer-readable storage medium, when the instructions in the storage medium are executed by the processor of the mobile terminal, the mobile terminal can execute the data provided by the embodiment shown in FIG. 1 Storage method.

An embodiment of the present application also provides a computer program product containing instructions, which when run on a computer, causes the computer to execute the data storage method provided by the embodiment shown in FIG. 1 described above.

A person of ordinary skill in the art may understand that all or part of the steps to implement the above-described embodiments may be completed by hardware, or may be completed by a program instructing related hardware. The program may be stored in a computer-readable storage medium. The mentioned storage medium may be a read-only memory, a magnetic disk or an optical disk, etc.

The above are only preferred embodiments of this application and are not intended to limit this application. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application should be included in the protection of this application Within range.

Claims

A data storage method, characterized in that the method includes:

Obtain multiple pieces of data from the data source, and each piece of data carries a time stamp;

Classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;

Aggregate statistics for each set of data in the multiple sets of data to obtain multiple aggregated data;

The multiple aggregated data is classified and stored by multiple data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
The method according to claim 1, wherein the classifying the multiple pieces of data according to the time stamp of each piece of data includes:

Obtaining the latest time from the time stamps of the multiple pieces of data;

Determine a target time interval that includes the latest time and the interval length is a preset threshold;

Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.
The method of claim 2, wherein the determining a target time interval that includes the latest time and the interval length is a preset threshold includes:

When the latest time is within a time interval in which the length of the pre-stored interval is the preset threshold, the time interval is determined as the target time interval.
The method of claim 2, wherein the determining a target time interval that includes the latest time and the interval length is a preset threshold includes:

When the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval;

Determine the time sum between the left value of the time interval and the time difference;

Update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum;

The updated time interval is determined as the target time interval.
The method according to any one of claims 2-4, wherein the classifying the plurality of pieces of data according to the time stamp of each piece of data and the target time interval includes:

Determining that the time indicated by the timestamp in the multiple pieces of data is less than the left value of the target time interval as high-level data, and the data indicating that the time indicated by the timestamp in the multiple pieces of data is within the target time interval Determined as low-level data.
The method according to claim 5, wherein when the time indicated by the time stamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, the target time interval is in days In the case of time granularity, the aggregation of each group of data in the multiple groups of data to obtain multiple aggregated data includes:

Based on the three time granularities of year, month, and day, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highest aggregated data, and based on year, month, day, hour, minute , Six time granularity of seconds, according to different time levels and data attributes, aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data, different time levels include different dimensions of Time granularity.
The method according to claim 6, wherein when the plurality of data processing units include a high-level data processing unit and a low-level data processing unit, the plurality of aggregated data is processed by the plurality of data processing units Carry out classified storage, including:

Acquiring the row health in each aggregated data, the row health of each aggregated data is generated during aggregation statistics, and used to indicate the time level and data attributes corresponding to each aggregated data;

Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit Storing, and based on the row key in each first low-aggregated data, storing the plurality of first low-aggregated data through the low-level data processing unit.
The method according to claim 7, characterized in that, based on the line health in each first high-aggregated data and each second high-aggregated data, the plurality of The storing of one high-aggregated data and the plurality of second high-aggregated data includes:

Combining the plurality of first high-aggregated data and the plurality of second high-aggregated data with the same high-aggregated data with the same health key to obtain a plurality of third high-aggregated data;

For each third-highest aggregated data in the plurality of third-highest aggregated data, query whether the same data as the row health of each third-highest aggregated data is stored in the memory of the high-level data processing unit ;

When the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, merge the queried data with the third-highest aggregated data to merge The combined data is stored in the memory of the high-level data processing unit.
The method according to claim 8, wherein after querying whether the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, the method further includes:

When the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the The data of the three high aggregated data is the same as the health;

The acquired data is merged with each of the third highest aggregated data, and the merged data is stored in the memory of the high-level data processing unit.
The method according to claim 7, wherein the storing the plurality of first low-aggregated data by the low-level data processing unit based on the line health in each first low-aggregated data includes: :

Combining the first low-aggregated data with the same health key among the multiple first low-aggregated data to obtain multiple second low-aggregated data;

For each second low-aggregated data in the plurality of second low-aggregated data, query whether the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit ;

When the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit, the queried data is merged with each second low-aggregated data to The combined data is stored in the memory of the low-level data processing unit.
The method according to claim 10, wherein after querying whether memory of the low-level data processing unit stores the same data as the row health of each second low-aggregated data, the method further includes:

When the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the The data of the second low aggregated data has the same health and health;

The acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
The method of claim 7, wherein the method further comprises:

When the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, store the data in the memory of the high-level data processing unit to the disk of the high-level data processing unit; or,

When the amount of data in the memory of the low-level data processing unit reaches the preset number threshold, the data in the memory of the low-level data processing unit is stored to the disk of the low-level data processing unit.
A data storage device, characterized in that the device includes:

The acquisition module is used to acquire multiple pieces of data from the data source, and each piece of data carries a time stamp;

A classification processing module, configured to classify the multiple pieces of data according to the time stamp of each piece of data to obtain multiple sets of data;

An aggregation statistics module, configured to aggregate statistics on each group of the multiple sets of data to obtain multiple aggregated data;

A classification storage module is used to classify and store the plurality of aggregated data by a plurality of data processing units, wherein the types of aggregated data stored in each data processing unit are the same.
The apparatus according to claim 13, wherein the classification processing module is used to:

Obtaining the latest time from the time stamps of the multiple pieces of data;

Determine a target time interval that includes the latest time and the interval length is a preset threshold;

Classify the multiple pieces of data according to the time stamp of each piece of data and the target time interval.
The apparatus according to claim 14, wherein the classification processing module is used to:

When the latest time is within the time interval of the pre-stored interval length is the preset threshold, the time interval is determined as the target time interval.
The apparatus according to claim 14, wherein the classification processing module is used to:

When the latest time is greater than the right value of the pre-stored interval length of the preset threshold, determine the time difference between the latest time and the right value of the time interval;

Determine the time sum between the left value of the time interval and the time difference;

Update the right value of the time interval to the latest time, and update the left value of the time interval to the time sum;

The updated time interval is determined as the target time interval.
The device according to any one of claims 14-16, wherein the classification processing module is used to:

Determining that the time indicated by the timestamp in the multiple pieces of data is less than the left value of the target time interval as high-level data, and the data indicating that the time indicated by the timestamp in the multiple pieces of data is within the target time interval Determined as low-level data.
The apparatus of claim 17, wherein the aggregation statistics module is used to:

When the time indicated by the timestamp of each piece of data includes multiple time granularities of year, month, day, hour, minute, and second, and the target time interval uses day as the time granularity, based on year, month, and day Time granularity, according to different time levels and data attributes, aggregate statistics on the high-level data to obtain multiple first highly aggregated data, and six time granularities based on year, month, day, hour, minute, and second, according to Different time levels and data attributes aggregate statistics on the low-level data to obtain multiple second high-aggregated data and multiple first low-aggregated data. Different time levels include different granularities of time granularity.
The apparatus of claim 18, wherein the classification storage module is used to:

When the plurality of data processing units include a high-level data processing unit and a low-level data processing unit, the row keys in each aggregated data are obtained. The row keys of each aggregated data are generated during aggregation statistics. To indicate the time level and data attributes corresponding to each aggregated data;

Based on the row health in each first high-aggregated data and each second high-aggregated data, the plurality of first high-aggregated data and the plurality of second high-aggregated data are performed by the high-level data processing unit Storing, and based on the row key in each first low-aggregated data, storing the plurality of first low-aggregated data through the low-level data processing unit.
The apparatus of claim 19, wherein the classification storage module is used to:

Combining the plurality of first high-aggregated data and the plurality of second high-aggregated data with the same high-aggregated data with the same health key to obtain a plurality of third high-aggregated data;

For each third-highest aggregated data in the plurality of third-highest aggregated data, query whether the same data as the row health of each third-highest aggregated data is stored in the memory of the high-level data processing unit ;

When the memory of the high-level data processing unit stores the same data as the row health of each third-highest aggregated data, merge the queried data with the third-highest aggregated data to merge The combined data is stored in the memory of the high-level data processing unit.
The apparatus of claim 20, wherein the classification storage module is used to:

When the same data as the row health of each third-highest aggregated data is not stored in the memory of the high-level data processing unit, the The data of the three high aggregated data is the same as the health;

The acquired data is merged with each of the third highest aggregated data, and the merged data is stored in the memory of the high-level data processing unit.
The apparatus of claim 19, wherein the classification storage module is used to:

Combining the first low-aggregated data with the same health key among the multiple first low-aggregated data to obtain multiple second low-aggregated data;

For each second low-aggregated data in the plurality of second low-aggregated data, query whether the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit ;

When the same data as the row health of each second low-aggregated data is stored in the memory of the low-level data processing unit, the queried data is merged with each second low-aggregated data to The combined data is stored in the memory of the low-level data processing unit.
The apparatus according to claim 22, wherein the classification storage module is used to:

When the same data as the row health of each second low-aggregated data is not stored in the memory of the low-level data processing unit, the The data of the second low aggregated data has the same health and health;

The acquired data is merged with each of the second low-aggregated data, and the merged data is stored in the memory of the low-level data processing unit.
The apparatus of claim 19, wherein the classification storage module is used to:

When the amount of data in the memory of the high-level data processing unit reaches a preset number threshold, store the data in the memory of the high-level data processing unit to the disk of the high-level data processing unit; or,

When the amount of data in the memory of the low-level data processing unit reaches the preset number threshold, the data in the memory of the low-level data processing unit is stored in the disk of the low-level data processing unit.
A computer-readable storage medium having instructions stored on the computer-readable storage medium, characterized in that, when the instructions are executed by a processor, the steps of any one of the methods of claims 1-12 are implemented.
A storage device, characterized in that the storage device includes a processor and a memory, wherein the memory is used to store a computer program; the processor is used to execute a program stored on the memory to implement rights The method steps of any one of claims 1-12 are required.