CN110109923B - Time sequence data storage method, time sequence data analysis method and time sequence data analysis device - Google Patents

Time sequence data storage method, time sequence data analysis method and time sequence data analysis device Download PDF

Info

Publication number
CN110109923B
CN110109923B CN201910270612.5A CN201910270612A CN110109923B CN 110109923 B CN110109923 B CN 110109923B CN 201910270612 A CN201910270612 A CN 201910270612A CN 110109923 B CN110109923 B CN 110109923B
Authority
CN
China
Prior art keywords
time
slice
sequence data
time sequence
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910270612.5A
Other languages
Chinese (zh)
Other versions
CN110109923A (en
Inventor
刘睿
黄践焜
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing MetarNet Technologies Co Ltd
Original Assignee
Beijing MetarNet Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing MetarNet Technologies Co Ltd filed Critical Beijing MetarNet Technologies Co Ltd
Priority to CN201910270612.5A priority Critical patent/CN110109923B/en
Publication of CN110109923A publication Critical patent/CN110109923A/en
Application granted granted Critical
Publication of CN110109923B publication Critical patent/CN110109923B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/242Query formulation
    • G06F16/2433Query languages
    • G06F16/2438Embedded query languages
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2474Sequence data queries, e.g. querying versioned data
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • G06F16/2477Temporal data queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Mathematical Physics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Fuzzy Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The embodiment of the invention provides a storage method, an analysis method and a device of time sequence data, wherein the storage method comprises the following steps: acquiring the time precision of the time sequence data according to the sampling time interval of the time sequence data; aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data; and storing the file slice to an HBase database by taking the time precision and the timestamp as HBase row keys. According to the time sequence data storage method provided by the embodiment of the invention, the time sequence data is sliced and stored in HBase according to the time dimension, so that the information redundancy can be effectively reduced, and the storage space is compressed.

Description

Time sequence data storage method, time sequence data analysis method and time sequence data analysis device
Technical Field
The embodiment of the invention relates to the technical field of internet, in particular to a storage method, an analysis method and a device of time sequence data.
Background
The time sequence data is a sequence recorded by some index according to a time sequence, for example, in a communication service, the performance data of the communication equipment is a time sequence data, a network management system collects and generates a data file from the equipment at regular time and reports the data file, a measured value of the performance index of the communication equipment at each time point is described, and state change information of the communication equipment at current time and historical time is recorded. The time sequence data operation has the following characteristics: reading according to a time range, high probability of reading recent data, large data volume and multidimensional analysis.
Under the background of massive time series data, when random retrieval and multidimensional analysis of small data volume are required, based on the characteristics of the time series data, HBase is adopted in the prior art to store the time series data. The HBase is a NoSQL database constructed on the basis of HDFS (Hadoop Distributed File System), data is organized in a Key and Value form and arranged according to a Key Value dictionary sequence, and when the background data volume is large, the random read-write performance of the small data volume is excellent. However, HBase can only support data storage and query, and cannot perform data analysis.
In view of this, a Spark SQL method may be used to analyze HBase data, but when searching HBase data using Spark technology, it is required that HBase data must be tabulated according to a traditional relational database, one HBase column is allocated for each index column, the granularity is too fine, and according to HBase Qualifier (complete representation of key in HFile, including column family, column name, version, row key), a large amount of redundant information is added, thereby increasing redundant storage and network overhead; in addition, the query is essentially full-table scanning, HBase data needs to be completely scanned into a memory and then processed, and the query efficiency is poor.
Disclosure of Invention
Embodiments of the present invention provide a method and an apparatus for storing and analyzing time series data, which overcome the above problems or at least partially solve the above problems.
In a first aspect, an embodiment of the present invention provides a method for storing time series data, including:
acquiring the time precision of the time sequence data according to the sampling time interval of the time sequence data;
aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data;
and storing the file slice to an HBase database by taking the time precision and the timestamp as HBase row keys.
In a second aspect, an embodiment of the present invention provides a method for analyzing time series data, including:
analyzing the received SQL statement to obtain a time range for inquiring time sequence data and a corresponding execution plan;
scanning the HBase database according to the time range, positioning to a plurality of time sequence data slices, reading Value of each time sequence data slice, and generating Spark RDD;
performing matching and filtering operation on the Spark RDD according to the execution plan to obtain a minimum data set minimum DataFrame consistent with the execution plan;
SQL calculation is performed based on the minimum DataFrame.
In a third aspect, an embodiment of the present invention provides a time series data storage device, including:
the time precision determining module is used for acquiring the time precision of the time sequence data according to the sampling time interval of the time sequence data;
the aggregation module is used for aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data;
and the storage module is used for storing the file slice to an HBase database by taking the time precision and the timestamp as HBase row keys.
In a fourth aspect, an embodiment of the present invention provides an apparatus for analyzing time series data, including:
the analysis module is used for analyzing the received SQL statement to obtain a time range for inquiring the time sequence data and a corresponding execution plan;
the scanning module is used for scanning the HBase database according to the time range, positioning to a plurality of time sequence data slices, reading Value of each time sequence data slice and generating Spark RDD;
a minimum DataFrame generating module, configured to perform matching and filtering operations on the Spark RDD according to the execution plan, and obtain a minimum data set minimum DataFrame consistent with the execution plan;
and the SQL execution module is used for carrying out SQL calculation based on the minimum DataFrame.
In a fifth aspect, an embodiment of the present invention provides an electronic device, which includes a memory, a processor, and a computer program stored in the memory and executable on the processor, and the processor implements the steps of the method according to the first aspect or the second aspect when executing the program.
In a sixth aspect, embodiments of the present invention provide a non-transitory computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements the steps of the method as provided in the first or second aspect.
The embodiment of the invention provides a time sequence data storage method, an analysis method and a device, which are used for slicing and storing time sequence data into HBase according to time dimension, expanding a Spark SQL execution process based on HBase slice data, and effectively improving response speed and data analysis efficiency when small data volume random retrieval is carried out under the background of massive time sequence data.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and those skilled in the art can also obtain other drawings according to the drawings without creative efforts.
Fig. 1 is a schematic flow chart illustrating a method for storing time series data according to an embodiment of the present invention;
FIG. 2 is a diagram illustrating a time-series data set being partitioned into multiple file slices according to an embodiment of the present invention;
FIG. 3 is a schematic diagram of row keys provided in accordance with an embodiment of the present invention;
FIG. 4 is a schematic diagram illustrating the storage of a time-series data slice in HBase according to an embodiment of the present invention;
FIG. 5 is a flowchart illustrating a method for analyzing time series data according to an embodiment of the present invention;
fig. 6 is a schematic diagram of an SQL parsing process according to an embodiment of the present invention;
fig. 7 is a schematic diagram of an HBase scanning process according to an embodiment of the present invention;
fig. 8 is a schematic diagram of a minimum DataFrame creation process according to an embodiment of the present invention;
FIG. 9 is an exemplary diagram of an SQL execution process provided by an embodiment of the invention;
FIG. 10 is a schematic diagram of an HBase-based modified Spark SQL query logic according to an embodiment of the present invention;
FIG. 11 is a schematic structural diagram of a device for storing time series data according to an embodiment of the present invention;
fig. 12 is a schematic structural diagram of an apparatus for analyzing time series data according to an embodiment of the present invention;
fig. 13 is a schematic physical structure diagram of an electronic device according to an embodiment of the present invention.
Detailed Description
In order to make the objects, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are some, but not all, embodiments of the present invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
For the convenience of understanding, the technical terms and related concepts related to the embodiments of the present invention are explained herein before.
HBase is a NoSQL database constructed on the basis of HDFS, data is organized in a Key (row Key) and Value form and is arranged according to a Key Value dictionary sequence, formed HFile files are stored in the HDFS, the main query modes are divided into two types, the Key values are subjected to get according to a single or multiple Key values to obtain corresponding records, or the Key values in a certain range are subjected to scan to obtain multiple records, and when the background data volume is large, the random read-write performance of small data volume is excellent.
Spark SQL: the SQL-based data processing components in Spark technology include DataFrame, Catalyst and the like, and support large data file systems such as HDFS.
RDD: resilient Distributed data sets, which are the most basic data abstractions in Spark, represent an unchangeable, partitionable set in which elements can be computed in parallel; RDDs can only be created based on performing deterministic operations on datasets and other existing RDDs in stable physical storage; RDD allows a user to explicitly cache a working set in memory when executing multiple queries, and subsequent queries can reuse the working set, which greatly increases query speed.
DataFrame: is a data set composed of columns that is conceptually equivalent to a table in a relational database or a data frame in R/Python, but is rich in optimizations on the query engine. The DataFrame can be constructed from a variety of sources, for example: structured data files, tables in hive, external databases, or existing RDDs.
region: region is the basic unit of HBase data storage and management.
As shown in fig. 1, a schematic flow chart of a method for storing time series data according to an embodiment of the present invention includes:
step 101, acquiring time precision of time sequence data according to a sampling time interval of the time sequence data;
the time sequence data comprises data such as measurement object identification, time stamp, index sequence and the like, the time precision is defined as the time interval of data acquisition, and the shorter the time interval is, the higher the time precision is.
The time stamps represent sampling instants, the interval between which is the sampling time interval. Therefore, the sampling time interval can be obtained from the time stamp of the time series data, thereby extracting the time accuracy.
To facilitate retrieval of time series data, a large time series data set may be divided into small time series data sets according to sampling time and time precision.
Step 102, aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data;
it can be understood that time series data with the same time stamp and time precision are aggregated to form a file slice, and fig. 2 is a schematic diagram of slicing a time series data set into a plurality of file slices.
Therefore, the information redundancy of the qualifier can be effectively reduced, the storage space is compressed, and the time range is easy to retrieve the time sequence data.
And 103, storing the file slice to an HBase database by taking the time precision and the time stamp as HBase row keys.
Specifically, the file slice is stored with the TIME precision PERIOD and the timestamp TIME as the HBase row key.
According to the time sequence data storage method provided by the embodiment of the invention, the time sequence data is sliced and stored in HBase according to the time dimension, so that the information redundancy can be effectively reduced, and the storage space is compressed.
Based on the content of the above embodiment, the step of storing the file slice to the HBase database with the time precision and the timestamp as the HBase row key specifically includes:
acquiring the size of the file slice;
if the size of the file slice is larger than the preset byte length, further segmenting the file slice to generate a plurality of sub-slices;
storing the plurality of sub-slices into an HBase database by taking the time precision, the timestamp and the slice number of the sub-slice as row keys; alternatively, the first and second electrodes may be,
and if the size of the file slice is smaller than or equal to the preset byte length, setting the slice number corresponding to the file slice to be zero, and storing the file slice into an HBase database by taking the time precision, the timestamp and the slice number of the file slice as row keys.
Specifically, according to the HBase storage feature, a key Value is preferably between 100KB and 10MB, so that when a file slice is large, the file slice is further sliced.
Firstly, the size of a file slice is obtained, if the size of the file slice is larger than the preset byte length, the file slice is further segmented, and a plurality of sub-slices and slice numbers corresponding to the sub-slices are obtained.
And (3) constructing a row key as shown in fig. 3 by combining the retrieval characteristics of the HBase, namely storing each file SLICE into the HBase database by taking the Time precision PERIOD, the timestamp Time and the SLICE NUMBER SLICE NUMBER of the sub-SLICE as a row key rowkey.
In a specific implementation, the predetermined byte length is typically set to 10 MB.
Wherein, further segmenting the file slice to obtain a plurality of sub-slices further comprises:
firstly, determining the number N of sub-slices for segmenting the file slice according to the preset byte length;
wherein the content of the first and second substances,
Figure BDA0002018248670000061
i.e., a file slice is divided into N sub-slices by a preset byte length (e.g., 10MB) size, the sub-slices having the same prefixes (PERIOD, TIME) as the parent slice.
Then, carrying out Hash calculation according to the measurement object identification of the time sequence data, taking the N remainder of the Hash calculation result and then adding one to obtain the slice number corresponding to the time sequence data;
namely, it is
Figure BDA0002018248670000071
And finally, merging the time sequence data with the same slice number into one sub-slice to generate N sub-slices.
If the size of the file slice is smaller than or equal to the preset byte length, setting the slice number corresponding to the file slice to be zero, and still using the time precision, the timestamp and the slice number of the file slice as row keys to store the file slice into the HBase database, as shown in FIG. 4, which is a schematic diagram of the storage of the time sequence data slice in the HBase. As can be seen from fig. 4, in the method for storing time series data according to the embodiment of the present invention, after the time series data is sliced according to the time dimension, a larger slice is further sliced into a plurality of sub-slices and stored in the HBase.
As shown in fig. 5, a schematic flow chart of an analysis method of time series data according to an embodiment of the present invention is shown, where the time series data is stored by using the storage method provided in each of the embodiments.
On the basis of the time sequence data storage method, the embodiment of the invention expands the storage logic of Spark SQL, increases the support for the slice storage and corrects the defect of full-table scanning in the process of Spark HBase query. In general, when searching time series data, a time range is necessary, so that the reliability of the operation of the analysis method provided by the invention is ensured, and the conditions can be obtained only by lexical analysis of SQL. The data of Spark SQL operation only occurs in a small range (generally, the data is slightly larger than the required data, and the useless data range is reduced along with the increase of the SLICE NUMBER), so that IO is effectively reduced, and the response speed of query is improved.
As shown in fig. 5, the method for analyzing time series data includes:
step 501, analyzing the received SQL statement to obtain a time range for querying time sequence data and a corresponding execution plan;
specifically, for lexical analysis of input SQL by using Spark SQL Parser and analysis of query data, in the embodiment of the present invention, a Spark SQL self-contained Catalyst is used to complete an SQL parsing process to obtain a Logical Plan, and a query range corresponding to a time field and a query column range are obtained from the Logical Plan. Fig. 6 is a schematic diagram of SQL parsing process. As shown in fig. 6, after SQL is processed by using Catalyst, the related tables and the corresponding columns can be obtained in the Project stage, and the method is applied to the subsequent slicing data clipping and DataFrame creation processes, and the Filter conditions corresponding to each table can be obtained in the Filter stage, so as to obtain the Filter fields and ranges of each table, and further obtain the time range of the processed data.
Step 502, scanning the HBase database according to the time range, positioning to a plurality of time sequence data slices, reading Value of each time sequence data slice, and generating Spark RDD;
in the embodiment of the present invention, the newHadoopApi provided by Spark completes the HBase Scan process, and the start line key is transmitted according to the time range to complete the Scan, and the Scan process for the HBase is shown in fig. 7. After the Scan operation is initiated, the Scan operation is carried out on each HBase Region, and because the data in the HBase are stored in order according to the row keys, the required data range can be quickly positioned according to the initial row key of the Scan operation, and the data in the range can be continuously read.
Reading Value of each time sequence data slice to generate Spark RDD, specifically:
reading the Value of each time series data slice in a Key-Value form by utilizing newHadoop Api to generate Spark RDD:
val slice_data_rdd=sc.newAPIHadoopRDD(……)
step 503, performing matching and filtering operations on the Spark RDD according to the execution plan to obtain a minimum data set Minimal DataFrame consistent with the execution plan;
as shown in fig. 8, the specific process of generating the minimum data set minimum DataFrame is as follows:
performing Value matching on the Spark RDD according to the execution plan to obtain a first RDD, wherein raw _ data _ RDD in the graph 8 is the first RDD;
filtering the first RDD according to the execution plan Schema, to filter out useless fields, that is, to filter out unnecessary columns, to obtain a second RDD, where filtered _ raw _ data _ RDD in fig. 8 is the second RDD;
and registering the second RDD as a DataFrame, and obtaining a Minimal DataFrame consistent with the execution plan, wherein the Minimal DataFrame only comprises columns necessary for SQL calculation and rows necessary according to a time range.
And step 504, SQL calculation is carried out based on the minimum DataFrame.
After the minimum data set minimum DataFrame is constructed, returning Spark SQL processing logic, and performing SQL calculation by using the minimum DataFrame to obtain a processing result. Fig. 9 is an exemplary diagram of the final SQL execution process.
Fig. 10 is a schematic diagram of HBase-based modified Spark SQL query logic according to an embodiment of the present invention. The time sequence data analysis method provided by the embodiment of the invention not only can play the advantages of HBase random reading performance, but also can avoid the whole table scanning process when Spark queries HBase, and can realize rapid data retrieval and analysis under massive time sequence background data.
As shown in fig. 11, a schematic structural diagram of a time series data storage device according to an embodiment of the present invention includes: a time precision determination module 1101, an aggregation module 1102, and a storage module 1103, wherein,
a time precision determining module 1101, configured to obtain time precision of time series data according to a sampling time interval of the time series data;
the aggregating module 1102 is configured to aggregate a plurality of time series data with the same timestamp and the same time precision into a file slice according to the timestamp and the time precision of the time series data;
the storage module 1103 is configured to store the file slice in an HBase database with the time precision and the timestamp as HBase row keys.
The device is used for realizing the time sequence data storage method in the previous method embodiments. Therefore, the description and definition of the storage method of the time series data in the foregoing embodiments may be used for understanding each execution module in the embodiments of the present invention, and are not described herein again.
According to the time sequence data storage device provided by the embodiment of the invention, the time sequence data is sliced according to the time dimension and stored in the HBase, so that the information redundancy can be effectively reduced, and the storage space is compressed.
As shown in fig. 12, a schematic structural diagram of an apparatus for analyzing time series data according to an embodiment of the present invention includes: a parsing module 1201, a scanning module 1202, a minimum DataFrame generating module 1203, and an SQL executing module 1204, wherein,
the analysis module 1201 is configured to analyze the received SQL statement to obtain a time range for querying the time series data and a corresponding execution plan;
a scanning module 1202, configured to scan the HBase database according to the time range, locate a plurality of time-series data slices, read Value of each time-series data slice, and generate a Spark RDD;
a minimum DataFrame generating module 1203, configured to perform matching and filtering operations on the Spark RDD according to the execution plan, so as to obtain a minimum data set minimum DataFrame consistent with the execution plan;
and the SQL execution module 1204 is used for carrying out SQL calculation based on the minimum DataFrame.
The device is used for realizing the time sequence data analysis method in the previous method embodiments. Therefore, the description and definition of the analysis method of the time series data in the foregoing embodiments may be used for understanding each execution module in the embodiments of the present invention, and are not described herein again.
The time sequence data analysis device provided by the embodiment of the invention not only exerts the random reading performance of HBase, but also avoids the full-table scanning process when Spark queries HBase, and realizes the rapid retrieval and analysis of data under massive time sequence background data.
Fig. 13 is a schematic entity structure diagram of an electronic device according to an embodiment of the present invention, and as shown in fig. 13, the electronic device may include: a processor (processor)1310, a communication Interface (Communications Interface)1320, a memory (memory)1330 and a communication bus 1340, wherein the processor 1310, the communication Interface 1320 and the memory 1330 communicate with each other via the communication bus 1340. The processor 1310 may call a computer program stored on the memory 1330 and operable on the processor 1310 to perform the method for storing timing data provided by the above method embodiments, for example, including: acquiring the time precision of the time sequence data according to the sampling time interval of the time sequence data; aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data; and storing the file slice to an HBase database by taking the time precision and the timestamp as HBase row keys.
The processor 1310 may also invoke a computer program stored on the memory 1330 and executable on the processor 1310 to perform the method for analyzing the timing data provided by the above method embodiments, including, for example: analyzing the received SQL statement to obtain a time range for inquiring time sequence data and a corresponding execution plan; scanning the HBase database according to the time range, positioning to a plurality of time sequence data slices, reading Value of each time sequence data slice, and generating Spark RDD; performing matching and filtering operation on the Spark RDD according to the execution plan to obtain a minimum data set minimum DataFrame consistent with the execution plan; SQL calculation is performed based on the minimum DataFrame.
In addition, the logic instructions in the memory 1330 may be implemented in software functional units and stored in a computer readable storage medium when sold or used as a stand-alone product. Based on such understanding, the technical solutions of the embodiments of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes several instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the methods described in the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a removable hard disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a magnetic disk or an optical disk, and other various media capable of storing program codes.
An embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for storing time series data provided in the foregoing method embodiments, for example, including: acquiring the time precision of the time sequence data according to the sampling time interval of the time sequence data; aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data; and storing the file slice to an HBase database by taking the time precision and the timestamp as HBase row keys.
An embodiment of the present invention further provides a non-transitory computer-readable storage medium, on which a computer program is stored, where the computer program, when executed by a processor, implements the method for analyzing time series data provided in the foregoing method embodiments, and the method includes: analyzing the received SQL statement to obtain a time range for inquiring time sequence data and a corresponding execution plan; scanning the HBase database according to the time range, positioning to a plurality of time sequence data slices, reading Value of each time sequence data slice, and generating Spark RDD; performing matching and filtering operation on the Spark RDD according to the execution plan to obtain a minimum data set minimum DataFrame consistent with the execution plan; SQL calculation is carried out based on the minimum DataFrame
The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the modules may be selected according to actual needs to achieve the purpose of the solution of the present embodiment. One of ordinary skill in the art can understand and implement it without inventive effort.
Through the above description of the embodiments, those skilled in the art will clearly understand that each embodiment can be implemented by software plus a necessary general hardware platform, and certainly can also be implemented by hardware. With this understanding in mind, the above-described technical solutions may be embodied in the form of a software product, which can be stored in a computer-readable storage medium such as ROM/RAM, magnetic disk, optical disk, etc., and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to execute the methods described in the embodiments or some parts of the embodiments.
Finally, it should be noted that: the above examples are only intended to illustrate the technical solution of the present invention, but not to limit it; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.

Claims (9)

1. A method for storing time series data, comprising:
acquiring the time precision of the time sequence data according to the sampling time interval of the time sequence data;
aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data;
storing the file slice to an HBase database by taking the time precision and the timestamp as HBase row keys;
the step of storing the file slice into an HBase database by taking the time precision and the timestamp as HBase row keys specifically comprises the following steps:
acquiring the size of the file slice;
if the size of the file slice is larger than the preset byte length, further segmenting the file slice to generate a plurality of sub-slices;
storing the plurality of sub-slices into an HBase database by taking the time precision, the timestamp and the slice number of the sub-slice as row keys;
and if the size of the file slice is smaller than or equal to the preset byte length, setting the slice number corresponding to the file slice to be zero, and storing the file slice into an HBase database by taking the time precision, the timestamp and the slice number of the file slice as row keys.
2. The method according to claim 1, wherein the step of further slicing the file slice to generate a plurality of sub-slices comprises:
determining the number N of sub-slices for segmenting the file slice according to the preset byte length;
performing hash calculation according to the measurement object identification of the time sequence data, taking the remainder of N from the result of the hash calculation, and then adding one to obtain a slice number corresponding to the time sequence data;
and merging the time sequence data with the same slice number into one sub-slice to generate N sub-slices.
3. A method of analyzing time series data, wherein the time series data is stored by the method of any one of claims 1 to 2, the method comprising:
analyzing the received SQL statement to obtain a time range for inquiring time sequence data and a corresponding execution plan;
scanning the HBase database according to the time range, positioning to a plurality of time sequence data slices, reading Value of each time sequence data slice, and generating Spark RDD;
performing matching and filtering operation on the Spark RDD according to the execution plan to obtain a minimum data set minimum DataFrame consistent with the execution plan;
SQL calculation is performed based on the minimum DataFrame.
4. The method according to claim 3, wherein the step of reading Value of each time series data slice and generating Spark RDD specifically comprises:
and reading the Value of each time series data slice in a Key-Value mode by utilizing the newHadoop Api to generate Spark RDD.
5. The method according to claim 3, wherein the step of performing matching and filtering operations on the Spark RDD according to the execution plan to obtain a minimum data set Minial DataFrame consistent with the execution plan includes:
performing Value matching on the Spark RDD according to the execution plan to obtain a first RDD;
filtering the first RDD according to the execution plan, filtering useless fields and obtaining a second RDD;
and registering the second RDD as a DataFrame.
6. An apparatus for storing time series data, comprising:
the time precision determining module is used for acquiring the time precision of the time sequence data according to the sampling time interval of the time sequence data;
the aggregation module is used for aggregating a plurality of time sequence data with the same time stamp and the same time precision into a file slice according to the time stamp and the time precision of the time sequence data;
the storage module is used for storing the file slices into an HBase database by taking time precision and a timestamp as HBase row keys;
the step of storing the file slice into an HBase database by taking the time precision and the timestamp as HBase row keys specifically comprises the following steps:
acquiring the size of the file slice;
if the size of the file slice is larger than the preset byte length, further segmenting the file slice to generate a plurality of sub-slices;
storing the plurality of sub-slices into an HBase database by taking the time precision, the timestamp and the slice number of the sub-slice as row keys;
and if the size of the file slice is smaller than or equal to the preset byte length, setting the slice number corresponding to the file slice to be zero, and storing the file slice into an HBase database by taking the time precision, the timestamp and the slice number of the file slice as row keys.
7. An apparatus for analyzing time series data, wherein the time series data is stored by the method of any one of claims 1 to 2, the apparatus comprising:
the analysis module is used for analyzing the received SQL statement to obtain a time range for inquiring the time sequence data and a corresponding execution plan;
the scanning module is used for scanning the HBase database according to the time range, positioning to a plurality of time sequence data slices, reading Value of each time sequence data slice and generating Spark RDD;
a minimum DataFrame generating module, configured to perform matching and filtering operations on the Spark RDD according to the execution plan, and obtain a minimum data set minimum DataFrame consistent with the execution plan;
and the SQL execution module is used for carrying out SQL calculation based on the minimum DataFrame.
8. An electronic device, comprising:
at least one processor; and
at least one memory communicatively coupled to the processor, wherein:
the memory stores program instructions executable by the processor, the processor invoking the program instructions to perform the method of any of claims 1 to 5.
9. A non-transitory computer-readable storage medium storing computer instructions that cause a computer to perform the method of any one of claims 1 to 5.
CN201910270612.5A 2019-04-04 2019-04-04 Time sequence data storage method, time sequence data analysis method and time sequence data analysis device Active CN110109923B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910270612.5A CN110109923B (en) 2019-04-04 2019-04-04 Time sequence data storage method, time sequence data analysis method and time sequence data analysis device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910270612.5A CN110109923B (en) 2019-04-04 2019-04-04 Time sequence data storage method, time sequence data analysis method and time sequence data analysis device

Publications (2)

Publication Number Publication Date
CN110109923A CN110109923A (en) 2019-08-09
CN110109923B true CN110109923B (en) 2021-07-06

Family

ID=67485216

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910270612.5A Active CN110109923B (en) 2019-04-04 2019-04-04 Time sequence data storage method, time sequence data analysis method and time sequence data analysis device

Country Status (1)

Country Link
CN (1) CN110109923B (en)

Families Citing this family (13)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110750536B (en) * 2019-10-11 2020-06-23 清华大学 Vibration noise smoothing method and system for attitude time series data
CN110765154A (en) * 2019-10-16 2020-02-07 华电莱州发电有限公司 Method and device for processing mass real-time generated data of thermal power plant
CN111125074A (en) * 2019-12-12 2020-05-08 深圳供电局有限公司 Power distribution Internet of things data processing method and device
CN111078753B (en) * 2019-12-17 2024-02-27 联想(北京)有限公司 Time sequence data storage method and device based on HBase database
CN111159235A (en) * 2019-12-20 2020-05-15 中国建设银行股份有限公司 Data pre-partition method and device, electronic equipment and readable storage medium
CN111274256B (en) * 2020-01-20 2023-09-12 远景智能国际私人投资有限公司 Resource management and control method, device, equipment and storage medium based on time sequence database
CN111324487A (en) * 2020-01-21 2020-06-23 北京市天元网络技术股份有限公司 Method and device for copying communication guarantee data
CN111400265B (en) * 2020-03-04 2023-04-07 浙江永贵电器股份有限公司 Storage method based on large-redundancy time sequence data
CN111522846B (en) * 2020-04-09 2023-08-22 浙江邦盛科技股份有限公司 Data aggregation method based on time sequence intermediate state data structure
CN111639060A (en) * 2020-06-08 2020-09-08 华润电力技术研究院有限公司 Thermal power plant time sequence data processing method, device, equipment and medium
CN111813782A (en) * 2020-07-14 2020-10-23 杭州海康威视数字技术股份有限公司 Time sequence data storage method and device
CN112612823B (en) * 2020-12-14 2022-07-19 南京铁道职业技术学院 Big data time sequence analysis method based on fusion of Pyspark and Pandas
CN117472915B (en) * 2023-12-27 2024-03-15 中国西安卫星测控中心 Hierarchical storage method of time sequence data oriented to multiple Key values

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605805A (en) * 2013-12-09 2014-02-26 冶金自动化研究设计院 Storage method of massive time series data
CN106648446A (en) * 2015-10-30 2017-05-10 阿里巴巴集团控股有限公司 Time series data storage method and apparatus, and electronic device
CN107391719A (en) * 2017-07-31 2017-11-24 南京邮电大学 Distributed stream data processing method and system in a kind of cloud environment
CN108337122A (en) * 2018-02-22 2018-07-27 深圳市脉山龙信息技术股份有限公司 The operation management system calculated based on distributed stream
CN108921188A (en) * 2018-05-23 2018-11-30 重庆邮电大学 A kind of parallel C RF algorithm based on Spark big data platform

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8943107B2 (en) * 2012-12-04 2015-01-27 At&T Intellectual Property I, L.P. Generating and using temporal metadata partitions

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103605805A (en) * 2013-12-09 2014-02-26 冶金自动化研究设计院 Storage method of massive time series data
CN106648446A (en) * 2015-10-30 2017-05-10 阿里巴巴集团控股有限公司 Time series data storage method and apparatus, and electronic device
CN107391719A (en) * 2017-07-31 2017-11-24 南京邮电大学 Distributed stream data processing method and system in a kind of cloud environment
CN108337122A (en) * 2018-02-22 2018-07-27 深圳市脉山龙信息技术股份有限公司 The operation management system calculated based on distributed stream
CN108921188A (en) * 2018-05-23 2018-11-30 重庆邮电大学 A kind of parallel C RF algorithm based on Spark big data platform

Also Published As

Publication number Publication date
CN110109923A (en) 2019-08-09

Similar Documents

Publication Publication Date Title
CN110109923B (en) Time sequence data storage method, time sequence data analysis method and time sequence data analysis device
US11995086B2 (en) Methods for enhancing rapid data analysis
US9953102B2 (en) Creating NoSQL database index for semi-structured data
Abraham et al. Scuba: Diving into data at facebook
CN111046034B (en) Method and system for managing memory data and maintaining data in memory
CN109213756B (en) Data storage method, data retrieval method, data storage device, data retrieval device, server and storage medium
CN104424258B (en) Multidimensional data query method, query server, column storage server and system
CN110765154A (en) Method and device for processing mass real-time generated data of thermal power plant
Yu et al. Two birds, one stone: a fast, yet lightweight, indexing scheme for modern database systems
CN111897867A (en) Database log statistical method, system and related device
US11928113B2 (en) Structure and method of aggregation index for improving aggregation query efficiency
CN111680043B (en) Method for quickly retrieving mass data
CN114116762A (en) Offline data fuzzy search method, device, equipment and medium
CN116881287A (en) Data query method and related equipment
CN110019192B (en) Database retrieval method and device
Chen et al. Indexing metric uncertain data for range queries
CN114969036A (en) Data retrieval method and device
CN110555021A (en) Data storage method, query method and related device
WO2022016532A1 (en) Efficient scan through comprehensive bitmap-index over columnar storage format
CN112506953A (en) Query method, device and storage medium based on Structured Query Language (SQL)
Hasan Performances analysis of NoSQL and relational databases for analyzing GeoJSON spatial data
CN108304499B (en) Method, terminal and medium for pushing down predicate in SQL connection operation
CN112347098A (en) Database table splitting method and system, electronic equipment and storage medium
RU2417424C1 (en) Method of compensating for multi-dimensional data for storing and searching for information in database management system and device for realising said method
US10042942B2 (en) Transforms using column dictionaries

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant