WO2023103626A1 - 数据降采样和数据查询方法、系统及存储介质 - Google Patents

数据降采样和数据查询方法、系统及存储介质 Download PDF

Info

Publication number
WO2023103626A1
WO2023103626A1 PCT/CN2022/127512 CN2022127512W WO2023103626A1 WO 2023103626 A1 WO2023103626 A1 WO 2023103626A1 CN 2022127512 W CN2022127512 W CN 2022127512W WO 2023103626 A1 WO2023103626 A1 WO 2023103626A1
Authority
WO
WIPO (PCT)
Prior art keywords
data
downsampling
downsampled
query
storage medium
Prior art date
Application number
PCT/CN2022/127512
Other languages
English (en)
French (fr)
Inventor
朱龙成
刘志鹏
李飞勃
张友东
杨成虎
Original Assignee
阿里巴巴(中国)有限公司
淘宝(中国)软件有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 阿里巴巴(中国)有限公司, 淘宝(中国)软件有限公司 filed Critical 阿里巴巴(中国)有限公司
Publication of WO2023103626A1 publication Critical patent/WO2023103626A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/16File or folder operations, e.g. details of user interfaces specifically adapted to file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/172Caching, prefetching or hoarding of files
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2453Query optimisation
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries

Definitions

  • the present application relates to the technical field of data processing, and in particular to a data downsampling and data query method, system and storage medium.
  • Time series data is a series of data continuously generated based on a certain frequency.
  • time series data There are a large amount of time series data in the fields of application performance monitoring (Application Performance Monitor, APM), Internet of Things and Industrial Internet.
  • APM Application Performance Monitor
  • Time-series databases are designed for efficiently storing and querying such time-series data.
  • One type of requirement in time series databases is to downsample the original data.
  • real-time downsampling is generally performed during data query.
  • This downsampling method needs to scan the original data from the disk files corresponding to the time series database. For queries with a relatively large time span, a large amount of original data needs to be scanned, and the data query efficiency is low.
  • Various aspects of the present application provide a data down-sampling and data query method, system and storage medium to improve data query efficiency.
  • An embodiment of the present application provides a data downsampling method, including: writing the acquired original data into the memory; when the original data in the memory reaches a set data amount, writing the original data in the memory into the first Persistent storage medium: in the process of writing the original data to the first persistent storage medium, perform downsampling processing on the target original data written in the first persistent storage medium according to the preset downsampling rule, so as to obtain downsampling data; writing the downsampled data into a second persistent storage medium.
  • the embodiment of the present application also provides a data query method, including: obtaining a query request; the query request is used for aggregation query; according to the query request, querying the memory and the persistent storage medium for storing down-sampled data; In the case of the data, obtain the first original data and the first downsampling data that meet the query request from the memory and the persistent storage medium respectively; according to the query request, perform downsampling processing on the first original data to obtain the second downsampling Data; based on the first downsampled data and the second downsampled data, determine a query result of the query request.
  • the embodiment of the present application also provides a computing system, including: a memory and a processor; the memory includes: a memory and a persistent storage medium; the processor is connected to the memory and the persistent storage medium in communication, and is used to execute the above data downsampling method and /or the steps in the above data query method.
  • the embodiment of the present application also provides a computer-readable storage medium storing computer instructions.
  • the computer instructions are executed by one or more processors, one or more processors are caused to perform the above-mentioned data downsampling method and/or the above-mentioned Steps in a data query method.
  • the target original data written to the persistent storage medium is subjected to downsampling processing; and the downsampled The obtained downsampled data is processed to realize the pre-downsampling of the original data.
  • the pre-downsampling result can be queried directly, without real-time downsampling processing on the original data during downsampling query, which helps to improve the efficiency of subsequent downsampling query.
  • FIG. 1A is a schematic flow diagram of a data downsampling method provided in an embodiment of the present application
  • FIG. 1B is a schematic diagram of the data downsampling process provided by the embodiment of the present application.
  • FIG. 2 is a schematic diagram of the field structure provided by the embodiment of the present application.
  • FIG. 3 is a schematic flow diagram of a data query method provided in an embodiment of the present application.
  • FIG. 4 is a schematic diagram of the data query process provided by the embodiment of the present application.
  • FIG. 5 is a schematic diagram of the downsampling file merging process provided by the embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a computing system provided by an embodiment of the present application.
  • the user has a requirement for down-sampling query when querying data.
  • the temperature sensor reports the temperature once per minute, and the average temperature per hour in the past 7 days needs to be queried when querying data.
  • the raw temperature data per minute needs to be down-sampled to the average temperature data per hour.
  • real-time downsampling is performed at data query time. This downsampling method needs to scan the original data from the disk file corresponding to the original data. For queries with a relatively large time span, a large amount of original data needs to be scanned, and the data query efficiency is low; and a large amount of original data query consumes a lot of memory resources. The real-time downsampling calculation of original data also consumes a lot of CPU resources.
  • the written The target original data of the persistent storage medium is down-sampled; and the down-sampled data obtained by the down-sampling process is stored, thereby realizing the pre-down-sampling of the original data.
  • the pre-downsampling result can be queried directly, without real-time downsampling processing on the original data during downsampling query, which helps to improve the efficiency of subsequent downsampling query.
  • FIG. 1A is a schematic flowchart of a data downsampling method 100 provided in an embodiment of the present application. As shown in FIG. 1A , the method 100 includes step 101 to step 104 .
  • step 101 the acquired original data is written into memory.
  • step 102 when the original data in the internal memory reaches a set data volume, write the original data in the internal memory to the first persistent storage medium.
  • step 103 during the process of writing the original data into the first persistent storage medium, according to the preset downsampling rule, the target original data written in the first persistent storage medium is subjected to downsampling processing, so as to obtain the downsampled data.
  • step 104 the downsampled data is written to a second persistent storage medium.
  • the original data may be time series data, that is, a series of data continuously generated based on a certain frequency.
  • raw data can be obtained.
  • the physical machine may be a terminal device such as a computer, or may be a single server device, or may be a cloud-based server array.
  • a physical machine may also refer to other computing devices with corresponding service capabilities, such as terminal devices such as computers (running service programs), and the like.
  • the physical machine can provide data management services.
  • the physical machine can provide data storage, data processing and data query services, and the like.
  • a physical machine may maintain a database.
  • the database may be a time-series database for storing time-series data and providing time-series data query services.
  • the acquired raw data may be written into the memory of the physical machine.
  • the raw data can be written to the MenStore space of the memory. Due to the limited storage space of the internal memory, when the amount of data stored in the internal memory reaches a set data amount, the data stored in the internal memory needs to be written to a persistent storage medium for preservation.
  • step 102 of FIG. 1A and FIG. 1B when the original data in the memory reaches a set data volume, the original data in the memory can be written to a persistent storage medium.
  • the persistent storage medium mainly refers to a non-volatile storage medium, such as a magnetic disk, a floppy disk, a hard disk, a digital versatile disk (DVD) or other optical storage, a magnetic tape, or a CD-ROM (CD). -ROM), etc.
  • a non-volatile storage medium such as a magnetic disk, a floppy disk, a hard disk, a digital versatile disk (DVD) or other optical storage, a magnetic tape, or a CD-ROM (CD). -ROM), etc.
  • the persistent storage medium and the memory may be deployed on the same physical machine, or may be deployed on a different physical machine from the memory.
  • the persistent storage medium and the memory belong to the same physical machine;
  • the persistent storage medium and the memory belong to the same physical machine;
  • Memory can belong to the same physical machine or to different physical machines.
  • pre-downsampling processing can be performed on the original data, so that during downsampling query, the downsampling data can be queried directly without performing downsampling processing on the original data during the data query process , which can effectively improve the efficiency of data query.
  • the set down-sampling rule is used to down-sample the target original data written to the persistent storage medium to obtain the down-sampled data.
  • the specific implementation manner of obtaining the downsampling rule is not limited.
  • the downsampling rule may be independently set by the user or provider of the original data.
  • the storage system may provide an interactive interface for users to access; users (users or providers of raw data, etc.) may independently set down-sampling rules through the interactive interface.
  • General downsampling rules may include: sampling time interval and aggregation operator. Wherein, the sampling time interval mainly refers to the time interval at which the original data is down-sampled.
  • the aggregation operator refers to the downsampling method used for the original data within the sampling time interval.
  • the aggregation operator can be an index aggregation operator, a bucket aggregation operator, a matrix aggregation operator, and a pipeline aggregation operator.
  • Index aggregation operators may include: maximum value (max), minimum value (min), sum (sum), average value (avg), value statistics, distinct aggregation, percentage statistics, and percentage ranking aggregation, etc.
  • the above downsampling rule indicates that the original data in the database "db" are summed according to sampling time intervals of 5s (5 seconds) and 5 minutes (5 minutes).
  • step 103 can be implemented as: obtaining the sampling time interval and aggregation operator from the preset down-sampling rule; obtaining each sampling time for the target original data currently written to the persistent storage medium The target raw data in the interval; and according to the aggregation operator in the downsampling rule, the target raw data in each sampling interval is aggregated to obtain the downsampling data in the sampling interval.
  • a data table may include: Field. Fields can include: field name and field value. You can use the field name to index the corresponding field value. In some embodiments, field values with the same field name can be stored in columns or rows; in this way, all field values of the field can be indexed by using the field name. For example, as shown in FIG. 2, temperature (Temperature) may be a field name; timestamp (Timestamp) and temperature value (Value) may be field values corresponding to the field name temperature.
  • the original data of the same attribute can be aggregated; the original data of different attributes cannot be aggregated.
  • temperature time-series data, humidity time-series data and air pollution index are obtained. Since temperature and humidity are attributes of different dimensions, it is meaningless to aggregate temperature time-series data and humidity time-series data.
  • the target original data when performing downsampling processing on the target original data written to the persistent storage medium, the target original data may be divided into at least one data unit according to the field name of the target original data.
  • the field value corresponding to the same field name in the target original number can be divided into one data unit to obtain at least one data unit.
  • one data unit can be one field.
  • the specific number of data units may be determined by the number of field names included in the target original data.
  • downsampling processing may be performed on at least one data unit according to a preset downsampling rule, so as to obtain downsampled data corresponding to each data unit, and then to obtain downsampled data corresponding to target original data.
  • the sampling time interval and aggregation operator can be obtained from the preset down-sampling rules; for any data unit A, from the data unit A, get Raw data of interest in each sampling interval.
  • the target original data in each sampling time interval may be acquired according to the time stamp information in the data unit A.
  • the target original data in each sampling time interval may be aggregated according to the aggregation operator, so as to obtain the down-sampled data corresponding to the data unit A.
  • the downsampled data may also be written into a persistent storage medium for storage.
  • the persistent storage medium that stores the original data is defined as the first persistent storage medium; the persistent storage medium that stores the downsampled data is defined as the second persistent storage medium medium.
  • the first persistent storage medium and the second persistent storage medium may be the same storage medium, or may be different persistent storage mediums.
  • the first persistent storage medium and the second persistent storage medium can be mounted on the same physical machine, or can be mounted on different physical machines.
  • the number of the first persistent storage medium and the second persistent storage medium can be one or more.
  • a plurality means two or more. Multiple first persistent storage media can be mounted on the same physical machine, or can be mounted on different physical machines. Certainly, multiple second persistent storage media may also be mounted on different physical machines.
  • downsampling is performed on the target original data written to the persistent storage medium; and the downsampled The obtained downsampled data is processed to realize the pre-downsampling of the original data.
  • the pre-downsampling result can be queried directly, without real-time downsampling processing on the original data during downsampling query, which helps to improve the efficiency of subsequent downsampling query.
  • the data down-sampling provided in the embodiment is in the memory refresh (MemStore Flush) stage, that is, during the process of writing the data in the memory to the first persistent storage medium, the object written to the first persistent storage medium
  • the original data is down-sampled, compared with CQ down-sampling, there is no need to query the inverted data and forward index of the original data to obtain the original data, which can reduce memory and CPU resource consumption.
  • the original data and downsampling data in the memory can be queried.
  • the original data in the memory is down-sampled in real time, and the down-sampled data that meets the query request can be obtained directly for the down-sampled data, and the data query result can be obtained. Since the original data in the memory is the latest original data, and the downsampled data query results can realize full downsampled data query, which solves the disadvantage that CQ downsampling cannot query the latest downsampled data.
  • the direct query of the downsampling data part no downsampling processing is required during the data query process, which helps to improve the efficiency of data query compared with real-time downsampling query.
  • the storage system maintained in the embodiment of the present application can not only provide down-sampling query, but also provide non-down-sampling query.
  • non-downsampling query requests the original data in the memory and the original data in the first persistent storage medium can be queried.
  • the query process is the same or similar to the existing storage system data query, which is not the focus of this application. Therefore, the data query method provided by the embodiment of the present application will be described exemplarily below by taking aggregation query (that is, down-sampling query) as an example.
  • FIG. 3 is a schematic flowchart of a data query method 300 provided by an embodiment of the present application. As shown in FIG. 3 , the data query method 300 includes steps 301 to 305 .
  • step 301 a query request is obtained; the query request is used for aggregation query.
  • step 302 the memory and the second persistent storage medium are queried according to the query request.
  • step 303 for the case that there is data satisfying the query request in the internal memory, the first original data and the first downsampled data satisfying the query request are respectively obtained from the internal memory and the second persistent storage medium.
  • step 304 according to the query request, down-sampling processing is performed on the first original data to obtain second down-sampling data.
  • a query result of the query request is determined based on the first downsampled data and the second downsampled data.
  • the query request may be a non-aggregated query or an aggregated query.
  • the embodiment of the present application focuses on aggregation query as an example to illustrate the data query method provided in the embodiment of the present application.
  • a query request can be obtained, and the query request is used for aggregation query.
  • a query request may contain query conditions.
  • the query conditions may include: the data object to be queried, the aggregation operator, and the time range of the query, etc.
  • the original data in the memory is the latest write. Since the time range and data objects queried by different query requests may be different, there may or may not be data that satisfies some query requests in the memory. For the storage system, it is impossible to determine in advance whether there is data that satisfies the query request in the memory. Therefore, in order to improve the timeliness and accuracy of data query and prevent the latest data from being missed, as shown in step 302 in Figure 3 and Figure 4, you can According to the query request, the memory and the second persistent storage medium are queried.
  • semantic analysis may be performed on the query request to obtain query conditions of the query request.
  • the query request can be compiled into an abstract syntax tree (Abstracted Syntax Tree, AST), and error detection is performed on the statement of the query request in the process to ensure that the input request statement has no grammatical and lexical errors. For example, detect if there is a misspelling of a keyword, whether there is redundant punctuation, whether the entire sentence is legal, and so on.
  • the nodes of the above-mentioned abstract syntax tree may be checked sequentially, and metadata of related tables and attributes are attached to the syntax tree, and finally a syntax tree (bound AST) containing semantics is generated.
  • the access requirement content of the query request can be obtained according to the syntax tree containing semantics.
  • an execution plan may be generated according to query conditions.
  • the optimizer can generate a logical operator tree (LOT) according to the semantic syntax tree.
  • the nodes of the semantic syntax tree may be mapped to the operator nodes to obtain a logical execution tree.
  • Each node on the logical execution tree is called a logical operator.
  • a physical operator (physical operator) corresponding to each logical operator may be expanded to obtain a physical execution tree.
  • the physical execution tree with the least cost can be selected from the physical execution trees as the execution plan. Among them, the minimum cost can be the shortest path, the minimum memory consumption, the minimum calculation amount or the shortest calculation time, and so on.
  • the memory and the second persistent storage medium can be queried according to the execution plan.
  • the downsampled data satisfying the query request may be obtained from the second persistent storage medium; and based on the data obtained from the second persistent storage medium
  • the downsampled data that meets the query request determines the query result of the query request. Because of this data query method, the downsampled data that meets the query request can be obtained directly from the downsampled data, without real-time downsampling of the original data during the data query process, which helps to improve the efficiency of data query.
  • the original data meeting the query request (defined as the first original data) and the downsampled data can be obtained from the memory and the second persistent storage medium respectively. data.
  • step 304 downsampling may be performed on the original data obtained from the memory that meets the query request, to obtain downsampled data.
  • the downsampled data obtained from the second persistent storage medium that satisfies the query request is defined as the first downsampled data;
  • the downsampled data obtained by downsampling the original data is defined as the second downsampled data.
  • the aggregation operator and sampling time interval included in the query request may be obtained from the query request.
  • the original data corresponding to each sampling time interval can be obtained from the original data satisfying the query request; in one embodiment, according to the sampling time interval included in the query request
  • the aggregation operator aggregates the original data corresponding to each sampling time interval to obtain the second downsampled data.
  • a query result corresponding to the query request may be determined based on the first downsampled data and the second downsampled data.
  • the data query method provided in this embodiment can query the original data and the downsampled data in the memory.
  • the original data in the memory is down-sampled in real time, and the down-sampled data that meets the query request can be directly obtained for the down-sampled data, and the data query result can be obtained. Since the original data in the memory is the latest original data, plus the query result of the down-sampling data, the full down-sampling data query can be realized, which can improve the timeliness and accuracy of the data query, and solve the problem that the latest down-sampling data cannot be queried by CQ down-sampling Shortcomings.
  • the direct query of the downsampling data part no downsampling processing is required during the data query process, which helps to improve the efficiency of data query compared with real-time downsampling query.
  • the data query method provided by the embodiment of the present application still has a higher data rate. Query efficiency.
  • the way data is stored may affect the data query process. Therefore, the specific implementation process of the downsampling query (aggregation query) will be exemplarily described below in conjunction with the storage process of the downsampled data and the process of writing the downsampled data to the second persistent storage medium.
  • a specific implementation form of writing the downsampled data into the second persistent storage medium is not limited.
  • the downsampled data stored in the second persistent storage medium is generally obtained by downsampling according to different downsampling rules, in order to facilitate subsequent queries and improve the efficiency of subsequent data queries, in the embodiment of this application, for any of the above data
  • the target field name (Field) used to represent the downsampling rule and downsampling object can be determined according to the downsampling rule corresponding to data unit A and the field name of data unit A.
  • the specific format of the target field name (Field) is not limited.
  • the format of the target field name may be expressed as: " ⁇ raw_field ⁇ _ ⁇ aggregator ⁇ _ ⁇ interval ⁇ ".
  • "raw_field” indicates the column field name, that is, the field name of the data unit, which can represent the downsampling object.
  • "aggregator” indicates the aggregation operator;
  • "interval” indicates the sampling interval.
  • the downsampling rule can be determined to represent "max downsampling at a sampling interval of 30s", and the downsampling object is the CPU field.
  • the target field name can be expressed as "cpu_max_30s”.
  • the target field name can be the field name
  • the downsampled data of any data unit A can be used as the field value of the target field name
  • the target field name and the downsampled data corresponding to the data unit A can be written into the second persistent permanent storage media.
  • the target field name that satisfies the query condition can be determined according to the query condition in the down-sampling query request; according to the target field name that meets the query condition, the field value corresponding to the target field name is indexed as the field value that satisfies the query Conditional downsampled data.
  • data query can be performed according to the target field name corresponding to the down-sampled data, without querying all the down-sampled data, which helps to improve the efficiency of data query.
  • the query condition corresponding to the query request can be obtained from the query request;
  • the first field name in the format of the field name corresponding to the downsampled data in the permanent storage medium that is, the format of the above-mentioned target field name.
  • the data object to be queried, the aggregation operator, and the sampling interval can be obtained from the query conditions; in one embodiment, the data to be queried can be obtained according to the format of the target field name Object, aggregation operator and sampling time interval, converted into the first field name in the format of the above target field name.
  • the data object to be queried is the CPU field; the aggregation operator is the max operator; and the sampling interval is 30s.
  • the name of the first field transformed by the query condition is "cpu_max_30s".
  • the second persistent storage medium may be queried according to the first field name to determine the downsampled data corresponding to the first field name.
  • the first down-sampled data meeting the query condition may be acquired from the down-sampled data corresponding to the first field name.
  • the original data and the downsampled data can be stored in the form of files.
  • a file refers to an encoding method for information used for storing information, and the specific implementation form of the file is not limited.
  • the file may be a data table or the like.
  • the storage file of the original data is defined as the original file; the storage file of the downsampled data is defined as the downsampling file.
  • each time the original data in the internal memory reaches the set data volume an operation of writing the original data in the internal memory to the first persistent storage medium is started to form an original file;
  • an operation of downsampling the target original data written in the first persistent storage medium and writing the downsampled data into the second persistent storage medium is started to form a Downsample file.
  • the downsampling files may be stored in a hierarchical organization structure.
  • Each level is used to store a set threshold number of downsampled files.
  • the set threshold corresponding to each level is denoted by M. Wherein, M ⁇ 2, and M is an integer.
  • the thresholds corresponding to different levels may be the same or different.
  • the aggregation algorithm in the downsampling rule can be used sub, performing an aggregation operation on the downsampling processing results corresponding to overlapping time windows; and merging the aggregated M downsampling files into one downsampling file. After that, save the merged downsampling file to the upper level. Since the downsampling data of overlapping time windows are deduplicated during the merging process of the downsampling files, storing the downsampling files in the sampling hierarchical structure can reduce the storage space occupied by the downsampling data.
  • the first downsampled data that meets the query request obtained from the second persistent storage medium may be located in a downsampled file, or may be located in in multiple downsampled files.
  • a plurality means two or more.
  • the query result corresponding to the query request may be determined based on the first aggregated down-sampled data and the second down-sampled data.
  • the deleted original data may be marked to obtain a tombstone (Tombstone) record.
  • the tombstone record is used to record the deleted original data information.
  • the original data recorded in the tombstone record may be original data deleted logically from the first persistent storage medium, or original data actually physically deleted.
  • the downsampled data corresponding to the tombstone record may be determined according to the time information of the data in the tombstone record and the time information of the downsampled data stored in the second persistent storage medium.
  • the downsampling file is stored in the form of a downsampling file, according to the time information of the data in the tombstone record and the time information of the data in the downsampling file stored in the second persistent storage medium, Determine the downsampling file for the tombstone record.
  • the downsampling data corresponding to the tombstone record may be determined from the downsampling file corresponding to the tombstone record during the merging process of the downsampling file corresponding to the tombstone record.
  • the downsampled data of is the downsampled data corresponding to the tombstone record.
  • the downsampling data corresponding to the tombstone record may be deleted during the merging process of the downsampling file corresponding to the tombstone record, so that the downsampling data corresponding to the deleted original data no longer exists in the merged downsampling file, Realize synchronous deletion of downsampled data and original data, and solve the defect that the above-mentioned CQ downsampling method cannot delete downsampled data synchronously when original data is deleted.
  • the The tombstone record of the deleted original data when the query result of the query request is determined during the aggregation query process, the The tombstone record of the deleted original data; and according to the time information of the data in the tombstone record and the time information of the data in the first downsampling data, judge whether the downsampling data corresponding to the tombstone record is contained in the first downsampling data; if the judgment result is Yes, the downsampled data corresponding to the tombstone record can be deleted from the first downsampled data; and the second downsampled data and the first downsampled data after deleting the downsampled data corresponding to the tombstone record are determined as the query result of the query request.
  • the first downsampled data is located in multiple downsampled files, and the downsampled data in the multiple downsampled files have overlapping time windows, based on the aggregated first downsampled data and the second downsampled data
  • the query result corresponding to the query request it is also possible to determine whether the aggregated first downsampled data contains the downsampling corresponding to the tombstone record according to the time information of the data in the tombstone record and the time information of the data in the aggregated first downsampled data data; if the judgment result is yes, the downsampling data corresponding to the tombstone record can be deleted from the aggregated first downsampling data;
  • the downsampling data is the query result corresponding to the query request.
  • the query result may be returned to the provider of the query request.
  • the reason why the aggregated query can query the downsampled data that meets the aggregated query request in the downsampled data is mainly because the downsampling rule corresponding to the downsampled data can be provided by the query request side to set.
  • the provider of the query request can independently set the down-sampling rules according to its own query requirements; and pre-store them in the module, device, device or system that executes the data down-sampling method provided by the embodiment of the present application.
  • the subject of execution of each step of the method provided in the foregoing embodiments may be the same device, or the method may also be executed by different devices.
  • the execution subject of steps 301 and 302 may be device A; for another example, the execution subject of step 301 may be device A, and the execution subject of step 302 may be device B; and so on.
  • an embodiment of the present application also provides a computer-readable storage medium storing computer instructions, and when the computer instructions are executed by one or more processors, one or more processors are caused to execute the above data downsampling method and /or a step in a data query method.
  • the embodiment of the present application also provides a computer program product, where the computer program product includes: a computer program.
  • the processor is caused to execute the steps in the above data downsampling method and/or data query method.
  • the specific implementation form of the computer program product is not limited.
  • a computer program product may be implemented as a query engine, a data processing system against a database, or an executor in a query engine, among others.
  • FIG. 6 is a schematic structural diagram of a computing system provided by an embodiment of the present application.
  • the computing system includes: a memory 61 and a processor 62 .
  • the storage 61 may include: a memory 61a and a persistent storage medium 61b.
  • the memory 61 and the processor 62 may be located on the same physical machine, or may be located on different physical machines.
  • the memory 61a and the persistent storage medium 61b may belong to the same physical machine, or may belong to different physical machines.
  • the memory 61a and the processor 62 belong to the same physical machine.
  • Plural means two or more. Multiple persistent storage media 61b may belong to the same physical machine, or may belong to different physical machines.
  • the memory 61 a and the persistent storage medium 61 b are in communication connection with the processor 62 .
  • the processor 62 can be used to: write the acquired raw data into the internal memory 61a; when the original data in the internal memory 61a reaches a set data volume, write the original data in the internal memory 61a to the first storage medium 61b.
  • the first persistent storage medium 61b1 and the second persistent storage medium 61b2 may be the same storage medium, or may be different storage mediums.
  • the processor 62 when the processor 62 performs down-sampling processing on the target original data written in the first persistent storage medium, it is specifically configured to: divide the target original data into at least one data field according to the field name of the target original data units; and, according to a preset down-sampling rule, at least one data unit is respectively down-sampled to obtain down-sampled data.
  • the processor 62 when the processor 62 performs down-sampling processing on at least one data unit, it is specifically configured to: obtain the sampling time interval and the aggregation operator from the preset down-sampling rules; Unit, from any data unit, obtain the target original data in each sampling time interval; according to the aggregation operator, aggregate the target raw data in each sampling time interval to obtain the corresponding reduction of any data unit sample data.
  • the processor 62 when the processor 62 writes the down-sampling processing result into the second persistent storage medium 61b2, it is specifically configured to: for the down-sampling data corresponding to any data unit, according to the down-sampling rule and any data
  • the field name of the unit determines the target field name used to represent the downsampling rule and downsampling object; the target field name is the field name, and the downsampling data of any data unit is the field value of the target field name, and the target field name and The downsampling data corresponding to any data unit is written into the second persistent storage medium 61b2.
  • the processor 62 is further configured to: store the downsampling file corresponding to the downsampling data in a hierarchical organization structure.
  • the processor 62 is also configured to: for any two adjacent levels, when the number of downsampled files in the lower level reaches the threshold M corresponding to the lower level, merge the M downsampled files; The final downsampling file is stored in the upper level of the lower level; wherein, M is a set threshold, M ⁇ 2, and M is an integer.
  • the processor 62 when the processor 62 merges the M downsampling files, it is specifically configured to: for the case where the M downsampling files have overlapping time windows, according to the aggregation operator in the downsampling rule, The downsampling processing results corresponding to the overlapping time windows are aggregated; and the aggregated M downsampling files are merged into one downsampling file.
  • the processor 62 is further configured to: mark the deleted original data in case of data deletion in the first persistent storage medium 61b1 to obtain a tombstone record; The time information and the time information of the data in the downsampling file determine the downsampling file corresponding to the tombstone record; in the process of merging the downsampling file corresponding to the tombstone record, determine the downsampling data corresponding to the tombstone record from the downsampling file corresponding to the tombstone record ; Delete the downsampling data corresponding to the tombstone record.
  • the computing system may further include: a communication component 63 .
  • the processor 62 is also used to: obtain a query request through the communication component 63; the query request is used for aggregation query; according to the query request, query the memory 61a and the second persistent storage medium 61b2; for the case where there is data satisfying the query request in the memory 61a , from the internal memory and the second persistent storage medium 61b2 to obtain the first original data and the first downsampled data that meet the query request respectively; according to the query request, perform downsampling processing on the first original data to obtain the second downsampled data ; and, based on the first downsampled data and the second downsampled data, determine a query result of the query request.
  • the processor 62 determines the query result of the query request, it is specifically configured to: acquire the tombstone record of the original data used to mark deletion; Time information of the data, judging whether the downsampling data corresponding to the tombstone record is contained in the first downsampling data; if the judgment result is yes, deleting the downsampling data corresponding to the tombstone record from the first downsampling data; and determining the second downsampling
  • the data and the first downsampled data after deleting the downsampled data corresponding to the tombstone record are the query result of the query request.
  • the processor 62 when the processor 62 queries the second persistent storage medium 61b2, it is specifically configured to: obtain the query condition corresponding to the query request from the query request; The first field name in the field name format corresponding to the downsampling data; according to the first field name, query the second persistent storage medium 61b2 to determine the downsampling data corresponding to the first field name; from the second persistent storage medium Acquiring the first downsampled data that meets the query request includes: acquiring the first downsampled data that meets the query condition from the downsampled data corresponding to the first field name.
  • the first downsampled data is located in a plurality of downsampled files.
  • the processor 62 determines the query result of the query request, it is specifically configured to: for the case where the first downsampled data in different downsampled files have overlapping time windows, according to the aggregation operator in the query request, the overlapping time The aggregated first downsampled data corresponding to the window is aggregated to obtain the aggregated first downsampled data; based on the aggregated first downsampled data and the second downsampled data, the query result of the query request is determined.
  • the computing system may further include: a power supply component 64 and other components.
  • FIG. 6 only schematically shows some components, which does not mean that the computing system must include all the components shown in FIG. 6 , nor does it mean that the computing system can only include the components shown in FIG. 6 .
  • the components included in the computing system provided in the embodiment of the present application may belong to the same physical machine, or may belong to different physical machines.
  • different physical machines are connected by communication.
  • the processor 62 can control and operate other components through communication between physical machines.
  • the computing system provided in this embodiment in the process of writing original data from the memory to the persistent storage medium, performs downsampling processing on the target original data written to the persistent storage medium according to the preset downsampling rule; and stores the downsampled
  • the downsampling data obtained by sampling processing realizes the pre-downsampling of the original data.
  • the pre-downsampling result can be queried directly, without real-time downsampling processing on the original data during downsampling query, which helps to improve the efficiency of subsequent downsampling query.
  • the data downsampling provided in the embodiment of the present application is during the memory refresh (MemStore Flush) stage, that is, during the process of writing the data in the memory to the first persistent storage medium, the first persistent storage medium Compared with CQ downsampling, it is not necessary to query the inverted data and forward index of the original data to obtain the original data, which can reduce memory and CPU resource consumption.
  • the original data and downsampling data in the memory can be queried.
  • the original data in the memory is down-sampled in real time, and the down-sampled data that meets the query request can be directly obtained for the down-sampled data, and the data query result can be obtained. Since the original data in the memory is the latest original data, and the downsampled data query results can realize full downsampled data query, which solves the disadvantage that CQ downsampling cannot query the latest downsampled data.
  • the direct query of the downsampling data part no downsampling processing is required during the data query process, which helps to improve the efficiency of data query compared with real-time downsampling query.
  • the memory is used to store computer programs, and may be configured to store other various data to support operations on the device where it is located.
  • the processor can execute the computer program stored in the memory to realize the corresponding control logic.
  • the memory can be realized by any type of volatile or non-volatile storage devices or their combination, such as Static Random Access Memory (SRAM), Electrically Erasable Programmable Read Only Memory (EEPROM), Erasable Programmable Read Only Memory (EPROM), Programmable Read Only Memory (PROM), Read Only Memory (ROM), Magnetic Memory, Flash Memory, Magnetic or Optical Disk.
  • the processor may be any hardware processing device capable of executing the logic of the above method.
  • the processor can be a central processing unit (Central Processing Unit, CPU), a graphics processing unit (Graphics Processing Unit, GPU) or a micro control unit (Microcontroller Unit, MCU); it can also be a field programmable gate Field-Programmable Gate Array (FPGA), Programmable Array Logic (PAL), General Array Logic (GAL), Complex Programmable Logic Device (CPLD), etc. Programmable devices; or advanced RISC processors (Advanced RISC Machines, ARM) or system chips (System on Chip, SOC), etc., but not limited thereto.
  • CPU Central Processing Unit
  • GPU Graphics Processing Unit
  • MCU micro control unit
  • FPGA field programmable gate Field-Programmable Gate Array
  • PAL Programmable Array Logic
  • GAL General Array Logic
  • CPLD Complex Programmable Logic Device
  • Programmable devices or advanced RISC processors (Advanced RISC Machines,
  • the communication component is configured to facilitate wired or wireless communication between the device where it is located and other devices.
  • the device where the communication component is located can access a wireless network based on communication standards, such as WiFi, 2G or 3G, 4G, 5G or a combination thereof.
  • the communication component receives a broadcast signal or broadcast related information from an external broadcast management system via a broadcast channel.
  • the communication component may also be based on Near Field Communication (NFC) technology, Radio Frequency Identification (RFID) technology, Infrared Data Association (IrDA) technology, Ultra Wideband (UWB) technology, Bluetooth (BT) technology or other technology to achieve.
  • NFC Near Field Communication
  • RFID Radio Frequency Identification
  • IrDA Infrared Data Association
  • UWB Ultra Wideband
  • Bluetooth Bluetooth
  • the power supply component is configured to provide power to various components of the device where it is located.
  • a power supply component may include a power management system, one or more power supplies, and other components associated with generating, managing, and distributing power to the device in which the power supply component resides.
  • the embodiments of the present application may be provided as methods, systems, or computer program products. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including but not limited to disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
  • computer-usable storage media including but not limited to disk storage, CD-ROM, optical storage, etc.
  • These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to operate in a specific manner, such that the instructions stored in the computer-readable memory produce an article of manufacture comprising instruction means, the instructions
  • the device realizes the function specified in one or more procedures of the flowchart and/or one or more blocks of the block diagram.
  • a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
  • processors CPUs
  • input/output interfaces network interfaces
  • memory volatile and non-volatile memory
  • Memory may include non-permanent storage in computer readable media, in the form of random access memory (RAM) and/or nonvolatile memory such as read-only memory (ROM) or flash RAM. Memory is an example of computer readable media.
  • RAM random access memory
  • ROM read-only memory
  • flash RAM flash random access memory
  • the storage medium of the computer is a readable storage medium, which may also be referred to as a readable medium.
  • Readable storage media including both volatile and non-permanent, removable and non-removable media, may be implemented by any method or technology for information storage.
  • Information may be computer readable instructions, data structures, modules of a program, or other data.
  • Examples of computer storage media include, but are not limited to, phase change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read-Only Memory (EEPROM), Flash memory or other memory technology, Compact Disc Read-Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, A magnetic tape cartridge, disk storage or other magnetic storage device or any other non-transmission medium that can be used to store information that can be accessed by a computing device.
  • computer-readable media excludes transitory computer-readable media, such as modulated data signals and carrier waves.

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Databases & Information Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Human Computer Interaction (AREA)
  • Fuzzy Systems (AREA)
  • Mathematical Physics (AREA)
  • Probability & Statistics with Applications (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

本申请实施例提供了一种数据降采样和数据查询方法、系统及存储介质。在本申请实施例中,在原始数据从内存写入持久性存储介质的过程中,根据预设的降采样规则,对写入持久性存储介质的目标原始数据进行降采样处理;并存储降采样处理得到的降采样数据,实现了原始数据的预降采样。这样,在降采样查询时,可直接查询预降采样结果,无需对原始数据进行实时降采样处理,有助于提高降采样查询效率。

Description

数据降采样和数据查询方法、系统及存储介质
本申请要求于2021年12月09日提交中国专利局、申请号为202111501316.5、发明名称为“数据降采样和数据查询方法、系统及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及数据处理技术领域,尤其涉及一种数据降采样和数据查询方法、系统及存储介质。
背景技术
时序数据是基于某种频率持续产生的一系列数据,在应用性能监测(Application Performance Monitor,APM)、物联网和工业互联网等领域存在大量时序数据。时序数据库是为高效存储和查询这类时序数据而设计的。时序数据库中有一类需求是对原始数据进行降采样处理。
在相关技术中,一般在数据查询时进行实时降采样。这种降采样方式需要从时序数据库对应的磁盘文件中扫描原始数据,对于时间跨度比较大的查询,需要扫描大量原始数据,数据查询效率较低。
发明内容
本申请的多个方面提供了一种数据降采样和数据查询方法、系统及存储介质,用以提高数据查询效率。
本申请实施例提供了一种数据降采样方法、包括:将获取的原始数据写入内存;在内存中的原始数据达到设定的数据量的情况下,将内存中的原始数据写入第一持久性存储介质;在原始数据写入第一持久性存储介质的过程中,根据预设的降采样规则,对写入第一持久性存储介质的目标原始数据进行降采样处理,以得到降采样数据;将降采样数据写入第二持久性存储介质。
本申请实施例还提供了一种数据查询方法,包括:获取查询请求;查询请求用于聚合查询;根据查询请求,查询内存和存储降采样数据的持久性存储介质;针对内存中存在满足查询请求的数据的情况,从内存和持久性存储介质中分别获取满足查询请求的第一原始数据和第一降采样数据;根据查询请求,对第一原始数据进行降采样处理,以得到第二降采样数据;基于第一降采样数据和第二降采样数据,确定查询请求的查询结果。
本申请实施例还提供了一种计算系统,包括:存储器和处理器;存储器包括:内存和持久性存储介质;处理器与内存和持久性存储介质通信连接,用于执行上述数据降采样方法和/或上述数据查询方法中的步骤。
本申请实施例还提供了一种存储有计算机指令的计算机可读存储介质,当计算机指令被一个或多个处理器执行时,致使一个或多个处理器执行上述数据降采样方法和/或上述数据查询方法中的步骤。
在本申请实施例中,在原始数据从内存写入持久性存储介质的过程中,根据预设的降采样规则,对写入持久性存储介质的目标原始数据进行降采样处理;并存储降采样处理得到的降采样数据,实现了原始数据的预降采样。这样,在降采样查询时,可直接查询预降采样结果,无需在降采样查询时,对原始数据进行实时降采样处理,有助于提高后续降采样查询效率。
上述概述仅仅是为了说明书的目的,并不意图以任何方式进行限制。除上述描述的示意性的方面、实施方式和特征之外,通过参考附图和以下的详细描述,本申请进一步的方面、实施方式和特征将会是容易明白的。
附图说明
此处所说明的附图用来提供对本申请的进一步理解,构成本申请的一部分,本申请的示意性实施例及其说明用于解释本申请,并不构成对本申请的不当限定。在附图中:
图1A为本申请实施例提供的数据降采样方法的流程示意图;
图1B为本申请实施例提供的数据降采样过程示意图;
图2为本申请实施例提供的字段结构示意图;
图3为本申请实施例提供的数据查询方法的流程示意图;
图4为本申请实施例提供的数据查询过程示意图;
图5为本申请实施例提供的降采样文件合并过程示意图;以及
图6为本申请实施例提供的计算系统的结构示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合本申请具体实施例及相应的附图对本申请技术方案进行清楚、完整地描述。显然,所描述的实施例仅是本申请一部分实施例,而不是全部的实施例。基于本申请中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本申请保护的范围。
在一种实施例中,由于原始数据的数据量较大,用户在数据查询时有降采样查询的需求。温度传感器以每分钟一次的频率上报温度,数据查询时需要查询过去7天每小时的平均温度。这种场景就需要将每分钟的原始温度数据,降采样为每小时的平均温度数据。在一些方案中,在数据查询时进行实时降采样。这种降采样方式需要从原始数据对应的磁盘文件中扫描原始数据,对于时间跨度比较大的查询,需要扫描大量原始数据,数据查询效率较低;而且大量原始数据查询消耗大量内存资源,对大量原始数据实时降采样计算也消耗大量CPU资源。
在另一些方案中,通过连续查询(Continuous Queries,QS)方式定期执行降采样。这种降采样方式,存在以下缺陷:(1)资源损耗高。每次CQ降采样执行时需要查询大量索引,包括正排索引和倒排索引,消耗大量内存资源和CPU资源;(2)在数据查询时,不能查询到最新的降采样数据。由于CQ降采样是周期性执行,非实时执行,导致磁盘中新写入的原始数据,不能立即降采样处理得到最新的降采样数据,进而导致数据查询时无法查询到最近的降采样数据;(3)由于原始数据和降采样数据存储于不同的数据表,删除原 始数据没法同步删除降采样数据,导致原始数据和降采样数据不同步。
针对上述实时查询时降采样导致数据查询效率低的技术问题,在本申请一些实施例中,在原始数据从内存写入持久性存储介质的过程中,根据预设的降采样规则,对写入持久性存储介质的目标原始数据进行降采样处理;并存储降采样处理得到的降采样数据,实现了原始数据的预降采样。这样,在降采样查询时,可直接查询预降采样结果,无需在降采样查询时,对原始数据进行实时降采样处理,有助于提高后续降采样查询效率。
以下结合附图,详细说明本申请各实施例提供的技术方案。
应注意到:相同的标号在下面的附图以及实施例中表示同一物体,因此,一旦某一物体在一个附图或实施例中被定义,则在随后的附图和实施例中不需要对其进行进一步讨论。
图1A为本申请实施例提供的数据降采样方法100的流程示意图。如图1A所示,该方法100包括步骤101至步骤104。
在步骤101中,将获取的原始数据写入内存。
在步骤102中,在内存中的原始数据达到设定的数据量的情况下,将内存中的原始数据写入第一持久性存储介质。
在步骤103中,在原始数据写入第一持久性存储介质的过程中,根据预设的降采样规则,对写入第一持久性存储介质的目标原始数据进行降采样处理,以得到降采样数据。
在步骤104中,将降采样数据写入第二持久性存储介质。
在本申请实施例中,原始数据可为时序数据,即基于某种频率持续产生的一系列数据。对于物理机来说,可获取原始数据。在本申请实施例中,物理机可为电脑等终端设备,也可为单一服务器设备,也可以云化的服务器阵列。另外,物理机也可以指具备相应服务能力的其他计算设备,例如电脑等终端设备(运行服务程序)等。
在本实施例中,物理机可提供数据管理服务。在一种实施例中,物理机可提供数据存储、数据处理和数据查询服务等。在一些实施例中,物理机可维护有数据库。在本实施例中,数据库可为时序数据库,用于存储时序数据,并提供时序数据查询服务。
在步骤101中,可将获取的原始数据写入物理机的内存。在一种实施例中,可将原始数据写入内存的MenStore空间。由于内存的存储空间有限,在内存存储的数据量达到设定的数据量的情况下,需要将内存存储的数据写入持久性存储介质进行保存。相应地,如图1A的步骤102和图1B所示,可在内存中的原始数据达到设定的数据量的情况下,将内存中的原始数据写入持久性存储介质。在本申请实施例中,持久性存储介质主要是指非易失性存储介质,如磁盘、软盘、硬盘、数字多功能光盘(DVD)或其他光学存储、磁带或只读光盘只读存储器(CD-ROM)等。
在本申请实施例中,持久性存储介质可与内存部署于同一物理机,也与内存部署于不同的物理机。对于物理机挂载的存储系统为集中式存储系统的实施例,持久性存储介质与内存属于同一物理机;对于物理机挂载的存储系统为分布式存储系统的实施例,持久性存储介质与内存可属于同一物理机,也可属于不同的物理机。
在本实施例中,为了提高数据查询效率,可对原始数据进行预降采样处理,这样在降采样查询时,可直接查询降采样数据,而无需在数据查询过程中对原始数据进行降采样处 理,可有效提高数据查询效率。基于此,在本实施例中,为了实现原始数据的预降采样,如图1A中的步骤103和图1B所示,可在内存中的原始数据写入持久性存储介质的过程中,根据预设的降采样规则,对写入持久性存储介质的目标原始数据进行降采样处理,得到降采样数据。
在本申请实施例中,不限定获取降采样规则的具体实施方式。在一些实施例中,降采样规则可为原始数据的使用方或提供方等自主设置的。在一种实施例中,存储系统可提供交互界面供用户访问;用户(原始数据的使用方或提供方等)可通过该交互界面自主设置降采样规则。一般的降采样规则可包括:采样时间间隔和聚合算子。其中,采样时间间隔主要是指对原始数据按照多大的时间间隔进行降采样。聚合算子是指对采样时间间隔内的原始数据采用的降采样方式。其中,聚合算子可为指标聚合算子、桶聚合算子、矩阵聚合算子以及管道聚合算子等。指标聚合算子可包括:最大值(max)、最小值(min)、求和(sum)、求平均值(avg)、值统计、distinct聚合、百分比统计以及百分比排名聚合等等。
例如,对于降采样规则可采用下述语句进行表述:
Figure PCTCN2022127512-appb-000001
上述降采样规则表示对数据库“db”中的原始数据分别按照5s(5秒)和5min(5分钟)的采样时间间隔进行求和。
基于预设的降采样规则,步骤103可实现为:从预设的降采样规则中,获取采样时间间隔和聚合算子;针对当前写入持久性存储介质的目标原始数据,获取每个采样时间间隔内的目标原始数据;并按照降采样规则中的聚合算子,对每个采样时间间隔内的目标原始数据进行聚合处理,以得到该采样时间间隔内的降采样数据。
在一种实施例中,数据经常采样数据表进行存储。数据表可包括:字段(Field)。字段可包括:字段名和字段值。可利用字段名索引对应的字段值。在一些实施例中,可将字段名相同的字段值按列或按行存储;这样,可利用字段名索引到该字段所有字段值。例如,如图2所示,温度(Temperature)可为字段名;时间戳(Timestamp)和温度值(Value)可为字段名温度对应的字段值。
考虑到不同字段名对应的数据对象属性不同,在降采样处理时,可针对相同属性的原始数据进行聚合处理;对于不同属性的原始数据无法进行聚合处理。例如,对于某个物理空间的进行检测,得到温度时序序列数据、湿度时序序列数据及大气污染指数等。由于温度和湿度是不同维度的属性,因此,对温度时序序列数据和湿度时序序列数据进行聚合处理无实际意义。基于此,在本实施例中,在对写入持久性存储介质的目标原始数据进行降采样处理时,可按照目标原始数据的字段名,将目标原始数据划分为至少一个数据单元。在一种实施例中,可按照目标原始数据的字段名,将目标原始数字中同一字段名对应的字段值划分为一个数据单元,得到至少一个数据单元。相应地,一个数据单元可为一个字段。在本申请实施例中,数据单元的具体数量,可由目标原始数据包含的字段名数量决定。
在一种实施例中,可根据预设的降采样规则,对至少一个数据单元分别将进行降采样处理,以得到每个数据单元对应降采样数据,进而得到目标原始数据对应的降采样数据。
在一种实施例中,基于上述预设的降采样规则,可从预设的降采样规则中,获取采样时间间隔和聚合算子;针对任一数据单元A,从该数据单元A中,获取每个采样时间间隔内的目标原始数据。在一种实施例中,针对任一数据单元A,可根据数据单元A中的时间戳信息,获取每个采样时间间隔内的目标原始数据。在一种实施例中,可按照聚合算子,对每个采样时间间隔内的目标原始数据进行聚合处理,以得到数据单元A对应的降采样数据。
在得到目标原始数据对应的降采样数据之后,在步骤104中,可将降采样数据也写入持久性存储介质进行存储。在本申请实施例中,为了便于描述和区分,将保存原始数据的持久性存储介质,定义为第一持久性存储介质;将存储降采样数据的持久性存储介质,定义为第二持久性存储介质。
其中,第一持久性存储介质和第二持久性存储介质可为同一存储介质,也可为不同的持久性存储介质。对于第一持久性存储介质和第二持久性存储介质为不同的持久性存储介质的情况,第一持久性存储介质和第二持久性存储介质可挂载于同一物理机,也可挂载于不同的物理机。第一持久性存储介质和第二持久性存储介质的数量均可为1个或多个。多个是指2个或2个以上。多个第一持久性存储介质可挂载于同一物理机,也可挂载于不同的物理机。当然,多个第二持久性存储介质也可挂载于不同的物理机。
在一种实施例中,在原始数据从内存写入持久性存储介质的过程中,根据预设的降采样规则,对写入持久性存储介质的目标原始数据进行降采样处理;并存储降采样处理得到的降采样数据,实现了原始数据的预降采样。这样,在降采样查询时,可直接查询预降采样结果,无需在降采样查询时,对原始数据进行实时降采样处理,有助于提高后续降采样查询效率。
另一方面,在实施例提供的数据降采样是在内存刷新(MemStore Flush)阶段,即将内存中的数据写入第一持久性存储介质的过程中,对写入第一持久性存储介质的目标原始数据进行降采样处理的,相较于CQ降采样无需查询原始数据的倒排数据和正排索引来获取原始数据,可减少内存和CPU资源消耗。
对于降采样查询,在本申请实施例中,可查询内存中的原始数据和降采样数据。一方面,对内存中的原始数据进行实时降采样,对于降采样数据可直接获取满足查询请求的降 采样数据,得到数据查询结果。由于内存中的原始数据为最新的原始数据,加上降采样数据查询结果可实现全量降采样数据查询,解决CQ降采样不能查询到最新降采样数据的缺点。另一方面,对于直接查询降采样数据部分在数据查询过程中无需进行降采样处理,相较于实时降采样查询,有助于提高数据查询效率。
本申请实施例维护的存储系统不仅可提供降采样查询,也可提供非降采样查询。对于非降采样查询请求,可查询内存中的原始数据和第一持久性存储介质中的原始数据,该查询过程与现有存储系统数据查询相同或相似,非本申请的重点。因此,下面重点以聚合查询(即降采样查询)为例,对本申请实施例提供的数据查询方法进行示例性说明。
图3为本申请实施例提供的数据查询方法300的流程示意图。如图3所示,该数据查询方法300包括步骤301至步骤305。
在步骤301中,获取查询请求;查询请求用于聚合查询。
在步骤302中,根据查询请求,查询内存和第二持久性存储介质。
在步骤303中,针对内存中存在满足查询请求的数据的情况,从内存和第二持久性存储介质中分别获取满足查询请求的第一原始数据和第一降采样数据。
在步骤304中,根据查询请求,对第一原始数据进行降采样处理,以得到第二降采样数据。
在步骤305中,基于第一降采样数据和第二降采样数据,确定查询请求的查询结果。
在本申请实施例中,查询请求可为非聚合查询,也可为聚合查询。本申请实施例重点以聚合查询为例,对本申请实施例提供的数据查询方法进行示例性说明。相应地,在步骤301中,可获取查询请求,该查询请求用于聚合查询。查询请求可包含查询条件。查询条件可包括:待查询的数据对象、聚合算子及查询的时间范围等。
内存中的原始数据是最新写入的,由于不同查询请求查询的时间范围和数据对象可能不同,导致内存中可能存在,也可能不存在满足有些查询请求的数据。对于存储系统来说,无法事先确定内存中是否存在满足查询请求的数据,因此,为了提高数据查询的时效性和准确度,防止漏查最新数据,如图3步骤302和图4所示,可根据查询请求,查询内存和第二持久性存储介质。
在一种实施例中,可对查询请求进行语义解析,得到查询请求的查询条件。在一种实施例中,可把查询请求编译成抽象语法树(Abstracted Syntax Tree,AST),并在该过程中对查询请求的语句进行错误检测,确保输入的请求语句没有语法和词法错误。例如,检测是否存在关键词拼写错误、是否有多余的标点符号、整个语句是否合法等等。
在一种实施例中,可对上述抽象语法树的节点依次进行检查,并把相关表的元数据,属性的元数据附在语法树上,最后生成含有语义的语法树(bound AST)。在一种实施例中,可根据含有语义的语法树获取查询请求的访问需求内容。
在一种实施例中,可根据查询条件,生成执行计划。在一种实施例中,优化器可根据语义语法树,生成逻辑执行树(logical operator tree,LOT)。在一种实施例中,可将语义语法树节点对应到操作符节点,得到逻辑执行树。逻辑执行树上的每个节点,称为逻辑操作符(logical operator)。在一种实施例中,可扩展出每个逻辑操作符对应的物理操作符(physical operator),得到物理执行树。在一种实施例中,可从物理执行树中选择出代价最小的物理 执行树,作为执行计划。其中,代价最小可以为路径最短、内存消耗最小、计算量最小或者计算时间最短等等。
在一种实施例中,可根据执行计划,查询内存和第二持久性存储介质。
在本实施例中,对于内存中不存在满足查询请求的数据的实施例,可从第二持久性存储介质中获取满足查询请求的降采样数据;并基于从第二持久性存储介质中获取的满足查询请求的降采样数据,确定查询请求的查询结果。由于该数据查询方式,可直接从降采样数据中获取满足查询请求的降采样数据,无需在数据查询过程中对原始数据进行实时降采样,有助于提高数据查询效率。
对于内存中存在满足查询请求的数据的实施例来说,在步骤303中,可从内存和第二持久性存储介质中分别获取满足查询请求的原始数据(定义为第一原始数据)和降采样数据。
在一种实施例中,在步骤304中,可根据查询请求,对从内存中获取的满足查询请求的原始数据进行降采样处理,得到降采样数据。在本申请实施例中,为了便于描述和区分,将从第二持久性存储介质中获取的满足查询请求的降采样数据,定义为第一降采样数据;将对从内存中获取的满足查询请求的原始数据进行降采样处理得到的降采样数据,定义为第二降采样数据。
在一种实施例中,可从查询请求中,获取查询请求包含的聚合算子和采样时间间隔。在一种实施例中,可按照查询请求包含的采样时间间隔,从满足查询请求的原始数据中,获取每个采样时间间隔对应的原始数据;在一种实施例中,可按照查询请求包含的聚合算子对每个采样时间间隔对应的原始数据进行聚合处理,以得到上述第二降采样数据。
接着,在步骤305中,可基于第一降采样数据和第二降采样数据,确定查询请求对应的查询结果。
本实施例提供的数据查询方法,可查询内存中的原始数据和降采样数据。一方面,对内存中的原始数据进行实时降采样,对于降采样数据可直接获取满足查询请求的降采样数据,得到数据查询结果。由于内存中的原始数据为最新的原始数据,加上降采样数据查询结果可实现全量降采样数据查询,可提高数据查询的时效性和准确度,解决了CQ降采样不能查询到最新降采样数据的缺点。另一方面,对于直接查询降采样数据部分在数据查询过程中无需进行降采样处理,相较于实时降采样查询,有助于提高数据查询效率。
而且,对于内存中存在满足查询请求的原始数据的情况,由于内存空间较小,存储的原始数据的数据量相较于第一持久性存储介质存储的原始数据要小很多,因此,对于内存中的原始数据的实时降采样的完成速度较快,相较于上述现有方案中对全量原始数据进行实时降采样查询的方式来说,本申请实施例提供的数据查询方式依然具有较高的数据查询效率。
在一种实施例中,数据的存储方式可能影响数据查询过程。因此,下面结合降采样数据的存储过程及降采样数据写入第二持久性存储介质的过程,对降采样查询(聚合查询)的具体实施过程进行示例性说明。
在本申请实施例中,不限定将降采样数据写入第二持久性存储介质的具体实现形式。考虑到第二持久性存储介质存储的降采样数据一般是根据不同的降采样规则降采样处理 得到的,为了便于后续查询,提高后续数据查询效率,在本申请实施例中,针对上述任一数据单元A对应的降采样数据,可根据数据单元A对应的降采样规则和数据单元A的字段名,确定用于表征降采样规则和降采样对象的目标字段名(Field)。在本申请实施例中,不限定目标字段名(Field)的具体格式。在一些实施例中,目标字段名的格式可表示为:“{raw_field}_{aggregator}_{interval}”。其中,“raw_field”表示列字段名即数据单元的字段名,可表征降采样对象。“aggregator”表示聚合算子;“interval”表示采样时间间隔。例如,对CPU按照30s的采样时间间隔做max降采样的降采样规则,可确定降采样规则为表征“按照30s的采样时间间隔做max降采样”、降采样对象为CPU字段。相应地,目标字段名可表示为“cpu_max_30s”。
在一种实施例中,可以目标字段名为字段名,以任一数据单元A的降采样数据为目标字段名的字段值,将目标字段名和数据单元A对应的降采样数据写入第二持久性存储介质。这样,在降采样查询时,可根据降采样查询请求中的查询条件,确定满足查询条件的目标字段名;根据满足查询条件的目标字段名,索引出该目标字段名对应的字段值作为满足查询条件的降采样数据。该降采样查询过程,可根据降采样数据对应的目标字段名进行数据查询,无需查询所有的降采样数据,有助于提高数据查询效率。
在一种实施例中,基于上述目标字段名,在根据查询请求查询第二持久性存储介质时,可从查询请求中,获取查询请求对应的查询条件;并根据查询条件,生成满足第二持久性存储介质中降采样数据对应的字段名格式(即上述目标字段名的格式)的第一字段名。在一种实施例中,可从查询条件中,获取待查询的数据对象、聚合算子及采样时间间隔等;在一种实施例中,可按照上述目标字段名的格式,根据待查询的数据对象、聚合算子及采样时间间隔,转化成具有上述目标字段名的格式的第一字段名。例如,对于查询CPU每30s内的最大值(max)的查询条件来说,待查询的数据对象为CPU字段;聚合算子为max算子;采样时间间隔为30s。相应地,该查询条件转化的第一字段名为“cpu_max_30s”。
在一种实施例中,可根据第一字段名,查询第二持久性存储介质,以确定第一字段名对应的降采样数据。在一种实施例中,可从第一字段名对应的降采样数据中,获取满足查询条件的第一降采样数据。
在一些实施例中,如图1B和图4所示,原始数据和降采样数据可以文件的形式进行存储。在本申请实施例中,文件是指为了存储信息而使用的对信息的编码方式,不限定文件的具体实现形态。在一些实施例中,文件可为数据表等。其中,原始数据的存储文件,定义为原始文件;降采样数据的存储文件定义为降采样文件。在本申请实施例中,内存中的原始数据每次达到设定的数据量,启动一次将内存中的原始数据写入第一持久性存储介质的操作,形成一个原始文件;在每次原始数据写入第一持久性存储介质的过程中,启动一次对写入第一持久性存储介质的目标原始数据进行降采样处理操作及将降采样数据写入第二持久性存储介质的操作,形成一个降采样文件。
在本申请实施例中,为了降低降采样文件占用的存储空间,可采用层级组织结构存储降采样文件。每个层级用于存储设定阈值个降采样文件。将每个层级对应的设定阈值用M进行表示。其中,M≥2,且M为整数。不同层级对应的阈值可以相同,也可不同。在本申请实施例中,为了降低降采样文件占用的存储空间,如图5所示,针对任意两个相邻层 级,在下层级中的降采样文件的数量达到该下层级对应的阈值M的情况下,对M个降采样文件进行合并处理;将合并后的降采样文件存储至下层级的上一层级,例如,图5中层级组织结果的层级从L0-L5依次增高,在L0层级中的降采样文件的数量达到设定阈值M时,可将L0层级中的M个降采样文件进行合并;并将合并后的降采样文件存储至L1层级;对于L1层级,在该层级中的降采样文件的数量达到设定阈值N时,可将L1层级中的N个降采样文件进行合并;并将合并后的降采样文件存储至L2层级等等,依次类推。其中,N≥2,且N为整数。N与M可以相同,也可不同。
考虑到M个降采样文件可能存在时间窗口重叠的降采样数据,为了进一步降低降采样数据占用的存储空间,针对M个降采样文件存在重叠时间窗口的情况,可根据降采样规则中的聚合算子,对重叠时间窗口对应的降采样处理结果进行聚合操作;并将聚合后的M个降采样文件合并为一个降采样文件。之后,将合并后的降采样文件存在至上一层级。由于在降采样文件合并过程中对重叠时间窗口的降采样数据进行了去重,因此,采样层级组织结构存储降采样文件可降低降采样数据占用的存储空间。
对于以文件形式存储的降采样数据的实施例来说,在聚合查询时,从第二持久性存储介质中获取的满足查询请求的第一降采样数据可能位于一个降采样文件中,也可能位于多个降采样文件中。多个是指2个或2个以上。在本实施例中,对于第一降采样数据位于多个降采样文件的实施例,可根据多个降采样文件中的降采样数据的时间信息,判断多个降采样文件中的降采样数据是否存在重叠时间窗口;若判断结果为是,可根据查询请求中的聚合算子,对重叠时间窗口对应的第一降采样数据进行聚合操作,得到第一降采样数据。在一种实施例中,可基于聚合后的第一降采样数据和第二降采样数据,确定查询请求对应的查询结果。
在一种实施例中,对于写入第一持久性存储介质的原始数据来说,可能存在数据删除的情况,在本申请实施例中,为了实现降采样数据与原始数据同步删除,在第一持久性存储介质的原始数据存在数据删除的情况下,可对删除的原始数据进行标记,得到墓碑(Tombstone)记录。其中,墓碑记录用于记录删除的原始数据信息。其中,墓碑记录中记录的原始数据,可为第一持久性存储介质逻辑意义上删除的原始数据,也可为实际物理上删除的原始数据。
在一种实施例中,可根据墓碑记录中数据的时间信息和第二持久性存储介质存储的降采样数据的时间信息,确定墓碑记录对应的降采样数据。在一种实施例中,对于上述以降采样文件形式存储降采样文件的实施例来说,可根据墓碑记录中数据的时间信息和第二持久性存储介质存储的降采样文件中数据的时间信息,确定墓碑记录对应的降采样文件。为了保持降采样数据与原始数据同步删除,可在墓碑记录对应的降采样文件合并过程中,从墓碑记录对应的降采样文件中确定墓碑记录对应的降采样数据。在一种实施例中,可根据墓碑记录中数据的实现信息和墓碑记录对应的降采样文件中降采样数据的时间信息,确定墓碑记录对应的降采样文件中与墓碑记录中数据的时间窗口重叠的降采样数据,为墓碑记录对应的降采样数据。在一种实施例中,可在墓碑记录对应的降采样文件合并过程中,删除墓碑记录对应的降采样数据,这样合并后的降采样文件不再存在被删除的原始数据对应的降采样数据,实现降采样数据和原始数据同步删除,解决上述CQ降采样方式无法在原 始数据删除时同步删除降采样数据的缺陷。
为了防止查询到已删除的原始数据对应的降采样数据,提高数据查询准确度,在本实施例中,基于上述墓碑记录,在聚合查询过程中确定查询请求的查询结果时,可获取用于标记删除的原始数据的墓碑记录;并根据墓碑记录中数据的时间信息和第一降采样数据中数据的时间信息,判断第一降采样数据中是否含有墓碑记录对应的降采样数据;若判断结果为是,可从第一降采样数据中删除墓碑记录对应的降采样数据;并确定第二降采样数据和删除墓碑记录对应的降采样数据后的第一降采样数据,为查询请求的查询结果。这样,可保证墓碑记录标记的删除的原始数据对应的降采样数据不被查询出,有助于提高数据查询准确度,解决上述CQ降采样方式无法在原始数据删除时同步删除降采样数据的缺陷。
对于上述第一降采样数据位于多个降采样文件,且多个降采样文件中的降采样数据存在重叠时间窗口的实施例,在基于聚合后的第一降采样数据和第二降采样数据,确定查询请求对应的查询结果时,也可根据墓碑记录中数据的时间信息和聚合后的第一降采样数据中数据的时间信息,判断聚合第一降采样数据中是否含有墓碑记录对应的降采样数据;若判断结果为是,可从聚合后的第一降采样数据中删除墓碑记录对应的降采样数据;并确定第二降采样数据和删除墓碑记录对应的降采样数据后的聚合后的第一降采样数据为查询请求对应的查询结果。
在一种实施例中,可将查询结果返回给查询请求的提供方。在本申请实施例中,针对聚合查询来说,之所以聚合查询可在降采样数据中查询到满足聚合查询请求的降采样数据,主要是因为降采样数据对应的降采样规则可由查询请求的提供方进行设置。对于查询请求的提供方来说可根据自身的查询需求,自主设置降采样规则;并预先存储于执行本申请实施例提供的数据降采样方法的模块、装置、设备或系统中。
需要说明的是,上述实施例所提供方法的各步骤的执行主体均可以是同一设备,或者,该方法也由不同设备作为执行主体。比如,步骤301和302的执行主体可以为设备A;又比如,步骤301的执行主体可以为设备A,步骤302的执行主体可以为设备B;等等。
另外,在上述实施例及附图中的描述的一些流程中,包含了按照特定顺序出现的多个操作,但是应该清楚了解,这些操作可以不按照其在本文中出现的顺序来执行或并行执行,操作的序号如301、302等,仅仅是用于区分开各个不同的操作,序号本身不代表任何的执行顺序。另外,这些流程可以包括更多或更少的操作,并且这些操作可以按顺序执行或并行执行。
相应地,本申请实施例还提供了一种存储有计算机指令的计算机可读存储介质,当计算机指令被一个或多个处理器执行时,致使一个或多个处理器执行上述数据降采样方法和/或数据查询方法中的步骤。
本申请实施例还提供了一种计算机程序产品,该计算机程序产品包括:计算机程序。当计算机程序被处理器执行时,致使处理器执行上述数据降采样方法和/或数据查询方法中的步骤。在本申请实施例中,不限定计算机程序产品的具体实现形态。在一些实施例中,计算机程序产品可实现为查询引擎、针对数据库的数据处理系统或者查询引擎中的执行器等。
图6为本申请实施例提供的计算系统的结构示意图。如图6所示,该计算系统包括: 存储器61和处理器62。其中,存储器61可包括:内存61a和持久性存储介质61b。
在本实施例中,存储器61和处理器62可位于同一物理机,也可位于不同的物理机。内存61a和持久性存储介质61b可属于同一物理机,也可属于不同的物理机。在一种实施例中,内存61a与处理器62属于同一物理机。持久性存储介质61b的数量可为1个或多个。多个是指2个和2个以上。多个持久性存储介质61b可属于同一物理机,也可属于不同的物理机。
在本实施例中,内存61a和持久性存储介质61b与处理器62通信连接。处理器62可用于:将获取的原始数据写入内存61a;在内存61a中的原始数据达到设定的数据量的情况下,将内存61a中的原始数据写入持久性存储介质61b中的第一持久性存储介质61b1;在原始数据写入第一持久性存储介质61b1的过程中,根据预设的降采样规则,对写入第一持久性存储介质61b1的目标原始数据进行降采样处理,以得到降采样数据;并将降采样数据写入第二持久性存储介质61b2。
在本申请实施例中,第一持久性存储介质61b1和第二持久性存储介质61b2可为同一存储介质,也可为不同的存储介质。
在一些实施例中,处理器62在对写入第一持久性存储介质的目标原始数据进行降采样处理时,具体用于:按照目标原始数据的字段名,将目标原始数据划分为至少一个数据单元;以及,根据预设的降采样规则,对至少一个数据单元分别将进行降采样处理,以得到降采样数据。
在一种实施例中,处理器62在对至少一个数据单元分别将进行降采样处理时,具体用于:从预设的降采样规则中,获取采样时间间隔和聚合算子;针对任一数据单元,从任一数据单元中,获取每个采样时间间隔内的目标原始数据;按照聚合算子,对每个采样时间间隔内的目标原始数据进行聚合处理,以得到任一数据单元对应的降采样数据。
在另一些实施例中,处理器62在将降采样处理结果写入第二持久性存储介质61b2时,具体用于:针对任一数据单元对应的降采样数据,根据降采样规则和任一数据单元的字段名,确定用于表征降采样规则和降采样对象的目标字段名;以目标字段名为字段名,以任一数据单元的降采样数据为目标字段名的字段值,将目标字段名和任一数据单元对应的降采样数据写入第二持久性存储介质61b2。
在一些实施例中,处理器62还用于:采用层级组织结构存储降采样数据对应的降采样文件。相应地,处理器62还用于:针对任意两个相邻层级,在下层级中的降采样文件的数量达到下层级对应的阈值M的情况下,对M个降采样文件进行合并处理;将合并后的降采样文件存储至下层级的上一层级;其中,M为设定阈值,M≥2,且M为整数。
在一种实施例中,处理器62在对M个降采样文件进行合并处理时,具体用于:针对M个降采样文件存在重叠时间窗口的情况,根据降采样规则中的聚合算子,对重叠时间窗口对应的降采样处理结果进行聚合操作;并将聚合后的M个降采样文件合并为一个降采样文件。
在一些实施例中,处理器62还用于:针对第一持久性存储介质61b1中的原始数据存在数据删除的情况,对删除的原始数据进行标记,以得到墓碑记录;根据墓碑记录中数据的时间信息和降采样文件中数据的时间信息,确定墓碑记录对应的降采样文件;在墓碑记 录对应的降采样文件合并过程中,从墓碑记录对应的降采样文件中确定墓碑记录对应的降采样数据;删除墓碑记录对应的降采样数据。
在本申请实施例中,如图6所示,计算系统还可包括:通信组件63。处理器62还用于:通过通信组件63获取查询请求;查询请求用于聚合查询;根据查询请求,查询内存61a和第二持久性存储介质61b2;针对内存61a中存在满足查询请求的数据的情况,从内存和第二持久性存储介质61b2中分别获取满足查询请求的第一原始数据和第一降采样数据;根据查询请求,对第一原始数据进行降采样处理,以得到第二降采样数据;以及,基于第一降采样数据和第二降采样数据,确定查询请求的查询结果。
在一种实施例中,处理器62在确定查询请求的查询结果时,具体用于:获取用于标记删除的原始数据的墓碑记录;根据墓碑记录中数据的时间信息和第一降采样数据中数据的时间信息,判断第一降采样数据中是否含有墓碑记录对应的降采样数据;若判断结果为是,从第一降采样数据中删除墓碑记录对应的降采样数据;并确定第二降采样数据和删除墓碑记录对应的降采样数据后的第一降采样数据,为查询请求的查询结果。
在一种实施例中,处理器62在查询第二持久性存储介质61b2时,具体用于:从查询请求中,获取查询请求对应的查询条件;根据查询条件,生成满足第二持久性存储介质中降采样数据对应的字段名格式的第一字段名;根据第一字段名,查询第二持久性存储介质61b2,以确定第一字段名对应的降采样数据;从第二持久性存储介质中获取满足查询请求的第一降采样数据,包括:从第一字段名对应的降采样数据中,获取满足查询条件的第一降采样数据。
在一些实施例中,第一降采样数据位于多个降采样文件中。相应地,处理器62在确定查询请求的查询结果时,具体用于:针对不同降采样文件中的第一降采样数据存在重叠时间窗口的情况,根据查询请求中的聚合算子,对重叠时间窗口对应的第一降采样数据进行聚合操作,以得到聚合后的第一降采样数据;基于聚合后的第一降采样数据和第二降采样数据,确定查询请求的查询结果。
在一些可选实施方式中,如图6所示,计算系统还可以包括:电源组件64等组件。图6中仅示意性给出部分组件,并不意味着计算系统必须包含图6所示全部组件,也不意味着计算系统只能包括图6所示组件。
值得说明的是,本申请实施例提供的计算系统包含的组件可属于同一物理机,也可属于不同的物理机。对于包含的组件属于不同的物理机的情况,不同的物理机之间通信连接。处理器62可通过物理机之间的通信实现对其它组件的控制和操作。
本实施例提供的计算系统,在原始数据从内存写入持久性存储介质的过程中,根据预设的降采样规则,对写入持久性存储介质的目标原始数据进行降采样处理;并存储降采样处理得到的降采样数据,实现了原始数据的预降采样。这样,在降采样查询时,可直接查询预降采样结果,无需在降采样查询时,对原始数据进行实时降采样处理,有助于提高后续降采样查询效率。
另一方面,在本申请实施例提供的数据降采样是在内存刷新(MemStore Flush)阶段,即将内存中的数据写入第一持久性存储介质的过程中,对写入第一持久性存储介质的目标原始数据进行降采样处理的,相较于CQ降采样无需查询原始数据的倒排数据和正排索引 来获取原始数据,可减少内存和CPU资源消耗。
对于降采样查询,在本申请实施例中,可查询内存中的原始数据和降采样数据。一方面,对内存中的原始数据进行实时降采样,对于降采样数据可直接获取满足查询请求的降采样数据,得到数据查询结果。由于内存中的原始数据为最新的原始数据,加上降采样数据查询结果可实现全量降采样数据查询,解决CQ降采样不能查询到最新降采样数据的缺点。另一方面,对于直接查询降采样数据部分在数据查询过程中无需进行降采样处理,相较于实时降采样查询,有助于提高数据查询效率。
在本申请实施例中,存储器用于存储计算机程序,并可被配置为存储其它各种数据以支持在其所在设备上的操作。其中,处理器可执行存储器中存储的计算机程序,以实现相应控制逻辑。存储器可以由任何类型的易失性或非易失性存储设备或者它们的组合实现,如静态随机存取存储器(SRAM),电可擦除可编程只读存储器(EEPROM),可擦除可编程只读存储器(EPROM),可编程只读存储器(PROM),只读存储器(ROM),磁存储器,快闪存储器,磁盘或光盘。
在本申请实施例中,处理器可以为任意可执行上述方法逻辑的硬件处理设备。在一种实施例中,处理器可以为中央处理器(Central Processing Unit,CPU)、图形处理器(Graphics Processing Unit,GPU)或微控制单元(Microcontroller Unit,MCU);也可以为现场可编程门阵列(Field-Programmable Gate Array,FPGA)、可编程阵列逻辑器件(Programmable Array Logic,PAL)、通用阵列逻辑器件(General Array Logic,GAL)、复杂可编程逻辑器件(Complex Programmable Logic Device,CPLD)等可编程器件;或者为先进精简指令集(RISC)处理器(Advanced RISC Machines,ARM)或系统芯片(System on Chip,SOC)等等,但不限于此。
在本申请实施例中,通信组件被配置为便于其所在设备和其他设备之间有线或无线方式的通信。通信组件所在设备可以接入基于通信标准的无线网络,如WiFi,2G或3G,4G,5G或它们的组合。在一个示例性实施例中,通信组件经由广播信道接收来自外部广播管理系统的广播信号或广播相关信息。在一个示例性实施例中,通信组件还可基于近场通信(NFC)技术、射频识别(RFID)技术、红外数据协会(IrDA)技术、超宽带(UWB)技术、蓝牙(BT)技术或其他技术来实现。
在本申请实施例中,电源组件被配置为其所在设备的各种组件提供电力。电源组件可以包括电源管理系统,一个或多个电源,及其他与为电源组件所在设备生成、管理和分配电力相关联的组件。
需要说明的是,本文中的“第一”、“第二”等描述,是用于区分不同的消息、设备、模块等,不代表先后顺序,也不限定“第一”和“第二”是不同的类型。
本领域内的技术人员应明白,本申请的实施例可提供为方法、系统、或计算机程序产品。因此,本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且,本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。
本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图 和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器,使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。
这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中,使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品,该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。
这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上,使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理,从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。
在一个典型的配置中,计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。
内存可能包括计算机可读介质中的非永久性存储器,随机存取存储器(RAM)和/或非易失性内存等形式,如只读存储器(ROM)或闪存(flash RAM)。内存是计算机可读介质的示例。
计算机的存储介质为可读存储介质,也可称为可读介质。可读存储介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括,但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带,磁盘存储或其他磁性存储设备或任何其他非传输介质,可用于存储可以被计算设备访问的信息。按照本文中的界定,计算机可读介质不包括暂存电脑可读媒体(transitory media),如调制的数据信号和载波。
还需要说明的是,术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含,从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素,而且还包括没有明确列出的其他要素,或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下,由语句“包括一个……”限定的要素,并不排除在包括所述要素的过程、方法、商品或者设备中还存在另外的相同要素。
以上所述仅为本申请的实施例而已,并不用于限制本申请。对于本领域技术人员来说,本申请可以有各种更改和变化。凡在本申请的原理之内所作的任何修改、等同替换、改进等,均应包含在本申请的权利要求范围之内。

Claims (14)

  1. 一种数据降采样方法,包括:
    将获取的原始数据写入内存;
    在所述内存中的原始数据达到设定的数据量的情况下,将内存中的原始数据写入第一持久性存储介质;
    在所述原始数据写入第一持久性存储介质的过程中,根据预设的降采样规则,对写入所述第一持久性存储介质的目标原始数据进行降采样处理,以得到降采样数据;
    将所述降采样数据写入第二持久性存储介质。
  2. 根据权利要求1所述的方法,其中,所述根据预设的降采样规则,对写入所述第一持久性存储介质的目标原始数据进行降采样处理,包括:
    按照所述目标原始数据的字段名,将所述目标原始数据划分为至少一个数据单元;
    根据预设的降采样规则,对所述至少一个数据单元分别将进行降采样处理,以得到所述降采样数据。
  3. 根据权利要求2所述的方法,其中,所述根据预设的降采样规则,对所述至少一个数据单元分别将进行降采样处理,包括:
    从所述预设的降采样规则中,获取采样时间间隔和聚合算子;
    针对任一数据单元,从所述任一数据单元中,获取每个采样时间间隔内的目标原始数据;
    按照所述聚合算子,对所述每个采样时间间隔内的目标原始数据进行聚合处理,以得到所述任一数据单元对应的降采样数据。
  4. 根据权利要求2所述的方法,其中,所述将所述降采样数据写入第二持久性存储介质,包括:
    针对任一数据单元对应的降采样数据,根据所述降采样规则和所述任一数据单元的字段名,确定用于表征所述降采样规则和降采样对象的目标字段名;
    以所述目标字段名为字段名,以所述任一数据单元的降采样数据为目标字段名的字段值,将所述目标字段名和所述任一数据单元对应的降采样数据写入所述第二持久性存储介质。
  5. 根据权利要求1-4任一项所述的方法,其中,采用层级组织结构存储降采样数据对应的降采样文件;所述方法还包括:
    针对任意两个相邻层级,在下层级中的降采样文件的数量达到所述下层级对应的阈值M的情况下,对M个降采样文件进行合并处理;
    将合并后的降采样文件存储至所述下层级的上一层级;其中,M为设定阈值,M≥2,且M为整数。
  6. 根据权利要求5所述的方法,其中,所述对M个降采样文件进行合并处理,包括:
    针对M个降采样文件存在重叠时间窗口的情况,根据所述降采样规则中的聚合算子,对所述重叠时间窗口对应的降采样处理结果进行聚合操作;
    将聚合后的M个降采样文件合并为一个降采样文件。
  7. 根据权利要求5所述的方法,还包括:
    针对所述第一持久性存储介质中的原始数据存在数据删除的情况,对删除的原始数据进行标记,以得到墓碑记录;
    根据所述墓碑记录中数据的时间信息和所述降采样文件中数据的时间信息,确定所述墓碑记录对应的降采样文件;
    在所述墓碑记录对应的降采样文件合并过程中,从所述墓碑记录对应的降采样文件中确定所述墓碑记录对应的降采样数据;
    删除所述墓碑记录对应的降采样数据。
  8. 根据权利要求1-4任一项所述的方法,还包括:
    获取查询请求;所述查询请求用于聚合查询;
    根据所述查询请求,查询所述内存和所述第二持久性存储介质;
    针对所述内存中存在满足所述查询请求的数据的情况,从所述内存和所述第二持久性存储介质中分别获取满足所述查询请求的第一原始数据和第一降采样数据;
    根据所述查询请求,对所述第一原始数据进行降采样处理,以得到第二降采样数据;
    基于所述第一降采样数据和所述第二降采样数据,确定所述查询请求的查询结果。
  9. 根据权利要求8所述的方法,其中,所述基于所述第一降采样数据和所述第二降采样数据,确定所述查询请求的查询结果,包括:
    获取用于标记删除的原始数据的墓碑记录;
    根据所述墓碑记录中数据的时间信息和所述第一降采样数据中数据的时间信息,判断所述第一降采样数据中是否含有所述墓碑记录对应的降采样数据;
    若判断结果为是,从所述第一降采样数据中删除所述墓碑记录对应的降采样数据;
    确定所述第二降采样数据和删除所述墓碑记录对应的降采样数据后的第一降采样数据,为所述查询请求的查询结果。
  10. 根据权利要求8所述的方法,其中,所述根据查询请求,查询所述第二持久性存储介质,包括:
    从所述查询请求中,获取所述查询请求对应的查询条件;
    根据查询条件,生成满足所述第二持久性存储介质中降采样数据对应的字段名格式的第一字段名;
    根据所述第一字段名,查询所述第二持久性存储介质,以确定所述第一字段名对应的降采样数据;
    所述从所述第二持久性存储介质中获取满足所述查询请求的第一降采样数据,包括:
    从所述第一字段名对应的降采样数据中,获取满足所述查询条件的第一降采样数据。
  11. 根据权利要求8所述的方法,其中,所述第一降采样数据位于多个降采样文件中;所述基于所述第一降采样数据和所述第二降采样数据,确定所述查询请求的查询结果,包括:
    针对不同降采样文件中的第一降采样数据存在重叠时间窗口的情况,根据所述查询请求中的聚合算子,对重叠时间窗口对应的第一降采样数据进行聚合操作,以得到聚合后的第一降采样数据;
    基于所述聚合后的第一降采样数据和所述第二降采样数据,确定所述查询请求的查询 结果。
  12. 一种数据查询方法,包括:
    获取查询请求;所述查询请求用于聚合查询;
    根据所述查询请求,查询内存和存储降采样数据的持久性存储介质;
    针对所述内存中存在满足所述查询请求的数据的情况,从所述内存和所述持久性存储介质中分别获取满足所述查询请求的第一原始数据和第一降采样数据;
    根据所述查询请求,对所述第一原始数据进行降采样处理,以得到第二降采样数据;
    基于所述第一降采样数据和所述第二降采样数据,确定所述查询请求的查询结果。
  13. 一种计算系统,包括:存储器和处理器;所述存储器包括:内存和持久性存储介质;
    所述处理器与所述内存和所述持久性存储介质通信连接,用于执行权利要求1-12任一项所述的方法中的步骤。
  14. 一种存储有计算机指令的计算机可读存储介质,当所述计算机指令被一个或多个处理器执行时,致使所述一个或多个处理器执行权利要求1-12任一项所述的方法中的步骤。
PCT/CN2022/127512 2021-12-09 2022-10-26 数据降采样和数据查询方法、系统及存储介质 WO2023103626A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202111501316.5A CN114328601A (zh) 2021-12-09 2021-12-09 数据降采样和数据查询方法、系统及存储介质
CN202111501316.5 2021-12-09

Publications (1)

Publication Number Publication Date
WO2023103626A1 true WO2023103626A1 (zh) 2023-06-15

Family

ID=81050415

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2022/127512 WO2023103626A1 (zh) 2021-12-09 2022-10-26 数据降采样和数据查询方法、系统及存储介质

Country Status (2)

Country Link
CN (1) CN114328601A (zh)
WO (1) WO2023103626A1 (zh)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113761021A (zh) * 2021-08-17 2021-12-07 杭州涂鸦信息技术有限公司 时序指标数据降精度处理方法、装置和计算机设备
CN114328601A (zh) * 2021-12-09 2022-04-12 阿里巴巴(中国)有限公司 数据降采样和数据查询方法、系统及存储介质

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200098433A (ko) * 2019-02-12 2020-08-20 한국전자통신연구원 영상 부호화/복호화 방법, 장치 및 비트스트림을 저장한 기록 매체
CN112231531A (zh) * 2020-09-15 2021-01-15 山东浪潮通软信息科技有限公司 一种基于opentsdb的数据展示方法、设备及介质
CN113342817A (zh) * 2021-06-23 2021-09-03 蘑菇物联技术(深圳)有限公司 数据降采样方法、装置、系统与计算机可读存储介质
CN114328601A (zh) * 2021-12-09 2022-04-12 阿里巴巴(中国)有限公司 数据降采样和数据查询方法、系统及存储介质

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20200098433A (ko) * 2019-02-12 2020-08-20 한국전자통신연구원 영상 부호화/복호화 방법, 장치 및 비트스트림을 저장한 기록 매체
CN112231531A (zh) * 2020-09-15 2021-01-15 山东浪潮通软信息科技有限公司 一种基于opentsdb的数据展示方法、设备及介质
CN113342817A (zh) * 2021-06-23 2021-09-03 蘑菇物联技术(深圳)有限公司 数据降采样方法、装置、系统与计算机可读存储介质
CN114328601A (zh) * 2021-12-09 2022-04-12 阿里巴巴(中国)有限公司 数据降采样和数据查询方法、系统及存储介质

Also Published As

Publication number Publication date
CN114328601A (zh) 2022-04-12

Similar Documents

Publication Publication Date Title
US10963456B2 (en) Querying of materialized views for time-series database analytics
US10614050B2 (en) Managing object requests via multiple indexes
WO2023103626A1 (zh) 数据降采样和数据查询方法、系统及存储介质
CN104781812B (zh) 策略驱动的数据放置和信息生命周期管理
US20180246950A1 (en) Scalable database system for querying time-series data
US9361342B2 (en) Query to streaming data
US10114826B2 (en) Autonomic regulation of a volatile database table attribute
US9507807B1 (en) Meta file system for big data
US8938430B2 (en) Intelligent data archiving
US10417265B2 (en) High performance parallel indexing for forensics and electronic discovery
EP2849089A1 (en) Virtual table indexing mechanism and method capable of realizing multi-attribute compound condition query
US9390111B2 (en) Database insert with deferred materialization
US20220019589A1 (en) Workload aware data partitioning
CN112084190A (zh) 一种基于大数据的采集数据实时存储与管理系统和方法
US9229968B2 (en) Management of searches in a database system
CN107004036B (zh) 用以搜索包含大量条目的日志的方法和系统
US8548980B2 (en) Accelerating queries based on exact knowledge of specific rows satisfying local conditions
US9275059B1 (en) Genome big data indexing
Zheng et al. Timo: In‐memory temporal query processing for big temporal data
US20240095246A1 (en) Data query method and apparatus based on doris, storage medium and device
US11657032B2 (en) Compacted table data files validation
Shrinivas et al. Techniques used in time series databases and their internals

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 22903050

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE