CN115168656A - Data processing method and device, computing equipment and storage medium - Google Patents

Data processing method and device, computing equipment and storage medium Download PDF

Info

Publication number
CN115168656A
CN115168656A CN202210759002.3A CN202210759002A CN115168656A CN 115168656 A CN115168656 A CN 115168656A CN 202210759002 A CN202210759002 A CN 202210759002A CN 115168656 A CN115168656 A CN 115168656A
Authority
CN
China
Prior art keywords
window
data
index
column
row
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202210759002.3A
Other languages
Chinese (zh)
Inventor
谢佳明
苑艺
廖新涛
林亮
李飞飞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Alibaba Cloud Computing Ltd
Original Assignee
Alibaba Cloud Computing Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Alibaba Cloud Computing Ltd filed Critical Alibaba Cloud Computing Ltd
Priority to CN202210759002.3A priority Critical patent/CN115168656A/en
Publication of CN115168656A publication Critical patent/CN115168656A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9024Graphs; Linked lists
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/901Indexing; Data structures therefor; Storage structures
    • G06F16/9014Indexing; Data structures therefor; Storage structures hash tables

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Software Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The application provides a data processing method, a data processing device, a computing device and a storage medium. According to the method and the device, the window numbers of all rows are determined through the window row data of all rows in the data to be processed, the window row data are compressed and stored according to the window numbers of all rows, the memory can be saved, the index jump information of the same window number is stored into the linked list according to the window numbers of all rows, the linked list heads corresponding to different window numbers are stored into the linked list heads, the workload of full sequencing is avoided, and the time occupied by window division is shortened.

Description

Data processing method and device, computing equipment and storage medium
Technical Field
One or more embodiments of the present disclosure relate to the field of computer technologies, and in particular, to a data processing method, an apparatus, a computing device, and a storage medium.
Background
In the big data era, various industries generate a large amount of data, and production optimization and business direction can be further guided by analyzing the data. The window operator is a commonly used data analysis operator, and can divide data into different windows according to data characteristics, and then further calculate the data of each window respectively. At present, the window division of data by adopting a sorting algorithm has the defects of high complexity and long time consumption.
Disclosure of Invention
In view of this, one or more embodiments of the present specification provide a data processing method, apparatus, computing device and medium.
To achieve the above object, one or more embodiments of the present disclosure provide the following technical solutions:
according to a first aspect of one or more embodiments of the present specification, there is provided a data processing method including:
acquiring data to be processed, and determining a window column and a data column in the data to be processed, wherein the window column is used for indicating that the data to be processed is subjected to window division according to the column of data;
determining window numbers of all rows according to window column data of all rows in the data to be processed;
compressing and storing the window column data according to the window numbers of the rows;
and storing the indexes of the data column data of each row as linked lists according to the window numbers of each row, and storing linked list headers corresponding to different window numbers in the linked list headers, wherein the indexes in the linked lists are used for indicating the index jump information of the same window number.
In an embodiment of the present specification, the determining, according to the window column data of each row in the data to be processed, the window number of each row includes:
determining hash values of window column data of each row in the data to be processed;
and inserting the hash value into a hash table, and determining the window number according to the position of the hash value in the hash table.
In an embodiment of the present specification, the inserting a hash value into a hash table, and determining a window number according to a position of the hash value in the hash table includes:
under the condition that a hash value does not exist in a hash table, determining a window number according to the position of the hash value in the hash table, and establishing a mapping relation between the window number and the hash value;
and under the condition that the hash value exists in the hash table, determining the window number corresponding to the hash value according to the mapping relation.
In an embodiment of the present specification, the storing indexes of data in rows and columns as linked lists according to window numbers of the rows, and storing linked list headers corresponding to different window numbers in the linked list headers includes: the following operations are performed for each row:
determining a chain table head corresponding to the window number according to the window number of the row;
storing the index stored in the chain table head into the chain table corresponding to the row;
storing the index of the row to the head of the chain table.
In an embodiment of the present specification, before storing the index of the data column data of each row as a linked list, the method further includes:
and constructing an index column, wherein the index column stores indexes of data of each row and data column.
In one embodiment of the present description, the method further comprises:
constructing a window index column according to the index in the chain table header and the index jump information of the chain table indicating the same window number;
acquiring data column data according to indexes in the window index column;
and obtaining data in the same window according to the data column data and the window column data corresponding to the window number.
In an embodiment of this specification, the constructing a window index column according to the index in the linked list header and the index skip information that the linked list indicates the same window number includes:
acquiring a first index in the chain table header, wherein the first index corresponds to a first window number;
acquiring a second index from a linked list according to the line indicated by the first index;
performing index skipping on the linked list based on the second index to obtain a third index indicating the first window number in the linked list;
and obtaining a window index column corresponding to the first window number according to the first index, the second index and the third index.
According to a second aspect of one or more embodiments herein, there is provided a data processing apparatus comprising:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed and determining a window column and a data column in the data to be processed, and the window column is used for indicating that the data to be processed is subjected to window division according to the column of data;
the determining unit is used for determining window numbers of all rows according to the window column data of all rows in the data to be processed;
the compression storage unit is used for compressing and storing the window column data according to the window numbers of the rows;
and the first processing unit is used for storing the indexes of the data column data of each row as a linked list according to the window numbers of each row, and storing linked list headers corresponding to different window numbers in the linked list headers, wherein the indexes in the linked list are used for indicating the index jump information of the same window number.
In an embodiment of the present specification, the determining unit is configured to:
determining hash values of window column data of each row in the data to be processed;
and inserting the hash value into a hash table, and determining the window number according to the position of the hash value in the hash table.
In an embodiment of the present specification, when the hash value is inserted into a hash table, and the window number is determined according to a position of the hash value in the hash table, the determining unit is configured to:
under the condition that a hash value does not exist in a hash table, determining a window number according to the position of the hash value in the hash table, and establishing a mapping relation between the window number and the hash value;
and under the condition that the hash value exists in the hash table, determining the window number corresponding to the hash value according to the mapping relation.
In one embodiment of the present specification, the first processing unit is configured to perform the following for each row:
determining a chain table head corresponding to the window number according to the window number of the row;
storing the index stored in the chain table head into the chain table corresponding to the row;
storing the index of the row to the head of the chain table.
In an embodiment of the present specification, before storing the index of each row data column data as a linked list, the apparatus further includes: the construction unit is used for constructing an index column, and indexes of data of each row and data column are stored in the index column.
In one embodiment of the present description, the apparatus further comprises a second processing unit configured to:
constructing a window index column according to the index in the chain table header and the index jump information of the chain table indicating the same window number;
acquiring data column data according to indexes in the window index column;
and obtaining data in the same window according to the data column data and the window column data corresponding to the window number.
In an embodiment of the present specification, when constructing a window index column according to an index in the head of the linked list and index jump information indicating the same window number in the linked list, the second processing unit is configured to:
acquiring a first index in the chain table header, wherein the first index corresponds to a first window number;
acquiring a second index from a linked list according to the line indicated by the first index;
performing index skipping on the linked list based on the second index to obtain a third index indicating the first window number in the linked list;
and obtaining a window index column corresponding to the first window number according to the first index, the second index and the third index.
According to a third aspect of one or more embodiments of the present specification, there is provided a computing device comprising:
a processor;
a memory for storing processor-executable instructions;
the processor executes the executable instructions to implement the operations performed by the data processing method provided by any one of the embodiments of the first aspect and the first aspect.
According to a fourth aspect of one or more embodiments of the present specification, a computer-readable storage medium is provided, on which computer instructions are stored, and when the computer instructions are executed by a processor, the computer instructions implement the operations performed by the data processing method provided in any one of the first aspect and the first aspect.
According to a fifth aspect of one or more embodiments of the present specification, a computer program product is proposed, which comprises a computer program that, when executed by a processor, performs the operations performed by the data processing method provided in any one of the first aspect and the first aspect.
According to the method and the device, the window numbers of all rows are determined through the window row data of all rows in the data to be processed, the window row data are compressed and stored according to the window numbers of all rows, the memory can be saved, the index jump information of the same window number is stored into the linked list according to the window numbers of all rows, the linked list heads corresponding to different window numbers are stored into the linked list heads, the workload of full sequencing is avoided, and the time occupied by window division is shortened.
Drawings
Fig. 1 is a flowchart of a data processing method according to an exemplary embodiment.
Fig. 2 is a schematic process diagram for compressing data to be processed according to an exemplary embodiment.
FIG. 3 is a diagram illustrating the construction of an index chain table and a chain table header for data to be processed according to an exemplary embodiment.
FIG. 4 is a diagram illustrating the construction of a first row index linked list and corresponding link list headers according to an exemplary embodiment.
FIG. 5 is a diagram illustrating the construction of a second row index linked list and a corresponding linked list header according to an exemplary embodiment.
FIG. 6 is a diagram illustrating the construction of a third row index linked list and corresponding link list headers according to an exemplary embodiment.
FIG. 7 is a diagram illustrating construction of a fourth row index linked list and corresponding linked list headers in accordance with an illustrative embodiment.
FIG. 8 is a diagram illustrating the construction of a fifth row index linked list and corresponding link list headers according to an exemplary embodiment.
FIG. 9 is a process diagram for building an index column in accordance with an illustrative embodiment.
FIG. 10a is a flowchart illustrating a process of building a W1 window index column in accordance with an illustrative embodiment.
FIG. 10b is a flowchart illustrating a process of building a W2 window index column in accordance with an illustrative embodiment.
Fig. 11 is a block diagram of a data processing apparatus according to an example embodiment.
FIG. 12 is a schematic block diagram of a computing device provided in an exemplary embodiment.
Detailed Description
Reference will now be made in detail to the exemplary embodiments, examples of which are illustrated in the accompanying drawings. When the following description refers to the accompanying drawings, like numbers in different drawings represent the same or similar elements unless otherwise indicated. The implementations described in the following exemplary embodiments do not represent all implementations consistent with one or more embodiments of the specification. Rather, they are merely examples of apparatus and methods consistent with certain aspects of one or more embodiments of the specification, as detailed in the claims which follow.
It should be noted that: in other embodiments, the steps of the corresponding methods are not necessarily performed in the order shown and described in this specification. In some other embodiments, the methods may include more or fewer steps than those described herein. Moreover, a single step described in this specification may be broken down into multiple steps in other embodiments; multiple steps described in this specification may be combined into a single step in other embodiments.
In the big data era, various industries generate a large amount of data, and the data is analyzed to further guide production optimization and business direction, such as the analysis of sales in the past year, the analysis of usage behavior data of users, and the like. Analytical databases often need to process massive data at a time, for example, all transactions in a year need to be summed to obtain annual transaction amount, which means that a single query often needs to occupy a large amount of resources and consumes a large amount of query time. Therefore, performance optimization in analytical databases has a significant impact on user experience. In the analytical database, the window operator is a commonly used analytical operator, and data can be divided into different windows by using window division in the window operator. At present, a window division method needs to collect all data and then uniformly sort the data, and due to the high complexity of a sorting algorithm, a large amount of query time is often occupied, so that the method is easy to become a bottleneck of query. Therefore, the performance of window division by using the sorting algorithm is low, which results in prolonged response time and reduced experience for the user.
In view of this, the present application provides a data processing method, which can be applied to a scene in which data is subjected to window division. The window division in the present application may be understood as dividing rows in data according to data of a specified certain column, dividing all rows with the same data of the column into the same group, and dividing rows with different data of the column into different groups. Among them, a certain column designated is referred to as a window column.
The data processing method provided by the application can be executed by a computing device. The computing device may be a server, such as one server, multiple servers, a server cluster, a cloud computing platform, and the like, and the application does not limit the device type and the device number of the computing device.
The following embodiments will describe a specific implementation process of the present application with reference to the drawings.
Referring to fig. 1, fig. 1 is a flowchart of a data processing method provided in an exemplary embodiment, the method including:
step 101, acquiring data to be processed, and determining a window column and a data column in the data to be processed, wherein the window column is used for indicating that the data to be processed is subjected to window division according to the data column.
When the data processing method provided by the application is utilized, the data to be processed can be a part of data in all the data, and compared with the method of uniformly sequencing all the data at one time, the method is higher in flexibility. For example, in some scenarios, sales data of the whole year needs to be windowed according to a specified window column, and in this case, the data to be processed may be sales data of different seasons. For example, the data to be processed is sales data of a first quarter and a second quarter, and when sales data of a third quarter is processed, the sales data of the third quarter can be divided into windows on the basis of the first quarter and the second quarter without sequencing all the sales data of the first quarter, the second quarter and the third quarter, so that the window division performance can be improved, and the division time can be shortened.
And 102, determining window numbers of all rows according to window column data of all rows in the data to be processed.
And determining the window number of the corresponding row according to the window column data in each row of window columns. And obtaining the same window number according to the window column data of each row in the same window.
And 103, compressing and storing the window column data according to the window numbers of the rows.
And establishing a mapping relation between the window number and the window row data, and storing the window row data corresponding to the same window number only once for the same window, so that the memory occupation can be reduced.
And step 104, storing the indexes of the data column data of each row as linked lists according to the window numbers of each row, and storing linked list headers corresponding to different window numbers in the linked list headers, wherein the indexes in the linked lists are used for indicating the index jump information of the same window number.
And storing the indexes of the data columns in each row as a linked list according to the indication of the window number. The index in the linked list points to the location of the data column data for which the next window number is the same. The same window number is marked as a window, different window numbers correspond to different windows, window chains corresponding to a plurality of windows are stored in the linked list, and the linked list header corresponding to each window chain is stored in the link list header. The N window chains correspond to N different window numbers, the N window chains correspond to N chain headers, and the jump information of each window chain in the N window chains is stored in the same chain table. Wherein N is a positive integer.
According to the method and the device, the window numbers of all rows are determined through the window column data of all rows in the data to be processed, the window column data are compressed and stored according to the window numbers of all rows, so that the memory can be saved, the index jump information of the same window number is stored into the linked list according to the window numbers of all rows, the linked list heads corresponding to different window numbers are stored into the linked list heads, the workload of full sequencing is avoided, and the time occupied by window division is shortened.
The foregoing is a description of the basic implementation of the present application, and a number of alternative implementations of the present application are described below.
In some embodiments, the present application may calculate a hash value for each row of window column data and insert the hash value into a hash table to obtain a window number of each row.
In a possible implementation manner, the process of determining the window number of each row according to the window column data of each row in the data to be processed may include the following steps:
step 102-1, hash values of window column data of each row in the data to be processed are determined.
And 102-2, inserting the hash value into the hash table, and determining the window number according to the position of the hash value in the hash table.
In one possible implementation manner, the step 100-2 inserts the hash value into the hash table, and includes two cases in the process of determining the window number according to the position of the hash value in the hash table, which are described below separately.
And under the condition that the hash value does not exist in the hash table, determining the window number according to the position of the hash value in the hash table, and establishing a mapping relation between the window number and the hash value.
And under the condition that the hash value exists in the hash table, determining the window number corresponding to the hash value according to the mapping relation.
When the hash table is used for determining the window number, the query mode of the hash table is adopted before the hash value corresponding to the window column data is inserted into the hash table, whether the same hash value is inserted into the hash table can be quickly determined, traversal query is not needed, and therefore query time can be shortened. Therefore, the performance of window division can be improved through the hash table, and an important memory compression opportunity can be provided: for the same window, the window column data is practically the same, so that the window column data can be stored only once for each window, thereby compressing the memory and avoiding redundancy.
For convenience of understanding, the data processing method provided by the present application is described by taking the data to be processed shown in fig. 2 as an example. As shown in fig. 2, the data to be processed is 5 rows and 2 columns of data, where one column is a window column, that is, the data to be processed is divided into different windows according to the data in the column; the other column is a data column.
And determining the window number of the first row to be 1 according to the window column data a in the window column of the first row. Specifically, the hash value of the window column data a is calculated, the hash value is inserted into the hash table, and the window number 1 corresponding to the window column data a is determined according to the position of the hash value corresponding to the window column data a in the hash table, where the window number 1 may be understood as the hash value inserted into the hash table first. And constructing a mapping relation between the window column data a and the window number 1. Similarly, according to the window column data b in the window column of the second row, the window number of the second row is determined to be 2, and the mapping relation between the window column data b and the window number 2 is constructed. When the window number of the third row is determined, because the hash value corresponding to the window column data a is already inserted into the hash table, the window number of the third row may be determined according to the mapping relationship between the window column data a and the window number 1. As shown in fig. 2, after the above processing, the obtained window column (after compression) is compressed from 5 rows to 2 rows, which reduces the memory usage. In this example, the window number is determined by traversing each row of data, in addition, the window columns may be processed in batch, specifically, after determining the window number corresponding to one row in the same window, the window numbers of other rows in the same window are obtained, and the present application is not limited specifically.
At present, a hash table method is also partially adopted in a window division method of a DuckDB window operator, but a sorting algorithm is still mainly adopted, and the hash table is only used for assisting the sorting algorithm. In the DuckDB, all data are first divided into fixed partitions (for example, 1024 partitions) by using a hash table, and then the fixed partitions are sorted and divided by using a sorting algorithm, which is still divided by using the sorting algorithm in essence, and the hash table is only used for preliminarily dividing the data, so that the full-scale sorting is avoided, and the workload of the sorting algorithm is reduced. Therefore, the performance improvement of the DuckDB on window partitioning is limited. Meanwhile, the hash table introduced in the duckbb needs to occupy additional memory, which results in increased memory consumption. According to the method and the device, the memory compression is carried out when the hash table is used for dividing, namely, the window column data of the same window only needs to be stored once. The unique memory compressor is divided by using the hash table, the memory usage amount of the method can be reduced, and when the number of windows is small and the number of lines of the same window is large, the method can use less memory than the original sorting-based division algorithm. In other words, the use of the hash table by DuckDB is merely to aid in sorting, and is still essentially partitioned by the sorting algorithm. The window division is completely carried out by using the hash table, and the sorting operation is not needed in the division process, so that the complexity is reduced and the performance is improved.
It should be noted that the data to be processed shown in fig. 2 includes a window column and a data column, but the data to be processed in this application may include a window column and/or a data column. In the case where there are plural columns of window columns, the window number is determined regarding the plural columns of window columns in one row as a whole. In the case where there are a plurality of columns of data, the subsequent processing is performed with the plurality of columns of data in one row as a whole.
It should be noted that, in the above process, the window number is determined by the hash value corresponding to the window column data, in other embodiments, a mapping relationship between the window column data and the window number may also be directly established, and then the window number is determined according to the mapping relationship between the window column data and the window number.
In another possible implementation manner, the window number may be determined directly according to the mapping relationship between the window column data and the window number.
Taking the data to be processed in 5 rows and 2 columns as an example, it can be known from the window column of the data to be processed that the window column includes two different window column data, and a mapping relationship between the window column data a and the window number 1 and a mapping relationship between the window column data b and the window number 2 are constructed. When the window number of the row is determined, the window number of the row is determined by inquiring the mapping relation, the method is suitable for the data to be processed with simpler window column data, the window column data is more complex, and the efficiency and the performance of processing by using the hash function are higher.
The above-described embodiment describes a process of determining a window number, and the following embodiment describes a process of storing an index of data column data in each row as a linked list according to a window number in each row, and storing a linked list header corresponding to different window numbers in a link list header.
In some embodiments, the step 104 stores the index of the data column data in each row as a linked list according to the window number of each row, and stores the linked list header corresponding to the different window numbers in the linked list header, including: the following operations are performed for each row:
and step 104-1, determining a chain table head corresponding to the window number according to the window number of the row.
And step 104-2, storing the indexes stored in the head of the chain table into the chain table corresponding to the rows.
And step 104-3, storing the indexes of the rows to the head of the chain table.
According to the method and the device, the memory overhead can be further reduced by creating the linked list, only storing the linked list header and storing the index of the row to the linked list header, namely inserting the data into the linked list header.
The above-mentioned link table and link table header storing process will be described by taking 5 rows and 2 columns of data to be processed as an example.
As shown in fig. 2, since the window numbers in the same window are the same, the list head corresponding to the window number is constructed according to the number of the window numbers, and as shown in fig. 3, there are window number 1 and window number 2, and two list heads are constructed. The constructed list header has a flag, which in this example is-1, indicating that there is no row with the same window number as this row.
The list header is determined according to the window number, and the row index of the same window number forms a window chain, where the list header refers to the list header of the window chain, and therefore the list header is also referred to as the window list header in this embodiment. A linked list for storing indexes, also referred to as an index linked list, is constructed.
According to the window number 1 of the first row, determining that the head of the chain table of the window chain W1 corresponding to the window number 1 is-1, storing the index-1 stored in the head of the chain table into the chain table corresponding to the first row, storing the index 1 of the first row into the head of the chain table of the window chain W1, and obtaining the effect shown in FIG. 4 after finishing the processing of the first row.
According to the window number 2 of the second row, the chain table header of the window chain W2 corresponding to the window number 2 is determined to be-1, the index-1 stored in the chain table header is stored into the chain table corresponding to the second row, the index 2 of the second row is stored into the chain table header of the window chain W2, and after the processing of the second row is finished, the effect as shown in FIG. 5 is obtained.
According to the window number 1 of the third row, the chain table head of the window chain W1 corresponding to the window number 1 is determined to be 1, the index 1 stored in the chain table head is stored into the chain table corresponding to the third row, the index 3 of the third row is stored into the chain table head of the window chain W1, and after the processing of the third row is finished, the effect as shown in FIG. 6 is obtained. As can be seen in the diagram shown in FIG. 6, the third row in the linked list records the jump information from the third row to the first row.
According to the window number 2 in the fourth row, the list head of the window chain W2 corresponding to the window number 2 is determined to be 2, the index 2 stored in the list head is stored in the linked list corresponding to the fourth row, the index 4 in the fourth row is stored in the list head of the window chain W2, and after the fourth row processing is finished, the effect as shown in fig. 7 is obtained. As can be seen in the diagram shown in fig. 7, not only the jump information from the third row to the first row but also the jump information from the fourth row to the second row of the window chain W2 is recorded in the linked list.
According to the window number 2 of the fifth row, the chain table header of the window chain W2 corresponding to the window number 2 is determined to be 4, the index 4 stored in the chain table header is stored into the chain table corresponding to the fifth row, the index 5 of the fifth row is stored into the chain table header of the window chain W2, and after the processing of the fifth row is finished, the effect as shown in fig. 8 is obtained.
As can be seen from the schematic diagrams shown in fig. 3 to 8, the two arrays, i.e., the index linked list and the window chain header, are used to encode window information of each row, and the indexes of the same window are organized into a linked list: the index linked list records the jump information of the linked list, and the window linked list head records the linked list head of each window. The window information refers to information of a window to which a row belongs, for example, for the third row of the data to be processed in the above example, the window information of the third row includes that the third row belongs to the first window and belongs to the first row of the previous row of the first window with the third row. When a certain line of data is coded, firstly, according to the window number of said line, finding out the table head of current chain table of said window; the row of data is then inserted from the head of the linked list: if the linked list head is empty, the row becomes the head, otherwise, the original linked list head points to the new row, and the new row is set as the new head. The method for inserting from the head of the linked list can represent a window chain by using only one linked list head, avoids additional storage of the linked list tail and saves the memory. In the example of the present disclosure, two windows W1 and W2 are obtained by construction, that is, window information of each row is encoded into a window linked list by an index linked list and a window linked list header.
In one possible implementation, an index column is constructed, and the constructed index column is used for storing indexes of data columns in each row. The index is a row number corresponding to each row of data column data. And after the index columns are constructed, storing the indexes of the data of each row of data columns as linked lists according to the indexes stored in the index columns.
Still taking the data to be processed in 5 rows and 2 columns as an example, as shown in fig. 9, after the window numbers of the rows are obtained, an index column, an index linked list and a window linked list header are constructed. The index column is actually the row number of the data to be processed, for example, the index of the first row is 1, and the index of the second row is 2. The index columns are constructed for the purpose of conveniently sorting subsequently and assisting the subsequent operation of the window operator. The process of constructing the head of each window link in the index link table has been described in the foregoing embodiments, and is not described herein again.
In some embodiments, in the scenario shown in fig. 9, the sixth line of data is newly added: and the window column data is c, the data column data is F, the window number of the sixth row is determined to be 3 according to the window column data c of the sixth row, and c is added to the window column (after compression). At this time, if there is no chain table header corresponding to the window number 3, the chain table header corresponding to the window number 3 is constructed, and the subsequent process is the same as the processing process of the first to fifth elements, and is not described again here.
The above process is a process of encoding data to be processed, i.e., a data input process. And in the data input stage, inputting and dividing data to be processed, and storing the data to be processed after compression and index coding. In the data input stage, the division information of all windows is calculated, the compressed and indexed data stores the window information of the original data, and the redundant data in the compressed and indexed data is compressed, so that the memory occupation can be reduced.
The process of data output is described below. And in the data output process, decoding and restoring the data, and outputting the decoded data in a mode of sequentially carrying out each window.
In some embodiments, the data processing method of the present application further comprises the steps of:
step 201, a window index list is constructed according to indexes in the chain table header and index jump information indicating the same window number by the chain table.
For each window, since the linked list of the window is already constructed, the index column of the window can be constructed according to the linked list. The process of constructing the window index column comprises the following steps: and starting from the head of the linked list of the window, accessing the linked lists one by one according to the index linked list, obtaining index numbers of the linked lists and adding the index numbers into the window index list. It should be noted that one list head corresponds to one window index column.
Step 202, data column data is obtained according to indexes in the window index column.
And step 203, obtaining data in the same window according to the data column data and the window column data corresponding to the window number.
In some embodiments, step 201 constructs a window index column according to the index in the head of the linked list and the index jump information indicating the same window number in the linked list, including the following steps:
step 201-1, a first index in the head of the chain table is obtained, and the first index corresponds to a first window number.
Step 201-2, according to the line indicated by the first index, a second index is obtained in the linked list.
And step 201-3, performing index jumping on the linked list based on the second index, and acquiring a third index indicating the first window number in the linked list.
Step 201-4, obtaining a window index column corresponding to the first window number according to the first index, the second index and the third index.
Taking the data to be processed of 5 rows and 2 columns as an example, as shown in fig. 9, a first index 3 in the head of the chain table is obtained, and the first index 3 corresponds to the first window. And according to the third row indicated by the first index 3, acquiring a second index 1 corresponding to the third row in the linked list. And then, according to the first row indicated by the second index 1, acquiring a third index-1 corresponding to the first row in the linked list, wherein the third index is a flag bit-1 and represents the last row in the first window of the first row. The above-described processing is performed, resulting in the W1 window index column shown in fig. 10 a. The next first index 5 in the list head of the linked list is obtained in the same way, index jumping is performed on the linked list according to the indication of the first index 5, all the indexes indicating the second window number in the linked list are obtained, and the W2 window index column shown in fig. 10b is obtained.
For each window, after obtaining a window index column, the window data may be further read according to the index. For the data column, the data column in the data to be processed can be read according to the index in the window index column. And for the window column, reading the compressed window column according to the window number. As shown in fig. 10a and 10b, the process of constructing the window index column is called as constructing a partition index entry, and after the partition index entry is constructed, the partition data is read.
For the usage scenario of the window operator, after the window index column is constructed, the window data is often not read directly, but further operations are performed on the window data, such as sorting within the window, calculating a window function, and the like. The window index column constructed by the method allows data to be accessed before materialization, can conveniently provide a flexible reading mode for the data, and avoids copying the data.
Compared with the traditional sorting algorithm, firstly, the window division is carried out by using the hash table algorithm, so that the window division performance is greatly improved; secondly, when the window division is carried out by using the hash table algorithm, the window row data in the same window is stored only once, namely, the memory compression is carried out, the memory occupation is reduced, and when the number of windows is small, the memory occupation is even lower than that of a sequencing algorithm, so that the window division method and the window division device have advantages in both performance and resource consumption.
According to the method and the device, the hash table is used for carrying out window division, the time occupied by the window division can be shortened, in addition, the window column data is compressed and stored, the internal memory occupied by the window column is reduced, the performance of data processing is improved, the response time to a user is obviously reduced, and the query experience of the user is improved.
Corresponding to the embodiments of the method, the present specification also provides embodiments of an apparatus and a computing device applied by the apparatus.
Referring to fig. 11, fig. 11 is a block diagram of a data processing apparatus according to an exemplary embodiment, the apparatus including:
an obtaining unit 1101, configured to obtain data to be processed, and determine a window column and a data column in the data to be processed, where the window column is used to instruct to perform window division on the data to be processed according to the column of data;
a determining unit 1102, configured to determine window numbers of each row according to window column data of each row in the data to be processed;
a compression storage unit 1103, configured to perform compression storage on the window column data according to the window numbers of the respective rows;
the first processing unit 1104 is configured to store, as a linked list, indexes of data column data in each row according to the window numbers of each row, and store linked list headers corresponding to different window numbers in the linked list header, where the indexes in the linked list are used to indicate index skip information of the same window number.
In an embodiment of the present description, the determining unit 1102 is configured to:
determining hash values of window column data of each row in the data to be processed;
and inserting the hash value into a hash table, and determining the window number according to the position of the hash value in the hash table.
In an embodiment of the present specification, the determining unit 1102, when the hash value is inserted into the hash table and the window number is determined according to the position of the hash value in the hash table, is configured to:
under the condition that a hash value does not exist in a hash table, determining a window number according to the position of the hash value in the hash table, and establishing a mapping relation between the window number and the hash value;
and under the condition that the hash value exists in the hash table, determining the window number corresponding to the hash value according to the mapping relation.
In one embodiment of the present description, the first processing unit 1104 is configured to perform the following for each row:
determining a chain table head corresponding to the window number according to the window number of the row;
storing the index stored in the chain table head into the chain table corresponding to the row;
storing the index of the row to the head of the chain table.
In an embodiment of the present specification, before storing the index of each row data column data as a linked list, the apparatus further includes: the constructing unit 1105 is configured to construct an index column, where an index of each row of data column data is stored in the index column.
In one embodiment of the present description, the apparatus further comprises a second processing unit 1106 to:
constructing a window index column according to the index in the chain table header and the index jump information of the chain table indicating the same window number;
acquiring data column data according to indexes in the window index column;
and obtaining data in the same window according to the data column data and the window column data corresponding to the window number.
In an embodiment of the present specification, when constructing a window index column according to the index in the linked list header and the index jump information indicating the same window number in the linked list, the second processing unit 1106 is configured to:
acquiring a first index in the chain table header, wherein the first index corresponds to a first window number;
acquiring a second index from a linked list according to the line indicated by the first index;
performing index skipping on the linked list based on the second index to obtain a third index indicating the first window number in the linked list;
and obtaining a window index column corresponding to the first window number according to the first index, the second index and the third index.
For the device embodiments, since they substantially correspond to the method embodiments, reference may be made to the partial description of the method embodiments for relevant points. The above-described embodiments of the apparatus are merely illustrative, and the units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network modules. Some or all of the elements can be selected according to actual needs to achieve the purpose of the solution in the specification. One of ordinary skill in the art can understand and implement without inventive effort.
The application also provides a computing device, and referring to fig. 12, fig. 12 is a schematic block diagram of a computing device provided by an exemplary embodiment. Referring to fig. 12, at the hardware level, the apparatus includes a processor 1202, an internal bus 1204, a network interface 1206, a memory 1208, and a non-volatile memory 1210, although hardware required for implementing other functions may also be included. One or more embodiments of the present description can be implemented in software, such as by the processor 1202 reading corresponding computer programs from the non-volatile memory 1210 into the memory 1208 and then executing. Of course, besides the software implementation, the one or more embodiments in this specification do not exclude other implementations, such as logic devices or combination of software and hardware, and so on, that is, the execution subject of the following processing flow is not limited to each logic unit, and may also be hardware or logic devices.
The present application also provides a computer program product comprising a computer program which, when executed by a processor, implements the data processing method provided in any of the embodiments of the present application.
The apparatuses, modules or units illustrated in the above embodiments may be implemented by a computer chip or an entity, or implemented by a product with certain functions. A typical implementation device is a computer, which may take the form of a personal computer, laptop computer, cellular telephone, camera phone, smart phone, personal digital assistant, media player, navigation device, email messaging device, game console, tablet computer, wearable device, or a combination of any of these devices.
In a typical configuration, a computer includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.
The Memory may include volatile Memory in a computer readable medium, random Access Memory (RAM), and/or non-volatile Memory such as Read-Only Memory (ROM) or flash RAM. Memory is an example of a computer-readable medium.
Computer-readable media, including both non-transitory and non-transitory, removable and non-removable media, may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase-change Random Access Memory (PRAM), static Random Access Memory (SRAM), dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), read Only Memory (ROM), electrically Erasable Programmable Read Only Memory (EEPROM), flash Memory or other Memory technology, compact Disc Read Only Memory (CD-ROM), digital Versatile Disc (DVD) or other optical storage, magnetic cassettes, magnetic disk storage, quantum Memory, graphene-based storage media or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable Media does not include Transitory computer readable Media such as modulated data signals and carrier waves.
It should also be noted that the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising a … …" does not exclude the presence of another identical element in a process, method, article, or apparatus that comprises the element.
The foregoing description has been directed to specific embodiments of this disclosure. Other embodiments are within the scope of the following claims. In some cases, the actions or steps recited in the claims may be performed in a different order than in the embodiments and still achieve desirable results. In addition, the processes depicted in the accompanying figures do not necessarily require the particular order shown, or sequential order, to achieve desirable results. In some embodiments, multitasking and parallel processing may also be possible or may be advantageous.
The terminology used in the description of the one or more embodiments is for the purpose of describing the particular embodiments only and is not intended to be limiting of the description of the one or more embodiments. As used in one or more embodiments of the present specification and the appended claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. It should also be understood that the term "and/or" as used herein refers to and encompasses any and all possible combinations of one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in one or more embodiments of the present description to describe various information, such information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, first information may also be referred to as second information, and similarly, second information may also be referred to as first information, without departing from the scope of one or more embodiments herein. The word "if," as used herein, may be interpreted as "at … …" or "at … …" or "in response to a determination," depending on the context.
The above description is only for the purpose of illustrating the preferred embodiments of the one or more embodiments of the present disclosure, and is not intended to limit the scope of the one or more embodiments of the present disclosure, and any modifications, equivalent substitutions, improvements, etc. made within the spirit and principle of the one or more embodiments of the present disclosure should be included in the scope of the one or more embodiments of the present disclosure.

Claims (10)

1. A data processing method, comprising:
acquiring data to be processed, and determining a window column and a data column in the data to be processed, wherein the window column is used for indicating that the data to be processed is subjected to window division according to the column of data;
determining window numbers of all rows according to window column data of all rows in the data to be processed;
compressing and storing the window column data according to the window numbers of the rows;
and storing the indexes of the data column data of each row as linked lists according to the window numbers of each row, and storing linked list headers corresponding to different window numbers in the linked list headers, wherein the indexes in the linked lists are used for indicating the index jump information of the same window number.
2. The method of claim 1, wherein the determining the window number of each row according to the window column data of each row in the data to be processed comprises:
determining hash values of window column data of each row in the data to be processed;
and inserting the hash value into a hash table, and determining the window number according to the position of the hash value in the hash table.
3. The method of claim 2, wherein inserting the hash value into the hash table and determining the window number according to the hash value's position in the hash table comprises:
under the condition that a hash value does not exist in a hash table, determining a window number according to the position of the hash value in the hash table, and establishing a mapping relation between the window number and the hash value;
and under the condition that the hash value exists in the hash table, determining the window number corresponding to the hash value according to the mapping relation.
4. The method of claim 1, wherein storing the index of the data column data in each row as a linked list according to the window number of each row, and storing the linked list header corresponding to the different window numbers in the linked list header comprises:
the following operations are performed for each row:
determining a chain table head corresponding to the window number according to the window number of the row;
storing the index stored in the chain table head into the chain table corresponding to the row;
storing the index of the row to the head of the chain table.
5. The method of claim 1, further comprising, prior to storing the indices of the rows and columns of data as a linked list:
and constructing an index column, wherein the index column stores indexes of data of each row and data column.
6. The method according to any one of claims 1-5, further comprising:
constructing a window index column according to the index in the chain table header and the index jump information of the chain table indicating the same window number;
acquiring data column data according to indexes in the window index column;
and obtaining data in the same window according to the data column data and the window column data corresponding to the window number.
7. The method of claim 6, wherein constructing a window index column according to the index in the head of the linked list and the index jump information indicating the same window number in the linked list comprises:
acquiring a first index in the chain table header, wherein the first index corresponds to a first window number;
acquiring a second index from a linked list according to the line indicated by the first index;
performing index skipping on the linked list based on the second index to obtain a third index indicating the first window number in the linked list;
and obtaining a window index column corresponding to the first window number according to the first index, the second index and the third index.
8. A data processing apparatus, characterized in that the apparatus comprises:
the device comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring data to be processed and determining a window column and a data column in the data to be processed, and the window column is used for indicating that the data to be processed is subjected to window division according to the column of data;
the determining unit is used for determining window numbers of all rows according to the window column data of all rows in the data to be processed;
the compression storage unit is used for compressing and storing the window column data according to the window numbers of the rows;
and the first processing unit is used for storing the indexes of the data column data of each row as a linked list according to the window numbers of each row, and storing linked list headers corresponding to different window numbers in the linked list headers, wherein the indexes in the linked list are used for indicating the index jump information of the same window number.
9. A computing device, comprising:
a processor;
a memory for storing processor-executable instructions;
wherein the processor implements the data processing method of any one of claims 1 to 7 by executing the executable instructions.
10. A computer-readable storage medium on which computer instructions are stored, the instructions, when executed by a processor, implementing a data processing method as claimed in any one of claims 1 to 7.
CN202210759002.3A 2022-06-29 2022-06-29 Data processing method and device, computing equipment and storage medium Pending CN115168656A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210759002.3A CN115168656A (en) 2022-06-29 2022-06-29 Data processing method and device, computing equipment and storage medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210759002.3A CN115168656A (en) 2022-06-29 2022-06-29 Data processing method and device, computing equipment and storage medium

Publications (1)

Publication Number Publication Date
CN115168656A true CN115168656A (en) 2022-10-11

Family

ID=83489704

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210759002.3A Pending CN115168656A (en) 2022-06-29 2022-06-29 Data processing method and device, computing equipment and storage medium

Country Status (1)

Country Link
CN (1) CN115168656A (en)

Similar Documents

Publication Publication Date Title
JP6377622B2 (en) Data profiling using location information
US10402427B2 (en) System and method for analyzing result of clustering massive data
CN106407207B (en) Real-time newly-added data updating method and device
US8805767B1 (en) Machine learning memory management and distributed rule evaluation
CN115391561A (en) Method and device for processing graph network data set, electronic equipment, program and medium
US20110179013A1 (en) Search Log Online Analytic Processing
CN113934851A (en) Data enhancement method and device for text classification and electronic equipment
CN110472659B (en) Data processing method, device, computer readable storage medium and computer equipment
CN115185998A (en) Target field searching method and device, server and computer readable storage medium
CN111475736A (en) Community mining method, device and server
CN110688055B (en) Data access method and system in large graph calculation
US11782947B2 (en) Apparatus for recommending feature and method for recommending feature using the same
CN111352954B (en) Association rule mining method, system and device under low resource condition
CN110928941B (en) Data fragment extraction method and device
US11295229B1 (en) Scalable generation of multidimensional features for machine learning
CN116737373A (en) Load balancing method, device, computer equipment and storage medium
CN115168656A (en) Data processing method and device, computing equipment and storage medium
EP2541409A1 (en) Parallelization of large scale data clustering analytics
CN116560984A (en) Test case clustering grouping method based on call dependency graph
CN114818458A (en) System parameter optimization method, device, computing equipment and medium
CN114564501A (en) Database data storage and query methods, devices, equipment and medium
CN115794806A (en) Gridding processing system, method and device for financial data and computing equipment
US20170300516A1 (en) System and method for building a dwarf data structure
Rajendran et al. Incremental MapReduce for K-medoids clustering of big time-series data
CN110990434B (en) Spark platform grouping and Fp-Growth association rule mining method

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination