CN114442940A

CN114442940A - Data processing method, device, medium and electronic equipment

Info

Publication number: CN114442940A
Application number: CN202210001643.2A
Authority: CN
Inventors: 尤夕多; 姚琴; 蒋鸿翔; 余利华
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Hangzhou Netease Shuzhifan Technology Co ltd
Priority date: 2022-01-04
Filing date: 2022-01-04
Publication date: 2022-05-06

Abstract

The embodiment of the application provides a data processing method. The method can comprise the following steps: analyzing the received SQL statement to obtain an execution plan, and screening the execution plan to obtain a target execution plan containing write operation; adding a data aggregation operation to obtain an adjusted target execution plan prior to the write operation; optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan; the final execution plan is run to automatically complete data aggregation operations prior to write operations for incremental data. In addition, the embodiment of the application provides a data processing device, a medium and an electronic device.

Description

Data processing method, device, medium and electronic equipment

Technical Field

Embodiments of the present application relate to the field of computer processing, and more particularly, to a data processing method, apparatus, medium, and electronic device.

Background

This section is intended to provide a background or context to the embodiments of the application that are recited in the claims. The description herein is not admitted to be relevant prior art by inclusion in this section.

In the related art for data aggregation operation, a stored data block is read from a data table, then data aggregation operation is performed on the data block, and after the data block after data aggregation is obtained, the data block is written into the data table.

Disclosure of Invention

The problem that needs to read data and then write data in the related art for data aggregation operation is relatively wasteful of I/O resources, and needs to manually trigger data aggregation operation, such as setting an additional timing task to trigger data aggregation optimization, which increases maintenance cost of task.

Therefore, a data processing method is very needed, which can complete data aggregation operation before write operation for incremental data, thereby avoiding a process of reading a data block that has been written, performing data aggregation operation, and then rewriting the data block, and avoiding setting an additional timing task to trigger the data aggregation operation, thereby implementing automatic data aggregation operation, reducing occupation of IO resources, and reducing maintenance cost of the task.

The incremental data refers to newly generated data which needs to be written into the data table. The data table may be maintained in a database.

In this context, embodiments of the present application are intended to provide a data processing method, apparatus, medium, and electronic device.

In a first aspect of embodiments of the present application, there is provided a data processing method applied to a data processing engine, including: analyzing the received SQL statement to obtain an execution plan, and screening the execution plan to obtain a target execution plan containing write operation; adding a data aggregation operation to obtain an adjusted target execution plan prior to the write operation; the data aggregation operation is used for adjusting the data distribution state of the aggregated incremental data in the incremental data block to be written, and the write operation is used for writing the aggregated incremental data in the incremental data block into a data table according to the adjusted data distribution state; optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan; and running the final execution plan.

In a second aspect of embodiments of the present application, there is provided a data processing apparatus applied to a data processing engine, the apparatus including: the screening module is used for analyzing the received SQL statement to obtain an execution plan, and screening the execution plan to obtain a target execution plan containing the write operation; an adding module, configured to add a data aggregation operation to obtain an adjusted target execution plan before the write operation; the data aggregation operation is used for adjusting the data distribution state of the aggregated incremental data in the incremental data block to be written, and the write operation is used for writing the aggregated incremental data in the incremental data block into a data table according to the adjusted data distribution state; the optimization module is used for optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan; and the operation module is used for operating the final execution plan.

In a third aspect of embodiments of the present application, there is provided a computer-readable storage medium storing a computer program for causing a processor to execute a data processing method as shown in any one of the foregoing embodiments.

In a fourth aspect of embodiments herein, there is provided an electronic device comprising: a processor; a memory for storing processor-executable instructions; wherein the processor executes the executable instructions to implement the data processing method as shown in any one of the foregoing embodiments.

In the foregoing described technical solution, because the write operation of the final execution plan includes a data aggregation operation before the write operation, the data aggregation operation may be configured to adjust a data distribution state of incremental data aggregated in an incremental data block to be written, and the write operation may be configured to write the incremental data aggregated in the incremental data block into a data table according to the adjusted data distribution state, the data aggregation operation may be completed before the write operation for the incremental data block, thereby avoiding reading the written data block first, performing the data aggregation operation, and then rewriting the flow, and avoiding setting an additional timing task to trigger the data aggregation operation, thereby implementing an automatic data aggregation operation, reducing an occupancy of IO resources, and reducing a maintenance cost of the task.

In addition, because the adjustment of the Data distribution state of the incremental Data aggregated by the incremental Data block is completed before the write operation, the rationality of the Data distribution in the Data block can be improved, so that in the process of reading the Data (namely Data skiping) meeting the Data screening condition, the optimized Data distribution can be benefited, the Data volume of traversal is reduced, and the Data reading efficiency is improved.

Drawings

The foregoing and other objects, features and advantages of the exemplary embodiments of this application will be readily understood by reading the following detailed description with reference to the accompanying drawings. Several embodiments of the present application are illustrated by way of example, and not by way of limitation, in the figures of the accompanying drawings and in which:

fig. 1 is a schematic view of an application scenario of a data processing method according to an embodiment of the present application;

FIG. 2 is a flow chart of a method of data processing according to an embodiment of the present application;

FIG. 3 is a flowchart illustrating a method of data aggregation operations according to an embodiment of the present application;

FIG. 4 is a flowchart illustrating a method of data mapping operations according to an embodiment of the present application;

FIG. 5 is a flowchart illustrating a method for updating an index according to an embodiment of the present application;

FIG. 6 is a flow chart illustrating a method for reading data according to an embodiment of the present application;

FIG. 7 is a flow chart illustrating a method of write operations according to an embodiment of the present application;

FIG. 8 is a flowchart illustrating a method of a read operation according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present application;

FIG. 10 is a program product for use in a data processing method according to an embodiment of the present application;

fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present application.

In the drawings, the same or corresponding reference numerals indicate the same or corresponding parts.

Detailed Description

The principles and spirit of the present application will be described with reference to a number of exemplary embodiments. It should be understood that these embodiments are given solely for the purpose of enabling those skilled in the art to better understand and to practice the present application, and are not intended to limit the scope of the present application in any way. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the disclosure to those skilled in the art.

As will be appreciated by one skilled in the art, embodiments of the present application may be embodied as a system, apparatus, device, method, or computer program product. Thus, the present application may be embodied in the form of: entirely hardware, entirely software (including firmware, resident software, micro-code, etc.), or a combination of hardware and software.

Moreover, any number of elements in the drawings are by way of example and not by way of limitation, and any nomenclature is used solely for differentiation and not by way of limitation.

According to an embodiment of the application, a data processing method, a medium, a device and an electronic device are provided.

The principles and spirit of the present application are explained in detail below with reference to several representative embodiments of the present application.

Summary of The Invention

The inventor finds that, after analyzing the received SQL statement by the data processing engine to obtain a corresponding execution plan and then optimizing the obtained execution plan, the data processing engine may sequentially execute corresponding operations according to an arrangement order of the operations in the execution plan, where the operations may include a write operation for an incremental data block, a read operation for a stored data block, and the like. The data processing engine may be mounted in any type of electronic device. The application does not limit the specific type of electronic device.

If a data aggregation operation can be added before a write operation, the data aggregation operation can be completed before the write operation for the incremental data block.

In summary, in the present application, the data processing engine may analyze the received SQL statement to obtain an execution plan, and filter the execution plan to obtain a target execution plan including the write operation; adding a data aggregation operation to obtain an adjusted target execution plan prior to the write operation; optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan; and running the final execution plan.

Because the planned write operation is finally executed before the planned write operation is executed, the data aggregation operation can be used for adjusting the data distribution state of the incremental data aggregated in the incremental data block to be written, and the write operation can be used for writing the incremental data aggregated in the incremental data block into the data table according to the adjusted data distribution state, the data aggregation operation can be completed before the write operation aiming at the incremental data block, so that the flow of reading the written data block first, rewriting the written data block after the data aggregation operation is performed is avoided, an additional timing task is avoided from being set to trigger the data aggregation operation, the automatic data aggregation operation is realized, the IO resource occupation is reduced, and the maintenance cost of the task is reduced.

Having described the basic principles of the present application, various non-limiting embodiments of the present application are described in detail below.

Application scene overview

Referring to fig. 1, fig. 1 is a schematic view of an application scenario of a data processing method according to an embodiment of the present application.

As shown in fig. 1, the foregoing application scenario may include

client devices

1011, 1012, 1013 and the like which mount database clients, and a server device 102 which mounts a data processing engine. Types of such devices may include laptop computers, cell phones, Personal Digital Assistants (PDAs), and the like. The application does not limit the specific type of these devices.

The user may send SQL statements for the database 103 to the server device 102 through the client device. The server device 102 may perform data operation on the database 103 based on the received SQL statement through a data processing engine. The data operation may include a write operation, a read operation, and the like. The database 103 may be a distributed database, or may be a local database carried on a server device. The application does not limit the type of database. The database 103 may contain data tables, which may include data files. The incremental data blocks required to be written into the database can be written into the data table in the form of data files.

The data processing engine can analyze the received SQL statement to obtain an execution plan, and a target execution plan containing write operation is obtained by screening the execution plan; adding a data aggregation operation to obtain an adjusted target execution plan prior to the write operation; optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan; and running the final execution plan.

Exemplary method

Referring to fig. 2, fig. 2 is a flowchart illustrating a method of processing data according to an embodiment of the present application.

The data processing method illustrated in fig. 2 may be applied to a data processing engine. In some embodiments, to improve data processing efficiency and reduce engine size, the data processing engine may employ a Spark engine based on distributed iterative computation.

The Spark engine is an efficient and general purpose computing engine that supports large-scale data processing. Can be used to build large, low-latency data analysis applications. SQL statement processing is supported. The method is developed by AMP labs of Berkeley division university of California, has a similar computing environment with Apache Hadoop, expands a MapReduce computing model, and efficiently supports more computing modes including interactive query and stream processing.

As shown in fig. 2, the data processing method may include S202-S208. The present application does not specifically limit the order of execution of the steps unless specifically stated otherwise.

In step S202, the received SQL statement is analyzed to obtain an execution plan, and a target execution plan including the write operation is obtained by screening the execution plan.

The SQL statement refers to a language for operating a database. The user can perform related operations on the database by editing the SQL statements. Such as write operations, read operations.

The execution plan is a set of operations that the data processing engine needs to perform. The operation contained in the SQL can be obtained by analyzing the SQL statement, and an execution plan is formed. For example, an SQL statement is a write operation statement for an incremental data block, and parsing the SQL may result in an execution plan containing the write operation.

The user can edit the SQL statement through the database client. The client can send the SQL statement to a corresponding server. The data processing engine carried in the server can analyze the SQL statement to obtain a corresponding execution plan.

In this step, the data processing engine may determine whether each of the analyzed execution plans includes a write operation, and determine the execution plan including the write operation as a target execution plan.

S204, before the write operation, adding a data aggregation operation to obtain an adjusted target execution plan.

The data aggregation operation is used to adjust a data distribution state of the aggregated delta data within the delta data block to be written. And the writing operation is used for writing the aggregated incremental data in the incremental data block into a data table according to the adjusted data distribution state.

In this step, a data aggregation operator may be added before a write operator corresponding to the target execution plan, so that when performing a related operation subsequently according to the execution plan, the data aggregation operation may be performed on the incremental data first, and then the write operation is performed.

In the database, the data distribution status may indicate the ordering of the incremental data. The incremental data is formed by aggregating subdata corresponding to a plurality of fields. The data aggregation operation refers to a process of re-ordering each incremental data according to the ordering condition of the subdata under the preset field in the plurality of fields of each incremental data. The write operation may write the incremental data to the data file included in the data table in the adjusted order. In some embodiments, the data file is a column store data file.

The preset field can be set according to the service requirement. In some embodiments, a field on which the data aggregation operation depends may be placed in the attribute information of the data table, and subsequently, when the target execution plan is optimized based on the attribute information of the data table, the field carried by the attribute information may be used as the preset field.

The preset field may be one field or two or more fields among the plurality of fields. For example, the plurality of fields includes four fields of name, gender, age, and height. The preset field may be an age field of four fields, assuming that sorting according to age field is required. The preset field may be two fields of age and height, assuming that sorting according to height and age fields is required.

Referring to Table 1a, Table 1a illustrates an ordering of delta data within a delta data block that has not been processed by a data aggregation operation. The incremental data block structure shown in table 1a is merely illustrative and not limiting to the present application.

Name (I)	Sex	Age (age)	Height of a person
				A	For male	11	144
J	Woman	15	150
				C	Woman	14	145
N	For male	12	137
				E	For male	14	160
F	Woman	11	135
				M	Woman	10	137
H	For male	16	165
				I	For male	13	140
B	For male	13	150
				K	Woman	12	134
L	Woman	11	130
				G	For male	10	135
D	Woman	14	143
				O	Woman	15	148
P	For male	13	137

TABLE 1a

As shown in table 1a, the incremental data block may include 16 pieces of incremental data. The incremental data block may include four fields for name, gender, age, and height. Before the data aggregation operation, adjacent data have no regularity, and the data distribution in the incremental data block is poor.

Assuming that the preset field is an age field, the data aggregation operation is a process of rearranging each incremental data according to an ascending order arrangement result of each subdata under the age field. After the data aggregation operation, as shown in table 2a, the adjacent data heights are closer and the data distribution is more reasonable from the viewpoint of the age field.

TABLE 2a

And S206, optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan.

The attribute information may be used to optimize an execution plan for operating on the data table. For example, the attribute information may include whether a write operation is turned on, first configuration information, second configuration information, and the like. The first configuration information is used for indicating whether to start data aggregation operation on the incremental data block, and the second configuration information is preset field information on which the data aggregation operation depends.

Assuming that the attribute information indicates that the data table closes the write operation, in S206, the write operation of the target execution plan may be closed, so that data cannot be written in the data table.

By adding the first configuration information and the second configuration information to the attribute information, the data aggregation operation added in the step S204 can be flexibly configured, and the method is suitable for more scenes.

Take the incremental data block illustrated in table 1a as an example. For example, in scenario one, data is read from the data table according to the age screening condition. According to the requirement, configuration information included in the attribute information of the data table may be set to start a data aggregation operation, and a preset field is set to an age to perform the data aggregation operation according to the age field. In S206, the preset field according to which the data aggregation operation in the target execution plan depends may be set as the age according to the attribute information of the data table, so as to meet the requirement of scenario one.

For example, in the second scenario, data is read from the data table according to height. According to the requirement, configuration information included in the attribute information of the data table may be set to start a data aggregation operation, and a preset field may be set to a height. In S206, the preset field according to which the data aggregation operation in the target execution plan is based may be set as a height according to the attribute information of the data table, so as to meet the requirement of the second scenario.

As another example, scenario three, no data aggregation operation needs to be performed on the data in the data table. According to the requirement, the configuration information may be set to close the data aggregation operation. In S206, the data aggregation operation included in the target execution plan may be closed according to the attribute information of the data table, so as to meet the requirement of scenario three.

And S208, operating the final execution plan.

The final execution plan may include specific operations that the engine needs to perform. By executing the final execution plan, the engine may complete the corresponding data aggregation operation and/or write operation.

According to the schemes described in S202-S208, the target execution plan including the write operation may be modified, a data aggregation operation may be added before the write operation, and a final execution plan may be obtained after the target execution plan is optimized.

Because the planned write operation is finally executed before the planned write operation is executed, the data aggregation operation can be used for adjusting the data distribution state of the incremental data aggregated in the incremental data block to be written, and the write operation can be used for writing the incremental data aggregated in the incremental data block into the data table according to the adjusted data distribution state, compared with the related art, the data aggregation operation can be completed before the write operation aiming at the incremental data block, so that the flow of reading the written data block first, rewriting the written data block after the data aggregation operation is performed, and setting an additional timing task to trigger the data aggregation operation is avoided, thereby realizing the automatic data aggregation operation, reducing the occupation of IO resources and reducing the maintenance cost of the task.

Take the incremental data block illustrated in table 1a as an example. The incremental data blocks that have not undergone data aggregation operation are written into the data table in a column storage manner according to the arrangement order of data in the blocks, so as to generate 4 data files shown in table 1b, which are file 1, file 2, file 3, and file 4, respectively. Wherein each data file stores 4 pieces of data. For convenience of explanation of the embodiment, it is assumed that one data file can store 4 pieces of data.

TABLE 1b

Assuming that data with the age of 10-12 needs to be read currently, since 4 files all contain data with the age of 10-12, the 4 files need to be traversed to find out 6 pieces of data as shown in table 1 c.

Name(s)	Sex	Age (age)	Height of human body
				A	For male	11	144
N	For male	12	137
				F	Female	11	135
M	Female	10	137
				K	Woman	12	134
G	For male	10	135

TABLE 1c

And processing the incremental data block according to the scheme described in S202-S208. Suppose the data aggregation operation is a process of rearranging each incremental data according to the ascending sort result of each subdata under the height field.

The incremental data block shown in table 2a may be obtained after performing the data aggregation operation before writing the incremental data. Then, a write operation is performed, and the incremental data blocks subjected to the data aggregation operation are written into the data table in a column storage manner according to the arrangement order of the data in the blocks, so that 4 data files shown in table 2b can be generated, namely, file 5, file 6, file 7, and file 8. Wherein each data file stores 4 pieces of data. From the height field, the adjacent data heights in the data file shown in Table 2b are closer, and the data distribution is more reasonable.

TABLE 2b

Assuming that data of age 10-12 needs to be read currently, only the first 2 files need to be traversed to find the 6 pieces of data shown in Table 1c, since only the first 2 files contain data of age 10-12. It can be seen that because data aggregation operation is performed on the incremental data block, the desired data can be obtained only by traversing the data in 2 files, so that the data traversal amount is reduced, and the data reading efficiency is improved.

It can be seen that according to the scheme described in S202-S208, data aggregation operation can be completed before write operation for an incremental data block, so that, on one hand, IO resource occupation is reduced, and maintenance cost of a task is reduced, on the other hand, rationality of data distribution in the data block can be improved, optimized data distribution can be benefited, traversed data volume is reduced, and data reading efficiency is improved.

The application provides a data processing method which is applied to a Spark engine. The method may include S202-S208. The descriptions of S202-S208 are not repeated below.

Wherein the preset fields comprise at least two fields of the plurality of fields, and the data aggregation operation comprises a data mapping operation and a data sorting operation.

The data mapping operation may map sub data of at least two fields selected from the sub data of the plurality of fields included in the incremental data into target sub data of one field.

The data sorting operation may reorder the incremental data within the incremental data block according to the sorting result for the target sub-data.

It can be seen that the Data aggregation operation is completed according to the sub-Data of at least two fields, so that after the Data aggregation operation, the ordering of the incremental Data in the incremental Data blocks is more reasonable, that is, the ordering of the Data in the Data file obtained through the write operation is also more reasonable, and thus, in the process of reading the Data (namely, Data skiping) meeting the Data screening condition, the optimized Data distribution can be benefited, the traversed Data amount is reduced, and the Data reading efficiency is improved.

Referring to fig. 3, fig. 3 is a flowchart illustrating a method of data aggregation operation according to an embodiment of the present application. As shown in fig. 3, the method includes S302-S306. The present application does not specifically limit the order of execution of the steps unless specifically stated otherwise.

S302, aiming at each piece of incremental data aggregated in the incremental data block, screening subdata of at least two fields from subdata of a plurality of fields contained in the incremental data through the data mapping operation to be used as target subdata for mapping, and obtaining mapping data corresponding to the incremental data.

In this step, for each piece of incremental data, mapping manners such as sub-data splicing, sub-data weighted summation, and the like may be adopted to map the sub-data of the at least two fields into the target sub-data.

In some embodiments, the mapping may be performed with reference to the z-order method. The z-order is a technology for compressing multidimensional data into one-dimensional data, original data characteristics can be well reserved, the mapping is carried out by adopting a z-order method, the obtained mapping data can well reserve the information of subdata of at least two fields, and therefore, after the incremental data are reordered based on the ordering result of the mapping data, the distribution of data in the incremental data blocks is more reasonable, the data volume traversed during data reading is reduced, and the data reading efficiency is improved.

Referring to fig. 4, fig. 4 is a flowchart illustrating a method of data mapping operation according to an embodiment of the present application. For each piece of incremental data aggregated in the incremental data block, the steps illustrated in fig. 4 are executed, so that mapping data corresponding to each piece of incremental data can be obtained.

The steps illustrated in fig. 4 are supplementary to S302. As shown in fig. 4, the method includes S402-S408. The present application does not specifically limit the order of execution of the steps unless specifically stated otherwise.

S402, converting the target subdata into corresponding preset binary numbers.

The preset system number can be set according to the service requirement. E.g., decimal, hexadecimal, etc. For ease of computation, a binary may be employed in some embodiments. The following description will be given taking binary numbers converted into 8 bits as an example. Take the first piece of incremental data in the incremental data block shown in table 1a as an example of a data conversion operation. It is assumed that the fields included in the attribute information of the data table are height and age, i.e. the at least two fields (preset fields) on which the data aggregation operation depends are age and height. Through the data conversion operation, a binary number of 8 bits in the age field is 00001011, and a binary number of 8 bits in the height field is 10010000.

After binary conversion is performed on each piece of data, an incremental data block as illustrated in table 3a can be obtained. The subdata under the age and height fields of each data in table 3a is an 8-bit binary number.

TABLE 3a

S404, obtaining the number on the first digit of the preset scale corresponding to each subdata in the target subdata, and arranging the obtained numbers on the first digit of the preset scale according to a preassigned digit arrangement sequence to obtain a mapping digit sequence.

The numerical arrangement sequence can be set according to the service requirement. For example, the number arrangement order may be an arrangement order of the sub-data of the at least two fields in the incremental data, or an inverse arrangement order of the sub-data of the at least two fields in the incremental data. The numerical arrangement is exemplified below in the order of arrangement.

Take the incremental data block illustrated in table 3a as an example. Assume that S404 requires a correlation operation for the first piece of data. In S404, the first digit 0 may be taken out of the binary number 00001011 corresponding to age, the first digit 1 may be taken out of the binary number 10010000 corresponding to height, and the binary numbers are arranged in the order of age first and height second to obtain the mapping digit sequence 01.

S406, continuously acquiring the number on the second bit of the preset carry number corresponding to each sub-data in the target sub-data, and continuously arranging the number on the second bit of the acquired preset carry number at the tail of the acquired mapping number sequence according to the number arrangement sequence to obtain the updated mapping number sequence.

Take the incremental data block illustrated in table 3a as an example. Assume that S406 requires a correlation operation for the first piece of data. In S406, the second digit 0 may be extracted from the binary number 00001011 corresponding to age, the second digit 0 may be extracted from the binary number 10010000 corresponding to height, and then the binary numbers are arranged at the end of the obtained mapping digit sequence 01 according to the order of age first and height second, so as to obtain a mapping digit sequence 0100.

And S408, repeating the above steps until the number on the last digit of the preset scale number corresponding to each subdata in the target subdata is obtained, and continuing to arrange the number on the last digit of the obtained preset scale number according to the digit arrangement sequence at the end of the mapping digit sequence obtained after the last arrangement, so as to finally obtain the mapping data corresponding to the incremental data.

Take the incremental data block illustrated in table 3a as an example. Assume that S408 requires a correlation operation for the first piece of data. In S408, the last digit 1 may be taken out from the binary number 00001011 corresponding to age, the last digit 0 may be taken out from the binary number 10010000 corresponding to height, and then the mapping data 0100000110001010 corresponding to the 16-bit incremental data may be obtained by continuously arranging the ends of the mapping number sequence 01000001100010 obtained after the last arrangement in the order of age first and height.

After performing S402-S408 for each piece of incremental data corresponding to table 1a, an incremental data block as shown in table 3b can be obtained. The mapping data in table 3b is obtained by mapping subdata corresponding to age and height.

Name (I)	Sex	Age (age)	Height of a person	Mapping data
					A	For male	00001011	10010000	0100000110001010
J	Woman	00001111	10010110	0100000110111110
					C	Woman	00001110	10010001	0100000110101001
N	For male	00001100	10001001	0100000011100001
					E	For male	00001110	10100000	0100010010101000
F	Woman	00001011	10000111	0100000010011111
					M	Woman	00001010	10001001	0100000011001001
H	For male	00010000	10100101	0100011000010001
					I	For male	00001101	10001100	0100000011110010
B	For male	00001101	10010110	0100000110110110
					K	Woman	00001100	10000110	0100000010110100
L	Woman	00001011	10000010	0100000010001110
					G	For male	00001010	10000111	0100000010011101
D	Female	00001110	10001111	0100000011111101
					O	Woman	00001111	10010100	0100000110111010
P	For male	00001101	10010010	0100000011100011

TABLE 3b

Through the scheme described in S402-S408, the subdata of at least two fields that need to be subjected to data mapping and are included in each incremental data block can be mapped with reference to the z-order method to obtain the mapping data that can be sorted, and the original information of the subdata of the at least two fields is well retained, so that after the incremental data are sorted again based on the sorting result of the mapping data, the data with the similar meaning expressed by the original information can be gathered together, so that the distribution of the data is more reasonable, the data volume traversed during reading the data is reduced, and the data reading efficiency is improved.

S304, sorting the mapping data set formed by the mapping data corresponding to each incremental data through the data sorting operation to obtain a corresponding data sorting result.

The mapping data set comprises mapping data corresponding to each piece of incremental data. The sorting may refer to sorting the mapping data in the mapping data set according to a preset order. The preset order may be an ascending order or a descending order.

In some embodiments, the mapping data in the mapping data set may be sorted in an ascending order or a descending order according to the size of the mapping data corresponding to each piece of incremental data, so as to obtain a corresponding data sorting result. Therefore, the data in the mapping data set can be sorted, and then the incremental data can be sorted again. The following is an example in ascending order.

S306, adjusting the data distribution state of the aggregated incremental data in the incremental data block based on the data sorting result.

In some embodiments, the arrangement order of the corresponding incremental data in the incremental data block may be adjusted according to the arrangement order of the mapping data in the mapping data set indicated by the data sorting result.

Take the incremental data block illustrated in table 3b as an example. In the steps illustrated in S304-S306, the sub-data corresponding to the mapping data field may be sorted first to obtain a data sorting result. And then adjusting the arrangement sequence of the corresponding incremental data according to the data sorting result to obtain the incremental data block shown in the table 3 c. The incremental data blocks after the data aggregation operation are shown in table 3 c. The subdata corresponding to the mapping data fields in table 3c is process data, and may not participate in write operations and may not be written to a data file. The subdata under the corresponding age and height fields in table 3c has been restored to the original binary number.

TABLE 3c

Through the steps described in S302-S306, the sub data of at least two fields selected from the sub data of the multiple fields included in the incremental data may be mapped into the target sub data of one field, and the target sub data may well cover the sub data of the at least two fields, and then the incremental data in the incremental data block is reordered according to the ordering result of the target sub data, which is equivalent to completing the data aggregation operation according to the sub data of the at least two fields. After the Data aggregation operation, the sorting of the incremental Data in the incremental Data block is more reasonable, and the sorting of the Data in the Data file obtained through the write operation is also more reasonable, so that in the process of reading the Data (namely Data skiping) meeting the Data screening condition, the optimized Data distribution can be benefited, the Data quantity of traversal is reduced, and the Data reading efficiency is improved.

For example, taking the incremental data block shown in table 1a as an example, if the data aggregation operation is performed only with the age field as the preset field, and then the write operation is performed, 4 data files shown in table 2b can be obtained.

If the data files shown in Table 2b are required to read data with ages of 11-13 and heights of 135-140, since the first 3 files all contain data with ages of 11-13 and heights of 135-140, the first 3 files need to be traversed to align the 4 pieces of data shown in Table 2 c.

Name (I)	Sex	Age (age)	Height of a person
				F	Woman	11	135
N	For male	12	137
				I	For male	13	140
P	For male	13	137

TABLE 2c

If age and height are used as the preset fields, the data aggregation operations indicated in S302-S306 are performed on the incremental data block indicated in table 1a, and then 4 data files shown in table 3d can be obtained, which are file 9, file 10, file 11, and file 11, respectively. Wherein each data file stores 4 pieces of data. From the age and height fields, the data file shown in Table 3d has a close proximity of the adjacent data height and age fields, and the data distribution is more reasonable than that shown in Table 2 b.

TABLE 3d

If the data files shown in Table 3d are required to read the data with the age of 11-13 and the height of 135- > 140, since the first 2 files all contain the data with the age of 11-13 and the height of 135- > 140, the first 2 files only need to be traversed to find out the 4 pieces of data shown in Table 2 c. It can be seen that since the incremental data block is subjected to the data aggregation operation as illustrated in S302-S306, the data aggregation operation can be performed according to the subdata of at least two fields, so that the desired data can be obtained by traversing the data in 2 files, the data traversal amount is reduced, and the data reading efficiency is improved.

In some embodiments, for flexible configuration of data aggregation operations, the first configuration information and the second configuration information may be added to attribute information corresponding to the data table before collecting the data.

Wherein the first configuration information is used to indicate whether to initiate a data aggregation operation on the incremental data block; the second configuration information is used for indicating at least two preset fields in the plurality of fields under the condition that the incremental data block is started to perform data aggregation operation, and screening sub data of the at least two fields from the sub data of the plurality of fields contained in the incremental data to serve as target sub data to be mapped based on the at least two preset fields.

Therefore, in step S206, flexible configuration of the data aggregation operation can be achieved based on the first configuration information and the second configuration information, so that in S208, the data processing engine can run an execution plan according to the content indicated by the first configuration information and the second configuration information.

In some embodiments, in the process of performing the plan optimization in S206, the first configuration information and the second configuration information may be acquired.

Then, under the condition that the first configuration information indicates that data aggregation operation is started on the incremental data block, at least two preset fields in the multiple fields indicated in the second configuration information are used as at least two fields screened from the subdata of the multiple fields included in the incremental data, and a final execution plan for completing optimization is obtained.

And under the condition that the first configuration information indicates that the data aggregation operation is closed on the incremental data block, closing the data aggregation operation included in the target execution plan to obtain a final execution plan which is optimized.

Accordingly, subsequent S208 may run an optimization to obtain the final execution plan.

Wherein, in the case that the first configuration information indicates that a data aggregation operation is started on the incremental data block, the data processing engine may execute the data aggregation operation and the write operation in the final execution plan

And running the write operation in the final execution plan under the condition that the first configuration information indicates that the data aggregation operation is closed to the incremental data block.

Take the incremental data block illustrated in table 1a as an example. Assuming that the first configuration information indicates to start a data aggregation operation, the second configuration information indicates an age and height field. In S206, the data aggregation operation in the target execution plan may be optimized, so that in S208, in the process of performing the data aggregation operation on the incremental data in the incremental data block, mapping operations may be performed on subdata of two fields, namely, age and height, screened from subdata of multiple fields included in the incremental data to obtain mapping data, and the incremental data is reordered according to the ordering result of the mapping data. The incremental data may then be written to the data table based on the result of the reordering.

Assuming that the first configuration information indicates to close the data aggregation operation, the data aggregation operation included in the target execution plan may be deleted in S206, so that the incremental data may be directly written into the data table without performing the data aggregation operation on the incremental data block in S208.

The Spark engine is a general big data processing and analyzing engine, does not provide an indexing function, and therefore needs a third-party framework to construct an index. At present, in the process of constructing indexes, the generated files need to be reread, metadata statistical information is analyzed from the read files according to an index strategy and is used as indexes, so that I/O (input/output) resources are additionally occupied, indexes corresponding to the data files cannot be automatically modified after the data files are modified, and the indexes need to be manually updated. It can be seen that the third party framework has high use and maintenance costs for the index.

The application also provides a data processing method which is applied to the Spark engine. The data processing protocol adopted by the Spark engine is a queue protocol, so that the problem of high index establishing difficulty is solved. The method may include S202-S208. The descriptions of S202-S208 are not repeated below.

In the method, the index and the data are physically recorded in the same partial file based on a partial protocol, so that after the partial file is modified (for example, incremental data is written into the partial file or the data is deleted from the partial file), the index stored in the file can be automatically updated according to metadata statistical information corresponding to the aggregated data in the partial file, the index is easier to maintain, and the problem that the index is inconsistent with the aggregated data in the file does not occur. In addition, the metadata statistical information can be selected from various types of metadata statistical information as required to serve as indexes, and the effect of light-weight indexing is achieved.

In some embodiments, the data processing protocol adopted by the Spark engine is a partial protocol. The data table comprises at least one partial file; the partial file includes column storage spaces corresponding to the plurality of fields, respectively.

During the process of executing the write operation, the incremental data can be written into the at least one partial file according to the adjusted arrangement sequence; and writing the subdata of each field contained in the incremental data into a column storage space corresponding to the field in the partial file. Therefore, a write operation method combining the Spark engine and the partial protocol is realized, and the incremental data can be stored in a database in a column storage mode.

Each partial file carries a limited amount of data. For example, 4 may be provided. Each of the partial files may include column storage spaces corresponding to the plurality of fields, respectively. Take the data block shown in table 3c as an example. The subdata of a plurality of fields of the incremental data can be sequentially written into the column storage space corresponding to the partial file according to the sequence from top to bottom, and the data can be continuously written into the next partial file every time 4 pieces of data are reached until all the data in the incremental data block are written into the partial file according to columns. The result of the write operation may be as shown in table 3 d.

In some embodiments, after the write operation of the incremental data block is completed, the index corresponding to each column of files of the partial file can be automatically updated based on the partial protocol, so that the index is easier to maintain, and the problem that the index is inconsistent with the aggregated data in the file does not occur.

Referring to fig. 5, fig. 5 is a flowchart illustrating a method for updating an index according to an embodiment of the present application. As shown in fig. 5, the method includes S502-S504. The present application does not specifically limit the order of execution of the steps unless specifically stated otherwise.

S502, aiming at each part file, updating the metadata statistical information corresponding to each row of storage space according to the data in each row of storage space stored in the part file, and storing the metadata statistical information corresponding to each row of storage space in the same part file corresponding to the corresponding row of storage space.

In the partial protocol, the metadata statistics may be automatically updated. The metadata statistical information can be used for explaining data in the column storage space, so that data query is facilitated. In some embodiments, the metadata statistics include at least one of: a minimum value in the data in the column memory space; a maximum value in the data within the column storage space; removing the duplicate of the data in the column storage space to obtain a residual data set; a bloom filter.

The metadata statistics can be used to specify the minimum, maximum, deduplication, and bloom mapping values in the data in the column storage space. Metadata statistics may be used as an index. From these values it can be determined whether the dequeue memory space contains data that meets the screening requirements.

S504, extracting the metadata statistical information based on a preset metadata type to obtain an index corresponding to each column of storage space in the partial file.

The preset metadata type can be set according to business requirements. For example, in a scenario where the data range is used as the data filtering condition, the preset metadata type may be metadata such as a minimum value in data in the storage space. And then selecting the minimum value in the data in the storage space from the metadata statistical information corresponding to each column of the storage space of the partial file as a corresponding index. It is understood that, sometimes, the sub-data contained in the incremental data may be non-numerical data such as chinese characters, and in this case, when the metadata statistics information is maintained, the chinese characters may be first converted into preset binary numbers (e.g., binary numbers), and then the metadata statistics information is maintained.

Taking the file illustrated in table 3d as an example, the minimum value can be found as an index for each column of storage spaces of the files 9 to 12 according to the data stored in the column storage spaces. Wherein, the index corresponding to the age storage space in the file 9 is 10, and the index corresponding to the height storage space is 130. The index corresponding to the age storage space in the file 10 is 10, and the index corresponding to the height storage space is 137. The index corresponding to the age storage space in the file 11 is 11, and the index corresponding to the height storage space is 143. The index for the age storage space in file 12 is 14 and the index for the height storage space is 148. For simplicity of the embodiment, the index of names corresponding to gender is not described in the present application.

In some embodiments, according to the automatically updated index of the queue protocol, a candidate queue file which may include data satisfying the data screening condition is screened first, and then the data satisfying the data screening condition is read from the candidate queue file, so that judgment on each piece of data and the data screening condition is not required, and the data reading efficiency is improved. In addition, data stored in the partial files are subjected to data aggregation operation, so that the distribution is more reasonable, and compared with the situation that the data aggregation operation is not performed, a smaller number of candidate partial files can be selected through indexes, so that the data traversal amount is reduced, and the data reading efficiency is improved.

Referring to fig. 6, fig. 6 is a flowchart illustrating a method for reading data according to an embodiment of the present disclosure. As shown in fig. 6, the method includes S602-S606. The present application does not specifically limit the order of execution of these steps unless specifically stated otherwise.

S602, responding to a data reading request, and acquiring a data screening condition aiming at least one target field included in the data reading request.

The data reading request may be a request sent by the service party to the server through the client when the service party needs to read data. The target field can be set according to business requirements. The data screening condition can be set according to the data type stored in the target field. For example, the target field stores numerical data, and the data filtering condition may be a numerical range. For another example, the target field stores Chinese character type data, and the data filtering condition may be a target Chinese character.

Take the example of reading data from the file shown in table 3 d. The target field may be age and height, the first filtering condition corresponding to age is 11-13, and the second filtering condition corresponding to height is 135-140.

S604, determining candidate partial files in the data table according to the data screening conditions and the indexes.

The candidate partial file may contain data meeting the data screening condition.

Take the example of reading data from the file illustrated in table 3 d. The index corresponding to the column memory space is the minimum value stored in the space. The target field may be age and height, the first filtering condition corresponding to age is 11-13, and the second filtering condition corresponding to height is 135-140.

For each file, if the index corresponding to the file age storage space is less than or equal to the larger age boundary value 13, it can be stated that the age storage space may contain data satisfying the data filtering condition. If the index is greater than the age boundary value 13, it may indicate that the age storage space may not contain data satisfying the data filtering condition.

Similarly, if the index associated with the height storage space is less than or equal to the larger height boundary value 140, it can be said that the height storage space may contain data satisfying the data filtering condition. If the index is greater than the height boundary value 140, it may indicate that the height storage space is unlikely to contain data that satisfies the data filtering condition.

Based on the principle, a comparison method between indexes respectively corresponding to the age storage space and the height storage space and data screening conditions can be set, so that screening of the candidate request files is completed. In the files shown in table 3d, file 9 and file 10 may be the candidate partial files.

And S606, reading the data meeting the data screening condition in the candidate partial file.

In this step, each piece of data of the candidate partial file may be traversed to obtain data that satisfies the data screening condition.

Take the example of reading data from the file shown in table 3 d. One piece of data satisfying the data filtering condition can be read from the file 9 and three pieces of data can be read from the file 10.

According to the steps recorded in S602-S606, judgment on each piece of data and data screening conditions is not needed, and the data reading efficiency is improved. In addition, data stored in the partial files are subjected to data aggregation operation, so that the distribution is more reasonable, and compared with the situation that the data aggregation operation is not performed, a smaller number of candidate partial files can be selected through indexes, so that the data traversal amount is reduced, and the data reading efficiency is improved.

The following description will be made with reference to an application scenario of fig. 1. It should be noted that the foregoing application scenarios are merely illustrative for the convenience of understanding the spirit and principles of the present application and that the embodiments of the present application are not limited in this respect. Rather, embodiments of the present application may be applied to any scenario where applicable. The following description will be given taking a scenario of counting height and age data as an example.

In the scenario shown in fig. 1, the client device may send some Spark SQL statements to the server device 102 to perform some processing on the database. The server device 102 may be equipped with a Spark data processing engine (hereinafter referred to as Spark engine). The Spark engine may employ a partial protocol. The database 103 may include data table 1 and data table 2. The attribute information corresponding to the data table comprises first configuration information and second configuration information aiming at data aggregation operation.

The data table 1 includes first configuration information of an identifier 0 indicating that a data aggregation operation is opened, and second configuration information including age and height fields. The data table 2 includes first configuration information of an identifier 1 indicating that the data aggregation operation is to be closed.

Referring to fig. 7, fig. 7 is a schematic flow chart illustrating a write operation method according to an embodiment of the present application. The execution subject of the method illustrated in fig. 7 is a Spark engine. As shown in fig. 7, the illustrated method may include S701-S706. The present application does not specifically limit the order of execution of these steps unless specifically stated otherwise.

The description will be given taking as an example that incremental data blocks (hereinafter simply referred to as incremental blocks) shown in table 1a are written in data table 1 and data table 2, respectively. Assume that the statement that writes the incremental block to data table 1 is SQL1 and the statement that writes to data table 2 is SQL 2.

S701, performing first analysis on the received Spark SQL to obtain a first analysis result.

The first parsing may be understood as a coarse-grained syntax parsing of the Spark SQL statement, identifying the operation type of the statement.

In this step, it can be analyzed that SQL1 needs to write the incremental blocks into data table 1, and SQL2 needs to write the incremental blocks into coarse-grained information such as data table 2.

And S702, performing second analysis on the basis of the first analysis result to obtain an execution plan corresponding to Spark SQL.

The second parsing may be understood as detailed semantic parsing, and may identify a specific operation corresponding to the operation type.

In this step, it can be analyzed that SQL1 needs to write the sub-data in the four fields of name, gender, age, and height included in the incremental block into the four column storage spaces of name, gender, age, and height included in data table 1, so as to obtain execution plan 1 corresponding to SQL 1.

SQL2 needs to write the sub-data in the four fields of name, gender, age, and height included in the incremental block into the four column storage spaces of name, gender, age, and height included in data table 2, so as to obtain execution plan 2 corresponding to SQL 2.

S703, screening out a target execution plan containing the write operation from the analyzed execution plans, and adding a data aggregation operation before the write operation.

The data aggregation operation comprises a data mapping operation and a data sorting operation. The data mapping operation references the z-order algorithm. The data sorting operation is an ascending sort.

In this step, the execution plan 1 and the execution plan 2 containing the write operation may be obtained as the target execution plan 1 and the target execution plan 2 from the execution plans parsed in S701 to S702, and the data aggregation operation may be added before the write operation of the two target execution plans.

S704, optimizing the target execution plan 1 based on the attribute information of the data table 1, and optimizing the target execution plan 2 based on the attribute information of the data table 2.

In this step, the optimization may be performed according to the first configuration information and the second configuration information included in the attribute information.

For the target execution plan 1, according to the first configuration information and the second configuration information included in the attribute information of the data table 1, the data aggregation operation included in the attribute information may be optimized to screen sub data of two fields, namely age and height, from the sub data of the plurality of fields included in the incremental data to serve as target sub data for mapping.

For the target execution plan 2, the data aggregation operation contained in the attribute information of the data table 1 may be deleted according to the first configuration information.

S705, respectively converting the optimized target execution plan 1 and target execution plan 2 to obtain a final execution plan 1 and a final execution plan 2 executable by the data processing engine.

The execution plan may include a logical execution plan and a physical execution plan. The logic execution plan may be understood as the execution plan in S701-S703, which facilitates editing and optimization. The physical execution plan may be understood as the final execution plan, which is a specific operation step that the engine can execute.

S706, the data aggregation operation and the write operation included in the final execution plan 1 are executed, and the write operation included in the final execution plan 2 is executed.

For the data aggregation operation in final execution plan 1, reference may be made to the descriptions of S302-S306, where S302 may include S402-S408, and based on the increment blocks illustrated in table 1a, the increment blocks illustrated in table 3c may be obtained.

For the write operation in the final execution plan 1, a write operation method combining a Spark engine and a request protocol and an index updating method illustrated in S502-S504 may be referred to, and based on the increment block illustrated in table 3c, a request file 9-12 illustrated in table 3d is obtained, where an index corresponding to an age storage space in the file 9 is 10, and an index corresponding to a height storage space is 130. The index corresponding to the age storage space in the file 10 is 10, and the index corresponding to the height storage space is 137. The index corresponding to the age storage space in the file 11 is 11, and the index corresponding to the height storage space is 143. The index for the age storage space in file 12 is 14 and the index for the height storage space is 148. For simplicity of the embodiment, the index of names corresponding to gender is not described in the present application.

For the write operation in the final execution plan 2, the write operation method that combines the Spark engine and the request protocol and the index updating method illustrated in S502-S504 may be referred to, and based on the increment block illustrated in table 1a, the request files 1-4 illustrated in table 1b are obtained. Wherein, the index corresponding to the age storage space in the file 1 is 11, and the index corresponding to the height storage space is 137. The index corresponding to the age storage space in file 2 is 10 and the index corresponding to the height storage space is 135. The index corresponding to the age storage space in file 3 is 11, and the index corresponding to the height storage space is 130. The index corresponding to the age storage space in file 4 is 10 and the index corresponding to the height storage space is 135.

Through the write operation method illustrated in S701-S706, firstly, after the incremental data is written into the partial file (i.e., after the partial file is modified) based on the partial protocol, an index can be automatically generated according to metadata statistical information corresponding to data aggregated in the partial file, and multiple types of metadata statistical information can be supported as the index, so that the index is easier to maintain, and the problem that the index is inconsistent with the data aggregated in the file does not occur.

Secondly, before the write operation aiming at the incremental data block, the data aggregation operation can be completed, on one hand, the occupation of IO resources is reduced, the maintenance cost of tasks is reduced, on the other hand, the reasonability of data distribution in the data block can be improved, the optimized data distribution can be benefited, the traversed data volume is reduced, and the data reading efficiency is improved.

Thirdly, referring to a z-order method, Data mapping can be performed according to the subdata of at least two fields of each incremental Data to obtain corresponding mapping Data, and original information of the subdata of the at least two fields is well reserved, so that after the incremental Data are reordered based on the ordering result of the mapping Data of each incremental Data, Data with similar meaning expressed by the original information can be gathered together, the ordering of the incremental Data in an incremental Data block is more reasonable, and the Data ordering in a Data file obtained through writing operation is more reasonable, so that in the process of reading the Data (namely Data skiping) meeting the Data screening condition, the optimized Data distribution can be benefited, the traversed Data volume is reduced, and the Data reading efficiency is improved.

Fourthly, adding the first configuration information and the second configuration information in the attribute information corresponding to the data table, realizing flexible configuration of the data aggregation operation and adapting to various service requirements.

Referring to fig. 8, fig. 8 is a schematic flowchart illustrating a read operation method according to an embodiment of the present application. The execution subject of the method illustrated in fig. 8 is a Spark engine. As shown in fig. 8, the method may include S801-S803. The present application does not specifically limit the order of execution of the steps unless specifically stated otherwise.

The following description will take the data read from data tables 1 and 2, wherein the data are 11-13 in age and 135-140 in height.

This example simplifies the processing method of the SQL statement of the read operation, and the related method can refer to S701-S705. In this example, the final execution plan of the data with the ages of 11-13 and the heights of 135-140 read from the data sheet 1 is referred to as plan 3, and the final execution plan of the data with the ages of 11-13 and the heights of 135-140 read from the data sheet 2 is referred to as plan 4.

S801, responding to a data reading request, and acquiring a data screening condition aiming at least one target field included in the data reading request.

In this step, the first condition for age is age 11-13, and the second condition for height is age 135-140.

S802, determining candidate partial files in the data table according to the data screening conditions and indexes of the partial files in the data table 1 and the data table 2.

In this step, for the files 9-12 included in the data table 1, it is determined whether the index corresponding to the age storage space is smaller than the larger age boundary value in the first condition and the index corresponding to the height storage space is smaller than the larger height boundary value in the second condition for each file. Candidate partial files 9 and 10 are then screened out, wherein the index corresponding to the age storage space of the candidate partial files 9 and 10 is smaller than the larger age boundary value, and the index corresponding to the height storage space is smaller than the larger height boundary value.

The screening steps described above may be performed for files 1-4 included in data table 2 to obtain candidate partial files 1-4.

And S803, reading data meeting the data screening condition from the candidate partial file.

In this step, for the read operation in plan 3, data satisfying the data screening condition needs to be read from the candidate partial files 9 and 10. For the read operation in plan 4, data meeting the data screening condition needs to be read from the candidate partial files 1 to 4.

Therefore, after Data aggregation operation based on the z-order mapping method, the sorting of the incremental Data in the incremental Data blocks is more reasonable, and the Data sorting in the partial files in the Data table 1 obtained through writing operation is also more reasonable, so that in the process of reading the Data (namely Data skiping) meeting the Data screening condition, the Data distribution after optimization can be benefited, only two partial files need to be read, the Data volume of traversal is reduced, and the Data reading efficiency is improved.

Exemplary devices

Having described the method of the exemplary embodiment of the present application, next, a data processing apparatus exemplary disclosed in the present application will be described with reference to fig. 9. The data processing device can be applied to a data processing engine for implementing the data processing method shown in any one of the foregoing embodiments.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present disclosure.

As shown in fig. 9, a data processing apparatus 900 (hereinafter, referred to as apparatus 900) may include:

the screening module 910 is configured to analyze the received SQL statement to obtain an execution plan, and screen the execution plan to obtain a target execution plan including the write operation;

an adding module 920, configured to add a data aggregation operation to obtain an adjusted target execution plan before the write operation; the data aggregation operation is used for adjusting the data distribution state of the aggregated incremental data in the incremental data block to be written, and the write operation is used for writing the aggregated incremental data in the incremental data block into a data table according to the adjusted data distribution state;

an optimizing module 930, configured to optimize the adjusted target execution plan based on the attribute information corresponding to the data table, to obtain a final execution plan;

an execution module 940 for executing the final execution plan.

In some embodiments, the data processing engine comprises a Spark engine; the data processing protocol adopted by the Spark engine is a queue protocol.

In some embodiments, each piece of incremental data aggregated in the incremental data block is aggregated by subdata corresponding to a plurality of fields; the data aggregation operation comprises a data mapping operation and a data sorting operation;

the data aggregation operation is used for adjusting the data distribution state of the aggregated delta data in the delta data block to be written, and comprises the following steps:

for each piece of incremental data aggregated in the incremental data block, screening subdata of at least two fields from subdata of a plurality of fields contained in the incremental data through the data mapping operation to serve as target subdata for mapping, and obtaining mapping data corresponding to the incremental data;

sorting a mapping data set formed by mapping data corresponding to each incremental data through the data sorting operation to obtain a corresponding data sorting result;

adjusting a data distribution state of the aggregated delta data within the delta data block based on the data sorting result.

In some embodiments, the screening, by the data mapping operation, sub-data of at least two fields from the sub-data of the plurality of fields included in the incremental data as target sub-data for mapping, so as to obtain mapping data corresponding to each piece of the incremental data, where the mapping data includes:

for each piece of incremental data aggregated in the incremental data block, executing the following steps to obtain mapping data corresponding to each piece of incremental data:

converting the target subdata into corresponding preset carry numbers;

acquiring the number on the first digit of the preset scale corresponding to each subdata in the target subdata, and arranging the acquired number on the first digit of the preset scale according to a preassigned digit arrangement sequence to obtain a mapping digit sequence;

continuously acquiring the number on the second bit of the preset carry number corresponding to each subdata in the target subdata, and continuously arranging the number on the second bit of the acquired preset carry number at the tail of the acquired mapping number sequence according to the number arrangement sequence to obtain an updated mapping number sequence;

and repeating the above steps until the number on the last digit of the preset scale number corresponding to each subdata in the target subdata is obtained, and continuing to arrange the number on the last digit of the obtained preset scale number according to the digit arrangement sequence at the end of the mapping digit sequence obtained after the last arrangement, so as to finally obtain the mapping data corresponding to the incremental data.

In some embodiments, the sorting, by the data sorting operation, the mapping data set formed by the mapping data corresponding to each piece of incremental data to obtain a corresponding data sorting result includes:

and according to the size of the mapping data corresponding to each incremental data, performing ascending order or descending order on the mapping data in the mapping data set to obtain a corresponding data sorting result.

In some embodiments, said adjusting a data distribution state of the aggregated delta data within the delta data block based on the data ordering result comprises:

and adjusting the arrangement sequence of the corresponding incremental data in the incremental data block according to the arrangement sequence of the mapping data in the mapping data set indicated by the data sorting result.

In some embodiments, the attribute information corresponding to the data table includes first configuration information and second configuration information;

In some embodiments, the operation module 940 is specifically configured to:

under the condition that the first configuration information indicates that a data aggregation operation is started on the incremental data block, executing the data aggregation operation and a write operation in the final execution plan;

In some embodiments, the data table comprises at least one partial file; the partial file comprises column storage spaces respectively corresponding to the fields;

the write operation is used to write the aggregated incremental data in the incremental data block into a data table according to the adjusted data distribution state, and includes:

writing the incremental data into the at least one partial file according to the adjusted arrangement sequence; and writing the subdata of each field contained in the incremental data into a column storage space corresponding to the field in the partial file.

In some embodiments, the apparatus 900 further comprises:

the updating module is used for updating the metadata statistical information corresponding to each column of storage space according to the data in each column of storage space stored in the partial file aiming at each partial file after the writing operation of the incremental data block is finished, and storing the metadata statistical information corresponding to each column of storage space in the same partial file corresponding to the corresponding column of storage space;

and extracting the metadata statistical information based on a preset metadata type to obtain an index corresponding to each column of storage space in the request file.

In some embodiments, the apparatus 900 further comprises:

the reading module is used for responding to a data reading request and acquiring a data screening condition aiming at least one target field included in the data reading request;

determining candidate partial files in the data table according to the data screening conditions and the indexes;

and reading data meeting the data screening condition in the candidate partial file.

In some embodiments, the metadata statistics include at least one of:

a minimum value in the data in the column memory space;

a maximum value in the data within the column storage space;

removing the duplicate of the data in the column storage space to obtain a residual data set;

a bloom filter.

In the foregoing solution, first, based on a partial protocol, after the incremental data is written into the partial file (that is, after the partial file is modified), an index may be automatically generated according to metadata statistical information corresponding to data aggregated in the partial file, and multiple types of metadata statistical information may be supported as the index, so that the index is easier to maintain, and a problem that the index is inconsistent with the aggregated data in the file does not occur.

Thirdly, referring to a z-order method, Data mapping can be performed according to the subdata of at least two fields of each incremental Data to obtain corresponding mapping Data, and original information of the subdata of the at least two fields is well reserved, so that after the incremental Data are reordered based on the ordering result of the mapping Data of each incremental Data, Data with close meanings expressed by the original information can be gathered together, the ordering of the incremental Data in an incremental Data block is more reasonable, and the ordering of the Data in a Data file obtained through writing operation is more reasonable, so that in the process of reading the Data (namely Data skiping) meeting the Data screening condition, optimized Data distribution can be benefited, the Data volume of traversal is reduced, and the Data reading efficiency is improved.

And fourthly, adding the first configuration information and the second configuration information in the attribute information corresponding to the data table, realizing flexible configuration of the data aggregation operation, and adapting to various service requirements.

Exemplary Medium

Having described the method and apparatus of the exemplary embodiments of the present application, a readable storage medium of the exemplary disclosure of the present application is described next with reference to fig. 10. The storage medium stores a computer program for causing a processor to execute the data processing method as in any one of the preceding embodiments.

Referring to fig. 10, fig. 10 is a diagram illustrating a program product 1000 applied to a data processing method according to an embodiment of the present application.

In some embodiments shown, the aforementioned data processing method may be implemented by a program product 1000, such as a portable compact disc read only memory (CD-ROM) and including program code, and may be run on a device, such as a personal computer. However, the program product of the present application is not limited thereto, and in this document, a readable storage medium may be any tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device.

The program product may employ any combination of one or more readable media. The readable medium may be a readable signal medium or a readable storage medium. A readable storage medium may be, for example, but not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any combination of the foregoing. More specific examples (a non-exhaustive list) of the readable storage medium include: an electrical connection having one or more wires, a portable disk, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.

A computer readable signal medium may include a propagated data signal with readable program code embodied therein, for example, in baseband or as part of a carrier wave. Such a propagated data signal may take many forms, including, but not limited to, electro-magnetic, optical, or any suitable combination of the foregoing. A readable signal medium may also be any readable medium that is not a readable storage medium and that can communicate, propagate, or transport a program for use by or in connection with an instruction execution system, apparatus, or device.

Program code embodied on a readable medium may be transmitted using any appropriate medium, including but not limited to wireless, wireline, optical fiber cable, RE, etc., or any suitable combination of the foregoing.

Program code for carrying out operations of the present application may be written in any combination of one or more programming languages, including an object oriented programming language such as Java, C + + or the like and conventional procedural programming languages, such as the C language or similar programming languages. The program code may execute entirely on the consumer electronic device, partly on a remote electronic device, or entirely on the remote electronic device or server. In the case of remote electronic devices, the remote electronic devices may be connected to the consumer electronic device through any kind of model, including a Local Area Network (LAN) or a Wide Area Network (WAN), or may be connected to an external electronic device (for example, through the internet using an internet service provider).

Exemplary electronic device

Having described the methods, apparatus and media of the exemplary embodiments of the present application, an electronic device of the exemplary disclosure of the present application is now described with reference to fig. 11. The apparatus comprises: a processor; a memory for storing processor-executable instructions; wherein the processor implements the data processing method as shown in any one of the previous embodiments by executing the executable instructions.

Referring to fig. 11, fig. 11 is a schematic structural diagram of an electronic device according to an embodiment of the present disclosure.

The electronic device 1100 shown in fig. 11 is only an example, and should not bring any limitation to the functions and the scope of use of the embodiments of the present application.

As shown in fig. 11, the electronic device 1100 is represented in the form of a general electronic device. The components of the electronic device 1100 may include, but are not limited to: the aforementioned at least one processor 1101, the aforementioned at least one storage processor 1102, and a bus 1103 that connects the various system components (including the processor 1101 and the storage processor 1102).

The bus 1103 includes a data bus, a control bus, and an address bus.

The storage processor 1102 may include readable media in the form of volatile memory, such as Random Access Memory (RAM)11021 and/or cache memory 11022, and may further include readable media in the form of non-volatile memory, such as Read Only Memory (ROM) 11023.

Storage processor 1102 may also include a program/utility 11025 having a set (at least one) of program modules 11024, such program modules 11024 including, but not limited to: an operating system, one or more application programs, other program modules, and program data, each of which, or some combination thereof, may comprise an implementation of a model environment.

The electronic device 1100 may also communicate with one or more external devices 1104 (e.g., keyboard, pointing device, etc.).

Such communication may occur via input/output (I/O) interfaces 1105. Also, electronic device 1100 can communicate with one or more models (e.g., a Local Area Network (LAN), a Wide Area Network (WAN), and/or a public model such as the internet) via model adapter 1106. As shown in FIG. 11, model adapter 1106 communicates with the other modules of electronic device 1100 over bus 1103. It should be understood that although not shown in the figures, other hardware and/or software modules may be used in conjunction with the electronic device 1100, including but not limited to: microcode, device drivers, redundant processors, external disk drive arrays, RAID systems, tape drives, and data backup storage systems, among others.

It should be noted that although in the above detailed description several units/modules or sub-units/modules of the data processing apparatus are mentioned, such a division is merely exemplary and not mandatory. Indeed, the features and functionality of two or more of the units/modules described above may be embodied in one unit/module, according to embodiments of the application. Conversely, the features and functions of one unit/module described above may be further divided into embodiments by a plurality of units/modules.

Further, while the operations of the methods of the present application are depicted in the drawings in a particular order, this does not require or imply that these operations must be performed in this particular order, or that all of the illustrated operations must be performed, to achieve desirable results. Additionally or alternatively, certain steps may be omitted, multiple steps combined into one step execution, and/or one step broken down into multiple step executions.

While the spirit and principles of the application have been described with reference to several particular embodiments, it is to be understood that the application is not limited to the disclosed embodiments, nor is the division of aspects, which is for convenience only as the features in such aspects may not be combined to benefit from the description. The application is intended to cover various modifications and equivalent arrangements included within the spirit and scope of the appended claims.

Claims

1. A data processing method applied to a data processing engine is characterized by comprising the following steps:

analyzing the received SQL statement to obtain an execution plan, and screening the execution plan to obtain a target execution plan containing write operation;

adding a data aggregation operation to obtain an adjusted target execution plan prior to the write operation;

wherein the data aggregation operation is to adjust the aggregated increments within the incremental data block to be written

The write operation is used for writing the aggregated incremental data in the incremental data block into a data table according to the adjusted data distribution state;

optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan;

and running the final execution plan.

2. The method of claim 1, wherein the data processing engine comprises a Spark engine; the data processing protocol adopted by the Spark engine is a queue protocol.

3. The method of claim 2, wherein each piece of incremental data aggregated in the incremental data block is aggregated by subdata corresponding to a plurality of fields; the data aggregation operation comprises a data mapping operation and a data sorting operation;

sorting mapping data sets formed by mapping data corresponding to each incremental data through the data sorting operation to obtain corresponding data sorting results;

4. The method of claim 3, wherein the step of, for each piece of incremental data aggregated in the incremental data block, screening at least two fields of sub-data from a plurality of fields of sub-data included in the incremental data through the data mapping operation to obtain mapping data corresponding to each piece of incremental data, which includes:

converting the target subdata into corresponding preset carry numbers;

5. The method according to claim 3, wherein the sorting the mapping data set formed by the mapping data corresponding to each incremental data through the data sorting operation to obtain a corresponding data sorting result includes:

and according to the size of the mapping data corresponding to each piece of incremental data, performing ascending or descending arrangement on the mapping data in the mapping data set to obtain a corresponding data sorting result.

6. The method of claim 3, wherein adjusting the data distribution state of the aggregated delta data within the delta data block based on the data ordering result comprises:

7. The method of claim 3, wherein the attribute information corresponding to the data table comprises first configuration information and second configuration information;

8. A data processing apparatus, applied to a data processing engine, comprising:

the screening module is used for analyzing the received SQL statement to obtain an execution plan, and screening the execution plan to obtain a target execution plan containing the write operation;

an adding module, configured to add a data aggregation operation to obtain an adjusted target execution plan before the write operation; the data aggregation operation is used for adjusting the data distribution state of the aggregated incremental data in the incremental data block to be written, and the write operation is used for writing the aggregated incremental data in the incremental data block into a data table according to the adjusted data distribution state;

the optimization module is used for optimizing the adjusted target execution plan based on the attribute information corresponding to the data table to obtain a final execution plan;

and the operation module is used for operating the final execution plan.

9. An electronic device, comprising:

a processor;

a memory for storing processor-executable instructions;

wherein the processor implements the data processing method of any one of claims 1-7 by executing the executable instructions.

10. A computer-readable storage medium, characterized in that the storage medium stores a computer program for causing a processor to execute the data processing method of any one of claims 1-7.