WO2024113405A1

WO2024113405A1 - Data processing method and apparatus, device, and storage medium

Info

Publication number: WO2024113405A1
Application number: PCT/CN2022/137867
Authority: WO
Inventors: 陈志标; 黄靖东; 谢锐
Original assignee: 深圳计算科学研究院
Priority date: 2022-11-30
Filing date: 2022-12-09
Publication date: 2024-06-06
Also published as: CN115905236A; CN115905236B

Abstract

The present application provides a data processing method and apparatus, a device, and a storage medium. The method is applied to calculation of column-based storage data, and the method comprises: acquiring a target column data block and corresponding prompt information in the target column data block; according to the prompt information, matching, in a preset algorithm set on the basis of degrees of preferences, a target preferred algorithm capable of satisfying and corresponding to the prompt information; and optimizing an operator according to the target preferred algorithm, and processing target data in the target column data block by means of the optimized operator. For a stored data block, data-specific prompt information thereof is saved, so that an in-memory compute engine can utilize the prompt information and use an optimal processing algorithm, thereby improving the processing efficiency of unit data, and finally reducing the overall computing cost.

Description

A data processing method, device, equipment and storage medium

Technical Field

The present application relates to the field of data processing, and in particular to a data processing method, device, equipment and storage medium.

Background technique

The traditional way of processing massive data generally uses distributed computing to solve the problem of data scale processing efficiency, but this generally requires a large amount of computing resources, which often leads to high implementation costs of big data solutions and ultimately cannot be implemented in enterprises. Therefore, as the amount of data to be processed increases and the processing latency requirements become shorter and shorter, improving the data processing efficiency per unit of computing resources has also begun to become one of the core competitiveness of big data computing software. Especially in the field of in-memory computing technology, increasing the number of records that each thread can process per second is generally regarded as a key means to improve its competitiveness.

In order to improve the processing power of CPU (central processing unit), in addition to increasing the clock frequency of CPU, single-core processing is generally achieved through instruction pipeline and SIMD parallel instructions. How to better utilize the CPU and pursue higher IPC (Instructions per cycle) has become a hot research topic in the database industry and academia.

Modern CPU architectures generally contain multiple cores. In order to shorten the data path and increase the main frequency of each core, the designer uses multiple groups of registers to separate the various parts of the CPU, forming an instruction and data pipeline. Although general CPUs have built-in branch predictors to predict conditional branches and reduce pipeline damage, the main problem is that they can only make simple predictions, and they are not always correct; even if the judgment is correct, it is not zero cost, especially when there are too many conditions, it also requires a lot of instructions.

Summary of the invention

In view of the above problems, the present application is proposed to provide a data processing method, apparatus, device and storage medium that overcome the above problems or at least partially solve the above problems, including:

A data processing method is applied to calculation of column-type stored data, and the method comprises:

Obtaining a target column data block and corresponding prompt information in the target column data block;

According to the prompt information, a target preferred algorithm corresponding to the prompt information can be matched in a preset algorithm set through a preferred degree;

The operator is optimized according to the target optimization algorithm, and the target data in the target column data block is processed by the optimized operator.

Furthermore, before the step of obtaining the target column data block and the prompt information in the target column data block, the step includes:

Acquire initial data, and generate a set of column data blocks to be processed according to the initial data;

Determine a target column data block in the set of column data blocks to be processed;

The prompt information in the target column data block is obtained.

Furthermore, the step of generating a set of column data blocks to be processed based on the initial data includes:

Acquire the data type corresponding to the initial data;

Determining a preprocessing function corresponding to the data type according to the data type;

generating a plurality of initial column data blocks according to the initial data and the preprocessing function;

The to-be-processed column data block set is generated according to a plurality of the initial column data blocks.

Furthermore, the step of determining the target column data block in the set of column data blocks to be processed includes:

Determine a target column in the set of column data blocks to be processed according to a preset query condition;

The target column data block is determined in the set of column data blocks to be processed according to the preset query condition and the target column.

Furthermore, the step of selecting a target preferred algorithm that can satisfy the prompt information by matching the preferred degree in a preset algorithm set according to the prompt information includes:

Sorting all algorithms in the preset algorithm set according to the degree of preference to generate a preferred algorithm set with a preferred ranking sequence number, wherein the preferred algorithm set includes a first algorithm at the first position, a standard algorithm at the last position, and a plurality of intermediate algorithms with preferred ranking sequences between the first algorithm and the standard algorithm;

The target preferred algorithm is determined among the first algorithm, the standard algorithm and several intermediate algorithms according to the prompt information.

Furthermore, the step of determining the target preferred algorithm among the first algorithm, the standard algorithm and the plurality of intermediate algorithms according to the prompt information includes:

generating a first condition corresponding to the first algorithm and a plurality of sub-conditions corresponding to the plurality of intermediate algorithms according to the first algorithm and the plurality of intermediate algorithms;

Matching the prompt information with the first condition and the plurality of sub-conditions;

When the prompt information does not satisfy the first condition, a second algorithm located after the first algorithm and a second sub-condition of the second algorithm are obtained from the plurality of sub-conditions;

When the prompt information does not satisfy the second sub-condition, determining in sequence whether the prompt information has a target sub-condition satisfying the prompt information among several sub-conditions;

When there is no target algorithm satisfying the prompt information in the first condition and the plurality of sub-conditions, the standard algorithm is determined as the target preferred algorithm.

The embodiment of the present invention further discloses a data processing device, which is applied to calculation of column-type stored data, and includes:

A first acquisition module, used for acquiring a target column data block and corresponding prompt information in the target column data block;

A matching module, configured to match a target preferred algorithm that can satisfy the prompt information in a preset algorithm set according to the prompt information through a preferred degree;

The processing module is used to optimize the operator according to the target optimization algorithm, and process the target data in the target column data block through the optimized operator.

Furthermore, before the first acquisition module, it includes:

A first generating module, configured to obtain initial data and generate a set of column data blocks to be processed according to the initial data;

A first determining module, configured to determine a target column data block in the set of column data blocks to be processed;

The second acquisition module is used to acquire the prompt information in the target column data block.

An embodiment of the present invention further discloses a computer device, including a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein the computer program implements the steps of a data processing method as described above when executed by the processor.

An embodiment of the present invention further discloses a computer-readable storage medium, on which a computer program is stored. When the computer program is executed by a processor, the steps of the data processing method described above are implemented.

This application has the following advantages:

In the embodiments of the present application, compared with the prior art which can only make simple predictions, which are not always correct; even if the judgment is correct, it is not zero cost, especially when there are too many conditions, a lot of instructions are required. The present application provides a solution for allowing the memory computing engine to use these prompt information and adopt the optimal processing algorithm by saving the data-specific prompt information of the stored data block, specifically: obtaining the target column data block and the corresponding prompt information in the target column data block; matching the target optimization algorithm corresponding to the prompt information in the preset algorithm set through the degree of preference according to the prompt information; optimizing the operator according to the target optimization algorithm, and processing the target data in the target column data block by the optimized operator. By optimizing the operator according to the target optimization algorithm and processing the target data in the target column data block through the optimized operator, the problem that only simple predictions can be made and are not always correct is solved; even if the judgment is correct, it is not zero cost, especially when there are too many conditions, a lot of instructions are required. The principle of locality of data is utilized to save prompt information of certain conditions that each data block meets as a whole. When processing this data block, statements such as conditional judgments can be judged in advance once, and the most efficient algorithm matching this data processing is selected based on these prompt information. By reducing the conditional judgments and the most efficient processing algorithm in the processing process, the efficiency of single-core data processing is improved; by saving the data-specific prompt information of the stored data block, the memory computing engine can use these prompt information and adopt the optimal processing algorithm, thereby improving the processing efficiency of unit data and ultimately reducing the overall computing cost.

BRIEF DESCRIPTION OF THE DRAWINGS

In order to more clearly illustrate the technical solution of the present application, the drawings required for use in the description of the present application will be briefly introduced below. Obviously, the drawings described below are only some embodiments of the present application. For ordinary technicians in this field, other drawings can be obtained based on these drawings without paying any creative labor.

FIG1 is a flowchart of a data processing method according to an embodiment of the present application;

FIG2 is a structural block diagram of a data processing method and device provided by an embodiment of the present application;

FIG. 3 is a schematic diagram of the structure of a computer device provided by an embodiment of the present invention.

Detailed ways

In order to make the objects, features and advantages of the present application more obvious and understandable, the present application is further described in detail below in conjunction with the accompanying drawings and specific implementation methods. Obviously, the described embodiments are part of the embodiments of the present application, rather than all of the embodiments. Based on the embodiments in the present application, all other embodiments obtained by ordinary technicians in the field without creative work are within the scope of protection of the present application.

The inventors have found through analysis of the prior art that modern CPU architectures generally include multiple cores. In order to shorten the data path and increase the main frequency of each core, the designer uses multiple groups of registers to separate the various parts of the CPU, forming an instruction and data pipeline. The average processing delay of instructions is improved through pipeline parallelism. The efficient operation of the pipeline is very critical to the processing performance of the CPU, and instructions such as conditional branch jumps in calculations are killers that destroy the efficient operation of the pipeline. In order to reduce the impact of these instructions, column-based vectorized calculations improve performance by changing each single calculation to batch calculation, reducing function calls and improving cache memory hits. However, for calculations of complex types (non-CPU native types, such as high-precision digital types), in order to handle various scenarios, in the processing of a batch of data, it is often necessary to make various judgments for each data, and then perform different processing based on the judgments. Although general CPUs have built-in branch predictors for predicting conditional branches to reduce pipeline damage, the main problems are: 1. Only simple predictions can be made, and they are not always correct; 2. Even if the judgment is correct, it is not zero cost, especially when there are too many conditions, a lot of instructions are consumed;

Existing technical solution 1:

Vectorized computing technology: In 2005, Peter Boncz and other scholars published the paper MonetDB/X100: Hyper-PI Pelining Query Execution at VLDB (Very Large Data Bases, a famous international conference in the database field), proposing vectorized computing technology for the first time. This technology is one of the basic technologies of contemporary memory computing. Its main contributions are: 1. The use of pipeline execution operators; 2. Small, continuous memory, resident cache (cache memory), fixed type arrays as computing units, using deterministic branches to reduce branch prediction failures during the calculation process, and making full use of cache (cache memory).

Disadvantages of this solution: 1. The branch rewrite can only be applied to simple types and interpretations, such as integers. It is generally impossible to rewrite complex data types and algorithms, or the improvement effect after rewriting is not good. 2. There is only one implementation algorithm for one type.

Existing technical solution 2:

Runtime operator specialization technology: In 2018, memSQL published the Fast Selection and Aggregation on Encoded Data using Operator Specialization paper at SIGMOD. Its main contributions are: first, it proposed a variety of different filtering and aggregation algorithms and implemented operators on this basis. Then, it dynamically selected different operators for specialization to process the data by analyzing the parameters of the data to be processed (including the number of groups, the number of aggregations, the number of bits for each value, the selection rate, etc.) during operation, thereby achieving efficient data processing.

The disadvantage of this scheme is that it only utilizes some implicit information during data encoding, which is difficult to use to reduce branch prediction during actual vectorized calculations of operators. Therefore, the algorithms cited in this article are mainly for integers or data encoded using dictionary encoding.

1 , a flowchart of a data processing method according to an embodiment of the present application is shown;

S110, obtaining a target column data block and corresponding prompt information in the target column data block;

S120, according to the prompt information, a target preferred algorithm that can meet the prompt information by matching the preferred degree in a preset algorithm set;

S130, optimizing the operator according to the target optimization algorithm, and processing the target data in the target column data block by the optimized operator.

In the embodiments of the present application, compared with the prior art which can only make simple predictions, which are not always correct; even if the judgment is correct, it is not zero cost, especially when there are too many conditions, a lot of instructions are required. The present application provides a solution for allowing the memory computing engine to use these prompt information and adopt the optimal processing algorithm by saving the data-specific prompt information of the stored data block, specifically: obtaining the target column data block and the corresponding prompt information in the target column data block; according to the prompt information, matching the target optimization algorithm corresponding to the prompt information through the degree of preference in the preset algorithm set; optimizing the operator according to the target optimization algorithm, and processing the target data in the target column data block by the optimized operator. By optimizing the operator according to the target optimization algorithm and processing the target data in the target column data block through the optimized operator, the problem that only simple predictions can be made and are not always correct is solved; even if the judgment is correct, it is not zero cost, especially when there are too many conditions, a lot of instructions are required. The principle of locality of data is utilized to save prompt information of certain conditions that each data block meets as a whole. When processing this data block, statements such as conditional judgments can be judged in advance once, and the most efficient algorithm matching this data processing is selected based on these prompt information. By reducing the conditional judgments and the most efficient processing algorithm in the processing process, the efficiency of single-core data processing is improved; by saving the data-specific prompt information of the stored data block, the memory computing engine can use these prompt information and adopt the optimal processing algorithm, thereby improving the processing efficiency of unit data and ultimately reducing the overall computing cost.

Next, a data processing method in this exemplary embodiment will be further described.

As described in step S110, a target column data block and corresponding prompt information in the target column data block are obtained.

In an embodiment of the present invention, the specific process before "obtaining the target column data block and the corresponding prompt information in the target column data block" in step S110 may be further explained in combination with the following description.

As described in the following steps,

S101, obtaining initial data, and generating a set of column data blocks to be processed according to the initial data;

S102, determining a target column data block in the set of column data blocks to be processed;

S103: Acquire the prompt information in the target column data block.

It should be noted that the initial data refers to the data input into the system by the user.

It should be noted that the to-be-processed column data block set refers to a set formed by normalized data, which includes prompt information.

It should be noted that the target column data block refers to a target column data block obtained by screening the to-be-processed column data block set according to the query range, ie, the preset query condition.

It should be noted that generating a set of column data blocks to be processed from initial data is a process of data normalization.

It should be noted that, first, two-dimensional table relational data is stored in a column storage organization, and all data in a column with a certain number of record rows is called a column data block unit, namely an initial column data block: the to-be-processed column data block set is formed by a number of initial column data blocks.

As an example, data is generally imported or written in units of rows, and multiple rows can be written at one time. Each column of data in each row needs to be written into a column data block corresponding to the column organization.

As described in step S101, initial data is acquired, and a set of column data blocks to be processed is generated according to the initial data.

In an embodiment of the present invention, the specific process of "obtaining initial data, and generating a set of column data blocks to be processed according to the initial data" in step S101 can be further explained in combination with the following description.

As described in the following steps,

S1011, obtaining a data type corresponding to the initial data;

S1012, determining a preprocessing function corresponding to the data type according to the data type;

S1013, generating a plurality of initial column data blocks according to the initial data and the preprocessing function;

S1014: Generate the to-be-processed column data block set according to a plurality of the initial column data blocks.

It should be noted that the initial column data block is generally configured with a maximum number of data to be written. When the maximum number is reached, that is, it is greater than the preset number or the upper-level user explicitly requests that the contents of the buffer be forcibly written out, it will trigger the writing of data into the actual storage; for the initial column data block of each column, the preprocessing function of the corresponding data type is obtained according to the data type defined by the column; the initial data is normalized using the preprocessing function to generate the normalized column data block that ultimately needs to be written, namely the initial column data block, and at the same time, a number of initial column data blocks are generated into a set of column data blocks to be processed; the initial column data block is written to the persistent storage through the storage module.

As an example, through the number preprocessing function of initial data, when the number of initial data is greater than a preset number, the preprocessing function corresponding to the data type is determined through the initial data and the data type corresponding to the initial data; the initial data is normalized according to the preprocessing function to generate a set of column data blocks to be processed.

In a specific implementation, for the initial column data block of each column, a preprocessing function of its type is obtained according to the data type defined by the column. If no preprocessing function exists for this type, the initial column data block is directly written to the persistent storage through the storage module.

In a specific implementation, a plurality of initial column data blocks are generated according to the initial data and the preprocessing function, that is, the initial data is normalized.

In a specific implementation, after a plurality of initial column data blocks are generated according to the initial data and the preprocessing function, the prompt information of the normalized column data blocks, ie, the initial column data blocks, is collected at the same time.

As described in step S102, a target column data block in the set of column data blocks to be processed is determined.

In an embodiment of the present invention, the specific process of "determining the target column data block in the set of column data blocks to be processed" in step S102 may be further explained in combination with the following description.

As described in the following steps,

S1021, determining a target column in the set of column data blocks to be processed according to a preset query condition;

S1022: Determine the target column data block in the set of column data blocks to be processed according to the preset query condition and the target column.

It should be noted that the preset query condition refers to a query range defined by a user, through which a target column can be obtained from the set of column data blocks to be processed, and then the target column data block can be determined through the target column and the query range.

As described in step S120, the operator is optimized according to the target optimization algorithm, and the target data in the target column data block is processed by the optimized operator.

In one embodiment of the present invention, the specific process of "optimizing the operator according to the target optimization algorithm, and processing the target data in the target column data block by the optimized operator" in step S120 can be further explained in combination with the following description.

As described in the following steps,

S1201, sorting all algorithms in the preset algorithm set according to the degree of preference to generate a preferred algorithm set with preferred ranking numbers, wherein the preferred algorithm set includes a first algorithm at the top, a standard algorithm at the bottom, and a plurality of intermediate algorithms with preferred ranking numbers between the first algorithm and the standard algorithm;

S1202: Determine the target preferred algorithm among the first algorithm, the standard algorithm, and the plurality of intermediate algorithms according to the prompt information.

It should be noted that the algorithm types in the preset algorithm set include but are not limited to addition, subtraction and multiplication.

It should be noted that the prompt information includes but is not limited to whether the sign bits of the numerical values are the same, the most significant bit of the numerical value, the result of the null value judgment and the maximum precision value of the numerical value.

As an example, all values of all data in the target column data block are obtained, and based on all values, whether the numerical sign bits of all data are the same number, the most significant bit and the null value judgment results are determined, and the maximum precision value of the numerical value is generated based on the most significant bit.

Among them, whether the sign bits of the numerical value are the same number refers to judging whether the numerical values are all positive or negative; the most significant bit refers to the most significant bit of the numerical value, and the most significant bit refers to the n-1 bit in an n-bit binary number, with the highest weight 2^(n-1); the null value judgment result refers to judging whether the numerical values are all null values or non-null values; the maximum precision value of the numerical value refers to storing the numerical value according to the number of bytes required for the most significant bit, and calculating the maximum precision value.

Because the data type of the column is fixed, the data type stored in each column data block is also fixed. A preprocessing function is defined for each supported data type. The preprocessing function normalizes the initial data to generate an initial column data block, and collects prompt information corresponding to the initial column data block. The prompt information is optional. The prompt information (which may be empty) is stored together with the normalized column data, that is, the initial column data block, which is called a set of column data blocks to be processed.

As an example, whether the numerical values of all the data are the same number is determined by judging whether the numerical sign bits are all positive or negative, the most significant bit in the numerical value is determined, and the null value judgment result is determined by judging whether the numerical values are all null values or non-null values.

In a specific implementation, a normalized column data block, namely an initial column data block, is generated through preprocessing: data is uniformly normalized at a data block granularity, and prompt information of the initial column data block is obtained. Subsequent calculations can use this prompt information to dynamically accelerate the calculation.

The user's computing request will be converted into the execution operator tree of the memory computing engine. The operator types of general relational computing engines include: filter operator, projection operator, join operator, aggregation operator, etc. The normalized column data block required by the request, that is, the initial column data block, is used to determine the target column data block through the preset query conditions. The target column data block and the prompt information in the target column data block are passed to the operator and the algorithm in the operator, that is, the target optimization algorithm, is executed to realize the computational processing of the data. In the implementation of the operator, on the basis of using vectorized computing technology, the use of prompt information is added to further optimize the calculation, and the implementation of the operator is divided into a normal path and multiple fast paths. When the prompt information meets the requirements of the fast path, the code of the fast path will be executed to improve the data processing efficiency.

It should be noted that during calculation, the operator uses the prompt information in the target column data block, and selects the target optimization algorithm for calculation based on the prompt information: there are several types of target optimization algorithms, and the best algorithm, i.e. the first algorithm, is selected in the optimization algorithm set according to the degree of optimization, followed by several intermediate algorithms, and finally the most common standard implementation method, i.e. the standard algorithm. The algorithm is dynamically selected at the granularity of the target column data block, and there is no need to modify the execution framework of the existing computing engine, only the performance-sensitive operators need to be transformed and optimized.

As an example, all algorithms in the preset algorithm set are ranked by priority to generate a preferred algorithm set with preferred ranking numbers, wherein the preferred algorithm set includes a first algorithm at the first position, a standard algorithm at the last position, and several intermediate algorithms with preferred ranking numbers between the first algorithm and the standard algorithm.

As described in step S1202, the target preferred algorithm is determined among the first algorithm, the standard algorithm and the plurality of intermediate algorithms according to the prompt information.

In an embodiment of the present invention, the following description may be combined to further illustrate the step S1302 of “determining the target preferred algorithm among the first algorithm, the standard algorithm, and the plurality of intermediate algorithms according to the prompt information”.

As described in the following steps,

S12021. Generate a first condition corresponding to the first algorithm and a plurality of sub-conditions corresponding to the plurality of intermediate algorithms according to the first algorithm and the plurality of intermediate algorithms;

S12022. Match the prompt information with the first condition and the plurality of sub-conditions;

S12023. When the prompt information does not satisfy the first condition, obtaining a second algorithm located after the first algorithm and a second sub-condition of the second algorithm from among the plurality of sub-conditions;

S12024: When the prompt information does not satisfy the second sub-condition, determining in sequence whether the prompt information has a target sub-condition satisfying the prompt information among several sub-conditions;

S13025. When there is no target algorithm that satisfies the prompt information in the first condition and the plurality of sub-conditions, the standard algorithm is determined as the target preferred algorithm.

It should be noted that the prompt information in the target column data block is obtained, and the target optimization algorithm is obtained by matching the prompt information.

As an example, it is checked whether the prompt information satisfies the first condition that can satisfy the specialized implementation X, that is, the first algorithm. If not, it is checked whether the prompt information satisfies the second sub-condition that can satisfy the specialized implementation Y, that is, the second algorithm. If not, it is determined in turn whether the prompt information has a target sub-condition that satisfies the prompt information in several sub-conditions; when there is no target algorithm that satisfies the prompt information in the first condition and several sub-conditions, the standard algorithm is determined as the target preferred algorithm, the target data in the target column data block is processed by the algorithm implemented by the general standard, that is, the standard algorithm, and the result is returned, ending the current round of processing of the operator.

In a specific implementation, it is checked whether the prompt information satisfies the first condition of the specialized implementation X, i.e., the first algorithm. If so, the target data in the target column data block is processed by the specialized implementation X, i.e., the first algorithm, and the result is returned, thereby ending the current round of processing of the operator.

By checking whether the prompt information satisfies the second sub-condition of the specialized implementation Y, i.e., the second algorithm, if it satisfies the second sub-condition, the target data in the target column data block is processed through the specialized implementation Y, i.e., the second algorithm, and the result is returned, thus ending this round of processing of the operator.

By selecting the target preferred algorithm according to the degree of preference in the preferred algorithm set, when there is an algorithm that can meet the prompt information, the algorithm is determined to be the target preferred algorithm. When all the preferred algorithms in the preferred algorithm set do not have an algorithm that can meet the prompt information, the standard algorithm is directly used to process the target data in the target column data block, and the result is returned, ending this round of processing of the operator.

Embodiment 1

Take a table with only one column as an example. The column type is a high-precision numeric type. Assume that each column data block has 5 data entries. The precision generally refers to the number of decimal digits that can be represented. The high-precision numeric type has a relatively large precision. Users can easily expand to multiple columns and types, and each data block can accommodate more data.

Data preprocessing and writing mainly involve the generation of prompt information and the conversion of data into normalized form. Taking high-precision digital types as an example, the preprocessing steps are as follows:

1. Align the input value according to scale, where scale is used to indicate how many decimal places there are if the value is a positive integer; if the value is a negative number, it indicates how many zeros are after the integer.

2. Check whether all the values in the data block are positive or negative.

3. Count the most significant bits of the values in the data block.

4. Check whether all the data in the data block are null (indicating null values, no value) or not null (non-null values, the value is not null).

5. The value is stored in the number of bytes required for the most significant bit, and the maximum precision is calculated.

6. Store prompt information based on the data characteristics of the above data block.

The following prompt information is generated when preprocessing the data block:

1. The input values are normalized according to scale 2, that is, the scale of the values in the data block is 2.

2. The most significant digit of the normalized value is 16, which means that the value can be represented by 2 bytes, and the maximum precision is 5.

3. The sign bits are consistent and all positive.

4. All values in the data block are not null, that is, non-empty values.

5. When writing, the data is stored in a fixed length of 2 bytes, and the prompt information of the data block is written.

The data block data contains the following prompt information: maximum precision (i.e., the maximum precision of val for all values in the data block, for example, 99.9, 999.9, 9999.9, then the maximum precision is 5), whether the scale is the same, whether the sign is the same, whether all data are not null, etc. Among them, the not null value information of a data block is stored separately, using a bool array to store. If all are NULL, the array is empty. Using this information, the judgment logic in the calculation process can be reduced, and some fast paths can be taken to speed up the calculation of data block data. The following is an example of the processing logic of addition. When adding two high-precision numbers, it is necessary to judge whether the sign bit follows different logics, whether the scale needs to be adjusted, and whether there is overflow. These judgments are all overhead. The optimization of the addition process can use the prompt information of the data block to reduce these judgments, and for specific conditions, the data of the entire data block follows the same calculation logic.

The digital structure is generally as follows:

struct Number

{

u128 val,

int16_t scale,

bool Negative,

};

Among them, "u128 val" is used to indicate how many bits of integer the value is represented by;

"int16_t scale" is used to indicate how many decimal places there are if the value is a positive integer; or, if the value is a negative number, how many zeros are after the integer;

"bool Negative" is used to indicate whether a value is negative.

val is a 16-byte integer used to store numbers. It can also be an integer with higher bytes, depending on the maximum precision required. If it is u128, its maximum precision can only be expressed to 38 bits (decimal).

Among them, Number is a numerical value.

For example, the value -2.35 is represented as follows:

struct Number

{

Val(235)

Scale(2),

Negative(true),

}.

The value 23500 is represented as follows:

struct Number

{

Val(235)

Scale(-2),

Negative(false),

}.

For convenience of description, the following symbols are defined.

Data block 1: represents the first data block of addition;

Data block 2: represents the second data block of addition;

Data block 3: represents the result data block;

Pmax1: Maximum precision of data block 1;

Pmax2: Maximum precision of data block 2;

scale1: If data block 1 has the same scale, it indicates the scale of data block 1;

scale2: If data block 2 has the same scale, it indicates the scale of data block 2;

negative1: If data block 1 has the same sign bit, it indicates the sign bit of data block 1;

negative2: If data block 2 has the same sign bit, it indicates the sign bit of data block 2;

value1: represents a value of data block 1;

value2: represents a value of data block 2;

Value3: represents a value of data block 2.

1. The addition of two data block numbers has the following specializations according to the prompt information of the data blocks.

Specialization 1: If scale1 and scale2 are equal, the precision {MAX(Pmax1, Pmax2) + 1} of the result does not exceed 19 bits, data block 1 has the same sign bit, data block 2 has the same sign bit, Negative1 = Negative2, and there are no null values in data block 1 and data block 2.

At this time, the addition of the val of the two data blocks only needs to be converted into u64 and added, because a 19-bit integer (an integer consisting of 19 9s) can be represented by u64 (a 64-bit unsigned integer), and the result of adding two 19-bit integers is at most 20 bits, and the result cannot exceed the maximum value of u64. The scale of the result is the same as the scale of the original data block, and the maximum precision of the result is {MAX(Pmax1,Pmax2)+1}, where the maximum precision is the evaluation value, which does not need to be an exact value, and the sign bit of the result is the sign bit of the original data block.

By traversing data block 1, data block 2, data block 3, and taking out the values inside;

for(value1 in data block 1, value2 in data block 2, value3 in data block 3){

value3.val＝(value1.val as u64)+(value2.val as u64);

value3.scale = value1.scale;

value3.negative = value1.negative;

}.

For example:

Data block 1 has values (2.22, 1.11, 33.33), the maximum precision of data block 1 is 4, and the scale is 2;

Data block 2 has values (4.01, 999.02, 6.04), the maximum precision of data block 2 is 5, and the scale is 2;

Data block 3 is used to store the results.

The scales of data blocks 1 and 2 are both 2, and the maximum precision of the results is 6 (note that this is the evaluation precision, and it may also be 5 in some scenarios, for example, 999.02 becomes 998.02), which does not exceed 19. The scale of the results is 2, the signs are positive, and the results are also positive.

Traverse data block 1, data block 2, data block 3;

First time to get:

value1:number{val(222),scale(2),negative(false)} and value2:number{val(401),scale(2),negative(false)};

value3.val = 222 + 401 (u64 addition is used here);

value3.scale = 2;

value3.negative = false;

The result is number{val(623), scale(2), negative(false)}, which is 6.23.

The second time I got:

value1:number{val(111),scale(2),negative(false)} and value2:number{val(99902),scale(2),negative(false)};

value3.val = 111 + 99902;

value3.scale = 2;

value3.negative = false;

The result is number{val(100013), scale(2), negative(false)}, which is 1000.13.

The third time I got:

value1:number{val(3333),scale(2),negative(false)} and value2:number{val(604),scale(2),negative(false)};

value3.val = 3333 + 604;

value3.scale = 2;

value3.negative = false;

The result is number{val(3937),scale(2),negative(false)}, which is 39.37.

The data of data block 3 is (6.23, 1000.13, 39.37), with a maximum precision of 6, a scale of 2, and a positive sign.

This specialization can reduce the overhead of large number addition, because most current CPUs are 64-bit, and only u64 addition is supported by hardware, while u128 needs to be implemented in software.

Specialization 2: If scale1 and scale2 are equal, the precision of the result is {MAX(Pmax1, Pmax2) + 1}, which exceeds 19 bits but does not exceed 38 bits, data block 1 has the same sign bit, data block 2 has the same sign bit, Negative1 = Negative2, and there is no null value in data block 1 and data block 2. At this time, the val of the two data blocks can be added, and the result cannot exceed the maximum value of u128. And the scale of the result is the same as the scale of the original data block, and the maximum precision of the result is {MAX(Pmax1, Pmax2) + 1}, where the maximum precision is the evaluation value, which does not need to be an exact value, and the sign bit of the result is the sign bit of the original data block.

for(value1 in data block 1, value2 in data block 2, value3 in data block 3){

value3.val＝value1.val+value2.val；

value3.scale = value1.scale;

value3.negative = value1.negative;

}.

For example:

Data block 2 has values (4.01, 999999999999999999.02, 6.04), the maximum precision of data block 2 is 19, and the scale is 2;

Data block 3 is used to store the results.

The scale of data block 1 and data block 2 is 2, and the maximum precision of the result is 20 (the maximum precision here is the evaluation precision, which may also be 19 in some scenarios). It does not exceed 19. The scale of the result is 2, the sign is positive, and the result is also positive; there is no NULL.

Traverse data block 1, data block 2, data block 3;

First time to get:

value3.val = 222 + 401 (u128 addition is used here);

value3.scale = 2,

value3.negative = false,

The result is number{val(623),scale(2),negative(false)}, which is 6.23.

The second time I got:

value1:number{val(111),scale(2),negative(false)}and value2:number{val(99999999999999999902),scale(2),negative(false)};

value3.val = 111 + 999999999999999999902;

value3.scale = 2;

value3.negative = false;

The result is number{val(10000000000000000013),scale(2),negative(false)}, which is 100000000000000000.13.

The third time I got:

value3.val = 3333 + 604;

value3.scale = 2;

value3.negative = false;

The result is number{val(3937),scale(2),negative(false)}, which is 39.37.

The data in data block 3 is (6.23,1000000000000000000.13,39.37), with a maximum precision of 20, a scale of 2, a sign of positive, and no NULL value.

Specialization 3: If scale1 and scale2 are equal, the precision {MAX(Pmax1, Pmax2) + 1} of the result does not exceed 19 bits, data block 1 has the same sign bit, data block 2 has the same sign bit, Negative1 = Negative2, and data block 1 and data block 2 have null values.

At this time, if there is a value that is NULL, the result is NULL. For non-null values, the processing logic is the same as specialization 1.

for(value1 in data block 1, value2 in data block 2, value3 in data block 3){

If (value1.val is NULL or value2.val is NULL) {

The corresponding value of NOT NULL array identifier is NULL;

}

Else{

The corresponding value of the NOT NULL array identifier is Not NULL;

value3.val＝(value1.val as u64)+(value2.val as u64);

}

value3.scale = value1.scale;

value3.negative = value1.negative;

}.

For example:

Data block 1 has values (2.22, 1.11, 33.33). The maximum precision of data block 1 is 4, and there is no NULL. The scale is 2.

Data block 2 has the value (4.01,999.02,NULL), the maximum precision of data block 2 is 5, the scale is 2, and the NOT NULL array is (true,true,false), true means the data is not NULL, false means the data is NULL;

Data block 3 is used to store the results.

The scale of data block 1 and data block 2 is 2, and the maximum precision of the result is 6 (the maximum precision here is the evaluation precision, which may also be 5 in some scenarios, for example, 999.02 becomes 998.02), and it does not exceed 19. The scale of the result is 2, the signs are positive, and the results are also positive. There are NULL values. During the calculation process, it is necessary to determine whether the value is NULL.

Traverse data block 1, data block 2, data block 3;

First time to get:

value1:number{val(222),scale(2),negative(false)} and value2:number{val(401),scale(2),negative(false)}, according to the Not NULL array of the data block, it is judged whether the two numbers are NULL or not, and the result is Not NULL;

value3.val = 222 + 401 (u64 addition is used here);

value3.scale = 2;

value3.negative = false;

The result is number{val(623),scale(2),negative(false)}, which is 6.23.

The second time I got:

value1:number{val(111),scale(2),negative(false)} and Value2:number{val(99902),scale(2),negative(false)}, according to the Not NULL array of the data block, it is judged whether the two numbers are NULL or not, and the result is Not NULL;

value3.val = 111 + 99902;

value3.scale = 2;

value3.negative = false;

The result is number{val(100013),scale(2),negative(false)}, which is 1000.13.

The third time I got:

value1:number{val(3333),scale(2),negative(false)} and Value2:number(NULL), according to the Not NULL array of the data block, it is determined that there is a number that is NULL, and the result is marked as NULL in the Not Null array.

The data of data block 3 is (6.23,1000.13,NULL), with a maximum precision of 6, a scale of 2, and a positive sign. The Not NULL array is (true,true,false).

Specialization 4: If scale1 and scale2 are equal, the precision {MAX(Pmax1, Pmax2) + 1} of the result exceeds 19 bits but does not exceed 38 bits, data block 1 has the same sign bit, data block 2 has the same sign bit, Negative1 = Negative2, and data block 1 and data block 2 have null values.

At this time, if there is a value that is NULL, the result is NULL. For non-null values, the processing logic is the same as specialization 2.

for(value1 in data block 1, value2 in data block 2, value3 in data block 3){

If (value1.val is NULL or value2.val is NULL) {

The corresponding value of NOT NULL array identifier is NULL;

}

Else{

The corresponding value of the NOT NULL array identifier is Not NULL;

value3.val＝value1.val+value2.val；

}

value3.scale = value1.scale;

value3.negative=value1.negative.

For example:

Data block 2 has the value (4.01, 999999999999999999.02, NULL), the maximum precision of data block 2 is 19, scale is 2, and the NOT NULL array is (true, true, false), true means the data is not NULL, false means the data is NULL;

Data block 3 is used to store the results.

The scale of data block 1 and data block 2 is 2, the maximum precision of the result is 20 (evaluation precision), and it does not exceed 19. The scale of the result is 2, the sign is positive, and the result is also positive. There is no NULL.

Traverse data block 1, data block 2, data block 3;

First time to get:

value3.val = 222 + 401 (u128 addition is used here);

value3.scale = 2;

value3.negative = false;

The result is number{val(623),scale(2),negative(false)}, which is 6.23.

The second time I got:

value1:number{val(111),scale(2),negative(false)} and value2:number{val(99999999999999999902),scale(2),negative(false)}, according to the Not NULL array of the data block, it is judged whether the two numbers are NULL or not, and the result is Not NULL;

value3.val = 111 + 999999999999999999902;

value3.scale = 2;

value3.negative = false;

The third time I got:

value1:number{val(3333),scale(2),negative(false)} and value2:number(NULL), according to the Not NULL array of the data block, value2 is judged to be NULL, and the result is NULL;

The data of data block 3 is (6.23,100000000000000000.13,NULL), with a maximum precision of 20, a scale of 2, and a sign of positive. The Not NULL array is (true,true,false).

Second, other specializations can be combined based on whether the data block has NULL, whether the data block has the same symbol value, and the range of precision. We will not describe them one by one here. The following describes the conventional calculation method that needs to be used when all fast paths cannot be used.

General process:

1. Is Value1.negative equal to value2.negative? If not, change addition to subtraction. Otherwise, it is still addition. The logic of subtraction is not expanded here.

2. If the negatives are the same, use addition. Then check if scale1 and scale2 are equal. If they are not equal, they need to be adjusted to the same scale number. In the process of adjustment, it may be necessary to lose precision to avoid overflow.

3. After adding two values, check whether the value exceeds u128. If so, check whether the result exceeds the maximum value of number. If so, report overflow error. Otherwise, adjust scale. Losing precision ensures that data does not overflow.

3. The subtraction of two data block numbers has the following specializations based on the statistical information of the data blocks.

Specialization 1: If scale1 and scale2 are equal, the precision of the result {MAX(Pmax1, Pmax2)+1} does not exceed 19 bits, data block 1 has the same sign bit, data block 2 has the same sign bit, Negative1 = Negative2, and neither data block 1 nor data block 2 has a null value.

At this time, the subtraction of the val of the two data blocks only needs to be converted into u64 subtraction, because a 19-bit integer (an integer consisting of 19 9s) can be represented by u64 (a 64-bit unsigned integer), and the result of subtracting two 19-bit integers is at most 20 bits, and the result cannot exceed the maximum value of u64. The scale of the result is the same as the scale of the original data block, and the maximum precision of the result is {MAX(Pmax1, Pmax2)+1}, where the maximum precision is an evaluation value and does not need to be an exact value. The sign bit of the result is determined by the size of the two numbers.

Traverse data block 1, data block 2, data block 3;

For example:

Data block 1 has values (4.22, 1.11, 33.33), the maximum precision of data block 1 is 4, and the scale is 2;

Data block 2 has values (2.01, 3.31, 6.11), the maximum precision of data block 2 is 3, and the scale is 2;

Data block 3 is used to store the results.

The scales of data blocks 1 and 2 are both 2, and the maximum evaluation accuracy of the results is 5 (due to the large overhead of calculation accuracy, evaluation accuracy is used), which does not exceed 19. The scale of the results is 2 and the signs are all positive.

Traverse data block 1, data block 2, data block 3;

First time to get:

value1: number{val(422), scale(2), negative(false)} and value2: number{val(201), scale(2), negative(false)};

value1.val is greater than value2.val;

value3.val = 422-201 (u64 subtraction is used here);

value3.negative = false,

value3.scale = 2;

The result is number{val(221),scale(2),negative(false)}, which is 2.21.

The second time I got:

value1:number{val(111),scale(2),negative(false)} and value2:number{val(331),scale(2),negative(false)};

value1.val is smaller than value2.val;

value3.val = 331-111;

value3.negative = true;

value3.scale = 2;

The result is number{val(220),scale(2),negative(true)}, which is 2.20.

The third time I got:

value1:number{val(3333),scale(2),negative(false)} and value2:number{val(611),scale(2),negative(false)};

value1.val is greater than value2.val;

value3.val = 3333-611;

value3.negative = false;

value3.scale = 2;

The result is number{val(2722),scale(2),negative(false)}, which is 27.22.

The data of data block 3 is (2.21, 2.20, 27.22), its maximum precision is 5 (evaluation value), and the scale is 2.

4. The multiplication of two data block numbers has the following specializations according to the statistical information of the data blocks.

Specialization 1: If the scales of two data blocks are the same within their data blocks, the scale of the result does not exceed the scale specification (the scale generally has its maximum and minimum values limited), the precision of the result (Pmax1+Pmax2) does not exceed 19 bits, and there are no null values in data block 1 and data block 2.

At this time, the multiplication of the val of the two data blocks only needs to be converted into u64 multiplication, because a 19-bit integer (an integer consisting of 19 9s) can be represented by u64 (a 64-bit unsigned integer). The scale of the result is Scale1+scale2, and the maximum precision of the result is (Pmax1+Pmax2), where the maximum precision is an evaluation value and does not need to be an exact value.

For example:

Data block 1 has values (4.02, 1.11, 3.33), the maximum precision of data block 1 is 3, and the scale is 2;

Data block 2 has values (2.1, 3.1, 6.2), the maximum precision of data block 2 is 2, and the scale is 1;

Data block 3 is used to store the results.

The scales of data blocks 1 and 2 are both 2, and the maximum evaluation precision of the results is 5 (due to the large cost of calculation precision, the evaluation precision is used), which does not exceed 19. The scale of the results is 2, and the signs are all positive.

Traverse data block 1, data block 2, data block 3;

First time to get:

value1:number{val(402),scale(2),negative(false)} and value2:number{val(2.1),scale(1),negative(false)};

value3.val = 402*21 (u64 subtraction is used here);

value3.negative = false;

value3.scale = 2 + 1;

The result is number{val(8442),scale(3),negative(false)}, which is 8.442.

The second time I got:

value1:number{val(111),scale(2),negative(false)} and value2:number{val(31),scale(1),negative(false)};

value3.val = 111*31;

value3.negative = false;

value3.scale = 2 + 1;

The result is number{val(3441),scale(),negative(true)}, which is 3.441.

The third time I got:

value1:number{val(333),scale(2),negative(false)} and value2:number{val(62),scale(1),negative(false)};

value3.val = 333*62;

value3.negative = false;

value3.scale = 2 + 1;

The result is number{val(20646),scale(2),negative(false)}, which is 20.646.

The data of data block 3 is (8.442, 3.441, 20.646), its maximum precision is 5 (evaluation value), and the scale is 3.

Advantages of the current technical solution: 1. The optimal algorithm can be dynamically selected to process each data block. 2. Users do not perceive it and do not need to change the existing computing framework. They only need to gradually transform the relevant operators.

As for the device embodiment, since it is basically similar to the method embodiment, the description is relatively simple, and the relevant parts can be referred to the partial description of the method embodiment.

2, a structural block diagram of a data processing method and device provided by an embodiment of the present application is shown;

A data processing device, the device is applied to calculation of column-type stored data, the device comprising:

A first acquisition module 210 is used to acquire a target column data block and corresponding prompt information in the target column data block;

A matching module 220, configured to match a target preferred algorithm that can satisfy the prompt information in a preset algorithm set according to the prompt information by a preferred degree;

The processing module 230 is used to optimize the operator according to the target optimization algorithm, and process the target data in the target column data block through the optimized operator.

In one embodiment of the present invention, before the first acquisition module 210, the following steps are included:

A first generating module, used for acquiring initial data, and generating a set of column data blocks to be processed according to the initial data;

In one embodiment of the present invention, the first generating module includes:

A first acquisition submodule, used to acquire a data type corresponding to the initial data;

A first determination submodule, configured to determine a preprocessing function corresponding to the data type according to the data type;

A first generating submodule, used for generating a plurality of initial column data blocks according to the initial data and the preprocessing function;

The second generating submodule is used to generate the to-be-processed column data block set according to a plurality of the initial column data blocks.

In one embodiment of the present invention, the first determining module includes:

A second determination submodule is used to determine a target column in the set of column data blocks to be processed according to a preset query condition;

The third determining submodule is used to determine the target column data block in the set of column data blocks to be processed according to the preset query condition and the target column.

In one embodiment of the present invention, the matching module 220 includes:

A third generation submodule is used to sort all algorithms in the preset algorithm set according to the degree of preference to generate a preferred algorithm set with a preferred ranking sequence number, wherein the preferred algorithm set includes a first algorithm at the first position, a standard algorithm at the last position, and a plurality of intermediate algorithms with preferred ranking sequences between the first algorithm and the standard algorithm;

The fourth determination submodule is used to determine the target preferred algorithm among the first algorithm, the standard algorithm and the plurality of intermediate algorithms according to the prompt information.

In one embodiment of the present invention, the fourth determining submodule includes:

A generating unit, configured to generate a first condition corresponding to the first algorithm and a plurality of sub-conditions corresponding to the plurality of intermediate algorithms according to the first algorithm and the plurality of intermediate algorithms;

A matching unit, used for matching the prompt information with the first condition and the plurality of sub-conditions;

A first determining unit, configured to obtain, from among the plurality of sub-conditions, a second algorithm following the first algorithm and a second sub-condition of the second algorithm when the prompt information does not satisfy the first condition;

A second determining unit, configured to determine in sequence whether the prompt information has a target sub-condition that satisfies the prompt information among several sub-conditions when the prompt information does not satisfy the second sub-condition;

The third determining unit is used to determine the standard algorithm as the target preferred algorithm when there is no target algorithm that satisfies the prompt information in the first condition and the plurality of sub-conditions.

3, a computer device of a data processing method of the present invention is shown, which may specifically include the following:

The computer device 12 is in the form of a general-purpose computing device, and the components of the computer device 12 may include but are not limited to: one or more processors or processing units 16, a system memory 28, and a bus 18 connecting different system components (including the system memory 28 and the processing unit 16).

The bus 18 represents one or more of several types of bus 18 structures, including a memory bus 18 or memory controller, a peripheral bus 18, an accelerated graphics port, a processor or a local bus 18 using any of a variety of bus 18 architectures. These architectures include, by way of example, but are not limited to, an Industry Standard Architecture (ISA) bus 18, a Micro Channel Architecture (MAC) bus 18, an Enhanced ISA bus 18, an Audio Video Electronics Standards Association (VESA) local bus 18, and a Peripheral Component Interconnect (PCI) bus 18.

The computer device 12 typically includes a variety of computer system readable media. These media can be any available media that can be accessed by the computer device 12, including volatile and non-volatile media, removable and non-removable media.

The system memory 28 may include computer system readable media in the form of volatile memory, such as random access memory (RAM) 30 and/or cache memory 32. The computer device 12 may further include other removable/non-removable, volatile/non-volatile computer system storage media. By way of example only, the storage system 34 may be used to read and write non-removable, non-volatile magnetic media (commonly referred to as "hard drives"). Although not shown in Figure 3, a disk drive for reading and writing removable non-volatile disks (such as "floppy disks"), and an optical disk drive for reading and writing removable non-volatile optical disks (such as CD-ROMs, DVD-ROMs or other optical media) may be provided. In these cases, each drive may be connected to the bus 18 via one or more data medium interfaces. The memory may include at least one program product having a set (e.g., at least one) of program modules 42 that are configured to perform the functions of various embodiments of the present invention.

A program/utility 40 having a set (at least one) of program modules 42 may be stored in, for example, a memory, such program modules 42 including, but not limited to, an operating system, one or more application programs, other program modules 42, and program data, each of which or some combination may include an implementation of a network environment. The program modules 42 generally perform the functions and/or methods of the embodiments described herein.

The computer device 12 may also communicate with one or more external devices 14 (e.g., keyboards, pointing devices, displays 24, cameras, etc.), may communicate with one or more devices that enable an operator to interact with the computer device 12, and/or may communicate with any device that enables the computer device 12 to communicate with one or more other computing devices (e.g., network cards, modems, etc.). Such communication may be performed via an input/output (I/O) interface 22. Furthermore, the computer device 12 may also communicate with one or more networks (e.g., local area networks (LANs)), wide area networks (WANs), and/or public networks (e.g., the Internet) via a network adapter 20. As shown, the network adapter 20 communicates with other modules of the computer device 12 via a bus 18. It should be understood that, although not shown in FIG. 3 , other hardware and/or software modules may be used in conjunction with the computer device 12, including, but not limited to, microcode, device drivers, redundant processing units 16, external disk drive arrays, RAID systems, tape drives, and data backup storage systems 34, etc.

The processing unit 16 executes various functional applications and data processing by running programs stored in the system memory 28, such as implementing the data processing method provided by the embodiment of the present invention.

That is, when the processing unit 16 executes the program, it achieves: obtaining the target column data block and the corresponding prompt information in the target column data block; matching the target optimization algorithm corresponding to the prompt information in the preset algorithm set through the degree of preference; optimizing the operator according to the target optimization algorithm, and processing the target data in the target column data block through the optimized operator.

In an embodiment of the present invention, the present invention further provides a computer-readable storage medium on which a computer program is stored, and when the program is executed by a processor, the data processing method provided in all embodiments of the present application is implemented:

That is, when the program is executed by the processor, it is implemented as follows: obtaining the target column data block and the corresponding prompt information in the target column data block; according to the prompt information, a target optimization algorithm corresponding to the prompt information can be satisfied by matching the degree of preference in a preset algorithm set; according to the target optimization algorithm, the operator is optimized, and the target data in the target column data block is processed by the optimized operator.

Any combination of one or more computer-readable media may be used. A computer-readable medium may be a computer-readable signal medium or a computer-readable storage medium. A computer-readable storage medium may be, for example, but not limited to, an electrical, magnetic, optical, electromagnetic, infrared, or semiconductor system, device, or device, or any combination thereof. More specific examples of computer-readable storage media (a non-exhaustive list) include: an electrical connection with one or more wires, a portable computer disk, a hard disk, a random access memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disk read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination thereof. In this document, a computer-readable storage medium may be any tangible medium containing or storing a program that may be used by or in conjunction with an instruction execution system, device, or device.

Computer-readable signal media may include a data signal propagated in baseband or as part of a carrier wave, which carries a computer-readable program code. Such propagated data signals may take a variety of forms, including, but not limited to, electromagnetic signals, optical signals, or any suitable combination of the above. Computer-readable signal media may also be any computer-readable medium other than a computer-readable storage medium, which may send, propagate, or transmit a program for use by or in conjunction with an instruction execution system, apparatus, or device.

The computer program code for performing the operation of the present invention can be written in one or more programming languages or a combination thereof, including object-oriented programming languages such as Java, Smalltalk, C++, and conventional procedural programming languages such as "C" language or similar programming languages. The program code can be executed entirely on the operator's computer, partially on the operator's computer, as an independent software package, partially on the operator's computer and partially on a remote computer, or entirely on a remote computer or server. In the case of a remote computer, the remote computer can be connected to the operator's computer through any type of network, including a local area network (LAN) or a wide area network (WAN), or can be connected to an external computer (for example, using an Internet service provider to connect through the Internet). The various embodiments in this specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments. The same and similar parts between the various embodiments can be referred to each other.

Although the preferred embodiments of the present application have been described, those skilled in the art may make additional changes and modifications to these embodiments once they have learned the basic creative concept. Therefore, the appended claims are intended to be interpreted as including the preferred embodiments and all changes and modifications that fall within the scope of the embodiments of the present application.

Finally, it should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants thereof are intended to cover non-exclusive inclusion, so that a process, method, article or terminal device including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or terminal device. In the absence of further restrictions, the elements defined by the sentence "comprise a ..." do not exclude the existence of other identical elements in the process, method, article or terminal device including the elements.

The above is a detailed introduction to a data processing method, device, equipment and storage medium provided by the present application. Specific examples are used in this article to illustrate the principles and implementation methods of the present application. The description of the above embodiments is only used to help understand the method of the present application and its core idea. At the same time, for a person skilled in the art, according to the idea of the present application, there will be changes in the specific implementation method and application scope. In summary, the content of this specification should not be understood as a limitation on the present application.

Claims

A data processing method, characterized in that the method is applied to calculation of column-type stored data, and the method comprises:

Obtaining a target column data block and corresponding prompt information in the target column data block;

According to the prompt information, a target preferred algorithm corresponding to the prompt information can be matched in a preset algorithm set through a preferred degree;

The operator is optimized according to the target optimization algorithm, and the target data in the target column data block is processed by the optimized operator.
The method according to claim 1, characterized in that before the step of obtaining the target column data block and the prompt information in the target column data block, it comprises:

Acquire initial data, and generate a set of column data blocks to be processed according to the initial data;

Determine a target column data block in the set of column data blocks to be processed;

The prompt information in the target column data block is obtained.
The method according to claim 2, characterized in that the step of generating a set of column data blocks to be processed based on the initial data comprises:

Acquire the data type corresponding to the initial data;

Determining a preprocessing function corresponding to the data type according to the data type;

generating a plurality of initial column data blocks according to the initial data and the preprocessing function;

The to-be-processed column data block set is generated according to a plurality of the initial column data blocks.
The method according to claim 2, characterized in that the step of determining the target column data block in the set of column data blocks to be processed comprises:

Determine a target column in the set of column data blocks to be processed according to a preset query condition;

The target column data block is determined in the set of column data blocks to be processed according to the preset query condition and the target column.
The method according to claim 1 is characterized in that the step of matching the target preferred algorithm corresponding to the prompt information in the preset algorithm set by the degree of preference according to the prompt information comprises:

Sorting all algorithms in the preset algorithm set according to the degree of preference to generate a preferred algorithm set with a preferred ranking sequence number, wherein the preferred algorithm set includes a first algorithm at the first position, a standard algorithm at the last position, and a plurality of intermediate algorithms with preferred ranking sequences between the first algorithm and the standard algorithm;

The target preferred algorithm is determined among the first algorithm, the standard algorithm and several intermediate algorithms according to the prompt information.
The method according to claim 5, characterized in that the step of determining the target preferred algorithm among the first algorithm, the standard algorithm and the plurality of intermediate algorithms according to the prompt information comprises:

generating a first condition corresponding to the first algorithm and a plurality of sub-conditions corresponding to the plurality of intermediate algorithms according to the first algorithm and the plurality of intermediate algorithms;

Matching the prompt information with the first condition and the plurality of sub-conditions;

When the prompt information does not satisfy the first condition, a second algorithm located after the first algorithm and a second sub-condition of the second algorithm are obtained from the plurality of sub-conditions;

When the prompt information does not satisfy the second sub-condition, determining in sequence whether the prompt information has a target sub-condition satisfying the prompt information among several sub-conditions;

When there is no target algorithm satisfying the prompt information in the first condition and the plurality of sub-conditions, the standard algorithm is determined as the target preferred algorithm.
A data processing device, characterized in that the device is applied to calculation of column-type stored data, and the device comprises:

A first acquisition module, used for acquiring a target column data block and corresponding prompt information in the target column data block;

A matching module, configured to match a target preferred algorithm that can satisfy the prompt information in a preset algorithm set according to the prompt information through a preferred degree;

The processing module is used to optimize the operator according to the target optimization algorithm, and process the target data in the target column data block through the optimized operator.
The device according to claim 7, characterized in that before the first acquisition module, it comprises:

A first generating module, configured to obtain initial data and generate a set of column data blocks to be processed according to the initial data;

A first determining module, configured to determine a target column data block in the set of column data blocks to be processed;

The second acquisition module is used to acquire the prompt information in the target column data block.
A computer device, characterized in that it comprises a processor, a memory, and a computer program stored in the memory and capable of running on the processor, wherein when the computer program is executed by the processor, the method according to any one of claims 1 to 6 is implemented.
A computer-readable storage medium, characterized in that a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the method according to any one of claims 1 to 6 is implemented.