CN116243869A

CN116243869A - Data processing method and device and electronic equipment

Info

Publication number: CN116243869A
Application number: CN202310247019.5A
Authority: CN
Inventors: 赵卫; 陆刚; 汪盼; 顾超; 张朝辉
Original assignee: Avatr Technology Chongqing Co Ltd
Current assignee: Avatr Technology Chongqing Co Ltd
Priority date: 2023-03-14
Filing date: 2023-03-14
Publication date: 2023-06-09

Abstract

The embodiment of the invention relates to the technical field of data processing, and discloses a data processing method, a device and electronic equipment, wherein the method comprises the following steps: acquiring data to be processed, wherein the data to be processed comprises data of at least one data row to be written; classifying the data to be processed to obtain classification information of the data to be processed; determining a target row group corresponding to the classification information from a plurality of row groups included in a column storage file according to the classification information of the data to be processed; the classification information corresponding to different line groups in the line groups is different; writing the data to be processed in the target row group in the column storage file. By applying the technical scheme of the invention, the occupation of data to the storage space can be reduced, and the data compression ratio can be improved.

Description

Data processing method and device and electronic equipment

Technical Field

The embodiment of the invention relates to the technical field of data processing, in particular to a data processing method, a data processing device and electronic equipment.

Background

With the development of big data technology and the like, more and more data are accessed into a big data storage system, and data support is provided for analyzing the business process and optimizing the service quality.

Currently, data may be stored in a data partition for each time period (e.g., daily) in a conservative manner. However, since there may be a lot of unchanged information in the data of each time period, there may be a lot of repeated data in the data partition, which causes a lot of memory space to be occupied and memory resources to be wasted.

Disclosure of Invention

In view of the above problems, embodiments of the present invention provide a data processing method, apparatus, and electronic device, which are used to solve the problem in the prior art that data occupies more storage space.

According to an aspect of an embodiment of the present invention, there is provided a data processing method, the method including: acquiring data to be processed, wherein the data to be processed comprises data of at least one data row to be written; classifying the data to be processed to obtain classification information of the data to be processed; determining a target row group corresponding to the classification information from a plurality of row groups included in a column storage file according to the classification information of the data to be processed; the classification information corresponding to different line groups in the line groups is different; writing the data to be processed in the target row group in the column storage file.

In an optional manner, before the classifying the data to be processed to obtain the classification information of the data to be processed, the method further includes: determining coding information of data of each data line to be written in the data to be processed; determining coding information of data of at least one stored data row which is the same as the data generation time of the data row to be written in an index file according to the data generation time of the data row to be written in; the index file comprises coding information of data of a plurality of stored data rows in the column storage file; and if the coding information of the data line to be written is different from the coding information of the data of the stored data line, executing the step of classifying the data to be processed to obtain the classification information of the data to be processed.

In an alternative, the method further comprises: and discarding the data of the data line to be written if the coding information of the data line to be written is the same as the coding information of the data of the stored data line.

In an alternative manner, the index file further includes a plurality of row group identifiers in the column storage file and identifiers of respective stored data rows, and the method further includes: acquiring an instruction to be queried, wherein the instruction to be queried comprises data generation time of data to be queried; determining a row group identifier and a plurality of stored data row identifiers corresponding to the data generation time of the data to be queried in the index file according to the data generation time of the data to be queried; and determining the data to be queried in the column storage file according to the row group identification and the identifications of a plurality of stored data rows.

In an optional manner, before classifying the data to be processed to obtain classification information of the data to be processed, the method further includes: comparing the number of the data lines to be written included in the data to be processed with the preset number; and if the number of the data lines to be written in the data to be processed is greater than or equal to the preset number, executing the step of classifying the data to be processed to obtain classification information of the data to be processed.

In an optional manner, the classifying the data to be processed to obtain classification information of the data to be processed includes: classifying the data of the preset number of data lines to be written to obtain classification information of the data of the preset number of data lines to be written; the data of the preset number of the data lines to be written is determined according to the data generation time of each data line to be written in the data to be processed.

In an alternative, the method further comprises: acquiring new data to be processed; and if the sum of the number of unclassified data lines to be written in the data to be processed and the number of new data lines to be written in the new data to be processed is greater than or equal to the preset number, classifying the unclassified data of the data lines to be written and the new data of the data lines to be written to obtain new classification information.

In an alternative manner, the target row group includes a plurality of column groups, each column group in the plurality of column groups includes a data column, and writing the data to be processed in the target row group in the column storage file includes: storing data of corresponding data columns in the data to be processed in a target column group corresponding to the attribute information according to the attribute information of each data column in the data to be processed in the target row group in the column storage file; the plurality of column groups includes the target column group.

According to another aspect of an embodiment of the present invention, there is provided a data processing apparatus including: the device comprises an acquisition module, a storage module and a data processing module, wherein the acquisition module is used for acquiring data to be processed, and the data to be processed comprises data of at least one data row to be written; the classification module is used for classifying the data to be processed to obtain classification information of the data to be processed; the determining module is used for determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed; the classification information corresponding to different line groups in the line groups is different; and the writing module is used for writing the data to be processed in the target row group in the column storage file.

In an optional manner, before the classifying the data to be processed to obtain the classifying information of the data to be processed, the classifying module is further configured to determine the encoding information of the data of each data row to be written in the data to be processed, determine, according to the data generation time of the data row to be written, the encoding information of the data of at least one stored data row identical to the data generation time of the data row to be written in an index file, where the index file includes the encoding information of the data of a plurality of stored data rows in the column storage file, and if the encoding information of the data row to be written in is different from the encoding information of the data of the stored data row, execute the step of classifying the data to be processed to obtain the classifying information of the data to be processed.

In an optional manner, the classification module is further configured to discard the data of the data row to be written if the coding information of the data row to be written is the same as the coding information of the data of the stored data row.

In an optional manner, the index file further includes a plurality of row group identifiers and identifiers of stored data rows in the column storage file, the writing module is further configured to obtain an instruction to be queried, the instruction to be queried includes a data generation time of data to be queried, determine, in the index file, a row group identifier and identifiers of a plurality of stored data rows corresponding to the data generation time of the data to be queried according to the data generation time of the data to be queried, and determine, in the column storage file, the data to be queried according to the row group identifier and the identifiers of a plurality of stored data rows.

In an optional manner, before the classifying the data to be processed to obtain the classification information of the data to be processed, the classifying module is further configured to compare the number of data lines to be written included in the data to be processed with a preset number, and if the number of data lines to be written included in the data to be processed is greater than or equal to the preset number, execute the step of classifying the data to be processed to obtain the classification information of the data to be processed.

In an optional manner, the classification module is configured to classify the data of the preset number of data rows to be written to obtain classification information of the data of the preset number of data rows to be written; the data of the preset number of the data lines to be written is determined according to the data generation time of each data line to be written in the data to be processed.

In an optional manner, the classification module is further configured to obtain new data to be processed, and if the sum of the number of unclassified data lines to be written in the data to be processed and the number of new data lines to be written in the new data to be processed is greater than or equal to the preset number, classify the unclassified data of the data lines to be written and the new data of the data lines to be written, so as to obtain new classification information.

In an optional manner, the target row group includes a plurality of column groups, each column group in the plurality of column groups includes a data column, the writing module is configured to store, in the target row group in the column storage file, data of a corresponding data column in the data to be processed in a target column group corresponding to the attribute information according to attribute information of each data column in the data to be processed, and the plurality of column groups includes the target column group.

According to another aspect of an embodiment of the present invention, there is provided an electronic apparatus including: a processor; a memory for storing at least one executable instruction; the executable instructions cause the processor to perform the operations of the data processing method of any one of the preceding claims.

According to yet another aspect of embodiments of the present invention, there is provided a computer-readable storage medium having stored therein at least one executable instruction that, when executed on an electronic device, causes the electronic device to perform the operations of the data processing method as set forth in any one of the above.

According to the data processing method, the data processing device and the electronic equipment provided by the embodiment of the invention, the data to be processed can be acquired, the data to be processed is classified, the classification information of the data to be processed is obtained, then the target row group corresponding to the classification information is determined in the plurality of row groups included in the column storage file according to the classification information of the data to be processed, and the data to be processed is written in the target row group in the column storage file. By the method, the data to be processed can be written into the column storage file according to the columns, so that the data to be processed is stored in the target row group with similar data characteristics, the compression ratio of the data is improved, and the data storage space is saved; meanwhile, due to the format advantage of the column-type storage file, when the stored data is read, only the data of the corresponding column is required to be read, and each row of data is not required to be read, so that the data reading efficiency can be improved.

The foregoing description is only an overview of the technical solutions of the embodiments of the present invention, and may be implemented according to the content of the specification, so that the technical means of the embodiments of the present invention can be more clearly understood, and the following specific embodiments of the present invention are given for clarity and understanding.

Drawings

The drawings are only for purposes of illustrating embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

fig. 1 shows a flowchart of a data processing method provided in the present embodiment;

FIG. 2 shows a schematic classification diagram of a decision tree algorithm provided in this embodiment;

fig. 3 shows an example of data to be processed provided by the present embodiment;

FIG. 4 is a schematic diagram of a column storage file according to the present embodiment;

FIG. 5 is a flowchart showing another data processing method provided in the present embodiment;

FIG. 6 is a flowchart showing still another data processing method provided by the present embodiment;

FIG. 7 is a flowchart showing still another data processing method provided by the present embodiment;

FIG. 8 is a flowchart showing still another data processing method provided by the present embodiment;

FIG. 9 is a flowchart showing still another data processing method provided by the present embodiment;

FIG. 10 is a flowchart showing still another data processing method provided by the present embodiment;

FIG. 11 is a flowchart showing still another data processing method provided by the present embodiment;

FIG. 12 is a schematic view of an index file according to the present embodiment;

fig. 13 is a schematic diagram showing the structure of a data processing apparatus provided in the present embodiment;

fig. 14 shows a schematic structural diagram of an electronic device provided in the present embodiment.

Detailed Description

Exemplary embodiments of the present invention will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present invention are shown in the drawings, it should be understood that the present invention may be embodied in various forms and should not be limited to the embodiments set forth herein.

To facilitate analysis of data generated during a business process, a zipper table may be used to store historical data, recording historical changes to the data. Specifically, the pull chain table identifies, by the start date and end date of the field, data between which a record is valid. Thus, a pull chain table may record all the information of changes of things from the beginning up to the current state. However, when searching for data stored in the slide fastener table, attention is paid to the start date and end date of the record, and the search logic is complicated, so that the use efficiency of the data is low.

In view of one or more of the foregoing problems, fig. 1 shows a flowchart of a data processing method provided by an embodiment of the present invention, where the method may be executed by an electronic device, may obtain data to be processed, analyze the data to be processed, determine a target row group of the data to be processed, and write the data to be processed into the target row group of a column storage file, so as to save a storage space of the data and improve a data use efficiency. The electronic device may be a background server or a service cluster of the service provider. As shown in fig. 1, the method may include the steps of:

step 110: and obtaining data to be processed.

The data to be processed may comprise data of at least one data row to be written. The data to be written into the data line refers to one line of data or one line of record in the data to be processed.

In the service process, the electronic device can monitor service change and automatically generate data to be processed including one or more data to be written into, can also receive a storage request of a user or device, acquire the data to be processed carried in the storage request, or can also respond to the storage request to actively acquire the data to be processed sent by the user or device.

Step 120: classifying the data to be processed to obtain classification information of the data to be processed.

The classification information of the data to be processed may be used to represent the category to which the data to be processed belongs. The similarity of the data to be processed of the same category is higher, and for the data of the plurality of data lines, the data of the plurality of data lines with the same category have similarity on the same attribute.

To determine the distribution of the data to be processed, the data to be processed may be classified to divide the data to be processed into one or more categories.

For example, a decision tree algorithm may be used to classify the data to be processed, and determine classification information of the data to be processed. The decision tree acts as a tree structure in which each internal node may represent a test on a tree, each branch represents a test output, and each leaf node represents a class. In this embodiment, the decision tree algorithm may be an iterative binary tree 3 generation (Iterative Dichotomiser, id 3) algorithm, a C4.5 algorithm, a classification regression tree (Classification And Regression Tree, cat) algorithm, or the like, which is not particularly limited in this embodiment.

For example, fig. 2 shows a schematic diagram of classification of a decision tree algorithm provided in this embodiment, as shown in fig. 2, when the decision tree algorithm is used to classify data to be processed, an attribute feature of the data to be processed, such as attribute feature x, may be tested from a root node, and the data to be processed is distributed to sub-nodes, such as first-level sub-node 1 and first-level sub-node 2 and … …, respectively, according to the test result, where each sub-node corresponds to a value of the attribute feature x.

Then, another attribute feature of the data to be processed, such as attribute feature y, may be tested, and the data to be processed is distributed to the sub-nodes, such as the second-level sub-node 1 and the second-level sub-node 2 and … …, and the sub-node corresponds to one value of the attribute feature y. And testing and distributing all attribute characteristics of the data to be processed in a recursion mode until the leaf nodes are reached, and distributing the data to be processed into classes of the leaf nodes.

In an alternative way, the data to be processed may be classified by using a pre-trained decision tree algorithm, so as to obtain classification information of the data to be processed. The pre-trained decision tree algorithm can be generated by training the obtained historical data. The historical data may be data acquired at a historical time that is homologous to the data to be processed.

Taking an ID3 algorithm as an example, the acquired historical data can be divided into a training set and a testing set, a certain attribute is selected according to the size of the information entropy, the training set is divided according to the attribute, and the sum of the information entropy of the sub-training sets after division is the most reduced as compared with the information entropy of the data set before division, and the information entropy is used as the current optimal attribute. After selecting the optimal attribute, dividing the original training set according to the value of the attribute to obtain sub-data sets, and iteratively selecting the optimal attribute by taking each sub-data set as a complete data set until the samples in the data set are all the same classification label, wherein the decision tree generation process is finished.

In addition, in order to prevent the overfitting, pruning treatment can be carried out on the trained decision tree algorithm according to the maximum node number of the decision tree algorithm, so as to generate a pre-trained decision tree algorithm.

By the method for classifying the data to be processed, the data to be processed can be divided into one or more categories, and the distribution condition of the data to be processed is determined, so that the data to be processed can be stored later.

Step 130: and determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed.

Wherein, the classification information corresponding to different line groups in the line groups is different. That is, the classification information corresponding to the same row group is the same, and the data to be processed corresponding to the same row group has higher similarity. The target row group is a row group for storing data to be processed.

A Column store file is a type of store file that stores data in a Column-oriented Storage (Column-store) manner. Fig. 3 shows an example of data to be processed provided in this embodiment, and as shown in fig. 3, the data to be processed includes 6 rows and 5 columns, where each column of data is "serial number", "customer name", "age", "location" and "customer level", respectively. Column storage is a storage that organizes the data of each column together in a sequence, e.g., each column of data may be organized together in a left-to-right fashion, such as 21, 32, 31, 41, 35, 50 for an "age" column of data.

In the column storage, since the data of each column is aggregated and stored, when data of a few attributes such as 'age', 'customer level' is queried, only the data of the corresponding columns, namely 'age' column and 'customer level' column, need to be read, so that the data amount of the read data can be greatly reduced, and since the data types corresponding to each attribute are the same, the column storage mode can design a data compression algorithm in a targeted manner.

In this embodiment, the column storage file may be a part file. The part file is a novel column type storage format file in the Hadoop ecological cycle, and can support a nested data model.

Fig. 4 shows a schematic diagram of a column storage file provided in this embodiment, as shown in fig. 4, where the head and tail of the part file each have a metadata Number with a content "PAR1" and a length of 4 bytes, for identifying this file as the part file, and the data block portion includes a plurality of line groups, such as line group 0 and line group 1 shown in fig. 4, each line group being used for storing at least one line of data. For example, assuming 1000 lines of data, two line groups of 500 lines of data each are split according to the respective sizes. In each row group, a plurality of Column blocks may be included, e.g., column block such as Column a may be included in row group 0, and Column block such as Column b may be included in row group 1. A column block is assembled from a collection of columns of data. In each column block, data is stored in a minimum unit of page.

In each column block in the part file, repetition Levels (repetition level) and Definition Levels (definition level), values (data Values) are also required to store complete information. The Values are Values of data stored in the column block.

In particular, repetition Levels is primarily used to express the length of an array type field, which can express the length indirectly by recording a change in nesting level, i.e., if the nesting level is unchanged, it is indicated that the array is still running, if the nesting level is changed, it is indicated that the previous array is finished. If the nesting level increases from 0 to 1 at a certain value, then the Repetition level for this value is 0. If the nesting level is unchanged at the position of a value, then the Repetition level of that value is its nesting level. For example, for [ [ "a", "b" ], [ "c", "d", "e" ] ], its corresponding Repetition level would be encoded into the values shown in Table 1 below:

TABLE 1

Because the nesting level of this array is 2, and "a" is the boundary from level 0 to level2, its Repetition level is 0, and "c" is the boundary from level 1 to level2, so its Repetition level is 1, and the nesting levels of the other letters are unchanged, so their Repetition level is 2.

Definition Levels is a defined depth, mainly used to express the null position. Since null is not explicitly stored in the part file, whether a certain value is null is determined by Definition level.

After the classification information of the data to be processed is obtained, the row group allocation based on the data characteristics of the data to be processed can be realized by searching the target row group matched with the classification information of the data to be processed in the row groups included in the column storage file, so that the data to be processed can be stored conveniently according to the characteristics of the stored data of each row group.

In an alternative manner, when determining the target row group, the row group having the least amount of data may also be determined as the target row group according to the amount of data of each row group in the column storage file. In this way, the amount of data stored by each row group can be balanced.

Step 140: and writing the data to be processed in the target row group in the column storage file.

After the target row group corresponding to the classification information of the data to be processed is determined, the data to be processed can be written into the target row group in the column storage file, so that the data of the same class is stored in one row group as much as possible, the compression effect of the data can be improved, the storage space can be saved, the storage cost can be reduced, and the time consumption for reading the stored data can be reduced.

For example, when the target row group is determined to be row group 0 as shown in fig. 4, the data to be processed may be written in row group 0 in the column storage file such that each column of data in the data to be processed is stored in a corresponding column block in row group 0.

According to the data processing method provided by the embodiment, data to be processed can be obtained, the data to be processed is classified to obtain classification information of the data to be processed, then a target row group corresponding to the classification information is determined in a plurality of row groups included in a column storage file according to the classification information of the data to be processed, and the data to be processed is written in the target row group in the column storage file.

By the method, the data to be processed can be written into the column storage file according to the columns, so that the data to be processed is stored in the target row group with similar data characteristics, the compression ratio of the data is improved, and the data storage space is saved; meanwhile, due to the format advantage of the column-type storage file, when the stored data is read, only the data of the corresponding column is required to be read, and each row of data is not required to be read, so that the data reading efficiency can be improved.

Considering that during actual business, business data may not change in a period of time, for example, a user may watch the same video content multiple times in the same day, if one piece of data recording the name of the user watching the content is generated each time, the generated data in the day contains multiple pieces of repeated data, which affects subsequent business analysis.

Thus, in order to reduce duplicate data in the stored data, fig. 5 shows a flowchart of another data processing method provided in this embodiment, as shown in fig. 5, may include the following steps 510-570:

step 510: and obtaining data to be processed.

Wherein the data to be processed may comprise data of at least one data row to be written.

Step 520: and determining the coding information of the data of each data line to be written in the data to be processed.

The encoding algorithm may be an algorithm that converts data of each data line to be written in the data to be processed into data capable of uniquely identifying the corresponding data line to be written. By way of example, any one of hash algorithms, such as Message Digest (MD) Algorithm, secure hash Algorithm (Secure Hash Algorithm, SHA), and the like, may be used. The message digest algorithm may include, by version, MD2, MD4, and MD5 algorithms, the secure hash algorithm may include a first generation SHA algorithm standard and a second generation SHA algorithm, namely SHA-1 and SHA-2, and SHA-2 may include SHA-224, SHA-256, SHA-384, SHA-512, and the like.

The coding information refers to coding data of data to be written of a data line generated by using a coding algorithm, and can be used for uniquely identifying the data of the data line to be written.

For example, any hash algorithm may be used to convert the data of each data line to be written in the data to be processed into a binary string with a fixed length, where the binary string is the coding information of the data of the corresponding data line to be written.

Step 530: according to the data generation time of the data line to be written, determining the coding information of the data of at least one stored data line which is the same as the data generation time of the data line to be written in the index file.

Wherein the index file may include encoded information of data of a plurality of stored data rows in the column storage file. The data generation time of the data line to be written refers to the time of generating the data of the data line to be written, and the time accuracy degree can be selected according to the time dimension of the stored data. The time dimension of storing data refers to the time dimension of storing data to be processed, and may be one day, one week, one month, or the like.

For example, for more frequently changing traffic data, the time dimension of storing data may be set to a shorter time, while for less frequently changing traffic data, the time dimension of storing data may be set to a longer time.

Taking the example of the time dimension of stored data being 1 day, if the data generation time of the data line to be written is 2023, 1 month, 1 day, stored data lines of 2023, 1 month, 1 day may be searched in the index file, and the encoded information of the data of these stored data lines may be obtained.

Step 540: if the encoding information of the data to be written into the data line is different from the encoding information of the data of the stored data line, step 550 is performed.

In the data to be processed, if the encoding information of the data to be written in the data row is different from the encoding information of the data of the stored data row, which indicates that the data to be written in the data row does not have the same data in the time dimension of storing the data, it may be determined that the data to be written in the data row is not the repeated data, and the step 650 is continued.

Step 550: classifying the data to be processed to obtain classification information of the data to be processed.

In this step, the data to be processed may be the remaining data after discarding the repeated data, that is, the data of the data line to be written whose encoding information is the same as that of the data of the stored data line, or may be the data of all the data lines to be written without discarding the repeated data.

Step 560: and determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed.

Wherein, the classification information corresponding to different line groups in the line groups is different.

According to the classification information of the data to be processed, a target row group corresponding to the classification information can be determined from a plurality of row groups included in the column storage file.

Step 570: and writing the data to be processed in the target row group in the column storage file.

After the target row group is determined, the data to be processed, such as the data remaining after the duplicate data is discarded in the data to be processed, may be written in the target row group in the column storage file.

Through the steps 510-570, the data rows to be written, in which the encoding information in the data to be processed is different from the encoding information of the data of the stored data rows, can be determined, the repeatability verification of the data of all the data rows to be written in the data to be processed is completed, the data to be processed is classified based on the verification result, the target row group corresponding to the classification information is determined, the data to be processed is written in the target row group of the column storage file, the repeated data in the column storage file is reduced, and the occupation of the storage space is reduced.

It should be noted that, the specific implementation manner of the steps 510, 550-570 may refer to the specific implementation manner of the steps 110-140 in the foregoing embodiment, and will not be repeated here.

Fig. 6 shows a flowchart of yet another data processing method provided in this embodiment, as shown in fig. 6, may include the following steps:

step 610: and obtaining data to be processed.

Step 620: and determining the coding information of the data of each data line to be written in the data to be processed.

Step 630: according to the data generation time of the data line to be written, determining the coding information of the data of at least one stored data line which is the same as the data generation time of the data line to be written in the index file.

Step 640: and discarding the data to be written in the data line if the coding information of the data to be written in the data line is the same as the coding information of the data of the stored data line.

In the data to be processed, if the coding information of the data of a certain data line to be written is the same as the coding information of the data of the stored data line, which means that the data of the data line to be written has the same data in the corresponding time dimension of the stored data, the data of the data line to be written can be determined to be the repeated data, so that the data of the data line to be written can be discarded without being stored.

By the method, all data on the day of 2023, 1 month and 1 day can be repeatedly checked according to the time dimension of the stored data, and repeated data in the data to be processed corresponding to the time dimension can be deleted. Therefore, the occupation of the repeated data to the storage space can be reduced, and the computing resource is saved.

Step 650: classifying the data to be processed to obtain classification information of the data to be processed.

In this step, the data to be processed may be the data remaining after discarding the data of the data line to be written in step 640.

For example, a decision tree algorithm, such as a C4.5 algorithm, may be used to classify the data to be processed to obtain classification information of the data to be processed.

Step 660: and determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed.

According to the classification information of the data to be processed, a target row group corresponding to the classification information of the data to be processed can be determined from a plurality of row groups in the column storage file.

Step 670: and writing the data to be processed in the target row group in the column storage file.

After the target row group is determined, the data to be processed can be written into the target row group in the column storage file, so that the storage of the data to be processed is completed.

By the method, the data rows to be written, which are different from the coding information of the data of the stored data rows, in the data to be processed can be determined, the repeatability verification of the data of all the data rows to be written in the data to be processed is completed, the data to be processed is classified based on the verification result, the target row group corresponding to the classification information is determined, the data to be processed is written in the target row group of the column storage file, the repeated data in the column storage file is reduced, and the occupation of the storage space is reduced.

In addition, the specific implementation manners of steps 610-630 and steps 650-670 may refer to the specific implementation manners of steps 510-530 and steps 550-570 in the foregoing embodiments, and are not repeated herein.

In order to improve the efficiency of data processing, fig. 7 shows a flowchart of yet another data processing method provided in this embodiment, and as shown in fig. 7, the method may include the following steps:

step 710: and obtaining data to be processed.

Step 720: and comparing the number of the data rows to be written included in the data to be processed with the preset number.

The number of data lines to be written may be the number of lines of data lines to be written. The preset number can be set in a self-defined manner according to actual requirements, for example, 100 lines, 500 lines and the like can be set.

Step 730: if the number of data rows to be written included in the data to be processed is greater than or equal to the preset number, step 740 is performed.

When the number of data lines to be written included in the data to be processed is greater than or equal to the preset number, it is indicated that the data amount of the data to be processed reaches the data amount threshold of the classification process, so step 740 may be performed.

Step 740: classifying the data to be processed to obtain classification information of the data to be processed.

Step 750: and determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed.

Step 760: and writing the data to be processed in the target row group in the column storage file.

After the target row group is determined, the data to be processed may be written in the target row group in the column storage file.

By the method, when the number of the rows of the data to be written in the data to be processed reaches the preset number, the data to be processed is classified, the target row group is determined, the data to be processed is written into the target row group in the column storage file, the method for determining whether reclassification is needed according to the data quantity of the data to be processed is realized, and the method does not need classification processing of the data of each row, so that the calculation resources can be saved, and the processing efficiency of the data is improved.

It should be noted that, the specific implementation manner of the steps 710 and 740-760 may refer to the specific implementation manner of the steps 110-140 in the foregoing embodiment, and will not be described herein again.

Fig. 8 shows a flowchart of yet another data processing method provided in this embodiment, as shown in fig. 8, may include the following steps:

step 810: and obtaining data to be processed.

Step 820: classifying the data of the preset number of data lines to be written to obtain classification information of the data of the preset number of data lines to be written.

The data of the preset number of data lines to be written is determined according to the data generation time of each data line to be written in the data to be processed. For example, when determining the data of the preset number of data lines to be written, the data of each data line to be written may be arranged according to the sequence from the early to the late of the data generation time of each data line to be written in the data to be processed, so as to screen the data of the first N data lines to be written meeting the preset number from the data of each data line to be written, where the preset number is N.

When the data of the preset number of data lines to be written are screened from the data to be processed, the data of the preset number of data lines to be written can be classified, and classification information of the data can be determined. Taking 1000 data lines to be written as an example, the preset number is 800, the data of the first 800 data lines to be written in the data to be processed can be classified according to the data generation time of each data line to be written in the data to be processed, so as to obtain the classification information of the data of the first 800 data lines to be written in.

Step 830: and determining a target row group corresponding to the classification information of the data of the preset number of data rows to be written in a plurality of row groups included in the column storage file according to the classification information of the data of the preset number of data rows to be written in.

For example, for the above data to be processed including 1000 data lines to be written, the target line group corresponding to the classification information of the data of the first 800 data lines to be written may be determined according to the classification information of the data of the first 800 data lines to be written.

Step 840: and writing the data of the preset number of data rows to be written in a target row group in the column storage file.

After the target row group is determined, data of a preset number of data rows to be written can be written in the target row group in the column storage file. For example, the data of the first 800 data lines to be written may be written into the target line group according to the target line group corresponding to the classification information of the data of the first 800 data lines to be written.

By the method, the data of the data rows to be written in reaching the preset number can be stored according to the number of the data rows of the data to be processed, so that the classification accuracy of the data to be processed can be ensured, and meanwhile, the processing efficiency of the data to be processed is improved.

It should be noted that, the specific implementation of step 810 may refer to the specific implementation of step 110 in the foregoing embodiment, which is not described herein.

Fig. 9 shows a flowchart of yet another data processing method provided in this embodiment, as shown in fig. 9, may include the following steps:

step 910: and obtaining data to be processed.

Step 920: classifying the data of the preset number of data lines to be written to obtain classification information of the data of the preset number of data lines to be written.

The data of the preset number of data lines to be written is determined according to the data generation time of each data line to be written in the data to be processed.

Step 930: and determining a target row group corresponding to the classification information of the data of the preset number of data rows to be written in a plurality of row groups included in the column storage file according to the classification information of the data of the preset number of data rows to be written in.

Step 940: and writing the data of the preset number of data rows to be written in a target row group in the column storage file.

Step 950: and acquiring new data to be processed.

Wherein the new data to be processed may comprise data of at least one new data row to be written.

In order to store the data of the data line to be written, which is not classified, the electronic device may continuously monitor the service change and acquire new data to be processed, or may continuously receive a new storage request of the user or the device, acquire the new data to be processed carried in the new storage request, or may actively acquire the new data to be processed sent by the user or the device in response to the new storage request.

Step 960: if the sum of the number of unclassified data lines to be written in the data to be processed and the number of new data lines to be written in the new data to be processed is greater than or equal to the preset number, classifying the unclassified data of the data lines to be written and the new data of the data lines to be written, such as the data of all unclassified data lines to be written and the data of part of the new data lines to be written, to obtain new classification information.

When new data to be processed is acquired, the sum of the number of unclassified data lines to be written in the data to be processed and the number of new data lines to be written in the data to be processed, namely the total number of unclassified data lines to be written and new data lines to be written, can be counted, if the total number is greater than or equal to the preset number, the data of the unclassified data lines to be written and the data of the new data lines to be written can be classified, and new classification information can be obtained.

For example, for the unclassified data to be written in the data row and the new data to be written in the data row, all the data in the two data can be classified to obtain new classification information, and the data reaching the preset number in the two data can be classified to obtain new classification information.

Step 970: and determining a target row group corresponding to the new classification information from a plurality of row groups included in the column storage file according to the new classification information.

Step 980: and writing the data of the unclassified data row to be written in the data to be processed and the data of the new data row to be written in the new data to be processed in the target row group in the column storage file.

The new classification information may be obtained by classifying the data of the unclassified data line to be written and the data of the new data line to be written, or may be obtained by classifying the data reaching a preset number in the two data, where the target line group corresponds to the new classification information.

Therefore, the two types of data, namely, all the data corresponding to the data of the unclassified data line to be written and the data of the new data line to be written, can be written in the target line group in the column storage file, or the data reaching the preset quantity in the two types of data can be written in the target line group in the column storage file, such as the data of all the unclassified data lines to be written and the data of part of the new data line to be written.

By the method, the unclassified data lines to be written in the data to be processed and the new data lines to be written in the new data to be processed can be counted according to the size relation between the number of the data lines to be written in the data to be processed and the preset number, and the data of the unclassified data lines to be written in and the new data to be processed are stored based on the counting result.

In order to improve the storage efficiency, in an alternative manner, the data of the data line to be written, which is not classified in the data to be processed, may be directly stored in the target line group corresponding to the classification information of the data of the classified data line to be written in the data to be processed, without performing the classification processing on the data.

For example, assuming that there are a plurality of target line groups corresponding to classification information of data of classified data lines to be written corresponding to the data to be processed, which are

line groups

1 and 2, respectively, a line group having the closest data generation time of the stored data lines may be determined as a target line group of data of the unclassified data lines according to the data generation time of the unclassified data lines, and the data of the unclassified data lines to be written may be stored in the target line group.

In an alternative manner, taking the column storage file as a part file as an example, the target row group may include a plurality of column groups, and each column group in the plurality of column groups may include one data column. Fig. 10 is a flowchart illustrating yet another data processing method provided in this embodiment, and as shown in fig. 10, the method may include the following steps:

step 1010: and obtaining data to be processed.

Step 1020: classifying the data to be processed to obtain classification information of the data to be processed.

Step 1030: and determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed.

Step 1040: and in the target row group in the column storage file, according to the attribute information of each data column in the data to be processed, storing the data of the corresponding data column in the data to be processed in the target column group corresponding to the attribute information.

Wherein the plurality of column groups may include a target column group. The attribute information of a data column refers to a data feature of a column of data, and may be a column name, a data type, or the like of a column of data.

For example, for the data to be processed as shown in fig. 3, the data of the column of "client name" may be stored in the target column group corresponding to "client name" and the data of the column of "age" may be stored in the target column group corresponding to "age" according to the attribute information of each data column in the data to be processed, such as "client name" and "age".

By the method, the data to be processed can be stored in the corresponding target column group according to the attribute information of each data column, and column storage of the data to be processed is completed. Since the data in one column group in the column storage file has a similar data structure, the column storage file can achieve a higher compression ratio when data compression is performed.

It should be noted that, the specific implementation manner of the steps 1010-1030 may refer to the specific implementation manner of the steps 110-130 in the foregoing embodiment, and will not be described herein.

After the data to be processed is stored in the column storage file, the stored data in the column storage file may be queried. Specifically, in order to quickly acquire the data to be queried, the index file in this embodiment may further include a plurality of row group identifiers in the column storage file and identifiers of the stored data rows. Wherein, the row group identifier is character data for uniquely identifying a row group, and can be composed of numbers, letters, special characters and the like; the identification of the stored data line can be the line number and serial number of the data line, or other identification which can uniquely identify a stored data line.

Thus, fig. 11 shows a flowchart of yet another data processing method provided in this embodiment, and as shown in fig. 11, the method may include the following steps:

step 1110: and obtaining data to be processed.

Step 1120: classifying the data to be processed to obtain classification information of the data to be processed.

Step 1130: and determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed.

Step 1140: and writing the data to be processed in the target row group in the column storage file.

Step 1150: and obtaining an instruction to be queried.

The to-be-queried instruction may include a data generation time of the to-be-queried data.

The user may select or input the data generation time of the data to be queried through an input box in a data query interface displayed on a computer, and the terminal device generates the command to be queried and sends the command to be queried to the electronic device in response to the operation of inputting the data generation time by the user.

Step 1160: and determining a row group identifier corresponding to the data generation time of the data to be queried and identifiers of a plurality of stored data rows in the index file according to the data generation time of the data to be queried.

After the instruction to be queried is obtained, the electronic equipment can analyze the data generation time of the data to be queried.

Then, the electronic device may search the index file for a row group identifier corresponding to the data generation time of the data to be queried and an identifier of a stored data row corresponding to the row group identifier according to the data generation time of the data to be queried.

In an alternative manner, the identifier of the stored data row corresponding to the data generation time of the data to be queried and the row group identifier corresponding to the identifier of the stored data row may also be searched in the index file according to the data generation time of the data to be queried.

Fig. 12 shows a schematic diagram of an index file provided in this embodiment, as shown in fig. 12, taking a data generation time of data to be queried as Day1 as an example, first, identifiers of stored data rows corresponding to Day1, that is, 0-100 rows, 101-200 rows, and 301-500 rows, may be searched for in the index file, and then, according to the identifiers of the stored data rows, row group identifiers of 0-100 rows, 101-200 rows, and 301-500 rows corresponding to Day1, that is, row group identifier 1, row group identifier 2, and row group identifier 4, may be determined from the data of the row group identifiers.

Step 1170: and determining the data to be queried in the column storage file according to the row group identification and the identifications of a plurality of stored data rows.

After determining the row group identifier corresponding to the data generation time of the data to be queried and the identifiers of the plurality of stored data rows, a row group can be determined in the column storage file according to the row group identifier, and the data of the data row corresponding to the identifier of the corresponding stored data row can be read from the row group.

For example, the row group in the column storage file may be determined according to the row group identifier 1, the row group identifier 2, and the row group identifier 4, then 0-100 rows of data are read from the row group corresponding to the row group identifier 1, 101-200 rows of data are read from the row group corresponding to the row group identifier 2, and 301-500 rows of data are read from the row group corresponding to the row group identifier 4.

By the method, the row group identification of the data to be queried and the identification of the stored data row can be determined according to the index file, so that the data query efficiency can be improved, the reading speed can be improved, and the waste of computing resources can be reduced.

It should be noted that, the specific implementation manner of the steps 1110 to 1140 may refer to the specific implementation manner of the steps 110 to 140 in the foregoing embodiment, which is not described herein again.

In summary, according to the data processing method provided in this embodiment, classification information of data to be processed may be determined, a target row group may be determined according to the classification information, and the data to be processed may be written into the target row group in the column storage file, so as to complete data storage. The data in the same row group corresponds to the same category, so that the data with similarity can be stored in the same row group as much as possible, the compression ratio of the data can be improved, the data storage space is saved, and the data in the corresponding row only needs to be read when the stored data is read due to the format advantage of the column storage file, and the data in each row does not need to be read, so that the reading efficiency of the data can be improved.

Fig. 13 is a schematic structural diagram of a data processing apparatus according to an embodiment of the present invention, and as shown in fig. 13, the data processing apparatus 1300 may include: an obtaining module 1310, configured to obtain data to be processed, where the data to be processed includes data of at least one data row to be written; the classification module 1320 may be configured to classify the data to be processed to obtain classification information of the data to be processed; the determining module 1330 may be configured to determine, according to classification information of the data to be processed, a target row group corresponding to the classification information from a plurality of row groups included in the column storage file; the classification information corresponding to different line groups in the line groups is different; a write module 1340 may be used to write the data to be processed in the target row group in the column storage file.

In an alternative manner, before classifying the data to be processed to obtain the classification information of the data to be processed, the classification module 1320 may be further configured to determine coding information of data of each data row to be written in the data to be processed, determine, according to the data generation time of the data row to be written, coding information of data of at least one stored data row that is the same as the data generation time of the data row to be written in an index file, where the index file includes coding information of data of a plurality of stored data rows in a column storage file, and execute the step of classifying the data to be processed to obtain the classification information of the data to be processed if the coding information of the data row to be written is different from the coding information of the data of the stored data row.

In an alternative manner, the classification module 1320 may be further configured to discard the data of the data row to be written if the encoding information of the data row to be written is the same as the encoding information of the data of the stored data row.

In an alternative manner, the index file further includes a plurality of row group identifiers and identifiers of the stored data rows in the column storage file, the writing module 1340 may be further configured to obtain an instruction to be queried, where the instruction to be queried includes a data generation time of the data to be queried, determine, according to the data generation time of the data to be queried, the row group identifier and the identifiers of the stored data rows corresponding to the data generation time of the data to be queried in the index file, and determine, according to the row group identifier and the identifiers of the stored data rows, the data to be queried in the column storage file.

In an alternative manner, before classifying the data to be processed to obtain the classification information of the data to be processed, the classification module 1320 may be further configured to compare the number of lines of data to be written included in the data to be processed with the preset number, and if the number of lines of data to be written included in the data to be processed is greater than or equal to the preset number, execute the step of classifying the data to be processed to obtain the classification information of the data to be processed.

In an alternative manner, the classifying module 1320 may be configured to classify the data of the preset number of data lines to be written to obtain classification information of the data of the preset number of data lines to be written to; the data of the preset number of data lines to be written is determined according to the data generation time of each data line to be written in the data to be processed.

In an alternative manner, the classification module 1320 may also be used to obtain new data to be processed; and if the sum of the number of unclassified data lines to be written in the data to be processed and the number of new data lines to be written in the new data to be processed is greater than or equal to the preset number, classifying the unclassified data of the data lines to be written and the new data of the data lines to be written to obtain new classification information.

In an alternative manner, the target row group includes a plurality of column groups, each column group in the plurality of column groups includes a data column, the writing module 1340 may be configured to store, in the target row group in the column storage file, data of a corresponding data column in the data to be processed in the target column group corresponding to the attribute information according to attribute information of each data column in the data to be processed, where the plurality of column groups includes the target column group.

The specific details of each module in the above apparatus are already described in the method section embodiments, and the details of the undisclosed solution may be referred to the method section embodiments, so that they will not be described in detail.

Fig. 14 is a schematic structural diagram of an electronic device according to an embodiment of the present invention, which is not limited to the specific implementation of the electronic device according to the embodiment of the present invention.

As shown in fig. 14, the electronic device may include: a processor 1402, a communication interface (Communications Interface) 1404, a memory 1406, and a communication bus 1408.

Wherein: processor 1402, communication interface 1404, and memory 1406 communicate with each other via a communication bus 1408. A communication interface 1404 for communicating with network elements of other devices, such as clients or other servers. The processor 1402 is configured to execute the program 1410, and may specifically perform the relevant steps in the embodiments of the data processing method described above.

In particular, program 1410 may include program code comprising computer-executable instructions.

The processor 1402 may be a central processing unit CPU, or a specific integrated circuit ASIC (Application Specific Integrated Circuit), or one or more integrated circuits configured to implement embodiments of the present invention. The one or more processors included in the electronic device may be the same type of processor, such as one or more CPUs; but may also be different types of processors such as one or more CPUs and one or more ASICs.

Memory 1406 for storing a program 1410. Memory 1406 may comprise high-speed RAM memory or may also comprise non-volatile memory, such as at least one disk memory.

The program 1410 may be specifically invoked by the processor 1402 to cause the electronic device to perform the operational steps of the data processing method described above.

An embodiment of the present invention provides a computer readable storage medium storing at least one executable instruction that, when executed on an electronic device, causes the electronic device to perform a data processing method according to any of the above-described method embodiments.

The executable instructions may be particularly useful for causing an electronic device to perform the operational steps of the data processing method described above.

The algorithms or displays presented herein are not inherently related to any particular computer, virtual system, or other apparatus. In addition, embodiments of the present invention are not directed to any particular programming language.

In the description provided herein, numerous specific details are set forth. It will be appreciated, however, that embodiments of the invention may be practiced without such specific details. Similarly, in the above description of exemplary embodiments of the invention, various features of embodiments of the invention are sometimes grouped together in a single embodiment, figure, or description thereof for the purpose of streamlining the disclosure and aiding in the understanding of one or more of the various inventive aspects. Wherein the claims following the detailed description are hereby expressly incorporated into this detailed description, with each claim standing on its own as a separate embodiment of this invention.

Those skilled in the art will appreciate that the modules in the apparatus of the embodiments may be adaptively changed and disposed in one or more apparatuses different from the embodiments. The modules or units or components of the embodiments may be combined into one module or unit or component and, furthermore, they may be divided into a plurality of sub-modules or sub-units or sub-components. Except that at least some of such features and/or processes or elements are mutually exclusive.

It should be noted that the above-mentioned embodiments illustrate rather than limit the invention, and that those skilled in the art will be able to design alternative embodiments without departing from the scope of the appended claims. In the claims, any reference signs placed between parentheses shall not be construed as limiting the claim. The word "comprising" does not exclude the presence of elements or steps not listed in a claim. The word "a" or "an" preceding an element does not exclude the presence of a plurality of such elements. The invention may be implemented by means of hardware comprising several distinct elements, and by means of a suitably programmed computer. In the unit claims enumerating several means, several of these means may be embodied by one and the same item of hardware. The use of the words first, second, third, etc. do not denote any order. These words may be interpreted as names. The steps in the above embodiments should not be construed as limiting the order of execution unless specifically stated.

Claims

1. A method of data processing, the method comprising:

acquiring data to be processed, wherein the data to be processed comprises data of at least one data row to be written;

Classifying the data to be processed to obtain classification information of the data to be processed;

determining a target row group corresponding to the classification information from a plurality of row groups included in a column storage file according to the classification information of the data to be processed; the classification information corresponding to different line groups in the line groups is different;

writing the data to be processed in the target row group in the column storage file.

2. The method according to claim 1, wherein before said classifying the data to be processed to obtain classification information of the data to be processed, the method further comprises:

determining coding information of data of each data line to be written in the data to be processed;

determining coding information of data of at least one stored data row which is the same as the data generation time of the data row to be written in an index file according to the data generation time of the data row to be written in; the index file comprises coding information of data of a plurality of stored data rows in the column storage file;

and if the coding information of the data line to be written is different from the coding information of the data of the stored data line, executing the step of classifying the data to be processed to obtain the classification information of the data to be processed.

3. The method according to claim 2, wherein the method further comprises:

and discarding the data of the data line to be written if the coding information of the data line to be written is the same as the coding information of the data of the stored data line.

4. The method of claim 2, wherein the index file further comprises a plurality of row group identifications and identifications of respective stored data rows in the column store file, the method further comprising:

acquiring an instruction to be queried, wherein the instruction to be queried comprises data generation time of data to be queried;

determining a row group identifier and a plurality of stored data row identifiers corresponding to the data generation time of the data to be queried in the index file according to the data generation time of the data to be queried;

and determining the data to be queried in the column storage file according to the row group identification and the identifications of a plurality of stored data rows.

5. The method according to any one of claims 1-4, wherein before classifying the data to be processed, the method further comprises, before classifying the data to be processed into classification information of the data to be processed:

comparing the number of the data lines to be written included in the data to be processed with the preset number;

And if the number of the data lines to be written in the data to be processed is greater than or equal to the preset number, executing the step of classifying the data to be processed to obtain classification information of the data to be processed.

6. The method according to claim 5, wherein classifying the data to be processed to obtain classification information of the data to be processed comprises:

classifying the data of the preset number of data lines to be written to obtain classification information of the data of the preset number of data lines to be written;

the data of the preset number of the data lines to be written is determined according to the data generation time of each data line to be written in the data to be processed.

7. The method of claim 6, wherein the method further comprises:

acquiring new data to be processed;

and if the sum of the number of unclassified data lines to be written in the data to be processed and the number of new data lines to be written in the new data to be processed is greater than or equal to the preset number, classifying the unclassified data of the data lines to be written and the new data of the data lines to be written to obtain new classification information.

8. The method of any of claims 1-4, wherein the target row group comprises a plurality of column groups, each column group of the plurality of column groups comprising a column of data, the writing the data to be processed in the target row group in the column storage file comprising:

storing data of corresponding data columns in the data to be processed in a target column group corresponding to the attribute information according to the attribute information of each data column in the data to be processed in the target row group in the column storage file; the plurality of column groups includes the target column group.

9. A data processing apparatus, the apparatus comprising:

the device comprises an acquisition module, a storage module and a data processing module, wherein the acquisition module is used for acquiring data to be processed, and the data to be processed comprises data of at least one data row to be written;

the classification module is used for classifying the data to be processed to obtain classification information of the data to be processed;

the determining module is used for determining a target row group corresponding to the classification information from a plurality of row groups included in the column storage file according to the classification information of the data to be processed; the classification information corresponding to different line groups in the line groups is different;

And the writing module is used for writing the data to be processed in the target row group in the column storage file.

10. An electronic device, comprising: a processor;

a memory for storing at least one executable instruction;

the executable instructions cause the processor to perform the operations of the data processing method of any one of claims 1-8.