CN111611243A

CN111611243A - Data processing method and device

Info

Publication number: CN111611243A
Application number: CN202010406485.XA
Authority: CN
Inventors: 焦英翔; 石光川
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2020-09-01
Anticipated expiration: 2040-05-13
Also published as: CN111611243B

Abstract

A data processing method and apparatus are disclosed. The method comprises the following steps: acquiring a dividing point of a first data table; the method comprises the steps that a first data table is segmented by using a segmentation point to obtain a first number of blocks; and allocating the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number. This can improve the data processing efficiency.

Description

Data processing method and device

Technical Field

The present invention relates generally to the field of machine learning, and more particularly, to a data processing method and apparatus.

Background

In the field of machine learning, the data required for modeling is typically a collection of data stored in rows. To improve data processing efficiency, a distributed system is typically applied for multiple compute nodes to process data in parallel.

Data segmentation is a key step in improving data processing efficiency using distributed systems. Data slicing mismatches can reduce the data processing efficiency of the distributed system.

Disclosure of Invention

Exemplary embodiments of the present invention are directed to providing a data processing scheme capable of implementing data slicing.

According to a first aspect of the present invention, a data processing method is provided, comprising: acquiring a dividing point of a first data table; the method comprises the steps that a first data table is segmented by using a segmentation point to obtain a first number of blocks; and allocating the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number.

Optionally, the step of obtaining the cut point of the first data table includes: setting an additional value for a target field value in the first data aiming at each piece of first data in the first data table, wherein the additional value and the target field value form a new target field value; and calculating the quantile points of the new target field values by using a quantile algorithm, wherein the quantile points are the cut points.

Optionally, the step of calculating quantiles for a plurality of new target field values using a quantile algorithm comprises: and calculating quantiles of a plurality of new target field values by using a quantile algorithm based on weight so as to enable the data volume of different blocks obtained after segmentation based on the quantiles to be the same or basically the same, wherein the weight is used for representing the data volume of single data.

Optionally, the method further comprises: storing second data in a second data table on one or more parameter servers, each parameter server storing at least part of the second data; for at least one piece of first data in a block allocated to the computing node, the computing node acquires second data from a parameter server storing the second data identical to a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the step of processing the first data based on the second data and a preset data processing rule includes: and splicing the field values of one or more second fields in the second data into the first data as new field values in the first data.

Optionally, the method further comprises: and storing the second data acquired from the parameter server in the computing node.

Optionally, the method further comprises: judging whether second data identical to the target field value in the first data exists in the computing node or not; if the second data exists in the computing node, processing the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, performing an operation of acquiring the second data from the parameter server storing the second data having the same target field value as that in the first data.

Optionally, second data with the same target field value is stored in the same parameter server, and the method further includes: and processing second data with the same target field value by the parameter server according to a preset data processing rule to obtain an intermediate processing result of the target field value, wherein the computing node acquires the intermediate processing result from the parameter server which stores the intermediate processing result corresponding to the target field value in the first data, stores the intermediate processing result in the computing node, and processes the first data based on the intermediate processing result.

Optionally, the method further comprises: judging whether the data volume of the second data table is larger than a first threshold value or not; and if the data quantity of the second data table is smaller than or equal to the first threshold, respectively storing the second data in the second data table to each first computing node, so that the first computing nodes process the first data based on the second data and a preset data processing rule.

Optionally, the method further comprises: if the data volume of the second data table is larger than a first threshold value, the second data table is segmented by using the segmentation point of the first data table, and blocks obtained through segmentation are distributed to the second number of computing nodes for processing; for at least one piece of first data assigned to the computing node, the computing node acquires second data from the computing node storing the second data identical to a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the method further comprises: judging whether a target field value with the frequency number larger than or equal to a second threshold value exists in the first data table; if the target field value with the frequency number larger than or equal to the second threshold exists, the data processing method is executed.

Optionally, the method further comprises: and if the target field value with the frequency number larger than or equal to the second threshold value does not exist, distributing the plurality of pieces of first data in the first data table to one or more computing nodes, wherein the data with the same target field value are distributed to the same computing node.

Optionally, the method further comprises: and distributing a plurality of pieces of second data in the second data table to one or more computing nodes, wherein the second data with the same target field value and the first data are distributed to the same computing node.

According to a second aspect of the present invention, there is provided a data processing apparatus comprising: the acquisition module is used for acquiring the dividing points of the first data table; the segmentation module is used for segmenting the first data table by using the segmentation points to obtain a first number of blocks; and the distribution module is used for distributing the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number.

Optionally, the obtaining module includes: the setting module is used for setting an additional value for a target field value in the first data aiming at each piece of first data in the first data table, and the additional value and the target field value form a new target field value; and the calculating module is used for calculating the quantile points of the new target field values by using a quantile algorithm, wherein the quantile points are the segmentation points.

Optionally, the calculating module calculates quantiles of the plurality of new target field values by using a quantile algorithm based on weights, so that the data volumes of different blocks obtained after segmentation based on the quantiles are the same or substantially the same, wherein the weights are used for representing the data volume of the single data.

Optionally, the allocation module is further configured to store second data in the second data table on one or more parameter servers, each parameter server storing at least part of the second data, for at least one piece of first data in the block allocated to the computing node, the computing node obtains the second data from the parameter server storing the second data that is the same as the target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the computing node splices field values of one or more second fields in the second data into the first data as new field values in the first data.

Optionally, the computing node further stores second data obtained from the parameter server.

Optionally, the apparatus further comprises: the first judging module is used for judging whether second data which is the same as the target field value in the first data exists in the computing node or not; if the second data exists in the computing node, the computing node processes the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, the computing node acquires the second data from the parameter server which stores the second data with the same target field value as that in the first data.

Optionally, second data with the same target field value is stored in the same parameter server, and the parameter server processes the second data with the same target field value according to a preset data processing rule to obtain an intermediate processing result of the target field value, wherein the computing node obtains the intermediate processing result from the parameter server storing the intermediate processing result corresponding to the target field value in the first data, stores the intermediate processing result in the computing node, and processes the first data based on the intermediate processing result.

Optionally, the apparatus further comprises: the second judging module is used for judging whether the data volume of the second data table is larger than the first threshold value or not; the allocation module is further configured to, if the data amount of the second data table is less than or equal to the first threshold, store the second data in the second data table to each of the first computing nodes, so that the first computing nodes process the first data based on the second data and a preset data processing rule.

Optionally, if the data size of the second data table is greater than the first threshold, the segmentation module segments the second data table by using the segmentation point of the first data table, and the distribution module distributes the blocks obtained by segmentation to a second number of computing nodes for processing; for at least one piece of first data assigned to the computing node, the computing node acquires second data from the computing node storing the second data identical to a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the apparatus further comprises: the third judging module is used for judging whether a target field value with the frequency number larger than or equal to a second threshold value exists in the first data table; if the target field value with the frequency number larger than or equal to the second threshold value exists, the obtaining module obtains the segmentation points of the first data table, the segmentation module uses the segmentation points to segment the first data table, and the distribution module distributes the first number of blocks to the second number of calculation nodes for processing.

Optionally, if there is no target field value with a frequency greater than or equal to the second threshold, the allocating module allocates the plurality of pieces of first data in the first data table to one or more computing nodes, where data with the same target field value are allocated to the same computing node.

Optionally, the allocating module further allocates a plurality of pieces of second data in the second data table to one or more computing nodes, wherein the second data and the first data having the same target field value are allocated to the same computing node.

According to a third aspect of the present invention, there is also provided a system comprising at least one computing device and at least one storage device storing instructions, wherein the instructions, when executed by the at least one computing device, cause the at least one computing device to perform the method according to the first aspect of the present invention.

According to a fourth aspect of the present invention, there is also provided a computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the method according to the first aspect of the present invention.

In the data processing method and apparatus according to the exemplary embodiment of the present invention, the data table is partitioned by obtaining the partitioning point of the first data table and using the partitioning point, and different blocks after partitioning can be allocated to corresponding computing nodes. Wherein, the calculation of the cut point can not depend on the artificially set inclination threshold value, and the cut point can be calculated by setting an additional value for the target field value to cope with the possible inclination key.

Drawings

These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings of which:

FIG. 1 illustrates a flow chart of a data processing method according to an exemplary embodiment of the present invention;

FIG. 2 is a flow chart diagram illustrating steps that may also be included in the data processing method of FIG. 1;

fig. 3A to 3D are schematic diagrams illustrating a data processing flow in a specific application example;

fig. 4 shows a block diagram of a data processing apparatus according to an exemplary embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, exemplary embodiments thereof will be described in further detail below with reference to the accompanying drawings and detailed description.

Fig. 1 illustrates a flowchart of a data processing method according to an exemplary embodiment of the present invention. The method shown in fig. 1 may be implemented entirely in software via a computer program, and the method shown in fig. 1 may also be executed by a specifically-configured computing device.

Referring to fig. 1, at step S110, a cut point of a first data table is acquired.

The first data table may be regarded as a collection of pieces of first data stored by rows. The first data table includes a plurality of pieces of first data, each of which may be composed of one or more field values.

The segmentation point is used for representing a segmentation position for segmenting the first data in the first data table. The method and the device can acquire the pre-calculated dividing points of the first data table from the outside, and can also calculate the dividing points of the first data table in real time. Wherein, the obtained dividing points can be one or more.

When a distributed system is used to improve data processing efficiency, a common method is to divide data equally into n parts by rows and process the data by n computing nodes simultaneously, so that the total efficiency is improved to n times. However, not all operations can equally divide the data into n parts at random, such as one consumption record, and when it is desired to count the consumption items of the user in the last day, the consumption records of the same user need to be recorded and processed together, that is, they must be distributed on the same computing node to obtain the correct result. The general method is to divide the data into n groups according to the field value of a certain column, ensure that the data with the same field value are divided on the same computing node, and then the computing nodes compute the result in parallel. The disadvantage of this is that when the data distribution is not uniform, for example, the consumption records of some special accounts (enterprises, merchants, etc.) are much larger than the number of normal users, the node where they are located will be allocated with data larger than other nodes, thereby slowing down the overall computing efficiency, which is the phenomenon of data skew. Once the data skew occurs, the impact is very severe.

Aiming at the point, the quantile algorithm can be used for calculating the quantile points of a plurality of target field values in the first data table, and the calculated quantile points are used as the cut points. That is, m quantiles may be selected from a plurality of target field values using a quantile algorithm, and the m quantiles may divide the plurality of target field values substantially equally into m +1 shares. The segmentation point cannot guarantee a certain equal segmentation, but the segmentation error can not exceed a very small threshold. One or more segmentation points can be selected under the complexity of one scanning by using a quantile algorithm.

The target field value is also a field value of a target field, and the target field can be determined according to the data processing task for the first data table. For example, if the first data table is a consumption record table and the data processing task is to calculate the average consumption value of the user in the last 24 hours based on the consumption record table, the target field is the user field and the target field value is the user ID.

If the target field value is not processed in advance, when one or more quantiles are selected from a plurality of target field values by using a quantile algorithm, a plurality of same target field values are combined into one field value to be processed, and thus a possible inclined key in a data table cannot be handled, wherein the inclined key refers to a target field value with a larger data volume compared with other target field values.

The invention further provides that an additional value can be set for a target field value in the first data aiming at each piece of first data in the first data table, the additional value and the target field value form a new target field value, and then quantile points of a plurality of new target field values are calculated by using a quantile algorithm, so that the condition of uneven segmentation caused by combination of the same field values can be avoided, and the data can be equally divided under the condition that an inclined key exists.

The additional value set to the target field value may be a random number value, and different additional values may be set to a plurality of pieces of first data having the same target field value, respectively. After additional values may be added to the target field value, a "new target field value" is formed with the target field value. In the invention, the additional value is only used for realizing the equipartition in the segmentation stage, and in the data processing stage, namely the process of processing the data by the computing node, the target data value in the first data is still the original value, namely the additional value does not participate in the data processing.

For example, assuming that there are 100 pieces of data, the target field value of the 100 pieces of data only appears 10 a, 10 b, and 80 c, if the number of the segmentation points is set to 10, only three segmentation points { a, b, c } are generated according to the quantile point algorithm, and the segmentation according to the three segmentation points will divide 80% of the data into one group, and then the data will be skewed when each group of data is allocated to different computation nodes.

If a random number is added to the target field value in each piece of data, and then calculation is performed according to a quantile algorithm, a { a, b, c, c, c, c, c } set of the quantile points can be obtained, and the 100 pieces of data can be uniformly divided into 10 according to the set. The selected cut point is followed by the random number filled in the processing, and the actual storage may be { c.123, c.238, c.732}, so that when determining the sub-bucket (i.e., block) where a piece of data is located, the random number generated for the piece of data can be randomly dropped into the bucket with the same number as c.

Considering that the data volumes of different first data may have differences, in order to avoid a tilt caused by the data volume differences, the present invention may calculate quantiles of a plurality of new target field values by using a quantile algorithm based on weights, so that the data volumes of different blocks obtained by segmentation based on the quantiles are the same or substantially the same, wherein the weights are used for representing the data volume of a single piece of data. The weight-based quantilesketch may be, but is not limited to, weighted quantilesketch.

In step S120, the first data table is partitioned by using the partitioning point to obtain a first number of blocks.

In step S130, a first number of blocks are allocated to a second number of compute nodes for processing, where the first number is greater than or equal to the second number.

The computing nodes for data processing may be applied in advance, and the number of the dividing points may be determined according to the number of the computing nodes (i.e., the second number). The number of the dissection points may be set to be greater than or equal to the number of the compute nodes. For example, the number of the dissection points may be set to be several times (e.g., 10 times) the number of the compute nodes.

In the present invention, the segmentation points obtained in step S110 may be ordered, and the blocks obtained by segmentation based on the ordered segmentation points are also ordered. When the first number of blocks are allocated to the second number of computing nodes, the target field value of the computing node i can be ensured to be smaller than that of the computing node i +1, that is, the first data allocated to each computing node can also be ensured to be ordered. After the first number of blocks are allocated to the second number of compute nodes, the native data may be sorted internally by the compute nodes to complete the global row-wise sorting operation.

The method and the device do not depend on artificially set threshold values when the first data table is segmented, can ensure that the target field values are ordered among the groups, and equivalently realize the row-by-row ordering of the data while segmenting the data. In addition, because data can be repeatedly processed in the machine learning process, particularly in the process of automatic machine learning parameter adjustment, although extra cost is brought in the process of calculating quantiles by using a quantile algorithm, the obtained quantiles can be reused only by once calculation, and the efficiency improvement of the subsequent process is usually far greater than the consumption of the subsequent process.

After the first data in the first data table is allocated to the compute node in the above manner, the compute node can process the first data according to a specific data processing task. For example, a compute node may sum or subtract two columns of field values, combine features for two discrete feature columns, and compute these operations in parallel without crossing rows.

Grouping by target field value is often required when window operations are involved to ensure that data of the same target field value is distributed to the same compute node. For example, when the consumption total of the user in the last 24 hours is counted, the consumption records of the same user are distributed to the same computing node by grouping according to the user id. The data segmentation mode of the invention cannot guarantee this point, and at the moment, when the quantile algorithm is used for calculating the quantile point, the part for setting the added value for the target field value can be closed, so that the data with the same target field value can be distributed to the same calculation node. Taking the quantile algorithm as an improved quantilesketch as an example, the part for setting an additional value (such as a random number) for the target field value can be closed, so that the aim is fulfilled by reducing the algorithm to the original quantile sketch.

In addition, from the practical situation, when window operation is involved, the size of the window is usually much smaller than the size of the total data, so based on the data segmentation method of the present invention, even if some data errors are generated at the junction of the computation nodes, the final model effect will not be affected extremely, but the possibility of overfitting will be reduced to a certain extent, and in addition, the influence of the oblique data can be reduced significantly by the quantile point algorithm, so when window operation is involved, the improved quantile point algorithm is still practical.

As an example, before the first data table is segmented, a determination may be made as to whether the first data table has the skew data, and if the determination result indicates that the first data table may have the skew data, the method shown in fig. 1 may be performed to solve the data skew problem; and if the judgment result shows that the first data table does not have the inclined data, segmenting the first data table to one or more computing nodes according to the existing segmentation mode. For example, it may be determined whether there is a target field value in the first data table with a frequency greater than or equal to the second threshold. If the target field value with the frequency number larger than or equal to the second threshold value exists, executing the method shown in FIG. 1; if there is no target field value with the frequency greater than or equal to the second threshold, the first data in the first data table may be allocated to one or more computing nodes, and data with the same target field value may be allocated to the same computing node. The second threshold is a parameter required when further improvement of efficiency is desired, and is not a parameter necessary for solving the tilt problem, the second threshold is not strictly required to be a boundary between a tilt key and a normal key, and even if the second threshold is not accurately set, the problem of data tilt by using the method shown in fig. 1 is not affected.

The data slicing method proposed by the present invention is described in detail with reference to fig. 1.

In the field of data processing, data processing is usually performed by combining two data tables.

Taking the case that the data processing includes the table-matching operation for two data tables as an example, the consumption record table records the commodity purchasing situation of the user at a certain time, if the information such as the age and the sex of the user can be added into a desired sample during model training, the information such as the age and the sex of the user needs to be spliced into the consumption record table according to the user id from the user information table, and the process is the table-matching operation.

A common table-matching strategy is to group all two data tables according to a target field value (e.g., a user id), so that data of the same target field value in both data tables are allocated to the same computing node, and the computing node can perform table-matching operation according to local data. However, based on the data segmentation method of the present invention, data of the same target field value may be distributed in multiple computing nodes, so the existing table-matching strategy is not applicable to the present invention.

For a scene needing to combine two data tables for data processing, the invention provides an implementation scheme based on a parameter server. Fig. 2 shows a flow chart of steps that the data processing method shown in fig. 1 may further include. The method illustrated in fig. 2 may be implemented entirely in software via a computer program, and the method illustrated in fig. 2 may also be executed by a specifically-configured computing device.

Referring to fig. 2, in step S210, the second data in the second data table is stored on one or more parameter servers, each of which stores at least part of the second data.

The parameter server is a node mainly used for storing logic and has an independent memory space. The parameter server may be an independently existing physical node or a logical virtual node. For example, a parameter server may refer to a block of independent memory space in a computing device. The parameter server and the computing node may be located in the same computing device or in different computing devices.

The parameter server can be viewed as a global hash look-up table. As an example, a hash table may be constructed by using a target field value of the second data in the second data table as a key and using the number of the parameter server as a value; and distributing the second data to each parameter server according to the constructed hash table.

In step S220, for at least one piece of first data in a block allocated to the computing node, the computing node may acquire second data from a parameter server storing the second data identical to a target field value in the first data, and process the first data based on the second data and a preset data processing rule.

The preset data processing rules may include, but are not limited to, a table-mosaicing operation, which may include, but is not limited to, left mosaicing (left join), intersection mosaicing (inner join), and union stitching (outer join). Taking the left-stitching as an example, the field values of one or more second fields in the second data may be stitched into the first data as new field values in the first data.

The process of splitting the second data table and acquiring the second data by the computing node does not depend on any characteristics of the first data table in grouping, so that the method is suitable for the data distributed based on the method shown in fig. 1. While not relying on the second data table for any characteristics (e.g., the amount of data cannot be excessive).

The parameter server based approach incurs additional overhead. Taking table splicing operation as an example, the method based on the parameter server may send the hash table once when constructing the hash table, and send the hash table once through the network when splicing, and even may generate several times of additional network overhead when the number of rows of the first data table is much larger than that of the right table. However, this overhead does not affect the versatility of the method in actual operation, and firstly, the parameter server-based method can greatly accelerate the data skew, and secondly, when the size of the second data table is equivalent to that of the first data table, the overhead is only generated once, which is not acceptable. When the second data table is small, even if several times of network overhead is generated, the transmission amount is not large, and the total amount of all overhead does not exceed the size of the first data table, which is acceptable.

As an example, the second data with the same target field value may be stored in the same parameter server, and the present invention may further process, by the parameter server, the second data with the same target field value according to a preset data processing rule to obtain an intermediate processing result of the target field value, wherein the computing node may obtain the intermediate processing result from the parameter server storing the intermediate processing result corresponding to the target field value in the first data, and process the first data based on the intermediate processing result. Therefore, when the table is spliced into a plurality of data records (namely, the target field value is not unique in the second data table), the spliced content is the aggregation value of the data records with the same target field value in the second data table, such as the summation or the average of numerical columns with the same target field value, so that when the second data table is dispersed in the parameter server, the aggregation can be made in advance by the parameter server, and then after the key values of the second data table are dispersed, the tilted keys are already merged, and the condition that the pressure of one parameter server node is overlarge cannot exist.

Considering that in a very individual case, the data amount of the same key in the second data table after merging needs to be increased linearly, which results in a large amount of data being stored on a certain parameter server, the computing node will become a bottleneck if frequently requesting data on the parameter server when processing the first data table. To this end, the invention proposes that the data (second data or intermediate processing results) obtained from the parameter server can be saved in the computing node. For example, a cache pool size may be set in the computing node in advance, data requested by the computing node from the parameter server each time may be stored in the cache pool, and after the cache pool is full, a data record that is not used for the longest time may be popped up, so that when the second data table has a tilted key, the key value is stored in the cache pool of the local computing node with a high probability, and a network application does not need to be frequently sent to the parameter server, which may greatly improve data processing efficiency.

Therefore, when the computing node needs to acquire the second data, whether the second data with the same target field value as that of the first data exists in the computing node can be judged firstly; if the second data exists in the computing node, processing the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, performing an operation of acquiring the second data from the parameter server storing the second data having the same target field value as that in the first data.

In an embodiment of the present invention, for the second data table that needs to be used when processing the first data table, it may be determined whether the data amount of the second data table is greater than the first threshold first. If the data amount of the second data table is less than or equal to the first threshold, the second data in the second data table may be directly stored in each first computing node, so that the first computing node processes the first data directly based on the locally stored second data and the preset data processing rule.

If the data amount of the second data table is greater than the first threshold, the second data table may be segmented according to the method shown in fig. 1. When data processing needs to be performed by combining two data tables, data with the same target field value in the two data tables needs to be combined for processing, and for example, a table splicing operation is taken as an example, two pieces of data (i.e., two pieces of data with the same target field value in the two data tables) corresponding to the target field can be spliced. For this reason, when the second data table is partitioned, the second data table may be partitioned by using the partitioning point of the first data table, the second data table may also be partitioned into the first number of blocks, and then the blocks may be allocated to the second number of computing nodes for processing. The blocks obtained by splitting the second data table may be distributed according to the distribution manner of the blocks obtained by splitting the first data table. Thus, the first data and the second data having the same target field value can be assigned to the same compute node.

For at least one piece of first data assigned to the computing node, the computing node may acquire second data from the computing node that stores the second data that is the same as a target field value in the first data, and process the first data based on the second data and a preset data processing rule. The second data table is segmented and distributed according to the segmentation and distribution mode of the first data table, so that when the computing node needs to obtain second data which is the same as a target field value in the first data, at least part of the second data can be obtained locally, and network overhead is reduced.

In the present invention, the second data table may be allocated with reference to the allocation manner of the first data table. As described above, in the case where there is no target field value whose frequency count is greater than or equal to the second threshold value, first data in the first data table whose target field value is the same may be allocated to the same computation node. In this case, for the second data table, the first data with the same target field value in the second data table may be allocated to the same computing node according to the allocation manner of the first data table, wherein the second data with the same target field value and the first data are allocated to the same computing node.

The data processing method of the present invention will be further described with reference to specific application examples.

Assuming that a consumption record table A and a user information table B exist at present, the data processing task is to count the current consumption average value of the user in the last 24 hours for each consumption of each user, then the age and gender information of the user is spliced to the table A, and finally the table A is used for machine learning model training.

Before a computation begins, one or more compute nodes for data processing may be applied on a processor (e.g., a distributed cluster), each compute node being allocated a certain amount of memory. One or more parameter servers may also be applied. Parameter servers and compute nodes refer to logically distinct nodes that have separate memory spaces but may reside on the same computer. As shown in fig. 3A, 3 compute nodes and 3 parameter servers may be applied. Preferably, the parameter server 1 and the computing node 1 can be located on the same computer 1, the parameter server 2 and the computing node 2 can be located on the same computer 2, and the parameter server 3 and the computing node 3 can be located on the same computer 3.

Tables a and B may be stored in the cloud or on some storage medium accessible to the processor before the computation begins. The data formats of table a and table B are as follows.

TABLE A, 30000 rows, 4 columns

Consumption time	User id	Commodity	Price
				2019-10-28 10:00:01	000001	00000001	10.00

TABLE B, 300 rows, 3 columns

User id	Sex	Age (age)
			000001	0	22

The compute nodes are the nodes that mainly run the compute logic, and the parameter server nodes are only used to build the global hash table. As shown in fig. 3B, all of tables a and B may be read into the memory of the compute node at first, and as in a general distributed system, data is stored in each compute node by dividing rows.

For the first operation of the data processing task: for each consumption of each user, the average consumption value of the last 24 hours of the user is counted. This operation changes table a to a new table of 30000 x 5, which is equivalent to extending a list of fields (i.e., features).

The user id is firstly grouped according to the user id, and the user id is firstly taken as a target field value to construct a cut point because the consumption of the same user in the last 24 hours needs to be counted. A total of 3 compute nodes, so that 29 cut points can be selected to equally divide the data into 30 parts by using a weight-based split-point algorithm (such as a weighted quantile sketch algorithm), and this process only needs each compute node to scan the locally stored data, and then the local results (29 point values) are collected and combined into a whole local result. The 29 division points can be repeatedly used as the division points of the user id of the table A, so that the division points can be stored in the cloud at the beginning and can be directly obtained when data is read, and the additional operation is avoided.

The data in table a is bucketized according to the slicing point, the data of the user id in the first 10 buckets is allocated to compute node 1, the data of the user id in the middle 10 buckets is allocated to compute node 2, and the data of the user id in the last 10 buckets is allocated to compute node 3. Each distributed computing node holds approximately 10000 pieces of data in the table A, the data with the same user id are all guaranteed to be on the same computing node at the moment, and the average value in the window is calculated according to the time sequence after the data are grouped according to the id locally.

When there is a data skew phenomenon in table a, for example, the consumption record of user 000001 occupies 15000 rows, then after bucket division, about 10000 rows of 000001 consumption records are allocated to compute node 1, and 5000 rows are allocated to node 2, and when the window mean of the data at the junction is computed, a little error occurs, but this is tolerable by the machine learning algorithm.

Operation two is processed next: the information of table B is spliced into table a, and the following results are expected:

consumption time	User id	Commodity	Price	Sex	Age (age)
						2019-10-28 10:00:01	000001	00000001	10.00	0	22

The data of table B may first be spread over the various parameter servers with the user id as the key. As shown in fig. 3C, a simple dispersion method is to look at what the remainder of the division of the user id by 3 in table B is, and distribute the piece of data to the parameter server with the corresponding number according to the remainder, so as to randomly disperse the data in the memory of the parameter server. In the figure, table B is not shown in the compute node for the sake of more clearly showing the current operation, but may be used for subsequent operations, and table B on the parameter server is only temporary data of the current operation, and whether both are still present during and after the operation may be determined as the case may be.

As shown in fig. 3D, next, the computing node processes each row of table a, and sends a network request to a parameter server node where the id is located (obtained by calculation using the remainder of division by 3) according to the user id of the current row, the parameter server returns gender and age information corresponding to the current id to the computing node, and the computing node pieces together the information to the current row and adds table B information of the id to a local cache (cache).

Before sending the request, the computing node checks whether the current id is already in the local cache, and if the current id is already in the local cache, the computing node does not need to send the network request. And after processing, if the current id does not exist in the cache, adding the current id into the cache, and if the preset cache space is insufficient, popping up one or more pieces of id information which is not used for the longest time.

When the preset cache space is large enough (because the data amount of the table B is small), after a period of processing, the data of the table B can be completely stored in the cache of each computing node, and all subsequent processing can extract information from the cache.

Assuming that the table B is another consumption record table, the maximum value consumed by each user in the table B is pieced into the table a, and when the table B is uploaded to the parameter server, the maximum value of each id can be directly maintained, and even if the table B has a tilt, the tilt does not occur on the parameter server.

If the content of the table B needing to be spliced is a value which cannot be combined, a certain inclination exists on the parameter server, but due to the existence of the cache, the inclination value is likely to always appear in the cache, and the data is not frequently and repeatedly requested from the parameter server.

The data processing method of the present invention can also be implemented as a data processing apparatus. Fig. 4 shows a block diagram of a data processing apparatus according to an exemplary embodiment of the present invention. Wherein the functional elements of the auxiliary machine learning data processing apparatus may be implemented by hardware, software, or a combination of hardware and software implementing the principles of the present invention. It will be appreciated by those skilled in the art that the functional units described in fig. 4 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or further definition of the functional units described herein.

In the following, functional units that the data processing apparatus may have and operations that each functional unit may perform are briefly described, and for details related thereto, reference may be made to the above-mentioned related description, which is not described herein again.

Referring to fig. 4, the data processing apparatus 400 includes an acquisition module 410, a cutting module 420, and an allocation module 430. The obtaining module 410 is configured to obtain a cut point of the first data table.

The slicing module 420 is configured to slice the first data table using the slicing point to obtain a first number of blocks. The allocating module 430 is configured to allocate a first number of blocks to a second number of computing nodes for processing, where the first number is greater than or equal to the second number.

The obtaining module 410 may obtain the pre-calculated dividing point of the first data table from the outside, or may calculate the dividing point of the first data table in real time. Optionally, the obtaining module may include a setting module and a calculating module. The setting module is used for setting an additional value for a target field value in each piece of first data in the first data table, and the additional value and the target field value form a new target field value; the calculation module is used for calculating the quantile points of a plurality of new target field values by using a quantile algorithm, and the quantile points are the segmentation points. The calculation module can calculate the quantiles of a plurality of new target field values by using a quantile algorithm based on weight, so that the data size of different blocks obtained after segmentation based on the quantiles is the same or basically the same, wherein the weight is used for representing the data size of single data.

The allocation module 430 may further store the second data in the second data table on one or more parameter servers, each parameter server storing at least a part of the second data, for at least one piece of the first data in the block allocated to the computing node, the computing node acquires the second data from the parameter server storing the second data identical to the target field value in the first data, and processes the first data based on the second data and a preset data processing rule. Taking the data processing rule as an example of table-splicing operation, the computing node may splice field values of one or more second fields in the second data into the first data as new field values in the first data.

Second data with the same target field value can be stored in the same parameter server, and the parameter server processes the second data with the same target field value according to a preset data processing rule to obtain an intermediate processing result of the target field value, wherein the computing node acquires the intermediate processing result from the parameter server storing the intermediate processing result corresponding to the target field value in the first data, stores the intermediate processing result in the computing node, and processes the first data based on the intermediate processing result.

The computing node may also store second data obtained from the parameter server. The data processing apparatus 400 may further include a first judgment module. The first judging module is used for judging whether second data which is the same as the target field value in the first data exists in the computing node or not; if the second data exists in the computing node, the computing node processes the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, the computing node acquires the second data from the parameter server which stores the second data with the same target field value as that in the first data.

The data processing apparatus 400 may further include a second determining module, configured to determine whether the data amount of the second data table is greater than the first threshold; the allocation module is further configured to, if the data amount of the second data table is less than or equal to the first threshold, store the second data in the second data table to each of the first computing nodes, so that the first computing nodes process the first data based on the second data and a preset data processing rule.

If the data size of the second data table is larger than the first threshold, the segmentation module may segment the second data table using the segmentation point of the first data table, and the allocation module may allocate the blocks obtained by segmentation to a second number of computing nodes for processing; for at least one piece of first data assigned to the computing node, the computing node acquires second data from the computing node storing the second data identical to a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

The data processing apparatus 400 may further include a third determining module, configured to determine whether a target field value with a frequency greater than or equal to a second threshold exists in the first data table; if the target field value with the frequency number larger than or equal to the second threshold value exists, the obtaining module obtains the segmentation points of the first data table, the segmentation module uses the segmentation points to segment the first data table, and the distribution module distributes the first number of blocks to the second number of calculation nodes for processing. If the target field value with the frequency number larger than or equal to the second threshold value does not exist, the distribution module distributes the plurality of pieces of first data in the first data table to one or more computing nodes, wherein the data with the same target field value are distributed to the same computing node.

In a case where the allocation module allocates data having the same target field value in the first data table to the same compute node, the allocation module may allocate a plurality of pieces of second data in a second data table to one or more compute nodes in the same allocation manner for the second data table, where the second data having the same target field value and the first data are allocated to the same compute node.

It should be understood that the specific implementation manner of the data processing apparatus 400 according to the exemplary embodiment of the present invention can be implemented by referring to the related description for the data processing method in conjunction with fig. 1 to 3D, and is not described herein again.

The data processing method and apparatus according to the exemplary embodiment of the present invention are described above with reference to fig. 1 to 4. It is to be understood that the above-described method may be implemented by a program recorded on a computer-readable medium, for example, according to an exemplary embodiment of the present invention, there may be provided a computer-readable storage medium storing instructions, wherein a computer program for executing a data processing method of the present invention (for example, shown in fig. 1) is recorded on the computer-readable medium.

The computer program in the computer-readable medium may be executed in an environment deployed in a computer device such as a client, a host, a proxy device, a server, and the like, and it should be noted that the computer program may be used to perform, in addition to the steps shown in fig. 1, additional steps other than the above steps or perform more specific processing when the above steps are performed, and the contents of the additional steps and the further processing are described with reference to fig. 1 and 2, and will not be described again to avoid repetition.

It should be noted that the data processing apparatus according to the exemplary embodiment of the present invention may fully rely on the execution of the computer program to realize the corresponding functions, that is, each apparatus corresponds to each step in the functional architecture of the computer program, so that the whole apparatus is called by a special software package (e.g., lib library) to realize the corresponding functions.

Alternatively, the various means shown in fig. 4 may be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium such as a storage medium, so that a processor may perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, exemplary embodiments of the present invention may also be implemented as a computing device including a storage component having stored therein a set of computer-executable instructions that, when executed by a processor, perform a data processing method.

In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the set of instructions described above.

The computing device need not be a single computing device, but can be any device or collection of circuits capable of executing the instructions (or sets of instructions) described above, individually or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with local or remote (e.g., via wireless transmission).

In the computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the data processing method according to the exemplary embodiment of the present invention may be implemented by software, some of the operations may be implemented by hardware, and further, the operations may be implemented by a combination of hardware and software.

The processor may execute instructions or code stored in one of the memory components, which may also store data. Instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integral to the processor, e.g., having RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, a network connection, etc., so that the processor can read files stored in the storage component.

Further, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via a bus and/or a network.

Operations involved in a data processing method according to an exemplary embodiment of the present invention may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operated on by non-exact boundaries.

For example, as described above, a data processing apparatus according to an exemplary embodiment of the present invention may include a storage part and a processor, wherein the storage part stores therein a set of computer-executable instructions that, when executed by the processor, performs the above-mentioned data processing method.

While exemplary embodiments of the invention have been described above, it should be understood that the above description is illustrative only and not exhaustive, and that the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention should be subject to the scope of the claims.

Claims

1. A method of data processing, comprising:

acquiring a dividing point of a first data table;

segmenting the first data table by using the segmentation point to obtain a first number of blocks;

and allocating the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number.

2. The method of claim 1, wherein obtaining the cut point of the first data table comprises:

setting an additional value for a target field value in each piece of first data in the first data table, wherein the additional value and the target field value form a new target field value;

and calculating the quantile points of the new target field values by using a quantile algorithm, wherein the quantile points are the cut points.

3. The method of claim 2, wherein the step of calculating quantiles for a plurality of new target field values using a quantile algorithm comprises:

and calculating quantiles of a plurality of new target field values by using a quantile algorithm based on weight, so that the data size of different blocks obtained after segmentation based on the quantiles is the same or basically the same, wherein the weight is used for representing the data size of single data.

4. The method of claim 1, further comprising:

storing second data in a second data table on one or more parameter servers, each of said parameter servers storing at least a portion of said second data;

for at least one piece of first data in a block allocated to the computing node, the computing node acquires second data, which is the same as a target field value in the first data, from a parameter server storing the second data, and processes the first data based on the second data and a preset data processing rule.

5. The method of claim 4, wherein the processing the first data based on the second data and a preset data processing rule comprises:

and splicing field values of one or more second fields in the second data into the first data as new field values in the first data.

6. The method of claim 4, further comprising:

and storing the second data acquired from the parameter server in the computing node.

7. The method of claim 6, further comprising:

judging whether second data which is the same as the target field value in the first data exists in the computing node or not;

if the second data exists in the computing node, processing the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, performing the operation of acquiring the second data from the parameter server storing the second data with the same target field value as that in the first data.

8. A data processing apparatus comprising:

the acquisition module is used for acquiring the dividing points of the first data table;

the segmentation module is used for segmenting the first data table by using the segmentation points to obtain a first number of blocks;

and the allocating module is used for allocating the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number.

9. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method of any of claims 1 to 7.

10. A computer-readable storage medium storing instructions that, when executed by at least one computing device, cause the at least one computing device to perform the method of any of claims 1 to 7.