CN111611243B

CN111611243B - Data processing method and device

Info

Publication number: CN111611243B
Application number: CN202010406485.XA
Authority: CN
Inventors: 焦英翔; 石光川
Original assignee: 4Paradigm Beijing Technology Co Ltd
Current assignee: 4Paradigm Beijing Technology Co Ltd
Priority date: 2020-05-13
Filing date: 2020-05-13
Publication date: 2023-06-13
Anticipated expiration: 2040-05-13
Also published as: CN111611243A

Abstract

A data processing method and device are disclosed. Comprising the following steps: acquiring a dividing point of a first data table; dividing the first data table by using dividing points to obtain a first number of blocks; and allocating the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number. Thereby, the data processing efficiency can be improved.

Description

Data processing method and device

Technical Field

The present invention relates generally to the field of machine learning, and more particularly, to a data processing method and apparatus.

Background

In the field of machine learning, the data required for modeling is typically a data set stored in rows. To increase data processing efficiency, a distributed system is typically applied with multiple compute nodes to process data in parallel.

Data slicing is a key step in improving data processing efficiency with distributed systems. Unreasonable data slicing can reduce the data processing efficiency of the distributed system.

Disclosure of Invention

Exemplary embodiments of the present invention are directed to a data processing scheme capable of implementing data slicing.

According to a first aspect of the present invention, there is provided a data processing method comprising: acquiring a dividing point of a first data table; dividing the first data table by using dividing points to obtain a first number of blocks; and allocating the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number.

Optionally, the step of obtaining the cut point of the first data table includes: setting additional value for target field value in first data aiming at each piece of first data in the first data table, wherein the additional value and the target field value form a new target field value; and calculating a plurality of new quantiles of the target field value by using a quantile algorithm, wherein the quantiles are the slicing points.

Optionally, the step of calculating quantiles of the plurality of new target field values using a quantile algorithm comprises: calculating quantiles of a plurality of new target field values by using a quantile algorithm based on weights, so that the data sizes of different blocks obtained by segmentation based on the quantiles are the same or basically the same, wherein the weights are used for representing the data sizes of single data.

Optionally, the method further comprises: storing second data in a second data table on one or more parameter servers, each parameter server storing at least a portion of the second data; for at least one piece of first data in a block allocated to a computing node, the computing node acquires the second data from a parameter server storing the second data which is the same as a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the step of processing the first data based on the second data and a preset data processing rule includes: and splicing the field values of one or more second fields in the second data into the first data as new field values in the first data.

Optionally, the method further comprises: the second data obtained from the parameter server is stored in the computing node.

Optionally, the method further comprises: judging whether second data with the same target field value as the first data exists in the computing node or not; if the second data exists in the computing node, processing the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, performing an operation of acquiring the second data from a parameter server storing the second data identical to the target field value in the first data.

Optionally, the second data with the same target field value is stored to the same parameter server, and the method further includes: and processing the second data with the same target field value by the parameter server according to a preset data processing rule to obtain an intermediate processing result of the target field value, wherein the computing node acquires the intermediate processing result from the parameter server storing the intermediate processing result corresponding to the target field value in the first data, saves the intermediate processing result in the computing node, and processes the first data based on the intermediate processing result.

Optionally, the method further comprises: judging whether the data amount of the second data table is larger than a first threshold value; and if the data amount of the second data table is smaller than or equal to the first threshold value, respectively storing the second data in the second data table to each first computing node so that the first computing node processes the first data based on the second data and a preset data processing rule.

Optionally, the method further comprises: if the data volume of the second data table is larger than the first threshold value, the second data table is segmented by using the segmentation points of the first data table, and the segmented blocks are distributed to the second number of computing nodes for processing; for at least one piece of first data allocated to the computing node, the computing node acquires second data from the computing node storing second data identical to a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the method further comprises: judging whether a target field value with the frequency being greater than or equal to a second threshold value exists in the first data table; and if the frequency is greater than or equal to the target field value of the second threshold value, executing the data processing method.

Optionally, the method further comprises: and if the target field value with the frequency being greater than or equal to the second threshold value does not exist, distributing a plurality of pieces of first data in the first data table to one or more computing nodes, wherein the data with the same target field value are distributed to the same computing node.

Optionally, the method further comprises: and distributing the plurality of second data in the second data table to one or more computing nodes, wherein the second data and the first data with the same target field value are distributed to the same computing node.

According to a second aspect of the present invention, there is provided a data processing apparatus comprising: the acquisition module is used for acquiring the dividing points of the first data table; the segmentation module is used for segmenting the first data table by using segmentation points to obtain a first number of blocks; and the allocation module is used for allocating the first number of blocks to the second number of computing nodes for processing, wherein the first number is greater than or equal to the second number.

Optionally, the acquiring module includes: the setting module is used for setting additional values for target field values in the first data aiming at each piece of first data in the first data table, and the additional values and the target field values form new target field values; and the calculation module is used for calculating a plurality of new quantiles of the target field value by using a quantile algorithm, wherein the quantiles are the segmentation points.

Optionally, the calculating module calculates the quantiles of the plurality of new target field values by using a quantile algorithm based on weights, so that the data sizes of different blocks obtained after the slicing based on the quantiles are the same or substantially the same, wherein the weights are used for representing the data sizes of the single piece of data.

Optionally, the allocation module is further configured to store the second data in the second data table on one or more parameter servers, each parameter server storing at least a portion of the second data, for at least one piece of first data in the block allocated to the computing node, the computing node obtains the second data from the parameter server storing the second data identical to the target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the computing node concatenates the field values of one or more second fields in the second data into the first data as new field values in the first data.

Optionally, the computing node also saves the second data obtained from the parameter server.

Optionally, the apparatus further comprises: the first judging module is used for judging whether second data with the same target field value as the first data exists in the computing node or not; if the second data exists in the computing node, the computing node processes the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, the computing node acquires the second data from a parameter server which stores the second data which is the same as the target field value in the first data.

Optionally, the second data with the same target field value is stored in the same parameter server, the parameter server processes the second data with the same target field value according to a preset data processing rule to obtain an intermediate processing result of the target field value, wherein the computing node obtains the intermediate processing result from the parameter server storing the intermediate processing result corresponding to the target field value in the first data, saves the intermediate processing result in the computing node, and processes the first data based on the intermediate processing result.

Optionally, the apparatus further comprises: the second judging module is used for judging whether the data volume of the second data table is larger than a first threshold value; the allocation module is further configured to store the second data in the second data table to each first computing node if the data size of the second data table is less than or equal to the first threshold, so that the first computing node processes the first data based on the second data and a preset data processing rule.

Optionally, if the data size of the second data table is greater than the first threshold, the splitting module splits the second data table by using the splitting point of the first data table, and the distributing module distributes the blocks obtained by splitting to the second plurality of computing nodes for processing; for at least one piece of first data allocated to the computing node, the computing node acquires second data from the computing node storing second data identical to a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

Optionally, the apparatus further comprises: the third judging module is used for judging whether a target field value with the frequency being more than or equal to a second threshold value exists in the first data table; if the frequency is greater than or equal to the target field value of the second threshold value, the acquisition module acquires a segmentation point of the first data table, the segmentation module segments the first data table by using the segmentation point, and the distribution module distributes the first number of blocks to the second number of computing nodes for processing.

Optionally, if there is no target field value with the frequency greater than or equal to the second threshold, the allocation module allocates the plurality of pieces of first data in the first data table to one or more computing nodes, where the data with the same target field value is allocated to the same computing node.

Optionally, the allocation module further allocates a plurality of pieces of second data in the second data table to one or more computing nodes, wherein the second data and the first data having the same target field value are allocated to the same computing node.

According to a third aspect of the present invention there is also presented a system comprising at least one computing device and at least one storage device storing instructions which, when executed by the at least one computing device, cause the at least one computing device to perform a method according to the first aspect of the present invention.

According to a fourth aspect of the present invention there is also presented a computer readable storage medium storing instructions which, when executed by at least one computing device, cause the at least one computing device to perform a method according to the first aspect of the present invention.

In the data processing method and device according to the exemplary embodiment of the invention, the splitting point of the first data table is obtained, the splitting point is used for realizing the splitting of the data table, and different blocks after the splitting can be allocated to corresponding computing nodes. Wherein the calculation of the cut point may not depend on an artificially set tilt threshold, and may cope with a tilt key that may exist by setting an additional value for the target field value when calculating the cut point.

Drawings

These and/or other aspects and advantages of the present invention will become more apparent and more readily appreciated from the following detailed description of the embodiments of the invention, taken in conjunction with the accompanying drawings, wherein:

FIG. 1 illustrates a flow chart of a data processing method according to an exemplary embodiment of the invention;

FIG. 2 is a schematic diagram showing a flow of steps that may be included in the data processing method shown in FIG. 1;

FIGS. 3A-3D show a data processing flow diagram for a specific application example;

Fig. 4 shows a block diagram of a data processing apparatus according to an exemplary embodiment of the present invention.

Detailed Description

In order that those skilled in the art will better understand the present invention, exemplary embodiments of the present invention will be described in further detail with reference to the accompanying drawings and detailed description.

Fig. 1 shows a flow chart of a data processing method according to an exemplary embodiment of the invention. The method shown in fig. 1 may be implemented entirely in software by a computer program and may also be performed by a specifically configured computing device.

Referring to fig. 1, in step S110, a cut point of a first data table is acquired.

The first data table may be regarded as a set of a plurality of pieces of first data stored in rows. The first data table includes a plurality of pieces of first data, each of which may be composed of one or more field values.

The segmentation point is used for representing segmentation positions for segmenting a plurality of pieces of first data in the first data table. The method and the device can acquire the pre-calculated segmentation points of the first data table from the outside, and can also calculate the segmentation points of the first data table in real time. Wherein the acquired cut points may be one or more.

When the data processing efficiency is improved by using a distributed system, it is common practice to equally divide data into n parts by rows, and simultaneously process the data by n computing nodes, so that the total efficiency is improved by n times. However, not all operations may divide the data into n parts randomly, such as a part of the consumption record data, and when it is desired to count the consumption items of the user on the last day, the consumption records of the same user need to be processed together, that is, they must be distributed on the same computing node to obtain the correct result. The common practice is to divide the data into n groups according to the field value of a certain column, ensure that the data with the same field value is divided on the same computing node, and then calculate the result in parallel by the computing node. This has the disadvantage that if the data is unevenly distributed, for example, the number of consumption records of some special accounts (enterprises, merchants, etc.) is far greater than that of normal users, then the nodes where they are located can be distributed with data greater than other nodes, thus slowing down the overall computational efficiency, i.e. the phenomenon of data skew. The impact is severe once the data skew occurs.

In view of this, the present invention may calculate the quantiles of the plurality of target field values in the first data table using a quantile algorithm, and take the calculated quantiles as the slicing quantiles. That is, m quantiles may be selected from a plurality of target field values using a quantile algorithm, which may split the plurality of target field values substantially equally into m+1 shares. The splitting point cannot guarantee equal splitting, but the splitting error can not exceed a small threshold value. One or more segmentation points can be selected at one scan complexity using a quantile algorithm.

The target field value, i.e. the field value of the target field, may be determined from the data processing task for the first data table. For example, assuming that the first data table is a consumption record table, the data processing task is to calculate a consumption average value of the user for the last 24 hours based on the consumption record table, the target field is a user field, and the target field value is a user ID.

If the target field value is not processed in advance, when one or more quantiles are selected from a plurality of target field values by using a quantile algorithm, a plurality of identical target field values are combined into one field value to be processed, so that a possible inclined key in the data table cannot be dealt with, wherein the inclined key refers to a target field value with a larger data volume than other target field values.

Therefore, the invention further provides that for each piece of first data in the first data table, an additional value can be set for the target field value in the first data, the additional value and the target field value form a new target field value, and then the quantile algorithm is utilized to calculate the quantiles of a plurality of new target field values, so that the situation of uneven segmentation caused by the combination of the same field values can be avoided, and the data sharing can be realized under the condition of the existence of inclined keys.

The additional value set for the target field value may be a random value, and for a plurality of pieces of first data having the same target field value, different additional values may be set for them, respectively. Additional value may be added to the target field value to form a "new target field value" with the target field value. In the invention, the added value is only used for realizing equipartition in the segmentation stage, and in the data processing stage, namely in the process of processing data by the computing node, the target data value in the first data is still the original value, namely the added value does not participate in the data processing.

For example, assuming 100 pieces of data, only 10 a, 10 b and 80 c are present in the target field value in the 100 pieces of data, if the number of the slicing points is set to 10, only three slicing points { a, b, c } are generated according to the slicing point algorithm, and the slicing is performed according to the three slicing points, so that 80% of the data is divided into one group, and then when each group of data is allocated to different computing nodes, the data is inclined.

If a random number is added to the target field value in each piece of data, and then the random number is calculated according to a quantile algorithm, a division point set of { a, b, c, c, c, c, c, c } can be obtained, and 100 pieces of data can be evenly divided into 10 parts according to the division point set. The selected segmentation point is further followed by a random number filled in during processing, and the actual storage may be such as { c.123, c.238, c.732}, so that when determining the bucket (i.e. the block) in which a piece of data is located, the random number generated for the piece of data may be used to randomly fall into the bucket with the same number as c.

In view of the possible difference in the data sizes of the different first data, in order to avoid the inclination caused by the difference in the data sizes, the invention can calculate the quantiles of a plurality of new target field values by using a quantile algorithm based on weights, so that the data sizes of different blocks obtained after the slicing based on the quantiles are the same or basically the same, wherein the weights are used for representing the data sizes of single data. The weight-based quantile algorithm may be, but is not limited to weighted quantile sketch (weighted quantile thumbnail).

In step S120, the first data table is split by using the splitting point to obtain a first number of blocks.

In step S130, the first number of blocks is allocated to the second number of computing nodes for processing, where the first number is greater than or equal to the second number.

The computing nodes for data processing may be applied in advance, and the number of the cut points may be determined according to the number of the computing nodes (i.e., the second number). The number of cut points may be set to be greater than or equal to the number of compute nodes. For example, the number of cut points may be set to a number of times (e.g., 10 times) the number of compute nodes.

In the present invention, the segmentation points obtained in step S110 may be ordered, and the blocks obtained by segmentation based on the ordered segmentation points are also ordered. When the first number of blocks are allocated to the second number of computing nodes, it can be ensured that the target field value on the computing node i is smaller than the target field value on the computing node i+1, that is, it can be ensured that the first data allocated on each computing node is also ordered. After the first number of blocks are allocated to the second number of computing nodes, the local data may be further ordered by the computing nodes to complete the global row ordering operation.

The invention does not depend on the threshold value set by human when the first data table is segmented, and can ensure that the target field value is orderly among the groups, which is equivalent to realizing the line ordering of the data while the data is segmented. In addition, as the machine learning process, especially the process related to automatic machine learning parameter adjustment, the data can be repeatedly processed, and although the process of calculating the quantile point by using the quantile algorithm brings additional expense, the quantile point obtained by multiplexing can be reused only by one calculation, so that the efficiency of the subsequent process is improved to be far greater than the consumption of the subsequent process.

After the first data in the first data table is distributed to the computing nodes in the above manner, the computing nodes can process the first data according to specific data processing tasks. For example, the computing node may sum or difference two columns of field values, and the two discrete feature columns may be combined as features, which may be computed in parallel without requiring cross-row operations.

When window operations are involved, it is often necessary to group by target field value to ensure that data of the same target field value is distributed to the same compute node. For example, when counting the total consumption of the user for the last 24 hours, grouping according to user ids is needed to ensure that the consumption records of the same user are distributed on the same computing node. The above-mentioned data slicing manner of the present invention cannot guarantee this, and at this time, when the quantile algorithm is used to calculate the quantile, the portion where the additional value is set for the target field value may be closed, so that the data of the same target field value is allocated to the same calculation node. Taking the quantile algorithm as an example, the part for setting additional value (such as random number) for the target field value can be closed, so as to degenerate the algorithm into the original quantile slot to achieve the purpose.

In addition, from practical considerations, when window operation is involved, the size of the window is usually far smaller than the size of the total data, so based on the data segmentation mode of the invention, even if some data errors are generated at the junction of the computing nodes, the final model effect is not extremely affected, but the possibility of over fitting is reduced to a certain extent, and the influence of oblique data can be obviously reduced by the quantile algorithm, so that the improved quantile algorithm is still practical when window operation is involved.

As an example, before the first data table is sliced, a determination may be made as to whether there is tilt data in the first data table, and if the determination result indicates that there is a possibility of tilt data in the first data table, the method shown in fig. 1 may be executed to solve the problem of data tilt; if the judging result shows that the first data table does not have the inclination data, the first data table can be divided into one or more computing nodes according to the existing dividing mode. For example, it may be determined whether a target field value having a frequency greater than or equal to a second threshold value exists in the first data table. If the frequency is greater than or equal to the target field value of the second threshold, executing the method shown in fig. 1; if there is no target field value with the frequency greater than or equal to the second threshold value, multiple pieces of first data in the first data table can be allocated to one or more computing nodes, and the data with the same target field value are allocated to the same computing node. The second threshold is a parameter required when further improving efficiency is desired, but is not a parameter necessary for solving the problem of tilting, the invention does not strictly require that the second threshold is a boundary between a tilting key and a common key, and even if the second threshold is set inaccurately, the method shown in fig. 1 is not affected to solve the problem of tilting data.

The data slicing method proposed by the present invention is described in detail with reference to fig. 1.

In the field of data processing, it is often necessary to combine two data tables for data processing.

Taking the data processing as an example, the table splicing operation aiming at two data tables is taken, commodity purchasing conditions of users at a certain time are recorded in a consumption record table, if age, sex and other information of the users can be added in a training model requirement sample, the age, sex and other information of the users are required to be spliced into the consumption record table according to user id from a user information table, and the process is the table splicing operation.

A common table spelling strategy is to group all the two data tables according to the target field value (e.g. user id), so that the data with the same target field value in the two data tables are distributed to the same computing node, and the computing node can execute the table spelling operation according to the local data. Based on the data splitting method of the present invention, the data with the same target field value may be distributed among multiple computing nodes, so the present table splitting strategy is not applicable to the present invention.

For a scene requiring data processing by combining two data tables, the invention provides an implementation scheme based on a parameter server. FIG. 2 is a flow chart illustrating steps that may also be included in the data processing method shown in FIG. 1. The method shown in fig. 2 may be implemented entirely in software by a computer program and may also be performed by a specifically configured computing device.

Referring to fig. 2, at step S210, second data in a second data table is stored on one or more parameter servers, each of which stores at least a portion of the second data.

The parameter server is a node for mainly storing logic and has independent memory space. The parameter server may be a physical node which exists independently, or may be a virtual node on logic. For example, a parameter server may refer to a block of independent memory space in a computing device. The parameter server and the computing node may be located in the same computing device or may be located in different computing devices.

The parameter server may be regarded as a global hash lookup table. As an example, a hash table may be constructed with the target field value of the second data in the second data table as a key and the number of the parameter server as a value; and distributing the second data to each parameter server according to the constructed hash table.

In step S220, for at least one piece of first data in the block allocated to the computing node, the computing node may acquire the second data from the parameter server storing the second data identical to the target field value in the first data, and process the first data based on the second data and a preset data processing rule.

The preset data processing rules may include, but are not limited to, a list operation, which may include, but is not limited to, left join (left join), intersection join (inner join), and union join (outer join). Taking the left splicing as an example, the field value of one or more second fields in the second data can be used as a new field value in the first data and spliced into the first data.

The splitting of the second data table and the process of the computing node obtaining the second data are independent of any characteristic of the first data table on the grouping, so that the method is applicable to the data distributed based on the method shown in fig. 1. While not relying on the second data table for any characteristic (e.g., the amount of data cannot be excessive).

The parameter server-based approach incurs additional overhead. Taking the table splicing operation as an example, the method based on the parameter server can be sent once when the hash table is constructed, and then sent once through the network when the hash table is spliced, and when the number of lines of the first data table is far greater than that of the right table, additional network overhead can be generated even by several times. However, this overhead does not affect the generality of the method in actual operation, firstly, the parameter server-based method can greatly accelerate the data tilting, and secondly, when the second data table is equal to the first data table in size, the overhead of the network is generated only once, which is not unacceptable. When the second data table is small, even if network overhead several times that of the right table is generated, the transmission amount is not large, and the total amount of all the overhead does not exceed the size of the first data table, which is acceptable.

As an example, the second data with the same target field value may be stored in the same parameter server, and the parameter server may further process the second data with the same target field value according to a preset data processing rule to obtain an intermediate processing result of the target field value, where the computing node may obtain the intermediate processing result from the parameter server storing the intermediate processing result corresponding to the target field value in the first data, and process the first data based on the intermediate processing result. Therefore, when the table is spliced in one-to-many mode (i.e. the target field value is not unique in the second data table), the spliced content is the aggregate value of the data records with the same target field value in the second data table, such as summation or averaging of the numerical columns of the same target field value, so that when the second data table is scattered in the parameter server, the parameter server can make aggregation in advance, and after the key value of the second data table is scattered, the inclined keys are combined, and the condition that the pressure of a certain parameter server node is overlarge does not exist.

Considering that in very individual cases, the amount of data of the same key in the second data table after merging needs to increase linearly, so that a certain parameter server stores a large amount of data, and when a computing node processes the first data table, if the computing node frequently requests the data on the parameter server, the computing node becomes a bottleneck. For this purpose, the invention proposes that the data (second data or intermediate processing results) obtained from the parameter server can be saved in the computing node. For example, a buffer pool size may be set in the computing node in advance, the data requested by the computing node from the parameter server each time may be stored in the buffer pool, and after the buffer pool is full, the data record that is not used for the longest time may be popped up, so when the second data table has a tilt key, the key value may be stored in the buffer pool of the local computing node with a high probability, without frequently sending a network application to the parameter server, and thus the data processing efficiency may be greatly improved.

Therefore, when the computing node needs to acquire the second data, whether the second data with the same target field value as the first data exists in the computing node or not can be judged; if the second data exists in the computing node, processing the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, performing an operation of acquiring the second data from a parameter server storing the second data identical to the target field value in the first data.

In one embodiment of the present invention, for a second data table that needs to be used when processing a first data table, it may be first determined whether the data amount of the second data table is greater than a first threshold. If the data amount of the second data table is smaller than or equal to the first threshold value, the second data in the second data table can be directly stored in each first computing node respectively, so that the first computing node processes the first data directly based on the locally stored second data and a preset data processing rule.

If the data size of the second data table is greater than the first threshold, the second data table may be sliced according to the method shown in fig. 1. When data processing is required to be performed by combining two data tables, it is generally required to combine data with the same target field value in the two data tables, and taking the splitting operation as an example, the splitting operation can only be performed by two pieces of data corresponding to the target field (i.e. two pieces of data with the same target field value in the two data tables). Therefore, when the second data table is segmented, the segmentation point of the first data table can be used for segmenting the second data table, the second data table is also segmented into the first number of blocks, and then the blocks are distributed to the second number of computing nodes for processing. The block obtained by dividing the second data table can be distributed according to the distribution mode of the block obtained by dividing the first data table. Thus, the first data and the second data having the same target field value can be allocated to the same computing node.

For at least one piece of first data allocated to the computing node, the computing node may acquire second data from the computing node storing second data identical to the target field value in the first data, and process the first data based on the second data and a preset data processing rule. The second data table is segmented and distributed according to the segmentation and distribution modes of the first data table, so that when the computing node needs to acquire the second data with the same target field value as the first data, at least part of the second data can be acquired locally, and network overhead is reduced.

In the present invention, the second data table may be allocated with reference to the allocation manner of the first data table. As described above, in the case where there is no target field value whose frequency is greater than or equal to the second threshold value, the first data having the same target field value in the first data table may be allocated to the same computing node. At this time, for the second data table, the first data with the same target field value in the second data table may be allocated to the same computing node according to the allocation manner of the first data table, where the second data with the same target field value and the first data are allocated to the same computing node.

The data processing method of the present invention will be further described with reference to specific examples.

Assuming that a consumption record table A and a user information table B exist at present, the data processing task is to count the average value of the consumption of the user in 24 hours recently for each consumption of each user, splice the age and sex information of the user to the table A, and finally use the table A for training a machine learning model.

One or more compute nodes for data processing may be applied to a processor (e.g., a distributed cluster) before computation begins, each compute node being allocated a certain amount of memory. One or more parameter servers may also be applied for. The parameter server and the computing node are logically different nodes, which have independent memory space, but may be on the same computer. As shown in fig. 3A, 3 compute nodes and 3 parameter servers may be applied for. Preferably, the parameter server 1 and the computing node 1 may be co-located on the computer 1, the parameter server 2 and the computing node 2 may be co-located on the computer 2, and the parameter server 3 and the computing node 3 may be co-located on the computer 3.

The tables a and B may be stored on a cloud or some type of storage medium accessible to the processor before the computation starts. The data formats of tables a and B are as follows.

Table A,30000 rows, 4 columns

Consumption time	User id	Goods commodity	Price of
				2019-10-28 10:00:01	000001	00000001	10.00

Table B,300 rows, 3 columns

User id	Sex (sex)	Age of
			000001	0	22

The compute nodes are the nodes that mainly run the compute logic, and the parameter server nodes are only used to construct the global hash table. As shown in fig. 3B, tables a and B may be all read into the memory of the computing node initially, and data may be stored in each computing node in a row division, as in a general distributed system.

The first operation for a data processing task: for each consumption of each user, the average of the consumption of the user last 24 hours is counted. This operation will change table a to a new table 30000 x 5, which corresponds to expanding a list of fields (i.e. features).

First, it needs to be grouped according to user id, and because it is the consumption of the same user in the last 24 hours that needs to be counted, first, the user id is used as the target field value to construct the dividing point. The total number of the computing nodes is 3, so that 29 segmentation points can be selected to equally divide the data into 30 parts by using a weight-based segmentation point algorithm (such as weighted quantile sketch algorithm), each computing node only needs to scan the locally stored data in the process, and then the local results (29 point values) are summarized and combined into a total result. The 29 cut points can be reused as the cut points of the user id of the table A, so the cut points can be stored in the cloud end at the beginning, and the cut points can be directly obtained when the data is read, so that the extra operation is avoided.

According to the splitting point, the data in the table A are split into buckets, the data in the first 10 buckets of the user id are distributed to the computing node 1, the data in the middle 10 buckets of the user id are distributed to the computing node 2, and the data in the last 10 buckets of the user id are distributed to the computing node 3. After distribution, each computing node holds data in approximately 10000 tables A, at this time, the same data of user ids are guaranteed to be all located on the same computing node, and average values in windows are calculated according to id grouping and time sequence.

When there is a data skew phenomenon in table a, for example, the consumption record of user 000001 occupies 15000 rows, after the bucket is divided, about 10000 rows of consumption records of 000001 are allocated to the computing node 1, 5000 rows are allocated to the node 2, and at this time, a bit of error is generated when calculating the window mean value of the data at the junction, but this is tolerable by the machine learning algorithm.

Next, processing operation two: splicing the information of Table B into Table A, the following results are desired:

consumption time	User id	Goods commodity	Price of	Sex (sex)	Age of
						2019-10-28 10:00:01	000001	00000001	10.00	0	22

The data of table B may first be spread over the various parameter servers with user id as key. As shown in fig. 3C, a simple dispersion method is to look at the remainder of dividing the user id by 3 in table B, and allocate the piece of data to the parameter server with the corresponding number according to the remainder, so as to randomly disperse the data into the memory of the parameter server. In the figure, table B is not shown in the computing node for clarity of illustration of the current operation, but may be used for subsequent operations, table B on the parameter server is only temporary data of the current operation, and whether the current operation and the subsequent operation exist or not may be determined according to the situation.

As shown in fig. 3D, next, each row of the table a is processed by the computing node, according to the user id of the current row, a network request is sent to a parameter server node where the id is located (obtained by dividing the remainder of 3 by calculation), the parameter server returns the gender and age information corresponding to the current id to the computing node, the computing node splices the information to the current row, and the table B information of the id is added to a local cache (cache).

Before sending the request, the compute node will check if the current id is already present in the local cache, and if so, no network request needs to be sent. After processing, if the current id does not exist in the cache, the current id is added, and if the preset cache space is insufficient, one or more pieces of id information which are not used for the longest time are popped up.

When the preset cache space is large enough (because the data size of the table B is small), after a period of processing, the data of the table B is stored in the cache of each computing node, and the subsequent processing extracts information from the cache.

Assuming that table B is another consumption record table, to splice the maximum value consumed by each user in table B into table a, when uploading table B to the parameter server, the maximum value of each id can be directly maintained, and even if table B is inclined, no inclination occurs on the parameter server.

Assuming that the table B content to be spliced is a value which can not be combined, a certain inclination exists on the parameter server, but due to the existence of the cache, the inclination value always appears in the cache with high probability, and the data request from the parameter server is not repeated frequently.

The data processing method of the present invention may also be implemented as a data processing apparatus. Fig. 4 shows a block diagram of a data processing apparatus according to an exemplary embodiment of the present invention. Wherein the functional units of the auxiliary machine learning data processing device may be implemented by hardware, software or a combination of hardware and software implementing the principles of the present invention. Those skilled in the art will appreciate that the functional units depicted in fig. 4 may be combined or divided into sub-units to implement the principles of the invention described above. Thus, the description herein may support any possible combination, or division, or even further definition of the functional units described herein.

The functional units that the data processing apparatus may have and the operations that each functional unit may perform are briefly described below, and the details related thereto are referred to the above related description and will not be repeated here.

Referring to fig. 4, the data processing apparatus 400 includes an acquisition module 410, a segmentation module 420, and an allocation module 430. The obtaining module 410 is configured to obtain a cut point of the first data table.

The splitting module 420 is configured to split the first data table using the splitting point to obtain a first number of blocks. The allocation module 430 is configured to allocate a first number of blocks to a second number of computing nodes for processing, where the first number is greater than or equal to the second number.

The obtaining module 410 may obtain the pre-computed splitting point of the first data table from the outside, or may calculate the splitting point of the first data table in real time. Alternatively, the obtaining module may include a setting module and a calculating module. The setting module is used for setting additional value for the target field value in the first data aiming at each piece of first data in the first data table, and the additional value and the target field value form a new target field value; the calculation module is used for calculating a plurality of new quantiles of the target field value by using a quantile algorithm, wherein the quantiles are the segmentation points. The calculation module may calculate the quantiles of the multiple new target field values by using a quantile algorithm based on weights, so that the data sizes of different blocks obtained after the slicing based on the quantiles are the same or substantially the same, where the weights are used to characterize the data sizes of the single piece of data.

The allocation module 430 may further store the second data in the second data table on one or more parameter servers, each parameter server storing at least a portion of the second data, for at least one piece of first data in the block allocated to the computing node, the computing node obtains the second data from the parameter server storing the second data that is the same as the target field value in the first data, and processes the first data based on the second data and a preset data processing rule. Taking the data processing rule as a table splicing operation as an example, the computing node can splice the field values of one or more second fields in the second data into the first data by taking the field values of one or more second fields in the second data as new field values in the first data.

The second data with the same target field value may be stored in the same parameter server, and the parameter server processes the second data with the same target field value according to a preset data processing rule to obtain an intermediate processing result of the target field value, where the computing node obtains the intermediate processing result from the parameter server storing the intermediate processing result corresponding to the target field value in the first data, stores the intermediate processing result in the computing node, and processes the first data based on the intermediate processing result.

The computing node may also store second data obtained from the parameter server. The data processing apparatus 400 may further include a first determination module. The first judging module is used for judging whether second data with the same target field value as the first data exists in the computing node or not; if the second data exists in the computing node, the computing node processes the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, the computing node acquires the second data from a parameter server which stores the second data which is the same as the target field value in the first data.

The data processing apparatus 400 may further include a second determining module configured to determine whether the data amount of the second data table is greater than the first threshold; the allocation module is further configured to store the second data in the second data table to each first computing node if the data size of the second data table is less than or equal to the first threshold, so that the first computing node processes the first data based on the second data and a preset data processing rule.

If the data size of the second data table is larger than the first threshold value, the segmentation module can segment the second data table by using the segmentation point of the first data table, and the distribution module can distribute the segmented blocks to a second number of computing nodes for processing; for at least one piece of first data allocated to the computing node, the computing node acquires second data from the computing node storing second data identical to a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

The data processing apparatus 400 may further include a third determining module configured to determine whether a target field value with a frequency greater than or equal to a second threshold value exists in the first data table; if the frequency is greater than or equal to the target field value of the second threshold value, the acquisition module acquires a segmentation point of the first data table, the segmentation module segments the first data table by using the segmentation point, and the distribution module distributes the first number of blocks to the second number of computing nodes for processing. If the target field value with the frequency being greater than or equal to the second threshold value does not exist, the allocation module allocates a plurality of pieces of first data in the first data table to one or more computing nodes, wherein the data with the same target field value are allocated to the same computing node.

In the case that the allocation module allocates the data with the same target field value in the first data table to the same computing node, for the second data table, the allocation module may allocate a plurality of pieces of second data in the second data table to one or more computing nodes in the same allocation manner, wherein the second data with the same target field value and the first data are allocated to the same computing node.

It should be appreciated that the specific implementation of the data processing apparatus 400 according to the exemplary embodiment of the present invention may be implemented with reference to the related descriptions of the data processing method in connection with fig. 1 to 3D, and will not be described herein.

A data processing method and apparatus according to an exemplary embodiment of the present invention are described above with reference to fig. 1 to 4. It should be understood that the above-described method may be implemented by a program recorded on a computer-readable medium, for example, a computer-readable storage medium storing instructions may be provided according to an exemplary embodiment of the present invention, in which a computer program for executing the data processing method of the present invention (e.g., shown in fig. 1) is recorded on the computer-readable medium.

The computer program in the above-described computer readable medium may be run in an environment deployed in a computer device such as a client, a host, a proxy device, a server, etc., and it should be noted that the computer program may be used to perform additional steps other than those shown in fig. 1 or more specific processes when performing the above-described steps, and the contents of these additional steps and further processes have been described with reference to fig. 1 and 2, and will not be repeated here.

It should be noted that the data processing apparatus according to the exemplary embodiment of the present invention may completely rely on the execution of a computer program to implement the corresponding functions, i.e., each apparatus corresponds to each step in the functional architecture of the computer program, so that the entire apparatus is called through a specific software package (e.g., lib library) to implement the corresponding functions.

On the other hand, each of the devices shown in fig. 4 may also be implemented by hardware, software, firmware, middleware, microcode, or any combination thereof. When implemented in software, firmware, middleware or microcode, the program code or code segments to perform the corresponding operations may be stored in a computer-readable medium, such as a storage medium, so that the processor can perform the corresponding operations by reading and executing the corresponding program code or code segments.

For example, exemplary embodiments of the present invention may also be implemented as a computing device including a storage element having stored therein a set of computer-executable instructions that, when executed by a processor, perform a data processing method.

In particular, the computing devices may be deployed in servers or clients, as well as on node devices in a distributed network environment. Further, the computing device may be a PC computer, tablet device, personal digital assistant, smart phone, web application, or other device capable of executing the above-described set of instructions.

Here, the computing device need not be a single computing device, but may be any device or collection of circuits capable of executing the above-described instructions (or instruction set) alone or in combination. The computing device may also be part of an integrated control system or system manager, or may be configured as a portable electronic device that interfaces with locally or remotely (e.g., via wireless transmission).

In the computing device, the processor may include a Central Processing Unit (CPU), a Graphics Processor (GPU), a programmable logic device, a special purpose processor system, a microcontroller, or a microprocessor. By way of example, and not limitation, processors may also include analog processors, digital processors, microprocessors, multi-core processors, processor arrays, network processors, and the like.

Some of the operations described in the data processing method according to the exemplary embodiment of the present invention may be implemented in software, some of the operations may be implemented in hardware, and furthermore, the operations may be implemented in a combination of software and hardware.

The processor may execute instructions or code stored in one of the storage components, wherein the storage component may also store data. Instructions and data may also be transmitted and received over a network via a network interface device, which may employ any known transmission protocol.

The memory component may be integrated with the processor, for example, RAM or flash memory disposed within an integrated circuit microprocessor or the like. Further, the storage component may comprise a stand-alone device, such as an external disk drive, a storage array, or any other storage device usable by a database system. The storage component and the processor may be operatively coupled or may communicate with each other, such as through an I/O port, network connection, etc., such that the processor is able to read files stored in the storage component.

In addition, the computing device may also include a video display (such as a liquid crystal display) and a user interaction interface (such as a keyboard, mouse, touch input device, etc.). All components of the computing device may be connected to each other via buses and/or networks.

Operations involved in a data processing method according to exemplary embodiments of the present invention may be described as various interconnected or coupled functional blocks or functional diagrams. However, these functional blocks or functional diagrams may be equally integrated into a single logic device or operate at non-exact boundaries.

For example, as described above, a data processing apparatus according to an exemplary embodiment of the present invention may include a storage unit and a processor, wherein the storage unit stores therein a set of computer-executable instructions that, when executed by the processor, perform the data processing method described above.

The foregoing description of exemplary embodiments of the invention has been presented only to be understood as illustrative and not exhaustive, and the invention is not limited to the exemplary embodiments disclosed. Many modifications and variations will be apparent to those of ordinary skill in the art without departing from the scope and spirit of the invention. Therefore, the protection scope of the present invention shall be subject to the scope of the claims.

Claims

1. A data processing method, comprising:

acquiring a dividing point of a first data table;

dividing the first data table by using the dividing points to obtain a first number of blocks;

assigning the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number,

the step of obtaining the cutting point of the first data table comprises the following steps:

setting an additional value for a target field value in the first data aiming at each piece of first data in the first data table, wherein the additional value and the target field value form a new target field value;

calculating a plurality of new quantiles of the target field value by using a quantile algorithm, wherein the quantiles are the slicing points.

2. The method of claim 1, wherein calculating quantiles of a plurality of new target field values using a quantile algorithm comprises:

calculating quantiles of a plurality of new target field values by using a quantile algorithm based on weights, so that the data sizes of different blocks obtained by segmentation based on the quantiles are the same or basically the same, wherein the weights are used for representing the data sizes of single data.

3. The method of claim 1, further comprising:

storing second data in a second data table on one or more parameter servers, each of said parameter servers storing at least part of said second data;

for at least one piece of first data in a block allocated to the computing node, the computing node acquires second data from a parameter server storing the second data which is the same as a target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

4. A method according to claim 3, wherein the step of processing the first data based on the second data and a preset data processing rule comprises:

and splicing the field values of one or more second fields in the second data into the first data as new field values in the first data.

5. A method according to claim 3, further comprising:

and storing the second data acquired from the parameter server in the computing node.

6. The method of claim 5, further comprising:

judging whether second data with the same target field value as the first data exists in the computing node or not;

If the second data exists in the computing node, processing the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, executing the operation of acquiring the second data from the parameter server which stores the second data with the same target field value in the first data.

7. A method according to claim 3, wherein the second data having the same target field value is stored to the same parameter server, the method further comprising:

processing the second data with the same target field value by the parameter server according to a preset data processing rule to obtain an intermediate processing result of the target field value,

the computing node obtains an intermediate processing result from a parameter server storing the intermediate processing result corresponding to a target field value in the first data, stores the intermediate processing result in the computing node, and processes the first data based on the intermediate processing result.

8. A method according to claim 3, further comprising:

Judging whether the data amount of the second data table is larger than a first threshold value or not;

and if the data amount of the second data table is smaller than or equal to the first threshold value, respectively storing the second data in the second data table to each first computing node so that the first computing node processes the first data based on the second data and a preset data processing rule.

9. The method of claim 8, further comprising:

if the data volume of the second data table is larger than the first threshold value, the second data table is segmented by using the segmentation point of the first data table, and blocks obtained by segmentation are distributed to the second plurality of computing nodes for processing;

for at least one piece of first data allocated to the computing node, the computing node acquires second data from the computing node storing the second data which is the same as the target field value in the first data, and processes the first data based on the second data and a preset data processing rule.

10. The method of claim 1, further comprising:

judging whether a target field value with the frequency being more than or equal to a second threshold value exists in the first data table;

The method of any of claims 1 to 9 is performed if the presence frequency is greater than or equal to a target field value of the second threshold.

11. The method of claim 10, further comprising:

and if the target field value with the frequency being greater than or equal to the second threshold value does not exist, distributing the plurality of pieces of first data in the first data table to one or more computing nodes, wherein the data with the same target field value are distributed to the same computing node.

12. The method of claim 11, further comprising:

and distributing the plurality of second data in the second data table to one or more computing nodes, wherein the second data and the first data with the same target field value are distributed to the same computing node.

13. A data processing apparatus comprising:

the acquisition module is used for acquiring the dividing points of the first data table;

the segmentation module is used for segmenting the first data table by using the segmentation points to obtain a first number of blocks;

an allocation module for allocating the first number of blocks to a second number of computing nodes for processing, wherein the first number is greater than or equal to the second number,

wherein, the acquisition module includes: a setting module, configured to set, for each piece of first data in the first data table, an additional value for a target field value in the first data, where the additional value and the target field value form a new target field value; and the calculation module is used for calculating a plurality of new quantiles of the target field value by using a quantile algorithm, wherein the quantiles are the slicing points.

14. The apparatus of claim 13, wherein the calculation module calculates quantiles of a plurality of new target field values using a quantile algorithm based on weights, such that the data size of different chunks resulting from slicing based on the quantiles is the same or substantially the same, wherein the weights are used to characterize the data size of a single piece of data.

15. The apparatus of claim 13, wherein,

the allocation module is further configured to store second data in a second data table on one or more parameter servers, each of the parameter servers storing at least a portion of the second data,

16. The apparatus of claim 15, wherein the computing node concatenates field values of one or more second fields in the second data into the first data as new field values in the first data.

17. The apparatus of claim 15, wherein,

the computing node also stores second data obtained from the parameter server.

18. The apparatus of claim 17, further comprising:

a first judging module, configured to judge whether second data, which is the same as a target field value in the first data, exists in the computing node;

if the second data exists in the computing node, the computing node processes the first data based on the second data stored in the computing node and a preset data processing rule; and/or if the second data does not exist in the computing node, the computing node acquires the second data from a parameter server which stores the second data which is the same as the target field value in the first data.

19. The apparatus of claim 15, wherein the second data having the same target field value is stored in the same parameter server, the parameter server processes the second data having the same target field value according to a preset data processing rule to obtain an intermediate processing result of the target field value,

20. The apparatus of claim 15, further comprising:

the second judging module is used for judging whether the data volume of the second data table is larger than a first threshold value or not;

the allocation module is further configured to store, if the data size of the second data table is smaller than or equal to the first threshold, second data in the second data table to each first computing node, so that the first computing node processes the first data based on the second data and a preset data processing rule.

21. The apparatus of claim 20, wherein,

if the data size of the second data table is larger than the first threshold value, the segmentation module segments the second data table by using the segmentation points of the first data table, and the distribution module distributes the segmented blocks to the second number of computing nodes for processing;

22. The apparatus of claim 13, further comprising:

a third judging module, configured to judge whether a target field value with a frequency greater than or equal to a second threshold exists in the first data table;

if the frequency is greater than or equal to the target field value of the second threshold, the acquisition module acquires a segmentation point of the first data table, the segmentation module segments the first data table by using the segmentation point, and the distribution module distributes the first number of blocks to the second number of computing nodes for processing.

23. The apparatus of claim 22, wherein,

and if the target field value with the frequency being greater than or equal to the second threshold value does not exist, the distribution module distributes a plurality of pieces of first data in the first data table to one or more computing nodes, wherein the data with the same target field value are distributed to the same computing node.

24. The apparatus of claim 22, wherein,

the allocation module also allocates a plurality of pieces of second data in a second data table to one or more computing nodes, wherein the second data and the first data with the same target field value are allocated to the same computing node.

25. A system comprising at least one computing device and at least one storage device storing instructions that, when executed by the at least one computing device, cause the at least one computing device to perform the method of any of claims 1-12.

26. A computer readable storage medium storing instructions which, when executed by at least one computing device, cause the at least one computing device to perform the method of any of claims 1 to 12.