CN108234566B

CN108234566B - Cluster data processing method and device

Info

Publication number: CN108234566B
Application number: CN201611193097.8A
Authority: CN
Inventors: 李静; 李炉阳
Original assignee: Alibaba Group Holding Ltd
Current assignee: Alibaba Group Holding Ltd
Priority date: 2016-12-21
Filing date: 2016-12-21
Publication date: 2021-04-23
Anticipated expiration: 2036-12-21
Also published as: CN108234566A

Abstract

The document discloses a cluster data processing method and a device; the cluster data processing method comprises the following steps: acquiring attribute information of tasks running on a plurality of clusters within a first preset time; and determining the data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster.

Description

Cluster data processing method and device

Technical Field

The present invention relates to the field of network communications, and in particular, to a cluster data processing method and apparatus.

Background

With the advent of the big data age, data business is developing vigorously, and the storage scale and the calculation scale are rapidly increasing in a blowout mode. However, the capacity of the single-room physical machines in which the clusters of the distributed system are located is limited, and the number of the single-room physical machines cannot be infinitely increased, so that a cross-region multi-room multi-cluster pattern appears. However, the communication and data reading between multiple rooms and multiple clusters consumes huge network bandwidth.

At present, in a cross-region multi-machine-room scenario, when network bandwidth encounters a bottleneck, network operation and maintenance personnel generally perform current limiting operation, or the network bandwidth is increased hard. However, when performing the throttling operation, it may cause delay of the computing task of the cluster, thereby affecting the user experience. In addition, a hard increase in network bandwidth can result in increased costs.

Disclosure of Invention

The following is a summary of the subject matter described in detail herein. This summary is not intended to limit the scope of the claims.

The embodiment of the application provides a data processing method and device for a cluster, which can reduce task delay and optimize network traffic of the cluster.

The embodiment of the application provides a cluster data processing method, which comprises the following steps:

acquiring attribute information of tasks running on a plurality of clusters within a first preset time;

and determining data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster.

After determining the data to be copied and the target cluster in which the data to be copied needs to be copied according to the acquired attribute information of the task, the data processing method may further include:

generating a replication list, wherein the replication list is used for recording the position information of the data to be replicated and a target cluster needing to replicate the data to be replicated;

and writing the replication list into a metadata base so that the related cluster can acquire the replication list.

Wherein, the data processing method may further include:

and instructing the target cluster to copy the data to be copied according to the copy list.

The obtaining attribute information of the tasks running on the plurality of clusters within the first predetermined time period may include: the attribute information of the tasks running on the plurality of clusters within a first preset time is periodically acquired.

Wherein the attribute information of each task at least comprises: and running the cluster of the task and the cluster where the data read by the task are located.

The determining, according to the obtained attribute information of the task, data to be copied and a target cluster that needs to copy the data to be copied may include:

and screening data meeting preset conditions from the data read by the task cross-cluster according to the acquired attribute information of the task to serve as data to be copied.

The screening out data meeting the predetermined condition as data to be copied may include:

screening out data of which the first parameter value meets a first condition and the second parameter value meets a second condition as data to be copied;

for data read by each task across clusters, the first parameter value is the number of times that the clusters running the tasks read the data within a second preset time; the second parameter value is the total times or continuous times that the first parameter value meets the first condition within a first preset time length; the second predetermined length of time is less than the first predetermined length of time; the first condition includes: the first parameter value is greater than or equal to a first threshold value; the second condition includes: the second parameter value is greater than or equal to a second threshold value.

An embodiment of the present application further provides a data processing apparatus for a cluster, including:

the system comprises an acquisition unit, a processing unit and a processing unit, wherein the acquisition unit is used for acquiring attribute information of tasks running on a plurality of clusters within a first preset time;

and the processing unit is used for determining the data to be copied and a target cluster which needs to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster.

The processing unit may be further configured to generate a replication list after determining, according to the acquired attribute information of the task, data to be replicated and a target cluster that needs to replicate the data to be replicated, and write the replication list into a metadata base, so that a relevant cluster acquires the replication list; the replication list is used for recording the position information of the data to be replicated and a target cluster which needs to replicate the data to be replicated.

Wherein the data processing apparatus may further include: and the indicating unit is used for indicating the target cluster to copy the data to be copied according to the copying list.

The obtaining unit may be configured to periodically obtain attribute information of tasks running on the plurality of clusters within a first predetermined time.

Wherein, the attribute information of each task at least comprises: and running the cluster of the task and the cluster where the data read by the task are located.

The processing unit may be configured to determine, according to the obtained attribute information of the task, to-be-copied data and a target cluster to which the to-be-copied data needs to be copied, by the following means:

The processing unit may be configured to screen out data that meets a predetermined condition as data to be copied by:

for data read by each task across clusters, the first parameter value is the number of times that the clusters running the tasks read the data within a second preset time; the second parameter value is the total times or continuous times that the first parameter value meets the first condition in a first preset time length, and the second preset time length is smaller than the first preset time length; the first condition includes: the first parameter value is greater than or equal to a first threshold value; the second condition includes: the second parameter value is greater than or equal to a second threshold value.

An embodiment of the present application further provides a data processing apparatus for a cluster, including: a memory and a processor;

the memory is used for storing a program for processing cluster data; the program for cluster data processing, when read and executed by a processor, performs the following operations:

The embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions, and when the computer-executable instructions are executed by a processor, the data processing method of the cluster is implemented.

In the embodiment of the application, the attribute information of tasks running on a plurality of clusters within a first preset time is obtained; and determining the data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster. Therefore, by data replication, the access speed of the task running on the target cluster to the data to be replicated can be improved; particularly, after cross-cluster data replication is realized, the data volume needing cross-cluster reading in the task execution process can be reduced, so that the network flow of the cluster is optimized, the task delay is reduced, the user experience is improved, and the cost is reduced.

Of course, it is not necessary for any product to achieve all of the above advantages at the same time for the practice of the present application.

Other aspects will be apparent upon reading and understanding the attached drawings and detailed description.

Drawings

FIG. 1 is a schematic diagram of a cluster deployed across a region;

fig. 2 is a flowchart of a data processing method of a cluster according to an embodiment of the present application;

fig. 3 is an optional schematic diagram of a system architecture to which the data processing method of a cluster provided in the embodiment of the present application is applied;

FIG. 4 is an exemplary flow chart of an embodiment of the present application;

FIG. 5 is a schematic diagram of interaction between clusters in an embodiment of the present application;

fig. 6 is a schematic diagram of a clustered data processing apparatus according to an embodiment of the present application.

Detailed Description

The embodiments of the present application will be described in detail below with reference to the accompanying drawings, and it should be understood that the embodiments described below are only for illustrating and explaining the present application and are not intended to limit the present application.

It should be noted that the terms "first," "second," and the like in the description and claims of this application and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order.

It should be noted that, if not conflicted, the embodiments and the features of the embodiments can be combined with each other and are within the scope of protection of the present application. Additionally, while a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order different than here.

In some embodiments, a computing device executing a data processing method of a cluster may include one or more processors (CPUs), input/output interfaces, network interfaces, and memories (memories).

The memory may include forms of volatile memory, Random Access Memory (RAM), and/or non-volatile memory in a computer-readable medium, such as Read Only Memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium. The memory may include module 1, module 2, … …, and module N (N is an integer greater than 2).

Computer readable media include both permanent and non-permanent, removable and non-removable storage media. A storage medium may implement information storage by any method or technology. The information may be computer readable instructions, data structures, modules of a program, or other data. Examples of computer storage media include, but are not limited to, phase change memory (PRAM), Static Random Access Memory (SRAM), Dynamic Random Access Memory (DRAM), other types of Random Access Memory (RAM), Read Only Memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), flash memory or other memory technology, compact disc read only memory (CD-ROM), Digital Versatile Discs (DVD) or other optical storage, magnetic cassettes, magnetic disk storage or other magnetic storage devices, or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer readable media does not include non-transitory computer readable media (transient media), such as modulated data signals and carrier waves.

Method embodiment

The embodiment provides a cluster data processing method. In this application, a cluster refers to a set of devices located in the same room. When interacting with the outside, a cluster can be regarded as an independent device. A distributed system may be built based on a plurality of clusters that communicate with each other, for example, a cloud computing system may be built by a plurality of clusters that communicate with each other, and a distributed system may also be regarded as an independent device when providing services to the outside.

Multiple intercommunicating clusters may be deployed across regions, such as in rooms in different cities. As shown in fig. 1, a cluster C1 is deployed at city a, a cluster C2 is deployed at city B, a cluster C3 is deployed at city C, and a cluster C4 is deployed at city D. I.e. different clusters are deployed in different cities. However, this is not limited in this application. In other implementations, multiple intercommunicating clusters may also be deployed in rooms at different locations in the same city.

As shown in fig. 1, communicating between these clusters, reading data from each other, generates network traffic. For example, multiple tasks on cluster C2 at the same time need to read the same piece of data from cluster C1, and in the extreme case, a traffic flood occurs between clusters C1 and C2. To address this problem, data that needs to be read by the task running on the cluster C2 may be copied from the source cluster (e.g., the cluster C1) where the data itself is located in advance, and one copy may be used for multiple reads by the task running on the cluster C2. However, if all data are copied to each cluster without limitation, it will definitely cause an extreme redundancy waste of storage resources, and therefore, the present embodiment provides a data processing method for a cluster, which is used to determine which data are respectively copied to which clusters, so as to neither excessively redundancy storage resources, nor reduce network traffic.

As shown in fig. 2, the data processing method for a cluster provided in this embodiment includes the following steps:

step 201: acquiring attribute information of tasks running on a plurality of clusters within a first preset time;

step 202: and determining the data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster.

Wherein the attribute information of each task may at least include: the cluster running the task, the cluster where the data read by the task is located.

For example, the task may be recorded by using a task identifier Dx, the data may be recorded in the form of a data table, the data may be recorded by using a table identifier Tx, and the cluster may be recorded by using a cluster identifier Cx. Taking task D1 as an example, the attribute information may include: the cluster running the task D1 is C1, the data table read by the task D1 is T1, and the cluster in which the data table T1 is located is C2.

In this case, the task running on the cluster reads the relevant data (e.g., reads data from other clusters or the present cluster) and processes the data to obtain output data, and the output data can be subsequently provided to other tasks running on the present cluster or other clusters. Each cluster may record attribute information of each task running on the cluster, for example, stored in a database of the cluster.

The attribute information of one task recorded by a cluster may form one record row, and the cluster may record the attribute information of all tasks running on the cluster to obtain a plurality of record rows, for example, as shown in table 1:

TABLE 1

A system architecture to which the data processing method of the present embodiment is applied may be as shown in fig. 3, including: a plurality of clusters (e.g., clusters C1, C2, C3, and C4). The plurality of clusters can communicate with each other to read data from each other. The data processing method provided by this embodiment may be applied to one of the clusters. As shown in fig. 3, the cluster C1 is used as a control cluster, and the clusters C2, C3 and C4 are used as computing clusters, so that the cluster C1 may be used to execute the data processing method provided in this embodiment. However, this is not limited in this application.

Taking the cluster C1 as an example of a control cluster, after acquiring attribute information of tasks running on multiple clusters within a first predetermined time period, the cluster C1 may determine data to be copied, which needs to be copied, of each target cluster (which may include the clusters C1 to C4), according to the acquired attribute information of the tasks, and notify the target clusters of location information of the data to be copied, so that the target clusters may request a source cluster storing the data to be copied to copy the data to be copied, and the source cluster transmits the data to be copied to the target clusters after confirming the request. Alternatively, after determining that each target cluster needs to copy the data to be copied, the cluster C1 notifies the source cluster storing the data to be copied to send the data to be copied to the corresponding target cluster. However, this is not limited in this application.

As shown in fig. 3, taking data to be copied, which needs to be copied by the cluster C1, as the data table T1 on the cluster C2 as an example, after the cluster C1 completes copying the data table T1, when a task running on the cluster C1 needs to read the data table T1, it may be determined first whether the cluster stores the data table T1, and if there is a data table T1 that can be directly read from the cluster (e.g., a storage device of the cluster), it is not necessary to read the data table T1 from the cluster C2, so as to reduce network traffic between clusters.

It should be noted that fig. 3 is only illustrated by taking one data table T1 as an example, however, the present application is not limited thereto. In practical applications, there may be more data tables replicated by the cluster C1 from the cluster C2, so that one data replication is performed for reading by the task running on the cluster C2 multiple times, thereby reducing the network traffic between clusters.

In some implementations, after step 202, the data processing method provided in this embodiment may further include:

and writing the replication list into a metadata base so that the related cluster acquires the replication list.

The metadata library is used for storing metadata (Meta Date), which is also called intermediary data and relay data, and is data describing data (data about data), mainly information describing data attribute (property), and may specify its elements or attributes (name, size, data type, etc.) or its structure (length, field, data column), or its related data (where, how to contact, owner).

For example, in a form in which the data to be copied is a data table, the copy list may include a table identifier of the data table to be copied, a cluster identifier of a source cluster where the data table to be copied is located, and a cluster identifier of a target cluster where the data table needs to be copied. The source clusters in which the data tables included in the replication list are located may be the same or different, and the target clusters may be the same or different. For example, the replication list may be as shown in table 2:

TABLE 2

Data sheet	Source cluster	Target cluster
			T1	C2	C1
T2	C1	C2
			……	……	……

In the implementation manner, the replication list is written into the metadata base, so that the related cluster can conveniently obtain the replication list, and further the related cluster can determine the data to be replicated, which needs to be replicated, or determine the data which needs to be replicated to other clusters according to the replication list. However, this is not limited in this application. In some implementations, the replication list can also be synchronized to the relevant clusters after the replication list is generated.

In some implementation manners, the data processing method provided by this embodiment may further include:

and indicating the target cluster to copy the data to be copied according to the copy list.

In this implementation manner, taking the control cluster to execute the data processing method of this embodiment as an example, after the control cluster generates the copy list, the control cluster may notify the relevant cluster to copy the data to be copied. However, this is not limited in this application. In other implementations, the control cluster is not required to perform notification instructions; and the related clusters synchronously obtain the copy list, or when the copy list is read from the metadata database, automatically copy the data to be copied according to the copy list.

In some implementations, step 201 can include: the attribute information of the tasks running on the plurality of clusters within a first preset time is periodically acquired. In other words, the data processing method provided by the present embodiment may be periodically executed. For example, step 201 and step 202 may be executed at a fixed time of day, so as to implement periodic update of the data that the cluster needs to replicate. When the data processing method of the present embodiment is executed periodically, when the copy list is generated and written into the metadata base in each period, the copy list generated in the previous period may be replaced with the copy list generated in the present period, so as to reduce the storage space occupied by the copy list.

In some implementations, step 202 may include:

In some implementations, screening out data that meets a predetermined condition as data to be copied may include: screening out data of which the first parameter value meets a first condition and the second parameter value meets a second condition as data to be copied;

for data read by each task across clusters, the first parameter value is the number of times that the cluster running the task reads the data within a second preset time; the second parameter value is the total times or continuous times that the first parameter value meets the first condition within a first preset time length; the second preset time length is less than the first preset time length; the first condition includes: the first parameter value is greater than or equal to a first threshold value, and the second condition includes: the second parameter value is greater than or equal to a second threshold value.

For data read by a task across clusters, the number of times that a cluster running the task in a second predetermined time reads the data may include: the number of times this data is read by all tasks running on this cluster within a second predetermined length of time. For example, for data table T1 read across clusters for task D1 (task D1 running on cluster C1), the first parameter value may be a total number of times data table T1 is read for all tasks (e.g., including tasks D1, D2, D3, etc.) running on cluster C1 within the second predetermined length of time.

Wherein, a task running on a cluster reads data stored on the cluster and can be considered as consuming no bandwidth traffic. Therefore, in this implementation manner, the task of reading data across the clusters can be screened from the acquired attribute information of the task, and then the data to be copied is screened according to the data that needs to be read across the clusters by the screened task.

In some implementations, the first predetermined length of time may be greater than or equal to a sum of the plurality of second predetermined lengths of time. For example, the first predetermined time is 15 days, the second predetermined time is 1 day, the second parameter value may be that the number of times that a task running for 10 days per day in one cluster reads the same piece of data across the cluster in 15 days is greater than a first threshold value, and at this time, the second parameter value is 10 times; alternatively, the second parameter value may be that the number of times that a task, which is continuously running for 8 days per day by one cluster in 15 days, reads the same piece of data across the cluster is greater than the first threshold, and at this time, the second parameter value is 8 times. However, the present application is not limited to the unit of the first predetermined period of time and the second predetermined period of time. In other implementations, the units of the first predetermined period of time and the second predetermined period of time may also be in hours.

The first threshold and the second threshold may be preset times, or may be determined by learning the history information using a machine learning algorithm.

In some implementations, the first condition can include: the first parameter value is greater than or equal to a first predetermined value and less than or equal to a second predetermined value, and the second condition may include: the second parameter value is greater than or equal to a third predetermined value and less than or equal to a fourth predetermined value. That is, the first parameter value satisfying a certain range can be selected by the first condition, and the second parameter value satisfying a certain range can be selected by the second condition. The first predetermined value, the second predetermined value, the third predetermined value and the fourth predetermined value may be preset times as needed, or determined by learning the history information using a machine learning algorithm.

It should be noted that, when step 202 is executed, a first parameter value and a second parameter value may be calculated for data that needs to be read across clusters and is related to one cluster, and then it is determined whether the first parameter value meets a first condition and whether the second parameter value meets a second condition, and then data that needs to be read across clusters and is related to another cluster is processed, so as to determine data to be copied of all clusters; or, first, for data related to all clusters and needing to be read across the clusters, a first parameter value is calculated, whether the first parameter value meets a first condition is judged, after all data of which the first parameter value meets the first condition are screened out, a second parameter value is calculated for the screened data, whether the second parameter value meets a second condition is judged, and data of which the second parameter value meets the second condition is screened out from the data of which the first parameter value meets the first condition, so that data to be copied of all the clusters are determined.

In some implementation manners, the number of times each task reads each piece of data, the cluster in which the data read by the task is located, and the cluster in which the task is operated in the first predetermined time period may be determined according to attribute information of the tasks operated in the plurality of clusters in the first predetermined time period, and the information is recorded in the replication list; when data replication is carried out according to the replication list, data meeting preset conditions are screened out from the replication list to be used as data to be replicated, and a source cluster and a target cluster of the data to be replicated are determined. The process of data screening may refer to the implementation process of step 202 in the previous implementation manner, and therefore, the description thereof is omitted here.

In some implementations, the multiple clusters of this embodiment belong to an ODPS (Open Data Processing Service) cluster, and the metadata base may be an OTS (Open Table Service) metadata base. The ODPS is a distributed computing framework system similar to Hadoop; an OTS is a data storage container, or database.

Referring to fig. 4, a data processing method of the cluster in this embodiment is illustrated.

Taking the ODPS cluster as an example, the data is recorded in the form of a data table, and the ODPS cluster records attribute information of each task running thereon, and may include: the method comprises the steps of identifying the cluster where the task runs, identifying the table of the data table read by the task, identifying the cluster where the data table is located, and finally modifying time of the data table.

As shown in fig. 4, the data processing method of the present embodiment may include the following steps:

step 401: screening out tasks needing to read a data table across clusters according to attribute information of each task operated by an ODPS cluster in the last N days (such as the first preset time length, for example, N is 15 days), filtering out the tasks reading the data table on the cluster, and determining the data table read across the cluster; reading the data table on the cluster can determine that the bandwidth traffic is not consumed, so the screening process is a task of determining the consumed bandwidth traffic.

Step 402: for each data table determined for each cluster, calculating the number of times (i.e. the aforementioned first parameter value) that all tasks running on this cluster (e.g. cluster C1) within M days (e.g. the aforementioned second predetermined time period, M is less than N, for example, M is 1 day) read the same data table;

step 403: screening out the data table with the screening frequency larger than the first threshold, for example, the reading frequency of the cluster C1 on the data table T1 can be determined to be larger than the first threshold through screening;

step 404: counting attribute information of tasks running on the cluster every day in the last N days, and determining the cross-cluster reading condition of the tasks running on the cluster on the data table in the N days;

step 405: if the total number of times of reading the same data sheet by all the tasks running on the cluster every day for X (X is less than or equal to N) days is greater than a first threshold value, and X is greater than or equal to a second threshold value, determining that the data sheet needs to be copied to the cluster, namely writing cluster information of the data sheet and cluster information of the data sheet needing to be copied into a copy list.

For example, as shown in fig. 5, the task running on cluster C2 reads data table t1 on cluster C1 more than 5 times (first parameter value) a day (second predetermined time period), and for example, if the first parameter value (5 times) is greater than or equal to the first threshold value (e.g., 3 times), then data table t1 should be considered to be copied to cluster C2. Then, data of the last 15 days (the first predetermined time period) is counted, and if the task running on cluster C2 has read data table t1 more than 5 times per day for more than 10 days (the second parameter value), taking the example that the second parameter value is greater than or equal to the second threshold value (e.g., less than or equal to 10), at this time, it is determined that data table t1 needs to be copied to cluster C2. For example, the copy list may have written therein: a data table to be copied t1, a source cluster C1 and a target cluster C2. After data replication is performed according to the replication list, when a task running on the cluster C2 needs to read the data table t1, it may be determined whether the data table t1 exists on the cluster C2, and since the data table t1 is replicated to the cluster C2, the task running on the cluster C2 may directly read the data table t1, and data reading across clusters is not needed, thereby reducing the network traffic across clusters.

Similarly, the above calculation process may be periodically performed for each cluster, and the determined data table to be copied is written into the copy list, so that the data copying process is subsequently performed according to the copy list.

In summary, in this embodiment, according to the attribute information of the task running on the cluster, which data needs to be copied to which clusters respectively are determined, so that the purpose of reducing network traffic without excessively redundant storage resources is achieved. Compared with the related art, the embodiment does not cause task delay, does not influence user experience, and does not increase bandwidth cost.

Device embodiment

This embodiment provides a clustered data processing apparatus, as shown in fig. 6, including: an acquisition unit 601 and a processing unit 602; the acquiring unit 601 is configured to acquire attribute information of tasks running on a plurality of clusters within a first predetermined time; the processing unit 602 is configured to determine, according to the obtained attribute information of the task, to-be-copied data and a target cluster that needs to copy the to-be-copied data, so as to copy the to-be-copied data to the target cluster.

Wherein, the attribute information of each task at least comprises: the cluster running the task, the cluster where the data read by the task is located.

In this embodiment, the acquiring unit 601 is a part of the above apparatus responsible for information acquisition, and may be software, hardware, or a combination of the two.

In this embodiment, the processing unit 602 is a part of the above apparatus responsible for data processing, and may be software, hardware, or a combination of the two.

In some implementations, the processing unit 602 may be further configured to generate a copy list after determining, according to the obtained attribute information of the task, data to be copied and a target cluster that needs to copy the data to be copied, and write the copy list into the metadata base, so that a relevant cluster obtains the copy list; the replication list is used for recording the position information of the data to be replicated and the target cluster needing to replicate the data to be replicated.

In some implementations, the data processing apparatus of this embodiment may further include: and the indicating unit is used for indicating the target cluster to copy the data to be copied according to the copying list.

In some implementations, the obtaining unit 601 may be configured to periodically obtain attribute information of tasks running on a plurality of clusters within a first predetermined time period.

In some implementations, the processing unit 602 may be configured to determine, according to the obtained attribute information of the task, data to be copied and a target cluster where the data to be copied needs to be copied, by:

In some implementations, the processing unit 602 may be configured to filter out data meeting a predetermined condition as data to be copied by:

for data read by each task across clusters, the first parameter value is the number of times that the cluster running the task reads the data within a second preset time; the second parameter value is the total times or continuous times that the first parameter value meets the first condition within a first preset time length; the second preset time length is less than the first preset time length; the first condition includes: the first parameter value is greater than or equal to a first threshold value; the second condition includes: the second parameter value is greater than or equal to a second threshold value.

In some implementations, the plurality of clusters of the present embodiment belong to an ODPS cluster.

For other details of the operations performed by the units in the data processing apparatus of this embodiment, reference may be made to embodiment one, and therefore, the description thereof is not repeated herein.

In addition, an embodiment of the present application further provides a data processing apparatus for a cluster, including: a memory and a processor; the memory is used for storing a program for processing cluster data; the program for cluster data processing, when read and executed by a processor, performs the following operations:

and determining the data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster.

In some implementation manners, after determining data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task, generating a copy list, and writing the copy list into a metadata base so that a related cluster can acquire the copy list; the replication list is used for recording the position information of the data to be replicated and the target cluster needing to replicate the data to be replicated.

In some implementations, after generating the replication list, the target cluster is instructed to replicate the data to be replicated according to the replication list.

In some implementations, attribute information of tasks running on the plurality of clusters within the first predetermined length of time is periodically obtained.

In some implementation manners, the data to be copied and the target cluster which needs to copy the data to be copied are determined according to the acquired attribute information of the task in the following manners:

In some implementations, the data meeting the predetermined condition may be screened out as the data to be copied by: screening out data of which the first parameter value meets a first condition and the second parameter value meets a second condition as data to be copied;

for each data read by the task across the clusters, the first parameter value is the number of times that the cluster running the task reads the data within a second preset time; the second parameter value is the total times or continuous times that the first parameter value meets the first condition within a first preset time length; the second preset time length is less than the first preset time length; the first condition includes: the first parameter value is greater than or equal to a first threshold value; the second condition includes: the second parameter value is greater than or equal to a second threshold value.

In some implementations, the plurality of clusters belong to ODPS clusters.

In this embodiment, when a program for performing cluster data processing is read and executed by a processor, the executed operations correspond to step 201 and step 202 in the first embodiment; for further details of the operations performed by the program, reference may be made to the first embodiment, and therefore, the description thereof is omitted.

In addition, an embodiment of the present application further provides a computer-readable storage medium, which stores computer-executable instructions, and when the computer-executable instructions are executed by a processor, the data processing method of the cluster is implemented.

It will be understood by those skilled in the art that all or part of the steps of the above methods may be implemented by a program instructing associated hardware (e.g., a processor) to perform the steps, and the program may be stored in a computer readable storage medium, such as a read only memory, a magnetic or optical disk, and the like. Alternatively, all or part of the steps of the above embodiments may be implemented using one or more integrated circuits. Accordingly, the modules/units in the above embodiments may be implemented in hardware, for example, by an integrated circuit, or may be implemented in software, for example, by a processor executing programs/instructions stored in a memory to implement the corresponding functions. The present application is not limited to any specific form of hardware or software combination.

The foregoing shows and describes the general principles and features of the present application, together with the advantages thereof. The present application is not limited to the above-described embodiments, which are described in the specification and drawings only to illustrate the principles of the application, but also to provide various changes and modifications within the spirit and scope of the application, which are within the scope of the claimed application.

Claims

1. A method for cluster data processing, comprising:

determining data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster;

the determining, according to the obtained attribute information of the task, data to be copied and a target cluster that needs to copy the data to be copied includes:

screening data with a first parameter value meeting a first condition and a second parameter value meeting a second condition from data read by the task cross-cluster according to the acquired attribute information of the task, wherein the data is used as data to be copied;

for data read by each task across clusters, the first parameter value is the number of times that the clusters running the tasks read the data within a second preset time; the second parameter value is the total times or continuous times that the first parameter value meets the first condition within a first preset time length; the second predetermined length of time is less than the first predetermined length of time;

the attribute information of each task at least includes: running the cluster of the task and the cluster where the data read by the task are located;

the first condition includes: the first parameter value is greater than or equal to a first threshold value; the second condition includes: the second parameter value is greater than or equal to a second threshold value.

2. The data processing method of the cluster according to claim 1, wherein after determining the data to be copied and the target cluster that needs to copy the data to be copied according to the acquired attribute information of the task, the data processing method further comprises:

3. The clustered data processing method of claim 2, wherein the data processing method further comprises: and instructing the target cluster to copy the data to be copied according to the copy list.

4. The method according to claim 1, wherein the obtaining attribute information of the tasks running on the plurality of clusters within the first predetermined time period comprises: the attribute information of the tasks running on the plurality of clusters within a first preset time is periodically acquired.

5. A clustered data processing apparatus, comprising:

the processing unit is used for determining data to be copied and a target cluster needing to copy the data to be copied according to the acquired attribute information of the task so as to copy the data to be copied to the target cluster;

the processing unit is configured to determine, according to the acquired attribute information of the task, to-be-copied data and a target cluster to which the to-be-copied data needs to be copied, in the following manner:

6. The apparatus according to claim 5, wherein the processing unit is further configured to generate a replication list after determining data to be replicated and a target cluster that needs to replicate the data to be replicated according to the acquired attribute information of the task, and write the replication list into a metadata base, so that an associated cluster acquires the replication list; the replication list is used for recording the position information of the data to be replicated and a target cluster which needs to replicate the data to be replicated.

7. A clustered data processing apparatus, comprising: a memory and a processor;