CN111258724B

CN111258724B - Data processing method, device, equipment and storage medium of distributed system

Info

Publication number: CN111258724B
Application number: CN202010037400.5A
Authority: CN
Inventors: 赵善亮
Original assignee: Ping An Bank Co Ltd
Current assignee: Ping An Bank Co Ltd
Priority date: 2020-01-14
Filing date: 2020-01-14
Publication date: 2024-02-06
Anticipated expiration: 2040-01-14
Also published as: CN111258724A

Abstract

The invention discloses a data processing method of a distributed system, which comprises the following steps: receiving a data processing signal and determining the data quantity of data to be processed; determining internal processing nodes for processing data to be processed from a distributed system, and counting the number of the internal processing nodes to be used as the number of the internal processing nodes; acquiring expected processing time length; acquiring historical parameters of historical data; judging whether the internal processing nodes can process the data to be processed within the expected processing time according to the historical parameters, the data quantity of the data to be processed, the number of the processing nodes and the expected processing time; and when the internal processing node is judged to be incapable of processing the data to be processed within the expected processing time, selecting one or more external nodes to be added to the distributed system, and enabling the internal processing node and the external nodes to jointly process the data to be processed. The invention also discloses a data processing device of the distributed system, a computer device and a computer readable storage medium.

Description

Data processing method, device, equipment and storage medium of distributed system

Technical Field

The present invention relates to the field of data processing technologies, and in particular, to a data processing method and apparatus for a distributed system, a computer device, and a computer readable storage medium.

Background

In view of the characteristics of openness, concurrency, scalability, and higher system extensibility of the distributed system, enterprises typically use the distributed system to process core service data, so as to increase the data processing speed.

However, the inventors found, in the course of studying the present invention, that there are at least the following drawbacks in the prior art: in one project, there is usually a plurality of batches of data waiting for processing, and each batch of data corresponds to an expected processing time, and since the prior art does not predict the processing time of the batch of data before processing the data, or even if the prior art predicts the processing time, the phenomenon that other batches of data wait for unlimited time and severely drag down progress is still not improved due to inaccurate prediction or mere prediction.

Disclosure of Invention

The present invention aims to provide a data processing method, device, computer equipment and computer readable storage medium of a distributed system, which can solve the above-mentioned defects in the prior art.

One aspect of the present invention provides a data processing method of a distributed system, the method including: receiving a data processing signal and determining the data quantity of data to be processed; determining internal processing nodes for processing the data to be processed from a distributed system, and counting the number of the internal processing nodes to be used as the number of the internal processing nodes; acquiring expected processing time length, wherein the expected processing time length is the time length spent for expected processing the data to be processed; obtaining history parameters of history data, wherein the history parameters comprise: the data amount of the history data, the number of history nodes in the distributed system that process the history data, and the time period spent processing the history data; judging whether the internal processing node can process the data to be processed within the expected processing time according to the history parameter, the data quantity of the data to be processed, the number of the processing nodes and the expected processing time; and when the internal processing node is judged to be incapable of processing the data to be processed within the expected processing time, selecting one or more external nodes to be added to the distributed system, and enabling the internal processing node and the external nodes to jointly process the data to be processed.

Optionally, the step of obtaining the history parameters of the history data includes: acquiring N preset tags, wherein each preset tag is associated with one or more batches of the historical data, the time length spent for processing each batch of the historical data is within the time length range represented by the preset tag associated with the historical data, and N is a positive integer greater than or equal to 1; determining a target label from N preset labels, wherein the expected processing time length is within the time length range represented by the target label; and acquiring the history parameters of the history data associated with the target tag.

Optionally, the step of obtaining the history parameter of the history data associated with the target tag includes: judging whether a plurality of batches of historical data related to the target label exist or not; when a plurality of batches of the historical data associated with the target tag exist, acquiring the time length spent for processing each batch of the historical data as a historical processing time length; determining the history processing time length with the smallest time interval with the expected processing time length from all the acquired history processing time lengths as a target processing time length; and acquiring the historical parameters of the historical data corresponding to the target processing time length in all batches of the historical data.

Optionally, the step of determining whether the internal processing node can process the data to be processed within the expected processing time period according to the history parameter, the data amount of the data to be processed, the number of the internal processing nodes and the expected processing time period includes: calculating the data size of each history node for processing the history data in the distributed system in unit time, wherein the data size is used as the processing data size in unit time; comparing the product of the amount of data processed in the unit time, the number of internal processing nodes and the expected processing time with the amount of data of the data to be processed to judge whether the internal processing nodes can process the data to be processed within the expected processing time; or comparing the quotient of the data amount of the data to be processed, the unit time processing data amount and the number of the internal processing nodes with the expected processing time length to judge whether the internal processing nodes can process the data to be processed within the expected processing time length; or comparing the quotient of the data volume of the data to be processed, the unit time processing data volume and the expected processing time length with the number of the internal processing nodes to judge whether the internal processing nodes can process the data to be processed within the expected processing time length.

Optionally, the step of selecting one or more external nodes to add to the distributed system comprises: determining the quotient of the data quantity of the data to be processed, the unit time processing data quantity and the expected processing time length as the total node quantity required by the processing of the data to be processed in the expected processing time length; calculating a difference between the total number of nodes and the number of internal processing nodes; selecting the difference value and adding the external nodes to the distributed system.

Optionally, the data to be processed is split into a plurality of data slices, and the method further includes: monitoring to acquire that an abnormal slice exists in the process of jointly processing the data to be processed by the internal processing node and the external node, and determining the node where the abnormal slice is located, wherein the data slice obtained by segmentation comprises the abnormal slice; the node where the abnormal fragments are located is controlled to send the abnormal fragments to a preset cache queue; and monitoring that all normal fragments in the data fragments obtained by segmentation are processed, and controlling the internal processing nodes and the external nodes to reprocess all the abnormal fragments stored in the preset cache queue.

Another aspect of the present invention provides a data processing apparatus for a distributed system, the apparatus comprising: the first determining module is used for receiving the data processing signal and determining the data quantity of the data to be processed; the second determining module is used for determining the internal processing nodes for processing the data to be processed from the distributed system, counting the number of the internal processing nodes and taking the number of the internal processing nodes as the number of the internal processing nodes; the first acquisition module is used for acquiring expected processing time length, wherein the expected processing time length is the time length spent for expected processing the data to be processed; the second obtaining module is configured to obtain a history parameter of the history data, where the history parameter includes: the data amount of the history data, the number of history nodes in the distributed system that process the history data, and the time period spent processing the history data; the judging module is used for judging whether the internal processing node can process the data to be processed within the expected processing time according to the history parameter, the data quantity of the data to be processed, the number of the internal processing nodes and the expected processing time; and the first processing module is used for selecting one or more external nodes to be added to the distributed system when the internal processing node is judged to be incapable of processing the data to be processed within the expected processing time, and enabling the internal processing node and the external nodes to jointly process the data to be processed.

Optionally, the data to be processed is split into a plurality of data slices, and the apparatus further includes: a third determining module, configured to monitor and learn that an abnormal slice exists in a process that the internal processing node and the external node jointly process the data to be processed, and determine a node where the abnormal slice is located, where the data slice obtained by segmentation includes the abnormal slice; the control module is used for controlling the node where the abnormal fragments are located to send the abnormal fragments to a preset cache queue; and the second processing module is used for monitoring and knowing that all normal fragments in the data fragments obtained by segmentation are processed, and controlling the internal processing node and the external node to reprocess all the abnormal fragments stored in the preset cache queue.

Yet another aspect of the present invention provides a computer device comprising: the data processing method of the distributed system according to any one of the embodiments described above is implemented by a memory, a processor, and a computer program stored in the memory and executable on the processor, when the processor executes the computer program.

A further aspect of the present invention provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements a data processing method of a distributed system according to any of the embodiments described above.

Before processing data to be processed, referring to historical parameters of the historical data, estimating whether all internal processing nodes can process the data to be processed within expected processing time according to the historical parameters, if not, dynamically increasing the number of nodes for processing the data to be processed, for example, selecting one or more external nodes to be added to the distributed system, and enabling all the internal processing nodes and the selected external nodes to jointly process the data to be processed. According to the method and the device, the number of the nodes for processing the data to be processed is dynamically increased, so that online regulation and control can be realized, the time for processing the data to be processed is inevitably shortened, the purpose of processing the data to be processed within the expected processing time as much as possible is realized, the defects in the prior art are overcome, the processing capacity of a distributed system is improved, the processing pressure and the resource occupation time of a single node are reduced, and meanwhile, the project progress is accelerated.

Drawings

Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are only for purposes of illustrating the preferred embodiments and are not to be construed as limiting the invention. Also, like reference numerals are used to designate like parts throughout the figures. In the drawings:

FIG. 1 schematically illustrates a flow chart of a data processing method of a distributed system according to an embodiment of the invention;

FIG. 2 schematically illustrates a schematic diagram of a data processing scheme of a distributed system according to an embodiment of the present invention;

FIG. 3 schematically illustrates a block diagram of a data processing apparatus of a distributed system according to an embodiment of the present invention;

fig. 4 schematically shows a block diagram of a computer device adapted to implement a data processing method of a distributed system according to an embodiment of the invention.

Detailed Description

The present invention will be described in further detail with reference to the drawings and examples, in order to make the objects, technical solutions and advantages of the present invention more apparent. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

It should be noted that, in this document, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises the element.

The embodiment of the invention provides a data processing method of a distributed system, which can be applied to the following business scenes: the distributed system comprises a master node and a plurality of slave nodes, the master node can manage the slave nodes, and the master node can dynamically regulate and control the number of nodes in the distributed system by executing the data processing method of the distributed system provided by the embodiment, so that the aim of processing data to be processed within expected processing time can be fulfilled to the greatest extent. In particular, FIG. 1 schematically shows a flow chart of a data processing method of a distributed system according to an embodiment of the invention. As shown in fig. 1, the data processing method of the distributed system may include steps S1 to S6, where:

Step S1, a data processing signal is received, and the data quantity of data to be processed is determined.

The data to be processed may be any business data such as web page access data, user billing data, or user purchase records, etc.

And S2, determining internal processing nodes for processing data to be processed from the distributed system, and counting the number of the internal processing nodes to be used as the number of the internal processing nodes.

After receiving the data processing signal, the master node may first determine which slave nodes in an idle state will be allocated a task of processing the data to be processed, determine the slave nodes as internal processing nodes, and further count the number of the internal processing nodes as the number of the internal processing nodes. When determining the internal processing node, the master node determines whether other data to be processed exists currently, if not, the master node determines that tasks for processing the data to be processed are to be distributed to all the slave nodes in an idle state, otherwise, the master node may determine the number of the internal processing nodes according to the data volume of the data to be processed and the data volume of other data to be processed, for example, the data volume of the data to be processed is 10G, the data volume of the other data to be processed is 5G, and the number of the idle slave nodes is 6, and may determine that the number of the internal processing nodes is 4 according to the ratio of the data volume of the data to be processed to the data volume of the other data to be processed, and the number of the idle slave nodes for processing the other data to be processed is 2.

For example, if the master node determines that the task of processing the data to be processed will be allocated to 3 slave nodes in an idle state, the internal processing nodes in this embodiment are the 3 slave nodes, and the number of internal processing nodes is 3.

It should be noted that, at this time, the master node has not actually allocated the task of processing the data to be processed to the internal processing node, and the internal processing node has not started to execute the task of processing the data to be processed either, where the purpose of determining the internal processing node is to estimate whether the internal processing node can process the data to be processed within the expected processing time period.

And step S3, acquiring expected processing time length, wherein the expected processing time length is the time length spent for expected processing of the data to be processed.

For each task in each project, the staff estimates the time length required for completing the task, and then the task and the corresponding estimated time form a task table to be stored. The obtaining the desired processing time may be obtaining from the task table.

Step S4, acquiring history parameters of history data, wherein the history parameters comprise: the amount of data of the history data, the number of history nodes in the distributed system that process the history data, and the length of time it takes to process the history data.

In this embodiment, after a batch of history data is processed, the history parameters of the history data may be recorded, and the history parameters may be stored in a corresponding data table for subsequent query.

In addition, some preset labels may be preset, where each preset label represents a duration range, for example, 1 min-30 min, 31 min-60 min, or 61 min-90 min. For each batch of historical data, a determination is made as to which predetermined tag characterizes the range of time durations that the historical data takes to process, and the determined predetermined tag is then associated with the batch of historical data.

Alternatively, step S4 may include steps S41 to S43, wherein:

step S41, acquiring N preset tags, wherein each preset tag is associated with one or more batches of historical data, and the time length spent in processing each batch of historical data is within the time length range represented by the preset tag associated with the historical data, and N is a positive integer greater than or equal to 1;

step S42, determining a target label from N preset labels, wherein the expected processing time length is within the time length range represented by the target label;

step S43, acquiring history parameters of history data associated with the target tag.

In this embodiment, each predetermined tag may be associated with one or more batches of history data, and the target tag determined from the N predetermined tags may also be associated with one or more batches of history data. When the target label is only associated with one batch of history data, the history parameters of the batch of history data can be directly obtained; when the target tag is associated with a plurality of batches of history data, a batch of history data is selected from the plurality of batches of history data associated with the target tag, and history parameters of the selected history data are acquired.

Specifically, step S43 may include steps S431 to S434, wherein:

step S431, judging whether a plurality of batches of historical data associated with the target tag exist or not;

step S432, when there are multiple batches of history data associated with the target tag, acquiring the time length spent for processing each batch of history data as a history processing time length;

step S433, determining a historical processing time length with the smallest time interval with the expected processing time length from all the obtained historical processing time lengths as a target processing time length;

step S434, obtain the history parameters of the history data corresponding to the target processing duration in all the lot history data.

Wherein the time period spent processing each batch of history data is called a history processing time period. When a plurality of batches of historical data are associated with the target tag, calculating the time interval between each historical processing time length and the expected processing time length, taking the historical processing time length with the smallest time interval as the target processing time length, then determining the historical data corresponding to the target processing time length from the plurality of batches of historical data, and further acquiring the historical parameters of the determined historical data.

For example, there are three batches of history data, the time spent processing the first batch of history data is 28 minutes, the time spent processing the second batch of history data is 45 minutes, and the time spent processing the third batch of history data is 52 minutes. Setting that the first preset label corresponds to 1 min-30 min and the second preset label corresponds to 31 min-60 min, and associating the first preset label with the first historical data; a second predetermined tag is associated for the second set of historical data and the third set of historical data, respectively. If the desired processing time is 30 minutes, a history parameter of a first set of history data associated with a first predetermined tag may be obtained. If the expected processing time length is 40min, a second preset label is selected, however, two batches of history data associated with the second preset label exist, and the time interval between the time length spent for processing the second batch of history data and the expected processing time length is 4min, and the time length spent for processing the third batch of history data and the expected processing time length is 12min, so that the time length spent for processing the second batch of history data can be used as the target processing time length, and further, the history energy parameters of the second batch of history data can be obtained.

And S5, judging whether the internal processing nodes can process the data to be processed within the expected processing time according to the historical parameters, the data quantity of the data to be processed, the number of the internal processing nodes and the expected processing time.

In this embodiment, the history parameter is used as a reference, i.e., the history parameter is used as a standard, and then it is determined whether the internal processing node can process the data to be processed within the expected processing time.

In this embodiment, the judging method includes multiple steps, for example, step S5 may include:

calculating the data size of each historical node for processing the historical data in the distributed system in unit time, and taking the data size as the processing data size in unit time;

comparing the product of the unit time processing data quantity, the number of the internal processing nodes and the expected processing time length with the data quantity of the data to be processed to judge whether the internal processing nodes can process the data to be processed within the expected processing time length; or alternatively

Comparing the quotient of the data quantity of the data to be processed, the data quantity processed in unit time and the number of the internal processing nodes with the expected processing time length to judge whether the internal processing nodes can process the data to be processed within the expected processing time length; or alternatively

And comparing the quotient of the data quantity of the data to be processed, the unit time processing data quantity and the expected processing time length with the quantity of the internal processing nodes so as to judge whether the internal processing nodes can process the data to be processed within the expected processing time length.

Specifically, the calculation method of the processing data amount per unit time is as follows: dividing the data amount of the history data by the number of history nodes for processing the history data to obtain a quotient, dividing the quotient by the time length spent for processing the history data, and taking the obtained result as the processing data amount in unit time.

The first scheme of step S5 may be that a product of the amount of data processed in unit time, the number of internal processing nodes, and the expected processing time period is calculated, and then the product is compared with the amount of data to be processed, if the product is greater than or equal to the amount of data to be processed, the internal processing node is characterized as being capable of processing the data to be processed in the expected processing time period, and if the product is less than the amount of data to be processed, the internal processing node is characterized as being incapable of processing the data to be processed in the expected processing time period.

The second scheme of step S5 may be that a quotient is obtained by dividing the data amount of the data to be processed by the data amount processed in unit time, the quotient is divided by the number of internal processing nodes, and then a quotient is obtained, and the quotient is used as a final quotient, and the final quotient is compared with the expected processing duration, if the final quotient is less than or equal to the data amount of the data to be processed, the internal processing node is characterized as being capable of processing the data to be processed in the expected processing duration, and if the final quotient is greater than the data amount of the data to be processed, the internal processing node is characterized as being incapable of processing the data to be processed in the expected processing duration.

The third scheme of step S5 may be that a quotient is obtained by dividing the data amount of the data to be processed by the data amount processed in unit time, the quotient is divided by the expected processing time length to obtain a quotient, the quotient is used as a final quotient, the final quotient is compared with the number of internal processing nodes, if the final quotient is less than or equal to the data amount of the data to be processed, the internal processing nodes are represented to be capable of processing the data to be processed in the expected processing time length, if the final quotient is greater than the data amount of the data to be processed, the internal processing nodes are represented to be incapable of processing the data to be processed in the expected processing time length.

Optionally, step S5 may further include a fourth scheme: and comparing the quotient of the data quantity of the data to be processed, the number of the internal processing nodes and the expected processing time with the data quantity processed in unit time so as to judge whether the internal processing nodes can process the data to be processed within the expected processing time.

Specifically, if the internal processing node can process the data to be processed within the expected processing time, the quotient of the data volume of the data to be processed, the number of the internal processing nodes and the expected processing time is the data volume of the history data to be processed in a unit time of each node for processing the data to be processed in the distributed system. The specific calculation method comprises the following steps: dividing the data quantity of the data to be processed by the number of the internal processing nodes to obtain a quotient, dividing the quotient by the expected processing time length to obtain a quotient, and taking the quotient as a final quotient, wherein the final quotient is at least the data quantity of the data to be processed of each node for processing the data to be processed in unit time, comparing the final quotient with the data quantity processed in unit time, if the final quotient is greater than or equal to the data quantity processed in unit time, representing that the internal processing node can process the data to be processed in the expected processing time length, and if the final quotient is less than the data quantity processed in unit time, representing that the internal processing node cannot process the data to be processed in the expected processing time length.

For example, in combination with the above example, obtaining the history parameters of the second batch of history data includes: the data amount of the history data is 100G, the number of the history nodes for processing the history data is 4, the time spent for processing the history data is 45min, and each history node is about 1min for processing the history data of 0.56G. If the data size of the data to be processed is 200G, the number of the determined internal processing nodes is 6, and the expected processing time length is 40min, 60min is estimated to be needed for the 6 internal processing nodes to process the 200G data to be processed. I.e. the internal processing node cannot process the data to be processed within the desired processing time.

And S6, when the internal processing node is judged to be incapable of processing the data to be processed within the expected processing time, selecting one or more external nodes to be added to the distributed system, and enabling the internal processing node and the external nodes to jointly process the data to be processed.

In this embodiment, each external node is a node that is in a standby state and has been preconfigured to be capable of starting up to work at any time, and the master node may select one or more external nodes to be added to the distributed system, and make the internal processing node and the selected external nodes jointly process the data to be processed.

It should be noted that, the processing described in this embodiment may be any form of processing, for example, if the data to be processed is web page access data, the processing may be determining behavior data of the user according to the web page access data, determining a popular web page according to the web page access data, and so on. For another example, the data to be processed is a user purchase record, and the processing may be to determine a commodity with higher sales according to the user purchase record, and so on.

Optionally, the step of selecting one or more external nodes to add to the distributed system in step S6 may include steps S61 to S63, wherein:

step S61, determining the quotient of the data quantity of the data to be processed, the data quantity processed in unit time and the expected processing time length as the total node quantity required for processing the data to be processed in the expected processing time length;

step S62, calculating the difference value between the total node number and the internal processing node number;

in step S63, the difference external nodes are selected and added to the distributed system.

In this embodiment, the amount of data processed in unit time is taken as the amount of data processed in unit time by each node, so by calculating the quotient of the amount of data to be processed, the amount of data processed in unit time and the expected processing time, the total number of nodes required for processing the data to be processed in the expected processing time can be obtained, the difference between the total number of nodes and the number of internal processing nodes is further calculated, and the external nodes of the difference are selected to be added to the distributed system. For example, in combination with the above example, the timeout processing time period is 40min, and the total number of nodes required for processing 200G of data to be processed within 40min is: 200 ≡40 ≡0.56 ≡9, i.e. 9 nodes are required, the difference is 9-6=3, so 3 external nodes can be selected to be added to the distributed system.

According to the method and the device for processing the data, the number of the added external nodes can be accurately determined, so that the time spent for processing the data to be processed can be guaranteed to be approximate or consistent with the expected processing time to the greatest extent, meanwhile, the processing capacity of the distributed system is further improved, the processing pressure of a single node is reduced, the time occupied by resources is reduced, and the project progress is accelerated.

Optionally, the data to be processed is split into a plurality of data slices, and the method further comprises steps A1 to A3, wherein:

a1, monitoring to acquire that an abnormal slice exists in the process of jointly processing data to be processed by an internal processing node and an external node, and determining the node where the abnormal slice is located, wherein the data slice obtained by segmentation comprises the abnormal slice;

a2, controlling a node where the abnormal slice is located to send the abnormal slice to a preset cache queue;

and step A3, monitoring and knowing that all normal fragments in the data fragments obtained by segmentation are processed, and controlling the internal processing nodes and the external nodes to reprocess all abnormal fragments stored in a preset cache queue.

In this embodiment, before processing the data to be processed, the master node divides the data to be processed into a plurality of data slices, and then generates a processing task for each data slice, and sequentially issues the processing tasks to the internal processing node and the selected one or more external nodes, where each of the internal processing node and the selected one or more external nodes may be referred to as a working node. For example, a data fragment is allocated to each working node, then more data fragments are allocated to the working node with faster processing according to the processing speed of the working node, and fewer data fragments are allocated to the working node with slower processing. As shown in fig. 2, a master node (which may be the execution plan generation engine in fig. 2) may generate one task from each data slice and then assign to a worker node (executor in fig. 2) in the distributed system (task execution cluster in fig. 2).

For each working node, the situation of processing a task can be actively reported, for example, when the task is not processed within a preset time length, the current processed data fragments are reported to be abnormal fragments, the master node can locate the working node with the abnormal fragments and instruct the working node to temporarily store the abnormal fragments in a preset cache queue, and after all normal fragments in the data to be processed are processed, the master node can instruct all the working nodes to re-process the abnormal fragments stored in the preset cache queue.

According to the embodiment, when a single abnormal partition occurs, the whole processing flow is not needed, the occurring abnormal partition is stored, and finally, the abnormal partition is processed uniformly, so that the processing time is further saved.

The embodiment of the present invention further provides a data processing device of a distributed system, where the data processing device of the distributed system corresponds to the data processing method of the distributed system provided in the foregoing embodiment, and corresponding technical features and technical effects are not described in detail in this embodiment, and reference may be made to the embodiments for relevant points. In particular, FIG. 3 schematically shows a block diagram of a data processing apparatus of a distributed system according to an embodiment of the invention. As shown in fig. 3, the data processing apparatus 300 of the distributed system may include a first determining module 301, a second determining module 302, a first obtaining module 303, a second obtaining module 304, a judging module 305, and a first processing module 306, where:

A first determining module 301, configured to receive a data processing signal, and determine a data amount of data to be processed;

a second determining module 302, configured to determine, from the distributed system, internal processing nodes for processing data to be processed, and count the number of internal processing nodes, as the number of internal processing nodes;

a first obtaining module 303, configured to obtain a desired processing duration, where the desired processing duration is a duration spent for desired processing of data to be processed;

a second obtaining module 304, configured to obtain history parameters of the history data, where the history parameters include: the amount of data of the historical data, the number of historical nodes in the distributed system that process the historical data, and the length of time spent processing the historical data;

the judging module 305 is configured to judge whether the internal processing node can process the data to be processed within the expected processing duration according to the history parameter, the data amount of the data to be processed, the number of processing nodes, and the expected processing duration;

the first processing module 306 is configured to select one or more external nodes to add to the distributed system when it is determined that the internal processing node cannot process the data to be processed within the desired processing duration, and cause the internal processing node and the external nodes to jointly process the data to be processed.

Optionally, the second acquisition module is further configured to: acquiring N preset tags, wherein each preset tag is associated with one or more batches of historical data, the time length spent for processing each batch of historical data is within the time length range represented by the preset tag associated with the historical data, and N is a positive integer greater than or equal to 1; determining a target label from N preset labels, wherein the expected processing time length is within the time length range represented by the target label; and acquiring the history parameters of the history data associated with the target tag.

Optionally, the second obtaining module, when obtaining the history parameter of the history data associated with the target tag, is further configured to: judging whether a plurality of batches of historical data associated with the target tag exist or not; when a plurality of batches of historical data associated with the target tag exist, acquiring the time length spent for processing each batch of historical data as one historical processing time length; determining a historical processing time length with the smallest time interval with the expected processing time length from all the acquired historical processing time lengths as a target processing time length; and acquiring historical parameters of historical data corresponding to the target processing time length in all batches of historical data.

Optionally, the judging module is further configured to: calculating the data size of each historical node for processing the historical data in the distributed system in unit time, and taking the data size as the processing data size in unit time; comparing the product of the unit time processing data quantity, the number of the internal processing nodes and the expected processing time length with the data quantity of the data to be processed to judge whether the internal processing nodes can process the data to be processed within the expected processing time length; or comparing the quotient of the data quantity of the data to be processed, the data quantity processed in unit time and the number of the internal processing nodes with the expected processing time length to judge whether the internal processing nodes can process the data to be processed within the expected processing time length; or comparing the quotient of the data quantity of the data to be processed, the unit time processing data quantity and the expected processing time length with the quantity of the internal processing nodes so as to judge whether the internal processing nodes can process the data to be processed within the expected processing time length.

Optionally, the first processing module, when selecting one or more external nodes to add to the distributed system, is further configured to: determining the quotient of the data quantity of the data to be processed, the data quantity processed in unit time and the expected processing time length as the total node quantity required by the data to be processed in the expected processing time length; calculating the difference between the total node number and the internal processing node number; the difference value is selected and added to the distributed system.

Optionally, the data to be processed is split into a plurality of data slices, and the apparatus may further include: the third determining module is used for monitoring and knowing that an abnormal slice exists in the process of jointly processing the data to be processed by the internal processing node and the external node, and determining the node where the abnormal slice is located, wherein the data slice obtained by cutting comprises the abnormal slice; the control module is used for controlling the node where the abnormal slice is located to send the abnormal slice to a preset cache queue; the second processing module is used for monitoring and knowing that all normal fragments in the data fragments obtained by segmentation are processed, and controlling the internal processing nodes and the external nodes to reprocess all abnormal fragments stored in the preset cache queue.

Fig. 4 schematically shows a block diagram of a computer device adapted to implement a data processing method of a distributed system according to an embodiment of the invention. In this embodiment, the computer device 400 may be a smart phone, a tablet computer, a notebook computer, a desktop computer, a rack-mounted server, a blade server, a tower server, or a rack-mounted server (including a stand-alone server or a server cluster formed by a plurality of servers) for executing a program, etc. As shown in fig. 4, the computer device 400 of the present embodiment includes at least, but is not limited to: a memory 401, a processor 402, and a network interface 403 which may be communicatively connected to each other through a system bus. It should be noted that FIG. 4 only shows computer device 400 having components 401-403, but it should be understood that not all of the illustrated components are required to be implemented and that more or fewer components may be implemented instead.

In this embodiment, the memory 403 includes at least one type of computer-readable storage medium, which includes flash memory, hard disk, multimedia card, card memory (e.g., SD or DX memory, etc.), random Access Memory (RAM), static Random Access Memory (SRAM), read-only memory (ROM), electrically erasable programmable read-only memory (EEPROM), programmable read-only memory (PROM), magnetic memory, magnetic disk, optical disk, etc. In some embodiments, the memory 401 may be an internal storage unit of the computer device 400, such as a hard disk or a memory of the computer device 400. In other embodiments, the memory 401 may also be an external storage device of the computer device 400, such as a plug-in hard disk, a Smart Media Card (SMC), a Secure Digital (SD) Card, a Flash memory Card (Flash Card) or the like, which are provided on the computer device 400. Of course, memory 401 may also include both internal storage elements of computer device 400 and external storage devices. In the present embodiment, the memory 401 is typically used to store an operating system installed on the computer device 400 and various types of application software, such as program codes of a data processing method of a distributed system, and the like. In addition, the memory 401 can also be used to temporarily store various types of data that have been output or are to be output.

The processor 402 may be a central processing unit (Central Processing Unit, CPU), controller, microcontroller, microprocessor, or other data processing chip in some embodiments. The processor 402 is generally used to control the overall operation of the computer device 400. Such as program code, for example, for performing data processing methods of a distributed system for data interaction or communication related control and processing, etc., with computer device 400.

In this embodiment, the data processing method of the distributed system stored in the memory 401 may also be divided into one or more program modules and executed by one or more processors (the processor 402 in this embodiment) to complete the present invention.

The network interface 403 may include a wireless network interface or a wired network interface, the network interface 403 typically being used to establish a communication link between the computer device 400 and other computer devices. For example, the network interface 403 is used to connect the computer device 400 to an external terminal through a network, establish a data transmission channel and a communication link between the computer device 400 and the external terminal, and the like. The network may be a wireless or wired network such as an Intranet (Intranet), the Internet (Internet), a global system for mobile communications (Global System of Mobile communication, abbreviated as GSM), wideband code division multiple access (Wideband Code Division Multiple Access, abbreviated as WCDMA), a 4G network, a 5G network, bluetooth (Bluetooth), wi-Fi, etc.

The present embodiment also provides a computer-readable storage medium including a flash memory, a hard disk, a multimedia card, a card memory (e.g., SD or DX memory, etc.), a Random Access Memory (RAM), a Static Random Access Memory (SRAM), a read-only memory (ROM), an electrically erasable programmable read-only memory (EEPROM), a programmable read-only memory (PROM), a magnetic memory, a magnetic disk, an optical disk, a server, an App application store, etc., on which a computer program is stored, which when executed by a processor, implements a data processing method of a distributed system.

It will be apparent to those skilled in the art that the modules or steps of the embodiments of the invention described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a storage device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than what is shown or described, or they may be separately fabricated into individual integrated circuit modules, or a plurality of modules or steps in them may be fabricated into a single integrated circuit module. Thus, embodiments of the invention are not limited to any specific combination of hardware and software.

From the above description of the embodiments, it will be clear to those skilled in the art that the above-described embodiment method may be implemented by means of software plus a necessary general hardware platform, but of course may also be implemented by means of hardware, but in many cases the former is a preferred embodiment.

The foregoing description is only of the preferred embodiments of the present invention, and is not intended to limit the scope of the invention, but rather is intended to cover any equivalents of the structures or equivalent processes disclosed herein or in the alternative, which may be employed directly or indirectly in other related arts.

Claims

1. A method of data processing for a distributed system, the method comprising:

receiving a data processing signal and determining the data quantity of data to be processed;

determining internal processing nodes for processing the data to be processed from a distributed system, and counting the number of the internal processing nodes to be used as the number of the internal processing nodes;

acquiring expected processing time length, wherein the expected processing time length is the time length spent for expected processing the data to be processed;

obtaining historical parameters of historical data, wherein the historical parameters comprise: the amount of data of the historical data, the number of historical nodes in the distributed system that process the historical data, and the length of time it takes to process the historical data;

Judging whether the internal processing node can process the data to be processed within the expected processing time according to the history parameters, the data quantity of the data to be processed, the number of the internal processing nodes and the expected processing time;

when the internal processing node is judged to be incapable of processing the data to be processed within the expected processing time length, selecting one or more external nodes to be added to the distributed system, and enabling the internal processing node and the external nodes to jointly process the data to be processed;

the step of acquiring the history parameters of the history data includes: acquiring N preset tags, wherein each preset tag is associated with one or more batches of historical data, the time length spent for processing each batch of historical data is within the time length range represented by the preset tag associated with the historical data, and N is a positive integer greater than or equal to 1; determining a target tag from N preset tags, wherein the expected processing time length is within the time length range represented by the target tag; judging whether a plurality of batches of historical data associated with the target tag exist or not; when a plurality of batches of the historical data associated with the target tag exist, acquiring the time length spent for processing each batch of the historical data as a historical processing time length; determining the historical processing time length with the smallest time interval with the expected processing time length from all the acquired historical processing time lengths, and taking the historical processing time length as a target processing time length; acquiring the history parameters of the history data corresponding to the target processing time length in all batches of the history data;

The data to be processed is sliced into a plurality of data slices, the method further comprising: monitoring to acquire that an abnormal fragment exists in the process of jointly processing the data to be processed by the internal processing node and the external node, and determining the node where the abnormal fragment is located, wherein the data fragment obtained by segmentation comprises the abnormal fragment; controlling the node where the abnormal fragments are located to send the abnormal fragments to a preset cache queue; and monitoring and knowing that all normal fragments in the data fragments obtained by segmentation are processed, and controlling the internal processing node and the external node to reprocess all the abnormal fragments stored in the preset cache queue.

2. The method of claim 1, wherein the step of determining whether the internal processing node can process the data to be processed within the desired processing time period based on the history parameter, the data amount of the data to be processed, the number of internal processing nodes, and the desired processing time period comprises:

calculating the data size of each historical node for processing the historical data in the distributed system in unit time, wherein the data size is used as the processing data size in unit time;

Comparing the quotient of the data quantity of the data to be processed, the unit time processing data quantity and the number of the internal processing nodes with the expected processing time length to judge whether the internal processing nodes can process the data to be processed within the expected processing time length; or alternatively

And comparing the quotient of the data quantity of the data to be processed, the unit time processing data quantity and the expected processing time length with the number of the internal processing nodes to judge whether the internal processing nodes can process the data to be processed within the expected processing time length.

3. The method of claim 2, wherein the step of selecting one or more external nodes to add to the distributed system comprises:

determining the quotient of the data quantity of the data to be processed, the unit time processing data quantity and the expected processing time length as the total node quantity required by the data to be processed in the expected processing time length;

Calculating a difference between the total number of nodes and the number of internal processing nodes;

selecting the difference value and adding the external nodes to the distributed system.

4. A data processing apparatus for a distributed system for implementing the method of any one of claims 1 to 3, the apparatus comprising:

the first determining module is used for receiving the data processing signal and determining the data quantity of the data to be processed;

the second determining module is used for determining internal processing nodes for processing the data to be processed from the distributed system, and counting the number of the internal processing nodes to be used as the number of the internal processing nodes;

the first acquisition module is used for acquiring expected processing time length, wherein the expected processing time length is the time length spent for expected processing the data to be processed;

the second acquisition module is used for acquiring historical parameters of the historical data, wherein the historical parameters comprise: the amount of data of the historical data, the number of historical nodes in the distributed system that process the historical data, and the length of time it takes to process the historical data;

the judging module is used for judging whether the internal processing node can process the data to be processed within the expected processing time according to the history parameters, the data quantity of the data to be processed, the number of the internal processing nodes and the expected processing time;

And the first processing module is used for selecting one or more external nodes to be added to the distributed system when the internal processing node cannot process the data to be processed within the expected processing time, and enabling the internal processing node and the external nodes to jointly process the data to be processed.

5. The apparatus of claim 4, wherein the data to be processed is sliced into a plurality of data slices, the apparatus further comprising:

a third determining module, configured to monitor that an abnormal slice exists in a process that the internal processing node and the external node jointly process the data to be processed, and determine a node where the abnormal slice is located, where the data slice obtained by segmentation includes the abnormal slice;

the control module is used for controlling the node where the abnormal fragments are located to send the abnormal fragments to a preset cache queue;

and the second processing module is used for monitoring and knowing that all normal fragments in the data fragments obtained by segmentation are processed, and controlling the internal processing node and the external node to reprocess all the abnormal fragments stored in the preset cache queue.

6. A computer device, the computer device comprising: a memory, a processor and a computer program stored on the memory and executable on the processor, characterized in that the processor implements the steps of the method according to any one of claims 1 to 3 when the computer program is executed by the processor.

7. A computer readable storage medium, on which a computer program is stored, characterized in that the computer program is for implementing the steps of the method according to any one of claims 1 to 3 when being executed by a processor.