CN107508901B

CN107508901B - Distributed data processing method, device, server and system

Info

Publication number: CN107508901B
Application number: CN201710783415.4A
Authority: CN
Inventors: 黄世清
Original assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Current assignee: Beijing Jingdong Century Trading Co Ltd; Beijing Jingdong Shangke Information Technology Co Ltd
Priority date: 2017-09-04
Filing date: 2017-09-04
Publication date: 2020-12-22
Anticipated expiration: 2037-09-04
Also published as: CN107508901A

Abstract

The application provides a distributed data processing method, a distributed data processing device, a server and a distributed data processing system, and relates to the technical field of data processing. The distributed data processing method comprises the following steps: the task allocation server fragments the data to be processed according to the number of the data processing servers to obtain fragment information of the data to be processed; the task allocation server sends each piece information to the corresponding data processing server so that the data processing server can obtain and process the data pieces to be processed according to the piece information; and the task allocation server determines a data processing result according to the feedback result of each data processing server. By the method, the data can be fragmented based on the number of the servers capable of processing the data and distributed to each server for processing, so that extra operations except data processing in distributed data processing can be reduced, the time occupied by the extra operations is shortened, and the efficiency of the distributed data processing is improved.

Description

Distributed data processing method, device, server and system

Technical Field

The present application relates to the field of data processing technologies, and in particular, to a distributed data processing method, apparatus, server, and system.

Background

In a common software development process, a single computer can meet the requirement for computing a small amount of data, and a special big data service is used for processing large-scale data computing.

However, when the data volume is of a medium scale, a single computer cannot process the data rapidly, the resource shortage is caused by too many users of the large data cluster service, the task processing is slow, and even the waiting is needed, and the waiting time occupies too large proportion in the whole data processing process, so that the data processing efficiency is reduced.

Disclosure of Invention

The inventor finds that the big data cluster service causes low data processing efficiency due to the early preparation work, the cluster resource overall planning and the like, and the large data cluster service is more obvious when the data volume is smaller.

It is an object of the present application to improve the efficiency of distributed data processing.

According to an aspect of the present application, a distributed data processing method is provided, including: the task allocation server averagely fragments the data to be processed according to the number of the data processing servers to obtain fragment information of the data to be processed; the task allocation server sends each piece information to the corresponding data processing server so that the data processing server can obtain and process the data pieces to be processed according to the piece information; and the task allocation server determines a data processing result according to the feedback result of each data processing server.

Optionally, the task allocation server averagely segments the data to be processed according to the number of the data processing servers, and acquiring segment information of the data to be processed includes: the task allocation server evenly divides the data to be processed into pieces according to the number of the data processing servers, and information of data processing of a single server is obtained; and processing the data fragmentation of the single server according to the preset thread number of the single data processing server to obtain fragmentation information.

Optionally, the task allocation server evenly segments the data to be processed according to the number of the data processing servers, and the obtaining information of the data processed by the single server includes: distributing data for each data processing server through a Hash algorithm, and acquiring information of data processed by a primary distribution single server; and processing the data processed by the primary distribution single server through a data balance algorithm so as to enable the data to be distributed in a balanced manner, and acquiring the information of the data processed by the single server.

Optionally, the sending, by the task allocation server, each piece of fragment information to the corresponding data processing server includes: and the task allocation server stores the fragment information into the database according to a preset strategy so that the data processing server acquires the fragment information according to the update information conforming to the preset strategy when monitoring the database.

Optionally, multiple threads of the data processing server monitor the database simultaneously; when the plurality of threads acquire the fragment information, the thread which acquires the to-be-processed data fragment corresponding to the fragment information firstly processes the to-be-processed data fragment, and other threads continue to monitor the database; the method comprises the steps that a thread for obtaining data fragmentation to be processed continues to monitor a database after completing data fragmentation to be processed; and circularly executing the processes until all the data to be processed distributed to the data processing server is processed.

Optionally, the method further comprises: the task allocation server sends algorithms or algorithm identifications used for processing the data to be processed to each data processing server in advance, so that each data processing server processes the data to be processed by adopting the corresponding algorithm.

Optionally, the fragmentation information includes one or more of source information, data table information, fragmentation field information, and filtering condition information of the data to be processed, and address information of the destination data processing server.

Optionally, the data to be processed includes one or more of data stored in a database, data from an external device, and data acquired through a network.

By the method, the data can be fragmented based on the number of the servers capable of processing the data and distributed to each server for processing, so that the proportion of operations except the data processing in the distributed data processing can be reduced, and the efficiency of the distributed data processing can be improved.

According to another aspect of the application, a distributed computing device is presented, comprising: the data fragmentation unit is used for averagely fragmenting the data to be processed according to the number of the data processing servers to acquire fragmentation information of the data to be processed; the fragment information distribution unit is used for sending each fragment information to the corresponding data processing server so that the data processing server can acquire and process the to-be-processed data fragments according to the fragment information; and the result acquisition unit is used for determining the data processing result according to the feedback information of each data processing server.

Optionally, the data slicing unit includes: the first fragmentation subunit is used for averagely fragmenting the data to be processed according to the number of the data processing servers to acquire the information of the data processed by the single server; and the second fragmentation subunit is used for fragmenting the data processed by the single server according to the preset thread number of the single data processing server to acquire fragmentation information.

Optionally, the first fragmentation sub-unit is configured to: distributing data for each data processing server through a Hash algorithm, and acquiring information of data processed by a primary distribution single server; and processing the data processed by the primarily distributed single server through a data balancing algorithm to acquire the information of the data processed by the single server.

Optionally, the fragmentation information distribution unit is configured to: and storing the fragment information into a database according to a preset strategy so that the data processing server acquires the fragment information according to the update information conforming to the preset strategy when monitoring the database.

Optionally, the method further comprises: and the data storage unit is used for storing the data to be processed into the database.

Optionally, the method further comprises: and the algorithm specifying unit is used for sending the algorithm or the algorithm identification used for processing the data to be processed to each data processing server so that each data processing server adopts the corresponding algorithm to process the data to be processed.

Optionally, the method further comprises: the data acquisition unit is used for acquiring the fragment information and acquiring the to-be-processed data fragments according to the fragment information; and the data processing unit is used for processing the data fragments to be processed and feeding back a processing result.

Optionally, the data acquisition unit monitors the database by using a plurality of threads which are not processing the to-be-processed data fragments; the data processing unit is used for processing the data fragments to be processed by adopting the thread which firstly acquires the data fragments to be processed corresponding to the fragment information when the fragment information is acquired by a plurality of threads; and after the thread for acquiring the to-be-processed data fragments finishes the to-be-processed data fragments, continuously monitoring the database.

The device can divide the data into pieces based on the number of the servers capable of processing the data and distribute the data to each server for processing, thereby reducing the proportion of operations except the data processing in the distributed data processing and improving the efficiency of the distributed data processing.

According to yet another aspect of the present application, a distributed data processing apparatus is provided, comprising: a memory; and a processor coupled to the memory, the processor configured to perform any of the above-mentioned distributed data processing methods based on instructions stored in the memory.

According to yet another aspect of the application, a computer-readable storage medium is proposed, on which computer program instructions are stored, which instructions, when executed by a processor, implement the steps of any of the above-mentioned distributed data processing methods.

According to another aspect of the present application, a server is proposed, which comprises means for performing any one of the above-mentioned distributed data processing methods.

Such a server can divide data into pieces based on the number of servers capable of processing data and distribute the data to each processing server for processing, thereby reducing the weight of operations other than data processing in distributed data processing and improving the efficiency of distributed data processing.

Further, according to an aspect of the present application, a distributed data processing system is proposed, comprising a plurality of servers as above.

In such a distributed data processing system, the servers can divide the data into pieces based on the number of servers capable of processing the data and distribute the data to each server for processing, so that the proportion of operations other than data processing in the distributed data processing can be reduced, and the efficiency of the distributed data processing can be improved.

Drawings

The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the application and together with the description serve to explain the application and not to limit the application. In the drawings:

fig. 1 is a flowchart of an embodiment of a distributed data processing method of the present application.

Fig. 2 is a flowchart of another embodiment of a distributed data processing method of the present application.

Fig. 3 is a schematic diagram of an embodiment of a distributed data processing apparatus of the present application.

Fig. 4 is a schematic diagram of another embodiment of a distributed data processing apparatus of the present application.

Fig. 5 is a schematic diagram of another embodiment of a distributed data processing apparatus of the present application.

Fig. 6 is a schematic diagram of still another embodiment of a distributed data processing apparatus according to the present application.

FIG. 7 is a schematic diagram of one embodiment of a distributed data processing system of the present application.

Detailed Description

The technical solution of the present application is further described in detail by the accompanying drawings and examples.

There are several platforms in the prior art for large-scale data processing, such as MapReduce. MapReduce is a computation model, framework and platform oriented to parallel processing of big data, and allows a distribution and parallel computation cluster containing dozens, hundreds or thousands of nodes to be formed by common commercial servers in the market.

The MapReduce computing framework is very powerful and is generally accepted in practical application of processing large-scale data. However, MapReduce has the following problems in the practical application process:

in terms of maintainability, to utilize the computing service provided by MapReduce, a Hadoop environment has to be installed, because MapReduce is not a separate framework and must rely on the HDFS file system to be executed. This increases the cost and effort of maintenance.

From the viewpoint of easy use, writing the MapReduce task requires learning its API, although a certain amount of work is done, it can be learned basically within a week, but then the maintenance work of the Hadoop cluster service is very staggering and requires a strong professional knowledge.

In terms of time cost, MapReduce needs to go through the processes of task loading, task monitoring, data slicing, slice computing, data shuffling, data merging and the like in the process of computing, one-time computing is basically in hours, and the method can be tolerated by large-scale data, but the time for completing the framework function is too long for medium-small-scale data.

Therefore, in the development process using the MapReduce framework, when small and medium-scale data sets are processed, system resources and labor and time costs are greatly wasted.

A flow diagram of one embodiment of a distributed data processing method of the present application is shown in fig. 1.

In step 101, the task allocation server averagely fragments the data to be processed according to the number of the data processing servers, and obtains fragment information of the data to be processed. For example, the task allocation server discovers that 10 data processing servers can be used for processing the to-be-processed data, and therefore, fragments the to-be-processed data, allocates at least one to-be-processed data fragment to each data processing server, and generates fragment information of each to-be-processed data fragment.

In one embodiment, the fragmentation information includes one or more of source information, data table information, fragmentation field information, and filtering condition information of the data to be processed, and address information of a destination data processing server receiving the fragmentation information.

In step 102, the task allocation server sends the fragment information to the corresponding data processing server. In one embodiment, the task allocation server may store the fragmentation information into the database according to a predetermined policy, for example, into a data table monitored by the destination data processing server, and when the data processing server determines that the data table monitored by the data processing server has data update, obtain the fragmentation information from the database. The database may be MySQL or NoSql, etc.

And the data processing server acquires the corresponding to-be-processed data fragment according to the acquired fragment information and processes the acquired data. In one embodiment, the data to be processed may include one or more of data stored in a database, data from an external device, and data acquired over a network. The data processing server can determine the source of the data according to the fragment information and perform data extraction.

In step 103, the task allocation server determines a data processing result according to the feedback result of each data processing server. In one embodiment, the data processing servers may store the feedback result in a predetermined position or a predetermined table or field of the database, and the task allocation server obtains the feedback result of each data processing server by reading the database, thereby obtaining the data processing result.

By the lightweight distributed computing method, the data can be fragmented based on the number of servers capable of processing the data and distributed to each server for processing, so that the proportion of operations except for data processing in distributed data processing can be reduced, the efficiency of distributed data processing is improved, and the improvement of efficiency is particularly remarkable for data processing with medium data volume. In addition, the method does not need to install a specific environment, and does not need to spend a large amount of cost and energy to maintain the specific environment, so that the maintenance cost is reduced, and the user experience is improved.

In one embodiment, the task distribution server may further write the fragment record information to the database, where the fragment record information may include one or more of the following information:

the number of messages: the total number of the fragments generated by the calculation is the number of the messages.

Data set: each shard requires a set of computed primary data fields.

A source host: and identifying the task allocation server.

The calculation host: and the fragment corresponds to the identification of the destination data processing server.

Start time: and the time of receiving the fragment information by the data processing server.

End time: and the data processing server completes the processing time of the corresponding to-be-processed data fragment.

The state is as follows: the current slice is computed in a processing state, which may include: not calculating, calculating completion, calculating failure and the like;

version number: for use in concurrent data transmission, default is 0.

By the method, the data processing process can be effectively monitored, and the controllability and reliability of data processing are improved.

In one embodiment, each data processing server can adopt a multi-thread parallel processing mode to improve the efficiency of data processing. The number of threads can be determined by manual adjustment and configuration. When the task allocation server performs data allocation, the data can be divided into n parts according to the number n of the data processing servers to obtain information of data processed by a single server; and then the data distributed to the single server is sliced according to the preset thread number of each data processing server, and the information of the data to be processed which is processed by the single thread, namely the slicing information, is obtained.

By the method, each data processing server can process data in a mode of parallel processing of a plurality of threads, so that the data processing efficiency is improved, and the utilization rate of server resources is also improved.

In one embodiment, data can be distributed to each data processing server through a hash algorithm to obtain information of data processed by the primary distribution single server, and then data balance calculation is performed on the data processed by the primary distribution single server to balance the data quantity distributed to each server as much as possible. In one embodiment, an averaging and balancing algorithm may be used, i.e., an average of the total amount is calculated, and fragmentation of data to be processed that exceeds the average will distribute data to servers that are less than the average. And a maximum value algorithm can be adopted, the maximum value of the data which can be processed by each server is set, the balanced distribution is carried out only when the data to be processed is sliced to exceed the maximum value, and the average value is calculated if the maximum value is smaller than the average value.

The method enables the resource utilization conditions of the servers to be balanced, and shortens the time for the task allocation server to determine the data processing result.

A flow chart of another embodiment of the distributed data processing method of the present application is shown in fig. 2.

In step 201, the task allocation server evenly segments the data to be processed according to the number of the data processing servers, and obtains the information of the data processed by the single server. In one embodiment, the information of the data processed by the single server may be obtained by performing the equalization processing after the initial allocation mentioned above, so as to ensure the equalization of the data allocated to each server.

In step 202, the single server processes the data fragments according to the number of the predetermined threads of the single data processing server, and fragment information is obtained. In one embodiment, the time period allowed for each slice of data to be processed to be executed may be set, such as half an hour for each task, and the time required for calculation of each piece of data is estimated (i.e., the execution time of each piece of data, which may be obtained through practical experience), according to the formula:

the number of data per slice is equal to the total time allowed to be calculated/execution time of data per slice

Total number of pieces of data/number of pieces of data

And determining the total number of the fragments, and further acquiring the fragment information of each to-be-processed data fragment.

In step 203, multiple threads of the data processing server listen to the database simultaneously. In one embodiment, the number of threads per data server may be set, for example, by default 5 threads are turned on, the interval time is 3 seconds, and the like.

In step 204, to avoid repeated computation due to multiple threads taking a piece of data at the same time, an optimistic lock may be used to avoid repeated taking of data. The optimistic lock uses the version number as the identification field, when a plurality of threads take the same fragment information, the thread which updates the database firstly processes the to-be-processed data fragment, and other threads do not acquire and process the to-be-processed data fragment any more. After the data processing is finished, the data processing server stores the result into the database for the task allocation server to read. After the thread processing the data fragment to be processed completes the data processing, the database can be monitored continuously to obtain the next fragment information.

In step 205, the task allocation server determines a data processing result according to the feedback result of each data processing server.

By the method, a plurality of threads of the data processing server can be prevented from taking and processing the same task, the stability of data processing is improved, and the efficiency of data processing is ensured.

In one embodiment, the data processing server may acquire the fragment information in real time and perform data processing while the task allocation server writes the fragment information into the database. By the method, the data fragmentation and the data processing can be executed in parallel, so that the data fragmentation process is prevented from occupying a large amount of time, and the data processing efficiency is further improved.

In one embodiment, in the server cluster, any one server can be used as a task allocation server, one or more other servers and the task allocation server are used as data processing servers, and a user can start a task from any one server in the server cluster, so that the utilization rate of the server is improved, and the user experience is improved.

In one embodiment, the task server may store the data to be processed in the database in advance before performing data fragmentation, so that the data processing server may perform data extraction according to the fragmentation information. Especially for file type data, the data to be calculated and its related data need to be loaded into the database in advance. The database may comprise a relational database or a non-relational database, etc.

In one embodiment, each server may be configured with at least one algorithm, and the task server may specify the algorithm currently used for processing the data to be processed for the data processing server, thereby improving the flexibility of data processing.

A schematic diagram of one embodiment of a distributed data processing apparatus of the present application is shown in fig. 3. The data fragmenting unit 301 can fragment the data to be processed according to the number of the data processing servers, and obtain the fragment information of the data to be processed. For example, the task allocation server discovers that 10 data processing servers are available for processing the to-be-processed data, and therefore the data slicing unit 301 slices the to-be-processed data, allocates at least one to-be-processed data slice to each data processing server, and generates slicing information for each to-be-processed data slice.

The fragment information distribution unit 302 can transmit the fragment information to the corresponding data processing server. In one embodiment, the fragmentation information distribution unit 302 may store the fragmentation information into a database according to a predetermined policy, for example, into a data table monitored by the destination data processing server, and obtain the fragmentation information from the database when the data processing server determines that the data table monitored by the data processing server has a data update. The database may be MySQL or NoSql, etc.

The result acquisition unit 303 can determine a data processing result from the feedback result of each data processing server. In one embodiment, the data processing servers may store the feedback result in a predetermined position or a predetermined table or field of the database, and the result obtaining unit 303 obtains the feedback result of each data processing server by reading the database, so as to obtain the data processing result.

The device can divide the data into pieces based on the number of the servers capable of processing the data and distribute the data to each server for processing, thereby reducing the proportion of operations except the data processing in the distributed data processing, improving the efficiency of the distributed data processing, and particularly showing the improvement of the efficiency for the data processing with medium data amount.

In one embodiment, data-slicing unit 301 may include a first slicing subunit and a second slicing subunit. The first fragmentation subunit may first divide the data into n parts according to the number n of the data processing servers (n is a positive integer not less than 1), so as to obtain information of processing the data by the single server; the second fragmentation subunit fragments the data distributed to the single server according to the preset thread number of each data processing server, and acquires the information of the data to be processed fragmented processed by the single thread, namely the fragmentation information.

The device can process data in a mode of parallel processing of a plurality of threads in each data processing server, thereby improving the efficiency of data processing and improving the utilization rate of server resources.

In one embodiment, the first fragmentation sub-unit may first allocate data to each data processing server through a hash algorithm to obtain information of data processed by the initially allocated single server, and further perform data equalization calculation on the data processed by the initially allocated single server, so that the data amount allocated to each server is equalized as much as possible. The method ensures that the application and resource utilization conditions of each server are balanced as much as possible, and shortens the time for the task allocation server to determine the data processing result.

In an embodiment, the distributed data processing apparatus may further include a data storage unit, which is capable of storing the data to be processed into the database in advance before the data fragmentation unit performs data fragmentation, so that the data processing server may perform data extraction according to the fragmentation information.

In one embodiment, the distributed data processing apparatus may further include an algorithm specifying unit capable of specifying an algorithm currently used for processing the data to be processed for the data processing server, thereby improving the flexibility of data processing.

In one embodiment a distributed data processing apparatus may comprise a data acquisition unit and a data processing unit.

The data acquisition unit can acquire the fragment information and acquire the to-be-processed data fragments according to the fragment information. In one embodiment, the data acquisition unit may employ multiple threads to simultaneously listen to the database.

The data processing unit can process the data fragments to be processed and feed back the processing result. In one embodiment, when the plurality of threads all acquire the fragment information, the data processing unit processes the to-be-processed data fragment by using the thread that acquires the to-be-processed data fragment corresponding to the fragment information first, and the other threads stop the operation of acquiring the to-be-processed data fragment.

The device can prevent a plurality of threads of the data processing server from taking and processing the same task, improves the stability of data processing, and ensures the efficiency of data processing.

A schematic diagram of another embodiment of the distributed data processing apparatus of the present application is shown in fig. 4. The structure and function of the data fragmentation unit 401, the fragmentation information distribution unit 402 and the result acquisition unit 403 are similar to those in the embodiment shown in fig. 3, and are used for executing the steps executed in the task allocation server in the above distributed data processing method. The distributed data processing apparatus further comprises a data acquisition unit 404 and a data processing unit 405 for performing the steps performed at the data processing server in the above distributed data processing method.

The distributed data processing device can enable any one server to serve as a task distribution server in a server cluster, one or more other servers and the task distribution server serve as a data processing server, and a user can start a task from any one server in the server cluster, so that the utilization rate of the server is improved, and the user experience is improved.

In one embodiment, the functions of the respective units may be implemented by the configured interfaces, such as: the data acquisition unit 404 includes 3 interfaces, defining interface class names: com.jd.ipc.simulate.frame.jddatacollectservice, 3 interfaces are:

interface 1 method: public Map geSplitData (DataContext context) threads Exception

Interface 1 method description: the interface method is used for acquiring fragmentation information. The method returns the data set with the value of the fragment information, and the exception can be thrown out when an error occurs.

Interface 2 method: public Map getCalcData (DataContext context) threads Exception

Interface 2 method description: the interface method is used for acquiring the data fragments to be processed and can transmit necessary parameters, such as main table and auxiliary table names, configuration files or filtering conditions of data and the like. The return value of the method is a data set of the calculation data, and the abnormity can be thrown out when an error occurs.

Interface 3 method: public Map getOutsideData (DataContext context) threads Exception

Interface 3 method description: the interface method is used for acquiring external data and can transmit necessary parameters, such as configuration files or filter conditions of the external data. The method returns a data set with external data, and exceptions can be thrown out when errors occur.

The data processing unit may include an interface defining an interface class name com.jd.ipc.simulate.module.jddatacalcservice, the interface being: the interface method is used for calculating data, and specific calculation logic is realized by a user and can provide data to be processed and related configuration data. The return value of the method is a Boolean value and is used for identifying the success or failure of processing, and an exception can be thrown when an error occurs.

A schematic structural diagram of an embodiment of a distributed data processing apparatus according to the present application is shown in fig. 5. The distributed data processing apparatus includes a memory 510 and a processor 520. Wherein: the memory 510 may be a magnetic disk, flash memory, or any other non-volatile storage medium. The memory is for storing instructions in the corresponding embodiments of the distributed data processing method above. Processor 520 is coupled to memory 510 and may be implemented as one or more integrated circuits, such as a microprocessor or microcontroller. The processor 520 is configured to execute instructions stored in the memory, and can implement distributed data processing and improve efficiency of distributed data processing.

In one embodiment, as also shown in FIG. 6, a distributed data processing apparatus 600 includes a memory 610 and a processor 620. Processor 620 is coupled to memory 610 through a BUS 630. The distributed data processing apparatus 600 may also be connected to external storage 650 via storage interface 640 for the purpose of retrieving external data, and may also be connected to a network or another computer system (not shown) via network interface 660. The specific procedures for the transfer and processing of data will not be described in detail herein.

In this embodiment, the memory stores the data instruction, and the processor processes the instruction, so that distributed data processing can be realized, and the efficiency of distributed data processing can be improved.

In another embodiment, the present application proposes a computer-readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method in the corresponding embodiment of the distributed data processing method. As will be appreciated by one skilled in the art, embodiments of the present application may be provided as a method, apparatus, or computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable non-transitory storage media (including, but not limited to, disk storage, CD-ROM, optical storage, and the like) having computer-usable program code embodied therein.

In one embodiment, the present application further proposes a server configured with a device capable of executing any one of the above-mentioned distributed data processing methods, so that data can be fragmented and distributed to each server for processing based on the number of servers capable of processing data, and thus the proportion of operations other than data processing in distributed data processing can be reduced, the efficiency of distributed data processing can be improved, and the improvement in efficiency is particularly prominent for data processing with medium data volume.

In one embodiment, after the application for implementing the above distributed data processing method is downloaded and saved to the lib directory of the J2EE server, the server is started, and an interface is called, where the main table to be calculated, the fragment field, the filter condition, the IP address of the server participating in the calculation, and the like are specified in the interface parameters. And acquiring the data fragments to be processed according to the main table, the fragment fields and the filtering conditions specified in the interface, and distributing the acquired fragment information to the IP of each server so that the servers can acquire the corresponding data fragments to be processed. Each server monitors task distribution data, and when the data of the server exists, the data is obtained. In order to improve the computing efficiency, each server uses a multi-thread mode to process data, and each thread processes data of one fragment independently. And multithreading calls a task calculation model to complete the calculation of the data.

The server can execute the above-mentioned distributed data processing method after downloading and storing the application, and does not need to configure a specific dependency environment or maintain the specific dependency environment, thereby reducing the workload of the user and improving the user experience.

A schematic diagram of one embodiment of a distributed data processing system of the present application is shown in fig. 7. The distributed processing system comprises a plurality of servers, such as servers 701-705. Each server may be a server as mentioned above, configured with means capable of performing any one of the distributed data processing methods mentioned above. The servers are respectively connected with the databases 710, and data interaction can be performed through the databases.

The distributed data processing system can divide data into pieces based on the number of servers capable of processing the data and distribute the data to each server for processing, thereby reducing the proportion of operations except for data processing in the distributed data processing and improving the efficiency of the distributed data processing.

In one embodiment, some of the servers in the distributed data processing system may be capable of performing the steps performed at the task allocation server in the above distributed data processing method, and some of the servers may be capable of performing the steps performed at the data processing server in the above distributed data processing method.

In an embodiment, each server in the distributed data processing system can execute both the steps executed by the task allocation server in the distributed data processing method and the steps executed by the data processing servers in the distributed data processing method, so that the task can be started from any one server in the server cluster, the utilization rate of the server is improved, and the user experience is improved.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flow diagrams and/or block diagrams, and combinations of flows and/or blocks in the flow diagrams and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.

These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The present application has been described in detail so far. Some details well known in the art have not been described in order to avoid obscuring the concepts of the present application. It will be fully apparent to those skilled in the art from the foregoing description how to practice the presently disclosed embodiments.

The methods and apparatus of the present application may be implemented in a number of ways. For example, the methods and apparatus of the present application may be implemented by software, hardware, firmware, or any combination of software, hardware, and firmware. The above-described order for the steps of the method is for illustration only, and the steps of the method of the present application are not limited to the order specifically described above unless specifically stated otherwise. Further, in some embodiments, the present application may also be embodied as a program recorded in a recording medium, the program including machine-readable instructions for implementing a method according to the present application. Thus, the present application also covers a recording medium storing a program for executing the method according to the present application.

Finally, it should be noted that: the above examples are only intended to illustrate the technical solutions of the present application and not to limit them; although the present application has been described in detail with reference to preferred embodiments, those of ordinary skill in the art will understand that: modifications to the specific embodiments of the application or equivalent replacements of some of the technical features may still be made; all of which are intended to be encompassed within the scope of the claims appended hereto without departing from the spirit and scope of the present disclosure.

Claims

1. A distributed data processing method, comprising:

the task allocation server averagely fragments the data to be processed according to the number of the data processing servers to acquire fragment information of the data to be processed, and the method comprises the following steps: the task allocation server evenly segments the data to be processed according to the number of the data processing servers to obtain the information of the data processed by the single server; dividing the data processed by the single server into pieces according to the preset thread number of the single data processing server to obtain the piece information;

the task allocation server sends each piece of fragment information to a corresponding data processing server, and the method comprises the following steps: the task allocation server stores the fragment information into a database according to a preset strategy so that a plurality of threads of the data processing server monitor the database at the same time, acquires the fragment information according to the update information conforming to the preset strategy, acquires and processes the data fragments to be processed according to the fragment information, and continuously monitors the database after the threads acquiring the data fragments to be processed complete the data fragments to be processed;

and the task allocation server determines a data processing result according to the feedback result of each data processing server.

2. The method of claim 1, wherein the task allocation server evenly segments the data to be processed according to the number of data processing servers, and acquiring information of data processed by a single server comprises:

distributing data for each data processing server through a Hash algorithm, and acquiring information of data processed by a primary distribution single server;

and processing the data processed by the primary distribution single server through a data balance algorithm to enable the data to be distributed in a balanced manner, and acquiring the information of the data processed by the single server.

3. The method of claim 1, wherein,

a plurality of threads of the data processing server monitor the database simultaneously;

when the plurality of threads acquire the fragment information, the thread which acquires the to-be-processed data fragment corresponding to the fragment information firstly processes the to-be-processed data fragment, and other threads continue to monitor the database;

after the thread for acquiring the data fragment to be processed finishes the data fragment to be processed, the thread continues to monitor the database;

and circularly executing the processes until all the data to be processed distributed to the data processing server is processed.

4. The method of claim 1, further comprising:

and the task allocation server sends an algorithm or an algorithm identifier for processing the data to be processed to each data processing server in advance, so that each data processing server processes the data to be processed by adopting a corresponding algorithm.

5. The method of claim 1, wherein,

the fragment information comprises one or more of source information, data table information, fragment field information and filtering condition information of the data to be processed, and address information of a target data processing server;

the data to be processed includes one or more of data stored in a database, data from an external device, and data acquired through a network.

6. A distributed computing device, comprising:

the data fragmentation unit is used for averagely fragmenting the data to be processed according to the number of the data processing servers to acquire fragmentation information of the data to be processed, and comprises the following steps:

the first fragmentation subunit is used for averagely fragmenting the data to be processed according to the number of the data processing servers to acquire the information of the data processed by the single server;

the second fragmentation subunit is used for fragmenting the data processed by the single server according to the number of the preset threads of the single data processing server to acquire the fragmentation information;

the fragment information distribution unit is used for sending each piece of fragment information to a corresponding data processing server, and comprises: storing the fragment information into a database according to a preset strategy so that when a plurality of threads of the data processing server monitor the database simultaneously, acquiring the fragment information according to the update information conforming to the preset strategy, acquiring and processing the data fragments to be processed according to the fragment information, and continuously monitoring the database after the threads acquiring the data fragments to be processed complete the data fragments to be processed;

and the result acquisition unit is used for determining a data processing result according to the feedback information of each data processing server.

7. The apparatus of claim 6, wherein the first tile subunit is to:

and processing the data processed by the initially distributed single server through a data balancing algorithm to acquire the information of the data processed by the single server.

8. The apparatus of claim 6, further comprising:

and the algorithm specifying unit is used for sending an algorithm or an algorithm identifier for processing the data to be processed to each data processing server so that each data processing server processes the data to be processed by adopting a corresponding algorithm.

9. The apparatus of claim 6, further comprising:

the data acquisition unit is used for acquiring the fragment information and acquiring the to-be-processed data fragments according to the fragment information;

and the data processing unit is used for processing the to-be-processed data fragments and feeding back a processing result.

10. The apparatus of claim 9, wherein,

the data acquisition unit monitors the database by adopting a plurality of threads which are not processing the data fragments to be processed;

the data processing unit is used for processing the to-be-processed data fragments by adopting a thread which firstly acquires the to-be-processed data fragments corresponding to the fragment information when the plurality of threads acquire the fragment information;

and after the thread for acquiring the data fragment to be processed finishes the data fragment to be processed, the thread continues to monitor the database.

11. The apparatus of claim 6, wherein,

12. A distributed data processing apparatus comprising:

a memory; and

a processor coupled to the memory, the processor configured to perform the method of any of claims 1-5 based on instructions stored in the memory.

13. A computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the steps of the method of any one of claims 1 to 5.

14. A server comprising means for performing the distributed data processing method of any of claims 1 to 5.

15. A distributed data processing system comprising a plurality of servers as claimed in claim 14.