CN102385536A - Method and system for realization of parallel computing - Google Patents

Method and system for realization of parallel computing Download PDF

Info

Publication number
CN102385536A
CN102385536A CN2010102693321A CN201010269332A CN102385536A CN 102385536 A CN102385536 A CN 102385536A CN 2010102693321 A CN2010102693321 A CN 2010102693321A CN 201010269332 A CN201010269332 A CN 201010269332A CN 102385536 A CN102385536 A CN 102385536A
Authority
CN
China
Prior art keywords
node
information
worker node
log information
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN2010102693321A
Other languages
Chinese (zh)
Other versions
CN102385536B (en
Inventor
周扬
胡媛
张艺夕
李桂萍
黄翔
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANTONG JINGHAISHEN AQUATIC PRODUCT CO., LTD.
Original Assignee
ZTE Corp
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by ZTE Corp filed Critical ZTE Corp
Priority to CN201010269332.1A priority Critical patent/CN102385536B/en
Priority to PCT/CN2011/072818 priority patent/WO2012024937A1/en
Publication of CN102385536A publication Critical patent/CN102385536A/en
Application granted granted Critical
Publication of CN102385536B publication Critical patent/CN102385536B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1479Generic software techniques for error detection or fault masking
    • G06F11/1482Generic software techniques for error detection or fault masking by means of middleware or OS functionality
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/14Error detection or correction of the data by redundancy in operation
    • G06F11/1402Saving, restoring, recovering or retrying
    • G06F11/1415Saving, restoring, recovering or retrying at system level
    • G06F11/1438Restarting or rejuvenating
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/16Error detection or correction of the data by redundancy in hardware
    • G06F11/20Error detection or correction of the data by redundancy in hardware using active fault-masking, e.g. by switching out faulty elements or by switching in spare elements

Abstract

The invention discloses a method for realization of parallel computing. The method comprises the following steps of: recording log information of a Worker node and a Master node of executing a task after an overall task is started; acquiring the recorded log information of the fault Worker node by a new Worker node when a fault occurs in the Worker node of executing the task, and continuously handling the service flow of the fault Worker node from a breakpoint of fault occurrence moment according to the log information; and/or acquiring the recorded log information of the fault Master node when a fault occurs in the Master node of executing the task and after a new Master node starts, and continuously handling the service flow of the fault Master node from the breakpoint of the fault occurrence moment according to the log information. The invention simultaneously discloses a system for realization of the parallel computing. With the adoption of the method and the system, when the fault occurs in the node, the task can be continuously executed from the breakpoint of the fault occurrence moment.

Description

A kind of method and system that realize parallel computation
Technical field
The present invention relates to the cloud computing field, be meant a kind of system and method for realizing parallel computation especially.
Background technology
MapReduce is proposed by the slip-stick artist of Google at first; It is a kind of system architecture that can the parallel processing mass data; The principle of work of MapReduce system is: automatically a task is resolved into a plurality of subtasks; After all subtasks are finished, gather result these subtasks of executed in parallel then.
Fig. 1 is the configuration diagram of existing MapReduce system, and as can be seen from Figure 1, MapReduce is divided into two stages with data processing: mapping (Map) stage and abbreviation (Reduce) stage.The MapReduce system mainly comprises client (Client), host's (Master) node and workman (Worker) node; Wherein, Client is used to submit to the MapReduce task; The Master node is used for automatically the MapReduce task being decomposed into Map task and Reduce task; Afterwards these task schedulings are carried out to the Worker node, the Worker node is used for after receiving the Map or Reduce task requests that Master sends, and carries out the task in the request.The MapReduce system can realize parallel processing, distributed data, fault-tolerant, and function such as equally loaded automatically.
In the existing MapReduce system; When breaking down in the process that certain Worker node is being executed the task; The task that the Master node is responsible for this fault Worker node; Redistribute to other Worker nodes, after other Worker nodes are received task, this task is started anew to carry out again one time.When the Master node breaks down in whole task executions process, then need whole task be started anew all to carry out again one time, so, reduce data-handling efficiency, and then influence user experience.
Summary of the invention
In view of this, fundamental purpose of the present invention is to provide a kind of method and system that realize parallel computation, can when node break down, from fault breakpoint constantly take place and continue to execute the task.
For achieving the above object, technical scheme of the present invention is achieved in that
The invention provides a kind of method that realizes parallel computation, this method comprises:
After overall task starts, the Worker node that record is executed the task and the log information of Master node;
When the Worker node of executing the task broke down, new Worker node obtained the log information of the fault Worker node of record, and continued the operation flow of handling failure Worker node according to the breakpoint of log information when fault takes place; And/or; When the Master node of executing the task breaks down; After new Master node starts, obtain the log information of the fault Master node of record, and according to the operation flow of the breakpoint continuation handling failure Master node of log information when fault takes place.
In the such scheme, said new Worker node obtains the log information of fault Worker node, for:
The Master node sends the information of executing the task to said new Worker node;
After said new Worker node is received information, send query requests information to global information monitoring function entity;
After global information monitoring function entity is received query requests information, the log information of the fault Worker node of preserving according to query requests information searching self, and return the log information of fault Worker node to said new Worker node.
In the such scheme, said new Master node obtains the log information of fault Master node, for:
Said new Master node sends query requests information to global information monitoring function entity;
After global information monitoring function entity is received query requests information, the log information of the fault Master node of preserving according to query requests information searching self, and return the log information of fault Master node to said new Master node.
In the such scheme, before the log information of record Master node and Worker node, this method further comprises:
User Program selects a node as the Master node through after calling client-side program storehouse startup overall task, needs the input data source of processing afterwards to the transmission of Master node;
The Master node will be imported data source and carry out dividing processing after receiving the input data source that needs to handle;
Master selects the Worker node execute the task, and needs the task of execution to each Worker node distributions of executing the task;
The Worker node of executing the task reads the divided data piece, carries out the task of distributing.
In the such scheme, the Worker node that said record is executed the task and the log information of Master node, for:
After overall task started, Worker node of executing the task and Master node were uploaded to global information monitoring function entity in real time with the log information of self;
The Worker node that the preservation of global information monitoring function entity is executed the task and the log information of Master node.
In the such scheme, preserve at global information monitoring function entity before the log information of the Worker node execute the task and Master node, this method further comprises:
After global information monitoring function entity is received the log information that the Worker node uploads; Whether the identification information of judging the node that carries in the log information of Worker node is consistent with the identification information of the Worker node of preservation; Confirm consistent; Then preserve the log information of Worker node, confirm inconsistently, then abandon the log information of Worker node.
The present invention also provides a kind of method of obtaining log information, and this method comprises:
After overall task starts, the Master node that preservation is in real time executed the task and the log information of Worker node;
When the Worker node of executing the task breaks down; And after receiving the query requests information that new Worker node sends; The log information of the fault Worker node of preserving according to the query requests information searching, and return the log information of fault Worker node to said new Worker node; And/or; After the Master node of executing the task breaks down and is receiving the query requests information of new Master node transmission; The log information of the fault Master node of preserving according to the query requests information searching, and return the log information of fault Master node to said new Master node.
In the such scheme, before the log information of preserving in real time the Master node of executing the task and Worker node, this method further comprises:
Whether the identification information of judging the node that carries in the log information of Worker node is consistent with the identification information of the Worker node of preservation; Confirm consistent; Then preserve the log information of Worker node, confirm inconsistently, then abandon the log information of Worker node.
The present invention also provides a kind of global information monitoring entity that obtains log information, and this global information monitoring entity comprises: memory module and enquiry module; Wherein,
Memory module after being used for overall task and starting, is preserved the log information that the Master node of executing the task and Worker node are uploaded in real time;
Enquiry module; Be used for after the Worker node of executing the task breaks down and receiving the query requests information of new Worker node transmission; The log information of the fault Worker node of preserving according to query requests information searching memory module, and return the log information of fault Worker node to said new Worker node; And/or; After the Master node of executing the task breaks down and is receiving the query requests information of new Master node transmission; The log information of the fault Master node of preserving according to query requests information searching memory module, and return the log information of fault Master node to said new Master node.
In the such scheme; This global information monitoring entity further comprises: judge module, when being used for the Worker node and uploading log information, judge whether the identification information of this node that carries in the log information of Worker node is consistent with the identification information of the Worker node of preservation; When confirming unanimity; Preserve the log information of this Worker node, otherwise, the log information of this Worker node abandoned.
In the such scheme, said memory module, the identification information that also is used to preserve the Worker node.
The present invention also provides a kind of system that realizes parallel computation, and this system comprises: global information monitoring function entity, a Worker node, an and Master node; Wherein,
Global information monitoring function entity, after being used for overall task and starting, the Worker node that record is executed the task and the log information of Master node;
The one Worker node; Be used for when the Worker node of executing the task breaks down; Obtain the log information of fault Worker node from global information monitoring function entity, and continue the operation flow of handling failure Worker node according to the breakpoint of log information when fault takes place; And/or,
The one Master node; Be used for when the Master node of executing the task breaks down; After self starts; Obtain the log information of fault Master node from global information monitoring function entity, and continue the operation flow of handling failure Master node according to the breakpoint of log information when fault takes place.
In the such scheme, this system further comprises: User Program unit, the 2nd Master node and the 2nd Worker node; Wherein,
User Program unit is used for selecting a node as the Master node through after calling client-side program storehouse startup overall task, needs the input data source of processing afterwards to the 2nd Master node transmission;
The 2nd Master node; Be used for after the input data source of receiving the needs processing that User Program unit sends; To import data source and carry out dividing processing, select the Worker node execute the task afterwards, and need the task of execution to each Worker node distributions of executing the task;
The 2nd Worker node is used for after receiving the task that the 2nd Master node distributes, carrying out the task of distributing.
In the such scheme, said the 2nd Master node also is used for when the 2nd Worker node breaks down, and sends the information of executing the task to a Worker node;
A said Worker node specifically is used for: after receiving the information that the 2nd Master node sends, send query requests information to global information monitoring function entity, and receive the log information of the 2nd Worker node that global information monitoring function entity returns;
Said global information monitoring function entity; Also be used for after receiving the query requests information that a Worker node sends; The log information of the 2nd Worker node of preserving according to query requests information searching self, and return the log information of the 2nd Worker node to a Worker node.
In the such scheme; A said Master node; Specifically be used for: when the 2nd Master node breaks down, send query requests information, and receive the log information of the 2nd Master node that global information monitoring function entity returns to global information monitoring function entity;
Said global information monitoring function entity; Also be used for after receiving the query requests information that a Master node sends; The log information of the 2nd Master node of preserving according to query requests information searching self, and return the log information of the 2nd Master node to a Master node.
In the such scheme, said the 2nd Worker node also is used for after overall task starts, and self log information is uploaded to global information monitoring function entity in real time;
Said the 2nd Master node also is used for after overall task starts, and self log information is uploaded to global information monitoring function entity in real time;
Global information monitoring function entity also is used to preserve the log information of the 2nd Worker node and the 2nd Master node.
In the such scheme; Said global information monitoring function entity also is used for before the log information of preserving the 2nd Worker node and the 2nd Master node, judging whether the identification information of the node that carries in the log information of the 2nd Worker node is consistent with the identification information of the Worker node of preservation; Confirm consistent; Then preserve the log information of the 2nd Worker node, confirm inconsistently, then abandon the log information of the 2nd Worker node.
The method and system of realization parallel computation provided by the invention, new Worker node obtains the log information of the fault Worker node of record, and continues the operation flow of handling failure Worker node according to the breakpoint of log information when fault takes place; And/or new Master obtains the log information of the fault Master node of record, and continues the operation flow of handling failure Master node according to the breakpoint of log information when fault takes place; So; Can from fault breakpoint constantly take place and continue to execute the task when node break down, and then improve the treatment effeciency of data; Save system resource, promote user experience.
Description of drawings
Fig. 1 is the configuration diagram of existing MapReduce system;
Fig. 2 realizes the method flow synoptic diagram of parallel computation for the present invention;
Fig. 3 is the method flow synoptic diagram before the log information of record Master node and Worker node;
Fig. 4 realizes the system architecture synoptic diagram of parallel computation for the present invention.
Embodiment
Below in conjunction with accompanying drawing and specific embodiment the present invention is remake further detailed explanation.
The present invention realizes the method for parallel computation, and is as shown in Figure 2, may further comprise the steps:
Step 201: after overall task starts, the Worker node that record is executed the task and the log information of Master node;
Here, before the log information of record Master node and Worker node, as shown in Figure 3, this method can further include following steps:
Step 301:User Program selects a node as the Master node through after calling client-side program storehouse startup overall task, needs the input data source of processing afterwards to the transmission of Master node.
Step 302:Master node will be imported data source and carry out dividing processing after receiving the input data source that needs to handle, and execution in step 303 afterwards;
Here, the Master node can call the segmentation function among the User Program, will import data source and carry out dividing processing; User Program can tell Master node with the calling program parameter in advance, perhaps, can be in advance the mode of call function through message be sent to the Master node.
Step 303:Master node is selected the Worker node execute the task, and needs the task of execution to each Worker node distributions of executing the task.
Step 304: the Worker node of executing the task reads the divided data piece, carries out the task of distributing;
Wherein, step 301~304 are identical with existing processing procedure, repeat no more here;
Said log information comprises: the status information of node operation and the state and the critical data of business processing flow; Wherein, the status information of said node operation can be: network condition, CPU, internal memory, disk space, Map task or Reduce task executions state etc.; The state of said business processing flow is relevant with the concrete operation flow of processing with critical data; Give an example; Use MapReduce to walk abreast for one and send the operation flow of the short message of weather forecast to 100,000 cellphone subscribers, then the state of said business processing flow and critical data comprise cellphone subscriber's telephone number information;
When practical application; Can in the MapReduce system, set up a global information monitoring function entity; Log information by global information monitoring function entity record Master node and Worker node; And dispose global information monitoring function identity of entity identification information in advance on all nodes in the MapReduce system, said global information monitoring function identity of entity identification information can be that agreement (IP) address, identify label number (ID) interconnected between the network waits all can show the information of global information monitoring function entity identities; All nodes in the MapReduce system can be according to said global information monitoring function identity of entity identification information, and the log information of uploading self is to global information monitoring function entity; After overall task started, Master node and Worker node were uploaded to global information monitoring function entity in real time with the log information of self;
Reliable in order to guarantee whole log record process; After overall task started, which Worker node the Master node distributes to overall task was carried out, and the identification information of these Worker nodes is sent to global information monitoring function entity; Global information monitoring function entity receives and preserves the identification information of Worker node; If when having the Worker node to upload log information, global information monitoring function entity judges whether to preserve the log information of this Worker node according to the identification information of the Worker node of preserving, particularly; When the identification information of the identification information of this node that carries in the log information of Worker node and the Worker node of preservation is consistent; Then preserve the log information of this Worker node, otherwise, the log information of this Worker node abandoned; The identification information of said Worker node is meant the information that can identify Worker node identity, such as: IP address, machine name or ID etc.;
The concrete form of said global information monitoring function entity can be a log database, the aggregate that can also be made up of one or more nodes;
Said Worker node is meant the set of all Worker nodes of this task of execution.
Step 202: when the Worker node of executing the task broke down, new Worker node obtained the log information of the fault Worker node of record, and continued the operation flow of handling failure Worker node according to the breakpoint of log information when fault takes place; And/or; When the Master node of executing the task breaks down; After new Master node starts, obtain the log information of the fault Master node of record, and according to the operation flow of the breakpoint continuation handling failure Master node of log information when fault takes place;
Here, the Master node can know that through the heartbeat detection between self and the Worker node Worker node of executing the task breaks down; After the Worker node of executing the task breaks down; The Master node can be according to the loading condition of other node in the MapReduce system; That is: the processing of the automatic load balancing in the existing MapReduce system is selected a node as new Worker node; Said new Worker node can be a Worker node of carrying out the health of this task, can also be the Worker node of not carrying out the health of this task;
Behind task start; The User Program of MapReduce system can start a timer, behind timer expiry, does not also receive the task action result that the Master node returns; Just think that this Master node breaks down; Need to select a new node as the Master node, when selecting, can be according to the loading condition of other node in the MapReduce system; That is: the processing of the automatic load balancing in the existing MapReduce system is selected a node as new Master node; Said new Master node can be a Master node of carrying out this task, can also be other Master node of not carrying out this task;
Said new Worker node obtains the log information of fault Worker node, is specially:
The Master node sends the information of executing the task to said new Worker node;
After said new Worker node is received information, send query requests information to global information monitoring function entity;
After global information monitoring function entity is received query requests information, the log information of the fault Worker node of preserving according to query requests information searching self, and return the log information of fault Worker node to said new Worker node;
Wherein, said information of executing the task comprises the identification information of task data source, task ID, fault Worker node etc.;
Said query requests information comprises the node identification information of task ID, fault Worker etc., the node identification information of said fault Worker can be IP address, machine name, ID etc. all can identify the information of fault Worker node identity;
Said new Master node obtains the log information of fault Master node, is specially:
Said new Master node sends query requests information to global information monitoring function entity;
After global information monitoring function entity is received query requests information, the log information of the fault Master node of preserving according to query requests information searching self, and return the log information of fault Master node to said new Master node;
Wherein, said query requests information identification information or task ID information of comprising fault Master node etc. can identify fault Master node log information recorded; The identification information of said fault Master node can be IP address, machine name, ID etc. all can identify the information of fault Master node identity.
When the task of each Worker node is finished, can calls external interface self log information is uploaded to global information monitoring function entity, notify the Master node simultaneously, self being responsible for of task disposes; After the Master node is notified, self the task flagging of Worker node is become accomplish.After receiving the notice of having finished dealing with that all Worker nodes send, the Master node finishes overall task.
For realizing said method, the present invention also provides a kind of global information monitoring entity that obtains log information, and this global information monitoring entity comprises: memory module and enquiry module; Wherein,
Memory module after being used for overall task and starting, is preserved the log information that the Master node of executing the task and Worker node are uploaded in real time;
Enquiry module; Be used for after the Worker node of executing the task breaks down and receiving the query requests information of new Worker node transmission; The log information of the fault Worker node of preserving according to query requests information searching memory module, and return the log information of fault Worker node to said new Worker node; And/or; After the Master node of executing the task breaks down and is receiving the query requests information of new Master node transmission; The log information of the fault Master node of preserving according to query requests information searching memory module, and return the log information of fault Master node to said new Master node.
Wherein, This global information monitoring entity can further include judge module, when being used for the Worker node and uploading log information, judges whether the identification information of this node that carries in the log information of Worker node is consistent with the identification information of the Worker node of preservation; When confirming unanimity; Preserve the log information of this Worker node, otherwise, the log information of this Worker node abandoned.
Said memory module, the identification information that also is used to preserve the Worker node.
Simultaneously, the present invention provides a kind of system that realizes parallel computation again, and is as shown in Figure 4, and this system comprises: global information monitoring function entity 41, a Worker node 42, an and Master node 43; Wherein,
Global information monitoring function entity 41, after being used for overall task and starting, the Worker node that record is executed the task and the log information of Master node;
The one Worker node 42; Be used for when the Worker node of executing the task breaks down; Obtain the log information of fault Worker node from global information monitoring function entity 41, and continue the operation flow of handling failure Worker node according to the breakpoint of log information when fault takes place; And/or,
The one Master node 43; Be used for when the Master node of executing the task breaks down; After self starts; Obtain the log information of fault Master node from global information monitoring function entity 41, and continue the operation flow of handling failure Master node according to the breakpoint of log information when fault takes place.
Here, need to prove: a Worker node 42 can be a Worker node of carrying out the health of this task, can also be the Worker node of not carrying out the health of this task; The one Master node 43 can be a Master node of carrying out this task, can also be other Master node of not carrying out this task.
Wherein, this system can further include User Program unit, the 2nd Master node and the 2nd Worker node; Wherein,
User Program unit is used for selecting a node as the Master node through after calling client-side program storehouse startup overall task, needs the input data source of processing afterwards to the 2nd Master node transmission;
The 2nd Master node; Be used for after the input data source of receiving the needs processing that User Program unit sends; To import data source and carry out dividing processing, select the Worker node execute the task afterwards, and need the task of execution to each Worker node distributions of executing the task;
The 2nd Worker node is used for after receiving the task that the 2nd Master node distributes, carrying out the task of distributing.
Here, need to prove: the 2nd Worker node can be the set of the Worker node of executing the task more than.
Wherein, said the 2nd Master node also is used for when the 2nd Worker node breaks down, and sends the information of executing the task to a Worker node 42;
A said Worker node; Specifically be used for: after receiving the information that the 2nd Master node sends; Send the query requests information to global information monitoring function entity 41, and receive the log information of the 2nd Worker node that global information monitoring function entity 41 returns;
Said global information monitoring function entity 41; Also be used for after receiving the query requests information that a Worker node 42 sends; The log information of the 2nd Worker node of preserving according to query requests information searching self, and return the log information of the 2nd Worker node to a Worker node 41.
Wherein, A said Master node 42; Specifically be used for: when the 2nd Master node breaks down, send the query requests information, and receive the log information of the 2nd Master node that global information monitoring function entity 41 returns to global information monitoring function entity 41;
Said global information monitoring function entity 41; Also be used for after receiving the query requests information that a Master node 43 sends; The log information of the 2nd Master node of preserving according to query requests information searching self, and return the log information of the 2nd Master node to a Master node 43.
Said the 2nd Worker node also is used for after overall task starts, and self log information is uploaded to global information monitoring function entity 41 in real time;
Said the 2nd Master node also is used for after overall task starts, and self log information is uploaded to global information monitoring function entity 41 in real time;
Global information monitoring function entity 41 also is used to preserve the log information of the 2nd Worker node and the 2nd Master node.
Wherein, Said global information monitoring function entity 41 also is used for before the log information of preserving the 2nd Worker node and the 2nd Master node, judging whether the identification information of the node that carries in the log information of the 2nd Worker node is consistent with the identification information of the Worker node of preservation; Confirm consistent; Then preserve the log information of the 2nd Worker node, confirm inconsistently, then abandon the log information of the 2nd Worker node.
The above is merely preferred embodiment of the present invention, is not to be used to limit protection scope of the present invention, all any modifications of within spirit of the present invention and principle, being done, is equal to replacement and improvement etc., all should be included within protection scope of the present invention.

Claims (17)

1. a method that realizes parallel computation is characterized in that, this method comprises:
After overall task starts, workman's (Worker) node that record is executed the task and the log information of host (Master) node;
When the Worker node of executing the task broke down, new Worker node obtained the log information of the fault Worker node of record, and continued the operation flow of handling failure Worker node according to the breakpoint of log information when fault takes place; And/or; When the Master node of executing the task breaks down; After new Master node starts, obtain the log information of the fault Master node of record, and according to the operation flow of the breakpoint continuation handling failure Master node of log information when fault takes place.
2. method according to claim 1 is characterized in that, said new Worker node obtains the log information of fault Worker node, for:
The Master node sends the information of executing the task to said new Worker node;
After said new Worker node is received information, send query requests information to global information monitoring function entity;
After global information monitoring function entity is received query requests information, the log information of the fault Worker node of preserving according to query requests information searching self, and return the log information of fault Worker node to said new Worker node.
3. method according to claim 1 is characterized in that, said new Master node obtains the log information of fault Master node, for:
Said new Master node sends query requests information to global information monitoring function entity;
After global information monitoring function entity is received query requests information, the log information of the fault Master node of preserving according to query requests information searching self, and return the log information of fault Master node to said new Master node.
4. according to claim 1,2 or 3 described methods, it is characterized in that before the log information of record Master node and Worker node, this method further comprises:
User Program selects a node as the Master node through after calling client-side program storehouse startup overall task, needs the input data source of processing afterwards to the transmission of Master node;
The Master node will be imported data source and carry out dividing processing after receiving the input data source that needs to handle;
Master selects the Worker node execute the task, and needs the task of execution to each Worker node distributions of executing the task;
The Worker node of executing the task reads the divided data piece, carries out the task of distributing.
5. method according to claim 4 is characterized in that, the Worker node that said record is executed the task and the log information of Master node, for:
After overall task started, Worker node of executing the task and Master node were uploaded to global information monitoring function entity in real time with the log information of self;
The Worker node that the preservation of global information monitoring function entity is executed the task and the log information of Master node.
6. method according to claim 5 is characterized in that, preserves at global information monitoring function entity before the log information of the Worker node execute the task and Master node, and this method further comprises:
After global information monitoring function entity is received the log information that the Worker node uploads; Whether the identification information of judging the node that carries in the log information of Worker node is consistent with the identification information of the Worker node of preservation; Confirm consistent; Then preserve the log information of Worker node, confirm inconsistently, then abandon the log information of Worker node.
7. a method of obtaining log information is characterized in that, this method comprises:
After overall task starts, the Master node that preservation is in real time executed the task and the log information of Worker node;
When the Worker node of executing the task breaks down; And after receiving the query requests information that new Worker node sends; The log information of the fault Worker node of preserving according to the query requests information searching, and return the log information of fault Worker node to said new Worker node; And/or; After the Master node of executing the task breaks down and is receiving the query requests information of new Master node transmission; The log information of the fault Master node of preserving according to the query requests information searching, and return the log information of fault Master node to said new Master node.
8. method according to claim 7 is characterized in that, before the log information of preserving in real time the Master node of executing the task and Worker node, this method further comprises:
Whether the identification information of judging the node that carries in the log information of Worker node is consistent with the identification information of the Worker node of preservation; Confirm consistent; Then preserve the log information of Worker node, confirm inconsistently, then abandon the log information of Worker node.
9. a global information monitoring entity that obtains log information is characterized in that, this global information monitoring entity comprises: memory module and enquiry module; Wherein,
Memory module after being used for overall task and starting, is preserved the log information that the Master node of executing the task and Worker node are uploaded in real time;
Enquiry module; Be used for after the Worker node of executing the task breaks down and receiving the query requests information of new Worker node transmission; The log information of the fault Worker node of preserving according to query requests information searching memory module, and return the log information of fault Worker node to said new Worker node; And/or; After the Master node of executing the task breaks down and is receiving the query requests information of new Master node transmission; The log information of the fault Master node of preserving according to query requests information searching memory module, and return the log information of fault Master node to said new Master node.
10. global information monitoring entity according to claim 9 is characterized in that, this global information monitoring entity further comprises: judge module; When being used for the Worker node and uploading log information; Whether the identification information of judging this node that carries in the log information of Worker node is consistent with the identification information of the Worker node of preservation, when confirming unanimity, preserves the log information of this Worker node; Otherwise, abandon the log information of this Worker node.
11., it is characterized in that said memory module, the identification information that also is used to preserve the Worker node according to claim 9 or 10 described global information monitoring entities.
12. a system that realizes parallel computation is characterized in that, this system comprises: global information monitoring function entity, a Worker node, an and Master node; Wherein,
Global information monitoring function entity, after being used for overall task and starting, the Worker node that record is executed the task and the log information of Master node;
The one Worker node; Be used for when the Worker node of executing the task breaks down; Obtain the log information of fault Worker node from global information monitoring function entity, and continue the operation flow of handling failure Worker node according to the breakpoint of log information when fault takes place; And/or,
The one Master node; Be used for when the Master node of executing the task breaks down; After self starts; Obtain the log information of fault Master node from global information monitoring function entity, and continue the operation flow of handling failure Master node according to the breakpoint of log information when fault takes place.
13. system according to claim 12 is characterized in that, this system further comprises: UserProgram unit, the 2nd Master node and the 2nd Worker node; Wherein,
User Program unit is used for selecting a node as the Master node through after calling client-side program storehouse startup overall task, needs the input data source of processing afterwards to the 2nd Master node transmission;
The 2nd Master node; Be used for after the input data source of receiving the needs processing that User Program unit sends; To import data source and carry out dividing processing, select the Worker node execute the task afterwards, and need the task of execution to each Worker node distributions of executing the task;
The 2nd Worker node is used for after receiving the task that the 2nd Master node distributes, carrying out the task of distributing.
14. system according to claim 13 is characterized in that,
Said the 2nd Master node also is used for when the 2nd Worker node breaks down, and sends the information of executing the task to a Worker node;
A said Worker node specifically is used for: after receiving the information that the 2nd Master node sends, send query requests information to global information monitoring function entity, and receive the log information of the 2nd Worker node that global information monitoring function entity returns;
Said global information monitoring function entity; Also be used for after receiving the query requests information that a Worker node sends; The log information of the 2nd Worker node of preserving according to query requests information searching self, and return the log information of the 2nd Worker node to a Worker node.
15. system according to claim 13 is characterized in that,
A said Master node specifically is used for: when the 2nd Master node breaks down, send query requests information to global information monitoring function entity, and receive the log information of the 2nd Master node that global information monitoring function entity returns;
Said global information monitoring function entity; Also be used for after receiving the query requests information that a Master node sends; The log information of the 2nd Master node of preserving according to query requests information searching self, and return the log information of the 2nd Master node to a Master node.
16. according to claim 13,14 or 15 described systems, it is characterized in that,
Said the 2nd Worker node also is used for after overall task starts, and self log information is uploaded to global information monitoring function entity in real time;
Said the 2nd Master node also is used for after overall task starts, and self log information is uploaded to global information monitoring function entity in real time;
Global information monitoring function entity also is used to preserve the log information of the 2nd Worker node and the 2nd Master node.
17. system according to claim 16 is characterized in that,
Said global information monitoring function entity; Also be used for before the log information of preserving the 2nd Worker node and the 2nd Master node; Whether the identification information of judging the node that carries in the log information of the 2nd Worker node is consistent with the identification information of the Worker node of preservation, confirms unanimity, then preserves the log information of the 2nd Worker node; Confirm inconsistently, then abandon the log information of the 2nd Worker node.
CN201010269332.1A 2010-08-27 2010-08-27 Method and system for realization of parallel computing Active CN102385536B (en)

Priority Applications (2)

Application Number Priority Date Filing Date Title
CN201010269332.1A CN102385536B (en) 2010-08-27 2010-08-27 Method and system for realization of parallel computing
PCT/CN2011/072818 WO2012024937A1 (en) 2010-08-27 2011-04-14 Method and system for realizing parallel computing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201010269332.1A CN102385536B (en) 2010-08-27 2010-08-27 Method and system for realization of parallel computing

Publications (2)

Publication Number Publication Date
CN102385536A true CN102385536A (en) 2012-03-21
CN102385536B CN102385536B (en) 2014-06-11

Family

ID=45722853

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201010269332.1A Active CN102385536B (en) 2010-08-27 2010-08-27 Method and system for realization of parallel computing

Country Status (2)

Country Link
CN (1) CN102385536B (en)
WO (1) WO2012024937A1 (en)

Cited By (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136363A (en) * 2013-03-14 2013-06-05 曙光信息产业(北京)有限公司 Inquiry processing method and cluster data base system
CN104461752A (en) * 2014-11-21 2015-03-25 浙江宇视科技有限公司 Two-level fault-tolerant multimedia distributed task processing method
CN106789141A (en) * 2015-11-24 2017-05-31 阿里巴巴集团控股有限公司 A kind of gateway device failure processing method and processing device
CN107644382A (en) * 2016-07-22 2018-01-30 平安科技(深圳)有限公司 Policy information statistical method and device
CN108600008A (en) * 2018-04-24 2018-09-28 成都致云科技有限公司 Server management method, server managing device and distributed system
CN108959063A (en) * 2017-05-25 2018-12-07 北京京东尚科信息技术有限公司 A kind of method and apparatus that program executes
CN110673936A (en) * 2019-09-18 2020-01-10 平安科技(深圳)有限公司 Breakpoint continuous operation method and device for arranging service, storage medium and electronic equipment
CN113596148A (en) * 2021-07-27 2021-11-02 上海商汤科技开发有限公司 Data transmission method, system, device, computing equipment and storage medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085792A1 (en) * 2004-10-15 2006-04-20 Microsoft Corporation Systems and methods for a disaster recovery system utilizing virtual machines running on at least two host computers in physically different locations
CN1987804A (en) * 2005-12-22 2007-06-27 国际商业机器公司 Method and system for securing redundancy in parallel computing sytem
WO2009059377A1 (en) * 2007-11-09 2009-05-14 Manjrosoft Pty Ltd Software platform and system for grid computing
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101770402A (en) * 2008-12-29 2010-07-07 中国移动通信集团公司 Map task scheduling method, equipment and system in MapReduce system

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101145946B (en) * 2007-09-17 2010-09-01 中兴通讯股份有限公司 A fault tolerance cluster system and method based on message log

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20060085792A1 (en) * 2004-10-15 2006-04-20 Microsoft Corporation Systems and methods for a disaster recovery system utilizing virtual machines running on at least two host computers in physically different locations
CN1987804A (en) * 2005-12-22 2007-06-27 国际商业机器公司 Method and system for securing redundancy in parallel computing sytem
WO2009059377A1 (en) * 2007-11-09 2009-05-14 Manjrosoft Pty Ltd Software platform and system for grid computing
CN101764835A (en) * 2008-12-25 2010-06-30 华为技术有限公司 Task allocation method and device based on MapReduce programming framework
CN101770402A (en) * 2008-12-29 2010-07-07 中国移动通信集团公司 Map task scheduling method, equipment and system in MapReduce system

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
游进国等: "基于PC集群的并行数据仓库架构", 《计算机工程》, vol. 35, no. 20, 31 October 2009 (2009-10-31) *

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103136363A (en) * 2013-03-14 2013-06-05 曙光信息产业(北京)有限公司 Inquiry processing method and cluster data base system
CN104461752A (en) * 2014-11-21 2015-03-25 浙江宇视科技有限公司 Two-level fault-tolerant multimedia distributed task processing method
CN104461752B (en) * 2014-11-21 2018-09-18 浙江宇视科技有限公司 A kind of multimedia distributed task processing method of two-stage failure tolerant
CN106789141A (en) * 2015-11-24 2017-05-31 阿里巴巴集团控股有限公司 A kind of gateway device failure processing method and processing device
US10831622B2 (en) 2015-11-24 2020-11-10 Alibaba Group Holding Limited Method and apparatus for processing gateway device fault
CN106789141B (en) * 2015-11-24 2020-12-11 阿里巴巴集团控股有限公司 Gateway equipment fault processing method and device
CN107644382A (en) * 2016-07-22 2018-01-30 平安科技(深圳)有限公司 Policy information statistical method and device
CN108959063A (en) * 2017-05-25 2018-12-07 北京京东尚科信息技术有限公司 A kind of method and apparatus that program executes
CN108600008A (en) * 2018-04-24 2018-09-28 成都致云科技有限公司 Server management method, server managing device and distributed system
CN108600008B (en) * 2018-04-24 2021-12-17 致云科技有限公司 Server management method, server management device and distributed system
CN110673936A (en) * 2019-09-18 2020-01-10 平安科技(深圳)有限公司 Breakpoint continuous operation method and device for arranging service, storage medium and electronic equipment
CN113596148A (en) * 2021-07-27 2021-11-02 上海商汤科技开发有限公司 Data transmission method, system, device, computing equipment and storage medium

Also Published As

Publication number Publication date
CN102385536B (en) 2014-06-11
WO2012024937A1 (en) 2012-03-01

Similar Documents

Publication Publication Date Title
CN102385536B (en) Method and system for realization of parallel computing
CN107688496B (en) Task distributed processing method and device, storage medium and server
CN108737270B (en) Resource management method and device for server cluster
CN110995513B (en) Data sending and receiving method in Internet of things system, internet of things equipment and platform
US11119911B2 (en) Garbage collection method and device
CN111176803B (en) Service processing method, device, server and storage medium
CN105516086B (en) Method for processing business and device
CN103412786A (en) High performance server architecture system and data processing method thereof
CN111176941B (en) Data processing method, device and storage medium
CN102333130A (en) Method and system for accessing cache server and intelligent cache scheduler
CN109033814B (en) Intelligent contract triggering method, device, equipment and storage medium
CN106034113A (en) Data processing method and data processing device
CN105373563B (en) Database switching method and device
CN106156210B (en) Method and device for determining application identifier matching list
CN108154343B (en) Emergency processing method and system for enterprise-level information system
CN113055493B (en) Data packet processing method, device, system, scheduling device and storage medium
CN111163117B (en) Zookeeper-based peer-to-peer scheduling method and device
US20160006635A1 (en) Monitoring method and monitoring system
CN105760215A (en) Map-reduce model based job running method for distributed file system
CN113407629A (en) Data synchronization method and device, electronic equipment and storage medium
CN113301136B (en) Service request processing method and device
CN108255820B (en) Method and device for data storage in distributed system and electronic equipment
CN110764882A (en) Distributed management method, distributed management system and device
CN113132143B (en) Service call tracing method and related product
CN117331686A (en) Request processing method and device, storage medium and electronic device

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right

Effective date of registration: 20170531

Address after: 226200 No. 46 Industrial Park, Nanyang Town, Nantong, Jiangsu, Qidong

Patentee after: Qidong planting valve factory

Address before: 518057 Nanshan District Guangdong high tech Industrial Park, South Road, science and technology, ZTE building, Ministry of Justice

Patentee before: ZTE Corporation

TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20170724

Address after: 226200 Yuan Xiang Village, Nanyang Town, Qidong, Jiangsu

Patentee after: NANTONG JINGHAISHEN AQUATIC PRODUCT CO., LTD.

Address before: 226200 No. 46 Industrial Park, Nanyang Town, Nantong, Jiangsu, Qidong

Patentee before: Qidong planting valve factory