US20130174170A1 - Parallel computer, and job information acquisition method for parallel computer - Google Patents
Parallel computer, and job information acquisition method for parallel computer Download PDFInfo
- Publication number
- US20130174170A1 US20130174170A1 US13/778,494 US201313778494A US2013174170A1 US 20130174170 A1 US20130174170 A1 US 20130174170A1 US 201313778494 A US201313778494 A US 201313778494A US 2013174170 A1 US2013174170 A1 US 2013174170A1
- Authority
- US
- United States
- Prior art keywords
- job information
- calculation
- retention
- identification number
- management node
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000000034 method Methods 0.000 title claims description 18
- 230000014759 maintenance of location Effects 0.000 claims abstract description 378
- 230000005540 biological transmission Effects 0.000 claims description 86
- 230000000717 retained effect Effects 0.000 claims description 68
- 230000008569 process Effects 0.000 claims description 9
- 238000005259 measurement Methods 0.000 claims description 3
- 238000012545 processing Methods 0.000 description 59
- 238000010586 diagram Methods 0.000 description 22
- 238000004891 communication Methods 0.000 description 15
- 238000001514 detection method Methods 0.000 description 9
- 230000010365 information processing Effects 0.000 description 9
- 230000006870 function Effects 0.000 description 8
- 238000012544 monitoring process Methods 0.000 description 7
- 230000007704 transition Effects 0.000 description 4
- 102220468073 Trafficking protein particle complex subunit 5_S52A_mutation Human genes 0.000 description 2
- 102220501791 TP53-binding protein 1_S13A_mutation Human genes 0.000 description 1
- 102220502165 TP53-binding protein 1_S25A_mutation Human genes 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 238000005516 engineering process Methods 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 230000004044 response Effects 0.000 description 1
- 102220070930 rs794728599 Human genes 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3495—Performance evaluation by tracing or monitoring for systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3404—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for parallel or distributed programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/84—Using snapshots, i.e. a logical point-in-time copy of the data
Definitions
- the present invention relates to a parallel computer, and a job information acquisition method for parallel computer.
- a parallel computer can process large-scale calculation, for example, by connecting a plurality of computers (hereinafter, referred to as calculation nodes) through a network, distributing a calculation job among separate calculation nodes, and executing the calculation job in parallel. Accordingly, the demand for the parallel computer is increasing rapidly.
- a parallel computer includes a node managing a calculation node group (hereinafter, simply referred to as a management node) including a plurality of calculation nodes.
- the parallel computer may require technology that enables a management node side to recognize information such as usage of respective resources such as a CPU, a memory and a file used in each calculation node by a currently-executed calculation job, and the number of commands executed by the calculation job (hereinafter, simply referred to as job information).
- FIG. 14 is an illustration diagram illustrating a snapshot acquisition method for a parallel computer.
- a management node 112 managing a plurality of calculation nodes 111 manages the current time, and requests each calculation node 111 to acquire job information when the current time arrives at a predetermined time (step S 211 ).
- each calculation node 111 acquires own job information (step S 212 ).
- each calculation node 111 transmits the acquired job information to the management node 112 (step S 213 ).
- the management node 112 of the parallel computer 110 illustrated in FIG. 14 can acquire job information of the same time (timing) of each calculation node 111 , that is, a snapshot.
- FIG. 15 is an illustration diagram illustrating another snapshot acquisition method for a parallel computer 120 .
- each calculation node 121 manages the current time.
- each calculation node 121 acquires own job information (step S 221 ).
- each calculation node 121 transmits the acquired job information to a management node 122 (step S 222 ).
- the management node 122 of the parallel computer 120 illustrated in FIG. 15 can acquire job information of the same time (timing) of each calculation node 121 , that is, a snapshot.
- Patent Document 1 Japanese Laid-open Patent Publication No. 8-44680
- Patent Document 2 Japanese Laid-open Patent Publication No. 63-136176
- the parallel computer 120 illustrated in FIG. 15 since the job information is asynchronously transmitted from the respective calculation nodes 121 , there may be a case where the job information of the same time (same timing) transmitted from the respective calculation nodes 121 is not received at the management node 122 till the next job information acquisition time. As a result, job information of different times may be received in a mixed manner. That is, in the parallel computer 120 , since the job information of the same timing of the respective calculation nodes 121 is not known, an accurate snapshot is difficult to acquire.
- a parallel computer includes a plurality of calculation nodes that execute a calculation job distributively in parallel, and a management node that manages the plurality of calculation nodes.
- the calculation node includes an acquisition unit that acquires job information about a calculation job handled by the calculation node according to a period timing common to the calculation nodes, a retention control unit that retains the job information in a retention unit of the calculation node in association with an identification number identifying the period timing at which the job information is acquired by the acquisition unit, and clears all the job information retained in the retention unit when a clear request is received from the management node, and an information transmission unit that, when a transmission request for the job information about a designated identification number is received, transmits the job information about the designated identification number to the management node in a case where the job information about the designated identification number exists in the retention unit, and transmits job information about an identification number just before the designated identification number to the management node in a case where the job information about the designated identification number does not exist in
- the management node includes a retention control unit that retains the job information in a retention unit of the management node when the job information is received from each of the calculation nodes according to the transmission request, retains, as a snapshot, job information of the same identification number in a case where the job information of the same identification number about the calculation nodes is detected in the retention unit, and clears job information other than the job information of the same identification number retained in the retention unit of the management node, and a clear request unit that transmits the clear request to each of the calculation nodes when the job information of the same identification number is retained as a snapshot.
- the retention unit of the calculation node includes a retention region enabling retention of job information corresponding to a plurality of periods
- the retention unit of the management node includes a retention region enabling retention of the job information corresponding to the plurality of periods with respect to each of the calculation nodes.
- FIG. 1 is a block diagram illustrating a parallel computer according to a first embodiment
- FIG. 2 is a block diagram illustrating a parallel computer according to a second embodiment
- FIG. 3 is an illustration diagram of a parallel computer
- FIG. 4 is an illustration diagram of a job information acquisition period (time belt);
- FIG. 5 is an illustration diagram illustrating the reason for setting a calculation side retention unit corresponding to two generations
- FIG. 6 is an illustration diagram illustrating an example of an operation transition for snapshot acquisition of a parallel computer
- FIG. 7 is an illustration diagram illustrating an example of an operation transition for snapshot acquisition of a parallel computer
- FIG. 8 is an illustration diagram illustrating an example of an operation transition for snapshot acquisition of a parallel computer
- FIG. 9 is a flow chart illustrating an internal processing operation of a representative node for job acquisition processing of a representative node side
- FIG. 10 is a flow chart illustrating an internal processing operation of a calculation node for job acquisition processing of a calculation node side
- FIG. 11 is a flow chart illustrating an internal processing operation of a management node for job acquisition processing of a management node side
- FIG. 12 is an illustration diagram illustrating a parallel computer according to another embodiment
- FIG. 13 is an illustration diagram illustrating a computer executing a job information acquisition program of a parallel computer
- FIG. 14 is an illustration diagram illustrating a snapshot acquisition method for a parallel computer.
- FIG. 15 is an illustration diagram illustrating another snapshot acquisition method for a parallel computer.
- FIG. 1 is a block diagram illustrating a parallel computer according to a first embodiment.
- a parallel computer 1 A illustrated in FIG. 1 includes a plurality of calculation nodes 50 executing a calculation job distributively in parallel, and a management node 60 managing the plurality of calculation nodes 50 .
- the calculation node 50 includes an acquisition unit 51 , a retention unit 52 , a retention control unit 53 , and an information transmission unit 54 . According to the period timing common to the calculation nodes, the acquisition unit 51 acquires job information about a calculation job handled by the relevant calculation node 50 .
- the retention control unit 53 retains the job information in the retention unit 52 of the calculation node 50 in association with an identification number identifying the period timing at which the acquisition unit 51 acquires the job information. Also, when receiving a clear request from the management node 60 , the retention control unit 53 clears all of the job information retained in the retention unit 52 .
- the retention unit 52 includes a retention region retaining own job information corresponding to a predetermined number of periods, for example, two periods (generations).
- the information transmission unit 54 when receiving a job information transmission request for a designated identification number from the management node 60 , when job information of the designated identification number exists in the retention unit 52 , the information transmission unit 54 transmits the job information of the designated identification number to the management node 60 . Also, when the job information of the designated identification number does not exist in the retention unit 52 and job information of an identification number just before the designated identification number exists therein, the information transmission unit 54 transmits the job information of the identification number to the management node 60 . Also, the identification number just before the designated identification number corresponds to, for example, a one-generation-ago identification number.
- the management node 60 includes a retention unit 61 , a retention control unit 62 , and a clear request unit 63 .
- the retention unit 61 includes a retention region capable of retaining job information corresponding to a predetermined number of periods for each calculation node 50 .
- the retention control unit 62 retains the received job information in the retention unit 61 of the management node 60 . Also, when detecting job information of the same identification number about all the calculation nodes 50 in the retention unit 61 , the retention control unit 62 retains the job information of the same identification number as a snapshot.
- the retention control unit 62 clears job information other than the job information of the same identification number retained in the retention unit 61 of the management node 60 .
- the clear request unit 63 transmits a clear request to each calculation node 50 .
- the calculation node 50 acquires the job information according to the period timing common to the calculation nodes, and retains the acquired job information in the retention unit 52 of the calculation node 50 in association with the identification number identifying the period timing at which the job information is acquired.
- the management node 60 when receiving the job information from each calculation node 50 according to the transmission request, retains the received job information in the retention unit 61 of the management node 60 .
- the calculation node 50 when detecting job information of the same identification number about the calculation nodes in the retention unit 61 , the calculation node 50 retains the job information of the same identification number as a snapshot.
- the job information of the same identification number when the job information of the same identification number is retained as a snapshot, the job information other than the job information of the same identification number retained in the retention unit 61 of the management node 60 is cleared and all of the job information retained in the retention unit 52 of the calculation node 50 is cleared.
- the job information since the job information is managed in association with the identification number of the period timing at which the job information is acquired, an accurate snapshot of the job information between the calculation nodes 50 can be secured.
- the retention unit 52 of the calculation node 50 includes a retention region capable of retaining job information corresponding to a predetermined number of periods
- the retention unit 61 of the management node 60 includes a retention region capable of retaining job information corresponding to a predetermined number of periods for each calculation node 50 .
- FIG. 2 is a block diagram illustrating a parallel computer according to a second embodiment
- FIG. 3 is an illustration diagram of the parallel computer.
- a parallel computer 1 illustrated in FIG. 2 includes a plurality of calculation nodes 3 connected to a network 2 , and a management node 4 managing the plurality of calculation nodes 3 .
- the parallel computer 1 distributes calculation job among the respective calculation nodes 3 and executes calculation in parallel. Also, although four calculation nodes 3 ( 3 A to 3 D) are illustrated for the convenience of description, the number of calculation nodes 3 is not limited thereto.
- the calculation node 3 corresponds to, for example, a computer, and executes a calculation job.
- the calculation node 3 includes a calculation processing unit 11 , a job information processing control unit 12 , a calculation side communication unit 13 , and a calculation side retention unit 14 .
- the calculation processing unit 11 executes an own calculation job among the distributed calculation jobs.
- the calculation side communication unit 13 communicates with the management node 4 through the network 2 .
- the calculation side retention unit 14 corresponds to, for example, a buffer or the like.
- the calculation side retention unit 14 includes a first retention region 14 A and a second retention region 14 B retaining job information corresponding to two generations, that is, two time belts.
- the job information processing control unit 12 includes a timing detection unit 21 , an acquisition processing unit 22 , a calculation side retention control unit 23 , and an information transmission unit 24 .
- the timing detection unit 21 detects the timing of acquiring own job information.
- the timing detection unit 21 starts a timer operation according to a job start command common to the calculation nodes 3 .
- FIG. 4 is an illustration diagram of a job information acquisition period (time belt).
- the timing detection unit 21 detects the job information acquisition timing by using the period timing common to the calculation nodes, that is, the time belt of FIG. 4 .
- the acquisition processing unit 22 acquires own job information.
- the calculation side retention control unit 23 controls the retention by the calculation side retention unit 14 , and retains the job information retained by the acquisition processing unit 22 in the calculation side retention unit 14 .
- the job information includes job information contents, information existence/nonexistence, node information, time belt number, information acquisition date and time, and the like.
- the job information contents include usage of resources such as a job ID identifying a job and a CPU, a memory and a file used in an own job, and the number of commands executed by the job.
- the information existence/nonexistence is information indicating the existence/nonexistence of job information.
- the job information corresponds to job information having job information contents
- the job information corresponds to error information that will be described below.
- the node information corresponds to a node ID identifying the calculation node 3 that is the source of the job information.
- the time belt number corresponds to a number identifying the period timing common to the calculation nodes 3 having acquired the job information.
- the information acquisition date and time correspond to the date and time of acquisition of the job information.
- the calculation side retention control unit 23 determines whether an empty space exists in the retention region of the calculation side retention unit 14 . When an empty space exists in the retention region of the calculation side retention unit 14 , the calculation side retention control unit 23 retains the job information in the calculation side retention unit 14 . Also, when an empty space does not exist in the retention region of the calculation side retention unit 14 , the calculation side retention control unit 23 prohibits the retention of the job information.
- the calculation side retention control unit 23 determines whether job information of a designated time belt number exists in the calculation side retention unit 14 .
- the calculation side retention control unit 23 transmits the job information of the designated time belt number to the management node 4 through the calculation side communication unit 13 .
- the calculation side retention control unit 23 determines whether job information one generation before the designated time belt number exists in the calculation side retention unit 14 .
- the calculation side retention control unit 23 transmits the job information one generation before the designated time belt number to the management node 4 through the calculation side communication unit 13 . Also, when job information one generation before the designated time belt number does not exist in the calculation side retention unit 14 , the calculation side retention control unit 23 transmits error information to the management node 4 through the calculation side communication unit 13 . Also, according to a clear request from the management node 4 , which will be described below, the calculation side retention control unit 23 clears all of the job information retained in the calculation side retention unit 14 .
- one calculation node 3 A among the four calculation nodes 3 ( 3 A to 3 D) is referred to as a representative node.
- the representative node has substantially the same internal configuration as the calculation node 3 , but is characterized by having a function described below.
- the job information processing control unit 12 of the representative node acquires job information according to the period timing common to the calculation nodes 3 , and retains the job information in the calculation side retention unit 14 .
- the job information processing control unit 12 notifies a time belt number of the job information to the management node 4 as a transmission request target through the calculation side communication unit 13 .
- the management node 4 corresponds to, for example, a computer.
- the management node 4 connects with each calculation node 3 through the network 2 to manage each calculation node 3 .
- the management node 4 includes a management side processing unit 31 , a snapshot processing control unit 32 , a management side communication unit 33 , and a management side retention unit 34 .
- the management side processing unit 31 manages the distributed calculation nodes 3 .
- the management side communication unit 33 communicates with each calculation node 3 through the network 2 .
- the management side retention unit 34 corresponds to, for example, a buffer or the like.
- the management side retention unit 34 includes a first retention region 34 A, a second retention region 34 B and a third retention region 34 C retaining job information corresponding to three generations, that is, three time belts.
- the first retention region 34 A retains job information about a snapshot
- the second retention region 34 B and the third retention region 34 C are used to temporarily retain job information in order to acquire a snapshot.
- the first retention region 34 A is used to temporarily retain job information, like the second retention region 34 B and the third retention region 34 C.
- the snapshot processing control unit 32 includes a transmission request unit 41 , a received information identification unit 42 , a retention region monitoring unit 43 , a clear request unit 44 , and a management side retention control unit 45 .
- the transmission request unit 41 requests the transmission of job information about the time belt number from each calculation node 3 through the management side communication unit 33 .
- the received information identification unit 42 identifies the received information of each calculation node 3 received according to a transmission request of a designated time belt number to each calculation node 3 .
- the received information is received from the calculation node 3 , and includes, for example, job information of a designate time belt number, job information of a time belt number one generation before the designated time belt number, error information, and the like.
- the retention region monitoring unit 43 monitors the job information of the respective calculation nodes 3 retained in the first retention region 34 A, the second retention region 34 B and the third retention region 34 C. In addition, based on the job information monitoring result, the retention region monitoring unit 43 determines whether there is a new time belt number corresponding to the timing that could retain the job information of all the calculation nodes 3 . When there is a new time belt number that could retain the job information of all the calculation nodes 3 , the management side retention control unit 45 determines that a snapshot of the same time belt number is newly acquired, and updates/registers the job information of all the calculation nodes 3 of the same time belt number in the first retention region 34 A.
- management side retention control unit 45 clears all the job information of the respective calculation nodes 3 retained in the second retention region 34 B and the third retention region 34 C. Also, when a new snapshot is acquired, the clear request unit 44 requests the clear of all the job information retained in the calculation side retention unit 14 of all the calculation nodes 3 through the management side communication unit 33 .
- the management node 4 when detecting a snapshot presentation request from a user terminal, the management node 4 presents the user terminal with the job information of the same time belt number of all the calculation nodes 3 retained in the first retention region 34 A of the management side retention unit 34 , as a snapshot. That is, the user can know the job information of each calculation node 3 with respect to the job that is being executed.
- FIG. 5 is an illustration diagram illustrating the reason for setting the calculation side retention unit 14 corresponding to two generations.
- the calculation node 3 B clears all the job information up to the time belt number T 2 retained in the calculation side retention unit 14 .
- the job information to be acquired next is the job information of a time belt number T 3 .
- the calculation node 3 C clears all the job information up to the time belt number T 3 retained in the calculation side retention unit 14 .
- the job information to be acquired next is the job information of a time belt number T 4 .
- the calculation side retention unit 14 of each calculation node 3 is provided with the first retention region 14 A and the second retention region 14 B as retention regions for retaining the job information corresponding to two time belts in order to absorb a gap corresponding to one time belt.
- the reason for setting the management side retention unit 34 to include a region retaining the job information corresponding to three generations, that is, three time belts will be described.
- the job information of the same time belt number T 1 of all calculation nodes 3 is retained, that is, when a snapshot of the time belt number T 1 is acquired, the job information of the relevant time belt number is retained in the first retention region 34 A.
- the second retention region 34 B and the third retention region 34 C are used until the job information of the next time belt number of all the calculation nodes 3 is retained.
- the job information transmitted to the management node 4 from each calculation node 3 is also missed by one generation.
- the management side retention unit 34 uses the first retention region 34 A to retain the job information of a snapshot, and is provided with the second retention region 34 B and the third retention region 34 C as retention regions for retaining the job information corresponding to two time belts in order to absorb a gap corresponding to one time belt.
- FIGS. 6 to 8 are illustration diagrams illustrating an example of an operation transition for snapshot acquisition of the parallel computer 1 A.
- four calculation nodes 3 3 A to 3 D
- the calculation node 3 A is set to be a representative node.
- each of the calculation nodes 3 A, 3 C and 3 D acquires job information according to the timing of the time belt number T 1 from the job start command, and retains the job information in the calculation side retention unit 14 .
- the job information of the time belt number T 1 is retained in the first retention region 14 A of the calculation nodes 3 A, 3 C and 3 D. Since the calculation node 3 B cannot acquire the job information of the time belt number T 1 due to the delay of reception of the job start command because of a certain factor, information is not retained in the first retention region 14 A.
- the calculation node 3 A is the representative node. Therefore, when retaining the job information of the time belt number T 1 in the calculation side retention unit 14 , the calculation node 3 A notifies the time belt number T 1 to the management node 4 (step S 11 ). When receiving the time belt number T 1 of the calculation node 3 A, the management node 4 requests the transmission of the job information of the time belt number T 1 from all the calculation nodes 3 (step S 12 ).
- each calculation node 3 determines whether the job information of the time belt number T 1 exists in the calculation side retention unit 14 .
- each of the calculation nodes 3 A, 3 C and 3 D transmits the job information of the time belt number T 1 to the management node 4 (step S 13 ).
- the calculation node 3 B transmits error information to the management node 4 (step S 13 A).
- the management node 4 When receiving the job information of the time belt number T 1 of the calculation nodes 3 A, 3 C and 3 D, the management node 4 retains the job information of the time belt number T 1 in the first retention region 34 A corresponding to the calculation nodes 3 A, 3 C and 3 D. Also, when receiving the error information of the calculation node 3 B, the management node 4 does not retain information in the first retention region 34 A corresponding to the calculation node 3 B.
- each of the calculation nodes 3 A, 3 C and 3 D acquires job information of the time belt number T 2 according to the timing of the time belt number T 2 , and retains the job information in the second retention region 14 B of the calculation side retention unit 14 . Also, the calculation node 3 B acquires job information of the time belt number T 1 according to the timing of the time belt number T 1 , and retains the job information in the first retention region 14 A of the calculation side retention unit 14 .
- the calculation node 3 A is the representative node. Therefore, when retaining the job information of the time belt number T 2 in the calculation side retention unit 14 , the calculation node 3 A notifies the time belt number T 2 to the management node 4 (step S 14 ). When receiving the time belt number T 2 , the management node 4 requests the transmission of the job information of the time belt number T 2 from all the calculation nodes 3 (step S 15 ).
- each calculation node 3 determines whether the job information of the time belt number T 2 exists in the calculation side retention unit 14 .
- each of the calculation nodes 3 A, 3 C and 3 D transmits the job information of the time belt number T 2 to the management node 4 (step S 16 ).
- the calculation node 3 B notifies the job information of the time belt number T 1 to the management node 4 (step S 16 A).
- the management node 4 When receiving the job information of the time belt number T 2 of the calculation nodes 3 A, 3 C and 3 D, the management node 4 retains the job information of the time belt number T 2 in the second retention region 34 B corresponding to the calculation nodes 3 A, 3 C and 3 D. Also, when receiving the job information of the time belt number T 1 of the calculation node 3 B, the management node 4 retains the job information of the time belt number T 1 in the first retention region 34 A corresponding to the calculation node 3 B. As a result, the job information of the time belt number T 1 of all the calculation nodes 3 is retained in the first retention region 34 A. That is, a snapshot of the time belt number T 1 is acquired.
- the management node 4 When acquiring the snapshot of the time belt number T 1 , the management node 4 requests the clear of all the job information retained in the calculation side retention unit 14 of all the calculation nodes 3 from all the calculation nodes 3 (step S 17 ). In addition, while retaining the job information of the time belt number T 1 in the first retention region 34 A, the management node 4 clears all the job information retained in the second retention region 34 B and the third retention region 34 C (step S 18 ).
- each calculation node 3 clears all the job information retained in the first retention region 14 A and the second retention region 14 B (step S 19 ).
- each of the calculation nodes 3 A, 3 C and 3 D acquires job information according to the timing of the time belt number T 4 , and retains the job information of the time belt number T 4 in the first retention region 14 A.
- the calculation node 3 B acquires job information according to the timing of the time belt number T 3 , and retains the job information in the first retention region 14 A.
- the calculation node 3 A is the representative node. Therefore, when retaining the job information of the time belt number T 4 in the calculation side retention unit 14 , the calculation node 3 A notifies the time belt number T 4 to the management node 4 (step S 20 ). When receiving the time belt number T 4 of the calculation node 3 A, the management node 4 requests the transmission of the job information of the time belt number T 4 from all the calculation nodes 3 (step S 21 ).
- each calculation node 3 when receiving the transmission request for the job information of the time belt number T 4 , each calculation node 3 determines whether the job information of the time belt number T 4 exists in the calculation side retention unit 14 . When including the job information of the time belt number T 4 in the calculation side retention unit 14 , each of the calculation nodes 3 A, 3 C and 3 D transmits the job information of the time belt number T 4 to the management node 4 (step S 22 ).
- the calculation node 3 B when not including the job information of the time belt number T 4 in the calculation side retention unit 14 and including one-generation-ago job information, that is, the job information of the time belt number T 3 in the calculation side retention unit 14 , the calculation node 3 B notifies the job information of the time belt number T 3 to the management node 4 (step S 22 A).
- the management node 4 When receiving the job information of the time belt number T 4 of the calculation nodes 3 A, 3 C and 3 D, the management node 4 retains the job information of the time belt number T 4 in the second retention region 34 B corresponding to the calculation nodes 3 A, 3 C and 3 D. Also, when receiving the job information of the time belt number T 3 of the calculation node 3 B, the management node 4 retains the job information of the time belt number T 3 in the second retention region 34 B corresponding to the calculation node 3 B. Also, the job information of the time belt number T 1 of all the calculation nodes 3 is being stored as a snapshot in the first retention region 34 A.
- each of the calculation nodes 3 A, 3 C and 3 D acquires job information according to the timing of the time belt number T 5 , and retains the job information of the time belt number T 5 in the second retention region 14 B.
- the calculation node 3 B acquires job information according to the timing of the time belt number T 4 , and retains the job information of the time belt number T 4 in the second retention region 14 B.
- the calculation node 3 A is the representative node. Therefore, when retaining the job information of the time belt number T 5 in the calculation side retention unit 14 , the calculation node 3 A notifies the time belt number T 5 to the management node 4 (step S 23 ). When receiving the time belt number T 5 of the calculation node 3 A, the management node 4 requests the transmission of the job information of the time belt number T 5 from all the calculation nodes 3 (step S 24 ).
- each calculation node 3 determines whether the job information of the time belt number T 5 exists in the calculation side retention unit 14 .
- each of the calculation nodes 3 A, 3 C and 3 D transmits the job information of the time belt number T 5 to the management node 4 (step S 25 ).
- the calculation node 3 B notifies the job information of the time belt number T 4 to the management node 4 (step S 25 A).
- the management node 4 When receiving the job information of the time belt number T 5 of the calculation nodes 3 A, 3 C and 3 D, the management node 4 retains the job information of the time belt number T 5 in the third retention region 34 C corresponding to the calculation nodes 3 A, 3 C and 3 D. Also, when receiving the job information of the time belt number T 4 of the calculation node 3 B, the management node 4 retains the job information of the time belt number T 4 in the third retention region 34 C corresponding to the calculation node 3 B.
- the job information of the time belt number T 4 in the second retention region 34 B corresponding to the calculation nodes 3 A, 3 C and 3 D, and the job information of the time belt number T 4 in the third retention region 34 C corresponding to the calculation node 3 B, and the job information of the time belt number T 4 corresponding to all calculation nodes 3 are retained. That is, the snapshot of the time belt number T 4 is acquired.
- the management node 4 When acquiring the snapshot of the time belt number T 4 , the management node 4 requests the clear of all the job information retained in the calculation side retention unit 14 of all the calculation nodes 3 from all the calculation nodes 3 (step S 26 ). The management node 4 overwrites/updates the job information of the time belt number T 1 on the job information of the time belt number T 4 in the first retention region 34 A, and clears all the job information retained in the second retention region 34 B and the third retention region 34 C (step S 27 ).
- each calculation node 3 clears all the job information retained in the first retention region 14 A and the second retention region 14 B (step S 28 ).
- the latest snapshot can be retained in the first retention region 34 A of the management node 4 .
- the management node 4 can present the latest snapshot retained in the first retention region 34 A.
- FIG. 9 is a flow chart illustrating a processing operation of the calculation node 3 A for job acquisition processing of the representative node side.
- the timing detection unit 21 in the job information processing control unit 12 of the calculation node 3 A determines whether the job information acquisition timing is detected (step S 51 ).
- the acquisition processing unit 22 in the job information processing control unit 12 executes job information acquisition processing (step S 52 A) and determines whether the own job information could be acquired (step S 52 ).
- the calculation side retention control unit 23 in the job information processing control unit 12 determines whether an empty space exists in the calculation side retention unit 14 (step S 53 ).
- the calculation side retention control unit 23 retains the job information of the time belt number in the calculation side retention unit 14 (step S 54 ).
- the information transmission unit 24 in the job information processing control unit 12 notifies the time belt number as a time belt number of a transmission request target to the management node 4 (step S 55 ).
- the calculation side retention control unit 23 determines whether the transmission request for the job information designating the time belt number of the transmission request target is received from the management node 4 (step S 56 ).
- the calculation side retention control unit 23 transmits the job information about the time belt number of the transmission request retained in the calculation side retention unit 14 to the management node 4 (step S 57 ).
- the calculation side retention control unit 23 determines whether the clear request is received from the management node 4 (step S 58 ). When the clear request is received from the management node 4 (Yes in step S 58 ), the calculation side retention control unit 23 clears all the job information retained in the calculation side retention unit 14 (step S 59 ), and proceeds to step S 51 to determine whether the job information acquisition timing is detected.
- step S 58 the calculation side retention control unit 23 determines whether the job information acquisition timing is detected.
- step S 60 the calculation side retention control unit 23 proceeds to step S 58 to determine whether the clear request is received.
- step S 52 A the calculation side retention control unit 23 proceeds to step S 52 A to execute job information acquisition processing.
- step S 51 when the job information acquisition timing is not detected (No in step S 51 ), the timing detection unit 21 proceeds to step S 51 to continue to monitor the job information acquisition timing. Also, the job information could not be acquired (No in step S 52 ), the acquisition processing unit 22 proceeds to step S 51 to detect the job information acquisition timing.
- step S 53 when an empty space does not exist in the calculation side retention unit 14 (No in step S 53 ), the calculation side retention control unit 23 does not retain the job information of the time belt number in the calculation side retention unit 14 (step S 61 ), and proceeds to step S 51 to detect the job information acquisition timing.
- step S 56 the calculation side retention control unit 23 proceeds to step S 56 to continue to monitor the job information transmission request. Also, since step S 56 is processing executed by the representative node, the time belt number of the transmission request target urging the transmission request from the management node 4 is notified. Therefore, in the normal case, the transmission request is necessarily received from the management node 4 .
- the representative node determines whether an empty space exists in the calculation side retention unit 14 .
- the job information is retained in the calculation side retention unit 14 in association with the time belt number identifying the acquisition timing.
- the representative node can retain the job information corresponding to two generations in association with the time belt number.
- the representative node can report the time belt number of the job information of the transmission request target to the management node 4 .
- the representative node can transmit the job information of the transmission request target to the management node 4 .
- the representative node In the job acquisition processing of the representative node side, when the clear request is received from the management node 4 , all the job information retained in the calculation side retention unit 14 is cleared. As a result, the representative node can retain new job information in the calculation side retention unit 14 so that the latest snapshot is acquired by the management node 4 .
- FIG. 10 is a flow chart illustrating a processing operation of the calculation node 3 for job acquisition processing of the calculation node side.
- the timing detection unit 21 in the job information processing control unit 12 of the calculation node 3 determines whether the job information acquisition timing is detected (step S 71 ).
- the acquisition processing unit 22 executes job information acquisition processing (step S 72 ) and determines whether the own job information could be acquired (step S 73 ).
- the calculation side retention control unit 23 determines whether an empty space exists in the calculation side retention unit 14 (step S 74 ). When an empty space exists in the calculation side retention unit 14 (Yes in step S 74 ), the calculation side retention control unit 23 retains the job information of the time belt number in the calculation side retention unit 14 (step S 75 ).
- the calculation side retention control unit 23 determines whether the transmission request for the job information designating the time belt number of the transmission request target is received from the management node 4 (step S 76 ). When the transmission request for the job information is received (Yes in step S 76 ), the calculation side retention control unit 23 determines whether the job information of the time belt number of the transmission request exists in the calculation side retention unit 14 (step S 77 ).
- the information transmission unit 24 transmits the job information of the time belt number of the transmission request to the management node 4 (step S 78 ).
- the calculation side retention control unit 23 determines whether the clear request is received from the management node 4 (step S 79 ). When the clear request is received (Yes in step S 79 ), the calculation side retention control unit 23 clears all the job information retained in the calculation side retention unit 14 (step S 80 ), and proceeds to step S 71 to determine whether the job information acquisition timing is detected.
- step S 81 determines whether the job information acquisition timing is detected.
- step S 79 determines whether the clear request is received.
- step S 72 execute job information acquisition processing.
- step S 71 the timing detection unit 21 proceeds to step S 71 to continue to monitor the job information acquisition timing. Also, the job information could not be acquired (No in step S 73 ), the acquisition processing unit 22 proceeds to step S 71 to detect the job information acquisition timing.
- step S 74 when an empty space does not exist in the calculation side retention unit 14 (No in step S 74 ), the calculation side retention control unit 23 does not retain the job information of the time belt number in the calculation side retention unit 14 (step S 82 ), and proceeds to step S 71 to detect the job information acquisition timing.
- step S 76 the calculation side retention control unit 23 proceeds to step S 79 to determine whether the clear request is received.
- the calculation side retention control unit 23 determines whether job information one generation before the time belt number exists in the calculation side retention unit 14 (step S 83 ). Also, when the time belt number of the transmission request is, for example, T 3 , the one-generation-ago job information corresponds to the job information of the time belt number T 2 . When the job information one generation before the time belt number exists in the calculation side retention unit 14 (Yes in step S 83 ), the calculation side retention control unit 23 transmits the one-generation-ago job information to the management node 4 (step S 84 ), and proceeds to step S 79 to determine whether clear request is received.
- step S 83 the calculation side retention control unit 23 transmits error information to the management node 4 (step S 85 ), and proceeds to step S 71 to determine whether the job information acquisition timing is detected.
- the calculation node 3 determines whether an empty space exists in the calculation side retention unit 14 .
- the job information is retained in the calculation side retention unit 14 in association with the time belt number identifying the acquisition timing.
- the calculation node 3 can retain the job information corresponding to two generations in association with the time belt number.
- the calculation node 3 can transmit the job information of the designated time belt number of the transmission request to the management node 4 .
- the calculation node 3 can also transmit the one-generation-ago job information to the management node 4 in order to absorb a gap between the calculation nodes 3 caused by, for example, the transmission delay of the clear request.
- the calculation node 3 can retain new job information in the calculation side retention unit 14 so that the latest snapshot is acquired by the management node 4 .
- FIG. 11 is a flow chart illustrating a processing operation of the management node 4 for job acquisition processing of the management node side.
- the snapshot processing control unit 32 of the management node 4 determines whether the time belt number of the transmission request target is received from the representative calculation node 3 A (step S 91 ).
- the transmission request unit 41 of the snapshot processing control unit 32 requests the transmission of job information about the time belt number of the transmission request target from all the calculation nodes 3 (step S 92 ).
- the received information identification unit 42 in the snapshot processing control unit 32 determines whether the information received from each calculation node 3 is error information (step S 93 ). When the information received from each calculation node 3 is not error information (No in step S 93 ), the received information identification unit 42 determines whether the received information is job information (step S 94 ). When the received information is job information (Yes in step S 94 ), the management side retention control unit 45 in the snapshot processing control unit 32 retains the job information in the management side retention unit 34 corresponding to the relevant calculation node 3 (step S 95 ). The received information identification unit 42 determines whether the reception of information from all the calculation nodes 3 receiving the transmission request is completed (step S 96 ).
- step S 96 When the reception of information from all the calculation nodes 3 is not completed (No in step S 96 ), the received information identification unit 42 determines that there is the received information that is not yet identified, and proceeds to step S 93 to determine whether the received information is error information.
- the retention region monitoring unit 43 in the snapshot processing control unit 32 determines whether there is a new time belt number that could retain the job information of all the calculation nodes 3 , based on the information retained in the management side retention unit 34 (step S 97 ).
- the retention region monitoring unit 43 determines that the snapshot of the same time belt number is newly acquired.
- the transmission request unit 41 determines that the snapshot of the same time belt number is newly acquired, and requests the clear of the job information retained in the management side retention unit 34 from all the calculation nodes 3 (step S 98 ).
- the management side retention control unit 45 updates/registers the job information of the same time belt number of all the calculation nodes 3 , which could be newly retained, as a new snapshot in the first retention region 34 A (step S 99 ). In addition, the management side retention control unit 45 clears all the job information of the respective calculation nodes 3 retained in the second retention region 34 B and the third retention region 34 C (step S 100 ), and ends the processing operation of FIG. 11 .
- the snapshot processing control unit 32 ends the processing operation of FIG. 11 . Also, when the received information is error information (Yes in step S 93 ), the received information identification unit 42 identifies the received information from the calculation node 3 , and proceeds to step S 96 to determine whether the identification of the received information from all the calculation nodes 3 is completed.
- the retention region monitoring unit 43 ends the processing operation of FIG. 11 .
- the management node 4 when receiving the time belt number of the transmission request target from the representative node, the management node 4 requests the transmission of the job information of the time belt number of the transmission request target from each calculation node 3 .
- the management node 4 can request the transmission of the job information about the designated time belt number from each calculation node 3 .
- the management node 4 determines whether the received information from each calculation node 3 with respect to the transmission request is job information.
- the received information is job information
- the information is determined as job information of the designated time belt number or the one-generation-ago time belt number and the job information is retained in association with the relevant calculation node 3 in the management side retention unit 34 .
- the management node 4 can retain the job information of each calculation node 3 corresponding to three generations in the management side retention unit 34 .
- the management node 4 determines that the snapshot of the same time belt number is newly acquired. In addition, the management node 4 determines that the snapshot of the same time belt number is newly acquired, and requests the clear of the job information retained in the management side retention unit 34 from all the calculation nodes 3 .
- the management node 4 updates/registers the job information of the same time belt number of all the calculation nodes 3 , which could be newly retained, as a new snapshot in the first retention region 34 A, and clears the job information of each calculation node 3 retained in the second retention region 34 B and the third retention region 34 C.
- the management node 4 can present the latest snapshot to the user.
- the management node 4 can use the second retention region 34 B and the third retention region 34 C as a temporary job information retention region.
- the calculation node 3 acquires the job information according to the period timing common to the calculation nodes, and retains the acquired job information in the calculation side retention unit 14 in association with the time belt number identifying the period timing at which the job information is acquired.
- the management node 4 when receiving the job information from each calculation node 3 according to the transmission request, retains the received job information in the management side retention unit 34 .
- the management node 4 when detecting job information of the same time belt number about the calculation nodes 3 in the management side retention unit 34 , the management node 4 retains the job information of the same time belt number as a snapshot.
- the job information of the same time belt number is retained as a snapshot
- the job information other than the job information of the same time belt number retained in the management side retention unit 34 is cleared and all of the job information retained in the calculation side retention unit 14 is cleared.
- the calculation side retention unit 14 is provided with a retention region capable of retaining job information corresponding to two generations
- the management side retention unit 34 is provided with a retention region capable of retaining job information corresponding to three generations.
- one of the plurality of calculation nodes 3 is used as a representative node, and the management node 4 starts the transmission request for the job information associated with the time belt information when the time belt number of the transmission request target is notified from the representative node to the management node 4 .
- the management node 4 starts the transmission request for the job information associated with the time belt information when the time belt number of the transmission request target is notified from the representative node to the management node 4 .
- calculation nodes 3 are provided in the second embodiment, the number of calculation nodes 3 is not limited thereto. Also, although one of the plurality of calculation nodes 3 is used as the representative node in the second embodiment, the number of calculation nodes used as the representative node is not limited thereto. Also, although one of the plurality of calculation nodes 3 is used as the representative node in the second embodiment, each calculation node 3 may be used as the representative node.
- the calculation side retention unit 14 is provided with a retention region retaining job information corresponding to two generations, and the management side retention unit 34 is provided with a retention region retaining job information corresponding to three generations.
- the calculation side retention unit 14 may be provided with a retention region retaining job information corresponding to three generations, and the management side retention unit 34 may be provided with a retention region retaining job information corresponding to four generations.
- the time in each calculation node 3 taken until the execution of job information clearing after arrival of the clear request from the management node 4 at each calculation node 3 is measured, and the maximum gap time between the calculation nodes 3 is calculated based on the measurement result.
- the maximum gap time is assumed to be sufficiently shorter than the time belt interval time, and the calculation side retention unit 14 is provided with a retention region retaining job information corresponding to two generations.
- the calculation side retention unit 14 is provided with a retention region retaining job information corresponding to (n+1) generations.
- the management side retention unit 34 is provided with a retention region retaining job information corresponding to (n+3) generations.
- the management side retention unit 34 is provided with a retention region retaining job information corresponding to four generations.
- the management side retention unit 34 is provided with a retention region retaining job information corresponding to five generations.
- FIG. 12 is an illustration diagram illustrating a three-stage parallel computer.
- a parallel computer 1 B illustrated in FIG. 12 includes 12 calculation nodes 3 A to 3 L, three sub management nodes 4 B to 4 D, and one management node 4 A.
- the sub management node 4 B relays and manages four calculation nodes 3 A to 3 D.
- the sub management node 4 C relays and manages four calculation nodes 3 E to 3 H.
- the sub management node 4 D relays and manages four calculation nodes 31 to 3 L.
- the management node 4 A manages three sub management nodes 4 B to 4 D.
- the calculation side retention unit 14 of each of the calculation nodes 3 A to 3 L includes a first retention region 14 A and a second retention region 14 B.
- Each of the sub management nodes 4 B to 4 D includes a first retention region 34 D, a second retention region 34 E and a third retention region 34 F that retain job information of four calculation nodes corresponding to three generations.
- the management side retention unit 34 of the management node 4 A includes a first retention region 34 A, a second retention region 34 B and a third retention region 34 C that retain job information of the same time belt number of 12 calculation nodes 3 A to 3 L corresponding to three generations.
- Each of the calculation nodes 3 A to 3 L acquires job information of the common period timing from a job start command, and retains the job information in the calculation side retention unit 14 .
- the sub management nodes 4 B, 4 C and 4 D summarize and collect the job information from the managed calculation nodes 3 A to 3 D ( 3 E to 3 H and 3 I to 3 L).
- Each of the sub management nodes 4 B, 4 C and 4 D collects the job information and retains the collected job information.
- the sub management nodes 4 B, 4 C and 4 D summarize and transmit the job information of the calculation nodes 3 A to 3 D ( 3 E to 3 H and 3 I to 3 L) to the management node 4 A.
- the management node 4 A does not separately communicate with the calculation nodes 3 A to 3 L, but collects the job information of the calculation nodes 3 A to 3 L through communication with the sub management nodes 4 B, 4 C and 4 D. As a result, since the management node 4 A collects the job information of the calculation nodes 3 A to 3 L through communication with the sub management nodes 4 B, 4 C and 4 D, the number of times of communication and the communication load can be reduced.
- FIG. 12 illustrates a three-layer structure of the management node 4 A, the sub management nodes 4 B to 4 D and the calculation nodes 3 A to 3 L
- the present invention is not limited to the three-layer structure but may include a hierarchical structure of four or more layers.
- the respective elements of the respective units illustrated do not necessarily require a physical configuration as illustrated. That is, the details of distribution/integration of the respective units are not limited to the illustrated embodiments, and all or some of the respective units may be functionally or physically distributed/integrated in random units according to various loads or use conditions.
- all or some of various processing functions performed by the respective devices may be executed on a CPU (Central Processing Unit) (or microcomputer such as MPU (Micro Processing Unit) or MCU (Micro Controller Unit)). Also, needless to say, all or some of the various processing functions may be executed on a program interpreted and executed by a CPU (or microcomputer such as MPU or MCU), or on hardware based on wired logic.
- CPU Central Processing Unit
- MPU Micro Processing Unit
- MCU Micro Controller Unit
- FIG. 13 is an illustration diagram illustrating a computer executing a job information acquisition program of a parallel computer.
- a computer 200 illustrated in FIG. 13 includes a HDD (Hard Disk Drive) 210 , a RAM (Random Access Memory) 220 , a ROM (Read Only Memory) 230 , and a CPU 240 that are connected through a bus 250 .
- HDD Hard Disk Drive
- RAM Random Access Memory
- ROM Read Only Memory
- a job information acquisition program of the calculation node side performing the same function as the above embodiment is pre-stored in the ROM 230 .
- the job information acquisition program of the calculation node side includes an acquisition program 231 , a retention program 232 , an information transmission program 233 , and a clear program 234 .
- the programs 231 to 234 may be appropriately integrated or distributed.
- the CPU 240 reads the programs 231 to 234 from the ROM 230 and executes the same. As illustrated in FIG. 13 , the respective programs 231 to 234 function as an acquisition process 241 , a retention process 242 , an information transmission process 243 , and a clear process 244 .
- a computer 200 A includes an HDD 210 A, a RAM 220 A, a ROM 230 A, and a CPU 240 A that are connected through a bus 250 A.
- a job information acquisition program of the management node side performing the same function as the above embodiment is pre-stored in the ROM 230 A.
- the job information acquisition program of the management node side includes a retention program 231 A, a snapshot retention program 232 A, a clear program 233 A, and a clear request program 234 A.
- the programs 231 A to 234 A may be appropriately integrated or distributed.
- the CPU 240 A reads the programs 231 A to 234 A from the ROM 230 A and executes the same. As illustrated in FIG. 13 , the respective programs 231 A to 234 A function as a retention process 241 A, a snapshot retention process 242 A, a clear process 243 A, and a clear request process 244 A.
- the CPU 240 acquires job information about a calculation job handled by the calculation node.
- the CPU 240 retains the job information in the retention unit of the RAM 220 , which enables the retention of job information corresponding to a predetermined number of periods, in association with the identification number identifying the period timing at which the job information is acquired.
- the CPU 240 transmits the job information about the designated identification number to the management node when the job information about the designated identification number exists in the retention unit. Also, when the job information about the designated identification number does not exist in the retention unit and the job information of an identification number just before the designated identification number exists therein, the CPU 240 transmits the job information of the identification number to the management node.
- the CPU 240 A when receiving the job information from each calculation node according to the transmission request, the CPU 240 A retains the received job information in the RAM 220 A that enables the retention of job information corresponding to a predetermined number of periods with respect to each calculation node. In addition, when detecting the job information of the same identification number about the calculation node in the retention unit, the CPU 240 A retains the received job information of the same identification number as a snapshot. In addition, when the job information of the same identification number is retained as a snapshot, the CPU 240 A clears job information other than the job information of the same identification number retained in the retention unit of the RAM 220 A. In addition, when the job information of the same identification number is retained as a snapshot, the CPU 240 A transmits a clear request to each calculation node.
- the CPU 240 When receiving the clear request from the management node, the CPU 240 clears all the job information retained in the retention unit of the RAM 220 . As a result, since the job information is managed in association with the identification number of the period timing at which the job information is acquired, an accurate snapshot of the job information between the calculation nodes can be secured. Also, the impossibility of collecting the job information of each calculation node by the management node due to the different job information clear timing caused by, for example, the transmission delay of the clear request from the management node can be avoided, and the acquisition of a snapshot can be secured.
- the job information of the same timing about a job that is being executed in each calculation node of the parallel computer can be acquired.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
- This application is a continuation of International Application No. PCT/JP2010/064639, filed on Aug. 27, 2010 and designating the U.S., the entire contents of which are incorporated herein by reference.
- The present invention relates to a parallel computer, and a job information acquisition method for parallel computer.
- A parallel computer can process large-scale calculation, for example, by connecting a plurality of computers (hereinafter, referred to as calculation nodes) through a network, distributing a calculation job among separate calculation nodes, and executing the calculation job in parallel. Accordingly, the demand for the parallel computer is increasing rapidly.
- In general, a parallel computer includes a node managing a calculation node group (hereinafter, simply referred to as a management node) including a plurality of calculation nodes. The parallel computer may require technology that enables a management node side to recognize information such as usage of respective resources such as a CPU, a memory and a file used in each calculation node by a currently-executed calculation job, and the number of commands executed by the calculation job (hereinafter, simply referred to as job information).
- Thus, each calculation node executing a calculation job may need to acquire job information of the same time, that is, a snapshot.
FIG. 14 is an illustration diagram illustrating a snapshot acquisition method for a parallel computer. In aparallel computer 110 illustrated inFIG. 14 , amanagement node 112 managing a plurality ofcalculation nodes 111 manages the current time, and requests eachcalculation node 111 to acquire job information when the current time arrives at a predetermined time (step S211). In response to the job information acquisition request, eachcalculation node 111 acquires own job information (step S212). When acquiring the job information, eachcalculation node 111 transmits the acquired job information to the management node 112 (step S213). As a result, themanagement node 112 of theparallel computer 110 illustrated inFIG. 14 can acquire job information of the same time (timing) of eachcalculation node 111, that is, a snapshot. -
FIG. 15 is an illustration diagram illustrating another snapshot acquisition method for aparallel computer 120. In theparallel computer 120 illustrated inFIG. 15 , eachcalculation node 121 manages the current time. When the current time arrives at a predetermined time, eachcalculation node 121 acquires own job information (step S221). When acquiring own job information, eachcalculation node 121 transmits the acquired job information to a management node 122 (step S222). As a result, themanagement node 122 of theparallel computer 120 illustrated inFIG. 15 can acquire job information of the same time (timing) of eachcalculation node 121, that is, a snapshot. - Patent Document 1: Japanese Laid-open Patent Publication No. 8-44680
- Patent Document 2: Japanese Laid-open Patent Publication No. 63-136176
- In the
parallel computer 110 illustrated inFIG. 14 , when a gap occurs in the timing until the arrival of the job information acquisition request from themanagement node 112 at therespective calculation nodes 111, the job information acquisition timing is not synchronized between thecalculation nodes 111, so that an accurate snapshot is difficult to acquire. - Also, in the
parallel computer 120 illustrated inFIG. 15 , since the job information is asynchronously transmitted from therespective calculation nodes 121, there may be a case where the job information of the same time (same timing) transmitted from therespective calculation nodes 121 is not received at themanagement node 122 till the next job information acquisition time. As a result, job information of different times may be received in a mixed manner. That is, in theparallel computer 120, since the job information of the same timing of therespective calculation nodes 121 is not known, an accurate snapshot is difficult to acquire. - According to an aspect of an embodiment of the invention, a parallel computer includes a plurality of calculation nodes that execute a calculation job distributively in parallel, and a management node that manages the plurality of calculation nodes. The calculation node includes an acquisition unit that acquires job information about a calculation job handled by the calculation node according to a period timing common to the calculation nodes, a retention control unit that retains the job information in a retention unit of the calculation node in association with an identification number identifying the period timing at which the job information is acquired by the acquisition unit, and clears all the job information retained in the retention unit when a clear request is received from the management node, and an information transmission unit that, when a transmission request for the job information about a designated identification number is received, transmits the job information about the designated identification number to the management node in a case where the job information about the designated identification number exists in the retention unit, and transmits job information about an identification number just before the designated identification number to the management node in a case where the job information about the designated identification number does not exist in the retention unit and the job information about the identification number just before the designated identification number exists in the retention unit. The management node includes a retention control unit that retains the job information in a retention unit of the management node when the job information is received from each of the calculation nodes according to the transmission request, retains, as a snapshot, job information of the same identification number in a case where the job information of the same identification number about the calculation nodes is detected in the retention unit, and clears job information other than the job information of the same identification number retained in the retention unit of the management node, and a clear request unit that transmits the clear request to each of the calculation nodes when the job information of the same identification number is retained as a snapshot. The retention unit of the calculation node includes a retention region enabling retention of job information corresponding to a plurality of periods, and the retention unit of the management node includes a retention region enabling retention of the job information corresponding to the plurality of periods with respect to each of the calculation nodes.
- The object and advantages of the embodiment will be realized and attained by means of the elements and combinations particularly pointed out in the claims.
- It is to be understood that both the foregoing general description and the following detailed description are exemplary and explanatory and are not restrictive of the embodiment, as claimed.
-
FIG. 1 is a block diagram illustrating a parallel computer according to a first embodiment; -
FIG. 2 is a block diagram illustrating a parallel computer according to a second embodiment; -
FIG. 3 is an illustration diagram of a parallel computer; -
FIG. 4 is an illustration diagram of a job information acquisition period (time belt); -
FIG. 5 is an illustration diagram illustrating the reason for setting a calculation side retention unit corresponding to two generations; -
FIG. 6 is an illustration diagram illustrating an example of an operation transition for snapshot acquisition of a parallel computer; -
FIG. 7 is an illustration diagram illustrating an example of an operation transition for snapshot acquisition of a parallel computer; -
FIG. 8 is an illustration diagram illustrating an example of an operation transition for snapshot acquisition of a parallel computer; -
FIG. 9 is a flow chart illustrating an internal processing operation of a representative node for job acquisition processing of a representative node side; -
FIG. 10 is a flow chart illustrating an internal processing operation of a calculation node for job acquisition processing of a calculation node side; -
FIG. 11 is a flow chart illustrating an internal processing operation of a management node for job acquisition processing of a management node side; -
FIG. 12 is an illustration diagram illustrating a parallel computer according to another embodiment; -
FIG. 13 is an illustration diagram illustrating a computer executing a job information acquisition program of a parallel computer; -
FIG. 14 is an illustration diagram illustrating a snapshot acquisition method for a parallel computer; and -
FIG. 15 is an illustration diagram illustrating another snapshot acquisition method for a parallel computer. - Preferred embodiments of the present invention will be explained with reference to accompanying drawings. In addition, the embodiments do not limit the technique disclosed herein.
-
FIG. 1 is a block diagram illustrating a parallel computer according to a first embodiment. Aparallel computer 1A illustrated inFIG. 1 includes a plurality ofcalculation nodes 50 executing a calculation job distributively in parallel, and amanagement node 60 managing the plurality ofcalculation nodes 50. Thecalculation node 50 includes anacquisition unit 51, aretention unit 52, aretention control unit 53, and aninformation transmission unit 54. According to the period timing common to the calculation nodes, theacquisition unit 51 acquires job information about a calculation job handled by therelevant calculation node 50. - The
retention control unit 53 retains the job information in theretention unit 52 of thecalculation node 50 in association with an identification number identifying the period timing at which theacquisition unit 51 acquires the job information. Also, when receiving a clear request from themanagement node 60, theretention control unit 53 clears all of the job information retained in theretention unit 52. Theretention unit 52 includes a retention region retaining own job information corresponding to a predetermined number of periods, for example, two periods (generations). - Also, when receiving a job information transmission request for a designated identification number from the
management node 60, when job information of the designated identification number exists in theretention unit 52, theinformation transmission unit 54 transmits the job information of the designated identification number to themanagement node 60. Also, when the job information of the designated identification number does not exist in theretention unit 52 and job information of an identification number just before the designated identification number exists therein, theinformation transmission unit 54 transmits the job information of the identification number to themanagement node 60. Also, the identification number just before the designated identification number corresponds to, for example, a one-generation-ago identification number. - The
management node 60 includes aretention unit 61, aretention control unit 62, and aclear request unit 63. Theretention unit 61 includes a retention region capable of retaining job information corresponding to a predetermined number of periods for eachcalculation node 50. When receiving the job information from eachcalculation node 50 according to the transmission request, theretention control unit 62 retains the received job information in theretention unit 61 of themanagement node 60. Also, when detecting job information of the same identification number about all thecalculation nodes 50 in theretention unit 61, theretention control unit 62 retains the job information of the same identification number as a snapshot. When the job information of the same identification number is retained as a snapshot, theretention control unit 62 clears job information other than the job information of the same identification number retained in theretention unit 61 of themanagement node 60. When the job information of the same identification number is retained as a snapshot, theclear request unit 63 transmits a clear request to eachcalculation node 50. - In the first embodiment, the
calculation node 50 acquires the job information according to the period timing common to the calculation nodes, and retains the acquired job information in theretention unit 52 of thecalculation node 50 in association with the identification number identifying the period timing at which the job information is acquired. In addition, in the first embodiment, when receiving the job information from eachcalculation node 50 according to the transmission request, themanagement node 60 retains the received job information in theretention unit 61 of themanagement node 60. In the first embodiment, when detecting job information of the same identification number about the calculation nodes in theretention unit 61, thecalculation node 50 retains the job information of the same identification number as a snapshot. In addition, in the first embodiment, when the job information of the same identification number is retained as a snapshot, the job information other than the job information of the same identification number retained in theretention unit 61 of themanagement node 60 is cleared and all of the job information retained in theretention unit 52 of thecalculation node 50 is cleared. As a result, since the job information is managed in association with the identification number of the period timing at which the job information is acquired, an accurate snapshot of the job information between thecalculation nodes 50 can be secured. - In the first embodiment, the
retention unit 52 of thecalculation node 50 includes a retention region capable of retaining job information corresponding to a predetermined number of periods, and theretention unit 61 of themanagement node 60 includes a retention region capable of retaining job information corresponding to a predetermined number of periods for eachcalculation node 50. As a result, for example, the job information clear timing caused by the delay of transmission of the clear request from themanagement node 60 is different in eachcalculation node 50. Accordingly, the impossibility of collecting the job information of eachcalculation node 50 by themanagement node 60 can be avoided, and the accurate snapshot of the calculation job executed in theparallel computer 1A can be secured. -
FIG. 2 is a block diagram illustrating a parallel computer according to a second embodiment, andFIG. 3 is an illustration diagram of the parallel computer. Aparallel computer 1 illustrated inFIG. 2 includes a plurality ofcalculation nodes 3 connected to anetwork 2, and amanagement node 4 managing the plurality ofcalculation nodes 3. Theparallel computer 1 distributes calculation job among therespective calculation nodes 3 and executes calculation in parallel. Also, although four calculation nodes 3 (3A to 3D) are illustrated for the convenience of description, the number ofcalculation nodes 3 is not limited thereto. - The
calculation node 3 corresponds to, for example, a computer, and executes a calculation job. Thecalculation node 3 includes acalculation processing unit 11, a job informationprocessing control unit 12, a calculationside communication unit 13, and a calculationside retention unit 14. Thecalculation processing unit 11 executes an own calculation job among the distributed calculation jobs. The calculationside communication unit 13 communicates with themanagement node 4 through thenetwork 2. The calculationside retention unit 14 corresponds to, for example, a buffer or the like. The calculationside retention unit 14 includes afirst retention region 14A and asecond retention region 14B retaining job information corresponding to two generations, that is, two time belts. - The job information
processing control unit 12 includes atiming detection unit 21, anacquisition processing unit 22, a calculation sideretention control unit 23, and aninformation transmission unit 24. Thetiming detection unit 21 detects the timing of acquiring own job information. Thetiming detection unit 21 starts a timer operation according to a job start command common to thecalculation nodes 3. Also,FIG. 4 is an illustration diagram of a job information acquisition period (time belt). Thetiming detection unit 21 detects the job information acquisition timing by using the period timing common to the calculation nodes, that is, the time belt ofFIG. 4 . When thetiming detection unit 21 acquires the job information acquisition timing, theacquisition processing unit 22 acquires own job information. - The calculation side
retention control unit 23 controls the retention by the calculationside retention unit 14, and retains the job information retained by theacquisition processing unit 22 in the calculationside retention unit 14. Also, the job information includes job information contents, information existence/nonexistence, node information, time belt number, information acquisition date and time, and the like. The job information contents include usage of resources such as a job ID identifying a job and a CPU, a memory and a file used in an own job, and the number of commands executed by the job. The information existence/nonexistence is information indicating the existence/nonexistence of job information. Also, in the case of information existence, the job information corresponds to job information having job information contents, and in the case of job nonexistence, the job information corresponds to error information that will be described below. The node information corresponds to a node ID identifying thecalculation node 3 that is the source of the job information. The time belt number corresponds to a number identifying the period timing common to thecalculation nodes 3 having acquired the job information. The information acquisition date and time correspond to the date and time of acquisition of the job information. - When acquiring the own job information according to the job information acquisition timing, the calculation side
retention control unit 23 determines whether an empty space exists in the retention region of the calculationside retention unit 14. When an empty space exists in the retention region of the calculationside retention unit 14, the calculation sideretention control unit 23 retains the job information in the calculationside retention unit 14. Also, when an empty space does not exist in the retention region of the calculationside retention unit 14, the calculation sideretention control unit 23 prohibits the retention of the job information. - According to a designated time belt number transmission request from the
management node 4, which will be described below, the calculation sideretention control unit 23 determines whether job information of a designated time belt number exists in the calculationside retention unit 14. When job information of a designated time belt number exists in the calculationside retention unit 14, the calculation sideretention control unit 23 transmits the job information of the designated time belt number to themanagement node 4 through the calculationside communication unit 13. Also, when job information of a designated time belt number does not exist in the calculationside retention unit 14, the calculation sideretention control unit 23 determines whether job information one generation before the designated time belt number exists in the calculationside retention unit 14. Also, when job information one generation before the designated time belt number exists in the calculationside retention unit 14, the calculation sideretention control unit 23 transmits the job information one generation before the designated time belt number to themanagement node 4 through the calculationside communication unit 13. Also, when job information one generation before the designated time belt number does not exist in the calculationside retention unit 14, the calculation sideretention control unit 23 transmits error information to themanagement node 4 through the calculationside communication unit 13. Also, according to a clear request from themanagement node 4, which will be described below, the calculation sideretention control unit 23 clears all of the job information retained in the calculationside retention unit 14. - Also, for the convenience of description, one
calculation node 3A among the four calculation nodes 3 (3A to 3D) is referred to as a representative node. The representative node has substantially the same internal configuration as thecalculation node 3, but is characterized by having a function described below. The job informationprocessing control unit 12 of the representative node acquires job information according to the period timing common to thecalculation nodes 3, and retains the job information in the calculationside retention unit 14. In addition, when retaining job information in the calculationside retention unit 14, the job informationprocessing control unit 12 notifies a time belt number of the job information to themanagement node 4 as a transmission request target through the calculationside communication unit 13. - The
management node 4 corresponds to, for example, a computer. Themanagement node 4 connects with eachcalculation node 3 through thenetwork 2 to manage eachcalculation node 3. Themanagement node 4 includes a managementside processing unit 31, a snapshotprocessing control unit 32, a managementside communication unit 33, and a managementside retention unit 34. The managementside processing unit 31 manages the distributedcalculation nodes 3. The managementside communication unit 33 communicates with eachcalculation node 3 through thenetwork 2. The managementside retention unit 34 corresponds to, for example, a buffer or the like. The managementside retention unit 34 includes afirst retention region 34A, asecond retention region 34B and athird retention region 34C retaining job information corresponding to three generations, that is, three time belts. Also, thefirst retention region 34A retains job information about a snapshot, and thesecond retention region 34B and thethird retention region 34C are used to temporarily retain job information in order to acquire a snapshot. Also, when not retaining job information of a snapshot, thefirst retention region 34A is used to temporarily retain job information, like thesecond retention region 34B and thethird retention region 34C. - The snapshot
processing control unit 32 includes atransmission request unit 41, a receivedinformation identification unit 42, a retentionregion monitoring unit 43, aclear request unit 44, and a management sideretention control unit 45. When receiving a time belt number of a transmission request target from the representative node, thetransmission request unit 41 requests the transmission of job information about the time belt number from eachcalculation node 3 through the managementside communication unit 33. The receivedinformation identification unit 42 identifies the received information of eachcalculation node 3 received according to a transmission request of a designated time belt number to eachcalculation node 3. Also, the received information is received from thecalculation node 3, and includes, for example, job information of a designate time belt number, job information of a time belt number one generation before the designated time belt number, error information, and the like. - The retention
region monitoring unit 43 monitors the job information of therespective calculation nodes 3 retained in thefirst retention region 34A, thesecond retention region 34B and thethird retention region 34C. In addition, based on the job information monitoring result, the retentionregion monitoring unit 43 determines whether there is a new time belt number corresponding to the timing that could retain the job information of all thecalculation nodes 3. When there is a new time belt number that could retain the job information of all thecalculation nodes 3, the management sideretention control unit 45 determines that a snapshot of the same time belt number is newly acquired, and updates/registers the job information of all thecalculation nodes 3 of the same time belt number in thefirst retention region 34A. In addition, the management sideretention control unit 45 clears all the job information of therespective calculation nodes 3 retained in thesecond retention region 34B and thethird retention region 34C. Also, when a new snapshot is acquired, theclear request unit 44 requests the clear of all the job information retained in the calculationside retention unit 14 of all thecalculation nodes 3 through the managementside communication unit 33. - Also, for example, when detecting a snapshot presentation request from a user terminal, the
management node 4 presents the user terminal with the job information of the same time belt number of all thecalculation nodes 3 retained in thefirst retention region 34A of the managementside retention unit 34, as a snapshot. That is, the user can know the job information of eachcalculation node 3 with respect to the job that is being executed. - Next, the reason for setting the calculation
side retention unit 14 to include a region retaining the job information corresponding to two generations, that is, two time belts will be described.FIG. 5 is an illustration diagram illustrating the reason for setting the calculationside retention unit 14 corresponding to two generations. When job information of a new snapshot is retained, themanagement node 4 issues a clear request to eachcalculation node 3. - In
FIG. 5 , when the timing of arrival of the clear request from themanagement node 4 is the timing of acquiring the job information of a time belt number T2, thecalculation node 3B clears all the job information up to the time belt number T2 retained in the calculationside retention unit 14. As a result, in thecalculation node 3B, the job information to be acquired next is the job information of a time belt number T3. - Also, when the timing of arrival of the clear request from the
management node 4 is the timing of acquiring the job information of a time belt number T3, thecalculation node 3C clears all the job information up to the time belt number T3 retained in the calculationside retention unit 14. As a result, in thecalculation node 3C, the job information to be acquired next is the job information of a time belt number T4. - That is, since the timing of arrival of the clear request is different between the
calculation nodes 3, the job information to be acquired may be missed by one generation, that is, one time belt. Accordingly, the calculationside retention unit 14 of eachcalculation node 3 is provided with thefirst retention region 14A and thesecond retention region 14B as retention regions for retaining the job information corresponding to two time belts in order to absorb a gap corresponding to one time belt. - In addition, the reason for setting the management
side retention unit 34 to include a region retaining the job information corresponding to three generations, that is, three time belts will be described. For example, when the job information of the same time belt number T1 of allcalculation nodes 3 is retained, that is, when a snapshot of the time belt number T1 is acquired, the job information of the relevant time belt number is retained in thefirst retention region 34A. Thesecond retention region 34B and thethird retention region 34C are used until the job information of the next time belt number of all thecalculation nodes 3 is retained. However, as described above, when a gap between thecalculation nodes 3 with respect to the clear request corresponds to one generation, the job information transmitted to themanagement node 4 from eachcalculation node 3 is also missed by one generation. Accordingly, the managementside retention unit 34 uses thefirst retention region 34A to retain the job information of a snapshot, and is provided with thesecond retention region 34B and thethird retention region 34C as retention regions for retaining the job information corresponding to two time belts in order to absorb a gap corresponding to one time belt. - Next, an operation of the
parallel computer 1A according to the second embodiment will be described.FIGS. 6 to 8 are illustration diagrams illustrating an example of an operation transition for snapshot acquisition of theparallel computer 1A. Also, for the convenience of description, four calculation nodes 3 (3A to 3D) are provided, and thecalculation node 3A is set to be a representative node. InFIG. 6 , each of thecalculation nodes side retention unit 14. Also, the job information of the time belt number T1 is retained in thefirst retention region 14A of thecalculation nodes calculation node 3B cannot acquire the job information of the time belt number T1 due to the delay of reception of the job start command because of a certain factor, information is not retained in thefirst retention region 14A. - The
calculation node 3A is the representative node. Therefore, when retaining the job information of the time belt number T1 in the calculationside retention unit 14, thecalculation node 3A notifies the time belt number T1 to the management node 4 (step S11). When receiving the time belt number T1 of thecalculation node 3A, themanagement node 4 requests the transmission of the job information of the time belt number T1 from all the calculation nodes 3 (step S12). - When receiving the transmission request for the job information of the time belt number T1, each
calculation node 3 determines whether the job information of the time belt number T1 exists in the calculationside retention unit 14. When including the job information of the time belt number T1 in the calculationside retention unit 14, each of thecalculation nodes side retention unit 14 and not including one-generation-ago job information therein, thecalculation node 3B transmits error information to the management node 4 (step S13A). - When receiving the job information of the time belt number T1 of the
calculation nodes management node 4 retains the job information of the time belt number T1 in thefirst retention region 34A corresponding to thecalculation nodes calculation node 3B, themanagement node 4 does not retain information in thefirst retention region 34A corresponding to thecalculation node 3B. - Next, each of the
calculation nodes second retention region 14B of the calculationside retention unit 14. Also, thecalculation node 3B acquires job information of the time belt number T1 according to the timing of the time belt number T1, and retains the job information in thefirst retention region 14A of the calculationside retention unit 14. - The
calculation node 3A is the representative node. Therefore, when retaining the job information of the time belt number T2 in the calculationside retention unit 14, thecalculation node 3A notifies the time belt number T2 to the management node 4 (step S14). When receiving the time belt number T2, themanagement node 4 requests the transmission of the job information of the time belt number T2 from all the calculation nodes 3 (step S15). - In
FIG. 7 , when receiving the transmission request for the job information of the time belt number T2, eachcalculation node 3 determines whether the job information of the time belt number T2 exists in the calculationside retention unit 14. When including the job information of the time belt number T2 in the calculationside retention unit 14, each of thecalculation nodes side retention unit 14 and including one-generation-ago job information, that is, the time belt number T1 in the calculationside retention unit 14, thecalculation node 3B notifies the job information of the time belt number T1 to the management node 4 (step S16A). - When receiving the job information of the time belt number T2 of the
calculation nodes management node 4 retains the job information of the time belt number T2 in thesecond retention region 34B corresponding to thecalculation nodes calculation node 3B, themanagement node 4 retains the job information of the time belt number T1 in thefirst retention region 34A corresponding to thecalculation node 3B. As a result, the job information of the time belt number T1 of all thecalculation nodes 3 is retained in thefirst retention region 34A. That is, a snapshot of the time belt number T1 is acquired. - When acquiring the snapshot of the time belt number T1, the
management node 4 requests the clear of all the job information retained in the calculationside retention unit 14 of all thecalculation nodes 3 from all the calculation nodes 3 (step S17). In addition, while retaining the job information of the time belt number T1 in thefirst retention region 34A, themanagement node 4 clears all the job information retained in thesecond retention region 34B and thethird retention region 34C (step S18). - In addition, when receiving the clear request from the
management node 4, eachcalculation node 3 clears all the job information retained in thefirst retention region 14A and thesecond retention region 14B (step S19). - Next, each of the
calculation nodes first retention region 14A. Likewise, thecalculation node 3B acquires job information according to the timing of the time belt number T3, and retains the job information in thefirst retention region 14A. - The
calculation node 3A is the representative node. Therefore, when retaining the job information of the time belt number T4 in the calculationside retention unit 14, thecalculation node 3A notifies the time belt number T4 to the management node 4 (step S20). When receiving the time belt number T4 of thecalculation node 3A, themanagement node 4 requests the transmission of the job information of the time belt number T4 from all the calculation nodes 3 (step S21). - In
FIG. 8 , when receiving the transmission request for the job information of the time belt number T4, eachcalculation node 3 determines whether the job information of the time belt number T4 exists in the calculationside retention unit 14. When including the job information of the time belt number T4 in the calculationside retention unit 14, each of thecalculation nodes side retention unit 14 and including one-generation-ago job information, that is, the job information of the time belt number T3 in the calculationside retention unit 14, thecalculation node 3B notifies the job information of the time belt number T3 to the management node 4 (step S22A). - When receiving the job information of the time belt number T4 of the
calculation nodes management node 4 retains the job information of the time belt number T4 in thesecond retention region 34B corresponding to thecalculation nodes calculation node 3B, themanagement node 4 retains the job information of the time belt number T3 in thesecond retention region 34B corresponding to thecalculation node 3B. Also, the job information of the time belt number T1 of all thecalculation nodes 3 is being stored as a snapshot in thefirst retention region 34A. - Next, each of the
calculation nodes second retention region 14B. Likewise, thecalculation node 3B acquires job information according to the timing of the time belt number T4, and retains the job information of the time belt number T4 in thesecond retention region 14B. - The
calculation node 3A is the representative node. Therefore, when retaining the job information of the time belt number T5 in the calculationside retention unit 14, thecalculation node 3A notifies the time belt number T5 to the management node 4 (step S23). When receiving the time belt number T5 of thecalculation node 3A, themanagement node 4 requests the transmission of the job information of the time belt number T5 from all the calculation nodes 3 (step S24). - When receiving the transmission request for the job information of the time belt number T5, each
calculation node 3 determines whether the job information of the time belt number T5 exists in the calculationside retention unit 14. When including the job information of the time belt number T5 in the calculationside retention unit 14, each of thecalculation nodes side retention unit 14 and including one-generation-ago job information, that is, the job information of the time belt number T4 in the calculationside retention unit 14, thecalculation node 3B notifies the job information of the time belt number T4 to the management node 4 (step S25A). - When receiving the job information of the time belt number T5 of the
calculation nodes management node 4 retains the job information of the time belt number T5 in thethird retention region 34C corresponding to thecalculation nodes calculation node 3B, themanagement node 4 retains the job information of the time belt number T4 in thethird retention region 34C corresponding to thecalculation node 3B. As a result, the job information of the time belt number T4 in thesecond retention region 34B corresponding to thecalculation nodes third retention region 34C corresponding to thecalculation node 3B, and the job information of the time belt number T4 corresponding to allcalculation nodes 3 are retained. That is, the snapshot of the time belt number T4 is acquired. - When acquiring the snapshot of the time belt number T4, the
management node 4 requests the clear of all the job information retained in the calculationside retention unit 14 of all thecalculation nodes 3 from all the calculation nodes 3 (step S26). Themanagement node 4 overwrites/updates the job information of the time belt number T1 on the job information of the time belt number T4 in thefirst retention region 34A, and clears all the job information retained in thesecond retention region 34B and thethird retention region 34C (step S27). - In addition, when receiving the clear request from the
management node 4, eachcalculation node 3 clears all the job information retained in thefirst retention region 14A and thesecond retention region 14B (step S28). By repeating these processing operations, the latest snapshot can be retained in thefirst retention region 34A of themanagement node 4. As a result, even when detecting the snapshot presentation request from the user terminal, themanagement node 4 can present the latest snapshot retained in thefirst retention region 34A. - Next, the job acquisition processing of the
calculation node 3A being the representative node will be described.FIG. 9 is a flow chart illustrating a processing operation of thecalculation node 3A for job acquisition processing of the representative node side. InFIG. 9 , thetiming detection unit 21 in the job informationprocessing control unit 12 of thecalculation node 3A determines whether the job information acquisition timing is detected (step S51). When the job information acquisition timing is detected (Yes in step S51), theacquisition processing unit 22 in the job informationprocessing control unit 12 executes job information acquisition processing (step S52A) and determines whether the own job information could be acquired (step S52). - When the own job information could be acquired (Yes in step S52), the calculation side
retention control unit 23 in the job informationprocessing control unit 12 determines whether an empty space exists in the calculation side retention unit 14 (step S53). When an empty space exists in the calculation side retention unit 14 (Yes in step S53), the calculation sideretention control unit 23 retains the job information of the time belt number in the calculation side retention unit 14 (step S54). - When the job information of the time belt number is retained in the calculation
side retention unit 14, theinformation transmission unit 24 in the job informationprocessing control unit 12 notifies the time belt number as a time belt number of a transmission request target to the management node 4 (step S55). The calculation sideretention control unit 23 determines whether the transmission request for the job information designating the time belt number of the transmission request target is received from the management node 4 (step S56). When the transmission request for the job information is received (Yes in step S56), the calculation sideretention control unit 23 transmits the job information about the time belt number of the transmission request retained in the calculationside retention unit 14 to the management node 4 (step S57). - The calculation side
retention control unit 23 determines whether the clear request is received from the management node 4 (step S58). When the clear request is received from the management node 4 (Yes in step S58), the calculation sideretention control unit 23 clears all the job information retained in the calculation side retention unit 14 (step S59), and proceeds to step S51 to determine whether the job information acquisition timing is detected. - Also, when the clear request is not received (No in step S58), the calculation side
retention control unit 23 determines whether the job information acquisition timing is detected (step S60). When the job information acquisition timing is not detected (No in step S60), the calculation sideretention control unit 23 proceeds to step S58 to determine whether the clear request is received. When the job information acquisition timing is detected (Yes in step S60), the calculation sideretention control unit 23 proceeds to step S52A to execute job information acquisition processing. - Also, when the job information acquisition timing is not detected (No in step S51), the
timing detection unit 21 proceeds to step S51 to continue to monitor the job information acquisition timing. Also, the job information could not be acquired (No in step S52), theacquisition processing unit 22 proceeds to step S51 to detect the job information acquisition timing. - Also, when an empty space does not exist in the calculation side retention unit 14 (No in step S53), the calculation side
retention control unit 23 does not retain the job information of the time belt number in the calculation side retention unit 14 (step S61), and proceeds to step S51 to detect the job information acquisition timing. - Also, when the job information transmission request is not received (No in step S56), the calculation side
retention control unit 23 proceeds to step S56 to continue to monitor the job information transmission request. Also, since step S56 is processing executed by the representative node, the time belt number of the transmission request target urging the transmission request from themanagement node 4 is notified. Therefore, in the normal case, the transmission request is necessarily received from themanagement node 4. - In the job acquisition processing of the representative node side illustrated in
FIG. 9 , when acquiring the job information according to the acquisition timing common to the calculation nodes, the representative node determines whether an empty space exists in the calculationside retention unit 14. When an empty space exists in the calculationside retention unit 14, the job information is retained in the calculationside retention unit 14 in association with the time belt number identifying the acquisition timing. As a result, the representative node can retain the job information corresponding to two generations in association with the time belt number. - In the job acquisition processing of the representative node side, when the job information is retained in the calculation
side retention unit 14 in association with the time belt number, the time belt number is notified to themanagement node 4 as the transmission request target. As a result, the representative node can report the time belt number of the job information of the transmission request target to themanagement node 4. - In the job acquisition processing of the representative node side, according to the transmission request for job information of a designated time belt number from the
management node 4, the job information of the designated time belt number is transmitted to themanagement node 4. As a result, the representative node can transmit the job information of the transmission request target to themanagement node 4. - In the job acquisition processing of the representative node side, when the clear request is received from the
management node 4, all the job information retained in the calculationside retention unit 14 is cleared. As a result, the representative node can retain new job information in the calculationside retention unit 14 so that the latest snapshot is acquired by themanagement node 4. - Next, the job acquisition processing of the
calculation nodes 3 other than the representative node will be described.FIG. 10 is a flow chart illustrating a processing operation of thecalculation node 3 for job acquisition processing of the calculation node side. InFIG. 10 , thetiming detection unit 21 in the job informationprocessing control unit 12 of thecalculation node 3 determines whether the job information acquisition timing is detected (step S71). When the job information acquisition timing is detected (Yes in step S71), theacquisition processing unit 22 executes job information acquisition processing (step S72) and determines whether the own job information could be acquired (step S73). - When the own job information could be acquired (Yes in step S73), the calculation side
retention control unit 23 determines whether an empty space exists in the calculation side retention unit 14 (step S74). When an empty space exists in the calculation side retention unit 14 (Yes in step S74), the calculation sideretention control unit 23 retains the job information of the time belt number in the calculation side retention unit 14 (step S75). - The calculation side
retention control unit 23 determines whether the transmission request for the job information designating the time belt number of the transmission request target is received from the management node 4 (step S76). When the transmission request for the job information is received (Yes in step S76), the calculation sideretention control unit 23 determines whether the job information of the time belt number of the transmission request exists in the calculation side retention unit 14 (step S77). - When the job information of the time belt number of the transmission request exists in the calculation side retention unit 14 (Yes in step S77), the
information transmission unit 24 transmits the job information of the time belt number of the transmission request to the management node 4 (step S78). The calculation sideretention control unit 23 determines whether the clear request is received from the management node 4 (step S79). When the clear request is received (Yes in step S79), the calculation sideretention control unit 23 clears all the job information retained in the calculation side retention unit 14 (step S80), and proceeds to step S71 to determine whether the job information acquisition timing is detected. - Also, when the clear request is not received (No in step S79), the calculation side
retention control unit 23 determines whether the job information acquisition timing is detected (step S81). When the job information acquisition timing is not detected (No in step S81), the calculation sideretention control unit 23 proceeds to step S79 to determine whether the clear request is received. When the job information acquisition timing is detected (Yes in step S81), the calculation sideretention control unit 23 proceeds to step S72 to execute job information acquisition processing. - Also, when the job information acquisition timing is not detected (No in step S71), the
timing detection unit 21 proceeds to step S71 to continue to monitor the job information acquisition timing. Also, the job information could not be acquired (No in step S73), theacquisition processing unit 22 proceeds to step S71 to detect the job information acquisition timing. - Also, when an empty space does not exist in the calculation side retention unit 14 (No in step S74), the calculation side
retention control unit 23 does not retain the job information of the time belt number in the calculation side retention unit 14 (step S82), and proceeds to step S71 to detect the job information acquisition timing. - Also, when the job information transmission request is not received (No in step S76), the calculation side
retention control unit 23 proceeds to step S79 to determine whether the clear request is received. - Also, when the job information of the time belt number of the transmission request does not exist in the calculation side retention unit 14 (No in step S77), the calculation side
retention control unit 23 determines whether job information one generation before the time belt number exists in the calculation side retention unit 14 (step S83). Also, when the time belt number of the transmission request is, for example, T3, the one-generation-ago job information corresponds to the job information of the time belt number T2. When the job information one generation before the time belt number exists in the calculation side retention unit 14 (Yes in step S83), the calculation sideretention control unit 23 transmits the one-generation-ago job information to the management node 4 (step S84), and proceeds to step S79 to determine whether clear request is received. - Also, when the job information one generation before the time belt number does not exist in the calculation side retention unit 14 (No in step S83), the calculation side
retention control unit 23 transmits error information to the management node 4 (step S85), and proceeds to step S71 to determine whether the job information acquisition timing is detected. - In the job acquisition processing of the calculation node side illustrated in
FIG. 10 , when acquiring the job information according to the acquisition timing common to the calculation nodes, thecalculation node 3 determines whether an empty space exists in the calculationside retention unit 14. When an empty space exists in the calculationside retention unit 14, the job information is retained in the calculationside retention unit 14 in association with the time belt number identifying the acquisition timing. As a result, thecalculation node 3 can retain the job information corresponding to two generations in association with the time belt number. - In the job acquisition processing of the calculation node side, according to the transmission request for job information of a designated time belt number from the
management node 4, it is determined whether the job information of the designated time belt number exists in the calculationside retention unit 14. When the job information of the designated time belt number exists in the calculationside retention unit 14, the job information of the time belt number is transmitted to themanagement node 4. As a result, thecalculation node 3 can transmit the job information of the designated time belt number of the transmission request to themanagement node 4. - In the job acquisition processing of the calculation node side, when the job information of the designated time belt number does not exist in the calculation
side retention unit 14, it is determined whether one-generation-ago job information exists in the calculationside retention unit 14. When one-generation-ago job information exists in the calculationside retention unit 14, the one-generation-ago job information is transmitted to themanagement node 4. As a result, thecalculation node 3 can also transmit the one-generation-ago job information to themanagement node 4 in order to absorb a gap between thecalculation nodes 3 caused by, for example, the transmission delay of the clear request. - In the job acquisition processing of the calculation node side, when one-generation-ago job information does not exist in the calculation
side retention unit 14, error information is transmitted to themanagement node 4. As a result, thecalculation node 3 can report the nonexistence of transmittable job information to themanagement node 4. - In the job acquisition processing of the calculation node side, when the clear request is received from the
management node 4, all the job information retained in the calculationside retention unit 14 is cleared. As a result, thecalculation node 3 can retain new job information in the calculationside retention unit 14 so that the latest snapshot is acquired by themanagement node 4. - Next, an operation of the
management node 4 will be described.FIG. 11 is a flow chart illustrating a processing operation of themanagement node 4 for job acquisition processing of the management node side. InFIG. 11 , the snapshotprocessing control unit 32 of themanagement node 4 determines whether the time belt number of the transmission request target is received from therepresentative calculation node 3A (step S91). When the time belt number of the transmission request target is received (Yes in step S91), thetransmission request unit 41 of the snapshotprocessing control unit 32 requests the transmission of job information about the time belt number of the transmission request target from all the calculation nodes 3 (step S92). - The received
information identification unit 42 in the snapshotprocessing control unit 32 determines whether the information received from eachcalculation node 3 is error information (step S93). When the information received from eachcalculation node 3 is not error information (No in step S93), the receivedinformation identification unit 42 determines whether the received information is job information (step S94). When the received information is job information (Yes in step S94), the management sideretention control unit 45 in the snapshotprocessing control unit 32 retains the job information in the managementside retention unit 34 corresponding to the relevant calculation node 3 (step S95). The receivedinformation identification unit 42 determines whether the reception of information from all thecalculation nodes 3 receiving the transmission request is completed (step S96). - When the reception of information from all the
calculation nodes 3 is not completed (No in step S96), the receivedinformation identification unit 42 determines that there is the received information that is not yet identified, and proceeds to step S93 to determine whether the received information is error information. When the reception of information from all thecalculation nodes 3 is completed (Yes in step S96), the retentionregion monitoring unit 43 in the snapshotprocessing control unit 32 determines whether there is a new time belt number that could retain the job information of all thecalculation nodes 3, based on the information retained in the management side retention unit 34 (step S97). - When there is a new time belt number that could retain the job information of all the calculation nodes 3 (Yes in step S97), the retention
region monitoring unit 43 determines that the snapshot of the same time belt number is newly acquired. In addition, thetransmission request unit 41 determines that the snapshot of the same time belt number is newly acquired, and requests the clear of the job information retained in the managementside retention unit 34 from all the calculation nodes 3 (step S98). - The management side
retention control unit 45 updates/registers the job information of the same time belt number of all thecalculation nodes 3, which could be newly retained, as a new snapshot in thefirst retention region 34A (step S99). In addition, the management sideretention control unit 45 clears all the job information of therespective calculation nodes 3 retained in thesecond retention region 34B and thethird retention region 34C (step S100), and ends the processing operation ofFIG. 11 . - When the time belt number of the transmission target is not received (No in step S91), the snapshot
processing control unit 32 ends the processing operation ofFIG. 11 . Also, when the received information is error information (Yes in step S93), the receivedinformation identification unit 42 identifies the received information from thecalculation node 3, and proceeds to step S96 to determine whether the identification of the received information from all thecalculation nodes 3 is completed. - When there is no new time belt number that could retain the job information of all the calculation nodes 3 (No in step S97), the retention
region monitoring unit 43 ends the processing operation ofFIG. 11 . - In the snapshot acquisition processing of the management node side illustrated in
FIG. 11 , when receiving the time belt number of the transmission request target from the representative node, themanagement node 4 requests the transmission of the job information of the time belt number of the transmission request target from eachcalculation node 3. As a result, according to the time belt number of the transmission request target from the representative node, themanagement node 4 can request the transmission of the job information about the designated time belt number from eachcalculation node 3. - In the snapshot acquisition processing of the management node side, the
management node 4 determines whether the received information from eachcalculation node 3 with respect to the transmission request is job information. When the received information is job information, the information is determined as job information of the designated time belt number or the one-generation-ago time belt number and the job information is retained in association with therelevant calculation node 3 in the managementside retention unit 34. As a result, themanagement node 4 can retain the job information of eachcalculation node 3 corresponding to three generations in the managementside retention unit 34. - In the snapshot acquisition processing of the management node side, when a new time belt number that could retain the job information of all the
calculation nodes 3 exists in the managementside retention unit 34, themanagement node 4 determines that the snapshot of the same time belt number is newly acquired. In addition, themanagement node 4 determines that the snapshot of the same time belt number is newly acquired, and requests the clear of the job information retained in the managementside retention unit 34 from all thecalculation nodes 3. Themanagement node 4 updates/registers the job information of the same time belt number of all thecalculation nodes 3, which could be newly retained, as a new snapshot in thefirst retention region 34A, and clears the job information of eachcalculation node 3 retained in thesecond retention region 34B and thethird retention region 34C. - As a result, since the snapshot of the job information of the same time belt number is retained in the
first retention region 34A, themanagement node 4 can present the latest snapshot to the user. In addition, by clearing the job information of thesecond retention region 34B and thethird retention region 34C, themanagement node 4 can use thesecond retention region 34B and thethird retention region 34C as a temporary job information retention region. - In the second embodiment, the
calculation node 3 acquires the job information according to the period timing common to the calculation nodes, and retains the acquired job information in the calculationside retention unit 14 in association with the time belt number identifying the period timing at which the job information is acquired. In addition, in the second embodiment, when receiving the job information from eachcalculation node 3 according to the transmission request, themanagement node 4 retains the received job information in the managementside retention unit 34. In the second embodiment, when detecting job information of the same time belt number about thecalculation nodes 3 in the managementside retention unit 34, themanagement node 4 retains the job information of the same time belt number as a snapshot. In addition, in the second embodiment, when the job information of the same time belt number is retained as a snapshot, the job information other than the job information of the same time belt number retained in the managementside retention unit 34 is cleared and all of the job information retained in the calculationside retention unit 14 is cleared. As a result, since the job information is managed in association with the time belt number of the period timing at which the job information is acquired, an accurate snapshot of the job information between thecalculation nodes 3 can be secured. - In the second embodiment, the calculation
side retention unit 14 is provided with a retention region capable of retaining job information corresponding to two generations, and the managementside retention unit 34 is provided with a retention region capable of retaining job information corresponding to three generations. As a result, for example, the job information clear timing caused by the delay of transmission of the clear request from themanagement node 4 is different in eachcalculation node 3. Accordingly, the impossibility of collecting the job information of eachcalculation node 3 by themanagement node 4 can be avoided, and the acquisition of a snapshot can be secured. - In the second embodiment, one of the plurality of
calculation nodes 3 is used as a representative node, and themanagement node 4 starts the transmission request for the job information associated with the time belt information when the time belt number of the transmission request target is notified from the representative node to themanagement node 4. As a result, since one representative node is sufficient, the communication load in the acquisition of a snapshot can be reduced. - Also, although four
calculation nodes 3 are provided in the second embodiment, the number ofcalculation nodes 3 is not limited thereto. Also, although one of the plurality ofcalculation nodes 3 is used as the representative node in the second embodiment, the number of calculation nodes used as the representative node is not limited thereto. Also, although one of the plurality ofcalculation nodes 3 is used as the representative node in the second embodiment, eachcalculation node 3 may be used as the representative node. - Also, in the second embodiment, the calculation
side retention unit 14 is provided with a retention region retaining job information corresponding to two generations, and the managementside retention unit 34 is provided with a retention region retaining job information corresponding to three generations. However, the calculationside retention unit 14 may be provided with a retention region retaining job information corresponding to three generations, and the managementside retention unit 34 may be provided with a retention region retaining job information corresponding to four generations. - Also, in the second embodiment, the time in each
calculation node 3 taken until the execution of job information clearing after arrival of the clear request from themanagement node 4 at eachcalculation node 3 is measured, and the maximum gap time between thecalculation nodes 3 is calculated based on the measurement result. The maximum gap time is assumed to be sufficiently shorter than the time belt interval time, and the calculationside retention unit 14 is provided with a retention region retaining job information corresponding to two generations. - On the other hand, in the case where the maximum time difference is longer than the time belt interval time, when the condition of “(n times of the time belt interval time)<(maximum time difference)≦((n+1) times of time belt interval time)” is satisfied, the calculation
side retention unit 14 is provided with a retention region retaining job information corresponding to (n+1) generations. In addition, the managementside retention unit 34 is provided with a retention region retaining job information corresponding to (n+3) generations. For example, when n=1, the calculationside retention unit 14 is provided with a retention region retaining job information corresponding to three generations, and the managementside retention unit 34 is provided with a retention region retaining job information corresponding to four generations. Also, when n=2, the calculationside retention unit 14 is provided with a retention region retaining job information corresponding to four generations, and the managementside retention unit 34 is provided with a retention region retaining job information corresponding to five generations. - Also, although a two-stage
parallel computer 1 is provided between themanagement node 4 and thecalculation node 3 in the second embodiment, a multi-stage parallel computer may be provided between thecalculation node 3 and themanagement node 4.FIG. 12 is an illustration diagram illustrating a three-stage parallel computer. - A
parallel computer 1B illustrated inFIG. 12 includes 12calculation nodes 3A to 3L, threesub management nodes 4B to 4D, and onemanagement node 4A. Thesub management node 4B relays and manages fourcalculation nodes 3A to 3D. In addition, thesub management node 4C relays and manages fourcalculation nodes 3E to 3H. In addition, thesub management node 4D relays and manages fourcalculation nodes 31 to 3L. In addition, themanagement node 4A manages threesub management nodes 4B to 4D. - The calculation
side retention unit 14 of each of thecalculation nodes 3A to 3L includes afirst retention region 14A and asecond retention region 14B. Each of thesub management nodes 4B to 4D includes afirst retention region 34D, asecond retention region 34E and athird retention region 34F that retain job information of four calculation nodes corresponding to three generations. - In addition, the management
side retention unit 34 of themanagement node 4A includes afirst retention region 34A, asecond retention region 34B and athird retention region 34C that retain job information of the same time belt number of 12calculation nodes 3A to 3L corresponding to three generations. - Each of the
calculation nodes 3A to 3L acquires job information of the common period timing from a job start command, and retains the job information in the calculationside retention unit 14. Thesub management nodes calculation nodes 3A to 3D (3E to 3H and 3I to 3L). Each of thesub management nodes sub management nodes calculation nodes 3A to 3D (3E to 3H and 3I to 3L) to themanagement node 4A. - That is, the
management node 4A does not separately communicate with thecalculation nodes 3A to 3L, but collects the job information of thecalculation nodes 3A to 3L through communication with thesub management nodes management node 4A collects the job information of thecalculation nodes 3A to 3L through communication with thesub management nodes - Although the example of
FIG. 12 illustrates a three-layer structure of themanagement node 4A, thesub management nodes 4B to 4D and thecalculation nodes 3A to 3L, the present invention is not limited to the three-layer structure but may include a hierarchical structure of four or more layers. - Also, the respective elements of the respective units illustrated do not necessarily require a physical configuration as illustrated. That is, the details of distribution/integration of the respective units are not limited to the illustrated embodiments, and all or some of the respective units may be functionally or physically distributed/integrated in random units according to various loads or use conditions.
- In addition, all or some of various processing functions performed by the respective devices may be executed on a CPU (Central Processing Unit) (or microcomputer such as MPU (Micro Processing Unit) or MCU (Micro Controller Unit)). Also, needless to say, all or some of the various processing functions may be executed on a program interpreted and executed by a CPU (or microcomputer such as MPU or MCU), or on hardware based on wired logic.
- However, the various processing described in the present embodiment can be implemented by executing a prepared program on a computer. Therefore, an example of a computer executing a program having the same function as the above embodiment will be described below with reference to
FIG. 13 .FIG. 13 is an illustration diagram illustrating a computer executing a job information acquisition program of a parallel computer. - A
computer 200 illustrated inFIG. 13 includes a HDD (Hard Disk Drive) 210, a RAM (Random Access Memory) 220, a ROM (Read Only Memory) 230, and aCPU 240 that are connected through a bus 250. - A job information acquisition program of the calculation node side performing the same function as the above embodiment is pre-stored in the
ROM 230. As illustrated inFIG. 13 , the job information acquisition program of the calculation node side includes anacquisition program 231, aretention program 232, aninformation transmission program 233, and aclear program 234. Also, like the respective elements of thecalculation node 50 illustrated inFIG. 1 , theprograms 231 to 234 may be appropriately integrated or distributed. - The
CPU 240 reads theprograms 231 to 234 from theROM 230 and executes the same. As illustrated inFIG. 13 , therespective programs 231 to 234 function as anacquisition process 241, aretention process 242, aninformation transmission process 243, and aclear process 244. - Also, a
computer 200A includes anHDD 210A, aRAM 220A, aROM 230A, and aCPU 240A that are connected through a bus 250A. - A job information acquisition program of the management node side performing the same function as the above embodiment is pre-stored in the
ROM 230A. As illustrated inFIG. 13 , the job information acquisition program of the management node side includes aretention program 231A, asnapshot retention program 232A, aclear program 233A, and aclear request program 234A. Also, like the respective elements of themanagement node 60 illustrated inFIG. 1 , theprograms 231A to 234A may be appropriately integrated or distributed. - The
CPU 240A reads theprograms 231A to 234A from theROM 230A and executes the same. As illustrated inFIG. 13 , therespective programs 231A to 234A function as aretention process 241A, asnapshot retention process 242A, aclear process 243A, and aclear request process 244A. - According to the period timing common to the calculation nodes, the
CPU 240 acquires job information about a calculation job handled by the calculation node. In addition, theCPU 240 retains the job information in the retention unit of theRAM 220, which enables the retention of job information corresponding to a predetermined number of periods, in association with the identification number identifying the period timing at which the job information is acquired. In addition, when receiving the transmission request for the job information about the designated identification number from the management node, theCPU 240 transmits the job information about the designated identification number to the management node when the job information about the designated identification number exists in the retention unit. Also, when the job information about the designated identification number does not exist in the retention unit and the job information of an identification number just before the designated identification number exists therein, theCPU 240 transmits the job information of the identification number to the management node. - Also, when receiving the job information from each calculation node according to the transmission request, the
CPU 240A retains the received job information in theRAM 220A that enables the retention of job information corresponding to a predetermined number of periods with respect to each calculation node. In addition, when detecting the job information of the same identification number about the calculation node in the retention unit, theCPU 240A retains the received job information of the same identification number as a snapshot. In addition, when the job information of the same identification number is retained as a snapshot, theCPU 240A clears job information other than the job information of the same identification number retained in the retention unit of theRAM 220A. In addition, when the job information of the same identification number is retained as a snapshot, theCPU 240A transmits a clear request to each calculation node. - When receiving the clear request from the management node, the
CPU 240 clears all the job information retained in the retention unit of theRAM 220. As a result, since the job information is managed in association with the identification number of the period timing at which the job information is acquired, an accurate snapshot of the job information between the calculation nodes can be secured. Also, the impossibility of collecting the job information of each calculation node by the management node due to the different job information clear timing caused by, for example, the transmission delay of the clear request from the management node can be avoided, and the acquisition of a snapshot can be secured. - In one aspect, the job information of the same timing about a job that is being executed in each calculation node of the parallel computer can be acquired.
- All examples and conditional language recited herein are intended for pedagogical purposes of aiding the reader in understanding the invention and the concepts contributed by the inventor to further the art, and are not to be construed as limitations to such specifically recited examples and conditions, nor does the organization of such examples in the specification relate to a showing of the superiority and inferiority of the invention. Although the embodiments of the present invention have been described in detail, it should be understood that the various changes, substitutions, and alterations could be made hereto without departing from the spirit and scope of the invention.
Claims (8)
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2010/064639 WO2012026041A1 (en) | 2010-08-27 | 2010-08-27 | Parallel computer, job information acquisition program for parallel computer, job information acquisition method for parallel computer, computation device and computation management device |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/064639 Continuation WO2012026041A1 (en) | 2010-08-27 | 2010-08-27 | Parallel computer, job information acquisition program for parallel computer, job information acquisition method for parallel computer, computation device and computation management device |
Publications (2)
Publication Number | Publication Date |
---|---|
US20130174170A1 true US20130174170A1 (en) | 2013-07-04 |
US9336044B2 US9336044B2 (en) | 2016-05-10 |
Family
ID=45723068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/778,494 Expired - Fee Related US9336044B2 (en) | 2010-08-27 | 2013-02-27 | Parallel computer, and job information acquisition method for parallel computer |
Country Status (4)
Country | Link |
---|---|
US (1) | US9336044B2 (en) |
EP (1) | EP2610752B1 (en) |
JP (1) | JP5464276B2 (en) |
WO (1) | WO2012026041A1 (en) |
Families Citing this family (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP6145089B2 (en) * | 2011-06-07 | 2017-06-07 | メソブラスト インターナショナル エスエイアールエル | Method for repairing tissue damage using protease resistant mutants of stromal cell-derived factor-1 |
WO2014010047A1 (en) * | 2012-07-11 | 2014-01-16 | 株式会社日立製作所 | Management system and information acquisition method |
EP2829975B1 (en) * | 2013-07-23 | 2019-04-24 | Fujitsu Limited | A fault-tolerant monitoring apparatus, method and system |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6279001B1 (en) * | 1998-05-29 | 2001-08-21 | Webspective Software, Inc. | Web service |
US20090241145A1 (en) * | 2008-03-24 | 2009-09-24 | Verizon Data Services Llc | System and method for providing an interactive program guide having date and time toolbars |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63136176A (en) | 1986-11-27 | 1988-06-08 | Casio Comput Co Ltd | Data processor |
JP2940403B2 (en) | 1994-08-03 | 1999-08-25 | 株式会社日立製作所 | Monitor Data Collection Method for Parallel Computer System |
EP0790559B1 (en) * | 1996-02-14 | 2002-05-15 | Hitachi, Ltd. | Method of monitoring a computer system, featuring performance data distribution to plural monitoring processes |
JP2002324014A (en) * | 2001-04-26 | 2002-11-08 | Meidensha Corp | Monitor and control system |
US8037264B2 (en) * | 2003-01-21 | 2011-10-11 | Dell Products, L.P. | Distributed snapshot process |
DE10327155B4 (en) * | 2003-06-13 | 2006-12-07 | Sap Ag | Backup procedure with adaptation to computer landscape |
JP2007128122A (en) * | 2005-11-01 | 2007-05-24 | Hitachi Ltd | Method for determining collection start time of operation performance data |
-
2010
- 2010-08-27 JP JP2012530498A patent/JP5464276B2/en not_active Expired - Fee Related
- 2010-08-27 EP EP10856443.6A patent/EP2610752B1/en active Active
- 2010-08-27 WO PCT/JP2010/064639 patent/WO2012026041A1/en active Application Filing
-
2013
- 2013-02-27 US US13/778,494 patent/US9336044B2/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US6279001B1 (en) * | 1998-05-29 | 2001-08-21 | Webspective Software, Inc. | Web service |
US20090241145A1 (en) * | 2008-03-24 | 2009-09-24 | Verizon Data Services Llc | System and method for providing an interactive program guide having date and time toolbars |
Also Published As
Publication number | Publication date |
---|---|
EP2610752A1 (en) | 2013-07-03 |
US9336044B2 (en) | 2016-05-10 |
JP5464276B2 (en) | 2014-04-09 |
EP2610752A4 (en) | 2015-11-04 |
JPWO2012026041A1 (en) | 2013-10-28 |
EP2610752B1 (en) | 2017-09-27 |
WO2012026041A1 (en) | 2012-03-01 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US8533731B2 (en) | Apparatus and method for distrubuting complex events based on correlations therebetween | |
CN105528330B (en) | The method, apparatus of load balancing is gathered together and many-core processor | |
CN110825544A (en) | Computing node, failure detection method thereof and cloud data processing system | |
CN104854845B (en) | Use the method and apparatus of efficient atomic operation | |
US11341842B2 (en) | Metering data management system and computer readable recording medium | |
US20090157768A1 (en) | Computer system and data loss prevention method | |
US9396087B2 (en) | Method and apparatus for collecting performance data, and system for managing performance data | |
US9336044B2 (en) | Parallel computer, and job information acquisition method for parallel computer | |
JP6754693B2 (en) | How to match data and event gaps sent and received on networks using different communication technologies | |
US20150095488A1 (en) | System and method for acquiring log information of related nodes in a computer network | |
EP1074914A1 (en) | Supervisory system and method | |
US20140129863A1 (en) | Server, power management system, power management method, and program | |
JPWO2008120566A1 (en) | Engine / processor linkage system and linkage method | |
US9575865B2 (en) | Information processing system and monitoring method | |
CN109218137B (en) | Method and device for detecting state of node in distributed system | |
JP5354106B2 (en) | Control program, control device, and control method | |
US11487593B2 (en) | Barrier synchronization system and parallel information processing apparatus | |
US10523469B2 (en) | Relay device and communication system | |
JP2014142683A (en) | Monitoring control system | |
US20140298354A1 (en) | Information processing apparatus, information processing method, and storage medium | |
JP5602693B2 (en) | Monitoring interval control device, monitoring interval control method and program | |
CN114268171B (en) | Distributed measurement system based on low-power-consumption wide area wireless network and control method | |
CN111316599B (en) | Method, management node and processing node for persistent availability in cloud environment | |
JPWO2010064394A1 (en) | Data processing system, computer program thereof, and data processing method | |
JP6167827B2 (en) | Information providing program, information providing method, information providing apparatus, and information collecting system |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
AS | Assignment |
Owner name: FUJITSU LIMITED, JAPAN Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:TAKESHITA, HIROTO;REEL/FRAME:031179/0292 Effective date: 20130327 |
|
ZAAA | Notice of allowance and fees due |
Free format text: ORIGINAL CODE: NOA |
|
ZAAB | Notice of allowance mailed |
Free format text: ORIGINAL CODE: MN/=. |
|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
MAFP | Maintenance fee payment |
Free format text: PAYMENT OF MAINTENANCE FEE, 4TH YEAR, LARGE ENTITY (ORIGINAL EVENT CODE: M1551); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY Year of fee payment: 4 |
|
FEPP | Fee payment procedure |
Free format text: MAINTENANCE FEE REMINDER MAILED (ORIGINAL EVENT CODE: REM.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
LAPS | Lapse for failure to pay maintenance fees |
Free format text: PATENT EXPIRED FOR FAILURE TO PAY MAINTENANCE FEES (ORIGINAL EVENT CODE: EXP.); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
STCH | Information on status: patent discontinuation |
Free format text: PATENT EXPIRED DUE TO NONPAYMENT OF MAINTENANCE FEES UNDER 37 CFR 1.362 |
|
FP | Lapsed due to failure to pay maintenance fee |
Effective date: 20240510 |