WO2012026041A1 - 並列計算機、並列計算機のジョブ情報取得プログラム、並列計算機のジョブ情報取得方法、計算装置及び計算管理装置 - Google Patents
並列計算機、並列計算機のジョブ情報取得プログラム、並列計算機のジョブ情報取得方法、計算装置及び計算管理装置 Download PDFInfo
- Publication number
- WO2012026041A1 WO2012026041A1 PCT/JP2010/064639 JP2010064639W WO2012026041A1 WO 2012026041 A1 WO2012026041 A1 WO 2012026041A1 JP 2010064639 W JP2010064639 W JP 2010064639W WO 2012026041 A1 WO2012026041 A1 WO 2012026041A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- job information
- calculation
- node
- identification number
- holding
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/46—Multiprogramming arrangements
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3495—Performance evaluation by tracing or monitoring for systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3404—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment for parallel or distributed programming
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/34—Recording or statistical evaluation of computer activity, e.g. of down time, of input/output operation ; Recording or statistical evaluation of user activity, e.g. usability assessment
- G06F11/3466—Performance evaluation by tracing or monitoring
- G06F11/3476—Data logging
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/84—Using snapshots, i.e. a logical point-in-time copy of the data
Definitions
- the present invention relates to a parallel computer, a job information acquisition program for a parallel computer, a job information acquisition method for a parallel computer, a calculation device, and a calculation management device.
- a parallel computer for example, connects a large number of computers (hereinafter simply referred to as calculation nodes) via a network, distributes the calculation jobs to the individual calculation nodes, and executes the calculation jobs in parallel. It can be processed at high speed. Therefore, the demand for this parallel computer is increasing rapidly.
- a parallel computer has a node (hereinafter simply referred to as a management node) that manages a group of computing nodes composed of a plurality of computing nodes.
- a management node that manages a group of computing nodes composed of a plurality of computing nodes.
- information such as the CPU used by each calculation node currently used by the calculation job, the amount of each resource such as a memory and a file, and the number of instructions executed by the calculation job (hereinafter simply referred to as job information). ) Is required on the management node side.
- FIG. 14 is an explanatory diagram showing a snapshot acquisition method for a parallel computer.
- the current time is managed by the management node 112 that manages the plurality of calculation nodes 111, and when the current time reaches a predetermined time, the calculation nodes 111 are requested to acquire job information (steps). S211).
- Each calculation node 111 acquires job information that it is in charge of in response to a job information acquisition request (step S212).
- each calculation node 111 transfers this job information to the management node 112 (step S213).
- the management node 112 of the parallel computer 110 shown in FIG. 14 can acquire job information at the same time (timing) of each calculation node 111, that is, a snapshot.
- FIG. 15 is an explanatory diagram showing another snapshot acquisition method for the parallel computer 120.
- each calculation node 121 manages the current time.
- each computation node 121 acquires job information that it is in charge of (step S221).
- the calculation node 121 transfers the job information to the management node 122 (step S222).
- the management node 122 of the parallel computer 120 shown in FIG. 15 can acquire job information of each calculation node 121 at the same time (timing), that is, a snapshot.
- job information is asynchronously sent from each calculation node 121, so that the job information transmitted from each calculation node 121 at the same time (same timing) is acquired as the next job information. Not all of them reach the management node 122 by the time. As a result, it can be considered that job information at other times is sent together. That is, since the parallel computer 120 cannot grasp the job information of each computation node 121 at the same timing, an accurate snapshot cannot be acquired.
- One aspect is to provide a parallel computer or the like that can acquire job information of the same timing related to a job being executed on each computation node of the parallel computer.
- the parallel computer disclosed in the present application includes a plurality of calculation nodes that distribute and execute calculation jobs in parallel, and a management node that manages the plurality of calculation nodes.
- the acquisition unit acquires job information related to a calculation job handled by the calculation node itself according to a cycle timing common to the calculation nodes.
- the holding control unit on the calculation node side associates with the identification number for identifying the cycle timing at which the acquisition unit has acquired the job information, holds the job information in the holding unit on the calculation node side, and from the management node When the deletion request is received, all the job information held in the holding unit is deleted.
- the information transmission unit receives a job information transmission request related to the specified identification number from the management node, if the job information related to the specified identification number is in the holding unit, the information transmission unit displays the job information related to the specified identification number. Sent to the management node.
- the information transmission unit stores the job information related to the specified identification number in the management node Send to. Further, when the job control information is received from each computation node in response to the transmission request, the management node side holding control unit holds the received job information in the management node side holding unit. In addition, when the holding control unit on the management node side detects job information related to the calculation node having the same identification number in the holding unit, the holding information unit holds the job information having the same identification number as a snapshot.
- the holding control unit on the management node side deletes job information other than the job information with the same identification number being held in the holding unit on the management node side To do.
- the deletion request unit transmits the deletion request to each calculation node.
- the holding unit on the calculation node side includes a holding area that can hold job information for a predetermined period, and the holding unit on the management node side includes job information for the predetermined period for each calculation node. A holding area is provided to enable holding.
- FIG. 1 is a block diagram illustrating the parallel computer according to the first embodiment.
- FIG. 2 is a block diagram illustrating the parallel computer according to the second embodiment.
- FIG. 3 is an explanatory diagram of a parallel computer.
- FIG. 4 is an explanatory diagram of a job information acquisition cycle (time belt).
- FIG. 5 is an explanatory diagram showing the reason why the calculation side holding unit is divided into two generations.
- FIG. 6 is an explanatory diagram showing an example of operation transition related to snapshot acquisition of a parallel computer.
- FIG. 7 is an explanatory diagram showing an example of operation transition related to snapshot acquisition of a parallel computer.
- FIG. 8 is an explanatory diagram showing an example of operation transition related to snapshot acquisition of a parallel computer.
- FIG. 1 is a block diagram illustrating the parallel computer according to the first embodiment.
- FIG. 2 is a block diagram illustrating the parallel computer according to the second embodiment.
- FIG. 3 is an explanatory diagram of a parallel computer.
- FIG. 4 is an explan
- FIG. 9 is a flowchart showing the processing operation in the representative node related to the job acquisition process on the representative node side.
- FIG. 10 is a flowchart showing the processing operation inside the computation node related to the computation node side job acquisition processing.
- FIG. 11 is a flowchart showing processing operations inside the management node related to the management node side snapshot acquisition processing.
- FIG. 12 is an explanatory diagram of the parallel computer according to the third embodiment.
- FIG. 13 is an explanatory diagram of a computer that executes a job information acquisition program of a parallel computer.
- FIG. 14 is an explanatory diagram showing a snapshot acquisition method for a parallel computer.
- FIG. 15 is an explanatory diagram showing another snapshot acquisition method for a parallel computer.
- FIG. 1 is a block diagram illustrating a parallel computer according to the first embodiment.
- a parallel computer 1A shown in FIG. 1 includes a plurality of calculation nodes 50 that distribute and execute calculation jobs in parallel, and a management node 60 that manages the plurality of calculation nodes 50.
- the calculation node 50 includes an acquisition unit 51, a holding unit 52, a holding control unit 53, and an information transmission unit 54.
- the acquisition unit 51 acquires job information related to a calculation job handled by the calculation node 50 according to the cycle timing common to the calculation nodes.
- the holding control unit 53 holds the job information in the holding unit 52 on the calculation node 50 side in association with the identification number for identifying the cycle timing at which the acquiring unit 51 has acquired the job information. Further, when receiving a deletion request from the management node 60, the holding control unit 53 deletes all job information held in the holding unit 52.
- the holding unit 52 includes a holding area that holds its own job information for a predetermined plurality of cycles, for example, two cycles (generations).
- the information transmission unit 54 receives a job information transmission request related to the specified identification number from the management node 60, if the job information related to the specified identification number is in the holding unit 52, the information transmission unit 54 The job information related to the number is transmitted to the management node 60.
- the information transmission unit 54 performs job information relating to the identification number. Is transmitted to the management node 60.
- the identification number immediately before the identification number corresponds to, for example, an identification number one generation before.
- the management node 60 includes a holding unit 61, a holding control unit 62, and an erasing request unit 63.
- the holding unit 61 includes a holding area that can hold job information for a predetermined plurality of cycles for each computation node 50.
- the holding control unit 62 receives job information from each computation node 50 in response to the transmission request, the holding control unit 62 holds the received job information in the holding unit 61 on the management node 60 side. Further, when the holding control unit 62 detects job information related to all the calculation nodes 50 having the same identification number in the holding unit 61, the holding control unit 62 holds the job information having the same identification number as a snapshot.
- the holding control unit 62 When holding the job information with the same identification number as a snapshot, the holding control unit 62 deletes job information other than the job information with the same identification number being held in the holding unit 61 on the management node 60 side.
- the deletion request unit 63 transmits an deletion request to each calculation node 50.
- the calculation node 50 acquires job information according to the cycle timing common to the calculation nodes, and associates the job information with the identification number for identifying the cycle timing at which the job information is acquired. It is held in the holding unit 52. Further, in the first embodiment, when the management node 60 receives job information from each calculation node 50 in response to the transmission request, the management node 60 holds the received job information in the holding unit 61 on the management node 60 side. In the first embodiment, when the calculation node 50 detects job information related to the calculation node with the same identification number in the holding unit 61, the job information with the same identification number is held as a snapshot.
- the holding unit 52 on the calculation node 50 side includes a holding area that can hold job information for a predetermined plurality of cycles
- the holding unit 61 on the management node 60 side has a predetermined plurality of cycles for each calculation node 50. It has a holding area that can hold job information.
- the job information deletion timing due to the transmission delay of the clear request from the management node 60 differs for each calculation node 50. Therefore, it is possible to avoid a situation in which job information of each calculation node 50 cannot be collected on the management node 60 side, and to guarantee an accurate snapshot of the calculation job being executed on the parallel computer 1A.
- FIG. 2 is a block diagram showing the parallel computer of the second embodiment
- FIG. 3 is an explanatory diagram of the parallel computer.
- a parallel computer 1 shown in FIG. 2 has a plurality of calculation nodes 3 connected to a network 2 and a management node 4 that manages the plurality of calculation nodes 3, and distributes calculation jobs to the individual calculation nodes 3. In parallel.
- four computation nodes 3 (3A to 3D) are used, but the number is not limited to this.
- the calculation node 3 corresponds to, for example, a computer and executes a calculation job.
- the calculation node 3 includes a calculation processing unit 11, a job information processing control unit 12, a calculation side communication unit 13, and a calculation side holding unit 14.
- the calculation processing unit 11 executes a calculation job that the calculation processing unit 11 is in charge of among the distributed calculation jobs.
- the calculation side communication unit 13 communicates with the management node 4 via the network 2.
- the calculation side holding unit 14 corresponds to, for example, a buffer, and includes a first holding area 14A and a second holding area 14B that hold job information for two generations, that is, two time belts.
- the job information processing control unit 12 includes a timing detection unit 21, an acquisition processing unit 22, a calculation side holding control unit 23, and an information transmission unit 24.
- the timing detection unit 21 detects timing for acquiring job information for which the timing detection unit 21 is responsible.
- the timing detection unit 21 starts a timer operation in response to a job start command common to the calculation nodes 3.
- FIG. 4 is an explanatory diagram of a job information acquisition cycle (time belt).
- the timing detection unit 21 detects the job information acquisition timing using the cycle timing common to the calculation nodes, that is, the time belt of FIG.
- the acquisition processing unit 22 acquires job information for which the acquisition processing unit 22 is responsible.
- the calculation-side holding control unit 23 holds and controls the calculation-side holding unit 14 and holds the job information acquired by the acquisition processing unit 22 in the calculation-side holding unit 14.
- the job information includes job information content, information presence / absence, node information, time belt number, information acquisition date and time, and the like.
- the job information content includes a job ID for identifying the job, a usage amount of each resource such as a CPU, a memory, and a file used in the job that the user is in charge of, a number of instructions executed by the job, and the like.
- the presence / absence of information is information indicating the presence / absence of information on job information contents. When the presence / absence of information is “present”, the job information content corresponds to certain job information.
- the job information corresponds to error information described later.
- the node information corresponds to a node ID that identifies the calculation node 3 that is the source of the job information.
- the time belt number corresponds to a number for identifying a cycle timing common to the calculation nodes 3 that acquired the job information.
- the information acquisition date and time corresponds to the date and time when the job information is acquired.
- the calculation-side holding control unit 23 determines whether or not there is an empty holding area in the calculation-side holding unit 14 when acquiring the job information that it is in charge of according to the job information acquisition timing.
- the calculation side holding control unit 23 holds job information in the calculation side holding unit 14 when there is a vacancy. Further, the calculation side holding control unit 23 prohibits holding of job information because there is no space.
- the calculation side holding control unit 23 determines whether or not there is job information of the specified time belt number in the calculation side holding unit 14 in response to a transmission request of the specified time belt number described later from the management node 4. When there is job information of the specified time belt number in the calculation side holding unit 14, the calculation side holding control unit 23 transmits the job information of the specified time belt number to the management node 4 via the calculation side communication unit 13. In addition, when there is no job information of the specified time belt number in the calculation side holding unit 14, the calculation side holding control unit 23 has job information one generation before the specified time belt number in the calculation side holding unit 14. Determine whether or not.
- the calculation side holding control unit 23 transmits the job information of the previous generation to the management node 4 via the calculation side communication unit 13. Further, the calculation side holding control unit 23 transmits error information to the management node 4 via the calculation side communication unit 13 when there is no job information one generation before the specified time belt number. Further, the calculation side holding control unit 23 deletes all job information held in the calculation side holding unit 14 in response to a clear request described later from the management node 4.
- one calculation node 3A is a representative node.
- the representative node has substantially the same internal configuration as that of the calculation node 3, but has a function described below.
- the job information processing control unit 12 of the representative node acquires job information according to the cycle timing common to the calculation nodes 3 and holds the job information in the calculation side holding unit 14. Further, when the job information processing control unit 12 holds the job information in the calculation side holding unit 14, the job information processing control unit 12 notifies the management node 4 of the time belt number of the job information as a transmission request target via the calculation side communication unit 13. Is provided.
- the management node 4 corresponds to, for example, a computer, and is connected to each calculation node 3 via the network 2 to manage each calculation node 3.
- the management node 4 includes a management side processing unit 31, a snapshot processing control unit 32, a management side communication unit 33, and a management side holding unit 34.
- the management processing unit 31 manages the distributed computation nodes 3.
- the management communication unit 33 communicates with each computation node 3 via the network 2.
- the management-side holding unit 34 corresponds to, for example, a buffer or the like, and has a first holding area 34A, a second holding area 34B, and a third holding area for holding job information for three generations, that is, three time belts for each computation node 3. It has area 34C.
- the first holding area 34A holds job information related to a snapshot
- the second holding area 34B and the third holding area 34C are used to temporarily hold job information so as to acquire a snapshot. To do.
- the first holding area 34A is used to temporarily hold job information in the same manner as the second holding area 34B and the third holding area 34B in a state where the job information of the snapshot is not held.
- the snapshot processing control unit 32 includes a transmission request unit 41, a reception information identification unit 42, a holding area monitoring unit 43, a clear request unit 44, and a management side holding control unit 45.
- the transmission request unit 41 receives the time belt number to be transmitted from the representative node
- the transmission request unit 41 requests each calculation node 3 to transmit job information related to the time belt number via the management-side communication unit 33.
- the reception information identification unit 42 identifies the reception information of each calculation node 3 received in response to the transmission request for the designated time belt number to each calculation node 3.
- the received information is, for example, job information of a specified time belt number, job information of a time belt number one generation before the specified time belt number, error information, or the like received from the calculation node 3.
- the holding area monitoring unit 43 monitors the job information of each calculation node 3 held in the first holding area 34A, the second holding area 34B, and the third holding area 34C. Furthermore, the holding area monitoring unit 43 determines whether or not there is a time belt number corresponding to the timing at which the job information of all the calculation nodes 3 can be newly held based on the monitoring result of the job information. If there is a new time belt number that can hold the job information of all the calculation nodes 3, the management side holding control unit 45 determines that a new snapshot of the same time belt number has been acquired, and The job information of all the computation nodes 3 is updated and registered in the first holding area 34A.
- the management-side holding control unit 45 deletes all the job information of each calculation node 3 being held in the second holding area 34B and the third holding area 34C. Further, when a new snapshot is acquired, the clear request unit 44 requests the calculation side holding unit 14 of all the calculation nodes 3 to clear all job information being held via the management side communication unit 33.
- the management node 4 detects a snapshot presentation request from a user terminal
- the job information of all the calculation nodes 3 having the same time belt number held in the first holding area 34A in the management side holding unit 34 is displayed. It will be presented to the user terminal as a snapshot. That is, the user can grasp the job information of each calculation node 3 regarding the currently executed calculation job.
- FIG. 5 is an explanatory diagram showing the reason why the calculation-side holding unit 14 is divided into two generations.
- the calculation node 3B when the clear request arrives from the management node 4 is the timing when the job information of the time belt number T2 is being acquired, up to the time belt number T2 held in the calculation side holding unit 14 All job information is deleted. As a result, in the calculation node 3B, the next job information to be acquired is the job information of the time belt number T3.
- the calculation node 3C when the clear request arrives from the management node 4 is the timing when the job information of the time belt number T3 is being acquired, the job information up to the time belt number T3 held in the calculation side holding unit 14 Are all erased. As a result, in the calculation node 3B, the next job information to be acquired is the job information of the time belt number T4.
- the calculation side holding unit 14 of each calculation node 3 uses the first holding area 14A and the second holding area 14B as holding areas for holding job information for two time belts in order to absorb the deviation for one time belt. Got ready.
- the management side holding unit 34 is made an area for holding job information for three generations, that is, three time belts. For example, when job information for all calculation nodes 3 of the same time belt number T1 is held, that is, when a snapshot of the time belt number T1 is acquired, the job information of the time belt number is held in the first holding area 34A. The second holding area 34B and the third holding area 34C are used until the job information of all the calculation nodes 3 of the next time belt number is held. However, as described above, when the shift between the calculation nodes 3 with respect to the clear request is for one generation, the job information sent from each calculation node 3 to the management node 4 is also shifted by one generation. Accordingly, the management-side holding unit 34 also uses the first holding area 34A to hold snapshot job information, and holds job information for two time belts in order to absorb the deviation for one time belt. A second holding area 34B and a third holding area 34C were prepared as areas.
- 6 to 8 are explanatory diagrams showing an example of operation transitions related to snapshot acquisition of the parallel computer 1A.
- four calculation nodes 3 (3A to 3D) are used, and the calculation node 3A is a representative node.
- each of the calculation nodes 3A, 3C, and 3D acquires job information according to the timing of the time belt number T1 from the job start command, and holds the job information in the calculation side holding unit 14.
- the job information of the time belt number T1 is held in the first holding area 14A of the calculation nodes 3A, 3C, 3D.
- the calculation node 3B is in a state in which the reception of the job start command is delayed for some reason and the job information of the time belt number T1 cannot be acquired, and no information is held in the first holding area 14A.
- the calculation node 3A Since the calculation node 3A is a representative node, when the job information of the time belt number T1 is held in the calculation side holding unit 14, the time belt number T1 is notified to the management node 4 (step S11). When receiving the time belt number T1 of the calculation node 3A, the management node 4 requests all the calculation nodes 3 to transmit the job information of the time belt number T1 (step S12).
- each calculation node 3 When each calculation node 3 receives the job information transmission request of the time belt number T1, it determines whether or not the job information of the time belt number T1 is in the calculation side holding unit 14.
- the calculation nodes 3A, 3C and 3D in which the job information of the time belt number T1 is in the calculation side holding unit 14 transmit the job information of the time belt number T1 to the management node 4 (step S13).
- the calculation node 3B that does not have the job information of the time belt number T1 in the calculation-side holding unit 14 and also has no job information of one generation before transmits error information to the management node 4 (step S13A).
- the management node 4 When the management node 4 receives the job information of the time belt number T1 of the calculation nodes 3A, 3C and 3D, the management node 4 holds the job information of the time belt number T1 in the first holding area 34A corresponding to the calculation nodes 3A, 3C and 3D. Further, when the error information of the calculation node 3B is received, the management node 4 does not hold the information in the first holding area 34A corresponding to the calculation node 3B.
- each of the calculation nodes 3A, 3C, and 3D acquires the job information of the time belt number T2 according to the timing of the time belt number T2, and holds the job information in the second holding area 14B of the calculation side holding unit 14. It is in the state.
- the calculation node 3B acquires job information of the time belt number T1 according to the timing of the time belt number T1, and holds the job information in the first holding area 14A of the calculation side holding unit 14.
- the calculation node 3A is a representative node
- the time belt number T2 is notified to the management node 4 (step S14).
- the management node 4 requests all the calculation nodes 3 to transmit the job information of the time belt number T2 (step S15).
- each calculation node 3 determines whether or not the job information of the time belt number T2 is in the calculation side holding unit 14.
- Each of the calculation nodes 3A, 3C, and 3D having the job information of the time belt number T2 in the calculation side holding unit 14 transmits the job information of the time belt number T2 to the management node 4 (step S16). Further, the job information of the time belt number T2 is not in the calculation side holding unit 14, and the calculation node 3B in which the job information of the previous generation, that is, the time belt number T1 is in the calculation side holding unit 14, is the time belt number T1. The job information is notified to the management node 4 (step S16A).
- the management node 4 When the management node 4 receives the job information of the time belt number T2 of the calculation nodes 3A, 3C and 3D, the management node 4 holds the job information of the time belt number T2 in the second holding area 34B corresponding to the calculation nodes 3A, 3C and 3D. Further, upon receiving the job information of the time belt number T1 of the calculation node 3B, the management node 4 holds the job information of the time belt number T1 in the first holding area 34A corresponding to the calculation node 3B. As a result, the job information of all the calculation nodes 3 with the time belt number T1 is held in the first holding area 34A, that is, the snapshot with the time belt number T1 is acquired.
- the management node 4 requests all the calculation nodes 3 to clear all the job information held in the calculation side holding unit 14 of all the calculation nodes 3 (Ste S17). Further, the management node 4 deletes all the job information held in the second holding area 34B and the third holding area 34C while holding the job information of the time belt number T1 in the first holding area 34A (step S18). .
- each calculation node 3 receives the clear request from the management node 4, it erases all the job information held in the first holding area 14A and the second holding area 14B (step S19).
- each of the calculation nodes 3A, 3C, and 3D acquires job information according to the timing of the time belt number T4, and holds the job information of the time belt number T4 in the first holding area 14A.
- the calculation node 3B acquires job information according to the timing of the time belt number T3 and holds the job information in the first holding area 14A.
- the calculation node 3A is a representative node
- the time belt number T4 is notified to the management node 4 (step S20).
- the management node 4 requests all the calculation nodes 3 to transmit the job information of the time belt number T4 (step S21).
- each calculation node 3 determines whether or not the job information of the time belt number T4 is in the calculation side holding unit.
- the calculation nodes 3A, 3C, and 3D having the job information of the time belt number T4 in the calculation side holding unit 14 notify the job information of the time belt number T4 to the management node 4 (step S22).
- the management node 4 When the management node 4 receives the job information of the time belt number T4 of the calculation nodes 3A, 3C and 3D, the management node 4 holds the job information of the time belt number T4 in the second holding area 34B corresponding to the calculation nodes 3A, 3C and 3D. Further, upon receiving the job information of the time belt number T3 of the calculation node 3B, the management node 4 holds the job information of the time belt T3 in the second holding area 34B corresponding to the calculation node 3B. In the first holding area 34A, the job information of all the calculation nodes 3 with the time belt number T1 is held as a snapshot.
- each of the calculation nodes 3A, 3C, and 3D acquires job information according to the timing of the time belt number T5, and holds the job information of the time belt number T5 in the second holding area 14B.
- the calculation node 3B acquires job information according to the timing of the time belt number T4, and holds the job information of the time belt number T4 in the second holding area 14B.
- the calculation node 3A is a representative node
- the time belt number T5 is notified to the management node 4 (step S23).
- the management node 4 requests all the calculation nodes 3 to transmit the job information of the time belt number T5 (step S24).
- each calculation node 3 receives the transmission request for the job information of the time belt number T5, the calculation node 3 determines whether or not the job information of the time belt number T5 is in the calculation side holding unit 14.
- the calculation nodes 3A, 3C, and 3D when the job information of the time belt number T5 is in the calculation side holding unit 14 transmits the job information of the time belt number T5 to the management node 4 (step S25).
- the job information of number T4 is notified to the management node 4 (step S25A).
- the management node 4 When the management node 4 receives the job information of the time belt number T5 of the calculation nodes 3A, 3C and 3D, the management node 4 holds the job information of the time belt number T5 in the third holding area 34C corresponding to the calculation nodes 3A, 3C and 3D. Further, upon receiving the job information of the time belt number T4 of the calculation node 3B, the management node 4 holds the job information of the time belt T4 in the third holding area 34C corresponding to the calculation node 3B.
- the time belt number is calculated from the job information of the time belt number T4 corresponding to the calculation nodes 3A, 3C and 3D in the second holding area 34B and the job information of the time belt number T4 corresponding to the calculation node 3B in the third holding area 34C.
- the job information of all the computation nodes 3 at T4 is held. That is, the snapshot of the time belt number T4 is acquired.
- the management node 4 requests all the calculation nodes 3 to clear all the job information held in the calculation side holding unit 14 of all the calculation nodes 3 ( Step S26).
- the management node 4 overwrites and updates the job information of the time belt number T1 with the job information of the time belt number T4 in the first holding area 34A, and all the jobs being held in the second holding area 34B and the third holding area 34C. Information is erased (step S27).
- each calculation node 3 receives the clear request from the management node 4 and erases all job information held in the first holding area 14A and the second holding area 14B (step S28). Therefore, the latest snapshot can be held in the first holding area 34A of the management node 4 by repeating such a series of processing operations. As a result, even if the management node 4 detects a snapshot presentation request from the user terminal, the management node 4 can present it as the latest snapshot being held in the first holding area 34A.
- FIG. 9 is a flowchart showing the processing operation of the computation node 3A related to the representative node side job acquisition processing.
- the timing detection unit 21 in the job information processing control unit 12 of the computation node 3A determines whether or not the job information acquisition timing has been detected (step S51).
- the acquisition processing unit 22 in the job information processing control unit 12 detects the acquisition timing of the job information (Yes at Step S51)
- the acquisition processing unit 22 executes the job information acquisition process (Step S52A), and can acquire the job information that it is in charge of. It is determined whether or not (step S52).
- the calculation-side holding control unit 23 in the job information processing control unit 12 determines whether or not there is a vacancy in the calculation-side holding unit 14 when the job information that it is in charge of can be acquired (Yes in step S52) ( Step S53). If there is an empty space in the calculation side holding unit 14 (Yes at Step S53), the calculation side holding control unit 23 holds the job information of the time belt number in the calculation side holding unit 14 (Step S54).
- Step S55 The calculation-side holding control unit 23 determines whether or not a job information transmission request specifying the time belt number to be transmitted is received from the management node 4 (step S56).
- step S56 the calculation side holding control unit 23 transmits job information related to the time belt number of the transmission request held in the calculation side holding unit 14 to the management node 4.
- the calculation side holding control unit 23 determines whether or not a clear request has been received from the management node 4 (step S58). When receiving the clear request (Yes at Step S58), the calculation-side holding control unit 23 deletes all the job information held in the calculation-side holding unit 14 (Step S59), and detects whether the job information acquisition timing has been detected. To determine whether or not, the process proceeds to step S51.
- the calculation side holding control unit 23 determines whether or not the acquisition timing of the job information is detected (Step S60). If the calculation-side holding control unit 23 does not detect the acquisition timing of the job information (No at Step S60), the calculation-side holding control unit 23 proceeds to Step S58 to determine whether or not a clear request has been received. If the calculation-side holding control unit 23 detects the acquisition timing of the job information (Yes at Step S60), the calculation-side holding control unit 23 proceeds to Step S52A to execute the job information acquisition process.
- Step S51 If the timing for acquiring job information is not detected (No at Step S51), the timing detector 21 proceeds to Step S51 to continuously monitor the timing for acquiring job information. If the job information cannot be acquired (No at Step S52), the acquisition processing unit 22 proceeds to Step S51 in order to detect the acquisition timing of the job information.
- step S53 when there is no space in the calculation side holding unit 14 (No in step S53), the calculation side holding control unit 23 does not hold the job information of the time belt number in the calculation side holding unit 14 (step S61). The process proceeds to step S51 in order to detect information acquisition timing.
- Step S56 is a process executed by the representative node, since the time belt number of the transmission request target that urges the transmission request from the management node 4 is notified by itself, the management node 4 always makes sure that it is normal. A transmission request is received.
- the representative node side job acquisition process shown in FIG. 9 when the representative node acquires job information according to the acquisition timing common to the calculation nodes, it is determined whether or not there is a free space in the calculation side holding unit 14. If there is a vacancy in the calculation side holding unit 14, the job information is held in the calculation side holding unit 14 in association with the time belt number for identifying the acquisition timing. As a result, the representative node can hold up to two generations of job information in association with the time belt number.
- the management node 4 when job information is held in the calculation side holding unit 14 in association with the time belt number, the management node 4 is notified of the time belt number as a transmission request target. As a result, the representative node can notify the management node 4 of the time belt number of the job information to be transmitted.
- the representative node side job acquisition process in response to a transmission request for job information of a specified time belt number from the management node 4, the job information of the specified time belt number is transmitted to the management node 4.
- the representative node can transmit the job information to be transmitted to the management node 4 side.
- the representative node side job acquisition process when a clear request is received from the management node 4, all job information held in the calculation side holding unit 14 is deleted. As a result, the representative node can hold new job information in the calculation side holding unit 14 so that the latest snapshot is acquired on the management node 4 side.
- FIG. 10 is a flowchart showing the processing operation of the computation node 3 related to the computation node side job acquisition processing.
- the timing detection unit 21 in the job information processing control unit 12 of the calculation node 3 determines whether or not the job information acquisition timing has been detected (step S71). If the acquisition processing unit 22 detects the acquisition timing of the job information (Yes at Step S71), the acquisition processing unit 22 executes the job information acquisition process (Step S72), and determines whether or not the job information that it is in charge of can be acquired (Step S71). S73).
- the calculation side holding control unit 23 determines whether or not there is a vacancy in the calculation side holding unit 14 (step S74) when the job information that it is in charge of can be acquired (Yes in step S73). If there is an empty space in the calculation side holding unit 14 (Yes at Step S74), the calculation side holding control unit 23 holds the job information of the time belt number in the calculation side holding unit 14 (Step S75).
- the calculation side holding control unit 23 determines whether or not the job information transmission request specifying the time belt number to be transmitted is received from the management node 4 (step S76). When receiving the job information transmission request (Yes at Step S76), the calculation side holding control unit 23 determines whether or not the job information of the time belt number of the transmission request is in the calculation side holding unit 14 (Step S77). ).
- the information transmission unit 24 transmits the job information of the time belt number of the transmission request to the management node 4 (Step S78). ).
- the calculation side holding control unit 23 determines whether or not a clear request is received from the management node 4 (step S79).
- the calculation-side holding control unit 23 deletes all the job information held in the calculation-side holding unit 14 (Step S80), and detects whether the job information acquisition timing has been detected. To determine whether or not, the process proceeds to step S71.
- the calculation-side holding control unit 23 determines whether the job information acquisition timing is detected (Step S81). If the calculation-side holding control unit 23 does not detect the acquisition timing of the job information (No at Step S81), the calculation-side holding control unit 23 proceeds to Step S79 to determine whether or not a clear request has been received. If the calculation-side holding control unit 23 detects the acquisition timing of the job information (Yes at Step S81), the calculation-side holding control unit 23 proceeds to Step S72 to execute the job information acquisition process.
- the timing detection unit 21 determines whether the acquisition timing of the job information is acquired (No at Step S71). If the timing detection unit 21 does not detect the acquisition timing of the job information (No at Step S71), the timing detection unit 21 proceeds to Step S71 in order to continuously monitor the acquisition timing of the job information. Further, when the job information cannot be acquired (No at Step S73), the acquisition processing unit 22 proceeds to Step S71 so as to detect the acquisition timing of the job information.
- step S74 when there is no space in the calculation side holding unit 14 (No in step S74), the calculation side holding control unit 23 does not hold the job information of the time belt number in the calculation side holding unit 14 (step S82). In order to detect the acquisition timing of the job information, the process proceeds to step S71.
- Step S76 the calculation-side holding control unit 23 proceeds to Step S79 to determine whether or not a clear request has been received.
- the calculation side holding control unit 23 holds the job information of the previous generation of the time belt number by the calculation side. It is determined whether it is in the unit 14 (step S83). If the time belt number of the transmission request is T3, for example, the job information of the previous generation corresponds to the job information of the time belt number T2.
- the calculation side holding control unit 23 transmits the job information of the previous generation to the management node 4 when the job information of the previous generation of the time belt number is in the calculation side holding unit 14 (Yes in step S83) (step S83).
- step S84 The process proceeds to step S79 to determine whether or not a clear request has been received.
- the calculation side holding control unit 23 transmits error information to the management node 4 (Step S85), and the job information In order to determine whether or not the information acquisition timing has been detected, the process proceeds to step S71.
- the calculation node 3 acquires job information according to the acquisition timing common to the calculation nodes, it is determined whether or not there is a free space in the calculation side holding unit 14. If there is a vacancy in the calculation side holding unit 14, the job information is held in the calculation side holding unit 14 in association with the time belt number for identifying the acquisition timing. As a result, the calculation node 3 can hold job information for up to two generations in association with the time belt number.
- the job acquisition processing on the calculation node side in response to a transmission request for job information of the specified time belt number from the management node 4, it is determined whether or not the job information of the specified time belt number is in the calculation side holding unit 14. To do.
- the job information of the specified time belt number is in the calculation side holding unit 14, the job information of the time belt number is transmitted to the management node 4.
- the calculation node 3 can transmit job information of a specified time belt number according to the transmission request to the management node 4.
- the calculation node 3 can also transmit the job information of the previous generation to the management node 4 in order to absorb the deviation between the calculation nodes 3 due to the transmission delay of the clear request, for example.
- calculation node side job acquisition process error information is transmitted to the management node 4 when the job information of the previous generation is not in the calculation side holding unit 14.
- the calculation node 3 can notify the management node 4 that there is no job information that can be transmitted.
- calculation node side job acquisition process when a clear request is received from the management node 4, all job information held in the calculation side holding unit 14 is deleted. As a result, the calculation node 3 can hold new job information in the calculation side holding unit 14 so that the latest snapshot is acquired on the management node 4 side.
- FIG. 11 is a flowchart showing the processing operation of the management node 4 related to the management node side snapshot processing.
- the snapshot processing control unit 32 in the management node 4 determines whether or not the time belt number to be transmitted is received from the representative calculation node 3A (step S91).
- the transmission request unit 41 of the snapshot processing control unit 32 receives the time belt number to be transmitted (Yes in step S91), it requests all the calculation nodes 3 to transmit job information related to the time belt number to be transmitted. (Step S92).
- the reception information identification unit 42 in the snapshot processing control unit 32 determines whether or not the information received from each calculation node 3 is error information (step S93). If the received information is not error information (No at Step S93), the received information identifying unit 42 determines whether the received information is job information (Step S94). When the received information is job information (Yes in step S94), the management side holding control unit 45 in the snapshot processing control unit 32 holds the job information in the management side holding unit 34 corresponding to the calculation node 3 ( Step S95). Then, the reception information identification unit 42 determines whether or not the information reception from all the computation nodes 3 requested to transmit has been completed (step S96).
- the reception information identification unit 42 determines that there is unidentified reception information when information reception from all the computation nodes 3 has not been completed (No at Step S96), and determines whether the reception information is error information. Therefore, the process proceeds to step S93.
- the holding area monitoring unit 43 in the snapshot processing control unit 32 newly adds all the calculation nodes 3 based on the holding contents of the management side holding unit 34. It is determined whether there is a time belt number for which job information can be held (step S97).
- the holding area monitoring unit 43 determines that a new snapshot of the same time belt number has been newly acquired when there is a new time belt number that can hold the job information of all the calculation nodes 3 (Yes in step S97). Further, the transmission request unit 41 determines that a snapshot having the same time belt number has been newly acquired, and requests all the calculation nodes 3 to clear the job information held in the management side holding unit 34 (step S40). S98).
- the management side holding control unit 45 updates and registers the job information of all the calculation nodes 3 having the same time belt number that can be newly held in the first holding area 34A as a new snapshot (step S99). Further, the management-side holding control unit 45 deletes all the job information of each calculation node 3 being held in the second holding area 34B and the third holding area 34C (step S100), and ends the processing operation of FIG.
- the snapshot processing control unit 32 does not receive the time belt number to be transmitted (No at Step S91), the processing operation of FIG. Further, when the received information is error information (Yes at Step S93), the reception information identifying unit 42 identifies the reception information from the calculation node 3 and the identification of the reception information from all the calculation nodes 3 is completed. To determine whether or not, the process proceeds to step S96.
- the holding area monitoring unit 43 ends the processing operation of FIG. 11 when there is no new time belt number that can hold the job information of all the calculation nodes 3 (No at Step S97).
- the management node 4 when the management node 4 receives the time belt number subject to transmission request from the representative node, the job information of the time belt number subject to transmission request is sent to each computation node 3. To do. As a result, the management node 4 can realize a job information transmission request related to the designated time belt number to each calculation node 3 in accordance with the time belt number to be transmitted from the representative node.
- the management node 4 determines whether the received information from each calculation node 3 for the transmission request is job information. If the received information is job information, it is determined that the job information is the specified time belt number or the previous generation time belt number, and this job information is stored in association with the calculation node 3 in the management-side storage unit 34. To do. As a result, the management node 4 can hold the job information of each calculation node 3 in the management side holding unit 34 for three generations.
- the management node 4 In the management-side snapshot acquisition process, when the management node 4 has a new time belt number in the management-side holding unit 34 that can hold the job information of all the calculation nodes 3, that is, a new snapshot with the same time belt number is newly created. It is judged that it was acquired. Further, the management node 4 determines that a snapshot having the same time belt number has been newly acquired, and requests all the calculation nodes 3 to clear the job information held in the management side holding unit 34. The job information of all the calculation nodes 3 having the same time belt number that can be newly held by the management node 4 is updated and registered in the first holding area 34A as a new snapshot, and is also registered in the second holding area 34B and the third holding area 34C. The job information of each computation node 3 being held is deleted.
- the management node 4 since the management node 4 holds the snapshot related to the job information of the same time belt number in the first holding area 34A, it can present the latest snapshot to the user. Furthermore, the management node 4 can use the second holding area 34A and the third holding area 34C as temporary holding areas for job information by deleting the job information in the second holding area 34B and the third holding area 34C. .
- the calculation node 3 acquires job information according to the cycle timing common to the calculation nodes, and associates the job information with the time belt number for identifying the cycle timing at which the job information is acquired, and stores the job information in the calculation side holding unit 14 Hold on. Furthermore, in the second embodiment, when the management node 4 receives job information from each calculation node 3 in response to the transmission request, the management node 4 holds the received job information in the management side holding unit 34. In the second embodiment, when the management node 4 detects job information related to the calculation node 3 having the same time belt number in the management side holding unit 34, the job information having the same time belt number is held as a snapshot.
- the calculation-side holding unit 14 has a holding area that can hold job information for two generations, and the management-side holding unit 34 can hold job information for three generations for each calculation node 3. With area.
- the job information erasing timing due to the transmission delay of the clear request from the management node 4 differs for each calculation node 3. Therefore, it is possible to guarantee the snapshot acquisition by avoiding the situation where the job information of each calculation node 3 cannot be collected on the management node 4 side.
- one of the plurality of calculation nodes 3 is used as a representative node, and the management node 4 uses the time belt number as a key when the representative node notifies the management node 4 of the time belt number to be transmitted.
- the job information transmission request since only one representative node is required, it is possible to reduce the communication burden for acquiring the snapshot.
- the number of calculation nodes 3 is four, but the number is not limited to these. Moreover, in the said Example 2, although 1 unit
- the calculation-side holding unit 14 has a holding area for holding job information for two generations
- the management-side holding unit 34 has a holding area for holding job information for three generations.
- the calculation side holding unit 14 may be provided with a holding area for holding job information for three generations
- the management side holding unit 34 may be provided with a holding area for holding job information for four generations.
- the time for each computation node 3 required until the clear request from the management node 4 reaches each computation node 3 and the job information is erased is measured. Based on the measurement result, The maximum shift time between the calculation nodes 3 is calculated. Then, assuming that the maximum deviation time is sufficiently shorter than the time belt interval time, a holding area for holding job information for two generations is prepared in the calculation side holding unit 14.
- n 1, a holding area for holding job information for three generations is prepared in the calculation-side holding unit 14, and a holding area for holding job information for four generations is prepared in the management-side holding unit 34.
- FIG. 12 is an explanatory diagram showing a parallel computer having a three-stage configuration.
- the parallel computer 1B shown in FIG. 12 has twelve calculation nodes 3A to 3L, three sub management nodes 4B to 4D, and one management node 4A.
- the sub management node 4B relays and manages the four calculation nodes 3A to 3D.
- the sub management node 4C relays and manages the four calculation nodes 3E to 3H.
- the sub management node 4D relays and manages the four calculation nodes 3I to 3L.
- the management node 4A manages the three sub management nodes 4B to 4D.
- the calculation side holding unit 14 of each calculation node 3A to 3L has a first holding area 14A and a second holding area 14B.
- Each of the sub-management nodes 4B to 4D has a first holding area 34D, a second holding area 34E, and a third holding area 34F that hold job information of four calculation nodes for three generations.
- the management-side holding unit 34 of the management node 4A has a first holding area 34A, a second holding area 34B, and a third holding area for three generations of job information of the same time belt number of the 12 calculation nodes 3A to 3L. It has a holding area 34C.
- Each of the calculation nodes 3A to 3L acquires the job information at the common cycle timing from the job start command, and holds the job information in the calculation side holding unit 14.
- Each of the sub management nodes 4B, 4C, and 4D collects and collects job information from each of the calculation nodes 3A to 3D (3E to 3H and 3I to 3L) to be managed.
- the sub management nodes 4B, 4C, and 4D hold the collected job information. Further, the sub management nodes 4B, 4C and 4D collectively transmit the job information of the calculation nodes 3A to 3D (3E to 3H and 3I to 3L) to the management node 4A.
- the management node 4A does not communicate with each of the calculation nodes 3A to 3L, but collects job information of the calculation nodes 3A to 3L through communication with the sub management nodes 4B, 4C, and 4D.
- the management node 4A communicates with the sub-management nodes 4B, 4C, and 4D to collect job information of the calculation nodes 3A to 3L, so that the communication frequency can be reduced and the communication load can be reduced.
- the three-layer structure of the management node 4A, the sub-management nodes 4B to 4D, and the calculation nodes 3A to 3L has been described.
- the structure is not limited to the three-layer structure, and the hierarchy structure may be four or more layers. good.
- each component of each part illustrated does not necessarily need to be physically configured as illustrated.
- the specific form of distribution / integration of each part is not limited to the one shown in the figure, and all or a part thereof may be functionally or physically distributed / integrated in arbitrary units according to various loads and usage conditions. Can be configured.
- each device is all or any part of it on a CPU (Central Processing Unit) (or a micro computer such as MPU (Micro Processing Unit) or MCU (Micro Controller Unit)). You may make it perform.
- CPU Central Processing Unit
- MPU Micro Processing Unit
- MCU Micro Controller Unit
- Various processing functions may be executed entirely or arbitrarily on a program that is analyzed and executed by a CPU (or a microcomputer such as an MPU or MCU) or hardware based on wired logic. Needless to say.
- FIG. 13 is an explanatory diagram of a computer that executes a job information acquisition program of a parallel computer.
- HDD hard disk drive
- RAM random access memory
- ROM read only memory
- the ROM 230 stores in advance a job information acquisition program on the calculation node side that performs the same function as in the above embodiment.
- the job information acquisition program on the calculation node side is an acquisition program 231, a holding program 232, an information transmission program 233, and an erasure program 234.
- the programs 231 to 234 may be appropriately integrated or distributed in the same manner as each component of the calculation node 50 shown in FIG.
- the CPU 240 reads these programs 231 to 234 from the ROM 230 and executes them.
- the programs 231 to 234 function as an acquisition process 241, a holding process 242, an information transmission process 243, and an erasing process 244.
- the computer 200A is configured by connecting an HDD 210A, a RAM 220A, a ROM 230A, and a CPU 240A via a bus 250A.
- the ROM 230A stores in advance a job information acquisition program on the management node side that performs the same function as in the above-described embodiment.
- the management node side job information acquisition program includes a holding program 231A, a snapshot holding program 232A, an erasing program 233A, and an erasing request program 234A.
- the programs 231A to 234A may be appropriately integrated or distributed in the same manner as each component of the management node 60 shown in FIG.
- the CPU 240A reads these programs 231A to 234A from the ROM 230A and executes them.
- the programs 231A to 234A function as a holding process 241A, a snapshot holding process 242A, an erasing process 243A, and an erasing request process 244A.
- the CPU 240 acquires job information related to a calculation job handled by the calculation node itself according to the cycle timing common to the calculation nodes. Further, the CPU 240 holds the job information in a holding unit in the RAM 220 that can hold job information for a predetermined plurality of cycles in association with an identification number for identifying the cycle timing at which the job information is acquired. Further, when the CPU 240 receives a job information transmission request related to the specified identification number from the management node, if the job information related to the specified identification number is in the holding unit, the CPU 240 displays the job information related to the specified identification number in the management node. Send to. In addition, when there is no job information related to the specified identification number in the holding unit and there is job information related to the identification number immediately before the specified identification number, the CPU 240 transmits the job information related to the specified identification number to the management node. .
- the CPU 240A When the CPU 240A receives job information from each calculation node in response to the transmission request, the CPU 240A holds the received job information in a holding unit in the RAM 220A that can hold job information for a predetermined plurality of cycles for each calculation node. To do. Further, when the CPU 240A detects job information related to a calculation node having the same identification number in the holding unit, the CPU 240A holds the job information having the same identification number as a snapshot. Furthermore, when the job information with the same identification number is held as a snapshot, the CPU 240A deletes job information other than the job information with the same identification number being held in the holding unit in the RAM 220A. Further, when the job information having the same identification number is stored as a snapshot, the CPU 240A transmits an erasure request to each computation node.
- the CPU 240 deletes all the job information held in the holding unit in the RAM 220.
- the job information is managed using the identification number of the cycle timing for acquiring the job information as a key, an accurate snapshot of the job information between the computation nodes can be secured.
- the job information deletion timing due to the transmission delay of the clear request from the management node is different, so that the situation where the management node cannot collect the job information of each calculation node is avoided, and the snapshot acquisition is guaranteed To do.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- General Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- Computer Hardware Design (AREA)
- Quality & Reliability (AREA)
- Software Systems (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
1A 並列計算機
3 計算ノード
4 管理ノード
14 計算側保持部
14A 第1保持領域
14B 第2保持領域
22 取得処理部
23 計算側保持制御部
24 情報送信部
34 管理側保持部
34A 第1保持領域
34B 第2保持領域
34C 第3保持領域
41 送信依頼部
44 クリア依頼部
45 管理側保持制御部
50 計算ノード
51 取得部
52 保持部
53 保持制御部
54 情報送信部
60 管理ノード
61 保持部
62 保持制御部
63 消去依頼部
Claims (10)
- 計算ジョブを分散して並列的に実行する複数の計算ノードと、これら複数の計算ノードを管理する管理ノードとを有し、
前記計算ノードは、
計算ノード共通の周期タイミングに応じて、当該計算ノード自体が担当する計算ジョブに関わるジョブ情報を取得する取得部と、
前記取得部が前記ジョブ情報を取得した周期タイミングを識別する識別番号に関連付けして、当該ジョブ情報を当該計算ノード側の保持部に保持すると共に、前記管理ノードからの消去依頼を受信すると、当該保持部に保持中のジョブ情報を全て消去する計算ノード側の保持制御部と、
前記管理ノードから指定の識別番号に関わるジョブ情報の送信依頼を受信すると、当該指定の識別番号に関わるジョブ情報が当該保持部内にある場合、当該指定の識別番号に関わるジョブ情報を管理ノードに送信すると共に、当該指定の識別番号に関わるジョブ情報が当該保持部内になく、当該指定の識別番号直前の識別番号に関わるジョブ情報がある場合、当該識別番号に関わるジョブ情報を管理ノードに送信する情報送信部と
を有し、
前記管理ノードは、
前記送信依頼に応じて各計算ノードから前記ジョブ情報を受信すると、当該受信したジョブ情報を当該管理ノード側の保持部に保持すると共に、当該保持部内に同一識別番号の計算ノードに関わるジョブ情報を検出した場合、当該同一識別番号のジョブ情報をスナップショットとして保持すると共に、前記同一識別番号のジョブ情報をスナップショットとして保持した場合、当該管理ノード側の保持部に保持中の当該同一識別番号のジョブ情報以外のジョブ情報を消去する管理ノード側の保持制御部と、
前記同一識別番号のジョブ情報をスナップショットとして保持した場合、各計算ノードに対して前記消去依頼を送信する消去依頼部とを有し、
前記計算ノード側の保持部は、
所定複数周期分のジョブ情報を保持可能にする保持領域を備え、
前記管理ノード側の保持部は、
前記計算ノード毎の前記所定複数周期分のジョブ情報を保持可能にする保持領域を備えたことを特徴とする並列計算機。 - 前記管理ノードからの前記消去依頼が各計算ノードに到達してジョブ情報の消去を実行するまでに要する計算ノード毎の時間を測定し、その測定結果に基づき、計算ノード間の最大ズレ時間を算出し、前記周期タイミングの間隔時間のn倍<最大ズレ時間≦前記周期タイミングの間隔時間の(n+1)倍が成立する場合、前記管理ノード側の保持部は、(n+3)周期分のジョブ情報を保持する保持領域を備え、前記計算ノード側の保持部は、(n+2)周期分のジョブ情報を保持する保持領域を備えることを特徴とする請求項1記載の並列計算機。
- 前記取得部は、
前記計算ジョブの実行開始タイミングに応じてタイマ計時動作を開始し、この計時時間に基づき、前記周期タイミングを検出することを特徴とする請求項1又は2に記載の並列計算機。 - 前記管理ノードは、
所定信号に応じて指定の識別番号に関わるジョブ情報の送信を各計算ノードに依頼する送信依頼部を有することを特徴とする請求項1又は2に記載の並列計算機。 - 前記複数の計算ノードの内、1台の計算ノードを代表ノードとし、
当該代表ノードは、当該代表ノード内の取得部がジョブ情報を取得すると、当該ジョブ情報の識別番号を前記管理ノードに通知する信号を前記所定信号とすることを特徴とする請求項4記載の並列計算機。 - 前記計算ノードは、
当該計算ノード内の取得部がジョブ情報を取得すると、当該ジョブ情報の識別情報を前記管理ノードに通知する信号を前記所定信号とすることを特徴とする請求項4記載の並列計算機。 - 計算ジョブを分散して並列的に実行する複数の計算ノードと、これら複数の計算ノードを管理する管理ノードとを有する並列計算機のジョブ情報取得プログラムであって、
計算ノード共通の周期タイミングに応じて、当該計算ノード自体が担当する計算ジョブに関わるジョブ情報を取得する計算ノード側の取得手順と、
前記取得手順が前記ジョブ情報を取得した周期タイミングを識別する識別番号に関連付けして、所定複数周期分のジョブ情報を保持可能にする計算ノード側の保持部に当該ジョブ情報を保持する計算ノード側の保持手順と、
前記管理ノードから指定の識別番号に関わるジョブ情報の送信依頼を受信すると、当該指定の識別番号に関わるジョブ情報が当該保持部内にある場合、当該指定の識別番号に関わるジョブ情報を管理ノードに送信すると共に、当該指定の識別番号に関わるジョブ情報が当該保持部内になく、当該指定の識別番号直前の識別番号に関わるジョブ情報がある場合、当該識別番号に関わるジョブ情報を管理ノードに送信する計算ノード側の情報送信手順と
前記送信依頼に応じて各計算ノードから前記ジョブ情報を受信すると、当該受信したジョブ情報を、計算ノード毎の所定複数周期分のジョブ情報を保持可能にする当該管理ノード側の保持部に保持する管理ノード側の保持手順と、
当該保持部内に同一識別番号の計算ノードに関わるジョブ情報を検出した場合、当該同一識別番号のジョブ情報をスナップショットとして保持する管理ノード側のスナップショット保持手順と、
前記同一識別番号のジョブ情報をスナップショットとして保持した場合、当該管理ノード側の保持部に保持中の当該同一識別番号のジョブ情報以外のジョブ情報を消去する管理ノード側の消去手順と、
前記同一識別番号のジョブ情報をスナップショットして保持した場合、各計算ノードに対して消去依頼を送信する管理ノード側の消去依頼手順と、
前記管理ノードからの消去依頼を受信すると、当該計算ノード側の保持部に保持中のジョブ情報を全て消去する計算ノード側の消去手順と
を含むプログラムをコンピュータに実行させることを特徴とする並列計算機のジョブ情報取得プログラム。 - 計算ジョブを分散して並列的に実行する複数の計算ノードと、これら複数の計算ノードを管理する管理ノードとを有する並列計算機のジョブ情報取得方法であって、
計算ノード共通の周期タイミングに応じて、当該計算ノード自体が担当する計算ジョブに関わるジョブ情報を取得する計算ノード側の取得ステップと、
前記取得手順が前記ジョブ情報を取得した周期タイミングを識別する識別番号に関連付けして、所定複数周期分のジョブ情報を保持可能にする計算ノード側の保持部に当該ジョブ情報を保持する計算ノード側の保持ステップと、
前記管理ノードから指定の識別番号に関わるジョブ情報の送信依頼を受信すると、当該指定の識別番号に関わるジョブ情報が当該保持部内にある場合、当該指定の識別番号に関わるジョブ情報を管理ノードに送信すると共に、当該指定の識別番号に関わるジョブ情報が当該保持部内になく、当該指定の識別番号直前の識別番号に関わるジョブ情報がある場合、当該識別番号に関わるジョブ情報を管理ノードに送信する計算ノード側の情報送信ステップと
前記送信依頼に応じて各計算ノードから前記ジョブ情報を受信すると、当該受信したジョブ情報を、計算ノード毎の所定複数周期分のジョブ情報を保持可能にする当該管理ノード側の保持部に保持する管理ノード側の保持ステップと、
当該保持部内に同一識別番号の計算ノードに関わるジョブ情報を検出した場合、当該同一識別番号のジョブ情報をスナップショットとして保持する管理ノード側のスナップショット保持ステップと、
前記同一識別番号のジョブ情報をスナップショットとして保持した場合、当該管理ノード側の保持部に保持中の当該同一識別番号のジョブ情報以外のジョブ情報を消去する管理ノード側の消去ステップと、
前記同一識別番号のジョブ情報をスナップショットして保持した場合、各計算ノードに対して消去依頼を送信する管理ノード側の消去依頼ステップと、
前記管理ノードからの消去依頼を受信すると、当該計算ノード側の保持部に保持中のジョブ情報を全て消去する計算ノード側の消去ステップと
を有することを特徴とする並列計算機のジョブ情報取得方法。 - 計算ジョブを分散して並列的に実行する計算処理部と、
計算装置共通の周期タイミングに応じて、当該計算装置自体が担当する計算ジョブに関わるジョブ情報を取得する取得部と、
前記取得部が前記ジョブ情報を取得した周期タイミングを識別する識別番号に関連付けして、当該ジョブ情報を当該計算装置側の保持部に保持すると共に、計算管理装置からの消去依頼を受信すると、当該保持部に保持中のジョブ情報を全て消去する保持制御部と、
前記計算管理装置から指定の識別番号に関わるジョブ情報の送信依頼を受信すると、当該指定の識別番号に関わるジョブ情報が当該保持部内にある場合、当該指定の識別番号に関わるジョブ情報を計算管理装置に送信すると共に、当該指定の識別番号に関わるジョブ情報が当該保持部内になく、当該指定の識別番号直前の識別番号に関わるジョブ情報がある場合、当該識別番号に関わるジョブ情報を計算管理装置に送信する情報送信部と
を有し、
前記保持部は、
所定複数周期分のジョブ情報を保持可能にする保持領域を備えたことを特徴とする計算装置。 - 複数の計算装置を管理する管理側処理部と、
前記計算装置に対する指定の識別番号に関わるジョブ情報の送信依頼に応じて、各計算装置からジョブ情報を受信すると、当該受信したジョブ情報を当該計算管理装置側の保持部に保持すると共に、当該保持部内に同一識別番号の計算装置に関わるジョブ情報を検出した場合、当該同一識別番号のジョブ情報をスナップショットとして保持すると共に、前記同一識別番号のジョブ情報をスナップショットとして保持した場合、当該計算管理装置側の保持部に保持中の当該同一識別番号のジョブ情報以外のジョブ情報を消去する保持制御部と、
前記同一識別番号のジョブ情報をスナップショットとして保持した場合、各計算装置に保持するジョブ情報を消去する消去依頼を送信する消去依頼部とを有し、
前記保持部は、
前記計算装置毎の所定複数周期分のジョブ情報を保持可能にする保持領域を備えたことを特徴とする計算管理装置。
Priority Applications (4)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
EP10856443.6A EP2610752B1 (en) | 2010-08-27 | 2010-08-27 | Parallel computer, job information acquisition program of parallel computer, and job information acquisition method for parallel computer |
JP2012530498A JP5464276B2 (ja) | 2010-08-27 | 2010-08-27 | 並列計算機、並列計算機のジョブ情報取得プログラム、並列計算機のジョブ情報取得方法、計算装置及び計算管理装置 |
PCT/JP2010/064639 WO2012026041A1 (ja) | 2010-08-27 | 2010-08-27 | 並列計算機、並列計算機のジョブ情報取得プログラム、並列計算機のジョブ情報取得方法、計算装置及び計算管理装置 |
US13/778,494 US9336044B2 (en) | 2010-08-27 | 2013-02-27 | Parallel computer, and job information acquisition method for parallel computer |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2010/064639 WO2012026041A1 (ja) | 2010-08-27 | 2010-08-27 | 並列計算機、並列計算機のジョブ情報取得プログラム、並列計算機のジョブ情報取得方法、計算装置及び計算管理装置 |
Related Child Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US13/778,494 Continuation US9336044B2 (en) | 2010-08-27 | 2013-02-27 | Parallel computer, and job information acquisition method for parallel computer |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2012026041A1 true WO2012026041A1 (ja) | 2012-03-01 |
Family
ID=45723068
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2010/064639 WO2012026041A1 (ja) | 2010-08-27 | 2010-08-27 | 並列計算機、並列計算機のジョブ情報取得プログラム、並列計算機のジョブ情報取得方法、計算装置及び計算管理装置 |
Country Status (4)
Country | Link |
---|---|
US (1) | US9336044B2 (ja) |
EP (1) | EP2610752B1 (ja) |
JP (1) | JP5464276B2 (ja) |
WO (1) | WO2012026041A1 (ja) |
Cited By (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2014010047A1 (ja) * | 2012-07-11 | 2014-01-16 | 株式会社日立製作所 | 管理システム及び情報取得方法 |
JP2015022755A (ja) * | 2013-07-23 | 2015-02-02 | 富士通株式会社 | フォールトトレラントな監視装置、方法及びシステム |
US10662234B2 (en) * | 2011-06-07 | 2020-05-26 | Mesoblast International Sàrl | Methods for repairing tissue damage using protease-resistant mutants of stromal cell derived factor-1 |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002324014A (ja) * | 2001-04-26 | 2002-11-08 | Meidensha Corp | 監視制御システム |
JP2007128122A (ja) * | 2005-11-01 | 2007-05-24 | Hitachi Ltd | 稼働性能データ収集開始時刻決定方法 |
Family Cites Families (7)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JPS63136176A (ja) | 1986-11-27 | 1988-06-08 | Casio Comput Co Ltd | デ−タ処理装置 |
JP2940403B2 (ja) | 1994-08-03 | 1999-08-25 | 株式会社日立製作所 | 並列計算機システムにおけるモニタデータ収集方法 |
EP0790559B1 (en) * | 1996-02-14 | 2002-05-15 | Hitachi, Ltd. | Method of monitoring a computer system, featuring performance data distribution to plural monitoring processes |
US6279001B1 (en) * | 1998-05-29 | 2001-08-21 | Webspective Software, Inc. | Web service |
US8037264B2 (en) * | 2003-01-21 | 2011-10-11 | Dell Products, L.P. | Distributed snapshot process |
DE10327155B4 (de) * | 2003-06-13 | 2006-12-07 | Sap Ag | Backup-Verfahren mit Anpassung an Computer-Landschaft |
US8769572B2 (en) * | 2008-03-24 | 2014-07-01 | Verizon Patent And Licensing Inc. | System and method for providing an interactive program guide having date and time toolbars |
-
2010
- 2010-08-27 JP JP2012530498A patent/JP5464276B2/ja active Active
- 2010-08-27 WO PCT/JP2010/064639 patent/WO2012026041A1/ja active Application Filing
- 2010-08-27 EP EP10856443.6A patent/EP2610752B1/en active Active
-
2013
- 2013-02-27 US US13/778,494 patent/US9336044B2/en not_active Expired - Fee Related
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002324014A (ja) * | 2001-04-26 | 2002-11-08 | Meidensha Corp | 監視制御システム |
JP2007128122A (ja) * | 2005-11-01 | 2007-05-24 | Hitachi Ltd | 稼働性能データ収集開始時刻決定方法 |
Non-Patent Citations (1)
Title |
---|
See also references of EP2610752A4 * |
Cited By (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US10662234B2 (en) * | 2011-06-07 | 2020-05-26 | Mesoblast International Sàrl | Methods for repairing tissue damage using protease-resistant mutants of stromal cell derived factor-1 |
WO2014010047A1 (ja) * | 2012-07-11 | 2014-01-16 | 株式会社日立製作所 | 管理システム及び情報取得方法 |
US9130880B2 (en) | 2012-07-11 | 2015-09-08 | Hitachi, Ltd. | Management system and information acquisition method |
JP2015022755A (ja) * | 2013-07-23 | 2015-02-02 | 富士通株式会社 | フォールトトレラントな監視装置、方法及びシステム |
US10069698B2 (en) | 2013-07-23 | 2018-09-04 | Fujitsu Limited | Fault-tolerant monitoring apparatus, method and system |
Also Published As
Publication number | Publication date |
---|---|
JP5464276B2 (ja) | 2014-04-09 |
US9336044B2 (en) | 2016-05-10 |
EP2610752B1 (en) | 2017-09-27 |
EP2610752A1 (en) | 2013-07-03 |
JPWO2012026041A1 (ja) | 2013-10-28 |
EP2610752A4 (en) | 2015-11-04 |
US20130174170A1 (en) | 2013-07-04 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5777467B2 (ja) | 制御装置およびプログラム | |
WO2015139164A1 (zh) | 一种任务调度的方法、装置及设备 | |
JP2006277115A (ja) | 異常検出プログラムおよび異常検出方法 | |
CN105528366B (zh) | 一种数据同步控制方法和装置 | |
JP5464276B2 (ja) | 並列計算機、並列計算機のジョブ情報取得プログラム、並列計算機のジョブ情報取得方法、計算装置及び計算管理装置 | |
JP2007080171A (ja) | 機器管理装置、機器管理方法、プログラム及び記録媒体 | |
US8930532B2 (en) | Session management in a thin client system for effective use of the client environment | |
JP5983102B2 (ja) | 監視プログラム、方法及び装置 | |
CN110737526A (zh) | 一种基于Redis的分布式集群下的定时任务管理方法及装置 | |
JP6613763B2 (ja) | 情報処理装置、情報処理方法、及び、プログラム | |
US11132223B2 (en) | Usecase specification and runtime execution to serve on-demand queries and dynamically scale resources | |
CN106843890B (zh) | 基于智能决策的传感器网络、节点及其运行方法 | |
US11797356B2 (en) | Multi-instrument behavior synchronization using jobs and milestones | |
JP2010009288A (ja) | マルチプロセッサシステム及びプログラム実行方法 | |
JP5614346B2 (ja) | 試験方法、試験プログラム、及び情報処理装置 | |
JP2011128959A (ja) | ジョブ管理装置、ジョブ管理方法及びジョブ管理プログラム | |
JP5054495B2 (ja) | 計算機システム、データ管理方法、データ管理プログラム及び処理装置 | |
JP6951637B2 (ja) | 調査資料採取プログラム、調査資料採取装置及び調査資料採取方法 | |
JP2004264954A (ja) | Cpu使用率測定システム | |
JP2016088057A (ja) | 情報処理装置、情報処理装置の制御方法、及びプログラム | |
JP2009075724A (ja) | 管理装置、管理システム、管理プログラム、および、管理方法 | |
JP6620524B2 (ja) | 処理分散制御装置、処理分散制御方法および処理分散制御プログラム | |
JP2013089061A (ja) | 情報処理装置とプログラム | |
JP2015064848A (ja) | ジョブ管理システム | |
JP6000190B2 (ja) | 機器管理システム、電子機器、および機器管理プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 10856443 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2012530498 Country of ref document: JP Kind code of ref document: A |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
REEP | Request for entry into the european phase |
Ref document number: 2010856443 Country of ref document: EP |
|
WWE | Wipo information: entry into national phase |
Ref document number: 2010856443 Country of ref document: EP |