WO2013035243A1 - クラウドサービス復旧時間予測システム、方法およびプログラム - Google Patents
クラウドサービス復旧時間予測システム、方法およびプログラム Download PDFInfo
- Publication number
- WO2013035243A1 WO2013035243A1 PCT/JP2012/004906 JP2012004906W WO2013035243A1 WO 2013035243 A1 WO2013035243 A1 WO 2013035243A1 JP 2012004906 W JP2012004906 W JP 2012004906W WO 2013035243 A1 WO2013035243 A1 WO 2013035243A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- recovery
- resource
- schedule
- service
- user
- Prior art date
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/008—Reliability or availability analysis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
-
- H—ELECTRICITY
- H04—ELECTRIC COMMUNICATION TECHNIQUE
- H04L—TRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
- H04L69/00—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
- H04L69/40—Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass for recovering from a failure of a protocol instance or entity, e.g. service redundancy protocols, protocol state redundancy or protocol service redirection
Definitions
- the present invention relates to a cloud service recovery time prediction system, a cloud service recovery time prediction method, and a cloud service recovery time prediction program for predicting a service recovery time for each service user when a system failure involving a plurality of computer resources fails.
- Patent Document 1 describes an example of an embodiment of such a cloud service.
- Patent Document 2 describes a method of generating and publishing a Web page for publishing failure information based on failure information received from a user. By quickly disclosing information on the Web when a failure occurs, the stress of users affected by the failure can be reduced.
- Patent Document 3 describes a failure notification method for notifying a user of network failure information.
- the failure notification method described in Patent Literature 3 notifies failure information using mail or the like for each user based on a user application reservation status when a failure occurs in a system involving communication such as a video conference application.
- the influence on the user is determined and the failure information is notified based on the time overlap between the application reservation period of the user and the period from the occurrence of the failure to the recovery.
- Patent Document 4 describes a method for notifying a user of necessary information when a network service cannot be used.
- the server machine has identification information of the client machine that uses the resource provided by the network service, and the type of program that is running on the client machine that uses the corresponding resource. And the type of user of the program of the corresponding client machine is managed.
- the network service cannot be provided because a predetermined problem has occurred in the resource of the server machine, the network service information based on the resource provided by the network service is notified to the client machine.
- Non-Patent Document 1 discloses the current state of the cloud service provided by Google (registered trademark) AppEngine.
- Non-Patent Document 2 discloses the current state of the cloud service provided by Amazon (registered trademark) EC2.
- Patent Document 5 describes a method for monitoring a data recovery possible time in a storage system that performs asynchronous remote copy between a plurality of storage apparatuses.
- the latest or similar data staying in the buffer of the first storage device is stored together with time information at predetermined time intervals. Then, based on the accumulated information, the oldest or equivalent data at a predetermined time, and at least one information among the number of remaining data, the data is stored in the second storage device based on the predetermined time.
- the recovery time is calculated using the collected data.
- information disclosed in Non-Patent Document 1 and Non-Patent Document 2 is information related to the recovery time of the entire service. In this case, even if a part of the service is restored and a situation where some service users can use the service, each service user cannot grasp the situation. That is, the service user must wait until the restoration of all services in accordance with the estimated recovery time information disclosed by the service provider when the cloud service fails.
- the service provided by the cloud service is not necessarily a service that requires a reservation in advance. Therefore, when a failure occurs in a service that a user uses without making a reservation, the method described in Patent Literature 3 cannot cope with the failure.
- Patent Document 4 does not describe how to calculate the expected time until the service is restored, and the specific prediction method is unknown.
- the present invention provides a cloud that can predict the time for a cloud service to be restored for each user when the service requested by the user becomes unavailable due to a failure of the cloud service provided using a plurality of types of computer resources.
- An object is to provide a service recovery time prediction system, a cloud service recovery time prediction method, and a cloud service recovery time prediction program.
- the cloud service recovery time prediction system includes a plurality of types of computer resources provided by a cloud service or a computer resource when the failure occurs in a provided service that is a service provided by the cloud service or the provided service.
- a recovery schedule storage means that stores the recovery schedule that defines the schedule to be restored for each type of computer resource or service provided, and a resource usage profile that defines the computer resources used when each user uses the cloud service.
- the resource usage profile storage means to store for each user and the computer resource or service to be used when the user uses the cloud service are identified from the resource usage profile, and all the identified resources are determined based on the recovery schedule.
- Computer resources Provides a recovery time prediction means for predicting a recovery time of a cloud service used by the user by predicting a recovery time of the provided service, and a recovery time presentation means for presenting the predicted service recovery time to the user. It is provided with.
- the cloud service recovery time prediction method predicts a recovery time when a failure occurs in a plurality of types of computer resources provided by a cloud service or a provided service that is a service provided by the cloud service.
- a time prediction method which refers to resource use profile storage means for storing for each user a resource use profile that defines computer resources to be used when each user uses a cloud service, and stores the resource use profile.
- the computer resource or the service to be used when the user uses the cloud service is identified from the resource usage profile stored in the means, and the computer resource or the service to be provided when a failure occurs in the computer resource or the service to be provided Establish a schedule for restoring services
- the recovery schedule is extracted from the recovery schedule storage means for storing each type of computer resource or each provided service, and based on the extracted recovery schedule, the time for recovering all the specified computer resources or provided services is predicted.
- the recovery time of the cloud service used by the user is predicted, and the predicted service recovery time is presented to the user.
- the program for predicting the recovery time of a cloud service is a computer that predicts a recovery time when a failure occurs in a plurality of types of computer resources provided by a cloud service or a provided service that is a service provided by the cloud service.
- a cloud service recovery time prediction program to be applied in which a resource usage profile storage that stores, for each user, a resource usage profile that defines computer resources used when each user uses the cloud service.
- the computer resource or the service to be used when the user uses the cloud service is identified from the resource usage profile stored in the resource usage profile storage unit.
- a recovery schedule that defines a schedule for restoring a resource or its service is extracted from a recovery schedule storage means for storing each type of computer resource or each provided service, and all the computers specified based on the extracted recovery schedule are extracted.
- Recovery time prediction process that predicts the recovery time of the cloud service used by the user by predicting the recovery time of the resource or provided service, and the recovery time presentation that presents the predicted service recovery time to the user Processing is executed.
- the present invention when a service requested by a user cannot be used due to a failure of a cloud service provided using a plurality of types of computer resources, it is possible to predict the time for the cloud service to recover for each user.
- FIG. FIG. 1 is an explanatory diagram illustrating an example of the entire configuration of a cloud service including the cloud service recovery time prediction system according to the first embodiment of this invention.
- the cloud service illustrated in FIG. 1 includes a cloud service providing unit 1, a recovery time prediction system 2, and a cloud service client 3.
- a recovery time prediction system 2 illustrated in FIG. 1 corresponds to the cloud service recovery time prediction system according to the first embodiment of this invention.
- the cloud service providing unit 1, the recovery time prediction system 2, and the cloud service client 3 are connected to each other via a communication network (not shown).
- the cloud service providing unit 1 includes a virtual machine 101, a storage 102, and a service providing unit 103. Various services are provided to the user using the virtual machine 101, the storage 102, and the service providing unit 103.
- the virtual machine 101 and the storage 102 included in the cloud service providing unit 1 may be referred to as computer resources.
- the cloud service providing unit 1 includes six virtual machines 101, storages 102, and service providing units 103, respectively.
- the number of virtual machines 101, storages 102, and service providing units 103 is not limited to six.
- the number of virtual machines 101, storages 102, and service providing units 103 may be one each, two or more, five or less, or seven or more. Further, the numbers of the virtual machine 101, the storage 102, and the service providing unit 103 may not be the same.
- the cloud service providing unit 1 may include other similar computer resources in order to provide various services to the user.
- the cloud service providing unit 1 includes a dedicated interface (not shown) for providing computer resources such as the virtual machine 101 and the storage 102 to the user.
- Examples of the dedicated interface include an interface for creating and deleting the virtual machine 101 and an interface for adding storage.
- the user of the cloud service uses the function of the cloud service client 3 to access the dedicated interface and uses the virtual machine 101 and the storage 102 in the cloud service.
- the service providing unit 103 provides services that can be used by users in addition to computer resources.
- Examples of services provided by the service providing unit 103 include, for example, a VPN (Virtual Private Network) service that provides private network access, a load balancing service that distributes traffic load, and the state and performance of a specific virtual machine in a cloud service.
- Examples include a monitoring service to be monitored, a scale-up service for increasing / decreasing the number of virtual machine instances in accordance with an increase / decrease in traffic, and an authentication / authorization service for restricting access to resources.
- the user of the cloud service uses the service provided by the service providing unit 103 using the function of the cloud service client 3 to construct an application system (not shown) in the cloud service providing unit 1.
- a user of a cloud service registers the type and amount of computer resources to be used and the service to be used in the cloud service, and pays a service fee to the service provider according to the usage status.
- the cloud service provider stores the computer resource used by each user and information on the service used in the resource use profile storage unit 206 described later.
- the recovery time prediction system 2 includes a failure status investigation unit 201, a failure status storage unit 202, a recovery schedule generation unit 203, a resource recovery schedule storage unit 204, a recovery time prediction unit 205, and a resource usage profile storage unit 206.
- Recovery time presentation means 207 is provided.
- the failure status investigation means 201 investigates the computer resources in the cloud service providing unit 1 and the failure status of the service providing unit 103. Specifically, the failure status investigation means 201 investigates the failure status of each computer resource or service when a cloud service failure involving a plurality of types of computer resource loss occurs. Then, the failure state investigation unit 201 stores the investigation result in the failure state storage unit 202.
- the failure status storage unit 202 stores the failure status in the cloud service providing unit 1.
- the failure status is stored in the failure status storage unit 202 by the failure status investigation unit 201 as needed.
- the recovery schedule generation unit 203 generates a recovery schedule for each computer resource and provided service according to the failure status stored in the failure status storage unit 202. Then, the recovery schedule generation unit 203 stores the generated recovery schedule in the resource recovery schedule storage unit 204.
- the recovery schedule means that when a failure occurs in a plurality of types of computer resources provided by the cloud service providing unit 1 or a service provided by the service providing unit 3 (hereinafter also referred to as provided service), This is a schedule for restoring computer resources or provided services.
- the recovery schedule is created using a generally known method. For example, the time required for recovery for each failure of the computer resources and the recovery order for each combination of failures may be set in advance.
- the recovery schedule generation unit 203 identifies a failure of the computer resource from the failure state at the timing when the failure state is registered in the failure state storage unit 202, and sets in advance based on the identified computer resource.
- the recovery schedule may be generated from the time taken for recovery and the recovery order. Further, for example, when creating a storage recovery schedule, the recovery schedule generation unit 203 may generate a recovery schedule using the method described in Patent Document 5.
- the method by which the recovery schedule generating unit 203 generates the recovery schedule is not limited to the above method.
- the recovery schedule generation unit 203 receives the input recovery procedure. May be used as a recovery schedule.
- the resource recovery schedule storage unit 204 stores a recovery schedule for each type of computer resource and provided service.
- the resource usage profile storage unit 206 stores a resource usage profile that defines computer resources (specifically, the type and amount of computer resources) used when each user uses the cloud service.
- the resource usage profile is stored in advance in the resource usage profile storage unit 206 by an administrator or the like.
- the amount of computer resources includes the number of virtual machines 101 and the capacity allocated in the storage 102.
- the recovery time predicting means 205 predicts the service recovery time for each user based on the recovery schedule for each computer resource or provided service and the resource usage profile stored in the resource usage profile storage unit 206. Specifically, the recovery time predicting means 205 specifies a computer resource or a service to be used when the user uses the cloud service from the resource use profile. Then, the recovery time predicting unit 205 predicts the recovery time of all the specified computer resources or provided services based on the recovery schedule.
- the recovery time predicting unit 205 may predict the latest time among the predicted computer resources and the time when each provided service is recovered as the recovery time of the cloud service used by the user.
- the restoration time presenting means 207 presents the predicted service restoration time to the user. Examples of the presentation method include announcements using the Web and notifications to users using e-mails, instant messages, and the like. Note that the presentation of the recovery time includes indirectly transmitting the recovery time to another device.
- the failure status investigation unit 201, the recovery schedule generation unit 203, the recovery time prediction unit 205, and the recovery time presentation unit 207 are realized by a CPU of a computer that operates according to a program (cloud service recovery time prediction program).
- a program cloud service recovery time prediction program
- the program is stored in a storage unit (not shown) in the recovery time prediction system, and the CPU reads the program, and according to the program, the failure status investigation unit 201, the recovery schedule generation unit 203, and the recovery time prediction unit 205.
- the recovery time presenting means 207 may be operated.
- the failure status investigation unit 201, the recovery schedule generation unit 203, the recovery time prediction unit 205, and the recovery time presentation unit 207 may each be realized by dedicated hardware.
- failure status storage unit 202 the resource recovery schedule storage unit 204, and the resource use profile storage unit 206 are realized by, for example, a magnetic disk.
- the recovery time prediction system 2 of this embodiment When a disaster or power outage occurs in a data center operating a cloud service, a failure occurs in a plurality of computer resources and provided services depending on the extent of the disaster. For example, when power is lost in one section of the data center, the server operating in the section and the virtual machine operating in the server are stopped. Further, in this case, the storage device in the partition and various provided services are also stopped. Due to this failure, a user who has used a virtual machine, a storage device, or a provided service cannot use the cloud service. When the cloud service provider detects the occurrence of a failure, the cloud service provider starts the recovery process of the cloud service. The recovery time prediction system 2 predicts recovery time for each user who is affected by a failure in the course of the cloud service recovery process.
- FIG. 2 is a flowchart showing an example of processing for investigating the failure status of each computer resource and provided service and generating a recovery schedule.
- the failure status investigation unit 201 identifies a physical server, a virtual server, a storage device, and various services that have stopped due to a failure in the cloud service providing unit 1 and investigates the damage status (step S 1000). ).
- Examples of the damage status of each computer resource include a state that requires replacement due to physical damage, a state in which there is no physical damage but a logical inconsistency may have occurred, and some data has been lost. A state where there is a possibility that some functions are not provided as a service.
- the failure status investigation unit 201 stores the investigation result in the failure status storage unit 202 (step S1001).
- the failure status investigation unit 201 may, for example, aggregate alert messages generated due to failures and automatically collect the investigation results.
- the administrator of the cloud service may investigate the failure status by confirming the on-site damage status or log data. Further, the failure status investigation unit 201 may periodically monitor the status of computer resources and provided services and determine that a failure has occurred when an abnormality is detected. It is assumed that the failure status storage unit 202 repeatedly performs reference processing and update processing continuously during the recovery processing, and stores the latest failure status at each time point.
- the recovery schedule generating unit 203 refers to the failure status and generates a recovery schedule for each computer resource and provided service (step S1002). Then, the recovery schedule generation unit 203 updates the recovery schedule in the resource recovery schedule storage unit 204 (step S1003).
- the recovery procedure (recovery schedule) varies depending on the type of resource, the state of damage, the number of personnel required for recovery, and the stockpiling of resources.
- the recovery time and procedure for these assumed failure contents may be set in advance, and the recovery schedule generation unit 203 may create a recovery schedule based on these pieces of information.
- the recovery schedule generation unit 203 stores the input recovery schedule in the resource recovery schedule storage unit 204. May be.
- the recovery schedule for virtual machine resources is created by a specialized administrator group that manages a server cluster that provides virtual machines.
- the storage recovery schedule is generated by an administrator group specialized in storage management. The recovery schedule generation unit 203 may replace the generated recovery schedule with a recovery schedule created by these administrators.
- FIG. 3 is a flowchart showing an example of a procedure for predicting and presenting service restoration time for each user based on a restoration schedule and a resource usage profile.
- the recovery time prediction unit 205 first acquires a list of all users of the cloud service from the resource usage profile storage unit 206 (step S2000). Note that the recovery time prediction unit 205 may acquire a list of users only for users affected by the failure.
- the recovery time prediction means 205 selects one user at a time from the acquired user list (step S2001), and checks the predicted recovery time. Specifically, the recovery time predicting unit 205 refers to the resource usage profile storage unit 206 and acquires the resource usage profile of the selected user Ui (step S2002).
- This resource usage profile includes a list of computer resources requested by the user, and the type of resource (hereinafter referred to as resource type Rj) is specified from this resource list. Therefore, the recovery time prediction unit 205 selects the resource type Rj from the resource usage profile (step S2003).
- the resource type Rj indicates, for example, a virtual machine, storage, and various provided services.
- the resource type Rj may indicate information indicating whether a certain computer resource is a shared resource in addition to the above contents.
- the recovery time predicting unit 205 refers to the resource recovery schedule storage unit 204 and refers to the resource recovery schedule of the selected resource type Rj (step S2004).
- this resource recovery schedule resources to be recovered at each time point, the amount of resources, information indicating a portion in the resources, and the like are described. Therefore, the recovery time predicting unit 205 predicts a scheduled recovery time Tj at which the resource requested by the user described in the resource usage profile is recovered and usable (step S2005).
- the recovery time prediction unit 205 may record the prediction result in a memory (not shown) or the like.
- the recovery time predicting means 205 determines whether or not the expected recovery time Tj has been predicted for all resource types Rj described in the resource usage profile (step S2006). When the recovery scheduled time Tj of all resource types Rj is not predicted (NO in step S2006), the recovery time prediction unit 205 repeats the processing from step S2003 to step S2006.
- the recovery time predicting means 205 obtains the maximum value of the expected recovery time Tj.
- the recovery time predicting unit 205 sets the maximum value of the scheduled recovery time Tj as the scheduled recovery time of the user Ui (step S2007). Note that the recovery time prediction unit 205 may record the scheduled recovery time in a memory (not shown) or the like.
- the recovery time predicting means 205 determines whether or not the expected recovery time has been predicted for all users included in the user list (step S2008). When the scheduled recovery time is not predicted for all users (NO in step S2008), the recovery time prediction unit 205 repeats the processing from step S2001 to step S2008. On the other hand, when the expected recovery time is predicted for all users (YES in step S2008), the recovery time presentation unit 207 presents the predicted recovery time to the user (step S2009).
- the recovery time predicting unit 205 specifies the computer resource or the service to be used when the user uses the cloud service from the resource use profile. Further, the recovery time predicting means 205 predicts the recovery time of all the specified computer resources or provided services based on the recovery schedule. Thereby, the recovery time predicting means 205 predicts the recovery time of the cloud service used by the user. Then, the recovery time presentation means 207 presents the predicted service recovery time to the user. Therefore, when a service requested by a user cannot be used due to a failure in a cloud service provided using a plurality of types of computer resources, the time for the cloud service to recover can be predicted for each user.
- the recovery time predicting means 205 predicts the service recovery time for each user by referring to the resource usage profile of the user and the recovery schedule of each computer resource or provided service. Therefore, it is possible to present a different service recovery time for each user when a failure occurs in the cloud service.
- the failure status investigation means 201 may investigate the failure status of each computer resource or provided service and store it in the failure status storage unit 202. Then, the recovery schedule generation unit 203 may generate a recovery schedule based on the failure status stored in the failure status storage unit 202 and store the recovery schedule in the resource recovery schedule storage unit 204. Thus, by automatically creating a recovery schedule at the timing when a failure occurs, it becomes possible to deal with the failure more quickly.
- FIG. 4 is an explanatory diagram illustrating an example of a cloud service recovery time prediction system according to the second embodiment of this invention.
- symbol same as FIG. 1 is attached
- subjected and description is abbreviate
- the recovery time prediction system 2 in this embodiment includes a resource reservation information storage unit 208 in addition to the configuration of the recovery time prediction system 2 of the first embodiment.
- the computer resources of the cloud service providing unit 1 are shared by a plurality of users, but there are types of resources that are exclusively used among users.
- An example of such a type of resource is a shared virtual machine that can be used by any user.
- Computer resources that are shared exclusively cannot be used simultaneously by all users. Therefore, a reservation function for permitting use only to a specific user is necessary.
- the recovery time prediction system 2 of the present embodiment predicts recovery time with reference to reservation information used to realize this reservation function.
- the resource reservation information storage unit 208 stores reservation information related to the use of each computer resource.
- the reservation information is information in which the reservation start time of a computer resource that is shared by a plurality of users and used exclusively among users is associated with the user of the computer resource.
- the resource reservation information storage unit 208 is realized by, for example, a magnetic disk. That is, the reservation information includes information indicating the time when the user starts to reserve a computer resource (reservation start time).
- the recovery time predicting means 205 identifies the computer resource or service provided by the user from the resource usage profile of the user. Based on the reservation information, the recovery time predicting means 205 determines whether or not a computer resource that is exclusively used by a plurality of users can be reserved. The recovery time prediction unit 205 stores, in the resource reservation information storage unit 208, reservation information whose reservation start time is the recovery time of a computer resource that can be reserved based on the recovery schedule. Then, the recovery time predicting means 205 specifies the time when the user can reserve the computer resource from the recovery schedule and the reservation information.
- the recovery time predicting means 205 identifies a computer resource or provided service used by the user that is expected to be recovered earliest and its recovery time from the recovery schedule. Further, the recovery time predicting unit 205 refers to the reservation information of the computer resource or provided service corresponding to the specified time. When the computer resource or the provided service can be reserved, the recovery time prediction unit 205 registers reservation information with the recovery time as a reservation start time in the resource reservation information storage unit 208. On the other hand, if the computer resource or provided service to be recovered cannot be reserved, the recovery time predicting means 205 repeats the above processing for the computer resource or provided service whose recovery time is the next earliest. The recovery time prediction unit 205 may determine whether or not the computer resource or the provided service can be reserved based on whether or not the reservation information for the target computer resource is registered.
- the recovery time predicting means 205 predicts the recovery time of the cloud service used by the user based on the recovery schedule and reservation information. Specifically, the recovery time predicting means 205 determines the latest time among the predicted recovery time of each computer resource and each provided service and each computer resource used by the user and the reserved time of each provided service. May be predicted as the recovery time of the cloud service used by. In other words, the recovery time predicting unit 205 determines that the computer resource or provided service in which the reservation information is registered is recovered at the reservation start time included in the reservation information, and predicts the service recovery time.
- FIG. 5 is a flowchart showing another example of a procedure for predicting and presenting a service recovery time for each user.
- the recovery time predicting means 205 first acquires a list of all users of the cloud service from the resource usage profile storage unit 206 (step S3000).
- the recovery time predicting unit 205 sorts the user list based on the priorities in order to reserve resources and predict the recovery time in order from the user with the highest priority (step S3001).
- the priority of the user is determined according to the service contract form, usage frequency, period, etc. of the user.
- the recovery time predicting means 205 selects the user Ui with the highest priority from the sorted user list (step S3002), and acquires the resource usage profile of the user (step S3003).
- the recovery time prediction unit 205 selects the resource type Rj from the resource usage profile (step S3004). Then, the recovery time predicting unit 205 refers to the resource recovery schedule storage unit 204 and refers to the resource recovery schedule of the selected resource type Rj (step S3005). Then, the recovery time prediction unit 205 determines whether or not the resource type Rj is a shared resource (step S3006).
- the recovery time prediction unit 205 refers to the resource recovery schedule corresponding to the resource type Rj, and the resource requested by the user can be recovered and used.
- the estimated recovery time Tj is predicted.
- the recovery time predicting unit 205 may record the prediction result in a memory (not shown) or the like (step S3007).
- the case where the resource type Rj does not indicate a shared resource is, for example, a case where the resource type Rj indicates a resource dedicated to the user Ui or a case where the resource type Rj indicates a shared resource that does not require exclusive control.
- a storage volume in which user data is recorded corresponds to this resource type Rj.
- a monitoring function shared by a plurality of users, a service such as a load distribution function, and the like also correspond to this resource type Rj.
- the recovery time prediction unit 205 first refers to the resource reservation information storage unit 208 and acquires reservation information of the resource type Rj (step S3008).
- the case where the resource type Rj indicates a shared resource is a case where the resource type Rj indicates a resource of a type shared by a plurality of users and used exclusively.
- the recovery time predicting unit 205 refers to the resource recovery schedule of the resource type Rj, and predicts, as the recovery scheduled time Tj, the shortest time when the resource requested by the user can be reserved after the resource recovery. This is because it is predicted that the recovered resource can be used when the resource requested by the user can be reserved. At this time, the recovery time prediction unit 205 determines that a resource reserved for use by another user cannot be reserved.
- the recovery time prediction unit 205 may record the prediction result in a memory (not shown) or the like (step S3009).
- the recovery time predicting unit 205 creates reservation information according to the type and amount of the resource requested by the user Ui, and stores the reservation information in the resource reservation information storage unit 208 (step S3010).
- the recovery time predicting unit 205 may reserve the computer resource corresponding to the user Ui by storing the user Ui in association with the requested computer resource, for example.
- the recovery time predicting means 205 determines whether or not the expected recovery time Tj has been predicted for all resource types Rj described in the resource usage profile (step S3011). When the recovery scheduled time Tj of all resource types Rj is not predicted (NO in step S3011), the recovery time predicting unit 205 repeats the processing from step S3004 to step S3011.
- the recovery time predicting means 205 obtains the maximum value of the expected recovery time Tj.
- the recovery time predicting unit 205 sets the maximum value of the scheduled recovery time Tj as the scheduled recovery time of the user Ui (step S3012). Note that the recovery time prediction unit 205 may record the scheduled recovery time in a memory (not shown) or the like.
- the restoration time prediction means 205 determines whether or not the restoration scheduled time has been predicted for all the users included in the user list (step S3013). When the scheduled recovery time is not predicted for all users (NO in step S3013), the recovery time predicting means 205 repeats the processing from step S3002 to step S3013.
- the recovery time presenting means 207 presents the predicted recovery time to the user (step S2009). That is, the recovery time prediction unit 205 predicts the recovery time of the computer resources used by the users in order based on the sorted user list, and when the recovery time prediction is completed for all users, the recovery time presentation unit 207. Presents the prediction results for each user.
- the recovery time prediction unit 205 stores the reservation information in the resource reservation information storage unit 208. Specifically, the recovery time predicting unit 205 determines whether or not a computer resource used by each user can be reserved based on reservation information stored in the resource reservation information storage unit 208. In addition, the recovery time prediction unit 205 stores reservation information in the resource reservation information storage unit 208 with the recovery time of the reservable computer resource as the reservation start time based on the recovery schedule. Then, the recovery time prediction unit 205 predicts the recovery time of the cloud service used by the user based on the recovery schedule and reservation information.
- the recovery time predicting unit 205 records the resource reservation information to be recovered in the resource reservation information storage unit 208 for computer resources that are shared and used exclusively by a plurality of users. Therefore, the time when the type and amount of the resource requested by each user can be reliably used is predicted as the service restoration time, and the prediction result can be presented to the user.
- the recovery time predicting unit 205 predicts the scheduled recovery time in consideration that the computer resource reserved by another user cannot be used even if it is recovered. Therefore, it is possible to avoid the problem that the user cannot resume the use of the service because the other user has used it first after the scheduled recovery time.
- FIG. 6 is an explanatory diagram illustrating an example of a cloud service recovery time prediction system according to the third embodiment of this invention.
- the recovery time prediction system 2 in the present embodiment includes a recovery schedule optimization unit 209 and a recovery schedule constraint information storage unit 210 in addition to the configuration of the recovery time prediction system 2 of the first embodiment.
- the recovery schedule constraint information storage unit 210 stores requests for resource recovery schedules and constraint information. Specifically, the recovery schedule constraint information storage unit 210 stores recovery schedule constraint information that defines a constraint condition of a resource recovery schedule based on a dependency relationship between computer resources or a resource recovery request by a user. Examples of the recovery schedule constraint information include the deadline and priority of the recovery time of each user. The recovery schedule constraint information is stored in advance in the recovery schedule constraint information storage unit 210 by an administrator or the like.
- the recovery schedule optimization unit 209 generates a recovery schedule that optimizes the recovery schedule of each computer resource or service based on the recovery schedule constraint information.
- the recovery schedule optimization unit 209 refers to the information stored in the resource use profile storage unit 206 and the resource recovery schedule storage unit 204 together with the recovery schedule constraint information when generating the resource recovery schedule.
- the recovery schedule optimizing unit 209 maximizes or minimizes a target index (hereinafter also referred to as a target index) under the constraints indicated by the recovery schedule constraint information. Search for candidates (combinations).
- ⁇ Whether to maximize or minimize the target index depends on the nature of the target index. For example, if the objective index is “the number of service restoration users within a predetermined period”, it can be said that the optimization is to maximize the value. Further, for example, if the objective index is “recovery time”, it can be said that the optimization is to minimize the value.
- the recovery schedule optimization unit 209 updates the corresponding recovery schedule stored in the resource recovery schedule storage unit 204 with the searched recovery schedule. Then, the recovery time predicting unit 205 predicts the recovery time of the service used by the user based on the updated recovery schedule.
- Examples of the target index include the average recovery time for all users, the worst value of the recovery time for a specific user group, and the cost of the service provider for recovery. This index is determined in advance by an administrator or the like based on input from the service provider or information set in advance.
- the recovery schedule optimizing means 209 identifies the computer resource used by the user based on the resource usage profile. Then, the recovery schedule optimization unit 209 specifies a schedule (recovery order) for recovering the specified computer resources based on the recovery schedule. For example, the recovery schedule optimizing unit 209 uses a combination in which the order of recovering the computer resources in the recovery schedule is changed as a candidate for the recovery schedule. The recovery schedule optimization unit 209 determines whether the recovery schedule candidate satisfies the constraint indicated by the recovery schedule constraint information. When there are a plurality of recovery schedule candidates, the recovery schedule optimization unit 209 selects an optimal candidate from the candidates, and updates the recovery schedule with the selected candidate.
- the recovery schedule optimizing means 209 optimizes the recovery schedule according to the needs of the service provider after the failure status is investigated, each resource recovery schedule is generated, or during the failure recovery process. Execute the process.
- the failure status investigation unit 201, the recovery schedule generation unit 203, the recovery time prediction unit 205, the recovery time presentation unit 207, and the recovery schedule optimization unit 209 operate according to a program (cloud service recovery time prediction program). This is realized by a CPU of a computer. Further, the failure status investigation unit 201, the recovery schedule generation unit 203, the recovery time prediction unit 205, the recovery time presentation unit 207, and the recovery schedule optimization unit 209 are each realized by dedicated hardware. Also good.
- FIG. 7 is a flowchart illustrating an example of processing for creating a recovery schedule.
- the recovery schedule optimizing means 209 determines an objective index for optimization based on input from the service provider or information set in advance (step S4000).
- the average recovery time of all service users is used as an objective index.
- a method for determining a recovery schedule that minimizes the average recovery time will be described as an example.
- the recovery schedule generation unit 203 refers to the failure status and generates a recovery schedule for each computer resource and provided service (step S4001).
- the recovery schedule may be created by an administrator or the like. Thereafter, the recovery schedule generating unit 203 stores the generated recovery schedule in the resource recovery schedule storage unit 204.
- the recovery schedule optimization unit 209 acquires information necessary for generating a recovery schedule from the resource usage profile storage unit 206 and the recovery schedule constraint information storage unit 210. Specifically, the recovery schedule optimization unit 209 acquires the resource usage profile from the resource usage profile storage unit 206 (step S4002), and acquires the recovery schedule constraint information from the recovery schedule constraint information storage unit 210 (step S4003). .
- the recovery schedule optimizing means 209 searches for a recovery schedule that can be realized within the range of the given recovery schedule constraint information, and searches for a combination of recovery schedules that optimize the objective index (step S4004).
- the recovery schedule optimization means 209 searches for a combination of recovery schedules using a method generally used as a solution to the optimization problem.
- a case where sequential search is used will be described as an example of the simplest search method.
- FIG. 8 is a flowchart showing an example of the sequential search process.
- the computer resource recovery schedule is not limited to one.
- Various recovery schedules are conceivable, such as combinations in which the order of recovering computer resources is changed. Therefore, the recovery schedule optimization means 209 first lists the recovery schedule combination candidates that can be realized based on the failure status (step S5000 in FIG. 8).
- the recovery schedule combination candidates listed here become a search range (search space) when performing optimization.
- the recovery schedule optimization means 209 selects a recovery schedule candidate Sj (step S5001). Then, the recovery time predicting means 205 predicts the service recovery time for all users (step S5002). Note that the method for predicting the service recovery time is the same as the method described in the first embodiment.
- the restoration schedule optimization means 209 calculates the value Vi of the objective index based on the prediction result (step S5003).
- the average recovery time is used as an objective index. Therefore, the recovery schedule optimizing means 209 calculates the average value of the recovery time predicted for each user and calculates the value Vi of the objective index.
- the recovery schedule optimization unit 209 determines whether the candidate Sj satisfies all the recovery schedule constraint information stored in the recovery schedule constraint information storage unit 210 (step S5004).
- the recovery schedule optimization unit 209 determines whether the predicted recovery time satisfies this constraint. to decide.
- the recovery schedule optimization unit 209 When the candidate Sj satisfies all the recovery schedule constraint information stored in the recovery schedule constraint information storage unit 210, the recovery schedule optimization unit 209 describes the candidate Sj as a recovery schedule candidate (hereinafter referred to as an optimal recovery schedule candidate). There is also.) Note that the recovery schedule optimization unit 209 may record the optimal recovery schedule candidate in a memory (not shown) or the like (step S5005).
- the recovery schedule optimizing unit 209 determines whether or not to search for a recovery schedule candidate (step S5006). For example, the recovery schedule optimization unit 209 may determine whether or not to end the search for recovery schedule candidates by determining whether or not a search end condition is satisfied. If it is determined not to end the search for the recovery schedule candidate (NO in step S5006), the processing from step S5001 to step S5006 is repeated. On the other hand, when it is determined that the search for the recovery schedule candidate is to be ended (YES in step S5006), the optimization process is ended.
- step S5001 to step S5005 is repeatedly performed for different Sj.
- search end condition is reached, the search for the recovery schedule candidate is completed.
- the search termination condition includes, for example, a case where all candidates are searched or a case where the search is terminated when a certain number of searches are completed.
- the recovery schedule optimization means 209 determines the best recovery schedule from the recovery schedule candidates obtained as a result of the search, and updates the recovery schedule stored in the resource recovery schedule storage unit 204 (step in FIG. 7). S4005).
- the average recovery time is used as an optimization objective index. Therefore, the recovery schedule optimizing unit 209 determines, as a best recovery schedule, a recovery schedule that minimizes the average recovery time from among the recovery schedules listed as candidates as a result of the search.
- the recovery schedule optimization unit 209 searches for a recovery schedule candidate that maximizes or minimizes the target index under the constraints indicated by the recovery schedule constraint information. Further, the recovery schedule optimization unit 209 updates the corresponding recovery schedule stored in the resource recovery schedule storage unit 204 with the recovery schedule. Then, the recovery time prediction unit 205 predicts the recovery time of the cloud service used by the user based on the updated recovery schedule.
- the recovery schedule optimization unit 209 updates the best resource recovery schedule as necessary. Therefore, it is possible to predict a recovery time that satisfies the user's requirements and restrictions for service recovery.
- FIG. 9 is an explanatory diagram illustrating an example of computer resources and services provided by the cloud service providing unit 1 according to the present embodiment.
- the cloud service providing unit 1 of this embodiment includes n virtual machines (virtual machine 1 to virtual machine n), 2 * m storage volumes (volume 11 to volume 2m), a monitoring service, a VPN service, and load balancing.
- the service is provided to users.
- n and m are positive numbers.
- a service provided by the cloud service providing unit 1 may be referred to as an additional service.
- FIG. 10 is an explanatory diagram showing a part of a resource profile of a cloud service user. It is assumed that the resource profile of the cloud service user illustrated in FIG. 10 is stored in the resource use profile storage unit 206 when the failure occurs.
- the resource profile includes the type and number of virtual machines required by each user, storage space, and additional services.
- the user A uses one standard type virtual machine and the volume 11 of the storage volume, and further uses the monitoring service.
- virtual machines with different specifications and functions may be prepared. Therefore, the required virtual machine type may be included in the resource usage profile.
- all virtual machines are standard type virtual machines. It is also assumed that the virtual machines used by user A to user F are affected by the failure.
- the failure state investigation means 201 investigates the damage state of the computer resource due to the failure, and records the investigated failure state in the failure state storage unit 202.
- FIG. 11 is an explanatory diagram showing an example of failure status data.
- the failure status data includes information indicating the failure location of the computer resource used in the cloud service providing unit 1. Further, the failure status data includes information on the cause of the failure and the recovery procedure as necessary. These failure status data are used to create a resource recovery schedule.
- FIG. 12 is an explanatory diagram showing an example of a resource recovery schedule.
- the resource recovery schedule is designed by a person in charge of the recovery operation in consideration of the damage status of the computer resources and the man-hours required for the recovery operation.
- the recovery schedule generation unit 203 stores the created resource recovery schedule in the resource recovery schedule storage unit 204.
- the time at which each computer resource and additional service is scheduled to be restored is stored in the resource restoration schedule storage unit 204.
- the recovery work is started at 12:00, four standard type virtual machines are available at 12:30, and further added at 13:00 Shows that four virtual machines will be available.
- the recovery time predicting means 205 predicts the service recovery time for each user with reference to the recovery schedule and resource usage profile.
- the service recovery time is predicted according to the procedure of the flowchart illustrated in FIG.
- the recovery time predicting means 205 selects the user A and refers to the resource usage profile of the user A. From the resource usage profile illustrated in FIG. 10, it can be determined that the user A is using the virtual machine, the storage, and the monitoring service.
- the recovery time predicting means 205 first refers to the recovery schedule of the virtual machine.
- the recovery schedule illustrated in FIG. 12 indicates that four standard-type virtual machines can be used at 12:30. Therefore, the recovery time predicting unit 205 predicts the scheduled recovery time of the virtual machine used by the user A as 12:30.
- the recovery time predicting means 205 refers to the storage recovery schedule. From the recovery schedule illustrated in FIG. 12, it can be seen that the volume 21 used by the user A is scheduled to be recovered at 12:20. Therefore, the recovery time predicting unit 205 predicts the scheduled recovery time of the storage used by the user A as 12:20.
- the recovery time predicting means 205 refers to the recovery schedule of the monitoring service. From the recovery schedule illustrated in FIG. 12, it can be seen that the monitoring service is scheduled to be recovered at 12:30. Therefore, the recovery time predicting unit 205 predicts the scheduled recovery time of the monitoring service used by the user A as 12:30.
- the recovery time prediction unit 205 obtains the maximum value from the recovery time of each computer resource or service from the above result.
- the recovery time predicting means 205 can determine that the service recovery scheduled time for the user A is 12:30.
- the recovery time predicting means 205 similarly obtains the scheduled recovery time of the service for other users based on the resource usage profile of each user.
- FIG. 13 is an explanatory diagram illustrating an example of a result of obtaining a scheduled service recovery time for each user.
- the recovery time presenting means 207 presents the predicted service recovery time to the user. By performing such processing, the object of the present invention can be achieved.
- the first embodiment it is assumed that users A to F use standard type virtual machines.
- this standard type virtual machine is a resource of a type that is shared and used exclusively by a plurality of users. That is, in the second embodiment, in order to predict the recovery time more accurately, the recovery time is predicted in consideration of the reservation of the virtual machine. Considering reservations for virtual machines makes it possible to predict recovery times more accurately.
- the second example corresponds to the second embodiment.
- FIG. 14 is an explanatory diagram showing an example of virtual machine reservation information.
- the service recovery time is predicted according to the procedure of the flowchart illustrated in FIG.
- reservations for using computer resources are made in the order from user A to user F.
- FIG. 15 is an explanatory diagram showing another example of the result of obtaining the service recovery scheduled time for each user.
- FIG. 16 is a block diagram showing an example of the minimum configuration of the cloud service recovery time prediction system according to the present invention.
- the cloud service recovery time prediction system according to the present invention includes a plurality of types of computer resources (for example, virtual machine 101, storage 102) provided by a cloud service (for example, cloud service providing unit 1) or a service provided by the cloud service.
- a recovery schedule that defines a schedule for restoring the computer resource or the provided service when a failure occurs in the provided service (for example, the service providing unit 103), for each type of the computer resource or the provided service
- a storage unit 81 for example, a resource recovery schedule storage unit 204
- a resource use profile storage unit 82 that stores, for each user, a resource use profile that defines computer resources to be used when each user uses the cloud service.
- a resource use profile storage unit 206 (For example, A resource use profile storage unit 206), and a computer resource or a service to be used when the user uses the cloud service is specified from the resource use profile, and all the specified computer resources or services to be provided based on the recovery schedule
- the recovery time predicting means 83 (for example, the recovery time predicting means 205) for predicting the recovery time of the cloud service used by the user by predicting the recovery time, and the predicted service recovery time to the user
- the recovery time presenting means 84 to present (for example, the recovery time presenting means 207) is provided.
- a recovery schedule storage means for storing a schedule for each type of computer resource or provided service, and a resource usage profile that defines a computer resource used when each user uses the cloud service is stored for each user.
- Resource usage profile storage means, and a computer resource or a service to be used when a user uses the cloud service is specified from the resource usage profile, and all the specified computer resources or When the provided service is restored
- a recovery time prediction means for predicting the recovery time of the cloud service used by the user, and a recovery time presentation means for presenting the predicted service recovery time to the user.
- (Supplementary Note 2) Fault status storage means for storing fault status of each computer resource or provided service, fault status investigation means for investigating the fault status and storing it in the fault status storage means, and storing in the fault status storage means
- the cloud service recovery time prediction system according to appendix 1, further comprising: a recovery schedule generating unit that generates a recovery schedule based on the failure status that has been made and stores the recovery schedule in the recovery schedule storage unit.
- Resource reservation information is stored as reservation information, which is information that associates the reservation start time of a computer resource shared by a plurality of users and used exclusively among the users with the user of the computer resources.
- Resource reservation information registration means stored in the means, wherein the resource reservation information registration means can reserve the computer resource used by each user based on the reservation information stored in the resource reservation information storage means And determining, based on the recovery schedule, storing the reservation information having a recovery start time of the reservable computer resource as a reservation start time in a resource reservation information storage unit, and a recovery time prediction unit includes a recovery schedule and the recovery schedule
- the cloud service recovery time prediction system according to appendix 1 or appendix 2, which predicts the recovery time of a cloud service used by a user based on reservation information.
- Recovery schedule constraint information storage means for storing recovery schedule constraint information that defines a constraint condition of a recovery schedule based on a dependency relationship between computer resources or a resource recovery request by a user, and a constraint indicated by the recovery schedule constraint information
- a recovery schedule optimization unit that searches for a recovery schedule candidate that maximizes or minimizes the target index and updates the corresponding recovery schedule stored in the resource recovery schedule storage unit with the recovery schedule.
- the recovery time prediction means includes the cloud service recovery time prediction according to any one of appendix 1 to appendix 3 for predicting the recovery time of the cloud service used by the user based on the updated recovery schedule. system.
- a cloud service recovery time prediction method for predicting recovery time when a failure occurs in a plurality of types of computer resources provided by a cloud service or a service provided by the cloud service Reference is made to resource usage profile storage means for storing, for each user, a resource usage profile that defines computer resources to be used when each user uses the cloud service, and the information stored in the resource usage profile storage means.
- a computer resource or a service to be used when a user uses the cloud service is specified, and the computer resource or the service to be provided when a failure occurs in the computer resource or the service provided
- a recovery schedule that defines the schedule By extracting from the recovery schedule storage means for storing for each type of the computer resource or each provided service, and by predicting the recovery time of all the specified computer resources or provided services based on the extracted recovery schedule, A cloud service recovery time prediction method characterized by predicting a recovery time of a cloud service used by a user and presenting the predicted service recovery time to the user.
- Resource reservation information is stored as reservation information, which is information that associates a reservation start time of a computer resource shared by a plurality of users and exclusively used among the users, with the user of the computer resource.
- reservation information is information that associates a reservation start time of a computer resource shared by a plurality of users and exclusively used among the users, with the user of the computer resource.
- Cloud service recovery time prediction applied to a computer for predicting recovery time when a failure occurs in a plurality of types of computer resources provided by a cloud service or a service provided by the cloud service A resource use profile storage unit that stores, for each user, a resource use profile that defines a computer resource to be used when each user uses the cloud service.
- a computer resource or a service to be used when a user uses the cloud service is specified from the resource usage profile stored in the resource usage profile storage means, and a failure occurs in the computer resource or the service provided
- the computer resources or services provided A recovery schedule that defines a schedule to be recovered is extracted from a recovery schedule storage unit that stores each type of computer resource or each provided service.
- the recovery time prediction process for predicting the recovery time of the cloud service used by the user and the recovery time presentation process for presenting the predicted service recovery time to the user are executed. For cloud service recovery time prediction.
- the failure status investigation process which makes the computer investigate the failure status of each computer resource or provided service and stores it in the failure status storage means, and the recovery schedule based on the failure status stored in the failure status storage means
- the cloud service recovery time prediction program according to appendix 10 which executes a recovery schedule generation process for generating a recovery schedule and storing the recovery schedule in a recovery schedule storage unit.
- Reservation information which is information that associates the reservation start time of a computer resource shared by a plurality of users and used exclusively among the users, with the user of the computer resource
- the resource reservation information registration process for storing the reservation information whose reservation start time is the recovery time of the computer resource that can be reserved in the resource reservation information storage means is executed, and the recovery schedule and the resource reservation information are determined by the recovery time prediction process.
- the cloud service according to supplementary note 9 or supplementary note 10 which predicts the recovery time of the cloud service used by the user based on the reservation information stored in the storage means Scan recovery time prediction for the program.
- the present invention is suitably applied to a cloud service recovery time prediction system that predicts a service recovery time for each service user in the event of a system failure involving a plurality of computer resource failures.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Security & Cryptography (AREA)
- Computer Networks & Wireless Communication (AREA)
- Signal Processing (AREA)
- Debugging And Monitoring (AREA)
- Management, Administration, Business Operations System, And Electronic Commerce (AREA)
Abstract
Description
図1は、本発明の第1の実施形態のクラウドサービス復旧時間予測システムを含むクラウドサービスの構成全体の例を示す説明図である。図1に例示するクラウドサービスでは、クラウドサービス提供部1と、復旧時刻予測システム2と、クラウドサービスクライアント3とを備えている。図1に例示する復旧時刻予測システム2が、本発明の第1の実施形態のクラウドサービス復旧時間予測システムに対応する。クラウドサービス提供部1と、復旧時刻予測システム2と、クラウドサービスクライアント3とは、通信ネットワーク網(図示せず)を介して相互に接続される。
次に、本発明の第2の実施形態のクラウドサービス復旧時間予測システムを説明する。なお、本実施形態におけるクラウドサービス復旧時間予測システムも、図1に例示するクラウドサービスと同様の構成に含まれるものとする。図4は、本発明の第2の実施形態のクラウドサービス復旧時間予測システムの例を示す説明図である。なお、第1の実施形態と同様の構成については、図1と同一の符号を付し、説明を省略する。本実施形態における復旧時間予測システム2は、第1の実施形態の復旧時間予測システム2の構成に加え、資源予約情報記憶部208を含む。
次に、本発明の第3の実施形態のクラウドサービス復旧時間予測システムを説明する。なお、本実施形態におけるクラウドサービス復旧時間予測システムも、図1に例示するクラウドサービスと同様の構成に含まれるものとする。図6は、本発明の第3の実施形態のクラウドサービス復旧時間予測システムの例を示す説明図である。なお、第1の実施形態と同様の構成については、図1と同一の符号を付し、説明を省略する。本実施形態における復旧時間予測システム2は、第1の実施形態の復旧時間予測システム2の構成に加え、復旧スケジュール最適化手段209と、復旧スケジュール制約情報記憶部210とを含む。
101 仮想マシン
102 ストレージ
103 サービス提供部
2 復旧時間予測システム
201 障害状況調査手段
202 障害状況記憶部
203 復旧スケジュール生成手段
204 資源復旧スケジュール記憶部
205 復旧時間予測手段
206 資源利用プロフィール記憶部
207 復旧時間提示手段
208 資源予約情報記憶部
209 復旧スケジュール最適化手段
210 復旧スケジュール制約情報記憶部
3 クラウドサービスクライアント
Claims (10)
- クラウドサービスで提供される複数種類の計算機資源または当該クラウドサービスで提供されるサービスである提供サービスで障害が発生した場合の当該計算機資源または当該提供サービスを復旧させるスケジュールを規定した復旧スケジュールを、当該計算機資源の種類または提供サービスごとに記憶する復旧スケジュール記憶手段と、
各利用者が前記クラウドサービスを利用する際に使用する計算機資源を定めた資源利用プロフィールを当該利用者ごとに記憶する資源利用プロフィール記憶手段と、
利用者が前記クラウドサービスを利用する際に使用する計算機資源または提供サービスを前記資源利用プロフィールから特定し、前記復旧スケジュールに基づいて、特定された全ての計算機資源または提供サービスが復旧する時間を予測することにより、当該利用者が利用するクラウドサービスの復旧時間を予測する復旧時間予測手段と、
予測されたサービス復旧時間を前記利用者に提示する復旧時間提示手段とを備えた
ことを特徴とするクラウドサービス復旧時間予測システム。 - 各計算機資源または提供サービスの障害状況を記憶する障害状況記憶手段と、
前記障害状況を調査して前記障害状況記憶手段に記憶させる障害状況調査手段と、
前記障害状況記憶手段に記憶された障害状況に基づいて復旧スケジュールを生成し、当該復旧スケジュールを復旧スケジュール記憶手段に記憶させる復旧スケジュール生成手段とを備えた
請求項1記載のクラウドサービス復旧時間予測システム。 - 複数の利用者で共有され当該利用者間で排他的に利用される計算機資源の予約開始時間を、当該計算機資源の利用者と対応づけた情報である予約情報を資源予約情報記憶手段に記憶する資源予約情報登録手段を備え、
前記資源予約情報登録手段は、前記資源予約情報記憶手段に記憶されている予約情報に基づいて、各利用者が利用する前記計算機資源が予約可能か否か判断し、前記復旧スケジュールに基づいて、予約可能な前記計算機資源の復旧時刻を予約開始時刻とする前記予約情報を資源予約情報記憶手段に記憶し、
復旧時間予測手段は、復旧スケジュールおよび前記予約情報に基づいて、利用者が利用するクラウドサービスの復旧時間を予測する
請求項1または請求項2記載のクラウドサービス復旧時間予測システム。 - 計算機資源間の依存関係または利用者による資源復旧要求に基づく復旧スケジュールの制約条件を規定した復旧スケジュール制約情報を記憶する復旧スケジュール制約情報記憶手段と、
前記復旧スケジュール制約情報が示す制約のもとで、目的とする指標を最大化または最小化する復旧スケジュールの候補を探索し、当該復旧スケジュールで資源復旧スケジュール記憶手段に記憶された対応する復旧スケジュールを更新する復旧スケジュール最適化手段を備え、
復旧時間予測手段は、更新された復旧スケジュールに基づいて、利用者が利用するクラウドサービスの復旧時間を予測する
請求項1から請求項3のうちのいずれか1項に記載のクラウドサービス復旧時間予測システム。 - クラウドサービスで提供される複数種類の計算機資源または当該クラウドサービスで提供されるサービスである提供サービスで障害が発生した場合の復旧時間を予測するクラウドサービス復旧時間予測方法であって、
各利用者が前記クラウドサービスを利用する際に使用する計算機資源を定めた資源利用プロフィールを当該利用者ごとに記憶する資源利用プロフィール記憶手段を参照し、当該資源利用プロフィール記憶手段に記憶された当該資源利用プロフィールから、利用者が前記クラウドサービスを利用する際に使用する計算機資源または提供サービスを特定し、
前記計算機資源または前記提供サービスで障害が発生した場合の当該計算機資源または当該提供サービスを復旧させるスケジュールを規定した復旧スケジュールを、当該計算機資源の種類または提供サービスごとに記憶する復旧スケジュール記憶手段から抽出し、
抽出した前記復旧スケジュールに基づいて、特定された全ての計算機資源または提供サービスが復旧する時間を予測することにより、当該利用者が利用するクラウドサービスの復旧時間を予測し、
予測されたサービス復旧時間を前記利用者に提示する
ことを特徴とするクラウドサービス復旧時間予測方法。 - 各計算機資源または提供サービスの障害状況を調査して障害状況記憶手段に記憶させ、
前記障害状況記憶手段に記憶された障害状況に基づいて復旧スケジュールを生成し、
前記復旧スケジュールを復旧スケジュール記憶手段に記憶させる
請求項5記載のクラウドサービス復旧時間予測方法。 - 複数の利用者で共有され当該利用者間で排他的に利用される計算機資源の予約開始時間を、当該計算機資源の利用者と対応づけた情報である予約情報を資源予約情報記憶手段に記憶する際、当該資源予約情報記憶手段に記憶されている前記予約情報に基づいて、各利用者が利用する前記計算機資源が予約可能か否か判断し、復旧スケジュールに基づいて、予約可能な前記計算機資源の復旧時刻を予約開始時刻とする予約情報を前記資源予約情報記憶手段に記憶し、
復旧スケジュールおよび前記資源予約情報記憶手段に記憶された予約情報に基づいて、利用者が利用するクラウドサービスの復旧時間を予測する
請求項5または請求項6記載のクラウドサービス復旧時間予測方法。 - 計算機資源間の依存関係または利用者による資源復旧要求に基づく復旧スケジュールの制約条件を規定した復旧スケジュール制約情報が示す制約のもとで、目的とする指標を最大化または最小化する復旧スケジュールの候補を探索し、
前記復旧スケジュールで資源復旧スケジュール記憶手段に記憶された対応する復旧スケジュールを更新し、
更新された復旧スケジュールに基づいて、利用者が利用するクラウドサービスの復旧時間を予測する
請求項5から請求項7のうちのいずれか1項に記載のクラウドサービス復旧時間予測方法。 - クラウドサービスで提供される複数種類の計算機資源または当該クラウドサービスで提供されるサービスである提供サービスで障害が発生した場合の復旧時間を予測するコンピュータに適用されるクラウドサービス復旧時間予測用プログラムであって、
前記コンピュータに、
各利用者が前記クラウドサービスを利用する際に使用する計算機資源を定めた資源利用プロフィールを当該利用者ごとに記憶する資源利用プロフィール記憶手段を参照し、当該資源利用プロフィール記憶手段に記憶された当該資源利用プロフィールから、利用者が前記クラウドサービスを利用する際に使用する計算機資源または提供サービスを特定し、前記計算機資源または前記提供サービスで障害が発生した場合の当該計算機資源または当該提供サービスを復旧させるスケジュールを規定した復旧スケジュールを、当該計算機資源の種類または提供サービスごとに記憶する復旧スケジュール記憶手段から抽出し、抽出した前記復旧スケジュールに基づいて、特定された全ての計算機資源または提供サービスが復旧する時間を予測することにより、当該利用者が利用するクラウドサービスの復旧時間を予測する復旧時間予測処理、および、
予測されたサービス復旧時間を前記利用者に提示する復旧時間提示処理
を実行させるためのクラウドサービス復旧時間予測用プログラム。 - コンピュータに、
各計算機資源または提供サービスの障害状況を調査して障害状況記憶手段に記憶させる障害状況調査処理、および、
前記障害状況記憶手段に記憶された障害状況に基づいて復旧スケジュールを生成し、当該復旧スケジュールを復旧スケジュール記憶手段に記憶させる復旧スケジュール生成処理を実行させる
請求項9記載のクラウドサービス復旧時間予測用プログラム。
Priority Applications (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US13/981,249 US8904242B2 (en) | 2011-09-08 | 2012-08-02 | Cloud service recovery time prediction system, method and program |
JP2013529882A JP5370624B2 (ja) | 2011-09-08 | 2012-08-02 | クラウドサービス復旧時間予測システム、方法およびプログラム |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2011-196064 | 2011-09-08 | ||
JP2011196064 | 2011-09-08 |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2013035243A1 true WO2013035243A1 (ja) | 2013-03-14 |
Family
ID=47831721
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2012/004906 WO2013035243A1 (ja) | 2011-09-08 | 2012-08-02 | クラウドサービス復旧時間予測システム、方法およびプログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US8904242B2 (ja) |
JP (1) | JP5370624B2 (ja) |
WO (1) | WO2013035243A1 (ja) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015230522A (ja) * | 2014-06-03 | 2015-12-21 | Pfuテクニカルコミュニケーションズ株式会社 | 情報処理装置、診断順序決定方法及び制御プログラム |
JP2017045079A (ja) * | 2015-08-24 | 2017-03-02 | 株式会社日立製作所 | クラウド管理方法及びクラウド管理システム |
CN107003926A (zh) * | 2014-12-25 | 2017-08-01 | 歌乐株式会社 | 故障信息提供服务器、故障信息提供方法 |
JP2021086604A (ja) * | 2019-11-29 | 2021-06-03 | ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド | 異常サーバのサービス処理方法および装置 |
Families Citing this family (55)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US8028090B2 (en) | 2008-11-17 | 2011-09-27 | Amazon Technologies, Inc. | Request routing utilizing client location information |
US7991910B2 (en) | 2008-11-17 | 2011-08-02 | Amazon Technologies, Inc. | Updating routing information based on client location |
US8447831B1 (en) | 2008-03-31 | 2013-05-21 | Amazon Technologies, Inc. | Incentive driven content delivery |
US8606996B2 (en) | 2008-03-31 | 2013-12-10 | Amazon Technologies, Inc. | Cache optimization |
US8321568B2 (en) | 2008-03-31 | 2012-11-27 | Amazon Technologies, Inc. | Content management |
US7970820B1 (en) | 2008-03-31 | 2011-06-28 | Amazon Technologies, Inc. | Locality based content distribution |
US7962597B2 (en) | 2008-03-31 | 2011-06-14 | Amazon Technologies, Inc. | Request routing based on class |
US9407681B1 (en) | 2010-09-28 | 2016-08-02 | Amazon Technologies, Inc. | Latency measurement in resource requests |
US8782236B1 (en) | 2009-06-16 | 2014-07-15 | Amazon Technologies, Inc. | Managing resources using resource expiration data |
US8397073B1 (en) | 2009-09-04 | 2013-03-12 | Amazon Technologies, Inc. | Managing secure content in a content delivery network |
US9495338B1 (en) | 2010-01-28 | 2016-11-15 | Amazon Technologies, Inc. | Content distribution network |
US9712484B1 (en) | 2010-09-28 | 2017-07-18 | Amazon Technologies, Inc. | Managing request routing information utilizing client identifiers |
US8468247B1 (en) | 2010-09-28 | 2013-06-18 | Amazon Technologies, Inc. | Point of presence management in request routing |
US9003035B1 (en) | 2010-09-28 | 2015-04-07 | Amazon Technologies, Inc. | Point of presence management in request routing |
US10958501B1 (en) | 2010-09-28 | 2021-03-23 | Amazon Technologies, Inc. | Request routing information based on client IP groupings |
US8452874B2 (en) | 2010-11-22 | 2013-05-28 | Amazon Technologies, Inc. | Request routing processing |
US10467042B1 (en) | 2011-04-27 | 2019-11-05 | Amazon Technologies, Inc. | Optimized deployment based upon customer locality |
US10623408B1 (en) | 2012-04-02 | 2020-04-14 | Amazon Technologies, Inc. | Context sensitive object management |
US9154551B1 (en) | 2012-06-11 | 2015-10-06 | Amazon Technologies, Inc. | Processing DNS queries to identify pre-processing information |
US9323577B2 (en) * | 2012-09-20 | 2016-04-26 | Amazon Technologies, Inc. | Automated profiling of resource usage |
US10205698B1 (en) | 2012-12-19 | 2019-02-12 | Amazon Technologies, Inc. | Source-dependent address resolution |
US9251115B2 (en) * | 2013-03-07 | 2016-02-02 | Citrix Systems, Inc. | Dynamic configuration in cloud computing environments |
US9183034B2 (en) * | 2013-05-16 | 2015-11-10 | Vmware, Inc. | Managing availability of virtual machines in cloud computing services |
US9223672B1 (en) * | 2013-09-24 | 2015-12-29 | Intuit Inc. | Method and system for providing error repair status data to an application user |
US20170010941A1 (en) * | 2014-05-30 | 2017-01-12 | Hitachi, Ltd. | Method for adjusting backup schedule for virtual computer |
JP6369235B2 (ja) * | 2014-09-02 | 2018-08-08 | 富士通株式会社 | ストレージ制御装置およびストレージ制御プログラム |
JP2016057795A (ja) * | 2014-09-09 | 2016-04-21 | 富士通株式会社 | ストレージ制御装置,ストレージシステム及びストレージ制御プログラム |
US10009248B2 (en) * | 2014-12-12 | 2018-06-26 | International Business Machines Corporation | System with on-demand state for applications |
US10097448B1 (en) | 2014-12-18 | 2018-10-09 | Amazon Technologies, Inc. | Routing mode and point-of-presence selection service |
US10225326B1 (en) | 2015-03-23 | 2019-03-05 | Amazon Technologies, Inc. | Point of presence based data uploading |
US9832141B1 (en) | 2015-05-13 | 2017-11-28 | Amazon Technologies, Inc. | Routing based request correlation |
US10270878B1 (en) | 2015-11-10 | 2019-04-23 | Amazon Technologies, Inc. | Routing for origin-facing points of presence |
US10075551B1 (en) | 2016-06-06 | 2018-09-11 | Amazon Technologies, Inc. | Request management for hierarchical cache |
US10110694B1 (en) | 2016-06-29 | 2018-10-23 | Amazon Technologies, Inc. | Adaptive transfer rate for retrieving content from a server |
US10061652B2 (en) * | 2016-07-26 | 2018-08-28 | Microsoft Technology Licensing, Llc | Fault recovery management in a cloud computing environment |
US10469513B2 (en) | 2016-10-05 | 2019-11-05 | Amazon Technologies, Inc. | Encrypted network addresses |
US10831549B1 (en) | 2016-12-27 | 2020-11-10 | Amazon Technologies, Inc. | Multi-region request-driven code execution system |
US10938884B1 (en) | 2017-01-30 | 2021-03-02 | Amazon Technologies, Inc. | Origin server cloaking using virtual private cloud network environments |
US10379898B2 (en) | 2017-03-24 | 2019-08-13 | International Business Machines Corporation | Virtual machine consolidation |
US11075987B1 (en) | 2017-06-12 | 2021-07-27 | Amazon Technologies, Inc. | Load estimating content delivery network |
US10447648B2 (en) | 2017-06-19 | 2019-10-15 | Amazon Technologies, Inc. | Assignment of a POP to a DNS resolver based on volume of communications over a link between client devices and the POP |
US10409664B2 (en) | 2017-07-27 | 2019-09-10 | International Business Machines Corporation | Optimized incident management using hierarchical clusters of metrics |
US10742593B1 (en) | 2017-09-25 | 2020-08-11 | Amazon Technologies, Inc. | Hybrid content request routing system |
US10558533B2 (en) * | 2017-12-07 | 2020-02-11 | Red Hat, Inc. | Reducing service disruptions in a micro-service environment |
US10592578B1 (en) | 2018-03-07 | 2020-03-17 | Amazon Technologies, Inc. | Predictive content push-enabled content delivery network |
US10931674B2 (en) * | 2018-04-30 | 2021-02-23 | Paypal, Inc. | Detecting whether to implement one or more security measures on a shared resource |
US10909003B2 (en) * | 2018-08-30 | 2021-02-02 | Sap Se | Decommissioning disaster recovery for a cloud based application |
US20200151010A1 (en) * | 2018-11-10 | 2020-05-14 | Nutanix, Inc. | Scheduling of fixed number of non-sharable resources |
US10862852B1 (en) | 2018-11-16 | 2020-12-08 | Amazon Technologies, Inc. | Resolution of domain name requests in heterogeneous network environments |
US11025747B1 (en) | 2018-12-12 | 2021-06-01 | Amazon Technologies, Inc. | Content request pattern-based routing system |
US11500733B2 (en) | 2021-03-19 | 2022-11-15 | International Business Machines Corporation | Volatile database caching in a database accelerator |
US11797570B2 (en) * | 2021-03-19 | 2023-10-24 | International Business Machines Corporation | Asynchronous persistency of replicated data changes in a database accelerator |
US20230325280A1 (en) * | 2022-04-12 | 2023-10-12 | Citrix Systems, Inc. | System and method to predict session failure in virtual applications and desktops deployment |
WO2023198276A1 (en) * | 2022-04-12 | 2023-10-19 | Telefonaktiebolaget Lm Ericsson (Publ) | Handling failure of an application instance |
US11977459B2 (en) * | 2022-06-02 | 2024-05-07 | Rubrik, Inc. | Techniques for accelerated data recovery |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002108728A (ja) * | 2000-10-02 | 2002-04-12 | Ntt Docomo Inc | 障害情報の掲載方法およびプロバイダ設備 |
JP2003179614A (ja) * | 2002-07-29 | 2003-06-27 | Matsushita Electric Ind Co Ltd | 通信制御装置及び通信制御方法 |
JP2006313399A (ja) * | 2005-05-06 | 2006-11-16 | Fujitsu Ltd | 保守業務支援プログラム |
JP2007041646A (ja) * | 2005-07-29 | 2007-02-15 | Fujitsu Ltd | クライアント−サーバ型システム、並びに、その管理方法および管理プログラム |
JP2009211618A (ja) * | 2008-03-06 | 2009-09-17 | Nec Corp | 障害自動復旧装置 |
Family Cites Families (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP3958710B2 (ja) | 2003-06-06 | 2007-08-15 | 日本電信電話株式会社 | 障害通知方式及び障害通知方法 |
GB2409297A (en) * | 2003-12-16 | 2005-06-22 | Ibm | Method of assessing the impact of the failure of a component on the temporal activity of the services supported by the component |
JP4717923B2 (ja) | 2008-12-17 | 2011-07-06 | 株式会社日立製作所 | ストレージシステム、データ復旧可能時刻の推定値の算出方法、および、管理計算機 |
US7962797B2 (en) * | 2009-03-20 | 2011-06-14 | Microsoft Corporation | Automated health model generation and refinement |
US8887006B2 (en) * | 2011-04-04 | 2014-11-11 | Microsoft Corporation | Proactive failure handling in database services |
US9152487B2 (en) * | 2011-09-30 | 2015-10-06 | Microsoft Technology Licensing, Llc | Service outage details in an error message |
-
2012
- 2012-08-02 US US13/981,249 patent/US8904242B2/en active Active
- 2012-08-02 WO PCT/JP2012/004906 patent/WO2013035243A1/ja active Application Filing
- 2012-08-02 JP JP2013529882A patent/JP5370624B2/ja not_active Expired - Fee Related
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2002108728A (ja) * | 2000-10-02 | 2002-04-12 | Ntt Docomo Inc | 障害情報の掲載方法およびプロバイダ設備 |
JP2003179614A (ja) * | 2002-07-29 | 2003-06-27 | Matsushita Electric Ind Co Ltd | 通信制御装置及び通信制御方法 |
JP2006313399A (ja) * | 2005-05-06 | 2006-11-16 | Fujitsu Ltd | 保守業務支援プログラム |
JP2007041646A (ja) * | 2005-07-29 | 2007-02-15 | Fujitsu Ltd | クライアント−サーバ型システム、並びに、その管理方法および管理プログラム |
JP2009211618A (ja) * | 2008-03-06 | 2009-09-17 | Nec Corp | 障害自動復旧装置 |
Non-Patent Citations (1)
Title |
---|
"Fujitsu ga Un'yo Kanri Soft no Shin Seihin de 'Saidai 70 % Un'yo Fuka o Keigen suru'", 4 November 2004 (2004-11-04) * |
Cited By (6)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
JP2015230522A (ja) * | 2014-06-03 | 2015-12-21 | Pfuテクニカルコミュニケーションズ株式会社 | 情報処理装置、診断順序決定方法及び制御プログラム |
CN107003926A (zh) * | 2014-12-25 | 2017-08-01 | 歌乐株式会社 | 故障信息提供服务器、故障信息提供方法 |
JP2017045079A (ja) * | 2015-08-24 | 2017-03-02 | 株式会社日立製作所 | クラウド管理方法及びクラウド管理システム |
JP2021086604A (ja) * | 2019-11-29 | 2021-06-03 | ベイジン バイドゥ ネットコム サイエンス アンド テクノロジー カンパニー リミテッド | 異常サーバのサービス処理方法および装置 |
JP7039652B2 (ja) | 2019-11-29 | 2022-03-22 | ベイジン バイドゥ ネットコム サイエンス テクノロジー カンパニー リミテッド | 異常サーバのサービス処理方法および装置 |
US11734057B2 (en) | 2019-11-29 | 2023-08-22 | Beijing Baidu Netcom Science And Technology Co., Ltd. | Method and apparatus for processing a service of an abnormal server |
Also Published As
Publication number | Publication date |
---|---|
JP5370624B2 (ja) | 2013-12-18 |
US20130305083A1 (en) | 2013-11-14 |
JPWO2013035243A1 (ja) | 2015-03-23 |
US8904242B2 (en) | 2014-12-02 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
JP5370624B2 (ja) | クラウドサービス復旧時間予測システム、方法およびプログラム | |
JP5948257B2 (ja) | 情報処理システム監視装置、監視方法、及び監視プログラム | |
US8296419B1 (en) | Dynamically modifying a cluster of computing nodes used for distributed execution of a program | |
US9135076B2 (en) | Automated capacity aware provisioning | |
EP2561444B1 (en) | Automated recovery and escalation in complex distributed applications | |
US9965262B2 (en) | Application bundle pulling | |
US10389850B2 (en) | Managing redundancy among application bundles | |
US10523518B2 (en) | Application bundle preloading | |
US10152516B2 (en) | Managing staleness latency among application bundles | |
CN102799485B (zh) | 历史数据的迁移方法及装置 | |
US9692654B2 (en) | Systems and methods for correlating derived metrics for system activity | |
US10389794B2 (en) | Managing redundancy among application bundles | |
CN104216763A (zh) | 用于解决在受管基础架构中发生的事件的方法和系统 | |
KR102373144B1 (ko) | 디바이스 관리 서버 및 방법 | |
JP2015103149A (ja) | 管理システムおよび管理システムの制御方法 | |
Xia et al. | A Markov decision process approach for optimal data backup scheduling | |
Abderrahim et al. | The three-dimensional model for dependability integration in cloud computing | |
Birje et al. | Cloud monitoring system: a review | |
Bratosin et al. | A reference model for grid architectures and its analysis | |
JP6502783B2 (ja) | 一括管理システム、一括管理方法およびプログラム | |
Leong et al. | A case study-cost of preemption for urgent computing on supermuc | |
JP2018190205A (ja) | 事業者間一括サービス管理装置および事業者間一括サービス管理方法 | |
US20230214681A1 (en) | Model Decisions Based On Speculative Execution | |
JP4941439B2 (ja) | クラスタシステムにおける性能低下の原因箇所の特定方法、クラスタシステム | |
JP7073766B2 (ja) | 情報処理プログラム、情報処理方法及び情報処理装置 |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 12830558 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2013529882 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 13981249 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 12830558 Country of ref document: EP Kind code of ref document: A1 |