WO2022168269A1 - 情報処理装置、情報処理方法、及び、情報処理プログラム - Google Patents
情報処理装置、情報処理方法、及び、情報処理プログラム Download PDFInfo
- Publication number
- WO2022168269A1 WO2022168269A1 PCT/JP2021/004347 JP2021004347W WO2022168269A1 WO 2022168269 A1 WO2022168269 A1 WO 2022168269A1 JP 2021004347 W JP2021004347 W JP 2021004347W WO 2022168269 A1 WO2022168269 A1 WO 2022168269A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- recovery
- monitoring data
- data
- information processing
- monitoring
- Prior art date
Links
- 230000010365 information processing Effects 0.000 title claims abstract description 22
- 238000003672 processing method Methods 0.000 title claims description 6
- 238000012544 monitoring process Methods 0.000 claims abstract description 178
- 238000000034 method Methods 0.000 claims abstract description 115
- 238000011084 recovery Methods 0.000 claims description 258
- 230000002159 abnormal effect Effects 0.000 claims description 31
- 230000007704 transition Effects 0.000 claims description 23
- 230000006870 function Effects 0.000 claims description 9
- 238000009472 formulation Methods 0.000 abstract description 16
- 239000000203 mixture Substances 0.000 abstract description 16
- 230000009471 action Effects 0.000 description 105
- 238000004458 analytical method Methods 0.000 description 20
- 238000010586 diagram Methods 0.000 description 19
- 230000005856 abnormality Effects 0.000 description 15
- 230000008569 process Effects 0.000 description 13
- 238000012423 maintenance Methods 0.000 description 12
- 230000010485 coping Effects 0.000 description 7
- 238000007726 management method Methods 0.000 description 7
- 238000011161 development Methods 0.000 description 6
- 238000002372 labelling Methods 0.000 description 6
- 238000013075 data extraction Methods 0.000 description 5
- 238000002620 method output Methods 0.000 description 5
- 238000012545 processing Methods 0.000 description 5
- 230000005012 migration Effects 0.000 description 4
- 238000013508 migration Methods 0.000 description 4
- 238000004891 communication Methods 0.000 description 3
- 238000005516 engineering process Methods 0.000 description 3
- 239000000284 extract Substances 0.000 description 3
- 230000036541 health Effects 0.000 description 2
- 238000003909 pattern recognition Methods 0.000 description 2
- 240000004050 Pentaglottis sempervirens Species 0.000 description 1
- 235000004522 Pentaglottis sempervirens Nutrition 0.000 description 1
- 230000001174 ascending effect Effects 0.000 description 1
- 238000013500 data storage Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000008520 organization Effects 0.000 description 1
- 238000013341 scale-up Methods 0.000 description 1
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/14—Error detection or correction of the data by redundancy in operation
- G06F11/1402—Saving, restoring, recovering or retrying
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0793—Remedial or corrective actions
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
- G06F11/302—Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a software system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F2201/00—Indexing scheme relating to error detection, to error correction, and to monitoring
- G06F2201/865—Monitoring of software
Definitions
- the present invention relates to an information processing device, an information processing method, and an information processing program.
- microservices that provide services by cooperating predetermined components among multiple hierarchically distributed components in a prescribed order.
- health checks are executed for each component, and based on the health check execution results, the normality or abnormality of each component is monitored, and recovery work is performed for abnormal components that return unexpected results. to run.
- the recovery work of microservices has fluidity such as changing the components and resources used, so various recovery methods are conceivable.
- recovery work must be defined in advance, but it is difficult to appropriately define recovery work with various options after covering various abnormal patterns. Therefore, it is necessary to accumulate know-how while actually performing recovery work at the time of service provision, and to mature knowledge about error patterns and recovery work.
- Non-Patent Document 1 there is a technology called chaos engineering that randomly generates various failures that are expected in actual services and continuously performs recovery work for each failure.
- the present invention has been made in view of the above circumstances, and an object of the present invention is to formalize know-how about the state transition from an abnormal state to a normal state due to recovery work without a lot of effort by maintenance personnel, It is to provide a technology that can formulate a recovery policy when a failure occurs.
- An information processing apparatus recognizes a pattern of each recovery work content for a plurality of recovery tasks for a service of an application program, and groups the plurality of recovery tasks for each pattern of recovery work content. form a recovery work group, recognize patterns in each of the monitoring contents of a plurality of monitoring data relating to the services monitored immediately before and immediately after the plurality of recovery work, respectively, and process the plurality of monitoring data. a learning unit that forms a plurality of monitoring data groups by grouping for each monitoring content pattern;
- An information processing method is an information processing method performed by an information processing apparatus, in which a pattern of each recovery work content is recognized for a plurality of recovery operations for a service of an application program, and the plurality of recovery operations are recovered. forming a plurality of recovery work groups by grouping for each work content pattern; recognizing each pattern, and grouping the plurality of monitoring data for each monitoring content pattern to form a plurality of monitoring data groups.
- An information processing program causes a computer to function as the information processing apparatus.
- the state transition from an abnormal state to a normal state due to recovery work is made into explicit knowledge as know-how without requiring a lot of maintenance work, and a technique is provided that can formulate a recovery policy in the event of a failure or the like. can.
- FIG. 1 is a diagram showing the configuration of a service providing system.
- FIG. 2 is a diagram showing the functional block configuration of the service recovery method planning device.
- FIG. 3 is a diagram showing a process of saving recovery action data.
- FIG. 4 is a diagram showing a specific example of recovery action data.
- FIG. 5 is a diagram showing a saving process of monitoring data.
- FIG. 6 is a diagram showing learning processing of monitoring data and recovery action data.
- FIG. 7 is a diagram showing specific examples of monitoring data and recovery action data.
- FIG. 8 is a diagram showing an example of grouping of monitoring data and recovery action data.
- FIG. 9 is a diagram showing a specific example of learning result data.
- FIG. 10 is a diagram showing the recovery method determination process.
- FIG. 11 is a diagram showing an example of determining a recovery method.
- FIG. 12 is a diagram showing a display example of the recovery method.
- FIG. 13 is a diagram showing hardware functions of the service recovery method planning device.
- the present invention continuously accumulates cases of service failures and recovery work, and when a sufficient number of cases are accumulated, each service failure and each recovery work is pattern-recognized and grouped, and service is performed. Failures are associated with each other through recovery work and learned in advance. Then, using the learning result, recovery work suitable for the occurred service failure, that is, recovery action corresponding to the failure pattern is presented to the maintenance person.
- pattern recognition is performed for each of a plurality of recovery operations and monitoring data immediately before and after each recovery operation, and each pattern is grouped. Therefore, it is possible to grasp normal and abnormal state transitions between the grouped monitoring data groups. As a result, the state transition from the abnormal state to the normal state by the recovery work can be formalized as know-how, and a recovery policy can be formulated in the event of a failure, etc., without requiring a great deal of effort by the maintenance person.
- the present invention learns by associating the monitoring data groups of a plurality of monitoring data groups via the recovery work group based on the fact that immediately preceding monitoring data transitions to immediately following monitoring data due to recovery work. State transitions from an abnormal state to a normal state can be clearly expressed as know-how, and a recovery policy can be quickly formulated in the event of a failure or the like.
- FIG. 1 is a diagram showing the configuration of a service providing system 1. As shown in FIG.
- the service providing system 1 includes a development device 11 , an execution unit 12 , a monitoring unit 13 , a distribution unit 14 , an analysis unit 15 , a service recovery method formulation device 16 and a management unit 17 .
- the development device 11 is a development environment device for a program developer to develop an application program.
- the development device 11 transmits an application program created by a program developer, some function programs, service update information, and the like to the execution unit 12 and the analysis unit 15 .
- the execution unit 12 is a functional unit that executes an application program installed in itself and provides the user with services executed by the application program.
- a service is, for example, a microservice that provides a service by causing predetermined components among a plurality of hierarchized and distributed components to cooperate in a specified order.
- the monitoring unit 13 is a functional unit that performs application monitoring to periodically monitor the operation of the application program being executed by the execution unit 12, and stores service operation information of the application program obtained by the application monitoring as monitoring data. .
- the monitoring unit 13 performs resource monitoring to periodically monitor the resources (physical server, virtual server, container, host, CPU, disk, memory, etc.) of the execution unit 12, and the resource monitoring obtained by the resource monitoring.
- a functional unit that saves metrics information (CPU, memory usage, etc.) as monitoring data.
- the distribution unit 14 is a functional unit that acquires monitoring data from the monitoring unit 13 and transmits the monitoring data to the analysis unit 15 and the service recovery method formulation device 16 .
- the analysis unit 15 uses an existing method to analyze whether the monitoring data transmitted from the distribution unit 14 is normal or abnormal using the function program, service update information, etc. transmitted from the development device 11, and determines whether the monitoring data is normal or abnormal. Alternatively, it is a functional unit that transmits the analysis result data of the abnormality to the service recovery method formulation device 16 and the maintenance person.
- the service recovery method formulation device (information processing device) 16 receives the monitoring data transmitted from the distribution unit 14 and the analysis unit 15, normality or abnormality analysis result data for the monitoring data, and abnormality monitoring data acquired from the management unit 17. It is a device that learns by associating with past failure cases and coping method data performed for
- the service recovery method formulation device 16 is a device that uses the learned learning result data to present the maintenance person with a recovery method for dealing with service failures that will occur in the future.
- the management unit 17 is a functional unit that saves the recovery work entered by the maintenance person when a service failure occurs and the time stamps of the start and completion of the recovery work as failure case and coping method data.
- FIG. 2 is a diagram showing the functional block configuration of the service recovery method formulation device 16.
- the service recovery method formulation device 16 includes a recovery work data extraction unit 161, a recovery work data time-series storage unit 162, a monitoring data reception unit 163, a monitoring data time-series storage unit 164, a recovery method learning unit 165, A recovery method determination unit 166 and a recovery method output unit 167 are provided.
- the recovery work data extracting unit 161 is a functional unit that acquires failure case/coping method data from the management unit 17 and extracts an expression that characterizes the content of recovery work (hereinafter referred to as recovery action) from the failure case/coping method data. .
- the recovery work data chronological storage unit 162 is a functional unit that saves a plurality of recovery action data in chronological order based on the time stamps of the work start and work completion of the recovery work.
- the monitoring data receiving unit 163 is a functional unit that receives monitoring data from the distribution unit 14 and receives normal or abnormal analysis results for the monitoring data from the analysis unit 15 .
- the monitoring data time-series storage unit 164 is a functional unit that stores a plurality of pieces of monitoring data in time series based on the time stamps of the monitoring data.
- a recovery method learning unit (learning unit) 165 acquires a plurality of pieces of recovery action data from the recovery work data time-series storage unit 162 when sufficient recovery action data and monitoring data are accumulated, and stores the monitoring data time-series. It is a functional unit that acquires a plurality of monitoring data from the unit 164, associates the plurality of monitoring data with the plurality of recovery action data for learning, and saves the learned learning result data.
- the recovery method learning unit 165 recognizes the pattern of each recovery action content for a plurality of recovery actions for the service of the application program, and groups the plurality of recovery actions for each recovery action content pattern to obtain a plurality of recovery actions. form a recovery action group, recognize patterns in each of the monitoring contents of the plurality of monitoring data related to the above-mentioned services monitored immediately before and after the plurality of recovery actions are performed, respectively, and divide the plurality of monitoring data into monitoring contents pattern to form a plurality of monitoring data groups.
- the recovery method learning unit 165 acquires learning result data in which the monitoring data groups of the plurality of monitoring data groups are associated with each other via the recovery action group so that the immediately preceding monitoring data transitions to the immediately following monitoring data by the recovery action. It has a function to generate and save.
- a recovery method determining unit (determining unit) 166 receives an analysis result indicating whether the monitoring data is normal or abnormal from the analyzing unit 15, acquires abnormal monitoring data whose analysis result is abnormal from the monitoring data receiving unit 163, and determines a recovery method. It is a functional unit that uses learning result data of the learning unit 165 to determine recovery action data corresponding to a service failure related to the abnormal monitoring data as a recovery method.
- the recovery method determining unit 166 searches for a monitoring data group that matches the monitoring data of the abnormality from the learning result data for the monitoring data of the abnormality that has been analyzed to be in an abnormal state, It has a function of searching for one or more routes along which normal monitoring data transits to a grouped monitoring data group, and determining the recovery action of the recovery action group on the selected route as the recovery method.
- the recovery method output unit 167 is a functional unit that outputs the recovery method determined by the recovery method determination unit 166 to the display, printer, or the like of the terminal device provided by the maintenance person.
- FIG. 3 is a diagram showing a process of saving recovery action data.
- Step S101 First, the recovery work data extraction unit 161 acquires failure case/measures data from the management unit 17 .
- Step S102 the recovery work data extraction unit 161 extracts recovery action data that characterizes the content of the recovery work from the acquired failure case/coping method data. Since it is conceivable that the failure case/coping method data stored in the management unit 17 is input in various formats, this step absorbs the format differences between the failure case/coping method data and performs necessary recovery. Extract action data only.
- Recovery action data is, for example, an action name that indicates the type of recovery action and a variable that indicates the target of the recovery action.
- a specific example of recovery action data is shown in FIG.
- Action names include, for example, (1) migrating a container or virtual machine to another host, (2) scaling out adding a container, virtual machine, host, etc., (3) adding a container, virtual machine, host, etc. (4) load balancing, which allocates processing from overloaded containers, virtual machines, hosts to less loaded ones; and (5) restart, which restarts containers, virtual machines, hosts, etc.
- the variables are the migration target (type (container, virtual machine), component name, IP address), location before migration (IP address, resource name), and location after migration. .
- Each variable related to other action names is as shown in FIG.
- Step S103 Next, the recovery work data extraction unit 161 passes the extracted recovery action data (action name, variable) to the recovery work data time-series storage unit 162 .
- Step S104 Finally, the recovery work data time-series storage unit 162 stores the passed recovery action data in chronological order along with the work time based on the time stamps of the work start and work completion of the recovery action of the recovery action data. do.
- a plurality of pieces of recovery action data (action name, variable, work time) are stored in chronological order in the recovery work data chronological storage unit 162 .
- FIG. 5 is a diagram showing a saving process of monitoring data.
- Step S201 First, the monitoring data receiving unit 163 receives monitoring data (service operation information, metrics information) from the distribution unit 14 .
- Step S202 Next, the monitoring data receiving unit 163 receives the normal or abnormal analysis result of the monitoring data from the analysis unit 15 .
- Step S203 the monitoring data receiving unit 163 adds normal or abnormal labeling information to the monitoring data received from the distribution unit 14 based on the received normality or abnormality analysis result.
- Step S204 Next, the monitoring data receiving unit 163 passes the monitoring data to which the normal or abnormal labeling information is added to the monitoring data time-series storage unit 164 .
- Step S205 Finally, the monitoring data time-series storage unit 164 stores the transferred monitoring data in time series based on the time stamp of the monitoring data.
- the monitoring data time-series storage unit 164 stores a plurality of pieces of monitoring data (service operation information, metrics information, normal or abnormal labeling information) in time series.
- FIG. 6 is a diagram showing learning processing of monitoring data and recovery action data.
- Recovery method learning unit 165 executes the following processes after sufficient recovery action data and monitoring data are accumulated.
- Step S301 First, the recovery method learning unit 165 acquires time-series data of a plurality of pieces of recovery action data (action name, variable, work time) from the recovery work data time-series storage unit 162 .
- the learning method will be described below.
- the recovery method learning unit 165 combines recovery action data, monitoring data immediately before the recovery action of the recovery action data occurs, and monitoring data immediately after the recovery action of the recovery action data is completed. Save as "results" of data.
- the recovery method learning unit 165 determines recovery action patterns (migration, scale-out, scale-up, load (distribution, restart, etc.) is performed, and grouping is performed to classify a plurality of recovery action data for each recovery action pattern.
- recovery action patterns miration, scale-out, scale-up, load (distribution, restart, etc.) is performed, and grouping is performed to classify a plurality of recovery action data for each recovery action pattern.
- grouping is shown in FIG.
- the recovery method learning unit 165 acquires the monitoring data pattern (contents of service operation information, content of metrics information (CPU usage rate etc.), labeling information of normality or abnormality, etc.) is performed, and grouping is performed to classify a plurality of monitoring data for each monitoring data pattern.
- the monitoring data pattern contents of service operation information, content of metrics information (CPU usage rate etc.), labeling information of normality or abnormality, etc.
- a general clustering method etc. is used according to the format of recovery action data and monitoring data.
- the recovery method learning unit 165 determines the transition relation from the immediately preceding monitoring data to the immediately following monitoring data by the recovery action from the "result data". are grasped, and based on the transition relation for each grouped monitoring data group, the monitoring data groups are connected with arrow lines so that the transition source and the transition destination can be grasped.
- the monitoring data included in monitoring data group 1 is transitioned to the monitoring data included in monitoring data group 4 via recovery action group 2 .
- a directed graph is generated in which the "recovery action group" is the transition arc and the "monitoring data group” is the node.
- Step S304 Finally, the recovery method learning unit 165 saves the generated directed graph as learning result data.
- the learning result data is used when determining a recovery method for service failures that will occur in the future.
- the learning result data is a generalized directed graph of actual know-how for determining a recovery method according to a service failure that will occur in the future. Determining a recovery method (that is, determining a path) becomes a problem of finding a path from an abnormal state monitoring data group to a normal state monitoring data group.
- the transition arc that forms the path is always linked to the recovery action group, and the work time of the recovery action included in the recovery action group, the total number of recovery action groups, etc. are defined as the cost (weight), and the cost is is used to calculate the cost of the entire route.
- V is a node and a set of monitoring data groups u.
- E is the transition arc.
- a transition arc E always has a recovery action group.
- the cost of the recovery action group is expressed by giving a weight w such as work time to the transition arc E.
- a recovery method is determined by searching for a path from a monitoring data group u 1 ⁇ V as a starting point to a monitoring data group u 2 ⁇ V in a normal state, and the weight w of all transition arcs E forming the searched path is Add up to estimate the cost of the entire route.
- the recovery method determination process will be described below.
- FIG. 10 is a diagram showing the recovery method determination process.
- Step S401 First, the distribution unit 14 transmits the monitoring data acquired from the monitoring unit 13 to the analysis unit 15 and the monitoring data receiving unit 163 .
- Step S402 Next, the analysis unit 15 analyzes whether the transmitted monitoring data is normal or abnormal.
- Step S403 the analyzing unit 15 transmits the analyzed normal or abnormal analysis result data to the monitoring data receiving unit 163 and the recovery method determining unit 166 .
- the monitoring data receiving unit 163 stores the monitoring data to which the normal or abnormal labeling information is attached in the monitoring data time-series storage unit 164
- the recovery method learning unit 165 stores the monitoring data and past recovery actions for the monitoring data. is used to generate (update) learning result data. The method of generating the learning result data is as already explained.
- Step S405 Next, the recovery method determination unit 166 acquires learning result data from the recovery method learning unit 165 .
- Step S406 the recovery method determining unit 166 uses the acquired learning result data to determine, as a recovery method, a recovery operation corresponding to the service failure related to the acquired abnormality monitoring data.
- a method for determining the recovery method will be described.
- the appropriate recovery action is determined to recover from the service failure.
- the pre-generated learning result data and the monitoring data at the time of service failure are collated to evaluate the cost on the route and derive a recovery action plan.
- the recovery method determining unit 166 performs pattern recognition for grasping the monitoring data pattern included in the acquired abnormality monitoring data, and the abnormality monitoring data is identified among the plurality of monitoring data groups in the learning result data. Find which monitoring data group best fits. In the example of FIG. 11, the monitoring data group 1 is searched as the monitoring data group most similar to the monitoring data of the abnormality.
- the recovery method determining unit 166 searches all routes from the searched monitoring data group to the monitoring data group in the normal state.
- the path from the monitoring data group 1 in the abnormal state to the monitoring data group 4 in the normal state is the path 1 via the circuit action group 2 and the path via the recovery action group 1 and the recovery action group 3. 2 and , are searched.
- the recovery method determination unit 166 sorts all the retrieved routes in ascending order of cost (work time) and determines the recovery method. For example, the work time of recovery actions included in recovery action group 2 on route 1 is 30 minutes, and the total work time of recovery actions included in recovery action group 1 and recovery action group 3 on route 2 is 35 minutes. If so, sort in the order of route 1 and route 2. One path becomes one recovery method. Also, all recovery action groups included in one path are recovery procedures.
- Step S407 the recovery method determination unit 166 passes recovery method data including all retrieved recovery methods (one or more paths) to the recovery method output unit 167 .
- Step S408 Finally, the recovery method output unit 167 displays each recovery method included in the passed recovery method data together with the recovery procedure on the display of the terminal device provided by the maintenance person in descending order of cost (work time). .
- route 1 shown in FIG. 11 is set as the first recovery method, and "only the recovery actions of recovery action group 2" are set as the recovery procedures.
- Route 2 which has a higher cost than route 1, is the second recovery method, and "recovery action of recovery action group 1 ⁇ recovery action of recovery action group 3" is displayed as a recovery procedure.
- the estimated work completion time is, for example, the average work time of all the recovery actions included in the recovery action group.
- the past recovery work corresponding to the recovery action is used as the link destination information to the detailed recovery work record.
- step S406 the work time is taken as a cost, and the display order of each recovery method is determined based on the size of the work time.
- the total number of recovery action groups on the route may be used as the cost, and the order of display may be determined based on the size of the total number of recovery action groups.
- the total number of recovery action groups for route 1 is one
- the total number of recovery action groups for route 2 is two.
- the recovery method learning unit 165 recognizes the pattern of each recovery action content for a plurality of recovery actions for the service of the application program, and groups the plurality of recovery actions for each recovery action content pattern.
- a plurality of recovery action groups are formed by using a plurality of recovery actions, and the plurality of monitoring data related to the above services monitored immediately before and after each of the plurality of recovery actions are respectively performed. Since a plurality of monitoring data groups are formed by grouping for each monitoring content pattern, it is possible to grasp normal and abnormal state transitions between the grouped monitoring data groups. Also, the state transition from abnormal state to normal state by recovery work can be formalized as know-how, and a recovery policy can be formulated in the event of a failure.
- the recovery method learning unit 165 restores the monitoring data groups of the plurality of monitoring data groups via the recovery action group so that the immediately preceding monitoring data transitions to the immediately following monitoring data by the recovery action. Since the learning result data associated with each other is generated, the state transition from the abnormal state to the normal state can be clearly expressed as know-how, and a recovery policy can be quickly formulated in the event of a failure or the like.
- the service recovery method formulation device 16 of the present embodiment described above includes a CPU 901, a memory 902, a storage 903, a communication device 904, an input device 905, an output device 906, can be realized using a general-purpose computer system with Memory 902 and storage 903 are storage devices.
- each function of the service recovery method formulation device 16 is realized by the CPU 901 executing a predetermined program loaded on the memory 902 .
- the service recovery method formulation device 16 may be implemented by one computer.
- the service recovery method formulation device 16 may be implemented with multiple computers.
- the service recovery method formulation device 16 may be a virtual machine implemented on a computer.
- a program for the service recovery method formulation device 16 can be stored in computer-readable recording media such as HDD, SSD, USB memory, CD, and DVD.
- the program for the service restoration method formulation device 16 can also be distributed via a communication network.
- Service providing system 11 Development device 12: Execution unit 13: Monitoring unit 14: Distribution unit 15: Analysis unit 16: Service recovery method formulation device 17: Management unit 161: Recovery work data extraction unit 162: Recovery work data time Sequence storage unit 163: Monitoring data reception unit 164: Monitoring data time-series storage unit 165: Recovery method learning unit 166: Recovery method determination unit 167: Recovery method output unit 901: CPU 902: Memory 903: Storage 904: Communication device 905: Input device 906: Output device
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Quality & Reliability (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Debugging And Monitoring (AREA)
Abstract
Description
本発明は、サービス障害と回復作業との事例を継続的に蓄積しておき、各事例が十分に蓄積された際に、各サービス障害と各回復作業とをそれぞれパターン認識してグルーピングし、サービス障害同士を回復作業を介して関連付けて予め学習しておく。そして、当該学習結果を用いて、発生したサービス障害に適した回復作業、つまり障害パターンに対応した回復アクションを保守者へ提示する。
図1は、サービス提供システム1の構成を示す図である。当該サービス提供システム1は、開発用装置11と、実行部12と、監視部13と、流通部14と、解析部15と、サービス回復方法策定装置16と、管理部17と、を備える。
図2は、サービス回復方法策定装置16の機能ブロック構成を示す図である。当該サービス回復方法策定装置16は、回復作業データ抽出部161と、回復作業データ時系列保存部162と、監視データ受信部163と、監視データ時系列保存部164と、回復方法学習部165と、回復方法決定部166と、回復方法出力部167と、を備える。
[回復アクションデータの保存処理]
図3は、回復アクションデータの保存処理を示す図である。
まず、回復作業データ抽出部161は、管理部17から障害事例・対処方法データを取得する。
次に、回復作業データ抽出部161は、取得した障害事例・対処方法データから回復作業の内容を特徴付ける回復アクションデータを抽出する。管理部17に保存されている障害事例・対処方法データは様々なフォーマットで入力されていると考えられるため、このステップでは、障害事例・対処方法データ間のフォーマットの差分を吸収して必要な回復アクションデータのみを抽出する。
次に、回復作業データ抽出部161は、抽出した回復アクションデータ(アクション名称、変数)を回復作業データ時系列保存部162へ渡す。
最後に、回復作業データ時系列保存部162は、渡された回復アクションデータを、当該回復アクションデータの回復アクションの作業開始及び作業完了の各タイムスタンプを基に、作業時間とともに、時系列に保存する。
図5は、監視データの保存処理を示す図である。
まず、監視データ受信部163は、流通部14から監視データ(サービス動作情報、メトリクス情報)を受信する。
次に、監視データ受信部163は、解析部15から当該監視データに対する正常又は異常の解析結果を受信する。
次に、監視データ受信部163は、受信した正常又は異常の解析結果に基づき、流通部14から受信していた監視データに対して、正常又は異常のラベリング情報を付与する。
次に、監視データ受信部163は、正常又は異常のラベリング情報が付与された監視データを監視データ時系列保存部164へ渡す。
最後に、監視データ時系列保存部164は、渡された監視データを、当該監視データのタイムスタンプを基に、時系列に保存する。
図6は、監視データと回復アクションデータの学習処理を示す図である。回復方法学習部165は、回復アクションデータ及び監視データが十分に蓄積された後、以降の処理を実行する。
まず、回復方法学習部165は、回復作業データ時系列保存部162から複数の回復アクションデータ(アクション名称、変数、作業時間)の時系列データを取得する。
次に、回復方法学習部165は、監視データ時系列保存部164から複数の監視データ(サービス動作情報、メトリクス情報、正常又は異常のラベリング情報)の時系列データを取得する。
次に、回復方法学習部165は、取得した複数の回復アクションデータの時系列データと取得した複数の監視データの時系列データとを用いて、複数の監視データと複数の回復アクションデータとを関連付けて学習する。以降、学習方法について説明する。
最後に、回復方法学習部165は、生成した有向グラフを学習結果データとして保存する。当該学習結果データは、将来発生するサービス障害に対して回復方法を決定する際に用いられる。
次に、将来発生するサービス障害に対する回復方法の決定方法について説明する。
まず、流通部14は、監視部13から取得した監視データを解析部15と監視データ受信部163へ送信する。
次に、解析部15は、送信された監視データが正常か異常かを解析する。
次に、解析部15は、解析した正常又は異常の解析結果データを監視データ受信部163と回復方法決定部166へ送信する。その後、監視データ受信部163は、正常又は異常のラベリング情報を付与した監視データを監視データ時系列保存部164に保存し、回復方法学習部165は、監視データ及び当該監視データに対する過去の回復アクションを用いて学習結果データを生成(更新)する。当該学習結果データの生成方法は、既に説明した通りである。
次に、回復方法決定部166は、送信された正常又は異常の解析結果データの中から異常の解析結果を有する異常の監視データを監視データ受信部163から取得する。
次に、回復方法決定部166は、回復方法学習部165から学習結果データを取得する。
次に、回復方法決定部166は、取得した学習結果データを用いて、取得していた異常の監視データに関するサービス障害に対応する回復作業を回復方法として決定する。以降、回復方法の決定方法について説明する。
次に、回復方法決定部166は、検索した全ての回復方法(1つ以上の経路)を含む回復方法データを回復方法出力部167へ渡す。
最後に、回復方法出力部167は、渡された回復方法データに含まれる各回復方法を、回復手順とともに、コスト(作業時間)の小さい順に上から、保守者の備える端末装置のディスプレイへ表示する。
ステップS406では、作業時間をコストとし、作業時間の大小に基づき各回復方法の表示順を決定していた。作業時間の他、経路上の回復アクショングループの総数をコストとし、回復アクショングループの総数の大小に基づき表示順を決定してもよい。図11の例では、経路1の回復アクショングループの総数は1つであり、経路2の回復アクショングループの総数は2つであるため、経路1、経路2の順に上から表示する。
本実施形態によれば、回復方法学習部165が、アプリケーションプログラムのサービスに対する複数の回復アクションについて、各回復アクション内容のパターンをそれぞれ認識し、複数の回復アクションを回復アクション内容のパターン毎にグルーピングして複数の回復アクショングループを形成し、複数の回復アクションがそれぞれ行われる直前と直後にそれぞれ監視された上記サービスに関する複数の監視データについて、各監視内容のパターンをそれぞれ認識し、複数の監視データを監視内容のパターン毎にグルーピングして複数の監視データグループを形成するので、グルーピングされた監視データグループ間における正常、異常の状態遷移を把握可能となることから、保守者の多大な労力がなくても回復作業による異常状態から正常状態への状態遷移をノウハウとして形式知化し、障害発生時等の回復の方針を策定できる。
本発明は、上記実施形態に限定されない。本発明は、本発明の要旨の範囲内で数々の変形が可能である。
11:開発用装置
12:実行部
13:監視部
14:流通部
15:解析部
16:サービス回復方法策定装置
17:管理部
161:回復作業データ抽出部
162:回復作業データ時系列保存部
163:監視データ受信部
164:監視データ時系列保存部
165:回復方法学習部
166:回復方法決定部
167:回復方法出力部
901:CPU
902:メモリ
903:ストレージ
904:通信装置
905:入力装置
906:出力装置
Claims (5)
- アプリケーションプログラムのサービスに対する複数の回復作業について、各回復作業内容のパターンをそれぞれ認識し、前記複数の回復作業を回復作業内容のパターン毎にグルーピングして複数の回復作業グループを形成し、前記複数の回復作業がそれぞれ行われる直前と直後にそれぞれ監視された前記サービスに関する複数の監視データについて、各監視内容のパターンをそれぞれ認識し、前記複数の監視データを監視内容のパターン毎にグルーピングして複数の監視データグループを形成する学習部、
を備える情報処理装置。 - 前記学習部は、
回復作業によって直前の監視データが直後の監視データへ遷移するように、前記複数の監視データグループの監視データグループ同士を前記回復作業グループを介して関連付けた学習結果データを生成する請求項1に記載の情報処理装置。 - 異常状態であると解析された異常の監視データについて、前記学習結果データから前記異常の監視データに合う監視データグループを検索し、決定した監視データグループから正常な監視データがグルーピングされた監視データグループへ遷移する1つ以上の経路を検索し、選択した経路上の回復作業グループの回復作業を回復方法として決定する決定部、
を更に備える請求項2に記載の情報処理装置。 - 情報処理装置で行う情報処理方法において、
アプリケーションプログラムのサービスに対する複数の回復作業について、各回復作業内容のパターンをそれぞれ認識し、前記複数の回復作業を回復作業内容のパターン毎にグルーピングして複数の回復作業グループを形成するステップと、
前記複数の回復作業がそれぞれ行われる直前と直後にそれぞれ監視された前記サービスに関する複数の監視データについて、各監視内容のパターンをそれぞれ認識し、前記複数の監視データを監視内容のパターン毎にグルーピングして複数の監視データグループを形成するステップと、
を行う情報処理方法。 - 請求項1乃至3のいずれかに記載の情報処理装置としてコンピュータを機能させる情報処理プログラム。
Priority Applications (3)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
JP2022579267A JP7513921B2 (ja) | 2021-02-05 | 2021-02-05 | 情報処理装置、情報処理方法、及び、情報処理プログラム |
PCT/JP2021/004347 WO2022168269A1 (ja) | 2021-02-05 | 2021-02-05 | 情報処理装置、情報処理方法、及び、情報処理プログラム |
US18/274,328 US20240152427A1 (en) | 2021-02-05 | 2021-02-05 | Information processing apparatus, information processing method and program |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
PCT/JP2021/004347 WO2022168269A1 (ja) | 2021-02-05 | 2021-02-05 | 情報処理装置、情報処理方法、及び、情報処理プログラム |
Publications (1)
Publication Number | Publication Date |
---|---|
WO2022168269A1 true WO2022168269A1 (ja) | 2022-08-11 |
Family
ID=82742129
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
PCT/JP2021/004347 WO2022168269A1 (ja) | 2021-02-05 | 2021-02-05 | 情報処理装置、情報処理方法、及び、情報処理プログラム |
Country Status (3)
Country | Link |
---|---|
US (1) | US20240152427A1 (ja) |
JP (1) | JP7513921B2 (ja) |
WO (1) | WO2022168269A1 (ja) |
Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009144825A1 (ja) * | 2008-05-30 | 2009-12-03 | 富士通株式会社 | 復旧方法管理プログラム、復旧方法管理装置及び復旧方法管理方法 |
WO2014171047A1 (ja) * | 2013-04-17 | 2014-10-23 | 日本電気株式会社 | 障害復旧手順生成装置、障害復旧手順生成方法および障害復旧手順生成プログラム |
Family Cites Families (3)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7194445B2 (en) * | 2002-09-20 | 2007-03-20 | Lenovo (Singapore) Pte. Ltd. | Adaptive problem determination and recovery in a computer system |
US10963355B2 (en) * | 2018-11-28 | 2021-03-30 | International Business Machines Corporation | Automated and dynamic virtual machine grouping based on application requirement |
US20240345911A1 (en) * | 2023-04-14 | 2024-10-17 | Microsoft Technology Licensing, Llc | Machine learning aided diagnosis and prognosis of large scale distributed systems |
-
2021
- 2021-02-05 WO PCT/JP2021/004347 patent/WO2022168269A1/ja active Application Filing
- 2021-02-05 US US18/274,328 patent/US20240152427A1/en active Pending
- 2021-02-05 JP JP2022579267A patent/JP7513921B2/ja active Active
Patent Citations (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2009144825A1 (ja) * | 2008-05-30 | 2009-12-03 | 富士通株式会社 | 復旧方法管理プログラム、復旧方法管理装置及び復旧方法管理方法 |
WO2014171047A1 (ja) * | 2013-04-17 | 2014-10-23 | 日本電気株式会社 | 障害復旧手順生成装置、障害復旧手順生成方法および障害復旧手順生成プログラム |
Also Published As
Publication number | Publication date |
---|---|
US20240152427A1 (en) | 2024-05-09 |
JPWO2022168269A1 (ja) | 2022-08-11 |
JP7513921B2 (ja) | 2024-07-10 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11334602B2 (en) | Methods and systems for alerting based on event classification and for automatic event classification | |
US9612898B2 (en) | Fault analysis apparatus, fault analysis method, and recording medium | |
JP4872945B2 (ja) | 運用管理装置、運用管理システム、情報処理方法、及び運用管理プログラム | |
Yuan et al. | Automated known problem diagnosis with event traces | |
JP4318643B2 (ja) | 運用管理方法、運用管理装置および運用管理プログラム | |
EP2683117B1 (en) | A method and system for network transaction monitoring using transaction flow signatures | |
US20110307742A1 (en) | Method and apparatus for cause analysis involving configuration changes | |
US20140281760A1 (en) | Management server, management system, and management method | |
CN111539493B (zh) | 一种告警预测方法、装置、电子设备及存储介质 | |
US10180872B2 (en) | Methods and systems that identify problems in applications | |
JP2010182015A (ja) | 品質管理システムおよび品質管理装置および品質管理プログラム | |
US10346450B2 (en) | Automatic datacenter state summarization | |
JP2006190138A (ja) | アラーム管理装置及びアラーム管理方法及びプログラム | |
Makanju et al. | Investigating event log analysis with minimum apriori information | |
WO2022168269A1 (ja) | 情報処理装置、情報処理方法、及び、情報処理プログラム | |
Wittkopp et al. | A taxonomy of anomalies in log data | |
Kubacki et al. | Holistic processing and exploring event logs | |
JP5141789B2 (ja) | 運用管理装置、運用管理システム、情報処理方法、及び運用管理プログラム | |
CN112860496A (zh) | 故障修复操作推荐方法、装置及存储介质 | |
JP2007257581A (ja) | 故障解析装置 | |
US12001271B2 (en) | Network monitoring apparatus, method, and program | |
CN115186001A (zh) | 一种补丁处理方法和装置 | |
US20220035359A1 (en) | System and method for determining manufacturing plant topology and fault propagation information | |
Straube et al. | Model-driven resilience assessment of modifications to HPC infrastructures | |
JP7506329B2 (ja) | 保守システム、データ処理装置、保守方法、および、プログラム |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
121 | Ep: the epo has been informed by wipo that ep was designated in this application |
Ref document number: 21924664 Country of ref document: EP Kind code of ref document: A1 |
|
ENP | Entry into the national phase |
Ref document number: 2022579267 Country of ref document: JP Kind code of ref document: A |
|
WWE | Wipo information: entry into national phase |
Ref document number: 18274328 Country of ref document: US |
|
NENP | Non-entry into the national phase |
Ref country code: DE |
|
122 | Ep: pct application non-entry in european phase |
Ref document number: 21924664 Country of ref document: EP Kind code of ref document: A1 |