Method and Apparatus for Implementing An Expandable Network Based Expert System
CLAIM OF BENEFIT TO PROVISIONAL APPLICATION
This patent application claims the benefit of the earlier-field U.S. Provisional Patent Application No. 60/152,676 filed 9/7/1999.
FIELD OF THE INVENTION
The present invention relates to the field of expert systems. In particular the present invention discloses an expandable graphical user interface (GUI) based expert system for computer system diagnoses.
BACKGROUND OF THE INVENTION
Human experts are used to maintain complex systems such as human bodies, computer networks, and telephone networks. A human expert maintains a complex system by first diagnosing the system status by collecting a set of vital statistic measurements and symptoms. For example, medical doctors (human body experts) measure vital statistics such as body temperature, blood pressure, blood sample characteristics and symptoms such as headaches, body pain, coughs, etc.
After collecting the information about the system, the expert applies his/her
"expert" knowledge that was gained through schooling and experience. Using the expert knowledge, the expert determines if anything is wrong with the system. The process may be iterative. For example, a certain statistic may indicate a problem but further tests may be required to narrow down the cause of the problem. Furthermore, the expert may need to
consult textbooks, technical journals, or other publications to diagnose difficult or obscure problems. If the expert determined that a specific problem exists, the expert again applies his/her knowledge to select a suggested treatment or repair to address the specific problem.
Computer system experts operate by manually probing the subject computer system to obtain various vital statistics such as CPU usage, cache hit percentage, etc. The expert then uses his/her expert knowledge to analyze the vital statistics to determine a particular problem. After having identified the problem, the expert suggests and/or implements remedial measures to address the identified problem. This process is currently a time consuming and difficult process. Since computer experts are expensive, this cost increases the total cost of ownership of any complex computer system. It would therefore be desirable to have a more efficient system for diagnosing computer system problems.
SUMMARY OF THE INVENTION
An expandable network-based expert system for diagnosing network-based systems is disclosed. The network-based expert system collects measurements from a network-based system through a computer network. The network-based expert system then stores the collected information in a database. The network-based expert system diagnoses the status of the network-based system using a set of defined diagnostics. New diagnostics may be added to the network-based expert system at any time to test for new types of conditions.
Other objects, features, and advantages of present invention will be apparent from the company drawings and from the following detailed description.
BRIEF DESCRIPTION OF THE DRAWINGS
The objects, features, and advantages of the present invention will be apparent to one skilled in the art in view of the following detailed description in which:
Figure 1 illustrates how a human expert is used to solve problems.
Figure 2 illustrates a block diagram of network computer system being diagnosed by an expert system.
Figure 3 illustrates an architectural diagram of an expandable expert system architecture for computer system diagnoses.
Figure 3 illustrates an architectural block diagram of a generic network-based expert system.
Figure 4 illustrates a block diagram of a network-based expert system implementation for diagnosing SAP R/3 systems.
Figure 5 illustrates a user interface screen for requesting an operating mode.
Figure 6 illustrates a user interface screen requesting SAP logon information.
Figure 7 illustrates a user interface screen displaying a workload summary overview for a SAP system.
Figure 8 illustrates a user interface screen displaying a workload summary for a specific instance of an application server in a SAP system.
Figure 9 illustrates a user interface screen requesting information to run a diagnostic group.
Figure 10 illustrates a user interface screen displaying result information after running a diagnostic group.
Figure 11 illustrates a user interface screen displaying detailed result information for a particular diagnostic.
Figure 12 illustrates a user interface screen displaying source measurement data for a particular diagnostic.
Figure 13 illustrates a user interface screen displaying a formal report after running a diagnostic group.
Figure 14 illustrates a user interface screen displaying diagnostic configuration information.
Figure 15 illustrates a user interface screen requesting information needed to create a new diagnostic.
Figure 16 illustrates a user interface screen of diagnostic builder used to create SQL queries and expressions.
Figure 17 illustrates a user interface screen containing a three-dimensional view of monthly dialog steps counts (transaction volume) of all application servers.
Figure 18 illustrates a user interface screen containing a three-dimensional view of monthly response time of all SAP application servers.
Figure 19 illustrates a user interface screen displaying the current SAP R/3 throughput and response time of all application servers.
Figure 20 illustrates a trending view of monthly dialog response time vs. transaction volume with best-fit projection lines.
Figure 21 illustrates a user interface screen displaying weekly response time vs. transaction volume.
Figure 22 illustrates a user interface screen displaying the daily response time vs. transaction volume.
Figure 23 illustrates a user interface screen containing a three-dimensional view of the ratio of response time among CPU, database, and R/3 for all the application servers.
DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT
A method and apparatus for an expandable expert system for computer system diagnoses is disclosed. In the following description, for purposes of explanation, specific nomenclature is set forth to provide a thorough understanding of the present invention. However, it will be apparent to one skilled in the art that these specific details are not required in order to practice the present invention. For example, the present invention has been described with reference to one implementation for diagnosing SAP R/3 systems. However, the same techniques can easily be applied to other types of computer systems such as web servers, email servers, accounting systems, database systems, etc.
Expert Diagnoses and Treatment
Figure 1 illustrates a conceptual diagram of how an expert is used to diagnose and repair a complex system 110 such as a human, a computer network, or an automobile. The human expert 140 first collects various symptoms and measurements of the complex system 110. The expert 140 decides which measurements to take based on his expert knowledge.
After collecting the symptoms and measurements of the complex system 110, the expert 140 uses his/her "expert" knowledge gained through schooling and experience to determine the status of the system 110. Using the expert knowledge, the expert 140 determines if anything is wrong with the system 110. The expert 140 may consult textbooks, technical journals, or other publications 150 to diagnose difficult or obscure problems.
If the expert 140 determines that the complex system 110 has a specific problem, the expert 140 uses his/her expert knowledge to select a suggested treatment or repair to address the specific problem. By using human experts, complex systems can be maintained. However, human experts are very expensive.
-1-
Diagnosing Computer Network Problems
Computer system experts diagnose computer system networks by manually collecting data from the computer network. For example, an expert may use an RMON program to obtain various network statistics such as network usage, packet types, network protocol, etc. The expert then uses his/her expert knowledge to analyze the collected statistics to determine a particular problem. After having identified the problem, the expert may suggest or implement remedial measures to address the identified network problem.
The current methods used by computer network experts can be time consuming and difficult. Since computer network experts are expensive, it would be desirable to simplify the task of diagnosing computer network system problems.
Expandable Network Based Expert System
To simplify the diagnoses of computer network-based systems, the present invention introduces an expandable network based expert system. The expandable network- based expert system runs on a computer system that is coupled to a computer network. The expandable network based expert system collects measurements needed to diagnose a particular computer network based system problem. The expandable network based expert system then analyzes the collected data to determine the status of the computer network based system. The expandable network-based expert system maybe configured to diagnose any type of computer network, network application, server, or other computer network based system that can have its vital statistics obtained across a network connection.
Figure 2 illustrates a network environment that may use the teachings of the present invention. Referring to Figure 2, an expandable network based expert system 241 runs on a computer system 240 that is coupled to a network 270. The expandable network
based expert system 241 is configured to collect specific statistics to diagnose a particular computer network based system. For example, the expandable network based expert system 241 may be configured to diagnose the actual network 270. Alternatively, the expandable network based expert system 241 may be configured to diagnose an application server 250.
Network-Based Expert System Architecture
Figure 3 illustrates a block diagram of the general architecture of the expandable network based expert system. Referring to Figure 3, a user 315 interacts with the expandable network based expert system 300 through a graphical user interface 310. The graphical user interface 310 controls the main component of the expandable network based expert system 300, the expert system engine 350. The expert system engine 350 is responsible for collection of the required measurements, storing the measurements, and performing diagnostics on the collected measurements.
To collect measurements, the network-based expert system 300 uses a data collection unit 330. The data collection unit 330 accesses a computer network 380. Through the computer network 380, the data collection unit 330 collects measurements from any resource on the computer network 380. For example, the data collection unit 330 may collect information from packets on the computer network 380, servers (385 and 387) connected to the computer network 380, and network accessible databases (388). The data collection unit 330 may use packet sniffing, remote procedure calls (RPCs), remote function calls (RFCs), ICMP Pings, SQL database queries or any other means of obtaining information through a network interface.
The network-based expert system 300 stores the collected measurements into a measurement database 325. The network based expert system 300 stores the collected measurements into measurement database 325 using a database interface 320. The database interface 320 provides an interface layer that may modified such that various different database systems may be used for the measurement database 325 may be used. The
measurement database 325 should be organized in a logical manner that allows simple access, searching, and input of data. In one embodiment, the database is a Microsoft Access type database.
The network-based expert system 300 analyzes the collected measurements using the expert system engine 350. The expert system engine 350 includes a large number of individual diagnostics (361, 362, . . ., 36n and 371, 372, . . . , 37n) that are used to test the collected measurements. The individual diagnostics are organized into diagnostic groups. For example, diagnostics 361, 362, . . . , 36n belong to diagnostic group 360. Using the graphical user interface 310, the user 315 may select individual or groups of diagnostic to run.
Expandability
The network-based expert system 300 is expandable in many different ways. When in use to diagnose a particular application, the network-based expert system 300 may be expanded by adding new individual diagnostics or diagnostic groups. If a new diagnostic needs new measurement data, the measurements database 325 may be modified to store the new measurement. Furthermore, the data collection unit 330 would be correspondingly modified to be able to fetch the new measurement through the computer network 380.
The network-based expert system 300 may also be easily expanded to diagnose different network-based systems. To diagnose a new network-based system, the expandable network-based expert system 300 is modified in three areas: the measurements database 325, the data collection unit 330, and the individual and grouped diagnostics. The measurements database 325 is first modified to store the new measurements that will be taken to diagnose the new network-based system. Next, the data collection unit 330 is modified to fetch the new measurements listed in the modified the measurements database 325. Finally, the expert system engine 350 is modified to store new sets of individual diagnostics organized into new diagnostic groups.
Diagnostics
To facilitate the analysis, one embodiment allows each measurement to be assigned a "baseline" reference value and a baseline target percentage. The baseline reference value represents a desired value that the measured value should exceed (or be lower than, depending on the performance metric). Examples of baseline values include desired response time, available memory, etc. The baseline target percentage is a value that specifies the percentage of samples that should exceed (or be lower than) the baseline reference value to receive a satisfactory rating for the diagnostic. For example, consider a response time measurement with a 1000 millisecond baseline reference value and a 70% baseline target percentage. If there are five samples with response time measurements of 700, 800, 645,
1200, and 900 milliseconds then the system will pass the diagnostic since 80% of the samples were below the desired 1000 millisecond baseline value. If the five samples were 700, 1245, 865, 940, and 1320, then the system would fail the diagnostic since only 60% of the samples had better performance than the 1000 millisecond baseline value.
SAP Based Expert System
To provide a detailed example of one embodiment of the expandable network- based expert system, an embodiment for the SAP R/3 enterprise software will be disclosed. SAP R/3 is a well-known enterprise resource planning program that is used by thousands of business. SAP R/3 is a complex network-based client-server software system. Human experts are normally required to install and maintain a SAP R/3 system. However, with the expert system of the present invention, an untrained person may be empowered to keep an SAP R/3 system running.
Figure 4 illustrates one embodiment of an expert system 400 configured for a SAP R/3 system. One specific embodiment has been implemented on the Microsoft Windows NT operating system. In the embodiment of Figure 4, the SAP expert system 400 is divided into two components: a graphical user interface (GUI) program 410 and a server
program 417. The graphical user interface (GUI) program 410 registers with the Windows registry 405. By registering with the Windows registry, the GUI program 410 can maintain context information such as window position, the last diagnostic run, etc. The graphical user interface (GUI) program 410 interacts with a server program 417 that performs the majority of the work.
The GUI program 410 interacts with a transaction dispatcher 450 in the server program 417. The transaction dispatcher 450 is a centralized facility for accepting user requests and issuing transaction requests to various subsystems in the server program 417. The subsystems include a logon transaction unit 461, a workload summary transaction unit 463, a diagnostic transaction unit 465, and a trend transaction unit 467. Each subsystem will be described individually.
Logon Transaction Unit The logon transaction unit 461 is responsible for starting the server program
417 and accessing the desired SAP resources. When the SAP expert system 400 begins, the logon transaction unit 461 may ask the user if he/she desires to work online, offline, or in a demo mode as illustrated in Figure 5. The online mode enables the SAP expert system 400 to access SAP resources on the network 480. The offline mode allows the user to examine previously collected measurement data. The demo mode uses a demonstration data set. The demo mode is used in conjunction with tutorial to teach new users how to use the SAP expert system 400.
When a user selects online mode, the logon transaction unit 461 requests information about an SAP application server to log into. Next, the logon transaction unit 461 requests a client number, user name, password, and language information needed to log into the application server as illustrated in Figure 6. The logon transaction unit 461 then proceeds to log onto a SAP server (such as SAP host 493) to access SAP resources. As illustrated in
Figure 4, the logon transaction unit 461 communicates with a SAP host through a SAP communication interface 430.
Once logged in, the logon transaction unit 461 collects detailed information from the application server that that it is logged into. Furthermore, the logon transaction unit 461 collects a list of other application servers. The logon transaction unit 461 stores the collected information in server database 425 through the database interface 420.
After a successful logon, the logon transaction unit 461 will be in online mode. The online logon transaction unit 461 then instructs the workload summary transaction unit 463 to display a workload overview.
Workload Summary Unit
The workload summary transaction unit 463 is responsible for displaying a high-level snapshot overview of the SAP system. The workload summary transaction unit 463 generates a workload summary by obtaining performance information that has been collected by the SAP R/3 system since midnight. For example, the workload summary transaction unit 463 may access historical performance information stored in the SAP database 495 and transfer such data into database 425. The workload summary transaction unit 463 then displays a formatted workload summary on the display screen. Figure 7 illustrates one embodiment of a formatted workload summary. The historical performance information stored into database 425 may be later used for trend analysis.
To obtain more detailed information, a user may "drill down" into the data by requesting the display of additional information about a particular server. The data is fetched from the database 425 and formatted for screen display. Figure 8 illustrates a workload summary for a particular application server.
The workload summary transaction unit 463 may be configured to periodically update the workload information. In this manner, the user can view how the workload changes during the day.
To run a diagnostic on a particular application server, the user first selects a host system or instance from the list presented in Figure 7. The user then selects the diagnostic pull-down menu. This action activates the diagnostic transaction unit 465. In an alternate embodiment, a pop-up menu may be used to initiate a diagnostic on a selected host system.
Diagnostic Transaction Unit
The diagnostic transaction unit 465 is the main unit responsible for performing the expert systems diagnostics. The user enters diagnostic request through the GUI program 410 to have the diagnostic transaction unit 465 execute a group of diagnostics. Figure 9 illustrates one embodiment of a screen display that may be used to perform a specific diagnostic.
Referring to Figure 9, a user first specifies a diagnostic group to run and a sample time period. After making these selections, the user instructs the diagnostic transaction unit 465 to collect the measurement data necessary for the diagnostic group. The diagnostic transaction unit 465 then collects the data.
In one embodiment, the data is collected in four phases. A first stage collects real-time data such as CPU workload, queue length, memory usage, disk usage, LAN status, and information about the processes running on application server. Next, a second stage collects pseudo-static information that does not change rapidly. Specifically, the second stage collects SAP instance parameters. Examples of pseudo-static measurements include SAP buffer parameters, SAP memory information, SAP buffers, tables, etc. A third stage collects historical measurement statistics. The historical measurement status includes CPU workload,
memory usage, disk usage, and LAN status information collected during the last 24 hours for the application server. In one embodiment, there are twenty-four hourly averaged values for the last twenty-four hours. The fourth stage collects information and statistics about the transactions that have occurred. The transaction information is used to determine what happened during the data collection interval. For example, the expert system will be able to determine the number of transactions that occurred, the response time of the transactions, the CPU time, the database access time, etc.
After collecting measurement data, the expert system can then analyze the data. To analyze the data, the user may select a baseline value to be used as illustrated in Figure 9. Different baselines that may be selected by the user include a typical weekday baseline values, typical weekend day baseline value, typical end-of-month day baseline value, optimal baseline values, etc. The user may create new baseline values as desired.
Finally, after the analyzing the collected data, the user may view the results of the various diagnostics. Figure 10 illustrates the results from running a particular diagnostic group on a particular server instance. Again, the user can drill down to obtain additional information. Figure 11 illustrates the detailed information from one particular diagnostic run in the diagnostic group. For even more detail, the user can drill down to view the actual data collected as illustrated in Figure 12. For reporting purposes, the user may print or publish to a web site a report of the diagnostic group results as illustrated in Figure 13.
The diagnostic transaction unit 465 is expandable in that new diagnostics and diagnostic groups may be added at any time. Users may create their own specific diagnostics. Diagnostics may be created and shared by importing and exporting diagnostic definitions.
Figure 14 illustrates a diagnostic configuration menu. A user may select and edit an existing diagnostic using the edit pull-down menu. Furthermore, a user may create a new diagnostic from the diagnostic configuration menu of Figure 14.
Figure 15 illustrates one embodiment of a display screen that may be used to create new diagnostics. As illustrated in Figure 15, the user specifies a name for the diagnostic. The user further specifies a data object to examine and the data table that includes the data object. For comparison, the user defines a default baseline value and a default baseline target percentage. The condition field is used to specify the logical expression of the diagnostic. As illustrated in Figure 15, the expression "Response Time < #1#" replaces the
#1# with the Default Baseline Value such that the comparison of the response time with the baseline value is returned. The Exception textbox is used to specify exception. For example, we could perform analysis only on certain objects. The Report Field List is used to specify a list of fields that will be displayed when the user drill down during the analysis. The user may operate the "test" button to test all the queries defined by the diagnostic including the
Condition, Exception, and Report Field List.
To facilitate the creation of condition expressions, a diagnostic builder window may be used as illustrate in Figure 16. The Fields Name list in the diagnostic builder window displays every field that is available for creating this diagnostic. The list of fields is based on the data table selected. Again, the user may activate the test button to test the diagnostic. The test button builds and executes a full SQL query. If everything works, then the diagnostic will run OK.
Trend Transaction Unit
To allow a user to see long-term trends in SAP system performance, the SAP expert system embodiment contains a trend transaction unit 467. The trend transaction unit 467 constructs well-formatted graphical displays that describe the history of a SAP system. The historical display allows a user to understand the "normal" performance of a SAP system.
In one embodiment, the trend transaction unit 467 creates both a workload history and a hardware history. The workload history displays SAP workload information such as the volume of transactions and the transaction response time per month, week, or day.
The hardware history displays hardware usage information such as CPU busy, memory usage per day, file system usage, etc.
To display historical information, the trend transaction unit 467 may access information that was collected earlier and stored into the database 425. In the SAP environment, the trend transaction unit 467 may also request historical information from the SAP hosts. In the default SAP configuration, an SAP system collects 2 days, 2 weeks, and 2 months of data (which really gives 3 months, 3 weeks, and 3 days of historical data).
The trend transaction unit 467 gets the list of all workload information available from the MONI Table with the SAPWL_WORKLOAD_GET_DIRECTORY remote function call (RFC) in the SAP system. The directory table contains information about each SAP Application Server and also a special Application Server called TOTAL, which is the sum of all Application Servers. This information is stored in the Env_Workload_Directory table in database 425. The trend transaction unit 467 furthermore reads the OSMON data to get the OS historical data. This historical OS information is stored in the Env_OS_Hi story table in the database 425.
After collecting the historical information, the trend transaction unit 467 allows the user to view the collected historical data in a number of different formats. Figure 17 illustrates a three-dimensional view of monthly dialog steps counts (transaction volume) of all application servers. Figure 18 illustrates a three-dimensional view of monthly response time of all SAP application servers. Figure 19 illustrates a view of the current SAP R/3 throughput and response time of all application servers. Figure 20 illustrates a trending view of monthly dialog response time vs. transaction volume with best-fit projection lines. From the view of Figure 20, a user can drill down by single click of the graphical bars to get weekly response time vs. transaction volume as illustrate in Figure 21. From the view of Figure 21, a user can drill further down to display the daily response time vs. transaction
volume as illustrated in Figure 22. Finally, Figure 23 illustrates a three-dimensional view of the ratio of response time among CPU, database, and R/3 for all the application servers.
The foregoing has described an expandable network-based expert system. It is contemplated that changes and modifications may be made by one of ordinary skill in the art, to the materials and arrangements of elements of the present invention without departing from the scope of the invention.