TWI292091B - Computer performance evaluator and application method thereof - Google Patents

Computer performance evaluator and application method thereof Download PDF

Info

Publication number
TWI292091B
TWI292091B TW94105691A TW94105691A TWI292091B TW I292091 B TWI292091 B TW I292091B TW 94105691 A TW94105691 A TW 94105691A TW 94105691 A TW94105691 A TW 94105691A TW I292091 B TWI292091 B TW I292091B
Authority
TW
Taiwan
Prior art keywords
state
time
controlled node
module
memory
Prior art date
Application number
TW94105691A
Other languages
Chinese (zh)
Other versions
TW200630794A (en
Inventor
Fan Tien Cheng
Shanglun Wu
Yunta Chung
Original Assignee
Univ Nat Cheng Kung
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Univ Nat Cheng Kung filed Critical Univ Nat Cheng Kung
Priority to TW94105691A priority Critical patent/TWI292091B/en
Publication of TW200630794A publication Critical patent/TW200630794A/en
Application granted granted Critical
Publication of TWI292091B publication Critical patent/TWI292091B/en

Links

Landscapes

  • Debugging And Monitoring (AREA)
  • Hardware Redundancy (AREA)

Description

1292091 九、發明說明 【發明所屬之技術領域】 本發明係有關於一種計算機效能評估器 (Performance Evaluates ; PEV)與其應用方法,特別是有 關於一種具有偵測計算機效能異常及預測失效時間 (Time to Failure)之能力的計算機效能評估器與其應用 方法。 【先前技術】 隨著科技的進步,資訊應用系統對企業或是個人用 戶來說佔有舉足輕重的地位,一旦資訊應用系統可靠度 不佳,對使用者來說會遭受到嚴重的損失。近年來軟體 老化(Software aging)的現象被提出:資訊應用系統因 關鍵資源被耗盡致造成效能衰減及失效。 在預測資訊系統因“Software Aging”失效的研究 中,由於資訊系統會因執行環境及應用程式的不同, Measure-based的方法被幾位學者所提出,其透過所收 集到系統參數並根據參數變化的趨勢去預測應用系統的 資源多久後會耗盡。如Trivedi等(“A Measurement-based1292091 IX. Description of the Invention [Technical Field of the Invention] The present invention relates to a computer performance evaluator (Performance Evaluates; PEV) and an application method thereof, and particularly relates to a method for detecting computer performance abnormality and predicting failure time (Time to The computer performance estimator of the ability to fail and its application method. [Prior Art] With the advancement of technology, information application systems play an important role for enterprises or individual users. Once the reliability of information application systems is poor, users will suffer serious losses. In recent years, the phenomenon of software aging has been proposed: information application systems suffer from performance degradation and failure due to the exhaustion of key resources. In the research of predicting information system failure due to "Software Aging", because the information system will be different due to the execution environment and application, the Measure-based method is proposed by several scholars, through the collected system parameters and according to the parameters. The trend to predict how long an application's resources will be exhausted. Such as Trivedi et al ("A Measurement-based

Model for Estimation of Resource Exhaustion in Operational Software Systems,” Proc. Of ISSRE 1999 Boca Raton,FL,Nov. 1999.)使用無母數方法去計算每單 位時間的資源消耗程度,並估計出的資源消耗率去預測 失效時間。然而,此文獻技術並無將受控計算機的效能 區分成不同之狀態,必須不停地計算系統失效時間,且 5 1292091 通訊協定是架設在 SNMP(Simple Netw〇rk ManagementModel for Estimation of Resource Exhaustion in Operational Software Systems,” Proc. Of ISSRE 1999 Boca Raton, FL, Nov. 1999.) Use the no-parent method to calculate the resource consumption per unit time and estimate the resource consumption rate. Predicting the expiration time. However, this documentary technique does not distinguish the performance of the controlled computer into different states, and must constantly calculate the system failure time, and the 5 1292091 protocol is set up in SNMP (Simple Netw〇rk Management).

Protocol)上。SNMp最大的優點是簡單,因為業者容易 安裝,網路管理平台和管理人員也容易使用;SNMp最 -大的缺點也是簡單,因為太過於簡單使得複雜龐大的網 路系統覺知不敷應用、沒有效率,而且在安全的防禦施 設上太過於簡單,使得企業的網路元件和器材毫無安全 的保F早。而新版本的SNMp,稱之為snmPv2,它除了使 _網路管理工作變得比較有效率以外,最主要是著重於改 善SNMP在網路安全上的缺失。然而,由於snmPv2改 變了很多,過於繁雜,不能讓眾多的網路器材廠商馬上 完全接納和普遍應用,以致於至今僅有少數的幾個網路 器材和元件具有SNMPv2所有的功能。 現有產。口技術方面,美國昇陽公司(Sun Corporation) • 的Solaris作業系統提出一套作業系統中之預測自我修 復技術(Predictive Self-Healing ; PSH)。此技術之架構 鲁包含 2 個元件.Solaris Fault Manager 以及 Solaris Service Manager。此技術具有自動重新啟動應用程式服 務的功能,還針對CPU、記憶體、以及1/〇匯流排結 合元件,實作了預測式自我修復功能。但此技術僅支援 Sun Solaris作業系統,且偵測器與受測物存在於同一環 境之中’會造成“球員兼裁判”的疑慮。 現有已申請之專利方面,應用於電腦效能模擬的 “Retargetable computer design system” 專利(美國專利 月1J案苐6,7 7 2,1 〇 6 5虎)中’提出一套結合模擬技術的電腦 1292091 效能預測方式。此方式藉由收集欲評估之計算機的相關 效能參數,將參數設定給模擬器做模擬,以了解目前計 算機之效能如何。 中華民國專利第573266公告號之“通用型服務管理 系、、先專利,發明了一套架構在Jini環境並採用Design • by Contract技術的服務管理機制(Service w Scheme,SMS)。 SMS中含有一個泛用評估器⑴咖士 • EValUat〇r,GEV),以偵測出Service之執行是否發生異 常。除了偵測出Service是否當機(Crash )之外,還包 括傳輸錯誤的訊息或將訊息傳至錯誤的節點等之異常偵 測,以及對於Service之性能衰退的異常偵測。另外,此 泛用評估器並具備備份之功能,它能將Service之所有執 行狀態與參數備份至資料庫,如遇到Service本身失常 , 時,這些正常的備份資料將被Client取出並傳送至所選 定的備份Service上,俾繼續執行未完成的任務。然而, φ GEV僅使用了簡易的3 〇:測試去判定單一種異常之狀 態。由於診斷一個系統效能好壞所須的參數種類並非單 一種參數即可’而是多維的(Multi-Dimensional Vectors),因此,此泛用評估器的系統效能評估能力是不 夠的。 應用於半導體製程設備的“預防維護時間預測方 法”專利(中華民國專利第533352公告號)中,提出一套 機台預防維護時間的預測方法。此方法第一步先接收此 半導體製程設備之歷史資料,再重歷史資料中抓出欲分 1292091 析之複數資料組’此複數資料組包含—個參數和一個時 間,之後再以最小平方逼近法求得此資料組的一條迴歸 線。對於不同參數組’將估算出不同之迴歸線。最後再 -根據這些迴歸線預測機台預防維護時間。此方法雖然用 •多種參數去預測機台預防維護時間,但卻是採個別分析 .參數的方式,並無更進一步應用多變量分析的方法將不 • 同參數分析出一個統一的輸出結果。 • 應用於半導體晶圓廠之性能分析的“半導體晶圓廠 性能指標預測系統”專利(中華民國專利第459269公告 號)中,以排隊理論為依據,其中包含一個系統分析模組 (System Analyzer Module)和一個性能指標預測模組 (Predictor M〇dule),此系統可分析廠内各機台群之晶圓 •到達模式(Arrival Pattern)與加工模式(Service • Pattern),並準確預測含產品製程時間在内之各項性能指 標。此系統特徵是將機台依操作特性分成六類,分別對 鲁各類機台以適當之排隊理論模型建模、考慮機台不可用 事件及晶圓優先等級不同之特性。此發明之系統分析模 組是應用“經驗分佈(Empirical Distribution)分析,,與 Goodness-of-fit測試”方式。然而,此專利無法預測計 算機因資源耗盡而導致失效的狀態。 另外,美國微軟公司(Microsoft)所提出的具容錯能 力之微軟叢集服務(即 Microsoft Cluster Service ; MSCS) 的架構’係一針對平台資源之全面容錯能力管理的解決 方案。其不僅可管理應用程式的容錯能力,亦能管理磁 8 ⑧ 1292091 碟機、印表機、以及美國微軟公司所研發的其他軟體系 統,例如:SQL Server 2000、Exchange Server 2000。當 MSCS的節點偵測機制應用於較複雜的叢集環境中時, ‘ 其所有節點皆會同時週期性地發送心跳(Heartbeat)來通 知其他節點“ I am alive !!,,的訊息,使得網路的負載較 重。 • 請參照第5 A圖,其繪示習知之應用程式叢集服務 φ (ApPllcatl〇n Cluster Service ; APCS)之三層式系統架構 不意圖,此三層式系統架構係由客戶端、應用程式端、 及資料庫端所組成。如第5A圖所示,在習知之叢集服 務的架構中,將叢集環境中之受控節點7〇_74區分成主 節點(Master Node;如 70)與副節點(slave N〇de;如 71_74) 兩種角色’主節點70會以心跳機制去偵測副節點7} _74 .是否已經失效’而倘若主節點70失效,m副節點71_74 之間會再自動重新協調選出—個新的主節點,代替失效 •的:節點。此習知之叢集服務技術僅以主節點發送心跳 甙心給副節點來減輕網路。^ ^ ^ ^ ^ ^ ^ ^ 與習知之應用程叢隼服務u軟叢集服務 幸而骞玆4 集服務白 >又有偵測受控節點因資源耗 蓋而導致效能里當的 月匕“的月,力,以及預測失效時間的能力。 因此,非常迫切需要發展一 苴庫用古1 # 種δ十异機效能評估器與 應用方法,糟以偵測出 里常之壯r 异機因貝源耗盡而導致效能 八 ,且預測出計算機從昱當肤態谷斗 剩時間之效能坪仕… “態至失效狀態所 間等資訊,爽瞽σ 侦測節點異常與預測失效時 ’助叢集服務軟體決定適t的節點轉移時 9 1292091 機及策略, 進而提升軟體系統之可靠度。 【發明内容】 本發明的 其應用方法, 新架構。 =的就是在提供一種計算機效能評估器與 藉以提出可預測計算機因冑祕冑致失效之 器盘盆—目的m在提供-種計算機效能評估 效時間法’以提供债測節點異常與預測計算機失 當的節=相關資訊,來幫助叢集服務軟體決定適 、移時機及策略,進而提升軟體系統之可靠度。 51,:以據:發明之上述㈣’提供一種計算機效能評估 :用以偵測出位於至少一個受控節點之計算機因資源耗盡 而導致之效能異常。 • 依照本發明的較佳實施例,此計算機效能評估器至少包 括·至少一個資料收集模組、偵測模組和預測模組。資料收 ^集模組係與受控節點交連,用以收集受控節點之複數個系統 資源參數。偵測模組係根據此些系統資源參數來評估受控節 點之效能狀態,並對每一個受控節點產生一狀態指標。當偵 測模組基於狀態指標,來判斷出受控節點之其中之一的效能 狀態為異常狀態時,預測模組會被啟動來進行預測步驟,以 預測此受控節點在多久時間後會因耗盡資源而失效,其中此 預測步驟係根據例如具嚴格遞減(Monotonic-Decay)特性之 軟體老化現象’並採用平均失效時間(Mean 丁to Failure ; MTTF)之預測方法。 1292091 的發明之敕上述目的,提供—種計算機效能評估器 群組(G_p)’其中此叢集環境中有複數個受ί =本發明的較佳實施例’在此計算機效能評估器的岸 用方法中,f先,提供計算機效能評估器,其中此計算機t 能評估器至少包括有複數個資料收集器模組、㈣模租、^ 及預測模組。㈣,分別安裝資料收集器模組至受控節點 中。然後,資料收集模組分別收集受控節點之複數個系统資 源參數,並將系統資源參數傳送至❹j模組做分析與判斷, 並對每-個受控節點產生一狀態指標。接著,當谓測模組基 於狀態指標’來判斷出受控節點之其中之一的效能狀態為二 常狀態時,預測模組會被啟動來進行預測步驟,以預=此受 控節點在多久時間後會因耗盡資源而失效,來及時進行節= 替換。 • 因此,應用本發明,可有效並及時地預測計算機因資源 耗盡而導致失效;可有效地提供預測計算機失效時間之能' 力,來幫助叢集服務軟體決定適當的節點轉移時機及策 略’進而提升軟體系統之可靠度。 【實施方式】 本發明所提出之計算機效能評估器所要解決之問題在 於如何預知受控節點(計算機)因資源耗盡,造成執行效率不 佳甚至當機之情況。為了達到預知之境界,本發明提出一個 11 1292091 結合偵測模式與預測 點之系統資源參架構,此架構可即時收集受控節 預測受控節點之剩二T點之效能狀態’並且更進而 J铢可〒,即失效時間。Protocol). The biggest advantage of SNMp is simplicity, because the operators are easy to install, and the network management platform and management personnel are also easy to use. The biggest disadvantage of SNMp is also simple, because it is too simple to make complex and large network systems aware of the application, no Efficiency, and the security of the defense is too simple, so that the network components and equipment of the enterprise are not safe. The new version of SNMp, called snmPv2, not only makes _ network management work more efficient, but also focuses on improving the lack of SNMP in network security. However, since snmPv2 has changed a lot and is too complicated, it cannot be fully accepted and widely used by many network equipment manufacturers, so that only a few network devices and components have all the functions of SNMPv2. Existing production. In terms of port technology, Sun Corporation's Solaris operating system proposes a Predictive Self-Healing (PSH) technology in a set of operating systems. The architecture of this technology includes 2 components. Solaris Fault Manager and Solaris Service Manager. This technology has the ability to automatically restart application services, as well as predictive self-healing for CPU, memory, and 1/〇 bus combination components. However, this technology only supports the Sun Solaris operating system, and the detector and the measured object exist in the same environment, which will cause the "player and referee" doubts. In the existing patent application, the "Retargetable computer design system" patent (US Patent Monthly 1J case ,6,7 7 2,1 〇6 5 tiger) used in computer performance simulation proposes a set of computer 1292991 combined with analog technology. Performance prediction method. This method sets the parameters to the simulator for simulation by collecting relevant performance parameters of the computer to be evaluated to understand the performance of the current computer. The Universal Service Management Department and the First Patent of the Republic of China Patent No. 573266, invented a Service W Scheme (SMS) that is architected in the Jini environment and uses Design • by Contract technology. The SMS contains a General purpose evaluator (1) café (EValUat〇r, GEV) to detect whether the execution of the Service is abnormal. In addition to detecting whether the Service is crashing (Crash), it also includes transmitting a wrong message or transmitting a message. Anomaly detection to the wrong node, etc., and anomaly detection for the performance degradation of the service. In addition, the universal evaluator has a backup function, which can back up all execution status and parameters of the service to the database, such as When the Service itself is out of order, these normal backup data will be taken out by the Client and transferred to the selected backup service, and the unfinished tasks will continue. However, φ GEV only uses the simple 3 〇: test to determine A single state of anomaly. Because the type of parameters required to diagnose a system's performance is not a single parameter, it can be multi-dimensional (Mul ti-Dimensional Vectors), therefore, the system performance evaluation capability of this general-purpose evaluator is not enough. A patent for the "Prevention and Maintenance Time Prediction Method" for semiconductor process equipment (Republic of China Patent No. 533352) The method for predicting the maintenance time of the machine. The first step of this method is to receive the historical data of the semiconductor process equipment, and then to retrieve the historical data set of the 1292891 analysis. The complex data set contains one parameter and one Time, then a regression line of the data set is obtained by the least square approximation method. For different parameter groups, 'regression lines will be estimated. Finally, based on these regression lines, the machine can be used to prevent maintenance time. This method uses multiple parameters. To predict the maintenance time of the machine, but the method of individual analysis and parameters, there is no further application of multivariate analysis method will not analyze the same parameter to produce a unified output. • Applied to semiconductor fabs Performance of the "Semiconductor Fab Performance Index Prediction System" patent (Zhong Huamin) National Patent No. 459269, based on queuing theory, includes a System Analyzer Module and a Predictor M〇dule, which analyzes each machine in the plant. Group wafers • Arrival pattern and processing pattern (Service • Pattern), and accurately predict various performance indicators including product processing time. This system features the machine into two categories according to the operating characteristics, respectively Modeling the appropriate queuing theory models for various types of machines in Lu, taking into account the characteristics of different machine unavailability events and wafer priority levels. The system analysis module of this invention is an "Empirical Distribution Analysis, and Goodness-of-fit Test" approach. However, this patent cannot predict the state of the computer that has failed due to resource exhaustion. In addition, the Microsoft Cluster Service (MSCS) architecture of fault-tolerant Microsoft Cluster Service (MSCS) proposed by Microsoft Corporation of the United States is a solution for comprehensive fault-tolerant management of platform resources. It not only manages the fault tolerance of the application, but also manages the magnetic 8 8 1292091 drives, printers, and other soft systems developed by Microsoft, such as SQL Server 2000 and Exchange Server 2000. When MSCS's node detection mechanism is applied to a more complex cluster environment, 'all its nodes will periodically send heartbeats to notify other nodes of the message "I am alive !!," The load is heavy. • Please refer to Figure 5A, which shows the three-tier system architecture of the application cluster service φ (ApPllcatl〇n Cluster Service; APCS). This three-tier system architecture is based on the customer. The terminal, the application terminal, and the database end are composed. As shown in FIG. 5A, in the architecture of the conventional cluster service, the controlled node 7〇_74 in the cluster environment is divided into a master node (Master Node; 70) and the secondary node (slave N〇de; such as 71_74) two roles 'the master node 70 will use the heartbeat mechanism to detect the secondary node 7} _74. Has it failed? And if the primary node 70 fails, the m secondary node 71_74 The new master node will be automatically re-coordinated to replace the failed node: This conventional cluster service technology only sends the heartbeat heart to the secondary node to relieve the network. ^ ^ ^ ^ ^ ^ ^ ^ and the application of the knowledge Cong Falcon Service u soft cluster service Fortunately, Qian services is hereby set 4 White > another controlled node detection due to resource consumption cover caused performance in the month when the dagger, "the month, power, and the ability to predict time to failure. Therefore, it is very urgent to develop a 1 用 1 1 δ δ δ 效能 效能 效能 效能 效能 效能 , , , , , , , , , , , , , , , , , , , , , , , , , , , , , , The performance of the computer from the time of the skin of the squatting state of the squad... "The state of the state to the state of the failure state, the 瞽 σ detection node anomaly and the prediction failure" 'help cluster service software determines the appropriate node transfer time 9 1292091 Machine and strategy, and thus improve the reliability of the software system. SUMMARY OF THE INVENTION The application method of the present invention, the new architecture is to provide a computer performance estimator and a device for predicting the failure of the computer due to the secret The basin--the purpose of providing a computer-efficiency evaluation time method to provide information on the abnormality of the debt measurement node and the prediction of computer misconduct, to help the cluster service software determine the appropriate time, time and strategy, and thus improve the software system. Reliability 51.: According to the above invention (4), a computer performance evaluation is provided to detect a computer factor located in at least one controlled node. Abnormal performance caused by exhausted resources. According to a preferred embodiment of the present invention, the computer performance estimator includes at least one data collection module, a detection module, and a prediction module. Interconnecting with the controlled node to collect a plurality of system resource parameters of the controlled node. The detecting module estimates the performance status of the controlled node according to the system resource parameters, and generates a status indicator for each controlled node. When the detection module determines that the performance status of one of the controlled nodes is an abnormal state based on the status indicator, the prediction module is activated to perform a prediction step to predict how long the controlled node will be after Failure due to depletion of resources, wherein the prediction step is based on, for example, a software aging phenomenon with a strict reduction (Monotonic-Decay) characteristic and a prediction method of mean time to failure (MTTF). The above object provides a computer performance evaluator group (G_p) 'where there are a plurality of 丛 = preferred embodiments of the present invention In the shore method of the computer performance estimator, first, a computer performance evaluator is provided, wherein the computer evaluator includes at least a plurality of data collector modules, (4) a module rent, and a prediction module. , respectively, installing the data collector module to the controlled node. Then, the data collection module separately collects a plurality of system resource parameters of the controlled node, and transmits the system resource parameters to the 模组j module for analysis and judgment, and each - A controlled node generates a status indicator. Then, when the predicate module determines that the performance status of one of the controlled nodes is a binormal state based on the status indicator ', the prediction module is activated to perform the prediction step. , in advance = how long after this controlled node will expire due to running out of resources, to perform the section = replacement in time. • Therefore, the application of the present invention can effectively and timely predict the failure of the computer due to resource exhaustion; it can effectively provide the ability to predict the time of failure of the computer to help the cluster service software determine the appropriate node transfer timing and strategy. Improve the reliability of the software system. [Embodiment] The problem to be solved by the computer performance evaluator proposed by the present invention is how to predict that the controlled node (computer) is inefficient or even crashed due to resource exhaustion. In order to achieve the foreseeable realm, the present invention proposes a system resource reference architecture combining detection mode and prediction point, which can immediately collect the performance state of the remaining two T points of the controlled node prediction controlled node and further铢可〒, the expiration time.

請參照第1圖,1給-士 A 統架構示意圖。本發明之計算機效能評估器的系 個資料收隹描 發月之计鼻機效能評估器10包含有複數 隹”、組21和22、偵測模組30和預測模組5〇。資 果組2 1和22係分別安裝於受控節點7 1和72中,其 呼叫作業系統應用程式介面(Operating System API ; osPlease refer to Figure 1 for a schematic diagram of the architecture of the A-A system. The computer performance evaluator of the present invention includes a plurality of 鼻", groups 21 and 22, a detection module 30, and a prediction module 5 〇. 2 1 and 22 are installed in controlled nodes 7 1 and 72 respectively, which call the operating system application interface (Operating System API; os

^ # 82來收集節點之系統資源參數,並透過CORBA 、孔基礎木構91和92將系統資源參數傳送至偵測模組3〇。 —本發明之特徵之一係在於將受控節點失效歷經的過程 •定義成若干狀態。請參照第2圖,其繪示本發明之受控節點 的狀態示意圖。本發明之受控節點包含了 5種狀態,分別 為#初始化狀悲、活動(Active)狀態、非活動狀態、 ”吊(k)狀態、和失效(Dead)狀態,其中受控節點在剛啟 ❿動夺/進人到初始化狀態;當受控節點之應用程式被啟動 後又控筇點進入到活動狀態,在一般狀況下,受控節點皆 處=活動狀態;當應用程式暫停時,受控節點進入到非活動 狀悲,而當應用程式重新啟動時,再回復到活動狀態;當受 控節點的可用資源發生異常之狀況時,例如··軟體老化,則 受控即點將從活動狀態進入到異常狀態,而當受控節點的可 用資源恢復正常後,則受控節點可再回到活動狀態;當受控 節點的可用資源被耗盡時,受控節點將進入到失效狀態(即 當機)。因此,本發明之偵測模組的主要功能係在於偵測受 12 1292091 控節點是否已進入異常狀態。 明參照第3圖,其繪示本發明之偵測模組的方塊示意 圖。本發明之偵測模組為一個模糊邏輯偵測模組(Fuzzy Detection Module),此模組會根據一些關鍵的資源效 -此參數推估出一個狀態指標,並依據此指標來判斷受控節點 _ 目前所處之狀態。模糊邏輯偵測模組至少包括:篩選裝置(未 ^ 繪不)、模糊器(Fuzzifier)33、模糊規則庫37、推論引擎35 •和解模糊器(Defuzzifier)39。筛選裝置係用以對資料收集模 、、且所收集到之系統資源參數進行分群及篩選之動作,而選取 複數個關鍵參數。首先,參考Micr〇s〇ft Wind〇ws 2〇〇〇技術 資訊,觀察與計算機效能有關的狀況的系統資源參數,包含 有評估記憶體(Memory)和快取記憶體(Cache)使用狀況、分 析處理器(Processor)之活動、審核與調校硬碟(Disk)效能、 - 以及監控網路效能等。為了減少計算機效能監控問題之複雜 度’本發明可僅考慮與軟體老化有關之系統資源參數,然 • 而,本發明亦可考慮其他與計算機效能有關之系統資源參 數,故本發明並不在此限。軟體老化是因為關鍵資源被耗盡 導致節點效能衰減甚至失效的現象,而與軟體老化有關的主 要因素為處理器與記憶體。根據Microsoft技術報告,將與 處理器、記憶體、虛擬記憶體有關之系統資源參數,即處理 器時間(Processor Time)、優先時間(Privileged Time)、未分 頁的内存池字節(Pool Nonpaged Bytes)、記憶體可使用量 (Available Mbytes)、工作集(Working Set)、每秒分頁數 (Pages/sec)、未分頁的内存池配置(p〇〇1 Nonpaged 13 1292091^ # 82 collects the system resource parameters of the node, and transmits the system resource parameters to the detection module through the CORBA and the hole infrastructure 91 and 92. - One of the features of the present invention is that the process of exposing a controlled node to failure is defined as a number of states. Please refer to FIG. 2, which is a schematic diagram showing the state of the controlled node of the present invention. The controlled node of the present invention comprises five states, namely, an initialization state, an active state, an inactive state, a "hanging (k) state, and a dead state, wherein the controlled node is just starting. When the application of the controlled node is started, the control point enters the active state. Under normal conditions, the controlled node is in the active state; when the application is suspended, the application is suspended. The control node enters the inactive sorrow, and when the application restarts, it returns to the active state; when the available resources of the controlled node are abnormal, for example, the software aging, the controlled point will be active. Entering the abnormal state, and when the available resources of the controlled node return to normal, the controlled node can return to the active state; when the available resources of the controlled node are exhausted, the controlled node will enter the failed state (ie, Therefore, the main function of the detection module of the present invention is to detect whether the 12 1292091 control node has entered an abnormal state. Referring to FIG. 3, the detection mode of the present invention is illustrated. The detection module of the present invention is a fuzzy detection module, which estimates a state indicator according to some key resource effects - and according to the indicator Judging the state of the controlled node _ currently. The fuzzy logic detection module includes at least: a screening device (not drawn), a fuzzifier 33, a fuzzy rule base 37, a deduction engine 35, and a defuzzifier (Defuzzifier) 39. The screening device is used to perform grouping and screening of the data collection modules and the collected system resource parameters, and select a plurality of key parameters. First, refer to Micr〇s〇ft Wind〇ws 2〇 〇〇Technical information, system resource parameters for observing conditions related to computer performance, including evaluation memory and cache memory usage, analysis processor activity, auditing and tuning hard Disk performance, - and monitoring network performance, etc. In order to reduce the complexity of computer performance monitoring problems, the present invention can only consider the software ageing The resource parameters of the system, however, the present invention may also consider other system resource parameters related to computer performance, so the invention is not limited thereto. Software aging is caused by the depletion of key resources, causing node performance to decay or even fail. The main factors related to software aging are the processor and the memory. According to the Microsoft technical report, the system resource parameters related to the processor, memory, and virtual memory, that is, the processor time, the priority time (Privileged Time) ), Non-paged Memory Pool Bytes, Available Mbytes, Working Set, Pages/sec, Paged/sec, Non-paged Memory Pool Configuration ( P〇〇1 Nonpaged 13 1292091

Allocations)、分頁的内存池字節(p00i paged Bytes)、每秒 輸入分頁數(Pages Input/sec)、每秒輸出分頁數(pages Output/sec)等參數擷取出來,同時使用多變量分析之因素命 " 名法將參數做分群及篩選之動作。最後,篩選裝置選取4 個關鍵參數’即:處理器時間(Processor Time)、優先時間 (Privileged Time)、未分頁的内存池字節(p0〇i Nonpaged - Bytes)、記憶體可使用量(Available Mbytes)當作模糊器33 φ 之輸入資料。在這些輸入資料中,處理器時間與優先時間主 要用來檢查處理器效能是否已達瓶頸;未分頁的内存池配置 係用來判斷是否有嚴重的記憶體遺失(Memory Leak)狀 況;而記憶體可使用量則是用來判斷所剩主記憶體是否已不 足。 模糊器33係根據模糊規則庫37來對關鍵參數 _ 31(XProcessor 一 Time 、 XPrivileged—Time 、 XPool一Nonpaged—Bytes、XAvailable一Mbytes)賦予一權重植 鲁 而產生一組輸入資料(// (Processor 一 Time)、“ (Privileged—Time) 、 // (Pool 一 Nonpaged 一 Bytes) 、 u (Available一Mbytes)),輸入至推論引擎35。推論引擎35再 根據此組輸入資料,來推論是否此些關鍵參數都接近門檀 值,而產生一輸出值(State—Index))輸入至解模糊器39。 接著,解模糊器39將推論出之輸出值轉化成狀態指標 32(YState一Index),以判斷目前受控節點所歸屬之狀態。狀 態指標32的值通常係介於〇到1之間,若狀態指標32係介 於0.3至1之間,則為正常狀態;若狀態指標32係介於 14 1292091 至〇·3之間,則為異常狀態。 &如第1圖所示,當偵測模組30判斷出某受控節點之狀 -恶2異常狀態時,將會啟動預測模組50去預測此受控節點 在多久時間後會因耗盡資源而失效。若軟體老化現象具嚴格 ~遞減之特性,則平均失效時間(MTTF)之預測方法說明如下。 、叫參照第4圖,其繪示本發明之預測失效時間(Μττρ) 之评估曲線。在符合嚴格遞減式的軟體老化系統中,有3 參條與預測平均失效時間(MTTF)有關的曲線,曲線所對應之 ^與少分別表示時間以及系統所剩資源。若有足夠的樣本 數·⑺,少/==1,2,…,w,且可用簡單線性迴歸來判斷少與 /之關係時,則可建立一簡單線性模型,其估計式如下··Allocations), paged memory pool bytes (p00i paged Bytes), paged input per second (Pages Input/sec), output per page (pages Output/sec), etc., and multivariate Analysis of the factors of life " name method to parameterize the action of grouping and screening. Finally, the screening device selects four key parameters': Processor Time, Privileged Time, unpaged memory pool bytes (p0〇i Nonpaged - Bytes), memory usable (Available) Mbytes) is used as input data for the fuzzer 33 φ. In these input data, the processor time and priority time are mainly used to check whether the processor performance has reached the bottleneck; the unpaged memory pool configuration is used to determine whether there is a serious memory Leak condition; and the memory The usable amount is used to determine whether the remaining main memory is insufficient. The fuzzer 33 assigns a set of input data to the key parameter _ 31 (XProcessor-Time, XPrivileged-Time, XPool-Nonpaged-Bytes, XAvailable-Mbytes) according to the fuzzy rule base 37 (// (Processor) A Time), "Privileged-Time", // (Pool-Nonpaged-Bytes), u (Available-Mbytes), are input to the inference engine 35. The inference engine 35 then infers whether or not based on the input data of the group. The key parameters are close to the gate value, and an output value (State_Index) is generated and input to the defuzzifier 39. Next, the defuzzifier 39 converts the inferred output value into a state indicator 32 (YState-Index) to Determine the status of the current controlled node. The value of the status indicator 32 is usually between 〇1, and if the status indicator 32 is between 0.3 and 1, it is normal; if the status indicator is 32 14 1292091 Between 〇 and 3, it is an abnormal state. As shown in Figure 1, when the detection module 30 determines the state of a controlled node - the abnormal state of the evil 2, the prediction module will be started. 50 to predict this controlled node is more After the time, it will be invalidated due to depletion of resources. If the software aging phenomenon has strict-decreasing characteristics, the prediction method of the mean time to failure (MTTF) is explained as follows. Referring to FIG. 4, the predicted failure time of the present invention is illustrated. The evaluation curve of (Μττρ). In the software aging system that meets the strict degressive type, there are 3 parameters related to the predicted mean time to failure (MTTF) curve, and the corresponding ^ and less of the curve respectively represent the time and the resources left by the system. If there are enough sample numbers (7), less /==1, 2,...,w, and simple linear regression can be used to judge the relationship between less and /, then a simple linear model can be established, and the estimation formula is as follows:

A A 其中 问,2, 3,··, ί>,-观-〇 /=1__ (1) ±yt β〇 -y-βχ't y ~ , n , and n 本發明應用預測區間(Predicti〇n㈣⑽)的觀念,在第 4圖中第2與第3條曲線··〜與&分別代表評估值》 的下界限與上界限。M如偵測模組偵測到節點纟&的時 間』進人到異常狀態,貝彳平均失效時間(歐W)可由下述方 =:則出。假言免^纟示當進入到失效狀態時的剩餘資源, 觀”第4 ®以及應用公式⑴與使用校正(⑶加μ㈣的反 向迴歸法,則W,W與‘可分別由☆與I外與 、及D與如的父集推導出。因此,最有可能的平均 15 1292091 失效時間(MTTF)值:户靖、平 值· 4 ^ 干均失效時間(MTTF)的下界限 :公❹;:平均失效一)的上界限值·上,可由下 另外,本發明提供一 法, 択種汁异機效能評估器的應用方 错以正合應用程式叢集服務,適用於 ::係:=叢,環境中有複數個受⑽ $ At Q #异機效切估器與應用程式叢集服務,以計算機 二:評估器來代替第5A圖所示之主節點的功 ^'、:再£刀成主郎點與副節點等角色。在叢集環 #自即點6 *裝並執行計算機效能評估11令的資AA where, 2, 3,··, ί>,-View-〇/=1__ (1) ±yt β〇-y-βχ'ty ~ , n , and n The prediction interval (Predicti〇n(4)(10)) The concept of the 2nd and 3rd curves in Fig. 4··· and & respectively represent the lower and upper bounds of the evaluation value. If the detection module detects the time of the node amp & and enters the abnormal state, the average failure time of the 彳 (European W) can be output by the following =:. The hypothesis does not show the remaining resources when entering the failure state, the observation "4th and the application formula (1) and the use of correction ((3) plus μ (four) of the inverse regression method, then W, W and ' can be respectively from ☆ and I The outer set, and the parent set of D and Ru are derived. Therefore, the most likely average is 15 1292091 Dead Time (MTTF) value: Hu Jing, Ping 4 · Lower limit of 4^ dry time to failure (MTTF): ;: the upper limit value of the average failure one), above, the present invention provides a method, the application error of the juice specific evaluator is applied to the application cluster service, applicable to:::= In the plex, there are a number of (10) $ At Q # different machine estimator and application cluster services in the environment, and the computer 2: evaluator replaces the function of the master node shown in Figure 5A. Become a role such as the main point and the secondary node. In the cluster ring #自即点6 * Install and execute the computer performance evaluation 11

Hr組後,則將不需要制習知之應用程式叢集服務 r原本偵測節點失效的心跳機制。 估号照Γ、5β圖’其緣示本發明之整合計算機效能評 ° —用知式叢集服務之三層式系統架構示意圖,盆中 70=算機效能評估器10安裝在叢集環境中之受控節點 挺产之一(例如:7〇)上’再將資料收集模組21-24分別安 ^叢集壤境中之其他受控節點(例如·· 71_74)。受控節點(例 透過資料收集模組21_24將其系統資源參數分別 =至计算機效能評估器丨〇做分析與判斷,以便知道是否 有:控節點其效能已異常而須被替換,且在受控節點真正失 效則即可替換,如此可達到服務不中斷的境界。 ^發明之計算機效能評估器具有通用性,即可適用於 各種两科技產業’如半導體肖TFT_LCD產業等的資訊系 16 1292091 統。如將本發明之計算機效能評估器與應用程式叢集服務 的整合應用於半導體生產工廠的製造執行系統,將可提高製 造執行系統的妥善率與可靠度。以下使用半導體生產工廠的 I ie執行糸統之機台管理者(Equipment Manager ; EM)及在 - 製品追蹤(WIP Tracking ; WT)模組為例,來說明本發明。 凊參照弟6圖’其繪示應用本發明之計算機效能評估 器並整合具谷錯能力之製造執行系統的架構示意圖,其中 魯此製造執行系統包含有以機台管理者(EM)和在製品追縱 (WT)等二應用模組,em模組與WT模組同時置放在2部應 用伺服器(Application Server)83和84之中,即EM1模組與 WT2模組同時置放在應用伺服器83中;EM2模組與WT1 模組同時置放在應用伺服器84中,其中EM1模組係處於工 作中的狀態’ EM2模組為EM1模組的備份;WT1模組係處 ’ 於工作中的狀態,WT2模組為WT1模組的備份,而模組之 間係以CORBA ORB作為基礎通訊架構來互傳資訊。EMi •模組同時管理有多部機台86,其主要功能為監控機台86的 加工行為及狀態,其擁有的能力包括機台設定與遠端遙控機 σ作業、機台派工、加工程式(Recipe)管理、收集機台狀態 及資料與機台錯誤之警告等;而WT模組的主要工作為I 蹤與管理全廠之在製品,必須記錄目前有多少在製品在廠區 内加工,以及必須記錄目前所有在製品為Track化或Track 〇ut狀態等。應用伺服器83並安裝有叢集服務軟體 ApCSl,以偵測EM1模組與WT2模組的狀態;應用伺服器 84並安裝有叢集服務軟體APCS2,以偵測EM2模組與wn 17 1292091 模組的狀態。 由於計算機效能評估器的可靠度要求必須趨近於 1〇〇% ’所以本應用例計算機效能評估器採用映射(Mirro〇 的架構來提升可靠度’即使用互為映射之兩台計算機效能評 -估器12和14,來债測應用伺服器83和84。如第6圖之點 線區塊部分所示,應用伺服器83和應用伺服器以週期性地 _回報電腦效能檢測值給計算機效能評估器12或14(步驟 •⑽)’計算機效能評估器12或14再根據效能檢測值判^是 否有應用祠服$已進入異常狀態。若判斷出某部應用飼服器 異常已進入異常狀態’則立即通知此異常的應用伺服器執行 應用程式錯誤後轉移機制(步驟11〇)。異常的應用伺服器收 •到叶算機效能評估器12或14的通知後,會關閉執行中之應 用程式(如EM1*WT1)並通知另一部應用伺服器啟動備援 的應用程式(如EM2或WT2)。 以下進一步描寫如第6圖所示之計算機效能評估器 • (PEV)加上應用程式叢集服務(APCS)之系統的運作方式: 在正常狀況下,EM1模組與WT1模組分別在應用伺 服器83(節點1)和應用伺服器84(節點2)執行(即工作中)。 右APCS2彳貞測到WT1模組失效時,APCS2會通知APCS1 喚起WT2模組,以復原WT1模組所提供的服務。另外,若 計算機效能評估器12或14偵測到節點1處於異常狀態, 並預測出節點1的失效時間,此時,計算機效能評估器12 或14會通知APCS2以EM2模組,在節點1完全失效前接 管原本由節點1所提供的服務(即EM1模組)。如此,便可 18 1292091 確保持續地提供機台管理者(EM)的服務。 由上述本發明較佳實施例可知,本發明可有效並及時地 制計算機因資源耗盡而導致失效;可有效地提供預:: =機失效時間之能力,來幫助叢集服務軟體決定適當的 -節點轉移時機及策略,進而提升軟體系統之可靠度。 雖然本發明已以一較佳實施例揭露如上,然其並非用以 限定本發明,任何熟習此技藝者,在不脫離本發明之精神和 • ^内,當可作各種之更動與潤飾,因此本發明之保護範圍 當視後附之申請專利範圍所界定者為準。 【圖式簡單說明】 為了更完整了解本發明及其優點’請參照上述敘述 並配合下列之圖式,其中: 第1圖為繪示本發明之計算機效能評估器的系統架 構示意圖。 ’、 第2圖為繪示本發明之受控節點的狀態示意圖。 第3圖為繪示本發明之偵測模組的方塊示意圖。 第4圖為繪示本發明之預測失效時間(MTTF)之評估 之三層式 ^第5A圖為繪示習知之應用程式叢集服務 系統架構示意圖。 第5B圖為繪示本發明之整合計算機效能評估器與 -用耘式叢集服務之三層式系統架構示意圖。 第6圖為繪示應用本發明之計算機效能評估器並整 19 1292091 合具容錯能力之製造執行系統的架構示意圖。 【主要元件符號說明】 ίο 計算機效能評估器 21、22、23、24 資料收集模組 30 偵測模組 31 關鍵參數 32 狀態指標 33 模糊器 35 推論引擎 37 模糊規則庫 39 解模糊器 50 預測模組 70、71、72、73、74 受控節點 81 > 82 作業系統應用程式介面 83、84 應用伺服器 91、92 CORBA通訊基礎架構 100 回報電腦效能檢測值 110 執行應用程式錯誤後轉移機制 20After the Hr group, there is no need to know the application cluster service r to detect the heartbeat mechanism of the node failure. Estimated number, 5β map's edge shows the integrated computer performance evaluation of the present invention - a three-layer system architecture diagram using the knowledge cluster service, the basin 70=computer performance evaluator 10 installed in the cluster environment One of the control nodes is quite productive (for example: 7〇) and then the data collection modules 21-24 are respectively clustered with other controlled nodes in the soil (for example, 71_74). The controlled node (for example, through the data collection module 21_24, its system resource parameter = to the computer performance evaluator to analyze and judge, in order to know whether there is: the control node whose performance is abnormal and has to be replaced, and is subject to If the control node is really invalid, it can be replaced, so that the service can be uninterrupted. The invention of the computer performance estimator is universal, and can be applied to various information technology industries such as the semiconductor Xiao TFT_LCD industry, etc. 16 1292091 If the integration of the computer performance evaluator of the present invention and the application clustering service is applied to the manufacturing execution system of the semiconductor manufacturing factory, the proper rate and reliability of the manufacturing execution system can be improved. The following is performed using the semiconductor manufacturing factory. The present invention is described by taking the example of the Equipment Manager (EM) and the WIP Tracking (WT) module. 凊 弟 6 图 图 图 图 图 图 图 图 图 图 图 图 图 图 图 图 图 图And integrate the architecture diagram of the manufacturing execution system with the ability of the valley fault, in which the manufacturing execution system includes the machine Two application modules, such as administrator (EM) and product tracking (WT), em module and WT module are placed in two application servers 83 and 84, namely EM1 module and The WT2 module is placed in the application server 83 at the same time; the EM2 module and the WT1 module are placed in the application server 84 at the same time, wherein the EM1 module is in the working state 'EM2 module is the backup of the EM1 module The WT1 module is in the working state, the WT2 module is the backup of the WT1 module, and the CORBA ORB is used as the basic communication architecture to exchange information between the modules. The EMi • module manages multiple parts at the same time. The main function of the machine 86 is to monitor the processing behavior and status of the machine 86. The capabilities of the machine 86 include machine setting and remote remote control σ operation, machine dispatching, processing (recipe) management, and collecting machine status. And the warnings of data and machine errors; and the main work of the WT module is to track and manage the work in progress of the whole factory. It is necessary to record how many products are currently processed in the factory, and it is necessary to record all current WIP products. Or Track 〇ut status, etc. Application server 83 and The cluster service software ApCSl is installed to detect the status of the EM1 module and the WT2 module; the server 84 is installed and the cluster service software APCS2 is installed to detect the state of the EM2 module and the wn 17 1292091 module. The reliability requirements of the evaluator must approach 1%%. Therefore, the computer performance evaluator of this application uses mapping (Mirro〇 architecture to improve reliability), that is, two computer performance evaluation evaluators using mutual mapping. And 14, to the debt test application servers 83 and 84. As shown in the dotted line block diagram of FIG. 6, the application server 83 and the application server periodically report the computer performance evaluation value to the computer performance evaluator 12 or 14 (step • (10)) 'computer performance evaluator 12 Or 14 according to the performance detection value to determine whether there is an application 祠 service $ has entered an abnormal state. If it is determined that a certain application server abnormality has entered an abnormal state, the application server that immediately notifies the abnormality executes the application error post-transfer mechanism (step 11). After the abnormal application server receives the notification to the computer performance evaluator 12 or 14, it will close the executing application (such as EM1*WT1) and notify another application server to start the backup application (such as EM2 or WT2). The following further describes the operation of the Computer Performance Evaluator (PEV) plus Application Cluster Service (APCS) system as shown in Figure 6: Under normal conditions, the EM1 module and the WT1 module are respectively in the application server. 83 (node 1) and application server 84 (node 2) are executed (ie, in operation). When the right APCS2 detects that the WT1 module is invalid, APCS2 will notify APCS1 to evoke the WT2 module to restore the service provided by the WT1 module. In addition, if the computer performance evaluator 12 or 14 detects that the node 1 is in an abnormal state and predicts the failure time of the node 1, at this time, the computer performance evaluator 12 or 14 notifies the APCS2 to the EM2 module, and the node 1 is completely The service originally provided by node 1 (ie, EM1 module) is taken over before failure. In this way, 18 1292091 ensures that the service of the machine manager (EM) is continuously provided. It can be seen from the above preferred embodiments of the present invention that the present invention can effectively and timely make the computer fail due to resource exhaustion; and can effectively provide the ability of pre-::=machine failure time to help the cluster service software determine the appropriate- Node transfer timing and strategy to improve the reliability of the software system. Although the present invention has been described above in terms of a preferred embodiment, it is not intended to limit the invention, and various modifications and refinements may be made without departing from the spirit and scope of the invention. The scope of the invention is defined by the scope of the appended claims. BRIEF DESCRIPTION OF THE DRAWINGS For a more complete understanding of the present invention and its advantages, reference is made to the above description in conjunction with the following drawings, wherein: FIG. 1 is a schematic diagram showing the system architecture of the computer performance estimator of the present invention. FIG. 2 is a schematic diagram showing the state of the controlled node of the present invention. FIG. 3 is a block diagram showing the detection module of the present invention. Figure 4 is a three-layer diagram showing the evaluation of the predicted time to failure (MTTF) of the present invention. Figure 5A is a schematic diagram showing the architecture of a conventional application cluster service system. FIG. 5B is a schematic diagram showing the architecture of the integrated computer performance evaluator and the three-layer system using the 丛 clustering service of the present invention. Figure 6 is a block diagram showing the architecture of a manufacturing execution system in which the computer performance evaluator of the present invention is applied and which is fault tolerant. [Main component symbol description] ίο Computer performance evaluator 21, 22, 23, 24 Data collection module 30 Detection module 31 Key parameters 32 Status indicator 33 Fuzzy device 35 Inference engine 37 Fuzzy rule library 39 Defuzzifier 50 Prediction mode Group 70, 71, 72, 73, 74 Controlled Node 81 > 82 Operating System Application Interface 83, 84 Application Server 91, 92 CORBA Communication Infrastructure 100 Reporting Computer Performance Detection Value 110 Performing Application Error Post-Transfer Mechanism 20

Claims (1)

12920911292091 /〇/]ή r,、 修ρό正本I 十、申請專利範園 1. 一種計算機效能評估器(Performance Evaluator ; PEV),用以偵測出位於至少一受控節點之計算機因資源耗盡 而導致之效能異常,其中該計算機效能評估器至少包括: 至少一資料收集模組,與至少一受控節點交連,用以收 集該至少一受控節點之複數個系統資源參數;/〇/]ή r,, 修ρό正本 I X. Applying for a patent garden 1. A Computer Efficient Evaluator (PEV) is used to detect that a computer at at least one controlled node is exhausted due to resources. The computer performance evaluator includes: at least one data collection module, interconnecting with at least one controlled node, to collect a plurality of system resource parameters of the at least one controlled node; 一偵測模組,根據該些系統資源參數來評估該至少一受 控節點之效能狀態,並對每一該至少一受控節點產生一狀態 指標;以及 一預測模組,當該偵測模組基於該狀態指標,來判斷出 該至少一受控節點之其中一者的效能狀態為一異常狀態時, 該預測模組會被啟動來進行一預測步驟,以預測該至少一受 控節點之該其中一者在多久時間後會因耗盡資源而失效。 2.如申請專利範圍第1項所述之計算機效能評估器,其中 φ 每一該至少一受控節點之效能狀態至少包括: 一初始化狀態,其中每一該至少一受控節點在剛啟動 時,會進入到該初始化狀態; 一活動狀態,其中當每一該至少一受控節點之一應用程 式被啟動後,則進入到該活動狀態; 一非活動狀態,其中當每一該至少一受控節點之該應用 程式暫停時,則進入到該非活動狀態,而當該應用程式重新 啟動時,再回復到該活動狀態; 該異常狀態,其中當每一該至少一受控節點的可用資源 21 1292091 發生異常之狀況時,則從該活動狀態進入到該異常狀態,而 當每一該至少一受控節點的可用資源恢復後,則再回到該活 動狀態;以及 一失效狀態’其中當每一該至少一受控節點的可用資源 被耗盡時’則進入到該失效狀態。 • 3.如申請專利範圍第1項所述之計算機效能評估器,其中 Φ 該至少一資料收集模組係安裝於該至少一受控節點中,該至 少一資料收集模組係分別透過該至少一受控節點之至少一作 業糸統應用程式介面(〇perating System API ; OS API),來收 集該些系統資源參數。 4.如申請專利範圍第1項所述之計算機效能評估器,其中 該些系統資源參數係與處理器、記憶體、虛擬記憶體(Virtual Memory)有關之參數。 5·如申請專利範圍第1項所述之計算機效能評估器,其中 該偵測模組係使用一模糊邏輯彳貞測模組(Fuzzy Logic Detection Module)來產生該狀態指標,該模糊邏輯偵測模組至 少包括: 一篩選裝置,用以對該些系統資源參數進行分群及篩選 之動作,而選取複數個關鍵參數; 一模糊器(FuzziHer),用以根據一模糊規則庫來對每一該 ‘ 些關鍵參數賦予一權重植,而產生一組輸入資料; 22 1292091 一推論引擎,根據該組輸入資料推論是否該些關鍵參數 都接近門檻值,而產生一輸出值;以及 一解模糊器(Defuzzifier),將推論出之該輸出值轉化成該 狀態指標。 6·如申請專利範圍第5項所述之計算機效能評估器,其中 該些關鍵參數為一處理器時間(processor Time)、一優先時間 (Privileged Time)、一 未分頁的内存池字節(Pool Nonpaged Bytes)、以及一記憶體可使用量(Avaiiable Mbytes),該處理器 時間與該優先時間係用來檢查處理器效能是否已達瓶頸;該 未分頁的内存池字節係用來判斷是否有嚴重的記憶體遺失 (Memory Leak)的狀況,而該記憶體可使用量則是用來判斷所 剩主記憶體是否已不足。 7·如申請專利範圍第1項所述之計算機效能評估器,其中 該預測步驟係根據具嚴格遞減(M〇n〇t〇nic_DeCay)特性之一軟 體老化現象,並採用一平均失效時間(Mean Time to ; / TF)之預測s法,曲線所對應之,與丨分別表示時間以及 系統所剩資源,當有足夠的樣本數:(“),卜1,2, ···,w,且 可用間早線性迴歸來判斷V與,之關係時,則 線性模型,其估計式如下: 間早 M,2,3,···,乃 〇) 23 1292091 A 其中 ί> !>, ±ί— F = ^i— η , and η 其中當該偵測模組偵測到該至少一受控節點之該其中一 者在^的時間點進入到該異常狀態時,則最有可能的 MTTF值:Attf、MTTF的下界限值:&、以及MTTF的上界限 值··匕,可由下述公式推導出:a detection module, evaluating performance states of the at least one controlled node according to the system resource parameters, and generating a status indicator for each of the at least one controlled node; and a prediction module, when the detection module When the group determines, according to the status indicator, that the performance status of one of the at least one controlled node is an abnormal state, the prediction module is started to perform a prediction step to predict the at least one controlled node. How long after one of these will expire due to running out of resources. 2. The computer performance evaluator according to claim 1, wherein the performance state of each of the at least one controlled node includes: an initialization state, wherein each of the at least one controlled node is just started Entering the initialization state; an active state, wherein when one of the at least one controlled node is launched, the active state is entered; an inactive state, wherein each of the at least one When the application of the control node is suspended, the inactive state is entered, and when the application is restarted, the active state is returned; the abnormal state, wherein each of the at least one controlled node has available resources 21 1292091 When an abnormal condition occurs, the abnormal state is entered from the active state, and when the available resources of each of the at least one controlled node are restored, then the active state is returned; and a failed state 'where each When the available resources of the at least one controlled node are exhausted, then the failure state is entered. 3. The computer performance evaluator of claim 1, wherein the at least one data collection module is installed in the at least one controlled node, and the at least one data collection module transmits the at least one data collection module At least one operating system interface (OS API) of a controlled node collects the system resource parameters. 4. The computer performance evaluator according to claim 1, wherein the system resource parameters are parameters related to a processor, a memory, and a virtual memory. 5. The computer performance evaluator according to claim 1, wherein the detection module generates a status indicator by using a fuzzy logic detection module (Fuzzy Logic Detection Module), the fuzzy logic detection The module includes at least: a screening device for performing grouping and screening of the system resource parameters, and selecting a plurality of key parameters; a fuzzer (FuzziHer) for each of the fuzzy rule bases 'The key parameters are assigned a weight recombination to produce a set of input data; 22 1292091 An inference engine that infers whether the key parameters are close to the threshold value based on the set of input data, and produces an output value; and a defuzzifier ( Defuzzifier), which deduces the output value into a state indicator. 6. The computer performance evaluator according to claim 5, wherein the key parameters are a processor time, a priority time (Privileged Time), and a non-paged memory pool byte (Pool). Nonpaged Bytes), and a memory available (Avaiiable Mbytes), the processor time and the priority time are used to check whether the processor performance has reached the bottleneck; the non-paged memory pool byte is used to determine whether there is A serious memory loss (Memory Leak) condition, and the memory usable amount is used to determine whether the remaining main memory is insufficient. 7. The computer performance evaluator according to claim 1, wherein the predicting step is based on a software aging phenomenon with a strict decreasing (M〇n〇t〇nic_DeCay) characteristic, and adopting an average time to failure (Mean) Time to ; / TF) predictive s method, the curve corresponds to, and 丨 respectively represent time and system resources, when there are enough sample numbers: ("), Bu 1, 2, ···, w, and When the early linear regression is used to judge the relationship between V and , the linear model is estimated as follows: M, 2, 3, . . . , 〇 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 23 — F = ^i— η , and η where the most likely MTTF value is detected when the detection module detects that one of the at least one controlled node enters the abnormal state at the time point of ^ : The lower limit value of Attf and MTTF: & and the upper limit value of MTTF··匕 can be derived by the following formula: Tmttf -h^MTTF -ts ^LB = ^DM ^ ts 5 ^UB ~ ^DJUB h 其中匕濟,w與。可分別由h與j)、〜與4, 以及心與加的交集推導出,沁與I分別代表評估值f 的下界限與上界限,心表示當進入到該失效狀態時的剩餘資 ⑩源。 8·—種計算機效能評估器的應用方法,藉以整合應用程式 叢集服務(Application Cluster Service),適用於一叢集環境之 一群組(Group),其中該叢集環境中有複數個受控節點,該計 异機效能評估器的應用方法至少包括: 提供一計算機效能評估器,其中該計算機效能評估器至 少包括有複數個資料收集器模組、一偵測模組、以及一 模組; # j 24 1292091 分別安裝該些資料收集哭 茱态模組至該些受控節點中; 該些 > 料收集模、纟且分別 次漁灸數,廿腺兮此/ ]收木該些受控節點之複數個系統 貝源參數並將該些糸絲咨、、店a 土,,土 、貝源參數傳送至該偵測模組做分析 與判斷,並對每一該虺受抑銘立丄 1文刀析 —又控即點產生一狀態指標;以及 啟動該預測模組來推# „ , 木進仃一預測步驟,其中當該债測 基於該狀態指標,來判斷屮兮$ , 乂 俱、、且 』所出该至少一受控節點之其中之一者 的效能狀態為一異常狀離w π 一 > 時該預測杈組會被啟動來進行該 預測步驟,以預測該虺受柝鉻 一又徑即點之该其中一者在多久時間後 會因耗盡資源而失效,來及時進行節點替換。 9.如申請專利範圍第8項所述之計算機效能評估器的應 用方法,其中每一該些受控節點之效能狀態至少包括·· 一初始化狀態,其中每一該些受控節點在剛啟動時,會 進入到該初始化狀態; θ 一活動狀態,其中當每一該些受控節點之一應用程式被 φ 啟動後,則進入到該活動狀態; 一非活動狀態,其中當每一該些受控節點之該應用程式 暫停時,則進入到該非活動狀態,而當該應用程式重新啟動 時,再回復到該活動狀態; 該異$狀態,其中當每一該些受控節點的可用資源發生 異常之狀況時,則將從該活動狀態進入到該異常狀態,而當 每一些受控節點的可用資源恢復後,則再回到該活動狀態; 以及 一失效狀態,其中當每一該些受控節點的可用資源被耗 25 1292091 &吩,則進入到該失效狀態。 1〇·如申請專利範圍第 用方汰钍丄 乃”丨处 < 吓异飛欢能評估器的應 記愫於,/·、該㈣統資源參數係與處理器、記憶體、虛擬 ⑽體(Virtual Memory)有關之參數。 11.如中請專利範圍第8項所述之計算機效能評估器的應 =法,其中該偵測模組係使用一模糊邏輯偵測模組來產生 該狀態指標’該模糊邏輯偵測模組至少包括: 篩選裝置,用以對該些系統資源參數進行分群及篩選 之動作,而選取出複數個關鍵參數; 模糊器,用以根據一模糊規則庫來對每一該些關鍵參 數賦予一權重植,而產生一組輸入資料; 一推論引擎,根據該組輸入資料推論是否該些關鍵參數 都接近門檻值,而產生一輸出值;以及 一解模糊器,將推論出之該輸出值轉化成該狀態指標。 12·如申請專利範圍第π項所述之計算機效能評估器的 應用方法,其中該些關鍵參數為一處理器時間(Pr0cessor Time)、一優先時間(Privileged Time)、一未分頁的内存池字節 (Pool Nonpaged Bytes)、以及一記憶體可使用量(Available Mbytes),該處理器時間與該優先時間係用來檢查處理器效能 是否已達瓶頸;該未分頁的内存池字節係用來判斷是否有嚴 重的記憶體遺失(Memory Leak)的狀況,而該記憶體可使用量 26 1292091 則疋用來判斷所剩主記憶體是否已不足 用方L3·如Γ請專利範圍第8項所述之計算機效能評估器的應 ’其中該預測步驟係根據具嚴格遞減特性之一軟體老 化現象’並採用-平均失效時間(MTTF)之預測方法,曲線所與少分別表示時間以及系統所剩資源,當有足夠的 水 /~,兄),ί=1,2, ···,《,且可用簡單線性迴歸來判斷少與 ί之關係時,則可建立一簡單線性模型,其估計式如下: 兄=式·卜1,2, 3,···,” (1) ATmttf -h^MTTF -ts ^LB = ^DM ^ ts 5 ^UB ~ ^DJUB h where the economy, w and. It can be derived from the intersection of h and j), ~ and 4, and heart and plus, respectively, 沁 and I represent the lower and upper bounds of the evaluation value f, respectively, and the heart represents the remaining 10 sources when entering the failure state. . The application method of the computer performance evaluator is to integrate an application cluster service (Application Cluster Service), which is applicable to a group of a cluster environment, wherein the cluster environment has a plurality of controlled nodes, The application method of the different machine performance evaluator includes at least: providing a computer performance evaluator, wherein the computer performance evaluator comprises at least a plurality of data collector modules, a detection module, and a module; # j 24 1292091 respectively, the data is collected to collect the crying state module to the controlled nodes; the > materials are collected, and the number of moxibustion moxibustions respectively, the parotid gland / / ] / 该 该 该 该 该 该 该A plurality of system shell source parameters and the parameters of the silk thread, the shop a soil, the soil and the shell source are transmitted to the detection module for analysis and judgment, and each of the 虺 虺 铭 铭 铭Knife analysis - control and point to generate a state indicator; and start the prediction module to push # „ , wood into a prediction step, wherein when the debt test is based on the state indicator, to determine 屮兮 $, furniture, The performance state of one of the at least one controlled node is an abnormality away from w π −> and the predicted 杈 group is activated to perform the prediction step to predict the 虺 柝 一 一If the one of the paths is one of the points, it will be invalidated due to the exhaustion of resources, and the node replacement will be performed in time. 9. The application method of the computer performance evaluator according to claim 8 of the patent application, wherein each of the methods The performance states of the controlled nodes include at least an initialization state, wherein each of the controlled nodes enters the initialization state upon initial startup; θ an active state, wherein each of the controlled nodes After an application is started by φ, it enters the active state; an inactive state, wherein when the application of each of the controlled nodes is suspended, the inactive state is entered, and when the application is restarted Reverting to the active state; the different $ state, wherein when the available resources of each of the controlled nodes are abnormal, the active state will be entered to the different state a normal state, and when the available resources of each of the controlled nodes are restored, then returning to the active state; and a failure state, wherein when the available resources of each of the controlled nodes are consumed by 25 1292091 & Entering the failure state. 1〇·If the patent application scope is used by the party, it is “丨处”, the scare of the Feihuan energy evaluator should be recorded, /·, the (four) system resource parameter system and processor , memory, virtual (10) body (Virtual Memory) related parameters. 11. The computer performance evaluator according to the eighth aspect of the patent application, wherein the detection module uses a fuzzy logic detection module to generate the status indicator 'the fuzzy logic detection module The method includes at least: a screening device, configured to perform grouping and screening operations on the system resource parameters, and select a plurality of key parameters; and a fuzzy device to assign a weight to each of the key parameters according to a fuzzy rule base Planting, and generating a set of input data; a deduction engine, inferring whether the key parameters are close to the threshold value according to the input data of the group, and generating an output value; and a defuzzifier, converting the inferred output value into The status indicator. 12. The application method of the computer performance evaluator according to claim π, wherein the key parameters are a processor time (Pr0cessor Time), a priority time (Privileged Time), and a non-paged memory pool word. (Pool Nonpaged Bytes), and a memory available (Available Mbytes), the processor time and the priority time are used to check whether the processor performance has reached the bottleneck; the non-paged memory pool byte is used Determine whether there is a serious memory loss (Memory Leak) condition, and the memory usable amount 26 1292091 is used to determine whether the remaining main memory is insufficient. L3 · For example, please refer to item 8 of the patent scope. The computer performance evaluator should be 'in the prediction step based on one of the strict degradation characteristics of the software aging phenomenon' and adopt the -mean time to failure (MTTF) prediction method, the curve is less than the time and the remaining resources of the system When there is enough water/~, brother), ί=1, 2, ···, ", and you can use simple linear regression to judge the relationship between ί and ί, you can create a simple line. Model that estimates the formula: Formula brother-Bu = 1,2, 3, ···, "(1) A ηΣ,,· 其中 合7 / ,凡=歹一成' ,and 其中當該偵測模組偵測到該至少一受控節點之該其中一 者在G的時間點進入到該異常狀態時,則最有可能的 MTTF值:/、MTTF的下界限值:&、以及mttf的上界限 值:心,可由下述公式推導出: y η TmTF = tDMTTF - ts 5 A P A. A 其中k,M77F, 與’峨可分別由 %與}、%與&, 以及A與夕⑽的交集推導出,&與‘分別代表評估值j) 的下界限與上界限,☆表示當進入到該失效狀態時的剩餘資 27 1292091ηΣ,,· where 7 / , where =歹一成', and when the detection module detects that one of the at least one controlled node enters the abnormal state at the time point of G, The most likely MTTF values are: /, the lower bound of the MTTF: &, and the upper bound of mttf: the heart, which can be derived from the following formula: y η TmTF = tDMTTF - ts 5 AP A. A where k, M77F, and '峨 can be derived from the intersection of % and }, % and &, and A and eve (10) respectively, and the lower and upper bounds of & and 'representing the evaluation value j respectively, respectively, ☆ indicates that when entering Remaining capital in the state of failure 27 1292091
TW94105691A 2005-02-24 2005-02-24 Computer performance evaluator and application method thereof TWI292091B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW94105691A TWI292091B (en) 2005-02-24 2005-02-24 Computer performance evaluator and application method thereof

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW94105691A TWI292091B (en) 2005-02-24 2005-02-24 Computer performance evaluator and application method thereof

Publications (2)

Publication Number Publication Date
TW200630794A TW200630794A (en) 2006-09-01
TWI292091B true TWI292091B (en) 2008-01-01

Family

ID=45067413

Family Applications (1)

Application Number Title Priority Date Filing Date
TW94105691A TWI292091B (en) 2005-02-24 2005-02-24 Computer performance evaluator and application method thereof

Country Status (1)

Country Link
TW (1) TWI292091B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI421720B (en) * 2010-09-29 2014-01-01 Univ Nat Sun Yat Sen A performance evaluation apparatus for an electromagnetic stirrer and a method thereof

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI476549B (en) * 2013-07-15 2015-03-11 China Steel Corp Method and apparatus for identifying a process model
TWI497287B (en) * 2014-01-22 2015-08-21 Chunghwa Telecom Co Ltd Monitoring Method and Design Method of Joint Information System
TWI669606B (en) * 2017-11-20 2019-08-21 財團法人資訊工業策進會 Diagnostic method for machine and diagnostic system thereof

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI421720B (en) * 2010-09-29 2014-01-01 Univ Nat Sun Yat Sen A performance evaluation apparatus for an electromagnetic stirrer and a method thereof

Also Published As

Publication number Publication date
TW200630794A (en) 2006-09-01

Similar Documents

Publication Publication Date Title
Di Martino et al. Measuring and understanding extreme-scale application resilience: A field study of 5,000,000 HPC application runs
CN109933452B (en) Micro-service intelligent monitoring method facing abnormal propagation
Vaidyanathan et al. A comprehensive model for software rejuvenation
Sahoo et al. Failure data analysis of a large-scale heterogeneous server environment
JP5186211B2 (en) Health monitoring technology and application server control
CN102693177B (en) Fault diagnosing and processing methods of virtual machine as well as device and system thereof
US9600394B2 (en) Stateful detection of anomalous events in virtual machines
Castelli et al. Proactive management of software aging
US8364460B2 (en) Systems and methods for analyzing performance of virtual environments
US7194445B2 (en) Adaptive problem determination and recovery in a computer system
US9720823B2 (en) Free memory trending for detecting out-of-memory events in virtual machines
US10248561B2 (en) Stateless detection of out-of-memory events in virtual machines
US10489232B1 (en) Data center diagnostic information
US8656009B2 (en) Indicating an impact of a change in state of a node
US20130226526A1 (en) Automated Performance Data Management and Collection
CN106445781A (en) Message-transmission based detection system for automatic monitoring of HPC large-scale concurrent program exception and hardware-hardware cause judgment
US20050049901A1 (en) Methods and systems for model-based management using abstract models
CN102662788A (en) Computer system fault diagnosis decision and processing method
US10474509B1 (en) Computing resource monitoring and alerting system
Zhou et al. Logsayer: Log pattern-driven cloud component anomaly diagnosis with machine learning
Khalid et al. Survey of frameworks, architectures and techniques in autonomic computing
TWI292091B (en) Computer performance evaluator and application method thereof
Liu et al. A large-scale study of failures on petascale supercomputers
Li et al. Constructing large-scale real-world benchmark datasets for aiops
Di Martino et al. Measuring the resiliency of extreme-scale computing environments