TW201933142A - Frequent time-gap sequential pattern mining system and method - Google Patents

Frequent time-gap sequential pattern mining system and method Download PDF

Info

Publication number
TW201933142A
TW201933142A TW107102041A TW107102041A TW201933142A TW 201933142 A TW201933142 A TW 201933142A TW 107102041 A TW107102041 A TW 107102041A TW 107102041 A TW107102041 A TW 107102041A TW 201933142 A TW201933142 A TW 201933142A
Authority
TW
Taiwan
Prior art keywords
style
candidate
pattern
frequent
time interval
Prior art date
Application number
TW107102041A
Other languages
Chinese (zh)
Other versions
TWI668580B (en
Inventor
楊富丞
呂栢頤
Original Assignee
中華電信股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 中華電信股份有限公司 filed Critical 中華電信股份有限公司
Priority to TW107102041A priority Critical patent/TWI668580B/en
Application granted granted Critical
Publication of TWI668580B publication Critical patent/TWI668580B/en
Publication of TW201933142A publication Critical patent/TW201933142A/en

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

A frequent time-gap sequential pattern mining system and method are disclosed. The system includes a time-gap sequence database, a pseudo projection database construction module, a candidate pattern generation module and a candidate pattern pruning module. The time-gap sequence database stores time-gap sequences. The pseudo projection database construction module constructs a pseudo projection database of candidate patterns according to the time-gap sequence database and calculates support of the candidate patterns. The candidate pattern generation module searches all L-length candidate patterns from the pseudo projection database to calculate all L+1-length candidate patterns, and updates support of the L+1 length candidate patterns to gather into a set. The candidate pattern pruning module applies a pruning algorithm to prune or eliminate the L+1-length candidate patterns in the set, thereby the mining effectiveness would be improved.

Description

頻繁時間間隔序列樣式探勘系統及方法 Frequent time interval sequence style exploration system and method

本發明係關於一種時間間隔序列樣式探勘之技術,特別是指一種頻繁時間間隔序列樣式探勘系統及方法。 The present invention relates to a technique for time interval sequence pattern exploration, and more particularly to a frequent time interval sequence pattern exploration system and method.

生活中有許多資料具有時間順序之特性,故分析具有順序性資料是一個重要議題,無論在科學及商業上皆有廣泛的運用。序列中項目依照不同應用或情境代表不同內容,通常為人們感興趣之現象。每當項目發生時,它將伴隨一個發生時間點,而將多個事件之記錄依其發生時間順序形成一個特定序列。 There are many materials in life that have chronological characteristics, so analyzing sequential data is an important issue, and it is widely used both scientifically and commercially. Projects in the sequence represent different content according to different applications or situations, and are usually of interest to people. Whenever an item occurs, it is accompanied by an occurrence time point, and the records of multiple events are formed into a specific sequence according to their chronological order.

舉例而言,電子商務網站中顧客每個購買行為視為一個項目,則可得知顧客先購買產品A與產品B,於不同天又分別購買產品C與產品D之序列,此序列可以{(A,B),C,D}表示。若該序列中,每個事件之發生間隔時間也同時被記錄時,則此稱為時間間隔序列。 For example, in the e-commerce website, each purchase behavior of the customer is regarded as a project, and it can be known that the customer first purchases the product A and the product B, and separately purchases the sequence of the product C and the product D on different days, and the sequence can be {( A, B), C, D} are indicated. If the interval between occurrences of each event is also recorded in the sequence, this is called a time interval sequence.

因此,接續上述例子,還原序列{(A,B),C,D}的時間間隔資訊後可獲得新的序列{A,0 B,10 C,30 D},新的序列代表顧客同天購買產品A與產品B,十天後購買產品C, 又於一個月後才購買產品D,其中,數字可代表任意時間單位。 Therefore, following the above example, the time sequence information of the sequence {(A, B), C, D} is restored to obtain a new sequence {A, 0 B, 10 C, 30 D}, and the new sequence represents the same purchase of the customer. Product A and Product B, purchase product C after ten days, Product D is purchased only one month later, where the number can represent any time unit.

是以,若眾多顧客均擁有此經歷之時間間隔序列,則顯示產品A、B與C之時間相關性,可作為何時合適推薦產品C予已經購買產品A與B之顧客之參考。顯而易見地,從時間間隔序列資料庫探勘頻繁時間間隔序列樣式,將有助於各種應用領域探索各項目間之時間關聯性,提供更精細意涵協助使用者進行有效之決策。 Therefore, if many customers have a time interval sequence of such experiences, the time correlation of products A, B and C is displayed, which can be used as a reference for when to properly recommend product C to customers who have purchased products A and B. Obviously, exploring the frequent time interval sequence pattern from the time interval sequence database will help various application areas explore the temporal correlation between projects and provide more detailed meanings to assist users in making effective decisions.

然而,探勘所有頻繁樣式將可能造成探勘效率或判讀商業意涵之時效低落,資料探勘領域中常採用最大頻繁樣式與封閉式頻繁樣式兩者來避免產生過多的候選樣式或頻繁樣式。若一個樣式沒有存在任何超樣式(superset)稱為最大頻繁樣式,而若一個樣式沒有存在任何擁有相同支持度之超樣式則稱為封閉式頻繁樣式。但是,時間間隔序列樣式中因時間間隔資訊之新穎特性,即擁有時間間隔較短的序列樣式比起時間間隔較長之序列樣式較能提供更充分與細緻之商業意涵,套用傳統樣式之修剪方法並不合適。 However, the exploration of all frequent patterns may result in low efficiency in exploration efficiency or interpretation of commercial meanings. Both the maximum frequent pattern and the closed frequent pattern are often used in the data exploration field to avoid excessive candidate patterns or frequent patterns. If a style does not have any supersets called the maximum frequent style, and if a style does not have any super styles with the same support, it is called closed frequent style. However, the novel nature of the time interval information in the time interval sequence pattern, that is, the sequence pattern with a shorter time interval provides a more complete and detailed commercial meaning than the sequence pattern with a longer time interval, and the traditional style trimming is applied. The method is not appropriate.

總言之,習知技術大多僅考慮序列順序現象,並將其套用於不同領域資料而產生不同分析方法,卻忽略時間間隔資訊與特性。而且,習知技術所需記憶體之資源較大,探勘效率較低,亦可能喪失重要的細緻資訊。 In summary, most of the conventional techniques only consider the sequence order phenomenon, and apply it to different fields of data to generate different analysis methods, but ignore the time interval information and characteristics. Moreover, the memory required by the prior art technology is relatively large, the exploration efficiency is low, and important and detailed information may be lost.

因此,如何解決上述習知技術之問題,實已成為本領域技術人員之一大課題。 Therefore, how to solve the above problems of the prior art has become one of the major problems of those skilled in the art.

本發明提供一種頻繁時間間隔序列樣式探勘系統及方法,其可用以解決上述習知技術之一個或多個問題。 The present invention provides a frequent time interval sequence pattern mining system and method that can be used to solve one or more of the above-discussed problems.

本發明中頻繁時間間隔序列樣式探勘系統包括:一時間間隔序列資料庫、一虛擬投影資料庫建立模組、一候選樣式生成模組與一候選樣式修剪模組。時間間隔序列資料庫用以儲存時間間隔序列。虛擬投影資料庫建立模組依據時間間隔序列資料庫建立候選樣式之虛擬投影資料庫與計算出候選樣式之支持度。候選樣式生成模組自虛擬投影資料庫中搜尋出所有L長度之候選樣式,以依據該些L長度之候選樣式計算出所有L+1長度之候選樣式,進而更新該些L+1長度之候選樣式之支持度,俾將該些L+1長度之候選樣式匯集成集合,其中,L為等於或大於1之正整數。候選樣式修剪模組以修剪演算法修剪或刪除集合內該些L+1長度之候選樣式。 The frequent time interval sequence style exploration system of the present invention comprises: a time interval sequence database, a virtual projection database creation module, a candidate pattern generation module and a candidate style clipping module. The time interval sequence database is used to store time interval sequences. The virtual projection database building module establishes a virtual projection database of the candidate style according to the time interval sequence database and calculates the support degree of the candidate pattern. The candidate pattern generation module searches for all L length candidate patterns from the virtual projection database, and calculates all L+1 length candidate patterns according to the L length candidate patterns, and then updates the L+1 length candidates. The support of the style, the candidate patterns of the L+1 lengths are aggregated into a set, where L is a positive integer equal to or greater than 1. The candidate style pruning module trims or deletes the candidate patterns of the L+1 lengths in the set by a pruning algorithm.

本發明中頻繁時間間隔序列樣式探勘方法包括:自時間間隔序列資料庫中搜尋所有L長度之頻繁樣式以匯集成集合,其中,L為等於或大於1之正整數;將集合內所有頻繁樣式均建立虛擬投影資料庫;自虛擬投影資料庫中搜尋出所有L+1長度之候選樣式,以計算出該些L+1長度之候選樣式之支持度;更新該些L+1長度之候選樣式之支持度以更新集合;以及以修剪演算法修剪或刪除集合內該些L+1長度之候選樣式。 The frequent time interval sequence pattern exploration method of the present invention comprises: searching for a frequent pattern of all L lengths from a time interval sequence database to be aggregated into a set, wherein L is a positive integer equal to or greater than 1; all frequent patterns in the set are Establishing a virtual projection database; searching for all L+1 length candidate patterns from the virtual projection database to calculate the support of the L+1 length candidate patterns; updating the L+1 length candidate patterns Support for updating the set; and trimming or deleting the candidate patterns of the L+1 lengths within the set with a pruning algorithm.

為讓本發明之上述特徵和優點能更明顯易懂,下文特舉實施例,並配合所附圖式作詳細說明。在以下描述內容 中將部分闡述本發明之額外特徵及優點,且此等特徵及優點將部分自所述描述內容顯而易見,或可藉由對本發明之實踐習得。本發明之特徵及優點借助於在申請專利範圍中特別指出的元件及組合來認識到並達到。應理解,前文一般描述與以下詳細描述兩者均僅為例示性及解釋性的,且不欲約束本發明所主張之範圍。 The above described features and advantages of the invention will be apparent from the description and appended claims. In the following description The additional features and advantages of the present invention are set forth in part in the description of the invention. The features and advantages of the present invention are realized and attained by the <RTIgt; It is to be understood that both the foregoing general description

1‧‧‧頻繁時間間隔序列樣式探勘系統 1‧‧‧Frequency Interval Sequence Pattern Exploration System

10‧‧‧時間間隔序列資料庫 10‧‧‧Time interval sequence database

11‧‧‧時間間隔序列 11‧‧‧time interval sequence

20‧‧‧虛擬投影資料庫建立模組 20‧‧‧Virtual projection database building module

21‧‧‧虛擬投影資料庫 21‧‧‧Virtual Projection Database

30‧‧‧候選樣式生成模組 30‧‧‧Candidate style generation module

40‧‧‧候選樣式修剪模組 40‧‧‧Candidate style trimming module

41‧‧‧第一修剪演算法 41‧‧‧First trimming algorithm

42‧‧‧第二修剪演算法 42‧‧‧Second trimming algorithm

50‧‧‧頻繁序列樣式 50‧‧‧ Frequent sequence style

S1至S6‧‧‧步驟 S1 to S6‧‧‧ steps

第1圖繪示本發明中頻繁時間間隔序列樣式探勘系統之示意架構圖;第2圖繪示本發明中頻繁時間間隔序列樣式探勘方法之示意流程圖;第3A圖繪示本發明中時間間隔序列資料庫之示意圖;第3B圖繪示本發明中UL=1集合內項目之支持度;第3C圖繪示本發明中前缀為樣式〈A〉之虛擬投影資料庫與其支持度;第3D圖繪示本發明中前缀為樣式〈A〉之3-pattern虛擬投影資料庫與其支持度;第3E圖繪示本發明中頻繁2-pattern之虛擬投影資料庫與其支持度;第3F圖繪示本發明中頻繁3-pattern之虛擬投影資料庫與其支持度;第3G圖繪示本發明中頻繁4-pattern之虛擬投影資料庫與其支持度; 第4A圖繪示本發明第2圖中步驟S3之虛擬碼;以及第4B圖繪示本發明第2圖中步驟S4之虛擬碼。 1 is a schematic structural diagram of a frequent time interval sequence pattern exploration system in the present invention; FIG. 2 is a schematic flow chart of a frequent time interval sequence pattern exploration method in the present invention; FIG. 3A is a time interval of the present invention; Schematic diagram of the sequence database; FIG. 3B illustrates the support of the items in the U L=1 set in the present invention; FIG. 3C illustrates the virtual projection database prefixed with the pattern <A> and its support in the present invention; The figure shows a 3-pattern virtual projection database prefixed with the pattern <A> and its support degree in the present invention; FIG. 3E shows a virtual projection database with frequent 2-patterns in the present invention and its support degree; FIG. 3F shows The virtual projection database of 3-pattern is frequently used in the present invention and its support; the 3G diagram shows the virtual projection database of the frequent 4-pattern in the present invention and its support degree; FIG. 4A shows the step S3 in the second diagram of the present invention. The virtual code; and FIG. 4B illustrates the virtual code of step S4 in FIG. 2 of the present invention.

以下藉由特定的具體實施形態說明本發明之實施方式,熟悉此技術之人士可由本說明書所揭示之內容輕易地了解本發明之其他優點與功效,亦可藉由其他不同的具體實施形態加以施行或應用。 The embodiments of the present invention are described in the following specific embodiments, and those skilled in the art can easily understand other advantages and functions of the present invention by the disclosure of the present disclosure, and can also be implemented by other different embodiments. Or application.

有別於習知技術之最大頻繁樣式或封閉式頻繁樣式,若一個樣式並無存在任何擁有相同序列且其任一時間間隔更小的樣式,則稱為最小頻繁時間間隔序列樣式,本發明提出開創性之最小頻繁時間間隔序列樣式探勘演算法以挖掘此樣式。而且,本發明將提出一個一般化的表示方式來表示時間間隔序列,可同時記錄項目順序及其時間間隔長短,並提出有一個效率之探勘演算法從時間間隔序列資料庫中搜尋出所有頻繁之時間間隔序列樣式。 Different from the maximum frequent pattern or closed frequent pattern of the prior art, if a pattern does not have any pattern having the same sequence and any time interval is smaller, it is called a minimum frequent time interval sequence pattern, and the present invention proposes A groundbreaking minimum frequent interval sequence style exploration algorithm to exploit this style. Moreover, the present invention will present a generalized representation to represent a time interval sequence, which can simultaneously record the order of the items and the length of the time interval, and proposes an efficient exploration algorithm to search all the frequent items from the time interval sequence database. Time interval sequence style.

以下時間間隔序列均簡稱為樣式(pattern),先說明基本定義。E={e1,e2,…,ez}為項目種類之集合,zThe following time interval sequences are simply referred to as patterns, and the basic definitions are explained first. E={e1,e2,...,ez} is a collection of project types, z .

定義1:序列資料庫 Definition 1: Sequence Database

一時間間隔序列資料庫TSDB={s1,s2,...,sN},一個時間間隔序列,ti-1,i為項目ei-1與項目ei之時間發生間隔,1i-1<ij,i,j,ei E。例如,s={A,2 B,3 C}代表A發生後兩個時間單位後發生B,且B發生後再三個時間單位後發生C。 A time interval sequence database TSDB={s 1 ,s 2 ,...,s N }, a time interval sequence , t i-1, i is the time interval between the project e i-1 and the project e i , 1 I-1<i j,i,j , e i E. For example, s={A, 2 B, 3 C} represents B after two time units after A occurs, and C occurs after three time units after B occurs.

定義2:樣式長度 Definition 2: Style length

一時間隔序列s內擁有j個項目,則s的長度L=j或稱j-pattern。以定義1為例,序列s的長度為3或稱3-pattern。 If there are j items in the interval sequence s, the length of s is L=j or j-pattern. Taking definition 1 as an example, the length of the sequence s is 3 or 3-pattern.

定義3:支持與尾綴 Definition 3: Support and suffix

給定,且 ,1 m1<m2 j mf g,則s 1支持s 2,而尾綴,若mf=g時,s 1s 2的前綴。例如,前缀s 1=〈A,5 C〉與s 2=〈A,2 B,2 C,1 D,0 E〉,則s 1支持s 2given versus And ,1 m 1< m 2 j Mf g , then s 1 supports s 2 , and the suffix If mf = g , ; s 1 is the prefix of s 2 . For example, the prefix s 1 =<A,5 C> and s 2 =<A,2 B,2 C,1 D,0 E>, then s 1 supports s 2 , .

定義4:尾綴間隔累積樣式與空間 Definition 4: suffix interval accumulation pattern and space

接續上述定義3,給定尾綴間隔累積序列樣式,或〈tmf,mf+1 emf+1,atmf,mf+2 emf+2,...,atmf,g eg〉表示,其空間為所有時間間隔與項目之配對集合{atmf+1,mf+2 emf+2},...,{atg-1,g eg}。接續上例,〈1 D,1 E〉,Following the above definition 3, given the suffix interval cumulative sequence pattern , or <t mf,mf+1 e mf+1 ,at mf,mf+2 e mf+2 ,...,at mf,g e g 〉 indicates that its space is a paired set of all time intervals and items {at mf+1,mf+2 e mf+2 },...,{at g-1,g e g }. Continue with the above example, <1 D,1 E>, .

定義5:虛擬投影資料庫 Definition 5: Virtual Projection Database

一個樣式s的虛擬投影資料庫pdb(s)={{SID1,Index1},...,{SIDM,IndexM}},一組配對中SID為虛擬投影資料庫內被樣式s支持之序列編號,而Index為該SID中被樣式s支持之最後的項目編號。一個序列中第一個項目之標號為1,第二個項目之標號為2,以此類推。例如,s=〈A,1 C〉,則s的虛 擬投影資料庫pdb(s)={{2,2},{4,4}},見第3D圖。 A virtual projection database of style s pdb(s)={{SID 1 , Index 1 },...,{SID M ,Index M }}, a set of paired SIDs supported by the style s in the virtual projection database The serial number, and Index is the last item number in the SID that is supported by the style s. The first item in a sequence is numbered 1, the second item is numbered 2, and so on. For example, s=<A,1 C>, then the virtual projection database ps(s)={{2,2},{4,4}} of s, see Figure 3D.

定義6:樣式支持度 Definition 6: Style support

一個樣式s之支持度為資料庫中被樣式s支持之序列個數,用sup(s)表示。 The support for a style s is the number of sequences in the database that are supported by the style s, represented by sup(s).

定義7:頻繁樣式 Definition 7: Frequent styles

一個頻繁樣式s之支持度應不低於最小支持度門檻值(minimum support threshold;min_sup)。 The support of a frequent pattern s should not be lower than the minimum support threshold (min_sup).

第1圖係繪示本發明中頻繁時間間隔序列樣式探勘系統1之示意架構圖。如圖所示,頻繁時間間隔序列樣式探勘系統1包括一時間間隔序列資料庫10、一虛擬投影資料庫建立模組20、一候選樣式生成模組30與一候選樣式修剪模組40。 1 is a schematic architectural diagram of a frequent time interval sequence pattern exploration system 1 in the present invention. As shown, the frequent time interval sequence style mapping system 1 includes a time interval sequence database 10, a virtual projection database creation module 20, a candidate pattern generation module 30, and a candidate pattern clipping module 40.

時間間隔序列資料庫10用以儲存時間間隔序列11。虛擬投影資料庫建立模組20依據時間間隔序列資料庫10建立候選樣式之虛擬投影資料庫21與計算出候選樣式之支持度。候選樣式生成模組30自虛擬投影資料庫21中搜尋出所有L長度之候選樣式,以依據該些L長度之候選樣式計算出所有L+1長度之候選樣式,進而更新該些L+1長度之候選樣式之支持度,俾將該些L+1長度之候選樣式匯集成集合,其中,L為等於或大於1之正整數。候選樣式修剪模組40以例如第一修剪演算法41或第二修剪演算法42修剪或刪除集合內該些L+1長度之候選樣式。前述虛擬投影資料庫21可用以儲存候選樣式之位置索引。 The time interval sequence database 10 is used to store the time interval sequence 11. The virtual projection database creation module 20 creates a virtual projection database 21 of the candidate style according to the time interval sequence database 10 and calculates the support degree of the candidate pattern. The candidate pattern generation module 30 searches for all L length candidate patterns from the virtual projection database 21 to calculate all L+1 length candidate patterns according to the L length candidate patterns, and then updates the L+1 lengths. The support of the candidate pattern, the candidate patterns of the L+1 lengths are aggregated into a set, where L is a positive integer equal to or greater than 1. The candidate style pruning module 40 trims or deletes the L+1 length candidate patterns within the set, for example, by the first pruning algorithm 41 or the second pruning algorithm 42. The aforementioned virtual projection database 21 can be used to store the location index of the candidate pattern.

舉例而言,虛擬投影資料庫建立模組20可自時間間隔 序列資料庫10中搜尋出所有L長度(如L=1長度,或稱1-pattern)之頻繁樣式以匯集成集合(或稱UL=1),並將UL=1集合內之該些L=1長度之頻繁樣式均建立虛擬投影資料庫21。候選樣式生成模組30可自虛擬投影資料庫21中搜尋出所有L=2長度(或稱2-pattern)之候選樣式以計算出該些L=2長度之候選樣式之支持度,進而更新該些L=2長度之候選樣式之支持度,俾將該些L=2長度之候選樣式匯集成集合(如UL=2集合)。候選樣式修剪模組40以第一修剪演算法41或第二修剪演算法42修剪或刪除UL=2集合內之該些L=2長度之候選樣式。 For example, the virtual projection database creation module 20 can search all the length patterns of the L length (such as L=1 length, or 1-pattern) from the time interval sequence database 10 to be aggregated into a set (or U). L=1 ), and the frequent patterns of the L=1 lengths in the U L=1 set are all established into the virtual projection database 21. The candidate pattern generation module 30 may search for all L=2 length (or 2-pattern) candidate patterns from the virtual projection database 21 to calculate the support degree of the L=2 length candidate patterns, and then update the candidate pattern. For the support of the candidate styles of L=2 length, the candidate patterns of L=2 lengths are aggregated into a set (such as U L=2 sets). The candidate style pruning module 40 trims or deletes the L=2 length candidate patterns in the U L=2 set by the first pruning algorithm 41 or the second pruning algorithm 42.

第一修剪演算法41可例如為:若集合內存在第一樣式P與第二樣式P'兩者,具有第一樣式之虛擬投影資料庫pdb(P)等於具有第二樣式之虛擬投影資料庫pdb(P'),且第一樣式P支持第二樣式P',則刪除第一樣式P。再者,第二修剪演算法42可例如為:若集合內存在第一樣式P與第二樣式P'兩者,第一樣式P支持第二樣式P',且具有第一樣式之樣式支持度sup(P)減去具有第二樣式之樣式支持度sup(P')加上配對集合SPP' 中最大支持度低於最小支持度門檻值(min_sup),則刪除第一樣式P。 The first pruning algorithm 41 may be, for example, if there is a first pattern P and a second pattern P ' in the set , the virtual projection database pdb(P) having the first pattern is equal to the virtual projection having the second pattern. The database pdb(P ' ), and the first style P supports the second style P', then the first style P is deleted. Furthermore, the second pruning algorithm 42 may be, for example, if the first memory P and the second pattern P are present in the set, the first pattern P supports the second pattern P and has the first style. The style support degree sup(P) minus the style support degree sup(P ' ) having the second style plus the maximum support degree in the paired set SP P ' is lower than the minimum support degree threshold (min_sup), the first style is deleted P.

候選樣式修剪模組40可依據時間間隔之特性刪除非最小頻繁時間間隔序列樣式,且最小頻繁時間間隔序列樣式可泛指所有具有相同的項目與順序之序列中,單一序列之二個項目間之時間間隔比其他序列之相同二個項目間之時間間隔較小者。 The candidate pattern trimming module 40 can delete the non-minimum frequent time interval sequence pattern according to the characteristics of the time interval, and the minimum frequent time interval sequence pattern can generally refer to all the sequences with the same item and sequence, between the two sequences of the single sequence. The time interval is smaller than the time interval of the other two items.

候選樣式修剪模組40可判斷集合內是否存在該L+1長度之頻繁樣式,若存在,則以遞迴方式將集合內之所有頻繁樣式均建立虛擬投影資料庫21,再由候選樣式生成模組30更新該L+1長度之候選樣式之支持度以更新集合。 The candidate style pruning module 40 can determine whether there is a frequent pattern of the L+1 length in the set. If yes, all the frequent patterns in the set are built into the virtual projection database 21 in a recursive manner, and then the candidate pattern is generated. Group 30 updates the support for the candidate pattern of the L+1 length to update the set.

舉例而言,UL=2集合內支持度不小於最小支持度門檻值之L=2長度候選樣式為L=2長度之頻繁樣式。當UL=2集合內存在L=2長度之頻繁樣式時,候選樣式修剪模組40以遞迴方式將UL+1集合內之該些L+1長度之頻繁樣式均建立虛擬投影資料庫21,同時輸出所有UL+1集合內支持度不小於最小支持度門檻值之的候選序列樣式為頻繁序列樣式50。 For example, the U L=2 in- set support is not less than the minimum support threshold, and the L=2 length candidate pattern is a frequent pattern of L=2 length. When the U L=2 set has a frequent pattern of L=2 length, the candidate style pruning module 40 recursively sets the frequent patterns of the L+1 lengths in the U L+1 set to the virtual projection database. 21. Simultaneously output a candidate sequence pattern in which all the support values of the U L+1 set are not less than the minimum support threshold value is the frequent sequence pattern 50.

第2圖係繪示本發明中頻繁時間間隔序列樣式探勘方法之示意流程圖,請參閱上述第1圖一併說明。 FIG. 2 is a schematic flow chart showing a method for exploring a frequent time interval sequence pattern in the present invention. Please refer to FIG. 1 above for a description.

如第2圖所示,在步驟S1中,自時間間隔序列資料庫10中搜尋出所有L(如L=1)長度之頻繁樣式以匯集成集合(如UL=1集合),其中,L為等於或大於1(即L≧1)之正整數。 As shown in FIG. 2, in step S1, all the L (such as L = 1) length frequent patterns are searched from the time interval sequence database 10 to be aggregated into a set (eg, U L = 1 set), where L Is a positive integer equal to or greater than 1 (ie, L≧1).

在步驟S2中,將集合內之所有頻繁樣式均建立虛擬投影資料庫21,且虛擬投影資料庫21可用以儲存候選樣式之位置索引。 In step S2, all the frequent patterns in the set are built into the virtual projection database 21, and the virtual projection database 21 can be used to store the position index of the candidate pattern.

在步驟S3中,自虛擬投影資料庫21中搜尋出所有L+1長度之候選樣式,以計算出該些L+1長度之候選樣式之支持度。 In step S3, all L+1 length candidate patterns are searched from the virtual projection database 21 to calculate the support degrees of the L+1 length candidate patterns.

在步驟S4中,更新該些L+1長度之候選樣式之支持度以更新集合(如UL+1集合)。 In step S4, the support of the candidate patterns of the L+1 lengths is updated to update the set (such as the U L+1 set).

在步驟S5中,以第一修剪演算法41或第二修剪演算法42修剪或刪除集合(如UL+1集合)內該些L+1長度之候選樣式。 In step S5, the candidate patterns of the L+1 lengths in the set (e.g., U L+1 sets) are clipped or deleted by the first pruning algorithm 41 or the second pruning algorithm 42.

例如,第一修剪演算法41為:若集合內存在樣式P與樣式P'兩者,虛擬投影資料庫pdb(P)等於虛擬投影資料庫pdb(P'),且樣式P支持樣式P',則刪除樣式P。又,第二修剪演算法42為:若集合內存在樣式P與樣式P'兩者,樣式P支持樣式P',且樣式支持度sup(P)減去樣式支持度sup(P')加上配對集合SPP' 中最大支持度低於最小支持度門檻值,則刪除樣式P。 For example, the first pruning algorithm 41 is: if both the style P and the style P ' exist in the set, the virtual projection database pdb(P) is equal to the virtual projection database pdb(P ' ), and the style P supports the style P', Then delete the style P. Moreover, the second pruning algorithm 42 is: if both the style P and the style P ' exist in the set, the style P supports the style P ' , and the style support degree sup(P) minus the style support degree sup(P ' ) plus If the maximum support in the paired set SP P ' is lower than the minimum support threshold, the style P is deleted.

在步驟S6中,判斷集合(如UL+1集合)內是否存在該些L+1長度之頻繁樣式?若存在,則以遞迴方式遞迴至步驟S2。若不存在,則結束。換言之,若集合(如UL+1集合)為非空集合,則重複執行步驟S2至步驟S6,直到集合(如UL+1集合)為空集合。前述遞迴方式可包括將集合內之所有頻繁樣式均建立虛擬投影資料庫21,進而更新L+1長度之候選樣式之支持度以更新該集合。 In step S6, it is judged whether there are frequent patterns of the L+1 lengths in the set (such as the U L+1 set). If it exists, it is returned to step S2 in a recursive manner. If it does not exist, it ends. In other words, if the set (eg, U L+1 set) is a non-empty set, steps S2 through S6 are repeatedly performed until the set (eg, U L+1 set) is an empty set. The foregoing recursive manner may include establishing all of the frequent patterns in the set to the virtual projection database 21, and then updating the support of the candidate pattern of the L+1 length to update the set.

另外,本發明之頻繁時間間隔序列樣式探勘方法可依據時間間隔之特性刪除非最小頻繁時間間隔序列樣式,且最小頻繁時間間隔序列樣式可泛指所有具有相同的項目與順序之序列中,單一序列之二個項目間之時間間隔比其他序列之相同二個項目間之時間間隔較小者。 In addition, the frequent time interval sequence pattern exploration method of the present invention can delete the non-minimum frequent time interval sequence pattern according to the characteristics of the time interval, and the minimum frequent time interval sequence pattern can generally refer to all sequences having the same item and sequence, a single sequence. The time interval between the two items is smaller than the time interval between the two items in the other sequences.

本發明之頻繁時間間隔序列樣式探勘方法可去除集合內具有支持度低於最小支持度門檻值之候選樣式,並持續 從L長度之頻繁樣式之集合中生成出所有L+1之頻繁樣式,直到沒有任何長度更長的頻繁樣式產生為止。 The frequent time interval sequence style exploration method of the present invention can remove candidate patterns in the set with support below the minimum support threshold and continue All frequent patterns of L+1 are generated from a collection of frequent patterns of L lengths until no frequent patterns of longer length are produced.

第3A圖係繪示本發明中時間間隔序列資料庫之示意圖,第3B圖係繪示本發明中UL=1集合內項目之支持度,第3C圖係繪示本發明中前缀為樣式〈A〉之虛擬投影資料庫與其支持度,第3D圖係繪示本發明中前缀為樣式〈A〉之3-pattern虛擬投影資料庫與其支持度,第3E圖係繪示本發明中頻繁2-pattern之虛擬投影資料庫與其支持度,第3F圖係繪示本發明中頻繁3-pattern之虛擬投影資料庫與其支持度,第3G圖係繪示本發明中頻繁4-pattern之虛擬投影資料庫與其支持度。此外,第4A圖係繪示本發明第2圖中步驟S3之虛擬碼,第4B圖係繪示本發明第2圖中步驟S4之虛擬碼。 3A is a schematic diagram showing a time interval sequence database in the present invention, FIG. 3B is a diagram showing the support degree of the items in the U L=1 set in the present invention, and FIG. 3C is a diagram showing the prefix style in the present invention. The virtual projection database of A> and its support degree, the 3D figure shows the 3-pattern virtual projection database prefixed with the pattern <A> in the present invention and its support degree, and the 3E figure shows the frequent 2- in the present invention. The virtual projection database of pattern and its support degree, the 3F figure shows the virtual projection database with frequent 3-pattern in the invention and its support degree, and the 3G figure shows the virtual projection database of frequent 4-pattern in the invention. With its support. 4A is a virtual code of step S3 in FIG. 2 of the present invention, and FIG. 4B is a virtual code of step S4 in FIG. 2 of the present invention.

詳言之,在第2圖之步驟S1中,先列出時間間隔序列資料庫中所有的1-pattern,並依照上述定義3計算項目之支持度,即有發生該項目之序列數量。對於每個1-pattern的支持度超過最小支持度門檻值加入UL=1集合,給定第3A圖之時間間隔序列資料庫。假設最小支持度門檻值為3(即支持度大於3),則UL=1集合內有項目A、B、C與E,如第3B圖所示之結果。 In detail, in step S1 of FIG. 2, all 1-patterns in the time interval sequence database are listed first, and the support degree of the item is calculated according to the above definition 3, that is, the number of sequences in which the item occurs. For each 1-pattern support exceeds the minimum support threshold and join the U L=1 set, given the time interval sequence database of Figure 3A. Assuming that the minimum support threshold is 3 (ie, the support is greater than 3), there are items A, B, C, and E in the U L=1 set, as shown in Figure 3B.

在第2圖之步驟S2中,將UL=1集合內每個1-pattern依照定義5建立其虛擬投影資料庫,以〈A〉前綴為例,則如第3C圖中第2欄所示。 In step S2 of FIG. 2, each 1-pattern in the U L=1 set establishes its virtual projection database according to definition 5, taking the <A> prefix as an example, as shown in the second column of FIG. 3C. .

在第2圖之步驟S3中,利用所獲得之虛擬投影資料庫 下搜尋出尾綴間隔累積空間,如第3C圖第3欄所示(詳細步驟如第4A圖之虛擬碼所載),前缀搭配其間隔累積空間內尾綴形成多個L+1長度之候選樣式與建立所屬之虛擬投影資料庫並加入集合,如第3C圖之第3欄與第4欄所示。 In step S3 of FIG. 2, the suffix interval accumulation space is searched by using the obtained virtual projection database, as shown in the third column of FIG. 3C (the detailed steps are as shown in the virtual code of FIG. 4A), the prefix Matching the suffixes in the interval accumulation space to form candidate patterns of multiple L+1 lengths and establishing the virtual projection database to which they belong and join The collection is shown in columns 3 and 4 of Figure 3C.

第2圖之步驟S3所獲得之虛擬投影資料庫僅考慮尾綴出現於資料庫之情況(即尾綴內的時間間隔與項目均相同),但樣式中的時間間隔均代表小於等於之意義,所以於第2圖之步驟S4進行虛擬投影資料庫更新考慮小於之情況,參考第4B圖之虛擬碼所載。 The virtual projection database obtained in step S3 of FIG. 2 only considers the case where the suffix appears in the database (ie, the time interval in the suffix is the same as the item), but the time interval in the pattern represents the meaning of less than or equal to Therefore, in the case where the virtual projection database update is considered to be smaller in step S4 of FIG. 2, reference is made to the virtual code of FIG. 4B.

例如,如第3C圖第4欄所示,〈A,5 B〉的虛擬投影資料庫為{{1,5}},即虛擬投影資料庫內僅序列1中發生相同的順序與其時間間隔,但〈A,5 B〉表示A在五個時間單位以內發生B,也就包含A在三個時間單位以內發生B之樣式,所以應將pdb(〈A,3 B〉)加入pdb(〈A,5 B〉)。 For example, as shown in the fourth column of FIG. 3C, the virtual projection database of <A, 5 B> is {{1, 5}}, that is, only the sequence in the sequence 1 of the virtual projection database occurs in the same order and its time interval. However, <A, 5 B> means that A occurs within five time units, and B contains the pattern of B within three time units, so pdb(<A, 3 B>) should be added to pdb (<A , 5 B>).

因此,第2圖之步驟S4針對尾綴間隔累積空間內每個S,若於空間內存在相同項目的S',且S'的時間間隔比S的時間間隔小時,則S'的虛擬投影資料庫應被融合到S的虛擬投影資料庫,例如pdb(A 3 C)應包含pdb(〈A,2 C〉)與pdb(〈A,1 C〉),而pdb(〈A,2 C〉)應包含pdb(〈A,1 C〉)。 Therefore, step S4 of FIG. 2 accumulates each S in the space for the suffix interval, and if there is S' of the same item in the space, and the time interval of S' is smaller than the time interval of S, the virtual projection data of S' The library should be merged into the virtual projection database of S. For example, pdb(A 3 C) should contain pdb(<A,2 C>) and pdb(<A,1 C>), while pdb(<A,2 C>) ) should contain pdb(<A,1 C>).

申言之,對於每一個pdb(前缀+S')中每個配對{SID',Index'},若序列編號SID'未出現於pdb(前缀+S)中,則將其配對{SID',Index'}加進pdb(前缀+S)。若序列編號SID'已出現於pdb(前缀+S),且該SID'之Index'小於pdb(前 缀+S)中相同SID之Index時,則將Index更新為index'。例如,如第3C圖第5欄所示,pdb(〈A,1 C〉)中所有配對的SID皆未曾出現於pdb(〈A,2 C〉),則將整個pdb(〈A,1 C〉)新增至pdb(〈A,2 C〉);而pdb(〈A,2 C〉)中SID=2,3,4之配對均未出現於pdb(〈A,3 C〉),所以均加入pdb(〈A,3 C〉),其中pdb(〈A,2 C〉)中SID=1的index=2較小,故將更新pdb(〈A,3 C〉)中SID=1之index為2。 It is stated that for each pair {SID', Index'} in each pdb (prefix + S'), if the sequence number SID' does not appear in pdb (prefix + S), then pair it {SID', Index'} is added to pdb (prefix + S). If the sequence number SID' has appeared in pdb (prefix + S), and the index ' of the SID' is smaller than pdb (previous When the index of the same SID in +S) is added, the Index is updated to index'. For example, as shown in column 5 of Figure 3C, all paired SIDs in pdb(<A,1 C>) have not appeared in pdb(<A,2 C>), then the entire pdb(<A,1 C) 〉) added to pdb(<A, 2 C>); while pdb(<A, 2 C>) has no pairing of SID=2, 3, 4 in pdb(<A, 3 C>), so All are added to pdb (<A, 3 C>), where index=2 of SID=1 in pdb(<A, 2 C>) is small, so SID=1 in pdb(<A, 3 C>) will be updated. The index is 2.

在第2圖之步驟S5中,依照更新後虛擬投影資料庫計算出L+1長度之候選樣式之支持度,支持度超過最低門檻支持度之L+1長度候選樣式則視為L+1長度之頻繁樣式並加入UL+1集合,而支持度低於最小門檻支持度則刪除,如第3C圖第3欄中以×表示。 In step S5 of FIG. 2, the support degree of the L+1 length candidate pattern is calculated according to the updated virtual projection database, and the L+1 length candidate pattern whose support exceeds the minimum threshold support is regarded as L+1 length. The frequent pattern is added to the U L+1 set, and the support is deleted below the minimum threshold support, as indicated by the third column in the 3C chart.

例如,假設最小門檻支持度為2,則第3C圖中頻繁樣式只有〈A,3 B〉、〈A,5 B〉、〈A,1 C〉、〈A,2 C〉與〈A,3 C〉,其餘刪除。但因頻繁樣式仍然眾多,故本發明提出修剪演算法進一步刪除頻繁樣式之組合,集合內若還有頻繁樣式經修剪演算法刪除,則回到步驟S2重新開始,直到集合內無任何頻繁樣式存在為止,整個迴圈是使用遞迴方式執行。以第3D圖之前缀〈A,2 C〉為例,發現並無任何候選樣式支持度超過最低門檻支持度,則結束該前缀之探勘。 For example, assuming a minimum threshold support of 2, the frequent patterns in the 3C chart are only <A, 3 B>, <A, 5 B>, <A, 1 C>, <A, 2 C>, and <A, 3 C>, the rest are deleted. However, since the frequent styles are still numerous, the present invention proposes a pruning algorithm to further delete the combination of frequent patterns. If there are frequent patterns in the set to be deleted by the pruning algorithm, the process returns to step S2 and restarts until there are no frequent patterns in the set. So far, the entire loop is executed using the recursive method. Taking the prefix <A, 2 C> of the 3D figure as an example, it is found that no candidate pattern support exceeds the minimum threshold support, and the exploration of the prefix is ended.

在第2圖之步驟S5中,提到使用修剪演算法,主要是透過刪除集合內之樣式以減少探索樣式數量,進而提升探勘效率。此修剪演算法之原理是假設樣式P與樣式P',且樣式P支持樣式P',若能確定以樣式P為前綴所生長的所 有頻繁樣式皆會與樣式P'相同,則可以刪除。例如,若確定〈A,3 C〉與〈A,2 C〉為前綴所生長的所有頻繁樣式均相同,則刪除時間較大,因為時間較大的樣式並沒有帶來更多訊息量。 In step S5 of FIG. 2, the use of the pruning algorithm is mentioned, mainly by deleting the styles in the set to reduce the number of exploration styles, thereby improving the efficiency of the exploration. The principle of this pruning algorithm is to assume that the style P and the style P', and the style P supports the style P', if it can be determined by the style P prefix If there are frequent patterns, they will be the same as the style P', you can delete them. For example, if it is determined that all frequent patterns of <A, 3 C> and <A, 2 C> are the same for the prefix, the deletion time is larger because the time-consuming style does not bring more information.

根據上述定義與觀察,本發明可以獲得以下引理。 According to the above definition and observation, the present invention can obtain the following lemma.

引理1:存在樣式P與樣式P'兩者,若pdb(P)=pdb(P') Lemma 1: There are both style P and style P', if pdb(P)=pdb(P ' ) .

引理2:存在樣式P與樣式P'兩者,若樣式P支持樣式P',則sup(P)sup(P')。 Lemma 2: There is both style P and style P'. If style P supports style P ' , then sup(P) Sup(P ' ).

引理3:存在樣式P與樣式P'兩者,且樣式P支持樣式P',若 sup(P)-sup(P')+sup(P'+{at,e})<min_sup,則gSPP gSPP' s.t.sup(g)min_sup。 Lemma 3: There are both style P and style P', and style P supports style P ' , if Sup(P)-sup(P ' )+sup(P ' +{at,e})<min_sup, then g SP P g SP P ' stsup(g) Min_sup.

依據上述三個引理,本發明提出下列兩個修剪演算法。 According to the above three lemmas, the present invention proposes the following two trimming algorithms.

第一修剪演算法41(見第1圖):若存在兩個樣式P與樣式P',pdb(P)=pdb(P')且樣式P支持樣式P',則刪除樣式P。基於引理1,可以得知以樣式P為前綴之所生長之樣式都會出現於以樣式P'為前綴相同。因此,第3C圖中以前缀〈A〉間隔累積空間下,〈A,5 B〉支持〈A,3 B〉以及〈A,3 C〉支持〈A,2 C〉兩群,而〈A,3 C〉與〈A,2 C〉的pdb相同,所以刪除〈A,3 C〉,如第3D圖之〈A,2 C〉與〈A,3 C〉所表示其生長樣式完全相同。若經此第一修剪演算法刪除,則以表示。 The first trimming algorithm 41 (see Fig. 1): If there are two styles P and a pattern P ' , pdb(P) = pdb(P ' ) and the pattern P supports the style P', the pattern P is deleted. Based on Lemma 1, it can be seen that the styles grown with the prefix of style P appear in the same prefix as the style P ' . Therefore, in Fig. 3C, under the cumulative space of the prefix <A>, <A, 5 B> supports <A, 3 B> and <A, 3 C> supports <A, 2 C>, and <A, 3 C> is the same as pdb of <A, 2 C>, so delete <A, 3 C>, as shown in Fig. 3D, <A, 2 C> and <A, 3 C> indicate that the growth pattern is exactly the same. If it is deleted by this first trimming algorithm, it is represented by .

第二修剪演算法42(見第1圖):若存在樣式P與樣式P',樣式P支持樣式P',sup(P)-sup(P')+SPP' 中最大支持度 <min_sup,則刪除樣式P。基於引理2,可以得知sup(P)-sup(P')0。基於引理3,可以得知以樣式P為前綴所獲得的頻繁樣式將會與以樣式P'為前綴的頻繁樣式相同,故無須再長。 The second trimming algorithm 42 (see Fig. 1): If there is a pattern P and a pattern P ' , the pattern P supports the style P ' , the maximum support degree in the sup(P)-sup(P ' )+SP P ' <min_sup, Then delete the style P. Based on Lemma 2, you can know that sup(P)-sup(P ' ) 0. Based on Lemma 3, it can be known that the frequent pattern obtained by prefixing the style P will be the same as the frequent pattern prefixed by the style P ' , so there is no need to be longer.

例如,假設min_sup=2,第3C圖中〈A,5 B〉支持〈A,3 B〉,且argmax(sup(〈A,5 B〉)-sup(〈A,3 B〉)+sup(SP〈A,3 B〉))=sup(〈A,5 B〉)-sup(〈A,3 B〉+argmax(sup(SP〈A,3 B〉)))=3-2+1<3,因此刪除時間較大的〈A,5 B〉,而argmax(sup(SP〈A,3 B〉))如第4圖第1行所示為1。若經此第二修剪演算法刪除,則以表示。 For example, suppose min_sup=2, <A, 5 B> in Fig. 3C supports <A, 3 B>, and argmax(sup(<A,5 B>)-sup(<A,3 B>)+sup( SP <A,3 B> ))=sup(<A,5 B>)-sup(<A,3 B〉+argmax(sup(SP 〈A,3 B> ))))=3-2+1< 3, therefore, delete the larger time <A, 5 B>, and argmax (sup(SP <A, 3 B> )) is 1 as shown in the 1st line of Fig. 4. If it is deleted by this second trimming algorithm, it is represented by .

總言之,本發明以第3A圖之範例資料庫進行實施例說明。在第2圖之步驟S1中,自時間間隔序列資料庫中搜尋出所有L=1長度之頻繁樣式。假設min_sup=2,則所有L=1長度之頻繁樣式有〈A〉、〈B〉、〈C〉與〈E〉,其結果如第3B圖所示。 In summary, the present invention is described by way of example in the example database of FIG. 3A. In step S1 of Fig. 2, all frequent patterns of length L = 1 are searched from the time interval sequence database. Assuming that min_sup=2, the frequent patterns of all L=1 lengths are <A>, <B>, <C>, and <E>, and the result is as shown in FIG. 3B.

接著,所有L=1長度之頻繁樣式將以遞迴方式依序由第2圖之步驟S2到步驟S5進行。假設執行順序依序為〈A〉、〈B〉、〈C〉與〈E〉,首先〈A〉為前綴開始時,自步驟S2至步驟S5,直到停止條件滿足才換〈B〉,且直到所有L=1長度之頻繁樣式都完成則停止本發明之系統及方法。 Then, all frequent patterns of L=1 length will be sequentially performed in step by step from step S2 to step S5 of FIG. It is assumed that the execution order is sequentially <A>, <B>, <C>, and <E>. First, when <A> is the start of the prefix, step S2 to step S5 are performed until the stop condition is satisfied, and then <B> is changed, and until The system and method of the present invention are stopped when all of the frequent patterns of length L = 1 are completed.

詳細而言,〈A〉為前綴列出所有L=1長度之頻繁樣式經由第2圖之步驟S2獲得虛擬投影資料庫(如第3E圖第1行第2欄),第2圖之步驟S3則藉由虛擬投影資料庫找出尾綴間隔累積空間(如第3E圖第1行第3欄)以及其虛擬投影資料庫(如第3E圖第1行第4欄),再於第2圖之步驟S4 更新其虛擬投影資料庫並計算出真實的支持度,然後套用第2圖之步驟S5之第一修剪演算法將〈A,3 C〉刪除,並以第二修剪演算法將〈A,5 B〉刪除。 In detail, <A> lists the frequent patterns of all L=1 lengths as a prefix, and obtains a virtual projection database via step S2 of FIG. 2 (eg, column 3E, row 1, column 2), step S3 of FIG. Then, by using the virtual projection database, the suffix interval accumulation space (such as the first column and the third column of the 3E figure) and its virtual projection database (such as the third column and the fourth column of the 3E chart) are found, and then the second figure. Step S4 Update its virtual projection database and calculate the true support, then delete the <A, 3 C> using the first trimming algorithm of step S5 of Figure 2, and use the second trimming algorithm to <A, 5 B >delete.

然後,以〈B〉為前綴未有任何L=2長度之頻繁樣式,因此停止並換為前綴〈C〉。以〈C〉為前綴的L=2之候選樣式只有〈C,2 B〉,因其支持度超過最小支持度門檻值,則立即以〈C,2 B〉為前綴找出L=3之候選樣式,如第3F圖第1行表示,因無存在任何L=4長度之頻繁樣式,故停止並換為前綴〈E〉。以〈E〉為前綴下L=2長度之頻繁樣式有〈E,2 A〉與〈E,4 C〉,故以〈E,2 A〉為前綴找L=3長度之候選樣式,如第3F圖第2行所示,並發現仍有〈E,2 A,2 C〉的支持度超過最小支持度門檻值,故以〈E,2 A,2 C〉為前綴繼續找出L=4長度之候選樣式。如第3G圖所示,因無存在任何L=4長度之頻繁樣式,則停止整個探勘系統。因此,最小頻繁時間間隔序列樣式有〈A〉、〈B〉、〈C〉、〈E〉、〈A,3 B〉、〈A,1 C〉、〈A,2 C〉、〈A,2 C,2 B〉、〈C,2 B〉、〈E,2 A〉、〈E,4 C〉與〈E,2 A2 C〉。 Then, prefixed with <B>, there is no frequent pattern of L=2 length, so stop and change to prefix <C>. The candidate pattern of L=2 prefixed by <C> is only <C, 2 B>. Since the support exceeds the minimum support threshold, the candidate of L=3 is immediately prefixed with <C, 2 B>. The style, as shown in the 1st line of the 3F chart, is stopped and replaced with the prefix <E> because there is no frequent pattern of L=4 length. The frequent patterns of L=2 length under the prefix of <E> are <E, 2 A> and <E, 4 C>, so the candidate pattern of L=3 length is prefixed by <E, 2 A>, such as As shown in the second line of the 3F chart, it is found that the support of <E, 2 A, 2 C> exceeds the minimum support threshold, so continue to find L=4 with the prefix of <E, 2 A, 2 C>. Candidate style for length. As shown in Fig. 3G, the entire exploration system is stopped because there is no frequent pattern of L=4 length. Therefore, the minimum frequent interval sequence patterns are <A>, <B>, <C>, <E>, <A, 3 B>, <A, 1 C>, <A, 2 C>, <A, 2 C, 2 B>, <C, 2 B>, <E, 2 A>, <E, 4 C> and <E, 2 A2 C>.

綜上所述,本發明之頻繁時間間隔序列樣式探勘系統及方法中,可探勘最小頻繁時間間隔序列樣式,且樣式探勘方式具開創性之貢獻。再者,本發明可利用虛擬投影資料庫之儲存方式,大幅降低所需記憶體資源。同時,本發明依照時間間隔之特性提出兩個修剪演算法,可降低候選樣式之搜尋空間,增加探勘效率。 In summary, in the frequent time interval sequence pattern exploration system and method of the present invention, the minimum frequent time interval sequence pattern can be explored, and the style exploration method has a groundbreaking contribution. Furthermore, the present invention can utilize the storage method of the virtual projection database to greatly reduce the required memory resources. At the same time, the present invention proposes two pruning algorithms according to the characteristics of the time interval, which can reduce the search space of candidate patterns and increase the efficiency of exploration.

進一步言之,本發明之優點或技術功效至少包括:(一)採用虛擬投影資料庫之資料結構,以避免習知技術需要大 量資料複製與更新等步驟所造成之效能瓶頸;(二)依據時間間隔之特性觀察與遞迴架構提出獨特之修剪演算法,大幅減少樣式生長空間以提升探勘效率;(三)呈現更細緻之時間間隔資訊。 Further, the advantages or technical effects of the present invention include at least: (1) using a data structure of a virtual projection database to avoid the need for conventional techniques. The performance bottleneck caused by the steps of copying and updating the quantity data; (2) proposing a unique pruning algorithm based on the characteristics of the time interval observation and recursive structure, greatly reducing the pattern growth space to improve the exploration efficiency; (3) presenting more detailed Time interval information.

上述實施形態僅例示性說明本發明之原理、特點及其功效,並非用以限制本發明之可實施範疇,任何熟習此項技藝之人士均可在不違背本發明之精神及範疇下,對上述實施形態進行修飾與改變。任何運用本發明所揭示內容而完成之等效改變及修飾,均仍應為申請專利範圍所涵蓋。因此,本發明之權利保護範圍,應如申請專利範圍所列。 The above-described embodiments are merely illustrative of the principles, features, and effects of the present invention, and are not intended to limit the scope of the present invention. Any person skilled in the art can recite the above without departing from the spirit and scope of the present invention. The embodiment is modified and changed. Any equivalent changes and modifications made by the disclosure of the present invention should still be covered by the scope of the patent application. Therefore, the scope of protection of the present invention should be as set forth in the scope of the patent application.

Claims (15)

一種頻繁時間間隔序列樣式探勘系統,包括:一時間間隔序列資料庫,其用以儲存時間間隔序列;一虛擬投影資料庫建立模組,其依據該時間間隔序列資料庫建立候選樣式之虛擬投影資料庫與計算出該候選樣式之支持度;一候選樣式生成模組,其自該虛擬投影資料庫中搜尋出所有L長度之候選樣式,以依據該些L長度之候選樣式計算出所有L+1長度之候選樣式,進而更新該些L+1長度之候選樣式之支持度,俾將該些L+1長度之候選樣式匯集成集合,其中,L為等於或大於1之正整數;以及一候選樣式修剪模組,其以修剪演算法修剪或刪除該集合內該些L+1長度之候選樣式。 A frequent time interval sequence style exploration system, comprising: a time interval sequence database for storing a time interval sequence; a virtual projection database building module, wherein the virtual projection data of the candidate pattern is established according to the time interval sequence database The library and the support degree of the candidate pattern are calculated; a candidate pattern generation module searches for all L length candidate patterns from the virtual projection database to calculate all L+1 according to the L length candidate patterns. a candidate pattern of the length, and further updating the support of the candidate patterns of the L+1 lengths, the candidate patterns of the L+1 lengths are aggregated into a set, wherein L is a positive integer equal to or greater than 1; and a candidate A style trimming module that trims or deletes the candidate patterns of the L+1 lengths within the set using a pruning algorithm. 如申請專利範圍第1項所述之系統,其中,該虛擬投影資料庫係用以儲存該候選樣式之位置索引。 The system of claim 1, wherein the virtual projection database is used to store a location index of the candidate pattern. 如申請專利範圍第1項所述之系統,其中,該修剪演算法為若該集合內存在第一樣式與第二樣式兩者,具有該第一樣式之虛擬投影資料庫等於具有該第二樣式之虛擬投影資料庫,且該第一樣式支持該第二樣式,則刪除該第一樣式。 The system of claim 1, wherein the pruning algorithm is that if the first and second styles exist in the set, the virtual projection database having the first style is equal to the first The second style virtual projection database, and the first style supports the second style, the first style is deleted. 如申請專利範圍第1項所述之系統,其中,該修剪演算法為若該集合內存在第一樣式與第二樣式兩者,該 第一樣式支持該第二樣式,且具有該第一樣式之樣式支持度減去具有該第二樣式之樣式支持度加上配對集合中最大支持度低於最小支持度門檻值,則刪除該第一樣式。 The system of claim 1, wherein the pruning algorithm is such that if the first and second styles exist in the set, The first style supports the second style, and the style support with the first style minus the style support with the second style plus the maximum support in the paired set is lower than the minimum support threshold, then deleting The first style. 如申請專利範圍第1項所述之系統,其中,該候選樣式修剪模組依據時間間隔之特性刪除非最小頻繁時間間隔序列樣式。 The system of claim 1, wherein the candidate pattern trimming module deletes the non-minimum frequent time interval sequence pattern according to the characteristics of the time interval. 如申請專利範圍第5項所述之系統,其中,該最小頻繁時間間隔序列樣式係指所有具有相同的項目與順序之序列中,單一序列之二個項目間之時間間隔比其他序列之相同二個項目間之時間間隔較小者。 The system of claim 5, wherein the minimum frequent interval sequence pattern refers to all sequences having the same item and sequence, and the time interval between two items of the single sequence is the same as the other sequences. The time interval between items is smaller. 如申請專利範圍第1項所述之系統,其中,該候選樣式修剪模組更判斷該集合內是否存在該L+1長度之頻繁樣式,若存在,則以遞迴方式將該集合內之所有頻繁樣式均建立該虛擬投影資料庫,再由該候選樣式生成模組更新該L+1長度之候選樣式之支持度以更新該集合。 The system of claim 1, wherein the candidate pattern trimming module further determines whether the frequent pattern of the L+1 length exists in the set, and if present, returns all of the set in the recursive manner. The virtual projection database is established in the frequent style, and the candidate pattern generation module updates the support degree of the candidate pattern of the L+1 length to update the set. 一種頻繁時間間隔序列樣式探勘方法,包括:自時間間隔資料庫中搜尋出所有L長度之頻繁樣式以匯集成集合,其中,L為等於或大於1之正整數;將該集合內所有頻繁樣式均建立虛擬投影資料庫;自該虛擬投影資料庫中搜尋出所有L+1長度之候選樣式,以計算出該些L+1長度之候選樣式之支持度; 更新該些L+1長度之候選樣式之支持度以更新該集合;以及以修剪演算法修剪或刪除該集合內該些L+1長度之候選樣式。 A frequent time interval sequence style exploration method includes: searching for a frequent pattern of all L lengths from a time interval database to be aggregated into a set, wherein L is a positive integer equal to or greater than 1; all frequent patterns in the set are Establishing a virtual projection database; searching for all L+1 length candidate patterns from the virtual projection database to calculate the support degree of the L+1 length candidate patterns; Updating the support of the L+1 length candidate styles to update the set; and trimming or deleting the L+1 length candidate styles within the set by a pruning algorithm. 如申請專利範圍第8項所述之方法,其中,該虛擬投影資料庫係用以儲存該候選樣式之位置索引。 The method of claim 8, wherein the virtual projection database is used to store a location index of the candidate pattern. 如申請專利範圍第8項所述之系統,其中,該修剪演算法為若該集合內存在第一樣式與第二樣式兩者,具有該第一樣式之虛擬投影資料庫等於具有該第二樣式之虛擬投影資料庫,且該第一樣式支持該第二樣式,則刪除該第一樣式。 The system of claim 8, wherein the pruning algorithm is such that if the first and second styles exist in the set, the virtual projection database having the first style is equal to the first The second style virtual projection database, and the first style supports the second style, the first style is deleted. 如申請專利範圍第8項所述之方法,其中,該修剪演算法為若該集合內存在第一樣式與第二樣式兩者,該第一樣式支持該第二樣式,且具有該第一樣式之樣式支持度減去具有該第二樣式之樣式支持度加上配對集合中最大支持度低於最小支持度門檻值,則刪除該第一樣式。 The method of claim 8, wherein the trimming algorithm is that if the first and second styles exist in the set, the first style supports the second style and has the first The style of the style is subtracted from the style support with the second style plus the maximum support in the paired set is lower than the minimum support threshold, and the first style is deleted. 如申請專利範圍第8項所述之方法,更包括依據時間間隔之特性刪除非最小頻繁時間間隔序列樣式。 The method of claim 8, further comprising deleting the non-minimum frequent time interval sequence pattern according to the characteristics of the time interval. 如申請專利範圍第12項所述之方法,其中,該最小頻繁時間間隔序列樣式係指所有具有相同的項目與順序之序列中,單一序列之二個項目間之時間間隔比其他序列之相同二個項目間之時間間隔較小者。 The method of claim 12, wherein the minimum frequent interval sequence pattern refers to all sequences having the same item and sequence, and the time interval between two items of the single sequence is the same as the other sequences. The time interval between items is smaller. 如申請專利範圍第8項所述之方法,更包括判斷該集 合內是否存在該L+1長度之頻繁樣式,若存在,則以遞迴方式將該集合內之所有頻繁樣式均建立該虛擬投影資料庫,進而更新L+1長度之候選樣式之支持度以更新該集合。 For example, the method described in claim 8 of the patent scope includes determining the set. Whether there is a frequent pattern of the L+1 length in the combination, if any, all the frequent patterns in the set are established in the recursive manner to establish the virtual projection database, thereby updating the support degree of the L+1 length candidate pattern to Update the collection. 如申請專利範圍第8項所述之方法,更包括去除該集合內具有支持度低於最小支持度門檻值之候選樣式,並持續從該L長度之頻繁樣式之集合中生成出所有L+1之頻繁樣式,直到沒有任何長度更長的頻繁樣式產生為止。 The method of claim 8, further comprising removing candidate patterns in the set having a support below a minimum support threshold, and continuously generating all L+1 from the set of frequent patterns of the L length. Frequent patterns until no frequent patterns of longer length are produced.
TW107102041A 2018-01-19 2018-01-19 Frequent time-gap sequential pattern mining system and method TWI668580B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
TW107102041A TWI668580B (en) 2018-01-19 2018-01-19 Frequent time-gap sequential pattern mining system and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
TW107102041A TWI668580B (en) 2018-01-19 2018-01-19 Frequent time-gap sequential pattern mining system and method

Publications (2)

Publication Number Publication Date
TWI668580B TWI668580B (en) 2019-08-11
TW201933142A true TW201933142A (en) 2019-08-16

Family

ID=68316018

Family Applications (1)

Application Number Title Priority Date Filing Date
TW107102041A TWI668580B (en) 2018-01-19 2018-01-19 Frequent time-gap sequential pattern mining system and method

Country Status (1)

Country Link
TW (1) TWI668580B (en)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI710980B (en) * 2019-12-11 2020-11-21 中華電信股份有限公司 Process management device and process management method

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8775945B2 (en) * 2009-09-04 2014-07-08 Yahoo! Inc. Synchronization of advertisment display updates with user revisitation rates
CN102317941A (en) * 2011-07-30 2012-01-11 华为技术有限公司 Information recommending method, recommending engine and network system
US10552581B2 (en) * 2011-12-30 2020-02-04 Elwha Llc Evidence-based healthcare information management protocols
TWI499290B (en) * 2012-11-30 2015-09-01 Ind Tech Res Inst Information recommendation method and system

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI710980B (en) * 2019-12-11 2020-11-21 中華電信股份有限公司 Process management device and process management method

Also Published As

Publication number Publication date
TWI668580B (en) 2019-08-11

Similar Documents

Publication Publication Date Title
Alkan et al. CRoM and HuspExt: Improving efficiency of high utility sequential pattern extraction
Leung et al. Discovering frequent patterns from uncertain data streams with time-fading and landmark models
Tanbeer et al. Discovering periodic-frequent patterns in transactional databases
Hu et al. Mining association rules with multiple minimum supports: a new mining algorithm and a support tuning mechanism
Wang et al. Efficient mining of frequent item sets on large uncertain databases
Leung et al. Frequent pattern mining from time-fading streams of uncertain data
Fournier-Viger et al. Mining local periodic patterns in a discrete sequence
CN105760443B (en) Item recommendation system, project recommendation device and item recommendation method
Lin et al. RWFIM: Recent weighted-frequent itemsets mining
US20110004631A1 (en) Frequent changing pattern extraction device
Fournier-Viger et al. TSPIN: mining top-k stable periodic patterns
Dam et al. An efficient algorithm for mining top-k on-shelf high utility itemsets
CN106203631B (en) The parallel Frequent Episodes Mining and system of description type various dimensions sequence of events
Hsueh et al. Mining negative sequential patterns for e-commerce recommendations
Goyal et al. Efficient skyline itemsets mining
CN104854587A (en) Maintenance of active database queries
Khemmarat et al. Fast top-k path-based relevance query on massive graphs
CN111984688B (en) Method and device for determining business knowledge association relationship
Zihayat et al. Efficiently mining high utility sequential patterns in static and streaming data
Ashraf et al. WeFreS: weighted frequent subgraph mining in a single large graph
TWI668580B (en) Frequent time-gap sequential pattern mining system and method
JP5445696B2 (en) Reference support device, reference support method, and reference support program
Li et al. DSM-PLW: Single-pass mining of path traversal patterns over streaming Web click-sequences
Tanbeer et al. Mining regular patterns in transactional databases
Motegaonkar et al. A survey on sequential pattern mining algorithms