CN115964415A - Pre-HUSPM-based database sequence insertion processing method - Google Patents
Pre-HUSPM-based database sequence insertion processing method Download PDFInfo
- Publication number
- CN115964415A CN115964415A CN202310250759.4A CN202310250759A CN115964415A CN 115964415 A CN115964415 A CN 115964415A CN 202310250759 A CN202310250759 A CN 202310250759A CN 115964415 A CN115964415 A CN 115964415A
- Authority
- CN
- China
- Prior art keywords
- sequence
- database
- utility
- weighted
- original
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
Description
技术领域Technical Field
本发明属于数据挖掘领域,具体涉及一种基于Pre-HUSPM的数据库序列插入处理方法。The invention belongs to the field of data mining, and in particular relates to a database sequence insertion processing method based on Pre-HUSPM.
背景技术Background Art
高效用序列模式挖掘(HUSPM)算法可以用于分析用户的购物习惯,HUSPM会考虑每个项目的权重、单位利润等。当序列集的效用大于用户设置的最小效用阈值时,则序列集为高效用序列模式。通常,HUSPM算法在静态数据库下运行,但在实际应用中,几乎每天都有新的数据添加,这可能导致原来发现的高效利用序列模式会失败,或者更新数据库后出现新的一些新信息。因此,在传统的动态数据挖掘中,每次有少量数据进入时,都需要重新扫描原始数据库,重新扫描原始数据库会消耗大量的资源和时间。尤其当插入少量数据时,实质对整个数据库没有影响,此时更新数据库会造成资源浪费,维护成本增加,因此高效地维护和更新挖掘的高效用序列模式变得尤为重要。The High Utility Sequential Pattern Mining (HUSPM) algorithm can be used to analyze the shopping habits of users. HUSPM considers the weight of each item, unit profit, etc. When the utility of a sequence set is greater than the minimum utility threshold set by the user, the sequence set is a high-utility sequence pattern. Usually, the HUSPM algorithm runs under a static database, but in actual applications, new data is added almost every day, which may cause the original discovered high-utility sequence pattern to fail, or some new information to appear after the database is updated. Therefore, in traditional dynamic data mining, every time a small amount of data enters, the original database needs to be rescanned, which consumes a lot of resources and time. Especially when a small amount of data is inserted, it has no effect on the entire database. At this time, updating the database will cause a waste of resources and increase maintenance costs. Therefore, it is particularly important to efficiently maintain and update the mined high-utility sequence patterns.
发明内容Summary of the invention
为了解决上述问题,本发明提出了一种基于Pre-HUSPM的数据库序列插入处理方法,将pre-large概念和基于投影的挖掘算法P-HUSPM进行融合构建了增量算法Pre-HUSPM,用于高效挖掘高效用序列模式,减少原始数据库的重新扫描次数。In order to solve the above problems, the present invention proposes a database sequence insertion processing method based on Pre-HUSPM, which integrates the pre-large concept and the projection-based mining algorithm P-HUSPM to construct an incremental algorithm Pre-HUSPM for efficiently mining high-utility sequence patterns and reducing the number of rescanning of the original database.
本发明的技术方案如下:The technical solution of the present invention is as follows:
一种基于Pre-HUSPM的数据库序列插入处理方法,构建增量算法Pre-HUSPM来高效挖掘高效用序列模式,具体包括如下步骤:A database sequence insertion processing method based on Pre-HUSPM, constructing an incremental algorithm Pre-HUSPM to efficiently mine high-utility sequence patterns, specifically including the following steps:
步骤1、向原始数据库中插入待插入数据库;Step 1: Add the original database Insert to be inserted into the database ;
步骤2、根据原始数据库的信息计算安全值;Step 2: Based on the original database Information calculation security value ;
步骤3、扫描待插入数据库,计算待插入数据库中每一个序列的总效用和的总效用;Step 3: Scan the database to be inserted , calculate the number of The total utility of each sequence in and Total utility ;
步骤4、将自上次重新扫描原始数据库以来新事务的总效用值与待插入数据库中单个项目的序列加权效用最大值的总和与安全值进行比较,根据比较结果进行相应操作;Step 4: Compare the total utility value of new transactions since the last rescan of the original database with the total utility value of the new transactions to be inserted into the database. The maximum value of the sequence weighted utility of a single item in The sum and safety value Make a comparison and perform corresponding operations according to the comparison results;
步骤5、判断新数据库中的大序列加权效用序列集集合中的每个序列的效用比是否大于等于效用阈值上限,若是,则序列是高效用序列模式,将序列加入到高效用序列模式集合中并输出,否则,不需要进行任何操作;最终输出数据库更新后的新数据库及其高效用序列模式集。Step 5: Determine the new database Large sequence weighted utility sequence set in Each sequence in the collection Is the utility ratio greater than or equal to the upper utility threshold? , if so, then the sequence is a high-utility sequence pattern, Added to the collection of high-performance sequence patterns Otherwise, no operation is required; finally, the new database after the database update is output Its high-utility sequence pattern set .
进一步地,步骤1中,设原始数据库,为序列总个数,为序列的序号,表示第个序列,为项目集合,为项目总个数,项目是个不同项的集合,表示为,表示项目中的第个项。Furthermore, in step 1, the original database is , is the total number of sequences, is the sequence number, Indicates A sequence, For project collection , is the total number of projects, yes A set of different items, represented by , Display items The Item.
进一步地,步骤2中,安全值的计算公式如下:Furthermore, in
(1); (1);
其中,表示效用阈值上限,表示效用阈值下限,表示原始数据库的总效用,和的值预先设定;in, represents the upper limit of the utility threshold, represents the lower limit of the utility threshold, Represents the original database The total utility of and The value of is preset;
的计算公式如下: The calculation formula is as follows:
(2); (2);
其中,表示原始数据库中序列的总效用,计算公式如下:in, Represents the original database Middle sequence The total utility is calculated as follows:
(3); (3);
其中,表示序列中项目中项的效用。in, Representation sequence Medium Project middle The utility of the item.
进一步地,步骤3中,按照与公式(2)和(3)相同的方式计算得到待插入数据库总效用,与此同时计算,计算时代入待插入数据库的相关数据。Furthermore, in
进一步地,步骤4中的具体判断准则为:设为自上次重新扫描原始数据库以来新事务的总效用值,当时,进行步骤4.1和步骤4.2,当时,进行步骤4.3;Furthermore, the specific judgment criteria in step 4 are: is the total utility value of new transactions since the last rescan of the original database, when When , proceed to step 4.1 and step 4.2. When , proceed to step 4.3;
步骤4.1、从待插入数据库扫描生成1-候选集,并设置=1,表示序列集中正在处理的项数;Step 4.1: From the database to be inserted Scan to generate 1-candidate set and set =1, Indicates the number of items being processed in the sequence set;
步骤4.2、扫描1-候选集,更新原有信息的序列效用和序列加权效用,依次产生2-候选集,继续更新原有信息的序列效用和序列加权效用,直到没有候选集的生成;同时,设置;Step 4.2, scan the 1-candidate set, update the sequence utility and sequence weighted utility of the original information, generate 2-candidate sets in sequence, and continue to update the sequence utility and sequence weighted utility of the original information until no candidate sets are generated; at the same time, set ;
步骤4.3、当时,生成新数据库,此时需要重新扫描原始数据库;将设置为0,并将赋值给。Step 4.3: When a new database is generated, the original database needs to be rescanned. Set to 0 and Assign to .
进一步地,步骤4.2的具体过程如下:Furthermore, the specific process of step 4.2 is as follows:
步骤4.2.1、计算新数据库的总效用,计算公式如下:Step 4.2.1. Calculate the new database Total utility , the calculation formula is as follows:
(4); (4);
对于候选集中的每个候选,计算待插入数据库中序列的序列加权效用和序列效用,计算公式如下:For the candidate set For each candidate in, calculate the number of candidates to be inserted into the database Middle sequence The sequence weighted utility and sequence utility , the calculation formula is as follows:
(5); (5);
(6); (6);
其中,表示序列这一行总的效用值;表示序列中的子序列的效用是序列中所有出现的的效用中的最大效用,定义如下:in, Representation sequence The total utility value of this row; Representation sequence Subsequence in The utility of is all occurrences of The maximum utility among the utilities of is defined as follows:
(7); (7);
其中,表示序列中某项的最大内部效用是该序列中该项的最大效用值,定义如下:in, The maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, which is defined as follows:
(8); (8);
其中,表示序列的项目中项的内部效用,定义如下:in, Representation sequence Project middle The internal utility of an item is defined as follows:
(9); (9);
其中,表示序列中项目中项的数量,表示项的单位利润;in, Representation sequence Medium Project middle The number of items, express Unit profit of the item;
步骤4.2.2、对于在大序列加权效用序列的原始数据库中设置的每个大序列加权效用序列,执行子步骤4.2.2.1-子步骤4.2.2.3;Step 4.2.2: Weighted utility sequence in large sequence For each large sequence weighted utility sequence set in the original database, execute sub-steps 4.2.2.1 to 4.2.2.3;
步骤4.2.3、对于原始数据库中的每个预大序列加权利用序列集,同样执行步骤4.2.2的子步骤4.2.2.1-子步骤4.2.2.3;Step 4.2.3: For each pre-large sequence weighted application sequence set in the original database, sub-steps 4.2.2.1 to 4.2.2.3 of step 4.2.2 are also performed;
如果原始数据库中的大序列加权序列集和原始数据库中的预大序列加权序列集包含待插入数据库中的序列,就将和中的项集的序列效用和序列加权效用的值进行更新,并将序列放入到1-候选集,用来生成2-候选集;如果和中不包含新数据库中的序列,就不需要更新,将从1-候选集中移除;If the original database Large sequence weighted sequence set in and the original database Pre-large sequence weighted sequence set in Contains the database to be inserted Sequence in , then and The sequential utility of the itemsets in and sequence weighted utility The value of Put it into the 1-candidate set to generate the 2-candidate set; if and New databases are not included Sequence in , there is no need to update, Remove from 1-candidate set;
步骤4.2.4、从-候选集生成候选(+1)-候选集;设=+1,重复步骤4.2.1到步骤4.2.4,直到没有发现更新的大或前大序列加权效用序列集。Step 4.2.4, from -Candidate set generation candidate ( +1)-Candidate set ;set up = +1, repeat steps 4.2.1 to 4.2.4 until no updated large or former large sequence weighted utility sequence set is found.
进一步地,步骤4.2.2的子步骤如下:Furthermore, the sub-steps of step 4.2.2 are as follows:
子步骤4.2.2.1、更新在新数据库中序列的序列加权效用,计算公式如下:Sub-step 4.2.2.1: Update in the new database Middle sequence The sequence weighted utility , the calculation formula is as follows:
(10); (10);
其中,为原始数据库中序列的序列加权效用,存储着序列的,为待插入数据库中序列的序列加权效用;in, For the original database Middle sequence The sequence weighted utility of Stores the sequence of , To be inserted into the database Middle sequence The sequence weighted utility of
子步骤4.2.2.2、更新新数据库中整个序列集的序列效用:Sub-step 4.2.2.2: Update the new database The entire sequence set The sequence utility :
(11); (11);
其中,表示序列在原始数据库中的序列效用,存储着序列的, 为待插入数据库中序列的序列效用;in, Representation sequence In the original database The sequence utility in Stores the sequence of , To be inserted into the database Middle sequence The sequence utility of
子步骤4.2.2.3、如果,则将序列放入,是新数据库中的大序列加权效用的序列集;如果,则将序列放入,是新数据库中的预大序列加权效用序列集;否则,丢弃序列。Sub-step 4.2.2.3, if , then the sequence Put in , It is a new database The weighted utility of a large sequence in sequence set; if , then the sequence Put in , It is a new database Pre-large sequence weighted utility in sequence set; otherwise, discard the sequence .
进一步地,步骤4.3的具体过程如下:Furthermore, the specific process of step 4.3 is as follows:
步骤4.3.1、合并待插入数据库和原始数据库,生成新数据库;Step 4.3.1: Merge the database to be inserted and the original database , generate a new database ;
步骤4.3.2、对于每个,采用与公式(5)相同的计算方式计算新数据库的序列加权效用,然后采用与公式(2)相同的计算方式计算新数据库的总效用;Step 4.3.2: For each , the new database is calculated using the same calculation method as formula (5) The sequence weighted utility , and then use the same calculation method as formula (2) to calculate the new database Total utility ;
步骤4.3.3、设序列的加权效用比为,如果,则将序列放入;如果,则将序列放入;否则,丢弃序列;是新数据库中的大序列加权效用的序列集;是新数据库中的预大序列加权效用序列集;Step 4.3.3: Let the weighted utility ratio of the sequence be ,if , then the sequence Put in ;if , then the sequence Put in ; Otherwise, discard the sequence ; It is a new database The weighted utility of a large sequence in Sequence Set; It is a new database Pre-large sequence weighted utility in Sequence Set;
步骤4.3.4、执行递归挖掘算法,运用递归挖掘算法,生成多项集的投影数据库,并生成多项集的和序列集,直到没有找到和序列集;执行挖掘过程时,从1序列集开始挖掘,再接着2序列集,直到最后一个序列集为空,此时停止挖掘过程,输出新数据库中的大序列加权效用序列集和预大序列加权效用序列集,和用于下次数据插入时使用。Step 4.3.4: Execute the recursive mining algorithm to generate a projection database of multiple sets and generate a and Sequence set until none is found and Sequence set; when executing the mining process, start mining from sequence set 1, then sequence set 2, until the last sequence set is empty, then stop the mining process and output a new database Large sequence weighted utility sequence set in and the pre-large sequence weighted utility sequence set , and It will be used for next data insertion.
进一步地,步骤4.3.4中,递归挖掘算法的具体过程如下:Furthermore, in step 4.3.4, the specific process of the recursive mining algorithm is as follows:
步骤4.3.4.1、遍历和,对属于和的每个序列构建它的投影数据库;Step 4.3.4.1, traversal and , for and Each sequence of Build its projection database ;
步骤4.3.4.2、计算的序列加权效用值,其中是的拓展项集;如果,计算序列效用,并将放到集合中;如果,计算,并将放到集合中,否则,如果都不满足,将不做任何处理;Step 4.3.4.2. Calculation The sequence weighted utility Value, where yes The expanded itemset of , calculate the sequence utility , and Put In the collection; if ,calculate , and Put In the set, otherwise, if none of them are satisfied, no processing will be done;
步骤4.3.4.3、将当前参数传入进去,递归调用挖掘算法过程,直到 和集合都为空,停止运行;是新数据库中的大序列加权效用的+1序列集;是新数据库中的预大序列加权效用+1序列集。Step 4.3.4.3: Pass the current parameters in and recursively call the mining algorithm process until and The collections are all empty, so stop running; It is a new database The weighted utility of a large sequence in +1 sequence set; It is a new database Pre-large sequence weighted utility in +1 for the sequence set.
本发明所带来的有益技术效果。The beneficial technical effects brought about by the present invention.
提出了一种新的序列模式挖掘算法Pre-HUSPM,用于处理序列插入问题,当插入少量数据时,不需要更新整个数据库,避免造成资源浪费。A new sequential pattern mining algorithm Pre-HUSPM is proposed to deal with the sequence insertion problem. When inserting a small amount of data, there is no need to update the entire database to avoid wasting resources.
基于矩阵投影的高效用序列模式挖掘算法(P-HUSPM),可以减少序列挖掘中候选集的数量,从而加快挖掘高效用序列集的处理时间;因此由于不需要频繁地重新扫描数据库的次数,因此可以在很大程度上减少运行时间。The high-utility sequence pattern mining algorithm (P-HUSPM) based on matrix projection can reduce the number of candidate sets in sequence mining, thereby speeding up the processing time of mining high-utility sequence sets; therefore, since there is no need to frequently rescan the database, the running time can be greatly reduced.
提出了一个新的概念,用它作为安全阈值来判断数据库是否需要重新扫描,减少了数据库重新扫描的次数,降低了维护成本。Proposed a new concept , using it as a safety threshold to determine whether the database needs to be rescanned, reducing the number of database rescans and reducing maintenance costs.
附图说明BRIEF DESCRIPTION OF THE DRAWINGS
图1为本发明基于Pre-HUSPM的数据库序列插入处理方法的流程图。FIG1 is a flow chart of the database sequence insertion processing method based on Pre-HUSPM of the present invention.
图2为本发明实验中SIGN数据集在效用阈值上限为15%时三个算法在不同效用阈值下限下的运行时间对比图。FIG2 shows the upper limit of the utility threshold of the SIGN dataset in the experiment of the present invention. When the utility threshold is 15%, the three algorithms have different The following is a comparison of the running times.
图3为本发明实验中LEVIATHAN数据集在效用阈值上限为18%时三个算法在不同效用阈值下限下的运行时间对比图。FIG3 shows the upper limit of the utility threshold of the LEVIATHAN dataset in the experiment of the present invention. When the utility threshold is 18%, the three algorithms have different The following is a comparison of the running times.
图4为本发明实验中FIFA数据集在效用阈值上限为21%时三个算法在不同效用阈值下限下的运行时间对比图。FIG4 shows the FIFA dataset in the experiment of the present invention at the upper limit of the utility threshold When the utility threshold is 21%, the three algorithms have different The following is a comparison of the running times.
图5为本发明实验中BIBLE数据集在效用阈值上限为16%时三个算法在不同效用阈值下限下的运行时间对比图。Figure 5 shows the upper limit of the utility threshold of the BIBLE dataset in the experiment of the present invention. When the utility threshold is 16%, the three algorithms have different The following is a comparison of the running times.
图6为本发明实验中Kosarak10k数据集在效用阈值上限为14%时三个算法在不同效用阈值下限下的运行时间对比图。FIG6 shows the Kosarak10k dataset in the experiment of the present invention at the upper limit of the utility threshold When the utility threshold is 14%, the three algorithms have different The following is a comparison of the running times.
图7为本发明实验中BMS数据集在效用阈值上限为4.5%时三个算法在不同效用阈值下限下的运行时间对比图。FIG. 7 shows the BMS dataset in the experiment of the present invention at the upper limit of the utility threshold When the utility threshold is 4.5%, the three algorithms have different The following is a comparison of the running times.
具体实施方式DETAILED DESCRIPTION
下面结合附图以及具体实施方式对本发明作进一步详细说明:The present invention is further described in detail below with reference to the accompanying drawings and specific embodiments:
本发明所提及的数据库为序列数据库,序列数据库中包括大序列、预大序列、小序列。当序列的支持度大于支持度上限阈值时,则该序列为大序列;当序列的支持度小于支持度上限阈值且大于支持度下限阈值时,则该序列为预大序列;当序列的支持度小于支持度下限阈值时,则该序列为小序列。其中,预大序列在未来很可能成为大序列。The database mentioned in the present invention is a sequence database, which includes large sequences, pre-large sequences, and small sequences. When the support of a sequence is greater than the upper support threshold, the sequence is a large sequence; when the support of a sequence is less than the upper support threshold and greater than the lower support threshold, the sequence is a pre-large sequence; when the support of a sequence is less than the lower support threshold, the sequence is a small sequence. Among them, the pre-large sequence is likely to become a large sequence in the future.
本发明融合了pre-large概念和基于投影的挖掘算法P-HUSPM,提出了Pre-HUSPM算法,主要通过设置阈值作为是否需要重新扫描数据库的条件,进而对数据库序列进行有效维护和更新,减少数据库重新扫描次数。表示待插入数据库中单个项目的序列加权效用最大值。This paper combines the pre-large concept and the projection-based mining algorithm P-HUSPM, and proposes the Pre-HUSPM algorithm, which mainly sets the threshold As a condition for whether the database needs to be rescanned, the database sequence is effectively maintained and updated to reduce the number of database rescanning times. Represents the maximum value of the sequence-weighted utility of a single item to be inserted into the database.
将新序列数据库添加到原始序列数据库时会出现九种情况:情况1为将新序列数据库的大序列插入到原始序列数据库的大序列中;情况2为将新序列数据库的预大序列插入到原始序列数据库的大序列中;情况3为将新序列数据库的小序列插入到原始序列数据库的大序列中;情况4为将新序列数据库的大序列插入到原始序列数据库的预大序列中;情况5为将新序列数据库的预大序列插入到原始序列数据库的预大序列中;情况6为将新序列数据库的小序列插入到原始序列数据库的预大序列中;情况7为将新序列数据库的大序列插入到原始序列数据库的小序列中;情况8为将新序列数据库的预大序列插入到原始序列数据库的小序列中;情况9为将新序列数据库的小序列插入到原始序列数据库的小序列中。Nine situations will occur when adding a new sequence database to the original sequence database: situation 1 is inserting the large sequence of the new sequence database into the large sequence of the original sequence database;
情况1、情况5、情况6、情况8和情况9是基于计数的加权平均,不会影响最终的大序列集。情况2和情况3可能会删除一些现有的大序列集,而情况4和7可能会添加新的大序列集合。当同时保留大序列集和预大序列集时,可以很好地处理情况2、情况3和情况4的这些情况。
而上述情况7是本发明的主要研究重点,当出现情况7,即插入的数据库资料不是很大的时候,实质是不需要更新数据库的,此时现有技术会去更新数据库,造成了资源浪费。The above situation 7 is the main research focus of the present invention. When situation 7 occurs, that is, when the inserted database data is not very large, it is actually unnecessary to update the database. At this time, the prior art will update the database, resulting in a waste of resources.
针对该问题,本发明提出了一种基于Pre-HUSPM的数据库序列插入处理方法,采用了如下定理,并对定理进行了证明。To solve this problem, the present invention proposes a database sequence insertion processing method based on Pre-HUSPM, adopts the following theorem, and proves the theorem.
定理.设和分别为效用阈值下限和效用阈值上限,为原始数据库的总效用。是待插入数据库中单个项目的序列加权效用最大值。如果,则情况7中序列集的序列加权效用在整个更新数据库中没有希望成为高效用加权序列项集。Theorem. Assume and are the lower and upper utility thresholds, respectively. For the original database total utility. To be inserted into the database The maximum value of the sequence weighted utility of a single item in . If , then the sequence weighted utility of the sequence set in case 7 has no hope of becoming a high-utility weighted sequence item set in the entire updated database.
证明:从,可获得以下推导式:Proof: From , we can get the following derivation:
; ;
; ;
; ;
; ;
; ;
对于情况7中的序列,如果序列的序列加权效用在原始数据库中很小,则。For the sequence in case 7, if the sequence The sequence weighted utility of If the .
如果序列在待插入数据库中具有较大的序列加权效用,则其在待插入数据库中的序列加权效用必须大于或等于,但小于或等于待插入数据库的总效用。因此,。If the sequence To be inserted into the database has a larger sequence weighted utility in the database to be inserted Sequence weighted utility in Must be greater than or equal to , but less than or equal to the one to be inserted into the database Total utility .therefore, .
在序列挖掘中,插入数据库后形成的新数据库中更新的序列的比率被计算为:In sequence mining, inserting into the database The new database formed The updated sequence in The ratio is calculated as:
; ;
其中,为新数据库中序列的序列加权效用,为原始数据库中序列的序列加权效用。因此,当小于安全值()时,不需要重新扫描原始数据库。in, For new database Middle sequence The sequence weighted utility of For the original database Middle sequence Therefore, when Less than the safety value ( ), there is no need to rescan the original database.
根据该定理,可以有效地处理情况7中的序列。According to this theorem, the sequence in case 7 can be processed efficiently.
一种基于Pre-HUSPM的数据库序列插入处理方法,具体包括如下步骤:A database sequence insertion processing method based on Pre-HUSPM specifically comprises the following steps:
步骤1、向原始数据库中插入待插入数据库。Step 1: Add the original database Insert to be inserted into the database .
本发明实施例中,原始数据库为一个交易资料数据库,插入的待插入数据库是一个新的交易资料数据库。In the embodiment of the present invention, the original database For a transaction data database, the database to be inserted It is a new transaction information database.
原始交易资料数据库和新的交易资料数据库均是包含一组序列的数据库,设原始数据库,为序列总个数,为序列的序号,表示第个序列,具有唯一标识符,为项目集合,为项目总个数,项目是个不同项的集合,表示为,表示项目中的第个项。The original transaction data database and the new transaction data database are both databases containing a set of sequences. , is the total number of sequences, is the sequence number, Indicates A sequence, Has a unique identifier, For project collection , is the total number of projects, yes A set of different items, represented by , Display items The Item.
原始交易资料数据库包括、、、、五个序列和、、、、五个项目。其中,序列的项目集合为,表示一项;序列的项目集合为;序列的项目集合为;序列的项目集合为;序列的项目集合为。此、、、、五个项目的利润分别为3、2、4、2、1,在数据库中以表格的形式保存,保存为一个项目利润表。The original transaction data database includes , , , , Five sequences and , , , , Five projects. Among them, The set of items in the sequence is , Indicates an item; The set of items in the sequence is ; The set of items in the sequence is ; The set of items in the sequence is ; The set of items in the sequence is .this , , , , The profits of the five projects are 3, 2, 4, 2, and 1 respectively. They are saved in the database in the form of a table and saved as a project profit table. .
待插入数据库包括、两个序列,序列的项目集合为,序列的项目集合为。To be inserted into the database include , Two sequences, The set of items in the sequence is , The set of items in the sequence is .
步骤2、根据原始数据库的信息计算安全值。Step 2: Based on the original database Information calculation security value .
安全值的计算公式如下:Safety value The calculation formula is as follows:
(1); (1);
其中,表示效用阈值上限,表示效用阈值下限,表示原始数据库的总效用,和的值预先设定。in, represents the upper limit of the utility threshold, represents the lower limit of the utility threshold, Represents the original database The total utility of and The value is preset.
的计算公式如下: The calculation formula is as follows:
(2); (2);
其中,表示原始数据库中序列的总效用,计算公式如下:in, Represents the original database Middle sequence The total utility is calculated as follows:
(3); (3);
其中,表示序列中项目中项的效用。in, Representation sequence Medium Project middle The utility of the item.
本发明实施例中,预先设定效用阈值上限为0.35,效用阈值上限与高效用序列模式阈值相同,设定效用阈值下限为0.25,计算得=36,=26,=28,=23,=28; =141;=21。In the embodiment of the present invention, the upper limit of the utility threshold is preset The upper limit of the utility threshold is the same as the high-utility sequence mode threshold, and the lower limit of the utility threshold is set to is 0.25, and the calculated =36, =26, =28, =23, =28; =141; =21.
步骤3、扫描待插入数据库,计算待插入数据库中每一个序列的总效用和的总效用。Step 3: Scan the database to be inserted , calculate the number of The total utility of each sequence in and Total utility .
按照与公式(2)和(3)相同的方式计算得到待插入数据库总效用,与此同时计算,计算时代入待插入数据库的相关数据;The database to be inserted is calculated in the same way as formulas (2) and (3) Total Utility , while calculating , enter the database to be inserted during calculation relevant data;
本发明实施例中,=10,=7,=17。In the embodiment of the present invention, =10, =7, =17.
步骤4、将自上次重新扫描原始数据库以来新事务的总效用值与的总和与安全值进行比较,根据比较结果进行相应操作。具体判断准则为:设为自上次重新扫描原始数据库以来新事务的总效用值,当时,进行步骤4.1和步骤4.2,当时,进行步骤4.3。Step 4. Compare the total utility value of new transactions since the last rescan of the original database with The sum and safety value Compare and perform corresponding operations according to the comparison results. The specific judgment criteria are: is the total utility value of new transactions since the last rescan of the original database, when When , proceed to step 4.1 and step 4.2. Then proceed to step 4.3.
步骤4.1、从待插入数据库扫描生成1-候选集,并设置=1,表示的是序列集中正在处理的项数。Step 4.1: From the database to be inserted Scan to generate 1-candidate set and set =1, Indicates the number of items in the sequence set being processed.
本发明实施例中,生成的1-候选集为:。In the embodiment of the present invention, the generated 1-candidate set is: .
步骤4.2、扫描1-候选集,更新原有信息的序列效用和序列加权效用,依次产生2-候选集,继续更新原有信息的序列效用和序列加权效用,直到没有候选集的生成。同时,设置。具体过程如下:Step 4.2, scan the 1-candidate set, update the sequence utility and sequence weighted utility of the original information, generate 2-candidate sets in sequence, and continue to update the sequence utility and sequence weighted utility of the original information until no candidate sets are generated. At the same time, set The specific process is as follows:
步骤4.2.1、计算新数据库的总效用,计算公式如下:Step 4.2.1. Calculate the new database Total utility , the calculation formula is as follows:
(4); (4);
本发明实施例中,=141+17=158。In the embodiment of the present invention, =141+17=158.
对于候选集中的每个候选,计算待插入数据库中序列的序列加权效用和序列效用,计算公式如下:For the candidate set For each candidate in, calculate the number of candidates to be inserted into the database Middle sequence The sequence weighted utility and sequence utility , the calculation formula is as follows:
(5); (5);
(6); (6);
其中,表示序列这一行总的效用值;表示序列中的子序列的效用是序列中所有出现的的效用中的最大效用,定义如下:in, Representation sequence The total utility value of this row; Representation sequence Subsequence in The utility of is all occurrences of The maximum utility among the utilities of is defined as follows:
(7); (7);
其中,表示序列中某项的最大内部效用是该序列中该项的最大效用值,定义如下:in, The maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, which is defined as follows:
(8); (8);
其中,表示序列的项目中项的内部效用,定义如下:in, Representation sequence Project middle The internal utility of an item is defined as follows:
(9); (9);
其中,表示序列中项目中项的数量,表示项的单位利润。in, Representation sequence Medium Project middle The number of items, express The unit profit of the item.
例如本发明实施例中, =10, =8。For example, in the embodiment of the present invention, =10, =8.
例如,可以表示为,其中,,。其中在和的内部效用分别是:=3×3=9,=2×3=6。For example, It can be expressed as ,in , , .in exist and The internal utilities are: =3×3=9, =2×3=6.
在中出现了两次,最大效用在表示为:=9。exist middle It appeared twice. The greatest effect is It is expressed as: =9.
子序列在出现了两次,这两次的效用分别是(3×3)+(4×2)=17和(3×2)+(4×2)=14。所以,=17。Subsequence exist It appears twice, and the utility of these two times is (3×3)+(4×2)=17 and (3×2)+(4×2)=14 respectively. So, =17.
步骤4.2.2、对于在大序列加权效用序列的原始数据库中设置的每个大序列加权效用序列,执行子步骤:Step 4.2.2: Weighted utility sequence in large sequence For each large sequence weighted utility sequence set in the original database, perform the following substeps:
子步骤4.2.2.1、更新在新数据库中序列的序列加权效用,计算公式如下:Sub-step 4.2.2.1: Update in the new database Middle sequence The sequence weighted utility , the calculation formula is as follows:
(10); (10);
其中,为原始数据库中序列的序列加权效用,存储着序列的,为待插入数据库中序列的序列加权效用。in, For the original database Middle sequence The sequence weighted utility of Stores the sequence of , To be inserted into the database Middle sequence The sequence weighted utility of .
本发明实施例中的序列,=76+7=83。Embodiments of the present invention In sequence, =76+7=83.
子步骤4.2.2.2、更新新数据库中整个序列集的序列效用:Sub-step 4.2.2.2: Update the new database The entire sequence set The sequence utility :
(11); (11);
其中,表示序列在原始数据库中的序列效用,存储着序列的, 为待插入数据库中序列的序列效用。in, Representation sequence In the original database The sequence utility in Stores the sequence of , To be inserted into the database Middle sequence The sequence utility.
本发明实施例中的序列,=30+3=33。Embodiments of the present invention In sequence, =30+3=33.
子步骤4.2.2.3、如果,则将序列放入,是新数据库中的大序列加权效用的序列集;如果,则将序列放入,是新数据库中的预大序列加权效用序列集;否则,丢弃序列,因为它在数据库更新后仍然很小。Sub-step 4.2.2.3, if , then the sequence Put in , It is a new database The weighted utility of a large sequence in sequence set; if , then the sequence Put in , It is a new database Pre-large sequence weighted utility in sequence set; otherwise, discard the sequence , as it will still be small after the database update.
本发明实施例中,=52.5%>35%,所以序列仍放入集合中。In the embodiment of the present invention, =52.5%>35%, so the sequence Still put in In collection.
步骤4.2.3、对于原始数据库中的每个预大序列加权利用序列集,同样执行步骤4.2.2的子步骤4.2.2.1-子步骤4.2.2.3。Step 4.2.3: For each pre-large sequence weighted application sequence set in the original database, sub-steps 4.2.2.1 to 4.2.2.3 of step 4.2.2 are also performed.
如果原始数据库中的大序列加权序列集和原始数据库中的预大序列加权序列集包含待插入数据库中的序列,就将和中的项集的序列效用和序列加权效用的值进行更新,并将序列放入到1-候选集,用来生成2-候选集;如果和中不包含新数据库中的序列,就不需要更新,将从1-候选集中移除。If the original database Large sequence weighted sequence set in and the original database Pre-large sequence weighted sequence set in Contains the database to be inserted Sequence in , then and The sequential utility of the itemsets in and sequence weighted utility The value of Put it into the 1-candidate set to generate the 2-candidate set; if and New databases are not included Sequence in , there is no need to update, Remove from 1-candidate set.
例如本发明实施例中,、在中,就将加入到1-候选集中,如果不是就将其移除。从1-候选集可以生成2-候选集、、和,并从待插入数据库挖掘它们的和,如果不存在,值就是0,以此类推,直到候选集为空。For example, in the embodiment of the present invention, , exist In Add it to the 1-candidate set, if not, remove it. From the 1-candidate set, you can generate the 2-candidate set , , and , and from the database to be inserted Dig them up and , if it does not exist, the value is 0, and so on, until the candidate set is empty.
步骤4.2.4、从-候选集生成候选(+1)-候选集;设=+1,重复步骤4.2.1到步骤4.2.4,直到没有发现更新的大或前大序列加权效用序列集。Step 4.2.4, from -Candidate set generation candidate ( +1)-Candidate set ;set up = +1, repeat steps 4.2.1 to 4.2.4 until no updated large or former large sequence weighted utility sequence set is found.
步骤4.3、当时,生成新数据库,此时需要重新扫描原始数据库。将设置为0,并将赋值给。具体过程如下:Step 4.3: When a new database is generated, the original database needs to be rescanned. Set to 0 and Assign to The specific process is as follows:
步骤4.3.1、合并待插入数据库和原始数据库D,生成新数据库U;Step 4.3.1: Merge the database to be inserted and the original database D, generate a new database U;
步骤4.3.2、对于每个,采用与公式(5)相同的计算方式计算新数据库的序列加权效用,然后采用与公式(2)相同的计算方式计算新数据库的总效用;Step 4.3.2: For each , the new database is calculated using the same calculation method as formula (5) The sequence weighted utility , and then use the same calculation method as formula (2) to calculate the new database Total utility ;
步骤4.3.3、设序列的加权效用比为,如果,则将序列放入;如果,则将序列放入;否则,丢弃序列,因为它在数据库更新后仍然很小。Step 4.3.3: Let the weighted utility ratio of the sequence be ,if , then the sequence Put in ;if , then the sequence Put in ; Otherwise, discard the sequence , as it will still be small after the database update.
步骤4.3.4、执行递归挖掘算法,运用递归挖掘算法,生成多项集的投影数据库,并生成多项集的和序列集,直到没有找到和序列集。执行挖掘过程时,从1序列集开始挖掘,再接着2序列集,直到最后一个序列集为空,此时停止挖掘过程,输出新数据库中的大序列加权效用序列集和预大序列加权效用序列集,和用于下次数据插入时使用。Step 4.3.4: Execute the recursive mining algorithm to generate a projection database of multiple sets and generate a and Sequence set until none is found and When executing the mining process, start mining from sequence set 1, then sequence set 2, until the last sequence set is empty, then stop the mining process and output the new database Large sequence weighted utility sequence set in and the pre-large sequence weighted utility sequence set , and It will be used for next data insertion.
具体过程如下:The specific process is as follows:
步骤4.3.4.1、遍历和,对属于和的每个序列构建它的投影数据库,这样可以减少候选集的个数,提高运行速度,其中表示的是序列集中正在处理的项数。投影数据库的构建过程为:找到以项目作为序列前缀的每一个序列,如果一个序列中没有项目,就不保留。Step 4.3.4.1, traversal and , for and Each sequence of Build its projection database , which can reduce the number of candidate sets and improve the running speed. It indicates the number of items being processed in the sequence set. The construction process of the projection database is: find the items Each sequence that is a prefix of a sequence, if there are no items in a sequence , will not be retained.
定义:设有两个序列和,其中。如果(1)该序列有前缀,(2)其中该序列是以为前缀的的子序列,并且该序列是不再有超序列,那么序列的子序列称为的投影序列,这个关系记为。因此,序列在新数据库中的投影数据库是序列对应的数据库中每个序列的所有投影序列的集合,记为。Definition: Suppose there are two sequences and ,in If (1) the sequence has a prefix , (2) where the sequence is Prefix , and the sequence no longer has a supersequence, then the sequence A subsequence of The projection sequence of Therefore, the sequence In the new database The projection database in is the sequence The set of all projection sequences for each sequence in the corresponding database is denoted as .
例如,根据上述定义,对序列构建投影数据库,找到以为作为序列前缀的每一个序列,如果一个序列中没有项目,就不保留,例如中没有项目,在序列的投影数据库就没有。因此,序列的投影数据库中只包含、、、四个序列,具体内容为:序列的项目集合为,序列的总效用为36;序列的项目集合为,序列的总效用为9;序列的项目集合为 ,序列的总效用为9;序列的项目集合为 ,序列的总效用为22。For example, according to the above definition, for the sequence Build a projection database and find For each sequence that is a prefix of a sequence, if there are no items in a sequence , it is not retained, for example No Project, in sequence The projection database does not have Therefore, the sequence The projection database only contains , , , Four sequences, the specific contents are: The set of items in the sequence is , The total utility of the sequence is 36; The set of items in the sequence is , The total utility of the sequence is 9; The set of items in the sequence is , The total utility of the sequence is 9; The set of items in the sequence is , The total utility of the sequence is 22.
步骤4.3.4.2、计算的序列加权效用值,其中是的拓展项集;如果,计算序列效用,并将放到集合中;如果,计算,并将放到集合中,否则,如果都不满足,将不做任何处理。Step 4.3.4.2. Calculation The sequence weighted utility Value, where yes The expanded itemset of , calculate the sequence utility , and Put In the collection; if ,calculate , and Put Otherwise, if none of them are satisfied, no processing will be done.
步骤4.3.4.3、将当前参数传入进去,递归调用挖掘算法过程,直到 和集合都为空,停止运行。Step 4.3.4.3: Pass the current parameters in and recursively call the mining algorithm process until and The collection is empty and the operation stops.
递归挖掘算法的伪代码如下:The pseudo code of the recursive mining algorithm is as follows:
1: for 每一个序列 do;1: for each sequence do;
2: 构建序列的投影数据库 ;2: Build sequence Projection database ;
3: end for;3: end for;
4: for 每一个 ,其中是 在投影数据库 的超集 do;4: for each ,in yes In the projection database A superset of do;
5: 计算;5: Calculation ;
6: if then;6: if then;
7: 计算;7: Calculation ;
8: 将序列放入集合中;8: Sequence Put in In the collection;
9: else if ;9: else if ;
10: 计算;10: Calculation ;
11: 将序列放入集合中;11: Sequence Put in In the collection;
12: end if;12: end if;
13: end for;13: end for;
14: Mining(,,,,,);14: Mining( , , , , , );
步骤5、判断新数据库中的大序列加权效用序列集集合中的每个序列的效用比是否大于等于效用阈值上限,即,若是,则序列是高效用序列模式,将序列S加入到高效用序列模式集合中并输出,否则,不需要进行任何操作;最终输出数据库更新后的新数据库及其高效用序列模式集。Step 5: Determine the new database Large sequence weighted utility sequence set in Each sequence in the collection Is the utility ratio greater than or equal to the upper utility threshold? ,Right now , if so, then the sequence is a high-utility sequence pattern. Add sequence S to the set of high-utility sequence patterns. Otherwise, no operation is required; finally, the new database after the database update is output Its high-utility sequence pattern set .
本发明实施例中,==35.4%>35%,所以是一个高效用序列,需要加入到高效用序列模式集合中。In the embodiment of the present invention, = =35.4%>35%, so Is a high-utility sequence and needs to be added to the high-utility sequence pattern set middle.
最终得到的和如下:The final result and as follows:
大序列加权效用序列集包括的序列集为、、、;其中,序列集的序列加权效用为83,序列效用为22;序列集的序列加权效用为95,序列效用为56;序列集的序列加权效用为77,序列效用为20;序列集的序列加权效用为77,序列效用为16;Large Sequential Weighted Utility Sequence Set The included sequence sets are , , , ; Among them, the sequence set The sequence weighted utility is 83, and the sequence utility is 22; the sequence set The sequence weighted utility of is 95, and the sequence utility is 56; the sequence set The sequence weighted utility of is 77, and the sequence utility is 20; the sequence set The sequence weighted utility of is 77, and the sequence utility is 16;
预大序列加权效用序列集包括的序列集为、、、、;其中,序列集的序列加权效用为53,序列效用为18;序列集的序列加权效用为46,序列效用为17;序列集的序列加权效用为43,序列效用为17;序列集的序列加权效用为52,序列效用为32;序列集的序列加权效用为54,序列效用为38。Pre-large sequence weighted utility sequence set The included sequence sets are , , , , ; Among them, the sequence set The sequence weighted utility of is 53, and the sequence utility is 18; the sequence set The sequence weighted utility is 46, and the sequence utility is 17; the sequence set The sequence weighted utility is 43, and the sequence utility is 17; the sequence set The sequence weighted utility of is 52, and the sequence utility is 32; the sequence set The sequence weighted utility of is 54 and the sequence utility is 38.
更新后的新数据库的高效用序列模式集只包含序列集,此时序列集的序列加权效用为95,序列效用为56,效用比为35.4%。New updated database A set of high-utility sequence patterns Contains only sequence sets , then the sequence set The sequence weighted utility is 95, the sequence utility is 56, and the utility ratio is 35.4%.
本发明中,Pre-HUSPM算法的伪代码如下:In the present invention, the pseudo code of the Pre-HUSPM algorithm is as follows:
输入:一个项目利润表、原始数据库、效用阈值上限(与最小序列效用高阈值相同)、效用阈值下限、的总效用、一组大序列加权利用序列和前大序列加权利用序列以及它们的序列加权效用值、从中找到的实际效用值、保存最后处理的序列的总效用值的安全交易效用缓冲器、以及待插入数据库。Input: A project income statement , original database , upper threshold of utility (same as minimum sequence utility high threshold), utility lower threshold , Total utility , a set of large sequence weighted application sequences The weighted application sequence of the previous large sequence and their sequence weighted utility values, from The actual utility value found in , and the secure transaction utility buffer that holds the total utility value of the last processed sequence , and to be inserted into the database .
输出:新数据库 ()的一组高效用序列模式()。Output: New database ( ) of a set of high-utility sequence patterns ( ).
1: 计算安全序列效用界限 ;1: Calculate the safety sequence utility bound ;
2: for each do;2: for each do;
3: 扫描数据库 ,计算 ;3: Scan the database ,calculate ;
4: end for;4: end for;
5: 计算和;5: Calculation and ;
6: 如果then;6: If then;
7: 计算总效用 ;7: Calculate total utility ;
8: 设置 =1;8: Settings =1;
9: 生成1-项候选集,;9: Generate 1-item candidate set , ;
10: while null do;10: while null do;
11: for each do;11: for each do;
12: 计算 ;12: Calculation ;
13: 计算 ;13: Calculation ;
14: end for;14: end for;
15: for each do;15: for each do;
16: 调用效用求和算法;16: Call the utility summation algorithm;
17: end for;17: end for;
18: for each do;18: for each do;
19: 调用效用求和算法;19: Call the utility summation algorithm;
20: end for;20: end for;
21: 从( ∪ ) 生成 ( + 1)-候选集;21: From ( ∪ ) Generate( + 1)-Candidate set ;
22: 设置 =+1;22: Settings = +1;
23: end while;23: end while;
24: else;24: else;
25: 合并待插入数据库和原始数据库,生成新数据库;25: Merge the database to be inserted and the original database , generate a new database ;
26: for each do;26: for each do;
27: 计算;27: Calculation ;
28: end for;28: end for;
29: 计算 ;29: Calculation ;
30: 设置 =1;30: Settings =1;
31: for each do;31: for each do;
32: if ;32: if ;
33: 将加入到集合 当中;33: Will Add to collection among;
34: else if ;34: else if ;
35: 将 加入到集合 当中;35: Will Add to collection among;
36: end if;36: end if;
37: end for;37: end for;
38: 如果 不在和当中,就将从新数据库中移除,当作新的数据库;38: If Not Available and Among them, From new database Remove it and treat it as a new database ;
39: Mining(,,,, , );39: Mining( , , , , , );
40: end if;40: end if;
41: for each do;41: for each do;
42: if ;42: if ;
43: 将序列放入 集合中;43: Sequence Put in In the collection;
44: end if;44: end if;
45: end for;45: end for;
46: if then;46: if then;
47: 设置 and = 0;47: Settings and = 0;
48: else;48: else;
49: 设置 ;49: Settings ;
50: end if;50: end if;
51: 设置 and ;51: Settings and ;
上述伪代码中用到的效用求和算法的伪代码如下:The pseudocode for the utility summation algorithm used in the above pseudocode is as follows:
1:;1: ;
2: ;2: ;
3: if then;3: if then;
4: 将序列放入集合中;4: Sequence Put in In the collection;
5: else if then;5: else if then;
6: 将序列放入集合中;6: Sequence Put in In the collection;
7: end if;7: end if;
为了证明本发明算法的优越性与可行性,进行了对比实验。将本发明提出的Pre-HUSPM算法与P-HUSPM算法和Pre-HUSPM-TSU算法进行了比较。实验采用6个不同规模且具有不同特征的真实数据集,数据集的名称分别为SIGN、LEVIATHAN、FIFA、BIBLE、Kosarak10k、BMS,该六个数据集均来自SPMF网站。其中,SIGN是包含许多非常长的序列的密集数据集;LEVIATHAN和FIFA均是包含许多长序列的中等密度数据集;BIBLE是一个中等密度的数据集,包含许多中等长度的序列;BMS和Kosarak10k均是稀疏数据集,只有一些长序列。对于所有数据集,都满足高斯分布。在实验中,将每个数据集分为一个原始数据集和100个新数据集。该数据集的特征属性具体为:SIGN数据集的序列数量为730个,不同项目的数量为267个,平均序列长度为52个,最大序列长度为94个,原始数据库的序列个数为230个,待插入数据库的序列个数为5个;LEVIATHAN数据集的序列数量为5834个,不同项目的数量为9025个,平均序列长度为33.8个,最大序列长度为100个,原始数据库的序列个数为2834个,待插入数据库的序列个数为30个;FIFA数据集的序列数量为20450个,不同项目的数量为2990个,平均序列长度为36.2个,最大序列长度为100个,原始数据库的序列个数为10450个,待插入数据库的序列个数为100个;BIBLE数据集的序列数量为36369个,不同项目的数量为13905个,平均序列长度为21.6个,最大序列长度为100个,原始数据库的序列个数为21369个,待插入数据库的序列个数为150个;Kosarak10k数据集的序列数量为10000个,不同项目的数量为10094个,平均序列长度为8.1个,最大序列长度为608个,原始数据库的序列个数为1000个,待插入数据库的序列个数为90个;BMS数据集的序列数量为59601个,不同项目的数量为497个,平均序列长度为2.5个,最大序列长度为267个,原始数据库的序列个数为39601个,待插入数据库的序列个数为200个。In order to prove the superiority and feasibility of the algorithm of the present invention, a comparative experiment was carried out. The Pre-HUSPM algorithm proposed in the present invention was compared with the P-HUSPM algorithm and the Pre-HUSPM-TSU algorithm. The experiment used 6 real data sets of different scales and with different characteristics. The names of the data sets are SIGN, LEVIATHAN, FIFA, BIBLE, Kosarak10k, and BMS. The six data sets are all from the SPMF website. Among them, SIGN is a dense data set containing many very long sequences; LEVIATHAN and FIFA are both medium-density data sets containing many long sequences; BIBLE is a medium-density data set containing many medium-length sequences; BMS and Kosarak10k are both sparse data sets with only some long sequences. For all data sets, Gaussian distribution is satisfied. In the experiment, each data set is divided into an original data set and 100 new data sets. The characteristic attributes of the dataset are as follows: the number of sequences in the SIGN dataset is 730, the number of different items is 267, the average sequence length is 52, the maximum sequence length is 94, the number of sequences in the original database is 230, and the number of sequences to be inserted into the database is 5; the number of sequences in the LEVIATHAN dataset is 5834, the number of different items is 9025, the average sequence length is 33.8, the maximum sequence length is 100, the number of sequences in the original database is 2834, and the number of sequences to be inserted into the database is 30; the number of sequences in the FIFA dataset is 20450, the number of different items is 2990, the average sequence length is 36.2, the maximum sequence length is 100, the number of sequences in the original database is 10450, and the number of sequences to be inserted into the database is 100 ; The number of sequences in the BIBLE dataset is 36,369, the number of different projects is 13,905, the average sequence length is 21.6, the maximum sequence length is 100, the number of sequences in the original database is 21,369, and the number of sequences to be inserted into the database is 150; the number of sequences in the Kosarak10k dataset is 10,000, the number of different projects is 10,094, the average sequence length is 8.1, the maximum sequence length is 608, the number of sequences in the original database is 1,000, and the number of sequences to be inserted into the database is 90; the number of sequences in the BMS dataset is 59,601, the number of different projects is 497, the average sequence length is 2.5, the maximum sequence length is 267, the number of sequences in the original database is 39,601, and the number of sequences to be inserted into the database is 200.
本发明实验在六个不同的数据集上将效用阈值上限控制为相同的变量,选取不同的效用阈值下限进行实验对比,实验结果如图2-图7所示。通过实验发现,Pre-HUSPM-TSU算法在运行时间上比HUSPM算法时间短,这样缩短了运行时间。而本发明提出的优化算法采用了,替代了Pre-HUSPM-TSU算法中的,实质形成的是Pre-HUSPM-算法(Pre-HUSPM-即为本发明所提到的Pre-HUSPM算法),Pre-HUSPM-算法在运行时间上会比HUSPM和Pre-HUSPM-TSU好很多。因此,Pre-HUSPM-在较大的非密集数据集中具有更快的运行时间,在运行时间方面具有较好的性能。The experiment of this invention sets the upper limit of the utility threshold on six different data sets. Controlling the same variables, selecting different utility threshold lower limits Experimental comparison is carried out, and the experimental results are shown in Figures 2 to 7. It is found through experiments that the Pre-HUSPM-TSU algorithm has a shorter running time than the HUSPM algorithm, thus shortening the running time. The optimization algorithm proposed in this invention adopts , replacing the Pre-HUSPM-TSU algorithm , the actual formation is Pre-HUSPM- Algorithm (Pre-HUSPM- That is the Pre-HUSPM algorithm mentioned in the present invention), Pre-HUSPM- The running time of the algorithm is much better than that of HUSPM and Pre-HUSPM-TSU. It has faster runtimes on larger, non-dense datasets and better performance in terms of runtime.
通过选取不同的,发现若设置得太小,在重新扫描数据库时,运行速度会变得更慢,因为会生成太多的预大序列集。如果设置得太接近的值,那么安全值将变得太小,因此每当添加新数据时,可能必须重新扫描数据库,这也将导致较慢的操作。对于实际应用,需要合理设置和。和的范围均是0至1,设置时确保,具体数值根据用户的需要自行设置。By choosing different , it is found that if If set too small, the database will run more slowly when it is rescanned because too many pre-large sequence sets will be generated. Set too close If the value of , then the safety value will become too small, so every time new data is added, the database may have to be rescanned, which will also result in slower operation. For practical applications, it is necessary to set a reasonable and . and The range is 0 to 1. Make sure The specific value can be set according to user needs.
当然,上述说明并非是对本发明的限制,本发明也并不仅限于上述举例,本技术领域的技术人员在本发明的实质范围内所做出的变化、改型、添加或替换,也应属于本发明的保护范围。Of course, the above description is not a limitation of the present invention, and the present invention is not limited to the above examples. Changes, modifications, additions or substitutions made by technicians in this technical field within the essential scope of the present invention should also fall within the protection scope of the present invention.
Claims (9)
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310250759.4A CN115964415B (en) | 2023-03-16 | 2023-03-16 | Pre-HUSPM-based database sequence insertion processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310250759.4A CN115964415B (en) | 2023-03-16 | 2023-03-16 | Pre-HUSPM-based database sequence insertion processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115964415A true CN115964415A (en) | 2023-04-14 |
CN115964415B CN115964415B (en) | 2023-05-26 |
Family
ID=85894768
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310250759.4A Expired - Fee Related CN115964415B (en) | 2023-03-16 | 2023-03-16 | Pre-HUSPM-based database sequence insertion processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115964415B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030217055A1 (en) * | 2002-05-20 | 2003-11-20 | Chang-Huang Lee | Efficient incremental method for data mining of a database |
CN105590237A (en) * | 2015-12-18 | 2016-05-18 | 齐鲁工业大学 | Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making |
CN106777182A (en) * | 2016-12-23 | 2017-05-31 | 陕西理工学院 | A kind of data flow effective item set mining algorithm for reducing candidate |
CN108733705A (en) * | 2017-04-20 | 2018-11-02 | 哈尔滨工业大学深圳研究生院 | A kind of effective sequential mode mining method and device |
CN109101530A (en) * | 2018-06-22 | 2018-12-28 | 哈尔滨工业大学(深圳) | Effective sequence of events pattern mining algorithm |
CN109408563A (en) * | 2018-11-07 | 2019-03-01 | 哈尔滨工业大学(深圳) | High average utility item set mining method, apparatus and computer equipment |
CN111475551A (en) * | 2020-06-15 | 2020-07-31 | 河北工业大学 | A high-average utility sequential pattern mining method under non-overlapping conditions |
CN111930803A (en) * | 2020-08-07 | 2020-11-13 | 河北工业大学 | Non-overlapping self-adaptive frequent sequence pattern mining method |
CN112434031A (en) * | 2020-11-16 | 2021-03-02 | 宁波财经学院 | Uncertain high-utility mode mining method based on information entropy |
US20220058716A1 (en) * | 2020-08-18 | 2022-02-24 | Qilu University Of Technology | Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method |
CN114971794A (en) * | 2022-05-27 | 2022-08-30 | 齐鲁工业大学 | Time period-based high-utility sequence mode analysis method and system in group purchase |
-
2023
- 2023-03-16 CN CN202310250759.4A patent/CN115964415B/en not_active Expired - Fee Related
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030217055A1 (en) * | 2002-05-20 | 2003-11-20 | Chang-Huang Lee | Efficient incremental method for data mining of a database |
CN105590237A (en) * | 2015-12-18 | 2016-05-18 | 齐鲁工业大学 | Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making |
CN106777182A (en) * | 2016-12-23 | 2017-05-31 | 陕西理工学院 | A kind of data flow effective item set mining algorithm for reducing candidate |
CN108733705A (en) * | 2017-04-20 | 2018-11-02 | 哈尔滨工业大学深圳研究生院 | A kind of effective sequential mode mining method and device |
CN109101530A (en) * | 2018-06-22 | 2018-12-28 | 哈尔滨工业大学(深圳) | Effective sequence of events pattern mining algorithm |
CN109408563A (en) * | 2018-11-07 | 2019-03-01 | 哈尔滨工业大学(深圳) | High average utility item set mining method, apparatus and computer equipment |
CN111475551A (en) * | 2020-06-15 | 2020-07-31 | 河北工业大学 | A high-average utility sequential pattern mining method under non-overlapping conditions |
CN111930803A (en) * | 2020-08-07 | 2020-11-13 | 河北工业大学 | Non-overlapping self-adaptive frequent sequence pattern mining method |
US20220058716A1 (en) * | 2020-08-18 | 2022-02-24 | Qilu University Of Technology | Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method |
CN112434031A (en) * | 2020-11-16 | 2021-03-02 | 宁波财经学院 | Uncertain high-utility mode mining method based on information entropy |
CN114971794A (en) * | 2022-05-27 | 2022-08-30 | 齐鲁工业大学 | Time period-based high-utility sequence mode analysis method and system in group purchase |
Non-Patent Citations (1)
Title |
---|
慕欢欢;柴玉梅;王黎明;: "面向数据流的一个高效用项集挖掘算法", 计算机应用与软件 * |
Also Published As
Publication number | Publication date |
---|---|
CN115964415B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Lin et al. | Efficient updating of discovered high-utility itemsets for transaction deletion in dynamic databases | |
Plantevit et al. | Mining multidimensional and multilevel sequential patterns | |
US20030217055A1 (en) | Efficient incremental method for data mining of a database | |
Wang et al. | On incremental high utility sequential pattern mining | |
Liu et al. | Effective sanitization approaches to protect sensitive knowledge in high-utility itemset mining | |
CN111930797A (en) | Uncertain periodic frequent item set mining method and device | |
Wu et al. | Incrementally updating the discovered high average-utility patterns with the pre-large concept | |
CN107038026A (en) | The automatic machine update method and system of a kind of increment type | |
Tatti et al. | Finding robust itemsets under subsampling | |
Gan et al. | ProUM: High utility sequential pattern mining | |
CN111309786B (en) | Parallel frequent item set mining method based on MapReduce | |
Lin et al. | A fast maintenance algorithm of the discovered high-utility itemsets with transaction deletion | |
CN111026862A (en) | An Incremental Entity Summarization Method Based on Formal Concept Analysis Technology | |
Kiran et al. | Efficient discovery of weighted frequent itemsets in very large transactional databases: A re-visit | |
Truong et al. | EHUSM: mining high utility sequences with a pessimistic utility model | |
CN115964415B (en) | Pre-HUSPM-based database sequence insertion processing method | |
CN108319728A (en) | A kind of frequent community search method and system based on k-star | |
CN110309179B (en) | Maximum fault-tolerant frequent item set mining method based on parallel PSO | |
Sun et al. | Applying prefixed-itemset and compression matrix to optimize the MapReduce-based Apriori algorithm on Hadoop | |
Hong et al. | Hiding sensitive itemsets by inserting dummy transactions | |
Tin et al. | Hupsmt: An efficient algorithm for mining high utility-probability sequences in uncertain databases with multiple minimum utility thresholds | |
Ou et al. | Efficient algorithms for incremental Web log mining with dynamic thresholds | |
Zhou et al. | Incremental association rule mining based on matrix compression for edge computing | |
CN112231438B (en) | Method and device for mining closed term set and generation sub | |
CN108197272A (en) | A kind of update method and device of distributed association rules increment |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant | ||
CF01 | Termination of patent right due to non-payment of annual fee | ||
CF01 | Termination of patent right due to non-payment of annual fee |
Granted publication date: 20230526 |