CN115964415A

CN115964415A - Pre-HUSPM-based database sequence insertion processing method

Info

Publication number: CN115964415A
Application number: CN202310250759.4A
Authority: CN
Inventors: 吴明泰; 李凤洋; 潘正祥; 陈建铭; 吴祖扬
Original assignee: Shandong University of Science and Technology
Current assignee: Shandong University of Science and Technology
Priority date: 2023-03-16
Filing date: 2023-03-16
Publication date: 2023-04-14
Anticipated expiration: 2043-03-16
Also published as: CN115964415B

Abstract

The invention discloses a database sequence insertion processing method based on Pre-HUSPM, belonging to the field of data mining and comprising the following steps: inserting a database to be inserted into an original database; calculating a security value according to the information of the original database; scanning a database to be inserted, and calculating the total utility of each sequence in the database to be inserted and the total utility of the database to be inserted; comparing the total utility value of a new transaction since the original database is rescanned last time with the sum of the maximum value of the sequence weighted utility of the single item in the database to be inserted with a safety value, and performing corresponding operation according to the comparison result; comparing and judging the utility ratio of each sequence in the large sequence weighted utility sequence set in the new database with the utility threshold upper limit; and finally, outputting the new database after the database is updated and the high-utility sequence pattern set thereof. The invention reduces the times of database rescanning and lowers the maintenance cost.

Description

A database sequence insertion processing method based on Pre-HUSPM

技术领域Technical Field

本发明属于数据挖掘领域，具体涉及一种基于Pre-HUSPM的数据库序列插入处理方法。The invention belongs to the field of data mining, and in particular relates to a database sequence insertion processing method based on Pre-HUSPM.

背景技术Background Art

高效用序列模式挖掘(HUSPM)算法可以用于分析用户的购物习惯，HUSPM会考虑每个项目的权重、单位利润等。当序列集的效用大于用户设置的最小效用阈值时，则序列集为高效用序列模式。通常，HUSPM算法在静态数据库下运行，但在实际应用中，几乎每天都有新的数据添加，这可能导致原来发现的高效利用序列模式会失败，或者更新数据库后出现新的一些新信息。因此，在传统的动态数据挖掘中，每次有少量数据进入时，都需要重新扫描原始数据库，重新扫描原始数据库会消耗大量的资源和时间。尤其当插入少量数据时，实质对整个数据库没有影响，此时更新数据库会造成资源浪费，维护成本增加，因此高效地维护和更新挖掘的高效用序列模式变得尤为重要。The High Utility Sequential Pattern Mining (HUSPM) algorithm can be used to analyze the shopping habits of users. HUSPM considers the weight of each item, unit profit, etc. When the utility of a sequence set is greater than the minimum utility threshold set by the user, the sequence set is a high-utility sequence pattern. Usually, the HUSPM algorithm runs under a static database, but in actual applications, new data is added almost every day, which may cause the original discovered high-utility sequence pattern to fail, or some new information to appear after the database is updated. Therefore, in traditional dynamic data mining, every time a small amount of data enters, the original database needs to be rescanned, which consumes a lot of resources and time. Especially when a small amount of data is inserted, it has no effect on the entire database. At this time, updating the database will cause a waste of resources and increase maintenance costs. Therefore, it is particularly important to efficiently maintain and update the mined high-utility sequence patterns.

发明内容Summary of the invention

为了解决上述问题，本发明提出了一种基于Pre-HUSPM的数据库序列插入处理方法，将pre-large概念和基于投影的挖掘算法P-HUSPM进行融合构建了增量算法Pre-HUSPM，用于高效挖掘高效用序列模式，减少原始数据库的重新扫描次数。In order to solve the above problems, the present invention proposes a database sequence insertion processing method based on Pre-HUSPM, which integrates the pre-large concept and the projection-based mining algorithm P-HUSPM to construct an incremental algorithm Pre-HUSPM for efficiently mining high-utility sequence patterns and reducing the number of rescanning of the original database.

本发明的技术方案如下：The technical solution of the present invention is as follows:

一种基于Pre-HUSPM的数据库序列插入处理方法，构建增量算法Pre-HUSPM来高效挖掘高效用序列模式，具体包括如下步骤：A database sequence insertion processing method based on Pre-HUSPM, constructing an incremental algorithm Pre-HUSPM to efficiently mine high-utility sequence patterns, specifically including the following steps:

步骤1、向原始数据库

中插入待插入数据库

；Step 1: Add the original database

Insert to be inserted into the database

;

步骤2、根据原始数据库

的信息计算安全值

；Step 2: Based on the original database

Information calculation security value

;

步骤3、扫描待插入数据库

，计算待插入数据库

中每一个序列的总效用

和

的总效用

；Step 3: Scan the database to be inserted

, calculate the number of

The total utility of each sequence in

and

Total utility

;

步骤4、将自上次重新扫描原始数据库以来新事务的总效用值与待插入数据库

中单个项目的序列加权效用最大值

的总和与安全值

进行比较，根据比较结果进行相应操作；Step 4: Compare the total utility value of new transactions since the last rescan of the original database with the total utility value of the new transactions to be inserted into the database.

The maximum value of the sequence weighted utility of a single item in

The sum and safety value

Make a comparison and perform corresponding operations according to the comparison results;

步骤5、判断新数据库

中的大序列加权效用序列集

集合中的每个序列

的效用比是否大于等于效用阈值上限

，若是，则序列

是高效用序列模式，将序列

加入到高效用序列模式集合

中并输出，否则，不需要进行任何操作；最终输出数据库更新后的新数据库

及其高效用序列模式集

。Step 5: Determine the new database

Large sequence weighted utility sequence set in

Each sequence in the collection

Is the utility ratio greater than or equal to the upper utility threshold?

, if so, then the sequence

is a high-utility sequence pattern,

Added to the collection of high-performance sequence patterns

Otherwise, no operation is required; finally, the new database after the database update is output

Its high-utility sequence pattern set

.

进一步地，步骤1中，设原始数据库

，

为序列总个数，

为序列的序号，

表示第

个序列，

为项目集合

，

为项目总个数，项目

是

个不同项的集合，表示为

，

表示项目

中的第

个项。Furthermore, in step 1, the original database is

,

is the total number of sequences,

is the sequence number,

Indicates

A sequence,

For project collection

,

is the total number of projects,

yes

A set of different items, represented by

,

Display items

The

Item.

进一步地，步骤2中，安全值

的计算公式如下：Furthermore, in step 2, the safety value

The calculation formula is as follows:

(1)；

(1);

其中，

表示效用阈值上限，

表示效用阈值下限，

表示原始数据库

的总效用，

和

的值预先设定；in,

represents the upper limit of the utility threshold,

represents the lower limit of the utility threshold,

Represents the original database

The total utility of

and

The value of is preset;

的计算公式如下：

The calculation formula is as follows:

(2)；

(2);

其中，

表示原始数据库

中序列

的总效用，计算公式如下：in,

Represents the original database

Middle sequence

The total utility is calculated as follows:

(3)；

(3);

其中，

表示序列

中项目

中

项的效用。in,

Representation sequence

Medium Project

middle

The utility of the item.

进一步地，步骤3中，按照与公式（2）和（3）相同的方式计算得到待插入数据库

总效用

，与此同时计算

，计算时代入待插入数据库

的相关数据。Furthermore, in step 3, the database to be inserted is calculated in the same way as formulas (2) and (3).

Total Utility

, while calculating

, enter the database to be inserted during calculation

relevant data.

进一步地，步骤4中的具体判断准则为：设

为自上次重新扫描原始数据库以来新事务的总效用值，当

时，进行步骤4.1和步骤4.2，当

时，进行步骤4.3；Furthermore, the specific judgment criteria in step 4 are:

is the total utility value of new transactions since the last rescan of the original database, when

When , proceed to step 4.1 and step 4.2.

When , proceed to step 4.3;

步骤4.1、从待插入数据库

扫描生成1-候选集，并设置

=1，

表示序列集中正在处理的项数；Step 4.1: From the database to be inserted

Scan to generate 1-candidate set and set

=1,

Indicates the number of items being processed in the sequence set;

步骤4.2、扫描1-候选集，更新原有信息的序列效用和序列加权效用，依次产生2-候选集，继续更新原有信息的序列效用和序列加权效用，直到没有候选集的生成；同时，设置

；Step 4.2, scan the 1-candidate set, update the sequence utility and sequence weighted utility of the original information, generate 2-candidate sets in sequence, and continue to update the sequence utility and sequence weighted utility of the original information until no candidate sets are generated; at the same time, set

;

步骤4.3、当

时，生成新数据库，此时需要重新扫描原始数据库；将

设置为0，并将

赋值给

。Step 4.3:

When a new database is generated, the original database needs to be rescanned.

Set to 0 and

Assign to

.

进一步地，步骤4.2的具体过程如下：Furthermore, the specific process of step 4.2 is as follows:

步骤4.2.1、计算新数据库

的总效用

，计算公式如下：Step 4.2.1. Calculate the new database

Total utility

, the calculation formula is as follows:

(4)；

(4);

对于候选集

中的每个候选，计算待插入数据库

中序列

的序列加权效用

和序列效用

，计算公式如下：For the candidate set

For each candidate in, calculate the number of candidates to be inserted into the database

Middle sequence

The sequence weighted utility

and sequence utility

, the calculation formula is as follows:

(5)；

(5);

(6)；

(6);

其中，

表示序列

这一行总的效用值；

表示序列

中的子序列

的效用是序列中所有出现的

的效用中的最大效用，定义如下：in,

Representation sequence

The total utility value of this row;

Representation sequence

Subsequence in

The utility of is all occurrences of

The maximum utility among the utilities of is defined as follows:

(7)；

(7);

其中，

表示序列中某项的最大内部效用是该序列中该项的最大效用值，定义如下：in,

The maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, which is defined as follows:

(8)；

(8);

其中，

表示序列

的项目

中

项的内部效用，定义如下：in,

Representation sequence

Project

middle

The internal utility of an item is defined as follows:

(9)；

(9);

其中，

表示序列

中项目

中

项的数量，

表示

项的单位利润；in,

Representation sequence

Medium Project

middle

The number of items,

express

Unit profit of the item;

步骤4.2.2、对于在大序列加权效用序列

的原始数据库中设置的每个大序列加权效用序列，执行子步骤4.2.2.1-子步骤4.2.2.3；Step 4.2.2: Weighted utility sequence in large sequence

For each large sequence weighted utility sequence set in the original database, execute sub-steps 4.2.2.1 to 4.2.2.3;

步骤4.2.3、对于

原始数据库中的每个预大序列加权利用序列集，同样执行步骤4.2.2的子步骤4.2.2.1-子步骤4.2.2.3；Step 4.2.3:

For each pre-large sequence weighted application sequence set in the original database, sub-steps 4.2.2.1 to 4.2.2.3 of step 4.2.2 are also performed;

如果原始数据库

中的大序列加权序列集

和原始数据库

中的预大序列加权序列集

包含待插入数据库

中的序列

，就将

和

中的项集的序列效用

和序列加权效用

的值进行更新，并将序列

放入到1-候选集，用来生成2-候选集；如果

和

中不包含新数据库

中的序列

，就不需要更新，将

从1-候选集中移除；If the original database

Large sequence weighted sequence set in

and the original database

Pre-large sequence weighted sequence set in

Contains the database to be inserted

Sequence in

, then

and

The sequential utility of the itemsets in

and sequence weighted utility

The value of

Put it into the 1-candidate set to generate the 2-candidate set; if

and

New databases are not included

Sequence in

, there is no need to update,

Remove from 1-candidate set;

步骤4.2.4、从

-候选集生成候选(

+1)-候选集

；设

=

+1，重复步骤4.2.1到步骤4.2.4，直到没有发现更新的大或前大序列加权效用序列集。Step 4.2.4, from

-Candidate set generation candidate (

+1)-Candidate set

;set up

=

+1, repeat steps 4.2.1 to 4.2.4 until no updated large or former large sequence weighted utility sequence set is found.

进一步地，步骤4.2.2的子步骤如下：Furthermore, the sub-steps of step 4.2.2 are as follows:

子步骤4.2.2.1、更新在新数据库

中序列

的序列加权效用

，计算公式如下：Sub-step 4.2.2.1: Update in the new database

Middle sequence

The sequence weighted utility

, the calculation formula is as follows:

(10)；

(10);

其中，

为原始数据库

中序列

的序列加权效用，

存储着序列

的

，

为待插入数据库

中序列

的序列加权效用；in,

For the original database

Middle sequence

The sequence weighted utility of

Stores the sequence

of

,

To be inserted into the database

Middle sequence

The sequence weighted utility of

子步骤4.2.2.2、更新新数据库

中整个序列集

的序列效用

：Sub-step 4.2.2.2: Update the new database

The entire sequence set

The sequence utility

:

(11)；

(11);

其中，

表示序列

在原始数据库

中的序列效用，

存储着序列

的

，

为待插入数据库

中序列

的序列效用；in,

Representation sequence

In the original database

The sequence utility in

Stores the sequence

of

,

To be inserted into the database

Middle sequence

The sequence utility of

子步骤4.2.2.3、如果

，则将序列

放入

，

是新数据库

中的大序列加权效用的

序列集；如果

，则将序列

放入

，

是新数据库

中的预大序列加权效用

序列集；否则，丢弃序列

。Sub-step 4.2.2.3, if

, then the sequence

Put in

,

It is a new database

The weighted utility of a large sequence in

sequence set; if

, then the sequence

Put in

,

It is a new database

Pre-large sequence weighted utility in

sequence set; otherwise, discard the sequence

.

进一步地，步骤4.3的具体过程如下：Furthermore, the specific process of step 4.3 is as follows:

步骤4.3.1、合并待插入数据库

和原始数据库

，生成新数据库

；Step 4.3.1: Merge the database to be inserted

and the original database

, generate a new database

;

步骤4.3.2、对于每个

，采用与公式（5）相同的计算方式计算新数据库

的序列加权效用

，然后采用与公式（2）相同的计算方式计算新数据库

的总效用

；Step 4.3.2: For each

, the new database is calculated using the same calculation method as formula (5)

The sequence weighted utility

, and then use the same calculation method as formula (2) to calculate the new database

Total utility

;

步骤4.3.3、设序列的加权效用比为

，如果

，则将序列

放入

；如果

，则将序列

放入

；否则，丢弃序列

；

是新数据库

中的大序列加权效用的

序列集；

是新数据库

中的预大序列加权效用

序列集；Step 4.3.3: Let the weighted utility ratio of the sequence be

,if

, then the sequence

Put in

;if

, then the sequence

Put in

; Otherwise, discard the sequence

;

It is a new database

The weighted utility of a large sequence in

Sequence Set;

It is a new database

Pre-large sequence weighted utility in

Sequence Set;

步骤4.3.4、执行递归挖掘算法，运用递归挖掘算法，生成多项集的投影数据库，并生成多项集的

和

序列集，直到没有找到

和

序列集；执行挖掘过程时，从1序列集开始挖掘，再接着2序列集，直到最后一个序列集为空，此时停止挖掘过程，输出新数据库

中的大序列加权效用序列集

和预大序列加权效用序列集

，

和

用于下次数据插入时使用。Step 4.3.4: Execute the recursive mining algorithm to generate a projection database of multiple sets and generate a

and

Sequence set until none is found

and

Sequence set; when executing the mining process, start mining from sequence set 1, then sequence set 2, until the last sequence set is empty, then stop the mining process and output a new database

Large sequence weighted utility sequence set in

and the pre-large sequence weighted utility sequence set

,

and

It will be used for next data insertion.

进一步地，步骤4.3.4中，递归挖掘算法的具体过程如下：Furthermore, in step 4.3.4, the specific process of the recursive mining algorithm is as follows:

步骤4.3.4.1、遍历

和

，对属于

和

的每个序列

构建它的投影数据库

；Step 4.3.4.1, traversal

and

, for

and

Each sequence of

Build its projection database

;

步骤4.3.4.2、计算

的序列加权效用

值，其中

是

的拓展项集；如果

，计算序列效用

，并将

放到

集合中；如果

，计算

，并将

放到

集合中，否则，如果都不满足，将不做任何处理；Step 4.3.4.2. Calculation

The sequence weighted utility

Value, where

yes

The expanded itemset of

, calculate the sequence utility

, and

Put

In the collection; if

,calculate

, and

Put

In the set, otherwise, if none of them are satisfied, no processing will be done;

步骤4.3.4.3、将当前参数传入进去，递归调用挖掘算法过程，直到

和

集合都为空，停止运行；

是新数据库

中的大序列加权效用的

+1序列集；

是新数据库

中的预大序列加权效用

+1序列集。Step 4.3.4.3: Pass the current parameters in and recursively call the mining algorithm process until

and

The collections are all empty, so stop running;

It is a new database

The weighted utility of a large sequence in

+1 sequence set;

It is a new database

Pre-large sequence weighted utility in

+1 for the sequence set.

本发明所带来的有益技术效果。The beneficial technical effects brought about by the present invention.

提出了一种新的序列模式挖掘算法Pre-HUSPM，用于处理序列插入问题，当插入少量数据时，不需要更新整个数据库，避免造成资源浪费。A new sequential pattern mining algorithm Pre-HUSPM is proposed to deal with the sequence insertion problem. When inserting a small amount of data, there is no need to update the entire database to avoid wasting resources.

基于矩阵投影的高效用序列模式挖掘算法(P-HUSPM)，可以减少序列挖掘中候选集的数量，从而加快挖掘高效用序列集的处理时间；因此由于不需要频繁地重新扫描数据库的次数，因此可以在很大程度上减少运行时间。The high-utility sequence pattern mining algorithm (P-HUSPM) based on matrix projection can reduce the number of candidate sets in sequence mining, thereby speeding up the processing time of mining high-utility sequence sets; therefore, since there is no need to frequently rescan the database, the running time can be greatly reduced.

提出了一个新的概念

，用它作为安全阈值来判断数据库是否需要重新扫描，减少了数据库重新扫描的次数，降低了维护成本。Proposed a new concept

, using it as a safety threshold to determine whether the database needs to be rescanned, reducing the number of database rescans and reducing maintenance costs.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明基于Pre-HUSPM的数据库序列插入处理方法的流程图。FIG1 is a flow chart of the database sequence insertion processing method based on Pre-HUSPM of the present invention.

图2为本发明实验中SIGN数据集在效用阈值上限

为15%时三个算法在不同效用阈值下限

下的运行时间对比图。FIG2 shows the upper limit of the utility threshold of the SIGN dataset in the experiment of the present invention.

When the utility threshold is 15%, the three algorithms have different

The following is a comparison of the running times.

图3为本发明实验中LEVIATHAN数据集在效用阈值上限

为18%时三个算法在不同效用阈值下限

下的运行时间对比图。FIG3 shows the upper limit of the utility threshold of the LEVIATHAN dataset in the experiment of the present invention.

When the utility threshold is 18%, the three algorithms have different

The following is a comparison of the running times.

图4为本发明实验中FIFA数据集在效用阈值上限

为21%时三个算法在不同效用阈值下限

下的运行时间对比图。FIG4 shows the FIFA dataset in the experiment of the present invention at the upper limit of the utility threshold

When the utility threshold is 21%, the three algorithms have different

The following is a comparison of the running times.

图5为本发明实验中BIBLE数据集在效用阈值上限

为16%时三个算法在不同效用阈值下限

下的运行时间对比图。Figure 5 shows the upper limit of the utility threshold of the BIBLE dataset in the experiment of the present invention.

When the utility threshold is 16%, the three algorithms have different

The following is a comparison of the running times.

图6为本发明实验中Kosarak10k数据集在效用阈值上限

为14%时三个算法在不同效用阈值下限

下的运行时间对比图。FIG6 shows the Kosarak10k dataset in the experiment of the present invention at the upper limit of the utility threshold

When the utility threshold is 14%, the three algorithms have different

The following is a comparison of the running times.

图7为本发明实验中BMS数据集在效用阈值上限

为4.5%时三个算法在不同效用阈值下限

下的运行时间对比图。FIG. 7 shows the BMS dataset in the experiment of the present invention at the upper limit of the utility threshold

When the utility threshold is 4.5%, the three algorithms have different

The following is a comparison of the running times.

具体实施方式DETAILED DESCRIPTION

下面结合附图以及具体实施方式对本发明作进一步详细说明：The present invention is further described in detail below with reference to the accompanying drawings and specific embodiments:

本发明所提及的数据库为序列数据库，序列数据库中包括大序列、预大序列、小序列。当序列的支持度大于支持度上限阈值时，则该序列为大序列；当序列的支持度小于支持度上限阈值且大于支持度下限阈值时，则该序列为预大序列；当序列的支持度小于支持度下限阈值时，则该序列为小序列。其中，预大序列在未来很可能成为大序列。The database mentioned in the present invention is a sequence database, which includes large sequences, pre-large sequences, and small sequences. When the support of a sequence is greater than the upper support threshold, the sequence is a large sequence; when the support of a sequence is less than the upper support threshold and greater than the lower support threshold, the sequence is a pre-large sequence; when the support of a sequence is less than the lower support threshold, the sequence is a small sequence. Among them, the pre-large sequence is likely to become a large sequence in the future.

本发明融合了pre-large概念和基于投影的挖掘算法P-HUSPM，提出了Pre-HUSPM算法，主要通过设置阈值

作为是否需要重新扫描数据库的条件，进而对数据库序列进行有效维护和更新，减少数据库重新扫描次数。

表示待插入数据库中单个项目的序列加权效用最大值。This paper combines the pre-large concept and the projection-based mining algorithm P-HUSPM, and proposes the Pre-HUSPM algorithm, which mainly sets the threshold

As a condition for whether the database needs to be rescanned, the database sequence is effectively maintained and updated to reduce the number of database rescanning times.

Represents the maximum value of the sequence-weighted utility of a single item to be inserted into the database.

将新序列数据库添加到原始序列数据库时会出现九种情况：情况1为将新序列数据库的大序列插入到原始序列数据库的大序列中；情况2为将新序列数据库的预大序列插入到原始序列数据库的大序列中；情况3为将新序列数据库的小序列插入到原始序列数据库的大序列中；情况4为将新序列数据库的大序列插入到原始序列数据库的预大序列中；情况5为将新序列数据库的预大序列插入到原始序列数据库的预大序列中；情况6为将新序列数据库的小序列插入到原始序列数据库的预大序列中；情况7为将新序列数据库的大序列插入到原始序列数据库的小序列中；情况8为将新序列数据库的预大序列插入到原始序列数据库的小序列中；情况9为将新序列数据库的小序列插入到原始序列数据库的小序列中。Nine situations will occur when adding a new sequence database to the original sequence database: situation 1 is inserting the large sequence of the new sequence database into the large sequence of the original sequence database; situation 2 is inserting the pre-large sequence of the new sequence database into the large sequence of the original sequence database; situation 3 is inserting the small sequence of the new sequence database into the large sequence of the original sequence database; situation 4 is inserting the large sequence of the new sequence database into the pre-large sequence of the original sequence database; situation 5 is inserting the pre-large sequence of the new sequence database into the pre-large sequence of the original sequence database; situation 6 is inserting the small sequence of the new sequence database into the pre-large sequence of the original sequence database; situation 7 is inserting the large sequence of the new sequence database into the small sequence of the original sequence database; situation 8 is inserting the pre-large sequence of the new sequence database into the small sequence of the original sequence database; situation 9 is inserting the small sequence of the new sequence database into the small sequence of the original sequence database.

情况1、情况5、情况6、情况8和情况9是基于计数的加权平均，不会影响最终的大序列集。情况2和情况3可能会删除一些现有的大序列集，而情况4和7可能会添加新的大序列集合。当同时保留大序列集和预大序列集时，可以很好地处理情况2、情况3和情况4的这些情况。Cases 1, 5, 6, 8, and 9 are weighted averages based on counts and will not affect the final large sequence set. Cases 2 and 3 may delete some existing large sequence sets, while cases 4 and 7 may add new large sequence sets. These cases of cases 2, 3, and 4 can be handled well when both large sequence sets and pre-large sequence sets are kept.

而上述情况7是本发明的主要研究重点，当出现情况7，即插入的数据库资料不是很大的时候，实质是不需要更新数据库的，此时现有技术会去更新数据库，造成了资源浪费。The above situation 7 is the main research focus of the present invention. When situation 7 occurs, that is, when the inserted database data is not very large, it is actually unnecessary to update the database. At this time, the prior art will update the database, resulting in a waste of resources.

针对该问题，本发明提出了一种基于Pre-HUSPM的数据库序列插入处理方法，采用了如下定理，并对定理进行了证明。To solve this problem, the present invention proposes a database sequence insertion processing method based on Pre-HUSPM, adopts the following theorem, and proves the theorem.

定理.设

和

分别为效用阈值下限和效用阈值上限，

为原始数据库

的总效用。

是待插入数据库

中单个项目的序列加权效用最大值。如果

，则情况7中序列集的序列加权效用在整个更新数据库中没有希望成为高效用加权序列项集。Theorem. Assume

and

are the lower and upper utility thresholds, respectively.

For the original database

total utility.

To be inserted into the database

The maximum value of the sequence weighted utility of a single item in . If

, then the sequence weighted utility of the sequence set in case 7 has no hope of becoming a high-utility weighted sequence item set in the entire updated database.

证明：从

，可获得以下推导式：Proof: From

, we can get the following derivation:

；

;

；

;

；

;

；

;

；

;

对于情况7中的序列，如果序列

的序列加权效用在原始数据库

中很小，则

。For the sequence in case 7, if the sequence

The sequence weighted utility of

If the

.

如果序列

在待插入数据库

中具有较大的序列加权效用，则其在待插入数据库

中的序列加权效用

必须大于或等于

，但小于或等于待插入数据库

的总效用

。因此，

。If the sequence

To be inserted into the database

has a larger sequence weighted utility in the database to be inserted

Sequence weighted utility in

Must be greater than or equal to

, but less than or equal to the one to be inserted into the database

Total utility

.therefore,

.

在序列挖掘中，插入数据库

后形成的新数据库

中更新的序列

的比率被计算为：In sequence mining, inserting into the database

The new database formed

The updated sequence in

The ratio is calculated as:

；

;

其中，

为新数据库

中序列

的序列加权效用，

为原始数据库

中序列

的序列加权效用。因此，当

小于安全值

（

）时，不需要重新扫描原始数据库。in,

For new database

Middle sequence

The sequence weighted utility of

For the original database

Middle sequence

Therefore, when

Less than the safety value

（

), there is no need to rescan the original database.

根据该定理，可以有效地处理情况7中的序列。According to this theorem, the sequence in case 7 can be processed efficiently.

一种基于Pre-HUSPM的数据库序列插入处理方法，具体包括如下步骤：A database sequence insertion processing method based on Pre-HUSPM specifically comprises the following steps:

步骤1、向原始数据库

中插入待插入数据库

。Step 1: Add the original database

Insert to be inserted into the database

.

本发明实施例中，原始数据库

为一个交易资料数据库，插入的待插入数据库

是一个新的交易资料数据库。In the embodiment of the present invention, the original database

For a transaction data database, the database to be inserted

It is a new transaction information database.

原始交易资料数据库和新的交易资料数据库均是包含一组序列的数据库，设原始数据库

，

为序列总个数，

为序列的序号，

表示第

个序列，

具有唯一标识符，

为项目集合

，

为项目总个数，项目

是

个不同项的集合，表示为

，

表示项目

中的第

个项。The original transaction data database and the new transaction data database are both databases containing a set of sequences.

,

is the total number of sequences,

is the sequence number,

Indicates

A sequence,

Has a unique identifier,

For project collection

,

is the total number of projects,

yes

A set of different items, represented by

,

Display items

The

Item.

原始交易资料数据库包括

、

、

、

、

五个序列和

、

、

、

、

五个项目。其中，

序列的项目集合为

，

表示一项；

序列的项目集合为

；

序列的项目集合为

；

序列的项目集合为

；

序列的项目集合为

。此

、

、

、

、

五个项目的利润分别为3、2、4、2、1，在数据库中以表格的形式保存，保存为一个项目利润表

。The original transaction data database includes

,

Five sequences and

,

Five projects. Among them,

The set of items in the sequence is

,

Indicates an item;

The set of items in the sequence is

;

The set of items in the sequence is

;

The set of items in the sequence is

;

The set of items in the sequence is

.this

,

The profits of the five projects are 3, 2, 4, 2, and 1 respectively. They are saved in the database in the form of a table and saved as a project profit table.

.

待插入数据库

包括

、

两个序列，

序列的项目集合为

，

序列的项目集合为

。To be inserted into the database

include

,

Two sequences,

The set of items in the sequence is

,

The set of items in the sequence is

.

步骤2、根据原始数据库

的信息计算安全值

。Step 2: Based on the original database

Information calculation security value

.

安全值

的计算公式如下：Safety value

The calculation formula is as follows:

(1)；

(1);

其中，

表示效用阈值上限，

表示效用阈值下限，

表示原始数据库

的总效用，

和

的值预先设定。in,

represents the upper limit of the utility threshold,

represents the lower limit of the utility threshold,

Represents the original database

The total utility of

and

The value is preset.

的计算公式如下：

The calculation formula is as follows:

(2)；

(2);

其中，

表示原始数据库

中序列

的总效用，计算公式如下：in,

Represents the original database

Middle sequence

The total utility is calculated as follows:

(3)；

(3);

其中，

表示序列

中项目

中

项的效用。in,

Representation sequence

Medium Project

middle

The utility of the item.

本发明实施例中，预先设定效用阈值上限

为0.35，效用阈值上限与高效用序列模式阈值相同，设定效用阈值下限

为0.25，计算得

=36，

=26，

=28，

=23，

=28；

=141；

=21。In the embodiment of the present invention, the upper limit of the utility threshold is preset

The upper limit of the utility threshold is the same as the high-utility sequence mode threshold, and the lower limit of the utility threshold is set to

is 0.25, and the calculated

=36,

=26,

=28,

=23,

=28;

=141;

=21.

步骤3、扫描待插入数据库

，计算待插入数据库

中每一个序列的总效用

和

的总效用

。Step 3: Scan the database to be inserted

, calculate the number of

The total utility of each sequence in

and

Total utility

.

按照与公式（2）和（3）相同的方式计算得到待插入数据库

总效用

，与此同时计算

，计算时代入待插入数据库

的相关数据；The database to be inserted is calculated in the same way as formulas (2) and (3)

Total Utility

, while calculating

, enter the database to be inserted during calculation

relevant data;

本发明实施例中，

=10，

=7，

=17。In the embodiment of the present invention,

=10,

=7,

=17.

步骤4、将自上次重新扫描原始数据库以来新事务的总效用值与

的总和与安全值

进行比较，根据比较结果进行相应操作。具体判断准则为：设

为自上次重新扫描原始数据库以来新事务的总效用值，当

时，进行步骤4.1和步骤4.2，当

时，进行步骤4.3。Step 4. Compare the total utility value of new transactions since the last rescan of the original database with

The sum and safety value

Compare and perform corresponding operations according to the comparison results. The specific judgment criteria are:

When , proceed to step 4.1 and step 4.2.

Then proceed to step 4.3.

步骤4.1、从待插入数据库

扫描生成1-候选集，并设置

=1，

表示的是序列集中正在处理的项数。Step 4.1: From the database to be inserted

Scan to generate 1-candidate set and set

=1,

Indicates the number of items in the sequence set being processed.

本发明实施例中，生成的1-候选集为：

。In the embodiment of the present invention, the generated 1-candidate set is:

.

步骤4.2、扫描1-候选集，更新原有信息的序列效用和序列加权效用，依次产生2-候选集，继续更新原有信息的序列效用和序列加权效用，直到没有候选集的生成。同时，设置

。具体过程如下：Step 4.2, scan the 1-candidate set, update the sequence utility and sequence weighted utility of the original information, generate 2-candidate sets in sequence, and continue to update the sequence utility and sequence weighted utility of the original information until no candidate sets are generated. At the same time, set

The specific process is as follows:

步骤4.2.1、计算新数据库

的总效用

，计算公式如下：Step 4.2.1. Calculate the new database

Total utility

, the calculation formula is as follows:

(4)；

(4);

本发明实施例中，

=141+17=158。In the embodiment of the present invention,

=141+17=158.

对于候选集

中的每个候选，计算待插入数据库

中序列

的序列加权效用

和序列效用

，计算公式如下：For the candidate set

Middle sequence

The sequence weighted utility

and sequence utility

, the calculation formula is as follows:

(5)；

(5);

(6)；

(6);

其中，

表示序列

这一行总的效用值；

表示序列

中的子序列

的效用是序列中所有出现的

的效用中的最大效用，定义如下：in,

Representation sequence

The total utility value of this row;

Representation sequence

Subsequence in

The utility of is all occurrences of

The maximum utility among the utilities of is defined as follows:

(7)；

(7);

其中，

(8)；

(8);

其中，

表示序列

的项目

中

项的内部效用，定义如下：in,

Representation sequence

Project

middle

The internal utility of an item is defined as follows:

(9)；

(9);

其中，

表示序列

中项目

中

项的数量，

表示

项的单位利润。in,

Representation sequence

Medium Project

middle

The number of items,

express

The unit profit of the item.

例如本发明实施例中，

=10，

=8。For example, in the embodiment of the present invention,

=10,

=8.

例如，

可以表示为

，其中

，

，

。其中

在

和

的内部效用分别是：

=3×3=9，

=2×3=6。For example,

It can be expressed as

,in

,

.in

exist

and

The internal utilities are:

=3×3=9,

=2×3=6.

在

中

出现了两次，

最大效用在

表示为：

=9。exist

middle

It appeared twice.

The greatest effect is

It is expressed as:

=9.

子序列

在

出现了两次，这两次的效用分别是(3×3)+(4×2)=17和(3×2)+(4×2)=14。所以，

=17。Subsequence

exist

It appears twice, and the utility of these two times is (3×3)+(4×2)=17 and (3×2)+(4×2)=14 respectively. So,

=17.

步骤4.2.2、对于在大序列加权效用序列

的原始数据库中设置的每个大序列加权效用序列，执行子步骤：Step 4.2.2: Weighted utility sequence in large sequence

For each large sequence weighted utility sequence set in the original database, perform the following substeps:

子步骤4.2.2.1、更新在新数据库

中序列

的序列加权效用

，计算公式如下：Sub-step 4.2.2.1: Update in the new database

Middle sequence

The sequence weighted utility

, the calculation formula is as follows:

(10)；

(10);

其中，

为原始数据库

中序列

的序列加权效用，

存储着序列

的

，

为待插入数据库

中序列

的序列加权效用。in,

For the original database

Middle sequence

The sequence weighted utility of

Stores the sequence

of

,

To be inserted into the database

Middle sequence

The sequence weighted utility of .

本发明实施例

中的

序列，

=76+7=83。Embodiments of the present invention

In

sequence,

=76+7=83.

子步骤4.2.2.2、更新新数据库

中整个序列集

的序列效用

：Sub-step 4.2.2.2: Update the new database

The entire sequence set

The sequence utility

:

(11)；

(11);

其中，

表示序列

在原始数据库

中的序列效用，

存储着序列

的

，

为待插入数据库

中序列

的序列效用。in,

Representation sequence

In the original database

The sequence utility in

Stores the sequence

of

,

To be inserted into the database

Middle sequence

The sequence utility.

本发明实施例

中的

序列，

=30+3=33。Embodiments of the present invention

In

sequence,

=30+3=33.

子步骤4.2.2.3、如果

，则将序列

放入

，

是新数据库

中的大序列加权效用的

序列集；如果

，则将序列

放入

，

是新数据库

中的预大序列加权效用

序列集；否则，丢弃序列

，因为它在数据库更新后仍然很小。Sub-step 4.2.2.3, if

, then the sequence

Put in

,

It is a new database

The weighted utility of a large sequence in

sequence set; if

, then the sequence

Put in

,

It is a new database

Pre-large sequence weighted utility in

sequence set; otherwise, discard the sequence

, as it will still be small after the database update.

本发明实施例中，

=52.5%>35%，所以序列

仍放入

集合中。In the embodiment of the present invention,

=52.5%>35%, so the sequence

Still put in

In collection.

步骤4.2.3、对于

原始数据库中的每个预大序列加权利用序列集，同样执行步骤4.2.2的子步骤4.2.2.1-子步骤4.2.2.3。Step 4.2.3:

For each pre-large sequence weighted application sequence set in the original database, sub-steps 4.2.2.1 to 4.2.2.3 of step 4.2.2 are also performed.

如果原始数据库

中的大序列加权序列集

和原始数据库

中的预大序列加权序列集

包含待插入数据库

中的序列

，就将

和

中的项集的序列效用

和序列加权效用

的值进行更新，并将序列

放入到1-候选集，用来生成2-候选集；如果

和

中不包含新数据库

中的序列

，就不需要更新，将

从1-候选集中移除。If the original database

Large sequence weighted sequence set in

and the original database

Pre-large sequence weighted sequence set in

Contains the database to be inserted

Sequence in

, then

and

The sequential utility of the itemsets in

and sequence weighted utility

The value of

Put it into the 1-candidate set to generate the 2-candidate set; if

and

New databases are not included

Sequence in

, there is no need to update,

Remove from 1-candidate set.

例如本发明实施例中，

、

在

中，就将

加入到1-候选集中，如果不是就将其移除。从1-候选集可以生成2-候选集

、

、

和

，并从待插入数据库

挖掘它们的

和

，如果不存在，值就是0，以此类推，直到候选集为空。For example, in the embodiment of the present invention,

,

exist

In

Add it to the 1-candidate set, if not, remove it. From the 1-candidate set, you can generate the 2-candidate set

,

and

, and from the database to be inserted

Dig them up

and

, if it does not exist, the value is 0, and so on, until the candidate set is empty.

步骤4.2.4、从

-候选集生成候选(

+1)-候选集

；设

=

-Candidate set generation candidate (

+1)-Candidate set

;set up

=

步骤4.3、当

时，生成新数据库，此时需要重新扫描原始数据库。将

设置为0，并将

赋值给

。具体过程如下：Step 4.3:

When a new database is generated, the original database needs to be rescanned.

Set to 0 and

Assign to

The specific process is as follows:

步骤4.3.1、合并待插入数据库

和原始数据库D，生成新数据库U；Step 4.3.1: Merge the database to be inserted

and the original database D, generate a new database U;

步骤4.3.2、对于每个

，采用与公式（5）相同的计算方式计算新数据库

的序列加权效用

，然后采用与公式（2）相同的计算方式计算新数据库

的总效用

；Step 4.3.2: For each

The sequence weighted utility

Total utility

;

步骤4.3.3、设序列的加权效用比为

，如果

，则将序列

放入

；如果

，则将序列

放入

；否则，丢弃序列

，因为它在数据库更新后仍然很小。Step 4.3.3: Let the weighted utility ratio of the sequence be

,if

, then the sequence

Put in

;if

, then the sequence

Put in

; Otherwise, discard the sequence

, as it will still be small after the database update.

和

序列集，直到没有找到

和

序列集。执行挖掘过程时，从1序列集开始挖掘，再接着2序列集，直到最后一个序列集为空，此时停止挖掘过程，输出新数据库

中的大序列加权效用序列集

和预大序列加权效用序列集

，

和

and

Sequence set until none is found

and

When executing the mining process, start mining from sequence set 1, then sequence set 2, until the last sequence set is empty, then stop the mining process and output the new database

Large sequence weighted utility sequence set in

and the pre-large sequence weighted utility sequence set

,

and

It will be used for next data insertion.

具体过程如下：The specific process is as follows:

步骤4.3.4.1、遍历

和

，对属于

和

的每个序列

构建它的投影数据库

，这样可以减少候选集的个数，提高运行速度，其中

表示的是序列集中正在处理的项数。投影数据库的构建过程为：找到以项目

作为序列前缀的每一个序列，如果一个序列中没有项目

，就不保留。Step 4.3.4.1, traversal

and

, for

and

Each sequence of

Build its projection database

, which can reduce the number of candidate sets and improve the running speed.

It indicates the number of items being processed in the sequence set. The construction process of the projection database is: find the items

Each sequence that is a prefix of a sequence, if there are no items in a sequence

, will not be retained.

定义：设有两个序列

和

，其中

。如果(1)该序列有前缀

，(2)其中该序列是以

为前缀的

的子序列，并且该序列是不再有超序列，那么序列

的子序列称为

的投影序列，这个关系记为

。因此，序列

在新数据库

中的投影数据库是序列

对应的数据库中每个序列的所有投影序列的集合，记为

。Definition: Suppose there are two sequences

and

,in

If (1) the sequence has a prefix

, (2) where the sequence is

Prefix

, and the sequence no longer has a supersequence, then the sequence

A subsequence of

The projection sequence of

Therefore, the sequence

In the new database

The projection database in is the sequence

The set of all projection sequences for each sequence in the corresponding database is denoted as

.

例如，根据上述定义，对序列

构建投影数据库，找到以

为作为序列前缀的每一个序列，如果一个序列中没有项目

，就不保留，例如

中没有

项目，在序列

的投影数据库就没有

。因此，序列

的投影数据库中只包含

、

、

、

四个序列，具体内容为：

序列的项目集合为

，

序列的总效用为36；

序列的项目集合为

，

序列的总效用为9；

序列的项目集合为

，

序列的总效用为9；

序列的项目集合为

，

序列的总效用为22。For example, according to the above definition, for the sequence

Build a projection database and find

For each sequence that is a prefix of a sequence, if there are no items in a sequence

, it is not retained, for example

No

Project, in sequence

The projection database does not have

Therefore, the sequence

The projection database only contains

,

Four sequences, the specific contents are:

The set of items in the sequence is

,

The total utility of the sequence is 36;

The set of items in the sequence is

,

The total utility of the sequence is 9;

The set of items in the sequence is

,

The total utility of the sequence is 9;

The set of items in the sequence is

,

The total utility of the sequence is 22.

步骤4.3.4.2、计算

的序列加权效用

值，其中

是

的拓展项集；如果

，计算序列效用

，并将

放到

集合中；如果

，计算

，并将

放到

集合中，否则，如果都不满足，将不做任何处理。Step 4.3.4.2. Calculation

The sequence weighted utility

Value, where

yes

The expanded itemset of

, calculate the sequence utility

, and

Put

In the collection; if

,calculate

, and

Put

Otherwise, if none of them are satisfied, no processing will be done.

和

集合都为空，停止运行。Step 4.3.4.3: Pass the current parameters in and recursively call the mining algorithm process until

and

The collection is empty and the operation stops.

递归挖掘算法的伪代码如下：The pseudo code of the recursive mining algorithm is as follows:

1: for 每一个序列

do；1: for each sequence

do;

2: 构建序列

的投影数据库

；2: Build sequence

Projection database

;

3: end for；3: end for;

4: for 每一个

，其中

是

在投影数据库

的超集 do；4: for each

,in

yes

In the projection database

A superset of do;

5: 计算

；5: Calculation

;

6: if

then；6: if

then;

7: 计算

；7: Calculation

;

8: 将序列

放入

集合中；8: Sequence

Put in

In the collection;

9: else if

；9: else if

;

10: 计算

；10: Calculation

;

11: 将序列

放入

集合中；11: Sequence

Put in

In the collection;

12: end if；12: end if;

13: end for；13: end for;

14: Mining(

，

，

，

，

，

)；14: Mining(

,

);

步骤5、判断新数据库

中的大序列加权效用序列集

集合中的每个序列

的效用比是否大于等于效用阈值上限

，即

，若是，则序列

是高效用序列模式，将序列S加入到高效用序列模式集合

及其高效用序列模式集

。Step 5: Determine the new database

Large sequence weighted utility sequence set in

Each sequence in the collection

Is the utility ratio greater than or equal to the upper utility threshold?

,Right now

, if so, then the sequence

is a high-utility sequence pattern. Add sequence S to the set of high-utility sequence patterns.

Its high-utility sequence pattern set

.

本发明实施例中，

=

=35.4%＞35%，所以

是一个高效用序列，需要加入到高效用序列模式集合

中。In the embodiment of the present invention,

=

=35.4%＞35%, so

Is a high-utility sequence and needs to be added to the high-utility sequence pattern set

middle.

最终得到的

和

如下：The final result

and

as follows:

大序列加权效用序列集

包括的序列集为

、

、

、

；其中，序列集

的序列加权效用为83，序列效用为22；序列集

的序列加权效用为95，序列效用为56；序列集

的序列加权效用为77，序列效用为20；序列集

的序列加权效用为77，序列效用为16；Large Sequential Weighted Utility Sequence Set

The included sequence sets are

,

; Among them, the sequence set

The sequence weighted utility is 83, and the sequence utility is 22; the sequence set

The sequence weighted utility of is 95, and the sequence utility is 56; the sequence set

The sequence weighted utility of is 77, and the sequence utility is 20; the sequence set

The sequence weighted utility of is 77, and the sequence utility is 16;

预大序列加权效用序列集

包括的序列集为

、

、

、

、

；其中，序列集

的序列加权效用为53，序列效用为18；序列集

的序列加权效用为46，序列效用为17；序列集

的序列加权效用为43，序列效用为17；序列集

的序列加权效用为52，序列效用为32；序列集

的序列加权效用为54，序列效用为38。Pre-large sequence weighted utility sequence set

The included sequence sets are

,

; Among them, the sequence set

The sequence weighted utility of is 53, and the sequence utility is 18; the sequence set

The sequence weighted utility is 46, and the sequence utility is 17; the sequence set

The sequence weighted utility is 43, and the sequence utility is 17; the sequence set

The sequence weighted utility of is 52, and the sequence utility is 32; the sequence set

The sequence weighted utility of is 54 and the sequence utility is 38.

更新后的新数据库

的高效用序列模式集

只包含序列集

，此时序列集

的序列加权效用为95，序列效用为56，效用比为35.4%。New updated database

A set of high-utility sequence patterns

Contains only sequence sets

, then the sequence set

The sequence weighted utility is 95, the sequence utility is 56, and the utility ratio is 35.4%.

本发明中，Pre-HUSPM算法的伪代码如下：In the present invention, the pseudo code of the Pre-HUSPM algorithm is as follows:

输入：一个项目利润表

、原始数据库

、效用阈值上限

(与最小序列效用高阈值相同)、效用阈值下限

、

的总效用

、一组大序列加权利用序列

和前大序列加权利用序列

以及它们的序列加权效用值、从

中找到的实际效用值、保存最后处理的序列的总效用值的安全交易效用缓冲器

、以及待插入数据库

。Input: A project income statement

, original database

, upper threshold of utility

(same as minimum sequence utility high threshold), utility lower threshold

,

Total utility

, a set of large sequence weighted application sequences

The weighted application sequence of the previous large sequence

and their sequence weighted utility values, from

The actual utility value found in , and the secure transaction utility buffer that holds the total utility value of the last processed sequence

, and to be inserted into the database

.

输出：新数据库

(

)的一组高效用序列模式(

)。Output: New database

(

) of a set of high-utility sequence patterns (

).

1: 计算安全序列效用界限

；1: Calculate the safety sequence utility bound

;

2: for each

do；2: for each

do;

3: 扫描数据库

，计算

；3: Scan the database

,calculate

;

4: end for；4: end for;

5: 计算

和

；5: Calculation

and

;

6: 如果

then；6: If

then;

7: 计算总效用

；7: Calculate total utility

;

8: 设置

=1；8: Settings

=1;

9: 生成1-项候选集

，

；9: Generate 1-item candidate set

,

;

10: while

null do；10: while

null do;

11: for each

do；11: for each

do;

12: 计算

；12: Calculation

;

13: 计算

；13: Calculation

;

14: end for；14: end for;

15: for each

do；15: for each

do;

16: 调用效用求和算法；16: Call the utility summation algorithm;

17: end for；17: end for;

18: for each

do；18: for each

do;

19: 调用效用求和算法；19: Call the utility summation algorithm;

20: end for；20: end for;

21: 从(

∪

) 生成 (

+ 1)-候选集

；21: From (

∪

) Generate(

+ 1)-Candidate set

;

22: 设置

=

+1；22: Settings

=

+1;

23: end while；23: end while;

24: else；24: else;

25: 合并待插入数据库

和原始数据库

，生成新数据库

；25: Merge the database to be inserted

and the original database

, generate a new database

;

26: for each

do；26: for each

do;

27: 计算

；27: Calculation

;

28: end for；28: end for;

29: 计算

；29: Calculation

;

30: 设置

=1；30: Settings

=1;

31: for each

do；31: for each

do;

32: if

；32: if

;

33: 将

加入到集合

当中；33: Will

Add to collection

among;

34: else if

；34: else if

;

35: 将

加入到集合

当中；35: Will

Add to collection

among;

36: end if；36: end if;

37: end for；37: end for;

38: 如果

不在

和

当中，就将

从新数据库

中移除，当作新的数据库

；38: If

Not Available

and

Among them,

From new database

Remove it and treat it as a new database

;

39: Mining(

，

，

，

，

，

)；39: Mining(

,

);

40: end if；40: end if;

41: for each

do；41: for each

do;

42: if

；42: if

;

43: 将序列

放入

集合中；43: Sequence

Put in

In the collection;

44: end if；44: end if;

45: end for；45: end for;

46: if

then；46: if

then;

47: 设置

and

= 0；47: Settings

and

= 0;

48: else；48: else;

49: 设置

；49: Settings

;

50: end if；50: end if;

51: 设置

and

；51: Settings

and

;

上述伪代码中用到的效用求和算法的伪代码如下：The pseudocode for the utility summation algorithm used in the above pseudocode is as follows:

1:

；1:

;

2:

；2:

;

3: if

then；3: if

then;

4: 将序列

放入

集合中；4: Sequence

Put in

In the collection;

5: else if

then；5: else if

then;

6: 将序列

放入

集合中；6: Sequence

Put in

In the collection;

7: end if；7: end if;

为了证明本发明算法的优越性与可行性，进行了对比实验。将本发明提出的Pre-HUSPM算法与P-HUSPM算法和Pre-HUSPM-TSU算法进行了比较。实验采用6个不同规模且具有不同特征的真实数据集，数据集的名称分别为SIGN、LEVIATHAN、FIFA、BIBLE、Kosarak10k、BMS，该六个数据集均来自SPMF网站。其中，SIGN是包含许多非常长的序列的密集数据集；LEVIATHAN和FIFA均是包含许多长序列的中等密度数据集；BIBLE是一个中等密度的数据集，包含许多中等长度的序列；BMS和Kosarak10k均是稀疏数据集，只有一些长序列。对于所有数据集，都满足高斯分布。在实验中，将每个数据集分为一个原始数据集和100个新数据集。该数据集的特征属性具体为：SIGN数据集的序列数量为730个，不同项目的数量为267个，平均序列长度为52个，最大序列长度为94个，原始数据库的序列个数为230个，待插入数据库的序列个数为5个；LEVIATHAN数据集的序列数量为5834个，不同项目的数量为9025个，平均序列长度为33.8个，最大序列长度为100个，原始数据库的序列个数为2834个，待插入数据库的序列个数为30个；FIFA数据集的序列数量为20450个，不同项目的数量为2990个，平均序列长度为36.2个，最大序列长度为100个，原始数据库的序列个数为10450个，待插入数据库的序列个数为100个；BIBLE数据集的序列数量为36369个，不同项目的数量为13905个，平均序列长度为21.6个，最大序列长度为100个，原始数据库的序列个数为21369个，待插入数据库的序列个数为150个；Kosarak10k数据集的序列数量为10000个，不同项目的数量为10094个，平均序列长度为8.1个，最大序列长度为608个，原始数据库的序列个数为1000个，待插入数据库的序列个数为90个；BMS数据集的序列数量为59601个，不同项目的数量为497个，平均序列长度为2.5个，最大序列长度为267个，原始数据库的序列个数为39601个，待插入数据库的序列个数为200个。In order to prove the superiority and feasibility of the algorithm of the present invention, a comparative experiment was carried out. The Pre-HUSPM algorithm proposed in the present invention was compared with the P-HUSPM algorithm and the Pre-HUSPM-TSU algorithm. The experiment used 6 real data sets of different scales and with different characteristics. The names of the data sets are SIGN, LEVIATHAN, FIFA, BIBLE, Kosarak10k, and BMS. The six data sets are all from the SPMF website. Among them, SIGN is a dense data set containing many very long sequences; LEVIATHAN and FIFA are both medium-density data sets containing many long sequences; BIBLE is a medium-density data set containing many medium-length sequences; BMS and Kosarak10k are both sparse data sets with only some long sequences. For all data sets, Gaussian distribution is satisfied. In the experiment, each data set is divided into an original data set and 100 new data sets. The characteristic attributes of the dataset are as follows: the number of sequences in the SIGN dataset is 730, the number of different items is 267, the average sequence length is 52, the maximum sequence length is 94, the number of sequences in the original database is 230, and the number of sequences to be inserted into the database is 5; the number of sequences in the LEVIATHAN dataset is 5834, the number of different items is 9025, the average sequence length is 33.8, the maximum sequence length is 100, the number of sequences in the original database is 2834, and the number of sequences to be inserted into the database is 30; the number of sequences in the FIFA dataset is 20450, the number of different items is 2990, the average sequence length is 36.2, the maximum sequence length is 100, the number of sequences in the original database is 10450, and the number of sequences to be inserted into the database is 100 ; The number of sequences in the BIBLE dataset is 36,369, the number of different projects is 13,905, the average sequence length is 21.6, the maximum sequence length is 100, the number of sequences in the original database is 21,369, and the number of sequences to be inserted into the database is 150; the number of sequences in the Kosarak10k dataset is 10,000, the number of different projects is 10,094, the average sequence length is 8.1, the maximum sequence length is 608, the number of sequences in the original database is 1,000, and the number of sequences to be inserted into the database is 90; the number of sequences in the BMS dataset is 59,601, the number of different projects is 497, the average sequence length is 2.5, the maximum sequence length is 267, the number of sequences in the original database is 39,601, and the number of sequences to be inserted into the database is 200.

本发明实验在六个不同的数据集上将效用阈值上限

控制为相同的变量，选取不同的效用阈值下限

进行实验对比，实验结果如图2-图7所示。通过实验发现，Pre-HUSPM-TSU算法在运行时间上比HUSPM算法时间短，这样缩短了运行时间。而本发明提出的优化算法采用了

，替代了Pre-HUSPM-TSU算法中的

，实质形成的是Pre-HUSPM-

算法（Pre-HUSPM-

即为本发明所提到的Pre-HUSPM算法），Pre-HUSPM-

算法在运行时间上会比HUSPM和Pre-HUSPM-TSU好很多。因此，Pre-HUSPM-

在较大的非密集数据集中具有更快的运行时间，在运行时间方面具有较好的性能。The experiment of this invention sets the upper limit of the utility threshold on six different data sets.

Controlling the same variables, selecting different utility threshold lower limits

Experimental comparison is carried out, and the experimental results are shown in Figures 2 to 7. It is found through experiments that the Pre-HUSPM-TSU algorithm has a shorter running time than the HUSPM algorithm, thus shortening the running time. The optimization algorithm proposed in this invention adopts

, replacing the Pre-HUSPM-TSU algorithm

, the actual formation is Pre-HUSPM-

Algorithm (Pre-HUSPM-

That is the Pre-HUSPM algorithm mentioned in the present invention), Pre-HUSPM-

The running time of the algorithm is much better than that of HUSPM and Pre-HUSPM-TSU.

It has faster runtimes on larger, non-dense datasets and better performance in terms of runtime.

通过选取不同的

，发现若

设置得太小，在重新扫描数据库时，运行速度会变得更慢，因为会生成太多的预大序列集。如果

设置得太接近

的值，那么安全值将变得太小，因此每当添加新数据时，可能必须重新扫描数据库，这也将导致较慢的操作。对于实际应用，需要合理设置

和

。

和

的范围均是0至1，设置时确保

，具体数值根据用户的需要自行设置。By choosing different

, it is found that if

If set too small, the database will run more slowly when it is rescanned because too many pre-large sequence sets will be generated.

Set too close

If the value of , then the safety value will become too small, so every time new data is added, the database may have to be rescanned, which will also result in slower operation. For practical applications, it is necessary to set a reasonable

and

.

and

The range is 0 to 1. Make sure

The specific value can be set according to user needs.

当然，上述说明并非是对本发明的限制，本发明也并不仅限于上述举例，本技术领域的技术人员在本发明的实质范围内所做出的变化、改型、添加或替换，也应属于本发明的保护范围。Of course, the above description is not a limitation of the present invention, and the present invention is not limited to the above examples. Changes, modifications, additions or substitutions made by technicians in this technical field within the essential scope of the present invention should also fall within the protection scope of the present invention.

Claims

1. A database sequence insertion processing method based on Pre-HUSPM is characterized in that an incremental algorithm Pre-HUSPM is constructed to efficiently mine a high-utility sequence mode, and the method specifically comprises the following steps:

step 1, to the original database

Insert the database to be inserted->

；

Step 2, according to the original database

Is calculated a safety value->

；

Step 3, scanning the database to be inserted

Calculating a database to be inserted->

The total utility of each of the sequences->

and

Is greater than or equal to>

；

Step 4, the total utility value of the new transaction since the original database is rescanned last time and the database to be inserted

Sequence-weighted utility maximum for a single item->

Is summed with a safety value->

Comparing, and performing corresponding operation according to a comparison result;

step 5, judging a new database

Is greater than the set of large sequence weighted utility sequences->

Each sequence in the set->

Whether or not the utility ratio of (a) is greater than or equal to the upper utility threshold limit->

If so, the sequence is->

Is a high utility sequencing mode, sequences->

Add to high utility sequence pattern set >>

And outputting, otherwise, no operation is needed; finally outputting the new database after the database update->

And its high utility sequential pattern set>

。

2. The Pre-HUSPM-based database sequence insertion processing method according to claim 1, wherein in step 1, a primary database is provided

，

Is the total number of sequences, is based on>

Is a serial number of the sequence, is asserted>

Is shown as

Or a sequence, is>

Set an item>

，

Is the total number of items, the item->

Is->

A collection of different items, represented as

，

Indicates that the item is pick>

Is greater than or equal to>

And (4) items.

3. Pre-HUSPM-based database sequence insertion processing method according to claim 2, characterized in that in step 2, the security value

The calculation formula of (c) is as follows:

(1)；

wherein ,

indicates an upper utility threshold value, greater than or equal to>

Indicates a utility threshold lower limit, <' > or>

Represents the original database->

The total utility of (a) of (b),

and

Presetting the value of (A);

the calculation formula of (a) is as follows:

(2)；

wherein ,

represents the original database->

In sequence->

The calculation formula is as follows:

(3)；

wherein ,

represents a sequence->

Middle item->

In or>

The utility of the item.

4. The Pre-HUSPM-based database sequence insertion processing method according to claim 3, wherein in the step 3, the database to be inserted is calculated in the same manner as the formulas (2) and (3)

Total utility->

At the same time counting>

The database to be inserted is included in the calculation time>

The correlation data of (a).

5. The Pre-HUSPM-based database sequence insertion processing method according to claim 4, wherein the specific judgment criteria in step 4 are: is provided with

When ≧ the total utility value for the new transaction since the last rescan of the original database>

When, step 4.1 and step 4.2 are carried out, when +>

Then, step 4.3 is carried out;

step 4.1, insert the database from waiting

The scan generates a 1-candidate set and sets ≦>

=1，

Representing the number of items being processed in the set of sequences;

step 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated; at the same time, set up

；

Step 4.3, when

Generating a new database, and scanning the original database again at the moment; will be provided with

Set to 0 and will->

Assign a value to>

。

6. The Pre-HUSPM-based database sequence insertion processing method according to claim 5, wherein the specific process of step 4.2 is as follows:

step 4.2.1, calculate the new database

Is greater than or equal to>

The calculation formula is as follows:

(4)；

for candidate set

Calculates the ≥ er/min of each candidate in the database to be inserted>

In a sequence>

Is weighted effect of->

And the effect of the sequence->

The calculation formula is as follows:

(5)；

(6)；

wherein ,

represents a sequence->

The total utility value for this row;

Represents a sequence->

Is based on the sub-sequence->

Has the effect that all occurrences in the sequence->

The maximum utility of (a) is defined as follows:

(7)；

wherein ,

indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:

(8)；

wherein ,

representing a sequence>

Is greater than or equal to>

In or>

The internal utility of an item, defined as follows:

(9)；

wherein ,

represents a sequence->

Middle item->

Is/is>

Number of items, <' > based on>

Represents->

The unit profit of the item;

step 4.2.2, for weighting utility sequences in large sequences

Performing substep 4.2.2.1-substep 4.2.2.3 on each large sequence weighted utility sequence set in the original database;

step 4.2.3 for

Original numberEach pre-large sequence in the database is weighted by using the sequence set, and sub-step 4.2.2.1-sub-step 4.2.2.3 of step 4.2.2 are also executed;

if the original database

Is greater than the set of large sequence weighted sequences->

And a base of original data>

Is predetermined by the pre-large sequence weighting sequence set->

Containing the database to be inserted->

Is based on the sequence->

Will->

and

Sequence utility of item sets in

And the sequence weighted utility>

Is updated and the sequence is->

Put into 1-candidate set, used for producing 2-candidate set; if->

and

Does not contain a new database->

Is based on the sequence->

Will->

Remove from the 1-candidate set;

step 4.2.4 from

-candidate set generating candidates (@ n)>

+ 1) -candidate set +>

(ii) a Is arranged and/or is>

=

+1, repeating steps 4.2.1 through 4.2.4 until no updated large or pre-large sequence weighted utility sequence set is found.

7. Pre-HUSPM based database sequence insertion processing method according to claim 6, characterized in that the substeps of step 4.2.2 are as follows:

substep 4.2.2.1, updating the new database

In sequence->

Is weighted effect of->

The calculation formula is as follows:

(10)；/>

wherein ,

for the original database->

In sequence->

Is weighted effect of->

Stores the sequence->

Is/are as follows

，

For being inserted into the database->

In sequence->

The sequence weighted utility of (a);

substep 4.2.2.2 updating the new database

In the entire sequence set->

In a sequence effect >>

：

(11)；

wherein ,

represents a sequence->

In the raw database->

In, on the sequence effect in>

Stores the sequence->

Is/are>

，

For being inserted into the database->

In sequence->

The sequence utility of (a);

substeps 4.2.2.3, if

Then will beSequence>

Put in and/or pick up>

，

Is a new database->

Is greater than the sequence weighted effect in->

A sequence set; if->

Then the sequence is asserted>

Put in and/or pick up>

，

Is a new database->

Pre-large sequence weighted utility of->

A sequence set; otherwise, the sequence is discarded>

。

8. The Pre-HUSPM-based database sequence insertion processing method according to claim 7, wherein the specific process of step 4.3 is as follows:

step 4.3.1, merging the databases to be inserted

And the original database->

Generating a new database>

；

Step 4.3.2, for each

The new database is calculated in the same way as in equation (5)>

Is weighted effect of->

Then the new database is calculated in the same way as in equation (2)>

Is greater than or equal to>

；

Step 4.3.3, set the weighted utility ratio of the sequence to

If->

Then the sequence is asserted>

Is put into

(ii) a If->

Then the sequence is combined>

Put in and/or pick up>

(ii) a Otherwise, the sequence is discarded>

；

Is a new database->

Is greater than the sequence weighted effect in->

A sequence set;

Is a new database->

Pre-large sequence weighted utility of->

A sequence set;

step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets

and

Sequence set until no more than found >>

and

A sequence set; when the mining process is executed, the mining is started from the sequence set 1, then follows the sequence set 2, stopping the mining process until the last sequence set is empty, and outputting a new database ≥ er>

Is greater than the set of large sequence weighted utility sequences->

And pre-large sequence weighted utility sequence set>

，

and

The data insertion method is used for next data insertion.

9. The Pre-HUSPM-based database sequence insertion processing method according to claim 8, wherein in the step 4.3.4, the specific process of the recursive mining algorithm is as follows:

step 4.3.4.1, traverse