CN115964415A - Pre-HUSPM-based database sequence insertion processing method - Google Patents

Pre-HUSPM-based database sequence insertion processing method Download PDF

Info

Publication number
CN115964415A
CN115964415A CN202310250759.4A CN202310250759A CN115964415A CN 115964415 A CN115964415 A CN 115964415A CN 202310250759 A CN202310250759 A CN 202310250759A CN 115964415 A CN115964415 A CN 115964415A
Authority
CN
China
Prior art keywords
sequence
database
utility
weighted
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310250759.4A
Other languages
Chinese (zh)
Other versions
CN115964415B (en
Inventor
吴明泰
李凤洋
潘正祥
陈建铭
吴祖扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN202310250759.4A priority Critical patent/CN115964415B/en
Publication of CN115964415A publication Critical patent/CN115964415A/en
Application granted granted Critical
Publication of CN115964415B publication Critical patent/CN115964415B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a database sequence insertion processing method based on Pre-HUSPM, belonging to the field of data mining and comprising the following steps: inserting a database to be inserted into an original database; calculating a security value according to the information of the original database; scanning a database to be inserted, and calculating the total utility of each sequence in the database to be inserted and the total utility of the database to be inserted; comparing the total utility value of a new transaction since the original database is rescanned last time with the sum of the maximum value of the sequence weighted utility of the single item in the database to be inserted with a safety value, and performing corresponding operation according to the comparison result; comparing and judging the utility ratio of each sequence in the large sequence weighted utility sequence set in the new database with the utility threshold upper limit; and finally, outputting the new database after the database is updated and the high-utility sequence pattern set thereof. The invention reduces the times of database rescanning and lowers the maintenance cost.

Description

一种基于Pre-HUSPM的数据库序列插入处理方法A database sequence insertion processing method based on Pre-HUSPM

技术领域Technical Field

本发明属于数据挖掘领域,具体涉及一种基于Pre-HUSPM的数据库序列插入处理方法。The invention belongs to the field of data mining, and in particular relates to a database sequence insertion processing method based on Pre-HUSPM.

背景技术Background Art

高效用序列模式挖掘(HUSPM)算法可以用于分析用户的购物习惯,HUSPM会考虑每个项目的权重、单位利润等。当序列集的效用大于用户设置的最小效用阈值时,则序列集为高效用序列模式。通常,HUSPM算法在静态数据库下运行,但在实际应用中,几乎每天都有新的数据添加,这可能导致原来发现的高效利用序列模式会失败,或者更新数据库后出现新的一些新信息。因此,在传统的动态数据挖掘中,每次有少量数据进入时,都需要重新扫描原始数据库,重新扫描原始数据库会消耗大量的资源和时间。尤其当插入少量数据时,实质对整个数据库没有影响,此时更新数据库会造成资源浪费,维护成本增加,因此高效地维护和更新挖掘的高效用序列模式变得尤为重要。The High Utility Sequential Pattern Mining (HUSPM) algorithm can be used to analyze the shopping habits of users. HUSPM considers the weight of each item, unit profit, etc. When the utility of a sequence set is greater than the minimum utility threshold set by the user, the sequence set is a high-utility sequence pattern. Usually, the HUSPM algorithm runs under a static database, but in actual applications, new data is added almost every day, which may cause the original discovered high-utility sequence pattern to fail, or some new information to appear after the database is updated. Therefore, in traditional dynamic data mining, every time a small amount of data enters, the original database needs to be rescanned, which consumes a lot of resources and time. Especially when a small amount of data is inserted, it has no effect on the entire database. At this time, updating the database will cause a waste of resources and increase maintenance costs. Therefore, it is particularly important to efficiently maintain and update the mined high-utility sequence patterns.

发明内容Summary of the invention

为了解决上述问题,本发明提出了一种基于Pre-HUSPM的数据库序列插入处理方法,将pre-large概念和基于投影的挖掘算法P-HUSPM进行融合构建了增量算法Pre-HUSPM,用于高效挖掘高效用序列模式,减少原始数据库的重新扫描次数。In order to solve the above problems, the present invention proposes a database sequence insertion processing method based on Pre-HUSPM, which integrates the pre-large concept and the projection-based mining algorithm P-HUSPM to construct an incremental algorithm Pre-HUSPM for efficiently mining high-utility sequence patterns and reducing the number of rescanning of the original database.

本发明的技术方案如下:The technical solution of the present invention is as follows:

一种基于Pre-HUSPM的数据库序列插入处理方法,构建增量算法Pre-HUSPM来高效挖掘高效用序列模式,具体包括如下步骤:A database sequence insertion processing method based on Pre-HUSPM, constructing an incremental algorithm Pre-HUSPM to efficiently mine high-utility sequence patterns, specifically including the following steps:

步骤1、向原始数据库

Figure SMS_1
中插入待插入数据库
Figure SMS_2
;Step 1: Add the original database
Figure SMS_1
Insert to be inserted into the database
Figure SMS_2
;

步骤2、根据原始数据库

Figure SMS_3
的信息计算安全值
Figure SMS_4
;Step 2: Based on the original database
Figure SMS_3
Information calculation security value
Figure SMS_4
;

步骤3、扫描待插入数据库

Figure SMS_5
,计算待插入数据库
Figure SMS_6
中每一个序列的总效用
Figure SMS_7
Figure SMS_8
的总效用
Figure SMS_9
;Step 3: Scan the database to be inserted
Figure SMS_5
, calculate the number of
Figure SMS_6
The total utility of each sequence in
Figure SMS_7
and
Figure SMS_8
Total utility
Figure SMS_9
;

步骤4、将自上次重新扫描原始数据库以来新事务的总效用值与待插入数据库

Figure SMS_10
中单个项目的序列加权效用最大值
Figure SMS_11
的总和与安全值
Figure SMS_12
进行比较,根据比较结果进行相应操作;Step 4: Compare the total utility value of new transactions since the last rescan of the original database with the total utility value of the new transactions to be inserted into the database.
Figure SMS_10
The maximum value of the sequence weighted utility of a single item in
Figure SMS_11
The sum and safety value
Figure SMS_12
Make a comparison and perform corresponding operations according to the comparison results;

步骤5、判断新数据库

Figure SMS_14
中的大序列加权效用序列集
Figure SMS_15
集合中的每个序列
Figure SMS_17
的效用比是否大于等于效用阈值上限
Figure SMS_18
,若是,则序列
Figure SMS_19
是高效用序列模式,将序列
Figure SMS_20
加入到高效用序列模式集合
Figure SMS_21
中并输出,否则,不需要进行任何操作;最终输出数据库更新后的新数据库
Figure SMS_13
及其高效用序列模式集
Figure SMS_16
。Step 5: Determine the new database
Figure SMS_14
Large sequence weighted utility sequence set in
Figure SMS_15
Each sequence in the collection
Figure SMS_17
Is the utility ratio greater than or equal to the upper utility threshold?
Figure SMS_18
, if so, then the sequence
Figure SMS_19
is a high-utility sequence pattern,
Figure SMS_20
Added to the collection of high-performance sequence patterns
Figure SMS_21
Otherwise, no operation is required; finally, the new database after the database update is output
Figure SMS_13
Its high-utility sequence pattern set
Figure SMS_16
.

进一步地,步骤1中,设原始数据库

Figure SMS_23
Figure SMS_24
为序列总个数,
Figure SMS_25
为序列的序号,
Figure SMS_26
表示第
Figure SMS_27
个序列,
Figure SMS_28
为项目集合
Figure SMS_29
Figure SMS_22
为项目总个数,项目
Figure SMS_30
Figure SMS_31
个不同项的集合,表示为
Figure SMS_32
Figure SMS_33
表示项目
Figure SMS_34
中的第
Figure SMS_35
个项。Furthermore, in step 1, the original database is
Figure SMS_23
,
Figure SMS_24
is the total number of sequences,
Figure SMS_25
is the sequence number,
Figure SMS_26
Indicates
Figure SMS_27
A sequence,
Figure SMS_28
For project collection
Figure SMS_29
,
Figure SMS_22
is the total number of projects,
Figure SMS_30
yes
Figure SMS_31
A set of different items, represented by
Figure SMS_32
,
Figure SMS_33
Display items
Figure SMS_34
The
Figure SMS_35
Item.

进一步地,步骤2中,安全值

Figure SMS_36
的计算公式如下:Furthermore, in step 2, the safety value
Figure SMS_36
The calculation formula is as follows:

Figure SMS_37
(1);
Figure SMS_37
(1);

其中,

Figure SMS_38
表示效用阈值上限,
Figure SMS_39
表示效用阈值下限,
Figure SMS_40
表示原始数据库
Figure SMS_41
的总效用,
Figure SMS_42
Figure SMS_43
的值预先设定;in,
Figure SMS_38
represents the upper limit of the utility threshold,
Figure SMS_39
represents the lower limit of the utility threshold,
Figure SMS_40
Represents the original database
Figure SMS_41
The total utility of
Figure SMS_42
and
Figure SMS_43
The value of is preset;

Figure SMS_44
的计算公式如下:
Figure SMS_44
The calculation formula is as follows:

Figure SMS_45
(2);
Figure SMS_45
(2);

其中,

Figure SMS_46
表示原始数据库
Figure SMS_47
中序列
Figure SMS_48
的总效用,计算公式如下:in,
Figure SMS_46
Represents the original database
Figure SMS_47
Middle sequence
Figure SMS_48
The total utility is calculated as follows:

Figure SMS_49
(3);
Figure SMS_49
(3);

其中,

Figure SMS_50
表示序列
Figure SMS_51
中项目
Figure SMS_52
Figure SMS_53
项的效用。in,
Figure SMS_50
Representation sequence
Figure SMS_51
Medium Project
Figure SMS_52
middle
Figure SMS_53
The utility of the item.

进一步地,步骤3中,按照与公式(2)和(3)相同的方式计算得到待插入数据库

Figure SMS_54
总效用
Figure SMS_55
,与此同时计算
Figure SMS_56
,计算时代入待插入数据库
Figure SMS_57
的相关数据。Furthermore, in step 3, the database to be inserted is calculated in the same way as formulas (2) and (3).
Figure SMS_54
Total Utility
Figure SMS_55
, while calculating
Figure SMS_56
, enter the database to be inserted during calculation
Figure SMS_57
relevant data.

进一步地,步骤4中的具体判断准则为:设

Figure SMS_58
为自上次重新扫描原始数据库以来新事务的总效用值,当
Figure SMS_59
时,进行步骤4.1和步骤4.2,当
Figure SMS_60
时,进行步骤4.3;Furthermore, the specific judgment criteria in step 4 are:
Figure SMS_58
is the total utility value of new transactions since the last rescan of the original database, when
Figure SMS_59
When , proceed to step 4.1 and step 4.2.
Figure SMS_60
When , proceed to step 4.3;

步骤4.1、从待插入数据库

Figure SMS_61
扫描生成1-候选集,并设置
Figure SMS_62
=1,
Figure SMS_63
表示序列集中正在处理的项数;Step 4.1: From the database to be inserted
Figure SMS_61
Scan to generate 1-candidate set and set
Figure SMS_62
=1,
Figure SMS_63
Indicates the number of items being processed in the sequence set;

步骤4.2、扫描1-候选集,更新原有信息的序列效用和序列加权效用,依次产生2-候选集,继续更新原有信息的序列效用和序列加权效用,直到没有候选集的生成;同时,设置

Figure SMS_64
;Step 4.2, scan the 1-candidate set, update the sequence utility and sequence weighted utility of the original information, generate 2-candidate sets in sequence, and continue to update the sequence utility and sequence weighted utility of the original information until no candidate sets are generated; at the same time, set
Figure SMS_64
;

步骤4.3、当

Figure SMS_65
时,生成新数据库,此时需要重新扫描原始数据库;将
Figure SMS_66
设置为0,并将
Figure SMS_67
赋值给
Figure SMS_68
。Step 4.3:
Figure SMS_65
When a new database is generated, the original database needs to be rescanned.
Figure SMS_66
Set to 0 and
Figure SMS_67
Assign to
Figure SMS_68
.

进一步地,步骤4.2的具体过程如下:Furthermore, the specific process of step 4.2 is as follows:

步骤4.2.1、计算新数据库

Figure SMS_69
的总效用
Figure SMS_70
,计算公式如下:Step 4.2.1. Calculate the new database
Figure SMS_69
Total utility
Figure SMS_70
, the calculation formula is as follows:

Figure SMS_71
(4);
Figure SMS_71
(4);

对于候选集

Figure SMS_72
中的每个候选,计算待插入数据库
Figure SMS_73
中序列
Figure SMS_74
的序列加权效用
Figure SMS_75
和序列效用
Figure SMS_76
,计算公式如下:For the candidate set
Figure SMS_72
For each candidate in, calculate the number of candidates to be inserted into the database
Figure SMS_73
Middle sequence
Figure SMS_74
The sequence weighted utility
Figure SMS_75
and sequence utility
Figure SMS_76
, the calculation formula is as follows:

Figure SMS_77
(5);
Figure SMS_77
(5);

Figure SMS_78
(6);
Figure SMS_78
(6);

其中,

Figure SMS_79
表示序列
Figure SMS_80
这一行总的效用值;
Figure SMS_81
表示序列
Figure SMS_82
中的子序列
Figure SMS_83
的效用是序列中所有出现的
Figure SMS_84
的效用中的最大效用,定义如下:in,
Figure SMS_79
Representation sequence
Figure SMS_80
The total utility value of this row;
Figure SMS_81
Representation sequence
Figure SMS_82
Subsequence in
Figure SMS_83
The utility of is all occurrences of
Figure SMS_84
The maximum utility among the utilities of is defined as follows:

Figure SMS_85
(7);
Figure SMS_85
(7);

其中,

Figure SMS_86
表示序列中某项的最大内部效用是该序列中该项的最大效用值,定义如下:in,
Figure SMS_86
The maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, which is defined as follows:

Figure SMS_87
(8);
Figure SMS_87
(8);

其中,

Figure SMS_88
表示序列
Figure SMS_89
的项目
Figure SMS_90
Figure SMS_91
项的内部效用,定义如下:in,
Figure SMS_88
Representation sequence
Figure SMS_89
Project
Figure SMS_90
middle
Figure SMS_91
The internal utility of an item is defined as follows:

Figure SMS_92
(9);
Figure SMS_92
(9);

其中,

Figure SMS_93
表示序列
Figure SMS_94
中项目
Figure SMS_95
Figure SMS_96
项的数量,
Figure SMS_97
表示
Figure SMS_98
项的单位利润;in,
Figure SMS_93
Representation sequence
Figure SMS_94
Medium Project
Figure SMS_95
middle
Figure SMS_96
The number of items,
Figure SMS_97
express
Figure SMS_98
Unit profit of the item;

步骤4.2.2、对于在大序列加权效用序列

Figure SMS_99
的原始数据库中设置的每个大序列加权效用序列,执行子步骤4.2.2.1-子步骤4.2.2.3;Step 4.2.2: Weighted utility sequence in large sequence
Figure SMS_99
For each large sequence weighted utility sequence set in the original database, execute sub-steps 4.2.2.1 to 4.2.2.3;

步骤4.2.3、对于

Figure SMS_100
原始数据库中的每个预大序列加权利用序列集,同样执行步骤4.2.2的子步骤4.2.2.1-子步骤4.2.2.3;Step 4.2.3:
Figure SMS_100
For each pre-large sequence weighted application sequence set in the original database, sub-steps 4.2.2.1 to 4.2.2.3 of step 4.2.2 are also performed;

如果原始数据库

Figure SMS_103
中的大序列加权序列集
Figure SMS_105
和原始数据库
Figure SMS_106
中的预大序列加权序列集
Figure SMS_108
包含待插入数据库
Figure SMS_109
中的序列
Figure SMS_111
,就将
Figure SMS_112
Figure SMS_102
中的项集的序列效用
Figure SMS_104
和序列加权效用
Figure SMS_107
的值进行更新,并将序列
Figure SMS_110
放入到1-候选集,用来生成2-候选集;如果
Figure SMS_113
Figure SMS_114
中不包含新数据库
Figure SMS_115
中的序列
Figure SMS_116
,就不需要更新,将
Figure SMS_101
从1-候选集中移除;If the original database
Figure SMS_103
Large sequence weighted sequence set in
Figure SMS_105
and the original database
Figure SMS_106
Pre-large sequence weighted sequence set in
Figure SMS_108
Contains the database to be inserted
Figure SMS_109
Sequence in
Figure SMS_111
, then
Figure SMS_112
and
Figure SMS_102
The sequential utility of the itemsets in
Figure SMS_104
and sequence weighted utility
Figure SMS_107
The value of
Figure SMS_110
Put it into the 1-candidate set to generate the 2-candidate set; if
Figure SMS_113
and
Figure SMS_114
New databases are not included
Figure SMS_115
Sequence in
Figure SMS_116
, there is no need to update,
Figure SMS_101
Remove from 1-candidate set;

步骤4.2.4、从

Figure SMS_117
-候选集生成候选(
Figure SMS_118
+1)-候选集
Figure SMS_119
;设
Figure SMS_120
=
Figure SMS_121
+1,重复步骤4.2.1到步骤4.2.4,直到没有发现更新的大或前大序列加权效用序列集。Step 4.2.4, from
Figure SMS_117
-Candidate set generation candidate (
Figure SMS_118
+1)-Candidate set
Figure SMS_119
;set up
Figure SMS_120
=
Figure SMS_121
+1, repeat steps 4.2.1 to 4.2.4 until no updated large or former large sequence weighted utility sequence set is found.

进一步地,步骤4.2.2的子步骤如下:Furthermore, the sub-steps of step 4.2.2 are as follows:

子步骤4.2.2.1、更新在新数据库

Figure SMS_122
中序列
Figure SMS_123
的序列加权效用
Figure SMS_124
,计算公式如下:Sub-step 4.2.2.1: Update in the new database
Figure SMS_122
Middle sequence
Figure SMS_123
The sequence weighted utility
Figure SMS_124
, the calculation formula is as follows:

Figure SMS_125
(10);
Figure SMS_125
(10);

其中,

Figure SMS_127
为原始数据库
Figure SMS_128
中序列
Figure SMS_129
的序列加权效用,
Figure SMS_131
存储着序列
Figure SMS_132
Figure SMS_133
Figure SMS_134
为待插入数据库
Figure SMS_126
中序列
Figure SMS_130
的序列加权效用;in,
Figure SMS_127
For the original database
Figure SMS_128
Middle sequence
Figure SMS_129
The sequence weighted utility of
Figure SMS_131
Stores the sequence
Figure SMS_132
of
Figure SMS_133
,
Figure SMS_134
To be inserted into the database
Figure SMS_126
Middle sequence
Figure SMS_130
The sequence weighted utility of

子步骤4.2.2.2、更新新数据库

Figure SMS_135
中整个序列集
Figure SMS_136
的序列效用
Figure SMS_137
:Sub-step 4.2.2.2: Update the new database
Figure SMS_135
The entire sequence set
Figure SMS_136
The sequence utility
Figure SMS_137
:

Figure SMS_138
(11);
Figure SMS_138
(11);

其中,

Figure SMS_141
表示序列
Figure SMS_142
在原始数据库
Figure SMS_143
中的序列效用,
Figure SMS_144
存储着序列
Figure SMS_145
Figure SMS_146
Figure SMS_147
为待插入数据库
Figure SMS_139
中序列
Figure SMS_140
的序列效用;in,
Figure SMS_141
Representation sequence
Figure SMS_142
In the original database
Figure SMS_143
The sequence utility in
Figure SMS_144
Stores the sequence
Figure SMS_145
of
Figure SMS_146
,
Figure SMS_147
To be inserted into the database
Figure SMS_139
Middle sequence
Figure SMS_140
The sequence utility of

子步骤4.2.2.3、如果

Figure SMS_152
,则将序列
Figure SMS_153
放入
Figure SMS_154
Figure SMS_155
是新数据库
Figure SMS_156
中的大序列加权效用的
Figure SMS_157
序列集;如果
Figure SMS_158
,则将序列
Figure SMS_148
放入
Figure SMS_149
Figure SMS_150
是新数据库
Figure SMS_151
中的预大序列加权效用
Figure SMS_159
序列集;否则,丢弃序列
Figure SMS_160
。Sub-step 4.2.2.3, if
Figure SMS_152
, then the sequence
Figure SMS_153
Put in
Figure SMS_154
,
Figure SMS_155
It is a new database
Figure SMS_156
The weighted utility of a large sequence in
Figure SMS_157
sequence set; if
Figure SMS_158
, then the sequence
Figure SMS_148
Put in
Figure SMS_149
,
Figure SMS_150
It is a new database
Figure SMS_151
Pre-large sequence weighted utility in
Figure SMS_159
sequence set; otherwise, discard the sequence
Figure SMS_160
.

进一步地,步骤4.3的具体过程如下:Furthermore, the specific process of step 4.3 is as follows:

步骤4.3.1、合并待插入数据库

Figure SMS_161
和原始数据库
Figure SMS_162
,生成新数据库
Figure SMS_163
;Step 4.3.1: Merge the database to be inserted
Figure SMS_161
and the original database
Figure SMS_162
, generate a new database
Figure SMS_163
;

步骤4.3.2、对于每个

Figure SMS_164
,采用与公式(5)相同的计算方式计算新数据库
Figure SMS_165
的序列加权效用
Figure SMS_166
,然后采用与公式(2)相同的计算方式计算新数据库
Figure SMS_167
的总效用
Figure SMS_168
;Step 4.3.2: For each
Figure SMS_164
, the new database is calculated using the same calculation method as formula (5)
Figure SMS_165
The sequence weighted utility
Figure SMS_166
, and then use the same calculation method as formula (2) to calculate the new database
Figure SMS_167
Total utility
Figure SMS_168
;

步骤4.3.3、设序列的加权效用比为

Figure SMS_170
,如果
Figure SMS_171
,则将序列
Figure SMS_173
放入
Figure SMS_175
;如果
Figure SMS_177
,则将序列
Figure SMS_178
放入
Figure SMS_179
;否则,丢弃序列
Figure SMS_169
Figure SMS_172
是新数据库
Figure SMS_174
中的大序列加权效用的
Figure SMS_176
序列集;
Figure SMS_180
是新数据库
Figure SMS_181
中的预大序列加权效用
Figure SMS_182
序列集;Step 4.3.3: Let the weighted utility ratio of the sequence be
Figure SMS_170
,if
Figure SMS_171
, then the sequence
Figure SMS_173
Put in
Figure SMS_175
;if
Figure SMS_177
, then the sequence
Figure SMS_178
Put in
Figure SMS_179
; Otherwise, discard the sequence
Figure SMS_169
;
Figure SMS_172
It is a new database
Figure SMS_174
The weighted utility of a large sequence in
Figure SMS_176
Sequence Set;
Figure SMS_180
It is a new database
Figure SMS_181
Pre-large sequence weighted utility in
Figure SMS_182
Sequence Set;

步骤4.3.4、执行递归挖掘算法,运用递归挖掘算法,生成多项集的投影数据库,并生成多项集的

Figure SMS_183
Figure SMS_185
序列集,直到没有找到
Figure SMS_187
Figure SMS_188
序列集;执行挖掘过程时,从1序列集开始挖掘,再接着2序列集,直到最后一个序列集为空,此时停止挖掘过程,输出新数据库
Figure SMS_189
中的大序列加权效用序列集
Figure SMS_190
和预大序列加权效用序列集
Figure SMS_191
Figure SMS_184
Figure SMS_186
用于下次数据插入时使用。Step 4.3.4: Execute the recursive mining algorithm to generate a projection database of multiple sets and generate a
Figure SMS_183
and
Figure SMS_185
Sequence set until none is found
Figure SMS_187
and
Figure SMS_188
Sequence set; when executing the mining process, start mining from sequence set 1, then sequence set 2, until the last sequence set is empty, then stop the mining process and output a new database
Figure SMS_189
Large sequence weighted utility sequence set in
Figure SMS_190
and the pre-large sequence weighted utility sequence set
Figure SMS_191
,
Figure SMS_184
and
Figure SMS_186
It will be used for next data insertion.

进一步地,步骤4.3.4中,递归挖掘算法的具体过程如下:Furthermore, in step 4.3.4, the specific process of the recursive mining algorithm is as follows:

步骤4.3.4.1、遍历

Figure SMS_192
Figure SMS_193
,对属于
Figure SMS_194
Figure SMS_195
的每个序列
Figure SMS_196
构建它的投影数据库
Figure SMS_197
;Step 4.3.4.1, traversal
Figure SMS_192
and
Figure SMS_193
, for
Figure SMS_194
and
Figure SMS_195
Each sequence of
Figure SMS_196
Build its projection database
Figure SMS_197
;

步骤4.3.4.2、计算

Figure SMS_200
的序列加权效用
Figure SMS_202
值,其中
Figure SMS_203
Figure SMS_205
的拓展项集;如果
Figure SMS_206
,计算序列效用
Figure SMS_208
,并将
Figure SMS_209
放到
Figure SMS_198
集合中;如果
Figure SMS_199
,计算
Figure SMS_201
,并将
Figure SMS_204
放到
Figure SMS_207
集合中,否则,如果都不满足,将不做任何处理;Step 4.3.4.2. Calculation
Figure SMS_200
The sequence weighted utility
Figure SMS_202
Value, where
Figure SMS_203
yes
Figure SMS_205
The expanded itemset of
Figure SMS_206
, calculate the sequence utility
Figure SMS_208
, and
Figure SMS_209
Put
Figure SMS_198
In the collection; if
Figure SMS_199
,calculate
Figure SMS_201
, and
Figure SMS_204
Put
Figure SMS_207
In the set, otherwise, if none of them are satisfied, no processing will be done;

步骤4.3.4.3、将当前参数传入进去,递归调用挖掘算法过程,直到

Figure SMS_211
Figure SMS_212
集合都为空,停止运行;
Figure SMS_213
是新数据库
Figure SMS_214
中的大序列加权效用的
Figure SMS_215
+1序列集;
Figure SMS_216
是新数据库
Figure SMS_217
中的预大序列加权效用
Figure SMS_210
+1序列集。Step 4.3.4.3: Pass the current parameters in and recursively call the mining algorithm process until
Figure SMS_211
and
Figure SMS_212
The collections are all empty, so stop running;
Figure SMS_213
It is a new database
Figure SMS_214
The weighted utility of a large sequence in
Figure SMS_215
+1 sequence set;
Figure SMS_216
It is a new database
Figure SMS_217
Pre-large sequence weighted utility in
Figure SMS_210
+1 for the sequence set.

本发明所带来的有益技术效果。The beneficial technical effects brought about by the present invention.

提出了一种新的序列模式挖掘算法Pre-HUSPM,用于处理序列插入问题,当插入少量数据时,不需要更新整个数据库,避免造成资源浪费。A new sequential pattern mining algorithm Pre-HUSPM is proposed to deal with the sequence insertion problem. When inserting a small amount of data, there is no need to update the entire database to avoid wasting resources.

基于矩阵投影的高效用序列模式挖掘算法(P-HUSPM),可以减少序列挖掘中候选集的数量,从而加快挖掘高效用序列集的处理时间;因此由于不需要频繁地重新扫描数据库的次数,因此可以在很大程度上减少运行时间。The high-utility sequence pattern mining algorithm (P-HUSPM) based on matrix projection can reduce the number of candidate sets in sequence mining, thereby speeding up the processing time of mining high-utility sequence sets; therefore, since there is no need to frequently rescan the database, the running time can be greatly reduced.

提出了一个新的概念

Figure SMS_218
,用它作为安全阈值来判断数据库是否需要重新扫描,减少了数据库重新扫描的次数,降低了维护成本。Proposed a new concept
Figure SMS_218
, using it as a safety threshold to determine whether the database needs to be rescanned, reducing the number of database rescans and reducing maintenance costs.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1为本发明基于Pre-HUSPM的数据库序列插入处理方法的流程图。FIG1 is a flow chart of the database sequence insertion processing method based on Pre-HUSPM of the present invention.

图2为本发明实验中SIGN数据集在效用阈值上限

Figure SMS_219
为15%时三个算法在不同效用阈值下限
Figure SMS_220
下的运行时间对比图。FIG2 shows the upper limit of the utility threshold of the SIGN dataset in the experiment of the present invention.
Figure SMS_219
When the utility threshold is 15%, the three algorithms have different
Figure SMS_220
The following is a comparison of the running times.

图3为本发明实验中LEVIATHAN数据集在效用阈值上限

Figure SMS_221
为18%时三个算法在不同效用阈值下限
Figure SMS_222
下的运行时间对比图。FIG3 shows the upper limit of the utility threshold of the LEVIATHAN dataset in the experiment of the present invention.
Figure SMS_221
When the utility threshold is 18%, the three algorithms have different
Figure SMS_222
The following is a comparison of the running times.

图4为本发明实验中FIFA数据集在效用阈值上限

Figure SMS_223
为21%时三个算法在不同效用阈值下限
Figure SMS_224
下的运行时间对比图。FIG4 shows the FIFA dataset in the experiment of the present invention at the upper limit of the utility threshold
Figure SMS_223
When the utility threshold is 21%, the three algorithms have different
Figure SMS_224
The following is a comparison of the running times.

图5为本发明实验中BIBLE数据集在效用阈值上限

Figure SMS_225
为16%时三个算法在不同效用阈值下限
Figure SMS_226
下的运行时间对比图。Figure 5 shows the upper limit of the utility threshold of the BIBLE dataset in the experiment of the present invention.
Figure SMS_225
When the utility threshold is 16%, the three algorithms have different
Figure SMS_226
The following is a comparison of the running times.

图6为本发明实验中Kosarak10k数据集在效用阈值上限

Figure SMS_227
为14%时三个算法在不同效用阈值下限
Figure SMS_228
下的运行时间对比图。FIG6 shows the Kosarak10k dataset in the experiment of the present invention at the upper limit of the utility threshold
Figure SMS_227
When the utility threshold is 14%, the three algorithms have different
Figure SMS_228
The following is a comparison of the running times.

图7为本发明实验中BMS数据集在效用阈值上限

Figure SMS_229
为4.5%时三个算法在不同效用阈值下限
Figure SMS_230
下的运行时间对比图。FIG. 7 shows the BMS dataset in the experiment of the present invention at the upper limit of the utility threshold
Figure SMS_229
When the utility threshold is 4.5%, the three algorithms have different
Figure SMS_230
The following is a comparison of the running times.

具体实施方式DETAILED DESCRIPTION

下面结合附图以及具体实施方式对本发明作进一步详细说明:The present invention is further described in detail below with reference to the accompanying drawings and specific embodiments:

本发明所提及的数据库为序列数据库,序列数据库中包括大序列、预大序列、小序列。当序列的支持度大于支持度上限阈值时,则该序列为大序列;当序列的支持度小于支持度上限阈值且大于支持度下限阈值时,则该序列为预大序列;当序列的支持度小于支持度下限阈值时,则该序列为小序列。其中,预大序列在未来很可能成为大序列。The database mentioned in the present invention is a sequence database, which includes large sequences, pre-large sequences, and small sequences. When the support of a sequence is greater than the upper support threshold, the sequence is a large sequence; when the support of a sequence is less than the upper support threshold and greater than the lower support threshold, the sequence is a pre-large sequence; when the support of a sequence is less than the lower support threshold, the sequence is a small sequence. Among them, the pre-large sequence is likely to become a large sequence in the future.

本发明融合了pre-large概念和基于投影的挖掘算法P-HUSPM,提出了Pre-HUSPM算法,主要通过设置阈值

Figure SMS_231
作为是否需要重新扫描数据库的条件,进而对数据库序列进行有效维护和更新,减少数据库重新扫描次数。
Figure SMS_232
表示待插入数据库中单个项目的序列加权效用最大值。This paper combines the pre-large concept and the projection-based mining algorithm P-HUSPM, and proposes the Pre-HUSPM algorithm, which mainly sets the threshold
Figure SMS_231
As a condition for whether the database needs to be rescanned, the database sequence is effectively maintained and updated to reduce the number of database rescanning times.
Figure SMS_232
Represents the maximum value of the sequence-weighted utility of a single item to be inserted into the database.

将新序列数据库添加到原始序列数据库时会出现九种情况:情况1为将新序列数据库的大序列插入到原始序列数据库的大序列中;情况2为将新序列数据库的预大序列插入到原始序列数据库的大序列中;情况3为将新序列数据库的小序列插入到原始序列数据库的大序列中;情况4为将新序列数据库的大序列插入到原始序列数据库的预大序列中;情况5为将新序列数据库的预大序列插入到原始序列数据库的预大序列中;情况6为将新序列数据库的小序列插入到原始序列数据库的预大序列中;情况7为将新序列数据库的大序列插入到原始序列数据库的小序列中;情况8为将新序列数据库的预大序列插入到原始序列数据库的小序列中;情况9为将新序列数据库的小序列插入到原始序列数据库的小序列中。Nine situations will occur when adding a new sequence database to the original sequence database: situation 1 is inserting the large sequence of the new sequence database into the large sequence of the original sequence database; situation 2 is inserting the pre-large sequence of the new sequence database into the large sequence of the original sequence database; situation 3 is inserting the small sequence of the new sequence database into the large sequence of the original sequence database; situation 4 is inserting the large sequence of the new sequence database into the pre-large sequence of the original sequence database; situation 5 is inserting the pre-large sequence of the new sequence database into the pre-large sequence of the original sequence database; situation 6 is inserting the small sequence of the new sequence database into the pre-large sequence of the original sequence database; situation 7 is inserting the large sequence of the new sequence database into the small sequence of the original sequence database; situation 8 is inserting the pre-large sequence of the new sequence database into the small sequence of the original sequence database; situation 9 is inserting the small sequence of the new sequence database into the small sequence of the original sequence database.

情况1、情况5、情况6、情况8和情况9是基于计数的加权平均,不会影响最终的大序列集。情况2和情况3可能会删除一些现有的大序列集,而情况4和7可能会添加新的大序列集合。当同时保留大序列集和预大序列集时,可以很好地处理情况2、情况3和情况4的这些情况。Cases 1, 5, 6, 8, and 9 are weighted averages based on counts and will not affect the final large sequence set. Cases 2 and 3 may delete some existing large sequence sets, while cases 4 and 7 may add new large sequence sets. These cases of cases 2, 3, and 4 can be handled well when both large sequence sets and pre-large sequence sets are kept.

而上述情况7是本发明的主要研究重点,当出现情况7,即插入的数据库资料不是很大的时候,实质是不需要更新数据库的,此时现有技术会去更新数据库,造成了资源浪费。The above situation 7 is the main research focus of the present invention. When situation 7 occurs, that is, when the inserted database data is not very large, it is actually unnecessary to update the database. At this time, the prior art will update the database, resulting in a waste of resources.

针对该问题,本发明提出了一种基于Pre-HUSPM的数据库序列插入处理方法,采用了如下定理,并对定理进行了证明。To solve this problem, the present invention proposes a database sequence insertion processing method based on Pre-HUSPM, adopts the following theorem, and proves the theorem.

定理.设

Figure SMS_233
Figure SMS_234
分别为效用阈值下限和效用阈值上限,
Figure SMS_235
为原始数据库
Figure SMS_236
的总效用。
Figure SMS_237
是待插入数据库
Figure SMS_238
中单个项目的序列加权效用最大值。如果
Figure SMS_239
,则情况7中序列集的序列加权效用在整个更新数据库中没有希望成为高效用加权序列项集。Theorem. Assume
Figure SMS_233
and
Figure SMS_234
are the lower and upper utility thresholds, respectively.
Figure SMS_235
For the original database
Figure SMS_236
total utility.
Figure SMS_237
To be inserted into the database
Figure SMS_238
The maximum value of the sequence weighted utility of a single item in . If
Figure SMS_239
, then the sequence weighted utility of the sequence set in case 7 has no hope of becoming a high-utility weighted sequence item set in the entire updated database.

证明:从

Figure SMS_240
,可获得以下推导式:Proof: From
Figure SMS_240
, we can get the following derivation:

Figure SMS_241
Figure SMS_241
;

Figure SMS_242
Figure SMS_242
;

Figure SMS_243
Figure SMS_243
;

Figure SMS_244
Figure SMS_244
;

Figure SMS_245
Figure SMS_245
;

对于情况7中的序列,如果序列

Figure SMS_246
的序列加权效用在原始数据库
Figure SMS_247
中很小,则
Figure SMS_248
。For the sequence in case 7, if the sequence
Figure SMS_246
The sequence weighted utility of
Figure SMS_247
If the
Figure SMS_248
.

如果序列

Figure SMS_250
在待插入数据库
Figure SMS_251
中具有较大的序列加权效用,则其在待插入数据库
Figure SMS_252
中的序列加权效用
Figure SMS_253
必须大于或等于
Figure SMS_254
,但小于或等于待插入数据库
Figure SMS_255
的总效用
Figure SMS_256
。因此,
Figure SMS_249
。If the sequence
Figure SMS_250
To be inserted into the database
Figure SMS_251
has a larger sequence weighted utility in the database to be inserted
Figure SMS_252
Sequence weighted utility in
Figure SMS_253
Must be greater than or equal to
Figure SMS_254
, but less than or equal to the one to be inserted into the database
Figure SMS_255
Total utility
Figure SMS_256
.therefore,
Figure SMS_249
.

在序列挖掘中,插入数据库

Figure SMS_257
后形成的新数据库
Figure SMS_258
中更新的序列
Figure SMS_259
的比率被计算为:In sequence mining, inserting into the database
Figure SMS_257
The new database formed
Figure SMS_258
The updated sequence in
Figure SMS_259
The ratio is calculated as:

Figure SMS_260
Figure SMS_260
;

其中,

Figure SMS_261
为新数据库
Figure SMS_263
中序列
Figure SMS_265
的序列加权效用,
Figure SMS_266
为原始数据库
Figure SMS_267
中序列
Figure SMS_268
的序列加权效用。因此,当
Figure SMS_269
小于安全值
Figure SMS_262
Figure SMS_264
)时,不需要重新扫描原始数据库。in,
Figure SMS_261
For new database
Figure SMS_263
Middle sequence
Figure SMS_265
The sequence weighted utility of
Figure SMS_266
For the original database
Figure SMS_267
Middle sequence
Figure SMS_268
Therefore, when
Figure SMS_269
Less than the safety value
Figure SMS_262
Figure SMS_264
), there is no need to rescan the original database.

根据该定理,可以有效地处理情况7中的序列。According to this theorem, the sequence in case 7 can be processed efficiently.

一种基于Pre-HUSPM的数据库序列插入处理方法,具体包括如下步骤:A database sequence insertion processing method based on Pre-HUSPM specifically comprises the following steps:

步骤1、向原始数据库

Figure SMS_270
中插入待插入数据库
Figure SMS_271
。Step 1: Add the original database
Figure SMS_270
Insert to be inserted into the database
Figure SMS_271
.

本发明实施例中,原始数据库

Figure SMS_272
为一个交易资料数据库,插入的待插入数据库
Figure SMS_273
是一个新的交易资料数据库。In the embodiment of the present invention, the original database
Figure SMS_272
For a transaction data database, the database to be inserted
Figure SMS_273
It is a new transaction information database.

原始交易资料数据库和新的交易资料数据库均是包含一组序列的数据库,设原始数据库

Figure SMS_275
Figure SMS_276
为序列总个数,
Figure SMS_277
为序列的序号,
Figure SMS_278
表示第
Figure SMS_279
个序列,
Figure SMS_280
具有唯一标识符,
Figure SMS_281
为项目集合
Figure SMS_274
Figure SMS_282
为项目总个数,项目
Figure SMS_283
Figure SMS_284
个不同项的集合,表示为
Figure SMS_285
Figure SMS_286
表示项目
Figure SMS_287
中的第
Figure SMS_288
个项。The original transaction data database and the new transaction data database are both databases containing a set of sequences.
Figure SMS_275
,
Figure SMS_276
is the total number of sequences,
Figure SMS_277
is the sequence number,
Figure SMS_278
Indicates
Figure SMS_279
A sequence,
Figure SMS_280
Has a unique identifier,
Figure SMS_281
For project collection
Figure SMS_274
,
Figure SMS_282
is the total number of projects,
Figure SMS_283
yes
Figure SMS_284
A set of different items, represented by
Figure SMS_285
,
Figure SMS_286
Display items
Figure SMS_287
The
Figure SMS_288
Item.

原始交易资料数据库包括

Figure SMS_289
Figure SMS_291
Figure SMS_293
Figure SMS_295
Figure SMS_297
五个序列和
Figure SMS_298
Figure SMS_300
Figure SMS_301
Figure SMS_303
Figure SMS_305
五个项目。其中,
Figure SMS_307
序列的项目集合为
Figure SMS_309
Figure SMS_311
表示一项;
Figure SMS_312
序列的项目集合为
Figure SMS_314
Figure SMS_290
序列的项目集合为
Figure SMS_292
Figure SMS_294
序列的项目集合为
Figure SMS_296
Figure SMS_299
序列的项目集合为
Figure SMS_302
。此
Figure SMS_304
Figure SMS_306
Figure SMS_308
Figure SMS_310
Figure SMS_313
五个项目的利润分别为3、2、4、2、1,在数据库中以表格的形式保存,保存为一个项目利润表
Figure SMS_315
。The original transaction data database includes
Figure SMS_289
,
Figure SMS_291
,
Figure SMS_293
,
Figure SMS_295
,
Figure SMS_297
Five sequences and
Figure SMS_298
,
Figure SMS_300
,
Figure SMS_301
,
Figure SMS_303
,
Figure SMS_305
Five projects. Among them,
Figure SMS_307
The set of items in the sequence is
Figure SMS_309
,
Figure SMS_311
Indicates an item;
Figure SMS_312
The set of items in the sequence is
Figure SMS_314
;
Figure SMS_290
The set of items in the sequence is
Figure SMS_292
;
Figure SMS_294
The set of items in the sequence is
Figure SMS_296
;
Figure SMS_299
The set of items in the sequence is
Figure SMS_302
.this
Figure SMS_304
,
Figure SMS_306
,
Figure SMS_308
,
Figure SMS_310
,
Figure SMS_313
The profits of the five projects are 3, 2, 4, 2, and 1 respectively. They are saved in the database in the form of a table and saved as a project profit table.
Figure SMS_315
.

待插入数据库

Figure SMS_316
包括
Figure SMS_317
Figure SMS_318
两个序列,
Figure SMS_319
序列的项目集合为
Figure SMS_320
Figure SMS_321
序列的项目集合为
Figure SMS_322
。To be inserted into the database
Figure SMS_316
include
Figure SMS_317
,
Figure SMS_318
Two sequences,
Figure SMS_319
The set of items in the sequence is
Figure SMS_320
,
Figure SMS_321
The set of items in the sequence is
Figure SMS_322
.

步骤2、根据原始数据库

Figure SMS_323
的信息计算安全值
Figure SMS_324
。Step 2: Based on the original database
Figure SMS_323
Information calculation security value
Figure SMS_324
.

安全值

Figure SMS_325
的计算公式如下:Safety value
Figure SMS_325
The calculation formula is as follows:

Figure SMS_326
(1);
Figure SMS_326
(1);

其中,

Figure SMS_327
表示效用阈值上限,
Figure SMS_328
表示效用阈值下限,
Figure SMS_329
表示原始数据库
Figure SMS_330
的总效用,
Figure SMS_331
Figure SMS_332
的值预先设定。in,
Figure SMS_327
represents the upper limit of the utility threshold,
Figure SMS_328
represents the lower limit of the utility threshold,
Figure SMS_329
Represents the original database
Figure SMS_330
The total utility of
Figure SMS_331
and
Figure SMS_332
The value is preset.

Figure SMS_333
的计算公式如下:
Figure SMS_333
The calculation formula is as follows:

Figure SMS_334
(2);
Figure SMS_334
(2);

其中,

Figure SMS_335
表示原始数据库
Figure SMS_336
中序列
Figure SMS_337
的总效用,计算公式如下:in,
Figure SMS_335
Represents the original database
Figure SMS_336
Middle sequence
Figure SMS_337
The total utility is calculated as follows:

Figure SMS_338
(3);
Figure SMS_338
(3);

其中,

Figure SMS_339
表示序列
Figure SMS_340
中项目
Figure SMS_341
Figure SMS_342
项的效用。in,
Figure SMS_339
Representation sequence
Figure SMS_340
Medium Project
Figure SMS_341
middle
Figure SMS_342
The utility of the item.

本发明实施例中,预先设定效用阈值上限

Figure SMS_344
为0.35,效用阈值上限与高效用序列模式阈值相同,设定效用阈值下限
Figure SMS_345
为0.25,计算得
Figure SMS_346
=36,
Figure SMS_347
=26,
Figure SMS_348
=28,
Figure SMS_349
=23,
Figure SMS_350
=28;
Figure SMS_343
=141;
Figure SMS_351
=21。In the embodiment of the present invention, the upper limit of the utility threshold is preset
Figure SMS_344
The upper limit of the utility threshold is the same as the high-utility sequence mode threshold, and the lower limit of the utility threshold is set to
Figure SMS_345
is 0.25, and the calculated
Figure SMS_346
=36,
Figure SMS_347
=26,
Figure SMS_348
=28,
Figure SMS_349
=23,
Figure SMS_350
=28;
Figure SMS_343
=141;
Figure SMS_351
=21.

步骤3、扫描待插入数据库

Figure SMS_352
,计算待插入数据库
Figure SMS_353
中每一个序列的总效用
Figure SMS_354
Figure SMS_355
的总效用
Figure SMS_356
。Step 3: Scan the database to be inserted
Figure SMS_352
, calculate the number of
Figure SMS_353
The total utility of each sequence in
Figure SMS_354
and
Figure SMS_355
Total utility
Figure SMS_356
.

按照与公式(2)和(3)相同的方式计算得到待插入数据库

Figure SMS_357
总效用
Figure SMS_358
,与此同时计算
Figure SMS_359
,计算时代入待插入数据库
Figure SMS_360
的相关数据;The database to be inserted is calculated in the same way as formulas (2) and (3)
Figure SMS_357
Total Utility
Figure SMS_358
, while calculating
Figure SMS_359
, enter the database to be inserted during calculation
Figure SMS_360
relevant data;

本发明实施例中,

Figure SMS_361
=10,
Figure SMS_362
=7,
Figure SMS_363
=17。In the embodiment of the present invention,
Figure SMS_361
=10,
Figure SMS_362
=7,
Figure SMS_363
=17.

步骤4、将自上次重新扫描原始数据库以来新事务的总效用值与

Figure SMS_364
的总和与安全值
Figure SMS_365
进行比较,根据比较结果进行相应操作。具体判断准则为:设
Figure SMS_366
为自上次重新扫描原始数据库以来新事务的总效用值,当
Figure SMS_367
时,进行步骤4.1和步骤4.2,当
Figure SMS_368
时,进行步骤4.3。Step 4. Compare the total utility value of new transactions since the last rescan of the original database with
Figure SMS_364
The sum and safety value
Figure SMS_365
Compare and perform corresponding operations according to the comparison results. The specific judgment criteria are:
Figure SMS_366
is the total utility value of new transactions since the last rescan of the original database, when
Figure SMS_367
When , proceed to step 4.1 and step 4.2.
Figure SMS_368
Then proceed to step 4.3.

步骤4.1、从待插入数据库

Figure SMS_369
扫描生成1-候选集,并设置
Figure SMS_370
=1,
Figure SMS_371
表示的是序列集中正在处理的项数。Step 4.1: From the database to be inserted
Figure SMS_369
Scan to generate 1-candidate set and set
Figure SMS_370
=1,
Figure SMS_371
Indicates the number of items in the sequence set being processed.

本发明实施例中,生成的1-候选集为:

Figure SMS_372
。In the embodiment of the present invention, the generated 1-candidate set is:
Figure SMS_372
.

步骤4.2、扫描1-候选集,更新原有信息的序列效用和序列加权效用,依次产生2-候选集,继续更新原有信息的序列效用和序列加权效用,直到没有候选集的生成。同时,设置

Figure SMS_373
。具体过程如下:Step 4.2, scan the 1-candidate set, update the sequence utility and sequence weighted utility of the original information, generate 2-candidate sets in sequence, and continue to update the sequence utility and sequence weighted utility of the original information until no candidate sets are generated. At the same time, set
Figure SMS_373
The specific process is as follows:

步骤4.2.1、计算新数据库

Figure SMS_374
的总效用
Figure SMS_375
,计算公式如下:Step 4.2.1. Calculate the new database
Figure SMS_374
Total utility
Figure SMS_375
, the calculation formula is as follows:

Figure SMS_376
(4);
Figure SMS_376
(4);

本发明实施例中,

Figure SMS_377
=141+17=158。In the embodiment of the present invention,
Figure SMS_377
=141+17=158.

对于候选集

Figure SMS_378
中的每个候选,计算待插入数据库
Figure SMS_379
中序列
Figure SMS_380
的序列加权效用
Figure SMS_381
和序列效用
Figure SMS_382
,计算公式如下:For the candidate set
Figure SMS_378
For each candidate in, calculate the number of candidates to be inserted into the database
Figure SMS_379
Middle sequence
Figure SMS_380
The sequence weighted utility
Figure SMS_381
and sequence utility
Figure SMS_382
, the calculation formula is as follows:

Figure SMS_383
(5);
Figure SMS_383
(5);

Figure SMS_384
(6);
Figure SMS_384
(6);

其中,

Figure SMS_385
表示序列
Figure SMS_386
这一行总的效用值;
Figure SMS_387
表示序列
Figure SMS_388
中的子序列
Figure SMS_389
的效用是序列中所有出现的
Figure SMS_390
的效用中的最大效用,定义如下:in,
Figure SMS_385
Representation sequence
Figure SMS_386
The total utility value of this row;
Figure SMS_387
Representation sequence
Figure SMS_388
Subsequence in
Figure SMS_389
The utility of is all occurrences of
Figure SMS_390
The maximum utility among the utilities of is defined as follows:

Figure SMS_391
(7);
Figure SMS_391
(7);

其中,

Figure SMS_392
表示序列中某项的最大内部效用是该序列中该项的最大效用值,定义如下:in,
Figure SMS_392
The maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, which is defined as follows:

Figure SMS_393
(8);
Figure SMS_393
(8);

其中,

Figure SMS_394
表示序列
Figure SMS_395
的项目
Figure SMS_396
Figure SMS_397
项的内部效用,定义如下:in,
Figure SMS_394
Representation sequence
Figure SMS_395
Project
Figure SMS_396
middle
Figure SMS_397
The internal utility of an item is defined as follows:

Figure SMS_398
(9);
Figure SMS_398
(9);

其中,

Figure SMS_399
表示序列
Figure SMS_400
中项目
Figure SMS_401
Figure SMS_402
项的数量,
Figure SMS_403
表示
Figure SMS_404
项的单位利润。in,
Figure SMS_399
Representation sequence
Figure SMS_400
Medium Project
Figure SMS_401
middle
Figure SMS_402
The number of items,
Figure SMS_403
express
Figure SMS_404
The unit profit of the item.

例如本发明实施例中,

Figure SMS_405
=10,
Figure SMS_406
=8。For example, in the embodiment of the present invention,
Figure SMS_405
=10,
Figure SMS_406
=8.

例如,

Figure SMS_408
可以表示为
Figure SMS_409
,其中
Figure SMS_411
Figure SMS_412
Figure SMS_413
。其中
Figure SMS_414
Figure SMS_415
Figure SMS_407
的内部效用分别是:
Figure SMS_410
=3×3=9,
Figure SMS_416
=2×3=6。For example,
Figure SMS_408
It can be expressed as
Figure SMS_409
,in
Figure SMS_411
,
Figure SMS_412
,
Figure SMS_413
.in
Figure SMS_414
exist
Figure SMS_415
and
Figure SMS_407
The internal utilities are:
Figure SMS_410
=3×3=9,
Figure SMS_416
=2×3=6.

Figure SMS_417
Figure SMS_418
出现了两次,
Figure SMS_419
最大效用在
Figure SMS_420
表示为:
Figure SMS_421
=9。exist
Figure SMS_417
middle
Figure SMS_418
It appeared twice.
Figure SMS_419
The greatest effect is
Figure SMS_420
It is expressed as:
Figure SMS_421
=9.

子序列

Figure SMS_422
Figure SMS_423
出现了两次,这两次的效用分别是(3×3)+(4×2)=17和(3×2)+(4×2)=14。所以,
Figure SMS_424
=17。Subsequence
Figure SMS_422
exist
Figure SMS_423
It appears twice, and the utility of these two times is (3×3)+(4×2)=17 and (3×2)+(4×2)=14 respectively. So,
Figure SMS_424
=17.

步骤4.2.2、对于在大序列加权效用序列

Figure SMS_425
的原始数据库中设置的每个大序列加权效用序列,执行子步骤:Step 4.2.2: Weighted utility sequence in large sequence
Figure SMS_425
For each large sequence weighted utility sequence set in the original database, perform the following substeps:

子步骤4.2.2.1、更新在新数据库

Figure SMS_426
中序列
Figure SMS_427
的序列加权效用
Figure SMS_428
,计算公式如下:Sub-step 4.2.2.1: Update in the new database
Figure SMS_426
Middle sequence
Figure SMS_427
The sequence weighted utility
Figure SMS_428
, the calculation formula is as follows:

Figure SMS_429
(10);
Figure SMS_429
(10);

其中,

Figure SMS_430
为原始数据库
Figure SMS_432
中序列
Figure SMS_433
的序列加权效用,
Figure SMS_435
存储着序列
Figure SMS_436
Figure SMS_437
Figure SMS_438
为待插入数据库
Figure SMS_431
中序列
Figure SMS_434
的序列加权效用。in,
Figure SMS_430
For the original database
Figure SMS_432
Middle sequence
Figure SMS_433
The sequence weighted utility of
Figure SMS_435
Stores the sequence
Figure SMS_436
of
Figure SMS_437
,
Figure SMS_438
To be inserted into the database
Figure SMS_431
Middle sequence
Figure SMS_434
The sequence weighted utility of .

本发明实施例

Figure SMS_439
中的
Figure SMS_440
序列,
Figure SMS_441
=76+7=83。Embodiments of the present invention
Figure SMS_439
In
Figure SMS_440
sequence,
Figure SMS_441
=76+7=83.

子步骤4.2.2.2、更新新数据库

Figure SMS_442
中整个序列集
Figure SMS_443
的序列效用
Figure SMS_444
:Sub-step 4.2.2.2: Update the new database
Figure SMS_442
The entire sequence set
Figure SMS_443
The sequence utility
Figure SMS_444
:

Figure SMS_445
(11);
Figure SMS_445
(11);

其中,

Figure SMS_447
表示序列
Figure SMS_448
在原始数据库
Figure SMS_450
中的序列效用,
Figure SMS_451
存储着序列
Figure SMS_452
Figure SMS_453
Figure SMS_454
为待插入数据库
Figure SMS_446
中序列
Figure SMS_449
的序列效用。in,
Figure SMS_447
Representation sequence
Figure SMS_448
In the original database
Figure SMS_450
The sequence utility in
Figure SMS_451
Stores the sequence
Figure SMS_452
of
Figure SMS_453
,
Figure SMS_454
To be inserted into the database
Figure SMS_446
Middle sequence
Figure SMS_449
The sequence utility.

本发明实施例

Figure SMS_455
中的
Figure SMS_456
序列,
Figure SMS_457
=30+3=33。Embodiments of the present invention
Figure SMS_455
In
Figure SMS_456
sequence,
Figure SMS_457
=30+3=33.

子步骤4.2.2.3、如果

Figure SMS_459
,则将序列
Figure SMS_460
放入
Figure SMS_462
Figure SMS_464
是新数据库
Figure SMS_466
中的大序列加权效用的
Figure SMS_468
序列集;如果
Figure SMS_470
,则将序列
Figure SMS_458
放入
Figure SMS_461
Figure SMS_463
是新数据库
Figure SMS_465
中的预大序列加权效用
Figure SMS_467
序列集;否则,丢弃序列
Figure SMS_469
,因为它在数据库更新后仍然很小。Sub-step 4.2.2.3, if
Figure SMS_459
, then the sequence
Figure SMS_460
Put in
Figure SMS_462
,
Figure SMS_464
It is a new database
Figure SMS_466
The weighted utility of a large sequence in
Figure SMS_468
sequence set; if
Figure SMS_470
, then the sequence
Figure SMS_458
Put in
Figure SMS_461
,
Figure SMS_463
It is a new database
Figure SMS_465
Pre-large sequence weighted utility in
Figure SMS_467
sequence set; otherwise, discard the sequence
Figure SMS_469
, as it will still be small after the database update.

本发明实施例中,

Figure SMS_471
=52.5%>35%,所以序列
Figure SMS_472
仍放入
Figure SMS_473
集合中。In the embodiment of the present invention,
Figure SMS_471
=52.5%>35%, so the sequence
Figure SMS_472
Still put in
Figure SMS_473
In collection.

步骤4.2.3、对于

Figure SMS_474
原始数据库中的每个预大序列加权利用序列集,同样执行步骤4.2.2的子步骤4.2.2.1-子步骤4.2.2.3。Step 4.2.3:
Figure SMS_474
For each pre-large sequence weighted application sequence set in the original database, sub-steps 4.2.2.1 to 4.2.2.3 of step 4.2.2 are also performed.

如果原始数据库

Figure SMS_477
中的大序列加权序列集
Figure SMS_479
和原始数据库
Figure SMS_480
中的预大序列加权序列集
Figure SMS_482
包含待插入数据库
Figure SMS_484
中的序列
Figure SMS_486
,就将
Figure SMS_488
Figure SMS_475
中的项集的序列效用
Figure SMS_478
和序列加权效用
Figure SMS_481
的值进行更新,并将序列
Figure SMS_483
放入到1-候选集,用来生成2-候选集;如果
Figure SMS_485
Figure SMS_487
中不包含新数据库
Figure SMS_489
中的序列
Figure SMS_490
,就不需要更新,将
Figure SMS_476
从1-候选集中移除。If the original database
Figure SMS_477
Large sequence weighted sequence set in
Figure SMS_479
and the original database
Figure SMS_480
Pre-large sequence weighted sequence set in
Figure SMS_482
Contains the database to be inserted
Figure SMS_484
Sequence in
Figure SMS_486
, then
Figure SMS_488
and
Figure SMS_475
The sequential utility of the itemsets in
Figure SMS_478
and sequence weighted utility
Figure SMS_481
The value of
Figure SMS_483
Put it into the 1-candidate set to generate the 2-candidate set; if
Figure SMS_485
and
Figure SMS_487
New databases are not included
Figure SMS_489
Sequence in
Figure SMS_490
, there is no need to update,
Figure SMS_476
Remove from 1-candidate set.

例如本发明实施例中,

Figure SMS_492
Figure SMS_493
Figure SMS_494
中,就将
Figure SMS_496
加入到1-候选集中,如果不是就将其移除。从1-候选集可以生成2-候选集
Figure SMS_498
Figure SMS_500
Figure SMS_501
Figure SMS_491
,并从待插入数据库
Figure SMS_495
挖掘它们的
Figure SMS_497
Figure SMS_499
,如果不存在,值就是0,以此类推,直到候选集为空。For example, in the embodiment of the present invention,
Figure SMS_492
,
Figure SMS_493
exist
Figure SMS_494
In
Figure SMS_496
Add it to the 1-candidate set, if not, remove it. From the 1-candidate set, you can generate the 2-candidate set
Figure SMS_498
,
Figure SMS_500
,
Figure SMS_501
and
Figure SMS_491
, and from the database to be inserted
Figure SMS_495
Dig them up
Figure SMS_497
and
Figure SMS_499
, if it does not exist, the value is 0, and so on, until the candidate set is empty.

步骤4.2.4、从

Figure SMS_502
-候选集生成候选(
Figure SMS_503
+1)-候选集
Figure SMS_504
;设
Figure SMS_505
=
Figure SMS_506
+1,重复步骤4.2.1到步骤4.2.4,直到没有发现更新的大或前大序列加权效用序列集。Step 4.2.4, from
Figure SMS_502
-Candidate set generation candidate (
Figure SMS_503
+1)-Candidate set
Figure SMS_504
;set up
Figure SMS_505
=
Figure SMS_506
+1, repeat steps 4.2.1 to 4.2.4 until no updated large or former large sequence weighted utility sequence set is found.

步骤4.3、当

Figure SMS_507
时,生成新数据库,此时需要重新扫描原始数据库。将
Figure SMS_508
设置为0,并将
Figure SMS_509
赋值给
Figure SMS_510
。具体过程如下:Step 4.3:
Figure SMS_507
When a new database is generated, the original database needs to be rescanned.
Figure SMS_508
Set to 0 and
Figure SMS_509
Assign to
Figure SMS_510
The specific process is as follows:

步骤4.3.1、合并待插入数据库

Figure SMS_511
和原始数据库D,生成新数据库U;Step 4.3.1: Merge the database to be inserted
Figure SMS_511
and the original database D, generate a new database U;

步骤4.3.2、对于每个

Figure SMS_512
,采用与公式(5)相同的计算方式计算新数据库
Figure SMS_513
的序列加权效用
Figure SMS_514
,然后采用与公式(2)相同的计算方式计算新数据库
Figure SMS_515
的总效用
Figure SMS_516
;Step 4.3.2: For each
Figure SMS_512
, the new database is calculated using the same calculation method as formula (5)
Figure SMS_513
The sequence weighted utility
Figure SMS_514
, and then use the same calculation method as formula (2) to calculate the new database
Figure SMS_515
Total utility
Figure SMS_516
;

步骤4.3.3、设序列的加权效用比为

Figure SMS_518
,如果
Figure SMS_519
,则将序列
Figure SMS_520
放入
Figure SMS_521
;如果
Figure SMS_522
,则将序列
Figure SMS_523
放入
Figure SMS_524
;否则,丢弃序列
Figure SMS_517
,因为它在数据库更新后仍然很小。Step 4.3.3: Let the weighted utility ratio of the sequence be
Figure SMS_518
,if
Figure SMS_519
, then the sequence
Figure SMS_520
Put in
Figure SMS_521
;if
Figure SMS_522
, then the sequence
Figure SMS_523
Put in
Figure SMS_524
; Otherwise, discard the sequence
Figure SMS_517
, as it will still be small after the database update.

步骤4.3.4、执行递归挖掘算法,运用递归挖掘算法,生成多项集的投影数据库,并生成多项集的

Figure SMS_526
Figure SMS_527
序列集,直到没有找到
Figure SMS_529
Figure SMS_530
序列集。执行挖掘过程时,从1序列集开始挖掘,再接着2序列集,直到最后一个序列集为空,此时停止挖掘过程,输出新数据库
Figure SMS_531
中的大序列加权效用序列集
Figure SMS_532
和预大序列加权效用序列集
Figure SMS_533
Figure SMS_525
Figure SMS_528
用于下次数据插入时使用。Step 4.3.4: Execute the recursive mining algorithm to generate a projection database of multiple sets and generate a
Figure SMS_526
and
Figure SMS_527
Sequence set until none is found
Figure SMS_529
and
Figure SMS_530
When executing the mining process, start mining from sequence set 1, then sequence set 2, until the last sequence set is empty, then stop the mining process and output the new database
Figure SMS_531
Large sequence weighted utility sequence set in
Figure SMS_532
and the pre-large sequence weighted utility sequence set
Figure SMS_533
,
Figure SMS_525
and
Figure SMS_528
It will be used for next data insertion.

具体过程如下:The specific process is as follows:

步骤4.3.4.1、遍历

Figure SMS_535
Figure SMS_536
,对属于
Figure SMS_537
Figure SMS_538
的每个序列
Figure SMS_539
构建它的投影数据库
Figure SMS_540
,这样可以减少候选集的个数,提高运行速度,其中
Figure SMS_541
表示的是序列集中正在处理的项数。投影数据库的构建过程为:找到以项目
Figure SMS_534
作为序列前缀的每一个序列,如果一个序列中没有项目
Figure SMS_542
,就不保留。Step 4.3.4.1, traversal
Figure SMS_535
and
Figure SMS_536
, for
Figure SMS_537
and
Figure SMS_538
Each sequence of
Figure SMS_539
Build its projection database
Figure SMS_540
, which can reduce the number of candidate sets and improve the running speed.
Figure SMS_541
It indicates the number of items being processed in the sequence set. The construction process of the projection database is: find the items
Figure SMS_534
Each sequence that is a prefix of a sequence, if there are no items in a sequence
Figure SMS_542
, will not be retained.

定义:设有两个序列

Figure SMS_543
Figure SMS_544
,其中
Figure SMS_545
。如果(1)该序列有前缀
Figure SMS_546
,(2)其中该序列是以
Figure SMS_547
为前缀的
Figure SMS_548
的子序列,并且该序列是不再有超序列,那么序列
Figure SMS_549
的子序列称为
Figure SMS_550
的投影序列,这个关系记为
Figure SMS_551
。因此,序列
Figure SMS_552
在新数据库
Figure SMS_553
中的投影数据库是序列
Figure SMS_554
对应的数据库中每个序列的所有投影序列的集合,记为
Figure SMS_555
。Definition: Suppose there are two sequences
Figure SMS_543
and
Figure SMS_544
,in
Figure SMS_545
If (1) the sequence has a prefix
Figure SMS_546
, (2) where the sequence is
Figure SMS_547
Prefix
Figure SMS_548
, and the sequence no longer has a supersequence, then the sequence
Figure SMS_549
A subsequence of
Figure SMS_550
The projection sequence of
Figure SMS_551
Therefore, the sequence
Figure SMS_552
In the new database
Figure SMS_553
The projection database in is the sequence
Figure SMS_554
The set of all projection sequences for each sequence in the corresponding database is denoted as
Figure SMS_555
.

例如,根据上述定义,对序列

Figure SMS_556
构建投影数据库,找到以
Figure SMS_558
为作为序列前缀的每一个序列,如果一个序列中没有项目
Figure SMS_560
,就不保留,例如
Figure SMS_562
中没有
Figure SMS_564
项目,在序列
Figure SMS_566
的投影数据库就没有
Figure SMS_567
。因此,序列
Figure SMS_569
的投影数据库中只包含
Figure SMS_571
Figure SMS_573
Figure SMS_575
Figure SMS_576
四个序列,具体内容为:
Figure SMS_577
序列的项目集合为
Figure SMS_578
Figure SMS_579
序列的总效用为36;
Figure SMS_557
序列的项目集合为
Figure SMS_559
Figure SMS_561
序列的总效用为9;
Figure SMS_563
序列的项目集合为
Figure SMS_565
Figure SMS_568
序列的总效用为9;
Figure SMS_570
序列的项目集合为
Figure SMS_572
Figure SMS_574
序列的总效用为22。For example, according to the above definition, for the sequence
Figure SMS_556
Build a projection database and find
Figure SMS_558
For each sequence that is a prefix of a sequence, if there are no items in a sequence
Figure SMS_560
, it is not retained, for example
Figure SMS_562
No
Figure SMS_564
Project, in sequence
Figure SMS_566
The projection database does not have
Figure SMS_567
Therefore, the sequence
Figure SMS_569
The projection database only contains
Figure SMS_571
,
Figure SMS_573
,
Figure SMS_575
,
Figure SMS_576
Four sequences, the specific contents are:
Figure SMS_577
The set of items in the sequence is
Figure SMS_578
,
Figure SMS_579
The total utility of the sequence is 36;
Figure SMS_557
The set of items in the sequence is
Figure SMS_559
,
Figure SMS_561
The total utility of the sequence is 9;
Figure SMS_563
The set of items in the sequence is
Figure SMS_565
,
Figure SMS_568
The total utility of the sequence is 9;
Figure SMS_570
The set of items in the sequence is
Figure SMS_572
,
Figure SMS_574
The total utility of the sequence is 22.

步骤4.3.4.2、计算

Figure SMS_581
的序列加权效用
Figure SMS_586
值,其中
Figure SMS_587
Figure SMS_588
的拓展项集;如果
Figure SMS_589
,计算序列效用
Figure SMS_590
,并将
Figure SMS_591
放到
Figure SMS_580
集合中;如果
Figure SMS_582
,计算
Figure SMS_583
,并将
Figure SMS_584
放到
Figure SMS_585
集合中,否则,如果都不满足,将不做任何处理。Step 4.3.4.2. Calculation
Figure SMS_581
The sequence weighted utility
Figure SMS_586
Value, where
Figure SMS_587
yes
Figure SMS_588
The expanded itemset of
Figure SMS_589
, calculate the sequence utility
Figure SMS_590
, and
Figure SMS_591
Put
Figure SMS_580
In the collection; if
Figure SMS_582
,calculate
Figure SMS_583
, and
Figure SMS_584
Put
Figure SMS_585
Otherwise, if none of them are satisfied, no processing will be done.

步骤4.3.4.3、将当前参数传入进去,递归调用挖掘算法过程,直到

Figure SMS_592
Figure SMS_593
集合都为空,停止运行。Step 4.3.4.3: Pass the current parameters in and recursively call the mining algorithm process until
Figure SMS_592
and
Figure SMS_593
The collection is empty and the operation stops.

递归挖掘算法的伪代码如下:The pseudo code of the recursive mining algorithm is as follows:

1: for 每一个序列

Figure SMS_594
do;1: for each sequence
Figure SMS_594
do;

2: 构建序列

Figure SMS_595
的投影数据库
Figure SMS_596
;2: Build sequence
Figure SMS_595
Projection database
Figure SMS_596
;

3: end for;3: end for;

4: for 每一个

Figure SMS_597
,其中
Figure SMS_598
Figure SMS_599
在投影数据库
Figure SMS_600
的超集 do;4: for each
Figure SMS_597
,in
Figure SMS_598
yes
Figure SMS_599
In the projection database
Figure SMS_600
A superset of do;

5: 计算

Figure SMS_601
;5: Calculation
Figure SMS_601
;

6: if

Figure SMS_602
then;6: if
Figure SMS_602
then;

7: 计算

Figure SMS_603
;7: Calculation
Figure SMS_603
;

8: 将序列

Figure SMS_604
放入
Figure SMS_605
集合中;8: Sequence
Figure SMS_604
Put in
Figure SMS_605
In the collection;

9: else if

Figure SMS_606
;9: else if
Figure SMS_606
;

10: 计算

Figure SMS_607
;10: Calculation
Figure SMS_607
;

11: 将序列

Figure SMS_608
放入
Figure SMS_609
集合中;11: Sequence
Figure SMS_608
Put in
Figure SMS_609
In the collection;

12: end if;12: end if;

13: end for;13: end for;

14: Mining(

Figure SMS_610
Figure SMS_611
Figure SMS_612
Figure SMS_613
Figure SMS_614
Figure SMS_615
);14: Mining(
Figure SMS_610
,
Figure SMS_611
,
Figure SMS_612
,
Figure SMS_613
,
Figure SMS_614
,
Figure SMS_615
);

步骤5、判断新数据库

Figure SMS_618
中的大序列加权效用序列集
Figure SMS_619
集合中的每个序列
Figure SMS_620
的效用比是否大于等于效用阈值上限
Figure SMS_621
,即
Figure SMS_622
,若是,则序列
Figure SMS_623
是高效用序列模式,将序列S加入到高效用序列模式集合
Figure SMS_624
中并输出,否则,不需要进行任何操作;最终输出数据库更新后的新数据库
Figure SMS_616
及其高效用序列模式集
Figure SMS_617
。Step 5: Determine the new database
Figure SMS_618
Large sequence weighted utility sequence set in
Figure SMS_619
Each sequence in the collection
Figure SMS_620
Is the utility ratio greater than or equal to the upper utility threshold?
Figure SMS_621
,Right now
Figure SMS_622
, if so, then the sequence
Figure SMS_623
is a high-utility sequence pattern. Add sequence S to the set of high-utility sequence patterns.
Figure SMS_624
Otherwise, no operation is required; finally, the new database after the database update is output
Figure SMS_616
Its high-utility sequence pattern set
Figure SMS_617
.

本发明实施例中,

Figure SMS_625
=
Figure SMS_626
=35.4%>35%,所以
Figure SMS_627
是一个高效用序列,需要加入到高效用序列模式集合
Figure SMS_628
中。In the embodiment of the present invention,
Figure SMS_625
=
Figure SMS_626
=35.4%>35%, so
Figure SMS_627
Is a high-utility sequence and needs to be added to the high-utility sequence pattern set
Figure SMS_628
middle.

最终得到的

Figure SMS_629
Figure SMS_630
如下:The final result
Figure SMS_629
and
Figure SMS_630
as follows:

大序列加权效用序列集

Figure SMS_632
包括的序列集为
Figure SMS_633
Figure SMS_635
Figure SMS_636
Figure SMS_637
;其中,序列集
Figure SMS_638
的序列加权效用为83,序列效用为22;序列集
Figure SMS_639
的序列加权效用为95,序列效用为56;序列集
Figure SMS_631
的序列加权效用为77,序列效用为20;序列集
Figure SMS_634
的序列加权效用为77,序列效用为16;Large Sequential Weighted Utility Sequence Set
Figure SMS_632
The included sequence sets are
Figure SMS_633
,
Figure SMS_635
,
Figure SMS_636
,
Figure SMS_637
; Among them, the sequence set
Figure SMS_638
The sequence weighted utility is 83, and the sequence utility is 22; the sequence set
Figure SMS_639
The sequence weighted utility of is 95, and the sequence utility is 56; the sequence set
Figure SMS_631
The sequence weighted utility of is 77, and the sequence utility is 20; the sequence set
Figure SMS_634
The sequence weighted utility of is 77, and the sequence utility is 16;

预大序列加权效用序列集

Figure SMS_640
包括的序列集为
Figure SMS_641
Figure SMS_643
Figure SMS_644
Figure SMS_646
Figure SMS_648
;其中,序列集
Figure SMS_649
的序列加权效用为53,序列效用为18;序列集
Figure SMS_642
的序列加权效用为46,序列效用为17;序列集
Figure SMS_645
的序列加权效用为43,序列效用为17;序列集
Figure SMS_647
的序列加权效用为52,序列效用为32;序列集
Figure SMS_650
的序列加权效用为54,序列效用为38。Pre-large sequence weighted utility sequence set
Figure SMS_640
The included sequence sets are
Figure SMS_641
,
Figure SMS_643
,
Figure SMS_644
,
Figure SMS_646
,
Figure SMS_648
; Among them, the sequence set
Figure SMS_649
The sequence weighted utility of is 53, and the sequence utility is 18; the sequence set
Figure SMS_642
The sequence weighted utility is 46, and the sequence utility is 17; the sequence set
Figure SMS_645
The sequence weighted utility is 43, and the sequence utility is 17; the sequence set
Figure SMS_647
The sequence weighted utility of is 52, and the sequence utility is 32; the sequence set
Figure SMS_650
The sequence weighted utility of is 54 and the sequence utility is 38.

更新后的新数据库

Figure SMS_651
的高效用序列模式集
Figure SMS_652
只包含序列集
Figure SMS_653
,此时序列集
Figure SMS_654
的序列加权效用为95,序列效用为56,效用比为35.4%。New updated database
Figure SMS_651
A set of high-utility sequence patterns
Figure SMS_652
Contains only sequence sets
Figure SMS_653
, then the sequence set
Figure SMS_654
The sequence weighted utility is 95, the sequence utility is 56, and the utility ratio is 35.4%.

本发明中,Pre-HUSPM算法的伪代码如下:In the present invention, the pseudo code of the Pre-HUSPM algorithm is as follows:

输入:一个项目利润表

Figure SMS_656
、原始数据库
Figure SMS_657
、效用阈值上限
Figure SMS_659
(与最小序列效用高阈值相同)、效用阈值下限
Figure SMS_661
Figure SMS_663
的总效用
Figure SMS_664
、一组大序列加权利用序列
Figure SMS_665
和前大序列加权利用序列
Figure SMS_655
以及它们的序列加权效用值、从
Figure SMS_658
中找到的实际效用值、保存最后处理的序列的总效用值的安全交易效用缓冲器
Figure SMS_660
、以及待插入数据库
Figure SMS_662
。Input: A project income statement
Figure SMS_656
, original database
Figure SMS_657
, upper threshold of utility
Figure SMS_659
(same as minimum sequence utility high threshold), utility lower threshold
Figure SMS_661
,
Figure SMS_663
Total utility
Figure SMS_664
, a set of large sequence weighted application sequences
Figure SMS_665
The weighted application sequence of the previous large sequence
Figure SMS_655
and their sequence weighted utility values, from
Figure SMS_658
The actual utility value found in , and the secure transaction utility buffer that holds the total utility value of the last processed sequence
Figure SMS_660
, and to be inserted into the database
Figure SMS_662
.

输出:新数据库

Figure SMS_666
(
Figure SMS_667
)的一组高效用序列模式(
Figure SMS_668
)。Output: New database
Figure SMS_666
(
Figure SMS_667
) of a set of high-utility sequence patterns (
Figure SMS_668
).

1: 计算安全序列效用界限

Figure SMS_669
;1: Calculate the safety sequence utility bound
Figure SMS_669
;

2: for each

Figure SMS_670
do;2: for each
Figure SMS_670
do;

3: 扫描数据库

Figure SMS_671
,计算
Figure SMS_672
;3: Scan the database
Figure SMS_671
,calculate
Figure SMS_672
;

4: end for;4: end for;

5: 计算

Figure SMS_673
Figure SMS_674
;5: Calculation
Figure SMS_673
and
Figure SMS_674
;

6: 如果

Figure SMS_675
then;6: If
Figure SMS_675
then;

7: 计算总效用

Figure SMS_676
;7: Calculate total utility
Figure SMS_676
;

8: 设置

Figure SMS_677
=1;8: Settings
Figure SMS_677
=1;

9: 生成1-项候选集

Figure SMS_678
Figure SMS_679
;9: Generate 1-item candidate set
Figure SMS_678
,
Figure SMS_679
;

10: while

Figure SMS_680
null do;10: while
Figure SMS_680
null do;

11: for each

Figure SMS_681
do;11: for each
Figure SMS_681
do;

12: 计算

Figure SMS_682
;12: Calculation
Figure SMS_682
;

13: 计算

Figure SMS_683
;13: Calculation
Figure SMS_683
;

14: end for;14: end for;

15: for each

Figure SMS_684
do;15: for each
Figure SMS_684
do;

16: 调用效用求和算法;16: Call the utility summation algorithm;

17: end for;17: end for;

18: for each

Figure SMS_685
do;18: for each
Figure SMS_685
do;

19: 调用效用求和算法;19: Call the utility summation algorithm;

20: end for;20: end for;

21: 从(

Figure SMS_686
Figure SMS_687
) 生成 (
Figure SMS_688
+ 1)-候选集
Figure SMS_689
;21: From (
Figure SMS_686
Figure SMS_687
) Generate(
Figure SMS_688
+ 1)-Candidate set
Figure SMS_689
;

22: 设置

Figure SMS_690
=
Figure SMS_691
+1;22: Settings
Figure SMS_690
=
Figure SMS_691
+1;

23: end while;23: end while;

24: else;24: else;

25: 合并待插入数据库

Figure SMS_692
和原始数据库
Figure SMS_693
,生成新数据库
Figure SMS_694
;25: Merge the database to be inserted
Figure SMS_692
and the original database
Figure SMS_693
, generate a new database
Figure SMS_694
;

26: for each

Figure SMS_695
do;26: for each
Figure SMS_695
do;

27: 计算

Figure SMS_696
;27: Calculation
Figure SMS_696
;

28: end for;28: end for;

29: 计算

Figure SMS_697
;29: Calculation
Figure SMS_697
;

30: 设置

Figure SMS_698
=1;30: Settings
Figure SMS_698
=1;

31: for each

Figure SMS_699
do;31: for each
Figure SMS_699
do;

32: if

Figure SMS_700
;32: if
Figure SMS_700
;

33: 将

Figure SMS_701
加入到集合
Figure SMS_702
当中;33: Will
Figure SMS_701
Add to collection
Figure SMS_702
among;

34: else if

Figure SMS_703
;34: else if
Figure SMS_703
;

35: 将

Figure SMS_704
加入到集合
Figure SMS_705
当中;35: Will
Figure SMS_704
Add to collection
Figure SMS_705
among;

36: end if;36: end if;

37: end for;37: end for;

38: 如果

Figure SMS_706
不在
Figure SMS_707
Figure SMS_708
当中,就将
Figure SMS_709
从新数据库
Figure SMS_710
中移除,当作新的数据库
Figure SMS_711
;38: If
Figure SMS_706
Not Available
Figure SMS_707
and
Figure SMS_708
Among them,
Figure SMS_709
From new database
Figure SMS_710
Remove it and treat it as a new database
Figure SMS_711
;

39: Mining(

Figure SMS_712
Figure SMS_713
Figure SMS_714
Figure SMS_715
Figure SMS_716
Figure SMS_717
);39: Mining(
Figure SMS_712
,
Figure SMS_713
,
Figure SMS_714
,
Figure SMS_715
,
Figure SMS_716
,
Figure SMS_717
);

40: end if;40: end if;

41: for each

Figure SMS_718
do;41: for each
Figure SMS_718
do;

42: if

Figure SMS_719
;42: if
Figure SMS_719
;

43: 将序列

Figure SMS_720
放入
Figure SMS_721
集合中;43: Sequence
Figure SMS_720
Put in
Figure SMS_721
In the collection;

44: end if;44: end if;

45: end for;45: end for;

46: if

Figure SMS_722
then;46: if
Figure SMS_722
then;

47: 设置

Figure SMS_723
and
Figure SMS_724
= 0;47: Settings
Figure SMS_723
and
Figure SMS_724
= 0;

48: else;48: else;

49: 设置

Figure SMS_725
;49: Settings
Figure SMS_725
;

50: end if;50: end if;

51: 设置

Figure SMS_726
and
Figure SMS_727
;51: Settings
Figure SMS_726
and
Figure SMS_727
;

上述伪代码中用到的效用求和算法的伪代码如下:The pseudocode for the utility summation algorithm used in the above pseudocode is as follows:

1:

Figure SMS_728
;1:
Figure SMS_728
;

2:

Figure SMS_729
;2:
Figure SMS_729
;

3: if

Figure SMS_730
then;3: if
Figure SMS_730
then;

4: 将序列

Figure SMS_731
放入
Figure SMS_732
集合中;4: Sequence
Figure SMS_731
Put in
Figure SMS_732
In the collection;

5: else if

Figure SMS_733
then;5: else if
Figure SMS_733
then;

6: 将序列

Figure SMS_734
放入
Figure SMS_735
集合中;6: Sequence
Figure SMS_734
Put in
Figure SMS_735
In the collection;

7: end if;7: end if;

为了证明本发明算法的优越性与可行性,进行了对比实验。将本发明提出的Pre-HUSPM算法与P-HUSPM算法和Pre-HUSPM-TSU算法进行了比较。实验采用6个不同规模且具有不同特征的真实数据集,数据集的名称分别为SIGN、LEVIATHAN、FIFA、BIBLE、Kosarak10k、BMS,该六个数据集均来自SPMF网站。其中,SIGN是包含许多非常长的序列的密集数据集;LEVIATHAN和FIFA均是包含许多长序列的中等密度数据集;BIBLE是一个中等密度的数据集,包含许多中等长度的序列;BMS和Kosarak10k均是稀疏数据集,只有一些长序列。对于所有数据集,都满足高斯分布。在实验中,将每个数据集分为一个原始数据集和100个新数据集。该数据集的特征属性具体为:SIGN数据集的序列数量为730个,不同项目的数量为267个,平均序列长度为52个,最大序列长度为94个,原始数据库的序列个数为230个,待插入数据库的序列个数为5个;LEVIATHAN数据集的序列数量为5834个,不同项目的数量为9025个,平均序列长度为33.8个,最大序列长度为100个,原始数据库的序列个数为2834个,待插入数据库的序列个数为30个;FIFA数据集的序列数量为20450个,不同项目的数量为2990个,平均序列长度为36.2个,最大序列长度为100个,原始数据库的序列个数为10450个,待插入数据库的序列个数为100个;BIBLE数据集的序列数量为36369个,不同项目的数量为13905个,平均序列长度为21.6个,最大序列长度为100个,原始数据库的序列个数为21369个,待插入数据库的序列个数为150个;Kosarak10k数据集的序列数量为10000个,不同项目的数量为10094个,平均序列长度为8.1个,最大序列长度为608个,原始数据库的序列个数为1000个,待插入数据库的序列个数为90个;BMS数据集的序列数量为59601个,不同项目的数量为497个,平均序列长度为2.5个,最大序列长度为267个,原始数据库的序列个数为39601个,待插入数据库的序列个数为200个。In order to prove the superiority and feasibility of the algorithm of the present invention, a comparative experiment was carried out. The Pre-HUSPM algorithm proposed in the present invention was compared with the P-HUSPM algorithm and the Pre-HUSPM-TSU algorithm. The experiment used 6 real data sets of different scales and with different characteristics. The names of the data sets are SIGN, LEVIATHAN, FIFA, BIBLE, Kosarak10k, and BMS. The six data sets are all from the SPMF website. Among them, SIGN is a dense data set containing many very long sequences; LEVIATHAN and FIFA are both medium-density data sets containing many long sequences; BIBLE is a medium-density data set containing many medium-length sequences; BMS and Kosarak10k are both sparse data sets with only some long sequences. For all data sets, Gaussian distribution is satisfied. In the experiment, each data set is divided into an original data set and 100 new data sets. The characteristic attributes of the dataset are as follows: the number of sequences in the SIGN dataset is 730, the number of different items is 267, the average sequence length is 52, the maximum sequence length is 94, the number of sequences in the original database is 230, and the number of sequences to be inserted into the database is 5; the number of sequences in the LEVIATHAN dataset is 5834, the number of different items is 9025, the average sequence length is 33.8, the maximum sequence length is 100, the number of sequences in the original database is 2834, and the number of sequences to be inserted into the database is 30; the number of sequences in the FIFA dataset is 20450, the number of different items is 2990, the average sequence length is 36.2, the maximum sequence length is 100, the number of sequences in the original database is 10450, and the number of sequences to be inserted into the database is 100 ; The number of sequences in the BIBLE dataset is 36,369, the number of different projects is 13,905, the average sequence length is 21.6, the maximum sequence length is 100, the number of sequences in the original database is 21,369, and the number of sequences to be inserted into the database is 150; the number of sequences in the Kosarak10k dataset is 10,000, the number of different projects is 10,094, the average sequence length is 8.1, the maximum sequence length is 608, the number of sequences in the original database is 1,000, and the number of sequences to be inserted into the database is 90; the number of sequences in the BMS dataset is 59,601, the number of different projects is 497, the average sequence length is 2.5, the maximum sequence length is 267, the number of sequences in the original database is 39,601, and the number of sequences to be inserted into the database is 200.

本发明实验在六个不同的数据集上将效用阈值上限

Figure SMS_737
控制为相同的变量,选取不同的效用阈值下限
Figure SMS_738
进行实验对比,实验结果如图2-图7所示。通过实验发现,Pre-HUSPM-TSU算法在运行时间上比HUSPM算法时间短,这样缩短了运行时间。而本发明提出的优化算法采用了
Figure SMS_739
,替代了Pre-HUSPM-TSU算法中的
Figure SMS_740
,实质形成的是Pre-HUSPM-
Figure SMS_741
算法(Pre-HUSPM-
Figure SMS_742
即为本发明所提到的Pre-HUSPM算法),Pre-HUSPM-
Figure SMS_743
算法在运行时间上会比HUSPM和Pre-HUSPM-TSU好很多。因此,Pre-HUSPM-
Figure SMS_736
在较大的非密集数据集中具有更快的运行时间,在运行时间方面具有较好的性能。The experiment of this invention sets the upper limit of the utility threshold on six different data sets.
Figure SMS_737
Controlling the same variables, selecting different utility threshold lower limits
Figure SMS_738
Experimental comparison is carried out, and the experimental results are shown in Figures 2 to 7. It is found through experiments that the Pre-HUSPM-TSU algorithm has a shorter running time than the HUSPM algorithm, thus shortening the running time. The optimization algorithm proposed in this invention adopts
Figure SMS_739
, replacing the Pre-HUSPM-TSU algorithm
Figure SMS_740
, the actual formation is Pre-HUSPM-
Figure SMS_741
Algorithm (Pre-HUSPM-
Figure SMS_742
That is the Pre-HUSPM algorithm mentioned in the present invention), Pre-HUSPM-
Figure SMS_743
The running time of the algorithm is much better than that of HUSPM and Pre-HUSPM-TSU.
Figure SMS_736
It has faster runtimes on larger, non-dense datasets and better performance in terms of runtime.

通过选取不同的

Figure SMS_745
,发现若
Figure SMS_746
设置得太小,在重新扫描数据库时,运行速度会变得更慢,因为会生成太多的预大序列集。如果
Figure SMS_748
设置得太接近
Figure SMS_749
的值,那么安全值将变得太小,因此每当添加新数据时,可能必须重新扫描数据库,这也将导致较慢的操作。对于实际应用,需要合理设置
Figure SMS_750
Figure SMS_751
Figure SMS_752
Figure SMS_744
的范围均是0至1,设置时确保
Figure SMS_747
,具体数值根据用户的需要自行设置。By choosing different
Figure SMS_745
, it is found that if
Figure SMS_746
If set too small, the database will run more slowly when it is rescanned because too many pre-large sequence sets will be generated.
Figure SMS_748
Set too close
Figure SMS_749
If the value of , then the safety value will become too small, so every time new data is added, the database may have to be rescanned, which will also result in slower operation. For practical applications, it is necessary to set a reasonable
Figure SMS_750
and
Figure SMS_751
.
Figure SMS_752
and
Figure SMS_744
The range is 0 to 1. Make sure
Figure SMS_747
The specific value can be set according to user needs.

当然,上述说明并非是对本发明的限制,本发明也并不仅限于上述举例,本技术领域的技术人员在本发明的实质范围内所做出的变化、改型、添加或替换,也应属于本发明的保护范围。Of course, the above description is not a limitation of the present invention, and the present invention is not limited to the above examples. Changes, modifications, additions or substitutions made by technicians in this technical field within the essential scope of the present invention should also fall within the protection scope of the present invention.

Claims (9)

1. A database sequence insertion processing method based on Pre-HUSPM is characterized in that an incremental algorithm Pre-HUSPM is constructed to efficiently mine a high-utility sequence mode, and the method specifically comprises the following steps:
step 1, to the original database
Figure QLYQS_1
Insert the database to be inserted->
Figure QLYQS_2
Step 2, according to the original database
Figure QLYQS_3
Is calculated a safety value->
Figure QLYQS_4
Step 3, scanning the database to be inserted
Figure QLYQS_5
Calculating a database to be inserted->
Figure QLYQS_6
The total utility of each of the sequences->
Figure QLYQS_7
and
Figure QLYQS_8
Is greater than or equal to>
Figure QLYQS_9
Step 4, the total utility value of the new transaction since the original database is rescanned last time and the database to be inserted
Figure QLYQS_10
Sequence-weighted utility maximum for a single item->
Figure QLYQS_11
Is summed with a safety value->
Figure QLYQS_12
Comparing, and performing corresponding operation according to a comparison result;
step 5, judging a new database
Figure QLYQS_14
Is greater than the set of large sequence weighted utility sequences->
Figure QLYQS_15
Each sequence in the set->
Figure QLYQS_17
Whether or not the utility ratio of (a) is greater than or equal to the upper utility threshold limit->
Figure QLYQS_18
If so, the sequence is->
Figure QLYQS_19
Is a high utility sequencing mode, sequences->
Figure QLYQS_20
Add to high utility sequence pattern set >>
Figure QLYQS_21
And outputting, otherwise, no operation is needed; finally outputting the new database after the database update->
Figure QLYQS_13
And its high utility sequential pattern set>
Figure QLYQS_16
2. The Pre-HUSPM-based database sequence insertion processing method according to claim 1, wherein in step 1, a primary database is provided
Figure QLYQS_22
Figure QLYQS_23
Is the total number of sequences, is based on>
Figure QLYQS_24
Is a serial number of the sequence, is asserted>
Figure QLYQS_25
Is shown as
Figure QLYQS_26
Or a sequence, is>
Figure QLYQS_28
Set an item>
Figure QLYQS_29
Figure QLYQS_27
Is the total number of items, the item->
Figure QLYQS_30
Is->
Figure QLYQS_31
A collection of different items, represented as
Figure QLYQS_32
Figure QLYQS_33
Indicates that the item is pick>
Figure QLYQS_34
Is greater than or equal to>
Figure QLYQS_35
And (4) items.
3. Pre-HUSPM-based database sequence insertion processing method according to claim 2, characterized in that in step 2, the security value
Figure QLYQS_36
The calculation formula of (c) is as follows:
Figure QLYQS_37
(1);
wherein ,
Figure QLYQS_38
indicates an upper utility threshold value, greater than or equal to>
Figure QLYQS_39
Indicates a utility threshold lower limit, <' > or>
Figure QLYQS_40
Represents the original database->
Figure QLYQS_41
The total utility of (a) of (b),
Figure QLYQS_42
and
Figure QLYQS_43
Presetting the value of (A);
Figure QLYQS_44
the calculation formula of (a) is as follows:
Figure QLYQS_45
(2);
wherein ,
Figure QLYQS_46
represents the original database->
Figure QLYQS_47
In sequence->
Figure QLYQS_48
The calculation formula is as follows:
Figure QLYQS_49
(3);
wherein ,
Figure QLYQS_50
represents a sequence->
Figure QLYQS_51
Middle item->
Figure QLYQS_52
In or>
Figure QLYQS_53
The utility of the item.
4. The Pre-HUSPM-based database sequence insertion processing method according to claim 3, wherein in the step 3, the database to be inserted is calculated in the same manner as the formulas (2) and (3)
Figure QLYQS_54
Total utility->
Figure QLYQS_55
At the same time counting>
Figure QLYQS_56
The database to be inserted is included in the calculation time>
Figure QLYQS_57
The correlation data of (a).
5. The Pre-HUSPM-based database sequence insertion processing method according to claim 4, wherein the specific judgment criteria in step 4 are: is provided with
Figure QLYQS_58
When ≧ the total utility value for the new transaction since the last rescan of the original database>
Figure QLYQS_59
When, step 4.1 and step 4.2 are carried out, when +>
Figure QLYQS_60
Then, step 4.3 is carried out;
step 4.1, insert the database from waiting
Figure QLYQS_61
The scan generates a 1-candidate set and sets ≦>
Figure QLYQS_62
=1,
Figure QLYQS_63
Representing the number of items being processed in the set of sequences;
step 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated; at the same time, set up
Figure QLYQS_64
Step 4.3, when
Figure QLYQS_65
Generating a new database, and scanning the original database again at the moment; will be provided with
Figure QLYQS_66
Set to 0 and will->
Figure QLYQS_67
Assign a value to>
Figure QLYQS_68
6. The Pre-HUSPM-based database sequence insertion processing method according to claim 5, wherein the specific process of step 4.2 is as follows:
step 4.2.1, calculate the new database
Figure QLYQS_69
Is greater than or equal to>
Figure QLYQS_70
The calculation formula is as follows:
Figure QLYQS_71
(4);
for candidate set
Figure QLYQS_72
Calculates the ≥ er/min of each candidate in the database to be inserted>
Figure QLYQS_73
In a sequence>
Figure QLYQS_74
Is weighted effect of->
Figure QLYQS_75
And the effect of the sequence->
Figure QLYQS_76
The calculation formula is as follows:
Figure QLYQS_77
(5);
Figure QLYQS_78
(6);
wherein ,
Figure QLYQS_79
represents a sequence->
Figure QLYQS_80
The total utility value for this row;
Figure QLYQS_81
Represents a sequence->
Figure QLYQS_82
Is based on the sub-sequence->
Figure QLYQS_83
Has the effect that all occurrences in the sequence->
Figure QLYQS_84
The maximum utility of (a) is defined as follows:
Figure QLYQS_85
(7);
wherein ,
Figure QLYQS_86
indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:
Figure QLYQS_87
(8);
wherein ,
Figure QLYQS_88
representing a sequence>
Figure QLYQS_89
Is greater than or equal to>
Figure QLYQS_90
In or>
Figure QLYQS_91
The internal utility of an item, defined as follows:
Figure QLYQS_92
(9);
wherein ,
Figure QLYQS_93
represents a sequence->
Figure QLYQS_94
Middle item->
Figure QLYQS_95
Is/is>
Figure QLYQS_96
Number of items, <' > based on>
Figure QLYQS_97
Represents->
Figure QLYQS_98
The unit profit of the item;
step 4.2.2, for weighting utility sequences in large sequences
Figure QLYQS_99
Performing substep 4.2.2.1-substep 4.2.2.3 on each large sequence weighted utility sequence set in the original database;
step 4.2.3 for
Figure QLYQS_100
Original numberEach pre-large sequence in the database is weighted by using the sequence set, and sub-step 4.2.2.1-sub-step 4.2.2.3 of step 4.2.2 are also executed;
if the original database
Figure QLYQS_103
Is greater than the set of large sequence weighted sequences->
Figure QLYQS_105
And a base of original data>
Figure QLYQS_107
Is predetermined by the pre-large sequence weighting sequence set->
Figure QLYQS_108
Containing the database to be inserted->
Figure QLYQS_110
Is based on the sequence->
Figure QLYQS_112
Will->
Figure QLYQS_113
and
Figure QLYQS_101
Sequence utility of item sets in
Figure QLYQS_104
And the sequence weighted utility>
Figure QLYQS_106
Is updated and the sequence is->
Figure QLYQS_109
Put into 1-candidate set, used for producing 2-candidate set; if->
Figure QLYQS_111
and
Figure QLYQS_114
Does not contain a new database->
Figure QLYQS_115
Is based on the sequence->
Figure QLYQS_116
Will->
Figure QLYQS_102
Remove from the 1-candidate set;
step 4.2.4 from
Figure QLYQS_117
-candidate set generating candidates (@ n)>
Figure QLYQS_118
+ 1) -candidate set +>
Figure QLYQS_119
(ii) a Is arranged and/or is>
Figure QLYQS_120
=
Figure QLYQS_121
+1, repeating steps 4.2.1 through 4.2.4 until no updated large or pre-large sequence weighted utility sequence set is found.
7. Pre-HUSPM based database sequence insertion processing method according to claim 6, characterized in that the substeps of step 4.2.2 are as follows:
substep 4.2.2.1, updating the new database
Figure QLYQS_122
In sequence->
Figure QLYQS_123
Is weighted effect of->
Figure QLYQS_124
The calculation formula is as follows:
Figure QLYQS_125
(10);/>
wherein ,
Figure QLYQS_127
for the original database->
Figure QLYQS_128
In sequence->
Figure QLYQS_129
Is weighted effect of->
Figure QLYQS_131
Stores the sequence->
Figure QLYQS_132
Is/are as follows
Figure QLYQS_133
Figure QLYQS_134
For being inserted into the database->
Figure QLYQS_126
In sequence->
Figure QLYQS_130
The sequence weighted utility of (a);
substep 4.2.2.2 updating the new database
Figure QLYQS_135
In the entire sequence set->
Figure QLYQS_136
In a sequence effect >>
Figure QLYQS_137
Figure QLYQS_138
(11);
wherein ,
Figure QLYQS_140
represents a sequence->
Figure QLYQS_141
In the raw database->
Figure QLYQS_143
In, on the sequence effect in>
Figure QLYQS_144
Stores the sequence->
Figure QLYQS_145
Is/are>
Figure QLYQS_146
Figure QLYQS_147
For being inserted into the database->
Figure QLYQS_139
In sequence->
Figure QLYQS_142
The sequence utility of (a);
substeps 4.2.2.3, if
Figure QLYQS_148
Then will beSequence>
Figure QLYQS_151
Put in and/or pick up>
Figure QLYQS_153
Figure QLYQS_154
Is a new database->
Figure QLYQS_156
Is greater than the sequence weighted effect in->
Figure QLYQS_158
A sequence set; if->
Figure QLYQS_159
Then the sequence is asserted>
Figure QLYQS_149
Put in and/or pick up>
Figure QLYQS_150
Figure QLYQS_152
Is a new database->
Figure QLYQS_155
Pre-large sequence weighted utility of->
Figure QLYQS_157
A sequence set; otherwise, the sequence is discarded>
Figure QLYQS_160
8. The Pre-HUSPM-based database sequence insertion processing method according to claim 7, wherein the specific process of step 4.3 is as follows:
step 4.3.1, merging the databases to be inserted
Figure QLYQS_161
And the original database->
Figure QLYQS_162
Generating a new database>
Figure QLYQS_163
Step 4.3.2, for each
Figure QLYQS_164
The new database is calculated in the same way as in equation (5)>
Figure QLYQS_165
Is weighted effect of->
Figure QLYQS_166
Then the new database is calculated in the same way as in equation (2)>
Figure QLYQS_167
Is greater than or equal to>
Figure QLYQS_168
Step 4.3.3, set the weighted utility ratio of the sequence to
Figure QLYQS_170
If->
Figure QLYQS_171
Then the sequence is asserted>
Figure QLYQS_173
Is put into
Figure QLYQS_174
(ii) a If->
Figure QLYQS_176
Then the sequence is combined>
Figure QLYQS_178
Put in and/or pick up>
Figure QLYQS_180
(ii) a Otherwise, the sequence is discarded>
Figure QLYQS_169
Figure QLYQS_172
Is a new database->
Figure QLYQS_175
Is greater than the sequence weighted effect in->
Figure QLYQS_177
A sequence set;
Figure QLYQS_179
Is a new database->
Figure QLYQS_181
Pre-large sequence weighted utility of->
Figure QLYQS_182
A sequence set;
step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets
Figure QLYQS_184
and
Figure QLYQS_186
Sequence set until no more than found >>
Figure QLYQS_187
and
Figure QLYQS_188
A sequence set; when the mining process is executed, the mining is started from the sequence set 1, then follows the sequence set 2, stopping the mining process until the last sequence set is empty, and outputting a new database ≥ er>
Figure QLYQS_189
Is greater than the set of large sequence weighted utility sequences->
Figure QLYQS_190
And pre-large sequence weighted utility sequence set>
Figure QLYQS_191
Figure QLYQS_183
and
Figure QLYQS_185
The data insertion method is used for next data insertion.
9. The Pre-HUSPM-based database sequence insertion processing method according to claim 8, wherein in the step 4.3.4, the specific process of the recursive mining algorithm is as follows:
step 4.3.4.1, traverse
Figure QLYQS_192
and
Figure QLYQS_193
To be belonged to>
Figure QLYQS_194
and
Figure QLYQS_195
Each sequence of (4)>
Figure QLYQS_196
Constructing its projection database->
Figure QLYQS_197
Step 4.3.4.2, calculate
Figure QLYQS_199
Is weighted effect of->
Figure QLYQS_201
Value, wherein>
Figure QLYQS_202
Is->
Figure QLYQS_204
A set of expansion terms; if it is not
Figure QLYQS_205
Calculating the effectiveness of the sequence->
Figure QLYQS_207
And will->
Figure QLYQS_209
Put to>
Figure QLYQS_198
In the set; if it is not
Figure QLYQS_200
Calculating >>
Figure QLYQS_203
And will->
Figure QLYQS_206
Put to>
Figure QLYQS_208
In the set, if not, no processing is carried out;
step 4.3.4.3, the current parameters are introduced, and the mining algorithm process is called recursively until the current parameters are input
Figure QLYQS_211
And
Figure QLYQS_212
the sets are all empty, and the operation is stopped;
Figure QLYQS_213
Is a new database->
Figure QLYQS_214
Is greater than the sequence weighted effect in->
Figure QLYQS_215
+1 sequence set;
Figure QLYQS_216
Is a new database->
Figure QLYQS_217
Pre-large sequence weighted utility of->
Figure QLYQS_210
+1 sequence set. />
CN202310250759.4A 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method Expired - Fee Related CN115964415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310250759.4A CN115964415B (en) 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310250759.4A CN115964415B (en) 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method

Publications (2)

Publication Number Publication Date
CN115964415A true CN115964415A (en) 2023-04-14
CN115964415B CN115964415B (en) 2023-05-26

Family

ID=85894768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310250759.4A Expired - Fee Related CN115964415B (en) 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method

Country Status (1)

Country Link
CN (1) CN115964415B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217055A1 (en) * 2002-05-20 2003-11-20 Chang-Huang Lee Efficient incremental method for data mining of a database
CN105590237A (en) * 2015-12-18 2016-05-18 齐鲁工业大学 Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making
CN106777182A (en) * 2016-12-23 2017-05-31 陕西理工学院 A kind of data flow effective item set mining algorithm for reducing candidate
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN109101530A (en) * 2018-06-22 2018-12-28 哈尔滨工业大学(深圳) Effective sequence of events pattern mining algorithm
CN109408563A (en) * 2018-11-07 2019-03-01 哈尔滨工业大学(深圳) High average utility item set mining method, apparatus and computer equipment
CN111475551A (en) * 2020-06-15 2020-07-31 河北工业大学 A high-average utility sequential pattern mining method under non-overlapping conditions
CN111930803A (en) * 2020-08-07 2020-11-13 河北工业大学 Non-overlapping self-adaptive frequent sequence pattern mining method
CN112434031A (en) * 2020-11-16 2021-03-02 宁波财经学院 Uncertain high-utility mode mining method based on information entropy
US20220058716A1 (en) * 2020-08-18 2022-02-24 Qilu University Of Technology Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method
CN114971794A (en) * 2022-05-27 2022-08-30 齐鲁工业大学 Time period-based high-utility sequence mode analysis method and system in group purchase

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217055A1 (en) * 2002-05-20 2003-11-20 Chang-Huang Lee Efficient incremental method for data mining of a database
CN105590237A (en) * 2015-12-18 2016-05-18 齐鲁工业大学 Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making
CN106777182A (en) * 2016-12-23 2017-05-31 陕西理工学院 A kind of data flow effective item set mining algorithm for reducing candidate
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN109101530A (en) * 2018-06-22 2018-12-28 哈尔滨工业大学(深圳) Effective sequence of events pattern mining algorithm
CN109408563A (en) * 2018-11-07 2019-03-01 哈尔滨工业大学(深圳) High average utility item set mining method, apparatus and computer equipment
CN111475551A (en) * 2020-06-15 2020-07-31 河北工业大学 A high-average utility sequential pattern mining method under non-overlapping conditions
CN111930803A (en) * 2020-08-07 2020-11-13 河北工业大学 Non-overlapping self-adaptive frequent sequence pattern mining method
US20220058716A1 (en) * 2020-08-18 2022-02-24 Qilu University Of Technology Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method
CN112434031A (en) * 2020-11-16 2021-03-02 宁波财经学院 Uncertain high-utility mode mining method based on information entropy
CN114971794A (en) * 2022-05-27 2022-08-30 齐鲁工业大学 Time period-based high-utility sequence mode analysis method and system in group purchase

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
慕欢欢;柴玉梅;王黎明;: "面向数据流的一个高效用项集挖掘算法", 计算机应用与软件 *

Also Published As

Publication number Publication date
CN115964415B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
Lin et al. Efficient updating of discovered high-utility itemsets for transaction deletion in dynamic databases
Plantevit et al. Mining multidimensional and multilevel sequential patterns
US20030217055A1 (en) Efficient incremental method for data mining of a database
Wang et al. On incremental high utility sequential pattern mining
Liu et al. Effective sanitization approaches to protect sensitive knowledge in high-utility itemset mining
CN111930797A (en) Uncertain periodic frequent item set mining method and device
Wu et al. Incrementally updating the discovered high average-utility patterns with the pre-large concept
CN107038026A (en) The automatic machine update method and system of a kind of increment type
Tatti et al. Finding robust itemsets under subsampling
Gan et al. ProUM: High utility sequential pattern mining
CN111309786B (en) Parallel frequent item set mining method based on MapReduce
Lin et al. A fast maintenance algorithm of the discovered high-utility itemsets with transaction deletion
CN111026862A (en) An Incremental Entity Summarization Method Based on Formal Concept Analysis Technology
Kiran et al. Efficient discovery of weighted frequent itemsets in very large transactional databases: A re-visit
Truong et al. EHUSM: mining high utility sequences with a pessimistic utility model
CN115964415B (en) Pre-HUSPM-based database sequence insertion processing method
CN108319728A (en) A kind of frequent community search method and system based on k-star
CN110309179B (en) Maximum fault-tolerant frequent item set mining method based on parallel PSO
Sun et al. Applying prefixed-itemset and compression matrix to optimize the MapReduce-based Apriori algorithm on Hadoop
Hong et al. Hiding sensitive itemsets by inserting dummy transactions
Tin et al. Hupsmt: An efficient algorithm for mining high utility-probability sequences in uncertain databases with multiple minimum utility thresholds
Ou et al. Efficient algorithms for incremental Web log mining with dynamic thresholds
Zhou et al. Incremental association rule mining based on matrix compression for edge computing
CN112231438B (en) Method and device for mining closed term set and generation sub
CN108197272A (en) A kind of update method and device of distributed association rules increment

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20230526