CN115964415A - Pre-HUSPM-based database sequence insertion processing method - Google Patents

Pre-HUSPM-based database sequence insertion processing method Download PDF

Info

Publication number
CN115964415A
CN115964415A CN202310250759.4A CN202310250759A CN115964415A CN 115964415 A CN115964415 A CN 115964415A CN 202310250759 A CN202310250759 A CN 202310250759A CN 115964415 A CN115964415 A CN 115964415A
Authority
CN
China
Prior art keywords
sequence
database
utility
weighted
original
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN202310250759.4A
Other languages
Chinese (zh)
Other versions
CN115964415B (en
Inventor
吴明泰
李凤洋
潘正祥
陈建铭
吴祖扬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Shandong University of Science and Technology
Original Assignee
Shandong University of Science and Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Shandong University of Science and Technology filed Critical Shandong University of Science and Technology
Priority to CN202310250759.4A priority Critical patent/CN115964415B/en
Publication of CN115964415A publication Critical patent/CN115964415A/en
Application granted granted Critical
Publication of CN115964415B publication Critical patent/CN115964415B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02DCLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
    • Y02D10/00Energy efficient computing, e.g. low power processors, power management or thermal management

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a database sequence insertion processing method based on Pre-HUSPM, belonging to the field of data mining and comprising the following steps: inserting a database to be inserted into an original database; calculating a security value according to the information of the original database; scanning a database to be inserted, and calculating the total utility of each sequence in the database to be inserted and the total utility of the database to be inserted; comparing the total utility value of a new transaction since the original database is rescanned last time with the sum of the maximum value of the sequence weighted utility of the single item in the database to be inserted with a safety value, and performing corresponding operation according to the comparison result; comparing and judging the utility ratio of each sequence in the large sequence weighted utility sequence set in the new database with the utility threshold upper limit; and finally, outputting the new database after the database is updated and the high-utility sequence pattern set thereof. The invention reduces the times of database rescanning and lowers the maintenance cost.

Description

Pre-HUSPM-based database sequence insertion processing method
Technical Field
The invention belongs to the field of data mining, and particularly relates to a database sequence insertion processing method based on Pre-HUSPM.
Background
A High Utility Sequence Pattern Mining (HUSPM) algorithm may be used to analyze the user's shopping habits, which would take into account the weight of each item, unit profit, etc. And when the utility of the sequence set is greater than the minimum utility threshold set by the user, the sequence set is a high utility sequence mode. In general, the HUSPM algorithm runs under a static database, but in practical application, new data is added almost every day, which may cause the failure of the originally discovered efficient utilization sequence pattern or new information after updating the database. Therefore, in the conventional dynamic data mining, the original database needs to be rescanned every time a small amount of data enters, and rescanning the original database consumes a lot of resources and time. Especially when a small amount of data is inserted, substantially the whole database is not affected, and then resource waste and maintenance cost increase are caused by updating the database, so that the efficient maintenance and updating of the mined high-utility sequence mode become important.
Disclosure of Invention
In order to solve the problems, the invention provides a database sequence insertion processing method based on Pre-HUSPM, which fuses a Pre-large concept and a projection-based mining algorithm P-HUSPM to construct an incremental algorithm Pre-HUSPM for efficiently mining a high-utility sequence mode and reducing the rescanning times of an original database.
The technical scheme of the invention is as follows:
a database sequence insertion processing method based on Pre-HUSPM constructs an incremental algorithm Pre-HUSPM to efficiently mine a high-utility sequence mode, and specifically comprises the following steps:
step 1, to the original database
Figure SMS_1
Insert database to be inserted->
Figure SMS_2
Step 2, according to the original database
Figure SMS_3
Is calculated for a security value &>
Figure SMS_4
Step 3, scanning the database to be inserted
Figure SMS_5
Calculating the database to be inserted->
Figure SMS_6
The total utility of each of the sequences->
Figure SMS_7
And
Figure SMS_8
is greater than or equal to>
Figure SMS_9
Step 4, the total utility value of the new transaction since the original database is rescanned last time and the database to be inserted
Figure SMS_10
Sequence-weighted utility maximum for a single item->
Figure SMS_11
Is summed with a safety value->
Figure SMS_12
Comparing, and performing corresponding operation according to a comparison result;
step 5, judging a new database
Figure SMS_14
Large sequence weighting in Utility sequence set->
Figure SMS_15
Each sequence in the set->
Figure SMS_17
Whether or not the utility ratio of (a) is greater than or equal to the upper utility threshold limit->
Figure SMS_18
If so, then the sequence->
Figure SMS_19
Is a high utility sequence pattern, will sequence->
Figure SMS_20
Joining a set of high utility sequential patterns>
Figure SMS_21
And outputting, otherwise, no operation is needed; finally outputting the new database after the database update->
Figure SMS_13
And its high utility sequential pattern set>
Figure SMS_16
Further, in step 1, a raw database is set up
Figure SMS_23
Figure SMS_24
Is the total number of sequences, is based on>
Figure SMS_25
Is a serial number of the sequence, is asserted>
Figure SMS_26
Indicates the fifth->
Figure SMS_27
Or a sequence, is>
Figure SMS_28
Set an item>
Figure SMS_29
Figure SMS_22
Is the total number of items, the item->
Figure SMS_30
Is->
Figure SMS_31
A set of different terms, denoted as ^ or ^>
Figure SMS_32
Figure SMS_33
Indicates that the item is pick>
Figure SMS_34
Is greater than or equal to>
Figure SMS_35
And (4) each item.
Further, in step 2, the security value
Figure SMS_36
The calculation formula of (a) is as follows: />
Figure SMS_37
(1);
wherein ,
Figure SMS_38
indicates an upper utility threshold value, greater than or equal to>
Figure SMS_39
Indicates a utility threshold lower limit, <' > or>
Figure SMS_40
Represents the original database->
Figure SMS_41
Is taken up and/or taken off>
Figure SMS_42
and
Figure SMS_43
The value of (2) is preset;
Figure SMS_44
the calculation formula of (c) is as follows:
Figure SMS_45
(2);
wherein ,
Figure SMS_46
represents the original database->
Figure SMS_47
In sequence->
Figure SMS_48
The calculation formula is as follows:
Figure SMS_49
(3);
wherein ,
Figure SMS_50
represents a sequence->
Figure SMS_51
Middle item->
Figure SMS_52
Is/is>
Figure SMS_53
The utility of the item.
Further, in step 3, the database to be inserted is obtained by calculation in the same way as the formulas (2) and (3)
Figure SMS_54
Total effectiveness>
Figure SMS_55
At the same time, a calculation is made->
Figure SMS_56
The database to be inserted is included in the calculation time>
Figure SMS_57
The relevant data of (2).
Further, the specific judgment criteria in step 4 are: is provided with
Figure SMS_58
When a total utility value of a new transaction since the last rescan of the original database>
Figure SMS_59
When, step 4.1 and step 4.2 are carried out, when ^ 4.1 and>
Figure SMS_60
then, step 4.3 is carried out;
step 4.1, insert the database from waiting
Figure SMS_61
The scan generates a 1-candidate set and sets ≦>
Figure SMS_62
=1,
Figure SMS_63
Representing the number of items being processed in the set of sequences;
step 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated; at the same time, set up
Figure SMS_64
Step 4.3, when
Figure SMS_65
Generating a new database, and scanning the original database again at the moment; will->
Figure SMS_66
Is set to be 0 and is set to be,and will->
Figure SMS_67
Assign a value to>
Figure SMS_68
Further, the specific process of step 4.2 is as follows:
step 4.2.1, calculate the new database
Figure SMS_69
Is greater than or equal to>
Figure SMS_70
The calculation formula is as follows:
Figure SMS_71
(4);
for candidate set
Figure SMS_72
Calculates the ≥ er/min of each candidate in the database to be inserted>
Figure SMS_73
In sequence->
Figure SMS_74
Sequence weighted utility of
Figure SMS_75
And sequential effect>
Figure SMS_76
The calculation formula is as follows:
Figure SMS_77
(5);
Figure SMS_78
(6);
wherein ,
Figure SMS_79
represents a sequence->
Figure SMS_80
The total utility value for this row;
Figure SMS_81
Represents a sequence->
Figure SMS_82
Is based on the sub-sequence->
Figure SMS_83
Has the effect that all occurrences in the sequence->
Figure SMS_84
The maximum utility of (a) is defined as follows:
Figure SMS_85
(7);
wherein ,
Figure SMS_86
indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:
Figure SMS_87
(8);
wherein ,
Figure SMS_88
represents a sequence->
Figure SMS_89
Is greater than or equal to>
Figure SMS_90
Is/is>
Figure SMS_91
Internal utility of an item, defined as follows:
Figure SMS_92
(9);/>
wherein ,
Figure SMS_93
represents a sequence->
Figure SMS_94
Middle item->
Figure SMS_95
Is/is>
Figure SMS_96
The number of items->
Figure SMS_97
Represents->
Figure SMS_98
The unit profit of the item;
step 4.2.2, for weighting utility sequences in large sequences
Figure SMS_99
Performing substep 4.2.2.1-substep 4.2.2.3 for each large sequence weighted utility sequence set in the original database;
step 4.2.3, for
Figure SMS_100
Each pre-large sequence in the original database is weighted by using the sequence set, and the substep 4.2.2.1-substep 4.2.2.3 of the step 4.2.2 are also executed;
if the original database
Figure SMS_103
Is greater than the set of large sequence weighted sequences->
Figure SMS_105
And the original database->
Figure SMS_106
Pre-large sequence of (1) weighted sequence set +>
Figure SMS_108
Containing the database to be inserted->
Figure SMS_109
In (b) is combined with a sequence->
Figure SMS_111
Will->
Figure SMS_112
and
Figure SMS_102
The sequence utility of the set of items in->
Figure SMS_104
And the sequence weighted utility>
Figure SMS_107
Is updated and the sequence is->
Figure SMS_110
Putting the candidate into a 1-candidate set to generate a 2-candidate set; if->
Figure SMS_113
and
Figure SMS_114
Does not contain a new database->
Figure SMS_115
Is based on the sequence->
Figure SMS_116
Will->
Figure SMS_101
Remove from the 1-candidate set;
step 4.2.4 from
Figure SMS_117
-candidate set generating candidates (@ n)>
Figure SMS_118
+ 1) -candidate set->
Figure SMS_119
(ii) a Is arranged and/or is>
Figure SMS_120
=
Figure SMS_121
+1, repeating steps 4.2.1 through 4.2.4 until no updated large or pre-large sequence weighted utility sequence set is found.
Further, the substeps of step 4.2.2 are as follows:
substep 4.2.2.1, updating the new database
Figure SMS_122
In a sequence>
Figure SMS_123
Is weighted effect of->
Figure SMS_124
The calculation formula is as follows:
Figure SMS_125
(10);
wherein ,
Figure SMS_127
for the original database->
Figure SMS_128
In a sequence>
Figure SMS_129
Is weighted effect of->
Figure SMS_131
Stores the sequence->
Figure SMS_132
Is/are as follows
Figure SMS_133
Figure SMS_134
For being inserted into the database->
Figure SMS_126
In sequence->
Figure SMS_130
The sequence weighted utility of (a);
substep 4.2.2.2 updating the new database
Figure SMS_135
In the entire sequence set->
Figure SMS_136
Is effective in>
Figure SMS_137
Figure SMS_138
(11);
wherein ,
Figure SMS_141
represents a sequence->
Figure SMS_142
In the raw database->
Figure SMS_143
Is effective in>
Figure SMS_144
Storing a sequence +>
Figure SMS_145
Is/are as follows
Figure SMS_146
Figure SMS_147
For database to be inserted>
Figure SMS_139
In sequence->
Figure SMS_140
The sequence utility of (a);
substeps 4.2.2.3, if
Figure SMS_152
Then the sequence is asserted>
Figure SMS_153
Put in or out>
Figure SMS_154
Figure SMS_155
Is a new database->
Figure SMS_156
Is greater than the sequence weighted effect in->
Figure SMS_157
A sequence set; if->
Figure SMS_158
Then the sequence is asserted>
Figure SMS_148
Put in and/or pick up>
Figure SMS_149
Figure SMS_150
Is a new database->
Figure SMS_151
Pre-large sequence weighted utility of->
Figure SMS_159
A sequence set; otherwise, the sequence is discarded>
Figure SMS_160
Further, the specific process of step 4.3 is as follows:
step 4.3.1, merging the databases to be inserted
Figure SMS_161
And the original database->
Figure SMS_162
Generates a new database->
Figure SMS_163
Step 4.3.2, for each
Figure SMS_164
The new database is calculated in the same way as in equation (5)>
Figure SMS_165
In a sequence weighted utility of >>
Figure SMS_166
And then calculates a new database ≧ according to the same calculation as in equation (2)>
Figure SMS_167
Is greater than or equal to>
Figure SMS_168
Step 4.3.3, set the weighted utility ratio of the sequence to
Figure SMS_170
If->
Figure SMS_171
Then the sequence is asserted>
Figure SMS_173
Put in and/or pick up>
Figure SMS_175
(ii) a If +>
Figure SMS_177
Then the sequence is asserted>
Figure SMS_178
Put in and/or pick up>
Figure SMS_179
(ii) a Otherwise, the sequence is discarded>
Figure SMS_169
Figure SMS_172
Is a new database->
Figure SMS_174
Is greater than the sequence weighted effect in->
Figure SMS_176
A sequence set;
Figure SMS_180
Is a new database->
Figure SMS_181
Pre-large sequence weighted utility of->
Figure SMS_182
A sequence set;
step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets
Figure SMS_183
and
Figure SMS_185
Sequence set until no more than found >>
Figure SMS_187
and
Figure SMS_188
A sequence set; when the mining process is executed, the mining is started from the sequence set 1, then the sequence set 2 is followed until the last sequence set is empty, at the moment, the mining process is stopped, and a new database +is output>
Figure SMS_189
Is greater than the set of large sequence weighted utility sequences->
Figure SMS_190
And pre-large sequence weighted utility sequence set
Figure SMS_191
Figure SMS_184
and
Figure SMS_186
The data insertion method is used for next data insertion.
Further, in step 4.3.4, the specific process of the recursive mining algorithm is as follows:
step 4.3.4.1, traverse
Figure SMS_192
and
Figure SMS_193
To be belonged to>
Figure SMS_194
and
Figure SMS_195
Each sequence of (4)>
Figure SMS_196
Construct its projection database>
Figure SMS_197
Step 4.3.4.2, calculate
Figure SMS_200
Sequence weighted utility of
Figure SMS_202
Value, wherein>
Figure SMS_203
Is->
Figure SMS_205
A set of expansion terms; if it is not
Figure SMS_206
Calculating the effectiveness of the sequence->
Figure SMS_208
And will->
Figure SMS_209
Put to>
Figure SMS_198
In the set; if it is not
Figure SMS_199
Calculating >>
Figure SMS_201
And will->
Figure SMS_204
Put to>
Figure SMS_207
In the set, if not, no processing is carried out;
step 4.3.4.3, the current parameters are introduced, and the mining algorithm process is called recursively until the current parameters are input
Figure SMS_211
And
Figure SMS_212
the sets are all empty, and the operation is stopped;
Figure SMS_213
Is a new database->
Figure SMS_214
Is greater than the sequence weighted effect in->
Figure SMS_215
+1 sequence set;
Figure SMS_216
Is a new database->
Figure SMS_217
Pre-large sequence weighted utility of->
Figure SMS_210
+1 sequence set.
The invention brings beneficial technical effects.
A new sequence pattern mining algorithm Pre-HUSPM is provided for processing the problem of sequence insertion, when a small amount of data is inserted, the whole database does not need to be updated, and resource waste is avoided.
The high-utility sequence pattern mining algorithm (P-HUSPM) based on matrix projection can reduce the number of candidate sets in sequence mining, thereby accelerating the processing time for mining the high-utility sequence set; run time can be reduced to a large extent since frequent rescanning of the database is not required.
A new concept is proposed
Figure SMS_218
The method is used as a safety threshold value to judge whether the database needs to be rescanned, so that the rescanning times of the database are reduced, and the maintenance cost is reduced.
Drawings
FIG. 1 is a flow chart of the database sequence insertion processing method based on Pre-HUSPM of the present invention.
FIG. 2 is a graph of the upper limit of the utility threshold for the SIGN data set in the experiment of the present invention
Figure SMS_219
At 15% the three algorithms are at different utility thresholdsDevice for limiting and retaining>
Figure SMS_220
Run time comparison of figures below.
FIG. 3 is a graph of the Leviaathan data set at the upper limit of the utility threshold for the experiments of the present invention
Figure SMS_221
At 18% the three algorithms have different utility threshold lower bounds->
Figure SMS_222
Run time comparison of figures below.
FIG. 4 shows the FIFA data set at the upper limit of the utility threshold in the experiment of the present invention
Figure SMS_223
At 21% the three algorithms have different utility threshold lower bounds->
Figure SMS_224
Run time comparison of figures below.
FIG. 5 shows the BIBLE data set at the upper limit of the utility threshold in the experiment of the present invention
Figure SMS_225
At 16% the three algorithms have different utility threshold lower bounds @>
Figure SMS_226
Run time comparison of figures below.
FIG. 6 shows the Kosarak10k data set at the upper limit of the utility threshold in the experiment of the present invention
Figure SMS_227
At 14% the three algorithms have different utility threshold lower bounds->
Figure SMS_228
Run time comparison of figures below.
FIG. 7 shows the BMS data set at the upper limit of the utility threshold in the experiment of the present invention
Figure SMS_229
At 4.5%, the three algorithms are in different effectsBased on a lower threshold>
Figure SMS_230
Run time comparison of figures below. />
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
the database mentioned in the invention is a sequence database, and the sequence database comprises large sequences, pre-large sequences and small sequences. When the support degree of the sequence is greater than the upper limit threshold of the support degree, the sequence is a large sequence; when the support degree of the sequence is smaller than the upper support degree threshold and larger than the lower support degree threshold, the sequence is a pre-large sequence; and when the support degree of the sequence is less than the support degree lower threshold, the sequence is a small sequence. Among them, pre-large sequences are likely to become large sequences in the future.
The invention integrates Pre-large concept and projection-based mining algorithm P-HUSPM, provides the Pre-HUSPM algorithm, and mainly sets a threshold value
Figure SMS_231
The database rescanning method is used as a condition for whether the database needs to be rescanned or not, so that the database sequence is effectively maintained and updated, and the rescanning times of the database are reduced.
Figure SMS_232
A sequence weighted utility maximum representing a single item to be inserted into the database.
Nine cases may occur when a new sequence database is added to an original sequence database: case 1 is the insertion of a large sequence of a new sequence database into a large sequence of the original sequence database; case 2 is the insertion of a pre-large sequence of a new sequence database into a large sequence of the original sequence database; case 3 is the insertion of a small sequence of the new sequence database into a large sequence of the original sequence database; case 4 is insertion of a large sequence of the new sequence database into a pre-large sequence of the original sequence database; case 5 is inserting the pre-large sequence of the new sequence database into the pre-large sequence of the original sequence database; case 6 is inserting a small sequence of the new sequence database into a pre-large sequence of the original sequence database; case 7 is insertion of a large sequence of the new sequence database into a small sequence of the original sequence database; case 8 is inserting a pre-large sequence of the new sequence database into a small sequence of the original sequence database; case 9 is the insertion of a small sequence of the new sequence database into a small sequence of the original sequence database.
Case 1, case 5, case 6, case 8, and case 9 are weighted averages based on counts that do not affect the final large sequence set. Cases 2 and 3 may delete some existing large sequence sets, while cases 4 and 7 may add new large sequence sets. These cases of case 2, case 3, and case 4 can be handled well when both the large sequence set and the pre-large sequence set are reserved.
The above situation 7 is the main research focus of the present invention, and when the situation 7 occurs, that is, the inserted database data is not very large, the database is not substantially required to be updated, and at this time, the prior art will update the database, resulting in resource waste.
Aiming at the problem, the invention provides a database sequence insertion processing method based on Pre-HUSPM, which adopts the following theorem and proves the theorem.
Theorem, setting
Figure SMS_233
and
Figure SMS_234
A lower utility threshold and an upper utility threshold, respectively>
Figure SMS_235
For the original database->
Figure SMS_236
The total utility of (c).
Figure SMS_237
Is to be inserted into the database->
Figure SMS_238
In a singleThe sequence of items weights the utility maximum. If it is not
Figure SMS_239
Then the sequence weighted utility of the sequence set in case 7 is not expected to be a high utility weighted sequence item set throughout the update database.
And (3) proving that: from
Figure SMS_240
A push down guidance can be obtained:
Figure SMS_241
Figure SMS_242
Figure SMS_243
Figure SMS_244
;/>
Figure SMS_245
for the sequence in case 7, if the sequence
Figure SMS_246
Sequence weighted utility of in the original database>
Figure SMS_247
Very small in size, then
Figure SMS_248
If the sequence is
Figure SMS_250
On the database to be inserted->
Figure SMS_251
Has greater sequence weighting utility, then it is in the database to be inserted
Figure SMS_252
Is weighted by the sequence in->
Figure SMS_253
Must be greater than or equal to->
Figure SMS_254
But less than or equal to the database to be inserted
Figure SMS_255
Is greater than or equal to>
Figure SMS_256
. Accordingly, is present>
Figure SMS_249
In sequence mining, inserting databases
Figure SMS_257
Post-formed new database->
Figure SMS_258
Is updated in>
Figure SMS_259
Is calculated as:
Figure SMS_260
wherein ,
Figure SMS_261
for a new database->
Figure SMS_263
In sequence->
Figure SMS_265
Is weighted effect of->
Figure SMS_266
For a base of raw data>
Figure SMS_267
In a sequence>
Figure SMS_268
The sequence weighted utility of (c). Therefore, when->
Figure SMS_269
Less than the safety value>
Figure SMS_262
Figure SMS_264
) There is no need to rescan the original database.
According to this theorem, the sequence in case 7 can be efficiently handled.
A database sequence insertion processing method based on Pre-HUSPM specifically comprises the following steps:
step 1, to the original database
Figure SMS_270
Insert the database to be inserted->
Figure SMS_271
In the embodiment of the invention, the original database
Figure SMS_272
For a transaction data database, inserted database to be inserted>
Figure SMS_273
Is a new transaction data database.
The original transaction data base and the new transaction data base are both databases containing a group of sequences, and the original database is set
Figure SMS_275
Figure SMS_276
Is the total number of sequences->
Figure SMS_277
Is a serial number of the sequence, is asserted>
Figure SMS_278
Indicates the fifth->
Figure SMS_279
Or a sequence, is>
Figure SMS_280
Has a unique identifier, is selected>
Figure SMS_281
Set an item>
Figure SMS_274
Figure SMS_282
Is the total number of items, the item->
Figure SMS_283
Is
Figure SMS_284
A set of different terms, denoted as ^ or ^>
Figure SMS_285
Figure SMS_286
Indicates that the item is pick>
Figure SMS_287
Is greater than or equal to>
Figure SMS_288
And (4) items.
The primary transaction data database includes
Figure SMS_289
Figure SMS_291
Figure SMS_293
Figure SMS_295
Figure SMS_297
Five sequences and>
Figure SMS_298
Figure SMS_300
Figure SMS_301
Figure SMS_303
Figure SMS_305
five items. Wherein it is present>
Figure SMS_307
The set of items of the sequence is +>
Figure SMS_309
Figure SMS_311
Represents an item;
Figure SMS_312
Set of items for a sequence being +>
Figure SMS_314
Figure SMS_290
The collection of items of the sequence is
Figure SMS_292
Figure SMS_294
The set of items of the sequence is +>
Figure SMS_296
Figure SMS_299
The set of items of the sequence is +>
Figure SMS_302
. Hereby->
Figure SMS_304
Figure SMS_306
Figure SMS_308
Figure SMS_310
Figure SMS_313
The profits of the five projects are respectively 3, 2, 4, 2 and 1, the profits are stored in a database in a form of a table and stored as a project profit table->
Figure SMS_315
To be inserted into a database
Figure SMS_316
Comprises>
Figure SMS_317
Figure SMS_318
Two sequences->
Figure SMS_319
The collection of items of the sequence is
Figure SMS_320
Figure SMS_321
The set of items of the sequence is +>
Figure SMS_322
Step 2, according to the original database
Figure SMS_323
Is calculated a safety value->
Figure SMS_324
Safety value
Figure SMS_325
The calculation formula of (c) is as follows:
Figure SMS_326
(1);
wherein ,
Figure SMS_327
indicates an upper utility threshold value, greater than or equal to>
Figure SMS_328
Indicating a utility threshold lower limit,>
Figure SMS_329
represents the original database->
Figure SMS_330
Is taken up and/or taken off>
Figure SMS_331
and
Figure SMS_332
The value of (2) is preset. />
Figure SMS_333
The calculation formula of (a) is as follows:
Figure SMS_334
(2);
wherein ,
Figure SMS_335
represents the original database->
Figure SMS_336
In sequence->
Figure SMS_337
The calculation formula is as follows:
Figure SMS_338
(3);
wherein ,
Figure SMS_339
represents a sequence->
Figure SMS_340
Middle item->
Figure SMS_341
Is/is>
Figure SMS_342
The utility of the item.
In the embodiment of the invention, the upper limit of the utility threshold is preset
Figure SMS_344
0.35, the upper utility threshold is the same as the high utility sequential pattern threshold, and the lower utility threshold is set>
Figure SMS_345
Is 0.25, calculated>
Figure SMS_346
=36,
Figure SMS_347
=26,
Figure SMS_348
=28,
Figure SMS_349
=23,
Figure SMS_350
=28;
Figure SMS_343
=141;
Figure SMS_351
=21。
Step 3, scanning the database to be inserted
Figure SMS_352
Calculating the database to be inserted->
Figure SMS_353
The total utility of each of the sequences->
Figure SMS_354
And
Figure SMS_355
is greater than or equal to>
Figure SMS_356
The database to be inserted is obtained by calculation in the same way as the formulas (2) and (3)
Figure SMS_357
Total utility->
Figure SMS_358
At the same time, a calculation is made->
Figure SMS_359
The time of calculationTo be inserted into the database->
Figure SMS_360
The relevant data of (2);
in the embodiment of the present invention, the first and second,
Figure SMS_361
=10,
Figure SMS_362
=7,
Figure SMS_363
=17。
step 4, the total utility value of the new transaction since the original database was rescanned last time and
Figure SMS_364
is summed with a safety value->
Figure SMS_365
And comparing, and performing corresponding operation according to a comparison result. The specific judgment criterion is as follows: is arranged and/or is>
Figure SMS_366
When ≧ the total utility value for the new transaction since the last rescan of the original database>
Figure SMS_367
When, step 4.1 and step 4.2 are carried out, when +>
Figure SMS_368
Then, step 4.3 is performed.
Step 4.1, inserting the slave to the database
Figure SMS_369
The scan generates a 1-candidate set and sets +>
Figure SMS_370
=1,
Figure SMS_371
Indicating that the sequence set is beingNumber of items processed.
In the embodiment of the present invention, the generated 1-candidate set is:
Figure SMS_372
and 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated. At the same time, set up
Figure SMS_373
. The specific process is as follows:
step 4.2.1, calculate the new database
Figure SMS_374
Is greater than or equal to>
Figure SMS_375
The calculation formula is as follows:
Figure SMS_376
(4);
in the embodiment of the present invention, the first and second substrates,
Figure SMS_377
=141+17=158。
for candidate set
Figure SMS_378
Calculates the ≥ er/min of each candidate in the database to be inserted>
Figure SMS_379
In sequence->
Figure SMS_380
Sequence weighted utility of
Figure SMS_381
And the effect of the sequence->
Figure SMS_382
The calculation formula is as follows:
Figure SMS_383
(5);
Figure SMS_384
(6);
wherein ,
Figure SMS_385
represents a sequence->
Figure SMS_386
The total utility value for this row;
Figure SMS_387
Represents a sequence->
Figure SMS_388
Is based on the sub-sequence->
Figure SMS_389
Has the effect that all occurrences in the sequence->
Figure SMS_390
The maximum utility of (a) is defined as follows:
Figure SMS_391
(7);
wherein ,
Figure SMS_392
indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:
Figure SMS_393
(8);/>
wherein ,
Figure SMS_394
representing a sequence>
Figure SMS_395
In (c) is greater than or equal to>
Figure SMS_396
Is/is>
Figure SMS_397
The internal utility of an item, defined as follows:
Figure SMS_398
(9);
wherein ,
Figure SMS_399
represents a sequence->
Figure SMS_400
In item->
Figure SMS_401
In or>
Figure SMS_402
The number of items->
Figure SMS_403
Represents->
Figure SMS_404
The unit profit of the item.
For example in an embodiment of the present invention,
Figure SMS_405
=10,
Figure SMS_406
=8。
for example,
Figure SMS_408
can be expressed as->
Figure SMS_409
, wherein
Figure SMS_411
Figure SMS_412
Figure SMS_413
. wherein
Figure SMS_414
In or on>
Figure SMS_415
and
Figure SMS_407
The internal utilities of (a) are respectively:
Figure SMS_410
=3×3=9,
Figure SMS_416
=2×3=6。
In that
Figure SMS_417
In or>
Figure SMS_418
Appears twice>
Figure SMS_419
The most effective is used in>
Figure SMS_420
Expressed as:
Figure SMS_421
=9。
subsequence(s)
Figure SMS_422
Is at>
Figure SMS_423
Two occurrences, the two effects being (3 × 3) + (4 × 2) =17 and (3 × 2) + (3 × 2), respectively4 × 2) =14. Therefore, is->
Figure SMS_424
=17。
Step 4.2.2, for weighting utility sequences in large sequences
Figure SMS_425
Each big sequence weighted utility sequence set in the original database, the sub-steps are executed:
substep 4.2.2.1, updating the new database
Figure SMS_426
In sequence->
Figure SMS_427
Is weighted effect of->
Figure SMS_428
The calculation formula is as follows:
Figure SMS_429
(10);
wherein ,
Figure SMS_430
for a base of raw data>
Figure SMS_432
In sequence->
Figure SMS_433
Is weighted effect of->
Figure SMS_435
Stores the sequence->
Figure SMS_436
Is
Figure SMS_437
Figure SMS_438
To be insertedGo into the database->
Figure SMS_431
In sequence->
Figure SMS_434
The sequence weighted utility of (c).
Examples of the invention
Figure SMS_439
In>
Figure SMS_440
Sequence,. According to the result of the comparison>
Figure SMS_441
=76+7=83。
Substep 4.2.2.2 updating the new database
Figure SMS_442
In a whole sequence set>
Figure SMS_443
Is effective in>
Figure SMS_444
Figure SMS_445
(11);
wherein ,
Figure SMS_447
represents a sequence->
Figure SMS_448
In the original database>
Figure SMS_450
Is effective in>
Figure SMS_451
Stores the sequence->
Figure SMS_452
Is/are as follows
Figure SMS_453
Figure SMS_454
For being inserted into the database->
Figure SMS_446
In sequence->
Figure SMS_449
The sequence utility of (a).
Examples of the invention
Figure SMS_455
In>
Figure SMS_456
Sequence,. According to the result of the comparison>
Figure SMS_457
=30+3=33。
Substeps 4.2.2.3, if
Figure SMS_459
Then the sequence is asserted>
Figure SMS_460
Put in and/or pick up>
Figure SMS_462
Figure SMS_464
Is a new database->
Figure SMS_466
Is greater than the sequence weighted effect in->
Figure SMS_468
A sequence set; if->
Figure SMS_470
Then the sequence is asserted>
Figure SMS_458
Put in or out>
Figure SMS_461
Figure SMS_463
Is a new database->
Figure SMS_465
Pre-large sequence weighted utility of->
Figure SMS_467
A sequence set; otherwise, the sequence is discarded>
Figure SMS_469
Since it is still small after the database update.
In the embodiment of the present invention, the first and second substrates,
Figure SMS_471
=52.5%>35%, so that the sequence->
Figure SMS_472
Is still put in
Figure SMS_473
In the collection.
Step 4.2.3, for
Figure SMS_474
Each pre-large sequence in the original database is weighted with the sequence set and sub-step 4.2.2.1-sub-step 4.2.2.3 of step 4.2.2 are also performed. />
If the original database
Figure SMS_477
Is greater than the set of large sequence weighted sequences->
Figure SMS_479
And the original database->
Figure SMS_480
Is predetermined by the pre-large sequence weighting sequence set->
Figure SMS_482
Containing the database to be inserted->
Figure SMS_484
Is based on the sequence->
Figure SMS_486
Will->
Figure SMS_488
and
Figure SMS_475
The sequence utility of the set of items in->
Figure SMS_478
And the sequence weighted utility>
Figure SMS_481
Is updated and the sequence is->
Figure SMS_483
Put into 1-candidate set, used for producing 2-candidate set; if->
Figure SMS_485
and
Figure SMS_487
Does not contain a new database->
Figure SMS_489
Is based on the sequence->
Figure SMS_490
Will->
Figure SMS_476
Removed from the 1-candidate set.
For example in an embodiment of the present invention,
Figure SMS_492
Figure SMS_493
is at>
Figure SMS_494
In, will->
Figure SMS_496
Add to the 1-candidate set and remove it if not. From the 1-candidate set, a 2-candidate set @canbe generated>
Figure SMS_498
Figure SMS_500
Figure SMS_501
and
Figure SMS_491
And is based on the data bank to be inserted->
Figure SMS_495
Dig them->
Figure SMS_497
and
Figure SMS_499
If not, the value is 0, and so on until the candidate set is empty.
Step 4.2.4 from
Figure SMS_502
-candidate set generating candidates (@ n)>
Figure SMS_503
+ 1) -candidate set->
Figure SMS_504
(ii) a Device for combining or screening>
Figure SMS_505
=
Figure SMS_506
+1, repeating steps 4.2.1 through 4.2.4 until no updated large or pre-large sequence weighted utility sequence set is found.
Step 4.3, when
Figure SMS_507
A new database is created, at which time the original database needs to be rescanned. Will->
Figure SMS_508
Set to 0 and will->
Figure SMS_509
Assign a value to>
Figure SMS_510
. The specific process is as follows:
step 4.3.1, merging the databases to be inserted
Figure SMS_511
And an original database D, generating a new database U;
step 4.3.2, for each
Figure SMS_512
The new database is calculated in the same way as in equation (5)>
Figure SMS_513
In a sequence weighted utility of >>
Figure SMS_514
Then the new database is calculated in the same way as in equation (2)>
Figure SMS_515
Is greater than or equal to>
Figure SMS_516
Step 4.3.3, set the weighted utility ratio of the sequence to
Figure SMS_518
If->
Figure SMS_519
Then the sequence is asserted>
Figure SMS_520
Put in and/or pick up>
Figure SMS_521
(ii) a If->
Figure SMS_522
Then the sequence is asserted>
Figure SMS_523
Put in or out>
Figure SMS_524
(ii) a Otherwise, the sequence is discarded>
Figure SMS_517
Because it is still small after the database update.
Step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets
Figure SMS_526
and
Figure SMS_527
Sequence set until no more than found >>
Figure SMS_529
and
Figure SMS_530
And (4) sequence set. When the mining process is executed, the mining is started from the sequence set 1, then follows the sequence set 2, stopping the mining process until the last sequence set is empty, and outputting a new database ≥ er>
Figure SMS_531
Is greater than the set of large sequence weighted utility sequences->
Figure SMS_532
And pre-large sequence weighted utility sequence set
Figure SMS_533
Figure SMS_525
and
Figure SMS_528
The data insertion method is used for next data insertion.
The specific process is as follows:
step 4.3.4.1, traverse
Figure SMS_535
and
Figure SMS_536
To be belonged to>
Figure SMS_537
and
Figure SMS_538
Each sequence of (4)>
Figure SMS_539
Constructing its projection database->
Figure SMS_540
In this way, the number of candidate sets can be reduced and the operating speed increased, wherein->
Figure SMS_541
Indicating the number of terms being processed in the set of sequences. The construction process of the projection database comprises the following steps: find the item->
Figure SMS_534
Each sequence prefixed to a sequence if no item is ∑ er in a sequence>
Figure SMS_542
It is not retained.
Defining: is provided with two sequences
Figure SMS_543
and
Figure SMS_544
, wherein
Figure SMS_545
. If (1) the sequence has a prefix +>
Figure SMS_546
And (2) wherein the sequence is->
Figure SMS_547
Is prefixed->
Figure SMS_548
And that the sequence is no longer supersequence, then the sequence->
Figure SMS_549
Is referred to as>
Figure SMS_550
In which this relationship is denoted as->
Figure SMS_551
. Accordingly, the sequence +>
Figure SMS_552
In a new data bank->
Figure SMS_553
Is the sequence->
Figure SMS_554
The set of all projection sequences of each sequence in the corresponding database is recorded as ≥ er>
Figure SMS_555
For example, according to the above definition, for sequences
Figure SMS_556
Constructing a projection database to find out whether to be matched with a specific criterion>
Figure SMS_558
For each sequence prefixed to a sequence if no item is present in a sequence->
Figure SMS_560
Is not reserved, e.g. < >>
Figure SMS_562
Does not have a->
Figure SMS_564
Item in the sequence->
Figure SMS_566
Does not have a projection database of pick>
Figure SMS_567
. Accordingly, the sequence +>
Figure SMS_569
Only contains pick/place in the projection database of>
Figure SMS_571
Figure SMS_573
Figure SMS_575
Figure SMS_576
The four sequences specifically comprise:
Figure SMS_577
The set of items of the sequence is +>
Figure SMS_578
Figure SMS_579
The total utility of the sequence was 36;
Figure SMS_557
Set of items for a sequence being +>
Figure SMS_559
Figure SMS_561
The total utility of the sequence was 9;
Figure SMS_563
The collection of items of the sequence is
Figure SMS_565
Figure SMS_568
The total utility of the sequence was 9;
Figure SMS_570
The set of items of the sequence is +>
Figure SMS_572
Figure SMS_574
The total utility of the sequence is 22.
Step 4.3.4.2, calculate
Figure SMS_581
Is weighted effect of->
Figure SMS_586
Value, wherein>
Figure SMS_587
Is->
Figure SMS_588
A set of expansion terms; if it is not
Figure SMS_589
Calculating sequential utility>
Figure SMS_590
And will->
Figure SMS_591
Put into>
Figure SMS_580
In the set; if it is not
Figure SMS_582
Calculate->
Figure SMS_583
And will->
Figure SMS_584
Put into>
Figure SMS_585
In the set, if not, no processing is performed.
Step 4.3.4.3, the current parameters are introduced, and the mining algorithm process is called recursively until the current parameters are input
Figure SMS_592
And
Figure SMS_593
the collection is empty and the operation is stopped.
The pseudo-code of the recursive mining algorithm is as follows:
1 for each sequence
Figure SMS_594
do;
2 construction sequence
Figure SMS_595
Is based on the projection database->
Figure SMS_596
3: end for;
4: for each
Figure SMS_597
, wherein
Figure SMS_598
Is->
Figure SMS_599
In the projection database->
Figure SMS_600
The superset do of;
5 calculation of
Figure SMS_601
6: if
Figure SMS_602
then;
7 calculation of
Figure SMS_603
8: the sequence
Figure SMS_604
Put in and/or pick up>
Figure SMS_605
In the set;
9: else if
Figure SMS_606
10 calculation of
Figure SMS_607
11-will sequence
Figure SMS_608
Put in or out>
Figure SMS_609
In the set;
12: end if;
13: end for;
14: Mining(
Figure SMS_610
Figure SMS_611
Figure SMS_612
Figure SMS_613
Figure SMS_614
Figure SMS_615
);
step 5, judging a new database
Figure SMS_618
Is greater than the set of large sequence weighted utility sequences->
Figure SMS_619
Each sequence in the set->
Figure SMS_620
Whether or not the utility ratio of (a) is greater than or equal to the upper utility threshold limit->
Figure SMS_621
I.e. is->
Figure SMS_622
If so, the sequence is->
Figure SMS_623
Is a high utility sequential pattern, sequence S is added to a set of high utility sequential patterns>
Figure SMS_624
And output, otherwise, no operation is requiredMaking; finally outputting the new database after the database update->
Figure SMS_616
And its high utility sequential pattern set>
Figure SMS_617
In the embodiment of the present invention, the first and second,
Figure SMS_625
=
Figure SMS_626
=35.4% > 35%, so is->
Figure SMS_627
Is a high-utility sequence requiring addition of a set of high-utility sequence patterns>
Figure SMS_628
In (1).
Obtained finally
Figure SMS_629
and
Figure SMS_630
The following:
large sequence weighted utility sequence set
Figure SMS_632
Comprises a sequence set of ^ n>
Figure SMS_633
Figure SMS_635
Figure SMS_636
Figure SMS_637
(ii) a Wherein the sequence set->
Figure SMS_638
Has a sequence weighted utility of 83, and a sequence utility of 22; sequence set>
Figure SMS_639
Has a sequence weighted utility of 95 and a sequence utility of 56; sequence set>
Figure SMS_631
Has a sequence weighted utility of 77, a sequence utility of 20; sequence set
Figure SMS_634
Has a sequence weighted utility of 77, a sequence utility of 16;
pre-large sequence weighted utility sequence set
Figure SMS_640
Comprises a sequence set of ^ n>
Figure SMS_641
Figure SMS_643
Figure SMS_644
Figure SMS_646
Figure SMS_648
(ii) a Wherein the sequence set->
Figure SMS_649
Has a sequence weighted utility of 53, a sequence utility of 18; sequence set>
Figure SMS_642
Has a sequence weighted utility of 46 and a sequence utility of 17; sequence set>
Figure SMS_645
Has a sequence weighted utility of 43 and a sequence utility of 17; sequence set>
Figure SMS_647
Has a sequence weighted utility of 52, a sequence utility of 32; sequence set>
Figure SMS_650
The sequence weighted utility of 54 and the sequence utility of 38.
Updated new database
Figure SMS_651
Is selected based on the high utility sequence pattern set->
Figure SMS_652
Comprising only the sequence set->
Figure SMS_653
At this time the sequence set->
Figure SMS_654
The sequence weighted utility of (a) is 95, the sequence utility is 56, and the utility ratio is 35.4%.
In the present invention, the pseudo code of the Pre-HUSPM algorithm is as follows:
inputting: profit schedule for project
Figure SMS_656
The original database->
Figure SMS_657
An upper utility threshold value>
Figure SMS_659
(same as minimum sequence utility high threshold), utility lower threshold->
Figure SMS_661
Figure SMS_663
Is greater than or equal to>
Figure SMS_664
A group of large sequence weights utilizes the sequence->
Figure SMS_665
And pre-large sequence weighting with sequence->
Figure SMS_655
And their sequence-weighted utility value, slave->
Figure SMS_658
The safe transaction utility buffer that holds the total utility value of the last processed sequence, the actual utility value found in>
Figure SMS_660
And the database to be inserted->
Figure SMS_662
And (3) outputting: new database
Figure SMS_666
(
Figure SMS_667
) Is selected based on a set of high utility sequence patterns (` vs `)>
Figure SMS_668
)。
1 computing safety sequence utility bounds
Figure SMS_669
2: for each
Figure SMS_670
do;
3: scanning database
Figure SMS_671
Calculate->
Figure SMS_672
4: end for;
5 calculation of
Figure SMS_673
and
Figure SMS_674
6 if
Figure SMS_675
then;
7 calculating Total Effect
Figure SMS_676
8: setting up
Figure SMS_677
=1;
Generating a 1-item candidate set
Figure SMS_678
Figure SMS_679
10: while
Figure SMS_680
null do;
11: for each
Figure SMS_681
do;
12 calculation of
Figure SMS_682
13 calculation of
Figure SMS_683
14: end for;
15: for each
Figure SMS_684
do;
Invoking a utility summation algorithm;
17: end for;
18: for each
Figure SMS_685
do;
calling a utility summation algorithm;
20: end for;
21 from
Figure SMS_686
Figure SMS_687
) Generates (` based on `)>
Figure SMS_688
+ 1) -candidate set->
Figure SMS_689
22: setting up
Figure SMS_690
=
Figure SMS_691
+1;
23: end while;
24: else;
Merging databases to be inserted
Figure SMS_692
And the original database->
Figure SMS_693
Generates a new database->
Figure SMS_694
26: for each
Figure SMS_695
do;
27 calculation of
Figure SMS_696
28: end for;
29 calculation of
Figure SMS_697
30 is provided with
Figure SMS_698
=1;
31: for each
Figure SMS_699
do;
32: if
Figure SMS_700
33: will
Figure SMS_701
Is added to the collection->
Figure SMS_702
Among them;
34: else if
Figure SMS_703
35 is to mix
Figure SMS_704
Joining in to a collection +>
Figure SMS_705
Among them;
36: end if;
37: end for;
38 if it is
Figure SMS_706
Is not at>
Figure SMS_707
and
Figure SMS_708
In the middle, will->
Figure SMS_709
From the new database->
Figure SMS_710
Is removed as a new database->
Figure SMS_711
39: Mining(
Figure SMS_712
Figure SMS_713
Figure SMS_714
Figure SMS_715
Figure SMS_716
Figure SMS_717
);
40: end if;
41: for each
Figure SMS_718
do;
42: if
Figure SMS_719
43-will sequence
Figure SMS_720
Put in or out>
Figure SMS_721
In the set;
44: end if;
45: end for;
46: if
Figure SMS_722
then;
47: setting up
Figure SMS_723
and
Figure SMS_724
= 0;
48: else;
49 is provided with
Figure SMS_725
50: end if;
51 setting up
Figure SMS_726
and
Figure SMS_727
The pseudo code of the utility summation algorithm used in the pseudo code is as follows:
1:
Figure SMS_728
2:
Figure SMS_729
3: if
Figure SMS_730
then;
4, the sequence
Figure SMS_731
Put in and/or pick up>
Figure SMS_732
In the set; />
5: else if
Figure SMS_733
then;
6, the sequence
Figure SMS_734
Put in and/or pick up>
Figure SMS_735
In the set;
7: end if;
in order to prove the superiority and feasibility of the algorithm of the invention, a comparative experiment is carried out. The Pre-HUSPM algorithm provided by the invention is compared with the P-HUSPM algorithm and the Pre-HUSPM-TSU algorithm. The experiment used 6 real datasets of different scale and with different characteristics, named SIGN, LEVIATHAN, FIFA, ble, kosarak10k, BMS, respectively, all from the SPMF website. Where SIGN is a dense data set containing many very long sequences; both LEVIATHAN and FIFA are medium density datasets containing many long sequences; ble is a medium density data set containing many sequences of medium length; BMS and Kosarak10k are both sparse datasets with only a few long sequences. The gaussian distribution is satisfied for all data sets. In the experiment, each data set was divided into one original data set and 100 new data sets. The characteristic attributes of the data set are specifically: the number of sequences of the SIGN data set is 730, the number of different items is 267, the average sequence length is 52, the maximum sequence length is 94, the number of sequences of an original database is 230, and the number of sequences to be inserted into the database is 5; the number of sequences of the LEVIATHAN data set is 5834, the number of different items is 9025, the average sequence length is 33.8, the maximum sequence length is 100, the number of sequences of an original database is 2834, and the number of sequences to be inserted into the database is 30; the number of sequences of the FIFA data set is 20450, the number of different items is 2990, the average sequence length is 36.2, the maximum sequence length is 100, the number of sequences of an original database is 10450, and the number of sequences to be inserted into the database is 100; the number of sequences of a BIBLE data set is 36369, the number of different items is 13905, the average sequence length is 21.6, the maximum sequence length is 100, the number of sequences of an original database is 21369, and the number of sequences to be inserted into the database is 150; the number of sequences of the Kosarak10k data set is 10000, the number of different items is 10094, the average sequence length is 8.1, the maximum sequence length is 608, the number of sequences of an original database is 1000, and the number of sequences to be inserted into the database is 90; the sequence number of the BMS data set is 59601, the number of different items is 497, the average sequence length is 2.5, the maximum sequence length is 267, the sequence number of the original database is 39601, and the sequence number of the database to be inserted is 200.
The invention experiment limits the utility threshold upper limit on six different data sets
Figure SMS_737
Control to the same variable, selecting different utility threshold lower limits->
Figure SMS_738
Experimental comparisons were made and the experimental results are shown in fig. 2-7. It is found through experiments that the Pre-HUSPM-TSU algorithm is shorter in running time than the HUSPM algorithm, thus shortening the running time. The optimization algorithm proposed by the invention uses->
Figure SMS_739
Instead of @, in the Pre-HUSPM-TSU algorithm>
Figure SMS_740
Substantially formed by Pre-HUSPM-
Figure SMS_741
Algorithm (Pre-HUSPM-)>
Figure SMS_742
Namely the Pre-HUSPM algorithm mentioned in the present invention), pre-HUSPM-
Figure SMS_743
The algorithm will be much better than HUSPM and Pre-HUSPM-TSU in runtime. Thus, pre-HUSPM-
Figure SMS_736
Faster run time in larger non-dense data sets and better performance in terms of run time.
By selecting different
Figure SMS_745
Find if +>
Figure SMS_746
Set too small, the running speed may become slower when the database is rescanned because too many pre-large sequence sets are generated. If->
Figure SMS_748
Is arranged too close to>
Figure SMS_749
The security value will become too small and therefore the database may have to be rescanned each time new data is added, which will also result in slower operation. For practical applications, a reasonable setting is required>
Figure SMS_750
and
Figure SMS_751
Figure SMS_752
and
Figure SMS_744
Are all in the range of 0 to 1, ensuring at set time +>
Figure SMS_747
And the specific numerical value is set according to the requirements of users.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.

Claims (9)

1. A database sequence insertion processing method based on Pre-HUSPM is characterized in that an incremental algorithm Pre-HUSPM is constructed to efficiently mine a high-utility sequence mode, and the method specifically comprises the following steps:
step 1, to the original database
Figure QLYQS_1
Insert the database to be inserted->
Figure QLYQS_2
Step 2, according to the original database
Figure QLYQS_3
Is calculated a safety value->
Figure QLYQS_4
Step 3, scanning the database to be inserted
Figure QLYQS_5
Calculating a database to be inserted->
Figure QLYQS_6
The total utility of each of the sequences->
Figure QLYQS_7
and
Figure QLYQS_8
Is greater than or equal to>
Figure QLYQS_9
Step 4, the total utility value of the new transaction since the original database is rescanned last time and the database to be inserted
Figure QLYQS_10
Sequence-weighted utility maximum for a single item->
Figure QLYQS_11
Is summed with a safety value->
Figure QLYQS_12
Comparing, and performing corresponding operation according to a comparison result;
step 5, judging a new database
Figure QLYQS_14
Is greater than the set of large sequence weighted utility sequences->
Figure QLYQS_15
Each sequence in the set->
Figure QLYQS_17
Whether or not the utility ratio of (a) is greater than or equal to the upper utility threshold limit->
Figure QLYQS_18
If so, the sequence is->
Figure QLYQS_19
Is a high utility sequencing mode, sequences->
Figure QLYQS_20
Add to high utility sequence pattern set >>
Figure QLYQS_21
And outputting, otherwise, no operation is needed; finally outputting the new database after the database update->
Figure QLYQS_13
And its high utility sequential pattern set>
Figure QLYQS_16
2. The Pre-HUSPM-based database sequence insertion processing method according to claim 1, wherein in step 1, a primary database is provided
Figure QLYQS_22
Figure QLYQS_23
Is the total number of sequences, is based on>
Figure QLYQS_24
Is a serial number of the sequence, is asserted>
Figure QLYQS_25
Is shown as
Figure QLYQS_26
Or a sequence, is>
Figure QLYQS_28
Set an item>
Figure QLYQS_29
Figure QLYQS_27
Is the total number of items, the item->
Figure QLYQS_30
Is->
Figure QLYQS_31
A collection of different items, represented as
Figure QLYQS_32
Figure QLYQS_33
Indicates that the item is pick>
Figure QLYQS_34
Is greater than or equal to>
Figure QLYQS_35
And (4) items.
3. Pre-HUSPM-based database sequence insertion processing method according to claim 2, characterized in that in step 2, the security value
Figure QLYQS_36
The calculation formula of (c) is as follows:
Figure QLYQS_37
(1);
wherein ,
Figure QLYQS_38
indicates an upper utility threshold value, greater than or equal to>
Figure QLYQS_39
Indicates a utility threshold lower limit, <' > or>
Figure QLYQS_40
Represents the original database->
Figure QLYQS_41
The total utility of (a) of (b),
Figure QLYQS_42
and
Figure QLYQS_43
Presetting the value of (A);
Figure QLYQS_44
the calculation formula of (a) is as follows:
Figure QLYQS_45
(2);
wherein ,
Figure QLYQS_46
represents the original database->
Figure QLYQS_47
In sequence->
Figure QLYQS_48
The calculation formula is as follows:
Figure QLYQS_49
(3);
wherein ,
Figure QLYQS_50
represents a sequence->
Figure QLYQS_51
Middle item->
Figure QLYQS_52
In or>
Figure QLYQS_53
The utility of the item.
4. The Pre-HUSPM-based database sequence insertion processing method according to claim 3, wherein in the step 3, the database to be inserted is calculated in the same manner as the formulas (2) and (3)
Figure QLYQS_54
Total utility->
Figure QLYQS_55
At the same time counting>
Figure QLYQS_56
The database to be inserted is included in the calculation time>
Figure QLYQS_57
The correlation data of (a).
5. The Pre-HUSPM-based database sequence insertion processing method according to claim 4, wherein the specific judgment criteria in step 4 are: is provided with
Figure QLYQS_58
When ≧ the total utility value for the new transaction since the last rescan of the original database>
Figure QLYQS_59
When, step 4.1 and step 4.2 are carried out, when +>
Figure QLYQS_60
Then, step 4.3 is carried out;
step 4.1, insert the database from waiting
Figure QLYQS_61
The scan generates a 1-candidate set and sets ≦>
Figure QLYQS_62
=1,
Figure QLYQS_63
Representing the number of items being processed in the set of sequences;
step 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated; at the same time, set up
Figure QLYQS_64
Step 4.3, when
Figure QLYQS_65
Generating a new database, and scanning the original database again at the moment; will be provided with
Figure QLYQS_66
Set to 0 and will->
Figure QLYQS_67
Assign a value to>
Figure QLYQS_68
6. The Pre-HUSPM-based database sequence insertion processing method according to claim 5, wherein the specific process of step 4.2 is as follows:
step 4.2.1, calculate the new database
Figure QLYQS_69
Is greater than or equal to>
Figure QLYQS_70
The calculation formula is as follows:
Figure QLYQS_71
(4);
for candidate set
Figure QLYQS_72
Calculates the ≥ er/min of each candidate in the database to be inserted>
Figure QLYQS_73
In a sequence>
Figure QLYQS_74
Is weighted effect of->
Figure QLYQS_75
And the effect of the sequence->
Figure QLYQS_76
The calculation formula is as follows:
Figure QLYQS_77
(5);
Figure QLYQS_78
(6);
wherein ,
Figure QLYQS_79
represents a sequence->
Figure QLYQS_80
The total utility value for this row;
Figure QLYQS_81
Represents a sequence->
Figure QLYQS_82
Is based on the sub-sequence->
Figure QLYQS_83
Has the effect that all occurrences in the sequence->
Figure QLYQS_84
The maximum utility of (a) is defined as follows:
Figure QLYQS_85
(7);
wherein ,
Figure QLYQS_86
indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:
Figure QLYQS_87
(8);
wherein ,
Figure QLYQS_88
representing a sequence>
Figure QLYQS_89
Is greater than or equal to>
Figure QLYQS_90
In or>
Figure QLYQS_91
The internal utility of an item, defined as follows:
Figure QLYQS_92
(9);
wherein ,
Figure QLYQS_93
represents a sequence->
Figure QLYQS_94
Middle item->
Figure QLYQS_95
Is/is>
Figure QLYQS_96
Number of items, <' > based on>
Figure QLYQS_97
Represents->
Figure QLYQS_98
The unit profit of the item;
step 4.2.2, for weighting utility sequences in large sequences
Figure QLYQS_99
Performing substep 4.2.2.1-substep 4.2.2.3 on each large sequence weighted utility sequence set in the original database;
step 4.2.3 for
Figure QLYQS_100
Original numberEach pre-large sequence in the database is weighted by using the sequence set, and sub-step 4.2.2.1-sub-step 4.2.2.3 of step 4.2.2 are also executed;
if the original database
Figure QLYQS_103
Is greater than the set of large sequence weighted sequences->
Figure QLYQS_105
And a base of original data>
Figure QLYQS_107
Is predetermined by the pre-large sequence weighting sequence set->
Figure QLYQS_108
Containing the database to be inserted->
Figure QLYQS_110
Is based on the sequence->
Figure QLYQS_112
Will->
Figure QLYQS_113
and
Figure QLYQS_101
Sequence utility of item sets in
Figure QLYQS_104
And the sequence weighted utility>
Figure QLYQS_106
Is updated and the sequence is->
Figure QLYQS_109
Put into 1-candidate set, used for producing 2-candidate set; if->
Figure QLYQS_111
and
Figure QLYQS_114
Does not contain a new database->
Figure QLYQS_115
Is based on the sequence->
Figure QLYQS_116
Will->
Figure QLYQS_102
Remove from the 1-candidate set;
step 4.2.4 from
Figure QLYQS_117
-candidate set generating candidates (@ n)>
Figure QLYQS_118
+ 1) -candidate set +>
Figure QLYQS_119
(ii) a Is arranged and/or is>
Figure QLYQS_120
=
Figure QLYQS_121
+1, repeating steps 4.2.1 through 4.2.4 until no updated large or pre-large sequence weighted utility sequence set is found.
7. Pre-HUSPM based database sequence insertion processing method according to claim 6, characterized in that the substeps of step 4.2.2 are as follows:
substep 4.2.2.1, updating the new database
Figure QLYQS_122
In sequence->
Figure QLYQS_123
Is weighted effect of->
Figure QLYQS_124
The calculation formula is as follows:
Figure QLYQS_125
(10);/>
wherein ,
Figure QLYQS_127
for the original database->
Figure QLYQS_128
In sequence->
Figure QLYQS_129
Is weighted effect of->
Figure QLYQS_131
Stores the sequence->
Figure QLYQS_132
Is/are as follows
Figure QLYQS_133
Figure QLYQS_134
For being inserted into the database->
Figure QLYQS_126
In sequence->
Figure QLYQS_130
The sequence weighted utility of (a);
substep 4.2.2.2 updating the new database
Figure QLYQS_135
In the entire sequence set->
Figure QLYQS_136
In a sequence effect >>
Figure QLYQS_137
Figure QLYQS_138
(11);
wherein ,
Figure QLYQS_140
represents a sequence->
Figure QLYQS_141
In the raw database->
Figure QLYQS_143
In, on the sequence effect in>
Figure QLYQS_144
Stores the sequence->
Figure QLYQS_145
Is/are>
Figure QLYQS_146
Figure QLYQS_147
For being inserted into the database->
Figure QLYQS_139
In sequence->
Figure QLYQS_142
The sequence utility of (a);
substeps 4.2.2.3, if
Figure QLYQS_148
Then will beSequence>
Figure QLYQS_151
Put in and/or pick up>
Figure QLYQS_153
Figure QLYQS_154
Is a new database->
Figure QLYQS_156
Is greater than the sequence weighted effect in->
Figure QLYQS_158
A sequence set; if->
Figure QLYQS_159
Then the sequence is asserted>
Figure QLYQS_149
Put in and/or pick up>
Figure QLYQS_150
Figure QLYQS_152
Is a new database->
Figure QLYQS_155
Pre-large sequence weighted utility of->
Figure QLYQS_157
A sequence set; otherwise, the sequence is discarded>
Figure QLYQS_160
8. The Pre-HUSPM-based database sequence insertion processing method according to claim 7, wherein the specific process of step 4.3 is as follows:
step 4.3.1, merging the databases to be inserted
Figure QLYQS_161
And the original database->
Figure QLYQS_162
Generating a new database>
Figure QLYQS_163
Step 4.3.2, for each
Figure QLYQS_164
The new database is calculated in the same way as in equation (5)>
Figure QLYQS_165
Is weighted effect of->
Figure QLYQS_166
Then the new database is calculated in the same way as in equation (2)>
Figure QLYQS_167
Is greater than or equal to>
Figure QLYQS_168
Step 4.3.3, set the weighted utility ratio of the sequence to
Figure QLYQS_170
If->
Figure QLYQS_171
Then the sequence is asserted>
Figure QLYQS_173
Is put into
Figure QLYQS_174
(ii) a If->
Figure QLYQS_176
Then the sequence is combined>
Figure QLYQS_178
Put in and/or pick up>
Figure QLYQS_180
(ii) a Otherwise, the sequence is discarded>
Figure QLYQS_169
Figure QLYQS_172
Is a new database->
Figure QLYQS_175
Is greater than the sequence weighted effect in->
Figure QLYQS_177
A sequence set;
Figure QLYQS_179
Is a new database->
Figure QLYQS_181
Pre-large sequence weighted utility of->
Figure QLYQS_182
A sequence set;
step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets
Figure QLYQS_184
and
Figure QLYQS_186
Sequence set until no more than found >>
Figure QLYQS_187
and
Figure QLYQS_188
A sequence set; when the mining process is executed, the mining is started from the sequence set 1, then follows the sequence set 2, stopping the mining process until the last sequence set is empty, and outputting a new database ≥ er>
Figure QLYQS_189
Is greater than the set of large sequence weighted utility sequences->
Figure QLYQS_190
And pre-large sequence weighted utility sequence set>
Figure QLYQS_191
Figure QLYQS_183
and
Figure QLYQS_185
The data insertion method is used for next data insertion.
9. The Pre-HUSPM-based database sequence insertion processing method according to claim 8, wherein in the step 4.3.4, the specific process of the recursive mining algorithm is as follows:
step 4.3.4.1, traverse
Figure QLYQS_192
and
Figure QLYQS_193
To be belonged to>
Figure QLYQS_194
and
Figure QLYQS_195
Each sequence of (4)>
Figure QLYQS_196
Constructing its projection database->
Figure QLYQS_197
Step 4.3.4.2, calculate
Figure QLYQS_199
Is weighted effect of->
Figure QLYQS_201
Value, wherein>
Figure QLYQS_202
Is->
Figure QLYQS_204
A set of expansion terms; if it is not
Figure QLYQS_205
Calculating the effectiveness of the sequence->
Figure QLYQS_207
And will->
Figure QLYQS_209
Put to>
Figure QLYQS_198
In the set; if it is not
Figure QLYQS_200
Calculating >>
Figure QLYQS_203
And will->
Figure QLYQS_206
Put to>
Figure QLYQS_208
In the set, if not, no processing is carried out;
step 4.3.4.3, the current parameters are introduced, and the mining algorithm process is called recursively until the current parameters are input
Figure QLYQS_211
And
Figure QLYQS_212
the sets are all empty, and the operation is stopped;
Figure QLYQS_213
Is a new database->
Figure QLYQS_214
Is greater than the sequence weighted effect in->
Figure QLYQS_215
+1 sequence set;
Figure QLYQS_216
Is a new database->
Figure QLYQS_217
Pre-large sequence weighted utility of->
Figure QLYQS_210
+1 sequence set. />
CN202310250759.4A 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method Active CN115964415B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202310250759.4A CN115964415B (en) 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202310250759.4A CN115964415B (en) 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method

Publications (2)

Publication Number Publication Date
CN115964415A true CN115964415A (en) 2023-04-14
CN115964415B CN115964415B (en) 2023-05-26

Family

ID=85894768

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202310250759.4A Active CN115964415B (en) 2023-03-16 2023-03-16 Pre-HUSPM-based database sequence insertion processing method

Country Status (1)

Country Link
CN (1) CN115964415B (en)

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217055A1 (en) * 2002-05-20 2003-11-20 Chang-Huang Lee Efficient incremental method for data mining of a database
CN105590237A (en) * 2015-12-18 2016-05-18 齐鲁工业大学 Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making
CN106777182A (en) * 2016-12-23 2017-05-31 陕西理工学院 A kind of data flow effective item set mining algorithm for reducing candidate
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN109101530A (en) * 2018-06-22 2018-12-28 哈尔滨工业大学(深圳) Effective sequence of events pattern mining algorithm
CN109408563A (en) * 2018-11-07 2019-03-01 哈尔滨工业大学(深圳) High average utility item set mining method, apparatus and computer equipment
CN111475551A (en) * 2020-06-15 2020-07-31 河北工业大学 High average utility sequence pattern mining method under non-overlapping condition
CN111930803A (en) * 2020-08-07 2020-11-13 河北工业大学 Non-overlapping self-adaptive frequent sequence pattern mining method
CN112434031A (en) * 2020-11-16 2021-03-02 宁波财经学院 Uncertain high-utility mode mining method based on information entropy
US20220058716A1 (en) * 2020-08-18 2022-02-24 Qilu University Of Technology Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method
CN114971794A (en) * 2022-05-27 2022-08-30 齐鲁工业大学 Time period-based high-utility sequence mode analysis method and system in group purchase

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20030217055A1 (en) * 2002-05-20 2003-11-20 Chang-Huang Lee Efficient incremental method for data mining of a database
CN105590237A (en) * 2015-12-18 2016-05-18 齐鲁工业大学 Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making
CN106777182A (en) * 2016-12-23 2017-05-31 陕西理工学院 A kind of data flow effective item set mining algorithm for reducing candidate
CN108733705A (en) * 2017-04-20 2018-11-02 哈尔滨工业大学深圳研究生院 A kind of effective sequential mode mining method and device
CN109101530A (en) * 2018-06-22 2018-12-28 哈尔滨工业大学(深圳) Effective sequence of events pattern mining algorithm
CN109408563A (en) * 2018-11-07 2019-03-01 哈尔滨工业大学(深圳) High average utility item set mining method, apparatus and computer equipment
CN111475551A (en) * 2020-06-15 2020-07-31 河北工业大学 High average utility sequence pattern mining method under non-overlapping condition
CN111930803A (en) * 2020-08-07 2020-11-13 河北工业大学 Non-overlapping self-adaptive frequent sequence pattern mining method
US20220058716A1 (en) * 2020-08-18 2022-02-24 Qilu University Of Technology Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method
CN112434031A (en) * 2020-11-16 2021-03-02 宁波财经学院 Uncertain high-utility mode mining method based on information entropy
CN114971794A (en) * 2022-05-27 2022-08-30 齐鲁工业大学 Time period-based high-utility sequence mode analysis method and system in group purchase

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
慕欢欢;柴玉梅;王黎明;: "面向数据流的一个高效用项集挖掘算法", 计算机应用与软件 *

Also Published As

Publication number Publication date
CN115964415B (en) 2023-05-26

Similar Documents

Publication Publication Date Title
Gan et al. Privacy preserving utility mining: a survey
Zihayat et al. Mining top-k high utility patterns over data streams
Wang et al. A machine-learning based memetic algorithm for the multi-objective permutation flowshop scheduling problem
Lin et al. Efficient updating of discovered high-utility itemsets for transaction deletion in dynamic databases
Xu et al. An iterated local search for the multi-objective permutation flowshop scheduling problem with sequence-dependent setup times
Ryang et al. Top-k high utility pattern mining with effective threshold raising strategies
Nawaz et al. Mining high utility itemsets with hill climbing and simulated annealing
Brodal et al. A parallel priority queue with constant time operations
Lin et al. A fast updated algorithm to maintain the discovered high-utility itemsets for transaction modification
Kim et al. Mining high utility itemsets based on the time decaying model
Liu et al. Effective sanitization approaches to protect sensitive knowledge in high-utility itemset mining
Gan et al. ProUM: High utility sequential pattern mining
Wang et al. Incremental mining of high utility sequential patterns in incremental databases
Vu et al. FTKHUIM: a fast and efficient method for mining top-K high-utility itemsets
Quadrana et al. An efficient closed frequent itemset miner for the MOA stream mining system
Lin et al. Mining high-utility sequential patterns from big datasets
Kiran et al. Efficient discovery of weighted frequent itemsets in very large transactional databases: A re-visit
CN115964415A (en) Pre-HUSPM-based database sequence insertion processing method
Alam et al. Generating massive scale-free networks: Novel parallel algorithms using the preferential attachment model
US7840506B1 (en) System and method for geodesic data mining
Lin et al. Efficient mining of high average-utility sequential patterns from uncertain databases
CN114416717A (en) Data processing method and architecture
Tan et al. Parallel max-min ant system using mapreduce
Chen et al. Distributed pruning optimization oriented FP-Growth method based on PSO algorithm
Yang et al. IMBT--A Binary Tree for Efficient Support Counting of Incremental Data Mining

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant