CN115964415A - Pre-HUSPM-based database sequence insertion processing method - Google Patents
Pre-HUSPM-based database sequence insertion processing method Download PDFInfo
- Publication number
- CN115964415A CN115964415A CN202310250759.4A CN202310250759A CN115964415A CN 115964415 A CN115964415 A CN 115964415A CN 202310250759 A CN202310250759 A CN 202310250759A CN 115964415 A CN115964415 A CN 115964415A
- Authority
- CN
- China
- Prior art keywords
- sequence
- database
- utility
- weighted
- original
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000003780 insertion Methods 0.000 title claims abstract description 28
- 230000037431 insertion Effects 0.000 title claims abstract description 28
- 238000003672 processing method Methods 0.000 title claims abstract description 18
- 238000004364 calculation method Methods 0.000 claims description 33
- 238000005065 mining Methods 0.000 claims description 32
- 230000000694 effects Effects 0.000 claims description 29
- 238000000034 method Methods 0.000 claims description 22
- 238000012545 processing Methods 0.000 claims description 5
- 101100379079 Emericella variicolor andA gene Proteins 0.000 claims description 3
- 238000012966 insertion method Methods 0.000 claims description 3
- 101100001674 Emericella variicolor andI gene Proteins 0.000 claims 1
- 238000012163 sequencing technique Methods 0.000 claims 1
- 238000012423 maintenance Methods 0.000 abstract description 4
- 238000007418 data mining Methods 0.000 abstract description 3
- 238000002474 experimental method Methods 0.000 description 11
- 239000002699 waste material Substances 0.000 description 3
- 238000007792 addition Methods 0.000 description 2
- 238000010276 construction Methods 0.000 description 2
- 238000005304 joining Methods 0.000 description 2
- 239000000758 substrate Substances 0.000 description 2
- 241000764238 Isis Species 0.000 description 1
- 230000004075 alteration Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 230000000052 comparative effect Effects 0.000 description 1
- 238000009826 distribution Methods 0.000 description 1
- 230000003203 everyday effect Effects 0.000 description 1
- 239000011159 matrix material Substances 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 238000005457 optimization Methods 0.000 description 1
- 238000011160 research Methods 0.000 description 1
- 230000000717 retained effect Effects 0.000 description 1
- 238000012216 screening Methods 0.000 description 1
- 238000004904 shortening Methods 0.000 description 1
- 230000003068 static effect Effects 0.000 description 1
- 238000006467 substitution reaction Methods 0.000 description 1
Images
Classifications
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02D—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN INFORMATION AND COMMUNICATION TECHNOLOGIES [ICT], I.E. INFORMATION AND COMMUNICATION TECHNOLOGIES AIMING AT THE REDUCTION OF THEIR OWN ENERGY USE
- Y02D10/00—Energy efficient computing, e.g. low power processors, power management or thermal management
Landscapes
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a database sequence insertion processing method based on Pre-HUSPM, belonging to the field of data mining and comprising the following steps: inserting a database to be inserted into an original database; calculating a security value according to the information of the original database; scanning a database to be inserted, and calculating the total utility of each sequence in the database to be inserted and the total utility of the database to be inserted; comparing the total utility value of a new transaction since the original database is rescanned last time with the sum of the maximum value of the sequence weighted utility of the single item in the database to be inserted with a safety value, and performing corresponding operation according to the comparison result; comparing and judging the utility ratio of each sequence in the large sequence weighted utility sequence set in the new database with the utility threshold upper limit; and finally, outputting the new database after the database is updated and the high-utility sequence pattern set thereof. The invention reduces the times of database rescanning and lowers the maintenance cost.
Description
Technical Field
The invention belongs to the field of data mining, and particularly relates to a database sequence insertion processing method based on Pre-HUSPM.
Background
A High Utility Sequence Pattern Mining (HUSPM) algorithm may be used to analyze the user's shopping habits, which would take into account the weight of each item, unit profit, etc. And when the utility of the sequence set is greater than the minimum utility threshold set by the user, the sequence set is a high utility sequence mode. In general, the HUSPM algorithm runs under a static database, but in practical application, new data is added almost every day, which may cause the failure of the originally discovered efficient utilization sequence pattern or new information after updating the database. Therefore, in the conventional dynamic data mining, the original database needs to be rescanned every time a small amount of data enters, and rescanning the original database consumes a lot of resources and time. Especially when a small amount of data is inserted, substantially the whole database is not affected, and then resource waste and maintenance cost increase are caused by updating the database, so that the efficient maintenance and updating of the mined high-utility sequence mode become important.
Disclosure of Invention
In order to solve the problems, the invention provides a database sequence insertion processing method based on Pre-HUSPM, which fuses a Pre-large concept and a projection-based mining algorithm P-HUSPM to construct an incremental algorithm Pre-HUSPM for efficiently mining a high-utility sequence mode and reducing the rescanning times of an original database.
The technical scheme of the invention is as follows:
a database sequence insertion processing method based on Pre-HUSPM constructs an incremental algorithm Pre-HUSPM to efficiently mine a high-utility sequence mode, and specifically comprises the following steps:
Step 4, the total utility value of the new transaction since the original database is rescanned last time and the database to be insertedSequence-weighted utility maximum for a single item->Is summed with a safety value->Comparing, and performing corresponding operation according to a comparison result;
Further, in step 1, a raw database is set up,Is the total number of sequences, is based on>Is a serial number of the sequence, is asserted>Indicates the fifth->Or a sequence, is>Set an item>,Is the total number of items, the item->Is->A set of different terms, denoted as ^ or ^>,Indicates that the item is pick>Is greater than or equal to>And (4) each item.
wherein ,indicates an upper utility threshold value, greater than or equal to>Indicates a utility threshold lower limit, <' > or>Represents the original database->Is taken up and/or taken off> andThe value of (2) is preset;
Further, in step 3, the database to be inserted is obtained by calculation in the same way as the formulas (2) and (3)Total effectiveness>At the same time, a calculation is made->The database to be inserted is included in the calculation time>The relevant data of (2).
Further, the specific judgment criteria in step 4 are: is provided withWhen a total utility value of a new transaction since the last rescan of the original database>When, step 4.1 and step 4.2 are carried out, when ^ 4.1 and>then, step 4.3 is carried out;
step 4.1, insert the database from waitingThe scan generates a 1-candidate set and sets ≦>=1,Representing the number of items being processed in the set of sequences;
step 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated; at the same time, set up;
Step 4.3, whenGenerating a new database, and scanning the original database again at the moment; will->Is set to be 0 and is set to be,and will->Assign a value to>。
Further, the specific process of step 4.2 is as follows:
step 4.2.1, calculate the new databaseIs greater than or equal to>The calculation formula is as follows:
for candidate setCalculates the ≥ er/min of each candidate in the database to be inserted>In sequence->Sequence weighted utility ofAnd sequential effect>The calculation formula is as follows:
wherein ,represents a sequence->The total utility value for this row;Represents a sequence->Is based on the sub-sequence->Has the effect that all occurrences in the sequence->The maximum utility of (a) is defined as follows:
wherein ,indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:
wherein ,represents a sequence->Is greater than or equal to>Is/is>Internal utility of an item, defined as follows:
wherein ,represents a sequence->Middle item->Is/is>The number of items->Represents->The unit profit of the item;
step 4.2.2, for weighting utility sequences in large sequencesPerforming substep 4.2.2.1-substep 4.2.2.3 for each large sequence weighted utility sequence set in the original database;
step 4.2.3, forEach pre-large sequence in the original database is weighted by using the sequence set, and the substep 4.2.2.1-substep 4.2.2.3 of the step 4.2.2 are also executed;
if the original databaseIs greater than the set of large sequence weighted sequences->And the original database->Pre-large sequence of (1) weighted sequence set +>Containing the database to be inserted->In (b) is combined with a sequence->Will-> andThe sequence utility of the set of items in->And the sequence weighted utility>Is updated and the sequence is->Putting the candidate into a 1-candidate set to generate a 2-candidate set; if-> andDoes not contain a new database->Is based on the sequence->Will->Remove from the 1-candidate set;
step 4.2.4 from-candidate set generating candidates (@ n)>+ 1) -candidate set->(ii) a Is arranged and/or is>=+1, repeating steps 4.2.1 through 4.2.4 until no updated large or pre-large sequence weighted utility sequence set is found.
Further, the substeps of step 4.2.2 are as follows:
substep 4.2.2.1, updating the new databaseIn a sequence>Is weighted effect of->The calculation formula is as follows:
wherein ,for the original database->In a sequence>Is weighted effect of->Stores the sequence->Is/are as follows,For being inserted into the database->In sequence->The sequence weighted utility of (a);
wherein ,represents a sequence->In the raw database->Is effective in>Storing a sequence +>Is/are as follows,For database to be inserted>In sequence->The sequence utility of (a);
substeps 4.2.2.3, ifThen the sequence is asserted>Put in or out>,Is a new database->Is greater than the sequence weighted effect in->A sequence set; if->Then the sequence is asserted>Put in and/or pick up>,Is a new database->Pre-large sequence weighted utility of->A sequence set; otherwise, the sequence is discarded>。
Further, the specific process of step 4.3 is as follows:
step 4.3.1, merging the databases to be insertedAnd the original database->Generates a new database->;
Step 4.3.2, for eachThe new database is calculated in the same way as in equation (5)>In a sequence weighted utility of >>And then calculates a new database ≧ according to the same calculation as in equation (2)>Is greater than or equal to>;
Step 4.3.3, set the weighted utility ratio of the sequence toIf->Then the sequence is asserted>Put in and/or pick up>(ii) a If +>Then the sequence is asserted>Put in and/or pick up>(ii) a Otherwise, the sequence is discarded>;Is a new database->Is greater than the sequence weighted effect in->A sequence set;Is a new database->Pre-large sequence weighted utility of->A sequence set;
step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets andSequence set until no more than found >> andA sequence set; when the mining process is executed, the mining is started from the sequence set 1, then the sequence set 2 is followed until the last sequence set is empty, at the moment, the mining process is stopped, and a new database +is output>Is greater than the set of large sequence weighted utility sequences->And pre-large sequence weighted utility sequence set, andThe data insertion method is used for next data insertion.
Further, in step 4.3.4, the specific process of the recursive mining algorithm is as follows:
step 4.3.4.1, traverse andTo be belonged to> andEach sequence of (4)>Construct its projection database>;
Step 4.3.4.2, calculateSequence weighted utility ofValue, wherein>Is->A set of expansion terms; if it is notCalculating the effectiveness of the sequence->And will->Put to>In the set; if it is notCalculating >>And will->Put to>In the set, if not, no processing is carried out;
step 4.3.4.3, the current parameters are introduced, and the mining algorithm process is called recursively until the current parameters are inputAndthe sets are all empty, and the operation is stopped;Is a new database->Is greater than the sequence weighted effect in->+1 sequence set;Is a new database->Pre-large sequence weighted utility of->+1 sequence set.
The invention brings beneficial technical effects.
A new sequence pattern mining algorithm Pre-HUSPM is provided for processing the problem of sequence insertion, when a small amount of data is inserted, the whole database does not need to be updated, and resource waste is avoided.
The high-utility sequence pattern mining algorithm (P-HUSPM) based on matrix projection can reduce the number of candidate sets in sequence mining, thereby accelerating the processing time for mining the high-utility sequence set; run time can be reduced to a large extent since frequent rescanning of the database is not required.
Drawings
FIG. 1 is a flow chart of the database sequence insertion processing method based on Pre-HUSPM of the present invention.
FIG. 2 is a graph of the upper limit of the utility threshold for the SIGN data set in the experiment of the present inventionAt 15% the three algorithms are at different utility thresholdsDevice for limiting and retaining>Run time comparison of figures below.
FIG. 3 is a graph of the Leviaathan data set at the upper limit of the utility threshold for the experiments of the present inventionAt 18% the three algorithms have different utility threshold lower bounds->Run time comparison of figures below.
FIG. 4 shows the FIFA data set at the upper limit of the utility threshold in the experiment of the present inventionAt 21% the three algorithms have different utility threshold lower bounds->Run time comparison of figures below.
FIG. 5 shows the BIBLE data set at the upper limit of the utility threshold in the experiment of the present inventionAt 16% the three algorithms have different utility threshold lower bounds @>Run time comparison of figures below.
FIG. 6 shows the Kosarak10k data set at the upper limit of the utility threshold in the experiment of the present inventionAt 14% the three algorithms have different utility threshold lower bounds->Run time comparison of figures below.
Detailed Description
The invention is described in further detail below with reference to the following figures and detailed description:
the database mentioned in the invention is a sequence database, and the sequence database comprises large sequences, pre-large sequences and small sequences. When the support degree of the sequence is greater than the upper limit threshold of the support degree, the sequence is a large sequence; when the support degree of the sequence is smaller than the upper support degree threshold and larger than the lower support degree threshold, the sequence is a pre-large sequence; and when the support degree of the sequence is less than the support degree lower threshold, the sequence is a small sequence. Among them, pre-large sequences are likely to become large sequences in the future.
The invention integrates Pre-large concept and projection-based mining algorithm P-HUSPM, provides the Pre-HUSPM algorithm, and mainly sets a threshold valueThe database rescanning method is used as a condition for whether the database needs to be rescanned or not, so that the database sequence is effectively maintained and updated, and the rescanning times of the database are reduced.A sequence weighted utility maximum representing a single item to be inserted into the database.
Nine cases may occur when a new sequence database is added to an original sequence database: case 1 is the insertion of a large sequence of a new sequence database into a large sequence of the original sequence database; case 2 is the insertion of a pre-large sequence of a new sequence database into a large sequence of the original sequence database; case 3 is the insertion of a small sequence of the new sequence database into a large sequence of the original sequence database; case 4 is insertion of a large sequence of the new sequence database into a pre-large sequence of the original sequence database; case 5 is inserting the pre-large sequence of the new sequence database into the pre-large sequence of the original sequence database; case 6 is inserting a small sequence of the new sequence database into a pre-large sequence of the original sequence database; case 7 is insertion of a large sequence of the new sequence database into a small sequence of the original sequence database; case 8 is inserting a pre-large sequence of the new sequence database into a small sequence of the original sequence database; case 9 is the insertion of a small sequence of the new sequence database into a small sequence of the original sequence database.
Case 1, case 5, case 6, case 8, and case 9 are weighted averages based on counts that do not affect the final large sequence set. Cases 2 and 3 may delete some existing large sequence sets, while cases 4 and 7 may add new large sequence sets. These cases of case 2, case 3, and case 4 can be handled well when both the large sequence set and the pre-large sequence set are reserved.
The above situation 7 is the main research focus of the present invention, and when the situation 7 occurs, that is, the inserted database data is not very large, the database is not substantially required to be updated, and at this time, the prior art will update the database, resulting in resource waste.
Aiming at the problem, the invention provides a database sequence insertion processing method based on Pre-HUSPM, which adopts the following theorem and proves the theorem.
Theorem, setting andA lower utility threshold and an upper utility threshold, respectively>For the original database->The total utility of (c).Is to be inserted into the database->In a singleThe sequence of items weights the utility maximum. If it is notThen the sequence weighted utility of the sequence set in case 7 is not expected to be a high utility weighted sequence item set throughout the update database.
for the sequence in case 7, if the sequenceSequence weighted utility of in the original database>Very small in size, then。
If the sequence isOn the database to be inserted->Has greater sequence weighting utility, then it is in the database to be insertedIs weighted by the sequence in->Must be greater than or equal to->But less than or equal to the database to be insertedIs greater than or equal to>. Accordingly, is present>。
wherein ,for a new database->In sequence->Is weighted effect of->For a base of raw data>In a sequence>The sequence weighted utility of (c). Therefore, when->Less than the safety value>() There is no need to rescan the original database.
According to this theorem, the sequence in case 7 can be efficiently handled.
A database sequence insertion processing method based on Pre-HUSPM specifically comprises the following steps:
In the embodiment of the invention, the original databaseFor a transaction data database, inserted database to be inserted>Is a new transaction data database.
The original transaction data base and the new transaction data base are both databases containing a group of sequences, and the original database is set,Is the total number of sequences->Is a serial number of the sequence, is asserted>Indicates the fifth->Or a sequence, is>Has a unique identifier, is selected>Set an item>,Is the total number of items, the item->IsA set of different terms, denoted as ^ or ^>,Indicates that the item is pick>Is greater than or equal to>And (4) items.
The primary transaction data database includes、、、、Five sequences and>、、、、five items. Wherein it is present>The set of items of the sequence is +>,Represents an item;Set of items for a sequence being +>;The collection of items of the sequence is;The set of items of the sequence is +>;The set of items of the sequence is +>. Hereby->、、、、The profits of the five projects are respectively 3, 2, 4, 2 and 1, the profits are stored in a database in a form of a table and stored as a project profit table->。
To be inserted into a databaseComprises>、Two sequences->The collection of items of the sequence is,The set of items of the sequence is +>。
wherein ,indicates an upper utility threshold value, greater than or equal to>Indicating a utility threshold lower limit,>represents the original database->Is taken up and/or taken off> andThe value of (2) is preset. />
In the embodiment of the invention, the upper limit of the utility threshold is preset0.35, the upper utility threshold is the same as the high utility sequential pattern threshold, and the lower utility threshold is set>Is 0.25, calculated>=36,=26,=28,=23,=28; =141;=21。
The database to be inserted is obtained by calculation in the same way as the formulas (2) and (3)Total utility->At the same time, a calculation is made->The time of calculationTo be inserted into the database->The relevant data of (2);
step 4, the total utility value of the new transaction since the original database was rescanned last time andis summed with a safety value->And comparing, and performing corresponding operation according to a comparison result. The specific judgment criterion is as follows: is arranged and/or is>When ≧ the total utility value for the new transaction since the last rescan of the original database>When, step 4.1 and step 4.2 are carried out, when +>Then, step 4.3 is performed.
Step 4.1, inserting the slave to the databaseThe scan generates a 1-candidate set and sets +>=1,Indicating that the sequence set is beingNumber of items processed.
and 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated. At the same time, set up. The specific process is as follows:
step 4.2.1, calculate the new databaseIs greater than or equal to>The calculation formula is as follows:
for candidate setCalculates the ≥ er/min of each candidate in the database to be inserted>In sequence->Sequence weighted utility ofAnd the effect of the sequence->The calculation formula is as follows:
wherein ,represents a sequence->The total utility value for this row;Represents a sequence->Is based on the sub-sequence->Has the effect that all occurrences in the sequence->The maximum utility of (a) is defined as follows:
wherein ,indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:
wherein ,representing a sequence>In (c) is greater than or equal to>Is/is>The internal utility of an item, defined as follows:
wherein ,represents a sequence->In item->In or>The number of items->Represents->The unit profit of the item.
for example,can be expressed as->, wherein,,. whereinIn or on> andThe internal utilities of (a) are respectively:=3×3=9,=2×3=6。
subsequence(s)Is at>Two occurrences, the two effects being (3 × 3) + (4 × 2) =17 and (3 × 2) + (3 × 2), respectively4 × 2) =14. Therefore, is->=17。
Step 4.2.2, for weighting utility sequences in large sequencesEach big sequence weighted utility sequence set in the original database, the sub-steps are executed:
substep 4.2.2.1, updating the new databaseIn sequence->Is weighted effect of->The calculation formula is as follows:
wherein ,for a base of raw data>In sequence->Is weighted effect of->Stores the sequence->Is,To be insertedGo into the database->In sequence->The sequence weighted utility of (c).
wherein ,represents a sequence->In the original database>Is effective in>Stores the sequence->Is/are as follows,For being inserted into the database->In sequence->The sequence utility of (a).
Substeps 4.2.2.3, ifThen the sequence is asserted>Put in and/or pick up>,Is a new database->Is greater than the sequence weighted effect in->A sequence set; if->Then the sequence is asserted>Put in or out>,Is a new database->Pre-large sequence weighted utility of->A sequence set; otherwise, the sequence is discarded>Since it is still small after the database update.
In the embodiment of the present invention, the first and second substrates,=52.5%>35%, so that the sequence->Is still put inIn the collection.
Step 4.2.3, forEach pre-large sequence in the original database is weighted with the sequence set and sub-step 4.2.2.1-sub-step 4.2.2.3 of step 4.2.2 are also performed. />
If the original databaseIs greater than the set of large sequence weighted sequences->And the original database->Is predetermined by the pre-large sequence weighting sequence set->Containing the database to be inserted->Is based on the sequence->Will-> andThe sequence utility of the set of items in->And the sequence weighted utility>Is updated and the sequence is->Put into 1-candidate set, used for producing 2-candidate set; if-> andDoes not contain a new database->Is based on the sequence->Will->Removed from the 1-candidate set.
For example in an embodiment of the present invention,、is at>In, will->Add to the 1-candidate set and remove it if not. From the 1-candidate set, a 2-candidate set @canbe generated>、、 andAnd is based on the data bank to be inserted->Dig them-> andIf not, the value is 0, and so on until the candidate set is empty.
Step 4.2.4 from-candidate set generating candidates (@ n)>+ 1) -candidate set->(ii) a Device for combining or screening>=+1, repeating steps 4.2.1 through 4.2.4 until no updated large or pre-large sequence weighted utility sequence set is found.
Step 4.3, whenA new database is created, at which time the original database needs to be rescanned. Will->Set to 0 and will->Assign a value to>. The specific process is as follows:
step 4.3.1, merging the databases to be insertedAnd an original database D, generating a new database U;
step 4.3.2, for eachThe new database is calculated in the same way as in equation (5)>In a sequence weighted utility of >>Then the new database is calculated in the same way as in equation (2)>Is greater than or equal to>;
Step 4.3.3, set the weighted utility ratio of the sequence toIf->Then the sequence is asserted>Put in and/or pick up>(ii) a If->Then the sequence is asserted>Put in or out>(ii) a Otherwise, the sequence is discarded>Because it is still small after the database update.
Step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets andSequence set until no more than found >> andAnd (4) sequence set. When the mining process is executed, the mining is started from the sequence set 1, then follows the sequence set 2, stopping the mining process until the last sequence set is empty, and outputting a new database ≥ er>Is greater than the set of large sequence weighted utility sequences->And pre-large sequence weighted utility sequence set, andThe data insertion method is used for next data insertion.
The specific process is as follows:
step 4.3.4.1, traverse andTo be belonged to> andEach sequence of (4)>Constructing its projection database->In this way, the number of candidate sets can be reduced and the operating speed increased, wherein->Indicating the number of terms being processed in the set of sequences. The construction process of the projection database comprises the following steps: find the item->Each sequence prefixed to a sequence if no item is ∑ er in a sequence>It is not retained.
Defining: is provided with two sequences and, wherein. If (1) the sequence has a prefix +>And (2) wherein the sequence is->Is prefixed->And that the sequence is no longer supersequence, then the sequence->Is referred to as>In which this relationship is denoted as->. Accordingly, the sequence +>In a new data bank->Is the sequence->The set of all projection sequences of each sequence in the corresponding database is recorded as ≥ er>。
For example, according to the above definition, for sequencesConstructing a projection database to find out whether to be matched with a specific criterion>For each sequence prefixed to a sequence if no item is present in a sequence->Is not reserved, e.g. < >>Does not have a->Item in the sequence->Does not have a projection database of pick>. Accordingly, the sequence +>Only contains pick/place in the projection database of>、、、The four sequences specifically comprise:The set of items of the sequence is +>,The total utility of the sequence was 36;Set of items for a sequence being +>,The total utility of the sequence was 9;The collection of items of the sequence is ,The total utility of the sequence was 9;The set of items of the sequence is +> ,The total utility of the sequence is 22.
Step 4.3.4.2, calculateIs weighted effect of->Value, wherein>Is->A set of expansion terms; if it is notCalculating sequential utility>And will->Put into>In the set; if it is notCalculate->And will->Put into>In the set, if not, no processing is performed.
Step 4.3.4.3, the current parameters are introduced, and the mining algorithm process is called recursively until the current parameters are inputAndthe collection is empty and the operation is stopped.
The pseudo-code of the recursive mining algorithm is as follows:
3: end for;
12: end if;
13: end for;
In the embodiment of the present invention, the first and second,==35.4% > 35%, so is->Is a high-utility sequence requiring addition of a set of high-utility sequence patterns>In (1).
large sequence weighted utility sequence setComprises a sequence set of ^ n>、、、(ii) a Wherein the sequence set->Has a sequence weighted utility of 83, and a sequence utility of 22; sequence set>Has a sequence weighted utility of 95 and a sequence utility of 56; sequence set>Has a sequence weighted utility of 77, a sequence utility of 20; sequence setHas a sequence weighted utility of 77, a sequence utility of 16;
pre-large sequence weighted utility sequence setComprises a sequence set of ^ n>、、、、(ii) a Wherein the sequence set->Has a sequence weighted utility of 53, a sequence utility of 18; sequence set>Has a sequence weighted utility of 46 and a sequence utility of 17; sequence set>Has a sequence weighted utility of 43 and a sequence utility of 17; sequence set>Has a sequence weighted utility of 52, a sequence utility of 32; sequence set>The sequence weighted utility of 54 and the sequence utility of 38.
Updated new databaseIs selected based on the high utility sequence pattern set->Comprising only the sequence set->At this time the sequence set->The sequence weighted utility of (a) is 95, the sequence utility is 56, and the utility ratio is 35.4%.
In the present invention, the pseudo code of the Pre-HUSPM algorithm is as follows:
inputting: profit schedule for projectThe original database->An upper utility threshold value>(same as minimum sequence utility high threshold), utility lower threshold->、Is greater than or equal to>A group of large sequence weights utilizes the sequence->And pre-large sequence weighting with sequence->And their sequence-weighted utility value, slave->The safe transaction utility buffer that holds the total utility value of the last processed sequence, the actual utility value found in>And the database to be inserted->。
And (3) outputting: new database () Is selected based on a set of high utility sequence patterns (` vs `)>)。
4: end for;
14: end for;
Invoking a utility summation algorithm;
17: end for;
calling a utility summation algorithm;
20: end for;
23: end while;
24: else;
28: end for;
36: end if;
37: end for;
38 if it isIs not at> andIn the middle, will->From the new database->Is removed as a new database->;
40: end if;
44: end if;
45: end for;
48: else;
50: end if;
The pseudo code of the utility summation algorithm used in the pseudo code is as follows:
7: end if;
in order to prove the superiority and feasibility of the algorithm of the invention, a comparative experiment is carried out. The Pre-HUSPM algorithm provided by the invention is compared with the P-HUSPM algorithm and the Pre-HUSPM-TSU algorithm. The experiment used 6 real datasets of different scale and with different characteristics, named SIGN, LEVIATHAN, FIFA, ble, kosarak10k, BMS, respectively, all from the SPMF website. Where SIGN is a dense data set containing many very long sequences; both LEVIATHAN and FIFA are medium density datasets containing many long sequences; ble is a medium density data set containing many sequences of medium length; BMS and Kosarak10k are both sparse datasets with only a few long sequences. The gaussian distribution is satisfied for all data sets. In the experiment, each data set was divided into one original data set and 100 new data sets. The characteristic attributes of the data set are specifically: the number of sequences of the SIGN data set is 730, the number of different items is 267, the average sequence length is 52, the maximum sequence length is 94, the number of sequences of an original database is 230, and the number of sequences to be inserted into the database is 5; the number of sequences of the LEVIATHAN data set is 5834, the number of different items is 9025, the average sequence length is 33.8, the maximum sequence length is 100, the number of sequences of an original database is 2834, and the number of sequences to be inserted into the database is 30; the number of sequences of the FIFA data set is 20450, the number of different items is 2990, the average sequence length is 36.2, the maximum sequence length is 100, the number of sequences of an original database is 10450, and the number of sequences to be inserted into the database is 100; the number of sequences of a BIBLE data set is 36369, the number of different items is 13905, the average sequence length is 21.6, the maximum sequence length is 100, the number of sequences of an original database is 21369, and the number of sequences to be inserted into the database is 150; the number of sequences of the Kosarak10k data set is 10000, the number of different items is 10094, the average sequence length is 8.1, the maximum sequence length is 608, the number of sequences of an original database is 1000, and the number of sequences to be inserted into the database is 90; the sequence number of the BMS data set is 59601, the number of different items is 497, the average sequence length is 2.5, the maximum sequence length is 267, the sequence number of the original database is 39601, and the sequence number of the database to be inserted is 200.
The invention experiment limits the utility threshold upper limit on six different data setsControl to the same variable, selecting different utility threshold lower limits->Experimental comparisons were made and the experimental results are shown in fig. 2-7. It is found through experiments that the Pre-HUSPM-TSU algorithm is shorter in running time than the HUSPM algorithm, thus shortening the running time. The optimization algorithm proposed by the invention uses->Instead of @, in the Pre-HUSPM-TSU algorithm>Substantially formed by Pre-HUSPM-Algorithm (Pre-HUSPM-)>Namely the Pre-HUSPM algorithm mentioned in the present invention), pre-HUSPM-The algorithm will be much better than HUSPM and Pre-HUSPM-TSU in runtime. Thus, pre-HUSPM-Faster run time in larger non-dense data sets and better performance in terms of run time.
By selecting differentFind if +>Set too small, the running speed may become slower when the database is rescanned because too many pre-large sequence sets are generated. If->Is arranged too close to>The security value will become too small and therefore the database may have to be rescanned each time new data is added, which will also result in slower operation. For practical applications, a reasonable setting is required> and。 andAre all in the range of 0 to 1, ensuring at set time +>And the specific numerical value is set according to the requirements of users.
It is to be understood that the above description is not intended to limit the present invention, and the present invention is not limited to the above examples, and those skilled in the art may make modifications, alterations, additions or substitutions within the spirit and scope of the present invention.
Claims (9)
1. A database sequence insertion processing method based on Pre-HUSPM is characterized in that an incremental algorithm Pre-HUSPM is constructed to efficiently mine a high-utility sequence mode, and the method specifically comprises the following steps:
Step 3, scanning the database to be insertedCalculating a database to be inserted->The total utility of each of the sequences-> andIs greater than or equal to>;
Step 4, the total utility value of the new transaction since the original database is rescanned last time and the database to be insertedSequence-weighted utility maximum for a single item->Is summed with a safety value->Comparing, and performing corresponding operation according to a comparison result;
step 5, judging a new databaseIs greater than the set of large sequence weighted utility sequences->Each sequence in the set->Whether or not the utility ratio of (a) is greater than or equal to the upper utility threshold limit->If so, the sequence is->Is a high utility sequencing mode, sequences->Add to high utility sequence pattern set >>And outputting, otherwise, no operation is needed; finally outputting the new database after the database update->And its high utility sequential pattern set>。
2. The Pre-HUSPM-based database sequence insertion processing method according to claim 1, wherein in step 1, a primary database is provided,Is the total number of sequences, is based on>Is a serial number of the sequence, is asserted>Is shown asOr a sequence, is>Set an item>,Is the total number of items, the item->Is->A collection of different items, represented as,Indicates that the item is pick>Is greater than or equal to>And (4) items.
3. Pre-HUSPM-based database sequence insertion processing method according to claim 2, characterized in that in step 2, the security valueThe calculation formula of (c) is as follows:
wherein ,indicates an upper utility threshold value, greater than or equal to>Indicates a utility threshold lower limit, <' > or>Represents the original database->The total utility of (a) of (b), andPresetting the value of (A);
4. The Pre-HUSPM-based database sequence insertion processing method according to claim 3, wherein in the step 3, the database to be inserted is calculated in the same manner as the formulas (2) and (3)Total utility->At the same time counting>The database to be inserted is included in the calculation time>The correlation data of (a).
5. The Pre-HUSPM-based database sequence insertion processing method according to claim 4, wherein the specific judgment criteria in step 4 are: is provided withWhen ≧ the total utility value for the new transaction since the last rescan of the original database>When, step 4.1 and step 4.2 are carried out, when +>Then, step 4.3 is carried out;
step 4.1, insert the database from waitingThe scan generates a 1-candidate set and sets ≦>=1,Representing the number of items being processed in the set of sequences;
step 4.2, scanning the 1-candidate set, updating the sequence utility and the sequence weighting utility of the original information, sequentially generating a 2-candidate set, and continuously updating the sequence utility and the sequence weighting utility of the original information until no candidate set is generated; at the same time, set up;
6. The Pre-HUSPM-based database sequence insertion processing method according to claim 5, wherein the specific process of step 4.2 is as follows:
step 4.2.1, calculate the new databaseIs greater than or equal to>The calculation formula is as follows:
for candidate setCalculates the ≥ er/min of each candidate in the database to be inserted>In a sequence>Is weighted effect of->And the effect of the sequence->The calculation formula is as follows:
wherein ,represents a sequence->The total utility value for this row;Represents a sequence->Is based on the sub-sequence->Has the effect that all occurrences in the sequence->The maximum utility of (a) is defined as follows:
wherein ,indicating that the maximum internal utility of an item in a sequence is the maximum utility value of the item in the sequence, defined as follows:
wherein ,representing a sequence>Is greater than or equal to>In or>The internal utility of an item, defined as follows:
wherein ,represents a sequence->Middle item->Is/is>Number of items, <' > based on>Represents->The unit profit of the item;
step 4.2.2, for weighting utility sequences in large sequencesPerforming substep 4.2.2.1-substep 4.2.2.3 on each large sequence weighted utility sequence set in the original database;
step 4.2.3 forOriginal numberEach pre-large sequence in the database is weighted by using the sequence set, and sub-step 4.2.2.1-sub-step 4.2.2.3 of step 4.2.2 are also executed;
if the original databaseIs greater than the set of large sequence weighted sequences->And a base of original data>Is predetermined by the pre-large sequence weighting sequence set->Containing the database to be inserted->Is based on the sequence->Will-> andSequence utility of item sets inAnd the sequence weighted utility>Is updated and the sequence is->Put into 1-candidate set, used for producing 2-candidate set; if-> andDoes not contain a new database->Is based on the sequence->Will->Remove from the 1-candidate set;
7. Pre-HUSPM based database sequence insertion processing method according to claim 6, characterized in that the substeps of step 4.2.2 are as follows:
substep 4.2.2.1, updating the new databaseIn sequence->Is weighted effect of->The calculation formula is as follows:
wherein ,for the original database->In sequence->Is weighted effect of->Stores the sequence->Is/are as follows,For being inserted into the database->In sequence->The sequence weighted utility of (a);
wherein ,represents a sequence->In the raw database->In, on the sequence effect in>Stores the sequence->Is/are>,For being inserted into the database->In sequence->The sequence utility of (a);
substeps 4.2.2.3, ifThen will beSequence>Put in and/or pick up>,Is a new database->Is greater than the sequence weighted effect in->A sequence set; if->Then the sequence is asserted>Put in and/or pick up>,Is a new database->Pre-large sequence weighted utility of->A sequence set; otherwise, the sequence is discarded>。
8. The Pre-HUSPM-based database sequence insertion processing method according to claim 7, wherein the specific process of step 4.3 is as follows:
step 4.3.1, merging the databases to be insertedAnd the original database->Generating a new database>;
Step 4.3.2, for eachThe new database is calculated in the same way as in equation (5)>Is weighted effect of->Then the new database is calculated in the same way as in equation (2)>Is greater than or equal to>;
Step 4.3.3, set the weighted utility ratio of the sequence toIf->Then the sequence is asserted>Is put into(ii) a If->Then the sequence is combined>Put in and/or pick up>(ii) a Otherwise, the sequence is discarded>;Is a new database->Is greater than the sequence weighted effect in->A sequence set;Is a new database->Pre-large sequence weighted utility of->A sequence set;
step 4.3.4, executing a recursive mining algorithm, generating a projection database of a plurality of sets by using the recursive mining algorithm, and generating the projection database of the plurality of sets andSequence set until no more than found >> andA sequence set; when the mining process is executed, the mining is started from the sequence set 1, then follows the sequence set 2, stopping the mining process until the last sequence set is empty, and outputting a new database ≥ er>Is greater than the set of large sequence weighted utility sequences->And pre-large sequence weighted utility sequence set>, andThe data insertion method is used for next data insertion.
9. The Pre-HUSPM-based database sequence insertion processing method according to claim 8, wherein in the step 4.3.4, the specific process of the recursive mining algorithm is as follows:
step 4.3.4.1, traverse andTo be belonged to> andEach sequence of (4)>Constructing its projection database->;
Step 4.3.4.2, calculateIs weighted effect of->Value, wherein>Is->A set of expansion terms; if it is notCalculating the effectiveness of the sequence->And will->Put to>In the set; if it is notCalculating >>And will->Put to>In the set, if not, no processing is carried out;
step 4.3.4.3, the current parameters are introduced, and the mining algorithm process is called recursively until the current parameters are inputAndthe sets are all empty, and the operation is stopped;Is a new database->Is greater than the sequence weighted effect in->+1 sequence set;Is a new database->Pre-large sequence weighted utility of->+1 sequence set. />
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310250759.4A CN115964415B (en) | 2023-03-16 | 2023-03-16 | Pre-HUSPM-based database sequence insertion processing method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202310250759.4A CN115964415B (en) | 2023-03-16 | 2023-03-16 | Pre-HUSPM-based database sequence insertion processing method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN115964415A true CN115964415A (en) | 2023-04-14 |
CN115964415B CN115964415B (en) | 2023-05-26 |
Family
ID=85894768
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202310250759.4A Active CN115964415B (en) | 2023-03-16 | 2023-03-16 | Pre-HUSPM-based database sequence insertion processing method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN115964415B (en) |
Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030217055A1 (en) * | 2002-05-20 | 2003-11-20 | Chang-Huang Lee | Efficient incremental method for data mining of a database |
CN105590237A (en) * | 2015-12-18 | 2016-05-18 | 齐鲁工业大学 | Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making |
CN106777182A (en) * | 2016-12-23 | 2017-05-31 | 陕西理工学院 | A kind of data flow effective item set mining algorithm for reducing candidate |
CN108733705A (en) * | 2017-04-20 | 2018-11-02 | 哈尔滨工业大学深圳研究生院 | A kind of effective sequential mode mining method and device |
CN109101530A (en) * | 2018-06-22 | 2018-12-28 | 哈尔滨工业大学(深圳) | Effective sequence of events pattern mining algorithm |
CN109408563A (en) * | 2018-11-07 | 2019-03-01 | 哈尔滨工业大学(深圳) | High average utility item set mining method, apparatus and computer equipment |
CN111475551A (en) * | 2020-06-15 | 2020-07-31 | 河北工业大学 | High average utility sequence pattern mining method under non-overlapping condition |
CN111930803A (en) * | 2020-08-07 | 2020-11-13 | 河北工业大学 | Non-overlapping self-adaptive frequent sequence pattern mining method |
CN112434031A (en) * | 2020-11-16 | 2021-03-02 | 宁波财经学院 | Uncertain high-utility mode mining method based on information entropy |
US20220058716A1 (en) * | 2020-08-18 | 2022-02-24 | Qilu University Of Technology | Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method |
CN114971794A (en) * | 2022-05-27 | 2022-08-30 | 齐鲁工业大学 | Time period-based high-utility sequence mode analysis method and system in group purchase |
-
2023
- 2023-03-16 CN CN202310250759.4A patent/CN115964415B/en active Active
Patent Citations (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030217055A1 (en) * | 2002-05-20 | 2003-11-20 | Chang-Huang Lee | Efficient incremental method for data mining of a database |
CN105590237A (en) * | 2015-12-18 | 2016-05-18 | 齐鲁工业大学 | Application of high utility sequential pattern with negative-profit items in electronic commerce business decision making |
CN106777182A (en) * | 2016-12-23 | 2017-05-31 | 陕西理工学院 | A kind of data flow effective item set mining algorithm for reducing candidate |
CN108733705A (en) * | 2017-04-20 | 2018-11-02 | 哈尔滨工业大学深圳研究生院 | A kind of effective sequential mode mining method and device |
CN109101530A (en) * | 2018-06-22 | 2018-12-28 | 哈尔滨工业大学(深圳) | Effective sequence of events pattern mining algorithm |
CN109408563A (en) * | 2018-11-07 | 2019-03-01 | 哈尔滨工业大学(深圳) | High average utility item set mining method, apparatus and computer equipment |
CN111475551A (en) * | 2020-06-15 | 2020-07-31 | 河北工业大学 | High average utility sequence pattern mining method under non-overlapping condition |
CN111930803A (en) * | 2020-08-07 | 2020-11-13 | 河北工业大学 | Non-overlapping self-adaptive frequent sequence pattern mining method |
US20220058716A1 (en) * | 2020-08-18 | 2022-02-24 | Qilu University Of Technology | Commodity recommendation system based on actionable high utility negative sequential rules mining and its working method |
CN112434031A (en) * | 2020-11-16 | 2021-03-02 | 宁波财经学院 | Uncertain high-utility mode mining method based on information entropy |
CN114971794A (en) * | 2022-05-27 | 2022-08-30 | 齐鲁工业大学 | Time period-based high-utility sequence mode analysis method and system in group purchase |
Non-Patent Citations (1)
Title |
---|
慕欢欢;柴玉梅;王黎明;: "面向数据流的一个高效用项集挖掘算法", 计算机应用与软件 * |
Also Published As
Publication number | Publication date |
---|---|
CN115964415B (en) | 2023-05-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
Gan et al. | Privacy preserving utility mining: a survey | |
Zihayat et al. | Mining top-k high utility patterns over data streams | |
Wang et al. | A machine-learning based memetic algorithm for the multi-objective permutation flowshop scheduling problem | |
Lin et al. | Efficient updating of discovered high-utility itemsets for transaction deletion in dynamic databases | |
Xu et al. | An iterated local search for the multi-objective permutation flowshop scheduling problem with sequence-dependent setup times | |
Ryang et al. | Top-k high utility pattern mining with effective threshold raising strategies | |
Nawaz et al. | Mining high utility itemsets with hill climbing and simulated annealing | |
Brodal et al. | A parallel priority queue with constant time operations | |
Lin et al. | A fast updated algorithm to maintain the discovered high-utility itemsets for transaction modification | |
Kim et al. | Mining high utility itemsets based on the time decaying model | |
Liu et al. | Effective sanitization approaches to protect sensitive knowledge in high-utility itemset mining | |
Gan et al. | ProUM: High utility sequential pattern mining | |
Wang et al. | Incremental mining of high utility sequential patterns in incremental databases | |
Vu et al. | FTKHUIM: a fast and efficient method for mining top-K high-utility itemsets | |
Quadrana et al. | An efficient closed frequent itemset miner for the MOA stream mining system | |
Lin et al. | Mining high-utility sequential patterns from big datasets | |
Kiran et al. | Efficient discovery of weighted frequent itemsets in very large transactional databases: A re-visit | |
CN115964415A (en) | Pre-HUSPM-based database sequence insertion processing method | |
Alam et al. | Generating massive scale-free networks: Novel parallel algorithms using the preferential attachment model | |
US7840506B1 (en) | System and method for geodesic data mining | |
Lin et al. | Efficient mining of high average-utility sequential patterns from uncertain databases | |
CN114416717A (en) | Data processing method and architecture | |
Tan et al. | Parallel max-min ant system using mapreduce | |
Chen et al. | Distributed pruning optimization oriented FP-Growth method based on PSO algorithm | |
Yang et al. | IMBT--A Binary Tree for Efficient Support Counting of Incremental Data Mining |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |