CN113505156A

CN113505156A - Transaction data frequent sequence pattern mining method based on improved Prefix span algorithm

Info

Publication number: CN113505156A
Application number: CN202110777271.8A
Authority: CN
Inventors: 何新; 王子龙; 陈琛
Original assignee: Nanjing Rongxin Intelligent Technology Co ltd
Current assignee: Nanjing Rongxin Intelligent Technology Co ltd
Priority date: 2021-07-09
Filing date: 2021-07-09
Publication date: 2021-10-15

Abstract

The invention discloses a transaction data frequent sequence pattern mining method based on an improved Prefix span algorithm, which comprises the following steps: preprocessing commodity transaction data to obtain a commodity transaction data set, and storing the commodity transaction data set in a transaction sequence database; scanning a transaction sequence database, counting each single item to obtain the sequence support degree of each single item, arranging in a descending order, and selecting the single item of the front mu item and meeting the minimum support degree as an initial prefix; depth-first traversal is adopted, the position of the first initial prefix is calculated and stored in a prefix position information table, and a commodity transaction projection database is generated; iterating the commodity transaction projection database until a new commodity transaction projection database cannot be generated, and storing a frequent sequence mode set generated by each commodity transaction projection database; and repeating the previous step from the second initial prefix until all the initial prefixes are calculated. The invention is used for reducing the time/space consumption of frequent sequence pattern mining of transaction data and improving the execution efficiency.

Description

Transaction data frequent sequence pattern mining method based on improved Prefix span algorithm

Technical Field

The invention relates to the technical field of transaction data mining, in particular to a transaction data frequent sequence pattern mining method based on an improved Prefix span algorithm.

Background

The transaction data of the large chain supermarket has a series of user transaction databases, each record comprises the ID of a user, the time when the transaction occurs and the items related to the transaction, and if a mode related to the association relationship between the transactions, namely the connection between a plurality of times of purchasing behaviors of the user can be mined, more targeted marketing measures can be taken.

At present, in the trading data frequent sequence pattern mining algorithm, a lot of time and energy are spent by experts and scholars to provide a plurality of typical methods, such as GSP, SPADE, Prefix span algorithm and the like. The GSP algorithm reduces the number of candidate sequences to be scanned and the generation of redundant useless patterns, but generates a large number of candidate sequence patterns aiming at a large-scale sequence database and needs to circularly scan the sequence database; the SPADE algorithm only reduces the number of times of scanning the database to 3 times, but generates a large number of vertical databases under the condition of huge original data; the advantages of the Prefix span algorithm are that no candidate sequence is generated, compared with other two algorithms, the memory consumption is relatively stable, the efficiency is higher, but the problem of repeated projection database may occur, so that the repeated projection database is mined and divided, certain repeated calculation is caused, and the time/space consumption is increased. Therefore, a frequent sequence pattern mining method for transaction data based on an improved Prefix span algorithm is urgently needed to be researched.

Disclosure of Invention

The invention aims to provide a transaction data frequent sequence pattern mining method based on an improved Prefix span algorithm, which is used for reducing the time/space consumption of transaction data frequent sequence pattern mining and improving the execution efficiency.

In order to achieve the purpose, the invention provides the following scheme:

a transaction data frequent sequence pattern mining method based on an improved Prefix span algorithm comprises the following steps:

s1) preprocessing the acquired commodity transaction data to obtain a commodity transaction data set, and storing the commodity transaction data set in a transaction sequence database D;

s2) scanning the transaction sequence database D, counting each single item with the length of 1 to obtain the sequence support degree sup of each single item, arranging in a descending order, and selecting the single item with the first mu item and the minimum support degree min _ sup as an initial prefix;

s3), depth-first traversal is adopted, the position of the first initial prefix is calculated and stored in the prefix position information table, and a commodity transaction projection database is generated; iterating the commodity transaction projection database until a new commodity transaction projection database cannot be generated, and storing a frequent sequence mode set generated by each commodity transaction projection database;

s4) repeating the step S3) from the second initial prefix until all initial prefixes are calculated;

wherein, the step S4) specifically includes:

s401) generating a commodity transaction projection database with a second initial prefix; if the commodity transaction projection database is empty, recursively returning;

s402) scanning a commodity transaction projection database, and counting the single items; if the sequence support degree sup of all the single items is lower than the minimum support degree min _ sup, recursively returning;

s403) merging each single item meeting the minimum support degree min _ sup with the current prefix to obtain a plurality of new prefixes, and calculating prefix positions of the new prefixes; if the prefix position information table has the prefix with the same position as the previous prefix, directly returning a frequent sequence mode set generated by the prefix in the prefix position information table, and returning to the step S3); otherwise, the prefix position information table stores new prefix position information, generates new commodity transaction projection data, and returns to step S401).

Optionally, the preprocessing of the obtained commodity transaction data in step S1) includes completing or deleting missing or repeated order records, and correcting data with record errors.

Optionally, the top μ item in step S2) is a valid item number, which indicates a category of a main commodity sold by the vending machine.

Alternatively, the number of available items μmay be set based on the mechanical structure of the vending machine, the container capacity, and the number of lanes, or may be set based on the number of types of main products set by the administrator, or may be set based on a combination of the two.

Optionally, the prefix location information table in step S3) is stored by a Hash table.

According to the specific embodiment provided by the invention, the invention discloses the following technical effects: the transaction data frequent sequence pattern mining method based on the improved Prefix span algorithm provided by the invention avoids repeated calculation of a repeated projection database by the aid of the prefix position information table and by adopting depth-first traversal, reduces the time/space consumption of transaction data frequent sequence pattern mining, and improves the execution efficiency.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the embodiments will be briefly described below, and it is obvious that the drawings in the following description are only some embodiments of the present invention, and it is obvious for those skilled in the art to obtain other drawings without inventive exercise.

FIG. 1 is a flow chart of a transaction data sequence pattern mining method based on an improved Prefix span algorithm according to an embodiment of the present invention;

FIG. 2 is a flow chart of the Prefix span algorithm according to the embodiment of the present invention;

FIG. 3 is a partial diagram of merchandise transaction data according to an embodiment of the invention;

FIG. 4 is a partial view of a merchandise transaction data set according to an embodiment of the invention;

FIG. 5 is a comparison graph of the execution efficiency before and after the Prefix span algorithm is improved according to the embodiment of the present invention;

fig. 6 is a comparison diagram of the execution space before and after the improvement of the PrefixSpan algorithm according to the embodiment of the present invention.

Detailed Description

The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

In order to make the aforementioned objects, features and advantages of the present invention comprehensible, embodiments accompanied with figures are described in further detail below.

As shown in fig. 1 to fig. 2, the frequent sequence pattern mining method for transaction data based on the improved PrefixSpan algorithm provided by the embodiment of the present invention includes the following steps:

s2) scanning the transaction sequence database D, counting each single item with the length of 1 to obtain the sequence support degree sup of each single item, arranging in a descending order, and selecting the single item with the first mu item and the minimum support degree min _ sup as an initial prefix; the first mu item is effective item number and indicates the type of main goods sold by the vending machine; the number mu of effective items is set according to the mechanical structure of the vending machine, the container capacity and the number of goods channels, or is set according to the number of main commodity types set by a manager, or is set according to the combination of the two;

s3), depth-first traversal is adopted, the position of the first initial prefix is calculated and stored in the prefix position information table, and a commodity transaction projection database is generated; iterating the commodity transaction projection database until a new commodity transaction projection database cannot be generated, and storing a frequent sequence mode set generated by each commodity transaction projection database; the prefix position information table is stored through a Hash table;

wherein, the step S4) specifically includes:

Experiments are used below to verify that the embodiments of the present invention improve the characteristics of the PrefixSpan algorithm (ppfprefixsspan) and compare it with the PrefixSpan algorithm (PrefixSpan Base on Effective Items, EIPrefixSpan) that adds significant terms and the unmodified PrefixSpan algorithm.

As shown in fig. 3, the test commodity transaction data for this experiment was the total sales records from 1/2015 to 4/2015 for 30/2015 in a retail supermarket. 42817 records of 2000 users purchasing goods in four months are covered in the data set, and are classified into 17 columns of ' customer number ', ' major code ', ' major name ', ' middle code ', ' minor name ', ' date of sale ', ' month of sale ', ' goods code ', ' specification model ', ' goods type ', ' unit ', ' quantity of sale ', ' amount of sale ', ' unit price ', ' sales promotion ' or not ' in the data set, wherein the data set classifies the goods into 4 goods types, 15 goods major classes, 176 goods middle classes, 759 goods minor classes. According to the practical situation of the intelligent vending machine system, 176 commodity classes are selected as a feature item set, and the experimental data are respectively processed by using three algorithms of a Prefix span algorithm, an EIPrefix span algorithm and a PPFPrefix span algorithm. All experimental related programs are written by python, and the running environment of the software is windows.

Firstly, preprocessing the acquired commodity transaction data, wherein the preprocessing comprises completing or deleting missing or repeated order records and correcting data with recording errors. Then, 5 columns of data of 'customer number', 'middle class code', 'middle class name', 'sales date', 'sales month' are saved in the preprocessed 17 columns of data, as shown in fig. 4.

The execution efficiency and the execution space of the EIPrefixSpan algorithm and the ppfprefixsspan algorithm with the support degrees sup of 2%, 4%, 6%, 8%, 10%, 12%, 14%, 16%, and μ ═ 30 are respectively taken from the commodity transaction data set, and the execution efficiency comparison experiment results are shown in table 1, and the comparison graph is shown in fig. 5.

Table 1 results of performance efficiency comparison experiment

As shown in fig. 5, it can be found that the execution efficiency of the PPFPrefixSpan algorithm is further optimized compared with the EIPrefixSpan algorithm and is significantly better than the PrefixSpan algorithm, and the PPFPrefixSpan algorithm avoids the frequent sequence generated by the repeated recursion of the repeated projection database through the prefix position information table and the depth-first traversal in the sequence pattern mining process, and reduces the algorithm operation time, so that the algorithm has a more data set in the repeated projection database, and the algorithm effect is more significant. Therefore, the mining efficiency of the sequence pattern is improved.

The results of the performed spatial comparison experiments are shown in table 2, and the comparison graph is shown in fig. 6.

Table 2 results of the experiments performed with spatial contrast

As shown in fig. 6, it can be observed that the execution space of the PPFPrefixSpan algorithm is further optimized compared with the EIPrefixSpan algorithm, and is significantly better than the PrefixSpan algorithm, and the PPFPrefixSpan algorithm avoids generating a repeated projection database in the sequence pattern mining process, thereby reducing the memory usage required by the PPFPrefixSpan algorithm in the operation process.

The experimental results show that when the number of repeated projection databases generated for the data set in the arithmetic operation process is large, the PPFPrefixSpan algorithm saves time/space resources compared with the EIPreFIxSpan algorithm, and proves that the algorithm optimization of the PPFPrefixSpan algorithm by introducing the prefix position information table is really feasible.

The idea of the EIPrefixSpan algorithm is to introduce a concept of a significant term μ in the first step of the algorithm based on the PrefixSpan algorithm. The main idea of the algorithm is that when the vending machine is matched in a real-world manner, due to the limited volume of the vending machine for placing articles, only a few commodities with the best sale condition can be selected in fact, and the related commodities with the highest support degree are matched, but the data of the retail supermarket of the user often comprises tens of hundreds of commodity types, so that the user only needs to obtain the commodities with the highest sale amount and the related commodities with high confidence degree.

The PPFPrefixSpan algorithm is a sequence pattern association algorithm based on prefix projection on the basis of the EIPreFIxSpan algorithm, and is different from the EIPreFIxSpan algorithm in that a prefix position information table is introduced into the PPFPrefixSpan algorithm because frequent sequence pattern sets generated for the same projection database are the same. When the location information of any prefix a is the same as the prefix location information of the prefix β in the prefix location information table, the frequent sequence pattern set generated by the projection database of the prefix a and the new frequent sequence pattern set generated by the prefix β can be directly returned. Therefore, depth-first traversal is adopted, a single item of the initial prefix is taken first, and recursion is carried out from the item with the length of 1 to the item with the length of L to serve as an initial reference set. Because the probability of prefix position repetition between new prefixes recursively generated from the same prefix is much lower than the probability of prefix position repetition between prefixes of the same length through breadth-first traversal, depth-first traversal is employed in the PPFPrefixSpan algorithm.

In the actual arithmetic operation, the prefix position information table is stored through a Hash table, and the Hash table is a data structure which is stored and accessed through a Key-Value mode, and can be directly inquired through a Key Value, so that the inquiry speed is accelerated. The algorithm comprises three important information, namely prefixes, prefix positions and frequent sequences of projection databases corresponding to the prefixes, so that the information is stored through a secondary Hash table. The first level stores a prefix, namely a prefix position, wherein the prefix is a Key Value and the prefix position is Value; and storing prefix positions, namely frequent sequences, in the second level, wherein the prefix positions are Key values, and the frequent sequence of the projection database corresponding to the prefixes is Value.

The algorithm firstly scans the whole sequence database, counts all single items, sorts the support degree of each single item, and takes the first mu item with large support degree as an initial prefix. And adopting depth-first traversal, firstly constructing a projection database of the first initial prefix, and directly storing the prefix position information if a prefix position information table is empty. And performing single item counting on the projection database, forming a new prefix by the single item meeting the minimum support and the original prefix, storing prefix position information of the new prefix, forming a new projection database, iterating the new projection database, wherein the projection database is empty, and storing all sequence mode results. And acquiring prefix position information of a second initial prefix, scanning a prefix position information table, stopping recursion if a prefix with the same position as the initial prefix exists, and directly returning to a sequence mode set generated by the prefix, otherwise, storing the initial prefix position information, continuing recursion until all initial prefixes are calculated, and returning to all sequence mode sets.

The transaction data frequent sequence pattern mining method based on the improved Prefix span algorithm provided by the invention avoids repeated calculation of a repeated projection database by the aid of the prefix position information table and by adopting depth-first traversal, reduces the time/space consumption of transaction data frequent sequence pattern mining, and improves the execution efficiency.

The principles and embodiments of the present invention have been described herein using specific examples, which are provided only to help understand the method and the core concept of the present invention; meanwhile, for a person skilled in the art, according to the idea of the present invention, the specific embodiments and the application range may be changed. In view of the above, the present disclosure should not be construed as limiting the invention.

Claims

1. A transaction data frequent sequence pattern mining method based on an improved Prefix span algorithm is characterized by comprising the following steps:

wherein, the step S4) specifically includes:

2. The improved Prefix span algorithm based transaction data frequent sequence pattern mining method as claimed in claim 1, wherein the preprocessing of the obtained commodity transaction data in step S1) includes complementing or deleting missing or repeated order records and correcting data with recording errors.

3. The frequent sequence pattern mining method for transaction data based on the improved Prefix span algorithm as claimed in claim 1, wherein the top μ item in step S2) is a significant item number, which refers to the category of the main goods sold by the vending machine.

4. The frequent sequence pattern mining method of transaction data based on modified Prefix span algorithm as claimed in claim 3, wherein the number of valid items μ is set according to the mechanical structure and container capacity of vending machine and the number of channels, or the number of main commodity kinds set by manager, or the combination of the two.

5. The improved Prefix span algorithm based transaction data frequent sequence pattern mining method as claimed in claim 1, wherein said prefix position information table in step S3) is saved by a Hash table.