WO2021102775A1 - 一种基于改进遗传算法的模式数据挖掘方法 - Google Patents

一种基于改进遗传算法的模式数据挖掘方法 Download PDF

Info

Publication number
WO2021102775A1
WO2021102775A1 PCT/CN2019/121475 CN2019121475W WO2021102775A1 WO 2021102775 A1 WO2021102775 A1 WO 2021102775A1 CN 2019121475 W CN2019121475 W CN 2019121475W WO 2021102775 A1 WO2021102775 A1 WO 2021102775A1
Authority
WO
WIPO (PCT)
Prior art keywords
population
huis
individual
algorithm
processing
Prior art date
Application number
PCT/CN2019/121475
Other languages
English (en)
French (fr)
Inventor
方伟
张强
孙俊
吴小俊
Original Assignee
江南大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 江南大学 filed Critical 江南大学
Priority to PCT/CN2019/121475 priority Critical patent/WO2021102775A1/zh
Publication of WO2021102775A1 publication Critical patent/WO2021102775A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2458Special types of queries, e.g. statistical queries, fuzzy queries or distributed queries
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/12Computing arrangements based on biological models using genetic models

Definitions

  • the invention relates to a pattern data mining method based on an improved genetic algorithm, which belongs to the technical field of data mining.
  • Data mining refers to the process of extracting potentially interesting information or patterns from a large amount of data for further use.
  • An item set is a collection composed of at least one item in the transaction database.
  • Transaction database is a database that can record transactions, news and other transactions.
  • a transaction database usually records at least one transaction, and each transaction includes at least one data item.
  • the data item of the product corresponds to the name of the product, and a transaction about the transaction record can include several data items of the product. Since the transaction database of the transaction type can often reflect the user's preference information, when recommending relevant information to the user, the item set recommended to the user is often mined from the multiple item sets formed in the transaction database.
  • Frequent itemsets mining has always been an important research direction in the field of data mining. It refers to the process of mining itemsets whose occurrence frequency is not lower than a user-specified threshold from a database. Although FIM can dig out itemsets that frequently appear in the transaction database, it does not consider the utility value of the itemsets. In the process of mining itemsets, it is often necessary to consider itemsets with higher utility values (referred to as high-utility itemsets, HUIs). The purpose of high-utility itemsets (pattern) mining is to mine all itemsets in the transaction database whose utility value is not less than the specified minimum utility value.
  • the high-utility item set mining (HUIM) problem is a typical combinatorial optimization problem.
  • Evolutionary computation (EC) is an effective stochastic optimization method. It is inspired by the natural evolution process. The principle of natural evolution finds the optimal solution and applies it to various combinatorial optimization problems.
  • EC-based methods such as genetic algorithm (GA), particle swarm optimization (PSO), ant colony optimization (ACO) and artificial bee colony algorithm are introduced to solve this problem.
  • EC-based high-utility itemsets mining algorithms are usually expensive to mine all high-utility itemsets (HUIs) that meet the minimum utility threshold.
  • HUIs high-utility itemsets
  • the reasons for this problem can be summarized in two aspects.
  • the reason is that efficient use of itemsets mining is different from traditional optimization problems. All itemsets that meet the minimum utility threshold should be found, while the original evolutionary calculation method always searches in the direction of the best individual of the previous generation. , This may miss some results in the iteration process.
  • the efficient item set mining algorithm based on EC is inefficient when searching for new HUIs.
  • the present invention uses an improved genetic algorithm to efficiently mine HUIs, and provides a pattern data mining method HUIM-IGA based on the improved genetic algorithm.
  • the method of the present invention improves the efficiency of discovering HUIs, and shortens the total time consumption of mining all HUIs.
  • the pattern data mining method of the present invention which is improved based on genetic algorithm (GA), includes:
  • the database is any database that can record transactions, news and other matters.
  • the database usually records at least one transaction, and each transaction includes at least one data item.
  • the database is a shopping basket database.
  • the data item of the product corresponds to the name of the product
  • a transaction about the transaction record can include several data items of the product.
  • the initialization population includes: according to the TWU value (transaction weighted utility value in the data set) of each 1-HTWUIs (high-transaction-weighted utility 1-itemsets) ) To initialize each individual.
  • the initial population is specifically: each initial individual is first assigned a value of 0, and then an integer k is randomly generated as the number of positions that are not 0 in the individual; according to the TWU value of each 1-HTWUIs, and roulette is used to select The operator decides which items will appear in the current individual; adds the new individual generated to the (initial) population; and finally returns to the initialized population.
  • the performing individual repair processing on the population includes: sequential pruning based on the TWU value of 1-HTWUIs; and saving as much as possible the combination of items in the individual that may produce utility values.
  • the individual repair processing is specifically: first determine whether the individuals to be repaired are all 0, if the individuals to be repaired are all 0, then directly jump out of the algorithm, otherwise, enter the individual repair stage;
  • the individual repair stage includes: initializing the intermediate variable temp to the column corresponding to 1-HTWUI with the largest TWU value in the bitmap; and according to the TWU value descending order of temp and the column corresponding to HTWUI in the bitmap Calculate temp'. If temp' consists of all 0s, delete the corresponding item from the current individual; otherwise, continue to perform the current operation until all bits of the individual to be repaired have been checked.
  • the fitness value f(x) of each individual x (in genetic algorithm, fitness is the main indicator describing individual performance, according to the size of the fitness, the purpose of evaluating the individual's pros and cons is achieved, So as to carry out the survival of the fittest for the individual.
  • the pattern data mining method further includes:
  • Neighbor exploration is performed on the repeated HUIs in the population.
  • the performing neighborhood exploration processing on the repeated HUIs in the population includes: for new individuals generated after genetic manipulation, not directly performing fitness evaluation, but first checking whether the individual is included in the HUIs In the set, if it is, perform neighborhood mutation on it to generate a new solution near the HUI to explore the solution in its neighborhood space, and then perform fitness evaluation on it, otherwise, directly perform fitness evaluation.
  • the neighborhood exploration processing is specifically: assuming that the individual currently performing the mutation operation is a HUI (binary code form), use presence_indexs to store the index corresponding to position 1 in the HUI (Step 1), and absence_indexs to store it in the HUI
  • For the subscript of position 0 (Step 2) first randomly select a value m from the presence_indexs set (Step 3), and then randomly select a value n from the absence_indexs set (Step 4); finally, set the mth position in the HUI Set to 0 (Step 5), and set the nth position in the HUI to 1 (Step 6).
  • the method further includes:
  • the process of maintaining population diversity includes: in each evolution process, two consecutively stored HUIs are selected from the set of HUIs to replace two randomly selected individuals in the current population.
  • the method further includes:
  • the elite processing includes: merging the previous generation population Pt with the contemporary population P t+1 , and deleting duplicate individuals in the combined population, and then sorting according to the utility value from large to small, and selecting the individual composition with larger utility The first generation population Q.
  • the pattern data mining method is improved based on genetic algorithm (GA), including:
  • Step 1 4 calculates the TWU value corresponding to each item, and then find all 1-HTWUIs, and sort the 1-HTWUIs according to the size of the TWU value.
  • the length of each individual is determined by the number of 1-HTWUIs (Step 1- 4).
  • the population is randomly initialized using the roulette selection method (Step 6).
  • each individual in the population is repaired according to the TWU value of 1-HTWUIs in descending order. If the repaired individual belongs to the set of HUIs, the solution in its neighborhood is explored, and then it is judged whether it is HUI, if it is, add it to the HUIs collection (Step 8-18).
  • the present invention also provides a method for recommending products to users.
  • the method includes using the data mining method of the present invention to mine a set of high-efficiency items, and recommending products to users based on the information gathered from the high-efficiency items.
  • the present invention also provides a commodity packaging method, which includes using the data mining method of the present invention to mine a high-efficiency item set in shopping basket data, and pack and package several commodities in a certain high-efficiency item set.
  • the present invention also provides an item set mining device, which includes:
  • the first calculation module is used to calculate the TWU value (transaction weighted utility value) of each item
  • the second calculation module is used to calculate the utility value of the individual (that is, the item set);
  • the individual repair module is used to sequentially trim the TWU value of 1-HTWUIs, and save as much as possible the combination of items that may produce utility values in the individual;
  • the high-efficiency itemset determining module is configured to determine the itemset as the high-efficiency itemset HUI when the utility value of the item set corresponding to the individual is ⁇ the minimum utility value.
  • the item set mining device further includes:
  • the neighborhood exploration module is used for when the repaired individual belongs to the HUIs set, perform neighborhood mutation on the individual to generate a new individual.
  • the item set mining device further includes:
  • the population diversity maintenance module is used to select two continuously stored HUIs from the HUIs collection to replace two randomly selected individuals in the current population during each evolution process.
  • the item set mining device further includes:
  • the population selection and crossover module is used to perform selection and crossover on the population.
  • the item set mining device further includes:
  • the elite module is used to merge the previous generation population Pt with the contemporary population P t+1 , and delete duplicate individuals in the merged population, and then sort according to the utility value from large to small, and select individuals with larger utility values to form the next-generation population Q.
  • the item set mining device further includes: a decoding module, which is used to decode all the mined HUIs in a binary coded form.
  • the present invention also provides a piece of data processing equipment, including the item set mining device of the present invention.
  • the method of the present invention includes using one or more of the following processes: individual repair processing, neighborhood exploration processing, population diversity maintenance processing, and elite processing to mine HUIs.
  • the data set in the population initialization stage, in order to mine HUIs more effectively, is expressed in the form of a bitmap, and is based on each 1-HTWUIs (high-transaction-weighted utility 1 item set, high-transaction-weighted utility 1). -itemsets) TWU value (transaction weighted utility value in the data set) to initialize the individual;
  • the method of the present invention adopts 1-HTWUIs-based transaction weighted utility ranking (TWU) individual repair processing, on the one hand, it can ensure that the repaired individual is an effective combination method in the data set, and on the other hand, it avoids repair Strategy destroys outstanding individuals.
  • TWU transaction weighted utility ranking
  • the method of the present invention can achieve the purpose of improving the solution efficiency by adopting the neighborhood search processing for repeated HUIs, and by rationally using these repeated HUIs.
  • these repeated HUIs will be replaced by other solutions in their neighborhood search space, which improves the algorithm's local search ability in the optimal solution area and speeds up the algorithm's search speed and efficiency for new HUIs.
  • the method of the present invention can also effectively expand the search space of high-quality solutions through the adoption of new population diversity maintenance processing, avoiding the problem of missing HUIs caused by the algorithm prematurely falling into the local optimum, and reducing The HUIs of the algorithm are missing in the search process;
  • the method of the present invention can also prevent the loss of high-quality itemsets through the use of elite processing.
  • the method of the present invention can process transaction databases that are common in daily applications, and has better performance in terms of the number of high-efficiency itemsets discovered, the ability to discover high-efficiency itemsets, and the running time.
  • Figure 1 A coding example of a chromosome
  • Figure 2 is a schematic diagram of the neighborhood search processing for repeated HUIs
  • FIG. 1 Schematic diagram of the treatment method for maintaining population diversity
  • FIG. 4 Schematic diagram of elite processing method
  • Fig. 6 is a schematic diagram of the influence of diversity maintenance processing on convergence speed
  • Figure 7 is a schematic diagram of the influence of elite processing on convergence speed
  • Genetic algorithm is a meta-heuristic optimization method inspired by natural selection process. Genetic algorithm is widely used to solve various np problems. In genetic algorithms, a certain number of individuals constitute a population, and each individual represents a potential solution. The genetic algorithm starts from the initial population with potential solutions, and then performs three genetic operations (crossover, mutation, and selection) on the chromosomes to generate the next-generation population. Repeat the genetic operator until the stopping condition is met, and then output the optimal solution.
  • the main evolution operators of genetic algorithm are as follows:
  • Selection operator The selection operator is used to select suitable individuals in the population. According to predefined rules, the more excellent individuals are more likely to be selected and survive, and the worse individuals are less likely to be selected into the next generation.
  • Mutation operator The mutation operator changes one or more genes of an individual under a certain probability, so that the offspring individuals produce genes different from their parents. The mutation operator helps to maintain the diversity of the population and increase the possibility of achieving global optimization.
  • HUIM HUIM
  • the purpose of HUIM is to explore a combination of items whose utility value is not less than a user-specified threshold from the data set. These explored item sets can help shopping mall decision-makers or managers to formulate reasonable and effective sales strategies.
  • the HUIM problem considers the number and weight of items at the same time. Its formal definition and mathematical model were first given by Yao et al. Liu et al. proposed the TWU model for the first time in the Two-Phase algorithm, and used Transaction-weighted Downward Closure (TWDC) to achieve the goal of reducing the search space.
  • TWDC Transaction-weighted Downward Closure
  • the Two-Phase algorithm has the problem of generating a large number of candidate item sets in the second stage. In order to solve this problem, Li et al.
  • the IHUP algorithm proposed by Ahmed et al. does not require multiple scans of the data set.
  • Tseng et al. improved the IHUP algorithm and proposed the UP-tree structure, with UP-Growth and UP-Growth+ used to discover HUIs.
  • Liu et al. proposed the HUI-Miners algorithm, which transforms the original database into a list structure, avoids the generation of candidate sets and repeated scans of the data set, and improves the mining efficiency.
  • the EC-based HUIM algorithm is also used to mine HUIs.
  • Kannimuthua et al. proposed the HUPEumu-GARM algorithm based on genetic algorithm for the first time. The algorithm improved the mutation operator and adopted the mutation operator based on ranking. The algorithm will adaptively adjust the mutation probability as the evolution progresses.
  • Lin et al. proposed the HUIM-BPSOsig algorithm based on the discrete PSO (the discrete PSO) algorithm, in which the length of the particles is determined by the high-transaction-weighted utility 1-itemsets (1-HTWUIs) Determined by the number of, this can effectively reduce the search space and improve the efficiency of the solution. Lin et al.
  • each transaction T q ⁇ D (1 ⁇ q ⁇ n) is a subset of the set I and consists of several items. And use the unique mark T ID to mark.
  • Each item in the transaction T q has a purchase quantity (internal utility), denoted as q(i j ,T q ) (1 ⁇ j ⁇ v,1 ⁇ q ⁇ n), and each item in the set I There is an external utility p(i j ), which represents the profit of the commodity.
  • An item set (or pattern) X ⁇ i 1 ,i 2 ,...,i k ⁇ (1 ⁇ k ⁇ v) is a non-empty subset of I.
  • Table 1 defines a quantitative database
  • Table 2 defines the profit of different items in the database.
  • u(X) The utility value of item set X in data set D is denoted as u(X), which is defined as follows:
  • Ten u(f, T 2 ) ten u(d, T 4 ) ten u(f, T 4 ) ten u(d, T 9 ) ten u(f, T 9 ) 2 ⁇ 4 ten 3 ⁇ 5
  • Ten 2 ⁇ 4 ten 1 ⁇ 5 ten 2 ⁇ 4 ten 1 ⁇ 5 49.
  • T q The utility value of the transaction T q , denoted as tu(T q ), is defined as follows:
  • TWU(X) The transaction weighted utility value of item set X in data set D is denoted as follows:
  • the user sets the minimum utility threshold ⁇ according to his preference. If the utility value of item set X is not less than the minimum utility value preset by the user, then item set X is a high-efficiency item set. Among them, the minimum utility value minUti is defined as follows:
  • X is called the high transaction weighted utility itemset (HTWUI).
  • HWUI high transaction weighted utility itemset
  • the HUIM problem can be defined as: For a given transaction database D, the external utility value pt of all items, the minimum utility threshold ⁇ , the purpose of HUIM is to mine the utility value of the database D that is not less than the minimum utility minUti All itemsets.
  • Example 1 Pattern data mining method HUIM-IGA based on improved genetic algorithm
  • the following multiple processes are used: individual repair processing, neighborhood exploration processing, population diversity maintenance processing, and elite processing to mine HUIs.
  • Step 1 4 calculates the TWU value corresponding to each item, and then find all 1-HTWUIs, and sort the 1-HTWUIs according to the size of the TWU value.
  • the length of each individual is determined by the number of 1-HTWUIs (Step 1- 4).
  • the population is randomly initialized using the roulette selection method (Step 6).
  • each individual in the population is repaired according to the TWU value of 1-HTWUIs in descending order. If the repaired individual belongs to the set of HUIs, the solution in its neighborhood is explored, and then it is judged whether it is HUI, if it is, add it to the HUIs collection (Step 8-18).
  • Algorithm 1 is as follows:
  • the data set is represented in the form of a bitmap.
  • a transaction data set D composed of n transactions and v different items
  • its bitmap is an n ⁇ v Boolean matrix, denoted as B(D).
  • B(D) n ⁇ v Boolean matrix
  • Table 4 shows the bitmap representation of the sample database.
  • the transaction weighted utility closed-down (TWDC) attribute is used to delete items that cannot constitute HUIs.
  • the search space can be significantly reduced, and the calculation speed can be improved.
  • Each chromosome represents an item set and is composed of 0 and 1. Among them, 1 indicates that the item at the corresponding position exists; 0 indicates that the item at the corresponding position does not exist.
  • the chromosome is used as a potential solution, representing the item set ⁇ a, c, e, g ⁇ .
  • each individual is initialized according to the TWU value of each 1-HTWUIs, and the 1-HTWUIs with a high TWU value are more likely to be selected.
  • the pseudo code of population initialization is shown in Algorithm 2.
  • each initial individual is first assigned a value of 0, and then an integer k is randomly generated as the number of positions that are not 0 in the individual (Step 5-6).
  • the TWU value of each 1-HTWUIs and use the roulette selection operator to determine which items will appear in the current individual (Step7-13). Add the generated new individuals to the population (Step 14).
  • return to the initialized population (Step 17).
  • Algorithm 2 is as follows:
  • the individual repair processing method is based on the TWU value of 1-HTWUIs for sequential pruning, and saves as much as possible the combination of items that may produce utility values in the individual. Its pseudo code is shown in Algorithm 3. In Algorithm 3, first judge whether the individual to be repaired is all 0 (indicating that the item set corresponding to this individual is empty. Obviously, the utility value corresponding to the empty item set is 0. Therefore, there is no need to repair). If the individuals to be repaired are all 0, then the algorithm is directly jumped out (Step 2-3), otherwise, it enters the individual repair stage (Step 4-28). Initialize the intermediate variable temp to the column corresponding to 1-HTWUI with the largest TWU value in the bitmap (Step 5-12).
  • temp and the column corresponding to the HTWUI in the bitmap are ANDed to get temp'. If temp' consists of all 0s, delete the corresponding item from the current individual, otherwise, continue to execute the current Operate until all bits of the individual to be repaired have been checked (Step 13-28).
  • Algorithm 3 is as follows:
  • a neighborhood search processing method for repeated HUIs is adopted. For new individuals generated after genetic manipulation, fitness evaluation is not performed directly, but first to check whether the individual is included in the HUIs set, and if so, perform neighborhood mutation on it to generate a new solution near the HUI. To explore the solution in its neighborhood space, and then perform fitness evaluation on it, otherwise, directly perform fitness evaluation.
  • Figure 2 explains this neighborhood discovery processing method.
  • the dark-colored circles represent the individuals corresponding to the repeated HUIs in the population, and the light-colored circles represent the potential HUIs in the neighborhood of these repeated HUIs. All individuals with fitness values lower than minUti are marked as Up the white circle.
  • the purpose of this method is to use these repeated HUIs to explore the potential HUIs in its neighborhood, and to improve the local search ability of the algorithm in the optimal solution area.
  • Algorithm 4 describes this neighborhood mutation method. Assuming that the individual currently performing the mutation operation is a HUI (binary coded form), use presence_indexs to store the subscript corresponding to position 1 in the HUI (Step 1), and absence_indexs to store the subscript corresponding to position 0 in the HUI (Step 2) , First randomly select a value m (Step 3) from the presence_indexs set, and then randomly select a value n (Step 4) from the absence_indexs set. Finally, the mth position in the HUI is set to 0 (Step 5), and the nth position in the HUI is set to 1 (Step 6).
  • HUI binary coded form
  • Algorithm 4 is as follows:
  • the high-utility itemset mining problem needs to find all itemsets that meet the minimum utility value, which means that the number of final solutions may be far more than one. Since the distribution of HUIs in the solution space is not uniform, it is easy to miss some solutions only by searching in the direction of the best individual of the previous generation. At the same time, with the rapid decline of population diversity, the search space is restricted, and the algorithm is also easy to fall into a local optimum prematurely.
  • this embodiment adopts a method for maintaining population diversity.
  • the HUIM-IGA algorithm also introduces the elite method. As shown in Figure 4, first merge the previous generation population P t with the contemporary population P t+1 , and delete the duplicate individuals in the merged population, and then sort them according to the utility value from large to small, and select the individual composition with larger utility.
  • the first generation population Q first merge the previous generation population P t with the contemporary population P t+1 , and delete the duplicate individuals in the merged population, and then sort them according to the utility value from large to small, and select the individual composition with larger utility.
  • Example 2 Application example of improved genetic algorithm pattern data mining method
  • the 1-HTWUIs mined are shown in Table 3, the chromosome length is 6 (determined by the number of 1-HTWUIs), and Table 4 shows the bitmap representation of the sample database.
  • TWU(d) ⁇ minUti so the corresponding data column can be deleted.
  • the TWU value of each 1-HTWUI is sorted from largest to smallest, and the result is ⁇ c:206,f:182,a:170,g:170,b:161,e:146 ⁇ .
  • the binary codes of the three individuals in the initial population obtained by Algorithm 2 are 110111, 110110 and 001100 respectively.
  • Example 3 Performance test of pattern data mining method based on improved genetic algorithm
  • the population diversity maintenance strategy is removed on the basis of the original algorithm (HUIM-IGA in Example 1) and marked as HUIM-IGA- DMS .
  • the elite strategy is removed on the basis and marked as HUIM-IGA- ES , and experiments are performed on the data set.
  • HUIM-IGA, HUIM-IGA- DMS and HUIM-IGA- ES each run independently 10 times.
  • HUIM-IGA and HUIM-IGA- DMS have similar convergence speeds, but in the middle and late stages of algorithm evolution, HUIM-IGA has a significantly faster convergence rate than HUIM-IGA- DMS .
  • HUIM-IGA only needs 15,000 fitness function evaluations. It basically converges, and HUIM-IGA- DMS without the strategy of maintaining population diversity does not converge until the end of 60,000 fitness function evaluations. The reason for this phenomenon is that in the early stage of evolution, due to the high diversity of the population, the algorithm can easily find HUIs, so the convergence speed of the two at this stage is similar.
  • HUIM-IGA with the continuous progress of evolution, the diversity of the population is gradually lost, and the exploration ability of the algorithm is weakened, so the convergence speed of HUIM-IGA-DMS is gradually reduced.
  • HUIM-IGA with a population diversity maintenance strategy can always maintain the ability to explore new HUIs. Therefore, the convergence speed of HUIM-IGA will be faster than that of HUIM-IGA- DMS in the middle and late stages of evolution, which means that HUIM-IGA can mine HUIs at a faster speed.
  • the convergence speed of the Bio-HUIF-GA algorithm and the Bio-HUIF-PSO algorithm is slower than that of the HUIM-IGA algorithm of Example 1.
  • the reason is that the same HUIs are only evaluated once in HUIM-IGA, which helps In order to save the number of fitness evaluations, thereby improving the convergence speed.
  • the convergence speed gradually decreases in the later stages of evolution.
  • the HUIM-BPSO algorithm performs better than the HUIM-BPSOsig algorithm and the HUPEumu-GRAM algorithm, mainly because the algorithm adopts an OR/NOR-tree structure, which avoids the generation of invalid combinations in the evolution process and accelerates the convergence speed of the algorithm.
  • the HUPEumu-GRAM algorithm has the slowest convergence speed, mainly because the standard genetic algorithm lacks an effective search strategy when solving the HUIM problem, so the algorithm is difficult to converge within a limited number of evaluations.
  • Each algorithm analyzes the number of HUIs mined on four real data sets, and conducts experiments based on different minimum utility thresholds. Since the IHUP algorithm, UP-Growth algorithm and UP-Hist Growth algorithm can mine complete HUIs from the data set, they are used here to determine the number of real HUIs under different minimum utility thresholds on different data sets.
  • Table 6 to Table 7 show the best results, worst results and average results of the ratio of the number of HUIs mined to the actual number of HUIs in each data set under the corresponding minimum utility threshold when each algorithm is run 10 times independently.
  • the Bio-HUIF-GA algorithm and the Bio-HUIF-PSO algorithm need a higher ⁇ to mine all HUIs. This is because the smaller the minimum utility threshold, the greater the number of HUIs that meet the conditions. In the absence of an effective search strategy, it becomes more difficult to mine HUIs.
  • the Bio-HUIF-GA algorithm and Bio-HUIF-PSO algorithm cannot guarantee 100% mining of all HUIs when the minimum utility threshold is small. This is because the smaller the minimum utility threshold, the number of HUIs that meet the condition The more, in the absence of an effective search strategy, it is difficult to dig out complete HUIs.
  • the HUIM-BPSO algorithm can avoid the generation of invalid combinations and speed up the mining of HUIs, so it performs better than the HUIM-BPSOsig algorithm.
  • the average accuracy of HUPEumu-GRAM algorithm on each data set is basically less than 20%. This is because in the iterative evolution process of the algorithm, the previous generation of outstanding individuals is always searched as the target. For the number of HUIM solutions For many problems, it is easy to lose the solution, and it is difficult for the algorithm to find all HUIs within a limited number of evaluations.
  • UP-Growth and IHUP algorithms run out of memory in some data sets (for example, Chess and Connect data sets). In addition, their running time on the Mushroom dataset and the Accident_10% dataset is slower than the HUIM-IGA, Bio-HUIF-GA and Bio-HUIF-PSO algorithms. On multiple data sets, the performance of the UP-Hist Growth algorithm is better than other algorithms, but the overall performance of HUIM-IGA is better than the UP-Hist Growth algorithm.
  • HUIM-BPSO, HUIM-BPSOsig and HUPEumu-GRAM algorithms take more than 3 hours to mine complete HUIs on most data sets.
  • the experimental results on the actual data set show that compared with the most advanced EC-based HUIM algorithm, the HUIM-IGA method has better performance in terms of convergence speed, ability to discover HUIs, and running time.
  • Example 4 High-efficiency item set mining device
  • An item set mining device including:
  • the first calculation module the second calculation module, the neighborhood exploration module, the population diversity maintenance module, the population selection and crossover module, the elite module, the high-efficiency item set determination module, and the decoding module.
  • the first calculation module calculates the TWU value of each item in the transaction database, and then finds all 1-HTWUIs, and sorts 1-HTWUI according to the size of the TWU value.
  • the length of each individual is determined by the number of 1-HTWUIs.
  • the population is randomly initialized using the roulette selection method.
  • the individual repair module is used to sequentially trim the TWU value of 1-HTWUIs, and save as much as possible the combination of items that may produce utility values in the individual; specifically: first determine whether the individual to be repaired is an invalid item set, if If no, then enter the individual repair stage; the individual repair stage includes: initializing the intermediate variable temp to the column corresponding to 1-HTWUI with the largest TWU value in the bitmap; combining temp with the bitmap in descending order of the TWU value Perform the AND operation on the corresponding column of HTWUI to get temp'. If temp' consists of all 0s, delete the corresponding item from the current individual, otherwise, continue to perform the current operation until all the bits of the individual to be repaired have been checked. .
  • the neighborhood exploration module is used to perform neighborhood mutation to generate a new individual when the repaired individual belongs to the HUIs collection; specifically: assuming that the individual currently performing the mutation operation is a HUI (binary code form), Use presence_indexs to store the index corresponding to position 1 in the HUI, and absence_indexs to store the index corresponding to position 0 in the HUI. First, a value m is randomly selected from the presence_indexs set, and then a value n is randomly selected from the absence_indexs set; finally Set the mth position in the HUI to 0, and set the nth position in the HUI to 1.
  • HUI binary code form
  • the population diversity maintenance module is used to select two continuously stored HUIs from the HUIs collection to replace two randomly selected individuals in the current population during each evolution process;
  • the population selection and crossover module is used to perform selection and crossover on populations
  • the elite module is used to merge the previous generation population Pt with the contemporary population P t+1 , and delete duplicate individuals in the merged population, and then sort according to the utility value from large to small, and select individuals with larger utility values to form the next-generation population Q;
  • the second calculation module is used to calculate the utility value of the individual (that is, the item set); the high-efficiency item set determination module is used to determine the item set when the utility value of the item set corresponding to the individual is ⁇ the minimum utility value In order to efficiently use the item set HUI; the decoding module is used to decode all the HUI in the form of binary encoding that is mined.
  • the item set mining device may implement the following algorithm:

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Mathematical Physics (AREA)
  • General Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • Computational Linguistics (AREA)
  • Artificial Intelligence (AREA)
  • Genetics & Genomics (AREA)
  • General Health & Medical Sciences (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biomedical Technology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Evolutionary Biology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Fuzzy Systems (AREA)
  • Probability & Statistics with Applications (AREA)
  • Databases & Information Systems (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

一种基于改进遗传算法的模式数据挖掘方法,属于数据挖掘技术领域。该方法在种群初始化后,包括使用如下一种以上处理:个体修复处理,邻域探索处理,种群多样性维持处理和精英处理,来挖掘HUIs。在四个实际数据集上的实验结果表明,与目前最先进的基于EC的HUIM算法相比,所提出的HUIM-IGA方法在发现的HUIs数量、发现HUIs的能力和运行时间方面具有更好的性能。可应用于处理日常应用中常见的交易型等事务数据库,在发现的高效用项集数量、发现高效用项集的能力和运行时间方面具有更好的性能。

Description

一种基于改进遗传算法的模式数据挖掘方法 技术领域
本发明涉及一种基于改进遗传算法的模式数据挖掘方法,属于数据挖掘技术领域。
背景技术
数据挖掘,指的是从大量的数据中提取出潜在的有趣的信息或模式,以供进一步使用的过程。项集是由事务数据库中的至少一个项组合而成的集合。事务数据库是一种可以记录交易、新闻等事务的数据库。事务数据库通常记录有至少一条事务,每条事务中包括至少一个数据项。在购物篮数据中,商品的数据项对应商品的名称,一条关于交易记录的事务中可以包括若干个商品的数据项。由于交易类的事务数据库往往能够反映用户的偏好信息,因此在向用户推荐相关信息时,往往会从事务数据库形成的多个项集中挖掘出向用户推荐的项集。
频繁项集挖掘(frequent itemsets mining,FIM),一直都是数据挖掘领域中的重要研究方向,它指的是从数据库中挖掘出其出现频率不低于用户指定阈值的项集的过程。虽然FIM能挖掘出事务数据库中那些频繁出现的项集,但是并没有考虑项集的效用值。而在挖掘项集的过程中,往往需要考虑效用值较高的项集(简称高效用项集,high-utility itemsets,HUIs)。高效用项集(模式)挖掘的目的就是挖掘出事务数据库中,效用值不小于指定最小效用值的所有项集。
高效用项集挖掘(High-utility itemset mining,HUIM)问题是一个典型的组合优化问题,演化计算(Evolutionary computation,EC)是一种有效的随机优化方法,它从自然进化过程中得到启发,利用自然进化原理寻找最优解,并应用于各种组合优化问题中。为了探索HUIM问题中的巨大搜索空间,基于EC的方法如遗传算法(GA)、粒子群优化(PSO)、蚁群优化(ACO)和和人工蜂群算法被引入用来解决这一问题。
然而,基于EC的高效用项集挖掘算法在挖掘满足最小效用阈值的所有高效用项集(HUIs)时通常耗费严重来。出现这个问题的原因可概括为两个方面。一方面的原因在于,高效用项集挖掘不同与传统的最优化问题,所有满足最小效用阈值的项集都应该被发现,而原始的演化计算方法总是沿着上一代最佳个体的方向搜索,这可能会在迭代过程中遗漏一些结果。另一方面的原因在于,基于EC的高效用项集挖掘算法在搜索新的HUIs时效率较低。
发明内容
为了解决现有技术中的至少一个技术问题,本发明采用改进的遗传算法用来高效地挖掘HUIs,提供一种基于改进遗传算法的模式数据挖掘方法HUIM-IGA。本发明方法提高了发现HUIs的效率,缩短了挖掘所有HUIs的总耗时。
本发明的模式数据挖掘方法,所述方法基于遗传算法(GA)进行改进,包括:
将原始数据库表示成位图的形式;
初始化种群;
对种群进行个体修复处理;
得到HUIs集合。
在一种实施方式中,所述数据库是任意一种可以记录交易、新闻等事务的数据库。所述数据库通常记录有至少一条事务,每条事务中包括至少一个数据项。
可选地,所述数据库为购物篮数据库。在购物篮数据中,商品的数据项对应商品的名称,一条关于交易记录的事务中可以包括若干个商品的数据项。
在一种实施方式中,所述初始化种群,包括:根据每个1-HTWUIs(高事务加权效用1项集,high-transaction-weighted utilization 1-itemsets)的TWU值(数据集中的事务加权效用值)来初始化每个个体。
所述初始化种群,具体是:每个初始个体先全部赋值为0,然后随机生成整数k作为该个体中不为0位置的个数;根据每个1-HTWUIs的TWU值并利用轮盘赌选择算子决定哪些项将会 出现在当前个体中;将生成的新的个体加入到(初始)种群中;最后返回完成初始化的种群。
在一种实施方式中,所述对种群进行个体修复处理,包括:基于1-HTWUIs的TWU值大小进行顺序修剪;尽可能保存了个体中可能会产生效用值的项的组合。
所述个体修复处理,具体是:先判断待修复的个体是否全为0,如果待修复的个体全为0,则直接跳出该算法,否则,进入个体修复阶段;
所述个体修复阶段,包括:初始化中间变量temp为位图中TWU值最大的1-HTWUI对应的那一列;按照TWU值从大到小的顺序将temp与位图中的HTWUI对应的列进行与运算得到temp’,如果temp’全由0组成,则从当前个体中删除对应的项,否则,继续执行当前操作,直到待修复个体的所有位都被检查完为止。
所述个体修复处理后,计算每个个体x的适应度值f(x)(在遗传算法中,适应度是描述个体性能的主要指标,根据适应度的大小,达到评估个体优劣的目的,从而对个体进行优胜劣汰。在这里适应度值就是该个体所代表的项集的效用值,即f(x)=utility(x)),如果适应度值不小于最小效用值(minUti),则将x添加到HUIs集合。
在一种实施方式中,所述模式数据挖掘方法,还包括:
对种群中的重复HUIs进行邻域探索处理。
在一种实施方式中,所述对种群中的重复HUIs进行邻域探索处理,包括:针对基因操作之后所产生的新个体,不直接进行适应度评估,而是先检查该个体是否包含在HUIs集合中,如果是,则对其执行邻域变异,产生一个该HUI附近的新解,来探索其邻域空间内的解,然后再对其进行适应度评估,否则,直接进行适应度评估。
所述邻域探索处理,具体是:假设当前执行完变异操作的个体是一个HUI(二进制编码形式),使用presence_indexs存放该HUI中为1位置对应的下标(Step 1),absence_indexs存放该HUI中为0位置的下标(Step 2),先从presence_indexs集合中随机选择一个值m(Step 3),然后再从absence_indexs集合中随机选择一个值n(Step 4);最后将HUI中第m个位置设置为0(Step 5),HUI中第n个位置设置为1(Step 6)。
在一种实施方式中,所述方法还包括:
进行种群多样性维持处理。
所述进行种群多样性维持处理,包括:在每一次进化过程中,都从HUIs集合中选择两个连续存放的HUI来替换当前种群中的两个随机选择的个体。
在一种实施方式中,所述方法还包括:
进行精英处理。
所述进行精英处理,包括:将上一代种群Pt与当代种群P t+1合并,并删除合并种群中的重复个体,然后按照utility值从大到小排序,并选择utility较大的个体构成下一代种群Q。
在一种实施方式中,所述模式数据挖掘方法,基于遗传算法(GA)进行改进,包括:
首先,计算每个项对应的TWU值,然后找出所有的1-HTWUIs,并根据TWU值的大小对1-HTWUI进行排序,每个个体的长度由1-HTWUIs的个数决定(Step 1-4)。根据各项的TWU值并利用轮盘赌选择法随机初始化种群(Step 6)。在进化阶段,对种群中的每个个体按照1-HTWUIs的TWU值由大到小的顺序进行修复,如果修复后的个体属于HUIs集合,则探索其邻域范围内的解,然后判断是否为HUI,如果是,则将其加入到HUIs集合(Step 8-18)。然后,运用种群多样性维护策略,避免由于算法陷入局部最优而导致的HUIs丢失(Step 19-27)。随后对当前种群执行选择和交叉操作(Step 28-29)。同步骤8-17一样,对当前种群执行修复策略(Step 30-38)。然后对当前种群执行变异操作(Step 39)。随后精英策略被执行,保留效用值较大的个体(Step 40-41)。最后,将挖掘出的所有二进制编码形式的HUI进行解码,并返回(Step 43-44)。
本发明还提供一种向用户推荐产品的方法,所述方法包括利用本发明的数据挖掘方法挖掘高效用项集,根据高效用项集中挖掘的信息向用户推荐产品。
本发明还提供一种商品包装方法,所述方法包括利用本发明的数据挖掘方法挖掘购物篮 数据中的高效用项集,将某一高效用项集中的若干个商品打包包装。
本发明还提供一种项集挖掘装置,所述项集挖掘装置包括:
第一计算模块,用于计算每项的TWU值(事务加权效用值);
第二计算模块,用于计算个体(即项集)的效用值;
个体修复模块,用于对1-HTWUIs的TWU值大小进行顺序修剪,尽可能保存了个体中可能会产生效用值的项的组合;
高效用项集确定模块,用于当所述个体对应的项集的效用值≥最小效用值时,将所述项集确定为高效用项集HUI。
在一种实施方式中,项集挖掘装置还包括:
邻域探索模块,用于当修复后的个体属于HUIs集合时,将所述个体进行邻域变异生成新的个体。
在一种实施方式中,项集挖掘装置还包括:
种群多样性维持模块,用于在每一次进化过程中,从HUIs集合中选择两个连续存放的HUI来替换当前种群中的两个随机选择的个体。
在一种实施方式中,项集挖掘装置还包括:
种群选择和交叉模块,用于对种群执行选择和交叉。
在一种实施方式中,项集挖掘装置还包括:
精英模块,用于将上一代种群Pt与当代种群P t+1合并,并删除合并种群中的重复个体,然后按照utility值从大到小排序,并选择效用值较大的个体构成下一代种群Q。
在一种实施方式中,项集挖掘装置还包括:解码模块,用于将挖掘出的所有二进制编码形式的HUI进行解码。
本发明还提供一张数据处理设备,包含本发明的项集挖掘装置。
本发明的有益效果:
(1)本发明的方法,种群初始化后,包括使用如下一种以上处理:个体修复处理,邻域探索处理,种群多样性维持处理和精英处理,来挖掘HUIs。
本发明的方法,在种群初始化阶段,为了更有效地挖掘HUIs,数据集被表示成位图的形式,并根据每个1-HTWUIs(高事务加权效用1项集,high-transaction-weighted utilization 1-itemsets)的TWU值(数据集中的事务加权效用值)来初始化个体;
进一步地,本发明方法,通过采用的基于1-HTWUIs的事务加权效用排序(TWU)的个体修复处理,一方面可以保证修复后的个体是数据集中有效的组合方式,另一方面,避免了修复策略对优秀个体的破坏。
进一步地,本发明方法,可以通过采用的针对重复HUIs的邻域探索处理,通过合理利用这些重复的HUIs,来达到提高求解效率的目的。具体的,这些重复的HUIs将被其邻域搜索空间内的其他解替代,提高了算法在最优解区域的局部搜索能力,加快了算法对新HUIs的搜索速度和效率。
进一步地,本发明方法,还可以通过采用的新的种群多样性维持处理,有效地扩展高质量解的搜索空间,避免了由于算法过早陷入局部最优而导致的HUIs的遗漏问题,减少了算法在搜索过程中的HUIs缺失;
进一步地,本发明方法,还可以通过采用的精英处理,防止高质量项集的流失。
(2)本发明方法,在四个实际数据集上的实验结果表明,与目前最先进的基于EC的HUIM算法相比,所提出的HUIM-IGA方法在发现的HUIs数量、发现HUIs的能力和运行时间方面具有更好的性能。
(3)本发明的方法,能处理日常应用中常见的交易型等事务数据库,在发现的高效用项集数量、发现高效用项集的能力和运行时间方面具有更好的性能。
附图说明
图1一条染色体的编码示例;
图2针对重复HUIs的邻域探索处理示意图;
图3种群多样性维持处理方法示意图;
图4精英处理方法示意图;
图5邻域变异示意图;
图6多样性维持处理对收敛速度的影响的示意图;
图7精英处理对收敛速度的影响的示意图;
图8算法的收敛速度比较(Chess);
图9算法的收敛速度比较(Mushroom);
图10算法的收敛速度比较(Accident_10%);
图11算法的收敛速度比较(Connect)。
具体实施方式
一、相关术语和背景介绍
1、遗传算法(GA)
遗传算法是受自然选择过程启发而产生的一种元启发式优化方法。遗传算法被广泛应用于各种np难题的求解。在遗传算法中,一定数量的个体构成种群,每个个体代表了一个潜在的解决方案。遗传算法从具有潜在解的初始种群开始,然后在染色体上执行三种遗传操作(交叉、突变和选择),来产生下一代种群。重复执行遗传算子,直到满足停止条件,然后输出最优解。遗传算法的主要进化算子如下:
(1)选择算子:选择算子用于选择种群中的合适的个体。根据预定义的规则,越优秀的个体更有可能被选中并生存下来,越差的个体被选中进入下一代的可能性就越小。
(2)交叉算子:通过选择算子选择的两个个体可以重组其基因,并通过交换部分基因形成新的个体。后代个体从两个个体(父母)那里都继承了一些特征。
(3)变异算子:变异算子在一定概率下使个体的一个或多个基因发生变化,使后代个体产生不同于父母的基因。变异算子有助于保持种群的多样性,增加实现全局优化的可能性。
2、高效用项集(HUIM)
HUIM的目的是从数据集中探索出效用值不小于用户指定阈值的项的组合,这些被探索出来的项集可以帮助诸如商场决策者或者管理人员制定合理有效的销售策略。作为FIM问题的延伸,HUIM问题同时考虑了项的数量和权重。其形式化定义和数学模型最早由Yao等人给出。Liu等人在Two-Phase算法中首次提出了TWU模型,并利用事务加权向下封闭属性(Transaction-weighted Downward Closure,TWDC),达到对搜索空间减枝的目的。Two-Phase算法存在在第二阶段会生成大量的候选项集的问题。为了解决这个问题,Li等人提出了降低候选项集的IIDS策略,来减少Two-Phase算法中候选项集的数量。Ahmed等人提出的IHUP算法不需要对数据集进行多次扫描。随后,Tseng等人对IHUP算法进行了改进,提出了UP-tree结构,UP-Growth和UP-Growth+用于发现HUIs。Liu等人提出了HUI-Miners算法,将原数据库转化为列表结构,避免了候选集的产生和对数据集的重复扫描,提高了挖掘效率。
不同于传统的HUIM算法,基于EC的HUIM算法也被用来挖掘HUIs。Kannimuthua等人首次基于遗传算法提出了HUPEumu-GARM算法,该算法对变异算子进行了改进,采用了基于排名的变异算子,算法会随着进化的进行来适应性的调整突变概率。Lin等人基于离散PSO(the discrete PSO)算法,提出了HUIM-BPSOsig算法,在该算法中粒子的长度由高事务加权效用1项集(high-transaction-weighted utilization 1-itemsets,1-HTWUIs)的个数决定,这可以有效地减小搜索空间,提高求解效率。Lin等人设计了一种OR/NOR-tree结构用来产生合理的项集,并提出了HUIM-BPSO算法,减少了无效组合的产生,进一步提高了挖掘性能。最近,Song等人提出了包含Bio-HUIF-GA,Bio-HUIF-PSO等算法的HUIM框架,来尽可能多的挖掘HUIs。
3、准备工作和问题陈述
(1)准备工作
假设集合I={i 1,i 2,…,i v}由事务数据集D={T 1,T 2,…,T n}中v个互不相同的项的组成,n=|D|是D中的总交易数量。事务数据库D中,每笔交易事务T q∈D(1≤q≤n)是集合I的子集,并由若干个项组成。并且使用唯一的标志T ID来标记。交易事务T q中的每个项都有一个购买数量(内部效用),记作q(i j,T q)(1≤j≤v,1≤q≤n),集合I中每个项都有一个外部效用p(i j),表示商品的利润。一个项集(或模式)X={i 1,i 2,…,i k}(1≤k≤v)是I的一个非空子集。表1定义了一个量化数据库,表2定义了数据库中不同项的利润。
表1.示例数据库.
Figure PCTCN2019121475-appb-000001
表2.利润表.
Figure PCTCN2019121475-appb-000002
项i j在事务T q中的效用值,记作u(i j T q),其定义如下:
u(i j,T q)=q(i j,T q)×p(i j)
        (1)
例如,在表1和表2中,
u(a,T 1)=q(a,T 1)×p(a)=1×2=2,
u(d,T 2)=q(d,T 2)×p(d)=2×4=8。
项集X在事务T q中的效用值,记作u(X,T q),其定义如下:
Figure PCTCN2019121475-appb-000003
例如,
u({b,f},T 6)=u(b,T 6)+u(f,T 6)=q(b,T 6)×p(b)+q(f,T 6)×p(f)=2×7+2×5=24,
u({f,g},T 8)=u(f,T 8)十u(g,T 8)=q(f,T 8)×p(f)十q(g,T 8)×p(g)=2×5十1×13=23。
项集X在数据集D中的效用值,记作u(X),其定义如下:
Figure PCTCN2019121475-appb-000004
例如,
u({a,c,e})=u({a,c,e},T 1)十u({a,c,e},T 3)=u(a,T 1)十u(c,T 1)十u(e,T 1)十u(a,T 3)十u(c,T 3)+u(e,T 3)=1×2+2×3+2×9+1×2+1×3+1×9=40,
u({d,f})=u({d,f},T 2)十u({d,f},T 4)十u({d,f},T 9)=u(d,T 2)十u(f,T 2)十u(d,T 4)十u(f,T 4)十u(d,T 9)十u(f,T 9)=2×4十3×5十2×4十1×5十2×4十1×5=49。
事务T q的效用值,记作tu(T q),其定义如下:
Figure PCTCN2019121475-appb-000005
例如,
tu(T 2)=u(c,T 2)+u(d,T 2)+u(f,T 2)=1×3+2×4+3×5=26
tu(T 7)=u(c,T 7)+u(e,T 7)=4×3+1×9=21
项集X在数据集D中的事务加权效用值,记作TWU(X),其定义如下:
Figure PCTCN2019121475-appb-000006
例如,
TWU({e,g})=tu(T 1)+tu(T 3)=52+41=93,TWU({f})=tu(T 2)+tu(T 4)+tu(T 6)+tu(T 8)+tu(T 9)+tu(T 10)=26+34+52+25+13+32=182。
用户根据偏好,设置最小效用阈值δ。如果项集X的效用值不小于用户预设的最小效用值,则项集X就是一个高效用项集。其中,最小效用值minUti的定义如下:
Figure PCTCN2019121475-appb-000007
例如,假设最小效用阈值δ设置为25%,则
minUti=(52十26十41十34十34十52十21十25十13十32)×25%=330×25%=82.5。
因为u({c,e})=84>minUti=82.5,所以{c,e}为高效用项集;由于
u({b,d,f})=34<minUti=82.5,因此{b,d,f}不是高效用项集。
如果项集X满足TWU(X)≥minUti,则X称为高事务加权效用项集(HTWUI)。例如,在表1和表2中,假设最小效用值为330×25%=82.5,挖掘出的1-HTWUIs如表3所示。
表3.示例数据库在minUti=82.5时对应的1-HTWUIs.
Figure PCTCN2019121475-appb-000008
(2)问题陈述
基于以上定义,可以将HUIM问题定义为:对于一个给定的交易数据库D,所有项的外部效用值pt,最小效用阈值δ,HUIM的目的就是挖掘出数据库D中效用值不小于最小效用minUti的所有项集。
实施例1:基于改进遗传算法的模式数据挖掘方法HUIM-IGA
本实施例的方法,种群初始化后,包括使用如下多种处理:个体修复处理,邻域探索处理,种群多样性维持处理和精英处理,来挖掘HUIs。
本实施例的基于改进遗传算法的模式数据挖掘方法,其使用的HUIM-IGA算法如算法1所示。
首先,计算每个项对应的TWU值,然后找出所有的1-HTWUIs,并根据TWU值的大小对1-HTWUI进行排序,每个个体的长度由1-HTWUIs的个数决定(Step 1-4)。根据各项的TWU值并利用轮盘赌选择法随机初始化种群(Step 6)。在进化阶段,对种群中的每个个体按照1-HTWUIs的TWU值由大到小的顺序进行修复,如果修复后的个体属于HUIs集合,则探索其邻域范围内的解,然后判断是否为HUI,如果是,则将其加入到HUIs集合(Step 8-18)。然后,运用种群多样性维护策略,避免由于算法陷入局部最优而导致的HUIs丢失(Step 19-27)。随后对当前种群执行选择和交叉操作(Step 28-29)。同步骤8-17一样,对当前种群执行修复策略(Step 30-38)。然后对当前种群执行变异操作(Step 39)。随后精英策略被执行,保留效用值较大的个体(Step 40-41)。最后,将挖掘出的所有二进制编码形式的HUI进行解码,并返回(Step 43-44)。
算法1具体如下:
Figure PCTCN2019121475-appb-000009
Figure PCTCN2019121475-appb-000010
以下,为基于改进遗传算法的模式数据挖掘方法中,初始化种群、个体修复处理、邻域探索处理、种群多样性维持处理、精英处理等的说明。
(1)初始化种群
在本实施例的HUIM-IGA算法中,数据集被表示成位图的形式。对于一个由n个事务和v个不同项构成的事务数据集D,其位图是一个n×v的布尔矩阵,记作B(D)。B(D)的第q行(1≤q≤n)和第j列(1≤j≤v)的值,即B q,j,通过如下方式计算:
Figure PCTCN2019121475-appb-000011
表4给出了示例数据库的位图表示。
表4.示例数据库的位图表示
Figure PCTCN2019121475-appb-000012
Figure PCTCN2019121475-appb-000013
在提出的HUIM-IGA算法中,利用事务加权效用向下封闭(TWDC)属性,删除那些无法构成HUIs的项。可以显著减小搜索空间,提升计算速度。该算法中,染色体的长度由1-HTWUIs的个数决定,例如在示例数据集中假设minUti=82.5,挖掘出的1-HTWUIs为{a,b,c,e,f,g},因此染色体的长度为6。每条染色体表示一个项集,并且由0和1构成。其中,1表示对应位置的项存在;0表示对应位置的项不存在。例如,在图1中该染色体作为一个潜在的解,表示项集{a,c,e,g}。
在种群初始化阶段,根据每个1-HTWUIs的TWU值来初始化每个个体,高TWU值的1-HTWUIs被选中的概率较大。种群初始化的伪代码如算法2所示。在算法2中,每个初始个体先全部赋值为0,然后随机生成整数k作为该个体中不为0位置的个数(Step 5-6)。根据每个1-HTWUIs的TWU值并利用轮盘赌选择算子决定哪些项将会出现在当前个体中(Step7-13)。将生成的新的个体加入到种群中(Step 14)。最后返回完成初始化的种群(Step 17)。
算法2具体如下:
Figure PCTCN2019121475-appb-000014
(2)个体修复处理
由于演化计算的随机性,算法在种群初始化阶段和遗传操作阶段会产生许多并不存在于数据集中的项的组合,浪费了执行时间,增加了搜索空间。Song等人提出了一种名为PEV check的个体修剪策略,用来使每个个体所表示的项集都是合理的组合方式;1-HTWUI的TWU值越高,越可能出现在HUI中;但是这种修剪策略可能会将个体中可能会构成HUI的优良基 因也修剪掉。为此,本实施例对其进行了改进,提出了个体修复处理方法。
个体修复处理方法基于1-HTWUIs的TWU值大小进行顺序修剪,尽可能保存了个体中可能会产生效用值的项的组合。其伪代码如算法3所示,在算法3中,先判断待修复的个体是否全为0(表示这个个体所对应的项集是空的,显然,空的项集对应的效用值为0,因此不需要修复),如果待修复的个体全为0,则直接跳出该算法(Step 2-3),否则,进入个体修复阶段(Step 4-28)。初始化中间变量temp为位图中TWU值最大的1-HTWUI对应的那一列(Step 5-12)。按照TWU值从大到小的顺序将temp与位图中的HTWUI对应的列进行与运算得到temp’,如果temp’全由0组成,则从当前个体中删除对应的item,否则,继续执行当前操作,直到待修复个体的所有位都被检查完为止(Step 13-28)。
算法3具体如下:
Figure PCTCN2019121475-appb-000015
(3)针对重复HUIs的邻域探索处理
由于演化算法本身固有的随机性,算法在迭代过程中,不但会产生许多无意义的项的组合方式,还会产生许多重复的HUI。这些重复的HUIs不可避免地增加了执行时间。
为了合理利用重复HUIs,避免重复评价适应度,采用了一种针对重复HUIs的邻域搜索处理方法。针对基因操作之后所产生的新个体,不直接进行适应度评估,而是先检查该个体是否包含在HUIs集合中,如果是,则对其执行邻域变异,产生一个该HUI附近的新解,来 探索其邻域空间内的解,然后再对其进行适应度评估,否则,直接进行适应度评估。
图2解释了这个邻域探索处理方法。如图2所示,深颜色的圆表示种群中重复的HUIs所对应的个体,浅颜色圆表示这些重复HUIs邻域内潜在的HUIs,所有适应度值低于minUti的个体在图中都被标注成了白色圆。该方法的目的就是为了利用这些重复的HUIs探索其邻域范围内潜在的HUIs,提高算法在最优解区域的局部搜索能力。
算法4描述了这种邻域变异方法。假设当前执行完变异操作的个体是一个HUI(二进制编码形式),使用presence_indexs存放该HUI中为1位置对应的下标(Step 1),absence_indexs存放该HUI中为0位置的下标(Step 2),先从presence_indexs集合中随机选择一个值m(Step 3),然后再从absence_indexs集合中随机选择一个值n(Step 4)。最后将HUI中第m个位置设置为0(Step 5),HUI中第n个位置设置为1(Step 6)。
算法4具体如下:
Figure PCTCN2019121475-appb-000016
(4)种群多样性维持处理和精英处理
高效用项集挖掘问题需要找到所有的满足最小效用值的项集,这意味着最终解的数量可能远不止一个。由于HUIs在解空间中的分布并不均匀,仅仅沿着上一代的最佳个体的方向进行搜索,很容易遗漏一些解。同时,随着种群多样性的迅速下降,搜索空间受到限制,算法也很容易过早的陷入局部最优。
为了保持各代群体的多样性,防止由于算法陷入局部最优而导致的HUIs遗漏问题,本实施例采用了一种种群多样性维持处理方法。
如图3所示,在每一次进化过程中,都从HUIs集合中选择两个连续存放的HUI来替换当前种群中的两个随机选择的个体。这样做的目的,一方面可以使较优个体有更多机会向不同方向进化,扩大了最优解的搜索空间,另一方面,还使种群在进化过程中维持一定的多样性,一定程度上防止了由于算法过早陷入局部最优而导致的HUIs的遗漏问题。
同时,为了防止精英个体的流失,HUIM-IGA算法中还引入了精英方法。如图4所示,首先将上一代种群P t与当代种群P t+1合并,并删除合并种群中的重复个体,然后按照utility值从大到小排序,并选择utility较大的个体构成下一代种群Q。
实施例2:改进遗传算法的模式数据挖掘方法的应用示例
以表1所示的示例数据库和表2所示的利润表为例,简要说明所设计的改进遗传算法的模式数据挖掘方法(HUIM-IGA)的应用过程。
假设minUti=82.5,挖掘出的1-HTWUIs如表3所示,染色体长度为6(由1-HTWUIs的个数决定),表4给出了示例数据库的位图表示。
由表3可知TWU(d)<minUti,因此可以将其对应的数据列删除。每个1-HTWUI的TWU 值从大到小排序结果为{c:206,f:182,a:170,g:170,b:161,e:146}。
假设通过算法2得到的初始种群中的3个个体的二进制编码分别为110111,110110和001100。对于110111按照算法3的修复过程,先初始化中间变量temp为位图中TWU值最大的1-HTWUI对应的那一列,即temp=Bit(f)=0101010111。然后按照1-HTWUI的TWU值从大到小的顺序依次修复,temp∩Bit(a)=0101010111∩1010010100=0000010100,由于结果不全为0,所以令temp=0000010100。temp∩Bit(g)=0000010100∩1010010100=0000010100,同时令temp=0000010100。temp∩Bit(b)=0000010100∩0011110000=0000010000,令temp=0000010000。temp∩Bit(e)=0000010000∩1010001001=0000000000,由于结果为0,所以将该个体中e所对应的位置置为0。故修复的结果为110011。类似的,修复110110后的结果为110010,修复001100后的结果保持不变仍为001100。
分别计算修复后的这三个个体对应的适应度函数值(即utility值),可得到utility(110011)=52<minUti,utility(110010)=26<minUti,utility(001100)=84>minUti。因此,将001100加入到HUIs集合。按照算法1的步骤,假设当前种群的某个个体为001100,由于其已经存在于HUIs集合中,因此不需要计算其适应度函数值。对其执行邻域变异操作(算法4),会得到类似图5的效果。
实施例3:改进遗传算法的模式数据挖掘方法的性能测试
针对实施例1提供的改进遗传算法的模式数据挖掘方法,本实施例将其与最先进的基于演化计算的高效用项集挖掘算法进行对比实验来验证我们所提算法的求解性能,其中包括Bio-HUIF-GA[Song W,Huang C.Mining high utility itemsets using bio-inspired algorithms:A diverse optimal value framework[J].IEEE Access,2018,6:19568-19582.doi:10.1109/ACCESS.2018.2819162],Bio-HUIF-PSO[Song W,Huang C.Mining high utility itemsets using bio-inspired algorithms:A diverse optimal value framework[J].IEEE Access,2018,6:19568-19582.doi:10.1109/ACCESS.2018.2819162]算法,HUIM-BPSO[Lin C W,Yang L,Fournier-Viger P,et al.A binary PSO approach to mine high-utility itemsets[J].Soft Computing,2016]算法,HUIM-BPSOsig[Lin C W,Yang L,Fournier-Viger P,et al.Mining high-utility itemsets based on particle swarm optimization[J].Engineering Applications of Artificial Intelligence,2016,55:320-330.]算法和HUPEumu-GRAM[Kannimuthu S,PremalathaK.Discovery of High Utility Itemsets Using Genetic Algorithm with Ranked Mutation[J].Applied Artificial Intelligence,2014,28(4):337-359.]算法。为了进行全面的比较,还包括了三种精确的HUIM算法,即UP-Growth[Tseng V S,Shie B E,Wu C W,et al.Efficient algorithms for mining high utility itemsets from transactional databases[J].IEEE transactions on knowledge and data engineering,2012,25(8):1772-1786.]算法,IHUP[Ahmed C F,Tanbeer S K,Jeong B S,et al.Efficient tree structures for high utility pattern mining in incremental databases[J].IEEE Transactions on Knowledge and Data Engineering,2009,21(12):1708-1721.]算法和UP-Hist Growth[Dawar S,Goyal V.UP-Hist Tree:An Efficient Data Structure for Mining High Utility Patterns from Transaction Databases.[J].2015.]算法。算法的收敛速度,挖掘到的HUIs个数(可以用来显示不同算法发现的HUIs数量的准确性),并将发现所有HUIs所需的执行时间考虑在内,以比较算法。
A试验方法
所有的实验都是在windows 1064位英特尔酷睿i5–8400@2.80GHz CPU 8G内存的台式机上完成,所有算法都使用Java语言实现。使用了四个公开的真实数据集来评估算法的性能,包括Chess,Mushroom,Accident,Connect。所有的数据集都可以从SPMF数据挖掘库[Fournier-Viger P,Lin C W,Gomariz A,et al.The SPMF Open-Source DataMining Library Version 2[C]//Joint European Conference on Machine Learning&Knowledge Discovery in Databases.Springer International Publishing,2016.]中下载。为了更简单地解释这个问题,类似 于之前的工作[Lin C W,Yang L,Fournier-Viger P,et al.A binary PSO approach to mine high-utility itemsets[J].Soft Computing,2016],[Song W,Huang C.Mining high utility itemsets using bio-inspired algorithms:A diverse optimal value framework[J].IEEE Access,2018,6:19568-19582.doi:10.1109/ACCESS.2018.2819162],只采用10%的Accident数据集。并记录所有算法分别独立运行10次所得到的收敛速度,挖掘到的HUIs和挖掘到所有HUIs所用的时间。所使用数据集的参数和特征描述如下表5,所有算法的总体规模设置为20,适应度评价次数设置为60000。
表5.所用数据集的特征表
Figure PCTCN2019121475-appb-000017
B实验结果
(1)种群多样性维护策略和精英策略对收敛速度的影响
为了说明种群多样性维持策略和精英策略对算法搜索HUIs效率的影响,在原算法(实施例1的的HUIM-IGA)基础上去除种群多样性维持策略,并标记为HUIM-IGA -DMS,在原算法基础上去除精英策略,并标记为HUIM-IGA -ES,并分别在数据集上进行实验,HUIM-IGA、HUIM-IGA -DMS和HUIM-IGA -ES各独立运行10次。
图6-图7给出了收敛曲线。从图6可以发现,
在进化初期HUIM-IGA和HUIM-IGA -DMS的收敛速度相似,但是在算法进化的中后期HUIM-IGA收敛速度明显快于HUIM-IGA -DMS,HUIM-IGA仅需进行15000次适应度函数评价就基本上收敛,而不带种群多样性维持策略的HUIM-IGA -DMS直到60000次适应度函数评价结束也没有收敛。造成这一现象的原因在于,在进化初期,由于种群的多样性较高,算法很容易找到HUIs,因此这此阶段两者的收敛速度类似。但是随着进化的不断进行,种群多样性逐渐丧失,算法的探索能力被削弱,因此使得HUIM-IGA -DMS的收敛速度逐渐降低。而带有种群多样性维持策略的HUIM-IGA却能始终保持对新HUIs的探索能力。因此进化中后期HUIM-IGA的收敛速度会快于HUIM-IGA -DMS的收敛速度,这意味着HUIM-IGA可以以更快的速度挖掘到HUIs。
从图7可以发现,不带精英策略的HUIM-IGA -ES在进化初期的收敛速度就明显慢于HUIM-IGA,在Mushroom和Accident_10%数据集上需要超过30000次的评估才能收敛。这说明精英策略可以避免HUIs的遗漏和优秀解的丢失,因此带有精英策略的HUIM-IGA的收敛速度要明显快于不带精英策略的HUIM-IGA -ES
(2)收敛速度比较
为了评价各算法的收敛性能,选取各算法分别独立运行10次的平均结果。由于IHUP,UP-Growth和UP-Hist Growth不是迭代算法,因此只比较了基于EC的六种算法。各算法在不同数据集上搜索到的HUIs随着评价次数的变化曲线如图8-图11所示。
观察图8-图11可以发现,相比于其他五个算法,实施例1提出的HUIM-IGA算法收敛速度最快。主要原因在于种群多样性维持处理可以有效防止算法过早陷入局部最优,从而使算法始终保持对新HUIs的探索能力。另外,此外,精英处理和邻域探索处理有助于避免优秀个体的流失,进而加快对新HUIs的探索。
Bio-HUIF-GA算法和Bio-HUIF-PSO算法的收敛速度相比实施例1的HUIM-IGA算法较慢,一方面原因在于,相同的HUIs在HUIM-IGA中只被评估一次,这有助于节省适应度评 估次数,从而提高收敛速度。另一方面,由于缺乏有效的搜索策略,导致进化后期收敛速度逐渐下降。HUIM-BPSO算法表现优于HUIM-BPSOsig算法和HUPEumu-GRAM算法,主要是由于算法采用了一种OR/NOR-tree结构,避免了演化过程中无效组合的产生,加快了算法的收敛速度。HUPEumu-GRAM算法收敛速度最慢,主要是标准的遗传算法在求解HUIM问题时,缺乏有效的搜索策略,因此算法很难在有限的评估次数内收敛。
(3)比较挖掘到的HUIs的个数
对各算法在四个真实数据集上对挖掘出来的HUIs个数进行分析,并基于不同的最小效用阈值进行实验。由于IHUP算法,UP-Growth算法和UP-Hist Growth算法可以从数据集中挖掘出完整的HUIs,因此这里被用来确定在不同数据集上的不同最小效用阈值下的真实HUIs个数。
表6-表7给出了各算法分别独立运行10次,挖掘出来的HUIs个数与各数据集在对应最小效用阈值下的真实HUIs个数比值的最好结果,最差结果和平均结果。
表6各算法挖掘到的HUIs的百分比
Figure PCTCN2019121475-appb-000018
表7
Figure PCTCN2019121475-appb-000019
Figure PCTCN2019121475-appb-000020
从表6-表7可以看出,在所列举的数据集的不同最小阈值下,提出的HUIM-IGA方法在探索出的HUIs完整程度上优于其他算法。HUIM-IGA在δ较高的情况下,可以挖掘出所有的HUIs,但是在δ较低的情况下(比如,在Connect数据集δ=31.3%和Accident_10%数据集δ=12.1%时),仍不能保证挖掘出所有的HUIs。但是即便在低最小效用阈值下,相比于其他的基于EC的算法,HUIM-IGA挖掘到的HUIs的数量明显更多。相比于HUIM-IGA,Bio-HUIF-GA算法和Bio-HUIF-PSO算法需要在更高的δ下才能挖掘出所有的HUIs。这是由于最小效用阈值越小,满足条件的HUI个数就越多,在缺乏有效搜索策略的情况下,使得挖掘HUIs变得更加的困难。Bio-HUIF-GA算法和Bio-HUIF-PSO算法在最小效用阈值较小的情况下,并不能保证100%挖掘出所有的HUIs,这是由于最小效用阈值越小,满足条件的HUI个数就越多,在缺乏有效搜索策略的情况下,很难挖掘出完整的HUIs。
由于HUIM-BPSO算法相比于HUIM-BPSOsig算法可以避免了无效组合的产生,加快了挖掘HUIs的速度,因此表现优于HUIM-BPSOsig算法。HUPEumu-GRAM算法在各数据集上的平均准确度基本上都低于20%,这是因为在算法迭代进化过程中,总是将上一代的优秀个体作为目标进行搜索,对于HUIM这种解数量较多的问题,很容易丢失解,算法很难在有限次评估次数内找到全部HUIs。
(4)运行时间比较
在四个实际数据集上使用不同的最小效用阈值,比较了八种方法挖掘所有满足最小效用阈值的HUIs的运行时间。运行时间对比结果如表8-表10所示。由于在Connectδ=31.3%和Accident_10%δ=12.1%上,平均情况下,所有的基于EC的算法都无法在6×10 4的评估次数 内挖掘到所有的HUIs。所以,为了比较各算法挖掘不同数量的HUIs的耗时,我们分别比较在Connectδ=31.3%和Accident_10%上,挖掘到30%,50%,70%,90%的HUIs各算法所需要的时间。结果如表11所示。
从表8-表10可以观察到,在四个真实数据集中,对于不同的最小效用阈值,所提出的HUIM-IGA算法在运行时间方面均优于其他对比算法。不难发现,虽然Bio-HUIF-GA算法和Bio-HUIF-PSO算法在进化初期表现出较快的收敛速度,但是由于其在进化后期收敛速度较慢,因此找到完整的HUIs非常耗时。
UP-Growth和IHUP算法在一些数据集都耗尽了内存(比如,Chess和Connect数据集)。另外,它们在Mushroom数据集和Accident_10%数据集上的运行时间比HUIM-IGA,Bio-HUIF-GA和Bio-HUIF-PSO算法都慢。在多个数据集上,UP-Hist Growth算法的性能优于其他算法,但是HUIM-IGA的整体性能要优于UP-Hist Growth算法。HUIM-BPSO,HUIM-BPSOsig和HUPEumu-GRAM算法在大多数数据集上挖掘完整HUIs的平均耗时都超过3小时。这是由于基于传统的遗传算法或粒子群算法的标准搜索方法会随着种群多样性的下降而逐渐降低搜索能力,因此可能会丢失一部分HUIs,使得在有效时间内很难找到所有的HUIs。从表11可以看出,对于挖掘出相同数量的HUIs,采用HUIM-IGA算法的耗时更短,这说明HUIM-IGA算法比其他基于EC的HUIM算法效率更高。
表8各算法挖掘到全部HUIs所用的时间
Figure PCTCN2019121475-appb-000021
表9
Figure PCTCN2019121475-appb-000022
Figure PCTCN2019121475-appb-000023
表10
Figure PCTCN2019121475-appb-000024
Figure PCTCN2019121475-appb-000025
—内存溢出
表11挖掘到不同比例的HUIs各算法的耗时
Figure PCTCN2019121475-appb-000026
综上,在实际数据集上的实验结果表明,与目前最先进的基于EC的HUIM算法相比,HUIM-IGA方法在收敛速度、发现HUIs的能力和运行时间方面都具有更好的性能。
实施例4:高效用项集挖掘装置
一种项集挖掘装置,包括:
第一计算模块、第二计算模块、邻域探索模块、种群多样性维持模块、种群选择和交叉模块、精英模块、高效用项集确定模块、解码模块。
本装置的原理如下:
第一计算模块计算事务数据库中各项的TWU值,然后找出所有的1-HTWUIs,并根据TWU值的大小对1-HTWUI进行排序,每个个体的长度由1-HTWUIs的个数决定,根据各项的TWU值并利用轮盘赌选择法随机初始化种群。
个体修复模块,用于对1-HTWUIs的TWU值大小进行顺序修剪,尽可能保存了个体中可能会产生效用值的项的组合;具体是:先判断待修复的个体是否是无效项集,如果不是,则进入个体修复阶段;所述个体修复阶段,包括:初始化中间变量temp为位图中TWU值最大的1-HTWUI对应的那一列;按照TWU值从大到小的顺序将temp与位图中的HTWUI对应的列进行与运算得到temp’,如果temp’全由0组成,则从当前个体中删除对应的item,否则,继续执行当前操作,直到待修复个体的所有位都被检查完为止。
邻域探索模块,用于当修复后的个体属于HUIs集合时,将所述个体进行邻域变异生成新的个体;具体是:假设当前执行完变异操作的个体是一个HUI(二进制编码形式),使用presence_indexs存放该HUI中为1位置对应的下标,absence_indexs存放该HUI中为0位置的下标,先从presence_indexs集合中随机选择一个值m,然后再从absence_indexs集合中随机选择一个值n;最后将HUI中第m个位置设置为0,HUI中第n个位置设置为1。
种群多样性维持模块,用于在每一次进化过程中,从HUIs集合中选择两个连续存放的HUI来替换当前种群中的两个随机选择的个体;
种群选择和交叉模块,用于对种群执行选择和交叉;
精英模块,用于将上一代种群Pt与当代种群P t+1合并,并删除合并种群中的重复个体,然后按照utility值从大到小排序,并选择效用值较大的个体构成下一代种群Q;
第二计算模块,用于计算个体(即项集)的效用值;高效用项集确定模块,用于当所述个体对应的项集的效用值≥最小效用值时,将所述项集确定为高效用项集HUI;解码模块,用于将挖掘出的所有二进制编码形式的HUI进行解码。
可选地,所述项集挖掘装置可以实现如下算法:
Figure PCTCN2019121475-appb-000027
Figure PCTCN2019121475-appb-000028
以上为本发明的较佳实施例,并不构成对本发明的限定。本发明的保护范围以权利要求限定的范围为主。

Claims (10)

  1. 一种模式数据挖掘方法,所述方法基于遗传算法进行改进,其特征在于,包括:
    将原始数据库表示成位图的形式;
    初始化种群;
    对种群进行个体修复处理;
    得到高效用项集(HUIs)集合。
  2. 根据权利要求1所述的方法,所述对种群进行个体修复处理,包括:基于1-HTWUIs的TWU值大小进行顺序修剪;尽可能保存了个体中可能会产生较高效用值的项的组合;可选地,所述个体修复处理后,计算每个个体x的效用值,如果效用值不小于最小效用值,则将x添加到HUIs集合。
  3. 根据权利要求1所述的方法,其特征在于,所述方法还包括:对种群中的重复HUIs进行邻域探索处理;可选地,所述对种群中的重复HUIs进行邻域探索处理,包括:针对基因操作之后所产生的新个体,不直接进行适应度评估,而是先检查该个体是否包含在HUIs集合中,如果是,则对其执行邻域变异,产生一个该HUI附近的新解,来探索其邻域空间内的解,然后再对其进行适应度评估,否则,直接进行适应度评估。
  4. 根据权利要求1所述的方法,其特征在于,所述方法还包括:进行种群多样性维持处理;可选地,所述进行种群多样性维持处理,包括:在每一次进化过程中,都从HUIs集合中选择两个连续存放的HUI来替换当前种群中的两个随机选择的个体。
  5. 根据权利要求1所述的方法,其特征在于,所述方法还包括:进行精英处理;可选地,所述进行精英处理,包括:将上一代种群P t与当代种群P t+1合并,并删除合并种群中的重复个体,然后按照utility值从大到小排序,并选择utility较大的个体构成下一代种群Q。
  6. 一种向用户推荐产品的方法,其特征在于,所述方法包括利用权利要求1-5任一所述的数据挖掘方法挖掘高效用项集,根据高效用项集中挖掘的信息向用户推荐产品。
  7. 一种商品包装方法,其特征在于,所述方法包括利用权利要求1-5任一所述的数据挖掘方法挖掘购物篮数据中的高效用项集,将某一高效用项集中的若干个商品打包包装。
  8. 一种项集挖掘装置,其特征在于,所述项集挖掘装置包括:
    第一计算模块,用于计算每项的TWU值(事务加权效用值);
    第二计算模块,用于计算个体(即项集)的效用值;
    个体修复模块,用于对1-HTWUIs的TWU值大小进行顺序修剪,尽可能保存了个体中可能会产生较高效用值的项的组合;
    高效用项集确定模块,用于当所述个体对应的项集的效用值≥最小效用值时,将所述项集确定为高效用项集HUI。
  9. 根据权利要求8所述的项集挖掘装置,其特征在于,所述项集挖掘装置还包括以下任意一个或多个模块:
    邻域探索模块,用于当修复后的个体属于HUIs集合时,将所述个体进行邻域变异生成新的个体;
    种群多样性维持模块,用于在每一次进化过程中,从HUIs集合中选择两个连续存放的HUI来替换当前种群中的两个随机选择的个体;
    种群选择和交叉模块,用于对种群执行选择和交叉;
    精英模块,用于将上一代种群P t与当代种群P t+1合并,并删除合并种群中的重复个体,然后按照utility值从大到小排序,并选择效用值较大的个体构成下一代种群Q;
    解码模块,用于将挖掘出的所有二进制编码形式的HUI进行解码。
  10. 一种数据处理设备,包含权利要求8-9任一所述的项集挖掘装置。
PCT/CN2019/121475 2019-11-28 2019-11-28 一种基于改进遗传算法的模式数据挖掘方法 WO2021102775A1 (zh)

Priority Applications (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/121475 WO2021102775A1 (zh) 2019-11-28 2019-11-28 一种基于改进遗传算法的模式数据挖掘方法

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/CN2019/121475 WO2021102775A1 (zh) 2019-11-28 2019-11-28 一种基于改进遗传算法的模式数据挖掘方法

Publications (1)

Publication Number Publication Date
WO2021102775A1 true WO2021102775A1 (zh) 2021-06-03

Family

ID=76129816

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2019/121475 WO2021102775A1 (zh) 2019-11-28 2019-11-28 一种基于改进遗传算法的模式数据挖掘方法

Country Status (1)

Country Link
WO (1) WO2021102775A1 (zh)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870939A (zh) * 2016-09-27 2018-04-03 腾讯科技(深圳)有限公司 一种模式挖掘方法及装置
CN108664330A (zh) * 2018-05-16 2018-10-16 哈尔滨工业大学(威海) 一种基于变邻域搜索策略的云资源分配方法
CN109977165A (zh) * 2019-04-16 2019-07-05 江南大学 一种三目标模式挖掘模型
CN110069498A (zh) * 2019-04-16 2019-07-30 江南大学 基于多目标演化算法的高质量模式挖掘方法

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107870939A (zh) * 2016-09-27 2018-04-03 腾讯科技(深圳)有限公司 一种模式挖掘方法及装置
CN108664330A (zh) * 2018-05-16 2018-10-16 哈尔滨工业大学(威海) 一种基于变邻域搜索策略的云资源分配方法
CN109977165A (zh) * 2019-04-16 2019-07-05 江南大学 一种三目标模式挖掘模型
CN110069498A (zh) * 2019-04-16 2019-07-30 江南大学 基于多目标演化算法的高质量模式挖掘方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
TSENG VINCENT S.; WU CHENG-WEI; FOURNIER-VIGER PHILIPPE; YU PHILIP S.: "Efficient Algorithms for Mining Top-K High Utility Itemsets", IEEE TRANSACTIONS ON KNOWLEDGE AND DATA ENGINEERING, IEEE SERVICE CENTRE , LOS ALAMITOS , CA, US, vol. 28, no. 1, 1 January 2016 (2016-01-01), US, pages 54 - 67, XP011592813, ISSN: 1041-4347, DOI: 10.1109/TKDE.2015.2458860 *

Similar Documents

Publication Publication Date Title
Zhang et al. Improved genetic algorithm for high-utility itemset mining
Lin et al. MEMU: more efficient algorithm to mine high average-utility patterns with multiple minimum average-utility thresholds
Yadav et al. An overview of genetic algorithm and modeling
Liu et al. Versatile black-box optimization
WO2020210974A1 (zh) 基于改进多目标演化算法的高质量模式挖掘模型及方法
Lin et al. Efficient chain structure for high-utility sequential pattern mining
Sun A study of solving traveling salesman problem with genetic algorithm
Alfaro-Cid et al. A genetic programming approach for bankruptcy prediction using a highly unbalanced database
CN110955702B (zh) 一种基于改进遗传算法的模式数据挖掘方法
Zhang et al. AlphaJoin: Join Order Selection à la AlphaGo.
Zhou Using heuristics and genetic algorithms for large-scale database query optimization
Xiao et al. A locating method for reliability-critical gates with a parallel-structured genetic algorithm
Lin et al. An UBMFFP tree for mining multiple fuzzy frequent itemsets
WO2021102775A1 (zh) 一种基于改进遗传算法的模式数据挖掘方法
Lin et al. An efficient chain structure to mine high-utility sequential patterns
Hichem et al. PSO with crossover operator applied to feature selection problem in classification
Chen et al. Genetic-fuzzy mining with taxonomy
Vanahalli et al. An efficient dynamic switching algorithm for mining colossal closed itemsets from high dimensional datasets
Lin et al. Joint utility and frequency for pattern classification
Pham et al. Mining Top-K High Utility Itemsets Using Bio-Inspired Algorithms with a Diversity within Population Framework
Du et al. A novel binary multi-swarms fruit fly optimisation algorithm for the 0-1 multidimensional knapsack problem
Kabir et al. A novel approach to mining maximal frequent itemsets based on genetic algorithm
Carstensen et al. Tku-pso: an efficient particle swarm optimization model for top-k high-utility itemset mining
Yang et al. IMBT--A Binary Tree for Efficient Support Counting of Incremental Data Mining
Iwen et al. Scalable rule-based gene expression data classification

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 19954395

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

122 Ep: pct application non-entry in european phase

Ref document number: 19954395

Country of ref document: EP

Kind code of ref document: A1