CN113282686B - Association rule determining method and device for unbalanced sample - Google Patents
Association rule determining method and device for unbalanced sample Download PDFInfo
- Publication number
- CN113282686B CN113282686B CN202110622409.7A CN202110622409A CN113282686B CN 113282686 B CN113282686 B CN 113282686B CN 202110622409 A CN202110622409 A CN 202110622409A CN 113282686 B CN113282686 B CN 113282686B
- Authority
- CN
- China
- Prior art keywords
- frequent item
- determining
- positive
- target
- item sets
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000000034 method Methods 0.000 title claims abstract description 44
- 230000002159 abnormal effect Effects 0.000 claims description 39
- 238000004590 computer program Methods 0.000 claims description 18
- 230000000694 effects Effects 0.000 claims description 11
- 238000006243 chemical reaction Methods 0.000 claims description 10
- 239000002131 composite material Substances 0.000 claims description 8
- 238000007635 classification algorithm Methods 0.000 abstract description 8
- 230000008569 process Effects 0.000 abstract description 7
- 238000005516 engineering process Methods 0.000 abstract description 6
- 238000012216 screening Methods 0.000 abstract description 5
- 238000009412 basement excavation Methods 0.000 abstract 1
- 239000003550 marker Substances 0.000 abstract 1
- 238000005065 mining Methods 0.000 description 11
- 230000005540 biological transmission Effects 0.000 description 6
- 238000010276 construction Methods 0.000 description 6
- 238000010586 diagram Methods 0.000 description 6
- 238000004458 analytical method Methods 0.000 description 3
- 238000012098 association analyses Methods 0.000 description 3
- 230000006399 behavior Effects 0.000 description 3
- 238000004422 calculation algorithm Methods 0.000 description 3
- 238000001514 detection method Methods 0.000 description 3
- 238000012545 processing Methods 0.000 description 3
- 238000004891 communication Methods 0.000 description 2
- 238000007418 data mining Methods 0.000 description 2
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000007781 pre-processing Methods 0.000 description 2
- 238000013459 approach Methods 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000013135 deep learning Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000003745 diagnosis Methods 0.000 description 1
- 238000011156 evaluation Methods 0.000 description 1
- 238000000605 extraction Methods 0.000 description 1
- 230000006870 function Effects 0.000 description 1
- 230000006872 improvement Effects 0.000 description 1
- 230000006698 induction Effects 0.000 description 1
- 230000010354 integration Effects 0.000 description 1
- 238000011835 investigation Methods 0.000 description 1
- 238000002372 labelling Methods 0.000 description 1
- 238000010295 mobile communication Methods 0.000 description 1
- 230000003287 optical effect Effects 0.000 description 1
- 238000013138 pruning Methods 0.000 description 1
- 238000012549 training Methods 0.000 description 1
- 230000035899 viability Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/21—Design, administration or maintenance of databases
- G06F16/215—Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/20—Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
- G06F16/28—Databases characterised by their database models, e.g. relational or object models
- G06F16/284—Relational databases
- G06F16/285—Clustering or classification
-
- Y—GENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
- Y02—TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
- Y02P—CLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
- Y02P90/00—Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
- Y02P90/30—Computing systems specially adapted for manufacturing
Abstract
The application provides a method and a device for determining association rules of unbalanced samples, wherein the method comprises the following steps: converting the original data into transaction data; classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples; based on the classification association rules in the positive and negative imbalance marker samples. The method and the device can solve the problems that in the related technology, under the scene of classifying positive and negative unbalanced samples, the traditional association rule classification algorithm consumes more system resources, and the association rule screening does not consider the whole sample information, so that the model popularization is affected. In the implementation process, the application pays more attention to the classification rule of taking the small sample as the target variable under the unbalanced sample condition, thereby avoiding the excavation of the whole data set and saving the consumption of system resources.
Description
Technical Field
The application relates to the field of data processing, in particular to a method and a device for determining association rules of unbalanced samples.
Background
Financial fraud, particularly in the fields of bank cards, credit cards, etc., has been an important direction for the financial industry to implement pneumatic control. Although a series of AI anti-fraud model applications develop with the development of deep learning, the rule-based approach system still retains its strong viability from the standpoint of comprehensibility, ease of popularization and stability. The method is an effective method for detecting the current financial fraud by finding possible association rules based on data mining analysis through the existing cases and combining expert analysis to determine the rules. Traditional financial institutions still rely on building rule-based wind-controlled anti-fraud models, where the generation of rule feature libraries becomes critical. But different from the traditional manual induction or offline investigation, under the background of big data, the combined strategy of automatic rule extraction, model and expert analysis can be adopted to improve the mining efficiency on the basis of ensuring the original interpretation.
The association analysis is used as a data mining method, and is widely applied to the fields of biology, traffic, telecommunication and finance because of the characteristics of easy understanding and good generalization. There are various methods for applying association rules to classification problems, such as Apriori-based integration algorithms, classification algorithms based on multiple classification association rules, and mining association rules directly from training data sets using a greedy algorithm, thereby reducing computational overhead. Generally, the classification algorithm based on the association rule has the characteristics of high accuracy and good robustness.
But in the specific application scenario of the two classification problems involving unbalanced samples, such as medical diagnosis, rail signals, financial fraud. If the target sample occupies a lower proportion of the total sample, and the traditional association rule classification algorithm is directly applied, especially when the number of examples in the data set is more, the combination among the item sets generates a large number of candidate item sets, frequent item sets are calculated in the candidate item sets, and the consumption of system resources is larger; in addition, in the pruning process, the grasping of the support degree and the confidence degree is difficult, too low threshold value can reserve too many redundant rules, so that overfitting is caused, and too high threshold value can prune association rules on a large number of target samples, so that popularization of the model is affected.
Aiming at the classification scenes of unbalanced samples in the related technology, the traditional association rule classification algorithm consumes large system resources, and the problem that the whole sample information is not considered and the model popularization is affected by the association rule screening is solved, and no solution is proposed.
Disclosure of Invention
The embodiment of the application provides a method and a device for determining association rules of unbalanced samples, which are used for at least solving the problems that in the related technology, for classification scenes of the unbalanced samples, the traditional association rule classification algorithm consumes more system resources, and the association rule screens out information of all samples which is not considered, so that model popularization is influenced.
According to an embodiment of the present application, there is provided an association rule determining method of an unbalanced sample, including:
converting the original data into transaction data;
classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
and determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Optionally, classifying and marking the positive and negative unbalance sample data in the transaction data, and obtaining the positive and negative unbalance mark sample includes:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
Optionally, determining, based on the positive and negative unbalance flag samples, a classification association rule of a target variable in the positive and negative unbalance sample data includes:
determining a plurality of frequent item sets on the abnormal client set through FP-Growth;
acquiring a plurality of candidate frequent item sets with the support number larger than a support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
Optionally, determining the combined performance values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set includes:
counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets;
determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk;
and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
Optionally, determining the classification association rule of the target variable according to the comprehensive performance value includes:
selecting a target frequent item set with the maximum comprehensive expression value from the plurality of target frequent item sets;
and after the target frequent item set is determined to be larger than a preset minimum threshold value, determining the target frequent item set as a classification association rule of the target variable.
Optionally, selecting the target frequent item set with the largest comprehensive performance value from the plurality of target frequent item sets includes:
sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
Optionally, after determining the target frequent item set as the classification association rule of the target variable, the method further comprises:
removing the target frequent item set from the frequent item set and adjusting the support number threshold;
the steps of repeatedly performing the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
According to another embodiment of the present application, there is also provided an association rule determining apparatus of an unbalanced sample, including:
the conversion module is used for converting the original data into transaction data;
the marking module is used for classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance marking samples;
and the determining module is used for determining the classification association rule of the target variable in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Optionally, the marking module includes:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and the marking sub-module is used for marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
Optionally, the determining module includes:
a first determining submodule, configured to determine a plurality of frequent item sets on the abnormal client set through FP-Growth;
an acquisition sub-module, configured to acquire a plurality of candidate frequent item sets with a support number greater than a support number threshold from the plurality of frequent item sets;
a second determination submodule for determining comprehensive representation values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and the third determination submodule is used for determining the classification association rule of the target variable according to the comprehensive representation value.
Optionally, the second determining sub-module is further configured to
Counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets;
determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk;
and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
Optionally, the third determining submodule includes:
the acquisition unit is used for selecting a target frequent item set with the maximum comprehensive expression value from the plurality of target frequent item sets;
and the determining unit is used for determining the target frequent item set as the classification association rule of the target variable after determining that the target frequent item set is larger than a preset minimum threshold value.
Optionally, the acquiring unit is further configured to
Sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
Optionally, the apparatus further comprises:
a removing module, configured to remove the target frequent item set from the frequent item set, and adjust the support number threshold;
an execution module configured to repeatedly perform the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
According to a further embodiment of the application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the application, the original data is converted into transaction data; classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples; based on the positive and negative unbalance mark samples, the classification association rule of the target variable in the positive and negative unbalance sample data is determined, the problem that the classification scene of the unbalance sample in the related technology is large in system resource consumption by the traditional association rule classification algorithm, the problem that the whole sample information is not considered and the model popularization is influenced by the association rule screening is solved, and the classification rule of the small sample serving as the target variable under the unbalance sample condition is more focused, so that the whole data set is prevented from being mined, and the system resource consumption is saved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal of an association rule determining method of an unbalanced sample according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of association rule determination for unbalanced samples according to an embodiment of the present application;
FIG. 3 is a flow chart of unbalanced category rule discovery based on association analysis and association classification in accordance with an embodiment of the present application;
FIG. 4 is a flow chart of an imbalance category label according to an embodiment of the present application;
FIG. 5 is a flow chart of the construction of an FP tree according to an embodiment of the application;
FIG. 6 is a flow chart of mining frequent item sets based on an FP tree according to an embodiment of the application;
fig. 7 is a schematic diagram of FP-tree construction according to the present embodiment;
FIG. 8 is a flow chart of mining association rules for unbalanced categories using FP-Growth in accordance with an embodiment of the application;
fig. 9 is a block diagram of an association rule determining apparatus of an unbalanced sample according to an embodiment of the present application.
Detailed Description
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Examples
The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Taking the example of running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal according to the method for determining association rules of an unbalanced sample in the embodiment of the present application, as shown in fig. 1, the mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the association rule determining method of the unbalanced sample in the embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a method for determining association rules of an unbalanced sample running on the mobile terminal or the network architecture is provided, and fig. 2 is a flowchart of a method for determining association rules of an unbalanced sample according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:
step S202, converting the original data into transaction data;
step S204, classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
in this embodiment, the step S204 may specifically include: dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity, wherein the unbalanced class data comprises the abnormal client set; and marking the abnormal client set in the transaction data to obtain the unbalanced sample.
Step S206, determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Converting the original data into transaction data through the above steps S202 to S206; classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples; based on the positive and negative unbalance mark samples, the classification association rule of the target variable in the positive and negative unbalance sample data is determined, the problem that the classification scene of the unbalance sample in the related technology is large in system resource consumption by the traditional association rule classification algorithm, the problem that the whole sample information is not considered and the model popularization is influenced by the association rule screening is solved, and the classification rule of the small sample serving as the target variable under the unbalance sample condition is more focused, so that the whole data set is prevented from being mined, and the system resource consumption is saved.
In this embodiment, the step S206 may specifically include:
s2061, determining a plurality of frequent item sets on the abnormal client set through FP-Growth;
s2062, acquiring a plurality of candidate frequent item sets with the support number greater than a support number threshold value from the plurality of frequent item sets;
s2063, determining comprehensive performance values of the candidate frequent item sets on the normal client set and the abnormal client set, wherein the comprehensive performance values can be F-score specifically;
further, the step S2063 may specifically include: counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets; determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk; and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
S2064, determining the classification association rule of the target variable according to the comprehensive representation value, wherein the target variable can be an abnormal client.
Further, the step S2064 may specifically include: selecting a target frequent item set with a maximum comprehensive expression value from the plurality of target frequent item sets, determining the target frequent item set as the association rule after determining that the target frequent item set is larger than a preset minimum threshold, specifically, sorting the plurality of target frequent item sets according to the comprehensive expression value, wherein the sorting can be performed from small to large or from large to small; and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets, namely determining the frequent item set with the maximum comprehensive expression value as an association rule.
In an alternative embodiment, after determining that the target frequent item set is the association rule, removing the target frequent item set from the frequent item set, and adjusting the support number threshold; the above steps S2062 to S2064 are repeatedly performed until the association rule cannot be determined from the plurality of frequent item sets, that is, until all association rules in the plurality of frequent item sets have been mined.
Aiming at the classification rule mining of unbalanced samples in the financial fraud detection, the embodiment of the application is a rapid rule discovery technology of target classification based on association rules, which consists of a rule mining based on FP-Growth and an association rule classifier based on F-score. The method mainly relates to a conversion process from original data to transaction data, unbalanced category marks, quick discovery of unbalanced category rules, final generation of association rules and storage of the association rules in an association rule knowledge base. In the process, only the target sample is extracted to meet the rule of the F-Score, so that the whole data set is prevented from being mined, and the calculation cost of frequent item sets is reduced. And secondly, according to the F-Score as the credibility measure, generating association rules, combining the information of the classified full samples, more accurately screening out rules with good classification effect on the target small samples, and relieving the overfitting problem caused by the redundant association rules to a certain extent and keeping a certain popularization effect. The F-score is used for evaluating the unbalanced class rules, so that the credibility and practical value of the rules are enhanced, and the method has guidance on construction of the unbalanced class knowledge base such as a financial fraud detection rule base. FIG. 3 is a flow chart of unbalanced category rule discovery based on association analysis and association classification, as shown in FIG. 3, according to an embodiment of the application, including:
step S301, converting the original data into transaction data;
in the conversion stage of original data-transaction data, the customer portrait data, customer behavior data, customer performance and corresponding discrete variables such as financial products, deadlines and grades are separately processed and converted into transaction data. The raw data to transaction data conversion includes:
and preprocessing the data, namely completing identification and complementation of the missing value of the original data, identification of the data type and storing key indexes of continuous data including maximum value, minimum value, median and the like.
And data conversion, wherein the data conversion is guided according to the result of data preprocessing. Specifically, the corresponding data conversion options are respectively input according to the identification result of the data type. The discrete data generates transaction data through a discrete data converter, the continuous data generates transaction data through a continuous data converter, and for data needing to be customized, related data indexes can be called, and the transaction data is generated through the customized data converter. In the automatic data conversion process, continuous data call median to determine threshold value, and custom data call maximum value, minimum value, median, etc. to determine segment number and threshold value.
The final raw data attribute values are converted into transaction data recorded in a binary form that occurs (does not occur).
Step S302, unbalanced category data marking;
in the unbalanced category data marking stage, clients with fraudulent behaviors are marked by matching with bank account opening accounts, and data are divided into a client set with fraudulent behaviors and a normal performance client set. In the process of establishing and updating the detection rule base, the performance of the rules on the positive sample marketing success data set is more focused, and the rules are expected to screen out potential fraudulent clients more accurately.
In order to extract association rules for the unbalanced categories, transaction data for the unbalanced categories need to be tagged. The labeling mode is supervised classification, and if the classification label is successfully matched with the classification label, the classification method enters a positive sample data set, and otherwise, enters a negative sample data set. FIG. 4 is a flow chart of an imbalance category label according to an embodiment of the application, as shown in FIG. 4, comprising:
step S401, dividing transaction data into a client set with fraudulent activity and a normal performing client set;
step S402, judging whether the transaction data is a positive example (namely whether the transaction data is data with fraudulent activity), executing step S403 if the judging result is negative, and executing step S404 if the judging result is positive;
step S403, marking an entry negative sample data set D0;
step S404, marking an entering positive sample data set D1;
step S405, the data sets D0, D1 are output.
Step S303, finding out an unbalanced class association rule;
in the association rule discovery stage, frequent item sets on a client set with fraudulent activity are mined through FP-Growth, the comprehensive performance of the frequent item sets on the client set with fraudulent activity and a normal performing client set is evaluated by taking F-score as an evaluation index, the frequent item sets with F-score meeting the minimum threshold are selected, and the frequent item sets with the largest score are ranked from large to small to be used as association rules. And extracting association rules according to the frequent item sets and storing the association rules in an association rule knowledge base.
The association rule discovery of the unbalanced category is based on an association rule mining algorithm FP-Growth, and the FP-Growth only needs to perform twice full scanning on the data set, so that the frequent item set can be discovered. The frequent item set refers to a set with a support degree equal to or greater than a minimum support degree. Where support refers to the frequency with which a certain set appears in all transactions.
The FP tree is firstly constructed through the FP-Growth mining association rule, the root node of the FP tree is an empty set, other nodes are composed of a single element and the occurrence times of the element in the data set, and the element with more occurrence times is closer to the root node. Nodes are connected, and the connected elements form a frequent set. Fig. 5 is a flowchart of the construction of an FP-tree according to an embodiment of the application, as shown in fig. 5, comprising:
step S501, traversing each set, sorting the elements in the set according to the occurrence times of the elements in the total data set, and removing the elements which do not reach the minimum support degree;
step S502, traversing elements in the sets downwards from the root node in order for each set;
step S503, judging whether the corresponding node exists in the tree, executing step S504 if the judging result is yes, otherwise executing step S505;
step S504, the count value of the node is increased;
step S505, creating a branch;
step S506, determining whether the elements in the set are traversed, returning to step S502 if the determination result is negative, and ending if the determination result is positive.
The method comprises the steps of searching each newly added node in a head pointer table, if the head pointer table does not have the element, creating an element node in the head pointer table, accessing the newly added element to the last of a linked list corresponding to the element in the head pointer table, increasing the count of the element in the head pointer table, recording the occurrence times of all frequent items in the head pointer table, wherein the frequent items in the head pointer table are the heads of a node linked list, and pointing to the occurrence positions of the frequent items in an FP tree in sequence;
step S504, the process loops until all the sets are traversed.
FIG. 6 is a flow chart of mining frequent item sets based on the FP tree, as shown in FIG. 6, according to an embodiment of the application, comprising:
step S601, the first frequent element of the head pointer table is fetched, and each tree path where the first frequent element is located is traversed;
step S602, backtracking to a root node in each tree path to obtain a condition mode base, wherein the condition mode base is a set of paths with searched elements as the end;
step S603, creating a new FP tree according to the obtained conditional pattern base;
step S604, recursively mining the FP tree;
in step S605, each time a loop is made, the first frequent element is conjugated with the incoming preamble of the recursion into a frequent item set, and each frequent element in each head pointer list of each level of recursion generates a frequent item set.
FIG. 7 is a schematic diagram of an FP tree construction according to the present embodiment, as shown in FIG. 7, in which all sets of transactions are recorded in a transaction database, and are added sequentially down from the root node of the tree for each set T0-T8 according to the library, and if a node already exists, the count value of the node is incremented, otherwise a branch is created. While traversing the sets, for each newly added node in the tree, look up in the "head pointer table" which is the "head pointer table" in reverse order, if there is no such element in the head pointer table, create a node of such element in the head pointer table. And (5) circulating until all the sets are traversed, and completing the construction of the FP tree.
After constructing the FP-tree, the frequent item sets may be mined therefrom. With such a head pointer FP-tree, for each element item, its corresponding conditional pattern base is obtained. The conditional pattern base is the set of paths ending with the element item sought. Each path is in fact a prefix path. Taking I5 as an example, a conditional pattern base { (I2I 1), (I2I 1I 3) } is obtained with a pattern suffix of I5. And then recursively calling FP-growth, and finding out the final sum pattern suffix I5 to obtain all frequent item sets with the support degree more than 2: { I2I 5}, { I1I 5}, and { I2I 1I 5}.
FIG. 8 is a flow chart of mining association rules for unbalanced categories using FP-Growth according to an embodiment of the application, as shown in FIG. 8, comprising:
step S801, obtaining a set S of frequent item sets with the support number greater than a threshold minSupport on a marked positive sample transaction data set D1 by using FP-Growth, wherein the support number is the occurrence number of the frequent item sets;
step S802, counting the times TPk and FPk of each frequent item set Ik in the data set D1 and the data set D0 respectively;
in step S803, recall, precision, and F-score for each frequent item set can be calculated from TPk and FPk, wherein,
;
;
。
step S804, sorting the frequent item sets according to the size of the F-score, selecting the frequent item set with the largest F-score as a rule, and enabling the F-score of the rule to be larger than a set threshold value minF-score;
in step S805, the transaction data hit by the rule is removed from D1 and D0 and the support number threshold is adjusted to find the next rule.
Step S806, repeating steps S801-S805 until the rule meeting the condition can not be mined through the FP-Growth.
Step S304, the association rule of the unbalanced type data is stored in the association rule knowledge base according to the found sequence.
Example 2
According to another embodiment of the present application, there is also provided an association rule determining apparatus of an unbalanced sample, and fig. 9 is a block diagram of the association rule determining apparatus of an unbalanced sample according to an embodiment of the present application, as shown in fig. 9, including:
a conversion module 92 for converting the original data into transaction data;
the marking module 94 is configured to perform classification marking on the positive and negative unbalance sample data in the transaction data to obtain a positive and negative unbalance mark sample;
a determining module 96 is configured to determine a classification association rule of the target variable in the positive and negative imbalance sample data based on the positive and negative imbalance flag sample.
Optionally, the marking module 94 includes:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and the marking sub-module is used for marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
Optionally, the determining module 96 includes:
a first determining submodule, configured to determine a plurality of frequent item sets on the abnormal client set through FP-Growth;
an acquisition sub-module, configured to acquire a plurality of candidate frequent item sets with a support number greater than a support number threshold from the plurality of frequent item sets;
a second determination submodule for determining comprehensive representation values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and the third determination submodule is used for determining the classification association rule of the target variable according to the comprehensive representation value.
Optionally, the second determining sub-module is further configured to
Counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets;
determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk;
and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
Optionally, the third determining submodule includes:
the acquisition unit is used for selecting a target frequent item set with the maximum comprehensive expression value from the plurality of target frequent item sets;
and the determining unit is used for determining the target frequent item set as the classification association rule of the target variable after determining that the target frequent item set is larger than a preset minimum threshold value.
Optionally, the acquiring unit is further configured to
Sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
Optionally, the apparatus further comprises:
a removing module, configured to remove the target frequent item set from the frequent item set, and adjust the support number threshold;
an execution module configured to repeatedly perform the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
Example 3
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:
s1, converting original data into transaction data;
s2, classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
s3, determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
Example 4
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, converting original data into transaction data;
s2, classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
s3, determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.
Claims (8)
1. A method for determining association rules of unbalanced samples, comprising:
converting the original data into transaction data;
classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
based on the positive and negative unbalance mark samples, determining a classification association rule of a target variable in the positive and negative unbalance sample data, including: determining a plurality of frequent item sets on the abnormal client set through FP-Growth; acquiring a plurality of candidate frequent item sets with the support number larger than a support number threshold value from the plurality of frequent item sets; determining a composite representation value of the plurality of candidate frequent item sets on a normal client set and the abnormal client set; determining a classification association rule of the target variable according to the comprehensive representation value, wherein the positive and negative unbalance marking sample comprises a marked normal client set and a marked abnormal client set; wherein determining the combined performance values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set comprises: counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets; determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk; and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
2. The method of claim 1, wherein classifying and marking positive and negative imbalance sample data in the transaction data to obtain positive and negative imbalance marked samples comprises:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
3. The method of claim 1, wherein determining the classification association rule for the target variable based on the composite representation value comprises:
selecting a target frequent item set with the maximum comprehensive expression value from a plurality of target frequent item sets;
and after the target frequent item set is determined to be larger than a preset minimum threshold value, determining the target frequent item set as a classification association rule of the target variable.
4. The method of claim 3, wherein selecting the target frequent item set having the greatest aggregate performance value from the plurality of target frequent item sets comprises:
sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
5. The method of claim 4, wherein after determining the target frequent item set as the classification association rule for the target variable, the method further comprises:
removing the target frequent item set from the frequent item set and adjusting the support number threshold;
the steps of repeatedly performing the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
6. An association rule determining apparatus for an unbalanced sample, comprising:
the conversion module is used for converting the original data into transaction data;
the marking module is used for classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance marking samples;
the determining module is configured to determine, based on the positive and negative unbalance flag samples, a classification association rule of a target variable in the positive and negative unbalance sample data, and includes: determining a plurality of frequent item sets on the abnormal client set through FP-Growth; acquiring a plurality of candidate frequent item sets with the support number larger than a support number threshold value from the plurality of frequent item sets; determining a composite representation value of the plurality of candidate frequent item sets on a normal client set and the abnormal client set; determining a classification association rule of the target variable according to the comprehensive representation value, wherein the positive and negative unbalance marking sample comprises a marked normal client set and a marked abnormal client set; wherein determining the combined performance values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set comprises: counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets; determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk; and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
7. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 5 when run.
8. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 5.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110622409.7A CN113282686B (en) | 2021-06-03 | 2021-06-03 | Association rule determining method and device for unbalanced sample |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN202110622409.7A CN113282686B (en) | 2021-06-03 | 2021-06-03 | Association rule determining method and device for unbalanced sample |
Publications (2)
Publication Number | Publication Date |
---|---|
CN113282686A CN113282686A (en) | 2021-08-20 |
CN113282686B true CN113282686B (en) | 2023-11-07 |
Family
ID=77283445
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN202110622409.7A Active CN113282686B (en) | 2021-06-03 | 2021-06-03 | Association rule determining method and device for unbalanced sample |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN113282686B (en) |
Families Citing this family (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN114221858B (en) * | 2021-12-15 | 2022-09-30 | 中山大学 | SDN network fault positioning method, device, equipment and readable storage medium |
Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007147166A2 (en) * | 2006-06-16 | 2007-12-21 | Quantum Leap Research, Inc. | Consilence of data-mining |
CN103731738A (en) * | 2014-01-23 | 2014-04-16 | 哈尔滨理工大学 | Video recommendation method and device based on user group behavioral analysis |
CN103995882A (en) * | 2014-05-28 | 2014-08-20 | 南京大学 | Probability frequent item set excavating method based on MapReduce |
CN104239437A (en) * | 2014-08-28 | 2014-12-24 | 国家电网公司 | Power-network-dispatching-oriented intelligent warning analysis method |
CN104537025A (en) * | 2014-12-19 | 2015-04-22 | 北京邮电大学 | Frequent sequence mining method |
CN105306475A (en) * | 2015-11-05 | 2016-02-03 | 天津理工大学 | Network intrusion detection method based on association rule classification |
CN105740245A (en) * | 2014-12-08 | 2016-07-06 | 北京邮电大学 | Frequent item set mining method |
CN106529580A (en) * | 2016-10-24 | 2017-03-22 | 浙江工业大学 | EDSVM-based software defect data association classification method |
CN107590516A (en) * | 2017-09-16 | 2018-01-16 | 电子科技大学 | Gas pipeline leak detection recognition methods based on Fibre Optical Sensor data mining |
CN108376347A (en) * | 2018-02-27 | 2018-08-07 | 广西财经学院 | A kind of commodity classification method based on A weighting priori algorithms |
CN108806767A (en) * | 2018-06-15 | 2018-11-13 | 中南大学 | Disease symptoms association analysis method based on electronic health record |
CN108830321A (en) * | 2018-06-15 | 2018-11-16 | 中南大学 | The classification method of unbalanced dataset |
CN110990461A (en) * | 2019-12-12 | 2020-04-10 | 国家电网有限公司大数据中心 | Big data analysis model algorithm model selection method and device, electronic equipment and medium |
CN111309777A (en) * | 2020-01-14 | 2020-06-19 | 哈尔滨工业大学 | Report data mining method for improving association rule based on mutual exclusion expression |
CN111782512A (en) * | 2020-06-23 | 2020-10-16 | 北京高质系统科技有限公司 | Multi-feature software defect comprehensive prediction method based on unbalanced noise set |
CN112380274A (en) * | 2020-11-16 | 2021-02-19 | 北京航空航天大学 | Control process-oriented anomaly detection system |
CN112723075A (en) * | 2021-01-04 | 2021-04-30 | 浙江新再灵科技股份有限公司 | Method for analyzing elevator vibration influence factors with unbalanced data |
CN112884179A (en) * | 2021-03-30 | 2021-06-01 | 北京交通大学 | Urban rail turn-back fault diagnosis method based on machine fault and text topic analysis |
-
2021
- 2021-06-03 CN CN202110622409.7A patent/CN113282686B/en active Active
Patent Citations (18)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
WO2007147166A2 (en) * | 2006-06-16 | 2007-12-21 | Quantum Leap Research, Inc. | Consilence of data-mining |
CN103731738A (en) * | 2014-01-23 | 2014-04-16 | 哈尔滨理工大学 | Video recommendation method and device based on user group behavioral analysis |
CN103995882A (en) * | 2014-05-28 | 2014-08-20 | 南京大学 | Probability frequent item set excavating method based on MapReduce |
CN104239437A (en) * | 2014-08-28 | 2014-12-24 | 国家电网公司 | Power-network-dispatching-oriented intelligent warning analysis method |
CN105740245A (en) * | 2014-12-08 | 2016-07-06 | 北京邮电大学 | Frequent item set mining method |
CN104537025A (en) * | 2014-12-19 | 2015-04-22 | 北京邮电大学 | Frequent sequence mining method |
CN105306475A (en) * | 2015-11-05 | 2016-02-03 | 天津理工大学 | Network intrusion detection method based on association rule classification |
CN106529580A (en) * | 2016-10-24 | 2017-03-22 | 浙江工业大学 | EDSVM-based software defect data association classification method |
CN107590516A (en) * | 2017-09-16 | 2018-01-16 | 电子科技大学 | Gas pipeline leak detection recognition methods based on Fibre Optical Sensor data mining |
CN108376347A (en) * | 2018-02-27 | 2018-08-07 | 广西财经学院 | A kind of commodity classification method based on A weighting priori algorithms |
CN108806767A (en) * | 2018-06-15 | 2018-11-13 | 中南大学 | Disease symptoms association analysis method based on electronic health record |
CN108830321A (en) * | 2018-06-15 | 2018-11-16 | 中南大学 | The classification method of unbalanced dataset |
CN110990461A (en) * | 2019-12-12 | 2020-04-10 | 国家电网有限公司大数据中心 | Big data analysis model algorithm model selection method and device, electronic equipment and medium |
CN111309777A (en) * | 2020-01-14 | 2020-06-19 | 哈尔滨工业大学 | Report data mining method for improving association rule based on mutual exclusion expression |
CN111782512A (en) * | 2020-06-23 | 2020-10-16 | 北京高质系统科技有限公司 | Multi-feature software defect comprehensive prediction method based on unbalanced noise set |
CN112380274A (en) * | 2020-11-16 | 2021-02-19 | 北京航空航天大学 | Control process-oriented anomaly detection system |
CN112723075A (en) * | 2021-01-04 | 2021-04-30 | 浙江新再灵科技股份有限公司 | Method for analyzing elevator vibration influence factors with unbalanced data |
CN112884179A (en) * | 2021-03-30 | 2021-06-01 | 北京交通大学 | Urban rail turn-back fault diagnosis method based on machine fault and text topic analysis |
Also Published As
Publication number | Publication date |
---|---|
CN113282686A (en) | 2021-08-20 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN110245496B (en) | Source code vulnerability detection method and detector and training method and system thereof | |
CN110020422B (en) | Feature word determining method and device and server | |
WO2021164382A1 (en) | Method and apparatus for performing feature processing for user classification model | |
CN106372060B (en) | Search for the mask method and device of text | |
CN110909725A (en) | Method, device and equipment for recognizing text and storage medium | |
CN111144723A (en) | Method and system for recommending people's job matching and storage medium | |
WO2016177069A1 (en) | Management method, device, spam short message monitoring system and computer storage medium | |
CN110909165A (en) | Data processing method, device, medium and electronic equipment | |
US11429810B2 (en) | Question answering method, terminal, and non-transitory computer readable storage medium | |
CN111105209A (en) | Job resume matching method and device suitable for post matching recommendation system | |
CN110737821B (en) | Similar event query method, device, storage medium and terminal equipment | |
CN111209317A (en) | Knowledge graph abnormal community detection method and device | |
CN110780965A (en) | Vision-based process automation method, device and readable storage medium | |
CN114398473A (en) | Enterprise portrait generation method, device, server and storage medium | |
CN107368526A (en) | A kind of data processing method and device | |
CN111160959A (en) | User click conversion estimation method and device | |
CN113282686B (en) | Association rule determining method and device for unbalanced sample | |
CN114547301A (en) | Document processing method, document processing device, recognition model training equipment and storage medium | |
CN102193928B (en) | Method for matching lightweight ontologies based on multilayer text categorizer | |
CN112181814B (en) | Multi-label marking method for defect report | |
CN113472860A (en) | Service resource allocation method and server under big data and digital environment | |
CN115115369A (en) | Data processing method, device, equipment and storage medium | |
CN109993381B (en) | Demand management application method, device, equipment and medium based on knowledge graph | |
CN115994331A (en) | Message sorting method and device based on decision tree | |
CN115392351A (en) | Risk user identification method and device, electronic equipment and storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |