CN113282686B - Association rule determining method and device for unbalanced sample - Google Patents

Association rule determining method and device for unbalanced sample Download PDF

Info

Publication number
CN113282686B
CN113282686B CN202110622409.7A CN202110622409A CN113282686B CN 113282686 B CN113282686 B CN 113282686B CN 202110622409 A CN202110622409 A CN 202110622409A CN 113282686 B CN113282686 B CN 113282686B
Authority
CN
China
Prior art keywords
frequent item
determining
positive
target
item sets
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202110622409.7A
Other languages
Chinese (zh)
Other versions
CN113282686A (en
Inventor
魏乐
卢格润
李琨
郑方兰
朱良姝
白冰
田江
向小佳
丁永建
李璠
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Everbright Technology Co ltd
Original Assignee
Everbright Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Everbright Technology Co ltd filed Critical Everbright Technology Co ltd
Priority to CN202110622409.7A priority Critical patent/CN113282686B/en
Publication of CN113282686A publication Critical patent/CN113282686A/en
Application granted granted Critical
Publication of CN113282686B publication Critical patent/CN113282686B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/21Design, administration or maintenance of databases
    • G06F16/215Improving data quality; Data cleansing, e.g. de-duplication, removing invalid entries or correcting typographical errors
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/28Databases characterised by their database models, e.g. relational or object models
    • G06F16/284Relational databases
    • G06F16/285Clustering or classification
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02PCLIMATE CHANGE MITIGATION TECHNOLOGIES IN THE PRODUCTION OR PROCESSING OF GOODS
    • Y02P90/00Enabling technologies with a potential contribution to greenhouse gas [GHG] emissions mitigation
    • Y02P90/30Computing systems specially adapted for manufacturing

Abstract

The application provides a method and a device for determining association rules of unbalanced samples, wherein the method comprises the following steps: converting the original data into transaction data; classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples; based on the classification association rules in the positive and negative imbalance marker samples. The method and the device can solve the problems that in the related technology, under the scene of classifying positive and negative unbalanced samples, the traditional association rule classification algorithm consumes more system resources, and the association rule screening does not consider the whole sample information, so that the model popularization is affected. In the implementation process, the application pays more attention to the classification rule of taking the small sample as the target variable under the unbalanced sample condition, thereby avoiding the excavation of the whole data set and saving the consumption of system resources.

Description

Association rule determining method and device for unbalanced sample
Technical Field
The application relates to the field of data processing, in particular to a method and a device for determining association rules of unbalanced samples.
Background
Financial fraud, particularly in the fields of bank cards, credit cards, etc., has been an important direction for the financial industry to implement pneumatic control. Although a series of AI anti-fraud model applications develop with the development of deep learning, the rule-based approach system still retains its strong viability from the standpoint of comprehensibility, ease of popularization and stability. The method is an effective method for detecting the current financial fraud by finding possible association rules based on data mining analysis through the existing cases and combining expert analysis to determine the rules. Traditional financial institutions still rely on building rule-based wind-controlled anti-fraud models, where the generation of rule feature libraries becomes critical. But different from the traditional manual induction or offline investigation, under the background of big data, the combined strategy of automatic rule extraction, model and expert analysis can be adopted to improve the mining efficiency on the basis of ensuring the original interpretation.
The association analysis is used as a data mining method, and is widely applied to the fields of biology, traffic, telecommunication and finance because of the characteristics of easy understanding and good generalization. There are various methods for applying association rules to classification problems, such as Apriori-based integration algorithms, classification algorithms based on multiple classification association rules, and mining association rules directly from training data sets using a greedy algorithm, thereby reducing computational overhead. Generally, the classification algorithm based on the association rule has the characteristics of high accuracy and good robustness.
But in the specific application scenario of the two classification problems involving unbalanced samples, such as medical diagnosis, rail signals, financial fraud. If the target sample occupies a lower proportion of the total sample, and the traditional association rule classification algorithm is directly applied, especially when the number of examples in the data set is more, the combination among the item sets generates a large number of candidate item sets, frequent item sets are calculated in the candidate item sets, and the consumption of system resources is larger; in addition, in the pruning process, the grasping of the support degree and the confidence degree is difficult, too low threshold value can reserve too many redundant rules, so that overfitting is caused, and too high threshold value can prune association rules on a large number of target samples, so that popularization of the model is affected.
Aiming at the classification scenes of unbalanced samples in the related technology, the traditional association rule classification algorithm consumes large system resources, and the problem that the whole sample information is not considered and the model popularization is affected by the association rule screening is solved, and no solution is proposed.
Disclosure of Invention
The embodiment of the application provides a method and a device for determining association rules of unbalanced samples, which are used for at least solving the problems that in the related technology, for classification scenes of the unbalanced samples, the traditional association rule classification algorithm consumes more system resources, and the association rule screens out information of all samples which is not considered, so that model popularization is influenced.
According to an embodiment of the present application, there is provided an association rule determining method of an unbalanced sample, including:
converting the original data into transaction data;
classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
and determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Optionally, classifying and marking the positive and negative unbalance sample data in the transaction data, and obtaining the positive and negative unbalance mark sample includes:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
Optionally, determining, based on the positive and negative unbalance flag samples, a classification association rule of a target variable in the positive and negative unbalance sample data includes:
determining a plurality of frequent item sets on the abnormal client set through FP-Growth;
acquiring a plurality of candidate frequent item sets with the support number larger than a support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
Optionally, determining the combined performance values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set includes:
counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets;
determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk;
and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
Optionally, determining the classification association rule of the target variable according to the comprehensive performance value includes:
selecting a target frequent item set with the maximum comprehensive expression value from the plurality of target frequent item sets;
and after the target frequent item set is determined to be larger than a preset minimum threshold value, determining the target frequent item set as a classification association rule of the target variable.
Optionally, selecting the target frequent item set with the largest comprehensive performance value from the plurality of target frequent item sets includes:
sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
Optionally, after determining the target frequent item set as the classification association rule of the target variable, the method further comprises:
removing the target frequent item set from the frequent item set and adjusting the support number threshold;
the steps of repeatedly performing the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
According to another embodiment of the present application, there is also provided an association rule determining apparatus of an unbalanced sample, including:
the conversion module is used for converting the original data into transaction data;
the marking module is used for classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance marking samples;
and the determining module is used for determining the classification association rule of the target variable in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Optionally, the marking module includes:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and the marking sub-module is used for marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
Optionally, the determining module includes:
a first determining submodule, configured to determine a plurality of frequent item sets on the abnormal client set through FP-Growth;
an acquisition sub-module, configured to acquire a plurality of candidate frequent item sets with a support number greater than a support number threshold from the plurality of frequent item sets;
a second determination submodule for determining comprehensive representation values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and the third determination submodule is used for determining the classification association rule of the target variable according to the comprehensive representation value.
Optionally, the second determining sub-module is further configured to
Counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets;
determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk;
and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
Optionally, the third determining submodule includes:
the acquisition unit is used for selecting a target frequent item set with the maximum comprehensive expression value from the plurality of target frequent item sets;
and the determining unit is used for determining the target frequent item set as the classification association rule of the target variable after determining that the target frequent item set is larger than a preset minimum threshold value.
Optionally, the acquiring unit is further configured to
Sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
Optionally, the apparatus further comprises:
a removing module, configured to remove the target frequent item set from the frequent item set, and adjust the support number threshold;
an execution module configured to repeatedly perform the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
According to a further embodiment of the application, there is also provided a computer-readable storage medium having stored therein a computer program, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
According to a further embodiment of the application, there is also provided an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
According to the application, the original data is converted into transaction data; classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples; based on the positive and negative unbalance mark samples, the classification association rule of the target variable in the positive and negative unbalance sample data is determined, the problem that the classification scene of the unbalance sample in the related technology is large in system resource consumption by the traditional association rule classification algorithm, the problem that the whole sample information is not considered and the model popularization is influenced by the association rule screening is solved, and the classification rule of the small sample serving as the target variable under the unbalance sample condition is more focused, so that the whole data set is prevented from being mined, and the system resource consumption is saved.
Drawings
The accompanying drawings, which are included to provide a further understanding of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and do not constitute a limitation on the application. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal of an association rule determining method of an unbalanced sample according to an embodiment of the present application;
FIG. 2 is a flow chart of a method of association rule determination for unbalanced samples according to an embodiment of the present application;
FIG. 3 is a flow chart of unbalanced category rule discovery based on association analysis and association classification in accordance with an embodiment of the present application;
FIG. 4 is a flow chart of an imbalance category label according to an embodiment of the present application;
FIG. 5 is a flow chart of the construction of an FP tree according to an embodiment of the application;
FIG. 6 is a flow chart of mining frequent item sets based on an FP tree according to an embodiment of the application;
fig. 7 is a schematic diagram of FP-tree construction according to the present embodiment;
FIG. 8 is a flow chart of mining association rules for unbalanced categories using FP-Growth in accordance with an embodiment of the application;
fig. 9 is a block diagram of an association rule determining apparatus of an unbalanced sample according to an embodiment of the present application.
Detailed Description
The application will be described in detail hereinafter with reference to the drawings in conjunction with embodiments. It should be noted that, without conflict, the embodiments of the present application and features of the embodiments may be combined with each other.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order.
Examples
The method according to the first embodiment of the present application may be implemented in a mobile terminal, a computer terminal or a similar computing device. Taking the example of running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal according to the method for determining association rules of an unbalanced sample in the embodiment of the present application, as shown in fig. 1, the mobile terminal may include one or more (only one is shown in fig. 1) processors 102 (the processors 102 may include, but are not limited to, a microprocessor MCU or a processing device such as a programmable logic device FPGA) and a memory 104 for storing data, and optionally, the mobile terminal may further include a transmission device 106 for a communication function and an input/output device 108. It will be appreciated by those skilled in the art that the structure shown in fig. 1 is merely illustrative and not limiting of the structure of the mobile terminal described above. For example, the mobile terminal may also include more or fewer components than shown in fig. 1, or have a different configuration than shown in fig. 1.
The memory 104 may be used to store a computer program, for example, a software program of application software and a module, such as a computer program corresponding to the association rule determining method of the unbalanced sample in the embodiment of the present application, and the processor 102 executes the computer program stored in the memory 104, thereby performing various functional applications and data processing, that is, implementing the above-mentioned method. Memory 104 may include high-speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some examples, the memory 104 may further include memory remotely located relative to the processor 102, which may be connected to the mobile terminal via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used to receive or transmit data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal. In one example, the transmission device 106 includes a network adapter (Network Interface Controller, simply referred to as NIC) that can connect to other network devices through a base station to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is configured to communicate with the internet wirelessly.
In this embodiment, a method for determining association rules of an unbalanced sample running on the mobile terminal or the network architecture is provided, and fig. 2 is a flowchart of a method for determining association rules of an unbalanced sample according to an embodiment of the present application, as shown in fig. 2, where the flowchart includes the following steps:
step S202, converting the original data into transaction data;
step S204, classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
in this embodiment, the step S204 may specifically include: dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity, wherein the unbalanced class data comprises the abnormal client set; and marking the abnormal client set in the transaction data to obtain the unbalanced sample.
Step S206, determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Converting the original data into transaction data through the above steps S202 to S206; classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples; based on the positive and negative unbalance mark samples, the classification association rule of the target variable in the positive and negative unbalance sample data is determined, the problem that the classification scene of the unbalance sample in the related technology is large in system resource consumption by the traditional association rule classification algorithm, the problem that the whole sample information is not considered and the model popularization is influenced by the association rule screening is solved, and the classification rule of the small sample serving as the target variable under the unbalance sample condition is more focused, so that the whole data set is prevented from being mined, and the system resource consumption is saved.
In this embodiment, the step S206 may specifically include:
s2061, determining a plurality of frequent item sets on the abnormal client set through FP-Growth;
s2062, acquiring a plurality of candidate frequent item sets with the support number greater than a support number threshold value from the plurality of frequent item sets;
s2063, determining comprehensive performance values of the candidate frequent item sets on the normal client set and the abnormal client set, wherein the comprehensive performance values can be F-score specifically;
further, the step S2063 may specifically include: counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets; determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk; and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
S2064, determining the classification association rule of the target variable according to the comprehensive representation value, wherein the target variable can be an abnormal client.
Further, the step S2064 may specifically include: selecting a target frequent item set with a maximum comprehensive expression value from the plurality of target frequent item sets, determining the target frequent item set as the association rule after determining that the target frequent item set is larger than a preset minimum threshold, specifically, sorting the plurality of target frequent item sets according to the comprehensive expression value, wherein the sorting can be performed from small to large or from large to small; and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets, namely determining the frequent item set with the maximum comprehensive expression value as an association rule.
In an alternative embodiment, after determining that the target frequent item set is the association rule, removing the target frequent item set from the frequent item set, and adjusting the support number threshold; the above steps S2062 to S2064 are repeatedly performed until the association rule cannot be determined from the plurality of frequent item sets, that is, until all association rules in the plurality of frequent item sets have been mined.
Aiming at the classification rule mining of unbalanced samples in the financial fraud detection, the embodiment of the application is a rapid rule discovery technology of target classification based on association rules, which consists of a rule mining based on FP-Growth and an association rule classifier based on F-score. The method mainly relates to a conversion process from original data to transaction data, unbalanced category marks, quick discovery of unbalanced category rules, final generation of association rules and storage of the association rules in an association rule knowledge base. In the process, only the target sample is extracted to meet the rule of the F-Score, so that the whole data set is prevented from being mined, and the calculation cost of frequent item sets is reduced. And secondly, according to the F-Score as the credibility measure, generating association rules, combining the information of the classified full samples, more accurately screening out rules with good classification effect on the target small samples, and relieving the overfitting problem caused by the redundant association rules to a certain extent and keeping a certain popularization effect. The F-score is used for evaluating the unbalanced class rules, so that the credibility and practical value of the rules are enhanced, and the method has guidance on construction of the unbalanced class knowledge base such as a financial fraud detection rule base. FIG. 3 is a flow chart of unbalanced category rule discovery based on association analysis and association classification, as shown in FIG. 3, according to an embodiment of the application, including:
step S301, converting the original data into transaction data;
in the conversion stage of original data-transaction data, the customer portrait data, customer behavior data, customer performance and corresponding discrete variables such as financial products, deadlines and grades are separately processed and converted into transaction data. The raw data to transaction data conversion includes:
and preprocessing the data, namely completing identification and complementation of the missing value of the original data, identification of the data type and storing key indexes of continuous data including maximum value, minimum value, median and the like.
And data conversion, wherein the data conversion is guided according to the result of data preprocessing. Specifically, the corresponding data conversion options are respectively input according to the identification result of the data type. The discrete data generates transaction data through a discrete data converter, the continuous data generates transaction data through a continuous data converter, and for data needing to be customized, related data indexes can be called, and the transaction data is generated through the customized data converter. In the automatic data conversion process, continuous data call median to determine threshold value, and custom data call maximum value, minimum value, median, etc. to determine segment number and threshold value.
The final raw data attribute values are converted into transaction data recorded in a binary form that occurs (does not occur).
Step S302, unbalanced category data marking;
in the unbalanced category data marking stage, clients with fraudulent behaviors are marked by matching with bank account opening accounts, and data are divided into a client set with fraudulent behaviors and a normal performance client set. In the process of establishing and updating the detection rule base, the performance of the rules on the positive sample marketing success data set is more focused, and the rules are expected to screen out potential fraudulent clients more accurately.
In order to extract association rules for the unbalanced categories, transaction data for the unbalanced categories need to be tagged. The labeling mode is supervised classification, and if the classification label is successfully matched with the classification label, the classification method enters a positive sample data set, and otherwise, enters a negative sample data set. FIG. 4 is a flow chart of an imbalance category label according to an embodiment of the application, as shown in FIG. 4, comprising:
step S401, dividing transaction data into a client set with fraudulent activity and a normal performing client set;
step S402, judging whether the transaction data is a positive example (namely whether the transaction data is data with fraudulent activity), executing step S403 if the judging result is negative, and executing step S404 if the judging result is positive;
step S403, marking an entry negative sample data set D0;
step S404, marking an entering positive sample data set D1;
step S405, the data sets D0, D1 are output.
Step S303, finding out an unbalanced class association rule;
in the association rule discovery stage, frequent item sets on a client set with fraudulent activity are mined through FP-Growth, the comprehensive performance of the frequent item sets on the client set with fraudulent activity and a normal performing client set is evaluated by taking F-score as an evaluation index, the frequent item sets with F-score meeting the minimum threshold are selected, and the frequent item sets with the largest score are ranked from large to small to be used as association rules. And extracting association rules according to the frequent item sets and storing the association rules in an association rule knowledge base.
The association rule discovery of the unbalanced category is based on an association rule mining algorithm FP-Growth, and the FP-Growth only needs to perform twice full scanning on the data set, so that the frequent item set can be discovered. The frequent item set refers to a set with a support degree equal to or greater than a minimum support degree. Where support refers to the frequency with which a certain set appears in all transactions.
The FP tree is firstly constructed through the FP-Growth mining association rule, the root node of the FP tree is an empty set, other nodes are composed of a single element and the occurrence times of the element in the data set, and the element with more occurrence times is closer to the root node. Nodes are connected, and the connected elements form a frequent set. Fig. 5 is a flowchart of the construction of an FP-tree according to an embodiment of the application, as shown in fig. 5, comprising:
step S501, traversing each set, sorting the elements in the set according to the occurrence times of the elements in the total data set, and removing the elements which do not reach the minimum support degree;
step S502, traversing elements in the sets downwards from the root node in order for each set;
step S503, judging whether the corresponding node exists in the tree, executing step S504 if the judging result is yes, otherwise executing step S505;
step S504, the count value of the node is increased;
step S505, creating a branch;
step S506, determining whether the elements in the set are traversed, returning to step S502 if the determination result is negative, and ending if the determination result is positive.
The method comprises the steps of searching each newly added node in a head pointer table, if the head pointer table does not have the element, creating an element node in the head pointer table, accessing the newly added element to the last of a linked list corresponding to the element in the head pointer table, increasing the count of the element in the head pointer table, recording the occurrence times of all frequent items in the head pointer table, wherein the frequent items in the head pointer table are the heads of a node linked list, and pointing to the occurrence positions of the frequent items in an FP tree in sequence;
step S504, the process loops until all the sets are traversed.
FIG. 6 is a flow chart of mining frequent item sets based on the FP tree, as shown in FIG. 6, according to an embodiment of the application, comprising:
step S601, the first frequent element of the head pointer table is fetched, and each tree path where the first frequent element is located is traversed;
step S602, backtracking to a root node in each tree path to obtain a condition mode base, wherein the condition mode base is a set of paths with searched elements as the end;
step S603, creating a new FP tree according to the obtained conditional pattern base;
step S604, recursively mining the FP tree;
in step S605, each time a loop is made, the first frequent element is conjugated with the incoming preamble of the recursion into a frequent item set, and each frequent element in each head pointer list of each level of recursion generates a frequent item set.
FIG. 7 is a schematic diagram of an FP tree construction according to the present embodiment, as shown in FIG. 7, in which all sets of transactions are recorded in a transaction database, and are added sequentially down from the root node of the tree for each set T0-T8 according to the library, and if a node already exists, the count value of the node is incremented, otherwise a branch is created. While traversing the sets, for each newly added node in the tree, look up in the "head pointer table" which is the "head pointer table" in reverse order, if there is no such element in the head pointer table, create a node of such element in the head pointer table. And (5) circulating until all the sets are traversed, and completing the construction of the FP tree.
After constructing the FP-tree, the frequent item sets may be mined therefrom. With such a head pointer FP-tree, for each element item, its corresponding conditional pattern base is obtained. The conditional pattern base is the set of paths ending with the element item sought. Each path is in fact a prefix path. Taking I5 as an example, a conditional pattern base { (I2I 1), (I2I 1I 3) } is obtained with a pattern suffix of I5. And then recursively calling FP-growth, and finding out the final sum pattern suffix I5 to obtain all frequent item sets with the support degree more than 2: { I2I 5}, { I1I 5}, and { I2I 1I 5}.
FIG. 8 is a flow chart of mining association rules for unbalanced categories using FP-Growth according to an embodiment of the application, as shown in FIG. 8, comprising:
step S801, obtaining a set S of frequent item sets with the support number greater than a threshold minSupport on a marked positive sample transaction data set D1 by using FP-Growth, wherein the support number is the occurrence number of the frequent item sets;
step S802, counting the times TPk and FPk of each frequent item set Ik in the data set D1 and the data set D0 respectively;
in step S803, recall, precision, and F-score for each frequent item set can be calculated from TPk and FPk, wherein,
step S804, sorting the frequent item sets according to the size of the F-score, selecting the frequent item set with the largest F-score as a rule, and enabling the F-score of the rule to be larger than a set threshold value minF-score;
in step S805, the transaction data hit by the rule is removed from D1 and D0 and the support number threshold is adjusted to find the next rule.
Step S806, repeating steps S801-S805 until the rule meeting the condition can not be mined through the FP-Growth.
Step S304, the association rule of the unbalanced type data is stored in the association rule knowledge base according to the found sequence.
Example 2
According to another embodiment of the present application, there is also provided an association rule determining apparatus of an unbalanced sample, and fig. 9 is a block diagram of the association rule determining apparatus of an unbalanced sample according to an embodiment of the present application, as shown in fig. 9, including:
a conversion module 92 for converting the original data into transaction data;
the marking module 94 is configured to perform classification marking on the positive and negative unbalance sample data in the transaction data to obtain a positive and negative unbalance mark sample;
a determining module 96 is configured to determine a classification association rule of the target variable in the positive and negative imbalance sample data based on the positive and negative imbalance flag sample.
Optionally, the marking module 94 includes:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and the marking sub-module is used for marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
Optionally, the determining module 96 includes:
a first determining submodule, configured to determine a plurality of frequent item sets on the abnormal client set through FP-Growth;
an acquisition sub-module, configured to acquire a plurality of candidate frequent item sets with a support number greater than a support number threshold from the plurality of frequent item sets;
a second determination submodule for determining comprehensive representation values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and the third determination submodule is used for determining the classification association rule of the target variable according to the comprehensive representation value.
Optionally, the second determining sub-module is further configured to
Counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets;
determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk;
and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
Optionally, the third determining submodule includes:
the acquisition unit is used for selecting a target frequent item set with the maximum comprehensive expression value from the plurality of target frequent item sets;
and the determining unit is used for determining the target frequent item set as the classification association rule of the target variable after determining that the target frequent item set is larger than a preset minimum threshold value.
Optionally, the acquiring unit is further configured to
Sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
Optionally, the apparatus further comprises:
a removing module, configured to remove the target frequent item set from the frequent item set, and adjust the support number threshold;
an execution module configured to repeatedly perform the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
It should be noted that each of the above modules may be implemented by software or hardware, and for the latter, it may be implemented by, but not limited to: the modules are all located in the same processor; alternatively, the above modules may be located in different processors in any combination.
Example 3
Embodiments of the present application also provide a computer readable storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the method embodiments described above when run.
Alternatively, in the present embodiment, the above-described storage medium may be configured to store a computer program for performing the steps of:
s1, converting original data into transaction data;
s2, classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
s3, determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Alternatively, in the present embodiment, the storage medium may include, but is not limited to: a usb disk, a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a removable hard disk, a magnetic disk, or an optical disk, or other various media capable of storing a computer program.
Example 4
An embodiment of the application also provides an electronic device comprising a memory having stored therein a computer program and a processor arranged to run the computer program to perform the steps of any of the method embodiments described above.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, where the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in the present embodiment, the above-described processor may be configured to execute the following steps by a computer program:
s1, converting original data into transaction data;
s2, classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
s3, determining classification association rules of target variables in the positive and negative unbalance sample data based on the positive and negative unbalance mark samples.
Alternatively, specific examples in this embodiment may refer to examples described in the foregoing embodiments and optional implementations, and this embodiment is not described herein.
It will be appreciated by those skilled in the art that the modules or steps of the application described above may be implemented in a general purpose computing device, they may be concentrated on a single computing device, or distributed across a network of computing devices, they may alternatively be implemented in program code executable by computing devices, so that they may be stored in a memory device for execution by computing devices, and in some cases, the steps shown or described may be performed in a different order than that shown or described, or they may be separately fabricated into individual integrated circuit modules, or multiple modules or steps within them may be fabricated into a single integrated circuit module for implementation. Thus, the present application is not limited to any specific combination of hardware and software.
The above description is only of the preferred embodiments of the present application and is not intended to limit the present application, but various modifications and variations can be made to the present application by those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the principle of the present application should be included in the protection scope of the present application.

Claims (8)

1. A method for determining association rules of unbalanced samples, comprising:
converting the original data into transaction data;
classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance mark samples;
based on the positive and negative unbalance mark samples, determining a classification association rule of a target variable in the positive and negative unbalance sample data, including: determining a plurality of frequent item sets on the abnormal client set through FP-Growth; acquiring a plurality of candidate frequent item sets with the support number larger than a support number threshold value from the plurality of frequent item sets; determining a composite representation value of the plurality of candidate frequent item sets on a normal client set and the abnormal client set; determining a classification association rule of the target variable according to the comprehensive representation value, wherein the positive and negative unbalance marking sample comprises a marked normal client set and a marked abnormal client set; wherein determining the combined performance values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set comprises: counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets; determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk; and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
2. The method of claim 1, wherein classifying and marking positive and negative imbalance sample data in the transaction data to obtain positive and negative imbalance marked samples comprises:
dividing the transaction data into a normal client set and an abnormal client set with fraudulent activity;
and marking the abnormal client set and the normal client set in the transaction data to obtain the positive and negative unbalance marking sample.
3. The method of claim 1, wherein determining the classification association rule for the target variable based on the composite representation value comprises:
selecting a target frequent item set with the maximum comprehensive expression value from a plurality of target frequent item sets;
and after the target frequent item set is determined to be larger than a preset minimum threshold value, determining the target frequent item set as a classification association rule of the target variable.
4. The method of claim 3, wherein selecting the target frequent item set having the greatest aggregate performance value from the plurality of target frequent item sets comprises:
sorting the plurality of target frequent item sets according to the comprehensive representation value;
and selecting the target frequent item set with the maximum comprehensive expression value from the sorted target frequent item sets.
5. The method of claim 4, wherein after determining the target frequent item set as the classification association rule for the target variable, the method further comprises:
removing the target frequent item set from the frequent item set and adjusting the support number threshold;
the steps of repeatedly performing the following steps until the classification association rule cannot be determined from the plurality of frequent item sets:
acquiring a plurality of candidate frequent item sets with support numbers greater than the support number threshold value from the plurality of frequent item sets;
determining a composite representation value of the plurality of candidate frequent item sets on the normal client set and the abnormal client set;
and determining the classification association rule of the target variable according to the comprehensive representation value.
6. An association rule determining apparatus for an unbalanced sample, comprising:
the conversion module is used for converting the original data into transaction data;
the marking module is used for classifying and marking the positive and negative unbalance sample data in the transaction data to obtain positive and negative unbalance marking samples;
the determining module is configured to determine, based on the positive and negative unbalance flag samples, a classification association rule of a target variable in the positive and negative unbalance sample data, and includes: determining a plurality of frequent item sets on the abnormal client set through FP-Growth; acquiring a plurality of candidate frequent item sets with the support number larger than a support number threshold value from the plurality of frequent item sets; determining a composite representation value of the plurality of candidate frequent item sets on a normal client set and the abnormal client set; determining a classification association rule of the target variable according to the comprehensive representation value, wherein the positive and negative unbalance marking sample comprises a marked normal client set and a marked abnormal client set; wherein determining the combined performance values of the plurality of candidate frequent item sets on the normal client set and the abnormal client set comprises: counting the times TPk and FPk of each frequent item set in the normal client set and the abnormal client set respectively in the plurality of frequent item sets; determining recall and accuracy for each frequent itemset based on the number TPk and the number FPk; and determining the comprehensive representation value of each frequent item set according to the recall rate and the precision of each frequent item set.
7. A computer-readable storage medium, characterized in that the storage medium has stored therein a computer program, wherein the computer program is arranged to execute the method of any of the claims 1 to 5 when run.
8. An electronic device comprising a memory and a processor, characterized in that the memory has stored therein a computer program, the processor being arranged to run the computer program to perform the method of any of the claims 1 to 5.
CN202110622409.7A 2021-06-03 2021-06-03 Association rule determining method and device for unbalanced sample Active CN113282686B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202110622409.7A CN113282686B (en) 2021-06-03 2021-06-03 Association rule determining method and device for unbalanced sample

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202110622409.7A CN113282686B (en) 2021-06-03 2021-06-03 Association rule determining method and device for unbalanced sample

Publications (2)

Publication Number Publication Date
CN113282686A CN113282686A (en) 2021-08-20
CN113282686B true CN113282686B (en) 2023-11-07

Family

ID=77283445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202110622409.7A Active CN113282686B (en) 2021-06-03 2021-06-03 Association rule determining method and device for unbalanced sample

Country Status (1)

Country Link
CN (1) CN113282686B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114221858B (en) * 2021-12-15 2022-09-30 中山大学 SDN network fault positioning method, device, equipment and readable storage medium

Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007147166A2 (en) * 2006-06-16 2007-12-21 Quantum Leap Research, Inc. Consilence of data-mining
CN103731738A (en) * 2014-01-23 2014-04-16 哈尔滨理工大学 Video recommendation method and device based on user group behavioral analysis
CN103995882A (en) * 2014-05-28 2014-08-20 南京大学 Probability frequent item set excavating method based on MapReduce
CN104239437A (en) * 2014-08-28 2014-12-24 国家电网公司 Power-network-dispatching-oriented intelligent warning analysis method
CN104537025A (en) * 2014-12-19 2015-04-22 北京邮电大学 Frequent sequence mining method
CN105306475A (en) * 2015-11-05 2016-02-03 天津理工大学 Network intrusion detection method based on association rule classification
CN105740245A (en) * 2014-12-08 2016-07-06 北京邮电大学 Frequent item set mining method
CN106529580A (en) * 2016-10-24 2017-03-22 浙江工业大学 EDSVM-based software defect data association classification method
CN107590516A (en) * 2017-09-16 2018-01-16 电子科技大学 Gas pipeline leak detection recognition methods based on Fibre Optical Sensor data mining
CN108376347A (en) * 2018-02-27 2018-08-07 广西财经学院 A kind of commodity classification method based on A weighting priori algorithms
CN108806767A (en) * 2018-06-15 2018-11-13 中南大学 Disease symptoms association analysis method based on electronic health record
CN108830321A (en) * 2018-06-15 2018-11-16 中南大学 The classification method of unbalanced dataset
CN110990461A (en) * 2019-12-12 2020-04-10 国家电网有限公司大数据中心 Big data analysis model algorithm model selection method and device, electronic equipment and medium
CN111309777A (en) * 2020-01-14 2020-06-19 哈尔滨工业大学 Report data mining method for improving association rule based on mutual exclusion expression
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set
CN112380274A (en) * 2020-11-16 2021-02-19 北京航空航天大学 Control process-oriented anomaly detection system
CN112723075A (en) * 2021-01-04 2021-04-30 浙江新再灵科技股份有限公司 Method for analyzing elevator vibration influence factors with unbalanced data
CN112884179A (en) * 2021-03-30 2021-06-01 北京交通大学 Urban rail turn-back fault diagnosis method based on machine fault and text topic analysis

Patent Citations (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2007147166A2 (en) * 2006-06-16 2007-12-21 Quantum Leap Research, Inc. Consilence of data-mining
CN103731738A (en) * 2014-01-23 2014-04-16 哈尔滨理工大学 Video recommendation method and device based on user group behavioral analysis
CN103995882A (en) * 2014-05-28 2014-08-20 南京大学 Probability frequent item set excavating method based on MapReduce
CN104239437A (en) * 2014-08-28 2014-12-24 国家电网公司 Power-network-dispatching-oriented intelligent warning analysis method
CN105740245A (en) * 2014-12-08 2016-07-06 北京邮电大学 Frequent item set mining method
CN104537025A (en) * 2014-12-19 2015-04-22 北京邮电大学 Frequent sequence mining method
CN105306475A (en) * 2015-11-05 2016-02-03 天津理工大学 Network intrusion detection method based on association rule classification
CN106529580A (en) * 2016-10-24 2017-03-22 浙江工业大学 EDSVM-based software defect data association classification method
CN107590516A (en) * 2017-09-16 2018-01-16 电子科技大学 Gas pipeline leak detection recognition methods based on Fibre Optical Sensor data mining
CN108376347A (en) * 2018-02-27 2018-08-07 广西财经学院 A kind of commodity classification method based on A weighting priori algorithms
CN108806767A (en) * 2018-06-15 2018-11-13 中南大学 Disease symptoms association analysis method based on electronic health record
CN108830321A (en) * 2018-06-15 2018-11-16 中南大学 The classification method of unbalanced dataset
CN110990461A (en) * 2019-12-12 2020-04-10 国家电网有限公司大数据中心 Big data analysis model algorithm model selection method and device, electronic equipment and medium
CN111309777A (en) * 2020-01-14 2020-06-19 哈尔滨工业大学 Report data mining method for improving association rule based on mutual exclusion expression
CN111782512A (en) * 2020-06-23 2020-10-16 北京高质系统科技有限公司 Multi-feature software defect comprehensive prediction method based on unbalanced noise set
CN112380274A (en) * 2020-11-16 2021-02-19 北京航空航天大学 Control process-oriented anomaly detection system
CN112723075A (en) * 2021-01-04 2021-04-30 浙江新再灵科技股份有限公司 Method for analyzing elevator vibration influence factors with unbalanced data
CN112884179A (en) * 2021-03-30 2021-06-01 北京交通大学 Urban rail turn-back fault diagnosis method based on machine fault and text topic analysis

Also Published As

Publication number Publication date
CN113282686A (en) 2021-08-20

Similar Documents

Publication Publication Date Title
CN110245496B (en) Source code vulnerability detection method and detector and training method and system thereof
CN110020422B (en) Feature word determining method and device and server
WO2021164382A1 (en) Method and apparatus for performing feature processing for user classification model
CN106372060B (en) Search for the mask method and device of text
CN110909725A (en) Method, device and equipment for recognizing text and storage medium
CN111144723A (en) Method and system for recommending people's job matching and storage medium
WO2016177069A1 (en) Management method, device, spam short message monitoring system and computer storage medium
CN110909165A (en) Data processing method, device, medium and electronic equipment
US11429810B2 (en) Question answering method, terminal, and non-transitory computer readable storage medium
CN111105209A (en) Job resume matching method and device suitable for post matching recommendation system
CN110737821B (en) Similar event query method, device, storage medium and terminal equipment
CN111209317A (en) Knowledge graph abnormal community detection method and device
CN110780965A (en) Vision-based process automation method, device and readable storage medium
CN114398473A (en) Enterprise portrait generation method, device, server and storage medium
CN107368526A (en) A kind of data processing method and device
CN111160959A (en) User click conversion estimation method and device
CN113282686B (en) Association rule determining method and device for unbalanced sample
CN114547301A (en) Document processing method, document processing device, recognition model training equipment and storage medium
CN102193928B (en) Method for matching lightweight ontologies based on multilayer text categorizer
CN112181814B (en) Multi-label marking method for defect report
CN113472860A (en) Service resource allocation method and server under big data and digital environment
CN115115369A (en) Data processing method, device, equipment and storage medium
CN109993381B (en) Demand management application method, device, equipment and medium based on knowledge graph
CN115994331A (en) Message sorting method and device based on decision tree
CN115392351A (en) Risk user identification method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant