CN107622121B - Data analysis method and device based on bitmap data structure - Google Patents

Data analysis method and device based on bitmap data structure Download PDF

Info

Publication number
CN107622121B
CN107622121B CN201710872848.7A CN201710872848A CN107622121B CN 107622121 B CN107622121 B CN 107622121B CN 201710872848 A CN201710872848 A CN 201710872848A CN 107622121 B CN107622121 B CN 107622121B
Authority
CN
China
Prior art keywords
item
transaction
target item
bitmap data
frequent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201710872848.7A
Other languages
Chinese (zh)
Other versions
CN107622121A (en
Inventor
刘东岳
吴斌
王柏
卜尧
郭志红
杨祎
马艳
辜超
白德盟
林颖
秦佳峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201710872848.7A priority Critical patent/CN107622121B/en
Publication of CN107622121A publication Critical patent/CN107622121A/en
Application granted granted Critical
Publication of CN107622121B publication Critical patent/CN107622121B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The embodiment of the invention provides a data analysis method and a device based on a bitmap data structure, wherein the method comprises the following steps: obtaining a first transaction subset distributed by a master node; obtaining the total number and the sequence; determining bitmap data of each target item according to the sorting; counting the proportion between the first data and the total number in the bitmap data of each target item, and determining a frequent 1 item set in the target item according to the counted proportion; broadcasting the bitmap data of the target item which is the frequent 1 item set and the bitmap data of the target item which is the frequent 1 item set to the main node and other distributed sub-nodes; receiving a statistical item and bitmap data of the statistical item; calculating whether the target item set is a frequent item set or not based on the received statistical items, the bitmap data of the statistical items and the bitmap data of the target item which is the frequent 1 item set; and if the target item set is a frequent item set, determining association rules among the items in the target item set. In this way, association rules and association relationships between items can be obtained quickly.

Description

Data analysis method and device based on bitmap data structure
Technical Field
The invention relates to the technical field of data mining, in particular to a data analysis method and device based on a bitmap data structure.
Background
With the explosive growth of data, people increasingly and urgently want to dig out valuable information from a large amount of existing data and further make corresponding decisions according to the valuable information.
For example, for large retail stores, the stores generate tens of thousands of records of transactions each year, where each record of transactions corresponds to an order number and an order number corresponds to multiple item names. However, without data mining, one cannot obtain the association rules between items during a transaction (e.g., 60% of those who buy coffee also buy pastry at the same time). After data mining, the purchasing relationship among all the articles in the transaction process, namely the association rule among all the articles, can be obtained through mining, and support can be provided for marketing decision of a market according to the obtained association rule.
In the data mining process, a transaction is often obtained through a transaction record in the large retail store, and different items in the large retail store are marked as different items. Specifically, for each transaction record, an order number corresponding to the transaction record may be used as a transaction identifier of the corresponding transaction, and each item name corresponding to the order number may be used as an item of the transaction. Thus, for a year's record of transactions in the large retail outlet, a transaction set consisting of multiple transactions may be obtained, and each transaction may contain one or more items. Then, a frequent item set mining algorithm is utilized: and the ECLAT algorithm is used for excavating a frequent item set in the transaction set, and then association rules among the projects are calculated according to the frequent item set.
Specifically, assuming that 10000 transactions exist in the transaction set, the ECLAT algorithm is used to calculate: the item set { item A, item B } in this transaction set occurs 100 times, i.e., the number of times item A and item B occur simultaneously is 100. Thus, it can be calculated that the probability of occurrence of the item set { item A, item B } is 0.01, that is, the support of the item set { item A, item B } is 0.01. If the 0.01 is larger than the preset minimum support degree, the item set { item A, item B } is a frequent 2 item set, and the association relationship between the item A and the item B can be calculated according to the frequent 2 item set { item A, item B }.
However, the inventor finds that, in the process of calculating a frequent item set by using the ECLAT algorithm, item B needs to be used for matching with each item in each transaction containing item a, and if the matching is successful, it indicates that the transaction containing item a also contains transaction B, and at this time, the number of times of occurrence of item set { item a, item B } is increased by one. Then, the item A is used for matching with each item in each transaction containing the item B, if the matching is successful, the matching indicates that the transaction containing the item B also contains the transaction A, and the frequency of the occurrence of the item set { item A, item B } is increased by one, so that the frequency of the occurrence of the item set { item A, item B } can be obtained through statistics. However, such matching is very slow, making the association rules slow to obtain.
Disclosure of Invention
The embodiment of the invention aims to provide a data analysis method and device based on a bitmap data structure, which are used for rapidly analyzing and obtaining association rules so as to rapidly obtain association relations among items.
In a first aspect, an embodiment of the present invention provides a data analysis method based on a bitmap data structure, which is applied to one distributed child node in distributed child nodes included in a distributed system, where the distributed system includes: the method comprises the following steps that:
obtaining a first transaction subset allocated by a master node, wherein the first transaction subset is: a subset of a set of transactions;
obtaining the total number of the transactions in the transaction set and the ordering of the transactions in the transaction set;
determining bitmap data corresponding to each target item according to the ordering, wherein each bit of the bitmap data corresponding to one target item corresponds to one transaction in the transaction set according to the ordering, and the value of each bit indicates whether the transaction corresponding to the bit is the associated transaction of the target item; the target items are: items contained in each transaction in the first transaction subset; the association transaction for a target item is: the transaction set comprises the transactions of the target item;
respectively counting the proportion between first data and the total number in bitmap data corresponding to each target item, and determining a frequent 1 item set in the target item according to the counted proportion, wherein the first data is as follows: dereferencing a bit corresponding to an associated transaction of an item in the bitmap data;
broadcasting the bitmap data of the target item which is the frequent 1 item set and the bitmap data of the target item which is the frequent 1 item set to the main node and other distributed sub-nodes;
receiving a statistical item and bitmap data of the statistical item, wherein the statistical item is obtained by other distributed sub-nodes based on statistics of second transaction sub-sets distributed by the master node, and a union of the first transaction sub-set and each second transaction sub-set is a transaction set;
calculating whether the target item set is a frequent item set or not based on the received statistical items, the bitmap data of the statistical items and the bitmap data of the target item which is the frequent 1 item set, wherein the target item set comprises at least two items;
and if the target item set is a frequent item set, determining association rules among the items in the target item set.
Optionally, the step of determining bitmap data corresponding to each target item according to the sorting may include:
for each target item, based on the transaction including the target item in the first transaction subset, the transaction including the target item in the second transaction subset, and a preset mapping relationship, setting the value of the bit corresponding to the transaction including the target item as first data, and setting the value of the bit corresponding to the transaction not including the target item as second data, to obtain bitmap data of the target item, where the mapping relationship is: and determining the corresponding relation between the bit in the bitmap data and the transaction in the transaction set according to the sequence.
Optionally, in this embodiment of the present invention, the first data is 1, and the second data is 0.
Optionally, before the step of calculating whether the target item set is a frequent item set, the method may further comprise:
and receiving a statistical instruction which is sent by the main node and aims at the target item set.
Optionally, after the step of determining bitmap data corresponding to each target item according to the sorting, the method may further include:
compressing bitmap data corresponding to each target item into compressed bitmap data;
broadcasting the target item of the frequent 1 item set and the bitmap data of the target item of the frequent 1 item set to the main node and other distributed child nodes, wherein the steps comprise:
and broadcasting the target items of the frequent 1 item set and the compressed bitmap data of the target items of the frequent 1 item set to the main node and other distributed child nodes.
In a second aspect, an embodiment of the present invention provides a data analysis apparatus based on a bitmap data structure, which is applied to one distributed child node in distributed child nodes included in a distributed system, where the distributed system includes: the apparatus may include a master node and distributed child nodes:
a first obtaining unit, configured to obtain a first subset of transactions allocated by a master node, where the first subset of transactions is: a subset of a set of transactions;
the second obtaining unit is used for obtaining the total number of the transactions in the transaction set and the ordering of the transactions in the transaction set;
the first determining unit is used for determining bitmap data corresponding to each target item according to the ordering, wherein each bit of the bitmap data corresponding to one target item corresponds to one transaction in the transaction set according to the ordering, and the value of each bit indicates whether the transaction corresponding to the bit is the associated transaction of the target item; the target items are: items contained in each transaction in the first transaction subset; the association transaction for a target item is: the transaction set comprises the transactions of the target item;
the statistical unit is used for respectively counting the proportion between first data and the total number in the bitmap data corresponding to each target item, and determining a frequent 1 item set in the target item according to the counted proportion, wherein the first data is as follows: dereferencing a bit corresponding to an associated transaction of an item in the bitmap data;
the broadcasting unit is used for broadcasting the target item of the frequent 1 item set and the bitmap data of the target item of the frequent 1 item set to the main node and other distributed sub-nodes;
the first receiving unit is used for receiving a statistical item and bitmap data of the statistical item, wherein the statistical item is obtained by other distributed sub-nodes based on statistics of a second transaction sub-set distributed by the master node, and a union of the first transaction sub-set and each second transaction sub-set is a transaction set;
the calculating unit is used for calculating whether the target item set is a frequent item set or not based on the received statistical items, the bitmap data of the statistical items and the bitmap data of the target items which are frequent 1 item sets, wherein the target item set comprises at least two items;
and the second determining unit is used for determining association rules among the items in the target item set when the target item set is a frequent item set.
Optionally, the first determining unit may be specifically configured to:
for each target item, based on the transaction including the target item in the first transaction subset, the transaction including the target item in the second transaction subset, and a preset mapping relationship, setting the value of the bit corresponding to the transaction including the target item as first data, and setting the value of the bit corresponding to the transaction not including the target item as second data, to obtain bitmap data of the target item, where the mapping relationship is: and determining the corresponding relation between the bit in the bitmap data and the transaction in the transaction set according to the sequence.
Optionally, in an embodiment of the present invention, the apparatus may further include:
and a second receiving unit, configured to receive a statistical instruction for the target item set sent by the master node before calculating whether the target item set is a frequent item set based on the received statistical item, the bitmap data of the statistical item, and the bitmap data of the target item which is a frequent 1 item set.
Optionally, in an embodiment of the present invention, the apparatus may further include:
the compression unit is used for compressing the bitmap data corresponding to each target item into compressed bitmap data after determining the bitmap data corresponding to each target item according to the sorting;
the broadcast unit may specifically be configured to:
broadcasting the target items of the frequent 1 item set and the compressed bitmap data of the target items of the frequent 1 item set to the master node and the other distributed child nodes.
Optionally, in this embodiment of the present invention, the first data is 1, and the second data is 0.
In a third aspect, an embodiment of the present invention further provides a distributed child node, including a processor, a communication interface, a memory, and a communication bus, where the processor, the communication interface, and the memory complete communication with each other through the communication bus;
a memory for storing a computer program;
a processor, configured to implement the method steps of the data analysis method based on the bitmap data structure according to any one of the first aspect described above when executing a program stored in the memory.
In an embodiment of the invention, a distributed child node in a distributed system may receive a first subset of transactions assigned by a master node. Then, the total number of the transactions in the transaction set and the ordering of each transaction in the transaction set are obtained. And then determining the items contained in each transaction in the first transaction set as target items. And corresponding a target item to the total number of bits, and corresponding each bit to a transaction in the transaction set according to the obtained sequence. Each bit uniquely corresponds to one transaction, and the transactions corresponding to every two bits are different. And determining the transaction containing the target item as the associated transaction of the target item, setting the value of the bit corresponding to the associated transaction as first data, and setting the value of the bit not corresponding to the associated transaction as second data, thereby obtaining the bitmap data corresponding to the target item. In this way, the proportion of the transactions containing the target item in the transaction set can be quickly determined by the ratio of the number of the first data in the bitmap data to the total number. And further, whether the target item set is a frequent 1 item set or not can be determined according to the proportion, so that the speed of acquiring the frequent 1 item set is greatly improved.
When the target item is determined to be the frequent 1 item set, the distributed child node may broadcast the target item and bitmap data of the target item to the master node and other distributed child nodes. And may receive the statistics item and the bitmap data of the statistics item broadcasted by other distributed child nodes. It can then be quickly determined whether a target item set containing at least two items is a frequent item set based on the bitmap data of the target item and the bitmap data of the statistical item. If the target item set is a frequent item set, the association rule of each item in the target item set can be determined according to the frequent item set, so that the speed of acquiring the association rule is increased.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a data analysis method based on a bitmap data structure according to an embodiment of the present invention;
fig. 2 is a comparison graph of the performance of the data analysis method based on the bitmap data structure according to the embodiment of the present invention and the performance of the existing data analysis method;
fig. 3 is a schematic diagram illustrating the performance of the data analysis method based on the bitmap data structure under various preset minimum support degrees according to the embodiment of the present invention;
fig. 4 is a schematic structural diagram of a data analysis apparatus based on a bitmap data structure according to an embodiment of the present invention;
fig. 5 is a schematic structural diagram of a distributed node according to an embodiment of the present invention.
Detailed Description
The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
In order to solve the problems in the prior art, embodiments of the present invention provide a data analysis method and apparatus based on a bitmap data structure.
First, a data analysis method based on a bitmap data structure according to an embodiment of the present invention is described below.
The data analysis method based on the bitmap data structure provided by the embodiment of the invention is applied to any one of the distributed sub-nodes contained in the distributed system. Wherein, this distributed system includes: a main node and distributed sub-nodes.
For clarity, a data analysis method based on a bitmap data structure provided by the embodiment of the present invention is described below by taking a distributed system including 1 master node and 31 distributed child nodes as an example. The main node and the distributed sub-nodes can be servers, and can also be user terminals such as computers and mobile phones, which is reasonable.
Referring to fig. 1, a data analysis method based on a bitmap data structure according to an embodiment of the present invention may include the following steps:
s101: obtaining a first transaction subset allocated by a master node, wherein the first transaction subset is: a subset of a set of transactions;
assuming that the distributed system requires association rule analysis for 75000 transaction records, i.e., 75000 transactions, for a large retail store, then the 75000 transactions constitute a transaction set. The order number corresponding to each transaction record may be used as a transaction identifier of a transaction corresponding to the transaction record, and each article corresponding to the order number may be used as an item of the transaction corresponding to the order number.
And assuming that the main node in the distributed system is Z, and the distributed child nodes are F1, F2, … …, F30 and F31. Then, for the distributed sub-node F1, the master node Z may form a part of the transactions in the transaction set into a first transaction subset, and allocate the first transaction subset to the distributed sub-node F1, so that the distributed sub-node F1 may obtain the first transaction subset allocated by the master node Z.
The master node Z may assign a first sub-set of transactions to the distributed child node F1 according to an equal sharing principle, for example, assign any 2419 transactions of the 75000 transactions as the first sub-set of transactions to the distributed child node F1. Of course, it is reasonable to assign the first subset of transactions to the distributed child node F1, for example, 5000 transactions of the 75000 transactions as the first subset of transactions to the distributed child node F1, based on the computing power of the distributed child node F1.
S102: obtaining the total number of the transactions in the transaction set and the ordering of the transactions in the transaction set;
s103: determining bitmap data corresponding to each target item according to the ordering, wherein each bit of the bitmap data corresponding to one target item corresponds to one transaction in the transaction set according to the ordering, and the value of each bit indicates whether the transaction corresponding to the bit is the associated transaction of the target item; the target items are: items contained in each transaction in the first transaction subset; the association transaction for a target item is: the transaction set comprises the transactions of the target item;
after the distributed sub-node F1 receives the first transaction subset, the distributed sub-node F1 may determine which target items the transactions in the first transaction subset contain, for example, when determining that the first transaction subset contains transaction 1 (item a, item B, item C) and transaction 2 (item a, item D), it may determine that the transactions in the first transaction subset contain target item a, target item B, target item C, and target item D. Then, bitmap data corresponding to each target item is determined.
The following description will be given taking the bitmap data for specifying the target item a as an example:
the distributed child node F1 may obtain from the master node Z the total number of transactions contained in the transaction set, as well as the ordering of the individual transactions in the transaction set. For example, the sequence is: transaction identification of transaction No. 1, transaction No. 2, … …, and transaction No. 75000.
Then, the total number of bits, i.e., 75000 bits, is determined and arranged in order. And then, one transaction is uniquely corresponding to one bit according to the obtained sequence, and the transactions corresponding to any two bits are different. Then, whether each transaction in the transaction set contains the target item a is determined, and if a certain transaction contains the target item a, the transaction is the associated transaction of the target item a. At this time, a value of a bit corresponding to the associated transaction in the total number of bits may be set to the first data, for example, may be set to 1. If a transaction does not include the target item a, the transaction is not a transaction associated with the target item a, and at this time, a value of a bit corresponding to the transaction in the total number of bits may be set to the second data, for example, may be set to 0. Thus, after the values of the total number of bits are all set, the bitmap data corresponding to the target item a can be obtained, and then the corresponding relationship between the target item a and the bitmap data can be established.
The distributed child node F1 may communicate with other distributed child nodes in the distributed system, so as to know which transactions in the transaction set are associated with the target item a. Of course, each distributed sub-node may also report the associated transaction of each target item obtained by its own statistics to the master node Z, and then the master node summarizes the associated transaction of each target item. It is also reasonable that the distributed child node F1 can thus obtain the transaction identifications of all the associated transactions of the target item a from the master node.
Similarly, the distributed child node F1 can determine bitmap data corresponding to target item B, target item C, and target item D, which will not be described in detail herein.
In addition, since the transaction assigned by another distributed child node may also include the item a, for example, the transaction 10 (item a, item E) assigned by the distributed child node F2 includes the item a. Therefore, in order to avoid the problem that the distributed child node F2 also takes the item a as a target item and determines the bitmap data of the target item a, and then performs repeated calculations on the bitmap data of the target item a, the master node Z may assign bitmap data determination instructions to the respective distributed child nodes to instruct the respective distributed child nodes to determine bitmap data of different items. For example, if the distributed child node F1 is instructed to determine the bitmap data of item a, then other distributed child nodes in the distributed system will not determine the bitmap data of item a, thereby avoiding the waste of computing resources.
S104: respectively counting the proportion between first data and the total number in bitmap data corresponding to each target item, and determining a frequent 1 item set in the target item according to the counted proportion, wherein the first data is as follows: dereferencing a bit corresponding to an associated transaction of an item in the bitmap data;
continuing with the above example, after the bitmap data of the target item a is obtained, the number of the first data in the bitmap data may be determined, and then the ratio of the number to the total number is calculated, resulting in the ratio between the first data and the total number. Then, it may be determined whether the ratio is greater than a preset minimum support, and if so, the target item a is determined to be the frequent 1 item set. If the number of the target item A is less than or equal to the number of the frequent 1 items, the target item A is determined not to be the frequent 1 item set. Therefore, the calculation amount of the frequent 1 item set is greatly reduced, and the calculation efficiency is improved.
Wherein, the skilled person can set the preset minimum support according to the actual situation. For example, the preset minimum support degree may be set to 0.8, but is not limited thereto.
S105: broadcasting the bitmap data of the target item which is the frequent 1 item set and the bitmap data of the target item which is the frequent 1 item set to the main node and other distributed sub-nodes;
s106: receiving a statistical item and bitmap data of the statistical item, wherein the statistical item is obtained by other distributed sub-nodes based on statistics of second transaction sub-sets distributed by the master node, and a union of the first transaction sub-set and each second transaction sub-set is a transaction set;
assuming the target item A is a frequent 1 item set, the distributed child node F1 may broadcast the target item A and the bitmap data for the target item A to other distributed child nodes. Of course, in order to reduce the memory consumption for storing bitmap data and the transmission consumption for transmitting bitmap data, it is reasonable that the distributed sub-node F1 also compresses the bitmap data of the target item a into compressed bitmap data, and then broadcasts the target item a and the compressed bitmap data of the target item a to the master node and other distributed sub-nodes.
Of course, the distributed child node F1 will also receive the bitmap data of the statistical item and the statistical item calculated by other distributed devices. The statistical item can be a frequent 1 item set obtained by other distributed child nodes based on statistics of the second transaction subset allocated by the master node. For example, the distributed child node F2 receives the second transaction subset allocated by the master node, then calculates the statistical item E as a frequent 1 item set based on the second transaction subset, and then broadcasts the statistical item E and the bitmap data of the statistical item E, so that the distributed child node F1 can receive the statistical item E and the bitmap data of the statistical item E. Of course, it is also reasonable that the statistical item may be an item that is not determined by other distributed child nodes as being a frequent 1 item set.
It should be noted that, the distributed child nodes F3 to F31 may all receive the second subset of transactions sent by the master node Z, but in order to avoid duplicate computation, the transactions included in the second subset of transactions received by each distributed child node are different from each other.
S107: calculating whether the target item set is a frequent item set or not based on the received statistical items, the bitmap data of the statistical items and the bitmap data of the target item which is the frequent 1 item set, wherein the target item set comprises at least two items;
s108: and if the target item set is a frequent item set, determining association rules among the items in the target item set.
For example, after receiving the bitmap data of the statistical item E and the statistical item E broadcast by the distributed child node F2, the distributed child node F1 may determine whether the item set { item a, item E } is a frequent 2-item set, that is, determine whether a ratio of the number of transactions of the item a and the item E occurring at the same time to the total number is greater than a preset minimum support degree, and if so, determine that the item set { item a, item E } is the frequent 2-item set.
The method for determining the number of transactions of the item a and the item E that occur at the same time may specifically be: comparing the bitmap data of the item a (i.e., the target item a) with the bitmap data of the item E (i.e., the statistical item E), when the values corresponding to the bits (e.g., the 10 th bits of the two bitmap data) in the same order are both the first data (e.g., both are 1), it indicates that the transaction corresponding to the 10 th bit includes both the item a and the item B. In this way, the number of bits in which the same order of bits is 1 in both bitmap data is determined, and thus the number of transactions in which the item a and the item E occur simultaneously can be obtained. Because the calculation amount is extremely small, the number of the transactions of the item A and the item E which simultaneously appear can be quickly determined, and whether the item set { the item A, the item E } is a frequent 2 item set or not can be quickly determined.
Assuming that the set of items { item A, item E } is determined to be a frequent 2 item set, then the number of transactions that occur both item A and item E can be used divided by the number of transactions that occur item A, assuming a 70% confidence. Then the association rule is available: of those who buy item a, 70% will also buy item E. The association rule is merely an example, and the generated association rule is of course not limited thereto.
In conclusion, by applying the embodiment of the invention, the speed of acquiring the frequent item set is improved, and the implicit association rule in the data can be rapidly analyzed and obtained. Wherein the frequent itemsets comprise a frequent 1 itemset and a frequent multiitemset.
The performance of analyzing the association rule by the data analysis method based on the bitmap data structure provided by the embodiment of the invention is described below with reference to table one, table two, fig. 2 and fig. 3.
Figure BDA0001417481500000111
Watch 1
Referring to table one, the inventor records the data analysis method based on the bitmap data structure provided in the embodiment of the present invention as an RBM-Eclat algorithm, sets a preset minimum support degree to 0.8, and then performs association rule analysis on a transaction set including 100 ten thousand transactions by using the RBM-Eclat algorithm, wherein the time spent on analyzing all association rules is 81 seconds. Similarly, the inventor also sets the preset minimum support degree to be 0.8, and performs association rule analysis on the transaction set containing 100 ten thousand transactions by using an Eclat algorithm in the prior art, and the time spent for analyzing all the association rules is 182 seconds. In addition, the inventor also sets the preset minimum support degree to be 0.8, and then performs association rule analysis on the transaction set containing 100 ten thousand transactions by using Apriori in the prior art, and the time spent for analyzing all the association rules is 151 seconds.
According to the above manner, the inventor further performs association rule analysis on a transaction set including 200 ten thousand transactions, a transaction set including 400 ten thousand transactions, a transaction set including 800 ten thousand transactions, and a transaction set including 1600 ten thousand transactions by using the above three algorithms, respectively, and obtains results shown in table 1 and fig. 2. As can be seen from table one and fig. 2, compared with the existing association rule analysis method, the data analysis method based on the bitmap data structure provided by the embodiment of the present invention can analyze the association rule faster.
In addition, the performance of the data analysis method based on the bitmap data structure provided by the embodiment of the invention is different under different preset minimum support degrees, which can be specifically referred to in table two and fig. 3.
Figure BDA0001417481500000121
Watch two
Referring to table two and fig. 3, when the preset minimum support degree is set to be 0.6, the association rule analysis is performed on the transaction set including 50 ten thousand transactions by using the data analysis method based on the bitmap data structure provided by the embodiment of the present invention, and the time spent on analyzing all the association rules is 222 seconds. When the preset minimum support degree is set to be 0.65, the association rule analysis is also carried out on the transaction set containing 50 ten thousand transactions, and the time spent on analyzing all the association rules is 113 seconds. When the preset minimum support degree is set to be 0.7, the association rule analysis is also performed on the transaction set containing 50 ten thousand transactions, and the time spent on analyzing all the association rules is 84 seconds, and the like. And will not be described herein.
In addition, when the preset minimum support degree is set to be 0.6, the association rule analysis is performed on the transaction set containing 100 ten thousand transactions by using the data analysis method based on the bitmap data structure provided by the embodiment of the invention, and the time spent on analyzing all the association rules is 486 seconds. When the preset minimum support degree is set to be 0.65, the association rule analysis is also carried out on the transaction set containing 100 ten thousand transactions, and the time spent on analyzing all the association rules is 182 seconds. When the preset minimum support degree is set to be 0.7, the association rule analysis is also performed on the transaction set containing 100 ten thousand transactions, and the time spent on analyzing all the association rules is 126 seconds, and the like. And will not be described herein.
As can be seen from the above, when the same transaction set is subjected to the management rule analysis, the smaller the value of the preset minimum support degree is set, the faster the calculation speed of the data analysis method based on the bitmap data structure provided by the embodiment of the present invention is.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a data analysis apparatus based on a bitmap data structure, which is applied to one of distributed child nodes included in a distributed system, where the distributed system includes: a master node and distributed child nodes, see fig. 4, the apparatus may comprise:
a first obtaining unit 401, configured to obtain a first subset of transactions allocated by a master node, where the first subset of transactions is: a subset of a set of transactions;
a second obtaining unit 402, configured to obtain a total number of transactions in the transaction set and an ordering of the transactions in the transaction set;
a first determining unit 403, configured to determine, according to the ordering, bitmap data corresponding to each target item, where each bit of the bitmap data corresponding to one target item corresponds to one transaction in the transaction set according to the ordering, and a value of each bit indicates whether the transaction corresponding to the bit is an associated transaction of the target item; the target items are: items contained in each transaction in the first transaction subset; the association transaction for a target item is: the transaction set comprises the transactions of the target item;
a counting unit 404, configured to count a ratio between first data and a total number in the bitmap data corresponding to each target item, respectively, and determine a frequent 1 item set in the target item according to the counted ratio, where the first data is: dereferencing a bit corresponding to an associated transaction of an item in the bitmap data;
a broadcasting unit 405, configured to broadcast the target item that is the frequent 1-item set and bitmap data of the target item that is the frequent 1-item set to the master node and other distributed child nodes;
a first receiving unit 406, configured to receive a statistical item and bitmap data of the statistical item, where the statistical item is obtained by other distributed child nodes based on statistics of a second transaction subset allocated by a master node, and a union of the first transaction subset and each second transaction subset is a transaction set;
a calculating unit 407, configured to calculate whether a target item set is a frequent item set based on the received statistical item, bitmap data of the statistical item, and bitmap data of a target item that is a frequent 1 item set, where the target item set includes at least two items;
a second determining unit 408, configured to determine association rules between items in the target item set when the target item set is a frequent item set.
In an embodiment of the invention, a distributed child node in a distributed system may receive a first subset of transactions assigned by a master node. Then, the total number of the transactions in the transaction set and the ordering of each transaction in the transaction set are obtained. And then determining the items contained in each transaction in the first transaction set as target items. And corresponding a target item to the total number of bits, and corresponding each bit to a transaction in the transaction set according to the obtained sequence. Each bit uniquely corresponds to one transaction, and the transactions corresponding to every two bits are different. And determining the transaction containing the target item as the associated transaction of the target item, setting the value of the bit corresponding to the associated transaction as first data, and setting the value of the bit not corresponding to the associated transaction as second data, thereby obtaining the bitmap data corresponding to the target item. In this way, the proportion of the transactions containing the target item in the transaction set can be quickly determined by the ratio of the number of the first data in the bitmap data to the total number. And further, whether the target item set is a frequent 1 item set or not can be determined according to the proportion, so that the speed of acquiring the frequent 1 item set is greatly improved.
When the target item is determined to be the frequent 1 item set, the distributed child node may broadcast the target item and bitmap data of the target item to the master node and other distributed child nodes. And may receive the statistics item and the bitmap data of the statistics item broadcasted by other distributed child nodes. It can then be quickly determined whether a target item set containing at least two items is a frequent item set based on the bitmap data of the target item and the bitmap data of the statistical item. If the target item set is a frequent item set, the association rule of each item in the target item set can be determined according to the frequent item set, so that the speed of acquiring the association rule is increased.
Optionally, the first determining unit 403 may specifically be configured to:
for each target item, based on the transaction including the target item in the first transaction subset, the transaction including the target item in the second transaction subset, and a preset mapping relationship, setting the value of the bit corresponding to the transaction including the target item as first data, and setting the value of the bit corresponding to the transaction not including the target item as second data, to obtain bitmap data of the target item, where the mapping relationship is: and determining the corresponding relation between the bit in the bitmap data and the transaction in the transaction set according to the sequence.
Optionally, in an embodiment of the present invention, the apparatus may further include:
and a second receiving unit, configured to receive a statistical instruction for the target item set sent by the master node before calculating whether the target item set is a frequent item set based on the received statistical item, the bitmap data of the statistical item, and the bitmap data of the target item which is a frequent 1 item set.
Optionally, in an embodiment of the present invention, the apparatus may further include:
the compression unit is used for compressing the bitmap data corresponding to each target item into compressed bitmap data after determining the bitmap data corresponding to each target item according to the sorting;
the broadcast unit 405 may be specifically configured to:
and broadcasting the target items of the frequent 1 item set and the compressed bitmap data of the target items of the frequent 1 item set to the main node and other distributed child nodes.
Optionally, in this embodiment of the present invention, the first data is 1, and the second data is 0.
Corresponding to the above method embodiment, an embodiment of the present invention further provides a distributed sub-node, referring to fig. 5, including a processor 501, a communication interface 502, a memory 503, and a communication bus 504, where the processor 501, the communication interface 502, and the memory 503 complete mutual communication through the communication bus 504;
a memory 503 for storing a computer program;
the processor 501 is configured to implement the method steps of the data analysis method based on the bitmap data structure provided in any one of the above-described method embodiments when executing the program stored in the memory 503.
In an embodiment of the invention, a distributed child node in a distributed system may receive a first subset of transactions assigned by a master node. Then, the total number of the transactions in the transaction set and the ordering of each transaction in the transaction set are obtained. And then determining the items contained in each transaction in the first transaction set as target items. And corresponding a target item to the total number of bits, and corresponding each bit to a transaction in the transaction set according to the obtained sequence. Each bit uniquely corresponds to one transaction, and the transactions corresponding to every two bits are different. And determining the transaction containing the target item as the associated transaction of the target item, setting the value of the bit corresponding to the associated transaction as first data, and setting the value of the bit not corresponding to the associated transaction as second data, thereby obtaining the bitmap data corresponding to the target item. In this way, the proportion of the transactions containing the target item in the transaction set can be quickly determined by the ratio of the number of the first data in the bitmap data to the total number. And further, whether the target item set is a frequent 1 item set or not can be determined according to the proportion, so that the speed of acquiring the frequent 1 item set is greatly improved.
When the target item is determined to be the frequent 1 item set, the distributed child node may broadcast the target item and bitmap data of the target item to the master node and other distributed child nodes. And may receive the statistics item and the bitmap data of the statistics item broadcasted by other distributed child nodes. It can then be quickly determined whether a target item set containing at least two items is a frequent item set based on the bitmap data of the target item and the bitmap data of the statistical item. If the target item set is a frequent item set, the association rule of each item in the target item set can be determined according to the frequent item set, so that the speed of acquiring the association rule is increased.
The communication bus mentioned in the electronic device may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The communication bus may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown, but this does not mean that there is only one bus or one type of bus.
The communication interface is used for communication between the electronic equipment and other equipment.
The Memory may include a Random Access Memory (RAM) or a Non-Volatile Memory (NVM), such as at least one disk Memory. Optionally, the memory may also be at least one memory device located remotely from the processor.
The Processor may be a general-purpose Processor, including a Central Processing Unit (CPU), a Network Processor (NP), and the like; but may also be a Digital Signal Processor (DSP), an Application Specific Integrated Circuit (ASIC), a Field Programmable Gate Array (FPGA) or other Programmable logic device, discrete Gate or transistor logic device, discrete hardware component.
It is noted that, herein, relational terms such as first and second, and the like may be used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Also, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising an … …" does not exclude the presence of other identical elements in a process, method, article, or apparatus that comprises the element.
All the embodiments in the present specification are described in a related manner, and the same and similar parts among the embodiments may be referred to each other, and each embodiment focuses on the differences from the other embodiments. Especially, for the device and distributed child node embodiments, since they are basically similar to the method embodiments, the description is simple, and the relevant points can be referred to the partial description of the method embodiments.
The above description is only for the preferred embodiment of the present invention, and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, or improvement made within the spirit and principle of the present invention shall fall within the protection scope of the present invention.

Claims (8)

1. A data analysis method based on a bitmap data structure is applied to one of distributed sub-nodes contained in a distributed system, and the distributed system comprises: a master node and distributed child nodes, the method comprising:
obtaining a first subset of transactions allocated by the master node, wherein the first subset of transactions is: a subset of a set of transactions;
obtaining a total number of transactions in the transaction set and an ordering of the transactions in the transaction set;
determining bitmap data corresponding to each target item according to the ordering, wherein each bit of the bitmap data corresponding to one target item corresponds to one transaction in the transaction set according to the ordering, and the value of each bit indicates whether the transaction corresponding to the bit is an associated transaction of the target item; the target items are: items contained in each transaction in the first transaction subset; the association transaction for a target item is: the transaction set comprises the transactions of the target item;
respectively counting the proportion between first data and the total number in bitmap data corresponding to each target item, and determining a frequent 1 item set in the target item according to the counted proportion, wherein the first data is as follows: dereferencing a bit corresponding to an associated transaction of an item in the bitmap data;
broadcasting the bitmap data of the target item which is the frequent 1 item set and the bitmap data of the target item which is the frequent 1 item set to the main node and other distributed sub-nodes;
receiving a statistical item and bitmap data of the statistical item, wherein the statistical item is obtained by other distributed sub-nodes based on statistics of a second transaction sub-set allocated by the master node, and a union of the first transaction sub-set and each second transaction sub-set is the transaction set;
calculating whether a target item set is a frequent item set or not based on the received statistical items, bitmap data of the statistical items and bitmap data of target items which are frequent 1 item sets, wherein the target item set comprises at least two items;
if the target item set is a frequent item set, determining association rules among items in the target item set;
the step of determining the bitmap data corresponding to each target item according to the ranking comprises:
for each target item, based on the transaction including the target item in the first transaction subset, the transaction including the target item in the second transaction subset, and a preset mapping relationship, setting a value of a bit corresponding to the transaction including the target item as first data, and setting a value of a bit corresponding to the transaction not including the target item as second data, to obtain bitmap data of the target item, where the mapping relationship is: and determining the corresponding relation between the bit in the bitmap data and the transaction in the transaction set according to the sequence.
2. The method of claim 1, wherein the first data is 1 and the second data is 0.
3. The method of claim 1, wherein prior to the step of computing whether the target set of items is a frequent set of items, the method further comprises:
and receiving a statistical instruction which is sent by the main node and aims at a target item set.
4. The method of claim 1, wherein after the step of determining bitmap data corresponding to each target item according to the ordering, the method further comprises:
compressing bitmap data corresponding to each target item into compressed bitmap data;
the step of broadcasting the bitmap data of the target item of the frequent 1 item set and the target item of the frequent 1 item set to the master node and the other distributed child nodes includes:
broadcasting the target items of the frequent 1 item set and the compressed bitmap data of the target items of the frequent 1 item set to the master node and the other distributed child nodes.
5. A data analysis device based on a bitmap data structure is applied to one of distributed sub-nodes contained in a distributed system, and the distributed system comprises: a master node and distributed sub-nodes, the apparatus comprising:
a first obtaining unit, configured to obtain a first subset of transactions allocated by the master node, where the first subset of transactions is: a subset of a set of transactions;
a second obtaining unit, configured to obtain a total number of transactions in the transaction set and an ordering of the transactions in the transaction set;
a first determining unit, configured to determine, according to the ordering, bitmap data corresponding to each target item, where each bit of the bitmap data corresponding to one target item corresponds to one transaction in the transaction set according to the ordering, and a value of each bit indicates whether the transaction corresponding to the bit is an associated transaction of the target item; the target items are: items contained in each transaction in the first transaction subset; the association transaction for a target item is: the transaction set comprises the transactions of the target item;
a counting unit, configured to count a ratio between first data and the total number in bitmap data corresponding to each target item, respectively, and determine a frequent 1 item set in the target item according to the counted ratio, where the first data is: dereferencing a bit corresponding to an associated transaction of an item in the bitmap data;
a broadcasting unit, configured to broadcast the target item of the frequent 1-item set and bitmap data of the target item of the frequent 1-item set to the master node and other distributed child nodes;
a first receiving unit, configured to receive a statistical item and bitmap data of the statistical item, where the statistical item is obtained by other distributed sub-nodes based on statistics of a second transaction subset allocated by the master node, and a union of the first transaction subset and each second transaction subset is the transaction set;
a calculating unit, configured to calculate whether a target item set is a frequent item set based on the received statistical items, bitmap data of the statistical items, and bitmap data of target items that are frequent 1 item sets, where the target item set includes at least two items;
the second determining unit is used for determining association rules among the items in the target item set when the target item set is a frequent item set;
the first determining unit is specifically configured to:
for each target item, based on the transaction including the target item in the first transaction subset, the transaction including the target item in the second transaction subset, and a preset mapping relationship, setting a value of a bit corresponding to the transaction including the target item as first data, and setting a value of a bit corresponding to the transaction not including the target item as second data, to obtain bitmap data of the target item, where the mapping relationship is: and determining the corresponding relation between the bit in the bitmap data and the transaction in the transaction set according to the sequence.
6. The apparatus of claim 5, further comprising:
a second receiving unit, configured to receive a statistical instruction for the target item set sent by the master node before calculating whether the target item set is a frequent item set based on the received statistical item, the bitmap data of the statistical item, and the bitmap data of the target item that is a frequent 1 item set.
7. The apparatus of claim 5, further comprising:
the compression unit is used for compressing the bitmap data corresponding to each target item into compressed bitmap data after determining the bitmap data corresponding to each target item according to the sorting;
the broadcast unit is specifically configured to:
broadcasting the target items of the frequent 1 item set and the compressed bitmap data of the target items of the frequent 1 item set to the master node and the other distributed child nodes.
8. A distributed sub-node is characterized by comprising a processor, a communication interface, a memory and a communication bus, wherein the processor and the communication interface are used for realizing the communication among the memories through the communication bus;
a memory for storing a computer program;
a processor for implementing the method steps of any of claims 1 to 4 when executing a program stored in the memory.
CN201710872848.7A 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure Active CN107622121B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710872848.7A CN107622121B (en) 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710872848.7A CN107622121B (en) 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure

Publications (2)

Publication Number Publication Date
CN107622121A CN107622121A (en) 2018-01-23
CN107622121B true CN107622121B (en) 2020-06-23

Family

ID=61090110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710872848.7A Active CN107622121B (en) 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure

Country Status (1)

Country Link
CN (1) CN107622121B (en)

Families Citing this family (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309368B (en) * 2018-03-26 2023-09-22 腾讯科技(深圳)有限公司 Data address determining method and device, storage medium and electronic device
CN110134721B (en) * 2019-05-17 2021-05-28 智慧足迹数据科技有限公司 Data statistics method and device based on bitmap and electronic equipment
US11520804B1 (en) 2021-05-13 2022-12-06 International Business Machines Corporation Association rule mining
US11762867B2 (en) 2021-10-07 2023-09-19 International Business Machines Corporation Association rule mining using max pattern transactions

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446978A (en) * 2008-12-11 2009-06-03 南京大学 Core node discovery method based on frequent itemset mining
CN103593400A (en) * 2013-12-13 2014-02-19 陕西省气象局 Lightning activity data statistics method based on modified Apriori algorithm
CN104834751A (en) * 2015-05-28 2015-08-12 成都艺辰德迅科技有限公司 Data analysis method based on Internet of things

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446978A (en) * 2008-12-11 2009-06-03 南京大学 Core node discovery method based on frequent itemset mining
CN103593400A (en) * 2013-12-13 2014-02-19 陕西省气象局 Lightning activity data statistics method based on modified Apriori algorithm
CN104834751A (en) * 2015-05-28 2015-08-12 成都艺辰德迅科技有限公司 Data analysis method based on Internet of things

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
利用位图技术挖掘关联规则的高效算法;祁文文等;《第十八届全国数据库学术会议论文集(技术报告篇)》;20010930;第123-126页 *
浅论Apriori 算法的改进;王晓等;《电脑学习》;20091031(第5期);第139-141页 *

Also Published As

Publication number Publication date
CN107622121A (en) 2018-01-23

Similar Documents

Publication Publication Date Title
CN107622121B (en) Data analysis method and device based on bitmap data structure
CN110399728B (en) Edge computing node trust evaluation method, device, equipment and storage medium
US10484413B2 (en) System and a method for detecting anomalous activities in a blockchain network
US8117609B2 (en) System and method for optimizing changes of data sets
CN111277274A (en) Data compression method, device, equipment and storage medium
CN110750658A (en) Recommendation method of media resource, server and computer readable storage medium
CN107222410B (en) Method, device, terminal and computer readable storage medium for link prediction
CN115953172A (en) Fraud risk identification method and device based on graph neural network
CN114265927A (en) Data query method and device, storage medium and electronic device
CN114493028A (en) Method and device for establishing prediction model, storage medium and electronic device
CN114780606A (en) Big data mining method and system
CN113204716A (en) Suspicious money laundering user transaction relation determining method and device
Cheng et al. An efficient FPRAS type group testing procedure to approximate the number of defectives
CN110990350A (en) Log analysis method and device
CN113918577B (en) Data table identification method and device, electronic equipment and storage medium
CN115361295A (en) Resource backup method, device, equipment and medium based on TOPSIS
CN114553717A (en) Network node dividing method, device, equipment and storage medium
CN110544190B (en) Method, device and equipment for determining personnel characteristics
CN113326064A (en) Method for dividing business logic module, electronic equipment and storage medium
CN113283484A (en) Improved feature selection method, device and storage medium
CN113010310A (en) Job data processing method and device and server
CN112764935A (en) Big data processing method and device, electronic equipment and storage medium
Patil et al. Digital governance and hotspot geoinformatics with continuous fractional response
CN109429083A (en) Thematic generation method, device and terminal device
CN113051128B (en) Power consumption detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant