CN107622121A - A kind of data analysing method and device based on bitmap data structure - Google Patents

A kind of data analysing method and device based on bitmap data structure Download PDF

Info

Publication number
CN107622121A
CN107622121A CN201710872848.7A CN201710872848A CN107622121A CN 107622121 A CN107622121 A CN 107622121A CN 201710872848 A CN201710872848 A CN 201710872848A CN 107622121 A CN107622121 A CN 107622121A
Authority
CN
China
Prior art keywords
item
affairs
bitmap data
destination
frequent
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201710872848.7A
Other languages
Chinese (zh)
Other versions
CN107622121B (en
Inventor
刘东岳
吴斌
王柏
卜尧
郭志红
杨祎
马艳
辜超
白德盟
林颖
秦佳峰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201710872848.7A priority Critical patent/CN107622121B/en
Publication of CN107622121A publication Critical patent/CN107622121A/en
Application granted granted Critical
Publication of CN107622121B publication Critical patent/CN107622121B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Management, Administration, Business Operations System, And Electronic Commerce (AREA)

Abstract

The embodiments of the invention provide a kind of data analysing method and device based on bitmap data structure, this method includes:Obtain the first affairs subclass of host node distribution;Obtain total number and sequence;The bitmap data of each destination item is determined according to sequence;The ratio between the first data and total number in the bitmap data of each destination item is counted, frequent 1 item collection in the ratio-dependent destination item of statistics;By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap data, broadcast to host node and other distributed child nodes;Receive the bitmap data of statistical item and statistical item;Based on receive statistical item, the bitmap data of statistical item and for frequent 1 item collection destination item bitmap data, calculate target item collection whether be frequent item set;If target item collection is frequent item set, determine that target item concentrates the correlation rule between each project.So, the incidence relation that can be quickly obtained between correlation rule and project.

Description

A kind of data analysing method and device based on bitmap data structure
Technical field
The present invention relates to data mining technology field, more particularly to a kind of data analysis side based on bitmap data structure Method and device.
Background technology
Increase with data explosion formula, people more and more urgently want excavated from a large amount of data with existing it is valuable Information, and then formulate corresponding decision-making according to these valuable information.
For example, for large retail store, these markets can all produce ten hundreds of transaction records every year, its In, every transaction record is corresponding with an order number, and an order number is corresponding with multiple item names.However, dug without data Before pick, the correlation rule that people can not obtain in process of exchange between each article (such as is bought in the people of coffee, 60% People has also bought cake simultaneously).And by data mining after, the purchase that can be excavated to obtain in process of exchange between each article is closed It is the correlation rule between that is, each article, and then branch can be provided according to obtained correlation rule for the marketing decision-making in market Hold.
Wherein, in data mining process, one is obtained often through a transaction record in the large retail store Affairs, and article different in the large retail store is designated as to different projects.Specifically, for every transaction record and Speech, Transaction Identifier that can be using order number corresponding to this transaction record as corresponding affairs, and will be corresponding to the order number often An a kind of project of item name as the affairs.So, can be with for large retail store's transaction record of 1 year The affairs set being made up of multiple affairs is obtained, and one or more projects can be included in each affairs.Then, using frequent Item set mining algorithm:ECLAT algorithms excavate the frequent item set in affairs set, and then, calculated further according to frequent item set Correlation rule between each project.
Specifically, it is assumed that 10000 affairs in affairs set be present, be calculated using ECLAT algorithms:The affairs set Middle item collection { project A, project B } occurs 100 times, i.e. the number that project A and project B occur simultaneously is 100 times.It is thus possible to The probability for calculating item collection { project A, project B } appearance is 0.01, i.e., the support of item collection { project A, project B } is 0.01.If 0.01 is more than default minimum support, then item collection { project A, project B } is frequent 2 item collection, and then can according to this frequent 2 Item collection { project A, project B } calculates the incidence relation between project A and project B.
But inventor has found, during frequent item set is calculated using ECLAT algorithms, it is necessary to using project B and often Each project is matched in the individual affairs comprising project A, if the match is successful, shows also to include in the affairs comprising project A Affairs B, one now is increased to the number that item collection { project A, project B } occurs.Then, project A and each thing for including project B are utilized Each project is matched in business, if the match is successful, is shown in the affairs comprising project B also comprising affairs A now to item collection The number that { project A, project B } occurs increases one, so, can count to obtain the number of item collection { project A, project B } appearance.But It is that this matching speed is very slow, so that the speed for obtaining correlation rule is slower.
The content of the invention
The purpose of the embodiment of the present invention is to provide a kind of data analysing method and device based on bitmap data structure, with Rapidly analysis obtains correlation rule, so that the quickly incidence relation between acquisition project.
In a first aspect, the embodiments of the invention provide a kind of data analysing method based on bitmap data structure, it is applied to Distributed system includes a distributed child node in distributed child node, and the distributed system includes:Host node and point Cloth child node, this method can include:
The first affairs subclass of host node distribution is obtained, wherein, the first affairs subset is combined into:The subset of affairs set Close;
Obtain the sequence of the total number of affairs and affairs in affairs set in affairs set;
According to sequence, bitmap data corresponding to each destination item is determined, wherein, bitmap number corresponding to a destination item According to each bit, according to sort it is corresponding with an affairs in affairs set, the value of each bit represents the bit Position corresponding to affairs whether be the destination item association affairs;Destination item is:Each affairs in first affairs subclass Comprising project;The association affairs of one destination item are:Affairs set includes the affairs of the destination item;
Count the ratio between the first data and total number in bitmap data corresponding to each destination item respectively, and according to Frequent 1 item collection in the ratio-dependent destination item of statistics, wherein, the first data are:The association affairs of project in bitmap data The value of corresponding bit;
By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap data, broadcast to host node With other distributed child nodes;
The bitmap data of statistical item and statistical item is received, wherein, statistical item is that other distributed child nodes are based on Second affairs subclass of host node distribution counts what is obtained, the union of the first affairs subclass and each second affairs subclass For affairs set;
Based on receive statistical item, the bitmap data of statistical item and for frequent 1 item collection destination item bitmap Data, calculate whether target item collection is frequent item set, wherein, target item is concentrated and includes at least two projects;
If target item collection is frequent item set, determine that target item concentrates the correlation rule between each project.
Alternatively, according to sequence, the step of determining bitmap data corresponding to each destination item, can include:
For each destination item, the affairs of the destination item, the second affairs are included based on the first affairs subclass Set includes the affairs of the destination item and default mapping relations, by bit corresponding to the affairs including the destination item Value be arranged to the first data, the affairs of the destination item will not included corresponding to the value of bit be arranged to the second number According to, the bitmap data of the destination item is obtained, wherein, mapping relations are:According to sequence determine, in bitmap data bit with The corresponding relation of affairs in affairs set.
Alternatively, in embodiments of the present invention, the first data are 1, and the second data are 0.
Alternatively, before the step of whether target item collection is frequent item set is calculated, this method can also include:
The statistics for target item collection that host node is sent is received to instruct.
Alternatively, after the step of according to sequence, determining bitmap data corresponding to each destination item, this method may be used also With including:
By bitmap data boil down to compress bitmap data corresponding to each destination item;
By for the destination item of frequent 1 item collection and be frequent 1 item collection the destination item bitmap data, broadcast to main section The step of point and other distributed child nodes, including:
By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item compress bitmap data, broadcast to master Node and other distributed child nodes.
Second aspect, the embodiments of the invention provide a kind of data analysis set-up based on bitmap data structure, it is applied to Distributed system includes a distributed child node in distributed child node, and the distributed system includes:Host node and point Cloth child node, the device can include:
First obtains unit, for obtaining the first affairs subclass of host node distribution, wherein, the first affairs subclass For:The subclass of affairs set;
Second obtaining unit, for obtaining the sequence of affairs in the total number of affairs and affairs set in affairs set;
First determining unit, for according to sequence, determining bitmap data corresponding to each destination item, wherein, a mesh Each bit of bitmap data corresponding to mark project, according to, each bit corresponding with an affairs in affairs set that sort The value of position represent affairs corresponding to the bit whether be the destination item association affairs;Destination item is:First thing The project that each transaction packet contains in business subclass;The association affairs of one destination item are:Affairs set includes the target item Purpose affairs;
Statistic unit, for counting respectively in bitmap data corresponding to each destination item between the first data and total number Ratio, and according to frequent 1 item collection in the ratio-dependent destination item of statistics, wherein, the first data are:Bitmap data middle term The value of bit corresponding to purpose association affairs;
Radio unit, for by for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap number According to broadcasting to host node and other distributed child nodes;
First receiving unit, for receiving the bitmap data of statistical item and statistical item, wherein, statistical item is other Distributed child node counts what is obtained based on the second affairs subclass that host node distributes, the first affairs subclass and each second The union of affairs subclass is affairs set;
Computing unit, for based on statistical item, the bitmap data of statistical item and the mesh for frequent 1 item collection received The bitmap data of mark project, calculates whether target item collection is frequent item set, wherein, target item is concentrated and includes at least two projects;
Second determining unit, for when target item collection is frequent item set, determining that target item is concentrated between each project Correlation rule.
Alternatively, the first determining unit specifically can be used for:
For each destination item, the affairs of the destination item, the second affairs are included based on the first affairs subclass Set includes the affairs of the destination item and default mapping relations, by bit corresponding to the affairs including the destination item Value be arranged to the first data, the affairs of the destination item will not included corresponding to the value of bit be arranged to the second number According to, the bitmap data of the destination item is obtained, wherein, mapping relations are:According to sequence determine, in bitmap data bit with The corresponding relation of affairs in affairs set.
Alternatively, in embodiments of the present invention, the device can also include:
Second receiving unit, for based on receive statistical item, the bitmap data of statistical item and for frequent 1 The bitmap data of the destination item of collection, before whether calculating target item collection is frequent item set, receive host node transmission is directed to mesh Mark the statistics instruction of item collection.
Alternatively, in embodiments of the present invention, the device can also include:
Compression unit, for according to sequence, after determining bitmap data corresponding to each destination item, by each target Bitmap data boil down to compress bitmap data corresponding to project;
Radio unit specifically can be used for:
By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item compress bitmap data, broadcast to institute State host node and other described distributed child nodes.
Alternatively, in embodiments of the present invention, the first data are 1, and the second data are 0.
The third aspect, the embodiment of the present invention additionally provide a kind of distributed child node, including processor, communication interface, deposit Reservoir and communication bus, wherein, processor, communication interface, memory completes mutual communication by communication bus;
Memory, for depositing computer program;
Processor, during for performing the program deposited on memory, realize the base described in above-mentioned any one of first aspect In the method and step of the data analysing method of bitmap data structure.
In embodiments of the present invention, a distributed child node in distributed system can be received by host node distribution First affairs subclass.Then, the total number for including affairs in affairs set, and each affairs in the affairs set are obtained Sequence.Afterwards, the project that each transaction packet contains in the first affairs set is determined, as destination item.And by a target item Mesh is corresponding with total number bit, and according to obtained sequence, by each bit to should transaction set close in a thing Business.Wherein, each bit is uniquely corresponding with an affairs, and affairs corresponding to each two bit differ.Also, will bag Affairs containing the destination item are defined as the association affairs of the destination item, and by the value of bit corresponding to the association affairs The first data are arranged to, will not be that the value of bit corresponding to association affairs is arranged to the second data, so as to obtain the target Bitmap data corresponding to project.So, can be by the number of the first data in the bitmap data and the ratio of total number, quickly Determine the ratio shared in affairs set of the affairs comprising the destination item in ground.And then this can be gone out according to the ratio-dependent Whether target item collection is frequent 1 item collection, drastically increases the speed for obtaining frequent 1 item collection.
After it is determined that the destination item is frequent 1 item collection, the distributed child node can be by the destination item and the target The bitmap data of project, broadcast to host node and other distributed child nodes.And other distributed child nodes can be received The statistical item of broadcast and the bitmap data of statistical item.It may then based on the bitmap data and the statistical items of the destination item Purpose bitmap data, it is quick to determine whether the target item collection comprising at least two projects is frequent item set.If target item collection is Frequent item set, and then can determine that the target item concentrates the correlation rule of each project to improve acquisition according to the frequent item set The speed of correlation rule.
Brief description of the drawings
In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.
Fig. 1 is a kind of flow chart of the data analysing method based on bitmap data structure provided in an embodiment of the present invention;
Fig. 2 is the performance of the data analysing method provided in an embodiment of the present invention based on bitmap data structure and existing number According to the performance test figure of analysis method;
Fig. 3 presets most ramuscules based on the data analysing method of bitmap data structure to be provided in an embodiment of the present invention a variety of The schematic diagram of performance under degree of holding;
Fig. 4 is a kind of structural representation of the data analysis set-up based on bitmap data structure provided in an embodiment of the present invention Figure;
Fig. 5 is a kind of structural representation of distributed node provided in an embodiment of the present invention.
Embodiment
Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.
In order to solve prior art problem, the embodiments of the invention provide a kind of data analysis based on bitmap data structure Method and device.
The data analysing method provided in an embodiment of the present invention based on bitmap data structure is illustrated first below.
Data analysing method provided in an embodiment of the present invention based on bitmap data structure, is wrapped applied to distributed system Containing any one distributed child node in distributed child node.Wherein, the distributed system includes:Host node and distribution formula Node.
It is right below by taking the distributed system comprising 1 host node and 31 distributed child nodes as an example in order to clearly illustrate Data analysing method provided in an embodiment of the present invention based on bitmap data structure illustrates.Wherein, the host node and distribution Formula node can be server, or the user terminal such as computer and mobile phone, this is all rational.
Referring to Fig. 1, the data analysing method provided in an embodiment of the present invention based on bitmap data structure can include following Step:
S101:The first affairs subclass of host node distribution is obtained, wherein, the first affairs subset is combined into:Affairs set Subclass;
Assuming that the distributed system needs 75000 transaction records to large retailing store, i.e. 75000 affairs are carried out Association Rule Analysis, then, 75000 affairs then form affairs set.Wherein it is possible to it will be ordered corresponding to every transaction record Transaction Identifier of the odd numbers as affairs corresponding to this transaction record, using every kind of article corresponding to the order number as the order number One project of corresponding affairs.
And assume that the host node in the distributed system is Z, distributed child node be F1, F2 ..., F30 and F31.That , for distributed child node F1, branch's affairs in the affairs set can be formed the first affairs by host node Z Set, and the first affairs subclass is distributed into the distributed child node F1, so, distributed child node F1 can obtain master First affairs subclass of node Z distribution.
Wherein, host node Z can distribute the first affairs subclass, example according to the principle divided equally to the distributed child node F1 Such as, any 2419 affairs in 75000 affairs are distributed into the distributed child node F1 as the first affairs subclass. It is of course also possible to the computing capability based on the distributed child node F1, the first affairs subset is distributed to the distributed child node F1 Close, for example, any 5000 affairs in 75000 affairs are distributed into the distributed child node as the first affairs subclass F1, this is all rational.
S102:Obtain the sequence of the total number of affairs and affairs in affairs set in affairs set;
S103:According to sequence, bitmap data corresponding to each destination item is determined, wherein, corresponding to a destination item Each bit of bitmap data, corresponding with an affairs in affairs set according to sorting, the value of each bit represents Affairs corresponding to the bit whether be the destination item association affairs;Destination item is:It is each in first affairs subclass The project that individual transaction packet contains;The association affairs of one destination item are:Affairs set includes the affairs of the destination item;
For the distributed child node F1 after the first affairs subclass is received, the distributed child node F1 can determine this Which destination item is transaction packet in first affairs subclass contain, and determines to include thing in the first affairs subclass for example, working as During business 1 (project A, project B, project C) and affairs 2 (project A, project D), then the thing in the first affairs subclass can be determined Business includes destination item A, destination item B, destination item C and destination item D.Then, it is determined that corresponding to each destination item Bitmap data.
Below to be illustrated exemplified by determining destination item A bitmap data:
The distributed child node F1 can obtain the total number of the affairs included in affairs set from host node Z, and should The sequence of each affairs in affairs set.Such as it is ordered as:The affairs mark of the Transaction Identifier of 1st affairs, the 2nd affairs Know ..., the Transaction Identifier of the 75000th affairs.
Then, it is determined that total number bit, that is, determine 75000 bits, and by the total number bit successively Arrangement.Afterwards, it is according to the sequence of acquisition that one affairs and a bit are uniquely corresponding, and corresponding to any two bit Affairs differ.Then, it is determined that whether each affairs in affairs set include destination item A, it is somebody's turn to do if a certain transaction packet contains Destination item A, then the affairs are destination item A association affairs.At this point it is possible to by the total number bit with The value of bit is arranged to the first data corresponding to the association affairs, such as could be arranged to 1.If do not include in an affairs Destination item A, then the affairs are not destination item A association affairs, now can be by the total number bit The value of bit corresponding with the affairs is arranged to the second data, such as could be arranged to 0.So, the total number is individual After the value of bit is all set, you can bitmap data corresponding to obtaining destination item A, then can establish the destination item A and the bitmap data corresponding relation.
Wherein, distributed child node F1 can be communicated with other distributed child nodes in the distributed system, from And know the association affairs which affairs is destination item A in the affairs set.Certainly, each distributed child node can also Reported to host node Z and itself count the association affairs of obtained each destination item, then the host node is to each destination item Association affairs collected.So that the institute that the distributed child node F1 can obtain destination item A from host node is relevant The Transaction Identifier of affairs, this is also rational.
Similarly, the distributed child node F1 can determine destination item B, destination item C and bitmap corresponding to destination item D Data, it is not described here in detail.
Further, since may also include project A in the affairs that other distributed child nodes distribute to obtain, for example, it is distributed Include project A in the affairs 10 (project A, project E) that child node F2 distributes to obtain.Therefore, in order to avoid distributed child node F2 determines destination item A bitmap data also using project A as destination item, and then destination item A bitmap data is entered The problem of row computes repeatedly occurs, and host node Z can be to each distributed child node assignment bit map data determine instruction, with instruction Each distribution formula section is determined to the bitmap data of disparity items.For example, indicate that distributed child node F1 identifies project A's Bitmap data, then other distributed child nodes in the distributed system would not identify project A bitmap data again, avoid The waste of computing resource.
S104:The ratio between the first data and total number in bitmap data corresponding to each destination item is counted respectively, And according to frequent 1 item collection in the ratio-dependent destination item of statistics, wherein, the first data are:The pass of project in bitmap data Join the value of bit corresponding to affairs;
Continue above-mentioned example, after destination item A bitmap data is obtained, it may be determined that the first data in the bitmap data Quantity, then calculate the ratio of the quantity and total number, obtain the ratio between the first data and total number.It is then possible to Judge whether the ratio is more than default minimum support, if being more than, it is determined that destination item A is frequent 1 item collection.If less than etc. In, it is determined that destination item A is not frequent 1 item collection.So, the amount of calculation of frequent 1 item collection of calculating greatly reduced, is improved Operation efficiency.
Wherein, those skilled in the art can set the default minimum support according to actual conditions.For example, it can incite somebody to action The default minimum support is arranged to 0.8, is not limited thereto certainly.
S105:By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap data, broadcast extremely Host node and other distributed child nodes;
S106:The bitmap data of statistical item and statistical item is received, wherein, statistical item is other distributed child nodes The second affairs subclass based on host node distribution counts what is obtained, the first affairs subclass and each second affairs subclass Union is affairs set;
Assuming that destination item A is frequent 1 item collection, then, the distributed child node F1 can be by destination item A and this Destination item A bitmap data broadcasts other distributed child nodes.Certainly, in order to reduce the memory consumption of storage bitmap data and The transmission consumption of bitmap data is transmitted, the distributed child node F1 can also be by destination item A bitmap data boil down to pressure Condense diagram data, then by destination item A and destination item A compress bitmap data broadcasting to host node and other distributions Formula node, this is also rational.
Certainly, the distributed child node F1 can also receive other distributed apparatus and statistical item and statistical item is calculated Bitmap data.Wherein, the statistical item can be the second affairs subset that other distributed child nodes are distributed based on host node Close frequent 1 item collection that statistics obtains.For example, distributed child node F2 receives the second affairs subclass of host node distribution, so It is frequent 1 item collection that statistical item E, which is calculated, based on the second affairs subclass afterwards, then by statistical item E and statistical items Mesh E bitmap data is broadcasted, so as to which the distributed child node F1 can receive statistical item E and statistical item E Bitmap data.Certainly, the statistical item can also be that other distributed child nodes are not determined whether for the item of frequent 1 item collection Mesh, this is also rational.
It should be noted that distributed child node F3 to F31 can receive the second affairs subset of host node Z transmissions Close, but in order to avoid computing repeatedly, the affairs included in the second affairs subclass that each distributed child node receives are mutual Differ.
S107:Based on statistical item, the bitmap data of statistical item and the destination item for frequent 1 item collection received Bitmap data, calculates whether target item collection is frequent item set, wherein, target item is concentrated and includes at least two projects;
S108:If target item collection is frequent item set, determine that target item concentrates the correlation rule between each project.
For example, distributed child node F1 is receiving the statistical item E and statistical items of distributed child node F2 broadcast After mesh E bitmap data, it may be determined that whether item collection { project A, project E } is frequent 2 item collection, that is, determines project occur simultaneously Whether the quantity of A and project E affairs and the ratio of total number are more than default minimum support, if being more than, it is determined that item collection { item Mesh A, project E } it is frequent 2 item collection.
Wherein it is determined that the mode for occurring the quantity of project A and project E affairs simultaneously is specifically as follows:To project A (i.e. Destination item A) bitmap data and project E (i.e. statistical item E) bitmap data be compared, when the bit of same order When value corresponding to (such as the 10th bit of two bitmap datas) is all the first data (such as being all 1), illustrate this Affairs corresponding to 10 bits include project A and project B simultaneously.In that way, determine in two bitmap datas, phase Bit with order is all 1 bit number, so, you can obtain and the number of project A and project E affairs occur simultaneously Amount.Because this kind of calculation amount of calculation is minimum, it is thus possible to quickly determine out while project A and project E affairs occur Quantity, and then quickly determine out whether item collection { project A, project E } is frequent 2 item collection.
Assuming that determining whether item collection { project A, project E } is frequent 2 item collection, then, it can utilize and project A occur simultaneously With the quantity of project E affairs divided by there is the quantity of project A affairs, it is assumed that obtain confidence level 70%.It is so i.e. available Correlation rule:In the people for buying project A, 70% people can also buy project E.The correlation rule is merely illustrative, the correlation rule of generation Certainly it is not limited thereto.
To sum up, using the embodiment of the present invention, the speed for obtaining frequent item set is improved, and then can quickly analyze to obtain Implicit correlation rule in data.Wherein, the frequent item set includes frequent 1 item collection and frequent multi itemset.
With reference to table one, table two, Fig. 2 and Fig. 3 to the data provided in an embodiment of the present invention based on bitmap data structure The performance that analytical goes out correlation rule illustrates.
Table one
Referring to table one, the data analysing method provided in an embodiment of the present invention based on bitmap data structure is designated as by inventor RBM-Eclat algorithms, default minimum support are set as 0.8, then using the RBM-Eclat algorithms to including 1,000,000 things The affairs set of business is associated rule analysis, and the used time that analysis obtains all correlation rules is 81 seconds.Equally, inventor is same Default minimum support is set as 0.8, and includes the affairs of 1,000,000 affairs to this using Eclat algorithms of the prior art Set is associated rule analysis, and the used time that analysis obtains all correlation rules is 182 seconds.In addition, inventor's equally setting is pre- If minimum support is 0.8, then the affairs set for including 1,000,000 affairs is entered using Apriori of the prior art Row Association Rule Analysis, the used time that analysis obtains all correlation rules are 151 seconds.
In the manner described above, inventor also using above-mentioned three kinds of algorithms respectively to the affairs set comprising 2,000,000 affairs, Affairs set comprising 4,000,000 affairs, the affairs set for including 8,000,000 affairs, and include the thing of 16,000,000 affairs Business set is associated rule analysis, obtains the result as shown in table 1 and Fig. 2.It is can be seen that from table one and Fig. 2 relative to existing Association Rule Analysis method, the data analysing method provided in an embodiment of the present invention based on bitmap data structure can be quickly Analyze correlation rule.
In addition, provided in an embodiment of the present invention preset most ramuscule based on the data analysing method of bitmap data structure in difference Performance under degree of holding is different, for details, reference can be made to table two and Fig. 3.
Table two
It can be seen from table two and Fig. 3, when setting default minimum support as 0.6, utilization is provided in an embodiment of the present invention Data analysing method based on bitmap data structure, rule analysis is associated to the affairs set comprising 500,000 affairs, point The used time that analysis obtains all correlation rules is 222 seconds.When setting default minimum support as 0.65,500,000 equally are included to this The affairs set of individual affairs is associated rule analysis, and the used time that analysis obtains all correlation rules is 113 seconds.When setting is default When minimum support is 0.7, rule analysis equally is associated to the affairs set for including 500,000 affairs, analysis obtains institute The relevant regular used time is 84 seconds, etc..Do not do and illustrate one by one herein.
In addition, when setting default minimum support as 0.6, bitmap data knot is based on using provided in an embodiment of the present invention The data analysing method of structure, rule analysis is associated to the affairs set comprising 1,000,000 affairs, it is relevant that analysis obtains institute The used time of rule is 486 seconds.When setting default minimum support as 0.65, the affairs of 1,000,000 affairs are equally included to this Set is associated rule analysis, and the used time that analysis obtains all correlation rules is 182 seconds.When set default minimum support as When 0.7, rule analysis equally is associated to the affairs set for including 1,000,000 affairs, analysis obtains all correlation rules Used time be 126 seconds, etc..Do not do and illustrate one by one herein.
From the foregoing, it will be observed that when being managed rule analysis to same affairs set, the value of default minimum support is set Smaller, the calculating speed of the data analysing method provided in an embodiment of the present invention based on bitmap data structure is faster.
Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of data based on bitmap data structure point Analysis apparatus, a distributed child node in distributed child node, the distributed system bag are included applied to distributed system Include:Host node and distributed child node, referring to Fig. 4, the device can include:
First obtains unit 401, for obtaining the first affairs subclass of host node distribution, wherein, the first affairs subset It is combined into:The subclass of affairs set;
Second obtaining unit 402, for obtaining the sequence of affairs in the total number of affairs and affairs set in affairs set;
First determining unit 403, for according to sequence, determining bitmap data corresponding to each destination item, wherein, one Each bit of bitmap data corresponding to destination item, according to, each ratio corresponding with an affairs in affairs set that sort The value of special position represent affairs corresponding to the bit whether be the destination item association affairs;Destination item is:First The project that each transaction packet contains in affairs subclass;The association affairs of one destination item are:Affairs set includes the target The affairs of project;
Statistic unit 404, for counting the first data and total number in bitmap data corresponding to each destination item respectively Between ratio, and according to frequent 1 item collection in the ratio-dependent destination item of statistics, wherein, the first data are:Bitmap data The value of bit corresponding to the association affairs of middle project;
Radio unit 405, for by for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap Data, broadcast to host node and other distributed child nodes;
First receiving unit 406, for receiving the bitmap data of statistical item and statistical item, wherein, statistical item is Other distributed child nodes count what is obtained based on the second affairs subclass that host node distributes, the first affairs subclass with it is each The union of second affairs subclass is affairs set;
Computing unit 407, for based on receive statistical item, the bitmap data of statistical item and be frequent 1 item collection Destination item bitmap data, calculate target item collection whether be frequent item set, wherein, target item concentrate include at least two Mesh;
Second determining unit 408, for when target item collection is frequent item set, determining that target item is concentrated between each project Correlation rule.
In embodiments of the present invention, a distributed child node in distributed system can be received by host node distribution First affairs subclass.Then, the total number for including affairs in affairs set, and each affairs in the affairs set are obtained Sequence.Afterwards, the project that each transaction packet contains in the first affairs set is determined, as destination item.And by a target item Mesh is corresponding with total number bit, and according to obtained sequence, by each bit to should transaction set close in a thing Business.Wherein, each bit is uniquely corresponding with an affairs, and affairs corresponding to each two bit differ.Also, will bag Affairs containing the destination item are defined as the association affairs of the destination item, and by the value of bit corresponding to the association affairs The first data are arranged to, will not be that the value of bit corresponding to association affairs is arranged to the second data, so as to obtain the target Bitmap data corresponding to project.So, can be by the number of the first data in the bitmap data and the ratio of total number, quickly Determine the ratio shared in affairs set of the affairs comprising the destination item in ground.And then this can be gone out according to the ratio-dependent Whether target item collection is frequent 1 item collection, drastically increases the speed for obtaining frequent 1 item collection.
After it is determined that the destination item is frequent 1 item collection, the distributed child node can be by the destination item and the target The bitmap data of project, broadcast to host node and other distributed child nodes.And other distributed child nodes can be received The statistical item of broadcast and the bitmap data of statistical item.It may then based on the bitmap data and the statistical items of the destination item Purpose bitmap data, it is quick to determine whether the target item collection comprising at least two projects is frequent item set.If target item collection is Frequent item set, and then can determine that the target item concentrates the correlation rule of each project to improve acquisition according to the frequent item set The speed of correlation rule.
Alternatively, the first determining unit 403 specifically can be used for:
For each destination item, the affairs of the destination item, the second affairs are included based on the first affairs subclass Set includes the affairs of the destination item and default mapping relations, by bit corresponding to the affairs including the destination item Value be arranged to the first data, the affairs of the destination item will not included corresponding to the value of bit be arranged to the second number According to, the bitmap data of the destination item is obtained, wherein, mapping relations are:According to sequence determine, in bitmap data bit with The corresponding relation of affairs in affairs set.
Alternatively, in embodiments of the present invention, the device can also include:
Second receiving unit, for based on receive statistical item, the bitmap data of statistical item and for frequent 1 The bitmap data of the destination item of collection, before whether calculating target item collection is frequent item set, receive host node transmission is directed to mesh Mark the statistics instruction of item collection.
Alternatively, in embodiments of the present invention, the device can also include:
Compression unit, for according to sequence, after determining bitmap data corresponding to each destination item, by each target Bitmap data boil down to compress bitmap data corresponding to project;
Radio unit 405 specifically can be used for:
By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item compress bitmap data, broadcast to master Node and other distributed child nodes.
Alternatively, in embodiments of the present invention, the first data are 1, and the second data are 0.
Corresponding to above method embodiment, the embodiment of the present invention additionally provides a kind of distributed child node, referring to Fig. 5, bag Processor 501, communication interface 502, memory 503 and communication bus 504 are included, wherein, processor 501, communication interface 502, deposit Reservoir 503 completes mutual communication by communication bus 504;
Memory 503, for depositing computer program;
Processor 501, during for performing the program deposited on memory 503, realize that any of the above-described embodiment of the method carries The method and step of the data analysing method based on bitmap data structure supplied.
In embodiments of the present invention, a distributed child node in distributed system can be received by host node distribution First affairs subclass.Then, the total number for including affairs in affairs set, and each affairs in the affairs set are obtained Sequence.Afterwards, the project that each transaction packet contains in the first affairs set is determined, as destination item.And by a target item Mesh is corresponding with total number bit, and according to obtained sequence, by each bit to should transaction set close in a thing Business.Wherein, each bit is uniquely corresponding with an affairs, and affairs corresponding to each two bit differ.Also, will bag Affairs containing the destination item are defined as the association affairs of the destination item, and by the value of bit corresponding to the association affairs The first data are arranged to, will not be that the value of bit corresponding to association affairs is arranged to the second data, so as to obtain the target Bitmap data corresponding to project.So, can be by the number of the first data in the bitmap data and the ratio of total number, quickly Determine the ratio shared in affairs set of the affairs comprising the destination item in ground.And then this can be gone out according to the ratio-dependent Whether target item collection is frequent 1 item collection, drastically increases the speed for obtaining frequent 1 item collection.
After it is determined that the destination item is frequent 1 item collection, the distributed child node can be by the destination item and the target The bitmap data of project, broadcast to host node and other distributed child nodes.And other distributed child nodes can be received The statistical item of broadcast and the bitmap data of statistical item.It may then based on the bitmap data and the statistical items of the destination item Purpose bitmap data, it is quick to determine whether the target item collection comprising at least two projects is frequent item set.If target item collection is Frequent item set, and then can determine that the target item concentrates the correlation rule of each project to improve acquisition according to the frequent item set The speed of correlation rule.
The communication bus that above-mentioned electronic equipment is mentioned can be Peripheral Component Interconnect standard (Peripheral Component Interconnect, PCI) bus or EISA (Extended Industry Standard Architecture, EISA) bus etc..The communication bus can be divided into address bus, data/address bus, controlling bus etc..For just Only represented in expression, figure with a thick line, it is not intended that an only bus or a type of bus.
The communication that communication interface is used between above-mentioned electronic equipment and other equipment.
Memory can include random access memory (Random Access Memory, RAM), can also include non-easy The property lost memory (Non-Volatile Memory, NVM), for example, at least a magnetic disk storage.Optionally, memory may be used also To be at least one storage device for being located remotely from aforementioned processor.
Above-mentioned processor can be general processor, including central processing unit (Central Processing Unit, CPU), network processing unit (Network Processor, NP) etc.;It can also be digital signal processor (Digital Signal Processing, DSP), it is application specific integrated circuit (Application Specific Integrated Circuit, ASIC), existing It is field programmable gate array (Field-Programmable Gate Array, FPGA) or other PLDs, discrete Door or transistor logic, discrete hardware components.
It should be noted that herein, such as first and second or the like relational terms are used merely to a reality Body or operation make a distinction with another entity or operation, and not necessarily require or imply and deposited between these entities or operation In any this actual relation or order.Moreover, term " comprising ", "comprising" or its any other variant are intended to Nonexcludability includes, so that process, method, article or equipment including a series of elements not only will including those Element, but also the other element including being not expressly set out, or it is this process, method, article or equipment also to include Intrinsic key element.In the absence of more restrictions, the key element limited by sentence "including a ...", it is not excluded that Other identical element also be present in process, method, article or equipment including the key element.
Each embodiment in this specification is described by the way of related, identical similar portion between each embodiment Divide mutually referring to what each embodiment stressed is the difference with other embodiment.Especially for device, For distributed child node embodiment, because it is substantially similar to embodiment of the method, so description is fairly simple, related part Illustrate referring to the part of embodiment of the method.
The foregoing is merely illustrative of the preferred embodiments of the present invention, is not intended to limit the scope of the present invention.It is all Any modification, equivalent substitution and improvements made within the spirit and principles in the present invention etc., are all contained in protection scope of the present invention It is interior.

Claims (10)

1. a kind of data analysing method based on bitmap data structure, it is characterised in that included point applied to distributed system A distributed child node in cloth child node, the distributed system include:Host node and distributed child node, the side Method includes:
The first affairs subclass of the host node distribution is obtained, wherein, the first affairs subset is combined into:The son of affairs set Set;
Obtain the sequence of affairs in the total number of affairs and the affairs set in the affairs set;
According to the sequence, bitmap data corresponding to each destination item is determined, wherein, bitmap number corresponding to a destination item According to each bit, according to it is described sequence, the value table of each bit corresponding with an affairs in the affairs set Show the affairs corresponding to the bit whether be the destination item association affairs;Destination item is:The first affairs subset The project that each transaction packet contains in conjunction;The association affairs of one destination item are:The affairs set includes the destination item Affairs;
Count the ratio between the first data and the total number in bitmap data corresponding to each destination item respectively, and according to Frequent 1 item collection in the ratio-dependent destination item of statistics, wherein, first data are:The association of project in bitmap data The value of bit corresponding to affairs;
By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap data, broadcast to the host node With other described distributed child nodes;
The bitmap data of statistical item and the statistical item is received, wherein, the statistical item is other distributed child nodes The second affairs subclass based on host node distribution counts what is obtained, the first affairs subclass and each second affairs The union of subclass is the affairs set;
Based on receive statistical item, the bitmap data of statistical item and for frequent 1 item collection destination item bitmap data, Calculate whether target item collection is frequent item set, wherein, the target item is concentrated and includes at least two projects;
If the target item collection is frequent item set, determine that the target item concentrates the correlation rule between each project.
2. according to the method for claim 1, it is characterised in that it is described according to the sequence, determine each destination item pair The step of bitmap data answered, including:
For each destination item, the affairs of the destination item, the second affairs are included based on the first affairs subclass Set includes the affairs of the destination item and default mapping relations, by bit corresponding to the affairs including the destination item Value be arranged to the first data, the affairs of the destination item will not included corresponding to the value of bit be arranged to the second number According to, the bitmap data of the destination item is obtained, wherein, the mapping relations are:According to it is described sequence determine, in bitmap data Bit and the corresponding relation of affairs in the affairs set.
3. according to the method for claim 2, it is characterised in that first data are 1, and second data are 0.
4. according to the method for claim 1, it is characterised in that the calculating target item collection whether be frequent item set step Before rapid, methods described also includes:
The statistics for target item collection that the host node is sent is received to instruct.
5. according to the method for claim 1, it is characterised in that described according to the sequence, determine each destination item After the step of corresponding bitmap data, methods described also includes:
By bitmap data boil down to compress bitmap data corresponding to each destination item;
It is described by for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap data, broadcast to the master The step of node and other described distributed child nodes, including:
By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item compress bitmap data, broadcast to the master Node and other described distributed child nodes.
6. a kind of data analysis set-up based on bitmap data structure, it is characterised in that included point applied to distributed system A distributed child node in cloth child node, the distributed system include:Host node and distributed child node, the dress Put including:
First obtains unit, for obtaining the first affairs subclass of the host node distribution, wherein, the first affairs subset It is combined into:The subclass of affairs set;
Second obtaining unit, for obtaining the row of affairs in the total number of affairs and the affairs set in the affairs set Sequence;
First determining unit, for according to the sequence, determining bitmap data corresponding to each destination item, wherein, a mesh Each bit of bitmap data corresponding to mark project, it is corresponding with an affairs in the affairs set according to the sequence, The value of each bit represent affairs corresponding to the bit whether be the destination item association affairs;Destination item For:The project that each transaction packet contains in the first affairs subclass;The association affairs of one destination item are:The transaction set Conjunction includes the affairs of the destination item;
Statistic unit, for counting respectively in bitmap data corresponding to each destination item between the first data and the total number Ratio, and according to frequent 1 item collection in the ratio-dependent destination item of statistics, wherein, first data are:Bitmap data The value of bit corresponding to the association affairs of middle project;
Radio unit, for by for the destination item of frequent 1 item collection and be frequent 1 item collection destination item bitmap data, extensively Cast to the host node and other described distributed child nodes;
First receiving unit, for receiving the bitmap data of statistical item and the statistical item, wherein, the statistical item is Other distributed child nodes count what is obtained based on the second affairs subclass that the host node distributes, the first affairs subset It is the affairs set to close with the union of each second affairs subclass;
Computing unit, for based on statistical item, the bitmap data of statistical item and the target item for frequent 1 item collection received Purpose bitmap data, calculates whether target item collection is frequent item set, wherein, the target item is concentrated and includes at least two projects;
Second determining unit, for when the target item collection is frequent item set, determine the target item concentrate each project it Between correlation rule.
7. device according to claim 6, it is characterised in that first determining unit is specifically used for:
For each destination item, the affairs of the destination item, the second affairs are included based on the first affairs subclass Set includes the affairs of the destination item and default mapping relations, by bit corresponding to the affairs including the destination item Value be arranged to the first data, the affairs of the destination item will not included corresponding to the value of bit be arranged to the second number According to, the bitmap data of the destination item is obtained, wherein, the mapping relations are:According to it is described sequence determine, in bitmap data Bit and the corresponding relation of affairs in the affairs set.
8. device according to claim 6, it is characterised in that described device also includes:
Second receiving unit, for based on receive statistical item, the bitmap data of statistical item and for frequent 1 item collection The bitmap data of destination item, before whether calculating target item collection is frequent item set, receive the host node transmission is directed to mesh Mark the statistics instruction of item collection.
9. device according to claim 6, it is characterised in that described device also includes:
Compression unit, for according to the sequence, after determining bitmap data corresponding to each destination item, by each target Bitmap data boil down to compress bitmap data corresponding to project;
The radio unit is specifically used for:
By for the destination item of frequent 1 item collection and be frequent 1 item collection destination item compress bitmap data, broadcast to the master Node and other described distributed child nodes.
A kind of 10. distributed child node, it is characterised in that including processor, communication interface, memory and communication bus, wherein, Processor, communication interface, memory complete mutual communication by communication bus;
Memory, for depositing computer program;
Processor, during for performing the program deposited on memory, realize any described method and steps of claim 1-5.
CN201710872848.7A 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure Active CN107622121B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710872848.7A CN107622121B (en) 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710872848.7A CN107622121B (en) 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure

Publications (2)

Publication Number Publication Date
CN107622121A true CN107622121A (en) 2018-01-23
CN107622121B CN107622121B (en) 2020-06-23

Family

ID=61090110

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710872848.7A Active CN107622121B (en) 2017-09-25 2017-09-25 Data analysis method and device based on bitmap data structure

Country Status (1)

Country Link
CN (1) CN107622121B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110134721A (en) * 2019-05-17 2019-08-16 智慧足迹数据科技有限公司 Data statistical approach, device and electronic equipment based on bitmap
CN110309368A (en) * 2018-03-26 2019-10-08 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of data address
US11520804B1 (en) 2021-05-13 2022-12-06 International Business Machines Corporation Association rule mining
US11762867B2 (en) 2021-10-07 2023-09-19 International Business Machines Corporation Association rule mining using max pattern transactions

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446978A (en) * 2008-12-11 2009-06-03 南京大学 Core node discovery method based on frequent itemset mining
CN103593400A (en) * 2013-12-13 2014-02-19 陕西省气象局 Lightning activity data statistics method based on modified Apriori algorithm
CN104834751A (en) * 2015-05-28 2015-08-12 成都艺辰德迅科技有限公司 Data analysis method based on Internet of things

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101446978A (en) * 2008-12-11 2009-06-03 南京大学 Core node discovery method based on frequent itemset mining
CN103593400A (en) * 2013-12-13 2014-02-19 陕西省气象局 Lightning activity data statistics method based on modified Apriori algorithm
CN104834751A (en) * 2015-05-28 2015-08-12 成都艺辰德迅科技有限公司 Data analysis method based on Internet of things

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
王晓等: "浅论Apriori 算法的改进", 《电脑学习》 *
祁文文等: "利用位图技术挖掘关联规则的高效算法", 《第十八届全国数据库学术会议论文集(技术报告篇)》 *

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110309368A (en) * 2018-03-26 2019-10-08 腾讯科技(深圳)有限公司 Determination method, apparatus, storage medium and the electronic device of data address
CN110309368B (en) * 2018-03-26 2023-09-22 腾讯科技(深圳)有限公司 Data address determining method and device, storage medium and electronic device
CN110134721A (en) * 2019-05-17 2019-08-16 智慧足迹数据科技有限公司 Data statistical approach, device and electronic equipment based on bitmap
US11520804B1 (en) 2021-05-13 2022-12-06 International Business Machines Corporation Association rule mining
US11762867B2 (en) 2021-10-07 2023-09-19 International Business Machines Corporation Association rule mining using max pattern transactions

Also Published As

Publication number Publication date
CN107622121B (en) 2020-06-23

Similar Documents

Publication Publication Date Title
CN107622121A (en) A kind of data analysing method and device based on bitmap data structure
CN107102941A (en) The generation method and device of a kind of test case
CN102724219B (en) A network data computer processing method and a system thereof
CN103580939B (en) A kind of unexpected message detection method and equipment based on account attribute
CN105630955A (en) Method for efficiently managing members of dynamic data set
CN104850649B (en) A kind of method and system that point of interest sampling is carried out on map
CN103970747B (en) Data processing method for network side computer to order search results
CN103927398A (en) Microblog hype group discovering method based on maximum frequent item set mining
CN105468632B (en) A kind of Geocoding and device
CN107948578A (en) The method of adjustment and adjusting apparatus of video conferencing system transmission bandwidth and resolution ratio
CN110533116A (en) Based on the adaptive set of Euclidean distance at unbalanced data classification method
CN109299334A (en) A kind of data processing method and device of knowledge mapping
CN110390585A (en) A kind of method and device identifying exception object
CN107038649A (en) A kind of friend recommendation method and device of terminal user
CN106682206A (en) Method and system for big data processing
CN104573132B (en) Song lookup method and device
CN113204716A (en) Suspicious money laundering user transaction relation determining method and device
CN110334104A (en) A kind of list update method, device, electronic equipment and storage medium
CN104598580A (en) Method and device for mining IP (Internet Protocol) geographic positioning data
CN115174580B (en) Data processing method and system based on big data
CN109428906A (en) Request processing method, device, system and terminal
CN113242332B (en) Improved method for forming street-level positioning library
CN107155214B (en) number determination method and device
CN115018502A (en) Virtual currency public link network transaction node IP-based tracing method and system
CN106599289A (en) Method and device for aggregating cartoon information message in search result page

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant