CN103995869B - Data-caching method based on Apriori algorithm - Google Patents

Data-caching method based on Apriori algorithm Download PDF

Info

Publication number
CN103995869B
CN103995869B CN201410214776.3A CN201410214776A CN103995869B CN 103995869 B CN103995869 B CN 103995869B CN 201410214776 A CN201410214776 A CN 201410214776A CN 103995869 B CN103995869 B CN 103995869B
Authority
CN
China
Prior art keywords
data
inquiry
frequent
conditional attribute
attribute
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201410214776.3A
Other languages
Chinese (zh)
Other versions
CN103995869A (en
Inventor
张莉
郭昆
杨乐游
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Northeastern University China
Original Assignee
Northeastern University China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Northeastern University China filed Critical Northeastern University China
Priority to CN201410214776.3A priority Critical patent/CN103995869B/en
Publication of CN103995869A publication Critical patent/CN103995869A/en
Application granted granted Critical
Publication of CN103995869B publication Critical patent/CN103995869B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/22Indexing; Data structures therefor; Storage structures
    • G06F16/2228Indexing structures
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/20Information retrieval; Database structures therefor; File system structures therefor of structured data, e.g. relational data
    • G06F16/24Querying
    • G06F16/245Query processing
    • G06F16/2455Query execution
    • G06F16/24552Database cache management

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Software Systems (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a data-caching method based on an Apriori algorithm. A query log is established for the condition attribute in a magnetic disk, the query frequency of each data block is computed, a plurality of data blocks with high query frequency form a frequent data block set, the query frequency of the condition attributes in the frequent data block set is computed, and a plurality of condition attributes with high query frequency form a frequent condition attribute set. A frequent condition attribute group set is obtained through the Apriori algorithm, the query frequency is mapped into the supporting degree in the Apriori algorithm, the frequent condition attribute group set is obtained, data corresponding to the frequent condition attribute group set are cached in an internal storage, and an index is established for the frequent condition attributes. According to the data-caching method, data query efficiency can be obviously improved in a frequent zone, compared with a single condition attribute, a plurality of condition attribute groups are cached, higher query efficiency is achieved, then database searching pressure is lowered, and higher query efficiency is achieved.

Description

A kind of data cache method based on Apriori algorithm
Technical field
The invention belongs to data query technique field is and in particular to a kind of data cache method based on Apriori algorithm.
Background technology
The rise of the social networking application such as developing rapidly with internet in the last few years, especially microblogging, wechat, data volume is quick-fried The growth of fried property, 2011, the mankind formally entered the ZB epoch.It will be recognized that, we have lived in the big data epoch. However, big data has just had been assigned since being born, and value density is low, wide variety feature, this also determines mass data and exists Problems will be faced during inquiry.In the case that data scale is less big, traditional relevant database has good Performance, high stable type, experience all sorts of history test.But when data volume reaches certain scale, for relational database, efficiency is Extremely low, insufferable.Sum it up, relevant database can not meet the big data epoch reading to database high concurrent Demand, the high efficiency storage to mass data and the demand accessing and the enhanced scalability to database and the high availability write Demand.
The discovery of problem has expedited the emergence of out new technology NoSQL.NoSQL implies that " being not only SQL ", is non-relational The generalized definition of data storage.It has broken the situation of relevant database and ACID theory big unification for a long time.NoSQL number Do not need the table structure of fixation according to storage, generally also there is not attended operation.Big data access possesses relevant database Incomparable performance advantage.However, the NoSQL database of current main-stream is many realizes data buffer storage mechanism using LIRS algorithm, But LIRS algorithm effectively cannot be counted to the data of frequently inquiry in the long period it is impossible to take targetedly strategy Cache data to be checked.
Content of the invention
In view of the shortcomings of the prior art, the invention provides a kind of data cache method based on Apriori algorithm.
The technical scheme is that:
A kind of data cache method based on Apriori algorithm, comprises the following steps:
Step 1:Record the conditional attribute in user's query statement in T days in disk, set up T inquiry in units of sky Daily record, i.e. user's inquiry content.
Step 2:Calculate the inquiry frequent degree of each data block in inquiry log, inquire about the big of frequent degree according to obtaining data block Little acquisition inquires about the high multiple data blocks of frequent degree, forms frequent data item set of blocks.
Step 2.1:Determine the inquiry times of data in each data block in T inquiry log.
Step 2.2:Standardization processing is carried out to the inquiry times of the data in each data block:Recent daily record ratio area is set Divide recent daily record data and history log data, when the inquiry times of the history log data in data block are higher than history log number During according to inquiry times upper limit threshold, then the inquiry times value of this history log data is this upper limit threshold;When in data block When in the recent period the inquiry times of daily record data are higher than recent daily record data inquiry times upper limit threshold, then the looking into of this recent daily record data Inquiry number of times value is this upper limit threshold.
Step 2.3:The inquiry times of data in the data block after standardization processing are weighted operating:Respectively to specification Average after the inquiry times weighted sum of data in each data block in T inquiry log after change process, that is, obtain each number Inquiry frequent degree according to block.
Step 2.4:The high multiple data blocks of inquiry frequent degree, i.e. frequency are selected according to the size that each data block inquires about frequent degree Numerous data block, each frequent data item block forms frequent data item set of blocks.
Step 3:The conditional attribute formation condition community set of each frequent data item block.
Step 4:The inquiry frequent degree of each conditional attribute in design conditions community set, looks into according to obtaining conditional attribute The size asking frequent degree obtains the high multiple conditional attributes of inquiry frequent degree, forms frequent conditional attribute set.
Step 4.1:Determine the inquiry times of each conditional attribute in frequent data item block in T inquiry log.
Step 4.2:Standardization processing is carried out to the inquiry times of each conditional attribute:Distinguished in the recent period according to recent daily record ratio Log conditions attribute and history log conditional attribute, when the inquiry times of history log conditional attribute are higher than that history log condition belongs to Property inquiry times upper limit threshold when, then the inquiry times value of this history log conditional attribute be this upper limit threshold;When recent day When the inquiry times of will conditional attribute are higher than recent Log conditions attribute query number of times upper limit threshold, then this recent Log conditions belongs to Property inquiry times value be this upper limit threshold.
Step 4.3:Conditional attribute inquiry times after standardization processing are weighted operate:Respectively to standardization processing Sue for peace after the inquiry times weighting of each conditional attribute in T days afterwards and average, that is, the inquiry obtaining each conditional attribute is frequent Degree.
Step 4.4:According to the inquiry frequent degree of each conditional attribute obtaining, select the high multiple conditional attributes of frequent degree, I.e. frequent conditional attribute, each frequent conditional attribute forms frequent conditional attribute set.
Step 5:Obtain frequent conditional attribute group set using Apriori algorithm and frequent conditional attribute set, condition belongs to Property the inquiry frequent degree support that is mapped as in Apriori algorithm, the frequent item set that Apriori algorithm obtains is frequent bar Part set of properties set.
Step 6:By corresponding for frequent conditional attribute group set data buffer storage to internal memory, and to frequent conditional attribute set In frequent conditional attribute set up index, complete data buffer storage.
Step 7:When client needs to carry out data query, according to the conditional attribute of data to be inquired about, inquired about Operation:To the frequent conditional attribute of all memory caches of conditional attribute of the data of inquiry, then directly obtain Query Result; To inquiry data a conditional attribute part for memory cache frequent conditional attribute, then according to this partly frequently condition belong to Meet the data of this partial condition attribute in the search index disk database of property, complete inquiry operation;Data to inquiry Conditional attribute all not in the frequent conditional attribute set of memory cache, then from disk load data block carry out inquiry behaviour Make.
The beneficial effects of the present invention is it is proposed that a kind of brand-new data cache method, in conjunction with NoSQL database, Frequent data item row buffer area is opened up, notebook data caching method can significantly improve data in frequent region in back end internal memory Search efficiency, and for the data query in other regions, due to not doing any process, therefore do not interfere with its inquiry operation, delay Deposit multiple conditional attribute groups and compare single conditional attribute and there is higher search efficiency, for conditional attribute group conditional attribute Number is in the caching of pilot-scale, although sacrificing the complete hit rate of part of cache, such caching can be completeer with flying colors Become the work of simplifying of intermediate record, reduce the intermediate result set producing because of the hit of partial condition attribute in internal memory, and according to frequency Numerous conditional attribute indexes rapidly locating, and then the retrieval pressure of mitigation database, achieves higher search efficiency.
Brief description
Fig. 1 is running environment HBase partition data table procedure chart in the specific embodiment of the invention;
Fig. 2 is improved data query procedure chart in the specific embodiment of the invention;
Fig. 3 is the data cache method flow chart in the specific embodiment of the invention based on Apriori algorithm;
Fig. 4 is querying condition attribute difference hit situation process chart in the specific embodiment of the invention;
Fig. 5 is different cache way search efficiency comparison diagrams in the specific embodiment of the invention;
Fig. 6 is different cache way conditional attribute hit situation comparison diagrams in the specific embodiment of the invention.
Specific embodiment
Specific embodiments of the present invention is described in detail below in conjunction with the accompanying drawings.
Present embodiment, under Hadoop-HBase environment, uses Sina weibo to inquiry data and user query behavior User data has carried out analogue simulation, makes T=7, emulation data is divided into 7 equal portions, to simulate the inquiry log of different time.
HBase is a NoSQL database towards row, and it runs on HDFS literary composition as a part for Hadoop project On part system.In terms of digital independent, HBase takes by row storage method, compared to by row storage method, decreases data In reading process, the reading of redundant data, improves data reading performance using redundancy, makes data retrieval more effective rapidly.In storage side Face, larger tables of data is divided into some data areas by HBase, i.e. data block, each zone sequence data table memory In a number of record, by multiple relevant ranges union operation, you can obtain complete table information.HBase tables of data is split Process is as shown in Figure 1.
Region corresponding data block concept in HBase, based on the data cache method of present embodiment, according to data query Situation filters out inquiry frequently data area, and that is, frequent data item is fast, by the frequent condition in some for frequent degree highest regions Attribute data caches to core buffer.When the data in data area is accessed, according to querying condition attribute and internal memory In caching hit situation, carry out different data query operation.Under HBase environment, data query process such as Fig. 2 institute Show, client sends inquiry request to data area server, data area server returns Query Result according to request for information Or inquire about further, to the frequent conditional attribute of all memory caches of conditional attribute of the data of inquiry, then directly obtain Query Result;A conditional attribute part to the data of inquiry is the frequent conditional attribute of memory cache, then according to this part Meet the data of this partial condition attribute in the search index disk database of conditional attribute, complete inquiry operation;To inquiry Data conditional attribute all not in the frequent conditional attribute set of memory cache, then from disk load data block looked into Ask operation.And the memory node in Hadoop layer is then responsible for loading data in magnetic disk and execution inquiry operation.
The data cache method based on Apriori algorithm of present embodiment is as shown in figure 3, comprise the following steps:
Step 1 records the conditional attribute in user's query statement in T days in disk in units of sky, sets up T inquiry Daily record, i.e. user's inquiry content.
Present embodiment conditional attribute is age of Sina weibo user, sex, location, the registration date, online The personal information of time, in implementation process, creates and builds 7 inquiry logs, represents user's inquiry record of nearest 7 days.
Step 2:Calculate the inquiry frequent degree of each data block block in inquiry log, data block inquiry is frequent according to obtaining The size of degree obtains the high multiple data blocks of inquiry frequent degree, i.e. frequent data item set of blocks blockfd.Assume to have 3 data Block, respectively block1、block2、block3, data block block1Inquiry frequent degree calculating process is as follows:
Step 2.1:Determine the inquiry times of each data block data in 7 inquiry logs.
Counted according to inquiry log and obtain block1Inquiry times Count (t) in 7 days.According to statistics, block1Related Inquiry inquiry times Count (t) when t value is 0,1,2,3,4,5,6 are respectively 1350,1433,1236,1546,1354, 1029,1175.
Step 2.2:Standardization processing is carried out to the inquiry times of the data in each data block:Recent daily record ratio area is set Divide recent daily record data and history log data, when the inquiry times of the history log data in data block are higher than history log number During according to inquiry times upper limit threshold, then this history log data inquiry times value is this upper limit threshold;When near in data block When the inquiry times of phase daily record data are higher than recent daily record data inquiry times upper limit threshold, then the inquiry of this recent daily record data Number of times value is this upper limit threshold.
Standardization processing is carried out to inquiry times Count (t) of the data in each data block:Recent daily record ratio is set qrecDistinguish recent daily record data and history log data, qrecSetting, span 0 < q are actually needed according to userrec< 1, Present embodiment takes qrec=0.3, then as t < qrecDuring × T, that is, first 5 days inquiry log data belong to history log data, work as t ≥qrecDuring × T, that is, nearest 2 days inquiry log data belong to recent daily record data, for the history log data in data block, Setting history log data inquiry times upper limit threshold MaxhisIt is generally the case that MaxhisAll records should be set to averagely look into Ask number of times 1.5 times, Maxhis=1400, when this data block inquiry times is higher than history log data inquiry times upper limit threshold, Then this several piece is this upper limit threshold according to inquiry times value, to daily record data in the near future, arranges recent daily record data inquiry times Upper limit threshold Maxrec, Maxrec2 times of all record the mean search frequencies should be set to, Maxrec=1700, when this data block is looked into When asking number of times higher than recent daily record data inquiry times upper limit threshold, then this data block inquiry times value is this upper limit threshold, Standardization processing is carried out to inquiry times Count (t) according to normalizing (1):
Due in step 2.1, as t=1 and t=3, relevant inquiring number of times has exceeded in history log data inquiry times Limit threshold value, therefore Count (1)=Count (3)=Maxhis=1400, Countstd1T () is 1350,1400,1236,1400, 1354,1029,1175.
By standardization processing is carried out to inquiry times, can avoid to a certain extent because indivedual skies inquiry times too high And lead to conditional attribute to inquire about the situation of frequent degree virtual height.
Step 2.3:To data block inquiry times Count after standardization processingstd1T () is weighted operating:Right respectively Average after the inquiry times weighted sum of data in each data block in 7 inquiry logs after standardization processing, that is, obtain This data block inquires about frequent degree FDblock
Wherein Countstd1T () is the inquiry times of data in the data block after standardization processing, W (t) is weighting function, For increasing function.
With monotonically increasing direct proportion type function, the correspondence department in first quartile is allocated as weighting function present embodiment, I.e. W (t)=t+1, wherein 0≤t≤6
Calculate block1Frequent degree:
Step 2.4:The high multiple data blocks of inquiry frequent degree, i.e. frequency are selected according to the size that each data block inquires about frequent degree Numerous data block, each frequent data item block forms frequent data item set of blocks, wherein block2Inquiry frequent degree is 5973.13648, block3Inquiry frequent degree is 5294.65.Data block is carried out size sequence, acquisition data block is interior after suing for peace successively to be existed in 1G The high multiple data blocks of inquiry frequent degree, wherein block2Belong to frequent data item block.
Step 3:The conditional attribute formation condition community set of each frequent data item block:Age of user in present embodiment, Sex, location, registration date, the conditional attribute collection of line duration are combined into conditional attribute set.
Step 4:The inquiry frequent degree of each conditional attribute in design conditions community set, looks into according to obtaining conditional attribute The size asking frequent degree obtains the high multiple conditional attributes of inquiry frequent degree, forms frequent conditional attribute set.With age condition As a example attribute, conditional attribute inquiry frequent degree calculating process is as follows:
Step 4.1:Determine the inquiry times of each conditional attribute in frequent data item block in 7 inquiry logs, this embodiment party In formula the inquiry related to age conditional attribute t value be 0,1,2,3,4,5,6 when inquiry times be respectively 130,135, 125、160、110、115、120.
Step 4.2:Standardization processing is carried out to the inquiry times of each conditional attribute:Distinguished in the recent period according to recent daily record ratio Log conditions attribute and history log conditional attribute, when the inquiry times of history log conditional attribute are higher than that history log condition belongs to Property inquiry times upper limit threshold when, then the inquiry times value of this history log conditional attribute be this upper limit threshold;When recent day When the inquiry times of will conditional attribute are higher than recent Log conditions attribute query number of times upper limit threshold, then this recent Log conditions belongs to Property inquiry times value be this upper limit threshold.
Standardization processing is carried out to conditional attribute inquiry times, according to recent daily record ratio qrec=0.3, first 5 days inquiry days Will conditional attribute belongs to history log conditional attribute, and the inquiry log conditional attribute of nearest 2 days belongs to recent Log conditions attribute, For history log conditional attribute, history log conditional attribute inquiry times upper limit threshold Max is sethis, Maxhis=140, when When this conditional attribute inquiry times is higher than history log conditional attribute inquiry times upper limit threshold, then the inquiry of this conditional attribute time Number value is this upper limit threshold.To Log conditions attribute in the near future, recent Log conditions attribute query number of times upper limit threshold is set Maxrec, Maxrec=150, when this conditional attribute inquiry times is higher than recent Log conditions attribute query number of times upper limit threshold, Then this conditional attribute inquiry times value is this upper limit threshold, according to normalizing (1) to conditional attribute inquiry times Count T () carries out standardization processing, due to obtaining as t=3 in step 4.1, Count (3)=160, and exceed history log and looked into Ask number of times upper limit threshold, therefore make Count (3)=Maxhis=140.Countstd2(t) Count (stt) d be 130,135,125, 140、110、115、120.
Step 4.3:Conditional attribute inquiry times after standardization processing are weighted operate:Respectively to standardization processing Sue for peace after the inquiry times weighting of each conditional attribute in 7 days afterwards and average, that is, the inquiry obtaining each conditional attribute is frequent Degree FDsa
Same with monotonically increasing direct proportion type function, the correspondence department in first quartile is allocated as weighting function, i.e. W (t) =t+1, wherein 0≤t≤6, calculate the frequent degree of age conditional attribute:
Step 4.4:According to the inquiry frequent degree of each conditional attribute obtaining, select the high multiple conditional attributes of frequent degree, I.e. frequent conditional attribute, each frequent conditional attribute forms frequent conditional attribute set, and the inquiry frequent degree at wherein age is 487.8571, the inquiry frequent degree of sex is 539.2857143, and the inquiry frequent degree of location is 632.1428571, registration The inquiry frequent degree on date is 217.1429, and the inquiry frequent degree of line duration is 103.4923.
Step 5:Obtain frequent conditional attribute group set using Apriori algorithm and frequent conditional attribute set, condition belongs to Property the inquiry frequent degree support that is mapped as in Apriori algorithm, the frequent item set that Apriori algorithm obtains is frequent bar Part set of properties set.
Step 5.1:Make A1=φ, if k is current highest frequent conditional attribute group length, as k=1, represents that length is 1 Frequent conditional attribute group set A1.
Step 5.2:Count each conditional attribute in frequent conditional attribute set and inquire about frequent degree, wherein age, sex, place Area, the registration date, line duration the corresponding frequent degree of conditional attribute be respectively 487.8571,539.2857,632.1428, 217.1429th, 103.4923, minimum frequent degree threshold value min of settingfd=175, by all more than or equal to minimum frequent degree threshold value Age, sex, location, registration date, the conditional attribute of line duration put into A1In, obtain the frequent condition that length is 1 Set of properties set A1.
Step 5.3:To A1In element do according to condition attribute-name and be referred to as dictionary and sort and carry out Nature Link, obtain length Frequent conditional attribute group Candidate Set C for 22, wherein C2Including location registration date, location age, location Area's sex, age-sex, age-registration date, sex-registration date.
Step 5.4:Make A2=φ, inquires about C2In each conditional attribute group, and retrieve all frequent conditional attribute set, system Meter C2In each conditional attribute group inquiry frequent degree, wherein location registration date, location age, location The corresponding frequent degree of set of properties such as area's sex, age-sex, age-registration date, sex-registration date is respectively 202.14285th, 339.2857,401.4285,321.4285,98.4957,135.671, conditional attribute group frequent degree is more than etc. Regional registration date, location age, location sex, age-sex etc. in minimum frequent degree threshold value Conditional attribute group puts into the frequent conditional attribute group set A that A2 length is 22In.
Step 5.5:To A2In element according to condition attribute-name is referred to as dictionary and sorts and carry out Nature Link, obtaining length is 3 frequent conditional attribute group Candidate Set C3, wherein C3Including regional Sex, Age conditional attribute group.
Step 5.6:Make A3=φ, inquires about C3In each conditional attribute group, and retrieve all frequent conditional attribute set, system Meter C3The inquiry frequent degree of conditional set of properties, wherein regional this corresponding frequent degree of conditional attribute group of Sex, Age is divided Not Wei 183.5714286, conditional attribute group frequent degree is more than or equal to the regional Sex, Age of minimum frequent degree threshold value Conditional attribute group puts into the frequent conditional attribute group set A that length is 33In.
Step 5.7:To A3In element according to condition attribute-name is referred to as dictionary and sorts and carry out Nature Link, obtaining length is 4 frequent conditional attribute group Candidate Set C4, wherein C4=φ.
Step 5.8:Obtain the frequent querying condition set of properties collection A, wherein A=∪ of each length in inquiry logkAk=A1∪ A2∪…∪Ak
Length be 1 frequent conditional attribute has age, sex, location, the registration date, corresponding frequent degree is respectively 487.8571、539.2857、632.1428、217.1429.
Length is that 2 frequent conditional attribute group has location registration date, regional age, regional sex, year Not, corresponding frequent degree is respectively 202.14285,339.2857,401.4285,321.4285 to rheological properties.
Length is that 3 frequent conditional attribute group has the place provincialism other age, and frequent degree is 183.5714286.
Step 6:By corresponding for frequent conditional attribute group set data buffer storage to internal memory, and to frequent conditional attribute set In frequent conditional attribute set up index, complete data buffer storage.
In internal memory, only cache 3 row conditional attribute data, have 3 groups of cache way, in the 1st group of memory cache caching the age, Sex, location data, caching location age, regional gender data in the 2nd group of memory cache, due to location Area's conditional attribute repeats, therefore is not take up memory headroom, caching location Sex, Age data in the 3rd group of memory cache.
Step 7:When client needs to carry out data query, according to the conditional attribute of data to be inquired about, inquired about Operation:To the frequent conditional attribute of all memory caches of conditional attribute of the data of inquiry, then directly obtain Query Result; To inquiry data a conditional attribute part for memory cache frequent conditional attribute, then according to this partly frequently condition belong to Meet the data of this partial condition attribute in the search index disk database of property, complete inquiry operation;Data to inquiry Conditional attribute all not in the frequent conditional attribute set of memory cache, then from disk load data block carry out inquiry operation Property all not in the frequent conditional attribute set of memory cache, that is, miss, then from disk load data block carry out inquiry behaviour Make.
Real data is inquired about, has 3 kinds of possible different hit situation, as shown in Figure 4.
When user inquiry date of birth conditional attribute when, date of birth conditional attribute uncached in internal memory, belong to inquiry In conditional attribute all not situations in memory cache, then load data block from disk and carry out inquiry operation.
When user's inquiry location date of birth conditional attribute group, belong to the conditional attribute only in inquiry Situation point in memory cache, then belong to according to meeting this area's condition in the search index disk database of this area's conditional attribute The data of property, completes inquiry operation.
When user querying regional conditional attribute, belong to the conditional attribute all situations in memory cache in inquiry, this When direct retrieval related data returning result in internal memory.
Under different cache way, average lookup efficiency comparative is as shown in Figure 5.Before application this method, a normal SQL The query time of Select sentence is averagely about 1500 milliseconds.
The data cache method of present embodiment can significantly improve efficiency data query in frequent region, and for it Data query in his region, due to not doing any process, therefore does not interfere with inquiry operation thereon.Caching two, three condition belongs to Property group is compared single conditional attribute and is had higher search efficiency, and this is the single conditional attribute due to during actual queries Condition query frequency relatively low, it is undesirable to cache complete hit rate, compared to many condition attribute query, single conditional attribute caching Uncorrelated record can not be removed well, recording of filtering out is larger, to the index inspection work in database afterwards Bring huge time overhead.
Although the contrast of query hit situation is as shown in fig. 6, two conditional attribute group cachings are compared three conditional attribute groups and cached Full hit rate difference is more, but its partial hit rate up to 63.93%.Conditional attribute group conditional attribute number is in The caching of pilot-scale, although sacrificing the complete hit rate of part of cache, such caching can complete centre more with flying colors Record simplifies work, reduces the intermediate result set producing in internal memory because of the hit of partial condition attribute, and according to frequent condition Property index rapidly locating, and then mitigate the retrieval pressure of database, achieve higher search efficiency, two conditional attributes Group caching average lookup efficiency slightly above three conditional attribute group caching is just belonging to this situation.

Claims (3)

1. a kind of data cache method based on Apriori algorithm is it is characterised in that comprise the following steps:
Step 1:Record the conditional attribute in user's query statement in T days in disk, set up T inquiry day in units of sky Will, i.e. user's inquiry content;
Step 2:Calculate the inquiry frequent degree of each data block in inquiry log, the size according to obtaining data block inquiry frequent degree obtains The high multiple data blocks of frequent degree must be inquired about, form frequent data item set of blocks;
Step 3:The conditional attribute formation condition community set of each frequent data item block;
Step 4:The inquiry frequent degree of each conditional attribute in design conditions community set, inquires about frequency according to obtaining conditional attribute The size of numerous degree obtains the high multiple conditional attributes of inquiry frequent degree, forms frequent conditional attribute set;
Step 5:Obtain frequent conditional attribute group set using Apriori algorithm and frequent conditional attribute set, conditional attribute Inquiry frequent degree is mapped as the support in Apriori algorithm, and the frequent item set that Apriori algorithm obtains is frequent condition and belongs to Property group set;
Step 6:By corresponding for frequent conditional attribute group set data buffer storage to internal memory, and in frequent conditional attribute set Frequently conditional attribute sets up index, completes data buffer storage;
Step 7:When client needs to carry out data query, according to the conditional attribute of data to be inquired about, carry out inquiry operation: To the frequent conditional attribute of all memory caches of conditional attribute of the data of inquiry, then directly obtain Query Result;To A conditional attribute part for the data of inquiry is the frequent conditional attribute of memory cache, then according to this partly frequent conditional attribute Meet the data of this partial condition attribute in search index disk database, complete inquiry operation;Bar to the data of inquiry Part attribute all not in the frequent conditional attribute set of memory cache, then loads data block from disk and carries out inquiry operation.
2. the data cache method based on Apriori algorithm according to claim 1 is it is characterised in that described step 2 has Body executes as follows:
Step 2.1:Determine the inquiry times of data in each data block in T inquiry log;
Step 2.2:Standardization processing is carried out to the inquiry times of the data in each data block:Arrange recent daily record ratio to distinguish closely Phase daily record data and history log data, when the inquiry times of the history log data in data block are looked into higher than history log data When asking number of times upper limit threshold, then the inquiry times value of this history log data is this upper limit threshold;When recent in data block When the inquiry times of daily record data are higher than recent daily record data inquiry times upper limit threshold, then the inquiry of this recent daily record data time Number value is this upper limit threshold;
Step 2.3:The inquiry times of data in the data block after standardization processing are weighted operating:At respectively to standardization Average after the inquiry times weighted sum of data in each data block in T inquiry log after reason, that is, obtain each data block Inquiry frequent degree;
Step 2.4:The high multiple data blocks of inquiry frequent degree are selected according to the size that each data block inquires about frequent degree, frequently counts According to block, each frequent data item block forms frequent data item set of blocks.
3. the data cache method based on Apriori algorithm according to claim 1 is it is characterised in that described step 4 has Body executes as follows:
Step 4.1:Determine the inquiry times of each conditional attribute in frequent data item block in T inquiry log;
Step 4.2:Standardization processing is carried out to the inquiry times of each conditional attribute:Recent daily record is distinguished according to recent daily record ratio Conditional attribute and history log conditional attribute, when the inquiry times of history log conditional attribute are looked into higher than history log conditional attribute When asking number of times upper limit threshold, then the inquiry times value of this history log conditional attribute is this upper limit threshold;When recent daily record bar When the inquiry times of part attribute are higher than recent Log conditions attribute query number of times upper limit threshold, then this recent Log conditions attribute is looked into Inquiry number of times value is this upper limit threshold;
Step 4.3:Conditional attribute inquiry times after standardization processing are weighted operate:After respectively to standardization processing Sue for peace after the inquiry times weighting of each conditional attribute in T days and average, that is, obtain the inquiry frequent degree of each conditional attribute;
Step 4.4:According to the inquiry frequent degree of each conditional attribute obtaining, select the high multiple conditional attributes of frequent degree, i.e. frequency Numerous conditional attribute, each frequent conditional attribute forms frequent conditional attribute set.
CN201410214776.3A 2014-05-20 2014-05-20 Data-caching method based on Apriori algorithm Expired - Fee Related CN103995869B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201410214776.3A CN103995869B (en) 2014-05-20 2014-05-20 Data-caching method based on Apriori algorithm

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201410214776.3A CN103995869B (en) 2014-05-20 2014-05-20 Data-caching method based on Apriori algorithm

Publications (2)

Publication Number Publication Date
CN103995869A CN103995869A (en) 2014-08-20
CN103995869B true CN103995869B (en) 2017-02-22

Family

ID=51310034

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201410214776.3A Expired - Fee Related CN103995869B (en) 2014-05-20 2014-05-20 Data-caching method based on Apriori algorithm

Country Status (1)

Country Link
CN (1) CN103995869B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107330041A (en) * 2017-06-27 2017-11-07 达而观信息科技(上海)有限公司 A kind of relevant search word method for digging decayed based on the time and system
CN107577506B (en) * 2017-08-07 2021-03-19 台州市吉吉知识产权运营有限公司 Data preloading method and system
CN109976905B (en) * 2019-03-01 2021-10-22 联想(北京)有限公司 Memory management method and device and electronic equipment

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533406A (en) * 2009-04-10 2009-09-16 北京锐安科技有限公司 Mass data querying method
CN102081625A (en) * 2009-11-30 2011-06-01 中国移动通信集团北京有限公司 Data query method and query server
EP2622514A2 (en) * 2010-09-28 2013-08-07 Alibaba Group Holding Limited Method and apparatus of ordering search results

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101533406A (en) * 2009-04-10 2009-09-16 北京锐安科技有限公司 Mass data querying method
CN102081625A (en) * 2009-11-30 2011-06-01 中国移动通信集团北京有限公司 Data query method and query server
EP2622514A2 (en) * 2010-09-28 2013-08-07 Alibaba Group Holding Limited Method and apparatus of ordering search results

Non-Patent Citations (3)

* Cited by examiner, † Cited by third party
Title
"Enhancing the performance of web proxy server through cluster based prefetching techniques";Nanhay Singh 等;《International Conference on Advances in Computing, 2013》;20130825;1158-1165 *
"Web数据缓存与预取一体化的研究与应用";曹英斌;《中国优秀硕士学位论文全文数据库 信息科技辑》;20130415(第4期);I139-172 *
"基于Apriori改进算法的Web日志挖掘系统的研究与实现";郑玮;《中国优秀硕士学位论文全文数据库 信息科技辑》;20111215(第S2期);I138-963 *

Also Published As

Publication number Publication date
CN103995869A (en) 2014-08-20

Similar Documents

Publication Publication Date Title
CN109739849B (en) Data-driven network sensitive information mining and early warning platform
CN104866434B (en) Towards data-storage system and data storage, the call method applied more
Dominguez-Sal et al. Survey of graph database performance on the hpc scalable graph analysis benchmark
US9858280B2 (en) System, apparatus, program and method for data aggregation
CN105989129B (en) Real time data statistical method and device
CN103688259B (en) For the method by compressing and file storage carries out automaticdata placement
JP2017037648A (en) Hybrid data storage system, method, and program for storing hybrid data
CN107515927A (en) A kind of real estate user behavioural analysis platform
CN110162528A (en) Magnanimity big data search method and system
CN110555316A (en) privacy protection table data sharing algorithm based on cluster anonymity
US20090287666A1 (en) Partitioning of measures of an olap cube using static and dynamic criteria
CN107291806A (en) A kind of Data View copy alternative manner in Web visible environments
Liu et al. Smartcube: An adaptive data management architecture for the real-time visualization of spatiotemporal datasets
CN108287840A (en) A kind of data storage and query method based on matrix Hash
CN103995869B (en) Data-caching method based on Apriori algorithm
CN113157943A (en) Distributed storage and visual query processing method for large-scale financial knowledge map
Williams et al. Enabling fine-grained HTTP caching of SPARQL query results
US7499927B2 (en) Techniques for improving memory access patterns in tree-based data index structures
CN109446358A (en) A kind of chart database accelerator and method based on ID caching technology
CN110427437A (en) A kind of relevant database mixing isomery interrogation model and method towards big data
CN108647266A (en) A kind of isomeric data is quickly distributed storage, exchange method
CN108416054A (en) Dynamic HDFS copy number calculating methods based on file access temperature
CN103365987B (en) Clustered database system and data processing method based on shared-disk framework
Qu et al. Hybrid indexes by exploring traditional B-tree and linear regression
CN114020779B (en) Self-adaptive optimization retrieval performance database and data query method

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20170222

Termination date: 20210520