CN104750682A  Buffering capacity allocation method for massive logs  Google Patents
Buffering capacity allocation method for massive logs Download PDFInfo
 Publication number
 CN104750682A CN104750682A CN201310727354.1A CN201310727354A CN104750682A CN 104750682 A CN104750682 A CN 104750682A CN 201310727354 A CN201310727354 A CN 201310727354A CN 104750682 A CN104750682 A CN 104750682A
 Authority
 CN
 China
 Prior art keywords
 described
 section
 sublist
 amount
 quoting
 Prior art date
Links
 230000003139 buffering Effects 0 abstract title 4
 230000000694 effects Effects 0 abstract 1
 230000015654 memory Effects 0 abstract 1
 230000001603 reducing Effects 0 abstract 1
Abstract
Description
Technical field
The present invention relates to log management field, more particularly, relate to a kind of buffering capacity distribution method of massive logs.
Background technology
IDC (Internet Data Center, Internet data center), DNS (Domain Name Service, domain name system) etc. produce the daily record of magnanimity, demand carries out importing (1 ~ 100,000/second) in real time fast, and nearly realtime search.As realized the targets such as abovementioned importing or search, generally all partitioning technique can be used.This technology very large table is divided into multiple little table according to certain rule and is stored into different regions respectively, a table so in logic, during physical store, can as multiple tables, be stored in different positions, simplify the management activity of database, but also can application performance be improved.Subregion (during inquiry, general limit is within two hours, and the threshold value so set may correspond to it, allows a search hit as far as possible in a subregion, is also no more than two subregions at the most) is carried out with data volume size.When subregion completes, the value in the territory of index has corresponding scope.For fast filtering.If the time domain in addition in daily record can keep temporally increasing progressively, so can to carrying out special processing.Reach its index take up room little and filter fast.Certain system provides to this territory whether strictly increasing is configurable.So it just can only use specific subregion directly to inquire about when access list.Do not need to relate to whole table in inquiry, naturally just improve query performance.Simultaneously because external interface is still a table, for user, application is transparent, the existence of their imperceptible subregion.Therefore, large table partitioning technique is applied widely in mass data storage.
In existing scheme, Mysql uses the speed only having 10002000 bar/second during MYISAM storage engines; MongoDB data volume is reduced to less than 2000/second more than 1,000 ten thousand hourly velocity, and also can decline along with the increase of data volume always; NOSQL, based on keyvalue couple, can not index multiple domain; The highest ability of lucene boot speed close to 10,000/second, and increases a lot of extra unwanted content relevant to score, position etc. simultaneously.Another stealthy important indicator of obvious mass data is that compression is few.The scheme of last time is all more limited at boot speed.Ratio of compression is little, and internal memory, cpu, I/O take height.In addition in order to save cost, log system needs in existing less idle taking on device.Obvious lightweight is the ultimate aim pursued.Buffering capacity allocation strategy is the important aspect of in system.
Summary of the invention
The technical problem to be solved in the present invention is, for the defect of the buffering capacity unreasonable distribution of prior art, provides a kind of buffering capacity distribution method of massive logs.
The technical solution adopted for the present invention to solve the technical problems is: the buffering capacity distribution method constructing a kind of massive logs, and in the buffering capacity of reading in massive logs timedivision gamete table, the method comprises the following steps:
S11, read in daily record to sublist in real time, and described daily record is stored in the section of specifying in sublist;
S12, all sections in described sublist to be divided according to the time of reading in daily record, if the daily record of reading in sublist has identical territory, in all identical territories, then all quote the sideplay amount in the territory occurred first in sublist, and the number of times that the sideplay amount in the territory occurred first described in statistics is cited;
S13, set up the amount of the quoting S of every section _{i}, described in the amount of quoting S _{i}for the number of times sum that the sideplay amount in the territory occurred first described in all in ith section is cited, wherein, i is the positive integer in [1, n], and n is the section in described sublist; Calculate the always amount of the quoting S of sublist _{sum}:
S14, the amount of quoting S according to the Time alignment every section and every section that read in daily record _{i}, to the amount of the quoting S of every section and every section _{i}relation carry out linear fit, obtain the straight line y=ax+b that regulation characterizes the corresponding relation of section and the amount of quoting, wherein, xaxis is the xth section in described sublist, yaxis for described in the amount of quoting;
S15, the total buffer amount C that will preset in described sublist according to the described corresponding relation of described straight line y=ax+b regulation _{sum}distribute to every section, the buffering capacity C of ith section of distribution gained _{i}for: C _{i}=C _{sum}× (ai+b)/S _{sum}.
In buffering capacity distribution method of the present invention, in described step S11: the territory of described daily record comprises user ID, access time, access IP, requests for page and request function number.
In buffering capacity distribution method of the present invention, described step S12 comprises following substep:
S12A, all sections in described sublist to be divided according to the time of reading in daily record, if the daily record of reading in sublist has identical territory, then in all identical territories, all quote the sideplay amount in the territory occurred first in sublist;
The number of times z that S12B, the sideplay amount of adding up the territory that ith section of jth occurs first are cited _{ij}and described number of times is sorted, wherein, i is the positive integer in [1, n], and n is the total hop count in described sublist, and j is the positive integer in [1, m], and m is total number in the territory occurred first in described ith section.
In buffering capacity distribution method of the present invention, described step S13 comprises following substep:
S13A, the foundation amount of quoting S _{i}, described in the amount of quoting S _{i}number of times sum for the sideplay amount in the territory occurred first described in all in ith section is cited: wherein, i is the positive integer in [1, n], and n is the total hop count in described sublist, and j is the positive integer in [1, m], and m is total number in the territory occurred first in described ith section;
The always amount of the quoting S of S13B, calculating sublist _{sum}:
In buffering capacity distribution method of the present invention, described step S14 also comprises:
S14A, the amount of quoting S according to the Time alignment every section and every section that read in daily record _{i}, the section of getting the territory place of described sequence in preset range carries out linear fit, and obtain the straight line y=ax+b that regulation characterizes the corresponding relation of section and the amount of quoting, wherein, xaxis is the section in described sublist, yaxis for described in the amount of quoting.
In buffering capacity distribution method of the present invention, the method also comprises:
S15A, before described step S15, judge whether ai+b is greater than 0, if ai+b is greater than 0, then perform step S15; If ai+b is less than or equal to 0, then perform step S15B;
S15B, by fitting a straight line along yaxis upwards translation c unit, until ai+b+c is greater than 0, and described fitting a straight line is modified to y=ax+b+c;
S15C, the total buffer amount C will preset in described sublist
_{sum}distribute to every section, the buffering capacity C of ith section of distribution gained
_{i}for:
Implement the buffering capacity distribution method of a kind of massive logs of the present invention, have following beneficial effect: according to the amount of quoting reasonable distribution buffering capacity, take less memory source, the sideplay amount of referring domain decreases I/O operation.
Accompanying drawing explanation
Below in conjunction with drawings and Examples, the invention will be further described, in accompanying drawing:
Fig. 1 is the process flow diagram of the buffering capacity distribution method of a kind of massive logs that preferred embodiment of the present invention provides;
Fig. 2 is the linear fit coordinate diagram that preferred embodiment of the present invention provides;
Fig. 3 is the structural representation that sublist is read in daily record;
Fig. 4 is the process flow diagram of the buffering capacity distribution method of a kind of massive logs that another preferred embodiment of the present invention provides;
Fig. 5 is the linear fit coordinate diagram that another preferred embodiment of the present invention provides.
Embodiment
In order to there be understanding clearly to technical characteristic of the present invention, object and effect, now contrast accompanying drawing and describe the specific embodiment of the present invention in detail.
As shown in Figure 1, in the process flow diagram of the buffering capacity distribution method of a kind of massive logs provided at preferred embodiment of the present invention, the method is used in the buffering capacity of reading in massive logs timedivision gamete table, as shown in Figure 3, comprise a summary table, summary table is made up of multiple sublist the general structure of WEB server, each sublist is made up of multilayer, every layer is made up of multistage, and section is the elementary cell of process, and the method specifically comprises:
S11, read in daily record to sublist in real time, and described daily record is stored in the section in sublist; The territory of described daily record at least comprises user ID, access time, access IP, requests for page and request function number.
S12, divide according to the time of reading in daily record the section in described sublist, if identical territory appears in the daily record of reading in sublist, then the sideplay amount in the territory occurred first in sublist is quoted in identical territory, and the number of times that the sideplay amount in territory occurred first described in statistics is cited; During the number of times that the sideplay amount of adding up the territory occurred first is cited, a bivariate table is used to store and cushion, such as: <name, number>, if when the territory occurred below is identical with this territory, quoted by number, this is because the territory occurred first can by below all sections directly quote, when there is the territory that occurs first and identical territory thereof in same section, the territory occurred first is directly quoted in the resourcearea of the identical territory section of being stored in; In statistic processes, all records of statistics can, in advance with binlog form written document, during statistics written document, can use this part binlog to complete final write.
This step specifically comprises following substep:
S12A, to the section in described sublist according to read in daily record time divide, if identical territory appears in the daily record of reading in sublist, the sideplay amount in the territory occurred first in sublist is quoted in identical territory;
The number of times z that S12B, the sideplay amount of adding up the territory that ith section of jth occurs first are cited _{ij}and described number of times is sorted, wherein, i is the positive integer in [1, n], and n is the total hop count in described sublist, and j is the positive integer in [1, m], and m is total number in the territory occurred first in described ith section.Because each section comprises multiple territory occurred first, the number of times that each territory is cited affects reading in of daily record and storage speed, therefore the total amount quoting number of times and correspondence in need every section being added up.
S13, set up the amount of the quoting S of every section _{i}, described in the amount of quoting S _{i}for the number of times sum that the sideplay amount in the territory occurred first described in all in ith section is cited, wherein, i is the positive integer in [1, n], and n is the section in described sublist; Calculate the always amount of the quoting S of sublist _{sum}:
This step specifically comprises following substep:
S13A, the foundation amount of quoting S _{i}, described in the amount of quoting S _{i}number of times sum for the sideplay amount in the territory occurred first described in all in ith section is cited: wherein, i is the positive integer in [1, n], and n is the total hop count in described sublist, and j is the positive integer in [1, m], and m is total number in the territory occurred first in described ith section;
The always amount of the quoting S of S13B, calculating sublist _{sum}:
S14, the amount of quoting S according to the Time alignment every section and every section that read in daily record _{i}, to the amount of the quoting S of every section and every section _{i}relation carry out linear fit, obtain straight line y=ax+b, wherein, xaxis is the xth section in described sublist, yaxis for described in the amount of quoting; As shown in Figure 2, because the operand amount of quoting of all sections being carried out to linear fit is very large, therefore this step can also according to the amount of the quoting S of the Time alignment every section and every section that read in daily record for the straight line y=ax+b obtained by linear fit _{i}, the section of getting the territory place of described sequence in preset range carries out linear fit, obtains straight line y=ax+b, and wherein, xaxis is the section in described sublist, yaxis for described in the amount of quoting.The section at the territory place of described sequence in preset range can carry out rank again, the section of generally reading at first contains the highest amount of quoting, get the high section of rank and data volume arranges, carry out linear fit again, not only can reduce operand, the accuracy that followup buffering capacity is distributed can also be ensured.
Linear fit adopts continuous curve to portray approx or than the funtcional relationship between the coordinate on quasiplane represented by discrete point group, some discrete function values of such as certain function known, by adjusting some undetermined coefficients in this function, make the difference of this function and known point set (least square meaning) minimum, if unJeiermined function is linear, be just linear fit.In numerical analysis, curve approaches discrete data with analytical expression exactly, i.e. the formulism of discrete data.In practice, discrete point group or data are the repeatedly observed reading of various physical problem amount relevant to statistical problem or experiment value often, and they are scattered, is not only not easy to process, and usually can not definitely and fully embodies its intrinsic rule.This defect just can be made up by suitable analytical expression.
The linear fit of general y=ax+b can be calculated by following formula:
Wherein, l is the number of (x, y) discrete value, x _{k}for a kth x value, y _{k}for a kth y value of correspondence.
Such as: get described sequence and carry out linear fit in the section at the place, territory of TOP V (namely in preset range), obtain the amount of quoting of getting the first five section, wherein the amount of quoting of first paragraph to the 5th section is respectively S _{1}, S _{2}, S _{3}, S _{4}and S _{5}, so l=5, x _{1}=1, _{y1}=S _{1}, x _{2}=2, y _{2}=S _{2}, x _{3}=3, y _{3}=S _{3}, x _{4}=4, y _{4}=S _{4}, x _{5}=5, y _{5}=S _{5}.The formula then abovementioned numerical tape being entered linear fit can calculate the value of a and b.
S15, the total buffer amount C will preset in described sublist according to described straight line y=ax+b _{sum}distribute to every section, the buffering capacity C of ith section of distribution gained _{i}for: C _{i}=C _{sum}× (ai+b)/S _{sum}.Buffering capacity is distributed according to the amount of quoting in section, improves memory usage, distributes the buffering capacity C of gained at ith section _{i}for: C _{i}=C _{sum}× (ai+b)/S _{sum}in, every section is distributed the buffering capacity of gained is this section of amount of quoting and the ratio of the total amount of quoting.
The beneficial effect of the method has:
1) inverted list of search technique of arranging in pairs or groups realizes, and massive logs ensures its temporally strictly increasing, and meet high speed and import, support realtime search, index is compact simultaneously;
2) owing to comprising index and data total amount during data compression, by directly quoting sideplay amount, reduce I/O number of operations, improve reading speed, ratio of compression is lower;
3) reasonable distribution buffering capacity, the size distributing buffer amount measured by reference, makes occupying system resources less;
4) in the system of below internal memory 2G, there is better Performance Ratio.
As shown in Figure 4, in the process flow diagram of the buffering capacity distribution method of a kind of massive logs provided at another preferred embodiment of the present invention, the present embodiment is based on a upper embodiment, be provided in the fitting a straight line y=ax+b+c revised when the amount of quoting that fitting a straight line y=ax+b characterizes is less than or equal to 0, and corresponding correction is also done to buffering capacity distribution, the method is specific as follows:
S21, read in daily record to sublist in real time, and described daily record is stored in the section of specifying in sublist;
S22, all sections in described sublist to be divided according to the time of reading in daily record, if the daily record of reading in sublist has identical territory, in all identical territories, then all quote the sideplay amount in the territory occurred first in sublist, and the number of times that the sideplay amount in the territory occurred first described in statistics is cited; The step S12 that this step also can adopt an embodiment to provide.
S23, set up the amount of the quoting S of every section _{i}, described in the amount of quoting S _{i}for the number of times sum that the sideplay amount in the territory occurred first described in all in ith section is cited, wherein, i is the positive integer in [1, n], and n is the section in described sublist; Calculate the always amount of the quoting S of sublist _{sum}: the step S13 that this step also can adopt an embodiment to provide.
S24, the amount of quoting S according to the Time alignment every section and every section that read in daily record _{i}, to the amount of the quoting S of every section and every section _{i}relation carry out linear fit, obtain the straight line y=ax+b that regulation characterizes the corresponding relation of section and the amount of quoting, wherein, xaxis is the xth section in described sublist, yaxis for described in the amount of quoting; The step S14 that this step also can adopt an embodiment to provide.
S25, before described step S15, judge whether ai+b is greater than 0, if ai+b is greater than 0, then perform step S26; If ai+b is less than or equal to 0, then perform step S27S28;
S26, the total buffer amount C that will preset in described sublist according to the described corresponding relation of described straight line y=ax+b regulation _{sum}distribute to every section, the buffering capacity C of ith section of distribution gained _{i}for: C _{i}=C _{sum}× (ai+b)/S _{sum}.
S27, by fitting a straight line along yaxis upwards translation c unit, until ai+b+c is greater than 0, and described fitting a straight line is modified to y=ax+b+c, as shown in Figure 5; This is that this does not conform with convention, therefore need revise described fitting a straight line because when section that some amounts of quoting are few corresponds to fitting a straight line, the amount of quoting may be negative value.
S28, the total buffer amount C will preset in described sublist
_{sum}distribute to every section, the buffering capacity C of ith section of distribution gained
_{i}for:
The present embodiment, except having the beneficial effect of a upper embodiment, distributes described fitting a straight line and buffering capacity and revises, ensure that buffering capacity is accurately distributed, the amount of making rational use of resources.
By reference to the accompanying drawings embodiments of the invention are described above; but the present invention is not limited to abovementioned embodiment; abovementioned embodiment is only schematic; instead of it is restrictive; those of ordinary skill in the art is under enlightenment of the present invention; do not departing under the ambit that present inventive concept and claim protect, also can make a lot of form, these all belong within protection of the present invention.
Claims (6)
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN201310727354.1A CN104750682B (en)  20131225  20131225  A kind of buffering capacity distribution method of massive logs 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN201310727354.1A CN104750682B (en)  20131225  20131225  A kind of buffering capacity distribution method of massive logs 
Publications (2)
Publication Number  Publication Date 

CN104750682A true CN104750682A (en)  20150701 
CN104750682B CN104750682B (en)  20180406 
Family
ID=53590393
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN201310727354.1A CN104750682B (en)  20131225  20131225  A kind of buffering capacity distribution method of massive logs 
Country Status (1)
Country  Link 

CN (1)  CN104750682B (en) 
Citations (4)
Publication number  Priority date  Publication date  Assignee  Title 

US7430741B2 (en) *  20040120  20080930  International Business Machines Corporation  Applicationaware system that dynamically partitions and allocates resources on demand 
CN101667198A (en) *  20090918  20100310  浙江大学  Cache optimization method of realtime vertical search engine objects 
US20120323870A1 (en) *  20090610  20121220  At&T Intellectual Property I, L.P.  Incremental Maintenance of Inverted Indexes for Approximate String Matching 
CN103336771A (en) *  20130402  20131002  江苏大学  Data similarity detection method based on sliding window 

2013
 20131225 CN CN201310727354.1A patent/CN104750682B/en active IP Right Grant
Patent Citations (4)
Publication number  Priority date  Publication date  Assignee  Title 

US7430741B2 (en) *  20040120  20080930  International Business Machines Corporation  Applicationaware system that dynamically partitions and allocates resources on demand 
US20120323870A1 (en) *  20090610  20121220  At&T Intellectual Property I, L.P.  Incremental Maintenance of Inverted Indexes for Approximate String Matching 
CN101667198A (en) *  20090918  20100310  浙江大学  Cache optimization method of realtime vertical search engine objects 
CN103336771A (en) *  20130402  20131002  江苏大学  Data similarity detection method based on sliding window 
Also Published As
Publication number  Publication date 

CN104750682B (en)  20180406 
Similar Documents
Publication  Publication Date  Title 

Dittrich et al.  Efficient big data processing in Hadoop MapReduce  
Cameron et al.  Econometric models based on count data. Comparisons and applications of some estimators and tests  
Plattner et al.  Inmemory data management: technology and applications  
Agarwal et al.  BlinkDB: queries with bounded errors and bounded response times on very large data  
Grund et al.  HYRISE: a main memory hybrid storage engine  
Morton et al.  Estimating the progress of MapReduce pipelines  
Ji et al.  Big data processing in cloud computing environments  
US20100293135A1 (en)  Highconcurrency query operator and method  
Ghinita et al.  A framework for efficient data anonymization under privacy and accuracy constraints  
US20130275364A1 (en)  Concurrent OLAPOriented Database Query Processing Method  
AU2014201593B2 (en)  Shared cache used to provide zero copy memory mapped database  
CN103955502B (en)  A kind of visualization OLAP application realization method and system  
US10157204B2 (en)  Generating statistical views in a database system  
US20130275365A1 (en)  MultiDimensional OLAP Query Processing Method Oriented to Column Store Data Warehouse  
CN102663116A (en)  Multidimensional OLAP (On Line Analytical Processing) inquiry processing method facing column storage data warehouse  
US7917526B2 (en)  GroupBy result size estimation  
TW201214167A (en)  Matching text sets  
US8712972B2 (en)  Query optimization with awareness of limited resource usage  
US8326825B2 (en)  Automated partitioning in parallel database systems  
US9875280B2 (en)  Efficient partitioned joins in a database with columnmajor layout  
EP2577507B1 (en)  Data mart automation  
Zeng et al.  Gola: Generalized online aggregation for interactive analysis on big data  
WO2011103579A2 (en)  Operating on time sequences of data  
CN103491187B (en)  A kind of big data united analysis processing method based on cloud computing  
US9672272B2 (en)  Method, apparatus, and computerreadable medium for efficiently performing operations on distinct data values 
Legal Events
Date  Code  Title  Description 

PB01  Publication  
C06  Publication  
SE01  Entry into force of request for substantive examination  
C10  Entry into substantive examination  
GR01  Patent grant  
GR01  Patent grant 