CN104750682B  A kind of buffering capacity distribution method of massive logs  Google Patents
A kind of buffering capacity distribution method of massive logs Download PDFInfo
 Publication number
 CN104750682B CN104750682B CN201310727354.1A CN201310727354A CN104750682B CN 104750682 B CN104750682 B CN 104750682B CN 201310727354 A CN201310727354 A CN 201310727354A CN 104750682 B CN104750682 B CN 104750682B
 Authority
 CN
 China
 Prior art keywords
 section
 sublist
 reference amount
 domain
 daily record
 Prior art date
Links
Abstract
Description
Technical field
The present invention relates to log management field, more specifically to a kind of buffering capacity distribution method of massive logs.
Background technology
IDC (Internet Data Center, Internet data center), DNS (Domain Name Service, domain name System) etc. produce the daily record of magnanimity, demand carry out it is quick import (1~100,000/second) in real time, and near realtime search.As realized The targets such as abovementioned importing or search, typically can all use partitioning technique.The technology is according to certain rule by a very big table Then it is divided into multiple small tables and is respectively stored into different regions, so a table in logic, can be as multiple tables during physical store Equally, different positions is stored in, simplifies the management activity of database, but also application performance can be improved.With data volume Size carries out subregion, and (general limit is within two hours during inquiry, then the threshold value of setting can correspond to it, allow and once search as far as possible Rope is hit in a subregion, is at most also no more than two subregions).When subregion is completed, corresponding to the value in the domain of index has Scope.For fast filtering.If the timedomain in other daily record can keep temporally being incremented by, then can be to carrying out specially treated. Reach its index to occupy little space and filter soon.Certain system provides whether strictly increasing is configurable to this domain.So it In access table, can is only directly inquired about using specific subregion.It need not be related to whole table in inquiry, just improve naturally Query performance.Simultaneously because external interface is still a table, for user, using being transparent, their imperceptible subregions Presence.Therefore, big table partitioning technique is applied very extensive in mass data storage.
In existing scheme, there was only the speed of 10002000 bars/second when Mysql is using MYISAM storage engines； Speed is reduced to below 2000/second when MongoDB data volumes are more than 10,000,000, and can also be always with the increase of data volume Decline；NOSQL is based on keyvalue pairs, it is impossible to which multiple domain is indexed；Lucene boot speeds highest just close to 10,000/second, And increase much relevant extra unwanted contents with score, position etc. simultaneously.Another stealthy important indicator of obvious mass data It is that compression is few.The scheme of last time is in boot speed than relatively limited.Compression ratio is small, and internal memory, cpu, I/O take high.In addition to Cost is saved, log system needs to take on device in existing less idle.Obvious lightweight is the ultimate aim pursued. Buffering capacity allocation strategy is an important aspect in system.
The content of the invention
The technical problem to be solved in the present invention is, for prior art buffering capacity unreasonable distribution the defects of, there is provided A kind of buffering capacity distribution method of massive logs.
The technical solution adopted for the present invention to solve the technical problems is：Construct a kind of buffering capacity distribution side of massive logs Method, for reading in buffering capacity of the massive logs timedivision with sublist, this method comprises the following steps：
S11, reading daily record in real time are stored in the section specified in sublist to sublist, and by the daily record；
S12, all sections in the sublist are divided according to the time for reading in daily record, if reading in the daily record tool of sublist There is identical domain, then the offset in the domain occurred first in sublist is quoted in all identical domains, and described in statistics first The number that the offset in the domain of appearance is cited；
S13, the reference amount S for establishing every section_{i}, the reference amount S_{i}For the skew in all domains occurred first in ith section The number sum being cited is measured, wherein, i is the positive integer in [1, n], and n is the section in the sublist；Calculate total reference of sublist Measure S_{sum}：
S14, the reference amount S according to every section and every section of Time alignment for reading in daily record_{i}, to every section and every section of reference amount S_{i} Relation carry out linear fit, obtain the straight line y=ax+b that regulation characterizes the corresponding relation of section and reference amount, wherein, xaxis is institute The xth section in sublist is stated, yaxis is the reference amount；
S15, according to as defined in the straight line y=ax+b corresponding relation by default total buffer amount in the sublist C_{sum}Every section is distributed to, the buffering capacity C of ith section of distribution gained_{i}For：C_{i}=C_{sum}× (ai+b)/S_{sum}。
In buffering capacity distribution method of the present invention, in the step S11：The domain of the daily record includes user ID, access time, access IP, requests for page and request function number.
In buffering capacity distribution method of the present invention, the step S12 includes following substep：
S12A, all sections in the sublist are divided according to the time for reading in daily record, if reading in the daily record of sublist With identical domain, then the offset in the domain occurred first in sublist is quoted in all identical domains；
S12B, count the number z that the offset in ith section of jth of domain occurred first is cited_{ij}And the number is arranged Sequence, wherein, i is the positive integer in [1, n], and n is total hop count in the sublist, and j is the positive integer in [1, m], and m is described the The total number in the domain occurred first in i sections.
In buffering capacity distribution method of the present invention, the step S13 includes following substep：
S13A, establish reference amount S_{i}, the reference amount S_{i}For the offset quilt in all domains occurred first in ith section The number sum of reference：Wherein, i be [1, n] in positive integer, n be the sublist in total hop count, j for [1, M] in positive integer, m be described ith section in the domain occurred first total number；
S13B, the total reference amount S for calculating sublist_{sum}：
In buffering capacity distribution method of the present invention, the step S14 also includes：
S14A, the reference amount S according to every section and every section of Time alignment for reading in daily record_{i}, the sequence is taken in preset range Section where interior domain carries out linear fit, obtains regulation and characterizes section and the straight line y=ax+b of the corresponding relation of reference amount, its In, xaxis is the section in the sublist, and yaxis is the reference amount.
In buffering capacity distribution method of the present invention, this method also includes：
S15A, before the step S15, judge whether ai+b is more than 0, if ai+b is more than 0, perform step S15；If Ai+b is less than or equal to 0, then performs step S15B；
S15B, fitting a straight line translated up into c unit along yaxis, until ai+b+c is more than 0, and by the fitting a straight line It is modified to y=ax+b+c；
S15C, by default total buffer amount C in the sublist_{sum}Every section is distributed to, the buffering capacity C of ith section of distribution gained_{i} For：
Implement a kind of buffering capacity distribution method of massive logs of the present invention, have the advantages that：According to reference amount Reasonable distribution buffering capacity, takes less memory source, and the offset of referring domain reduces I/O operation.
Brief description of the drawings
Below in conjunction with drawings and Examples, the invention will be further described, in accompanying drawing：
Fig. 1 is a kind of flow chart of the buffering capacity distribution method for massive logs that preferred embodiment of the present invention provides；
Fig. 2 is the linear fit coordinate diagram that preferred embodiment of the present invention provides；
Fig. 3 is the structural representation that sublist is read in daily record；
Fig. 4 is a kind of flow chart of the buffering capacity distribution method for massive logs that another preferred embodiment of the present invention provides；
Fig. 5 is the linear fit coordinate diagram that another preferred embodiment of the present invention provides.
Embodiment
In order to which technical characteristic, purpose and the effect of the present invention is more clearly understood, now compares accompanying drawing and describe in detail The embodiment of the present invention.
A kind of as shown in figure 1, flow of the buffering capacity distribution method of the massive logs provided in preferred embodiment of the present invention In figure, this method is used to reading in the massive logs timedivision and matching somebody with somebody the buffering capacity of sublist, the general structure of WEB server as shown in figure 3, Including a summary table, summary table is made up of multiple sublists, and each sublist is made up of multilayer, and every layer is made up of multistage, and section is processing Elementary cell, this method specifically include：
S11, daily record is read in real time to sublist, and the daily record is stored in the section in sublist；The domain of the daily record is at least Including ID, access time, access IP, requests for page and request function number.
S12, the section in the sublist is divided according to the time for reading in daily record, if identical occurs in the daily record for reading in sublist Domain, the then offset in the domain occurred first in identical domain reference sublist, and the offset quilt in the domain occurred first described in statistics The number of reference；During the number that the offset in the domain that statistics occurs first is cited, store and buffer, example using a bivariate table Such as：<Name, number>If the domain occurred below is identical with the domain, quoted by number, because going out first Existing domain can directly be quoted by all sections below, when there is the domain occurred first and its identical domain in same section, by phase Same domain is stored in the resourcearea of section and directly quotes the domain occurred first；In statistic processes, all records of statistics can be pre The first written document in the form of binlog, when counting written document, this part binlog can be used to complete final writein.
The step specifically includes following substep：
S12A, the section in the sublist is divided according to the time for reading in daily record, if the daily record appearance for reading in sublist is identical Domain, the offset in domain occurred first in sublist is quoted in identical domain；
S12B, count the number z that the offset in ith section of jth of domain occurred first is cited_{ij}And the number is arranged Sequence, wherein, i is the positive integer in [1, n], and n is total hop count in the sublist, and j is the positive integer in [1, m], and m is described the The total number in the domain occurred first in i sections.Because each section includes multiple domains occurred first, each domain is cited secondary Number all influences reading and the storage speed of daily record, therefore need to count the reference number in every section and corresponding total amount.
S13, the reference amount S for establishing every section_{i}, the reference amount S_{i}For the skew in all domains occurred first in ith section The number sum being cited is measured, wherein, i is the positive integer in [1, n], and n is the section in the sublist；Calculate total reference of sublist Measure S_{sum}：
The step specifically includes following substep：
S13A, establish reference amount S_{i}, the reference amount S_{i}For the offset quilt in all domains occurred first in ith section The number sum of reference：Wherein, i be [1, n] in positive integer, n be the sublist in total hop count, j for [1, M] in positive integer, m be described ith section in the domain occurred first total number；
S13B, the total reference amount S for calculating sublist_{sum}：
S14, the reference amount S according to every section and every section of Time alignment for reading in daily record_{i}, to every section and every section of reference amount S_{i} Relation carry out linear fit, obtain straight line y=ax+b, wherein, xaxis be the sublist in xth section, yaxis is the reference Amount；By the straight line y=ax+b obtained by linear fit as shown in Fig. 2 because the reference amount to all sections carries out Linear Quasi The operand of conjunction is very big, therefore the step can also be according to the reference amount for every section and every section of Time alignment for reading in daily record S_{i}, the section progress linear fit where the domain of the sequence within a preset range is taken, obtains straight line y=ax+b, wherein, xaxis is Section in the sublist, yaxis are the reference amount.Section where the domain of the sequence within a preset range can be carried out again Ranking, the section generally read at first contain highest reference amount, take section that ranking is high and its data volume to be arranged, Linear fit is carried out again, can not only reduce operand, it can also be ensured that the accuracy of followup buffering capacity distribution.
Linear fit is approx portrayed using full curve or than between the coordinate represented by discrete point group on quasiplane Functional relation, such as some discrete function values of certain known function, by adjusting some undetermined coefficients in the function so that the letter The difference (least square meaning) of number and known point set is minimum, if unJeiermined function is linear, is just linear fit.In numerical value point In analysis, curve matching is exactly that the formulation of discrete data, i.e. discrete data is approached with analytical expression.In practice, discrete point group Or data are often various physical problems and statistical problem about the multiple observation measured or experiment value, they be it is scattered, no Only it is not easy to handle, and generally can not definitely and fully embodies its intrinsic rule.This defect just can be by appropriate Analytical expression makes up.
General y=ax+b linear fit can be calculated by following formula：
Wherein, l be (x, y) centrifugal pump number, x_{k}For kth of x value, y_{k}For corresponding kth of y values.
Such as：Take the sequence to carry out linear fit in the section where the domain of TOP V (i.e. in preset range), taken The reference amount of the reference amount of first five section, wherein first paragraph to the 5th section is respectively S_{1}、S_{2}、S_{3}、S_{4}And S_{5}, then l=5, x_{1}=1,_{y1}= S_{1}, x_{2}=2, y_{2}=S_{2}, x_{3}=3, y_{3}=S_{3}, x_{4}=4, y_{4}=S_{4}, x_{5}=5, y_{5}=S_{5}.Abovementioned numerical value is then brought into the formula of linear fit Calculate a and b value.
S15, according to the straight line y=ax+b by default total buffer amount C in the sublist_{sum}Distribute to every section, ith section point Buffering capacity C with gained_{i}For：C_{i}=C_{sum}× (ai+b)/S_{sum}.Buffering capacity is distributed according to the reference amount in section, improves internal memory profit With rate, in the buffering capacity C obtained by ith section of distribution_{i}For：C_{i}=C_{sum}× (ai+b)/S_{sum}In, the buffering capacity of every section of distribution gained is this The ratio of Duan Yinyong amounts and total reference amount.
The beneficial effect of this method has：
1) inverted list for search technique of arranging in pairs or groups realizes that massive logs ensure its temporally strictly increasing, meets to lead at high speed Enter, while support search in real time, index is compact；
2) include index and data total amount when due to data compression, by directly quoting offset, reduce I/O operation time Number, improves reading speed, and compression ratio is lower；
3) reasonable distribution buffering capacity, by the size distributing buffer amount of reference amount, make occupying system resources smaller；
4) there is better performance ratio in the system below internal memory 2G.
As shown in figure 4, a kind of buffering capacity distribution method of the massive logs provided in another preferred embodiment of the present invention In flow chart, the present embodiment is based on a upper embodiment, there is provided is less than or equal to 0 in the reference amount that fitting a straight line y=ax+b is characterized Shi Xiuzheng fitting a straight line y=ax+b+c, and corresponding amendment is also done to buffering capacity distribution, this method is specific as follows：
S21, reading daily record in real time are stored in the section specified in sublist to sublist, and by the daily record；
S22, all sections in the sublist are divided according to the time for reading in daily record, if reading in the daily record tool of sublist There is identical domain, then the offset in the domain occurred first in sublist is quoted in all identical domains, and described in statistics first The number that the offset in the domain of appearance is cited；The step can also use the step S12 that a upper embodiment provides.
S23, the reference amount S for establishing every section_{i}, the reference amount S_{i}For the skew in all domains occurred first in ith section The number sum being cited is measured, wherein, i is the positive integer in [1, n], and n is the section in the sublist；Calculate total reference of sublist Measure S_{sum}：The step can also use the step S13 that a upper embodiment provides.
S24, the reference amount S according to every section and every section of Time alignment for reading in daily record_{i}, to every section and every section of reference amount S_{i} Relation carry out linear fit, obtain the straight line y=ax+b that regulation characterizes the corresponding relation of section and reference amount, wherein, xaxis is institute The xth section in sublist is stated, yaxis is the reference amount；The step can also use the step S14 that a upper embodiment provides.
S25, before the step S15, judge whether ai+b is more than 0, if ai+b is more than 0, perform step S26；If Ai+b is less than or equal to 0, then performs step S27S28；
S26, according to as defined in the straight line y=ax+b corresponding relation by default total buffer amount in the sublist C_{sum}Every section is distributed to, the buffering capacity C of ith section of distribution gained_{i}For：C_{i}=C_{sum}× (ai+b)/S_{sum}。
S27, fitting a straight line translated up into c unit along yaxis, until ai+b+c is more than 0, and the fitting a straight line is repaiied Just it is being y=ax+b+c, as shown in Figure 5；This is due to that reference amount may be when the few section of some reference amounts corresponds to fitting a straight line Negative value, this is departing from convention, therefore the fitting a straight line need to be modified.
S28, by default total buffer amount C in the sublist_{sum}Every section is distributed to, the buffering capacity C of ith section of distribution gained_{i} For：Because straight line has translated up along yaxis C unit, straight line form unit area and can not characterize total reference amount s with yaxis and xaxis_{sum}, therefore total reference amount is also corrected ForSoIth section can be characterized Reference amount ratio shared in total reference amount, total buffer amount is according to the pro rate buffering capacity to every section.
The present embodiment in addition to the beneficial effect with a upper embodiment, the fitting a straight line and buffering capacity are distributed into Amendment is gone, ensure that buffering capacity is accurately allocated, rationally utilize stock number.
Embodiments of the invention are described above in conjunction with accompanying drawing, but the invention is not limited in abovementioned specific Embodiment, abovementioned embodiment is only schematical, rather than restricted, one of ordinary skill in the art Under the enlightenment of the present invention, in the case of present inventive concept and scope of the claimed protection is not departed from, it can also make a lot Form, these are belonged within the protection of the present invention.
Claims (1)
Priority Applications (1)
Application Number  Priority Date  Filing Date  Title 

CN201310727354.1A CN104750682B (en)  20131225  20131225  A kind of buffering capacity distribution method of massive logs 
Applications Claiming Priority (1)
Application Number  Priority Date  Filing Date  Title 

CN201310727354.1A CN104750682B (en)  20131225  20131225  A kind of buffering capacity distribution method of massive logs 
Publications (2)
Publication Number  Publication Date 

CN104750682A CN104750682A (en)  20150701 
CN104750682B true CN104750682B (en)  20180406 
Family
ID=53590393
Family Applications (1)
Application Number  Title  Priority Date  Filing Date 

CN201310727354.1A CN104750682B (en)  20131225  20131225  A kind of buffering capacity distribution method of massive logs 
Country Status (1)
Country  Link 

CN (1)  CN104750682B (en) 
Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US7430741B2 (en) *  20040120  20080930  International Business Machines Corporation  Applicationaware system that dynamically partitions and allocates resources on demand 
CN101667198A (en) *  20090918  20100310  浙江大学  Cache optimization method of realtime vertical search engine objects 
CN103336771A (en) *  20130402  20131002  江苏大学  Data similarity detection method based on sliding window 
Family Cites Families (1)
Publication number  Priority date  Publication date  Assignee  Title 

US8271499B2 (en) *  20090610  20120918  At&T Intellectual Property I, L.P.  Incremental maintenance of inverted indexes for approximate string matching 

2013
 20131225 CN CN201310727354.1A patent/CN104750682B/en active IP Right Grant
Patent Citations (3)
Publication number  Priority date  Publication date  Assignee  Title 

US7430741B2 (en) *  20040120  20080930  International Business Machines Corporation  Applicationaware system that dynamically partitions and allocates resources on demand 
CN101667198A (en) *  20090918  20100310  浙江大学  Cache optimization method of realtime vertical search engine objects 
CN103336771A (en) *  20130402  20131002  江苏大学  Data similarity detection method based on sliding window 
Also Published As
Publication number  Publication date 

CN104750682A (en)  20150701 
Similar Documents
Publication  Publication Date  Title 

Triplett et al.  Productivity measurement issues in services industries: Baumol's disease has been cured  
Ladwig et al.  Linked data query processing strategies  
US8108415B2 (en)  Query transformation  
Cao et al.  Es 2: A cloud data storage system for supporting both oltp and olap  
Li et al.  A platform for scalable onepass analytics using mapreduce  
US8402031B2 (en)  Determining entity popularity using search queries  
EP2790113B1 (en)  Aggregate querycaching in databases architectures with a differential buffer and a main store  
US8285709B2 (en)  Highconcurrency query operator and method  
US7979436B2 (en)  Entitybased business intelligence  
DE202012013427U1 (en)  Linking tables in a MapReduce method  
AU2010234452A1 (en)  Generating improved document classification data using historical search results  
US10346383B2 (en)  Hybrid database table stored as both row and column store  
Letchford et al.  The advantage of short paper titles  
Polyzotis et al.  Meshing streaming updates with persistent data in an active data warehouse  
CN103177058B (en)  It is stored as row storage and row stores the hybrid database table of the two  
EP2608071A1 (en)  Hybrid database table stored as both row and column store  
US8762407B2 (en)  Concurrent OLAPoriented database query processing method  
Zhou et al.  Buffering accesses to memoryresident index structures  
CN101183368A (en)  Method and system for distributed calculating and enquiring magnanimity data in online analysis processing  
CN103377232B (en)  Headline keyword recommendation method and system  
DE202012013462U1 (en)  Data processing in a Mapreduce framework  
US7756889B2 (en)  Partitioning of nested tables  
AU2010236897A1 (en)  System and method for ranking search results within citation intensive document collections  
US8290933B2 (en)  Groupby size result estimation  
KR20060047700A (en)  Combining multidimensional expressions and data mining extensions to mine olap cubes 
Legal Events
Date  Code  Title  Description 

C06  Publication  
PB01  Publication  
C10  Entry into substantive examination  
SE01  Entry into force of request for substantive examination  
GR01  Patent grant  
GR01  Patent grant 