CN104750682B - A kind of buffering capacity distribution method of massive logs - Google Patents

A kind of buffering capacity distribution method of massive logs Download PDF

Info

Publication number
CN104750682B
CN104750682B CN201310727354.1A CN201310727354A CN104750682B CN 104750682 B CN104750682 B CN 104750682B CN 201310727354 A CN201310727354 A CN 201310727354A CN 104750682 B CN104750682 B CN 104750682B
Authority
CN
China
Prior art keywords
section
sublist
reference amount
domain
daily record
Prior art date
Application number
CN201310727354.1A
Other languages
Chinese (zh)
Other versions
CN104750682A (en
Inventor
吕成云
唐新民
沈智杰
景晓军
Original Assignee
任子行网络技术股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 任子行网络技术股份有限公司 filed Critical 任子行网络技术股份有限公司
Priority to CN201310727354.1A priority Critical patent/CN104750682B/en
Publication of CN104750682A publication Critical patent/CN104750682A/en
Application granted granted Critical
Publication of CN104750682B publication Critical patent/CN104750682B/en

Links

Abstract

The invention discloses a kind of buffering capacity distribution method of massive logs, this method includes:S11, daily record is read in real time to sublist;S12, number statistics is carried out to same domain value;S13, the reference amount for establishing every section, calculate total reference amount of sublist;S14, linear fit is carried out to every section of reference amount;S15, default total buffer amount in sublist distributed to every section.The beneficial effects of the practice of the present invention is reasonable distribution buffering capacity, takes less memory source, and the offset of referring domain reduces I/O operation.

Description

A kind of buffering capacity distribution method of massive logs

Technical field

The present invention relates to log management field, more specifically to a kind of buffering capacity distribution method of massive logs.

Background technology

IDC (Internet Data Center, Internet data center), DNS (Domain Name Service, domain name System) etc. produce the daily record of magnanimity, demand carry out it is quick import (1~100,000/second) in real time, and near real-time search.As realized The targets such as above-mentioned importing or search, typically can all use partitioning technique.The technology is according to certain rule by a very big table Then it is divided into multiple small tables and is respectively stored into different regions, so a table in logic, can be as multiple tables during physical store Equally, different positions is stored in, simplifies the management activity of database, but also application performance can be improved.With data volume Size carries out subregion, and (general limit is within two hours during inquiry, then the threshold value of setting can correspond to it, allow and once search as far as possible Rope is hit in a subregion, is at most also no more than two subregions).When subregion is completed, corresponding to the value in the domain of index has Scope.For fast filtering.If the time-domain in other daily record can keep temporally being incremented by, then can be to carrying out specially treated. Reach its index to occupy little space and filter soon.Certain system provides whether strictly increasing is configurable to this domain.So it In access table, can is only directly inquired about using specific subregion.It need not be related to whole table in inquiry, just improve naturally Query performance.Simultaneously because external interface is still a table, for user, using being transparent, their imperceptible subregions Presence.Therefore, big table partitioning technique is applied very extensive in mass data storage.

In existing scheme, there was only the speed of 1000-2000 bars/second when Mysql is using MYISAM storage engines; Speed is reduced to below 2000/second when MongoDB data volumes are more than 10,000,000, and can also be always with the increase of data volume Decline;NOSQL is based on key-value pairs, it is impossible to which multiple domain is indexed;Lucene boot speeds highest just close to 10,000/second, And increase much relevant extra unwanted contents with score, position etc. simultaneously.Another stealthy important indicator of obvious mass data It is that compression is few.The scheme of last time is in boot speed than relatively limited.Compression ratio is small, and internal memory, cpu, I/O take high.In addition to Cost is saved, log system needs to take on device in existing less idle.Obvious lightweight is the ultimate aim pursued. Buffering capacity allocation strategy is an important aspect in system.

The content of the invention

The technical problem to be solved in the present invention is, for prior art buffering capacity unreasonable distribution the defects of, there is provided A kind of buffering capacity distribution method of massive logs.

The technical solution adopted for the present invention to solve the technical problems is:Construct a kind of buffering capacity distribution side of massive logs Method, for reading in buffering capacity of the massive logs time-division with sublist, this method comprises the following steps:

S11, reading daily record in real time are stored in the section specified in sublist to sublist, and by the daily record;

S12, all sections in the sublist are divided according to the time for reading in daily record, if reading in the daily record tool of sublist There is identical domain, then the offset in the domain occurred first in sublist is quoted in all identical domains, and described in statistics first The number that the offset in the domain of appearance is cited;

S13, the reference amount S for establishing every sectioni, the reference amount SiFor the skew in all domains occurred first in i-th section The number sum being cited is measured, wherein, i is the positive integer in [1, n], and n is the section in the sublist;Calculate total reference of sublist Measure Ssum

S14, the reference amount S according to every section and every section of Time alignment for reading in daily recordi, to every section and every section of reference amount Si Relation carry out linear fit, obtain the straight line y=ax+b that regulation characterizes the corresponding relation of section and reference amount, wherein, x-axis is institute The xth section in sublist is stated, y-axis is the reference amount;

S15, according to as defined in the straight line y=ax+b corresponding relation by default total buffer amount in the sublist CsumEvery section is distributed to, the buffering capacity C of i-th section of distribution gainediFor:Ci=Csum× (ai+b)/Ssum

In buffering capacity distribution method of the present invention, in the step S11:The domain of the daily record includes user ID, access time, access IP, requests for page and request function number.

In buffering capacity distribution method of the present invention, the step S12 includes following sub-step:

S12A, all sections in the sublist are divided according to the time for reading in daily record, if reading in the daily record of sublist With identical domain, then the offset in the domain occurred first in sublist is quoted in all identical domains;

S12B, count the number z that the offset in i-th section of j-th of domain occurred first is citedijAnd the number is arranged Sequence, wherein, i is the positive integer in [1, n], and n is total hop count in the sublist, and j is the positive integer in [1, m], and m is described the The total number in the domain occurred first in i sections.

In buffering capacity distribution method of the present invention, the step S13 includes following sub-step:

S13A, establish reference amount Si, the reference amount SiFor the offset quilt in all domains occurred first in i-th section The number sum of reference:Wherein, i be [1, n] in positive integer, n be the sublist in total hop count, j for [1, M] in positive integer, m be described i-th section in the domain occurred first total number;

S13B, the total reference amount S for calculating sublistsum

In buffering capacity distribution method of the present invention, the step S14 also includes:

S14A, the reference amount S according to every section and every section of Time alignment for reading in daily recordi, the sequence is taken in preset range Section where interior domain carries out linear fit, obtains regulation and characterizes section and the straight line y=ax+b of the corresponding relation of reference amount, its In, x-axis is the section in the sublist, and y-axis is the reference amount.

In buffering capacity distribution method of the present invention, this method also includes:

S15A, before the step S15, judge whether ai+b is more than 0, if ai+b is more than 0, perform step S15;If Ai+b is less than or equal to 0, then performs step S15B;

S15B, fitting a straight line translated up into c unit along y-axis, until ai+b+c is more than 0, and by the fitting a straight line It is modified to y=ax+b+c;

S15C, by default total buffer amount C in the sublistsumEvery section is distributed to, the buffering capacity C of i-th section of distribution gainedi For:

Implement a kind of buffering capacity distribution method of massive logs of the present invention, have the advantages that:According to reference amount Reasonable distribution buffering capacity, takes less memory source, and the offset of referring domain reduces I/O operation.

Brief description of the drawings

Below in conjunction with drawings and Examples, the invention will be further described, in accompanying drawing:

Fig. 1 is a kind of flow chart of the buffering capacity distribution method for massive logs that preferred embodiment of the present invention provides;

Fig. 2 is the linear fit coordinate diagram that preferred embodiment of the present invention provides;

Fig. 3 is the structural representation that sublist is read in daily record;

Fig. 4 is a kind of flow chart of the buffering capacity distribution method for massive logs that another preferred embodiment of the present invention provides;

Fig. 5 is the linear fit coordinate diagram that another preferred embodiment of the present invention provides.

Embodiment

In order to which technical characteristic, purpose and the effect of the present invention is more clearly understood, now compares accompanying drawing and describe in detail The embodiment of the present invention.

A kind of as shown in figure 1, flow of the buffering capacity distribution method of the massive logs provided in preferred embodiment of the present invention In figure, this method is used to reading in the massive logs time-division and matching somebody with somebody the buffering capacity of sublist, the general structure of WEB server as shown in figure 3, Including a summary table, summary table is made up of multiple sublists, and each sublist is made up of multilayer, and every layer is made up of multistage, and section is processing Elementary cell, this method specifically include:

S11, daily record is read in real time to sublist, and the daily record is stored in the section in sublist;The domain of the daily record is at least Including ID, access time, access IP, requests for page and request function number.

S12, the section in the sublist is divided according to the time for reading in daily record, if identical occurs in the daily record for reading in sublist Domain, the then offset in the domain occurred first in identical domain reference sublist, and the offset quilt in the domain occurred first described in statistics The number of reference;During the number that the offset in the domain that statistics occurs first is cited, store and buffer, example using a bivariate table Such as:<Name, number>If the domain occurred below is identical with the domain, quoted by number, because going out first Existing domain can directly be quoted by all sections below, when there is the domain occurred first and its identical domain in same section, by phase Same domain is stored in the resource-area of section and directly quotes the domain occurred first;In statistic processes, all records of statistics can be pre- The first written document in the form of binlog, when counting written document, this part binlog can be used to complete final write-in.

The step specifically includes following sub-step:

S12A, the section in the sublist is divided according to the time for reading in daily record, if the daily record appearance for reading in sublist is identical Domain, the offset in domain occurred first in sublist is quoted in identical domain;

S12B, count the number z that the offset in i-th section of j-th of domain occurred first is citedijAnd the number is arranged Sequence, wherein, i is the positive integer in [1, n], and n is total hop count in the sublist, and j is the positive integer in [1, m], and m is described the The total number in the domain occurred first in i sections.Because each section includes multiple domains occurred first, each domain is cited secondary Number all influences reading and the storage speed of daily record, therefore need to count the reference number in every section and corresponding total amount.

S13, the reference amount S for establishing every sectioni, the reference amount SiFor the skew in all domains occurred first in i-th section The number sum being cited is measured, wherein, i is the positive integer in [1, n], and n is the section in the sublist;Calculate total reference of sublist Measure Ssum

The step specifically includes following sub-step:

S13A, establish reference amount Si, the reference amount SiFor the offset quilt in all domains occurred first in i-th section The number sum of reference:Wherein, i be [1, n] in positive integer, n be the sublist in total hop count, j for [1, M] in positive integer, m be described i-th section in the domain occurred first total number;

S13B, the total reference amount S for calculating sublistsum

S14, the reference amount S according to every section and every section of Time alignment for reading in daily recordi, to every section and every section of reference amount Si Relation carry out linear fit, obtain straight line y=ax+b, wherein, x-axis be the sublist in xth section, y-axis is the reference Amount;By the straight line y=ax+b obtained by linear fit as shown in Fig. 2 because the reference amount to all sections carries out Linear Quasi The operand of conjunction is very big, therefore the step can also be according to the reference amount for every section and every section of Time alignment for reading in daily record Si, the section progress linear fit where the domain of the sequence within a preset range is taken, obtains straight line y=ax+b, wherein, x-axis is Section in the sublist, y-axis are the reference amount.Section where the domain of the sequence within a preset range can be carried out again Ranking, the section generally read at first contain highest reference amount, take section that ranking is high and its data volume to be arranged, Linear fit is carried out again, can not only reduce operand, it can also be ensured that the accuracy of follow-up buffering capacity distribution.

Linear fit is approx portrayed using full curve or than between the coordinate represented by discrete point group on quasi-plane Functional relation, such as some discrete function values of certain known function, by adjusting some undetermined coefficients in the function so that the letter The difference (least square meaning) of number and known point set is minimum, if unJeiermined function is linear, is just linear fit.In numerical value point In analysis, curve matching is exactly that the formulation of discrete data, i.e. discrete data is approached with analytical expression.In practice, discrete point group Or data are often various physical problems and statistical problem about the multiple observation measured or experiment value, they be it is scattered, no Only it is not easy to handle, and generally can not definitely and fully embodies its intrinsic rule.This defect just can be by appropriate Analytical expression makes up.

General y=ax+b linear fit can be calculated by following formula:

Wherein, l be (x, y) centrifugal pump number, xkFor k-th of x value, ykFor corresponding k-th of y values.

Such as:Take the sequence to carry out linear fit in the section where the domain of TOP V (i.e. in preset range), taken The reference amount of the reference amount of first five section, wherein first paragraph to the 5th section is respectively S1、S2、S3、S4And S5, then l=5, x1=1,y1= S1, x2=2, y2=S2, x3=3, y3=S3, x4=4, y4=S4, x5=5, y5=S5.Above-mentioned numerical value is then brought into the formula of linear fit Calculate a and b value.

S15, according to the straight line y=ax+b by default total buffer amount C in the sublistsumDistribute to every section, i-th section point Buffering capacity C with gainediFor:Ci=Csum× (ai+b)/Ssum.Buffering capacity is distributed according to the reference amount in section, improves internal memory profit With rate, in the buffering capacity C obtained by i-th section of distributioniFor:Ci=Csum× (ai+b)/SsumIn, the buffering capacity of every section of distribution gained is this The ratio of Duan Yinyong amounts and total reference amount.

The beneficial effect of this method has:

1) inverted list for search technique of arranging in pairs or groups realizes that massive logs ensure its temporally strictly increasing, meets to lead at high speed Enter, while support search in real time, index is compact;

2) include index and data total amount when due to data compression, by directly quoting offset, reduce I/O operation time Number, improves reading speed, and compression ratio is lower;

3) reasonable distribution buffering capacity, by the size distributing buffer amount of reference amount, make occupying system resources smaller;

4) there is better performance ratio in the system below internal memory 2G.

As shown in figure 4, a kind of buffering capacity distribution method of the massive logs provided in another preferred embodiment of the present invention In flow chart, the present embodiment is based on a upper embodiment, there is provided is less than or equal to 0 in the reference amount that fitting a straight line y=ax+b is characterized Shi Xiuzheng fitting a straight line y=ax+b+c, and corresponding amendment is also done to buffering capacity distribution, this method is specific as follows:

S21, reading daily record in real time are stored in the section specified in sublist to sublist, and by the daily record;

S22, all sections in the sublist are divided according to the time for reading in daily record, if reading in the daily record tool of sublist There is identical domain, then the offset in the domain occurred first in sublist is quoted in all identical domains, and described in statistics first The number that the offset in the domain of appearance is cited;The step can also use the step S12 that a upper embodiment provides.

S23, the reference amount S for establishing every sectioni, the reference amount SiFor the skew in all domains occurred first in i-th section The number sum being cited is measured, wherein, i is the positive integer in [1, n], and n is the section in the sublist;Calculate total reference of sublist Measure SsumThe step can also use the step S13 that a upper embodiment provides.

S24, the reference amount S according to every section and every section of Time alignment for reading in daily recordi, to every section and every section of reference amount Si Relation carry out linear fit, obtain the straight line y=ax+b that regulation characterizes the corresponding relation of section and reference amount, wherein, x-axis is institute The xth section in sublist is stated, y-axis is the reference amount;The step can also use the step S14 that a upper embodiment provides.

S25, before the step S15, judge whether ai+b is more than 0, if ai+b is more than 0, perform step S26;If Ai+b is less than or equal to 0, then performs step S27-S28;

S26, according to as defined in the straight line y=ax+b corresponding relation by default total buffer amount in the sublist CsumEvery section is distributed to, the buffering capacity C of i-th section of distribution gainediFor:Ci=Csum× (ai+b)/Ssum

S27, fitting a straight line translated up into c unit along y-axis, until ai+b+c is more than 0, and the fitting a straight line is repaiied Just it is being y=ax+b+c, as shown in Figure 5;This is due to that reference amount may be when the few section of some reference amounts corresponds to fitting a straight line Negative value, this is departing from convention, therefore the fitting a straight line need to be modified.

S28, by default total buffer amount C in the sublistsumEvery section is distributed to, the buffering capacity C of i-th section of distribution gainedi For:Because straight line has translated up along y-axis C unit, straight line form unit area and can not characterize total reference amount s with y-axis and x-axissum, therefore total reference amount is also corrected ForSoI-th section can be characterized Reference amount ratio shared in total reference amount, total buffer amount is according to the pro rate buffering capacity to every section.

The present embodiment in addition to the beneficial effect with a upper embodiment, the fitting a straight line and buffering capacity are distributed into Amendment is gone, ensure that buffering capacity is accurately allocated, rationally utilize stock number.

Embodiments of the invention are described above in conjunction with accompanying drawing, but the invention is not limited in above-mentioned specific Embodiment, above-mentioned embodiment is only schematical, rather than restricted, one of ordinary skill in the art Under the enlightenment of the present invention, in the case of present inventive concept and scope of the claimed protection is not departed from, it can also make a lot Form, these are belonged within the protection of the present invention.

Claims (1)

1. a kind of buffering capacity distribution method of massive logs, for reading in buffering capacity of the massive logs time-division with sublist, it is special Sign is that this method comprises the following steps:
S11, reading daily record in real time are stored in the section specified in sublist to sublist, and by the daily record;The domain bag of the daily record Include ID, access time, access IP, requests for page and request function number;
S12, all sections in the sublist are divided according to the time for reading in daily record, if the daily record for reading in sublist has phase Same domain, then the offset in the domain occurred first in sublist is quoted in all identical domains, and occurred first described in statistics Domain the number that is cited of offset;The step S12 includes following sub-step:
S12A, all sections in the sublist are divided according to the time for reading in daily record, if the daily record for reading in sublist has Identical domain, then the offset in the domain occurred first in sublist is quoted in all identical domains;
S12B, count the number z that the offset in i-th section of j-th of domain occurred first is citedijAnd the number is sorted, its In, i is the positive integer in [1, n], and n is total hop count in the sublist, and j is the positive integer in [1, m], and m is in described i-th section The domain occurred first total number;
S13, the reference amount S for establishing every sectioni, the reference amount SiFor the offset quilt in all domains occurred first in i-th section The number sum of reference, wherein, i is the positive integer in [1, n], and n is the section in the sublist;Calculate total reference amount of sublist SsumThe step S13 includes following sub-step:
S13A, establish reference amount Si, the reference amount SiOffset for all domains occurred first in i-th section is cited Number sum:Wherein, i is the positive integer in [1, n], and n is total hop count in the sublist, and j is in [1, m] Positive integer, m be described i-th section in the domain occurred first total number;
S13B, the total reference amount S for calculating sublistsum
S14, the reference amount S according to every section and every section of Time alignment for reading in daily recordi, to every section and every section of reference amount SiPass System carries out linear fit, obtains regulation and characterizes section and the straight line y=ax+b of the corresponding relation of reference amount, wherein, x-axis is the son Xth section in table, y-axis are the reference amount;The step S14 includes:
S14A, the reference amount S according to every section and every section of Time alignment for reading in daily recordi, take the domain of the sequence within a preset range The section at place carries out linear fit, obtains regulation and characterizes section and the straight line y=ax+b of the corresponding relation of reference amount, wherein, x-axis is Section in the sublist, y-axis are the reference amount;
S15, according to as defined in the straight line y=ax+b corresponding relation by default total buffer amount C in the sublistsumPoint Every section of dispensing, the buffering capacity C of i-th section of distribution gainediFor:Ci=Csum×(ai+b)/Ssum
This method also includes:
S15A, before the step S15, judge whether ai+b is more than 0, if ai+b is more than 0, perform step S15;If ai+b Less than or equal to 0, then step S15B is performed;
S15B, fitting a straight line translated up into c unit along y-axis, until ai+b+c is more than 0, and by the fitting a straight line amendment For y=ax+b+c;
S15C, by default total buffer amount C in the sublistsumEvery section is distributed to, the buffering capacity C of i-th section of distribution gainediFor:
CN201310727354.1A 2013-12-25 2013-12-25 A kind of buffering capacity distribution method of massive logs CN104750682B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310727354.1A CN104750682B (en) 2013-12-25 2013-12-25 A kind of buffering capacity distribution method of massive logs

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310727354.1A CN104750682B (en) 2013-12-25 2013-12-25 A kind of buffering capacity distribution method of massive logs

Publications (2)

Publication Number Publication Date
CN104750682A CN104750682A (en) 2015-07-01
CN104750682B true CN104750682B (en) 2018-04-06

Family

ID=53590393

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310727354.1A CN104750682B (en) 2013-12-25 2013-12-25 A kind of buffering capacity distribution method of massive logs

Country Status (1)

Country Link
CN (1) CN104750682B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7430741B2 (en) * 2004-01-20 2008-09-30 International Business Machines Corporation Application-aware system that dynamically partitions and allocates resources on demand
CN101667198A (en) * 2009-09-18 2010-03-10 浙江大学 Cache optimization method of real-time vertical search engine objects
CN103336771A (en) * 2013-04-02 2013-10-02 江苏大学 Data similarity detection method based on sliding window

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8271499B2 (en) * 2009-06-10 2012-09-18 At&T Intellectual Property I, L.P. Incremental maintenance of inverted indexes for approximate string matching

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7430741B2 (en) * 2004-01-20 2008-09-30 International Business Machines Corporation Application-aware system that dynamically partitions and allocates resources on demand
CN101667198A (en) * 2009-09-18 2010-03-10 浙江大学 Cache optimization method of real-time vertical search engine objects
CN103336771A (en) * 2013-04-02 2013-10-02 江苏大学 Data similarity detection method based on sliding window

Also Published As

Publication number Publication date
CN104750682A (en) 2015-07-01

Similar Documents

Publication Publication Date Title
Triplett et al. Productivity measurement issues in services industries: Baumol's disease has been cured
Ladwig et al. Linked data query processing strategies
US8108415B2 (en) Query transformation
Cao et al. Es 2: A cloud data storage system for supporting both oltp and olap
Li et al. A platform for scalable one-pass analytics using mapreduce
US8402031B2 (en) Determining entity popularity using search queries
EP2790113B1 (en) Aggregate query-caching in databases architectures with a differential buffer and a main store
US8285709B2 (en) High-concurrency query operator and method
US7979436B2 (en) Entity-based business intelligence
DE202012013427U1 (en) Linking tables in a MapReduce method
AU2010234452A1 (en) Generating improved document classification data using historical search results
US10346383B2 (en) Hybrid database table stored as both row and column store
Letchford et al. The advantage of short paper titles
Polyzotis et al. Meshing streaming updates with persistent data in an active data warehouse
CN103177058B (en) It is stored as row storage and row stores the hybrid database table of the two
EP2608071A1 (en) Hybrid database table stored as both row and column store
US8762407B2 (en) Concurrent OLAP-oriented database query processing method
Zhou et al. Buffering accesses to memory-resident index structures
CN101183368A (en) Method and system for distributed calculating and enquiring magnanimity data in on-line analysis processing
CN103377232B (en) Headline keyword recommendation method and system
DE202012013462U1 (en) Data processing in a Mapreduce framework
US7756889B2 (en) Partitioning of nested tables
AU2010236897A1 (en) System and method for ranking search results within citation intensive document collections
US8290933B2 (en) Group-by size result estimation
KR20060047700A (en) Combining multidimensional expressions and data mining extensions to mine olap cubes

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant