CN107341258A

CN107341258A - A kind of log data acquisition method and system

Info

Publication number: CN107341258A
Application number: CN201710564475.7A
Authority: CN
Inventors: 袁; 袁一; 沈贇; 张学舟; 游枫
Original assignee: Industrial and Commercial Bank of China Ltd ICBC
Current assignee: Industrial and Commercial Bank of China Ltd ICBC
Priority date: 2017-07-12
Filing date: 2017-07-12
Publication date: 2017-11-10
Anticipated expiration: 2037-07-12
Also published as: CN107341258B

Abstract

The invention provides a kind of log data acquisition method and system, method includes：The target area of daily record data to be collected is divided at least two pickup areas in advance, each pickup area includes：Data center and at least one branch, daily record storage system are located at the data center of the first pickup area of one of at least two pickup area；The daily record data of the web server in each branch and data center is gathered, and the daily record data of each branch of collection is transferred to the data center of branch place pickup area；The daily record data of each pickup area is stored to daily record storage system by the Flume Primary Receives end of the data center of the first pickup area.Log collection scheme of the present invention, it is integrated with reference to characteristics such as high availability, high reliability, high timeliness, improves operating efficiency.

Description

A kind of log data acquisition method and system

Technical field

The present invention relates to data processing technique, is concretely a log collection method and system.

Background technology

At present, ecommerce and internet finance are developed rapidly.Bringing advantage to the user property of online transaction it is same When, potential risk is also faced with, such as user account is usurped, financial swindling and money laundering.Therefore, monitoring of the enterprise to transaction risk Demand is increasingly strong.When the traditional forms of enterprises is monitored to transaction risk, air control department generally is set up in enterprises, user is handed over Easily enter under line and analyze, intervention processing is carried out after finding suspicious data.With the development of big data technology, transaction risk control by Stepping enters digitlization and intellectuality.By the means of big data, transaction risk monitoring not only uses manpower and material resources sparingly, and improves work effect Rate, while can effectively reduce loss caused by economic crime.Daily record data is that transaction risk monitoring is dug using big data technology One important source of its information needed during pick analysis, therefore, all kinds of Log Collect Systems have obtained making extensively in enterprises With.

In the related product of numerous log collections of prior art, Flume is the wherein a height with higher popularity Performance distributed is increased income product, it provide can easy configuration multilayer collection framework, can support efficiently to receive from multiple data sources Collection daily record data is simultaneously saved in central data warehouse.But the log collection mode of prior art is to gather number by file granularity According to that is, pending file is acquired again after generating.Because its picking rate is limited, for the higher industry of some ageing demands Business, such as carry out marketing in real time, financial system tracking customer transactional data by online recommended user's commodity in e-commerce venture In the application such as monitoring trading risk, can not meet the needs of real-time.

In order to accelerate log collection, in the prior art, Flume proposes the method gathered line by line in a manner of tail-F, real The continuous collecting of daily record is showed.But which has the following disadvantages, i.e. once application service is restarted, log content covering Or the abnormal event such as deletion, it can cause loss of data or collect the hemistich data of mistake, cause follow-up daily record point Analysis makes a mistake.

The content of the invention

In order to overcome the shortcomings of to tackle abnormal event in traditional logs collection, cause loss of data, log analysis mistake The problems such as, the embodiments of the invention provide a kind of log data acquisition method, method includes：

The target area of daily record data to be collected is divided at least two pickup areas in advance, each pickup area is wrapped Include：Data center and at least one branch, daily record storage system are located at one of at least two pickup area The data center of first pickup area；

Gather the daily record data of the web server in each branch and data center, and by each branch of collection Daily record data be transferred to the data center of pickup area where the branch；

The daily record data of each pickup area by the Flume Primary Receives end of the data center of the first pickup area store to Daily record storage system.

In the embodiment of the present invention, the daily record data of described each branch of collection and the web server in data center, And the data center that the daily record data of each branch of collection is transferred to branch place pickup area includes：

The daily record data of the web server of the branch of collection, transmitted by the Flume receiving terminals positioned at branch The Flume receiving terminals of the data center of pickup area where to the branch.

In the embodiment of the present invention, data center that the daily record data of described each pickup area passes through the first pickup area Flume Primary Receives end, which is stored to daily record storage system, to be included：

The daily record data of the web server of the data center of collection, transmitted by the Flume Secondary Receives end of data center Stored to the Flume Primary Receives end of the data center of the first pickup area to daily record storage system；

The daily record data of the web server of the branch of first pickup area of collection, passes through the branch Flume receiving terminals are transmitted to the Flume Primary Receives end and stored to daily record storage system；

The daily record data of the web server of the branch of non-first pickup area of collection, by corresponding data The Flume Secondary Receives end of the heart, which is transmitted to the Flume Primary Receives end, to be stored to daily record storage system.

In the embodiment of the present invention, the Flume Secondary Receives end of the data center of non-first pickup area passes through express network Private line access is to Flume Primary Receives end.

The daily record data of web server is read in units of data block and writes transfer queue；

Daily record data in transfer queue is sent to Flume receiving terminals；

The downstream for determining daily record data according to the Flume receiving terminals location type sends ground.

In the embodiment of the present invention, described method includes：

The segmentation principle of the daily record data of web server is preset, described segmentation principle includes：With size or time Cutting is carried out to daily record data；

The journal file of storing daily record data is generated according to the segmentation principle of the daily record data of the web server of setting.

In the embodiment of the present invention, the daily record data that web server is read in units of data block simultaneously writes transfer Queue includes：

Step 1, file pointer is pointed to daily record to be collected；

Step 2, the daily record data in current log file is read in units of data block from the amount of specifying Offsets；

Step 3, read character one by one from data block and be put into caching；

Step 4, transfer queue is write by the character in row extraction caching.

In the embodiment of the present invention, it is described read character one by one from data block and be put into caching include：

Judge whether to read new character to determine whether for data block tail；

It is determined that read for data block tail, then perform step 2.

In the embodiment of the present invention, the character write-in transfer queue in the extraction caching by row includes：

Judgement reads whether new character is newline；

It is determined that the character read is newline, the character write-in transfer queue in extraction caching；

It is determined that the character read is not newline, then step 3 is performed.

In the embodiment of the present invention, it is determined that before what is read performs step 2 for data block tail, further execution journal is abnormal Detection,

Daily record exception is determined, then performs step 2 after resetting pointer offset amount.

In the embodiment of the present invention, further comprise before step 2 is performed：Determine whether newly-increased journal file；Its In,

It is determined that without newly-increased journal file, step 2 is performed；

It is determined that there is newly-increased journal file, then it is next that newly-increased journal file is specified after current log file has been read Journal file to be read.

In the embodiment of the present invention, include before step 3 is performed：

Judge whether to read data block；

It is determined that reading data block, then step 3 is performed；

It is determined that not reading data block, then carry out waiting default specified time.

In the embodiment of the present invention, it is determined that not reading data block, after waiting default specified time, execution journal is abnormal Detection, execution step determines whether newly-increased journal file after daily record exception then resets pointer offset amount.

Meanwhile the present invention also provides a kind of log data acquisition system, including：

Region division device, for the target area of daily record data to be collected to be divided into at least two pickup areas, respectively Pickup area includes：Data center and at least one branch, daily record storage system are located at least two acquisition zone The data center of first pickup area in one of domain；

Log data acquisition device, for gathering the daily record data of the web server in each branch and data center, And the daily record data of each branch of collection is transferred to the data center of branch place pickup area；

Flume Primary Receives end, the data center of first pickup area is arranged at, for by the day of each pickup area Will data storage is to daily record storage system.

In the embodiment of the present invention, described log data acquisition device includes：

Client is gathered, each branch and data center is arranged at, gathers in each branch and data center Web server daily record data；

Flume receiving terminals, it is arranged at each branched structure and data center；Wherein,

The daily record data of the web server of the data center of collection, transmitted by the Flume Secondary Receives end of data center Stored to the Flume Primary Receives end to daily record storage system；

The daily record data of the web server of the branch of first pickup area of collection, passes through the branch Flume receiving terminals are transmitted to Flume Primary Receives end and stored to daily record storage system；

The daily record data of the web server of the branch of non-first pickup area of collection, passes through the branch Flume receiving terminals are transmitted to the Secondary Receive end of corresponding data center, and are passed by the Flume Secondary Receives end of data center The Flume Primary Receives end is transported to store to daily record storage system.

In the embodiment of the present invention, described harvester includes：

Read module, the daily record data of web server is read in units of data block and writes transfer queue；

Transit module, for the daily record data in transfer queue to be sent to Flume receiving terminals；

The Flume receiving terminals determine that the downstream of daily record data sends ground according to location type.

In the embodiment of the present invention, the read module includes：

Principle presets unit, the segmentation principle of the daily record data for presetting web server, described segmentation principle Including：Cutting is carried out to daily record data with size or time；

Cutting unit, the segmentation principle for the daily record data of the web server according to setting generate storing daily record data Journal file.

In the embodiment of the present invention, described read module reads the daily record data of web server simultaneously in units of data block Write-in transfer queue includes：

Step 1, file pointer is pointed to daily record to be collected；

Step 3, read character one by one from data block and be put into caching；

Step 4, transfer queue is write by the character in row extraction caching.

In the embodiment of the present invention, described read module also includes：Daily record abnormality detection module is abnormal for execution journal Detection.

In the embodiment of the present invention, described read module also includes：

Block tail judging unit, for judging whether to read new character to determine whether for data block tail；

It is determined that read for data block tail, then perform step 2.

In the embodiment of the present invention, the read module also includes：

Newline judging unit, for judging to read whether new character is newline；

In the embodiment of the present invention, it is determined that before what is read performs step 2 for data block tail, daily record abnormality detection module is held Row daily record abnormality detection, determines daily record exception, then performs step 2 after resetting pointer offset amount.

In the embodiment of the present invention, also include in read module：

Newly-increased daily record judging unit, for determining whether newly-increased journal file；Wherein,

In the embodiment of the present invention, also include in read module：

Data block judging unit, for judging whether to read data block；

It is determined that reading data block, then step 3 is performed；

It is determined that not reading data block, after waiting default specified time, held using the daily record abnormality detection module Row daily record abnormality detection, execution step determines whether newly-increased journal file after daily record exception then resets pointer offset amount.

The present invention proposes improved log collection scheme, is one with reference to characteristics such as high availability, high reliability, high timeliness Body, improve operating efficiency.Log data acquisition efficiency and reliability can be improved using this technology, ensure that production system is normally transported OK.

For the above and other objects, features and advantages of the present invention can be become apparent, preferred embodiment cited below particularly, And coordinate institute's accompanying drawings, it is described in detail below.

Brief description of the drawings

In order to illustrate more clearly about the embodiment of the present invention or technical scheme of the prior art, below will be to embodiment or existing There is the required accompanying drawing used in technology description to be briefly described, it should be apparent that, drawings in the following description are only this Some embodiments of invention, for those of ordinary skill in the art, on the premise of not paying creative work, can be with Other accompanying drawings are obtained according to these accompanying drawings.

Fig. 1 is the flow chart of log data acquisition method disclosed by the invention；

Fig. 2 is the schematic diagram of the daily record real-time acquisition system of an embodiment of the present invention；

Fig. 3 is the flow chart that daily record gathers in real time in the embodiment of the present invention；

Fig. 4 is the gathering algorithm flow chart of daily record real-time acquisition system in the embodiment of the present invention.

Embodiment

Below in conjunction with the accompanying drawing in the embodiment of the present invention, the technical scheme in the embodiment of the present invention is carried out clear, complete Site preparation describes, it is clear that described embodiment is only part of the embodiment of the present invention, rather than whole embodiments.It is based on Embodiment in the present invention, those of ordinary skill in the art are obtained every other under the premise of creative work is not made Embodiment, belong to the scope of protection of the invention.

The present invention provides a kind of log data acquisition method, as shown in figure 1, this method includes：

Step S101, the target area of daily record data to be collected is divided at least two pickup areas in advance, each collection Region includes：Data center and at least one branch, daily record storage system be located at least two pickup area its One of the first pickup area data center；

Step S102, gathers the daily record data of the web server in each branch and data center, and by each of collection The data center of pickup area where the daily record data of branch is transferred to the branch；

Step S103, the Flume one-levels for the data center that the daily record data of each pickup area passes through the first pickup area connect Receiving end is stored to daily record storage system.

Wherein, the daily record data of each branch of above-mentioned collection and the web server in data center, and by collection The data center of pickup area includes where the daily record data of each branch is transferred to the branch：

It is illustrated in figure 2 the schematic diagram of the daily record real-time acquisition system of an embodiment of the present invention, in the present embodiment in advance Multiple regions are marked off according to gathering geographic position, are region 1 and area first by daily record target area Preliminary division to be collected Domain 2, a data center is set up respectively in region 1 and region 2, data center 1 is responsible for the daily record data of collecting zone 1, data Then it is responsible for collecting the daily record data for storing and including region 1 and region 2 in center 2.

A network special line, the daily record number of data center 1 are set up in the present embodiment between data center 1 and data center 2 According to the passage high-speed transfer to data center 2 can be passed through.Further, in specifically deployment Collecting operation, segment out in data This four classes log collection region of the heart 1, data center 2, branch 1 and branch 2.In the embodiment of the present invention, branch 1 It is with branch 2 with the difference of data center 1 and data center 2, branch is only used for the daily record inside collecting mechanism Data, data center then collect, collected including whole branches in notebook data center and affiliated area (if being deployed with multiple points Branch mechanism) daily record data.

Data center 2 is daily record storage system location in the present embodiment, and data center 2 is handling daily record number in one's respective area The daily record data of other data centers can be further collected on the basis of.It is it should be noted that above-mentioned in the embodiment of the present invention Division methods do not have absoluteness, and the thought that " dividing and rule " can be based on according to the actual conditions of each collection case is carried out most preferably Division.

Inside acquisition zone, provided with a small-sized acquisition system, acquisition system is by multiple collection clients and multiple Flume Receiving terminal forms, and runs in a LAN intranet environment.Collection client operates in Web application clothes in a manner of without intrusion It is engaged on device, is responsible for the daily record in real time on collection home server, because the computing resource of its occupancy Web Application Server is smaller, Pressure will not be produced to the service application of server.

The present invention has carried out high availability, high reliability, high effective innovative design to client log collection.In order to When pursuing the low consumption of remote transmission data, invention introduces an express network passage, the day of distant pickup area Will data can be then transmitted by the high-speed channel to daily record storage system location；In order to mitigate daily record storage system HDFS text Part writes pressure and overhead, the present invention devise multi-layer Flume patterns, and data can only be write step by step by Flume ends HDFS.Log data acquisition efficiency and reliability can be improved using this technology, ensure production system normal operation.

Flume receiving terminals are operated in Flume servers in the present embodiment, are responsible for real-time collecting combined data and are sent under Node is swum, its technology used is ripe Flume acquisition technique.Herein, the downstream node is the day of Flume receiving terminals Will data send terminal, and downstream node had both been probably Flume receiving terminals, it is also possible to daily record storage system.Each small-sized collection The data of system, which are finally summed up, carries out follow-up excavation processing in the daily record storage system of some acquisition system, in case of the present invention The system is the HDFS file system of data center 2.In order to rationally utilize server resource, above-mentioned deployment strategy follow it is simple, Inside flexible principle, either data center or branch, the number of client and Flume receiving terminals is gathered according to adopting The flexible in size adjustment of set task amount.

According to position attribution of the Flume receiving terminals with respect to HDFS file system, multi-layer setting has been carried out to Flume.Its In, one-level Flume receiving terminals are responsible for receiving the daily record data from two level Flume ends and collect write-in HDFS systems, two level Flume receiving terminals are responsible for receiving the daily record data of three-level Flume receiving terminals and are forwarded in one-level Flume receiving terminals, successively class Push away.Herein, the setting of rank, which is only used for distinguishing this rank Flume receiving terminals, is finally transferred to daily record data HDFS files system System needs the Flume receiving terminal numbers passed through, its function and indistinction.For example, three-level Flume receiving terminals are needed by three Flume receiving terminals (including local terminal) could finally store and arrive HDFS file system.The direct data transfer of Flume receiving terminals uses AVRO agreements.By taking Tu1Zhong data centers 2 as an example, the Flume collection terminals of two levels are provided with center, two level Flume ends will adopt One-level Flume ends are transmitted to after the daily record of collection client collection, one-level Flume is responsible at end receiving in central interior, data The data at the heart 1 and the two level Flume ends of branch 2 simultaneously write HDFS.And inside data center 1 and branch 2, it is provided only with The two level Flume receiving terminals of individual layer, it is responsible for being restored again into after daily record data is transmitted to the Primary Receive end transfer of data center 2 HDFS.Unlike, due to branch 2, range data center 2 is relatively near on geographical position, and data transfer walks common net winding thread Road, the data of data center 1 can obtain more high transformation property by the network special line of data center 1 and 2.For branch 1, The data for gathering client are first forwarded to the Flume Secondary Receives end of data center 1 via Flume three-level receiving terminals, last same The data of data center 2 are stored in HDFS via express network passage together.

It should be noted that client, the number of Flume receiving terminals, form Design are gathered in above example not by real Border performance is drawn, and gathering client, the specific number of Flume receiving terminals, form when actually implementing may arbitrarily change, Arrangement form is also increasingly complex.

In daily record real-time acquisition system of the present invention, log data stream priority process collection client and Flume three-level, Two level and Primary Receive end, finally flow into HDFS daily record storage system.Client is either gathered in the present embodiment still Flume receiving terminals, its internal structure have identical feature, i.e. internal structure is divided into reading module and hair digital-to-analogue block.Its feature Also reside in, both common one daily record data transfer queues of management service, the transfer queue is used for log cache data, reading mould Block and hair digital-to-analogue block asynchronously carry out read/write operation, the operation of non-interference other side's module to the queue.It is illustrated in figure 3 this The flow chart that daily record gathers in real time in inventive embodiments.

Step 101：The reading module of log collection client points to pointer file to be collected, dynamically with data block Queue is put into for unit reading and parsing

In the present embodiment, the daily record data that system is applied is basic using journal file by the application program at Web server end Unit, deposit in log collection catalogue and be managed.It is difficult to check in order to avoid individual log file is excessive, finger can be passed through Determine the capacity of size and each journal file of time control, therefore contain multiple journal files in log collection catalogue, wherein one Individual is current log file, and remaining is daily record archive file.Application program presets daily record segmentation principle, supports at present with big Small and two kinds of slit modes of time.If with daily record size cutting, it is set size to ensure each journal file；If cut with the time Point, then ensure one journal file of generation per minute.Current log file data growing amount once reaches cutting requirement, then stops Write in current file, while generate the new journal file for being used to write data.Application program is by current log file Addition numeric suffix becomes Historical archiving file, and newly-generated file continues to use the name of current log file.

Reading module carries out block-by-block reading to newly-generated journal file one by one when gathering daily record data.Specifically, institute State module and reading pointer pointed into current log file, once read one section of log content using data block as base unit every time, Go out daily record line by line from the Context resolution of reading.If being found in certain reading, daily record generates without new data, and the module can incite somebody to action Pointer points to log read position next time, stops the time specifying, waits new log content to generate.Stand-by period terminates Afterwards, new journal file whether is generated in the period that the module first goes to judge to wait.If so, then explanation is previously read daily record For file there is no new data generation, the module first to run through the remaining content of earlier file from last pointer points to position, Then pointer is pointed into newly-generated file to start to read new data.If nothing, illustrate the journal file being previously read also not To switching designated capabilities, the module continues the reading of content in a manner of data block and parsed.Further, in log collection mistake Cheng Zhong, add daily record abnormality detection the step of, for daily record be deleted or content cover the problems such as correct acquisition strategies in time, protect Demonstrate,prove the correctness integrality of institute's gathered data.The detailed description of above-mentioned gathering algorithm step is shown in Fig. 3.

Step 102：The hair digital-to-analogue block of log collection client is responsible for reading data row hair from the transfer queue of client It is sent to downstream Flume servers.

The Flume receiving terminal address lists that data can be transmitted in downstream, the hair digital-to-analogue block are configured with inside each client Connection between Flume receiving terminals uses load-balancing mechanism.After client terminal start-up is gathered, it is described hair digital-to-analogue block from this An address is randomly choosed in list, long connection is established to the Flume receiving terminals corresponding to the address.To improve transmitting efficiency, The hair digital-to-analogue block can read some data rows from transfer queue every time, and number of data lines mesh is preset value, if transfer queue Remaining data line number is less than the preset value, then reads remaining line number.The data row of reading is packaged into a data by the module Bag, corresponding Flume receiving terminals are sent to by Transmission Control Protocol.

Step 103：Send ground according to the type selecting downstream in the location of Flume receiving terminal servers.

According to the different qualities of the region of Flume receiving terminal servers, the Flume receiving terminals of different zones are being carried out Different processing modes is used corresponding to data transfer phase.

As shown in Figure 3, judge that Flume servers region for branch 1, then sends data to data center 1, turn to step 104；Judge that Flume servers region for data center 1 or branch 2, is then sent data to Data center 2, turn to step 105；Flume servers region is judged for data center 2, then need not be sent data to outer Portion region, turn to step 106.

Step 104：The Flume three-level receiving terminals of data center 1 are sent data to positioned at data center's 1Flume two levels Receiving terminal.

Flume receiving terminals positioned at branch 1 are referred to as three-level receiving terminal in system level aspect.Flume receiving terminal levels Not higher, the Flume receiving terminals that data transfer needs pass through are more, and the transmission time of consuming is also more therewith.If by common Circuit directly sends data to data center 2, is unfavorable for ensureing the ageing of data transfer.In order to avoid high latency, branch The Flume receiving terminals of mechanism 1 forward the data to data center 1, are carried out by the Flume receiving terminals of data center 1 high in next step The data transfer of fast network channel.By experimental test verification, such secondary transmission means is in transmission time much smaller than straight The remote transmission connect.

Step 105：Data center 1, branch 2 pass data with the different modes of high-speed channel, general network respectively Transport to the one-level Flume ends of data center 2.

Flume receiving terminals positioned at data center 1 and branch 2 are referred to as Secondary Receive end in system level aspect. The data that Flume Secondary Receives end can be received are directly transferred to the Flume Primary Receives end of data center 2.The present embodiment In because data center 1 and data center 2 directly set up have an express network path, therefore, the Flume of data center 1 connects Receiving end can send data to data center 2 by the high-speed channel.Branch 2 is then passed data by general network circuit Transport to data center 2.

Step 106：The Flume Primary Receives end of data center 2 receives the data inside and outside this center, is finally uploaded to HDFS。

The Flume Primary Receives end of data center 2 not only receives the daily record data from local Secondary Receive end, equally connects Receive the daily record data from the Secondary Receive end of branch 2 and data center 1.Present system is used the daily record in each region Data collect the mode of integration via Flume Primary Receives end, rather than the daily record data in each region is each independently uploaded to HDFS daily record storage systems.One of the reason for so designing is that Flume Primary Receives end can merge during combined data The same daily record data for being served by correlation, the number of final journal file is reduced, so as to alleviate HDFS file system to a large amount of The pressure of small documents storage.The two of reason, the system, which is set, only has Flume Primary Receives end to may have access to HDFS systems, not only just In HDFS Access Management Access, and the possibility for destroying HDFS systems in external service end is reduced from network security aspect.

As shown in figure 4, the gathering algorithm flow chart for daily record real-time acquisition system in the embodiment of the present invention.In the present embodiment After known collection client terminal start-up, log collection work can be ceaselessly carried out if without extraneous forced interruption.Its reading module is anti- Operated again using same set of gathering algorithm, specific algorithm flow is as follows:

Step 10101：File pointer points to daily record to be collected, and it is 0 to set pointer initial offset.

Under normal circumstances, daily record data proceeds by collection from file header, therefore, reads file pointer and is initially directed to daily record The first trip initial character of file data.

Step 10102：Determine whether other newly-increased journal files.

Reading module continues to read for a period of time from the beginning reading the data of new journal file every time or do not read data latency Before number, it is necessary to above-mentioned judgement is carried out, if a currently only daily record to be collected, turns to step 10103, if except pointer refers to To daily record to be collected outside again generate new journal file, then turn to step 10112.

Step 10103：File pointer reads current log file data from the amount of specifying Offsets in units of block.

The present invention carries out digital independent by base unit of data block, and data block is one section of day with fixed byte length Will data, the byte length that the possible deficiency of last block data block only read in end of file is specified.Contain in data block Some data rows, the stem and afterbody of data block are probably incomplete hemistich data, and the extraction step of data row is shown in step 10105- steps 10108.

Step 10104：Judge whether to read data block.

It is dynamic process to write journal file due to client Web server application program, and reading module can not ensure often It is secondary all to read daily record data block.Therefore after completing a read operation every time, once judged, see whether really read number According to.Step 10105 is turned to if data are read, step 10110 is turned to if data are not read.

Step 10105：Read character one by one from data block and be added to dynamic character array end.

A block number is read after, it is put into the caching of collection client by the data block, is responsible for depositing in caching and is waited to solve The data of analysis.If a pointer is responsible for the character in scanning reading caching specially, reading pointer is cached from caching first character Position, which is risen, reads character, and the character is added into a dynamic character array end, then scans character late.Dynamic character number The intermediate data state of the responsible temporal data row of group, its length increase with the increase of the character of storage.

Partial data may be still retained in dynamic character array after if last data block is read, this data When block parses, dynamic character array can retain last time still undrawn remaining data, by the still additional write-in of the fresh character scanned Dynamic character data trailer, extraction operation is carried out again until parsing a full line data.

Step 10106：Judge whether to read data EOB.

Because the data character that pointer scans read block caching is a kind of circulation read operation, therefore every time after reading Judge whether to read new character, finished if reading and illustrating that the data block is not yet read, turn to step 10107；If it was found that nothing Method reads new character, i.e. data EOB, then illustrates that current data block has been completed to read, and need to extract new data block reading, Turn to step 10109.

Step 10107：Whether the character for judging newly to read meets that the extraction of data row requires.

If the character of newest reading is newline ' n ', illustrate dynamic character array meet data row extraction require, turn To step 10108；If the character is the general character of non-newline, continue the reading of next character, turn to step 10105.

Step 10108：Parsing extraction a data row, the data row write is entered the transfer queue of client.

The data of dynamic character storage of array have met the extraction requirement of data row, and its data row is copied into transfer queue The content of dynamic character array is emptied afterwards.

Step 10109：Daily record abnormality detection, pointer offset amount is reset if abnormal.

, it is necessary to carry out daily record abnormality detection before next data block is read.This is due to client Web server Application program may restart or shut down etc. extremely in the process of running, then corresponding log file contents change, example Such as, from the beginning original journal file is covered by new daily record data.The flow of daily record abnormality detection is as follows, and first, reading module is more The offset of daily record is newly read next time, and new offset is that this reads offset of daily record plus reading data block byte length. Secondly, by current log file size compared with the offset value, subsequently to express clearly, if new offset value is A Value, current log file size is B values.If A values are more than B values, it is abnormal to conclude that overriding occurs for journal file.At this moment, daily record number According to need to from the beginning read, it is 0 to reset and read the pointer offset amount of journal file next time.If A values are less than B values, need further to obtain The number of characters (being set to C values) of dynamic character storage of array, (A-C, i.e. pointer original position are moved to from the read pointer of journal file Place) at offset, the first character swept to should be newline initial character, then move a character position further along, and scanning should Character content simultaneously judges.If the content is not newline ' n ', illustrate overriding occurs abnormal, should reset and read daily record text next time The pointer offset amount of part is 0.If the content should be newline ' n ', illustrate no exceptions situation.The finger of file is read next time Pin offset is A values.

Step 10110：Read, less than data, to wait specified time.

If Web server end does not have daily record data to write file within a period of time, pointer is set to tail of file and then read Less than data content.Gather when client is unpredictable can be read new data, wait for a period of time and reattempt reading.

Step 10111：Daily record abnormality detection, pointer offset amount is reset if abnormal.

Step 10111 is similar with step 10109, repeats no more.It is otherwise varied, if being not detected by daily record exception, step The offset for reading daily record in rapid 10109 next time is that increase reads data block byte length and (is set to D on the basis of this offset Value), i.e. A+D.The offset for reading daily record in step 10111 next time remains this offset for reading daily record, i.e. A.

Further, for being deployed in the client of (SuSE) Linux OS, step 10111 is one more than step 10109 There is deleted monitoring on being that daily record is no.Specifically, still without getting data block after acquisition module is reading preset times, Then trigger the detection that file is deleted.During detection, it can first judge that the file of specified file name whether there is：If specified file is not present The file of name, then directly judge that current file is deleted, new file does not generate again, now then waits new file generation；If deposit In the file of specified file name, then pass through the " device id+Inode of existing stat orders acquisition specified file name respective file Number (inode number) " character string, and the character string of opened file are contrasted, and judge that original has been deleted if different Remove, new file and generated, now, file read pointer is pointed into new file, and set the offset for reading daily record next time to set For 0.

Step 10112：Circulation runs through the remaining data of current log in units of block from the amount of specifying Offsets, specifies newly-increased Daily record is next processing file.

It should be noted that in the present embodiment, when entering step 10112, system can be first by the day of current pointer sensing The digital independent of will finishes, and is read repeatedly by data block and parses data row, its process and above-mentioned steps 10103- steps 10108 Similar, difference is, has been write completely because new daily record generates explanation current log, therefore need not perform step 10104, i.e., no longer Judge read whether data block can be read every time.Newly-increased journal file, which is pointed to, with backpointer carries out digital independent.

In order to improve the ageing of log collection, the present invention takes the dynamic acquisition daily record number in journal file generating process According to, every time by data block read and parse line by line, sustainable transmission daily record data；In order to ensure the integrality of data acquisition, this Monitoring is abnormal in real time in gatherer process for invention, once find daily record be deleted or overlay content if adjust acquisition strategies in time, Avoid the daily record data of read error.

Meanwhile invention additionally discloses a kind of log data acquisition system, including：

Region division device, the target area of daily record data to be collected is divided at least two pickup areas in advance, respectively Pickup area includes：Data center and at least one branch, daily record storage system are located at least two acquisition zone The data center of first pickup area in one of domain；

Log data acquisition device, each branch, data center are arranged at, for gathering in each branch and data The daily record data of web server in the heart, and the daily record data of each branch of collection is transferred to where the branch The data center of pickup area；

Flume Primary Receives end, the data center of first pickup area is arranged at, for by each acquisition zone of collection The daily record data in domain is stored to daily record storage system.

In the embodiment of the present invention, the Flume Secondary Receives end of the data center of non-first pickup area passes through express network Private line access is to Flume Secondary Receives end.

The daily record real-time acquisition system of the present invention uses multi-layer Flume server modes in global deployment strategy, i.e., Some Flume servers are set inside each pickup area by log collection transfer, if the finally acquisition zone where HDFS Dry Flume servers collect the daily record of each acquisition zone and write HDFS.This design pattern, first, avoid a large amount of Flume Server system pressure to caused by HDFS direct read/writes, while HDFS is avoided caused by storing large amount of small documents Namenode performances reduce problem；Second, the larger I/O operation of HDFS the two expenses is written to by remote transmission data, by data Decoupling and design, avoid the huge communication delay that data are sent directly to data center 2HDFS by data center 1；3rd, just In the unified management that HDFS is accessed, while reduce from network security aspect the address of fire wall white list.

And in collection client internal journal gathering algorithm design, acquisition method of the present invention can not only be with behavior unit Gather collector journal, improve the ageing of log collection, and shield Web server application restart, shut down etc. it is extremely right The influence of acquisition system of the present invention.Acquisition system abnormal conditions generation after still normal operation, accurately collect in daily record Hold, the phenomenon for reading hemistich data is effectively prevent compared with the acquisition method that Flume is carried.

It should be understood by those skilled in the art that, embodiments of the invention can be provided as method, system or computer program Product.Therefore, the present invention can use the reality in terms of complete hardware embodiment, complete software embodiment or combination software and hardware Apply the form of example.Moreover, the present invention can use the computer for wherein including computer usable program code in one or more The computer program production that usable storage medium is implemented on (including but is not limited to magnetic disk storage, CD-ROM, optical memory etc.) The form of product.

The present invention is the flow with reference to method according to embodiments of the present invention, equipment (system) and computer program product Figure and/or block diagram describe.It should be understood that can be by every first-class in computer program instructions implementation process figure and/or block diagram Journey and/or the flow in square frame and flow chart and/or block diagram and/or the combination of square frame.These computer programs can be provided The processors of all-purpose computer, special-purpose computer, Embedded Processor or other programmable data processing devices is instructed to produce A raw machine so that produced by the instruction of computer or the computing device of other programmable data processing devices for real The device for the function of being specified in present one flow of flow chart or one square frame of multiple flows and/or block diagram or multiple square frames.

These computer program instructions, which may be alternatively stored in, can guide computer or other programmable data processing devices with spy Determine in the computer-readable memory that mode works so that the instruction being stored in the computer-readable memory, which produces, to be included referring to Make the manufacture of device, the command device realize in one flow of flow chart or multiple flows and/or one square frame of block diagram or The function of being specified in multiple square frames.

These computer program instructions can be also loaded into computer or other programmable data processing devices so that counted Series of operation steps is performed on calculation machine or other programmable devices to produce computer implemented processing, so as in computer or The instruction performed on other programmable devices is provided for realizing in one flow of flow chart or multiple flows and/or block diagram one The step of function of being specified in individual square frame or multiple square frames.

Apply specific embodiment in the present invention to be set forth the principle and embodiment of the present invention, above example Explanation be only intended to help understand the present invention method and its core concept；Meanwhile for those of ordinary skill in the art, According to the thought of the present invention, there will be changes in specific embodiments and applications, in summary, in this specification Appearance should not be construed as limiting the invention.

Claims

1. a kind of log data acquisition method, it is characterised in that described method includes：

The target area of daily record data to be collected is divided at least two pickup areas in advance, each pickup area includes：Number According to center and at least one branch, what daily record storage system was located at one of at least two pickup area first adopts Collect the data center in region；

Gather the daily record data of the web server in each branch and data center, and by the day of each branch of collection Data center of the will data transfer to pickup area where the branch；

The daily record data of each pickup area is stored to daily record by the Flume Primary Receives end of the data center of the first pickup area Storage system.

2. log data acquisition method as claimed in claim 1, it is characterised in that each branch of described collection and data The daily record data of web server in center, and the daily record data of each branch of collection is transferred to the institute of branch Include in the data center of pickup area：

The daily record data of the web server of the branch of collection, this is transferred to by the Flume receiving terminals positioned at branch The Flume receiving terminals of the data center of pickup area where branch.

3. log data acquisition method as claimed in claim 2, it is characterised in that the daily record data of described each pickup area Being stored by the Flume Primary Receives end of the data center of the first pickup area to daily record storage system includes：

The daily record data of the web server of the data center of collection, transmitted by the Flume Secondary Receives end of data center to The Flume Primary Receives end of the data center of one pickup area is stored to daily record storage system；

The daily record data of the web server of the branch of first pickup area of collection, is connect by the Flume of the branch Receiver-side transmission to the Flume Primary Receives end is stored to daily record storage system；

The daily record data of the web server of the branch of non-first pickup area of collection, passes through corresponding data center Transmit to the Flume Primary Receives end and store to daily record storage system in Flume Secondary Receives end.

4. log data acquisition method as claimed in claim 3, it is characterised in that the data center of non-first pickup area Flume Secondary Receives end passes through express network private line access to Flume Primary Receives end.

5. the log data acquisition method as described in any claim of claim 1 or 4, it is characterised in that described collection The daily record data of web server in each branch and data center, and the daily record data of each branch of collection is passed The defeated data center to pickup area where the branch includes：

Daily record data in transfer queue is sent to Flume receiving terminals；

6. log data acquisition method as claimed in claim 5, it is characterised in that described method includes：

The segmentation principle of the daily record data of web server is preset, described segmentation principle includes：With size or time to day Will data carry out cutting；

7. log data acquisition method as claimed in claim 6, it is characterised in that described to be read in units of data block The daily record data of web server simultaneously writes transfer queue and included：

Step 1, file pointer is pointed to daily record to be collected；

Step 3, read character one by one from data block and be put into caching；

Step 4, transfer queue is write by the character in row extraction caching.

8. log data acquisition method as claimed in claim 7, it is characterised in that described to read word one by one from data block Symbol, which is put into caching, to be included：

Judge whether to read new character to determine whether for data block tail；

It is determined that read for data block tail, then perform step 2.

9. log data acquisition method as claimed in claim 8, it is characterised in that the described character extracted by row in caching Write-in transfer queue includes：

Judgement reads whether new character is newline；

10. the log data acquisition method as described in any claim of claim 8 or 9, it is characterised in that it is determined that reading For data block tail perform step 2 before, further execution journal abnormality detection,

11. log data acquisition method as claimed in claim 7, it is characterised in that performing the bag that takes a step forward of step 2 Include：Determine whether newly-increased journal file；Wherein,

It is determined that there is newly-increased journal file, then newly-increased journal file is specified to be continued to be next after current log file has been read The journal file taken.

12. log data acquisition method as claimed in claim 11, it is characterised in that include before step 3 is performed：

Judge whether to read data block；

It is determined that reading data block, then step 3 is performed；

13. log data acquisition method as claimed in claim 12, it is characterised in that it is determined that not reading data block, wait After default specified time, execution journal abnormality detection, daily record is abnormal, which then to be reset to perform step after pointer offset amount and judge, is It is no to have newly-increased journal file.

14. a kind of log data acquisition system, it is characterised in that described system includes：

Region division device, for the target area of daily record data to be collected to be divided into at least two pickup areas, each collection Region includes：Data center and at least one branch, daily record storage system be located at least two pickup area its One of the first pickup area data center；

Log data acquisition device, for gathering the daily record data of the web server in each branch and data center, and will The data center of pickup area where the daily record data of each branch of collection is transferred to the branch；

Flume Primary Receives end, the data center of first pickup area is arranged at, for by the daily record number of each pickup area According to storing to daily record storage system.

15. log data acquisition system as claimed in claim 14, it is characterised in that described log data acquisition device bag Include：

Client is gathered, each branch and data center is arranged at, gathers the web in each branch and data center The daily record data of server；

The daily record data of the web server of the data center of collection, transmitted by the Flume Secondary Receives end of data center to institute Flume Primary Receives end is stated to store to daily record storage system；

The daily record data of the web server of the branch of first pickup area of collection, is connect by the Flume of the branch Receiver-side transmission to Flume Primary Receives end is stored to daily record storage system；

The daily record data of the web server of the branch of non-first pickup area of collection, passes through the Flume of the branch Receiving terminal is transmitted to the Secondary Receive end of corresponding data center, and by the Flume Secondary Receives end of data center transmit to The Flume Primary Receives end is stored to daily record storage system.

16. log data acquisition system as claimed in claim 15, it is characterised in that the data center of non-first pickup area Flume Secondary Receives end pass through express network private line access to Flume Primary Receives end.

17. the log data acquisition system as described in any claim of claim 14 or 16, it is characterised in that described adopts Acquisition means include：

18. log data acquisition system as claimed in claim 17, it is characterised in that the read module includes：

Principle presets unit, the segmentation principle of the daily record data for presetting web server, described segmentation principle bag Include：Cutting is carried out to daily record data with size or time；

Cutting unit, the day for the segmentation principle generation storing daily record data of the daily record data of the web server according to setting Will file.

19. log data acquisition system as claimed in claim 18, it is characterised in that described read module using data block as Unit, which reads the daily record data of web server and writes transfer queue, to be included：

Step 1, file pointer is pointed to daily record to be collected；

Step 3, read character one by one from data block and be put into caching；

Step 4, transfer queue is write by the character in row extraction caching.

20. log data acquisition system as claimed in claim 19, it is characterised in that described read module also includes：Day The normal detection module of mystery, for execution journal abnormality detection.

21. log data acquisition system as claimed in claim 20, it is characterised in that described read module also includes：

It is determined that read for data block tail, then perform step 2.

22. log data acquisition system as claimed in claim 21, it is characterised in that the read module also includes：

Newline judging unit, for judging to read whether new character is newline；

23. the log data acquisition system as described in any claim of claim 21 or 22, it is characterised in that it is determined that reading Before that arrives performs step 2 for data block tail, daily record abnormality detection module execution journal abnormality detection, daily record exception is determined, then Step 2 is performed after resetting pointer offset amount.

24. log data acquisition system as claimed in claim 20, it is characterised in that also include in read module：

25. log data acquisition system as claimed in claim 24, it is characterised in that also include in read module：

Data block judging unit, for judging whether to read data block；

It is determined that reading data block, then step 3 is performed；

26. log data acquisition system as claimed in claim 25, it is characterised in that it is determined that not reading data block, wait After default specified time, using the daily record abnormality detection module execution journal abnormality detection, abnormal then reset of daily record refers to Step is performed after pin offset and determines whether newly-increased journal file.