CN106599244B

CN106599244B - General original log cleaning device and method

Info

Publication number: CN106599244B
Application number: CN201611183585.0A
Authority: CN
Inventors: 张亚军; 田文宝; 夏鹏
Original assignee: Feihu Information Technology Tianjin Co Ltd
Current assignee: Feihu Information Technology Tianjin Co Ltd
Priority date: 2016-12-20
Filing date: 2016-12-20
Publication date: 2024-01-05
Anticipated expiration: 2036-12-20
Also published as: CN106599244A

Abstract

The invention discloses a general original log cleaning device, which comprises a variable storage module, for storing metadata corresponding to each type of log, regular expressions and matched fields corresponding to each metadata; a configuration module; and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage. The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage.

Description

General original log cleaning device and method

Technical Field

The invention relates to the technical field of big data processing, in particular to a general original log cleaning device and method.

Background

When log analysis is performed, the data of the log is unorganized, or it is said that the data of the log is not all intended. The data inside needs to be cleaned, i.e. the character strings inside are filtered and structured.

Some large internet companies have various logs, the logs all need to be cleaned, the data of the log amount is huge, and the log occupies about a storage space of several t per day, and 2 problems exist in the log: firstly, the logs are more, each type of log needs to be cleaned, if each log is specially and independently processed, the time is wasted, and secondly, the second problem is that the large amount of the logs occupies large space resources, and the network io consumed when the logs are read again is high.

Disclosure of Invention

The invention aims at overcoming the technical defects in the prior art, and provides a general original log cleaning method for completing cleaning work of different logs by flexible device custom configuration.

The technical scheme adopted for realizing the purpose of the invention is as follows:

a general original log cleaning device comprises,

the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;

the configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, wherein the cleaning tasks are in one-to-one correspondence with metadata;

and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage.

The configuration is stored using a zookeeper.

A general original log cleaning method, comprising,

establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;

configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task;

and identifying corresponding metadata according to the log type, completing the cleaning step by adopting a mapreduce program according to the cleaning task configuration, and carrying out preset storage.

The configuration is stored using a zookeeper.

In the cleaning step, the mapreduce program automatically judges the number of the reduce according to the size of the input data.

The data to be flushed is stored in the hdfs directory.

Compared with the prior art, the invention has the beneficial effects that:

the invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.

Drawings

Fig. 1 is a flow chart of a general original log cleaning method according to the present invention.

Detailed Description

The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.

To reduce the volume of data files, the compression currently most in use is lzo compression and snappy compression. Hadoop is a large data platform architecture of distributed storage and distributed computation, by means of the platform, the irregular log is structured through a mapreduce program and then stored into hdfs for later use according to a custom storage format and a compression format. The method overcomes the defects of the prior art that different log processing is carried out on logs synchronized according to service requirements and the code repetition rate is high, and reduces the workload of developers.

The general original log cleaning device comprises a variable storage module, a configuration module and a cleaning module, wherein,

the regular expression is stored in the variable storage module and is stored separately from the variables, and the function of the regular expression is to acquire the required fields, so that correctness must be ensured, for example:

^([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})--\\[(.*)\\]\"GET/hdpb.gif\\？(.*)HTTP.*\"[0-9]{3}[0-9]{1,5}\"(.*)\"$"

the fields within brackets representing the fields to be extracted are classified, such as ip field, time field, parameter field, ua field. The metadata is set to be a plurality of according to the type of the data to be cleaned and the cleaning targets, and the reasonable metadata is selected according to the service requirements. The adoption of metadata simplifies the data model to be cleaned, and can realize the quick configuration of similar or approximate cleaning.

The configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, and the cleaning tasks are in one-to-one correspondence with metadata; the various cleaning requirements are directly and specifically tasked and stored, and the storage compression format and other necessary factors corresponding to each task are called, so that the corresponding cleaning process can be realized by calling the matched task.

The cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and performing preset storage.

The sum variable storage module is stored by adopting a zookeeper, and the configurations are stored by adopting the zookeeper, wherein the stored catalog format is as follows:

(1) First level directory/etl

(2) The secondary catalog/etl/$ task$ task represents a specific cleaning task name, each cleaning task being unique,

(3) Three levels of directories/etl/$task/field,/etl/$task/outformat,/etl/$task/regexp, such as regular expressions, are deposited to/etl/$task/field, and variables are stored at/etl/$task/regexp. Setting a storage format and a compression format in/etl/$ task/output,

the third level of catalogue is most important, and the catalogue has stored variable values and configuration format storage

The storage format of regexp is consistent with the number of groups of the regular expression, one-to-one correspondence is realized, and the type corresponding to each group is stored, wherein the types are three ips, strs and common.

The storage format of the file is as uid, vid, pid, the fields are separated by commas,

the storage format of the outformat has only 2 class name values, and the class name values are separated by commas to respectively represent a compression format and a storage format.

Meanwhile, the invention also discloses a general original log cleaning method which comprises the following steps,

configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task; the configuration is stored using a zookeeper.

And identifying corresponding metadata according to the log type, completing the cleaning step by adopting a Mapreduce program according to the cleaning task configuration, and performing preset storage, and simultaneously, automatically judging the number of the products by the Mapreduce program according to the size of the input data so as to produce a block file with a proper size.

The number of reduce= (total size of input data/size of hdfs block 3 (default block size is 128 m)) +1

Where 3 is the compression ratio, this value can be adjusted to make the size of the file produced by each reduction as small as possible, according to the compression algorithm, slightly smaller than the block size.

Before the cleaning operation is submitted, the configuration information of the zookeeper is read according to the cleaning task information, the compression and storage configuration is found first and used for submitting the cleaning operation, and in the Mapreduce stage, the configuration information of the field of the zookeeper and the regular matching relation are read according to the cleaning task information, so that the cleaning of the data is completed. The principle of Mapreduce is not discussed here, but is not a key technical point of this invention, and only uses these techniques. The invention improves the development efficiency, automatically optimizes the size of the hdfs block and reduces the probability of mistakes made by some new staff.

The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims

1. A general original log cleaning method is characterized by comprising the following steps of,

configuring and storing a plurality of cleaning tasks corresponding to metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task, wherein the configuration is stored by adopting a zookeeper;

and identifying corresponding metadata according to the log type, completing a cleaning step by adopting a mapreduce program according to the cleaning task configuration, performing preset storage, structuring an irregular log by the mapreduce program, and storing the structured log into hdfs for later use according to a self-defined storage format and a compression format, wherein in the cleaning step, the mapreduce program automatically judges the number of products according to the size of input data, and the number of products= (the total size of the input data/the size of hdfs block is 3) +1.

2. The universal raw journal cleaning method as set forth in claim 1, wherein the data to be cleaned is stored in an hfs directory.