CN106599244B - General original log cleaning device and method - Google Patents

General original log cleaning device and method Download PDF

Info

Publication number
CN106599244B
CN106599244B CN201611183585.0A CN201611183585A CN106599244B CN 106599244 B CN106599244 B CN 106599244B CN 201611183585 A CN201611183585 A CN 201611183585A CN 106599244 B CN106599244 B CN 106599244B
Authority
CN
China
Prior art keywords
log
cleaning
metadata
storage
storing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201611183585.0A
Other languages
Chinese (zh)
Other versions
CN106599244A (en
Inventor
张亚军
田文宝
夏鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Feihu Information Technology Tianjin Co Ltd
Original Assignee
Feihu Information Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Feihu Information Technology Tianjin Co Ltd filed Critical Feihu Information Technology Tianjin Co Ltd
Priority to CN201611183585.0A priority Critical patent/CN106599244B/en
Publication of CN106599244A publication Critical patent/CN106599244A/en
Application granted granted Critical
Publication of CN106599244B publication Critical patent/CN106599244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a general original log cleaning device, which comprises a variable storage module, for storing metadata corresponding to each type of log, regular expressions and matched fields corresponding to each metadata; a configuration module; and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage. The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage.

Description

General original log cleaning device and method
Technical Field
The invention relates to the technical field of big data processing, in particular to a general original log cleaning device and method.
Background
When log analysis is performed, the data of the log is unorganized, or it is said that the data of the log is not all intended. The data inside needs to be cleaned, i.e. the character strings inside are filtered and structured.
Some large internet companies have various logs, the logs all need to be cleaned, the data of the log amount is huge, and the log occupies about a storage space of several t per day, and 2 problems exist in the log: firstly, the logs are more, each type of log needs to be cleaned, if each log is specially and independently processed, the time is wasted, and secondly, the second problem is that the large amount of the logs occupies large space resources, and the network io consumed when the logs are read again is high.
Disclosure of Invention
The invention aims at overcoming the technical defects in the prior art, and provides a general original log cleaning method for completing cleaning work of different logs by flexible device custom configuration.
The technical scheme adopted for realizing the purpose of the invention is as follows:
a general original log cleaning device comprises,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, wherein the cleaning tasks are in one-to-one correspondence with metadata;
and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage.
The configuration is stored using a zookeeper.
A general original log cleaning method, comprising,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task;
and identifying corresponding metadata according to the log type, completing the cleaning step by adopting a mapreduce program according to the cleaning task configuration, and carrying out preset storage.
The configuration is stored using a zookeeper.
In the cleaning step, the mapreduce program automatically judges the number of the reduce according to the size of the input data.
The data to be flushed is stored in the hdfs directory.
Compared with the prior art, the invention has the beneficial effects that:
the invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Drawings
Fig. 1 is a flow chart of a general original log cleaning method according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
To reduce the volume of data files, the compression currently most in use is lzo compression and snappy compression. Hadoop is a large data platform architecture of distributed storage and distributed computation, by means of the platform, the irregular log is structured through a mapreduce program and then stored into hdfs for later use according to a custom storage format and a compression format. The method overcomes the defects of the prior art that different log processing is carried out on logs synchronized according to service requirements and the code repetition rate is high, and reduces the workload of developers.
The general original log cleaning device comprises a variable storage module, a configuration module and a cleaning module, wherein,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the regular expression is stored in the variable storage module and is stored separately from the variables, and the function of the regular expression is to acquire the required fields, so that correctness must be ensured, for example:
^([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})--\\[(.*)\\]\"GET/hdpb.gif\\?(.*)HTTP.*\"[0-9]{3}[0-9]{1,5}\"(.*)\"$"
the fields within brackets representing the fields to be extracted are classified, such as ip field, time field, parameter field, ua field. The metadata is set to be a plurality of according to the type of the data to be cleaned and the cleaning targets, and the reasonable metadata is selected according to the service requirements. The adoption of metadata simplifies the data model to be cleaned, and can realize the quick configuration of similar or approximate cleaning.
The configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, and the cleaning tasks are in one-to-one correspondence with metadata; the various cleaning requirements are directly and specifically tasked and stored, and the storage compression format and other necessary factors corresponding to each task are called, so that the corresponding cleaning process can be realized by calling the matched task.
The cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and performing preset storage.
The sum variable storage module is stored by adopting a zookeeper, and the configurations are stored by adopting the zookeeper, wherein the stored catalog format is as follows:
(1) First level directory/etl
(2) The secondary catalog/etl/$ task$ task represents a specific cleaning task name, each cleaning task being unique,
(3) Three levels of directories/etl/$task/field,/etl/$task/outformat,/etl/$task/regexp, such as regular expressions, are deposited to/etl/$task/field, and variables are stored at/etl/$task/regexp. Setting a storage format and a compression format in/etl/$ task/output,
the third level of catalogue is most important, and the catalogue has stored variable values and configuration format storage
The storage format of regexp is consistent with the number of groups of the regular expression, one-to-one correspondence is realized, and the type corresponding to each group is stored, wherein the types are three ips, strs and common.
The storage format of the file is as uid, vid, pid, the fields are separated by commas,
the storage format of the outformat has only 2 class name values, and the class name values are separated by commas to respectively represent a compression format and a storage format.
The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Meanwhile, the invention also discloses a general original log cleaning method which comprises the following steps,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task; the configuration is stored using a zookeeper.
And identifying corresponding metadata according to the log type, completing the cleaning step by adopting a Mapreduce program according to the cleaning task configuration, and performing preset storage, and simultaneously, automatically judging the number of the products by the Mapreduce program according to the size of the input data so as to produce a block file with a proper size.
The number of reduce= (total size of input data/size of hdfs block 3 (default block size is 128 m)) +1
Where 3 is the compression ratio, this value can be adjusted to make the size of the file produced by each reduction as small as possible, according to the compression algorithm, slightly smaller than the block size.
Before the cleaning operation is submitted, the configuration information of the zookeeper is read according to the cleaning task information, the compression and storage configuration is found first and used for submitting the cleaning operation, and in the Mapreduce stage, the configuration information of the field of the zookeeper and the regular matching relation are read according to the cleaning task information, so that the cleaning of the data is completed. The principle of Mapreduce is not discussed here, but is not a key technical point of this invention, and only uses these techniques. The invention improves the development efficiency, automatically optimizes the size of the hdfs block and reduces the probability of mistakes made by some new staff.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.

Claims (2)

1. A general original log cleaning method is characterized by comprising the following steps of,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task, wherein the configuration is stored by adopting a zookeeper;
and identifying corresponding metadata according to the log type, completing a cleaning step by adopting a mapreduce program according to the cleaning task configuration, performing preset storage, structuring an irregular log by the mapreduce program, and storing the structured log into hdfs for later use according to a self-defined storage format and a compression format, wherein in the cleaning step, the mapreduce program automatically judges the number of products according to the size of input data, and the number of products= (the total size of the input data/the size of hdfs block is 3) +1.
2. The universal raw journal cleaning method as set forth in claim 1, wherein the data to be cleaned is stored in an hfs directory.
CN201611183585.0A 2016-12-20 2016-12-20 General original log cleaning device and method Active CN106599244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611183585.0A CN106599244B (en) 2016-12-20 2016-12-20 General original log cleaning device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611183585.0A CN106599244B (en) 2016-12-20 2016-12-20 General original log cleaning device and method

Publications (2)

Publication Number Publication Date
CN106599244A CN106599244A (en) 2017-04-26
CN106599244B true CN106599244B (en) 2024-01-05

Family

ID=58600257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611183585.0A Active CN106599244B (en) 2016-12-20 2016-12-20 General original log cleaning device and method

Country Status (1)

Country Link
CN (1) CN106599244B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359103A (en) * 2018-09-04 2019-02-19 河南智云数据信息技术股份有限公司 A kind of data aggregate cleaning method and system
CN115509851A (en) * 2022-09-14 2022-12-23 易纳购科技(北京)有限公司 Page monitoring method, device and equipment

Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1908935A (en) * 2006-08-01 2007-02-07 华为技术有限公司 Search method and system of a natural language
CN1983952A (en) * 2005-12-14 2007-06-20 中兴通讯股份有限公司 Method and system for synchronizing network administration data in network optimizing system
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN104391989A (en) * 2014-12-16 2015-03-04 浪潮电子信息产业股份有限公司 Distributed ETL all-in-one machine system
CN105447099A (en) * 2015-11-11 2016-03-30 中国建设银行股份有限公司 Log structured information extraction method and apparatus
CN105706045A (en) * 2013-07-19 2016-06-22 泰必高软件公司 Semantics-oriented analysis of log message content
CN106021554A (en) * 2016-05-30 2016-10-12 北京奇艺世纪科技有限公司 Log analysis method and device
CN106227862A (en) * 2016-07-29 2016-12-14 浪潮软件集团有限公司 E-commerce data integration method based on distribution

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
KR20140042428A (en) * 2012-09-28 2014-04-07 삼성전자주식회사 Computing system and data management method thereof

Patent Citations (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1983952A (en) * 2005-12-14 2007-06-20 中兴通讯股份有限公司 Method and system for synchronizing network administration data in network optimizing system
CN1908935A (en) * 2006-08-01 2007-02-07 华为技术有限公司 Search method and system of a natural language
CN105706045A (en) * 2013-07-19 2016-06-22 泰必高软件公司 Semantics-oriented analysis of log message content
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN104391989A (en) * 2014-12-16 2015-03-04 浪潮电子信息产业股份有限公司 Distributed ETL all-in-one machine system
CN105447099A (en) * 2015-11-11 2016-03-30 中国建设银行股份有限公司 Log structured information extraction method and apparatus
CN106021554A (en) * 2016-05-30 2016-10-12 北京奇艺世纪科技有限公司 Log analysis method and device
CN106227862A (en) * 2016-07-29 2016-12-14 浪潮软件集团有限公司 E-commerce data integration method based on distribution

Also Published As

Publication number Publication date
CN106599244A (en) 2017-04-26

Similar Documents

Publication Publication Date Title
US11068439B2 (en) Unsupervised method for enriching RDF data sources from denormalized data
CN102982075B (en) Support to access the system and method for heterogeneous data source
CN102567312A (en) Machine translation method based on distributive parallel computation framework
CN108616419B (en) Data packet acquisition and analysis system and method based on Docker
WO2016018942A1 (en) Systems and methods for an sql-driven distributed operating system
CN108073625B (en) System and method for metadata information management
CN109710703A (en) A kind of generation method and device of genetic connection network
US9418241B2 (en) Unified platform for big data processing
KR20150092586A (en) Method and Apparatus for Processing Exploding Data Stream
CN103927331A (en) Data querying method, data querying device and data querying system
CN106294745A (en) Big data cleaning method and device
CN108121742A (en) The generation method and device of user's disaggregated model
Bala et al. P-ETL: Parallel-ETL based on the MapReduce paradigm
CN110851234A (en) Log processing method and device based on docker container
CN103810272A (en) Data processing method and system
Silva et al. Integrating big data into the computing curricula
CN109684319A (en) Data clean system, method, apparatus and storage medium
CN105786941B (en) Information mining method and device
Kim et al. A study on utilization of spatial information in heterogeneous system based on apache nifi
CN106599244B (en) General original log cleaning device and method
CN105468770A (en) Data processing method and system
CN108287889B (en) A kind of multi-source heterogeneous date storage method and system based on elastic table model
CN106599241A (en) Big data visual management method for GIS software
CN103778223B (en) Pervasive word-reciting system based on cloud platform and construction method thereof
US11392607B2 (en) Automatic feature engineering during online scoring phase

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant