CN106599244A - Universal original log cleaning device and method - Google Patents

Universal original log cleaning device and method Download PDF

Info

Publication number
CN106599244A
CN106599244A CN201611183585.0A CN201611183585A CN106599244A CN 106599244 A CN106599244 A CN 106599244A CN 201611183585 A CN201611183585 A CN 201611183585A CN 106599244 A CN106599244 A CN 106599244A
Authority
CN
China
Prior art keywords
cleaning
metadata
configuration
storage
task
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201611183585.0A
Other languages
Chinese (zh)
Other versions
CN106599244B (en
Inventor
张亚军
田文宝
夏鹏
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Feihu Information Technology Tianjin Co Ltd
Original Assignee
Feihu Information Technology Tianjin Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Feihu Information Technology Tianjin Co Ltd filed Critical Feihu Information Technology Tianjin Co Ltd
Priority to CN201611183585.0A priority Critical patent/CN106599244B/en
Publication of CN106599244A publication Critical patent/CN106599244A/en
Application granted granted Critical
Publication of CN106599244B publication Critical patent/CN106599244B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/1805Append-only file systems, e.g. using logs or journals to store data
    • G06F16/1815Journaling file systems
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/17Details of further file system functions
    • G06F16/174Redundancy elimination performed by the file system
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a universal original log cleaning device comprising a variable storage module used for storing metadata corresponding to each type of logs, and regular expressions corresponding to the metadata and matched fields; a configuration module; and a cleaning module used for identifying the corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and performing preset storage. According to the universal original log cleaning device disclosed by the invention, a set of metadata is established for each type of logs by means of the management of the metadata, the log and variable storage and configuration are reasonably managed, and these information can be configured at a management background. Moreover, due to the use of the regular expressions, logs meeting the rule can be screened, and important parameters are intercepted, and finally a corresponding relation is established with the variables in a variable memory.

Description

General original log cleaning device and method
Technical field
The present invention relates to big data processing technology field, more particularly to a kind of general original log cleaning device and side Method.
Background technology
When log analysis are carried out, the data of daily record are rambling, or say that the data of daily record are not all Want what is seen.So needing to clean the data of the inside, i.e. the character string inside filtration, and its structuring is processed.
Some large-scale Internet firms, daily record is various, and such daily record is required for cleaning, and some daily record amount data are huge Greatly, the memory space with several t is constituted about daily, just there are 2 problems here:One is that daily record form is more, will per class daily record Cleaning, if each daily record it is special alone go process, expend many times, Second Problem is that daily record amount is big, it will accounted for With very big space resources, then read the network io expended during these daily records also can be very high.
The content of the invention
The purpose of the present invention is, for technological deficiency present in prior art, and to provide a kind of flexible device making by oneself Justice configures to complete the general original log cleaning method of the cleaning of different daily records.
To realize that the technical scheme that the purpose of the present invention is adopted is:
A kind of general original log cleaning device, including,
Variable storage module, for storing metadata corresponding with every class daily record, regular expressions corresponding with each metadata Formula and the field of matching;
Configuration module, for the storage road of daily record before and after configuring multiple cleaning tasks, the corresponding cleaning of each cleaning task Footpath, storage format and compressed format, described cleaning task is corresponded with metadata;
Cleaning module, according to Log Types corresponding metadata is recognized, and adopts mapreduce programs according to task configuration Complete to clean logic and carry out default storage.
Described configuration is stored using zookeeper.
A kind of general original log cleaning method, including,
Set up the field of metadata corresponding with every class daily record, regular expression corresponding with each metadata and matching and deposit Storage;
The multiple store paths corresponding with the one-to-one cleaning task of metadata and each cleaning task of configuration, storage lattice Formula and compressed format are simultaneously stored;
Corresponding metadata is recognized according to Log Types, and is completed using mapreduce programs according to cleaning task configuration Cleaning step simultaneously carries out default storage.
Described configuration is stored using zookeeper.
Mapreduce programs are according to the individual of the size automatic decision reduce of input data in described cleaning step Number.
Data storage to be cleaned is in hdfs catalogues.
Compared with prior art, the invention has the beneficial effects as follows:
The present invention passes through metadata management:Correspondence all sets up a set of metadata per class daily record, daily record and variable storage and The rational management of configuration is got up, and these information can be configured in management backstage.And the use of regular expression can screen full The daily record of foot rule simultaneously intercepts important parameter, and the variable in last and variable storage sets up corresponding relation.Adopt simultaneously Mapreduce programs, according to the size of raw log files, the reduce numbers required for calculating by variable storage and are matched somebody with somebody Put write cleaning logic finally complete cleaning process.
Description of the drawings
Fig. 1 show the schematic flow sheet of the general original log cleaning method of the present invention.
Specific embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.It should be appreciated that described herein Specific embodiment only to explain the present invention, be not intended to limit the present invention.
To reduce the scale of construction of data file, at present the most popular compression for using is lzo compressions and snappy compressions.Hadoop It is the big data platform architecture of distributed storage and Distributed Calculation, by the platform, the present invention passes through mapreduce program handles Irregular daily record carries out being used for the later stage in hdfs by self-defining storage format and compressed format storage after structuring.Gram Take in prior art and different log processings have been done according to the synchronous daily record of business demand, the high defect of code redundancies has been reduced The workload of developer.
The general original log cleaning device of the present invention includes variable storage module, configuration module and cleaning module, its In,
Variable storage module is used for storage metadata corresponding with every class daily record, regular expression corresponding with each metadata And the field of matching;
Regular expression is stored in variable storage module, and variable is stored separately, and the effect of regular expression is Obtain the field for needing, it is necessary to assure correct, citing:
^([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})--\\[(.*)\\]\"GET/ hdpb.gif\\(.*)HTTP.*\"[0-9]{3}[0-9]{1,5}\"(.*)\"$"
Representative inside the round bracket field to be extracted, these fields can classify, such as ip fields, time field, parameter Field, ua fields.Metadata is multiple according to the type and cleaning goal setting of data to be cleaned, with specific reference to operational need Seek the rational metadata of selection matching.The employing of metadata simplifies data model to be cleaned, is capable of achieving similar or near Like the quick configuration of cleaning.
Described configuration module is used to configure depositing for daily record before and after multiple cleaning tasks, the corresponding cleaning of each cleaning task Storage path, storage format and compressed format, described cleaning task is corresponded with metadata;Various cleaning requirements are directly had Body task is simultaneously stored, and every kind of task is corresponding to store the necessary factors such as compressed format, so transfers reality by the task of matching Existing corresponding wash course.
Cleaning module adopts mapreduce to recognize corresponding metadata according to Log Types according to task configuration Program completes to clean logic and carry out default storage.
Wherein, described and variable storage module configuration module is stored using zookeeper, and these configurations are adopted Zookeeper is stored, and the catalogue format of storage is as follows:
(1) first class catalogue/etl
(2) second-level directory/etl/ $ task $ task represent specific cleaning task title, and each cleaning task is unique ,
(3) three-level catalogue/etl/ $ task/field ,/etl/ $ task/outformat ,/etl/ $ task/regexp, such as Regular expression is stored in/etl/ $ task/field, and variable storage is in/etl/ $ task/regexp.In/etl/ $ task/ Storage format and compressed format are set in output,
Third level catalogue is of paramount importance, and the variate-value and configuration format storage of storage is had in this catalogue
The storage format of regexp is consistent, man-to-man correspondence with the group numbers of regular expression, and can be deposited The corresponding types of each group are stored up, type has three ip, str and common.
The storage format of filed such as uid, vid, pid, use CSV between field,
The storage format of outformat only has 2 class name values, with CSV, compressed format and storage lattice is represented respectively Formula.
The present invention passes through metadata management:Correspondence all sets up a set of metadata per class daily record, daily record and variable storage and The rational management of configuration is got up, and these information can be configured in management backstage.And the use of regular expression can screen full The daily record of foot rule simultaneously intercepts important parameter, and the variable in last and variable storage sets up corresponding relation.Adopt simultaneously Mapreduce programs, according to the size of raw log files, the reduce numbers required for calculating by variable storage and are matched somebody with somebody Put write cleaning logic finally complete cleaning process.
Meanwhile, the invention also discloses a kind of general original log cleaning method includes,
Set up the field of metadata corresponding with every class daily record, regular expression corresponding with each metadata and matching and deposit Storage;
The multiple store paths corresponding with the one-to-one cleaning task of metadata and each cleaning task of configuration, storage lattice Formula and compressed format are simultaneously stored;Described configuration is stored using zookeeper.
Corresponding metadata is recognized according to Log Types, and is completed using Mapreduce programs according to cleaning task configuration Cleaning step simultaneously carries out default storage, while Mapreduce programs are according to the size automatic decision reduce's of input data Number is producing the block file of suitable size.
The number of Reduce=(size * 3 (acquiescence block size is 128m) of the total size of input data/hdfs blocks)+1
Wherein, 3 is compression ratio, and according to compression algorithm, this value can adjust the file for allowing each reduce to produce as far as possible Size be slightly less than the size of block.
Before washing and cleaning operation is submitted to, the configuration information of zookeeper is read according to cleaning task information, first find compression And storage configuration, for submitting washing and cleaning operation to, in Mapreduce stages, then root cleaning task information the word of zookeeper is read The configuration information of section and the matching relationship of canonical, complete the cleaning of data.The principle of Mapreduce is no longer discussed herein, is not The key technology point of this invention, just with these technologies.It is big that the present invention improves development efficiency, Automatic Optimal hdfs block It is little, reduce the probability that some new employees make a mistake.
The above is only the preferred embodiment of the present invention, it is noted that for the common skill of the art For art personnel, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications Also should be regarded as protection scope of the present invention.

Claims (6)

1. a kind of general original log cleaning device, it is characterised in that include,
Variable storage module, for storing corresponding with every class daily record metadata, regular expression corresponding with each metadata and The field of matching;
Configuration module, for the store path of daily record before and after configuring multiple cleaning tasks, the corresponding cleaning of each cleaning task, deposits Storage form and compressed format, described cleaning task is corresponded with metadata;
Cleaning module, according to Log Types corresponding metadata is recognized, and is completed using mapreduce programs according to task configuration Cleaning logic simultaneously carries out default storage.
2. general original log cleaning device as claimed in claim 1, it is characterised in that described configuration is adopted Zookeeper is stored.
3. a kind of general original log cleaning method, it is characterised in that include,
Set up the field of metadata corresponding with every class daily record, regular expression corresponding with each metadata and matching and store;
The multiple store paths corresponding with the one-to-one cleaning task of metadata and each cleaning task of configuration, storage format and Compressed format is simultaneously stored;
Corresponding metadata is recognized according to Log Types, and cleaning is completed using mapreduce programs according to cleaning task configuration Step simultaneously carries out default storage.
4. general original log cleaning method as claimed in claim 3, it is characterised in that described configuration is adopted Zookeeper is stored.
5. general original log cleaning method as claimed in claim 3, it is characterised in that in described cleaning step Number of the mapreduce programs according to the size automatic decision reduce of input data.
6. general original log cleaning method as claimed in claim 3, it is characterised in that data storage to be cleaned is in hdfs In catalogue.
CN201611183585.0A 2016-12-20 2016-12-20 General original log cleaning device and method Active CN106599244B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201611183585.0A CN106599244B (en) 2016-12-20 2016-12-20 General original log cleaning device and method

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201611183585.0A CN106599244B (en) 2016-12-20 2016-12-20 General original log cleaning device and method

Publications (2)

Publication Number Publication Date
CN106599244A true CN106599244A (en) 2017-04-26
CN106599244B CN106599244B (en) 2024-01-05

Family

ID=58600257

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201611183585.0A Active CN106599244B (en) 2016-12-20 2016-12-20 General original log cleaning device and method

Country Status (1)

Country Link
CN (1) CN106599244B (en)

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359103A (en) * 2018-09-04 2019-02-19 河南智云数据信息技术股份有限公司 A kind of data aggregate cleaning method and system
CN115509851A (en) * 2022-09-14 2022-12-23 易纳购科技(北京)有限公司 Page monitoring method, device and equipment

Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1908935A (en) * 2006-08-01 2007-02-07 华为技术有限公司 Search method and system of a natural language
CN1983952A (en) * 2005-12-14 2007-06-20 中兴通讯股份有限公司 Method and system for synchronizing network administration data in network optimizing system
US20140095558A1 (en) * 2012-09-28 2014-04-03 Samsung Electronics Co., Ltd. Computing system and method of managing data thereof
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN104391989A (en) * 2014-12-16 2015-03-04 浪潮电子信息产业股份有限公司 Distributed ETL (extract transform load) all-in-one machine system
CN105447099A (en) * 2015-11-11 2016-03-30 中国建设银行股份有限公司 Log structured information extraction method and apparatus
CN105706045A (en) * 2013-07-19 2016-06-22 泰必高软件公司 Semantics-oriented analysis of log message content
CN106021554A (en) * 2016-05-30 2016-10-12 北京奇艺世纪科技有限公司 Log analysis method and device
CN106227862A (en) * 2016-07-29 2016-12-14 浪潮软件集团有限公司 E-commerce data integration method based on distribution

Patent Citations (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN1983952A (en) * 2005-12-14 2007-06-20 中兴通讯股份有限公司 Method and system for synchronizing network administration data in network optimizing system
CN1908935A (en) * 2006-08-01 2007-02-07 华为技术有限公司 Search method and system of a natural language
US20140095558A1 (en) * 2012-09-28 2014-04-03 Samsung Electronics Co., Ltd. Computing system and method of managing data thereof
CN105706045A (en) * 2013-07-19 2016-06-22 泰必高软件公司 Semantics-oriented analysis of log message content
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN104391989A (en) * 2014-12-16 2015-03-04 浪潮电子信息产业股份有限公司 Distributed ETL (extract transform load) all-in-one machine system
CN105447099A (en) * 2015-11-11 2016-03-30 中国建设银行股份有限公司 Log structured information extraction method and apparatus
CN106021554A (en) * 2016-05-30 2016-10-12 北京奇艺世纪科技有限公司 Log analysis method and device
CN106227862A (en) * 2016-07-29 2016-12-14 浪潮软件集团有限公司 E-commerce data integration method based on distribution

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN109359103A (en) * 2018-09-04 2019-02-19 河南智云数据信息技术股份有限公司 A kind of data aggregate cleaning method and system
CN115509851A (en) * 2022-09-14 2022-12-23 易纳购科技(北京)有限公司 Page monitoring method, device and equipment

Also Published As

Publication number Publication date
CN106599244B (en) 2024-01-05

Similar Documents

Publication Publication Date Title
CN109753531A (en) A kind of big data statistical method, system, computer equipment and storage medium
CN106233259A (en) The many storage data from generation to generation of retrieval in decentralized storage networks
CN106682213A (en) Internet-of-things task customizing method and system based on Hadoop platform
EP3545431A1 (en) Event driven extract, transform, load (etl) processing
CN104317970B (en) A kind of data stream type processing method based on data mart modeling center
CN106682097A (en) Method and device for processing log data
WO2013149371A1 (en) Machine learning for database migration source
CN112667860A (en) Sub-graph matching method, device, equipment and storage medium
CN109564569B (en) Reducing memory usage for long-term computation
CN107391502A (en) The data query method, apparatus and index structuring method of time interval, device
US11755589B2 (en) Delaying segment generation in database systems
CN102609462A (en) Method for compressed storage of massive SQL (structured query language) by means of extracting SQL models
CN107004016A (en) Effective data manipulation is supported
CN109669975B (en) Industrial big data processing system and method
CN106682099A (en) Data storage method and device
CN108123840A (en) Log processing method and system
CN107992403A (en) IO performance test methods, device, equipment and computer-readable recording medium
CN110716950A (en) Method, device and equipment for establishing aperture system and computer storage medium
CN106599244A (en) Universal original log cleaning device and method
CN108073582B (en) Computing framework selection method and device
CN110019152A (en) A kind of big data cleaning method
CN109933589B (en) Data structure conversion method for data summarization based on ElasticSearch aggregation operation result
US20190340540A1 (en) Adaptive continuous log model learning
CN103761298A (en) Distributed-architecture-based entity matching method
CN106776125A (en) It is a kind of to realize the method and device that pc client software is repaired in real time

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant