CN106599244A - Universal original log cleaning device and method - Google Patents
Universal original log cleaning device and method Download PDFInfo
- Publication number
- CN106599244A CN106599244A CN201611183585.0A CN201611183585A CN106599244A CN 106599244 A CN106599244 A CN 106599244A CN 201611183585 A CN201611183585 A CN 201611183585A CN 106599244 A CN106599244 A CN 106599244A
- Authority
- CN
- China
- Prior art keywords
- cleaning
- metadata
- configuration
- storage
- task
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 67
- 238000000034 method Methods 0.000 title claims description 14
- 230000014509 gene expression Effects 0.000 claims abstract description 14
- 238000013500 data storage Methods 0.000 claims description 2
- 230000006835 compression Effects 0.000 description 6
- 238000007906 compression Methods 0.000 description 6
- 238000007726 management method Methods 0.000 description 6
- 238000005516 engineering process Methods 0.000 description 4
- 238000012986 modification Methods 0.000 description 2
- 230000004048 modification Effects 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 238000005406 washing Methods 0.000 description 2
- 235000012364 Peperomia pellucida Nutrition 0.000 description 1
- 240000007711 Peperomia pellucida Species 0.000 description 1
- 238000004458 analytical method Methods 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000004364 calculation method Methods 0.000 description 1
- 238000010276 construction Methods 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 230000007547 defect Effects 0.000 description 1
- 230000007812 deficiency Effects 0.000 description 1
- 238000000151 deposition Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 230000000694 effects Effects 0.000 description 1
- 238000001914 filtration Methods 0.000 description 1
- 230000001788 irregular Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a universal original log cleaning device comprising a variable storage module used for storing metadata corresponding to each type of logs, and regular expressions corresponding to the metadata and matched fields; a configuration module; and a cleaning module used for identifying the corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and performing preset storage. According to the universal original log cleaning device disclosed by the invention, a set of metadata is established for each type of logs by means of the management of the metadata, the log and variable storage and configuration are reasonably managed, and these information can be configured at a management background. Moreover, due to the use of the regular expressions, logs meeting the rule can be screened, and important parameters are intercepted, and finally a corresponding relation is established with the variables in a variable memory.
Description
Technical field
The present invention relates to big data processing technology field, more particularly to a kind of general original log cleaning device and side
Method.
Background technology
When log analysis are carried out, the data of daily record are rambling, or say that the data of daily record are not all
Want what is seen.So needing to clean the data of the inside, i.e. the character string inside filtration, and its structuring is processed.
Some large-scale Internet firms, daily record is various, and such daily record is required for cleaning, and some daily record amount data are huge
Greatly, the memory space with several t is constituted about daily, just there are 2 problems here:One is that daily record form is more, will per class daily record
Cleaning, if each daily record it is special alone go process, expend many times, Second Problem is that daily record amount is big, it will accounted for
With very big space resources, then read the network io expended during these daily records also can be very high.
The content of the invention
The purpose of the present invention is, for technological deficiency present in prior art, and to provide a kind of flexible device making by oneself
Justice configures to complete the general original log cleaning method of the cleaning of different daily records.
To realize that the technical scheme that the purpose of the present invention is adopted is:
A kind of general original log cleaning device, including,
Variable storage module, for storing metadata corresponding with every class daily record, regular expressions corresponding with each metadata
Formula and the field of matching;
Configuration module, for the storage road of daily record before and after configuring multiple cleaning tasks, the corresponding cleaning of each cleaning task
Footpath, storage format and compressed format, described cleaning task is corresponded with metadata;
Cleaning module, according to Log Types corresponding metadata is recognized, and adopts mapreduce programs according to task configuration
Complete to clean logic and carry out default storage.
Described configuration is stored using zookeeper.
A kind of general original log cleaning method, including,
Set up the field of metadata corresponding with every class daily record, regular expression corresponding with each metadata and matching and deposit
Storage;
The multiple store paths corresponding with the one-to-one cleaning task of metadata and each cleaning task of configuration, storage lattice
Formula and compressed format are simultaneously stored;
Corresponding metadata is recognized according to Log Types, and is completed using mapreduce programs according to cleaning task configuration
Cleaning step simultaneously carries out default storage.
Described configuration is stored using zookeeper.
Mapreduce programs are according to the individual of the size automatic decision reduce of input data in described cleaning step
Number.
Data storage to be cleaned is in hdfs catalogues.
Compared with prior art, the invention has the beneficial effects as follows:
The present invention passes through metadata management:Correspondence all sets up a set of metadata per class daily record, daily record and variable storage and
The rational management of configuration is got up, and these information can be configured in management backstage.And the use of regular expression can screen full
The daily record of foot rule simultaneously intercepts important parameter, and the variable in last and variable storage sets up corresponding relation.Adopt simultaneously
Mapreduce programs, according to the size of raw log files, the reduce numbers required for calculating by variable storage and are matched somebody with somebody
Put write cleaning logic finally complete cleaning process.
Description of the drawings
Fig. 1 show the schematic flow sheet of the general original log cleaning method of the present invention.
Specific embodiment
The present invention is described in further detail below in conjunction with the drawings and specific embodiments.It should be appreciated that described herein
Specific embodiment only to explain the present invention, be not intended to limit the present invention.
To reduce the scale of construction of data file, at present the most popular compression for using is lzo compressions and snappy compressions.Hadoop
It is the big data platform architecture of distributed storage and Distributed Calculation, by the platform, the present invention passes through mapreduce program handles
Irregular daily record carries out being used for the later stage in hdfs by self-defining storage format and compressed format storage after structuring.Gram
Take in prior art and different log processings have been done according to the synchronous daily record of business demand, the high defect of code redundancies has been reduced
The workload of developer.
The general original log cleaning device of the present invention includes variable storage module, configuration module and cleaning module, its
In,
Variable storage module is used for storage metadata corresponding with every class daily record, regular expression corresponding with each metadata
And the field of matching;
Regular expression is stored in variable storage module, and variable is stored separately, and the effect of regular expression is
Obtain the field for needing, it is necessary to assure correct, citing:
^([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})--\\[(.*)\\]\"GET/
hdpb.gif\\(.*)HTTP.*\"[0-9]{3}[0-9]{1,5}\"(.*)\"$"
Representative inside the round bracket field to be extracted, these fields can classify, such as ip fields, time field, parameter
Field, ua fields.Metadata is multiple according to the type and cleaning goal setting of data to be cleaned, with specific reference to operational need
Seek the rational metadata of selection matching.The employing of metadata simplifies data model to be cleaned, is capable of achieving similar or near
Like the quick configuration of cleaning.
Described configuration module is used to configure depositing for daily record before and after multiple cleaning tasks, the corresponding cleaning of each cleaning task
Storage path, storage format and compressed format, described cleaning task is corresponded with metadata;Various cleaning requirements are directly had
Body task is simultaneously stored, and every kind of task is corresponding to store the necessary factors such as compressed format, so transfers reality by the task of matching
Existing corresponding wash course.
Cleaning module adopts mapreduce to recognize corresponding metadata according to Log Types according to task configuration
Program completes to clean logic and carry out default storage.
Wherein, described and variable storage module configuration module is stored using zookeeper, and these configurations are adopted
Zookeeper is stored, and the catalogue format of storage is as follows:
(1) first class catalogue/etl
(2) second-level directory/etl/ $ task $ task represent specific cleaning task title, and each cleaning task is unique
,
(3) three-level catalogue/etl/ $ task/field ,/etl/ $ task/outformat ,/etl/ $ task/regexp, such as
Regular expression is stored in/etl/ $ task/field, and variable storage is in/etl/ $ task/regexp.In/etl/ $ task/
Storage format and compressed format are set in output,
Third level catalogue is of paramount importance, and the variate-value and configuration format storage of storage is had in this catalogue
The storage format of regexp is consistent, man-to-man correspondence with the group numbers of regular expression, and can be deposited
The corresponding types of each group are stored up, type has three ip, str and common.
The storage format of filed such as uid, vid, pid, use CSV between field,
The storage format of outformat only has 2 class name values, with CSV, compressed format and storage lattice is represented respectively
Formula.
The present invention passes through metadata management:Correspondence all sets up a set of metadata per class daily record, daily record and variable storage and
The rational management of configuration is got up, and these information can be configured in management backstage.And the use of regular expression can screen full
The daily record of foot rule simultaneously intercepts important parameter, and the variable in last and variable storage sets up corresponding relation.Adopt simultaneously
Mapreduce programs, according to the size of raw log files, the reduce numbers required for calculating by variable storage and are matched somebody with somebody
Put write cleaning logic finally complete cleaning process.
Meanwhile, the invention also discloses a kind of general original log cleaning method includes,
Set up the field of metadata corresponding with every class daily record, regular expression corresponding with each metadata and matching and deposit
Storage;
The multiple store paths corresponding with the one-to-one cleaning task of metadata and each cleaning task of configuration, storage lattice
Formula and compressed format are simultaneously stored;Described configuration is stored using zookeeper.
Corresponding metadata is recognized according to Log Types, and is completed using Mapreduce programs according to cleaning task configuration
Cleaning step simultaneously carries out default storage, while Mapreduce programs are according to the size automatic decision reduce's of input data
Number is producing the block file of suitable size.
The number of Reduce=(size * 3 (acquiescence block size is 128m) of the total size of input data/hdfs blocks)+1
Wherein, 3 is compression ratio, and according to compression algorithm, this value can adjust the file for allowing each reduce to produce as far as possible
Size be slightly less than the size of block.
Before washing and cleaning operation is submitted to, the configuration information of zookeeper is read according to cleaning task information, first find compression
And storage configuration, for submitting washing and cleaning operation to, in Mapreduce stages, then root cleaning task information the word of zookeeper is read
The configuration information of section and the matching relationship of canonical, complete the cleaning of data.The principle of Mapreduce is no longer discussed herein, is not
The key technology point of this invention, just with these technologies.It is big that the present invention improves development efficiency, Automatic Optimal hdfs block
It is little, reduce the probability that some new employees make a mistake.
The above is only the preferred embodiment of the present invention, it is noted that for the common skill of the art
For art personnel, under the premise without departing from the principles of the invention, some improvements and modifications can also be made, these improvements and modifications
Also should be regarded as protection scope of the present invention.
Claims (6)
1. a kind of general original log cleaning device, it is characterised in that include,
Variable storage module, for storing corresponding with every class daily record metadata, regular expression corresponding with each metadata and
The field of matching;
Configuration module, for the store path of daily record before and after configuring multiple cleaning tasks, the corresponding cleaning of each cleaning task, deposits
Storage form and compressed format, described cleaning task is corresponded with metadata;
Cleaning module, according to Log Types corresponding metadata is recognized, and is completed using mapreduce programs according to task configuration
Cleaning logic simultaneously carries out default storage.
2. general original log cleaning device as claimed in claim 1, it is characterised in that described configuration is adopted
Zookeeper is stored.
3. a kind of general original log cleaning method, it is characterised in that include,
Set up the field of metadata corresponding with every class daily record, regular expression corresponding with each metadata and matching and store;
The multiple store paths corresponding with the one-to-one cleaning task of metadata and each cleaning task of configuration, storage format and
Compressed format is simultaneously stored;
Corresponding metadata is recognized according to Log Types, and cleaning is completed using mapreduce programs according to cleaning task configuration
Step simultaneously carries out default storage.
4. general original log cleaning method as claimed in claim 3, it is characterised in that described configuration is adopted
Zookeeper is stored.
5. general original log cleaning method as claimed in claim 3, it is characterised in that in described cleaning step
Number of the mapreduce programs according to the size automatic decision reduce of input data.
6. general original log cleaning method as claimed in claim 3, it is characterised in that data storage to be cleaned is in hdfs
In catalogue.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611183585.0A CN106599244B (en) | 2016-12-20 | 2016-12-20 | General original log cleaning device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611183585.0A CN106599244B (en) | 2016-12-20 | 2016-12-20 | General original log cleaning device and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599244A true CN106599244A (en) | 2017-04-26 |
CN106599244B CN106599244B (en) | 2024-01-05 |
Family
ID=58600257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611183585.0A Active CN106599244B (en) | 2016-12-20 | 2016-12-20 | General original log cleaning device and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599244B (en) |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359103A (en) * | 2018-09-04 | 2019-02-19 | 河南智云数据信息技术股份有限公司 | A kind of data aggregate cleaning method and system |
CN115509851A (en) * | 2022-09-14 | 2022-12-23 | 易纳购科技(北京)有限公司 | Page monitoring method, device and equipment |
Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1908935A (en) * | 2006-08-01 | 2007-02-07 | 华为技术有限公司 | Search method and system of a natural language |
CN1983952A (en) * | 2005-12-14 | 2007-06-20 | 中兴通讯股份有限公司 | Method and system for synchronizing network administration data in network optimizing system |
US20140095558A1 (en) * | 2012-09-28 | 2014-04-03 | Samsung Electronics Co., Ltd. | Computing system and method of managing data thereof |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
CN104391989A (en) * | 2014-12-16 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Distributed ETL (extract transform load) all-in-one machine system |
CN105447099A (en) * | 2015-11-11 | 2016-03-30 | 中国建设银行股份有限公司 | Log structured information extraction method and apparatus |
CN105706045A (en) * | 2013-07-19 | 2016-06-22 | 泰必高软件公司 | Semantics-oriented analysis of log message content |
CN106021554A (en) * | 2016-05-30 | 2016-10-12 | 北京奇艺世纪科技有限公司 | Log analysis method and device |
CN106227862A (en) * | 2016-07-29 | 2016-12-14 | 浪潮软件集团有限公司 | E-commerce data integration method based on distribution |
-
2016
- 2016-12-20 CN CN201611183585.0A patent/CN106599244B/en active Active
Patent Citations (9)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1983952A (en) * | 2005-12-14 | 2007-06-20 | 中兴通讯股份有限公司 | Method and system for synchronizing network administration data in network optimizing system |
CN1908935A (en) * | 2006-08-01 | 2007-02-07 | 华为技术有限公司 | Search method and system of a natural language |
US20140095558A1 (en) * | 2012-09-28 | 2014-04-03 | Samsung Electronics Co., Ltd. | Computing system and method of managing data thereof |
CN105706045A (en) * | 2013-07-19 | 2016-06-22 | 泰必高软件公司 | Semantics-oriented analysis of log message content |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
CN104391989A (en) * | 2014-12-16 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Distributed ETL (extract transform load) all-in-one machine system |
CN105447099A (en) * | 2015-11-11 | 2016-03-30 | 中国建设银行股份有限公司 | Log structured information extraction method and apparatus |
CN106021554A (en) * | 2016-05-30 | 2016-10-12 | 北京奇艺世纪科技有限公司 | Log analysis method and device |
CN106227862A (en) * | 2016-07-29 | 2016-12-14 | 浪潮软件集团有限公司 | E-commerce data integration method based on distribution |
Cited By (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359103A (en) * | 2018-09-04 | 2019-02-19 | 河南智云数据信息技术股份有限公司 | A kind of data aggregate cleaning method and system |
CN115509851A (en) * | 2022-09-14 | 2022-12-23 | 易纳购科技(北京)有限公司 | Page monitoring method, device and equipment |
Also Published As
Publication number | Publication date |
---|---|
CN106599244B (en) | 2024-01-05 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109753531A (en) | A kind of big data statistical method, system, computer equipment and storage medium | |
CN106233259A (en) | The many storage data from generation to generation of retrieval in decentralized storage networks | |
CN106682213A (en) | Internet-of-things task customizing method and system based on Hadoop platform | |
EP3545431A1 (en) | Event driven extract, transform, load (etl) processing | |
CN104317970B (en) | A kind of data stream type processing method based on data mart modeling center | |
CN106682097A (en) | Method and device for processing log data | |
WO2013149371A1 (en) | Machine learning for database migration source | |
CN112667860A (en) | Sub-graph matching method, device, equipment and storage medium | |
CN109564569B (en) | Reducing memory usage for long-term computation | |
CN107391502A (en) | The data query method, apparatus and index structuring method of time interval, device | |
US11755589B2 (en) | Delaying segment generation in database systems | |
CN102609462A (en) | Method for compressed storage of massive SQL (structured query language) by means of extracting SQL models | |
CN107004016A (en) | Effective data manipulation is supported | |
CN109669975B (en) | Industrial big data processing system and method | |
CN106682099A (en) | Data storage method and device | |
CN108123840A (en) | Log processing method and system | |
CN107992403A (en) | IO performance test methods, device, equipment and computer-readable recording medium | |
CN110716950A (en) | Method, device and equipment for establishing aperture system and computer storage medium | |
CN106599244A (en) | Universal original log cleaning device and method | |
CN108073582B (en) | Computing framework selection method and device | |
CN110019152A (en) | A kind of big data cleaning method | |
CN109933589B (en) | Data structure conversion method for data summarization based on ElasticSearch aggregation operation result | |
US20190340540A1 (en) | Adaptive continuous log model learning | |
CN103761298A (en) | Distributed-architecture-based entity matching method | |
CN106776125A (en) | It is a kind of to realize the method and device that pc client software is repaired in real time |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |