CN106599244B - General original log cleaning device and method - Google Patents
General original log cleaning device and method Download PDFInfo
- Publication number
- CN106599244B CN106599244B CN201611183585.0A CN201611183585A CN106599244B CN 106599244 B CN106599244 B CN 106599244B CN 201611183585 A CN201611183585 A CN 201611183585A CN 106599244 B CN106599244 B CN 106599244B
- Authority
- CN
- China
- Prior art keywords
- log
- cleaning
- metadata
- storage
- storing
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
- 238000004140 cleaning Methods 0.000 title claims abstract description 57
- 238000000034 method Methods 0.000 title claims description 12
- 230000014509 gene expression Effects 0.000 claims abstract description 13
- 230000006835 compression Effects 0.000 claims description 16
- 238000007906 compression Methods 0.000 claims description 16
- 230000001788 irregular Effects 0.000 claims description 2
- 238000012216 screening Methods 0.000 abstract description 3
- 230000007547 defect Effects 0.000 description 2
- 238000007726 management method Methods 0.000 description 2
- 238000012545 processing Methods 0.000 description 2
- 230000006978 adaptation Effects 0.000 description 1
- 230000009286 beneficial effect Effects 0.000 description 1
- 238000013499 data model Methods 0.000 description 1
- 238000011161 development Methods 0.000 description 1
- 238000012986 modification Methods 0.000 description 1
- 230000004048 modification Effects 0.000 description 1
- 230000001360 synchronised effect Effects 0.000 description 1
Classifications
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/1805—Append-only file systems, e.g. using logs or journals to store data
- G06F16/1815—Journaling file systems
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/17—Details of further file system functions
- G06F16/174—Redundancy elimination performed by the file system
-
- G—PHYSICS
- G06—COMPUTING; CALCULATING OR COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/10—File systems; File servers
- G06F16/18—File system types
- G06F16/182—Distributed file systems
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
The invention discloses a general original log cleaning device, which comprises a variable storage module, for storing metadata corresponding to each type of log, regular expressions and matched fields corresponding to each metadata; a configuration module; and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage. The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage.
Description
Technical Field
The invention relates to the technical field of big data processing, in particular to a general original log cleaning device and method.
Background
When log analysis is performed, the data of the log is unorganized, or it is said that the data of the log is not all intended. The data inside needs to be cleaned, i.e. the character strings inside are filtered and structured.
Some large internet companies have various logs, the logs all need to be cleaned, the data of the log amount is huge, and the log occupies about a storage space of several t per day, and 2 problems exist in the log: firstly, the logs are more, each type of log needs to be cleaned, if each log is specially and independently processed, the time is wasted, and secondly, the second problem is that the large amount of the logs occupies large space resources, and the network io consumed when the logs are read again is high.
Disclosure of Invention
The invention aims at overcoming the technical defects in the prior art, and provides a general original log cleaning method for completing cleaning work of different logs by flexible device custom configuration.
The technical scheme adopted for realizing the purpose of the invention is as follows:
a general original log cleaning device comprises,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, wherein the cleaning tasks are in one-to-one correspondence with metadata;
and the cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and carrying out preset storage.
The configuration is stored using a zookeeper.
A general original log cleaning method, comprising,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task;
and identifying corresponding metadata according to the log type, completing the cleaning step by adopting a mapreduce program according to the cleaning task configuration, and carrying out preset storage.
The configuration is stored using a zookeeper.
In the cleaning step, the mapreduce program automatically judges the number of the reduce according to the size of the input data.
The data to be flushed is stored in the hdfs directory.
Compared with the prior art, the invention has the beneficial effects that:
the invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Drawings
Fig. 1 is a flow chart of a general original log cleaning method according to the present invention.
Detailed Description
The invention is described in further detail below with reference to the drawings and the specific examples. It should be understood that the specific embodiments described herein are for purposes of illustration only and are not intended to limit the scope of the invention.
To reduce the volume of data files, the compression currently most in use is lzo compression and snappy compression. Hadoop is a large data platform architecture of distributed storage and distributed computation, by means of the platform, the irregular log is structured through a mapreduce program and then stored into hdfs for later use according to a custom storage format and a compression format. The method overcomes the defects of the prior art that different log processing is carried out on logs synchronized according to service requirements and the code repetition rate is high, and reduces the workload of developers.
The general original log cleaning device comprises a variable storage module, a configuration module and a cleaning module, wherein,
the variable storage module is used for storing metadata corresponding to each type of log, regular expressions corresponding to the metadata and matched fields;
the regular expression is stored in the variable storage module and is stored separately from the variables, and the function of the regular expression is to acquire the required fields, so that correctness must be ensured, for example:
^([0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3}.[0-9]{1,3})--\\[(.*)\\]\"GET/hdpb.gif\\?(.*)HTTP.*\"[0-9]{3}[0-9]{1,5}\"(.*)\"$"
the fields within brackets representing the fields to be extracted are classified, such as ip field, time field, parameter field, ua field. The metadata is set to be a plurality of according to the type of the data to be cleaned and the cleaning targets, and the reasonable metadata is selected according to the service requirements. The adoption of metadata simplifies the data model to be cleaned, and can realize the quick configuration of similar or approximate cleaning.
The configuration module is used for configuring a plurality of cleaning tasks, a storage path, a storage format and a compression format of a log before and after cleaning corresponding to each cleaning task, and the cleaning tasks are in one-to-one correspondence with metadata; the various cleaning requirements are directly and specifically tasked and stored, and the storage compression format and other necessary factors corresponding to each task are called, so that the corresponding cleaning process can be realized by calling the matched task.
The cleaning module is used for identifying corresponding metadata according to the log type, completing cleaning logic by adopting a mapreduce program according to task configuration and performing preset storage.
The sum variable storage module is stored by adopting a zookeeper, and the configurations are stored by adopting the zookeeper, wherein the stored catalog format is as follows:
(1) First level directory/etl
(2) The secondary catalog/etl/$ task$ task represents a specific cleaning task name, each cleaning task being unique,
(3) Three levels of directories/etl/$task/field,/etl/$task/outformat,/etl/$task/regexp, such as regular expressions, are deposited to/etl/$task/field, and variables are stored at/etl/$task/regexp. Setting a storage format and a compression format in/etl/$ task/output,
the third level of catalogue is most important, and the catalogue has stored variable values and configuration format storage
The storage format of regexp is consistent with the number of groups of the regular expression, one-to-one correspondence is realized, and the type corresponding to each group is stored, wherein the types are three ips, strs and common.
The storage format of the file is as uid, vid, pid, the fields are separated by commas,
the storage format of the outformat has only 2 class name values, and the class name values are separated by commas to respectively represent a compression format and a storage format.
The invention manages through metadata: and establishing a set of metadata corresponding to each type of log, storing the log and the variable and reasonably configuring the log, and configuring the information in a management background. And the regular expression can be used for screening logs meeting the rule, intercepting important parameters and finally establishing a corresponding relation with the variables in the variable storage. And simultaneously, a mapreduce program is adopted, the number of required reduce is calculated according to the size of an original log file, and a cleaning logic is written through variable storage and configuration to finally finish a cleaning flow.
Meanwhile, the invention also discloses a general original log cleaning method which comprises the following steps,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to the metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task; the configuration is stored using a zookeeper.
And identifying corresponding metadata according to the log type, completing the cleaning step by adopting a Mapreduce program according to the cleaning task configuration, and performing preset storage, and simultaneously, automatically judging the number of the products by the Mapreduce program according to the size of the input data so as to produce a block file with a proper size.
The number of reduce= (total size of input data/size of hdfs block 3 (default block size is 128 m)) +1
Where 3 is the compression ratio, this value can be adjusted to make the size of the file produced by each reduction as small as possible, according to the compression algorithm, slightly smaller than the block size.
Before the cleaning operation is submitted, the configuration information of the zookeeper is read according to the cleaning task information, the compression and storage configuration is found first and used for submitting the cleaning operation, and in the Mapreduce stage, the configuration information of the field of the zookeeper and the regular matching relation are read according to the cleaning task information, so that the cleaning of the data is completed. The principle of Mapreduce is not discussed here, but is not a key technical point of this invention, and only uses these techniques. The invention improves the development efficiency, automatically optimizes the size of the hdfs block and reduces the probability of mistakes made by some new staff.
The foregoing is merely a preferred embodiment of the present invention and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present invention, which are intended to be comprehended within the scope of the present invention.
Claims (2)
1. A general original log cleaning method is characterized by comprising the following steps of,
establishing metadata corresponding to each type of log, and storing regular expressions and matched fields corresponding to the metadata;
configuring and storing a plurality of cleaning tasks corresponding to metadata one by one and storage paths, storage formats and compression formats corresponding to each cleaning task, wherein the configuration is stored by adopting a zookeeper;
and identifying corresponding metadata according to the log type, completing a cleaning step by adopting a mapreduce program according to the cleaning task configuration, performing preset storage, structuring an irregular log by the mapreduce program, and storing the structured log into hdfs for later use according to a self-defined storage format and a compression format, wherein in the cleaning step, the mapreduce program automatically judges the number of products according to the size of input data, and the number of products= (the total size of the input data/the size of hdfs block is 3) +1.
2. The universal raw journal cleaning method as set forth in claim 1, wherein the data to be cleaned is stored in an hfs directory.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611183585.0A CN106599244B (en) | 2016-12-20 | 2016-12-20 | General original log cleaning device and method |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201611183585.0A CN106599244B (en) | 2016-12-20 | 2016-12-20 | General original log cleaning device and method |
Publications (2)
Publication Number | Publication Date |
---|---|
CN106599244A CN106599244A (en) | 2017-04-26 |
CN106599244B true CN106599244B (en) | 2024-01-05 |
Family
ID=58600257
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201611183585.0A Active CN106599244B (en) | 2016-12-20 | 2016-12-20 | General original log cleaning device and method |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN106599244B (en) |
Families Citing this family (2)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN109359103A (en) * | 2018-09-04 | 2019-02-19 | 河南智云数据信息技术股份有限公司 | A kind of data aggregate cleaning method and system |
CN115509851A (en) * | 2022-09-14 | 2022-12-23 | 易纳购科技(北京)有限公司 | Page monitoring method, device and equipment |
Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1908935A (en) * | 2006-08-01 | 2007-02-07 | 华为技术有限公司 | Search method and system of a natural language |
CN1983952A (en) * | 2005-12-14 | 2007-06-20 | 中兴通讯股份有限公司 | Method and system for synchronizing network administration data in network optimizing system |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
CN104391989A (en) * | 2014-12-16 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Distributed ETL all-in-one machine system |
CN105447099A (en) * | 2015-11-11 | 2016-03-30 | 中国建设银行股份有限公司 | Log structured information extraction method and apparatus |
CN105706045A (en) * | 2013-07-19 | 2016-06-22 | 泰必高软件公司 | Semantics-oriented analysis of log message content |
CN106021554A (en) * | 2016-05-30 | 2016-10-12 | 北京奇艺世纪科技有限公司 | Log analysis method and device |
CN106227862A (en) * | 2016-07-29 | 2016-12-14 | 浪潮软件集团有限公司 | E-commerce data integration method based on distribution |
Family Cites Families (1)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
KR20140042428A (en) * | 2012-09-28 | 2014-04-07 | 삼성전자주식회사 | Computing system and data management method thereof |
-
2016
- 2016-12-20 CN CN201611183585.0A patent/CN106599244B/en active Active
Patent Citations (8)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN1983952A (en) * | 2005-12-14 | 2007-06-20 | 中兴通讯股份有限公司 | Method and system for synchronizing network administration data in network optimizing system |
CN1908935A (en) * | 2006-08-01 | 2007-02-07 | 华为技术有限公司 | Search method and system of a natural language |
CN105706045A (en) * | 2013-07-19 | 2016-06-22 | 泰必高软件公司 | Semantics-oriented analysis of log message content |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
CN104391989A (en) * | 2014-12-16 | 2015-03-04 | 浪潮电子信息产业股份有限公司 | Distributed ETL all-in-one machine system |
CN105447099A (en) * | 2015-11-11 | 2016-03-30 | 中国建设银行股份有限公司 | Log structured information extraction method and apparatus |
CN106021554A (en) * | 2016-05-30 | 2016-10-12 | 北京奇艺世纪科技有限公司 | Log analysis method and device |
CN106227862A (en) * | 2016-07-29 | 2016-12-14 | 浪潮软件集团有限公司 | E-commerce data integration method based on distribution |
Also Published As
Publication number | Publication date |
---|---|
CN106599244A (en) | 2017-04-26 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US11068439B2 (en) | Unsupervised method for enriching RDF data sources from denormalized data | |
CN102982075B (en) | Support to access the system and method for heterogeneous data source | |
CN102567312A (en) | Machine translation method based on distributive parallel computation framework | |
CN108616419B (en) | Data packet acquisition and analysis system and method based on Docker | |
WO2016018942A1 (en) | Systems and methods for an sql-driven distributed operating system | |
CN108073625B (en) | System and method for metadata information management | |
CN109710703A (en) | A kind of generation method and device of genetic connection network | |
US9418241B2 (en) | Unified platform for big data processing | |
KR20150092586A (en) | Method and Apparatus for Processing Exploding Data Stream | |
CN103927331A (en) | Data querying method, data querying device and data querying system | |
CN106294745A (en) | Big data cleaning method and device | |
CN108121742A (en) | The generation method and device of user's disaggregated model | |
Bala et al. | P-ETL: Parallel-ETL based on the MapReduce paradigm | |
CN110851234A (en) | Log processing method and device based on docker container | |
CN103810272A (en) | Data processing method and system | |
Silva et al. | Integrating big data into the computing curricula | |
CN109684319A (en) | Data clean system, method, apparatus and storage medium | |
CN105786941B (en) | Information mining method and device | |
Kim et al. | A study on utilization of spatial information in heterogeneous system based on apache nifi | |
CN106599244B (en) | General original log cleaning device and method | |
CN105468770A (en) | Data processing method and system | |
CN108287889B (en) | A kind of multi-source heterogeneous date storage method and system based on elastic table model | |
CN106599241A (en) | Big data visual management method for GIS software | |
CN103778223B (en) | Pervasive word-reciting system based on cloud platform and construction method thereof | |
US11392607B2 (en) | Automatic feature engineering during online scoring phase |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |