CN109918349A - Log processing method, device, storage medium and electronic device - Google Patents
Log processing method, device, storage medium and electronic device Download PDFInfo
- Publication number
- CN109918349A CN109918349A CN201910138347.5A CN201910138347A CN109918349A CN 109918349 A CN109918349 A CN 109918349A CN 201910138347 A CN201910138347 A CN 201910138347A CN 109918349 A CN109918349 A CN 109918349A
- Authority
- CN
- China
- Prior art keywords
- log
- processed
- logs
- data
- kafka
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Abstract
The invention discloses a kind of log processing method, device, storage medium and electronic device.Wherein, this method comprises: receiving multiple logs to be processed that open source component is sent, wherein the journal format of multiple logs to be processed is the conversion that open source component carries out multiple logs using default journal format;Data cleansing is carried out to multiple logs to be processed, obtains multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, obtains multiple second target journalings.The treatment effeciency that the present invention solves log in the related technology is low, is not able to satisfy to daily record data the technical issues of needs.
Description
Technical field
The present invention relates to computer fields, in particular to a kind of log processing method, device, storage medium and electricity
Sub-device.
Background technique
With the arriving of internet+epoch, the value of data is increasingly highlighted.The data of product are presented exponential type and increase,
Unstructured feature.Using distributed processing platform spark and hadoop technology, constructing big data platform is core the most
The storage of basic data, processing capacity center, provide powerful data-handling capacity, meet the interaction demand of data.Together
When by spark streaming, can effectively meet the requirement of enterprise real-time data, construct the real-time indicators body of enterprise development
System.
But there are log processings not real-time enough, the dilatation in distributed system of the existing storage mode to log is held
Mistake is lacking, and can not easily carry out the ETL (abbreviation of Extract-Transform-Load, for retouching of big data log
It states data from source terminal by extracting (extract), interaction conversion (transform), the mistake for loading (load) to destination
Journey) cleaning.
For above-mentioned problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of log processing method, device, storage medium and electronic device, at least to solve
It is low to the treatment effeciency of log in the related technology, it is not able to satisfy to daily record data the technical issues of needs.
According to an aspect of an embodiment of the present invention, a kind of log processing method is provided, comprising: receive open source component hair
The multiple logs to be processed sent, wherein the journal format of multiple logs to be processed is that open source component utilizes default journal format pair
The conversion that multiple logs carry out;Data cleansing is carried out to multiple logs to be processed, obtains multiple first object logs;According to default
Time interval carries out multidomain treat-ment to multiple first object logs, obtains multiple second target journalings.
According to another aspect of an embodiment of the present invention, a kind of log processing method is additionally provided, comprising: utilize default log
Format formats multiple logs of acquisition, obtains multiple logs to be processed;Multiple logs to be processed are stored respectively
Into subject document folder corresponding with each log to be processed;By subject document press from both sides in multiple logs to be processed be sent to distribution
In formula processing platform.
According to another aspect of an embodiment of the present invention, a kind of log processing device is additionally provided, comprising: receiving module is used
In the multiple logs to be processed for receiving open source component transmission, wherein the journal format of multiple logs to be processed is open source component benefit
The conversion that multiple logs are carried out with default journal format;First determining module, for carrying out data to multiple logs to be processed
Cleaning, obtains multiple first object logs;Second determining module is used for according to prefixed time interval to multiple first object logs
Multidomain treat-ment is carried out, multiple second target journalings are obtained.
According to another aspect of an embodiment of the present invention, a kind of log processing device is additionally provided, comprising: third determines mould
Block obtains multiple logs to be processed for formatting using default journal format to multiple logs of acquisition;Store mould
Block, for being stored multiple logs to be processed respectively into subject document folder corresponding with each log to be processed;Sending module,
It is sent in distributed processing platform for multiple logs to be processed in pressing from both sides subject document.
According to another aspect of an embodiment of the present invention, a kind of log processing system is additionally provided, comprising: distributed treatment is flat
Platform spark, wherein spark is arranged to execute method among the above when operation;Increase income component kafka, flat with distributed treatment
Platform connection, wherein kafka is arranged to execute above-mentioned method when operation;
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium
Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described
Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described
Step in embodiment of the method.
In embodiments of the present invention, the log of collection is formatted using open source component, is obtained multiple to be processed
Log to be processed is sent to distributed processing platform by log;Distributed processing platform carries out data to multiple logs to be processed
Cleaning, obtains multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained
Multiple second target journalings.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve the relevant technologies
The treatment effeciency of middle log is low, is not able to satisfy to daily record data the technical issues of needs.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair
Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of hardware block diagram of the mobile terminal of log processing method of the embodiment of the present invention;
Fig. 2 is the flow diagram (one) of the log processing method provided according to embodiments of the present invention;
Fig. 3 is the flow diagram (two) of the log processing method provided according to embodiments of the present invention;
Fig. 4 is the structural schematic diagram (one) of the log processing device provided according to embodiments of the present invention;
Fig. 5 is the structural schematic diagram (two) of the log processing device provided according to embodiments of the present invention;
Fig. 6 is the structural schematic diagram of the log processing system provided according to embodiments of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention
Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only
The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people
The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work
It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, "
Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way
Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or
Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover
Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to
Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product
Or other step or units that equipment is intrinsic.
According to embodiments of the present invention, a kind of log processing method embodiment is provided, it should be noted that in the stream of attached drawing
The step of journey illustrates can execute in a computer system such as a set of computer executable instructions, although also, flowing
Logical order is shown in journey figure, but in some cases, it can be to be different from shown or described by sequence execution herein
The step of.
Embodiment of the method provided by the embodiment of the present invention can be in mobile terminal, terminal or similar operation
It is executed in device.For running on mobile terminals, Fig. 1 is a kind of mobile end of log processing method of the embodiment of the present invention
The hardware block diagram at end.As shown in Figure 1, mobile terminal 10 may include one or more (only showing one in Fig. 1) processing
Device 102 (processing unit that processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc.) and
Memory 104 for storing data, optionally, above-mentioned mobile terminal can also include the transmission device for communication function
106 and input-output equipment 108.It will appreciated by the skilled person that structure shown in FIG. 1 is only to illustrate, simultaneously
The structure of above-mentioned mobile terminal is not caused to limit.For example, mobile terminal 10 may also include it is more than shown in Fig. 1 or less
Component, or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair
The corresponding computer program of log processing method in bright embodiment, processor 102 are stored in memory 104 by operation
Computer program realizes above-mentioned method thereby executing various function application and data processing.Memory 104 may include
High speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or its
His non-volatile solid state memory.In some instances, memory 104 can further comprise remotely setting relative to processor 102
The memory set, these remote memories can pass through network connection to mobile terminal 10.The example of above-mentioned network includes but not
It is limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include
The wireless network that the communication providers of mobile terminal 10 provide.In an example, transmitting device 106 includes a Network adaptation
Device (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments to
It can be communicated with internet.In an example, transmitting device 106 can for radio frequency (Radio Frequency, referred to as
RF) module is used to wirelessly be communicated with internet.
Fig. 2 is the flow diagram (one) of the log processing method provided according to embodiments of the present invention, as shown in Fig. 2, should
Method includes the following steps:
Step S202 receives multiple logs to be processed that open source component is sent, wherein the log lattice of multiple logs to be processed
Formula is the conversion that open source component carries out multiple logs using default journal format;
Step S204 carries out data cleansing to multiple logs to be processed, obtains multiple first object logs;
Step S206 carries out multidomain treat-ment to multiple first object logs according to prefixed time interval, obtains multiple second
Target journaling.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed,
Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed,
Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple
Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology
The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that executing subject among the above can be distributed processing platform spark, but not limited to this.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation
Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up,
The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework
Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through
Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java
Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number
According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message
Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default
Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], "
time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx
The log that access log collects client is stored in local, then uses following open source component rsyslog, filebeat,
The collector journals such as scribeagent, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke
Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop
In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
In an alternative embodiment, Spark carries out data cleansing to multiple logs to be processed in the following manner: benefit
The data cleansing of multiple logs to be processed is triggered with default activity algorithms;Using default transfer algorithm to after triggering it is multiple to
It handles log and carries out data cleansing.For example, using spark transfer algorithm tranform abundant, such as: map, flatmap,
Fliter, User-Defined Functions (User-Defined Function, referred to as UDF) etc. realize the cleaning of data, utilize work
The trigger datas cleaning processes such as dynamic algorithm action, such as collect, save.
In an alternative embodiment, in the following manner according to prefixed time interval to multiple first object logs into
Row multidomain treat-ment: the Log Types and logging time of each first object log in multiple first object logs are determined, wherein day
The will time is the time for obtaining each first object log;Each first object log is divided based on Log Types and logging time
Area is stored into the first predetermined directory, obtains multiple second target journalings.Such as: by the way that reasonable spark is arranged
The time (such as being set as 5 minutes) of the batch interval (being prefixed time interval) of streaming, in batches according to
Logtype (Log Types) and logtime (logging time) does secondary partition and is written under the catalogue of HDFS/warehouses/
{ logtype }/{ logtime } (the first predetermined directory) is stored.The delay time of data substantially with batch interval phase
When 5 minutes be arranged if it is front, displaying of the data in hive is also delay in 5 minutes.It is reasonable by being arranged
Batch interval can achieve the purpose of data near real-time collection.In the present embodiment, Spark computing engines are storages
Into hadoop storage system.
It should be noted that timeslice or batch processing time interval (batch interval): this is artificially to fluxion
According to quantitative standard is carried out, the foundation of flow data is split using timeslice as us.The corresponding RDD of the data of one timeslice
Example.
It should be noted that HDFS is the abbreviation of Hadoop Distribute File System, that is, Hadoop
One distributed file system.
In an alternative embodiment, it after obtaining multiple second target journalings, needs to multiple second target days
Will realizes just primary storage.It is understood that Spark Streaming can only accomplish at-least once (at least before 2.0
Once), spark frame is difficult that you is helped to accomplish exactly once (just primary), and program is needed to be directly realized by reliable data
What source and support idempotence operated flows down.But Spark Structure streaming can simply realize Exactly
Once。
It should be noted that Exactly-once is that every data is only handled once, it is one of the difficult point calculated in real time,
Accomplish that each record only will be dealt with primary purpose.
Specifically include following manner:
1) multiple second target journalings are deposited in the way of a just Exactly once by default feature source interface
Storage is into the second predetermined directory;Such as: kafka Source is utilized, specific on source code, Source is an abstract interface
Trait Source (being default feature source interface) includes that Structured Streaming realizes end-to-end
Exactly-once (just primary end to end) processing centainly needs function to be offered, Kafka's (KafkaSource)
The offset for the kafka that getOffset () is saved by the upper batch of reading, can be by when one of the end driver long
The consumer of operation gets the newest offsets of each topic from kafka brokers, and getBatch method is from root
A DataFrame data are returned to according to offset, commit can save file by checkpoint mechanism on hdfs, remember
The position offset for recording kafka, for the effective offset value of getOffset acquisition in failure.In the present embodiment,
The meaning of getOffset () is as follows: reading the file of the checkpoint of hdfs every time, obtains this beginning for reading kafka
Position updates the end position of this reading to checkoing file after having obtained.
2) multiple second target journalings are deposited in the way of a just Exactly once by default batch processing function
Storage is into the second predetermined directory.Such as: hdfsSink: this only addBatch () method supports Structured
Streaming realizes that end-to-end exactly-once handles the function of centainly needing.Hdfssink addBatch()
Specific implementation be can there is metadata to record the maximum batchId currently completed under the catalogue of hdfs, when from failure
When recovery, if the batchId of operation is less than or equal to the maximum batchId of metadata, operation can be skipped.To realize
The idempotence of data is written.In this implementation embodiment, addBatch () includes following functions: if what this was submitted
BatchId is less than the batchId recorded in hdfs metadata, it is believed that this batchId of task had been written into, and directly skipped.
If it is greater, then partition directory is written in daily record data, and hdfs metadata batchId adds 1.
It should be noted that Spark Streaming is Spark core application programming interface (Application
Programming Interface, referred to as API) an extension, may be implemented high-throughput, have fault tolerant mechanism
The processing of real-time streaming data.Support from multiple data sources obtain data, including Kafk, Flume, Twitter, ZeroMQ,
Kinesis and TCP sockets, from data source obtain data after, can be used such as map, reduce, join and
The processing of the high-level functions such as window progress complicated algorithm.Processing result can also finally be stored to file system, database
With field instrument disk.On the basis of " One Stack rule them all ", other subframes of Spark can also be used,
Such as cluster policy, figure calculate, and stream data is handled.
In an alternative embodiment, determine multiple second target journalings store to the second predetermined directory failure feelings
Under condition, multiple logs to be processed can be restored in the following manner, log is handled again:
1) in the first preset number of days, multiple logs to be processed are reacquired from local cache;Such as: log is printed upon
Local file system, which saves, daily rolls compression preservation 10 days, and machine has perfect operating system alarm can be timely first on line
Detect that hard-disk capacity problem avoids the problem that writing disk failure.Saving 10 days is that down-stream system is avoided to go wrong, worst
Situation is exactly not find in time long holidays on National Day, can push recovery again from data source.
2) restore multiple logs to be processed from local disk using result collection system;Such as: scribeagent is intentionally
The delay detection for jumping detection and log can find the problem in time, and scribeagent record sends the metadata information of file, most
Important is the transmission position of file, the failure that can be used for data is accurately restored, and Transmission Control Protocol is taken to send data.
It should be noted that Scribe is the result collection system of facebook open source, obtained inside facebook
To a large amount of application.It can be from collector journal on various Log Sources, and storing to a central storage system (can be NFS, divide
Cloth file system etc.) on, in order to concentrate statistical analysis processing.It is " distributed collection is uniformly processed " of log
Provide expansible a, scheme highly fault tolerant.The framework of scribe is fairly simple, mainly includes three parts, respectively
Scribe agent, scribe and storage system.
3) restore multiple logs to be processed from the copy stored in open source component;For example, number of copies is 3 on kafka line,
Log persistence saves 3 days.If spark structure streaming fails, directly data can be restored from kafka.
4) it reads the meta data file metadata of multiple logs to be processed and makes up function offset and restore multiple wait locate
Manage log;Such as: the mechanism of spark checkpoint saves the status informations such as the offset of kafka, when mission failure
Waiting spark task start can read metadata the and offset file under checkpoint file, and kafkaConsumer can
Data are read with the offset of specified partition, from without repeating and omitting.
5) multiple logs to be processed are obtained from multiple wave files of storage, wherein multiple wave files be stored in
It is one of lower: local node, the node in local rack, the node in different racks.Such as: the block mechanism of hdfs: file
The Replica Placement Strategy that number of copies is 3, HDFS is: first copy being placed on local node, second copy is put into local
Another node in rack, and node third copy being put into different racks.This approach reduces between rack
Write flow, to improve the performance write.The probability of rack failure is much smaller than node failure.This mode has no effect on data
The limitation of reliabilty and availability, and it reduces the network polymerization bandwidth of read operation really.
6) data of insert overwrite:Hive are extracted is operated using the covering of insert overwrite, relatively
Idempotence may be implemented in insert into, it can be with repetitive operation.
In an alternative embodiment, in order to increase the efficiency of log processing, kafka can be carried out it is extending transversely,
It specifically includes:
(1) partitions for increasing kafka brokes number and topics, according to kakfa default
DafaultPartions abs (key.hashcode) % (numPartitions), can allow the data distribution of topic to more
On more broke and machine, increase the data-handling capacity of upstream and downstream.
(2) after the partitions for increasing kafka, spark's can be according to the consumer groups of kafka
The partitions that identical rdd is arranged in mechanism increases the message capability of data
In conclusion high performance kafka information middleware, spark on yarn's is excellent by the real-time collecting of log
Different distributed nature: it is high fault-tolerant, it easily extends, extractly once is solved and aimed at dilatation in distributed system day, is held
The problem of mistake is lacking, and can not easily carry out the ETL cleaning of big data log.
In an alternative embodiment, in order to increase spark to the efficiency of log processing, spark can be applied into
Row optimization: specific as follows:
Optimization to spark long-time program running parameter: spark-submit--master yarn--deploy-
mode cluster\
-- conf spark.yarn.maxAppAttempts=4
-- conf spark.yarn.am.attemptFailuresValidityInterval=1h
-- conf spark.yarn.max.executor.failures={ 8*num_executors }
-- conf spark.yarn.executor.failuresValidityInterval=1h
-- conf spark.task.maxFailures=8
With cluster model running program, any mistake in Spark driver all can prevent us long-term running
Work.Fortunately, the maximum trial that spark.yarn.maxAppAttempts=4 reruns application program can be configured
Number.If application program operation a couple of days or several weeks then may be used without restarting or redeploying in the cluster highly used
4 trials can be exhausted within several hours.In order to avoid such case, (sp should be reset in each hour by attempting counter
Ark.yarn.am.attemptFailuresValidityInterval=1h), another important setting is in application program
The maximum quantity of executor failure before breaking down.It is max (2*num executors, 3) under default situations, it is very suitable
Batch processing job is closed, but is not suitable for the operation of long-play.The attribute has the corresponding valid period, and long-time is transported
Capable operation, you before abandoning operation it is also contemplated that improve the maximum quantity of mission failure
(spark.task.maxFailures=8).Item is distributed rationally by these, and spark is applied in responsible distributed environment
It can keep operation reliably and with long-term.
In an alternative embodiment, in spark, the first table is established according to secondary partition storage format, wherein
First table is used to store the summary information of multiple second target journalings;The first sub-table is established in the first table, wherein the
One sub-table is for storing the query information for inquiring multiple second target journalings;Timing is that the first sub-table adds subregion, wherein
Subregion is used to store the query information of the log of letter addition.
Such as: according to the secondary partition storage format of the ETL of front, establishes a summary table and believe for obtaining the summary of log
Then breath establishes inquiry and use of each sublist for task to logtype catalogue.It is table addition by scheduling timing
Subregion.Face on this basis, we write the newly-increased log and journal format of python script identification, establish json format and arrive
The automatic mapping of hive schema, automatically creates and builds table statement, establishes table in the warehouse hive.It automates, reduces artificial in this way
The Quick thread of maintenance and new data.
In an alternative embodiment, the small documents of log can be compressed and is merged, by multiple second targets
Each second target journaling in log is split, and obtains multiple decomposition logs;Parse multiple decomposition logs;By preset time
Multiple decomposition logs in section after parsing merge, and obtain merging log;Compression merges log.
Such as: the effect in order to pursue near real-time, spark streaming can be divided into daily log multiple small points
Solution, this is used the memory of namenode and the disk of datanode is disagreeableness.We are merged and are pressed using scheduling system
The small documents of contracting the last week, optimize.Particularly pass through the data of the processing of the map and reduce of setting mapreduce
Dynamic partition, setting intermediate result and the snappy compression for exporting result is arranged in amount, concurrent number.Then it cleverly uses
The sentence of insert overwrite is merging and will not influence inquiry in compression process.
In an alternative embodiment, the storage of log is monitored: to the storages of multiple second target journalings into
Row monitoring;In the case where storing failure, alert.Such as: rewrite the event listener prison of spark
Control sends metrics by statsd, is stored in graphite, then configures alarm rule by cabot, once triggering, leads to
Cross alarm service alerts.And the restful interface real time polling job state provided by yarn, automatic pull-up service, greatly
The reliability for the operation that amplitude provides.
In conclusion realizing fault-tolerant, the dilatation of storage by the technology of spark on yarn, monitoring distributed is healthy and strong
System schema, the effective stability that product is provided, real-time, the extensive reliable memory of log of data analysis.Simultaneously
In view of the merging and compression of cold and hot data, in the performance for not influencing data analysis, the reasonable load for reducing hadoop.
Fig. 3 is the flow diagram (two) of the log processing method provided according to embodiments of the present invention, as shown in figure 3, should
Method includes the following steps:
Step S302: it is formatted, is obtained multiple to be processed using multiple logs of the default journal format to acquisition
Log;
Step S304: multiple logs to be processed are stored respectively to subject document corresponding with each log to be processed and are pressed from both sides
In;
Step S306: by subject document press from both sides in multiple logs to be processed be sent in distributed processing platform.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed,
Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed,
Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple
Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology
The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that executing subject among the above can be open source component kafka, but not limited to this.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation
Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up,
The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework
Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through
Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java
Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number
According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message
Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default
Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], "
time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx
The log that access log collects client is stored in local, then uses rsyslog, the works such as filebeat, scribeagent
Has collector journal, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke
Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop
In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
It should be noted that the message of Kafka storage from any processes for being referred to as " producer " (Producer) more.
Data are so as to being assigned under different " subregion " (Partition), different " Topic ".In a subregion, these
Message is indexed and is stored together together with timestamp.Other processes for being referred to as " consumer " (Consumer) can be from subregion
Query messages.Kafka is operated on the cluster that one is made of one or more server, and subregion can across cluster node
Distribution.
Kafka efficiently handles real time streaming data, may be implemented integrated with Storm, HBase and Spark.As group
Collection is deployed on multiple servers, and Kafka, which handles its all publication and subscribes to message system, has used four API, that is, is produced
Person API, consumer API, Stream API and Connector API.It can transmit extensive streaming message, carry fault-tolerant function
Can, instead of some conventional message systems, such as JMS, AMQP.
The main terms of Kafka framework include Topic, Record and Broker.Topic is made of Record, Record
Hold different information, and Broker is then responsible for duplication message.There are four main API by Kafka.
Producer API: application issued Record stream is supported.
Consumer API: application subscription Topic and processing Record stream are supported.
Stream API: inlet flow is converted into output stream, and generates result.
Connector API: executing reusable producers and consumers API, can be linked to Topic existing using journey
Sequence.
Topic is used to classify to message, and the information for each entering Kafka can be placed under a Topic.
Broker is used to realize the host server of data storage.
Message in each Topic of Partition can be divided into several Partition, to improve the processing effect of message
Rate.
The embodiment of the invention also provides a kind of log processing device, Fig. 4 is the log provided according to embodiments of the present invention
The structural schematic diagram (one) of processing unit, as shown in figure 4, the device includes:
Receiving module 42, the multiple logs to be processed sent for receiving open source component, wherein multiple logs to be processed
Journal format is the conversion that open source component carries out multiple logs using default journal format;
First determining module 44 obtains multiple first object logs for carrying out data cleansing to multiple logs to be processed;
Second determining module 46 is obtained for carrying out multidomain treat-ment to multiple first object logs according to prefixed time interval
To multiple second target journalings.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed,
Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed,
Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple
Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology
The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation
Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up,
The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework
Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through
Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java
Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number
According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message
Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default
Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], "
time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx
The log that access log collects client is stored in local, then uses following open source component rsyslog, filebeat,
The collector journals such as scribeagent, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke
Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop
In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
In an alternative embodiment, Spark carries out data cleansing to multiple logs to be processed in the following manner: benefit
The data cleansing of multiple logs to be processed is triggered with default activity algorithms;Using default transfer algorithm to after triggering it is multiple to
It handles log and carries out data cleansing.For example, using spark transfer algorithm tranform abundant, such as: map, flatmap,
Fliter, User-Defined Functions (User-Defined Function, referred to as UDF) etc. realize the cleaning of data, utilize work
The trigger datas cleaning processes such as dynamic algorithm action, such as collect, save.
In an alternative embodiment, in the following manner according to prefixed time interval to multiple first object logs into
Row multidomain treat-ment: the Log Types and logging time of each first object log in multiple first object logs are determined, wherein day
The will time is the time for obtaining each first object log;Each first object log is divided based on Log Types and logging time
Area is stored into the first predetermined directory, obtains multiple second target journalings.Such as: by the way that reasonable spark is arranged
The time (such as being set as 5 minutes) of the batch interval (being prefixed time interval) of streaming, in batches according to
Logtype (Log Types) and logtime (logging time) does secondary partition and is written under the catalogue of HDFS/warehouses/
{ logtype }/{ logtime } (the first predetermined directory) is stored.The delay time of data substantially with batch interval phase
When 5 minutes be arranged if it is front, displaying of the data in hive is also delay in 5 minutes.It is reasonable by being arranged
Batch interval can achieve the purpose of data near real-time collection.In the present embodiment, Spark computing engines are storages
Into hadoop storage system.
It should be noted that timeslice or batch processing time interval (batch interval): this is artificially to fluxion
According to quantitative standard is carried out, the foundation of flow data is split using timeslice as us.The corresponding RDD of the data of one timeslice
Example.
It should be noted that HDFS is the abbreviation of Hadoop Distribute File System, that is, Hadoop
One distributed file system.
In an alternative embodiment, it after obtaining multiple second target journalings, needs to multiple second target days
Will realizes just primary storage.It is understood that Spark Streaming can only accomplish at-least once (at least before 2.0
Once), spark frame is difficult that you is helped to accomplish exactly once (just primary), and program is needed to be directly realized by reliable data
What source and support idempotence operated flows down.But Spark Structure streaming can simply realize Exactly
Once。
It should be noted that Exactly-once is that every data is only handled once, it is one of the difficult point calculated in real time,
Accomplish that each record only will be dealt with primary purpose.
Specifically include following manner:
1) multiple second target journalings are deposited in the way of a just Exactly once by default feature source interface
Storage is into the second predetermined directory;Such as: kafka Source is utilized, specific on source code, Source is an abstract interface
Trait Source (being default feature source interface) includes that Structured Streaming realizes end-to-end
Exactly-once (just primary end to end) processing centainly needs function to be offered, Kafka's (KafkaSource)
The offset for the kafka that getOffset () is saved by the upper batch of reading, can be by when one of the end driver long
The consumer of operation gets the newest offsets of each topic from kafka brokers, and getBatch method is from root
A DataFrame data are returned to according to offset, commit can save file by checkpoint mechanism on hdfs, remember
The position offset for recording kafka, for the effective offset value of getOffset acquisition in failure.In the present embodiment,
The meaning of getOffset () is as follows: reading the file of the checkpoint of hdfs every time, obtains this beginning for reading kafka
Position updates the end position of this reading to checkoing file after having obtained.
2) multiple second target journalings are deposited in the way of a just Exactly once by default batch processing function
Storage is into the second predetermined directory.Such as: hdfsSink: this only addBatch () method supports Structured
Streaming realizes that end-to-end exactly-once handles the function of centainly needing.Hdfssink addBatch()
Specific implementation be can there is metadata to record the maximum batchId currently completed under the catalogue of hdfs, when from failure
When recovery, if the batchId of operation is less than or equal to the maximum batchId of metadata, operation can be skipped.To realize
The idempotence of data is written.In this implementation embodiment, addBatch () includes following functions: if what this was submitted
BatchId is less than the batchId recorded in hdfs metadata, it is believed that this batchId of task had been written into, and directly skipped.
If it is greater, then partition directory is written in daily record data, and hdfs metadata batchId adds 1.
It should be noted that Spark Streaming is Spark core application programming interface (Application
Programming Interface, referred to as API) an extension, may be implemented high-throughput, have fault tolerant mechanism
The processing of real-time streaming data.Support from multiple data sources obtain data, including Kafk, Flume, Twitter, ZeroMQ,
Kinesis and TCP sockets, from data source obtain data after, can be used such as map, reduce, join and
The processing of the high-level functions such as window progress complicated algorithm.Processing result can also finally be stored to file system, database
With field instrument disk.On the basis of " One Stack rule them all ", other subframes of Spark can also be used,
Such as cluster policy, figure calculate, and stream data is handled.
In an alternative embodiment, determine multiple second target journalings store to the second predetermined directory failure feelings
Under condition, multiple logs to be processed can be restored in the following manner, log is handled again:
1) in the first preset number of days, multiple logs to be processed are reacquired from local cache;Such as: log is printed upon
Local file system, which saves, daily rolls compression preservation 10 days, and machine has perfect operating system alarm can be timely first on line
Detect that hard-disk capacity problem avoids the problem that writing disk failure.Saving 10 days is that down-stream system is avoided to go wrong, worst
Situation is exactly not find in time long holidays on National Day, can push recovery again from data source.
2) restore multiple logs to be processed from local disk using result collection system;Such as: scribeagent is intentionally
The delay detection for jumping detection and log can find the problem in time, and scribeagent record sends the metadata information of file, most
Important is the transmission position of file, the failure that can be used for data is accurately restored, and Transmission Control Protocol is taken to send data.
It should be noted that Scribe is the result collection system of facebook open source, obtained inside facebook
To a large amount of application.It can be from collector journal on various Log Sources, and storing to a central storage system (can be NFS, divide
Cloth file system etc.) on, in order to concentrate statistical analysis processing.It is " distributed collection is uniformly processed " of log
Provide expansible a, scheme highly fault tolerant.The framework of scribe is fairly simple, mainly includes three parts, respectively
Scribe agent, scribe and storage system.
3) restore multiple logs to be processed from the copy stored in open source component;For example, number of copies is 3 on kafka line,
Log persistence saves 3 days.If spark structure streaming fails, directly data can be restored from kafka.
4) it reads the meta data file metadata of multiple logs to be processed and makes up function offset and restore multiple wait locate
Manage log;Such as: the mechanism of spark checkpoint saves the status informations such as the offset of kafka, when mission failure
Waiting spark task start can read metadata the and offset file under checkpoint file, and kafkaConsumer can
Data are read with the offset of specified partition, from without repeating and omitting.
5) multiple logs to be processed are obtained from multiple wave files of storage, wherein multiple wave files be stored in
It is one of lower: local node, the node in local rack, the node in different racks.Such as: the block mechanism of hdfs: file
The Replica Placement Strategy that number of copies is 3, HDFS is: first copy being placed on local node, second copy is put into local
Another node in rack, and node third copy being put into different racks.This approach reduces between rack
Write flow, to improve the performance write.The probability of rack failure is much smaller than node failure.This mode has no effect on data
The limitation of reliabilty and availability, and it reduces the network polymerization bandwidth of read operation really.
(6) data of insert overwrite:Hive are extracted is operated using the covering of insert overwrite, relatively
Idempotence may be implemented in insert into, it can be with repetitive operation.
In an alternative embodiment, in order to increase the efficiency of log processing, kafka can be carried out it is extending transversely,
It specifically includes:
(1) partitions for increasing kafka brokes number and topics, according to kakfa default
DafaultPartions abs (key.hashcode) % (numPartitions), can allow the data distribution of topic to more
On more broke and machine, increase the data-handling capacity of upstream and downstream.
(2) after the partitions for increasing kafka, spark's can be according to the consumer groups of kafka
The partitions that identical rdd is arranged in mechanism increases the message capability of data
In conclusion high performance kafka information middleware, spark on yarn's is excellent by the real-time collecting of log
Different distributed nature: it is high fault-tolerant, it easily extends, extractly once is solved and aimed at dilatation in distributed system day, is held
The problem of mistake is lacking, and can not easily carry out the ETL cleaning of big data log.
In an alternative embodiment, in order to increase spark to the efficiency of log processing, spark can be applied into
Row optimization: specific as follows:
Optimization to spark long-time program running parameter: spark-submit--master yarn--deploy-
mode cluster\
-- conf spark.yarn.maxAppAttempts=4
-- conf spark.yarn.am.attemptFailuresValidityInterval=1h
-- conf spark.yarn.max.executor.failures={ 8*num_executors }
-- conf spark.yarn.executor.failuresValidityInterval=1h
-- conf spark.task.maxFailures=8
With cluster model running program, any mistake in Spark driver all can prevent us long-term running
Work.Fortunately, the maximum trial that spark.yarn.maxAppAttempts=4 reruns application program can be configured
Number.If application program operation a couple of days or several weeks then may be used without restarting or redeploying in the cluster highly used
4 trials can be exhausted within several hours.In order to avoid such case, (sp should be reset in each hour by attempting counter
Ark.yarn.am.attemptFailuresValidityInterval=1h), another important setting is in application program
The maximum quantity of executor failure before breaking down.It is max (2*num executors, 3) under default situations, it is very suitable
Batch processing job is closed, but is not suitable for the operation of long-play.The attribute has the corresponding valid period, and long-time is transported
Capable operation, you before abandoning operation it is also contemplated that improve the maximum quantity of mission failure
(spark.task.maxFailures=8).Item is distributed rationally by these, and spark is applied in responsible distributed environment
It can keep operation reliably and with long-term.
In an alternative embodiment, in spark, the first table is established according to secondary partition storage format, wherein
First table is used to store the summary information of multiple second target journalings;The first sub-table is established in the first table, wherein the
One sub-table is for storing the query information for inquiring multiple second target journalings;Timing is that the first sub-table adds subregion, wherein
Subregion is used to store the query information of the log of letter addition.
Such as: according to the secondary partition storage format of the ETL of front, establishes a summary table and believe for obtaining the summary of log
Then breath establishes inquiry and use of each sublist for task to logtype catalogue.It is table addition by scheduling timing
Subregion.Face on this basis, we write the newly-increased log and journal format of python script identification, establish json format and arrive
The automatic mapping of hive schema, automatically creates and builds table statement, establishes table in the warehouse hive.It automates, reduces artificial in this way
The Quick thread of maintenance and new data.
In an alternative embodiment, the small documents of log can be compressed and is merged, by multiple second targets
Each second target journaling in log is split, and obtains multiple decomposition logs;Parse multiple decomposition logs;By preset time
Multiple decomposition logs in section after parsing merge, and obtain merging log;Compression merges log.
Such as: the effect in order to pursue near real-time, spark streaming can be divided into daily log multiple small points
Solution, this is used the memory of namenode and the disk of datanode is disagreeableness.We are merged and are pressed using scheduling system
The small documents of contracting the last week, optimize.Particularly pass through the data of the processing of the map and reduce of setting mapreduce
Dynamic partition, setting intermediate result and the snappy compression for exporting result is arranged in amount, concurrent number.Then it cleverly uses
The sentence of insert overwrite is merging and will not influence inquiry in compression process.
In an alternative embodiment, the storage of log is monitored: to the storages of multiple second target journalings into
Row monitoring;In the case where storing failure, alert.Such as: rewrite the event listener prison of spark
Control sends metrics by statsd, is stored in graphite, then configures alarm rule by cabot, once triggering, leads to
Cross alarm service alerts.And the restful interface real time polling job state provided by yarn, automatic pull-up service, greatly
The reliability for the operation that amplitude provides.
In conclusion realizing fault-tolerant, the dilatation of storage by the technology of spark on yarn, monitoring distributed is healthy and strong
System schema, the effective stability that product is provided, real-time, the extensive reliable memory of log of data analysis.Simultaneously
In view of the merging and compression of cold and hot data, in the performance for not influencing data analysis, the reasonable load for reducing hadoop.
The embodiment of the invention also provides a kind of log processing device, Fig. 5 is the log provided according to embodiments of the present invention
The structural schematic diagram (two) of processing unit, as shown in figure 5, the device includes:
Third determining module 52 is obtained for being formatted using default journal format to multiple logs of acquisition
Multiple logs to be processed;
Memory module 54, for being stored multiple logs to be processed respectively to theme text corresponding with each log to be processed
In part folder;
Sending module 56 is sent in distributed processing platform for multiple logs to be processed in pressing from both sides subject document.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed,
Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed,
Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple
Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology
The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that executing subject among the above can be open source component kafka, but not limited to this.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation
Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up,
The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework
Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through
Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java
Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number
According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message
Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default
Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], "
time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx
The log that access log collects client is stored in local, then uses rsyslog, the works such as filebeat, scribeagent
Has collector journal, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke
Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop
In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
It should be noted that the message of Kafka storage from any processes for being referred to as " producer " (Producer) more.
Data are so as to being assigned under different " subregion " (Partition), different " Topic ".In a subregion, these
Message is indexed and is stored together together with timestamp.Other processes for being referred to as " consumer " (Consumer) can be from subregion
Query messages.Kafka is operated on the cluster that one is made of one or more server, and subregion can across cluster node
Distribution.
Kafka efficiently handles real time streaming data, may be implemented integrated with Storm, HBase and Spark.As group
Collection is deployed on multiple servers, and Kafka, which handles its all publication and subscribes to message system, has used four API, that is, is produced
Person API, consumer API, Stream API and Connector API.It can transmit extensive streaming message, carry fault-tolerant function
Can, instead of some conventional message systems, such as JMS, AMQP.
The main terms of Kafka framework include Topic, Record and Broker.Topic is made of Record, Record
Hold different information, and Broker is then responsible for duplication message.There are four main API by Kafka.
Producer API: application issued Record stream is supported.
Consumer API: application subscription Topic and processing Record stream are supported.
Stream API: inlet flow is converted into output stream, and generates result.
Connector API: executing reusable producers and consumers API, can be linked to Topic existing using journey
Sequence.
Topic is used to classify to message, and the information for each entering Kafka can be placed under a Topic.
Broker is used to realize the host server of data storage.
Message in each Topic of Partition can be divided into several Partition, to improve the processing effect of message
Rate.
The embodiment of the invention also provides a kind of log processing system, Fig. 6 is the log provided according to embodiments of the present invention
The structural schematic diagram of processing system, as shown in fig. 6, the device includes:
Distributed processing platform spark, wherein spark is arranged to side when operation in 8 any one of perform claim requirement
Method;
Increase income component kafka, connect with distributed processing platform, wherein perform claim is wanted when kafka is arranged to operation
The method for asking 9.
As shown in fig. 6, this system further include: server, kafka collect the log of client from server.It will
Log transmission carries out the processing of log to spark, and the log transmission after processing to Hadoop is further stored.
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein
The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store for executing above each step
Computer program.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read-
Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard
The various media that can store computer program such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory
There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method
Suddenly.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device
It is connected with above-mentioned processor, which connects with above-mentioned processor.
Optionally, in the present embodiment, above-mentioned processor can be set to execute above each step by computer program
Suddenly.
Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment
Example, details are not described herein for the present embodiment.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment
The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others
Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei
A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or
Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual
Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module
It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit
The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple
On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit
It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list
Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product
When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially
The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words
It embodies, which is stored in a storage medium, including some instructions are used so that a computer
Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or
Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited
Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code
Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art
For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered
It is considered as protection scope of the present invention.
Claims (14)
1. a kind of log processing method characterized by comprising
Receive multiple logs to be processed that open source component is sent, wherein the journal format of the multiple log to be processed is described
The conversion that open source component carries out multiple logs using default journal format;
Data cleansing is carried out to the multiple log to be processed, obtains multiple first object logs;
Multidomain treat-ment is carried out to the multiple first object log according to prefixed time interval, obtains multiple second target journalings.
2. being obtained the method according to claim 1, wherein carrying out data cleansing to the multiple log to be processed
Include: to the multiple first object log
The data cleansing of the multiple log to be processed is triggered using default activity algorithms;
Data cleansing is carried out to the multiple log to be processed after triggering using default transfer algorithm.
3. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh
It marks log and carries out multidomain treat-ment, obtaining the multiple second target journaling includes:
Determine the Log Types and logging time of each first object log in the multiple first object log, wherein described
Logging time is the time for obtaining each first object log;
Each first object log partition is stored to the first default mesh based on the Log Types and the logging time
In record, the multiple second target journaling is obtained.
4. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh
It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes following one:
The multiple second target journaling is stored in the way of a just Exactly once by default feature source interface
Into the second predetermined directory;
The multiple second target journaling is stored in the way of a just Exactly once by default batch processing function
Into the second predetermined directory.
5. according to the method described in claim 4, it is characterized in that, according to the prefixed time interval to the multiple first mesh
It marks log and carries out real-time multidomain treat-ment, after obtaining the multiple second target journaling, and determining the multiple second mesh
In the case that mark log is stored to second predetermined directory failure, the method also includes following one:
In the first preset number of days, the multiple log to be processed is reacquired from local cache;
Restore the multiple log to be processed from local disk using result collection system;
Restore the multiple log to be processed from the copy stored in the open source component;
Read the multiple log to be processed meta data file metadata and make up function offset restore it is the multiple to
Handle log;
The multiple log to be processed is obtained from multiple wave files of storage, wherein the multiple wave file is stored in
Following one: local node, the node in local rack, the node in different racks.
6. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh
It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes:
The first table is established according to secondary partition storage format, wherein first table is for storing the multiple second mesh
Mark the summary information of log;
The first sub-table is established in first table, wherein first sub-table is for storing inquiry the multiple the
The query information of two target journalings;
Timing is that first sub-table adds subregion, wherein the subregion is used to store the query information of the log of letter addition.
7. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh
It marks log and carries out multidomain treat-ment, after obtaining multiple second target journalings, the method also includes:
Each second target journaling in the multiple second target journaling is split, multiple decomposition logs are obtained;
Parse the multiple decomposition log;
The multiple decomposition log after parsing in preset time period is merged, obtains merging log;
Compress the merging log.
8. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh
It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes:
The storage of the multiple second target journaling is monitored;
In the case where storing failure, alert.
9. a kind of log processing method characterized by comprising
It is formatted using multiple logs of the default journal format to acquisition, obtains multiple logs to be processed;
The multiple log to be processed is stored respectively into subject document folder corresponding with each log to be processed;
The multiple log to be processed in subject document folder is sent in distributed processing platform.
10. a kind of log processing device characterized by comprising
Receiving module, the multiple logs to be processed sent for receiving open source component, wherein the day of the multiple log to be processed
Will format is the conversion that the open source component carries out multiple logs using default journal format;
First determining module obtains multiple first object logs for carrying out data cleansing to the multiple log to be processed;
Second determining module is obtained for carrying out multidomain treat-ment to the multiple first object log according to prefixed time interval
Multiple second target journalings.
11. a kind of log processing device characterized by comprising
Third determining module, for being formatted using default journal format to multiple logs of acquisition, obtain it is multiple to
Handle log;
Memory module, for being stored the multiple log to be processed respectively to subject document corresponding with each log to be processed
In folder;
Sending module, for the multiple log to be processed in subject document folder to be sent to distributed processing platform
In.
12. a kind of log processing system characterized by comprising
Distributed processing platform spark, wherein the spark is arranged to perform claim when operation and requires institute in any one of 1-8
The method stated;
Increase income component kafka, connect with the distributed processing platform, wherein the kafka is arranged to right of execution when operation
Benefit require 9 described in method.
13. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer
Program is arranged to perform claim when operation and requires method described in 1 to 8 any one, alternatively, perform claim requires described in 9
Method.
14. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory
Sequence, the processor are arranged to run the computer program and require method described in 1 to 8 any one with perform claim,
Alternatively, perform claim requires method described in 9.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910138347.5A CN109918349B (en) | 2019-02-25 | 2019-02-25 | Log processing method, log processing device, storage medium and electronic device |
Applications Claiming Priority (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
CN201910138347.5A CN109918349B (en) | 2019-02-25 | 2019-02-25 | Log processing method, log processing device, storage medium and electronic device |
Publications (2)
Publication Number | Publication Date |
---|---|
CN109918349A true CN109918349A (en) | 2019-06-21 |
CN109918349B CN109918349B (en) | 2021-05-25 |
Family
ID=66962220
Family Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
CN201910138347.5A Active CN109918349B (en) | 2019-02-25 | 2019-02-25 | Log processing method, log processing device, storage medium and electronic device |
Country Status (1)
Country | Link |
---|---|
CN (1) | CN109918349B (en) |
Cited By (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110908788A (en) * | 2019-12-02 | 2020-03-24 | 北京锐安科技有限公司 | Spark Streaming based data processing method and device, computer equipment and storage medium |
CN111581173A (en) * | 2020-05-09 | 2020-08-25 | 深圳市卡数科技有限公司 | Distributed storage method and device for log system, server and storage medium |
CN111831617A (en) * | 2020-07-16 | 2020-10-27 | 福建天晴数码有限公司 | Method for guaranteeing uniqueness of log data based on distributed system |
CN112506862A (en) * | 2020-12-28 | 2021-03-16 | 浪潮云信息技术股份公司 | Method for custom saving Kafka Offset |
CN112612677A (en) * | 2020-12-28 | 2021-04-06 | 北京天融信网络安全技术有限公司 | Log storage method and device, electronic equipment and readable storage medium |
CN113190726A (en) * | 2021-04-16 | 2021-07-30 | 珠海格力精密模具有限公司 | Method for reading CAE (computer aided engineering) modular flow analysis data, electronic equipment and storage medium |
CN113312353A (en) * | 2021-06-10 | 2021-08-27 | 中国民航信息网络股份有限公司 | Storage method and system for tracking journal |
WO2021189954A1 (en) * | 2020-10-12 | 2021-09-30 | 平安科技(深圳)有限公司 | Log data processing method and apparatus, computer device, and storage medium |
WO2021238273A1 (en) * | 2020-05-28 | 2021-12-02 | 苏州浪潮智能科技有限公司 | Message fault tolerance method and system based on spark streaming computing framework |
CN113760832A (en) * | 2020-06-03 | 2021-12-07 | 富泰华工业(深圳)有限公司 | File processing method, computer device and readable storage medium |
CN113778810A (en) * | 2021-09-27 | 2021-12-10 | 杭州安恒信息技术股份有限公司 | Log collection method, device and system |
CN113806434A (en) * | 2021-09-22 | 2021-12-17 | 平安科技(深圳)有限公司 | Big data processing method, device, equipment and medium |
Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140344622A1 (en) * | 2013-05-20 | 2014-11-20 | Vmware, Inc. | Scalable Log Analytics |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
CN105389352A (en) * | 2015-10-30 | 2016-03-09 | 北京奇艺世纪科技有限公司 | Log processing method and apparatus |
CN105824744A (en) * | 2016-03-21 | 2016-08-03 | 焦点科技股份有限公司 | Real-time log collection and analysis method on basis of B2B (Business to Business) platform |
CN108600300A (en) * | 2018-03-06 | 2018-09-28 | 北京思空科技有限公司 | Daily record data processing method and processing device |
-
2019
- 2019-02-25 CN CN201910138347.5A patent/CN109918349B/en active Active
Patent Citations (5)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20140344622A1 (en) * | 2013-05-20 | 2014-11-20 | Vmware, Inc. | Scalable Log Analytics |
CN104298771A (en) * | 2014-10-30 | 2015-01-21 | 南京信息工程大学 | Massive web log data query and analysis method |
CN105389352A (en) * | 2015-10-30 | 2016-03-09 | 北京奇艺世纪科技有限公司 | Log processing method and apparatus |
CN105824744A (en) * | 2016-03-21 | 2016-08-03 | 焦点科技股份有限公司 | Real-time log collection and analysis method on basis of B2B (Business to Business) platform |
CN108600300A (en) * | 2018-03-06 | 2018-09-28 | 北京思空科技有限公司 | Daily record data processing method and processing device |
Cited By (16)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
CN110908788A (en) * | 2019-12-02 | 2020-03-24 | 北京锐安科技有限公司 | Spark Streaming based data processing method and device, computer equipment and storage medium |
CN110908788B (en) * | 2019-12-02 | 2022-04-08 | 北京锐安科技有限公司 | Spark Streaming based data processing method and device, computer equipment and storage medium |
CN111581173A (en) * | 2020-05-09 | 2020-08-25 | 深圳市卡数科技有限公司 | Distributed storage method and device for log system, server and storage medium |
CN111581173B (en) * | 2020-05-09 | 2023-10-20 | 深圳市卡数科技有限公司 | Method, device, server and storage medium for distributed storage of log system |
WO2021238273A1 (en) * | 2020-05-28 | 2021-12-02 | 苏州浪潮智能科技有限公司 | Message fault tolerance method and system based on spark streaming computing framework |
CN113760832A (en) * | 2020-06-03 | 2021-12-07 | 富泰华工业(深圳)有限公司 | File processing method, computer device and readable storage medium |
CN111831617B (en) * | 2020-07-16 | 2022-08-09 | 福建天晴数码有限公司 | Method for guaranteeing uniqueness of log data based on distributed system |
CN111831617A (en) * | 2020-07-16 | 2020-10-27 | 福建天晴数码有限公司 | Method for guaranteeing uniqueness of log data based on distributed system |
WO2021189954A1 (en) * | 2020-10-12 | 2021-09-30 | 平安科技(深圳)有限公司 | Log data processing method and apparatus, computer device, and storage medium |
CN112612677A (en) * | 2020-12-28 | 2021-04-06 | 北京天融信网络安全技术有限公司 | Log storage method and device, electronic equipment and readable storage medium |
CN112506862A (en) * | 2020-12-28 | 2021-03-16 | 浪潮云信息技术股份公司 | Method for custom saving Kafka Offset |
CN113190726A (en) * | 2021-04-16 | 2021-07-30 | 珠海格力精密模具有限公司 | Method for reading CAE (computer aided engineering) modular flow analysis data, electronic equipment and storage medium |
CN113312353A (en) * | 2021-06-10 | 2021-08-27 | 中国民航信息网络股份有限公司 | Storage method and system for tracking journal |
CN113806434A (en) * | 2021-09-22 | 2021-12-17 | 平安科技(深圳)有限公司 | Big data processing method, device, equipment and medium |
CN113806434B (en) * | 2021-09-22 | 2023-09-05 | 平安科技(深圳)有限公司 | Big data processing method, device, equipment and medium |
CN113778810A (en) * | 2021-09-27 | 2021-12-10 | 杭州安恒信息技术股份有限公司 | Log collection method, device and system |
Also Published As
Publication number | Publication date |
---|---|
CN109918349B (en) | 2021-05-25 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
CN109918349A (en) | Log processing method, device, storage medium and electronic device | |
CN110321387B (en) | Data synchronization method, equipment and terminal equipment | |
CN112507029B (en) | Data processing system and data real-time processing method | |
CN109034993A (en) | Account checking method, equipment, system and computer readable storage medium | |
CN105824744A (en) | Real-time log collection and analysis method on basis of B2B (Business to Business) platform | |
CN109710614A (en) | A kind of method and device of real-time data memory and inquiry | |
CN111459986B (en) | Data computing system and method | |
CN102750326A (en) | Log management optimization method of cluster system based on downsizing strategy | |
CN109190025B (en) | Information monitoring method, device, system and computer readable storage medium | |
CN113360554B (en) | Method and equipment for extracting, converting and loading ETL (extract transform load) data | |
CN110019267A (en) | A kind of metadata updates method, apparatus, system, electronic equipment and storage medium | |
CN112559475B (en) | Data real-time capturing and transmitting method and system | |
CN105138691B (en) | Analyze the method and system of subscriber traffic | |
CN113535856B (en) | Data synchronization method and system | |
CN110704400A (en) | Real-time data synchronization method and device and server | |
WO2020263370A1 (en) | Parallel processing of filtered transaction logs | |
CN112506743A (en) | Log monitoring method and device and server | |
CN112019605A (en) | Data distribution method and system of data stream | |
CN114048217A (en) | Incremental data synchronization method and device, electronic equipment and storage medium | |
CN109451078A (en) | Transaction methods and device under a kind of distributed structure/architecture | |
CN109167672B (en) | Return source error positioning method, device, storage medium and system | |
CN114090529A (en) | Log management method, device, system and storage medium | |
CN104205775A (en) | A system for high reliability and high performance application message delivery | |
CN107480189A (en) | A kind of various dimensions real-time analyzer and method | |
CN111049846A (en) | Data processing method and device, electronic equipment and computer readable storage medium |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
PB01 | Publication | ||
PB01 | Publication | ||
SE01 | Entry into force of request for substantive examination | ||
SE01 | Entry into force of request for substantive examination | ||
GR01 | Patent grant | ||
GR01 | Patent grant |