CN109918349A

CN109918349A - Log processing method, device, storage medium and electronic device

Info

Publication number: CN109918349A
Application number: CN201910138347.5A
Authority: CN
Inventors: 刘晶晶
Original assignee: Netease Hangzhou Network Co Ltd
Current assignee: Netease Hangzhou Network Co Ltd
Priority date: 2019-02-25
Filing date: 2019-02-25
Publication date: 2019-06-21
Anticipated expiration: 2039-02-25
Also published as: CN109918349B

Abstract

The invention discloses a kind of log processing method, device, storage medium and electronic device.Wherein, this method comprises: receiving multiple logs to be processed that open source component is sent, wherein the journal format of multiple logs to be processed is the conversion that open source component carries out multiple logs using default journal format；Data cleansing is carried out to multiple logs to be processed, obtains multiple first object logs；Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, obtains multiple second target journalings.The treatment effeciency that the present invention solves log in the related technology is low, is not able to satisfy to daily record data the technical issues of needs.

Description

Log processing method, device, storage medium and electronic device

Technical field

The present invention relates to computer fields, in particular to a kind of log processing method, device, storage medium and electricity Sub-device.

Background technique

With the arriving of internet+epoch, the value of data is increasingly highlighted.The data of product are presented exponential type and increase, Unstructured feature.Using distributed processing platform spark and hadoop technology, constructing big data platform is core the most The storage of basic data, processing capacity center, provide powerful data-handling capacity, meet the interaction demand of data.Together When by spark streaming, can effectively meet the requirement of enterprise real-time data, construct the real-time indicators body of enterprise development System.

But there are log processings not real-time enough, the dilatation in distributed system of the existing storage mode to log is held Mistake is lacking, and can not easily carry out the ETL (abbreviation of Extract-Transform-Load, for retouching of big data log It states data from source terminal by extracting (extract), interaction conversion (transform), the mistake for loading (load) to destination Journey) cleaning.

For above-mentioned problem, currently no effective solution has been proposed.

Summary of the invention

The embodiment of the invention provides a kind of log processing method, device, storage medium and electronic device, at least to solve It is low to the treatment effeciency of log in the related technology, it is not able to satisfy to daily record data the technical issues of needs.

According to an aspect of an embodiment of the present invention, a kind of log processing method is provided, comprising: receive open source component hair The multiple logs to be processed sent, wherein the journal format of multiple logs to be processed is that open source component utilizes default journal format pair The conversion that multiple logs carry out；Data cleansing is carried out to multiple logs to be processed, obtains multiple first object logs；According to default Time interval carries out multidomain treat-ment to multiple first object logs, obtains multiple second target journalings.

According to another aspect of an embodiment of the present invention, a kind of log processing method is additionally provided, comprising: utilize default log Format formats multiple logs of acquisition, obtains multiple logs to be processed；Multiple logs to be processed are stored respectively Into subject document folder corresponding with each log to be processed；By subject document press from both sides in multiple logs to be processed be sent to distribution In formula processing platform.

According to another aspect of an embodiment of the present invention, a kind of log processing device is additionally provided, comprising: receiving module is used In the multiple logs to be processed for receiving open source component transmission, wherein the journal format of multiple logs to be processed is open source component benefit The conversion that multiple logs are carried out with default journal format；First determining module, for carrying out data to multiple logs to be processed Cleaning, obtains multiple first object logs；Second determining module is used for according to prefixed time interval to multiple first object logs Multidomain treat-ment is carried out, multiple second target journalings are obtained.

According to another aspect of an embodiment of the present invention, a kind of log processing device is additionally provided, comprising: third determines mould Block obtains multiple logs to be processed for formatting using default journal format to multiple logs of acquisition；Store mould Block, for being stored multiple logs to be processed respectively into subject document folder corresponding with each log to be processed；Sending module, It is sent in distributed processing platform for multiple logs to be processed in pressing from both sides subject document.

According to another aspect of an embodiment of the present invention, a kind of log processing system is additionally provided, comprising: distributed treatment is flat Platform spark, wherein spark is arranged to execute method among the above when operation；Increase income component kafka, flat with distributed treatment Platform connection, wherein kafka is arranged to execute above-mentioned method when operation；

According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.

According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.

In embodiments of the present invention, the log of collection is formatted using open source component, is obtained multiple to be processed Log to be processed is sent to distributed processing platform by log；Distributed processing platform carries out data to multiple logs to be processed Cleaning, obtains multiple first object logs；Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained Multiple second target journalings.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve the relevant technologies The treatment effeciency of middle log is low, is not able to satisfy to daily record data the technical issues of needs.

Detailed description of the invention

The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:

Fig. 1 is a kind of hardware block diagram of the mobile terminal of log processing method of the embodiment of the present invention；

Fig. 2 is the flow diagram (one) of the log processing method provided according to embodiments of the present invention；

Fig. 3 is the flow diagram (two) of the log processing method provided according to embodiments of the present invention；

Fig. 4 is the structural schematic diagram (one) of the log processing device provided according to embodiments of the present invention；

Fig. 5 is the structural schematic diagram (two) of the log processing device provided according to embodiments of the present invention；

Fig. 6 is the structural schematic diagram of the log processing system provided according to embodiments of the present invention.

Specific embodiment

In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.

It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.

According to embodiments of the present invention, a kind of log processing method embodiment is provided, it should be noted that in the stream of attached drawing The step of journey illustrates can execute in a computer system such as a set of computer executable instructions, although also, flowing Logical order is shown in journey figure, but in some cases, it can be to be different from shown or described by sequence execution herein The step of.

Embodiment of the method provided by the embodiment of the present invention can be in mobile terminal, terminal or similar operation It is executed in device.For running on mobile terminals, Fig. 1 is a kind of mobile end of log processing method of the embodiment of the present invention The hardware block diagram at end.As shown in Figure 1, mobile terminal 10 may include one or more (only showing one in Fig. 1) processing Device 102 (processing unit that processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc.) and Memory 104 for storing data, optionally, above-mentioned mobile terminal can also include the transmission device for communication function 106 and input-output equipment 108.It will appreciated by the skilled person that structure shown in FIG. 1 is only to illustrate, simultaneously The structure of above-mentioned mobile terminal is not caused to limit.For example, mobile terminal 10 may also include it is more than shown in Fig. 1 or less Component, or with the configuration different from shown in Fig. 1.

Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair The corresponding computer program of log processing method in bright embodiment, processor 102 are stored in memory 104 by operation Computer program realizes above-mentioned method thereby executing various function application and data processing.Memory 104 may include High speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or its His non-volatile solid state memory.In some instances, memory 104 can further comprise remotely setting relative to processor 102 The memory set, these remote memories can pass through network connection to mobile terminal 10.The example of above-mentioned network includes but not It is limited to internet, intranet, local area network, mobile radio communication and combinations thereof.

Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of mobile terminal 10 provide.In an example, transmitting device 106 includes a Network adaptation Device (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments to It can be communicated with internet.In an example, transmitting device 106 can for radio frequency (Radio Frequency, referred to as RF) module is used to wirelessly be communicated with internet.

Fig. 2 is the flow diagram (one) of the log processing method provided according to embodiments of the present invention, as shown in Fig. 2, should Method includes the following steps:

Step S202 receives multiple logs to be processed that open source component is sent, wherein the log lattice of multiple logs to be processed Formula is the conversion that open source component carries out multiple logs using default journal format；

Step S204 carries out data cleansing to multiple logs to be processed, obtains multiple first object logs；

Step S206 carries out multidomain treat-ment to multiple first object logs according to prefixed time interval, obtains multiple second Target journaling.

Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed, Log to be processed is sent to distributed processing platform；Distributed processing platform carries out data cleansing to multiple logs to be processed, Obtain multiple first object logs；Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.

It should be noted that executing subject among the above can be distributed processing platform spark, but not limited to this.

It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up, The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message Classification, the information for each entering Kafka can be placed under a Topic.

It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default Journal format can be with are as follows: [logtime] [operation], JSON；Such as: [2013-04-101:00:09] [Click], " time":12344343,"server":"1001"}。

In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx The log that access log collects client is stored in local, then uses following open source component rsyslog, filebeat, The collector journals such as scribeagent, in the topic of producer to kafka.

It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.

In an alternative embodiment, Spark carries out data cleansing to multiple logs to be processed in the following manner: benefit The data cleansing of multiple logs to be processed is triggered with default activity algorithms；Using default transfer algorithm to after triggering it is multiple to It handles log and carries out data cleansing.For example, using spark transfer algorithm tranform abundant, such as: map, flatmap, Fliter, User-Defined Functions (User-Defined Function, referred to as UDF) etc. realize the cleaning of data, utilize work The trigger datas cleaning processes such as dynamic algorithm action, such as collect, save.

In an alternative embodiment, in the following manner according to prefixed time interval to multiple first object logs into Row multidomain treat-ment: the Log Types and logging time of each first object log in multiple first object logs are determined, wherein day The will time is the time for obtaining each first object log；Each first object log is divided based on Log Types and logging time Area is stored into the first predetermined directory, obtains multiple second target journalings.Such as: by the way that reasonable spark is arranged The time (such as being set as 5 minutes) of the batch interval (being prefixed time interval) of streaming, in batches according to Logtype (Log Types) and logtime (logging time) does secondary partition and is written under the catalogue of HDFS/warehouses/ { logtype }/{ logtime } (the first predetermined directory) is stored.The delay time of data substantially with batch interval phase When 5 minutes be arranged if it is front, displaying of the data in hive is also delay in 5 minutes.It is reasonable by being arranged Batch interval can achieve the purpose of data near real-time collection.In the present embodiment, Spark computing engines are storages Into hadoop storage system.

It should be noted that timeslice or batch processing time interval (batch interval): this is artificially to fluxion According to quantitative standard is carried out, the foundation of flow data is split using timeslice as us.The corresponding RDD of the data of one timeslice Example.

It should be noted that HDFS is the abbreviation of Hadoop Distribute File System, that is, Hadoop One distributed file system.

In an alternative embodiment, it after obtaining multiple second target journalings, needs to multiple second target days Will realizes just primary storage.It is understood that Spark Streaming can only accomplish at-least once (at least before 2.0 Once), spark frame is difficult that you is helped to accomplish exactly once (just primary), and program is needed to be directly realized by reliable data What source and support idempotence operated flows down.But Spark Structure streaming can simply realize Exactly Once。

It should be noted that Exactly-once is that every data is only handled once, it is one of the difficult point calculated in real time, Accomplish that each record only will be dealt with primary purpose.

Specifically include following manner:

1) multiple second target journalings are deposited in the way of a just Exactly once by default feature source interface Storage is into the second predetermined directory；Such as: kafka Source is utilized, specific on source code, Source is an abstract interface Trait Source (being default feature source interface) includes that Structured Streaming realizes end-to-end Exactly-once (just primary end to end) processing centainly needs function to be offered, Kafka's (KafkaSource) The offset for the kafka that getOffset () is saved by the upper batch of reading, can be by when one of the end driver long The consumer of operation gets the newest offsets of each topic from kafka brokers, and getBatch method is from root A DataFrame data are returned to according to offset, commit can save file by checkpoint mechanism on hdfs, remember The position offset for recording kafka, for the effective offset value of getOffset acquisition in failure.In the present embodiment, The meaning of getOffset () is as follows: reading the file of the checkpoint of hdfs every time, obtains this beginning for reading kafka Position updates the end position of this reading to checkoing file after having obtained.

2) multiple second target journalings are deposited in the way of a just Exactly once by default batch processing function Storage is into the second predetermined directory.Such as: hdfsSink: this only addBatch () method supports Structured Streaming realizes that end-to-end exactly-once handles the function of centainly needing.Hdfssink addBatch() Specific implementation be can there is metadata to record the maximum batchId currently completed under the catalogue of hdfs, when from failure When recovery, if the batchId of operation is less than or equal to the maximum batchId of metadata, operation can be skipped.To realize The idempotence of data is written.In this implementation embodiment, addBatch () includes following functions: if what this was submitted BatchId is less than the batchId recorded in hdfs metadata, it is believed that this batchId of task had been written into, and directly skipped. If it is greater, then partition directory is written in daily record data, and hdfs metadata batchId adds 1.

It should be noted that Spark Streaming is Spark core application programming interface (Application Programming Interface, referred to as API) an extension, may be implemented high-throughput, have fault tolerant mechanism The processing of real-time streaming data.Support from multiple data sources obtain data, including Kafk, Flume, Twitter, ZeroMQ, Kinesis and TCP sockets, from data source obtain data after, can be used such as map, reduce, join and The processing of the high-level functions such as window progress complicated algorithm.Processing result can also finally be stored to file system, database With field instrument disk.On the basis of " One Stack rule them all ", other subframes of Spark can also be used, Such as cluster policy, figure calculate, and stream data is handled.

In an alternative embodiment, determine multiple second target journalings store to the second predetermined directory failure feelings Under condition, multiple logs to be processed can be restored in the following manner, log is handled again:

1) in the first preset number of days, multiple logs to be processed are reacquired from local cache；Such as: log is printed upon Local file system, which saves, daily rolls compression preservation 10 days, and machine has perfect operating system alarm can be timely first on line Detect that hard-disk capacity problem avoids the problem that writing disk failure.Saving 10 days is that down-stream system is avoided to go wrong, worst Situation is exactly not find in time long holidays on National Day, can push recovery again from data source.

2) restore multiple logs to be processed from local disk using result collection system；Such as: scribeagent is intentionally The delay detection for jumping detection and log can find the problem in time, and scribeagent record sends the metadata information of file, most Important is the transmission position of file, the failure that can be used for data is accurately restored, and Transmission Control Protocol is taken to send data.

It should be noted that Scribe is the result collection system of facebook open source, obtained inside facebook To a large amount of application.It can be from collector journal on various Log Sources, and storing to a central storage system (can be NFS, divide Cloth file system etc.) on, in order to concentrate statistical analysis processing.It is " distributed collection is uniformly processed " of log Provide expansible a, scheme highly fault tolerant.The framework of scribe is fairly simple, mainly includes three parts, respectively Scribe agent, scribe and storage system.

3) restore multiple logs to be processed from the copy stored in open source component；For example, number of copies is 3 on kafka line, Log persistence saves 3 days.If spark structure streaming fails, directly data can be restored from kafka.

4) it reads the meta data file metadata of multiple logs to be processed and makes up function offset and restore multiple wait locate Manage log；Such as: the mechanism of spark checkpoint saves the status informations such as the offset of kafka, when mission failure Waiting spark task start can read metadata the and offset file under checkpoint file, and kafkaConsumer can Data are read with the offset of specified partition, from without repeating and omitting.

5) multiple logs to be processed are obtained from multiple wave files of storage, wherein multiple wave files be stored in It is one of lower: local node, the node in local rack, the node in different racks.Such as: the block mechanism of hdfs: file The Replica Placement Strategy that number of copies is 3, HDFS is: first copy being placed on local node, second copy is put into local Another node in rack, and node third copy being put into different racks.This approach reduces between rack Write flow, to improve the performance write.The probability of rack failure is much smaller than node failure.This mode has no effect on data The limitation of reliabilty and availability, and it reduces the network polymerization bandwidth of read operation really.

6) data of insert overwrite:Hive are extracted is operated using the covering of insert overwrite, relatively Idempotence may be implemented in insert into, it can be with repetitive operation.

In an alternative embodiment, in order to increase the efficiency of log processing, kafka can be carried out it is extending transversely, It specifically includes:

(1) partitions for increasing kafka brokes number and topics, according to kakfa default DafaultPartions abs (key.hashcode) % (numPartitions), can allow the data distribution of topic to more On more broke and machine, increase the data-handling capacity of upstream and downstream.

(2) after the partitions for increasing kafka, spark's can be according to the consumer groups of kafka The partitions that identical rdd is arranged in mechanism increases the message capability of data

In conclusion high performance kafka information middleware, spark on yarn's is excellent by the real-time collecting of log Different distributed nature: it is high fault-tolerant, it easily extends, extractly once is solved and aimed at dilatation in distributed system day, is held The problem of mistake is lacking, and can not easily carry out the ETL cleaning of big data log.

In an alternative embodiment, in order to increase spark to the efficiency of log processing, spark can be applied into Row optimization: specific as follows:

Optimization to spark long-time program running parameter: spark-submit--master yarn--deploy- mode cluster\

-- conf spark.yarn.maxAppAttempts=4

-- conf spark.yarn.am.attemptFailuresValidityInterval=1h

-- conf spark.yarn.max.executor.failures={ 8*num_executors }

-- conf spark.yarn.executor.failuresValidityInterval=1h

-- conf spark.task.maxFailures=8

With cluster model running program, any mistake in Spark driver all can prevent us long-term running Work.Fortunately, the maximum trial that spark.yarn.maxAppAttempts=4 reruns application program can be configured Number.If application program operation a couple of days or several weeks then may be used without restarting or redeploying in the cluster highly used 4 trials can be exhausted within several hours.In order to avoid such case, (sp should be reset in each hour by attempting counter Ark.yarn.am.attemptFailuresValidityInterval=1h), another important setting is in application program The maximum quantity of executor failure before breaking down.It is max (2*num executors, 3) under default situations, it is very suitable Batch processing job is closed, but is not suitable for the operation of long-play.The attribute has the corresponding valid period, and long-time is transported Capable operation, you before abandoning operation it is also contemplated that improve the maximum quantity of mission failure (spark.task.maxFailures=8).Item is distributed rationally by these, and spark is applied in responsible distributed environment It can keep operation reliably and with long-term.

In an alternative embodiment, in spark, the first table is established according to secondary partition storage format, wherein First table is used to store the summary information of multiple second target journalings；The first sub-table is established in the first table, wherein the One sub-table is for storing the query information for inquiring multiple second target journalings；Timing is that the first sub-table adds subregion, wherein Subregion is used to store the query information of the log of letter addition.

Such as: according to the secondary partition storage format of the ETL of front, establishes a summary table and believe for obtaining the summary of log Then breath establishes inquiry and use of each sublist for task to logtype catalogue.It is table addition by scheduling timing Subregion.Face on this basis, we write the newly-increased log and journal format of python script identification, establish json format and arrive The automatic mapping of hive schema, automatically creates and builds table statement, establishes table in the warehouse hive.It automates, reduces artificial in this way The Quick thread of maintenance and new data.

In an alternative embodiment, the small documents of log can be compressed and is merged, by multiple second targets Each second target journaling in log is split, and obtains multiple decomposition logs；Parse multiple decomposition logs；By preset time Multiple decomposition logs in section after parsing merge, and obtain merging log；Compression merges log.

Such as: the effect in order to pursue near real-time, spark streaming can be divided into daily log multiple small points Solution, this is used the memory of namenode and the disk of datanode is disagreeableness.We are merged and are pressed using scheduling system The small documents of contracting the last week, optimize.Particularly pass through the data of the processing of the map and reduce of setting mapreduce Dynamic partition, setting intermediate result and the snappy compression for exporting result is arranged in amount, concurrent number.Then it cleverly uses The sentence of insert overwrite is merging and will not influence inquiry in compression process.

In an alternative embodiment, the storage of log is monitored: to the storages of multiple second target journalings into Row monitoring；In the case where storing failure, alert.Such as: rewrite the event listener prison of spark Control sends metrics by statsd, is stored in graphite, then configures alarm rule by cabot, once triggering, leads to Cross alarm service alerts.And the restful interface real time polling job state provided by yarn, automatic pull-up service, greatly The reliability for the operation that amplitude provides.

In conclusion realizing fault-tolerant, the dilatation of storage by the technology of spark on yarn, monitoring distributed is healthy and strong System schema, the effective stability that product is provided, real-time, the extensive reliable memory of log of data analysis.Simultaneously In view of the merging and compression of cold and hot data, in the performance for not influencing data analysis, the reasonable load for reducing hadoop.

Fig. 3 is the flow diagram (two) of the log processing method provided according to embodiments of the present invention, as shown in figure 3, should Method includes the following steps:

Step S302: it is formatted, is obtained multiple to be processed using multiple logs of the default journal format to acquisition Log；

Step S304: multiple logs to be processed are stored respectively to subject document corresponding with each log to be processed and are pressed from both sides In；

Step S306: by subject document press from both sides in multiple logs to be processed be sent in distributed processing platform.

It should be noted that executing subject among the above can be open source component kafka, but not limited to this.

In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx The log that access log collects client is stored in local, then uses rsyslog, the works such as filebeat, scribeagent Has collector journal, in the topic of producer to kafka.

It should be noted that the message of Kafka storage from any processes for being referred to as " producer " (Producer) more. Data are so as to being assigned under different " subregion " (Partition), different " Topic ".In a subregion, these Message is indexed and is stored together together with timestamp.Other processes for being referred to as " consumer " (Consumer) can be from subregion Query messages.Kafka is operated on the cluster that one is made of one or more server, and subregion can across cluster node Distribution.

Kafka efficiently handles real time streaming data, may be implemented integrated with Storm, HBase and Spark.As group Collection is deployed on multiple servers, and Kafka, which handles its all publication and subscribes to message system, has used four API, that is, is produced Person API, consumer API, Stream API and Connector API.It can transmit extensive streaming message, carry fault-tolerant function Can, instead of some conventional message systems, such as JMS, AMQP.

The main terms of Kafka framework include Topic, Record and Broker.Topic is made of Record, Record Hold different information, and Broker is then responsible for duplication message.There are four main API by Kafka.

Producer API: application issued Record stream is supported.

Consumer API: application subscription Topic and processing Record stream are supported.

Stream API: inlet flow is converted into output stream, and generates result.

Connector API: executing reusable producers and consumers API, can be linked to Topic existing using journey Sequence.

Topic is used to classify to message, and the information for each entering Kafka can be placed under a Topic.

Broker is used to realize the host server of data storage.

Message in each Topic of Partition can be divided into several Partition, to improve the processing effect of message Rate.

The embodiment of the invention also provides a kind of log processing device, Fig. 4 is the log provided according to embodiments of the present invention The structural schematic diagram (one) of processing unit, as shown in figure 4, the device includes:

Receiving module 42, the multiple logs to be processed sent for receiving open source component, wherein multiple logs to be processed Journal format is the conversion that open source component carries out multiple logs using default journal format；

First determining module 44 obtains multiple first object logs for carrying out data cleansing to multiple logs to be processed；

Second determining module 46 is obtained for carrying out multidomain treat-ment to multiple first object logs according to prefixed time interval To multiple second target journalings.

Specifically include following manner:

(6) data of insert overwrite:Hive are extracted is operated using the covering of insert overwrite, relatively Idempotence may be implemented in insert into, it can be with repetitive operation.

-- conf spark.yarn.maxAppAttempts=4

-- conf spark.yarn.am.attemptFailuresValidityInterval=1h

-- conf spark.yarn.max.executor.failures={ 8*num_executors }

-- conf spark.yarn.executor.failuresValidityInterval=1h

-- conf spark.task.maxFailures=8

The embodiment of the invention also provides a kind of log processing device, Fig. 5 is the log provided according to embodiments of the present invention The structural schematic diagram (two) of processing unit, as shown in figure 5, the device includes:

Third determining module 52 is obtained for being formatted using default journal format to multiple logs of acquisition Multiple logs to be processed；

Memory module 54, for being stored multiple logs to be processed respectively to theme text corresponding with each log to be processed In part folder；

Sending module 56 is sent in distributed processing platform for multiple logs to be processed in pressing from both sides subject document.

Producer API: application issued Record stream is supported.

Stream API: inlet flow is converted into output stream, and generates result.

Broker is used to realize the host server of data storage.

The embodiment of the invention also provides a kind of log processing system, Fig. 6 is the log provided according to embodiments of the present invention The structural schematic diagram of processing system, as shown in fig. 6, the device includes:

Distributed processing platform spark, wherein spark is arranged to side when operation in 8 any one of perform claim requirement Method；

Increase income component kafka, connect with distributed processing platform, wherein perform claim is wanted when kafka is arranged to operation The method for asking 9.

As shown in fig. 6, this system further include: server, kafka collect the log of client from server.It will Log transmission carries out the processing of log to spark, and the log transmission after processing to Hadoop is further stored.

The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.

Optionally, in the present embodiment, above-mentioned storage medium can be set to store for executing above each step Computer program.

Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.

The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.

Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.

Optionally, in the present embodiment, above-mentioned processor can be set to execute above each step by computer program Suddenly.

Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment Example, details are not described herein for the present embodiment.

The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.

In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.

In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.

The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.

It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.

If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.

The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims

1. a kind of log processing method characterized by comprising

Receive multiple logs to be processed that open source component is sent, wherein the journal format of the multiple log to be processed is described The conversion that open source component carries out multiple logs using default journal format；

Data cleansing is carried out to the multiple log to be processed, obtains multiple first object logs；

Multidomain treat-ment is carried out to the multiple first object log according to prefixed time interval, obtains multiple second target journalings.

2. being obtained the method according to claim 1, wherein carrying out data cleansing to the multiple log to be processed Include: to the multiple first object log

The data cleansing of the multiple log to be processed is triggered using default activity algorithms；

Data cleansing is carried out to the multiple log to be processed after triggering using default transfer algorithm.

3. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, obtaining the multiple second target journaling includes:

Determine the Log Types and logging time of each first object log in the multiple first object log, wherein described Logging time is the time for obtaining each first object log；

Each first object log partition is stored to the first default mesh based on the Log Types and the logging time In record, the multiple second target journaling is obtained.

4. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes following one:

The multiple second target journaling is stored in the way of a just Exactly once by default feature source interface Into the second predetermined directory；

The multiple second target journaling is stored in the way of a just Exactly once by default batch processing function Into the second predetermined directory.

5. according to the method described in claim 4, it is characterized in that, according to the prefixed time interval to the multiple first mesh It marks log and carries out real-time multidomain treat-ment, after obtaining the multiple second target journaling, and determining the multiple second mesh In the case that mark log is stored to second predetermined directory failure, the method also includes following one:

In the first preset number of days, the multiple log to be processed is reacquired from local cache；

Restore the multiple log to be processed from local disk using result collection system；

Restore the multiple log to be processed from the copy stored in the open source component；

Read the multiple log to be processed meta data file metadata and make up function offset restore it is the multiple to Handle log；

The multiple log to be processed is obtained from multiple wave files of storage, wherein the multiple wave file is stored in Following one: local node, the node in local rack, the node in different racks.

6. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes:

The first table is established according to secondary partition storage format, wherein first table is for storing the multiple second mesh Mark the summary information of log；

The first sub-table is established in first table, wherein first sub-table is for storing inquiry the multiple the The query information of two target journalings；

Timing is that first sub-table adds subregion, wherein the subregion is used to store the query information of the log of letter addition.

7. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining multiple second target journalings, the method also includes:

Each second target journaling in the multiple second target journaling is split, multiple decomposition logs are obtained；

Parse the multiple decomposition log；

The multiple decomposition log after parsing in preset time period is merged, obtains merging log；

Compress the merging log.

8. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes:

The storage of the multiple second target journaling is monitored；

In the case where storing failure, alert.

9. a kind of log processing method characterized by comprising

It is formatted using multiple logs of the default journal format to acquisition, obtains multiple logs to be processed；

The multiple log to be processed is stored respectively into subject document folder corresponding with each log to be processed；

The multiple log to be processed in subject document folder is sent in distributed processing platform.

10. a kind of log processing device characterized by comprising

Receiving module, the multiple logs to be processed sent for receiving open source component, wherein the day of the multiple log to be processed Will format is the conversion that the open source component carries out multiple logs using default journal format；

First determining module obtains multiple first object logs for carrying out data cleansing to the multiple log to be processed；

Second determining module is obtained for carrying out multidomain treat-ment to the multiple first object log according to prefixed time interval Multiple second target journalings.

11. a kind of log processing device characterized by comprising

Third determining module, for being formatted using default journal format to multiple logs of acquisition, obtain it is multiple to Handle log；

Memory module, for being stored the multiple log to be processed respectively to subject document corresponding with each log to be processed In folder；

Sending module, for the multiple log to be processed in subject document folder to be sent to distributed processing platform In.

12. a kind of log processing system characterized by comprising

Distributed processing platform spark, wherein the spark is arranged to perform claim when operation and requires institute in any one of 1-8 The method stated；

Increase income component kafka, connect with the distributed processing platform, wherein the kafka is arranged to right of execution when operation Benefit require 9 described in method.

13. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer Program is arranged to perform claim when operation and requires method described in 1 to 8 any one, alternatively, perform claim requires described in 9 Method.

14. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to run the computer program and require method described in 1 to 8 any one with perform claim, Alternatively, perform claim requires method described in 9.