CN109918349A - Log processing method, device, storage medium and electronic device - Google Patents

Log processing method, device, storage medium and electronic device Download PDF

Info

Publication number
CN109918349A
CN109918349A CN201910138347.5A CN201910138347A CN109918349A CN 109918349 A CN109918349 A CN 109918349A CN 201910138347 A CN201910138347 A CN 201910138347A CN 109918349 A CN109918349 A CN 109918349A
Authority
CN
China
Prior art keywords
log
processed
logs
data
kafka
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201910138347.5A
Other languages
Chinese (zh)
Other versions
CN109918349B (en
Inventor
刘晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201910138347.5A priority Critical patent/CN109918349B/en
Publication of CN109918349A publication Critical patent/CN109918349A/en
Application granted granted Critical
Publication of CN109918349B publication Critical patent/CN109918349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Abstract

The invention discloses a kind of log processing method, device, storage medium and electronic device.Wherein, this method comprises: receiving multiple logs to be processed that open source component is sent, wherein the journal format of multiple logs to be processed is the conversion that open source component carries out multiple logs using default journal format;Data cleansing is carried out to multiple logs to be processed, obtains multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, obtains multiple second target journalings.The treatment effeciency that the present invention solves log in the related technology is low, is not able to satisfy to daily record data the technical issues of needs.

Description

Log processing method, device, storage medium and electronic device
Technical field
The present invention relates to computer fields, in particular to a kind of log processing method, device, storage medium and electricity Sub-device.
Background technique
With the arriving of internet+epoch, the value of data is increasingly highlighted.The data of product are presented exponential type and increase, Unstructured feature.Using distributed processing platform spark and hadoop technology, constructing big data platform is core the most The storage of basic data, processing capacity center, provide powerful data-handling capacity, meet the interaction demand of data.Together When by spark streaming, can effectively meet the requirement of enterprise real-time data, construct the real-time indicators body of enterprise development System.
But there are log processings not real-time enough, the dilatation in distributed system of the existing storage mode to log is held Mistake is lacking, and can not easily carry out the ETL (abbreviation of Extract-Transform-Load, for retouching of big data log It states data from source terminal by extracting (extract), interaction conversion (transform), the mistake for loading (load) to destination Journey) cleaning.
For above-mentioned problem, currently no effective solution has been proposed.
Summary of the invention
The embodiment of the invention provides a kind of log processing method, device, storage medium and electronic device, at least to solve It is low to the treatment effeciency of log in the related technology, it is not able to satisfy to daily record data the technical issues of needs.
According to an aspect of an embodiment of the present invention, a kind of log processing method is provided, comprising: receive open source component hair The multiple logs to be processed sent, wherein the journal format of multiple logs to be processed is that open source component utilizes default journal format pair The conversion that multiple logs carry out;Data cleansing is carried out to multiple logs to be processed, obtains multiple first object logs;According to default Time interval carries out multidomain treat-ment to multiple first object logs, obtains multiple second target journalings.
According to another aspect of an embodiment of the present invention, a kind of log processing method is additionally provided, comprising: utilize default log Format formats multiple logs of acquisition, obtains multiple logs to be processed;Multiple logs to be processed are stored respectively Into subject document folder corresponding with each log to be processed;By subject document press from both sides in multiple logs to be processed be sent to distribution In formula processing platform.
According to another aspect of an embodiment of the present invention, a kind of log processing device is additionally provided, comprising: receiving module is used In the multiple logs to be processed for receiving open source component transmission, wherein the journal format of multiple logs to be processed is open source component benefit The conversion that multiple logs are carried out with default journal format;First determining module, for carrying out data to multiple logs to be processed Cleaning, obtains multiple first object logs;Second determining module is used for according to prefixed time interval to multiple first object logs Multidomain treat-ment is carried out, multiple second target journalings are obtained.
According to another aspect of an embodiment of the present invention, a kind of log processing device is additionally provided, comprising: third determines mould Block obtains multiple logs to be processed for formatting using default journal format to multiple logs of acquisition;Store mould Block, for being stored multiple logs to be processed respectively into subject document folder corresponding with each log to be processed;Sending module, It is sent in distributed processing platform for multiple logs to be processed in pressing from both sides subject document.
According to another aspect of an embodiment of the present invention, a kind of log processing system is additionally provided, comprising: distributed treatment is flat Platform spark, wherein spark is arranged to execute method among the above when operation;Increase income component kafka, flat with distributed treatment Platform connection, wherein kafka is arranged to execute above-mentioned method when operation;
According to still another embodiment of the invention, a kind of storage medium is additionally provided, meter is stored in the storage medium Calculation machine program, wherein the computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
According to still another embodiment of the invention, a kind of electronic device, including memory and processor are additionally provided, it is described Computer program is stored in memory, the processor is arranged to run the computer program to execute any of the above-described Step in embodiment of the method.
In embodiments of the present invention, the log of collection is formatted using open source component, is obtained multiple to be processed Log to be processed is sent to distributed processing platform by log;Distributed processing platform carries out data to multiple logs to be processed Cleaning, obtains multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained Multiple second target journalings.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve the relevant technologies The treatment effeciency of middle log is low, is not able to satisfy to daily record data the technical issues of needs.
Detailed description of the invention
The drawings described herein are used to provide a further understanding of the present invention, constitutes part of this application, this hair Bright illustrative embodiments and their description are used to explain the present invention, and are not constituted improper limitations of the present invention.In the accompanying drawings:
Fig. 1 is a kind of hardware block diagram of the mobile terminal of log processing method of the embodiment of the present invention;
Fig. 2 is the flow diagram (one) of the log processing method provided according to embodiments of the present invention;
Fig. 3 is the flow diagram (two) of the log processing method provided according to embodiments of the present invention;
Fig. 4 is the structural schematic diagram (one) of the log processing device provided according to embodiments of the present invention;
Fig. 5 is the structural schematic diagram (two) of the log processing device provided according to embodiments of the present invention;
Fig. 6 is the structural schematic diagram of the log processing system provided according to embodiments of the present invention.
Specific embodiment
In order to enable those skilled in the art to better understand the solution of the present invention, below in conjunction in the embodiment of the present invention Attached drawing, technical scheme in the embodiment of the invention is clearly and completely described, it is clear that described embodiment is only The embodiment of a part of the invention, instead of all the embodiments.Based on the embodiments of the present invention, ordinary skill people The model that the present invention protects all should belong in member's every other embodiment obtained without making creative work It encloses.
It should be noted that description and claims of this specification and term " first " in above-mentioned attached drawing, " Two " etc. be to be used to distinguish similar objects, without being used to describe a particular order or precedence order.It should be understood that using in this way Data be interchangeable under appropriate circumstances, so as to the embodiment of the present invention described herein can in addition to illustrating herein or Sequence other than those of description is implemented.In addition, term " includes " and " having " and their any deformation, it is intended that cover Cover it is non-exclusive include, for example, the process, method, system, product or equipment for containing a series of steps or units are not necessarily limited to Step or unit those of is clearly listed, but may include be not clearly listed or for these process, methods, product Or other step or units that equipment is intrinsic.
According to embodiments of the present invention, a kind of log processing method embodiment is provided, it should be noted that in the stream of attached drawing The step of journey illustrates can execute in a computer system such as a set of computer executable instructions, although also, flowing Logical order is shown in journey figure, but in some cases, it can be to be different from shown or described by sequence execution herein The step of.
Embodiment of the method provided by the embodiment of the present invention can be in mobile terminal, terminal or similar operation It is executed in device.For running on mobile terminals, Fig. 1 is a kind of mobile end of log processing method of the embodiment of the present invention The hardware block diagram at end.As shown in Figure 1, mobile terminal 10 may include one or more (only showing one in Fig. 1) processing Device 102 (processing unit that processor 102 can include but is not limited to Micro-processor MCV or programmable logic device FPGA etc.) and Memory 104 for storing data, optionally, above-mentioned mobile terminal can also include the transmission device for communication function 106 and input-output equipment 108.It will appreciated by the skilled person that structure shown in FIG. 1 is only to illustrate, simultaneously The structure of above-mentioned mobile terminal is not caused to limit.For example, mobile terminal 10 may also include it is more than shown in Fig. 1 or less Component, or with the configuration different from shown in Fig. 1.
Memory 104 can be used for storing computer program, for example, the software program and module of application software, such as this hair The corresponding computer program of log processing method in bright embodiment, processor 102 are stored in memory 104 by operation Computer program realizes above-mentioned method thereby executing various function application and data processing.Memory 104 may include High speed random access memory, may also include nonvolatile memory, as one or more magnetic storage device, flash memory or its His non-volatile solid state memory.In some instances, memory 104 can further comprise remotely setting relative to processor 102 The memory set, these remote memories can pass through network connection to mobile terminal 10.The example of above-mentioned network includes but not It is limited to internet, intranet, local area network, mobile radio communication and combinations thereof.
Transmitting device 106 is used to that data to be received or sent via a network.Above-mentioned network specific example may include The wireless network that the communication providers of mobile terminal 10 provide.In an example, transmitting device 106 includes a Network adaptation Device (Network Interface Controller, referred to as NIC), can be connected by base station with other network equipments to It can be communicated with internet.In an example, transmitting device 106 can for radio frequency (Radio Frequency, referred to as RF) module is used to wirelessly be communicated with internet.
Fig. 2 is the flow diagram (one) of the log processing method provided according to embodiments of the present invention, as shown in Fig. 2, should Method includes the following steps:
Step S202 receives multiple logs to be processed that open source component is sent, wherein the log lattice of multiple logs to be processed Formula is the conversion that open source component carries out multiple logs using default journal format;
Step S204 carries out data cleansing to multiple logs to be processed, obtains multiple first object logs;
Step S206 carries out multidomain treat-ment to multiple first object logs according to prefixed time interval, obtains multiple second Target journaling.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed, Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed, Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that executing subject among the above can be distributed processing platform spark, but not limited to this.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up, The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], " time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx The log that access log collects client is stored in local, then uses following open source component rsyslog, filebeat, The collector journals such as scribeagent, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
In an alternative embodiment, Spark carries out data cleansing to multiple logs to be processed in the following manner: benefit The data cleansing of multiple logs to be processed is triggered with default activity algorithms;Using default transfer algorithm to after triggering it is multiple to It handles log and carries out data cleansing.For example, using spark transfer algorithm tranform abundant, such as: map, flatmap, Fliter, User-Defined Functions (User-Defined Function, referred to as UDF) etc. realize the cleaning of data, utilize work The trigger datas cleaning processes such as dynamic algorithm action, such as collect, save.
In an alternative embodiment, in the following manner according to prefixed time interval to multiple first object logs into Row multidomain treat-ment: the Log Types and logging time of each first object log in multiple first object logs are determined, wherein day The will time is the time for obtaining each first object log;Each first object log is divided based on Log Types and logging time Area is stored into the first predetermined directory, obtains multiple second target journalings.Such as: by the way that reasonable spark is arranged The time (such as being set as 5 minutes) of the batch interval (being prefixed time interval) of streaming, in batches according to Logtype (Log Types) and logtime (logging time) does secondary partition and is written under the catalogue of HDFS/warehouses/ { logtype }/{ logtime } (the first predetermined directory) is stored.The delay time of data substantially with batch interval phase When 5 minutes be arranged if it is front, displaying of the data in hive is also delay in 5 minutes.It is reasonable by being arranged Batch interval can achieve the purpose of data near real-time collection.In the present embodiment, Spark computing engines are storages Into hadoop storage system.
It should be noted that timeslice or batch processing time interval (batch interval): this is artificially to fluxion According to quantitative standard is carried out, the foundation of flow data is split using timeslice as us.The corresponding RDD of the data of one timeslice Example.
It should be noted that HDFS is the abbreviation of Hadoop Distribute File System, that is, Hadoop One distributed file system.
In an alternative embodiment, it after obtaining multiple second target journalings, needs to multiple second target days Will realizes just primary storage.It is understood that Spark Streaming can only accomplish at-least once (at least before 2.0 Once), spark frame is difficult that you is helped to accomplish exactly once (just primary), and program is needed to be directly realized by reliable data What source and support idempotence operated flows down.But Spark Structure streaming can simply realize Exactly Once。
It should be noted that Exactly-once is that every data is only handled once, it is one of the difficult point calculated in real time, Accomplish that each record only will be dealt with primary purpose.
Specifically include following manner:
1) multiple second target journalings are deposited in the way of a just Exactly once by default feature source interface Storage is into the second predetermined directory;Such as: kafka Source is utilized, specific on source code, Source is an abstract interface Trait Source (being default feature source interface) includes that Structured Streaming realizes end-to-end Exactly-once (just primary end to end) processing centainly needs function to be offered, Kafka's (KafkaSource) The offset for the kafka that getOffset () is saved by the upper batch of reading, can be by when one of the end driver long The consumer of operation gets the newest offsets of each topic from kafka brokers, and getBatch method is from root A DataFrame data are returned to according to offset, commit can save file by checkpoint mechanism on hdfs, remember The position offset for recording kafka, for the effective offset value of getOffset acquisition in failure.In the present embodiment, The meaning of getOffset () is as follows: reading the file of the checkpoint of hdfs every time, obtains this beginning for reading kafka Position updates the end position of this reading to checkoing file after having obtained.
2) multiple second target journalings are deposited in the way of a just Exactly once by default batch processing function Storage is into the second predetermined directory.Such as: hdfsSink: this only addBatch () method supports Structured Streaming realizes that end-to-end exactly-once handles the function of centainly needing.Hdfssink addBatch() Specific implementation be can there is metadata to record the maximum batchId currently completed under the catalogue of hdfs, when from failure When recovery, if the batchId of operation is less than or equal to the maximum batchId of metadata, operation can be skipped.To realize The idempotence of data is written.In this implementation embodiment, addBatch () includes following functions: if what this was submitted BatchId is less than the batchId recorded in hdfs metadata, it is believed that this batchId of task had been written into, and directly skipped. If it is greater, then partition directory is written in daily record data, and hdfs metadata batchId adds 1.
It should be noted that Spark Streaming is Spark core application programming interface (Application Programming Interface, referred to as API) an extension, may be implemented high-throughput, have fault tolerant mechanism The processing of real-time streaming data.Support from multiple data sources obtain data, including Kafk, Flume, Twitter, ZeroMQ, Kinesis and TCP sockets, from data source obtain data after, can be used such as map, reduce, join and The processing of the high-level functions such as window progress complicated algorithm.Processing result can also finally be stored to file system, database With field instrument disk.On the basis of " One Stack rule them all ", other subframes of Spark can also be used, Such as cluster policy, figure calculate, and stream data is handled.
In an alternative embodiment, determine multiple second target journalings store to the second predetermined directory failure feelings Under condition, multiple logs to be processed can be restored in the following manner, log is handled again:
1) in the first preset number of days, multiple logs to be processed are reacquired from local cache;Such as: log is printed upon Local file system, which saves, daily rolls compression preservation 10 days, and machine has perfect operating system alarm can be timely first on line Detect that hard-disk capacity problem avoids the problem that writing disk failure.Saving 10 days is that down-stream system is avoided to go wrong, worst Situation is exactly not find in time long holidays on National Day, can push recovery again from data source.
2) restore multiple logs to be processed from local disk using result collection system;Such as: scribeagent is intentionally The delay detection for jumping detection and log can find the problem in time, and scribeagent record sends the metadata information of file, most Important is the transmission position of file, the failure that can be used for data is accurately restored, and Transmission Control Protocol is taken to send data.
It should be noted that Scribe is the result collection system of facebook open source, obtained inside facebook To a large amount of application.It can be from collector journal on various Log Sources, and storing to a central storage system (can be NFS, divide Cloth file system etc.) on, in order to concentrate statistical analysis processing.It is " distributed collection is uniformly processed " of log Provide expansible a, scheme highly fault tolerant.The framework of scribe is fairly simple, mainly includes three parts, respectively Scribe agent, scribe and storage system.
3) restore multiple logs to be processed from the copy stored in open source component;For example, number of copies is 3 on kafka line, Log persistence saves 3 days.If spark structure streaming fails, directly data can be restored from kafka.
4) it reads the meta data file metadata of multiple logs to be processed and makes up function offset and restore multiple wait locate Manage log;Such as: the mechanism of spark checkpoint saves the status informations such as the offset of kafka, when mission failure Waiting spark task start can read metadata the and offset file under checkpoint file, and kafkaConsumer can Data are read with the offset of specified partition, from without repeating and omitting.
5) multiple logs to be processed are obtained from multiple wave files of storage, wherein multiple wave files be stored in It is one of lower: local node, the node in local rack, the node in different racks.Such as: the block mechanism of hdfs: file The Replica Placement Strategy that number of copies is 3, HDFS is: first copy being placed on local node, second copy is put into local Another node in rack, and node third copy being put into different racks.This approach reduces between rack Write flow, to improve the performance write.The probability of rack failure is much smaller than node failure.This mode has no effect on data The limitation of reliabilty and availability, and it reduces the network polymerization bandwidth of read operation really.
6) data of insert overwrite:Hive are extracted is operated using the covering of insert overwrite, relatively Idempotence may be implemented in insert into, it can be with repetitive operation.
In an alternative embodiment, in order to increase the efficiency of log processing, kafka can be carried out it is extending transversely, It specifically includes:
(1) partitions for increasing kafka brokes number and topics, according to kakfa default DafaultPartions abs (key.hashcode) % (numPartitions), can allow the data distribution of topic to more On more broke and machine, increase the data-handling capacity of upstream and downstream.
(2) after the partitions for increasing kafka, spark's can be according to the consumer groups of kafka The partitions that identical rdd is arranged in mechanism increases the message capability of data
In conclusion high performance kafka information middleware, spark on yarn's is excellent by the real-time collecting of log Different distributed nature: it is high fault-tolerant, it easily extends, extractly once is solved and aimed at dilatation in distributed system day, is held The problem of mistake is lacking, and can not easily carry out the ETL cleaning of big data log.
In an alternative embodiment, in order to increase spark to the efficiency of log processing, spark can be applied into Row optimization: specific as follows:
Optimization to spark long-time program running parameter: spark-submit--master yarn--deploy- mode cluster\
-- conf spark.yarn.maxAppAttempts=4
-- conf spark.yarn.am.attemptFailuresValidityInterval=1h
-- conf spark.yarn.max.executor.failures={ 8*num_executors }
-- conf spark.yarn.executor.failuresValidityInterval=1h
-- conf spark.task.maxFailures=8
With cluster model running program, any mistake in Spark driver all can prevent us long-term running Work.Fortunately, the maximum trial that spark.yarn.maxAppAttempts=4 reruns application program can be configured Number.If application program operation a couple of days or several weeks then may be used without restarting or redeploying in the cluster highly used 4 trials can be exhausted within several hours.In order to avoid such case, (sp should be reset in each hour by attempting counter Ark.yarn.am.attemptFailuresValidityInterval=1h), another important setting is in application program The maximum quantity of executor failure before breaking down.It is max (2*num executors, 3) under default situations, it is very suitable Batch processing job is closed, but is not suitable for the operation of long-play.The attribute has the corresponding valid period, and long-time is transported Capable operation, you before abandoning operation it is also contemplated that improve the maximum quantity of mission failure (spark.task.maxFailures=8).Item is distributed rationally by these, and spark is applied in responsible distributed environment It can keep operation reliably and with long-term.
In an alternative embodiment, in spark, the first table is established according to secondary partition storage format, wherein First table is used to store the summary information of multiple second target journalings;The first sub-table is established in the first table, wherein the One sub-table is for storing the query information for inquiring multiple second target journalings;Timing is that the first sub-table adds subregion, wherein Subregion is used to store the query information of the log of letter addition.
Such as: according to the secondary partition storage format of the ETL of front, establishes a summary table and believe for obtaining the summary of log Then breath establishes inquiry and use of each sublist for task to logtype catalogue.It is table addition by scheduling timing Subregion.Face on this basis, we write the newly-increased log and journal format of python script identification, establish json format and arrive The automatic mapping of hive schema, automatically creates and builds table statement, establishes table in the warehouse hive.It automates, reduces artificial in this way The Quick thread of maintenance and new data.
In an alternative embodiment, the small documents of log can be compressed and is merged, by multiple second targets Each second target journaling in log is split, and obtains multiple decomposition logs;Parse multiple decomposition logs;By preset time Multiple decomposition logs in section after parsing merge, and obtain merging log;Compression merges log.
Such as: the effect in order to pursue near real-time, spark streaming can be divided into daily log multiple small points Solution, this is used the memory of namenode and the disk of datanode is disagreeableness.We are merged and are pressed using scheduling system The small documents of contracting the last week, optimize.Particularly pass through the data of the processing of the map and reduce of setting mapreduce Dynamic partition, setting intermediate result and the snappy compression for exporting result is arranged in amount, concurrent number.Then it cleverly uses The sentence of insert overwrite is merging and will not influence inquiry in compression process.
In an alternative embodiment, the storage of log is monitored: to the storages of multiple second target journalings into Row monitoring;In the case where storing failure, alert.Such as: rewrite the event listener prison of spark Control sends metrics by statsd, is stored in graphite, then configures alarm rule by cabot, once triggering, leads to Cross alarm service alerts.And the restful interface real time polling job state provided by yarn, automatic pull-up service, greatly The reliability for the operation that amplitude provides.
In conclusion realizing fault-tolerant, the dilatation of storage by the technology of spark on yarn, monitoring distributed is healthy and strong System schema, the effective stability that product is provided, real-time, the extensive reliable memory of log of data analysis.Simultaneously In view of the merging and compression of cold and hot data, in the performance for not influencing data analysis, the reasonable load for reducing hadoop.
Fig. 3 is the flow diagram (two) of the log processing method provided according to embodiments of the present invention, as shown in figure 3, should Method includes the following steps:
Step S302: it is formatted, is obtained multiple to be processed using multiple logs of the default journal format to acquisition Log;
Step S304: multiple logs to be processed are stored respectively to subject document corresponding with each log to be processed and are pressed from both sides In;
Step S306: by subject document press from both sides in multiple logs to be processed be sent in distributed processing platform.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed, Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed, Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that executing subject among the above can be open source component kafka, but not limited to this.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up, The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], " time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx The log that access log collects client is stored in local, then uses rsyslog, the works such as filebeat, scribeagent Has collector journal, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
It should be noted that the message of Kafka storage from any processes for being referred to as " producer " (Producer) more. Data are so as to being assigned under different " subregion " (Partition), different " Topic ".In a subregion, these Message is indexed and is stored together together with timestamp.Other processes for being referred to as " consumer " (Consumer) can be from subregion Query messages.Kafka is operated on the cluster that one is made of one or more server, and subregion can across cluster node Distribution.
Kafka efficiently handles real time streaming data, may be implemented integrated with Storm, HBase and Spark.As group Collection is deployed on multiple servers, and Kafka, which handles its all publication and subscribes to message system, has used four API, that is, is produced Person API, consumer API, Stream API and Connector API.It can transmit extensive streaming message, carry fault-tolerant function Can, instead of some conventional message systems, such as JMS, AMQP.
The main terms of Kafka framework include Topic, Record and Broker.Topic is made of Record, Record Hold different information, and Broker is then responsible for duplication message.There are four main API by Kafka.
Producer API: application issued Record stream is supported.
Consumer API: application subscription Topic and processing Record stream are supported.
Stream API: inlet flow is converted into output stream, and generates result.
Connector API: executing reusable producers and consumers API, can be linked to Topic existing using journey Sequence.
Topic is used to classify to message, and the information for each entering Kafka can be placed under a Topic.
Broker is used to realize the host server of data storage.
Message in each Topic of Partition can be divided into several Partition, to improve the processing effect of message Rate.
The embodiment of the invention also provides a kind of log processing device, Fig. 4 is the log provided according to embodiments of the present invention The structural schematic diagram (one) of processing unit, as shown in figure 4, the device includes:
Receiving module 42, the multiple logs to be processed sent for receiving open source component, wherein multiple logs to be processed Journal format is the conversion that open source component carries out multiple logs using default journal format;
First determining module 44 obtains multiple first object logs for carrying out data cleansing to multiple logs to be processed;
Second determining module 46 is obtained for carrying out multidomain treat-ment to multiple first object logs according to prefixed time interval To multiple second target journalings.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed, Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed, Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up, The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], " time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx The log that access log collects client is stored in local, then uses following open source component rsyslog, filebeat, The collector journals such as scribeagent, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
In an alternative embodiment, Spark carries out data cleansing to multiple logs to be processed in the following manner: benefit The data cleansing of multiple logs to be processed is triggered with default activity algorithms;Using default transfer algorithm to after triggering it is multiple to It handles log and carries out data cleansing.For example, using spark transfer algorithm tranform abundant, such as: map, flatmap, Fliter, User-Defined Functions (User-Defined Function, referred to as UDF) etc. realize the cleaning of data, utilize work The trigger datas cleaning processes such as dynamic algorithm action, such as collect, save.
In an alternative embodiment, in the following manner according to prefixed time interval to multiple first object logs into Row multidomain treat-ment: the Log Types and logging time of each first object log in multiple first object logs are determined, wherein day The will time is the time for obtaining each first object log;Each first object log is divided based on Log Types and logging time Area is stored into the first predetermined directory, obtains multiple second target journalings.Such as: by the way that reasonable spark is arranged The time (such as being set as 5 minutes) of the batch interval (being prefixed time interval) of streaming, in batches according to Logtype (Log Types) and logtime (logging time) does secondary partition and is written under the catalogue of HDFS/warehouses/ { logtype }/{ logtime } (the first predetermined directory) is stored.The delay time of data substantially with batch interval phase When 5 minutes be arranged if it is front, displaying of the data in hive is also delay in 5 minutes.It is reasonable by being arranged Batch interval can achieve the purpose of data near real-time collection.In the present embodiment, Spark computing engines are storages Into hadoop storage system.
It should be noted that timeslice or batch processing time interval (batch interval): this is artificially to fluxion According to quantitative standard is carried out, the foundation of flow data is split using timeslice as us.The corresponding RDD of the data of one timeslice Example.
It should be noted that HDFS is the abbreviation of Hadoop Distribute File System, that is, Hadoop One distributed file system.
In an alternative embodiment, it after obtaining multiple second target journalings, needs to multiple second target days Will realizes just primary storage.It is understood that Spark Streaming can only accomplish at-least once (at least before 2.0 Once), spark frame is difficult that you is helped to accomplish exactly once (just primary), and program is needed to be directly realized by reliable data What source and support idempotence operated flows down.But Spark Structure streaming can simply realize Exactly Once。
It should be noted that Exactly-once is that every data is only handled once, it is one of the difficult point calculated in real time, Accomplish that each record only will be dealt with primary purpose.
Specifically include following manner:
1) multiple second target journalings are deposited in the way of a just Exactly once by default feature source interface Storage is into the second predetermined directory;Such as: kafka Source is utilized, specific on source code, Source is an abstract interface Trait Source (being default feature source interface) includes that Structured Streaming realizes end-to-end Exactly-once (just primary end to end) processing centainly needs function to be offered, Kafka's (KafkaSource) The offset for the kafka that getOffset () is saved by the upper batch of reading, can be by when one of the end driver long The consumer of operation gets the newest offsets of each topic from kafka brokers, and getBatch method is from root A DataFrame data are returned to according to offset, commit can save file by checkpoint mechanism on hdfs, remember The position offset for recording kafka, for the effective offset value of getOffset acquisition in failure.In the present embodiment, The meaning of getOffset () is as follows: reading the file of the checkpoint of hdfs every time, obtains this beginning for reading kafka Position updates the end position of this reading to checkoing file after having obtained.
2) multiple second target journalings are deposited in the way of a just Exactly once by default batch processing function Storage is into the second predetermined directory.Such as: hdfsSink: this only addBatch () method supports Structured Streaming realizes that end-to-end exactly-once handles the function of centainly needing.Hdfssink addBatch() Specific implementation be can there is metadata to record the maximum batchId currently completed under the catalogue of hdfs, when from failure When recovery, if the batchId of operation is less than or equal to the maximum batchId of metadata, operation can be skipped.To realize The idempotence of data is written.In this implementation embodiment, addBatch () includes following functions: if what this was submitted BatchId is less than the batchId recorded in hdfs metadata, it is believed that this batchId of task had been written into, and directly skipped. If it is greater, then partition directory is written in daily record data, and hdfs metadata batchId adds 1.
It should be noted that Spark Streaming is Spark core application programming interface (Application Programming Interface, referred to as API) an extension, may be implemented high-throughput, have fault tolerant mechanism The processing of real-time streaming data.Support from multiple data sources obtain data, including Kafk, Flume, Twitter, ZeroMQ, Kinesis and TCP sockets, from data source obtain data after, can be used such as map, reduce, join and The processing of the high-level functions such as window progress complicated algorithm.Processing result can also finally be stored to file system, database With field instrument disk.On the basis of " One Stack rule them all ", other subframes of Spark can also be used, Such as cluster policy, figure calculate, and stream data is handled.
In an alternative embodiment, determine multiple second target journalings store to the second predetermined directory failure feelings Under condition, multiple logs to be processed can be restored in the following manner, log is handled again:
1) in the first preset number of days, multiple logs to be processed are reacquired from local cache;Such as: log is printed upon Local file system, which saves, daily rolls compression preservation 10 days, and machine has perfect operating system alarm can be timely first on line Detect that hard-disk capacity problem avoids the problem that writing disk failure.Saving 10 days is that down-stream system is avoided to go wrong, worst Situation is exactly not find in time long holidays on National Day, can push recovery again from data source.
2) restore multiple logs to be processed from local disk using result collection system;Such as: scribeagent is intentionally The delay detection for jumping detection and log can find the problem in time, and scribeagent record sends the metadata information of file, most Important is the transmission position of file, the failure that can be used for data is accurately restored, and Transmission Control Protocol is taken to send data.
It should be noted that Scribe is the result collection system of facebook open source, obtained inside facebook To a large amount of application.It can be from collector journal on various Log Sources, and storing to a central storage system (can be NFS, divide Cloth file system etc.) on, in order to concentrate statistical analysis processing.It is " distributed collection is uniformly processed " of log Provide expansible a, scheme highly fault tolerant.The framework of scribe is fairly simple, mainly includes three parts, respectively Scribe agent, scribe and storage system.
3) restore multiple logs to be processed from the copy stored in open source component;For example, number of copies is 3 on kafka line, Log persistence saves 3 days.If spark structure streaming fails, directly data can be restored from kafka.
4) it reads the meta data file metadata of multiple logs to be processed and makes up function offset and restore multiple wait locate Manage log;Such as: the mechanism of spark checkpoint saves the status informations such as the offset of kafka, when mission failure Waiting spark task start can read metadata the and offset file under checkpoint file, and kafkaConsumer can Data are read with the offset of specified partition, from without repeating and omitting.
5) multiple logs to be processed are obtained from multiple wave files of storage, wherein multiple wave files be stored in It is one of lower: local node, the node in local rack, the node in different racks.Such as: the block mechanism of hdfs: file The Replica Placement Strategy that number of copies is 3, HDFS is: first copy being placed on local node, second copy is put into local Another node in rack, and node third copy being put into different racks.This approach reduces between rack Write flow, to improve the performance write.The probability of rack failure is much smaller than node failure.This mode has no effect on data The limitation of reliabilty and availability, and it reduces the network polymerization bandwidth of read operation really.
(6) data of insert overwrite:Hive are extracted is operated using the covering of insert overwrite, relatively Idempotence may be implemented in insert into, it can be with repetitive operation.
In an alternative embodiment, in order to increase the efficiency of log processing, kafka can be carried out it is extending transversely, It specifically includes:
(1) partitions for increasing kafka brokes number and topics, according to kakfa default DafaultPartions abs (key.hashcode) % (numPartitions), can allow the data distribution of topic to more On more broke and machine, increase the data-handling capacity of upstream and downstream.
(2) after the partitions for increasing kafka, spark's can be according to the consumer groups of kafka The partitions that identical rdd is arranged in mechanism increases the message capability of data
In conclusion high performance kafka information middleware, spark on yarn's is excellent by the real-time collecting of log Different distributed nature: it is high fault-tolerant, it easily extends, extractly once is solved and aimed at dilatation in distributed system day, is held The problem of mistake is lacking, and can not easily carry out the ETL cleaning of big data log.
In an alternative embodiment, in order to increase spark to the efficiency of log processing, spark can be applied into Row optimization: specific as follows:
Optimization to spark long-time program running parameter: spark-submit--master yarn--deploy- mode cluster\
-- conf spark.yarn.maxAppAttempts=4
-- conf spark.yarn.am.attemptFailuresValidityInterval=1h
-- conf spark.yarn.max.executor.failures={ 8*num_executors }
-- conf spark.yarn.executor.failuresValidityInterval=1h
-- conf spark.task.maxFailures=8
With cluster model running program, any mistake in Spark driver all can prevent us long-term running Work.Fortunately, the maximum trial that spark.yarn.maxAppAttempts=4 reruns application program can be configured Number.If application program operation a couple of days or several weeks then may be used without restarting or redeploying in the cluster highly used 4 trials can be exhausted within several hours.In order to avoid such case, (sp should be reset in each hour by attempting counter Ark.yarn.am.attemptFailuresValidityInterval=1h), another important setting is in application program The maximum quantity of executor failure before breaking down.It is max (2*num executors, 3) under default situations, it is very suitable Batch processing job is closed, but is not suitable for the operation of long-play.The attribute has the corresponding valid period, and long-time is transported Capable operation, you before abandoning operation it is also contemplated that improve the maximum quantity of mission failure (spark.task.maxFailures=8).Item is distributed rationally by these, and spark is applied in responsible distributed environment It can keep operation reliably and with long-term.
In an alternative embodiment, in spark, the first table is established according to secondary partition storage format, wherein First table is used to store the summary information of multiple second target journalings;The first sub-table is established in the first table, wherein the One sub-table is for storing the query information for inquiring multiple second target journalings;Timing is that the first sub-table adds subregion, wherein Subregion is used to store the query information of the log of letter addition.
Such as: according to the secondary partition storage format of the ETL of front, establishes a summary table and believe for obtaining the summary of log Then breath establishes inquiry and use of each sublist for task to logtype catalogue.It is table addition by scheduling timing Subregion.Face on this basis, we write the newly-increased log and journal format of python script identification, establish json format and arrive The automatic mapping of hive schema, automatically creates and builds table statement, establishes table in the warehouse hive.It automates, reduces artificial in this way The Quick thread of maintenance and new data.
In an alternative embodiment, the small documents of log can be compressed and is merged, by multiple second targets Each second target journaling in log is split, and obtains multiple decomposition logs;Parse multiple decomposition logs;By preset time Multiple decomposition logs in section after parsing merge, and obtain merging log;Compression merges log.
Such as: the effect in order to pursue near real-time, spark streaming can be divided into daily log multiple small points Solution, this is used the memory of namenode and the disk of datanode is disagreeableness.We are merged and are pressed using scheduling system The small documents of contracting the last week, optimize.Particularly pass through the data of the processing of the map and reduce of setting mapreduce Dynamic partition, setting intermediate result and the snappy compression for exporting result is arranged in amount, concurrent number.Then it cleverly uses The sentence of insert overwrite is merging and will not influence inquiry in compression process.
In an alternative embodiment, the storage of log is monitored: to the storages of multiple second target journalings into Row monitoring;In the case where storing failure, alert.Such as: rewrite the event listener prison of spark Control sends metrics by statsd, is stored in graphite, then configures alarm rule by cabot, once triggering, leads to Cross alarm service alerts.And the restful interface real time polling job state provided by yarn, automatic pull-up service, greatly The reliability for the operation that amplitude provides.
In conclusion realizing fault-tolerant, the dilatation of storage by the technology of spark on yarn, monitoring distributed is healthy and strong System schema, the effective stability that product is provided, real-time, the extensive reliable memory of log of data analysis.Simultaneously In view of the merging and compression of cold and hot data, in the performance for not influencing data analysis, the reasonable load for reducing hadoop.
The embodiment of the invention also provides a kind of log processing device, Fig. 5 is the log provided according to embodiments of the present invention The structural schematic diagram (two) of processing unit, as shown in figure 5, the device includes:
Third determining module 52 is obtained for being formatted using default journal format to multiple logs of acquisition Multiple logs to be processed;
Memory module 54, for being stored multiple logs to be processed respectively to theme text corresponding with each log to be processed In part folder;
Sending module 56 is sent in distributed processing platform for multiple logs to be processed in pressing from both sides subject document.
Through the above steps, the log of collection is formatted using open source component, obtains multiple logs to be processed, Log to be processed is sent to distributed processing platform;Distributed processing platform carries out data cleansing to multiple logs to be processed, Obtain multiple first object logs;Multidomain treat-ment is carried out to multiple first object logs according to prefixed time interval, is obtained multiple Second target journaling.It may be implemented to handle log in real time, improve the efficiency of log processing, and then solve day in the related technology The treatment effeciency of will is low, is not able to satisfy to daily record data the technical issues of needs.
It should be noted that executing subject among the above can be open source component kafka, but not limited to this.
It should be noted that open source component Kafka among the above is an open source by Apache Software Foundation exploitation Stream process platform, by Scala and written in Java.The target of the project be for processing real time data provides a unification, height is handled up, The platform of low latency.Its data persistence layer is substantially one and " disappears according to the extensive publish/subscribe of distributed transaction log framework Cease queue ", it is very valuable to handle stream data as enterprise-level infrastructure that this makes it.In addition, Kafka can pass through Kafka Connect is connected to external system (for data input/output), and provides Kafka Streams, a Java Stream Processing library (computer).The message of Kafka storage from any processes for being referred to as " producer " (Producer) more.Number According to so as to being assigned under different " subregion " (Partition), different " Topic ".Topic is used to carry out message Classification, the information for each entering Kafka can be placed under a Topic.
It should be noted that the format of multiple logs to be processed is Kafka specification, convenient for the parsing of structuring, it is default Journal format can be with are as follows: [logtime] [operation], JSON;Such as: [2013-04-101:00:09] [Click], " time":12344343,"server":"1001"}。
In the present embodiment, the application program in Kafka can be by logging module, or passes through nginx The log that access log collects client is stored in local, then uses rsyslog, the works such as filebeat, scribeagent Has collector journal, in the topic of producer to kafka.
It should be noted that Apache Spark is an open source collection group operatione frame, it is initially by University of California Bai Ke Lay branch school AMPLab is developed.Broker data can be stored in magnetic after having run work by the MapReduce relative to Hadoop In disk, Spark has used computing in memory, can when not yet hard disk is written in data in memory analytic operation.
It should be noted that the message of Kafka storage from any processes for being referred to as " producer " (Producer) more. Data are so as to being assigned under different " subregion " (Partition), different " Topic ".In a subregion, these Message is indexed and is stored together together with timestamp.Other processes for being referred to as " consumer " (Consumer) can be from subregion Query messages.Kafka is operated on the cluster that one is made of one or more server, and subregion can across cluster node Distribution.
Kafka efficiently handles real time streaming data, may be implemented integrated with Storm, HBase and Spark.As group Collection is deployed on multiple servers, and Kafka, which handles its all publication and subscribes to message system, has used four API, that is, is produced Person API, consumer API, Stream API and Connector API.It can transmit extensive streaming message, carry fault-tolerant function Can, instead of some conventional message systems, such as JMS, AMQP.
The main terms of Kafka framework include Topic, Record and Broker.Topic is made of Record, Record Hold different information, and Broker is then responsible for duplication message.There are four main API by Kafka.
Producer API: application issued Record stream is supported.
Consumer API: application subscription Topic and processing Record stream are supported.
Stream API: inlet flow is converted into output stream, and generates result.
Connector API: executing reusable producers and consumers API, can be linked to Topic existing using journey Sequence.
Topic is used to classify to message, and the information for each entering Kafka can be placed under a Topic.
Broker is used to realize the host server of data storage.
Message in each Topic of Partition can be divided into several Partition, to improve the processing effect of message Rate.
The embodiment of the invention also provides a kind of log processing system, Fig. 6 is the log provided according to embodiments of the present invention The structural schematic diagram of processing system, as shown in fig. 6, the device includes:
Distributed processing platform spark, wherein spark is arranged to side when operation in 8 any one of perform claim requirement Method;
Increase income component kafka, connect with distributed processing platform, wherein perform claim is wanted when kafka is arranged to operation The method for asking 9.
As shown in fig. 6, this system further include: server, kafka collect the log of client from server.It will Log transmission carries out the processing of log to spark, and the log transmission after processing to Hadoop is further stored.
The embodiments of the present invention also provide a kind of storage medium, computer program is stored in the storage medium, wherein The computer program is arranged to execute the step in any of the above-described embodiment of the method when operation.
Optionally, in the present embodiment, above-mentioned storage medium can be set to store for executing above each step Computer program.
Optionally, in the present embodiment, above-mentioned storage medium can include but is not limited to: USB flash disk, read-only memory (Read- Only Memory, referred to as ROM), it is random access memory (Random Access Memory, referred to as RAM), mobile hard The various media that can store computer program such as disk, magnetic or disk.
The embodiments of the present invention also provide a kind of electronic device, including memory and processor, stored in the memory There is computer program, which is arranged to run computer program to execute the step in any of the above-described embodiment of the method Suddenly.
Optionally, above-mentioned electronic device can also include transmission device and input-output equipment, wherein the transmission device It is connected with above-mentioned processor, which connects with above-mentioned processor.
Optionally, in the present embodiment, above-mentioned processor can be set to execute above each step by computer program Suddenly.
Optionally, the specific example in the present embodiment can be with reference to described in above-described embodiment and optional embodiment Example, details are not described herein for the present embodiment.
The serial number of the above embodiments of the invention is only for description, does not represent the advantages or disadvantages of the embodiments.
In the above embodiment of the invention, it all emphasizes particularly on different fields to the description of each embodiment, does not have in some embodiment The part of detailed description, reference can be made to the related descriptions of other embodiments.
In several embodiments provided herein, it should be understood that disclosed technology contents can pass through others Mode is realized.Wherein, the apparatus embodiments described above are merely exemplary, such as the division of the unit, Ke Yiwei A kind of logical function partition, there may be another division manner in actual implementation, for example, multiple units or components can combine or Person is desirably integrated into another system, or some features can be ignored or not executed.Another point, shown or discussed is mutual Between coupling, direct-coupling or communication connection can be through some interfaces, the INDIRECT COUPLING or communication link of unit or module It connects, can be electrical or other forms.
The unit as illustrated by the separation member may or may not be physically separated, aobvious as unit The component shown may or may not be physical unit, it can and it is in one place, or may be distributed over multiple On unit.It can some or all of the units may be selected to achieve the purpose of the solution of this embodiment according to the actual needs.
It, can also be in addition, the functional units in various embodiments of the present invention may be integrated into one processing unit It is that each unit physically exists alone, can also be integrated in one unit with two or more units.Above-mentioned integrated list Member both can take the form of hardware realization, can also realize in the form of software functional units.
If the integrated unit is realized in the form of SFU software functional unit and sells or use as independent product When, it can store in a computer readable storage medium.Based on this understanding, technical solution of the present invention is substantially The all or part of the part that contributes to existing technology or the technical solution can be in the form of software products in other words It embodies, which is stored in a storage medium, including some instructions are used so that a computer Equipment (can for personal computer, server or network equipment etc.) execute each embodiment the method for the present invention whole or Part steps.And storage medium above-mentioned includes: that USB flash disk, read-only memory (ROM, Read-Only Memory), arbitrary access are deposited Reservoir (RAM, Random Access Memory), mobile hard disk, magnetic or disk etc. be various to can store program code Medium.
The above is only a preferred embodiment of the present invention, it is noted that for the ordinary skill people of the art For member, various improvements and modifications may be made without departing from the principle of the present invention, these improvements and modifications are also answered It is considered as protection scope of the present invention.

Claims (14)

1. a kind of log processing method characterized by comprising
Receive multiple logs to be processed that open source component is sent, wherein the journal format of the multiple log to be processed is described The conversion that open source component carries out multiple logs using default journal format;
Data cleansing is carried out to the multiple log to be processed, obtains multiple first object logs;
Multidomain treat-ment is carried out to the multiple first object log according to prefixed time interval, obtains multiple second target journalings.
2. being obtained the method according to claim 1, wherein carrying out data cleansing to the multiple log to be processed Include: to the multiple first object log
The data cleansing of the multiple log to be processed is triggered using default activity algorithms;
Data cleansing is carried out to the multiple log to be processed after triggering using default transfer algorithm.
3. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, obtaining the multiple second target journaling includes:
Determine the Log Types and logging time of each first object log in the multiple first object log, wherein described Logging time is the time for obtaining each first object log;
Each first object log partition is stored to the first default mesh based on the Log Types and the logging time In record, the multiple second target journaling is obtained.
4. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes following one:
The multiple second target journaling is stored in the way of a just Exactly once by default feature source interface Into the second predetermined directory;
The multiple second target journaling is stored in the way of a just Exactly once by default batch processing function Into the second predetermined directory.
5. according to the method described in claim 4, it is characterized in that, according to the prefixed time interval to the multiple first mesh It marks log and carries out real-time multidomain treat-ment, after obtaining the multiple second target journaling, and determining the multiple second mesh In the case that mark log is stored to second predetermined directory failure, the method also includes following one:
In the first preset number of days, the multiple log to be processed is reacquired from local cache;
Restore the multiple log to be processed from local disk using result collection system;
Restore the multiple log to be processed from the copy stored in the open source component;
Read the multiple log to be processed meta data file metadata and make up function offset restore it is the multiple to Handle log;
The multiple log to be processed is obtained from multiple wave files of storage, wherein the multiple wave file is stored in Following one: local node, the node in local rack, the node in different racks.
6. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes:
The first table is established according to secondary partition storage format, wherein first table is for storing the multiple second mesh Mark the summary information of log;
The first sub-table is established in first table, wherein first sub-table is for storing inquiry the multiple the The query information of two target journalings;
Timing is that first sub-table adds subregion, wherein the subregion is used to store the query information of the log of letter addition.
7. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining multiple second target journalings, the method also includes:
Each second target journaling in the multiple second target journaling is split, multiple decomposition logs are obtained;
Parse the multiple decomposition log;
The multiple decomposition log after parsing in preset time period is merged, obtains merging log;
Compress the merging log.
8. the method according to claim 1, wherein according to the prefixed time interval to the multiple first mesh It marks log and carries out multidomain treat-ment, after obtaining the multiple second target journaling, the method also includes:
The storage of the multiple second target journaling is monitored;
In the case where storing failure, alert.
9. a kind of log processing method characterized by comprising
It is formatted using multiple logs of the default journal format to acquisition, obtains multiple logs to be processed;
The multiple log to be processed is stored respectively into subject document folder corresponding with each log to be processed;
The multiple log to be processed in subject document folder is sent in distributed processing platform.
10. a kind of log processing device characterized by comprising
Receiving module, the multiple logs to be processed sent for receiving open source component, wherein the day of the multiple log to be processed Will format is the conversion that the open source component carries out multiple logs using default journal format;
First determining module obtains multiple first object logs for carrying out data cleansing to the multiple log to be processed;
Second determining module is obtained for carrying out multidomain treat-ment to the multiple first object log according to prefixed time interval Multiple second target journalings.
11. a kind of log processing device characterized by comprising
Third determining module, for being formatted using default journal format to multiple logs of acquisition, obtain it is multiple to Handle log;
Memory module, for being stored the multiple log to be processed respectively to subject document corresponding with each log to be processed In folder;
Sending module, for the multiple log to be processed in subject document folder to be sent to distributed processing platform In.
12. a kind of log processing system characterized by comprising
Distributed processing platform spark, wherein the spark is arranged to perform claim when operation and requires institute in any one of 1-8 The method stated;
Increase income component kafka, connect with the distributed processing platform, wherein the kafka is arranged to right of execution when operation Benefit require 9 described in method.
13. a kind of storage medium, which is characterized in that be stored with computer program in the storage medium, wherein the computer Program is arranged to perform claim when operation and requires method described in 1 to 8 any one, alternatively, perform claim requires described in 9 Method.
14. a kind of electronic device, including memory and processor, which is characterized in that be stored with computer journey in the memory Sequence, the processor are arranged to run the computer program and require method described in 1 to 8 any one with perform claim, Alternatively, perform claim requires method described in 9.
CN201910138347.5A 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device Active CN109918349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910138347.5A CN109918349B (en) 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910138347.5A CN109918349B (en) 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN109918349A true CN109918349A (en) 2019-06-21
CN109918349B CN109918349B (en) 2021-05-25

Family

ID=66962220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910138347.5A Active CN109918349B (en) 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN109918349B (en)

Cited By (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110908788A (en) * 2019-12-02 2020-03-24 北京锐安科技有限公司 Spark Streaming based data processing method and device, computer equipment and storage medium
CN111581173A (en) * 2020-05-09 2020-08-25 深圳市卡数科技有限公司 Distributed storage method and device for log system, server and storage medium
CN111831617A (en) * 2020-07-16 2020-10-27 福建天晴数码有限公司 Method for guaranteeing uniqueness of log data based on distributed system
CN112506862A (en) * 2020-12-28 2021-03-16 浪潮云信息技术股份公司 Method for custom saving Kafka Offset
CN112612677A (en) * 2020-12-28 2021-04-06 北京天融信网络安全技术有限公司 Log storage method and device, electronic equipment and readable storage medium
CN113190726A (en) * 2021-04-16 2021-07-30 珠海格力精密模具有限公司 Method for reading CAE (computer aided engineering) modular flow analysis data, electronic equipment and storage medium
CN113312353A (en) * 2021-06-10 2021-08-27 中国民航信息网络股份有限公司 Storage method and system for tracking journal
WO2021189954A1 (en) * 2020-10-12 2021-09-30 平安科技(深圳)有限公司 Log data processing method and apparatus, computer device, and storage medium
WO2021238273A1 (en) * 2020-05-28 2021-12-02 苏州浪潮智能科技有限公司 Message fault tolerance method and system based on spark streaming computing framework
CN113760832A (en) * 2020-06-03 2021-12-07 富泰华工业(深圳)有限公司 File processing method, computer device and readable storage medium
CN113778810A (en) * 2021-09-27 2021-12-10 杭州安恒信息技术股份有限公司 Log collection method, device and system
CN113806434A (en) * 2021-09-22 2021-12-17 平安科技(深圳)有限公司 Big data processing method, device, equipment and medium

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140344622A1 (en) * 2013-05-20 2014-11-20 Vmware, Inc. Scalable Log Analytics
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN105389352A (en) * 2015-10-30 2016-03-09 北京奇艺世纪科技有限公司 Log processing method and apparatus
CN105824744A (en) * 2016-03-21 2016-08-03 焦点科技股份有限公司 Real-time log collection and analysis method on basis of B2B (Business to Business) platform
CN108600300A (en) * 2018-03-06 2018-09-28 北京思空科技有限公司 Daily record data processing method and processing device

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20140344622A1 (en) * 2013-05-20 2014-11-20 Vmware, Inc. Scalable Log Analytics
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN105389352A (en) * 2015-10-30 2016-03-09 北京奇艺世纪科技有限公司 Log processing method and apparatus
CN105824744A (en) * 2016-03-21 2016-08-03 焦点科技股份有限公司 Real-time log collection and analysis method on basis of B2B (Business to Business) platform
CN108600300A (en) * 2018-03-06 2018-09-28 北京思空科技有限公司 Daily record data processing method and processing device

Cited By (16)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110908788A (en) * 2019-12-02 2020-03-24 北京锐安科技有限公司 Spark Streaming based data processing method and device, computer equipment and storage medium
CN110908788B (en) * 2019-12-02 2022-04-08 北京锐安科技有限公司 Spark Streaming based data processing method and device, computer equipment and storage medium
CN111581173A (en) * 2020-05-09 2020-08-25 深圳市卡数科技有限公司 Distributed storage method and device for log system, server and storage medium
CN111581173B (en) * 2020-05-09 2023-10-20 深圳市卡数科技有限公司 Method, device, server and storage medium for distributed storage of log system
WO2021238273A1 (en) * 2020-05-28 2021-12-02 苏州浪潮智能科技有限公司 Message fault tolerance method and system based on spark streaming computing framework
CN113760832A (en) * 2020-06-03 2021-12-07 富泰华工业(深圳)有限公司 File processing method, computer device and readable storage medium
CN111831617B (en) * 2020-07-16 2022-08-09 福建天晴数码有限公司 Method for guaranteeing uniqueness of log data based on distributed system
CN111831617A (en) * 2020-07-16 2020-10-27 福建天晴数码有限公司 Method for guaranteeing uniqueness of log data based on distributed system
WO2021189954A1 (en) * 2020-10-12 2021-09-30 平安科技(深圳)有限公司 Log data processing method and apparatus, computer device, and storage medium
CN112612677A (en) * 2020-12-28 2021-04-06 北京天融信网络安全技术有限公司 Log storage method and device, electronic equipment and readable storage medium
CN112506862A (en) * 2020-12-28 2021-03-16 浪潮云信息技术股份公司 Method for custom saving Kafka Offset
CN113190726A (en) * 2021-04-16 2021-07-30 珠海格力精密模具有限公司 Method for reading CAE (computer aided engineering) modular flow analysis data, electronic equipment and storage medium
CN113312353A (en) * 2021-06-10 2021-08-27 中国民航信息网络股份有限公司 Storage method and system for tracking journal
CN113806434A (en) * 2021-09-22 2021-12-17 平安科技(深圳)有限公司 Big data processing method, device, equipment and medium
CN113806434B (en) * 2021-09-22 2023-09-05 平安科技(深圳)有限公司 Big data processing method, device, equipment and medium
CN113778810A (en) * 2021-09-27 2021-12-10 杭州安恒信息技术股份有限公司 Log collection method, device and system

Also Published As

Publication number Publication date
CN109918349B (en) 2021-05-25

Similar Documents

Publication Publication Date Title
CN109918349A (en) Log processing method, device, storage medium and electronic device
CN110321387B (en) Data synchronization method, equipment and terminal equipment
CN112507029B (en) Data processing system and data real-time processing method
CN109034993A (en) Account checking method, equipment, system and computer readable storage medium
CN105824744A (en) Real-time log collection and analysis method on basis of B2B (Business to Business) platform
CN109710614A (en) A kind of method and device of real-time data memory and inquiry
CN111459986B (en) Data computing system and method
CN102750326A (en) Log management optimization method of cluster system based on downsizing strategy
CN109190025B (en) Information monitoring method, device, system and computer readable storage medium
CN113360554B (en) Method and equipment for extracting, converting and loading ETL (extract transform load) data
CN110019267A (en) A kind of metadata updates method, apparatus, system, electronic equipment and storage medium
CN112559475B (en) Data real-time capturing and transmitting method and system
CN105138691B (en) Analyze the method and system of subscriber traffic
CN113535856B (en) Data synchronization method and system
CN110704400A (en) Real-time data synchronization method and device and server
WO2020263370A1 (en) Parallel processing of filtered transaction logs
CN112506743A (en) Log monitoring method and device and server
CN112019605A (en) Data distribution method and system of data stream
CN114048217A (en) Incremental data synchronization method and device, electronic equipment and storage medium
CN109451078A (en) Transaction methods and device under a kind of distributed structure/architecture
CN109167672B (en) Return source error positioning method, device, storage medium and system
CN114090529A (en) Log management method, device, system and storage medium
CN104205775A (en) A system for high reliability and high performance application message delivery
CN107480189A (en) A kind of various dimensions real-time analyzer and method
CN111049846A (en) Data processing method and device, electronic equipment and computer readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant