CN109918349B - Log processing method, log processing device, storage medium and electronic device - Google Patents

Log processing method, log processing device, storage medium and electronic device Download PDF

Info

Publication number
CN109918349B
CN109918349B CN201910138347.5A CN201910138347A CN109918349B CN 109918349 B CN109918349 B CN 109918349B CN 201910138347 A CN201910138347 A CN 201910138347A CN 109918349 B CN109918349 B CN 109918349B
Authority
CN
China
Prior art keywords
logs
log
target
processed
preset
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201910138347.5A
Other languages
Chinese (zh)
Other versions
CN109918349A (en
Inventor
刘晶晶
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Netease Hangzhou Network Co Ltd
Original Assignee
Netease Hangzhou Network Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Netease Hangzhou Network Co Ltd filed Critical Netease Hangzhou Network Co Ltd
Priority to CN201910138347.5A priority Critical patent/CN109918349B/en
Publication of CN109918349A publication Critical patent/CN109918349A/en
Application granted granted Critical
Publication of CN109918349B publication Critical patent/CN109918349B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Abstract

The invention discloses a log processing method and device, a storage medium and an electronic device. Wherein, the method comprises the following steps: receiving a plurality of logs to be processed sent by an open source component, wherein the log format of the plurality of logs to be processed is the conversion of the plurality of logs by the open source component by using a preset log format; carrying out data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs. The invention solves the technical problems that the processing efficiency of the log is low and the requirement on log data cannot be met in the related technology.

Description

Log processing method, log processing device, storage medium and electronic device
Technical Field
The present invention relates to the field of computers, and in particular, to a log processing method, device, storage medium, and electronic device.
Background
With the coming of the internet + era, the value of data is more and more prominent. The data of the product shows the characteristics of exponential growth and unstructured. By utilizing the spark and hadoop technologies of the distributed processing platform, the large data platform is constructed as the most core center of the storage and processing capacity of the basic data, so that the strong data processing capacity is provided, and the interaction requirement of the data is met. Meanwhile, by spark streaming, the requirement of real-time data of an enterprise can be effectively met, and a real-time index system for enterprise development is constructed.
However, the existing storage method for the logs has the defects that the log processing is not real enough, the capacity is expanded in a distributed system, the fault tolerance is insufficient, and the cleaning of the large data logs (an Extract-Transform-Load abbreviation, which is used for describing the process of extracting (Extract), performing (Transform), and loading (Load) the data from a source end to a destination end) cannot be conveniently performed.
In view of the above problems, no effective solution has been proposed.
Disclosure of Invention
Embodiments of the present invention provide a log processing method, device, storage medium, and electronic device, so as to at least solve the technical problems in the related art that the processing efficiency of logs is low and the needs of log data cannot be met.
According to an aspect of an embodiment of the present invention, there is provided a log processing method, including: receiving a plurality of logs to be processed sent by an open source component, wherein the log format of the plurality of logs to be processed is the conversion of the plurality of logs by the open source component by using a preset log format; carrying out data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs.
According to another aspect of the embodiments of the present invention, there is also provided a log processing method, including: carrying out format conversion on the obtained logs by using a preset log format to obtain a plurality of logs to be processed; respectively storing a plurality of logs to be processed into theme folders corresponding to the logs to be processed; and sending the plurality of to-be-processed logs in the theme folder to the distributed processing platform.
According to another aspect of the embodiments of the present invention, there is also provided a log processing apparatus, including: the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a plurality of logs to be processed sent by an open source component, and the log format of the plurality of logs to be processed is the conversion of the open source component to the plurality of logs by using a preset log format; the first determining module is used for carrying out data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and the second determining module is used for partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs.
According to another aspect of the embodiments of the present invention, there is also provided a log processing apparatus, including: the third determining module is used for performing format conversion on the obtained logs by using a preset log format to obtain a plurality of logs to be processed; the storage module is used for respectively storing the multiple logs to be processed into the theme folders corresponding to the logs to be processed; and the sending module is used for sending the plurality of to-be-processed logs in the theme folder to the distributed processing platform.
According to another aspect of the embodiments of the present invention, there is also provided a log processing system, including: a spark of a distributed processing platform, wherein the spark is configured to execute the method in the above when running; an open source component kafka coupled to the distributed processing platform, wherein the kafka is configured to perform the method described above at runtime;
according to a further embodiment of the present invention, there is also provided a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
According to yet another embodiment of the present invention, there is also provided an electronic device, including a memory in which a computer program is stored and a processor configured to execute the computer program to perform the steps in any of the above method embodiments.
In the embodiment of the invention, format conversion is carried out on the collected logs by utilizing the open source component to obtain a plurality of logs to be processed, and the logs to be processed are sent to the distributed processing platform; the distributed processing platform performs data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs. The log processing method and the log processing device can realize real-time log processing, improve log processing efficiency, and further solve the technical problems that log processing efficiency is low and the requirement on log data cannot be met in the related technology.
Drawings
The accompanying drawings, which are included to provide a further understanding of the invention and are incorporated in and constitute a part of this application, illustrate embodiment(s) of the invention and together with the description serve to explain the invention without limiting the invention. In the drawings:
fig. 1 is a block diagram of a hardware structure of a mobile terminal of a log processing method according to an embodiment of the present invention;
FIG. 2 is a flow chart of a log processing method according to an embodiment of the present invention;
fig. 3 is a flow chart of a log processing method according to an embodiment of the present invention (ii);
FIG. 4 is a schematic structural diagram (one) of a log processing apparatus according to an embodiment of the present invention;
FIG. 5 is a schematic structural diagram of a log processing apparatus according to an embodiment of the present invention (II);
fig. 6 is a schematic structural diagram of a log processing system according to an embodiment of the present invention.
Detailed Description
In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
In accordance with an embodiment of the present invention, there is provided a log processing method embodiment, it is noted that the steps illustrated in the flowchart of the figure may be performed in a computer system such as a set of computer executable instructions, and that while a logical order is illustrated in the flowchart, in some cases the steps illustrated or described may be performed in an order different than here.
The method provided by the embodiment of the invention can be executed in a mobile terminal, a computer terminal or a similar arithmetic device. Taking an example of the present invention running on a mobile terminal, fig. 1 is a block diagram of a hardware structure of the mobile terminal of a log processing method according to an embodiment of the present invention. As shown in fig. 1, the mobile terminal 10 may include one or more (only one shown in fig. 1) processors 102 (the processor 102 may include, but is not limited to, a processing device such as a microprocessor MCU or a programmable logic device FPGA) and a memory 104 for storing data, and optionally may also include a transmission device 106 for communication functions and an input-output device 108. It will be understood by those skilled in the art that the structure shown in fig. 1 is only an illustration, and does not limit the structure of the mobile terminal. For example, the mobile terminal 10 may also include more or fewer components than shown in FIG. 1, or have a different configuration than shown in FIG. 1.
The memory 104 may be used to store a computer program, for example, a software program and a module of application software, such as a computer program corresponding to the log processing method in the embodiment of the present invention, and the processor 102 executes various functional applications and data processing by running the computer program stored in the memory 104, so as to implement the method described above. The memory 104 may include high speed random access memory, and may also include non-volatile memory, such as one or more magnetic storage devices, flash memory, or other non-volatile solid-state memory. In some instances, the memory 104 may further include memory located remotely from the processor 102, which may be connected to the mobile terminal 10 via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.
The transmission device 106 is used for receiving or transmitting data via a network. Specific examples of the network described above may include a wireless network provided by a communication provider of the mobile terminal 10. In one example, the transmission device 106 includes a Network adapter (NIC), which can be connected to other Network devices through a base station so as to communicate with the internet. In one example, the transmission device 106 may be a Radio Frequency (RF) module, which is used for communicating with the internet in a wireless manner.
Fig. 2 is a flowchart illustrating a log processing method according to an embodiment of the present invention (i), as shown in fig. 2, the method includes the following steps:
step S202, receiving a plurality of logs to be processed sent by the open source component, wherein the log format of the plurality of logs to be processed is the conversion of the plurality of logs by the open source component by using a preset log format;
step S204, performing data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs;
step S206, the multiple first target logs are subjected to partition processing according to a preset time interval, and multiple second target logs are obtained.
Through the steps, format conversion is carried out on the collected logs by utilizing the open source assembly to obtain a plurality of logs to be processed, and the logs to be processed are sent to the distributed processing platform; the distributed processing platform performs data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs. The log processing method and the log processing device can realize real-time log processing, improve log processing efficiency, and further solve the technical problems that log processing efficiency is low and the requirement on log data cannot be met in the related technology.
It should be noted that, the execution subject in the foregoing may be a distributed processing platform spark, but is not limited thereto.
It should be noted that the open source component Kafka in the above is an open source stream processing platform developed by Apache software foundation, and written by Scala and Java. The goal of this project is to provide a uniform, high-throughput, low-latency platform for processing real-time data. Its persistence layer is essentially a "large-scale publish/subscribe message queue in a distributed transaction log architecture," which makes it very valuable as an enterprise-level infrastructure to handle streaming data. Further, Kafka can be connected to an external system (for data input/output) through Kafka Connect, and Kafka Streams, a Java streaming library (computer), is provided. Kafka stores messages from any number of processes called "producers" (Producers). The data can thus be assigned to different "partitions" (Partition), different "Topic". Topic is used to classify messages, and each message that goes into Kafka is placed under one Topic.
It should be noted that the format of the multiple logs to be processed is Kafka specification, which facilitates structured parsing, and the preset log format may be: [ logtime ] [ operation ], JSON; for example: [2013-04-101:00:09] [ Click ], { "time":12344343, "server": 1001 "}.
In this embodiment, the application in Kafka may collect the log of the client locally through the logging module or through nginx access log, and then collect the log using the following open source components rsyslog, fileboat, script, etc., and the producer goes to topic of Kafka.
It should be noted that Apache Spark is an open source cluster computing framework originally developed by AMPLab, the beckley division university, california. Compared with MapReduce of Hadoop, the method stores intermediate data into a disk after the operation is finished, and Spark uses an in-memory operation technology and can analyze and operate in a memory when the data is not written into a hard disk.
In an alternative embodiment, Spark performs data cleansing on a plurality of logs to be processed by: triggering data cleaning of a plurality of logs to be processed by utilizing a preset activity algorithm; and carrying out data cleaning on the plurality of logs to be processed after triggering by utilizing a preset conversion algorithm. For example, a spark rich transformation algorithm tranform is utilized, such as: the method includes the steps of cleaning data by using a map, a flutmap, a flutter, a User-Defined Function (UDF for short) and the like, and triggering a data cleaning process by using an activity algorithm action, such as collection, save and the like.
In an alternative embodiment, the plurality of first target logs are partitioned at preset time intervals by: determining the log type and the log time of each first target log in the plurality of first target logs, wherein the log time is the time for acquiring each first target log; and storing each first target log partition into a first preset directory based on the log type and the log time to obtain a plurality of second target logs. For example: by setting a reasonable time (e.g., 5 minutes) for the batch interval of spark streaming, the secondary partition is written to the directory of the HDFS/winehouses/{ logtype }/{ logtime } (first predetermined directory) in batches according to logtype and logtime (log time). The delay time of the data is roughly equivalent to the batch interval, and if it is 5 minutes previously set, the presentation of the data in hive is also a 5 minute delay. Through setting up reasonable batch interval, can reach the purpose that data was gathered in near real time. In this embodiment, the Spark calculation engine is stored in the hadoop storage system.
Note that, a time slice or batch interval (batch interval): this is the standard for artificially quantifying streaming data, and time slices are taken as the basis for splitting streaming data. The data of one time slice corresponds to one RDD instance.
It should be noted that the HDFS is an abbreviation of Hadoop distribution File System, that is, a distributed File System of Hadoop.
In an alternative embodiment, after obtaining the plurality of second target logs, the plurality of second target logs need to be stored exactly once. We know that Spark Streaming can only do at least once before 2.0, Spark framework is difficult to help you do exact once, and program is needed to directly realize reliable data source and support downstream of idempotent operation. But Spark Structure streaming can simply implement exact one.
It should be noted that exact-once is one of the difficulties of real-time calculation, namely that each piece of data is processed only once, and the purpose that each record is processed only once is achieved.
The method specifically comprises the following steps:
1) storing a plurality of second target logs into a second preset directory in a way of exact once by using a preset feature source interface; for example: the method comprises the steps of utilizing kafka Source, specifically to a Source code, wherein the Source is an abstract interface trail Source (namely a preset characteristic Source interface) and comprises a function which is required to be provided by a Structured stream to realize end-to-end extract-one (end-to-end just once) processing, obtaining the latest offsets of each topic by reading the offsets of kafka stored in the last batch through a consumer running long at a driver end from kafka routers, returning a datame data from the offsets through a getPath method, storing a file on hdfs through a framesetpoint mechanism, and recording the offset positions of kafka for obtaining effective offset values when the getOffer fails. In the present embodiment, getOffset () has the following meaning: and reading the checkpoint file of hdfs each time, acquiring the starting position of the current read kafka, and updating the ending position of the current read to the checkpoint file after the acquisition is finished.
2) And storing the plurality of second target logs into a second preset directory in a way of exact once by using a preset batch processing function. For example: hdfsSink: this only adddescribe () method supports the functionality that Structured Streaming has to need to implement end-to-end exact-once processing. A specific implementation of HdfsssinaddDatch () is to have a metadata under the directory of hdfs to record the currently completed maximum batchId, and when recovering from a fault, if the batchId of a job is less than or equal to the maximum batchId of metadata, the job will skip. Thereby enabling idempotent writing of data. In the present embodiment, addbitmap () includes the following functions: if the batch Id submitted this time is less than the batch Id recorded in the hdfs metadata, the batch Id of the task is considered to be written and skipped directly. If so, the log data is written to the partition directory and the hdfs metadata batchId is incremented by 1.
It should be noted that Spark Streaming is an extension of a Spark core Application Programming Interface (API), and can implement processing of real-time Streaming data with high throughput and a fault-tolerant mechanism. Support the acquisition of data from a variety of data sources, including Kafk, Flume, Twitter, ZeroMQ, Kinesis, and TCP sockets, after acquiring data from a data source, complex algorithmic processing may be performed using high-level functions such as map, reduce, join, and window. And finally, the processing result can be stored in a file system, a database and a field instrument panel. On the basis of the One Stack rule all, other subframes of Spark, such as cluster learning, graph calculation, etc., can be used for processing the flow data.
In an optional embodiment, in a case that it is determined that the plurality of second target logs fail to be stored in the second preset directory, the plurality of logs to be processed may be recovered in the following manner to process the logs again:
1) within a first preset number of days, a plurality of logs to be processed are obtained again from the local cache; for example: the log is printed in a local file system and is stored for 10 days in a rolling and compressing way according to the day, and firstly, the online machine has a perfect operating system to alarm so as to detect the problem of hard disk capacity in time and avoid the problem of failure in writing the disk. The storage for 10 days is to avoid the problems of the downstream system, and the worst condition is that the long-term holiday of the national day is not found in time and the data can be pushed and recovered from the data source again.
2) Recovering a plurality of logs to be processed from a local disk by using a log collection system; for example: the script has the advantages that problems can be found timely through heartbeat detection and delay detection of logs, the script records metadata information of a transmitted file, the most important is the transmission position of the file, the method can be used for accurately recovering the fault of data, and a TCP protocol is adopted for transmitting the data.
It should be noted that the script is a log collection system of an open source of the facebook, and a large number of applications are already available inside the facebook. It can collect logs from various log sources and store them on a central storage system (which may be NFS, distributed file system, etc.) to facilitate centralized statistical analysis processing. The method provides an extensible and high-fault-tolerance scheme for 'distributed collection and uniform processing' of the logs. The structure of the script is simple, and the script mainly comprises three parts, namely a script agent, a script and a storage system.
3) Recovering a plurality of logs to be processed from the copy stored in the open source component; for example, the number of copies on the kafka line is 3, and the log is persisted for 3 days. If spark structure streaming fails, data can be recovered directly from kafka.
4) Reading metadata files metadata and offset of a plurality of logs to be processed to recover the plurality of logs to be processed; for example: the spark checkpoint mechanism stores state information such as offset of kafka, when a task fails, a spark task starts to read metadata and an offset file under a checkpoint folder, and a kafkaConsumer can specify the offset of a partition to read data, so that duplication and omission do not occur.
5) Acquiring a plurality of logs to be processed from a plurality of stored duplicate files, wherein the plurality of duplicate files are stored in one of the following: local nodes, nodes on a local rack, nodes on different racks. For example: block mechanism of hdfs: the number of copies of the file is 3, and the copy placement strategy of the HDFS is as follows: the first copy is placed at the local node, the second copy is placed at another node on the local chassis, and the third copy is placed at a node on a different chassis. This way, write traffic between chassis is reduced, thereby improving write performance. The probability of a rack failure is much less than a node failure. This approach does not affect the data reliability and availability constraints, and it does reduce the network aggregate bandwidth for read operations.
6) insert over write: the data extraction of Hive adopts the override operation of insert over write, and the idempotent can be realized relative to insert inter, and the operation can be repeated.
In an optional embodiment, in order to increase the efficiency of log processing, the kafka may be expanded laterally, specifically including:
(1) increasing the number of kafka bridges and partitionins of topics, and distributing topic data to more bridges and machines according to default dafalltcodes abs (numPartitions) of kakfa, thereby increasing the data processing capacity of upstream and downstream.
(2) After increasing the partitions of kafka, spark can set the same partitions of rdd to increase the message capability of the data according to the conditioner groups mechanism of kafka
In summary, through the real-time collection of logs, the high-performance kafka information middleware, the excellent distributed nature of spark on yarn: high fault tolerance, easy expansion, extra once, have solved the log in the expansion of distributed system, the fault tolerance is deficient to some extent, can not be convenient carry on the problem that ETL of big data log washs.
In an alternative embodiment, to increase the efficiency of spark on log processing, spark applications may be optimized: the method comprises the following specific steps:
optimizing spark long-time program operation parameters: spark-submit-master yarn-default-mode cluster
--conf spark.yarn.maxAppAttempts=4\
--conf spark.yarn.am.attemptFailuresValidityInterval=1h\
--conf spark.yarn.max.executor.failures={8*num_executors}\
--conf spark.yarn.executor.failuresValidityInterval=1h\
--conf spark.task.maxFailures=8\
Running the program in the cluster model, any error in the Spark driver prevents our long-running work. Fortunately, spark. If an application runs for days or weeks without rebooting or redeploying on a highly used cluster, it is possible to exhaust 4 attempts in a few hours. To avoid this, the attempt counter should be reset every hour (1 h) and another important setting is the maximum number of executors that fail before the application fails. Max (2 x num executors, 3) by default, is well suited for batch jobs, but not for jobs that run for long periods of time. This attribute has a corresponding validity period, and for jobs that run for a long time, you can also consider increasing the maximum number of task failures before abandoning the job (spark. Through the optimization configuration items, spark application can keep long-term reliable operation in responsible distributed environment.
In an optional embodiment, in spark, a first table is established according to a secondary partition storage format, wherein the first table is used for storing summary information of a plurality of second target logs; establishing a first sub-table in the first table, wherein the first sub-table is used for storing query information for querying a plurality of second target logs; and adding a partition for the first sub-table at regular time, wherein the partition is used for storing the query information of the added log.
For example: according to the secondary partition storage format of the ETL, a general table is established for acquiring summary information of the log, and then sub-tables are established for the logtype directory for inquiring and using tasks. The table is added with partitions by scheduling system timing. On the basis, a python script is written to identify a newly added log and a log format, automatic mapping from a json format to a hive schema is established, a form establishing statement is automatically established, and a form is established in a hive warehouse. Therefore, automation is realized, and manual maintenance and rapid online of new data are reduced.
In an optional embodiment, the small files of the logs may be compressed and merged, and each of the second target logs in the plurality of second target logs may be divided to obtain a plurality of decomposed logs; analyzing a plurality of decomposition logs; merging a plurality of decomposed logs analyzed within a preset time period to obtain merged logs; and compressing the merged log.
For example: in pursuit of near real-time effects, spark streaming will split daily logs into multiple small splits, which is unfriendly to memory usage by namenodes and disks by dataodes. A scheduling system is adopted to merge and compress the small files in the previous week for optimization. Specifically, the snapshot compression of the intermediate result and the output result is set by setting the processed data volume and the concurrency number of map and reduce of mapreduce. Then the clever statements using insert overhead do not affect the query during merging and compression.
In an alternative embodiment, the storage of the log is monitored: monitoring storage of a plurality of second target logs; and sending alarm information under the condition that the storage fails. For example: events listener monitoring rewriting spark, sending metrics through statsd, storing in graph, then configuring alarm rules through cabot, and once triggered, alarming through an alarm service. And the status of the job is polled in real time through a restful interface provided by yarn, the service is automatically pulled up, and the reliability of the job is greatly provided.
In conclusion, the distributed robust system scheme of fault tolerance, capacity expansion, monitoring and the like of storage is realized by spark on grow technology, and the stability of products, the real-time performance of data analysis and the large-scale reliable storage of logs are effectively provided. Meanwhile, the combination and compression of cold and hot data are considered, and the hadoop load is reasonably reduced on the aspect of not influencing the performance of data analysis.
Fig. 3 is a schematic flow chart (ii) of a log processing method according to an embodiment of the present invention, as shown in fig. 3, the method includes the following steps:
step S302: carrying out format conversion on the obtained logs by using a preset log format to obtain a plurality of logs to be processed;
step S304: respectively storing a plurality of logs to be processed into theme folders corresponding to the logs to be processed;
step S306: and sending the plurality of to-be-processed logs in the theme folder to the distributed processing platform.
Through the steps, format conversion is carried out on the collected logs by utilizing the open source assembly to obtain a plurality of logs to be processed, and the logs to be processed are sent to the distributed processing platform; the distributed processing platform performs data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs. The log processing method and the log processing device can realize real-time log processing, improve log processing efficiency, and further solve the technical problems that log processing efficiency is low and the requirement on log data cannot be met in the related technology.
It should be noted that the execution subject in the above may be the open source component kafka, but is not limited thereto.
It should be noted that the open source component Kafka in the above is an open source stream processing platform developed by Apache software foundation, and written by Scala and Java. The goal of this project is to provide a uniform, high-throughput, low-latency platform for processing real-time data. Its persistence layer is essentially a "large-scale publish/subscribe message queue in a distributed transaction log architecture," which makes it very valuable as an enterprise-level infrastructure to handle streaming data. Further, Kafka can be connected to an external system (for data input/output) through Kafka Connect, and Kafka Streams, a Java streaming library (computer), is provided. Kafka stores messages from any number of processes called "producers" (Producers). The data can thus be assigned to different "partitions" (Partition), different "Topic". Topic is used to classify messages, and each message that goes into Kafka is placed under one Topic.
It should be noted that the format of the multiple logs to be processed is Kafka specification, which facilitates structured parsing, and the preset log format may be: [ logtime ] [ operation ], JSON; for example: [2013-04-101:00:09] [ Click ], { "time":12344343, "server": 1001 "}.
In this embodiment, the application in Kafka may collect the log of the client locally through a logging module or through nginx access log, and then collect the log using rsyslog, fileboat, script, and other tools, and the producer goes to topic of Kafka.
It should be noted that Apache Spark is an open source cluster computing framework originally developed by AMPLab, the beckley division university, california. Compared with MapReduce of Hadoop, the method stores intermediate data into a disk after the operation is finished, and Spark uses an in-memory operation technology and can analyze and operate in a memory when the data is not written into a hard disk.
It should be noted that Kafka stores messages from any number of processes called "producers" (producers). The data can thus be assigned to different "partitions" (Partition), different "Topic". Within a partition, the messages are indexed and stored along with a timestamp. Other processes, referred to as "consumers" (consumers), may query messages from the partitions. Kafka runs on a cluster of one or more servers and partitions can be distributed across the cluster nodes.
Kafka efficiently processes real-time streaming data and can achieve integration with Storm, HBase and Spark. Deployed as a cluster on multiple servers, Kafka's system for processing all of its publish and subscribe messages uses four APIs, namely, a producer API, a consumer API, a Stream API, and a Connector API. The system can transfer large-scale streaming messages, has a fault-tolerant function, and replaces some traditional message systems such as JMS, AMQP and the like.
The main terms of the Kafka architecture include Topic, Record, and Broker. Topic consists of Record, which holds different information, while Broker is responsible for copying messages. Kafka has four major APIs.
Producer API: supporting applications to publish Record streams.
Consumer API: the support application subscribes to Topic and processes Record streams.
Stream API: convert the input stream to the output stream and produce a result.
Connector API: executing reusable producer and consumer APIs, Topic may be linked to existing applications.
Topic is used to classify messages, and each message that goes into Kafka is placed under one Topic.
The Broker is used to implement a host server for data storage.
Partition the message in each Topic is divided into several partitions to improve the processing efficiency of the message.
Fig. 4 is a schematic structural diagram (a) of the log processing apparatus provided in the embodiment of the present invention, and as shown in fig. 4, the apparatus includes:
the receiving module 42 is configured to receive multiple logs to be processed sent by the open source component, where a log format of the multiple logs to be processed is conversion performed on the multiple logs by the open source component by using a preset log format;
the first determining module 44 is configured to perform data cleaning on the multiple logs to be processed to obtain multiple first target logs;
the second determining module 46 is configured to perform partition processing on the multiple first target logs according to a preset time interval to obtain multiple second target logs.
Through the steps, format conversion is carried out on the collected logs by utilizing the open source assembly to obtain a plurality of logs to be processed, and the logs to be processed are sent to the distributed processing platform; the distributed processing platform performs data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs. The log processing method and the log processing device can realize real-time log processing, improve log processing efficiency, and further solve the technical problems that log processing efficiency is low and the requirement on log data cannot be met in the related technology.
It should be noted that the open source component Kafka in the above is an open source stream processing platform developed by Apache software foundation, and written by Scala and Java. The goal of this project is to provide a uniform, high-throughput, low-latency platform for processing real-time data. Its persistence layer is essentially a "large-scale publish/subscribe message queue in a distributed transaction log architecture," which makes it very valuable as an enterprise-level infrastructure to handle streaming data. Further, Kafka can be connected to an external system (for data input/output) through Kafka Connect, and Kafka Streams, a Java streaming library (computer), is provided. Kafka stores messages from any number of processes called "producers" (Producers). The data can thus be assigned to different "partitions" (Partition), different "Topic". Topic is used to classify messages, and each message that goes into Kafka is placed under one Topic.
It should be noted that the format of the multiple logs to be processed is Kafka specification, which facilitates structured parsing, and the preset log format may be: [ logtime ] [ operation ], JSON; for example: [2013-04-101:00:09] [ Click ], { "time":12344343, "server": 1001 "}.
In this embodiment, the application in Kafka may collect the log of the client locally through the logging module or through nginx access log, and then collect the log using the following open source components rsyslog, fileboat, script, etc., and the producer goes to topic of Kafka.
It should be noted that Apache Spark is an open source cluster computing framework originally developed by AMPLab, the beckley division university, california. Compared with MapReduce of Hadoop, the method stores intermediate data into a disk after the operation is finished, and Spark uses an in-memory operation technology and can analyze and operate in a memory when the data is not written into a hard disk.
In an alternative embodiment, Spark performs data cleansing on a plurality of logs to be processed by: triggering data cleaning of a plurality of logs to be processed by utilizing a preset activity algorithm; and carrying out data cleaning on the plurality of logs to be processed after triggering by utilizing a preset conversion algorithm. For example, a spark rich transformation algorithm tranform is utilized, such as: the method includes the steps of cleaning data by using a map, a flutmap, a flutter, a User-Defined Function (UDF for short) and the like, and triggering a data cleaning process by using an activity algorithm action, such as collection, save and the like.
In an alternative embodiment, the plurality of first target logs are partitioned at preset time intervals by: determining the log type and the log time of each first target log in the plurality of first target logs, wherein the log time is the time for acquiring each first target log; and storing each first target log partition into a first preset directory based on the log type and the log time to obtain a plurality of second target logs. For example: by setting a reasonable time (e.g., 5 minutes) for the batch interval of spark streaming, the secondary partition is written to the directory of the HDFS/winehouses/{ logtype }/{ logtime } (first predetermined directory) in batches according to logtype and logtime (log time). The delay time of the data is roughly equivalent to the batch interval, and if it is 5 minutes previously set, the presentation of the data in hive is also a 5 minute delay. Through setting up reasonable batch interval, can reach the purpose that data was gathered in near real time. In this embodiment, the Spark calculation engine is stored in the hadoop storage system.
Note that, a time slice or batch interval (batch interval): this is the standard for artificially quantifying streaming data, and time slices are taken as the basis for splitting streaming data. The data of one time slice corresponds to one RDD instance.
It should be noted that the HDFS is an abbreviation of Hadoop distribution File System, that is, a distributed File System of Hadoop.
In an alternative embodiment, after obtaining the plurality of second target logs, the plurality of second target logs need to be stored exactly once. We know that Spark Streaming can only do at least once before 2.0, Spark framework is difficult to help you do exact once, and program is needed to directly realize reliable data source and support downstream of idempotent operation. But Spark Structure streaming can simply implement exact one.
It should be noted that exact-once is one of the difficulties of real-time calculation, namely that each piece of data is processed only once, and the purpose that each record is processed only once is achieved.
The method specifically comprises the following steps:
1) storing a plurality of second target logs into a second preset directory in a way of exact once by using a preset feature source interface; for example: the method comprises the steps of utilizing kafka Source, specifically to a Source code, wherein the Source is an abstract interface trail Source (namely a preset characteristic Source interface) and comprises a function which is required to be provided by a Structured stream to realize end-to-end extract-one (end-to-end just once) processing, obtaining the latest offsets of each topic by reading the offsets of kafka stored in the last batch through a consumer running long at a driver end from kafka routers, returning a datame data from the offsets through a getPath method, storing a file on hdfs through a framesetpoint mechanism, and recording the offset positions of kafka for obtaining effective offset values when the getOffer fails. In the present embodiment, getOffset () has the following meaning: and reading the checkpoint file of hdfs each time, acquiring the starting position of the current read kafka, and updating the ending position of the current read to the checkpoint file after the acquisition is finished.
2) And storing the plurality of second target logs into a second preset directory in a way of exact once by using a preset batch processing function. For example: hdfsSink: this only adddescribe () method supports the functionality that Structured Streaming has to need to implement end-to-end exact-once processing. A specific implementation of HdfsssinaddDatch () is to have a metadata under the directory of hdfs to record the currently completed maximum batchId, and when recovering from a fault, if the batchId of a job is less than or equal to the maximum batchId of metadata, the job will skip. Thereby enabling idempotent writing of data. In the present embodiment, addbitmap () includes the following functions: if the batch Id submitted this time is less than the batch Id recorded in the hdfs metadata, the batch Id of the task is considered to be written and skipped directly. If so, the log data is written to the partition directory and the hdfs metadata batchId is incremented by 1.
It should be noted that Spark Streaming is an extension of a Spark core Application Programming Interface (API), and can implement processing of real-time Streaming data with high throughput and a fault-tolerant mechanism. Support the acquisition of data from a variety of data sources, including Kafk, Flume, Twitter, ZeroMQ, Kinesis, and TCP sockets, after acquiring data from a data source, complex algorithmic processing may be performed using high-level functions such as map, reduce, join, and window. And finally, the processing result can be stored in a file system, a database and a field instrument panel. On the basis of the One Stack rule all, other subframes of Spark, such as cluster learning, graph calculation, etc., can be used for processing the flow data.
In an optional embodiment, in a case that it is determined that the plurality of second target logs fail to be stored in the second preset directory, the plurality of logs to be processed may be recovered in the following manner to process the logs again:
1) within a first preset number of days, a plurality of logs to be processed are obtained again from the local cache; for example: the log is printed in a local file system and is stored for 10 days in a rolling and compressing way according to the day, and firstly, the online machine has a perfect operating system to alarm so as to detect the problem of hard disk capacity in time and avoid the problem of failure in writing the disk. The storage for 10 days is to avoid the problems of the downstream system, and the worst condition is that the long-term holiday of the national day is not found in time and the data can be pushed and recovered from the data source again.
2) Recovering a plurality of logs to be processed from a local disk by using a log collection system; for example: the script has the advantages that problems can be found timely through heartbeat detection and delay detection of logs, the script records metadata information of a transmitted file, the most important is the transmission position of the file, the method can be used for accurately recovering the fault of data, and a TCP protocol is adopted for transmitting the data.
It should be noted that the script is a log collection system of an open source of the facebook, and a large number of applications are already available inside the facebook. It can collect logs from various log sources and store them on a central storage system (which may be NFS, distributed file system, etc.) to facilitate centralized statistical analysis processing. The method provides an extensible and high-fault-tolerance scheme for 'distributed collection and uniform processing' of the logs. The structure of the script is simple, and the script mainly comprises three parts, namely a script agent, a script and a storage system.
3) Recovering a plurality of logs to be processed from the copy stored in the open source component; for example, the number of copies on the kafka line is 3, and the log is persisted for 3 days. If spark structure streaming fails, data can be recovered directly from kafka.
4) Reading metadata files metadata and offset of a plurality of logs to be processed to recover the plurality of logs to be processed; for example: the spark checkpoint mechanism stores state information such as offset of kafka, when a task fails, a spark task starts to read metadata and an offset file under a checkpoint folder, and a kafkaConsumer can specify the offset of a partition to read data, so that duplication and omission do not occur.
5) Acquiring a plurality of logs to be processed from a plurality of stored duplicate files, wherein the plurality of duplicate files are stored in one of the following: local nodes, nodes on a local rack, nodes on different racks. For example: block mechanism of hdfs: the number of copies of the file is 3, and the copy placement strategy of the HDFS is as follows: the first copy is placed at the local node, the second copy is placed at another node on the local chassis, and the third copy is placed at a node on a different chassis. This way, write traffic between chassis is reduced, thereby improving write performance. The probability of a rack failure is much less than a node failure. This approach does not affect the data reliability and availability constraints, and it does reduce the network aggregate bandwidth for read operations.
(6) insert over write: the data extraction of Hive adopts the override operation of insert over write, and the idempotent can be realized relative to insert inter, and the operation can be repeated.
In an optional embodiment, in order to increase the efficiency of log processing, the kafka may be expanded laterally, specifically including:
(1) increasing the number of kafka bridges and partitionins of topics, and distributing topic data to more bridges and machines according to default dafalltcodes abs (numPartitions) of kakfa, thereby increasing the data processing capacity of upstream and downstream.
(2) After increasing the partitions of kafka, spark can set the same partitions of rdd to increase the message capability of the data according to the conditioner groups mechanism of kafka
In summary, through the real-time collection of logs, the high-performance kafka information middleware, the excellent distributed nature of spark on yarn: high fault tolerance, easy expansion, extra once, have solved the log in the expansion of distributed system, the fault tolerance is deficient to some extent, can not be convenient carry on the problem that ETL of big data log washs.
In an alternative embodiment, to increase the efficiency of spark on log processing, spark applications may be optimized: the method comprises the following specific steps:
optimizing spark long-time program operation parameters: spark-submit-master yarn-default-mode cluster
--conf spark.yarn.maxAppAttempts=4\
--conf spark.yarn.am.attemptFailuresValidityInterval=1h\
--conf spark.yarn.max.executor.failures={8*num_executors}\
--conf spark.yarn.executor.failuresValidityInterval=1h\
--conf spark.task.maxFailures=8\
Running the program in the cluster model, any error in the Spark driver prevents our long-running work. Fortunately, spark. If an application runs for days or weeks without rebooting or redeploying on a highly used cluster, it is possible to exhaust 4 attempts in a few hours. To avoid this, the attempt counter should be reset every hour (1 h) and another important setting is the maximum number of executors that fail before the application fails. Max (2 x num executors, 3) by default, is well suited for batch jobs, but not for jobs that run for long periods of time. This attribute has a corresponding validity period, and for jobs that run for a long time, you can also consider increasing the maximum number of task failures before abandoning the job (spark. Through the optimization configuration items, spark application can keep long-term reliable operation in responsible distributed environment.
In an optional embodiment, in spark, a first table is established according to a secondary partition storage format, wherein the first table is used for storing summary information of a plurality of second target logs; establishing a first sub-table in the first table, wherein the first sub-table is used for storing query information for querying a plurality of second target logs; and adding a partition for the first sub-table at regular time, wherein the partition is used for storing the query information of the added log.
For example: according to the secondary partition storage format of the ETL, a general table is established for acquiring summary information of the log, and then sub-tables are established for the logtype directory for inquiring and using tasks. The table is added with partitions by scheduling system timing. On the basis, a python script is written to identify a newly added log and a log format, automatic mapping from a json format to a hive schema is established, a form establishing statement is automatically established, and a form is established in a hive warehouse. Therefore, automation is realized, and manual maintenance and rapid online of new data are reduced.
In an optional embodiment, the small files of the logs may be compressed and merged, and each of the second target logs in the plurality of second target logs may be divided to obtain a plurality of decomposed logs; analyzing a plurality of decomposition logs; merging a plurality of decomposed logs analyzed within a preset time period to obtain merged logs; and compressing the merged log.
For example: in pursuit of near real-time effects, spark streaming will split daily logs into multiple small splits, which is unfriendly to memory usage by namenodes and disks by dataodes. A scheduling system is adopted to merge and compress the small files in the previous week for optimization. Specifically, the snapshot compression of the intermediate result and the output result is set by setting the processed data volume and the concurrency number of map and reduce of mapreduce. Then the clever statements using insert overhead do not affect the query during merging and compression.
In an alternative embodiment, the storage of the log is monitored: monitoring storage of a plurality of second target logs; and sending alarm information under the condition that the storage fails. For example: events listener monitoring rewriting spark, sending metrics through statsd, storing in graph, then configuring alarm rules through cabot, and once triggered, alarming through an alarm service. And the status of the job is polled in real time through a restful interface provided by yarn, the service is automatically pulled up, and the reliability of the job is greatly provided.
In conclusion, the distributed robust system scheme of fault tolerance, capacity expansion, monitoring and the like of storage is realized by spark on grow technology, and the stability of products, the real-time performance of data analysis and the large-scale reliable storage of logs are effectively provided. Meanwhile, the combination and compression of cold and hot data are considered, and the hadoop load is reasonably reduced on the aspect of not influencing the performance of data analysis.
An embodiment of the present invention further provides a log processing apparatus, and fig. 5 is a schematic structural diagram (ii) of the log processing apparatus provided in the embodiment of the present invention, and as shown in fig. 5, the apparatus includes:
the third determining module 52 is configured to perform format conversion on the obtained multiple logs by using a preset log format to obtain multiple logs to be processed;
the storage module 54 is configured to store the multiple logs to be processed into the theme folders corresponding to the respective logs to be processed respectively;
and the sending module 56 is configured to send the multiple to-be-processed logs in the theme folder to the distributed processing platform.
Through the steps, format conversion is carried out on the collected logs by utilizing the open source assembly to obtain a plurality of logs to be processed, and the logs to be processed are sent to the distributed processing platform; the distributed processing platform performs data cleaning on a plurality of logs to be processed to obtain a plurality of first target logs; and partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs. The log processing method and the log processing device can realize real-time log processing, improve log processing efficiency, and further solve the technical problems that log processing efficiency is low and the requirement on log data cannot be met in the related technology.
It should be noted that the execution subject in the above may be the open source component kafka, but is not limited thereto.
It should be noted that the open source component Kafka in the above is an open source stream processing platform developed by Apache software foundation, and written by Scala and Java. The goal of this project is to provide a uniform, high-throughput, low-latency platform for processing real-time data. Its persistence layer is essentially a "large-scale publish/subscribe message queue in a distributed transaction log architecture," which makes it very valuable as an enterprise-level infrastructure to handle streaming data. Further, Kafka can be connected to an external system (for data input/output) through Kafka Connect, and Kafka Streams, a Java streaming library (computer), is provided. Kafka stores messages from any number of processes called "producers" (Producers). The data can thus be assigned to different "partitions" (Partition), different "Topic". Topic is used to classify messages, and each message that goes into Kafka is placed under one Topic.
It should be noted that the format of the multiple logs to be processed is Kafka specification, which facilitates structured parsing, and the preset log format may be: [ logtime ] [ operation ], JSON; for example: [2013-04-101:00:09] [ Click ], { "time":12344343, "server": 1001 "}.
In this embodiment, the application in Kafka may collect the log of the client locally through a logging module or through nginx access log, and then collect the log using rsyslog, fileboat, script, and other tools, and the producer goes to topic of Kafka.
It should be noted that Apache Spark is an open source cluster computing framework originally developed by AMPLab, the beckley division university, california. Compared with MapReduce of Hadoop, the method stores intermediate data into a disk after the operation is finished, and Spark uses an in-memory operation technology and can analyze and operate in a memory when the data is not written into a hard disk.
It should be noted that Kafka stores messages from any number of processes called "producers" (producers). The data can thus be assigned to different "partitions" (Partition), different "Topic". Within a partition, the messages are indexed and stored along with a timestamp. Other processes, referred to as "consumers" (consumers), may query messages from the partitions. Kafka runs on a cluster of one or more servers and partitions can be distributed across the cluster nodes.
Kafka efficiently processes real-time streaming data and can achieve integration with Storm, HBase and Spark. Deployed as a cluster on multiple servers, Kafka's system for processing all of its publish and subscribe messages uses four APIs, namely, a producer API, a consumer API, a Stream API, and a Connector API. The system can transfer large-scale streaming messages, has a fault-tolerant function, and replaces some traditional message systems such as JMS, AMQP and the like.
The main terms of the Kafka architecture include Topic, Record, and Broker. Topic consists of Record, which holds different information, while Broker is responsible for copying messages. Kafka has four major APIs.
Producer API: supporting applications to publish Record streams.
Consumer API: the support application subscribes to Topic and processes Record streams.
Stream API: convert the input stream to the output stream and produce a result.
Connector API: executing reusable producer and consumer APIs, Topic may be linked to existing applications.
Topic is used to classify messages, and each message that goes into Kafka is placed under one Topic.
The Broker is used to implement a host server for data storage.
Partition the message in each Topic is divided into several partitions to improve the processing efficiency of the message.
An embodiment of the present invention further provides a log processing system, fig. 6 is a schematic structural diagram of the log processing system provided in the embodiment of the present invention, and as shown in fig. 6, the apparatus includes:
a distributed processing platform spark, wherein the spark is arranged to perform the method of any of claim 8 when run-time;
an open source component kafka coupled to the distributed processing platform, wherein the kafka is configured to perform the method of claim 9 at runtime.
As shown in fig. 6, the present system further includes: server, kafka collects logs of clients from servers. And transmitting the log to spark for log processing, and transmitting the processed log to Hadoop for further storage.
Embodiments of the present invention also provide a storage medium having a computer program stored therein, wherein the computer program is arranged to perform the steps of any of the above method embodiments when executed.
Alternatively, in the present embodiment, the storage medium may be configured to store a computer program for executing the above steps.
Optionally, in this embodiment, the storage medium may include, but is not limited to: various media capable of storing computer programs, such as a usb disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic disk, or an optical disk.
Embodiments of the present invention also provide an electronic device comprising a memory having a computer program stored therein and a processor arranged to run the computer program to perform the steps of any of the above method embodiments.
Optionally, the electronic apparatus may further include a transmission device and an input/output device, wherein the transmission device is connected to the processor, and the input/output device is connected to the processor.
Alternatively, in this embodiment, the processor may be configured to execute the above steps through a computer program.
Optionally, the specific examples in this embodiment may refer to the examples described in the above embodiments and optional implementation manners, and this embodiment is not described herein again.
The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.
In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.
In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.
The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.
In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.
The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.
The foregoing is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, various modifications and decorations can be made without departing from the principle of the present invention, and these modifications and decorations should also be regarded as the protection scope of the present invention.

Claims (13)

1. A log processing method, comprising:
receiving a plurality of logs to be processed sent by an open source component, wherein the log format of the plurality of logs to be processed is the conversion of the open source component to the plurality of logs by using a preset log format;
performing data cleaning on the logs to be processed to obtain a plurality of first target logs;
partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs;
after the multiple first target logs are partitioned according to the preset time interval to obtain multiple second target logs, the method further includes:
establishing a first table according to a two-level partition storage format, wherein the first table is used for storing summary information of the plurality of second target logs;
establishing a first sub-table in the first table, wherein the first sub-table is used for storing query information for querying the plurality of second target logs;
and adding a partition for the first sub-table at regular time, wherein the partition is used for storing the query information of the added log.
2. The method of claim 1, wherein performing data cleansing on the plurality of to-be-processed logs to obtain the plurality of first target logs comprises:
triggering data cleaning of the logs to be processed by using a preset activity algorithm;
and performing data cleaning on the plurality of logs to be processed after triggering by using a preset conversion algorithm.
3. The method of claim 1, wherein partitioning the first target logs according to the preset time interval to obtain the second target logs comprises:
determining a log type and a log time of each first target log in the plurality of first target logs, wherein the log time is the time for acquiring each first target log;
and storing each first target log partition into a first preset directory based on the log type and the log time to obtain a plurality of second target logs.
4. The method according to claim 1, wherein after the first target logs are partitioned according to the preset time interval to obtain the second target logs, the method further comprises one of:
storing the plurality of second target logs into a second preset directory in a way of exact once by using a preset feature source interface;
and storing the plurality of second target logs into a second preset directory in a way of exact once by using a preset batch processing function.
5. The method according to claim 4, wherein after the real-time partition processing is performed on the plurality of first target logs according to the preset time interval to obtain the plurality of second target logs, and in a case that it is determined that the storage of the plurality of second target logs to the second preset directory fails, the method further comprises one of:
within a first preset number of days, the plurality of logs to be processed are obtained again from the local cache;
recovering the plurality of logs to be processed from the local disk by using a log collection system;
recovering the plurality of logs to be processed from the copy stored in the open source component;
reading metadata files metadata and offset of the multiple logs to be processed to recover the multiple logs to be processed;
obtaining the multiple logs to be processed from multiple stored duplicate files, wherein the multiple duplicate files are stored in one of the following: local nodes, nodes on a local rack, nodes on different racks.
6. The method of claim 1, wherein after the partitioning the first target logs according to the preset time interval to obtain a second target logs, the method further comprises:
dividing each second target log in the plurality of second target logs to obtain a plurality of decomposition logs;
parsing the plurality of decomposition logs;
merging the plurality of decomposed logs analyzed within a preset time period to obtain merged logs;
and compressing the merged log.
7. The method according to claim 1, wherein after the partitioning processing is performed on the first target logs according to the preset time interval to obtain the second target logs, the method further comprises:
monitoring storage of the second plurality of target logs;
and sending alarm information under the condition that the storage fails.
8. A log processing method, comprising:
carrying out format conversion on the obtained logs by using a preset log format to obtain a plurality of logs to be processed;
respectively storing the multiple logs to be processed into theme folders corresponding to the logs to be processed;
sending the plurality of to-be-processed logs in the theme folder to a distributed processing platform;
the distributed processing platform performs data cleaning on the logs to be processed to obtain a plurality of first target logs;
partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs;
establishing a first table according to a secondary partition storage format in the distributed processing platform, wherein the first table is used for storing summary information of the plurality of second target logs;
establishing a first sub-table in the first table, wherein the first sub-table is used for storing query information for querying the plurality of second target logs;
and adding a partition for the first sub-table at regular time, wherein the partition is used for storing the query information of the added log.
9. A log processing apparatus, comprising:
the system comprises a receiving module, a processing module and a processing module, wherein the receiving module is used for receiving a plurality of logs to be processed sent by an open source component, and the log format of the plurality of logs to be processed is the conversion of the open source component to the plurality of logs by using a preset log format;
the first determining module is used for carrying out data cleaning on the logs to be processed to obtain a plurality of first target logs;
the second determining module is used for partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs;
after the multiple first target logs are subjected to partition processing according to the preset time interval and multiple second target logs are obtained, a first table is established according to a two-level partition storage format, wherein the first table is used for storing summary information of the multiple second target logs;
establishing a first sub-table in the first table, wherein the first sub-table is used for storing query information for querying the plurality of second target logs;
and adding a partition for the first sub-table at regular time, wherein the partition is used for storing the query information of the added log.
10. A log processing apparatus, comprising:
the third determining module is used for performing format conversion on the obtained logs by using a preset log format to obtain a plurality of logs to be processed;
the storage module is used for respectively storing the multiple logs to be processed into the theme folders corresponding to the logs to be processed;
the sending module is used for sending the plurality of to-be-processed logs in the theme folder to a distributed processing platform;
the distributed processing platform performs data cleaning on the logs to be processed to obtain a plurality of first target logs;
partitioning the plurality of first target logs according to a preset time interval to obtain a plurality of second target logs;
establishing a first table according to a secondary partition storage format in the distributed processing platform, wherein the first table is used for storing summary information of the plurality of second target logs;
establishing a first sub-table in the first table, wherein the first sub-table is used for storing query information for querying the plurality of second target logs;
and adding a partition for the first sub-table at regular time, wherein the partition is used for storing the query information of the added log.
11. A log processing system, comprising:
a distributed processing platform spark, wherein the spark is arranged to perform the method of any of claims 1-7 when run-time;
an open source component kafka coupled to the distributed processing platform, wherein the kafka is configured to perform the method of claim 8 at runtime.
12. A storage medium having stored thereon a computer program, wherein the computer program is arranged to perform the method of any of claims 1 to 7 when executed, or to perform the method of claim 8.
13. An electronic device comprising a memory and a processor, wherein the memory has stored therein a computer program, and wherein the processor is arranged to execute the computer program to perform the method of any one of claims 1 to 7 or to perform the method of claim 8.
CN201910138347.5A 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device Active CN109918349B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201910138347.5A CN109918349B (en) 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201910138347.5A CN109918349B (en) 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device

Publications (2)

Publication Number Publication Date
CN109918349A CN109918349A (en) 2019-06-21
CN109918349B true CN109918349B (en) 2021-05-25

Family

ID=66962220

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201910138347.5A Active CN109918349B (en) 2019-02-25 2019-02-25 Log processing method, log processing device, storage medium and electronic device

Country Status (1)

Country Link
CN (1) CN109918349B (en)

Families Citing this family (12)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN110908788B (en) * 2019-12-02 2022-04-08 北京锐安科技有限公司 Spark Streaming based data processing method and device, computer equipment and storage medium
CN111581173B (en) * 2020-05-09 2023-10-20 深圳市卡数科技有限公司 Method, device, server and storage medium for distributed storage of log system
CN111752752B (en) * 2020-05-28 2022-07-19 苏州浪潮智能科技有限公司 Message fault tolerance method and system based on Spark stream computing framework
CN113760832A (en) * 2020-06-03 2021-12-07 富泰华工业(深圳)有限公司 File processing method, computer device and readable storage medium
CN111831617B (en) * 2020-07-16 2022-08-09 福建天晴数码有限公司 Method for guaranteeing uniqueness of log data based on distributed system
CN112148674B (en) * 2020-10-12 2023-12-19 平安科技(深圳)有限公司 Log data processing method, device, computer equipment and storage medium
CN112612677A (en) * 2020-12-28 2021-04-06 北京天融信网络安全技术有限公司 Log storage method and device, electronic equipment and readable storage medium
CN112506862A (en) * 2020-12-28 2021-03-16 浪潮云信息技术股份公司 Method for custom saving Kafka Offset
CN113190726A (en) * 2021-04-16 2021-07-30 珠海格力精密模具有限公司 Method for reading CAE (computer aided engineering) modular flow analysis data, electronic equipment and storage medium
CN113312353A (en) * 2021-06-10 2021-08-27 中国民航信息网络股份有限公司 Storage method and system for tracking journal
CN113806434B (en) * 2021-09-22 2023-09-05 平安科技(深圳)有限公司 Big data processing method, device, equipment and medium
CN113778810A (en) * 2021-09-27 2021-12-10 杭州安恒信息技术股份有限公司 Log collection method, device and system

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN105389352A (en) * 2015-10-30 2016-03-09 北京奇艺世纪科技有限公司 Log processing method and apparatus
CN105824744A (en) * 2016-03-21 2016-08-03 焦点科技股份有限公司 Real-time log collection and analysis method on basis of B2B (Business to Business) platform
CN108600300A (en) * 2018-03-06 2018-09-28 北京思空科技有限公司 Daily record data processing method and processing device

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9244755B2 (en) * 2013-05-20 2016-01-26 Vmware, Inc. Scalable log analytics

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104298771A (en) * 2014-10-30 2015-01-21 南京信息工程大学 Massive web log data query and analysis method
CN105389352A (en) * 2015-10-30 2016-03-09 北京奇艺世纪科技有限公司 Log processing method and apparatus
CN105824744A (en) * 2016-03-21 2016-08-03 焦点科技股份有限公司 Real-time log collection and analysis method on basis of B2B (Business to Business) platform
CN108600300A (en) * 2018-03-06 2018-09-28 北京思空科技有限公司 Daily record data processing method and processing device

Also Published As

Publication number Publication date
CN109918349A (en) 2019-06-21

Similar Documents

Publication Publication Date Title
CN109918349B (en) Log processing method, log processing device, storage medium and electronic device
CN111723160B (en) Multi-source heterogeneous incremental data synchronization method and system
CN110321387B (en) Data synchronization method, equipment and terminal equipment
CN107038162B (en) Real-time data query method and system based on database log
US11507594B2 (en) Bulk data distribution system
US9130971B2 (en) Site-based search affinity
US9124612B2 (en) Multi-site clustering
CN107544984B (en) Data processing method and device
CN108694195B (en) Management method and system of distributed data warehouse
CN108881477B (en) Distributed file acquisition monitoring method
US20140358844A1 (en) Workflow controller compatibility
CN111125260A (en) Data synchronization method and system based on SQL Server
CN112231402A (en) Real-time synchronization method, device, equipment and storage medium for heterogeneous data
CN109298978B (en) Recovery method and system for database cluster of specified position
CN111209352A (en) Data processing method and device, electronic equipment and storage medium
CN113010565B (en) Server real-time data processing method and system based on server cluster
CN109190025B (en) Information monitoring method, device, system and computer readable storage medium
CN112328702B (en) Data synchronization method and system
CN112069264A (en) Heterogeneous data source acquisition method and device, electronic equipment and storage medium
CN114048217A (en) Incremental data synchronization method and device, electronic equipment and storage medium
CN113672668A (en) Log real-time processing method and device in big data scene
CN110309206B (en) Order information acquisition method and system
CN116701352A (en) Database data migration method and system
CN114238018B (en) Method, system and device for detecting integrity of log collection file and storage medium
CN115221116A (en) Data writing method, device and equipment and readable storage medium

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant