CN110727568A

CN110727568A - Multi-source log data processing system and method in cloud environment

Info

Publication number: CN110727568A
Application number: CN201910880320.3A
Authority: CN
Inventors: 罗平; 季统凯
Original assignee: G Cloud Technology Co Ltd
Current assignee: G Cloud Technology Co Ltd
Priority date: 2019-09-18
Filing date: 2019-09-18
Publication date: 2020-01-24

Abstract

The invention relates to a multi-source log data processing system and method in a cloud environment. The log source input module of the system of the invention provides the function of multi-source log access; the data preprocessing module provides a data classification function; the data processing module provides a log processing plug-in management function; the data storage module provides a back-end data storage management function. The invention carries out tagging processing on the text log data stream through a two-dimensional vector < ip, path >; yml file, defining log data stream processing link, and determining the processing sequence of the log among different plugins; scanning and loading all plug-ins in a plugin directory; yml file, construct complete data stream analysis link, original log file according to link identification, flow to different plug-in unit to process step by step. The invention solves the problems of high coupling, insufficient expansibility and the like of multi-source log processing; the method can be used for multi-source log data processing.

Description

Multi-source log data processing system and method in cloud environment

Technical Field

The invention relates to the technical field of log data processing, in particular to a multi-source log data processing system and method in a cloud environment.

Background

With the rapid development of various distributed technologies and the maturity of rich open-source distributed frameworks, the traditional large monolithic program is gradually deconstructed, and the service-oriented architecture (SOA) is shifted. Among them, the micro service architecture is a typical representative. However, a significant problem exists in such an architecture, and since each service component is deployed in a distributed manner, when an exception occurs in the system, the workload of checking the exception log is very heavy. Therefore, it is necessary to perform secondary parsing and structured storage on multi-source heterogeneous log data in order to support the later operation and maintenance work. Therefore, a uniform and dedicated platform is needed to complete the log management work; however, the existing log platform has the following problems:

log platform resource restriction

The current log analysis platform is a complex with intensive computation, intensive network and intensive storage, so the demand of the platform on hardware resources is very high, which also causes high cost.

Second, all function modules in the log platform are highly coupled

The log acquisition module, the log processing module, the storage module and other modules are highly coupled, so that upgrading iteration of products is not facilitated.

Third, the log analysis function has insufficient expansibility

The high coupling among all modules in the log platform leads each platform to aim at a specific language environment, thereby invisibly improving the familiarity difficulty of product codes of products; and the mixed functions among the log analysis functions lead the fixed functions not to be easily expanded or reduced and the functions to be reused, thereby improving the upgrading of the whole product framework and the product iteration cost and having poor robustness and expansibility.

Disclosure of Invention

The technical problem to be solved by the invention is to provide a multi-source log data processing system and method in a cloud environment; and the secondary processing of multi-source log data is realized, and the problems of high coupling, insufficient expansibility and the like are avoided.

The technical scheme for solving the technical problems is as follows:

the system comprises a log source input module, a data preprocessing module, a data processing module and a data storage module; the log source input module provides a multi-source log access function; the data preprocessing module provides a data classification function; the data processing module provides a flexible management function of the log processing plug-in; the data storage module provides a back-end data storage management function.

The log source access module uniformly accesses original and heterogeneous text format logs of different sources to the log data stream processing platform.

The data preprocessing module uniformly and intensively manages logs of different sources at a log receiving end and identifies tags by characteristics; the tag is represented by a two-dimensional vector < ip, path >; the ip is a network connection default parameter, and the path is a log general parameter.

The data processing module carries out secondary processing on the log data, and the secondary processing is realized through a managed log processing plug-in; the system comprises a multi-language plug-in module, a specific function plug-in module and a plug-in management module;

the multi-language plug-in module provides a universal core plug-in library, plug-in APIs (application programming interfaces) of various language versions such as Java, Python, Ruby, Go and the like are provided, and a cross-language universal plug-in platform is realized;

the plug-in management module copies plug-in codes to an engineering catalog plugins, and the system realizes dynamic plug-in loading by regularly scanning the plugins catalog, wherein the dynamic plug-in loading, unloading, exception management and behavior management are included; the plug-in loading and unloading realize the loading or unloading of the plug-in module with specific functions; yml file unified planning of the whole data flow is provided by plug-in behavior management.

The plug-in is a log analysis module with json data types as input and output and completely independent functions; yml files can be analyzed to realize a specific combination mode.

The insert is divided into four parts, including: firstly, a tag judges whether a log data source meets requirements or not; secondly, leading the plug-in unit, and indicating which plug-in units process the subsequent data; thirdly, the kernel processing logic inside the plug-in realizes the kernel service of log analysis; and fourthly, accessing the plug-in subsequently, and showing that the output data of the plug-in is processed by the subsequent plug-ins at the next stage.

The method comprises the following steps:

step 1: the cloud platform builds a log data stream secondary processing system;

step 2: accessing text log data from different sources;

and step 3: performing tagging processing on text log data streams from different sources through a two-dimensional vector < ip, path >;

and 4, step 4: yml file, defining log data stream processing link, and determining the processing sequence of the log among different plug-ins;

and 5: the plug-in management module scans and loads all plug-ins in the plugin directory;

step 6: yml files are analyzed, a complete data stream analysis link is constructed, and original log files are sequentially and gradually streamed to different plug-ins for processing according to link identifications;

and 7: and storing the processed log into a back-end data processing module.

The invention provides a system and a method for secondary processing of multi-source log data streams in a cloud environment. And performing centralized management on source logs at the data preprocessing end, and uniquely identifying log sources by using a two-dimensional vector consisting of ip and path, so that the operation and maintenance workload of software is reduced, and excessive invasive operation on the log collection end is achieved. The log processing part is decoupled from the whole workflow into an independent module for independent management, log analysis is completed by combining different completely independent plugins, loading and unloading of the plugins are realized by dynamically scanning the plugins, the whole service is not required to be restarted, the software plug-in has the dynamic loading characteristic, and the product flexibility is improved. By defining the unified specification of plugin, the language version API is provided, so that the system has the cross-language characteristic. The log processing part has the characteristics of high cohesion, low coupling and high expandability, and is more favorable for product iteration. And the log processing part provides a data chain processing script plug _ playbook.yml to define a log stream processing chain, so that the log processing flow is simple and visual, the flow is controllable, and the exception tracking is simplified.

Drawings

The invention is further described below with reference to the accompanying drawings:

FIG. 1 is a system framework diagram of the present invention;

FIG. 2 is a flow chart of the method of the present invention;

FIG. 3 is a diagram of the plug-in logic structure of the present invention.

Detailed Description

As shown in fig. 1, the system for secondary processing of multi-source log data stream of the present invention includes 4 functional modules: 1. the log source input module is used for providing a multi-source log access function; 2. the data preprocessing module is used for providing a data classification function; 3. the data processing module provides a flexible management function of the log processing plug-in; 4. and the data storage module provides a function of back-end data storage management.

1. Log source access module

The log source access module uniformly accesses the text format logs of different sources to the log data stream processing platform. At this point, the original, heterogeneous logs generated by the different service components are accessed.

2. Data preprocessing module

And the data preprocessing module is used for uniformly and intensively managing logs of different sources at a log receiving end and identifying tag by characteristics. In the invention, a two-dimensional vector < ip, path > is adopted to represent the tag. Because a large amount of heterogeneous data exists in a log source, the data cannot be processed in a uniform manner, and therefore, the logs needing to be processed need to be identified through labeling features.

3. Data processing module

The data processing module carries out secondary processing on the log data, and the secondary processing is realized through the managed log processing plug-in; the system comprises a supported multilingual plug-in module, a plug-in module with a specific function and a plug-in management module.

The plugin refers to a log analysis module with json data types as input and output of the plugin and completely independent functions, the modules can be flexibly combined and shared, and the specific combination mode is realized by analyzing a plugin _ playlist.

The multi-language Plugin module is a universal Plugin platform which is realized by a system through providing a universal coreplugin specification and realizing a Plugin API for providing multiple language versions such as Java, Python, Ruby, Go and the like.

The plugin management module is used for copying the plugin codes to the plugins catalog to realize dynamic loading of plugins; the module can periodically and dynamically scan the directory, thereby realizing dynamic loading. plugin management includes plugin load, unload, exception management, and behavior management. The loading and unloading of the plug modules realize the loading and unloading of plug modules with specific functions. Yml file (referred to as log processing chain scenario herein) is provided to uniformly plan the processing flow of the whole data stream, and the management of the behavior of plugin is realized through the scenario file.

4. Data storage module

The data storage module provides an open back-end data storage architecture.

As shown in fig. 2, the secondary processing method of the multi-source log data stream of the present invention comprises the following basic steps:

step 1: the cloud platform establishes a multi-core CPU and a high-RAM cloud server for deploying a log data stream secondary processing system;

step 2: accessing text log data from different sources;

and 4, step 4: yml file, and defining a log data stream processing link to determine the processing sequence of the log among different plugins;

and 5: the Plugin management module scans and loads all the plug-ins in the Plugins directory;

step 6: yml files are analyzed, a complete data stream analysis link is constructed, and original log files are sequentially and gradually streamed to different plug-in processing units according to link identifications;

and 7: and storing the logs after the secondary processing into a back-end data processing module.

The log source is identified by a two-dimensional vector < ip, path > serving as a feature identifier tag, wherein ip in the vector is a network connection default parameter, and path is a log general parameter. Therefore, the uniqueness of the log source is identified by using a simple parameter mode without introducing an additional parameter to represent the log source at all. The log sources with different unique identifications can be identified by carrying out centralized and unified configuration at the receiving end, the configuration steps of the log acquisition end are simplified, and the additional configuration caused by manual invasion of operation and maintenance personnel to the acquisition end is reduced; compared with the traditional mode, the method has more technical advantages. When the traditional log collection mode is configured with identification, extra parameters such as id and tag are usually configured at the collection end to identify the log collection source, or the log source cannot be uniquely identified.

The conventional log processing platform has a very high code coupling degree of a log acquisition and processing module, can be regarded as a customized log processing platform, and is extremely unfriendly to universality, maintainability and expandability. According to the invention, log receiving and secondary processing are decoupled into two functional modules for management respectively, and architectural support is provided for product upgrading iteration. The log secondary processing loads and unloads the plugin with specific functions in a plugin mode, and brings strong flexibility and expandability to the module. The plugin can identify the introduction of a new plugin and the removal of an old plugin only under the scan engineering catalog plugins, so that the dynamic loading of the plugins is realized.

The invention uses core plugin lib to define the plugin specification in a unified way, provides the plugin API of a multi-language version, and is convenient for developers to realize the plugin API of a specific language on the basis of following the specification, so that the development of plug-ins has the characteristic of cross-language. The plug-in is a completely independent functional module, taking a GeoIPplugin plug-in for inquiring ip geographic position information as an example, a developer can realize plug-ins of different versions such as java and python, and uniformly pack related dependencies under a GeoIP directory.

As shown in fig. 3, the present invention provides a flexible logic architecture of plugin, which is divided into 4 parts: firstly, a tag judges whether a log data source meets requirements or not; secondly, leading the plug-in unit, and indicating which plug-in units process the subsequent data; note: only debug mode is active; thirdly, the internal core processing logic of the plugin realizes the core service of log analysis; fourthly, a subsequent access plug-in unit indicates that the output data of the plug-in unit is processed by the subsequent plug-in units at the next stage; note: only debug mode is active. The input and output of the plug-ins are json data, and guarantee is provided for data transmission among the plug-ins.

The log data stream processing chain provides a flexible log processing data chain definition mode; the main points are as follows: firstly, defining the processing flow of log data passing through different plug-ins by using plug-in data chain script plug _ playbook.yml; yml file defines log source of a plug-in process, and which plug-in receives the log as subsequent process, specific input and output fields in array form from tag (tag), pre _ plug-in name (with uniqueness), input field and output field. And secondly, dynamically defining a front plug-in and a rear plug-in of the single plugin (the function is effective only in a system debug mode), and mainly facilitating development and debugging of the plug-ins by developers to realize flexible dynamic data stream processing chain definition of a single plug-in.

Claims

1. A multisource log data processing system under a cloud environment is characterized in that: the system comprises a log source input module, a data preprocessing module, a data processing module and a data storage module; the log source input module provides a multi-source log access function; the data preprocessing module provides a data classification function; the data processing module provides a flexible management function of the log processing plug-in; the data storage module provides a back-end data storage management function.

2. The system of claim 1, wherein: the log source access module uniformly accesses original and heterogeneous text format logs of different sources to the log data stream processing platform.

3. The system of claim 1, wherein: the data preprocessing module uniformly and intensively manages logs of different sources at a log receiving end and identifies tags by characteristics; the tag is represented by a two-dimensional vector < ip, path >; the ip is a network connection default parameter, and the path is a log general parameter.

4. The system of claim 2, wherein: the data preprocessing module uniformly and intensively manages logs of different sources at a log receiving end and identifies tags by characteristics; the tag is represented by a two-dimensional vector < ip, path >; the ip is a network connection default parameter, and the path is a log general parameter.

5. The system according to any one of claims 1 to 4, wherein: the data processing module carries out secondary processing on the log data, and the secondary processing is realized through a managed log processing plug-in; the system comprises a multi-language plug-in module, a specific function plug-in module and a plug-in management module;

6. The system of claim 5, wherein: the plug-in is a log analysis module with json data types as input and output and completely independent functions; yml files can be analyzed to realize a specific combination mode.

7. The system of claim 5, wherein: the insert is divided into four parts, including: firstly, a tag judges whether a log data source meets requirements or not; secondly, leading the plug-in unit, and indicating which plug-in units process the subsequent data; thirdly, the kernel processing logic inside the plug-in realizes the kernel service of log analysis; and fourthly, accessing the plug-in subsequently, and showing that the output data of the plug-in is processed by the subsequent plug-ins at the next stage.

8. A multi-source log data processing method in a cloud environment is characterized by comprising the following steps: the method comprises the following steps:

step 2: accessing text log data from different sources;

and 7: and storing the processed log into a back-end data processing module.