CN115801562B

CN115801562B - Efficient and scalable CDN log processing method and system

Info

Publication number: CN115801562B
Application number: CN202211363596.2A
Authority: CN
Inventors: 李文宇; 刘亮为; 沈志华
Original assignee: Hangzhou Upyun Technology Co ltd
Current assignee: Hangzhou Upyun Technology Co ltd
Priority date: 2022-11-02
Filing date: 2022-11-02
Publication date: 2024-08-16
Anticipated expiration: 2042-11-02
Also published as: CN115801562A

Abstract

The invention discloses a high-efficiency telescopic CDN log processing method and a system, comprising the following steps: the distributed message engine module receives the log processing task, and caches the log processing task in a task processing queue to wait for pulling; the original log reading module is used for pulling the log processing task from the distributed message engine module, analyzing and executing the log processing task, reading the log file stored in the distributed storage module according to the flow, and reading the read log; the log diversion delivery module analyzes and calculates the index number of the message theme, and the corresponding message theme is determined by the calculated index number; the log aggregation processing module analyzes, extracts and aggregates the log information, and then writes the log information into different log files and stores the log information. According to the invention, the number of the instances of the processing nodes can be dynamically scaled according to different service scales, the machine resources can be utilized more efficiently, the overall availability of the system is improved, and the burden of maintenance personnel is reduced.

Description

Efficient and scalable CDN log processing method and system

Technical Field

The invention relates to the field of CDN log processing, in particular to a high-efficiency and scalable CDN log processing method and system.

Background

As a cloud service provider, it is necessary to provide services for massive clients, where a large number of access logs are generated in CDN acceleration traffic. The CDN access log of each accelerating domain name is an important basis for the customer to conduct problems checking, data analysis, bill checking and other businesses.

The access logs collected from the CDN edge nodes, called raw logs, are in the form of compressed files, each file recording all HTTP request information to access the CDN edge hosts over a period of time. The original logs of the CDN nodes of the whole network are continuously reported to a data center and finally stored in a distributed storage cluster.

The prior technical proposal adopts a big data processing framework popular in the industry to perform unified log processing, but is practically used, and the proposal is found to have the following defects:

(1) In order to meet the general processing scene, the resource consumption is huge, and more hardware resources and higher maintenance cost are required to be input;

(2) The data model is too ideal, and the efficiency of solving the actual problem is not high;

(3) The development flexibility is not high due to the limitation of the framework;

(4) Log hotspot analysis is time consuming and resource consuming.

Disclosure of Invention

The first objective of the present invention is to provide a method and a system for processing a CDN log with high efficiency and scalability, which can dynamically scale and scale the number of instances of processing nodes according to different service scales, and can more efficiently utilize machine resources, and simultaneously improve the overall availability of the system and reduce the burden of maintenance personnel.

The technical scheme for realizing the first object of the invention is as follows:

a high-efficiency scalable CDN log processing method comprises the following steps:

1) Acquiring a log file reported by an edge node, and storing the log file in a distributed storage module;

2) The storage event monitoring module monitors that a new log file is stored in the distributed storage module, the newly stored log file path is contained in a newly created log processing task, and the log processing task is sent to the distributed message engine module;

3) The distributed message engine module receives the log processing task, and caches the log processing task in a task processing queue to wait for pulling;

4) The original log reading module is used for pulling the log processing task from the distributed message engine module, analyzing and executing the log processing task, reading the log files stored in the distributed storage module according to a flow, and sending the read logs to the log diversion delivery module in batches;

5) The log shunting delivery module calculates hash and takes a module by analyzing a domain name field or a domain name bucket field in the log to obtain an index number of a message theme, and the log shunting delivery module sends the log to the distributed message engine module to determine the corresponding message theme according to the calculated index number;

In step 5), the field value is calculated and hashed and modulo is carried out to determine the index number of the message theme, so that the logs of the same domain name can be ensured to be sent to the same message theme, and the subsequent log aggregation processing is facilitated;

The field value is calculated and hashed and modulo is carried out to determine the index number of the message theme, so that the method is beneficial to distributing the domain names with approximately the same number for each message theme and realizing the load balancing among the message themes.

6) The log aggregation processing module pulls log information of the information subject, analyzes the log information according to a specific format, extracts a domain name field and a domain name bucket field, aggregates the log information according to a configured aggregation rule and a domain name field or a domain name bucket field, writes the log information into different log files according to a standard format respectively, uploads the stored log files to the distributed storage module, and the stored log at the moment is called a standard log and is finally provided for clients;

In step 6), the same message theme can be distributed with a plurality of log aggregation processing modules, and the log aggregation processing modules are stateless services and can flexibly stretch and retract the number of examples according to the log data quantity of different domain names.

7) And the log hot spot analysis module scans the standard log stored in the distributed storage module and generates an analysis report.

In step 1), obtaining a log file reported by an edge node, which specifically includes:

1.1 The edge node log reporting module uploads a log file on the edge node to the log storage proxy service module through an HTTP request;

The HTTP request comprises an HTTP request body and HTTP request parameters;

1.2 The log storage proxy service module acquires log meta information according to parameters in the HTTP request, splices a target path according to the log meta information, constructs an RPC request by using the target path and the HTTP request body, and sends the RPC request to the distributed storage module;

the log meta information includes the host IP of the edge node and the type of log file.

In step 7), scanning and generating an analysis report, specifically including:

7.1 The log hot spot analysis module selects a domain name directory and reads the standard log files one by one;

7.2 The log hot spot analysis module analyzes each index from the log, and uses a Filtered-SPACE SAVING (FSS) stream data algorithm to count the request times and the request sizes of different hot indexes. After all standard log files are read, the front K hot spot statistical results can be derived and formatted to be a hot spot analysis report

In step 7.2), creating a TopK operator for each index by using an FSS stream data algorithm, and calculating a 64-bit hash abstract of the corresponding index value analyzed in each log according to SipHash algorithm to serve as a storage key. The advantages are that:

The traditional method for counting the hot index is to use a small root heap algorithm, but the algorithm needs to consume a large amount of memory in the face of massive log requests, and the efficiency is reduced along with the increase of the data volume. The FSS stream data algorithm is selected, so that the memory consumption can be controlled in a stable range, the cost is only to sacrifice a small amount of accuracy, the accuracy of the generated popular analysis report reaches 99% to meet the requirement, the computational resource requirement can be controlled by sacrificing a small amount of accuracy, and the statistics of popular resources is more efficient; in addition, the computing performance is further improved by preprocessing the storage key by using the SipHash algorithm in combination.

A high-efficiency telescopic CDN log processing system comprises an edge node log reporting program, a log storage proxy service, a distributed storage cluster, a storage event monitoring module, a distributed message engine cluster, an original log reading module, a log shunt delivery module, a log aggregation processing module and a log hot spot analysis module.

Compared with the prior art, the invention has the following advantages:

1. the invention provides a high-efficiency telescopic CDN log processing system which can meet specific CDN log processing requirements without consuming excessive machine resources;

2. the processing modules of the invention realize multi-instantiation, are convenient to stretch and retract, have no single-point fault hidden trouble, and greatly improve the overall usability of the system;

3. according to the invention, zero activity is improved through the combination of a plurality of modules, so that iteration of a hot spot analysis function is more conveniently realized;

4. the invention improves the hotspot analysis efficiency by adopting the FSS stream data algorithm, and saves the calculation resources.

Drawings

FIG. 1 is a data flow diagram of an efficient and scalable CDN log processing system of the present invention.

FIG. 2 is a flow chart of the process of the log proxy service of the present invention.

FIG. 3 is a flow chart illustrating a method for efficient hotspot analysis according to the present invention.

Detailed Description

In particular to an efficient and scalable CDN log processing system, comprises the following steps of;

The edge node log reporting module is used for uploading the log file on the edge node to the log storage proxy service module through an HTTP request; the HTTP request comprises HTTP request body and HTTP request parameters

The log storage proxy service module acquires log meta information according to parameters in the HTTP request, splices a target path according to the log meta information, constructs an RPC request by using the target path and the HTTP request body, and sends the RPC request to the distributed storage module; the log meta information comprises the host IP of the edge node and the type of the log file;

the distributed storage module receives the RPC request sent by the log storage proxy service module and stores the log file to a target position;

The storage event monitoring module monitors that the distributed storage module stores a new log file, and the newly-stored log file path is contained in a newly-created log processing task and the log processing task is sent to the distributed message engine module;

The distributed message engine module receives log processing tasks;

The original log reading module is used for pulling the log processing task from the distributed message engine module, analyzing and executing the log processing task, reading the log files stored in the distributed storage module according to a flow, and sending the read logs to the log diversion delivery module in batches;

the log shunting delivery module calculates hash and takes a module by analyzing a domain name field or a domain name bucket field in the log to obtain an index number of a message theme, and the log shunting delivery module sends the log to the distributed message engine module to determine the corresponding message theme according to the calculated index number;

The log aggregation processing module pulls log information of the information subject, aggregates the log information according to domain name fields or domain name bucket fields, writes the log information into different log files according to standard formats respectively, and uploads the stored log files to the distributed storage module, wherein the stored log is called a standard log;

And the log hot spot analysis module scans the standard log stored in the distributed storage module and generates an analysis report.

The system comprises the following processing flows:

(1) And the edge node log reporting program runs on the CDN edge host, cuts off, saves, compresses and uploads the CDN access log to a log storage proxy service of the data center every 5 minutes.

(2) The log storage proxy service receives the HTTP request for uploading the log and transfers the received request body content to the distributed storage cluster.

(3) The storage event monitoring module monitors that a new log file is stored in the distributed storage cluster, and includes a newly-stored log file path in a newly-created log processing task, and sends the log processing task to the distributed message engine cluster

(4) The original log reading module is used as a consumption end of the distributed message engine and is pulled to a log processing task;

(5) The original log reading module analyzes and executes a log processing task, and reads the log files stored in the distributed storage cluster according to a stream;

(6) The original log reading module sends the read logs to the log diversion delivery module in batches;

(7) The log diversion delivery module calculates hash of the field value and takes the module by analyzing the domain name field or domain name bucket field in the log to obtain the index number of the message theme;

(8) The log diversion delivery module sends the log to the distributed message engine cluster, and the corresponding message theme is determined by the index number calculated in the step (7);

(9) The log aggregation processing module is used as a consumption end of the distributed message engine and pulls log messages of specific message subjects;

(10) The log aggregation processing module aggregates the logs according to the domain name fields or domain name bucket fields and stores the logs into different files respectively;

(11) The log aggregation processing module records the last writing time of each file, if the files have no new writing in more than 1 minute, the writing of the files is considered to be finished, and then the stored log files are uploaded to the distributed storage cluster, and the stored log at the moment is called a standard log;

and the log hot spot analysis module scans the standard logs stored in the distributed storage cluster every 24 hours and generates an analysis report.

In step (2), the log proxy service is a multi-instance scalable HTTP server program, and the step of transferring the log file includes the following steps:

2.1 Receiving an HTTP PUT request sent by a log reporting program;

2.2 Acquiring log metadata information from HTTP request parameters, wherein the log metadata information comprises a node IP to which a log belongs, log generation time and a log file format;

2.3 Reading the HTTP request body to the memory;

2.4 Requesting the distributed storage cluster, creating a target file to be written, and maintaining TCP connection;

2.5 Writing the data in the memory into a target file in the distributed storage cluster;

2.6 If the file writing is completed, closing the TCP connection, and entering step 2.7); if the file writing fails, closing the TCP connection, and entering step 2.8);

2.7 Reporting program response 200 status code to the log, indicating that log save was successful, and entering step 2.9).

2.8 Reporting the program response 500 status codes to the log, and returning the failure reasons;

2.9 Ending).

In step 2.4), the target file path to be written is spliced from the log metadata information in step 2).

A second object of the present invention is to provide an efficient method of hotspot analysis that can trade off tolerable accuracy loss for considerable hotspot analysis performance.

The technical scheme for realizing the second object of the invention is as follows: an efficient hotspot analysis method comprising the steps of:

(a) The hot spot analysis module selects a domain name catalog and reads standard log files one by one;

(b) The hotspot analysis module analyzes information such as request size, URL, request status code, userAgents and the like from the log;

(c) The hot spot analysis module uses a Filtered-SPACE SAVING (FSS) stream data algorithm to count the request times and the request sizes of different hot indexes;

(d) Counting the hot spot values of the previous K strips by using an FSS algorithm, wherein in order to balance accuracy loss and memory resource consumption, an analysis program applies for spaces larger than K nodes, and the values are 1.5K to 2K;

(e) And after all standard log files are read, the first K hot spot statistical results can be derived and formatted to be a hot spot analysis report.

As shown in fig. 1, the embodiment includes an edge node log reporting program, a log storage proxy service, a distributed storage cluster, a storage event monitoring module, a distributed message engine cluster, an original log reading module, a log distribution delivery module, a log aggregation processing module, a log hot spot analysis module, and a user.

The log storage proxy service is a high-performance HTTP gateway service; the distributed storage cluster is a commercial object storage service cluster or a large-scale HDFS cluster; the distributed message engine cluster typically employs a Kafka cluster.

The embodiment comprises the following steps:

(1) The edge node log reporting program collects CDN access logs every 5 minutes and uploads the CDN access logs to log storage proxy service of a data center;

(2) The log storage proxy service transfers the uploaded log file to a distributed storage cluster;

(3) The storage event monitoring module sends a new log processing task to the distributed message engine cluster;

(4) The original log reading module receives and executes log processing tasks, reads log files stored in the distributed storage cluster row by row, and sends the log files to the log diversion delivery module in batches;

(5) The log diversion delivery module calculates an index number of a message theme through a specific field in the log, and sends the log to the distributed message engine cluster;

(6) The log aggregation processing module pulls log information of a specific information subject, aggregates the logs according to domain name fields or domain name bucket fields, respectively stores the log information in different files, and finally uploads the log information to the distributed storage cluster;

(7) And the log hot spot analysis module scans the standard logs stored in the distributed storage cluster every 24 hours and generates an analysis report.

Wherein (1) the uploaded log is called an original log; (6) The uploaded logs are called standard logs, and only the standard logs can be provided for users to download; (7) Generating a share for each domain name of the user every 24 hours; can be viewed by a user.

As shown in fig. 2, in step (2), the log storage proxy service saves the uploaded log file to the distributed storage cluster, including the steps of:

(2.1) receiving an HTTP PUT request sent by a log reporting program;

(2.2) obtaining log metadata information from the HTTP request parameters;

(2.3) reading the HTTP request body to the memory;

(2.4) requesting the distributed storage cluster to create a target file to be written;

(2.5) writing the data in the memory into a target file in the distributed storage cluster;

(2.6) if the writing of the file is completed, reporting a program response 200 status code to the log; if the file writing fails, the program response 500 status codes are reported to the log.

As shown in fig. 3, the workflow of the log hotspot analysis module in step (7) includes the following steps:

(7.1) selecting a domain name directory for traversing;

(7.2) reading a standard log file row by row;

(7.3) resolving information such as request size, URL, request status code, userAgent and the like from the log;

And (7.4) using a Filtered-SPACE SAVING (FSS) stream data algorithm to firstly check whether the data is initialized, and if not, firstly applying 1.5K to 2K node spaces to meet the TopK statistical requirement.

(7.5) Inputting index data to be counted into an FSS algorithm, and counting the request times and the request sizes of different hot indexes;

(7.6) if the file is needed to be read, returning to the step (7.2);

and (7.7) after all the standard log files are read, leading out the front K hot spot statistical results, and formatting and leading out the hot spot statistical results as a hot spot analysis report.

As shown in Table 4, the purpose of cost reduction and efficiency improvement is achieved by adopting the efficient and scalable CDN log processing system scheme, and each module of the whole log processing link has the capability of flexibly scaling the number of instances according to the actual service scale.

TABLE 4 Table 4

Claims

1. The efficient and scalable CDN log processing method is characterized by comprising the following steps of:

5) The log diversion delivery module analyzes and calculates an index number of the message theme, the log diversion delivery module sends the log to the distributed message engine module, and the calculated index number determines the corresponding message theme;

6) The log aggregation processing module pulls the log information of the information subject, analyzes, extracts and aggregates the log information, writes the log information into different log files respectively, stores the log information, uploads the stored log files to the distributed storage module, and the log in the stored log files is called a standard log and is finally provided for clients;

7) The log hot spot analysis module scans the standard log stored in the distributed storage module and generates an analysis report;

scanning and generating an analysis report, which specifically comprises the following steps:

7.2 The log hot spot analysis module analyzes each index from the log, uses a Filtered-SPACE SAVING stream data algorithm to count the request times and the request sizes of different hot indexes, reads all standard log files, and can derive the previous K hot spot statistical results, and formats and derives the hot spot statistical results as a hot spot analysis report;

And creating a TopK operator for each index by using a Filtered-SPACE SAVING stream data algorithm, and calculating a 64-bit hash abstract of the corresponding index value analyzed in each log according to the SipHash algorithm to serve as a storage key.

2. The efficient and scalable CDN log processing method of claim 1, wherein in step 1), obtaining a log file reported by an edge node specifically includes:

1.2 The log storage proxy service module obtains log meta information according to parameters in the HTTP request, splices a target path according to the log meta information, constructs an RPC request by using the target path and the HTTP request body, and sends the RPC request to the distributed storage module.

3. The efficient scalable CDN log processing method of claim 2, wherein in step 1.1), the HTTP request includes an HTTP request body and HTTP request parameters.

4. The efficient scalable CDN log processing method of claim 2, wherein in step 1.2), the log meta information includes a host IP of the edge node and a type of the log file.

5. The efficient scalable CDN log processing method of claim 1, wherein in step 5), parsing the index number of the computed message topic comprises:

And calculating hash of the field value by analyzing the domain name field or domain name bucket field in the log, and taking the modulus to obtain the index number of the message theme.

6. The efficient scalable CDN log processing method of claim 1, wherein in step 6), log messages are parsed, extracted, aggregated, and then written into different log files and saved respectively, specifically including:

Analyzing the log information according to the format, extracting a domain name field and a domain name bucket field, aggregating the log information according to the configured aggregation rule and the domain name field or the domain name bucket field, and respectively writing the log information into different log files according to the standard format and storing the log information.

7. A system for implementing the efficient scalable CDN log processing method of any one of claims 1 to 6, which is characterized by comprising an edge node log reporting program, a log storage proxy service, a distributed storage cluster, a storage event monitoring module, a distributed message engine cluster, an original log reading module, a log splitting delivery module, a log aggregation processing module, and a log hotspot analysis module.