CN115801562B - Efficient and scalable CDN log processing method and system - Google Patents

Efficient and scalable CDN log processing method and system Download PDF

Info

Publication number
CN115801562B
CN115801562B CN202211363596.2A CN202211363596A CN115801562B CN 115801562 B CN115801562 B CN 115801562B CN 202211363596 A CN202211363596 A CN 202211363596A CN 115801562 B CN115801562 B CN 115801562B
Authority
CN
China
Prior art keywords
log
module
distributed
cdn
processing
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202211363596.2A
Other languages
Chinese (zh)
Other versions
CN115801562A (en
Inventor
李文宇
刘亮为
沈志华
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Upyun Technology Co ltd
Original Assignee
Hangzhou Upyun Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Upyun Technology Co ltd filed Critical Hangzhou Upyun Technology Co ltd
Priority to CN202211363596.2A priority Critical patent/CN115801562B/en
Publication of CN115801562A publication Critical patent/CN115801562A/en
Application granted granted Critical
Publication of CN115801562B publication Critical patent/CN115801562B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Debugging And Monitoring (AREA)

Abstract

The invention discloses a high-efficiency telescopic CDN log processing method and a system, comprising the following steps: the distributed message engine module receives the log processing task, and caches the log processing task in a task processing queue to wait for pulling; the original log reading module is used for pulling the log processing task from the distributed message engine module, analyzing and executing the log processing task, reading the log file stored in the distributed storage module according to the flow, and reading the read log; the log diversion delivery module analyzes and calculates the index number of the message theme, and the corresponding message theme is determined by the calculated index number; the log aggregation processing module analyzes, extracts and aggregates the log information, and then writes the log information into different log files and stores the log information. According to the invention, the number of the instances of the processing nodes can be dynamically scaled according to different service scales, the machine resources can be utilized more efficiently, the overall availability of the system is improved, and the burden of maintenance personnel is reduced.

Description

Efficient and scalable CDN log processing method and system
Technical Field
The invention relates to the field of CDN log processing, in particular to a high-efficiency and scalable CDN log processing method and system.
Background
As a cloud service provider, it is necessary to provide services for massive clients, where a large number of access logs are generated in CDN acceleration traffic. The CDN access log of each accelerating domain name is an important basis for the customer to conduct problems checking, data analysis, bill checking and other businesses.
The access logs collected from the CDN edge nodes, called raw logs, are in the form of compressed files, each file recording all HTTP request information to access the CDN edge hosts over a period of time. The original logs of the CDN nodes of the whole network are continuously reported to a data center and finally stored in a distributed storage cluster.
The prior technical proposal adopts a big data processing framework popular in the industry to perform unified log processing, but is practically used, and the proposal is found to have the following defects:
(1) In order to meet the general processing scene, the resource consumption is huge, and more hardware resources and higher maintenance cost are required to be input;
(2) The data model is too ideal, and the efficiency of solving the actual problem is not high;
(3) The development flexibility is not high due to the limitation of the framework;
(4) Log hotspot analysis is time consuming and resource consuming.
Disclosure of Invention
The first objective of the present invention is to provide a method and a system for processing a CDN log with high efficiency and scalability, which can dynamically scale and scale the number of instances of processing nodes according to different service scales, and can more efficiently utilize machine resources, and simultaneously improve the overall availability of the system and reduce the burden of maintenance personnel.
The technical scheme for realizing the first object of the invention is as follows:
a high-efficiency scalable CDN log processing method comprises the following steps:
1) Acquiring a log file reported by an edge node, and storing the log file in a distributed storage module;
2) The storage event monitoring module monitors that a new log file is stored in the distributed storage module, the newly stored log file path is contained in a newly created log processing task, and the log processing task is sent to the distributed message engine module;
3) The distributed message engine module receives the log processing task, and caches the log processing task in a task processing queue to wait for pulling;
4) The original log reading module is used for pulling the log processing task from the distributed message engine module, analyzing and executing the log processing task, reading the log files stored in the distributed storage module according to a flow, and sending the read logs to the log diversion delivery module in batches;
5) The log shunting delivery module calculates hash and takes a module by analyzing a domain name field or a domain name bucket field in the log to obtain an index number of a message theme, and the log shunting delivery module sends the log to the distributed message engine module to determine the corresponding message theme according to the calculated index number;
In step 5), the field value is calculated and hashed and modulo is carried out to determine the index number of the message theme, so that the logs of the same domain name can be ensured to be sent to the same message theme, and the subsequent log aggregation processing is facilitated;
The field value is calculated and hashed and modulo is carried out to determine the index number of the message theme, so that the method is beneficial to distributing the domain names with approximately the same number for each message theme and realizing the load balancing among the message themes.
6) The log aggregation processing module pulls log information of the information subject, analyzes the log information according to a specific format, extracts a domain name field and a domain name bucket field, aggregates the log information according to a configured aggregation rule and a domain name field or a domain name bucket field, writes the log information into different log files according to a standard format respectively, uploads the stored log files to the distributed storage module, and the stored log at the moment is called a standard log and is finally provided for clients;
In step 6), the same message theme can be distributed with a plurality of log aggregation processing modules, and the log aggregation processing modules are stateless services and can flexibly stretch and retract the number of examples according to the log data quantity of different domain names.
7) And the log hot spot analysis module scans the standard log stored in the distributed storage module and generates an analysis report.
In step 1), obtaining a log file reported by an edge node, which specifically includes:
1.1 The edge node log reporting module uploads a log file on the edge node to the log storage proxy service module through an HTTP request;
The HTTP request comprises an HTTP request body and HTTP request parameters;
1.2 The log storage proxy service module acquires log meta information according to parameters in the HTTP request, splices a target path according to the log meta information, constructs an RPC request by using the target path and the HTTP request body, and sends the RPC request to the distributed storage module;
the log meta information includes the host IP of the edge node and the type of log file.
In step 7), scanning and generating an analysis report, specifically including:
7.1 The log hot spot analysis module selects a domain name directory and reads the standard log files one by one;
7.2 The log hot spot analysis module analyzes each index from the log, and uses a Filtered-SPACE SAVING (FSS) stream data algorithm to count the request times and the request sizes of different hot indexes. After all standard log files are read, the front K hot spot statistical results can be derived and formatted to be a hot spot analysis report
In step 7.2), creating a TopK operator for each index by using an FSS stream data algorithm, and calculating a 64-bit hash abstract of the corresponding index value analyzed in each log according to SipHash algorithm to serve as a storage key. The advantages are that:
The traditional method for counting the hot index is to use a small root heap algorithm, but the algorithm needs to consume a large amount of memory in the face of massive log requests, and the efficiency is reduced along with the increase of the data volume. The FSS stream data algorithm is selected, so that the memory consumption can be controlled in a stable range, the cost is only to sacrifice a small amount of accuracy, the accuracy of the generated popular analysis report reaches 99% to meet the requirement, the computational resource requirement can be controlled by sacrificing a small amount of accuracy, and the statistics of popular resources is more efficient; in addition, the computing performance is further improved by preprocessing the storage key by using the SipHash algorithm in combination.
A high-efficiency telescopic CDN log processing system comprises an edge node log reporting program, a log storage proxy service, a distributed storage cluster, a storage event monitoring module, a distributed message engine cluster, an original log reading module, a log shunt delivery module, a log aggregation processing module and a log hot spot analysis module.
Compared with the prior art, the invention has the following advantages:
1. the invention provides a high-efficiency telescopic CDN log processing system which can meet specific CDN log processing requirements without consuming excessive machine resources;
2. the processing modules of the invention realize multi-instantiation, are convenient to stretch and retract, have no single-point fault hidden trouble, and greatly improve the overall usability of the system;
3. according to the invention, zero activity is improved through the combination of a plurality of modules, so that iteration of a hot spot analysis function is more conveniently realized;
4. the invention improves the hotspot analysis efficiency by adopting the FSS stream data algorithm, and saves the calculation resources.
Drawings
FIG. 1 is a data flow diagram of an efficient and scalable CDN log processing system of the present invention.
FIG. 2 is a flow chart of the process of the log proxy service of the present invention.
FIG. 3 is a flow chart illustrating a method for efficient hotspot analysis according to the present invention.
Detailed Description
In particular to an efficient and scalable CDN log processing system, comprises the following steps of;
The edge node log reporting module is used for uploading the log file on the edge node to the log storage proxy service module through an HTTP request; the HTTP request comprises HTTP request body and HTTP request parameters
The log storage proxy service module acquires log meta information according to parameters in the HTTP request, splices a target path according to the log meta information, constructs an RPC request by using the target path and the HTTP request body, and sends the RPC request to the distributed storage module; the log meta information comprises the host IP of the edge node and the type of the log file;
the distributed storage module receives the RPC request sent by the log storage proxy service module and stores the log file to a target position;
The storage event monitoring module monitors that the distributed storage module stores a new log file, and the newly-stored log file path is contained in a newly-created log processing task and the log processing task is sent to the distributed message engine module;
The distributed message engine module receives log processing tasks;
The original log reading module is used for pulling the log processing task from the distributed message engine module, analyzing and executing the log processing task, reading the log files stored in the distributed storage module according to a flow, and sending the read logs to the log diversion delivery module in batches;
the log shunting delivery module calculates hash and takes a module by analyzing a domain name field or a domain name bucket field in the log to obtain an index number of a message theme, and the log shunting delivery module sends the log to the distributed message engine module to determine the corresponding message theme according to the calculated index number;
The log aggregation processing module pulls log information of the information subject, aggregates the log information according to domain name fields or domain name bucket fields, writes the log information into different log files according to standard formats respectively, and uploads the stored log files to the distributed storage module, wherein the stored log is called a standard log;
And the log hot spot analysis module scans the standard log stored in the distributed storage module and generates an analysis report.
The system comprises the following processing flows:
(1) And the edge node log reporting program runs on the CDN edge host, cuts off, saves, compresses and uploads the CDN access log to a log storage proxy service of the data center every 5 minutes.
(2) The log storage proxy service receives the HTTP request for uploading the log and transfers the received request body content to the distributed storage cluster.
(3) The storage event monitoring module monitors that a new log file is stored in the distributed storage cluster, and includes a newly-stored log file path in a newly-created log processing task, and sends the log processing task to the distributed message engine cluster
(4) The original log reading module is used as a consumption end of the distributed message engine and is pulled to a log processing task;
(5) The original log reading module analyzes and executes a log processing task, and reads the log files stored in the distributed storage cluster according to a stream;
(6) The original log reading module sends the read logs to the log diversion delivery module in batches;
(7) The log diversion delivery module calculates hash of the field value and takes the module by analyzing the domain name field or domain name bucket field in the log to obtain the index number of the message theme;
(8) The log diversion delivery module sends the log to the distributed message engine cluster, and the corresponding message theme is determined by the index number calculated in the step (7);
(9) The log aggregation processing module is used as a consumption end of the distributed message engine and pulls log messages of specific message subjects;
(10) The log aggregation processing module aggregates the logs according to the domain name fields or domain name bucket fields and stores the logs into different files respectively;
(11) The log aggregation processing module records the last writing time of each file, if the files have no new writing in more than 1 minute, the writing of the files is considered to be finished, and then the stored log files are uploaded to the distributed storage cluster, and the stored log at the moment is called a standard log;
and the log hot spot analysis module scans the standard logs stored in the distributed storage cluster every 24 hours and generates an analysis report.
In step (2), the log proxy service is a multi-instance scalable HTTP server program, and the step of transferring the log file includes the following steps:
2.1 Receiving an HTTP PUT request sent by a log reporting program;
2.2 Acquiring log metadata information from HTTP request parameters, wherein the log metadata information comprises a node IP to which a log belongs, log generation time and a log file format;
2.3 Reading the HTTP request body to the memory;
2.4 Requesting the distributed storage cluster, creating a target file to be written, and maintaining TCP connection;
2.5 Writing the data in the memory into a target file in the distributed storage cluster;
2.6 If the file writing is completed, closing the TCP connection, and entering step 2.7); if the file writing fails, closing the TCP connection, and entering step 2.8);
2.7 Reporting program response 200 status code to the log, indicating that log save was successful, and entering step 2.9).
2.8 Reporting the program response 500 status codes to the log, and returning the failure reasons;
2.9 Ending).
In step 2.4), the target file path to be written is spliced from the log metadata information in step 2).
A second object of the present invention is to provide an efficient method of hotspot analysis that can trade off tolerable accuracy loss for considerable hotspot analysis performance.
The technical scheme for realizing the second object of the invention is as follows: an efficient hotspot analysis method comprising the steps of:
(a) The hot spot analysis module selects a domain name catalog and reads standard log files one by one;
(b) The hotspot analysis module analyzes information such as request size, URL, request status code, userAgents and the like from the log;
(c) The hot spot analysis module uses a Filtered-SPACE SAVING (FSS) stream data algorithm to count the request times and the request sizes of different hot indexes;
(d) Counting the hot spot values of the previous K strips by using an FSS algorithm, wherein in order to balance accuracy loss and memory resource consumption, an analysis program applies for spaces larger than K nodes, and the values are 1.5K to 2K;
(e) And after all standard log files are read, the first K hot spot statistical results can be derived and formatted to be a hot spot analysis report.
As shown in fig. 1, the embodiment includes an edge node log reporting program, a log storage proxy service, a distributed storage cluster, a storage event monitoring module, a distributed message engine cluster, an original log reading module, a log distribution delivery module, a log aggregation processing module, a log hot spot analysis module, and a user.
The log storage proxy service is a high-performance HTTP gateway service; the distributed storage cluster is a commercial object storage service cluster or a large-scale HDFS cluster; the distributed message engine cluster typically employs a Kafka cluster.
The embodiment comprises the following steps:
(1) The edge node log reporting program collects CDN access logs every 5 minutes and uploads the CDN access logs to log storage proxy service of a data center;
(2) The log storage proxy service transfers the uploaded log file to a distributed storage cluster;
(3) The storage event monitoring module sends a new log processing task to the distributed message engine cluster;
(4) The original log reading module receives and executes log processing tasks, reads log files stored in the distributed storage cluster row by row, and sends the log files to the log diversion delivery module in batches;
(5) The log diversion delivery module calculates an index number of a message theme through a specific field in the log, and sends the log to the distributed message engine cluster;
(6) The log aggregation processing module pulls log information of a specific information subject, aggregates the logs according to domain name fields or domain name bucket fields, respectively stores the log information in different files, and finally uploads the log information to the distributed storage cluster;
(7) And the log hot spot analysis module scans the standard logs stored in the distributed storage cluster every 24 hours and generates an analysis report.
Wherein (1) the uploaded log is called an original log; (6) The uploaded logs are called standard logs, and only the standard logs can be provided for users to download; (7) Generating a share for each domain name of the user every 24 hours; can be viewed by a user.
As shown in fig. 2, in step (2), the log storage proxy service saves the uploaded log file to the distributed storage cluster, including the steps of:
(2.1) receiving an HTTP PUT request sent by a log reporting program;
(2.2) obtaining log metadata information from the HTTP request parameters;
(2.3) reading the HTTP request body to the memory;
(2.4) requesting the distributed storage cluster to create a target file to be written;
(2.5) writing the data in the memory into a target file in the distributed storage cluster;
(2.6) if the writing of the file is completed, reporting a program response 200 status code to the log; if the file writing fails, the program response 500 status codes are reported to the log.
As shown in fig. 3, the workflow of the log hotspot analysis module in step (7) includes the following steps:
(7.1) selecting a domain name directory for traversing;
(7.2) reading a standard log file row by row;
(7.3) resolving information such as request size, URL, request status code, userAgent and the like from the log;
And (7.4) using a Filtered-SPACE SAVING (FSS) stream data algorithm to firstly check whether the data is initialized, and if not, firstly applying 1.5K to 2K node spaces to meet the TopK statistical requirement.
(7.5) Inputting index data to be counted into an FSS algorithm, and counting the request times and the request sizes of different hot indexes;
(7.6) if the file is needed to be read, returning to the step (7.2);
and (7.7) after all the standard log files are read, leading out the front K hot spot statistical results, and formatting and leading out the hot spot statistical results as a hot spot analysis report.
As shown in Table 4, the purpose of cost reduction and efficiency improvement is achieved by adopting the efficient and scalable CDN log processing system scheme, and each module of the whole log processing link has the capability of flexibly scaling the number of instances according to the actual service scale.
TABLE 4 Table 4

Claims (7)

1. The efficient and scalable CDN log processing method is characterized by comprising the following steps of:
1) Acquiring a log file reported by an edge node, and storing the log file in a distributed storage module;
2) The storage event monitoring module monitors that a new log file is stored in the distributed storage module, the newly stored log file path is contained in a newly created log processing task, and the log processing task is sent to the distributed message engine module;
3) The distributed message engine module receives the log processing task, and caches the log processing task in a task processing queue to wait for pulling;
4) The original log reading module is used for pulling the log processing task from the distributed message engine module, analyzing and executing the log processing task, reading the log files stored in the distributed storage module according to a flow, and sending the read logs to the log diversion delivery module in batches;
5) The log diversion delivery module analyzes and calculates an index number of the message theme, the log diversion delivery module sends the log to the distributed message engine module, and the calculated index number determines the corresponding message theme;
6) The log aggregation processing module pulls the log information of the information subject, analyzes, extracts and aggregates the log information, writes the log information into different log files respectively, stores the log information, uploads the stored log files to the distributed storage module, and the log in the stored log files is called a standard log and is finally provided for clients;
7) The log hot spot analysis module scans the standard log stored in the distributed storage module and generates an analysis report;
scanning and generating an analysis report, which specifically comprises the following steps:
7.1 The log hot spot analysis module selects a domain name directory and reads the standard log files one by one;
7.2 The log hot spot analysis module analyzes each index from the log, uses a Filtered-SPACE SAVING stream data algorithm to count the request times and the request sizes of different hot indexes, reads all standard log files, and can derive the previous K hot spot statistical results, and formats and derives the hot spot statistical results as a hot spot analysis report;
And creating a TopK operator for each index by using a Filtered-SPACE SAVING stream data algorithm, and calculating a 64-bit hash abstract of the corresponding index value analyzed in each log according to the SipHash algorithm to serve as a storage key.
2. The efficient and scalable CDN log processing method of claim 1, wherein in step 1), obtaining a log file reported by an edge node specifically includes:
1.1 The edge node log reporting module uploads a log file on the edge node to the log storage proxy service module through an HTTP request;
1.2 The log storage proxy service module obtains log meta information according to parameters in the HTTP request, splices a target path according to the log meta information, constructs an RPC request by using the target path and the HTTP request body, and sends the RPC request to the distributed storage module.
3. The efficient scalable CDN log processing method of claim 2, wherein in step 1.1), the HTTP request includes an HTTP request body and HTTP request parameters.
4. The efficient scalable CDN log processing method of claim 2, wherein in step 1.2), the log meta information includes a host IP of the edge node and a type of the log file.
5. The efficient scalable CDN log processing method of claim 1, wherein in step 5), parsing the index number of the computed message topic comprises:
And calculating hash of the field value by analyzing the domain name field or domain name bucket field in the log, and taking the modulus to obtain the index number of the message theme.
6. The efficient scalable CDN log processing method of claim 1, wherein in step 6), log messages are parsed, extracted, aggregated, and then written into different log files and saved respectively, specifically including:
Analyzing the log information according to the format, extracting a domain name field and a domain name bucket field, aggregating the log information according to the configured aggregation rule and the domain name field or the domain name bucket field, and respectively writing the log information into different log files according to the standard format and storing the log information.
7. A system for implementing the efficient scalable CDN log processing method of any one of claims 1 to 6, which is characterized by comprising an edge node log reporting program, a log storage proxy service, a distributed storage cluster, a storage event monitoring module, a distributed message engine cluster, an original log reading module, a log splitting delivery module, a log aggregation processing module, and a log hotspot analysis module.
CN202211363596.2A 2022-11-02 2022-11-02 Efficient and scalable CDN log processing method and system Active CN115801562B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202211363596.2A CN115801562B (en) 2022-11-02 2022-11-02 Efficient and scalable CDN log processing method and system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202211363596.2A CN115801562B (en) 2022-11-02 2022-11-02 Efficient and scalable CDN log processing method and system

Publications (2)

Publication Number Publication Date
CN115801562A CN115801562A (en) 2023-03-14
CN115801562B true CN115801562B (en) 2024-08-16

Family

ID=85435044

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202211363596.2A Active CN115801562B (en) 2022-11-02 2022-11-02 Efficient and scalable CDN log processing method and system

Country Status (1)

Country Link
CN (1) CN115801562B (en)

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241078A (en) * 2020-01-07 2020-06-05 网易(杭州)网络有限公司 Data analysis system, data analysis method and device
CN114221988A (en) * 2021-11-03 2022-03-22 新浪网技术(中国)有限公司 Content distribution network hotspot analysis method and system

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9729671B2 (en) * 2014-10-05 2017-08-08 YScope Inc. Systems and processes for computer log analysis
US11665047B2 (en) * 2020-11-18 2023-05-30 Vmware, Inc. Efficient event-type-based log/event-message processing in a distributed log-analytics system

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111241078A (en) * 2020-01-07 2020-06-05 网易(杭州)网络有限公司 Data analysis system, data analysis method and device
CN114221988A (en) * 2021-11-03 2022-03-22 新浪网技术(中国)有限公司 Content distribution network hotspot analysis method and system

Also Published As

Publication number Publication date
CN115801562A (en) 2023-03-14

Similar Documents

Publication Publication Date Title
US20230004434A1 (en) Automated reconfiguration of real time data stream processing
CN105824744B (en) A kind of real-time logs capturing analysis method based on B2B platform
Lee et al. Toward scalable internet traffic measurement and analysis with hadoop
US20160350385A1 (en) System and method for transparent context aware filtering of data requests
US9130971B2 (en) Site-based search affinity
US9697316B1 (en) System and method for efficient data aggregation with sparse exponential histogram
WO2017198227A1 (en) Interactive internet protocol television system and real-time acquisition method for user data
CN109710731A (en) A kind of multidirectional processing system of data flow based on Flink
US8179799B2 (en) Method for partitioning network flows based on their time information
CN113010565B (en) Server real-time data processing method and system based on server cluster
CN113312376B (en) Method and terminal for real-time processing and analysis of Nginx logs
US11188443B2 (en) Method, apparatus and system for processing log data
CN113360554A (en) Method and equipment for extracting, converting and loading ETL (extract transform load) data
CN110609782B (en) Micro-service optimization system and method based on big data
CN114090529A (en) Log management method, device, system and storage medium
WO2022156542A1 (en) Data access method and system, and storage medium
CN114971714A (en) Accurate customer operation method based on big data label and computer equipment
CN114390033A (en) Loop state patrol instrument acquisition system and method based on extensible communication protocol
CN115801562B (en) Efficient and scalable CDN log processing method and system
CN110309206B (en) Order information acquisition method and system
CN107480189A (en) A kind of various dimensions real-time analyzer and method
CN108430067A (en) A kind of Internet service mass analysis method and system based on XDR
CN108614820A (en) The method and apparatus for realizing the parsing of streaming source data
CN108959041B (en) Method for transmitting information, server and computer readable storage medium
CN115391429A (en) Time sequence data processing method and device based on big data cloud computing

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant