CN114697070B

CN114697070B - Method and system for dynamically compressing and storing HTTP protocol traffic

Info

Publication number: CN114697070B
Application number: CN202111665961.0A
Authority: CN
Inventors: 章明珠; 钟立; 钟志成
Original assignee: Chengdu Siwei Century Technology Co ltd
Current assignee: Chengdu Siwei Century Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2024-04-02
Anticipated expiration: 2041-12-31
Also published as: CN114697070A

Abstract

The invention discloses a method and a system for dynamically compressing and storing HTTP traffic, wherein the method comprises the following steps: reading log data of an HTTP protocol; preprocessing log data; matching the log data with service fingerprint features in a fingerprint feature library, and performing fingerprint matching on the log data to obtain service fingerprints corresponding to the log data; when the fingerprint matching is successful, matching the log data with rules in a rule model library, and if the rule matching is unsuccessful, compressing the log data by adopting a static compression method; if the rule matching is successful, respectively compressing by adopting built-in compression methods, obtaining the corresponding score and weight of each compression method according to the compression execution time and compression ratio, obtaining the comprehensive score of each compression method through weighted summation, and compressing by adopting the compression method with the highest score. The invention preferentially selects the compression method according to the dynamic content of the service corresponding to the dynamic data, thereby further improving the compression performance and saving the storage space.

Description

Method and system for dynamically compressing and storing HTTP protocol traffic

Technical Field

The invention relates to the technical field of WEB service safety, in particular to a method and a system for dynamically compressing and storing HTTP traffic.

Background

In the process of analyzing HTTP protocol flow, if uplink and downlink content data are required to be stored, a static compression method is basically adopted, and all data contents are directly compressed, so that the storage space is reduced. The static compression format is usually ZIP compression format, RAR compression format, etc., and the corresponding compression method may be DEFLATE method, LZ77 method, HUFFMAN encoding method, RLE method, etc. The flow of the current static compression method is shown in fig. 1, and it can be seen from the figure that the static compression processing method directly preprocesses the request data, directly extracts the data such as request content and response content after preprocessing, and compresses and stores the extracted data by adopting a specific compression format and a specific compression method so as to save storage space. Unstructured data (e.g., request header, response body, etc.) corresponding to the HTTP protocol traffic log is compressed and stored. A large amount of unstructured data content, such as frame data, code data, annotation data, fixed labels and the like which are possibly irrelevant to the display function, can be widely stored in the WEB service borne by the HTTP protocol, and the size and the data format of the unstructured data can directly influence the compression result.

Disclosure of Invention

The invention aims to provide a method and a system for dynamically compressing and storing HTTP protocol traffic, which analyze and process characteristic parameters of WEB business, uniformly extract static data content and only compress and store the dynamic data content.

The invention adopts the following technical scheme:

the invention provides a method for dynamically compressing and storing HTTP protocol traffic, which comprises the following steps:

reading log data of an HTTP protocol;

preprocessing log data;

matching the log data with service fingerprint features in a fingerprint feature library, and performing fingerprint matching on the log data to obtain service fingerprints corresponding to the log data;

when the fingerprint matching is successful, matching the log data with rules in a rule model library, and if the rule matching is unsuccessful, compressing the log data by adopting a static compression method; if the rule matching is successful, judging whether the service fingerprint matched with the log data has a corresponding compression rule, and if so, adopting the compression rule to compress; otherwise, all the built-in compression methods are adopted to compress respectively, the corresponding score and weight of each compression method are obtained according to the compression execution time and the compression ratio, the comprehensive score of each compression method is obtained through weighted summation, and the compression method with the highest score is adopted to compress.

In some embodiments, before reading the log data of the HTTP protocol, the method further includes:

collecting data flow of a network by using a DPI technology, outputting the data flow to a storage medium, reading and analyzing the data flow in a polling mode, judging whether the data flow is log data of an HTTP protocol, and if so, reading; otherwise, the data traffic is discarded.

In some embodiments, preprocessing includes information completion and data formatting, and key fields of the completion information include request paths, service systems, request header information, request body information, and response body information.

In some embodiments, when fingerprint matching is unsuccessful, current log data is collected and fingerprint feature learned.

In some embodiments, the compression method with the highest score is taken as the compression rule corresponding to the service fingerprint of the current log data while the compression is performed by the compression method with the highest score.

The invention provides a system for dynamically compressing and storing HTTP protocol traffic, which comprises:

the reading module is used for reading the log data of the HTTP protocol;

the preprocessing module is used for preprocessing the log data;

the service fingerprint matching module is used for matching the log data with service fingerprint characteristics in the fingerprint characteristic library, and fingerprint matching is carried out on the log data to obtain service fingerprints corresponding to the log data;

the compression module is used for matching the log data with rules in the rule model library when the fingerprint matching is successful, and adopting a static compression method to compress the log data if the rule matching is unsuccessful; if the rule matching is successful, judging whether the service fingerprint matched with the log data has a corresponding compression rule, and if so, adopting the compression rule to compress; otherwise, all the built-in compression methods are adopted to compress respectively, the corresponding score and weight of each compression method are obtained according to the compression execution time and the compression ratio, the comprehensive score of each compression method is obtained through weighted summation, and the compression method with the highest score is adopted to compress.

The invention can automatically classify the WEB business of the HTTP protocol and automatically learn the WEB business fingerprint corresponding to each business. And collecting sample information of the WEB service fingerprints according to the characteristic parameters of the WEB service fingerprints. And when the sample number reaches the preset number, carrying out algorithm matching on sample contents of the WEB service fingerprints. And forming a rule model according to the matching result and combining the extracted dynamic data content, and storing the rule model in a model library.

Compared with the prior art, the invention has the following characteristics and beneficial effects:

1. in the prior art, the full data is often compressed and stored, while the invention extracts the static data and the dynamic data, only the effective dynamic data is reserved for compression and storage, and the accurate compression can be realized.

2. The invention carries out secondary analysis on the extracted dynamic data, and preferentially selects a more proper compression method according to the dynamic content of the corresponding service of the dynamic data, thereby further improving the compression performance and saving the storage space.

Drawings

FIG. 1 is a flow chart of a current static compression method;

FIG. 2 is a flow chart of the method of the present invention;

FIG. 3 is a schematic view of the number of layers and depth;

FIG. 4 is a schematic diagram of a matrix calculation process;

Detailed Description

The following detailed description of the embodiments of the invention refers to the accompanying drawings. It will be apparent that the detailed description is merely a partial, but not all, example of the invention. All other embodiments, which can be made by those skilled in the art without the inventive effort, are intended to be within the scope of the present invention based on the described embodiments.

The implementation of the invention needs to use the following prior art means:

(1) DPI technology (Deep Packet Inspection, deep database packet inspection technology) that parses data traffic to obtain specific field data;

(2) The bypass mirror image is used for acquiring bypass collected data flow through the mirror image;

(3) Bypass beam splitting refers to acquiring bypass collected data traffic by using a beam splitter.

The invention is embedded into a core network link in a bypass mode, and the acquisition, the recombination and the restoration of HTTP data flow in the network are realized by using DPI technology (deep database packet inspection technology); and network traffic is replicated for analysis by bypass beam splitting, bypass mirroring, or the like. Currently, the method is applicable to the service system environment based on HTTP interaction. The bypass acquisition is the basis, and the original flow requested by the user for accessing the target system is acquired through the bypass acquisition, so that the online service is not negatively affected. The core module automatically learns the collected original data in combination with preset feature dimensions, and establishes an original service fingerprint feature model corresponding to each service module according to different parameters of different access source objects and the detailed condition of access service response. And then collecting samples of all the service fingerprints, and when the samples are collected to a preset number, the core module performs text algorithm matching on the grouped feature model samples, extracts static contents in the feature model samples and generates a frame extraction rule model.

Referring to fig. 2, a flow chart of the method of the present invention is shown, and the implementation of the present invention will be described in detail with reference to fig. 2.

1. Fingerprint feature learning

The invention uses "service fingerprint" to interpret and determine the uniqueness of a single operational service, for example: the system logs in the fingerprint, adds users' fingerprint, deletes the business fingerprint such as the role fingerprint newly, every business fingerprint has corresponding 32 bit HASH unique mark as the fingerprint number. Only when the initiated operation requests can correspond to the unique fingerprints to which they belong, the same fingerprint requests can be classified according to the fingerprint numbers, and the calculation method of the response is combined for each group of fingerprint classifications to obtain the final desired calculation result.

The target system refers to a target system found in the traffic, the traffic is split based on domain names, and different domain names belong to different target systems. For example: traffic with a domain name www.baidu.com belongs to one target system and traffic with a domain name www.sina.com.cn belongs to another target system. The target systems are independent and mutually noninterfere, and different target systems can have the same characteristic fingerprints, but the same characteristic fingerprints are not allowed to appear in the same target system. For example: www.baidu.com target systems can only have one "login fingerprint" and www.sina.com.cn target systems can also have "login fingerprints".

(1) Characteristic parameter

After the flow is collected and restored, the data is automatically extracted according to a pre-configured strategy rule, the characteristic parameter fields are automatically complemented after the extraction, and the complemented data is used as basic analysis data to be stored in a cache database. The types of characteristic parameters built in at present are totally divided into 6 types: domain name, destination port, link address, request type, request parameter name, response type, and then analyzing and processing according to these types of characteristic parameters. Specific characteristic parameters and examples are shown in table 1, and the characteristic parameters are static characteristics.

Table 1 characteristic parameters and examples

Sequence number	Field name	Parameter name	Parameter examples
				1	domain	Domain name	search.sina.com.cn
2	destport	Target port	80、8080
				3	url	Linking addresses	/
4	requesttype	Request type	get、post
				5	requestparams	Request parameter name	c＝news&q＝test、c＝img&q＝test
6	responsetype	Response type	http、js、css

And carrying out learning analysis according to the content of the characteristic parameter field, and determining the uniqueness of the service fingerprint according to a final analysis result.

(2) Dynamic learning

According to the construction standard difference of different target service systems, combining the accumulated practical experience in the WEB service safety field, finding: in some systems, there may be random content in the link address feature dimension field. The invention solves the problem by utilizing statistical analysis and packet deduplication so as to more accurately determine the uniqueness of the service.

In some specific service systems, dynamic content such as dynamic identity marks, TOKEN TOKENs, etc. are stored in the link address, and the corresponding marks are randomly allocated after each re-login, so that the link address of each user may be randomly changed. The following are examples of link addresses within a system:

address 1: CD5141ae53cd45e897482bda c.e. 92/Home/Index

Address 2: 31 dc8f4b74b8880729c59b772945e/Home/Index

Address 3: per 8394aca18fe04a9192df164fd95da4f3/Home/Index

As in the address example above, a target system stores a SESSION flag in a link address, and for similar problems that may exist with dynamic link addresses, the present invention introduces concepts such as "depth", "layer number", "grouping", etc. to comprehensively analyze the link address to determine the dynamic content existing therein for processing. Splitting and layering URL paths of a single access log through path separation symbols "/", wherein the total layer number, namely depth, depth and layer number, are shown in the figure 3, the address shown in the figure 3 is layered and then is 3 layers, and the depth is the same as the value 3. The grouping means respectively calculates according to each parameter dimension of the single access log and obtains a calculation result, and the calculation results of different access logs are put into the same group when the calculation results of different access logs are consistent.

In the dynamic learning of the invention, grouping training is carried out according to the static characteristic combination of the target access log, and the adopted static characteristic combination is a domain name, a target port, a request type, a request parameter name and a response type.

The specific steps of dynamic learning are as follows:

step 1: when the number of the buffer access logs of the training channel reaches a preset threshold (100000 by default), the temporary preparation work before the training is formally entered, and all link addresses of the current batch are subjected to layering treatment. The so-called hierarchy is a hierarchy of linked addresses by a path separation symbol "/", the hierarchy being schematically seen in fig. 3. The result of the layering of address 1 in the above example is [ "cd5141ae53cd45e897482bda c.e 92", "Home", "Index" ].

Step 2: after layering, the training engine performs digital processing on all single-layer string objects of the current layer, namely performs digital hash conversion on the same comparison string, wherein the range of the digital hash value is 32-bit integer number (the value range is-2147483648 ～ 2147483647). In theory, the hash value has a very small probability of collision, and if the total training set is within 32-bit integer, the probability of collision is negligible. The digital hash converted array objects H= [ H1, H2, H3, … …, hn ] and n is the total number of current training channel logs;

step 3: and carrying out grouping statistics on the current array object H, and when the number of the grouped elements is less than 10% of the total number of H (10% by default and can be manually adjusted), considering that the current layer is converged, and judging that the current layer is a static layer, and otherwise, judging that the current layer is a dynamic layer. The static layer is represented by original characters, and the dynamic layer is represented by wildcards. According to an example, when enough access records are collected, the final learning result may be: /Home/Index.

The grouping statistical method in the step comprises the following steps: the statistics groups are associated with the number of layers, with the same layers of different logs being grouped together. The 1 layer is a group, the 2 layers are a group, the 3 layers are a group and … …. And each group performs independent grouping statistical calculation to obtain a calculation result of each statistical group, namely a value result of the current layer. And after all the statistical groups (all layers) are calculated, rearranging the results of all the statistical groups to obtain the learning result of the current group.

Step 4: and outputting a learning result, and providing a manual auditing entrance for auditing and confirming. And automatically entering the dynamic link address with the verification into the fingerprint feature library, automatically clearing all log records of the current cache channel if the verification is neglected, and re-collecting the cache log.

Step 5: and (5) repeatedly entering the step (1) to perform dynamic training of other channels.

The dynamic learning result obtained after the dynamic learning is completed is stored in the fingerprint feature library, and when any service (login service, some recharging service, some inquiring service, some transacting service and the like) of the target system is learned, the dynamic learning result is automatically added into the fingerprint feature library, and the number of data in the fingerprint feature library is increased along with the increase of the learning time.

(3) Service standardization

In order to better manage the service, the invention performs standardization processing on all the determined service fingerprints. After all the characteristic parameters are determined, unique value operation is carried out on the parameter combination, and the input parameter sequence of the HASH value algorithm is domain, destport, url, requesttype, requestparams, responsetype. Aiming at the characteristic that the HTML is not particularly sensitive to case, the parameter values corresponding to the unified parameter of the calculation mode of the service standardization algorithm are converted into lower case character processing, so that the final service fingerprint is more converged. The unique calculation formula for determining the service fingerprint is as follows:

md5(domain、destport、url、requesttype、requestparams、responsetype) (1)

2. automatic sample collection

If the characteristic parameters of the current HTTP access log are completely matched with the characteristic parameters of the service fingerprint, automatically extracting unstructured content data from the access data of the log, and storing the unstructured content data to a designated position according to the original format. In a specific time, when the access source arrives at the same object, the filtering is performed automatically, and only 1-2 samples are collected in a single access source in a fixed time so as to ensure the diversity of sample contents. The sample automatic collection module realizes the automatic management of all business fingerprint samples, and comprises the following steps: automated matching, automated collection and storage, automated sample filtration, automated sample expiration, automated sample update, and the like.

In the invention, fingerprint feature stock stores service fingerprints corresponding to specific services in a target system, and the specific services in the target system are in one-to-one correspondence with the service fingerprints. The sample characteristic stock stores specific access log records corresponding to fingerprint characteristics, the fingerprint characteristics and the specific access log records are in a corresponding relation of 1-to-many, and one fingerprint characteristic corresponds to a plurality of specific access log records. For example, "login fingerprint" corresponds to "Zhang three login log", "Liu four login log", "Wang five login log", "Zhang three login log", "Liu four login log", "Wang five login log" in the sample feature library, and the like as sample feature data.

3. Rule model learning

In the traffic transmitted by HTTP protocol, the data content often includes excessive interference content such as frames, and according to the difference of modes such as different platform architectures and system construction, the static content data may occupy 1 time or even several times of the actual dynamic data. If operations such as secondary analysis and data content storage are to be performed on all access log content data, the storage pressure of the server is increased, and the accuracy of the secondary analysis is reduced. After the static data is extracted through the rule model, the content of the stored data is greatly reduced, and the storage pressure of the server is reduced. The interference content such as the frame word strings and the fixed codes is eliminated, and the accuracy of secondary analysis is improved.

The frame model mainly adopts a tolerance difference method to carry out correlation calculation analysis on all sample contents of the service fingerprint, carries out progressive scanning on the sample contents according to the condition of each group of fingerprint samples, and analyzes the existing static content data by mutual matching of the sample contents.

The specific steps are as follows:

step 1: and acquiring all the service fingerprint sets S meeting the conditions, wherein the meeting conditions comprise that the number of samples meets the standard, the frame model is never executed or not executed correctly, and the model does not meet the effective time range. Initializing a service fingerprint set F= [ F1, F2, F3, … …, fm ], and obtaining the number of fingerprints of the learning model as m.

Step 2: initializing a training set, extracting single fingerprints F from a service fingerprint set F one by one, dividing F into preset training sets, extracting a sample set S of the fingerprints F from sample data by each group of training sets (the training set is 10 by default), wherein S= [ S1, S2, S3, … …, sp ], and the number of samples is p.

Step 3: the method comprises the steps of calculating a current training set by adopting a patience difference method, firstly constructing a matrix A, scanning the content among samples row by row and column by column, calculating the characteristic information quantity among each row, namely the longest public subsequence, and comparing the similarity degree of the content of each sample corresponding to the same fingerprint by utilizing the longest public subsequence. The longest common subsequence LCS (Longest Common Subsequence), defined as: for sequence S, if two or more subsequences of the known sequence, respectively, are the longest of all sequences that meet this condition, then S is referred to as the longest common subsequence of the known sequence.

Fig. 4 shows a matrix a, and black boxes indicate hit feature data.

Step 4: when the difference data obtained between the samples exceeds the allowable range (the error amount defaults to 0.98), all the samples are cleared, and the step 2 is repeated. If within the allowed difference data step 5 is entered.

Data: if the data of a certain row is completely consistent with the data of the comparison row, the difference value of the row is considered to be 0, and when the data of the certain row is not completely equal, the number of the characters of the difference is reversely deduced according to the calculated characteristic information quantity LCS, and the calculation mode is as follows: difference value= (number of characters added+number of characters deleted)/total number of characters. When the number of characters of the difference exceeds 98% (configurable, default 0.98), it is considered to be variable, and vice versa.

Step 5: and combining the data among the samples, and calculating and obtaining a final rule model, wherein the data service rule model is = { service fingerprint number, an invariable row set, a variable row and invariable column set, and the rule model validity period (defaults to 30 days) }.

In the invention, a single service fingerprint is formed by combining a text content, wherein the text content comprises N lines of data, and each line corresponds to M columns respectively. The invariable row is a invariable row number for recording a certain business fingerprint log, and is in one-to-many relation with the fingerprint, and all columns in the invariable row number are invariable. The variable row and invariable column records a certain business fingerprint variable row number and an invariable column number in the row, and the row are in one-to-many relation.

4. Differentiated compressed storage

For the fingerprint which can be matched with the service and has a rule model, the corresponding dynamic data content is extracted, otherwise, all the data content is extracted. And then comprehensively scoring according to the performances and compression ratios of different compression algorithms (7 lossless compression algorithms 1-RLE, 2-LZ77, 3-LZF, 4-FLATE, 5-bzip2, 6-LZMA and 7-LZO are built in), and finally obtaining 7 differential compression scores. And according to the value of the score, each service fingerprint preferentially selects a compression algorithm with the maximum score. The differential compression scores currently employed correspond to the various indicators table 2.

Table 2 list of performance metrics and compression ratio metrics

The execution performance index and the compression ratio index are uniformly divided into 9 layers, the parameters such as the score, the weight and the like of each layer can be dynamically adjusted according to actual conditions, and default values are automatically adopted under the condition of no adjustment.

Each index dimension needs to be combined with the deviation degree of the value of the index dimension and the value of the index interval and the weight of the current interval to carry out comprehensive scoring calculation (weighted average), and the value range of the single index dimension is [0,1]. The maximum score is 1 of the total index dimension number (e.g., the total current total 2 feature dimensions, differential compression total score is 2*1). The default validity period of the appointed compression algorithm obtained through maximum score calculation is 30 days, after the expiration period is exceeded, the differentiated compression module can automatically update the compression algorithm at regular time, so that the historical result can be adjusted under the condition that service environment data are changed, and the compression algorithm which accords with the data characteristics is selected again.

The following specific steps of the present invention for dynamic compression storage of HTTP protocol traffic will be provided:

(1) The DPI technology is utilized to collect the data traffic of the network and output the data traffic to the storage medium.

(2) Reading and analyzing the data flow in a polling mode, judging whether the data flow is log data of the HTTP protocol, and if so, entering the step (3); otherwise, the data traffic is discarded.

(3) Preprocessing the HTTP log data, including information completion, data formatting and the like. The key fields of the complement include: the method comprises the steps of (1) separating a plurality of fields according to a '\t' character, and then entering a step (4). Preprocessing of the HTTP protocol log data is consistent with the existing static compression method, and will not be described here again. The key fields corresponding to the service system comprise domain names, destination IP and destination ports.

(4) Fingerprint matching is carried out on the preprocessed HTTP protocol log data by adopting fingerprint features in the fingerprint feature library, if the preprocessed HTTP protocol log data are not matched, the preprocessed HTTP protocol log data are collected, fingerprint feature learning is carried out on the preprocessed HTTP protocol log data, and a learning result is output to the fingerprint feature library; if the matching is successful, the method proceeds to step (5) and step (7).

(5) And collecting access logs hitting the service fingerprint, filtering the homologous logs according to preset logic, and storing the homology logs in a sample feature library.

(6) The rule model learning module collects the synchronous sample characteristic data, performs model rule learning, and stores the learning result in the rule model library.

(7) And (3) the rule model library receives HTTP protocol log data, matches all rules according to the characteristic information, and enters the step (8) if the rule is hit, and the missed rule compresses and stores the log content by adopting a static compression method.

(8) Receiving formatted log data, judging whether the current fingerprint has a compression rule or not, and if so, entering a step (9); otherwise, all built-in compression methods are tried, compression scores of the compression methods are calculated, an algorithm with higher score is preferentially selected and recorded in a compression rule base, and then the step (9) is carried out.

The compression fraction is obtained by the following steps:

the execution time and compression ratio of each compression method for compression are obtained respectively, the corresponding score and weight of the execution time and compression ratio are obtained from table 2, and the compression score of each compression method is obtained through weighted summation.

(9) And adopting a specified compression algorithm in the current fingerprint rule base to compress and store the current log content.

The invention focuses on the data and focuses on the business itself, performs cluster learning by utilizing mathematical analysis methods such as self-learning, sampling, statistics and the like and combining related characteristic parameters, realizes the determination and interpretation of single business and automatically constructs a corresponding dynamic data rule model, which is a key technology of the patent and is protected.

For different data types, according to different differentiation comprehensive score values, a proper algorithm mode is selected automatically and preferentially, and the automatic selection result is optimized and updated regularly, so that the technical core is differentiated compression, comprehensive scoring and automatic updating, and the protection is realized.

It will be appreciated by those skilled in the art that all or part of the steps in implementing the method of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program when executed includes the steps of: (steps of the method), the storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.

The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims

1. A method for dynamically compressing and storing HTTP protocol traffic, comprising:

reading log data of an HTTP protocol;

preprocessing log data;

2. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:

before reading the log data of the HTTP protocol, the method further includes:

3. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:

the preprocessing comprises information completion and data formatting, and key fields of the completion information comprise a request path, a service system, request header information, request body information and response body information.

4. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:

when the fingerprint matching is unsuccessful, current log data are collected and fingerprint feature learning is carried out on the current log data.

5. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:

and taking the compression method with the highest score for compression, and taking the compression method with the highest score as a compression rule corresponding to the service fingerprint of the current log data.

6. A system for dynamic compression storage of HTTP protocol traffic, comprising:

the reading module is used for reading the log data of the HTTP protocol;

the preprocessing module is used for preprocessing the log data;