CN114697070B - Method and system for dynamically compressing and storing HTTP protocol traffic - Google Patents

Method and system for dynamically compressing and storing HTTP protocol traffic Download PDF

Info

Publication number
CN114697070B
CN114697070B CN202111665961.0A CN202111665961A CN114697070B CN 114697070 B CN114697070 B CN 114697070B CN 202111665961 A CN202111665961 A CN 202111665961A CN 114697070 B CN114697070 B CN 114697070B
Authority
CN
China
Prior art keywords
compression
log data
fingerprint
matching
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111665961.0A
Other languages
Chinese (zh)
Other versions
CN114697070A (en
Inventor
章明珠
钟立
钟志成
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Chengdu Siwei Century Technology Co ltd
Original Assignee
Chengdu Siwei Century Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Chengdu Siwei Century Technology Co ltd filed Critical Chengdu Siwei Century Technology Co ltd
Priority to CN202111665961.0A priority Critical patent/CN114697070B/en
Publication of CN114697070A publication Critical patent/CN114697070A/en
Application granted granted Critical
Publication of CN114697070B publication Critical patent/CN114697070B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/069Management of faults, events, alarms or notifications using logs of notifications; Post-processing of notifications
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0807Network architectures or network communication protocols for network security for authentication of entities using tickets, e.g. Kerberos
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/08Network architectures or network communication protocols for network security for authentication of entities
    • H04L63/0861Network architectures or network communication protocols for network security for authentication of entities using biometrical features, e.g. fingerprint, retina-scan
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/04Protocols for data compression, e.g. ROHC

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Computer Hardware Design (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • General Health & Medical Sciences (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a method and a system for dynamically compressing and storing HTTP traffic, wherein the method comprises the following steps: reading log data of an HTTP protocol; preprocessing log data; matching the log data with service fingerprint features in a fingerprint feature library, and performing fingerprint matching on the log data to obtain service fingerprints corresponding to the log data; when the fingerprint matching is successful, matching the log data with rules in a rule model library, and if the rule matching is unsuccessful, compressing the log data by adopting a static compression method; if the rule matching is successful, respectively compressing by adopting built-in compression methods, obtaining the corresponding score and weight of each compression method according to the compression execution time and compression ratio, obtaining the comprehensive score of each compression method through weighted summation, and compressing by adopting the compression method with the highest score. The invention preferentially selects the compression method according to the dynamic content of the service corresponding to the dynamic data, thereby further improving the compression performance and saving the storage space.

Description

Method and system for dynamically compressing and storing HTTP protocol traffic
Technical Field
The invention relates to the technical field of WEB service safety, in particular to a method and a system for dynamically compressing and storing HTTP traffic.
Background
In the process of analyzing HTTP protocol flow, if uplink and downlink content data are required to be stored, a static compression method is basically adopted, and all data contents are directly compressed, so that the storage space is reduced. The static compression format is usually ZIP compression format, RAR compression format, etc., and the corresponding compression method may be DEFLATE method, LZ77 method, HUFFMAN encoding method, RLE method, etc. The flow of the current static compression method is shown in fig. 1, and it can be seen from the figure that the static compression processing method directly preprocesses the request data, directly extracts the data such as request content and response content after preprocessing, and compresses and stores the extracted data by adopting a specific compression format and a specific compression method so as to save storage space. Unstructured data (e.g., request header, response body, etc.) corresponding to the HTTP protocol traffic log is compressed and stored. A large amount of unstructured data content, such as frame data, code data, annotation data, fixed labels and the like which are possibly irrelevant to the display function, can be widely stored in the WEB service borne by the HTTP protocol, and the size and the data format of the unstructured data can directly influence the compression result.
Disclosure of Invention
The invention aims to provide a method and a system for dynamically compressing and storing HTTP protocol traffic, which analyze and process characteristic parameters of WEB business, uniformly extract static data content and only compress and store the dynamic data content.
The invention adopts the following technical scheme:
the invention provides a method for dynamically compressing and storing HTTP protocol traffic, which comprises the following steps:
reading log data of an HTTP protocol;
preprocessing log data;
matching the log data with service fingerprint features in a fingerprint feature library, and performing fingerprint matching on the log data to obtain service fingerprints corresponding to the log data;
when the fingerprint matching is successful, matching the log data with rules in a rule model library, and if the rule matching is unsuccessful, compressing the log data by adopting a static compression method; if the rule matching is successful, judging whether the service fingerprint matched with the log data has a corresponding compression rule, and if so, adopting the compression rule to compress; otherwise, all the built-in compression methods are adopted to compress respectively, the corresponding score and weight of each compression method are obtained according to the compression execution time and the compression ratio, the comprehensive score of each compression method is obtained through weighted summation, and the compression method with the highest score is adopted to compress.
In some embodiments, before reading the log data of the HTTP protocol, the method further includes:
collecting data flow of a network by using a DPI technology, outputting the data flow to a storage medium, reading and analyzing the data flow in a polling mode, judging whether the data flow is log data of an HTTP protocol, and if so, reading; otherwise, the data traffic is discarded.
In some embodiments, preprocessing includes information completion and data formatting, and key fields of the completion information include request paths, service systems, request header information, request body information, and response body information.
In some embodiments, when fingerprint matching is unsuccessful, current log data is collected and fingerprint feature learned.
In some embodiments, the compression method with the highest score is taken as the compression rule corresponding to the service fingerprint of the current log data while the compression is performed by the compression method with the highest score.
The invention provides a system for dynamically compressing and storing HTTP protocol traffic, which comprises:
the reading module is used for reading the log data of the HTTP protocol;
the preprocessing module is used for preprocessing the log data;
the service fingerprint matching module is used for matching the log data with service fingerprint characteristics in the fingerprint characteristic library, and fingerprint matching is carried out on the log data to obtain service fingerprints corresponding to the log data;
the compression module is used for matching the log data with rules in the rule model library when the fingerprint matching is successful, and adopting a static compression method to compress the log data if the rule matching is unsuccessful; if the rule matching is successful, judging whether the service fingerprint matched with the log data has a corresponding compression rule, and if so, adopting the compression rule to compress; otherwise, all the built-in compression methods are adopted to compress respectively, the corresponding score and weight of each compression method are obtained according to the compression execution time and the compression ratio, the comprehensive score of each compression method is obtained through weighted summation, and the compression method with the highest score is adopted to compress.
The invention can automatically classify the WEB business of the HTTP protocol and automatically learn the WEB business fingerprint corresponding to each business. And collecting sample information of the WEB service fingerprints according to the characteristic parameters of the WEB service fingerprints. And when the sample number reaches the preset number, carrying out algorithm matching on sample contents of the WEB service fingerprints. And forming a rule model according to the matching result and combining the extracted dynamic data content, and storing the rule model in a model library.
Compared with the prior art, the invention has the following characteristics and beneficial effects:
1. in the prior art, the full data is often compressed and stored, while the invention extracts the static data and the dynamic data, only the effective dynamic data is reserved for compression and storage, and the accurate compression can be realized.
2. The invention carries out secondary analysis on the extracted dynamic data, and preferentially selects a more proper compression method according to the dynamic content of the corresponding service of the dynamic data, thereby further improving the compression performance and saving the storage space.
Drawings
FIG. 1 is a flow chart of a current static compression method;
FIG. 2 is a flow chart of the method of the present invention;
FIG. 3 is a schematic view of the number of layers and depth;
FIG. 4 is a schematic diagram of a matrix calculation process;
Detailed Description
The following detailed description of the embodiments of the invention refers to the accompanying drawings. It will be apparent that the detailed description is merely a partial, but not all, example of the invention. All other embodiments, which can be made by those skilled in the art without the inventive effort, are intended to be within the scope of the present invention based on the described embodiments.
The implementation of the invention needs to use the following prior art means:
(1) DPI technology (Deep Packet Inspection, deep database packet inspection technology) that parses data traffic to obtain specific field data;
(2) The bypass mirror image is used for acquiring bypass collected data flow through the mirror image;
(3) Bypass beam splitting refers to acquiring bypass collected data traffic by using a beam splitter.
The invention is embedded into a core network link in a bypass mode, and the acquisition, the recombination and the restoration of HTTP data flow in the network are realized by using DPI technology (deep database packet inspection technology); and network traffic is replicated for analysis by bypass beam splitting, bypass mirroring, or the like. Currently, the method is applicable to the service system environment based on HTTP interaction. The bypass acquisition is the basis, and the original flow requested by the user for accessing the target system is acquired through the bypass acquisition, so that the online service is not negatively affected. The core module automatically learns the collected original data in combination with preset feature dimensions, and establishes an original service fingerprint feature model corresponding to each service module according to different parameters of different access source objects and the detailed condition of access service response. And then collecting samples of all the service fingerprints, and when the samples are collected to a preset number, the core module performs text algorithm matching on the grouped feature model samples, extracts static contents in the feature model samples and generates a frame extraction rule model.
Referring to fig. 2, a flow chart of the method of the present invention is shown, and the implementation of the present invention will be described in detail with reference to fig. 2.
1. Fingerprint feature learning
The invention uses "service fingerprint" to interpret and determine the uniqueness of a single operational service, for example: the system logs in the fingerprint, adds users' fingerprint, deletes the business fingerprint such as the role fingerprint newly, every business fingerprint has corresponding 32 bit HASH unique mark as the fingerprint number. Only when the initiated operation requests can correspond to the unique fingerprints to which they belong, the same fingerprint requests can be classified according to the fingerprint numbers, and the calculation method of the response is combined for each group of fingerprint classifications to obtain the final desired calculation result.
The target system refers to a target system found in the traffic, the traffic is split based on domain names, and different domain names belong to different target systems. For example: traffic with a domain name www.baidu.com belongs to one target system and traffic with a domain name www.sina.com.cn belongs to another target system. The target systems are independent and mutually noninterfere, and different target systems can have the same characteristic fingerprints, but the same characteristic fingerprints are not allowed to appear in the same target system. For example: www.baidu.com target systems can only have one "login fingerprint" and www.sina.com.cn target systems can also have "login fingerprints".
(1) Characteristic parameter
After the flow is collected and restored, the data is automatically extracted according to a pre-configured strategy rule, the characteristic parameter fields are automatically complemented after the extraction, and the complemented data is used as basic analysis data to be stored in a cache database. The types of characteristic parameters built in at present are totally divided into 6 types: domain name, destination port, link address, request type, request parameter name, response type, and then analyzing and processing according to these types of characteristic parameters. Specific characteristic parameters and examples are shown in table 1, and the characteristic parameters are static characteristics.
Table 1 characteristic parameters and examples
Sequence number Field name Parameter name Parameter examples
1 domain Domain name search.sina.com.cn
2 destport Target port 80、8080
3 url Linking addresses /
4 requesttype Request type get、post
5 requestparams Request parameter name c=news&q=test、c=img&q=test
6 responsetype Response type http、js、css
And carrying out learning analysis according to the content of the characteristic parameter field, and determining the uniqueness of the service fingerprint according to a final analysis result.
(2) Dynamic learning
According to the construction standard difference of different target service systems, combining the accumulated practical experience in the WEB service safety field, finding: in some systems, there may be random content in the link address feature dimension field. The invention solves the problem by utilizing statistical analysis and packet deduplication so as to more accurately determine the uniqueness of the service.
In some specific service systems, dynamic content such as dynamic identity marks, TOKEN TOKENs, etc. are stored in the link address, and the corresponding marks are randomly allocated after each re-login, so that the link address of each user may be randomly changed. The following are examples of link addresses within a system:
address 1: CD5141ae53cd45e897482bda c.e. 92/Home/Index
Address 2: 31 dc8f4b74b8880729c59b772945e/Home/Index
Address 3: per 8394aca18fe04a9192df164fd95da4f3/Home/Index
As in the address example above, a target system stores a SESSION flag in a link address, and for similar problems that may exist with dynamic link addresses, the present invention introduces concepts such as "depth", "layer number", "grouping", etc. to comprehensively analyze the link address to determine the dynamic content existing therein for processing. Splitting and layering URL paths of a single access log through path separation symbols "/", wherein the total layer number, namely depth, depth and layer number, are shown in the figure 3, the address shown in the figure 3 is layered and then is 3 layers, and the depth is the same as the value 3. The grouping means respectively calculates according to each parameter dimension of the single access log and obtains a calculation result, and the calculation results of different access logs are put into the same group when the calculation results of different access logs are consistent.
In the dynamic learning of the invention, grouping training is carried out according to the static characteristic combination of the target access log, and the adopted static characteristic combination is a domain name, a target port, a request type, a request parameter name and a response type.
The specific steps of dynamic learning are as follows:
step 1: when the number of the buffer access logs of the training channel reaches a preset threshold (100000 by default), the temporary preparation work before the training is formally entered, and all link addresses of the current batch are subjected to layering treatment. The so-called hierarchy is a hierarchy of linked addresses by a path separation symbol "/", the hierarchy being schematically seen in fig. 3. The result of the layering of address 1 in the above example is [ "cd5141ae53cd45e897482bda c.e 92", "Home", "Index" ].
Step 2: after layering, the training engine performs digital processing on all single-layer string objects of the current layer, namely performs digital hash conversion on the same comparison string, wherein the range of the digital hash value is 32-bit integer number (the value range is-2147483648 ~ 2147483647). In theory, the hash value has a very small probability of collision, and if the total training set is within 32-bit integer, the probability of collision is negligible. The digital hash converted array objects H= [ H1, H2, H3, … …, hn ] and n is the total number of current training channel logs;
step 3: and carrying out grouping statistics on the current array object H, and when the number of the grouped elements is less than 10% of the total number of H (10% by default and can be manually adjusted), considering that the current layer is converged, and judging that the current layer is a static layer, and otherwise, judging that the current layer is a dynamic layer. The static layer is represented by original characters, and the dynamic layer is represented by wildcards. According to an example, when enough access records are collected, the final learning result may be: /Home/Index.
The grouping statistical method in the step comprises the following steps: the statistics groups are associated with the number of layers, with the same layers of different logs being grouped together. The 1 layer is a group, the 2 layers are a group, the 3 layers are a group and … …. And each group performs independent grouping statistical calculation to obtain a calculation result of each statistical group, namely a value result of the current layer. And after all the statistical groups (all layers) are calculated, rearranging the results of all the statistical groups to obtain the learning result of the current group.
Step 4: and outputting a learning result, and providing a manual auditing entrance for auditing and confirming. And automatically entering the dynamic link address with the verification into the fingerprint feature library, automatically clearing all log records of the current cache channel if the verification is neglected, and re-collecting the cache log.
Step 5: and (5) repeatedly entering the step (1) to perform dynamic training of other channels.
The dynamic learning result obtained after the dynamic learning is completed is stored in the fingerprint feature library, and when any service (login service, some recharging service, some inquiring service, some transacting service and the like) of the target system is learned, the dynamic learning result is automatically added into the fingerprint feature library, and the number of data in the fingerprint feature library is increased along with the increase of the learning time.
(3) Service standardization
In order to better manage the service, the invention performs standardization processing on all the determined service fingerprints. After all the characteristic parameters are determined, unique value operation is carried out on the parameter combination, and the input parameter sequence of the HASH value algorithm is domain, destport, url, requesttype, requestparams, responsetype. Aiming at the characteristic that the HTML is not particularly sensitive to case, the parameter values corresponding to the unified parameter of the calculation mode of the service standardization algorithm are converted into lower case character processing, so that the final service fingerprint is more converged. The unique calculation formula for determining the service fingerprint is as follows:
md5(domain、destport、url、requesttype、requestparams、responsetype) (1)
2. automatic sample collection
If the characteristic parameters of the current HTTP access log are completely matched with the characteristic parameters of the service fingerprint, automatically extracting unstructured content data from the access data of the log, and storing the unstructured content data to a designated position according to the original format. In a specific time, when the access source arrives at the same object, the filtering is performed automatically, and only 1-2 samples are collected in a single access source in a fixed time so as to ensure the diversity of sample contents. The sample automatic collection module realizes the automatic management of all business fingerprint samples, and comprises the following steps: automated matching, automated collection and storage, automated sample filtration, automated sample expiration, automated sample update, and the like.
In the invention, fingerprint feature stock stores service fingerprints corresponding to specific services in a target system, and the specific services in the target system are in one-to-one correspondence with the service fingerprints. The sample characteristic stock stores specific access log records corresponding to fingerprint characteristics, the fingerprint characteristics and the specific access log records are in a corresponding relation of 1-to-many, and one fingerprint characteristic corresponds to a plurality of specific access log records. For example, "login fingerprint" corresponds to "Zhang three login log", "Liu four login log", "Wang five login log", "Zhang three login log", "Liu four login log", "Wang five login log" in the sample feature library, and the like as sample feature data.
3. Rule model learning
In the traffic transmitted by HTTP protocol, the data content often includes excessive interference content such as frames, and according to the difference of modes such as different platform architectures and system construction, the static content data may occupy 1 time or even several times of the actual dynamic data. If operations such as secondary analysis and data content storage are to be performed on all access log content data, the storage pressure of the server is increased, and the accuracy of the secondary analysis is reduced. After the static data is extracted through the rule model, the content of the stored data is greatly reduced, and the storage pressure of the server is reduced. The interference content such as the frame word strings and the fixed codes is eliminated, and the accuracy of secondary analysis is improved.
The frame model mainly adopts a tolerance difference method to carry out correlation calculation analysis on all sample contents of the service fingerprint, carries out progressive scanning on the sample contents according to the condition of each group of fingerprint samples, and analyzes the existing static content data by mutual matching of the sample contents.
The specific steps are as follows:
step 1: and acquiring all the service fingerprint sets S meeting the conditions, wherein the meeting conditions comprise that the number of samples meets the standard, the frame model is never executed or not executed correctly, and the model does not meet the effective time range. Initializing a service fingerprint set F= [ F1, F2, F3, … …, fm ], and obtaining the number of fingerprints of the learning model as m.
Step 2: initializing a training set, extracting single fingerprints F from a service fingerprint set F one by one, dividing F into preset training sets, extracting a sample set S of the fingerprints F from sample data by each group of training sets (the training set is 10 by default), wherein S= [ S1, S2, S3, … …, sp ], and the number of samples is p.
Step 3: the method comprises the steps of calculating a current training set by adopting a patience difference method, firstly constructing a matrix A, scanning the content among samples row by row and column by column, calculating the characteristic information quantity among each row, namely the longest public subsequence, and comparing the similarity degree of the content of each sample corresponding to the same fingerprint by utilizing the longest public subsequence. The longest common subsequence LCS (Longest Common Subsequence), defined as: for sequence S, if two or more subsequences of the known sequence, respectively, are the longest of all sequences that meet this condition, then S is referred to as the longest common subsequence of the known sequence.
Fig. 4 shows a matrix a, and black boxes indicate hit feature data.
Step 4: when the difference data obtained between the samples exceeds the allowable range (the error amount defaults to 0.98), all the samples are cleared, and the step 2 is repeated. If within the allowed difference data step 5 is entered.
Data: if the data of a certain row is completely consistent with the data of the comparison row, the difference value of the row is considered to be 0, and when the data of the certain row is not completely equal, the number of the characters of the difference is reversely deduced according to the calculated characteristic information quantity LCS, and the calculation mode is as follows: difference value= (number of characters added+number of characters deleted)/total number of characters. When the number of characters of the difference exceeds 98% (configurable, default 0.98), it is considered to be variable, and vice versa.
Step 5: and combining the data among the samples, and calculating and obtaining a final rule model, wherein the data service rule model is = { service fingerprint number, an invariable row set, a variable row and invariable column set, and the rule model validity period (defaults to 30 days) }.
In the invention, a single service fingerprint is formed by combining a text content, wherein the text content comprises N lines of data, and each line corresponds to M columns respectively. The invariable row is a invariable row number for recording a certain business fingerprint log, and is in one-to-many relation with the fingerprint, and all columns in the invariable row number are invariable. The variable row and invariable column records a certain business fingerprint variable row number and an invariable column number in the row, and the row are in one-to-many relation.
4. Differentiated compressed storage
For the fingerprint which can be matched with the service and has a rule model, the corresponding dynamic data content is extracted, otherwise, all the data content is extracted. And then comprehensively scoring according to the performances and compression ratios of different compression algorithms (7 lossless compression algorithms 1-RLE, 2-LZ77, 3-LZF, 4-FLATE, 5-bzip2, 6-LZMA and 7-LZO are built in), and finally obtaining 7 differential compression scores. And according to the value of the score, each service fingerprint preferentially selects a compression algorithm with the maximum score. The differential compression scores currently employed correspond to the various indicators table 2.
Table 2 list of performance metrics and compression ratio metrics
The execution performance index and the compression ratio index are uniformly divided into 9 layers, the parameters such as the score, the weight and the like of each layer can be dynamically adjusted according to actual conditions, and default values are automatically adopted under the condition of no adjustment.
Each index dimension needs to be combined with the deviation degree of the value of the index dimension and the value of the index interval and the weight of the current interval to carry out comprehensive scoring calculation (weighted average), and the value range of the single index dimension is [0,1]. The maximum score is 1 of the total index dimension number (e.g., the total current total 2 feature dimensions, differential compression total score is 2*1). The default validity period of the appointed compression algorithm obtained through maximum score calculation is 30 days, after the expiration period is exceeded, the differentiated compression module can automatically update the compression algorithm at regular time, so that the historical result can be adjusted under the condition that service environment data are changed, and the compression algorithm which accords with the data characteristics is selected again.
The following specific steps of the present invention for dynamic compression storage of HTTP protocol traffic will be provided:
(1) The DPI technology is utilized to collect the data traffic of the network and output the data traffic to the storage medium.
(2) Reading and analyzing the data flow in a polling mode, judging whether the data flow is log data of the HTTP protocol, and if so, entering the step (3); otherwise, the data traffic is discarded.
(3) Preprocessing the HTTP log data, including information completion, data formatting and the like. The key fields of the complement include: the method comprises the steps of (1) separating a plurality of fields according to a '\t' character, and then entering a step (4). Preprocessing of the HTTP protocol log data is consistent with the existing static compression method, and will not be described here again. The key fields corresponding to the service system comprise domain names, destination IP and destination ports.
(4) Fingerprint matching is carried out on the preprocessed HTTP protocol log data by adopting fingerprint features in the fingerprint feature library, if the preprocessed HTTP protocol log data are not matched, the preprocessed HTTP protocol log data are collected, fingerprint feature learning is carried out on the preprocessed HTTP protocol log data, and a learning result is output to the fingerprint feature library; if the matching is successful, the method proceeds to step (5) and step (7).
(5) And collecting access logs hitting the service fingerprint, filtering the homologous logs according to preset logic, and storing the homology logs in a sample feature library.
(6) The rule model learning module collects the synchronous sample characteristic data, performs model rule learning, and stores the learning result in the rule model library.
(7) And (3) the rule model library receives HTTP protocol log data, matches all rules according to the characteristic information, and enters the step (8) if the rule is hit, and the missed rule compresses and stores the log content by adopting a static compression method.
(8) Receiving formatted log data, judging whether the current fingerprint has a compression rule or not, and if so, entering a step (9); otherwise, all built-in compression methods are tried, compression scores of the compression methods are calculated, an algorithm with higher score is preferentially selected and recorded in a compression rule base, and then the step (9) is carried out.
The compression fraction is obtained by the following steps:
the execution time and compression ratio of each compression method for compression are obtained respectively, the corresponding score and weight of the execution time and compression ratio are obtained from table 2, and the compression score of each compression method is obtained through weighted summation.
(9) And adopting a specified compression algorithm in the current fingerprint rule base to compress and store the current log content.
The invention focuses on the data and focuses on the business itself, performs cluster learning by utilizing mathematical analysis methods such as self-learning, sampling, statistics and the like and combining related characteristic parameters, realizes the determination and interpretation of single business and automatically constructs a corresponding dynamic data rule model, which is a key technology of the patent and is protected.
For different data types, according to different differentiation comprehensive score values, a proper algorithm mode is selected automatically and preferentially, and the automatic selection result is optimized and updated regularly, so that the technical core is differentiated compression, comprehensive scoring and automatic updating, and the protection is realized.
It will be appreciated by those skilled in the art that all or part of the steps in implementing the method of the above embodiments may be implemented by a program to instruct related hardware, where the program may be stored in a computer readable storage medium, and the program when executed includes the steps of: (steps of the method), the storage medium, such as: ROM/RAM, magnetic disks, optical disks, etc.
The foregoing description is only of the preferred embodiments of the present invention and is not intended to limit the scope of the present invention. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention are included in the protection scope of the present invention.

Claims (6)

1. A method for dynamically compressing and storing HTTP protocol traffic, comprising:
reading log data of an HTTP protocol;
preprocessing log data;
matching the log data with service fingerprint features in a fingerprint feature library, and performing fingerprint matching on the log data to obtain service fingerprints corresponding to the log data;
when the fingerprint matching is successful, matching the log data with rules in a rule model library, and if the rule matching is unsuccessful, compressing the log data by adopting a static compression method; if the rule matching is successful, judging whether the service fingerprint matched with the log data has a corresponding compression rule, and if so, adopting the compression rule to compress; otherwise, all the built-in compression methods are adopted to compress respectively, the corresponding score and weight of each compression method are obtained according to the compression execution time and the compression ratio, the comprehensive score of each compression method is obtained through weighted summation, and the compression method with the highest score is adopted to compress.
2. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:
before reading the log data of the HTTP protocol, the method further includes:
collecting data flow of a network by using a DPI technology, outputting the data flow to a storage medium, reading and analyzing the data flow in a polling mode, judging whether the data flow is log data of an HTTP protocol, and if so, reading; otherwise, the data traffic is discarded.
3. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:
the preprocessing comprises information completion and data formatting, and key fields of the completion information comprise a request path, a service system, request header information, request body information and response body information.
4. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:
when the fingerprint matching is unsuccessful, current log data are collected and fingerprint feature learning is carried out on the current log data.
5. The method for dynamic compression storage of HTTP protocol traffic according to claim 1, wherein:
and taking the compression method with the highest score for compression, and taking the compression method with the highest score as a compression rule corresponding to the service fingerprint of the current log data.
6. A system for dynamic compression storage of HTTP protocol traffic, comprising:
the reading module is used for reading the log data of the HTTP protocol;
the preprocessing module is used for preprocessing the log data;
the service fingerprint matching module is used for matching the log data with service fingerprint characteristics in the fingerprint characteristic library, and fingerprint matching is carried out on the log data to obtain service fingerprints corresponding to the log data;
the compression module is used for matching the log data with rules in the rule model library when the fingerprint matching is successful, and adopting a static compression method to compress the log data if the rule matching is unsuccessful; if the rule matching is successful, judging whether the service fingerprint matched with the log data has a corresponding compression rule, and if so, adopting the compression rule to compress; otherwise, all the built-in compression methods are adopted to compress respectively, the corresponding score and weight of each compression method are obtained according to the compression execution time and the compression ratio, the comprehensive score of each compression method is obtained through weighted summation, and the compression method with the highest score is adopted to compress.
CN202111665961.0A 2021-12-31 2021-12-31 Method and system for dynamically compressing and storing HTTP protocol traffic Active CN114697070B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111665961.0A CN114697070B (en) 2021-12-31 2021-12-31 Method and system for dynamically compressing and storing HTTP protocol traffic

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111665961.0A CN114697070B (en) 2021-12-31 2021-12-31 Method and system for dynamically compressing and storing HTTP protocol traffic

Publications (2)

Publication Number Publication Date
CN114697070A CN114697070A (en) 2022-07-01
CN114697070B true CN114697070B (en) 2024-04-02

Family

ID=82137445

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111665961.0A Active CN114697070B (en) 2021-12-31 2021-12-31 Method and system for dynamically compressing and storing HTTP protocol traffic

Country Status (1)

Country Link
CN (1) CN114697070B (en)

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102761540A (en) * 2012-05-30 2012-10-31 北京奇虎科技有限公司 Data compression method, device and system and server
CN109062774A (en) * 2018-06-21 2018-12-21 平安科技(深圳)有限公司 Log processing method, device and storage medium, server
CN109101504A (en) * 2017-06-20 2018-12-28 恒为科技(上海)股份有限公司 A kind of efficient log compression and indexing means
CN111526151A (en) * 2020-04-28 2020-08-11 网易(杭州)网络有限公司 Data transmission method and device, electronic equipment and storage medium
CN111817722A (en) * 2020-07-09 2020-10-23 北京奥星贝斯科技有限公司 Data compression method and device and computer equipment

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11163624B2 (en) * 2017-01-27 2021-11-02 Pure Storage, Inc. Dynamically adjusting an amount of log data generated for a storage system

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102761540A (en) * 2012-05-30 2012-10-31 北京奇虎科技有限公司 Data compression method, device and system and server
CN109101504A (en) * 2017-06-20 2018-12-28 恒为科技(上海)股份有限公司 A kind of efficient log compression and indexing means
CN109062774A (en) * 2018-06-21 2018-12-21 平安科技(深圳)有限公司 Log processing method, device and storage medium, server
CN111526151A (en) * 2020-04-28 2020-08-11 网易(杭州)网络有限公司 Data transmission method and device, electronic equipment and storage medium
CN111817722A (en) * 2020-07-09 2020-10-23 北京奥星贝斯科技有限公司 Data compression method and device and computer equipment

Also Published As

Publication number Publication date
CN114697070A (en) 2022-07-01

Similar Documents

Publication Publication Date Title
CN105808988B (en) Method and device for identifying abnormal account
CN104866558B (en) A kind of social networks account mapping model training method and mapping method and system
CN112261645B (en) Mobile application fingerprint automatic extraction method and system based on grouping and domain division
CN110012122B (en) Domain name similarity analysis method based on word embedding technology
CN111695597A (en) Credit fraud group recognition method and system based on improved isolated forest algorithm
CN107832333B (en) Method and system for constructing user network data fingerprint based on distributed processing and DPI data
CN110891030A (en) HTTP traffic characteristic identification and extraction method based on machine learning
KR101982756B1 (en) System and Method for processing complex stream data using distributed in-memory
CN111224998B (en) Botnet identification method based on extreme learning machine
CN106844553B (en) Data detection and expansion method and device based on sample data
CN112822121A (en) Traffic identification method, traffic determination method and knowledge graph establishment method
CN108199878B (en) Personal identification information identification system and method in high-performance IP network
CN114697070B (en) Method and system for dynamically compressing and storing HTTP protocol traffic
CN108650145A (en) Phone number characteristic automatic extraction method under a kind of home broadband WiFi
CN110175289B (en) Mixed recommendation method based on cosine similarity collaborative filtering
CN113761137A (en) Method and device for extracting address information
CN105868271B (en) Surname statistical method and device
CN116192531A (en) Log anomaly detection system based on isolated forest
CN112559823B (en) Data standardized data acquisition method
CN112488140B (en) Data association method and device
CN113157847A (en) Method and device for rapidly checking forest plant survey data
CN111382211A (en) Data summarizing method and device
CN107180022A (en) object classification method and device
CN117493950A (en) Target object identification method based on network traffic
CN117851190A (en) Database performance monitoring method and system capable of analyzing in real time

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant