CN110891030A - HTTP traffic characteristic identification and extraction method based on machine learning - Google Patents

HTTP traffic characteristic identification and extraction method based on machine learning Download PDF

Info

Publication number
CN110891030A
CN110891030A CN201911364419.4A CN201911364419A CN110891030A CN 110891030 A CN110891030 A CN 110891030A CN 201911364419 A CN201911364419 A CN 201911364419A CN 110891030 A CN110891030 A CN 110891030A
Authority
CN
China
Prior art keywords
rule
http
data
feature
format
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Granted
Application number
CN201911364419.4A
Other languages
Chinese (zh)
Other versions
CN110891030B (en
Inventor
祝远鉴
王懿
韩震
汪洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd
Original Assignee
NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd filed Critical NANJING FIBERHOME INFORMATION DEVELOPMENT Co Ltd
Priority to CN201911364419.4A priority Critical patent/CN110891030B/en
Publication of CN110891030A publication Critical patent/CN110891030A/en
Application granted granted Critical
Publication of CN110891030B publication Critical patent/CN110891030B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/04Protocols for data compression, e.g. ROHC
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/06Notations for structuring of protocol data, e.g. abstract syntax notation one [ASN.1]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L69/00Network arrangements, protocols or services independent of the application payload and not provided for in the other groups of this subclass
    • H04L69/22Parsing or analysis of headers
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computer Security & Cryptography (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)

Abstract

The invention discloses a HTTP flow characteristic identification and extraction method based on machine learning, which comprises the following steps: step 1, identifying and collecting HTTP traffic; step 2, carrying out feature detection and generating rules; and 3, extracting HTTP flow characteristics. Compared with the existing characteristic extraction based on the regular expression in the market, the method improves the accuracy of the characteristic, reduces the probability of mistakenly extracting dirty data by the regular expression, and reduces the investment of labor cost and the feedback time of novel characteristic response compared with a characteristic marking method based on manpower. Meanwhile, in the patent, the feature/rule generation and the feature extraction are separated, a unique extraction engine can be designed, and the feature extraction efficiency is improved.

Description

HTTP traffic characteristic identification and extraction method based on machine learning
Technical Field
The invention relates to an HTTP flow characteristic identification and extraction method based on machine learning.
Background
In the internet society, a large amount of HTTP traffic exists on the network, a large amount of valuable data exists in the HTTP traffic, and the data is collected and integrated into a knowledge base, so that the information can be known in time, events can be responded, and decisions can be made. Currently, there are many methods for parsing HTTP data and extracting valid features, such as regular expression-based extraction, feature matching-based extraction, and machine learning-based feature recognition methods.
Although some feature recognition and extraction products are already introduced in the market, the products have certain defects, the extraction based on the feature formats such as a regular expression and a state machine is easy to carry out error extraction on features, the feature library is polluted, the whole feature library is unavailable, the feature matching extraction based on manual analysis is high in labor cost, the response to the newly added features is not timely, the extraction efficiency is greatly tested based on a machine learning method, and the impact of large-flow data cannot be met.
Disclosure of Invention
In order to quickly respond to the identification of the newly added features, improve the accuracy rate of the feature identification, enhance the efficiency of the feature extraction and reduce the cost of manual intervention, the invention provides an HTTP flow feature identification and extraction method based on machine learning, which is used for generating extraction rules, quickly extracting the features and storing the features in a warehouse.
The invention comprises the following steps:
step 1, identifying and collecting HTTP traffic;
step 2, carrying out feature detection and generating rules;
and 3, extracting HTTP flow characteristics.
The step 1 comprises the following steps:
step 1-1, flow sampling: the access flow is usually larger, and each layer of protocol in a TCP/IP5 layer model exists, and the header information from a link layer to a transmission layer needs to be analyzed, so as to obtain the quintuple information (source IP, destination IP, source port, destination port and protocol), filter the flow of non-TCP and sample the TCP flow according to the session;
step 1-2, session reorganization: when the TCP message transmitted by the network has a messy report or a lost packet, the TCP message needs to be recombined to obtain complete application layer data, session recombination is carried out on a data packet which is actively connected and transmitted with a server side according to a sequence number and a confirmation sequence number in the TCP message, the sequence number in the data packet is the same as the sequence number and the confirmation sequence number in the data packet in the third step of connection establishment, the server side receives the data packet and transmits confirmation data to the client side, in the data packet, the sequence number is the confirmation number in the previous data packet, the confirmation number is the sum of the sequence number in the data packet transmitted by the client side and the size of data carried in the data packet, the TCP session is recombined according to the sequence according to the relation between the sequence number and the confirmation sequence number, 4 times of hand waving messages are received, the session recombination is completed, the messy report is rearranged, and the lost packet message needs to be temporarily stored for 60s to judge whether the lost packet, if the message of waving hands for 4 times is found, the conversation is incomplete, and the message is still lost after 60s, the message is discarded;
step 1-3, detecting application layer load, and identifying HTTP flow according to an HTTP protocol format formulated by RFC (request For comments)7230,7231,7232,7233,7234,7235 protocol specification;
step 1-4, HTTP deduplication: extracting the HOST and URL fields in the identified HTTP traffic according to an HTTP structure, wherein if the URL fields carry parameters, the parameters in a URL protocol need to be removed, and the removal of the duplication is carried out according to the HOST and the URL fields, for example, 10 sessions with completely consistent HOST and URL can be put down within 30 minutes, and more repeated HOST and URL sessions are directly discarded;
step 1-5, HTTP validity preliminary screening: the method is used for filtering worthless data, matching and grading are carried out according to a keyword knowledge base through preliminary screening, the keyword knowledge base comprises two types of data, one type is keywords and is formed by manually analyzing HTTP data and screening characteristic accumulation, the other type is keywords corresponding scores, the capital and lowercase of the keywords are ignored in the calculation process according to the frequency ratio of the keywords appearing in all the keywords as the scores. And (3) searching whether the HTTP session contains keywords or not, if the HTTP session contains the keywords, acquiring a score in a mode of accumulating corresponding scores, and judging that the HTTP traffic is valuable if the final score is greater than a threshold value 50, and providing the value for further analysis in the steps 2 and 3.
In step 1-1, sampling the TCP flow according to the session, wherein the sampling mode is as follows: whether the packet is the initial packet of the three-way handshake is judged through SYN and ACK marks in TCP, if so, the session is judged to be accepted or rejected according to the sampling ratio through the generated [1-100] random number, for example, the sampling ratio is 10%, the random number is within the range of [1,10], and all messages in the session process are transferred, namely all messages generated from the three-way handshake to the middle of four hands are transferred.
The step 2 comprises the following steps:
step 2-1, message compression and coding detection: the HTTP flow head usually indicates the message Content compression format in the Content-Encoding header, decompresses the common compression deflate, gzip and zlib formats, if the HTTP head does not contain the Content-Encoding header, judges whether the HTTP Content contains the gzip and zlib magic heads, if so, performs the corresponding format decoding attempt, if the decoding process is abnormal, exits the Encoding detection process, and if not, directly skips the decompression process; decoding common encoding formats such as url encoding, base64 and escape; the HTTP traffic header indicates the character set charset in the Content-Type header, and the character set is uniformly converted into a UTF-8 character set for non-UTF-8 character sets, and the default of the character set is not indicated to use UTF-8 encoding;
step 2-2, message word segmentation: according to the message Content-Type, specifying MIME (multipurpose Internet Mailextensions) to decompose the message, and decomposing MIME information specified in non-HTTP specification or default application/x-www-form-url-encoded format without MIME information;
step 2-3, feature identification: identifying each word segmentation field by adopting 2 dimensions, namely linear classifier classification and knowledge base judgment, calculating a score according to each weight of each dimension, and judging that the detection is passed if the score is greater than a threshold value 80;
step 2-4, feature marking: recording feature tag (HOST, URL, location, encoding, prefix, offset, suffix, association, ordering, field meaning) ten-tuple information;
step 2-5, feature integration: filing and sorting feature marker information, inputting each field of a regular ten-tuple as training data, and adopting K-Means clustering to reduce the number of features and obtain feature similarity (reference document: Dharandra S.Modha, W.Scott span, feature weighing.k-Means clustering.MachineLearning,2003, 52: 217-237);
step 2-6, generating rules: judging feature difference according to feature similarity, combining the mark features with the difference lower than a minimum threshold value 10, wherein the rules generated by combination are divided into two types:
the similarity of URL fields is high, a large number of public prefixes and public suffixes can be extracted, the data size is considered according to the length of the URL, the length of the URL is usually half of the length of the URL, at the moment, the functional prefixes and the public suffixes are extracted, the middle part is subjected to fuzzy processing by using wildcards to generate a new URL, and the new URL replaces the URL field in the original rule to generate a new rule;
the second type, suffix field can obtain the public prefix, such as abc and aef, the public prefix is used as the suffix of the rule to generate a new rule, and the new rule replaces the original rules; the rules with the difference not lower than the minimum threshold value are not processed, and the rule format still meets the ten-tuple format in the feature mark;
step 2-7, rule merging: for newly generated rules, the newly generated rules need to be integrated into an evaluated rule tree structure, rule files are generated, each rule records updating time, the tree structure is shown as follows, a HOST field is used as a first layer, a URL is used as a second layer, other fields in the rules are used as a third layer, when HOST is the same, HOST layers are combined and pushed down in sequence until data fields are different, if the newly generated rules exist in a rule tree, the rule time is refreshed, and the timeliness of the rules is guaranteed; the rule file is as follows:
Figure BDA0002338027050000031
Figure BDA0002338027050000041
step 2-8, data extraction: designing an extraction engine conforming to a ten-tuple format rule, loading a rule file generated in the step 2-7, forming a rule tree structure, accessing flow, extracting HTTP flow according to the step 1-2 and the step 1-3, analyzing HOST and URL fields in the upstream head of the HTTP flow, comparing whether HOST and URL layer name attributes in the rule tree structure are consistent or not, and quickly filtering, if so, extracting the internal features according to (position, code, prefix, offset and suffix) attribute fields indicated in a rule layer, and if not, directly discarding the HTTP flow to realize quick screening and extraction of the flow;
step 2-9, rule evaluation: the method comprises the following steps of checking according to a format and a knowledge base, wherein the format checking aims at features, more than two features are generated after the features are checked through the format, at the moment, the relevance checking is carried out through the knowledge base, the rule evaluation is carried out through calculating a weighted harmonic average value through an accuracy rate and a recall rate, the judgment is carried out according to the value size, and the calculation formula is as follows:
Figure BDA0002338027050000042
wherein, FβFor weighted harmonic mean, P is precision, R is recall, β is used to measure the relative importance of recall to precision, where precision is more important, setting β<1。
In step 2-3, the Score is calculated for each dimension according to the respective weight, and the calculation formula of the Score is as follows:
Score=α*LR+β*LC,α+β=1
α is weight, LR is a linear classifier, the Score is between 0 and 100, LC is knowledge base check, the Score is between 0 and 100, the Score of the obtained Score is also between 0 and 100, the linear classifier is trained according to a preset corpus, the length, the format information and the context character string of the judged field are collected as characteristics, a bag-of-words model is used for digitalization, then the linear classifier is obtained through training, keywords and weight information are stored in the knowledge base, and if the HTTP message body is judged to have the keywords stored in the knowledge base, the corresponding weight information is accumulated.
In step 2-4, the HOST field represents the HTTP domain name; the URL field indicates the resource path; the fields of position, code, prefix, offset and suffix are associated, the position indicates the environment of the field, including the upstream head, the upstream content, the downstream head and the downstream content, the code indicates the code type adopted by the position, the prefix indicates the prepositive information of the identification field, the offset indicates the number of bytes of offset of the identification field from the prepositive information, and the suffix indicates the postpositive information of the identification field; the association field describes the interdependence between the ten tuples; the sequencing field marks the output sequence of the extraction field; the field meaning indicates the meaning of the identification field.
The step 3 comprises the following steps:
step 3-1, rule loading: the method comprises the steps that two rule objects are supported by default, one rule object takes effect and the other rule object does not take effect, when a new rule is generated, the rule object which does not take effect detects that the rule is updated and the rule is loaded, after the rule loading is finished, the rule object is set to take effect, the rule object which takes effect before is set to be invalid, data extraction only needs to be carried out according to the effective rule, and the seamless connection of rule loading and extraction is achieved;
step 3-2, extracting and marking: after the data is extracted by the extraction engine, a service Tag can be added as supplementary information of the extracted data, and the Tag format is stored in a Tag-Length-value (TLV) coding structure, so that the expansion of the service Tag is facilitated;
step 3-3, data storage: writing data into a database cluster, wherein extracted fields are unknown and cannot fix data structures, database selection needs to be carried out according to actual business requirements, if transactional requirements are weak, non-relational databases can be selected to store and extract the fields, and if the transactional requirements are high, relational databases are selected, two table structures are created to be stored, one table stores and extracts the data, and the other table stores and extracts the meanings of the fields.
Has the advantages that: compared with the existing characteristic extraction based on the regular expression on the market, the method improves the accuracy of the characteristic, reduces the probability of mistakenly extracting dirty data by the regular expression, and reduces the investment of labor cost and the feedback time of novel characteristic response compared with a characteristic marking method based on manpower. Meanwhile, in the patent, the feature/rule generation and the feature extraction are separated, a unique extraction engine can be designed, and the feature extraction efficiency is improved.
Drawings
The foregoing and other advantages of the invention will become more apparent from the following detailed description of the invention when taken in conjunction with the accompanying drawings.
FIG. 1 is a diagram of the process architecture of the present invention.
Detailed Description
The invention is further explained below with reference to the drawings and the embodiments.
As shown in fig. 1, the present invention provides a HTTP traffic feature identification and extraction method based on machine learning, which specifically includes:
1. HTTP traffic identification collection
The HTTP traffic identification and acquisition module is a data source of the feature detection and rule generation module, the traffic provided for the back end needs to meet diversity and effectiveness, the former covers various types of HTTP traffic as far as possible, and the latter filters meaningless messages and reduces the processing pressure of the back end. The specific process is as follows:
1.1HTTP traffic recognition
(1) The input traffic port is not restricted and full traffic sampling is performed;
(2) TCP conversation is recombined, and conversation context is restored;
(3) detecting an application layer load (DPI), and identifying HTTP traffic;
1.2HTTP traffic sampling
(1) HTTP traffic is subjected to duplication removal based on HOST and URL, and repeated traffic processing pressure is reduced;
(2) the HTTP traffic effectiveness primary screening is used for filtering worthless data, matching scoring can be carried out according to pre-screened keywords through the primary screening, if the score is larger than a threshold value, the traffic is considered to be valuable, the screening granularity is rough, and the diversity of the traffic provided for a rear end is guaranteed;
2. feature detection and rule generation
The feature detection and rule generation module is a core module of the whole system, and the output rule has direct influence on the extraction effect. The module can automatically generate rules and can also perform manual intervention, and the results of each intermediate link are detected and corrected. The specific process is as follows:
2.1 feature detection
(1) Message compression and encoding detection: common compression and encoding decoding are carried out, and contents are displayed in a plaintext mode;
(2) message word segmentation: specifying MIME according to the message Content-Type to decompose the message, and trying to explore and decompose the unknown format;
(3) and (3) feature identification: the feature identification is based on 2 dimensions, each dimension calculates a score according to the weight of each dimension, and if the score is larger than a threshold value, the detection can be judged to pass;
Score=α*LR+β*LC,α+β=1
α is weight, LR (logical regression) is a linear classifier, the Score is between 0 and 100, LC (library check) is a knowledge base check, the Score is between 0 and 100, the Score obtained is also between 0 and 100, the linear classifier is trained according to the linguistic data prepared in advance, the length, the format information and the context character string of the judged field are collected as characteristics, a bag-of-words model is used for numeralization, and then the linear classifier is obtained by training;
(4) characteristic marking: recording feature tag (HOST, URL, location, encoding, prefix, offset, suffix, association, ordering, field meaning) ten-tuple information;
wherein: the HOST field represents the HTTP domain name; the URL field indicates a resource path, and the resource path comprises a field which can be processed in a fuzzy manner, such as date, number and the like; the location and coding fields are associated, indicating levels, each level comprising location and coding information; the association field describes the interdependence between the ten tuples; the sequencing field marks the output sequence of the extraction field, and mainly solves the problem of disorder of the message content field.
2.2 rule learning
(1) Feature integration: filing and sorting the characteristic mark information, and adopting K-Means clustering to reduce the characteristic quantity and obtain the characteristic similarity;
(2) and (3) rule generation: and judging feature differences according to the feature similarity, merging the mark features with the differences lower than a threshold value, splitting the mark features higher than the threshold value to form a final feature extraction rule, wherein the rule format still meets the ten-tuple format in the feature marks.
2.3 rule evaluation
(1) And (3) rule merging: the rule learning submodule continuously generates rules which need to be integrated into the evaluated rules, the rule time is refreshed for the same rules, the timeliness of the rules is guaranteed, and the rules are evaluated in time for different rules, so that the diversity of the rules is guaranteed;
(2) data extraction: and designing an extraction engine which accords with the ten-tuple format rule, extracting HTTP flow according to the rule, and extracting the characteristics in the HTTP flow. The extraction engine performs rapid filtering according to the rule cross element group and HOST and URL to ensure rapid screening of data, and then extracts normalized features according to the position and the coding information;
(3) and (3) rule evaluation: the verification is carried out according to the format and the knowledge base, the format verification aims at the characteristics, a plurality of characteristics are usually generated after the format verification is carried out, at the moment, the relevance verification can be carried out through the knowledge base, the characteristics are related to each other, and the data reliability can be improved. The rule evaluation calculates a weighted harmonic mean value through the accuracy rate and the recall rate, and judgment is carried out according to the value, wherein the calculation formula is as follows:
Figure BDA0002338027050000071
wherein, FβFor weighted harmonic mean, P is precision, R is recall, β is used to measure the relative importance of recall to precision, where precision is more important, setting β<1。
2.4 Manual detection and intervention
(1) Manual detection: the manual detection detects 4 positions in front and at back of the feature detection, rule learning and rule evaluation sub-module, provides a mode of manually searching specified data to check feature detection and rule generation conditions, and can also check the data quality recommended by a system, and the recommendation carries out weighted scoring according to the feature quality, the data quantity and the importance of HTTP flow, and is recommended with high scoring priority;
(2) manual intervention: the functions of importing flow, correcting results and cutting off the flow are provided. The flow can import user-specified data or third-party data according to the needs; after the result is manually detected, the output result can be corrected; the flow can be terminated in advance for invalid output, and subsequent operation is not performed;
3. HTTP traffic feature extraction
3.1 feature extraction
(1) Rule loading: rule hot loading is supported, and the rule loading and seamless connection extraction are realized;
(2) extracting and marking: and adding the service label after the data is extracted by the extraction engine.
3.2 feature warehousing
And (4) data storage: and writing the data into the database cluster, and warehousing according to actual business requirements.
The invention realizes the following innovation:
and (3) feature detection: HTTP traffic decompression, decoding and word segmentation preprocessing operations, extracting HTTP traffic characteristics, and judging the characteristics by combining a machine learning algorithm and a knowledge base scoring mode;
and (3) rule learning: merging and splitting the features through similarity to generate a rule ten-tuple;
and (3) rule evaluation: designing an extraction engine to extract data and evaluating the extraction quality according to the weighted harmonic mean;
manual detection and intervention: the method for manual intervention is provided, the feature detection, the rule learning and the rule evaluation can be operated independently, and the effectiveness and the reliability of the generated rule can be improved at the later stage by increasing the mode of manual intervention.
The present invention provides a HTTP traffic feature recognition and extraction method based on machine learning, and a number of methods and approaches for implementing the technical solution are provided, the above description is only a preferred embodiment of the present invention, and it should be noted that, for those skilled in the art, a number of improvements and modifications may be made without departing from the principle of the present invention, and these improvements and modifications should also be regarded as the protection scope of the present invention. All the components not specified in the present embodiment can be realized by the prior art.

Claims (7)

1. A HTTP traffic feature recognition and extraction method based on machine learning is characterized by comprising the following steps:
step 1, identifying and collecting HTTP traffic;
step 2, carrying out feature detection and generating rules;
and 3, extracting HTTP flow characteristics.
2. The method of claim 1, wherein step 1 comprises:
step 1-1, flow sampling: analyzing the header information from a link layer to a transmission layer, acquiring quintuple information (source IP, sink IP, source port, sink port and protocol), filtering the traffic of non-TCP, and sampling the TCP traffic according to a session;
step 1-2, session reorganization: according to the serial number and the confirmation serial number in the TCP message, the session reorganization establishes connection and sends a data packet to the client and the server actively, the serial number in the data packet is the same as the serial number and the confirmation serial number in the data packet in the third step of establishing connection, the server receives the data packet and sends confirmation data to the client, the serial number in the data packet is the confirmation number in the previous data packet, the confirmation number is the sum of the serial number in the data packet sent by the client and the size of data carried in the data packet, according to the relation between the serial number and the confirmation serial number, the TCP session is reorganized according to the sequence, the session reorganization is completed when 4 hand-waving messages are received, the order of disordered messages is reorganized, for the lost messages, the whole session needs to be stored for 60s temporarily to judge whether the phenomenon of packet loss exists or not, if the 4 hand-waving messages are found and the session is incomplete, and the loss exists, discarding the message;
step 1-3, detecting application layer load, and identifying HTTP flow according to an HTTP protocol format formulated by RFC protocol specification;
step 1-4, HTTP deduplication: extracting the HOST and URL fields in the identified HTTP traffic according to an HTTP structure, wherein if the URL fields carry parameters, the parameters in a URL protocol are required to be removed, and duplication is removed according to the HOST and the URL fields;
step 1-5, HTTP validity preliminary screening: the method is used for filtering worthless data, matching and scoring are carried out according to a keyword knowledge base through preliminary screening, the keyword knowledge base comprises two types of data, one type is keywords, the other type is corresponding scores of the keywords, whether the keywords are contained in the HTTP session or not is searched, if yes, scoring is obtained in a mode of accumulating the corresponding scores, and the value of the HTTP flow is judged if the final scoring is larger than a threshold value 50, and the HTTP flow is provided for further analysis in the step 2 and the step 3.
3. The method according to claim 2, wherein in step 1-1, the TCP traffic is sampled per session in the following manner: whether the session is the initial packet of the three-way handshake is judged through SYN and ACK marks in TCP, and if the session is the initial packet of the three-way handshake, the session is accepted or rejected according to a sampling ratio through generated [1-100] random numbers.
4. The method of claim 3, wherein step 2 comprises:
step 2-1, message compression and coding detection: the HTTP flow head indicates the message Content compression format in the Content-Encoding header, the compression format is decompressed, if the HTTP head does not contain the Content-Encoding header, whether the HTTP Content contains gzip and zlib magic heads or not is judged, if yes, a corresponding format decoding attempt is carried out, if an exception occurs in the decoding process, the Encoding detection process is exited, and if not, the decompression process is directly skipped; decoding the encoded format; the HTTP flow header indicates a character set charset in the Content-Type header, the character set is uniformly converted into a UTF-8 character set for the non-UTF-8 character set, and the default of the character set is not indicated and UTF-8 encoding is used;
step 2-2, message word segmentation: specifying a MIME decomposition message according to the message Content-Type, and decomposing MIME information specified in a non-HTTP specification or a default application/x-www-form-url-encoded format which does not contain the MIME information;
step 2-3, feature identification: identifying each word segmentation field by adopting 2 dimensions, namely linear classifier classification and knowledge base judgment, calculating a score according to each weight of each dimension, and judging that the detection is passed if the score is greater than a threshold value 80;
step 2-4, feature marking: recording feature tag (HOST, URL, location, encoding, prefix, offset, suffix, association, ordering, field meaning) ten-tuple information;
step 2-5, feature integration: filing and sorting the feature marker information, inputting each field of the rule ten-element group as training data, and adopting K-Means clustering to reduce the number of features and obtain feature similarity;
step 2-6, generating rules: judging feature difference according to feature similarity, combining the mark features with the difference lower than a minimum threshold value 10, wherein the rules generated by combination are divided into two types:
the first class and URL field can extract a large number of public prefixes and public suffixes, the data size is considered according to the length of the URL, half of the URL length needs to be met, at the moment, the functional prefixes and the public suffixes are extracted, the middle part uses wildcards to carry out fuzzy processing to generate new URLs, and the new URLs replace URL fields in the original rules to generate new rules;
the second type and suffix field can obtain a common prefix, the common prefix a is used as a suffix of the rule to generate a new rule, and the new rule replaces the original multiple rules;
the rules with the difference not lower than the minimum threshold value are not processed, and the rule format still meets the ten-tuple format in the feature mark;
step 2-7, rule merging: for newly generated rules, the newly generated rules need to be integrated into an evaluated rule tree structure, rule files are generated, each rule records updating time, the tree structure is displayed in an xml format, a HOST field is used as a first layer, a URL is used as a second layer, other fields in the rules are used as a third layer, when HOST is the same, the HOST layers are combined and pushed down in sequence until the data of each layer are different, and if the newly generated rules exist in the rule tree, the rule time is refreshed, and the timeliness of the rules is guaranteed;
step 2-8, data extraction: designing an extraction engine conforming to a ten-tuple format rule, loading a rule file generated in the step 2-7, forming a rule tree structure, accessing flow, extracting HTTP flow according to the step 1-2 and the step 1-3, analyzing HOST and URL fields in the upstream head of the HTTP flow, comparing whether the attributes of HOST and URL layers in the rule tree structure are consistent, if so, extracting the internal features according to the (position, code, prefix, offset and suffix) attribute fields indicated in the rule layer, and if not, directly discarding the HTTP flow to realize the rapid screening and extraction of the flow;
step 2-9, rule evaluation: the method comprises the following steps of checking according to a format and a knowledge base, wherein the format checking aims at features, more than two features are generated after the features are checked through the format, at the moment, the relevance checking is carried out through the knowledge base, the rule evaluation is carried out through calculating a weighted harmonic average value through an accuracy rate and a recall rate, the judgment is carried out according to the value size, and the calculation formula is as follows:
Figure FDA0002338027040000031
wherein, FβFor weighted harmonic mean, P is precision, R is recall, β is used to measure the relative importance of recall to precision, setting β<1。
5. The method according to claim 4, wherein in step 2-3, each dimension calculates a Score according to a respective weight, and the Score is calculated according to the following formula:
Score=α*LR+β*LC,α+β=1
α is weight, LR is a linear classifier, the Score is between 0 and 100, LC is knowledge base check, the Score is between 0 and 100, the Score of the obtained Score is also between 0 and 100, the linear classifier is trained according to a preset corpus, the length, the format information and the context character string of the judged field are collected as characteristics, a bag-of-words model is used for digitalization, then the linear classifier is obtained through training, keywords and weight information are stored in the knowledge base, and if the HTTP message body is judged to have the keywords stored in the knowledge base, the corresponding weight information is accumulated.
6. The method according to claim 5, wherein in step 2-4, the HOST field represents an HTTP domain name; the URL field indicates the resource path; the fields of position, code, prefix, offset and suffix are associated, the position indicates the environment of the field, including the upstream head, the upstream content, the downstream head and the downstream content, the code indicates the code type adopted by the position, the prefix indicates the prepositive information of the identification field, the offset indicates the number of bytes of offset of the identification field from the prepositive information, and the suffix indicates the postpositive information of the identification field; the association field describes the interdependence between the ten tuples; the sequencing field marks the output sequence of the extraction field; the field meaning indicates the meaning of the identification field.
7. The method of claim 6, wherein step 3 comprises:
step 3-1, rule loading: the method comprises the steps that two rule objects are supported by default, one rule object takes effect and the other rule object does not take effect, when a new rule is generated, the rule object which does not take effect detects that the rule is updated and the rule is loaded, the rule object is set to take effect after the rule loading is finished, the rule object which takes effect before is set to be invalid, data extraction only needs to be carried out according to the effective rule, and the seamless connection of rule loading and extraction is achieved;
step 3-2, extracting and marking: after the data is extracted by the extraction engine, a service label can be added as supplementary information of the extracted data, and the label format is stored in a Tag-Length-Value coding structure;
step 3-3, data storage: data is written to the database cluster.
CN201911364419.4A 2019-12-26 2019-12-26 HTTP traffic characteristic identification and extraction method based on machine learning Active CN110891030B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911364419.4A CN110891030B (en) 2019-12-26 2019-12-26 HTTP traffic characteristic identification and extraction method based on machine learning

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911364419.4A CN110891030B (en) 2019-12-26 2019-12-26 HTTP traffic characteristic identification and extraction method based on machine learning

Publications (2)

Publication Number Publication Date
CN110891030A true CN110891030A (en) 2020-03-17
CN110891030B CN110891030B (en) 2021-03-16

Family

ID=69753207

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911364419.4A Active CN110891030B (en) 2019-12-26 2019-12-26 HTTP traffic characteristic identification and extraction method based on machine learning

Country Status (1)

Country Link
CN (1) CN110891030B (en)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111641624A (en) * 2020-05-25 2020-09-08 西安电子科技大学 Network protocol header compression method based on decision tree
CN113726486A (en) * 2021-11-03 2021-11-30 湖南麒麟信安科技股份有限公司 Message duplication removing method, system and storage medium in parallel redundant network
CN113784294A (en) * 2021-11-12 2021-12-10 南京信息工程大学 Mobile phone position information extraction method under WIFI environment
CN117097821A (en) * 2023-10-19 2023-11-21 深圳市佳贤通信科技股份有限公司 Base station message parameter updating and storing method based on TR069 protocol

Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506538A (en) * 2014-12-26 2015-04-08 北京奇虎科技有限公司 Machine learning type domain name system security defense method and device
CN105007282A (en) * 2015-08-10 2015-10-28 济南大学 Malicious software network behavior detection method specific to network service provider and system thereof
CN105022960A (en) * 2015-08-10 2015-11-04 济南大学 Multi-feature mobile terminal malicious software detecting method based on network flow and multi-feature mobile terminal malicious software detecting system based on network flow
CN106341343A (en) * 2016-09-14 2017-01-18 晶赞广告(上海)有限公司 Automatic service degradation system and method thereof
CN108039957A (en) * 2017-11-10 2018-05-15 上海华讯网络系统有限公司 Complex network flow bag intelligent analysis system
CN108259367A (en) * 2018-01-11 2018-07-06 重庆邮电大学 A kind of Flow Policy method for customizing of the service-aware based on software defined network
CN109286576A (en) * 2018-10-10 2019-01-29 北京理工大学 A kind of network agent encryption traffic characteristic extracting method of data packet frequency analysis
CN109672687A (en) * 2018-12-31 2019-04-23 南京理工大学 HTTP based on suspicious degree assessment obscures flow rate testing methods
CN110011931A (en) * 2019-01-25 2019-07-12 中国科学院信息工程研究所 A kind of encryption traffic classes detection method and system
US20190260683A1 (en) * 2017-02-06 2019-08-22 Silver Peak Systems, Inc. Multi-level learning for predicting and classifying traffic flows from first packet data

Patent Citations (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104506538A (en) * 2014-12-26 2015-04-08 北京奇虎科技有限公司 Machine learning type domain name system security defense method and device
CN105007282A (en) * 2015-08-10 2015-10-28 济南大学 Malicious software network behavior detection method specific to network service provider and system thereof
CN105022960A (en) * 2015-08-10 2015-11-04 济南大学 Multi-feature mobile terminal malicious software detecting method based on network flow and multi-feature mobile terminal malicious software detecting system based on network flow
CN106341343A (en) * 2016-09-14 2017-01-18 晶赞广告(上海)有限公司 Automatic service degradation system and method thereof
US20190260683A1 (en) * 2017-02-06 2019-08-22 Silver Peak Systems, Inc. Multi-level learning for predicting and classifying traffic flows from first packet data
CN108039957A (en) * 2017-11-10 2018-05-15 上海华讯网络系统有限公司 Complex network flow bag intelligent analysis system
CN108259367A (en) * 2018-01-11 2018-07-06 重庆邮电大学 A kind of Flow Policy method for customizing of the service-aware based on software defined network
CN109286576A (en) * 2018-10-10 2019-01-29 北京理工大学 A kind of network agent encryption traffic characteristic extracting method of data packet frequency analysis
CN109672687A (en) * 2018-12-31 2019-04-23 南京理工大学 HTTP based on suspicious degree assessment obscures flow rate testing methods
CN110011931A (en) * 2019-01-25 2019-07-12 中国科学院信息工程研究所 A kind of encryption traffic classes detection method and system

Cited By (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111641624A (en) * 2020-05-25 2020-09-08 西安电子科技大学 Network protocol header compression method based on decision tree
CN113726486A (en) * 2021-11-03 2021-11-30 湖南麒麟信安科技股份有限公司 Message duplication removing method, system and storage medium in parallel redundant network
CN113784294A (en) * 2021-11-12 2021-12-10 南京信息工程大学 Mobile phone position information extraction method under WIFI environment
CN117097821A (en) * 2023-10-19 2023-11-21 深圳市佳贤通信科技股份有限公司 Base station message parameter updating and storing method based on TR069 protocol
CN117097821B (en) * 2023-10-19 2023-12-19 深圳市佳贤通信科技股份有限公司 Base station message parameter updating and storing method based on TR069 protocol

Also Published As

Publication number Publication date
CN110891030B (en) 2021-03-16

Similar Documents

Publication Publication Date Title
CN110891030B (en) HTTP traffic characteristic identification and extraction method based on machine learning
US7860872B2 (en) Automated media analysis and document management system
KR100883261B1 (en) Content information analysis method, system and recording medium
US8335779B2 (en) Method and apparatus for gathering, categorizing and parameterizing data
US20160055213A1 (en) System and method for performing longest common prefix strings searches
CN104391881B (en) A kind of daily record analytic method and system based on segmentation methods
US20090222426A1 (en) Computer-Implemented System And Method For Analyzing Search Queries
US20150095359A1 (en) Volume Reducing Classifier
CN106844640B (en) Webpage data analysis processing method
CN103281213A (en) Method for extracting, analyzing and searching network flow and content
JP2005525657A (en) Managing expressions in database systems
JP2001134575A (en) Method and system for detecting frequently appearing pattern
JP2014502753A (en) Web page information detection method and system
CN112261645B (en) Mobile application fingerprint automatic extraction method and system based on grouping and domain division
WO2017114282A1 (en) Information search device and method, search server and machine-readable storage medium
CN102945246A (en) Method and device for processing network information data
CN108804527A (en) Based on wechat region circle of friends data analysis system and method
CN107301245B (en) Power information video search system
JPWO2010150910A1 (en) Information search device, information search method, information search program, and recording medium on which information search program is recorded
CN112822121A (en) Traffic identification method, traffic determination method and knowledge graph establishment method
CN108563795B (en) Pairs method for accelerating matching of regular expressions of compressed flow
CN112464036B (en) Method and device for auditing violation data
CN107491538B (en) Storage process command and parameter value extraction method of DB2 database
CN109740147B (en) Duplicate removal matching analysis method for large-number talent resume
CN117171650A (en) Document data processing method, system and medium based on web crawler technology

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant