CN110602038B

CN110602038B - Abnormal UA detection and analysis method and system based on rules

Info

Publication number: CN110602038B
Application number: CN201910706278.3A
Authority: CN
Inventors: 苟高鹏; 熊刚; 陈洁; 李镇; 徐安林
Original assignee: Institute of Information Engineering of CAS
Current assignee: Institute of Information Engineering of CAS
Priority date: 2019-08-01
Filing date: 2019-08-01
Publication date: 2020-12-04
Anticipated expiration: 2039-08-01
Also published as: CN110602038A

Abstract

The invention provides a method and a system for detecting and analyzing abnormal UA based on rules, which are characterized in that network traffic is captured based on a Spark network traffic capture platform, HTTP traffic is filtered from all network traffic according to an HTTP format, and UA fields of the HTTP traffic are extracted, so that the abnormal UA in the network traffic can be effectively detected and analyzed, and network management and malicious software detection are facilitated.

Description

Abnormal UA detection and analysis method and system based on rules

Technical Field

The invention belongs to the technical field of network information, and particularly relates to a method and a system for detecting and analyzing abnormal UA (user agent) based on rules.

Background

Key fields in network traffic play a crucial role in network traffic. Key fields in the Domain Name System (DNS) can be used to resolve the remaining trust in the Domain to see the evolution of DNS resolution, as well as to detect malware behavior in the network. Similarly, key fields in HyperText Transfer Protocol (HTTP) and Transport Layer Security/Secure Socket Layer (TLS/SSL) protocols, such as UA, cookie, and Server Name Indication (SNI), play a crucial role in network behavior analysis and malicious behavior detection.

Since HTTP takes up nearly half of all protocol traffic generated every day, the frequency of HTTP usage by users is high and the number of users involved is large, and the User Agent field in HTTP contains information of the client, including the operating system and version of the client, CPU type, browser and version, browser rendering engine, browser language, browser plug-in, etc. Therefore, the research User Agent field can be considered as a research on the condition that the flow key field in the network contains abnormal characters, and can also analyze the reason of the abnormal characters from the perspective of the client, because the client which generates the abnormal characters possibly has malicious behaviors. In order to research the phenomenon that abnormal characters exist in key fields of various protocols in network traffic, a User Agent field of an HTTP protocol is used as data to be detected and analyzed. Since the UA may contain information of the client, the UA may also be used to identify malware, while the client's preferences may be revealed by accounting for information of the client's operating system, browser, and device.

In a high-speed network environment, deep analysis of a network protocol is realized, and extracting the content of a key field is the primary premise of mapping and marking the network and the flow attribute, however, due to the complexity of the network protocol, the existing analysis tool often has the condition that abnormal characters exist in the key field during protocol analysis in the high-speed network environment, and the abnormal characters in the key field introduce polluted error information for realizing effective mapping and marking of network flow.

The abnormal characters of the key fields are generally ignored in the related research on UA in the past, and the key fields are not directly processed. Since there is also a certain reaction to these UAs that the behavior of the client and the client are closely connected, these UAs should not be ignored, and they also represent the ecosystem of UAs in network traffic.

Disclosure of Invention

The invention aims to provide a method and a system for detecting and analyzing abnormal UA based on rules, which can effectively detect and analyze the abnormal UA in network flow by extracting UA fields from HTTP flow in a network, thereby facilitating network management and malicious software detection.

In order to achieve the purpose, the invention adopts the following technical scheme:

a method of rule-based abnormal UA detection and analysis, comprising the steps of:

capturing network traffic based on a Spark network traffic capturing platform;

carrying out protocol analysis on the captured traffic, filtering the HTTP traffic from all network traffic according to an HTTP format, extracting UA fields and IP information of a client and storing the UA fields and the IP information as a log;

carrying out abnormal detection on the extracted UA through a regular expression, judging whether the UA has abnormal characters, and if the UA has the abnormal characters, judging that the UA is abnormal;

according to the detected abnormal UA, calculating the similarity between the abnormal UA and the normal UA for the data in the log, and storing the normal UA with the similarity larger than 0 with the abnormal UA;

analyzing the first plurality of clients with the maximum number of abnormal UAs to find out the reasons of abnormal characters;

and carrying out custom classification on the stored normal UAs and classifying the normal UAs according to the custom types, judging the normal UAs which do not conform to the custom classification as abnormal UAs again, carrying out preference analysis on the device type and the browser type used by the client side containing the abnormal UAs, and detecting the malicious client side.

Further, the UA field and the IP information of the client form a log in the format of < client ID, UA >.

Further, the similarity of the abnormal UA and the normal UA is calculated for the data in the log using the Levenshtein distance.

Further, the number of clients refers to the clients containing the abnormal UA total accounting for 80% of all the abnormal UA total.

Further, the abnormal UA is stored separately, and the number is counted.

Further, the reasons for the occurrence of the abnormal character include: the malicious software generates abnormal UA by itself, and the abnormal UA is generated by different encoding and decoding modes of the UA.

A system for rule-based abnormal UA detection and analysis, comprising:

the Spark network traffic capturing platform is used for capturing network traffic;

the filter is used for carrying out protocol analysis on the captured flow, extracting UA field and IP information of the client and storing the UA field and the IP information as a log, carrying out anomaly detection on the extracted UA through a regular expression, and detecting abnormal UA containing abnormal characters and normal UA with the similarity larger than 0;

the analyzer is used for analyzing the first clients with the maximum number of abnormal UAs and finding out the reasons of abnormal characters; and carrying out custom classification on the stored normal UAs and classifying the stored normal UAs according to the custom types, judging the normal UAs which do not conform to the custom classification as abnormal UAs again, carrying out preference analysis on the device type and the browser type used by the client side containing the abnormal UAs, and detecting the malicious client side.

Further, the filter comprises an HTTP extractor, a UA extractor and an IP extractor, wherein the HTTP extractor is used for filtering HTTP traffic from all network traffic according to an HTTP format, the UA extractor is used for extracting a UA field from the HTTP traffic, and the IP extractor is used for extracting the IP information of the client from the HTTP traffic.

The method aims to pay attention to UAs which are usually ignored and contain abnormal characters, filter the UAs containing the abnormal characters from all UAs in network traffic and count the number of the UAs by using a rule-based method (namely a regular expression), and analyze malicious clients from the UAs. The method realizes passive measurement of the high-speed network flow, captures the network flow by using a Spark-based high-speed network flow capture platform, identifies and deeply analyzes HTTP, and extracts UA fields in the HTTP. The detection method of abnormal UA which is usually ignored and the reasons of the abnormal UA are researched, a rule-based method is used, namely, the regular expression is used for detecting the abnormal UA in the UA field, and the regular expression is used for successfully distinguishing the UA containing abnormal characters from the normal UA. The similarity of each abnormal UA and other normal UAs is calculated by using the Levenshtein distance, and the normal UAs with the similarity larger than 0 among the abnormal UAs are saved for analysis. The reason for generating abnormal characters in network traffic is revealed from the perspective of coding and malicious users.

The method of the invention has the following advantages:

(1) focusing on details, measurement analyses UA that contain anomalous characters in network traffic.

(2) UA containing anomalous characters are detected from all UA fields using a rule-based regular expression, which can be faster than a statistical-based approach in terms of time consumption. And the normal UA has a fixed format and characters, and the condition of misjudgment cannot occur by using a correct rule method.

(3) The first several (for example, the first 20) clients containing the largest number of abnormal characters are analyzed, so that the interference analysis result of the UA containing the abnormal characters caused by accidental factors is avoided, and the reason of the abnormal UA is analyzed from the perspective of the client.

(4) Not only are abnormal UAs analyzed, but also normal UAs with similarity greater than 0 calculated by using Levenshtein distance are analyzed, the formally normal UAs may be abnormal in meaning, the method not only detects the formally abnormal UAs, but also detects the semantically abnormal UAs, and shows an 'ecosystem' of the abnormal UAs in network flow.

Drawings

Fig. 1 is a flow diagram of a method for rule-based abnormal UA detection and analysis.

Fig. 2 is a system framework diagram of a rule-based abnormal UA detection and analysis.

Detailed Description

In order to make the aforementioned features and advantages of the present invention more comprehensible, a method for detecting and analyzing an abnormal UA based on rules disclosed in the present invention is described in detail below with reference to the accompanying drawings, as shown in a flowchart of fig. 1, and includes the following steps:

firstly, a detection stage:

(1) capturing network traffic: and capturing the high-speed traffic by using a Spark-based high-speed network traffic capturing platform, and waiting for processing.

(2) Network traffic filtering and key field extraction: and carrying out protocol analysis on the captured traffic, filtering the HTTP traffic from all network traffic according to an HTTP format, extracting UA fields and IP information of a client according to the HTTP format, and forming logs in a < client ID, UA > format for storage.

(3) And (3) abnormal UA detection: and detecting the extracted UA through a regular expression, judging whether abnormal characters exist or not, if the UA does not accord with the established rule, judging that the abnormal characters exist, storing the abnormal characters separately, and counting the number of the abnormal characters at the same time.

(4) Normal UA extraction: according to the detected abnormal UAs, the similarity between the abnormal UAs and the normal UAs is calculated by using data of Levenshtein distance in the collected logs, and the normal UAs with the similarity larger than 0 with the abnormal UAs are saved.

II, an analysis stage:

(1) abnormal UA analysis: in order to prevent abnormal characters from appearing in UAs caused by accidental factors in a network, the client with the abnormal UA number of the first 20 clients is selected to perform reason analysis of the abnormal characters appearing in all the clients, and the total number of the abnormal UAs contained in the 20 clients accounts for about 80% of the total number of all the abnormal UAs. Through the analysis of the filtered abnormal UAs, two reasons for the occurrence of the UAs are found, wherein the abnormal UAs are mainly generated by malware itself, because a large number of identical abnormal UAs are generated by the client, and meanwhile, the different encoding and decoding modes of the UAs are also one of the reasons for the abnormal UAs. Malicious behavior of the malware may be detected and tracked through the anomalous UAs, which is advantageous for maintaining network security, and ecosystems that exhibit anomalous UAs may detect and track malicious behavior of the malware through the anomalous UAs, which is advantageous for maintaining network security, and ecosystems that exhibit anomalous UAs.

(2) Normal UA analysis: since these clients produce many abnormal UAs, the normal UAs of these clients must also be normal in the sense of UA usage, which are custom classified and categorized and analyzed from the client's usage of device type and browser.

The method is implemented by a system for detecting and analyzing abnormal UA based on rules, as shown in fig. 1 and fig. 2, and specifically includes the following parts:

the filter is used for carrying out protocol analysis on the captured flow, extracting UA field and IP information of the client and storing the UA field and the IP information as a log, carrying out anomaly detection on the extracted UA through a regular expression, and detecting abnormal UA containing abnormal characters and normal UA with the similarity larger than 0; specifically, the filter includes an HTTP extractor, a UA extractor, and an IP extractor, where the HTTP extractor is configured to filter HTTP traffic from all network traffic according to an HTTP format, the UA extractor is configured to extract a UA field from the HTTP traffic, and the IP extractor is configured to extract IP information of the client from the HTTP traffic.

The process of the invention is further illustrated by the following specific example:

as shown in fig. 2, traffic is captured for 2 months using a traffic capture platform, and a total of over 1500 hundred million UAs are collected, wherein nearly 2200 million UAs contain abnormal characters, the ratio of these abnormal UAs to normal UAs is about 0.1485 ‰, wherein the number of clients containing abnormal characters is about 91000, and they are distributed around the world.

The client with the number of abnormal UAs as the top 20 of all clients is selected to search the reason, so that the interference brought by accidental factors to the analysis is avoided. Two reasons are found to cause abnormal UAs, one is that the decoding and encoding methods of UAs do not match, and the other is that users/applications themselves produce these abnormal UAs, which malicious users are more likely to generate abnormal UAs for malicious activities, and their formats are different from those of normal UAs.

And finally, carrying out custom classification on the filtered normal UA, classifying the normal UA by using a regular expression according to the custom classification, finding 3 UA types which are abnormal in meaning (namely are not in accordance with the custom classification), and showing an ecosystem of the abnormal UA.

The above embodiments are only intended to illustrate the technical solution of the present invention and not to limit the same, and a person skilled in the art can modify the technical solution of the present invention or substitute the same without departing from the spirit and scope of the present invention, and the scope of the present invention should be determined by the claims.

Claims

1. A method of rule-based abnormal UA detection and analysis, comprising the steps of:

capturing network traffic based on a Spark network traffic capturing platform;

2. The method of claim 1, wherein the UA field and the client's IP information are logged in a format of < client ID, UA >.

3. The method of claim 1, wherein the similarity of the abnormal UA and the normal UA is calculated for data in the log using Levenshtein distance.

4. The method of claim 1, wherein the number of clients refers to clients that contain a total number of anomalous UAs that is 80% of the total number of all anomalous UAs.

5. The method of claim 1, wherein the abnormal UAs are stored separately and counted.

6. The method of claim 1, wherein the cause of the occurrence of the anomalous character comprises: the malicious software generates abnormal UA by itself, and the abnormal UA is generated by different encoding and decoding modes of the UA.

7. A system for rule-based abnormal UA detection and analysis, comprising:

the filter is used for carrying out protocol analysis on the captured flow, extracting the IP information of a UA field and a client and storing the IP information as a log, carrying out abnormal detection on the extracted UA through a regular expression, detecting abnormal UA containing abnormal characters, calculating the similarity between the abnormal UA and a normal UA, and storing the normal UA of which the similarity with the abnormal UA is more than 0;

8. The system of claim 7, wherein the filter comprises an HTTP extractor for filtering HTTP traffic from all network traffic according to the HTTP format, a UA extractor for extracting UA fields from the HTTP traffic, and an IP extractor for extracting client IP information from the HTTP traffic.

9. The system of claim 8, wherein the UA field and the client's IP information are logged in a format of < client ID, UA >.

10. The system of claim 7, wherein the plurality of clients refers to clients that contain a total number of anomalous UAs that is 80% of the total number of all anomalous UAs.