WO2022143511A1 - 一种恶意流量识别方法及相关装置 - Google Patents

一种恶意流量识别方法及相关装置 Download PDF

Info

Publication number
WO2022143511A1
WO2022143511A1 PCT/CN2021/141587 CN2021141587W WO2022143511A1 WO 2022143511 A1 WO2022143511 A1 WO 2022143511A1 CN 2021141587 W CN2021141587 W CN 2021141587W WO 2022143511 A1 WO2022143511 A1 WO 2022143511A1
Authority
WO
WIPO (PCT)
Prior art keywords
traffic
alarm
http
feature
alarm traffic
Prior art date
Application number
PCT/CN2021/141587
Other languages
English (en)
French (fr)
Inventor
万荣飞
朱安南
张甲
段海新
Original Assignee
华为技术有限公司
清华大学
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司, 清华大学 filed Critical 华为技术有限公司
Priority to EP21914247.8A priority Critical patent/EP4258610A4/en
Publication of WO2022143511A1 publication Critical patent/WO2022143511A1/zh
Priority to US18/345,853 priority patent/US20230353585A1/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1416Event detection, e.g. attack signature detection
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F21/00Security arrangements for protecting computers, components thereof, programs or data against unauthorised activity
    • G06F21/50Monitoring users, programs or devices to maintain the integrity of platforms, e.g. of processors, firmware or operating systems
    • G06F21/55Detecting local intrusion or implementing counter-measures
    • G06F21/56Computer malware detection or handling, e.g. anti-virus arrangements
    • G06F21/566Dynamic detection, i.e. detection performed at run-time, e.g. emulation, suspicious activities
    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/064Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1441Countermeasures against malicious traffic
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/02Protocols based on web technology, e.g. hypertext transfer protocol [HTTP]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1433Vulnerability analysis

Definitions

  • the present application relates to the field of communication technologies, and in particular, to a malicious traffic identification method and related apparatus.
  • HTTP HyperText Transfer Protocol
  • various types of malicious software such as Trojan horse viruses
  • HTTP communication methods Communication between C2
  • the Trojan virus after the update iteration will be significantly different from the previous communication traffic.
  • Detecting from the traffic layer that is, by extracting features from the traffic
  • Detecting from host behavior that is, by extracting features from the behavior of infected hosts detection.
  • there are mainly two methods for processing the extracted features 1) detection methods based on unsupervised clustering; 2) detection methods based on supervised models.
  • both the unsupervised clustering detection method and the supervised model detection method only consider the characteristics of a single stream, that is, an HTTP
  • the flow characteristics do not consider the multi-flow network behavior characteristics of malicious CC communications, that is, the characteristics of multiple HTTP flows.
  • the current detection methods have insufficient basic information richness, and cannot effectively and accurately identify whether the traffic is malicious traffic.
  • the behavior of many rogue software is similar to the characteristic behavior of CC traffic at the single-stream level. Only using the single-stream feature analysis method cannot effectively distinguish rogue software from malware.
  • Embodiments of the present application provide a malicious traffic identification method and a related device, so as to improve the accuracy of malicious traffic identification.
  • an embodiment of the present application provides a method for identifying malicious traffic, which may include:
  • determining the receiving time of the first alarm traffic acquiring multiple pieces of second alarm traffic corresponding to the first alarm traffic in a target time period based on a preset policy; the target time period is a time period determined based on the receiving time; The similarity between each second alarm traffic in the plurality of second alarm traffic and the first alarm traffic is greater than a preset threshold; feature extraction is performed on the plurality of second alarm traffic to obtain first feature information; Based on the first feature information, it is determined whether the first alarm traffic is malicious traffic.
  • the malicious traffic identification device can retrospectively obtain multiple traffics (that is, multiple first traffic) matching the single traffic according to the preset policy from the time when the single traffic (ie: the first alarm traffic) is received. 2. Alarm traffic). Then, feature extraction is performed on the backtracked multiple traffic flows to obtain feature information, so that the malicious traffic identification device can classify the above single traffic according to the feature information, thereby determining whether the single traffic is malicious traffic.
  • the similarity between the plurality of second alarm traffic and the first alarm traffic is all greater than a preset threshold.
  • This method of classifying a single traffic according to the feature information of multiple traffics with similar single traffic enables the malicious traffic identification device to fully consider the characteristics of multi-stream network behavior of malicious CC communication traffic when identifying traffic, so as to be more accurate to detect and distinguish malicious traffic in the live network.
  • the detection process of the prior art the detection of a single HTTP stream is avoided because the traffic situation in the existing network is relatively complex.
  • the embodiment of the present application observes the communication behavior of traffic from a multi-flow perspective, traces multiple alarm traffic back to different clusters based on one or more preset policies, and uses the statistics of different clusters to which each alarm traffic belongs. information, according to the feature information to determine the positive and negative (that is, whether the alarm traffic is malicious traffic), so as to prevent accidental errors.
  • the target time period is a time period with the receiving time as a starting point and a preset time period backward, or a time period with the receiving time as an end point and a preset time period forward.
  • the receiving time when the first alarm traffic is received may be used as an endpoint, and a preset time period may be taken forward or backward, so as to ensure that as many records as possible similar to the first alarm traffic are obtained.
  • the second alarm traffic may be used as an endpoint, and a preset time period may be taken forward or backward, so as to ensure that as many records as possible similar to the first alarm traffic are obtained.
  • the preset policy includes: one or more of a first policy, a second policy, and a third policy, where the first policy is based on the first alarm traffic the Internet Protocol IP address and the user agent UA information to obtain the multiple second alarm traffic policies; the second policy is to obtain the multiple second alarm traffic based on the IP address of the first alarm traffic and a preset generalization rule The strategy of the alarm traffic; the third strategy is a strategy for acquiring the multiple second alarm traffic based on the IP address of the first alarm traffic and the hypertext transfer protocol HTTP Header information of the first alarm traffic.
  • various traffic backtracking methods can accurately backtrack to multiple flows of the same source of the first alarm traffic, so that whether the first alarm traffic is malicious traffic can be identified according to the behavior characteristics of the multiple traffic, and improve the performance of the traffic. improve the accuracy of identifying malicious traffic.
  • the preset policy includes the first policy; the collection according to the preset policy of multiple pieces of second alarm traffic corresponding to the first alarm traffic within a target time period includes: Obtain the IP address and UA information of the first alarm traffic; collect multiple HTTP streams sent by the IP address within the target time period, and the HTTP stream that is the same as the UA information of the first alarm traffic is the Describe the second alarm traffic.
  • the preset policy includes the second policy, and the collection of multiple pieces of second alarm traffic corresponding to the first alarm traffic within a target time period according to the preset policy includes: Acquire the IP address of the first alarm traffic; collect multiple first HTTP streams sent by the IP address within the target time period; perform processing on the multiple first HTTP streams according to the preset generalization rule
  • the generalization process obtains a plurality of second HTTP streams, and the preset generalization rule is to uniformly replace the target string corresponding to each first HTTP stream in the plurality of first HTTP streams using a preset standard; From the plurality of second HTTP streams, a target second HTTP stream whose similarity with the first alarm traffic is greater than a preset threshold is filtered out as the second alarm traffic.
  • the method of calculating the similarity between the flows is used to determine multiple flows (similarities exceeding a preset threshold) in the same cluster as the first alarm flow (sent from the same software and different applications). ), and then determine whether the first alarm traffic is malicious traffic according to the behavior characteristics of the multiple traffics, thereby improving the accuracy of identifying malicious traffic.
  • the preset policy includes the third policy
  • the collection of multiple pieces of second alarm traffic corresponding to the first alarm traffic within a target time period according to the preset policy includes: Obtain the IP address and the HTTP Header information of the first alarm traffic; collect multiple third HTTP streams sent by the IP address within the target time period;
  • the HTTP Header corresponding to each third HTTP stream in the N-gram process is performed to obtain a first matrix, and the first matrix includes the HTTP Header sequence information corresponding to each third HTTP stream;
  • Dimensionality reduction processing extracting target HTTP Header sequence information that matches the HTTP Header information of the first alarm traffic in the first matrix after dimensionality reduction processing; based on the target HTTP Header sequence information, obtain the target HTTP Header sequence information
  • the corresponding third HTTP flow is the second alarm flow.
  • the method of backtracking by extracting HTTP Header sequence (sequence) information in the flow can be backtracked to multiple flows sent by different applications in the same software, and then determine the first flow according to the behavior characteristics of the multiple flows. Whether the alerted traffic is malicious traffic improves the accuracy of identifying malicious traffic.
  • the first feature information is a feature representation vector
  • the performing feature extraction on the multiple pieces of second alarm traffic to obtain the first feature information includes: performing feature extraction on the multiple pieces of second alarm traffic Perform feature extraction on the alarm traffic to obtain behavior feature information corresponding to the multiple pieces of second alarm traffic, where the behavior feature information includes: one or more of a connection behavior feature, a request difference feature, and a request response feature; according to the behavior feature information, and obtain the feature representation vector.
  • the determining whether the first alarm traffic is malicious traffic according to the first feature information includes: performing detection through a backtracking model based on the first feature information, and obtaining the first detection Result: based on the plurality of second alarm traffic, the baseline model is used for detection, and a second detection result is obtained, wherein the baseline model is a pre-trained detection model based on historical traffic; based on the first detection result and the The second detection result is to determine whether the first alarm traffic is malicious traffic.
  • the method further includes: if the first alarm traffic is malicious traffic, performing preset generalization processing on the first alarm traffic to obtain the generalized first alarm traffic; The generalized first alarm traffic is classified, and the malicious traffic type matched by the first alarm traffic is determined.
  • the method further includes: if the first alarm traffic is malicious traffic, performing preset generalization processing on the first alarm traffic to obtain the generalized first alarm traffic; The generalized first alarm traffic is classified, and the malicious traffic type matched by the first alarm traffic is determined.
  • the method before the determining the receiving time of the first alarm traffic, the method further includes: receiving a plurality of fourth HTTP streams; Suppose feature extraction rules are used for feature extraction to obtain a second feature set, where the second feature set includes: second feature information corresponding to the plurality of fourth HTTP streams respectively; based on the second feature set, the first classification The model is to filter out the first alarm traffic from the plurality of fourth HTTP streams.
  • the first alarm traffic that is, the suspected malicious traffic
  • the plurality of fourth HTTP streams that is, the Single-stream filtering
  • the second feature information includes manual feature information and/or representation learning feature information; wherein the manual feature information includes: domain name readability features corresponding to the fourth HTTP stream, uniform resources One or more of locator URL structure features, behavior indication features, and HTTP Header features; the representation learning feature information includes high-dimensional features corresponding to the fourth HTTP stream.
  • an apparatus for identifying malicious traffic including:
  • a determining unit configured to determine the receiving time of the first alarm traffic
  • a backtracking unit configured to obtain multiple pieces of second alarm traffic corresponding to the first alarm traffic within a target time period according to a preset policy; the target time period is a time period determined based on the receiving time; the multiple The similarity between each second alarm traffic in the second alarm traffic and the first alarm traffic is greater than a preset threshold;
  • an extraction unit configured to perform feature extraction on the multiple pieces of second alarm traffic to obtain first feature information
  • a judgment unit configured to judge whether the first alarm traffic is malicious traffic based on the first feature information.
  • the preset policy includes: one or more of a first policy, a second policy, and a third policy, where the first policy is based on the first alarm traffic the Internet Protocol IP address and the user agent UA information to obtain the multiple second alarm traffic policies; the second policy is to obtain the multiple second alarm traffic based on the IP address of the first alarm traffic and a preset generalization rule The strategy of the alarm traffic; the third strategy is a strategy for acquiring the multiple second alarm traffic based on the IP address of the first alarm traffic and the hypertext transfer protocol HTTP Header information of the first alarm traffic.
  • the preset policy includes the first policy; the backtracking unit is specifically configured to: acquire the IP address and UA information of the first alarm traffic; collect at the target time Among the multiple HTTP streams sent by the IP address in the segment, the HTTP stream that is the same as the UA information of the first alarm traffic is the second alarm traffic.
  • the preset policy includes the second policy
  • the backtracking unit is specifically configured to: acquire the IP address of the first alarm traffic; collect in the target time period A plurality of first HTTP streams sent by the IP address; perform generalization processing on the plurality of first HTTP streams according to the preset generalization rules to obtain a plurality of second HTTP streams
  • the preset generalization rules are:
  • the target string corresponding to each of the first HTTP streams of the plurality of first HTTP streams is uniformly replaced by using a preset standard; from the plurality of second HTTP streams, the traffic related to the first alarm is filtered out.
  • the target second HTTP flow whose similarity is greater than the preset threshold is the second alarm flow.
  • the preset policy includes the third policy
  • the traceback unit is specifically configured to: acquire the IP address and the HTTP Header information of the first alarm traffic; collect Multiple third HTTP streams sent by the IP address within the target time period; respectively perform N-gram processing on the HTTP Header corresponding to each third HTTP stream in the multiple third HTTP streams to obtain the first matrix, the first matrix includes the HTTP Header sequence information corresponding to each third HTTP stream; dimensionality reduction processing is performed on the first matrix, and the first matrix after the dimensionality reduction processing is extracted and the first alarm
  • the target HTTP Header sequence information matched by the HTTP Header information of the traffic; based on the target HTTP Header sequence information, obtaining the third HTTP stream corresponding to the target HTTP Header sequence information is the second alarm traffic.
  • the first feature information is a feature representation vector
  • the extraction unit is specifically configured to: perform feature extraction on the multiple pieces of second alarm traffic to obtain the multiple pieces of second alarm traffic
  • the behavior feature information corresponding to the traffic, the behavior feature information includes one or more of the connection behavior feature, the request difference feature, and the request response feature; and the feature representation vector is obtained according to the behavior feature information.
  • the judging unit is specifically configured to: perform detection through a retrospective model based on the first feature information to obtain a first detection result; perform detection through a baseline model based on the plurality of second alarm traffic detection, and obtain a second detection result, wherein the baseline model is a detection model pre-trained based on historical traffic; based on the first detection result and the second detection result, determine whether the first alarm traffic is malicious flow.
  • the apparatus further includes: a generalization unit, configured to perform preset generalization processing on the first alarm traffic if the first alarm traffic is malicious traffic, and after obtaining the generalization
  • the classification unit is configured to classify the generalized first alarm traffic, and determine the malicious traffic type matched by the first alarm traffic.
  • the apparatus further includes an alarm traffic unit, the alarm traffic unit is configured to: receive multiple fourth HTTP streams before determining the receiving time of the first alarm traffic; Perform feature extraction for each fourth HTTP stream in the fourth HTTP stream according to a preset feature extraction rule to obtain a second feature set, where the second feature set includes: second feature information corresponding to the plurality of fourth HTTP streams respectively ; Based on the second feature set, through the first classification model, filter out the first alarm traffic from the plurality of fourth HTTP streams.
  • the second feature information includes manual feature information and/or representation learning feature information; wherein the manual feature information includes: domain name readability features corresponding to the fourth HTTP stream, uniform resources One or more of locator URL structure features, behavior indication features, and HTTP Header features; the representation learning feature information includes high-dimensional features corresponding to the fourth HTTP stream.
  • an embodiment of the present application provides a service device, the service device includes a processor, and the processor is configured to support the service device to implement corresponding functions in the malicious traffic identification method provided in the first aspect.
  • the service device may also include memory, coupled to the processor, which holds program instructions and data necessary for the service device.
  • the service device may also include a communication interface for the service device to communicate with other devices or a communication network.
  • an embodiment of the present application provides a computer-readable storage medium for storing computer software instructions used for the malicious traffic identification device provided in the second aspect above, which includes a program for executing the above aspect. .
  • an embodiment of the present application provides a computer program, where the computer program includes instructions, when the computer program is executed by a computer, the computer can execute the process performed by the malicious traffic identification device in the second aspect.
  • the present application provides a chip system
  • the chip system includes a processor for supporting a terminal device to implement the functions involved in the first aspect above, for example, generating or processing the malicious traffic identification method involved in the above. information.
  • the chip system further includes a memory for storing necessary program instructions and data of the data sending device.
  • the chip system may be composed of chips, or may include chips and other discrete devices.
  • FIG. 1 is a schematic structural diagram of a malicious traffic identification system provided by an embodiment of the present application.
  • FIG. 2 is a schematic flowchart of a method for identifying malicious traffic provided by an embodiment of the present application.
  • FIG. 3 is a schematic diagram of a framework for identifying malicious traffic provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of a feature extraction provided by an embodiment of the present application.
  • FIG. 5 is a schematic flowchart of backtracking traffic according to a first policy according to an embodiment of the present application.
  • FIG. 6 is a schematic diagram of multiple pieces of traffic backtracked according to a first policy according to an embodiment of the present application.
  • FIG. 7 is a schematic flowchart of backtracking traffic according to a second strategy according to an embodiment of the present application.
  • FIG. 8 is a schematic diagram before and after a traffic generalization provided by an embodiment of the present application.
  • FIG. 9 is a schematic flowchart of backtracking traffic according to a third strategy according to an embodiment of the present application.
  • FIG. 10 is a schematic flowchart of a method for obtaining first feature information provided by an embodiment of the present application.
  • FIG. 11 is a function image provided by an embodiment of the present application with En as an independent variable and an as a dependent variable.
  • FIG. 12 is a schematic flowchart of determining a type of malicious traffic according to an embodiment of the present application.
  • FIG. 13 is a schematic structural diagram of an apparatus for identifying malicious traffic provided by an embodiment of the present application.
  • FIG. 14 is a schematic structural diagram of another device for identifying malicious traffic provided by an embodiment of the present application.
  • At least one (a) of a, b or c may represent: a, b, c, a and b, a and c, b and c or a, b and c, where a, b and c can be It can be single or multiple.
  • a component may be, but is not limited to, a process running on a processor, a processor, an object, an executable, a thread of execution, a program, and/or a computer.
  • an application running on a computing device and the computing device may be components.
  • One or more components may reside within a process and/or thread of execution, and a component may be localized on one computer and/or distributed between 2 or more computers.
  • these components can execute from various computer readable media having various data structures stored thereon.
  • a component may, for example, be based on a signal having one or more data packets (eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals) Communicate through local and/or remote processes.
  • data packets eg, data from two components interacting with another component between a local system, a distributed system, and/or a network, such as the Internet interacting with other systems via signals
  • HTTP HyperText Transfer Protocol
  • HTML HyperText Transfer Protocol
  • CC/C2 Remote command and control server, the target machine can receive commands from the server, so as to achieve the purpose of the server controlling the target machine. This method is often used for virus Trojans to control infected machines.
  • IRC Internet Relay Chat
  • IRC Internet Relay Chat
  • N-gram an n-gram, which refers to n words that appear consecutively in the text.
  • the n-gram model is a probabilistic language model based on (n-1) order Markov chains, which infers the structure of sentences by the probability of occurrence of n words.
  • Content-Type content type
  • Content-Type generally refers to the Content-Type existing in the web page, which is used to define the type of network file and the encoding of the web page, and decide what form and encoding the browser will read the file in. This is It is often seen that the result of some web page clicks is the reason for downloading a file or a picture.
  • the ContentType property specifies the HTTP content type of the response. If ContentType is not specified, it defaults to TEXT/HTML.
  • Representation Learning also known as learning representation.
  • representation refers to the form and method used to represent the input observation sample X of the model through the parameters of the model.
  • Representation learning refers to learning a representation that is valid for the observed sample X.
  • the low-dimensional vector representation obtained by representation learning is a distributed representation. The reason why it is so named is because each dimension in a vector has no clear corresponding meaning when viewed in isolation; and a vector is formed by combining each dimension, which can represent the semantic information of the object.
  • Decision tree is based on the known probability of occurrence of various situations, by forming a decision tree to find the probability that the expected value of the net present value is greater than or equal to zero, evaluate the project risk, and judge its feasibility. Decision analysis The method is a graphical method that uses probability analysis intuitively. Since this decision branch is drawn as a graph like the branches of a tree, it is called a decision tree. In machine learning, a decision tree is a predictive model that represents a mapping relationship between object attributes and object values. Classification tree (decision tree) is a very common classification method. It is a kind of supervised learning.
  • supervised learning is given a bunch of samples, each sample has a set of attributes and a category, these categories are determined in advance, then a classifier is obtained through learning, this classifier can The object is given the correct classification.
  • machine learning is called supervised learning.
  • U Agent which refers to browsers and search engines. Its information includes hardware platform, system software, application software and user's personal preferences.
  • URL Uniform Resource Locator
  • Content type generally refers to the Content-Type that exists in the web page, which is used to define the type of network file and the encoding of the web page, and decide what form and encoding the browser will read the file in. This is It is often seen that the result of clicking on some Asp web pages is the reason for downloading a file or a picture.
  • the ContentType property specifies the HTTP content type of the response. If ContentType is not specified, it defaults to TEXT/HTML.
  • TF-IDF Term Frequency–Inverse Document Frequency
  • TF-IDF is a commonly used weighting technique for information retrieval and data mining, used to evaluate the effect of a word on one of the documents in a document set or a corpus Importance. The importance of a word increases proportionally to the number of times it appears in the document, but decreases inversely to the frequency it appears in the corpus.
  • Various forms of TF-IDF weighting are often applied by search engines as a measure or rating of the degree of relevance between documents and user queries.
  • search engines on the Internet use link analysis-based ranking methods to determine the order in which documents appear in search results.
  • Bag-of-words (BOW) model which is a commonly used document representation method in the field of information retrieval.
  • the BOW model assumes that for a document, its word order, grammar, syntax and other elements are ignored, and it is only regarded as a collection of several words.
  • the appearance of each word in the document is independent and does not depend on whether other words appear. (It is not related to the order) That is to say, any word that appears in any position in the document is independently selected without being affected by the semantics of the document.
  • ROC Receiveiver Operating Characteristic Curve
  • the ROC curve and AUC coefficient are mainly used to test the ability of the model to correctly rank customers.
  • the ROC curve describes the proportion of cumulative bad customers under a certain cumulative proportion of good customers. The stronger the model's ability to distinguish, the closer the ROC curve is to the upper left corner.
  • the AUC coefficient represents the area under the ROC curve. The higher the AUC coefficient, the stronger the risk discrimination ability of the model.
  • KS Kinolmogorov-Smirnov test: The K-S test is mainly to verify the model's ability to distinguish default objects. Usually, after the model predicts the credit score of the entire sample, the entire sample is divided into two parts according to default and non-default. , and then use the KS statistic to test whether there is a significant difference in the distribution of credit scores between the two groups of samples.
  • FIG. 1 is a schematic structural diagram of a malicious traffic identification system provided by an embodiment of the present application.
  • the client in this application may include the first service device 001 , the second service device 002 and the third service device 003 in FIG.
  • the communication connection can be performed in a wired or wireless manner, and both the second service device 002 and the third service device 003 can send a HyperText Transfer Protocol (HyperText Transfer Protocol, HTTP) request to the first service device.
  • HTTP HyperText Transfer Protocol
  • the first service device 001 may include, but is not limited to, a background server, a component server, a data processing server, etc., a device that provides various local service programs for clients.
  • the first service device 001 may receive or respond to a HyperText Transfer Protocol (HyperText Transfer Protocol, HTTP) request sent by one or more service devices, so as to provide corresponding application services for other service devices.
  • HTTP HyperText Transfer Protocol
  • the first service device 001 needs to identify whether the HTTP requests sent by other service devices belong to malicious traffic, and if it belongs to malicious traffic, it cannot respond to it in order to ensure network security.
  • the first service device 001 is configured with a local service for identifying malicious traffic, wherein the local service may include, but is not limited to: determining the receiving time of the first alarm traffic; Multiple pieces of second alarm traffic corresponding to the alarm traffic; the target time period is a time period determined based on the receiving time; the difference between each second alarm traffic in the multiple second alarm traffic and the first alarm traffic is The similarity is greater than a preset threshold; feature extraction is performed on the multiple pieces of second alarm traffic to obtain first feature information; based on the first feature information, it is determined whether the first alarm traffic is malicious traffic.
  • the local service may include, but is not limited to: determining the receiving time of the first alarm traffic; Multiple pieces of second alarm traffic corresponding to the alarm traffic; the target time period is a time period determined based on the receiving time; the difference between each second alarm traffic in the multiple second alarm traffic and the first alarm traffic is The similarity is greater than a preset threshold; feature extraction is performed on the multiple pieces of second alarm traffic to obtain first feature information; based on the first feature information
  • the second service device 002 may also include, but is not limited to, a backend server, a component server, a data processing server, etc., a device that provides various local service programs for clients. Relevant applications can be installed and run, and an HTTP request can be sent to the first service device, so that the first service device can obtain a corresponding service after responding.
  • the third service device 003 can be a Trojan horse command and control server (Command and Control Server, CC/C2), and other service devices can receive commands from the third service device 003 (CC server), thereby reaching the third service device 003
  • CC/C2 Common and Control Server
  • the purpose of controlling the above service equipment is often used for virus Trojans to control the infected service equipment.
  • the third service device 003 may send an HTTP request to the first service device, so that the first service device receives an HTTP stream, which is malicious traffic and can be identified by the first service device.
  • the network architecture in FIG. 1 is only an exemplary implementation in the embodiments of the present application, and the malicious traffic identification system architecture in the embodiments of the present application includes but is not limited to the above malicious traffic identification system architecture.
  • FIG. 2 is a schematic flowchart of a malicious traffic identification method provided by an embodiment of the present application, and the method can be applied to the malicious traffic identification system architecture described in FIG. 1, wherein the first service device 001 may be It is used to support and execute the method flow steps S201-S209 shown in FIG. 2 . The following description will be made from the side of the first service device with reference to FIG. 2 .
  • the method may include the following steps S201-S209.
  • Step S201 Receive multiple fourth HTTP streams.
  • the malicious traffic identification device receives a plurality of fourth hypertext transfer protocol HTTP streams.
  • the fourth HTTP stream may be a hypertext transfer protocol HTTP stream received by the first service device and sent from one or more second service devices and/or third service devices.
  • Step S202 Perform feature extraction on each of the plurality of fourth HTTP streams to obtain a second feature set.
  • the malicious traffic identification device performs feature extraction on each of the plurality of fourth HTTP streams to obtain a second feature set, wherein the second feature set includes: the plurality of fourth HTTP streams The second feature information corresponding to the streams respectively.
  • the first service device 001 can perform feature extraction according to preset feature extraction rules, and then obtain corresponding non-numeric feature vectors and other numeric
  • the eigenvectors and other digital eigenvectors are spliced together according to a uniform rule to obtain a final single-stream eigenvector, that is, the second feature information corresponding to the fourth HTTP stream. Please refer to FIG.
  • the existing network traffic is firstly processed through a single-flow classifier (feature processing and single-flow classification (single-flow filtering) are performed through models trained on multiple black and white flows) to obtain suspected malicious
  • the first alarm traffic of the traffic for example, on the basis of the single-stream data traffic, feature extraction is performed by the feature processor, and the single-stream feature in the traffic is extracted to form a feature vector, and such feature vector is input into the classifier for preliminary Determine whether the traffic is the CC communication traffic of malware (ie, the first alarm traffic); perform multi-stream feature extraction based on the first alarm traffic to obtain a multi-stream feature representation (multi-stream backtracking); finally, based on the multi-stream feature Indicates whether the first alarm traffic is malicious traffic is determined through the backtracking model and the baseline model.
  • the features extracted by the above model may be further used to pass a malicious family classifier to finally determine the type to which the first alarm traffic belongs.
  • a malicious family classifier to finally determine the type to which the first alarm traffic belongs.
  • the second feature information includes manual feature information and/or representation learning feature information; wherein, the manual feature information includes: domain name readability feature corresponding to the fourth HTTP stream, uniform resource locator URL structure feature , one or more of behavior indicating features, and HTTP Header features; the representation learning feature information includes high-dimensional features corresponding to the fourth HTTP stream.
  • FIG. 4 is a schematic diagram of a feature extraction provided by an embodiment of the present application.
  • a feature engineering method can be used to extract manual features and a representation learning method can be used to perform feature extraction.
  • (1) manual feature information includes one or more of the following features: domain name readability feature corresponding to the fourth HTTP stream, uniform resource locator URL structure feature, behavior indication feature, HTTP Header feature (HTTP response feature) ;
  • the URL statistical features include one or more of the following features: length, proportion of vowels, proportion of consonants, proportion of special characters, proportion of uppercase letters, proportion of lowercase letters, proportion of numbers, domain name level, domain name character distribution, top-level domain name , path length, path layer number, file suffix, number of parameters, average parameter value length, whether there is base64, whether it follows common patterns;
  • HTTP Header features include one or more of the following features: Content Type, Content Type, User agent UA, HTTP return status code, N-gram of Header sequence.
  • Representation learning feature information is assisted by representation learning (Representation Learning).
  • representation learning Representation Learning
  • the high-dimensional features of the fourth HTTP stream are extracted to maximize the feature extraction of the existing data set, and Correlate in higher dimensions.
  • the white traffic shown in Figure 4 refers to normal traffic
  • the black traffic refers to malicious traffic.
  • performing feature extraction on each of the multiple fourth HTTP streams to obtain a second feature set includes: performing feature extraction on each of the multiple fourth HTTP streams, Obtain an initial feature set; perform text processing on the non-numeric features in the initial feature set to obtain the second feature set.
  • the classification model generally processes digital input, text-to-digital conversion is required for text features or non-numeric features in the features to convert them into digitized vectors that can be processed by the classification model.
  • the method of performing feature extraction on each of the plurality of fourth HTTP streams may be manual feature extraction and/or representation learning feature extraction.
  • the above-mentioned text features include but are not limited to: top-level domain name, file suffix, Content Type, UA, etc. It is understandable that since the input of these four field features are all strings, and the machine learning classifier cannot process strings, it is necessary to convert the strings into digitized vectors that can be processed by the classification model.
  • the method used in the above text processing process is: TF-IDF.
  • TF the "term frequency” (Term Frequency, abbreviated as TF) in TF-IDF
  • TF the number of times a word appears in the article, which reflects the frequency of a word in the document
  • inverse document frequency Inverse document frequency
  • Document Frequency abbreviated as IDF
  • IDF the number of occurrences of a word in the article / the total number of words in the article, which reflects the inverse ratio of the commonness of a word, which can effectively solve some problems with high frequency but not very large meaning of words.
  • TF-IDF transformation For example, first perform TF-IDF transformation on these features, and calculate their vector representations based on word frequency and document order. It should be noted that in the process of classification and identification, the data processed by TF-IDF needs to be compared with the basic TF-IDF library of the detection model to find abnormalities.
  • the identified normal data traffic can be confirmed by some technical means, used for model training or correctness verification) statistical acquisition, and can be generated by specific white traffic in a specific detection scenario.
  • performing text processing on non-numeric features in the initial feature set to obtain the second feature set includes: performing text processing on non-numeric features in the initial feature set to obtain digital features vector set; perform dimensionality reduction processing on the above digital feature vector set to obtain the second feature set. It is understandable that after TF-IDF processing is performed on the extracted initial feature set, the dimension of the obtained vector is relatively large, such a high-dimensional vector consumes resources for the classification model and subsequent processing, and the processing efficiency is not high, so , dimensionality reduction can be performed to convert such a high-dimensional vector into a low-dimensional vector space.
  • the dimensionality reduction processing method may include, but is not limited to, singular value decomposition (Singular value decomposition, SVD), principal component analysis (Principal Component Analysis, PCA), and the like.
  • singular value decomposition singular value decomposition
  • PCA Principal component analysis
  • a dimensionality reduction operation is performed to reduce the vector processed by TF-IDF from a high-dimensional space to a ten-dimensional space. in dimensional space.
  • combining and filtering features extracted by different methods for each fourth HTTP stream to obtain second feature information corresponding to each fourth HTTP stream.
  • feature engineering features and representation learning features are combined, and feature selection algorithms such as minimum redundancy maximum correlation (mRMR) are used to filter out a feature set with the best effect corresponding to each fourth HTTP stream.
  • mRMR minimum redundancy maximum correlation
  • a single-flow traffic feature is extracted from the current network traffic, text processing is performed on non-numeric features, and a second feature set is obtained by combining and filtering traffic features.
  • Step S203 Based on the second feature set, filter out the first alarm traffic from the plurality of fourth HTTP streams through the first classification model.
  • the apparatus for identifying malicious traffic may, based on the second feature set, filter out the first alarm traffic from the plurality of fourth HTTP streams through a first classification model.
  • the first alarm traffic is traffic for which suspected malicious traffic is screened from the plurality of fourth HTTP streams by using the first classification model.
  • the traffic feature vector (ie, the second feature information) of each fourth HTTP stream obtained above is input into the first classification model.
  • the first classification model can use the stack mode to train different classifiers based on different features to make judgments.
  • the first classification model can be used to use the decision tree mechanism based on the judgment results of each classifier to finally obtain the first layer based on the HTTP session. Test results.
  • the first classification model may be a model trained by using the marked black and white traffic training data set. This preprocessing of the data enables initial screening of normal traffic.
  • the single-stream oriented data flow extracts and selects composite features based on artificial experience and representation learning methods to form single-stream feature vectors, and then input such feature vectors into the classifier, and the first step is to determine the Whether the traffic is suspected to be the CC communication traffic of malware, and if so, the next step is further judged, which greatly improves the efficiency of judging whether the traffic is malicious traffic.
  • Step S204 Determine the receiving time of the first alarm traffic.
  • the malicious traffic identification device determines the receiving time of the first alarm traffic. After the first alarm traffic is filtered out, the receiving time of the first alarm traffic can be determined, so as to trace back multiple traffics.
  • Step S205 Acquire multiple pieces of second alarm traffic corresponding to the first alarm traffic within the target time period according to a preset policy.
  • the malicious traffic identification device obtains, according to a preset policy, multiple pieces of second alarm traffic corresponding to the first alarm traffic within a target time period, where the target time period is a time period determined based on the receiving time; the The similarity between each second alarm traffic in the plurality of second alarm traffic and the first alarm traffic is greater than a preset threshold.
  • the detection process due to the relatively complex traffic situation in the existing network, the detection of a single HTTP stream is contingent to a certain extent. If the communication behavior of malicious samples can be observed from a multi-stream perspective, multiple requests can be traced back to For different clusters, the combination of statistical features of different clusters to which each alarm flow belongs is used to determine the positive and negative, thus eliminating accidental errors. That is, by observing the overall communication behavior of a malicious sample within a certain period of time, the malicious sample can be more accurately judged from a behavioral perspective, making the final multi-stream result more robust and interpretable in behavior.
  • the target time period is a time period determined based on the receiving time, for example: the target time period is a time period with a preset duration starting from the receiving time, or a time period starting from the receiving time Time is a preset time period ahead of the end point.
  • the target time period may also be a time period including the receiving time.
  • the preset policy includes: one or more of a first policy, a second policy, and a third policy, where the first policy is based on the Internet Protocol IP address and The user agent UA information obtains the strategy of the plurality of second alarm traffic; the second strategy is a strategy for obtaining the plurality of second alarm traffic based on the IP address of the first alarm traffic and a preset generalization rule; The third strategy is a strategy for acquiring the multiple pieces of second alert traffic based on the IP address of the first alert traffic and the Hypertext Transfer Protocol HTTP Header information of the first alert traffic.
  • the traffic backtracking method is used to collect the CC communication traffic for a period of time forward and/or backward based on the first alarm traffic, and then perform multi-stream characteristics. extract.
  • the first strategy is based on the IP address and UA information of the first alarm traffic to trace back, which can be traced back to multiple traffic sent by the same software, the same service device or the same application; the second strategy is based on the first alarm traffic.
  • the IP address of an alarm traffic backtracks multiple traffic flows, and then generalizes the backtracked traffic according to the preset generalization rules, so as to filter out multiple traffic sent by the same software and different applications as the first alarm traffic; the third strategy is based on The IP address and HTTP header information of the first alarm traffic can be traced back, which can be traced back to multiple traffic sent by different applications in the same software.
  • Various traffic backtracking methods can accurately trace back to multiple flows of the same source as the first alarm traffic, so that whether the first alarm traffic is malicious traffic can be identified according to the behavior characteristics of the multiple traffic, which improves the accuracy of identifying malicious traffic. Spend.
  • the preset policy includes the first policy; the collecting multiple pieces of second alarm traffic corresponding to the first alarm traffic within a target time period according to the preset policy includes: acquiring the first alarm traffic IP address and UA information of the alarm traffic; collected from multiple HTTP streams sent by the IP address within the target time period, the HTTP stream that is the same as the UA information of the first alarm traffic is the second alarm traffic .
  • FIG. 5 is a schematic flowchart of backtracking traffic according to a first policy according to an embodiment of the present application. As shown in FIG.
  • the UA information and source IP address information of the first alarm traffic can be used as the unique index to perform traffic backtracking, and the application traffic can be identified by using the UA Header information to extract the All HTTP flows with the same UA information sent N minutes before or N minutes after the source IP address (src-ip) are retrospectively analyzed for the second alarm flow.
  • the application traffic can be identified by using the UA Header information to extract the All HTTP flows with the same UA information sent N minutes before or N minutes after the source IP address (src-ip) are retrospectively analyzed for the second alarm flow.
  • FIG. 6 is a schematic diagram of a plurality of traffic backtracked according to a first policy provided by an embodiment of the present application.
  • Multi-flow packet 1 corresponds to the first alarm traffic whose IP address is IP and UA information is UA1;
  • the UA information is the first alarm traffic of UA3.
  • HTTP request 1-HTTP request 4 corresponding to typical site polling + URL change mode
  • HTTP request 5-HTTP request 7 corresponding to typical stable heartbeat mode
  • HTTP request 8-HTTP request 10 corresponding to some specific samples communication behavior.
  • the preset policy includes the second policy, and the collecting multiple pieces of second alarm traffic corresponding to the first alarm traffic in a target time period according to the preset policy includes: acquiring the first alarm traffic. the IP address of the alarm traffic; collect multiple first HTTP streams sent by the IP address within the target time period; perform generalization processing on the multiple first HTTP streams according to the preset generalization rule, and obtain A plurality of second HTTP streams, the preset generalization rule is to uniformly replace the target string corresponding to each first HTTP stream in the plurality of first HTTP streams using a preset standard; In the second HTTP flow, the target second HTTP flow whose similarity with the first alarm traffic is greater than the preset threshold is filtered out as the second alarm traffic. Please refer to FIG. 7 . FIG.
  • FIG. 7 is a schematic flowchart of backtracking traffic according to a second policy according to an embodiment of the present application.
  • the historical traffic of the source IP (for example, the traffic data of the same source IP in the target time period) matches all the most similar HTTP flows, that is, the second alarm traffic.
  • the so-called generalization is to use the same standard to replace the changed character string positions in the traffic (for example, in the embodiment of this application, all lowercase letters can be replaced by x, special characters can be replaced by T, and uppercase letters can be replaced by X) .
  • FIG. 8 FIG.
  • FIG. 8 is a schematic diagram before and after a traffic generalization provided by an embodiment of the present application.
  • a traffic generalization As shown in FIG. 8 , after a plurality of first HTTP streams are generalized according to a unified generalization rule, their corresponding second HTTP streams are obtained. Further, the similarity between the plurality of second HTTP flows and the first alarm flow can be calculated.
  • the malicious traffic identification device may first use the bag of words model (BOW) to vectorize, and then use the cosine similarity in the vector space model (VSM) to calculate the similarity between the plurality of second HTTP streams and the first alarm traffic respectively.
  • BOW bag of words model
  • VSM vector space model
  • VSM vector space model
  • A is the template vector of the alarm flow
  • B is the vector of the traceback flow.
  • the preset policy includes the third policy
  • the collecting multiple pieces of second alarm traffic corresponding to the first alarm traffic in a target time period according to the preset policy includes: acquiring the first alarm traffic.
  • the IP address and the HTTP Header information of the alarm traffic includes: collect multiple third HTTP streams sent by the IP address within the target time period; respectively, for each third HTTP stream in the multiple third HTTP streams; Perform N-gram processing on the HTTP Header corresponding to the HTTP stream to obtain a first matrix, where the first matrix includes the HTTP Header sequence information corresponding to each of the third HTTP streams; perform dimensionality reduction processing on the first matrix to extract The target HTTP Header sequence information that matches the HTTP Header information of the first alarm traffic in the first matrix after dimensionality reduction processing; based on the target HTTP Header sequence information, obtain the third HTTP Header sequence information corresponding to the target HTTP Header sequence information
  • the flow is the second alarm flow.
  • FIG. 9 is a schematic flowchart of backtracking traffic according to a third policy according to an embodiment of the present application.
  • N-gram processing is performed on the HTTP Header of the HTTP request of the source IP, that is, the HTTP Header sequence (sequence) information in the traffic is extracted, and N respectively takes different values (depending on performance considerations) to form the following table 1 shows the sample-header combination matrix (HTTP header sequence N-gram matrix).
  • Use Hash Trick for dimensionality reduction and extract HTTP streams of the same sequence after dimensionality reduction.
  • the hash trick method can be used to reduce the dimension of the matrix to obtain the matrix after the dimension reduction of the N-gram matrix. For example: perform a MinHash on a random transformation of the feature vector x to obtain the hash result, and take the last b bits of the hash result (which can be expressed in binary). It is the process of b-bit Min Hash. This process is repeated k times, each sample can be represented by k*b bits, and the processing time and space requirements are greatly reduced.
  • This method of backtracking by extracting the HTTP header sequence information in the traffic can be backtracked to multiple traffics sent by different applications in the same software, and then according to the multiple traffics sent by different applications in the same software.
  • the behavior characteristics of the traffic are used to determine whether the first alarm traffic is malicious traffic, which improves the accuracy of identifying malicious traffic.
  • Step S206 Perform feature extraction on multiple pieces of second alarm traffic to obtain first feature information.
  • the malicious traffic identification device performs feature extraction on multiple pieces of second alarm traffic to obtain first feature information.
  • the HTTP streams obtained are respectively input to the next stage for feature extraction.
  • FIG. 10 is a schematic flowchart of a method for obtaining first feature information provided by an embodiment of the present application. As shown in FIG.
  • the first alarm traffic that is, the pre-classification result
  • the first policy that is, UA aggregation
  • the second policy that is, traffic template similarity clustering
  • the multiple second alarm traffic multi-stream data
  • HTTP header N-gram the third policy
  • perform feature extraction on the multiple second alarm traffic and obtain the feature representation vector corresponding to each policy (Vector-traceback), and then combined into a multi-stream feature representation vector is the first feature information.
  • the first feature information is a feature representation vector
  • the performing feature extraction on the multiple pieces of second alarm traffic to obtain the first feature information includes: performing feature extraction on the multiple pieces of second alarm traffic , obtain behavior feature information corresponding to the multiple pieces of second alarm traffic, where the behavior feature information includes: one or more of connection behavior features, request difference features, and request response features; according to the behavior feature information, obtain all The described feature representation vector.
  • the malicious traffic identification device can fully consider the characteristics of the multi-flow network behavior of the malicious CC communication traffic when identifying the traffic, so that the malicious traffic in the existing network can be detected and distinguished more accurately. It should be noted that, please refer to the following Table 2.
  • Table 2 is a multi-stream behavior feature information table provided by the embodiment of the present application.
  • Step S207 Based on the first feature information, determine whether the first alarm traffic is malicious traffic.
  • the malicious traffic identification device may determine whether the first alarm traffic is malicious traffic based on the first feature information.
  • the first feature information may be used to represent behavior feature information of the multi-flow traffic corresponding to the first alarm traffic. Based on the behavior feature information, by performing detection through a backtracking model, it may be determined whether the first alarm traffic is malicious traffic.
  • the obtained multi-stream behavior feature information such as: vector representation
  • the stacking method can be used for multiple training to extract the vector Behavior characteristics, that is, more detection results can be obtained based on the backtracking model.
  • the backtracking model may be a pre-trained classification model for identifying whether the traffic is malicious traffic.
  • the determining whether the first alarm traffic is malicious traffic according to the first feature information includes: performing detection through a backtracking model based on the first feature information, and obtaining a first detection result; Multiple pieces of second alarm traffic are detected through a baseline model, and a second detection result is obtained, wherein the baseline model is a detection model pre-trained based on historical traffic; based on the first detection result and the second detection result, Determine whether the first alarm traffic is malicious traffic.
  • the current network production environment is accumulated for a period of time, and on this basis, the multi-flow characteristics of the current network traffic are extracted, and this is used as training data to construct a single-classification model of the existing network historical data (that is, the baseline model), so that this model can represent the behavior baseline of the existing network, so that the traffic that is different from the normal behavior can be discriminated from the perspective of the baseline.
  • the backtracking model may be a pre-trained multi-flow classifier for identifying whether the traffic is malicious traffic.
  • FIG. 11 is a function image provided by an embodiment of the present application with En as an independent variable and an as a dependent variable, where En ⁇ (0,1). As shown in Fig.
  • Step S208 If the first alarm traffic is malicious traffic, perform preset generalization processing on the first alarm traffic to obtain the generalized first alarm traffic.
  • the malicious traffic identification device performs preset generalization processing on the first alarm traffic to obtain the generalized first alarm traffic. It can be understood that, if it is determined that the first alarm traffic is malicious traffic, it is also possible to identify which type of malicious traffic the malicious traffic belongs to.
  • Step S209 Classify the generalized first alarm traffic to determine a malicious traffic type matched by the first alarm traffic.
  • the malicious traffic identification device classifies the generalized first alarm traffic, and determines the malicious traffic type that matches the first alarm traffic.
  • the malicious traffic identification device classifies the generalized first alarm traffic by using the trained category classification model.
  • the multi-family classification model that has been trained using the features extracted from the model (backtracking model) involved in the above step S207 is used to determine the family to which the malicious traffic belongs. Therefore, please refer to FIG. 12 .
  • FIG. 12 is a schematic flowchart of determining a type of malicious traffic according to an embodiment of the present application. As shown in Figure 12, after the malicious traffic samples are generalized, traffic template extraction, representation learning, feature extraction, feature identification, and multi-classifiers, the malicious traffic types matching the alarm traffic can be determined.
  • the malicious traffic identification device performs preset generalization processing on the first alarm traffic, obtains the generalized first alarm traffic, and performs feature extraction on the generalized first alarm traffic (equivalent to feature extraction) to obtain the corresponding feature representation vector; finally, the feature representation vector is input into the above-mentioned multi-family classification model to identify the type of malicious traffic.
  • the malicious traffic identification device may, from the time of receipt of a single traffic (that is, the first alarm traffic), trace back multiple traffics (that is: multiple second alarm traffic). Then, feature extraction is performed on the backtracked traffic to obtain feature information, so that the malicious traffic identification device can classify the single traffic according to the feature information, thereby determining whether the single traffic is malicious traffic.
  • the similarity between the plurality of second alarm traffic and the first alarm traffic is all greater than a preset threshold.
  • This method of classifying a single traffic according to the feature information of multiple traffics with similar single traffic enables the malicious traffic identification device to fully consider the characteristics of multi-stream network behavior of malicious CC communication traffic when identifying traffic, so as to be more accurate to detect and distinguish malicious traffic in the live network.
  • the detection process of the prior art the detection of a single HTTP stream is avoided because the traffic situation in the existing network is relatively complex.
  • the embodiment of the present application observes the communication behavior of traffic from a multi-stream perspective, traces multiple alarm traffic back to different clusters based on one or more methods, and uses the statistics of different clusters to which each alarm traffic belongs. According to the feature information, the positive and negative (that is, whether the alarm traffic is malicious traffic) is determined, thereby preventing accidental errors.
  • Table 3 provides a performance data table of a single-stream model provided by the embodiment of the present application.
  • the accuracy of the detection algorithm can be estimated to be about 80% in actual network operation. (The above-mentioned X campus network confirmed more than 40 flow alarms)
  • Table 3 shows that for all HTTP communications, the ACC value in the experimental environment (test set) reaches more than 99.99%, and the ROC value is close to 1 (0.99999).
  • the ROC value is generally between 0.5-1.0. The larger the value, the higher the accuracy of the model judgment, that is, the closer to 1, the better.
  • ROC 0.5 indicates that the predictive power of the model is indistinguishable from random outcomes.
  • the KS value represents the ability of the model to distinguish between addition and subtraction. The larger the value of KS, the better the prediction accuracy of the model. Generally, if KS>0.2, the model can be considered to have good prediction accuracy.
  • IP cluster infection behavior of the live network is successfully discovered.
  • identification accuracy of the backtracking model of an X campus network reaches 100%.
  • the detected IP addresses in Table 4 are two clusters of 166.***.**.111 and 166.***.***.191 Malicious HTTP flow.
  • the traffic separation method based on multi-stream backtracking can firstly separate the HTTP traffic of the same malware/application communication in a continuous period of time; secondly, the multi-level detection framework based on backtracking (first Single-stream filtering and multi-stream backtracking) can effectively reduce the storage and detection of a large number of irrelevant data flows during the detection process (backtracking traffic only needs the suspicious traffic detected by the first layer, which accounts for a small proportion), improves analysis efficiency, and is more suitable for applications in a corporate network environment.
  • the traffic separation method based on multi-stream backtracking can distinguish the communication traffic of rogue software and the communication traffic of malware from the characteristics of multi-stream behavior.
  • FIG. 13 is a schematic structural diagram of a malicious traffic identification device provided by an embodiment of the present application.
  • the malicious traffic identification device 10 may include a determination unit 101, a backtracking unit 102, an extraction unit 103, and a judgment unit 104, and may also It includes: a generalization unit 105 , a classification unit 106 and an alarm traffic unit 107 .
  • the detailed description of each unit is as follows.
  • a determining unit 101 configured to determine the receiving time of the first alarm traffic
  • the backtracking unit 102 is configured to acquire, according to a preset policy, multiple pieces of second alarm traffic corresponding to the first alarm traffic within a target time period; the target time period is a time period determined based on the receiving time; The similarity between each second alarm traffic in the second alarm traffic and the first alarm traffic is greater than a preset threshold;
  • an extraction unit 103 configured to perform feature extraction on the plurality of second alarm flows to obtain first feature information
  • the determining unit 104 is configured to determine whether the first alarm traffic is malicious traffic based on the first feature information.
  • the preset policy includes: one or more of a first policy, a second policy, and a third policy, where the first policy is based on the first alarm traffic the Internet Protocol IP address and the user agent UA information to obtain the multiple second alarm traffic policies; the second policy is to obtain the multiple second alarm traffic based on the IP address of the first alarm traffic and a preset generalization rule The strategy of the alarm traffic; the third strategy is a strategy for acquiring the multiple second alarm traffic based on the IP address of the first alarm traffic and the hypertext transfer protocol HTTP Header information of the first alarm traffic.
  • the preset policy includes the first policy; the backtracking unit 102 is specifically configured to: acquire the IP address and UA information of the first alarm traffic; Among the multiple HTTP streams sent by the IP address within the time period, the HTTP stream that is the same as the UA information of the first alarm traffic is the second alarm traffic.
  • the preset policy includes the second policy
  • the backtracking unit 102 is specifically configured to: acquire the IP address of the first alarm traffic; collect the IP address at the target time Multiple first HTTP streams sent by the IP address in the segment; perform generalization processing on multiple first HTTP streams according to the preset generalization rules to obtain multiple second HTTP streams, and the preset generalization rules In order to uniformly replace the target string corresponding to each of the first HTTP streams in the plurality of first HTTP streams using a preset standard; filter out the target strings corresponding to the first alarm from the plurality of second HTTP streams The target second HTTP flow whose similarity between flows is greater than the preset threshold is the second alarm flow.
  • the preset policy includes the third policy
  • the backtracking unit 102 is specifically configured to: acquire the IP address and the HTTP Header information of the first alarm traffic; Collect multiple third HTTP streams sent by the IP address in the target time period; respectively perform N-gram processing on the HTTP Header corresponding to each third HTTP stream in the multiple third HTTP streams, and obtain the first A matrix, where the first matrix includes HTTP header sequence information corresponding to each third HTTP stream; dimensionality reduction processing is performed on the first matrix, and the first matrix after dimensionality reduction processing is extracted from the The target HTTP Header sequence information matched by the HTTP Header information of the alarm traffic; based on the target HTTP Header sequence information, the third HTTP stream corresponding to the target HTTP Header sequence information is obtained as the second alarm traffic.
  • the first feature information is a feature representation vector
  • the extracting unit 103 is specifically configured to: perform feature extraction on the multiple pieces of second alarm traffic, and obtain the multiple pieces of second alarm traffic.
  • the behavior feature information corresponding to the alarm traffic, the behavior feature information includes: one or more of a connection behavior feature, a request difference feature, and a request response feature; and the feature representation vector is obtained according to the behavior feature information.
  • the judging unit 104 is specifically configured to: perform detection through a backtracking model based on the first feature information, and obtain a first detection result; pass the baseline model based on the plurality of second alarm traffic Perform detection to obtain a second detection result, wherein the baseline model is a pre-trained detection model based on historical traffic; based on the first detection result and the second detection result, determine whether the first alarm traffic is Malicious traffic.
  • the apparatus further includes: a generalization unit 105, configured to perform preset generalization processing on the first alarm traffic if the first alarm traffic is malicious traffic to obtain a generalization
  • the classification unit 106 is configured to classify the generalized first alarm traffic, and determine the malicious traffic type matched by the first alarm traffic.
  • the apparatus further includes an alarm traffic unit 107, and the alarm traffic unit 107 is configured to: receive a plurality of fourth HTTP streams before determining the receiving time of the first alarm traffic; Each of the multiple fourth HTTP streams performs feature extraction according to a preset feature extraction rule to obtain a second feature set, where the second feature set includes: the second feature set corresponding to the multiple fourth HTTP streams respectively. feature information; based on the second feature set, the first alarm traffic is filtered out from the plurality of fourth HTTP streams through the first classification model.
  • the second feature information includes manual feature information and/or representation learning feature information; wherein the manual feature information includes: domain name readability features corresponding to the fourth HTTP stream, uniform resources One or more of locator URL structure features, behavior indication features, and HTTP Header features; the representation learning feature information includes high-dimensional features corresponding to the fourth HTTP stream.
  • FIG. 14 is a schematic structural diagram of another malicious traffic identification device provided by an embodiment of the present application.
  • the device 20 includes at least one processor 201 , at least one memory 202 , and at least one communication interface 203 .
  • the device may also include general components such as an antenna, which will not be described in detail here.
  • the processor 201 may be a general-purpose central processing unit (CPU), a microprocessor, an application-specific integrated circuit (ASIC), or one or more integrated circuits for controlling the execution of the programs in the above solutions.
  • CPU central processing unit
  • ASIC application-specific integrated circuit
  • the communication interface 203 is used to communicate with other devices or communication networks, such as Ethernet, radio access network (RAN), core network, wireless local area network (Wireless Local Area Networks, WLAN) and the like.
  • RAN radio access network
  • WLAN wireless Local Area Networks
  • the memory 202 may be read-only memory (ROM) or other type of static storage device that can store static information and instructions, random access memory (RAM) or other type of static storage device that can store information and instructions It can also be an electrically erasable programmable read-only memory (Electrically Erasable Programmable Read-Only Memory, EEPROM), a compact disc read-only memory (CD-ROM) or other optical disk storage, optical disk storage (including compact discs, laser discs, optical discs, digital versatile discs, Blu-ray discs, etc.), magnetic disk storage media or other magnetic storage devices, or capable of carrying or storing desired program code in the form of instructions or data structures and capable of being executed by a computer Access any other medium without limitation.
  • the memory can exist independently and be connected to the processor through a bus.
  • the memory can also be integrated with the processor.
  • the memory 202 is used for storing the application program code for executing the above solution, and the execution is controlled by the processor 201 .
  • the processor 201 is configured to execute the application code stored in the memory 202 .
  • the code stored in the memory 202 can execute the network traffic identification method provided in FIG. 2 above, such as determining the receiving time of the first alarm traffic; acquiring multiple second alarms corresponding to the first alarm traffic in the target time period according to a preset policy traffic; the target time period is a time period determined based on the receiving time; the similarity between each second alarm traffic in the plurality of second alarm traffic and the first alarm traffic is greater than a preset threshold; Feature extraction is performed on the multiple pieces of second alarm traffic to obtain first feature information; based on the first feature information, it is determined whether the first alarm traffic is malicious traffic.
  • the disclosed apparatus may be implemented in other manners.
  • the apparatus embodiments described above are only illustrative, for example, the division of the above-mentioned units is only a logical function division, and other division methods may be used in actual implementation, for example, multiple units or components may be combined or integrated. to another system, or some features can be ignored, or not implemented.
  • the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical or other forms.
  • the above-mentioned units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
  • each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
  • the above-mentioned integrated units may be implemented in the form of hardware, or may be implemented in the form of software functional units.
  • the integrated units are implemented in the form of software functional units and sold or used as independent products, they may be stored in a computer-readable storage medium.
  • the technical solutions of the present application can be embodied in the form of software products in essence, or the parts that contribute to the prior art, or all or part of the technical solutions, and the computer software products are stored in a storage medium , including several instructions to enable a computer device (which may be a personal computer, a server, or a network device, etc., specifically a processor in the computer device) to execute all or part of the steps of the above methods in various embodiments of the present application.
  • a computer device which may be a personal computer, a server, or a network device, etc., specifically a processor in the computer device
  • the aforementioned storage medium may include: U disk, mobile hard disk, magnetic disk, optical disk, Read-Only Memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc.
  • a medium that can store program code may include: U disk, mobile hard disk, magnetic disk, optical disk, Read-Only Memory (Read-Only Memory, abbreviation: ROM) or Random Access Memory (Random Access Memory, abbreviation: RAM), etc.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Hardware Design (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Computing Systems (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Artificial Intelligence (AREA)
  • Health & Medical Sciences (AREA)
  • Virology (AREA)
  • General Health & Medical Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Medical Informatics (AREA)
  • Mathematical Physics (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请实施例提供了一种恶意流量识别方法及相关装置,其中,一种恶意流量识别方法,可包括:确定第一告警流量的接收时间;基于预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;对所述多条第二告警流量进行特征提取,获得第一特征信息;基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。实施本申请实施例,可以通过多流回溯的方法提升现网中恶意流量识别的准确率。

Description

一种恶意流量识别方法及相关装置
本申请要求于2020年12月31日提交中国专利局、申请号为202011639885.1、申请名称为“一种恶意流量识别方法及相关装置”的中国专利申请的优先权,以及要求于2021年12月21日提交中国专利局、申请号为202111573232.2、申请名称为“一种恶意流量识别方法及相关装置”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,尤其涉及一种恶意流量识别方法及相关装置。
背景技术
超文本传输协议(HyperText Transfer Protocol,HTTP)协议作为目前最重要的协议,在互联网上得到广泛应用。为了方便通信和掩盖恶意行为,各类恶意软件,如木马病毒等的通信手段往往会采用HTTP通信方式,其中,主要是指受控节点与木马的命令与控制服务器(Command and Control Server,CC/C2)之间的通信。由于木马病毒的更新迭代非常迅速,更新迭代后的木马病毒会与之前的通信流量有着较为明显的差异。当前对采用HTTP恶意流量检测思路有两种:1)从流量层来检测,即通过提取流量中的特征进行检测;2)从主机行为来检测,即通过提取受感染的主机行为中的特征进行检测。其中,处理提取好的特征的方法主要有两种:1)基于无监督的聚类检测方法;2)基于有监督模型的检测方法。
然而,无论是从流量层对应的特征来检测,还是从主机行为对应的特征来检测,该无监督的聚类检测方法和有监督模型检测方法都仅考虑了单流的特征,即,一条HTTP流的特征,并没有考虑的恶意CC通信的多流网络行为特征,即,多条HTTP流的特征。目前的检测方法,基础信息丰富度不足,无法有效准确的识别流量是否为恶意流量。而且,很多流氓软件的行为与CC流量的特征行为在单流层面上来看是相似的,仅仅使用单流特征分析方法,无法有效的区分流氓软件和恶意软件。
因此,如何更加精准检测出现网中的恶意流量,是亟待解决的问题。
发明内容
本申请实施例提供一种恶意流量识别方法及相关装置,以提升恶意流量识别的准确率。
第一方面,本申请实施例提供了一种恶意流量识别方法,可包括:
确定第一告警流量的接收时间;基于预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;对所述多条第二告警流量进行特征提取,获得第一特征信息;基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。
实施第一方面的实施例,恶意流量识别装置可以从单条流量(即:第一告警流量)的接收时间起,按照预设策略回溯获取与该单条流量匹配的多条流量(即:多条第二告警流量)。然后,对回溯到的多条流量进行特征提取,获得特征信息,使得恶意流量识别装置可以根据 该特征信息对上述单条流量进行分类,从而确定该单条流量是否为恶意流量。其中,该多条第二告警流量与第一告警流量之间的相似度均大于预设阈值。这种根据单条流量相似的多条流量的特征信息对单条流量进行分类的方法,使得恶意流量识别装置对流量进行识别时,可以充分考虑恶意CC通信流量的多流网络行为的特征,从而更加精准的检测并分辨现网中的恶意流量。避免了现有技术在检测过程中,由于现网中流量情况相对较为复杂,针对单条HTTP流的检测具备的偶然性。另外,本申请实施例从多流角度对流量的通信行为进行观察,将多个告警流量基于一种或多种的预设策略回溯到不同簇,利用每条告警流所属不同簇的统计其特征信息,根据该特征信息研判正负性(即,告警流量是否为恶意流量),从而杜绝了偶然误差。这种观察恶意流量在一定时间内整体的通信行为,可以从行为角度判断恶意样本,使得最终的多流判断结果更为鲁棒,同时也具备行为上的可解释性。而且,本申请实施例对于多流流量无论是从流量层对应的特征来检测,还是从主机行为对应的特征来检测,基础信息丰富度都足够恶意流量识别装置有效准确的识别流量是否为恶意流量。从而,可以从多流的特征上区分流氓软件的通信流量和恶意软件的通信流量,提高恶意流量识别的准确率。
在一种可能实现的方式中,所述目标时间段为以所述接收时间为起点向后预设时长的时间段,或者为以所述接收时间为终点向前预设时长的时间段。在本申请实施例中,可以以接收到第一告警流量的接收时间为端点,向前或向后取预设时长的时间段,以保证获得尽可能多的与第一告警流量相似的多条第二告警流量。
在一种可能实现的方式中,所述预设策略包括:第一策略、第二策略、第三策略中的一个或多个,其中,所述第一策略为基于所述第一告警流量的网际协议IP地址和用户代理UA信息获取所述多条第二告警流量的策略;所述第二策略为基于所述第一告警流量的IP地址和预设泛化规则获取所述多条第二告警流量的策略;所述第三策略为基于所述第一告警流量的IP地址和所述第一告警流量的超文本传输协议HTTP Header信息获取所述多条第二告警流量的策略。实施本申请实施例,多种流量回溯方式均可以精准的回溯到第一告警流量同源的多条流量,从而可以根据该多条流量的行为特征识别出第一告警流量是否为恶意流量,提高了识别恶意流量的精准度。
在一种可能实现的方式中,所述预设策略包括所述第一策略;所述按照预设策略采集目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:获取所述第一告警流量的IP地址和UA信息;采集在所述目标时间段内所述IP地址发送的多条HTTP流中,与所述第一告警流量的UA信息相同的HTTP流为所述第二告警流量。实施本申请实施例,通过回溯第一告警信息同源IP地址和同UA信息的流量,可以回溯到同个软件、同个服务设备或同个应用发送的多条流量,从而根据回溯到的多条流量的行为特征确定第一告警流量是否为恶意流量,提高了识别恶意流量的准确率。
在一种可能实现的方式中,所述预设策略包括所述第二策略,所述按照预设策略采集目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:获取所述第一告警流量的所述IP地址;采集在所述目标时间段内所述IP地址发送的多条第一HTTP流;对多条第一HTTP流按照所述预设泛化规则进行泛化处理,获得多条第二HTTP流,所述预设泛化规则为对所述多条第一HTTP流中每一条第一HTTP流对应的目标字符串,使用预设标准进行统一替换;从所述多条第二HTTP流中,筛选出与所述第一告警流量之间相似度大于预设阈值的目标第二HTTP流为所述第二告警流量。实施本申请实施例,通过泛化后,计算流量之间相似度的方法,进而确定与第一告警流量同簇的(同个软件、不同应用发送的)多条流量(相似度超过预设阈值),进而根据该多条流量的行为特征确定第一告警流量是否为恶意流量,提 高了识别恶意流量的准确度。
在一种可能实现的方式中,所述预设策略包括所述第三策略,所述按照预设策略采集目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:获取所述第一告警流量的所述IP地址和所述HTTP Header信息;采集在所述目标时间段内所述IP地址发送的多条第三HTTP流;分别对所述多条第三HTTP流中每一条第三HTTP流对应的HTTP Header进行N-gram处理,获得第一矩阵,所述第一矩阵包括所述每一条第三HTTP流对应的HTTP Header序列信息;对所述第一矩阵进行降维处理,提取降维处理后的第一矩阵中与所述第一告警流量的HTTP Header信息匹配的目标HTTP Header序列信息;基于所述目标HTTP Header序列信息,获取所述目标HTTP Header序列信息对应的第三HTTP流为所述第二告警流量。实施本申请实施例,通过提取流量中的HTTP Header序列(sequence)信息进行回溯的方法,可以回溯到同个软件中不同应用发送的多条流量,进而根据该多条流量的行为特征确定第一告警流量是否为恶意流量,提高了识别恶意流量的准确度。
在一种可能实现的方式中,所述第一特征信息为特征表示向量;所述对所述多条第二告警流量进行特征提取,获得第一特征信息,包括:对所述多条第二告警流量进行特征提取,获得所述多条第二告警流量对应的行为特征信息,所述行为特征信息包括:连接行为特征,请求差异特征,请求响应特征中的一个或多个;根据所述行为特征信息,获取所述特征表示向量。实施本申请实施例,对多流流量进行行为特征提取,可以很好的分辨流氓软件对应的流量和恶意软件对应的流量,提高了恶意流量识别的准确度。
在一种可能实现的方式中,所述根据所述第一特征信息,判断所述第一告警流量是否为恶意流量,包括:基于所述第一特征信息通过回溯模型进行检测,获得第一检测结果;基于所述多条第二告警流量通过基线模型进行检测,获得第二检测结果,其中,所述基线模型是基于历史流量预先训练好的检测模型;基于所述第一检测结果和所述第二检测结果,判断所述第一告警流量是否为恶意流量。实施本申请实施例,通过综合考虑通过回溯模型进行检测的第一检测结果和通过基线模型进行检测的第二检测结果,最终确定第一告警流量是否为恶意流量,大大提高了恶意流量识别的准确度。
在一种可能实现的方式中,所述方法还包括:若所述第一告警流量为恶意流量,对所述第一告警流量进行预设泛化处理,获得泛化后的第一告警流量;将所述泛化后的第一告警流量进行分类,确定所述第一告警流量匹配的恶意流量类型。实施本申请实施例,通过对泛化处理后的第一告警流量分类,可以确定与第一告警流量匹配的恶意流量类型,以便更好地维护网络安全。
在一种可能实现的方式中,所述确定第一告警流量的接收时间之前,还包括:接收多条第四HTTP流;对所述多条第四HTTP流中每一条第四HTTP流按照预设特征提取规则进行特征提取,获得第二特征集合,所述第二特征集合包括:所述多条第四HTTP流分别对应的第二特征信息;基于所述第二特征集合,通过第一分类模型,从所述多条第四HTTP流中筛选出所述第一告警流量。实施本申请实施例,通过第一分类模型,根据单流流量特征(如手工特征和/或表示学习特征),从多条第四HTTP流中筛选出疑似恶意流量的第一告警流量(即,单流过滤),可以有效降低检测过程中对大量无关数据流的存储与检测,提高恶意流量的分析效率。
在一种可能实现的方式中,所述第二特征信息包括手工特征信息和/或表示学习特征信息;其中,所述手工特征信息包括:第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征中的一个或多个;所述表示学习特征信息包括 第四HTTP流对应的高维特征。实施本申请实施例,在实现单流过滤提取现网流量中疑似恶意流量的第一告警流量时,可以通过识别流量对应的手工特征和/或表示学习特征实现,例如:提取所述多条第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征中的一个或多个;又例如:基于表示学习模型提取所述多条第四HTTP流对应的高维特征。提高了单流过滤识别疑似恶意流量的第一告警流量的准确度,提高恶意流量的分析效率。
第二方面,本申请实施例提供了一种恶意流量识别装置,包括:
确定单元,用于确定第一告警流量的接收时间;
回溯单元,用于按照预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;
提取单元,用于对所述多条第二告警流量进行特征提取,获得第一特征信息;
判断单元,用于基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。
在一种可能实现的方式中,所述预设策略包括:第一策略、第二策略、第三策略中的一个或多个,其中,所述第一策略为基于所述第一告警流量的网际协议IP地址和用户代理UA信息获取所述多条第二告警流量的策略;所述第二策略为基于所述第一告警流量的IP地址和预设泛化规则获取所述多条第二告警流量的策略;所述第三策略为基于所述第一告警流量的IP地址和所述第一告警流量的超文本传输协议HTTP Header信息获取所述多条第二告警流量的策略。
在一种可能实现的方式中,所述预设策略包括所述第一策略;所述回溯单元,具体用于:获取所述第一告警流量的IP地址和UA信息;采集在所述目标时间段内所述IP地址发送的多条HTTP流中,与所述第一告警流量的UA信息相同的HTTP流为所述第二告警流量。
在一种可能实现的方式中,所述预设策略包括所述第二策略,所述回溯单元,具体用于:获取所述第一告警流量的所述IP地址;采集在所述目标时间段内所述IP地址发送的多条第一HTTP流;对多条第一HTTP流按照所述预设泛化规则进行泛化处理,获得多条第二HTTP流,所述预设泛化规则为对所述多条第一HTTP流中每一条第一HTTP流对应的目标字符串,使用预设标准进行统一替换;从所述多条第二HTTP流中,筛选出与所述第一告警流量之间相似度大于预设阈值的目标第二HTTP流为所述第二告警流量。
在一种可能实现的方式中,所述预设策略包括所述第三策略,所述回溯单元,具体用于:获取所述第一告警流量的所述IP地址和所述HTTP Header信息;采集在所述目标时间段内所述IP地址发送的多条第三HTTP流;分别对所述多条第三HTTP流中每一条第三HTTP流对应的HTTP Header进行N-gram处理,获得第一矩阵,所述第一矩阵包括所述每一条第三HTTP流对应的HTTP Header序列信息;对所述第一矩阵进行降维处理,提取降维处理后的第一矩阵中与所述第一告警流量的HTTP Header信息匹配的目标HTTP Header序列信息;基于所述目标HTTP Header序列信息,获取所述目标HTTP Header序列信息对应的第三HTTP流为所述第二告警流量。
在一种可能实现的方式中,所述第一特征信息为特征表示向量;所述提取单元,具体用于:对所述多条第二告警流量进行特征提取,获得所述多条第二告警流量对应的行为特征信息,所述行为特征信息包括:连接行为特征,请求差异特征,请求响应特征中的一个或多个;根据所述行为特征信息,获取所述特征表示向量。
在一种可能实现的方式中,所述判断单元,具体用于:基于所述第一特征信息通过回溯模型进行检测,获得第一检测结果;基于所述多条第二告警流量通过基线模型进行检测,获得第二检测结果,其中,所述基线模型是基于历史流量预先训练好的检测模型;基于所述第一检测结果和所述第二检测结果,判断所述第一告警流量是否为恶意流量。
在一种可能实现的方式中,所述装置还包括:泛化单元,用于若所述第一告警流量为恶意流量,对所述第一告警流量进行预设泛化处理,获得泛化后的第一告警流量;分类单元,用于将所述泛化后的第一告警流量进行分类,确定所述第一告警流量匹配的恶意流量类型。
在一种可能实现的方式中,所述装置还包括告警流量单元,所述告警流量单元,用于:确定第一告警流量的接收时间之前,接收多条第四HTTP流;对所述多条第四HTTP流中每一条第四HTTP流按照预设特征提取规则进行特征提取,获得第二特征集合,所述第二特征集合包括:所述多条第四HTTP流分别对应的第二特征信息;基于所述第二特征集合,通过第一分类模型,从所述多条第四HTTP流中筛选出所述第一告警流量。
在一种可能实现的方式中,所述第二特征信息包括手工特征信息和/或表示学习特征信息;其中,所述手工特征信息包括:第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征中的一个或多个;所述表示学习特征信息包括第四HTTP流对应的高维特征。
第三方面,本申请实施例提供一种服务设备,该服务设备中包括处理器,处理器被配置为支持该服务设备实现第一方面提供的恶意流量识别方法中相应的功能。该服务设备还可以包括存储器,存储器用于与处理器耦合,其保存该服务设备必要的程序指令和数据。该服务设备还可以包括通信接口,用于该服务设备与其他设备或通信网络通信。
第四方面,本申请实施例提供一种计算机可读存储介质,用于储存为上述第二方面提供的一种恶意流量识别装置所用的计算机软件指令,其包含用于执行上述方面所设计的程序。
第五方面,本申请实施例提供了一种计算机程序,该计算机程序包括指令,当该计算机程序被计算机执行时,使得计算机可以执行上述第二方面中的恶意流量识别装置所执行的流程。
第六方面,本申请提供了一种芯片系统,该芯片系统包括处理器,用于支持终端设备实现上述第一方面中所涉及的功能,例如,生成或处理上述恶意流量识别方法中所涉及的信息。在一种可能的设计中,所述芯片系统还包括存储器,所述存储器,用于保存数据发送设备必要的程序指令和数据。该芯片系统,可以由芯片构成,也可以包含芯片和其他分立器件。
附图说明
为了更清楚地说明本申请实施例或背景技术中的技术方案,下面将对本申请实施例或背景技术中所需要使用的附图进行说明。
图1是本申请实施例提供的一种恶意流量识别系统构架示意图。
图2是本申请实施例提供的一种恶意流量识别方法的流程示意图。
图3是本申请实施例提供的一种恶意流量识别的框架示意图。
图4是本申请实施例提供的一种特征提取的示意图。
图5是本申请实施例提供的一种按照第一策略回溯流量的流程示意图。
图6是本申请实施例提供的一种根据第一策略回溯的多条流量示意图。
图7是本申请实施例提供的一种按照第二策略回溯流量的流程示意图。
图8是本申请实施例提供的一种流量泛化前后的示意图。
图9是本申请实施例提供的一种按照第三策略回溯流量的流程示意图。
图10是本申请实施例提供的一种获得第一特征信息的方法流程示意图。
图11是本申请实施例提供的一种以E n为自变量,a n为因变量的函数图像。
图12是本申请实施例提供的一种确定恶意流量所属种类的流程示意图。
图13是本申请实施例提供的一种恶意流量识别装置的结构示意图。
图14是本申请实施例提供的另一种恶意流量识别装置的结构示意图。
具体实施方式
下面将结合本申请实施例中的附图,对本申请实施例进行描述。
本申请的说明书和权利要求书及所述附图中的术语“第一”、“第二”、“第三”和“第四”等是用于区别不同对象,而不是用于描述特定顺序。此外,术语“包括”和“具有”以及它们任何变形,意图在于覆盖不排他的包含。例如包含了一系列步骤或单元的过程、方法、系统、产品或设备没有限定于已列出的步骤或单元,而是可选地还包括没有列出的步骤或单元,或可选地还包括对于这些过程、方法、产品或设备固有的其它步骤或单元。
下面将结合本申请实施例中的附图,对本申请实施例中的技术方案进行描述。在本申请中,“至少一个”是指一个或者多个,“多个”是指两个或两个以上。“和/或”,描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B的情况,其中A,B可以是单数或者复数。字符“/”一般表示前后关联对象是一种“或”的关系。“以下至少一项(个)”或其类似表达,是指的这些项中的任意组合,包括单项(个)或复数项(个)的任意组合。例如,a,b或c中的至少一项(个),可以表示:a,b,c,a和b,a和c,b和c或a、b和c,其中a、b和c可以是单个,也可以是多个。
在本文中提及“实施例”意味着,结合实施例描述的特定特征、结构或特性可以包含在本申请的至少一个实施例中。在说明书中的各个位置出现该短语并不一定均是指相同的实施例,也不是与其它实施例互斥的独立的或备选的实施例。本领域技术人员显式地和隐式地理解的是,本文所描述的实施例可以与其它实施例相结合。
在本说明书中使用的术语“部件”、“模块”、“系统”等用于表示计算机相关的实体、硬件、固件、硬件和软件的组合、软件、或执行中的软件。例如,部件可以是但不限于,在处理器上运行的进程、处理器、对象、可执行文件、执行线程、程序和/或计算机。通过图示,在计算设备上运行的应用和计算设备都可以是部件。一个或多个部件可驻留在进程和/或执行线程中,部件可位于一个计算机上和/或分布在2个或更多个计算机之间。此外,这些部件可从在上面存储有各种数据结构的各种计算机可读介质执行。部件可例如根据具有一个或多个数据分组(例如来自与本地系统、分布式系统和/或网络间的另一部件交互的二个部件的数据,例如通过信号与其它系统交互的互联网)的信号通过本地和/或远程进程来通信。
首先,对本申请中的部分用语进行解释说明,以便于本领域技术人员理解。
(1)超文本传输协议(HyperText Transfer Protocol,HTTP),它是一种用于分布式、协作式和超媒体信息系统的应用层协议,是万维网的数据通信的基础,也是互联网应用最为广泛的一种网络传输协议。最初设计HTTP的目的是为了提供一种发布和接收HTML页面的方法。
(2)木马的命令与控制服务器(Command and Control Server,CC/C2):远程命令和控制服务器,目标机器可以接收来自服务器的命令,从而达到服务器控制目标机器的目的。该方法常用于病毒木马控制被感染的机器。
(3)因特网中继聊天(Internet Relay Chat,IRC),一种应用层的协议,主要用于群体聊天。IRC用户使用特定的用户端聊天软件连接到IRC服务器,通过服务器中继与其他连接到这一服务器上的用户交流,所以IRC的中文名为“因特网中继聊天”。
(4)N-gram,n元语法,指文本中连续出现的n个语词。n元语法模型是基于(n-1)阶马尔可夫链的一种概率语言模型,通过n个语词出现的概率来推断语句的结构。
(5)Content-Type,内容类型,一般是指网页中存在的Content-Type,用于定义网络文件的类型和网页的编码,决定浏览器将以什么形式、什么编码读取这个文件,这就是经常看到一些网页点击的结果却是下载到的一个文件或一张图片的原因。ContentType属性指定响应的HTTP内容类型,如果未指定ContentType,默认为TEXT/HTML。
(6)表示学习(Representation Learning),又称学习表示。在深度学习领域内,表示是指通过模型的参数,采用何种形式、何种方式来表示模型的输入观测样本X。表示学习指学习对观测样本X有效的表示。表示学习得到的低维向量表示是一种分布式表示(distributed representation)。之所以如此命名,是因为孤立地看向量中的每一维,都没有明确对应的含义;而综合各维形成一个向量,则能够表示对象的语义信息。
(7)决策树(Decision Tree)是在已知各种情况发生概率的基础上,通过构成决策树来求取净现值的期望值大于等于零的概率,评价项目风险,判断其可行性的决策分析方法,是直观运用概率分析的一种图解法。由于这种决策分支画成图形很像一棵树的枝干,故称决策树。在机器学习中,决策树是一个预测模型,他代表的是对象属性与对象值之间的一种映射关系。分类树(决策树)是一种十分常用的分类方法。他是一种监督学习,所谓监督学习就是给定一堆样本,每个样本都有一组属性和一个类别,这些类别是事先确定的,那么通过学习得到一个分类器,这个分类器能够对新出现的对象给出正确的分类。这样的机器学习就被称之为监督学习。
(8)用户代理(User Agent,UA),是指浏览器,还包括搜索引擎。它的信息包括硬件平台、系统软件、应用软件和用户个人偏好。
(9)统一资源定位符(Uniform Resource Locator,URL),又叫做网页地址,是互联网上标准的资源的地址。互联网上的每个文件都有一个唯一的URL,它包含的信息指出文件的位置以及浏览器应该怎么处理它。URL最初是由蒂姆·伯纳斯-李发明用来作为万维网的地址的。
(10)内容类型,Content-Type,一般是指网页中存在的Content-Type,用于定义网络文件的类型和网页的编码,决定浏览器将以什么形式、什么编码读取这个文件,这就是经常看到一些Asp网页点击的结果却是下载到的一个文件或一张图片的原因。ContentType属性指定响应的HTTP内容类型,如果未指定ContentType,默认为TEXT/HTML。
(11)TF-IDF(Term Frequency–Inverse Document Frequency),是一种用于信息检索与数据挖掘的常用加权技术,用以评估一个字词对于一个文件集或一个语料库中的其中一份文件 的重要程度。字词的重要性随着它在文件中出现的次数成正比增加,但同时会随着它在语料库中出现的频率成反比下降。TF-IDF加权的各种形式常被搜索引擎应用,作为文件与用户查询之间相关程度的度量或评级。除了TF-IDF以外,因特网上的搜索引擎还会使用基于链接分析的评级方法,以确定文件在搜寻结果中出现的顺序。
(12)词袋模型(Bag-of-words,BOW),该Bag-of-words模型是信息检索领域常用的文档表示方法。在信息检索中,BOW模型假定对于一个文档,忽略它的单词顺序和语法、句法等要素,将其仅仅看作是若干个词汇的集合,文档中每个单词的出现都是独立的,不依赖于其它单词是否出现。(是不关顺序的)也就是说,文档中任意一个位置出现的任何单词,都不受该文档语意影响而独立选择的。
(13)ROC(Receiver Operating Characteristic Curve):接受者操作特征曲线。ROC曲线及AUC系数主要用来检验模型对客户进行正确排序的能力。ROC曲线描述了在一定累计好客户比例下的累计坏客户的比例,模型的分别能力越强,ROC曲线越往左上角靠近。AUC系数表示ROC曲线下方的面积。AUC系数越高,模型的风险区分能力越强。
(14)KS(Kolmogorov-Smirnov)检验:K-S检验主要是验证模型对违约对象的区分能力,通常是在模型预测全体样本的信用评分后,将全体样本按违约与非违约分为两部分,然后用KS统计量来检验这两组样本信用评分的分布是否有显著差异。
基于上述提出的技术问题,也为了便于理解本申请实施例,下面先对本申请实施例所基于的其中一种恶意流量识别系统架构进行描述。请参阅图1,图1是本申请实施例提供的一种恶意流量识别系统构架示意图。本申请中的客户端可以包括图1中的第一服务设备001、第二服务设备002和第三服务设备003,其中,第一服务设备001、第二服务设备002和第三服务设备003之间可以通过有线或无线的方式进行通信连接,第二服务设备002和第三服务设备003均可以向第一服务设备发送超文本传输协议(HyperText Transfer Protocol,HTTP)请求。其中,
第一服务设备001可以包括但不限于后台服务器、组件服务器、数据处理服务器等,为客户提供各种本地服务程序的设备。另外,第一服务设备001可以接收或响应一个或多个服务设备发送超文本传输协议(HyperText Transfer Protocol,HTTP)请求,以便为其他的服务设备提供相应的应用服务。但第一服务设备001需要识别出其他服务设备发送的HTTP请求是否属于恶意流量,若属于恶意流量,为了保证网络安全则不能对其响应。因此,第一服务设备001配置有恶意流量识别的本地服务,其中,该本地服务可包括但不限于:确定第一告警流量的接收时间;按照预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;对所述多条第二告警流量进行特征提取,获得第一特征信息;基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。
第二服务设备002也可以包括但不限于后台服务器、组件服务器、数据处理服务器等,为客户提供各种本地服务程序的设备。可以安装并运行相关的应用,可以向第一服务设备发送HTTP请求,以便第一服务设备响应后得到相应的服务。
第三服务设备003,可以为木马的命令与控制服务器(Command and Control Server,CC/C2),其他服务设备可以接收来第三服务设备003(CC服务器)的命令,从而达到第三服务设备003控制上述服务设备的目的,常用于病毒木马控制被感染的服务设备。例如:在本 申请实施例中,第三服务设备003可以向第一服务设备发送HTTP请求,使得第一服务设备接收到HTTP流,该HTTP流为恶意流量,可以被第一服务设备识别出。
可以理解的是,图1中的网络架构只是本申请实施例中的一种示例性的实施方式,本申请实施例中的恶意流量识别系统架构包括但不仅限于以上恶意流量识别系统架构。
基于图1提供的恶意流量识别系统架构,结合本申请中提供的恶意流量识别方法,对本申请中提出的技术问题进行具体分析和解决。
参见图2,图2是本申请实施例提供的一种恶意流量识别方法的流程示意图,该方法可应用于上述图1中所述的恶意流量识别系统架构中,其中的第一服务设备001可以用于支持并执行图2中所示的方法流程步骤S201-步骤S209。下面将结合附图2从第一服务设备侧进行描述。该方法可以包括以下步骤S201-步骤S209。
步骤S201:接收多条第四HTTP流。
具体的,恶意流量识别装置接收多条第四超文本传输协议HTTP流。其中,该第四HTTP流可以是第一服务设备接收到的来自一台或多台第二服务设备和/或第三服务设备发送的超文本传输协议HTTP流。
步骤S202:对多条第四HTTP流中每一条第四HTTP流进行特征提取,获得第二特征集合。
具体的,恶意流量识别装置对所述多条第四HTTP流中每一条第四HTTP流进行特征提取,获得第二特征集合,其中,所述第二特征集合包括:所述多条第四HTTP流分别对应的第二特征信息。可以理解的是,对于每一条第四HTTP流,第一服务设备001可以按照预设特征提取规则进行特征提取,然后会获的对应的非数字特征向量和其他的数字特征向量,将这些非数字特征向量和其他的数字特征向量按照统一的规则拼接起来,就得到了最终的单流特征向量,即,第四HTTP流对应的第二特征信息。请参考附图3,图3是本申请实施例提供的一种恶意流量识别的框架示意图。如图3所示,本申请实施例先对现网流量通过单流分类器(通过多条黑流量和白流量训练好的模型进行特征处理和单流分类(单流过滤)),获得疑似恶意流量的第一告警流量,例如:在单流的数据流量的基础上,通过特征处理器进行特征提取,提取流量中的单流特征形成特征向量,将此类特征向量输入分类器中,进行初步判断该流量是否是恶意软件的CC通信流量(即,第一告警流量);在基于该第一告警流量进行多流特征提取,获得多流特征表示(多流回溯);最后基于该多流特征表示通过回溯模型和基线模型确定该第一告警流量是否为恶意流量。另外,还可以进一步的使用上述模型提取的特征通过恶意家族分类器,最终确定第一告警流量所属类型。其中,具体的实现方式可参考下述步骤,本申请实施例在此暂不描述。
可选的,所述第二特征信息包括手工特征信息和/或表示学习特征信息;其中,所述手工特征信息包括:第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征中的一个或多个;所述表示学习特征信息包括第四HTTP流对应的高维特征。
例如:请参考附图4,图4是本申请实施例提供的一种特征提取的示意图。如图4所示,可以对于接收到的多条第四HTTP流,分别采用特征工程方法提取手工特征和表示学习方式进行特征提取。其中,(1)手工特征信息包括以下特征中的一个或多个:第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征(HTTP响应特征);其中,URL统计特征包括以下特征中的一个或多个:长度,元音比例,辅音比 例,特殊字符比例,大写字母比例,小写字母比例,数字比例,域名级数,域名字符分布,顶级域名,路径(path)长度,path层数,文件后缀,参数个数,平均参数值长度,是否存在base64,是否遵从常见模式;HTTP Header特征包括以下特征中的一个或多个:内容类型Content Type,用户代理UA,HTTP返回状态码,Header序列的N-gram。(2)表示学习特征信息为以表示学习(Representation Learning)的方式作为辅助,在神经网络输出层之前,抽取第四HTTP流的高维特征,最大化地对已有数据集进行特征抽取,并在较高维度上进行关联。其中,图4所示的白流量指代正常流量,黑流量指代恶意流量。将多条第四HTTP流中每一条第四HTTP流进行特征提取(手工特征提取和表示学习特征提取),对提取的特征进行预处理(如:数字特征处理和非数字特征转换),再将特征组合和筛选后,获得第二特征集合。
可选的,所述对多条第四HTTP流中每一条第四HTTP流进行特征提取,获得第二特征集合,包括:对多条第四HTTP流中每一条第四HTTP流进行特征提取,获得初始特征集合;对所述初始特征集合内非数字特征进行文本处理,获得所述第二特征集合。需要说明的是,由于分类模型一般处理数字输入,所以对于特征中的文本特征或非数字特征需要进行文本-数字转换,将其转换为分类模型可以处理的数字化向量。其中,对多条第四HTTP流中每一条第四HTTP流进行特征提取的方式可以是通过手工特征提取和/或表示学习特征提取的方式。
可选的,上述涉及的文本特征包括但不限于:顶级域名,文件后缀,Content Type,UA等。可以理解的是,由于这四个字段特征的输入均是字符串,而机器学习分类器无法处理字符串,所以需要对字符串进行转换,将其转换为分类模型可以处理的数字化向量。上述文本处理过程使用的方法为:TF-IDF。其中,TF-IDF中的"词频"(Term Frequency,缩写为TF),TF=某个词在文章中出现的次数,体现的是一个词在文档中出现的频率,"逆文档频率"(Inverse Document Frequency,缩写为IDF),IDF=某个词在文章中的出现次数/文章的总词数,体现的是一个词常见程度的反比,可以有效的解决一些出现频率比较高但是并没有很大的意义的词。在本申请实施例中,可以使用TF-IDF=TF*IDF的方法,有效地体现在流量中一个字段中某一个字符串出现的频率。例如:首先对这些特征进行TF-IDF转换,基于词频和文档顺序计算出其向量表达。需要说明的是,在分类识别的过程中,利用TF-IDF处理的数据需要与检测模型的基础TF-IDF库进行对比发现异常,基础TF-IDF库可以在训练过程中由白流量(可以指利用某些技术手段可以确认识别的正常数据流量,用于模型训练或者正确性验证)统计获得,可以在一个具体检测场景下利用具体白流量生成。
在一种可能实现的方式中,对所述初始特征集合内非数字特征进行文本处理,获得所述第二特征集合,包括:对所述初始特征集合内非数字特征进行文本处理,获得数字特征向量集合;对上述数字特征向量集合进行降维处理,获得所述第二特征集合。可以理解的是,对提取到的初始特征集合进行TF-IDF处理后,得到的向量维度比较大,这样高维的向量对于分类模型和后续的处理都比较消耗资源,并且处理效率不高,因此,可以进行降维处理将这样的一个高维向量转换到低维向量空间。其中,该降维处理的方法可以包括但不限于奇异值分解(Singular value decomposition,SVD)、主成分分析(Principal Component Analysis,PCA)等。例如,在本申请实施例中,由于TF-IDF计算后的向量维度过大,容易出现维度爆炸的问题,所以进行降维操作,将TF-IDF处理后的向量从高维空间降低到一个十维的空间中。
可选的,对每条第四HTTP流使用不同方法提取的特征进行组合和筛选,获得每条第四HTTP流对应的第二特征信息。例如:将特征工程特征和表示学习特征进行组合,并通过最小冗余最大相关性(mRMR)等特征选择算法,筛选出每条第四HTTP流对应的效果最优的特征集。如上述图4所示,本申请实施例,在从现网流量中提取单流的流量特征,对非数字特 征进行文本处理,对流量特征进行组合和筛选从而获得第二特征集合。
步骤S203:基于第二特征集合,通过第一分类模型,从多条第四HTTP流中筛选出第一告警流量。
具体的,恶意流量识别装置可以基于所述第二特征集合,通过第一分类模型,从所述多条第四HTTP流中筛选出所述第一告警流量。其中,第一告警流量为通过第一分类模型从所述多条第四HTTP流中筛选出疑似恶意流量的流量。例如:将上述得到的每一条第四HTTP流的流量特征向量(即,第二特征信息),输入进第一分类模型中。第一分类模型可采用stack模式,基于不同的特征训练不同的分类器进行判定,第一分类模型可以用于基于每个分类器的判定结果利用决策树机制最终可以得到基于HTTP会话的第一层检测结果。另外,该第一分类模型可以是利用已经标记好的黑白流量训练数据集训练得到的模型。这种对数据进行预处理实现对正常流量的初始筛选。在此基础上,面向单流的数据流量基于人工经验和表示学习的方法进行复合特征的抽取与选择,形成单流特征向量,然后将此类特征向量输入分类器中,进行第一步判断该流量是否疑似恶意软件的CC通信流量,若是,再进一步的进行下一步的判断,大大提升了判断该流量是否是恶意流量的效率。
步骤S204:确定第一告警流量的接收时间。
具体的,恶意流量识别装置确定第一告警流量的接收时间。在筛选出第一告警流量后可以确定第一告警流量的接收时间,以便回溯多条流量。
步骤S205:按照预设策略获取目标时间段内与第一告警流量对应的多条第二告警流量。
具体的,恶意流量识别装置按照预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量,所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值。在检测过程中,由于现网中流量情况相对较为复杂,针对单条HTTP流的检测具备一定的偶然性,如果能从多流角度对恶意样本通信行为进行观察,将多个请求基于不同的方法回溯到不同簇,利用每条告警流所属不同簇的统计特征组合,研判正负性,从而杜绝了偶然误差。即,观察恶意样本在一定时间内整体的通信行为,就可以从行为角度更准确的判断恶意样本,使得最终的多流结果更为鲁棒,同时具备行为上的可解释性。
可选的,所述目标时间段为基于所述接收时间确定的时间段,例如:所述目标时间段为以所述接收时间为起点向后预设时长的时间段,或者为以所述接收时间为终点向前预设时长的时间段。又例如:目标时间段还可以为包括接收时间的时间段。在接收第一告警流量附近获取第二告警流量,可以保证获得尽可能多的与第一告警流量相似的多条第二告警流量。
可选的,所述预设策略包括:第一策略、第二策略、第三策略中的一个或多个,其中,所述第一策略为基于所述第一告警流量的网际协议IP地址和用户代理UA信息获取所述多条第二告警流量的策略;所述第二策略为基于所述第一告警流量的IP地址和预设泛化规则获取所述多条第二告警流量的策略;所述第三策略为基于所述第一告警流量的IP地址和所述第一告警流量的超文本传输协议HTTP Header信息获取所述多条第二告警流量的策略。其中,在原有检测方法的基础上,在第一分类模型报出结果后,使用流量回溯方法,基于第一告警流量向前和/或向后采集一段时间的CC通信流量,然后进行多流特征提取。可以理解的是,第一策略是基于第一告警流量的IP地址和UA信息进行回溯,可以回溯到同个软件、同个服务设备或同个应用发送的多条流量;第二策略是基于第一告警流量的IP地址回溯多条流量,然后按照预设的泛化规则泛化回溯到的流量,从而筛选出与第一告警流量同一个软件不同应用发送的多条流量;第三策略是基于第一告警流量的IP地址和HTTP Header信息进行回溯,可 以回溯到同个软件中不同应用发送的多条流量。多种流量回溯方式均可以精准的回溯到第一告警流量同源的多条流量,从而可以根据该多条流量的行为特征识别出第一告警流量是否为恶意流量,提高了识别恶意流量的精准度。
可选的,所述预设策略包括所述第一策略;所述按照预设策略采集目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:获取所述第一告警流量的IP地址和UA信息;采集在所述目标时间段内所述IP地址发送的多条HTTP流中,与所述第一告警流量的UA信息相同的HTTP流为所述第二告警流量。请参考附图5,图5是本申请实施例提供的一种按照第一策略回溯流量的流程示意图。如图5所示,若预设策略包括第一策略,即可以使用第一告警流量的UA信息和源IP地址信息作为唯一索引进行流量回溯,通过UA Header信息进行应用流量标识,抽取出所述源IP地址(src-ip)前N分钟或后N分钟发出的相同UA信息的所有HTTP流为第二告警流量进行回溯分析。通过该方式可以回溯到同个软件、同个服务设备或同个应用发送的多条流量,提高了识别恶意流量的准确率。请参考附图6,图6是本申请实施例提供的一种根据第一策略回溯的多条流量示意图。如图6所示,根据3条第一告警流量,按照第一策略,回溯了十条HTTP请求,其中,该10条HTTP请求根据IP地址信息和UA信息共分为三组分别对应3条第一告警流量。多流分组1对应IP地址为IP、UA信息为UA1的第一告警流量;多流分组2对应IP地址为IP、UA信息为UA2的第一告警流量;多流分组3对应IP地址为IP、UA信息为UA3的第一告警流量。另外,HTTP请求1-HTTP请求4:对应了典型站点轮询+URL变化模式;HTTP请求5-HTTP请求7:对应了典型稳定心跳模式;HTTP请求8-HTTP请求10:对应了特定的一些样本通信行为。
可选的,所述预设策略包括所述第二策略,所述按照预设策略采集目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:获取所述第一告警流量的所述IP地址;采集在所述目标时间段内所述IP地址发送的多条第一HTTP流;对多条第一HTTP流按照所述预设泛化规则进行泛化处理,获得多条第二HTTP流,所述预设泛化规则为对所述多条第一HTTP流中每一条第一HTTP流对应的目标字符串,使用预设标准进行统一替换;从所述多条第二HTTP流中,筛选出与所述第一告警流量之间相似度大于所述预设阈值的目标第二HTTP流为所述第二告警流量。请参考附图7,图7是本申请实施例提供的一种按照第二策略回溯流量的流程示意图。如图7所示,通过对形成的流量进行泛化,将其产生变化的字段用字符代替,对相同源IP发出的流量进行统一泛化,并计算模板之间的字符串相似度,从而在源IP的历史流量(如:相同源IP在目标时间段内的流量数据)中匹配出最相似的所有HTTP流,即为第二告警流量。其中,所谓泛化,是对流量中的变化字符串位置,使用同一标准进行替换(如本申请实施例中,可以将所有小写字母换为x,特殊字符换为T,大写字母换为X)。请参考附图8,图8是本申请实施例提供的一种流量泛化前后的示意图。如图8所示,多条第一HTTP流按照统一的泛化规则进行泛化后,获得了其分别对应的第二HTTP流。进一步的,可以计算多条第二HTTP流与第一告警流量之间的相似度。这种通过泛化后,计算流量之间相似度的方法,进而确定与第一告警流量同簇的(同个软件、不同应用发送的)多条流量(相似度超过预设阈值),进而根据该多条流量的行为特征确定第一告警流量是否为恶意流量,提高了识别恶意流量的准确度。
可选的,从所述多条第二HTTP流中,筛选出与所述第一告警流量之间相似度大于所述预设阈值的目标第二HTTP流为所述第二告警流量,包括:将所述多条第二HTTP流向量化,再计算向量化后的多条第二HTTP流与第一告警流量之间的相似度。其中,恶意流量识别装置可以首先使用词袋模型(BOW)来向量化,再使用向量空间模型(VSM)中的余弦相似度 分别计算多条第二HTTP流与第一告警流量之间的相似度。进行字符串相似性度量的时候,求得两个泛化后请求在同一BOW下的向量表示,并计算余弦距离(相似度)。其中,流量间的相似度计算可以使用向量空间模型(VSM)中的余弦相似度:向量空间模型是一个把文本文件表示为标识符(比如索引)向量的代数模型。它应用于信息过滤、信息检索、索引以及相关排序。
Figure PCTCN2021141587-appb-000001
其中,A为告警流的模板向量,B为回溯流的向量。
可选的,所述预设策略包括所述第三策略,所述按照预设策略采集目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:获取所述第一告警流量的所述IP地址和所述HTTP Header信息;采集在所述目标时间段内所述IP地址发送的多条第三HTTP流;分别对所述多条第三HTTP流中每一条第三HTTP流对应的HTTP Header进行N-gram处理,获得第一矩阵,所述第一矩阵包括所述每一条第三HTTP流对应的HTTP Header序列信息;对所述第一矩阵进行降维处理,提取降维处理后的第一矩阵中与所述第一告警流量的HTTP Header信息匹配的目标HTTP Header序列信息;基于所述目标HTTP Header序列信息,获取所述目标HTTP Header序列信息对应的第三HTTP流为所述第二告警流量。请参考附图9,图9是本申请实施例提供的一种按照第三策略回溯流量的流程示意图。如图9所示,对源IP的HTTP请求的HTTP Header进行N-gram处理,即,提取流量中的HTTP Header序列(sequence)信息,分别对N取不同值(视性能考虑),形成以下表1所示样本-头部组合矩阵(HTTP header sequence N-gram矩阵)。使用Hash Trick进行降维,提取降维后同序列的HTTP流。
表1,HTTP header sequence N-gram矩阵
Figure PCTCN2021141587-appb-000002
其中,如图9所示,由于组合矩阵维度较高,可以采用hash trick方式对矩阵进行降维,获得N-gram矩阵降维后的矩阵。例如:对特征向量x进行随机转换进行一次MinHash,得到哈希结果,取哈希结果(可以用二进制表示)的最后b位。就是b-bit Min Hash的过程。该过程重复k次,每个样本就可以用k*b位进行表示,处理的时间和空间要求大大降低。这种通过提取流量(如:相同源IP在目标时间段内的流量数据)中的HTTP Header序列信息进行回溯的方法,可以回溯到同个软件中不同应用发送的多条流量,进而根据该多条流量的行为特征确定第一告警流量是否为恶意流量,提高了识别恶意流量的准确度。
步骤S206:对多条第二告警流量进行特征提取,获得第一特征信息。
具体的,所述恶意流量识别装置对多条第二告警流量进行特征提取,获得第一特征信息。可以理解的是,按照上述一种或多种策略回溯方法得到的HTTP流,分别输入下一阶段进行特征的提取。分别获取一种或多三种回溯方法得到的多条HTTP流对应的表示向量,并将其 连接为一个向量,即为第一特征信息。请参考附图10,图10是本申请实施例提供的一种获得第一特征信息的方法流程示意图。如图10所示,通过单流分类器获得第一告警流量,即预分类结果;根据该第一告警流量通过第一策略(即,UA聚合)、第二策略(即,流量模板相似聚类)和/或第三策略(HTTP header N-gram)后回溯的多条第二告警流量(多流数据),对该多条第二告警流量进行特征提取,获得每种策略对应的特征表示向量(Vector-traceback),再将其组合成的多流特征表示向量即为第一特征信息。
可选的,所述第一特征信息为特征表示向量;所述对所述多条第二告警流量进行特征提取,获得第一特征信息,包括:对所述多条第二告警流量进行特征提取,获得所述多条第二告警流量对应的行为特征信息,所述行为特征信息包括:连接行为特征,请求差异特征,请求响应特征中的一个或多个;根据所述行为特征信息,获取所述特征表示向量。可以使得恶意流量识别装置对流量进行识别时,充分考虑恶意CC通信流量的多流网络行为的特征,可以更加精准的检测并分辨现网中的恶意流量。需要说明的是,请参考下述表格2,表2是本申请实施例提供的一种多流的行为特征信息表。
表2,多流模型特征说明
Figure PCTCN2021141587-appb-000003
Figure PCTCN2021141587-appb-000004
步骤S207:基于第一特征信息,判断第一告警流量是否为恶意流量。
具体的,恶意流量识别装置可以基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。第一特征信息可以用于表征第一告警流量对应多流流量的行为特征信息,基于该行为特征信息,通过回溯模型进行检测,可以判断第一告警流量是否为恶意流量。例如,将得到的多流的行为特征信息,如:向量表示形式,输入进上述回溯模型(多流分类器),另外,为了高度利用向量特征,可以采取stacking的方式进行多次训练提取向量的行为特征,即可以得到基于该回溯模型多得到检测结果。其中,所述回溯模型可以为预先训练好的,用于识别流量是否为恶意流量的分类模型。
可选的,所述根据所述第一特征信息,判断所述第一告警流量是否为恶意流量,包括:基于所述第一特征信息通过回溯模型进行检测,获得第一检测结果;基于所述多条第二告警流量通过基线模型进行检测,获得第二检测结果,其中,所述基线模型是基于历史流量预先训练好的检测模型;基于所述第一检测结果和所述第二检测结果,判断所述第一告警流量是否为恶意流量。其中,针对现网生产环境进行一段时间的流量数据积累,并在此基础上抽取现网流量的多流特征,以此为训练数据,构建现网历史数据的单分类模型(即,所述基线模型),使得此模型可以表示现网的行为基线,从而能够从基线的角度,对不同于常规行为的流量进行判别。另外,所述回溯模型可以为预先训练好的多流分类器,用于识别流量是否为恶意流量。然后,将回溯模型的第一检测结果y 1(x)与通过历史流量预训练的单分类基线异常检测模型的第二检测结果y 2(x)进行平滑性整合,得到最终的判别结果T(x)。通过裁决公式得到其最终整合值,最后根据判别结果T(x),确定第一告警流量是否为恶意流量。具体的所述裁决公式为:
Figure PCTCN2021141587-appb-000005
其中,
Figure PCTCN2021141587-appb-000006
请参考附图11,图11是本申请实施例提供的一种以E n为自变量,a n为因变量的函数图像,其中E n∈(0,1)。如图11所示,当错误率E n越大时,a n向减小方向延伸,从而导致相应模型的判断权重减小。权重值与不同模型的输出值在经过算数平均后,输入平滑符号函数sigmoid进行最终映射值的计算,从而的到0(白样本标签或正常流量标签)和1(黑样本标签或恶意流量标签)的输出结果。另外,通过综合考虑通过回溯模型进行检测的第一检测结果和通过基线模型进行检测的第二检测结果,最终确定第一告警流量是否为恶意流量,大大提高了恶意流量识别的准确度。
步骤S208:若第一告警流量为恶意流量,对第一告警流量进行预设泛化处理,获得泛化后的第一告警流量。
具体的,若第一告警流量为恶意流量,恶意流量识别装置对所述第一告警流量进行预设泛化处理,获得泛化后的第一告警流量。可以理解的是,若确实确定第一告警流量为恶意流量后,还可以识别该恶意流量属于哪一类的恶意流量。
步骤S209:将泛化后的第一告警流量进行分类,确定第一告警流量匹配的恶意流量类型。
具体的,恶意流量识别装置将泛化后的第一告警流量进行分类,确定第一告警流量匹配的恶意流量类型。其中,恶意流量识别装置通过训练好的种类分类模型对泛化后的第一告警流量进行分类,该识别恶意流量所属种类所用的分类模型为对已知恶意样本进行通信流量进行泛化处理后,使用上述步骤S207中涉及的模型(回溯模型)提取的特征训练完成的多家族分类模型,用于对恶意流量研判所属家族。因此,请参考附图12,图12是本申请实施例提供的一种确定恶意流量所属种类的流程示意图。如图12所示,恶意流量样本,通过泛化处理、提取流量模板、表示学习、特征提取、特征标识和多分类器后,可以确定告警流量匹配的恶意流量类型。即在本申请实施例中,恶意流量识别装置对所述第一告警流量进行预设泛化处理后,获得泛化后的第一告警流量,对该泛化后的第一告警流量进行特征提取(相当于特征抽取),获得对应的特征表示向量;最后将该特征表示向量输入上述的多家族分类模型,识别出恶意流量的类型。
实施第一方面的实施例,恶意流量识别装置可以从单条流量(即:第一告警流量)的接收时间起,按照预设策略回溯目标时间段内与该单条流量匹配的多条流量(即:多条第二告警流量)。然后,对回溯到的多条流量进行特征提取,获得特征信息,使得恶意流量识别装置可以根据该特征信息对上述单条流量进行分类,从而确定该单条流量是否为恶意流量。其中,该多条第二告警流量与第一告警流量之间的相似度均大于预设阈值。这种根据单条流量相似的多条流量的特征信息对单条流量进行分类的方法,使得恶意流量识别装置对流量进行识别时,可以充分考虑恶意CC通信流量的多流网络行为的特征,从而更加精准的检测并分辨现网中的恶意流量。避免了现有技术在检测过程中,由于现网中流量情况相对较为复杂,针对单条HTTP流的检测具备的偶然性。另外,本申请实施例从多流角度对流量的通信行为进行观察,将多个告警流量基于一种或多种的方法回溯到不同簇,利用每条告警流所属不同簇的统计其特征信息,根据该特征信息研判正负性(即,告警流量是否为恶意流量),从而杜绝了偶然误差。这种观察恶意流量在一定时间内整体的通信行为,可以从行为角度判断恶意样本,使得最终的多流判断结果更为鲁棒,同时也具备行为上的可解释性。而且,本申请实施例对于多流流量无论是从流量层对应的特征来检测,还是从主机行为对应的特征来检测,基础信息丰富度都足够恶意流量识别装置有效准确的识别流量是否为恶意流量。从而,可以从多流的特征上区分流氓软件的通信流量和恶意软件的通信流量,提高恶意流量识别的准确率。
另外,在针对某X校园网采集到的1600万条正常现网数据和1万多的恶意流量样本数据,分别使用现有技术和本申请实施例进行网络数据识别的应用场景下,获得以下实验数据。
1、仅使用现有技术中单流检测模型:
请参考下述表3,表3本申请实施例提供的一种单流模型性能数据表,然而基于威胁情报等侧面确认,在实际的网络运行中检测算法精度估计可以在80%左右。(上述某X校园网确认40多条流告警)
表3,单流模型性能数据表
Accuracy准确性 0.9999664730928924
F1 0.9999831782138391
Precision精密度 0.9999728493273421
Recall检索率 0.999993507313716
其中,表3说明了针对所有HTTP通信,实验环境(测试集)下ACC值达到99.99%以 上,ROC值接近于1(0.99999)。其中,ROC值一般在0.5-1.0之间。值越大表示模型判断准确性越高,即越接近1越好。ROC=0.5表示模型的预测能力与随机结果没有差别。KS值表示了模型将加和减区分开来的能力。KS值越大,模型的预测准确性越好。一般,KS>0.2即可认为模型有比较好的预测准确性。
2、使用本申请实施例中所述恶意流量识别方法
在单层检测模型基础上进行多流判定,实验环境下成功发现现网IP聚集感染行为,在所有最终告警样本中,某X校园网的回溯模型的识别精度达到100%。请参考下述表4中检测出的恶意流量样例,如下表4中检测到的IP地址为166.***.**.111和166.***.***.191的两簇恶意HTTP流。
表4,恶意流量样本数据
http://arimaexim.com/logo.gif?f5da****=-119****187 158****498 166.***.**.111
http://arimaexim.com/logo.gif?faa7****=-89****66 158****025 166.***.**.111
http://arimaexim.com/logo.gif?f69c****=-110****150 158****218 166.***.**.111
http://www.arimaexim.com/logo.gif?faa7****=-89****66 158****026 166.***.**.111
http://ampyazilim.com.tr/images/xs2.jpg?cdd****=21****164 158****717 166.***.***.191
http://ahmediye.net/xs.jpg?857****=559****96 158****826 166.***.***.191
综上所述,实施本申请实施例,可以首先基于多流回溯的流量分离方法,准确分离连续时间段下的同一恶意软件/应用通信的HTTP流量;其次,基于回溯的多级检测框架(先单流过滤再多流回溯)可以有效降低检测过程中对大量无关数据流的存储与检测(回溯流量只需要第一层检测出的可疑流量,占比很小),提高分析效率,更适合应用在企业网环境。另外,基于多流回溯的流量分离方法,从多流行为特征上区分流氓软件的通信流量和恶意软件的通信流量。
上述详细阐述了本申请实施例的方法,下面提供了本申请实施例的相关装置。
请参见图13,图13是本申请实施例提供的一种恶意流量识别装置的结构示意图,该恶意流量识别装置10可以包括确定单元101、回溯单元102、提取单元103和判断单元104,还可以包括:泛化单元105、分类单元106和告警流量单元107。其中,各个单元的详细描述如下。
确定单元101,用于确定第一告警流量的接收时间;
回溯单元102,用于按照预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;
提取单元103,用于对所述多条第二告警流量进行特征提取,获得第一特征信息;
判断单元104,用于基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。
在一种可能实现的方式中,所述预设策略包括:第一策略、第二策略、第三策略中的一个或多个,其中,所述第一策略为基于所述第一告警流量的网际协议IP地址和用户代理UA信息获取所述多条第二告警流量的策略;所述第二策略为基于所述第一告警流量的IP地址和 预设泛化规则获取所述多条第二告警流量的策略;所述第三策略为基于所述第一告警流量的IP地址和所述第一告警流量的超文本传输协议HTTP Header信息获取所述多条第二告警流量的策略。
在一种可能实现的方式中,所述预设策略包括所述第一策略;所述回溯单元102,具体用于:获取所述第一告警流量的IP地址和UA信息;采集在所述目标时间段内所述IP地址发送的多条HTTP流中,与所述第一告警流量的UA信息相同的HTTP流为所述第二告警流量。
在一种可能实现的方式中,所述预设策略包括所述第二策略,所述回溯单元102,具体用于:获取所述第一告警流量的所述IP地址;采集在所述目标时间段内所述IP地址发送的多条第一HTTP流;对多条第一HTTP流按照所述预设泛化规则进行泛化处理,获得多条第二HTTP流,所述预设泛化规则为对所述多条第一HTTP流中每一条第一HTTP流对应的目标字符串,使用预设标准进行统一替换;从所述多条第二HTTP流中,筛选出与所述第一告警流量之间相似度大于预设阈值的目标第二HTTP流为所述第二告警流量。
在一种可能实现的方式中,所述预设策略包括所述第三策略,所述回溯单元102,具体用于:获取所述第一告警流量的所述IP地址和所述HTTP Header信息;采集在所述目标时间段内所述IP地址发送的多条第三HTTP流;分别对所述多条第三HTTP流中每一条第三HTTP流对应的HTTP Header进行N-gram处理,获得第一矩阵,所述第一矩阵包括所述每一条第三HTTP流对应的HTTP Header序列信息;对所述第一矩阵进行降维处理,提取降维处理后的第一矩阵中与所述第一告警流量的HTTP Header信息匹配的目标HTTP Header序列信息;基于所述目标HTTP Header序列信息,获取所述目标HTTP Header序列信息对应的第三HTTP流为所述第二告警流量。
在一种可能实现的方式中,所述第一特征信息为特征表示向量;所述提取单元103,具体用于:对所述多条第二告警流量进行特征提取,获得所述多条第二告警流量对应的行为特征信息,所述行为特征信息包括:连接行为特征,请求差异特征,请求响应特征中的一个或多个;根据所述行为特征信息,获取所述特征表示向量。
在一种可能实现的方式中,所述判断单元104,具体用于:基于所述第一特征信息通过回溯模型进行检测,获得第一检测结果;基于所述多条第二告警流量通过基线模型进行检测,获得第二检测结果,其中,所述基线模型是基于历史流量预先训练好的检测模型;基于所述第一检测结果和所述第二检测结果,判断所述第一告警流量是否为恶意流量。
在一种可能实现的方式中,所述装置还包括:泛化单元105,用于若所述第一告警流量为恶意流量,对所述第一告警流量进行预设泛化处理,获得泛化后的第一告警流量;分类单元106,用于将所述泛化后的第一告警流量进行分类,确定所述第一告警流量匹配的恶意流量类型。
在一种可能实现的方式中,所述装置还包括告警流量单元107,所述告警流量单元107,用于:确定第一告警流量的接收时间之前,接收多条第四HTTP流;对所述多条第四HTTP流中每一条第四HTTP流按照预设特征提取规则进行特征提取,获得第二特征集合,所述第二特征集合包括:所述多条第四HTTP流分别对应的第二特征信息;基于所述第二特征集合,通过第一分类模型,从所述多条第四HTTP流中筛选出所述第一告警流量。
在一种可能实现的方式中,所述第二特征信息包括手工特征信息和/或表示学习特征信息;其中,所述手工特征信息包括:第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征中的一个或多个;所述表示学习特征信息包括 第四HTTP流对应的高维特征。
需要说明的是,本申请实施例中所描述的恶意流量识别装置10中各功能单元的功能可参见上述图2中所述的方法实施例中步骤S201-步骤S209的相关描述,此处不再赘述。
如图14所示,图14是本申请实施例提供的另一种恶意流量识别装置的结构示意图,该装置20包括至少一个处理器201,至少一个存储器202、至少一个通信接口203。此外,该设备还可以包括天线等通用部件,在此不再详述。
处理器201可以是通用中央处理器(CPU),微处理器,特定应用集成电路(application-specific integrated circuit,ASIC),或一个或多个用于控制以上方案程序执行的集成电路。
通信接口203,用于与其他设备或通信网络通信,如以太网,无线接入网(RAN),核心网,无线局域网(Wireless Local Area Networks,WLAN)等。
存储器202可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其他类型的静态存储设备,随机存取存储器(random access memory,RAM)或者可存储信息和指令的其他类型的动态存储设备,也可以是电可擦可编程只读存储器(Electrically Erasable Programmable Read-Only Memory,EEPROM)、只读光盘(Compact Disc Read-Only Memory,CD-ROM)或其他光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其他磁存储设备、或者能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其他介质,但不限于此。存储器可以是独立存在,通过总线与处理器相连接。存储器也可以和处理器集成在一起。
其中,所述存储器202用于存储执行以上方案的应用程序代码,并由处理器201来控制执行。所述处理器201用于执行所述存储器202中存储的应用程序代码。
存储器202存储的代码可执行以上图2提供的网络流量识别方法,比如确定第一告警流量的接收时间;按照预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;对所述多条第二告警流量进行特征提取,获得第一特征信息;基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。
需要说明的是,本申请实施例中所描述的恶意流量识别装置20中各功能单元的功能可参见上述图2中所述的方法实施例中的步骤S201-步骤S209相关描述,此处不再赘述。
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述的部分,可以参见其他实施例的相关描述。
需要说明的是,对于前述的各方法实施例,为了简单描述,故将其都表述为一系列的动作组合,但是本领域技术人员应该知悉,本申请并不受所描述的动作顺序的限制,因为依据本申请,某些步骤可能可以采用其他顺序或者同时进行。其次,本领域技术人员也应该知悉,说明书中所描述的实施例均属于优选实施例,所涉及的动作和模块并不一定是本申请所必须的。
在本申请所提供的几个实施例中,应该理解到,所揭露的装置,可通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如上述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或 直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性或其它的形式。
上述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。
另外,在本申请各实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。
上述集成的单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的全部或部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以为个人计算机、服务端或者网络设备等,具体可以是计算机设备中的处理器)执行本申请各个实施例上述方法的全部或部分步骤。其中,而前述的存储介质可包括:U盘、移动硬盘、磁碟、光盘、只读存储器(Read-Only Memory,缩写:ROM)或者随机存取存储器(Random Access Memory,缩写:RAM)等各种可以存储程序代码的介质。
以上所述,以上实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围。

Claims (24)

  1. 一种恶意流量识别方法,其特征在于,包括:
    确定第一告警流量的接收时间;
    基于预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;
    对所述多条第二告警流量进行特征提取,获得第一特征信息;
    基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。
  2. 根据权利要求1所述方法,其特征在于,所述预设策略包括:第一策略、第二策略、第三策略中的一个或多个,其中,所述第一策略为基于所述第一告警流量的网际协议IP地址和用户代理UA信息获取所述多条第二告警流量的策略;所述第二策略为基于所述第一告警流量的IP地址和预设泛化规则获取所述多条第二告警流量的策略;所述第三策略为基于所述第一告警流量的IP地址和所述第一告警流量的超文本传输协议HTTP Header信息获取所述多条第二告警流量的策略。
  3. 根据权利要求2所述方法,其特征在于,所述预设策略为所述第一策略;所述基于预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:
    获取所述第一告警流量的IP地址和UA信息;
    采集在所述目标时间段内所述IP地址发送的多条HTTP流中,与所述第一告警流量的UA信息相同的HTTP流为所述第二告警流量。
  4. 根据权利要求2所述方法,其特征在于,所述预设策略为所述第二策略,所述基于预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:
    获取所述第一告警流量的所述IP地址;
    采集在所述目标时间段内所述IP地址发送的多条第一HTTP流;
    对多条第一HTTP流按照所述预设泛化规则进行泛化处理,获得多条第二HTTP流,所述预设泛化规则为对所述多条第一HTTP流中每一条第一HTTP流对应的目标字符串,使用预设标准进行统一替换;
    从所述多条第二HTTP流中,筛选出与所述第一告警流量之间相似度大于所述预设阈值的目标第二HTTP流为所述第二告警流量。
  5. 根据权利要求2所述方法,其特征在于,所述预设策略为所述第三策略,所述基于预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量,包括:
    获取所述第一告警流量的所述IP地址和所述HTTP Header信息;
    采集在所述目标时间段内所述IP地址发送的多条第三HTTP流;
    分别对所述多条第三HTTP流中每一条第三HTTP流对应的HTTP Header进行N-gram处理,获得第一矩阵,所述第一矩阵包括所述每一条第三HTTP流对应的HTTP Header序列信息;
    对所述第一矩阵进行降维处理,提取降维处理后的第一矩阵中与所述第一告警流量的 HTTP Header信息匹配的目标HTTP Header序列信息;
    基于所述目标HTTP Header序列信息,获取所述目标HTTP Header序列信息对应的第三HTTP流为所述第二告警流量。
  6. 根据权利要求1-5所述任意一项方法,其特征在于,所述第一特征信息为特征表示向量;所述对所述多条第二告警流量进行特征提取,获得第一特征信息,包括:
    对所述多条第二告警流量进行特征提取,获得所述多条第二告警流量对应的行为特征信息,所述行为特征信息包括:连接行为特征,请求差异特征,请求响应特征中的一个或多个;
    根据所述行为特征信息,获取所述特征表示向量。
  7. 根据权利要求1-6所述任意一项方法,其特征在于,所述根据所述第一特征信息,判断所述第一告警流量是否为恶意流量,包括:
    基于所述第一特征信息通过回溯模型进行检测,获得第一检测结果;
    基于所述多条第二告警流量通过基线模型进行检测,获得第二检测结果,其中,所述基线模型是基于历史流量预先训练好的检测模型;
    基于所述第一检测结果和所述第二检测结果,判断所述第一告警流量是否为恶意流量。
  8. 根据权利要求1-7所述任意一项方法,其特征在于,所述方法还包括:
    若所述第一告警流量为恶意流量,对所述第一告警流量进行预设泛化处理,获得泛化后的第一告警流量;
    将所述泛化后的第一告警流量进行分类,确定所述第一告警流量匹配的恶意流量类型。
  9. 根据权利要求1所述方法,其特征在于,所述确定第一告警流量的接收时间之前,还包括:
    接收多条第四HTTP流;
    对所述多条第四HTTP流中每一条第四HTTP流进行特征提取,获得第二特征集合,所述第二特征集合包括所述多条第四HTTP流分别对应的第二特征信息;
    基于所述第二特征集合,通过第一分类模型,从所述多条第四HTTP流中筛选出所述第一告警流量。
  10. 根据权利要求9所述方法,其特征在于,所述第二特征信息包括手工特征信息和/或表示学习特征信息;其中,所述手工特征信息包括:第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征中的一个或多个;所述表示学习特征信息包括第四HTTP流对应的高维特征。
  11. 一种恶意流量识别装置,其特征在于,包括:
    确定单元,用于确定第一告警流量的接收时间;
    回溯单元,用于基于预设策略获取目标时间段内与所述第一告警流量对应的多条第二告警流量;所述目标时间段为基于所述接收时间确定的时间段;所述多条第二告警流量中每条第二告警流量与所述第一告警流量的相似度均大于预设阈值;
    提取单元,用于对所述多条第二告警流量进行特征提取,获得第一特征信息;
    判断单元,用于基于所述第一特征信息,判断所述第一告警流量是否为恶意流量。
  12. 根据权利要求11所述装置,其特征在于,所述预设策略包括:第一策略、第二策略、第三策略中的一个或多个,其中,所述第一策略为基于所述第一告警流量的网际协议IP地址和用户代理UA信息获取所述多条第二告警流量的策略;所述第二策略为基于所述第一告警流量的IP地址和预设泛化规则获取所述多条第二告警流量的策略;所述第三策略为基于所述第一告警流量的IP地址和所述第一告警流量的超文本传输协议HTTP Header信息获取所述多条第二告警流量的策略。
  13. 根据权利要求12所述装置,其特征在于,所述预设策略为所述第一策略;所述回溯单元,具体用于:
    获取所述第一告警流量的IP地址和UA信息;
    采集在所述目标时间段内所述IP地址发送的多条HTTP流中,与所述第一告警流量的UA信息相同的HTTP流为所述第二告警流量。
  14. 根据权利要求12所述装置,其特征在于,所述预设策略为所述第二策略,所述回溯单元,具体用于:
    获取所述第一告警流量的所述IP地址;
    采集在所述目标时间段内所述IP地址发送的多条第一HTTP流;
    对多条第一HTTP流按照所述预设泛化规则进行泛化处理,获得多条第二HTTP流,所述预设泛化规则为对所述多条第一HTTP流中每一条第一HTTP流对应的目标字符串,使用预设标准进行统一替换;
    从所述多条第二HTTP流中,筛选出与所述第一告警流量之间相似度大于预设阈值的目标第二HTTP流为所述第二告警流量。
  15. 根据权利要求12所述装置,其特征在于,所述预设策略为所述第三策略,所述回溯单元,具体用于:
    获取所述第一告警流量的所述IP地址和所述HTTP Header信息;
    采集在所述目标时间段内所述IP地址发送的多条第三HTTP流;
    分别对所述多条第三HTTP流中每一条第三HTTP流对应的HTTP Header进行N-gram处理,获得第一矩阵,所述第一矩阵包括所述每一条第三HTTP流对应的HTTP Header序列信息;
    对所述第一矩阵进行降维处理,提取降维处理后的第一矩阵中与所述第一告警流量的HTTP Header信息匹配的目标HTTP Header序列信息;
    基于所述目标HTTP Header序列信息,获取所述目标HTTP Header序列信息对应的第三HTTP流为所述第二告警流量。
  16. 根据权利要求11-15所述任意一项装置,其特征在于,所述第一特征信息为特征表示向量;所述提取单元,具体用于:
    对所述多条第二告警流量进行特征提取,获得所述多条第二告警流量对应的行为特征信息,所述行为特征信息包括:连接行为特征,请求差异特征,请求响应特征中的一个或多个;
    根据所述行为特征信息,获取所述特征表示向量。
  17. 根据权利要求11-16所述任意一项装置,其特征在于,所述判断单元,具体用于:
    基于所述第一特征信息通过回溯模型进行检测,获得第一检测结果;
    基于所述多条第二告警流量通过基线模型进行检测,获得第二检测结果,其中,所述基线模型是基于历史流量预先训练好的检测模型;
    基于所述第一检测结果和所述第二检测结果,判断所述第一告警流量是否为恶意流量。
  18. 根据权利要求11-17所述任意一项装置,其特征在于,所述装置还包括:
    泛化单元,用于若所述第一告警流量为恶意流量,对所述第一告警流量进行预设泛化处理,获得泛化后的第一告警流量;
    分类单元,用于将所述泛化后的第一告警流量进行分类,确定所述第一告警流量匹配的恶意流量类型。
  19. 根据权利要求11所述装置,其特征在于,所述装置还包括告警流量单元,所述告警流量单元,用于:
    确定第一告警流量的接收时间之前,接收多条第四HTTP流;
    对所述多条第四HTTP流中每一条第四HTTP流进行特征提取,获得第二特征集合,所述第二特征集合包括所述多条第四HTTP流分别对应的第二特征信息;
    基于所述第二特征集合,通过第一分类模型,从所述多条第四HTTP流中筛选出所述第一告警流量。
  20. 根据权利要求19所述装置,其特征在于,所述第二特征信息包括手工特征信息和/或表示学习特征信息;其中,所述手工特征信息包括:第四HTTP流对应的域名可读性特征、统一资源定位符URL结构特征、行为指示特征、HTTP Header特征中的一个或多个;所述表示学习特征信息包括第四HTTP流对应的高维特征。
  21. 一种服务设备,其特征在于,包括处理器和存储器,其中,所述存储器用于存储恶意流量识别程序代码,所述处理器用于调用所述恶意流量识别程序代码来执行权利要求1-10任一项所述的方法。
  22. 一种芯片系统,其特征在于,所述芯片系统包括至少一个处理器,存储器和接口电路,所述存储器、所述接口电路和所述至少一个处理器通过线路互联,所述至少一个存储器中存储有指令;所述指令被所述处理器执行时,权利要求1-10中任意一项所述的方法得以实现。
  23. 一种计算机可读存储介质,其特征在于,所述计算机存储介质存储有计算机程序,该计算机程序被处理器执行时实现上述权利要求1-10任意一项所述的方法。
  24. 一种计算机程序,其特征在于,所述计算机程序包括指令,当所述计算机程序被计算机执行时,使得所述计算机执行如权利要求1-10中任意一项所述的方法。
PCT/CN2021/141587 2020-12-31 2021-12-27 一种恶意流量识别方法及相关装置 WO2022143511A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP21914247.8A EP4258610A4 (en) 2020-12-31 2021-12-27 MALICIOUS TRAFFIC IDENTIFICATION METHOD AND ASSOCIATED APPARATUS
US18/345,853 US20230353585A1 (en) 2020-12-31 2023-06-30 Malicious traffic identification method and related apparatus

Applications Claiming Priority (4)

Application Number Priority Date Filing Date Title
CN202011639885.1 2020-12-31
CN202011639885 2020-12-31
CN202111573232.2 2021-12-21
CN202111573232.2A CN114697068A (zh) 2020-12-31 2021-12-21 一种恶意流量识别方法及相关装置

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US18/345,853 Continuation US20230353585A1 (en) 2020-12-31 2023-06-30 Malicious traffic identification method and related apparatus

Publications (1)

Publication Number Publication Date
WO2022143511A1 true WO2022143511A1 (zh) 2022-07-07

Family

ID=82136169

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/141587 WO2022143511A1 (zh) 2020-12-31 2021-12-27 一种恶意流量识别方法及相关装置

Country Status (4)

Country Link
US (1) US20230353585A1 (zh)
EP (1) EP4258610A4 (zh)
CN (1) CN114697068A (zh)
WO (1) WO2022143511A1 (zh)

Cited By (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277856A (zh) * 2022-07-25 2022-11-01 每日互动股份有限公司 一种流量筛选方法和系统
CN115632832A (zh) * 2022-09-30 2023-01-20 温州佳润科技发展有限公司 一种应用于云服务的大数据攻击处理方法及系统
CN115665286A (zh) * 2022-12-26 2023-01-31 深圳红途科技有限公司 接口聚类方法、装置、计算机设备及存储介质
CN116346452A (zh) * 2023-03-17 2023-06-27 中国电子产业工程有限公司 一种基于stacking的多特征融合恶意加密流量识别方法和装置

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
TWI774582B (zh) * 2021-10-13 2022-08-11 財團法人工業技術研究院 惡意超文本傳輸協定請求的偵測裝置和偵測方法

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103905415A (zh) * 2013-10-25 2014-07-02 哈尔滨安天科技股份有限公司 一种防范远控类木马病毒的方法及系统
WO2016014178A1 (en) * 2014-07-21 2016-01-28 Heilig David Identifying malware-infected network devices through traffic monitoring
US9825989B1 (en) * 2015-09-30 2017-11-21 Fireeye, Inc. Cyber attack early warning system
CN111031071A (zh) * 2019-12-30 2020-04-17 杭州迪普科技股份有限公司 恶意流量的识别方法、装置、计算机设备及存储介质
CN112104628A (zh) * 2020-09-04 2020-12-18 福州林科斯拉信息技术有限公司 一种自适应特征规则匹配的实时恶意流量检测方法

Family Cites Families (10)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9930065B2 (en) * 2015-03-25 2018-03-27 University Of Georgia Research Foundation, Inc. Measuring, categorizing, and/or mitigating malware distribution paths
US11086310B2 (en) * 2015-05-27 2021-08-10 Honeywell International Inc. Method and apparatus for real time model predictive control operator support in industrial process control and automation systems
US10536357B2 (en) * 2015-06-05 2020-01-14 Cisco Technology, Inc. Late data detection in data center
US9699205B2 (en) * 2015-08-31 2017-07-04 Splunk Inc. Network security system
US10523609B1 (en) * 2016-12-27 2019-12-31 Fireeye, Inc. Multi-vector malware detection and analysis
US11374944B2 (en) * 2018-12-19 2022-06-28 Cisco Technology, Inc. Instant network threat detection system
US20200236131A1 (en) * 2019-01-18 2020-07-23 Cisco Technology, Inc. Protecting endpoints with patterns from encrypted traffic analytics
CN110753064B (zh) * 2019-10-28 2021-05-07 中国科学技术大学 机器学习和规则匹配融合的安全检测系统
CN111447190A (zh) * 2020-03-20 2020-07-24 北京观成科技有限公司 一种加密恶意流量的识别方法、设备及装置
CN112003824B (zh) * 2020-07-20 2023-04-18 中国银联股份有限公司 攻击检测方法、装置及计算机可读存储介质

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN103905415A (zh) * 2013-10-25 2014-07-02 哈尔滨安天科技股份有限公司 一种防范远控类木马病毒的方法及系统
WO2016014178A1 (en) * 2014-07-21 2016-01-28 Heilig David Identifying malware-infected network devices through traffic monitoring
US9825989B1 (en) * 2015-09-30 2017-11-21 Fireeye, Inc. Cyber attack early warning system
CN111031071A (zh) * 2019-12-30 2020-04-17 杭州迪普科技股份有限公司 恶意流量的识别方法、装置、计算机设备及存储介质
CN112104628A (zh) * 2020-09-04 2020-12-18 福州林科斯拉信息技术有限公司 一种自适应特征规则匹配的实时恶意流量检测方法

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4258610A4

Cited By (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115277856A (zh) * 2022-07-25 2022-11-01 每日互动股份有限公司 一种流量筛选方法和系统
CN115277856B (zh) * 2022-07-25 2023-08-18 每日互动股份有限公司 一种流量筛选方法和系统
CN115632832A (zh) * 2022-09-30 2023-01-20 温州佳润科技发展有限公司 一种应用于云服务的大数据攻击处理方法及系统
CN115632832B (zh) * 2022-09-30 2023-09-12 上海豹云网络信息服务有限公司 一种应用于云服务的大数据攻击处理方法及系统
CN115665286A (zh) * 2022-12-26 2023-01-31 深圳红途科技有限公司 接口聚类方法、装置、计算机设备及存储介质
CN116346452A (zh) * 2023-03-17 2023-06-27 中国电子产业工程有限公司 一种基于stacking的多特征融合恶意加密流量识别方法和装置
CN116346452B (zh) * 2023-03-17 2023-12-01 中国电子产业工程有限公司 一种基于stacking的多特征融合恶意加密流量识别方法和装置

Also Published As

Publication number Publication date
EP4258610A1 (en) 2023-10-11
CN114697068A (zh) 2022-07-01
EP4258610A4 (en) 2024-03-20
US20230353585A1 (en) 2023-11-02

Similar Documents

Publication Publication Date Title
WO2022143511A1 (zh) 一种恶意流量识别方法及相关装置
US11218500B2 (en) Methods and systems for automated parsing and identification of textual data
WO2020073507A1 (zh) 一种文本分类方法及终端
Adewole et al. SMSAD: a framework for spam message and spam account detection
Yassin et al. Anomaly-based intrusion detection through k-means clustering and naives bayes classification
CN108737423B (zh) 基于网页关键内容相似性分析的钓鱼网站发现方法及系统
US10637826B1 (en) Policy compliance verification using semantic distance and nearest neighbor search of labeled content
CN107622072B (zh) 一种针对网页操作行为的识别方法及服务器、终端
Alharthi et al. A real-time deep-learning approach for filtering Arabic low-quality content and accounts on Twitter
Zhu et al. MOE/RF: a novel phishing detection model based on revised multiobjective evolution optimization algorithm and random forest
CN111984867A (zh) 一种网络资源确定方法及装置
Kasim Automatic detection of phishing pages with event-based request processing, deep-hybrid feature extraction and light gradient boosted machine model
Gao et al. Reinforcement learning based web crawler detection for diversity and dynamics
Li et al. Towards a multi‐layers anomaly detection framework for analyzing network traffic
Manokaran et al. An empirical comparison of machine learning algorithms for attack detection in internet of things edge
Widiono et al. Phishing Website Detection Using Bidirectional Gated Recurrent Unit Model and Feature Selection
Zhu et al. Effective phishing website detection based on improved BP neural network and dual feature evaluation
Ravi Deep learning-based network intrusion detection in smart healthcare enterprise systems
Demirel et al. Web Based Anomaly Detection using Zero-Shot Learning with CNN
Jan et al. Semi-supervised labeling: a proposed methodology for labeling the twitter datasets
Guan et al. The design and implementation of a multidimensional and hierarchical web anomaly detection system
US11425077B2 (en) Method and system for determining a spam prediction error parameter
CN115964478A (zh) 网络攻击检测方法、模型训练方法及装置、设备及介质
CN112312590A (zh) 一种设备通信协议识别方法和装置
CN114329287A (zh) 一种异常链接处理方法、装置、计算机设备和存储介质

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 21914247

Country of ref document: EP

Kind code of ref document: A1

ENP Entry into the national phase

Ref document number: 2021914247

Country of ref document: EP

Effective date: 20230704

NENP Non-entry into the national phase

Ref country code: DE