WO2021047402A1 - 应用识别方法、装置及存储介质 - Google Patents

应用识别方法、装置及存储介质 Download PDF

Info

Publication number
WO2021047402A1
WO2021047402A1 PCT/CN2020/112316 CN2020112316W WO2021047402A1 WO 2021047402 A1 WO2021047402 A1 WO 2021047402A1 CN 2020112316 W CN2020112316 W CN 2020112316W WO 2021047402 A1 WO2021047402 A1 WO 2021047402A1
Authority
WO
WIPO (PCT)
Prior art keywords
service
port
services
address
domain name
Prior art date
Application number
PCT/CN2020/112316
Other languages
English (en)
French (fr)
Inventor
王璐
罗奇
华卓隽
王春桃
黄林杰
Original Assignee
华为技术有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by 华为技术有限公司 filed Critical 华为技术有限公司
Priority to EP20863933.6A priority Critical patent/EP4012980A4/en
Publication of WO2021047402A1 publication Critical patent/WO2021047402A1/zh
Priority to US17/691,463 priority patent/US11863439B2/en

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/028Capturing of monitoring data by filtering
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/46Cluster building
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L47/00Traffic control in data switching networks
    • H04L47/10Flow control; Congestion control
    • H04L47/24Traffic characterised by specific attributes, e.g. priority or QoS
    • H04L47/2483Traffic characterised by specific attributes, e.g. priority or QoS involving identification of individual flows
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/02Capturing of monitoring data
    • H04L43/026Capturing of monitoring data using flow identification
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/08Monitoring or testing based on specific metrics, e.g. QoS, energy consumption or environmental parameters
    • H04L43/0876Network utilisation, e.g. volume of load or congestion level
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/302Route determination based on requested QoS
    • H04L45/306Route determination based on the nature of the carried application
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/54Organization of routing tables
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/74Address processing for routing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L45/00Routing or path finding of packets in data switching networks
    • H04L45/76Routing in software-defined topologies, e.g. routing between virtual machines
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/50Network service management, e.g. ensuring proper service fulfilment according to agreements
    • H04L41/5003Managing SLA; Interaction between SLA and QoS
    • H04L41/5019Ensuring fulfilment of SLA
    • H04L41/5025Ensuring fulfilment of SLA by proactively reacting to service quality change, e.g. by reconfiguration after service quality degradation or upgrade

Definitions

  • This application relates to the field of communication technology, and in particular to an application identification method, device and storage medium.
  • DPI deep packet inspection
  • the DPI technology needs to maintain a flow characteristic database, when a new application appears, the flow characteristic database must be manually updated to have the ability to recognize the new application, resulting in a low recognition rate of the application.
  • This application provides an application identification method, device, and storage medium, which can solve the problem of low DPI technology identification application efficiency in related technologies.
  • the technical solution is as follows:
  • an application identification method includes:
  • the flow table includes a plurality of flow table entries, and each flow table entry in the plurality of flow tables includes a quintuple and a flow start time,
  • the domain name table includes a plurality of domain name table entries, and each domain name table entry in the plurality of domain name table entries includes a source Internet Protocol IP address, a destination domain name, a destination IP address, and a domain name type;
  • a label corresponding to each application type in the multiple application types is determined, where the label is used to identify the application to which the data stream belongs.
  • a data stream can include one or more packets, and the five-tuples of the one or more packets are the same. In other words, one or more packets with the same quintuple can form a data stream.
  • the five-tuple in the flow table includes source IP address, source port, destination IP address, destination port, and protocol number.
  • the source IP address and source port are the client's IP address and port
  • the destination IP address and destination port are the server's IP address and port
  • the protocol number is the client The number of the transmission protocol used when the client communicates with the server.
  • the flow start time of each data flow is the reception time of the first message in each data flow.
  • the first message of each data stream is not necessarily the first message of the entire data stream, but the first message among the messages currently received when extracting features.
  • A.name is to resolve a host name or domain name to one IP address
  • C.name is to resolve multiple host names or domain names to another.
  • a domain name is resolved by another domain name to an IP address, which is the same as the IP address resolved by A.name.
  • multiple C.names are equivalent to a branch of A.name.
  • each service is composed of an IP address and a port identifier, and an application can usually be composed of a group of services
  • multiple application types can be obtained .
  • Each application type includes multiple services, and each application type corresponds to one application.
  • the label of each application type in the multiple application types can be determined, so that the application to which the data stream belongs can be identified through the label. It can be seen that the process of identifying applications does not require a flow characteristic database, but can be identified based on popular behavior characteristics. In this way, when a new application appears, it can be directly based on the IP address and port of the server accessed by the new application. This new application is recognized, thereby improving the recognition rate of the application.
  • the analysis of traffic behavior characteristics according to the flow table to obtain multiple services includes:
  • a single client access service set and a first multi-client access service set are determined according to the flow table, and each service in the single client access service set is accessed by a single client and served by the IP
  • the address and port belong to the same end, and the port does not belong to the loop port set
  • each service in the first multi-client access service set is accessed by multiple clients, the IP address and port of the service belong to the same end, and the port Does not belong to the loop port set;
  • a second multi-client access service set is determined according to the flow table, and each service in the second multi-client access service set is multiplied.
  • the IP address and port of the client's access and service belong to different ends, and the port does not belong to the loop port set;
  • the determining a port with a loop according to the flow table includes:
  • first total flow number is greater than the first threshold, determine the total flow number of all data flows via the port, and use the determined total flow number as the second total flow number;
  • the port is a port with a loop.
  • the same-end IP address set of the port refers to a set of IP addresses that belong to the same side as the port
  • the peer IP address set of the port refers to a set of IP addresses that belong to a different side than the port.
  • a loop port means that the source IP and destination IP of most data flows through the port are the same, that is, the source and destination devices of most data flows through the port are the same device.
  • the multiple IP addresses determined by the intersection of the same end IP address set and the opposite end IP address set are the IP addresses of these devices as both the source and the destination.
  • the port can be considered as Potential ports with loops.
  • the first total flow number can be determined. If the first total flow number is greater than the first threshold, then the second total flow number can be further determined, and the first total flow number and The ratio between the second total number of flows.
  • the ratio is greater than the second threshold, it can indicate that the source IP and destination IP of most of the data flows through the port are the same, that is, most of the data through the port.
  • the source device and the destination device of the flow are the same device, and it can be determined that the port is a port with a loop.
  • the determining a single-client access service set and a first multi-client access service set based on the loop port set according to the flow table includes:
  • the flow table determine the service accessed by a single client and the IP address and port belong to the same end, and the service accessed by multiple clients and the IP address and port belong to the same end, and obtain the single client access potential service set and the first multi-client Access to a set of potential services;
  • the determining, according to the flow table, a service that is accessed by a single client and whose IP address and port belong to the same end, and a service that is accessed by multiple clients and whose IP address and port belong to the same end includes:
  • each target service corresponds to an IP address and port belonging to the same end in a flow table entry, and each target service corresponds to multiple data flows;
  • the port of the target service is randomly generated, determining whether the number of same-side ports corresponding to the IP address of the target service is greater than a third threshold
  • the target service is a service that is accessed by multiple clients and whose IP address and port belong to the same end;
  • the target service is a service accessed by a single client and the IP address and port belong to the same end.
  • a target service may be composed of a source IP address and a source port, or it may be composed of a destination IP address and a destination port.
  • the above target service corresponds to the IP address and port belonging to the same end in a flow table entry
  • the target service corresponds to multiple data flows, that is, in the flow table, if there are multiple data flows corresponding to one flow table entry belonging to the same end IP address and port, then the IP address and port can be used as the target service. For example, if there are multiple data flows in the flow table corresponding to the destination IP address and destination port in one flow table entry, then the destination IP address and destination port can be determined as the target service. Similarly, if there are multiple data flows in the flow table corresponding to the source IP address and source port in one flow table entry, the source IP address and source port can be determined as the target service.
  • the IP address and port of the target service may be the IP address and port of the server, in other words, The target service may be a service that is accessed by multiple clients and the IP address and port belong to the same end.
  • the target service may be a service that is accessed by multiple clients and the IP address and port belong to the same end.
  • the method further includes:
  • the target service is a service that is accessed by multiple clients and whose IP address and port belong to the same end.
  • the port number of the port of the target service is less than 1024, it is determined that the port is a well-known port, that is, the port is not randomly generated. If the port number of the port of the target service is greater than 1024, it is determined that the port is randomly generated.
  • the method further includes:
  • the target service is a service that is accessed by multiple clients and the IP address and port belong to the same end.
  • the determining a second multi-client access service set based on the loop port set and the first multi-client access service set according to the flow table includes:
  • the determining, based on the first multi-client access service set, according to the flow table, that the services that are accessed by multiple clients and whose IP addresses and ports belong to different ends include:
  • IP addresses and ports that are located in the same flow table entry and belong to different ends in the flow table as one reference service, and obtain multiple reference services;
  • For each of the multiple reference services determine whether the reference service corresponds to multiple data streams
  • the reference service corresponds to multiple data streams, determining whether the port of the reference service is randomly generated
  • the reference service is a service that is accessed by multiple clients and the IP address and port belong to different ends.
  • the reference service is also composed of an IP address and port, but the IP address and port of the reference service are at different ends.
  • the destination IP address and source port in the same flow table entry can form a reference service, in the same flow table entry
  • the source IP address and destination port can also constitute a reference service.
  • the above reference service is also composed of an IP address and port, but different from the target service, the IP address and port of the reference service are different ends.
  • the destination IP address and source port in the same flow table entry can constitute a reference Service
  • the source IP address and destination port in the same flow entry can also constitute a reference service.
  • the source IP address and the destination IP address may be reversed in the flow table obtained based on the feature extraction, resulting in the source IP address constituting the reference service may be the wrong destination IP address, or the destination IP address may be wrong It is impossible to determine whether the IP address and port in the reference service are IP addresses and ports belonging to different ends in the same flow table entry, but the source IP address in the domain name table must be the correct source IP address. Therefore, if it is determined that the IP address of the reference service is not included in the source IP address of the domain name table, it can be determined that the reference service is a service that is accessed by multiple clients and the IP address and port belong to different ends.
  • the clustering of the multiple services according to the flow table and the domain name table to obtain multiple application types includes:
  • multiple first services and multiple second services are obtained from the multiple services, and the multiple first services refer to services that are accessed by multiple clients and have corresponding domain names.
  • the multiple second services include services accessed by multiple clients without a corresponding domain name and services accessed by a single client;
  • the time correlation clustering result, the periodic clustering result, the semantic correlation clustering result, and the client similarity clustering result are merged to obtain the multiple application types.
  • the multiple services include services accessed by multiple clients as well as services accessed by a single client, and each domain name table entry in the domain name table includes source IP address, destination domain name, and destination IP address And the type of domain name, therefore, for services accessed by multiple clients, you can determine from the domain name table whether each service in the multiple services corresponds to a domain name, and then filter out multiple services that are accessed by multiple clients. There are services corresponding to the domain name, and multiple first services are available. At the same time, you can also filter out services that are accessed by multiple clients without a corresponding domain name.
  • the performing time correlation clustering on the multiple services according to the flow table and the domain name table to obtain a time correlation clustering result includes:
  • the selecting a periodic service from the multiple services according to the flow table includes:
  • the periodicity of the service is strong periodicity, it is determined that the service is a periodic service.
  • the performing semantic relevance clustering on the multiple first services to obtain a semantic relevance clustering result includes:
  • Clustering the un-clustered service and the plurality of second clustering results according to the semantic relevance of the domain name between the un-clustered service among the plurality of first services , To obtain the semantic relevance clustering result.
  • the clustering the multiple first services according to the semantic relevance of the domain name includes:
  • the services corresponding to the non-combinable domain names are clustered.
  • the domain name corresponding to the first service can be determined from the domain name table. If the domain name corresponding to the first service is unique, then Determine that the first service is the third service that is accessed by multiple clients and corresponds to a unique domain name. If the domain name corresponding to the first service is not unique, then determine that the first service is the fourth service that is accessed by multiple clients and corresponds to multiple domain names. service.
  • the performing client similarity clustering on the multiple second services to obtain a client similarity clustering result includes:
  • the determining a label corresponding to each application type in the multiple application types includes:
  • the multiple application types are divided into a first application group, a second application group, and a third application group.
  • the services included in each application type in the first application group have a corresponding domain name, and the second application group There is no corresponding domain name for the services included in each application type in the third application group, and each application type in the third application group corresponds to an un-clustered service;
  • the label corresponding to each application type in the second application group and the third application group is determined.
  • the network device After the network device determines the label corresponding to each application type in the multiple application types, it may also display the label corresponding to each application type in the multiple application types.
  • an application recognition device is provided, and the application recognition device has the function of realizing the behavior of the application recognition method in the first aspect.
  • the application identification device includes at least one module, and the at least one module is used to implement the application identification method provided in the above-mentioned first aspect.
  • a network device in a third aspect, includes a processor and a memory, and the memory is configured to store a program for executing the application identification method provided in the first aspect, and to store a program for implementing the first aspect. Provide the data involved in the application identification method.
  • the processor is configured to execute a program stored in the memory.
  • the operating device of the storage device may further include a communication bus, and the communication bus is used to establish a connection between the processor and the memory.
  • a computer-readable storage medium is provided, and instructions are stored in the computer-readable storage medium.
  • the instructions When the instructions are run on a computer, the computer executes the application identification method described in the first aspect.
  • a computer program product containing instructions, which when the computer program product runs on a computer, causes the computer to execute the application identification method described in the first aspect.
  • each service is composed of an IP address and a port identifier, and an application can usually be composed of a group of services
  • multiple application types can be obtained .
  • Each application type includes multiple services, and each application type corresponds to one application.
  • the label of each application type in the multiple application types can be determined, so that the application to which the data stream belongs can be identified through the label. It can be seen that this application does not need a traffic characteristic database in the process of identifying applications, but can be identified based on popular behavior characteristics. In this way, when a new application appears, it can be directly based on the IP address and IP address of the server accessed by the new application. Port, to identify the new application, thereby improving the recognition rate of the application.
  • FIG. 1 is an architecture diagram of an application identification system provided by an embodiment of the present application
  • Figure 2 is a schematic structural diagram of a computer device provided by an embodiment of the present application.
  • FIG. 3 is a flowchart of an application identification method provided by an embodiment of the present application.
  • FIG. 4 is a schematic diagram of an undirected graph provided by an embodiment of the present application.
  • FIG. 5 is a schematic structural diagram of an application identification device provided by an embodiment of the present application.
  • FIG. 6 is a schematic diagram of an analysis module provided by an embodiment of the present application.
  • Fig. 7 is a schematic diagram of a clustering module provided by an embodiment of the present application.
  • the key business of the enterprise is usually the business corresponding to the private application of the enterprise. Therefore, in order to improve the quality of the key business, it is usually necessary to identify the private application of the enterprise, so that the network manager of the enterprise can configure some strategies that can improve the quality of the key business. And then improve the quality of key business.
  • the bandwidth required for normal operation can be guaranteed, and for public network applications, current limiting processing can be performed, that is, the data flow corresponding to private applications is not limited, and public network applications The corresponding data stream is subjected to current limiting processing, thereby improving the quality of key services.
  • FIG. 1 is an architecture diagram of an application identification system provided by an embodiment of the present application.
  • the system includes multiple clients 101, a network device 102, and multiple servers 103.
  • Each client 101 and the network device 102 are connected in a wired or wireless manner for communication, and each server 103 and the network device 102 are also connected in a wired or wireless manner for communication.
  • an application is installed on the client 101, and data streams are generated when the client 101 runs the application.
  • the client 101 can send these data streams to the network Equipment 102.
  • the network device 102 receives these data streams, it can process these data streams to identify applications corresponding to the data streams. Later, when these data streams are transmitted to the server 103, the server 103 can process these data streams in response to the operation of the client 101.
  • the application installed on the client 101 may be a private application or a public network application.
  • private applications refer to applications used internally by the enterprise
  • public network applications refer to applications that anyone can use.
  • a private application can be an application used for communication within an enterprise
  • a public network application can be an application used for communication within the enterprise and outside the business.
  • the client 101 can be any electronic product that can interact with the user through one or more of the keyboard, touch panel, touch screen, remote control, voice interaction, or handwriting device, etc., such as a personal computer ( personal computer , PC), mobile phones, smart phones, personal digital assistants (PDA), wearable devices, pocket PCs (PPC), tablets, smart cars, smart TVs, smart speakers, etc.
  • a personal computer personal computer , PC
  • mobile phones smart phones
  • PDA personal digital assistants
  • PPC pocket PCs
  • tablets smart cars, smart TVs, smart speakers, etc.
  • the network device 102 may be a core switch, an access switch, a router, and other devices.
  • the server 103 may be a server, a server cluster composed of multiple servers, or a cloud computing service center.
  • Fig. 1 only uses 3 clients and 3 servers to illustrate the application identification system, which does not constitute a limitation to the embodiment of the present application.
  • the application identification method provided by the embodiments of the present application can be used in the identification of private applications of enterprises, as well as the identification of public network applications.
  • FIG. 2 is a schematic structural diagram of a computer device according to an embodiment of the present application.
  • the computer device may be the client 101, the network device 102, or the server 103 shown in FIG.
  • the computer device includes at least one processor 201, a communication bus 202, a memory 203, and at least one communication interface 204.
  • the processor 201 may be a general-purpose central processing unit (CPU), a network processor (NP), a microprocessor, or may be one or more integrated circuits used to implement the solution of the present application, for example, a dedicated integrated circuit Circuit (application-specific integrated circuit, ASIC), programmable logic device (programmable logic device, PLD) or a combination thereof.
  • ASIC application-specific integrated circuit
  • PLD programmable logic device
  • the above-mentioned PLD may be a complex programmable logic device (CPLD), a field-programmable gate array (FPGA), a generic array logic (GAL) or any combination thereof.
  • the communication bus 202 is used to transfer information between the above-mentioned components.
  • the communication bus 202 can be divided into an address bus, a data bus, a control bus, and so on. For ease of representation, only one thick line is used in the figure, but it does not mean that there is only one bus or one type of bus.
  • the memory 203 can be a read-only memory (ROM) or other types of static storage devices that can store static information and instructions, or it can be a random access memory (RAM) or can store information and instructions.
  • Other types of dynamic storage devices can also be electrically erasable programmable read-only memory (EEPROM), compact disc read-only memory (CD-ROM) or other optical disk storage , CD storage (including compressed CDs, laser disks, CDs, digital universal CDs, Blu-ray CDs, etc.), disk storage media or other magnetic storage devices, or can be used to carry or store desired program codes in the form of instructions or data structures And any other media that can be accessed by the computer, but not limited to this.
  • the memory 203 may exist independently and is connected to the processor 201 through the communication bus 202.
  • the memory 203 may also be integrated with the processor 201.
  • the communication interface 204 is used to communicate with other devices or a communication network.
  • the communication interface 204 includes a wired communication interface, and may also include a wireless communication interface.
  • the wired communication interface may be, for example, an Ethernet interface.
  • the Ethernet interface can be an optical interface, an electrical interface, or a combination thereof.
  • the wireless communication interface may be a wireless local area network (WLAN) interface, a cellular network communication interface, or a combination thereof.
  • WLAN wireless local area network
  • the processor 201 may include one or more CPUs, such as CPU0 and CPU1 as shown in FIG. 2.
  • the computer device may include multiple processors, such as the processor 201 and the processor 205 as shown in FIG. 2.
  • processors can be a single-core processor (single-CPU) or a multi-core processor (multi-CPU).
  • the processor here may refer to one or more devices, circuits, and/or processing cores for processing data (such as computer program instructions).
  • the computer device may further include an output device 206 and an input device 207.
  • the output device 206 communicates with the processor 201 and can display information in a variety of ways.
  • the output device 206 may be a liquid crystal display (LCD), a light emitting diode (LED) display device, a cathode ray tube (CRT) display device, or a projector, etc.
  • the input device 207 communicates with the processor 201, and can receive user input in a variety of ways.
  • the input device 207 may be a mouse, a keyboard, a touch screen device, a sensor device, or the like.
  • the memory 203 is used to store the program code 210 for executing the solution of the present application, and the processor 201 can execute the program code 210 stored in the memory 203.
  • the computer device may implement the application identification method provided in the embodiment of FIG. 3 below through the processor 201 and the program code 210 in the memory 203.
  • FIG. 3 is a flowchart of an application identification method provided by an embodiment of the present application, and the method is applied to a network device in the application identification system shown in FIG. 1. Please refer to Figure 3, the method includes the following steps.
  • Step 301 The network device extracts features from multiple data flows to obtain a flow table and a domain name table.
  • the flow table includes multiple flow entries, and each flow entry in the multiple flow tables includes a five-tuple and a flow start.
  • the domain name table includes multiple domain name table entries, and each of the multiple domain name table entries includes a source IP address, a destination domain name, a destination IP address, and a domain name type.
  • a data stream can include one or more packets, and the five-tuples of the one or more packets are the same.
  • one or more packets with the same quintuple can form a data stream.
  • the network device can separately extract the quintuple, destination domain name, and destination domain name in any packet included in each data flow. Domain name type, and determine the reception time of the first packet in each data stream as the stream start time.
  • the network device can obtain the source IP address and the destination IP address from the quintuple, thereby generating a flow table based on the quintuple of each data flow and the start time of the flow, and according to the source IP address of each data flow,
  • the destination domain name, destination IP address, and domain name type are used to generate a domain name table.
  • the network device can obtain the source IP address and the destination IP address of the data stream from the quintuple.
  • the source IP address and source port of the message are the client's IP address and port
  • the destination IP address and destination port are the server's IP address and port.
  • the protocol number is the number of the transmission protocol used when the client and the server communicate.
  • the stream start time of each data stream is the receiving time of the first packet in each data stream.
  • the first message of each data stream is not necessarily the first message of the entire data stream, but the first message among the messages currently received when the feature is extracted. For example, it is currently necessary to extract features for all data streams collected between 1:30 and 2:00. Assume that the reception time of the first packet of data stream A is 1:00, and data stream A is between 1:30 and 2 The receiving time of the first message between :00 is 1:31, then the flow start time of data flow A is 1:31.
  • A.name is to resolve a host name or domain name to an IP address
  • C.name is to resolve multiple host names or domain names to another domain name, and then An IP address is resolved by another domain name, and this IP address is the same as the IP address resolved by A.name.
  • multiple C.names are equivalent to a branch of A.name.
  • the network device may also set a trigger condition for feature extraction, that is, when the trigger condition is met, features are extracted from multiple data streams.
  • the network device can determine whether the data volume of the currently collected data stream reaches the data volume threshold. When the data volume of the currently collected data stream reaches the data volume threshold, the network device can Extract features from multiple data streams collected separately. For example, if the data volume threshold set by the network device is 200M, the network device can determine whether the data volume of the currently collected data stream reaches 200M during the process of collecting data streams. If it reaches 200M, then extract the collected data streams separately. feature.
  • the network device may collect statistics on the time difference between the start time of collection and the current time, and when the time difference reaches the first time threshold, extract features from multiple collected data streams. For example, if the first time threshold set by the network device is 30 minutes, the network device can count the time difference between the start time of collection and the current time, and if the time difference reaches 30 minutes, the collected data streams are extracted respectively.
  • the foregoing data volume threshold and the first time threshold can be set according to requirements.
  • the flow table and the domain name table may be preprocessed.
  • the network device can de-duplicate the flow table entries in the flow table and merge the flow table entries. After the merging, the flow table entries with incomplete information can be deleted to obtain the preprocessed flow table.
  • the network device can de-duplicate the domain name table entries in the domain name table, and filter out the domain name table entries whose domain name type is A.name. After filtering, the domain name table entries with incomplete information can be deleted to obtain preprocessing After the domain name table.
  • the flow table and domain name table obtained after the feature extraction can be as shown in the following Table 1 and Table 2.
  • the first flow table entry and the second flow table entry in Table 1 are duplicates.
  • the first flow table entry and the second flow table entry can be deleted.
  • One flow entry or the second flow entry, and the quintuple of the second flow entry and the third flow entry in Table 1 are the same, but the flow start time is different.
  • the flow table entries with incomplete information in the flow table need to be deleted. See Table 1.
  • the first domain name table entry and the second domain name table entry in Table 2 are duplicates.
  • the first domain name table entry or the second domain name table entry can be deleted, and the third domain name table in Table 2
  • the domain name type of the entry, the 5th and the 7th domain name table entry is C.name. Because the domain name table entry with the domain name type of C.name is equivalent to a branch of the domain name table entry with the domain name type of A.name, in order to retain the data stream with prominent characteristics, conduct the analysis of the trend behavior characteristics, and then more accurately determine the multiple Service, you can keep the domain name table entry whose domain name type is A.name in the domain name table, and delete the domain name table entry whose domain name type is C.name in the domain name table.
  • Source IP Destination domain name Destination IP Domain Type IP01 abpd-jap.xxx.com IP001 A.name IP02 acnd-jap.xxx.com IP002 C.name IP03 abnc-jp.xxx.com IP003 A.name IP04 afnd-hx.xxx.com IP004 C.name IP05 To IP005 A.name IP06 acjd-jap.xxx.com IP006 C.name IP07 abed-jap.xxx.com To A.name IP08 abnd-hx.xxx.com IP008 A.name IP09 abnt-jap.xxx.com IP009 A.name IP10 asnd-jp.xxx.com IP011 A.name IP11 atnd-jap.xxx.com IP012 A.name IP12 abyd-jnp.xxx.com IP013 A.name ... ... ... ...
  • preprocessing the flow table and domain name table is optional, that is, the subsequent steps can be implemented based on the flow table and domain name table before preprocessing, or can be implemented based on the preprocessed flow table and domain name table.
  • the implementation process of the two is similar.
  • the preprocessed flow table and domain name table are taken as examples to explain the subsequent steps. That is, the flow table and domain name table mentioned in the subsequent steps are all preprocessed flow table and domain name table.
  • the flow table and the domain name table can be obtained through step 301, the data in the flow table and the domain name table is huge, the number of flows is large and the trend characteristics are complex, and it is difficult to identify the application corresponding to the data flow from the flow table and the domain name table.
  • an application is usually composed of a set of services, and a service is composed of an IP address and a port identifier, so the application corresponding to the data stream can be identified by a step-by-step process.
  • step 302 analyze the popular behavior characteristics according to the flow table to obtain multiple services, and then according to step 303, cluster the multiple services to obtain multiple application types, and then determine the application according to step 303
  • the label corresponding to the type identifies the application corresponding to the data stream.
  • Step 302 The network device analyzes the trend characteristics according to the flow table to obtain multiple services, and each service is composed of an IP address and a port identifier.
  • step 302 can be implemented by the following steps (1)-(4):
  • the network device may obtain the same end IP address set and the opposite end IP address set of the port according to the flow table. Determine the intersection of the same end IP address set and the opposite end IP address set of the port to obtain multiple IP addresses.
  • determine the total flow number of data flows corresponding to the multiple IP addresses in all data flows passing through the port and use the determined total flow number as the first total flow number. If the first total flow number is greater than the first threshold, the total flow number of all data flows via the port is determined, and the determined total flow number is used as the second total flow number. If the ratio between the first total flow number and the second total flow number is greater than the second threshold, it is determined that the port is a port with a loop. After that, the ports with loops in the flow table can be grouped into loop port sets.
  • the network device determines the total flow number of all data flows via the port according to the flow table, and determines the total flow number of data flows corresponding to the multiple IP addresses in all the data flows via the port may be implemented as follows: From the flow table, select the flow table entry where the port is located, count the number of selected flow table entries, and determine the counted number as the total flow number of all data flows passing through the port. After that, count the number of flow table entries whose source IP address or destination IP address in the selected flow table entry is any one of the multiple IP addresses, and determine the counted number as the number of all data flows through the port The total number of data streams corresponding to the multiple IP addresses.
  • the source port is selected as the flow entry of the port, the number of the selected flow entry is counted, and the counted number is determined as all data passing through the port The total number of streams.
  • the same-end IP address set of the port refers to a set of IP addresses that belong to the same side as the port
  • the peer IP address set of the port refers to a set of IP addresses that belong to a different side than the port.
  • the same end IP address set of the port refers to the source end IP address set
  • the opposite end IP address set of the port refers to the destination IP address set.
  • the same end IP address set of the port refers to the destination IP address set
  • the opposite end IP address set of the port refers to the source IP address set.
  • the foregoing first threshold and second threshold may be set according to requirements, for example, the first threshold may be 20, and the second threshold may be 0.2.
  • the multiple IP addresses determined by the intersection of the same end IP address set and the opposite end IP address set are the IP addresses of these devices as both the source and the destination.
  • the port can be considered as Potential ports with loops.
  • the first total flow number can be determined. If the first total flow number is greater than the first threshold, then the second total flow number can be further determined, and the first total flow number and The ratio between the second total number of flows.
  • the ratio is greater than the second threshold, it can indicate that the source IP and destination IP of most of the data flows through the port are the same, that is, most of the data through the port.
  • the source device and the destination device of the flow are the same device, and it can be determined that the port is a port with a loop.
  • the destination port of 100 flow entries in Table 3 is Port001, and the destination IP address and source IP address of 30 flow entries are the same, then the first total The number of streams is 30, and the second total number of streams is 100.
  • the first threshold set by the network device according to requirements is 20, and the second threshold is 0.2.
  • the first total flow number 30 is greater than the first threshold value 20, so it is necessary to further determine the ratio of the first total flow number to the second total flow number.
  • the ratio is 0.3 and the ratio is greater than the second threshold 0.2, so the port Port001 is determined as a port with a loop.
  • the source port of 150 flow entries in Table 3 is Port02, 50 of which have the same source IP address and destination IP address, then the first total flow is determined The number is 50, and the second total flow number is 150. At this time, it can be determined that the first total flow number 50 is greater than the first threshold value 20, so it is necessary to further determine the ratio of the first total flow number to the second total flow number. The ratio is 0.3 and the ratio is greater than the second threshold 0.2, so the port Port02 is determined as a port with a loop.
  • the port is not a port with a loop.
  • each service in the single client access service set is accessed by a single client, the IP address of the service and The port belongs to the same end, and the port does not belong to the loop port set, each service in the first multi-client access service set is accessed by multiple clients, the IP address and port of the service belong to the same end, and the port does not belong to the loop port set .
  • the network device can determine the services that are accessed by a single client and the IP address and port belong to the same end, and the services that are accessed by multiple clients and the IP address and port belong to the same end, according to the flow table, to obtain a single client access potential service Set and the first multi-client access to the set of potential services. From the single-client access to the potential service set, filter the service where the ports in the loop port set are located, and obtain the single-client access service set. From the first multi-client access to the potential service set, filter the service where the port in the loop port set is located, and obtain the first multi-client access service set.
  • the network device determines the service accessed by a single client and the IP address and port belong to the same end according to the flow table, and the realization process of the service accessed by multiple clients and the IP address and port belong to the same end can be: Determine multiple targets according to the flow table Service, each target service corresponds to an IP address and port belonging to the same end in a flow table entry, and each target service corresponds to multiple data flows. For each target service among multiple target services, it is determined whether the port of the target service is randomly generated. If the port of the target service is randomly generated, it is determined whether the number of same-side ports corresponding to the IP address of the target service is greater than the third threshold.
  • the number of same-side ports corresponding to the IP address of the target service is greater than the third threshold, it is determined whether the number of peers corresponding to the target service is greater than the fourth threshold. If the number of peers corresponding to the target service is greater than the fourth threshold, it is determined that the target service is a service that is accessed by multiple clients and whose IP address and port belong to the same end. If the number of peers corresponding to the target service is not greater than the fourth threshold, it is determined whether the peer IP address of the target service is unique; if the peer IP address of the target service is unique, it is determined that the target service is accessed by a single client and The IP address and port belong to the same service.
  • the above target service corresponds to the IP address and port belonging to the same end in a flow table entry
  • the target service corresponds to multiple data flows, that is, in the flow table, if there are multiple data flows corresponding to one flow table entry belonging to the same end IP address and port, then the IP address and port can be used as the target service. For example, if there are multiple data flows in the flow table corresponding to the destination IP address and destination port in one flow table entry, then the destination IP address and destination port can be determined as the target service. Similarly, if there are multiple data flows in the flow table corresponding to the source IP address and source port in one flow table entry, the source IP address and source port can be determined as the target service.
  • the process for the network device to determine whether the target service corresponds to multiple data flows may be: the network device may determine the number of flow table entries in the flow table where the target service is located, if the determined number of flow table entries is greater than The fifth threshold, it is determined that the target service corresponds to multiple data streams, and if the number of determined flow entries is not greater than the fifth threshold, it is determined that the target service does not correspond to multiple data streams.
  • the foregoing third, fourth, and fifth thresholds can be set according to requirements.
  • the third threshold can be 20
  • the fourth threshold can be 5
  • the fifth threshold can be 10.
  • the IP address and port of the target service may be the IP address and port of the server, in other words, The target service may be a service that is accessed by multiple clients and the IP address and port belong to the same end.
  • the implementation process of determining whether the port of the target service is randomly generated may be: determining whether the port number of the target service port is greater than 1024, and if the port number of the port of the target service is less than 1024, then determining the port It is a well-known port, that is, the port is not randomly generated. If the port number of the port of the target service is greater than 1024, it is determined that the port is randomly generated.
  • the port of the target service is randomly generated, it is still not possible to confirm whether the target service is accessed by multiple clients and the IP address and port belong to the same end service. You also need to determine the same side port corresponding to the IP address of the target service. Whether the number is greater than the third threshold. When determining whether the number of same-side ports corresponding to the IP address of the target service is greater than the third threshold, it is necessary to first determine the number of same-side ports corresponding to the IP address of the target service.
  • the process of determining the number of same-side ports corresponding to the IP address of the target service may be: selecting the flow table entry where the IP address of the target service is located from the flow table, and determining that the selected flow table entry corresponds to the The IP address of the target service belongs to the number of ports on the same end, and the determined number of ports is used as the number of ports on the same side corresponding to the IP address of the target service.
  • the target service may be composed of a source IP address and a source port, or may be composed of a destination IP address and a destination port.
  • the realization process of determining the number of same-side ports corresponding to the IP address of the target service can be: select the flow table item where the IP address of the target service is located from the flow table, and determine the selection The number of source ports in the flow table entry, and the determined number of source ports is used as the number of same-side ports corresponding to the IP address of the target service.
  • the realization process of determining the number of same-side ports corresponding to the IP address of the target service can be: select the flow table item where the IP address of the target service is located from the flow table, and determine the selection The number of destination ports in the flow table entry, and the determined number of destination ports is used as the number of same-side ports corresponding to the IP address of the target service.
  • the process of determining the number of peers corresponding to the target service may be: selecting the flow table entry where the target service is located from the flow table, and determining the IP of the selected flow table entry that belongs to a different end from the target service. The number of addresses, the determined number of IP addresses is used as the number of peers corresponding to the target service.
  • the realization process of determining the number of peers corresponding to the target service may be: selecting the flow table entry where the target service is located from the flow table, and determining the selected flow table For the number of destination IP addresses in the item, the determined number of destination IP addresses is used as the number of counterparties corresponding to the target service.
  • the realization process of determining the number of peers corresponding to the target service can be: select the flow table entry where the target service is located from the flow table, and determine the source IP in the selected flow table entry The number of addresses, the determined number of source IP addresses is used as the number of peers corresponding to the target service.
  • the target service when it is determined that the number of peers corresponding to the target service is greater than the fourth threshold, it can be determined that the target service is a service that is accessed by multiple clients and whose IP address and port belong to the same end. That is, through the above-mentioned judgments at all levels, it is possible to accurately determine services that are accessed by multiple clients and whose IP address and port belong to the same end.
  • the number of peers corresponding to the target service is not greater than the fourth threshold, it can indicate that the target service may be accessed by a single client, and the IP address and port of the target service may be the IP address and port of the server.
  • a target service composed of a destination IP address and a destination port is: IP001+Port001.
  • the destination port Port001 is not randomly generated, then the number of destination ports corresponding to its IP address IP001 can be determined, assuming that the IP address IP001 corresponds to The number of destination ports is 25, and the third threshold set by the network device is 20. Since the number of destination ports corresponding to the IP address IP001 is greater than 20, the number of source ends corresponding to the target service can be further determined.
  • the target service IP001+Port001 has 10 flow entries in the flow table with source IP addresses and source ports that are not exactly the same, and the fourth threshold set by the network device is 5, the number of sources corresponding to the target service can be determined at this time It is 10 and greater than 5, so the target service IP001+Port001 is determined to be a service that is accessed by multiple clients and whose IP address and port belong to the same end.
  • the target service IP001+Port001 has 3 flow table entries with different source IP addresses and source ports in the flow table, then it can be determined that the number of source ends corresponding to the target service is 3 and is less than 5, then determine these 3 Whether the source IP addresses of the two flow table entries are the same, if the source IP addresses of the three flow table entries are the same, but the source port is different, the target service IP001+Port001 is determined to be accessed by a single client and the IP address and port Services that belong to the same end.
  • well-known ports are some ports reserved in the server, if the port of the target service is not randomly generated, that is, the port of the target service is a well-known port, at this time, you can directly determine that the target service is A service that is accessed by multiple clients and whose IP address and port belong to the same end.
  • the target service is a service that is accessed by multiple clients and the IP address and port belong to the same end.
  • the peer IP address of the target service is not unique, it is determined that the target service is not accessed by a single client and the IP address and port belong to the same end, nor is it accessed by multiple clients and the IP address and port belong to the same end. .
  • the second multi-client access service set is determined according to the flow table.
  • the network device may determine the services that are accessed by multiple clients and whose IP addresses and ports belong to different ends based on the first multi-client access service set and the flow table to obtain the second multi-client access potential service set. From the second multi-client access to the potential service set, filter the service where the port in the loop port set is located, and obtain the second multi-client access service set.
  • the network device based on the first multi-client access service set determines that the service that is accessed by multiple clients and that the IP address and port belong to different ends can be implemented as follows: the flow table is located in the same flow table entry and belongs to The IP addresses and ports of different ends are determined as one reference service, and multiple reference services are obtained. For each of the multiple reference services, it is determined whether the reference service corresponds to multiple data streams. If the reference service corresponds to multiple data streams, it is determined whether the port of the reference service is randomly generated. If the port of the reference service is not randomly generated, it is determined whether the port of the reference service is included in the first multi-client access service set.
  • the port of the reference service is not included in the first multi-client access service set, it is determined whether the IP address of the reference service is included in the source IP address of the domain name table. If the IP address of the reference service is not included in the source IP address of the domain name table, it is determined that the reference service is a service that is accessed by multiple clients and the IP address and port belong to different ends.
  • the above reference service is also composed of an IP address and port, but different from the target service, the IP address and port of the reference service are different ends.
  • the destination IP address and source port in the same flow table entry can constitute a reference Service
  • the source IP address and destination port in the same flow entry can also constitute a reference service.
  • the implementation process for the network device to determine whether the reference service corresponds to multiple data flows may be: the network device may determine the number of flow table entries in the flow table where the reference service is located, if the determined number of flow table entries is greater than The fifth threshold is determined to correspond to multiple data streams for the reference service, and if the number of determined flow table entries is not greater than the fifth threshold, it is determined that the reference service does not correspond to multiple data streams.
  • the operation of the network device to determine whether the port of the reference service is randomly generated may refer to the above-mentioned operation of determining whether the port of the target service is randomly generated, which is not described in detail in this embodiment of the application.
  • the source IP address and the destination IP address may be reversed in the flow table obtained based on the feature extraction, resulting in the source IP address constituting the reference service may be the wrong destination IP address, or the destination IP address may be wrong It is impossible to determine whether the IP address and port in the reference service are IP addresses and ports belonging to different ends in the same flow table entry, but the source IP address in the domain name table must be the correct source IP address. Therefore, if it is determined that the IP address of the reference service is not included in the source IP address of the domain name table, it can be determined that the reference service is a service that is accessed by multiple clients and the IP address and port belong to different ends.
  • the reference service does not correspond to multiple data streams, or the port of the reference service is randomly generated, or the port of the reference service is included in the first multi-client access service set, or the IP of the reference service If the address is included in the source IP address of the domain name table, it can be determined that the reference service is not accessed by multiple clients and that the IP address and port belong to different services.
  • the first multi-client access service, the second multi-client access service set, and the single-client access service set all contain one or more services
  • Step 303 The network device clusters the multiple services according to the flow table and the domain name table to obtain multiple application types.
  • step 303 can be implemented by the following steps (1)-(6):
  • the network device may obtain the flow start time in the flow table entry where each service of the multiple services is located according to the flow table. According to the obtained stream start time, the time difference between each two of the multiple services is determined. Through the determined time difference, the time correlation between each two of the multiple services is determined. According to the time correlation between every two services in the plurality of services, a service that satisfies the time correlation condition is selected from the plurality of services. According to the time correlation between the selected services, a similarity matrix is generated. According to the similarity matrix and the spectral clustering analysis, the spectral clustering results of the multiple services are determined. According to the domain name table, the similarity between every two of the multiple services is determined. Determine the temporal correlation clustering result according to the similarity and spectral clustering result between each two of the multiple services.
  • the network device determines the time correlation between each two of the multiple services through the determined time difference: for any two of the multiple services, determining the time between the two services If the time difference between the two services is less than the second time threshold, if the time difference between the two services is less than the second time threshold, it is determined that the two services have time correlation. If the time difference between the two services is not less than the second time threshold, it is determined that the two services do not have time correlation. For any two other services, you can also determine whether the two services are time-dependent according to the above method.
  • the second time threshold can be set according to requirements, which is not limited in the embodiment of the present application.
  • the network device selects a service that satisfies the condition from the multiple services
  • the implementation process may be: according to every two of the multiple services.
  • the time correlation between two services an undirected graph is generated, the undirected graph includes multiple nodes corresponding to the multiple services one-to-one, and edges corresponding to two services with time correlation, and the edges are used for Connect the two nodes corresponding to the two services with time correlation.
  • the largest clique refers to the connected region that includes the largest number of nodes after connecting nodes through edges.
  • the service corresponding to the node in the largest group is determined as the service that satisfies the time correlation condition.
  • the multiple services are service A-service G
  • service A and service B are time-dependent
  • service B and service C are time-dependent
  • service C is time-dependent with service A and service D
  • service D and service D are time-dependent
  • Service A is time-dependent
  • service E and service F are time-dependent
  • service E and service G are time-dependent.
  • the connection between node A'corresponding to service A and node B'corresponding to service B forms an edge
  • the connection between node B'corresponding to service B and node C'corresponding to service C forms an edge
  • the node C'corresponding to service C is connected to the node A'corresponding to service A and the node D'corresponding to service D to form an edge
  • the node D'corresponding to service D is connected to node A'corresponding to service A
  • An edge is formed, the node E'corresponding to the service E and the node F'corresponding to the service F are connected to form an edge, and the node E'corresponding to the service E and the node G'corresponding to the service G are connected to form an edge.
  • the connected area including the largest number of nodes after connecting the nodes by edges is the connected area composed of node A', node B', node C', and node D'.
  • the connected area is the largest group in the above undirected graph.
  • the network device generates the similarity matrix according to the time correlation between the selected services. For every two selected services, if there is a time correlation between the two services , It is determined that the corresponding elements of the two services in the similarity matrix are 1, and if the two services do not have time correlation, it is determined that the corresponding elements of the two services in the similarity matrix are 0. Among them, the corresponding element of the same service in the similarity matrix is 1.
  • the network device on the basis of the above-mentioned maximum group, generates the similarity matrix according to the time correlation between the selected services.
  • the realization process of generating the similarity matrix can be: For every two selected services, if this If there is a connecting edge between the two services corresponding to the nodes in the largest clique, it is determined that the corresponding elements of the two services in the similarity matrix are 1, if the two services have no connecting edges between the corresponding nodes in the largest clique , It is determined that the corresponding elements of the two services in the similarity matrix are 0.
  • the corresponding element of service A and service B in the similarity matrix is 1.
  • Service B and service C have time correlation, so the corresponding element of service B and service C in the similarity matrix is 1.
  • Service C and Service A have time correlation, then the corresponding element of Service C and Service A in the similarity matrix is 1.
  • Service C and service D have time correlation, so the corresponding element of service C and service D in the similarity matrix is 1.
  • the above-mentioned similarity matrix is a matrix with n rows and n columns, that is, n services are selected from the plurality of services according to the time correlation between every two services in the plurality of services.
  • the above-mentioned spectral clustering analysis is based on the similarity matrix to cluster these n services. That is, the spectral clustering analysis can determine which of the n services can be grouped into one category. Among them, the implementation process of spectral clustering analysis can refer to related technologies.
  • the network device determines the similarity between each two services in the multiple services according to the domain name table.
  • the implementation process may be: from the domain name table, determine the IP address of each service in the multiple services Corresponding domain names, determine the similarity between the domain names corresponding to the IP addresses of every two services in the multiple services, and obtain the similarity between every two services in the multiple services.
  • the process of determining the similarity between the domain names corresponding to the IP addresses of the two services may be: determining the IP addresses corresponding to the two services from the domain name table For the domain name, word segmentation is performed on the domain name corresponding to the IP address of each service, and all the words in the domain name corresponding to each IP address are obtained. Afterwards, the network device can remove the word segmentation as the domain name suffix from all the word segmentation in the domain name corresponding to each IP address to obtain the word segmentation group of the domain name corresponding to each IP address. Determine the cross-combination ratio of the segmentation of the domain names corresponding to the IP addresses of the two services, and determine the similarity between the domain names corresponding to the IP addresses of the two services by determining the cross-combination ratio.
  • the intersection ratio of every two sub-phrases refers to the ratio of the number of intersection elements of the two sub-phrases to the number of union elements.
  • the domain name corresponding to the IP address of service 1 is abnd-jap.xxx.com
  • the phrase group in this domain name can include: abnd, jap, and xxx
  • the domain name corresponding to the IP address of service 2 is abnd-hx.xxx.com
  • the word segmentation in this domain name can include: abnd, hk and xxx.
  • intersection elements in these two sub-phrases are abnd and xxx, and the union elements are abnd, jap, hk, and xxx, that is, the number of intersection elements of these two sub-phrases is 2, and the number of union elements is 4. Therefore, the intersection ratio of the segmentation of the domain names corresponding to the IP addresses of these two services is 2/4.
  • the spectral clustering result may include multiple types of services, and each category includes multiple similar services.
  • the spectral clustering result may include multiple first service sets, and each first service set includes multiple similar services.
  • the network device determines the temporal correlation clustering result according to the similarity between each two of the multiple services and the spectral clustering result.
  • each service in each first service set in the first service set based on the similarity between every two services in the multiple services, determine the relationship between the service and other services in the first service set where it is located The similarity. If the similarity between the service and other services in the first service set is not greater than the similarity threshold, the service is excluded from the first service set. After traversing the spectral clustering results including each service in each first service set in the above manner, multiple second service sets can be obtained. For each service excluded, the similarity between the service and each service in each second service set may be determined based on the similarity between every two services in the plurality of services. If the similarity between the service and each service in one of the second service sets is greater than the similarity threshold, then the service is added to the second service set. For services that have not been subjected to spectral clustering analysis, clustering is also performed according to the processing method of the above-noted services, so as to obtain time-dependent clustering results.
  • the similarity threshold can be set according to requirements, which is not limited in the embodiment of the present application.
  • the network device may also perform the above-mentioned flow table again based on the multiple services determined in step 302. deal with.
  • the flow table entries whose destination IP address and destination port in the flow table are not the IP address and port of any one of the multiple services can be deleted, and/or the source IP address and destination IP address are reversed Correct the entries and correct the entries in which the source port and the destination port are reversed.
  • a periodic service is selected from the multiple services to obtain a periodic clustering result.
  • the flow start time of multiple data flows for the same client to access the service is obtained.
  • the network device obtains the flow start time of multiple data flows for the same client to access the service according to the flow table.
  • the implementation process may be: from the flow table, select the IP address and port of the service as the destination.
  • the flow table entries of the IP address and destination port determine the number of flow table entries where each source IP address in the selected flow table entry is located, and obtain the flow start time from the flow table entry where the source IP address corresponding to the maximum number is located ,
  • the obtained stream start time is determined as the stream start time of multiple data streams for the same client to access the service.
  • the flow table entry where IP08 is located can be obtained, and the obtained flow can be started The start time is determined as the stream start time of multiple data streams for the same client to access the service.
  • the implementation process of determining whether the periodicity of the service is strong periodicity through Fourier transform may be: the determined time difference is the horizontal axis, and the determined time difference is the vertical
  • the axis establishes a coordinate system, draws the determined time difference to the coordinate system, and obtains a discrete signal. Performs Fourier transform on the discrete signal to determine whether the number of peaks in the transformed signal is less than the sixth threshold. If it is less than the sixth threshold, It is determined that the periodicity of the service is a strong periodicity; otherwise, it is determined that the periodicity of the service is not a strong periodicity.
  • the sixth threshold can be set according to requirements, which is not limited in the embodiment of the present application.
  • Multiple first services refer to services that are accessed by multiple clients and have corresponding domain names.
  • Multiple second services Services include services accessed by multiple clients without a corresponding domain name and services accessed by a single client.
  • the multiple services include services accessed by multiple clients and services accessed by a single client, and each domain name entry in the domain name table includes the source IP address, The destination domain name, destination IP address, and domain name type. Therefore, for services accessed by multiple clients, you can determine from the domain name table whether each service in the multiple services corresponds to a domain name, and then filter out the multiple services A service that is accessed by multiple clients and has a corresponding domain name obtains multiple first services. At the same time, you can also filter out services that are accessed by multiple clients without a corresponding domain name.
  • the network device may cluster the above-mentioned multiple first services according to the semantic relevance of the domain names to obtain multiple first clustering results. Based on the similarity of the domain names among the multiple first clustering results, the multiple first clustering results are merged to obtain multiple second clustering results. According to the semantic relevance of the domain name between the un-clustered service in the multiple first services and each second clustering result, cluster the un-clustered service with the multiple second clustering results to obtain the semantic relevance cluster. Class result.
  • the implementation process of clustering multiple first services may be: obtaining multiple third services that are accessed by multiple clients and correspond to a unique domain name from the multiple first services, and Obtain multiple fourth services that are accessed by multiple clients and correspond to multiple domain names from the multiple first services.
  • the services corresponding to the multiple second domain names are clustered.
  • the services corresponding to the non-combinable domain names among the domain names corresponding to the multiple fourth services the services corresponding to the non-combinable domain names are clustered.
  • the domain name corresponding to the first service can be determined from the domain name table. If the domain name corresponding to the first service is unique, then Determine that the first service is the third service that is accessed by multiple clients and corresponds to a unique domain name. If the domain name corresponding to the first service is not unique, then determine that the first service is the fourth service that is accessed by multiple clients and corresponds to multiple domain names. service.
  • obtaining mergeable domain names from the domain names corresponding to multiple fourth services, and merging the mergeable domain names to obtain multiple first domain names may be implemented as follows: Obtain multiple domain names with the same first-level domain name from the corresponding domain names, and perform word segmentation processing on the obtained multiple domain names to obtain all word segmentation in each domain name. Remove the participles belonging to the first-level domain name from all the participles in each domain name, and obtain the participle group of each domain name. Determine the cross-combination ratio of the obtained multiple domain names. If the cross-combination ratio is greater than the first cross-combination ratio threshold, determine the multiple domain names with the same first-level domain name among the domain names corresponding to the multiple fourth services. The merged domain name. After that, determine the intersection of the obtained multiple domain names and add the intersection of the multiple phrases in front of the first-level domain name to obtain the merged domain name, that is, the first domain name.
  • the domain names corresponding to the multiple fourth services are abpd-jap.xxx.com, acnd-jap.xxx.com, and abed-jap.xxx.com, and word segmentation is performed on these three domain names to obtain the domain name abpd-jap
  • All the participles in .xxx.com are abpd, jap, xxx and com
  • all the participles in the domain name acnd-jap.xxx.com are acnd, jap, xxx and com
  • all the participles in the domain name abed-jap.xxx.com It is abed, jap, xxx and com.
  • the merge ratio is 1/4. Assuming that the intersection ratio is greater than the first intersection ratio threshold, the three domain names are determined to be mergeable domain names. After that, determine the intersection of the obtained multiple domain name segmentation groups, add the intersection of the multiple segmentation groups in front of the first-level domain name, to obtain the merged domain name, that is, the first domain name is jap.xxx.com.
  • the semantic relevance of the multiple second domain names can also be determined according to the cross-combination ratio of the corresponding segmentation groups, and then the second domain names whose cross-combination ratio is greater than the threshold of the second cross-combination ratio are classified into one category, so as to realize the multiple The cluster of services corresponding to the second domain name.
  • the method for determining the cross-combination ratio of the word segmentation groups corresponding to the multiple second domain names may refer to the foregoing method, which is not described in detail in the embodiment of the present application.
  • the semantic relevance of the non-combinable domain names among the domain names corresponding to the multiple fourth services can also be determined according to the intersection ratio of the corresponding word segmentation, and then the second domain name whose intersection ratio is greater than the third intersection ratio threshold is divided into One type, so as to realize the clustering of services corresponding to domain names that cannot be merged.
  • the method for determining the cross-combination ratio of the phrase group corresponding to the non-combinable domain names among the domain names corresponding to the plurality of fourth services may refer to the foregoing method, which is not described in detail in the embodiment of the present application.
  • the first cross-union ratio threshold, the second cross-union ratio threshold, and the third cross-union ratio threshold can be set according to requirements, and the first cross-union ratio threshold, the second cross-union ratio threshold, and the third cross-union ratio threshold can be the same , Can also be different.
  • the network device may determine the IP address of the client accessing each second service among the multiple second services; according to the combination of the IP addresses of the clients accessing each second service, the multiple second services The service is clustered, and the client similarity clustering result is obtained.
  • the network device may classify the second service whose intersection ratio of the IP address of the client of the second service is greater than the fourth intersection ratio threshold into one category, so as to obtain the client similarity clustering result.
  • the client IP addresses of the second service B there are three client IP addresses of the second service A: IP01, IP02, IP03, and the client IP addresses of the second service B also have three: IP01, IP03, and IP05.
  • IP addresses of the client of the second service A There are two IP addresses that are the same as the two IP addresses of the client's IP address of the second service B, then the intersection element of the client's IP address between the second service A and the second service B is ⁇ IP01, IP03 ⁇ , the first The union element of the client IP address between the second service A and the second service B is ⁇ IP01, IP02, IP03, IP05 ⁇ , and the intersection ratio of the client IP address between the second service A and the second service B is 2/4. Assuming that the intersection ratio of the client IP addresses between the second service A and the second service B is greater than the fourth intersection ratio threshold, cluster these two second services to obtain the client similarity Clustering results.
  • the network device may determine the intersection ratio between every two clustering results, and merge the two clustering results with the intersection ratio greater than the fifth intersection ratio threshold. After processing all the clustering results in this way, multiple application types can be obtained.
  • the intersection ratio between every two clustering results refers to the intersection ratio of services in the two clustering results. That is, the ratio between the number of intersection services and the number of union services in the two clustering results.
  • Each clustering result includes multiple application types. Therefore, after the above four clustering results are merged, multiple application types can be obtained, and each application type can correspond to multiple services.
  • fourth and fifth cross-to-bin ratio thresholds can be set according to requirements, and the fourth and fifth cross-to-bin ratio thresholds can be the same or different, and are compared with the above-mentioned first cross-to-bin ratio threshold.
  • the threshold, the second intersection ratio threshold, and the third intersection ratio threshold may be the same or different.
  • Step 304 The network device determines a label corresponding to each of the multiple application types, where the label is used to identify the application to which the data stream belongs.
  • the network device may divide the multiple application types into a first application group, a second application group, and a third application group.
  • the services included in each application type in the first application group have corresponding domain names.
  • the services included in each application type in the second application group do not have a corresponding domain name, and each application type in the third application group corresponds to an un-clustered service.
  • the label corresponding to each application type is determined, and the label corresponding to each application type in the second application group and the third application group is determined.
  • the network device determines the label corresponding to each application type based on the domain name corresponding to the service included in each application type in the first application group as follows: For each application in the first application group Type: Determine the first-level domain name in the domain name corresponding to the service included in the application type. If the determined first-level domain names are the same, determine the company name corresponding to the first-level domain name, and use the company name as the label corresponding to the application type. If the determined first-level domain names are different, the enterprise name corresponding to the first-level domain name with the largest proportion among these first-level domain names is determined as the label corresponding to the application type.
  • a certain application type in the first application group includes 20 services, and the first-level domain name in the domain name corresponding to each service is xxx.com. Assuming that the enterprise name corresponding to these first-level domain names is xxx, then xxx is used as The label corresponding to this application type.
  • a certain application type in the first application includes 50 services, of which the first-level domain names in the domain names corresponding to 40 services are xxx.com, and the first-level domain names in the domain names corresponding to the remaining 10 services are scmd .com, assuming that the enterprise name corresponding to the first-level domain name xxx.com is xxx, and the enterprise name corresponding to the first-level domain name scmd.com is scmd. Since the first-level domain name xxx.com accounts for the largest proportion, xxx can be used as the label corresponding to the application type.
  • the implementation process for the network device to determine the label corresponding to each application type in the second application group may be: for each application type in the second application group, from the services included in the application type, determine the well-known The service where the port is located, the service belonging to the loop port set, the service with the same source port and the destination port, and the service belonging to the second plurality of client access service sets. It is determined whether the number of these services is greater than the seventh threshold, and if it is not greater than the seventh threshold, a label corresponding to the application type is generated according to the first format based on the first target character. If it is greater than the seventh threshold, a label corresponding to the application type is generated according to the second format based on the first target character.
  • the first target character can be set according to requirements, for example, the first target character is NN.
  • the first format and the second format can also be set according to requirements.
  • the first format can be: the first target character/ ⁇ IP01+Port001, IP02+Port002, IP01+Port003 ⁇
  • the second format can be: the first target character / ⁇ IP: 001, 002, 003, 004 ⁇ .
  • a certain application type in the second application group includes 30 services. From these 30 services, determine that the services of the well-known ports are IP01+Port001, IP02+Port002, and the services belonging to the loop port set are IP12+ Port002, the service with the same source port and destination port is IP14+Port011, and the service belonging to the second plurality of client access service sets is IP05+Port005. At this time, the number of these services that can be determined is 5.
  • the label corresponding to the application type is NN/ ⁇ IP01+Port001 , IP02+Port002, IP12+Port002, IP14+Port011, IP05+Port005 ⁇ .
  • the label corresponding to the application type is NN/ ⁇ IP: 001, 002, 002, 011, 005 ⁇ .
  • the implementation process for the network device to determine the label corresponding to each application type in the third application group may be: for each application type in the third application group, determining whether the number of services included in the application type is greater than The seventh threshold, if it is not greater than the seventh threshold, based on the second target character, a label of the application type is generated according to the first format. If it is greater than the seventh threshold, based on the second target character, a label of the application type is generated according to the second format.
  • the second target character can be set according to requirements, for example, the second target character is UKN.
  • the first format and the second format can also be set according to requirements.
  • the first format can be: the second target character/ ⁇ IP01+Port001, IP02+Port002, IP01+Port003 ⁇
  • the second format can be: the second target character / ⁇ IP: 001, 002, 003, 004 ⁇ .
  • the network device After the network device determines the label corresponding to each application type in the multiple application types, it may also display the label corresponding to each application type in the multiple application types.
  • the network device can obtain multiple services by analyzing the traffic behavior characteristics of the flow table. Since each service is composed of an IP address and a port identifier, and an application can usually be composed of a group of services, after clustering the multiple services according to the flow table and domain name table, multiple application types can be obtained , Each application type includes multiple services, and each application type corresponds to one application. At this time, the label of each application type in the multiple application types can be determined, so that the application to which the data stream belongs can be identified through the label. It can be seen that the process of identifying applications in the embodiment of this application does not require a traffic characteristic database, but can be identified based on popular behavior characteristics. In this way, when a new application appears, it can be directly based on the IP of the server accessed by the new application. The address and port are used to identify the new application, thereby improving the recognition rate of the application.
  • FIG. 5 is a schematic structural diagram of an application identification device provided by an embodiment of the present application.
  • the application identification device can be implemented as part or all of a network device by software, hardware, or a combination of the two.
  • the device includes: an extraction module 501, an analysis module 502, a clustering module 503, and a determination module 504.
  • the functions of the extraction module 501, the analysis module 502, the clustering module 503, and the determination module 504 can all be implemented by the processor in the embodiment of FIG. 2.
  • the extraction module 501 is configured to perform the operation of step 301 in the embodiment of FIG. 3;
  • the analysis module 502 is configured to perform the operation of step 302 in the embodiment of FIG. 3;
  • the clustering module 503 is configured to perform the operation of step 303 in the embodiment of FIG. 3;
  • the determining module 504 is configured to perform the operation of step 304 in the embodiment of FIG. 3.
  • the analysis module 502 includes:
  • the first determining submodule 5021 is configured to determine ports with loops according to the flow table to obtain a loop port set;
  • the second determining sub-module 5022 is used to determine the single client access service set and the first multi-client access service set based on the loop port set according to the flow table.
  • Each service in the single client access service set is accessed by a single client, The IP address and port of the service belong to the same end, and the port does not belong to the loop port set.
  • Each service in the first multi-client access service set is accessed by multiple clients, and the IP address and port of the service belong to the same end, and the port is not Belongs to the loop port set;
  • the third determining sub-module 5023 is used to determine the second multi-client access service set based on the loop port set and the first multi-client access service set according to the flow table.
  • Each service in the second multi-client access service set is The IP address and port of multi-client access and service belong to different ends, and the port does not belong to the loop port set;
  • the merging submodule 5024 is used to merge the first multi-client access service set, the second multi-client access service set and the single client access service set to obtain multiple services.
  • the first determining submodule 5021 is mainly used for:
  • the flow table determine the total flow number of data flows corresponding to multiple IP addresses in all data flows through the port, and use the determined total flow number as the first total flow number;
  • first total flow number is greater than the first threshold, determine the total flow number of all data flows via the port, and use the determined total flow number as the second total flow number;
  • the port is a port with a loop.
  • the second determining submodule 5022 is mainly used for:
  • the flow table determine the service accessed by a single client and the IP address and port belong to the same end, and the service accessed by multiple clients and the IP address and port belong to the same end, obtain the single client access potential service set and the first multi-client access potential Service set
  • the second determining submodule 5022 is further configured to:
  • each target service corresponds to an IP address and port belonging to the same end in a flow table entry, and each target service corresponds to multiple data flows;
  • the port of the target service is randomly generated, it is determined whether the number of same-side ports corresponding to the IP address of the target service is greater than the third threshold;
  • the third threshold it is determined whether the number of peers corresponding to the target service is greater than the fourth threshold
  • the target service is a service that is accessed by multiple clients and whose IP address and port belong to the same end;
  • the peer IP address of the target service is unique, it is determined that the target service is a service accessed by a single client and the IP address and port belong to the same end.
  • the second determining submodule 5022 is further configured to:
  • the target service is a service that is accessed by multiple clients and whose IP address and port belong to the same end.
  • the second determining submodule 5022 is further configured to:
  • the target service is a service that is accessed by multiple clients and the IP address and port belong to the same end.
  • the third determining submodule 5023 is mainly used for:
  • the third determining submodule 5023 is further configured to:
  • For each of the multiple reference services determine whether the reference service corresponds to multiple data streams
  • the reference service is a service that is accessed by multiple clients and the IP address and port belong to different ends.
  • the clustering module 503 includes:
  • the first clustering sub-module 5031 is configured to perform time correlation clustering on multiple services according to the flow table and the domain name table to obtain a time correlation clustering result;
  • the second clustering sub-module 5032 is used to select periodic services from a plurality of services according to the flow table to obtain periodic clustering results;
  • the obtaining submodule 5033 is used to obtain multiple first services and multiple second services from multiple services according to the domain name table.
  • Multiple first services refer to services that are accessed by multiple clients and have corresponding domain names.
  • the second service includes services accessed by multiple clients without a corresponding domain name and services accessed by a single client;
  • the third clustering submodule 5034 is used to perform semantic relevance clustering on multiple first services to obtain semantic relevance clustering results
  • the fourth clustering submodule 5035 is used to perform client similarity clustering on multiple second services to obtain a client similarity clustering result
  • the fusion sub-module 5036 is used to fuse time correlation clustering results, periodic clustering results, semantic correlation clustering results, and client similarity clustering results to obtain multiple application types.
  • the first clustering submodule 5031 is mainly used for:
  • the time correlation clustering results are determined.
  • the second clustering submodule 5032 is mainly used for:
  • the periodicity of the service is strong periodicity, it is determined that the service is a periodic service.
  • the third clustering submodule 5034 is mainly used for:
  • cluster the un-clustered service with the multiple second clustering results to obtain the semantic relevance cluster.
  • Class result
  • the third clustering submodule 5034 is also used to:
  • From multiple first services obtain multiple third services that are accessed by multiple clients and correspond to a unique domain name, and from multiple first services, obtain multiple fourth services that are accessed by multiple clients and correspond to multiple domain names ;
  • the services corresponding to the non-combinable domain names are clustered.
  • the fourth clustering submodule 5035 is mainly used for:
  • the determining module 504 is used to:
  • each application type in the first application group has a corresponding domain name
  • each application type in the second application group There is no corresponding domain name for the included services
  • each application type in the third application group corresponds to an un-clustered service
  • the network device can obtain multiple services by analyzing the traffic behavior characteristics of the flow table. Since each service is composed of an IP address and a port identifier, and an application can usually be composed of a group of services, after clustering the multiple services according to the flow table and domain name table, multiple application types can be obtained , Each application type includes multiple services, and each application type corresponds to one application. At this time, the label of each application type in the multiple application types can be determined, so that the application to which the data stream belongs can be identified through the label. It can be seen that the process of identifying applications in the embodiment of this application does not require a traffic characteristic database, but can be identified based on popular behavior characteristics. In this way, when a new application appears, it can be directly based on the IP of the server accessed by the new application. The address and port are used to identify the new application, thereby improving the recognition rate of the application.
  • the application identification device provided in the above embodiment recognizes the application, only the division of the above-mentioned functional modules is used as an example. In practical applications, the above-mentioned function allocation can be completed by different functional modules as needed, that is, the internal structure of the device is divided. Into different functional modules to complete all or part of the functions described above.
  • the application identification device provided in the foregoing embodiment and the application identification method embodiment belong to the same concept, and the specific implementation process is detailed in the method embodiment, and will not be repeated here.
  • the embodiments may be implemented in whole or in part by software, hardware, firmware, or any combination thereof.
  • software When implemented by software, it can be implemented in the form of a computer program product in whole or in part.
  • the computer program product includes one or more computer instructions.
  • the computer may be a general-purpose computer, a special-purpose computer, a computer network, or other programmable devices.
  • the computer instructions may be stored in a computer-readable storage medium, or transmitted from one computer-readable storage medium to another computer-readable storage medium.
  • the computer instructions may be transmitted from a website, computer, server, or data center. Transmission to another website, computer, server or data center via wired or wireless means.
  • the computer-readable storage medium may be any available medium that can be accessed by a computer, or a data storage device such as a server or a data center integrated with one or more available media.

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Environmental & Geological Engineering (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请公开了一种应用识别方法、装置及存储介质,属于通信技术领域。网络设备对流表进行流行为特征的分析,得到多个服务。由于每个服务由一个IP地址和一个端口标识构成,且一个应用通常可以由一组服务构成,因此,网络设备根据流表和域名表,对该多个服务进行聚类,得到多个应用类型,每个应用类型包括多个服务,且每个应用类型对应一个应用。进一步地,网络设备可以确定该多个应用类型中每个应用类型的标签,通过该标签可以识别数据流所属的应用。本申请提供的方法根据流行为特征即可识别数据流所属的应用,不需要流量特征数据库。当有新应用出现时,网络设备根据新应用访问的服务端的IP地址和端口,对新应用进行识别,提高了应用的识别率。

Description

应用识别方法、装置及存储介质
本申请要求于2019年9月10日提交中国国家知识产权局、申请号为201910853338.4、发明名称为“应用识别方法、装置及存储介质”的中国专利申请的优先权,其全部内容通过引用结合在本申请中。
技术领域
本申请涉及通信技术领域,特别涉及一种应用识别方法、装置及存储介质。
背景技术
目前,企业园区等私有网络可能会因突发流量的出现而带来拥塞丢包等问题。同时,一些新增未知应用的上线,可能会进一步加剧上述问题。其中,新增未知应用通常为企业的私有应用,因此,如何对企业的私有应用进行识别已经成为企业用户关注的重要问题之一。
当前,可以采用深度报文解析(deep packet inspection,DPI)技术对应用进行识别。其中,DPI技术主要是对数据流进行深度拆包解析,以提取流量特征,然后将提取的流量特征与存储的流量特征数据库中的数据进行匹配,以识别该数据流对应的应用。
由于DPI技术需要维护一个流量特征数据库,当有新的应用出现时,流量特征数据库也要人为更新才能具有新应用的识别能力,导致应用的识别率较低。
发明内容
本申请提供了一种应用识别方法、装置及存储介质,可以解决相关技术的DPI技术识别应用效率较低问题。所述技术方案如下:
第一方面,提供了一种应用识别方法,所述方法包括:
对多条数据流分别提取特征,得到流表和域名表,所述流表包括多个流表项,所述多个流表中的每个流表项包括五元组和流起始时间,所述域名表包括多个域名表项,所述多个域名表项中的每个域名表项包括源互联网协议IP地址、目的域名、目的IP地址和域名类型;
根据所述流表进行流行为特征的分析,得到多个服务,每个服务由一个IP地址和一个端口标识构成;
根据所述流表和所述域名表,对所述多个服务进行聚类,得到多个应用类型;
确定所述多个应用类型中每个应用类型对应的标签,所述标签用于识别数据流所属的应用。
一条数据流可以包括一个或多个报文,且该一个或多个报文的五元组相同。换句话说,五元组相同的一个或多个报文可以构成一条数据流。
其中,流表中的五元组包括源IP地址、源端口、目的IP地址、目的端口和协议号。示例性地,假设客户端当前需要向服务端发送报文,那么,源IP地址和源端口为客户端的IP地址和端口,目的IP地址和目的端口为服务端的IP地址和端口,协议号为客户端和服务端通信时所采用的传输协议的编号。
每条数据流的流起始时间是每条数据流中的第一个报文的接收时间。但是,每条数 据流的第一个报文并不一定是整个数据流的首报文,而是当前提取特征时接收到的报文中的第一个报文。
域名表中的域名类型有两种形式:A.name和C.name,A.name是将主机名或者域名解析到一个IP地址,C.name是将多个主机名或者域名都可以解析到另一个域名,再由另一个域名解析到一个IP地址,这个IP地址与A.name解析到的IP地址相同。也就是说,多个C.name相当于一个A.name的分支。
通过对流表进行流行为特征的分析,得到多个服务。由于每个服务由一个IP地址和一个端口标识构成,且一个应用通常可以由一组服务构成,因此,根据流表和域名表,对该多个服务进行聚类之后,可以得到多个应用类型,每个应用类型包括多个服务,且每个应用类型对应一个应用。此时,可以确定该多个应用类型中每个应用类型的标签,从而通过该标签可以识别数据流所属的应用。可以看出,在识别应用的过程并不需要流量特征数据库,而是根据流行为特征即可识别,这样,当有新应用出现时,可以直接根据这个新应用访问的服务端的IP地址和端口,对该新应用进行识别,从而提高了应用的识别率。
可选地,所述根据所述流表进行流行为特征的分析,得到多个服务,包括:
根据所述流表确定具有环路的端口,得到环路端口集;
基于所述环路端口集,根据所述流表确定单客户端访问服务集和第一多客户端访问服务集,所述单客户端访问服务集中的每个服务被单客户端访问、服务的IP地址和端口属于同一端、且端口不属于所述环路端口集,所述第一多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于同一端、且端口不属于所述环路端口集;
基于所述环路端口集和所述第一多客户端访问服务集,根据所述流表确定第二多客户端访问服务集,所述第二多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于不同端、且端口不属于所述环路端口集;
将所述第一多客户端访问服务集、所述第二多客户端访问服务集和所述单客户端访问服务集进行合并,得到所述多个服务。
可选地,所述根据所述流表确定具有环路的端口,包括:
对于所述流表中的每个端口,根据所述流表获取所述端口的同端IP地址集合和对端IP地址集合;
确定所述端口的同端IP地址集合和对端IP地址集合的交集,得到多个IP地址;
根据所述流表,确定经由所述端口的所有数据流中所述多个IP地址对应的数据流的总流数,将确定的总流数作为第一总流数;
如果所述第一总流数大于第一阈值,则确定经由所述端口的所有数据流的总流数,将确定的总流数作为第二总流数;
如果所述第一总流数与所述第二总流数之间的比值大于第二阈值,则确定所述端口为具有环路的端口。
该端口的同端IP地址集合是指与该端口属于同一侧的IP地址的集合,该端口的对端IP地址集合是指与该端口属于不同侧的IP地址的集合。环路端口是指经由该端口的大部分数据流的源端IP和目的端IP相同,也即是,经由该端口的大部分数据流的源端 设备和目的设备为同一设备。
通常情况下,通过同端IP地址集合和对端IP地址集合的交集确定的多个IP地址为既作为源端又作为目的端的这些设备的IP地址,在这种情况下,可以认为该端口为具有环路的潜在端口。为了进一步验证该端口是否为具有环路的端口,可以确定第一总流数,如果第一总流数大于第一阈值,那么,可以进一步确定第二总流数,以及第一总流数与第二总流数之间的比值,如果该比值大于第二阈值,那么可以表明经由该端口的大部分数据流的源端IP和目的端IP相同,也即是,经由该端口的大部分数据流的源端设备和目的设备为同一设备,进而可以确定该端口为具有环路的端口。
可选地,所述基于所述环路端口集,根据所述流表确定单客户端访问服务集和第一多客户端访问服务集,包括:
根据所述流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多客户端访问且IP地址和端口属于同一端的服务,得到单客户端访问潜在服务集和第一多客户端访问潜在服务集;
从所述单客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述单客户端访问服务集,从所述第一多客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述第一多客户端访问服务集。
可选地,所述根据所述流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多客户端访问且IP地址和端口属于同一端的服务,包括:
根据所述流表确定多个目标服务,每个目标服务对应一个流表项中属于同一端的IP地址和端口,且每个目标服务对应多条数据流;
对于所述多个目标服务中的每个目标服务,确定所述目标服务的端口是否为随机产生的;
如果所述目标服务的端口是随机产生的,则确定所述目标服务的IP地址对应的同侧端口数量是否大于第三阈值;
如果所述目标服务的IP地址对应的同侧端口数量大于所述第三阈值,则确定所述目标服务对应的对端数量是否大于第四阈值;
如果所述目标服务对应的对端数量大于所述第四阈值,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务;
如果所述目标服务对应的对端数量不大于所述第四阈值,则确定所述目标服务的对端IP地址是否唯一;
如果所述目标服务的对端IP地址唯一,则确定所述目标服务为被单客户端访问且IP地址和端口属于同一端的服务。
在流表中,如果有多条数据流对应一个流表项中属于同一端的IP地址和端口,那么可以将该IP地址和端口作为目标服务。也即是,一个目标服务可能由源IP地址和源端口构成的,也可能由目的IP地址和目的端口构成的。
上述目标服务对应一个流表项中属于同一端的IP地址和端口,且目标服务对应多条数据流,也即是,在流表中,如果有多条数据流对应一个流表项中属于同一端的IP地址和端口,那么可以将该IP地址和端口作为目标服务。比如,流表中有多条数据流对应一个流表项中的目的IP地址和目的端口,那么可以将该目的IP地址和目的端口确 定为目标服务。同理,如果流表中有多条数据流对应一个流表项中的源IP地址和源端口,那么可以将该源IP地址和源端口确定为目标服务。
由于目标服务对应一个流表项中属于同一端的IP地址和端口,且目标服务对应多条数据流,也即是,目标服务的IP地址和端口可能为服务端的IP地址和端口,换句话说,目标服务可能是被多客户端访问且IP地址和端口属于同一端的服务,此时,为了进一步确认目标服务是否为被多客户端访问且IP地址和端口属于同一端的服务,还可以确定目标服务的端口是否为随机产生的。如果该目标服务的端口是随机产生的,此时还不能确认该目标服务是否为被多客户端访问且IP地址和端口属于同一端的服务,还需要确定该目标服务的IP地址对应的同侧端口数量是否大于第三阈值。
可选地,所述确定所述目标服务的端口是否为随机产生的之后,还包括:
如果所述目标服务的端口不是随机产生的,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
如果该目标服务的端口的端口号小于1024,则确定该端口为知名端口,也即是,该端口不是随机产生的。如果该目标服务的端口的端口号大于1024,则确定该端口是随机产生的。
可选地,所述确定所述目标服务的IP地址对应的同侧端口数量是否大于第三阈值之后,还包括:
如果所述目标服务的IP地址对应的同侧端口数量不大于所述第三阈值,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
可选地,所述基于所述环路端口集和所述第一多客户端访问服务集,根据所述流表确定第二多客户端访问服务集,包括:
基于所述第一多客户端访问服务集,根据所述流表,确定被多客户端访问且IP地址和端口属于不同端的服务,得到第二多客户端访问潜在服务集;
从所述第二多客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述第二多客户端访问服务集。
可选地,所述基于所述第一多客户端访问服务集,根据所述流表,确定被多客户端访问且IP地址和端口属于不同端的服务,包括:
将所述流表中位于同一流表项且属于不同端的IP地址和端口确定为一个参考服务,得到多个参考服务;
对于所述多个参考服务中的每个参考服务,确定所述参考服务是否对应多条数据流;
如果所述参考服务对应多条数据流,则确定所述参考服务的端口是否为随机产生的;
如果所述参考服务的端口不是随机产生的,则确定所述参考服务的端口是否包含在所述第一多客户端访问服务集中;
如果所述参考服务的端口未包含在所述第一多客户端访问服务集中,则确定所述参考服务的IP地址是否包含在所述域名表的源IP地址中;
如果所述参考服务的IP地址未包含在所述域名表的源IP地址中,则确定所述参考服务为被多客户端访问且IP地址和端口属于不同端的服务。
参考服务也是由一个IP地址和端口组成,但参考服务的IP地址和端口是不同端的,比如,同一个流表项中的目的IP地址和源端口可以构成一个参考服务,同一个流表项中的源IP地址和目的端口也可以构成一个参考服务。
上述参考服务也是由一个IP地址和端口组成,但与目标服务不同的是,参考服务的IP地址和端口是不同端的,比如,同一个流表项中的目的IP地址和源端口可以构成一个参考服务,同一个流表项中的源IP地址和目的端口也可以构成一个参考服务。
在某些情况下,根据特征提取后得到的流表中可能会发生源IP地址与目的IP地址颠倒的情况,导致构成参考服务的源IP地址可能错为目的IP地址,或者目的IP地址错为源IP地址,进而无法判断参考服务中的IP地址和端口是否为同一流表项中属于不同端的IP地址和端口,但域名表中的源IP地址一定是正确的源IP地址。因此,如果确定该参考服务的IP地址未包含在域名表的源IP地址中,则可以确定该参考服务为被多客户端访问且IP地址和端口属于不同端的服务。
可选地,所述根据所述流表和所述域名表,对所述多个服务进行聚类,得到多个应用类型,包括:
根据所述流表和所述域名表,对所述多个服务进行时间相关性聚类,得到时间相关性聚类结果;
根据所述流表,从所述多个服务中选择具有周期性的服务,得到周期性聚类结果;
根据所述域名表,从所述多个服务中,获取多个第一服务和多个第二服务,所述多个第一服务是指被多客户端访问且有对应域名的服务,所述多个第二服务包括被多客户端访问且无对应域名的服务和被单客户端访问的服务;
对所述多个第一服务进行语义相关性聚类,得到语义相关性聚类结果;
对所述多个第二服务进行客户端相似度聚类,得到客户端相似度聚类结果;
对所述时间相关性聚类结果、所述周期性聚类结果、所述语义相关性聚类结果和所述客户端相似度聚类结果进行融合,得到所述多个应用类型。
基于上述描述可知,该多个服务中包括被多客户端访问的服务,也包括被单客户单访问的服务,而且域名表中的每个域名表项中包括源IP地址、目的域名、目的IP地址和域名类型,因此,对于被多客户端访问的服务,可以从域名表中,确定该多个服务中每个服务是否对应有域名,进而从该多个服务中筛选出被多客户端访问且有对应域名的服务,得到多个第一服务。同时,还可以筛选出被多客户端访问且无对应域名的服务。
可选地,所述根据所述流表和所述域名表,对所述多个服务进行时间相关性聚类,得到时间相关性聚类结果,包括:
根据所述流表,获取所述多个服务中每个服务所在的流表项中的流起始时间;
根据获取的流起始时间,确定所述多个服务中每两个服务之间的时间差;
通过确定的时间差,确定所述多个服务中每两个服务之间的时间相关性;
根据所述多个服务中每两个服务之间的时间相关性,从所述多个服务中选择满足时间相关性条件的服务;
根据选择出的服务之间的时间相关性,生成相似度矩阵;
根据所述相似度矩阵,按照谱聚类分析,确定所述多个服务的谱聚类结果;
根据所述域名表,确定所述多个服务中每两个服务之间的相似度;
根据所述多个服务中每两个服务之间的相似度和所述谱聚类结果,确定所述时间相关性聚类结果。
可选地,所述根据所述流表,从所述多个服务中选择具有周期性的服务,包括:
对于所述多个服务中的每个服务,根据所述流表,获取同一客户端访问所述服务的多条数据流的流起始时间;
按照访问所述服务的多条数据流的流起始时间的先后顺序,确定每相邻两个流起始时间之间的时间差;
基于确定的时间差,通过傅里叶变换,确定所述服务的周期性是否为强周期性;
如果所述服务的周期性为强周期性,则确定所述服务为具有周期性的服务。
可选地,所述对所述多个第一服务进行语义相关性聚类,得到语义相关性聚类结果,包括:
按照域名的语义相关性,对所述多个第一服务进行聚类,得到多个第一聚类结果;
基于所述多个第一聚类结果间的域名相似度,对所述多个第一聚类结果进行合并,得到多个第二聚类结果;
根据所述多个第一服务中未聚类的服务与每个第二聚类结果间的域名语义相关性,将所述未聚类的服务与所述多个第二聚类结果进行聚类,得到所述语义相关性聚类结果。
可选地,所述按照域名的语义相关性,对所述多个第一服务进行聚类,包括:
从所述多个第一服务中,获取被多客户端访问且对应唯一域名的多个第三服务,以及从所述多个第一服务中,获取被多客户端访问且对应多域名的多个第四服务;
从所述多个第四服务对应的域名中,获取可合并的域名,并将可合并的域名进行合并,得到多个第一域名;
去除每个第一域名中的数字和符号,以及去除每个第三服务对应的域名中的数字和符号,得到多个第二域名;
按照所述多个第二域名的语义相关性,将所述多个第二域名对应的服务进行聚类;
按照所述多个第四服务对应的域名中不可合并的域名的语义相关性,将所述不可合并的域名对应的服务进行聚类。
由于多个第一服务都是被多客户端访问的服务,因此,对于每个第一服务,可以从域名表中确定该第一服务对应的域名,如果该第一服务对应的域名唯一,则确定该第一服务为被多客户端访问且对应唯一域名的第三服务,如果该第一服务对应的域名不唯一,则确定该第一服务为被多客户端访问且对应多域名的第四服务。
可选地,所述对所述多个第二服务进行客户端相似度聚类,得到客户端相似度聚类结果,包括:
确定访问所述多个第二服务中每个第二服务的客户端的IP地址;
按照访问每个第二服务的客户端的IP地址的交并比,对所述多个第二服务进行聚类,得到所述客户端相似度聚类结果。
可选地,所述确定所述多个应用类型中每个应用类型对应的标签,包括:
将所述多个应用类型划分为第一应用组、第二应用组和第三应用组,所述第一应用组中的每个应用类型包括的服务存在对应的域名,所述第二应用组中的每个应用类型包 括的服务均不存在对应的域名,所述第三应用组中的每个应用类型对应一个未聚类的服务;
基于所述第一应用组中的每个应用类型包括的服务对应的域名,确定每个应用类型对应的标签;
确定所述第二应用组和第三应用组中每个应用类型对应的标签。
当网络设备确定该多个应用类型中每个应用类型对应的标签之后,还可以展示该多个应用类型中每个应用类型对应的标签。
第二方面,提供了一种应用识别装置,所述应用识别装置具有实现上述第一方面中应用识别方法行为的功能。所述应用识别装置包括至少一个模块,该至少一个模块用于实现上述第一方面所提供的应用识别方法。
第三方面,提供了一种网络设备,所述网络设备包括处理器和存储器,所述存储器用于存储执行上述第一方面所提供的应用识别方法的程序,以及存储用于实现上述第一方面所提供的应用识别方法所涉及的数据。所述处理器被配置为用于执行所述存储器中存储的程序。所述存储设备的操作装置还可以包括通信总线,该通信总线用于该处理器与存储器之间建立连接。
第四方面,提供了一种计算机可读存储介质,所述计算机可读存储介质中存储有指令,当该指令在计算机上运行时,使得计算机执行上述第一方面所述的应用识别方法。
第五方面,提供了一种包含指令的计算机程序产品,当该计算机程序产品在计算机上运行时,使得计算机执行上述第一方面所述的应用识别方法。
本申请提供的技术方案至少可以带来以下有益效果:
通过对流表进行流行为特征的分析,得到多个服务。由于每个服务由一个IP地址和一个端口标识构成,且一个应用通常可以由一组服务构成,因此,根据流表和域名表,对该多个服务进行聚类之后,可以得到多个应用类型,每个应用类型包括多个服务,且每个应用类型对应一个应用。此时,可以确定该多个应用类型中每个应用类型的标签,从而通过该标签可以识别数据流所属的应用。可以看出,本申请在识别应用的过程并不需要流量特征数据库,而是根据流行为特征即可识别,这样,当有新应用出现时,可以直接根据这个新应用访问的服务端的IP地址和端口,对该新应用进行识别,从而提高了应用的识别率。
附图说明
图1是本申请实施例提供的一种应用识别系统的架构图;
图2是本申请实施例提供的一种计算机设备的结构示意图;
图3是本申请实施例提供的一种应用识别方法的流程图;
图4是本申请实施例提供的一种无向图的示意图;
图5是本申请实施例提供的一种应用识别装置的结构示意图;
图6是本申请实施例提供的一种分析模块的示意图;
图7是本申请实施例提供的一种聚类模块的示意图。
具体实施方式
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图对本申请实施方式作进一步地详细描述。
在对本申请实施例提供的应用识别方法进行解释说明之前,先对本申请实施例的应用场景进行介绍。
在企业园区等私有网络中,往往存在关键业务和非关键业务,当非关键业务占用的带宽较大时,关键业务占用的带宽就会较小,此时,较小的带宽可能会影响关键业务的质量。而且,企业的关键业务通常为企业的私有应用对应的业务,因此,为了提高关键业务的质量,通常需要识别企业的私有应用,从而便于企业的网络管理人员配置一些能够提高关键业务质量的策略,进而提高关键业务的质量。比如,对于企业的私有应用,可以保证其正常运行所需的带宽,对于公网应用,可以进行限流处理,也即是,对私有应用对应的数据流不做限流处理,对公网应用对应的数据流进行限流处理,从而提高关键业务的质量。
当然,上述通过限流的策略来保证关键业务的质量只是本申请的一种应用场景,本申请还可以应用在其他的场景中,本申请对此不再一一列举。
图1是本申请实施例提供的一种应用识别系统的架构图,参见图1,该系统包括多个客户端101、一个网络设备102和多个服务端103,每个客户端101和网络设备102之间通过有线或者无线的方式连接以进行通信,每个服务端103和网络设备102之间也通过有线或者无线的方式连接以进行通信。
对于多个客户端101中的任一客户端101,该客户端101上安装有应用,当该客户端101运行应用时会产生数据流,此时,客户端101可以将这些数据流发送给网络设备102。网络设备102接收到这些数据流时,可以对这些数据流进行处理,以识别数据流对应的应用。之后,当这些数据流传输到服务端103时,服务端103可以处理这些数据流,以响应客户端101的操作。
客户端101上安装的应用可以为私有应用,也可以为公网应用。其中,私有应用是指企业内部使用的应用,公网应用是指任何人都可以使用的应用。比如,私有应用可以为企业内部用来通信的应用,公网应用可以为企业内部与业务外部用来通信的应用。
客户端101可以为是任何一种可与用户通过键盘、触摸板、触摸屏、遥控器、语音交互或手写设备等一种或多种方式进行人机交互的电子产品,例如个人计算机( personal  computer,PC)、手机、智能手机、个人数字助手(personal digital assistant,PDA)、可穿戴设备、掌上电脑(pocket PC,PPC)、平板电脑、智能车机、智能电视、智能音箱等。
网络设备102可以为核心交换机、接入交换机、路由器等设备。服务端103可以是一台服务器,也可以是由多台服务器组成的服务器集群,或者是一个云计算服务中心。
图1仅仅采用3个客户端和3个服务端来对应用识别系统进行举例说明,并不构成对本申请实施例的限定。此外本申请实施例提供的应用识别方法除了可以使用于企业私有应用的识别以外,也可以应用于公网应用的识别。
请参考图2,图2是根据本申请实施例示出的一种计算机设备的结构示意图,该计算机设备可以是图1中所示的客户端101、网络设备102或服务端103。该计算机设备包括至少一个处理器201、通信总线202、存储器203以及至少一个通信接口204。
处理器201可以是一个通用中央处理器(central processing unit,CPU)、网络处理 器(NP)、微处理器、或者可以是一个或多个用于实现本申请方案的集成电路,例如,专用集成电路(application-specific integrated circuit,ASIC),可编程逻辑器件(programmable logic device,PLD)或其组合。上述PLD可以是复杂可编程逻辑器件(complex programmable logic device,CPLD),现场可编程逻辑门阵列(field-programmable gate array,FPGA),通用阵列逻辑(generic array logic,GAL)或其任意组合。
通信总线202用于在上述组件之间传送信息。通信总线202可以分为地址总线、数据总线、控制总线等。为便于表示,图中仅用一条粗线表示,但并不表示仅有一根总线或一种类型的总线。
存储器203可以是只读存储器(read-only memory,ROM)或可存储静态信息和指令的其它类型的静态存储设备,也可以是随机存取存储器(random access memory,RAM)或者可存储信息和指令的其它类型的动态存储设备,也可以是电可擦可编程只读存储器(electrically erasable programmable read-only memory,EEPROM)、只读光盘(compact disc read-only memory,CD-ROM)或其它光盘存储、光碟存储(包括压缩光碟、激光碟、光碟、数字通用光碟、蓝光光碟等)、磁盘存储介质或者其它磁存储设备,或者是能够用于携带或存储具有指令或数据结构形式的期望的程序代码并能够由计算机存取的任何其它介质,但不限于此。存储器203可以是独立存在,并通过通信总线202与处理器201相连接。存储器203也可以和处理器201集成在一起。
通信接口204用于与其它设备或通信网络通信。通信接口204包括有线通信接口,还可以包括无线通信接口。其中,有线通信接口例如可以为以太网接口。以太网接口可以是光接口,电接口或其组合。无线通信接口可以为无线局域网(wireless local area networks,WLAN)接口,蜂窝网络通信接口或其组合等。
在具体实现中,作为一种实施例,处理器201可以包括一个或多个CPU,如图2中所示的CPU0和CPU1。
在具体实现中,作为一种实施例,计算机设备可以包括多个处理器,如图2中所示的处理器201和处理器205。这些处理器中的每一个可以是一个单核处理器(single-CPU),也可以是一个多核处理器(multi-CPU)。这里的处理器可以指一个或多个设备、电路、和/或用于处理数据(如计算机程序指令)的处理核。
在具体实现中,作为一种实施例,计算机设备还可以包括输出设备206和输入设备207。输出设备206和处理器201通信,可以以多种方式来显示信息。例如,输出设备206可以是液晶显示器(liquid crystal display,LCD)、发光二级管(light emitting diode,LED)显示设备、阴极射线管(cathode ray tube,CRT)显示设备或投影仪(projector)等。输入设备207和处理器201通信,可以以多种方式接收用户的输入。例如,输入设备207可以是鼠标、键盘、触摸屏设备或传感设备等。
在一些实施例中,存储器203用于存储执行本申请方案的程序代码210,处理器201可以执行存储器203中存储的程序代码210。例如,该计算机设备可以通过处理器201以及存储器203中的程序代码210,来实现下文图3实施例提供的应用识别方法。
图3是本申请实施例提供的一种应用识别方法的流程图,该方法应用于图1所示的应用识别系统中的网络设备。请参考图3,该方法包括如下步骤。
步骤301:网络设备对多条数据流分别提取特征,得到流表和域名表,流表包括多个流表项,所述多个流表中的每个流表项包括五元组和流起始时间,域名表包括多个域名表项,所述多个域名表项中的每个域名表项包括源IP地址、目的域名、目的IP地址和域名类型。
通常情况下,一条数据流可以包括一个或多个报文,且该一个或多个报文的五元组相同。换句话说,五元组相同的一个或多个报文可以构成一条数据流。另外,由于目的域名是目的IP地址对应的域名,当目的域名唯一时,域名类型也唯一,因此,网络设备可以分别提取每条数据流包括的任一报文中的五元组、目的域名和域名类型,并将每条数据流中的第一个报文的接收时间确定为流起始时间。之后,网络设备可以从五元组中获取源IP地址和目的IP地址,从而根据每条数据流的五元组和流起始时间,生成流表,以及根据每条数据流的源IP地址、目的域名、目的IP地址和域名类型,生成域名表。
由于五元组包括源IP地址、源端口、目的IP地址、目的端口和协议号,网络设备可以从五元组中获取数据流的源IP地址和目的IP地址。示例性地,假设客户端当前需要向服务端发送报文,那么,该报文的源IP地址和源端口为客户端的IP地址和端口,目的IP地址和目的端口为服务端的IP地址和端口,协议号为客户端和服务端通信时所采用的传输协议的编号。
基于上述描述,每条数据流的流起始时间是每条数据流中的第一个报文的接收时间。但是,每条数据流的第一个报文并不一定是整个数据流的首报文,而是当前提取特征时接收到的报文中的第一个报文。比如,当前需要对1:30到2:00之间采集到的所有数据流分别提取特征,假设,数据流A的首报文的接收时间为1:00,数据流A在1:30到2:00之间的第一个报文的接收时间为1:31,那么数据流A的流起始时间为1:31。
域名类型有两种形式:A.name和C.name,A.name是将主机名或者域名解析到一个IP地址,C.name是将多个主机名或者域名都可以解析到另一个域名,再由另一个域名解析到一个IP地址,这个IP地址与A.name解析到的IP地址相同。也就是说,多个C.name相当于一个A.name的分支。
在本申请实施例中,网络设备还可以设置特征提取的触发条件,也即是,在满足触发条件时对多条数据流分别提取特征。作为一种示例,网络设备在采集数据流的过程中,可以确定当前采集的数据流的数据量是否达到数据量阈值,当当前采集的数据流的数据量达到该数据量阈值时,网络设备可以对采集到的多条数据流分别提取特征。比如,网络设备设定的数据量阈值为200M,那么,网络设备在采集数据流的过程中,可以确定当前采集的数据流的数据量是否达到200M,如果达到,则对采集的数据流分别提取特征。
作为另一种示例,网络设备可以统计开始采集时间与当前时间之间的时间差,当该时间差达到第一时间阈值时,对采集的多条数据流分别提取特征。比如,网络设备设定的第一时间阈值为30分钟,那么,网络设备可以统计开始采集时间与当前时间之间的时间差,如果该时间差达到30分钟,则对采集到的数据流分别提取特征。
上述数据量阈值和第一时间阈值可以根据需求设置。
在一些实施例中,得到流表和域名表之后,还可以对流表和域名表进行预处理。作为一种示例,对于流表,网络设备可以对流表中的流表项进行去重以及流表项的合并,在合并之后,可以删除信息不完整的流表项,得到预处理后的流表。对于域名表,网络设备可以对域名表中的域名表项进行去重,以及筛选出域名类型为A.name的域名表项,在筛选之后,可以删除信息不完整的域名表项,得到预处理后的域名表。
比如,提取特征后得到的流表和域名表可以如下述表1和表2所示,表1中的第1个流表项和第2个流表项是重复的,此时,可以删除第1个流表项或者第2个流表项,而且,表1中的第2个流表项和第3个流表项的五元组相同,但流起始时间不同,假设相比于第3个流表项,第2个流表项的流起始时间较早,因此,可以将第2个流表项作为合并后的流表项,删除第3个流表项。此外,在对流表项进行去重以及流表项的合并后,还需要对流表中信息不完整的流表项进行删除,参见表1,表1中的第6个流表项、第8个流表项和第11个流表项的五元组不完整,那么,可以将这三个流表项删除,至此可以完成流表的预处理,从而可以得到表3所示的预处理后的流表。
表2中的第1个域名表项和第2个域名表项是重复的,此时,可以删除第1个域名表项或者第2个域名表项,而且,表2中第3个域名表项、第5个和第7个域名表项的域名类型为C.name。因为域名类型为C.name的域名表项相当于一个域名类型为A.name的域名表项的一个分支,为了保留特征突出的数据流来进行流行为特征分析,进而更准确地确定出多个服务,可以保留域名表中域名类型为A.name的域名表项,删除域名表中域名类型为C.name的域名表项。此外,在对域名表进行上述处理后,还需要删除域名表中信息不完整的域名表项,参见表2,表2中的第6个域名表项和第8个域名表项的信息不完整,那么,可以将这两个域名表项删除,至此可以完成域名表的预处理,从而可以得到表4所示的预处理后的域名表。
表1
Figure PCTCN2020112316-appb-000001
表2
源IP 目的域名 目的IP 域名类型
IP01 abpd-jap.xxx.com IP001 A.name
IP02 acnd-jap.xxx.com IP002 C.name
IP03 abnc-jp.xxx.com IP003 A.name
IP04 afnd-hx.xxx.com IP004 C.name
IP05   IP005 A.name
IP06 acjd-jap.xxx.com IP006 C.name
IP07 abed-jap.xxx.com   A.name
IP08 abnd-hx.xxx.com IP008 A.name
IP09 abnt-jap.xxx.com IP009 A.name
IP10 asnd-jp.xxx.com IP011 A.name
IP11 atnd-jap.xxx.com IP012 A.name
IP12 abyd-jnp.xxx.com IP013 A.name
…… …… …… ……
表3
Figure PCTCN2020112316-appb-000002
表4
源IP 目的域名 目的IP 域名类型
IP01 abpd-jap.xxx.com IP001 A.name
IP03 abnc-jp.xxx.com IP003 A.name
IP08 abnd-hx.xxx.com IP008 A.name
IP09 abnt-jp.xxx.com IP009 A.name
IP10 asnd-jp.xxx.com IP011 A.name
IP11 atnd-jap.xxx.com IP012 A.name
IP12 abyd-jnp.xxx.com IP013 A.name
…… …… …… ……
对流表和域名表进行预处理的操作是可选地,也即是,后续步骤可以根据预处理前 的流表和域名表来实现,也可以根据预处理后的流表和域名表来实现,两者的实现过程类似。在本申请实施例中,以预处理后的流表和域名表为例,对后续的步骤进行解释说明。也即是,后续步骤中提到的流表和域名表均为预处理后的流表和域名表。
尽管通过步骤301可以得到流表和域名表,但流表和域名表中的数据庞大,流数多且流行为特征复杂,很难从流表和域名表中识别出数据流对应的应用。但是,应用通常是由一组服务构成的,一个服务由一个IP地址和一个端口标识构成,所以可以采用分步处理的方式识别数据流对应的应用。也即是,先按照步骤302,根据流表进行流行为特征的分析,得到多个服务,再按照步骤303,对该多个服务进行聚类,得到多个应用类型,进而按照步骤303确定应用类型对应的标签,识别出数据流对应的应用。
步骤302:网络设备根据流表进行流行为特征的分析,得到多个服务,每个服务由一个IP地址和一个端口标识构成。
在一些实施例中,步骤302可以通过如下(1)-(4)的步骤来实现:
(1)根据流表确定具有环路的端口,得到环路端口集。
在一些实施例中,对于流表中的每个端口,网络设备可以根据流表获取该端口的同端IP地址集合和对端IP地址集合。确定该端口的同端IP地址集合和对端IP地址集合的交集,得到多个IP地址。根据流表,确定经由该端口的所有数据流中该多个IP地址对应的数据流的总流数,将确定的总流数作为第一总流数。如果第一总流数大于第一阈值,则确定经由该端口的所有数据流的总流数,将确定的总流数作为第二总流数。如果第一总流数与第二总流数之间的比值大于第二阈值,则确定该端口为具有环路的端口。之后,可以将流表中具有环路的端口组成环路端口集。
其中,网络设备根据流表,确定经由该端口的所有数据流的总流数,以及确定经由该端口的所有数据流中该多个IP地址对应的数据流的总流数的实现过程可以为:从流表中,选择该端口所在的流表项,统计选择的流表项的数量,将统计的数量确定为经由该端口的所有数据流的总流数。之后,统计选择出的流表项中源IP地址或者目的IP地址为该多个IP地址中的任一IP地址的流表项的数量,将统计的数量确定为经由该端口的所有数据流中该多个IP地址对应的数据流的总流数。
示例性地,当该端口为源端口时,从流表中,选择源端口为该端口的流表项,统计选择出的流表项的数量,将统计的数量确定为经由该端口的所有数据流的总流数。之后,统计选择出的流表项中源IP地址或者目的IP地址为该多个IP地址中的任一IP地址的流表项的数量,将统计的数量确定为经由该端口的所有数据流中该多个IP地址对应的数据流的总流数。
该端口的同端IP地址集合是指与该端口属于同一侧的IP地址的集合,该端口的对端IP地址集合是指与该端口属于不同侧的IP地址的集合。比如,该端口为源端口,那么,该端口的同端IP地址集合是指源端IP地址的集合,该端口的对端IP地址集合是指目的端IP地址的集合。同理,假设该端口为目的端口,那么,该端口的同端IP地址集合是指目的端IP地址的集合,该端口的对端IP地址集合是指源端IP地址的集合。
上述第一阈值和第二阈值可以根据需求设置,比如,第一阈值可以为20,第二阈值可以为0.2。
通常情况下,通过同端IP地址集合和对端IP地址集合的交集确定的多个IP地址为 既作为源端又作为目的端的这些设备的IP地址,在这种情况下,可以认为该端口为具有环路的潜在端口。为了进一步验证该端口是否为具有环路的端口,可以确定第一总流数,如果第一总流数大于第一阈值,那么,可以进一步确定第二总流数,以及第一总流数与第二总流数之间的比值,如果该比值大于第二阈值,那么可以表明经由该端口的大部分数据流的源端IP和目的端IP相同,也即是,经由该端口的大部分数据流的源端设备和目的设备为同一设备,进而可以确定该端口为具有环路的端口。
比如,对于表3中的目的端口Port001,表3中有100个流表项的目的端口为Port001,其中有30个流表项的目的IP地址和源IP地址是相同的,则确定第一总流数为30,第二总流数为100。假设,网络设备根据需求设置的第一阈值为20,第二阈值为0.2。此时可以确定第一总流数30大于第一阈值20,所以需要进一步确定第一总流数与第二总流数的比值,该比值为0.3且该比值大于第二阈值0.2,所以将端口Port001确定为具有环路的端口。同理,对于表1中的源端口Port02,表3中有150个流表项的源端口为Port02,其中有50个流表项的源IP地址和目的IP地址相同,则确定第一总流数为50,第二总流数为150。此时可以确定第一总流数50大于第一阈值20,所以需要进一步确定第一总流数与第二总流数的比值,该比值为0.3且该比值大于第二阈值0.2,所以将端口Port02确定为具有环路的端口。
进一步地,如果第一总流数不大于第一阈值,或者,第一总流数与第二总流数之间的比值不大于第二阈值,那么可以确定该端口不是具有环路的端口。
(2)基于该环路端口集,根据流表确定单客户端访问服务集和第一多客户端访问服务集,单客户端访问服务集中的每个服务被单客户端访问、服务的IP地址和端口属于同一端、且端口不属于环路端口集,第一多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于同一端、且端口不属于环路端口集。
在一些实施例中,网络设备可以根据流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多客户端访问且IP地址和端口属于同一端的服务,得到单客户端访问潜在服务集和第一多客户端访问潜在服务集。从单客户端访问潜在服务集中过滤环路端口集中的端口所在的服务,得到单客户端访问服务集。从第一多客户端访问潜在服务集中过滤环路端口集中的端口所在的服务,得到第一多客户端访问服务集。
其中,网络设备根据流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多客户端访问且IP地址和端口属于同一端的服务的实现过程可以为:根据流表确定多个目标服务,每个目标服务对应一个流表项中属于同一端的IP地址和端口,且每个目标服务对应多条数据流。对于多个目标服务中的每个目标服务,确定该目标服务的端口是否为随机产生的。如果该目标服务的端口是随机产生的,则确定该目标服务的IP地址对应的同侧端口数量是否大于第三阈值。如果该目标服务的IP地址对应的同侧端口数量大于第三阈值,则确定该目标服务对应的对端数量是否大于第四阈值。如果该目标服务对应的对端数量大于第四阈值,则确定该目标服务为被多客户端访问且IP地址和端口属于同一端的服务。如果该目标服务对应的对端数量不大于第四阈值,则确定该目标服务的对端IP地址是否唯一;如果该目标服务的对端IP地址唯一,则确定该目标服务为被单客户端访问且IP地址和端口属于同一端的服务。
上述目标服务对应一个流表项中属于同一端的IP地址和端口,且目标服务对应多 条数据流,也即是,在流表中,如果有多条数据流对应一个流表项中属于同一端的IP地址和端口,那么可以将该IP地址和端口作为目标服务。比如,流表中有多条数据流对应一个流表项中的目的IP地址和目的端口,那么可以将该目的IP地址和目的端口确定为目标服务。同理,如果流表中有多条数据流对应一个流表项中的源IP地址和源端口,那么可以将该源IP地址和源端口确定为目标服务。
在一些实施例中,网络设备确定目标服务是否对应多条数据流的实现过程可以为:网络设备可以确定流表中该目标服务所在的流表项的数量,如果确定的流表项的数量大于第五阈值,则确定该目标服务对应多条数据流,如果确定的流表项的数量不大于第五阈值,则确定该目标服务未对应多条数据流。
上述第三阈值、第四阈值和第五阈值可以根据需求设置,比如,第三阈值可以为20,第四阈值可以为5,第五阈值可以为10。
由于目标服务对应一个流表项中属于同一端的IP地址和端口,且目标服务对应多条数据流,也即是,目标服务的IP地址和端口可能为服务端的IP地址和端口,换句话说,目标服务可能是被多客户端访问且IP地址和端口属于同一端的服务,此时,为了进一步确认目标服务是否为被多客户端访问且IP地址和端口属于同一端的服务,还可以确定目标服务的端口是否为随机产生的。在一些实施例中,确定目标服务的端口是否为随机产生的实现过程可以为:确定该目标服务的端口的端口号是否大于1024,如果该目标服务的端口的端口号小于1024,则确定该端口为知名端口,也即是,该端口不是随机产生的。如果该目标服务的端口的端口号大于1024,则确定该端口是随机产生的。
如果该目标服务的端口是随机产生的,此时还不能确认该目标服务是否为被多客户端访问且IP地址和端口属于同一端的服务,还需要确定该目标服务的IP地址对应的同侧端口数量是否大于第三阈值。在确定该目标服务的IP地址对应的同侧端口数量是否大于第三阈值时,需要先确定该目标服务的IP地址对应的同侧端口数量。在一些实施例中,确定目标服务的IP地址对应的同侧端口数量的实现过程可以为:从流表中选择该目标服务的IP地址所在的流表项,确定选择的流表项中与该目标服务的IP地址属于同一端的端口的数量,将确定的端口的数量作为该目标服务的IP地址对应的同侧端口数量。
示例性地,基于上述描述,目标服务可能由源IP地址和源端口构成的,也可能由目的IP地址和目的端口构成的。当目标服务由源IP地址和源端口构成时,确定目标服务的IP地址对应的同侧端口数量的实现过程可以为:从流表中选择该目标服务的IP地址所在的流表项,确定选择的流表项中源端口的数量,将确定的源端口的数量作为该目标服务的IP地址对应的同侧端口数量。当目标服务由目的IP地址和目的端口构成时,确定目标服务的IP地址对应的同侧端口数量的实现过程可以为:从流表中选择该目标服务的IP地址所在的流表项,确定选择的流表项中目的端口的数量,将确定的目的端口的数量作为该目标服务的IP地址对应的同侧端口数量。
由于服务端的端口通常不会很多,访问服务端的客户端通常会比较多,因此,在确定该目标服务的IP地址对应的同侧端口数量大于第三阈值时,可以进一步判断该目标服务对应的对端数量是否大于第四阈值。在确定该目标服务对应的对端数量是否大于第四阈值时,需要先确定该目标服务对应的对端数量。在一些实施例中,确定该目标服务 对应的对端数量的实现过程可以为:从流表中选择该目标服务所在的流表项,确定选择的流表项中与该目标服务属于不同端的IP地址的数量,将确定的IP地址的数量作为该目标服务对应的对端数量。
示例性地,当目标服务由源IP地址和源端口构成时,确定目标服务对应的对端数量的实现过程可以为:从流表中选择该目标服务所在的流表项,确定选择的流表项中目的IP地址的数量,将确定的目的IP地址的数量作为该目标服务对应的对侧数量。当目标服务由目的IP地址和目的端口构成时,确定目标服务对应的对端数量的实现过程可以为:从流表中选择该目标服务所在的流表项,确定选择的流表项中源IP地址的数量,将确定的源IP地址的数量作为该目标服务对应的对端数量。
其中,当确定该目标服务对应的对端数量大于第四阈值时,可以确定该目标服务为被多客户端访问且IP地址和端口属于同一端的服务。也即是,通过上述各级判断,可以准确地确定出被多客户端访问且IP地址和端口属于同一端的服务。当确定该目标服务对应的对端数量不大于第四阈值时,可以表明该目标服务可能是被单客户端访问,且该目标服务的IP地址和端口可能为服务端的IP地址和端口,此时,可以判断该目标服务对应的对端IP地址是否唯一,如果该目标服务对应的对端IP地址唯一,那么可以直接确定该目标服务为被单客户端访问且IP地址和端口属于同一端的服务。
比如,由目的IP地址和目的端口构成的一个目标服务为:IP001+Port001,假设目的端口Port001不是随机产生的,那么,可以确定其IP地址IP001对应的目的端口数量,假设,IP地址IP001对应的目的端口数量为25个,网络设备设置的第三阈值为20,由于该IP地址IP001对应的目的端口数量大于20,那么可以进一步确定该目标服务对应的源端数量。假设,目标服务IP001+Port001在流表中具有10个源IP地址和源端口不完全相同的流表项,网络设备设置的第四阈值为5,此时可以确定该目标服务对应的源端数量为10,且大于5,所以将目标服务IP001+Port001确定为被多客户端访问且IP地址和端口属于同一端的服务。
如果目标服务IP001+Port001在流表中具有3个源IP地址和源端口不完全相同的流表项,此时可以确定该目标服务对应的源端数量为3,且小于5,则确定这3个流表项的源IP地址是否相同,如果这是3个流表项的源IP地址相同,仅仅是源端口不同,则将此目标服务IP001+Port001确定为被单客户端访问且IP地址和端口属于同一端的服务。
进一步地,由于知名端口是服务端中预留的一些端口,因此,如果目标服务的端口不是随机产生的,也即是,目标服务的端口为知名端口,此时,可以直接确定该目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
进一步地,如果该目标服务的IP地址对应的同侧端口数量不大于第三阈值,则确定该目标服务为被多客户端访问且IP地址和端口属于同一端的服务。或者,如果该目标服务的对端IP地址不唯一,则确定该目标服务不是被单客户端访问且IP地址和端口属于同一端的服务,也不是被多客户端访问且IP地址和端口属于同一端的服务。
(3)基于环路端口集和第一多客户端访问服务集,根据流表确定第二多客户端访问服务集。
在一些实施例中,网络设备可以基于第一多客户端访问服务集,根据流表,确定被 多客户端访问且IP地址和端口属于不同端的服务,得到第二多客户端访问潜在服务集。从第二多客户端访问潜在服务集中过滤环路端口集中的端口所在的服务,得到第二多客户端访问服务集。
其中,网络设备基于第一多客户端访问服务集,根据流表,确定被多客户端访问且IP地址和端口属于不同端的服务的实现过程可以为:将流表中位于同一流表项且属于不同端的IP地址和端口确定为一个参考服务,得到多个参考服务。对于多个参考服务中的每个参考服务,确定该参考服务是否对应多条数据流。如果该参考服务对应多条数据流,则确定该参考服务的端口是否为随机产生的。如果该参考服务的端口不是随机产生的,则确定该参考服务的端口是否包含在第一多客户端访问服务集中。如果该参考服务的端口未包含在第一多客户端访问服务集中,则确定该参考服务的IP地址是否包含在域名表的源IP地址中。如果该参考服务的IP地址未包含在域名表的源IP地址中,则确定该参考服务为被多客户端访问且IP地址和端口属于不同端的服务。
上述参考服务也是由一个IP地址和端口组成,但与目标服务不同的是,参考服务的IP地址和端口是不同端的,比如,同一个流表项中的目的IP地址和源端口可以构成一个参考服务,同一个流表项中的源IP地址和目的端口也可以构成一个参考服务。
在一些实施例中,网络设备确定参考服务是否对应多条数据流的实现过程可以为:网络设备可以确定流表中该参考服务所在的流表项的数量,如果确定的流表项的数量大于第五阈值,则确定该参考服务对应多条数据流,如果确定的流表项的数量不大于第五阈值,则确定该参考服务未对应多条数据流。
网络设备确定参考服务的端口是否为随机产生的操作可以参考上述判断目标服务的端口是否为随机产生的操作,本申请实施例对此不再详细阐述。
在某些情况下,根据特征提取后得到的流表中可能会发生源IP地址与目的IP地址颠倒的情况,导致构成参考服务的源IP地址可能错为目的IP地址,或者目的IP地址错为源IP地址,进而无法判断参考服务中的IP地址和端口是否为同一流表项中属于不同端的IP地址和端口,但域名表中的源IP地址一定是正确的源IP地址。因此,如果确定该参考服务的IP地址未包含在域名表的源IP地址中,则可以确定该参考服务为被多客户端访问且IP地址和端口属于不同端的服务。
进一步地,如果该参考服务未对应多条数据流,或者,该参考服务的端口是随机产生的,或者,该参考服务的端口包含在第一多客户端访问服务集中,或者该参考服务的IP地址包含在域名表的源IP地址中,那么可以确定该参考服务不是被多客户端访问且IP地址和端口属于不同端的服务。
(4)将第一多客户端访问服务集、第二多客户端访问服务集和单客户端访问服务集进行合并,得到多个服务。
由于第一多客户端访问服务、第二多客户端访问服务集合和单客户端访问服务集中都包含有一个或多个服务,因此,将第一多客户端访问服务集、第二多客户端访问服务集和单客户端访问服务集进行合并之后,即可得到属于服务端的多个服务。
步骤303:网络设备根据流表和域名表,对该多个服务进行聚类,得到多个应用类型。
在一些实施例中,步骤303可以通过如下(1)-(6)的步骤来实现:
(1)根据流表和域名表,对该多个服务进行时间相关性聚类,得到时间相关性聚类结果。
在一些实施例中,网络设备可以根据流表,获取该多个服务中每个服务所在的流表项中的流起始时间。根据获取的流起始时间,确定该多个服务中每两个服务之间的时间差。通过确定的时间差,确定该多个服务中每两个服务之间的时间相关性。根据该多个服务中每两个服务之间的时间相关性,从该多个服务中选择满足时间相关性条件的服务。根据选择出的服务之间的时间相关性,生成相似度矩阵。根据相似度矩阵,按照谱聚类分析,确定该多个服务的谱聚类结果。根据域名表,确定该多个服务中每两个服务之间的相似度。根据该多个服务中每两个服务之间的相似度和谱聚类结果,确定时间相关性聚类结果。
其中,网络设备通过确定的时间差,确定该多个服务中每两个服务之间的时间相关性的实现过程可以为:对于该多个服务中的任意两个服务,确定这两个服务之间的时间差是否小于第二时间阈值,如果这两个服务之间的时间差小于第二时间阈值,则确定这两个服务具有时间相关性。如果这两个服务之间的时间差不小于第二时间阈值,则确定这两个服务不具有时间相关性。对于其他任意两个服务,也可以按照上述方法确定这两个服务是否具有时间相关性。
第二时间阈值可以根据需求设置,本申请实施例对此不作限定。
在一些实施例中,网络设备根据该多个服务中每两个服务之间的时间相关性,从该多个服务中选择满足条件的服务的实现过程可以为:根据该多个服务中每两个服务之间的时间相关性,生成无向图,该无向图包括与该多个服务一一对应的多个节点,以及与具有时间相关性的两个服务对应的边,且边用于连接具有时间相关性的两个服务所对应的两个节点。确定无向图中的最大团,该最大团是指通过边连接节点之后包括的节点数量最多的连通区域。将最大团中的节点对应的服务确定为满足时间相关性条件的服务。
比如,该多个服务为服务A-服务G,服务A与服务B具有时间相关性,服务B与服务C具有时间相关性,服务C分别与服务A和服务D具有时间相关性,服务D与服务A具有时间相关性,服务E与服务F具有时间相关性,服务E与服务G具有时间相关性。那么,可以生成图4所示的无向图。在该无向图中,服务A对应的节点A’和服务B对应的节点B’之间连接形成一条边,服务B对应的节点B’和服务C对应的节点C’之间连接形成一条边,服务C对应的节点C’分别与服务A对应的节点A’和服务D对应的节点D’之间连接形成一条边,服务D对应的节点D’和服务A对应的节点A’之间连接形成一条边,服务E对应的节点E’与服务F对应的节点F’之间连接形成一条边,服务E对应的节点E’与服务G对应的节点G’之间连接形成一条边。在该无向图中,通过边连接节点之后包括的节点数量最多的连通区域是由节点A’、节点B’、节点C’和节点D’构成的连通区域。该连通区域即为上述无向图中的最大团,此时,将服务A、服务B、服务C和服务D确定为选择出满足时间相关性条件的服务。
在一些实施例中,网络设备根据选择出的服务之间的时间相关性,生成相似度矩阵的实现过程可以为:对于选择出的每两个服务,如果这两个服务之间具有时间相关性,则确定这两个服务在相似度矩阵中对应的元素为1,如果这两个服务不具有时间相关性,则确定这两个服务在相似度矩阵中对应的元素为0。其中,同一服务在相似度矩阵中对 应的元素为1。
在另一些实施例中,在上述最大团的基础上,网络设备根据选择出的服务之间的时间相关性,生成相似度矩阵的实现过程可以为:对于选择出的每两个服务,如果这两个服务在最大团中对应的节点之间存在连接边,则确定这两个服务在相似度矩阵中对应的元素为1,如果这两个服务在最大团中对应的节点之间没有连接边,则确定这两个服务在相似度矩阵中对应的元素为0。
比如,服务A与服务B具有时间相关性,则服务A和服务B在相似度矩阵中对应的元素为1。服务B与服务C具有时间相关性,则服务B和服务C在相似度矩阵中对应的元素为1。服务C与服务A具有时间相关性,则服务C和服务A在相似度矩阵中对应的元素为1。服务C与服务D具有时间相关性,则服务C与服务D在相似度矩阵中对应的元素为1。此时,根据服务A、服务B、服务C和服务D之间的时间相关性,生成的相似度矩阵为:
Figure PCTCN2020112316-appb-000003
上述相似度矩阵为n行n列的矩阵,也即是,根据该多个服务中每两个服务之间的时间相关性,从该多个服务中选择出n个服务。上述谱聚类分析是基于相似度矩阵,对这n个服务进行聚类。也即是,通过谱聚类分析可以确定这n个服务中的哪几个服务可以聚为一类。其中,谱聚类分析的实现过程可以参考相关技术。
在一些实施例中,网络设备根据域名表,确定该多个服务中每两个服务之间的相似度的实现过程可以为:从域名表中,确定该多个服务中每个服务的IP地址对应的域名,确定该多个服务中每两个服务的IP地址对应的域名之间的相似度,得到该多个服务中每两个服务之间的相似度。
其中,对于该多个服务中的任意两个服务,确定这两个服务的IP地址对应的域名之间的相似度的实现过程可以为:从域名表中确定这两个服务的IP地址对应的域名,对每个服务的IP地址对应的域名进行分词处理,得到每个IP地址对应的域名中的所有分词。之后,网络设备可以将每个IP地址对应的域名中的所有分词中,作为域名后缀的分词去除,得到每个IP地址对应的域名的分词组。确定这两个服务的IP地址对应的域名的分词组的交并比,将确定的交并比确定这两个服务的IP地址对应的域名之间的相似度。
可能大多数域名的域名后缀都是相同的,因此去除该域名后缀之外的分词的相似度能比较准确的反映两个域名的相似度。
还每两个分词组的交并比是指这两个分词组的交集元素个数与并集元素个数的比值。比如,服务1的IP地址对应的域名为abnd-jap.xxx.com,该域名中的分词组可以包括:abnd、jap和xxx,服务2的IP地址对应的域名为abnd-hx.xxx.com,该域名中的分词组可以包括:abnd、hk和xxx。这两个分词组中的交集元素为abnd和xxx,并集元素为abnd、jap、hk和xxx,也即是,这两个分词组的交集元素个数为2,并集元素个数为 4,因此,这两个服务的IP地址对应的域名的分词组的交并比为2/4。
由于谱聚类分析是将较为相似的多个服务划分为一类,这样,谱聚类结果就可能包括多类服务,每类服务包括较为相似的多个服务。换句话说,谱聚类结果就可能包括多个第一服务集合,每个第一服务集合包括较为相似的多个服务。但是,通过谱聚类分析之后,两个服务集合中也可能存在更相似的服务,而且,还有一些服务未进行谱聚类分析,因此,根据该多个服务中每两个服务之间的相似度和谱聚类结果,确定时间相关性聚类结果。在一些实施例中,网络设备根据该多个服务中每两个服务之间的相似度和谱聚类结果,确定时间相关性聚类结果的实现过程可以为:对于谱聚类结果包括的多个第一服务集合中的每个第一服务集合中的每个服务,基于该多个服务中每两个服务之间的相似度,确定该服务与所在的第一服务集合中其他服务之间的相似度。如果该服务与所在的第一服务集合中的其他服务之间的相似度均不大于相似度阈值,则将该服务从这个第一服务集合中剔除。当按照上述方式遍历完谱聚类结果包括每个第一服务集合中的每个服务之后,可以得到多个第二服务集合。对于剔除出的每个服务,可以基于该多个服务中每两个服务之间的相似度,确定该服务与每个第二服务集合中的每个服务之间的相似度。如果该服务与其中一个第二服务集合中的每个服务之间的相似度均大于相似度阈值,则将该服务添加至这个第二服务集合中。对于未进行谱聚类分析的服务,也按照上述剔除出的服务的处理方式进行聚类,从而得到时间相关性聚类结果。
相似度阈值可以根据需求设置,本申请实施例对此不做限定。
可选地,在本申请实施例中,网络设备根据流表和域名表,对该多个服务进行时间相关性聚类之前,还可以基于步骤302确定出的多个服务,对上述流表再次处理。在一些实施例中,可以删除流表中目的IP地址和目的端口不是该多个服务中任一服务的IP地址和端口的流表项,和/或,对源IP地址和目的IP地址颠倒的表项进行纠正,以及对源端口和目的端口颠倒的表项进行纠正。
(2)根据流表,从该多个服务中选择具有周期性的服务,得到周期性聚类结果。
在一些实施例中,对于该多个服务中的每个服务,根据流表,获取同一客户端访问该服务的多条数据流的流起始时间。按照访问该服务的多条数据流的流起始时间的先后顺序,确定每相邻两个流起始时间之间的时间差。基于确定的时间差,通过傅里叶变换,确定该服务的周期性是否为强周期性。如果该服务的周期性为强周期性,则确定该服务为具有周期性的服务。
在一些实施例中,网络设备根据流表,获取同一客户端访问该服务的多条数据流的流起始时间的实现过程可以为:从流表中,选择该服务的IP地址和端口作为目的IP地址和目的端口的流表项,确定选择的流表项中的每个源IP地址所在的流表项的数量,从最大数量对应的源IP地址所在的流表项中获取流起始时间,将获取的流起始时间确定为同一客户端访问该服务的多条数据流的流起始时间。
比如,对于服务IP008+Port008,流表中该服务的IP地址和端口作为目的IP地址和目的端口的流表项有20个,如果这20个流表项中源IP地址为IP08的流表项的数量为15个,这个20个流表项中源IP地址为IP01的流表项的数量为5个,那么,可以获取IP08所在的流表项中的流起始时间,将获取的流起始时间确定为同一客户端访问该服务的多条数据流的流起始时间。
在一些实施例中,基于确定的时间差,通过傅里叶变换,确定该服务的周期性是否为强周期性的实现过程可以为:以确定的时间差的数量为横轴,以确定的时间差为纵轴建立坐标系,将确定的时间差绘制到该坐标系中,得到离散信号,对离散信号进行傅里叶变换,确定变换后的信号中的峰值数量是否小于第六阈值,如果小于第六阈值,则确定该服务的周期性为强周期性,否则,确定该服务的周期性不为强周期性。
第六阈值可以根据需求设置,本申请实施例对此不做限定。
(3)根据域名表,从该多个服务中,获取多个第一服务和多个第二服务,多个第一服务是指被多客户端访问且有对应域名的服务,多个第二服务包括被多客户端访问且无对应域名的服务和被单客户端访问的服务。
基于步骤302中的步骤(4)可知,该多个服务中包括被多客户端访问的服务,也包括被单客户单访问的服务,而且域名表中的每个域名表项中包括源IP地址、目的域名、目的IP地址和域名类型,因此,对于被多客户端访问的服务,可以从域名表中,确定该多个服务中每个服务是否对应有域名,进而从该多个服务中筛选出被多客户端访问且有对应域名的服务,得到多个第一服务。同时,还可以筛选出被多客户端访问且无对应域名的服务。
(4)对该多个第一服务进行语义相关性聚类,得到语义相关性聚类结果。
在一些实施例中,网络设备可以按照域名的语义相关性,对上述多个第一服务进行聚类,得到多个第一聚类结果。基于多个第一聚类结果间的域名相似度,对多个第一聚类结果进行合并,得到多个第二聚类结果。根据多个第一服务中未聚类的服务与每个第二聚类结果间的域名语义相关性,将未聚类的服务与多个第二聚类结果进行聚类,得到语义相关性聚类结果。
其中,按照域名的语义相关性,对多个第一服务进行聚类的实现过程可以为:从多个第一服务中,获取被多客户端访问且对应唯一域名的多个第三服务,以及从多个第一服务中,获取被多客户端访问且对应多域名的多个第四服务。从多个第四服务对应的域名中,获取可合并的域名,并将可合并的域名进行合并,得到多个第一域名。去除每个第一域名中的数字和符号,以及去除每个第三服务对应的域名中的数字和符号,得到多个第二域名。按照多个第二域名的语义相关性,将多个第二域名对应的服务进行聚类。按照多个第四服务对应的域名中不可合并的域名的语义相关性,将不可合并的域名对应的服务进行聚类。
由于多个第一服务都是被多客户端访问的服务,因此,对于每个第一服务,可以从域名表中确定该第一服务对应的域名,如果该第一服务对应的域名唯一,则确定该第一服务为被多客户端访问且对应唯一域名的第三服务,如果该第一服务对应的域名不唯一,则确定该第一服务为被多客户端访问且对应多域名的第四服务。
在一些实施例中,从多个第四服务对应的域名中,获取可合并的域名,并将可合并的域名进行合并,得到多个第一域名的实现过程可以为:从多个第四服务对应的域名中获取一级域名相同的多个域名,对获取的多个域名进行分词处理,得到每个域名中的所有分词。将每个域名中的所有分词中属于一级域名的分词去除,得到每个域名的分词组。确定获取的多个域名的分词组的交并比,如果该交并比大于第一交并比阈值,则确定该多个第四服务对应的域名中一级域名相同的这多个域名为可合并的域名。之后,确定获 取的多个域名的分词组的交集,在一级域名的前面加上该多个分词组的交集,得到合并后的域名,也即是,第一域名。
比如,该多个第四服务对应的域名为abpd-jap.xxx.com、acnd-jap.xxx.com和abed-jap.xxx.com,对这三个域名进行分词处理,得到域名abpd-jap.xxx.com中的所有分词为abpd、jap、xxx和com,域名acnd-jap.xxx.com中的所有分词为acnd、jap、xxx和com,域名abed-jap.xxx.com中的所有分词为abed、jap、xxx和com。将每个域名中的所有分词中属于一级域名的分词xxx和com去除,得到域名abpd-jap.xxx.com的分词组为{abpd、jap},域名acnd-jap.xxx.com的分词组为{acnd、jap},域名abed-jap.xxx.com的分词组为{abed、jap}。这三个分词组中的交集元素为jap,交集元素个数为1,并集元素为abpd、acnd、abed和jap,并集元素个数为4,此时,可以确定这三个分词组的交并比为1/4。假设,该交并比大于第一交并比阈值,则确定这3个域名为可合并的域名。之后,确定获取的多个域名的分词组的交集,在一级域名的前面加上该多个分词组的交集,得到合并后的域名,也即是,第一域名为jap.xxx.com。
该多个第二域名的语义相关性也可以按照对应的分词组的交并比来确定,进而将交并比大于第二交并比阈值的第二域名划分为一类,从而实现该多个第二域名对应的服务的聚类。其中,多个第二域名对应的分词组的交并比的确定方式可以参考前文的方式,本申请实施例对此不再详细阐述。
该多个第四服务对应的域名中不可合并的域名的语义相关性也可以按照对应的分词组的交并比来确定,进而将交并比大于第三交并比阈值的第二域名划分为一类,从而实现不可合并的域名对应的服务的聚类。其中,该多个第四服务对应的域名中不可合并的域名对应的分词组的交并比的确定方式可以参考前文的方式,本申请实施例对此不再详细阐述。
其中,第一交并比阈值、第二交并比阈值和第三交并比阈值可以根据需求设置,且第一交并比阈值、第二交并比阈值和第三交并比阈值可以相同,也可以不同。
(5)对该多个第二服务进行客户端相似度聚类,得到客户端相似度聚类结果。
在一些实施例中,网络设备可以确定访问多个第二服务中每个第二服务的客户端的IP地址;按照访问每个第二服务的客户端的IP地址的交并比,对多个第二服务进行聚类,得到客户端相似度聚类结果。
作为一种示例,网络设备可以将第二服务的客户端的IP地址的交并比大于第四交并比阈值的第二服务划分为一类,从而得到客户端相似度聚类结果。
比如,第二服务A的客户端IP地址有三个:IP01、IP02、IP03,第二服务B的客户端IP地址也有三个:IP01、IP03、IP05,第二服务A的客户端的IP地址中的有两个IP地址和第二服务B客户端的IP地址中的两个IP地址相同,则第二服务A与第二服务B之间的客户端IP地址的交集元素为{IP01,IP03},第二服务A与第二服务B之间的客户端IP地址的并集元素为{IP01,IP02,IP03,IP05},第二服务A与第二服务B之间客户端IP地址的交并比为2/4,假设,第二服务A与第二服务B之间客户端IP地址的交并比大于第四交并比阈值,则将这两个第二服务进行聚类,得到客户端相似度聚类结果。
(6)对时间相关性聚类结果、周期性聚类结果、语义相关性聚类结果和客户端相 似度聚类结果进行融合,得到多个应用类型。
在一些实施例中,网络设备可以确定每两个聚类结果之间的交并比,将交并比大于第五交并比阈值的两个聚类结果进行合并。按照这样的方式处理完所有的聚类结果之后,即可得到多个应用类型。其中,每两个聚类结果之间的交并比是指这两个聚类结果中服务的交并比。也即是,这两个聚类结果中交集服务个数与并集服务个数之间的比值。
每个聚类结果都包括多个应用类型,因此,将上述四个聚类结果进行融合之后,可以得到多个应用类型,每个应用类型可以对应多个服务。
另外,第四交并比阈值和第五交并比阈值可以根据需求设置,且第四交并比阈值和第五交并比阈值可以相同,也可以不同,而且,与上述第一交并比阈值、第二交并比阈值和第三交并比阈值可以相同,也可以不同。
步骤304:网络设备确定多个应用类型中每个应用类型对应的标签,该标签用于识别数据流所属的应用。
在一些实施例中,网络设备可以将该多个应用类型划分为第一应用组、第二应用组和第三应用组,第一应用组中的每个应用类型包括的服务存在对应的域名,第二应用组中的每个应用类型包括的服务均不存在对应的域名,第三应用组中的每个应用类型对应一个未聚类的服务。基于第一应用组中的每个应用类型包括的服务对应的域名,确定每个应用类型对应的标签,确定第二应用组和第三应用组中每个应用类型对应的标签。
在一些实施例中,网络设备基于第一应用组中的每个应用类型包括的服务对应的域名,确定每个应用类型对应的标签的实现过程可以为:对于第一应用组中的每个应用类型,确定该应用类型包括的服务对应的域名中的一级域名,如果确定的一级域名都相同,则确定该一级域名对应的企业名称,将该企业名称作为该应用类型对应的标签。如果确定的一级域名不同,则确定这些一级域名中占比最大的一级域名对应的企业名称作为该应用类型对应的标签。
比如,第一应用组中的某个应用类型包括20个服务,每个服务对应的域名中的一级域名均为xxx.com,假设这些一级域名对应的企业名称为xxx,则将xxx作为该应用类型对应的标签。又比如,第一应用中某个应用类型包括50个服务,其中,有40个服务对应的域名中的一级域名均为xxx.com,剩余10个服务对应的域名中的一级域名为scmd.com,假设一级域名xxx.com对应的企业名称为xxx,一级域名scmd.com对应的企业名称为scmd。由于一级域名xxx.com的占比最大,因此,可以将xxx作为该应用类型对应的标签。
在一些实施例中,网络设备确定第二应用组中每个应用类型对应的标签的实现过程可以为:对于第二应用组中的每个应用类型,从该应用类型包括的服务中,确定知名端口所在的服务、属于环路端口集的服务、源端口和目的端口相同的服务、以及属于第二多个客户端访问服务集的服务。确定这些服务的数量是否大于第七阈值,如果不大于第七阈值,则基于第一目标字符,按照第一格式,生成该应用类型对应的标签。如果大于第七阈值,则基于第一目标字符,按照第二格式,生成该应用类型对应的标签。
第一目标字符可以根据需求设置,比如,第一目标字符为NN。第一格式和第二格式也可以根据需求设置,比如,第一格式可以为:第一目标字符/{IP01+Port001、IP02+Port002、IP01+Port003},第二格式可以为:第一目标字符/{IP:001、002、003、 004}。
比如,第二应用组中的某个应用类型包括30个服务,从这30个服务中,确定知名端口所在的服务分别为IP01+Port001、IP02+Port002,属于环路端口集的服务为IP12+Port002,源端口和目的端口相同的服务为IP14+Port011,属于第二多个客户端访问服务集的服务为IP05+Port005。此时,可以确定出的这些服务的数量为5,假设这些服务的数量5不大于第七阈值,且第一目标字符为NN,则可以确定该应用类型对应的标签为NN/{IP01+Port001、IP02+Port002、IP12+Port002、IP14+Port011、IP05+Port005}。假设这些服务的数量5大于第七阈值,则可以确定该应用类型对应的标签为NN/{IP:001、002、002、011、005}。
在一些实施例中,网络设备确定第三应用组中每个应用类型对应的标签的实现过程可以为:对于第三应用组中的每个应用类型,确定该应用类型包括的服务的数量是否大于第七阈值,如果不大于第七阈值,则基于第二目标字符,按照第一格式,生成该应用类型的标签。如果大于第七阈值,则基于第二目标字符,按照第二格式,生成该应用类型的标签。
第二目标字符可以根据需求设置,比如,第二目标字符为UKN。第一格式和第二格式也可以根据需求设置,比如,第一格式可以为:第二目标字符/{IP01+Port001、IP02+Port002、IP01+Port003},第二格式可以为:第二目标字符/{IP:001、002、003、004}。
当网络设备确定该多个应用类型中每个应用类型对应的标签之后,还可以展示该多个应用类型中每个应用类型对应的标签。
在本申请实施例中,网络设备可以通过对流表进行流行为特征的分析,得到多个服务。由于每个服务由一个IP地址和一个端口标识构成,且一个应用通常可以由一组服务构成,因此,根据流表和域名表,对该多个服务进行聚类之后,可以得到多个应用类型,每个应用类型包括多个服务,且每个应用类型对应一个应用。此时,可以确定该多个应用类型中每个应用类型的标签,从而通过该标签可以识别数据流所属的应用。可以看出,本申请实施例在识别应用的过程并不需要流量特征数据库,而是根据流行为特征即可识别,这样,当有新应用出现时,可以直接根据这个新应用访问的服务端的IP地址和端口,对该新应用进行识别,从而提高了应用的识别率。
图5是本申请实施例提供的一种应用识别装置的结构示意图,该应用识别装置可以由软件、硬件或者两者的结合实现成为网络设备的部分或者全部。参见图5,该装置包括:提取模块501、分析模块502、聚类模块503和确定模块504。其中,提取模块501、分析模块502、聚类模块503和确定模块504的功能均可以通过图2实施例中的处理器来实现。
提取模块501,用于执行图3实施例中步骤301的操作;
分析模块502,用于执行图3实施例中步骤302的操作;
聚类模块503,用于执行图3实施例中步骤303的操作;
确定模块504,用于执行图3实施例中步骤304的操作。
可选地,参见图6,分析模块502包括:
第一确定子模块5021,用于根据流表确定具有环路的端口,得到环路端口集;
第二确定子模块5022,用于基于环路端口集,根据流表确定单客户端访问服务集和第一多客户端访问服务集,单客户端访问服务集中的每个服务被单客户端访问、服务的IP地址和端口属于同一端、且端口不属于环路端口集,第一多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于同一端、且端口不属于环路端口集;
第三确定子模块5023,用于基于环路端口集和第一多客户端访问服务集,根据流表确定第二多客户端访问服务集,第二多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于不同端、且端口不属于环路端口集;
合并子模块5024,用于将第一多客户端访问服务集、第二多客户端访问服务集和单客户端访问服务集进行合并,得到多个服务。
可选地,第一确定子模块5021主要用于:
对于流表中的每个端口,根据流表获取端口的同端IP地址集合和对端IP地址集合;
确定端口的同端IP地址集合和对端IP地址集合的交集,得到多个IP地址;
根据流表,确定经由该端口的所有数据流中多个IP地址对应的数据流的总流数,将确定的总流数作为第一总流数;
如果第一总流数大于第一阈值,则确定经由该端口的所有数据流的总流数,将确定的总流数作为第二总流数;
如果第一总流数与第二总流数之间的比值大于第二阈值,则确定该端口为具有环路的端口。
可选地,第二确定子模块5022主要用于:
根据流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多客户端访问且IP地址和端口属于同一端的服务,得到单客户端访问潜在服务集和第一多客户端访问潜在服务集;
从单客户端访问潜在服务集中过滤环路端口集中的端口所在的服务,得到单客户端访问服务集,从第一多客户端访问潜在服务集中过滤环路端口集中的端口所在的服务,得到第一多客户端访问服务集。
可选地,第二确定子模块5022还用于:
根据流表确定多个目标服务,每个目标服务对应一个流表项中属于同一端的IP地址和端口,且每个目标服务对应多条数据流;
对于多个目标服务中的每个目标服务,确定目标服务的端口是否为随机产生的;
如果目标服务的端口是随机产生的,则确定目标服务的IP地址对应的同侧端口数量是否大于第三阈值;
如果目标服务的IP地址对应的同侧端口数量大于第三阈值,则确定目标服务对应的对端数量是否大于第四阈值;
如果目标服务对应的对端数量大于第四阈值,则确定目标服务为被多客户端访问且IP地址和端口属于同一端的服务;
如果目标服务对应的对端数量不大于第四阈值,则确定目标服务的对端IP地址是否唯一;
如果目标服务的对端IP地址唯一,则确定目标服务为被单客户端访问且IP地址和 端口属于同一端的服务。
可选地,第二确定子模块5022还用于:
如果目标服务的端口不是随机产生的,则确定目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
可选地,第二确定子模块5022还用于:
如果目标服务的IP地址对应的同侧端口数量不大于第三阈值,则确定目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
可选地,第三确定子模块5023主要用于:
基于第一多客户端访问服务集,根据流表,确定被多客户端访问且IP地址和端口属于不同端的服务,得到第二多客户端访问潜在服务集;
从第二多客户端访问潜在服务集中过滤环路端口集中的端口所在的服务,得到第二多客户端访问服务集。
可选地,第三确定子模块5023还用于:
将流表中位于同一流表项且属于不同端的IP地址和端口确定为一个参考服务,得到多个参考服务;
对于多个参考服务中的每个参考服务,确定参考服务是否对应多条数据流;
如果参考服务对应多条数据流,则确定参考服务的端口是否为随机产生的;
如果参考服务的端口不是随机产生的,则确定参考服务的端口是否包含在第一多客户端访问服务集中;
如果参考服务的端口未包含在第一多客户端访问服务集中,则确定参考服务的IP地址是否包含在域名表的源IP地址中;
如果参考服务的IP地址未包含在域名表的源IP地址中,则确定参考服务为被多客户端访问且IP地址和端口属于不同端的服务。
可选地,参见图7,聚类模块503包括:
第一聚类子模块5031,用于根据流表和域名表,对多个服务进行时间相关性聚类,得到时间相关性聚类结果;
第二聚类子模块5032,用于根据流表,从多个服务中选择具有周期性的服务,得到周期性聚类结果;
获取子模块5033,用于根据域名表,从多个服务中,获取多个第一服务和多个第二服务,多个第一服务是指被多客户端访问且有对应域名的服务,多个第二服务包括被多客户端访问且无对应域名的服务和被单客户端访问的服务;
第三聚类子模块5034,用于对多个第一服务进行语义相关性聚类,得到语义相关性聚类结果;
第四聚类子模块5035,用于对多个第二服务进行客户端相似度聚类,得到客户端相似度聚类结果;
融合子模块5036,用于对时间相关性聚类结果、周期性聚类结果、语义相关性聚类结果和客户端相似度聚类结果进行融合,得到多个应用类型。
可选地,第一聚类子模块5031主要用于:
根据流表,获取多个服务中每个服务所在的流表项中的流起始时间;
根据获取的流起始时间,确定多个服务中每两个服务之间的时间差;
通过确定的时间差,确定多个服务中每两个服务之间的时间相关性;
根据多个服务中每两个服务之间的时间相关性,从多个服务中选择满足时间相关性条件的服务;
根据选择出的服务之间的时间相关性,生成相似度矩阵;
根据相似度矩阵,按照谱聚类分析,确定多个服务的谱聚类结果;
根据域名表,确定多个服务中每两个服务之间的相似度;
根据多个服务中每两个服务之间的相似度和谱聚类结果,确定时间相关性聚类结果。
可选地,第二聚类子模块5032主要用于:
对于多个服务中的每个服务,根据流表,获取同一客户端访问服务的多条数据流的流起始时间;
按照访问服务的多条数据流的流起始时间的先后顺序,确定每相邻两个流起始时间之间的时间差;
基于确定的时间差,通过傅里叶变换,确定该服务的周期性是否为强周期性;
如果该服务的周期性为强周期性,则确定该服务为具有周期性的服务。
可选地,第三聚类子模块5034主要用于:
按照域名的语义相关性,对多个第一服务进行聚类,得到多个第一聚类结果;
基于多个第一聚类结果间的域名相似度,对多个第一聚类结果进行合并,得到多个第二聚类结果;
根据多个第一服务中未聚类的服务与每个第二聚类结果间的域名语义相关性,将未聚类的服务与多个第二聚类结果进行聚类,得到语义相关性聚类结果。
可选地,第三聚类子模块5034还用于:
从多个第一服务中,获取被多客户端访问且对应唯一域名的多个第三服务,以及从多个第一服务中,获取被多客户端访问且对应多域名的多个第四服务;
从多个第四服务对应的域名中,获取可合并的域名,并将可合并的域名进行合并,得到多个第一域名;
去除每个第一域名中的数字和符号,以及去除每个第三服务对应的域名中的数字和符号,得到多个第二域名;
按照多个第二域名的语义相关性,将多个第二域名对应的服务进行聚类;
按照多个第四服务对应的域名中不可合并的域名的语义相关性,将不可合并的域名对应的服务进行聚类。
可选地,第四聚类子模块5035主要用于:
确定访问多个第二服务中每个第二服务的客户端的IP地址;
按照访问每个第二服务的客户端的IP地址的交并比,对多个第二服务进行聚类,得到客户端相似度聚类结果。
可选地,确定模块504用于:
将多个应用类型划分为第一应用组、第二应用组和第三应用组,第一应用组中的每个应用类型包括的服务存在对应的域名,第二应用组中的每个应用类型包括的服务均不 存在对应的域名,第三应用组中的每个应用类型对应一个未聚类的服务;
基于第一应用组中的每个应用类型包括的服务对应的域名,确定每个应用类型对应的标签;
确定第二应用组和第三应用组中每个应用类型对应的标签。
在本申请实施例中,网络设备可以通过对流表进行流行为特征的分析,得到多个服务。由于每个服务由一个IP地址和一个端口标识构成,且一个应用通常可以由一组服务构成,因此,根据流表和域名表,对该多个服务进行聚类之后,可以得到多个应用类型,每个应用类型包括多个服务,且每个应用类型对应一个应用。此时,可以确定该多个应用类型中每个应用类型的标签,从而通过该标签可以识别数据流所属的应用。可以看出,本申请实施例在识别应用的过程并不需要流量特征数据库,而是根据流行为特征即可识别,这样,当有新应用出现时,可以直接根据这个新应用访问的服务端的IP地址和端口,对该新应用进行识别,从而提高了应用的识别率。
上述实施例提供的应用识别装置在应用识别时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将装置的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的应用识别装置与应用识别方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。
实施例可以全部或部分地通过软件、硬件、固件或者其任意结合来实现。当使用软件实现时,可以全部或部分地以计算机程序产品的形式实现。所述计算机程序产品包括一个或多个计算机指令。在计算机上加载和执行所述计算机指令时,全部或部分地产生按照本申请实施例所述的流程或功能。所述计算机可以是通用计算机、专用计算机、计算机网络或其他可编程装置。所述计算机指令可以存储在计算机可读存储介质中,或者从一个计算机可读存储介质向另一个计算机可读存储介质传输,例如,所述计算机指令可以从一个网站站点、计算机、服务器或数据中心通过有线或无线方式向另一个网站站点、计算机、服务器或数据中心进行传输。所述计算机可读存储介质可以是计算机能够存取的任何可用介质,或者是包含一个或多个可用介质集成的服务器、数据中心等数据存储设备。
应当理解的是,本文提及的“多个”是指两个或两个以上。在本申请的描述中,除非另有说明,“/”表示或的意思,例如,A/B可以表示A或B;本文中的“和/或”仅仅是一种描述关联对象的关联关系,表示可以存在三种关系,例如,A和/或B,可以表示:单独存在A,同时存在A和B,单独存在B这三种情况。另外,为了便于清楚描述本申请实施例的技术方案,在本申请的实施例中,采用了“第一”、“第二”等字样对功能和作用基本相同的相同项或相似项进行区分。本领域技术人员可以理解“第一”、“第二”等字样并不对数量和执行次序进行限定,并且“第一”、“第二”等字样也并不限定一定不同。
以上所述为本申请提供的实施例,并不用以限制本申请,凡在本申请的原则之内,所作的任何修改、等同替换、改进等,均应包含在本申请的保护范围之内。

Claims (23)

  1. 一种应用识别方法,其特征在于,所述方法包括:
    对多条数据流分别提取特征,得到流表和域名表,所述流表包括多个流表项,所述多个流表中的每个流表项包括五元组和流起始时间,所述域名表包括多个域名表项,所述多个域名表项中的每个域名表项包括源互联网协议IP地址、目的域名、目的IP地址和域名类型;
    根据所述流表进行流行为特征的分析,得到多个服务,每个服务由一个IP地址和一个端口标识构成;
    根据所述流表和所述域名表,对所述多个服务进行聚类,得到多个应用类型;
    确定所述多个应用类型中每个应用类型对应的标签,所述标签用于识别数据流所属的应用。
  2. 如权利要求1所述的方法,其特征在于,所述根据所述流表进行流行为特征的分析,得到多个服务,包括:
    根据所述流表确定具有环路的端口,得到环路端口集;
    基于所述环路端口集,根据所述流表确定单客户端访问服务集和第一多客户端访问服务集,所述单客户端访问服务集中的每个服务被单客户端访问、服务的IP地址和端口属于同一端、且端口不属于所述环路端口集,所述第一多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于同一端、且端口不属于所述环路端口集;
    基于所述环路端口集和所述第一多客户端访问服务集,根据所述流表确定第二多客户端访问服务集,所述第二多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于不同端、且端口不属于所述环路端口集;
    将所述第一多客户端访问服务集、所述第二多客户端访问服务集和所述单客户端访问服务集进行合并,得到所述多个服务。
  3. 如权利要求2所述的方法,其特征在于,所述根据所述流表确定具有环路的端口,包括:
    对于所述流表中的每个端口,根据所述流表获取所述端口的同端IP地址集合和对端IP地址集合;
    确定所述端口的同端IP地址集合和对端IP地址集合的交集,得到多个IP地址;
    根据所述流表,确定经由所述端口的所有数据流中所述多个IP地址对应的数据流的总流数,将确定的总流数作为第一总流数;
    如果所述第一总流数大于第一阈值,则确定经由所述端口的所有数据流的总流数,将确定的总流数作为第二总流数;
    如果所述第一总流数与所述第二总流数之间的比值大于第二阈值,则确定所述端口为具有环路的端口。
  4. 如权利要求2所述的方法,其特征在于,所述基于所述环路端口集,根据所述流表确定单客户端访问服务集和第一多客户端访问服务集,包括:
    根据所述流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多 客户端访问且IP地址和端口属于同一端的服务,得到单客户端访问潜在服务集和第一多客户端访问潜在服务集;
    从所述单客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述单客户端访问服务集,从所述第一多客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述第一多客户端访问服务集。
  5. 如权利要求4所述的方法,其特征在于,所述根据所述流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多客户端访问且IP地址和端口属于同一端的服务,包括:
    根据所述流表确定多个目标服务,每个目标服务对应一个流表项中属于同一端的IP地址和端口,且每个目标服务对应多条数据流;
    对于所述多个目标服务中的每个目标服务,确定所述目标服务的端口是否为随机产生的;
    如果所述目标服务的端口是随机产生的,则确定所述目标服务的IP地址对应的同侧端口数量是否大于第三阈值;
    如果所述目标服务的IP地址对应的同侧端口数量大于所述第三阈值,则确定所述目标服务对应的对端数量是否大于第四阈值;
    如果所述目标服务对应的对端数量大于所述第四阈值,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务;
    如果所述目标服务对应的对端数量不大于所述第四阈值,则确定所述目标服务的对端IP地址是否唯一;
    如果所述目标服务的对端IP地址唯一,则确定所述目标服务为被单客户端访问且IP地址和端口属于同一端的服务。
  6. 如权利要求5所述的方法,其特征在于,所述确定所述目标服务的端口是否为随机产生的之后,还包括:
    如果所述目标服务的端口不是随机产生的,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
  7. 如权利要求5所述的方法,其特征在于,所述确定所述目标服务的IP地址对应的同侧端口数量是否大于第三阈值之后,还包括:
    如果所述目标服务的IP地址对应的同侧端口数量不大于所述第三阈值,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
  8. 如权利要求2所述的方法,其特征在于,所述基于所述环路端口集和所述第一多客户端访问服务集,根据所述流表确定第二多客户端访问服务集,包括:
    基于所述第一多客户端访问服务集,根据所述流表,确定被多客户端访问且IP地址和端口属于不同端的服务,得到第二多客户端访问潜在服务集;
    从所述第二多客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述第二多客户端访问服务集。
  9. 如权利要求8所述的方法,其特征在于,所述基于所述第一多客户端访问服务集,根据所述流表,确定被多客户端访问且IP地址和端口属于不同端的服务,包括:
    将所述流表中位于同一流表项且属于不同端的IP地址和端口确定为一个参考服务, 得到多个参考服务;
    对于所述多个参考服务中的每个参考服务,确定所述参考服务是否对应多条数据流;
    如果所述参考服务对应多条数据流,则确定所述参考服务的端口是否为随机产生的;
    如果所述参考服务的端口不是随机产生的,则确定所述参考服务的端口是否包含在所述第一多客户端访问服务集中;
    如果所述参考服务的端口未包含在所述第一多客户端访问服务集中,则确定所述参考服务的IP地址是否包含在所述域名表的源IP地址中;
    如果所述参考服务的IP地址未包含在所述域名表的源IP地址中,则确定所述参考服务为被多客户端访问且IP地址和端口属于不同端的服务。
  10. 如权利要求1-9中任意一项所述的方法,其特征在于,所述根据所述流表和所述域名表,对所述多个服务进行聚类,得到多个应用类型,包括:
    根据所述流表和所述域名表,对所述多个服务进行时间相关性聚类,得到时间相关性聚类结果;
    根据所述流表,从所述多个服务中选择具有周期性的服务,得到周期性聚类结果;
    根据所述域名表,从所述多个服务中,获取多个第一服务和多个第二服务,所述多个第一服务是指被多客户端访问且有对应域名的服务,所述多个第二服务包括被多客户端访问且无对应域名的服务和被单客户端访问的服务;
    对所述多个第一服务进行语义相关性聚类,得到语义相关性聚类结果;
    对所述多个第二服务进行客户端相似度聚类,得到客户端相似度聚类结果;
    对所述时间相关性聚类结果、所述周期性聚类结果、所述语义相关性聚类结果和所述客户端相似度聚类结果进行融合,得到所述多个应用类型。
  11. 如权利要求1-10中任意一项所述的方法,其特征在于,所述确定所述多个应用类型中每个应用类型对应的标签,包括:
    将所述多个应用类型划分为第一应用组、第二应用组和第三应用组,所述第一应用组中的每个应用类型包括的服务存在对应的域名,所述第二应用组中的每个应用类型包括的服务均不存在对应的域名,所述第三应用组中的每个应用类型对应一个未聚类的服务;
    基于所述第一应用组中的每个应用类型包括的服务对应的域名,确定每个应用类型对应的标签;
    确定所述第二应用组和第三应用组中每个应用类型对应的标签。
  12. 一种应用识别装置,其特征在于,所述装置包括:
    提取模块,用于对多条数据流分别提取特征,得到流表和域名表,所述流表包括多个流表项,所述多个流表中的每个流表项包括五元组和流起始时间,所述域名表包括多个域名表项,所述多个域名表项中的每个域名表项包括源互联网协议IP地址、目的域名、目的IP地址和域名类型;
    分析模块,用于根据所述流表进行流行为特征的分析,得到多个服务,每个服务由一个IP地址和一个端口标识构成;
    聚类模块,用于根据所述流表和所述域名表,对所述多个服务进行聚类,得到多个应用类型;
    确定模块,用于确定所述多个应用类型中每个应用类型对应的标签,所述标签用于识别数据流所属的应用。
  13. 如权利要求12所述的装置,其特征在于,所述分析模块包括:
    第一确定子模块,用于根据所述流表确定具有环路的端口,得到环路端口集;
    第二确定子模块,用于基于所述环路端口集,根据所述流表确定单客户端访问服务集和第一多客户端访问服务集,所述单客户端访问服务集中的每个服务被单客户端访问、服务的IP地址和端口属于同一端、且端口不属于所述环路端口集,所述第一多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于同一端、且端口不属于所述环路端口集;
    第三确定子模块,用于基于所述环路端口集和所述第一多客户端访问服务集,根据所述流表确定第二多客户端访问服务集,所述第二多客户端访问服务集中的每个服务被多客户端访问、服务的IP地址和端口属于不同端、且端口不属于所述环路端口集;
    合并子模块,用于将所述第一多客户端访问服务集、所述第二多客户端访问服务集和所述单客户端访问服务集进行合并,得到所述多个服务。
  14. 如权利要求13所述的装置,其特征在于,所述第一确定子模块主要用于:
    对于所述流表中的每个端口,根据所述流表获取所述端口的同端IP地址集合和对端IP地址集合;
    确定所述端口的同端IP地址集合和对端IP地址集合的交集,得到多个IP地址;
    根据所述流表,确定经由所述端口的所有数据流中所述多个IP地址对应的数据流的总流数,将确定的总流数作为第一总流数;
    如果所述第一总流数大于第一阈值,则确定经由所述端口的所有数据流的总流数,将确定的总流数作为第二总流数;
    如果所述第一总流数与所述第二总流数之间的比值大于第二阈值,则确定所述端口为具有环路的端口。
  15. 如权利要求13所述的装置,其特征在于,所述第二确定子模块主要用于:
    根据所述流表确定被单客户端访问且IP地址和端口属于同一端的服务,以及被多客户端访问且IP地址和端口属于同一端的服务,得到单客户端访问潜在服务集和第一多客户端访问潜在服务集;
    从所述单客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述单客户端访问服务集,从所述第一多客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述第一多客户端访问服务集。
  16. 如权利要求15所述的装置,其特征在于,所述第二确定子模块还用于:
    根据所述流表确定多个目标服务,每个目标服务对应一个流表项中属于同一端的IP地址和端口,且每个目标服务对应多条数据流;
    对于所述多个目标服务中的每个目标服务,确定所述目标服务的端口是否为随机产生的;
    如果所述目标服务的端口是随机产生的,则确定所述目标服务的IP地址对应的同 侧端口数量是否大于第三阈值;
    如果所述目标服务的IP地址对应的同侧端口数量大于所述第三阈值,则确定所述目标服务对应的对端数量是否大于第四阈值;
    如果所述目标服务对应的对端数量大于所述第四阈值,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务;
    如果所述目标服务对应的对端数量不大于所述第四阈值,则确定所述目标服务的对端IP地址是否唯一;
    如果所述目标服务的对端IP地址唯一,则确定所述目标服务为被单客户端访问且IP地址和端口属于同一端的服务。
  17. 如权利要求16所述的装置,其特征在于,所述第二确定子模块还用于:
    如果所述目标服务的端口不是随机产生的,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
  18. 如权利要求16所述的装置,其特征在于,所述第二确定子模块还用于:
    如果所述目标服务的IP地址对应的同侧端口数量不大于所述第三阈值,则确定所述目标服务为被多客户端访问且IP地址和端口属于同一端的服务。
  19. 如权利要求13所述的装置,其特征在于,所述第三确定子模块用于:
    基于所述第一多客户端访问服务集,根据所述流表,确定被多客户端访问且IP地址和端口属于不同端的服务,得到第二多客户端访问潜在服务集;
    从所述第二多客户端访问潜在服务集中过滤所述环路端口集中的端口所在的服务,得到所述第二多客户端访问服务集。
  20. 如权利要求19所述的装置,其特征在于,所述第三确定子模块还用于:
    将所述流表中位于同一流表项且属于不同端的IP地址和端口确定为一个参考服务,得到多个参考服务;
    对于所述多个参考服务中的每个参考服务,确定所述参考服务是否对应多条数据流;
    如果所述参考服务对应多条数据流,则确定所述参考服务的端口是否为随机产生的;
    如果所述参考服务的端口不是随机产生的,则确定所述参考服务的端口是否包含在所述第一多客户端访问服务集中;
    如果所述参考服务的端口未包含在所述第一多客户端访问服务集中,则确定所述参考服务的IP地址是否包含在所述域名表的源IP地址中;
    如果所述参考服务的IP地址未包含在所述域名表的源IP地址中,则确定所述参考服务为被多客户端访问且IP地址和端口属于不同端的服务。
  21. 如权利要求12-20中任意一项所述的装置,其特征在于,所述聚类模块包括:
    第一聚类子模块,用于根据所述流表和所述域名表,对所述多个服务进行时间相关性聚类,得到时间相关性聚类结果;
    第二聚类子模块,用于根据所述流表,从所述多个服务中选择具有周期性的服务,得到周期性聚类结果;
    获取子模块,用于根据所述域名表,从所述多个服务中,获取多个第一服务和多个 第二服务,所述多个第一服务是指被多客户端访问且有对应域名的服务,所述多个第二服务包括被多客户端访问且无对应域名的服务和被单客户端访问的服务;
    第三聚类子模块,用于对所述多个第一服务进行语义相关性聚类,得到语义相关性聚类结果;
    第四聚类子模块,用于对所述多个第二服务进行客户端相似度聚类,得到客户端相似度聚类结果;
    融合子模块,用于对所述时间相关性聚类结果、所述周期性聚类结果、所述语义相关性聚类结果和所述客户端相似度聚类结果进行融合,得到所述多个应用类型。
  22. 如权利要求12-21中任意一项所述的装置,其特征在于,所述确定模块用于:
    将所述多个应用类型划分为第一应用组、第二应用组和第三应用组,所述第一应用组中的每个应用类型包括的服务存在对应的域名,所述第二应用组中的每个应用类型包括的服务均不存在对应的域名,所述第三应用组中的每个应用类型对应一个未聚类的服务;
    基于所述第一应用组中的每个应用类型包括的服务对应的域名,确定每个应用类型对应的标签;
    确定所述第二应用组和第三应用组中每个应用类型对应的标签。
  23. 一种计算机可读存储介质,其特征在于,所述存储介质内存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1-11任一所述的方法的步骤。
PCT/CN2020/112316 2019-09-10 2020-08-29 应用识别方法、装置及存储介质 WO2021047402A1 (zh)

Priority Applications (2)

Application Number Priority Date Filing Date Title
EP20863933.6A EP4012980A4 (en) 2019-09-10 2020-08-29 APPLICATION RECOGNITION METHOD AND DEVICE AND STORAGE MEDIUM
US17/691,463 US11863439B2 (en) 2019-09-10 2022-03-10 Method, apparatus and storage medium for application identification

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN201910853338.4 2019-09-10
CN201910853338.4A CN112564991A (zh) 2019-09-10 2019-09-10 应用识别方法、装置及存储介质

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US17/691,463 Continuation US11863439B2 (en) 2019-09-10 2022-03-10 Method, apparatus and storage medium for application identification

Publications (1)

Publication Number Publication Date
WO2021047402A1 true WO2021047402A1 (zh) 2021-03-18

Family

ID=74867254

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2020/112316 WO2021047402A1 (zh) 2019-09-10 2020-08-29 应用识别方法、装置及存储介质

Country Status (4)

Country Link
US (1) US11863439B2 (zh)
EP (1) EP4012980A4 (zh)
CN (1) CN112564991A (zh)
WO (1) WO2021047402A1 (zh)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113157540A (zh) * 2021-03-31 2021-07-23 国家计算机网络与信息安全管理中心 一种用户行为分析方法和系统
CN113177206A (zh) * 2021-05-21 2021-07-27 滨州职业学院 一种计算机应用识别方法、装置及存储介质
CN114039906B (zh) * 2021-09-27 2023-09-22 网宿科技股份有限公司 流量引导方法、电子设备及可读存储介质
US20230336793A1 (en) * 2022-04-14 2023-10-19 Oxylabs, Uab Streaming proxy service
CN116708369B (zh) * 2023-08-02 2023-10-27 闪捷信息科技有限公司 网络应用信息合并方法、装置、电子设备和存储介质

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8291495B1 (en) * 2007-08-08 2012-10-16 Juniper Networks, Inc. Identifying applications for intrusion detection systems
CN102984243A (zh) * 2012-11-20 2013-03-20 杭州迪普科技有限公司 一种ssl协议中应用的自动识别方法和装置
CN103051725A (zh) * 2012-12-31 2013-04-17 华为技术有限公司 应用识别方法、数据挖掘方法、装置及系统
CN106534145A (zh) * 2016-11-28 2017-03-22 北京天行网安信息技术有限责任公司 一种应用识别方法及设备
CN108173705A (zh) * 2017-11-28 2018-06-15 北京天融信网络安全技术有限公司 流量引流的首包识别方法、装置、设备及介质

Family Cites Families (14)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20100192225A1 (en) * 2009-01-28 2010-07-29 Juniper Networks, Inc. Efficient application identification with network devices
CN101873259B (zh) * 2010-06-01 2013-01-09 华为技术有限公司 Sctp报文识别方法和装置
US9477718B2 (en) * 2012-12-31 2016-10-25 Huawei Technologies Co., Ltd Application identification method, and data mining method, apparatus, and system
CN103297270A (zh) * 2013-05-24 2013-09-11 华为技术有限公司 应用类型识别方法及网络设备
US9569368B2 (en) * 2013-12-13 2017-02-14 Nicira, Inc. Installing and managing flows in a flow table cache
US10972437B2 (en) * 2016-08-08 2021-04-06 Talari Networks Incorporated Applications and integrated firewall design in an adaptive private network (APN)
CN107864168B (zh) * 2016-09-22 2021-05-18 华为技术有限公司 一种网络数据流分类的方法及系统
CN107426063A (zh) * 2017-09-22 2017-12-01 中国联合网络通信集团有限公司 互联网应用流量的识别系统及方法
CN115037575A (zh) * 2017-12-26 2022-09-09 华为技术有限公司 报文处理的方法和装置
CN108650195B (zh) * 2018-04-17 2021-08-24 南京烽火星空通信发展有限公司 一种app流量自动识别模型构建方法
US10855604B2 (en) * 2018-11-27 2020-12-01 Xaxar Inc. Systems and methods of data flow classification
CN109905288B (zh) * 2018-12-21 2021-09-14 中国科学院信息工程研究所 一种应用服务分类方法及装置
CN110197234B (zh) * 2019-06-13 2020-05-19 四川大学 一种基于双通道卷积神经网络的加密流量分类方法
US11456952B2 (en) * 2020-08-04 2022-09-27 Pensando Systems, Inc. Methods and systems for removing expired flow table entries using an extended packet processing pipeline

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US8291495B1 (en) * 2007-08-08 2012-10-16 Juniper Networks, Inc. Identifying applications for intrusion detection systems
CN102984243A (zh) * 2012-11-20 2013-03-20 杭州迪普科技有限公司 一种ssl协议中应用的自动识别方法和装置
CN103051725A (zh) * 2012-12-31 2013-04-17 华为技术有限公司 应用识别方法、数据挖掘方法、装置及系统
CN106534145A (zh) * 2016-11-28 2017-03-22 北京天行网安信息技术有限责任公司 一种应用识别方法及设备
CN108173705A (zh) * 2017-11-28 2018-06-15 北京天融信网络安全技术有限公司 流量引流的首包识别方法、装置、设备及介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
See also references of EP4012980A4

Also Published As

Publication number Publication date
US20220200902A1 (en) 2022-06-23
EP4012980A1 (en) 2022-06-15
CN112564991A (zh) 2021-03-26
US11863439B2 (en) 2024-01-02
EP4012980A4 (en) 2022-10-05

Similar Documents

Publication Publication Date Title
WO2021047402A1 (zh) 应用识别方法、装置及存储介质
US11637762B2 (en) MDL-based clustering for dependency mapping
US11683618B2 (en) Application performance monitoring and management platform with anomalous flowlet resolution
US11159386B2 (en) Enriched flow data for network analytics
US8676965B2 (en) Tracking high-level network transactions
CN106982150B (zh) 一种基于Hadoop的移动互联网用户行为分析方法
CN101345694A (zh) 一种快速查找定位和匹配访问控制列表的方法
Bosagh Zadeh et al. On the precision of social and information networks
WO2020258982A1 (zh) 一种分析基站安全日志的方法、系统及计算机可读存储介质
WO2020151482A1 (zh) 信息查询方法、装置、设备及存储介质
US10033583B2 (en) Accelerating device, connection and service discovery
US11108717B1 (en) Trends in a messaging platform
WO2021027331A1 (zh) 基于图数据的全量关系计算方法、装置、设备及存储介质
US20130132552A1 (en) Application-Aware Quality Of Service In Network Applications
CN112612832A (zh) 节点分析方法、装置、设备及存储介质
CN111181811A (zh) 统计方法、装置、电子设备及介质
Wang et al. Practical characterization of large networks using neighborhood information
CN110110081B (zh) 用于移动互联网海量监测数据的分级分类处理方法及系统
US11128564B2 (en) Systems and methods to filter out noisy application signatures to improve precision of first packet application classification
CN108737522B (zh) 一种消息的处理方法、装置和系统
KR20110125905A (ko) 패턴분석을 이용한 댓글 관리 방법, 그 시스템, 및 웹 서버
CN112637090A (zh) 一种基于可编程交换芯片的动态多级流控的方法
CN112564928A (zh) 服务分类方法及设备、互联网系统
CN112187700A (zh) 一种waf安全规则匹配方法、设备及存储介质
US11500908B1 (en) Trends in a messaging platform

Legal Events

Date Code Title Description
121 Ep: the epo has been informed by wipo that ep was designated in this application

Ref document number: 20863933

Country of ref document: EP

Kind code of ref document: A1

NENP Non-entry into the national phase

Ref country code: DE

ENP Entry into the national phase

Ref document number: 2020863933

Country of ref document: EP

Effective date: 20220311