WO2022078042A1 - 流量细分识别方法、系统、电子设备和存储介质 - Google Patents

流量细分识别方法、系统、电子设备和存储介质 Download PDF

Info

Publication number
WO2022078042A1
WO2022078042A1 PCT/CN2021/112328 CN2021112328W WO2022078042A1 WO 2022078042 A1 WO2022078042 A1 WO 2022078042A1 CN 2021112328 W CN2021112328 W CN 2021112328W WO 2022078042 A1 WO2022078042 A1 WO 2022078042A1
Authority
WO
WIPO (PCT)
Prior art keywords
traffic
service
feature vector
sub
identified
Prior art date
Application number
PCT/CN2021/112328
Other languages
English (en)
French (fr)
Inventor
何鸿业
Original Assignee
中兴通讯股份有限公司
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Priority claimed from CN202011085864.XA external-priority patent/CN114362982B/zh
Application filed by 中兴通讯股份有限公司 filed Critical 中兴通讯股份有限公司
Publication of WO2022078042A1 publication Critical patent/WO2022078042A1/zh

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING; CALCULATING OR COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F18/00Pattern recognition
    • G06F18/20Analysing
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/14Network analysis or design
    • H04L41/142Network analysis or design using statistical or mathematical methods
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security

Definitions

  • the embodiments of the present application relate to the field of communications, and in particular, to a method, system, electronic device, and storage medium for subdivision and identification of traffic.
  • Traffic segmentation based on machine learning (ML) identifies unrelated unknown source traffic in the classification process. Due to the closed assumption of the classification algorithm, the input samples must be marked as known categories during identification, and the actual identification In the process, there will be a large number of unrelated traffic input from unknown sources, which will cause a large number of mislabeling, that is, open set recognition (OSR) problem, which will affect the accuracy of traffic segmentation recognition.
  • OSR open set recognition
  • the embodiment of the present application provides a traffic subdivision identification method, which includes: acquiring service traffic of a pre-specified service; processing the service traffic to obtain a feature vector of the service traffic; passing the feature vector of the service traffic through a preset first exception
  • the detection model is used to obtain the sub-feature vector; the sub-feature vector is trained through the preset classification training model to obtain a traffic classifier; the sub-feature vector is passed through the preset second anomaly detection model to obtain the traffic filter; the traffic to be identified is obtained
  • the feature vector of the traffic to be identified is passed through the traffic classifier to obtain the corresponding service label; the feature vector of the traffic to be identified is passed through the traffic filter corresponding to the service label to obtain the traffic subdivision identification result.
  • the embodiment of the present application also proposes a traffic subdivision identification system, including: a traffic acquisition module, used for acquiring the service traffic of a pre-designated service; a feature extraction module, used for processing the service traffic obtained by the traffic acquisition module, and acquiring the The feature vector of the traffic, to process the traffic to be identified, and obtain the feature vector of the traffic to be identified; the first anomaly detection module is used to process the feature vector of the business traffic obtained by the feature extraction module to obtain the sub-feature vector; the classification training module, for training the sub-feature vector obtained by the first anomaly detection module to obtain a traffic classifier; the second anomaly detection module for processing the sub-feature vector obtained by the first anomaly detection module to obtain a traffic filter; traffic identification The module is used to obtain the feature vector of the traffic to be identified obtained by the feature extraction module, obtain the corresponding service label through the traffic classifier obtained by the classification training module, and use the feature vector of the traffic to be identified obtained by the feature extraction module to pass the second anomaly detection module.
  • An embodiment of the present application also provides an electronic device, the device includes: at least one processor; and a memory communicatively connected to the at least one processor; wherein the memory stores instructions that can be executed by the at least one processor, and the instructions are executed by at least one processor.
  • a processor executes to enable at least one processor to execute the above traffic identification method.
  • Fig. 1 is a flow chart of a traffic subdivision identification method provided according to a first embodiment of the present application
  • FIG. 2 is a flowchart of a method for identifying traffic segments provided according to a second embodiment of the present application
  • FIG. 3 is a flowchart of a method for identifying traffic segments provided according to a third embodiment of the present application.
  • FIG. 4 is a flowchart of a traffic subdivision identification method provided according to a fourth embodiment of the present application.
  • FIG. 5 is a flowchart of a method for identifying traffic segments according to a fifth embodiment of the present application.
  • FIG. 6 is a schematic structural diagram of a traffic subdivision identification system provided according to a sixth embodiment of the present application.
  • FIG. 7 is a schematic structural diagram of an electronic device provided according to a seventh embodiment of the present application.
  • traffic identification methods have been widely used in the field of network security, such as network anomaly detection and malicious traffic identification, marking the application source of mobile phone traffic, etc. But what is more important for operators is the subdivision and identification of traffic. For example, on the basis of the "WeChat traffic” category, it can specifically identify what business actions the traffic is generated by, such as "send WeChat messages", “ WeChat video call” and so on. Such segment identification can help operators monitor network conditions more specifically.
  • the machine learning-based traffic segmentation identification method extracts the general statistical information of traffic through feature engineering to construct traffic feature vectors, and uses machine learning algorithms for classification.
  • ML-based traffic segmentation identification has unrelated unknown source traffic in the classification process. Due to the closed assumption of the classification algorithm, the input samples must be marked as known categories during identification. In the actual identification process, there will be a large number of Unrelated traffic input from unknown sources will cause a large number of false labels, that is, open set identification problems, which will affect the accuracy of traffic segmentation identification.
  • the main purpose of the embodiments of the present application is to propose a traffic subdivision identification method, system, electronic device and storage medium, which solve the OSR problem in the traffic subdivision identification process and improve the accuracy of the traffic subdivision identification.
  • the first embodiment of the present application relates to a traffic subdivision identification method, as shown in FIG. 1 , which specifically includes:
  • Step 101 Acquire the service flow of the pre-specified service.
  • the pre-specified services can be specific business actions such as sending WeChat messages, watching a video on iQIYI, and making video calls on WeChat.
  • specific business actions such as sending WeChat messages, watching a video on iQIYI, and making video calls on WeChat.
  • the above are only specific examples, and the actual use process can be based on operator needs or The actual application environment specifies the business, which will not be repeated here.
  • Step 102 Process the service traffic to obtain a feature vector of the service traffic.
  • Step 103 Pass the feature vector of the service traffic through a preset first anomaly detection model to obtain a sub-feature vector.
  • step 104 the sub-feature vector is trained through a preset classification training model to obtain a traffic classifier.
  • the classification training model in step 104 can use a gradient boosting decision tree (Gradient Boosting Decision Tree, GBDT) algorithm, which belongs to a tree-shaped classifier, and will classify on each dimension feature separately, and standardize the overall feature. Insensitive, the classification accuracy is higher.
  • GDT gradient boosting decision tree
  • Other classification algorithms can also be used, such as: XGBOOST algorithm, LightGBM algorithm and so on.
  • Step 105 Pass the sub-feature vector through a preset second anomaly detection model to obtain a traffic filter.
  • the second anomaly detection model includes a single-class support vector machine algorithm, a method of fitting a priori distribution, and so on.
  • a traffic filter may be obtained by any method in the second anomaly detection model.
  • the obtained traffic filters correspond to pre-specified services. For example, if there are N pre-specified services, there are corresponding traffic filters for N services.
  • the first traffic filter is for WeChat to send messages.
  • Traffic filter the second traffic filter is the traffic filter for watching a video on iQIYI.
  • Step 106 Obtain a feature vector of the traffic to be identified, and use the feature vector of the traffic to be identified to obtain a corresponding service label through a traffic classifier.
  • Step 107 Pass the feature vector of the traffic to be identified through the traffic filter of the corresponding service label to obtain a traffic subdivision identification result.
  • this embodiment is mainly applied to traffic subdivision identification, and can also be applied to category identification to assist category identification to perform subdivision identification of the entire network traffic.
  • category identification to assist category identification to perform subdivision identification of the entire network traffic.
  • adopt the secondary identification method based on the identification of categories first use traditional models, such as DPI, to identify categories of traffic on the entire network, and then add in the form of expansion modules in the downstream according to specific needs. points support.
  • the first anomaly detection model and the second anomaly detection model eliminate irrelevant traffic in the training samples and irrelevant traffic in the traffic to be identified, effectively solve the OSR problem, and improve the accuracy of traffic subdivision identification.
  • step 101 includes sub-step 201 to sub-step 204 .
  • sub-step 201 the application that generates the service traffic is controlled by the control program.
  • a control program can be used to access an application that generates service traffic to control a certain application, and a terminal device can also be accessed through a control program to control an application in the terminal device.
  • the specific method for accessing a certain application or device by a specific control program can adopt technical means such as Appium, UIAutomator2, etc.
  • Appium UIAutomator2
  • the above is only a specific example, and the existing access technology of the task can be used in the actual operation process to control the application. . There are no specific restrictions here.
  • Sub-step 202 Execute a pre-designated service on the application generating the service flow, and obtain the service flow of the pre-designated service.
  • a pre-specified service is performed on an application according to actual operator requirements or user requirements, for example, a service operation of "sending WeChat messages" is manually performed.
  • network packet capture can be enabled in the background to obtain the service traffic.
  • sub-step 203 the operation steps for executing the pre-specified service are recorded, and an operation script is generated and saved.
  • sub-step 204 the operation script is imported into the control program for automatic execution, and the service flow of the pre-specified service is acquired.
  • sub-step 202 obtains service traffic through manual operations, and only a small amount of service traffic can be obtained, while sub-step 204 can obtain a large amount of service traffic by automatically and repeatedly executing services through a program.
  • the identification of traffic subdivision requires the acquisition of a large amount of data related to the subdivision business and with subdivision business labels. Generally, there is a large amount of irrelevant traffic data in the traffic data obtained. The data is cleaned and labeled through expert experience. , this method is difficult to implement, and the service traffic obtained in this embodiment is obtained by executing a specific pre-specified specific service operation by a specific application. Therefore, the obtained service traffic directly carries a service label, and no additional service labeling is required. mark.
  • this embodiment obtains service traffic with service tags by directly executing pre-specified services, avoiding the difficulty of manually marking service tags, and at the same time maximally reducing the time required to construct data. Reliance on expert experience.
  • the third embodiment of the present application relates to a traffic subdivision identification method.
  • This embodiment is substantially the same as the first embodiment, except that, as shown in FIG. 3 , step 102 includes sub-step 301 to sub-step 303 .
  • Sub-step 301 Acquire quintuple information of service traffic, wherein the quintuple information includes source IP, source port, destination IP, destination port, and transmission protocol.
  • Sub-step 302 Group the service traffic according to the quintuple information to obtain traffic samples.
  • all data packets in the service flow are grouped according to the quintuple information of all the data packets in the service flow, that is, the quintuple information of the data packets is the same as a group of flow samples, this group of flow samples have the same quintuple information.
  • the data packets can be arranged in order of transmission time.
  • sub-step 303 feature extraction is performed on the traffic samples to obtain feature vectors of the service traffic.
  • sub-step 303 can perform basic statistical feature extraction on the traffic sample, for example, the packet length of all data packets in the traffic sample, the average value of the transmission packet interval, the average packet length, the maximum packet length, etc. It is organized in the form of eigenvectors.
  • time sequence features of traffic samples such as port information of data packets, transmission direction of data packets, etc.
  • time sequence features of traffic samples can also be extracted, and organized into the form of feature vectors.
  • feature vectors formed by the two extraction methods can be spliced together to form the feature vector of the service traffic.
  • data is processed in units of streams, and feature extraction is performed on traffic samples to obtain feature vectors, so as to facilitate subsequent training and identification of data.
  • step 103 includes sub-step 401 to sub-step 404 .
  • Sub-step 401 Pass the feature vector of the service traffic through a preset first anomaly detection model to obtain a first anomaly score of the service traffic.
  • the first anomaly detection model may include an isolated forest algorithm, a local anomaly factor algorithm, a Kmeans-based clustering algorithm, etc., and the first anomaly score can be obtained by any algorithm in the first detection model.
  • the isolated forest algorithm as an example, the average search depth on the isolated tree can be normalized as the first abnormal score.
  • sub-step 402 it is judged whether the first abnormal score of the service flow is greater than the preset first threshold value, if yes, go to sub-step 403 ; otherwise, go to sub-step 401 .
  • sub-step 403 the feature vector of the service traffic is eliminated and a sub-feature vector is obtained.
  • sub-step 403 excludes the service traffic because when acquiring the service traffic of the pre-specified service, a lot of irrelevant traffic will also be generated at the same time, such as background traffic, application advertisement traffic, etc., which are related to the specified service traffic. It is irrelevant, which directly affects the subsequent classification training, resulting in deviations in the traffic identification results.
  • Sub-step 404 determine the first abnormal score of the next service flow.
  • the first anomaly detection model removes traffic unrelated to pre-specified services, such as background traffic, advertising traffic, etc., to obtain a purer sub-feature vector, which is convenient for subsequent
  • the effect of classification training is better, and the accuracy of traffic segmentation identification is further improved.
  • the fifth embodiment of the present application relates to a traffic subdivision identification method.
  • This embodiment is substantially the same as the first embodiment, except that, as shown in FIG. 5 , step 107 includes sub-step 501 to sub-step 504 .
  • Sub-step 501 Pass the feature vector of the traffic to be identified through the traffic filter of the corresponding service label to obtain the second abnormal score of the traffic to be identified.
  • sub-step 502 it is judged whether the second abnormal score is greater than the preset second threshold value, if yes, go to sub-step 503 ; otherwise, go to sub-step 504 .
  • Sub-step 503 remove the traffic to be identified corresponding to the service label.
  • the traffic to be identified includes traffic of various service types.
  • the second abnormal score is greater than the preset second threshold, it means that the service label obtained by the traffic classifier is wrong, that is to say, the traffic to be identified is wrong.
  • the characteristic distribution of the real traffic corresponding to the service label has a large deviation, so the traffic to be identified that is irrelevant to the identification target is eliminated.
  • Sub-step 504 output the service label corresponding to the traffic to be identified, and obtain the traffic subdivision identification result.
  • this embodiment removes the traffic irrelevant to the identification target through the traffic filter obtained by the second detection model in the identification stage, and further avoids false hits that generate a large number of irrelevant traffic during the identification process. Improve recognition accuracy.
  • the sixth embodiment of the present application relates to a traffic subdivision identification system, as shown in FIG. 6 , including:
  • the traffic acquisition module 601 is configured to acquire the service traffic of the pre-specified service.
  • the feature extraction module 602 is configured to process the service traffic obtained by the traffic acquisition module 601, obtain the feature vector of the service traffic, process the traffic to be identified, and obtain the feature vector of the traffic to be identified.
  • the first anomaly detection module 603 is configured to process the feature vector of the service traffic obtained by the feature extraction module 602 to obtain a sub-feature vector.
  • the classification training module 604 is used for training the sub-feature vector obtained by the first anomaly detection module 603 to obtain a traffic classifier.
  • the second anomaly detection module 605 is configured to process the sub-feature vector obtained by the first anomaly detection module 603 to obtain a traffic filter.
  • the traffic identification module 606 is used to obtain the feature vector of the traffic to be identified obtained by the feature extraction module 602, obtain the corresponding business label through the traffic classifier obtained by the classification training module 604, and use the feature vector of the traffic to be identified obtained by the feature extraction module 602. , and obtain the traffic subdivision identification result through the traffic filter of the corresponding service label obtained by the second abnormality detection module 605 .
  • this embodiment is a system embodiment corresponding to the first embodiment, and this embodiment can be implemented in cooperation with the first embodiment.
  • the related technical details mentioned in the first embodiment are still valid in this embodiment, and are not repeated here in order to reduce repetition.
  • the relevant technical details mentioned in this embodiment can also be applied in the first embodiment.
  • a logical unit may be a physical unit, a part of a physical unit, or multiple physical units.
  • a composite implementation of the unit in order to highlight the innovative part of the present application, this embodiment does not introduce units that are not closely related to solving the technical problem raised by the present application, but this does not mean that there are no other units in this embodiment.
  • the seventh embodiment of the present application relates to an electronic device, as shown in FIG. 7 , comprising: at least one processor 701 ; and a memory 702 communicatively connected with the at least one processor 701 ; wherein the memory 702 stores data that can be accessed by at least one processor 701 .
  • An instruction executed by one processor 701, and the instruction is executed by at least one processor 701, so that the at least one processor 701 can execute the traffic subdivision identification method described in any of the above method embodiments.
  • the memory and the processor are connected by a bus, and the bus may include any number of interconnected buses and bridges, and the bus connects one or more processors and various circuits of the memory.
  • the bus may also connect together various other circuits, such as peripherals, voltage regulators, and power management circuits, which are well known in the art and therefore will not be described further herein.
  • the bus interface provides the interface between the bus and the transceiver.
  • a transceiver may be a single element or multiple elements, such as multiple receivers and transmitters, providing a means for communicating with various other devices over a transmission medium.
  • the data processed by the processor is transmitted on the wireless medium through the antenna, and further, the antenna also receives the data and transmits the data to the processor.
  • the processor is responsible for managing the bus and general processing, and can also provide various functions, including timing, peripheral interface, voltage regulation, power management, and other control functions. Instead, memory may be used to store data used by the processor in performing operations.
  • the eighth embodiment of the present application relates to a computer-readable storage medium storing a computer program.
  • the above method embodiments are implemented when the computer program is executed by the processor.
  • the aforementioned storage medium includes: U disk, mobile hard disk, Read-Only Memory (ROM, Read-Only Memory), Random Access Memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes .

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Evolutionary Computation (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Evolutionary Biology (AREA)
  • Signal Processing (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Optimization (AREA)
  • Computing Systems (AREA)
  • Pure & Applied Mathematics (AREA)
  • Computer Security & Cryptography (AREA)
  • Probability & Statistics with Applications (AREA)
  • Mathematical Physics (AREA)
  • Mathematical Analysis (AREA)
  • Algebra (AREA)
  • Computer Hardware Design (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本申请提出了流量细分识别方法、系统、电子设备和存储介质,涉及通信领域。该方法包括:获取预先指定业务的业务流量;对业务流量进行处理,获取业务流量的特征向量;将业务流量的特征向量通过预先设置的第一异常检测模型,获取子特征向量;将子特征向量通过预先设置的分类训练模型进行训练,获得流量分类器;将子特征向量通过预先设置的第二异常检测模型,获得流量筛选器;获取待识别流量的特征向量,并将待识别流量的特征向量通过流量分类器获取对应的业务标签;将待识别流量的特征向量通过对应的业务标签的流量筛选器,获取流量细分识别结果。

Description

流量细分识别方法、系统、电子设备和存储介质
交叉引用
本申请基于申请号为“202011085864.X”、申请日为2020年10月12日的中国专利申请提出,并要求该中国专利申请的优先权,该中国专利申请的全部内容在此以引入方式并入本申请。
技术领域
本申请实施例涉及通信领域,特别涉及一种流量细分识别方法、系统、电子设备和存储介质。
背景技术
基于机器学习(Machine Learning,ML)的流量细分识别在分类过程中存在无关的未知来源流量,由于分类算法存在封闭假设,在识别时必定会将输入样本标记为已知的类别,而实际识别过程中会有大量未知来源的无关流量输入,这便会造大量误标记,即面临开放集合识别(Open Set Recognition,OSR)问题,导致流量细分识别的准确性受到影响。
发明内容
本申请实施例提供了一种流量细分识别方法,包括:获取预先指定业务的业务流量;对业务流量进行处理,获取业务流量的特征向量;将业务流量的特征向量通过预先设置的第一异常检测模型,获取子特征向量;将子特征向量通过预先设置的分类训练模型进行训练,获得流量分类器;将子特征向量通过预先设置的第二异常检测模型,获得流量筛选器;获取待识别流量的特征向量,并将待识别流量的特征向量通过流量分类器获取对应的业务标签;将待识别流量的特征向量通过对应业务标签的流量筛选器,获取流量细分识别结果。
本申请实施例还提出了一种流量细分识别系统,包括:流量获取模块,用于获取预先指定业务的业务流量;特征提取模块,用于对流量获取模块获得的业务流量进行处理,获取业务流量的特征向量,对待识别流量进行处理,获取待识别流量的特征向量;第一异常检测模块,用于将特征提取模块获得的业务流量的特征向量进行处理,获取子特征向量;分类训练模块,用于将第一异常检测模块获取的子特征向量进行训练,获得流量分类器;第二异常检测模块,用于将第一异常检测模块获取的子特征向量进行处理,获得流量筛选器;流量识别模块,用于将特征提取模块获得的待识别流量的特征向量,通过分类训练模块获得的流量分类器获取对应的业务标签,将特征提取模块获得的待识别流量的特征向量,通过第二异常检测模块获得的对应的业务标签的流量筛选器,获取流量细分识别结果。
本申请实施例还提出了一种电子设备,设备包括:至少一个处理器;以及,与至少一个处理器通信连接的存储器;其中,存储器存储有可被至少一个处理器执行的指令,指令被至少一个处理器执行,以使至少一个处理器能够执行以上的流量识别方法。
附图说明
一个或多个实施例通过与之对应的附图中的图片进行示例性说明,这些示例性说明并不构成对实施例的限定。
图1是根据本申请第一实施例提供的流量细分识别方法的流程图;
图2是根据本申请第二实施例提供的流量细分识别方法的流程图;
图3是根据本申请第三实施例提供的流量细分识别方法的流程图;
图4是根据本申请第四实施例提供的流量细分识别方法的流程图;
图5是根据本申请第五实施例提供的流量细分识别方法的流程图;
图6是根据本申请第六实施例提供的流量细分识别系统的结构示意图;
图7是根据本申请第七实施例提供的电子设备的结构示意图。
具体实施方式
为使本申请实施例的目的、技术方案和优点更加清楚,下面将结合附图对本申请的各实施例进行详细的阐述。然而,本领域的普通技术人员可以理解,在本申请各实施例中,为了使读者更好地理解本申请而提出了许多技术细节。但是,即使没有这些技术细节和基于以下各实施例的种种变化和修改,也可以实现本申请所要求保护的技术方案。以下各个实施例的划分是为了描述方便,不应对本申请的具体实现方式构成任何限定,各个实施例在不矛盾的前提下可以相互结合相互引用。
目前流量识别方法已经在网络安全领域有了广泛应用,例如网络异常检测与恶意流量识别,标记手机流量的应用来源等。但是对于运营商而言更为重要一点的是流量的细分识别,例如在“微信流量”大类的基础上,具体识别出流量是通过什么业务动作产生的,如“发送微信消息”,“微信视频通话”等。这类细分识别能帮助运营商更具体地监控网络状况。基于机器学习的流量细分识别方法通过特征工程提取流量的通用统计信息来构造流量特征向量,并使用机器学习算法进行分类。
然而,基于ML的流量细分识别在分类过程中存在无关的未知来源流量,由于分类算法存在封闭假设,在识别时必定会将输入样本标记为已知的类别,而实际识别过程中会有大量未知来源的无关流量输入,这便会造大量误标记,即面临开放集合识别问题,导致流量细分识别的准确性受到影响。
本申请实施例的主要目的在于提出一种流量细分识别方法、系统、电子设备和存储介质,解决了流量细分识别过程中的OSR问题,提高了流量细分识别的准确率。
本申请的第一实施例涉及一种流量细分识别方法,如图1所示,具体包括:
步骤101,获取预先指定业务的业务流量。
具体地说,预先指定的业务可以是发送微信消息,爱奇艺观看某一视频,微信视频通话等具体的业务动作,当然以上仅为具体的举例说明,实际使用过程中可以根据运营商需要或者实际应用环境指定业务,此处不做一一赘述。
步骤102,对业务流量进行处理,获取业务流量的特征向量。
步骤103,将业务流量的特征向量通过预先设置的第一异常检测模型,获取子特征向量。
步骤104,将子特征向量通过预先设置的分类训练模型进行训练,获得流量分类器。
具体地说,步骤104中的分类训练模型可以使用梯度提升决策树(Gradient Boosting  Decision Tree,GBDT)算法,该算法属于树形分类器,会单独在各维特征上进行分类,对特征整体的标准化不敏感,分类的准确率更高。当然也可以使用其他分类算法,比如:XGBOOST算法,LightGBM算法等等。
步骤105,将子特征向量通过预先设置的第二异常检测模型,获得流量筛选器。
具体地说,第二异常检测模型包括单类支持向量机算法,拟合先验分布方法等等,步骤105可以通过第二异常检测模型中的任一方法获得流量筛选器。另外,获得的流量筛选器对应于预先指定的业务,比如说,预先指定的业务有N个,那么对应的就有N个业务的流量筛选器,例如第一个流量筛选器为微信发送消息的流量筛选器,第二个流量筛选器为爱奇艺观看某一视频的流量筛选器。
步骤106,获取待识别流量的特征向量,并将待识别流量的特征向量通过流量分类器获取对应的业务标签。
步骤107,将待识别流量的特征向量通过对应的业务标签的流量筛选器,获取流量细分识别结果。
需要说明的是,本实施例主要应用于流量细分识别,也可以应用于大类识别中,辅助大类识别对全网流量进行细分识别。例如,采用基于大类识别辅助的二次识别方法,先用传统模型,如DPI对全网流量进行大类识别,然后在下游根据具体需求,以扩展模块的形式来添加对具体大类进行细分的支持。
本实施例通过第一异常检测模型和第二异常检测模型,剔除掉训练样本中的无关流量和待识别流量中的无关流量,有效地解决了OSR问题,提高了流量细分识别的准确率。
本申请的第二实施例涉及一种流量细分识别方法,本实施例与第一实施例大致相同,区别在于,如图2所示,步骤101包括子步骤201至子步骤204。
子步骤201,通过控制程序控制产生业务流量的应用。
具体地说,步骤201中可以通过控制程序接入产生业务流量的应用以控制某一应用,也可以通过控制程序接入终端设备,进而控制终端设备中的某一应用。另外,具体的控制程序接入某一应用或设备的方法可以采用Appium,UIAutomator2等技术手段,当然,以上仅为具体的举例说明,实际操作过程中可以采用任务现有的接入技术以控制应用。此处不做具体限制。
子步骤202,对产生业务流量的应用执行预先指定的业务,获取预先指定业务的业务流量。
在本实施方式中,根据实际运营商需求或用户需求对某一应用执行预先指定的业务,例如,人工执行“发送微信消息”的业务操作。另外,获取预先指定业务的业务流量可以在执行业务动作时,后台开启网络抓包以获取业务流量。
子步骤203,记录执行预先指定的业务的操作步骤,生成操作脚本并保存。
子步骤204,将操作脚本导入控制程序自动执行,获取预先指定业务的业务流量。
具体地说,子步骤202通过人工操作获取业务流量,只能得到少量业务流量,而子步骤204通过程序自动反复执行业务则可以获取大量业务流量。
需要说明的是,流量细分识别是需要获取大量与细分业务相关且带有细分业务标签的数据,一般获取的流量数据存在大量无关流量数据,多通过专家经验对数据进行清洗和标签标记,这种方法实现困难,而本实施例获取的业务流量由具体某一应用执行预先指定的具体的 业务操作来获取的,因此,获取的业务流量直接携带有业务标签,不需要额外进行业务标签的标记。
本实施例在第一实施例有益效果的基础上,通过直接执行预先指定的业务来获取带有业务标签的业务流量,避免了人工进行业务标签标记的困难,同时最大化地降低了构建数据时对专家经验的依赖。
本申请的第三实施例涉及一种流量细分识别方法,本实施例与第一实施例大致相同,区别在于,如图3所示,步骤102包括子步骤301至子步骤303。
子步骤301,获取业务流量的五元组信息,其中,五元组信息包括源IP,源端口,目的IP,目的端口,传输协议。
子步骤302,根据五元组信息对业务流量进行分组,获取流量样本。
在本实施方式中,根据获取业务流量中所有数据包的五元组信息,对业务流量中所有数据包进行分组,即数据包五元组信息一致的为一组流量样本,这一组流量样本拥有同一个五元组信息。另外,一组流量样本中,数据包可以按照传输时间先后进行排列。
子步骤303,对流量样本进行特征提取,获取业务流量的特征向量。
具体地说,子步骤303可以对流量样本进行基本的统计特征提取,例如,流量样本中所有数据包的包长,传输数据包间隔的平均值,平均包长,最大包长等等,并将其整理为特征向量的形式。
进一步地,还可以提取流量样本的时序特征,例如,数据包的端口信息,数据包的传输方向等等,并将其整理为特征向量的形式。另外还可以将两种提取方法形成的特征向量进行拼接形成业务流量的特征向量。
本实施例在第一实施例有益效果的基础上,以流为单位对数据进行处理,同时对流量样本进行特征提取获取特征向量,以便于后续对数据的训练和识别。
本申请的第四实施例涉及一种流量细分识别方法,本实施例与第一实施例大致相同,区别在于,如图4所示,步骤103包括子步骤401至子步骤404。
子步骤401,将业务流量的特征向量通过预先设置的第一异常检测模型,获取业务流量的第一异常分值。
具体地说,第一异常检测模型可以包括孤立森林算法,局部异常因子算法,基于Kmeans聚类算法等等,通过第一检测模型中的任一算法都可以获得第一异常分值。以孤立森林算法为例,可以将孤立树上的平均查找深度做归一化处理后作为第一异常分值。
子步骤402,判断业务流量的第一异常分值是否大于预先设置的第一阈值,若是,则进入子步骤403;否则,进入子步骤401。
具体地说,若业务流量的第一异常分值大于预先设置的第一阈值,则执行子步骤403,若业务流量的第一异常分值不大于预先设置的第一阈值,则执行子步骤404。
子步骤403,剔除业务流量的特征向量并获得子特征向量。
在本实施方式中,子步骤403剔除业务流量是由于在获取预先指定业务的业务流量时,同时还会产生很多无关流量,比如,背景流量,应用的广告流量等等,这些都与指定的业务无关,直接影响了后续的分类训练,导致流量识别结果出现偏差。
子步骤404,判断下一个业务流量的第一异常分值。
本实施例在第一实施例有益效果的基础上,通过第一异常检测模型剔除掉与预先指定的 业务无关的流量,比如背景流量,广告流量等,获取更加纯净的子特征向量,以便于后续分类训练的效果更好,进一步提升流量细分识别的准确率。
本申请的第五实施例涉及一种流量细分识别方法,本实施例与第一实施例大致相同,区别在于,如图5所示,步骤107包括子步骤501至子步骤504。
子步骤501,将待识别流量的特征向量通过对应的业务标签的流量筛选器,获取待识别流量的第二异常分值。
子步骤502,判断第二异常分值是否大于预先设置的第二阈值,若是,则进入子步骤503;否则,进入子步骤504。
具体地说,若第二异常分值大于预先设置的第二阈值,则执行子步骤503,若第二异常分值不大于预先设置的第二阈值,则执行子步骤504。
子步骤503,剔除对应业务标签的待识别流量。
在本实施方式中,待识别的流量包括各种业务类型的业务流量,当第二异常分值大于预先设置的第二阈值时,说明流量分类器获取的业务标签错误,也就是说待识别流量与该业务标签对应的真实流量的特征分布偏差较大,因此剔除掉与识别目标无关的待识别流量。
子步骤504,输出待识别流量对应的业务标签,获取流量细分识别结果。
本实施例在第一实施例有益效果的基础上,在识别阶段通过第二检测模型获得的流量筛选器,剔除掉与识别目标无关的流量,进一步避免识别过程中产生大量无关流量的误命中,提高识别准确率。
此外,应当理解的是,上面各种方法的步骤划分,只是为了描述清楚,实现时可以合并为一个步骤或者对某些步骤进行拆分,分解为多个步骤,只要包括相同的逻辑关系,都在本专利的保护范围内;对算法中或者流程中添加无关紧要的修改或者引入无关紧要的设计,但不改变其算法和流程的核心设计都在该专利的保护范围内。
本申请第六实施例涉及一种流量细分识别系统,如图6所示,包括:
流量获取模块601,用于获取预先指定业务的业务流量。
特征提取模块602,用于对流量获取模块601获得的业务流量进行处理,获取业务流量的特征向量,对待识别流量进行处理,获取待识别流量的特征向量。
第一异常检测模块603,用于将特征提取模块602获得的业务流量的特征向量进行处理,获取子特征向量。
分类训练模块604,用于将第一异常检测模块603获取的子特征向量进行训练,获得流量分类器。
第二异常检测模块605,用于将第一异常检测模块603获取的子特征向量进行处理,获得流量筛选器。
流量识别模块606,用于将特征提取模块602获得的待识别流量的特征向量,通过分类训练模块604获得的流量分类器获取对应的业务标签,将特征提取模块602获得的待识别流量的特征向量,通过第二异常检测模块605获得的对应的业务标签的流量筛选器,获取流量细分识别结果。
不难发现,本实施例为与第一实施例相对应的系统实施例,本实施例可与第一实施例互相配合实施。第一实施例中提到的相关技术细节在本实施例中依然有效,为了减少重复,这里不再赘述。相应地,本实施例中提到的相关技术细节也可应用在第一实施例中。
值得一提的是,本实施例中所涉及到的各模块均为逻辑模块,在实际应用中,一个逻辑单元可以是一个物理单元,也可以是一个物理单元的一部分,还可以以多个物理单元的组合实现。此外,为了突出本申请的创新部分,本实施例中并没有将与解决本申请所提出的技术问题关系不太密切的单元引入,但这并不表明本实施例中不存在其它的单元。
本申请的第七实施例涉及一种电子设备,如图7所示,包括:至少一个处理器701;以及,与至少一个处理器701通信连接的存储器702;其中,存储器702存储有可被至少一个处理器701执行的指令,指令被至少一个处理器701执行,以使至少一个处理器701能够执行上述任一方法实施例所描述的流量细分识别方法。
其中,存储器和处理器采用总线方式连接,总线可以包括任意数量的互联的总线和桥,总线将一个或多个处理器和存储器的各种电路连接在一起。总线还可以将诸如外围设备、稳压器和功率管理电路等之类的各种其他电路连接在一起,这些都是本领域所公知的,因此,本文不再对其进行进一步描述。总线接口在总线和收发机之间提供接口。收发机可以是一个元件,也可以是多个元件,比如多个接收器和发送器,提供用于在传输介质上与各种其他装置通信的单元。经处理器处理的数据通过天线在无线介质上进行传输,进一步,天线还接收数据并将数据传输给处理器。
处理器负责管理总线和通常的处理,还可以提供各种功能,包括定时,外围接口,电压调节、电源管理以及其他控制功能。而存储器可以被用于存储处理器在执行操作时所使用的数据。
本申请第八实施方式涉及一种计算机可读存储介质,存储有计算机程序。计算机程序被处理器执行时实现上述方法实施例。
即,本领域技术人员可以理解,实现上述实施例方法中的全部或部分步骤是可以通过程序来指令相关的硬件来完成,该程序存储在一个存储介质中,包括若干指令用以使得一个设备(可以是单片机,芯片等)或处理器(processor)执行本申请各个实施例方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(ROM,Read-Only Memory)、随机存取存储器(RAM,Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。
本领域的普通技术人员可以理解,上述各实施例是实现本申请的具体实施例,而在实际应用中,可以在形式上和细节上对其作各种改变,而不偏离本申请的精神和范围。

Claims (10)

  1. 一种流量细分识别方法,包括:
    获取预先指定业务的业务流量;
    对所述业务流量进行处理,获取所述业务流量的特征向量;
    将所述业务流量的特征向量通过预先设置的第一异常检测模型,获取子特征向量;
    将所述子特征向量通过预先设置的分类训练模型进行训练,获得流量分类器;
    将所述子特征向量通过预先设置的第二异常检测模型,获得流量筛选器;
    获取待识别流量的特征向量,并将所述待识别流量的特征向量通过所述流量分类器获取对应的业务标签;
    将所述待识别流量的特征向量通过对应业务标签的流量筛选器,获取流量细分识别结果。
  2. 根据权利要求1所述的流量细分识别方法,其中,所述获取预先指定业务的业务流量,包括:
    通过控制程序控制产生业务流量的应用;
    对所述产生业务流量的应用执行所述预先指定的业务,获取所述预先指定业务的业务流量;
    记录执行所述预先指定的业务的操作步骤,生成操作脚本并保存;
    将所述操作脚本导入所述控制程序自动执行所述预先指定的业务,获取所述预先指定业务的业务流量。
  3. 根据权利要求1或2所述的流量细分识别方法,其中,所述对所述业务流量进行处理,获取所述业务流量的特征向量,包括:
    获取所述业务流量的五元组信息,其中,所述五元组信息包括源IP,源端口,目的IP,目的端口,传输协议;
    根据所述五元组信息对所述业务流量进行分组,获取流量样本;
    对所述流量样本进行特征提取,获取所述业务流量的特征向量。
  4. 根据权利要求1至3中任意一项所述的流量细分识别方法,其中,所述将所述业务流量的特征向量通过预先设置的第一异常检测模型,获取子特征向量,包括:
    将所述业务流量的特征向量通过预先设置的第一异常检测模型,获取所述业务流量的第一异常分值;
    判断所述第一异常分值是否大于预先设置的第一阈值,其中,若所述第一异常分值大于预先设置的第一阈值,则剔除所述业务流量的特征向量并获得子特征向量,若所述第一异常分值不大于预先设置的第一阈值,则判断下一个业务流量的第一异常分值。
  5. 根据权利要求1至4中任意一项所述的流量细分识别方法,其中,所述将所述待识别流量的特征向量通过对应业务标签的流量筛选器,获取流量细分识别结果,包括:
    将所述待识别流量的特征向量通过对应业务标签的流量筛选器,获取所述待识别流量的第二异常分值;
    判断所述第二异常分值是否大于预先设置的第二阈值,其中,若所述第二异常分值大于预先设置的第二阈值,则剔除所述对应业务标签的待识别流量,若所述第二异常分值不大于预先设置的第二阈值,则输出所述待识别流量对应的业务标签,获取流量细分识别结果。
  6. 根据权利要求1至5中任意一项所述的流量细分识别方法,其中,所述第一异常检测 模型包括:孤立森林算法,局部异常因子算法,基于Kmeans聚类算法。
  7. 根据权利要求1至6中任意一项所述的流量细分识别方法,其中,所述第二异常检测模型包括:单类支持向量机,拟合先验分布。
  8. 一种流量细分识别系统,包括:
    流量获取模块,用于获取预先指定业务的业务流量;
    特征提取模块,用于对所述流量获取模块获得的业务流量进行处理,获取所述业务流量的特征向量,对待识别流量进行处理,获取所述待识别流量的特征向量;
    第一异常检测模块,用于将所述特征提取模块获得的业务流量的特征向量进行处理,获取子特征向量;
    分类训练模块,用于将所述第一异常检测模块获取的子特征向量进行训练,获得流量分类器;
    第二异常检测模块,用于将所述第一异常检测模块获取的子特征向量进行处理,获得流量筛选器;
    流量识别模块,用于将所述特征提取模块获得的待识别流量的特征向量,通过所述分类训练模块获得的流量分类器获取对应的业务标签,将所述特征提取模块获得的待识别流量的特征向量,通过所述第二异常检测模块获得的对应的业务标签的流量筛选器,获取流量细分识别结果。
  9. 一种电子设备,包括:
    至少一个处理器;以及,
    与所述至少一个处理器通信连接的存储器;其中,
    所述存储器存储有可被所述至少一个处理器执行的指令,所述指令被所述至少一个处理器执行,以使所述至少一个处理器能够执行如权利要求1至7中任意一项所述流量细分识别方法。
  10. 一种计算机可读存储介质,存储有计算机程序,所述计算机程序被处理器执行时实现权利要求1至7中任一项所述的流量细分识别方法。
PCT/CN2021/112328 2020-10-12 2021-08-12 流量细分识别方法、系统、电子设备和存储介质 WO2022078042A1 (zh)

Applications Claiming Priority (2)

Application Number Priority Date Filing Date Title
CN202011085864.XA CN114362982B (zh) 2020-10-12 流量细分识别方法、系统、电子设备和存储介质
CN202011085864.X 2020-10-12

Publications (1)

Publication Number Publication Date
WO2022078042A1 true WO2022078042A1 (zh) 2022-04-21

Family

ID=81090230

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/CN2021/112328 WO2022078042A1 (zh) 2020-10-12 2021-08-12 流量细分识别方法、系统、电子设备和存储介质

Country Status (1)

Country Link
WO (1) WO2022078042A1 (zh)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116170829A (zh) * 2023-04-26 2023-05-26 浙江省公众信息产业有限公司 一种独立专网业务的运维场景识别方法及装置

Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102045363A (zh) * 2010-12-31 2011-05-04 成都市华为赛门铁克科技有限公司 网络流量特征识别规则的建立方法、识别控制方法及装置
CN108650195A (zh) * 2018-04-17 2018-10-12 南京烽火天地通信科技有限公司 一种app流量自动识别模型构建方法
CN109151880A (zh) * 2018-11-08 2019-01-04 中国人民解放军国防科技大学 基于多层分类器的移动应用流量识别方法
CN111259985A (zh) * 2020-02-19 2020-06-09 腾讯科技(深圳)有限公司 基于业务安全的分类模型训练方法、装置和存储介质
WO2020119662A1 (zh) * 2018-12-14 2020-06-18 深圳先进技术研究院 一种网络流量分类方法

Patent Citations (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102045363A (zh) * 2010-12-31 2011-05-04 成都市华为赛门铁克科技有限公司 网络流量特征识别规则的建立方法、识别控制方法及装置
CN108650195A (zh) * 2018-04-17 2018-10-12 南京烽火天地通信科技有限公司 一种app流量自动识别模型构建方法
CN109151880A (zh) * 2018-11-08 2019-01-04 中国人民解放军国防科技大学 基于多层分类器的移动应用流量识别方法
WO2020119662A1 (zh) * 2018-12-14 2020-06-18 深圳先进技术研究院 一种网络流量分类方法
CN111259985A (zh) * 2020-02-19 2020-06-09 腾讯科技(深圳)有限公司 基于业务安全的分类模型训练方法、装置和存储介质

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
WANG QINGLONG; YAHYAVI AMIR; KEMME BETTINA; HE WENBO: "I know what you did on your smartphone: Inferring app usage over encrypted data traffic", 2015 IEEE CONFERENCE ON COMMUNICATIONS AND NETWORK SECURITY (CNS), IEEE, 28 September 2015 (2015-09-28), pages 433 - 441, XP032825425, DOI: 10.1109/CNS.2015.7346855 *

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116170829A (zh) * 2023-04-26 2023-05-26 浙江省公众信息产业有限公司 一种独立专网业务的运维场景识别方法及装置
CN116170829B (zh) * 2023-04-26 2023-07-04 浙江省公众信息产业有限公司 一种独立专网业务的运维场景识别方法及装置

Also Published As

Publication number Publication date
CN114362982A (zh) 2022-04-15

Similar Documents

Publication Publication Date Title
US10795992B2 (en) Self-adaptive application programming interface level security monitoring
CN106357618B (zh) 一种Web异常检测方法和装置
US10282643B2 (en) Method and apparatus for obtaining semantic label of digital image
CN110149266B (zh) 垃圾邮件识别方法及装置
WO2020172778A1 (zh) 一种拓扑处理方法和装置以及系统
CN109768952B (zh) 一种基于可信模型的工控网络异常行为检测方法
EP4155974A1 (en) Knowledge graph construction method and apparatus, check method and storage medium
CN110034966B (zh) 一种基于机器学习的数据流分类方法及系统
CN110956123B (zh) 一种富媒体内容的审核方法、装置、服务器及存储介质
WO2022078042A1 (zh) 流量细分识别方法、系统、电子设备和存储介质
CN106330768B (zh) 一种基于云计算的应用识别方法
CN113791723A (zh) 数据录入方法、设备及存储介质
WO2024055603A1 (zh) 一种未成年人文本识别方法及装置
CN116192997B (zh) 一种基于网络流的事件检测方法和系统
CN112528610A (zh) 一种数据标注方法、装置、电子设备及存储介质
CN116192527A (zh) 攻击流量检测规则生成方法、装置、设备及存储介质
CN111832657A (zh) 文本识别方法、装置、计算机设备和存储介质
CN114362982B (zh) 流量细分识别方法、系统、电子设备和存储介质
CN112764839B (zh) 一种用于管理服务平台的大数据配置方法及系统
WO2021129849A1 (zh) 日志处理方法、装置、设备和存储介质
CN113378222A (zh) 一种基于数据内容识别的文件标密方法及系统
CN115935219A (zh) 数据处理方法、装置和系统
CN113342804A (zh) 一种基于大数据的游离数据标签化后的二次利用的方法
CN114338089B (zh) 一种防攻击方法、装置、设备和计算机可读存储介质
CN113569879A (zh) 异常识别模型的训练方法、异常账号识别方法及相关装置

Legal Events

Date Code Title Description
NENP Non-entry into the national phase

Ref country code: DE

32PN Ep: public notification in the ep bulletin as address of the adressee cannot be established

Free format text: NOTING OF LOSS OF RIGHTS PURSUANT TO RULE 112(1) EPC (EPO FORM 1205A DATED 05/09/2023)

122 Ep: pct application non-entry in european phase

Ref document number: 21879089

Country of ref document: EP

Kind code of ref document: A1